0% found this document useful (0 votes)

160 views478 pages

3D Computer Vision - Foundations and Advanced Methodologies-Springer (2024)

Uploaded by

d3dxbrasil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

160 views478 pages

3D Computer Vision - Foundations and Advanced Methodologies-Springer (2024)

Uploaded by

d3dxbrasil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 478

Yu-Jin Zhang

3D Computer
Vision
Foundations and Advanced
Methodologies

P^USHWS HOUSE OF E-ECTOhIC! IKCUSTCT

i http://www.phoi.com.cn Springer
Yu-Jin Zhang

3D Computer Vision
Foundations and Advanced Methodologies

[nj 'I'linihT.IKilsWI L'"L-J MHLwaM houk of uowMca Musnrr

httpy/www.phel.com.cn
Springer
Yu-Jin Zhang
Department of Electronic Engineering
Tsinghua University
Beijing, China

ISBN 978-981-19-7602-5 ISBN 978-981-19-7603-2 (eBook)

https://doi.org/10.1007/978-981-19-7603-2

© Publishing House of Electronics Industry 2024

Jointly published with Publishing House of Electronics Industry
The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the
print book from: Publishing House of Electronics Industry.

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publishers, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publishers nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publishers remain neutral with
regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore

Paper in this product is recyclable.

Preface

Computer vision is an information discipline that uses computers and electronic

devices to realize human visual functions. The original purpose of computer vision
research is to identify and locate the objects in it with the help of the images related
to the scene, determine the structure of the object and the relationship between the
objects, so that the objects and scenes in the objective world can be meaningfully
explained and judged.
The research and application of computer vision has a history of more than half a
century. In recent years, with the introduction of technologies such as artificial
intelligence and deep learning, related theories and methods have developed rapidly,
and the application fields have been expanding.
The content of this book basically covers the main aspects of 3D computer vision.
In addition to the introduction of basic concepts and related disciplines in the
computer vision overview chapter, the main content of the book is divided into
10 chapters, which respectively introduce 10 types of computer vision technologies.
They are: camera imaging and calibration technology, depth image acquisition
technology, 3D point cloud data acquisition and processing technology, binocular
stereo vision technology, multi-ocular stereo vision technology, monocular multi
image scene restoration technology, monocular single-image scene restoration tech
nology, generalized matching technology, simultaneous location and mapping tech
nology, and spatio-temporal behavior understanding technology.
This book focuses on the fundamentals and recent advances in computer vision.
In the 10 chapters introducing computer vision technology, the basic concepts and
basic principles of the technology are first described, and the typical methods for
implementing the technology are analyzed in detail (including algorithm description,
specific steps, effect examples, etc.), and some new developments in the field of
technology are introduced, summarized, and classified, which can help readers
understand the latest development trends.
On the one hand, this book can be used as a professional course material for
senior undergraduates and graduate students in related disciplines, helping them
master the basic principles, carry out scientific research activities, and complete

v
vi Preface

graduation projects and dissertations; on the other hand, this book is also suitable for
the company’s research and development personnel to understand the latest progress
information and serves as a scientific research reference.
This book has approximately 500,000 words. It has 11 chapters, which consist of
a total of 59 sections (second-level headings) and 156 subsections (third-level
headings). There are 215 numbered figures, 37 numbered tables, and 660 numbered
equations. Finally, the list of more than 300 references cited (there were over 100 in
the 2020s), and more than 500 subject terms used for indexing are listed at the end of
the book to facilitate further access to related literatures.
Finally, I would like to thank my wife Yun He, daughter Heming Zhang, and
other family members for their understanding and support in all aspects.

Beijing, China Yu-Jin Zhang

Contents

1 Introduction ............................................................................................... 1
1.1 Introduction to Computer Vision .................................................. 2
1.1.1 Visual Essentials .............................................................. 2
1.1.2 The Goal of Computer Vision .......................................... 3
1.1.3 Related Disciplines ............................................................ 4
1.2 Computer Vision Theory and Framework .................................... 7
1.2.1 Visual Computational Theory .......................................... 7
1.2.2 Framework Issues and Improvements ............................. 13
1.2.3 A Discussion on Marr’s ReconstructionTheory ............. 15
1.2.4 Research on New Theoretical Framework ...................... 18
1.2.5 Discussion from the Perspective of Psychological
Cognition ............................................................ 21
1.3 Introduction to Image Engineering ............................................... 24
1.3.1 Three Levels of Technology in Image Engineering . . . 25
1.3.2 Research and Application of Image Engineering ........... 27
1.4 Introduction to Deep Learning ...................................................... 28
1.4.1 Deep Learning Overview .................................................. 28
1.4.2 Deep Learning Core Technology ..................................... 34
1.4.3 Deep Learning in Computer Vision ................................. 36
1.5 Organization and Content of this Book ........................................ 38
References ................................................................................................... 40
2 Camera Imaging and Calibration .......................................................... 41
2.1 Lightness Imaging Model ............................................................... 42
2.1.1 Photometric Concepts ................................................... 42
2.1.2 Basic Lightness Imaging Model ..................................... 43
2.2 Space Imaging Model ........................................................ 44
2.2.1 Projection Imaging Geometry ............................................ 44
2.2.2 Basic Space Imaging Model ............................................. 46

vii
viii Contents

2.2.3 General Space Imaging Model ....................................... 50

2.2.4 Complete Space Imaging Model .................................... 52
2.3 Camera Calibration Model ............................................................. 55
2.3.1 Linear Camera Model ...................................................... 55
2.3.2 Nonlinear Camera Model ................................................ 58
2.4 Camera Calibration Methods ......................................................... 62
2.4.1 Classification of Calibration Methods 62
2.4.2 Traditional Calibration Methods 64
2.4.3 Self-Calibration Methods 69
2.4.4 A Calibration Method for Structured Light
Active Vision System ....................................... 72
2.4.5 An Online Camera External Parameter
Calibration Method ........................................... 79
References ................................................................................................... 82
3 Depth Image Acquisition ......................................................................... 85
3.1 Depth Image and Depth Imaging .................................................. 86
3.1.1 Depth Images ..................................................................... 86
3.1.2 Depth Imaging ................................................................... 88
3.2 Direct Depth Imaging ..................................................................... 89
3.2.1 Introduction to Laser Scanning ........................................ 89
3.2.2 Time of Flight ................................................................... 94
3.2.3 LiDAR ............................................................................... 97
3.2.4 Structured Light ................................................................ 98
3.2.5 Moire Contour Striping.................................................... 101
3.3 Indirect Depth Imaging .................................................................. 105
3.3.1 Binocular Horizontal Mode ............................................. 106
3.3.2 Binocular Convergent Horizontal Mode ........................ 112
3.3.3 Binocular Axial Mode ...................................................... 115
3.4 Single-Pixel Depth Imaging ........................................................... 117
3.4.1 The Principle of Single-Pixel Imaging ........................... 117
3.4.2 Single-Pixel Camera ......................................................... 119
3.4.3 Single-Pixel 3D Imaging .................................................. 121
3.5 Binocular Vision and Stereopsis Vision ....................................... 123
3.5.1 Binocular Vision and Binocular Vision .......................... 123
3.5.2 From Monocular to Binocular Stereo ............................. 124
References ................................................................................................... 125
4 3D Point Cloud Data and Processing .................................................... 127
4.1 Point Cloud Data Overview ........................................................... 128
4.1.1 Point Cloud Data Acquisition Modes ............................. 128
4.1.2 Point Cloud Data Types ................................................... 129
4.1.3 Point Cloud Data Processing Tasks ................................ 129
4.1.4 LiDAR Test Dataset .......................................................... 131
4.2 Point Cloud Data Preprocessing .................................................... 132
4.2.1 Point Cloud Data Trapping .............................................. 132
4.2.2 Point Cloud Data Denoising ..................................... 133
Contents ix

4.2.3 Point Cloud Data Ground Region Filtering ................ 135

4.2.4 Point Cloud Data Reduction/Compression ................. 135
4.2.5 Multi-Platform Point Cloud Data Registration .............. 138
4.2.6 Registrationof Point Cloud Data and Image Data ......... 142
4.3 Laser Point Cloud 3D Modeling ................................................... 143
4.3.1 Delaunay Triangulation ................................................... 143
4.3.2 Patch Fitting ...................................................................... 144
4.4 Texture Mapping for 3D Models .................................................. 148
4.4.1 Color Texture Mapping ................................................... 149
4.4.2 Geometric Texture Mapping ........................................... 150
4.4.3 Procedural Texture Mapping ........................................... 151
4.5 Point Cloud Feature Description ................................................... 152
4.5.1 Global and Local Feature Descriptors ............................ 152
4.5.2 Signature of Histogram of Orientation ........................... 153
4.5.3 Rotational Projection Statistics ........................................ 153
4.5.4 Triple Orthogonal Local Depth Map .............................. 154
4.6 Point Cloud Understanding and Deep Learning 155
4.7 Bionic Optimized Registration of Point Clouds 156
4.7.1 Cuckoo Search ................................................................. 157
4.7.2 Improved Cuckoo Search ................................................ 158
4.7.3 Point Cloud Registration Application ............................ 159
References ................................................................................................... 161
5 Binocular Stereovision ............................................................................. 165
5.1 Region-Based Binocular Stereo Matching ................................... 166
5.1.1 Template Matching ............................................................ 166
5.1.2 Stereo Matching ................................................................. 170
5.2 Feature-Based Binocular Stereo Matching ................................... 177
5.2.1 The Basic Steps ................................................................. 177
5.2.2 Scale Invariant Feature Transformation .......................... 181
5.2.3 Speedup Robustness Features .......................................... 184
5.2.4 Dynamic Programming Matching .................................... 191
5.3 Parallax Map Error Detection and Correction .............................. 192
5.3.1 Error Detection .................................................................. 193
5.3.2 Error Correction ................................................................. 194
5.4 Stereo Matching Based on Deep Learning ................................... 197
5.4.1 Stereo Matching Networks ............................................... 197
5.4.2 Matching Based on Feature Cascade CNN .................... 199
References ................................................................................................... 200
6 Multi-ocular Stereovision ........................................................................ 203
6.1 Horizontal Multi-ocular StereoMatching ...................................... 204
6.1.1 Multi-ocular Images and SSD .......................................... 204
6.1.2 Inverse Distance andSSSD .............................................. 206
x Contents

6.2 Orthogonal Trinocular Stereo Matching ....................................... 209

6.2.1 Orthogonal Trinocular ..................................................... 209
6.2.2 Orthogonal Matching Based on Gradient
Classification ...................................................... 214
6.3 Multi-ocular Stereo Matching ........................................................ 219
6.3.1 Matching of Arbitrarily Arranged
Trinocular Stereo ............................................... 220
6.3.2 Orthogonal Multi-ocular Stereo Matching ...................... 225
6.4 Equal Baseline Multiple Camera Set ............................................. 227
6.4.1 Image Acquisition ............................................................. 227
6.4.2 Image Merging Methods .................................................. 228
6.5 Single-Camera Multi-mirror Catadioptric System ....................... 231
6.5.1 Overall System Structure .................................................. 231
6.5.2 Imaging and Calibration Models ..................................... 232
References ................................................................................................... 234
7 Monocular Multi-image Scene Restoration ......................................... 237
7.1 Monocular Image Scene Restoration ............................................. 238
7.2 Shape from Illumination ................................................................. 240
7.2.1 Scene Brightness and Image Brightness ........................ 240
7.2.2 Surface Reflection Characteristics and Brightness 244
7.2.3 Scene Surface Orientation ............................................... 247
7.2.4 Reflection Map and Image Brightness Constraint
Equation ............................................................................ 249
7.2.5 Solving the Image Brightness Constraint Equation . . . 253
7.3 Shape from Motion ......................................................................... 256
7.3.1 Optical Flow and Motion Fields ...................................... 257
7.3.2 Optical Flow Field and Optical Flow Equation ............. 260
7.3.3 Solving the Optical Flow Equation ................................. 262
7.3.4 Optical Flow and Surface Orientation ............................. 269
7.3.5 Optical Flow and Relative Depth .................................... 273
7.4 Shape from Segmented Contour .................................................... 274
7.5 Photometric Stereo Review ............................................................ 276
7.5.1 Light Source Calibration .................................................. 276
7.5.2 Non-Lambertian Surface Reflection Models .................. 277
7.5.3 Color Photometric Stereo ................................................. 278
7.5.4 3D Reconstruction Methods ............................................. 279
7.6 GAN-Based Photometric Stereo .................................................... 279
7.6.1 Network Structure ............................................................. 281
7.6.2 Loss Function .................................................................... 282
References ................................................................................................... 283
Contents xi

8 Monocular Single-Image Scene Restoration ........................................ 287

8.1 Shape from Shading ........................................................................ 288
8.1.1 Shades and Shapes ............................................................ 288
8.1.2 Solving the Image Brightness Constraint Equation . . . 292
8.2 Shape from Texture ........................................................................ 300
8.2.1 Monocular Imaging and Texture Distortion ................... 300
8.2.2 Restore Surface Orientation from Texture Changes . . . 303
8.2.3 Texture Vanishing Point Detection ................................. 311
8.3 Shape from Focus ........................................................................... 316
8.4 Estimating Pose from Perspective Three-Point ............................ 318
8.4.1 Perspective Three-Point Problem 319
8.4.2 Iterative Solution 320
8.5 Shape from Shading in Hybrid Surface Perspective Projection . 321
8.5.1 Improved Ward Reflection Model 322
8.5.2 Image Brightness Constraint Equation under
Perspective Projection ....................................... 323
8.5.3 Solving the Image Brightness Constraint Equation . . . 325
8.5.4 Equations Based on the Blinn-Phong
Reflection Model 327
8.5.5 Solving the New Image Brightness
Constraint Equation 328
References ................................................................................................... 330
9 Generalized Matching .............................................................................. 333
9.1 Introduction to Matching ................................................................ 334
9.1.1 Matching Strategy ............................................................. 335
9.1.2 Matching Classification .................................................... 336
9.1.3 Matching Evaluation ........................................................ 337
9.2 Object Matching .............................................................................. 338
9.2.1 Matching Metrics .............................................................. 338
9.2.2 Corresponding Point Matching ....................................... 341
9.2.3 Inertia Equivalent Ellipse Matching ............................... 343
9.3 Dynamic Pattern Matching ............................................................ 345
9.3.1 Matching Process ............................................................. 345
9.3.2 Absolute Pattern and Relative Pattern ........................... 346
9.4 Matching and Registration ............................................................. 349
9.4.1 Implementation of Registration ...................................... 349
9.4.2 Heterogeneous Remote Sensing Image
Registration Based on Feature Matching ........ 351
9.4.3 Image Matching Based on Spatial Relationship
Reasoning ........................................................... 352
9.5 Relationship Matching .................................................................... 353
9.6 Graph Isomorphism Matching ....................................................... 358
9.6.1 Introduction toGraph Theory .......................................... 358
9.6.2 Graph Isomorphism and Matching ................................. 361
xii Contents

9.7 Labeling and Matching of Line Drawing ..................................... 364

9.7.1 Contour Labeling .............................................................. 365
9.7.2 Structure Reasoning .......................................................... 366
9.7.3 Backtracking Labeling ..................................................... 367
9.8 Multimodal Image Matching ......................................................... 369
9.8.1 Region-Based Technology ............................................... 370
9.8.2 Feature-Based Technology .............................................. 372
References ................................................................................................... 374
10 Simultaneous Location and Mapping ................................................... 377
10.1 SLAM Overview ............................................................................. 378
10.1.1 Laser SLAM ...................................................................... 378
10.1.2 Visual SLAM .................................................................... 381
10.1.3 Comparison and Combination ......................................... 383
10.2 Laser SLAM Algorithms ............................................................... 385
10.2.1 GMapping Algorithm ....................................................... 385
10.2.2 Cartographer Algorithm ................................................... 388
10.3 Visual SLAM Algorithms .............................................................. 394
10.3.1 ORB-SLAM Algorithm Series ........................................ 395
10.3.2 LSD-SLAM Algorithm .................................................... 400
10.3.3 SVO Algorithm ................................................................. 408
10.4 Swarm Robots and Swarm SLAM ................................................ 410
10.4.1 Characteristics of Swarm Robots .................................... 411
10.4.2 Problems to be Solved by Swarm SLAM ...................... 412
10.5 Some New Trends of SLAM ......................................................... 413
10.5.1 Combination of SLAM and Deep Learning ................... 413
10.5.2 Combination of SLAM and Multi-Agent ....................... 414
References ................................................................................................... 416
11 Spatial-Temporal Behavior Understanding ........................................ 419
11.1 Spatial-Temporal Technology ........................................................ 420
11.1.1 New Research Domains .................................................. 420
11.1.2 Multiple Levels ................................................................. 421
11.2 Action Classification and Recognition 422
11.2.1 Action Classification ........................................................ 423
11.2.2 Action Recognition ........................................................... 425
11.3 Actor and Action Joint Modeling .................................................. 428
11.3.1 Single-Label Actor-Action Recognition ......................... 429
11.3.2 Multiple-Label Actor-Action Recognition ..................... 430
11.3.3 Actor-Action Semantic Segmentation ............................ 432
11.4 Activity and Behavior Modeling .................................................... 435
11.4.1 Action Modeling ............................................................... 436
11.4.2 Activity Modeling and Recognition ............................... 442
11.4.3 Joint Point-Based Behavior Recognition ....................... 447
Contents xiii

11.5 Abnormal Event Detection ............................................................. 450

11.5.1 Automatic Activity Analysis ........................................... 450
11.5.2 Classification of Abnormal Event Detection
Methods .............................................................. 455
11.5.3 Detection Based on Convolutional Auto-Encoder
Block Learning .................................................. 457
11.5.4 DetectionBased on One-Class Neural Network ............. 459
References ................................................................................................... 460

Index .................................................................................................................... 463

About the Author

Yu-Jin Zhang Yu-Jin Zhang received his doctor’s degree in Applied Science from
the University of Liege in Belgium in 1989. From 1989 to 1993, he successively
engaged in postdoctoral research and served as a full-time researcher at Delft
University in the Netherlands. Since 1993, he has worked in the Department of
Electronic Engineering of Tsinghua University in Beijing, China. He has been a
professor since 1997, a Ph.D. supervisor since 1998, and a tenured-professor since
2014. During the sabbatical year of 2003, he was a visiting professor of Nanyang
Technological University in Singapore.
At Tsinghua University, he has offered and taught over 10 undergraduate and
graduate courses, including “Image Processing,” “Image Analysis,” “Image Under
standing,” and “Content Based Visual Information Retrieval.” At Nanyang Techno
logical University, he offered and taught a postgraduate course: “Advanced Image
Analysis (English).” More than 30 Chinese and English textbooks have been written
and published (with a total of over 300,000 printed copies). More than 30 teaching
research papers have been published both domestically and internationally.
The main scientific research fields are image engineering (image processing,
image analysis, image understanding, and their technical applications) and related
disciplines that it actively advocates. He has published over 500 research papers on
image engineering both domestically and internationally. He published the mono
graphs Image Segmentation and Content Based Visual Information Retrieval (Sci
ence Press, China), Subspace Based Face Recognition (Tsinghua University Press,
China); has written English Chinese Dictionary of Image Engineering (1st, 2nd, and
3rd editions; Tsinghua University Press, China); has written Selected Works of
Image Engineering Technology and Selected Works of Image Engineering Technol
ogy (2) (Tsinghua University Press, China); has translated Computational Geometry,
Topology, and Physics of Digital Images with Applications (Springer Verlag, Ger
many) into Chinese; has led the compilation of Advances in Image and Video
Segmentation and Semantic Based Visual Information Retrieval (IRM Press,
USA), as well as Advances in Face Image Analysis: Technologies and Technologies
(IGI Global, USA); has written Handbook of Image Engineering (Springer Nature,
Singapore); has written A Selection of Image Processing Techniques, A Selection of
xv
xvi About the Author

Image Analysis Techniques, and A Selection of Image Understanding Techniques

(CRC Press, USA).
He is currently a fellow member and honorary supervisor of the Chinese Society
of Image Graphics; a fellow member of the International Society of Optical Engi
neering (SPIE), for achievements in image engineering; and formerly served as the
Chairman of the Program Committee of the 24th International Conference on Image
Processing (ICIP 2017).
Chapter 1
Introduction Check for
updates

Vision is an important function and means for human beings to observe and
recognize the world. Computer vision, as a subject using computer to realize
human visual function, has not only received great attention and in-depth research
but also been widely used [1].
Visual process can be seen as a complex process from sensing (feeling the image
of the objective world) to perception (understanding the objective world from the
image). This involves the knowledge of optics, geometry, chemistry, physiology,
psychology, and so on. To complete such a process, computer vision should not only
have its own corresponding theory and technology but also combine the achieve
ments of various disciplines and the development of various engineering
technologies.
The sections of this chapter are arranged as follows.
Section 1. 1 gives a general introduction to computer vision, including the key
points of human vision, the research methods and objectives of computer vision, and
the connections and differences with several major related disciplines.
Section 1. 2 introduces the theory and framework of computer vision, mainly
including the important visual computational theory and its existing problems and
improvements, and also discusses some situations of other theoretical frameworks.
Section 1. 3 introduces the overview of various image processing technologies
that are the basis of computer vision technology. Under the overall framework of
image engineering, three levels of image technology, as well as various recent
research directions and application fields, are discussed in detail.
Section 1. 4 provides an overview of deep learning methods that have rapidly
promoted the development of computer vision technology in recent years. In addi
tion to listing some basic concepts of convolutional neural networks, it also dis
cusses the core technology of deep learning and its connection with computer vision.
Section 1. 5 introduces the main content involved in the book and its organiza
tional structure.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 1
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_1
2 1 Introduction

1.1 Introduction to Computer Vision

The following is a general introduction to the origin, objectives, and related disci
plines of computer vision.

1.1.1 Visual Essentials

Computer vision originates from human vision, which is generally called vision.
Vision is a kind of human function, which plays an important role in human
observation and cognition of the objective world. According to statistics, about
75% of the information obtained by humans from the outside world comes from
the visual system, which not only shows that the amount of visual information is
huge but also shows that humans have a high utilization rate of visual information.
Human visual process can be seen as a complex process from sensing (feeling the
image obtained by 2D projection of 3D world) to perception (recognizing the content
and meaning of 3D world from 2D image).
Vision is a very familiar function that not only helps people obtain information
but also helps people process it. Vision can be further divided into two levels: visual
sensation and visual perception. Here, sensation is at a lower level, which mainly
receives external stimuli, while perception is at a higher level, which converts
external stimuli into meaningful content. In general, sensation receives external
stimuli almost indiscriminately and completely, while perception determines
which parts of the external stimulus could combine together to form the “object”
of interest.
Visual sensation mainly explains the basic properties (such as brightness, color,
etc.) of people’s response to light (i.e., visible radiation) from the molecular level and
point of view, and it mainly involves physics, chemistry, and other disciplines. The
main research contents of visual sensation are (1) the physical properties of light,
such as light quantum, light wave, and spectrum, and (2) the degree to which light
stimulates the visual receptors, such as photometry, eye structure, visual adaptation,
visual intensity and sensitivity, vision time, and space characteristics.
Visual perception mainly discusses how people respond to visual stimuli from
the objective world and the way they respond. It studies how to make people form an
interpretation of the spatial representation of the external world through vision, so it
also has psychological factors. As a form of reflection on the current objective
things, visual perception only relies on the principle of projection of light onto the
retina to form a retinal image and the known mechanism of the eye or nervous
system. It is difficult to explain the whole (perception) process clearly. Visual
perception is a group of activities carried out in the nerve center, which organizes
some scattered stimuli in the visual field to form a whole with a certain shape to
understand the world. As early as 2000 years ago, Aristotle defined the task of visual
perception as determining “What is where” [2].
1.1 Introduction to Computer Vision 3

In a narrow sense, the ultimate goal of vision is to make meaningful explanations

and descriptions of objective scenes for the observer. In a broad sense, it also
includes formulating behavior plans based on these explanations and descriptions,
as well as according to the surrounding environment and the wishes of the observer,
so as to act on the surrounding world. This is also actually the goal of computer
vision.

1.1.2 The Goal of Computer Vision

Computer vision is the use of computers to realize human visual functions, that is,
the sensation, perception, processing, and interpretation of three-dimensional scenes
in the objective world. The original purpose of vision research is to grasp and
understand the image of the scene; identify and locate the objects in it; determine
their own structure, spatial arrangement, and distribution; and explain the relation
ship between objects. The research goal of computer vision is to make meaningful
judgments about actual objects and scenes in the objective world based on perceived
images [3].
There are two main research methods of computer vision at present: one is the
method of bionics, which refers to the structural principle of the human visual
system to establish corresponding processing modules to complete similar functions
and tasks; the other is the method of engineering, which starts from analyzing the
function of human visual process, and does not deliberately simulate the internal
structure of the human visual system, but only considers the input and output of the
system, and adopts any existing and feasible means to realize the function of the
system. This book discusses the second approach primarily from an engineering
point of view.
The main research goals of computer vision can be summarized into two: they are
interrelated and complementary. The first research goal is to build computer vision
systems to accomplish various vision tasks. In other words, it enables the computer
to obtain images of the scene with the help of various visual sensors (such as CCD
and CMOS camera devices), from which to perceive and recover the geometric
properties, posture structure, motion, mutual position, etc. of objects in the 3D
environment, and to identify, describe, and explain objective scenarios and then
make judgments and decisions. The technical mechanism for accomplishing these
tasks is mainly studied here. At present, the work in this area focuses on building
various specialized systems to complete specialized visual tasks that appear in
various practical occasions; in the long run, it is necessary to build a more general
system (closer to the human visual system) to complete general vision tasks. The
second research goal is to use this research as a means to explore the visual working
mechanism of the human brain and to master and understand the visual working
mechanism of the human brain (such as computational neuroscience). The main
research here is the biological mechanism. For a long time, people have carried out a
lot of research on the human brain visual system from the aspects of physiology,
4 1 Introduction

psychology, nerve, cognition, etc., but it is far from revealing all the mysteries of the
visual process, especially the research and understanding of the visual mechanism is
still far away. It lags still behind the research and mastery of visual information
processing. It should be pointed out that a full understanding of human brain vision
will also promote in-depth research in computer vision [2]. This book mainly
considers the first research objective.
It can be seen from the above that computer vision uses computers to realize
human visual functions, and its research has obtained many inspirations from human
vision. Much important research in computer vision has been accomplished by
understanding the human visual system; typical examples include the use of pyra
mids as an efficient data structure, the use of the concept of local orientation, the use
of filtering techniques to detect motion, and the recent artificial neural network. In
addition, with the help of understanding and research on the function of the human
visual system, it can also help people develop new computer vision algorithms.
The research and application of computer vision has a long history. Overall, early
computer vision systems mainly relied on 2D projected images of objective scenes in
3D. The research goal of computer vision was to improve the quality of images, so
that users could obtain the information more clearly and conveniently, or focus on
automatically obtaining various characteristic data in the image to help users analyze
and recognize the scenery. This aspect of work can be attributed to 2D computer
vision, which is currently relatively mature with many application products avail
able. With the development of theory and technology, more and more research
focuses on fully utilizing the 3D spatial information obtained from objective scenery
(often combined with temporal information), automatically analyzing and under
standing the objective world, so as to making judgments and decisions. This includes
further obtaining depth information on the basis of 2D projection images to com
prehensively grasp the 3D world. This area of work is still being explored and
requires the introduction of technologies such as artificial intelligence, which is
currently the focus of research in computer vision. The recently related work can
be categorized under 3D computer vision and will be the concentration of this book.

1.1.3 Related Disciplines

As a discipline, computer vision is inextricably linked with many disciplines,

especially with some related and similar disciplines. The following is a brief
introduction to several disciplines that are closest to computer vision. The connec
tions and differences between related disciplines and fields are shown in Fig. 1.1.
1. Machine vision or robot vision
Machine vision or robot vision is inextricably linked with computer vision and
is used as a synonym in many cases. Specifically, it is generally believed that
computer vision focuses more on the theory and method of scene analysis and
image interpretation, while machine vision focuses more on acquiring the image
1.1 Introduction to Computer Vision 5

Fig. 1.1 The connections and differences between related disciplines and fields

of the environment through visual sensors, building a system with visual percep
tion function, and realizing the algorithm for detecting and identifying objects.
On the other hand, robot vision emphasizes more on the machine vision of robot,
so that robot has the function of visual perception.
2. Computer graphics
Graphics refers to the science of expressing data and information in the form of
graphics, charts, drawings, etc. Computer graphics studies how to use computer
technology to generate these forms, and it is also closely related to computer
vision. Computer graphics is generally referred to as the inverse problem of
computer vision, because vision extracts 3D information from 2D images,
while graphics use 3D models to generate 2D scene images (more generally
based on nonimage forms of data description to generate realistic images). It
should be noted that, compared with the many uncertainties in computer vision,
computer graphics mostly deals with deterministic problems, which can be solved
through mathematical methods. In many practical applications, people are more
concerned with the speed and accuracy of graphics generation, that is, to achieve
some kind of compromise between real time and fidelity.
3. Image engineering
Image engineering is a very rich discipline, including three levels (three sub
disciplines) that are both related and different: image processing, image analysis,
and image understanding, as well as their engineering applications.
Image processing emphasizes the conversion between images (image in and
image out). Although image processing commonly refers to various image
technologies, image processing in a narrow sense mainly focuses on the visual
observation effect of the output image [4]. This includes making various
processing adjustments to the image to improve the visual effect of the image
and facilitate the subsequent high-level processing; or compress and encode the
image to reduce the required storage space or transmission time on the basis of
ensuring the required visual perception, so as to meet the requirements of a given
transmission path; or add some additional information to the image without
affecting the appearance of the original image; etc.
6 1 Introduction

Image analysis is mainly to detect and measure the objects of interest in the
image to obtain their objective information, thereby establishing the description
of the objects in the image (image in and data out) [5]. If image processing is a
process from image to image, image analysis is a process from image to data. The
data here can be the result of the measurement of the object characteristics, or the
symbolic representation based on the measurement, or the identification conclu
sion of the object category. They describe the characteristics and properties of
objects in the image.
The focus of image understanding is to further study the nature of each object
in the image and their mutual relations on the basis of image analysis and obtain
an understanding of the meaning of the whole image content and an explanation
of the original imaging objective scene, so that people can make judgments
(know the world) and guide and plan actions (transform the world) [6]. If
image analysis mainly focuses on the observer to study the objective world
(mainly on observable things), image understanding, to a certain extent, focuses
on the objective world, as well as to grasp and explain the entire objective world
(including things not directly observed) with the help of knowledge and
experience.
4. Pattern recognition
Patterns refer to categories of objective things or phenomena that are similar
but not identical. Patterns cover a wide range, and images are one of them.
(Image) pattern recognition is similar to image analysis in that they have the
same input, while the different outputs can be easily converted. Recognition
refers to mathematics and technology that automatically establish symbolic
descriptions or logical reasoning from objective facts, so people define pattern
recognition as the discipline of classifying and describing objects and processes
in the objective world. At present, the recognition of image patterns mainly
focuses on the classification, analysis, and description of the content of interest
(object) in the image. On this basis, the goal of computer vision can be further
realized. At the same time, many concepts and methods of pattern recognition are
used in computer vision research; however, visual information has its particular
ity and complexity, so traditional pattern recognition (competitive learning
model) cannot include all computer vision.
5. Artificial intelligence
Artificial intelligence can be counted as a new theory, new tool, and new
technology that has been widely studied and applied in the field of computer
vision in recent years. Human intelligence mainly refers to the ability of human
beings to understand the world, to judge the things, to learn the environment, to
plan the behavior, to reason the thinking, to solve the problems, etc. Visual
function is a manifestation of human intelligence, and similarly, computer vision
is closely related to artificial intelligence. Many artificial intelligence technolo
gies are used in the research of computer vision. Conversely, computer vision can
also be regarded as an important application field of artificial intelligence, which
requires the help of theoretical research results and system implementation
experience of artificial intelligence. Machine learning is the core of artificial
1.2 Computer Vision Theory and Framework 7

intelligence, which studies how to make computers simulate or implement human

learning behaviors, thereby acquiring new knowledge or skills. This is the basis
for computer vision to complete complex vision tasks. Deep learning, which has
recently gained a lot of attention, improves and enhances basic machine learning
methods. It tries to imitate the working mechanism of the human brain, to build
neural networks that can learn to analyze, recognize, and interpret data such as
images.
In addition to the above related disciplines, from a broader perspective,
computer vision needs to use various engineering methods to solve some biolog
ical problems and complete the inherent functions of biology, so it is also related
(mutual learning and mutual dependence) to biology, physiology, psychology,
neurology, and other disciplines. In recent years, computer vision researchers
have been closely integrated with visual psychology and physiology researchers
and have obtained a series of research results. Computer vision belongs to
engineering application science and is inseparable from disciplines such as
industrial automation, human-computer interaction, office automation, visual
navigation and robotics, security monitoring, biomedicine, remote sensing map
ping, intelligent transportation, and military public security. On the one hand, the
research of computer vision fully combines and utilizes the achievements of these
disciplines; on the other hand, the application of computer vision also greatly
promotes the in-depth research and development of these disciplines.

1.2 Computer Vision Theory and Framework

As a discipline, computer vision has its own origins, theories, and frameworks. The
origin of computer vision should be traced back to the invention and application of
computers. In the 1960s, the earliest computer vision technology has been studied
and applied.

1.2.1 Visual Computational Theory

The research on computer vision did not have a comprehensive theoretical frame
work in the early days. In the 1970s, the research on object recognition and scene
understanding basically detected linear edges as the primitives of the scene and then
combined them to form more complex scene structures. However, in practical
applications, comprehensive primitive detection is difficult and unstable, so the
computer vision system can only input simple lines and corners to form the
so-called building block world.
Marr’s book Vision, published in 1982, summarizes a series of research results of
his and his colleagues on human vision, proposes a visual computational theory,
and provides a framework for understanding visual information. This framework is
8 1 Introduction

both comprehensive and refined and is the key to making the study of visual
information understanding rigorous and moving visual research from the descriptive
level to the mathematical science level. Marr’s theory states that the purpose of
vision must be understood before going to the details. This is applicable to a variety
of information processing tasks. The gist of the theory is as follows.

1.2.1.1 Vision Is a Complex Information Processing Process

Marr believes that vision is a far more complex information processing task and
process than people imagine, and its difficulty is often not recognized by people. A
major reason here is that while it is difficult for a computer to understand an image, it
is often a breeze for a human.
To understand the complex process of vision, two issues must first be addressed.
One is the representation of visual information; the other is the processing of visual
information. “Representation” here refers to a formal system (such as Arabic
numeral system, binary numeral system, etc.) that can clearly express certain entities
or certain types of information as well as some rules that explain how the system
works. In the representation, some information is prominent and explicit, while other
information is hidden and vague. Representation has a great influence on the
difficulty of subsequent information processing. The “processing” of visual infor
mation refers to the transformation and gradual abstraction of different forms of
representation through continuous processing, analysis, and understanding of the
information.
Solving the problem of representation and processing of visual information is
actually solving the problem of computability. If a task needs to be done by a
computer, it should be computable, which is the problem of computability. In
general, for a particular problem, a problem is computable if there is a program
that gives an output in finite steps for a given input.

1.2.1.2 Three Essential Factors of Visual Information Processing

To fully understand and interpret visual information, three essential factors need to
be grasped at the same time, namely, computational theory, algorithm implementa
tion, and hardware implementation.
First, the highest level of visual information understanding is abstract computa
tional theory. The question of whether vision can be calculated by modern com
puters needs to be answered by computational theory, but there is no clear answer so
far. Vision is a process of feeling and perception. From the perspective of micro
scopic anatomical knowledge and objective visual psychology knowledge, people
still lack the grasp of the mechanism of human visual function. Therefore, the
discussion on visual computability is still relatively limited, mainly focusing on
the number and symbol processing ability of existing computers to complete some
specific visual tasks and so on.
1.2 Computer Vision Theory and Framework 9

Table 1.1 The meaning of the three essential factors of visual information processing

Essential
factor Name Meaning and problems to be solved
1 Computational What is the computation goal? Why is it computed like this?
theory
2 Representation and How to realize computational theory? What is input and output
algorithm representation? What algorithm is used to realize the conver
sion between representations?
3 Hardware How to physically implement representations and algorithms?
implementation What are the specific details of computing structures?

Second, the objects of computer operation are discrete numbers or symbols, and
the storage capacity of the computer is also limited. Therefore, after having the
calculation theory, we must also consider the implementation of the algorithm.
Therefore, we need to select an appropriate representation for the entities operated
by the machining. On the one hand, the input and output representations of machin
ing should be selected; on the other hand, we should determine the algorithm to
complete the representation transformation. Representation and algorithm restrict
each other, so three points should be paid attention to: (1) in general, there are many
optional expressions; (2) the determination of the algorithm often depends on the
selected representation; (3) given a representation, there can be many algorithms to
complete the task. Generally, the instructions and rules used for machining are called
algorithms.
Finally, how the algorithm is physically implemented must also be considered.
Especially with the continuous improvement of real-time requirements, the problem
of dedicated hardware implementation is often raised. It should be noted that the
determination of an algorithm usually depends on the characteristics of the hardware
that physically implements the algorithm, and the same algorithm can be
implemented through different technical approaches.
After summarizing the above discussion, the content shown in Table 1.1 can be
obtained.
There is a certain logical causal connection between the above three essential
factors, but there is no absolute dependence. In fact, there are many different options
for each essential factor. In many cases, the problems involved in explaining each
essential factor are basically irrelevant to the other two essential factors (each
essential factor is relatively independent), or one or two essential factors can be
used to explain certain visual phenomena. The above three essential factors are also
called by many people the three levels of visual information processing, and they
point out that different problems need to be explained at different levels. The
relationship among the three essential factors is often shown in Fig. 1.2 (in fact, it
is more appropriate to regard it as two levels), in which the positive arrow indicates
that it has a guiding meaning and the reverse arrow has a meaning of as basic. Note
that once there is a computational theory, representations and algorithms as well as
hardware implementations influence each other.
10 1 Introduction

Fig. 1.2 The relationship of

the three essential factors of
visual information
processing

1.2.1.3 Three-Level Internal Representation of Visual Information

According to the definition of visual computability, visual information processing

can be decomposed into multiple transformation steps from one representation to
another. Representation is the key to visual information processing. A basic theo
retical framework for computer vision research and information understanding is
mainly composed of the three-level representation structure of the visible world
established, maintained, and explained by visual processing. For most philosophers,
what are the nature of visual expressions, how they relate to perception, and how
they support action are open to different interpretations. However, they agreed that
the answers to these questions were all related to the concept representation.
1. Primal sketch
The primal sketch denotes a 2D representation, which is a collection of image
features and describes the contour part where the properties of the object surface
change. The primal sketch representation provides the information of the contour
of each object in the image and is a form of sketch representation of the 3D object.
This way of representation can be proven from the human visual process. When
people observe a scene, they always notice the drastic changes in it. Therefore,
primal sketch should be a stage of the human visual process.
2. 2.5D sketch
The 2.5D sketch is completely proposed to adapt to the computing functions of
the computer. It decomposes the object according to the principle of orthogonal
projection according to a certain sampling density, so that the visible surface of
the object is decomposed into many facets (face element) of a certain size and
geometric shape; each facet has its own orientation. Using a normal vector to
represent the orientation of the facet in which it is located and composing a set of
needles (the vector is shown with an arrow/needle) constitute a 2.5D sketch map
(also called a needle diagram). In this type of diagram, the normal of each
orientation takes the observer as the center. The specific steps to obtain the
2.5D sketch map (Fig. 1.3 shows an example) are as follows:
(a) Decompose the orthogonal projection of the visible surface of the object into
a collection of facets.
(b) Use the normal lines to represent the orientation of the facet.
(c) Draw each normal line and superimpose all normal lines on the visible
surface within the outline of the object.
1.2 Computer Vision Theory and Framework 11

Fig. 1.3 2.5D sketch

example

Image Processing Processing Processing

Fig. 1.4 The three-level representation decomposition of the Marr’s framework

The 2.5D sketch map is actually an intrinsic image (see Sect. 1.3.2), because it
shows the orientation of the surface element of the object, thus giving the informa
tion of the surface shape. Surface orientation is an intrinsic characteristic, and depth
is also an intrinsic characteristic. The 2.5D sketch map can be converted into a
(relative) depth map.
3. 3D representation
3D representation is a representation form centered on the object (i.e., it also
includes the invisible part of the object). It describes the shape and spatial
organization of 3D objects in the object-centered coordinate system. Some
basic 3D entity representations can be found in Chap. 9.
Now come back to the problem of visual computability. From the perspective
of computer or information processing, the problem of visual computability can
be divided into several steps. Between the steps is a certain form of represen
tation, and each step consists of a calculation/processing method that connects
the two forms of representation (see Fig. 1.4).
According to the abovementioned three-level representation viewpoint, the prob
lem to be solved by visual computability is how to start from the pixel representation
of the original image, through the primal sketch representation and 2.5D sketch
representation, and finally obtain the 3D representation. They can be summarized in
Table 1.2.
12 1 Introduction

Table 1.2 Representation framework of visual computability problem

Representation Goal Primitive

Image Represent the brightness of the scene or Pixel (values)
the illuminance of the object
Primal sketch Represent the location of brightness Zero crossing point, end point, corner
changes in the image, the geometric point, inflection point, edge segment,
distribution of the object outline, and and boundary
the organizational structure
2.5D sketch Represent the orientation, depth, con Local surface orientation (“needle”
tour, and other properties of the visible primitives), surface orientation dis
surface of the object in the observer continuities, depth, and discontinu
centered coordinate system ous point in depth
3D Represent the object shapes and their 3D model, with the axis as the skele
representation spatial organization, by using voxels or ton; attach the volume element or
surface elements, in a coordinate sys face element to the axis
tem centered on an object

1.2.1.4 Visual Information Understanding Is Organized in the Form

of Functional Modules

The idea of viewing the visual information system as a set of relatively independent
functional modules is not only supported by the evolutionary and epistemological
arguments in computing, but also some functional modules can be separated by
experimental methods.
In addition, psychological research also shows that people obtain various intrinsic
visual information by using a variety of clues or a combination of them. This
suggests that the visual information system should include many modules. Each
module obtains specific visual cues and performs certain processing, so that different
weight coefficients can be combined with different modules to complete the visual
information understanding task according to the environment. According to this
point of view, complex processing can be completed with some simple independent
functional modules, which can simplify research methods and reduce the difficulty
of specific implementation. This is also very important from an engineering
perspective.

1.2.1.5 The Formal Representation of Computational Theory Must

Consider Constraints

During the image acquisition process, the information in the original scene will
undergo various changes, including the following:
1. When a 3D scene is projected as a 2D image, the depth of the object and the
invisible part of the information are lost.
1.2 Computer Vision Theory and Framework 13

2. Images are always obtained from a specific viewing direction. Different perspec
tive images of the same scene will be different. In addition, information will be
lost due to mutual occlusion of objects or mutual occlusion of various parts.
3. Imaging projection makes the illumination, object geometry and surface reflec
tion characteristics, camera characteristics, and the spatial relationship between
the light source, the object, and the camera all integrated into a single image gray
value, which are difficult to be distinguished.
4. Noise and distortion will inevitably be introduced in the imaging process.
For a problem, if its solution is existing, unique, and continuously dependent on
the initial data, then it is well-posed. If one or more of the above is not satisfied, it is
ill-posed (under-determined). Due to the information changes in the various original
scenes mentioned above, the method of solving the vision problem as the inverse
problem of the optical imaging process becomes an ill-posed problem (becoming an
ill-conditioned problem), so it is very difficult to solve. In order to solve this
problem, it is necessary to find out the constraints of the relevant problems according
to the general characteristics of the external objective world and turn them into
precise assumptions, so as to draw conclusive and testable conclusions. Constraints
are generally obtained with the aid of prior knowledge. The use of constraints can
change ill-conditioned problems. This is because adding constraints to the calcula
tion can make its meaning clear, thus enabling the problem to be solved.

1.2.2 Framework Issues and Improvements

Marr’s visual computational theory is the first theory that has a greater impact on
visual research. This theory has actively promoted research in this field and has
played an important role in the research and development of image understanding
and computer vision.
Marr’s theory also has its shortcomings, including four problems about the
overall framework (see Fig. 1.6):
1. The input in the framework is passive: what image is input, the system will
process what image.
2. The processing goal in the framework remains unchanged, and the position and
shape of the objects in the scene are always restored.
3. The framework lacks or does not pay enough attention to the guiding role of high-
level knowledge.
4. The information processing process in the entire framework is basically bottom-
up, one-way flow, and no feedback.
In response to the above problems, people have proposed a series of improvement
ideas in recent years. Corresponding to the framework of Fig. 1.4, these improve
ments can be incorporated into new modules to obtain the framework of Fig. 1.5.
14 1 Introduction

Fig. 1.5 Improved visual computational framework

In the following, with conjunction to Fig. 1.5, the four aspects of the original
framework of Fig. 1.4 will be discussed in detail.
1. Human vision has initiative
People will change the line of sight or perspective as needed to help observa
tion and cognition. Active vision means that the vision system can determine the
movement of the camera according to the existing analysis results and the current
requirements of the vision task to obtain the corresponding image from the
appropriate position and perspective. Human vision is also selective, one can
stare (observing the region of interest at a higher resolution), or one can turn a
blind eye to certain parts of the scene. Selective vision means that the vision
system can determine the focus of the camera to obtain the corresponding image
based on the existing analysis results and the current requirements of the vision
task. Taking these factors into account, an image acquisition module is added to
the improved framework, which is also considered together with other modules in
the framework. This module should choose the image acquisition modes
according to the visual purpose.
The aforementioned active vision and selective vision can also be regarded as
two forms of active vision: one is to move the camera to focus on a specific object
of interest in the current environment; the other is to focus on a specific region in
the image and dynamically interact with it to get an interpretation. Although the
two forms of active vision look very similar, in the first form, the initiative is
mainly reflected in the observation of the camera, while in the second form, the
initiative is mainly reflected in the processing level and strategy. Although there
is interaction in both forms, that is, vision has initiative, mobile cameras need to
record and store all the complete scenes, which is a very expensive process. In
addition, the overall interpretations obtained in this way are not necessarily used.
Collecting only the most useful part of the scene, narrowing its scope, and
enhancing its quality to obtain useful interpretations mimic the process of
human interpretation of the scene.
2. Human vision can be adjusted for different purposes
Purposive vision means that the vision system makes decisions based on the
purpose of vision, such as whether to fully recover information like the position
and shape of objects in the scene or just detect whether there is an object in the
scene. It may give a simpler solution to vision problems. The key issue here is to
1.2 Computer Vision Theory and Framework 15

determine the purpose of the task. Therefore, a visual purpose box (vision goal) is
added to the improvement framework. Qualitative analysis or quantitative anal
ysis can be determined according to different purposes of understanding
(in practice, there are quite a lot of occasions where only qualitative results are
sufficient; no complex quantitative result is needed). However, the current qual
itative analysis still lacks complete mathematical tools. The motivation of pur
posive vision is to clarify only part of the information that is needed. For example,
the collision avoidance of autonomous vehicles does not require precise shape
descriptions, and some qualitative results are sufficient. This kind of thinking
does not have a solid theoretical basis, but the study of biological vision systems
provides many examples.
Qualitative vision, which is closely related to purposive vision, seeks a
qualitative description of an object or scene. Its motivation is not to express
geometric information that is not needed for qualitative (nongeometric) tasks or
decisions. The advantage of qualitative information is that it is less sensitive to
various unwanted transformations (such as slightly changing perspectives) or
noise than quantitative information. Qualitative or invariant can allow easy
interpretation of observed events at different levels of complexity.
3. Humans have the ability to completely solve visual problems with only partial
information obtained from images
Humans have this ability due to the implicit use of various knowledge. For
example, after obtaining object shape information with the aid of CAD design
data (using object model library), it can help solve the difficulty of restoring the
object shape from a single image. The use of high-level (domain) knowledge can
solve the problem of insufficient low-level information, so a high-level knowl
edge frame (module) is added to the improved framework.
4. There is an interaction between the sequential processing processes in human
vision
The human visual process has a certain sequence in time and different levels in
meaning, and there is a certain interaction between the various steps. Although
the mechanism of this interaction is not yet fully understood, the important role of
high-level knowledge and feedback from the later results to low-level processing
has been widely recognized. From this perspective, the feedback control flow is
added to the improvement framework, and the existing results and high-level
knowledge are used to improve visual efficiency.

1. 2.3 A Discussion on Marr ’s Reconstruction Theory

Marr’s theory emphasizes the reconstruction of the scene and uses the reconstruction
as the basis for understanding the scene.
16 1 Introduction

1.2.3.1 Problems with Reconstruction Theory

According to Marr’s theory, the common core concept of different visual tasks/jobs
is representation, and the common processing goal is to recover the scene from
visual stimuli and incorporate it into the representation. If the vision system can
recover the characteristics of the scene, such as the reflective properties of the surface
of the object, the direction and speed of the object’s movement, the surface structure
of the object, etc., then there needs to be a representation that can help with various
recovery tasks. Under such a theory, different jobs should have the same conceptual
core, understanding process, and data structure.
In his theory, Marr showed how people can extract from various cues the
representations that construct the visual world from within. If the construction of
such a unified representation is regarded as the ultimate goal of visual information
processing and decision-making, then vision can be viewed as a reconstruction
process that starts with stimuli and is sequentially acquired and accumulated. This
idea of reconstructing the scene first and then interpreting it can simplify the visual
task, but it is not completely consistent with the human visual function. In fact,
reconstruction and interpretation are not always serial and need to be adjusted for
visual purpose.
The above assumptions have also been challenged. Some contemporaries, such as
Marr, questioned the vision process as a hierarchical, single-pass data processing
process. One of the meaningful contributions is that the single-path hypothesis has
been shown to be untenable, based on long-standing research in psychophysics and
neuropsychology. At the time Marr wrote Vision, there was little psychological
research that took into account information about the primate’s higher-level vision,
and little was known about the anatomy and functional organization of higher-level
visual areas. With the continuous acquisition of new data and the deepening of the
understanding of the entire visual process, people find that the visual process is less
and less like a single-channel processing process [7].
Fundamentally, a correct representation of the objective scene should be available
for any visual work. If this is not the case, then the visual world itself (which is an
external display of internal representations) cannot support visual behavior. None
theless, further research revealed that representations by reconstruction are in
many respect (shown below) a poor interpretation, or have a series of problems, for
understanding vision [7].
Let us first look at the implications of reconstruction for identification or classi
fication. If the visual world can be built in-house, then the visual system is not
necessary. In fact, acquiring an image, building a 3D model, or even giving a list of
locations of important stimulus features cannot guarantee recognition or classifica
tion. Of all the possible methods for interpreting the scene, the method involving
reconstruction has the largest circle, since reconstruction does not directly contribute
to the interpretation.
Secondly, it is also difficult to achieve the representation only by reconstruction
from the original image. From a computer vision point of view, it is very difficult to
1.2 Computer Vision Theory and Framework 17

recover scene representations from original images; there are now many findings in
biological vision that support other representation theories.
Finally, the reconstruction theory is also problematic conceptually. The source of
the problem has to do with theoretically that reconstruction can be applied to any
representation work. Leaving aside the question of whether reconstruction is achiev
able in concrete terms, one might first ask whether it is worthwhile to seek a
representation with universal unity. Since the best representation should be the one
best suited to the job, a representation with universal uniformity is not inevitably
necessary. In fact, according to the theory of information processing, the importance
of choosing the appropriate and correct representation for a given computational
problem is self-evident. Marr himself had also pointed out this importance.

1.2.3.2 Representation Without Reconstruction

Some studies and experiments in recent years have shown that the interpretation of
the scene is not necessarily based on the 3D restoration (reconstruction) of the scene,
or rather, it is not necessarily based on the complete 3D reconstruction of the scene.
Since there are a series of problems in realizing representation according to
reconstruction, other forms of representation methods have also been studied and
paid attention to. For example, there is a representation that was first proposed by
Locke in Concerning Human Understanding, which is now generally referred to as
semantics of mental representations [7]. Locke suggests representation in a natural
and predictable way. According to this view, a sufficiently reliable feature detector
constitutes a primitive representation of the existence of a certain feature in the visual
world. The representation of the entire goal and scene can then be constructed from
these primitives (if there are enough of them).
In the theory of natural computing, the original concept of feature hierarchy was
developed under the influence of the discovery of “insect detectors” in frog retinas.
Recent computer vision and computational neuroscience research results suggest
that modifications to the original feature-level representation hypothesis can serve as
an alternative to the reconstruction theory. Today’s feature detection is different
from traditional feature detection in two ways. One is that a set of feature detectors
can have much greater expressive power than any one of them; the other is that many
theoretical researchers realize that “symbols” are not the only elements that combine
features.
Consider the representation for spatial resolution as an example. In a typical
situation, the observer can see two straight line segments that are very close to each
other (the distance between them may also be smaller than the distance between the
photon receptors in the fovea). An early hypothesis was that at some stage of cortical
processing, visual input is reconstructed with sub-pixel accuracy, making it possible
to obtain distances in the scene that are smaller than pixels. Proponents of recon
struction theory do not believe that feature detectors can be used to build visual
functions, and Marr believes that “the world is so complex that it is impossible to
analyze with feature detectors.” Now this view is challenged. Taking the
18 1 Introduction

representation of spatial resolution as an example, a set of patterns covering the

viewing area can contain all the information needed to determine the offset without
the need for reconstruction.
As another example, consider the perception of relative motion. In monkeys’
mid-cortical regions, receptor cells can be found that have movements aligned with a
particular direction. The combined movement of these cells can be considered to
represent the movement of the field of view (FOV). To illustrate this, note that given
a mid-cortical region and determining movement in the field of view occur synchro
nously. Artificial simulations of cells produce similar behavioral responses to real
moving stimuli, with the result that cells reflect motor events, but visual movements
are difficult to reconstruct from movements in mid-cortical regions. This means that
motion can be determined without reconstruction.
The above discussion shows that new thinking is needed for Marr’s theory. A
computationally hierarchical description of a job determines its input and output
representations. For a low-level job, such as binocular vision, the input and output
are well-defined. A system with stereovision must receive two different images of
the same scene and also need to produce a representation that unambiguously
represents depth information. However, even in such a job, reconstruction is not
very necessary. In stereoscopic viewing, qualitative information, such as the depth
order of viewing surfaces, is useful and relatively easy to compute and also approx
imates what the human visual system actually does.
In high-level work, the choice of representation is less clear. A recognition system
must be able to accept images of the object or scene to be recognized, but what
should the representation of the desired recognition look like? It is not enough to
store and compare raw images of objects or scenes. As many researchers have
pointed out, the appearance of objects is related to the direction in which they are
viewed, to the lighting on them, and to the presence and distribution of other objects.
Of course, the appearance of an object is also related to its own shape. Can one
recover the geometric properties of an object from its appearance and use it as its
representation? Previous research has shown that this is also not feasible.
To sum up, on the one hand, a complete reconstruction looks unsatisfactory for
many reasons, and on the other hand it is unreliable to represent the object only with
the original image. However, these relatively obvious methodological shortcomings
do not imply that the entire theoretical framework based on expressive concepts is
wrong. It is only these shortcomings that suggest the need for further examination of
the underlying assumptions behind this notion of representation.

1.2.4 Research on New Theoretical Framework

Due to factors such as history, Marr did not study how to use mathematical methods
to strictly describe visual information. Although he studied early vision more fully,
he basically did not discuss the representation, utilization, and recognition of visual
knowledge based on visual knowledge. In recent years, there have been many
1.2 Computer Vision Theory and Framework 19

attempts to establish a new theoretical framework. For example, Grossberg claimed

to have established a new vision theory, that is, dynamic geometry of surface form
and appearance [8]. It points out that the perceived surface shape is the aggregate
result of multiple processing actions distributed over multiple spatial scales, so that
the supposed 2.5D map does not exist in practice, which challenges Marr’s theory.
Another new vision theory is the network-symbol model [9]. Under this model
framework, the 3D model of the scene does not need to be calculated accurately, but
the image is transformed into an understandable relational format similar to the
knowledge model. This is similar to the human visual system. In fact, it is very
difficult to process natural images with geometric operations. The human brain
constructs the relational network-symbol structure of the visual scene and uses
different cues to establish the relative order of the scene surface relative to the
observer and the various interrelationships between objects. In the network-symbol
model, object recognition is performed not according to the field of view but
according to the derived structure, which is not affected by local variations and
object appearance.
Two other representative works are introduced below.

1.2.4.1 Knowledge-Based Theoretical Framework

Knowledge-based theoretical frameworks are developed around the study of per

ceptual feature groupings [8, 10, 11]. The physiological basis of this theoretical
framework is derived from research findings in psychology. This theoretical frame
work argues that the human visual process is only a recognition process and has
nothing to do with reconstruction. In order to identify 3D objects, human perception
can be used to describe objects, which can be directly completed through 2D images
under the guidance of knowledge, without the need for bottom-up 3D reconstruction
through visual input.
The process of understanding a 3D scene from a 2D image can be divided into the
following three steps (see Fig. 1.6):
1. Using the process of perceptual organization, extract from image features those
groupings and structures that remain unchanged over a large range with respect to
the viewing direction.
2. Build a model with the help of image features, and use the method of probabilistic
queuing to reduce the search space in this process.

Fig. 1.6 Knowledge-based theoretical framework

20 1 Introduction

3. Find the spatial correspondence by solving the unknown observation points and
model parameters, so that the projection of the 3D model directly matches the
image features.
In the above whole process, there is no need to measure the 3D object surface
(no reconstruction), and the information about the surface is calculated using the
principle of perception. This theoretical framework shows high stability for handling
occlusion and incomplete data. This theoretical framework introduces feedback,
emphasizing the guiding role of high-level knowledge in vision. However, practice
has shown that in some occasions such as judging the size of the object and
estimating the distance of the object, recognition alone is not enough, and 3D
reconstruction must be carried out. In fact, 3D reconstruction still has a very wide
range of applications. For example, in the virtual human plan, a lot of human
information can be obtained by 3D reconstruction of a series of slices. Another
example, the 3D distribution of cells, can be obtained by 3D reconstruction of tissue
slices, and the positioning has a very good auxiliary effect.

1.2.4.2 Active Vision Theory Framework

The active vision theoretical framework is mainly based on the initiative of human
vision (or more generally biological vision). Human vision has two special
mechanisms:
1. Selective attention
Not all what the human eye sees is what people care about, and useful visual
information is usually only distributed in a certain spatial range and time period,
so human vision does not treat all parts of the scene equally, but selectively
according to needs. Some of them pay special attention, and others are just
general observation or even turn a blind eye. According to this feature of the
selective attention mechanism, multi-azimuth and multi-resolution sampling can
be performed when acquiring images, and information relevant to a specific task
can be selected or retained.
2. Gaze control
People can adjust their eyeballs, so that people can “look” at different posi
tions in the environment at different times according to their needs to obtain
useful information, which is gaze control. Accordingly, the camera parameters
can be adjusted so that it can always obtain visual information suitable for a
specific task. Gaze control can be divided into gaze stabilization and gaze
change. The former is a localization process, such as object tracking; the latter
is similar to the rotation of the eyeball, which controls the next fixation point
according to the needs of a specific task.
The theoretical framework of active vision proposed according to the human
visual mechanism is shown in Fig. 1.7.
1.2 Computer Vision Theory and Framework 21

Fig. 1.7 Active vision theory framework

The active vision theoretical framework emphasizes that the visual system should
be task-oriented and purpose-oriented, and the visual system should have the ability
to actively perceive. According to the existing analysis results and the current
requirements of the vision task, the active vision system can control the motion of
the camera through the mechanism of actively controlling the camera parameters and
coordinate the relationship between the processing task and the external signal.
These parameters include camera position, orientation, focal length, aperture, etc.
In addition, active vision also incorporates the ability of “attention.” By changing the
camera parameters or processing the post-camera data, the “points of attention” can
be controlled to achieve selective perception of space, time, resolution, etc.
Similar to the theoretical framework based on knowledge, the theoretical frame
work of active vision also attaches great importance to knowledge and believes that
knowledge belongs to the advanced ability to guide visual activities, which should
be used for complete visual tasks. However, there is a lack of feedback in the current
theoretical framework of active vision. On the one hand, this structure without
feedback does not conform to the biological vision system; on the other hand, it
often leads to the problems of poor accuracy of results, great influence by noise, and
high computational complexity. At the same time, it also lacks some adaptability to
applications and environments.

1.2.5 Discussion from the Perspective of Psychological

Cognition

Computer vision is closely related to human vision. It tries to achieve the functions
of human vision while also gaining many insights from human vision. Computer
vision needs to grasp the objective information of the scene through the visual
stimulus signals it experiences, and there is a process from perception to cognition.
Cognitive science is the result of the cross cooperation of psychology, linguistics,
neuroscience, computer science, anthropology, philosophy, artificial intelligence,
and other disciplines. Its goal is to explore the essence and mechanism of human
22 1 Introduction

cognition and intelligence. Among them, psychology is the core discipline of

cognitive science.
So far, there are three main theories in the study of cognitive essence and process
in psychology: traditional cognitivism, connectionism, and embodied cognition.

1.2.5.1 Traditional Cognitivism

Traditional cognitivism believes that cognitive processes are based on rational

rules acquired by humans, either congenital or acquired, and formalizes the
processing and operation of information received by the brain. Cognitive function
is independent of the body, including the brain, while the body is only a receptor for
stimuli and an effector for behavior. According to this viewpoint, cognition is a
disembodied mind: when the disembodied mind is manifested in the human brain, it
is human intelligence, whereas when the disembodied mind is manifested in the
computer, it is artificial intelligence.
The basic starting point of traditional cognitivism in understanding the objective
world is “cognition is computable” or “the essence of cognition is computation.”
According to this viewpoint, the cognitive process of the human brain is similar to
the symbol processing process of computers, both of which involve the processing,
manipulation, and handing of information. Although the structures and motivations
of the two may be different, they are similar in functionality, that is, they are both a
form of “computation.” If the brain is compared to the hardware of a computer, then
cognition is the “software” or “program” running on this “hardware,” and the
software or program is functionally independent of the hardware and can also be
separated from the hardware. In essence, Marr’s visual computational theory is
based on traditional cognitivism, which believes that the visual process is calculated
on the computer (hardware) with the help of artificially set rules (software).
Based on the basic idea of cognitive psychology of symbol processing, three
basic hypotheses can be deduced [12]:
1. The thinking process of the brain is similar to the information processing process
of a computer. The processes of both are consistent, including input, encoding,
storage, extraction, and output.
2. The cognitive process handles abstract symbols. Symbols represent the external
information, but not the external world itself. The advantage of this arrangement
is to ensure the simplicity and efficiency of the cognitive process.
3. The relationship between the cognitive processes and the physiological structure
of the brain is similar to the relationship between the computer software and
hardware. The direct result of this assumption is that cognition is seen as being
able to run on any substance with computational functions, without the specific
brain. On the other hand, although cognition runs in the brain, the physiological
structure of the brain has no impact on cognition. Cognition can run in both the
human brain and computers.
1.2 Computer Vision Theory and Framework 23

1.2.5.2 Connectionism

Connectionism believes that the human brain is a complex information processing

system composed of astronomical magnitude digital neurons interconnected, and the
rules based on them are learned by parallel distributed processing and nonlinear
characteristics of neurons. Connectionism is proposed to overcome the problem that
traditional cognitivism cannot reflect the flexibility of cognitive process so as in
trouble in both theory and practice [12].
Connectionism does not accept the analogy between computer and the human
brain made by symbol processing mode. It advocates the construction of “artificial
neural network”, which reflects the parallel distributed processing and nonlinear
characteristics of brain neurons. In this way, the research goal shifted from computer
simulation to artificial neural network construction, attempting to find out how
cognition emerges in complex connected and parallel distributed processing. How
ever, no matter how different the research style of connectionism and the symbol
processing model are, they are the same in the aspect of “the essence of cognition is
computing.” The functional independence and detachment of cognition are still
similar to the cognitive basis (cognitive computability) of Marr’s visual computing
theory.
The theory of connectionism emphasizes the whole activity of neural network and
believes that cognitive process is the result of parallel and distributed processing of
information in neural network. Connectionism emphasizes the interconnection of
individual cognitive units, that is, the interaction between simple processing units.
The symbol processing mode of cognitive psychology emphasizes that “cognitive
process is similar to symbol operation of computer”; the connectionism model
emphasizes “the network interaction between cognitive process and brain neurons,”
and cognitive process is similar to brain activities in structure and function.
Although connectionism has solved the problem that symbol processing mode is
difficult to solve to a certain extent and promoted the progress of cognitive psychol
ogy, as pointed out by subsequent psychologists, however, connectionism has not
broken through the shackles of symbol processing mode, and both have common
characteristics and limitations in cognitive theory and methodology [13].

1.2.5.3 Embodied Cognition

The theory of embodied cognition holds that cognition cannot be separated from the
body and is largely dependent and originating from the body. Human cognition is
closely related to the structure of the human body, the structure of nerves, the way of
sensory and motor system activities, etc. These factors determine a person’s thinking
style and way of understanding the world. The theory of embodied cognition holds
that cognition is the cognition of the body, which endows the body with a pivotal
role and decisive significance in cognitive shaping and enhances the importance of
the body and its activities in the interpretation of cognition [14].
24 1 Introduction

The theory of embodied cognition is fundamentally different from the basis of

Marr’s visual computational theory. The embodied cognition theory believes that
cognition is not an abstract symbolic operation like computer software. “The cog
nitive process is rooted in the body and is shaped by the interaction between the body
and the world in the process of perception and action” [15]. By using the term
“embodied,” we would like to emphasize two points. Firstly, cognition relies on
different types of experiences caused by bodies with various motor abilities; sec
ondly, various sensory motor abilities themselves are rooted in a more inclusive
biological, psychological, and cultural context. By using “action,” we would like to
emphasize once again that in a vivid cognition, the process of sensation and
movement, perception, and action are essentially inseparable [16]. The properties
and characteristics of embodiment include (1) physical participation in cognition;
(2) perception is for action; (3) meaning originates from the body; and (4) different
bodies create different ways of thinking [12].
By the way, some people divide the development of cognitive science into two
stages: the first stage is called “the first generation of cognitive science,” with
cognitive symbol processing and connectionism parallel processing as the main
research strategy; the second stage examines cognition in real life, stating that “the
actual cognitive situation is first and foremost the activity of a living body in a real
time environment” [17], thus proposing the concept of embodied cognition. The
emphasis on situational, embodied, and dynamic characteristics has become the
primary characteristics of the second generation of cognitive science. This also led
to the introduction of the concept of embodied intelligence (also known as embod
ied artificial intelligence), whose core is the intelligent agent that implements
embodied cognition. The embodied intelligence should be a new stage of intelligent
evolution [18], and some related explorations can also be seen from [19].
Finally, it is pointed out that with the development of cognitive science’s research
putting forward challenges on computationalism (cognition is a computing process
in essence), representationalism (external information is transformed into abstract
semantic symbols representing the objective world through senses, and cognitive
computing is the processing of these symbols according to certain rules), and
functionalism (the cognitive mechanism can be described according to its function,
and the important thing is the organization and realization of the function, while the
specific ability that the function depends on can be relatively ignored), and the
challenges to the Marr ’s visual computational theory based on them will enter a
new stage.

1.3 Introduction to Image Engineering

In order to realize the vision function, computer vision needs to use a series of
technologies. Among them, the most direct and closely related is image technology.
1.3 Introduction to Image Engineering 25

1.3.1 Three Levels of Technology in Image Engineering

Image technology is a general term for various image-related technologies in a broad

sense. Due to the great attention and fast progress of image technology in recent
years, many new theories, new methods, new algorithms, new means, new equip
ment, and new applications have appeared. The research and application of compre
hensive integration of various image technologies should be carried out under an
overall framework, which is image engineering (IE) [20, 21]. As we all know,
engineering refers to the general term for various disciplines formed by applying the
principles of natural science to the industrial sector. The subject of image engineer
ing is a new subject developed by combining the principles of mathematics, optics,
and other basic sciences with the experience accumulated in image application,
combining various image technologies, and conducting research and application in
the entire image field. In terms of its own content, image engineering is a new subject
that comprehensively and systematically studies image theory and methods,
expounds the principles of image technology, promotes the application of image
technology, and summarizes practical experience in production.
The research content of image engineering is very rich and the coverage is very
wide, which can be divided into three levels (see Fig. 1.8): image processing, image
analysis, and image understanding. These three levels have their own characteristics
in operand (operation object) and semantic level, and each level is different in terms
of data volume and abstraction.
Image processing (IP) is at a low level, focusing on the conversion between
images, aiming to improve the visual effect of images and lay a solid foundation for
subsequent work; it mainly processes pixels, and the amount of data to be processed
is very large.
Image analysis (IA) is in the middle layer, which mainly considers the detection
and measurement of objects of interest in the image, obtains objective information of
the object, and establishes the description of the image, involving operations such as
image segmentation and feature extraction.
Image understanding (IU) is at a high level, emphasizing the understanding of
image content and the interpretation of objective scenes. The operation objects are

Higher Smaller
Symbol's

O
Object

Pixel
Lower Bigger

Fig. 1.8 Schematic diagram of three levels of image engineering

26 1 Introduction

Fig. 1.9 Overall framework of image engineering

symbols abstracted from image descriptions, which have many similarities with
human thinking and reasoning.
It can be seen from Fig. 1.8 that among the three layers, the amount of data
gradually decreases as the abstraction level increases. Specifically, the raw image
data is gradually transformed after a series of processing, becoming more organized
and more abstractly expressed. In this process, semantic information is continuously
introduced, the operation objects also change, and the amount of data is gradually
compressed. In addition, high-level operations have a guiding role for low-level
operations, which can improve the efficiency of low-level operations.
From the perspective of comparing and combining with computer vision, the
main components of image engineering can also be represented by the overall
framework shown in Fig. 1.9, where the dashed box is the basic module of image
engineering. Various image techniques are used here to help people getting infor
mation from the scene. The first thing to do is to obtain images from the scene in
various ways. The next low-level processing of the image is mainly to improve the
visual effect of the image or reduce the data amount of the image on the basis of
maintaining the visual effect, and the processing result is mainly for the user to
watch. The middle-level analysis of the image is mainly to detect, extract, and
measure the objects of interest in the image. The results of the analysis provide the
user with data describing the characteristics and properties of the image objects.
Finally, the high-level understanding of the image is to understand and grasp the
content of the image and explain the original objective scene through the study of the
nature of the objects in the image and the relationship between them. The results of
the understanding provide the user with objective-world information that can guide
and plan actions. These image technologies from low level to high level are strongly
supported by new theories, new tools, and new technologies including artificial
intelligence, neural networks, genetic algorithms, fuzzy logic, image algebra,
1.3 Introduction to Image Engineering 27

machine learning, and deep learning. Appropriate strategies are also required to
control these tasks.
By the way, computer vision technology has evolved over the years. There are
many types of existing technologies. For these technologies, although there are some
classification methods, it does not appear to be stable and consistent till present. For
example, some people divide computer vision into low-level vision, middle-level
vision, and 3D vision, and some people divide computer vision into early vision
(which is further divided into single image and multiple images), middle-level
vision, and high-level vision (geometric method). Even the classification schemes
adopted by the same researcher in different time periods are not completely consis
tent. For example, someone once divided computer vision into early vision (which is
further divided into single image and multiple images), middle-level vision, and
high-level vision (which is further divided into geometric methods and probability
and inference methods). Similarly, the different schemes are divided into three
layers, which are somewhat similar to the stable and consistent three layers of
image engineering, although it does not correspond exactly.
Among the three levels of image engineering, the image understanding level is
most closely related to the current research and application of computer vision,
which has many historical origins. In building an image/visual information system
and using computers to assist humans in completing various visual tasks, both image
understanding and computer vision require theories in projection geometry, proba
bility theory and stochastic processes, and artificial intelligence. For example, they
all rely on two types of intelligent activities: (1) perception, such as perceiving the
distance, orientation, shape, movement speed, interrelationship, etc. of the visible
parts of the scene, and (2) thinking, such as analyzing the behavior of objects
according to the scene structure, inferring the development and changes of the
scene, and deciding and planning main actions. It can be said that image under
standing (based on image processing and analysis) has the same goal as computer
vision, and both use engineering techniques to realize the understanding and inter
pretation of the scene through the images obtained from the objective scene. In fact,
the terms image understanding and computer vision are often used interchangeably.
Essentially, they are interrelated; and in many cases, their coverages and contents
overlap, with no absolute boundaries in terms of concept or practicality. In many
contexts and situations, they have different focuses but are often complementary, so
it is more appropriate to think of them as different terms used by people of different
professions and backgrounds.

1.3.2 Research and Application of Image Engineering

Simultaneously with the presentation of image engineering (IE), a review series for
statistical classification of the image engineering literature began, which has been in
place for 28 years [22]. This review series selected all literature related to image
engineering in 15 journals for analysis. Considering that image engineering is an
28 1 Introduction

organic combination of image processing, image analysis, and image understanding

that are both related and distinct as well as also includes their engineering applica
tions, this review series first divides the literature into four categories of image
processing, image analysis, image understanding, and technology application, and
then the literature of each category is aggregated into several subcategories
according to the content. Table 1.3 presents the results of classifying the image
engineering research and application literature (25970) in the past 18 years. The
focus of research and application of image engineering can be seen from the statistics
of various literatures.

1.4 Introduction to Deep Learning

Deep learning uses cascaded multilayer nonlinear processing units for feature
extraction and transformation, realizing multilevel feature representation and con
cept abstraction learning [23]. Deep learning still belongs to the category of machine
learning, but compared with traditional machine learning methods, deep learning
methods avoid the requirements for manual design features under traditional
machine learning methods and show obvious effect advantages under big data.
Compared with traditional machine learning methods, deep learning is more general
and requires less prior knowledge and annotation data. However, the theoretical
framework of deep learning has not been fully established. At present, there is still a
lack of powerful and complete theoretical explanation for how deep neural network
operates and why it has high performance.

1.4.1 Deep Learning Overview

The current mainstream deep learning methods are based on neural networks (NN),
and neural networks have the ability to directly learn pattern features from training
data without first designing and extracting features and can easily achieve end-to-end
training. The study of neural networks has a long history. Among them, in 1989, the
universal approximation theorem of the multilayer perceptron (MLP) was proven,
and one of the basic deep learning models, the convolutional neural network
(CNN), was also used for handwritten digit recognition. The concept of deep
learning was formally proposed in 2006 and led to extensive research and applica
tion of deep neural network technology.

1.4.1.1 Basic Structure of Convolutional Neural Network

The convolutional neural network is developed on the basis of the traditional

artificial neural network. It has many similarities with the BP artificial neural
1.4 Introduction to Deep Learning 29

Table 1.3 Main categories and quantities of image engineering research and application literature
in the past 18 years
Category # Subcategory and main contexts #
Image 4050 Image acquisition (including various imaging methods, image 832
processing capturing, representation and storage, camera calibration, etc.)
Image reconstruction (including image reconstruction from pro 375
jection, indirect imaging, etc.)
Image enhancement/image restoration (including transformation, 1313
filtering, restoration, repair, replacement, correction, visual
quality evaluation, etc.)
Image/video coding and compression (including algorithm 505
research, implementation and improvement of related interna
tional standards, etc.)
Image information security (including digital watermarking, 705
information hiding, image authentication and forensics, etc.)
Image multi-resolution processing (including super-resolution 320
reconstruction, image decomposition and interpolation, resolu
tion conversion, etc.)
Image 4820 Image segmentation and primitive detection (including edges, 1564
analysis corners, control points, points of interest, etc.)
Object representation, object description, feature measurement 150
(including binary image morphology analysis, etc.)
Object feature extraction and analysis (including color, texture, 466
shape, space, structure, motion, saliency, attributes, etc.)
Object detection and object recognition (including object 2D 1474
positioning, tracking, extraction, identification and classification,
etc.)
Human body biological feature extraction and verification 1166
(including detection, positioning and recognition of human body,
face and organs, etc.)
Image 2213 Image matching and fusion (including registration of sequence 1070
understanding and stereo image, mosaic, etc.)
Scene recovering (including 3D scene representation, modeling, 256
reconstruction, etc.)
Image perception and interpretation (including semantic 123
description, scene model, machine learning, cognitive reasoning,
etc.)
Content-based image/video retrieval (including corresponding 450
labeling, classification, etc.)
Spatial-temporal techniques (including high-dimensional motion 314
analysis, object 3D posture detection, spatial-temporal tracking,
behavior judgment and behavior understanding, etc.)
Technical 3164 Hardware, system devices, and fast/parallel algorithms 348
applications Communication, video transmission and broadcasting (including 244
TV, network, radio, etc.)
Documents and texts (including words, numbers, symbols, etc.) 163
Biology and medicine (physiology, hygiene, health, etc.) 590
Remote sensing, radar, sonar, surveying and mapping, etc. 1279
Other (no directly/explicitly included technology application) 540
30 1 Introduction

Fig. 1.10 Schematic diagram of the basic structure of a typical convolutional neural network

network. The main input difference is that the BP network is a 1D vector, while the
convolutional neural network is a 2D matrix. The convolutional neural network
consists of a layer-by-layer structure, mainly including input layer, convolution
layer, pooling layer, output layer, fully connected layer, batch normalization layer,
etc. In addition, the convolutional neural networks also use activation functions
(excitation functions), cost functions, etc.
Figure 1.10 shows a part of the basic structure of a typical convolutional neural
network.
The convolutional neural network has four similarities with the general fully
connected neural network (multilayer perceptron):
1. Both construct the sum of products.
2. Both superimpose a bias (see below).
3. Both let the result go through an activation function (see below).
4. Both use the activation function value as a single input to the next layer.
Convolutional neural networks are different from general fully connected neural
networks in four ways:
1. The input of the convolutional neural network is a 2D image, and the input of the
fully connected neural network is a 1D vector.
2. Convolutional neural networks can directly learn 2D features from raw image
data, while fully connected neural networks cannot,
3. In a fully connected neural network, the output of all neurons in a layer is directly
provided to each neuron in the next layer, while the convolutional neural network
first uses convolution to convert the output of the neurons in the previous layer
according to the spatial neighbors. The fields are combined into a single value and
then provided to each neuron in the next layer.
4. In a convolutional neural network, the 2D image input to the next layer is first
subsampled to reduce sensitivity to translation.
In the following, the convolutional layers, pooling layers, activation functions,
and cost functions are explained in more detail.
1.4 Introduction to Deep Learning 31

1.4.1.2 Convolutional Layer

The convolutional layer mainly implements the convolution operation, and the
convolutional neural network is named after the convolution operation. At each
location in the image, add a bias value to the convolution value (sum of products) at
that location, and convert their sum to a single value through an activation function.
This value is used as the input to the next layer at that location. If the above operation
is performed on all positions of the input image, a set of 2D values is obtained, which
can be called a feature map (because convolution is to extract features). Different
convolution layers have different numbers of convolution kernels. The convolution
kernel is actually a numerical matrix. The commonly used convolution kernel sizes
arel x 1, 3 x 3, 5x5,7 x 7, etc. Each convolution kernel has a constant bias, and the
elements in all matrices plus the bias form the weight of the convolution layer, and
the weight participates in the iterative update of the network.
Two important concepts in convolution operations are local receptive field and
weight sharing (also called parameter sharing). The size of the local receptive field is
the scope of the convolution kernel convolution operation, and each convolution
operation only needs to care about the information in this range. Weight sharing
means that the value of each convolution kernel is unchanged during the convolution
operation, and only the weight of each iteration is updated. In other words, the same
weights and a single bias are used to generate the values of the feature map
corresponding to all locations of the receptive field of the input image. This allows
the same feature to be detected at all locations in the image. Each convolution kernel
only extracts a certain feature, so the values in different convolution kernels are different.

1.4.1.3 Pooling Layer

The operation after convolution and activation is pooling. The pooling layer mainly
implements down-sampling and dimension reduction operations, so it is also called
down-sampling layer or subsampling layer. The design of the pooling layer is based
on a model of the mammalian visual cortex. The model considers the visual cortex to
include two types of cells: simple cells and complex cells. Simple cells perform
feature extraction, while complex cells combine (merge) these features into a more
meaningful entirety. Pooling layers generally do not have weight updates.
The role of pooling includes reducing the spatial size of the data volume (reduc
ing the spatial resolution to obtain translation invariance), reducing the number of
parameters in the network and the amount of data to be processed, and thereby
reducing the overhead of computing resources and effectively controlling
overfitting. Pooling feature maps can be thought of as the result of subsampling
(for each feature map, there is a corresponding pooled feature map.). In other words,
pooled feature maps are feature maps with reduced spatial resolution. Pooling first
decomposes the feature map into a set of small regions (neighborhoods) and then
replaces all elements in the neighborhood with a single value. This is called pooling
neighborhoods, and here it can be assumed that the pooled neighborhoods are
contiguous (i.e., nonoverlapping).
32 1 Introduction

There are multiple ways to compute pooled values, and they are all collectively
referred to as pooling methods. Common pooling methods include the following:
1. Average pooling, also known as mean pooling, selects the average of all values
in each neighborhood.
2. Maximum pooling, also called maximum value pooling, selects the maximum
value of all values in each neighborhood.
3. L2 pooling selects the square root value of the sum of squares of all values in each
neighborhood.
4. Random value pooling: selection is made according to the corresponding values
in each neighborhood that satisfy certain criteria.

1.4.1.4 Activation Function

The activation function is also called the excitation function. Its function is to
selectively activate or suppress the features of the neuron nodes, which can enhance
the activation of useful object features and weaken the useless background features,
so that the convolutional neural network can solve the nonlinear problems. If the
nonlinear activation function is not added to the network model, the network model
is equivalent to a linear expression, which makes the network expressive ability not
strong. Therefore, it is necessary to use a nonlinear activation function, so that the
network model has the nonlinear mapping ability of the feature space.
The activation function must have some basic properties: (1) monotonicity in
which a monotonic activation function ensures that the single-layer network model
has convex function performance and (2) differentiability in which this allows the
use of error gradients to fine-tune the update of the model weights. Commonly used
activation functions include the following:
1. Sigmoid function (as shown in Fig. 1.11a)

h(z = i^1e~z (1:1)

Its derivative is

Fig. 1.11 Three commonly used activation functions

1.4 Introduction to Deep Learning 33

h'(z) = ddF = h(z)[1 - h(z)]: (1'2}

2. Hyperbolic tangent function (as shown in Fig. 1.11b)

1 e
h(z) = tanh(z) =1-e-2z:
-2z
(1:3)
“r e

Its derivative is

h0(z) = 1 — [h(z)]2: (1:4)

The hyperbolic tangent function is similar in shape to the sigmoid function, but
the hyperbolic tangent function is also symmetric about the function axis.
3. Rectifier function (as shown in Fig. 1.11c)

h(z) = max (0, z): (1:5)

Because the unit it uses is also called a rectified linear unit (ReLU), the
corresponding activation function is often called the ReLU activation function.
Its derivative is

1Ifz > 0
h0 (z)= : (1:6)
0 If z<0

1.4.1.5 Loss Function

The loss function is also called the cost function. In the task of machine learning, all
algorithms have an objective function. The principle of the algorithm is to optimize
this objective function. The direction of optimizing the objective function is to take
its maximum or minimum value. When the objective function is minimized under
constraints, it is the loss function. In the convolutional neural network, the loss
function is used to drive the network training, so that the network weights are
updated.
The most commonly used loss function in convolutional neural network model
training is the soft max loss function, which is the cross entropy loss function of soft
max. Soft max is a commonly used classifier whose expression is
34 1 Introduction

exp (wTXi)
h(xi) = (1-7)
V^n - T T \
E j exp (wT Xi)

Among them, xi represents the input feature; wi represents the corresponding

weight. The soft max loss function can be expressed as

Ls=- £ logh(xi) =- ti log “p :"x\

(1-8)
i= 1 i= 1 Ejexp (wTXi)

1.4.2 Deep Learning Core Technology

The core technologies of automated deep learning mainly include reinforcement

learning, transfer learning, data enhancement, hyper-parameter optimization, net
work design, etc.
1. Reinforcement learning
Model design based on reinforcement learning includes a model generation
unit and a model validation unit. The model generation unit first generates a series
of subnetworks according to a certain random initialization strategy. After using
the subnetwork to train on the index data set, the correct rate of verification is used
as the income feedback generation unit. The generation unit updates the design
strategy according to the model effect and conducts a new round of attempts. The
action space of automatic design includes operations such as convolution,
pooling, residual, group convolution, etc. It can micro-design the repeated sub
structure of the convolutional network and can also design the global network
structure macroscopically. The search space contains a large number of neural
structures of different networks. Combining reinforcement learning with
decision-making capabilities with deep neural networks enables perception,
decision-making, or integration of perceptual decision-making through end-to-
end learning.
2. Transfer learning
Transfer learning refers to a technique that uses related auxiliary tasks to
optimize one’s own target task, which is more suitable for scenarios with a
small amount of labeled data for the target task. Generally, the source model is
trained by using the source task with rich training samples, and then the transfer
learning technology is used to help the model of the target task to optimize the
effect or speed up the training. There are four common methods of transfer
learning:
• Based on model structure: deep neural network reflects the characteristics of
layers. Taking an image as an example, the underlying features close to the
1.4 Introduction to Deep Learning 35

input represent lower-level general information such as color and texture,

while the high-level features close to the output features represent high-level
information such as objects and semantics. Under the condition of a small
number of samples, in order to improve the generalization ability, transfer
learning often fixes common features and only optimizes high-level features
related to tasks.
• Based on samples: the basic idea is to find a subset of the source tasks that are
close to the samples of the target task and increase the weight of these data,
thereby changing the sample distribution of the source task to make it closer to
the target task.
• Improve regularization: compared with general de novo training, since the
target task and the source task are related to a certain extent in transfer
learning, the prior assumption of the weight distribution can be improved,
that is, a more reasonable regularization term can be designed.
• Introducing adapters: due to the huge number of deep network parameters, it is
easy to overfit directly with limited target data for training. Based on the fact
that the target task and the source task are very related in transfer learning, a
larger original network can be fixed, and an adapter with fewer parameters is
introduced for targeted training, so that the original network can be adapted to
new tasks.
3. Data augmentation
Deep learning requires a large amount of labeled training data. However, in
practical tasks, the general training data set is limited. Data augmentation is an
effective means to deal with small data volumes. Data augmentation generates a
new data set similar to the training data set through operations such as transfor
mation or synthesis, which can be used to expand the training data set and
improve the generalization ability of the model. It can also improve the robust
ness of the model by introducing noise data. Common data augmentation mainly
includes three classes: traditional data augmentation methods (such as flipping,
rotating, scaling, translation, random cropping, adding noise, etc.), simple and
effective data augmentation methods proposed in recent years (such as matting,
mixing, sample pairing, random erasing, etc.), and automatic data augmentation
methods (reinforcement learning-based data augmentation methods).
4. Hyper-parameter optimization
The design of deep learning models requires data preprocessing, feature
selection, model design, model training, and hyper-parameter adjustment.
Hyper-parameters play an important role in deep learning. Manual parameter
tuning requires machine learning experience and skills. The hyper-parameter
optimization done by automated deep learning can effectively replace manual
parameter tuning. The basic idea here is to establish the relationship between the
validation set loss function and the hyper-parameters and derive the function from
the hyper-parameters, so as to optimize the hyper-parameters using the gradient
descent method. However, the derivatives of hyper-parameters are very complex
to compute, have high time and space complexity, and cannot be applied to
36 1 Introduction

mainstream large-scale deep models. Current research focuses on finding simpli

fied and approximate computational methods to apply this technique to main
stream deep models.
5. Network design
Network design is a core competency in deep learning. Various new network
structures are continuously proposed and produced: auto-associative networks,
bidirectional recurrent neural networks, convolutional neural networks, deep
convolutional neural networks, fully connected networks, generative adversarial
networks, graph convolutional neural networks, long short-term memory net
works, multilayer perceptron networks, perceptron networks, recurrent neural
networks, recursive neural network, self-encoding networks, etc. This list is still
growing.

1.4.3 Deep Learning in Computer Vision

Since 2012, deep learning algorithms have achieved excellent results in image
classification, video classification, object detection, object tracking, semantic seg
mentation, depth estimation, image/video generation, and other tasks. Deep learning
has gradually replaced the traditional statistical learning and becomes the main
stream frameworks and methods of computer vision.
1. Image classification
The goal of image classification is to classify an image into a given category.
Some classical models in image classification have also become the backbone
network for other tasks such as detection and segmentation, from AlexNet to
VGG, to GoogleNet, to ResNet, and to DenseNet. The layers of neural network
model are getting deeper and deeper, from several layers to thousands of layers.
2. Video classification
An earlier and effective deep learning method is the two-stream convolution
network, which combines the apparent features and motion features. The
two-stream convolution network is based on 2D convolution kernel. In recent
years, many scholars have proposed some 3D CNN to realize video classification
by extending 2D convolution kernel to 3D or combining 2D and 3D, including
I3D, C3d, P3D, etc. In video motion detection, boundary sensitive network
(BSN), attention clustering network, generalized compact nonlocal network,
and so on are proposed.
3. Object detection
Object detection is to identify objects in an image and determine boundaries
and labels for each object. The commonly used deep learning-based models
mainly include the one-stage model and the two-stage model. The two-stage
method is based on image classification, that is, the potential candidate regions of
the object are first determined, and then the classification method is used to
identify them. A typical two-stage method is the R-CNN family. From R-CNN
to Fast R-CNN, R-FCN, and Faster R-CNN, the detection efficiency continues to
1.4 Introduction to Deep Learning 37

improve. The one-stage method is based on the regression method, which can
achieve a complete single training shared feature and can greatly improve the
speed under the premise of ensuring a certain accuracy. Some relative important
methods include the YOLO series and the SSD series and, recently, the deeply
supervised object detector (DSOD) method, the receptive field module (RFB)
network, etc.
4. Object tracking
Object tracking is to track one or more specific objects of interest in a specific
scene. Multi-object tracking is to track multiple object trajectories of interest in
video images and extract their motion trajectory information through temporal
correlation. Object tracking methods can be divided into two categories: gener
ative methods and discriminative methods. Generative methods mainly use a
generative model to describe the apparent features of the object and then search
for candidate objects to minimize the reconstruction error. The discriminative
method distinguishes the object from the background by training the classifier, so
it is also called the tracking-by-detection method. Its performance is more stable,
and it has gradually become the main research method in the field of object
tracking. The relative popular methods in recent years include a series of tracking
methods based on Siamese network.
5. Semantic segmentation
Semantic segmentation needs to label the semantic category of each pixel in
the image. A typical deep learning approach is to use a fully convolutional
network (FCN). After inputting an image, the FCN model directly obtains the
category of each pixel at the output, so as to achieve end-to-end image semantic
segmentation. Further improvements include U-Net, dilated convolution,
DeepLab series, pyramid scene parsing network (PSPNet), etc.
6. Depth estimation
The method of depth estimation based on monocular usually uses the image
data of a single view as input and directly predicts the depth value corresponding
to each pixel in the image. The baseline of deep learning in monocular depth
estimation is CNN. In order to overcome the difficulty that monocular depth
estimation usually requires a large amount of depth-annotated data and the high
cost of such data acquisition, a single view stereo matching (SVSM) model is
proposed, which can achieve good results with only a small amount of depth-
annotated data.
7. Image/video generation
Image/video generation is closer to computer graphics technology. The input
is the abstract attribute of the image, while the output is the image distribution
corresponding to the attribute. With the development of deep learning, the
automatic generation of image/video, the expansion of database, and the com
pletion of image information have been paid attention to. Two popular depth
generation models are variational auto-encoder (VAE) and generative adver
sarial network (GAN). As an unsupervised deep learning method, GAN can
learn by playing games between two neural networks, which can alleviate the
problem of data sparsity to a certain extent. Based on GAN, from Pix2Pix that
38 1 Introduction

needs to prepare paired data, to CycleGAN that only needs unpaired data, and to
StarGAN that can span multiple domains, it is gradually closer to practical (such
as AI anchor).

1.5 Organization and Content of this Book

This book has 11 chapters, divided into 4 levels, and its organizational structure is
shown in Fig. 1.12. Among them, four levels are given on the left: background
knowledge, image acquisition, scene recovery, and scene interpretation; on the right
are the chapters and their titles corresponding to these four levels, which are
described in detail below (see [24, 25] for the explanation of relevant names).
1. Background knowledge: providing an overview of computer vision and related
background information.
Chapter 1 briefly introduces the basics of computer vision, image engineering,
and deep learning and discusses the theory and framework of computer vision.

2. Image acquisition: introducing models, devices, and methods for 3D imaging.

Chapter 2 introduces camera imaging and calibration, describes lightness
imaging model and space imaging model, analyzes linear camera calibration
model and nonlinear camera calibration model, and discusses various camera
calibration methods.
Chapter 3 introduces the acquisition of depth images, gives some devices and
methods for direct depth imaging and several indirect depth imaging methods,
as well as discusses the principles, devices, and techniques of single-pixel 3D
imaging.

Fig. 1.12 Organization of this book

1.5 Organization and Content of this Book 39

Chapter 4 introduces the point cloud data and its processing. Point cloud data
sources, preprocessing (including registration of point cloud data with the help
of biomimetic optimization), laser point cloud modeling, texture mapping, local
feature description, and deep learning methods in scene understanding are
discussed.
3. Scene restoration: discussing various technical principles for reconstructing
objective 3D scenes from images.
Chapter 5 introduces binocular stereovision technology, mainly region-based
matching technology and feature-based matching technology. In addition, an
error correction algorithm, an overview of a recent stereo matching network
based on deep learning technology, and a specific method are also presented.
Chapter 6 introduces the multi-ocular stereovision technology, first discusses
the two specific modes of horizontal multi-ocular and orthogonal tri-ocular, and
then generalizes to the general multi-ocular situation. Finally, two systems
consisting of five cameras and a single camera combined with multiple mirrors
are analyzed separately.
Chapter 7 introduces monocular multi-image scene recovery and discusses
the principles and methods of shape recovery from illumination and shape
recovery from motion, respectively. In addition to an overview of recent
research progress in photometric stereo technology, the corresponding technol
ogies using GAN and CNN are also introduced in detail.
Chapter 8 introduces the scene restoration from monocular and single image
and discusses the principles and methods of restoring shape from shading,
texture, and focal length, respectively. Recent works on shape recovery from
tones under mixed surface perspective projection using different models are also
presented.
4. Scene interpretation: analyzing how to realize the understanding and judgment
of the scene with the help of the reconstructed 3D scene.
Chapter 9 introduces generalized matching, mainly object matching, relation
matching, matching with the help of graph theory, and line drawing labeling. It
also analyzes some specific matching techniques, the connection between
matching and registration, and provides a recent overview of multimodal
image matching.
Chapter 10 introduces simultaneous localization and mapping (SLAM). The
composition, process, and modules of laser SLAM and visual SLAM, their
comparison and fusion, and their combination with deep learning and multi
agent are discussed, respectively. Some typical algorithms are also analyzed in
detail.
Chapter 11 introduces the understanding of spatiotemporal behavior. On the
basis of its concept, definition, development, and hierarchical research, this
chapter focuses on the discussions of joint modeling of agent and action, activity
and behavior recognition, and automatic activity analysis, especially the detec
tion of abnormal events.
40 1 Introduction

References

1. Szeliski R (2022). Computer Vision: Algorithms and Applications, 2nd ED. Springer Nature
Switzerland.
2. Finkel L H, Sajda P (1994) Constructing Visual Perception. American Scientist, 82(3): 224-237.
3. Shapiro L, Stockman G (2001) Computer Vision. Prentice Hall, London.
4. Zhang Y-J (2017) Image Engineering, Vol.1: Image Processing. De Gruyter, Germany.
5. Zhang Y-J (2017) Image Engineering, Vol.2: Image Analysis. De Gruyter, Germany.
6. Zhang Y-J (2017) Image Engineering, Vol.3: Image Understanding. De Gruyter, Germany.
7. Edelman S (1999) Representation and Recognition in Vision. MIT Press, Cambridge.
8. Grossberg S, Mingolia E (1987) Neural dynamics of surface perception: boundary webs,
illuminants and shape-from-shading. Computer Vision, Graphics and Image Processing,
37(1): 116-165.
9. Kuvich G (2004) Active vision and image/video understanding systems for intelligent
manufacturing. SPIE 5605: 74-86.
10. Lowe D G (1987) Three-dimensional object recognition from single two-dimensional images.
Artificial Intelligence, 31(3): 355-395.
11. Lowe D G (1988) Four steps towards general-purpose robot vision. Proceedings of the 4th
International Symposium on Robotics Research, 221-228.
12. Ye H S (2017) Embodied Cognition: Principles and Applications. Commercial Press, Beijing.
13. Osbeck L M (2009) Transformations in cognitive science: Implementations and issues posed.
Journal of Theoretical and Philosophical Psychology, 29(1): 16-33.
14. Chen W, Yin R, Zhang J (2021) Embodied Cognition in Psychology: A Dialogue among Brain,
Body and Mind. Science Press, Beijing.
15. Alban M W, Kelley C M (2013) Embodiment meets metamemory: Weight as a cue for
metacognitive judgements. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 39(8): 1-7.
16. Varela F, Thompson, E, Rosch E (1991) The Embodied Mind: Cognitive Science and Human
Experience. Cambridge: MIT Press.
17. Li H W, Xiao J Y (2006) The embodied view of cognition. Journal of Dialectics of Nature,
2006, (1): 29-34.
18. Meng F K (2023) Embodied intelligence: A new stage of intelligent evolution. China Industry
and Information Technology, (7): 6-10.
19. Bonsignorio F (2023) Editorial: Novel methods in embodied and enactive AI and cognition.
Front. Neurorobotics, 17:1162568 (DOI: 10.3389/fnbot.2023.1162568).
20. Zhang Y-J (1996) Image engineering in China: 1995. Journal of Image and Graphics, 1(1):
78-83.
21. Zhang Y-J (1996) Image engineering and bibliography in China. Technical Digest of Interna
tional Symposium on Information Science and Technology, 158-160.
22. Zhang Y-J (2023) Image engineering in China: 2022. Journal of Image and Graphics, 28(4):
879-892.
23. Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press, Massachusetts.
24. Zhang Y-J (2021) English-Chinese Dictionary of Image Engineering. Tsinghua University
Press, Beijing.
25. Zhang Y-J (2021) Handbook of Image Engineering. Springer Nature, Singapore.
Chapter 2
Camera Imaging and Calibration

Image acquisition is an important means to obtain objective world information, and

it is also the basis of computer vision. This is because images are the operating
objects of various computer vision technologies, and image acquisition refers to the
technology and process of acquiring images (imaging).
The goal of computer vision is to realize the cognition of images, which can be
regarded as the inverse problem of imaging in a broader sense. Imaging studies how
to generate images from the scene, while image cognition attempts to use images to
obtain the description of the spatial scene. Therefore, in order to recognize the image,
it is necessary to construct a suitable imaging model first.
The most commonly used imaging devices include cameras and camcorders, and
cameras are used below to represent various imaging devices. In order to make the
collected image accurately reflect the properties of the objective world scene, it is
necessary to understand the relationship between the image brightness with the
optical properties of the object and the characteristics of the imaging system. At
the same time, in order to make the collected images accurately reflect the spatial
information of the objective world, it is also necessary to calibrate the camera to
solve the problem of obtaining the exact position of the scenery in the scene from the
image [1-3].
The commonly used image acquisition flowchart is shown in Fig. 2.1. The light
source radiates to the objective scene, the reflected light from the scene enters the
imaging sensor, and the sensor performs photoelectric conversion to obtain an
analog image related to the spatial relationship and surface properties of the objec
tive scene. Sampled and quantized to convert to a digital image that can be used by a
computer and output, an image of the scene is obtained.
Corresponding to the two parts of the content expressed by the image f(x, y), the
acquisition of the image involves two aspects, and their models need to be
established separately.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 41
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_2
42 2 Camera Imaging and Calibration

Light Reflected Photoelectric Sampling & Scene

Fig. 2.1 Image acquisition process

1. Photometry (and more generally radiometry): to address how “bright” an object is

in an image and how this brightness relates to the object’s optical properties and
the imaging system. It determines how “bright” the object is at (x, y) for f.
2. Geometry: the goal is to find where in the scene will be projected at (x, y) in the
image.
The sections of this chapter will be arranged as follows.
Section 2. 1 introduces the lightness imaging model. Several related photometric
concepts are introduced first, and then the basic lightness imaging model is discussed
in detail.
Section 2. 2 introduces the spatial imaging model. First introduce the projection
imaging geometry, and then introduce the basic space imaging model, general space
imaging model, and complete space imaging model from special to general.
Section 2. 3 introduces the camera calibration model. The characteristics of the
linear camera model and the nonlinear camera model are introduced, respectively,
and the basic procedures and steps of calibration are given.
Section 2. 4 first classifies various camera calibration methods and then specifi
cally discusses traditional calibration methods, self-calibration methods, structured
light active vision system calibration methods, and online camera external parameter
calibration methods.

2.1 Lightness Imaging Model

The lightness imaging model is constructed in order to be able to determine the f of

the image. This involves knowledge of photometry (including luminance and illu
minance) and the corresponding imaging model [4].

2.1.1 Photometric Concepts

Photometry is the study of the strength of light (radiation). More general radiometry
is the study of the strength of (electromagnetic) radiation. The brightness of the
scenery itself in the scene is related to the intensity of the light radiation.
For a luminous scenery (light source), the brightness of the scene is proportional
to its own radiated power or amount of light radiation. In photometry, luminous flux
is used to express the power or amount of optical radiation, and its unit is lm
2.1 Lightness Imaging Model 43

(lumens). The brightness of a light source in a certain direction is measured by the

luminous flux emitted per unit projected area and unit solid angle (its unit is
steradian, sr) in that direction, in cd/m2 (candela per unit square meter), where cd
is the unit of luminous intensity, 1 cd = 1 lm/sr.
For scenes that do not emit light, the illuminance of other light sources should be
considered. The illuminance obtained by the scene needs to be measured by the
illuminance on the surface illuminated by the light, that is, the luminous flux
irradiated on a unit area; the unit is lx (lux), 1 lx = 1 lm/m2. After the nonluminous
scene is illuminated by the light source, the incident light is reflected, and it is
equivalent to a luminous scene for imaging.
Brightness and illuminance have a certain relationship but also have obvious
differences. Illuminance is a measure of the amount of radiation that a light source
with a certain intensity irradiates the scene, and the illuminance value is related to the
distance from the surface of the object to the observer (viewer), while brightness is a
measure of the light intensity felt by the observer on the basis of illuminance and the
brightness value is independent of the distance from the surface of object to the
observer.

2.1.2 Basic Lightness Imaging Model

From the perspective of photometry, the process of image acquisition can be

regarded as a process of converting the light radiation intensity of the objective
scene into the brightness (grayscale) of the image. Based on such a lightness
imaging model, the gray value of the image collected from the scene is determined
by two factors: one is the brightness (radiation intensity) of the scene itself, and the
other is the way to convert the brightness of the scene into the brightness (grayscale)
of the image during imaging.
A simple image lightness imaging model is as follows. Here, a 2D brightness
function f(x, y) is used to represent the image, and f(x, y) also represents the
brightness of the image at a specific coordinate point (x, y) in space. Since brightness
is actually a measure of energy, f(x, y) must not be 0 and is a finite value, that is,

0 <f (x,y) < i: (2:1)

Generally speaking, image brightness is obtained by measuring the reflected light

from objects in the scene. So f(x, y) is basically determined by two factors: (1) the
light intensity incident on the visible scene and (2) the reflection ratio of the object
surface to the incident light in the scene. They can be expressed by illuminance
function I(x, y) and reflection function R(x, y), which are also called illuminance
component and reflection component, respectively. R(x, y) values of some typical
scenes are black velvet 0.01, stainless steel 0.65, whitewashed wall plane 0.80,
silvered utensils 0.90, and snow 0.93. Since f(x, y) is proportional to I(x, y) and
R(x, y), it can be considered that f(x, y) is obtained by multiplying I(x, y) and R(x, y):
44 2 Camera Imaging and Calibration

f(x,y) = i(x,y) • r(x,y), (2:2)

where

0 < i(x, y) < 1 (2:3)

0 < r(x, y) < 1: (2:4 )

Equation (2.3) shows that the incident amount is always greater than zero (only
considering the incident case), but it is not infinite (because it should be physically
realizable). Equation (2.4) shows that the reflectivity is between 0 (total absorption)
and 1 (total reflection). The values given by the two equations are theoretical limits.
It should be noted that the value of I(x, y) is determined by the lighting source, while
the value of R(x, y) is determined by the surface characteristics of objects in the
scene.
Generally, the luminance value of the monochrome image f(«) at the coordinates
(x, y) is called the gray value of the image at this point (represented by g). According
to Eqs. (2.2) and (2.3), g will take values in the following ranges:

Gmax ^ g < Gmin: (2:5)

The only restriction on Gmin in theory is that it should be positive (i.e.,

corresponding to incident, but generally taken as 0), and the only restriction on
Gmax is that it should be limited. In practical applications, the interval [Gmin, Gmax] is
called the gray value range. Usually this interval is digitally shifted to the interval
[0, G](G is a positive integer). It is considered black when g = 0 and white when
g = G - 1, and all intermediate values in turn represent grayscale values from black
to white.

2.2 Space Imaging Model

The purpose of building a space imaging model is to determine the (x, y) of the
image, that is, the 2D position where the 3D objective scene is projected onto the
image [4].

2.2.1 Projection Imaging Geometry

Projection imaging involves the transformation between different coordinate sys

tems. These transformations can be linearized using homogeneous coordinates.
2.2 Space Imaging Model 45

2.2.1.1 Coordinate Systems

The image acquisition process can be regarded as a process of projecting and

transforming the scene in the objective world, and this projection can be described
by imaging transformation (also known as “geometric perspective transformation”).
Imaging transformations involve transformations between different coordinate sys
tems and include the following categories:
1. World coordinate system: also called the real world coordinate system, it is the
absolute coordinate of the objective world (so it is also called the objective
coordinate system). The general 3D scene is represented by the world coordinate
system XYZ.
2. Camera coordinate system: the coordinate system xyz established with the
camera as the center, the optical axis of the camera is generally taken as the z-axis.
3. Image coordinate system: also known as the image plane coordinate system,
which is the coordinate system x’y’ on the imaging plane in the camera.
4. Image coordinate system in computer: it is the coordinate system MN (which
takes integers) used to express the image inside the computer. The image is
ultimately stored in the memory in the computer, so the image plane coordinate
system needs to be converted into the computer image coordinate system.
The process of image acquisition is to convert the objective scene in the world
coordinate system into the camera coordinate system first, then into the image
coordinate system, and finally into the computer image coordinate system. Different
types of imaging models can be obtained according to the different interrelationships
between the above several coordinate systems. These imaging models, also called
“camera models,” are models that describe the relationship between the various
coordinate systems.

2.2.1.2 Homogeneous Coordinates

When discussing the conversion between different coordinate systems, if the coor
dinate system can be expressed in the form of homogeneous coordinates, the
conversion between each coordinate system can be expressed in the form of a linear
matrix. Let us first consider the homogeneous representation of lines and points.
A straight line on the plane can be represented by the equation of line
ax + by + c = 0. Different a, b, c can represent different straight lines, so a straight
line can also be represented by a vector l = [a, b, c]T. Because the line ax + by + c = 0
and the line (ka)x + (kb)y + kc = 0 are the same when k is not 0, so when k is not
0, the vector [a, b, c]T and the vector k[a, b, c]T represent the same straight line. In
fact, these vectors that differ by only one scale can be considered equivalent. A
vector set that satisfies this equivalence relationship is called a homogeneous
vector, and any specific vector [a, b, c]T is the representative of the vector set.
46 2 Camera Imaging and Calibration

For a line l = [a, b, c]T, the point x = [x, y]T is on the line if and only if
ax + by + c = 0. This can be represented by the inner product of the vector [x, y,1]
corresponding to the point and the vector [a, b, c]T corresponding to the line, that is,
[x, y, 1]*[a, b, c]T = [x, y, 1]*l = 0. Here, the point vector [x, y]T is represented by a
3D vector with the value 1 added as the last item. Note that for any nonzero constant
k and any straight line l, [kx, ky, k]*l = 0 if and only if [x, y, 1]•l = 0. Therefore, it can
be considered that all vectors [kx, ky, k]T (obtained by the variation of k) are
expressions of points [x, y]T. Thus, like lines, points can also be represented by
homogeneous vectors.
In general, the homogeneous coordinates of the Cartesian coordinates XYZ
corresponding to a point in space are defined as (kX, kY, kZ, k), where k is an
arbitrary nonzero constant. Obviously, the transformation from homogeneous coor
dinates back to Cartesian coordinates can be obtained by dividing the first three
coordinate quantities by the 4-th coordinate quantity. Thus, a point in a Cartesian
world coordinate system can be represented in vector form as

W = [X Y Z]T: (2:6)

Its corresponding homogeneous coordinates can be expressed (the subscript h

indicates homogeneous) as

Wh = [ kX kY kZ k ]T: (2:7)

2.2.2 Basic Space Imaging Model

Consider the most basic and simplest space imaging model first; suppose that the
world coordinate system XYZ coincides with the camera coordinate system xyz and
the xy plane in the camera coordinate system xyz also coincides with the image plane
x’y’ (the computer image coordinate system is not considered in the first).

2.2.2.1 Space Imaging Model Diagram

The camera projects the 3D objective world scene onto the 2D image plane through
perspective projection. This projection can be spatially described by an imaging
transformation (also called a geometric perspective transformation or perspective
transformation). Figure 2.2 shows a schematic diagram of the geometric perspective
transformation model of an imaging process, in which the camera optical axis
(through the center of the lens) is directed outward along the positive z-axis. In
this way, the center of the image plane is at the origin, and the coordinates of the lens
center are (0, 0, z), and z represents the focal length of the lens.
2.2 Space Imaging Model 47

Fig. 2.2 Schematic

diagram of perspective
transformation imaging
model

2.2.2.2 Perspective Transformation

In camera imaging, the geometric relationship between spatial point coordinates

(X, Y, Z ) and image point coordinates (x, y) is determined by perspective transfor
mation. In the following discussion, it is assumed that Z > 2, i.e., all points of
interest in the objective scene are in front of the lens. According to Fig. 2.2, with the
help of the relationship of similar triangles, the following two equations can be easily
obtained:

X - —X _ X
(2:8)
2 Z—2 2—Z
y _ —Y _ Y
(2:9)
2 = Z — 2 = 2 — Z:

In these equations, the negative sign before X and Y means that the image points
are reversed. The image plane coordinates after 3D point perspective projection can
be obtained from these two equations:

2X
x= 2—Z (2:10)

y= 2—Z
2Y
: (2:11)

The above perspective transformation projects line segments in 3D space (except

along the projection direction) as line segments of the image plane. If line segments
that are parallel to each other in 3D space are also parallel to the projection plane,
these line segments are still parallel to each other after projection. A rectangle in 3D
space projected onto the image plane may be any quadrilateral, determined by four
vertices. Therefore, perspective transformation is often referred to as four-point
mapping.
Both Eqs. (2.10) and (2.11) are nonlinear because they contain the variable Z in
the denominator. To represent them in the form of linear matrices, they can be
48 2 Camera Imaging and Calibration

expressed homogeneously with the help of homogeneous coordinates. Under the

homogeneous expression, if the perspective transformation matrix is defined as

■1 0 0 0'
0 1 0 0
(2:12)

.0
0 0
0
1
-1 X = 0
1.

then, its product PWh with Wh gives a vector denoted ch:

■1 0 0 0" "kX" " kX "

0 1 0 0 kY kY
Ch = PWh = = : (2:13)
0
.0
0
0
1
-1 X = 0
1.
kZ
. k .
kZ
. -kZ=X + k.

Here the elements of ch are the camera coordinates in homogeneous form, and
these coordinates can be converted into Cartesian form by removing the first three
items from the 4-th item of ch, respectively. Therefore, the Cartesian coordinates of
any point in the camera coordinate system can be expressed in vector form:

.t= r xx XY xz t
c = [x y z1 X-X (2:14)
X- Y X-Z ,

where the first two items of c are the coordinates (x, y) of the 3D space point (X, Y, Z)
projected onto the image plane.

2.2.2.3 Inverse Perspective Transformation

Inverse perspective transformation refers to determining the coordinates of the 3D

objective scene according to the 2D image coordinates or mapping an image point
back to the 3D space. Using the matrix operation rules, we can get from Eq. (2.13)

Wh =P-1 ch, (2:15)

where the inverse perspective transformation matrix P-1 is

2.2 Space Imaging Model 49

■1 0 0 0"
0 1 0 0
(2:16)
0 0 1 0
.0 0 1=2 1.

Now, it is considered whether the coordinates of the corresponding 3D objective

scene points can be determined from the 2D image coordinates with the help of the
above inverse perspective transformation matrix. Let the coordinates of an image
point be (X‘, Y‘, 0), where 0 at z position only means that the image plane is at z = 0.
This point can be expressed as a homogeneous vector:

Ch = [ kx0 ky0 0 k ]T. (2:17)

The homogeneous world coordinate vector is obtained by substituting it into

Eq. (2.15):

Wh = [ kx ky 0 k ]T. (2:18)

The world coordinate vector in the corresponding Cartesian coordinate system is

W =[ X Y Z ]T = [ x y0 0 ]T. (2.19)

Equation (2.19) shows that the Z coordinate of the 3D space point cannot be
uniquely determined by the image point (X‘, Y‘) (because it gives z = 0 for any
point). The problem here is caused by the many-to-one transformation from the 3D
objective scene to the image plane. The image points (X‘, Y‘) now correspond to the
set of all collinear 3D spatial points on the straight lines passing through (X‘, Y‘, 0)
and (0, 0, 2) (see the line between image points and space points in Fig. 2.2). In the
world coordinate system, X and Y can be inversely solved from Eqs. (2.10) and
(2.11), respectively:

x0
X=I (2 - Z) (2:20)

Y = y0(2 - Z). (2.21)

The above two equations show that it is impossible to completely recover the
coordinates of a 3D space point from its image unless there is some prior knowledge
about the 3D space point mapped to the image point (such as knowing its
Z coordinate). In other words, it is necessary to know at least one world coordinate
of the point in order to recover the 3D space point from its image by using the inverse
perspective transformation.
50 2 Camera Imaging and Calibration

2.2.3 General Space Imaging Model

Further consider the case that the camera coordinate system is separated from the
world coordinate system, but the camera coordinate system coincides with the image
plane coordinate system (the computer image coordinate system is still not consid
ered). Figure 2.3 shows a schematic diagram of the geometric model for imaging at
this time. The position deviation between the center of the image plane (origin) and
the world coordinate system is recorded as vector D, and its components are Dx, Dy,
and Dz, respectively. Here, it is assumed that the camera scans horizontally at an
angle of / between the X- and x-axes and tilts vertically at an angle of a between the
Z- and z-axes. If the XY plane is taken as the equatorial plane of the Earth, and the Z-
axis points to the north pole of the Earth, the pan angle corresponds to the longitude
and the tilt angle corresponds to the latitude.
The above model can be converted from the basic camera space imaging model
when the world coordinate system coincides with the camera coordinate system
through the following series of steps: (1) move the origin of the image plane out of
the origin of the world coordinate system according to the vector D; (2) scan the x-
axis with a pan angle / (around the Z-axis); (3) tilt the z-axis (rotate around the x-
axis) at a certain tilt angle a.
Moving the camera relative to the world coordinate system is also equivalent to
moving the world coordinate system opposite to the camera. Specifically, the three
steps taken to convert the above geometric relationship can be performed for each
point in the world coordinate system. The following transformation matrix can be
used to translate the origin of the world coordinate system to the origin of the image
plane:

"10 0 - Dx"
0 1 0 — Dy
(2:22)
0 0 1 - Dz
0 0 0 1

Fig. 2.3 Schematic

diagram of projection
imaging when the world
coordinate system does not
coincide with the camera
coordinate system
2.2 Space Imaging Model 51

In other words, the homogeneous coordinate point Dh located in the coordinates

(Dx, Dy, Dz) is located at the origin of the new coordinate system after transformation
TDh.
Further consider the problem of how to coincide the coordinate axes. Pan angle y
is the angle between the X- and x-axes, which are parallel in the normal (nominal)
position. To pan the x-axis at the desired y angle, simply rotate the camera counter
clockwise (defined as looking at the origin from the positive rotation axis) around the
z-axis by y angle, i.e.,

" cos y sin 2 0 0'

— sin 2 cos y 0 0
R1 = (2:23)
0 0 1 0
. 0 0 0 1.

Positions without rotation (y = 0*) correspond to parallel X- and x-axes. Simi

larly, the tilt angle a is the angle between the z- and Z-axes, and the camera can be
rotated counterclockwise around the x-axis by a angle to achieve the effect of tilting
the camera axis by a angle, that is,

■1 0 0 0"
0 cos a sin a 0
Ra = (2:24)
0 — sin a cos a 0
.0 0 0 1.

A position without tilt (a = 0°) corresponds to the z- and Z-axes being parallel.
The transformation matrices that complete the above two rotations can be
concatenated into one matrix:

" cos y sin y 0 0'

— sin ycos a cos a cos y sin a 0
R = RaRY = , (2:25)
sin a sin y — sinacos y cos a 0
_ 0 0 0 1.

where R represents the effect of camera rotation in space.

At the same time, considering the translation and rotation transformations in order
to coincide the world coordinate system and the camera coordinate system, the
perspective projection transformation corresponding to Eq. (2.13) has the following
homogeneous expression:

ch =PRTWh (2:26)
52 2 Camera Imaging and Calibration

Expand Eq. (2.26) and convert to Cartesian coordinates to get the coordinates of
the midpoint (X, Y, Z ) of the world coordinate system in the image plane:

_ (X — Dx) cos y + (Y - Dy) sin Y

x=2 \ T-. X , , (2.27)
— (X — Dx) sin a sin / + (Y — Dy ) sin a cos / — (Z — Dz) cos a + 2

— (X — Dx) sin y cos a + (Y — Dy) cos a cos / + (Z — Dz) sin a

y = 2----------------------------------------------- -------------------------- (2.28)
— (X — Dx) sin a sin / + (Y — Dy ) sin a cos y — (Z — Dz) cos a + 2

2.2.4 Complete Space Imaging Model

In a more comprehensive imaging model than the general space imaging model
above, there are two factors to consider in addition to the misalignment of the world
coordinate system, the camera coordinate system, and the image plane coordinate
system (so transformation is required). First, the camera lens would be distorted, so
the imaging position on the image plane will be offset from the perspective projec
tion result calculated by the above equations. Second, the image coordinate unit used
in the computer is the number of discrete pixels in the memory, so the coordinates on
the image plane need to be rounded and converted (here, continuous coordinates are
still used on the image plane). Figure 2.4 presents a schematic diagram of the
complete space imaging model when these factors are taken into account.
In this way, the complete space imaging transformation from an objective scene
to a digital image can be viewed as consisting of four steps:
1. Transformation from world coordinates (X, Y, Z) to camera 3D coordinates (x, y,
z). Considering the case of a rigid body, the transformation can be expressed as

Fig. 2.4 Complete space imaging model

2.2 Space Imaging Model 53

(2:29)

where R and T are, respectively, a 3 x 3 rotation matrix (actually a function of the

angles between the three corresponding coordinate axes of the two coordinate
systems) and 1 x 3 translation vector:

r1 r2 r3
R= r4 r5 r6 (2:30)
r7 r8 r9

T = [ Tx T
Ty Tz]T: (2:31 )

2. The transformation from camera 3D coordinates (x, y, z) to undistorted image

plane coordinates (x‘, y‘) is

x0 x
= 2- (2:32)
z
y0 = 2y: (2:33)
z

3. The transformations from undistorted image plane coordinates (x‘, y‘) to actual
image plane coordinates (x*, y*) with offset by the radial distortion of the lens
(see Sect. 2.3.2) are

x* = x — Rx (2:34)
y* = y0 - Ry, (2:35 )

where Rx and Ry represent the radial distortion of the lens. Most lenses have
certain radial distortion. Although it generally has little impact on human eye
observation, it still needs to be corrected in optical measurement; otherwise, large
errors will occur. Theoretically speaking, there are two main types of lens distortion,
namely, radial distortion and tangential distortion. Since the tangential distortion is
relatively small, only radial distortion is often considered in general industrial
machine vision applications:

Rx = x* (k1 r2 + k2r4 + '" — x*kr2 (2:36)

Ry = y* (k1 r2 + k2r4 + —) - y*kr2, (2:37 )

where
54 2 Camera Imaging and Calibration

r = \/x2 + y2: (2:38)

In Eqs. (2.36) and (2.37), k = k1 is taken. On the one hand, the reason for this
approximate simplification is that, in practice, the higher-order term of R can be
ignored, so k2 can be ignored and only k1 can be considered. On the other hand, the
factor that the radial distortion is usually symmetrical about the main optical axis of
the camera lens is considered. At this time, the radial distortion of a point in the
image is proportional to the distance from the point to the optical axis of the lens [5].
4. The transformation from the actual image plane coordinates (x*, y*) to the
computer image coordinates (M, N )is

M = ^C Lx x + Om
SxM (2:39)

Sy + On,
N=f (2:40)

where M and N are the number of rows and columns of pixels in the computer
memory (computer
and columns coordinates),
of pixels respectively;
in the computer Om and
memory On are
center, the numberSxofisrows
respectively; the

distance
line between
direction); Sythe centers
is the of two
distance between thesensors
adjacent centersalong theadjacent
of two X direction (scanning
sensors along

the Y direction; Lx is the number of sensor elements in X direction; and MX is the

number of samples (pixels) taken by the computer in a row. In Eq. (2.39), p is an
uncertain image scale factor depending on the camera. When using CCD, the image
is scanned line by line. The distance between adjacent pixels along the y‘ direction is
also the distance between adjacent CCD sensitive points. However, some uncer
tainty factors will be introduced along the x‘ direction due to the time difference
between the image acquisition hardware and the camera scanning hardware or the
inaccuracy of the camera scanning time. These uncertain factors can be described by
introducing uncertain image scale factors.
Combining the last three steps above, we get the equation linking the computer
image coordinates (M, N) with the 3D coordinates x ( , y, z) of the object point in the
camera coordinate system:

z
(
AX = x0 = x*+Rx= x* (1 + kr2) = M - Om 5xLx (1
pMx
) + kr2) (2:41)

Ay = y0 = y* + Ry = y* (1 + kr2) = (N - On)Sy (1 + kr2). (2:42)

Substitute Eqs. (2.30) and (2.31) into the above two equations, and finally get
2.3 Camera Calibration Model 55

;riX+r2 Y +r3Z + Tx flM.,

—F Om (2:43)
r7X + r8Y + r9Z + Tz (1 + kr2 SSxLx
N
N
. r4X +^Y + r6Z + Ty
f7X + r8 Y + r9Z + Tz (1
1
:
+ kr2) S—x + On
x
(2:44)

2.3 Camera Calibration Model

The camera model represents the relationship between the coordinates of the scene
in the world coordinate system and its coordinates in the image coordinate system,
that is, the projection relationship between the object point (space point) and the
image point is given. Camera models can be mainly divided into two types: linear
models and nonlinear models.

2.3.1 Linear Camera Model

The linear camera model is also called the pinhole model. In this model, it is
considered that any point in the 3D space on the image coordinate system is formed
by the principle of pinhole imaging.
For linear camera models, the distortion caused by nonideal lenses does not have
to be considered, but the coordinates on the image plane are rounded. In this way,
Fig. 2.5 can be used to illustrate the transformation from the 3D world coordinate
system through the camera coordinate system and the image plane coordinate system
to the computer image coordinate system. Here, the three transformations (steps) are
represented by T1, T2, and T3, respectively.

2.3.1.1 Internal and External Parameters

The calibration parameters involved in camera calibration can be divided into two
categories: external parameters (outside the camera) and internal parameters (inside
the camera).
1. External parameters

Fig. 2.5 Schematic diagram of conversion from 3D world coordinate system to computer image
coordinate system under the linear camera model
56 2 Camera Imaging and Calibration

Fig. 2.6 Schematic

diagram of Euler angles

The first transformation (T1) in Fig. 2.5 is to transform from the 3D world
coordinate system to the 3D camera coordinate system whose center is located at
the optical center of the camera. The transformation parameters are called exter
nal parameters, also known as camera attitude parameters. The rotation
matrix R has a total of nine elements but actually only has three degrees of
freedom, which can be represented by the three Euler angles of the rigid body
rotation. The schematic diagram of Euler angles is shown in Fig. 2.6 (here the line
of sight is inverse to the X-axis), where the intersection line AB of the XY plane
and the xy plane is called the nodal line, and the angle 9 between AB and the x-axis
is the first Euler angle, called the yaw angle (also called the deflection angle),
which is the angle of rotation around the z-axis; the angle yr between the AB and
the X-axis is the second Euler angle, called the tilt angle, which is the angle of
rotation around the Z-axis; the angle $ between the Z-axis and the z-axis is the
third Euler angle, called the pitch angle, which is the angle of rotation around the
pitch line.
The rotation matrix can be expressed as a function of 9, ^, yr using Euler angles:

cos y cos 9 sin y cos 9 — sin 9

R= — sin y cos <fr + cos yr sin 9 sin <fr cos yr cos <fr + sin y sin 9 sin <fr cos 9 sin <fr
sin yr sin <fr + cos yr sin 9 cos <fr — cos y sin <fr + sin y sin 9 cos <fr cos 9 cos <fr

(2:45)

It can be seen that the rotation matrix has three degrees of freedom. In addition,
the translation matrix also has three degrees of freedom (translation coefficients in
three directions). Thus, the camera has six independent external parameters, namely,
three Euler angles 9, ^, y in R and three elements Tx, Ty, Tz in T.
2. Internal parameters
The second transformation (T2) and the third transformation (T3) in Fig. 2.5
are the transformations from the 3D camera coordinate system through the
2.3 Camera Calibration Model 57

image plane coordinate system to the 2D computer image coordinate system.

The transformation parameters involved here are collectively called internal
parameters, also known as the camera internal parameters. There are a total
of four internal parameters: the focal length 2, the uncertainty image scale factor
p, and the computer image coordinates Om and On of the origin of the image
plane. In the nonlinear camera model discussed in Sect. 2.3.2, various distortion
coefficients also need to be considered.
The main significance of distinguishing external parameters and internal
parameters is that when a camera is used to acquire multiple images at different
positions and directions, the external parameters of the camera corresponding to
each image may be different, but the internal parameters are determined when
the device is manufactured. It has nothing to do with the pose of the camera, so it
will not change. In this case, after moving the camera, you only need to
recalibrate the external parameters but not the internal parameters.

2.3.1.2 Basic Calibration Procedure

According to the general space imaging model discussed in Sect. 2.2.3, if a series of
transformations PRTWh are performed on the homogeneous coordinates Wh of the
space points, the world coordinate system and the camera coordinate system can be
coincident. Here, P is the imaging projection transformation matrix, R is the camera
rotation matrix, and T is the camera translation matrix. Let A = PRT; the elements in
A include camera translation, rotation, and projection parameters; then there is a
homogeneous expression of image coordinates: Ch = AWh. If k = 1 in the homo
geneous expression, we get

Ch1
Ch2
(2:46)
Ch3
.Ch4.

According to the definition of homogeneous coordinates, the camera coordinates

(image plane coordinates) in Cartesian form are

x = Ch1=Ch4: (2:47)
y = Ch2=Ch4 : (2:48 )

Substitute Eqs. (2.47) and (2.48) into (2.46) and expand the matrix product to get

xCh4 = anX + a^Y + a^Z + au (2:49)

58 2 Camera Imaging and Calibration

yCh4 — a2iX + a22Y + a23Z + a24 (2:50)

Ch4 — a4iX + a42 Y + a43Z + a44: (2:51)

Among them, the expansion of Ch3 is omitted because it is related to Z.

Substituting Ch4 into Eqs. (2.50) and (2.51), (2.52) equations with a total of
12 unknowns can be obtained:

(aii — a4i x)X + (ai2 — a42x) Y + (ai3 — a43x)Z + (ai4 — a44x) — 0 (2:52)
(a21 — a4iy)X + (a22 — a42y) Y + (a23 — a43y)Z + (a24 — a44y) — 0: (2:53)

It can be seen that a calibration procedure should include the following: (i) obtain
M > 6 space points with known world coordinates (Xi, Yi, Zi), i — i, 2, ...,
M(in practical applications, more than 25 points are often taken, and thenthe least
squares fitting is used to reduce the error); (2) shoot these points with a camera at a
given position to get their corresponding image plane coordinates (xi, yi), i — i,
2, ..., M; (3) substitute these coordinates into Eqs. (2.52) and (2.53) to solve the
unknown coefficients.
In order to realize the above calibration procedure, it is necessary to obtain spatial
points and image points with corresponding relationship. To precisely determine
these points, a calibrator (also called a calibration target, i.e., a standard reference) is
used, which has a fixed pattern of marked points (reference points) on it. The most
commonly used 2D calibration targets have a series of regularly arranged square
patterns (similar to a chessboard), and the vertices of these squares (crosshair
intersections) can be used as reference points for calibration. If the coplanar refer
ence point calibration algorithm is used, the calibration target corresponds to one
plane; if the noncoplanar reference point calibration algorithm is used, the calibra
tion target generally corresponds to two orthogonal planes.

2.3.2 Nonlinear Camera Model

In practical situations, the camera is usually imaged through a lens (often including
multiple lenses). Based on the current lens processing technology and camera
manufacturing technology, the projection relationship of the camera cannot be
simply described as a pinhole model. In other words, due to the influence of various
factors, such as lens processing and installation, the projection relationship of the
camera will not be a linear projection relationship, that is, the linear model cannot
accurately describe the imaging geometric relationship of the camera.
Real optical systems do not work exactly according to the idealized pinhole
imaging principle but have lens distortion. Due to various distortion factors, there
is a deviation between the real position of the 3D space point projected onto the 2D
image plane and the ideal image point position without distortion. Optical distortion
2.3 Camera Calibration Model 59

errors are more pronounced near the edge of the lens. Especially when using wide-
angle lenses, there is often a lot of distortion in the image plane far from the center. In
this way, there will be deviations in the measured coordinates, and the accuracy of
the obtained world coordinates will be reduced. Therefore, a nonlinear camera
model that takes into account the distortion must be used for camera calibration.

2.3.2.1 Distortion Types

Due to the influence of various distortion factors, when the 3D space point is
projected onto the 2D image plane, there is a deviation between the actual coordi
nates (xa, ya) and the ideal coordinates (xi, yi) without distortion, which can be
expressed as

xa= Xi + dx (2:54)
ya=y< + dy, (2:55)

where dx and dy are the total nonlinear distortion deviation values in the x and
y directions, respectively. There are two basic types of common distortion: radial
distortion and tangential distortion. They can be seen in Fig. 2.7, where dr
represents the deviation caused by radial distortion and dt represents the deviation
caused by tangential distortion. Most other distortions are the combination of these
two basic distortions, and the most typical combined distortions are eccentric
distortion (centrifugal distortion) and thin prism distortion.
1. Radial distortion
Radial distortion is mainly caused by irregularities in the lens shape (surface
curvature errors); the resulting aberrations are generally symmetrical about the
main optical axis of the camera lens and are more pronounced along the lens
radius away from the optical axis. Generally, positive radial distortion is called
pincushion distortion, and negative radial distortion is called barrel distortion,
as shown in Fig. 2.8 (where the square represents the original shape, pincushion
distortion causes four right angles to become acute angles, and barrel distortion
causes four right angles to become rounded). Their mathematical models are both

Fig. 2. 7 Schematic
diagram of radial distortion
and tangential distortion
60 2 Camera Imaging and Calibration

Fig. 2. 8 Illustration of
pincushion distortion and
barrel distortion

Fig. 2. 9 Schematic
diagram of tangential
distortion

dxr = xi

dyr= yi(kir2 + k2r4 + ’’■) • (2:57)

Among them, r = (xi2 + yi2)1/2 is the distance from the image point to the center of
the image, and k1, k2, etc. are the radial distortion coefficients.
2. Tangential distortion
The tangential distortion is mainly caused by the noncollinearity of the
optical centers of the lens groups, which causes the actual image point to move
tangentially on the image plane. Tangential distortion has a certain orientation in
space, so there are a maximum axis of distortion in a certain direction and a
minimum axis of distortion in a direction perpendicular to this direction. The
schematic diagram is shown in Fig. 2.9, where the solid line represents the case
without distortion and the dashed line represents the result caused by tangential
distortion. Generally, the influence of tangential distortion is relatively small,
and the case of separate modeling is relatively small.

3. Eccentric distortion
The eccentric distortion is caused by the discrepancy between the optical
center and the geometric center of the optical system, that is, the optical center of
the lens device is not strictly collinear. Its mathematical model is
2.3 Camera Calibration Model 61

dxt = l^ + r2} +2l2X,-y; + - (2:58)

dyt = 2lix,-y; + l2(2y2 + r2) + -, (2:59)

where r = (xi2 + yi2)1/2 is the distance from the image point to the image center
and l1, l 2, etc. are eccentric distortion coefficients.
4. Thin prism distortion
The thin prism distortion is caused by the improper design and assembly of
lens. This kind of distortion is equivalent to adding a thin prism to the optical
system, which will not only cause radial deviation but also cause tangential
deviation. Its mathematical model is

dxp= mi (x2+y2) + •" (2:60)

dyp = m2 (x2 + y2 + + ■", (2:61)

where m 1 and m2 are the distortion coefficients of thin prisms.

If the tangential distortion is not directly considered, the total distortion deviation
dx and dy considering the radial distortion, eccentric distortion, and thin prism
distortion is

dx =dxr + dxt + dxp (2:62)

dy =dyr +dyt + dyp: (2:63 )

If the terms higher than third order are ignored, n1 = 11 + m1, n2 = l2 + m2, n3 = 2 11,
and n4 = 2 l2, the following can be obtained:

dx = k1xr2 + (n1 + n3)x2 + n4xy + n1y2 (2:64)

dy = k1yr2 + n2x2 + n3xy + (n2 + n4)y2: (2:65)

2.3.2.2 Actual Calibration Procedure

In practical applications, the radial distortion of the camera lens often has the greatest
impact. If other distortions are ignored, the transformation from the undistorted
image plane coordinates (x‘, y‘) to the actual image plane coordinates (x*, y*) offset
by the radial distortion of the lens is given by Eqs. (2.34) and (2.35).
Considering the transformation from (x‘, y‘) to (x*, y*), the transformation from
3D world coordinate system to computer image coordinate system realized
according to the nonlinear camera model is shown in Fig. 2.10. The original
transformation T3 is now decomposed into two transformations (T31 and T32), and
Eqs. (2.39) and (2.40) can still be used to define T32 (only x* and y* are required to
replace x‘ and y‘).
62 2 Camera Imaging and Calibration

Fig. 2.10 Schematic diagram of transformation from 3D world coordinate system to computer
image coordinate system under nonlinear camera model

Although only radial distortion is considered when Eqs. (2.34) and (2.35) are
used here, the forms of Eqs. (2.62) and 2.63 or Eqs. (2.64) and (2.65) are actually
applicable to a variety of distortions. In this sense, the conversion process in
Fig. 2.10 is applicable to the case of various distortions, as long as the corresponding
T31 is selected according to the type of distortion. Comparing Figs. 2.10 with 2.5,
“nonlinearity” is reflected in the conversion from (x‘, y‘) to (x*, y*).

2.4 Camera Calibration Methods

Many camera calibration methods have been proposed. The classification of cali
bration methods is discussed first, and then several typical methods are introduced.

2.4.1 Classification of Calibration Methods

For the camera calibration methods, there are different classification ways according
to different criteria. For example, according to the characteristics of the camera
model, it can be divided into linear methods and nonlinear methods; according to
whether or not to require calibration targets, it can be divided into traditional camera
calibration methods, camera self-calibration methods, and active vision-based cali
bration methods (some people also take the latter two methods into one class). When
using the calibration target, according to the dimension of the calibration target, it
can also be divided into the method of using 2D plane target and the method of using
3D volumetric target; according to the results of solving parameters, it can be divided
into explicit methods and implicit methods; according to whether the internal
parameters of the camera can be changed, it can be divided into methods with
variable internal parameters and methods with immutable internal parameters;
according to the movement mode of the camera, it can be divided into methods
that limit the movement mode and methods that do not limit the movement mode;
according to the number of cameras used in the vision system, it can be divided into
single-camera calibration method and multi-camera calibration method. In addition,
when the spectra are different, the calibration method (and calibration target) often
needs to be adjusted, such as [6]. Table 2.1 gives a classification table of calibration
methods, which lists some classification criteria, categories, and typical methods.
2.4 Camera Calibration Methods 63

Table 2.1 Classification of camera calibration methods

Classification criteria Categories Typical methods
Characteristics of the camera Linear Two-level calibration method
model Nonlinear LM optimization method
Newton-Raphson (NR) optimization method
A nonlinear optimization method for param
eter calibration
Methods that assume only radial distortion
Requiring calibration targets Traditional cam Methods using optimization algorithms
era calibration Method using camera transformation matrix
Two-step approach considering distortion
compensation
Biplane method using camera imaging
models
Direct linear transform (DLT) method
Method using radial calibration constraints
(RAC)
Camera A direct way to solve the Kruppa equation
Self-calibration Hierarchical step-by-step approach
The method of using absolute conic
Method based on quadric surface
Active vision Linear method based on two sets of
based triorthogonal motions
calibration Method based on four groups and five groups
of plane orthogonal motions
Orthogonal motion method based on plane
homography matrix
Epipole based orthogonal motion method
Dimension of the calibration Using 2D plane Use a black-and-white chessboard to cali
target target brate the target (take the grid intersection as
the calibration point)
Use grid to arrange dots (take the center of the
dot as the calibration point)
Using 3D volu Using 3D objects of known size and shape
metric target
Results of solving parameters Explicit Considering calibration parameters with
calibration direct physical meaning (such as distortion
parameters)
Implicit Direct linear transform (DLT) method to
calibration calibrate geometric parameters
Whether the internal parame With variable During the calibration process, the optical
ters of the camera can be internal parameters of the camera (such as focal
changed parameters length) can be changed
With immutable In the calibration process, the method that the
internal optical parameters of the camera cannot be
parameters changed
Movement mode of the Limit the move Methods for only pure rotational motion of
camera ment mode the camera
Methods for the presence of orthogonal
translational motion of the camera
Do not limit the Methods to allow various camera movements
movement mode during calibration
(continued)
64 2 Camera Imaging and Calibration

Table 2.1 (continued)

Classification criteria Categories Typical methods
Number of cameras used in Single-camera A method that can only calibrate a single
the vision system calibration camera
Multi-camera A method of using a 1D calibrator (with three
calibration or more collinear points with known dis
tances) for multiple cameras and refining the
linear algorithm using the maximum likeli
hood criterion

In Table 2.1, the nonlinear methods are generally more complex and slower and
require a good initial value. In addition, the nonlinear search cannot guarantee that
the parameters converge to the global optimal solution. The implicit method takes
the elements of the transformation matrix as calibration parameters and uses a
transformation matrix to represent the correspondence between 3D space points
and 2D plane image points. Because the parameters themselves do not have clear
physical meanings, they are also called implicit parameter methods. Since the
implicit parameter method only needs to solve the linear equation, this method can
obtain higher efficiency when the accuracy requirement is not very high. The direct
linear method (DLT) takes the linear model as the object and uses a 3 x 4 matrix to
represent the correspondence between 3D space points and 2D plane image points,
ignoring the intermediate imaging process (or, comprehensively considering the
factors in the process). The most common multi-camera calibration method is the
dual-camera calibration method. Compared with the single-camera calibration, the
dual-camera calibration not only needs to know the internal and external parameters
of each camera itself but also needs to measure the relative relationship (location and
orientation) between the two cameras through calibration.

2.4.2 Traditional Calibration Methods

The traditional camera calibration procedure requires the use of a known calibration
target (2D calibration plate with known data, or 3D calibration block), that is, it is
necessary to know the size and shape of the calibration target (position and distri
bution of calibration points), and then by establishing the corresponding relationship
between the points on calibration target and the corresponding points on the captured
image to determine the internal and external parameters of the camera. The advan
tage is that the theory is clear, the solution is simple, and the calibration accuracy is
high. The disadvantage is that the calibration process is relatively complicated, and
the accuracy for the calibration target is relatively high.
2.4 Camera Calibration Methods 65

2.4.2.1 Basic Steps and Parameters

Referring to the complete space imaging model introduced in Sect. 2.2.4 and the
nonlinear camera model introduced in Sect. 2.3.2, the calibration of the camera can
be performed along the transformation direction from 3D world coordinates to
computer image coordinates. As shown in Fig. 2.11, there are four steps in the
conversion from the world coordinate system to the computer image coordinate
system, and each step has parameters to be calibrated.
Step 1: The parameters to be calibrated are the rotation matrix R and the
translation matrix T.
Step 2: The parameter to be calibrated is the lens focal length 2.
Step 3: The parameters to be calibrated are the radial distortion coefficient k of the
lens, the eccentric distortion coefficient l, and the thin prism distortion coefficient m.
Step 4: The parameter to be calibrated is the uncertainty image scale factor p.

2.4.2.2 Two-Level Calibration Method

The two-level calibration method is a typical traditional calibration method [7]. It is

named because the calibration is divided into two steps: the first step is to calculate
the external parameters of the camera (but the translation along the optical axis of the
camera is not considered here), and the second step is to calculate other parameters
of the camera. Since this method utilizes radial alignment constraints (RAC), it is
also called the RAC method. Most of the equations in the calculation process are
linear equations, so the process of solving the parameters is relatively simple. This
method has been widely used in industrial vision systems, and the average accuracy
of 3D measurement can reach 1/4000, and the accuracy in the depth direction can
reach 1/8000.
Calibration can be divided into two cases.
1. If p is known, only one image containing a set of coplanar datum points needs to
be used for calibration. At this time, the first step calculates R, Tx, Ty, and the
second step calculates 2, k, Tz. Here, because k is the radial distortion coefficient
of the lens, k may not be considered in the calculation of R. Similarly, k may not
be considered in the calculation of Tx and Ty, but k should be considered in the
calculation of Tz (the effect of Tz change on the image is similar as that of k), so it
is put in the second step.

Fig. 2.11 Camera calibration along the coordinate transformation direction

66 2 Camera Imaging and Calibration

2. If // is unknown, an image containing a group of noncoplanar datum points is

required for calibration. At this time, the first step calculates R, Tx, Ty, and the
second step still calculates z, k, Tz.
The specific calibration process is to first calculate a set of parameters s i (i = 1,
2, 3, 4, 5) or s = [s1 s2 s3 s4 s5]T. Then, the external parameters of the camera can be
further calculated with this set of parameters. Given m (m > 5) points with known
world coordinates (Xi, Yi, Zi) and corresponding image plane coordinates (xi, yi),
where i = 1, 2, ..., M, the matrix A can be constructed, where row a i can be
expressed as follows:

a,-=[ yX yiYi - xiXi — xiYi yi ]• (2:66)

Let s i have the following relationship with the rotation parameters r1, r2, r4, r5 and
the translation parameters Tx, Ty:

S1 sTr1y S2 srT2y S3 rpr4

Ty
S4 rpr5 S5 rpTTxy :
Ty
(2:67)

Set the vector u = [x1 x2 ... xM]T; then we can first solve s with the following
linear equations:

As = U: (2:68)

Then, the various rotation and translation parameters can be calculated according
to the following steps:
1. Let S = s 1 2 + s22 + s 3 2 + s42 , and calculate:

S- S2 - 4(sis4 - S2S3)2
(S1S4 — S2S3) + 0
4 (s1s4 - s2s3)2
T23
=< 1 (2:69)
y
s2 + s2 ^ 0
s1 + s2
1
s3 + s4 ^ 0
. S3 + s4

2. Set Ty = (Ty2)1/2, that is, take the positive square root, and calculate:

r1 =s1Ty r2 =s2Ty r4 =s3Ty r5 =s4Ty Tx =s5Ty : (2:70)

3. Select a point whose world coordinates are (X, Y, Z), and require its image plane
coordinates (x, y) to be far from the center of the image, and calculate:
2.4 Camera Calibration Methods 67

Px = r1X + r2Y + Tx (2:71)

pY = r4X + r5Y + Ty: (2:72 )

This is equivalent to applying the calculated rotation parameters to the X and Y of

the point (X, Y, Z). If the signs of pX and x are the same, and the signs of p Y and y are
the same, it means that Ty has the correct sign; otherwise, Ty needs to be negative.
4. Calculate other rotation parameters:

2 1 - r1 — r2r4 1 — r2r4 - r2
r 2r
— r2 — 2 6 =
A
d1
2 2
— r2 — r2r7 =------ - -------- r8 =------- 5---------r9
r3 r6
r3 =
= V 1 — r3r7 — r6 r8 :
r

(2:73)

Note that if the sign of r1r4 + r2r5 is positive, then r6 should be negative, and
the signs of r7 and r8 should be adjusted after calculating the focal length z.
5. Establish another set of linear equations to calculate the focal length Z and the
translation parameter Tz in the z direction. A matrix B can be constructed first,
and the row bi can be expressed as

bi=b r4Xi + r5Yi + Ty yi J, (2:74)

where |_-J represents the round-down function.

Let the element vi of the vector v be expressed as

Vi = (r7 Xi + r8Yi )yi: (2:75)

Then, t = [Z Tz]T can be solved by the following system of linear equations:

Bt=v: (2:76)

6. If z < 0, to use the right-handed coordinate system, r3, r6, r7, r8, Z, and Tz must be
negative.
7. Calculate the radial distortion coefficient k of the lens using the estimation of t,
and adjust the values of Z and Tz. Using the perspective projection equation
including distortion here, the following nonlinear equation can be obtained:

|yi- (1 +kr2) = Z r4Xi + r5Yi + r6Zi ■7 '1

i=1,2, ...,M : (2:77)
r7Xi + r8Yi + r9Zi + Tz J

The values of k, Z, and Tz can be obtained by solving the above equations with
nonlinear regression method.
68 2 Camera Imaging and Calibration

2.4.2.3 Accuracy Improvement

The above two-level calibration method only considers the radial distortion of the
camera lens. If the tangential distortion of the lens is further considered on this basis,
it is possible to further improve the accuracy of camera calibration.
Referring to Eqs. (2.62) and (2.63), the total distortion deviations dx and dy
considering radial distortion and tangential distortion are

dx = dxr + dxt (2.78)

dy= dyr + dyt . (2.79)

Considering the fourth-order term for radial distortion and the second-order term
for tangential distortion, we have

dx = Xi (^1 r2 + k2r4 ) + 11 (3x2 + y2 ) + 212X17, (2.80)

dy= yi(feir2 + fe2r4) + 211X17; + h (x2 + 3y2). (2.81)

The calibration of the camera can be divided into the following two steps.
1. Set the initial values of lens distortion coefficients k1, k2, 11, and 12 to be 0, and
calculate the values of R, T, and 2.
Referring to Eqs. (2.32) and (2.33), and referring to the derivation of
Eq. (2.77), we can get

= .X r1X+ r2Y + r3 Z + Tx
x 2Z +r7X + r8Y + r<9Z + Tz (2.82)

y
Y =rr4X + r5Y + ^Z + Ty
2Z r7X + r8Y + r9Z + Tz ' (2.83)

From Eqs. (2.82) and (2.83), we can get

x _ r1X + r2Y + r3Z + Tx

(2.84)
y = r4X+75y+76z+Ty •

Equation (2.84) is true for all datum points, that is, an equation can be established
by using the 3D world coordinates and 2D image coordinates of each datum point.
There are eight unknowns in Eq. (2.84), so if eight datum points can be determined,
an equation system with eight equations can be constructed, and then the values of r 1,
r2, r3, r4, r5, r6, Tx, and Ty can be calculated out. Because R is an orthogonal matrix,
the values of r7, r8, and r9 can be calculated according to its orthogonality. Substitute
the calculated values into Eqs. (2.82) and (2.83), and then arbitrarily take the 3D
world coordinates and 2D image coordinates of any two datum points; the values of
Tz and 2 can be calculated.
2.4 Camera Calibration Methods 69

2. Calculate the values of lens distortion coefficients k1, k2, l1, and l2.
According to Eqs. (2.54) and (2.55), as well as Eqs. (2.78-2.81), the follow
ing can be obtained:

/. Z = x = x + x(k1r2 + k2r4 ) + 11 (3x2 + y2) + 212X7, (2:85)

AY = y = y, + y;(k1r2 + k2r4 ) + 21^ , + I2 (x2 + 3y2 ). (2:86)

With the help of R and T already obtained, (Z, Y, Z) can be calculated by using
Eq. (2.84), and then substituted into Eqs. (2.85) and (2.86), we get

A y= = xij + xj(k1r2 + k2r4 ) + l1(3xjj + yj + 2l2xijyij (2.87)

J JX / LJ LJ J J J '

aY = y,j+y^r2 + k2 r4)+ 2l1xijyij+l2 xjy+3yj): (2.88)

Among them, j = 1, 2, ..., N, where N is the number of datum points. Using 2 N

linear equations and solving by the least squares method, the values of four distortion
coefficients k1, k2, 11, and 12 can be obtained.

2.4.3 Self-Calibration Methods

The camera self-calibration method was proposed in the early 1990s. Camera self
calibration can calculate real-time, online camera model parameters from geometric
constraints obtained from image sequences without resorting to high-precision
calibration targets, which is especially suitable for cameras that often need to
move. Since all the self-calibration methods are only related to the parameters of
the camera and have nothing to do with the external environment and the motion of
the camera, the self-calibration method is more flexible than the traditional calibra
tion method. However, the existing self-calibration methods are not very accurate
and robust.
The idea of the basic self-calibration method is to first establish the constraint
equation about the parameter matrix in the camera through the absolute quadratic
curve, which is called the Kruppa equation. Then, solve the Kruppa equation to
determine the matrix C (C = KTK-1, K is the internal parameter matrix). Finally, the
matrix K is obtained by Cholesky decomposition.
The self-calibration method can be realized with the help of active vision tech
nology. However, some researchers have put forward the calibration method based
on active vision technology as a separate category. Active vision system means that
the system can control the camera to obtain multiple images in motion and then use
the camera’s motion trajectory and the corresponding relationship between the
70 2 Camera Imaging and Calibration

Fig. 2.12 Geometric

relationship between images
imaged by camera
translation

obtained images to calibrate the camera. The method of active vision-based cali
bration is generally used when the motion parameters of the camera in the world
coordinate system are known, and it can usually be solved linearly and the obtained
results have high robustness.
In practical applications, the method based on active vision calibration generally
installs the camera accurately on the controllable platform, and actively controls the
platform to perform special movements to obtain multiple images, and then uses the
correspondence between the images and the camera motion parameters to determine
camera parameters. However, this method cannot be used if the camera motion
parameters are unknown or in situations where the camera motion cannot be
controlled. In addition, the motion platform required by this method has high
precision and high cost.
A typical self-calibration method can be introduced with reference to Fig. 2.12
[8]. The optical center of the camera is translated from O1 to O2, and the two images
formed are I1 and I2, respectively (the coordinate origins are o 1 and o2, respectively).
A point P in space is imaged as point p 1 on I1 and is imaged as point p2 on I2. Here, p 1
and p2 constitute a pair of corresponding points. If a point p2‘ is marked on I1
according to the coordinate value of point p2 on I2, then the connection between
p2‘ and p1 is called the connection of the corresponding point on I1. It can be proven
that when the camera performs pure translational motion, the lines connecting the
!——
corresponding points of all spatial points on I1 intersect at the same point e, and O1e
is the movement direction of the camera (here e is on the line connecting O1 to O2,
and O1O2 is translational motion trajectory).
According to the analysis of Fig. 2.12, by determining the intersection of the lines
connecting the corresponding points, the camera translation direction in the camera
coordinate system can be obtained. In this way, by controlling the camera to perform
translational motions in three directions, respectively, during calibration, and using
the corresponding point connection line to calculate the corresponding intersection ei
(i = 1, 2, 3) before and after each motion, the three translational motion directions
can be obtained.
2.4 Camera Calibration Methods 71

Referring to Eqs. (2.39) and (2.40), by considering the ideal case where the
uncertain image scale factor p is 1, and by taking each sensor in the x direction to
sample 1 pixel in each row, then Eqs. (2.39) and (2.40) can be written as

x
M=^ + Om (2:89)
x

N=y-+On
Sy
: (2:90)

Equations (2.89) and (2.90) establish the conversion relationship between the
image plane coordinate system x’y’ expressed in physical units (such as mm) and the
computer image coordinate system MN expressed in pixels. According to the
coordinates of the intersection point e i (i = 1, 2, 3) on I1 in Fig. 2.12, respectively
(xi, yi), Eqs. (2.89) and (2.90), we can see that the coordinates of ei in the camera
coordinate system are

ei = [ (xi — Om)Sx (yi — On)Sy 2 ]T• (2:91)

If you make the camera translation three times and make the movement directions
of these three times orthogonal, you can get eiTj 0 (i 4 j), and then it gives

(x1 - Om)(x2 - Om)S2 + (jy1 - On)(y2 - On)Sy + 22 = 0 (2:92)

(x1 - Om) (x3 - Om)S2 + Cy1 - On) Cy3 - On)S2 + 22 = 0 (2:93)

(x2 - Om)(x3 - Om)S2 + Cy2 - On)(y3 - On)S2 + 22 = 0: (2:94)

Equations (2.92), (2.93), and (2.94) are further rewritten as

(x1 - Om)(x2 - Om) + (y1 - On)(y2 - On) (+f= 0 (2:95)

Sx/ \Sx/

(x1 - Om)(x3 - Om) + (y1 - On)(y3 - On)(^y^ +f= 0 (2:96)

SxJ \Sx/

(x2 - Om)(x3 - Om) + Cy2 - On)(jy3 - On)(+T= 0: (2:97)

Define two intermediate variables:

2
(2:98)
72 2 Camera Imaging and Calibration

Q2 = (2:99)
Sx :

Then, Eqs. (2.95), (2.96), and (2.97) become three equations including four
unknown quantities of Om, On, Q1, and Q2. These equations are nonlinear, and if
Eqs. (2.96) and (2.97) are subtracted from Eq. (2.95), respectively, two linear
equations are obtained:

x1(x2 — x3) = (x2 — 3x )Om + (y2 — 73)OnQ1 — 71 (y2 — 73)Q1 (2:100)

X2(xi - X3) = (xi - X3 )Om + (yi - 73 )OnQ1 - 72(71 — ya)Qi- (2:101)

Represent OnQ1 in Eqs. (2.100) and (2.101) with an intermediate variable Q3:

Q3 = OnQ1 : (2:102)

Then Eqs. (2.100) and (2.101) become two linear equations about three
unknowns including Om, Q1, and Q3. Since the two equations have three unknowns,
the solutions of Eqs. (2.100) and (2.101) are generally not unique. In order to obtain
a unique solution, the camera can be moved three times in other orthogonal direc
tions to obtain three other intersection points e i (i = 4, 5, 6). If the three translational
movements have different directions from the previous three translational move
ments, two equations similar to Eqs. (2.100) and (2.101) can be obtained. In this
way, a total of four equations are obtained, and any three equations can be taken, or
the least squares method can be used to solve Om, Q1, and Q3 from the four
equations. Next, solve On from Eq. (2.102), and then substitute Om, On, and Q1
into Eq. (2.97) to solve for Q2. In this way, all the internal parameters of the camera
can be obtained by controlling the camera to perform two sets of three-orthogonal
translational motions.

2.4.4 A Calibration Method for Structured Light Active

Vision System

The structured light active vision system can be regarded as mainly composed of a
camera and a projector, and the accuracy of the 3D reconstruction of the system is
mainly determined by their calibration. There are many methods for camera calibra
tion, which are often realized by means of calibration targets and feature points. The
projector is generally regarded as a camera with a reverse light path. The biggest
difficulty in projector calibration is to obtain the world coordinates of the feature
points. A common solution is to project the projection pattern onto the calibration
target used to calibrate the camera and obtain the world coordinates of the projection
point according to the known feature points on the calibration target and the
2.4 Camera Calibration Methods 73

calibrated camera parameter matrix. This method requires the camera to be cali
brated in advance, so the camera calibration error will be superimposed into the
projector calibration error, resulting in an increase in the projector calibration error.
Another method is to project the encoded structured light onto a calibration target
containing several feature points and then use the phase technique to obtain the
coordinate points of the feature points on the projection plane. This method does not
need to calibrate the camera in advance but needs to project the sinusoidal grating
many times, and the total number of collected images will be relatively large.
The following introduces a calibration method for active vision system based on
color concentric circle array [9]. The projector projects a color concentric circle
pattern to the calibration plate drawn with the concentric circle array and separates
the projected concentric circle and the calibration plate concentric circle from the
acquired image through color channel filtering. Using the geometric constraints
satisfied by the concentric circle projection, the pixel coordinates of the center of
the circle on the image are calculated, and the homography relationship between the
calibration plane, the projector projection plane, and the camera imaging plane is
established, and then the system calibration is realized. This method only needs to
collect at least three images to achieve calibration.

2.4.4.1 Projector Models and Calibration

The projection process of the projector and the imaging process of the camera have
the same principle but opposite directions, and the reverse pinhole camera model can
be used as the mathematical model of the projector.
Similar to the camera imaging model, the projection model of the projector is also
designed as a conversion between three coordinate systems (the world coordinate
system, the projector coordinate system, and the projection plane coordinate system,
respectively), and the coordinate system in the computer is not considered first. The
world coordinate system is still represented by XYZ. The projector coordinate
system is a coordinate system xyz centered on the projector and generally takes
the optical axis of the projector as the z-axis. The projection plane coordinate
system is the coordinate system x’y’ on the imaging plane of the projector.
For simplicity, the corresponding axes of the world coordinate system XYZ and
the projector coordinate system xyz can be taken to coincide (and the projector
optical center is located at the origin). Then, the xy plane of the projector coordinate
system and the imaging plane of the projector can be taken to coincide, so that the
origin of projection plane is on the optical axis of the projector, and the z-axis of the
projector coordinate system is perpendicular to the projection plane and points
toward the projection plane, as shown in Fig. 2.13. Among them, the space point
(X, Y, Z ) is projected to the projection point (x, y) of the projection plane through the
optical center of the projector, and the connection between them is a spatial projec
tion ray.
The coordinate system and transformation ideas in the calibration are as follows.
First, use the projector to project the calibration pattern to the calibration plate [with
74 2 Camera Imaging and Calibration

Fig. 2.13 Basic projector

projection model

the world coordinate system W = (X, Y, Z)] and then use the camera [with the camera
coordinate system c = (x, y, z)] to collect the projected image. Calibrate the plate
image, and separate the calibration pattern on the calibration plate from the projected
pattern. By acquiring and matching the feature points on these patterns, the direct
linear transformation (DLT) algorithm [10] can be used to calculate the
homography matrix Hwc between the calibration plate and the camera imaging
plane, as well as the homography matrix Hcp between the camera imaging plane and
the projector [with the projector coordinate system p = (x’, y’)] projection plane,
which is caused by the calibration plate plane. They are both 3x3 non-singular
matrices representing a 2D projective transformation between two planes.
After obtaining Hwc and Hcp, the pixel coordinates I’c and J’c on the camera
imaging plane and the pixel coordinates I’p and J’p on the projector’s projection
plane, of the virtual circle points I = [1, i, 0]T and J = [1, -i, 0]T, can be obtained as
follows:

I0c = HwcI JC = HcpJ : (2:103)

ip = HcplC jp = Hcp JC: (2:104)

By changing the position and direction of the calibration plate to obtain the pixel
coordinates of at least three sets of virtual circle points in different planes on the
camera and the projector, the absolute conic curve images Sc and Sp in the camera
imaging plane and projector projection plane can be fitted. Then, by performing
Cholesky decomposition on Sc and Sp, the internal parameter matrices Kc and Kp of
the camera and projector can be obtained, respectively. Finally, using Kc and Kp, as
well as Hwc and Hcp, the external parameter matrices of the camera and projector can
be obtained.

2.4.4.2 Pattern Separation

Using a projector to project a new pattern onto the calibration plate that has already
drawn a pattern, and then using a camera to capture the projected calibration plate
2.4 Camera Calibration Methods 75

(a) (b) (c)

Fig. 2.14 Extraction of projected pattern from overlapping calibration plate pattern and projected
pattern

image, the two patterns in the captured image are overlapping and need to be
separated. For this purpose, it is possible to consider using two patterns of different
colors, with the aid of color filtering to separate the two patterns.
Specifically, a calibration plate with a magenta concentric circle array (7x9
concentric circles) patterned on a white background can be used, and a blue-green
concentric circle array (also 7 x 9 concentric circles) patterned on a yellow back
ground is projected onto the calibration plate by a projector. When the patterns are
projected onto the calibration plate Ib with a projector, the calibration plate pattern
and the projected pattern overlap, as shown in Fig. 2.14a, where only a pair of each
of the two circular patterns is drawn as an example. The area where the two patterns
overlap will change color, where the intersection of the magenta circle and the
yellow background turns into red, the intersection of the magenta circle and the
cyan circle turns into blue, and the intersection of the white background of the
calibration board and the projected pattern turns into the color of the projected
pattern. First convert it to the camera image Ic with the help of the homography
matrix Hwc (as shown in Fig. 2.14b), and then convert it to the projector image Ip
with the help of the homography matrix Hcp (as shown in Fig. 2.14c).
In the color filtering process, the image is first passed through the green, red, and
blue filter channels, respectively. After passing through the green filter channel,
since the circle pattern on the calibration plate has no green component, it will appear
black, and other areas will appear white, which can separate the calibration plate
pattern. After passing through the red filter channel, the projected circular pattern
appears black because there is no red component in it, while the yellow background
portion and the calibration plate circular pattern appear close to white. After passing
through the blue filter channel, since the yellow background area projected onto the
calibration plate and the red circle pattern on the calibration plate have no blue
component, they appear close to black, while the projected cyan circle pattern
appears close to white. Since the color difference of each pattern part is relatively
large, the overlapping patterns can be separated more easily. Taking the centers of
the separated concentric rings as feature points and obtaining their image coordi
nates, the homography matrix Hwc and the homography matrix Hcp can be
calculated.
76 2 Camera Imaging and Calibration

2.4.4.3 Homography Matrix Calculation

In order to calculate the homography matrix between the calibration plate and the
projection plane of the projector with the imaging plane of the camera, it is necessary
to calculate the center of the concentric circles on the calibration plate and the image
coordinates projected to the center of the concentric circles on the calibration plate.
Here, consider a pair of concentric circles C1 and C2 with the center O on a plane in
the space. The vector form of any point p on the plane relative to the polar line l of
the circle C1 is l = C1p, and the polar line l relative to the pole of the circle C2 is
q = C2-1l. The point p can be on the circumference of the circle C1 (as shown in
Fig. 2.15a), outside the circumference of the circle C1 (as shown in Fig. 2.15b), or on
the inside of the circumference of the circle C1 (as shown in Fig. 2.15c). However, in
these three cases, according to the constraint relationship between the poles and the
polar lines of the conic, the line connecting the point p and the point q will pass
through the center O.
The projection transformation maps the concentric circles C1 and C2 with the
center O on the plane S to the camera imaging plane Sc, the corresponding point of
the circle center O on Sc is Oc, and the corresponding conic curves of the concentric
circles C1 and C2 on Sc are G1 and G2, respectively. If the polar line of any point pi on
the plane Sc relative to G1 is li‘, and the pole of li‘ relative to G2 is qi, then according
to the projection invariance of the collinear relationship and the polar line-pole
relationship, it can be known that the connection between pi and qi goes through
Oc. If the connection between pi and qi is recorded as mi, then we have

m,- = ( m,-1 ml7 mo )T = q; x G2-1 G^ (2:105)

If the normalized homogeneous coordinate of the center projection point is

u = (u, v, 1)T, the distance di from the center projection point to the straight line
mi can be written as

(a) (b)

Fig. 2.15 Constraints between polar lines and poles of concentric circles
2.4 Camera Calibration Methods 77

Fig. 2.16 Matching of

circle and curve using cross
ratio constraint

d2= (mi • u)2

(2:106)
i m21 + m22 + m23 :

The following cost functions can be searched by Levenberg-Marquardt algorithm

with n arbitrary points on conic G1:

f(U, V) = ^d2: (2:107)

i= 1

From its local minimum point, the optimal projection position of the circle center
can be obtained.
In order to automatically extract and match concentric circle images, Canny
operator can be used for sub-pixel edge detection to extract circle boundaries and
fit conic curves. In the large number of conic curves detected in each image, the
conic curve pairs from the same concentric circle are first found by using the rank
constraint of concentric circles [11]. Considering that the two conic curves are G1
and G2, respectively, their generalized eigenvalues are 21, 22, and 23, respectively. If
21 = 22 = 23, G1 and G2 are the same conic; if 21=22 / 23, G1 and G2 are projections
of a pair of concentric circles; if 21 ^ 22 ^ 23, G1 and G2 come from different
concentric circles.
After pairing the conic curves, it is also necessary to match the concentric circles
on the calibration plate with the curve pairs in the image. Here, we can use cross ratio
invariance to automatically match concentric circles. As shown in Fig. 2.16a, let the
straight line on the diameter of the concentric circle intersect with the concentric
circle at p 1, p 2, p 3, and p4, and these intersection points are mapped
p p
top 1 , 2 , 3 , and
p4‘ after projection transformation (as shown in Fig. 2.16b). According to the cross
ratio invariance, the following relationship can be obtained (where |p ipj| represents
the distance from point p i to point pj):

C (p1, p2, p3, p4) =

jP1P2jjP3Pj Ip! P2I |p3 p4| = C (p'1, p2, p3, p4) :
(2:108)
jP1P3jjP2P4j Ip! P3I |p2 p4|

For concentric circles with different radius ratios, the intersection ratios formed
by the straight line where the diameter is located and the four intersections of the
concentric circles are different, so the radius ratio can be used to identify the
78 2 Camera Imaging and Calibration

concentric circles. When designing the calibration plate pattern and projection
pattern, different radius ratios can be set according to the positions of different
concentric circles to uniquely identify different concentric circles in the pattern. In
practice, only part of the concentric circles can be set with different radius ratios.
After the corresponding homography matrix is obtained, the positions of other
concentric circles and the projection point of the center of the circle can be obtained
with the help of the homography matrix.

2.4.4.4 Calibration Parameter Calculation

After the homography matrix Hwc and the homography matrix Hcp are determined,
the internal and external parameters of the camera and the projector can be calcu
lated. First, the homography matrix Hwc between the calibration plate plane and the
camera imaging plane can be expressed as

O'- HwcOi : (2:109)

Among them, Oi = (xi, yi, 1)T is the coordinates of the center of the concentric
circles on the calibration plate in the calibration plate coordinate system, and
Oi‘ = (ui, vi, 1)T is the image coordinates of the Oi projection point. Hwc can be
obtained by calculating the image coordinates of the center of the concentric circles
of four or more calibration plates and using the DLT algorithm.
Similar to the above process, the homography matrix Hcp between the projection
plane of the projector and the imaging plane of the camera can also be calculated.
Then, with the help of Eqs. (2.103) and (2.104), the internal parameter matrices Kc
and Kp of the camera and the projector can be calculated.
Further compute the external parameter matrices of the camera and projector. Set
the calibration plate plane to coincide with the XwYw plane of the world coordinate
system, the homogeneous coordinate of the previous point X in the world coordinate
system is Xw = [xw, yw, 0, 1]T, and its image point on the camera xc = [uc, vc, 1]T
satisfies (where Rp and tp are the rotation matrix and translation vector of the
calibration plate plane relative to the world coordinate system, respectively)

Xc - Kc[Rcjtc]Xw: (2:110)

Denote the 2D coordinate plane corresponding to the point Xw = [xw, yw, 0, 1]T as
xw = [xw, yw, 1]T, and use rc1 and rc2 to represent the first two columns of Rc,
respectively; then there is Kc[Rc|tc]Xw = Kc[rc1, rc2, tc]Xw, and substitute into
Eq. (2.110) to get

Xc - Kc[rc1,rc2, tc]Xw. (2:111)

2.4 Camera Calibration Methods 79

If rc1, rc2, and tc are not coplanar, that is, the plane of the calibration plate does not
pass through the optical center of the camera, there is a homography matrix Hwc
between the plane of the calibration plate and the image plane of the camera, which
can be known from Eq. (2.111):

Hw ~ Kc[rci, rc2, U (2:112)

From the above equation, rc1, rc2, and tc can be obtained. Because Rc is a unit
orthogonal matrix, so

rc3 = rci x rC2. (2.113)

Similar to the above process, since the corresponding relationship is also satisfied
between the calibration plate plane and the projector projection plane, so the rotation
matrix Rp and translation vector tp of the projector coordinate system relative to the
world coordinate system can be obtained. In this case, the rotation matrix R and
translation vector t between the camera coordinate system and the projector coordi
nate system can be expressed as R = Rc-1Rp and t = Rc-1(tp—tc), respectively.

2.4.5 An Online Camera External Parameter Calibration

Method

In the field of advanced driver assistance systems (ADAS) or autonomous driving,

onboard cameras are required to detect and recognize road signs or signage and to
detect and track objects around the vehicle. The camera internal and external
parameters have a large impact on the accuracy of these jobs. Among them, in
addition to the traditional method based on the calibration target (such as in the
previous sections), the calibration of the parameters in the camera can also be based
on the principle of stillness of the features in the environment. That is, the constraint
relationship between the image plane feature points of multi-frame images with
different viewing angles in the same scene can be established first, and then the
camera internal parameters can be calibrated in real time according to the constraint
relationship without relying on specific calibration targets [12].
The external parameter calibration of the camera is to determine the coordinate
system relationship between the camera and the vehicle. The general method is to
establish a high-precision calibration field for auxiliary calibration. The high-
precision calibration field is equipped with position and attitude tracking equipment
and specific calibration targets, and the robot hand-eye calibration method [13] is
used to determine the external parameters of the camera. The hand-eye calibration
method requires the spatial pose relationship between the calibration target and the
camera in the process of solving external parameters. According to different dimen
sions of the calibration target, it can be divided into 3D, 2D, and 1D calibration
80 2 Camera Imaging and Calibration

methods [14]. However, these methods usually rely on the ground signs that meet the
constraints of specific points, lines, or surfaces and are mainly suitable for offline
calibration. In addition, due to maintenance and structural deformation, the external
parameters of the camera may change significantly in the life cycle of the vehicle.
How to calibrate and adjust the external parameters online is also very important.
To solve these problems, a real-time camera external parameter calibration
method based on the matching of camera and high-precision map without using
precise and expensive calibration field is proposed [15].
The basic idea of this method is as follows. Firstly, the lane line in the image is
detected by using the deep learning technology. By assuming an initial external
parameter matrix T, and according to this matrix, the lane line point Pw in the world
coordinate system W(XYZ) is projected to the camera coordinate system C(xyz) to
obtain the 3D image point Pc, which is matched with the map. Then, the projection
error L(Tcv) between Pc and the lane point Dc detected by the camera is evaluated by
reasonably designing the error function L, and the external parameter matrix Tcv is
solved by using the idea of bundle adjustment (BA) to minimize the reprojection
error from the lane line curve to the image plane [16]. Here, Tcv determines the
coordinate system transformation between the camera coordinate system C(xyz) and
the vehicle coordinate system V(x’y’z’). Tcv is composed of rotation matrix R and
translation vector T. the three degrees of freedom of R can be expressed by three
Euler angles (rotation angles; see Sect. 2.3.1). Considering that the onboard camera
needs to detect obstacles such as pedestrians and vehicles within 200 m, its detection
accuracy is about 1 m. Assuming that the horizontal field angle of view of the camera
is about 57°, the requirements for the accuracy of external parameters of the camera
are that the rotation angle is about 0.2°, and the translation is about 0.2 m.

2.4.5.1 Lane Line Detection and Data Filtering

If the coordinates of lane line points set on the image plane collected by the camera
are (x‘, y‘), then according to the pinhole imaging model,

zc y0 = MPc = MTcv TvwPw : (2:H4)

Among them, zc is the distance between the lane line point Pc and the camera, M
is the parameter matrix of the camera, and Tvw is the coordinate transformation
matrix between the world coordinate system W(XYZ) and the vehicle coordinate
system V(x’y’z’), which expresses the pose of the vehicle.
The detection of lane lines can be carried out with the help of a deep learning
method based on the network structure U-Net++ [17]. After obtaining the lane line
features in the image plane, the 3D world coordinate system position cannot be
directly recovered from the 2D features in the image plane, so it is necessary to
2.4 Camera Calibration Methods 81

project the true value of the lane lines to the image plane, and set the loss function in
the image plane to perform optimization.
To prevent over-optimization and improve computational efficiency, the detected
features need to be selected/screened. Lane lines are usually composed of curves and
straight lines, and the actual curvature is relatively small. When the vehicle is driving
normally, in most cases, the lane line does not provide useful information for
translating Tx, and it is necessary to select the scene of the vehicle steering for
calibration. Therefore, the video captured by the vehicle camera can be divided into
useless frames, data frames, and key frames according to the following rules:
1. When the number of lane line pixels detected in the frame image is less than a
certain threshold, it is regarded as a useless frame to avoid vehicles passing
through intersections and traffic jams without obvious lane lines in the image.
2. The frame images when the vehicle driving distance from the previous key frame
and the vehicle yaw angle are both less than a certain threshold are classified as
useless frames to avoid repeated collection of lane line information.
3. When the Rule 1 and Rule 2 are not satisfied and the angle between the vehicle
and the true value of the lane line (map data) is greater than a certain threshold,
the frame image is classified as a key frame.
4. The frame images collected in other cases are classified as data frames.
Since the useless frame does not contain lane line information or only contains the
lane line information that has been counted, it can be ignored in the optimization of
the loss function to reduce the amount of data.
In actual driving, because the vehicle is parallel to the lane line, in most of the
time, the number of key frame images collected is less than the number of data
frames. As pointed out above, the lane lines in the data frame do not provide useful
information for translating Tx, so not distinguishing between key frames and data
frames may over-optimize other external parameters. To this end, a threshold can be
set. If the number of collected key frames is small, only parameters other than Tx are
optimized; if the number of collected key frames is sufficient, all external parameters
are optimized.

2.4.5.2 Optimize Reprojection Error

Defining the reprojection errors of lane line observation points and map reference
points as loss, the loss function can be expressed as

L(Tcv) = I MTcvTvwPw - (x0, y0,1^

dPw: (2:115)
Zc

Among them, Pw is the position of the lane line in the high-precision map in the
world coordinate system, and Tvw can be obtained through a global positioning
system (GPS) or the like. In this way, the loss function can be determined by
82 2 Camera Imaging and Calibration

determining Tcv. When the loss of different poses of the vehicle traversing the lane
during driving is combined, the camera external parameter calibration problem can
be reduced to an optimization problem that minimizes the loss:

Tcv = argmin[L(Tcv)]: (2:116)

In practice, the lane line has no obvious texture features along the direction of the
vehicle, so it is impossible to establish a one-to-one mapping between Pw and (x‘, y‘,
1)T to solve Eq. (2.116). To do this, convert the point-to-point error in Eq. (2.115) to
a point-set-to-point-set error:

L(Tcv)=y MTcvTvwPw - (x0, y0,1)T

dt: (2:H7)

In this way, Eq. (2.116) can be estimated and solved numerically.

Suppose the position of the detected lane line point in the image plane is (xi‘, yi‘),
and the normal direction is t, then Eq. (2.117) can be converted into

L = E [M (x0-xw, yi-yw) II + 11 & - Cll], (2:118)

where (xnw, ynw) is the projection of the lane line in the map on the image plane. The
calculation of the normal direction can be found in [18].
To sum up, the reprojection error calculation process includes the following steps:
1. Project the lane line point set in the map (within a range of 200 m from the
vehicle) into the camera coordinate system based on the camera external param
eter matrix Tcv and the vehicle pose matrix Tvw.
2. Project the point set that has undergone coordinate system transformation into the
image plane according to the parameter matrix in the camera.
3. Calculate the normal directions of the lane line point set in projected map and of
the detected lane line point set.
4. Determine the association between the map lane line points and the detected lane
line points by matching.
5. Determine the reprojection error according to Eq. (2.118), for example, a simple
steepest descent method can be used.

References

1. Khurana, A, Nagla, K S (2022) Extrinsic calibration methods for laser range finder and camera:
A systematic review. Mapan-Journal of Metrology Society of INDIA, 36(3): 669-690.
2. Tian J Y, Wu J G, Zhao Q C (2021) Research progress of camera calibration methods in vision
system. Chinese Journal of Liquid Crystals and Displays, 36(12): 1674-1692.
References 83

3. Liu X S, Li, A H, Deng Z J, et al. (2021) Advances in three-dimensional imaging technologies

based on single-camera stereo vision. Laser and Optoelectronics Progress, 58(24): 9-29.
4. Zhang Y-J (2017) Image Engineering, Vol.3: image understanding. De Gruyter, Germany.
5. Shapiro L, Stockman G (2001) Computer Vision. Prentice Hall, London.
6. ElSheikh A, Abu-Nabah B A, Hamdan, M O, et al. Infrared camera geometric calibration: A
review and a precise thermal radiation checkerboard target. Sensors, 2023, 23(7): #3479 (DOI:
10.3390/s23073479).
7. Tsai R Y (1987) A versatile camera calibration technique for high-accuracy 3D machine vision
metrology using off-the shelf TV camera and lenses. Journal of Robotics and Automation 3(4):
323-344.
8. Ma S D, Zhang Z Y (1998) Computer Vision: Fundamentals of Computational Theory and
Algorithms. Science Press, Beijing.
9. Li Y, Yan Y C (2021) A novel calibration method for active vision system based on array of
concentric circles. Acta Electronica Sinica 49(3): 536-541.
10. Hartly R, Zisserman A (2004) Multiple View Geometry in Computer Vision, 2nd
ED. Cambridge University Press, UK.
11. Kim J-S, Gurdjos P, Kweon, I-S (2005) Geometric and algebraic constraints of projected
concentric circles and their applications to camera calibration. IEEE Transaction on Pattern
Analysis and Machine Intelligence 25(4): 78-81.
12. Civera J, Bueno D R, Davison A J, et al. (2009) Camera self-calibration for sequential Bayesian
structure from motion. Proceedings of International Conference on Robotics and Automation
403-408.
13. Daniilidis K (1999) Hand-eye calibration using dual quaternions. The International Journal of
Robotics Research 18(3): 286-298.
14. Zhang Z Y (2004) Camera calibration with one-dimensional objects. IEEE Transactions on
Pattern Analysis and Machine Intelligence 26(7): 892-899.
15. Liao W L, Zhao H Q, Yan J C (2021) Online extrinsic camera calibration based on high-
definition map matching on public roadway. Journal of Image and Graphics 26(1): 208-217.
16. Triggs B, McLauchlan P F, Hartley R I, et al. (1999) Bundle adjustment—a modern synthesis.
Proceedings of the International Workshop on Vision Algorithms: Theory and Practice
298-372.
17. Zhou Z W, Siddiquee M M R, Tajbakhsh N, et al. (2018) UNet++: A nested U-net architecture
for medical image segmentation. Proceedings of the 4th International Workshop on Deep
Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support
3-11.
18. Ooyang D S, Feng H Y (2005) On the normal vector estimation for point cloud data from
smooth surfaces. Computer-Aided Design 37(10): 1071-1079.
Chapter 3
Depth Image Acquisition

The general imaging method obtains a 2D image originating from a 3D physical

space, in which the information on the plane perpendicular to the optical axis of the
camera is preserved in this 2D image, but the depth (distance) information along the
optical axis of the camera is lost. To complete visual tasks, computer vision often
needs to obtain 3D information of the objective world; that is, it needs to collect
images with depth information.
There are various methods to obtain (or restore) depth information, including the
method of directly obtaining distance information by using specific equipment and
devices, the stereovision technology that observes the world with reference to the
human binocular vision system, and the method of obtaining 3D information layer
by layer by moving the focal plane. There are also many methods to obtain 3D
images with the aid of additional optical devices [1].
The sections of this chapter will be arranged as follows.
Section 3. 1 first introduces the characteristics of depth images and their compar
ison with grayscale images and then outlines the different ways of depth imaging.
Section 3. 2 introduces some devices and methods for direct depth imaging,
mainly the time-of-flight method by means of laser scanning, light detection and
ranging (LiDAR), structured light method, and Moire contour fringe method using
grating interference.
Section 3. 3 introduces several methods of indirect depth imaging. Here only
binocular vision is considered, but three modes are discussed separately: binocular
horizontal mode, binocular convergent horizontal mode, and binocular axial mode.
Section 3. 4 introduces the principle of single-pixel imaging and single-pixel
cameras and also lists some typical methods for single-pixel 3D imaging.
Section 3. 5 discusses the relationship between biological and stereoscopic vision
with image display.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 85
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_3
86 3 Depth Image Acquisition

3.1 Depth Image and Depth Imaging

The image obtained by imaging with the camera can be represented by f(x, y). f(x, y)
can also represent an attribute f in space (x, y), and f in a general grayscale image
represents the gray scale or brightness at the pixel (x, y). If the image attribute
f represents depth, such images are called depth maps or depth images, which
reflect the 3D spatial information of the scene.

3.1.1 Depth Images

During general imaging, the 3D scene is projected to the 2D plane, so the collected
2D image f(x, y) does not directly contain the depth (or distance) information of the
scene (with information loss). The objective world is 3D. In order to express it
completely, the images collected from the scene also need to be 3D. The depth image
can be expressed as z = f(x, y). The 3D image f(x, y, z) can be further obtained from
the depth image. The 3D image f(x, y, z) including the depth information can express
the complete information of the scene (including the depth information).
Image is a description form of objective scene, which can be divided into two
categories according to the nature of the scene it describes: intrinsic image and
extrinsic image [2]. Image is the image of the scene obtained by the observer or
collector (sensor). The scene and the scenery in the scene have some objectively
existing characteristics that are independent of the nature of the observer and the
collector, such as the surface reflectance, transparency, surface orientation, move
ment speed of the scenery, the relative distance between the scenery and the
orientation in space, etc. These properties are called (scene) intrinsic properties,
and the images representing the physical quantities of these intrinsic properties are
called intrinsic images. There are many kinds of intrinsic images. Each intrinsic
image can only represent one intrinsic characteristic of the scene, without the
influence of other properties. If the intrinsic image can be obtained, it is very useful
to correctly explain the scenery represented by the image. Depth image is one of the
most commonly used intrinsic images, in which each pixel value represents the
distance (depth, also known as the elevation of the scene) between the scene point
represented by the pixel and the camera. These pixel values actually directly reflect
the shape of the visible surface of the scene (intrinsic property). From the depth
image, the geometric shape of the scene itself and the spatial relationship between
the scenes can be easily obtained.
The physical quantity represented by an extrinsic image is not only related to the
scene but also to the nature of the observer/collector, the conditions of image
acquisition, or the surrounding environment. A typical representative of extrinsic
images is the common grayscale image (gray scale corresponds to brightness or
illumination). The grayscale image reflects the received radiation intensity at the
observation site, and its intensity value is often the result of the combination of
3.1 Depth Image and Depth Imaging 87

Fig. 3.1 The difference

between depth image and
gray image

multiple factors, such as the intensity of the radiation source, the orientation of the
radiation mode, the reflection property of the scene surface, as well as the position
and performance of the collector.
The difference between depth image and grayscale image can be explained with
the help of Fig. 3.1. There is an object in the figure. Considering a section (profile) on
it, the depth acquired from the section has the following two characteristics com
pared with the grayscale image:
1. The pixel value of the same outer plane on the corresponding object in the depth
image changes at a certain rate (the plane is inclined relative to the image plane).
This value changes with the shape and orientation of the object but has nothing to
do with the external lighting conditions. The corresponding pixel value in
grayscale image depends not only on the illuminance of the surface (this is not
only related to the shape and orientation of the object but also related to the
external lighting conditions) but also depends on the reflection coefficient of the
surface.
2. There are two kinds of edge lines in depth images: one is the (distance) step edge
between the object and the background; the other is the ridge edge at the
intersection of various regions inside the object (the depth at the corresponding
extreme value is still continuous). In the grayscale image, both are step edges.
Solving many computer vision problems requires the use of extrinsic images to
recover intrinsic properties, that is, to obtain intrinsic images, which can further
explain the scene. In order to recover the intrinsic structure of the scene from
extrinsic images, various image (pre)processing methods are often required. For
example, in the imaging process of grayscale images, many physical information
about the scene are mixed and integrated in the pixel gray value, so the imaging
process can be regarded as a degenerate transformation. However, the physical
information about the scene is not completely lost after being mixed in the grayscale
image. Various preprocessing techniques (such as filtering, edge detection, distance
transformation, etc.) can be used to eliminate the degradation in the imaging process
with the help of redundant information in the image (i.e., the transformation of the
88 3 Depth Image Acquisition

imaging physical process is “inverse”), thereby transforming the image into an

intrinsic image that reflects the spatial properties of the scene.
From the perspective of image acquisition, there are two ways to obtain intrinsic
images: one is to directly acquire intrinsic images; the other is to first acquire
extrinsic images containing intrinsic information and then use image technology to
restore intrinsic properties. Taking to obtain a depth image as an example, a specific
device can be used to directly acquire the depth image (see Sect. 3.2); it is also
possible to first acquire a grayscale image containing stereo information and then
obtain depth information from it (see Sect. 3.3). For the former method, some
specific image acquisition devices (imaging devices) should be used; for the latter
method, some specific image acquisition methods (imaging methods) and the use of
some specific imaging techniques should be considered.

3.1.2 Depth Imaging

Many image understanding problems can be solved with the help of depth images.
There are many ways of depth imaging, which are mainly determined by the mutual
position and movement of the light source, the collector, and the scene. The most
basic imaging method is monocular imaging, that is, one collector is used to collect
an image of the scene at a fixed position. Although depth information about the scene
is not directly reflected in the image at this time, it is still implicit in the imaged
geometric distortion, shading, texture, surface contour, and other factors (Chaps. 7
and 8 will describe how to recover depth information from such images). If two
collectors are used to take images of the same scene at two positions (or one collector
can be used to take images of the same scene at two positions successively, or one
collector is used to obtain two images with the help of an optical imaging system), it
is binocular imaging (see Sect. 3.3.1 and Chap. 5). The parallax (disparity) generated
between the two images (similar to the human eyes) at this time can be used to help
determine the distance between the collector and the scene. If more than two
collectors are used to take images of the same scene at different locations (or one
collector can be used to take images of the same scene at multiple locations one after
the other), it is multi-ocular (multi-eye) imaging (see Chap. 6). Monocular, binoc
ular, or multi-ocular methods can obtain sequence images by continuous shooting in
addition to still images. Compared with binocular imaging, monocular imaging is
simpler to acquire, but it is more complicated to obtain depth information from
it. Conversely, binocular imaging increases the acquisition complexity but reduces
the complexity of acquiring depth information.
In the above discussion, it is considered that the light source is fixed in several
imaging methods. If the collector is fixed relative to the scene and the light source is
moved around the scene, this imaging mode is called light shift imaging (also called
stereo photometric imaging). Since the surface of the same scene has different
brightness under different lighting conditions, the image obtained by light shift
imaging can obtain the surface orientation of the object (but absolute depth
3.2 Direct Depth Imaging 89

Table 3.1 Characteristics of commonly used imaging modes

Imaging mode Light source Collector Scenery

Monocular imaging Fixed Fixed Fixed
Binocular (stereo) imaging Fixed Two positions Fixed
Multi-ocular (stereo) imaging Fixed Multi-positions Fixed
Video/sequence imaging Fixed/moving Fixed/moving Moving/fixed
Light shift (photometric stereo) imaging Moving Fixed Fixed
Active vision imaging Fixed Moving Fixed
Active vision (self-motion) imaging Fixed Moving Moving
Structured light imaging Fixed/moving Fixed/rotating Rotating/fixed

information cannot be obtained; see Sect. 7.2 for details). If you keep the light source
fixed and let the collector move to track the scene, or let the collector and the scene
move at the same time, it constitutes active vision imaging (refer to the initiative of
human vision, that is, people will move the body or head according to the needs of
observation to change the perspective and selectively pay special attention to part of
the scene), the latter of which is also called active visual self-motion imaging.
Alternatively, if a controllable light source is used to illuminate the scene,
interpreting the surface shape of the scene through the captured projection pattern
is structured light imaging (see Sect. 3.2.4). In this way, the light source and the
collector can be fixed, while the scenery can be rotated; or the scenery can be fixed,
while the light source and the collector can be rotated around the scenery together.
Some of the properties of light sources, collectors, and sceneries in these modes
are summarized in Table 3.1.

3.2 Direct Depth Imaging

Direct depth imaging refers to the use of specific equipment and devices to directly
obtain distance information to acquire 3D depth images. At present, the most
commonly used methods are mostly based on 3D laser scanning technology, and
other methods include Moire fringe method, holographic interferometry, Fresnel
diffraction, and other technologies. The direct depth imaging methods are mostly
active from the point of view of the signal source.

3.2.1 Introduction to Laser Scanning

3D laser scanning technology is a real scene replication technology. By using the

principle of laser ranging and recording the 3D coordinates, reflection intensity, and
texture of a large number of dense points on the surface of the measured object, it can
90 3 Depth Image Acquisition

quickly reconstruct the various data such as line, surface, volume, and 3D models of
the measured object. Here are a few related concepts:
1. Laser ranging: a laser beam is emitted to the object by the transmitter, the laser
beam reflected by the object is received by the photoelectric element, and the
timer measures the time from the emission to the reception of the laser beam,
thereby calculating the distance from the transmitter to the target. This kind of
imaging that collects information one point at a time can be regarded as an
extreme special case of monocular imaging, and the result obtained is
z = f(x, y). If such acquisition is repeated to obtain information of a region, it is
closer to ordinary monocular imaging.
2. Reverse engineering: it often refers to a reproduction process of product design
technology, that is, reverse analysis and research on an object product, so as to
deduce and obtain the design elements such as the processing flow, organizational
structure, functional characteristics, and technical specifications of the product, to
produce products with similar functions, but not exactly the same. The data
collection of the objective scene is also a reverse process. The objective infor
mation of the scene is obtained, and after analysis and processing, the scene
model is constructed, and the structure of the scene and the spatial relationship
between the scenes are reversed. In reverse engineering, the collection of points
on the surface of product appearance obtained by measuring instruments is called
point cloud. Point clouds are massive collections of points that express the spatial
distribution of the object and the characteristics of the object surface under the
same spatial reference system. A laser point cloud is a large collection of points
acquired by a laser scanner (more introduction in Chap. 4).
3. Reflection intensity: it represents the energy value returned by the reflected laser
wave, similar to the brightness value of gray level. The point cloud is obtained
according to the principle of laser measurement, including 3D coordinates (XYZ)
and laser reflection intensity. The point cloud obtained according to the photo
grammetry principle includes 3D coordinates (XYZ) and color information
(RGB).
The attribute of laser point cloud can be expressed with different parameters,
including point cloud density, point position accuracy, and surface normal
vector [3].
1. Point cloud density (p): the number of laser points per unit area, corresponding
to the point cloud spacing (average spacing of laser points, Ad): p = 1/Ad2.
2. Positional accuracy: the plane and elevation accuracy of the laser point is related
to the conditions of the laser scanner and other hardware, the density of the point
cloud, the surface properties of the object, and the coordinate transformation.
3. Surface normal vector: a single laser point can represent limited object attri
butes, and often the object attributes are expressed jointly by multiple laser points
in the neighborhood around the laser point. If the pixels in the neighborhood are
considered to form an approximate plane or surface, they can be represented by
normal vectors. The vector represented by a straight line perpendicular to a plane
3.2 Direct Depth Imaging 91

or curved surface composed of a certain number of laser points in the neighbor

hood is called the surface normal vector of the laser point.
Some technical characteristics and indicators of 3D laser scanning technology
include the following [3]:
1. Fast acquisition speed: the scanning speed of some scanners can reach 1,000,000
points/s.
2. High point cloud data density and high measurement accuracy: the point distance
of some scanners is <1 mm; the model surface measurement accuracy is 2 mm,
the distance measurement accuracy is ±4 mm, the point measurement accuracy is
6 mm, and the object acquisition accuracy is ±1.5 mm.
3. Full field of view scanning: the (horizontal direction x vertical direction) field of
view of the ground scanner can reach 360° x 310°.

3.2.1.1 Classification of 3D Laser Scanning Systems

According to the scanning distance, 3D laser scanning systems can be mainly

divided into three types: short-range (within 200 m), medium-range
(200-1000 m), and long-range (greater than 1000 m).
According to the operating platform, 3D laser scanning systems can be mainly
divided into three types: airborne platforms (measurement distance greater than
1 km), ground platforms (including mobile platforms (vehicle-mounted) and fixed
platforms), and handheld platforms.
Some people, according to the platform, also divide the 3D laser scanning system
into the following [4]:
1. Terrestrial laser scanning (TLS) system
Through the scanning mirror and servo motor, the 3D geometric information
of the ground surface is collected at high speed, high density, and high precision,
to obtain the 3D point cloud. At present, the maximum range of the terrestrial
laser scanning system is about several hundred meters, and the accuracy is about a
few millimeters.
2. Handheld/backpack laser scanning system
By integrating sensors such as laser scanners, panoramic cameras, and inertial
measurement units (IMUs) and using simultaneous localization and mapping
(SLAM) technology (see Chap. 10), the pose of the motion platform can be
determined for estimating and completing 3D digitization of the environment. At
present, the maximum range of handheld/backpack laser scanning systems is
about tens of meters to hundred meters, and the accuracy is about 5-30 mm.
3. Vehicle-mounted laser scanning system
Taking the vehicle as the carrying platform, it integrates various sensors such
as global positioning system (GPS), inertial navigation system (INS), laser
scanner, and charge coupled device (CCD) camera. The positioning and attitude
information provided by GPS and INS performed for trajectory calculation,
92 3 Depth Image Acquisition

verification, and coordinate conversion on the acquired images and point clouds
can be obtained. The process performs geolocation, generates high-precision 3D
coordinate information, and realizes 3D point clouds of roads and surrounding
objects.
4. Airborne laser scanning system
Similar to the vehicle-mounted laser scanning system, the airborne laser
scanning system also includes laser scanners and high-resolution digital cameras.
It integrates GPS and INS and uses various low-, medium-, and high-altitude
aircraft as platforms to obtain 3D spatial information of the observation area.
5. Spaceborne laser scanning system
Based on the satellite platform, it has the ability to actively obtain 3D
information of the global surface and objects. Some satellites are equipped with
geoscience laser altimeter system (GLAS) and advanced topographic laser
altimeter system (ATLAS).

3.2.1.2 3D Laser Scanning Ranging Principle

From the principle of 3D laser scanning ranging, it can be mainly divided into time
based and space-based modes. The time-based mode can be further divided into
pulse method and phase method, while the space-based mode is mainly
trigonometry.
1. Pulse method
Also known as time of flight (TOF). Calculate the distance D by measuring
the round-trip time of the laser from the transmitter to the object:

D =lc0t: (3:1)

Among them, c0 is the speed of light in vacuum (299,792,458 m/s), and t is the
round-trip time of the laser.
The distance that can be measured by the pulse method is relatively large, often
up to several hundred or several kilometers, but the accuracy is poor (influenced by
the measurement of time), generally at the centimeter level.
2. Phase method
The ranging method is similar to the pulse method, but the laser signal is
modulated. When calculating the distance D, it is expressed by m whole wave
lengths of wavelength 2 plus dz [d 2 (0-1)] less than one wavelength:

D =2(m2 + d2) = 2(m + d). (3:2)

The half wavelength z/2 is also called the length of precision measuring ruler,

which can be calculated as follows:

3.2 Direct Depth Imaging 93

2_ c0 c
2 = 2Fr 2F, (3:3)

where c is the true speed of light at the time of measurement, r is the refractive
index of the medium, and F is the frequency of the measuring ruler.
In Eq. (3.2), m and d can be determined by the following different methods:
(a) With the help of phase angle measurement: the distance corresponding to the
phase delay A# is calculated by measuring A# of the round-trip laser and
changing the wavelength 2 of the light. That is,

D = 2(m + d) = 2m +2A#: (3:4)

2 2 2 2n

(b) With the help of optical path measurement: if the measurement distance is
changed by SD, then d2/2 in Eq. (3.2) can be eliminated:

2 A# 2
D - SD = D - = 2m: (3:5)
2 2n

In this way, the distance D can be determined by only measuring the integer
measuring ruler.
(c) With the help of measurement of modulation frequency: it can be seen from
Eq. (3.2) that by changing the frequency of modulation light, the mantissa less
than the length of a measuring ruler can be 0:

D = 2m : (3:6)

In this way, the distance D can be determined by only measuring the integer
measuring ruler.
The measurable distance of the phase method is generally about 100 m, and the
accuracy is generally in the millimeter level.
3. Trigonometry
The most commonly used is the oblique type (the laser emission axis forms a
certain angle with the normal of the object surface) trigonometry. Consider
Fig. 3.2, where the laser is at the origin of the coordinate system, the Z-axis
points from the laser to the sensor, the Y-axis points from the inside paper, and
the X-axis points from the bottom to the top. In addition, the triangle consisting
of the laser, the sensor, and the object is in the XZ plane, where the distance
L between the laser and the sensors is the known length baseline of the system.
The position of the object point in this coordinate system is determined by the
angle a between the emitted ray and the baseline, the angle fi between the
94 3 Depth Image Acquisition

Fig. 3.2 Schematic

diagram of triangulation
ranging

reflected ray and the baseline, and the angle / that the triangle rotates around the
Z-axis:

cos a sin fl l
sin (a + fl)
sin a sin fl cos / l
sin (a + fl) (3:7)
sin a sin fl sin / l
sin (a + fl)

The measurable distance of the triangulation method is limited by the length of

the baseline and is generally only a few meters, but the accuracy can often reach the
micron level.

3.2.2 Time of Flight

This method obtains distance information by measuring the time it takes for a light
wave to return from the light source to the sensor after being reflected by the
measured object. Generally, the light source and the sensor are placed in the same
position, so the relationship between the propagation time t and the measured
distance D is shown in Eq. (3.1). Considering that light is actually transmitted in
the air, it should be corrected according to the medium:

D =1C0t, (3:8)

where r is the atmospheric refractive index of light, which is jointly determined by

air temperature, air pressure, and humidity. In practice, it is generally about 1.00025.
and r = 1. for simplicity, in many cases, the speed of light can be set as 3 x 108 m/s,
Therefore,

The depth image acquisition method based on time of flight is a typical method to
obtain distance information by measuring the travel time of light waves. Because a
point light source is generally used, it is also called the flying spot method. To obtain
3.2 Direct Depth Imaging 95

a 2D image, the beam needs to be scanned in 2D, or the object being measured needs
to be moved in 2D. The key to distance measurement in this method is to measure
time accurately, because the speed of light is 3 x 108 m/s, so if the spatial distance
resolution is required to be 0.001 m (i.e., to be able to distinguish two points or two
parallel lines that are 0.001 m apart in space), the time resolution needs to reach
66 x 1013/s.

3.2.2.1 Pulse Time Interval Measurement

This method uses the pulse interval to measure the time, specifically by measuring
the time difference of the pulse wave. Its basic principle block diagram is shown in
Fig. 3.3. The specific frequency laser emitted by the pulsed laser source is directed
forward through the optical lens and the beam scanning system and is reflected after
touching the object. The reflected light is received by another optical lens and enters
the time difference measurement module after photoelectric conversion. The module
simultaneously receives the laser directly sent by the pulsed laser source and
measures the time difference between the emitted pulse and the received pulse.
According to the time difference, the measured distance can be calculated by using
Eq. (3.8). It should be noted here that the initial pulse and echo pulse of the laser
cannot overlap within the working distance range.
Using the above principle, the distance measurement can also be performed by
replacing the pulsed laser with ultrasonic waves. Ultrasound can work not only
under natural light but also inside water. Because the propagation speed of sound
waves is relatively slow, the accuracy of time measurement is relatively low;
however, because the absorption of sound by the medium is generally large, the
sensitivity of the receiver is required to be high. In addition, due to the large
divergence of sound waves, very high-resolution distance information cannot be
obtained.

Fig. 3.3 Principle block diagram of pulse time interval measurement method
96 3 Depth Image Acquisition

3.2.2.2 Amplitude-Modulated Phase Measurement

The time difference can also be measured by measuring the phase difference. A
block diagram of the basic principle of a typical method can be seen in Fig. 3.4.
In Fig. 3.4, the laser emitted by the continuous laser source is amplitude-
modulated with the light intensity of a certain frequency and is sent out in two
ways. One path is directed forward through the optical scanning system and is
reflected after touching the object. The reflected light is filtered to take out the
phase. The other path enters the phase difference measurement module and is
compared with the phase of the reflected light. Because the phase has a period of
2n and the measured phase difference ranges from 0 to 2n, the depth measurement
D is

1 c c 1! J- AO + kr
A# - k-— (3:9)
2 |^2n Fmod F mod 2 2n

where c is the speed of light, Fmod is the modulation frequency, A# is the phase
difference (in radians), and k is an integer. The possible ambiguity of depth mea
surement can be overcome by limiting the range of measurement depth (limiting the
value of k). The parameter l introduced in Eq. (3.9) is a measurement scale: the
smaller l is, the higher the accuracy of distance measurement. In order to obtain a
smaller l, a higher modulation frequency Fmod should be used.

3.2.2.3 Frequency Modulated Coherent Measurement

The time difference can also be measured by measuring the frequency change. The
laser emitted by the continuous laser source can be frequency modulated with a
linear waveform of a certain frequency. Let the laser frequency be F and the
modulating wave frequency be Fmod, and the modulated laser frequency exhibits a
linear periodic change between F ± AF/2 (where AF is the frequency change of the
laser frequency after modulation). One part of the modulated laser is used as the

Fig. 3.4 The principle block diagram of the phase measurement method of amplitude modulation
3.2 Direct Depth Imaging 97

reference light, and the other part is projected to the object to be measured. After
touching the object, it is reflected and then received by the receiver. The two optical
signals coherently produce a beat frequency signal FB, which is equal to the product
of the slope of the laser frequency change and the propagation time:

FB= 1 = (2Fmod) t: (3:10)

Substituting Eqs. (3.8) into (3.10) and solving for D, we get

c
D = 4FmodAFFB: (3:U)

Then, by the phase change between the outgoing light wave and the returning
light wave,

Af = 2nAFt = 4nAFd =:
c (3:12)

Again got

Af
D =2AcF (3:13)
2n :

Comparing Eqs. (3.1) and (3.13), the number of coherent fringes N (which is also
the number of zero crossings of the beat frequency signal in the half-cycle of the
modulation frequency) is obtained:

N=Af= F ■'■
N 2n 2Fmod: (3:14)

In practice, the actual distance can be obtained by calibration, that is, according to
the accurate reference distance dref and the measured reference coherent fringe
number Nref using the following equation (by counting the actual coherent fringe
number):

D= -dref N:
(3:15)
Nref

3.2.3 LiDAR

The data obtained by only laser scanning lacks brightness information. If the laser
scanning process is supplemented with camera shooting, the depth information and
98 3 Depth Image Acquisition

Fig. 3.5 Simultaneous acquisition of depth and luminance images

brightness information in the scene can be obtained at the same time. Light
detection and ranging (LiDAR) is a typical example.
The principle of LiDAR can be illustrated with the help of Fig. 3.5 [5]. The whole
device is placed on a platform that can tilt and pan and can radiate and receive
amplitude-modulated laser waves. For each point on the surface of the 3D scene, the
waves radiated to and received from that point are compared to obtain information.
These information include both spatial information and intensity information. Spe
cifically, the spatial coordinates X and Y of each point are related to the pitch and
horizontal motion of the platform, and the depth Z is closely related to the phase
difference. The reflection properties of wavelength laser light can be determined by
the amplitude difference of the waves. In this way, LiDAR can obtain two registered
images at the same time: one is the depth image and the other is the luminance image.
Note that the depth range of the depth image is related to the modulation period of
the laser wave. If the modulation period is 2, the same depth will be calculated every
2/2, so the depth measurement range needs to be limited.
Compared with camera acquisition equipment, the acquisition speed of LiDAR is
relatively slow because the phase is calculated for each 3D surface point. According
to a similar idea, there are also systems that combine an independent laser scanning
device with a camera acquisition device to simultaneously acquire depth and color
information. One problem this brings is the need for data registration (see Chap. 4).

3.2.4 Structured Light

Structured light method is a commonly used method of active sensing and direct
acquisition of depth images. Its basic idea is to use the geometric information in
lighting to help extracting the geometric information of the scene. Structured light
ranging is carried out by trigonometry. The imaging system is mainly composed of a
camera and a light source, which are arranged in a triangle with the observed object.
The light source generates a series of point or line lasers to illuminate the surface of
the object, and the light-sensitive camera records the illuminated part and then
obtains depth information through triangulation calculation, so it is also called active
triangulation. The ranging accuracy of the active structured light method can reach
3.2 Direct Depth Imaging 99

the micrometer level, and the measurable depth field range can reach hundreds to
tens of thousands of times of the accuracy.
There are many specific ways to use structured light imaging, including light strip
method, grid method, circular light strip method, crossline method, thick light strip
method, spatial coding template method, color coding strip method, density ratio
method, etc. In addition, with the development of tunable flat optics, there will be
more structured light imaging methods [6]. Due to the different geometric structures
of the projected beams they use, the camera shooting methods and the depth distance
calculation methods are also different, but the common point is that they all utilize
the geometric structure relationship between the camera and the light source.
In the basic light strip method, a single light plane is used to illuminate parts of the
scene in sequence so that a light strip appears on the scene and only this part of the
light strip is detectable by the camera. In this way, a 2D entity (light plane) map is
obtained each time it is irradiated, and then the third dimension (distance) informa
tion of the spatial point corresponding to the visible image point on the light strip can
be obtained by calculating the intersection of the camera line of sight and the light
plane.

3.2.4.1 Structured Light Imaging

When using structured light imaging, the camera and light source should be
calibrated first. Figure 3.6 shows the geometric relationship of a structured light
system. Here, the XZ plane where the lens is located and perpendicular to the light
source is given (the Y-axis goes from the inside of the paper to the outside, and the
light source is a strip along the Y-axis). The laser emitted through the narrow slot
irradiates from the origin O of the world coordinate system to the spatial point
W (on the object surface) to generate a linear projection, and the optical axis of the
camera intersects with the laser beam. In this way, the camera can collect the linear
projection, so as to obtain the distance information at the point W on the object
surface.
In Fig. 3.6, F and H determine the position of the lens center in the world
coordinate system, a is the angle between the optical axis and the projection line,
fi is the angle between the z- and Z-axes, / is the angle between the projection line

Fig. 3.6 Schematic

diagram of structured light
imaging
100 3 Depth Image Acquisition

and the Z-axis, z is the focal length of the camera, h is the imaging height (the
distance from the image to the optical axis of the camera), and r is the distance from
the lens center to the intersection of the z- and Z-axes. It can be seen from the figure
that the distance Z between the light source and the object is the sum of s and d,
where s is determined by the system and d can be obtained by the following
equation:

, sin a rx sin a rx tan a

d = r— — =-------- (3:16)
sin y cos a sin p — sin a cos pa
:------------------- sinp(1 — tan a cotp):

Substituting tana = h/z into Eq. (3.16), Z can be expressed as

Z—s + d—s +
==
r x cscp x h z
1 — cot p x h z:
(3:17)

Equation (3.17) links Z with h (the rest are all system parameters), providing a
way to calculate the object distance according to the imaging height. It can be seen
that the imaging height contains 3D depth information, or the depth is a function of
the imaging height.

3.2.4.2 Imaging Width

Structured light imaging can not only give the distance Z of the space point but also
the thickness of the object along the Y direction. The image width can then be
analyzed by looking up the top plane from the bottom of the camera, as shown in
Fig. 3.7.
Figure 3.7 shows a schematic diagram of the plane determined by the Y-axis and
the center of the lens, where w is the imaging width:

w = Z0 Y, (3:18)

where t is the distance from the center of the lens to the vertical projection of point
W on the Z-axis (see Fig. 3.6):

Fig. 3.7 Schematic top

view of structured light
imaging
3.2 Direct Depth Imaging 101

t = ^(Z - F)2 + H2, (3:19)

and 2‘ is the distance along the z-axis from the center of the lens to the imaging plane
(see Fig. 3.6):

20 = h/+ + 22: (3:20)

Substitute Eqs. (3.19) and (3.20) into (3.18) to get

wt (Z - F)2 + H2
Y=2T = w (3:21)
h2 +22

In this way, Eq. (3.21) links the object thickness coordinate Y to the imaging
height, system parameters, and object distance.

3.2.5 Moire Contour Striping

Moire stripes can be formed when two gratings have a certain inclination and
overlap. The distribution of Moire contour stripes obtained by a certain method
can contain the distance information of the scene surface.

3.2.5.1 Basic Principle

When the grating is projected onto the surface of the scene with projection light, the
undulation of the surface will change the distribution of the projected image. If this
deformed projection image is reflected from the scene surface and then passes
through another grating, Moire contour stripes can be obtained. According to the
transmission principle of the optical signal, the above process can be described as the
result of the secondary spatial modulation of the optical signal. If both gratings are
linear sinusoidal perspective gratings, and the parameter defining the grating period
variation is l, the observed output optical signal is

f (l) =f1 fl +mi cos [wil + 01(1)]} * f2f 1 + m2 cos [w21 + &2(l)]g, (3:22)

where fi is the light intensity, mi is the modulation coefficient, 0i is the phase change
caused by the fluctuation of the scene surface, and wi is the spatial frequency
determined by the grating period. In Eq. (3.22), the first term on the right side
corresponds to the modulation function of the first grating passed by the optical
signal, and the second term on the right side corresponds to the modulation function
of the second grating passed by the optical signal.
102 3 Depth Image Acquisition

There are four periodic variables of spatial frequency in the output signal f(l) of
Eq. (3.22), namely w1, w2, w1 + w2, w1 — w2. Since the receiving process of the detector
acts as a low-pass filter on the spatial frequency, the light intensity of the Moire
stripes can be expressed as

T(1) = f 1f [1 + mim2 cos (wi - w2)1 + #i (1) — ^2(1)]• (3:23)

If the periods of the two gratings are the same, then

T (l) = f1 f2fi + P1(D — &2(1)]g : (3:24)

It can be seen that the distance information from the scene surface is directly
reflected in the phase change of the Moire stripes.

3.2.5.2 Basic Method

Figure 3.8 shows a schematic diagram of distance measurement using the Moire
stripe method. The light source and the viewpoint are at a distance D, and they have
the same distance from the grating G, and both are H. The grating is a transmissive
line grating with alternating black and white (period R). According to the coordinate
system in the figure, the grating surface is on the XOY plane; the measured height is
along the Z-axis, which is represented by the Z coordinate.
Considering a point A, whose coordinates are (x, y) on the measured surface, the
illuminance of the light source passing through the grating to it is the product of the
intensity of the light source and the transmittance of the grating at point A*. The light
intensity distribution at point A can be expressed as

Fig. 3.8 Schematic

diagram of distance
measurement by Moire
stripe method
3.2 Direct Depth Imaging 103

1 2 v1 1 2nn xH
T 1 (x, y) = Ci 2 + n nL n sin \ R z + HJ (3:25)

where n is an odd integer number and C1 is a constant related to the intensity. After
T passes through the grating G again, it is equivalent to another transmission
modulation at the point A’, and the light intensity distribution at A’ becomes

i i x i 1 । 2 1 ■ ('2nm xH + Dz'
T2(x,y) = C2 . ...= msinVR1T z + H , (3:26)

where m is an odd integer number and C2 is a constant related to the intensity. The
final received light intensity at the viewpoint is the product of two distributions:

T(x,y)=T1 (x, y)T2 (x, y): (3:27)

Expand Eq. (3.27) with a polynomial, and through low-pass filtering of the
receiving system, a partial sum containing only the variable z can be obtained [7]:

2
T (z) = B + S (3:28)

where n is an odd integer number, B is the background intensity of the Moire stripes,
and S is the contrast of the stripes. Equation (3.28) gives the mathematical descrip
tion of Moire contour stripes. Generally, only the fundamental frequency term of
n = 1 is used to approximately describe the distribution of Moire stripes, that is,
Eq. (3.28) can be simplified as

T (z) = B + S cos^zDH (3:29)

From Eq. (3.29), it can be known that.

1. The position of the bright strips is where the phase term is equal to an integer
multiple of 2n, i.e.,

n NRH
ZN= D - NR N 2 I: (3:30)

2. The height difference between any two bright stripes is not equal, so the height
cannot be determined by the number of stripes; only the height difference
between two adjacent bright stripes can be calculated.
3. If the distribution of the phase term 9 can be obtained, the height distribution of
the surface of the measured object can be obtained:
104 3 Depth Image Acquisition

ZN = 2nD-Re: (3:31)

3.2.5.3 Improvement Method

The abovementioned basic method requires the use of a grating of the same size as
the measured object, which brings inconvenience to the use and manufacture of the
device. An improved method is to install the grating in the projection system of the
light source and use the magnification capability of the optical system to obtain the
effect of a large grating. Specifically, two gratings are used, which are placed close to
the light source and the viewpoint, respectively. The light source transmits the light
beam through the grating, and the viewpoint is imaged behind the grating.
A practical schematic diagram of ranging using the above projection principle is
shown in Fig. 3.9. Two sets of imaging systems with the same parameters are used,
their optical axes are parallel, and two gratings with the same spacing are geomet
rically imaged at the same imaging distance, plus the projection images of the two
gratings are coincident.
Suppose Moire stripes are observed behind the grating G2, and G1 is used as the
projection grating, then the projection center O1 of the projection system L1 and the
convergence center O2 of the receiving system L2 are equivalent to the light source
point S and the viewpoint W in the basic method, respectively. In this way, as long as
R in Eqs. (3.29) and (3.31) are replaced by MR (M = H/H0 is the imaging
magnification of two optical paths), the distribution of Moire stripes can be described
as above, and the height distribution on the surface of the object to be measured can
be calculated.
In practical applications, the grating in front of the projection system L1 can be
omitted, while the computer software is used to complete its function. At this time,
the projected grating image containing the depth information of the measured object
surface is directly received by the camera.

Fig. 3.9 Schematic

diagram of Moire stripe
method using projection
principle
3.3 Indirect Depth Imaging 105

It can be known from Eq. (3.31) that if the distribution of the phase term 0 can be
obtained, the distribution of the height Z of the surface of the measured object can be
obtained. The phase distribution can be obtained by using multiple Moire images
with a certain phase shift. This method is often referred to simply as the phase-shift
method. Taking three images as an example, after obtaining the first image, move the
projection grating horizontally by R/3 distance to obtain the second image, and then
move the projection grating horizontally by R/3 distance to obtain the third image.
Referring to Eq. (3.29), these three images can be expressed as

t T1 (z) = A00 + C00cos 0

< T2(z) = A" + C"cos(0 + 2n=3). (3:32)
. T2(z) = A00 + C0 cos (0 + 4n=3)

Simultaneous solving provides

'a/3(T3 - T2) '

0 = arctan (3:33)
2T1 - (T2 + T3)

In this way, 0 can be calculated out point-by-point.

3.3 Indirect Depth Imaging

Indirect depth imaging means that the directly obtained images do not have depth
information or do not directly reflect depth information, and they need to be
processed to extract depth information. The human binocular depth vision function
is a typical example. The objective world seen by each eye (imaged on the retina) is
equivalent to a 2D image (which does not directly reflect depth information), but
from the two 2D images seen by both eyes, people perceive the distance of the scene.
This is the result of information processing by the human brain. There are other
methods to obtain depth information indirectly, such as various 3D layered imaging
methods. Indirect depth imaging methods are mostly passive from the perspective of
signal sources and are commonly used in various image processing, analysis, and
understanding techniques.
Binocular imaging can obtain two images of the same scene with different
viewpoints (similar to human eyes), and the binocular imaging model can be
regarded as a combination of two monocular imaging models. In actual imaging,
either two monocular systems can be used to acquire at the same time, or one
monocular system can be used to acquire at two poses in succession (at this time,
the subject and the light source are generally assumed to have no movement
changes).
By generalizing binocular imaging, multi-ocular imaging can also be achieved,
and some examples will be discussed in Chap. 6.
106 3 Depth Image Acquisition

Depending on the relative poses of the two cameras, there are multiple modes of
binocular imaging, and several typical situations are described below.

3.3.1 Binocular Horizontal Mode

Figure 3.10 shows a schematic diagram of binocular horizontal mode imaging.

The focal length of both lenses is 2, and the line connecting their centers is called the
baseline B of the system. The respective axes of the two camera coordinate systems
are completely parallel (the X-axis coincides), and the two image planes are both
parallel to the XY plane of the world coordinate system. The Z coordinate of a point
W in 3D space is the same for both camera coordinate systems.

3.3.1.1 Parallax and Depth

It can be seen from Fig. 3.10 that the same 3D space point corresponds to points in
two images (two image plane coordinate systems), respectively, and the position
difference between them is called parallax. The relationship between parallax and
depth (object distance) in binocular lateral mode is discussed below with the help of
Fig. 3.11. It is a schematic diagram of the plane (XZ plane) where the two lenses are

Fig. 3.1 0 Schematic

diagram of binocular lateral
mode imaging

Fig. 3.1 1 Parallax in

parallel binocular imaging
3.3 Indirect Depth Imaging 107

connected. Among them, the world coordinate system coincides with the first camera
coordinate system and only has a translation amount B in the X-axis direction with
the second camera coordinate system.
Considering the geometric relationship between the coordinate X of the point
W in 3D space and the coordinate x1 of the projected point on the first image plane,
we can get

jXj _X1
(3:34)
Z-2 2 :

Considering the geometric relationship between the coordinate X of the point

W in 3D space and the coordinate x2 on the projected point on the second image
plane, we can get

B — jXj : jx2j — B
(3:35)
Z—2 2

Combine the two equations, eliminate X, and the parallax is

d = xi + |x2j — B = 72b :
Z—2
: :
(3 36)

Solving for Z from Eq. (3.36),

Z =2 (3:37)

Equation (3.37) directly relates the distance Z between the object and the image
plane (i.e., the depth in 3D information) to the parallax d. In turn, it also shows that
the size of parallax is related to depth. That is, parallax contains the spatial infor
mation of 3D objects. According to Eq. (3.37), when the baseline and focal length
are known, it is very simple to calculate the Z coordinate of point W after determining
the parallax d. In addition, after the Z coordinate is determined, the world coordinates
X and Y of point W can be calculated by (x1, y1) or (x2, y2) referring to Eqs. (3.34) and
(3.35).
Now let’s look at the ranging accuracy obtained by parallax. According to
Eq. (3.37), depth information is related to parallax, which is related to imaging
coordinates. Suppose x 1 produces a deviation e, that is, x1e = x 1 + e, then there is
d1e = x1 + e + |x2| — B = d + e, so the distance deviation is

AZ = Z — Z 1e= 2( 1 +B) — 2^ 1 +dB- 2Be

(3:38)
d(d + e)

Substituting Eqs. (3.36) into (3.38) to obtain

108 3 Depth Image Acquisition

AZ = e (Z - 2)2 ~ eZ2
(3:39)
2B + e(Z - 2) ~ 2B + eZ :

The last step is the simplification in considering Z >> 2 in general. It can be seen
from Eq. (3.39) that ranging accuracy is related to camera focal length and baseline
length between cameras and object distance. The longer the focal length, the longer
the baseline, the higher the accuracy; the greater the object distance, the lower the
accuracy.
Equation (3.37) gives the expression of the relationship between absolute depth
and parallax. With the help of differentiation, the relationship between depth change
and parallax change is

Z=Ad=- B2=d2: (3:40)

Multiply both sides by 1/Z, and then

(1=Z)AZ=Ad = - 1=d = - Z=B: (3:41)

So,

jAZ=Zj = jAdj x Z=B2 = (Ad=d) x (d=2) x (Z=B). (3:42)

If both parallax and parallax change are measured in pixels, it can be known that
the measurement error of relative depth in the scene is (1) proportional to the pixel
size, (2) proportional to the depth Z, and (3) inversely proportional to the baseline
length B between cameras.
In addition, it can also be obtained by Eq. (3.41) that

AZ=Z = -Ad=d: (3:43)

It can be seen that the measurement error of relative depth and the measurement
error of relative parallax are numerically the same.
Observe the cylindrical object with a circular cross-section of local radius r using
two cameras, as shown in Fig. 3.12. There is a certain distance between the intersec
tion of the two camera sight lines and the boundary point of the circular section,
which is the error S. Now it is to obtain the equation for calculating the error S.

Fig. 3.1 2 Schematic

diagram of geometric
structure for calculating
measurement error
3.3 Indirect Depth Imaging 109

Fig. 3.13 Schematic diagram of the simplified geometrical structure of the calculated measurement
error

To simplify the calculation, it is assumed that the boundary point is at the

orthogonal bisector connecting the projection centers of the two cameras. The
simplified geometry is shown in the left figure of Fig. 3.13, and the details of the
error are shown in the right figure of Fig. 3.13.
From Fig. 3.13, we can get

=
8 = r sec (0 2) — r (3.44)
tan (0=2) = B 2Z =. (3:45)

Substituting 9, we get

8 = r [1 + (B=2Z)2]
1=2
= .
— r « rB2 8Z2 (3:46)

The above equation provides the formula for calculating the error 8. It can be seen
that the error is proportional to r and Z—2 .

3.3.1.2 Motion Parallax and Depth

Now consider a moving camera. If a monocular system is used to acquire a series of

images in multiple poses successively, the same 3D space point will correspond to
coordinate points on different image planes and generate parallax, which can be
called motion parallax. Here, the motion trajectory of the camera can be as the
baseline, and if two images are taken and the features in them are matched, it is also
possible to obtain depth information. This method is also called stereo from motion.
A difficulty here is that the objects in the series of images are taken from nearly the
same viewpoint, so the equivalent baseline is short.
When the camera moves, the distance the object point moves laterally depends
not only on X but also on Y. To simplify the problem, it can be represented by the
radial distance R (R2 = X2 + Y2) from the object point to the optical axis of the
camera.
Referring to Fig. 3.14 (the right image is a section of the left image), the radial
distances of the image points in the two images are
110 3 Depth Image Acquisition

R1= R2=Z1: (3:47)

R2= RA=Z2: (3:48)

In this way, the disparity can be expressed as

d = R2 -R1= R2( 1- - 1- ): (3:49)

Z 2 Z1

Let the baseline B = Z1 - Z2, and assuming that B << Z1, B << Z2, we can get
(take Z2 = Z1Z2)

d =RB^: (3:50)
Z2

Let R0 ~ (R1 + R2)/2; with the help of R/Z = R0/z, we get

d =BR0 (3:51)
d Z :

Finally, the depth of the object point can be deduced as

7=BR0= BRo
(3:52)
d (R2 - R1):

Equation (3.51) can be compared with Eq. (3.36); here the parallax depends on
the (average) radial distance R0 between the image point and the optical axis of the
camera and where it is independent of the radial distance. Then, Eq. (3.52) can be
compared with Eq. (3.37); here the depth information of the object point on the
optical axis cannot be given. For other object points, the accuracy of depth infor
mation depends on the radial distance.
3.3 Indirect Depth Imaging 111

3.3.1.3 Angular Scanning Imaging

In the above binocular lateral imaging mode, in order to determine the information of
a 3D space point, the point needs to be in the common field of view of the two
cameras. If you rotate the two cameras (around the X-axis), you can increase the
common field of view and capture panoramic images. This can be referred to as
stereoscopic imaging with an angular scanning camera, that is, a binocular
angular scanning mode, where the coordinates of the imaging point are determined
by the camera’s azimuth and elevation angles. In Fig. 3.15, 01 and 02 give the two
azimuth angles (corresponding to the saccade movement around the Y-axis), respec
tively, and the elevation angle <fr is the angle between the XZ plane with the plane
defined by the two optical centers and space point W.
Generally, the azimuth angle of the lens can be used to represent the spatial
distance between object images. Using the coordinate system shown in Fig. 3.15, we
have

tan 01 = jZj
(3:53)

tan ■ /: (3:54)

Combining them to eliminate X, then the Z coordinate of point W is

: B
Z = ----- -----------H-.
tan 01 + tan 02
(3.55)

Equation (3.55) actually relates the distance Z between the object and the image
plane (i.e., the depth in 3D information) with the tangents of the two azimuths
directly. Comparing Eqs. (3.55) with (3.37), it can be seen that the effects of parallax
and focal length are implicit in the azimuth angle. According to the Z coordinate of
the space point W, its X and Y coordinates can also be obtained as

X =Z tan 01 (3.56)
Y = Ztan ^. (3:57)

Fig. 3.15 Stereoscopic

imaging with angular
scanning camera
112 3 Depth Image Acquisition

3.3.2 Binocular Convergent Horizontal Mode

To achieve greater field of view (FOV) coincidence, two cameras can be placed side
by side but with the two optical axes converging. This binocular convergent
horizontal mode can be regarded as a generalization of the binocular lateral mode
(the vergence between binoculars is not zero at this time).

3.3.2.1 Parallax and Depth

Consider only the case shown in Fig. 3.16, which is obtained by rotating the two
monocular systems in Fig. 3.11 toward each other around their respective centers.
Figure 3.16 shows the plane (XZ plane) where the two lenses are connected. The
distance between the centers of the two lenses (i.e., the baseline) is B. The two
optical axes intersect at the (0, 0, Z ) point in the XZ plane, and the intersection angle
is 29. Now let’s see how to find the coordinates (X, Y, Z) of the 3D space point W if
two image plane coordinate points (x1, y1) and (x2, y2) are known.
First of all, it can be known from the triangle enclosed by the two world
coordinate axes and the camera optical axis

B cos 9
+ 2 cos 9: (3:58)
2 sin 9

Now draw vertical lines from point W to the optical axes of the two cameras,
because the angle between the two vertical lines and the X-axis is 9, so according to
the relationship of similar triangles, we can get

jx1 j _ Xcos 9
(3:59)
2 r - X sin 9

Fig. 3.16 Parallax in

binocular convergent
imaging
3.3 Indirect Depth Imaging 113

jx2 j _ X cos 0
(3:60)
T - r + Xsin 0,

where r is the distance from the (either) lens center to the point where the two optical
axes converge.
Combining Eqs. (3.59) and (3.60) and eliminating r and X, we get (refer to
Fig. 3.16)

2 COs 0 --
2jx1j • jx2j sin 0 _
Illi
jx1 j— jx2 j
---
d
: :
2jxij • jx2j sin 0
(3:61)

Substitute Eqs. (3.61) into (3.58) to get

B cos0 2jxij • jx2j sin0 3 6 .

Z - 2 sin 0 1 d : (3:62)

Equation (3.62), like Eq. (3.37), also directly relates the distance Z between the
object and the image plane with the parallax d. In addition, from Fig. 3.16, we can
get

B
r 2 sin 0: (3:63)

Substitute into Eqs. (3.59) and (3.60) to get the X coordinate of point W,

|X | = B jx1j = B_ jx2j (3:64)

2 sin 0 2 cos 0 + jx1 j sin 0 2 sin 0 2 cos 0 — jx2 j sin 0:

3.3.2.2 Image Rectification

The case of binocular convergence can also be converted to the case of binocular
parallelism. Image rectification is the process of geometric transformation of the
image obtained by the camera whose optical axis converges to obtain the image
equivalent to the image obtained by the camera whose optical axis is parallel
[8]. Considering the images before and after correction in Fig. 3.17 (represented
by trapezoid and square, respectively), the light emitted from the object point
W intersects with the left image at (x, y) and (X, Y) before and after rectification,
respectively. Each point on the image before rectification can be connected to the
lens center and extended to intersect with the image after rectification. Therefore, for
each point on the image before rectification, its corresponding point on the image
after rectification can be determined. The coordinates of the points before and after
rectification are connected by the projection transformation (a1 to a8 are the coeffi
cients of the projection transformation matrix):
114 3 Depth Image Acquisition

Fig. 3.17 Images obtained

by two cameras converging
with optical axis for
correction by projection
transformation

Fig. 3.18 Schematic

diagram of images before
(X1, Y1) (X2, Y2)
and after rectification

(X3, Y3) (X4, Y4)

aiXx+—a_____________
2Y + a3
(3:65)
a4X + a5Y + 1
a6X -p a7 Y -p a8
y — a4X + a5Y +1 : (3:66)

The eight coefficients in the above two equations can be determined with the help
of four groups of corresponding points on the image before and after rectification
(see [9]). Here, it can be considered to use the horizontal polar line [the intersection
of the plane composed of the baseline and a point in the scene and the imaging plane
(see Sect. 5.1.2)]. Therefore, it is necessary to select two polar lines in the image
before rectification and map them to two horizontal lines in the image after rectifi
cation, as shown in Fig. 3.18. The corresponding relationship can be

X1 — x1 X2 — x2 X3 — x3 X4 — x4 : (3:67)

Y1— Y2—y1^22 Y3 — Y4— y^+y4-: (3:68)

The above correspondence can maintain the width of the image before and after
rectification, but there will be scale changes in the vertical direction (in order to map
non-horizontal polar lines to horizontal polar lines). In order to obtain the rectified
image, for each point (x, y) on the rectified image, Eqs. (3.65) and (3.66) are to be
used to find the corresponding point (x, y) on the image before rectification.
Moreover, the gray level at the point (x, y) is assigned to the point (X, Y).
The above process should also be repeated for the right image. In order to ensure
that the corresponding polar lines on the rectified left image and right image
represent the same scanning line, it is necessary to map the corresponding polar
lines on the rectified image to the same scanning line on the rectified image, so the
3.3 Indirect Depth Imaging 115

Y coordinate in Eq. (3.68) should be used when correcting both the left image and
right image.

3.3.3 Binocular Axial Mode

When using binocular lateral mode or binocular convergent lateral mode, the
parallax needs to be calculated according to the triangle method, so the baseline
cannot be too short; otherwise, the accuracy of depth calculation will be affected.
However, when the baseline is long, the problem caused by field of view
misalignment will be more serious. At this time, the binocular axial mode, also
known as binocular longitudinal mode, can be considered. That is, the two cameras
are arranged in turn along the optical axis. This situation can also be seen as moving
the camera along the optical axis and acquiring the second image at a position closer
to the object than the first image, as shown in Fig. 3.19. In Fig. 3.19, only the XZ
plane is drawn, and the Y-axis is pointed from the inside to the outside of the paper.
The origin of the two camera coordinate systems that obtain the first image and the
second image only differs by B in the Z direction, where B is also the distance
(baseline) between the optical centers of the two cameras.
According to the geometric relationship in Fig. 3.19, we have

X |xi|
(3:69)
Z-2 2
X N (3:70)
Z-2-B 2

Combining Eqs. (3.69) and (3.70) can provide (only X is considered, similar for
Y)

_ |xi | • |x2|
X = B jxlj • |x2| = B
(3:71)
2 |X2| - |xi| 2 d
Z =2+ |X2B^
| - |X1| 2 + d/21:
(3:72)

Fig. 3.19 Binocular axial

mode imaging
116 3 Depth Image Acquisition

Fig. 3.20 Measuring

relative height with
stereovision

Compared with the binocular lateral mode, the common field of view of the two
cameras in binocular axial mode is the field of view of the camera in front (the
camera that obtained the second image in Fig. 3.19), so the boundary of the common
field of view is easily determined, and the problem that the 3D space point is only
seen by one camera due to occlusion can be basically ruled out. However, since the
two cameras basically use the same angle of view to observe the scene, the benefits
of lengthening the baseline for depth calculation accuracy cannot be fully reflected.
In addition, the accuracy of parallax and depth calculation is related to the distance
between the 3D space point and the optical axis of the camera (e.g., as in Eq. 3.72,
the depth Z is related to |x2|, that is, the distance between the projection of the 3D
space point and the optical axis), which is different from the binocular
horizontal mode.
The relative height of the ground object can be obtained by taking two images of
the object in the air with the camera carried by the aircraft. In Fig. 3.20, w represents
the moving distance of the camera by W, H is the camera height, h is the relative
height difference between two measuring points A and B, and (d1 - d2) corresponds
to the parallax between A and B in the two images. When d1 and d2 are much less
than W and H is much less than h, h can be simplified as follows:

h = W ^1 - d2^: (3:73)

If the above conditions are not met, the X and Y coordinates in the image need to
be corrected as follows:

0 H-h
x =x (3:74)
H
H-h
y0 y H (3:75)

When the object is close, the object can be rotated to obtain two images. A
schematic diagram is given in Fig. 3.21a, where 8 represents a given rotation angle.
At this time, the horizontal distances between the two object points A and B are
different in the two images, d1 and d2, respectively, as shown in Fig. 3.21b. The
3.4 Single-Pixel Depth Imaging 117

Fig. 3.21 Rotating the object to obtain two images to measure the relative height

connection angle 0 and height difference h between them can be calculated as

follows:

-1 cos 6 - d2=di
0 = tan 1 (3.76)
sin 6 J

d1 cos 6 — d2 d1 — d2 cos 6 1 - cos 6

h = jh — h2| = 1 2— 1 2 =(d1 +d2) (3.77)
sin 6 sin 6 sin 6

3.4 Single-Pixel Depth Imaging

The origin of single-pixel imaging can be traced back to the point-by-point scanning
imaging more than 100 years ago. Nowadays, single-pixel imaging mainly refers to
using only a single-pixel detector without spatial resolution to record image infor
mation. The term single-pixel imaging first appeared in a document in 2008 [10],
which is a work combined with compressed sensing. Among them, a digital
micromirror device (DMD) is used to perform binary random modulation on the
image of the object, and a single-pixel detector is used to obtain the total energy of
the modulated image. In the next reconstruction process, the Haar wavelet basis is
used as the sparse sampling basis to realize the sparse transformation of the image, so
that the system can recover a clear reconstructed image from the underdetermined
sampling data. In the same year, some researchers proposed a theoretical model of
computational ghost imaging (CGI) based on the correlation of thermal light
intensity [11]. By using a spatial light modulator to generate a light field with
known spatial intensity distribution, this provides another implementation scheme
for single-pixel imaging.

3.4.1 The Principle of Single-Pixel Imaging

The imaging schemes for single-pixel imaging and computational ghost imaging are
essentially the same [12]. In terms of implementation methods, the main difference
118 3 Depth Image Acquisition

(a)

(b)

Fig. 3.22 Flowcharts of single-pixel imaging and of computational ghost imaging

between early single-pixel imaging and computational ghost imaging is the location
of the spatial modulation of the optical signal in the imaging system. Figure 3.21
shows the flowcharts of the two schemes, respectively. Among them, Fig. 3.22a
shows the flowchart of the single-pixel imaging scheme. The light source illuminates
the scene, and the reflected or transmitted light signal will be spatially modulated by
a digital micromirror device (DMD), then passed through the lens, and finally
received by the single-pixel detector. Figure 3.22b shows the flowchart of the
computational ghost imaging scheme. The light signal is first spatially modulated
by the DMD and then illuminates the scene. The reflected or transmitted light signal
passes through the lens and is finally received by the single-pixel detector. In
contrast, the computational ghost imaging scheme uses a pre-modulation strategy
to modulate the light emitted by the light source, also known as structured illumi
nation; the single-pixel imaging scheme uses a post-modulation strategy, and the
light signal reflected or transmitted from the scene is modulated, also known as
structured detection. Although the processes are somewhat different, their image
reconstruction algorithms can be generalized.
The imaging models for single-pixel imaging and computational ghost imaging
are as follows. Consider a 2D image I 2 RK x L with Npixels: N = K x L. To acquire
this image, a series of modulated mask patterns with spatial resolution need to be fed
to the DMD. For the modulation mask sequence, P = [P1, P2, ..., PM] 2 RM x K x L,
where Pi 2 RK x L represents the i-th frame modulation mask and M represents the
number of modulation masks. A single-pixel detector captures M total light intensity
values S = [s1, s2, ..., sM] 2 RM. If the 2D image I is expanded into a vector form,
that is, I 2 RM, and the modulation mask sequence is expressed in a 2D matrix form,

PI = S: (3:78)

It can be seen from Eq. (3.78) that single-pixel imaging is a computational

imaging method that uses the known modulation mask matrix P and the
3.4 Single-Pixel Depth Imaging 119

measurement signal sequence S captured by the detector to solve the 2D image I.If
we multiply both sides of Eq. (3.78) by the inverse matrix of P, we can get the image:

I=P-1 S: (3.79)

The premise for Eq. (3.79) to hold is that the number of modulation matrices
M = N. In addition, the matrix is orthogonal to ensure that Eq. (3.79) has a unique
solution.

3.4.2 Single-Pixel Camera

The practical single-pixel camera has flexible imaging ability and high photoelectric
conversion efficiency, which greatly reduces the requirements for high-complexity
and high-cost photodetectors. Its imaging process is shown in Fig. 3.23.
An optical lens is used to project the object illuminated by the light source in the
scene onto the digital micromirror device (DMD). It is a device that realizes light
modulation by reflecting incident light with a large number of tiny lenses. Each unit
in the micromirror array can be controlled by voltage signal to carry out mechanical
turnover of plus or minus 12°, respectively, so that the incident light can be reflected
or completely absorbed at a symmetrical angle without output. Thus, a random
measurement matrix consisting of 1 and 0 is formed. The light reflected at a
symmetrical angle is received by a photosensitive diode (a fast, sensitive,
low-cost, and efficient single-pixel sensor commonly used at present, and a blood
avalanche diode is also used in low light). Its voltage changes with the intensity of
the reflected light. After quantization, a measurement value can be given. The
random measurement mode of each DMD corresponds to a row in the measurement
matrix. At this time, if the input image is regarded as a vector, the measurement
result is their point product. Repeat this projection operation for M times, and
M measurement results can be obtained by randomly configuring the turning angle
of each micromirror on the DMD for M times. According to the total variation
reconstruction method (which can be realized by DSP), the image can be
reconstructed with M measured values far smaller than the original scene image
pixels (its resolution is the same as that of the micromirror array). This is equivalent
to the realization of image data compression in the process of image acquisition.
Compared with traditional multi-pixel array scan cameras, single-pixel cameras
have the following advantages:

Fig. 3.23 Single-pixel camera imaging process

120 3 Depth Image Acquisition

(a) (b) (c)

Fig. 3.24 Examples of imaging effect of a single-pixel camera

1. The energy collection efficiency of single-pixel detection is higher, and the dark
noise is lower, which is suitable for extremely weak light and long-distance
scenes.
2. The single-pixel detection sensitivity is high, the advantages in invisible light
band and unconventional imaging are obvious, and the cost is low.
3. High temporal resolution, which can be used for 3D imaging of objects.
Of course, the biggest disadvantage of single-pixel imaging is that it requires
multiple measurements to image (essentially trading time for space). Theoretically, if
an image with N pixels is to be captured, at least N mask patterns are required for
modulation, that is, at least N measurements are required, which greatly limits the
development of single-pixel imaging applications. In practice, due to the sparseness
of natural image signals, the compressive sensing (CS) algorithm can be used to
reduce the number of measurements and make it practical.
The imaging effect of a single-pixel camera is still a certain distance away from
the imaging effect of a common CCD camera. Figure 3.24 presents a set of
examples. Figure 3.24a is the imaging effect of the white letter R on a piece of
black paper, and the number of pixels is 256 x 256; Fig. 3.24b is the imaging effect
of a single-pixel camera, where M is the 16% of number of pixels (11,000 measure
ments were taken); Fig. 3.23c is another imaging effect with a single-pixel camera,
where M is 2% of the number of pixels (1300 measurements were taken).
As can be seen from the figure, although the number of measurements has been
reduced, the quality is not comparable. In addition, mechanically flipping the
micromirror takes a certain amount of time, so the imaging time to obtain a sufficient
number of measurements is relatively long (often measured in minutes). In fact, in
the visible light range, imaging with a single-pixel camera costs more than a typical
CCD camera or CMOS camera. This is because the visible spectrum is consistent
with the photoelectric response region of silicon materials, so the integration of CCD
or CMOS devices is high and the price is low. But in other spectral ranges, such as
the infrared spectrum, single-pixel cameras have advantages. Since detection devices
in the infrared spectrum are expensive, single-pixel cameras can compete with them.
Other advantages of single-pixel imaging include imaging radar (LiDAR), terahertz
imaging, and X-ray imaging. Since single-pixel imaging provides a solution that can
3.4 Single-Pixel Depth Imaging 121

be imaged with a single detector and a large field of view illumination, single-pixel
cameras have potential in some imaging needs where detection and illumination
technologies are immature.

3.4.3 Single-Pixel 3D Imaging

The previous discussion is all about 2D imaging, and it is also convenient to extend
single-pixel imaging from 2D imaging to 3D imaging. At present, the main methods
can be divided into two types: direct method and reconstruction method.
1. Direct method
The expansion of single-pixel imaging has natural advantages from 2D imag
ing to 3D imaging. Single-pixel detectors are imaged point-by-point, so the time
spent on each single-pixel detector can be recorded, and depth information can be
measured by measuring the time of flight of light with the help of the time-of-
flight method in Sect. 3.2.2 [13]. In order to obtain high-precision depth infor
mation, such methods require detectors with high temporal resolution.
2. Reconstruction method
Combining 2D images obtained from different angles with single-pixel imag
ing results in a 3D image with depth information. For example, 3D reconstruction
can be performed using photometric stereo techniques (see Sect. 7.2) and multiple
single-pixel detectors located at different locations [14]. Among them, the corre
lation between the light intensity sequence captured by each single-pixel detector
and the corresponding structured mask sequence is used to reconstruct the 2D
image. Since each 2D image can be regarded as the imaging result from different
angles of the scene, the 3D image can be reconstructed accurately through pixel
matching.
Two types of schemes that are more common in current single-pixel 3D imaging
research and applications are introduced below.

3.4.3.1 3D Imaging Based on Intensity Information

Methods based on intensity information mainly rely on the technique of shape from
shading (see Sect. 9.1). An experimental device that reconstructs the scene surface
based on the shape from shading places four single-pixel detectors equidistant from
the top, bottom, left, and right of the light source to detect the light field projected on
the scene surface and reflected back [15]. Because the single-pixel detector has no
spatial resolution capability, the projection device at the light source will uniquely
determine the resolution of the reconstructed object. According to the reciprocity
principle of the imaging system, the distribution orientation of the single-pixel
detector will determine the shade distribution of the reconstructed object. Here,
four single-pixel detectors with different viewing angles will obtain different
122 3 Depth Image Acquisition

detection values due to the change of the orientation of the scene surface. After
calculation, 2D images with different light and dark distributions at different viewing
angles can be obtained.
Further, under the premise that the reflectivity of the scene surface is the same
everywhere, the brightness value is reconstructed according to the detection values
of different detectors, and the surface normal vectors of different pixel points can be
obtained. With the surface normal vectors, the gradient distribution among adjacent
pixels can be further obtained. Starting from a given point on the surface of the
object, the depth change of the object can be initially obtained, and the 3D recon
struction effect of classic stereovision can be obtained after the subsequent optimi
zation steps. On this basis, a 3D single-pixel video imaging method has also
emerged, which uses a single-pixel compression method called evolutionary com
pressed sensing, which can preserve spatial resolution to a large extent while
projecting at high frame rates [16].

3.4.3.2 3D Imaging Based on Structure Projection

The principle of this type of method is similar to structured light imaging (see Sect.
3.2.4). The common ones mainly include Fourier single-pixel 3D imaging technol
ogy and single-pixel 3D imaging technology based on digital grating.
1. Fourier single-pixel 3D imaging technology
Fourier single-pixel imaging (FSI) reconstructs high-quality images by
acquiring the Fourier spectrum of the object image [17]. FSI uses phase shift to
generate structured light illumination for spectrum acquisition and then performs
inverse Fourier transform on the obtained spectrum to achieve a reconstructed
image. For a rectangular image composed of N pixels with a length K and a width
L, let u and v denote the spatial frequency of the image in the X direction and the
Y dimension, respectively; the resulting Fourier matrix can be expressed as

P$(u, v) = cos (3:80)

where $ represents the phase. Specifically, in order to obtain the Fourier coeffi
cients, it is necessary to set different phase values at the same frequency to solve the
spectrum. The commonly used four-step phase-shift method uses four equally
spaced phases between 0 and 2n. If the corresponding four single-pixel detection
values are recorded as D0, Dn/2, Dn, and D3n/2, the Fourier coefficients corresponding
to the spatial frequency (u, v) can be expressed as

F(u, v) = (Dn - Do) + j(D3n=2 — D^■ (3:81)

Equation (3.81) shows that for an image with N pixels, 4 N samplings are required
to fully recover its image information.
3.5 Binocular Vision and Stereopsis Vision 123

Fourier single-pixel 3D imaging technology is a single-pixel imaging technology

based on 2D Fourier transform, which is a 3D imaging method for extracting depth
information of objects based on fringe/stripe projection contour technology [18].
2. Single-pixel 3D imaging technology based on digital grating
The digital grating has fast switching ability, and the digital phase-shift
technology with the aid of the digital grating has high precision and applicability
and is very suitable for optical stereo measurement.
The commonly used four-step phase-shift algorithm can achieve a good
balance between high phase accuracy and fast calculation. In the four-step
phase-shift algorithm, the phase-shift distance of each step is n/2, and the
phase-shift fringe intensity distribution can be written as

In(x, y) = Ia(x, y) + Im (x, y) cos (<£ + 2^}, (3:82)

where n = 1, 2, 3, 4, Ia(x, y) is the average light intensity, Im(x, y) is the

modulation intensity, and <fr is the required phase:

I1 (x, y) - I3(x, y)
$(x, y) = arctan (3:83)
I4(x,y) - h(x,y):

As above, a package phase with phase jump between [—n, n] can be obtained.
After spatial phase unwrapping, the absolute phase distribution of the object can be
obtained.

3.5 Binocular Vision and Stereopsis Vision

Binocular vision can realize human stereoscopic vision, that is, to perceive the
depth based on the difference of image positions between the left and right retinas,
because each eye has a slightly different angle to observe the scene (resulting in
binocular parallax). In order to make full use of this ability of human vision, we
should not only consider the characteristics of human vision when obtaining images
from nature but also consider the characteristics of human vision when displaying
these images to people.

3.5.1 Binocular Vision and Binocular Vision

In Helmet mounted display (HMD), there are mainly three types of visual displays
used [19]: (1) monocular (one eye is used to view one image), (2) binocular
(bio-binocular; two eyes are used to view two identical images), and (3) binocular
124 3 Depth Image Acquisition

Monocular Biocular Binocular

One eye o Two eyes Two eyes <=>

One image Two same images Two different images

Fig. 3.25 Comparing the three display types used in HMDs

stereo (two eyes are used to view two different images). The connections and
differences of three types are shown in Fig. 3.25.
Currently, there is a trend from monocular to binocular in terms of display. The
original display was monocular, and the current display is binocular but has two
independent optical paths, so it is possible to use different images to display
information to create stereo depth in binocular stereo mode (to display stereo or
3D image).

3.5.2 From Monocular to Binocular Stereo

In the real world, when a person focuses the gaze on an object, the two eyes converge
so that the object falls on the fovea of each retina, with little or no parallax. This
situation is shown in Fig. 3.26.
In Fig. 3.26, the Object F is the object in focus, and the arc passing through this
fixed point is called the horopter (concentric line of sight, the line containing all the
lines in space that fall on the corresponding image points of the retinas of the two
eyes). This is equivalent to setting a baseline from which the relative depth can be
judged. On either side of the line of sight, there is a region where images can be
fused, where a single object at a different depth from the focal object can be
perceived. This spatial region is known as the Panum’s zone of fusion or zone of
clear single binocular vision (ZCSBV). Objects falling in front of Panum’s zone
(cross-parallax zone) or behind (non-cross-parallax zone) will exhibit diplopia,
which is less reliable or accurate though still facilitating depth perception in the
form of qualitative stereopsis. For example, Object A is inside the Panum’s zone of
fusion and will therefore be considered as a single image point, while Object B is
outside the Panum’s zone of fusion and thus will have diplopia.
References 125

Fig. 3.26 The horopter of Area of

sight and Panum’s zone of uncrossed
Panum's
fusion disparity
zone of
fusion Area of
crossed
disparity

Left eye Right eye

A F^B A F

The spatial positioning of the horopter and the resulting constant changes in the
Panum’s zone of fusion will depend on where the individual focuses and where the
eyes are approached. Its size will vary between and within individuals, depending on
factors such as fatigue, brightness, and pupil size, but a difference of about 10-15
arcmin from focus can provide clear depth clue when viewed in the fovea. However,
the range generally considered to provide comfortable binocular viewing without
adverse symptoms is only the middle third of the Panum’s zone of fusion (approx
imately 0.5 crossed and non-crossed diopters).

References

1. Liu XS, Li, AH, Deng ZJ, et al. (2022) Advances in three-dimensional imaging technologies
based on single-camera stereo vision. Laser and Optoelectronics Progress, 59(14): 87-105.
2. Ballard DH, Brown CM (1982) Computer Vision, Prentice-Hall, London.
3. Li F, Wang J, Liu XY, et al. (2020) Principle and Application of 3D Laser Scanning. Earthquake
Press, Beijing.
4. Yang BS, Dong Z (2020) Point Cloud Intelligent Processing. Science Press, Beijing.
5. Shapiro L, Stockman G (2001) Computer Vision. Prentice Hall, London.
6. Dorrah A H, Capasso F. Tunable structured light with flat optics. Science, 2022, 376(6591):
367-377.
7. Liu XL (1998) Optical Vision Sensing. Beijing Science and Technology Press, Beijing.
8. Goshtasby AA (2006) 2-D and 3-D Image Registration -- for Medical, Remote Sensing, and
Industrial Applications. Wiley-Interscience, USA. Hoboken.
9. Zhang Y-J (2017) Image Engineering, Vol.1: Image Processing. De Gruyter, Germany.
10. Duarte MF, Davenport MA, Takbar D, et al. (2008) Single-pixel imaging via compressive
sampling. IEEE Signal Processing Magazine 25(2): 83-89.
11. Shapiro JH (2008) Computational ghost imaging. Physical Review A 78(6): 061802.
12. Zhai XL, Wu XY, Sun YW, et al (2021) Theory and approach of single-pixel imaging. Infrared
and Laser Engineering 50(12): 20211061.
13. Howland GA, Lum DJ, Ware MR, et al. (2013) Photon counting compressive depth mapping.
Optics Express 21(20): 23822-23837.
126 3 Depth Image Acquisition

14. Sun BQ, Edgar MP, Bowman R, et al. (2013) 3D computational imaging with single-pixel
detectors. Science 340(6134): 844-847.
15. Sun BQ, Jiang S, Ma YY, et al. (2020) Application and development of single pixel imaging in
the special wavebands and 3D imaging. Infrared and Laser Engineering 49(3): 0303016.
16. Zhang Y, Eedgar MP, Sun BQ, et al. (2016) 3D single-pixel video. Journal of Optics 18(3):
35203.
17. Zhang ZB, Ma X, Zhang JG (2015) Single-pixel imaging by means of Fourier spectrum
acquisition. Nature Communications 6(1): 1-6.
18. Radwell N, Mitchell KJ, Gibson GM, et al. (2014). Single-pixel infrared and visible micro
scope. Optica 1(5): 285-289.
19. Posselt BN, Winterbottom M (2021) Are new vision standards and tests needed for military
aircrew using 3D stereo helmet-mounted displays? BMJ Military Health 167: 442-445.
Chapter 4
3D Point Cloud Data and Processing Check for
updates

3D point cloud data can be obtained by laser scanning or photogrammetry and can
also be seen as a representation of 3D digitization of the physical world. Point cloud
data is a kind of temporal and spatial data. Its data structure is relatively simple, its
storage space is relatively compact, and its representation of local details of complex
surfaces is relatively complete. It has been widely used in many fields [1]. However,
3D point cloud data often lacks correlations with each other, and the amount of data
is very large, which brings many challenges to its processing [2].
The sections of this chapter will be arranged as follows.
Section 4. 1 first gives an overview of point cloud data, mainly the source of point
cloud data, different forms, and processing tasks.
Section 4. 2 discusses the preprocessing of point cloud data, which includes point
cloud data trapping, point cloud data denoising, point cloud data reduction or
compression, multi-platform point cloud data registration, as well as point cloud
data and image data registration.
Section 4. 3 introduces the modeling of laser point clouds and discusses the
Delaunay triangulation method and the patch fitting method, respectively.
Section 4. 4 introduces texture mapping for 3D models and discusses color texture
mapping, geometric texture mapping, and procedural texture mapping, respectively.
Section 4. 5 introduces the description of the local features of the point cloud and
introduces the description methods using the orientation histogram label, the rota
tional projection statistics, and the tri-orthogonal local depth map, respectively.
Section 4. 6 discusses deep learning methods in point cloud scene understanding,
mainly the challenges faced and various network models.
Section 4. 7 introduces the registration of point cloud data with the help of bionic
optimization, analyzes cuckoo search and improved cuckoo search techniques, and
discusses their application in point cloud registration.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 127
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_4
128 4 3D Point Cloud Data and Processing

4.1 Point Cloud Data Overview

Point cloud data has its particularity in terms of acquisition method, acquisition
equipment, data form, storage format, etc., which also brings about difference on its
processing tasks and requirements.

4.1.1 Point Cloud Data Acquisition Modes

At present, the acquisition methods of point cloud data mainly include laser scanning
methods and photogrammetry methods, which complement each other in practice,
such as LiDAR described in Sect. 3.2.3.
The laser scanning mode integrates laser scanners, global positioning systems,
and inertial measurement units on different platforms to jointly calculate the position
and attitude information of the laser transmitter and the distance to the object region
to obtain 3D point cloud in the object region.
The photogrammetry mode restores the position and attitude of the multi-view
image data captured by specific professional software and generates a dense image
point cloud with color information.
The difference between point cloud data obtained by 3D laser scanning technol
ogy and photogrammetry technology is as follows:
1. Different data sources: 3D laser scanning technology directly collects laser point
clouds (without reprocessing); the obtained point clouds have high precision,
uniform data, regular distribution, and obvious separation between objects. Pho
togrammetry technology uses continuous shooting with certain characteristics.
The overlapping images are spatially positioned, and the point cloud of the
encrypted points is obtained by means of adjustment. These point clouds often
have large fluctuations, cluttered distribution, high noise, and poor accuracy. All
points in the point cloud are often connected as a whole.
2. The data registration methods are different: the registration of the laser point
cloud is carried out through the coordinate registration of the points with the same
name between each station; the point cloud of photogrammetry generates the
overall point cloud according to the methods of internal orientation, relative
orientation, and absolute orientation.
3. Different coordinate conversion methods: 3D laser scanning only needs to mea
sure control points during the transformation of geodetic coordinates, and relative
coordinates can also be used; photogrammetry often requires auxiliary control
measurements for high-precision reconstruction of 3D point cloud data.
4. The measurement accuracy is different: the distribution of laser point clouds is
regular and uniform, and the accuracy is high; the photogrammetric point cloud is
generated by generating encrypted points, and the process is greatly affected by
the image matching accuracy, and the accuracy is relatively low.
5. The construction methods of 3D models are different: laser point clouds generate
3D ground models by filtering out ground points, and accurate 3D building
4.1 Point Cloud Data Overview 129

models are constructed by patch segmentation or manual modeling; photogram

metry technology uses image matching or draws 3D models in a stereoscopic
manner.
6. The acquisition methods of color and texture information are different: the 3D
laser scanner uses the attached camera to collect the color texture of the image and
then assigns RGB color to the point cloud by matching the image with the point
source. There are certain errors, and the image is subject to the influence of the
shooting angle and is sometimes difficult to use directly as a texture. Photogram
metry technology directly shoots the object photo, and the image texture can be
assigned to the point cloud during image positioning, and no additional
processing is required.
7. The requirements for the external environment are different: 3D laser scanners
can collect laser point cloud data during the day or night; photogrammetry
technology can only take pictures during the daytime with good light, so that
the same name point in the image pair can be easily and automatically found.
8. The errors caused by occlusion and shadows are different: after the 3D laser
scanner is occluded by ground objects, blank holes will be formed in the point
cloud; the photogrammetry technology will be affected by occlusion or shadows,
which will interfere with the image matching accuracy, resulting in large
matching errors.
Some comparisons between 3D laser scanning technology and photogrammetry
technology are shown in Table 4.1.

4.1.2 Point Cloud Data Types

Point cloud data based on different acquisition methods have their own characteris
tics, and some advantages and disadvantages of different point cloud data types are
shown in Table 4.2 [3].

4.1.3 Point Cloud Data Processing Tasks

After acquiring the 3D point cloud, multiple processing needs to be done to make
full use of it. The main contents include the following [4]:
1. Point cloud quality improvement
Point cloud quality improvement includes point cloud position correction,
point cloud reflection intensity correction, point cloud data attribute
integration, etc.
2. Point cloud model construction
The point cloud model construction performs the construction of a data
model (responsible for basic operations such as point cloud storage, management,
query, index, etc., as well as the design of data model and logic model), a
130 4 3D Point Cloud Data and Processing

Table 4.1 Comparison of 3D laser scanning technology and photogrammetry technology

Comparison
items 3D laser scanning technology Photogrammetry technology
Imaging High power collimating mono Frame photography or linear array scan
system chrome system ning imaging system
Geometric Polar coordinate geometry system Perspective projection geometry system
system
Data acquisi Point-by-point sampling; the 3D The 2D image of a region can be obtained
tion method coordinates of the point can be instantly, but the 3D coordinates of
directly obtained points cannot be obtained directly
Data acquisi Dynamic Intermission
tion process
Operation Active measurement Passive measurement
mode
Working All weather operation Greatly affected by the weather
conditions
Object detec Strong Weaker
tion capability
3D reconstruc Lower Higher
tion efficiency
Texture recon Limited Higher
struction
efficiency

Table 4.2 Advantages and disadvantages of different point cloud data

Point cloud data types Advantages Disadvantages
Image-based 3D Has color, texture, and depth It is greatly affected by light and
reconstruction information for 3D scenes environment, and the reconstruc
tion information is lost
Kinect-based RGB-D Dense point cloud with depth, Point clouds are easily affected
suitable for close-range 3D indoor by light and have a small field of
small scenes vision, which is only applicable
to indoor scenes
Vehicle LiDAR High precision, large density, 3D Limited by the platform; only
space and intensity information, point cloud scenes with linear
less affected by environmental road trajectories can be gener
factors ated. Unable to extract scene
color information
Static (fixed) LiDAR High precision, dense point cloud, The collection device cannot be
with 3D space and intensity infor moved, and the point cloud scene
mation, suitable for 3D outdoor is incomplete
scenes
Aerial/aerial photogra Sparse point clouds with high The ground details cannot be
phy LiDAR information accuracy are suitable reproduced, and the LiDAR can
for large-scale rough modeling of not extract the scene color
3D outdoor scenes information
Collision-based fusion High precision, dense point cloud, Laser point clouds alone cannot
of panoramic images mobile modeling, full frame 3D extract scene color information
and laser point clouds spatial color and intensity infor
mation, suitable for 3D outdoor
scenes
4.1 Point Cloud Data Overview 131

processing model (responsible for point cloud preprocessing, point cloud feature
extraction, point cloud classification, etc.), and a representation model (responsi
ble for application analysis of point cloud processing results).
3. Point cloud feature description
The point cloud feature description is to characterize the point cloud mor
phological structure, which is mainly divided into artificially designed features
and deep network learning features. The features of artificial design depend on the
prior knowledge of the designer and often have certain parameter sensitivity, for
example, eigenvalue-based descriptors, spin images, fast point feature histo
grams, rotational projection statistical feature descriptions, binary shape contexts,
etc. The features of deep network learning are automatically learned from a large
amount of training data based on deep learning, which can contain a large number
of parameters and have strong description ability. According to different deep
learning models, the features learned by deep networks are divided into three
categories: voxel-based, multi-view-based, and irregular point-based.
4. Point cloud semantic information extraction
Point cloud semantic information extraction refers to identifying and
extracting object elements from a large number of disorganized point cloud
data, providing the underlying objects and analysis basis for high-level under
standing of the scene. On the one hand, the point cloud scene contains the high-
density and high-precision 3D information of the object, which provides the real
3D perspective and miniature of the object. On the other hand, the high-density,
massive, spatial discrete characteristics of point cloud and the incomplete data of
3D objects in the scene, the overlap, the occlusion, the similarity, and other
phenomena among objects also bring great challenges to semantic information
extraction.
5. Structural reconstruction of point cloud objects and scene understanding
In order to describe the function and structure of the object in the point cloud
scene and the positional relationship between multiple objects, it is necessary to
perform structural reconstruction of point cloud objects and scene under
standing, i.e., to represent the object in the point cloud scene in a structured way
to support complex computational analysis and further scene interpretation. The
point cloud-based 3D object structural reconstruction is different from the mesh
structure-based digital surface model reconstruction. The key of the former is to
accurately extract the 3D boundaries of different functional structures, so as to
convert the discrete and disordered point cloud into a geometric primitive com
bination model with topological characteristics.

4.1.4 LiDAR Test Dataset

Numerous point cloud datasets have been released by many universities and
industries in recent years, which can provide a fair comparison for testing various
methods. These public benchmark datasets consist of virtual or real scenes, with a
132 4 3D Point Cloud Data and Processing

Table 4.3 Some commonly used LiDAR test datasets

Segmentation
Datasets Capacity category Object detection category
Apollo [6] 5.28 x 103 images More than 6.0 x 104 car
example
ASL dataset [7] 8 sequences
BLVD [8] 654 video clips 2.49 x 105 3D notes
DBNet [9] 1.00 x 103 km driving
data
IQmulus [10] 3.00 x 108 points 50
KITTI Odometer 22 sequences
[11]
MIMP [12] More than 5.14 x 107
points
NCLT [13] 27 sequences
NPM3D [14] 1.43 x 109 points 50
nuScenes[15] 1.00 x 103 driving 23 classes and 8 attributes
scenes
Oxford robot car 100 sequences
[16]
Semantic KITTI 22 sequences 28
[17]
Whu-TLS [18] 1.74 x 109

particular focus on point cloud classification, segmentation, registration, and object

detection (refer to [5]) tasks. They are especially useful in deep learning because they
can provide a large number of ground-truth labels to train the network.
An overview of some commonly used LiDAR test datasets (only considering
mobile LiDAR scans) is shown in Table 4.3.

4.2 Point Cloud Data Preprocessing

When the point cloud data is acquired, it is imperfect and incomplete due to various
reasons, and the amount of data is very large, so it often needs to be preprocessed
first (an algorithm overview can see [19]). Common point cloud data preprocessing
tasks include trapping, denoising, reduction/compression, registration, etc.

4.2.1 Point Cloud Data Trapping

Due to the reasons of the scanned object itself (such as self-occlusion, the surface
normal is nearly parallel to the incident laser line, and various factors that cause
insufficient reflected light intensity), the point cloud data will be missing in some
4.2 Point Cloud Data Preprocessing 133

positions, forming loopholes. For example, the 3D point cloud obtained by scanning
the human body often has loopholes in the top of the head, underarms, and other
body parts.
For the trapping of point cloud data, in addition to using common repair methods
(such as reverse engineering preprocessing), the point cloud data can also be
converted into grid form first and then repaired by grid model-based methods
[20]. The main steps of a three-stage leak trapping method are as follows: (1) recon
struct the scanned point cloud into a triangular mesh model, and identify the
vulnerabilities on it; (2) determine the type of the loophole boundary; and (3) repair
according to the type of the loophole boundary. The calculation of missing points
can use the method of nonuniform rational B-spline (NURBS) curve [21].

4.2.2 Point Cloud Data Denoising

The factors that cause noise points in point cloud data mainly include the following
[22]:
1. Errors caused by the surface factors of the measured object, such as surface
roughness, texture of material, distance, angle, etc. (the surface reflectivity is
too low to cause the incident laser to be absorbed without sufficient reflection, the
distance is too far, or the incident angle is too large resulting in a weak reflected
signal). This often needs to be solved by adjusting the position, angle, distance,
etc. of the equipment.
2. Errors caused by the scanning system itself, such as equipment ranging, posi
tioning accuracy, resolution, laser spot size, scanner vibration, etc.
3. Accidental noise points, such as the interference of external factors (moving
objects, flying insects) during the acquisition process.

4.2.2.1 Overview of Point Cloud Data Filtering Technology

Various filters can be employed to remove noise and outliers. A classification of

filtering techniques for point cloud data can be found in [23], where an overview of
the main six types of techniques can be found in Table 4.4.
The steps of the two filters are described in detail below [4].

4.2.2.2 Statistical Outlier Removal Filters

Statistical outlier removal filters are mainly used to remove outliers. Its basic idea
is to discriminate outlier points by counting the distribution density of point clouds
in the input point cloud region. The more concentrated the point cloud is, the greater
the distribution density is, while the less the point cloud is clustered where the
distribution density is small. For example, the average distance between each point
134 4 3D Point Cloud Data and Processing

Table 4.4 Various point cloud filtering technologies

Category Principle and characteristics
Statistics based filtering Statistical concepts adapted to the properties of point clouds are
used, such as likelihood, principal component, Bayesian statis
tics, L1 sparsity
Neighborhood based filtering Use a measure of similarity between a point and its neighbor
hood to determine where to filter a point, where the similarity
depends on the location of the point, normal, or region
Projection-based filtering Adjust the position of each point in the point cloud to achieve
point cloud filtering through different projection strategies, such
as local optimal projection (LOP), feature preserving local
optimal projection (FLOP), etc.
Filtering based on signal Extend the original signal processing method to point cloud
processing filtering, such as Laplacian operator, wiener filter, etc.
Filtering based on partial dif Expand triangular mesh filtering techniques for images, such as
ferential equation discretizing and solving anisotropic geometric diffusion equa
tions for point cloud filtering, using directional curvature and
principal curvature, etc.
Hybrid filtering A combination of two or more of the above techniques for
filtering, such as weighted locally optimal projection (WLOP) to
filter out noise, followed by mean-shift based outlier removal to
detect and eliminate outliers

and K neighbors is defined as a density metric. If the density of the neighborhood of a

point is less than a certain density threshold, the point is an outlier point and can be
removed.
The main steps of the statistical outlier elimination method are as follows:
1. Search the K neighbors of the point of interest, calculate the distance between the
point of interest and the K neighbors, and use the mean distance as the average
distance of the point.
2. Assuming that the density of all points in the input point cloud satisfies the
Gaussian distribution determined by the mean and standard deviation, calculate
the average distance and standard deviation between all points in the object point
cloud. Set the distance threshold, and add 1-3 times the standard deviation to the
average distance.
3. According to the set distance threshold, compare the average distance of the
points obtained in Step 1 with it, and the points whose average distance is greater
than the distance threshold are marked as outlier points and removed.

4.2.2.3 Radius Outlier Point Removal Filter

The radius outlier point removal filter uses the radius as the criterion to eliminate
all the points that do not reach a sufficient number of neighbor points in a certain
range of the input point cloud. Set the threshold of the number of neighbor points to
N, take (one by one) the current point as the center, and determine a sphere of radius
d. Calculate the number of neighbor points in the current sphere: when the number is
greater than N, the point is retained; otherwise, it is eliminated.
4.2 Point Cloud Data Preprocessing 135

The main steps of the radius outlier removal method are as follows:
1. Calculate the number of neighbors n in the sphere with the radius d of each point
in the input point cloud.
2. Set the number of points threshold N.
3. If n < N, mark the point as an outlier point; if n > N, do not mark it.
4. Repeat the above process for all points, and finally remove the marked points.

4. 2.3 Point Cloud Data Ground Region Filtering

In the point cloud data obtained from aerial photography, the ground region is often
included. The ground region needs to be removed when generating a digital ground
model from LiDAR data; otherwise, it will interfere with the segmentation of the
scene. The main filtering methods for points above the ground region include
filtering method based on elevation, filtering method based on model, filtering
method based on region growing, filtering method based on window movement,
and filtering method based on triangulation. Their comparison is shown in
Table 4.5 [3].

4. 2.4 Point Cloud Data Reduction/Compression

Due to the fast speed and high acquisition density of 3D laser scanning technology,
the scale of point cloud data is often large. This may lead to problems such as large
amount of calculation, high memory usage, and slow processing speed.
3D laser point clouds can be compressed in different ways. The simplest method
is the barycentric compression method, that is, only the points closest to the
barycenter in the neighborhood after point cloud rasterization are retained. However,
this method easily leads to missing features of point cloud data. In addition, a data
structure called octree can be used for compression by the bounding box method,
which can not only reduce the amount of data but also facilitate the calculation of
normal vectors, tangent planes, and curvature values of local neighborhood
data [24].
The octree structure is implemented by recursively dividing the point cloud
space. First, the space bounding box (circumscribed cube) of the point cloud data is
constructed as the root; then divide it into eight sub-cubes of the same size as the
child nodes of the root. This recursive division is performed until the side length of
the smallest sub-cube is equal to the given point distance, at which point the point
cloud space is divided into sub-cubes of the power of 2.
For recursive octaves of a space bounding box, assuming that the number of
division layers is N, the octree space model can be represented by an N-layer octree.
Each cube in the octree space model has a one-to-one correspondence with a node in
136 4 3D Point Cloud Data and Processing

Table 4.5 Comparison of ground region filtering methods

Filtering method Principles Characteristics
Elevation-based Filter according to the point distribu Fast, but less robust
filtering method tion in the point cloud, manually set
or adaptively find the elevation z-
direction threshold, and filter out the
points whose z-value is less than the
threshold in the point cloud as ground
points
Model-based fil Choose a model that fits the ground Applicable to specific environments,
tering methods (e.g., RANSAC based planar model, less robust, but better filtering effect
CSF modela), and use the fitted inliers
as ground points
Filtering method Taking the normal vector direction of When the ground is not undulating,
based on region the point as the criterion for region the ground can be separated well, but
growing growth, first adaptively find the point the time and space costs are rela
that is most likely to be the ground. tively large
On this basis, according to the angle
difference between the normal vector
direction of its neighbor point and its
normal vector direction, it is judged
whether it grows. Iterate until all
ground points are found
Filtering method The points distributed on the ground Faster, but the window size is too
based on window should have continuity. Set a suitable dependent on manual settings; only
movement window size to find the lowest point local features are considered
in the current window. Then set the
threshold through the lowest point
calculation model, and filter out all
points whose height difference
exceeds the threshold
Triangulation The discrete points are connected into Avoid data redundancy when the
based filtering multiple triangles covering the entire terrain is flat, but the data structure is
method region according to a certain rule, and complex and the space complexity is
they do not overlap each other to high
form an irregular triangle network.
The sparse triangle network is gener
ated from the seed points, the initial
segmentation is performed by ana
lyzing the slope of the model, the
triangle region with large slope is
eliminated, and the features such as
the elevation difference of each seg
ment are obtained through connec
tivity analysis
aCloth simulation filter (CSF) model, corresponding to a 3D algorithm for simulating cloth in
computer graphics
4.2 Point Cloud Data Preprocessing 137

the octree, and its position in the octree space model can be represented by the octree
code Q of the corresponding node:

Q=qN -1... qm... qiqo: (4:1)

In the equation, the node serial number qm is an octal number, m2 {0, 1,.. .,N —
1}, qm represents the serial number of the node in the sibling nodes, and qm +1
represents the sequence number of the parent node of the node in its sibling nodes. In
this way, from q0 to qN—1, the path from each leaf node in the octree to the root of the
tree is completely represented.
The specific steps of point cloud data encoding are as follows:
1. Determine the number of division layers N of the point cloud octree: N should
satisfy d0 • 2N > dmax, where d0 is the simplified specified point distance and dmax
is the maximum side length of the point cloud bounding box.
2. Determine the spatial index value (i, j, k) of the sub-cube where the point cloud
data point is located: if the data point is P(x, y, z), there are

i = Round [(x — xmin)=d0]

* j = Round[(y - ymin)=d0] : (4:2)
( k = Round [(z — Zmin)=d0]

Among them, (xmin, ymin, zmin) represents the minimum vertex coordinates of the
bounding box corresponding to the root node.
3. Determine the encoding of the sub-cube where the point cloud data point is
located: convert the index value (i, j, k) to a binary representation:

' i = i020 + i121 + -im2m + -iN- 12N— 1

<j = j020 + j121 + - jm2m + -- jN - 12N -1 : (4:3)

kk = k020 + k121 + -km2m + -kN - 12N -1

where im, jm, km 2 {0, 1}, m 2 {0, 1, ..., N — 1}. According to Eq. (4.1), the
octree code Q of the node corresponding to the sub-cube can be obtained.
4. From the octree code Q of the node corresponding to the sub-cube, the spatial
index value (i, j, k) can also be calculated inversely:
138 4 3D Point Cloud Data and Processing

N N- 1
i = Y (qmmod2)*2m
m—1
N— 1
j= = £ (L?m=2Jmod2)* 2m : (4:4)
m— 1
N— 1
k = £ ' 7m 4 mod2 * 2m
k m— 1

Here, the gap between adjacent nodes in the X direction is set to 1, the gap
between adjacent nodes in the Y direction is 2, and the gap between adjacent nodes in
the Z direction is 4.
In addition to the traditional data reduction methods based on bounding boxes
above, there are also scan line-based methods and reduced polygon count based
methods. The scan line-based data reduction method utilizes the characteristics of
scanning, that is, the points on each scan line are in the same scan plane during line
scanning, and there is a sequence, so it can be based on the change of the slope of the
front and rear points on the scan line to determine if it can be simplified [25]. For
point cloud data, if a triangulated irregular network (TIN) model has been
constructed, data reduction can be performed by reducing the number of polygons
in the model. A commonly used method is the common vertex compression
method [25].

4.2.5 Multi-Platform Point Cloud Data Registration

In practical applications, it is sometimes necessary to register point cloud data

obtained from different platforms to fuse information and obtain more comprehen
sive data. In addition, sometimes point cloud data obtained using the same platform
but at multiple times need to be combined. For example, terrestrial 3D laser scanning
is limited by occlusion or viewing angle during data acquisition, and it is difficult to
obtain the complete point cloud data of the object through one scan, often requiring
multiple scans from multiple angles. When the point cloud data is acquired, each
point cloud data is a local coordinate system with the scanner position as the origin.
It is necessary to uniformly transform these data into the same coordinate system to
reconstruct the real 3D space of the object. It can be seen that registration requires
data stitching and coordinate transformation. These registrations can be regarded as
the registration of multi-platform and multisite point cloud data, which is the
registration of 3D data to 3D data.

4.2.5.1 Types of Point Cloud Data Registration Methods

Registration methods can be divided into three categories: methods based on geo
metric features, methods based on surface features, as well as algorithms based on
iterative closest point (ICP) and its improvements.
4.2 Point Cloud Data Preprocessing 139

The geometric features-based method solves the registration parameters

between two adjacent point cloud images by searching for the geometric space
features of the overlapping part (generally 20-30% is more suitable) during regis
tration. According to the source of geometric features, the methods based on
geometric space features can be divided into methods based on geometric space
features of targets and methods based on geometric space features of shooting
scenes. Since the shape of urban buildings is mostly composed of simple and
basic geometric shapes, the point cloud data registration method based on building
geometric constraints is often used when scanning urban buildings.
Commonly used building geometric constraints include the condition that the
point is on the plane and the condition that the two normals are parallel to each other.
The specific constraints used in point cloud data registration based on geometric
feature constraints of building points, lines, and surfaces include co-level plane
condition, fixed distance condition, common plumb line condition, point-on-line
condition, point-to-line fixed distance condition, the coincidence condition of the
two space lines, the coplanarity condition of the two space lines, the condition of the
fixed direction of the line, the condition of the point on the plane, the fixed distance
condition of the point to the plane, etc.
Surface feature-based methods perform registration based on surface information
obtained by fitting the neighborhood of corresponding points in two point cloud
datasets. This kind of method does not need iterative calculation, the process is
relatively simple, and the speed is relatively fast; the disadvantage is that the surface
features are often not very accurate, and the information contained is relatively
limited, so the registration results are not very good.
Iterative closest point registration is also called iterative corresponding point
registration, which has high calculation accuracy and powerful functions and has
been widely used.

4.2.5.2 Iterative Closest Point Registration Algorithm

The basic algorithm for iterative closest point registration has two main steps:
1. Quickly search for the closest point pair in adjacent point cloud datasets.
2. Calculate the corresponding transformation (translation and rotation) parameters
according to the correspondence between the coordinates of the closest
neighbors.
Let P and Q be two sets of point cloud data, and P £ Q, the specific steps of the
algorithm are as follows (see Fig. 4.1):
1. Sampling the reference point cloud and the object point cloud to determine the
initial corresponding feature points to improve the subsequent stitching speed.
2. Registration is performed by calculating the closest point. The registration can be
divided into two steps: coarse registration and precise registration. Coarse
140 4 3D Point Cloud Data and Processing

Fig. 4.1 Schematic diagram of coordinate system conversion for the complete imaging process

registration narrows the difference between two datasets of point cloud to

improve registration accuracy.
Assuming that the coordinates of the reference point cloud are represented by
XYZ, and the coordinates of the object point cloud are represented by xyz, the
coordinate transformation equation is [24]

X x
Y = kRx(a)Ry(^)Rz(Y) y + (4:5)
Z z

Among them, k is the scaling factor between the two coordinate systems; x0, y0,
and z0 are the translation amounts along the X, Y, and Z directions of the coordinate
axes, respectively; a, fi, and / are the rotation angles around the three coordinate
axes, respectively; and Rx(a)Ry(fi)Rz(y) represents the rotation matrix:

(4:6)

It can be seen from Eq. (4.5) that in order to achieve the matching of point cloud
data, it is necessary to extract three pairs (or more) of feature points (control points)
or arrange three (or more) object points in the common region to calculate six
transformation parameters (a, fi, y, x0, y0, z0).
Precise registration is to iterate the point cloud data on the basis of rough
registration to minimize the objective function value and achieve accurate and
optimized registration. According to the method of determining the corresponding
point set, three methods can be used: point-to-point, point-to projection, and point-
4.2 Point Cloud Data Preprocessing 141

to-surface. They correspond to define the closest point using the shortest spatial
distance, the shortest projection distance, and the shortest normal distance,
respectively.
Only one point-to-point based registration method [24] is introduced below. Let
the two point sets be P = {pi, i = 0, 1, 2, ..., m} and U = {ui, i = 0, 1, 2, ..., n}. It is
not required that there is a one-to-one correspondence between the points in the two
point sets, nor does it require the same number of points in the two point sets; let
m > n. The registration process is to compute translation and rotation matrices
between the two coordinate systems such that the distance between homologous
points from P and U is minimized.
The main steps of precise registration are as follows:
1. Calculate the closest point, that is, for each point in the point set U, find the
closest corresponding point in the point set P by means of the distance measure,
and set the new point set composed of these points in the point set P as Q1 = {qi,
i = 0, 1, 2, ..., n}.
2. Using the quaternion method (see below), the registration between the point set
U and the point set Q1 is calculated, and the registration transformation matrices
R and T are obtained.
3. Perform coordinate transformation, that is, use transformation matrices R and T to
obtain U1 = RU1 + T for the point set U.
4. Calculate the distance difference between U1 and Q1, dj = (EIIU1 - Q1II2)/N;
calculate the closest point set between U1 and P and transform to U2; calculate
dj +1= (EllU2 - Q2ll23 )/N, if dj +1- dj < e (preset threshold), end; otherwise,
replace U with point set U1, and repeat the above steps.
The quaternion in Step 2 represents the rigid body motion, and the quaternion is a
vector with four elements, which can be regarded as a 3 x 1 vector part and a scalar
part. The steps to use it to calculate translation and rotation matrices are as follows:
1. Calculate the centroids of point sets P = {pi} and U = {ui}, respectively:

p'=m, <4:7>
i=1

u'=n52ui- (4-8)
i= 1

2. The point set P = {pi} and U = {ui} are translated relative to the centroid:
qi = Pi - p‘, Vi = Ui — u‘.
3. Calculate the correlation matrix K according to the moved point sets {qi} and
{vi}:
142 4 3D Point Cloud Data and Processing

K=N fW-
(4:9)
i= 1

4. Use the elements kij of the correlation matrix K: i, j = 1, 2, 3 to construct a 4D

symmetric matrix K‘:

K0
k11 + k12 + k13 k23 - k32 k13 - k31 k12 - k21
k23 - k32 k11 - k22 - k33 k12 + k21 k13 + k31
k13 - k31 ki2 + k2i - k11 + k12 - k13 k23 + k32
k12 - k21 k13 + k31 k23 + k32 - k11 - k12 + k13
(4:10)

5. Calculate the unit eigenvector (optimal selection vector) s* = [s0, s1, s2, s3]T
corresponding to the largest eigen-root of K‘.
6. Calculate the rotation matrix R with the help of the relationship between s* and
R:

2 2 2 2
s2 + s2 - s2 - s3 2(S1S2 - S0S3) 2(S1 S3 + S0S2)
R= 2(s i S2 + S0 S3) S0- S1+s2 — S3 2(S2 S3 +S0S1) ( 4:11)
2 (s1s3 -s0s2) 2(S2S3 + S0S1) S2 - S2 - S2 + S3

7. Calculate the translation vector T: T = u‘ — Rp . ‘

4.2.6 Registration of Point Cloud Data and Image Data

3D laser point clouds lack texture and spectral information, so many systems are also
equipped with color cameras to obtain color and texture information of the scene. By
registering 3D laser point clouds with color images, color point clouds with texture
attributes can be generated [24].
The registration of 2D optical images and 3D laser point clouds here is different
from the registration between the previous 3D point clouds. The methods are mainly
divided into three categories:
1. 2D-3D registration algorithm based on feature matching.
The basic principle of this type of algorithm is to use the corresponding
geometric features between the laser point cloud and the optical image for
registration, that is, to determine the relative registration parameters (translation
and rotation parameters).
2. 2D-3D registration algorithm based on statistics
4.3 Laser Point Cloud 3D Modeling 143

The basic principle of this kind of algorithm is to exhaustively search the

registration parameters of the optical image, back-project the line features on the
3D model to the 2D image, and find a set of solutions with the smallest difference
from the 2D line features as the optimal registration parameters. This type of
algorithm and the previous type of algorithm both use geometric features to
establish a corresponding registration relationship, but this type of algorithm
uses statistical methods to improve the degree of automation of registration;
however, it is equivalent to using more geometric features.
3. 3D-3D registration algorithm based on image multi-view stereo generation of
dense point cloud and laser point cloud registration.
This kind of algorithm first uses the dense matching method to generate the 3D
point cloud of the sequence image, and then uses the output value of the high-
precision positioning and orientation system (POS) as the initial registration
parameter of the image, and finally uses the iterative closest point algorithm to
accurately register the dense matching point cloud and the laser point cloud. This
kind of algorithm does not need to extract features and is more robust; but it is
more dependent on the output accuracy of the POS system. In areas with many
tall buildings or weak GPS signals, the output accuracy of the POS system is
poor, and it is possible that the initial registration parameters do not meet the
requirements and the algorithm cannot converge.

4.3 Laser Point Cloud 3D Modeling

Here we mainly consider the method of automatic modeling of laser point cloud with
3D surface model [24].

4.3.1 Delaunay Triangulation

Corresponding to the irregularly distributed LiDAR point cloud, it can be visualized

as an unordered point set P on the plane, where each point p corresponds to its
elevation value. It is generally converted into a triangulated irregular network
(TIN) by Delaunay triangulation. The Delaunay triangulation construction method
based on the point-by-point insertion algorithm includes the following steps:
1. Traverse all points, find the inclusive box of the point set, get the initial triangle as
the convex hull of the point set, and add it to the triangle linked list.
2. Add the points in the point set one by one. Find all triangles whose circumcircle
contains the insertion point in the triangle linked list (called the influence triangle
of this point), delete the common edge of the influence triangle, and connect the
insertion point with all the vertices of the influence triangle, so as to complete the
insertion of this point in the triangle in linked list.
144 4 3D Point Cloud Data and Processing

B C

(a) (b) (c) (d)

Fig. 4.2 Schematic diagram of new point insertion

3. Optimize the newly formed triangle according to the optimization criterion (such
as swapping the diagonals), and add the formed triangle to the triangle linked list.
4. Repeat Step 2 until all points in the point set are inserted.
The insertion process of Step 2 can be described with the help of Fig. 4.2:
Fig. 4.2a shows the insertion of a new point P into the existing triangle set of
AABC and kBCD; Fig. 4.2b shows the circumcircles of AABC and kBCD all
contain point P, so they are all influence triangles of point P; Fig. 4.2c shows the
result of deleting the common edge of the influence triangle; Fig. 4.2d shows the
insertion point P, and all vertices of the two influence triangles are connected,
separately (the new triangles formed can be added to the triangle linked list).

4.3.2 Patch Fitting

If the laser point cloud is first segmented to obtain patches, then these patches are
fitted to form parts of the 3D model. Point cloud segmentation is to divide whole
point cloud into multiple subregions corresponding to a natural surface one-to-one,
and each subregion only contains scan points collected from a specific natural
surface. Point cloud fitting is to construct the geometric shape of the object
represented by the point cloud by means of mathematical geometry from the point
cloud with certain characteristics after segmentation.

4.3.2.1 Point Cloud Segmentation Algorithm

There are many algorithms for segmentation of point clouds, which can be mainly
divided into edge-based, region-based, model-based, graph theory-based, and
cluster-based algorithms [3]. Among them, the simpler ones are K-means clustering
algorithm and region growing algorithm.
K-means clustering algorithm is a simple laser point cloud segmentation algo
rithm. The basic idea is to perform unsupervised classification on the data. The
4.3 Laser Point Cloud 3D Modeling 145

specific method is to update the value of each cluster center successively through an
iterative method until the best clustering effect is obtained. If it is assumed that the
sample set is divided into K categories, the algorithm can be described as follows:
1. Appropriately select the initial centers of the K classes.
2. In the i-th iteration, for any sample, calculate the distance to K centers, and assign
the sample to the class where the nearest center is located.
3. Use the mean method to update the central value of the class:

Zj = x: (4:12)
x2K j

where x represents the sample, nj is the number of samples of the same class, and
Kj represents the j-th class.

4. For all K cluster centers, carry out iterative update continuously according to
Step 2 and Step 3 until the iteration of the maximum number of iterations is
reached, or the difference between the pre- and post-iteration values of the
objective function:

J(K,Z) = ^£||x(i) -ZK(J|2, (4:13)

i = 1x2Ki

is less than a threshold, the iteration ends.

Region growing algorithm is also a simple laser point cloud segmentation
algorithm. Its basic idea is to gather sample points with similar properties to form
a region. The main steps are as follows:
1. Determine the seed point in the region to be divided as the starting point of
growth.
2. The points in the neighborhood of seed points with similar characteristics to the
seed points are absorbed into the region where the seed points are located.
3. Take the newly absorbed point as a new seed point, and continue to search for
points with similar characteristics to this seed point until there are no more points
satisfying similar characteristics.
4. Continue to iteratively update according to Step 2 and Step 3 until all seed points
are used for traversal.

4.3.2.2 Plane Fitting Algorithm

The object represented by the laser point cloud can have various geometric features,
the most basic being the plane. Let the space plane equation be ax + by + cz = d,
where a, b, and c are the unit normal vectors of the plane, and a2 + b2 + c2 = 1; d is
the distance from the origin to the plane, d > 0.
146 4 3D Point Cloud Data and Processing

Assuming that a point cloud of N points in a plane {(xi, yi, zi), i = 1, 2, ..., N}is
obtained, the distance from any point to the plane is di = |axi + byi + czi|; in addition,
under the condition a2 + b2 + c2 = 1, the best fit plane should satisfy that

dr==j = 57 (ax, + byi + czi - d)2 (4:14)

is also the smallest. Use Lagrange multipliers to form functions:

f = ^d2-2(a2 + b2 + c2 - 1): (4:15)

Taking the derivative of Eq. (4.15) with respect to d and setting the derivative to
zero, we get

df
dd 2^2 (axi + byi + cz, — d) = 0: (4:16)

Solved from it

d=«EN+*££+c£N ■ (4:17)
i i i

Substituting Eq. (4.17) into the equation for the distance from point to plane, we
have

di = ja(xi - x) + b(y; - y) + c(z, - z)j, (4:18)

wherex = £)N,y = 2N, z = EN.

Further take the derivative of Eq. (4.15) with respect to a, b, and c, respectively,
and set the derivative to zero, we get

' 2 52 (aAxi + bAyi + cAzi) Axi - 22a = 0

i
, 252 (aAxi-+ bAyi + cAzi)Ayi - 22b = 0 (4:19)
i
2 52 (aAxi- + bAyi + cAzi-) Azi - 22c = 0
i

where Ax,- = x,- - x, Ay,- = y,- - y, Az,- = z,- - z.

The equation of the eigenvalue formed by Eq. (4.19) is
4.3 Laser Point Cloud 3D Modeling 147

■^Ax;Ax; E^x-Ay, E A'-Az, -

i a a
E^-W, EAy,Ay, EAy«AZi
i b =2 b (4:20)
E)Ax,Az, E^Ay,Azi SAziAzi c c
i _

'EAyi Ax, E^x-Ay, AX Axi Az, ■

i i i
EAxi Ay, E^y.-Ay, EAyiAyi
Now, let A = G = [ab T. Since
i i i
E)Ax,Azi £AyiAZi EA'Az,
i i i _
a2 + b2 + c2 = 1, GTG = 1. Further, from Eq. (4.20), AG = 2G, and then
AGGT = 2GGT; and from 2GGT = 2, AGGT = 2 is obtained. So,

a Ax, Ax, +b Ax, Ay, + cj^ Ax, Az,,

ag = aZL AxiAy, + bZL Ay,Ay, +c52 Ay, Az,,

(4:21)

a^2 Ax, Az, + b^2 Ay, Az, + c^Az, Az,,

a2 yAx,Ax, + aby Ax,Ay, + ac^ Ax, Az,,

AGGT = abSAx,Ay, + b'EAy,Ay, + bcEAy,Azi, :

(4:22)
ac^2 Ax,Azi + bc^2 Ay, Az, + c2 EA'Az,,

= y^(aAx, + bAy, + cAz,)2

From 2GGT = 2, we get

2 = ^(aAx, + bAy, + cAz,)2 = ^d2: (4:23)

Therefore,
148 4 3D Point Cloud Data and Processing

2 = d^df ^ min, (4:24)

indicates that the minimum eigenvalue of matrix A needs to be calculated, and the
eigenvector [abc]T corresponding to the minimum eigenvalue is the plane normal
vector to be calculated. As above, a, b, c, and d are all calculated, and the plane can
be fitted.

4.4 Texture Mapping for 3D Models

Texture mapping can be seen as the process of mapping texture pixels in texture
space (via scene space) to pixels in screen space. It enhances realism by attaching an
image to the surface of a 3D object. The essence of texture mapping is to establish
two mapping relationships from screen space to texture space and from texture space
to screen space.
Depending on the texture function used, textures can be divided into 2D textures
and 3D textures. 2D texture patterns are defined in 2D space. 3D texture, also
known as entity texture, is a texture function that has a one-to-one mapping
relationship between texture points and 3D object space points.
According to the representation of texture, texture can be divided into color
texture, geometric texture, and procedural texture. Color texture refers to the
patterns on the smooth surface, which reflect the details with color or sensitivity.
Generally, it is used at the macro level to paste the real color pattern on the surface of
the scene to simulate the real realistic color. Geometric texture is composed of
rough or irregular small bumps (concave-convex). It is a surface texture based on the
microscopic geometric shape of the surface of the scene. It is generally used at the
sub-macro level to represent the unevenness, texture details, as well as changes of
light and darkness on the surface of the scene. Procedural textures are a variety of
dynamically changing natural scenes (regularly or irregularly, such as water waves,
clouds, smoke, etc.) that can be used to simulate complex, continuous surfaces.
From a mathematical point of view, if (u, v) is used for texture space and (x, y, z)is
used for scene space, the mapping can be represented as (u, v) = F(x, y, z); if
invertible, then (x, y, z) = F-1(u, v). The texture mapping algorithm consists of three
steps:
1. Define the texture object and obtain the texture.
2. Define the mapping function between texture space and scene space.
3. Select the resampling method of the texture to reduce various deformation and
aliasing of the mapped texture.
4.4 Texture Mapping for 3D Models 149

4.4.1 Color Texture Mapping

Color is mainly used to reflect the color of each point on the surface of the scene. If
the colors of adjacent points are combined, the texture characteristics can also be
reflected. There are different methods for color texture mapping [24].

4.4.1.1 2D Texture Forward Mapping

The forward mapping method generally uses the Catmull algorithm. The algorithm
projects the texture pixel coordinates to the scene surface coordinates one by one
through the mapping function and then displays them in the screen space through
surface parameterization, as shown in Fig. 4.3. Specifically, first calculate the
position and size of the texture pixel coordinates on the scene surface by means of
projection, assign the center gray value of the texture pixels to the corresponding
point on the scene surface according to hyperbolic interpolation, and take the texture
color value assigned at the corresponding point as the surface texture attribute of the
sampling point in the center of the pixel on the surface. Then, use the lighting model
to simulate the calculation of the brightness at the surface point, and assign its gray
value. The forward mapping algorithm can be represented as (u, v) ^ (x, y) = [p(u, v),
q(u, v)], where p and q both represent projection functions.
Forward mapping is relatively simple to implement texture mapping. Because it
sequentially accesses texture patterns, it can save a lot of storage space in computing
and improve computing speed. The disadvantage is that the texture mapping value is
only the gray value of the image, and the scene space and texture space pixels have
not one-to-one correspondence; there will be no corresponding texture pixels in
some regions, or there are redundant pixels, resulting in holes or multiple shots,
causing graphics deformation and confusion.

4.4.1.2 2D Texture Reverse Mapping

The reverse mapping method is also called the screen scanning method. It reversely
projects the coordinates of the scene surface to the coordinates of the texture space
through the mapping function, which can ensure the one-to-one correspondence
between the scene space and the texture space. After resampling the texture image

x y
Texture Scene space Screen space
space
Projection Surface
parameterization
O u O y o x

Fig. 4.3 Texture forward mapping flowchart

150 4 3D Point Cloud Data and Processing

according to the projection, the pixel value of the coordinate center of the
corresponding scene surface space is calculated, and then the calculation result is
assigned to the scene surface. The reverse mapping algorithm can be represented as
(x, y) ^ (u, v) = [f(x, y), g(x, y)], where f and g both represent projection functions.
The reverse mapping method needs to scan and search the scene space pixel by
pixel and resample each pixel at any time. In order to improve the calculation
efficiency, it needs to dynamically store the problem pattern, so it needs a lot of
storage space. In order to improve computational efficiency, on the one hand, the
search and matching of texture images can be optimized, and on the other hand, pixel
information can be obtained preferentially for reconstructing 3D scenes.

4.4.1.3 Two-Step Texture Mapping

Complex scene surfaces are often nonlinear, and it is difficult to directly parameter
ize with mathematical functions, that is, to directly establish the analytical relation
ship between texture space and scene space. One way to solve this problem is to
build an intermediate 3D surface, which decomposes the mapping from texture
space to scene space into a combination of two simple mappings from texture
space to intermediate 3D surface and then from intermediate 3D surface to scene
space.
The basic process of two-step texture mapping is as follows:
1. Establish a mapping from the 2D texture space to the intermediate 3D surface
space, called T mapping: T(u, v) ^ T‘(x‘, y‘, z‘).
2. Remap the texture mapped to the intermediate 3D surface space to the scene
surface, which is called T‘ mapping, T‘(x‘, y‘, z‘) ^ O(x, y, z), where O(x, y, z)
represent the scene space.

4.4.2 Geometric Texture Mapping

Geometric texture mapping technology is required for some scene surfaces that are
not smooth, bumpy, and random diffuse reflections caused by light exposure. One
method is to change the microscopic geometry of the scene surface by slightly
perturbing the position of each sampling point on the scene surface, thereby causing
the normal direction of the light on the scene surface to change and causing the
surface brightness to change abruptly, resulting in uneven realism (so geometric
texture mapping is also called bump texture mapping).
In practice, in order to improve the sense of realism, it is necessary not only to
map the texture of the scene itself but also to map the environment of the scene, so
the flowchart of geometric texture mapping is shown in Fig. 4.4.
In Fig. 4.4, the environment map is implemented with the help of environment
texture mapping, which maps the texture of the environment scene to the scene space
by mapping the environment, that is, simulating the reflection of the surface to the
4.4 Texture Mapping for 3D Models 151

Fig. 4.4 Flowchart of geometric texture mapping

surrounding environment of the scene. Environment mapping is an approximate

simulation of actual reflections, which can be achieved with a sphere or cube. The
cube environment texture mapping is uniformly sampled, is easy to draw, and can be
generated directly from photos. Compared with the complex ray tracing algorithm,
the environment texture mapping effect is fast and efficient.
2D texture mapping is a 2D to 2D linear transformation that maps a 2D texture
pattern to a surface in a 2D scene space. The actual texture pattern has a nonlinear
correspondence with the scene space surface, and the texture pattern is mapped to the
scene surface, which will cause deformation, resulting in the scene looking unreal
and unrealistic. If 2D mapping is performed on a scene surface with multiple
surfaces, the continuity of mapped textures between adjacent surfaces cannot be
guaranteed.
3D texture mapping directly defines the texture in the 3D space, and the mapped
texture points and the scene 3D space points correspond one-to-one through the
mapping function, which will not cause deformation and is more suitable for the
simulation mapping of natural scenes.

4.4.3 Procedural Texture Mapping

Procedural texture mapping is a method of 3D texture mapping. 3D textures

require huge 3D arrays in construction, which consumes a lot of memory. 3D
textures can be generated by defining simple procedural iteration functions, which
are called procedural textures. Procedural texture is a mathematical model based on
analytical expression, which generates complex textures through computer
calculations.
A method for simulating the texture effect of wood products based on a simple
regular texture function includes the following steps:
1. Use a set of coaxial cylindrical faces to define the surface texture.
2. Simulate realistic textures:
(a) Perturbation: the perturbation of the sine or other mathematical function to
the radius of the coaxial cylinder surface to change the eye pattern.
(b) Distortion: add a small amount of twist in the direction of the cylinder to
make the wood grain have a certain twist.
(c) Tilt: tilt along a certain direction of the cylinder axis, causing the wood grain
to fluctuate and form a realistic effect.
152 4 3D Point Cloud Data and Processing

4.5 Point Cloud Feature Description

The effective description of point cloud features is the basis of point cloud scene
understanding. The existing 3D point cloud feature descriptors include global feature
descriptors and local feature descriptors.

4.5.1 Global and Local Feature Descriptors

Global feature descriptor mainly describes the global features of the object.
Common global feature descriptors include ensemble of shape functions (ESF)
and vector feature histogram (VFH) [3]. There are many kinds of local feature
descriptors, such as spin image, 3D shape context (3DSC), point feature histo
gram (PFH), fast point feature histogram (FPFH), etc. The principles and char
acteristics of these descriptors are shown in Table 4.6.
Since global feature descriptors require pre-segmentation of objects and ignore
shape details, it is difficult to identify incomplete or only partially visible scenes
from cluttered scenes where objects are interlaced and overlapped. In contrast, local
feature descriptors can describe local surface features within a certain neighborhood
range and have strong robustness to object interlacing, occlusion, and overlapping
and are more suitable for the recognition of incomplete or only partially visible
scenes.

Table 4.6 Some global and local feature descriptors

Type Name Principles Characteristics

Global ESF Describe the triangle region formed by No preprocessing required, with
any three points in the point cloud, the strong feature description
angle between the sides, and the dis
tance between the vertices
VFH Added guide direction to relative nor High recognition
mal calculation after extending FPFH
Local Spin Descriptors are obtained by computing Anti-rigid body transformation and
image the projected coordinates of all vertices background interference; but sensi
to the base plane tive to density changes
3DSC Count the number of points in different Strong discrimination and noise
grids of spherical neighborhoods to get resistance
descriptors
PFH Parameterize the spatial differences of Highly robust to point cloud density
points and neighborhoods, and form a changes; but high computational
multi-bit histogram complexity
FPFH Recompute the K-neighborhood by Most of the recognition capabilities
computing the tuple of the query point of PFH are retained, and the compu
and its neighborhood equivalent to the tational complexity is reduced com
PFH pared to PFH
4.5 Point Cloud Feature Description 153

A good local feature descriptor should be both descriptive and robust. On the one
hand, the feature descriptor should have a broad description ability, which can
describe the local surface shape, texture, echo intensity, and other information as
much as possible. On the other hand, feature descriptors should be robust to noise,
occlusion and overlap between objects, point density changes, etc.
The steps for obtaining the other three local feature descriptors will be described
in detail below.

4.5.2 Signature of Histogram of Orientation

Signature of histogram of orientation (SHOT) is a local feature descriptor that

combines the advantages of point marking feature and histogram feature [26], which
has a good balance between description and robustness. The main steps to obtain the
SHOT descriptor are as follows:
1. For the point p in the point cloud, first use the neighborhood points of p to
construct the covariance matrix, and decompose the covariance matrix to obtain
three eigenvalues (21 > 22 > 23) and the corresponding eigenvectors (e 1, e2, e3).
2. Adjust the direction of the eigenvector (e 1, e2, e 3) so that its direction is consistent
with the direction of the connecting line between most neighborhood points
and p.
3. Takep as the coordinate origin, e1 as the x-axis, e2 as the y-axis, and e1 x e2 as the
z-axis to build the coordinate system.
4. Form mesh in the spherical neighborhood according to the radial direction
(logarithmic spacing), longitude direction (equal spacing), and latitude direction
(equal spacing).
5. Calculate the cosine value of the angle between the point and the z-axis for each
grid, and convert these cosine values into histogram representation.
6. Combine the values of all the histogram bins to form the final SHOT descriptor.
The color extension of SHOT descriptor, CSHOT descriptor, combines color and
shape features, which can well balance speed and accuracy [27]. However, CSHOT
has a dimension of 1344, which is a very high-dimensional descriptor. A compres
sion improvement of CSHOT is a series of configurable color shape descriptors,
which can achieve a balance between memory occupation and accuracy by adjusting
the dimension [28].

4.5.3 Rotational Projection Statistics

Rotational projection statistics (RoPS) is a feature descriptor for 3D local surface

description and object recognition [29]. It improves the feature characterization
154 4 3D Point Cloud Data and Processing

ability of descriptors by encoding point features on multiple projection surfaces. The

main steps to obtain rotational projection statistical feature descriptors are as
follows:
1. Borrow the method of the SHOT descriptor to construct a local coordinate system
(such as Sect. 4.4.1), and convert the neighborhood points into the local coordi
nate system.
2. Project the converted point cloud onto the three coordinate planes of XY, XZ, and
YZ, and mesh the projected point cloud.
3. Use the number of points in each grid after gridding to construct a distribution
matrix D, and normalize the matrix D so that the sum of all elements is 1.
4. Calculate the five statistical features (four central moments and one Shannon
entropy) of the normalized matrix as the features of the projection surface.
5. Rotate the neighborhood points around the X-, Y-, and Z-axes N times, and repeat
the above feature calculation after each rotation.
6. The RoPS descriptor (a histogram form) is obtained by combining the features
computed for N times.

4.5.4 Triple Orthogonal Local Depth Map

Triple orthogonal local depth images (TOLDI) borrow the idea of two feature
descriptors, signature of histogram of orientation and rotational projection statistics
[30]. The main steps to obtain triple orthogonal local depth map feature descriptors
are as follows:
1. For the point p in the point cloud, first use the neighborhood points of p to
construct a covariance matrix, and decompose the covariance matrix to obtain
three eigenvalues (21 > 22 > 23) and the corresponding eigenvectors (e 1, e2, e3).
2. Adjust the direction of e3 to make it consistent with the direction of the connection
between most of the neighborhood points and p, and use p and e 3 as the origin and
Z-axis of the local coordinate system, respectively.
3. Project all neighborhood points to the tangent plane of the Z-axis, and calculate
the weighted sum of the connection vector between the projected point and p (the
farther the projection distance is, the smaller the weight, and the closer the
projection distance is, the greater the weight) as the X-axis, and then determine
the Y-axis from it.
4. Project all neighborhood points into the local coordinate system established
above, and project the converted point cloud onto the three coordinate planes of
XY, XZ, and YZ.
5. Grid the projected point cloud, and use the minimum projected distance of the
point in the grid as the grid value to form a projected distance image.
6. Concatenate the projection distance maps of the three projection surfaces to
obtain the TOLDI descriptor (in the form of a histogram).
4.6 Point Cloud Understanding and Deep Learning 155

4.6 Point Cloud Understanding and Deep Learning

The improvement of computing power and the development of tensor data theory
have promoted the widespread application of deep learning in scene understanding
[31]. At present, point cloud understanding based on deep learning mainly faces
three challenges [3]:
1. Point cloud data is any point distributed in space and belongs to unstructured data.
Convolutional neural network filters cannot be used directly because there is no
structured grid.
2. Point cloud data is essentially a series of points in 3D space. In a geometric sense,
the order of points does not affect how it is represented in the underlying matrix
structure, so the same point cloud can be represented by two completely different
matrices.
3. The number of points in the point cloud data is different from the pixels in the
image; the number of pixels is a given constant and only depends on the camera.
However, the number of points in the point cloud is uncertain, and its number
depends on factors such as the sensor and the scanned scene. In addition, the
number of points of various objects in the point cloud is also different, which
leads to the problem of sample imbalance during network training.
Researchers have proposed a variety of deep networks for point cloud under
standing, which are mainly divided into three categories: 2D projection-based deep
learning network, 3D voxelization-based deep learning network, and network
model based on a single point in the point cloud [3].
1. 2D projection-based deep learning network.
With the progress of deep learning in 2D image segmentation and classifica
tion, people project 3D point clouds onto 2D images as the input of CNN
[32, 33]. Commonly used 2D projection images include virtual camera-based
RGB images, virtual camera-based depth maps, sensor-based distance images,
and panoramic images. These projection methods are fine-tuned by using the
object detection and semantic segmentation network models that have been
trained on a large number of 2D images as pretrained models, and it is more
convenient to obtain better detection and classification effects on 2D images. But
this method may also lose some 3D information.
There are also methods that use multi-view projection techniques for point
cloud classification, such as MVCNN [32], Snapnet [34], and DeePr3SS
[35]. This kind of method is easy to cause the loss of 3D structure information,
and the selection of the projection angle and the projection of the same angle have
different representation capabilities of the object, and they also have a certain
impact on the generalization ability of the network.
2. 3D voxelization-based deep learning network.
On the basis of the 2D CNN model, people build a 3D CNN model by
voxelizing the point cloud, which can retain more 3D structural information,
which is conducive to the high-resolution representation of point cloud data.
156 4 3D Point Cloud Data and Processing

Some results have been achieved in the work of marking point cloud and
classifying objects. To further improve the ability of representing voxels, a
variety of multi-scale voxel CNN methods have also been proposed, such as
MS3_DVS [36] and MVSNet [37].
The basis of this class of methods is voxelization [38]. In practice, voxelization
often uses a 0-1-based method to indicate whether there is a point or not, and a
voxel network density-based method and a grid point-based method can also be
used. The dimensions of voxelization are mainly 11 x 11 x 11, 16 x 16 x 16,
20 x 20 x 20, 32 x 32 x 32, etc. The amount of data can be reduced by down
sampling voxelization.
The CNN method based on 3D voxelization provides the point cloud structure
through meshing, uses mesh transformation to solve the alignment problem, and
also obtains a constant number of voxels. However, 3D CNN is very computa
tionally intensive during convolution. For solving this problem, it is usually to
reduce the resolution of voxels, but again, the error of quantization will increase.
In addition, only the structure information of the point cloud is used in the
network, and the information such as the color and intensity of the point cloud
are not considered.
3. Network model based on a single point in the point cloud.
The network model based on a single point in the point cloud can make full use
of the multimodal information of the point cloud and reduce the computational
complexity in the preprocessing process. For example, the PointNet network
model for indoor point cloud scenes can classify, partially segment, and seman
tically segment indoor point cloud data; its improved version, PointNet++ net
work model, can obtain multi-scale and comprehensive local features
[39]. Another improvement of the PointNet network model is the PointCNN
network model [40]. It proposes an X-transform method based on point learning
in the point cloud by analyzing the characteristics of the point cloud and uses it to
simultaneously weight the input features associated with the point. It improves
the performance on point cloud convolution processing by rearranging them into
a potentially implicit canonical order and applying quadrature and sum operations
on the elements. Another improvement of the PointNet network model is the
PointFlowNet network model [41]. It can estimate scene flow. In addition, there
are also studies that first segment or block large-scale point clouds and then use
the PointNet network model for classification on the results to overcome the
limitations of the original PointNet network model for large-scale point cloud
processing, such as SPGraph [42].

4.7 Bionic Optimized Registration of Point Clouds

Point cloud registration is a global optimization problem. In addition to the tradi

tional optimization methods, there are some bionic swarm optimization methods to
solve such problems by simulating the living habits of organisms.
4.7 Bionic Optimized Registration of Point Clouds 157

4.7.1 Cuckoo Search

Cuckoo search (CS) is a meta-heuristic global optimization method that simulates

the nest-seeking and spawning behavior of cuckoos, which uses the Levy flight
mechanism for global search [43].
There are some characteristics of cuckoo nest-seeking behavior. The cuckoo has a
unique way of reproduction. It lays its eggs in the nests built by other birds and uses
other host birds to hatch and raise its offspring, commonly known as nest parasitism.
Cuckoos can parasitize their eggs in the nests of many birds and can also mimic the
color and pattern of the host bird’s eggs to avoid detection. Cuckoos can lay their
own eggs in the host bird’s nest and leave with the host bird’s eggs in just 10 s. Many
birds, after discovering that they have cuckoo eggs in their nests, choose to either
remove the eggs or abandon the current nest and find another place to build a new
one. If a cuckoo egg is not recognized by the host bird, the host bird will hatch
it. Usually the eggs of cuckoos are hatched before the eggs of the host bird. The
young cuckoo also has an instinct to push the eggs of the host bird out of the nest,
making the host bird mistakenly think that the hatched cuckoo chicks are their own
offspring. While feeding, young cuckoos also imitate the calls of their host birds to
gain more opportunities to be fed.
The Levy flight mechanism is a mathematical model of random walk. When an
object performs a random walk, its step size obeys a heavy-tailed distribution, that is,
it takes a large value with a large probability (make a large jump at the local position
with a large probability to jump out of the local optimal so as to expand the scope of
the search). Studies of human behavior, such as Ju/'hoansi hunter-gatherer patterns,
also show typical features of Levi flight.
The basic cuckoo search algorithm simulates the process of cuckoo finding the
nest position suitable for laying eggs, abstracts it into a mathematical model, and
designs an optimization algorithm based on the Levi flight mechanism [44]. It is
based on three ideal rules:
1. Each cuckoo lays only one egg at a time and randomly selects a parasitic nest to
hatch.
2. The best nests with high-quality eggs will be passed on to the next generation.
3. The number of selectable host nests is fixed, and the probability that the eggs laid
by the cuckoo are found by the host bird is p 2 [0, 1].
Based on the above rules, the host bird can remove the eggs of the cuckoo or
abandon the nest to build another one. The basic CS algorithm is shown in Fig. 4.5:
For simplicity, it is assumed that each egg in the nest represents a solution, and
each cuckoo egg represents a new solution. The goal of the calculation is to replace
the poor solution in the nest with a new and possibly better solution (cuckoo egg).
The above algorithm can be easily generalized to complex cases where there are
multiple eggs in each nest, which represent a set of solutions.
158 4 3D Point Cloud Data and Processing

Cuckoo Search Algorithm

Begin
Objective function f(x), x = (x 1, x2, ..., xd)T
Initialize the location of Nhost nests Xi (i = 1, 2, ..., N)
Calculate the fitness value of each nest location Fi = f(Xi)
While (Current iteration value < maximum iteration value) or (Stop criteria not satisfied)
Get new lair locations xj with Levy's flight random walk mechanism
Calculate the new fitness value Fj = f(xj)
If (Fi > Fj)
Replace the original solution with the new solution
End
Discard the poor solution with probability p and create a new solution
Keep the best solution
Sort all solutions to find the current best solution
End
Post processing, visualization
End

Fig. 4.5 Steps of the cuckoo search algorithm

4.7.2 Improved Cuckoo Search

The cuckoo search algorithm has the characteristics of few parameters and simple
model, so it has good versatility and robustness and also has global convergence. But
there are some limitations:
1. In the iterative process, the algorithm generates new positions in the form of
random walk, which has a certain blindness, which will lead to the failure to
quickly search the global optimal value, and the search accuracy is difficult to
improve.
2. After searching the current position, the algorithm always selects the better
solution in a greedy way, which is easy to fall into local optimization.
3. The algorithm always discards the bad solution and generates a new solution with
a fixed probability. Without learning and inheriting the good experience of the
dominant group in the population, the search time will be increased.
In response to the above problems, many improvement methods have been
proposed, including the following:
1. Combining pattern search and coarse to fine strategy.
Although CS algorithm has good global detection ability, its local search
performance is relatively insufficient. Therefore, under the framework of CS
algorithm, pattern search with efficient coarse to fine search ability can be
embedded to enhance the local solution accuracy. The principle of pattern search
method is to find the lowest point of the search region. You can first determine a
valley leading to the center of the region and then search along the direction of the
valley line [45]. The essence of this strategy is to improve the mode of search step
4.7 Bionic Optimized Registration of Point Clouds 159

size through continuous successful iterations, so as to speed up the convergence

of the algorithm.
2. Using the mechanism of adaptive competitive ranking.
In order to avoid the algorithm falling into local optimization, the nests can be
ranked competitively according to the fitness value, and the top ranked nests can
be used to form a set of dominant nests. Using this mechanism, at the beginning
of iteration, the set of dominant nests is small, which is conducive to the
algorithm to quickly search for the global optimal solution and accelerate con
vergence. In the middle and late iterations, the search range can be relatively
large, so that the algorithm can maintain population diversity and is not easy to
fall into local optimization.
3. Adopting cooperation and sharing advantage set search mechanism.
The basic CS algorithm adopts the random walk search strategy, which has
insufficient heuristic information and slow search speed. Therefore, the search
mechanism of the advantage set of cooperation and sharing can be used to replace
the mixed mutation and cross operation methods to generate new solutions and
strengthen the learning of advantage experience.
The improved CS algorithm combined with the above improvements is shown
in Fig. 4.6 [46].

4.7.3 Point Cloud Registration Application

Consider the point cloud registration problem in Sect. 4.2.5. The registration process
needs to obtain the translation and rotation matrix between the coordinate systems of
the two point sets to minimize the distance between the homologous points from the
two point sets.
In practical application, it is generally necessary to sample the point set first to
reduce the amount of data processed subsequently by the point cloud and improve
the operation efficiency. Selecting feature points is an effective means to reduce the
amount of data. There are many methods to select feature points. If the point cloud
dataset includes N points, and the coordinates of any point pi are (xi, yi, zi), the main
steps of selecting feature points using intrinsic shape signatures (ISS) include the
following:
1. Define a local coordinate system for each point pi, and set the search radius of
each point as r.
2. Query all points within the radius r distance around each point pi and calculate
their weights:

Wj= iPT-Pj! -" r: (4:25)

3. Calculate the covariance matrix of each point pi:

160 4 3D Point Cloud Data and Processing

Improved Cuckoo Search Algorithm

Begin
Objective functionf(x), x = (x1, x2, ..., xd)T
Initialize the location of N host nests Xi (i = 1, 2, ..., N),
Set the initial value of the number of iterations to 1
Calculate the fitness value of each nest location Fi =f(xi)
While (Current iteration value < maximum iteration value) or (Satisfy the solution
accuracy condition)
(Global detection phase)
Get new lair locations xj with Levy's flight random walk mechanism
Calculate the new fitness value Fj =f(Xj)
If (Fi > Fj)
Replace the original solution with the new solution
End
Discard the poor solution with probability p and create a new solution
(Partial development stage)
Adaptive competitive ranking construction mechanism: Selecting the top M dominant nests
Cooperative sharing mechanism: Use the new nest position obtained to replace the discarded
position and keep the best solution
If (mod(number of iterations, mode search parameter) = 0)
Pattern search and coarse-to-fine strategy: Use the new nest location as the starting
location of the pattern search to perform a local search to obtain updated new nest
locations
End
If (f(updated new nest position) < f(original position))
Replace the original location with the updated new nest location
End
Number of iterations + 1, and keep the best solution
End
End

Fig. 4.6 Steps of the improved cuckoo search algorithm

E wj(Pi- Pj)(Pi- pJT

jpi- p jj < r
cov(pi) = (4-26)
E wij
jpi - p j j< r

4. Calculate the eigenvalues {Ai1, Ai2, Ai3} of the covariance matrix cov(pi) of each
point pi, in descending order.
5. Set the thresholds T1 and T2, and determine the points that meet the following
equations as feature points:

A2< T A3< T
(4.27)
A1< 1 d< 2-

Whether or not to extract feature points, 3D point cloud registration is to

determine the transformation matrix (including translation matrix and rotation
matrix) between two point cloud sets. In the ideal case, the transform solution
References 161

error is zero. But in practice, there are various reasons leading to errors. In this way,
the 3D point cloud registration problem becomes an optimization problem. It is
necessary to find the optimal transformation matrix to minimize the Euclidean
distance between the two point sets.
If the improved cuckoo search algorithm is used, the minimum corresponding
distance should be used as the global search criterion to achieve effective registration
of point cloud sets. Here, six transformation parameters need to be encoded. Since
the value ranges of the rotation parameters a, /?’, and / as well as the translation
variables x0, y0, and z0 are different, the parameter encoding needs to be normalized.
For example, randomly generate six solutions s1, s2, s3, s4, s5, and s6 within the
constraints of parameter coding, and form a set of solutions S = [s1, s2, s3, s4, s5, s6],
normalized to get S’ = [s1 ‘, s2’, s3’, s4’, s5’, s6’], where si’ = (si — l)/(u.i — li), i = 1,
2, ...,6, ui and li are the upper and lower bounds of si, respectively, so that the value
of the parameter encoding is in the range of [0, 1]. In this way, let each parameter
correspond to the position of the nest in the algorithm, and the entire point cloud
registration problem is transformed into a function optimization problem in 6D
space.

References

1. Engin I C, Maerz N H (2022) Investigation on the processing of LiDAR point cloud data for
particle size measurement of aggregates as an alternative to image analysis. Journal of Applied
Remote Sensing, 16(1): #016511 (DOI: 10.1117/1.JRS.16.016511).
2. Mirzaei K, Arashpour M, Asadi E, et al. (2022) 3D point cloud data processing with machine
learning for construction and infrastructure applications: A comprehensive review. Advanced
Engineering Informatics, 51: #101501 (DOI: 10.1016/j.aei.2021.101501).
3. Li Y, Tong G F, Yang J C, et al. (2019) 3D point cloud scene data acquisition and its key
technologies for scene understanding. Laser & Optoelectronics Progress 56(4): 040002-
1~040002-14.
4. Yang B S, Dong Z (2020) Point cloud intelligent processing. Beijing: Science Press.
5. Qin J, Wang W B, Zou Q J, et al. (2023) Review of 3D target detection methods based on
LiDAR point clouds. Computer Science, 50(6A): 259-265.
6. Song X, Wang P, Zhou D, et al. (2019) Apollocar3D: A large 3D car instance understanding
benchmark for autonomous driving. Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition 5452-5462.
7. Pomerleau F, Liu M, Colas F, et al. (2012) Challenging data sets for point cloud registration
algorithms. International Journal of Robotic Research 31: 1705-1711.
8. Xue J, Fang J, Li T, et al. (2019) BLVD: Building A large-scale 5D semantics benchmark for
autonomous driving. Proceedings of the International Conference on Robotics and Automation
20-24.
9. Chen Y, Wang J, Li J, et al. (2018) Lidar-video driving dataset: Learning driving policies
effectively. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
5870-5878.
10. Vallet B, Bredif M, Serna A, et al. (2015) TerraMobilita/iQmulus urban point cloud classifica
tion benchmark. Computers & Graphics 49(jun.):126-133.
162 4 3D Point Cloud Data and Processing

11. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? The KITTI vision
benchmark suite. Proceedings of the Conference on Computer Vision and Pattern Recognition,
3642-3649.
12. Wang C, Hou S, Wen C, et al. (2018) Semantic line framework-based indoor building modeling
using backpacked laser scanning point cloud. ISPRS J. Photogrammetric Remote Sensing
143, 150-166.
13. Carlevaris-Bianco N, Ushani A K, Eustice R M (2016) University of Michigan North Campus
long-term vision and LiDAR dataset. International Journal of Robotic Research 35, 1023-1035.
14. Roynard X, Deschaud J E, Goulette F (2018) Paris-Lille-3D: A large and high-quality ground
truth urban point cloud dataset for automatic segmentation and classification. International
Journal of Robotic Research 37, 545-557.
15. Caesar H, Bankiti V, Lang A H, et al. (2019) nuScenes: A multimodal dataset for autonomous
driving. arXiv:1903.11027.
16. Maddern W, Pascoe G, Linegar C, et al. (2017) 1000km: The Oxford RobotCar dataset.
International Journal of Robotic Research 36, 3-15.
17. Behley J, Garbade M, Milioto A, et al. (2019) SemanticKITTI: A dataset for semantic scene
understanding of LiDAR sequences. Proceedings of IEEE/CVF International Conference on
Computer Vision.
18. Dong Z, Liang F, Yang B, et al. (2020) Registration of large-scale terrestrial laser scanner point
clouds: A review and benchmark. ISPRS J. Photogrammetric Remote Sensing 163, 327-342.
19. Fan Y C, Zhang J Q, Cui C, et al. (2023) Overview of point cloud preprocessing algorithms.
Information and Computer, 35(6): 206-209.
20. Sun X D (2021) Point cloud data research and shape analysis. Beijing: Electronic Industry
Press.
21. Shi F Z (2013) Computer Aided Geometric Design and Nonuniform Rational B-Splines
(Revised Edition). Beijing: Higher Education Press.
22. Zhao Z X, Dong X J, Lv B X, et al. (2019) Application Theory and Practice of Ground 3D Laser
Scanning Technology. Beijing: China Water Conservancy and Hydropower Press.
23. Han X-F, Jin J S, Wang M-J. (2017) A review of algorithms for filtering the 3D point cloud.
Signal Processing: Image Communication 57: 103-112.
24. Li F, Wang J, Liu X Y, et al. (2020) Principle and Application of 3D Laser Scanning. Beijing:
Earthquake Press.
25. Wu Q H, Qu J K, Zhou B X. (2020) 3D Laser Scanning Data Processing Technology and Its
Engineering Application. Jinan: Shandong University Press.
26. Tombari F, Salti S, Di Stefano L (2010) Unique signatures of histograms for local surface
description. Proceedings of the European Conference on Computer Vision 356-369.
27. Tombari F, Salti S, Di Stefano L (2011) A combined texture-shape descriptor for enhanced 3D
feature matching. Proceedings of International Conference on Image Processing, 809-812.
28. Seib V, Pualus D (2021) Shortened color-shape descriptors for point cloud classification from
RGB-D cameras. IEEE International Conference on Autonomous Robot Systems and Compe
titions, 203-208.
29. Guo Y, Sohel F, Bennamoun M, et al. (2013) Rotational projection statistics for 3D local
surface description and object recognition. International Journal of Computer Vision 105(1):
63-86.
30. Yang J, Zhang Q, Xiao Y, et al. (2017) TOLDI: An effective and robust approach for 3D local
shape description. Pattern Recognition 65: 175-187.
31. GONG J Y, LOU Y J, LIU F Q, et al. Scene point cloud understanding and reconstruction
technologies in 3D space. Journal of Image and Graphics, 2023, 28(6): 1741-1766.
32. Su H, Maji S, Kalogerakis E, et al. (2015) Multi-view convolutional neural networks for 3D
shape recognition. IEEE International Conference on Computer Vision 945-853.
33. Qi C R, Su H, NieBner M, et al. (2016) Volumetric and multi-view CNNs for object classifi
cation on 3D data IEEE Conference on Computer Vision and Pattern Recognition 5694-5656.
References 163

34. Boulch A, Guerry J, Le Saux B, et al. (2018) SnapNet 3D point cloud semantic labeling with 2D
deep segmentation networks. Computers & Graphics 189-198.
35. Lawin F J, Danelljan M, Tosteberg P, et al. (2017) Deep projective 3D semantic segmentation.
International Conference on Computer Analysis of Images and Patterns 95-107.
36. Roynard X, Deschaud J E, Francois G (2018) Classification of point cloud scenes with
multiscale voxel deep network. https//arxiv.org/abs/1804.03583.
37. Wang L, Huang Y C, Shan J, et al. (2018) MSNet multi-scale convolutional network for point
cloud classification. Remote Sensing 10(4): 612.
38. Xu Y S, Tong X H, Stilla U (2021) A voxel-based representation of 3D point clouds: Methods,
applications, and its potential use in the construction industry. Automation in Construction 126:
103675 (DOI: 10.1016/j.autcon.2021.103675).
39. Charles R Q, Su H, Mok C, et al. (2017) PointNet: Deep learning on point sets for 3D
classification and segmentation. IEEE Conference on Computer Vision and Pattern Recognition
77-85.
40. Li Y Y, Bu R, Sun M C, et al. (2018) PointCNN: Convolution on x transformed points. https//
arxiv.org/abs/1801.07791.
41. Behl A, Paschalidou D, Donne S, et al. (2018) PointFlowNet: Learning representations for 3D
scene flow estimation from point clouds. https//arxiv.org/abs/1806.02170.
42. Landrieu L, Simonovsky M (2018) Large scale point cloud semantic segmentation with
superpoint graphs. IEEE Conference on Computer Vision and Pattern Recognition 4558-4567.
43. Yang X S, Deb S (2010) Engineering optimization by cuckoo search. International Journal of
Mathematical Modelling and Numerical Optimisation, 1(4): 330-343.
44. Yang X S, Deb S (2009) Cuckoo search via Levy flights. 2009 World Congress on Nature &
Biologically Inspired Computing (NaBIC) 210-214.
45. Hooke R, Jeeves T A. (1961) “Direct search” solution of numerical and statistical problems.
Journal of the ACM, 8(2): 212-229.
46. Ma W (2021) Bionic swarm intelligence optimization algorithm and its application in point
cloud registration. Nanjing: Southeast University Press.
Chapter 5
Binocular Stereovision Check for
updates

The human visual system is a natural stereovision system. The human eyes (each
equivalent to a camera) observe the same scene from two viewpoints, and the
information obtained is combined in the human brain to give a 3D objective
world. In computer vision, by collecting one set of two (or more) images from
different viewing angles, the parallax (disparity) between corresponding pixels in
different images can be obtained by means of the principle of triangulation. That is,
the parallax is the difference between the positions of a 3D space point projected
onto these 2D images. The depth information and the reconstructions of the 3D scene
can be further obtained according to the parallax.
In stereovision, computing parallax is a key step to obtain depth information, and
the main challenge is to determine the projected image points of 3D space points on
different images (two images for binocular stereovision, multiple images for multi
ocular stereovision). The determination of the correspondence after projection is a
matching problem. This chapter considers only binocular stereovision (multi-ocular
stereovision will be discussed in the next chapter).
The sections of this chapter will be arranged as follows.
Section 5. 1 discusses the principle and several common techniques of binocular
stereo matching based on regional grayscale correlation.
Section 5. 2 introduces the basic steps and methods for feature-based binocular stereo
matching and also analyzes the two commonly used feature points SIFT and
SURF in recent years.
Section 5. 3 presents an algorithm to detect and correct errors in parallax maps
obtained by stereo matching.
Section 5. 4 introduces recent stereo matching methods based on deep learning
techniques, including various stereo matching networks and a specific method.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 165
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_5
166 5 Binocular Stereovision

5.1 Region-Based Binocular Stereo Matching

Determining the relationship of corresponding points in binocular images is a key

step in obtaining depth images. The following discussion only takes the binocular
horizontal mode as an example, and the results obtained by the binocular lateral
mode can also be generalized to other modes if the unique geometric relationship
among the specific mode is considered.
The method of point correspondence matching can be used to determine the
relationship between corresponding points. However, the direct use of single point
grayscale search will be affected by many factors such as the image noise and the
same gray level of many points in the image. At present, the practical techniques are
mainly divided into two categories: grayscale correlation and feature matching. The
former is a region-based method, which considers the neighborhood properties of
each matching point. The latter is based on feature points, that is, first select the
unique or special points in the image as matching points, and then match these
points. The common features used in the latter method are mainly the coordinates of
inflection points and corners in the image, edge line segments, the contour of the
object, and so on. The above two methods are similar to the region-based and edge
based methods in image segmentation. This section first introduces the region-based
method, while the feature point based method is introduced in the next section.

5.1.1 Template Matching

Region-based methods need to consider the neighborhood properties of points, and

the neighborhood is often determined by templates (masks, also known as
sub-images or windows). When a point in the left image is given and the
corresponding point needs to be searched in the right image, the neighborhood
centered on the point in the left image can be extracted as a template, which can
be translated on the right image, and the correlation with each position can be
calculated to determine whether to match according to the correlation value. If it is
matched, it is considered that the center point of the matching position in the right
image and the point in the left image form a corresponding point pair. Here, the
maximum correlation value can be taken as the matching place; or a threshold value
can be given first, and then the points satisfying the correlation value greater than the
threshold value can be extracted and selected according to some other factors.

5.1.1.1 Basic Principle

The most basic matching method here is generally called template matching (the
principle can also be used for more complex matching), and its essence is to use a
small image (template) to match a part (sub-image) of a large image. The result of the
5.1 Region-Based Binocular Stereo Matching 167

Fig. 5.1 Template

O N y
matching diagram

....<>
■b J
(s,t)
M |<-x —>
w(x—s, y-t)

f(x, y)
'x

matching is to determine whether there is the small image in the large image and, if
so, further determine the position of the small image in the large image. Templates
are usually square in template matching but can also be rectangular or other shapes.
Now consider finding the matching position of a template image w(x, y) of size J x K
and a large imagef(x, y) of M x N, and let J < M and K < N. In the simplest case, the
correlation function between f(x, y) and w(x, y) can be written as

c(s, t) = ^2 y~f (x, y)w(x - s, y -1) (5:1)

x y

where s = 0, 1, 2, ..., M - 1; t = 0, 1, 2, ..., N - 1. The best matching position can

be determined according to the maximum correlation criterion.
The summation in Eq. 5.1 is performed on the image region where f(x, y) and
w(x, y) overlap. Figure 5.1 shows the schematic diagram of relevant calculations, in
which it is assumed that the origin off(x, y) is in the upper left corner and the origin
of w(x, y) is in its center. For any given position (s, t) in f(x, y), a specific value of
c(s, t) can be calculated according to Eq. 5.1. When s and t change, w(x, y) moves in
the image and gives all the values of the function c(s, t). The maximum value of
c(s, t) indicates the position that best matches w(x, y). Note that for the s and t values
near the f(x, y) edge, the matching accuracy will be affected by the image boundary
[1], and its error is proportional to the size of w(x, y).
In addition to determining the matching position according to the maximum
correlation criterion, the minimum mean square error function can also be used:

mme(s, t) = MN 1212 [f (x, y)w(x - s, y-1)!2

(5:2)

In VLSI hardware, the square operation is difficult to implement, so the absolute

value can be used instead of the square value to obtain the minimum mean
difference function:
168 5 Binocular Stereovision

Mad(s, t)=MNS ZLj f (x, y)w(x - s, y - t)i (5:3)

The correlation function defined by Eq. 5.1 has the disadvantage that it is
sensitive to changes in the magnitude of f(x, y) and w(x, y). For example, when the
value of f(x, y) is doubled, c(s, t) are also doubled. To overcome this problem, the
following correlation coefficients can be defined:

EE [ f (x, y) -f (x, y)] [w(x - s, y - t) - w

C(S, t) = / y _ 2 == (5:4)

x y
(
■t /EE [f(x, y) -f x, y)] [w(x - s, y - t) - w]2

where s = 0, 1,2, ..., M - 1; t = 0, 1,2, ..., N - 1, wis the mean value of w (only
need to be calculated once) and (x, y) represents the mean value of the region
corresponding to the current position of w in f(x, y).
The summation in Eq. 5.4 is for the common coordinates of f(x, y) and w(x, y).
Because the correlation coefficient has been scaled to the interval [-1, 1], the change
of its value is independent of the amplitude change of f(x, y) and w(x, y).
Another method is to calculate the grayscale difference between the template and
the sub-image and to establish the corresponding relationship between the two
groups of pixels that meet the mean square difference (MSD). The advantage of
this kind of method is that the matching result is not easily affected by the template
grayscale detection accuracy and density, so high positioning accuracy and dense
parallax surface can be obtained [2]. The disadvantage of this kind of method is that
it depends on the statistical characteristics of image gray scale, so it is sensitive to the
surface structure of the scenery and light reflection. Therefore, it is difficult in the
case of lack of sufficient texture details on the surface of the spatial scene and large
imaging distortion (such as too large baseline length). In actual matching, some gray
scale derived quantities can also be used, but some experiments show that in the
matching comparison using gray scale, grayscale differential size and direction,
grayscale Laplace value, and grayscale curvature as matching parameters, the effect
obtained by using grayscale parameters is the best [3].
As a basic matching technique (see Chap. 9 for some other typical matching
techniques), template matching has been applied in many aspects, especially when
the image only has translation. The above calculation of correlation coefficient can
normalize the correlation function and overcome the problem caused by amplitude
change. However, it is difficult to normalize the image size and rotation. The
normalization of size requires spatial scale transformation, which requires a lot of
calculation. The normalization of rotation is more complicated. If the rotation angle
off(x, y) is known, it requires just rotating w(x, y) by the same angle to align it with
f(x, y). However, without knowing the rotation angle off(x, y), to find the best match,
w(x, y) needs to be rotated at all possible angles. In practice, this method is not
5.1 Region-Based Binocular Stereo Matching 169

feasible, so it is rarely used directly in the cases of arbitrary rotation or no constraint

on rotation.

5.1.1.2 Reduce Computation

The method of image matching with templates representing matching primitives

needs to solve the problem that the amount of computation will increase exponen
tially with the number of primitives. If the number of primitives in the image is n and
the number of primitives in the template is m, there are O(nm) possible corresponding
relationships between the template and the primitives in the image. Here, the
combination number is C(n, m), or Cmn.
In order to reduce the calculation amount of template matching, on the one hand,
some prior knowledge (such as epipolar constraints in stereo matching (see Sect.
5.1.2)) can be used to reduce the positions to be matched; on the other hand, the
characteristic of the templates on adjacent matching positions that has a considerable
coincidence coverage can be used to reduce the number of recalculations of corre
lation values [4]. Incidentally, the correlation can also be computed by transforming
it into the frequency domain by means of a fast Fourier transform (FFT), so the
matching can be performed based on the frequency domain transform (see Sect. 9.
4.1). If f(x, y) and w(x, y) have the same size, the computation in the frequency
domain is more efficient than the direct spatial computation. In fact, w(x, y)is
generally much smaller than f(x, y). It has been estimated that if the number of
nonzero terms in w(x, y) is less than 132 (approximately equivalent to a 13 x 13
sub-image), it is more efficient to directly use Eq. 5.1 to calculate in the spatial
domain than in the frequency domain. Of course, this number has something to do
with the computer and programming algorithm used. In addition, the calculation of
the correlation coefficient in Eq. 5.4 is difficult to realize in the frequency domain, so
it is generally carried out directly in the space domain.
For efficient template matching, geometric hashing can be used. It is based on
the fact that three points can define a 2D plane. That is, if you choose three
noncollinear points P1, P2, and P3, you can use the linear combination of these
three points to represent any point:

Q = P1 +s(P2 - P1) + t(P3 - P1) (5:5)

The above equation does not change under the affine transformation. That is, the
value of (s, t) is only related to the three noncollinear points and has nothing to do
with the affine transformation itself. In this way, the value of (s, t) can be regarded as
the affine coordinate of the point Q. This property also applies to line segments: three
nonparallel line segments can be used to define an affine datum.
Geometric hashing builds a hash table that helps matching algorithms quickly to
determine the potential location of a template in an image. The hash table can be
constructed as follows: for any three noncollinear points in the template (datum point
group), calculate the affine coordinates (s, t) of other points. The affine coordinates
170 5 Binocular Stereovision

(s, t) of these points will be used as indexes into the hash table. For each point, the
hash table retains the index (serial number) for the current datum group. If you want
to search multiple templates in an image, you need to keep more template indexes.
To search for a template, a set of datum points is randomly selected in the image,
and the affine coordinates (s, t) of the other points are calculated. Using this affine
coordinate (s, t) as the index of the hash table, the index of the datum point group can
be obtained. This results in a vote for the presence of this datum group in the image.
If the randomly selected point does not correspond to the set of datum points on the
template, no vote is accepted. However, if the randomly selected point corresponds
to the set of datum points on the template, the vote is accepted. If many votes are
accepted, it indicates that the template is likely to be present in the image, and
metrics for the datum group are available. Because there is a certain probability that
the selected set of datum points will not be suitable, the algorithm needs to iterate to
increase the probability of finding the correct match. In fact, it is only necessary to
find a correct set of datum points to determine the matching template. So, if k out of
N template points are found in the image, the probability that the set of datum points
is correctly selected at least once in m attempts is

P = 1 - [1 — (k=N)3 ]m (5:6)

If the ratio k/N of the number of template points appearing in the image to the
number of image points is 0.2, and it is hoped that the probability of the template
matching is 99% (i.e., p = 0.99), then the number of attempts m required is 574.

5.1.2 Stereo Matching

According to the principle of template matching, the similarity of regional gray

levels can be used to search for the corresponding points of the two images.
Specifically, in the stereo image pair, first select a window in the left image centered
on a certain pixel, build a template with the grayscale distribution in the window, and
then use the template to search in the right image to find the most matching window
position. The pixel in the center of the matching window at this time corresponds to
the to-be-matched pixel of the left image.
In the above search process, if there is no prior knowledge or any restriction on
the position of the template in the right image, the range that needs to be searched
may cover the entire right image. Doing this search for every pixel in the left image is
time-consuming. In order to reduce the search range, some constraints can be
considered, such as the following three constraints [5].
1. Compatibility constraint: it means that black points can only match black
points. More generally, only the features originating from the same type of
5.1 Region-Based Binocular Stereo Matching 171

physical properties in the two images can be matched, also known as the
photometric compatibility constraint.
2. Uniqueness constraint: it means that a single black point in one image can only
match a single black point in another image.
3. Continuous constraint: it means that the parallax change near the matching
point is smooth (change gradually) in most of the points in the whole image
except the occlusion region or discontinuous region, also known as the disparity
smoothness constraint.
When discussing stereo matching, in addition to the above three constraints, the
epipolar line constraints introduced below and the order constraints introduced in
Sect. 5.2.4 can also be considered.

5.1.2.1 Epipolar Line Constraints

Epipolar line constraints can help reduce the search range and speed up the search
process.
First, the two important concepts of epipole and epipolar line are introduced with
the help of the binocular horizontal convergence mode in Fig. 5.2. In Fig. 5.2, the
coordinate origin is the optical center of the left eye, the X-axis connects the optical
centers of the left and right eyes, the Z-axis points to the observation direction, the
distance between the left and right eyes is B (also commonly referred to as the system
baseline), the optical axes of the left and right image planes are in the XZ plane, and
the intersection angle is 9. Consider the connection between the left and right image
planes. O1 and O2 are the optical centers of the left and right image planes,
respectively. The connecting line between them is called the optical center line.
The intersections e1 and e2 of the optical center line and the left and right image
planes are called the epipoles of the left and right image planes, respectively (the
epipole coordinates are represented by e1 and e2, respectively). The optical center
line and the space point W are in the same plane. This plane is called the epipolar
plane. The intersection lines L1 and L2 of the epipolar plane with the left and right
image planes are called the epipolar lines of the projection points of the space point
W on the left and right image planes, respectively. The epipolar line defines the

Fig. 5.2 Schematic

diagram of epipoles and
epipolar lines
172 5 Binocular Stereovision

Fig. 5.3 Epipolar line

constraint diagram

position of the corresponding point of the binocular image, and the projection point
p2 (coordinate p2) of the right image plane corresponding to the projection point p1
(coordinate p1) of the space point W on the left image plane must be on the epipolar
line L2. On the contrary, the projection point of the left image plane corresponding to
the projection point of the space point W on the right image plane must be on the
epipolar line L1.
From the above discussion, it can be seen that the epipole and the epipolar line
have correspondence. The epipolar line defines the position of the corresponding
point on the binocular image, and the projection point of the right image
plane corresponding to the projection point of the space point W on the left image
plane must be on the epipolar line L2. The projection point on the left image plane
corresponding to the projection point must be on the epipolar line L1. This is the
epipolar line constraint.
In binocular vision, when the ideal parallel optical axis model is adopted (i.e., the
line of sight of each camera is parallel), the epipolar line and the image scanning line
are coincident, and the stereovision system at this time is called a parallel
stereovision system. In parallel stereovision systems, the search range of stereo
matching can also be reduced by means of epipolar line constraints. Ideally, a search
of the entire graph can be changed to a search of one line of the image using epipolar
line constraints. However, it should be pointed out that the epipolar line constraint is
only a local constraint. For a space point, there may be more than one projection
point on the epipolar line.
An illustration of the epipolar line constraint is shown in Fig. 5.3. A camera (left)
is used to observe a point W in space, and the imaged point p1 should be on the line
connecting the optical center of the camera and point W. But all points on this line
will be imaged at point p1, so the position/distance of a particular point W cannot be
completely determined by point p1. Now use the second camera to observe the same
spatial point W, and the imaged point p2 should also be on the connection line
between the optical center of the camera and the point W. All points W on this line
are projected onto a straight line on the imaging plane 2, which is the epipolar line.
According to the geometric relationship in Fig. 5.3, for any point p1 on the
imaging plane 1, the imaging plane 2 and all its corresponding points are
5.1 Region-Based Binocular Stereo Matching 173

(constrained) on the same straight line, which is the abovementioned epipolar line
constraint.

5.1.2.2 Essential and Fundamental Matrices

The relationship between the projected coordinate points of the space point W on the
two images can be described by an essential matrix E with five degrees of freedom
[6], which can be decomposed into an orthogonal rotation matrix R followed by a
translation matrix T (E = RT). If the coordinates of the projected point in the left
image are represented by p1, and the coordinates of the projected point in the right
image are represented by p2, then it has

pT EP i =0 (5:7)

On the corresponding image, the epipolar lines passing through p1 and p2 meet
L2 = Ep1 and L1 = ETp2, respectively. On the corresponding image, Ee1 = 0 and
ETe2 = 0 are satisfied through the epipoles of p1 and p2, respectively.
Refer to Fig. 5.4 for derivation of the essential matrix. In Fig. 5.4, suppose that the
projection positions p1 and p2 of point W on the image can be observed, and the
rotation matrix R and the translation matrix T between the two cameras are also
known, then three 3D vectors O1O2, O1W, and O2W can be obtained. These three 3D
vectors must be coplanar. In mathematics, the criterion that the coplanar of three 3D
vectors a, b, and c can be written as a ■ (b x c) = 0 can be used to derive the essential
matrix. The essential matrix indicates the relationship between the coordinates of
projection points of the same space point W (coordinate W) on two images.
According to the perspective relationship of the second camera, vector O1W /
Rp 1, vector O1O2 / T, and vector O2W=p2. Combining these relationships with the
coplanarity condition yields the desired result:

Fig. 5.4 Derivation of

essential matrix
174 5 Binocular Stereovision

pT(Tx Rpi) = pTEpi = 0 (5:8)

The above discussion assumes that p1 and p2 are the camera-calibrated pixel
coordinates. If the camera is not calibrated, the original pixel coordinates q1 and q2
need to be used. Suppose the internal parameter matrix of the camera is G1 and G2,
then

pi=Gi- 1qi (5:9

P2 = G2- 1q2 (5:10)

Substituting the above two equations into Eq. 5.7, we get q2T(G2-1)TEG1-
iqi = 0, which can be written as

qT Fqi= o (5:ii)

where

F =(G2-1 )TEG-1 (5:12)

is called the fundamental matrix because it contains all the information for camera
calibration. The fundamental matrix has seven degrees of freedom (two parameters
are required for each epipole, plus three parameters to map the three epipolar lines
from one image to the other, since the projective transformation in the two 1D
projected spaces has three degrees of freedom), and the essential matrix has five
degrees of freedom, so the fundamental matrix has two more free parameters than the
essential matrix, but comparing Eqs. 5.7 and 5.11, it can be seen that the roles or
functions of these two matrices are similar.
The essential and fundamental matrices are related to the internal and external
parameters of the camera. If the internal and external parameters of the camera are
given, according to the epipolar line constraint, for any point on the imaging plane
1, only a 1D search is needed in the imaging plane 2 to determine the position of
the corresponding point. Further, the correspondence constraint is a function of the
internal and external parameters of the camera. Given the internal parameters, the
external parameters can be determined by means of the observed pattern of
corresponding points, and then the geometric relationship between the two cameras
can be established.

5.1.2.3 Influencing Factors in Matching

There are still some specific problems to be considered and solved when using the
region matching method in practice. Here are two examples:
5.1 Region-Based Binocular Stereo Matching 175

1. Due to the shape of the scenery itself or the mutual occlusion of the scenery when
shooting the scene, not all the scenes captured by the left camera can be captured
by the right camera. Therefore, some templates determined by the left image may
not be able to find the exact matching position in the right image. At this time, it is
often necessary to interpolate according to the matching results of other matching
positions to get the data of these unmatched points.
2. When using the pattern of template image to express the characteristics of a single
pixel, the premise is that different template images should have different patterns,
so that the matching can be differentiated, which can reflect the characteristics of
different pixels. However, sometimes there are some smooth regions in the
image, and the template images obtained in these smooth regions have the
same or similar patterns, so there will be uncertainty in the matching, and it
leads to false matching. In order to solve this problem, it is sometimes necessary
to project some random textures onto these surfaces to convert smooth regions
into texture regions, so as to obtain template images with different patterns to
eliminate uncertainty.
The following is an example of the error caused by stereo matching when there is
a gray level smooth region along the binocular baseline direction. See Fig. 5.5,
where Figs. 5.5a and b are the left and right of a pair of perspective views,

Fig. 5.5 Examples of binocular stereo matching affected by smooth regions of the image
176 5 Binocular Stereovision

respectively. Figure 5.5c is the disparity map obtained by binocular stereo matching
(for the sake of clarity, only the result of scene matching is retained), the dark color
in the figure represents a farther distance (larger depth), and the light color represents
a short distance (smaller depth). Figure 5.5d is a 3D perspective (contour map)
display corresponding to Fig. 5.5c. Comparing the figures, it can be seen that since
there are some positions in the scene (such as the horizontal eaves of towers, houses,
and other buildings), the gray values are generally similar along the horizontal
direction, so it is difficult to search and match them along the epipolar line direction.
Determining the corresponding points produces a lot of errors due to mismatches,
which are reflected in the Fig. 5.5c, which have some white or black regions that are
incompatible with the surrounding, and some sharp burrs are reflected in Fig. 5.5d.

5.1.2.4 Optical Properties Calculation

Using the grayscale information of the binocular image, it is possible to further

calculate some optical properties of the object surface (see Sect. 7.2). There are two
factors to pay attention to the reflection characteristics of the surface: one is the
scattering caused by the surface roughness; the other is the specular reflection caused
by the surface compactness. These two factors are combined as follows: let N be
the unit vector in the normal direction of the surface patch, S be the unit vector in the
direction of the point light source, and V be the unit vector in the direction of the
viewer ’s line of sight, and the reflected luminance I(x, y) obtained at the patch is the
product of the synthetic reflectance p (x, y) and the synthetic reflectance R[N(x, y)]
(see the luminance imaging model in Sect. 2.1.2), namely,

I(x, y) = p(x, y)R[N(x, y)] (5:13)

where

R[N(x,y)] = (1 - a)N ■ S + a(N ■ H)k (5:14)

Among them, p, a, k are the coefficients related to the optical properties of the
surface, which can be calculated from the image data.
The first term on the right side of Eq. 5.14 considers the scattering effect, which
does not vary with the viewing angle; the second term considers the specular
reflection effect. Let H be the unit vector in the angular direction of the specular
reflection:

S+V
H= (5:15)
V2[1 + (S • V)]

The second term on the right of the equal sign in Eq. 5.14 reflects the change of
line of sight vector V through vector H. In the coordinate system adopted in Fig. 5.2,
5.2 Feature-Based Binocular Stereo Matching 177

V0 = fO, 0, - 1g V} = f- sin 0,0, cos 0g (5:16)

5.2 Feature-Based Binocular Stereo Matching

The disadvantage of the region-based matching method is that it depends on the

statistical characteristics of the gray level of the image, so it is more sensitive to the
surface structure of the scene and the reflection of light. Therefore, there are certain
difficulties in the case where the surface of the space scenery lacks sufficient texture
details (such as along the epipolar line in Fig. 5.5) and the imaging distortion is large
(such as the length of the baseline is too large). Considering the characteristics of the
actual image, some significant feature points (also called control points, key points,
or matching points) in the image can be determined first and then matched with the
help of these feature points. Feature points are less sensitive to changes in ambient
lighting during matching, and their performance is relatively stable.

5.2.1 The Basic Steps

The main steps of feature-based point matching are as follows:

1. Select the feature points for matching in the image; the most commonly used
feature points are some special points in the image, such as edge points, corner
points, inflection points, landmark points, etc. Sects. 5.2.2 and 5.2.3 introduce
two typical feature points that have been widely used in recent years.
2. Match feature point pairs in stereo image pairs (see Sects. 5.2.1 and 5.2.4; also
refer to Chap. 1O).
3. Calculate the disparity of the matching point pair, and obtain the depth informa
tion at the matching point (similar to the previous region-based method).
4. Interpolate the sparse depth value results to obtain a dense depth map (because the
feature points are discrete, it is not possible to directly obtain a dense disparity
field after matching).

In the following, two simple matching methods with reasonable meaning of

principle statement are introduced, and the characteristics of matching between
two feature points are discussed.

5.2.1.1 Matching Using Edge Points

For an image f(x, y), the feature point image can be obtained by calculating the edge
points:
178 5 Binocular Stereovision

t(x, y) =maxfH, V, L,Rg (5:17)

where H, V, L,andR are all calculated by means of grayscale gradient:

H= [f(x,y) -f(x - 1, y)]2 + [f(x,y) -f(x + 1,y)]2 (5:18)

V=[f(x,y) -f(x,y -1)]2+ [f(x, y) -f(x,y +1)]2 (5:19)
L =[f(x,y) -f(x - 1,y + 1)]2 +[f(x,y) -f(x + 1,y - 1)]2 (5:20)
R =[f(x,y) -f(x + 1,y + 1)]2 +[f(x,y) -f(x - 1,y - 1)]2 (5:21)

Then, t(x, y) is divided into nonoverlapping small regions W, and the point with
the largest calculated value is selected as the feature point in each small region.
Now consider matching an image pair composed of a left image and a right
image. For each feature point of the left image, all possible matching points in the
right image can be formed into a set of possible matching points. In this way, a label
set can be obtained for each feature point of the left image, in which the label l is
either the parallax between the feature point of the left image and its possible
matching point or a special label representing no matching point. For each possible
matching point, calculate the following equation to set the initial matching proba
bility P(0)(l):

A(l) = 52 )
[fL(x,y -f (x r +lx,y+ly^2 (5:22)

where l = (lx, ly) is the possible parallax. A(l) represents the gray level fitting degree
between the two regions, which is inversely proportional to the initial matching
probability P(0)(l). In other words, P(0)(l) is related to the similarity in the neighbor
hood of possible matching points. According to this, the relaxation iteration method
can be used to iteratively update P(0)(l) by giving positive increments to the points
with close parallax in the neighborhood of possible matching points and negative
increments to the points with far parallax in the neighborhood of possible matching
points. With the iteration, the iterative matching probability P(k)(l) of the correct
matching point will gradually increase, while the matching probability P(k)(l) of
other points will gradually decrease. After a certain number of iterations, the point
with the maximum matching probability P(k)(l) is determined as the matching point.

5.2.1.2 Matching Using Zero-Crossings

When matching feature points, the zero-crossing pattern can also be used to obtain
matching primitives [7]. Zero-crossings can be obtained by convolution with the
Laplacian of Gaussian functions (e.g., see [8]). Considering the connectivity of the
5.2 Feature-Based Binocular Stereo Matching 179

Fig. 5.6 Illustration of zero-crossing patterns

Fig. 5.7 Schematic

diagram of spatial feature
points and image feature
points

zero-crossing points, 16 different zero-crossing patterns can be identified, as

represented by shaded pixels in Fig. 5.6.
For each zero-crossing pattern in the left image, combine all its possible matching
points in the right image into a possible matching point set. During stereo matching,
all non-horizontal zero-crossing patterns in the left image can be formed into a point
set with the help of horizontal epipolar line constraints, and each point is assigned a
label set and an initial matching probability. Taking a similar approach as in the
matching using edge points, the final matching points can also be obtained through
relaxation iterations.

5.2.1.3 Feature Point Depth

In the following, Fig. 5.7 (which is obtained by removing the epipolar line in Fig. 5.2
and then moving the baseline to the X-axis to facilitate description, where the
meanings of the letters are the same as those in Fig. 5.2) will be used to explain
the corresponding relationship between the spatial feature points and the image
feature points.
In the 3D space coordinate system, a feature point W(x, y, -z) is orthogonally
projected on the left and right images as

(u, v0) = (x, y) (5:23)

180 5 Binocular Stereovision

Fig. 5.8 Coordinate

arrangement for calculating
the relationship between
parallax and depth

(u00, v00) = [(x — B) cos 0 — z sin 0, y] (5:24)

The calculation of u" here is carried out according to the coordinate transforma
tion of first translation and then rotation. Equation 5.24 can also be derived with the
help of Fig. 5.8 (a schematic diagram parallel to the XZ plane in Fig. 5.7 is given
here):

U00 = OS = ST — TO = (QE + ET) sin0 — B-0 (5:25)

Note that W is on the -Z-axis, so there is

u00 = — z sin0 + (B — x) tan 0 sin 0 — ^os 0 = (x — B) cos 0 — z sin 0 (5:26)

If u" has been determined by u‘ (i.e., the matching between the feature points has
been established), the depth of the feature points projected to u‘ and u" can be
inversely solved from Eq. 5.24 as

— z = u cos 0 + (B — u0) cot 0 (5:27)

5.2.1.4 Sparse Matching Points

It can be seen from the above discussion that the feature points are just some specific
points on the object, and there is a certain interval between them. The dense disparity
field cannot be directly obtained from only sparse matching points, so the shape of
the object may not be accurately recovered. For example, Fig. 5.9a gives four points
that are coplanar in space (equidistant from another space plane). These points are
sparse matching points obtained by disparity calculation. These points are assumed
to be located on the outer surface of the object, but there can be infinitely many
surfaces passing through these four points, and several possible examples are given
as shown in Fig. 5.9b-d. It can be seen that only the sparse matching points cannot
uniquely restore the shape of the object, and it is necessary to combine some other
5.2 Feature-Based Binocular Stereo Matching 181

Fig. 5.9 Only sparse matching points cannot uniquely recover object shape

conditions or interpolate the sparse matching points to obtain a dense disparity map
such as region matching.

5.2.2 Scale Invariant Feature Transformation

Scale invariant feature transformation (SIFT) can be regarded as a method for

detecting salient features in images [9]. It can not only determine the position of a
point with salient features in an image but also give a description vector of the point
(also called the SIFT operator or descriptor). It is a local descriptor that contains
three types of information: position, scale, and orientation.
The basic ideas and steps of SIFT are as follows. First obtain a multiscale
representation of the image (e.g., see [1]), which can be convolved with the image
using a Gaussian convolution kernel (the only linear kernel). Gaussian kernels are
variable-scale Gaussian functions:

_, i 1 r x2 + y2
G(x,y,o)2no exp --202- (5.28)

where o is the scale factor. The image multiscale representation after convolution
with Gaussian convolution kernel is

L(x, y, o) = G(x, y, o) ®f (x, y) (5.29)

Gaussian function is a low-pass function, which will smooth the image after
convolution with the image. The big or small of the scale factor is related to the
smoothness degree, and the large o corresponds to the large scale. After convolution,
the general picture of the image is mainly given; small o corresponds to small scale,
and the details of the image are preserved after convolution. In order to make full use
of the image information of different scales, a series of Gaussian convolution kernels
with different scale factors are used to construct the Gaussian pyramid. It is generally
assumed that the scale factor coefficient between the two adjacent layers of the
Gaussian pyramid is k. If the scale factor of the first layer is o, the scale factor of the
second layer is ko, and the scale factor of the third layer is k2o, and so on.
182 5 Binocular Stereovision

SIFT then searches for salient feature points in the multiscale representation of
the image and uses the difference of Gaussian (DoG) operator. DoG is the differ
ence of convolution results by two Gaussian kernels with different scales, which is
similar to Laplace of Gaussian (LoG) operator. If h and k are used to represent
different scale factor coefficients, the DoG pyramid can be expressed as

D(x,y, o) = [G(x, y, ko) — G(x,y, ho)] ® f (x, y) = L(x, y, ko) — L(x, y, ho) (5:30)

The DoG pyramid multiscale representation space of an image is a 3D space

(image plane plus scale axis). To search for extreme values in such a 3D space, it is
necessary to compare the value of a point in the space with the value of its
26 neighborhood voxels (see [8]). In this way, the search results determine the
location and scale of the salient feature points.
Next, the gradient distribution of pixels in the neighborhood of salient feature
points is used to determine the direction parameters of each point. The magnitude
and direction of the gradient at (x, y) in the image are, respectively (the scale used by
each L is the scale of each salient feature point),

m(x, y) = ^/[L(x + 1, y) — L(x — 1, y)]2 + [L(x, y + 1) — L(x, y — 1)]2 (5:31)

ff
" ( y)v)=arctan [L(x,+1,
x y+y1)— L((x,—
)— Ly—1, y1)]
)J ((55 3232))
:

After obtaining the orientation of each point, the orientations of the pixels in the
neighborhood can be combined to obtain the orientation of the salient feature points.
See Fig. 5.10 for details. First (on the basis of determining the position and scale of
the salient feature points), take a 16 x 16 window centered on the salient feature
points, as shown in Fig. 5.10a. Divide the window into 16 groups of 4 x 4, as shown
in Fig. 5.10b. Calculate the gradient of each pixel in each group to obtain the
gradient of the pixels in the group, as shown in Fig. 5.10c in which the direction
of the arrow indicates the direction of the gradient and the length of the arrow is
proportional to the magnitude of the gradient. Use the eight-direction (interval 45°)
histogram to count the gradient directions of the pixels in each group, and take the
5.2 Feature-Based Binocular Stereo Matching 183

peak direction as the gradient direction of the group, as shown in Fig. 5.10d. In this
way, for 16 groups, each group can provide an 8D direction vector and be spliced
together to get a 16 x 8 = 128D vector. This vector is normalized and finally used as
the description vector of each salient feature point, that is, the SIFT descriptor. In
practice, the coverage region of SIFT descriptors can be square or circular, also
known as salient patch.
SIFT descriptors are invariant to image scaling, rotation, and illumination
changes, and they are also stable to affine transformations, viewing angle changes,
local shape distortion, and noise interference. This is because in the process of
acquiring SIFT descriptors, the influence of rotation is eliminated by calculating
and adjusting the gradient direction; the influence of illumination changes is elim
inated by vector normalization; and the combination of pixel direction information in
the neighborhood is used to enhance robustness. In addition, the SIFT descriptor is
rich in its own information and has good uniqueness (compared to edge points or
corner points that only contain position and extreme value information, the SIFT
descriptor has a 128D description vector). Also due to its uniqueness or specialty/
particularity, a large number of salient patches can often be identified in an image for
selection in different applications. Of course, due to the high dimension of the
description vector, the computational complexity of the SIFT descriptor is often
relatively large (the next subsection describes a descriptor for accelerating SIFT).
There are also many improvements to SIFT, including replacing the gradient histo
gram with PCA (effective dimensionality reduction), limiting the amplitude of the
histogram in each direction (some nonlinear illumination changes mainly affect the
amplitude), etc.
With the help of SIFT, a large number of local regions (generally more than a
hundred can be obtained for a 256 x 384 image) covering the image that do not
change with the translation, rotation, and zooming of the image can be determined in
the image scale space, and they are very little affected by noise and interference. For
example, Fig. 5.11 shows two results of salient patch detection. On the left is a ship
image and on the right is a beach image. All detected SIFT salient patches are
represented by circles (here, circular salient patches) covering on the image.

Fig. 5.11 Two examples of detection results of salient patches

184 5 Binocular Stereovision

5.2.3 Speedup Robustness Features

The speedup robustness features (SURF) can also be regarded as a method to

detect the salient feature points in the image. The basic idea is to accelerate the
SIFT. Therefore, in addition to the stability of the SIFT method, it also reduces the
computational complexity and has good real-time performance on detection and
matching.

5.2.3.1 Determining Points of Interest Based on Hessian Matrix

SURF algorithm determines the position and scale information of the points of
interest by calculating the determinant of the second-order Hessian matrix of the
image. The Hessian matrix of image f(x, y) at position (x, y) and scale c is defined as
follows:

hxx (x, y, c) hxy(x, y, c)

H [x, y, c] = (5:33)
hxy(x, y, c) hyy(x, y, c)

where hxx(x, y, c), hxy(x, y, c), and hyy(x, y, c) are the results of convolution of
Gaussian second-order differential [d2G(c)]/dx2, [d2G(c)]/dxdy, and [d2G(c)]/dy2
with image f(x, y) at(x, y), respectively.
The determinant of Hessian matrix is

d"(H)=dx2dyf—ixyH (534)

The maximum point in scale space and image space is called the point of
interest. The value of the determinant of the Hessian matrix is the characteristic
value of the Hessian matrix. Whether the point is an extreme point can be determined
according to the positive and negative values of the determinant at the image point.
Gaussian filter is optimal in the analysis of scale space, but in practice, after
discretization and quantization, it will lose repeatability (because the template is
square and anisotropic) when the image rotates at an odd number of 45° angle. For
example, Fig. 5.12a and b show the quantized Gaussian second-order partial differ
ential responses discretized along the X direction as well as along the X and
Y bisector direction, respectively, which are quite different.
In practice, a box filter can be used to approximate the Hessian matrix, resulting in
faster computations (independent of filter size) with integral images (e.g., see [8]).
For example, Fig. 5.12c and d are approximations to the Gaussian second-order
partial differential responses of Fig. 5.12a and b, respectively, where the 9x9 box
filter is an approximation of Gaussian filter of scale 1.2, which also represents the
lowest scale (i.e., the highest spatial resolution) at which the response is computed.
5.2 Feature-Based Binocular Stereo Matching 185

Fig. 5.12 Gaussian second-order partial differential response and its approximation (light colors
represent positive values, dark colors represent negative values, and middle gray represents 0)

Denote the approximate values of hxx(x, y, o), hxy(x, y, o), and hyy(x, y, o) as Axx, Axy,
and Ayy, respectively, and then the determinant of the approximate Hessian matrix is

det(HA) = AxxAyy - (wAxy) 2 (5:35)

where W is the relative weight for balancing the filter response (i.e., for the
compensation in using the approximation instead of the Gaussian convolution
kernel) to maintain the energy between the Gaussian kernel and the approximate
Gaussian kernel, which can be calculated as follows:

||hxy(1:2)| |F|| Ayy(9)||F

w= = 0:912 « 0:9 (5:36)
||hyy(1:2)| |F|| Axy(9)||F

where II«IIf stands for Frobenius norm.

Theoretically, the weight is scale-dependent, but in practice it can be kept
constant because its change has little effect on the result. Further, the filter response
is size-normalized, which guarantees a constant Frobenius norm for any filter size.
Experiments show that the performance of the approximate computation is compa
rable to that of the Gaussian filter after discretization and quantization.

5.2.3.2 Scale Space Representation

The detection of points of interest needs to be done at different scales. The scale
space is generally represented by a pyramid structure (e.g., see [1]). But because of
the use of box filters and integral images, instead of using the same filter for each
level of the pyramid, box filters of different sizes are used directly on the original
image (same computational speed). So, it is possible to up-sample the filter (Gauss
ian kernel) without iteratively reducing the image size. Taking the output of the
previous 9x9 box filter as the initial scale layer, the subsequent scale layers can be
obtained by filtering the image with larger and larger templates. Since the image is
186 5 Binocular Stereovision

Table 5.1 Scale space grouping

Group 1 2
Interval 1 2 3 4 1 2 3 4
Side length of box filter 9 15 21 27 15 27 39 51
=Side length x 1.2/9 1.2 2 2.8 3.6 2 3.6 5.2 6.8

(a) (b)

Fig. 5.13 Filters between two adjacent scale layers (9 x 9 and 15 x 15)

not down-sampled, high-frequency information is preserved, so aliasing effects do

not occur.
The scale space is divided into several groups, and each group represents a series
of filter response maps obtained by convoluting the same input image with the filters
with increased sizes. There is a doubling relationship between maps, as shown in
Table 5.1.
Each group is divided into a constant number of scale layers. Due to the discrete
nature of the integral image, the minimum scale difference between two adjacent
scales depends on the length l0 of the positive or negative lobe in the corresponding
direction of the second-order partial differential (this length is one-third of the filter
side length). For a 9 x 9 filter, l0 = 3. For two adjacent layers, the length in either
direction must be increased by at least two pixels to ensure that the final length is odd
(so the filter has a center pixel), which results in a total template side length
increasing six pixels. The construction of the scale space starts with a 9 x 9 box
filter, followed by filters of sizes 15 x 15, 21 x 21, and 27 x 27. Figure 5.13a and b
show the filters Ayy and Axy between two adjacent scale layers (9 x 9 and 15 x 15),
respectively. The length of the black lobe can only be increased by an even number
of pixels. Note that scaling the template can introduce rounding errors for directions
other than l0, such as for the width of the center band of the vertical filter. However,
since these errors are much smaller than l0, this approximation is acceptable.
The same considerations apply to other groups. For each new group, the filter size
increases exponentially. At the same time, the sampling interval used to extract
points of interest is multiplied for each new group, which can reduce the calculation
time, and the loss in accuracy is comparable to that of traditional methods for image
subsampling. The filter sizes for the second group are 15, 27, 39, and 51. The filter
sizes for their group are 27, 51, 75, and 99. If the size of the original image is still
5.2 Feature-Based Binocular Stereo Matching 187

Group

Fig. 5.14 Illustration of filter side length in different groups (logarithmic horizontal axis)

larger than the size of the corresponding filter, the fourth group can be calculated,
using the filter with the size of 51, 99, 147, and 195. Figure 5.14 shows the full
picture of the filter used, and each group overlaps with each other to ensure smooth
coverage of all possible scales. In typical scale space analysis, the number of points
of interest that can be detected in each group decreases very quickly.
The large scale change, especially the change between the first filter among these
groups (the change from 9 to 15 is 1.7), makes the scale sampling quite rough. For
this purpose, a scale space with a finer sampling scale can also be used. At this time,
the image is zoomed in and out by a factor of 2, and then the filter with size of 15 is
used to start the first group. The next filter sizes are 21, 27, 33, and 39. Then, the
second group starts, whose size increases in steps of 12 pixels, and so on. Thus, the
scale change between the first two filters is only 1.4 (21/15). The minimum scale that
can be detected by quadratic interpolation is c = (1.2 x 18/9)/2 = 1.2.
Since Frobenius norm keeps constant for any size of filter, it can be considered
that it has been normalized in scale, and it is no longer necessary to weight the
response of filter.

5.2.3.3 Description and Matching of Points of Interest

The SURF descriptor describes the distribution of brightness in the neighborhood of

the point of interest, similar to the gradient information extracted with SIFT. The
difference is that SURF is based on the response of the first-order Haar wavelet in the
X and Y directions instead of the gradient, which can make full use of the integral
image to improve the calculation speed; and the descriptor length is only 64, which
can improve the robustness of feature calculation and matching while reducing
the time.
Matching with SURF descriptors consists of three steps: (1) determining an
orientation around the point of interest, (2) constructing a square region aligned
with the selected orientation and extracting SURF descriptions from it, and
(3) matching the description features between the two regions.
188 5 Binocular Stereovision

Fig. 5.15 Schematic

diagram of determining the
orientation

1. Determine the direction

To obtain invariance to image rotation, an orientation is determined for the
point of interest. The Haar wavelet responses along the X and Y directions are first
calculated in a circular neighborhood of radius 6c around the point of interest,
where c is the scale at which the point of interest is detected. The sampling step is
scale-dependent and given as c. In order to be consistent with other parts, the size
of the wavelet is also scale-dependent and set as the side length of 4c. This reuses
the integral map for fast filtering. According to the characteristics of the Haar
wavelet template, only six operations are required to calculate the response in the
X and Y directions at any scale.
Once the wavelet response has been calculated and weighted by a Gaussian
distribution centered at the point of interest, the response can be represented as a
point in coordinate space with the horizontal coordinate corresponding to
the horizontal response intensity and the vertical coordinate corresponding to
the vertical response intensity. The orientation can be obtained by summing the
responses in a sliding sector window of size n/3 in radians (step size n/18), as
shown in Fig. 5.15.
The horizontal and vertical responses in the window are summed separately to
form a local orientation vector. The longest vector in all windows defines the
orientation of the point of interest. The size of the sliding window is a parameter
that needs to be carefully chosen, with small sizes mainly reflecting the gradient
of a single advantage, while large sizes tend to produce insignificant maxima in
the vector.
2. Descriptor based on Haar wavelet response sum
To extract the descriptor, the first step is to construct a square region around
the point of interest with the orientation determined above (to ensure rotation
invariance). The size of the window is 20c. These square regions are regularly
split into smaller 4x4= 16 subregions, which preserves important spatial
information. For each subregion, the Haar wavelet response is computed within
a regular 5 x 5 grid. For simplicity, dx is used to represent the Haar wavelet
response along the horizontal direction, and dy is used to represent the Haar
wavelet response along the vertical direction. Here “horizontal” and “vertical” are
relative to the selected point of interest, as shown in Fig. 5.16.
Next, the wavelet responses dx and dy are summed, respectively. In order to
use the polarization information about the brightness change, the absolute values |
5.2 Feature-Based Binocular Stereo Matching 189

Fig. 5.16 Square region around point of interest

Xdx

Hdy

Fig. 5.17 Different brightness modes and their descriptors

dx| and |dy| of the wavelet responses dx and dy are also summed, respectively. In
this way, a 4D description vector V can be obtained from each subregion,
V = (Edx, E dy, Eldxl, Eldyl). For all 16 subregions, the description vectors are
connected to form a description vector with a length of 64D. The wavelet
response thus obtained is insensitive to changes in illumination. The invariance
of contrast (scalar) is obtained by converting descriptor into unit vector.
Figure 5.17 is a schematic diagram of three different brightness modes and
descriptors obtained from corresponding subregions. On the left is a uniform
pattern, and each component of the descriptor is very small; in the middle, there is
an alternating pattern along the X direction, only Eldxl is large, and the rest are
small; on the right is the mode in which the brightness gradually increases along
the horizontal direction, and the values of Edx and Eldxl are both large. It can be
seen that descriptors have obvious differences for different brightness modes. It is
also conceivable that if these three local brightness modes are combined, a
specific descriptor can be obtained.
190 5 Binocular Stereovision

Image sub-region

No noise

With noise

Fig. 5.18 Comparison between SIFT and SURF

The principle of SURF is similar to that of SIFT to some extent. Both of them
are based on the spatial distribution of gradient information. But in practice,
SURF often has better performance than SIFT. The reason here is that SURF
gathers all gradient information in the subregion, while SIFT only depends on the
orientation of each independent gradient. This difference makes SURF more
noise resistant. An example is shown in Fig. 5.18. When there is no noise,
SIFT has only one gradient direction; when there is noise (the edge is no longer
smooth), SIFT has a certain gradient component in other directions except that the
main gradient direction remains unchanged. However, the SURF response is
basically the same in both cases (the noise is smoothed).
The evaluation experiments on the number of sampling points and subregions
show that the square subregion divided by 4 x 4 can give the best results. Further
subdivision will lead to poor robustness and greatly increase the matching time.
On the other hand, the short descriptor (SURF-36, i.e., 3 x 3 = 9 subregions,
4 responses per subregion) obtained by using the subregion of 3 x 3 will slightly
reduce the performance (acceptable compared with other descriptors), but the
calculation is much faster.
Besides, there is a variant of SURF descriptor, namely, SURF-128. It also uses
the previous summation but divides the values finer. The sum of dx and |dx| is
separated by/according to dy < 0 and dy > 0. Similarly, the sum of dy and Idyl is
also divided by/according to dx < 0 and dx > 0. In this way, the number of
features is doubled, and the robustness and reliability of the descriptor are
improved. However, although the descriptor itself is faster to calculate, the
amount of calculation will still increase due to the high dimension during
matching.
3. Quick index for matching
In order to quickly index in matching, the sign of Laplacian value (i.e., the
rank of Hessian matrix) of the point of interest can be considered. Generally, the
5.2 Feature-Based Binocular Stereo Matching 191

points of interest are detected and processed in the blob like structure. The sign of
Laplacian value can distinguish the bright patch in the dark background from the
dark patch in the bright background. No additional calculation is required here,
because the sign of Laplacian value has been calculated in the detection step. In
the matching step, you only need to compare the signs of Laplacian values. With
this information, the matching speed can be accelerated without degrading the
performance of descriptors.
The advantages of surf algorithm include that it is not affected by image
rotation and scale change, and it is anti-blur. The disadvantages include that it
is greatly affected by the change of viewpoint and illumination.

5.2.4 Dynamic Programming Matching

The selection method of feature points is often closely related to the matching
method used for them. For the matching of feature points, it is necessary to establish
the corresponding relationship between feature points. Therefore, the ordering
constraint conditions can be used and the dynamic programming method can be
used [10].
Taking Fig. 5.19a as an example, consider the three feature points on the visible
surface of the observed object, and name them as A, B, and C in sequence. They are
exactly the reverse of the order of projection (along the epipolar line) on the two
imaging images (see c, b, and a as well as c’, b’, and a’, respectively). The law of
their opposite order is called ordering constraint. The ordering constraint is an ideal
situation, which cannot be guaranteed to be always true in the actual scenario. For
example, in the case shown in Fig. 5.19b, a small object crosses in front of the large
object behind, blocking a part of the large object, so that the original c point and a’
point cannot be seen on the image and the order of projection on the image does not
meet the ordering constraint.

Fig. 5.19 Schematic diagram of ordering constraint

192 5 Binocular Stereovision

Fig. 5.20 Dynamic programming based matching

However, in most practical cases, ordering constraint is still a reasonable con

straint, so it can be used to design stereo matching algorithm based on dynamic
programming. The following discussion takes the example that multiple feature
points have been determined on the two epipolar lines (as shown in Fig. 5.19), and
the corresponding relationship between them should be established. Here, the
problem of matching each feature point pair can be transformed into the problem
of matching the intervals between adjacent feature points on the same epipolar line.
Refer to the example in Fig. 5.20a, where two feature point sequences are given and
arranged on two grayscale sections. Although the interval between some feature
points degenerates into one point due to occlusion, the order of feature points
determined by the ordering constraint is still preserved.
According to Fig. 5.20a, the problem of matching each pair of feature points can
be described as a problem of searching the optimal path on the graph of the nodes
corresponding to the feature points, and the arc between the nodes in the graph
representation can give the matching path between the intervals. In Fig. 5.20a, the
upper and lower contour lines correspond to two epipolar lines, respectively, and the
quadrilateral between the two contours corresponds to the interval between feature
points (zero length interval leads to the quadrilateral degenerating into a triangle).
The matching relationship determined by dynamic programming is also shown in
Fig. 5.20b, where each diagonal line corresponds to a quadrilateral interval and the
vertical line or horizontal line corresponds to the degenerated triangle.
The complexity of the algorithm is proportional to the product of the number of
feature points on the two epipolar lines.

5.3 Parallax Map Error Detection and Correction

In practical applications, the parallax map will have errors due to the existence of
periodic patterns, smooth regions, occlusion effects, and the laxity of constraint
principles. The parallax map is the basis for subsequent 3D reconstruction and other
5.3 Parallax Map Error Detection and Correction 193

processes, so it is very important to perform error detection and correction

processing on the basis of the parallax map.
The following introduces a general and fast parallax map error detection and
correction algorithm [11]. The first feature of this algorithm is that it can directly
process the parallax map, which is independent of the specific stereo matching
algorithm that generates the parallax map. In this way, it can be added to various
stereo matching algorithms as a general parallax map post-processing method
without modifying the original stereo matching algorithm. Second, the computa
tional cost of this method is only proportional to the number of mismatched pixels,
so the computational cost is small.

5.3.1 Error Detection

With the help of the ordering constraints discussed above, we first define the concept
of ordering matching constraints. Suppose that fL(x, y) and fR(x, y) are a pair of
(horizontal) images and OL and OR are their imaging centers, respectively. Let P and
Q be two points that do not coincide in space, PL and QL be the projections of P and
Q on fL(x, y), and PR and QR the projections of P and Q on fR(x, y), as shown in
Fig. 5.21 (see the discussion on binocular imaging in Sect. 3.3).
Assuming that X(«) is used to represent the X coordinate of the pixel, it can be seen
from Fig. 5.21 that in the correct match, if X(P) < X(Q), then X(PL) < X(QL) and
X(Pr) < X(Qr); and if X(P) > X(Q), then X(Pl) > X(Ql) and X(Pr) > X(Qr). So, if
the following conditions hold () means implicitly),

X(Pl) <X(QL) ) x(P ) < X(Q )

r r

(5:37)
X(Pl) >X(Q ) )
l X(Pr) > X(Q )
r

Fig. 5.21 Schematic

diagram for defining
sequential matching
constraints

X
194 5 Binocular Stereovision

Then it is said that PR and QR satisfy the ordering matching constraint; otherwise,
it is said that there is a crossover here. It can be seen from Fig. 5.21 that the ordering
matching constraints have certain restrictions on the Z coordinates of points P and Q,
which are relatively easy to determine in practical applications.
Matching intersection regions can be detected based on the concept of ordering
matching constraints. Let PR = fR(I, j ) and QR = fR(k, j ) be any two pixels in the j-th
row of fR(x, y), and then their matching points in fL(x, y) can be, respectively,
recorded as for PL = fL(i + d(i, j), j) and QL = fL(k + d(k, j), j ). Define C(PR, QR)
as the cross label between PR and QR. If Eq. 5.37 holds, then denote it as C(PR,
QR) = 0; otherwise, denote it as C(PR, QR) = 1. Define the intersection number Nc of
the corresponding pixel point PR as

N- 1
Nc (i,j) = CP C(PR, QR) k+i (5:38)
k=0

where N is the number of pixels in the j-th row.

5.3.2 Error Correction

If the interval in which the number of crossings is not zero is called a crossing
interval, the mismatch in the crossing interval can be corrected by the following
algorithm. Suppose {fR(i,j)l i £ [p, q]} is the intersection interval corresponding to
R, and then the total cross number Ntc of all pixels in this interval is

q
Ntc(i, j) = £ NC(i, j) (5:39)

Correction of mismatched points in the cross interval includes the following

steps:
1. Find the pixel fR(l, j ) with the largest number of crosses, where

l = max [NC(i,j)] (5:40)

i£ [p, q|

2. Determine the new search range, {fL(i, j)| i £ [s, t]}, for the matching point
fR(k, j ), where

r s=p -1 + d(p -1,j)

(5:41)
|_t = q + 1 + d(q + 1, j)
5.3 Parallax Map Error Detection and Correction 195

Table 5.2 Parallax in cross intervals

i 153 154 155 156 157 158 159 160 161 162 163
d(i,j) 28 28 28 27 28 27 27 21 21 21 27

Fig. 5.22 Distribution of matching points before cross interval correction

Table 5.3 Number of horizontal crossings in the interval [153, 163]

i 153 154 155 156 157 158 159 160 161 162 163
N 0 1 2 2 3 3 3 6 5 3 0

3. Find a new matching point from the search range that can reduce the total number
of crossings Ntc (e.g., the maximum grayscale correlation matching technology
can be used).
4. Use the new matching point to correct d(k, j) to eliminate the mismatch
corresponding to the pixel with the current maximum number of crossings.
The above steps can be used iteratively, and after correcting one mismatched
pixel, continue to correct each remaining error pixel. After d(k, j ) is corrected,
Nc(i, j) in the cross interval is recalculated by Eq. 5.38, and Ntc is calculated
according to Eq. 5.39. Then, the next round of correction process is performed
according to the above iteration until Ntc = 0. Because the correction principle is to
make Ntc = 0, it can be called a zero-cross correction algorithm. After correction, a
parallax map that complies with the ordering matching constraints can be obtained.
The above process of detecting and eliminating errors can be exemplified as
follows. Assume that the results of parallax calculation in the interval [153, 163] in
the j-th row of an image are shown in Table 5.2, and the distribution of each
matching point in this interval is shown in Fig. 5.22. According to the correspon
dence between fL(x, y) and fR(x, y), the matching points in the interval [160, 162] are
mismatched points. Table 5.3 can be obtained by calculating the number of crosses
according to Eq. 5.38.
It can be seen from Table 5.3 that the interval [fR(154, j), fR(162, j)] is a cross
interval. From Eq. 5.39, Ntc = 28 can be obtained; so from Eq. 5.40, we can know
that the pixel with the largest number of crossings is fR(160, j). Then, according to
Eq. 5.41, the search range of the new matching point fR(160, j) can be determined as
{fL(i, j)l i £ [181, 190]}. According to the maximum grayscale correlation matching
196 5 Binocular Stereovision

153 154 155 156 157 158 159 160 161 162 163

181 182 183 184 185 186 187 188 189 190 191

Fig. 5.23 Distribution of matching points after cross interval correction

Fig. 5.24 Real example of error detection and correction

technique, a new matching point fL(187, j) corresponding to fR(160, j) and capable

of reducing Ntc is found from the search range, and the parallax value d(160, j)is
corrected to d(160, j) = X[fL(187, j)] - X[fR(160, j)] = 27. Next, perform the next
round of correction according to the above iterative method until Ntc = 0 for the
entire interval. The distribution of matching points after correction is shown in
Fig. 5.23. As can be seen from Fig. 5.23, the original mismatch points in the interval
[160, 162] have been all eliminated.
It should be pointed out that the above algorithm can only eliminate the mismatch
points in the cross interval. Since the ordering matching constraint is only processed
for the cross interval, the mismatch point in the interval where the cross number is
zero cannot be detected, nor can it be corrected.
Finally, a result obtained using the above error detection and correction (elimi
nation) method is presented [11]. Here, the image in Fig. 5.5 is selected for matching.
Figure 5.24a is a part of the original image, Fig. 5.24b is the parallax map obtained
by binocular vision, and Fig. 5.24c is the result obtained after further processing with
the error detection and correction algorithm. Comparing Fig. 5.24b and c, it can be
seen that many mismatched points (too white and too black areas) in the original
disparity map have been removed. The quality of parallax map has been significantly
improved.
5.4 Stereo Matching Based on Deep Learning 197

5.4 Stereo Matching Based on Deep Learning

With the development of deep learning technology, it has been widely used in stereo
matching. Different from the traditional matching algorithm based on man-made
features, the stereo matching algorithm based on depth learning can extract more
image features for cost calculation by nonlinear transformation of images through
convolution, pooling, and full connection. Compared with man-made features, deep
learning features can provide more context information, make more use of the global
information of the image, and obtain the model parameters through training to
improve the robustness of the algorithm. At the same time, using GPU acceleration
technology can also obtain faster processing speed and meet the real-time require
ments of many application fields [12].

5.4.1 Stereo Matching Networks

Currently, image networks for stereo matching mainly include image pyramid
networks, Siamese networks, and generative adversarial networks [13].

5.4.1.1 Methods Using Image Pyramid Networks

A spatial pyramid pooling layer is set between the convolution layer and the full
connection layer, which can convert image features of different sizes into fixed
length representations [14]. This can avoid repeat calculation of convolution and can
ensure consistency of input image size.
Table 5.4 lists some typical methods using image pyramid network, as well as
their characteristics, principles, and effects.

Table 5.4 Several typical methods using image pyramid network and their characteristics, princi
ples, and effects
Methods Features and principles Effect
[15] Convolution neural network is used to The man-made features are replaced by
extract image features for cost calculation deep learning features
[16] Pyramid pooling module is introduced It solves the problems of gradient disap
into feature extraction, and multiscale pearance and gradient explosion and is
analysis and 3D-cnn structure are adopted suitable for weak texture, occlusion,
uneven light, and so on
[17] Build packet cost calculation Improve computational efficiency by
replacing 3D convolution layer
[18] A semi global aggregation layer and a Improve computational efficiency by
local guidance aggregation layer are replacing 3D convolution layer
designed
198 5 Binocular Stereovision

5.4.1.2 Methods Using Siamese Networks

The basic structure of Siamese network is shown in Fig. 5.25 [19]. First, the two
input images to be matched are converted into two feature vectors by using two
weighted shared convolutional neural networks (CNN), and then the similarity
between the two images is determined according to the L1 distance between the two
feature vectors.
The current method improves the basic structure of Siamese network. Table 5.5
shows their characteristics, principles, and effects.

5.4.1.3 Methods Using Generative Adversarial Networks

Generative adversarial networks (GAN) consist of a generative model and a

discriminative model. The generative model learns the sample features to make the

Fig. 5.25 The basic structure of the Siamese network

Table 5.5 Several improvements using Siamese networks and their characteristics, principles, and
effects
Methods Characteristics and principles Effects
[20] Deepen convolutional layers using ReLU The matching accuracy is improved
function and small convolution kernel
[21] When extracting features, first calculate the Uses color input as a guide; the
disparity map in the low-resolution cost con high-quality boundaries can be
volution, and then use the hierarchical refine generated
ment network to introduce high-frequency
details
[22] Pyramid pooling is used to connect two sub Multiscale features can be obtained
networks. The first subnetwork is composed of
a Siamese network and a 3D convolutional
network, which can generate low-precision
disparity maps; the second subnetwork is a fully
convolutional network, which restores the ini
tial disparity map to the original resolution
[23] Depth discontinuity is processed on the The continuity of depth disconti
low-resolution disparity map and restored to the nuities is improved
original resolution in the disparity refinement
stage
5.4 Stereo Matching Based on Deep Learning 199

Table 5.6 Several improvements using GAN and their characteristics, principles, and effects
Methods Characteristics and principles Effects
[25] Using a binocular vision-based GAN frame This unsupervised model works
work, including two generative subnetworks and well under uneven lighting
one discriminative network. The two generative conditions
networks are used to train and reconstruct the
disparity map, respectively, in the adversarial
learning. Through mutual restriction and super
vision, two disparity maps from different per
spectives are generated, and the final data is
output after fusion
[26] Use generative models to process occluded Recoverable to get a good parallax
regions effect
[27] Generative adversarial models using deep con The visual effect of depth map for
volutions to obtain multiple depth maps with occluded regions is improved
adjacent frames
[28] Use two images from the left and right cameras The disparity map for regions with
to generate a brand-new image for improving the poor lighting is improved
poorly matched part of the disparity map

Fig. 5.26 Flowchart of feature cascade CNN for binocular matching

generated image similar to the original image, while the discriminative model is used
to distinguish the “generated” image from the real image [24]. This process runs
iteratively, and the final discrimination result reaches the Nash equilibrium, that is,
the true and false concepts are both 0.5.
Some modifications have been made to the fundamental method using GAN.
Their characteristics and principles as well as effects are shown in Table 5.6.

5.4.2 Matching Based on Feature Cascade CNN

In order to improve the accuracy and robustness of parallax estimation in difficult

scenes such as complex surrounding scenes, illumination changes, and weak tex
tures, a feature cascaded convolutional neural network based binocular stereo
matching method has been proposed [29].
The flowchart of feature cascade CNN for binocular matching is shown in
Fig. 5.26. It uses image patches as input to overcome the false matching problem
200 5 Binocular Stereovision

Fig. 5.27 Detailed structure

of a fully convolutional
dense block

encountered when only relying on single grayscale information in weakly textured

regions. For feature extraction, a cascade of convolution (Conv) and ReLU functions
(Conv + ReLU; the trapezoid in Fig. 5.26) is used to generate the initial feature map.
A fully convolutional dense block module (see below) is used to enhance high-
frequency information and produce feature tensors. The dimension of the feature
tensor is adjusted, and then the feature tensor is classified and reorganized by
stacking layers of fully connected (FC) and ReLU functions (FC + ReLU). Finally,
use the Sigmoid function to predict the similarity. The performance of the network
model can be evaluated by the binary cross entropy (BCE) loss function value.
The detailed structure of the fully convolutional dense block for feature reuse is
shown in Fig. 5.27. Compared to standard CNN models, the dense connection
mechanism iteratively concatenates the feature maps of all previous layers in a
feed-forward fashion [30]. Its output is the result of four consecutive operations.
The four operations are batch normalization (BN), linear correction function
(ReLU), convolution (Conv), and dropout with some random loss rate. The feature
maps extracted from the shallow layers are cascaded to subsequent sub-layers
through a “skip connection” mechanism to compensate for the local feature infor
mation lost by the depth convolution. Using this dense connection method to build a
stereo matching model can effectively reduce the spatial complexity of the network
model and enhance the image texture details.

References

1. Zhang Y-J (2017) Image Engineering, Vol. 1: Image Processing. De Gruyter, Germany.
2. Kanade T, Yoshida A, Oda K, et al. (1996) A stereo machine for video-rate dense depth
mapping and its new applications. IEEE Conference on Computer Vision and Pattern Recog
nition (CVPR) 196-202.
3. Lew MS, Huang TS, Wong K (1994) Learning and feature selection in stereo matching. IEEE
Transactions on Pattern Analysis and Machine Intelligence 16(9): 869-881.
References 201

4. Zhang Y-J (2002) Image Engineering (Addendum)—Teaching Reference and Problem Solving.
Tsinghua University Press, Beijing.
5. Forsyth D, Ponce J (2012) Computer Vision: A Modern Approach, 2nd Ed. Prentice Hall,
London.
6. Davies ER (2005) Machine Vision: Theory, Algorithms, Practicalities, 3rd Ed. Elsevier,
Amsterdam.
7. Kim YC, Aggarwal JK (1987) Positioning three-dimensional objects using stereo images. IEEE
Transactions on Robotics and Automation 1: 361-373.
8. Zhang Y-J (2017) Image Engineering, Vol. 2: Image analysis. De Gruyter, Germany.
9. Nixon MS, Aguado AS (2008) Feature Extraction and Image Processing. 2nd Ed. Academic
Press, Maryland.
10. Forsyth D, Ponce J (2003) Computer Vision: A Modern Approach. Prentice Hall, London.
11. Jia B, Zhang Y-J, Lin XG (2000) General and fast algorithm for disparity error detection and
correction. Journal of Tsinghua University (Science & Technology) 40(1): 28-31.
12. Li, JI, Liu T, Wang XF (2022) Advanced pavement distress recognition and 3D reconstruction
by using GA-DenseNet and binocular stereo vision. Measurement, 201: 111760 https://doi.org/
10.1016/j.measurement.2022.111760.
13. Chen Y, Yang LL, Wang ZP (2020) Literature survey on stereo vision matching algorithms.
Journal of Graphics 41(5): 702-708.
14. He KM, Zhang XY, Ren SQ, et al. (2015) Spatial pyramid pooling in deep convolutional
networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelli
gence, 37(9): 1904-1916.
15. Zbontar J, Lecun Y (2015). Computing the stereo matching cost with a convolutional neural
network. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1592-1599.
16. Chang J, Chen Y (2018) Pyramid stereo matching network. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 5410-5418.
17. Guo XY, Yang K, Yang WK, et al. (2019) Group-wise correlation stereo network. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 3268-3277.
18. Zhang FH, Prisacariu V, Yang RG, et al. (2019) GA-NET: Guided aggregation net for end-to-
end stereo matching. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
185-194.
19. Bromley J, Bentz JW, Bottou L, et al. (1993) Signature verification using a “Siamese” time
delay neural network. International Journal of Pattern Recognition and Artificial Intelligence
7(4): 669-688.
20. Zagoruyko S, Komodakis N (2015) Learning to compare image patches via convolutional
neural networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
4353-4361.
21. Khamis S, Fanello S, Rhemann C, et al. (2018) StereoNet: Guided hierarchical refinement for
real-time edge-aware depth prediction. European Conference on Computer Vision (ECCV)
596-613.
22. Liu GD, Jiang GL, Xiong R, et al. (2019) Binocular depth estimation using convolutional neural
network with Siamese branches. IEEE International Conference on Robotics and Biomimetics
(ROBIO) 1717-1722.
23. Guo CG, Chen DY, Huang ZQ. (2019) Learning efficient stereo matching network with depth
discontinuity aware super-resolution. IEEE Access 7: 159712-159723.
24. Luo JY, Xu Y, Tang CW, et al. (2017) Learning inverse mapping by AutoEncoder based
generative adversarial nets. Neural Information Processing 207-216.
25. Pilzer A, Xu D, Puscas M, et al. (2018) Unsupervised adversarial depth estimation using cycled
generative networks. International Conference on 3D Vision (3DV) 587-595.
26. Matias LPN, Sons M, Souza JR, et al. (2019) VeIGAN: Vectorial inpainting generative
adversarial network for depth maps object removal. IEEE Intelligent Vehicles Symposium
(IV) 310-316.
202 5 Binocular Stereovision

27. Lore KG, Reddy K, Giering M, et al. (2018) Generative adversarial networks for depth map
estimation from RGB video. IEEE Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW) 1177-1185.
28. Liang H, Qi L, Wang ST, et al. (2019) Photometric stereo with only two images: a generative
approach. IEEE 2nd International Conference on Information Communication and Signal
Processing (ICICSP) 363-368.
29. Wu JJ, Chen Z, Zhang CX (2021) Binocular stereo matching based on feature cascade
convolutional network. Acta Electronica Sinica 49(4): 690-695.
30. Huang G, Liu Z, Van Der Maaten L, et al. (2017) Densely connected convolutional networks.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4700-4708.
Chapter 6
Multi-ocular Stereovision Check for
updates

The binocular stereovision technology introduced in Chap. 4 refers directly to the

structure of the human visual system. When using cameras for image acquisition,
systems with more than two cameras (or one camera placed in more than two
locations in succession) can also be used to acquire different images of the same
scene and further obtain depth information. This technology is called multi-ocular
(multi-eye) stereovision technology. Using the multi-ocular method is more complex
than using the binocular method but has certain advantages, including reducing the
uncertainty of image matching in binocular stereovision technology, eliminating the
mismatch caused by the grayscale smooth region of the scene surface, and reducing
the mismatch caused by periodic patterns on the scene surface.
Based on the binocular stereovision technology introduced in Chap. 4, this
chapter discusses how to extend its basic principles to a variety of multi-ocular
stereovision technologies to overcome some problems existing in the binocular
stereovision technology.
The sections of this chapter will be arranged as follows.
Section 6. 1 introduces the basic framework of horizontal multi-ocular
stereovision and analyzes the principle of reducing the mismatch caused by periodic
patterns on the basis of multi-ocular by introducing inverse distance.
Section 6. 2 introduces the method of orthographic trinocular stereovision, which
eliminates the false matching caused by smooth regions of image grayscale by
matching along the horizontal and vertical directions at the same time.
Section 6. 3 combines the unidirectional multi-ocular technique with the orthog
onal trinocular technique to discuss the more general orthogonal multi-ocular
stereovision technique.
Section 6. 4 introduces an iso-baseline multi-camera set using a total of five
cameras and discusses its methods of image acquisition and image merging in detail.
Section 6. 5 introduces an omnidirectional multi-stereo catadioptric system using
only a single camera but combined with multiple mirrors and mainly discusses its
overall structure, imaging, and calibration models.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 203
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_6
204 6 Multi-ocular Stereovision

6.1 Horizontal Multi-ocular Stereo Matching

From the discussion in Sect. 3.3.1, in the stereovision using the binocular horizontal
mode, the parallax d in the two images has the following relationship with the
baseline B between the two cameras (2 represents the camera focal length):

d= ■' “ BZ <6:1>

where the last step is to consider the simplification when Z >> 2 in general.
It can be seen from Eq. 6.1 that for a given object distance Z, the parallax d is
proportional to the baseline length B. The larger the baseline length B, the more
accurate the distance calculation will be. However, the problem brought by the
longer baseline length is that a larger parallax range needs to be searched to find
matching points, which not only increases the amount of calculation but also
increases the probability of false matching when there are periodic repetitive features
in the image (see below).
In order to solve the above problems, the method of multi-ocular stereovision [1]
can be used, that is, using multiple cameras to obtain multiple pairs of related
images, and the baseline used for each pair of images is not very long (so the
searched parallax range is not very large), but combining multiple pairs of images
is equivalent to form a longer baseline, which can improve parallax measurement
accuracy without increasing the probability of false matches when there are periodic
repeating features in the images.

6.1.1 Multi-ocular Images and SSD

From binocular to multi-ocular, the most direct method is to add cameras along the
extension line of the original binocular baseline to form the multi-ocular. For the
binocular horizontal mode, a set of image sequences along the (horizontal) baseline
direction is used for stereo matching to become the multi-ocular horizontal mode.
The basic idea of this method is to reduce the overall mismatch by computing the
sum of squared differences (SSD) between pairs of images [2]. Assuming the
camera moves along a horizontal line perpendicular to the optical axis (multiple
cameras can also be used), acquire a series of images fi(x, y) at points P0, P1, P2, ...,
PM, i = 0, 1, ..., M (see Fig. 6.1), resulting in a series of image pairs whose baseline
lengths are B1, B2, ..., BM, respectively.
According to Fig. 6.1, the parallax between the image captured at point P0 and the
image captured at point Pi is
6.1 Horizontal Multi-ocular Stereo Matching 205

Fig. 6.1 Schematic P0 P1 P2 PM X

diagram of multi-ocular
image acquisition location b-B 1
-- ----------- B 2

di = BiZ i = 1,2,..., M (6:2)

Because only the horizontal direction is considered here, the image function
f(x, y) can be simplified by f(x), and the image obtained at each position is

fi(x) =f [x - di ] + ni(x) (6:3)

It is considered that the distribution of noise ni(x) satisfies the Gaussian distribu
tion with mean value of 0 and variance of on that is, ni(x) ~ N(0, o2).
In f0(x), the SSD value at the position x is (W is the matching window)

Sd(x; di) = [ f (x + f) - fi (x + di + j) ]
o
(6:4)

where di is the parallax estimate at the position x. Since SSD is a random variable, its
expected value can be calculated as a global evaluation function (let NW be the
number of pixels in the matching window):

E[Sdx;di] = E { O+ j :: - fx +di - di+ j

+ nox + j - n,-x + di + j] 2}

= fx +j -fx + di - di+j] +2NW°n (6:5)

j2W

The above equation shows that when di = di, Sd(x; di) gets the minimum value. If
the image has the same grayscale patterns at x and x + p (p ^ 0), i.e.,

f (x + j) (x + P + j) j 2 W (6:6)

Then, according to Eq. 6.5, it gives

E Sd Xx; di) 1 = E [Sd (x; 2t + p) 1 = 2Nwo^ (6:7)

206 6 Multi-ocular Stereovision

This shows that the expected value of the SSD is likely to be extreme at both x and
x + p, i.e., there is an uncertainty problem, which will result in errors (mismatches).
Mismatching at x + p occurs for all image pairs (the location of the mismatch is not
related to the baseline length and baseline number), and the error cannot be avoided
even with multi-ocular images.

6.1.2 Inverse Distance and SSSD

Now introduce the concept of inverse distance (or inverse depth), and search for the
correct parallax by searching for the correct inverse distance. The inverse distance
t satisfies

t=Z (6:8)

According to Eq. 6.1, we have

t b' (6:9)

^^Bdl (6:10)

where ti and ti are the true and estimated inverse distances, respectively.
Substituting Eq. 6.10 into Eq. 6.4, the SSD corresponding to t is

St(x;li) = f0(x +j) -fi(x + Bil'ti+j)]2 (6:11)

Its expected value is

E[Si(x; ti) ]= f (x + j)-f[x + Bi^(ti - ti) + j] }2 + 2Nwff2 (6:12)

Adding all the SSDs corresponding to M inverse distances provides the sum of
SSDs (SSSD), which can be expressed as

(S)
(x; T) = ^St(x; tfi) (6:13)
t (12...M)
i= 1

The expected value of this new metric function is

6.1 Horizontal Multi-ocular Stereo Matching 207

e[s JS?
2.. m (x; t)] = t)
i= 1

= EE{ f (X +./) -f[x +Bi^(ti' - ti) +./]} +2NWff2 (6:14)

i = 1 j2W

Now consider the aforementioned problem of images having the same patterns at
x and x + p (see Eq. 6.6), where

E[St(x; ti)] = E[St(x; ti + p=Bil)] = 2N2-o2 (6:15)

It should be noted that the uncertainty problem still exists here, because there is
still a minimum at the inverse distance tp = ti + p/(B,2). However, here tp consists of
two items, with the change of Bi, although tp will also change, but ti does not change.
In other words, the parallax obtained for each camera is proportional to the reciprocal
distance, but the parallax obtained for different cameras is different. This is an
important property when using inverse distance in SSSD, and it can help eliminate
uncertainty problems caused by periodic patterns. Specifically, by choosing different
baselines, the minimum values of the sum of squared differences between each pair
of images are located at different positions. Taking the use of two baselines B1 and
B2 (B1 / B2) as an example, from Eq. 6.14,

E[s!S?2)(x;1)] ={f(x+j) -f [x+B1i(t1 - t1 j +j j

j2W
(6:16)
+^3{ f (x + j) f [x 1 B2^(t2 - t2) + j]} +4Nwff2

It can be shown that when t ^ t, there is [1]

e[s(U2 ) (x;7)] > 4N -.. e St (12) (x; t- (6:17)

(S)
That is, at the correct matching position t, there will be a true minimum St(12)(x ; t ).
It can be seen that by using two baselines with different lengths, the uncertainty
caused by repeated patterns can be resolved with the help of the new metric function.
Figure 6.2 gives an example of the effect of the new metric function
[1]. Figure 6.2a is the curve of f(x), which has the expression

J 2 + cos (xn=4) -4<x< 12

f(x,y)= t 1 (6:18)
x < - 1, x > 12

Let d1 = 5, on2 = 0.2, corresponding to the baseline B1, and the window size is
5. Figure 6.2b plots E[Sd1(x; d)], which has minima at both d1 = 5 and d1 = 13, so
208 6 Multi-ocular Stereovision

E [ S,(S2)( x, t)]

25 -
20 -
15 -
10 -
5 -
t
0+ 5 10 15 20
0
(c) (f)

Fig. 6.2 Comparison of old and new metric functions

there is an uncertainty problem. The selection of matching points is related to noise,

search range or interval, and search strategy. Suppose now that a pair of images with
a longer baseline B2 is used, and the new baseline is 1.5 times the length of the old
baseline, then the curve of E[Sd2(x; d)] is shown in Fig. 6.2c, which has minima at
d2 = 7 and d2 = 15, and the uncertainty problem still exists, and the distance between
the two minima does not change.
If the SSD of the inverse distance is used, the curves of E[St1(x; t)] and E[St2(x; t)]
when the baseline lengths are B1 and B2, respectively, are shown in Fig. 6.2d and e.
As can be seen from these plots, both curves still have two minima each: the minima
of E[St1(x; t)] are at t1 = 5 and t1 = 13, while the minima of E[St2(x; t)] are at t2 = 5
and t2 = 10. This shows that if only the inverted distance is used, the uncertainty of
matching may still exist. But, it can be noticed that the minimum values at the correct
matching positions (t = 5) of the two curves do not change at this time, while the
minimum value at the false matching position changes with the change of the
baseline length (the minimum of E[St2(x; t)] becomes t2 = 10). So, when the two
inverse distance SSD functions are added to obtain the inverse distance SSSD, its
expected value curve E S(s1)2)(x; t) is shown in Fig. 6.2f. It can be seen from the
6.2 Orthogonal Trinocular Stereo Matching 209

figure that the minimum value at the correct matching position is smaller than the
minimum value at the false matching position due to the overlap (the difference
between the two increases with the number of overlapping images). In other words,
the minimum value at the correct matching position has a global minimum, which
solves the uncertainty problem.
Consider the case where f(x) is a periodic function, and let its period be T. Thus,
each St(x, t) is a periodic function of t whose period is T/BiA. This shows that every
other T/BiA segment has a minimum value. When using two baselines, the resulting
S(ts()12)(x; t) is still a periodic function of t, but its period T12 increases to

'h-TQ
T12 = LCM (6:19)

Here LCM stands for least common multiple, so it can be seen that T12 will not
be smaller than T1 or T2. Further, by selecting appropriate baselines B1 and B2, it is
possible to make only one minimum value in the search interval, that is, the
uncertainty problem will be eliminated.

6.2 Orthogonal Trinocular Stereo Matching

Multi-ocular (multi-camera) placement is not necessarily limited to the same row or

column. Here first consider the case where the trinocular is placed in two mutually
orthogonal directions.

6.2.1 Orthogonal Trinocular

It can be seen from Fig. 4.5 that when binocular vision processes the matching
parallel to the epipolar line direction (i.e., the horizontal scanning line direction),
there will be a mismatch due to the lack of obvious features in the grayscale smooth
region. At this time, the gray values in the matching window will take the same
value within a certain range, so the matching position cannot be determined. This
kind of mismatching problem caused by the smooth grayscale region is inevitable in
the use of binocular stereo matching. The parallel-baseline multi-ocular stereo
matching method described earlier in Sect. 6.1 does not eliminate mismatches due
to this cause (although it can eliminate mismatches due to periodic patterns).
In practical applications, regions with relatively smooth gray scales in the hori
zontal direction may often have obvious grayscale differences in the vertical direc
tion. In other words, it is not smooth vertically. This suggests that people can use the
image pairs in the vertical direction to perform a vertical search to solve the problem
of mismatches that are easily generated by matching in the horizontal direction in
210 6 Multi-ocular Stereovision

these regions. Of course, for the grayscale smooth region in the vertical direction,
only using the image pairs in the vertical direction may also cause a mismatch
problem, and it is necessary to use the image pairs in the horizontal direction to
perform horizontal matching.

6.2.1.1 Eliminating Smooth Region Mismatches

Since both the horizontal grayscale smooth region and the vertical grayscale smooth
region may appear in the image, it is necessary to collect the horizontal image pair
and the vertical image pair at the same time. In the simplest case, two pairs of
orthogonal acquisition positions can be arranged on the plane (see Fig. 6.3). Here,
the left image L and the right image R form a horizontal stereo image pair, whose
baseline is Bh, and the left image L and the top image T form a vertical stereo image
pair, whose baseline is Bv. These two pairs of images constitute a set of orthogonal
trinocular images.
The characteristics of stereovision matching using orthogonal trinocular images
can be analyzed by referring to the method in Sect. 6.1. The images obtained at the
three acquisition positions can be represented as (since this is an orthogonal acqui
sition, the images are represented by f(x, y))

f fL(yy) =f (x,y) + «l(x,y)

* fr(xy) =f(x - dh, y) + nR (x, y) (6:20)
f fT(x, y) =f (x, y - dv) + nT(x, y)

where dh and dv are the horizontal and vertical parallaxes, respectively (see Eq. 6.3).
In the following discussion, let dh = dv = d, and the SSDs corresponding to the
horizontal and vertical directions are

Sh(x;y;d} = 52 fL(x +j y + k)-f

; R{ x + d+j;y + k)]2
(6:21)
Sv(x;y;d) = 52 fL(x+j;y + k)- fRXx + j;y + d + k)]2

Adding them up gives the orthogonal parallax metric function O(S)(x, y; d):

Fig. 6.3 Camera positions

for orthogonal trinocular
images
6.2 Orthogonal Trinocular Stereo Matching 211

O(S (x, y; d) = Sh (x, y; dj + Sv (x, y; dj (6:22)

Consider the expected value of O(S)(x, y; d):

E[O(S)(x,y;dty] = [f(x +j,y + k) -f(x + d-d +j,y + k)]

+[f(x+j, y+k) f[x+j, y+d - d+k)] +4Nw^n

(6:23)

where Nw represents the number of pixels in the matching window W. From

Eq. 6.23, it can be known that when d = d, there is

E [O(S) (x, y; d)] = 4A\w^2 (6:24)

It can be seen that at the correct parallax value, E[O(S)(x, y; d)] achieves a
minimum value. It can be seen from the above discussion that when using ortho
graphic trinocular images, it is not necessary to use the inverted distance in order to
eliminate repeating patterns in one direction.
An example of using the orthogonal trinocular method to eliminate the mismatch
brought to the binocular method by grayscale smooth region in one direction is
shown in Fig. 6.4, in which Fig. 6.4a—c are a group of a square cone image (in turn
left image, right image, and top image) with smooth grayscale regions in horizontal
and vertical directions; Fig. 6.4d is the parallax map obtained by stereo matching
using only the horizontal binocular images; and Fig. 6.4e is parallax map obtained by
stereo matching using only the vertical binocular images. In addition, Fig. 6.4f is the
parallax map obtained by stereo matching using orthogonal trinocular images.
Figure 6.4g—i are the 3D perspective views corresponding to Fig. 6.4d—f, respec
tively. From these images, it can be seen that in the parallax map obtained from the
horizontal binocular images, there are obvious mismatches (horizontal black strips)
at the smooth regions of horizontal gray scale; in the parallax map obtained from the
vertical binocular images, obvious mismatches (vertical black strips) occur in the
vertical grayscale smooth regions; in the parallax map obtained from the ortho
graphic trinocular images, the mismatches caused by various unidirectional gray
scale smooth regions are eliminated. That is, the results of parallax calculations in
various regions are correct, and these results are also very clear in each 3D perspec
tive view.
212 6 Multi-ocular Stereovision

Fig. 6.4 Examples of orthogonal trinocular stereo matching to eliminate mismatches in grayscale
smooth regions

6.2.1.2 Reducing Periodic Pattern Mismatches

Orthogonal trinocular stereo matching method can not only reduce the mismatch
caused by grayscale smooth regions but also reduce the mismatch caused by
periodic patterns. The following is an example of a scene that has periodic
repeating patterns in both horizontal and vertical directions. Suppose f(x, y)is a
periodic function whose horizontal and vertical periods are Tx and Ty, respectively,
that is,

f (x + j, y + k) =f (x + j + Tx, y + k + Ty) (6:25)

where Tx ^ 0, Ty ^ 0 are constants. Using Eqs. 6.21 -6.24, it can be deduced that

E Sh x, y; d = E Sh x, y; d + Tx (6:26)
6.2 Orthogonal Trinocular Stereo Matching 213

E Sv x, y; d = E SJ x, y; d + Ty (6:27)

e[o (S)(X; y;d) ] = E[Sh (x, y;d + Tx) + Sv (x, y;d + Ty)]

= e[o(s)(x, y;d + Txy)] (6:28)

Txy= LCM (Tx, Ty) (6:29)

It can be seen from Eq. 6.29 that, when Tx / Ty, the expected period Txy of
O(S)(x, y; d) is generally larger than both the expected period Tx of Sh(x, y; d) and the
expected period Ty of Sv(x, y; d ).
Consider further the extent to which parallax searches are performed for
matching. If d 2 [dmin, dmax] is set, then the number of minimum values Nv, Nh,
and N that occurs in E[Sh(x, y; d)], E[Sv(x, y; d)], and E[Sv(x, y; d)], respectively, is

dmax - dmin (fy

/V-.h
N - (6:30)
Tx

/Vv - dmax dmin

N (6 31)
(6:31)
Ty

AZ = dmax
N
dmin (fy '^'7'1
(6:32)
LCM (Tx, Ty)

From Eqs. 6.29-6.32, we know

N < min (Nh,Nv) (6:33)

This shows that when Sh(x, y; d) and Sv(x, y; d) are replaced by O(S)(x, y; d) as the
similarity measure function, E[O(S)(x, y; d)] has more minima than both E[Sh(x, y;
d)] and E[Sv(x, y; d)] in the same parallax search range. In other words, the
probability of a mismatch is reduced. In practical applications, it is possible to try
to limit the parallax search range to further avoid false matching.
An example of using the orthogonal trinocular method to eliminate periodic
patterns that bring false matching to the binocular method is shown in Fig. 6.6.
Figure 6.5a-c are the left image, right image, and top image, respectively, of a group
images with periodically repeating texture (period ratio of horizontal direction and
vertical direction is 2:3). Figure 6.5d shows the parallax map obtained by stereo
matching using only the horizontal binocular images; Fig. 6.5e is the parallax map
obtained by stereo matching using only vertical binocular images; Fig. 6.5f is the
parallax map obtained by stereo matching using orthogonal trinocular images.
Figure 6.5g-i are 3D perspective views corresponding to Fig. 6.5d-f, respectively.
Due to the effect of periodic patterns, there are many mismatches in both Fig. 6.5d
and e, while in the parallax map obtained from the orthographic trinocular images,
214 6 Multi-ocular Stereovision

Fig. 6.5 An example of eliminating periodic pattern mismatch by orthogonal trinocular method

most of the mismatches are eliminated. The effect of orthographic trinocular stereo
matching is also clearly seen in each 3D perspective views.
In trinocular vision, in order to reduce ambiguity as much as possible and ensure
the accuracy of feature location, it is necessary to generate two epipolar lines. These
two epipolar lines should be as orthogonal as possible in at least one image space,
which will help to uniquely determine all matching features. The projection center of
the third camera should not be on the same line with the projection centers of the
other two cameras; otherwise, the epipolar lines will be collinear. Once a feature is
uniquely defined, using more cameras does not reduce the influence of ambiguity.
However, the use of more cameras may produce further supporting data. It can help
further reduce the positioning error with the help of averaging, and it is possible to
obtain a slight increase in the accuracy and range of 3D depth perception.

6.2.2 Orthogonal Matching Based on Gradient Classification

Because the orthogonal trinocular stereo matching method can reduce a variety of
errors, there are many implementation methods. The main steps of an orthogonal
6.2 Orthogonal Trinocular Stereo Matching 215

trinocular stereo matching method are as follows [3]: (1) two independent complete
parallax images are obtained by using a certain correlation matching algorithms
(e.g., taking edge points as the matching feature (see Sect. 5.2.1)) through horizontal
image pairs and vertical image pairs, respectively; (2) according to certain fusion
criteria, the two parallax maps are combined into one parallax map using relaxation
technique. This method needs to use dynamic programming algorithm, fusion
criteria, relaxation technology, and other complex synthesis operations, so the
amount of calculation is large and the implementation is complex. A fast orthogonal
stereo matching method based on gradient classification is introduced as follows.

6.2.2.1 Flowchart of Algorithm

The basic idea of this method is to compare the smoothness of each region of
the image along the horizontal and vertical directions. For the smoother region in
the horizontal direction, the vertical image pair is used for matching, and for the
smoother region in the vertical direction, the horizontal image pair is used for
matching. In this way, it is not necessary to calculate two complete parallax maps,
respectively, and the synthesis of the parallax of the two regions is very simple.
Whether a region is smoother horizontally or vertically can be determined by
calculating the gradient direction of the region. The flowchart of the algorithm is
shown in Fig. 6.6, which is mainly composed of the following four specific steps:
1. The gradient direction information of each point in fL(x, y) is obtained by
calculating the gradient of fL(x, y).
2. According to the gradient direction information of each point in fL(x, y), fL(x, y)
can be divided into two parts by using the classification decision criteria: the

Fig. 6.6 Flowchart of 2D search stereo matching algorithm using gradient direction
216 6 Multi-ocular Stereovision

horizontal region whose gradient direction is closer to the horizontal direction and
the vertical region whose gradient direction is closer to the vertical direction.
3. A horizontal image pair is used to match and calculate parallax in the region
where the gradient direction is closer to the horizontal direction, and a vertical
image pair is used to match and calculate parallax in the region where the gradient
direction is closer to the vertical direction.
4. The two parallax values are combined into a complete parallax map, and then the
depth map is obtained.
When calculating the gradient map, considering that it is only necessary to
compare or judge whether the gradient direction is closer to the horizontal direction
or the vertical direction, the following simple methods with low computational
complexity can be used. For any pixel (x, y)in fL(x, y), the horizontal and vertical
gradient values Gh and Gv are, respectively,

W = 2 y+W=2

f । fL(x - i,j) - fL(x + i,j)\

Gh(x, y) = 52
i= 1 j=y- W 2 = (6:34)

W=2 x+W=2

Gv(x,y) = 52 52 jfL(x,y-j)-fL(x,y+j)\ (6:35)

j= 1 i = x — W=2

According to the gradient values calculated above, the following classification

criteria can be applied: for any pixel in fL(x, y), if Gh > Gv, the pixel is classified as in
a horizontal region, and the horizontal image pair is used for searching and
matching; if Gh < Gv, the pixel is classified as in a vertical region, and the vertical
image pair is used for searching and matching.
Figure 6.7 shows an example of using the above orthogonal trinocular stereo
matching method based on gradient classification to eliminate the influence of the
grayscale smooth region of the image in Fig. 5.5 on the matching. Figure 6.7a is the
top image corresponding to the left image in Fig. 5.5a and the right image in
Fig. 5.5b. Figure 6.7b shows the gradient image of the left image (white represents
the large gradient value, and black represents the negligible small gradient value).
Here, only the gradient direction is calculated at the position of the large gradient
value. Figure 6.7c and d show the gradient map close to the horizontal direction and
the gradient map close to the vertical direction (light color corresponds to a larger
gradient value, and dark color corresponds to a smaller gradient value), respectively.
Figure 6.7e and f show the parallax map obtained by matching horizontal image
pairs and vertical image pairs (light color corresponds to large parallax value, and
dark color corresponds to small parallax value), respectively. Figure 6.7g shows the
complete parallax map obtained by combining Fig. 6.7e and f. Figure 6.7h shows the
3D perspective view corresponding to Fig. 6.7g. Comparing Fig. 6.7g and h with
Fig. 5.5c and d, respectively, it can be seen that the wrong matching regions can be
greatly reduced by using orthogonal trinocular stereo matching.
6.2 Orthogonal Trinocular Stereo Matching 217

Fig. 6.7 An example of eliminating the influence of grayscale smooth regions of image by
orthogonal trinocular method
218 6 Multi-ocular Stereovision

6.2.2.2 Discussions on Template Size

In the above method, two templates (masks, windows) with different sizes are used.
The gradient template is used to calculate the gradient direction information, and the
matching (searching) template is used to calculate the relevant information of the
grayscale region. Here, the size of gradient template and the size of matching
(searching) template have a great impact on the matching performance [4].
The influence of gradient template size can be illustrated by taking Fig. 6.8 as an
example. In the figure, two regions with different gray levels are given, with A, B,
and C as vertices as well as B, C, E, and D as vertices, respectively. Assuming that
the point P to be matched is located near the horizontal edge segment BC, if the
gradient template is too small (such as the rectangle in Fig. 6.8a, which does not
include the point on the edge BC), it is difficult to distinguish the horizontal region
from the vertical region because Gh ~ Gv, and it is possible to match the point P with
the horizontal image, so the horizontal direction is relatively smooth, which may
cause false matching. If the gradient template is large enough (such as the rectangle
in Fig. 6.8b, which includes the points on the edge BC), then there must be Gh < Gv,
and then the point P will be matched with vertical image pairs to avoid mismatching.
However, it should be noted that in addition to the problem of large amount of
computation, too large template may cover multiple different edges and lead to
wrong direction determination.
The size of the matching (searching) template also has a great impact on the
performance. If the matching template is large enough to contain large enough
intensity changes for matching, then false matching will be reduced, but large
matching blur may occur. It can be divided into two cases (see Fig. 6.9). The two
regions with A, B, and C as vertices as well as B, C, E, and D as vertices,
respectively, have different textures (the rest are smooth regions).
1. When matching the boundary parts of the texture region and the smooth region
(as shown in Fig. 6.9a): if the template is small and only covers the smooth
region, the matching calculation will be random; if the template is large and

Fig. 6.8 Schematic diagram of gradient template influence

6.3 Multi-ocular Stereo Matching 219

Fig. 6.9 Schematic diagram of the influence of matching template

covers two regions, the appropriate matching image pair can be determined and
the correct matching can be obtained.
2. The boundary parts adjacent to two texture regions are matched (as shown in
Fig. 6.9b): since the template is always contained in the texture region, the correct
matching is guaranteed regardless of the template size.

6.2.2.3 Correction of Orthogonal Trinocular Parallax Map

The algorithm for detecting and correcting errors in parallax maps introduced in
Sect. 4.3 is also applicable to parallax maps obtained from orthogonal trinocular
stereovision. The definition of ordering matching constraint can be used for both
horizontal image pairs and vertical image pairs (after corresponding adjustment).
The flowchart of the basic algorithm for error detection and correction of parallax
map in orthogonal trinocular is shown in Fig. 6.10. The images involved here
include the left image fL(x, y), the right image fR(x, y), the top image fT(x, y), and
the parallax map d(x, y). First, the parallax map dX(x, y) corrected along the
horizontal direction is obtained by means of the ordering matching constraint in
the horizontal direction, then the resulted parallax map is corrected by means of the
ordering matching constraint in the vertical direction, and finally a new parallax map
dXY(x, y) satisfying the global (both along the horizontal X direction and along the
vertical Y direction) ordering matching constraint is obtained.

6.3 Multi-ocular Stereo Matching

The orthogonal trinocular stereo matching described in Sect. 6.2 is a special case of
multi-ocular stereo matching. In more general cases, more than three cameras can be
used to form a stereo system, and the baseline of each image pair can also be
220 6 Multi-ocular Stereovision

Fig. 6.10 Algorithm flowchart for error detection and correction of parallax map in orthogonal
trinocular vision

non-orthogonal. Two special cases, which are more general than orthogonal trinoc-
ular matching, are discussed below.

6.3.1 Matching of Arbitrarily Arranged Trinocular Stereo

In a trinocular stereo imaging system, three cameras can be arranged in any form
other than on a straight line or on a right triangle. Figure 6.11 shows the schematic
diagram of an arbitrarily arranged trinocular stereo imaging system, where C1, C2,
and C3 are the optical centers of three image planes, respectively, which can
determine a trifocal plane. Referring to the introduction of epipolar constraint in
Sect. 5.1.2, it can be seen that a given object point W (generally not located on the
trifocal plane) and any two optical center points can determine an epipolar plane. The
intersection of this plane with the image plane of the corresponding optical center is
the epipolar line. The epipolar line Lij represents the epipolar line in Image i, which
corresponds to Image j. Matching is always done on the epipolar line. In the
trinocular stereo imaging system, there are two epipolar lines on each image plane,
and the intersection of the two epipolar lines is also the intersection of the object
point W with the optical center line and the corresponding image plane.
If all three cameras observe the object point W, the coordinates of the three image
points obtained are p1, p2, and p3, respectively. Each pair of cameras can determine
an epipolar constraint. If Eij is used to represent the essential matrix between Image
i and Image j, there are
6.3 Multi-ocular Stereo Matching 221

Fig. 6.11 Arbitrarily

arranged trinocular stereo C3
imaging system
L32

L13 L23

X
L12 L 21
C2

P1TE12P2 = 0 (6:36)

P2TE23P3 = 0 (6:37)

p3 T E31p1 =0 (6:38 )

If eij is used to represent the epipolar coordinates of Image i and Image j, the
above three equations are not independent because
e31TE12e32 = e12TE23e13 = e23TE31e21 = 0. However, any of the above two
equations are independent, so when the essential matrix is known, the coordinates
of the third image point can be predicted by using the coordinates of any two image
points.
Compared with the binocular system, the trinocular system adds a third camera,
which can eliminate many uncertainties caused by only using the binocular image
matching. Although the methods introduced in Sect. 6.2 directly use two pairs of
images to match at the same time, in most trinocular stereo matching algorithms, a
pair of images is often used to establish the corresponding relationship, and then the
third image is used for verification, that is, the third image is used to check the match
made with the first two images [5].
A typical approach is shown in Fig. 6.i2. Consider using three cameras to image a
scene with four points A, B, C, andD. In Fig. 6.i2, the six points labeled i, 2, 3, 4,
5, and 6 represent incorrect reconstructed positions for the four points in the first two
images (corresponding to optical centers Oi and O2, respectively). Take the point
marked i as an example, which is the result of a mismatch between a2 and bi. When
the 3D space point i reconstructed from the first two images is reprojected to the
third image, the problem of mismatching can be found. It neither coincides with a3
nor b3, so it can be judged as an incorrect reconstruction position.
The above method first reconstructs the 3D space points corresponding to the
matching points in the first two images and then projects them to the third image. If
there is no compatible point near the projected point obtained above in the third
image, then the match is likely to be a false match. In practical applications, explicit
reconstruction and reprojection are not required. If the camera has been calibrated
(even only weakly calibrated [5]), and a 3D space point is known to correspond to
222 6 Multi-ocular Stereovision

Fig. 6.12 The third image helps to reduce uncertainty

Fig. 6.13 Schematic diagram of trinocular matching based on epipolar line

the two image points of the first image and the second image, respectively, then
taking the intersection of the corresponding epipolar lines can predict the position of
the 3D space point in the third image.
Several different matching methods are described below.

6.3.1.1 Trinocular Matching Based on Epipolar Line

This approach relies on searching along epipolar lines to disambiguate and achieve
matching [6]. Referring to Fig. 6.13, let Lji represent the j-th epipolar line in Image i.
If it is known that L1 1 and L1 2 are the corresponding epipolar lines obtained from
Image 1 and Image 2, then to find the corresponding point of Point a in Image 1 in
Image 2, it is only necessary to search for edges along L1 2. Suppose two possible
points b and c are found along L1 2, but it is not known which one to choose. Let the
epipolar lines passing through points b and c in Image 2 be L23 and L3 3, respectively,
and the epipolar lines passing through points b and c in image 3 be L23 and L3 3 ,
respectively. Now consider the stereo image pair consisting of Image 1 and Image 3;
if L1 1 and L1 3 are corresponding epipolar lines and there is only one point d along L1 3
on the intersection of L1 3 and L23, then the conclusion that Point a in Image 1 and
Point b in image 2 corresponded to each other holds true, because they both
6.3 Multi-ocular Stereo Matching 223

correspond to Point d in Image 3. The constraints provided by the third image

eliminate the problem that both points b and c may correspond to Point a.

6.3.1.2 Trinocular Matching Based on Edge Line Segments

This method utilizes the detected edge segments from the image to achieve trinocular
stereo matching [7]. First, the edge segments in the image are detected, and then a
segment adjacency graph is defined. The nodes in the graph represent edge
segments, and the arcs between the nodes indicate that the corresponding edge
segments are adjacent. For each edge segment, it can be expressed with local
geometric features such as its length and direction, midpoint position, etc. After
obtaining the line segment adjacency graphs G1, G2, and G3 of the three images in
this way, matching can be performed as follows (see Fig. 6.14):
1. For a Segment S1 in G1, calculate the Epipolar line L21 in Image 2 where the
Midpoint p1 of S1 is located, and the corresponding Point p2 of p1 in Image 2 will
be on the Epipolar line L21.
2. Consider the segment S2 intersecting the Epipolar line L21 in G2, and let the
intersection of L21 and S2 be p2; for each Segment S2, compare its length and
direction with Segment S1. If the difference is less than the given threshold, it is
considered that they may match.
3. For each possible matching line segment, further calculate its Epipolar line L32 in
Image 3, and set the intersection of it with p1 in Image 3 as p3; search for Segment
p3 whose difference in length and direction from Segment S1 and S2 is less than
the given threshold near p3. If it can be found, S1, S2, and S3 form a group of
matching segments.
Carry out the above steps on all line segments in the graph, and finally get all
matching line segments to realize image matching.

6.3.1.3 Trinocular Matching Based on Curves

In the previous matching based on edge segments, it is implicitly believed that the
contour of the scene to be matched is approximated by polygons. If the scene is

Fig. 6.14 Trinocular matching based on edge segments

224 6 Multi-ocular Stereovision

composed of polyhedron, the contour representation will be very compact. But for
many natural scenes, according to the increase of the complexity of their contours,
the number of sides of the polygons used to express them may be doubled to ensure
the accuracy of approximation. In addition, due to the change of angle of view,
the corresponding polygons in the two images cannot ensure that their vertices are on
the corresponding epipolar line. In this case, more calculations are needed, such as
the use of improved epipolar constraints [8].
To solve the above problem, curve-based matching can be performed (i.e., the
local contour of the scene is approximated by a polynomial higher than the first
order). Referring to Fig. 6.15, assume that a curve T1 1 has been detected in Image
1 (the superscript i indicates the image, while the subscript indicates the serial
number; here it refers to the j-th curve in the Image i), which is an image of a 3D
curve on the surface of the scene. The next step is to search for the corresponding
curve in Image 2. To this end, you can choose a Point p1 1 on T1 1 (the unit tangent
vector of this point is t1 1, and the curvature is k1 1). Consider the epipolar line L21 in
Image 2 (the epipolar line in Image 2 corresponding to Image 1). Let this epipolar
line intersect the curve family Tji in Image 2. Here, j = 2 is taken in the figure, that is,
the epipolar line L21 and the curves T2 1 and T22 intersect at points p1 2 and p22 (the
unit tangent vectors of these two points are t1 2 and t22, respectively, and the
curvatures are k12 and k22, respectively). Next in Image 3, the epipolar line from
Image 2 that intersects epipolar line L31 where Point p1 2 is located is considered.
Here, j = 2 is taken in the figure, that is, the epipolar line L31 and epipolar lines L32,1
and L32,2 corresponding to points p1 2 and p22 are considered (the numbers after the
subscript comma indicate the serial number). These two epipolar lines intersect with
epipolar line L31 at points p1 3 and p23, respectively.
If points p1 1 and p1 2 are corresponding, then theoretically, Point p1 3 whose
tangent unit vector and curvature can be calculated from the tangent unit vector
and the curvature of the two points p1 1 and p1 2 should be found on the curve T1 3.Ifit
is not found, it may be that (1) there is no point very close to p1 3; (2) there is a curve
passing through point p1 3, but its tangent unit vector is not as expected; or (3) there is
a curve passing through point p1 3, and its tangent unit vector is as expected, but its
curvature is not as expected. Either of the above states that points p1 1 and p1 2 should
not correspond.
In general, for each pair of points p11 and pj2, the intersection point pj3 of the
epipolar line L31 of the corresponding point p1 1 and the epipolar line L32,j of the

Fig. 6.15 Trinocular matching based on curves

6.3 Multi-ocular Stereo Matching 225

corresponding point pj2 and the unit tangent vector tj3 and curvature kj3at the point
pj-3 are calculated in Image 3. For each intersection point pj3, search for the closest
curve Tj3, and judge and execute according to the following three conditions of
increasing in stringency: (1) if the distance between the curve Tj3 and the point pj3
exceeds a certain threshold, cancel the corresponding relation between them; other
wise (2) calculate the unit tangent vector tj3 at each point pj3, and if the difference
between it and the unit tangent vector calculated by points p11 and pj2 exceeds a
certain threshold, cancel the correspondence between them; otherwise (3) the cur
vature k j3 at each point pj-3 is calculated, and if the difference between it and the
curvature calculated by points p11 and pj2 exceeds a certain threshold, the corre
spondence between them is canceled.
After the above filtering, only one possible corresponding Point pj2 in Image 2 is
reserved for Point p11 in Image 1, and the nearest curves Tj2 and Tj3 are further
searched in the neighborhood of points pj2 and pj3. The above process is performed
for all selected points in Image 1, and the final result is that a series of corresponding
points pj 1 , pj2 ,andpj3 are determined on the curves
1 2 3
Tj , Tj , and Tj , respectively.

6.3.2 Orthogonal Multi-ocular Stereo Matching

It has been pointed out in Sect. 6.1 that replacing the unidirectional binocular images
with unidirectional multi-ocular images can eliminate the effect of the unidirectional
periodic pattern. In Sect. 6.2.1, it is also pointed out that replacing the unidirectional
binocular images (or multi-ocular images) with orthogonal trinocular images can
eliminate the influence of grayscale smooth regions. The combination of the two can
form an orthogonal multi-ocular image sequence, and it is better to use the orthog
onal multi-ocular stereo matching method to eliminate the above two effects at the
same time [9]. Figure 6.16 shows a schematic diagram of the shooting position of the
orthogonal multi-camera image sequence. Let the camera shoot along each point
L, R1, R2, ... on the horizontal line and each point L, T1, T2, ... on the vertical line,
and a stereo image series of orthogonal baselines can be obtained. The analysis of
orthogonal multi-ocular images can be obtained by combining the method of

Fig. 6.16 Schematic

diagram of the shooting
positions of the multi
camera image sequence
226 6 Multi-ocular Stereovision

Fig. 6.17 Results of orthogonal multi-ocular stereo matching

unidirectional multi-ocular image analysis in Sect. 6.1 with the method and results of
orthogonal trinocular image analysis in Sect. 6.2.1.
The test results of the real image using the orthogonal multi-ocular stereo
matching method are as follows. Figure 6.17(a) is a parallax calculation result. In
the orthogonal multi-ocular images used, in addition to Fig. 4.5a and b, and
Fig. 6.7a, one image each is increased along the horizontal and vertical directions
of Fig. 6.16, respectively. This is equivalent to add two positions of R2 and T2 in
Fig. 6.16 for image acquisition. Figure 6.17b shows the corresponding 3D perspec
tive view. Comparing Fig. 6.17a and b with Fig. 6.7g and h, respectively, it can be
seen that the effect here is even better (fewer mismatch points).
Theoretically, multiple images can be acquired not only in the horizontal and
vertical directions but even in the depth direction (along the Z-axis). For example, as
shown in Fig. 6.16, the two positions D1 and D2 along the Z-axis can also be used.
However, practice shows that the contribution effect of depth-direction images to
recovering the 3D information of the scene is not obvious.
In addition, various cases of multi-ocular stereo matching can also be regarded as
the generalization of the method in this section. For example, a schematic diagram of
a four-ocular stereo matching is shown in Fig. 6.18. Figure 6.18a is a schematic
diagram of the projection imaging of the scene point W, and its imaging points for
the four images are p1, p2, p3, and p4, respectively. They are the intersections of the
four rays R1, R2, R3, and R4 with the four image planes in turn. Figure 6.18b shows
the projection imaging of a straight line L passing through the space point W. The
imaging results of the straight line on the four images are four straight lines l1, l2, l3,
and l4 on four planes Q1, Q2, Q3, and Q4, respectively. Geometrically, a ray passing
through C1 and p1 must also pass through the intersection of planes Q2, Q3, and Q4.
Algebraically, given the quad-focal tensor and any three straight lines passing
through the three image points, the position of the fourth image point can be
deduced [5].
6.4 Equal Baseline Multiple Camera Set 227

Fig. 6.18 Schematic diagram of four-ocular stereo matching

Fig. 6.19 Schematic

diagram of the location of
multiple camera set with
equal baselines

6.4 Equal Baseline Multiple Camera Set

There are many forms of multi-ocular stereovision. For example, literature [10] also
provides the source code of a multi-eye stereovision measurement system and some
photos and videos; literature [11] introduces a trinocular stereovision system com
posed of a projector and two cameras. The following is a brief introduction to an
equal baseline multiple camera set (EBMCS), in which a total of five cameras are
used [12].

6.4.1 Image Acquisition

The equal baseline multiple camera set arranges five cameras in a cross that shoots in
parallel, as shown in Fig. 6.19. Among them, C0 is the center camera, C1 is the right
camera, C2 is the top camera, C3 is the left camera, and C4 is the bottom camera.
It can be seen from Fig. 6.19 that the four cameras around the center camera form
with the center camera four pairs of stereo cameras in binocular parallel mode,
respectively. Their baselines are of equal length, so they are called equal baseline
multiple camera set.
From the perspective of image processing, what C0 and C1 collected is a pair of
horizontal binocular stereo images. For convenience, each pair of images can also be
regarded as images obtained by the horizontal binocular mode. Of course, some
228 6 Multi-ocular Stereovision

conversion is required here. A pair of stereo images collected for C0 and C2 needs to
be rotated 90° counterclockwise; a pair of stereo images collected for C0 and C4
needs to be rotated 90° clockwise; and a pair of stereo images collected for C0 and
C3 can be mirrored flipped. In this way, it is equivalent to calibrating the four pairs of
stereo images relative to the images collected by the central camera, and their results
can be combined and compared when calculating the parallax map.
A series of chessboard images with 11 x 8 black and white squares
(24 mm x 24 mm in size) are taken for the (geometric) calibration of the camera
and the rectification of the acquired images, so that its corners with ten lines in the
horizontal direction and with seven columns in the vertical direction can be used.
The series of images consists of ten groups, each containing five images from five
cameras. The calculation of the calibration parameters can be found in [13], and the
rectification algorithm can be found in [14].
Due to the use of multiple cameras, in addition to geometric calibration, images
from EBMCS were color-calibrated. Because the parallax map used here is a
grayscale map, the color calibration is actually the calibration of the pixel intensity.
The expression for the triangular filter used to adjust the intensity is given by

f =f + k (1 - jM-fj) (6:39)

Among them, f is the intensity before calibration, f is the intensity after calibra
tion, k is the intensity correction factor selected for the feature points in the image,
and M is the middle value of the grayscale range of the image. Generally, the
grayscale range is from 0 to 256, and M is preferably greater than or equal to 128.
Image transformation was also performed on the parallax maps obtained from the
images captured by the five cameras. Parallax maps are obtained from images
modified by scaling, rectification, as well as transformations such as rotation and
specular reflection. Therefore, the points of each parallax map correspond to the
points of the center image after these transformations. However, the transformation
parameters are different in different stereo cameras. Therefore, the center image
requires various modifications depending on the stereo camera used. Here the
parallax maps need to be merged to get a higher quality map, so the requirement is
to unify the maps by making them refer to the same image. The unification of
parallax maps is obtained by performing a transformation on them that is the inverse
of the transformation performed on the images from which these parallax maps were
obtained. All points in the generated parallax map correspond to points in the input
center image before calibration and rectification.

6.4.2 Image Merging Methods

One parallax map can be obtained from each pair of cameras, and EBMCS can
provide four parallax maps. The next step is to combine these four parallax maps into
6.4 Equal Baseline Multiple Camera Set 229

a single parallax map. Two different methods can be used to combine: arithmetic
mean merging method (AMMM) and exception excluding merging method
(EEMM).
No matter which method is used, the parallax value of each coordinate point in the
resulting parallax map depends on the parallax value of each coordinate point
located in the same coordinate position in each pre-merging parallax map. However,
since it is possible that some parts of the scene are occluded during image acquisi
tion, the parallax values of the corresponding positions in one or several parallax
maps cannot be calculated, so the number of combined parallax values of certain
positions in the final result parallax map may be smaller/lower than the number of
camera pairs included in the EBMCS. If N is the number of camera pairs in EBMCS
and Mx represents the number of points located at coordinate x in the disparity map
before merging and still at coordinate x in the disparity map after merging, then
Mx < N. The parallax value at coordinate x in the final result parallax map is

Df (x) = (X)
E1 ~ i MxDi (6:40)

where Di represents the parallax value in the pre-merging parallax map with index i.
There may be significant differences between parallax values located at the same
coordinates in different pre-merging parallax maps. AMMM does not exclude any
values but only averages them. However, a significant difference would indicate that
at least one of the pre-merging parallax maps contained incorrect parallax values. To
eliminate potential false discrepancies, EEMM can be used.
Let the parallax after performing EEMM merging be denoted by E(x), which
depends on each parallax value Di(x) at coordinate x in the parallax map i before
merging. If a pre-merging parallax map does not contain parallax values for coor
dinate x, the value of E(x) is equal to 0. The function E(x) is calculated differently
depending on the number of pre-merging parallax maps that contain parallax values
at coordinate x.
If only one pre-merge parallax map with index i contains parallax value Di(x),
then the value of E(x) is equal to Di(x). When the number of pre-merging parallax
maps with parallax at coordinate x is equal to 2, the EEMM calculates the difference
between these parallax values. The parallax value is equal to |Di(x) - Dj(x)|, where
i and j are the indices of the parallax map before merging under consideration.
EEMM specifies a maximum acceptable difference, denoted as T. A difference
value greater than T indicates that the difference value is indeterminate. Therefore,
the EEMM states that the parallax value is undefined and the value of E(x) is equal to
zero. If the difference between the parallax values is not greater than T, then E(x) is
equal to the arithmetic mean of the parallax values Di(x) and Dj(x):
230 6 Multi-ocular Stereovision

Di(x) + D j(x)
if |Di-(x) - Dj(x)| < T
E(x) = * 2 (6:41)
0 if |D,-(x) - Dj(x)|> T

In the case of merging three parallax values Di(x), Dj(x), and Dk(x), from different
pre-merging parallax maps, it is necessary to calculate the difference value between
each two parallax values and then use these differences’ value to determine the
calculation of E(x). Since there are a total of three difference values to be judged, the
condition of the maximum acceptable difference is set more stringent (the maximum
acceptable difference value S = T/2 at this time). The calculation of E(x) is divided
into four cases, as follows:

22l2i, j, kDlx

It can be seen from Eq. 6.42 that when all three parallax values are not greater
than S, the resulting parallax value E(x) of the merged parallax map is equal to the
arithmetic mean of all the pre-merging parallax values. When there is one parallax
value greater than S, the resulting parallax value E(x) is equal to the parallax value
that satisfies the other two conditions for the maximum acceptable parallax value.
When there are two parallax values greater than S, the resulting parallax value E(x) is
equal to the arithmetic mean of the two parallax values in the condition that satisfies
the maximum acceptable parallax value. When all three disparities are greater than S,
the resulting parallax value E(x) is indeterminate.
The last case in EEMM occurs when the four pre-merging parallax maps i, j, k,
and l all have parallax values at coordinate x. In this case, the merge method first
sorts the parallax values from different pre-merging parallax maps. Two extreme
values (maximum and minimum) are removed after sorting. Then, the arithmetic
mean is calculated from the two remaining parallax values, and this mean is taken as
the result of the merge method:

E(x) = Dj x ( ) +2 D (x)
k if|Di(x) < Dj (x) < Dk(x) < Di (x)} (6:43)
6.5 Single-Camera Multi-mirror Catadioptric System 231

6.5 Single-Camera Multi-mirror Catadioptric System

To achieve stereoscopic vision, binocular or multi-ocular needs to be used, that is,

two cameras or multiple cameras are required. If you only have a single camera, you
need to move the camera to capture two or more images with different viewing
angles (same field of view).
The following introduces a single-camera multi-stereo catadioptric system
[15]. The system is a single-camera multi-mirror catadioptric system with vertical
baseline and horizontal baseline structures, and the space structure is compact, which
realizes simultaneous data acquisition by using multiple central or non-central
omnidirectional stereo image pairs. To make the 3D reconstruction process general
and adaptable to various types of system configurations, a flexible calibration and
reconstruction algorithm is employed. The algorithm approximates the system as
multiple central sub-cameras and performs stereo matching in spherical representa
tions and optimizes the reconstruction results for multiple stereo image pairs.
Although the system is mainly used to generate 3D point cloud data (see Chap. 8)
[15], its design ideas are also inspired to other data acquisition and matching.

6.5.1 Overall System Structure

The overall system structure is shown in Fig. 6.20. It is composed of one camera and
five mirrors, and the horizontal and vertical layout are combined. The top mirror with
the focus of O1 is the main mirror, and the rest are secondary mirrors/sub-mirrors.
Four secondary mirrors are symmetrically placed in a plane perpendicular to the
optical axis of the main mirror and the camera, as shown in the top view (along the
optical axis of the camera, aerial view) of Fig. 6.20a. By arranging the primary and
secondary mirrors in two layers, the advantages of vertical and horizontal structures

(b) (c)

Fig. 6.20 Overall system structure

232 6 Multi-ocular Stereovision

are combined to achieve a longer baseline in a compact manner. The camera can
shoot the reflected images of five mirrors at a time, forming four stereo image pairs.
Taking O1-XYZ as the reference system and considering the (side view) XZ plane as
shown in Fig. 6.20b and c, the relative position of secondary mirror O2 can be
expressed by P = [Bx, 0, -BZ]T, where Bx and BZ are the horizontal and vertical
baselines of the system, respectively. The relative positions of other secondary
mirrors can also be obtained according to symmetry.
The system design has the following three characteristics:
1. The primary mirror and four secondary mirrors form a system containing four
binocular stereo pairs. In practice, it may not be possible for all mirrors to capture
the desired scene due to potential occlusion in the system. However, such a
design enables each object in the scene to still be seen by at least two stereo
pairs, which makes it possible to achieve higher reconstruction accuracy by
fusing stereo pairs.
2. Compared with the general purely horizontal [16] or purely vertical [17] baseline
layouts, the special layout between the primary and secondary mirrors achieves a
longer stereo baseline in a compact manner.
3. Flexible selection of mirrors and cameras. Unlike traditional central catadioptric
systems, which are only able to use a limited combination of camera and mirror
types, this system can be built with either a central or noncentral configuration. As
shown in Fig. 6.20b, the orthographic cameras pointed at five parabolic mirrors
can be regarded as five different central cameras. As shown in Fig. 6.20c,
perspective cameras directed at multiple parabolic, hyperbolic, or spherical mir
rors can constitute multiple noncentral cameras. Through efficient system model
ing, the 3D reconstruction process can be unified and simplified.

6.5.2 Imaging and Calibration Models

Multi-ocular stereovision systems can be configured with different types of cameras

and mirrors. To make the 3D reconstruction process uniform and applicable to all
types of configurations, a common imaging and calibration model is required. Here
we consider each mirror and its corresponding sub-image as a virtual sub-camera,
and then the whole system can be regarded as a combination of multiple virtual
cameras. Once each virtual camera has been calibrated, the parameters can be
further optimized by jointly calibrating the entire array.

6.5.2.1 Virtual Sub-camera Model

To enhance the generality of the system, a generalized unified model (GUM) is

adopted to describe the imaging process of each virtual sub-camera [18]. This model
is suitable not only for central cameras but also for many noncentral systems. Only a
few things about projections are discussed below.
6.5 Single-Camera Multi-mirror Catadioptric System 233

In GUM, the projection process can be described as follows:

1. Any spatial point Pw can first be transformed into a point Ps = [xs, ys, zs]T on the
unit sphere through rigid body transformation, and then a center projection is
performed on the center of the sphere.
2. Use the second projection center Cp(-q1, -q2, -q3)T in the unit sphere to project
Ps onto the normalized plane to obtain the point Pt. The noncentral characteristic
of the system can be well compensated by this eccentric projection center Cp:

TT
/xs + qi y1+q2
Pt — I\zs +I q3 ,
1 | (6:44)
zs + q3

3. Considering the distortion of the real camera, the radial distortion is compensated
in units of Pt.
4. Finally, apply the generalized perspective projection to get the pixels [19]:

g1 g1a uo

p—KPt — 0 g2 V0 Pt (6:45)
0 0 1 _

where K is the internal parameter matrix of the generalized perspective projection,

g1 and g2 are the focal lengths, a is the skew factor, and (u0, v0) are the principal
point coordinates.
By using a series of checkerboard calibration images, the internal and external
parameters of each sub-camera can be computed independently.

6.5.2.2 Joint Calibration of Multiple Mirror Positions

In the concept of a virtual camera array, each mirror and the region it occupies in the
pixel plane are considered a virtual sub-camera. Integrate mirror parameters into
virtual sub-cameras to convert the relative positions between mirrors into rigid body
transformations between virtual sub-cameras. After each sub-camera is indepen
dently calibrated, the relative positions of the sub-cameras need to be jointly
optimized to improve the consistency between the sub-cameras.
Let c1 be the reference coordinate of the main camera. The rigid body transfor
mation of ci (i — 2, 3, 4, 5) relative to c1 can be represented by T(ci-c1):

R3 X 3 t3 X 3
T(ci - c1) 0 i—2,3,4,5 (6:46)
1

where R3X3 is the rotation matrix and t3X1 is the translation vector.
234 6 Multi-ocular Stereovision

Given each 3D point Pw,ij in the world coordinate system and its corresponding
imaging pixel p1,ij in the main camera and pi,ij in the i-th camera, the steps to
calculate the reprojection error are as follows:
1. First transform the world point Pw,ij to Pc1,ij in the c1 coordinate system using Tw-
c1:

Pci,j=(Pw,j) -1 Pw,j(6:47)

2. Then use the matrix T(ci-c1) to convert Pc1,ij to Pci,ij:

Pci,j = (T(Ci - ci0 -1 Pci,j (6:48)

3. Next, use the internal parameter matrix Ki of the i-th sub-camera to convert Pci,ij
to reprojection pixel coordinates pi’,ij:

p,,j = KiPci,j- (6:49)

4. Finally, calculate the reprojection error e:

e =| |Pi0,j' - Pi,j| |2 (6:50)

If you let the function G denote the whole process of obtaining pi’,ij from Pw,ij,
then the optimal rigid body transformation T(ci-c1) can be calculated by minimizing
the reprojection error shown in Eq. 6.50:

arg min £ 11G(Ki, Tw,Q, T(c - ), Pw j - pi,j 112 (6:51)

T(ci - Ci) i,j

Since each virtual camera has been calibrated according to the previous virtual
sub-camera model description, the initial values of the parameters in G have been
obtained. Equation 6.51 can then be solved using a nonlinear optimization algorithm
such as the Levenberg-Marquardt algorithm.

References

1. Okutomi M, Kanade T (1993) A multiple—baseline stereo. IEEE Transactions on Pattern

Analysis and Machine Intelligence 15(4): 353-363.
2. Matthies L, Szeliski R, Kanade T (1989) Kalamn filter-based algorithms for estimating depth
from image sequences, International Journal of Computer Vision 3: 209-236.
3. Ohta Y, Watanabe M, Ikeda K (1986) Improved depth map by right angled tri-ocular stereo.
Proceedings of 8ICPR, 519-521.
References 235

4. Jia B, Zhang Y-J, Lin X G (1998) Study of a fast tri-nocular stereo algorithm and the influence
of mask size on matching. Proceedings of International Workshop on Image, Speech, Signal
Processing and Robotics, 169-173.
5. Forsyth D, Ponce J (2003). Computer Vision: A Modern Approach. Prentice Hall, London.
6. Goshtasby A A (2005) 2-D and 3-D Image Registration—for Medical, Remote Sensing, and
Industrial Applications. Wiley-Interscience, Hoboken, USA.
7. Ayache N, Lustman F (1987) Fast and reliable passive trinocular stereovision. Proceedings of
First ICCV, 422-427.
8. Faugeras O (1993) Three-dimensional Computer Vision: A Geometric Viewpoint. MIT Press,
Cambridge, USA.
9. Jia B, Zhang Y-J, Lin X G (2000) Stereo matching using both orthogonal and multiple image
pairs. Proceedings of the ICASSP, 4: 2139-2142.
10. Wu H (2022) Code and dataset. https://ieee-dataport.org/documents/code-and-dataset.
11. Zhou D, Wang P, Sun C K, et al. (2021) Calibration method for trinocular stereo vision system
comprising projector and dual cameras. Acta Optica Sinica, 41(11): 120-130.
12. Kaczmarek A L (2017) Stereo vision with equal baseline multiple camera set (EBMCS) for
obtaining depth maps of plants. Computers and Electronics in Agriculture, 135: 23-37.
13. Zhang Z (2000) A flexible new technique for camera calibration. IEEE Transactions on Pattern
Analysis and Machine Intelligence 22: 1330-1334.
14. Hartley RI (1999) Theory and practice of projective rectification. International Journal of
Computer Vision, 35, 115-127.
15. Chen S Y, Xiang Z Y, Zou N (2020) Multi-stereo 3D reconstruction with a single-camera multi
mirror catadioptric system. Measurement Science and Technology, 31: 015102.
16. Caron G, Marchand E, Mouaddib EM (2009) 3D model based pose estimation for omnidirec
tional stereovision. Proceedings of IEEE/RSJ International Conference on Intelligent Robots
and Systems, 5228-5233.
17. Lui W L D, Jarvis R (2000) Eye-full tower: A GPU-based variable multi-baseline omnidirec
tional stereovision system with automatic baseline selection for outdoor mobile robot naviga
tion. Robotics and Autonomous Systems, 58(6): 747-761.
18. Xiang Z, Dai X, Gong X (2013) Noncentral catadioptric camera calibration using a generalized
unified model. Optics Letters, 38: 13671369.
Chapter 7
Monocular Multi-image Scene Restoration Check for
updates

The stereo vision method introduced in the first two chapters restores the depth of the
scene according to two or more images obtained by the cameras in different
positions. Here, the depth information (distance information) can be regarded as
the redundant information from multiple images. Acquiring multiple images with
redundant information can also be achieved by collecting images of light change
and/or scene change at the same location. These images can be obtained with only
one (fixed) camera, so they can also be collectively referred to as monocular methods
(stereo vision methods are all based on multiple cameras and multiple images.
Although one camera can be used to shoot in multiple positions, it is still equivalent
to multiple cameras due to different angles of view). From the (monocular) multiple
images obtained in this way, the surface orientation of the scene can be determined,
and the relative depth between the parts of the scene can be directly obtained from
the surface orientation of the scene. In practice, it is often possible to further calculate
the absolute depth of the scene [1].
The sections of this chapter will be arranged as follows.
Section 7. 1 first gives a general introduction to monocular scene restoration methods
and classifies them according to monocular multiple images and monocular single
images.
Section 7. 2 introduces the basic principle of shape restoration by illumination and
discusses the photometric stereo method of determining the orientation of the
scene surface by using a series of images with the same viewing angle but
different illumination.
Section 7. 3 discusses the basic principle of restoring shape from motion. Based on
the establishment of the optical flow field describing the moving scene, the
relationship between the optical flow and the surface orientation and relative
depth of the scene is analyzed.
Section 7. 4 introduces a method to restore shape from contour, which combines
segmentation technology and convolution neural network technology to decom
pose the visual shell and estimate human posture.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 237
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_7
238 7 Monocular Multi-image Scene Restoration

Section 7. 5, combined with recent research progress, makes a survey of photometric

stereo technology from the perspective of light source calibration,
non-Lambertian surface illumination model, color photometric stereo, and 3D
reconstruction methods.
Section 7. 6 specifically introduces a method to obtain the normal information of the
scene by using the uncalibrated photometric stereo vision technology with the
help of the multi-scale aggregated generative adversarial networks (GANs).

7.1 Monocular Image Scene Restoration

The stereo vision method introduced in the first two chapters is an important method
of scene restoration. Its advantage is that the geometric relationship is very clear, but
its disadvantage is that it is necessary to determine the corresponding points in the
binocular or multi-ocular image. As can be seen from the first two chapters,
determining the corresponding points is the main work of stereo vision method,
and it is also a very difficult problem, especially when the lighting is inconsistent and
there are shadows. In addition, the stereo vision method needs to make several points
on the scene appear in all the images that need to determine the corresponding points
at the same time. In practice, due to the influence of line of sight occlusion, different
cameras cannot be guaranteed to have the same field of view, which leads to the
difficulty of corresponding point detection and affects the matching. At this time, if
the baseline length is shortened, the influence of occlusion may be weakened, but the
matching accuracy of the corresponding points will be reduced.
In order to avoid the complex matching problem of corresponding points, mon
ocular image scene restoration method is also often used, that is, only a single
camera with fixed position (but can shoot a single or multiple images) is used to
collect images, and various 3D clues in the obtained images are used to restore the
scene [2]. Due to the loss of 1D information (depth information) when projecting the
3D world onto the 2D image, the key to restore the scene here is to restore the lost 1D
depth information and realize the 3D reconstruction of the scene [3, 4].
From a more general point of view, restoring the scenery is to restore the intrinsic
characteristics of the scenery. Among the various intrinsic characteristics of the
scene, the shape of 3D object is the most basic and important. On the one hand,
many other features of the object, such as surface normal and object boundaries, can
be derived from the shape; on the other hand, people usually define the object with
shape first and then use other characteristics of the object to further describe the
object. Various methods of restoring scenery from shape are often labeled with the
name “shape from X.” Here, X can represent tone change, texture change, scene
motion, illumination change, focal length size, scene pose, contour position, shadow
size, etc.
It is worth pointing out that in the process of obtaining 2D images from 3D
scenes, some useful information is indeed lost due to projection, but some
7.1 Monocular Image Scene Restoration 239

information is retained after conversion (or there are 3D clues of the scene in the 2D
images). Here are some examples:
1. If the light source position is changed during imaging, multiple images under
different lighting conditions can be obtained. The image brightness of the same
scenery surface varies with the shape of the scenery, so it can be used to help
determine the shape of the 3D scenery. At this time, multiple images do not
correspond to different viewpoints but to different illumination, which is called
shape from illumination.
2. If the scene moves during image acquisition, optical flow will be generated in the
image sequence composed of multiple images. The size and direction of optical
flow vary with the orientation of the scene surface, so it can be used to help
determine the 3D structure of the moving object, which is called shape from
motion; some people also refer to it as the structure from motion [5].
3. If the scenery rotates around itself in the process of image acquisition, the contour
of the scenery (the boundary between the object and the background, also known
as silhouette) will be easily obtained in each image. Combining these contours,
the surface shape of the scenery can also be restored, which is called shape from
contour, or shape from silhouette (SfS).
4. During the imaging process, some information about the shape of the original
scenery will be converted into the brightness information corresponding to the
shape of the original scenery in the image (or in the case of determined illumi
nation, the brightness change in the image is related to the shape of the scenery).
Therefore, according to the shadow of the image, we can try to recover the surface
shape of the scenery, which is called shape from shading.
5. In the case of perspective projection, some information about the shape of the
scenery will remain in the changes of the surface texture of the object (different
orientations of the scenery surface will lead to different surface texture changes).
Therefore, through the analysis of the texture changes, we can determine the
different orientations of the object surface and then try to recover its surface
shape, which is called shape from texture.
6. There is also a close relationship between the focal length change caused by
focusing on the scene at different distances and the depth of the scene. Therefore,
the distance of the corresponding scene can be determined according to the focal
length of the clear imaging of the scene, which is called shape from focal length.
7. In addition, if the 3D scenery model and the camera focal length are known, the
perspective projection can establish the corresponding relationship between the
3D scene points and the imaging points on the image, so that the geometric shape
and pose of the 3D scenery can be calculated by using the relationship between
several corresponding points (a pose estimation from three-point perspective will
be discussed in Sect. 7.4).
Among the seven examples of scenery restoration listed above, the first three
cases need to collect multiple images, which will be introduced in the following
sections of this chapter, respectively; the last four cases only need to collect a single
image, which will be introduced in Chap. 8. The above methods can also be used in
240 7 Monocular Multi-image Scene Restoration

combination. For example, 3D reconstruction can be carried out by combining the

technology of shape from motion and shape from shadow to analyze the wear
particles [6].

7.2 Shape from Illumination

The shape from illumination is realized according to the photometric stereo prin
ciple. Photometric stereo, also known as photometric stereoscopic or photomet
ric stereo vision, is a method of reconstructing 3D information of the scene by using
photometric information (illumination direction, intensity, etc.) in the scene. The
specific method is to restore the surface orientation (normal direction) of the scene
with the help of a series of images collected under the same viewing angle (the same
viewpoint) but in different light source directions and, on this basis, restore the 3D
geometric structure of the scene.
Photometric stereo method is based on three conditions: (1) The incident light is a
parallel light or coming from a point source from an infinite distance; (2) it is
assumed that the reflection model of the object surface is a Lambert reflection
model, that is, the incident light is uniformly scattered in all directions, and it is
the same for the observer to observe from any angle; (3) the camera model is an
orthogonal projection model.
Photometric stereo method is often used in the environment where the lighting
conditions are easy to control or determine, with low cost, and can often obtain more
finer local details. For an ideal Lambert surface (see Sect. 7.2.2), a good effect can
often be obtained. The four steps of shape recovery are as follows: (1) establishing a
lighting model, (2) calibrating light source information, (3) solving the surface
reflectance and/or normal information, and (4) calculating the depth information.

7.2.1 Scene Brightness and Image Brightness

Scene brightness and image brightness are two related but different concepts in
photometry. In imaging, the former is related to radiant intensity or radiance,
while the latter is related to irradiance or illuminance. Specifically, the former
corresponds to the luminous flux emitted by the surface of the scene (regarded as a
light source), which is the power emitted by the unit area of the light source surface
within the unit solid angle, and the unit is Wm-2 sr-1. The latter corresponds to the
luminous flux irradiating on the surface of the scene, which is the power per unit area
irradiating on the surface of the scene, and the unit is Wm-2. In optical imaging, the
scene is imaged on the image plane (of the imaging system), so the scene brightness
corresponds to the luminous flux emitted from the surface of the scene, and the
image brightness corresponds to the luminous flux obtained by the image plane.
7.2 Shape from Illumination 241

It should be noted that the image brightness obtained after imaging the 3D scene
depends on many factors. For example, the light intensity reflected by an ideal
diffuse surface when illuminated by a point light source is proportional to the cosine
of the incident light intensity, the surface light reflection coefficient, and the light
incidence angle (the angle between the line of sight and the incoming ray). In a more
general case, the image brightness is affected by the shape of the scene itself, the
attitude in space, the surface reflection characteristics, the relative orientation and
position between the scene and the image acquisition system, the sensitivity of the
acquisition device, and the radiation intensity and distribution of the light source,
which does not represent the intrinsic characteristics of the scene.

7.2.1.1 Relationship Between Scene Brightness and Image Brightness

There is a close relationship between the radiant brightness (scene brightness) of a

point light source and the illumination (image brightness) of the corresponding point
on the imaged image [7]. Considering the situation as shown in Fig. 7.1, a lens with
diameter D is placed at a distance of 2 from the image plane (2 is the focal length of
the lens). Suppose the area of a patch on the surface of the scene is SO, and the area
of the corresponding image pixel is SI. The angle between the light from the scene
patch to the center of the lens and the optical axis is a, and the angle with the normal
N of the scene patch is 9. The distance between the scene and the lens along the
optical axis is z (since the direction from the lens to the image is set as positive here,
the direction considered in the figure is marked as -z).
The area of the image pixel seen from the center of the lens is SI x cosa, and the
actual distance between the image pixel and the center of the lens is z/cosa, so the
solid angle of the image pixel (see [8]) is SI x cosa/(z/cosa)2. Similarly, the solid
angle of the scene patch seen from the center of the lens is SO x cos9/(z/cosa)2. From
the equality of two solid angles:

SO _ cos a z 2
’S7 = ’Cos9 2 (7:1)

Let’s see how much light from the surface of the scene will pass through the lens.
Because the lens area is n(d/2)2, it can be seen from Fig. 7.1 that the solid angle of the
lens seen from the scene patch is:

Fig. 7.1 Surface patch and

corresponding image pixels
of the scene
242 7 Monocular Multi-image Scene Restoration

2
nd2 1 n/d\
a= —— cos a cos 3 a (7:2)
4 (z= cos a)2 4 z

In this way, the power emitted by the surface patch SO of the scene and passing
through the lens is:

SP = L x SO x a x cos 0 = L x SO x^fcos3 a cos 0 (7:3)

where L is the brightness of the scene surface in the direction toward the lens. Since
the light from other areas of the scene will not reach the image patch SI, the
illumination obtained by this patch is:

E = SP = Lx SO x n-(d^ cos3 a cos0 (7:4)

SI SI 4 z

Substitute Eq. (7.1) into Eq. (7.4), and finally get:

E = L x^d^ cos4a (7:5)

From Eq. (7.5), it can be seen that the measured patch illuminance E is directly
proportional to the brightness L of the scene of interest, is directly proportional to
the area of the lens, and is inversely proportional to the square of the focal length of
the lens. The illumination change caused by camera movement is reflected in the
included angle a.

7.2.1.2 Bidirectional Reflection Distribution Function

When imaging the observed scene, the brightness L of the scene is related not only to
the luminous flux incident on the surface of the scene and the proportion of the
incident light reflected but also to the geometric factors of light reflection, that is, to
the direction of illumination and line of sight. Now let’s look at the coordinate
system shown in Fig. 7.2, where N is the normal of the surface patch, OR is an
arbitrary reference line, and the direction of a light L can be expressed by the
included angle 0 (called polar angle) between the light and the normal of the patch
and the included angle $ (called azimuth) between the orthographic projection of the
light on the surface of the scene and the reference line.
With the help of such a coordinate system, the direction of light incident on the
surface of the scene can be represented by (0i, $ i), and the direction reflected to the
observer’s line of sight can be represented by (0e, ^e), as shown in Fig. 7.3.
Thus, the bidirectional reflection distribution function (BRDF), which is very
important for understanding surface reflection, can be defined, and it is written as
7.2 Shape from Illumination 243

Fig. 7. 2 Polar angle 0 and

azimuth fi indicating the
direction of light

Fig. 7. 3 Schematic
diagram of bidirectional
reflection distribution
function

Fig. 7. 4 Schematic
diagram of calculating
surface brightness under the
condition of extended light
source

f(0i, fii; 0e, fie) below. It represents the brightness of the surface observed by the
observer in direction V(0e, fie) when light enters the surface of the scene in direction
L(0i, fii). The unit of the bidirectional reflection distribution function is the reciprocal
of the solid angle (sr-1), and its value ranges from zero to infinity (at this time, any
small incident will lead to the observation of radiation). Note that f(0i, fii; 0e,
fie) = f(0e, fie; 0i, fii), that is, the bidirectional reflection distribution function is
symmetrical about the incident and reflection directions. Let the illuminance
obtained by incident on the object surface along the direction (0i, fii) be 8E(0i, fii)
and the reflected (emitted) brightness observed in the direction (0e, fie) be 8L(0e, fie).
The bidirectional reflection distribution function is the ratio of brightness and
illuminance, that is:

f<°i, fii; "e, fie> ''L?,fi (7:6)

Now consider further the case of the extended light source (e.g., see [8]). In
Fig. 7.4, the width of an infinitesimal patch on the sky (which can be considered as
radius 1) is 80i along the polar angle and 8fii along the azimuth. The solid angle
244 7 Monocular Multi-image Scene Restoration

Fig. 7.5 Four basic light incidence and observation methods

corresponding to this patch is 8a = sinfliSfliSî (where sinfli takes into account the
reduced spherical radius). If Eo(0-V î) is the illuminance of the unit solid angle along
the direction (fli, î), the illumination of the patch is Eo(0-V î)sinfli8fli8î, and the
illumination received by the whole surface is:

n n=2
E I Eo(fli, <ft)
E= sin fli cos flidflid^i (7:7)
—n 0

where cosfli takes into account the influence of the projection of the surface along the
direction (fli, ^i) (projected on a plane perpendicular to the normal).
In order to obtain the brightness of the whole surface, the product of bidirectional
reflection distribution function and patch illumination needs to be added up on the
hemispherical surface including the possible light. With the help of Eq. (7.6),
there are:

n n=2
L(fle, fc) = I I
f (fli, ^; fle, ^e)Eo(fli, <&) sin fli cos flidflid^i (7:8)
—n 0

The above result is a function of two variables (fle, $ e), which indicates the
direction of light hitting the observer.
The bidirectional reflection distribution function is related to both the incidence of
light and the observation of light. Common light incidence and observation methods
include the four basic forms shown in Fig. 7.5, where fl represents the angle of
incidence and $ represents the azimuth. They are a combination of diffuse incidence
di and directional (fli, î) incidence and diffuse reflection de and directional (fle, ê)
observations. Their reflectance is as follows: diffuse incidence diffuse reflection p(di;
de), directional incidence diffuse reflection p(fli, î; de), diffuse incidence directional
observation p(di; fle, ê), and directional incidence-directional observation p(fli, î;
fle, &)•

7.2.2 Surface Reflection Characteristics and Brightness

The bidirectional reflection distribution function indicates the reflection characteris

tics of the surface, and different surfaces have different reflection characteristics.
7.2 Shape from Illumination 245

Only two extreme cases are considered below: ideal scattering surface and ideal
specular reflection surface.
1. Ideal scattering surface
The ideal scattering surface, also known as Lambert surface or diffuse
reflection surface, is equally bright from all observation directions (independent
of the angle between the observation line of sight and the surface normal), and it
reflects all incident light completely unabsorbed. Therefore, f(0i, Qi; 0e, Qe) of the
ideal scattering surface is a constant (independent of angle), which can be
calculated as follows. For a surface, its brightness integral in all directions should
be equal to the total illumination obtained by the surface, that is:

n n/2
f f ^Qi; ê, QeÊ^î, Qi^ cos î sin de êdêdQe

—n 0

= E(0i, Qi) cos 0i (7.9)

where both sides are multiplied by cos0i to convert to the N direction. From the
above equation, it can be solved that the BRDF of the ideal scattering surface is:

f(0i, Qi; 0e, Qe) = 1/n (7.10)

It can be seen from the above that for an ideal scattering surface, the relation
ship between brightness L and illuminance E is:

L = E/n (7.11)

In practice, the common frosted (matte) surface will reflect light divergently,
and the ideal frosted surface model is the Lambertian model. The reflectivity of
Lambert surface only depends on the incident angle i. Further, the variation of
reflectance with i is cosi. For a given reflected light intensity L, it can be seen that
the incident angle satisfies cosi = C x L, and C is a constant, that is, the constant
reflection coefficient (albedo). Therefore, i is also a constant. It can be concluded
that the normal of the surface is on a directional cone around the direction of
the incident light, the half angle of the cone is i, and the axis of the cone points to
the point source of illumination, that is, the cone is centered on the direction of the
incident light.
The cones in two directions intersecting on two lines can define two directions
in space, as shown in Fig. 7.6. Therefore, in order to make the surface normal
completely unambiguous, a third cone is needed. When using three light sources,
each surface normal must have a common vertex with each of the three cones:
The two cones have two intersecting lines, and the third cone in the conventional
position will reduce the range to a single line, thus giving a unique explanation
and estimation of the direction of the surface normal. It should be noted that if
246 7 Monocular Multi-image Scene Restoration

Fig. 7.6 Cones in two

directions intersecting on
two lines

Fig. 7.7 Schematic

diagram of ideal specular
reflection surface

some points are hidden behind and are not shot by the light of a light source, there
will still be ambiguity. In fact, the three light sources cannot be in the same
straight line, and they should be relatively separated on the surface without
blocking each other.
If the absolute reflection coefficient R of the surface is unknown, a fourth cone
can be considered. Using four light sources can help determine the orientation of
an unknown or nonideal characteristic surface. But this is not always necessary.
For example, when three rays are orthogonal to each other, the sum of the cosines
of the included angles relative to each axis must be 1, which indicates that only
two angles are independent. Therefore, three sets of data are used to determine
R and two independent angles, that is, a complete solution is obtained. The use of
four light sources in practical applications can help determine any inconsistent
interpretation, which may come from the presence of specular elements.
2. Ideal specular surface
The ideal specular reflection surface reflects like a mirror (e.g., the highlight
region on the object is the result of the specular reflection of the light source by
the object), so the reflected light wavelength only depends on the light source and
has nothing to do with the color of the reflection surface. Unlike the ideal
scattering surface, an ideal specular reflection surface can reflect all the light
emitted from the (9-i, $i) direction to the (0e, Qe) direction. At this time, the
incident angle and the reflection angle are equal, as shown in Fig. 7.7. The BRDF
of the ideal specular reflection surface will be proportional to (the scale factor is k)
the product of two pulses 8(#e — 0i) and 8(Qe — Qi — n).
In order to calculate the scale factor k, the brightness of the surface in all
directions is integrated, which should be equal to the total illumination obtained
by the surface, that is:
7.2 Shape from Illumination 247

n n=2

/ / fe8(0e
- 0i)&(^e - fa — n) sin 0e cos 0ed0ed^e = k sin 0-i cos 0-i = 1 (7:12)
—n 0

From it, the BRDF of the ideal specular reflection surface can be solved:

8(0e — 0i)8(& — fa — n)
f (0i, ^i; 0e, &) = (7:13)
sin 0i cos 0i

When the light source is an extended light source, replace the above equation
into Eq. (7.8), and the brightness of the ideal specular reflection surface is:

n n=2
L(0e, &) = I I

—n 0

5 0e( dn0 (
sin 0i tcos
es/
0i )
n E(0i, ^i) sin &i cos = E&, — n) (7:14)

It can be seen that the polar angle has not changed, but the azimuth has turned
180°.
In practice, both ideal scattering surface and ideal specular reflection surface
are extreme cases, which are relatively rare. Many surfaces can be regarded as
having the properties of both parts of the ideal scattering surface and of the ideal
specular reflection surface (see Sect. 7.5.2 for further discussion). In other words,
the BRDF of the actual surface is the weighted sum of Eqs. (7.10) and (7.13).

7.2.3 Scene Surface Orientation

The orientation of the scene surface is an important description of the surface. For a
smooth surface, each point on it has a corresponding section, and the orientation of
this section can be used to represent the orientation of the surface at that point. The
normal vector of the surface, that is, the (unit) vector perpendicular to the tangent
plane, can indicate the orientation of the tangent plane. If the Gaussian spherical
coordinate system is used (see [8]) and the tail of the normal vector is placed in the
center of the ball, the top of the vector and the sphere will intersect at a specific point,
which can be used to mark the surface orientation. The normal vector has two
degrees of freedom, so the position of the intersection on the sphere can be
represented by two variables, such as polar angle and azimuth, or longitude and
latitude.
The selection of these variables is related to the setting of the coordinate system.
Generally, for convenience, one axis of the coordinate system is often overlapped
248 7 Monocular Multi-image Scene Restoration

Fig. 7.8 The surface with a

distance orthogonal to the
lens plane is described

Fig. 7.9 Parameterization

of surface orientation with
partial differentiation

with the optical axis of the imaging system, and the system origin is placed at the
center of the lens, so that the other two axes are parallel to the image plane. In the
right-hand system, you can point the Z-axis to the image, as shown in Fig. 7.8. In this
way, the surface of the scene can be described by a distance -z orthogonal to the lens
plane (i.e., parallel to the image plane).
Now write the surface normal vector by z and the partial derivatives of z to x and
y. The surface normal is perpendicular to all lines on the surface section, so the
surface normal can be obtained by calculating the outer (cross) product of any two
nonparallel lines on the section, as shown in Fig. 7.9.
If a small step 8x is taken from a given point (x, y) along the X-axis direction,
according to the Taylor expansion, the change along the Z-axis direction is
8z = 8x x dz/dx + e, where e includes the higher-order term. In the following,
p and q are used to represent the partial derivatives of z to x and y, respectively. (p, q)
is generally called surface gradient. In this way, the vector along the X-axis direction
is [8x 0 p8x]T, and then it is parallel to the (x, y) of the tangent plane of the straight
line with the vector ry = [0 1 q]T. Similarly, a line parallel to the vector rx = [1 0 p]T
also passes through (x, y) of the tangent plane. The surface normal can be obtained
by finding the outer product of these two lines. Finally, it is necessary to determine
whether to point the normal to the observer or leave the observer. If you let it point to
the observer (reverse direction), then there is:

N = rxxry=[ 1 0 p]Tx[0 1 q]T= [ -p -q 1 ]T (7:15)

Here, the unit vector on the surface normal is:

7.2 Shape from Illumination 249

N = N = [ -p -q 1 ]T
(7:16)
jNj \/1 + p2 + q2

The included angle between the normal of the scene surface and the lens direction
0e is calculated next. If the scene is quite close to the optical axis, the unit
observation vector V from the scene to the lens can be considered as [0 0 1]T, so
the dot product of the two unit vectors can be obtained:

1
N • V = cos 0e = (7:17)
^1 + p2 + q2

When the distance between the light source and the scene is much larger than the
scale of the scene itself, the direction of the light source can be indicated only by a
fixed vector, and the surface orientation corresponding to the vector is orthogonal to
the light emitted by the light source. If the normal of the surface of the scene can be
expressed by [-ps -qs 1]T, when the light source and the observer are on the same
side of the scene, the direction of the light source can be indicated by the gradient
( ps, qs).

7.2.4 Reflection Map and Image Brightness Constraint

Equation

Now consider linking pixel gray scale (image brightness) to pixel gradient (surface
orientation).

7.2.4.1 Reflection Map

Consider that a point light source irradiates a Lambert surface, and the illumination is
E. According to Eq. (7.10), its brightness L is:

L = -Ecos 0i 0i >0 (7:18)

where 0i is the angle between the surface normal vector [-p -q 1]T and the vector
pointing to the light source [-ps -qs 1]T. Note that since the brightness cannot be
negative, there is 0 < 0i < n/2. The inner product of these two unit vectors can be
obtained:
250 7 Monocular Multi-image Scene Restoration

cos 0-i = 1 + PsP + qs q

(7:19)
1+ + p2 + q21+ + p2 + q2

By substituting it into Eq. (7.18), the relationship between the brightness of the
scene and the orientation of the surface can be obtained. The relation function
obtained in this way is recorded as R( p, q) and drawn as a function of gradient
(p, q) in the form of isolines. The obtained graph is called reflection map. Gener
ally, PQ plane is called gradient space, in which each point (p, q) corresponds to a
specific surface orientation. In particular, the point at the origin represents all planes
perpendicular to the viewing direction. The reflection map depends on the properties
of the target surface material and the location of the light source, or in other words,
the information of the surface reflection characteristics and the distribution of the
light source are integrated in the reflection map.
The image illumination is proportional to several constants, including the recip
rocal of the focal length z square and the fixed brightness of the light source. In
practice, the reflection map is often normalized for unified description. For the
Lambert surface illuminated by a distant point light source, there are:

R(p, q) = 1+ Psp + qsq (7:20)

1+ + p2 + q211 + p22+ q2

From the above formula, the relationship between the brightness of the scene and
the orientation of the surface can be obtained from the reflection map. For the
Lambert surface, the isoline will be a nested conic, because from R(p, q) = c,
(1 + psp + qsq)2 = c2(1 + p2 + q2)(1 + ps2 + qs 2) can be obtained. The maximum value
of R( p, q) is obtained at ( p, q) = ( ps, qs).
Figure 7.10 shows three examples of Lambert surface reflection diagrams, of
which Fig. 7.10a is the case when ps = 0, qs = 0 (corresponding to nested concentric
circles); Fig. 7.10b shows the situation when ps / 0 and qs = 0 (corresponding to
ellipse or hyperbola); Fig. 7.10c shows the situation when ps / 0 and qs / 0
(corresponding to hyperbola).

Fig. 7.10 Some examples of Lambert surface reflection diagram

7.2 Shape from Illumination 251

Fig. 7.11 An example of

isotropic radiation surface
reflection diagram

Now consider another extreme case, called isotropy radiation surface. If the
surface of an object can radiate evenly in all directions (physically impossible), it
will feel brighter when you look at it obliquely. This is because the tilt reduces
the visible surface area, and it is assumed that the radiation itself does not change, so
the amount of radiation per unit area will be larger. At this time, the brightness of the
surface depends on the reciprocal of the cosine of the radiation angle. Considering
the projection of the object surface in the direction of the light source, it can be seen
that the brightness is proportional to cos0i/cos0e. Because cos0e = 1/(1 + p2 + q2)1/2,
there are:

R(p, q) = 1 +Psp + qsq (7:21)

V1 +p2 + q2

The isolines are now parallel straight lines because (1 + psp + qsq) = c(1-
+ ps2 + qs2)1/2 can be obtained from R(p, q) = c. These lines are orthogonal to the
direction ( ps, qs).
Figure 7.11 is an example of an isotropic radiation surface reflection diagram.
Here, let ps/qs = 1/2, so the slope of the isoline (straight line) is 2.

7.2.4.2 Image Brightness Constraint Equation

The reflection map shows the dependence between surface brightness and surface
orientation. The illumination E(x, y) of a point on the image is proportional to the
brightness of the corresponding point on the surface of the scene. If the surface
gradient at this point is (p, q), the brightness of this point can be recorded as R(p, q).
If the scale coefficient is set to the unit value by normalization, the

E(x, y)=R(p,q) (7:22)

This equation is called the image brightness constraint equation, which shows
that the gray level I(x, y) of the pixel at (x, y) in the image depends on the reflection
characteristic R(p, q) expressed by (p, q) of the pixel. The image brightness
constraint equation connects the brightness of any position (x, y) in the image
plane XY with the orientation (p, q) of the sampling unit expressed in a gradient
252 7 Monocular Multi-image Scene Restoration

Fig. 7.12 Variation of

spherical brightness with
position

space PQ. The image brightness constraint equation plays an important role in
restoring the shape of the object surface from the image.
Suppose a sphere on the Lambert surface is illuminated by a point light source,
and the observer is also in the position of the point source. Since there are 0e = 0i and
( ps, qs) = (0, 0), the relationship between brightness and gradient can be seen from
Eq. (7.20):

R(p, q) = > 12 2
(7:23)
\/1 + p2 + q2

If the center of the sphere is on the optical axis, its surface equation is:

z = z + r/— - (x2 + y2)

o x2 + y2 < r2 (7:24)

where r is the radius of the ball;-z0 is the distance between the ball center and the
lens (see Fig. 7.12).
According to p =-x/(z- z0) and q =-y/(z- z0), we can get (1 + p2 + q2)1/2 = r/
(z- z0) and finally get:

E(x, y) = R(p, q) = 1- - x2-^2 (7:25)

As can be seen from the above equation, the brightness gradually decreases from
the maximum value in the center of the image to the zero value at the edge of the
image. The same conclusion can also be obtained by considering the light source
direction S, line of sight direction V, and surface direction N marked in Fig. 7.12.
When people observe such a shadow change, they will think that the image is imaged
by a circular or spherical object. However, if each part of the surface of the ball has
different reflection characteristics, the image and feeling will be different. For
example, when the reflection map is represented by Eq. (7.21) and (ps,
qs) = (0, 0), a disk with uniform brightness is obtained. For people who are used
to observing the reflection characteristics of Lambert surface, such a sphere will
appear relatively flat.
7.2 Shape from Illumination 253

7.2.5 Solving the Image Brightness Constraint Equation

For a given image, people often hope to restore the shape of the original imaging
object. The correspondence between the surface orientation determined by p and
q and the brightness determined by the reflection graph R( p, q) is unique, but the
reverse is not necessarily true. In practice, there are often an infinite number of
surface orientations that can give the same brightness. On the reflection map, these
orientations corresponding to the same brightness are connected by isolines. In some
cases, special points with maximum or minimum brightness can often be used to
help determine the surface orientation. According to Eq. (7.20), for a Lambert
surface, R( p, q) = 1 can only be true when ( p, q) = ( ps, qs), so given the surface
brightness, the surface orientation can be uniquely determined. However, in general,
the corresponding relationship from image brightness to surface orientation is not
unique, because the brightness has only one degree of freedom (brightness value) in
each spatial position, while the orientation has two degrees of freedom (two gradient
values).
Thus, in order to restore the surface orientation, new information needs to be
introduced. To determine the two unknowns p and q, there should be two equations.
Two equations can be generated for each image point by using the two images
collected under different lighting conditions (see Fig. 7.13):

R1(p, q) = Ei
(7-26)
Rl(p, q) = E2

If these equations are linearly independent, then there are unique solutions for
p and q. If these equations are not linear, then for p and q, there are either no
solutions or multiple solutions. The correspondence between brightness and surface
orientation is not unique, which is an ill-conditioned problem. Acquiring two images
is equivalent to adding equipment to provide additional conditions to solve the
ill-conditioned problem.
The calculation of solving the image brightness constraint equation can be carried
out as follows. Set:

Fig. 7.13 Changes in

lighting in photometric
stereo
254 7 Monocular Multi-image Scene Restoration

R1 (p, q) = J' + rp + q q
P1
1
1 R-IP, q) = J'+ P2p
r2
+ q2q (7:27)

where

ri = 1 +P2 + q2 r- = 1 +P2 + q- (7:28)

It can be seen that as long as p1/q1 / p2/q2, it can be solved from the above
equations:

= (E2ri - ')q2 - (E2r2 - ')qi = (E2r2 - ')Pi - (E2ri - ')P2 (7 29)

p Pi q2 - P2qi q Piq2 - P2qi :

It can be seen from the above that given two corresponding images collected
under different lighting conditions, the unique solution can be obtained for the
surface orientation of each point on the imaging object.
An example of solving the image brightness constraint equation is shown in
Fig. 7.i4. Figure 7.i4a and b are two corresponding images acquired from the same
sphere under different lighting conditions (the same light source is placed in two
different positions). Figure 7.i4c shows the result of drawing the orientation vector
of each point after calculating the surface orientation with the above method. It can
be seen that the orientation close to the center of the ball is relatively perpendicular to
the paper, while the orientation close to the edge of the ball is relatively parallel to
the paper. Note that the orientation of the surface cannot be determined where the
light cannot reach or where only one image is illuminated.
In many practical situations, three different light sources are often used, which
can not only linearize the equation but also improve the accuracy and increase the
solvable surface orientation range. In addition, this newly added third image can also
help restore the surface reflection coefficient. The following is a specific description.

Fig. 7.14 Calculation of surface orientation by photometric stereo method

7.2 Shape from Illumination 255

The surface reflection property can often be described by the product of two
factors (coefficients). One is the geometric term, which represents the dependence on
the light reflection angle; the other is the proportion of incident light reflected by the
surface, which is called the reflection coefficient.
Generally, the reflection characteristics of each part of the object surface are
inconsistent. In the simplest case, brightness is only the product of reflection
coefficient and some orientation function. Here, the reflection coefficient is between
0 and 1. There is a surface similar to the Lambert surface (it has the same brightness
from all directions but does not reflect all incident light), and its brightness can be
expressed as pcosOi, where p is the surface reflection coefficient (it may change with
the position on the surface). In order to recover the reflection coefficient and gradient
( p, q), three kinds of information are needed, which can be obtained from the
measurement of three images.
Now introduce the unit vector in three light source directions:

S j= [ pj qx2]_ j = 1,2,3
(7:30)
Vi+pj+qj

Then, the illumination is:

Ej= p(Sj • N) j = 1,2,3 (7:31)

where

N= [-p -q 1]T
(7:32)
^/1 + p2 + q2

is the unit vector of the surface normal. In this way, three equations can be obtained
for unit vectors N and p:

E1= p(Si • N) E2= p(S2 • N) E3= p(S3 ■ N) (7:33)

These equations can be combined to obtain:

E = p(S ■ N) (7:34)

The rows of matrix S are the light source direction vectors S1, S2, and S3, and the
elements of vector E are the three luminance measurements.
Let S be nonsingular; it can be obtained from Eq. (5.34):
256 7 Monocular Multi-image Scene Restoration

(a) (b)

Fig. 7.15 Reflection coefficient restoration using three images

pN = S -1 • E = rg 1 [Ei (S2 x S3) + E2(S3 x Si) + E3 (Si x S2)] (7:35)

[Si ■ (S2 x S3)]

The direction of the surface normal is the product of a constant and a linear
combination of three vectors, each of which is perpendicular to the direction of the
two light sources. If each vector is multiplied by the brightness obtained when the
third light source is used, the unique reflection coefficient can be determined by
determining the value of the vector.
Finally, an example of using three images to restore the reflection coefficient is
given. Set a light source at three positions (-3.4, -0.8, -i.0), (0.0, 0.0, -i.0), and
(-4.7, -3.9, -i.0) in the space to collect three images. According to the brightness
constraint equation, three sets of equations can be obtained, so that the surface
orientation and reflection coefficient p can be calculated. Figure 7.i5a shows the
three groups of reflection characteristic curves. It can be seen from Fig. 7.i5b: When
the reflection coefficient p = 0.8, the three reflection characteristic curves intersect at
a point p = -0.i, q = -0.i; in other cases, there will be no intersection.

7.3 Shape from Motion

In the shape from illumination, the orientation of each surface of the scene is
revealed by changing the illumination in moving the light source. In fact, if the
light source is fixed but the pose of the scene is changed, it is also possible to reveal
different surfaces of the scene. The pose change of the scene can be realized by the
movement of the scene, so the shape and structure of each part of the scene can also
be revealed by using sequence images or videos and detecting the movement of the
scene.
7.3 Shape from Motion 257

The detection of motion can be based on the change of image brightness with
time. It should be noted that although the movement of the camera or the movement
of the scenery will lead to the brightness change between each image frame in the
video, the change of the lighting conditions in the video image will also lead to the
change of the image brightness with time, so the change of the brightness with time
on the image plane does not always correspond to the movement. Generally, optical
flow (vector) is used to represent the change of image brightness over time, but
sometimes it is different from the actual motion in the scene.

7.3.1 Optical Flow and Motion Fields

Motion can be described by a motion field, which is composed of the motion

(velocity) vector of each point in the image. When the object moves in front of the
camera or the camera moves in a fixed environment, it is possible to obtain the
corresponding image changes, which can be used to recover (obtain) the relative
motion between the camera and the object as well as the relationship between
multiple objects in the scene.
The motion field assigns a motion vector to each point in the image. At a specific
time, a point Pi in the image corresponds to a point Po on the object surface (see
Fig. 7.16). These two points can be connected by using the projection equation. If the
motion speed of the object point Po relative to the camera is Vo, this motion will
cause the corresponding image point Pi to generate a motion with the speed of Vi.
These two speeds are:

V = dro V = dri (7:36)

where ro and ri are linked by the following equation:

1 1
2ri= ro ■ zro (7:37)

Fig. 7.16 Object points and

image points connected by
projection equation
258 7 Monocular Multi-image Scene Restoration

(a) (b)

Fig. 7.17 Optical flow is not equivalent to a motion field

where 2 is the focal length of the lens and z is the distance from the lens center to the
object. Making the derivative of the above equation can get the velocity vectors
assigned to each pixel, and these velocity vectors constitute the motion field.
Visual psychology believes that when relative motion occurs between people and
the observed object, the movement of the parts with optical characteristics on the
surface of the observed object provides people with information about motion and
structure. When there is relative motion between the camera and the scene object, the
observed brightness mode motion is called optical flow or image flow. In other
words, the movement of the object with optical characteristics is projected onto the
retinal plane (i.e., the image plane) to form optical flow. Optical flow expresses the
change of image, which contains the information of object motion, and can be used
to determine the movement of the observer relative to the object. Optical flow has
three elements: (1) motion (velocity field), which is the necessary condition for the
formation of optical flow; (2) parts with optical characteristics (such as gray-scale
pixel points) that can carry information; and (3) imaging projection (from scene to
image plane), so optical flow can be observed.
Although optical flow is closely related to the motion field, they are not
completely corresponding. The object motion in the scene leads to the brightness
mode motion in the image, and the visible motion of the brightness mode generates
optical flow. In the ideal case, the optical flow corresponds to the motion field, but in
practice, there are also cases when it does not correspond. In other words, motion
produces optical flow, so there must be motion if there is optical flow, but there must
not be optical flow if there is motion.
Here are a few examples to illustrate the difference between optical flow and
motion field. First, when the light source is fixed, a sphere with uniform reflection
characteristics rotates in front of the camera, as shown in Fig. 7.17a. At this time,
there are spatial changes in brightness everywhere in the spherical image, but this
spatial change does not change with the rotation of the spherical surface, so the
image does not change (in gray level) with time. In this case, although the motion
field is not zero, the optical flow is zero everywhere. Next, consider the case that the
fixed ball is illuminated by a moving light source, as shown in Fig. 7.17b. The gray
scale everywhere in the image will change with the movement of the light source due
to the change of lighting conditions. In this case, although the optical flow is not
7.3 Shape from Motion 259

Fig. 7.18 Corresponding

points in two images at
different times

zero, the motion field of the ball is zero everywhere. This motion is also called
apparent motion (optical flow is the apparent motion of brightness mode). The above
two cases can be regarded as optical illusion.
As can be seen from the above example, optical flow is not equivalent to a motion
field. However, in most cases, there is still a certain corresponding relationship
between optical flow and motion field. So in many cases, the relative motion can
be estimated by image changes according to the corresponding relationship between
optical flow and motion field. However, it should be noted that there is also a
problem of determining the corresponding points between different images.
Refer to Fig. 7.18, where each closed curve represents an equal brightness curve.
Consider that there is an image point P with brightness E at time t, as shown in
Fig. 7.18a. When t + St, which image point does P correspond to? In other words, to
solve this problem, we need to know how the brightness mode changes. Generally,
many points near P have the same brightness E. If the brightness changes continu
ously in this part of the region, P should be on an equal brightness curve C.At t + St,
there will be some iso-luminance curves C with the same brightness near the
original C, as shown in Fig. 7.18b. However, it is difficult to say which point P‘
on C corresponds to the point P on the original C, because the shapes of the two
equal brightness curves C and C may be completely different. Therefore, although it
can be determined that curve C corresponds to curve C‘, it cannot be determined that
point P corresponds to point P‘.
It can be seen from the above example that only relying on the local information
in the changing image cannot uniquely determine the optical flow. Further, consider
Fig. 7.17 again. If there is a region in the image where the brightness is uniform and
does not change with time, the optical flow that is most likely to produce is zero
everywhere, but in fact any vector movement mode can be assigned to the region
with uniform brightness.
Optical flow can represent the changes in the image, which includes not only the
motion information of the observed object but also the related scene structure
information. Through the analysis of optical flow, we can determine the 3D structure
of the scene and the relative motion between the observer and the moving object.
Motion analysis can describe image changes and calculate object structure and
motion with the help of optical flow. The first step is to represent the changes in
the image with 2D optical flow (or the speed of the corresponding reference point),
and the second step is to calculate the 3D structure of the moving object and its
movement relative to the observer according to the optical flow calculation results.
260 7 Monocular Multi-image Scene Restoration

7.3.2 Optical Flow Field and Optical Flow Equation

The movement of the scenery in the scene will cause the scenery to be in different
relative positions in the image obtained during the movement. This difference in
position can be called parallax, which corresponds to the displacement vector
(including size and direction) reflected by the scenery movement on the image. If
the parallax is divided by the time difference, the velocity vector (also known as
the instantaneous displacement vector) is obtained. Optical flow can be regarded as
the instantaneous velocity field generated by the movement of gray-scale pixels on
the image plane. Based on this, the basic optical flow constraint equation, also
known as optical flow equation or image flow equation, can be established.

7.3.2.1 Optical Flow Equation

At time t, a specific image point is at (x, y). At time t +dt, the image point moves to
(x +dx, y + dy). If the time interval dt is small, it can be expected (or assumed) that
the gray level of the image point remains unchanged; in other words, there are:

f (x,y, t) =f (x + dx, y + dy, t + dt) (7:38)

Expand the right side of the above equation with Taylor series, let dt ^ 0, take the
limit, and omit the higher-order terms to obtain:

- df=dfdx+dfdy=dfu+dfv=0 (739)

where u and v are the moving speeds of image points in the X and Y directions,
respectively, and they form a speed vector. If write

f‘=d< d f^-f 40

The optical flow equation is obtained:

fxu+ fyv + ft =0 (7:41)

It can also be written as:

[fx, fy] •[«, v]T=- ft (7:42)

The optical flow equation shows that the temporal change rate of the gray level of
a point in the moving image is the product of the spatial change rate of the gray level
of the point and the spatial motion speed of the point.
7.3 Shape from Motion 261

In practice, the time change rate of gray scale can be estimated by the first-order
difference average value along the time direction:

ft~ 4[f(.x,yt + 1) + f(x + 1,yt + 1) + f(.x,y + 1t + 1) + f(x + 1,y + 1t + 1)]

- 1[ f (x, yt) + f (x + 1, yt) + f (x, y + 1t) + f (x + 1, y + 1t)]

(7:43)

The spatial change rate of gray scale can be estimated by the first-order difference
average along the X and Y directions, respectively:

fx - 4[f(x + 1, yt) + f(x + 1, y + 1t) + f(x + 1, yt + 1) + f(x + 1, y + 1t + 1)]

- 4[f(x, yt) + f(x, y + 1t) + f(x, yt + 1) + f(x, y+ 1t + 1)]

(7:44)

fy- 1[ f (x, y + 1t) + f (x + 1, y + 1t) + f (x, yt + 1) + f (x + 1, y + 1t + 1)]

- 1[ f (x, yt) + f (x+ 1, yt) + f (x, yt + 1) + f (x + 1, yt + 1)]

(7:45 )

7.3.2.2 Least Squares Method for Optical Flow Estimation

After substituting Eq. (7.43) to Eq. (7.45) into Eq. (7.42), the optical flow compo

nents u and v can be estimated by the least squares method. Take N pixels at different
positions on the same object with the same u and v on two consecutive images f(x, y,
t) and f(x, y, t + 1), and represent the estimation of ft, fx, and fy at the kth position
(k) <;(k) ( ^(k)
(k = 1, 2, ..., N) by ft ,fx , andfy , respectively:

-?(1) "I '^(1)

ft fx fy
_-?(2) '22') ^2)
ft = ft Fxy = fx fy (7:46)

_'NN) (NN) (NN)

- ft fx fy

Then, the least squares estimation of u and v is:

262 7 Monocular Multi-image Scene Restoration

Fig. 7.19 Optical flow detection example

1
[u, v]T=(FTyFxy) FTyft (7:47)

Figure 7.19 shows an example of optical flow detection. Figure 7.19a is a side
image of a patterned sphere, and Fig. 7.19b is an image obtained by rotating the
sphere (around the up and down axis) to the right by a small angle. The motion of the
sphere in 3D space is basically translational motion reflected in the 2D image, so in
the optical flow detected in Fig. 7.19c, the parts with large optical flow are distrib
uted along the longitude, reflecting the result of the horizontal movement of the
edge.

7.3.3 Solving the Optical Flow Equation

How to solve the optical flow equation given by Eq. (7.41)? The essence of this
problem is to calculate the optical flow component according to the gradient of gray
value of image points. This needs to be considered in different situations. Here are
some common situations.

7.3.3.1 Solving Optical Flow Equations for Rigid Body Motion

The calculation of optical flow is to solve the optical flow equation, that is, to find the
optical flow components according to the gray value gradients of the image point.
The optical flow equation restricts the relationship between the three directional
gradients and the optical flow components. It can be seen from Eq. (7.41) that this is
a linear constraint equation about the velocity components u and v. If a velocity
space is established with the velocity components as the axes (the coordinate system
is shown in Fig. 7.20), then the u and v values satisfying the constraint Eq. (7.41) are
on a straight line. It can be obtained from Fig. 7.20:
7.3 Shape from Motion 263

Fig. 7.20 The u and

v values that satisfy the
optical flow constraint
equation are on a
straight line

ff fx,
u0 -1 v0= ~ -ft 0= arctan (7:48)
fx fy fy.

Note that each point on the line is the solution of the optical flow equation. In
other words, only one optical flow equation is not enough to uniquely determine the
two quantities u and v. In fact, solving two variables with only one equation is an
ill-conditioned problem that must be solved with additional constraints.
In many cases, the research object can be regarded as a non-deformable rigid
body, and each adjacent point on it has the same optical flow velocity. This condition
can be used to help solve the optical flow equation. According to the condition that
the adjacent points on the object have the same optical flow velocity, it can be known
that the spatial variation rate of the optical flow velocity is zero, that is:

z— 2 a du du 2
(Vu) = u+dy) =0 (7:49)

2
(Vv)2 = + =0 (7:50)
dy,

These two conditions can be combined with the optical flow equation to calculate
the optical flow by solving a minimization problem. Assume:

e(x,y) = ££{( fxu + fyv + ft)2 + 22 [(W + (Vv)2]} (7:51)

The value of 2 should take into account the noise in the image. If the noise is
strong, it means that the confidence of the image data itself is low, and it needs to rely
more on the optical flow constraint, so 2 needs to take a larger value; otherwise, 2
needs to take a smaller value.
In order to minimize the total error in Eq. (7.51), e can be differentiated with
respect to u and v, respectively, and take the derivative to be zero, so that:

fx2 u+ fxfyv=—22Vu—fxft (7:52)

fy2 v+ fxfyu=—22Vv—fy ft (7:53)

264 7 Monocular Multi-image Scene Restoration

The above two equations are also called the Euler equations. If we let u and v
denote the mean in the u neighborhood and v neighborhood, respectively (which can
be calculated by the image local smoothing operator) and let Vu = u — u and
Vv = v — v, then Eqs. (7.52) and (7.53) can be changed into:

( fx2 + 22)u + fxfyv = 2 u — fxft (7:54)

(fy2 + 22 v v + fxfyu = 22v — fyft (7:55)

From the above two equations, the following can be obtained:

__ fx [ fxu + fyv + ft]

u 22+f2+fy2 (7:56)

__ fy [ fxu + fyv + ft]

v 22+fx2+fy2 (7:57)

Equations (7.56) and (7.57) provide the basis for an iterative solution to u(x, y)
and v(x, y). In practice, the following relaxation iterative equations are often used:

u(n+1) = -(n)_ fx[fxu(n)+fyv(n)+ft]

(7:58)
22+fx2 + f2

v(n+1) = v(n)_ fy[fxu(n)+fyv(n)+ft]

(7:59)
22+f2+fy2

Here we can take u(0) = 0, v(0) = 0 (straight line through the origin). The above
two equations have a simple geometric interpretation, that is, the iteration value at a
new (u, v) point is the average value in the neighborhood of the point minus an
adjustment amount, which is in the direction of the brightness gradient (see
Fig. 7.21). Therefore, the iterative process is a process of moving a straight line
along the brightness gradient, and the straight line is always perpendicular to the
direction of the brightness gradient. For the specific flowchart of solving Eqs. (7.58)
and (7.59), please refer to Fig. 8.10 in the next chapter.

7.3.3.2 Solving Optical Flow Equations for Smooth Motion

Further analysis of the previous Eqs. (7.52) and (7.63) shows that the optical flow in
the region where the brightness gradient is completely zero is actually
undeterminable, while in the region where the brightness gradient changes rapidly,
the resulting error of the optical flow calculation may be large. A common method
for solving the optical flow equation is to consider the smooth condition that the
7.3 Shape from Motion 265

Fig. 7.21 Geometric

interpretation of optical flow
equation solved by iterative
method

motion field is generally slow and stable in most parts of the image. In this case, we
can consider minimizing a measure of deviation from smoothness. A commonly
used measure is the integral of the square of the magnitude of the gradient of the
optical flow velocity:

es = Jj [ uxx2 + Uy]
2 2
+ (_vx2 + vyj_| dxdy (7:60)

Also consider minimizing the error of the optical flow constraint equation:

ec fxu + fyv + ft] 2dxdy (7:61)

So together we need to minimize v, where z is the weighted quantity. If the

measurement of brightness is accurate, Z should be larger; otherwise, if the image
noise is relatively large, z may be smaller.
Figure 7.22 shows an example of optical flow detection. Figure 7.22a is an image
of a football; Fig. 7.22b and c are images obtained by rotating Fig. 7.22a around the
vertical axis and clockwise around the line of sight, respectively. Figure 7.22d and e
show the optical flow detected for these two rotation cases, respectively.
It can be seen from the optical flow maps obtained above that the optical flow
value is relatively large at the junction of the black and white blocks on the surface of
the football, because the gray-scale changes in these places are more severe, and
inside the black and white blocks, the optical flow value is very small or 0, because
when the football is in motion, the gray scale of these points basically does not
change (similar to there is motion without optical flow). However, since the surface
of the football is not completely smooth, there is also a certain optical flow in the
interior of the black and white blocks corresponding to some football surfaces.

7.3.3.3 Solving the Optical Flow Equation When Gray-Scale Sudden

Changes

Optical flow will have discontinuities at the edges where objects overlap each other.
To generalize the above optical flow detection method from one region to another, it
is necessary to determine the discontinuity. This brings up a chicken-and-egg
266 7 Monocular Multi-image Scene Restoration

Fig. 7.22 Examples of optical flow detection

problem. If there is an accurate optical flow estimation, it is easy to find the places
where the optical flow changes rapidly and divide the image into different regions;
on the contrary, if the image can be well divided into different regions, an accurate
estimation of the optical flow can be obtained. The solution to this contradiction is to
incorporate the segmentation of regions into the iterative solution of optical flow.
Specifically, after each iteration, look for places where the optical flow changes
rapidly, and mark these places to avoid the smooth solution obtained in the next
iteration from crossing these discontinuities. In practical applications, the threshold
for determining the degree of optical flow change is generally set high to avoid
prematurely and finely dividing the image, and then the threshold is gradually
reduced as the optical flow estimation becomes better and better.
More generally speaking, the optical flow constraint equation is not only appli
cable to the continuous region of gray level but also applicable to the region with
abrupt changes in gray level. In other words, one condition for the optical flow
constraint equation to apply is that there can be (finite) abrupt discontinuities in the
image, but the changes around the discontinuities should be uniform.
Let’s see Fig. 7.23a; XY is the image plane, I is the gray-scale axis, and the object
moves along the X direction with velocity (u, v). At t0, the gray scale at point P0 is I0,
and the gray scale at point Pd is Id; at t0 + dt, the gray scale at P0 moves to Pd to form
optical flow. In this way, there is a gray-scale mutation between P0 and Pd, and the
gray-scale gradient is Vf = (fX, fy). Now look at Fig. 7.23b; if you look at the gray
scale change from the path, because the gray scale at Pd is the gray scale at P0 plus
the gray-scale difference between P0 and Pd, so there is:
7.3 Shape from Motion 267

I
Po
I0

Id Pd
X, T
t0 t 0+ dt
(a) (b)

Fig. 7.23 Solving the optical flow equation when the gray scale is abruptly changed

Pd
Id= /vf ■ dl + I0
(7:62)
Po

If you look at the gray-scale changes from the time course, because the observer
sees the gray scale changing from Id to I0 at Pd, so there is:

to+dt
f ftdt +1d
I0 = (7:63)
to

Since the change of gray levels should be the same in these two cases, it can be
solved by combining Eqs. (7.62) and (7.63):

Jdvf ■ dl =- "j ft dt
(7:64)
Po to

Substituting dl = [u v]Tdt and considering that the line integration limit and the
time integration limit should correspond, we can get:

fxu+ fyv + ft =o (7:65)

This shows that the solution can still be solved by using the previous method of
dealing with discontinuities.
It can be proved that the optical flow constraint equation is also applicable to the
discontinuous velocity field caused by the transition between the background and the
object under certain conditions, provided that the image has sufficient sampling
density. For example, in order to obtain the proper information from the texture
image sequence, the sampling rate of the space should be smaller than the scale of
the image texture. The sampling distance in time should also be smaller than the
268 7 Monocular Multi-image Scene Restoration

scale of the velocity field change, or even much smaller, so that the displacement is
smaller than the scale of the image texture. Another condition for the optical flow
constraint equation to apply is that the gray-scale change at each point in the image
plane should be entirely due to the motion of a specific pattern in the image and
should not include the effects of changes in reflection properties. This condition can
also be expressed as a change in the position of a mode in the image at different times
produces an optical flow velocity field, but the mode itself does not change.

7.3.3.4 Solving Optical Flow Equations Based on Higher-Order

Gradients

The previous solution to the optical flow equation only utilizes the first-order
gradient of the image gray levels. There is a view that the optical flow constraint
equation itself already contains the smoothness constraint on the optical flow field,
so in order to solve the optical flow constraint equation, it is necessary to consider the
continuity of the image itself on the gray level (i.e., consider the high-order gradient
of the image gray level) to constrain the gray-level field.
Expand the terms in the optical flow constraint equation with the Taylor series at
(x, y, t), and take the second order to get:

_df(x + dx, y + dy, t) _ df (x, y, t) d2 f (x, y, t) d

fx dx dx 1 d x2
■■ ? dy (7:66)

f _ d f(x + dx,y + dy,t) _ d f (x,y, t) d2f(x,y, t) d

fy dy dy 1 d yd x

+ df?t
d y2
) dy (7.67)

) _ d f (x, y, t)
d f (x + dx, y + dy, t d2f (x, y, t) d
ft dt dt 1 dtdx dx
+ ^^^1 dy (7.68)
dtdy v '
u(x + dx, y + dy, t) = u(x,y, t) + ux(x,y, t)dx + uy(x,y, t)dy (7-69)
v(x+ dx,y + dy, t)= v(x, y, t)+vx (x, y, t)dx + vy (x, y, t)dy (7:70 )

Substituting the above five equation into the optical flow constraint equation,
we get:
7.3 Shape from Motion 269

( fxu + fyv + ft) + ( fxxu + fyyv + fxux + fyvx + ftx)dx+

( fxyu + fyyv + fxuy + fyvy + fty ) dy + ( fxxux + fyxvx) dx2 + (7:71)
( fxyux + fxxuy + fyyvx + fxyvy ) dxdy + ( fxyuy + fyyvy) dy2 = 0

Because these terms are independent, six equations can be obtained, respectively:

fxU + fyv + ft = 0 (7:72)

fxxu + fyyv + fxux + fyvx + ftx = 0 (7:73)

fxyu + fyyv + fxuy + fyvy + fty = 0 (7:74)

fxxux + fyxvx = 0 (7:75)

fxyux + fxxuy + fyyvx + fyyvy + fxyvy = 0 (7:76)

fxxuy + fyyvy = 0 (7:77)

It is quite complicated to directly solve the above 6 second-order gradient

equations. With the help of the condition that the spatial rate of change of the optical
flow field is zero (see the above discussion when obtaining Eqs. (7.49) and (7.50)), it
can be assumed that ux, uy, vx, and vy are approximately zero, so that only three of the
above six equations remain, namely:

fxu + fyv + ft = 0 (7:78)

fxxu + fyyv + ftx = 0 (7:79)

fxyu + fyyv + fty = 0 (7:80)

To solve the two unknowns from these three equations, the method of least
squares method can be used.
When solving the optical flow constraint equation with the help of gradient, it is
assumed that the image is differentiable, that is, the movement of the object between
frame images should be small enough (less than one pixel/frame). If it is too large, the
aforementioned assumption does not hold, and the optical flow constraint equation
cannot be solved accurately. One of the methods that can be taken at this time is to
reduce the resolution of the image, which is equivalent to performing low-pass filtering
on the image, which has the effect of reducing the speed of optical flow.

7.3.4 Optical Flow and Surface Orientation

The optical flow contains information about the structure of the scene, so the
orientation of the surface can be solved from the optical flow of the object surface
270 7 Monocular Multi-image Scene Restoration

Fig. 7.24 Spherical

coordinates and Cartesian
coordinates

movement. The orientation of each point in the objective world and the surface of the
object can be represented by an orthogonal coordinate system XYZ centered on the
observer. Consider a monocular observer located at the origin of coordinates, and
assume that the observer has a spherical retina, so that the objective world can be
considered to be projected onto a unit image sphere. The image sphere has a
coordinate system consisting of longitude <fr and latitude 9. Points in the objective
world can be represented by these two image spherical coordinates plus a distance
r from the origin (see Fig. 7.24).
The transformations from spherical coordinates to Cartesian coordinates and from
Cartesian coordinates to spherical coordinates are, respectively:

x = r sin 9 cos $ (7:81)

y = r sin 9 sin $ (7:82 )
z= r cos9 (7:83 )

and

r = \/x2 + y2 + z2 (7:84)
9= arc cos (z=r) (7:85 )
$ = arc cos (y=x) (7:86 )

With the help of coordinate transformation, the optical flow of an arbitrary

moving point can be determined as follows. Let (u, v, w) = (dx/dt, dy/dt, dz/dt) be
the speed of the point in the XYZ coordinate system, and then (5, e) = (d^/dt, d9/dt)
is the angular velocity of the point along the $ and 9 directions in the image spherical
coordinate system as:

v cos $ — u sin $
5= (7:87)
r sin 9
7.3 Shape from Motion 271

(ur sin 0 cos <fr + vr sin 0 sin <fr + wr cos 0) cos 0 — rw

(7:88)
r2 sin 0

The above two equations are general representations for optical flow in the $ and
0 directions. Consider the optical flow calculation in a simple case below. Suppose
the scene is stationary and the observer is moving along the Z-axis (positive) with
velocity S. At this time, there are u = 0, v = 0, and w =—S, and substituting them
into Eqs. (7.87) and (7.88) can obtain, respectively:

8=0 (7:89)
e = S sin 0=r (7:90)

They form a simplified optical flow equation and are the basis for solving surface
orientation (and edge detection). According to the solution of the optical flow
equation, it can be judged whether each point in the optical flow field is a boundary
point, a surface point, or a space point. Among them, the type of boundary and the
orientation of the surface can also be determined in the two cases of boundary point
and surface point [9].
Here we only introduce how to obtain the surface orientation with the help of
optical flow. Looking at Fig. 7.25a first, let R be a point on a given surface patch on
the surface of the object, and the monocular observer with focus at O observes the
surface patch along the line of sight OR. Let the normal vector of the surface patch be
N; N can be decomposed into two mutually perpendicular directions: One is in the
ZR plane, of which the included angle with OR is c (as shown in Fig. 7.25b), and the
other is in a plane perpendicular to the ZR plane (parallel to the XY plane), of which
the angle with OR is t (as shown in Fig. 7.25c, where the Z-axis is pointed out from
the paper). In Fig. 7.25b, $ is a constant, while in Fig. 7.25c, 0 is a constant. In
Fig. 7.25b, the ZOR plane constitutes a “depth profile” along the line of sight, while
in Fig. 7.25c the “depth profile” is parallel to the XY plane.
How to determine c and t is now discussed. Consider first c in the ZR plane (see
Fig. 7.25b). If the vector angle 0 is given a small increment A0, the change in the
vector radius r is Ar. Passing R as an auxiliary line p, it can be seen that p/r = tan

Fig. 7.25 Schematic diagram for obtaining surface orientation by means of optical flow
272 7 Monocular Multi-image Scene Restoration

(A0) ~ A0, on the one hand, and p/Ar = tanff, on the other hand. By eliminating p
simultaneously, we can get:

rA0 = Artan o (7:91)

Consider next t in the vertical plane of the RZ plane (see Fig. 7.25c). If the vector
angle $ is given a small increment A^, the length of the vector radius r changes Ar.
Now, making the auxiliary line p, it can be seen that p/r = tanA^ ~ A^, on the one
hand, and p/Ar = tant, on the other hand. Simultaneously eliminating p, we can get:

rA^ = Artan t (7:92)

Further, taking the limits of Eqs. (7.91) and (7.92), respectively, we can get:

rl dr
cot ff = l_r d0 (7:93)

rl' dr
cot t = l_r 3^ (7:94)

where r can be determined by Eq. (7.84). Since e is a function of both $ and 0,

Eq. (7.90) can be rewritten as:

S sin 0
(7:95)
e(^, 0)

Taking partial derivatives with respect to $ and 0, respectively, we get:

-|r- = S sin 0 —21 e^~

(7:96)
d$ e2 d$

dr cos 0 sin 0 de
(7:97)
d0 \ e e2 30

Note that the surface orientation determined by ff and t is independent of the

observer’s velocity S. Substitute Eq. (7.95) to Eq. (7.97) into Eqs. (7.93) and (7.94)
to obtain the equations for ff and t:

d (ln e)
ff = arccot cot 0 d- (7:98)

d( ln e)
t = arccot (7:99)
d$
7.3 Shape from Motion 273

7.3.5 Optical Flow and Relative Depth

Using optical flow to analyze the motion can also obtain the mutual velocities u, v,
and w in the X, Y, and Z directions between the camera and the object in the world
coordinate system. If the coordinates of an object point at t0 = 0 are (X0, Y0, Z0), the
focal length of the optical system is set to 1, and the object moving speed is constant,
then the image coordinates of this point at time t are:

Xo + ut Y0 + vt
(x, y) = Z0 + wt, Z0 + wt (7:100)

Since there is a Z coordinate in the above equation, the distance information

between the camera and the moving object is indirectly included, which can be
determined with the help of optical flow as follows [10]. Let D(t) be the 2D image
distance of a point from the focus of expansion (FOE) (e.g., see [11]) and V(t) be its
velocity dD/dt. These quantities are related to the optical flow parameters as:

D(t) Z(t)
(7:101)
V(t) w(t)
The above equation is the basis for determining the distance between moving
objects. Assuming the motion is toward the camera, the ratio Z/w gives the time it
takes for an object moving at a constant speed w to cross the image plane. Based on
the knowledge of the distance of any point in an image that is moving along the Z-
axis with velocity w, the distance of any other point on that image that is moving
with the same velocity w can be calculated:

Z(t)V (t)D0(t)
Z0 (t)= (7:102)
D(t)V0(t)

where Z(t) is the known distance and Z‘(t) is the unknown distance.
According to Eq. (7.102), the relationship between the world coordinates X and
Y with the image coordinates x and y can be given by the observation position and
velocity:

x(t)w(t)D(t) Y y(t)w(t)D(t) w(t)D(t) (7:103)

X(t) = V (t) Y(t) = V (t) Z(t) = V (t)
274 7 Monocular Multi-image Scene Restoration

7.4 Shape from Segmented Contour

Shape from contour is an effective method to capture and reconstruct dynamic

objects, also called shape from silhouette. It consists of extracting the contour of the
object from some visual observations and intersecting a 3D cone generated from the
contour of the object with the center of the projection. The intersecting volume is
called the visual hull (analogous to the convex hull of a collection of points).
The following introduces an extension of the traditional voxel-based shape from
contours using contour segmentation technology, that is, a method for shape from
segmented contours. This approach allows 3D reconstruction of the object (here
human body) part alone, which can provide estimates of human body shape (espe
cially concave parts) [12].
The image sequences used were acquired with the aid of 34 color cameras. There
are 15 pillars evenly arranged in rectangular positions around the human body, each
with two cameras mounted on it, and its height is between 0.5 and 3 m. There are
four pillars on the front of the human body, five pillars on the back, and three pillars
on the left and right sides. There are also four cameras 3.5 m above the rectangular
region thus enclosed.
The flowchart used for shape from segmented contours is shown in Fig. 7.26. It
mainly consists of two modules: The part surrounded by the upper dotted line is used
for conventional visual hull estimation, which is based on creating a voxel grid space
in the selected volume and estimating the reconstructed volume by the projection of
the voxel center on the contour. The part surrounded by the dashed line below is used
to segment the human body in the contour image, reconstruct each part, and combine
the results of these parts.
The numbered boxes surrounded by the dashed line in Fig. 7.26 represent:
1. Estimation of human joint positions on 2D color images using a CNN-based
human pose pre-training model
2. Retrieving the 3D joint positions by casting rays guided from the 3D positions of
each camera center to each joint and computing the best intersection

(1) (2) (3) (4) (5) (6)

Fig. 7.26 Flowchart for recovering shape from segmented contours

7.4 Shape from Segmented Contour 275

3. Segmentation of rough 3D visual hull voxel reconstruction of the human body

4. Segmentation of the contour image by projecting the segmentation points onto
the contour
5. Estimation of the 3D volume of each body part
6. Combining the SfS reconstruction results of each body part separately into an
overall sSfS body model
An overview of Fig. 7.26 as a whole is given below.
Traditional visual hull estimation estimates the bounding box of the human body
(corresponding to the volume of the human body) based on the volume of the entire
reconstructed system and the human contour extracted from RGB images [13],
giving a voxel-represented visual hull.
Recovering shape from segmented contours adopts a coarse-to-fine sSfS method
to reconstruct selected body parts separately and combines these partial results for
final output. Here, the visual hull estimation for each body part still adopts the
traditional visual hull estimation algorithm. For this, 2D human body segmentation
is required on the contour image. Therefore, a custom method is implemented to map
3D human segmentation to 2D contours. There are many methods available here, for
example, CNN body segmentation based on image pose estimation [14, 15]. To
segment the body parts on the contour images, the estimated positions of the human
joints obtained from the human pose CNN [16] on the RGB images of the subjects
are used.
In this method, each approximate 3D joint position is first found from each
corresponding 2D joint position by ray casting and used as the closest point for all
3D rays. In this way, the position of the joint in the 3D space of the camera in the
system can be estimated, as shown in (2) in Fig. 7.26. Next, a voxel reconstruction of
the visual hull is obtained using the 3D joints to segment the reconstructed visual
hull by assigning each voxel to the nearest skeleton (defined as the segment between
the two joints. This step can also be done by other human 3D segmentation method
[17, 18]). Then, the visual hull voxels of each part are projected onto the contour
image to obtain the segmentation result of the contour image of each part of the
body. After the body parts are segmented in the contour image, each body part is
reconstructed according to the traditional SfS algorithm, and the 3D volume of each
body part is estimated using the same voxel voting threshold for the reconstruction
of each body part. Finally, the individual voxel models of the body parts are merged
to form the final SfS reconstruction result.
Based on the result of the contour segmentation, reconstruction can be performed
separately for each segmented body part. This approach allows us to find errors in
contour estimation and skip images of body parts that do not perform well. In order
to identify the wrong part, the uncertain pixel ratio (pixels with intensity values in the
range [1, 254]) of the contour is defined as:

M
(7:104)
N
276 7 Monocular Multi-image Scene Restoration

where M is the number of pixels with an intensity value in the range of [1, 254] and
N is the number of pixels in the partial projection image with an intensity value of
255. By applying a simple threshold, Ft, to each contour part image, only high-
quality parts of the body contour are obtained for reconstruction, thereby improving
the quality of the final visual hull.

7.5 Photometric Stereo Review

Photometric stereo technology is always in development, and a general review of the

involved light source calibration problems, non-Lambertian surface problems, color
photometric stereo problems, and 3D reconstruction problems is presented
below [19].

7.5.1 Light Source Calibration

The light source is an important device to realize the photometric stereo technology.
There are many kinds of light sources. A simple light source classification diagram is
shown in Fig. 7.27. Light sources can be divided into two categories: infinite point
light sources and near-field light sources. The premise of photometric stereo is that
the incident light is parallel light. In reality, it is often difficult to create a large region
of parallel light, so the point light source that is usually far away (generally, the
distance between the light source and the scene is ten times the width of the scene) is
emitted, and this light is approximately regarded as parallel light. For near-field light
sources, it is difficult to treat the light as parallel light because the light source is too
close. In practice, the light source has a certain scale and cannot be regarded as a
point light source but is called an extended light source. The extended light source is
divided into linear or planar according to its shape.
Light source calibration refers to placing auxiliary objects such as calibration
targets to estimate the information of the light source, including the direction and
intensity of the light source. The accuracy of light source information greatly affects
the performance and effect of photometric stereo technology. There are many
methods of light source calibration. Figure 7.28 shows a classification diagram of
light source calibration methods.

Near point light source

Near-field light source Line light source
Light source Extended light source
Infinite point light Planar light source
source

Fig. 7.27 Light source classification diagram

7.5 Photometric Stereo Review 277

Tone/shading information
Calibration
Special reflection properties
information
Shadow information

Light source Direction and position of light source

Light source information Light source intensity

Light source
calibration properties Number of Single light source
light sources Multiple light sources

No calibration target
Calibration Number of Single calibration target
target With calibration _ calibration targets Multiple calibration targets
target Near plane target
Type of
calibration targets Sphere with specific
reflection properties

Fig. 7.28 Classification of light source calibration methods

From the information used for calibration, the tone information, shadow infor
mation, or reflection characteristic information of the scene surface are often used
(the three kinds of information can also be used in combination). From the perspec
tive of light source properties/characteristics, light source information (light inten
sity, direction, and position) can be distinguished from the number of light sources
(single or multiple). The use of calibration targets is to obtain more accurate light
source information by using the reflection properties/characteristics of different
calibration targets. Commonly used calibration targets include cubes, differential
spheres, hollow transparent glass spheres, mirrors, etc.
The light source calibration method also depends on the type of light source. For
example, near-field light sources will cause uneven illumination distribution, and
white paper with Lambertian reflection characteristics can be used as a calibration
target to compensate for the intensity distribution of different light sources [20].
Finally, it should be pointed out that the calibration of the light source usually
requires the selection of a special calibration object and a separate calibration
experiment, which increases the difficulty of the application of photometric stereo
and limits the application of photometric stereo. How to simplify or omit this step
into light source self-calibration is a very valuable research direction [21]. A recent
work can be found in [22].

7.5.2 Non-Lambertian Surface Reflection Models

The ideal scattering surface and the ideal specular reflecting surface discussed in
Sect. 7.2.2 are rare in practice, and the actual scene surface often has very different
reflection characteristics, which are generally called non-Lambertian surfaces.
278 7 Monocular Multi-image Scene Restoration

The basic reflection models established for non-Lambertian surfaces mainly

include:
1. Phong reflection model
The Phong reflection model [23] simply divides the illumination into the
following light components: diffuse reflection, specular reflection, and ambient
light, and the proportion of each component on the surface of different materials
is different. The improved Blinn-Phong reflection model [24] replaces the
product of the observation vector and the reflection vector in the Phong reflection
model with the quantity product of the half-angle vector and the normal vector on
the specular component.
2. Torrance-Sparrow reflection model
The Torrance-Sparrow reflection model [25] is a specular reflection model for
rough surfaces derived from radiometry and micro-surface theory.
3. Cook-Torrance reflection model
The Cook-Torrance reflection model [26] comprehensively considers the
Torrance-Sparrow reflection model and the Blinn-Phong reflection model for
rendering specular and metallic textures. This model is widely used in the field
of computer graphics.
4. Ward reflection model
The Ward reflection model [27] is an anisotropic (elliptical) reflectivity
mathematical model that uses a Gaussian distribution to describe the specular
component, with four parameters with physical meaning.
One of the effects of non-Lambertian surface models is that they produce
specular and shadow information. Highlights are useful information for light
source calibration but are noise for reconstruction. Highlights and shadows can
affect the effect of 3D reconstruction and even lead to reconstruction errors, so
various means need to be taken to separate and remove them. The use of extra
information contained in the image, such as ghosting phenomena, can also help to
eliminate the effects of highlights [28].

7.5.3 Color Photometric Stereo

Color photometric stereo, also known as multispectral photometric stereo, refers

to a photometric stereo that directly uses a color image as input. The input to classical
photometric stereo is a gray-scale image. Most of the images obtained by the camera
are now in color. If it is converted into gray-scale images, some information may be
lost. This problem can be avoided by using color photometric stereo. In addition,
since the information in the color space is richer than that in the gray space, there are
more options for processing highlights or outliers. For example, based on the
two-color reflection model, it is assumed that the reflectivity of the object is
composed of the diffuse reflection component and the highlight component. Using
the four-light source color photometric stereo technology, in the presence of
7.6 GAN-Based Photometric Stereo 279

highlights and shadows, it can detect the highlight of the object and calculate the
normal direction of the object surface at the same time [29].
With the help of color photometric stereo technology, the original three gray
scale images can also be replaced with the three channels of the acquired color
image, and then the surface reconstruction can be achieved through a single color
image. This method can avoid the influence of position changes due to time-sharing
and realize fast 3D reconstruction and even real-time 3D reconstruction. Addition
ally, a work using convolutional neural networks for multispectral photometric
stereo can be seen in [30].

7.5.4 3D Reconstruction Methods

An overview of some typical 3D reconstruction methods is listed in Table 7.1,

including their principles, advantages, and disadvantages [19].
With the rapid development of deep learning technology, there have been many
related works in the field of photometric stereo. The deep learning method has the
nature of self-learning, which can get rid of the harsh assumptions of photometric
stereo on the light source model and reflection model. For example, applying a fully
convolutional network to non-Lambertian surfaces does not require consistent light
source information for training data and test data [37]; using supervised deep
learning techniques can enhance shadow suppression and the flexibility of the
reflection model on non-Lambertian surfaces [38]. Alternatively, convolutional
neural networks can be used to learn the relationship between images and normal
directions of non-convex scene surfaces [39]. There is also a dedicated deep
photometric stereo network (DPSN), which can learn the mapping relationship
between reflectance and normal direction under the premise of known light source
direction [40].

7.6 GAN-Based Photometric Stereo

Methods based on deep learning have also been introduced into calibrated photo
metric stereo methods in recent years, for example, see [41]. Instead of constructing
complex reflectivity models, these methods directly learn a mapping from reflectiv
ity observations in a given direction to normal information.
Traditional photometric stereo vision methods mostly assume a simplified reflec
tivity model (ideal Lambertian model or simple reflectivity model). However, in the
real world, most scenes have non-Lambertian surfaces (a combination of diffuse and
specular reflections), and a particular simple model is only valid for a small subset of
materials. At the same time, the calibration of light source information is also a
complex and tedious process. Solving this problem requires uncalibrated photomet
ric stereo vision technology [42], that is, it is necessary to directly calculate the
280 7 Monocular Multi-image Scene Restoration

Table 7.1 Principles, advantages, and disadvantages of some typical 3D reconstruction methods
Methods Principles Advantages Disadvantages
Path integral Direct integration of gra Easy to implement Error accumulation,
method [31] dient according to Green’s and fast greatly affected by data
formula error and noise
Least squares Search the best fitting sur Better overall optimi Loss of local informa
method face by minimizing the zation effect tion, large amount of
function and sacrificing calculation when the
local information amount of data is large
Fourier basis Approximate gradient data Good global effect The derivation is com
function with basic functions to and high computa plex and difficult to
method [32] obtain the best approxi tional efficiency apply to other basis
mation surface functions
Poisson’s Transform the functional Can be based on Fou It is necessary to deter
equation problem of minimizing a rier basis functions or mine which basis func
method [33] function into the problem extended to sine and tion to use for projection
of solving the Poisson cosine functions according to different
equation boundary conditions
Variational Solving the Poisson equa The iterative process Long calculation time
method [34] tion using iterative method is simple and can and accumulation of
based on global thinking solve the overall dis errors
tortion problem
Pyramid The subsurface is obtained It can ensure global Local detail information
method [35] based on the iterative pro shape optimization is lost; iterative correc
cess of the height space, and has certain noise tion is required
the sampling interval is immunity
continuously reduced, and
finally the entire surface is
stitched together
Algebraic By correcting the gradient Independent of inte There is a certain devia
method error of the curl value, it is gration paths, local tion in the reconstruc
finally reconstructed with error accumulation tion of the local detail
the Poisson equation can be suppressed surface
Singular value Obtain a vector that differs No need to calibrate Less computationally
decomposition from the true normal vec the light source efficient, will create
method [36] tor by a transformation general bas-relief
matrix through singular problems
value decomposition

normal information in the image only through multiple images under a fixed
viewpoint.
The following introduces a method to obtain the normal information of the scene
with the help of a multi-scale aggregation generative adversarial network (GAN)
to realize the uncalibrated photometric stereo vision technology [43].
7.6 GAN-Based Photometric Stereo 281

7.6.1 Network Structure

The structure of a multi-scale aggregated generative adversarial network is shown in

Fig. 7.29. The whole network is mainly composed of two parts: generator (including
multi-scale aggregation network and fine-tuning module) and discriminator. The role
of the multi-scale aggregation network is to learn the mapping between normal
information from any number of input images; use multi-scale aggregation to better
integrate local features and global features; then use the maximum pooling layer to
perform the aggregation of multiple input features; and finally get the normal
information through the L2 normalization layer to get normal information. The
normal information map thus generated is more accurate than the image generated
based on the full convolution algorithm. The fine-tuning module includes four
residual blocks, one fully convolutional layer, and one normalization layer, and its
role is to fine-tune the normal information map of coarse accuracy to generate a
higher-precision result map, so as to obtain a more accurate normal information. The
input of the discriminator is the normal information map generated by the generator
and the real normal information map, and the output is the probability that the
generated normal information map is the real normal information map (first calculate
the evaluation value obtained after extracting the input features through multiple
convolution layers, and then calculate their mean value).
The multi-scale aggregation network is based on the U-Net network and exploits
the properties of convolutional neural networks (such as translation invariance and
parameter sharing). The network consists of multiple identical modules, each
containing multiple convolutional and deconvolutional layers, as shown in
Fig. 7.30. In Fig. 7.30, the light yellow smooth blocks represent convolutional layers
(including 3x3 convolutions followed by Leaky ReLU), the polygons |x> between
convolutional layers represent average pooling layers, and the dark blue rough
blocks represent deconvolutional layers (including 4x4 deconvolution); the 0
between the convolutional layer and the deconvolutional layer represents the resid
ual calculation.
The input image is first passed through four convolutional layers with a convo
lution kernel size of 3 x 3 and a step of 1. After each convolution, it is passed through
the Leaky ReLU activation layer and regularized and then passed through the
average pooling layer with a step of 2 for down-sampling purposes. The other part
consists of three deconvolution and convolution layers. After the deconvolution
layer with a convolution kernel size of 4 x 4 and a step of 2, and then feature fusion

Fig. 7.29 Network structure block diagram

282 7 Monocular Multi-image Scene Restoration

64 64 3

256 512 256 256

Fig. 7.30 Flowchart of multi-scale aggregation module

with the feature map of the corresponding down-sampling part, it is into the
convolutional layer. Finally, the obtained results are subjected to maximum pooling
and normalization to obtain the final result.
Several considerations for the network structure are as follows:
1. Using skip connections to achieve multi-scale feature aggregation: For the fea
tures in each image, using multi-scale feature aggregation can achieve better
fusion of local and global features, so that more comprehensive information can
be observed in each image.
2. Aggregate multiple features using the maximum pooling method: Photometric
stereo has multiple inputs, and maximum pooling can naturally capture strong
features in images from different light directions; maximum pooling can easily be
used during training, ignoring inactive features to make the network more robust.
3. Perform L2 normalization on the pooled features: A coarse-grained normal
information map can be obtained.
4. Residual structure is adopted in the multi-scale aggregation network and fine
tuning module: The problem of gradient vanishing can be overcome by using skip
connections [44].

7.6.2 Loss Function

The loss function LG of the generator model consists of two parts: the cosine
similarity loss Lnormal of the normal vector and the adversarial loss Lgen:

Lg = klLnormal + k2Lgen (7:105)

where k1 = 1 and k2 = 0.01 can be used.

The network first generates a rougher normal map N‘ = M(x) and then further
refines the rough results through the refinement module R( ), in order to generate a
higher-precision normal map N" = R[M(x)] and finally send the normal map N" into
the discriminator network. Given an image of size H x W, the cosine similarity loss
Lnormal of the normal vector is defined as:
References 283

Lnormal HW^S (1 Nx,y' Nx,y ) + hw52(1 Nxy' Nx,y) (7:106)

x, y x, y

Among them, Nx,y represents the real normal information at the point (x, y). If the
real normal information is very different from the predicted normal information, the
point product of Nx,y and N‘x,y is close to 1, the Lnormai value will be small, and vice
versa. The analysis of the second term on the right-hand side of Eq. (7.106) is
similar.
The generator adversarial loss Lgen is defined as follows:

Lgen= - mGnExp~g[D(x)] (7:107)

The discriminator loss function is:

LD = min {ExPg[D(x)] -ENp~r[D(N)]} (7:108)

Among them, x ~ pg indicates that the input data x conforms to the pg distribution,
and N ~ pr indicates that the true normal information N conforms to the pr
distribution.

References

1. Lee J H, Kim C S (2022) Single-image depth estimation using relative depths. Journal of Visual
Communication and Image Processing, 84: 103459. https://doi.org/10.1016/j.jvcir.2022.
103459.
2. Pizlo Z, Rosenfeld A (1992) Recognition of planar shapes from perspective images using
contour-based invariants. CVGIP: Image Understanding 56(3): 330-350.
3. Song W, Zhu M F, Zhang M H, et al. (2022). A review of monocular depth estimation
techniques based on deep learning. Journal of Image and Graphics, 27(2): 292-328.
4. Luo H L, Zhou Y F. (2022). Review of monocular depth estimation based on deep learning.
Journal of Image and Graphics, 27(2): 390-403.
5. Swanborn D J B, Stefanoudis P V, Huvenne V A I, et al. Structure-from-motion photogram
metry demonstrates that fine-scale seascape heterogeneity is essential in shaping mesophotic
fish assemblages. Remote Sensing in Ecology and Conservation, 2022, 8(6): 904-920.
6. Wang S, Wu T H, Wang K P, et al. (2021) 3-D particle surface reconstruction from multiview
2-D images with structure from motion and shape from shading. IEEE Transaction on Industrial
Electronics 68(2): 1626-1635.
7. Horn B K P (1986) Robot Vision. MIT Press, USA. Cambridge.
8. Zhang Y-J (2017) Image Engineering, Vol. 3: image understanding. De Gruyter, Germany.
9. Ballard D H, Brown C M (1982) Computer Vision. Prentice-Hall, London.
10. Sonka M, Hlavac V, Boyle R (2008) Image Processing, Analysis, and Machine Vision. 3rd
Ed. Thomson, USA.
11. Zhang Y-J (2017) Image Engineering, Vol. 2: image analysis. De Gruyter, Germany.
12. Krajnik W, Markiewicz L, Sitnik R (2022) sSfS: Segmented shape from silhouette reconstruc
tion of the human body. Sensors 22: 925.
284 7 Monocular Multi-image Scene Restoration

13. Lu E, Cole F, Dekel T, et al. (2021) Omnimatte: Associating objects and their effects in video.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4505-4513.
14. Lin K, Wang L, Luo K, et al. (2021) Cross-domain complementary learning using pose for
multi-person part segmentation. IEEE Transactions on Circuits System and Video Technology
31, 1066-1078.
15. Li P, Xu Y, Wei Y, et al. (2022) Self-correction for human parsing. IEEE Transactions on
Pattern Analysis and Machine Intelligence 44(6): 3260-3271.
16. Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking.
Proceedings of the European Conference on Computer Vision (ECCV), 8-14.
17. Jertec A, Bojanic D, Bartol K, et al. (2019) On using PointNet architecture for human body
segmentation. Proceedings of the 2019 11th International Symposium on Image and Signal
Processing and Analysis (ISPA), 23-25.
18. Ueshima T, Hotta K, Tokai S (2021) Training PointNet for human point cloud segmentation
with 3D meshes. Proceedings of the Fifteenth International Conference on Quality Control by
Artificial Vision 12-14.
19. Deng X L, He Y B, Zhou J P (2021). Review of three-dimensional reconstruction methods
based on photometric stereo. Modern Computer 27(23): 133-143.
20. Xie W, Song Z, Zhang X (2010) A novel photometric method for real-time 3D reconstruction of
fingerprint. International Symposium on Visual Computing 31-40.
21. Shi B, Matsushita Y, Wei Y, et al. (2010) Self- calibrating photometric stereo. Proceedings of
the International Conference on Computer Vision and Pattern Recognition 1118-1125.
22. Abzal A, Saadatseresht M, Varshosaz M, et al. (2020) Development of an automatic map
drawing system for ancient bas-reliefs. Journal of Cultural Heritage 45: 204-214.
23. Phong BT (1998) Illumination for computer generated pictures. Communications of the ACM
18(6): 311-317.
24. Tozza S, Mecca R, Duocastella M, et al. (2016) Direct differential photometric stereo shape
recovery of diffuse and specular surfaces. Journal of Mathematical Imaging and Vision 56(1):
57-76.
25. Torrance K E, Sparrow E M (1967) Theory for off-specular reflection from roughened surfaces.
Journal of the Optical Society of America 65(9): 1105-1114.
26. Cook R L, Torrance K E (1982) A reflectance model for computer graphics. ACM Transactions
on Graphics 1(1): 7-24.
27. Ward G J (1992) Measuring and modeling anisotropic reflection. Proceedings of the 19th
Annual Conference on Computer Graphics and Interactive Techniques 265-272.
28. Shih Y C, Krishnan D, Durand F, et al. (2015) Reflection removal using ghosting cues.
Proceedings of the IEEE International Conference on Conference on Computer Vision and
Pattern Recognition 3193-3201.
29. Barsky S, Petrou M (2003) The 4-source photometric stereo technique for three-dimensional
surfaces in the presence of highlights and shadows. IEEE Transactions on Pattern Analysis and
Machine Intelligence 25(10): 1239-1252.
30. Lu L, Qi L, Luo Y, et al. (2018) Three-dimensional reconstruction from single image base on
combination of CNN and multi-spectral photometric stereo. Sensors 18(3): 764.
31. Horn B K P (1990) Height and gradient from shading. International Journal of Computer Vision
5(1): 37-75.
32. Frankot R T, Chellappa R (1998) A method for enforcing integrability in shape from shading
algorithms. IEEE Transactions on Pattern Analysis & Machine Intelligence 10(4): 439-451.
33. Simchony T, Chellappa R, Shao M (1990) Direct analytical methods for solving Poisson
equations in computer vision problems. IEEE Transactions on Pattern Analysis and Machine
Intelligence 12(5): 435-446.
34. Lv D H, Zhang D, Sun J A (2010) Simulation and evaluation of 3D reconstruction algorithm
based on photometric stereo technique. Computer Engineering and Design 31(16): 3635-3639.
35. Chen Y F, Tan W J, Wang H T, et al. (2005) Photometric stereo 3D reconstruction and
application. Journal of Computer-aided Design and Computer Graphics (11): 28-34.
References 285

36. Belhumeur P N, Kriegman D J, Yuille A L (1999) The bas-relief ambiguity. International

Journal of Computer Vision 35(1): 33-44.
37. Chen G, Han K, Wong K K (2018) PS-FCN: A flexible learning framework for photometric
stereo. Berlin: Springer International Publishing 3-19.
38. Wang X, Jian Z, Ren M (2020) Non-Lambertian photometric stereo network based on inverse
reflectance model with collocated light. IEEE Transactions on Image Processing 29: 6032-6042.
39. Ikehata S (2018) CNN-PS: CNN-based photometric stereo for general non-convex surfaces.
Proceedings of the European Conference on Computer Vision 3-18.
40. Santo H, Samejima M, Sugano Y, et al. (2020) Deep photometric stereo networks for deter
mining surface normal and reflectances. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 44(1): 114-128.
41. Wang G H, Lu Y T (2023). Application of deep learning technology to photometric stereo three
dimensional reconstruction. Laser & Optoelectronics Progress, 60(8): 197-216.
42. Papadhimitri T, Favaro P (2014) A closed-form, consistent and robust solution to uncalibrated
photometric stereo via local diffuse reflectance maxima. International Journal of Computer
Vision 107: 139-154.
43. Ren L, Sun X M (2022) Uncalibrated photometric stereo vision of multi-scale aggregation
GAN. Software Guide 21(3): 220-225.
44. He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition 770-778.
Chapter 8
Monocular Single-Image Scene Restoration

As pointed out in Sect. 7.1, this chapter introduces the method of scene restoration
based on monocular single image. According to the introduction and discussion in
Sect. 2.2.2, it is actually an ill-conditioned problem to use only monocular single
image for scene restoration. This is because when the 3D scene is projected onto the
2D image, the depth information is lost. However, from the practice of human visual
system, especially the ability of spatial perception (see [1]), in many cases, many
depth clues are still retained in the image, so it is possible to recover the scene from it
under the condition of certain constraints or prior knowledge [2-4].
The sections of this chapter will be arranged as follows.
Section 8.1 discusses how to reconstruct the shape of the scene surface according to
the image tone (light and dark information) generated by the spatial change of the
brightness of the scene surface during imaging.
Section 8. 2 introduces three technical principles to restore the surface orientation
according to the change (distortion) of the surface texture elements of the scene
after projection.
Section 8. 3 describes the relationship between the focal length change of the camera
and the depth of the scene caused by focusing on the scene at different distances
during imaging, so that the distance of the corresponding scene can be determined
according to the focal length of the clear imaging of the scene.
Section 8. 4 introduces a method to calculate the geometry and pose of 3D scene by
using the coordinates of three points on an image when the 3D scene model and
camera focal length are known.
Section 8. 5 further introduces the non-Lambertian illumination model and the
corresponding new technology used in the process of restoring shape from the
tone, relaxing the limited conditions, and imaging the mixed (diffuse and spec
ular) surface under perspective projection.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 287
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_8
288 8 Monocular Single-Image Scene Restoration

8.1 Shape from Shading

When the object in the scene is illuminated by light, the brightness will be different
due to the different orientations of the various parts of the surface. This spatial
change (light and dark change) of the brightness will appear as different shadings or
tones on the image (also often called different shadows) after imaging. The shape
information of the object can be obtained according to the distribution and change of
the tones, which is called shape from shading.

8.1.1 Shades and Shapes

In the following, the relationship between the shades on the image and the surface
shapes of the object in the scene is first discussed, and then how to represent the
change of orientation is introduced.

8.1.1.1 Image Shades and Surface Orientation

Shades correspond to the different levels of brightness (represented by gray level)

formed by projecting a 3D object onto a 2D image plane. The distribution of changes
in these levels depends on four factors: (1) the geometry (direction of the surface
normal) of the visible surface of the object (facing the observer), (2) the incident
intensity (energy) and direction of the light source, (3) the observer’s orientation
(line of sight) relative to the object and distance, and (4) reflection characteristics of
the surface of the object. The action of these four factors can be introduced with the
help of Fig. 8.1, in which the object is represented by the surface element S, and the
normal direction N of the surface element indicates the orientation of the surface
element, which is related to the local geometry of the object; the incident intensity
and direction of the light source are represented by the vector L; the position and
distance of the observer relative to the object are indicated by the line of sight vector
V; the surface reflection characteristic p of the object depends on the surface material
of the surface element, which is generally a function of the space position of the
surface element.

Fig. 8.1 Four factors N‘

affecting image gray level
change

S
8.1 Shape from Shading 289

According to Fig. 8.1, if the incident light intensity on the 3D object surface S is
L, and the reflection coefficient p is a constant, then the reflection intensity along
N is:

E(x, y) = L(x, y)p cos i (8:1)

If the light source is behind the observer and emits a parallel ray, then cosi = cose.
Assuming that the line of sight intersects the XY plane of the imaging perpendicu
larly, and the object has a Lambertian scattering surface, that is, the surface reflection
intensity does not change due to the change of the observation position, the observed
light intensity can be written as:

E(x, y)= L(x, y)p cos e (8:2)

In order to establish the relationship between the surface orientation and the image
brightness, the gradient coordinates PQ are also arranged on the XY plane, and the
method line is along the direction away from the observer; then according to N = [p
q — 1]T, V = [0 0-1]T, it can be obtained:

[p q — 1]T •[0 0 — 1 ]T 1
cos e = cos i = (8:3)
|[ p q — 1 ]T| -|[ 0 0 — 1 ]T| ^/p2 + q2 + 1

Substituting Eq. (8.3) into Eq. (8.1), the observed image gray scale is:

E(x, y)=L(x, y p/. ) 1 ,= (8:4)

p2 + q2 +1

Now consider the general case where the ray is not incident at angle i = e. Let the
light vector L incident through the surface element be [pi qi — 1]T, because cosi is the
cosine of the angle between N and L, so there is:

. [p q —1 ]T •[ 0 0 —1 ]T
cos i =
|[p q — 1 ]T| .|[0 0 — 1 ]T|
, ppi+ qqjt1 , (8:5)
-/p2 + q2 + 1 \/p2 + q2 + 1

Substituting Eq. (8.5) into Eq. (8.1), the observed image gray scale when incident
at any angle is:

E(x, y) = L(x, y)p = ppi + qqi +1 (8:6)

Vp2 + q2 +1 V'p2 + q2 +1
The above equation can also be written in a more abstract general form:
290 8 Monocular Single-Image Scene Restoration

E(x, y) = R(p, q) (8-7)

This is the same image brightness constraint equation as Eq. (6.22).

8.1.1.2 Gradient Space

Now consider the gray-scale changes of the image due to changes in the orientation
of the bins. A 3D surface can be represented as z = f (x, y), and the surface normal on
it can be represented as N = [pq- 1]T. It can be seen that a surface in 3D space is
just a point G(p, q) in 2D gradient space from its orientation, as shown in Fig. 8.2.
Using this gradient space approach to study 3D surfaces can act as a dimensionality
reduction (to 2D), but the representation of the gradient space does not determine the
position of the 3D surface in 3D coordinates. In other words, a point in the gradient
space represents all surface elements facing the same orientation, but the spatial
locations of these surface elements can vary.
Structures formed by plane intersections can be analyzed and interpreted with the
help of gradient space methods. For example, the intersection of multiple planes may
form convex or concave structures. To judge whether it is a convex structure or a
concave structure, the gradient information can be used. Let’s first look at the
situation where the two planes S1 and S2 intersect to form the intersection line l,as
shown in Fig. 8.3 (where the gradient coordinate PQ coincides with the spatial
coordinate XY). Here G1 and G2 represent the gradient space points corresponding to
the normal lines of the two planes, respectively, and the connection between them is
perpendicular to the projection l' of l.
If S and G of the same face have the same sign (on the same side of the projection
l' of l), it means that the two faces form a convex structure (see Fig. 8.4a). If the S and

Fig. 8.2 Representation of

Q G
3D surface in 2D gradient
space

r=(p2+q2 )1/2

= =arctan(q/p)
O P

Fig. 8.3 An example of

intersection of two spatial
planes
8.1 Shape from Shading 291

N1 N2

S1 S2
G1 l
—►
X,P
(a) (b)

Fig. 8.4 Two spatial planes form a convex structure and a concave structure

Fig. 8.5 Two cases of intersection of three spatial planes

G of the same face have different signs, it means that the two faces form a concave
structure (see Fig. 8.4b).
Further consider the case where three planes A, B, and C intersect, and the
intersection lines are l1, l2, and l3, respectively (see Fig. 8.5a). If the faces on both
sides of each intersection line and the corresponding gradient point have the same
sign (the faces are AABBCC in turn clockwise), it means that the three faces form a
convex structure, as shown in Fig. 8.5b. If the faces on both sides of each intersection
and the corresponding gradient points have different signs (the faces are CBACBA in
turn clockwise), it means that the three faces form a concave structure, as shown in
Fig. 8.5c.
Now go back to Eq. (8.4) and rewrite it as:

L(x, y)p]2
p2 + q2 = E(x, y)J 1 K2 1 (8:8)

where K represents the relative reflection intensity observed by the observer. Equa
tion (8.8) corresponds to the equation of a series of concentric circles on the PQ
plane, and each circle represents the observed orientation locus of the same gray
level surface element. At i = e, the reflection map consists of concentric circles. For
292 8 Monocular Single-Image Scene Restoration

Fig. 8.6 Application example of reflection maps

the general case of i / e, the reflection map consists of a series of ellipses and
hyperbolas.
An example application of a reflection map is now given. Suppose the observer
can see three planes A, B, and C, which form the plane intersection angle as shown in
Fig. 8.6a, but the actual inclination of each plane is unknown. Using the reflection
map, the angle between the three planes can be determined. Assuming that L and V
are in the same direction, it has (the relative reflection intensity can be measured
from the image) KA = 0.707, KB = 0.807, and KC = 0.577. According to
the characteristic that the line between G( p, q) of the two faces is perpendicular to
the intersection of the two faces, the triangle shown in Fig. 8.6b can be obtained (i.e.,
the condition that the orientation of the three planes satisfies). Now find GA, GB, and
GC on the reflection map shown in Fig. 8.6c. Substitute each value of K into Eq. (8.8)
to obtain the following two sets of solutions:

(Pa, qA) = (0:707,0:707) (Pb, qB) = ( - 0:189,0:707) (pc, qc) = (0:707,1:225)

(8:9)
(pA, qA) = (1,0) pp'B, qB ) = (- 0:732,0) Pp'c , q'c) = (1,1) (8:10)

The first set of solutions corresponds to the small triangles in Fig. 8.6c, and the
second set of solutions corresponds to the large triangles in Fig. 8.6c. Both sets of
solutions satisfy the condition of relative reflection intensity, so there are two
possible combinations of the orientations of the three planes, corresponding to the
two structures of the convex and concave points of intersection between the three
intersecting lines.

8.1.2 Solving the Image Brightness Constraint Equation

Since the image brightness constraint equation relates the gray level of the pixel with
the orientation, the orientation ( p, q) at that location can be obtained from the gray
8.1 Shape from Shading 293

level L(x, y) of the pixel at (x, y) in the image. But measuring the brightness of a
single point on an image can only provide one constraint, while the orientation of the
surface has two degrees of freedom. In other words, suppose that the object visible
surface in the image consists of N pixels, each pixel has a gray value L(x, y); to solve
Equation (8.7) is to obtain ( p, q) value at the pixel position. Because only
N equations can be formed from the image brightness equation according to
N pixels, but there are 2 N unknowns, that is, there are two gradient values to be
solved for each gray value, so this is an ill-conditioned problem, and a unique
solution cannot be obtained. It is generally necessary to solve this ill-conditioned
problem by adding additional conditions to establish additional equations. In other
words, the surface orientation cannot be recovered from the image luminance
equation alone without additional information.
A simple way to consider additional information is to exploit constraints in
monocular images. The main ones that can be considered include uniqueness,
continuity (surface, shape), compatibility (symmetry, epipolar lines), etc. In practical
applications, there are many factors that affect brightness, so it is only possible to
recover the shape of an object well from shadow tones if the environment is highly
controlled.
In practice, people often estimate the shape of each part of the human face by just
observing a flat picture. This indicates that the picture contains sufficient information
or that people implicitly introduce additional assumptions based on empirical
knowledge while observing. In fact, many real object surfaces are smooth, or
continuous in depth, and further partial derivatives are also continuous. The more
general case is that the object has a patch-continuous surface, only rough at the
edges. The above information provides a strong constraint. For the two adjacent
surface elements on the surface, their orientations are related to a certain extent, and
together they should give a continuous smooth surface. It can be seen that the
method of macroscopic smoothness constraint can be used to provide additional
information to help solve the image brightness constraint equation. The following
three cases are introduced from simple to complex.

8.1.2.1 Linear Case

Consider first the special case of linear reflection, and let:

R(p, q) =f(ap + bq) (8:11)

where a and b are constants, and the reflection map is shown in Fig. 8.7. The
contours (isolines) of the gradient space in the figure are parallel lines.
In Eq. (8.11), f is a strictly monotonic function (see Fig. 8.8), and its inverse
function f-1 exists. From the image brightness equation, we know:
294 8 Monocular Single-Image Scene Restoration

Fig. 8. 7 Reflection map of

linear combination of
gradient elements

Fig. 8. 8 s = ap + bq can be
recovered by E(x, y)

s = ap + bq = f 1[E(x, y)] (8:12)

Note that the gradient ( p, q) of a particular image point cannot be determined by

only measuring the gray level of the image, but an equation that constrains the
possible values of the gradient can be obtained. For a surface with an angle 9 of the
X-axis, the slope is:

m(9) = p cos 9 + q sin 9 (8:13)

Now choose a specific direction 90 (see Fig. 8.7), tan90 = b/a, that is:

cos 9o= TTTb sin 90=TTTb (814)

The slope in this direction is:

ap + bq 1
m(90) = f \ f 1[E(x, y)] (8:15)
Va2 + b2 V'a2 + b2

Starting from a specific image point, take a small step size 5s, at which time the
change of z is 5z = m5s, that is:

dz= , 1 f - 1[E(x, y)] (8:16)

ds va' ■ b

where x and y are both linear functions of s:

8.1 Shape from Shading 295

Fig. 8. 9 Restoring a
surface from parallel surface
profiles

x(s) = x0 + s cos 0 y(s) = y0 + s sin 0 (8:17)

First find the solution at a point (x0, y0, z0) on the surface, and then integrate the
previous differential equation with z to get:

z(s) = z +
o z 1 ff 1[E(x, y)] ds (8:18)
V a2 + b2

In this way a surface profile is obtained along a line in the direction given above
(one of the parallel lines in Fig. 8.9). When the reflection map is a function of a linear
combination of gradient elements, the surface profiles are parallel straight lines. The
surface can be recovered by integrating along these lines as long as the initial height
z0(t) is given. Of course, the integrals are calculated using numerical algorithms in
practice.
It should be noted that if you want to know the absolute distance, you need to
know the z0 value at a certain point, but the (surface) shape can be recovered without
this absolute distance. In addition, the absolute distance cannot be determined only
by the integral constant z0, because z0 itself does not affect the shade, and only the
change of the depth can affect the shade.

8.1.2.2 Rotational Symmetry

Now consider a more general case. If the distribution of the light source is
rotationally symmetric to the observer, then the reflection map is also rotationally
symmetric. For example, when the observer looks at the hemispherical sky from the
bottom up, the resulting reflection map is rotationally symmetric; when the point
light source is at the same position as the observer, the resulting reflection map is
also rotationally symmetric. In these cases, there are:

R(p, q) =f(p2 + q2} (8:19)

Now suppose that the function f is strictly monotone and differentiable, and the
inverse function is f-1, then according to the image brightness equation:
296 8 Monocular Single-Image Scene Restoration

p2 + q2 = f 1[E(x, y)] (8:20)

If the angle between the direction of the fastest ascent of the surface and the x-axis
is 0s, where tan0s = p/q, then:

cos 0 s = —, p sin 0s = —, q (8:21)

s y/p2+7 p2 + q2

According to Eq. (8.13), the slope in the direction of steepest ascent is:

m(0s) = Vp2 + q2 = fj f 1 [E(x, y)] (8:22)

In this case, if you know the brightness of the surface, you can know its slope, but
you don’t know the direction of the fastest rise, that is, you don’t know the respective
values of p and q. Now suppose the direction of steepest ascent is given by (p, q), if a
small step of length 5s is taken in the direction of steepest ascent, the resulting
changes in x and y should be:

5x=pjfc? 5s 5y = V^s 5s (8:23)

and the change of z is:

5z = m5s = -\/p2 + q25s = \J~f 1[E(x, y)]5s (8:24)

To simplify these equations, the step size can be taken as -^/p2 + q25s; then
we get:

5x=p5s 5y = q5s 5z = (p2 + q2)5s = { f- 1[E(x,y)]}5s (8:25)

In addition, a horizontal surface (plane) is a region of uniform brightness on the

image, so only the (curved) surface has a nonzero brightness gradient. To determine
the brightness gradient, the image brightness equation can be derived with respect to
x and y. Let u, v, and w be the second-order partial derivatives of z with respect to
x and y, respectively, that is:

d2z d2z d\ = d\
(8:26)
dx2 dxdy v dydx w dy 2

Then according to the chain rule of derivatives, we can get:

8.1 Shape from Shading 297

Ex = 2(pu + qv) f0 E€ = 2(pv + qw) f0 (8.27)

where f'(r) is the derivative of f (r) with respect to its unique variable r.
Now let’s determine the change in 8p and 8q due to taking steps (8x, 8y) in the
image plane. By differentiating p and q, we get:

8p = u8x + v8y 8q = v8x + w8y (8.28)

According to Eq. (8.25), we can get:

8p = (pu + qv)8s 8q =(pv + qw)8s (8.29)

Or by Eq. (8.27), we can get:

8q =2Ey7 8s (8.30)
8P
2f 8s
= 27 2f

In this way, in the limit case of 8s ^ 0, the following set of five differential
equations can be obtained (differentials are all performed on s):

x =P y=q z = p2 + q2 p= f q= f (8.31)

Given initial values, the above five ordinary differential equations can be solved
numerically to obtain a curve on the object surface. The curve thus obtained is called
the characteristic curve, and in this case it is the steepest ascent curve. This type of
curve is perpendicular to the contour line. Note that when R(p, q) is a linear function
of p and q, the characteristic curve is parallel to the object surface.
In addition, another set of equations can be obtained by differentiating x_ = p and y_
= q in Eq. (8.31) with respect to s again:

x=E €=E z = f - 1[E(x, €)] (8.32)

Since both Ex and Ey are measurements of image brightness, the above equations
need to be solved numerically.

8.1.2.3 General Case of Smooth Constraints

In general, the surface of the object is relatively smooth (although there are discon
tinuities between the objects); this smooth condition can be used as an additional
constraint. If the surface of the object is considered to be smooth (within the contour
of the object), the following two equations hold:
298 8 Monocular Single-Image Scene Restoration

2
<vp)2=fdp+dp) = 0
(8:33)
Ox dy
2 2
(Vq)2= (q^ + qA = 0
(8:34)
Ox Oy

When combined with the image brightness constraint equation, solving the
surface orientation problem can be transformed into a problem of minimizing a
total error as follows:

^(.x, y)=EHK y) - R(p, q)]2 + j[ (Vp)2 + (Vq)2]} (8:35)

The above equation can be regarded as the following: The orientation distribution
of the surface elements of the object is obtained, so that the weighted sum of the
overall gray level error and the overall smoothness error is minimized (see Sect. 7.
3.3). Let p and q represent the mean value in the neighborhood of p and q,
respectively, take e to make the derivative of p and q, respectively, and take the
derivVp = p — paVq = q — qtive to be zero; then substitute and into the result, we
get

p(x, y) = p(x, y) + J 'E(x, y) - R(p, q)] dR

dp (8:36)

1
q(x, y) = q(x, y)+ J E(x, y) — R(p, q)] dR (8:37)
j

The equations for iteratively solving the above two equations are as follows (the
boundary point values can be used as the initial values):

p(n+1) = p(n) +1 E(x,y) — R[p(n">, q(n))] dR(n) (8:38)

j dp
q(n+1) = q(n) + J E(x,y) - R(p(n), q(n))] SRqnl
(8:39)

It should be noted here that the inside and outside of the object outline are not
smooth, and there are jumps.

8.1.2.4 Flowchart of Solving Image Brightness Constraint Equation

The flowchart for solving Eqs. (8.38) and (8.39) is shown in Fig. 8.10, and its basic
framework can also be used to solve the relaxed iterative Eqs. (7.58) and (7.59) of
the optical flow equations.
8.1 Shape from Shading 299

Fig. 8.10 Flowchart for solving image brightness constraint equations

Fig. 8.11 Examples of shape from shades

Finally, two sets of examples of shape recovery from shades are given, as shown
in Fig. 8.11. Figure 8.11a is an image of a sphere, and Fig. 8.11b is a needle map of
the spherical surface obtained from Fig. 8.11a by using the shade information (the
short line in the figure indicates the normal direction of the place); Fig. 8.11c is
another image of a sphere, and Fig. 8.11d is the surface orientation needle map
obtained from Fig. 8.11c using the shade information. The direction of the light
source is relatively close to the line-of-sight direction in the group of Fig. 8.11a and
b, so the orientation of each point can basically be determined for the entire visible
surface. In Fig. 8.11c and d, the angle between the direction of the light source and
the direction of the line of sight is relatively large, so it is impossible to determine the
direction of the visible surface that is not illuminated by the light.
300 8 Monocular Single-Image Scene Restoration

8.2 Shape from Texture

When a person looks at a surface covered with texture, the tilt of the surface can be
observed with only one eye, because the texture of the surface will appear distorted
due to the tilt, and from the prior knowledge, one can obtain the information from the
distortion which direction the surface is facing. The role of texture in restoring
surface orientation was elucidated as early as 1950 [5]. This type of approach to
estimating surface orientation based on observed texture distortions is described
below, which is the problem of shape from texture.

8.2.1 Monocular Imaging and Texture Distortion

According to the discussion on perspective projection imaging in Sect. 2.2.2, the

farther the scene is from the observation point or the collector, the smaller the image
will be, and vice versa. This can be seen as a dimensional distortion. This imaging
distortion contains spatial and structural information about the 3D scene. It should be
pointed out here that unless the X or Y of the scene is considered to be known, the
absolute distance between the collector and the scene cannot be directly obtained
from the 2D image (only relative distance information is obtained).
The geometric outline of an object can be viewed as being composed of many
straight-line segments connected. In the following, consider some distortions that
occur when a straight line in 3D space is projected onto a 2D image plane. Referring
to the camera model in Sect. 2.2.2, the projection of a point is still a point. A straight
line is composed of its two end points and middle points, so the projection of a
straight line can be determined according to the projection of these points. There are
two points in space (the two ends of the line), W1 = [X1 Y1 Z1]T and W2 = [X2 Y2
Z2]T, and the middle point can be represented as (0 < s < 1):

sW1 + (1 - s)W2 = s + (1 — s) Y2 (8:40)

The projection of the above two ends can be represented as PW1 = [kX1 kY1 kZ1
q 1]T and PW2 = [kX2 kY2 kZ2 q2]T, with the help of homogeneous coordinates (Sect.
2.2.1), where q1 = k(2—Z1)/2, q2 = k(2—Z2)/2. The point on the straight line between
the original W1 and W2 after projection can be represented as (0 < s < 1):
8.2 Shape from Texture 301

‘kx1‘ kXX2l
kY 1 kY2
P[sW1 + (1 - s)W2] = s + (1 - s) (8:41)
kZ 1 kZ2

In other words, the image plane coordinates of all points on this space line can be
obtained by dividing the first three terms by the fourth term of the homogeneous
coordinates, which can be represented as (0 < s < 1):

w =[ x y ]T = s X1 + (1 — s)X2 sYr+^1—)Y2
(8:42)
sq1 + (1 — s)q2 sq1 +(1 — s)q2
The above is the projection transformation result of using s to represent the space
point. On the other hand, in the image plane, we have w1 = [2X1/(2—Z1) 2Y1/(2—
Z1)]T, w2 = [AX2/(A—Z2) 2Y2/(2—Z2)]T, and the points on the connecting line
between them can be represented as (0 < t < 1):

2X1 -i
r - 2X2 -
2 — Z1 2 — Z2
t w1 + (1 — t)w2 = t + (1 —1) (8:43)
2Y 1 2Y2
L2 — Z1J -2 — Z2 _

So the coordinates of w 1 and w2 and the points on the line between them on the
image plane (denoted by t) are (0 < t < 1):

... -iT
2X1 2 2 2X2 2Y 1 2 2 2Y2 T ,o
w =[ x y ]T = ti—Z,+ (1—')j—iJ (844)
tr—Z+(1—• ■ Z )
If the projection result represented by s is the coordinate of the image point expressed
by t, then Eq. (8.42) and Eq. (8.44) should be equal, so it can be solved as:

s= tq2 (8:45)
tq2 + (1 — t) q1
t= sq1 (8:46)
sq1 +(1 — s) q2

It can be seen from the above two equations: s and t have a single value
relationship. In 3D space, the point represented by s corresponds to one and only
one point represented by t in the 2D image plane. All the space points represented by
s are connected into a straight line, and all the image points represented by t are also
connected in a straight line. After a straight line in the visible 3D space is projected
onto the 2D image plane, as long as it is not projected vertically, the result is still a
straight line (but the length can be changed). In the case of vertical projection, the
302 8 Monocular Single-Image Scene Restoration

projection result is just a point (this is a special case). Its inverse proposition is also
true, that is, a straight line on the 2D image plane must be produced by the projection
of a straight line in 3D space (in special cases, it can also be produced by a plane
projection).
Next, consider the distortion of parallel lines, because parallelism is a line-to-line
relationship that is characteristic of linear systems. In 3D space, a point (X, Y, Z)ona
line can be represented as:

(8:47)

Among them, (X0, Y0, Z0) is the starting point of the straight line; (a, b, c) is the
direction cosine of the straight line; k is an arbitrary coefficient.
For a set of parallel lines, their (a, b, c) are all the same, but (X0, Y0, Z0) are
different. The distance between parallel lines is determined by their (X0, Y0, Z0)
differences. Substitute Eq. (8.47) into Eqs. (2.27) and (2.28) to get:

(X0 + ka — Dx) cos y +( Y0 + kb — Dy ) sin Y

x~2 777 7~, . ■ ■ , \ ■ 777 7~, 7^7 ~
— (X0 + ka — Dx) sin a sin / +( Y0 + kb — Dy) sin a cos y — (Z0 + kc—Dz) cos a+2
(8:48)
2 — (X0 + ka—Dx) sin Y cos a +( Y0 + kb—Dy ) cos a cos / + (Z0 + kc — Dz) sin a
y — (X0 + ka — Dx) sin a sin / +( Y0 + kb — Dy) sin a cos / — (Z0 + kc—Dz) cos a+2

(8:49)

When the straight line extends infinitely to both ends, k = ± 1, the above two
equations are simplified as:

. a cos y + b sin y
xi = 2------ :------ :----- -5—:---------------------- (8:50)
— a sin a sin / + b sin a cos / — c cos a
. — a sin y cos a + b cos a cos y + c sin a
(8:51)
yi = 2------- ---------------------------- -------------
— a sin a sin y + b sin a cos y — c cos a

It can be seen that the projected trajectory of parallel lines is only related to (a, b,
c) but not to (X0, Y0, Z0). In other words, parallel lines with the same (a, b, c) will
meet at a point after extending infinitely. This point can be in the image plane or
outside the image plane, so it is also called vanishing point or imaginary point. The
calculation of the vanishing point will be introduced in Sect. 8.2.3.
8.2 Shape from Texture 303

8.2.2 Restore Surface Orientation from Texture Changes

Using the texture on the surface of an object can help determine the orientation of the
surface and therefore restore the shape of the surface. The description of texture here
is mainly based on the idea of structural method (e.g., see [6]): Complex texture is
composed of some simple texture primitives (texels) that are repeatedly arranged and
combined in a regular form. In other words, texels can be viewed as visual primitives
with repetition and invariance in a region. Here, repetition means that these primi
tives appear repeatedly in different positions and directions. Of course, this repetition
is only possible under a certain resolution (the number of texels in a given visual
range). Invariance means that pixels that make up the same primitive have basically
the same characteristics, which may be related only to the gray level or may also
depend on other properties such as their shape.

8.2.2.1 Three Typical Methods

Using the texture of the object surface to determine its orientation should consider
the influence of the imaging process, which is related to the relationship between the
scene texture and the image texture. In the process of acquiring the image, the texture
structure on the original scene may change on the image (producing a gradient
change in both size and direction). This change may be different with the orientation
of the surface on which the texture is located, so it brings the 3D information about
the orientation of the surface of the object. Note that this does not mean that the
surface texture itself has 3D information but that the changes in the texture during the
imaging process have 3D information. The changes of texture can be mainly divided
into three categories (it is assumed that the texture is limited to a planar surface); see
the schematic diagram in Fig. 8.12. Commonly used information recovery methods
can also be divided into the following three categories:

(a) (c)

Fig. 8.12 Texture variation and surface orientation

304 8 Monocular Single-Image Scene Restoration

Using the Change of Texture Element Size

In perspective projection, there is a law of near big and far small, so texture elements
with different positions will have different changes in size after projection. This is
evident when looking in the direction in which the floor or tile is laid. According to
the maximum value of the texel projected size change rate, the orientation of the
plane where the texel is located can be determined (see Fig. 8.12a). The direction of
this maximum value is also the direction of the texture gradient. Assuming that the
image plane coincides with the paper surface and the line of sight comes out of the
paper, the direction of the texture gradient depends on the rotation angle of the texel
around the camera line of sight, and the value of the texture gradient gives the
degree of inclination of the texel relative to the line of sight. Therefore, with the help
of the geometric information placed by the camera, the orientation of the texture
element and the plane where it is located can be determined.
Figure 8.13 presents two pictures to illustrate that changes in texel size can give
clues to the depth of the scene. Figure 8.13a has many petals in the front (they are
equivalent to texels), and the petal size goes gradually reduced from front to back
(from near to far). This texel size change gives a sense of depth to the scene. The
building in Fig. 8.13b has many columns and windows (which are equivalent to
regular texels), and their size changes also give a sense of depth to the scene and
easily help the viewer to make the judgment that the corners of the building are
farthest.
It should be noted that the regular texture of the 3D scene surface will generate
texture gradients in the 2D image, but in turn the texture gradient in the 2D image
does not necessarily come from the regular texture of the 3D scene surface.

Utilize the Change of Texture Element Shape

The shape of the texel on the surface of the object may change to a certain extent
after the perspective projection and orthogonal projection imaging. If the original

Fig. 8.13 Variation in texel size gives scene depth

8.2 Shape from Texture 305

shape of the texel is known, the surface orientation can also be calculated from the
result of the change in the shape of the texel. The orientation of the plane is
determined by two angles (the angle of rotation relative to the camera axis and the
angle of inclination relative to the line of sight). For a given original texel, these two
angles can be determined according to the change results after imaging. For exam
ple, on a plane, a texture composed of circles will become ellipses on an inclined
plane (see Fig. 8.12b), where the orientation of the major axis of the ellipse
determines the angle of rotation relative to the camera axis, and the ratio of lengths
of the major and minor axes reflects the angle of inclination relative to the line of
sight. This ratio is also called the aspect ratio, and its calculation process is
described below. Let the equation of the plane on which the circular texture primitive
resides be:

ax + by + cz + d = 0 (8:52)

The circle that constitutes the texture can be regarded as the intersection line
between the plane and the sphere (the intersection line between the plane and the
sphere is always a circle, but when the line of sight is not perpendicular to the plane,
the deformation causes the observed intersection lines to always be elliptical); here
the spherical equation is set as:

x2 + y2 + z2 = r2 (8:53)

Combining the above two equations can provide the solution (equivalent to
projecting the sphere onto the plane):

a2 + c2 2 b2 + c2 2 2adx + 2bdy + 2abxy 2 d2

-^- x2 + y2 +------------------ cf---------- = = r - 7 (8:54)

This is an elliptic equation, which can be further transformed into:

(a2 + c2)x + 2ad 2 + (b2 + c2) y + b^bd

^~2 + 2abxy = Jr2
v ' a2 + c2
a2d2 + b2 d2 2
(8:55)
a2 + c2

From the above equation, the coordinates of the center point of the ellipse can be
obtained, and the long and short semiaxes of the ellipse can be determined, so that
the rotation angle and inclination angle can be calculated.
Another method to judge the deformation of circular texture is to calculate
directly the long and short semiaxes of different ellipses, respectively. Refer to
Fig. 8.14 (where the world coordinates coincide with the camera coordinates); the
included angle between the plane of the circular texture primitive and the Y-axis
306 8 Monocular Single-Image Scene Restoration

Fig. 8.14 The position of circular texture primitive plane in coordinate system

(also the included angle between the texture plane and the image plane) is a. At this
time, not only the circular texture primitive becomes an ellipse, but also the density
of the upper primitive is greater than that of the middle, forming a density gradient.
In addition, the appearance ratio of each ellipse, that is, the length ratio of the short
half axis to the long half axis, is not constant, forming an appearance ratio gradient.
At this time, both the size and shape of texture elements change.
If the diameter of the original circle is set as D, for the circle in the center of the
scene, the long axis of the ellipse in the image can be obtained as:

Dmajor(0, 0) = AD (8:56)

where A is the focal length of the camera, and Z is the object distance. At this time,
the appearance ratio is the cosine of the inclination angle, that is:

Dminor(0 , 0) A Z cos a (8:57)

Now consider the primitive on the scene that is not on the optical axis of the camera
(such as the light colored ellipse in Fig. 8.14). If the Y coordinate of the primitive is y,
and the included angle between the line with the origin and the Z-axis is Q, then [7]:

Dmajor (0, y) = A D (1 - tan Q tan a) (8:58)

Dminor(0 ,y) = AD cos a(1 — tanQtana)2 (8:59)

At this time, the appearance ratio is cosa(1-tanQtana), which will decrease with
the increase of Q, forming an appearance ratio gradient.
Incidentally, the above idea of using the shape change of texture elements to
determine the orientation of the plane where the texture elements are located can also
be extended, but more factors often need to be considered here. For example, the
8.2 Shape from Texture 307

Fig. 8.15 Schematic diagram of texture element grid and vanishing point

shape of the 3D scene can sometimes be inferred from the shape of the boundary of
the 2D region in the image. For example, the direct explanation of the ellipse in the
image often comes from the disk or ball in the scene. At this time, if the light and
shade changes and the texture patterns in the ellipse are uniform, the explanation of
the disk is often more reasonable; however, if both shading and texture patterns have
radial changes toward the boundary, the explanation of the ball is often more
reasonable.
(3) Using the change of spatial relationship among texture elements
If the texture is composed of regular grid of texel, the surface orientation
information can be restored by calculating its vanishing points (see Sect. 8.2.3).
Vanishing point is the common intersection of all segments in the set of intersecting
segments. For a transmission image, the vanishing point on the plane is formed by
the infinite texture element projecting to the image plane in a certain direction, or the
convergence point of parallel lines at infinity. For example, Fig. 8.15a shows a
perspective view of a box with parallel grid lines on each surface, and Fig. 8.15b is a
schematic diagram of the vanishing point of its surface texture.
If you look at the vanishing point along the surface, you can see the change of the
spatial relationship between texture elements, that is, the increase of the distribution
density of texture elements. The orientation of the surface can be determined by
using two vanishing points obtained from the same surface texture element grid. The
straight line where these two points are located is also called vanishing line/hidden
line, which is composed of vanishing points of parallel lines in different directions
on the same plane (e.g., the vanishing points of parallel lines in different directions
on the ground constitute the line of horizon). The direction of the vanishing line
indicates the rotation angle of the texture element relative to the camera axis, and the
intersection of the vanishing line and x = 0 indicates the inclination angle of the
texture element relative to the line of sight, as shown in Fig. 8.12c. The above
situation can be easily explained with the help of perspective projection model.
Finally, the above three methods of using texture element changes to determine
the orientation of the object surface can be summarized in Table 8.1.
308 8 Monocular Single-Image Scene Restoration

Table 8.1 Comparison of three methods of using texture element changes to determine the
orientation of object surface
Rotation angle around Tilt angle with respect to
Method viewing line viewing line
Using texel change in Texture gradient direction Texture gradient value
size
Using texel change in The direction of major princi Ration of texel major and minor prin
shape pal axis of texel ciple axes
Using texel change in The direction of line The cross point of line connecting two
spatial relation connecting two vanishing vanishing points and x = 0
points

Table 8.2 An overview of some methods to obtain shape from texture

Type
of Original Type of Analysis Analysis Unit
Surface clue surface texture projection method unit property
Texture gradient Plane Unknown Perspective Statistical Wave Wavelength
Texture gradient Plane Unknown Perspective Statistical Region Area
Texture gradient Plane Uniform Perspective Statistical/ Edge/ Density
density structural region
Convergence line Plane Parallel Perspective Statistical Edge Direction
line
Convergence line Plane Parallel Perspective Statistical Edge Direction
line
Normalized texture Plane Known Orthogonal Structural Line Length
characteristic map
Normalized texture Surface Known Spherical Structural Region Axis
characteristic map
Shape distortion Plane Isotropy Orthogonal Statistical Edge Direction
Shape distortion Plane Unknown Orthogonal Structural Region Shape

8.2.2.2 Get Shape from Texture

The specific effect of determining the surface orientation and restoring the surface
shape from the texture is related to the gradient of the surface itself, the distance
between the observation point and the surface, and the angle between the line of
sight and the image. Table 8.2 gives an overview of some typical methods, which
also lists various terms for obtaining shapes from textures [8]. The various methods
that have been proposed to determine the surface by texture are mostly based on
different combinations of them.
In Table 8.2, the difference between different methods is mainly that different
surface clues are used, which are texture gradient (refers to the rate and direction of
the maximum change of texture roughness on the surface), convergence line (which
can limit the orientation of the planar surface. Assuming that these lines are parallel
in 3D space, the convergence line can determine the vanishing point of the image),
normalized texture characteristic map (this map is similar to the reflection map in the
8.2 Shape from Texture 309

shape from shade), and shape distortion (if the original shape of a pattern on the
surface is known, the observed shape can be determined on the image for various
orientations of the surface). In most cases, the surface is a plane, but it can also be a
surface; the analysis method can be either structural method or statistical method.
In Table 8.2, perspective projection is often used for projection type, but it can
also be orthogonal projection or spherical projection. In spherical projection, the
observer is at the center of the sphere, the image is formed on the sphere, and the line
of sight is perpendicular to the sphere. When restoring the orientation of the surface
from the texture, the 3D solid should be reconstructed according to the distortion of
the original texture element shape after projection. Shape distortion is mainly related
to two factors: (1) the distance between the observer and the object, which affects the
size of texture element distortion, and (2) the angle between the normal of the object
surface and the line of sight (also known as the surface inclination), which affects the
shape of the texture element after distortion. In orthogonal projection, the first factor
does not work; only the second factor will work. In perspective projection, the first
factor works, while the second factor only works when the object surface is curved
(if the object surface is flat, it will not produce distortion that affects the shape). The
projection form that can make the above two factors work together on the shape of
the object is spherical perspective projection. At this time, the change of distance
between the observer and the object will cause the change of texture element size,
and the change of object surface inclination will cause the change of object shape
after projection.
In the process of restoring surface orientation from texture, it is often necessary to
have certain assumptions about texture pattern. Two typical assumptions are as
follows:

Isotropy Assumption

The isotropy assumption holds that for isotropic textures, the probability of finding
a texture primitive in the texture plane is independent of the orientation of the texture
primitive. In other words, the probability model of isotropic texture does not need to
consider the orientation of the coordinate system on the texture plane [9].

Homogeneity Assumption

The uniformity of texture in the image refers to that the texture of a window selected
at any position in the image is consistent with that of the window selected at other
positions. More strictly, the probability distribution of a pixel value only depends on
the nature of the pixel neighborhood and has nothing to do with the spatial coordi
nates of the pixel itself [9]. According to the homogeneity assumption, if the texture
of a window in the image is collected as a sample, the texture outside the window can
be modeled according to the nature of the sample.
310 8 Monocular Single-Image Scene Restoration

In the image obtained by orthogonal projection, even assuming that the texture is
uniform, the orientation of the texture plane cannot be restored, because the uniform
texture is still uniform after viewing angle transformation. However, if the image
obtained by perspective projection is considered, the restoration of the orientation of
the texture plane is possible.
This problem can be explained as follows: According to the uniformity assump
tion, the texture is considered to be composed of uniform patterns of points. At this
time, if the texture plane is sampled with equally spaced meshes, the number of
texture points obtained by each mesh should be the same or very close. However, if
the texture plane covered by equi-spaced meshes is used for perspective projection,
some meshes will be mapped into larger quadrangles, while others will be mapped
into smaller quadrangles. That is, the texture on the image plane is no longer
uniform. Because the mesh is mapped into different sizes, the number of texture
patterns (originally uniform) contained in it is no longer consistent. According to this
property, the relative orientation of the imaging plane and the texture plane can be
determined by the proportional relationship of the number of texture modes
contained in different windows.

8.2.2.3 Texture Stereo Technology

The combination of texture method and stereo vision method is called texture stereo
technology. It estimates the direction of the scene surface by acquiring two images
of the scene at the same time, avoiding the complex problem of corresponding point
matching. In this method, the two imaging systems used are connected by rotation
transformation.
In Fig. 8.16, the straight line orthogonal to the direction of the texture gradient
and parallel to the object surface is called the characteristic line, and there is no
change in the texture structure on this line. The angle between the feature line and the

Texture
1 gradient
direction

Characteristic line
X

Fig. 8.16 Characteristic lines of textured surface

8.2 Shape from Texture 311

X-axis is called the feature angle, which can be calculated by comparing the Fourier
energy spectrum of the texture region. According to the feature lines and feature
angles obtained from the two images, the surface normal vector N = [Nx Ny Nz]T can
be determined:

Nx = sin 01 (a13 cos 02 + a23 sin 02) (8:60)

Ny = — cos 01 (a13 cos 02 + a23 sin02) (8:61)
Nz = cos 01 (a21 cos 02 + a22sin 02) — sin 01 (a11 cos 02 + a21 sin 02) (8:62)

where 01 and 02 are the included angle between the characteristic line and the X-axis
in the two images in the counterclockwise direction, respectively; the coefficient aij
is the direction cosine between the corresponding axes in the two imaging systems.

8.2.3 Texture Vanishing Point Detection

Vanishing point plays an important role in determining the change of spatial

relationship between texture elements. The orientation of the scene surface can be
accurately determined by detecting the vanishing point of the texture.

8.2.3.1 Detect Line Segment Texture Vanishing Point

If the texture pattern is composed of straight-line segments, the method of detecting

its vanishing point can be introduced with the help of Fig. 8.17. Theoretically, this
work can be carried out in two steps (each step requires a Hough transform) [10]:
(1) Determine all the straight lines in the image (directly with the help of Hough
transform); (2) find those straight lines that pass through common points and
determine which points are vanishing points (detect the peaks accumulated in the
parameter space with the help of Hough transform).

Fig. 8.17 Determining the vanishing point of line texture

312 8 Monocular Single-Image Scene Restoration

According to Hough transform, the line in image space can be determined by

detecting parameters in parameter space. According to Fig. 8.17a, in the polar
coordinate system, the straight line can be represented as:

2 = x cos G + y sin 0 (8:63)

If the symbol “)” is used to represent the transformation from one set to another,
the transformation {-, y} ) {2, G} maps a line in the image space XY to a point in the
parameter space A0 while the collection of lines with the same vanishing point (xv,
yv) in the image space XY is projected to a circle in the parameter space A0. To
illustrate this, 2 = x/x1 + y2 and G = arctan{y/x} can be substituted into the follow
ing equation:

2 = xvcos G + yv sin G (8:64)

Turn the result to the rectangular coordinate system again, and you can get:

-- - xv2+ - yv 2 = xv 2+ yv2 (8 65)

Vx2+y2 2 + 2 (8:65)

The above equation represents a circle with a center of (xv/2, yv/2) and a radius of
2 =^/(xv=2)2 +(yv=2)2, as shown in Fig. 8.17b. This circle is the trajectory
projected into A0 space by the set of line segments with (xv,yv) as the vanishing
point. In other words, the vanishing point can be detected by mapping the line
segment set from XY space to A0 space with the transformation {x, y} ) {2, G}.
The above method of determining the vanishing point has two disadvantages:
One is that the detection of circles is more difficult than the detection of straight
lines, and the amount of calculation is also large; the other is that when xv ^ 1 or
yv ^ 1, there are 2 ^ 1 (symbol “^” here indicates a trend). To overcome these
shortcomings, the transformation {x, y} ) {k/2, G} can be used instead, where k is a
constant (k is related to the value range of the Hough transform space). At this time,
Eq. (8.64) becomes:

=
k 2 = xvcos G + yvsin G (8:66)

Converting Eq. (8.66) into the Cartesian coordinate system (letting s = 2cosG,
t = 2sinG), we get:

k = —v s + yvt (8:67)

This is a straight-line equation. In this way, the vanishing point at infinity can be
projected to the origin, and the locus of the point corresponding to the line segment
with the same vanishing point (xv, yv) inST space becomes a straight line, as shown
in Fig. 8.17c. The slope of this line is given by Eq. (8.67) as -yv/xv, so this line is
8.2 Shape from Texture 313

orthogonal to the vector from the origin to the vanishing point (xv, yv) and has a
distance k= ^x^ + yV from the origin. This straight line can be detected by another
Hough transform, that is, the space ST where the straight line is located is taken as
the original space, and it is detected in the (new) Hough transform space RW. In this
way, the straight line in the space ST is a point in the space RW, as shown in
Fig. 8.17d, and its position is:

k
r= —, (8:68)
•\/xV+y2

w = arctan yv (8:69)
Xv

From the above two equations, the coordinates of the vanishing point can be
solved as:

k2
(8:70)
r2V1 + tan2w
k2 tan w
(8:71)
r2V1 + tan2w

8.2.3.2 Determining the Vanishing Point outside the Image

The above method has no problem when the vanishing point is within the range of
the original image. In practice, however, the vanishing point is often outside the
image range (as shown in Fig. 8.18), or even at infinity, where problems with the
general image parameter space are encountered. For long-distance vanishing points,
the peaks of the parameter space are distributed over a large distance, so the
detection sensitivity will be poor, and the positioning accuracy will be low.

Fig. 8.18 An example of

vanishing point outside the
image
314 8 Monocular Single-Image Scene Restoration

Fig. 8.1 9 Determining the

vanishing point using a
Gaussian sphere

Fig. 8.2 0 Determining

vanishing points from
known intervals using cross
ratios

An improvement on this is to build a Gaussian sphere G around the camera’s

projection center, and use G as the parameter space instead of using the extended
image plane. As shown in Fig. 8.19, a vanishing point occurs at a finite distance (but
also at infinity) and has a one-to-one relationship (V and V’) with a point on a
Gaussian sphere (whose center is C). In practice, there will be many uncorrelated
points, and to eliminate their influence, paired lines (line in 3D space and line
projected onto a Gaussian sphere) need to be considered. If there are N lines in
total, the total number of line pairs is N(N-1)/2, that is, the magnitude is O(N2).
Consider a situation where the floor is covered with floor tiles, and the camera is
tilted to the floor and observed along the direction of the floor. At this time, the
configuration shown in Fig. 8.20 can be obtained (VL stands for vanishing line),
where C is the center of the camera, O, H1, and H2 are on the ground, O, V1, V2, and
V3 are on the imaging surface, and a and b (the length and width of the brick) are
known. The cross ratio obtained from points O, V1, V2, and V3 (e.g., see [6]) is equal
to the cross ratio obtained from points O, H1, and H2 and the point at infinity in the
horizontal direction:

yi (ya- yi) = xi = a
(8.72)
yi(ya- yi) xi a+b
where y3 can be calculated from Eq. (8.72):
8.2 Shape from Texture 315

Fig. 8.2 1 Calculate the

offset of the center of the
circle

byi y2
ayi + byi - ay2

In practice, the position and angle of the camera relative to the ground should be
adjustable so that a = b, and then:

y3 = 2yyi i-y2y2 (8:74)

This simple formula shows that the absolute values of a and b are not important; as
long as the ratio is known, it can be calculated. Further, the above calculation does
not assume that the points Vi, V2, and V3 are vertically above the point O, nor does it
assume that the points O, Hi, and H2 are on the horizontal line, only that they are on
two coplanar lines, and C is also in this plane.
In perspective projection, the ellipse is projected as an ellipse, but its center is
shifted a bit, because perspective projection doesn’t preserve length ratios (the
midpoint is no longer the midpoint). Assuming that the position of the vanishing
point of the plane can be determined from the image, the offset of the center can be
easily calculated using the previous method. Consider first the special case of
ellipse—circle, which is ellipse after projection. Referring to Fig. 8.2i, let b be the
short semiaxis of the projected ellipse, d be the distance between the projected ellipse
and the vanishing line, e be the offset of the center of the circle after projection, and
point P be the projected center. Taking b + e as y i,2b as y2, and b + d as y3, it can be
obtained from Eq. (8.74):

e =b2 (8:75)

The difference from the previous method is that y3 is known here, and it is used to
calculate yi and then calculate e. If the vanishing line is not known, but the
orientation of the plane where the ellipse is located and the orientation of the
image plane are known, the vanishing line can be deduced and calculated as above.
If the original object is an ellipse, the problem is more complicated because not
only the longitudinal position of the center of the ellipse is not known but also its
lateral position. At this time, two pairs of parallel tangent lines of the ellipse should
be considered. After projection imaging, one pair intersects at Pi, and the other
intersects at P2. Both intersections are on the vanishing line, as shown in Fig. 8.22.
Since for each pair of tangents, the chord connecting the tangent points passes
316 8 Monocular Single-Image Scene Restoration

Fig. 8.2 2 Calculate the

offset of the center of the
ellipse

Fig. 8.23 Schematic diagram of the depth of field of a thin lens

through the center O of the original ellipse (this property does not vary with
projection), the projection center should be on the chord. The intersection of the
two chords corresponding to the two pairs of tangents is the projection center C.

8.3 Shape from Focus

When using the optical system to image the scene, the lens actually used can only
clearly image the scene within a certain distance interval. In other words, when the
optical system is focused at a certain distance, it can only image the scene within a
certain range above and below this distance with sufficient clarity (the defocused
image will become blurred [11]). This distance range is called the depth of focus of
the lens. The depth of focus is determined by the farthest point and the nearest point
satisfying the degree of sharpness or by the farthest plane and the nearest plane. It is
conceivable that if the depth of field can be controlled, when the depth of field is
small, the farthest point and the nearest point on the scene that satisfy the degree of
clarity are very close, and then the depth of the scene can be determined. The median
of the depth of field range basically corresponds to the focal length [12], so this
method is often referred to as shape from focal length or shape from focus.
Figure 8.23 gives a schematic representation of the depth of field of a thin lens.
When the lens is focused on a point on the scene plane (the distance between the
8.3 Shape from Focus 317

scene and the lens is do), it is imaged on the image plane (the distance between the
image and the lens is di). If you reduce the distance between the scene and the lens to
do1, the image will be imaged at a distance of di1 from the lens, and the point image
on the original image plane will spread into a blurred disk of diameter D. If the
distance between the scene and the lens is increased to do2, the image will be imaged
at a distance of di2 from the lens, and the point image on the original image plane will
also spread into a blurred disk of diameter D.If D is the largest diameter acceptable
for sharpness, the difference between do1 and do2 is the depth of field.
The diameter of the blurred disk is related to both camera resolution and depth of
field. The resolution of the camera depends on the number, size, and arrangement of
the camera imaging units. In the common square grid arrangement, if there are N x N
cells, then N/2 lines can be distinguished in each direction. That is, there is an
interval of one cell between two adjacent lines. The general grating is that the black
and white lines are equidistant, so it can be said that N/2 pairs of lines can be
distinguished. The resolution capability of a camera can also be expressed in terms
of resolving power. If the spacing of the imaging units is A and the unit is mm, the
resolution power of the camera is 0.5/A and the unit is line/mm. If the side length of
the imaging element array of a CCD camera is 8 mm, and there are 512 x 512
elements in total, its resolution is 0.5 x 512/8 = 32 line/mm.
Assuming that the focal length of the lens is 2, then according to the thin lens
imaging formula, we have:

12. X
(8:76)
2 do di

Now set the lens aperture as A; then when the scene is at the closest point, the
distance di1 between the image and the lens is:

A
di1= - di (8:77)
A -D

According to Eq. (8.76), the distance to the closest point of the scene is:

do1= d^-i (8J8)

Substitute Eq. (8.77) into Eq. (8.78) to get:

> _ 2 --Ddi _ —do

2
(8:79)
- — d di — 2 2— + D(do — 2)

Similarly, the distance of the farthest point of the scene can be obtained as:
318 8 Monocular Single-Image Scene Restoration

A A+Ddi = AAdo
(8:80)
A+Ddi -A = AA -D(do — A)

It can be seen from the denominator on the right side of Eq. (8.80) that when:

do = A+-D A = H (8:81)

then do2 is infinite, and the depth of field is also infinite. H is called the hyper-focal
distance, and when do2 > H, the depth of field is infinite. While for do2 < H, the
depth of field is:

Ad =d — d = 2AA-Dd°(d° — A) (8 82)
o do2 do1 (AA)2 - D2(do — A)2 (8 )

It can be seen from Eq. (8.82) that the depth of field increases with the increase of D.
If a larger blurred disk is allowed/tolerated, the depth of field is also larger. In
addition, Eq. (8.82) shows that the depth of field decreases with the increase of A, that
is, a lens with a short focal length will give a larger depth of field.
Since the depth of field obtained with a lens having a longer focal length will be
smaller (closest and farthest points are close), it is possible to determine the distance
of the scene from the determination of the focal length. In fact, the human visual
system does this too. When people observe a scene, in order to see clearly, the
refractive power of the lens is controlled by adjusting the pressure of the ciliary
body, so that the depth information is connected with the pressure of the ciliary body,
and the distance of the scene is judged according to the pressure adjustment. The
automatic focus function of the camera is also realized based on this principle. If the
focal length of the camera changes smoothly within a certain range, edge detection
can be performed on the image obtained at each focal length value. For each pixel in
the image, determine the focal length value that produces a sharp edge, and use the
focal length value to determine the distance (depth) between the 3D scene surface
point corresponding to the pixel and the camera lens. In practical applications, for a
given scene, adjust the focal length to make the image of the camera clear; then the
focal length value at this time indicates the distance between the camera and it,
while, for an image shot with a certain focal length, the depth of the scene point
corresponding to the clear pixel point can also be calculated.

8.4 Estimating Pose from Perspective Three-Point

It is an ill-conditioned problem to directly estimate the corresponding point of a 3D

scene from the position of an image point. A point in the image can be the projection
result of a line or any point on the line in 3D space (see Sect. 2.2.2). To recover the
position of the 3D scene surface from the 2D image, some additional constraints are
8.4 Estimating Pose from Perspective Three-Point 319

required. The following introduces a method to calculate the geometric shape and
posture of the 3D scene by using the coordinates of the three image points under the
condition that the 3D scene model and the focal length of the camera are known
[13]. The pairwise distances are known.

8.4.1 Perspective Three-Point Problem

The perspective transformation that uses 2D image features to calculate 3D scene

features is an inverse perspective. Three points are used here, so it is called the
perspective three-point (P3P) problem. At this time, the coordinate relationship
between the image, the camera, and the scene can be seen in Fig. 8.24.
It is known that the corresponding point of three points Wi on the 3D scene on the
image plane xy is pi, and now the coordinates of Wi need to be calculated according
to pi. Note that the line from the origin to pi also goes through Wi, so if we let vi be the
unit vector on the corresponding line (pointing from the origin to pi), the coordinates
of Wi can be obtained by:

Wi = kiVi i = 1,2,3 (8:83)

The (known) distance between the above three points is (m / n):

dmn=\Wm - Wnj (8:84)

Substitute Eq. (8.83) into Eq. (8.84) to get:

dmn = kkmVm - knVn||2 = km - 2kmkn(Vm ' Vn) + k2 (8:85)

Fig. 8.24 Coordinate system for pose estimation

320 8 Monocular Single-Image Scene Restoration

8.4.2 Iterative Solution

Equation (8.85) gives a quadratic equation about the three ki, in which the three d2mn
on the left side of the equation are known from the 3D scene model, and the three dot
products vm^vn can also be calculated from the coordinates of the image points. The
P3P problem of the Wi coordinates of a point becomes a problem of solving three
quadratic equations with three unknowns. In theory, there are eight solutions to
Eq. (8.85) (i.e., eight sets of [k1 k2 k3]), but as can be seen from Fig. 8.24, due to
symmetry, if [k1 k2 k3] is a set of solutions, then [—k1 -k2 -k3] must also be a set of
solutions. Since the object can only be on one side of the camera, there are at most
four sets of real solutions. It has also been shown that, although there may be four
sets of solutions in certain cases, there are generally only two sets of solutions.
Now solve for ki in the following three functions:

f f (k1, k2, k3) = k2- 2k1k2(v1 • V2) + k2- d22

g g(ki, k2, k3) = k2— 2k2k3(v2 ■ v3) + k| — d23 (8:86)
_ h(k1, k2, k3) = k2— 2k3k1 (v3 ■ v1) + k2 — d21

Suppose the initial value is around [k1 k2 k3], butf(k1, k2, k3) / 0. An increment [41
42 43] is now required so thatf(k1 + 41, k2 + 42, k3 + 43) tends to 0. Expandf(k1 +41,
k2 + 42, k3 + 43) in the neighborhood of [k1 k2 k3] and omit higher-order terms, and
we get:

'k1'
df df df
f (k1+41, k2 +42, k3+43) =f (k1, k2, k3) + k2 (8:87)
dk1 dk2 dk3_
k3

Equating the left-hand side of the above equation to 0 yields a (partial differential)
linear equation containing [k1 k2 k3]. Similarly, the functions g and h in Eq. (8.86)
can also be transformed into linear equations. Put them together:

r< df df_ I
0 ■f (k1, k2, k3) dk1 dk2 dk3 'k1'
Sg dg dg
0 g(k1, k2, k3) k2 (8:88)
dk1 dk2 dk3
0 h(k1, k2, k3)_ k3
dh dh dh
. dk1 dk2 dk3 .

The above partial differential matrix is the Jacobian matrix J. The Jacobian matrix
J of a function f(k1, k2, k3) has the following form (where vmn = vm^vn):
8.5 Shape from Shading in Hybrid Surface Perspective Projection 321

J 11 J12 J 13
J(k1, k2, k3) = J21 J22 J23
J31 J32 J33
(2k1 - 2vi2k2) (2k2 - 2vi2ki) 0
0 (2k2 — 2v23k3) (2k3 — 2v23 k2) (8:89)
(2k1 — 2v3ik3) 0 (2k3 — 2V31 k1)

If the Jacobian matrix J is invertible at the point (k1, k2, k3), the parameter increment
can be obtained:

k1 f (k1, k2, k3)

k2 = —J—1 (k1, k2, k3) g(ki, k2, kj) (8:90)
k3 h(k1, k2, k3)

Add the above increment to the parameter value of the previous step, and use Kl to
represent the lth iteration value of the parameter to get (Newton’s method
representation):

Kl+1 = Kl - j - 1 (Kl) f^Kl) (8:91)

The above iterative algorithm can be summarized as follows:

Input: Three sets of corresponding point pairs (Wi, pi), camera focal length 2, and
distance tolerance A;
Output: Wi (the coordinates of the 3D point in the camera coordinate system)
Step 1: Initialize
Calculate d2mn according to Eq. (8.85),
Calculate vi and 2vm-vn according to pi,
Select the initial parameter vector K1 = [k1 k2 k3];
Step 2: Iterate untilf(Kl) ~ 0,
Calculate Kl + 1 according to Eq. (8.91),
Stop if f(Kl + 1)| < ±A or the number of iterations is reached;
Step 3: Calculate the pose
Calculate Wi with Kl + 1 according to Eq. (8.83).

8.5 Shape from Shading in Hybrid Surface Perspective

Projection

The method of recovering shape from shading was proposed in the early days using
some assumptions, for example, the light source is located at infinity, the camera
follows the orthogonal projection model, the reflection characteristics of the object
322 8 Monocular Single-Image Scene Restoration

surface obey the ideal diffuse reflection, etc., to simplify the imaging model. These
assumptions reduce the complexity of the SFS method but may also generate large
reconstruction errors in practical applications. For example, the actual scene surface
is rarely an ideal diffuse reflection surface, and the specular reflection factor is often
mixed. For another example, when the distance between the camera and the scene
surface is relatively close, the camera is closer to perform the perspective projection,
which will lead to a relatively obvious reconstruction error.

8.5.1 Improved Ward Reflection Model

Considering that the surface of the actual scene is mostly a mixture of diffuse
reflection and specular reflection, Ward proposed a reflection model [14], which
uses a Gaussian model to describe the specular component in the surface reflection.
Ward represents this model using the bidirectional reflectance distribution func
tion (BRDF; see Sect. 7.2.1):

bm 1 tan 2S\
;
f (0i ^-; 0e, &) =b + , exp
4nff2 (J(cos 0i cos 0e)
ff2 ) (8:92)

Among them, bl and bm are the diffuse reflection and specular reflection coeffi
cients, respectively; ff is the surface roughness coefficient, and S is the angle bisector
direction vector between the light source and the camera (L + V)/||L + V|| and the
surface normal vectors (where L is the light direction vector and V is the viewing
direction vector).
The Ward reflection model is a concrete physical realization of the Phong
reflection model. The Ward reflection model is actually a linear combination of
diffuse and specular reflections, where the diffuse part still uses the Lambertian
model. Since the use of the Lambertian model to calculate the radiance of the scene
surface is not accurate enough for the actual diffuse reflection surface, a more
accurate reflection model was proposed [15]. In this model, the scene surface is
considered to be composed of many “V”-shaped grooves, and the slopes of the two
micro-planes in the “V”-shaped grooves are the same in magnitude but opposite in
direction. Define the surface roughness as a probability distribution function of the
micro-facet orientation. Using the Gaussian probability distribution function, the
formula for calculating the radiance of the diffuse surface can be obtained as:

fV(0j, î; 0e, ê) = ~ cos 0ifA + B max [0, cos (ê — î)] sin a sin fig (8:93)

where A = 1-0.5ff2/(ff2 + 0.33), B = 0.45ff2/(ff2 + 0.09), a = max(0i, 0e), fi = min(0i,

0e).
8.5 Shape from Shading in Hybrid Surface Perspective Projection 323

Substituting Eq. (8.93) for the diffuse reflection term (the first term on the right
side of the equal sign) in Eq. (8.92), an improved Ward reflection model can be
obtained:

f'(0i, fa; 0e, fa = — cos 0ifA + B max [0, cos (^e — ^)] sin a sin fig
bm__________ 1 /— tan 22^ (8:94)
+ 4n^2 A/( cos 0i cos 0e)&XP V o2 )

The improved Ward model should better describe hybrid surfaces with both diffuse
reflection and specular reflection [16].

8.5.2 Image Brightness Constraint Equation under

Perspective Projection

Consider the perspective projection when the camera is relatively close to the scene
urface as shown in Fig. 8.25. The optical axis of the camera coincides with the
Z-axis, the optical center of the camera is located at the projection center, and the
focal length of the camera is 2. Let the image plane xy lie at Z = —2. At this time
0i = 0e = a = fi, ^j = ^e, Eq. (8.94) becomes:

fp(0i, <ft; 0e, fa = b-(A cos 0 + B sin20i ) + 4bm2expf—^an-^^) (8:95)

Suppose the surface shape of the scene in the image can be represented by the
function T: Q ^ R3:

Fig. 8.25 Perspective

projection of the light source
at the optical center
324 8 Monocular Single-Image Scene Restoration

t <x>=zr1f x,
(8:96)
2 —2
x
x= 2Q (8:97)
.y.

Among them, z(x) = — Z(X) > 0 represents the depth information of the point on
the scene surface along the optical axis; Q is an open set defined on the real number
***set R3, representing the size of the image.
The normal vector n(x) at any point P on the scene surface is:

2Vz(x)
n(x) = (8:98)
z(x) + x ■ Vz(x)

where Vz(x) is the gradient of z(x). The vector in the direction of ray casting through
point P is:

1 —x
L(x) = (8:99)
^/||x||2 + 22 2

Because 0i is the included angle between n(x) and L(x), if we let v(x) = lnz(x),
there is:

nT(x)
0i = arccos L(x)
lln(x)ll

Q (x)
= arccos (8:100)
2+ [1 +x ■ Vv(x)]2

=
where Q(x) = 2 y ||x||2 + 22. Substitute Eq. (8.100) into Eq. (8.95) to obtain the
image brightness constraint equation under perspective projection:

E(x)=1 A Q(x) , BF(x, Vv) - Q2 (x)

\/F(x, Vv) + B F(x, Vv) )

bm — 1 F(x, Vv) — Q2(x)'

+ my exp (8:101)
4rtff2 ff2 Q2(x) ,
8.5 Shape from Shading in Hybrid Surface Perspective Projection 325

V V V V
where F(x, v) = 22II v(x)ll2 + [1 + x^ v(x)]2 and v is the abbreviation of v(- V
x) = [p, q]T. The above equation is a first-order partial differential equation, and the
corresponding Hamiltonian function can be obtained:

H(x, Vv) = E&y/FXV- % ( AQ(x) + BF x, ( Vv (

,Q x'
n\ VF(x, Vv) (8:102)
+ 4n^2 5/F(x, Vv) exp - 1 F(x, Vv) - Q2(x)
ff2 Q2(x) ]

Considering the Dirichlet boundary conditions, the above equation can be written as
a static Hamilton-Jacobi equation:

H(x, p, q) = 0 8x 2Q
(8:103)
v(x) = rn(x) 8x 2dQ

where «(x) is a real valued continuous function defined on dQ.

8.5.3 Solving the Image Brightness Constraint Equation

A direct way to solve Equation (8.102) is to transform it into a time-varying problem:

'vt + H(x, p, q) = 0 8x 2Q
< v(x, t) = «(x) 8x 2 dQ (8:104)
v(x,0) = v0(x)

Then the fixed-point iterative sweeping method [17] and 2D central Hamilton
function [18] are used to solve.
Consider the mesh points xi-j- = (ih, jw) in the m x n image Q, where i = 1, 2,...,
m, j = 1, 2, ..., n, and (h, w) define the size of the discrete mesh in the numerical
algorithm. Now it is required to solve the discrete approximate solution vi,j = v(xi,j)
of the unknown function v(x).
The forward Euler formula is applied to expand Eq. (8.104) in time domain, and:

n+1 n
vij = vij ^tH\pij— ,pij+ ; qij, qij+ J
—
(8:105)

where At = Y{1/[(ffx/h) + (ffy/w)]} and Y is the CFL coefficient; ffx and oy are artificial
viscosity factors, which meet the following requirements:
326 8 Monocular Single-Image Scene Restoration

dH(p, q)
&x = max (8:106)
p, q dp

&y = max
dH(p, q) (8:107)
p, q dq

H is the numerical Hamiltonian function, by using the 2D central Hamiltonian

function:

H(p ,p+;q q+) = \[Hp q ) + H(p+q ) + H(p q+) + h(p+q+)]

- 2 px'P' -p-) + °y(q+ -q-)]

(8:108)

where p-, p+ and q-, q+ represent the backward and forward differences of p and q,
respectively:

p- = (vij - vi - 1j) =h (8:109)

ptj= vVi+ij - vij) =h (8:110)

q- = (vi,/ - vij -1 ) =w (8:111)

q+j = (vij+1 - vj w = (8:112)

Substituting Eq. (8.106) to Eq. (8.112) into Eq. (8.105), the final iterative equation is
obtained:

vnew = vod-Y 1 vij - vi - 1,j vij - vij -

vi,j h w
(tfx = h) + (tfy = w)

+H 'vi+1 j - vij vij - vij - 1

h h w

(Vij - vi - 1,j vij+1 - vij HH vi+1j - vij vij+1 - vij'

h h w J \ h w .

vi+1j - 2vij + vi - 1j- ^ (vij+1 - 2vij + vij - 1

(8:113)

Now, the solution algorithm process is summarized as follows [16]:

1. Initialization: Set the value of boundary points (i.e., dQ) as the real height value,
i.e., v0j = w(xij), and the values of these points remain unchanged during the
8.5 Shape from Shading in Hybrid Surface Perspective Projection 327

iteration. Assign the value of image area points (i.e., q) to a larger value, i.e., vi0,j =
M, M should be greater than the maximum value of all height values, and the
values of these points will be updated in the iteration process.
2. Alternate direction scanning: In step k + 1, use iterative Eq. (8.113) to update vi,j.
The scanning process adopts the Gauss-Seidel method from the following four
directions: (1) From top left to bottom right, i = 1: m, j = 1: n; (2) from bottom
left to top right, i = m:1,j = 1: n; (3) from bottom right to top left, i = m:1,j = n:
1; and (4) from top right to bottom left, i = 1: m, j = n: 1. When vk+1 < vj, update
i,j <vvi,jk+1. .
vvnew
3. Iteration stop criteria: When ||vk + 1 - vk||1 < e = 10-5, stop iteration; otherwise,
return to (2).

8.5.4 Equations Based on the Blinn-Phong Reflection Model

Due to the complexity of ward reflection model, it is difficult to find the optimal
artificial viscosity factor by using the fixed-point iterative scanning method,
resulting in slow convergence in the calculation process. Therefore, the Blinn-
Phong reflection model [19] can be used to establish the equation.
Based on the Blinn-Phong reflection model to characterize the hybrid reflection
characteristics of the object surface, the image brightness constraint equation is
[20]:

I(u, v) = kjcos 0i + kmcos aS (8:114)

where I(u, v) is the gray value of the image at (u, v); kl and km are the weighting
factors of diffuse reflection and specular reflection components of the scene surface,
respectively, and there is kl + km < 1; specular reflection index a > 0; 0i is the angle
between the normal vector N(u, v) at a point P(x, y, z) on the surface of the scene
corresponding to (u, v) and the light source L(u, v); 5 is defined in Eq. (8.92).
Considering that the point light source is approximately located at the projection
center, then:

L(u,v) = V(u,v) = H(u,v) ) S = 0i (8:115)

Thus, Eq. (8.114) becomes:

I(u, v) = kjcos 0-i + kmcos a0[ (8:116)

In addition, according to the imaging principle of pinhole perspective projection:

328 8 Monocular Single-Image Scene Restoration

u v —2
(8:117)

Therefore, the point P(x, y, z) on the hybrid surface can be represented as:

( )
P(x,y,z) = z u’v (u,v, — 2),
2
(u,v) 2 Q (8:118)

Among them, z(u, v) > 0; Q is the image region captured by the camera.
With the help of Eq. (8.118), the normal vector at point P can be calculated by:

N(u,v) = 2^, 2^, z(u,v) — u— v

(8:119)
du dv du dv

The direction vector of the light source passing to point P is:

1
L(u, v) = 2 —2 [ - u, - v, 2] (8:120)

Because 0i is the included angle between N(u, v) and L(u, v), there is:

___________ Q(u, v)z(u, v)

cos 0i = (8:121)
a /f J 3z 2-i- (d dl 2i / di । di । x\2
VV2au — l2av — udu — v dv — z

where Q(u, v) = 2/(u2 + v2 + 22)1/2 > 0. Let Z = ln[z(u, v)], and substituting
Eq. (8.121) into Eq. (8.116), the image brightness constraint equation of the hybrid
surface under perspective projection can be obtained:

I(u, v)= kl
Q(u, v) —
k Qa(u’ v)
(8:122)
U(u, v, g) — kmUa(u, v, g)

where g stands for VZ, and

2 2 2
U(u,v,g)= — — u(z—4Z— 1 >0 (8:123)
du dv

8.5.5 Solving the New Image Brightness Constraint Equation

Equation (8.122) is a first-order nonlinear partial differential equation. When the

specular reflection index a ^ 1, it is difficult to solve the equation. Next, the Newton-
8.5 Shape from Shading in Hybrid Surface Perspective Projection 329

Raphson method is used to iteratively approximate the solution of U(u, v, g)in

Eq. (8.122) first, and then the viscous solution of the image brightness constraint
equation is further calculated [20].
Considering Eq. (8.122) as an equation for T = Q(u, v)/U(u, v, g) > 0, we can get:

F(T) = klTa + kmT -1(u, v) = 0 (8:124)

The first derivative F‘(T) of F(T) is:

F0(T) = akiTa -1 +km> 0 (8:125)

Because F(T) is monotonic, given the initial value T0 = 1, the solution Tk of

Eq. (8.124) can be accurately obtained after k iterations using the iterative formula
of the Newton-Raphson method:

Tk = Tk- 1 F(Tk - 1)
(8:126)
F0 (Tk - 1)

thus get

Q(u, v)
U(u,v,g)= (8:127)
Tk

Substitute Eq. (8.127) into Eq. (8.123) to obtain a new image brightness constraint
equation:

2
Tk (4) \s/ udz+vdz+i
+ + -Q(u,v)=0 (8:128)
du J dv du dv

It can be seen that Eq. (8.128) is a partial differential equation of the Hamilton-Jacobi
type. It does not have a solution in the usual sense in general, so the solution in the
viscous sense needs to be calculated. First, give the Hamiltonian function of
Eq. (8.128):

H(u, v, g) = - Q(u, v) + Tk^/l2||g||2 + [(u, v) ■ g + 1]2 (8:129)

Use the Legendre transform to obtain the control form corresponding to Eq. (8.129):

H(u, v, g) =- Q(u, v) + sup f - lc(u, v, h) - fc(u, v, h) ■ gg (8:130)

a2B2 (0, 1)
330 8 Monocular Single-Image Scene Restoration

where lc(u, v, h) =— TkQ(u, v^/1 — ||hk2 - TkRT(u, v)v(u, v) ■ h + Q(u, v) ;

fc(u, v, h) = — 'T'RT(u, v) x D(u, v)R(u, v)h; B2(0, 1) is the unit circle defined on
R2. R(u, v), v(u, v), and D(u, v) satisfy, respectively:

=
u p u2 + v2 v=pu2 + v2
u2 + v2 ± 0
— v=pu2 + v2 u=pu2 + v2
R(u, v) = < (8:131)
1 0
u2 + v2 = 0
0 1

(J (u2 + v2)=u (u2 + v2 + A2)

v (u, v)= (8:132)
0

u (u2 + v2 + A2) 0
D (u, v)= (8:133)
0 A

To approximate H(u, v, g) in Eq. (8.130), use:

H (u, v, g)« — Q(u, v)+ sup f — lc(u, v, h) + min [—f 1(u, v, h),0]g1H
a2B2 (0,1)

+ max [—f 1 (u, v, h) ,0]g— + min [—f2 (u, v, h) ,0]gj g + max [—f2 (u, v, h) ,0]g— g
(8:134)

where fm is the mth (m = 1,2) component offc and gm and gm— are the forward and
backward differences of the mth component, respectively. The computation here is
an optimization problem (see [21]).
Finally, define Zk = Z(u, v, kAt), and expand the forward Euler formula in the
time domain to obtain the numerical solution of Z:

Zk =Zk—1 —AtH (u, v, g) (8:135)

where At is the time increment. Using the iterative fast marching strategy [22], the
viscous solution of Z can be accurately approximated after several iterations, and the
exponential function exp.(Z) of this viscous solution is the height value of the hybrid
reflective surface.

References

1. Zhang Y-J (2017) Image engineering, Vol. 3: Image understanding. De Gruyter, Germany.
References 331

2. Lee JH, Kim CS (2022) Single-image depth estimation using relative depths. Journal of Visual
Communication and Image Processing, 84: 103459. https://doi.org/10.1016/j.jvcir.2022.
103459).
3. Heydrich T, Yang Y, Du S (2022) A lightweight self-supervised training framework for
monocular depth estimation. International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2265-2269.
4. Anunay, Pankaj, Dhiman C (2021) DepthNet: A monocular depth estimation framework. 7th
International Conference on Engineering and Emerging Technologies (ICEET), 495-500.
5. Gibson JJ (1950) The perception of the visual world. Houghton Mifflin, Boston.
6. Zhang Y-J (2017) Image engineering, Vol. 2: Image analysis. De Gruyter, Germany.
7. Jain R, Kasturi R, Schunck BG (1995) Machine Vision. McGraw-Hill Companies. Inc.,
New York.
8. Tomita F, Tsuji S (1990) Computer Analysis of Visual Textures. Kluwer Academic Publishers,
Amsterdam.
9. Forsyth D, Ponce J (2003) Computer Vision: A Modern Approach. Prentice Hall, UK London.
10. Davies ER (2005) Machine Vision: Theory, Algorithms, Practicalities, 3rd Ed. Elsevier,
Amsterdam5.
11. Anwar S, Hayder Z, Porikli F (2021) Deblur and deep depth from single defocus image.
Machine Vision and Applications, 32(1): #34 (DOI: https://doi.org/10.1007/s00138-020-
01162-6).
12. Gladines J, Sels S, Hillen M, et al. A continuous motion shape-from-focus method for geometry
measurement during 3D printing. Sensors, 2022, 22(24): #9805 (DOI: https://doi.org/10.3390/
s22249805).
13. Shapiro L, Stockman G (2001) Computer Vision. Prentice Hall, UK London.
14. Ward GJ (1992) Measuring and modeling anisotropic reflection. Proceedings of the 19th
Annual Conference on Computer Graphics and Interactive Techniques 265-272.
15. Oren M, Nayar SK (1995) Generalization of the Lambertian model and implications for
machine vision. International Journal of Computer Vision 14(3): 227-251.
16. Wang GH, Han JQ, Zhang XM, et al. (2011) A new shape from shading algorithm for hybrid
surfaces. Journal of Astronautics 32(5): 1124-1129.
17. Zhao HK (2005) A fast sweeping method for Eikonal equations. Mathematics of Computation,
2005, 74(250): 603-627.
18. Shu CW (2007) High order numerical methods for time dependent Hamilton-Jacobi equations.
World Scientific Publishing, Singapore.
19. Tozza S, Mecca R, Duocastella M, et al. (2016) Direct differential photometric stereo shape
recovery of diffuse and specular surfaces. Journal of Mathematical Imaging and Vision 56(1):
57-76.
20. Wang GH, Zhang X (2021) Fast shape-from-shading algorithm for 3D reconstruction of hybrid
surfaces under perspective projection. Acta Optica Sinica 41(12): 1215003 (1-9).
21. Wang GH, Han JQ, Jia HH, et al. (2009) Fast viscosity solutions for shape from shading under a
more realistic imaging model. Optical Engineering 48(11): 117201.
22. Wang GH, Han JQ, Zhang XM (2009) Three-dimensional reconstruction of endoscope images
by a fast shape from shading method. Measurement Science and Technology 20(12): 125801.
Chapter 9
Generalized Matching

Abstract Matching can be understood as a technique or process of combining

various representations and knowledge to interpret a scene. This chapter focuses
on some generalized matching methods and techniques for objects and scenes, which
are more abstract than common image matching. It begins with an overview of
matching, including matching strategies, classification of matching methods, and
evaluation of matching. The principles and metrics for general object matching is
then described, and a dynamic pattern matching technique, which is characterized by
the dynamic pattern representation established during the matching process, is
introduced. The relationship between matching and registration is presented, and
some basic registration techniques, a heterogeneous image registration technique,
and an inference-based image matching technique are described. In addition, the
matching of various inter-relationships between objects, the using of graph isomor
phism to match, and the matching of 3-D scene to the corresponding model with line
drawing labels are introduced. Finally, some recent approaches to multimodal image
matching, both region-based and feature-based, are presented.

In the field of computer vision, matching is an important class of techniques, used in

all kinds of work from beginner to advanced. Matching can be understood as a
technique or process of combining various representations and knowledge to inter
pret a scene. Matching connects the unknown with the known and uses the known to
explain the unknown. For example, the scene matching technology is a technology
for autonomous navigation and positioning by using the data of the scene reference
map. It uses the image sensor mounted on the aircraft to collect the real-time scene
image during the flight and match it with the pre-prepared reference scene map in
real time, so as to obtain accurate navigation and positioning information. Matching
can also use the existing representations and models stored in the system to perceive
the information in the image input and finally establish the correspondence with the
external world to realize the interpretation of the scene.
In essence, matching can be regarded as a mathematical problem [1]. In recent
years, the methods based on deep learning have also received extensive attention
[2, 3].

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 333
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_9
334 9 Generalized Matching

Commonly used image-related matching methods and techniques can be classi

fied into two categories: One is more specific, often corresponds to the low-level
pixels or the collection of pixels corresponding, and is collectively referred to as
image matching; the other is more abstract, mainly related to the image object or
object properties and connections, or even related to the description and interpreta
tion of the scene, and is collectively referred to here as generalized matching. This
chapter focuses on some generalized matching methods and techniques.
The sections of this chapter will be arranged as follows.
Section 9. 1 begins with an overview of matching, including matching strategies,
classification of matching methods, and evaluation of matching.
Section 9. 2 discusses the principles and metrics of general object matching and
introduces several basic object matching techniques.
Section 9. 3 introduces a dynamic pattern matching technique, which is charac
terized in that the pattern representation to be matched is dynamically established
during the matching process.
Section 9. 4 compares matching and registration and introduces basic registration
techniques, a heterogeneous image registration technique, and an inference-based
image matching technique.
Section 9. 5 introduces the matching of various interrelationships between objects.
Relationships can represent different attributes of object sets, as well as more
abstract concepts.
Section 9. 6 first introduces some basic definitions and concepts of graph theory
and then discusses how to use graph isomorphism to match.
Section 9. 7 introduces a line drawing labeling method for representing the
interrelationships of the various surfaces of a 3D scene, which can also be used to
match the 3D scene to the corresponding model.
Section 9. 8 introduces some recent approaches to multimodal image matching,
both region-based and feature-based.

9.1 Introduction to Matching

Vision can be considered to include two aspects: “seeing” and “perception.” On the
one hand, “seeing” should be a purposeful “seeing,” that is, according to certain
knowledge (including the description of the object and the interpretation of
the scene), the scene should be found in the scene with the help of images; on the
other hand, “perception” should be “perception” with cognition, that is, to extract the
characteristics of the scene from the input image, and then match with the existing
scene model, so as to achieve the purpose of understanding the meaning of the scene.
9.1 Introduction to Matching 335

9.1.1 Matching Strategy

Matching, especially generalized matching, is a high-level image technology that is

intrinsically linked to knowledge. Knowledge is the sum of experience accumulated
by human beings in their cognition and understanding of the world. Humans use a lot
of experience and knowledge in understanding the world in order to increase the
reliability and efficiency of their work. Grasping the meaning of the scene often
requires reasoning, and knowledge is also the basis of reasoning.
Matching can be done at different (abstract) levels because knowledge has
different levels and can be applied at different levels. For each specific match, it
can be thought of as finding a correspondence between two representations. If the
types of the two representations are comparable, matching can be done in a similar
sense. For example, when both representations are image structures, it is called
image matching; if both representations represent the object in the image, it is called
object matching; if both representations represent the description of the scene, it is
called scene matching; if both representations are relational structures, it is called
relational matching; if the types of the two representations are different (e.g., one is
an image structure and the other is a relational structure), it can also be matched in an
extended sense or can be called fitting.
Matching is to establish a connection between two representations; it needs to be
done through mapping. When reconstructing the scene, the image matching strategy
can be divided into two cases according to the different mapping functions used (see
Fig. 9.1) [4].
1. Matching in the object space
In this case, the object O is directly reconstructed by inverting the perspective
transformations TO1 and TO2. Here, an explicit representation model for the object
O is needed to solve the problem by establishing a correspondence between the
image features and the object model features. The advantage of object space
matching techniques is that they fit well with the physical world, so even
occlusions can be handled if more complex models are used.
2. Matching in image space
The matching in the image space directly links the images I1 and I2 by means
of the mapping function T12. In this case, the object model is implicitly included
in the establishment of T12. This process is generally quite complicated, but if the
object surface is relatively smooth, an affine transformation can be used for local
approximation, and the computational complexity can be reduced to be

Fig. 9.1 Matching and

mapping
336 9 Generalized Matching

comparable to the matching in the object space. In the case of occlusion, the
smoothness assumption will be affected, and the image matching algorithm will
encounter difficulties.

9.1.2 Matching Classification

Image matching algorithms can be further classified according to the image repre
sentation model they use.
1. Raster-based matching
Raster-based matching uses a raster representation of the image, i.e., they
attempt to find a mapping function between image regions by directly comparing
grayscale or grayscale functions. This class of methods can be highly accurate but
sensitive to occlusion.
2. Feature-based matching
In feature-based matching, the symbolic description of an image is first
decomposed by salient features extracted from the image using feature extraction
operators, and then the corresponding features of different images are searched
based on assumptions about the local geometric properties of the object to be
described and the geometric mapping. These methods are more suitable for
situations with surface discontinuities and data approximations than raster
based matching methods.
3. Relationship-based matching
Relation matching, also known as structural matching, is based on the simi
larity of topological relationships (topological properties do not change under
perspective transformation) between features, and these similarities exist in
feature adjacency graphs rather than in grayscale or point distributions.
Matching of relational descriptions can be applied in many situations, but it
may generate a very complex search tree, so its computational complexity may
be very large.
Based on the template matching theory in Subsection 5.1.1, it is believed that in
order to recognize the content of an image, it is necessary to have its “memory trace”
or basic model in past experience, which is also called “mask.” If the present
stimulus matches the mask in the brain, you can tell what the stimulus
is. However, template matching theory says that the matching is that the external
stimulus must exactly match the template. In practice, people in real life can not only
recognize images that are consistent with the basic pattern but also recognize images
that do not completely conform to the basic pattern.
Gestalt psychologists came up with the prototype matching theory. This theory
holds that a presently observed image of the letter “A,” no matter what shape it is or
where it is placed, bears a resemblance to an “A” known to have been perceived in
the past. Humans do not store countless templates of different shapes in long-term
memory but use the similarity abstracted from various images as prototypes to test
9.1 Introduction to Matching 337

the images to be recognized. If a prototype resemblance can be found from the image
to be recognized, then the recognition of the image is achieved. This model of image
cognition is more suitable than template matching in terms of both neurological and
memory search processes and can also illustrate the cognitive process of some
irregular but some aspects similar to the prototype. According to this model, an
idealized prototype of the letter “A” can be formed, which summarizes the common
characteristics of various images similar to this prototype. On this basis, it becomes
possible by matching cognition to all other “A”s that are not identical to the
prototype but only similar.
Although prototype matching theory can more reasonably explain some phenom
ena in image cognition, it does not explain how humans discriminate and process
similar stimuli. Prototype matching theory does not give a clear image recognition
model or mechanism, and it is difficult to realize it in computer programs. Further
research is still a topic of vision and computer vision.

9.1.3 Matching Evaluation

Although the matching theory is not perfect, the matching task still needs to be
completed, so there are needs for an evaluation criterion in matching. In turn,
research on matching evaluation criteria will also promote the development of
matching theory. Commonly used image matching evaluation criteria mainly
include accuracy, reliability, robustness, and computational complexity [5].
Accuracy refers to the difference between the true value and the estimated value.
The smaller the difference, the more accurate the estimate. In image-level matching,
accuracy can refer to statistics such as the mean, median, maximum, or root mean
square of the distance between two image points to be matched (or a reference image
point and a matching image point). Accuracy can also be measured from synthetic or
simulated images when the correspondence has been determined; another approach
is to place fiducial markers in the scene and use the location of the fiducial markers to
evaluate the accuracy of the match. The unit of accuracy is often pixels or voxels.
Reliability refers to how many times the matching algorithm has achieved
satisfactory results in a total of multiple tests. If N pairs of images are tested, wherein
M tests give satisfactory results, when N is large enough and N pairs of images are
representative, then M/N represents reliability. The closer M/N is to 1, the more
reliable it is. In this sense, the reliability of the algorithm is predictable.
Robustness refers to the stability of accuracy or the reliability of an algorithm
under different changes in its parameters. Robustness can be measured in terms of
noise, density, geometric differences, percentage of dissimilar regions between
images, etc. The robustness of an algorithm can be obtained by determining how
stable the algorithm’s accuracy is or how reliable it is when the input parameters
change (e.g., by using their variance, the smaller the variance, the more robust the
algorithm). If there are many input parameters, each of which may affect the
accuracy or reliability of the algorithm, then the robustness of the algorithm can be
338 9 Generalized Matching

defined with respect to each parameter. For example, an algorithm might be robust to
noise but not robust to geometric distortions. Saying that an algorithm is robust
generally means that the performance of the algorithm does not change significantly
as the parameters involved change.
Computational complexity determines the speed of an algorithm, indicating its
usefulness in a specific application. For example, in image-guided neurosurgery, the
images used to plan the surgery need to be matched within seconds to images that
reflect the conditions of the surgery at a particular time. However, matching the
aerial imagery acquired by the aircraft often needs to be completed in the order of
milliseconds. Computational complexity can be expressed as a function of image
size (considering the number of additions or multiplications required for each unit);
it is generally hoped that the computational complexity of a good matching algo
rithm is a linear function of image size.

9.2 Object Matching

Image matching takes pixels as the unit, the amount of calculation is generally large,
and the matching efficiency is low. In practice, objects of interest are often detected
and extracted first, and then objects are matched. If a concise object representation is
used, the matching effort can be greatly reduced. Since the object can be represented
in different ways, the matching of the object can also take a variety of methods.

9.2.1 Matching Metrics

The effect of object matching should be judged by a certain measure, the core of
which is the similarity of the objects.

9.2.1.1 Hausdorff Distance

In an image, the object is composed of points (pixels), and the matching of two
objects is, in a certain sense, the matching between two sets of points. The method of
using Hausdorff distance (HD) to describe the similarity between point sets and
matching through feature point sets is widely used. Given two finite point sets
A = {a 1, a2, ..., am} and B = {b1, b2, ..., bn}, the Hausdorff distance between
them is defined as:

H(A,B) = max[ft(A,B), *(B,A)] (9.1)

where (norm ll«ll can take different forms):

9.2 Object Matching 339

Fig. 9.2 Hausdorff distance

schematic

h(A, B) = max min ||a — b\\ (9:2)

a2A b2B

h(B, A) = max min kb — ak (9:3)

b2B a2A

Among them, the function h(A, B) is called the directed Hausdorff distance from
the set A to B, which describes the longest distance from the point a 2 A to any point
in the point set B; similarly, the function h(B, A) is called the directed Hausdorff
distance from set B to A, describing the longest distance from point b 2 B to any
point in point set A. Since h(A, B) and h(B, A) are not symmetrical, the maximum
value between them is generally taken as the Hausdorff distance between the two
point sets.
The geometric meaning of Hausdorff distance can be explained as follows: If the
Hausdorff distance between two point sets A and B is d, then for any point in each
point set, taking it as the center, at least one point in another set of points in a circle
with radius d can be found. If the Hausdorff distance between two point sets is 0, it
means that the two point sets are coincident. In the schematic of Fig. 9.2:
h(A, B) = d21, h(B, A) = d22 = H(A, B).
The Hausdorff distance as defined above is sensitive to noise points or the outline
of the point set. A commonly used improvement method adopts the concept of
statistical averaging and replaces the maximum value with the average value, which
is called the modified Hausdorff distance (MHD) [6], that is, Eq. (9.2) and Eq. (9.3)
are changed to:

hMHD(A, B) = ^mMa - b| (9:4)

NA 02A b2B

hMHD(B,A) = N-min kb - ak (9:5)

where NA represents the number of points in point set A and NB represents the
number of points in point set B. Substituting them into Eq. (9.1), we get:

^mhd(a>b) = max [Amhd(a>b), /imhd(8>-A)] (9:6)

340 9 Generalized Matching

When using the Hausdorff distance to calculate the correlation between images, it
does not require a clear point relationship between the two images; in other words, it
does not need to establish a one-to-one relationship of point correspondences
between the two point sets, which is one of its important advantages.

9.2.1.2 Structure Matching Metrics

Objects can often be broken down into their individual components. Different
objects can have the same components but different structures. For structure
matching, most matching measures can be explained by the so-called “template
and spring” physical analogy model [7]. Considering that structure matching is the
matching between the reference structure and the structure to be matched, if the
reference structure is viewed as a structure depicted on a transparency, the matching
can be seen as moving the transparency on the structure to be matched and
deforming it to obtain a fit of the two structures.
Matching often involves similarities that can be quantitatively described. A
matching is not a simple correspondence, but a correspondence that is quantitatively
described according to a certain goodness index, and this goodness corresponds to
the matching measure. For example, the goodness of fit of two structures depends
both on how well the components of the two structures match each other and on the
amount of work required to deform the transparencies.
In practice, to achieve deformation, consider the model as a set of rigid templates
connected by springs, such as the template and spring model of a face as shown in
Fig. 9.3. Here the templates are connected by springs, and the spring functions
describe the relationship between the templates. The relationship between templates
generally has certain constraints. For example, in a face image, the two eyes are
generally on the same horizontal line, and the distance is always within a certain
range. The quality of the matching is a function of the goodness of the local fit of the

Fig. 9.3 Template and

spring model of the face

Left Right
edge edge

Mouth
9.2 Object Matching 341

template and the energy required to elongate the spring to fit the structure to be
matched to the reference structure.
The matching measure of template and spring can be represented in general form
as follows:

C = ^Ct[4/,F(J)]+ £ Cs[F(d),F(e)] + £ CM(c) (9.7)

deY (d.e)e(YxE) ce(MJM)

where CT represents the dissimilarity between the template d and the structure to be
matched, CS represents the dissimilarity between the structure to be matched and the
object part e, CM represents the penalty for missing parts, and F(«) is the mapping for
transformation of the reference structure template to the structure components to be
matched. F divides reference structures into two categories: structures that can be
found in the structures to be matched (belonging to set Y) and structures that are not
found in the structures to be matched (belonging to set N). Similarly, components
can also be divided into components that exist in the structure to be matched
(belonging to set E) and components that do not exist in the structure to be matched
(belonging to set M).
Normalization issues need to be considered in structure matching metrics because
the number of matched parts may affect the value of the final matching metric. For
example, if a “spring” always has a finite cost, such that the more elements matched,
the greater the total energy, that doesn’t mean that having more parts matched is
worse than having fewer parts. Conversely, delicate matching of a part of the
structure to be matched with a specific reference object often makes the remaining
part unmatched, and this “sub-match” is not as good as making most of the parts to
be matched closely matched. In Eq. (9.7), this is avoided by a penalty for missing
parts.

9.2.2 Corresponding Point Matching

The matching between two objects (or a model and an object) can be done by means
of the correspondence between them when there are feature points (see Sect. 5.2) or
specific landmark points (see [8]) on the object. In 2D space, if these feature points or
landmark points are different from each other (have different properties), matching
can be done according to two pairs of corresponding points. If these landmark points
or feature points are the same as each other (have the same attributes), at least three
noncollinear corresponding points (three points must be coplanar) need to be
determined on each of the two 2D objects.
In 3D space, if perspective projection is used, since any set of three points can
match any other set of three points, the correspondence between the two sets of
points cannot be determined at this point. Whereas if a weak perspective projection
is used, the matching is much less ambiguous.
342 9 Generalized Matching

Fig. 9.4 Three-point matching under weak perspective projection

Table 9.1 Ambiguity when using correspondence point matching

Distribution of points Coplanar Noncoplanar
The number of corresponding point pairs <2 3 4 5 >6 <2 3 4 5 >6
Perspective projection 1 4 1 1 1 1 4 2 1 1
Weak perspective projection 1 2 2 2 2 1 2 1 1 1

Consider a simple case. Suppose a set of three points P1, P2, and P3 on the object
are on the same circle, as shown in Fig. 9.4a. Suppose the center of gravity of the
triangle is C, and the straight lines connecting C and P1, P2, and P3 intersect the
circle at points Q1, Q2, and Q3, respectively. Under weak perspective projection
conditions, the distance ratio PiC:CQi remains unchanged after projection. In this
way, the circle will become an ellipse after projection (but the line will still be a line
after projection, and the distance ratio will not change), as shown in Fig. 9.4b. If P 1,
P 2, and P 3 can be observed in the image, C can be calculated, and then the
positions of Q 1, Q 2, and Q 3 can be determined. This leaves six points to determine
the position and parameters of the ellipse (actually at least five points are required).
Once the ellipse is determined, the matching becomes an ellipse matching (see
Subsection 9.2.3).
If the distance ratio is calculated incorrectly, Qi will not fall on the circle, as
shown in Fig. 9.4c. In this way, the ellipse passing through P 1, P 2, and P 3 and Q 1,
Q 2, and Q 3 cannot be obtained after projection, and the above calculation is
impossible.
For more general ambiguity cases, see Table 9.1, which gives the number of
solutions obtained when matching objects with corresponding points in the image in
each case. When the number of solutions > 2, it indicates that there is ambiguity. All
ambiguities occur when they are coplanar, corresponding to perspective inversion.
Any noncoplanar point (when more than three points in the corresponding plane)
provides enough information to disambiguate. In Table 9.1, the two cases of
coplanar point and noncoplanar point are considered, respectively, and the perspec
tive projection and weak perspective projection are also compared.
9.2 Object Matching 343

9.2.3 Inertia Equivalent Ellipse Matching

The matching between objects can also be done by means of their inertia equivalent
ellipses, which have been used in the registration work for 3D object reconstruction
from sequence images [9]. Unlike the matching based on the object contour, the
matching based on the equivalent ellipse of inertia is performed based on the entire
object region. For any object region, an inertia ellipse corresponding to it can be
obtained (e.g., see [8]). With the help of the inertia ellipse corresponding to the
object, an equivalent ellipse can be further calculated for each object. From the point
of view of object matching, since each object in the image pair to be matched can be
represented by its equivalent ellipse, the matching problem of the object can be
transformed into the matching of its equivalent ellipse. A schematic diagram is
shown in Fig. 9.5.
In general object matching, the main consideration is the deviation caused by
translation, rotation, and scale transformation, and the geometric parameters
corresponding to these transformations need to be obtained. For this purpose, the
parameters required for translation, rotation, and scale transformation can be calcu
lated by the center coordinates of the equivalent ellipse, the orientation angle
(defined as the angle between the major axis of the ellipse and the positive X-axis),
and the length of the main (major) axes.
First consider the center coordinates (xc, yc) of the equivalent ellipse, that is, the
barycentric coordinates of the object. Assuming that the object region contains
N pixels in total, then:

(9:8)

(9:9)

The translation parameter can be calculated according to the difference of the center
coordinates of the two equivalent ellipses. Secondly, the orientation angle <fr of the
equivalent ellipse can be obtained by using the slopes k and l of the two main axes of
the corresponding inertia ellipse (let A be the moment of inertia of the object rotating
around the X-axis and B be the moment of inertia of the object rotating around the
Y-axis):

Fig. 9.5 Matching by using equivalent ellipses

344 9 Generalized Matching

a arctan (k) A<B

(9:10)
{ arctan (l) A >B

The rotation parameter can be calculated from the difference in the orientation angle
of the two ellipses. Finally, the two semi-main lengths (a and b) of the equivalent
ellipse reflect information about the object size. If the object itself is an ellipse, it is
identical to its equivalent ellipse. In general, the equivalent ellipse of the object is the
approximation of the object in terms of moment of inertia and area (but not equal at
the same time). Here, the axis length needs to be normalized by the object area M.
After normalization, when A < B, the length a of the semi-main axis of the
equivalent ellipse can be calculated by the following equation (let H represent the
product of inertia):

2 (A +B) -yj(A - B)2 + 4H2

a= (9:11)
M

The scaling parameter can be calculated according to the length ratio of the main
axes of the two ellipses. The three transformation parameters of the geometric
correction required for the above two object matching can be calculated indepen
dently, so each transformation in the equivalent ellipse matching can be performed
sequentially [10].
Matching with inertia equivalent ellipse is more suitable for matching irregular
objects. Figure 9.6 shows an example of matching two adjacent cell profile images in
the process of reconstructing 3D cells from sequential medical profile images.
Figure 9.6a shows two cross-sectional views of the same cell on two adjacent
sections. The size and shape of the two cell sections, as well as their position and
orientation in the image, are different due to the effects of translation and rotation
during sectioning. Considering that the changes in the structures inside and around
the cell are large, matching only considering the contour does not work very well.
Figure 9.6b shows the result obtained after calculating the equivalent ellipses of the
cell profiles and then matching them. It can be seen that the positions and

Fig. 9.6 Example of inertia equivalent ellipse matching

9.3 Dynamic Pattern Matching 345

orientations of the two cell profiles are reasonably matched, which lays a solid
foundation for the subsequent 3D reconstruction.

9.3 Dynamic Pattern Matching

In the previous discussion of various matches, the representations that need to be

matched are pre-established. In fact, sometimes the representations to be matched are
created dynamically during the matching process, or in other words, different
representations need to be created for matching according to the data to be matched
during the matching process. A method, called dynamic pattern matching [11], is
introduced below in combination with a practical application.

9.3.1 Matching Process

In the process of reconstructing 3D cells from sequential medical profile images,

determining the correspondence of individual profiles of the same cell in adjacent
sections is the basis for implementing contour interpolation (see [8]). Due to the
complicated slicing process, thin sections, deformation, and other reasons, the
number of cell sections on adjacent profiles may be different, and their distribution
and arrangement may also be different. In order to reconstruct 3D cells, it is
necessary to determine the corresponding relationship between each profile of
each cell, that is, to find the corresponding profile of the same cell on each section.
A block diagram of the overall flow of this work can be seen in Fig. 9.7. Here, the
two profiles to be matched are called matched profiles and to-be-matched profiles,
respectively. The matched profile is the reference profile. By registering each to-be-
matched profile on the section with the corresponding matched profile on the
matched section, the profile to-be-matched becomes a matched profile and can be
used as a reference profile for the next profile to-be-matched. By continuing to match

Fig. 9.7 Flow diagram of dynamic pattern matching

346 9 Generalized Matching

in this way, all profiles on a sequence of sections can be registered (Fig. 9.7 only
takes one profile as an example). This strategy essentially relies on spatial relation
ships [12] for matching.
Referring to the flow diagram in Fig. 9.7, it can be seen that there are six main
steps in dynamic pattern matching:
1. Select a matched profile from the matched section.
2. Construct the pattern representation of the selected matched profile.
3. Determine the candidate region of the to-be-matched profile (a priori knowledge
can be used to reduce the amount of computation and ambiguity).
4. Select the profile to-be-matched in the candidate region.
5. Construct the pattern representation of each selected profile to-be-matched.
6. Use the similarity between profile patterns to check to determine the correspon
dence between profiles.

9.3.2 Absolute Pattern and Relative Pattern

Since the distribution of cell sections on the profile is not uniform, in order to
complete the above matching steps, it is necessary to dynamically establish a pattern
representation for each section that can be used for matching. Here, it can be
considered to construct the unique pattern of the section by using the relative
positional relationship of each section to its several adjacent sections. The pattern
thus constructed can be represented by a pattern vector. Assuming that the relation
ship used is the length and orientation of the line between each section and its
adjacent section (or the included angle between the lines), then the two profile modes
Pl and Pr that need to-be-matched on two adjacent sections (both using vector
representation) can be written as:

1*1 = [^10,yi0>^ll>^ll, (9-12)

Pr = <41, <41, (-,l3n, 0r«]T (9:13)

In these equations, xl0, yl0, and xr0, yr0 are the coordinates of the two center profiles,
respectively; each d represents the length of the connecting line between other
profiles on the same section and the matching profile; each 9 represents the angle
between the connecting lines from the matching profile to the surrounding two
adjacent profiles. Note that m and n can be different here. When m is different
from n, a part of the points constructing patterns can also be selected for matching. In
addition, the selection of m and n should be the result of the balance between the
amount of calculation and the uniqueness of the pattern, and the specific value can be
adjusted by determining the radius of the pattern (i.e., the largest d, such as d2 in
Fig. 9.8a). The entire pattern can be seen as contained in a circle with a defined radius
of action.
9.3 Dynamic Pattern Matching 347

Fig. 9.8 Absolute pattern

In order to match the profiles, the corresponding patterns need to be translated and
rotated. The pattern constructed above can be called an absolute pattern because it
contains the absolute coordinates of the central section. Figure 9.8a gives an example
of Pl. The absolute pattern has rotation invariance to the origin (central profile), that
is, after the entire pattern is rotated, d and 9 remain unchanged, but from Fig. 9.8b, it
can be seen that it does not have translation invariance, because after the entire
pattern is translated, both x0 and y0 have changed.
In order to obtain translation invariance, the coordinates of the center point in the
absolute pattern can be removed, and the relative pattern can be constructed as
follows:

Qi = [dn,9ii, (9:14)
Qr = [<4 1,0rl,'",drn, #rn]T (9:15)

The relative pattern corresponding to the absolute pattern in Fig. 9.8a is shown in
Fig. 9.9a.
It can be seen from Fig. 9.9b that the relative pattern is not only rotationally
invariant but also translationally invariant. In this way, two relative pattern repre
sentations can be matched by rotation and translation, and their similarity can be
calculated, so as to obtain the purpose of matching profiles.
Figure 9.10 shows the distribution of cell profiles on two adjacent medical
sections in practice [12], where each cell profile is represented by a dot. Since the
diameter of the cells is much larger than the thickness of the profiles, many cells span
multiple profiles. In other words, there should be many corresponding cell profiles
on adjacent profiles. However, as can be seen from Fig. 9.10, the distribution of
points on each section is very different, and the number of points is also very
different; there are 112 in Fig. 9.10a and 137 in Fig. 9.10b. The reasons include
348 9 Generalized Matching

Fig. 9.9 Relative pattern

Fig. 9.10 Distributions of cell profiles on two adjacent medical sections

the following: some cell profiles in Fig. 9.10a are the last profiles of cells and do not
continue to extend to Fig. 9.10b, and some cell profiles in Fig. 9.10b are new
beginnings of cells and not continued from Fig. 9.10a.
Using dynamic pattern matching to match the cell profiles in these two sections
resulted in 104 profiles in Fig. 9.10a finding the correct corresponding profiles
(92.86%) in Fig. 9.10b, while there are eight profiles with errors (7.14%).
From the analysis of dynamic pattern matching, its main characteristics are as
follows: The pattern is established dynamically and the matching is completely
automatic. This method is quite general and flexible, and its basic idea can be
applied to a variety of applications [9].
9.4 Matching and Registration 349

9.4 Matching and Registration

Matching and registration are two closely related concepts with many technical
similarities. Many registration tasks are accomplished with the aid of various
matching techniques. But if you analyze it carefully, there are some differences
between the two. The meaning of registration is generally narrow, mainly referring
to the establishment of correspondence between images obtained in different time or
space, especially the geometric correspondence (geometric correction). The final
effect is often reflected at the pixel level. Matching can consider both the geometric
properties of the image, the grayscale properties of the image, and even other
abstract properties and attributes of the image. From this point of view, registration
can be seen as matching of lower-level representations, while generalized matching
can include registration. By the way, the main difference between image registration
and stereo matching introduced in Chaps. 4 and 5 is that the former needs to establish
the relationship between point pairs and calculate the coordinate transformation
between the two images from this correspondence. The latter only needs to establish
the correspondence between point pairs and then calculate the disparity/parallax for
each pair of points separately.

9.4.1 Implementation of Registration

In terms of specific implementation technology, registration is often aided by

coordinate transformation or affine transformation. Most registration algorithms
include three steps: (1) feature selection, (2) feature matching, and (3) calculation
of transformation functions. The performance of registration techniques is often
determined by the following four factors [13]:
1. The feature space of the features used for registration
2. The search space that makes the search process possible to have a solution
3. The search strategy for scanning the search space
4. A similarity measure used to determine whether the registration
correspondence holds
Registration techniques in the image space can be divided into two categories
(region-based and feature-based) similar to stereo matching techniques. The regis
tration technology in the frequency domain is mainly carried out by the correlation
calculation in the frequency domain. Here, the image needs to be converted into the
frequency domain through the Fourier transform, and then the phase information or
amplitude information of the spectrum is used in the frequency domain to establish
the corresponding relationship between them to achieve registration. They can be
called phase correlation method and amplitude correlation method, respectively.
The following only takes the registration when there is translation between
images as an example to introduce the calculation of the phase correlation method
350 9 Generalized Matching

(the Fourier power spectrum can be used to calculate when the rotation and scale
change). The calculation of the phase correlation between the two images can be
carried out by means of phase estimation of the cross-power spectrum. Suppose two
images f1(x, y) and f2(x, y) have the following simple translation relationship in the
spatial domain:

f1 (x y)=f2(x - xo, y - yo) (9-16)

Then according to the translation theorem of Fourier transform, we have:

Fi(h, v) =F2(w,v)exp [-j27t(jtto + vy0)] (9-17)

If it is represented by the normalized cross-power spectrum of the Fourier transforms

F1(u, v) and F2(u, v) of the two images f1(x, y) and f2(x, y), the phase correlation
between them can be calculated as follows:

exp [ - j2ir(iao + vy0)] = (9-18)

where the inverse Fourier transform of exp.[—j2n(ux0 + vy0)] is 6(x — x0,y — y0). It can
be seen that the relative translation of the two images f1(x, y) and f2(x, y) in space is
(x0, y0). The amount of translation can be determined by searching the graph for the
location of the maximum value (caused by the pulse).
The steps of the Fourier transform-based phase correlation algorithm are sum
marized below:
1. Calculate the Fourier transforms F1(u, v) and F2(u, v) of the two images f1(x, y)
and f2(x, y) to be registered.
2. Filter out the DC component and high-frequency noise in the spectrum, and
calculate the product of the spectrum components.
3. Calculate the normalized cross-power spectrum using Eq. (9.18).
4. Perform inverse Fourier transform on the normalized cross-power spectrum.
5. Search for the peak point coordinates in the graph, which give the relative
translation.
The calculation amount of the abovementioned registration method is only related
to the size of the images and has nothing to do with the relative positions between the
images or whether they overlap. The method only utilizes the phase information in
the cross-power spectrum, which is simple to calculate and insensitive to the
brightness change between images and can effectively overcome the influence of
illumination changes. Since the obtained correlation peaks are relatively sharp and
prominent, higher registration accuracy can be obtained.
9.4 Matching and Registration 351

9.4.2 Heterogeneous Remote Sensing Image Registration

Based on Feature Matching

In the registration task of heterogeneous images, different imaging modalities,

resolutions, cross-modalities, temporal phases, etc., often bring difficulties. A deep
learning feature matching method (cross-pattern matching network) that addresses
these issues is as follows [14].
The flowchart of this matching method is shown in Fig. 9.11. It consists of two
stages. In the feature extraction stage, a convolutional neural network (CNN) is
first used to extract high-dimensional feature maps of a pair of heterogeneous remote
sensing images. Then, the key points on the feature map are selected according to the
two conditions of channel maximum and local maximum. Finally, the 512D descrip
tors of the corresponding locations are extracted. In the matching stage, feature
matching is first performed using fast nearest neighbor search. Next, a constraint
algorithm based on dynamic adaptive Euclidean distance calculation and random
sample consistency (RANSAC) is used to eliminate mismatched points and main
tain correct matching points between image pairs.
To achieve robust feature matching for heterogeneous remote sensing images, it
is necessary to find an invariant feature representation method to reduce the effects of
radiometric and geometric differences in heterogeneous images. Three aspects are
considered here:
1. Because high-level abstract semantic information is more adaptable to changes in
radiation and geometry than low-level gradient information, therefore, feature
maps from the deep layers of the CNN are selected, and the original input image
range (receptive field) corresponding to the extracted features is appropriately
expanded.
2. The CNN is trained using the registered data with large differences in illumina
tion and shooting angles, so that the CNN feature extractor can learn the invariant
features of changing images such as illumination and geometry.
3. Adopt the “reliability with more” strategy, first extract a large number of candi
date features, and then effectively limit them by improving the screening mech
anism of the matching process, so as to obtain more reliable and uniform
matching pairs.
In registration, based on the feature vectors used for matching, a fast nearest
neighbor search is performed to find those matching point pairs where the ratio of the

Fig. 9.11 Flowchart of CNN feature matching method

352 9 Generalized Matching

closest Euclidean distance to the second closest Euclidean distance is large (greater
than a threshold). Generally, the smaller the distance of the first matching point pair,
the better the matching quality. Next, calculate the average of all matching point
pairs and subtract it from each nearest Euclidean distance. If the result is negative,
this matching point pair is kept, and RANSAC is performed to finally select the true
matching point pair.

9.4.3 Image Matching Based on Spatial Relationship

Reasoning

Scene images usually contain many objects. The representation of image space
relationship is to describe the geometric relationship of image objects in Euclidean
space. Due to the complexity of the real world and the randomness of scene
shooting, the imaging of the same object on different images can vary significantly.
It is difficult to accurately match the image by relying only on the overall represen
tation of the image to calculate the similarity of the image. On the other hand, when
the same object is imaged in different images, its imaging morphology changes
significantly, but its spatial relationship to neighboring objects generally remains
stable (see Sect. 9.3). A method to solve the whole image matching problem by
inferentially analyzing the spatial proximity of objects in the image exploits this
fact [15].
The flowchart of the image matching algorithm is shown in Fig. 9.12. First, for
image pairs from the scene, object blocks are detected and depth features are
extracted for matching, thereby determining the spatial proximity of objects and
constructing a spatial proximity map of the objects in the scene. Then, based on the
constructed spatial proximity map, the spatial proximity relationship of the objects in
the image is analyzed, and the proximity of the image pair is quantitatively calcu
lated. Finally, find matching images.
Some details are as follows:
1. In order to extract the deep features of the object block, an object block feature
extraction network based on the contrast mechanism is constructed. The network
contains two identical channels with shared weights. Each channel is a deep

Object Identical Build a Determine Calculate

block object spatial the spatial the spatial Matched
Image 1
detection block proximity
proximity proximity images
—► matching --- ►
graph of
---► graph node --- ►
between
Object based on to which the
Image 2 scene image
block deep object block
detection features objects belongs pairs

Fig. 9.12 Flowchart of spatial relationship reasoning

9.5 Relationship Matching 353

convolutional network with seven convolutional layers and two fully connected
layers. Based on the depth features, the same object blocks in the two images are
matched.
2. The spatial proximity map of the scene objects is constructed, and the spatial
proximity relationship of different objects in the scene is analyzed by reasoning
according to the distribution of each object on the prior image. The construction
process is an iterative search process that includes an initialization step and an
update step [15]. The constructed spatial proximity map summarizes all objects
present in the scene and quantitatively represents the proximity between different
objects, where the same object blocks on different images are aggregated in the
same node.
3. In order to determine the matched image, the nodes of the object in the image are
searched in the spatial proximity graph, and the proximity relationship between
the objects in the image is determined according to the connection weight
between the nodes. Each test image may include several object blocks, and
their belonging nodes can be searched in the node set.
4. In order to calculate the spatial proximity of image pairs, it is necessary to detect
the object blocks contained in the image and determine the nodes to which each
object block belongs to form a node set. The connection weights between the
belonging nodes represent the proximity relationship between the object blocks in
the image. The spatial relationship between two images can be represented by the
proximity relationship of the object blocks in the image, and the spatial relation
ship matching of the images is completed by quantitatively calculating the spatial
proximity of the images.

9.5 Relationship Matching

The objective scene can be decomposed into multiple objects, and each object can be
decomposed into multiple components/parts, and there are different relationships
between them. The images collected from the objective scene can be represented by
the collection of various interrelationships between objects, so relationship matching
is an important step in image understanding. Similarly, the object in the image can be
represented by the set of interrelationships among the various parts of the object, so
the object can also be identified by using relationship matching. The two represen
tations to be matched in relationship matching are relations, and one of them is
usually called the object to-be-matched, and the other is called the model.
The main steps of relationship matching are described below. Here we consider
the case that an object to-be-matched is given, and it is required to find a model that
matches it. There are two relation sets: Xl and Xr, where Xl belongs to the object to
be-matched and Xr belongs to the model; they are respectively represented as:
354 9 Generalized Matching

Fig. 9.13 Schematic

representation of objects and
their relationships in the
image

Xi = fRii, R12, ", Rimg (9:19)

Xr = fRr1, Rr2, -, Rrn g (9:20)

In the equations, Ri1, Ri2, ..., Rim and Rr1, Rr2, ..., Rrn represent the representa
tions of different reiationships between the objects to-be-matched and the compo
nents in the modei, respectiveiy.
For exampie, Fig. 9.13a shows a schematic representation of an object in an
image (think of a front view of a tabie). It has three eiements, which can be expressed
as Qi = {A, B, C}, and the set of reiations between these eiements can be expressed
as Xi = {R1, R2, R3}. Among them, R1 represents the connection reiationship,
R1 = {(A, B), (A, C)}; R2 represents the upper-iower reiationship, R2 = {(A, B),
(A, C)}; R3 represents the ieft-right reiationship, R3 = {(B, C)}. Figure 9.13b gives a
schematic diagram of the object in another image (can be seen as a front view of a
tabie with a middie drawer), which has four eiements and can be represented as
Qr = {1, 2, 3, 4}. The set of reiationships between eiements can be expressed as
Xr = (R1, R2, R3). Among them, R1 represents the connection reiationship,
R1 = {(1, 2), (1, 3), (1, 4), (2, 4), (3, 4)}; R2 represents the upper and iower
reiationship, R2 = {(1, 2), (1, 3), (1, 4)}; R3 represents the ieft-right reiationship,
R3 = {(2, 3), (2, 4), (4, 3)}.
Now consider the distance between Xi and Xr, denoted as dis(Xi, Xr). dis(Xi, Xr) is
composed of the difference of the corresponding items expressed by each pair of
corresponding reiations in Xi and Xr, nameiy, each dis(Ri, Rr). The matching of Xi and
Xr is the matching of each pair of corresponding reiations in the two sets. The
foiiowing first considers one of the reiations and uses Ri and Rr to represent the
corresponding reiations, respectiveiy:

Ri £ SM = S(1) x S(2) x x S(M) (9:21)

Rr£ TN = T(1) x T(2) x - x T(N) (9:22)

Define p as the corresponding transformation (mapping) of S to T and p-1 as the

corresponding transformation (inverse mapping) of T to S. It is further defined that
the operation symboi © represents a composite operation, Ri © p means that the
transformation p is used to transform Rl, that is, the SM is mapped to TN, and Rr ® p
1 is that the inverse transformation p-1 is used to transform Rr, that is, the TN is
mapped to SM:
9.5 Relationship Matching 355

Ri^p=f[T(l), T(2), T(N')]gTn (9.23)

Rr®P >=815(1), 5(2), •••, 5(M)]G5" (9.24)

Here f and g represent the combined functions of certain relationship representa

tions, respectively.
Now consider dis(Rl, Rr). If the corresponding items in these two relationship
representations are not equal, then for any corresponding relationship p, there may
be the following four errors:

E1 = {Ri®P - (Ri®P) \Rr}

(9:25)
E3 = {Rr®p—1 — (Rr®p—1) \Rl}

E4= {ri — (RrO)P —1 ) \R1}

The distance between the two relational expressions Rl and Rr is the weighted sum
of the errors in Eq. (9.25) (here is the weighting of the effects of the errors, and the
weight is W):

dis(Ri, Rr) = J^Wi-Ei (9:26)

If the corresponding terms in the two relationship representations are equal, a

corresponding mapping p can always be found. According to the composite opera
tion, Rr = Rl © p and Rr © p—1 = Rl are established, that is, the distance calculated
by Eq. (9.26) is zero. At this point it can be said that Rl and Rr have an exact match.
In fact, the error in terms of E can be represented by C(E), and Eq. (9.26) can be
rewritten as:

disC(Ri, Rr) = J2w;C(E;) (9:27)

From the previous analysis, to match Rl and Rr, we should try to find a
corresponding mapping to minimize the error (distance in terms) between Rl and
Rr. Note that E is a function of p, so the corresponding mapping p that needs to be
found should satisfy:
356 9 Generalized Matching

disc(7?i, RT) = inf W,C[£,(p)] - (9:28)

Going further back to Eq. (9.19) and Eq. (9.20), to match two relation sets Xl and
Xr, a series of corresponding mappings pj should be found such that:

m
disC(Xi, Xr) = mf £ Vj^WijC[Ej(p>)] > (9:29)
, j i

Here n > m, and V is the weighting of the importance of various relationships.

Now take the two objects in Fig. 9.13 as an example and match them only
considering the connections. From Eq. (9.21) and Eq. (9.22), it can be known that:

Rl = {(A,B), (A, C)} = S(1) x S(2) c SM

Rr = {(1,2), (1,3), (1,4), (2,4), (3,4)} = T(1) x T(2) x T(3) x T(4) x T(5)

c TN

When there is no element 4 in Qr, Rr = [(1, 2), (1, 3)], which gives p = {(A, 1), (B,
2), (C, 3)}, p-1 = {(1, A), (2, B), (3, C)}, R^ p = {(1, 2), (1, 3)}, R^ p
1 = {(A, B), (A, C)}. At this time, the four errors in Eq. (9.25) are, respectively:

E4 = {Rl - (Rr®p-!) DRl} = {(A, B), (A, C)} - {(A,B), (A, C)} = 0

So there is dis(Rl, Rr) = 0.

If Qr has element 4 and Rr = [(1, 2), (1, 3), (1, 4), (2, 4), (3, 4)], then p = {(A, 4),
(B, 2), (C, 3)}, p-1 = {(4, A), (2, B), (3, C)}, R^ p = {(4, 2), (4, 3)}, R^ p
1 = {(B, A), (C, A)}. At this time, the four errors in Eq. (9.25) are, respectively:

E1= {(4,2), (4,3)}-{(4,2), (4,3)}=0E2= {(1,2), (1,3), (1,4), (2,4), (3,4)}

- {(2,4), ( 3,4)}= {(1,2), (1,3), ( 1, 4)}E3= {(B,A), ( C, A)} - {(A, B), (A, C)}

= 0E4= { (A,B ), (A,C)}-{(A,B ), (A,C)}=0

If only the connection relationship is considered, the order of each element can be
exchanged. From the above results, dis(Rl, Rr) = {(1, 2), (1, 3), (1, 4)}. In terms of
error terms, C(E1) = 0, C(E2) = 3, C(E3) = 0, C(E4) = 0, so disC(Rl, Rr) = 3.
9.5 Relationship Matching 357

Matching is to use the model stored in the computer to identify unknown patterns
in the object to be matched, so after finding a series of corresponding mappings pj, it
is necessary to determine their corresponding models. Assuming that the object
to-be-recognized, X, defined by Eq. (9.19) can find a correspondence that conforms
to Eq. (9.29) for each of the multiple models Y1, Y2, ..., YL (they can be represented
by Eq. (9.20)), suppose they are respectively p 1, p2, ..., pL, that is to say, the distance
disC(X, Yq) after X and multiple models are matched according to their respective
corresponding relationships can be obtained. If, for model Yq, its distance from
X satisfies:

disC(X, Yq) =mm{disC(X, Yi)} i =1, 2, •■•, L (9:30)

then for q < L, X 2 Yq holds, that is, the object to-be-matched, X, is considered to
match the model Yq.
Summarizing the above discussion, it can be seen that the matching process can
be summarized into four steps.
1. Determine the same relationship (relationship between parts), that is, for a given
relationship in Xl, determine a relationship in Xr that is the same as it. This
requires m x n comparisons:

'Rl1' Rr1

Rl2 Rr2
Xl = = Xr (9:31)

LRlmJ
Rrn

2. Determine the corresponding mapping of the matching relationship (relationship

representation correspondence), that is, determine p that can satisfy Eq. (9.28).
Suppose p has K possible forms, then it is to find the p that minimizes the
weighted sum of errors among the K transformations:

p1 : disC(Rl,Rr)^
disC(Rl,Rr) ^
Rl >Rr (9:32)

^Pk: disC(Rl,Rr) ^

3. Determine the corresponding mapping series of the relationship matching set,

that is, weight again according to the K dis values:
358 9 Generalized Matching

disC(Ru, Rri)
disC (Rl2, Rr2)
disC(Xl, Xr) ( < (9:33)

disC(Rlm, Rrm)

Note that m < n is set in the above equation, that is, only m pairs of relations can
find correspondence, and n—m relations only exist in the relation set Xr.

4. Determine the model to which it belongs (find the minimum value in L values of
disC(Xl, Xr)):

pi ^ Y 1 ^ disC(X, Y 1)
P2 ^ Y2 > disC(X, Y2)
X (9:34)

Ul^ Yl^ disC(X, Yl)

9.6 Graph Isomorphism Matching

Seeking correspondence is a key in relation matching. Because correspondences can

have many different combinations, if the search method is not appropriate, the
workload will be too large to be carried out. Graph isomorphism is one way to
solve this problem.

9.6.1 Introduction to Graph Theory

The following first introduces some basic concepts and definitions of graph theory.

9.6.1.1 Basic Definitions

In graph theory, a graph G is defined as consisting of a non-empty finite vertex set

V(G) and a finite edge set E(G), denoted as:

G=[V(G), E(G)] = [V, E] (9:35)

where each element of E(G) corresponds to an unordered pair of vertices in V(G),

called an edge of G. A graph is also a relational data structure.
9.6 Graph Isomorphism Matching 359

Below, the elements in set V are represented by uppercase letters, and the
elements in set E are represented by lowercase letters. Generally, the edge
e formed by the disordered pair of vertices A and B is recorded as e $ AB or e $
BA, and A and B are called the endpoints of e; the edge e is called join A and B. In this
case, vertices A and B are incident with edge e, which is incident with vertices A and
B. Two vertices incident with the same edge are adjacent, as are two edges that share
a common vertex. Two edges are called multiple edge or parallel edge if they have
the same two endpoints. If the two endpoints of an edge are the same, it is called a
loop; otherwise, it is called a link.
In the definition of a graph, the two elements (i.e., two vertices) of each unordered
pair can be the same or different, and any two unordered pairs (i.e., two edges) can be
the same or different. Different elements can be represented by vertices of different
colors, which is called the chromaticity of vertices (meaning that vertices are labeled
with different colors). Different relationships between elements can be represented
by edges of different colors, which is called edge chromaticity (meaning that edges
are marked with different colors). So a generalized colored graph G can be
expressed as:

G=[(V,C), (E,S)] (9:36)

where V is the vertex set, C is the vertex chromaticity set, E is the edge set, and S is
the edge chromaticity set. They are, respectively:

V=fv 1, V2, , VN g (9:37)

C = {CV1, CV2, , CVNg (9:38)
E = {eViVjJVi, Vj 2 V} (9:39)

S = {sViVjlVi, Vj 2 V} (9:40)

Among them, each vertex can have a color, and each edge can also have a color.

9.6.1.2 Geometric Representation of Graphs

If the vertices of the graph are represented by dots, and the edges are represented by
straight lines or curves connecting the vertices, the geometric representation or
geometric realization of the graph can be obtained. Graphs with edges greater than
or equal to 1 can have infinitely many geometric representations.
For example, suppose V(G) = {A, B, C}, E(G) = {a, b, c, d}, where a $ AB, b $
AB, c $ BC, d $ CC. The graph G can then be represented by the graph given in
Fig. 9.14.
In Fig. 9.14, edges a, b, and c are adjacent to each other, and edges c and d are
adjacent to each other, but edges a and b are not adjacent to edge d. Likewise,
vertices A and B are adjacent, and vertices B and C are adjacent, but vertices A and
360 9 Generalized Matching

Fig. 9.14 Geometric

representation of graphs

C are not adjacent. In terms of edge types, edges a and b are multiple edges, edge d is
a loop, and edges a, b, and c are links.
According to the geometric representation of the graph introduced above, the two
objects in Fig. 9.13 can be represented by two colored graphs as shown in Fig. 9.15,
where the vertex chromaticity is distinguished by the vertex shape, and the edge
chromaticity is distinguished by the line type. It can be seen that the information
reflected by the colored map is more comprehensive and intuitive.

9.6.1.3 Subgraph and Supergraph

For two graphs G and H, if V(H) £ V(G), E(H) £ E(G), then graph H is called a
subgraph of graph G, denoted as H £ G. Conversely, graph G is called the
supergraph of graph H. If graph H is a subgraph of graph G, but H ^ G, then
graph H is called the proper subgraph of graph G, and graph G is called the proper
supergraph graph of graph H [16].
If H £ G and V(H) = V(G), then graph H is called the spanning subgraph of
graph G, and graph G is called the spanning supergraph of graph H. For example,
in Fig. 9.16, Fig. 9.16a gives a graph G, while Fig. 9.16b, Fig. 9.16c, and Fig. 9.16d
give each of the spanning subgraphs of graph G (they are all graphs G spanning
subgraphs of but distinct from each other).
If all double edges and loops are removed from a graph G, the resulting simple
spanning subgraph is called the underlying simple graph of the graph G. The three
spanning subgraphs given in Fig. 9.16b-d have only one underlying simple graph,
Fig. 9.16d. The four operations to obtain the underlying simple graph are described
below with the help of the graph G given in Fig. 9.17a.
1. For the non-empty vertex subset V‘(G) £ V(G) of graph G, if there is a subgraph
of graph G taking V‘(G) as the vertex set, and taking all edges with both endpoints
in graph V‘(G) as the edge sets, then the subgraph is called the induced subgraph
of graph G and is denoted as G[V‘(G)] or G[V’]. Figure 9.17b gives the graph of G
[A, B, C] = G[a, b, c].
2. Similarly, for the non-empty edge subset E’(G) £ E(G) of graph G, if there is a
subgraph of graph G taking E’(G) as the edge set, and taking all the endpoints of
the edge as the vertex set, then the subgraph is called the edge-induced subgraph
of the graph G, denoted as G[E’(G)] or G[E’]. Figure 9.17c gives the graph of G
[a, d] = G[A, B, D].
3. For the proper subset V‘(G) £ V(G) of non-empty vertices of graph G, if there is a
subgraph of graph G taking all vertices after removing V‘(G) c V(G) as the
9.6 Graph Isomorphism Matching 361

Fig. 9.16 Graph and spanning subgraph examples

vertex set, and taking the edge set after removing all edges associated with V’(G)
in the graph G as the edge set, the subgraph is the remaining subgraph of the
graph G, denoted as G-V. Here G-V’ = G[V\ V’]. Figure 9.17d gives the graph
of G-{A, D} = G[B, C] = G[{A, B, C, D}—{A, D}].
4. For the non-empty proper edge subset E’(G) £ E(G) of graph G, if there is a
subgraph of graph G taking the edges after removing E ‘(G) C E(G) as the edge
set, then the subgraph is the spanning subgraph of the graph G, denoted as G-E ‘.
Note here that G-E’ and G[E\E’] have the same set of edges, but they are not
necessarily identical. Among them, the former always generates subgraphs, while
the latter does not necessarily. An example of the former is given in Fig. 9.17e,
G-{c} = G[a, b, d, e]. An example of the latter is given in Fig. 9.17f, G[{a, b,
c, d, e}—{a, b}] = G-A + G-[{a, b}].

9.6.2 Graph Isomorphism and Matching

The matching of graphs is achieved by means of graph isomorphism.

9.6.2.1 Identity and Isomorphism of Graphs

According to the definition of graph, for two graphs G and H, if and only if
V(G) = V(H), E(G) = E(H), the graphs G and H are said to be identical, and the
two graphs can be represented by the same geometric representation. For example,
graphs G and H in Fig. 9.18 are identical. However, if two graphs can be represented
362 9 Generalized Matching

(a) (b) (f)

Fig. 9.17 Several operations to obtain subgraphs

B
AAA b
G = [V, E]
C B b
H = [V, E]
C Y y
I = [V', E']
Z

Fig. 9.18 Identity of graphs

by the same geometric representation, they are not necessarily identical. For exam
ple, the graphs G and I in Fig. 9.18 are not identical (the labels of the vertices and
edges are different), although they can be represented by two geometric representa
tions of the same shape.
For two graphs that have the same geometric representation but are not identical,
as long as the labels of the vertices and edges of one graph are appropriately
renamed, a graph that is identical to the other graph can be obtained, which can be
called isomorphism for such two graphs. In other words, a two-graph isomorphism
indicates that there is a one-to-one correspondence between the vertices and edges of
the two graphs. The isomorphism of two graphs G and H can be written as G ffi H,
and the necessary and sufficient conditions are that the following mappings exist
between V(G) and V(H) as well as E(G) and E(H), respectively:

P: V(G) ^ V(H) (9:41)

Q: E(G) ^ E(H) (9:42)

and the mappings P and Q maintain an associated relationship, that is, Q(e) = P(A)
P(B), 8e $ AB 2 E(G), as shown in Fig. 9.19.

9.6.2.2 Determination of Isomorphism

It can be seen from the previous definition that isomorphic graphs have the same
structure, and the only difference is that the labels of vertices or edges are not exactly
the same. Graph isomorphism is more focused on describing mutual relationships, so
9.6 Graph Isomorphism Matching 363

Fig. 9.19 Illustration of

isomorphism of graph P
A >P(A)
Q
Q(e)
P
B P(’(B)

G = [V, E] H = [V', E']

graph isomorphism can have no geometric requirements, that is, it is more abstract
(of course, it can also have geometric requirements, i.e., more specific). Graph
isomorphism matching is essentially a tree search problem, where different branches
represent heuristics on different combinations of correspondences.
Now consider several cases of graph-to-graph isomorphism. For the sake of
simplicity, all graph vertices and edges are not labeled here, that is, all vertices are
considered to have the same color, and all edges also have the same color. For
clarity, a monochromatic line graph (which is a special case of G):

B=[(V), (E)] = [V,E] (9:43)

to explain. V and E in Eq. (.) are still given by Eq. (9.3

respectively, but here all elements in each set are the same. In other words, there is
only one type of vertex and one edge each. Referring to Fig. 9.20, given two graphs
B1 = [V1, E1] and B2 = [V2, E2], the isomorphism between them can be classified
into the following types [7].
1. Graph isomorphism
Graph isomorphism means that there is a one-to-one mapping between B1
and B2. For example, Fig. 9.20a and Fig. 9.20b are graph isomorphism. Generally
speaking, if the mapping is represented by f, then for e1 2 E1 and e2 2 E2, there
must be f(e1) = e2, and for each connection in E1 to any pair of vertices e1 and e’1
(e1, e’1 2 E1), there must be a connection between f(e1) andf(e’1) in E2. When
identifying the object, it is necessary to establish a graph isomorphism between
the graph representing the object and the graph of the object model.
2. Subgraph isomorphism
Subgraph isomorphism refers to the isomorphism between a part (subgraph)
of B1 and the whole graph of B2. For example, the multiple subgraphs in
Fig. 9.20c are isomorphic to Fig. 9.20a. When detecting objects in a scene, the
object model needs to be used to search for isomorphic subgraphs in the scene
graph.
. Double-subgraph isomorphism
Double-subgraph isomorphism refers to all isomorphism between each
subgraph of B1 and each subgraph of B2. In Fig. 9.20a and Fig. 9.20d, there are
several pairs of subgraphs that are isomorphic (e.g., the red-edge triangle in
Fig. 9.20a is isomorphic with the two red-edge triangles in Fig. 9.20d). When it
364 9 Generalized Matching

(b) (c)

Fig. 9.20 Several cases of graph isomorphism

is necessary to find a common object in two scenarios, the task can be transformed
into the problem of double-subgraph isomorphism.
There are many algorithms for finding graph isomorphism. For example, each
graph to be determined can be converted into a certain standard form, so that
isomorphism can be easily determined. In addition, it is also possible to perform
an exhaustive search on the tree of possible matches between corresponding
vertices in the line graph, but this method requires a large amount of computation
when the number of vertices in the line graph is large.
A method that is less restrictive and converges faster than isomorphic methods
is association graph matching [17]. In associative graph matching, the graph is
defined as G = [V, P, R], where V represents the set of nodes, P represents the set
of unit predicates used for the nodes, and R represents the set of binary relations
between nodes. Here the predicate represents a statement that takes only one of
the two values TRUE or FALSE, and the binary relation describes the properties
that a pair of nodes has. Given two graphs, an associative graph can be
constructed. Associative graph matching is the matching between nodes and
nodes as well as binary relationships and binary relationships in two graphs.

9.7 Labeling and Matching of Line Drawing

Observe the 3D scenery, and what you see is its (visible) surface, and when the 3D
scene is projected onto a 2D image, the individual surfaces form regions. The
boundaries of individual surfaces are shown as contours in the 2D image, and
these contours representing the object can form a line drawing of the object. For
relatively simple scenes, the line drawing can be labeled, that is, a 2D image with
contour labels can be used to represent the relationship between the various surfaces
of the 3D scene [18]. With the aid of such labels, 3D scenes can also be matched to
corresponding models in order to interpret the scene.
9.7 Labeling and Matching of Line Drawing 365

9.7.1 Contour Labeling

First, some definitions of nouns in contour labels are given.

9.7.1.1 Blade edge

If one continuous surface (called an occluding surface) in a 3D scene occludes a

portion of another surface (called an occluded surface), the change in the direction of
the surface normal is smooth and continuous as one proceeds along the contour of
the formal surface; the contour line is called the blade edge at this time (the blade
edge of the 2D image is a smooth curve). To indicate the blade edge, an arrow “^”
or “^” can be added to the contour line. Generally, the arrow direction indicates that
when advancing in the direction of the arrow, the blocking surface is on the right side
of the contour line. On both sides of the contour line, the orientation of the occluding
surface and the orientation of the occluded surface may be independent.

9.7.1.2 Limb

If a continuous surface in a 3D scene not only occludes a part of another surface but
also occludes other parts of itself, that is, self-occluding, the change of the surface
normal direction is smooth and continuous and perpendicular to the line of sight
direction, then the contour at the time is called a limb (often formed when a smooth
3D surface is viewed from the side). To represent the limbs, two opposite arrows
“$” can be added to the curve. The orientation of the 3D surface does not change
when travelling along the limb, whereas the direction of the 3D surface changes
continuously when travelling in a direction that is not parallel to the limb.
A blade is the true (physical) edge of a 3D scene, while a limb is just the
(apparent) edge. When a blade or limb crosses the boundary or contour between
the occluded object surface and the occluded background surface, a jump edge with
discontinuity in depth is created.

9.7.1.3 Crease

A crease is formed if the orientation of the 3D viewable surface changes suddenly or

if two 3D surfaces meet at an angle. On both sides of the crease, the points on the
surface are continuous, but the surface normal direction is discontinuous. If the
surface at the crease is convex, it is generally indicated by “+,” and if the surface at
the crease is concave, it is generally indicated by “-.”
366 9 Generalized Matching

9.7.1.4 Mark

Marks are formed if parts of the 3D surface have different reflectivity. The marks are
not due to the 3D surface shape and can be labeled with an “M.”

9.7.1.5 Shadow

If a continuous surface in a 3D scene does not block a part of the other surface from
the viewpoint but blocks the light from the light source on this part, it will cause
shades (shadows) on that part of the second surface. Shades on surfaces are not
caused by the shape of the surface itself but are the result of other parts’ effects on
lighting. Shades can be marked with “S.” There is a sudden change in lighting at the
shade boundary, called the lighting boundary.
Figure 9.21 gives examples of some nouns of the above contour labels. The
picture shows a hollow cylinder placed on a platform, there is a trace M on the
cylinder, and the cylinder creates a shade S on the platform. There are two limbs $
on the two sides of the cylinder, and the top contour is divided into two parts by the
two limbs. The creases everywhere on the platform are convex, while the crease
between the platform and the cylinder is concave.

9.7.2 Structure Reasoning

Next, we consider the inference analysis of the structure of the 3D object with the
help of the contour structure in the 2D image. Here, it is assumed that the surfaces of
the objects are all planes, and all the intersecting corners are formed by the inter
section of three faces. Such a 3D object can be called a trihedral corner object,
such as the object represented by the two line drawings in Fig. 9.22. A small change
in viewpoint at this time will not cause a change in the topology of the line drawing,
that is, it will not cause the disappearance of faces, edges, and connections, and the
object is said to be in general position in this case.

Fig. 9.21 An example of

contour marking
9.7 Labeling and Matching of Line Drawing 367

Fig. 9.22 Different interpretations of line drawing

The two line drawings in Fig. 9.22 are geometrically identical, but there are two
different 3D interpretations of them. The difference is that Fig. 9.22b labels three
more concave creases than Fig. 9.22a, so that the object in Fig. 9.22a appears to be
floating in the air, while the object in Fig. 9.22b appears to be attached to the
back wall.
In the drawings labeled only with {+, —, -^}, “+” means unclosed convex lines,
“—” means unclosed concave lines, and “^” means closed lines. At this time, there
are four categories and 16 types of (topological) combination of edge connections:
six types of L connections, four types of T connections, three types of arrow
connections (" connections), and three types of fork connections (Y connections),
as shown in Fig. 9.23.
If we consider the case of vertices formed by the intersection of all three faces,
there should be 64 labeling methods, but only the above 16 connection methods are
reasonable. In other words, only line drawings that can be labeled with the 16 con
nection types shown in Fig. 9.23 can physically exist. When a line drawing can be
labeled, its labeling provides a qualitative interpretation of the drawing.

9.7.3 Backtracking Labeling

To automatically label line drawings, different algorithms can be used. A method of

labeling via backtracking is introduced below [18]. Express the problem to be
solved as the following: Given a set of edges in a 2D line drawing, assign a label to
each edge (the type of connection used must satisfy Fig. 9.23) to explain the 3D
situation. Backtracking labeling arranges the edges into a sequence (the edges that
constrain the most labels first as much as possible), generates paths in a depth-first
fashion, labels each edge in turn with all possible labels, and checks the consistency
368 9 Generalized Matching

Fig. 9.23 All 16 connection types for trihedral corner objects

of the new label against other labels. If the connection produced with the new label
contradicts or does not conform to the situation in Fig. 9.23, fall back to consider
another path; otherwise, continue to consider the next edge. If the labels assigned to
all the edges in this order are consistent, a labeling result is obtained (a complete path
to the leaves is obtained). Generally, more than one labeling result can be obtained
from the same line drawing, and it is necessary to use some additional information or
prior knowledge to obtain the final and unique judgment result.
Now use the pyramid shown in Fig. 9.24 as an example to explain the basic steps
of backtracking notation. The process of labeling with backtracking and the resulting
interpretation tree (including each step and final result) are shown in Table 9.2. In
Table 9.2, considering vertex A first, there are three cases that conform to Fig. 9.23.
For the first case according to Fig. 9.23, consider vertex B in turn. At this time,
among the three possible situations that may conform to Fig. 9.23, there are two
kinds of interpretations of edge AB that are inconsistent with Fig. 9.23 and one that
does not conform to Fig. 9.23 if we continue to consider vertex C. In other words,
there is no correct explanation for vertex A conforming to case 1 of Fig. 9.23. Next,
consider the second case according to Fig. 9.23, and so on.
As can be seen from all the interpretation trees in the table, there are three
complete paths (marked down to the leaves) that give three different interpretations
of the same line drawing. The search space of the whole explanation tree is quite
small, which indicates that the trihedral corner object has a rather strong constraint
mechanism.
9.8 Multimodal Image Matching 369

Fig. 9.24 Pyramid

Table 9.2 Interpretation tree of pyramid line drawing

9.8 Multimodal Image Matching

Generalized image matching aims to identify and correspond to the same or similar
relationship/structure/content from two or more images. Multimodal image
matching (MMIM) can be seen as a special case. Often, images and/or objects to
be-matched have significant nonlinear appearance differences, which are caused not
370 9 Generalized Matching

only by different imaging sensors but also by different imaging conditions (such as
day and night, weather changes, or across seasons) and input data types such as
Image-Drawing-Sketch and Image-Text.
The multimodal image matching problem can be formulated as: Given a reference
image IR and a matching image IM of different modalities, find the correspondence
between them (or the objects in them) according to their similarity. Objects can be
represented by the region they occupy or the features they have. Therefore, matching
techniques can be divided into region-based techniques and feature-based
techniques.

9.8.1 Region-Based Technology

Region-based techniques take into account the intensity information of the object.
Two groups can be distinguished: the traditional group with handcrafted framework
and the recent group with learned framework.
A flowchart of the traditional region-based technique is shown in Fig. 9.25. It
includes three important modules: (i) metrics, (ii) transformation models, and (iii)
optimization methods [19].

9.8.1.1 Metrics

The accuracy of matching results depends on the metrics (matching criteria). Dif
ferent metrics can be designed based on assumptions about the intensity relationship
between the two images. Commonly used manual metrics can be simply divided into
correlation-based methods and information theory-based methods.

9.8.1.2 Transformation Models

Transformation models usually account for the geometric relationship between

image pairs, and their parameters need to be accurately estimated to guide image
manipulation for matching. Existing transformation models can be simply divided

Fig. 9.25 Traditional flowchart based on regional technology

9.8 Multimodal Image Matching 371

into linear models and nonlinear models. The latter can be further divided into
physical models (derived from physical phenomena, represented by partial differ
ential equations) and interpolation models (derived from interpolation or approxi
mation theory).

9.8.1.3 Optimization Methods

Optimization methods are used to search for the best transformation based on a given
metric to achieve the desired matching accuracy and efficiency. Considering the
nature of the variables inferred by optimization methods, they can be simply divided
into continuous methods and discrete methods. Continuous optimization assumes
that the variables are differentiable real-valued values of the desired objective
function. Discrete optimization assumes the solution space as a discrete set.
The method classes and some typical technologies in each module are shown in
Table 9.3.
In recent years, deep learning techniques have been used to drive iterative
optimization processes or directly estimate geometric transformation parameters in
an end-to-end manner. The first type of methods is called deep iterative learning
methods, and the second type of methods is called deep transformation learning
methods. Depending on the training strategy, the latter can be roughly divided into
two categories: supervised methods and unsupervised methods. Table 9.4 lists some
typical methods in these three categories and their principles.

Table 9.3 The method classes and some typical technologies in each module
Module Method classes Typical technologies References
Metrics Correlation-based Cross-correlation [20]
Normalized correlation coefficient (NCC) [21]
Information theory Mutual information (MI) [22]
based Normalized mutual information (NMI) [23]
Conditional mutual information (CMI) [24]
Transformation Linear models Rigid body, affine, projection [25]
models transformation
Nonlinear physical Differential homeomorphism [26]
models Large deformation differential homeo [27]
morphism metric mapping
Nonlinear interpo Radial basis function (RBF) [28]
lation models Thin plate splines (TPS)) [29]
Free-form deformation (FFDS) [30]
Optimization Continuous Gradient descent [31]
methods methods Conjugate gradient [31]
Discrete methods Based on graph theory [32]
Message passing [33]
Linear programming [34]
372 9 Generalized Matching

Table 9.4 Some typical techniques in region-based deep learning methods

Method type Some typical techniques and basic principles References
Deep iterative learning Training good metrics with stacked auto-encoders [35]
methods Combining deep similarity metrics and handcrafted [36]
metrics as an augmented metric [37]
Iteratively estimate transformation parameters using
reinforcement learning (RL) paradigm
Supervised transform Apply statistical appearance models to ensure that the [38]
estimation methods generated data better mimics real data that can be used [39]
as ground truth to define loss functions in supervised [40]
estimation
U-Net’s fully convolutional (FC) layers are used to
represent high-dimensional parameter spaces, output
deformable fields, or displacement vector fields to
define loss functions
Use generative adversarial networks (GANs) to learn to
estimate transformations to force the predicted trans
formations to be true or close to the true value
Unsupervised transform Leverages the ability of the spatial transform network [41]
estimation methods (STN) to predict geometric transformations in an end- [42]
to-end fashion, using only traditional similarity mea
sures, and a regularization term that constrains the
transformation model complexity or smoothness to
construct a loss function
Image binarization is first performed, and then the vot
ing scores between the reference and matching images
are computed to handle multimodal image pairs

9.8.2 Feature-Based Technology

Feature-based techniques usually have three steps: (1) feature detection, (2) feature
description, and (3) feature matching. In the feature detection and feature description
steps, the modal differences are suppressed, so the feature matching step can be done
well using the general approach. Depending on whether local image descriptors are
used or not, the feature matching step can be performed indirectly or directly. The
flowchart of the feature-based technique is shown in Fig. 9.26 [19].

9.8.2.1 Feature Detection

The detected features usually represent specific semantic structures in images or the
real world. Commonly used features can be divided into corner features (the
intersection of two straight lines, usually located at textured regions or edges),
blob features (locally closed regions where pixels are considered similar and there
fore distinct from surrounding neighborhoods), line/edge, and morphological region
features. The core idea of feature detection is to construct a response function to
distinguish different features, as well as flat and nonunique image regions.
9.8 Multimodal Image Matching 373

Fig. 9.26 Flowchart of feature-based technology

Commonly used functions can be further classified into gradient, strength, second
derivative, contour curvature, region segmentation, and learning-based functions.
Deep learning has shown great potential for key-point detection, especially in two
images with significant appearance differences, which usually occurs in cross-modal
image matching. Three groups of commonly used convolutional neural network
(CNN)-based detectors are as follows: (i) supervised [43], (ii) self-supervised [44],
and (iii) unsupervised [45].

9.8.2.2 Feature Description

Feature description refers to the mapping of local intensities around feature points
into a stable and discriminative vector form, so that detected features can be matched
quickly and easily. Existing descriptors can be classified into floating-point descrip
tors, binary descriptors, and learnable descriptors according to the image cues used
(e.g., gradient, intensity) and the form of descriptor generation (e.g., comparison,
statistics, and learning). Floating-point descriptors are usually generated by gradient
or intensity cue-based statistical methods. The core idea of gradient statistics-based
descriptors is to compute the direction of the gradient to form a floating-point vector
for feature description. Binary descriptors are usually based on local intensity
comparison strategies. Learnable descriptors are deep descriptors with high-order
image cues or semantic information extracted in CNNs. These descriptors can be
further classified into gradient-based statistics, local intensity comparison, local
intensity order statistics, and learning-based descriptors [46].

9.8.2.3 Feature Matching

The purpose of feature matching is to establish correct feature correspondences from

the two extracted feature sets.
The direct method is to establish the correspondence between these two sets by
directly using the spatial geometric relations and optimization methods. There are
two representative strategies: graph matching and point set registration. An indirect
approach is to treat feature matching as a two-stage problem. In the first stage, a
putative match set is constructed based on the similarity of local feature descriptors.
374 9 Generalized Matching

In the second stage, false matches are culled by imposing additional local and/or
global geometric constraints.
Random sample consensus (RANSAC) is a classical resampling-based
mismatch cancellation and parameter estimation method. Inspired by classical
RANSAC, a learning technique [47] that removes outliers and/or estimates model
parameters by training a deep regressor has been proposed to estimate the
transformed model. In addition to learning with multilayer perceptrons (MLPs),
graph convolutional networks (GCNs) [48] can also be used.

References

1. Huang DW, Pettie S (2022) Approximate generalized matching: f-Matchings and f-Edge
covers. Algorithmica, 84(7): 1952-1992.
2. Zhang J, Wang X, Bai X, et al. (2022) Revisiting domain generalized stereo matching networks
from a feature consistency perspective. IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 12991-13001.
3. Liu B, Yu H, Qi G (2022) GraftNet: Towards domain generalized stereo matching with a broad
spectrum and task-oriented feature. IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 13002-13011.
4. Kropatsch WG, Bischof H (eds.) (2001) Digital Image Analysis—Selected Techniques and
Applications. Berlin: Springer.
5. Goshtasby AA (2005) 2-D and 3-D Image Registration—for Medical, Remote Sensing, and
Industrial Applications. USA. Hoboken: Wiley-Interscience.
6. Dubuisson M, Jain AK (1994) A modified Hausdorff distance for object matching. Proceedings
of the 12th ICPR, 566-568.
7. Ballard DH, Brown CM (1982) Computer Vision, UK London: Prentice-Hall.
8. Zhang Y-J (2017) Image Engineering, Vol. 3: Image understanding. De Gruyter, Germany.
9. Zhang Y-J (1991) 3-D image analysis system and megakaryocyte quantitation. Cytometry, 12:
308-315.
10. Zhang Y-J (1997) Ellipse matching and its application to 3D registration of serial cell images.
Journal of Image and Graphics 2(8,9): 574-577.
11. Zhang Y-J (1990) Automatic correspondence finding in deformed serial sections. Scientific
Computing and Automation (Europe) Chapter 5 (39-54).
12. Li Q, You X, Li K, et al. (2021) Spatial relation reasoning and representation for image
matching. Acta Geodaetica et Cartographica Sinica, 50(1): 117-131.
13. Lohmann G (1998) Volumetric Image Analysis. USA, Hoboken: John Wiley & Sons and
Teubner Publishers.
14. Lan CZ, Lu WJ, Yu JM, et al. (2021) Deep learning algorithm for feature matching of cross
modality remote sensing image. Acta Geodaetica et Cartographica Sinica 50(2): 189-202.
15. Li Q, You X, Li K, et al. (2021) Spatial relation reasoning and representation for image
matching. Acta Geodaetica et Cartographica Sinica 50(1): 117-131.
16. Sun HQ (2004) Graph Theory and Its Application. Beijing: Science Press.
17. Snyder WE, Qi H (2004) Machine Vision. UK: Cambridge University Press.
18. Shapiro L, Stockman G (2001) Computer Vision. UK London: Prentice Hall.
19. Jiang XY, Ma JY, Xiao GB, et al. (2021) A review of multimodal image matching: Methods and
applications. Information Fusion 73: 22-71.
20. Avants BB, Epstein CL, Grossmann M, et al. (2008) Symmetric diffeomorphic image registra
tion with cross-correlation: evaluating automated labeling of elderly and neurodegenerative
brain. Medical Image Analysis 12(1): 26-41.
References 375

21. Luo J, Konofagou EE (2010) A fast normalized cross-correlation calculation method for motion
estimation, IEEE Trans. Ultrason. Ferroelectr. Freq. Control 57(6): 1347-1357.
22. Viola P, Wells III WM (1997) Alignment by maximization of mutual information, International
Journal of Computer Vision 24(2): 137-154.
23. Studholme C, Hill DLG, Hawkes DJ (1999) An overlap invariant entropy measure of 3D
medical image alignment. Pattern Recognition 32(1): 71-86.
24. Loeckx D, Slagmolen P, Maes F, et al. (2009) Nonrigid image registration using conditional
mutual information. IEEE Trans. Med. Imaging, 29(1): 19-29.
25. Zhang X, Yu FX, Karaman S, et al. (2017) Learning discriminative and transformation
covariant local feature detectors. Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition 6818-6826.
26. Trouve A (1998) Diffeomorphisms groups and pattern matching in image analysis. Interna
tional Journal of Computer Vision 28(3): 213-221.
27. Marsland S, Twining CJ (2004) Constructing diffeomorphic representations for the groupwise
analysis of nonrigid registrations of medical images. IEEE Trans. Med. Imaging 23(8):
1006-1020.
28. Zagorchev L, Goshtasby A (2006) A comparative study of transformation functions for
nonrigid image registration. IEEE Trans. Image Process 15(3): 529-538.
29. Bookstein FL (1989) Principal warps: Thin-plate splines and the decomposition of deforma
tions. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(6): 567-585.
30. Sederberg TW, Parry SR (1986) Free-form deformation of solid geometric models. Proceedings
of the 13th Annual Conference on Computer Graphics and Interactive Techniques 151-160.
31. Zhang Y-J (2021) Handbook of Image Engineering. Singapore: Springer Nature.
32. Ford Jr LR, Fulkerson DR. Flows in Networks. USA: Princeton University Press.
33. Pearl J (2014) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
The Netherlands: Elsevier.
34. Komodakis N, Tziritas G (2007) Approximate labeling via graph cuts based on linear program
ming. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(8): 1436-1453.
35. Cheng X, Zhang L, Zheng Y (2018) Deep similarity learning for multimodal medical images.
Computer Methods in Biomechanics and Biomedical Engineering: Imaging and Visualization
6(3): 248--252.
36. Blendowski M, Heinrich MP (2019) Combining MRF-based deformable registration and deep
binary 3D-CNN descriptors for large lung motion estimation in COPD patients. Int. J. Comput.
Assist. Radiol. Surg. 14(1): 43-52.
37. Liao R, Miao S, de Tournemirf P, et al. (2016) An artificial agent for robust image registration.
arXiv preprint, arXiv: 1611.10336.
38. Uzunova H, Wilms M, Handels H, et al. (2017) Training CNNs for image registration from few
samples with model-based data augmentation. Proceedings of International Conference on
Medical Image Computing and Computer-Assisted Intervention 223-231.
39. Hering A, Kuckertz S, Heldmann S, et al. (2019) Enhancing label-driven deep deformable
image registration with local distance metrics for state-of-the-art cardiac motion tracking.
Bildverarbeitung Fur Die Medizin 309-314.
40. Yan P, Xu S, Rastinehad AR, et al. (2018) Adversarial image registration with application for
MR and TRUS image fusion. International Workshop on Machine Learning in Medical Imaging
197-204.
41. Sun L, Zhang S (2018) Deformable MRI-ultrasound registration using 3D convolutional neural
network. Simulation, Image Processing, and Ultrasound Systems for Assisted Diagnosis and
Navigation 152-158.
42. Kori A, Krishnamurthi G (2019) Zero shot learning for multi-modal real time image registra
tion, arXiv preprint, arXiv:1908.06213.
43. Zhang Y-J (2017) Image Engineering, Vol.1: Image Processing. Germany: De Gruyter.
376 9 Generalized Matching

44. DeTone D, Malisiewicz T, Rabinovich A (2018) Superpoint: Self-supervised interest point

detection and description. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops 224-236.
45. Laguna AB, Ribae, Ponsa D, et al. (2019) Key.Net: Keypoint detection by handcrafted and
learned CNN filters. arXiv preprint, arXiv:1904.00889.
46. Ma J, Jiang X, Fan A, et al. (2020) Image matching from handcrafted to deep features: A survey.
International Journal of Computer Vision 1-57.
47. Kluger F, Brachmann E, Ackermann H, et al. (2020) Consac: Robust multi-model fitting by
conditional sample consensus. Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition 4634-4643.
48. Sarlin P-E, DeTone D, Malisiewicz T, et al. (2020) Superglue: Learning feature matching with
graph neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition 4938-4947.
Chapter 10
Simultaneous Location and Mapping Check for
updates

Simultaneous localization and mapping (SLAM), also known as real-time local

ization and map construction, refers to the subject equipped with sensors, which
simultaneously estimates its own motion and builds a model of the environment
without prior information about the environment. It is a vision algorithm mainly used
by mobile robots, which allows the robot to gradually build and update a geometric
model as it explores an unknown environment, and based on the built part of the
model, the robot can determine its position relative to the model (self-positioning). It
can also be understood that the robot starts to move from an unknown position in an
unknown environment. In the process of moving, it not only performs its own
positioning according to the existing map and the estimation of the position but
also performs incremental map drawing on the basis of its own positioning, so as to
realize the autonomous positioning and navigation of the robot [1].
The sections of this chapter will be arranged as follows.
Section 10.1 first gives a general introduction to SLAM. There are mainly the
three operations performed by SLAM, the composition, process and modules of
laser SLAM and visual SLAM, as well as their comparison and fusion, moreover, the
combination with other technologies.
Section 10 .2 introduces three typical algorithms of laser SLAM: GMapping
algorithm, Cartographer algorithm, and LOAM algorithm.
Section 10 .3 introduces three typical algorithms for visual SLAM: ORB-SLAM
algorithm series, LSD-SLAM algorithm, and SVO algorithm.
Section 10 .4 discusses the characteristics of swarm robots and some technical
issues faced by swarm SLAM.
Section 10 .5 introduces some advances in SLAM, mainly the combination of
SLAM with deep learning and with multi-agent.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 377
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_10
378 10 Simultaneous Location and Mapping

10.1 SLAM Overview

SLAM performs three operations or accomplishes three tasks:

1. Perception: Obtain information about the surrounding environment through
sensors.
2. Localization: With the help of (current and historical) information obtained by
sensors, infer its own position and attitude.
3. Mapping: According to the information obtained by the sensor and its own
posture, it describes the environment in which it is located.
The relationship between the three operations can be represented with the help of
Fig. 10.1. Perception is a necessary condition for SLAM, and reliable localization
and mapping can only be performed by perceiving the information of the surround
ing environment. Localization and mapping depend on each other: localization relies
on known map information, and mapping relies on reliable localization information,
both of which rely on perceived information.
The sensors used in SLAM systems mainly include inertial measurement units
(IMUs), LiDARs, and cameras. Using LiDAR as a sensor is called laser SLAM, and
using a camera as a sensor is called visual SLAM.

10.1.1 Laser SLAM

Laser SLAM infers the motion of the LiDAR itself and the surrounding environment
based on the point cloud data of continuous motion frame by frame. Laser SLAM
can accurately measure the angle and distance of object points in the environment;
without pre-arranging the scene, it can work in a poor light environment and
generate an environment map that is easy to navigate.
Laser SLAM mainly solves three problems: (i) extracting useful information from
the environment, that is, feature extraction; (ii) establishing the relationship between
environmental information observed at different times, that is, data association; and
(iii) describing the environment, that is, the map represents the problem.
The process framework of laser SLAM is shown in Fig. 10.2, which mainly
includes five modules:
1. Laser scanner: Receive the distance and angle information returned by the emitted
laser from the surrounding environment to form point cloud data.

Fig. 10.1 The connection

between the three operations
of SLAM
10.1 SLAM Overview 379

Fig. 10.2 Laser SLAM process framework

2. Front-end matching: Also known as point cloud registration, is to find the

correspondence between the two frames of point cloud data before and after.
Commonly used matching methods mainly include [2]:
(a) Point-based matching: Match the point cloud directly, such as iterative
closest point (ICP) method and its variant: point-to-line ICP (PLICP)
method [3], etc.
(b) Feature-based matching: Extract various features (such as points, lines, sur
faces, etc.) from point clouds for matching, such as conic section features [4],
implicit function features [5], etc.
(c) Matching based on mathematical features: It describes the pose changes
between frames with the help of various mathematical features, such as
normal distribution transform (NDT), which discretizes the current frame
image into a grid, according to the distribution, calculates the probability
density function of the grid, and then uses the Newton optimization method to
solve the transformation parameter with the largest sum of probability den
sities, so as to achieve the optimal matching result [6].
(d) Correlation-based matching: Match based on pose correlation.
(e) Matching based on optimization: The matching problem is transformed into a
nonlinear least squares problem and solved by gradient descent. Methods that
combine gradient optimization with correlation have recently received a lot of
attention.

3. Back-end optimization: It is also known as back-end nonlinear optimization.

With the help of data with errors after front-end matching, the sensor pose and
environment map are inferred through optimization. This is a state estimation
problem. The commonly used optimization methods mainly include [2]:
(a) Optimization based on Bayesian filter: This kind of method adopts the
Markov assumption (the current state is only related to the state of the
previous moment and the measurement value of the current moment); based
on the different representation methods for the posterior probability, this kind
of method can also be divided into Kalman filter and particle filter.
(b) Graph-based optimization: This type of method does not take into account the
Markov assumption (the current state is related to the measurements at all
previous moments). It represents the pose as nodes in the graph, and the
connections between the poses are represented by arcs between the nodes to
380 10 Simultaneous Location and Mapping

form a pose graph. By adjusting the nodes in the pose graph to satisfy the
spatial constraints to the greatest extent, the pose information and map are
obtained.

4. Closed-loop detection: It is also known as loop-back detection. It needs to

identify the scene that the LiDAR has passed and reached (whether it has returned
to the previous position) based on the similarity detection and gives the constraint
of longer time interval in addition to adjacent frames, providing more data for
pose estimation, and eliminates accumulated errors to improve the effect of
drawing. Commonly used closed-loop detection methods mainly include [2]:
(a) Scan-to-scan matching: The two laser frames are registered by relative
translation and rotation through correlation calculation.
(b) Scan-to-map matching: The laser data of the current frame is registered with
a map composed of consecutive laser frames over a period of time.
(c) Map-to-map matching: The current laser data frame in continuous time is
constructed into a map and registered with the previously generated map.
(d) Special matching: For example, matching the current laser data frame and
multi-resolution map [7]; based on Siamese neural network and K-D tree
matching [8].
LiDAR-based closed-loop detection methods can also be divided into
histogram-based methods and point cloud segmentation-based methods [9]. His
togram-based methods extract feature values of points and encode them as
descriptors using global features or selected key points. A commonly used
histogram is the normal distribution transform (NDT) histogram [10], which
compactly represents a point cloud image as a set of normal distributions.
Methods based on point cloud segmentation require object recognition, but
segmentation maps provide better scene representation and are more relevant to
the way humans perceive their environment. There are many methods for seg
mentation such as ground segmentation, full cluster, pedestal, with ground
pedestal methods for dense data segmentation, and Gaussian process incremental
sample-consistent, grid-based methods for sparse data [11]. Some characteristics
of the above two types of methods are shown in Table 10.1 [9]:

Table 10.1 Overview of some closed-loop detection methods based on LiDAR

Methods Advantages Disadvantages
Histogram Rotation invariance when viewpoint Unable to preserve unique infor
based changes greatly mation about the internal structure
methods Reduce the influence of spatial descrip of the scene
tors by the relative distance between the
target and the LiDAR
Segmentation A large point cloud map can be com Requires prior knowledge about
based pressed into a unique set of features the target location
methods Reduce matching time
10.1 SLAM Overview 381

5. Map update: The obtained point cloud data of each frame and the corresponding
pose are spliced into a global map to complete the map update. There are different
types of maps: scale maps, topological maps, and semantic maps. In 2D laser
SLAM, grid maps and feature maps in scale maps are mainly used [2]:
(a) Grid map: Divide the environment space into grid cells of equal size, whose
attribute is the probability that the grid is occupied by objects. If the grid is
occupied, the probability value is closer to 1. When there are no objects in the
grid, the probability value is closer to 0. If it is uncertain whether there is an
object in the grid, the probability value is equal to 0.5. The grid map has high
accuracy and can fully reflect the structural characteristics of the environ
ment, so the grid map can be directly used for autonomous navigation and
positioning of mobile robots.
(b) Feature map: It is also known as geometric map, which is composed of
geometric features such as points, lines, or arcs extracted from environmental
information. Because it occupies less resources and has a custom map
precision, it is suitable for building small scene maps.

10.1.2 Visual SLAM

Visual SLAM generally tracks the setup key points through continuous camera
frames, locates its spatial position by triangulation, and uses the position information
to infer its own pose [12]. At present, deep learning technology has been widely
applied in visual SLAM [13].
The cameras used in visual SLAM mainly include three classes: monocular
cameras, binocular cameras, and depth cameras (RGB-D). Other special cameras
such as panorama and fish-eye are used less frequently.
The advantages of monocular cameras are that they are low cost, are not affected
by the size of the environment, and can be used both indoors and outdoors; the
disadvantage is that they cannot obtain absolute depth and can only estimate relative
depth. The binocular camera can directly obtain the depth information, but it is
restricted by the length of the baseline (the size itself needs to be larger), the amount
of calculation to obtain the depth data is large, and the configuration and calibration
are complicated. Depth cameras can directly measure the depth of many points, but
they are mainly used indoors, while outdoor applications are limited by light
interference.
The process frame diagram of visual SLAM (vSLAM) is shown in Fig. 10.3,
which mainly includes five modules:
1. Vision sensor: It reads image information and performs data preprocessing
(feature extraction and matching).
2. Visual odometry (VO): It is also known as (SLAM’s) front-end. It is able to
estimate camera motion with the help of adjacent image frames and recover the
spatial structure of the scene. It is called visual odometry because it only
382 10 Simultaneous Location and Mapping

Fig. 10.3 Visual SLAM process framework

calculates the movement of adjacent moments and has no correlation with

previous past information. On the one hand, the motions of adjacent moments
are connected in series to form the motion trajectory of the camera, which solves
the positioning problem; on the other hand, the position of the spatial point
corresponding to the pixel can be calculated according to the position of the
camera at each moment, and a map is obtained.
3. Nonlinear optimization: Also known as back-end nonlinear optimization. It
performs overall optimization on the basis of the data provided by the front-end
to obtain globally consistent trajectories and maps.
4. Closed-loop detection: Also known as loop-back detection. It needs to identify
whether the camera has ever reached the current scene based on similarity
detection, and if a loop-back is detected, the information is provided to the
back-end for processing. Cameras of visual SLAM contain rich visual informa
tion that can be used for closed-loop detection. Commonly used closed-loop
detection methods can be divided into three categories [9]:
(a) Image-to-image matching: This class of methods considers matching using
correlations between visual features to detect closed-loop. A bag of words is a
commonly used model where words can be offline or online.
(b) Image-to-map matching: This type of method uses the correspondence
between the visual features of the current camera image and the feature
map to detect closed-loop. The matching goal here is to determine the camera
pose relative to the point features in the map based on the appearance features
and their structural information.
(c) Map-to-map matching: This class of methods performs map-to-map feature
matching to detect closed-loop by using visual features as well as relative
distances between features common to both (sub)maps.
Some characteristics of the above three methods are shown in Table 10.2
[9]:
5. Description and mapping: According to the estimated trajectory, establish a
description of the environment and the corresponding map (including 2D grid
map, 2D topology map, 3D point cloud map, and 3D grid map).
10.1 SLAM Overview 383

Table 10.2 Overview of closed-loop detection methods in some visual SLAMs

Methods Advantages Disadvantages
Image- Online Measurement information of features is Not suitable for dynamic robot
to- words not required (using topology information) environment
image Dependent on appearance features and Memory consumption is pro
their presence in dictionaries portional to vocabulary
Cycle detection for planar camera motion If you test on different data
sets, the performance will
degrade
Offline Allow real-time learning Memory consumption is pro
words portional to vocabulary
Geometric information is not
used
Image-to-map High performance when adjusted to 100% Low memory efficiency
accuracy
Allow online map feature training for real
environments
Map-to-map When there are common features in the Not suitable for sparse map
map, the real cycle can be detected High performance cannot be
achieved for complex and
dense maps

10.1.3 Comparison and Combination

Laser SLAM and visual SLAM have their own characteristics, and some compari
sons are listed in Table 10.3.
Due to the complexity of application scenarios, both laser SLAM and visual
SLAM have certain limitations when used alone. In order to take advantage of
different sensors, these two can be combined to fuse the two kinds of
information [14].

10.1.3.1 Based on Extended Kalman Filter

Extended Kalman filter (EKF) itself is a filtering method for online SLAM
systems. It can also be used to combine laser SLAM with visual SLAM with
RGB-D cameras [15]. When camera matching fails, a laser device is used to
supplement the camera’s 3D point cloud data and generate a map. However, this
method essentially only adopts a switching working mechanism between the work
ing modes of the two sensors and has not really fused the data of the two sensors.

10.1.3.2 Assist Visual SLAM with Laser SLAM

Using visual SLAM alone may not be able to effectively extract the depth informa
tion of feature points. Laser SLAM works better in this regard, so the depth of the
384 10 Simultaneous Location and Mapping

Table 10.3 Comparison between laser SLAM and visual SLAM

Laser SLAM Visual SLAM
Application It is mainly used indoors. The installa Relatively rich (both indoor and out
scenario tion and deployment position cannot be door, but highly dependent on light,
covered, and it is easy to fail in rainy cannot work in dark or textureless
and foggy days places)
RGB-D cameras are difficult to use in
outdoor scenes
Map accuracy The accuracy is high, and the accuracy When the depth camera range is 3
of map construction can reach 2 cm 12 m, the accuracy of building a map
can reach 3 cm
Data Large amount of information but poor Small amount of information, but
information stability and accuracy good stability
Application It can directly obtain the point cloud Based on the monocular or binocular
characteristics data in the environment, calculate the camera, the point cloud data in the
distance, scan a wide range, and facil environment cannot be directly
itate positioning and navigation obtained, but only the intensity image
However, when the environment is obtained, and the position of the self
changes violently, the relocation ability needs to be continuously moved
is poor and the stability is poor. Due to By extracting and matching feature
the quality of the point cloud, efficient points, the distance is further mea
results cannot be obtained sured by triangulation, and the stereo
vision range is narrow and small
System cost Expensive Cheap

scene can be measured with laser SLAM first, and then the point cloud can be
projected onto the video frame [16].

10.1.3.3 Using Visual SLAM to Assist Laser SLAM

When using laser SLAM alone, there will be some difficulties in extracting the
features of the object region. Using visual SLAM to extract ORB features and
perform closed-loop detection can improve the performance of laser SLAM in this
regard [17].

10.1.3.4 Simultaneous Use of Laser SLAM and Visual SLAM

To make laser SLAM and visual SLAM more tightly coupled, it is possible to use
both laser SLAM and visual SLAM at the same time and use the measurement
residuals of both modes at the back-end for back-end optimization [18]. In addition,
visual LiDAR odometry and real-time mapping (VLOAM) [19] can also be
designed, which combines low-frequency LiDAR odometry and high-frequency
visual odometry. This can quickly improve motion estimation accuracy and suppress
drift.
10.2 Laser SLAM Algorithms 385

10.2 Laser SLAM Algorithms

People have developed many laser SLAM algorithms based on different technolo
gies and with different characteristics. These algorithms can be mainly divided into
two categories: filtering methods and optimization methods.
In the following introduction, it is assumed for simplicity that the laser device and
its carrier (which can be a car, robot, drone, etc.) use the same coordinate system, and
the laser device is used to refer to the combined device of the laser device and its
carrier. Let xk represent the pose of the laser device, use mi to represent the marked
point in the environment (map), and use zk-1, i + 1 to represent the marked feature
mi + 1 observed by the laser device at xk-1. In addition, use uk to represent the motion
displacement between two adjacent poses on the motion trajectory.

10.2.1 GMapping Algorithm

The GMapping algorithm is a SLAM algorithm based on Rao-Blackwellized

particle filter (RBPF) to study the construction of raster maps [20].

10.2.1.1 RBPF Principle

The basic idea of RBPF is to deal with the localization and mapping problems in
SLAM separately. Specifically, first use P(x1:t|z1:t, u1:t-1) to estimate the trajectory
x1:t of the laser device, and then continue to estimate the map m with the help of x1:t:

P(x1:t, mjz1:t, U1:t- 1 ) = P(mjx1:t, Z1:t) ' P(X1:tjz1:t, U1:t- 1) (10:1)

Mapping with P(m|x1:t, z1:t) is straightforward given the pose of the laser device. The
following only discusses the positioning problem represented by P(x1:t|z1:t, u1:t-1).
Here a particle filter algorithm called sampling importance resampling (SIR) filter
is used. It mainly has four steps:

Sampling

Taking the probabilistic motion model of the laser device as the proposed distribu
tion D, the new particle point set {xt(i)} at the current moment is sampled from the
proposed distribution D by the particle point set {xt-1(i)} at the previous moment
owned. Therefore, the generation process of the new particle point set {xt(i)} can be
represented as xt(i) ~ P(xt|xt-1(i), ut-1).
386 10 Simultaneous Location and Mapping

Importance Weight Calculation

Considering the entire motion process, each possible trajectory of the laser device
can be represented by a particle point x1:t(i), and the importance weight of each
trajectory corresponding to the particle point x1:t(i) can be defined as the following
equation:

p(Wii:t jzi:t, U1:t- 1)

(10:2)
d(wlit jzi:t, U1:t- 1)

Resampling
Resampling refers to replacing newly generated particle points with importance
weights. Since the total number of particle points remains the same, when the
particle points with smaller weights are deleted, the particle points with larger
weights need to be copied to keep the total number of particle points unchanged.
After resampling, the weights of particle points become the same and then proceed to
the next round of sampling and resampling.

Map Estimation

Under the condition that each trajectory corresponds to the particle point x1:t(i),
P(m(i)lx1:t(i), z1-t) can be used to calculate a map m(i), and then the final map m is
obtained by integrating the maps calculated for each trajectory.

10.2.1.2 Improvements to RBPF

The GMapping algorithm improves RBPF in two aspects, namely, proposed distri
bution and resampling strategy.
The proposed distribution Dis discussed first. It can be seen from Eq. (10.2) that
each calculation needs to calculate the weight corresponding to the entire trajectory.
Over time, the trajectory will become very long, and the amount of computation will
increase. An improved method for this is to derive a recursive calculation method for
the weights based on Eq. (10.2):
10.2 Laser SLAM Algorithms 387

p(Wl:t jzi: t ,u1 :t-1) p(zt jxlit, Z1:t-1) p(xlit jzi:t, u1:t-1) =P(zt |Z1:t-1, U1:t-1)

d( w1‘t jzi:t, U1:t-1) D(xt(i)|x1i:)t-1, z1:t, U1:t-1) D(xH-1|z1:t-1, u1:t-2)

=
p(zt |x1i:)t, Z1:t-1) p(xjt-1, Ut-Q p(x1i:)t-1|Z1:t-1, Ut-2) P(zt |z1:t-1,u1:t-1)
D^xt(i) |x1i:)t-1, Z1 :t, u1 t-1} D^x1i:)t-1 |z 1 t-1, u1 t-2
: : :

P(zt mt"! 1,xtW) P(xt(i) jxtW 1, ut-1) (/)

«—-----( (, / V---------- w—Lwti-1 (10:3)
D^xt(l) |x1‘t-1,Z1:t, U1:t-1
If the motion model xti) ~P^Xi|xt-_ 1,ut-1) is used to calculate the proposed
distribution D, the generation of the particle point set {xt(i)} at the current moment
and the corresponding weight calculation are as follows:

xt(i) ~^xix-1, ut-1)

p(Zt | mji-1, X;(,)) p(Xt( i) | Xt(i-1, ut -1} 10:4)

w t(i) / wti-)1=p (zt mtW 1, xtW) wt-1
i) । (i)
n xt |x1:t - 1, z1:t, u1:t - 1

However, directly using the motion model as the proposed distribution will cause a
problem, that is, when the reliability of the observation data is relatively high, the
number of new particles generated by sampling the motion model will fall in the
observation distribution interval, resulting in a lower accuracy of the observation
update. To this end, the observation update process can be divided into two cases:
When the observation reliability is low, the default motion model of Eq. (10.3) is
used to generate a new particle point set {xt(i)} and the corresponding weight; when
the observation reliability is high, then directly sample from the interval of the
observation distribution, and approximate the distribution of the sampling point set
{xk} as a Gaussian distribution; by using the point set {xk} to calculate the param-
(i) (i) (i) (i)
eters^^ and^t of Liie ^Gaussian distribution, tne ^Gaussian distributionXt *~^N(^^t ,
£t(i)) is used to sample to finally generate a new particle point set {xt(i)} and
corresponding weights.
After generating the new particle point set {xt(i)} and the corresponding weights,
the resampling strategy can be considered. If every time the particle point set {xt(i)}
is updated, the weights are used for resampling, when the particle weights do not
change too much during the update process, or some bad particle points are better
than good particles due to noise. When the weight of the point is even larger,
performing resampling will result in the loss of good particle points. So before
performing resampling, you need to ensure its validity. To this end, the improved
388 10 Simultaneous Location and Mapping

resampling strategy measures its effectiveness with the help of parameters shown in
the following equations:

i<£(w-i°)2
Neff = (10:5)
i= 1

where w represents the normalized weight of the particle. When the similarity
between the proposed distribution and the object distribution is high, the weights
of each particle point are relatively close; when the similarity between the proposed
distribution and the object distribution is low, the weight difference of each particle
point is relatively large. In this way, a threshold can be set to judge the validity of the
parameter Neff, and resampling is performed when Neff is less than the threshold;
otherwise, resampling is skipped.

10.2.2 Cartographer Algorithm

The Cartographer algorithm is an optimization-based SLAM algorithm with both

(multi-sensor fusion) mapping and relocation functions [21].
SLAM systems based on optimization methods usually adopt a framework of
front-end local mapping, closed-loop detection, and back-end global optimization,
as shown in Fig. 10.4.

10.2.2.1 Local Mapping

Local mapping is the process of building a local map using sensor scan data.
Let’s first introduce the structure of the Cartographer map. Cartographer map is
composed of many local sub-maps, and each local sub-map contains several scan
frames (scan), as shown in Fig. 10.5. Maps, local sub-maps, and scan frames are
related by pose relationships. The scan frame and the sub-map are related by the
local pose qij, the sub-map and the map are related by the global pose qim, and the
scan frame and the map are related by the global pose qjs.
The pose coordinates can be expressed as q = (qx, qy, qe). Assuming that the
initial pose is q1 = (0, 0, 0), the scan frame here is Scan(1), and Sub-map(1) is
initialized with Scan(1). The corresponding pose q2 of Scan(2) is calculated using
the update method of scan-to-map matching, and Scan(2) is added to Sub-map(1)

Fig. 10.4 SLAM process Front-end Back-end

framework based on
optimization method
10.2 Laser SLAM Algorithms 389

Fig. 10.5 The structure of

the Cartographer map

Scan(1) | Sub-map(l) | -
Scan(2)
Scan(3)
Scan(4) | Sub-map(2) | _ Map
Scan(5)

Scan(n) | Sub-map(m) | -

based on the pose q2. The newly obtained scan frame is added to the map matching
method without performing scan, until the new scan frame is completely contained
in Sub-map(1), that is, the creation of Sub-map (1) ends when no new information
other than Sub-map (1) is observed in the new scanning frame. Then repeat the
above steps to construct a new local sub-map, Sub-map(2). All local sub-maps {Sub-
map(m)} constitute the final global map. In Fig. 10.5, it is assumed that Sub-map(1)
is constructed from Scan(1) and Scan(2); Sub-map(2) is constructed from Scan(3),
Scan(4), and Scan(5).
As can be seen from Fig. 10.5, each scan frame corresponds to a global coordinate
in the global map coordinate system and also corresponds to a local coordinate in the
local map coordinate system (because the scan frame is also included in the
corresponding local subgraph). Each local sub-image starts with the first inserted
scan frame, and the global coordinates of the initial scan frame are also the global
coordinates of the local sub-image. So, the global pose Qs = {qjs}( j = 1, 2, ..., n)
corresponding to all scan frames and the global pose Qm = {qim}( j = 1, 2, ..., m) are
associated with the local pose qij generated by scan-to-map matching, and these
constraints constitute the pose graph, which will be applied in the global mapping
later.
The construction of local subgraphs involves the transformation of multiple
coordinate systems. First, the distance point {dk}(k = 1, 2, ..., K) obtained by
scanning the laser device for one cycle takes the rotation center of the laser device as
the coordinate system to take the value. Then, in a local sub-map, the pose of the first
scan frame is used as a reference, and the pose of the subsequent scan frames can be
represented by a relative transition matrix Tq = (Rq, tq). In this way, the data points in
the scan frame can be transformed into the local sub-map coordinate system using
the following equation:

cos qs — sin qs qx
Tq ■ dk = dk + (10:6)
sin qs cos qs Jy.

Rq tq

In other words, the data points dk in the scan frame are transformed into the local
sub-map coordinate system.
390 10 Simultaneous Location and Mapping

The sub-maps in Cartographer are in the form of probabilistic grid maps, that is,
the continuous 2D space is divided into discrete grids, and the side length of the grid
(usually 5 cm) represents the resolution of the map. The scanned scene point is
replaced by the grid occupied by the scene point, and the probability is used to
describe whether there is a scenery in the grid. The larger the probability value, the
higher the possibility of the existence of the scenery.
Next, consider the process of adding scan data to a sub-map. If the data is
converted to the sub-map coordinate system according to Eq. (10.6), these data
will cover some grids {Mold} of the sub-map. The grid in the sub-map has three
states: occupied (hit), not occupied (miss), and unknown. The grid covered by the
scan points should be occupied. The region between the start and end of the scanning
beam should be free of scenery (light can pass through), so the corresponding grid
should be unoccupied. Due to scan resolution and range limitations, grids not
covered by scan points should be unknown. Because the grid in the sub-map may
be covered by more than one scan frame, the grid state needs to be iteratively
updated in two cases:
1. In the grid {Mold} covered by data points in the current frame, if the grid has never
been covered by data points before (i.e., unknown state), then use the following
formula for initial update:

if state(x) = hit
Mnew (x) = (10:7)
Pmiss if state(x) = miss

Among them, if the grid x is marked as an occupied state by a data point, then the
occupancy probability Phit is used to assign an initial value to the grid; if the grid x is
marked as a non-occupied state by a data point, then the non-occupied probability
Pmiss is used to assign an initial value to the grid.
(2) In the grid {Mold} covered by the data points of the current frame, if the grid
has been covered by the data points before (i.e., it has the value Mold), then use the
following formula to iteratively update:

clip(inv 1 (inv(Moid(x))inv(Phit))) if state(x) = hit

Mnew (x)= (10:8)
clip (inv -1 (inv(Moid(x))inv(Pmiss))) if state(x) = miss

Among them, if grid x is marked as occupied by data points, then Mold is updated
with occupancy probability Phit; if grid x is marked as non-occupied by data points,
then Mold is updated with non-occupied probability Pmiss. where inv is an inverse
proportional function: inv(p) = p/(1—p), and inv-1 is the inverse function of inv, clip
is an interval-limited function. When the function value is higher than the maximum
value of the set interval, the maximum value is taken, and when the function value is
lower than the minimum value of the set interval, the minimum value is taken.
10.2 Laser SLAM Algorithms 391

The Cartographer algorithm uses the above iterative update mechanism to effec
tively reduce the interference of dynamic scenes in the environment. Because the
dynamic scene will make the grid state transition between occupancy and
non-occupancy, each state transition will make the probability value of the grid
smaller, which reduces the interference of the dynamic scene.
Finally, considering that the pose predicted by the motion model may have a large
error, it is also necessary to use the observation data to correct the predicted pose
before adding it to the map. Here, the scan-to-map matching method is still used to
search and match in the neighborhood of the predicted pose to locally optimize
the pose:

K
arg min 52 (1 -Msmooth(Tq • dt)) (10-9)
q k=i

where Msmooth is a bi-cubic interpolation smoothing function, which is used to

determine the degree of matching between the transformed data point and the
sub-map (the value range is in the [0, 1] interval).

Closed-Loop Detection

The local optimization of the pose in Eq. (10.9) can reduce the cumulative error in
the local mapping, but when the scale of the mapping is large, the total cumulative
error will also lead to the phenomenon of ghosting on the map. This is actually the
motion trajectory back to the position it had previously reached. This requires using
closed-loop detection, adding closed-loop constraints to the overall mapping con
straints, and performing a global optimization of the global pose. In closed-loop
detection, a search matching algorithm with higher computational efficiency and
higher precision is required.
The closed-loop detection can be represented by the following formula
(W represents the search window):

q* = arg max 52 Mnearest (Tq ■ dt) (10:10)

q2W t=i

where the Mnearest function value is the probability value of the grid covered by
Tq^dk. When the search result is the current real pose, the matching degree is very
high, that is, the value of each Mnearest function is large, and the entire summation
result is also the largest.
If Eq. (10.10) is computed with exhaustive search, the computation is too large to
be performed in real time. For this reason, a branch-and-bound method is used to
improve efficiency. Branch and bound is to first match with a low-resolution map
392 10 Simultaneous Location and Mapping

and then gradually increase the resolution to match until the highest resolution.
Cartographer uses a depth-first strategy to search here.

Global Mapping

Cartographer uses a sparse pose map for global optimization, and the constraint
relationship of the sparse pose map can be constructed according to Fig. 10.5. Global
pose Qs = {qjs}( j = 1, 2, ..., n) corresponding to all scan frames and global pose
Qm = {qim}( j = 1, 2, ..., m) corresponding to all local submaps. The local pose qij
generated by scan-to-map matching is associated:

arg min 152L [E2 (qm, qj; yj', qj- (10:H)

qm, qs j '

where

E\qm, qj; 1 , qj) = e(qm, qj; q^ Ty - 1e

zij e qmi , qj; qi
Rqm1 ■ q
(t m -1 (10:12)
^qm, qj; q^
= qij -
qm - qj

In the above two equations, i is the sequence number of the sub-map, j is the
sequence number of the scan frame, and qij represents the local pose of the scan
frame with the sequence number j in the local sub-map with the sequence number i.
The loss function L is used to penalize too large error terms, and the Huber function
can be used.
Equation (10.11) is actually a nonlinear least squares problem. When the closed-
loop is detected, all pose quantities in the whole pose map are globally optimized,
and then all pose quantities in Qj and Qm will be corrected, and the corresponding
map points on each pose will be corrected accordingly, which is called global
mapping.

10.2.2.2 LOAM Algorithm

LOAM algorithm is a SLAM algorithm for outdoor environment. It uses multiline

laser device and can build 3D point cloud map [22]. Its process framework is shown
in Fig. 10.6, which mainly includes four modules.
10.2 Laser SLAM Algorithms 393

Fig. 10.6 LOAM process framework

Point Cloud Registration Module

The function of point cloud registration module is to extract feature points from
point cloud data. It calculates the smoothness of each point in the point cloud data of
the current frame, determines the points whose smoothness is less than the given
threshold as corners, and determines the points whose smoothness is greater than the
given threshold as surface points. Put all corners into the corner cloud set and all
surface points into the surface cloud set.

LiDAR Odometer Module

The function of LiDAR odometer module is positioning. It uses the scan-to-scan

method to register the feature points in the point cloud data of two adjacent frames to
obtain their pose transfer relationship. In the scene of low-speed motion, a
low-precision odometer (10 Hz odometer) can be obtained by directly using the
registration of inter-frame features, which can be used to correct the distortion in
motion. But in the scene of high-speed movement, it also needs to use the later
odometer fusion module.

Mapping Module

The mapping module uses the method of scan-to-map for high-precision position
ing. It takes the above low-precision odometer as the initial value of pose and
matches the corrected feature point cloud with the map. This matching can get a
high-precision odometer (1 Hz odometer). Based on the pose provided by such a
high-precision odometer, the corrected feature point cloud can be added to the
existing map.

Odometer Fusion Module

Although the accuracy of LiDAR odometer used for positioning is low, its update
speed is high. Although the odometer output by the mapping module has high
394 10 Simultaneous Location and Mapping

accuracy, its update speed is low. If these two are integrated, an odometer with high
speed and high accuracy can be obtained. Fusion is achieved by interpolation. If the
1 Hz high-precision odometer is used as the benchmark and the 10 Hz low-precision
odometer is used to interpolate it, the 1 Hz high-precision odometer can be output at
the speed of 10 Hz (equivalent to 10 Hz odometer).
It should be pointed out that if the frequency of the laser device itself is high
enough, or the inertial measurement unit (IMU), visual odometer (VO), wheel
odometer, etc., provide external information to speed up the speed of inter-frame
feature registration to respond to the changes of posture and correct the distortion in
motion, then fusion is not necessary.
LOAM algorithm has two characteristics worth pointing out: First, it solves
motion distortion; second, it improves the efficiency of mapping. Motion distortion
comes from the interference in data acquisition. The problem of motion distortion is
more prominent in low-cost laser devices because of low scanning frequency and
rotation speed. The lOAM algorithm uses the odometer obtained by inter-frame
registration to correct the motion distortion, so that the low-cost laser device can be
applied. The problem of mapping efficiency is more prominent when processing a
large number of 3D point cloud data. LOAM algorithm uses low-precision odometer
and high-precision odometer to decompose simultaneous positioning and mapping
into independent positioning and independent mapping, which can be processed
separately, reducing the amount of calculation, so that low-power computer equip
ment can also be applied.

10.3 Visual SLAM Algorithms

According to the different processing methods of image data, visual SLAM algo
rithm can be divided into feature point method, direct method, and semi-direct
method.
The feature point method first extracts features from the image and performs
feature matching and then uses the obtained data association information to calculate
the camera motion, that is, the front-end VO, and finally performs back-end optimi
zation and global mapping (see Fig. 10.3).
The direct method directly uses the image gray-scale information for data
association and calculates the camera motion. The front-end VO in the direct method
is directly carried out on the image pixels, and there is no need for feature extraction
and matching. The subsequent back-end optimization and global mapping are
similar to the feature point method.
Semi-direct method combines the robustness advantage of feature point method
in using feature extraction and matching and the computational speed advantage of
direct method without feature extraction and matching. It often has more stable and
faster performance.
10.3 Visual SLAM Algorithms 395

10.3.1 ORB-SLAM Algorithm Series

ORB-SLAM algorithm series presently include ORB-SLAM algorithm [23],

ORB-SLAM2 algorithm [24], and ORB-SLAM3 algorithm [25]. ORB-SLAM algo
rithm only considers monocular camera, ORB-SLAM2 algorithm also considers
binocular camera and RGB-D camera, and ORB-SLAM3 algorithm also considers
pinhole camera, fish-eye camera, and inertial navigation unit.

10.3.1.1 ORB-SLAM Algorithm

ORB-SLAM algorithm uses the optimization method for solution. It uses three
threads: front-end, back-end, and closed loop. Its process flow diagram is shown in
Fig. 10.7. The front-end combines the logic related to positioning such as feature
extraction, feature matching, and visual odometer in a separate thread (not dragged
by the relatively slow back-end thread to ensure the real-time positioning), which
extracts feature points by oriented FAST and rotated BRIEF (ORB) from the
image [26]. The back-end combines the mapping related logic of global optimization
and local optimization in a separate thread. It first performs local optimization
mapping and triggers global optimization when the closed-loop detection is success
ful (the global optimization process is carried out on the camera pose map, without
considering the map features to speed up the optimization speed). In addition, the
algorithm uses key frames (representative frames in image input). Generally, the
image frames directly input into the system tracking thread from the camera are
called ordinary frames, which are only used for positioning tracking. The number of
ordinary frames is very large, and the redundancy between frames is also large. If
only some frames with more feature points, rich attributes, large differences between
front and back frames, and more common visibility relationship with surrounding
frames are selected as key frames, the amount of calculation is smaller, and the
robustness is higher when generating map points. ORB-SLAM algorithm maintains
a sequence of key frames in operation. In this way, the front-end can quickly relocate
with the help of key frame information when positioning is lost, and the back-end
can optimize the key frame sequence to avoid bringing a large number of redundant
input frames into the optimization process and wasting computing resources.
Figure 10.7 mainly includes six modules, which are briefly introduced below.

Map Module

The map module corresponds to the data structure of SLAM system. The map
module of ORB-SLAM algorithm includes map points, key frame, common visibil
ity map, and spanning tree. The running process of the algorithm is to dynamically
maintain the map, in which there is a mechanism responsible for adding and deleting
the data in the key frame and also a mechanism responsible for adding and deleting
396 10 Simultaneous Location and Mapping

Normal ORB feature Initial pose Local map New key frame
frame extraction estimation tracking selection

Tracking thread
Map Key frame

Key frame insertion

Filter recent map points

New map cloud point

reconstruction
Location recognition
Local BA optimization
Closed-loop correction Closed-loop detection
Pose map Closed-loop Calculate Candidate Local key frame filtering
optimization fusion similarity detection
Transform Local mapping thread
Closed-loop thread

Fig. 10.7 ORB-SLAM process block diagram

the data in the map point cloud, so as to maintain the efficiency and robustness of
the map.

Map Initialization

SLAM system mapping needs to be carried out through incremental operations on

the basis of some initial point clouds of the map. Here, we can calculate the camera
pose transformation relationship between the two selected images and use the
triangulation method to construct the initial point cloud of the map.

Location Recognition

If the tracking thread is lost during the mapping process of SLAM system, it is
necessary to start relocation to retrieve the lost information; if the SLAM system
wants to judge whether the current position has been reached before after building a
large map, it needs to carry out closed-loop detection. In order to realize relocation
and closed-loop detection, position recognition technology is needed. In the large
environment, image-to-image matching method is often used in position recogni
tion. Among them, bag of word (BoW) model is often used to build a visual
vocabulary recognition database for matching.
10.3 Visual SLAM Algorithms 397

Tracking Thread

The tracking thread obtains the input image from the camera, completes the map
initialization, and then extracts the ORB feature. The next initial pose estimation
corresponds to coarse positioning, while local map tracking corresponds to fine
positioning. On the basis of coarse positioning, precise positioning uses the current
frame and multiple key frames on the local map to establish a common visibility
relationship and uses the projection relationship between the point cloud of the
common visibility map and the current frame to solve the pose of the camera more
accurately. Finally, select some new alternative key frames.

Local Mapping Thread

The local mapping thread first calculates the feature vector of the candidate key
frame selected by the tracking thread with the help of the bag of word model, that is,
the key frame is added to the database of the bag of word model. Then, the cloud
points (points in the point cloud) in the map that have a common visibility relation
ship but do not map with the key frame are associated (these cloud points are called
recent map points), and the key frame is inserted into the map structure. Some poor
quality map points in the near future should be deleted. For the newly inserted key
frames, the common visibility map can be used to match the adjacent key frames,
and the triangulation method can be used to reconstruct the new map cloud points.
Next, several key frames and map cloud points near the current frame are optimized
by bundle adjustment (BA). Finally, the key frames in the local map are filtered
again, and the redundant key frames are eliminated to ensure robustness.

Closed-Loop Thread

Closed-loop thread is divided into two parts: closed-loop detection and closed-loop
correction. The closed-loop detection uses the bag of word model to select the
frames in the database with high similarity to the current key frame as candidate
closed-loop frames and then calculate the similarity transformation relationship
between each candidate closed-loop frame and the current frame. If there is enough
data to calculate the similar transformation, and the transformation can ensure that
there are enough common viewpoints, the closed-loop detection is successful. Next,
the transform is used to correct the cumulative error of the current key frame and its
adjacent frames, and those map points that are inconsistent due to the cumulative
error are fused together. Finally, with the help of global optimization, those frames
that do not have a common visibility relationship with the current key frame are
corrected. Here, the key frame pose on the global map is taken as the optimization
variable, so it is also called pose map optimization.
398 10 Simultaneous Location and Mapping

10.3.1.2 ORB-SLAM2 Algorithm

ORB-SLAM2 algorithm is an extension of ORB-SLAM algorithm which is only

applicable to monocular system. Its process flow diagram is shown in Fig. 10.8.
Compared with the process flow diagram of ORB-SLAM algorithm, this process
flow diagram is basically the same, with two main differences: One is to add an input
preprocessing module in the tracking thread, and the other is to add a thread: global
BA optimization.
The input preprocessing module is added to add support for binocular cameras
and RGB-D cameras. The flowcharts of preprocessing for these two cameras are
shown in Fig. 10.9a and Fig. 10.9b, respectively. After preprocessing, we only need
to use the features extracted from the original input image without considering the
input image, that is, the processing behind the system is independent of the type of
camera.
The purpose of adding global BA optimization thread is to further calculate the
optimal structure and motion solution after the optimization of the pose graph.
Global BA optimization may be expensive, so it is executed in a separate thread
so that the system can continue to create maps and detect loops.

Local mapping thread

Closed-loop thread

Fig. 10.8 ORB-SLAM2 process block diagram

10.3 Visual SLAM Algorithms 399

Restored
Binocular
frames

(a)

Restored
RGB-D
frames

(b)

Fig. 10.9 Flowchart of input preprocessing in ORB-SLAM2

10.3.1.3 ORB-SLAM3 Algorithm

ORB-SLAM3 algorithm extends ORB-SLAM2 algorithm. Atlas mechanism and

support for inertial measurement unit (IMU) are mainly added. The whole process
block diagram is shown in Fig. 10.10.
When the system is running, if there is an error frame into the map or there is a
large deviation in the closed-loop optimization, the map will be very poor. To solve
this problem, ORB-SLAM3 algorithm adds Atlas mechanism, which actually
maintains a global map structure, including all key frames, map points, and
corresponding constraints. Among them, the online sub-map composed of key
frames and map points is called active map, while the offline sub-map composed
of key frames and map points is called inactive map. Such a mechanism can robustly
organize maps as well as reduce the impact of errors and the accumulation of
deviations.
In addition, if there is IMU data input, the tracking thread will read the image
frame and IMU data at the same time, and the speed value of the motion model in the
initial pose estimation will be provided by IMU instead. When the tracking is lost
and needs to be relocated, you can search in the active map and inactive map. If the
relocation is successful within the search range of the active map, the tracking
continues. If the relocation is successful within the search range of the inactive
map, the inactive map will be turned into an active map and the tracking will
continue. If relocation fails, initialize the map and start rebuilding the active map.
If the newly constructed active map is successfully matched with the previously
offline inactive map in the closed-loop detection, it indicates that the relocation is
successful, and the tracking can continue, and the system can continue to map
robustly.
400 10 Simultaneous Location and Mapping

Closed-loop thread

Fig. 10.10 ORB-SLAM3 process block diagram

10.3.2 LSD-SLAM Algorithm

LSD-SLAM algorithm is a typical direct method, which is suitable for various

SLAM systems using monocular camera [27], binocular camera [28], and omnidi-
rectional/panoramic camera [29].

10.3.2.1 Principle of Direct Method

The direct method, that is, the direct visual odometer method, does not need feature
extraction and matching (reducing the calculation time), directly uses the attributes
of image pixels to establish data association, and constructs the corresponding model
by minimizing the photometric error to solve the camera pose and map point cloud.
10.3 Visual SLAM Algorithms 401

Re-Projection Error

The re-projection error corresponds to the difference between pixel positions.

Similar to binocular stereo matching, here we consider two frames of images
I1(p) = I1(x, y) and I2(p) = I2(x, y), whose coordinate systems are O1XY (abbreviated
as O1) and O2XY (abbreviated as O2), respectively. Let the spatial point P be
projected onto these two images to obtain two pixel points I1(p1) and I2(p2),
respectively. Suppose the point of P in the coordinate system O1 is P1, and the
point of P in the coordinate system O2 is P2, then the projection relationship between
the spatial point P1 and the pixel point p1 is:

P1 = D(P1) (10:13)

Conversely, the back projection relationship from the pixel point p1 to the spatial
point P1 is:

P1=D-1(p1) (10:14)

D in the two equations represents the distribution.

On the other hand, point P1 in coordinate system O1 can be transformed into point
P2 in coordinate system O2 through coordinate transformation (R, t), and point P2
‘
will be re-projected back to image I2 to obtain pixelp2 , as follows:

R t
P2=TP1= P1 (10:15)
0 1
2 = D(P2)
p0 (10:16)

‘
Ideally, the re-projected pixel p2 should coincide with the actually observed pixel
p2. However, in practice, if the projection is disturbed by noise and the transforma
tion has certain errors, they do not coincide. The difference between the two is called
re-projection error:

e=p2-p02 (10:17)

Equation (10.17) gives the re-projection error of a point. If the re-projection error of
all feature points is considered, the camera pose transformation and map point cloud
can be optimized by minimizing the re-projection error:

minY"*||e,-||2 = minY^I|p2 — D(T ■ P1 2 (10:18)

T,p ii t,p 2 1

Photometric error.
402 10 Simultaneous Location and Mapping

Photometric error (photometric residual) corresponds to the difference between

the gray values of pixels. If it is assumed that the gray value of the pixel point
obtained by projecting a spatial point into different camera coordinate systems is the
same (i.e., the luminosity is unchanged, which is difficult to be strictly satisfied in
practice), under ideal circumstances, the gray value I1(p1) of the pixel point p1
obtained by projecting the spatial point P into the camera coordinate system O1
should be the same as the gray value I2(p2) of the pixel point p2 obtained by
projecting it into the camera coordinate system O2. However, in practice, if the
projection is disturbed by noise and the transformation has certain errors, they are not
equal, that is, I1(p1) is different from I2(p2), which is photometric error:

e = I i(pi) -1 p') (10:19)

Equation (10.19) gives the photometric error of a point. If the photometric error of all
pixel points is considered, the camera pose transformation and map point cloud can
be optimized by minimizing the re projection error:

min (ei)2 = minV^{Ipi) —I2 |D(T ■ pi 2 (10:20)

T p / y ' t p / J 1/ L \ 1/JJ )

Comparing the re-projection error and photometric error it can be seen that the
calculation of re-projection error needs the help of feature extraction and feature
matching and the calculated error corresponds to the distance between pixels;
feature extraction and matching are not needed to calculate photometric error. The
calculated error is the difference between the gray value of pixels in one image and
the gray value of pixels re-projected into another image. The difference between the
two also explains the relative advantages and disadvantages between the feature
point method and the direct method.

10.3.2.2 Monocular LSD-SLAM Algorithm

First the LSD-SLAM algorithm supporting monocular camera is discussed

[27]. Its flow diagram is shown in Fig. 10.11 which mainly includes three parts:
tracking module depth estimation module and map optimization module.

Tracking Module

The tracking module uses the new input frame and the current key frame to calculate
the pose transformation of the new input frame which is achieved by calculating the
minimum error between them that is photometric error:
10.3 Visual SLAM Algorithms 403

Track Depth map estimation Map optimization

Fig. 10.11 Monocular LSD SLAM flow diagram

rA.p, j
EA j = 12 (10:21)
02 , x
p2QDi WqW &

Among them, qji is a similar transformation on Lie algebra (used to describe pose
transfer); I •ll.g is Huber norm; rp(p, qji) is the photometric error:

rp(p, q^) = Ii [p] - Ij|w(p, Di(p), q^)] (10:22)

where w(p, d, q) is the re-projection function:

x0 px=d
’x=z0'
y0 py=d
w(p, d, q) = y0=z0 = exp (q) (10:23)
z0 1=d
1=z0
1 1

02
r (p,q ) is the variance of photometric error:
p ji

2
X (p, qji)\
022
rp (p,qji) = 20II Vi(p) (10:24)
k dDi(p)

where Di is the inverse depth map and Vi is the covariance of the inverse depth of the
image.
404 10 Simultaneous Location and Mapping

Depth Estimation Module

The depth estimation module receives the photometric error calculated by the
tracking module for each new input frame and the current key frame and determines
whether to replace the current key frame with the new input frame or improve the
current key frame. If the photometric error is large enough, the current key frame is
added to the map, and then the new input frame is used as the current key frame.
Specifically, first calculate the similarity transformation between the new input
frame and the current key frame, and then project the depth information of the
current key frame to the new input frame through the similarity transformation and
calculate its depth estimation value. If the photometric error is relatively small, the
new input frame is used to filter and update the depth estimation value of the current
key frame.

Map Optimization Module

The map optimization module receives the new key frame from the depth estimation
module and calculates its similar transformation with other key frames in the map
before adding it to the map. Here, the photometric error and depth error of the image
should be minimized at the same time:

EAq/i) = H rp(p, qj-) rd (p, qj)

^2 ^ &2 (10.25)
P2QDi • '. p,‘ii) r (p,q )
d j s

where rd(p, qj) and ^2 (pq ) are the depth error and the variances of depth error,
respectively:

rd (p> q^) = [p']3 - Di( [plu), p0 = Ws [p, Di(p), qj,.] (10.26)

2
drd(p, qji-) drd (p, qj-i)
ff2d(p,qi) = V< [p12) + v,(p) dD,(p) (10.27)
dDi{ [p0]i,2)

10.3.2.3 Omnidirectional LSD-SLAM Algorithm

Omnidirectional cameras have a wide field of view (FOV), and the field of view of
some fish-eye cameras can even exceed 180°. This feature of omnidirectional camera
is more suitable for SLAM applications [29]. However, the wide field of view
inevitably brings the problem of image distortion. The practical omnidirectional
camera is still monocular camera, so the main challenge to expand the monocular
10.3 Visual SLAM Algorithms 405

Fig. 10.12 Two successive

projection models

LSD-SLAM algorithm to the omnidirectional LSD-SLAM algorithm is to solve the

distortion problem. An effective method is to use correction technology, or establish
a mapping model to convert the distorted image into a non-distorted image. Once
this problem is solved, the monocular LSD-SLAM algorithm can be extended to the
omnidirectional LSD-SLAM algorithm without changing the basic process
framework.
To describe this problem, first assume that u = [u, v]T 2 I C R2 represents the
pixel coordinates, where I represents the image domain; x = [x, y, z]T 2 R3
represents the coordinates of 3D points in space. In the most general case, the
camera model is a function M:R3 ^ I, which defines the mapping between the
spatial 3D point x and the pixel point u in the image. For a lens with negligible
diameter, a common assumption is the single viewpoint assumption, that is, all rays
pass through a point in space—the camera coordinate origin C. Therefore, the
projection position of point x depends only on the direction of x. Here, M-1:
I x R+ ^ R3 is used as a function to map pixels back to 3D, and the inverse distance
d = ||x||-1 is used.
The omnidirectional LSD-SLAM algorithm uses a model for the central reflection
and refraction system [30], which has been extended to a wider range of physical
devices including fish-eye cameras [31, 32]. The central idea behind this model is to
connect two successive projections, as shown in Fig. 10.12. Among them, the first
one projects the space point onto the unit sphere centered on the camera. The second
is a normal pinhole projection, in which the camera center is offset along the z-axis -
q. This model is described by five parameters, fx, fy, cx, cy, and q. Here, (fx, fy)
indicate the focal length, (cx, cy) indicate the principal point. The projection of a
point is calculated as follows:

Cx
M«(x) = (10:28)
-cy -
406 10 Simultaneous Location and Mapping

where ||x|| is the L2 norm of x. The corresponding back-projection function has an

analytical form (when q = 0, it becomes a pinhole model):

qq + 1 + (1 — q2) (u2 + v2) u Op

M- 1(u, d) = 11 v 0 (10:29)
u2 + v2 + 1 .qJJ
1

where

/
(u — cx)=fx
u
(10:30)
v (v — cy) fy

One of the main characteristics of this model is that the projection function, back-
projection function, and their derivatives are easy to calculate.

10.3.2.4 Binocular LSD-SLAM Algorithm

Binocular LSD-SLAM algorithm is also an extension of monocular LSD-SLAM

algorithm, so that it can be applied to binocular camera/stereo camera [28]. Using a
binocular camera can obtain the depth information of the scene, thus overcoming the
uncertainty of the original monocular camera scale. The flowchart of binocular
LSD-SLAM algorithm is shown in Fig. 10.13, which is similar to monocular
LSD-SLAM algorithm and also includes three corresponding modules, but the
composition of the modules is different. Among them, the original single image is

Track Depth map estimation Map optimization

New stereo
frame

Track at
Current
keyframes

Fig. 10.13 Flow diagram of binocular LSD-SLAM

10.3 Visual SLAM Algorithms 407

changed into a pair of images. For simplicity, it is represented by symbol here, in

which the key frame is represented by thick lines and the current frame is represented
by thin lines.

Tracking Module

The tracking module uses the new input (stereo) frames and the current key (stereo)
frames to calculate the pose transformation of the new input frame, which is still
achieved by calculating the minimum error between them, that is, the photometric
error.

Depth Estimation Module

The estimation of scene geometry is carried out in key frames. Each key frame
maintains a Gaussian probability distribution on the inverse depth of the pixel
subset. This subset is selected as pixels with high image gradient amplitude, because
these pixels provide rich structural information and more robust disparity estimation
than pixels in non-textured regions.
The depth estimation combines the temporal stereo (TS) of the original monoc
ular LSD-SLAM with the static stereo (SS) obtained here from a fixed baseline
stereo camera. For each pixel, the binocular LSD-SLAM algorithm integrates static
and temporal stereo cues into depth estimation according to availability. In this way,
the characteristics of monocular structure from motion are combined with fixed
baseline stereo depth estimation in single SLAM method. Static stereo effectively
removes the scale as a free parameter, and temporal stereo cues can help estimate the
depth from a baseline other than the small baseline of the stereo camera.
In the binocular LSD-SLAM algorithm, the depth of the key frame can be
estimated directly from the static stereo (see Fig. 10.13). This method, which only
depends on time or static solid, has many advantages. Static stereo allows to estimate
the absolute proportion of the world and to move independently of the camera.
However, static stereo is limited by a constant baseline (in many cases, with a fixed
direction), which effectively limits performance to a specific range. However, time
stereo will not limit the performance to a specific range. The same sensor can be used
in both very small and very large-scale environments, and there is a seamless
transition between the two.
If a new key frame is generated (initialized), the propagate depth (PD) map can
be updated and trimmed with the help of static stereo.

Map Optimization Module

Camera motion between two images can be determined by means of direct image
alignment. This method tracks the movement of the camera toward the reference key
408 10 Simultaneous Location and Mapping

frames. It can also be used to estimate relative pose constraints between key frames
for pose map optimization. Of course, there is also a need to compensate for changes
in affine lighting.

10.3.3 SVO Algorithm

The SVO algorithm is a typical semi-direct method [33].

10.3.3.1 Principle of Semi-Direct Method

The semi-direct method uses some threads or modules of the feature point method
and the direct method in combination. Figure 10.14 can be used to introduce their
connection.
As can be seen from Fig. 10.14, the feature point method first establishes the
association between images through feature extraction and feature matching and
then solves the camera pose and map point cloud by minimizing the re-projection
error, while the direct method directly uses the attributes of image pixels to establish
the associations between the images, and then the camera pose and map point cloud
are solved by minimizing the photometric error. The semi-direct method combines
the feature extraction module of the feature point method with the direct association
module in the direct method. Because feature points are more robust than pixel
points, and direct association is more efficient than feature matching to minimize
re-projection errors, the semi-direct method combines the advantages of both
methods, which not only ensures robustness but also improves efficiency.

10.3.3.2 Principle of SVO Algorithm

The flowchart of the SVO algorithm is shown in Fig. 10.15, which mainly includes
two threads: motion estimation and mapping.
The motion estimation thread mainly has three modules: image alignment, feature
alignment, and posture and structure optimization.

Fig. 10.14 The semi-direct Z ---- -

method combines the Feature point Feature Feature
feature point method and the
direct method
method extraction
V . . .... " K
+
+
matching
J
Semi-direct method
f
Direct Pixel Direct
method properties + association
V
10.3 Visual SLAM Algorithms 409

Fig. 10.15 SVO flow diagram

Image Alignment Module

The image alignment module uses a sparse model to align the input new image with
the previous frame image. The alignment here is achieved by re-projecting the
extracted feature points (FAST corners) to a new image and computing the camera
posture transformation based on minimizing the photometric error. This is equiva
lent to replacing the pixels in the direct method with (sparse) feature points.

Feature Alignment Module

Using the previously computed (rough) camera posture transformation, the feature
points that are already in the map and co-viewed by the new image can be
re-projected back (from the map to the new image). Considering that the
re-projected feature point position may not coincide with the real feature point
position in the image, it needs to be corrected by minimizing the photometric
error. To align feature points, an affine transformation can be used.
It should be pointed out that although both alignment modules need to determine
the six parameters of the camera pose, if only the first one is used, the possibility of
pose drift will be very large, and if only the second one is used, the amount of
calculation will be very big.
410 10 Simultaneous Location and Mapping

Posture and Structure Optimization Module

Based on the rough camera posture obtained in Step (1) and the corrected feature
points obtained in Step (2), the camera pose and map point cloud can be optimized
by minimizing the re-projection error of the map point cloud to the new image. Note
that both Step (1) and Step (2) use the idea of the direct method, both of which
minimize the photometric error. The idea of the feature point method is used here,
which is to minimize the re-projection error. If only the camera pose is optimized,
only motion BA is used; if only map point cloud is optimized, only structure BA
is used.
The mapping thread estimates the depth of a 3D point given image and its
posture. It mainly includes three modules: feature extraction, initialization depth
filter, and update depth filter. They work under the guidance of two judgments.

1. Whether key frame.

Feature extraction is only required when selecting key frames to initialize new
3D points. A probabilistic depth filter is then initialized for each 2D feature for
which the corresponding 3D point is to be estimated (the depth estimation of the
feature is modeled with a probability distribution; each depth filter is associated
with a reference key frames). This initialization occurs every time a new key
frame is selected. The initialization of the filter is performed with a high degree of
uncertainty in depth, so at each subsequent frame, the depth estimate is updated in
a recursive Bayesian fashion.
2. Whether to converge.
Through continuous updating, the uncertainty of the depth filter gradually
becomes smaller. When the uncertainty becomes small enough (converged), the
depth estimate can be transformed into a new 3D point that will be inserted into
the map and used immediately for motion estimation.

10.4 Swarm Robots and Swarm SLAM

A swarm robot is a decentralized system that can collectively accomplish tasks that
a single robot cannot do alone. The properties of swarm robots, such as scalability,
flexibility, and fault tolerance, have been greatly improved due to the development
of localization awareness and communication, self-organization and redundancy
techniques, etc., which make swarm robots an ideal candidate for performing
tasks. In large unknown environments, swarm robots can autonomously perform
simultaneous localization and mapping (SLAM) by navigating dangerous dynamic
environments using a self-organizing exploration scheme.
10.4 Swarm Robots and Swarm SLAM 411

10.4.1 Characteristics of Swarm Robots

Swarm robots have some characteristics that distinguish them from centralized
multi-robot systems [34].

10.4.1.1 Scalability

Robots in swarms interact only with close peers and the immediate environment.
Contrary to most centralized multi-robot systems, they do not require global knowl
edge or supervision to operate. Therefore, modifying the size of the swarm does not
require reprogramming of individual robots, nor does it have a significant impact on
qualitative collective behavior. This enables swarm robots to achieve scalability—
that is, maintaining performance as more agents join the system—as they can cope
with environments of any size over a considerable range. Of course, an approach that
only works for very expensive robots doesn’t actually extend in practice, as eco
nomic constraints may prevent the acquisition of large numbers of robots. Therefore,
the design of swarm SLAM methods should take into account the cost of a single
robot.

10.4.1.2 Flexibility

Since swarms are decentralized and self-organizing, individual robots can dynami
cally assign themselves to different tasks to meet the requirements of specific
environmental and operating conditions, even if those conditions change while
operating. This adaptability provides flexibility for swam SLAM. Flexibility in
swam SLAM is not only suitable for very specialized hardware configurations but
also for existing infrastructure or global information sources with good results.

10.4.1.3 Fault Tolerance

Swarm robots are composed of a large number of robots with high redundancy. This
high redundancy, coupled with the lack of centralized control, prevents swarm
robots from being a single point of failure. Therefore, the swarm SLAM method
can achieve fault tolerance, as it can cope with the loss or failure of some robots
(as well as the noise due to the measurement). Likewise, fault tolerance makes
economic sense: Losing a robot should not have a significant impact on the cost of
the task or its success.
Taking these characteristics into account, it can be seen that swarm SLAM should
have a different application than multi-robot SLAM: Swarm robots are best suited
for applications where the main constraint is time or cost rather than high accuracy.
Therefore, they should be more suitable for generating rough abstract graphs, such as
412 10 Simultaneous Location and Mapping

topological maps or simple semantic maps, rather than precise metric maps. In fact,
when an accurate map is required, there is usually enough time to construct it, and
when time (or cost) is the main constraint, it is often acceptable to generate an
approximate but informative map. The method of swarm SLAM should also be
suitable for mapping dangerous dynamic environments. As the environment evolves
over time, a single or small swarm of robots takes time to update the map, while a
large enough swarm can do so quickly.

10.4.2 Problems to be Solved by Swarm SLAM

To achieve scalable, flexible, and fault-tolerant swarm SLAM, a series of problems

need to be addressed [34].

10.4.2.1 Explore the Environment

Exploration is an important function of SLAM. In swarm robotics, simpler explora

tion schemes, especially random walks, are often used [35, 36]. A better option is to
take advantage of swam-specific behaviors, such as dispersing and gathering. In
addition, when using swarm robots, it is necessary to consider how to design the
control software of individual robots. Studies have shown that automatic offline
design of swarm robots can outperform manual design by building control software
from simple atomic behaviors [37, 38]. A recent work in automatic design suggests
that exploration capabilities may arise from interactions between atomic behaviors,
not just from exploration schemes embedded in these atomic behaviors [39]. There
fore, using a simple, population-specific exploration scheme would benefit the
design process and the efficiency of the swam SLAM approach.

10.4.2.2 Sharing Information

A common approach used to share information in multi-robot SLAM is to share raw

and processed data [40]. However, neither seems to be optimal for swarm robots.
Sharing raw data from sensors is simple, but it can scale poorly because large
amounts of data may not be transferred quickly enough. Sharing processed data
can solve this problem by reducing the amount of data shared, but most existing
approaches are centralized and rely on external infrastructure, such as GPS or remote
computers, to assemble different subsets of data. Promising schemes to achieve fully
decentralized swarm SLAM when mapping dynamic environments include distrib
uted mapping [41-43] and graph-based mapping [44], and the latter is more suitable
for constructing topological or semantic graphs.
10.5 Some New Trends of SLAM 413

10.4.2.3 Retrieval of Information and Drawing

Retrieving maps without centralized information is an open problem in swarm

SLAM. In fact, the most intuitive approach is map merging, which requires
collecting individual maps on a system to merge them. One solution might be to
merge the individual maps from all the robots and then retrieve the map from any of
them; however, this is not realistic without using external infrastructure. However, if
only abstract maps with less data are needed when mapping dynamic environments,
this scheme is still competitive.
Finally, consider the case where you don’t need to retrieve the map. Since the
purpose of most SLAM methods is to build a map for use by another party, it is
suitable to consider a map that is only useful to the robot that built it. In swarm
robotics, building maps can help robots explore and improve their performance. The
map does not require human operator access, so it just needs to be shared between
robots.

10.5 Some New Trends of SLAM

In recent years, in addition to combining with bionics [45], SLAM has a lot of
combination with deep learning and multi-agent.

10.5.1 Combination of SLAM and Deep Learning

With the help of deep learning, the performance of odometry and closed-loop
detection can be improved, and the SLAM system’s understanding of environmental
semantics can be enhanced. For example, a visual odometry method called DeepVO
uses a convolutional neural network (CNN) on raw image sequences to learn
features and a recurrent neural network (RNN) to learn dynamical connections
between images [46]. This structure based on double convolutional neural network
can efficiently extract the effective information between adjacent frames and at the
same time has good generalization ability. For another example, in terms of closed-
loop detection, ConvNet is used to calculate the characteristics of the road sign
region and compare the similarity of the road sign region, so as to judge the
similarity between the whole images and improve the detection robustness in
situation when there is partial occlusion and severe changes in the scene [47]. In
fact, deep learning-based closed-loop detection methods are more robust to chang
ing environmental conditions, seasonal changes, and occlusions due to the presence
of dynamic objects.
An overview of some recent deep learning algorithms for closed-loop detection is
shown in Table 10.4.
414 10 Simultaneous Location and Mapping

Table 10.4 Some deep learning algorithms for closed-loop detection

Learning Suitable
algorithm Deep network Sensor Main features scene
[48] AlexNet Camera CNN feature Indoor/
outdoor
[49] Autoencoder Camera Histogram of gradient Indoor/
outdoor
[50] CNN LiDAR Segmented image Indoor/
outdoor
[51] CNN LiDAR Segmented image Indoor/
outdoor
[52] Faster R-CNN Camera SIFT, SURF, ORB Indoor
[53] Hybrid Camera Semantic feature Outdoor
[54] PCANet Camera SIFT, SURF, ORB Indoor/
outdoor
[55] PointNet++ LiDAR Semantic NDT Outdoor
[56] RangeNet++ LiDAR Semantic class Outdoor
[57] ResNet18 Camera CNN feature Indoor/
outdoor
[58] ResNet50 Camera CNN multi-view Outdoor
descriptor
[59] Siamese LiDAR Semi manual Outdoor
network
[60] VGG16 Camera CNN feature Outdoor
[61] VGG16 LiDAR/ CNN feature Outdoor
camera
[62] Yolo Camera ORB Indoor/
outdoor

In Table 10.4, see Sect. 5.2 for SIFT and SURF, Sect. 10.3.1 for ORB, and Sect.
10.1.1 for NDT.

10.5.2 Combination of SLAM and Multi-Agent

Each agent in a multi-agent system can communicate with each other, coordinate
with each other, and solve problems in parallel, which can improve the solution
efficiency of SLAM, and each agent is relatively independent, with good fault
tolerance and anti-interference ability, which can help SLAM solve problems in
large-scale environments. For example, the multi-agent distributed architecture [63]
uses successive over-relaxation (SOR) and Jacobi over-relaxation (JOR) to solve
normal equations, which can effectively save data bandwidth.
Visual SLAM systems assisted by using inertial measurement units (IMUs) are
often referred to as visual-inertial navigation systems. Multi-agent collaborative
visual SLAM systems often have a moving subject equipped with one or more
References 415

Table 10.5 Multi-agent visual SLAM system scheme

Number
of Front-end
SLAM agents features Back-end features Map type
PTAMM 2 Posture Triangulation, relocation, bundle Global map
estimation adjustment
CoSLAM 12 Detect Intra/inter-camera pose estimation, Global map
image map building, bundle adjustment
information
CSfM 2 Visual Location recognition, map fusion, Local map
odometry attitude optimization, bundle
adjustment
C2TAM 2 Posture Triangulation, relocation, bundle Local map
estimation adjustment
MOARSLAM 3 Vision Location recognition Map with rela
inertia tive pose
odometer relationship
CCM-SLAM 3 Visual Location recognition, map fusion, Local map
odometry remove redundant key frames

visual sensors, which can estimate the change of its own posture and reconstruct a
3D map of the unknown environment through the perception of environmental
information.
An overview of several existing multi-agent vision SLAM system schemes is
shown in Table 10.5 [64].
In Table 10.5, CCM-SLAM is a multi-agent visual SLAM framework [65] that
incorporates IMU, each agent only runs a visual odometry with a limited number of
key frames, and the agent will detect the key frame information and send it to the
server (reducing the cost and communication burden of a single agent); the server
constructs a local map based on this information and fuses the local map information
through the method of location recognition. In the server, posture estimation and
bundle adjustment are applied to refine the map.

References

1. Cornejo-Lupa MA, Cardinale Y, Ticona-Herrera R, et al. (2021) OntoSLAM: An ontology for

representing location and simultaneous mapping information for autonomous robots. Robotics,
10(4): 125 (DOI: https://doi.org/10.3390/robotics10040125).
2. Shen SJ, Tian X, Weig L (2022) Review of SLAM algorithm based on 2D LiDAR. Computer
Technology and Development 32(1): 13-18.
3. Andrea C (2008) An ICP variant using point-to-line metric. Proceedings of IEEE International
Conference on Robotics and Automation 19-25.
4. Zhao J, Huang SD, Zhao L, et al. (2019) Conic feature based simultaneous localization and
mapping in open environment via 2D LiDAR. IEEE Access 7: 173703-173718.
416 10 Simultaneous Location and Mapping

5. Zhao J, Zhao L, Huang SD, et al. (2020) 2D Laser SLAM with general features represented by
implicit functions. IEEE Robotics and Automation Letters 5(3): 4329-4336.
6. Biber P, Strasser W (2003) The normal distributions transform: A new approach to laser scan
matching. Proceedings of International Conference on Intelligent Robotics and Systems
2743-2748.
7. Olson E (2015) M3RSM: Many-to-many multi-resolution scan matching. Proceedings of IEEE
International Conference on Robotics and Automation 5815-5821.
8. Yin H, Ding XQ, Tang L, et al. (2017) Efficient 3D LiDAR based loop closing using deep
neural network. Proceedings of IEEE International Conference on Robotics and Biomimetric
481-486.
9. Arshad S, Kim GW (2021) Role of deep learning in loop closure detection for visual and
LiDAR SLAM: A survey. Sensors 21, #1243 (DOI: https://doi.org/10.3390/s21041243).
10. Magnusson M, Lilienthal AJ, Duckett T (2007) Scan registration for autonomous mining
vehicles using 3D-NDT. J. Field Robot 24, 803-827.
11. Douillard B, Underwood J, Kuntz N, et al. (2011) On the segmentation of 3D LiDAR point
clouds. Proceedings of the IEEE International Conference on Robotics and Automation 9-13.
12. Ning LR, Pang L, Dong D, et al. (2020) The combination of new technology and research status
of simultaneous location and mapping. Proceedings of the 6th Symposium on Novel Optoelec
tronic Detection Technology and Applications, #11455 (DOI: https://doi.org/10.1117/12.
2565347).
13. Huang Z X, Shao C L. Survey of visual SLAM based on deep learning. Robot, 2023, DOI:
https://doi.org/10.13973/j.cnki.robot.220426.
14. Wang JK, Jia X (2020) Survey of SLAM with camera-laser fusion sensor. Journal of Liaoning
University of Technology (Natural Science Edition) 40(6): 356-361.
15. Xu Y, Ou Y, Xu T. (2018) SLAM of robot based on the fusion of vision and LiDAR.
Proceedings of International Conference on Cyborg and Bionic Systems 121-126.
16. Graeter J, Wilczynski A, Lauer M (2018) LIMO: LiDAR-monocular visual odometry. Pro
ceedings of International Conference on Intelligent Robots and Systems 7872-7879.
17. Liang X, Chen H, Li Y, et al. (2016) Visual laser-SLAM in large-scale indoor environments.
Proceedings of International Conference on Robotics and Biomimetics 19-24.
18. Seo Y, Chou C (2019) A tight coupling of Visual-LiDAR measurements for an effective
odometry. Intelligent Vehicles Symposium 1118-1123.
19. Zhang J, Singh S (2015) Visual-LiDAR odometry and mapping: Low-drift, robust, and fast.
Proceedings of International Conference on Robotics and Automation 2174-2181.
20. Grisetti G, Stachniss C, Burgard W (2007) Improved techniques for grid mapping with
Rao-Blackwellized particle filters. IEEE Transactions on Robotics 23(1): 34-46.
21. Hess W, Kohler D, Rapp H, et al. (2016) Real-time loop closure in 2D LiDAR SLAM. Proc.
International Conference on Robotics and Automation 1271-1278.
22. Zhang J, Singh S (2014) LOAM: LiDAR odometry and mapping in real-time. Robotics: Science
and Systems Conference 1-9.
23. Mur-Artal R, Montiel JMM, Tardos JD (2015) ORB-SLAM: A versatile and accurate monoc
ular Slam system. IEEE Transactions on Robotics 31(5): 1147-1163.
24. Mur-Artal R, Tardos JD (2017) ORB-SLAM2: An open-source SLAM system for monocular,
stereo and RGB-D cameras. IEEE Transactions on Robotics 33(5): 1255-1262.
25. Campos C, Elvira R, Rodriguez J JG, et al. (2021) ORB-SLAM3: An accurate open-source
library for visual, visual-inertial, and Multimap SLAM. IEEE Transactions on Robotics 37(6):
1874-1890.
26. Rublee E, Rbaud V, Konolige K, et al. (2011) ORB: An efficient alternative to SIFT or SURF.
Proceedings of ICCV, 2564-2571.
27. Engel J, Schps T, Cremers D (2014) LSD-SLAM: Large-scale direct monocular SLAM. Pro
ceedings of ECCV 834-849.
28. Engel J, Stuckler J, Cremers D (2015) Large-scale direct SLAM with stereo cameras. IEEE
International Conference on Intelligent Robots and Systems 1935-1942.
References 417

29. Caruso D, Engel J, Cremers D (2015) Large-scale direct SLAM for omnidirectional cameras.
IEEE International Conference on Intelligent Robots and Systems 141-148.
30. Geyer C, Daniilidis N (2000) A unifying theory for central panoramic systems and practical
implications. Proceedings of ECCV 445-461.
31. Ying XH, Hu ZY (2004) Can we consider central catadioptric cameras and fisheye cameras
within a unified imaging model. Lecture Notes in Computer Science 3021: 442-455.
32. Barreto JP (2006) Unifying image plane liftings for central catadioptric and dioptric cameras.
Imaging Beyond the Pinhole Camera 21-38.
33. Forsterl C, Pizzoli M, Scaramuzza D (2014) SVO: Fast semi-direct monocular visual odometry.
Proceedings of IEEE International Conference on Robotics and Automation 15-22.
34. Kegeleirs M, Grisetti G, Birattari M (2021) Swarm SLAM: Challenges and Perspectives.
Frontiers in Robotics and AI 8: #618268 (DOI: https://doi.org/10.3389/frobt.2021.618268 ).
35. Dimidov C, Oriolo G, Trianni V (2016) Random walks in swarm robotics: An experiment with
kilobots, Swarm Intelligence 9882: 185-196.
36. Kegeleirs M, Garzon RD, Birattari M (2019) Random walk exploration for swarm mapping.
LNCS, in Towards Autonomous Robotic Systems 11650: 211-222.
37. Birattari M, Ligot A, Bozhinoski D, et al. (2019) Automatic off-line design of robot swarms: A
manifesto. Frontiers in Robotics and AI 6: 59.
38. Birattari M, Ligot A, Hasselmann K (2020). Disentangling automatic and semi-automatic
approaches to the optimization-based design of control software for robot swarms. Nature
Machine Intelligence 2, 494-499.
39. Spacy G, Kegeleirs M, Garzon RD (2020) Evaluation of alternative exploration schemes in the
automatic modular design of robot swarms of CCIS. Proceedings of the 31st Benelux Confer
ence on Artificial Intelligence 1196: 18-33.
40. Saeedi S, Trentini M, Seto M, et al. (2016) Multiple-robot simultaneous localization and
mapping: A review. Journal of Field Robotics 33, 3-46.
41. Fox D, Ko J, Konolige K, et al. (2006) Distributed multirobot exploration and mapping.
Proceedings of IEEE 94: 1325-1339.
42. Ghosh R, Hsieh C, Misailovic S, et al. (2020) Koord: a language for programming and verifying
distributed robotics application. Proceedings of ACM Program Language 4, 1-30.
43. Lajoie P-Y, Ramtoula B, Chamg Y, et al. (2020) DOOR-SLAM: Distributed, online, and outlier
resilient SLAM for robotic teams. IEEE Robotics and Automation Letters 5(2): 1656-1663.
44. Kummerle R, Grisetti G, Strasdat H, et al. (2011) G2o: A general framework for graph
optimization. Proceedings of the IEEE International Conference on Robotics and Automation
3607-3613.
45. Li WL, Wu DW, Zhu HN, et al. (2021) A bionic simultaneous location and mapping with
closed-loop correction based on dynamic recognition threshold. Proceedings of the 33rd
Chinese Control and Decision Conference (CCDC), 737-742.
46. Wang S, Clark R, Wen H, et al. (2017) DeepVO: Towards end-to-end visual odometry with
deep recurrent convolutional neural networks. Proceedings of International Conference on
Robotics and Automation 2043-2050.
47. Sunderhauf N, Shiraizi H, Dayoub F (2015) On the performance of ConvNet features for place
recognition. Proceedings of International Conference on Intelligent Robots and Systems
4297-4304.
48. Chen B, Yuan D, Liu C, et al. (2019) Loop closure detection based on multi-scale deep feature
fusion. Applied Science 9, 1120.
49. Merrill N, Huang G (2018) Lightweight unsupervised deep loop closure. Robotics: Science and
Systems.
50. Dube R, Cramariuc A, Dugas D, et al. (2018) SegMap: 3D segment mapping using data-driven
descriptors. Robotics: Science and Systems XIV.
51. Dube R, Cramariuc A, Dugas D, et al. (2019). SegMap: segment-based mapping and localiza
tion using data-driven descriptors. Int. J. Robot. Res. 39, 339-355.
418 10 Simultaneous Location and Mapping

52. Hu M, Li S, Wu J, et al. (2019) Loop closure detection for visual SLAM fusing semantic
information. IEEE Chinese Control Conference 4136-4141.
53. Liu Y, Xiang R, Zhang Q, et al. (2019) Loop closure detection based on improved hybrid deep
learning architecture. Proceedings of 2019 IEEE International Conferences on Ubiquitous
Computing and Communications and Data Science and Computational Intelligence and
Smart Computing, Networking and Services, 312-317.
54. Xia Y, Li J, Qi L, et al. (2016) Loop closure detection for visual SLAM using PCANet features.
Proceedings of the International Joint Conference on Neural Networks 2274-2281.
55. Zaganidis A, Zerntev A, Duckett T, et al. (2019) Semantically assisted loop closure in SLAM
using NDT histograms. Proceedings of the 2019 IEEE/RSJ International Conference on Intel
ligent Robots and Systems (IROS) 4562-4568.
56. Chen X, Labe T, Milioto A, et al. (2020) OverlapNet: Loop closing for LiDAR-based SLAM.
Proceedings of the Robotics: Science and Systems (RSS), Online Proceedings.
57. Wang S, Lv X, Liu X, et al. (2020) Compressed holistic ConvNet representations for detecting
loop closures in dynamic environments. IEEE Access 8, 60552-60574.
58. Facil JM, Olid D, Montesano L, et al. (2019) Condition-invariant multi-view place recognition.
arXiv:1902.09516.
59. Yin H, Tang L, Ding X, et al. (2018) LocNet: Global localization in 3D point clouds for mobile
vehicles. Proceedings of IEEE Intelligent Vehicles Symposium 728-733.
60. Olid D, Facil JM, Civera J. (2018) Single-view place recognition under seasonal changes.
arXiv:1808.06516.
61. Zywanowski K, Banaszczyk A, Nowicki M (2020) Comparison of camera-based and 3D
LiDAR-based loop closures across weather conditions. arXiv:2009.03705.
62. Wang Y, Zell A (2018) Improving feature-based visual SLAM by semantics. Proceedings of the
2018 IEEE International Conference on Image Processing, Applications and Systems 7-12.
63. Cieslewski T, Choudhary S, Scaramuzza D (2018) Data-efficient decentralized visual SLAM.
Proceedings of IEEE International Conference on Robotics and Automation 2466-2473.
64. Wang L, Yang GL, Cai QZ, et al. (2020) Research progress in collaborative visual SLAM for
multiple agents Navigation Positioning and Timing 7(3): 84-92.
65. Schmuck P, Chli M (2019) CCM-SLAM: Robust and efficient centralized collaborative mon
ocular simultaneous localization and mapping for robotic teams. Journal of Field Robotics
36(4): 763-781.
Chapter 11
Spatial-Temporal Behavior Understanding Check for
updates

From the perspective of computer vision to achieve human visual function, an

important task is to interpret the scene, make decisions, and guide actions by
processing the images obtained from the scene. To do this, it is necessary to judge
which objects are in the scene and how they change their position, attitude, speed,
relationship, etc., in space over time, as well as their changing trends. In short, it is
necessary to grasp the actions and activities of the scene in time and space, determine
the purpose of the actions and activities, and then understand the semantic informa
tion they convey.
Image-/video-based automatic spatial-temporal behavior understanding is a
challenging research problem. It includes acquiring objective information (acquiring
image sequences), processing relevant visual information, analyzing (representing
and describing) to extract information content, and interpreting the information of
images/videos on this basis to realize learning and recognition behavior [1, 2].
The abovementioned work spans a lot, among which action detection and recog
nition have received a lot of attention and research recently, and significant pro
gresses have also been made. Relatively speaking, the research on behavior
recognition and description (related to semantics and intelligence) at a higher level
of abstraction has not been carried out for long time, many concepts are not yet
clearly defined, and many technologies are constantly being developed and updated.
The sections of this chapter will be arranged as follows.
Section 11 .1 begins with an overview of the concept, definition, development, and
hierarchical research of understanding of spatial-temporal behavior.
Section 11 .2 provides an overview of some technical categories and characteristics
of action classification and recognition and discusses the connections
between them.
Section 11 .3 further introduces techniques for joint modeling and recognition of
actors and actions, including single-label actor-action recognition, multi-label
actor-action recognition, and actor-action semantic segmentation.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 419
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_11
420 11 Spatial-Temporal Behavior Understanding

Section 11 .4 introduces techniques for modeling and recognizing activities and

behaviors and their classification and some typical methods in each category, as
well as for analyzing various network properties in joint data research based on
deep learning methods.
Section 11 .5 discusses automated activity analysis, in particular the detection of
abnormal events. Firstly, the classification of existing abnormal event detection
methods is summarized, and then the processes of abnormal event detection
based on convolutional auto-encoder and single-class neural network are ana
lyzed in detail.

11.1 Spatial-Temporal Technology

Spatial-temporal technology is a technology oriented toward understanding

spatial-temporal behavior and is a relatively new field of research [3].

11.1.1 New Research Domains

The image engineering survey series mentioned in Chap. 1 has been going on for
28 years since the beginning of the literature statistics in 1995 [4]. In the second
decade of the image engineering survey series (starting with the literature statistics in
2005), with the emergence of new hotspots in image engineering research and
applications, a new subcategory has been added to the image understanding
category—C5: spatial-temporal techniques (3D motion analysis, posture detection,
object tracking, behavior judgment, and understanding) [5]. What is emphasized
here is the comprehensive use of various information in images/videos to make
corresponding judgments and interpretations of the scene and the dynamic situation
of the objects in it.
In the past 18 years, a total of 314 articles in the C5 subcategory have been
collected in the survey series, and their distribution in each year is shown by
histogram in Fig. 11.1. The figure also shows the development trend obtained by
fitting the number of documents in each year with a third-order polynomial. It can be
seen that the number of documents in the first few years has obvious fluctuations.
Later it was relatively stable, but the research results were not many; however, the
number of documents in the past 4 years has increased significantly, with an average
of more than 30 per year, while in the previous 14 years, the average is only about
13 articles.
11.1 Spatial-Temporal Technology 421

0
Il ■ ill.111 Illi
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021

Fig. 11.1 The numbers and their change for spatial-temporal technical documents

11.1.2 Multiple Levels

At present, the main objects of spatial-temporal technology research are moving

people or objects and the changes of sceneries (especially people) in the scene.
According to the degree of abstraction of its representation and description, it can be
divided into multiple levels from bottom to top [6]:
1. Action primitive: It refers to the atomic unit used to construct an action, which
generally corresponds to the short-term specific motion information in the scene.
2. Action: A meaningful aggregate (ordered combination) composed of a series of
action primitives of the actor/initiator. Typically, actions represent simple motion
patterns, often performed by one actor/person, and typically only last on the order
of seconds. The results of human actions often lead to changes in human posture.
3. Activity: The combination of a series of actions performed by the actor/initiator
in order to complete a certain work or achieve a certain goal (mainly emphasizing
logical combination). Activities are relatively large-scale movements that gener
ally depend on the environment and interacting people. Activities often represent
complex sequences of (possibly interacting) actions performed by multiple peo
ple, often lasting for a long period of time.
4. Event: Refers to a certain (irregular) activity that occurs in a specific time period
and a specific spatial location. Usually the actions in it are performed by multiple
actors/initiators (group activity). Detection of specific events is often associated
with abnormal activity.
5. Behavior: The actor/initiator mainly refers to people or animals, emphasizing
that the actor/initiator is dominated by thoughts and changes actions in a specific
environment/context, continues activities, and describes events.
Taking the sport of table tennis as an example, some typical examples of various
levels are given below, as shown in Fig. 11.2. The player’s step, swing, etc., can be
regarded as typical action primitives. It is a typical action for a player to complete a
serve (including tossing the ball, swinging arms, shaking the wrist, hitting the ball,
422 11 Spatial-Temporal Behavior Understanding

Fig. 11.2 Several pictures in the table tennis match

etc.) or return the ball (including moving, extending the arm, turning the wrist,
drawing the ball, etc.). Going to the paddle and picking up the ball is often seen as an
activity. In addition, two players hitting the ball back and forth to win points is also a
typical activity scene. The competition between sports teams is generally regarded as
an event, and the awarding of awards after the competition is also a typical event.
Although a player ’s self-motivation by making a fist after winning the game can be
regarded as an action, it is more often regarded as a behavioral performance of the
player. When the players play a beautiful counterattack, the audience’s applause,
shouting, cheering, etc., are also attributed to the behavior of the audience.
It should be pointed out that the concepts of the last three levels are often used
loosely in many studies. For example, when an activity is called an event, it
generally refers to some abnormal activities (such as a dispute between two people,
an old man walking and falling, etc.); when an activity is called an action, the
meaning (behavior) and nature of the activity (such as stealing, as the act of stealing
or breaking into a house is called theft) are emphasized.
The research corresponding to the content of the first two levels is relatively
mature [7], and related technologies have been widely used in many other tasks. The
following sections in this chapter mainly focus on the last three levels and make
some distinctions among them as much as possible.

11.2 Action Classification and Recognition

Vision-based human action recognition is a process of labeling image sequences

(videos) with action (class) labels. Human action recognition can be turned into a
classification problem based on the representation of observed images or videos.
For example, the sample pictures of some actions in the Weizmann action
recognition database are shown in Fig. 11.3 [8]. From top to bottom, the left column
is high-five (jack), side movement (side), bend over (bend), walk (walk), and run
(run); the right column is waving one hand (wave1), waving both hands (wave2),
jumping forward with one foot (skip), jumping forward with both feet (jump), and
jumping with both feet in place (pjump). The recognition of different actions in this
database is actually to classify the pictures into ten categories.
11.2 Action Classification and Recognition 423

Fig. 11.3 Example pictures of actions in the Weizmann action recognition database

11.2.1 Action Classification

The classification of actions can take many forms of techniques [9].

11.2.1.1 Direct Classification

In direct classification methods, no special attention is paid to the temporal domain.

These methods add information from all frames in an observation sequence into a
single representation or perform action recognition and classification for each frame
separately.
In many cases, the representation of images is high dimensional. This results in
very computationally expensive matching. In addition, features such as noise may
also be included in the representation. Therefore, a compact and robust feature
representation in a low-dimensional space is required for classification. Dimension
ality reduction techniques can use both linear and nonlinear methods. For example,
PCA is a typical linear method, while local linear embedding (LLE) is a typical
nonlinear method.
The classifiers used for direct classification can also be different. Discriminative
classifiers focus on how to distinguish between different classes, rather than model
ing individual classes; one typical example is SVM. Under the bootstrapping
framework, a series of weak classifiers (each often using only 1D representations)
are used to build a strong classifier. Except for AdaBoost (see [10]), LPBoost can
obtain sparse coefficients and converge quickly.
424 11 Spatial-Temporal Behavior Understanding

11.2.1.2 Time State Model

The time state model models the probabilities between states and between states and
observations. Each state summarizes the action performance at a certain moment,
and observes the corresponding image representation at a given time. The time state
model is either generative or discriminative.
The generative model learns a joint distribution between observations and
actions, to model each action class (considering all variations). Discriminative
models learn the probabilities of action classes under observation conditions; they
do not model classes but focus on differences between classes.
The most typical of the generative models is the hidden Markov Model (HMM),
in which the hidden states correspond to the various steps of the action. Hidden states
model state transition probabilities and observation probabilities. There are two
independent assumptions here. One is that the state transition only depends on the
previous state, and the other is that the observation only depends on the present state.
Variations of HMM include maximum entropy Markov model (MEMM), fac-
tored-state hierarchical HMM (FS-HHMM), and hierarchical variable transi
tion hidden Markov model (HVT-HMM).
On the other hand, discriminative models model the conditional distribution
given an observation, combining multiple observations to distinguish different
action classes. This model is beneficial for distinguishing related actions. Condi
tional random field (CRF) is a typical discriminant model, and its improvement
includes factorial CRF (FCRF), generalization of CRF, etc.

11.2.1.3 Action Detection

Action detection-based methods do not explicitly model object representations nor

actions in images. It links observation sequences to numbered video sequences to
directly detect (defined) actions. For example, a video segment can be described as a
bag of words encoded at different timescales, each word corresponding to the
gradient orientation of a local patch. Local patches with slow temporal changes
can be ignored, so that the representation will mainly focus on the motion region.
When the movement is periodic (such as a person walking or running), the action
is cyclic action, that is, a cyclic action. At this time, the time domain segmentation
can be performed by analyzing the self-similar matrix. It is further possible to add
markers to the movers and build a self-similar matrix by tracking the markers and
using an affine distance function. Frequency transformation is performed on the self
similar matrix, and the peaks in the spectrum correspond to the frequency of the
movement (e.g., to distinguish between a walking person and a running person, the
period of the gait can be calculated). The type of action can be determined by
analyzing the matrix structure.
The main methods of representing and describing human actions can be divided
into two categories: (1) appearance-based methods (directly use the description of
11.2 Action Classification and Recognition 425

the foreground, background, contour, optical flow, and changes of the image) and
(2) human body model-based methods (use the human body). The model represents
the structural features of the actor, such as describing the action with a sequence of
human joint points. No matter what kind of method is adopted, it will play an
important role to realize the detection of the human body and the detection and
tracking of important parts of the human body (such as head, hands, feet, etc.).

11.2.2 Action Recognition

The representation and recognition of actions and activities is a relatively new but
immature field [11]. The method used is largely dependent on the researcher’s
purpose. In scene interpretation, representations can be independent of the object
(such as a person or car) that led to the activity; in surveillance applications, human
activities and interactions between people are generally concerned. In a holistic
approach, global information is preferred over component information, such as
when determining a person’s gender. For simple actions such as walking or running,
a local approach can also be considered, in which more attention is paid to detailed
actions or action primitives. A framework for action recognition can be seen [12].

11.2.2.1 Holistic Recognition

Holistic recognition emphasizes the recognition of the entire human object or

individual parts of the human body. For example, a person’s walking, walking
gait, and the like may be recognized based on the structure of the entire body and
dynamic information of the entire body. The vast majority of the methods here are
based on the silhouette or outline of the human body without much distinction
between the various parts of the body. For example, there is a human-based
identification technique that uses human silhouettes and uniformly samples their
contours and then applies PCA to the decomposed contours. To calculate the spatial-
temporal correlation, the individual trajectories can be compared in eigen-space. On
the other hand, the use of dynamic information can be used to determine what a
person is doing in addition to identifying its identity. Body-part-based recognition
recognizes actions based on the location and dynamic information of body parts.

11.2.2.2 Posture Modeling

The recognition of human actions is closely linked to the estimation of human

postures. Human posture can be divided into action posture and body posture, the
former corresponds to the action behavior of a person at a certain moment, and the
latter corresponds to the orientation of the human body in 3D space.
426 11 Spatial-Temporal Behavior Understanding

The representation and calculation methods of human body posture can be mainly
divided into three types:
1. Appearance-based method: Instead of modeling the physical structure of the
human directly, the human posture is analyzed using information such as color,
texture, and contour. It is difficult to estimate human posture since only the
apparent information in 2D images is exploited.
2. Human body model-based methods: The human body is first modeled using a line
graph model, 2D or 3D model, and then the human posture is estimated by
analyzing these parameterized human models. Such methods usually require
high image resolution and object detection accuracy.
3. Method based on 3D reconstruction: First, the 2D moving objects obtained by
multiple cameras at different positions are reconstructed into 3D moving objects
through corresponding point matching, and then the camera parameters and
imaging formulas are used to estimate the human posture in 3D space.
Posture can be modeled based on spatial-temporal interest points (see [9]). If only
the spatial-temporal Harris interest point detector is used (see [13]), the obtained
spatial-temporal interest points are mostly located in the region of sudden motion.
The number of such points is small, which belongs to the sparse type, and it is easy to
lose important motion information in the video, resulting in detection failure. To
overcome this problem, dense spatial-temporal interest points can also be extracted
with the help of motion intensity to fully capture motion-induced changes. Here, the
motion intensity can be calculated by convolving the image with a spatial Gaussian
filter and a temporal Gabor filter (see [10]). After the spatial-temporal interest points
are extracted, a descriptor is first established for each point, and then each posture is
modeled. A specific method is to first extract the spatial-temporal feature points of
the posture in the training sample library as the underlying feature, so that one
posture corresponds to a set of spatial-temporal feature points. The posture samples
are then classified using unsupervised classification methods to obtain clustering
results of typical postures. Finally, each typical posture category is modeled using an
EM-based Gaussian mixture model.
A recent trend in posture estimation in natural scenes is to use a single frame for
posture detection in order to overcome the problem of tracking with a single view in
unstructured scenes. For example, robust part detection and probabilistic combina
tion of parts have resulted in better estimates of 2D postures in complex movies.

11.2.2.3 Activity Reconstruction

Actions lead to changes in posture. If each static posture of the human body is
defined as a state, then by means of the state space method (also known as the
probability network method), the states are switched through the transition proba
bility, and the construction of an activity sequence can be obtained by performing a
traversal between the states of the corresponding posture.
11.2 Action Classification and Recognition 427

Significant progress has also been made in automatically reconstructing human

activity from videos based on posture estimation. The original model-based analysis
synthesis scheme leverages multi-view video acquisition to efficiently search the
posture space. Many present methods focus more on capturing the overall body
motion and less on precisely building the details.
There have also been many advances in single-view human activity reconstruc
tion with the help of statistical sampling techniques. The present focus is on using
the learned model to constrain activity-based reconstruction. Research has shown
that using a strong prior model is helpful for tracking specific activities in a
single view.

11.2.2.4 Interactive Activity

Interactive activities are more complex activities. They can be divided into two
categories: (1) the interaction between people and the environment, such as people
driving a car and holding a book, and (2) interpersonal interaction, which often refers
to the communication activities or contact behaviors of two people (or multiple
people), which is obtained by combining the (atomic) activities of a single person.
Single-person activities can be described with the help of probabilistic graph models.
The probabilistic graph model is a powerful tool for modeling continuous dynamic
feature sequences and has a relatively mature theoretical basis. Its disadvantage is
that the topology of its model depends on the structural information of the activity
itself, so a large amount of training data is required for complex interactive activities
to learn the topology of the graph model. To combine single-person activities, the
method of statistical relational learning (SRL) can be used. SRL is a machine
learning method that integrates relational/logical representation, probabilistic rea
soning, machine learning, and data mining to obtain a likelihood model of
relational data.

11.2.2.5 Group Object Activity

Quantitative changes lead to qualitative changes, and a substantial increase in the

number of objects involved in activities will bring new problems and new research.
For example, the analysis of group object movement mainly takes people flow,
traffic flow, and dense biological groups in nature as the objects, studies the
representation and description method of group object movement, and analyzes
the movement characteristics of group object and the influence of boundary con
straints on group object movement. At this time, the grasp of the unique behavior of
a special individual is weakened, and more attention is paid to the abstraction of the
individual and the description of the entire collective activity. For example, some
researches use the macro kinematics theory to explore the motion law of particle
flow and establish the motion theory of particle flow. On this basis, the dynamic
evolution phenomena such as aggregation, dissipation, differentiation, and merger in
428 11 Spatial-Temporal Behavior Understanding

Fig. 11.4 Statistics on the

number of people in the
crowd monitoring

group object activities are semantically analyzed, in order to explain the trend and
situation of the whole scene.
In the analysis of group activity, the statistics of the number of individuals
participating in the activity is a basic data. For example, in many public places,
such as squares, stadium entrances, and exits, it is necessary to have certain statistics
on the number of people. Fig. 11.4 shows a picture of people counting in a
surveillance scenario [13]. Although there are many people in the scene with
different movement patterns, the concern here is the number of people in a certain
region (in the region enclosed by the box).

11.2.2.6 Scene Interpretation

Unlike the recognition of objects in a scene, scene interpretation mainly considers

the entire image without verifying a specific object or person. Many methods in
practice only consider the results captured by the camera, from which they learn and
recognize activity by observing object motion without necessarily determining the
object’s identity. This strategy is effective when the object is small enough to be
represented as a point in 2D space.
For example, a system for detecting abnormal cases includes the following
modules. The first is to extract objects such as 2D position and velocity, size, and
binary silhouette and use vector quantization to generate an example codebook. To
account for the temporal relationship between each other, co-occurrence statistics
can be used. By iteratively defining the probability function between the examples in
the two codebooks and determining a binary tree structure, where the leaf nodes
correspond to the probability distributions in the co-occurrence statistics matrix, and
the higher-level nodes correspond to simple scene activities (such as pedestrians or
car motion), it is possible to combine them to give a scene interpretation.
11.3 Actor and Action Joint Modeling 429

11.3 Actor and Action Joint Modeling

With the deepening of research, the categories of actors and actions that need to be
considered in the spatial-temporal behavior understanding are increasing. To do
this, the actor and action need to be jointly modeled [14]. In fact, jointly detecting an
ensemble of several objects in an image is more robust than detecting individual
objects individually. Therefore, joint modeling is necessary when considering mul
tiple different types of actors performing multiple different types of actions.
Consider the video as a 3D image f(x, y, t) and represent the video using the graph
structure G = (N, A). Among them, the node set N = (n 1, ..., nM) represents M voxels
(or M super-voxels), and the arc set A(n) represents the voxel set in the neighborhood
of a certain n in N. Assume that the body label set is denoted by X, and the action
label set is denoted by Y.
Consider a set of random variables {x} representing actors and a set of random
variables {y} representing actions. The actor-action understanding problem of inter
est can be viewed as a maximum a posteriori problem:

(x, y) = argmaxP(x, yjM) (11:1)

x, y

The general actor-action understanding problem includes three cases: single-label

actor-action recognition, multi-label actor-action recognition, and actor-action
semantic segmentation. They correspond to the three stages of successive granularity
refinement.

11.3.1 Single-Label Actor-Action Recognition

Single-label actor-action recognition is the coarsest-grained case, which corre

sponds to the general action recognition problem. Here x and y are both scalars, and
Eq. (11.1) represents a single action y initiated by a single actor x for a given video.
There are three models available at this time (see also Subsection 11.3.3):

11.3.1.1 Naive Bayes Model

It is assumed that the actor and the action are independent of each other, that is, any
actor can initiate any action. At this point, a set of classifiers need to be trained in the
action space to classify different actions. This is the simplest method but does not
emphasize the existence of actor-action tuples, that is, some actors may not initiate
all actions, or some actors can only initiate certain actions. In this way, when there
are many different actors and different actions, sometimes unreasonable combina
tions (such as people can fly, birds can swim, etc.) occur when using the naive Bayes
model.
430 11 Spatial-Temporal Behavior Understanding

11.3.1.2 Joint Product Space Model

Joint product space model utilizes the body space X and the action space Y to
generate a new label space Z. Here, the product relationship is used: Z = X x Y. In the
joint product space, a classifier can be learned directly for each actor-action tuple.
Obviously, this method emphasizes the existence of actor-action tuples, which can
eliminate the appearance of unreasonable combinations, and it is possible to use
more cross-actor-action features to learn more discriminative classifiers. However,
this approach may not take advantage of commonalities across different actors or
different actions, such as steps and arm swings for both adults and children to walk.

11.3.1.3 Three-Layer Model

The three-layer model unifies the naive Bayes model and the joint product space
model. It simultaneously learns a classifier in actor space X, action space Y, and joint
actor-action space Z. At inference time, it infers Bayesian terms and joint product
space terms separately and then combines them linearly to get the final result. It not
only models actor-action intersection but also models different actions initiated by
the same actor and the same action initiated by different actors.

11.3.2 Multiple-Label Actor-Action Recognition

In practice, many videos have multiple actors and/or initiate multiple actions, which
is the case for multi-labels. At this point, both x and y are binary vectors with
dimensions |X|and|Y|. The value of xi is 1 if the ith actor type exists in the video, and
0 otherwise. Similarly, the value ofyj is 1 if the jth action type exists in the video, and
0 otherwise. This generalized definition does not confine specific elements in x to
specific elements in y. This facilitates an independent comparison of the multi-label
performance of actors and actions with the multi-label performance of actor-action
tuples.
To study the situation where multiple actors initiate multiple actions, a
corresponding video database has been constructed [15]. This database is called
the actor-action database (A2D). Among them, a total of seven actor categories are
considered, adults, infants, cats, dogs, birds, cars, and balls, and nine action catego
ries, walking, running, jumping, rolling, climbing, crawling, flying, eating, and no
action (not the first eight categories). The main body includes both articulated, such
as adults, babies, cats, dogs, and birds, and also includes rigid bodies, such as cars
and balls. Many actors can initiate the same action, but no actor can initiate all of
them. So, while there are 63 combinations of them, some of them are unreasonable
(or hardly ever), resulting in a total of 43 reasonable actor-action tuples. Using the
text of these 43 actor-action tuples, 3782 videos were collected in YouTube, ranging
11.3
Actor and Action Joint Modeling
Table 11.1 Number of video segments corresponding to actor-action labels in the database
Walking Running Jumping Rolling Climbing Crawling Flying Eating No action
Adults 282 175 174 105 101 105 105 761
Infants 113 107 104 106 36
Cats 113 99 105 103 106 110 53
Dogs 176 110 104 104 109 107 46
Birds 112 107 107 99 106 105 26
Cars 120 107 104 102 99
Balls 105 117 109 87

431
432 11 Spatial-Temporal Behavior Understanding

Table 11.2 Number of video Number 1 2 3 4 5

segments corresponding to
Actor 2794 936 49 3 0
actor, action, and actor-action
labels in the database Action 2639 1037 99 6 1
Actor-action 2503 1051 194 31 3

in length from 24 to 332 frames (136 frames per segment on average). The number of
video segments corresponding to each actor-action tuple is shown in Table 11.1. The
blanks in the table correspond to unreasonable actor-action tuples, so no video was
collected. It can be seen from Table 11.1 that the number of video segments
corresponding to each actor-action tuple is about a hundred segments.
Among these 3782 videos, the number of video segments containing different
numbers (1-5) of actors, the number of video segments containing different numbers
(1-5) of actions, and the number of video segments containing different numbers of
actors-actions, respectively, are shown in Table 11.2. It can be seen from Table 11.2
that in more than one third of the video segments, the number of actors or actions is
greater than 1 (the last four columns of the bottom row in the table, including one
actor initiated more than two actions, or more than two actors initiated one action).
For the case of multi-label actor-action recognition, three classifiers can still be
considered similar to single-label actor-action recognition: a multi-label actor-action
classifier using naive Bayes, a multi-label actor-action classifier in the joint product
space, and an actor-action classifier based on a three-layer model that combines the
first two classifiers.
Multi-label actor-action recognition can be viewed as a retrieval problem. Exper
iments on the database introduced earlier (with 3036 segments as training set and
746 segments as test set, with basically similar ratios for various combinations) show
that the multi-label actor-action classifier in the joint product space performs better
than naive Bayes classifier, while the effect of the multi-label actor-action classifier
based on the three-layer model can still be improved [14].

11.3.3 Actor-Action Semantic Segmentation

Actor-action semantic segmentation is the most fine-grained case of action behav

ior understanding, and it also encompasses other coarser-grained problems such as
detection and localization. Here, the task is to find labels for the actor-action on each
voxel throughout the video. Still define two sets of random variables {x} and {y}
whose dimensions will be determined by the number of voxels or super-voxels, and
xi 2 X and yj 2 Y. The objective function of Eq. (11.1) is unchanged, but the way in
which the graph model of P(x, y|M) is implemented requires very different assump
tions about the relationship between the actor and action variables.
This relationship is discussed in detail below. We first introduce the method based
on the naive Bayes model, which handles the labels of the two classes separately. We
11.3 Actor and Action Joint Modeling 433

then introduce a method based on the joint product space model, which utilizes the
tuple [x, y] to jointly consider the actor and action. Next consider a two-layer model
that considers the association of actor and action variables. Finally, a three-layer
model is introduced, which considers both intra-category linkages and inter-category
linkages.

11.3.3.1 Naive Bayes Model

Similar to the case in single-label actor-action recognition, the naive Bayes model
can be represented as:

P(x, yjM) = P(xjM )P(yjM) =np(xi)P(yiOH n PCx', jP^t, j

i2M ieMjeA(i)
(11:2)
/nqA-Hy^n II qj(xi, xi) i}^i, yj) r
i2M i2Mj2A(i)

Among them, qi and ri encode the potential functions defined in the actor and
action models, respectively, and qij and rij encode the potential functions in the actor
node set and the action node set, respectively.
Now, it is required to train the classifier {fc\c 2 X} on the actor and use the
features on the action set to train the classifier {gc|c 2 Y}. The paired edge potential
functions have the form of the following contrast-sensitive Potts model:

' 1 xi = xj

qij =
' exp [ — k (1 +x2)] otherwise
(11:3)

'1 xi = xj

=
exp [ — k (1 +x2-)] otherwise
(11:4)

where x2ij is thex2 distance between the feature histograms of nodes i and j, while k is
the parameter to be learned from the training data. Actor-action semantic segmen
tation can be obtained by solving these two flat conditional random fields
independently [15].

11.3.3.2 Joint Product Space

Consider a new set of random variables z = {z1, ..., zM}; they are also defined over
all super-voxels in a video and pick labels from the actor and action product space
Z = X x Y. This way jointly captures the actor-action tuple as the only element but
cannot model the common factor of actor and action in different tuples (the model
introduced below will solve this problem). This has a single-layer graph model:
434 11 Spatial-Temporal Behavior Understanding

(a) (b) (c) (d)

Fig. 11.5 Schematic representation of different graph models

P(x, y jM) = P(zjM) = pP(z)p n P(Zi, zj)

(H:5)
/ n<i n sijtzi, zj=nsi^xi', yi n sj( [xi, yi], [xj, yj-D

where si is the potential function of the joint actor-action product space label and sij is
the node internal potential function between the two nodes of the corresponding
tuple [x, y]. Specifically, si contains the classification score obtained by pairing node
i with the trained actor-action classifier {hclc 2 Z}, and sij has the same form as
Eq. (11.3) or Eq. (11.4). For illustration, see Fig. 11.5a and b

11.3.3.3 Two-Layer Model

Given an actor node x and an action node y, the two-layer model uses edges that
encode the potential function of the tuple to connect each random variable pair {(xi,
yi)Mi = 1} and directly obtains the covariance of the cross-actor-action labels:

P(x, yjM) =nP(xi, yi )nii, ^xi, xj)^y<, yj)

(11:6)
/nqi(xi)ri(yi)ti(xi, yion n qj(xi, xjWyi-, yjO
i2M ieMjeA(i)

where ti(xi, yi) is the potential function learned for the labels of the entire product
space, which can be obtained as si in Fig. 11.5; see Fig. 11.5c. Here, connecting
edges across layers are added.
11.3 Actor and Action Joint Modeling 435

11.3.3.4 Three-Layer Model

The naive Bayes model represented by the previous Eq. (11.2) does not consider the
connection between the actor variable x and the action variable y. The joint product
space model of Eq. (11.5) combines features across actors and actions as well as
interaction features within the neighborhood of an actor-action node. The two-layer
model of Eq. (11.6) adds actor-action interactions between separate actor nodes and
action nodes but does not account for the spatial-temporal variation of these
interactions.
A three-layer model is given below, which can explicitly model the spatial-
temporal variation of Fig. 11.5d. It combines the nodes of the joint product space
with all the actor nodes and action nodes:

P(x, y, zjM) = P(xjM)P(yjM)P(zjM)]JP(xi-Zi)P(yi-Zi)

i2M

/ nqi'(xi)ri(yi)si(zi)ui(xizi)vi(yizi^n II qj (XiXj) rij (yi^jOsj (zizj)

i2M i2Mj2A(i)
(H:7)

where.

w(y- jxi) for Zi = [x/, y/] there is x, = x,'

Ui(Xi, Zi) = (H:8)
1 0 otherwise
w(xi'jyi) for Zi= [x/, yi'] there is yi= yi-'
vi (yi, Zi)= (11:9)
0 otherwise

where w(yi'|xi) and w(xi'|yi) are the classification scores of conditional classifiers
specially trained for this three-layer model.
These conditional classifiers are the main reason for the performance improve
ment: An action-specific, disjunctive classifier based on the actor-type condition can
take advantage of properties specific to the actor-action tuple. For example, when
training a conditional classifier on the action “eating” given an actor adult, all other
actions of the actor adult can be treated as negative training samples. In this way, this
three-layer model takes into account all the connections in each actor space and each
action space, as well as in the joint product space. In other words, the first three basic
models are all special cases of the three-layer model. It can be shown that maximiz
ing (x*, y*, Z*) of Eq. (11.7) also maximizes Eq. (11.1) [14].
436 11 Spatial-Temporal Behavior Understanding

Fig. 11.6 Classification diagram of action and activity modeling recognition techniques

11.4 Activity and Behavior Modeling

A general action/activity recognition system consists of several working steps from

an image sequence to high-level interpretation [16]:
1. Obtain input video or sequence images.
2. Extract refined underlying image features.
3. Obtain middle-level action descriptions based on underlying features.
4. High-level semantic interpretation starts from basic actions.
Generally practical activity recognition systems are hierarchical. The bottom
layer includes foreground-background segmentation module, tracking module,
object detection module, etc. The middle layer is mainly the action recognition
module. The most important of the high layers is the inference engine, which
encodes the semantics of the activity in terms of the action primitives of the lower
layers and makes an overall understanding in terms of the learned model. Some
recent examples of related work can be seen [17-19].
As pointed out in Sect. 11.1, activities are higher than actions in terms of
abstraction. From a technical point of view, the modeling and recognition of actions
and activities often use different techniques, and they range from simple to complex.
Many of the commonly used modeling and recognition techniques for actions and
activities today can be classified as shown in Fig. 11.6 [16].

11.4.1 Action Modeling

The methods of action modeling can be divided into three categories: nonparamet
ric modeling, volumetric modeling, and parametric time-series modeling. Non
parametric methods extract a set of features from each frame of the video and match
these features to stored templates. Stereo approaches do not extract features frame by
frame but treat the video as a 3D volume of pixel intensities and extend standard
image features (e.g., scale-space extrema, spatial filter responses) to 3D. Parametric
sequential methods model the temporal dynamics of motion, estimating parameters
specific to a set of actions from a training set.
11.4 Activity and Behavior Modeling 437

11.4.1.1 Nonparametric Modeling

Common nonparametric modeling methods are as follows.

2D Template

This type of method involves the steps of performing motion detection and then
tracking objects in the scene. After tracking, build a cropped sequence containing the
object. Changes in scale can be compensated for by normalizing the object size.
Calculate a periodic index for a given action, and if the periodicity is strong, perform
action recognition. For identification, an estimate of the period is used to segment the
sequence of periods into individual periods. The average period is decomposed into
several temporal segments, and flow-based features are computed for each spatial
point in each segment. Average the flow features in each segment into a single frame.
The average flow frame in this activity cycle constitutes the template for each action
group.
A typical approach is to model temporal templates as actions. The background is
first extracted, and the background patches extracted from a sequence are combined
into a still image. There are two ways of combining: One is to weight all frames in
the sequence with the same weight, and the resulting representation can be called a
motion energy image (MEI); the other is to use different weights for different
frames in the sequence, generally using larger weights for new frames and smaller
weights for old frames; the resulting representation can be called a motion history
image (MHI). For a given action, use the combined images to form a template. Then,
the region invariant moments of the template are calculated and identified.

3D Object Model

The 3D object model is a model established for the spatial-temporal object, such as
the generalized cylinder model (see [10]), the 2D contour stacking model, and so
on. The motion and shape information of the object is included in the 2D contour
stacking model, from which geometric features of the object surface, such as peaks,
pits, valleys, ridges, etc., can be extracted (see [10]). If you replace the 2D contours
with blobs in the background, you get a binary space-time volume.

Manifold Learning Method

A lot of action recognition involves data in high-dimensional space. Since the feature
space becomes exponentially sparse with dimensionality, a large number of samples
are required to construct an effective model. The inherent dimension of the data can
be determined by using the manifold where the learning data is located, which has a
438 11 Spatial-Temporal Behavior Understanding

relatively small degree of freedom and can help design an effective model in a
low-dimensional space. The easiest way to reduce dimensionality is to use principal
component analysis (PCA) techniques, where the data are assumed to be in a linear
subspace. In practice, except in very special cases, the data is not in a linear
subspace, so methods that can learn the eigen-geometry of the manifold from a
large number of samples are needed. Nonlinear dimensionality reduction techniques
allow data points to be represented according to how close they are to each other in a
nonlinear manifold; typical methods include local linear embedding (LLE) and
Laplacian eigenmaps.

11.4.1.2 Volumetric Modeling

Common volumetric modeling methods are as follows.

Space-Time Filtering

Spatial-temporal filtering is a generalization of spatial filtering, and a set of spatial-

temporal filters are used to filter the data of the video volume. Specific features are
further deduced from the response of the filter bank. It has been hypothesized that the
spatial-temporal properties of cells in the visual cortex can be described by spatial-
temporal filter structures, such as toward Gaussian kernels and differential as well as
toward Gabor filter banks. For example, a video segment can be considered as a
spatial-temporal volume defined in XYT, using a Gabor filter bank (see [10]) for each
voxel (x, y, t) to compute different orientations and spatial scales as well as the local
appearance model of a single temporal scale. Actions are identified using the average
spatial probability of each pixel in a frame. Because the action is analyzed at a single
timescale, this method cannot be applied when the frame rate varies. To do this,
locally normalized spatial-temporal gradient histograms are extracted over several
timescales, and x2 between the histograms is used and the input video and stored
samples are matched. Another method is to use a Gaussian kernel to filter in the
spatial domain, use a Gaussian differential to filter in the temporal domain, and
combine the histogram after thresholding the response and incorporating the
response into a histogram. This method provides a simple and effective feature for
far-field (non-close-up lens) video.
The filtering method can be implemented simply and quickly with the help of
efficient convolutions. In most applications, however, the bandwidth of the filter is
not known in advance, so large filter banks at multiple temporal and spatial scales are
required to efficiently capture motion. The use of large filter banks with multiple
temporal and spatial scales is also limited due to the requirement that the response of
each filter output has the same dimensionality as the input data.
11.4 Activity and Behavior Modeling 439

Component-Based Approach

A video (vertical) body can be regarded as a collection of many local components,

each of which has a special motion mode. A typical representation method is to use
spatial-temporal points of interest (see [7]). In addition to using Harris interest
point detector (see [7]), the spatial-temporal gradient extracted from the training set
can also be clustered. In addition, we can also use bag of word model to represent
actions, in which the bag of word model can be obtained by extracting spatial-
temporal interest points and clustering features.
Because the points of interest are local in nature, the long-term correlation is
ignored. In order to solve this problem, the correlogram (relational graph) can be
used. Consider video as a series of sets, each of which includes components in a
small sliding window. This method does not directly carry out global geometric
modeling of local components but regards them as a feature package. Different
actions can contain similar spatial-temporal components, but they can have different
geometric relationships. If the global geometric information is combined into the
component-based video representation, it constitutes a constellation component.
When there are many parts, this model will be more complex. Constellation model
and bag of word model can also be combined into a hierarchical structure. There are
only a small number of components in the high-level constellation model, and each
component is contained in the underlying feature package. This combines the
advantages of the two models.
In most component-based methods, the detection of components is often based on
some linear operations, such as filtering, spatial-temporal gradient, etc., so descrip
tors are sensitive to apparent changes, noise, occlusion, etc. On the other hand, these
methods are robust to unsteady background due to their inherent locality.

Sub-volume Matching

Sub-volume matching refers to the matching between the sub-volumes in the video
and the template. For example, the action can be matched with the template from the
perspective of spatial-temporal motion correlation. The main difference between this
method and the component-based method is that it does not need to extract the action
descriptor from the extreme point of the scale space but checks the similarity
between two local spatial-temporal patches (by comparing the motion between the
two patches). However, it is time-consuming to calculate the whole video volume.
One way to solve this problem is to extend the successful fast Haar feature (box
feature) in object detection to 3D. The Haar feature of 3D is the output of 3D filter
bank, and the filter coefficients are 1 and - 1. Combining the output of these filters
with the bootstrap method (see [10]) can obtain robust performance. Another method
is to regard the video volume as a collection of sub-volumes of any shape. Each
sub-volume is a spatially consistent stereo region, which can be obtained by clus
tering the pixels that are close in appearance and space. Then the given video is over
segmented into many sub-volumes or super-voxels. The action template is matched
440 11 Spatial-Temporal Behavior Understanding

by searching for the smallest set of regions in these sub-volumes that can maximize
the overlap between the sub-volume set and the template.
The advantage of sub-volume matching is that it is robust to noise and occlusion.
If combined with optical flow characteristics, it is also robust to apparent changes.
The disadvantage of sub-volumes matching is that it is easily affected by background
changes.
(4) Tensor-based method
Tensor is the generalization of 2D matrix in multidimensional space. A 3D space
time body can naturally be regarded as a tensor with three independent dimensions.
For example, human action, human identity, and joint trajectory can be regarded as
three independent dimensions of a tensor. By decomposing the total data tensor into
dominant patterns (similar to the generalization of PCA), the signs of the
corresponding person’s action and identity (the person who performs the action)
can be extracted. Of course, the 3D of the tensor can also be directly taken as the 3D
of the spatial-temporal domain, that is (x, y, t).
Tensor-based method provides a direct method to match video as a whole, which
does not need to consider the middle-level representation used in the previous
methods. In addition, other kinds of features (such as optical flow, spatial-temporal
filter response, etc.) can also be easily combined by increasing the tensor dimension.

11.4.1.3 Parametric Time-Series Modeling

The first two modeling methods are more suitable for simple actions, and the
modeling methods introduced below are more suitable for complex actions that
span the time domain, such as complex dance steps in ballet video, special gestures
of instrument players, etc.

Hidden Markov Model

Hidden Markov model (HMM) is a typical model in state space. It is very effective
in modeling time-series data, has good generalization and discrimination, and is
suitable for the work that needs recursive probability estimation. In the process of
constructing discrete hidden Markov model, the state space is regarded as a finite set
of discrete points. It is modeled as a series of probabilistic steps from one state to
another over time. The three key problems of hidden Markov model are reasoning,
decoding, and learning. Hidden Markov model was first used to identify tennis shot
actions, such as forehand, forehand volley, backhand, backhand volley, smash, etc.
Among them, a series of images with background subtraction are modeled into
hidden Markov models corresponding to specific categories. Hidden Markov model
can also be used to model time-varying actions (such as gait).
The model with single hidden Markov can be used to model single person’s
action. For multi-person’s actions or interactive actions, a pair of hidden Markov
models can be used to represent alternate actions. In addition, domain knowledge
11.4 Activity and Behavior Modeling 441

can also be combined into the construction of hidden Markov models, or hidden
Markov models can be combined with object detection to take advantage of the
relationship between actions and action objects. For example, the prior knowledge of
state duration can be combined into the framework of hidden Markov model, and the
resulting model is called semi-HMM. If the state space is added with a discrete label
for modeling high-level behavior, the hidden Markov model of mixed state can be
used for modeling nonstationary behavior.

Linear Dynamic System

Linear dynamic systems (LDS) are more general than hidden Markov models, in
which the state space is not limited to a set of finite symbols, but can be continuous
values in Rk space, where k is the dimension of the state space. The simplest linear
dynamic system is a first-order time invariant Gaussian Markov process, which can
be expressed as:

x(t) = Ax(t — 1) + w(t) wN(0, P) (11:10)

y(t) = Cx(t) + v(t) vN(0, Q) (11:11)

where x 2 Rd is the d-D state space, y 2 Rn is the n-D observation vector, d < < n,
and w and v are the process and observation noise, respectively; they are all Gaussian
distribution, the mean value is zero, and the covariance matrix is P and Q, respec
tively. Linear dynamic system can be regarded as an extension of hidden Markov
model with Gaussian observation model in continuous state space, which is more
suitable for processing high-dimensional time-series data, but it is still not suitable
for unsteady actions.

Nonlinear Dynamic System

Consider the following sequence of actions: A person bends down to pick up an

item, then walks to a table and puts the item on the table, and finally sits on a chair.
There are a series of short steps, each of which can be modeled by LDS. The whole
process can be regarded as the conversion between different LDS. The most
common time-varying LDS form is:

x(t) = A(t)x(t — 1) + w(t) wN(0, P) (11:12)

y(t) = C(t)x(t) + v(t) vN(0, Q) (11:13)

Compared with the previous Eq. (11.10) and Eq. (11.11), both A and C can
change with time. In order to solve such complex dynamic problems, the commonly
used method is to use switching linear dynamic system (SLDS) or jump linear
442 11 Spatial-Temporal Behavior Understanding

system (JLS). Switched linear dynamic system includes a group of linear dynamic
systems and a switching function, which changes model parameters by switching
between models. In order to recognize complex motion, a multi-layer method
including multiple different levels of abstraction can be adopted. At the lowest
level, there is a series of input images, the upper level includes the region composed
of consistent motion, which is called blob, and then the upper level combines the
trajectories of blobs from time, and the highest level includes a hidden Markov
model that represents complex behavior.
Although switched linear dynamic systems have stronger modeling and descrip
tion capabilities than hidden Markov models and linear dynamic systems, learning
and reasoning are much more complex in switched linear dynamic systems, so
approximate methods are generally required. In practice, it is difficult to determine
the appropriate number of switching states, which often requires a lot of training data
or complicated manual adjustment.

11.4.2 Activity Modeling and Recognition

Compared with action, activity not only lasts a long time, but also most of the
activity applications that people pay attention to, such as monitoring and content
based indexing, include multiple action people. Their activities interact not only with
each other but also with contextual entities. In order to model complex scenes, it is
necessary to represent and infer the intrinsic structure and semantics of complex
behaviors at a high level.

11.4.2.1 Graph Model

Common graph models are as follows.

(1) Belief network
Bayesian network is a simple belief network. It first encodes a group of random
variables as local conditional probability density (LCPD) and then encodes the
complex conditional dependencies between them. Dynamic belief network (DBN,
also known as dynamic Bayesian network) is a generalization of simple Bayesian
network by combining the time dependence between random variables. Compared
with the traditional HMM that can only encode one hidden variable, DBN can
encode the complex conditional dependencies between several random variables.
A two-step process is needed to model the interaction between two people, such
as pointing, squeezing, pushing, hugging, and so on. Firstly, the attitude is estimated
by Bayesian network, and then the time evolution of attitude is modeled by DBN.
Based on the scene context information derived from other objects in the scene,
actions can be recognized, and Bayesian network can be used to explain the
interaction between people or people and objects.
11.4 Activity and Behavior Modeling 443

Car Car Person Person

Fig. 11.7 Diagram showing the probability Petri net of car picking up activities

DBN is more general than HMM if the dependence between multiple random
variables is considered. However, in DBN, the time model is also a Markov model as
in HMM, so the basic DBN model can only deal with the behavior of sequences. The
development of graph models for learning and reasoning enables them to model
structured behavior. However, for large-scale network, learning local CPD often
requires a lot of training data or complicated manual adjustment by experts, both of
which have brought certain restrictions on the use of DBN in large-scale
environments.
(2) Petri net
Petri net is a mathematical tool to describe the relationship between conditions
and events. It is especially suitable for modeling and visualizing behaviors such as
sequencing, concurrency, synchronization, and resource sharing. Petri net is a
two-sided graph containing two kinds of nodes—location and transition, where
location refers to the state of the entity and transition refers to the change of the
state of the entity. Consider an example of using probabilistic Petri nets to represent a
car pickup activity, as shown in Fig. 11.7. In the figure, the position marks are p 1, p2,
p p
3, 4, and p 5, and the transition marks are t 1, t2, t3, t4, t5, and t6. In this Petri net, p 1 and
p3 are the starting nodes and p5 is the ending node. A car enters the scene and puts a
token in position p 1. Transition t 1 can be enabled at this time, but it will not be
officially started until the relevant conditions (i.e., the car should be parked in the
nearby parking space) are met. At this time, the token at p 1 is eliminated and placed at
p2. Similarly, when a person enters the parking space, place the token at p3, and the
transition starts after the person leaves the parked car. The token is then removed
from p 3 and placed at p4.
Now, a token is placed at each allowable position of transition t6, so that when the
relevant conditions (here, the car leaves the parking space) are met, the fire can be
ignited. Once the car leaves, t6 ignites, the tokens are removed, and a token is placed
at the final position p5. In this example, sorting, concurrency, and synchronization all
happen.
Petri network has been used to develop a system for high-level interpretation of
image sequences. Among them, the structure of Petri network needs to be deter
mined in advance, which is a very complicated work for the large network
representing complex activities. The above work can be semiautomated by automat
ically mapping a small group of logical, spatial, and temporal operations to the graph
structure. With this method, interactive tools for querying video surveillance can be
444 11 Spatial-Temporal Behavior Understanding

developed by mapping user query requirements to Petri network. However, this

method is based on deterministic Petri network, so it cannot deal with uncertainties
in low-level modules (trackers, object detectors, etc.).
Further, the real human activities are not completely consistent with the strict
model. The model needs to allow differences from the expected sequence and punish
the significant differences. Therefore, the concept of probabilistic Petri net (PPN)
is proposed. In PPN, transition is associated with weight, which records the proba
bility of transition initiation. By using jump transitions and giving them low prob
ability as punishment, we can achieve robustness to missing observations in the
input stream. In addition, the uncertainty of identifying the object or the uncertainty
of unfolding activities can be effectively combined into the token of Petri net.
Although Petrie net is an intuitive tool to describe complex activities, its disad
vantage is that it needs to describe the model structure manually. The problem of
learning structure from training data is not formally involved.
(3) Other graph models
Aiming at the shortcomings of DBN, especially the limitation of sequence
activity description, some other graph models are also proposed. Under the frame
work of DBN, some graph models are built, which are especially used to model
complex time relationships, such as sequence, period, parallelism, synchronization,
and so on. Typical examples are past-present-future (PNF) structures, which can be
used to model complex time sequencing situations. In addition, the propagation
network can be used to represent activities that use partial sorting intervals. One of
the activities is constrained by time, logical order, and the length of the activity
interval. The new method regards a time extended activity as a series of event tags.
With the help of context and specific constraints related to activities, we can find that
sequence tags have some implicit partial ordering properties. For example, you need
to open your mailbox before you can view messages. Using these constraints, the
activity model can be regarded as a set of subsequences, which represent partial
ordering constraints of different lengths.

11.4.2.2 Synthesis Method

The synthesis method is mainly realized with the help of grammatical concepts and
rules.

Grammar

Grammar uses a set of production rules to describe the structure of processing.

Similar to the grammar in the language model, production rules point out how to
construct sentences (activities) from words (activity primitives) and how to recog
nize the rules that sentences (videos) meet a given grammar (activity model). The
early grammar for recognizing visual activities was used to recognize the work of
disassembling objects. At this time, there was no probability model in the grammar.
11.4 Activity and Behavior Modeling 445

Subsequently, context-free grammar (CFG) was applied, which was used to model
and recognize human motion and multi-person interaction. A hierarchical process is
used here. At the low level, HMM and BN are combined, and at the high level, CFG
is used to model the interaction. The method of context-free grammar has a strong
theoretical basis and can model structured processes. In the synthesis method, we
only need to enumerate the primitive events to be detected and define the produc
tion rules of high-level activities. Once the rules of CFG are constructed, the existing
analytical algorithm can be used.
Because the deterministic grammar expects very good accuracy at the low level, it
is not suitable for the occasions where errors are caused by tracking errors and
missing observations at the low level. In complex scenarios with multiple time
connections (such as parallelism, coverage, synchronization, etc.), it is often difficult
to manually build grammar rules. Learning grammar rules from training data is a
promising alternative, but it has proved to be very difficult in general situations.

Stochastic Grammar

Algorithms used to detect low-level primitives are often probabilistic in nature.

Therefore, stochastic context-free grammar (SCFG) extends the probability of
context-free grammar, which is more suitable for combining actual visual models.
SCFG can be used to model the semantics of activities whose structural assumptions
are known. HMM is used in the detection of low-level primitives. The production
rule of grammar is supplemented by probability, and a skip transition is introduced.
This can improve the robustness to the insertion error in the input stream and also
improve the robustness in the low-level module. SCFG is also used to model multi
task activities (including multiple independent execution threads, intermittent related
interaction activities, etc.). However, although SCFG is more robust to errors and
omissions in the input stream than CFG, they also have the same limitations on time
link modeling as CFG.
In many cases, it is often necessary to associate some additional properties or
characteristics with event primitives. For example, the exact location where an event
primitive occurs may be important to describe an event, but this may not be recorded
in the set of event primitives beforehand. In these cases, attribute grammars are more
descriptive than traditional grammars. The probabilistic attribute grammar has been
used to handle multi-agent activity in monitoring. An example is shown in Fig. 11.8,
where production rules and event primitives such as “appear,” “disappear,” “move
close,” and “move-away” are used to describe the activity. Event primitives are
further associated with attributes such as where the event appears and disappears
(loc), classifying a set of objects (class), and identifying related entities (idr).

11.4.2.3 Knowledge-Based and Logic-Based Methods

Knowledge and logic are closely linked.

446 11 Spatial-Temporal Behavior Understanding

S ® BOARDINGN
BOARDING ® appear0 CHECK1 disappear1
(isPerson (appear, class) a isInside (appear.loc, Gate) a isInside (disappear.loc, Plane) )
CHECK ® moveclose0 CHECK1
CHECK ® moveaway0 CHECK1
CHECK ® moveclose0 moveaway1 CHECK1
(isPerson (moveclose, class) a moveclose.idr = moveaway.idr

Fig. 11.8 An example of attribute grammar for passenger boarding

Logic-Based Approach

Logic-based methods rely on strict logical rules to describe domain knowledge in a

general sense to describe activities. Logical rules are useful for describing user-
entered domain knowledge or for representing high-level reasoning results in an
intuitive and user-readable form. The declarative model describes all expected
activities in terms of scene structures, events, etc. The activity model includes
interactions between objects in the scene. A hierarchical structure can be used to
identify a series of actions performed by an agent. Symbolic descriptors for actions
can be extracted from low-level features through some intermediate layers. Next, a
rule-based approach is used to approximate the probability of a particular activity by
matching the properties of the agent with the expected distribution (expressed in
terms of mean and variance). This method considers that an activity is composed of
several action threads, and each action thread can be modeled as an automaton with a
finite random state. Constraints between different threads are propagated in a
temporal logical network. When representing and identifying high-level activities,
a system based on logic programming uses low-level modules to detect event
primitives and then uses a Prolog-based high-level inference engine to identify
activities represented by logical rules between event primitives. These methods do
not directly address the problem of uncertainty in observing the input stream. To deal
with these problems, logical and probabilistic models can be combined, where
logical rules are represented in terms of first-order logical predicates. Each rule is
also associated with a weight that indicates the accuracy of the rule. Further
reasoning can be done with the help of Markov logic networks.
While logic-based methods provide a natural way to incorporate domain knowl
edge, they often involve time-consuming audits of constraints being met. Also, it is
unclear how much domain knowledge needs to be incorporated. It can be expected
that more knowledge of the results will make the model more rigorous and not easy
to generalize to other situations. Finally, logical rules require domain experts to
perform time-consuming traversal of each configuration pair.
11.4 Activity and Behavior Modeling 447

Fig. 11.9 An example of an PROCESS (cruise-parking-lot (vehicle v, parking-lot lot),

ontology used to describe Sequence (enter (v, lot),
car patrolling activities in a Set-to-zero (i),
parking lot
Repeat-Until (
AND (inside (v, lot), move-in-circuit (v), increment (i) ),
Equal (i, n) ),
Exit (v, lot) ) )

11.4.2.4 Ontological Approach

In most practical configurations using the aforementioned methods, the definition of

symbolic activity is constructed empirically. Rules such as grammar or a set of
logical rules are specified manually. Although empirically constructed designs are
fast and work well in most cases, they are less generalizable and limited to the
specific cases in which they are designed. Therefore, a centralized representation of
activity definitions or an algorithm-independent activity ontology is also required.
Ontologies can standardize the definition of activities, allow porting to specific
registrations, enhance interoperability between different systems, and easily repli
cate and compare system performance. Typical practical examples include analyzing
social interactions in nursing rooms, classifying meeting videos, setting up bank
interactions, etc.
Since 2003, the international video event challenge workshop has been held to
integrate various capabilities to build a domain ontology based on general knowl
edge. The conference has defined six areas of video surveillance: (i) peripheral and
internal security, (ii) railway crossing monitoring, (iii) visual bank monitoring,
(iv) visual subway monitoring, (v) warehouse security, and (vi) airport parking
apron security. The meeting also guided the formulation of two formal languages:
One is video event representation language (VERL), which helps to complete the
ontology representation of complex events based on simple sub-events; the other is
video event markup language (VEML), which is used to annotate VERL events in
the video.
Figure 11.9 shows an example of using the ontology concept to describe a car
cruising/patrolling activity. This ontology records the number of times the car turns
around on the road in the parking lot without stopping. When this number exceeds a
threshold, a cruising activity is detected.
While ontologies provide concise high-level activity definitions, they do not
guarantee the correct “hardware” to “parse” the ontologies used to identify tasks.

11.4.3 Joint Point-Based Behavior Recognition

Behavior recognition is recognition considered at a higher level based on action

recognition and activity recognition. Compared with RGB data and depth data,
448 11 Spatial-Temporal Behavior Understanding

skeleton joint point data corresponds to higher-level features of the human body and
is not easily affected by the appearance of the scene. In addition, it can better avoid
noise effects caused by background occlusion, lighting changes, and viewing angle
changes. At the same time, it is also very efficient in terms of computation and
storage.
Joint point data is usually represented as a series of points (4D points in space
time space) coordinate vectors. That is, joints (articulation point) can be
represented by a 5D function J(l, x, y, z, t), where l is the label, (x, y, z) represents
the space coordinate, and t represents the time coordinate. In different deep learning
networks and algorithms, joint point data are often represented in different forms
(such as pseudo-images, vector sequences, and topological maps).
The joint data research based on deep learning method mainly involves three
aspects: data processing method, network architecture, and data fusion method.
Among them, the data processing method mainly involves whether to perform
preprocessing and data denoising, and the methods used for data fusion are also
relatively consistent. At present, more attention is paid to network architecture, and
there are three commonly used ones: convolutional neural network (CNN), recurrent
neural network (RNN), and graph convolutional network (GCN). The representation
methods of their corresponding joint point data are pseudo-images, vector
sequences, and topological maps [20].

11.4.3.1 Use CNN as the Backbone Network

Convolutional neural network (CNN) is an efficient network architecture for

extracting features of human behavior, which can be identified by local
convolutional filters or kernels learned from data. CNN-based action recognition
methods encode the spatial-temporal position coordinates of joints into rows and
columns, respectively, and then input the data into CNN for recognition. Generally,
in order to facilitate the use of CNN-based networks for feature extraction, the joint
point data will be transposed and mapped into an image format (where the rows
represent different joints l, the columns represent different times t, and the 3D space
coordinates (x, y, z) are considered as three channels of the image), and then perform
convolution operation.
Table 11.3 briefly summarizes some recent techniques to deal with the problem of
insufficient accuracy in the recognition of complex interactions.

11.4.3.2 Use RNN as the Backbone Network

Recurrent neural networks (RNNs) can process sequence data of varying lengths.
The behavior recognition method based on RNN first represents the joint point data
as a vector sequence, which contains the position information of all relevant nodes in
a time (state) sequence; then the vector sequence is sent to the behavior recognition
network with RNN as the backbone. The long short-term memory (LSTM) model is
11.4 Activity and Behavior Modeling 449

Table 11.3 Some recent technologies using CNN

S/
N Characteristics and description Advantages and disadvantages References
1 Double stream fusion of RGB information Short training time [21]
and joint point information is designed to
improve accuracy. Extract key frames before
sending RGB video information to CNN to
reduce training time
2 Use a pose-based approach for behavior rec The network structure is sim [22]
ognition. The CNN framework includes three ple, but the accuracy is not
semantic modules: Spatial attitude CNN, prominent
temporal attitude CNN, and action CNN. It
can be used as a supplementary semantic
flow of RGB flow and optical flow
3 3D behavior recognition based on tree struc Low training efficiency [23]
ture and reference joint using skeleton image
4 The time dynamics are encoded by calculat It can effectively filter the [24]
ing the motion amplitude and direction of the motion noise in the data
skeleton joints. Use different timescales to
calculate the motion value of the joint to filter
the noise
5 Recoding skeleton joint information with High speed but low accuracy [25]
geometric algebra

Table 11.4 Some recent technologies using LSTM

S/
N Characteristics and description Advantages and disadvantages References
1 Adding multi-mode feature fusion strategy The recognition accuracy is [26]
to the trust gate of spatiotemporal LSTM improved, but the training effi
ciency is reduced
2 The global context aware attention LSTM It can better focus the key joint [27]
network (GCA-LSTM), which is mainly points of each frame
composed of two layers of LSTM, is used.
The first layer generates global back
ground information, and the second layer
adds an attention mechanism to better
focus on the key joint points of each frame
3 Expand GCA-LSTM and add coarse The recognition accuracy is [28]
grained and fine-grained attention improved, but the training effi
mechanisms ciency is reduced
4 A dual stream attention loop LSTM net Make full use of joint informa [29]
work is proposed. The cyclic relation net tion to improve recognition
work learns the spatial features in a single accuracy
skeleton, and the multilayer LSTM learns
the temporal features in the skeleton
sequence
450 11 Spatial-Temporal Behavior Understanding

a variant of RNN because its cell states can decide which temporal states should be
left behind and which should be forgotten and have greater advantages in processing
time-series data such as joint video.
Table 11.4 briefly summarizes some recent techniques for behavior recognition
using LSTMs as network structures.

11.4.3.3 Use GCN as the Backbone Network

The collection of human skeleton joints can be viewed as a topology graph.

Topological graph is a kind of non-Euclidean structure data, in which the number
of adjacent vertices of each node may be different, so it is difficult to calculate
convolution with a fixed-size convolution kernel, so it cannot be directly processed
by CNN. Graph convolutional networks (GCNs) can directly deal with topological
graphs. Here, it is only necessary to represent the joint point data as a topology
graph, in which the vertices in the space domain are connected by space arcs, the
corresponding joint points between adjacent frames in the time domain are
connected by time arcs, and the space coordinate vector is used as the attribute
feature of each joint point.
Table 11.5 briefly summarizes some recent techniques for using GCNs as net
work structures for action recognition.

11.4.3.4 Use Hybrid Network as the Backbone Network

Joint point-based behavior recognition research can also use hybrid networks,
making full use of the feature extraction capabilities of CNN and GCN in the spatial
domain and the advantages of RNN in time-series classification. In this case, the
original joint node data should be represented by the corresponding data format
according to the needs of different hybrid networks.
Table 11.6 briefly summarizes some recent techniques using hybrid networks.

11.5 Abnormal Event Detection

Based on the detection and recognition of actions and activities, activities can be
automatically analyzed, and interpretation and judgment of the scene can be
established. Automated activity analysis is a broad term in which the detection of
abnormal events is an important task.

11.5.1 Automatic Activity Analysis

Once the scenario model is established, the behavior and activities of the object can
be analyzed. A basic function of surveillance video is to verify events of interest.
11.5 Abnormal Event Detection 451

Table 11.5 Some recent technologies using GCN

S/
N Characteristics and description Advantages and disadvantages References
1 Using the higher-order polynomial of High model complexity [30]
adjacency matrix, an encoder and
decoder method is designed to capture
the implicit joint correlation and obtain
the physical structure link between the
joints
2 Neural architecture search is used to High sampling and storage efficiency [31]
construct graph convolution network,
in which cross entropy evolution
strategy is combined with importance
hybrid method to improve sampling
efficiency and storage efficiency
3 In order to improve the efficiency of It is easy to combine with the main [32]
spatial-temporal information stream spatial-temporal convolution
processing, spatial residual layer and method
dense connection enhancement block
are introduced into spatial-temporal
graph convolution network
4 Based on the motion dependence High recognition accuracy [33]
between natural human joints and
bones, the dual flow adaptive graph
convolution network is improved by
representing the skeleton data as a
directed acyclic graph
5 A novel symbiotic graph convolution The two modules promote each other [34]
network is proposed, which includes in improving the accuracy of behav
not only the behavior recognition ior recognition and motion
function module but also the action prediction
prediction module
6 Using pseudo graph convolution net Key frames can be extracted, but [35]
work with time and channel attention some key information may be
mechanism, not only key frames can ignored
be extracted, but also input frames with
more features can be filtered out

Generally speaking, it is only good to define interest in certain circumstances. For

example, a parking management system will focus on whether there is still space for
parking, while in a smart conference room system it is concerned with the commu
nication between people. In addition to only identifying specific behaviors, all
atypical events also need to be checked. By observing a scene over a long period
of time, the system can perform a series of activity analyses to learn which events are
of interest.
Some typical activity analyses are as follows:
1. Virtual fence: Any monitoring system has a monitoring range, and a sentry can
be set up on the boundary of the range to give early warning to events that occur
452 11 Spatial-Temporal Behavior Understanding

Table 11.6 Some recent technologies using hybrid networks

S/
N Network Characteristics and description Advantages and disadvantages References
1 CNN+ A view adaptive scheme It combines the advantages of [36]
LSTM including two view adaptive CNN to extract behavioral fea
neural networks is designed. tures in the spatial domain and
The view adaptive loop network RNN to extract behavioral fea
is composed of the main LSTM tures in the time domain.
and the view adaptive Changes in different perspec
subnetwork, which sends the tives have little effect on the
joint point representation under recognition results
the new viewpoint to the main
LSTM network to determine the
behavior recognition. The view
adaptive convolution network is
composed of the main CNN and
the view adaptive subnetwork,
which sends the joint point rep
resentation under the new
observation viewpoint to the
main CNN to determine the
behavior category. Finally, the
classification scores of the two
parts of the network are fused
2 CNN+ Combining CNN and GCN, it Added learning of frequency [37]
GCN does not only consider the
extraction of behavior charac
teristics in space-time domain
but also learn frequency pat
terns with the help of residual
frequency attention method
3 GCN+ Using the attention enhance Enhanced ability to learn high- [38]
LSTM ment graph convolution LSTM level features and reduced
network (AGC-LSTM), it can computational cost
not only extract the behavioral
features in the spatial and tem
poral domains but also increase
the time of the top AGC-LSTM
layer by adding the time per
ception field, so as to enhance
the ability to learn the high-level
features and to reduce the com
puting cost
4 GCN+ Learning spatial-temporal con High recognition accuracy [39]
LSTM text information from human
joint point data with focusing
and diffusion mechanism using
bidirectional attention graph
convolution network
5 GCN+ The semantics of the joint Reduced model complexity and [40]
CNN points (frame index and joint improved recognition accuracy
(continued)
11.5 Abnormal Event Detection 453

Table 11.6 (continued)

S/
N Network Characteristics and description Advantages and disadvantages References
type) are fed to the semantic
perception graph convolution
layer and the semantic percep
tion convolution layer together
with the position and speed of
the joint points as part of the
network input

within the range. This is equivalent to establishing a virtual fence at the boundary
of the monitoring range and triggering analysis once there is an intrusion, such as
controlling a high-resolution pan-tilt-zoom camera (PTZ camera) to obtain the
details of the intrusion, and starting to count the number of intrusions.
2. Speed analysis: The virtual fence only uses location information, and with the
help of tracking technology, dynamic information can also be obtained to realize
speed-based early warning, such as vehicle speeding or road congestion.
3. Path classification: Velocity analysis only utilizes presently tracked data, and in
practice, activity paths (AP) obtained from historical motion patterns can also be
used. The behavior of emerging objects can be described with the aid of a
maximum a posteriori (MAP) path:

L* = argmaxp(lkjG) = argmaxp(G, lk)p(l*) (11:14)

This can help determine which activity path best explains the new data. Since
the prior path distribution p(lk) can be estimated using the training set, the
problem is reduced to using the HMM for maximum likelihood estimation.
4. Abnormality detection (anomaly detection): The detection of abnormal events
is often an important task of the monitoring system. Because activity paths
indicate typical activity, anomalies can be detected if a new trajectory does not
match an existing one. Anomalous patterns can be detected with intelligent
thresholding:

p(l*jG) < Li (H:15)

where the value of the active path l* most similar to the new trajectory G is still
smaller than the threshold value Ll.
454 11 Spatial-Temporal Behavior Understanding

Fig. 11.10 Collision

evaluation using paths

5. Online activity analysis: Being able to analyze, identify, and evaluate activities
online is more important than using the entire trajectory to describe movement.
A real-time system needs to be able to quickly reason about what is happening
based on incomplete data (often based on graph models). Two cases are
considered here:
(a) Path prediction: The tracking data to date can be used to predict future
behavior, and the prediction can be refined as more data is collected.
Predicting activity using incomplete trajectories can be represented as:

L = argmaxp(lj\WtGt+k) (11:16)

where Wt represents the window function and Gt + k is the trajectory up to the

present time t and the k predicted future tracking states.
(b) Tracking anomalies: In addition to classifying the entire trajectory as anom
alies, it is also necessary to detect abnormal events as soon as they occur.
This can be achieved by substituting WtGt + k for G in Eq. (11.15). The
window function Wt does not have to be the same as in the prediction, and
the threshold may need to be adjusted according to the amount of data.
6. Object interaction characterization: Higher-level analysis is expected to
further describe the interaction between objects. Similar to abnormal events,
strictly defining object interactions is difficult. There are different types of
interactions between different objects in different contexts. Taking car crashes
as an example, each car has its own space dimension, which can be regarded as
its personal space. When the car is driving, its personal space needs to increase a
minimum safe distance (minimum safety zone) around the car, so the space-time
personal space will change with the movement; the faster the speed, the more the
minimum safe distance increases (especially in the direction of travel). A
schematic is shown in Fig. 11.10, where the personal space is represented by a
circle, and the safety zone changes with speed (both size and direction). If the
safety zones of two vehicles meet, there is a potential for a collision, which can
help plan driving routes.
Finally, it should be pointed out that for simple activities, the analysis can be
carried out only by relying on the object position and velocity, but for more complex
activities, more measurements may be required, such as adding the curvature of the
profile to identify strange motion trajectories. To provide more comprehensive
coverage of activities and behaviors, multi-camera networks are often required.
11.5 Abnormal Event Detection 455

Activity trajectories can also be derived from objects composed of interconnected

parts (e.g., the human body), where activity needs to be defined relative to a set of
trajectories.

11.5.2 Classification of Abnormal Event Detection Methods

Intuitively, exceptions are counted relative to normal. But the definition of normal
can also change with time, environment, purpose, conditions, etc. In particular,
normal and abnormal are relatively subjective concepts, so objective and quantita
tive abnormal events often cannot be precisely defined.
The detection of abnormal events is mostly carried out with the help of video, so it
is often called video anomaly detection (VAD), also called video anomaly detec
tion and localization (VADL) to emphasize the need to not only detect abnormal
events appearing in the video but also identify the location where it happened in the
video.
Video abnormal event detection can be divided into two parts: video feature
extraction and abnormal event detection model establishment. Commonly used
video features are mainly divided into hand-designed features and features extracted
by deep models. Video abnormal event detection models can be divided into models
based on traditional probabilistic reasoning and models based on deep learning.
Therefore, there are various schemes for the classification of abnormal event detec
tion methods. The following considers dividing it into methods based on traditional
machine learning and methods based on deep learning and dividing them into
methods with supervised learning and methods with unsupervised learning.

11.5.2.1 Traditional Machine Learning and Deep Learning

From the perspective of the development of abnormal event detection methods,

methods based on traditional machine learning were mostly used in the early years,
and methods based on deep learning were used more recently (e.g., [41]), and there
are also methods that combine the two. A result classified from this perspective is
shown in Table 11.7 [42].
In Table 11.7, for each category of detection method, they are further divided into
four subcategories according to the input model: (1) point model, whose basic unit is
a single video space-time block; (2) sequence model, whose basic unit is a contin
uous space-time block sequence; (3) graph model, whose basic unit is a set of
interconnected space-time blocks; and (4) composite model, whose basic unit can
be a combination of the above three units.
In Table 11.7, the criteria used to identify anomalies are also different for each
model:
456 11 Spatial-Temporal Behavior Understanding

Table 11.7 Classification of abnormal event detection methods from a development perspective
Method category Input model Discriminate criterion
Traditional machine learning Point model Cluster discrimination
Co-occurrence discrimination
Reconstruction discrimination
Other discriminations
Sequence model Generative probabilistic discrimination
Graph model Graph inference discrimination
Graph structure discrimination
Composite model
Deep learning Point model Cluster discrimination
Reconstruction discrimination
Joint discrimination
Sequence model Prediction error discrimination
Composite model
Hybrid learning Point model Cluster discrimination
Reconstruction discrimination
Other discriminations

Point Model

There are presently five types of anomaly discrimination criteria: (1) cluster discrim
ination (according to the distribution of feature points in the feature space, the points
far from the cluster center, points belonging to small clusters, or points with low
distribution probability density are judged as abnormal); (2) co-occurrence discrim
ination (according to the probability of co-occurrence of feature points and normal
samples, the feature points with a lower probability of co-occurrence with normal
samples are judged as abnormal); (3) reconstruction discrimination (using
low-dimensional subspace/manifold as feature points, describe the distribution in
the feature space, and then judge the abnormality according to the distance from the
feature point to the normal sample subspace/manifold according to the reconstruc
tion error); (4) joint discrimination (the model uses the above three types of discrim
ination jointly); and (5) other discrimination (including hypothesis testing
discrimination, semantic analysis discrimination, etc.).

Sequence Model

At present, it includes two kinds of anomaly discrimination criteria: (1) generative

probability discrimination (the model outputs a probability value according to the
input sequence, which describes the degree to which the input sequence obeys the
normal feature transfer law, and judges the samples with low output probability as
abnormal) and (2) prediction error discrimination (the model predicts the eigen
values at the next moment according to the input sequence and judges the degree to
11.5 Abnormal Event Detection 457

which the input sequence obeys the normal transition law according to the prediction
error, and the sample with large prediction error is judged as abnormal).

Graph Model

At present, it includes two abnormal discrimination criteria: (1) graph inference

discrimination (the model judges the feature points that do not conform to the normal
inference relationship as abnormal according to the inferred relationship between the
feature points on the graph) and (2) graph structure discrimination (the model is
based on the topological structure of the graph, to judge uncommon topologies as
anomalies).

11.5.2.2 Supervised Learning and Unsupervised Learning

From the perspective of abnormal event detection technology, if there are clear
boundaries between normal events and abnormal events, and there are corresponding
samples, supervised learning technology can be used to classify them; if there is no
prior knowledge of normal events and abnormal events, only consider the clustering
distribution of each event sample, and you need to use unsupervised learning
technology; if you define abnormal events as all events except normal events, only
use the prior knowledge of normal events for training, and use normal samples to
learn the pattern of normal events, and then judging all samples that do not obey the
normal pattern as abnormal, this is the technique of semi-supervised learning. Of
course, these technologies can also be combined at different levels, generally
referred to as integrated technologies. The results of classification from this perspec
tive are shown in Table 11.8 [43].
In Table 11.8, the division of semi-supervised learning techniques is relatively
fine. In fact, there are many studies on semi-supervised learning techniques. In
practical applications, on the one hand, normal samples are easier to obtain than
abnormal samples, so semi-supervised learning techniques are easier to use than
supervised learning techniques; on the other hand, since a prior knowledge of normal
events is used, semi-supervised learning techniques have better performance than
that of unsupervised learning techniques. In this way, researchers pay more attention
to semi-supervised learning techniques and work more.

11.5.3 Detection Based on Convolutional Auto-Encoder

Block Learning

For the complete representation and description of video events, multiple features
are often required. The fusion of multiple features has stronger representative power
458 11 Spatial-Temporal Behavior Understanding

Table 11.8 Classification of abnormal event detection methods from a technical perspective
Method category Specific technology Main points
Supervised Binary classification Support vector
learning machines
Multi-example learning Multiple networks
Unsupervised Hypothesis test method Binary classifier
learning Unmask method Binary classifier
Semi-supervised Traditional machine Distance method One-class classifier
learning learning KNN method
Probabilistic method Distribution probability
Bayesian probability
Reconstruction error Sparse coding
method
Deep learning Deep distance method Deep one-class
classification
Deep KNN method
Deep probabilistic Autoregressive network
method Variational auto
encoder
Generative adversarial
networks
Deep generative error
method
Integrated learning Weighted sum method Deep generative
network
Sorting method Multiple detectors
Cascade method Multiple detectors

Fig. 11.11 The flowchart of video abnormal event detection using convolutional auto-encoder

than a single feature. In methods based on convolutional auto-encoders and block

learning [44], appearance features and motion features are used in combination.
A flowchart of video anomaly detection using convolutional auto-encoders is
shown in Fig. 11.11. Video frames are first divided into nonoverlapping small
blocks. Then, motion features (such as optical flow) representing the motion state
and appearance features (such as histogram of gradients, HOG) representing the
existence of objects are extracted.
11.5 Abnormal Event Detection 459

For each optical flow feature and HOG feature of a block, an anomaly detection
convolutional auto-encoder (AD-ConvAE) is set up for training and testing,
respectively. The AD-ConvAE on the block region in each video frame only pays
attention to the motion in the position region of the video frame, and the block
learning method can learn local features more effectively. During the training
process, the video only contains normal samples, and AD-ConvAE learns the normal
motion pattern of a certain region through the optical flow and HOG features of
the blocks in the video frame. During the test, the optical flow and HOG features of
the blocks in the test video frame are put into AD-ConvAE for reconstruction, and
the weighted reconstruction error is calculated according to the optical flow recon
struction error and the HOG reconstruction error. If the reconstruction error is large
enough, there are abnormal events in the block. In this way, in addition to detecting
the abnormal event, the localization of the abnormal event is also completed.
The network structure of AD-ConvAE includes two parts: encoder and decoder.
In the encoder part, multiple pairs of convolutional and pooling layers are used to
obtain deep features. In the decoder part, multiple pairs of convolution operations
and up-sampling operations are used to reconstruct the deep representation of the
features, outputting an image of the same size as the input image.

11.5.4 Detection Based on One-Class Neural Network

One-class neural network (ONN) is an extension of one-class classifier under the

deep learning framework. One-class support vector machine (OC-SVM) is a
widely used method for unsupervised anomaly detection. In fact, it is a special
form of SVM that learns a hyperplane that separates all data points in the kernel
Hilbert space from the origin and maximizes the distance of the hyperplane to the
origin. In the OC-SVM model, all data except the origin are labeled as positive
samples, and the origin is labeled as negative samples.

Auto-encoder

Fig. 11.12 The flowchart of video abnormal event detection using ONN
460 11 Spatial-Temporal Behavior Understanding

ONN can be regarded as a neural network structure designed using the equivalent
loss function of OC-SVM. In ONN, the data representation in the hidden layer is
directly driven by the ONN, so it can be designed for the task of anomaly detection,
combining the two stages of feature extraction and anomaly detection for joint
optimization. ONN combines the layer-by-layer data representation ability and
one-class classification ability of the auto-encoder, which can distinguish all normal
samples from abnormal samples.
Using ONN’s video anomaly event detection method [45], ONN is trained
separately on video frames of the same size and local region blocks of optical flow
graph to detect appearance anomalies and motion anomalies, and these two are fused
to determine the final detection result. The flowchart of video anomaly detection
using ONN is shown in Fig. 11.12. In the training phase, two auto-encoder networks
are learned with the help of the RGB images and optical flow images of the training
samples, respectively, and the encoder layer and ONN network of the pre-trained
auto-encoder are combined to optimize the parameters and learn the anomaly
detection model; in the test phase, the RGB image and optical flow image of the
test region are input into the appearance anomaly detection model and the motion
anomaly detection model, respectively, the output scores are fused, and the detection
threshold is set to judge whether the region is abnormal.

References

1. Corsie M, Swinton PA (2022) Reliability of spatial-temporal metrics used to assess collective

behaviours in football: An in silico experiment. DOI: https://doi.org/10.1080/24733938.2022.
2100460.
2. Huang XT, Chen MX, Wang Y, et al. (2022) Visitors' spatial-temporal behaviour and their
learning experience: A comparative study. Tourism Management Perspectives, 42: #100951
(DOI: https://doi.org/10.1016/j.tmp.2022.100951 ).
3. Zhang Y-J (2013) Understanding spatial-temporal behaviors. Journal of Image and Graphics
18(2): 141-151.
4. Zhang Y-J (2023) Image engineering in China: 2022. Journal of Image and Graphics 28(4):
879-892.
5. Zhang Y-J (2006) Image engineering in China: 2005. Journal of Image and Graphics 11(5):
601-623.
6. Zhang Y-J (2018) Chapter: 115. The understanding of spatial-temporal behaviors. USA.
Hershey: IGI Global, Encyclopedia of Information Science and Technology, 4th
Ed. (1344-1354).
7. Zhang Y-J (2017) Image engineering, Vol.3: Image understanding. De Gruyter, Germany.
8. Blank B, Gorelick L, Shechtman E, et al. (2005) Actions as space-time shapes. Proceedings of
the International Conference on Computer Vision 2: 1395-1402.
9. Poppe R (2010) A survey on vision-based human action recognition. Image and Vision
Computing 28: 976-990
10. Zhang Y-J (2017) Image engineering, Vol.2: Image analysis. De Gruyter, Germany.
11. Moeslund TB, Hilton A, Kruger V (2006) A survey of advances in vision-based human motion
capture and analysis. Computer Vision and Image Understanding 104: 90-126.
References 461

12. Afza F, Khan M A, Sharif M, et al. A framework of human action recognition using length
control features fusion and weighted entropy-variances based feature selection. Image and
Vision Computing, 2021, 106: #104090 (DOI: https://doi.org/10.1016/j.imavis.2020.104090 ).
13. Jia HX, Zhang Y-J (2009) Automatic people counting based on machine learning in intelligent
video surveillance. Video Engineering (4): 78-81.
14. Xu CL, Hsieh SH, Xiong CM, et al. (2015) Can humans fly? Action understanding with
multiple classes of actors. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition 2264-2273.
15. Xu CL, Corso JJ (2016) Actor-action semantic segmentation with grouping process models.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3083-3092.
16. Turaga P, Chellappa R, Subrahmanian VS, et al. (2008) Machine recognition of human
activities: A survey. IEEE-CSVT 18(11): 1473-1488.
17. Zhao J W, Li X P, Zhang W G. Vehicle trajectory prediction method based on modeling of
multi-agent interaction behavior. CAAI Transactions on Intelligent Systems, 2023, 18(3):
480-488.
18. Lu Y X, Xu G H, Tang B. Worker behavior recognition based on temporal and spatial self
attention of vision Transformer. Journal of Zhejiang University (Engineering Science), 2023,
57(03): 446-454.
19. Jiang H Y, Han J. Behavior recognition based on improved spatiotemporal heterogeneous
two-stream network. Computer Engineering and Design, 2023, 44(7): 2163-2168.
20. Liu Y, Xue PP, Li H, et al. (2021) A review of action recognition using joints based on deep
learning. Journal of Electronics and Information Technology 43(6): 1789-1802.
21. Ji XF, Qin LL, Wang YY (2019) Human interaction recognition based on RGB and skeleton
data fusion model. Journal of Computer Applications 39(11): 3349-3354.
22. Yan A, Wang YL, Li ZF, et al. (2019) PA3D: Pose-action 3D machine for video recognition.
IEEE Conference on Computer Vision and Pattern Recognition 7922-7931.
23. Caetano C, Bremond F, Schwartz WR (2019) Skeleton image representation for 3D action
recognition based on tree structure and reference joints. SIBGRAPI Conference on Graphics,
Patterns and Images 16-23.
24. Caetano C, Sena J, Bremond F, et al. (2019) SkeleMotion: A new representation of skeleton
joint sequences based on motion information for 3D action recognition. IEEE International
Conference on Advanced Video and Signal Based Surveillance 1-8.
25. Li YS, Xia RJ, Liu X, et al. (2019) Learning shape motion representations from geometric
algebra spatiotemporal model for skeleton-based action recognition. IEEE International Con
ference on Multimedia and Expo 1066-1071.
26. Liu J, Shahroudy A, Xu D, et al. (2017) Skeleton-based action recognition using spatiotemporal
LSTM network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelli
gence 40(12): 3007-3021.
27. Liu J, Wang G, Hu P, et al. (2017) Global context-ware attention LSTM networks for 3D action
recognition. IEEE Conference on Computer Vision and Pattern Recognition 1647-1656.
28. Liu J, Wang G, Duan LY, et al. (2018) Skeleton-based human action recognition with global
context-aware attention LSTM networks. IEEE Transactions on Image Processing 27(4):
1586-1599.
29. Zheng W, Li L, Zhang ZX, et al. (2019) Relational network for skeleton-based action recog
nition. IEEE International Conference on Multimedia and Expo, 826-831.
30. Li MS, Chen SH, Chen X, et al. (2019) Actional-structural graph convolutional networks for
skeleton-based action recognition. IEEE Conference on Computer Vision and Pattern Recog
nition 3595-3603.
31. Peng W, Hong XP, Chen HY, et al. (2019) Learning graph convolutional network for skeleton
based human action recognition by neural searching. arXiv preprint, arXiv: 1911.04131.
32. Wu C, Wu XJ, Kittler J (2019) Spatial residual layer and dense connection block enhanced
spatial temporal graph convolutional network for skeleton-based action recognition. IEEE
International Conference on Computer Vision Workshop 1-5.
462 11 Spatial-Temporal Behavior Understanding

33. Shi L, Zhang YF, Cheng J, et al. (2019) Skeleton-based action recognition with directed graph
neural networks. IEEE Conference on Computer Vision and Pattern Recognition 7912-7921.
34. Li MS, Chen SH, Chen X, et al. (2019) Symbiotic graph neural networks for 3D skeleton-based
human action recognition and motion prediction. arXiv preprint, arXiv: 1910.02212.
35. Yang HY, Gu YZ, Zhu JC, et al. (2020) PGCNTCA: Pseudo graph convolutional network with
temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 8:
10040-10,047.
36. Zhang PF, Lan CL, Xing JL, et al. (2019) View adaptive neural networks for high performance
skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence 41(8): 1963-1978.
37. Hu GY, Cui B, Yu S (2019) Skeleton-based action recognition with synchronous local and
non-local spatiotemporal learning and frequency attention. IEEE International Conference on
Multimedia and Expo 1216-1221.
38. Si CY, Chen WT, Wang W, et al. (2019) An attention enhanced graph convolutional LSTM
network for skeleton-based action recognition. IEEE Conference on Computer Vision and
Pattern Recognition 1227-1236.
39. Gao JL, He T, Zhou X, et al. (2019) Focusing and diffusion: Bidirectional attentive graph
convolutional networks for skeleton-based action recognition. arXiv preprint, arXiv:
1912.11521.
40. Zhang PF, Lan CL, Zeng WJ, et al. (2020) Semantics-guided neural networks for efficient
skeleton-based human action recognition. IEEE Conference on Computer Vision and Pattern
Recognition 1109-1118.
41. Zhang WL, Qi H, Li S (2022) Application of spatial temporal graph convolutional networks in
human abnormal behavior recognition. Computer Engineering and Application, 58(12):
122-131.
42. Wang ZG, Zhang Y-J (2020) Anomaly detection in surveillance videos: A survey. Journal of
Tsinghua University (Science & Technology) 60(6): 518-529.
43. Wang ZG (2022) Researches on Semi-Supervised and Deep Generative Model-Based Surveil
lance Video Anomaly Detection Algorithms (Dissertation). Tsinghua University.
44. Li XL, Ji GL, Zhao B (2021) Convolutional auto-encoder patch learning based video anomaly
event detection and localization. Journal of Data Acquisition and Processing 36(3): 489-497.
45. Jiang WY, Li G (2021) One-class neural network for video anomaly detection and localization.
Journal of Electronic Measurement and Instrumentation 35(7): 60-65.
Index

A Bag of word (BoW), 396

Abnormality detection, 453 Barrel distortion, 59
Absolute pattern, 347 Behavior, 421
Accuracy, 337 Behavior recognition, 447
Action, 421 Belief networks, 442
Action primitives, 421 Best fit plane, 146
Activation function, 31 Bias value, 31
Active vision, 14 Bi-directional reflectance distribution function
Active vision-based calibration, 70 (BRDF), 242, 322
Activity, 421 Binary cross entropy (BCE), 200
Actor-action database (A2D), 430 Binary space-time volume, 437
Actor-action semantic segmentation, 432 Binocular, 123
Adjacent, 359 Binocular angular scanning mode, 111
Advanced driving-assistance systems (ADAS), Binocular axis mode, 115
79 Binocular convergent horizontal mode, 112
Advanced topographic laser altimeter system Binocular horizontal mode, 106, 166
(ATLAS), 92 Binocular imaging model, 105
Aliasing, 186 Binocular longitudinal mode, 115
Angular scanning camera, 111 Binocular stereo, 123
Anomaly detection, 453 Blade edge, 365
Anomaly detection convolutional auto-encoder Blinn-Phong reflection model, 278, 327
(AD-ConvAE), 459 Blob, 191
Appearance feature, 458 Block learning, 458
Arithmetic mean merging method (AMMM), Boundary sensitive network (BSN), 36
229 Branch-and-bound, 391
Artificial intelligence, 6 Brightness, 43
Aspect ratio, 305 Bundle adjustment (BA), 80, 397
Association graph matching, 364
Atlas mechanism, 399
Attention clusters, 36 C
Average pooling, 32 Camera attitude parameters, 56
Camera coordinate system, 45
Camera internal parameters, 57
B Camera line of sight, 304
Back-end optimization, 379 Camera model, 55

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 463
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2
464 Index

Cartographer algorithm, 388 Direct method, 394, 400

Centrifugal distortion, 59 Disparity, 165
Charge coupled device (CCD), 91 Disparity smoothness constraint, 171
Closed-loop detection, 380, 382 Distortion, 300
Colored graph, 359 Double-subgraph isomorphism, 363
Color photometric stereo, 278 Dynamic belief network (DBN), 442
Color point cloud, 142 Dynamic geometry of surface form and
Color texture, 148 appearance, 19
Compatibility constraint, 170 Dynamic pattern matching, 345
Composite operation, 354 Dynamic programming, 191
Compressive sensing (CS), 120
Computational complexity, 338
Computational ghost imaging (CGI), 117 E
Computationalism, 24 Eccentric distortion, 59, 60
Computational theory, 7 Edge-induced subgraph, 360
Computer graphics, 5 Edge set, 358
Computer vision, 3 Embodied artificial intelligence, 24
Concurrency, 443 Embodied cognition, 23
Conditional random field (CRF), 424 Embodied intelligence, 24
Connectionism, 23 Ensemble of shape functions (ESF), 152
Context-free grammar (CFG), 445 Epipolar line, 171, 220
Contextual entities, 442 Epipolar line constraint, 172, 220
Continuous constraint, 171 Epipolar plane, 171
Convolutional auto-encoder, 458 Epipole, 171
Convolutional layer, 31, 197 Equal baseline multiple camera set (EBMCS),
Convolutional neural network (CNN), 28, 198, 227
351, 448 Essential matrix, 173, 220
Correlation coefficient, 168 Event, 421
Correlogram, 439 Exceptions excluding merging method
Crease, 365 (EEMM), 229
Cuckoo search (CS), 157 Extended Kalman filter (EKF), 383
Cyclic action, 424 External parameters, 56

D F
Data augmentation, 35 Factored-state hierarchical HMM (FS-HHMM),
Declarative model, 446 424
Decoding, 440 Factorial CRF (FCRF), 424
Deep iterative learning methods, 371 Fast point feature histogram (FPFH), 152
Deep learning, 7, 413 Feature adjacency graph, 336
Deeply supervised object detector (DSOD), 37 Feature cascaded convolutional neural
Deep photometric stereo network (DPSN), 279 networks, 199
Deep transformation learning methods, 371 Feature map, 31
Depth image, 86 Feature point method, 394
Depth imaging, 88 Feature points, 177
Depth map, 86 Field of view (FOV), 18, 112, 404
Depth of focus, 316 Fitting, 335
Difference of Gaussian (DoG), 182 Fixed-point iterative sweeping, 325
Diffuse reflection surface, 245 Focus of expansion (FOE), 273
Digital micro-mirror device (DMD), 117, 119 Forward mapping, 149
Dilated convolution, 37 Fourier single-pixel imaging (FSI), 122
Diplopic, 124 4-point mapping, 47
Direct depth imaging, 89 Front-end matching, 379
Index 465

Fully convolutional network (FCN), 37 Hyper-focal distance, 318

Functionalism, 24 Hyper-parameter optimization, 35
Fundamental matrix, 174

I
G Ideal scattering surface, 245
Gaze change, 20 Ideal specular reflecting surface, 277
Gaze control, 20 Identical, 361
Gaze stabilization, 20 Illuminance, 43, 240
Generalization of the CRF, 424 Illumination component, 43
Generalized compact non-local networks, 36 Image analysis, 6, 25
Generalized matching, 334 Image brightness constraint equation, 251, 290,
Generalized unified model (GUM), 232 293, 325, 327
Generative adversarial networks (GAN), 37, Image coordinate system, 45
198, 280 Image coordinate system in computer, 45
Generative model, 424 Image engineering, 5, 25, 27
Geometric feature-based methods, 139 Image flow, 258
Geometric hashing, 169 Image flow equation, 260
Geometric realization, 359 Image matching, 334
Geometric representation, 359 Image processing, 5, 25
Geometric texture, 148 Image pyramid network, 197
Geometric texture mapping, 150 Image rectification, 113
Geoscience laser altimeter system (GLAS), 92 Image-to-image, 382, 396
Global feature descriptor, 152 Image-to-map, 382
Global positioning system (GPS), 91 Image understanding, 6, 25
GMapping algorithm, 385 Imaginary point, 302
Gradient space, 250, 290 Incident, 359
Graph, 192, 358 Indirect depth imaging, 105
Graph convolutional networks (GCN), 374, 448 Induced subgraph, 360
Graph isomorphism, 363 Inertia equivalent ellipse, 343
Gray-scale smooth region, 209 Inertial measurement unit (IMU), 91, 394,
Gray value, 44 399, 414
Gray value range, 44 Inertial navigation system (INS), 91
In general position, 366
Internal parameter, 57
H Intrinsic image, 86
Hardware implementation, 9 Intrinsic property, 86
Harris interest point detector, 426, 439 Intrinsic shape signatures (ISS), 159
Hausdorff distance (HD), 338 Inverse distance, 206
Helmet mounted display (HMD), 123 Inverse perspective, 319
Hessian matrix, 184 Inverse perspective transformation, 48
Hidden line, 307 Irradiance, 240
Hidden Markov model (HMM), 440 Isomorphism, 362
Hierarchical variable transition hidden Markov Isotropy assumption, 309
model (HVT-HMM), 424 Isotropy radiation surface, 251
Holistic recognition, 425 Iterative closest point (ICP), 138, 379
Homogeneity assumption, 309 Iterative closest point registration, 139
Homogeneous coordinate, 45 Iterative fast marching, 330
Homogeneous vector, 45
Homography matrix, 74
Horopter, 124 J
Human intelligence, 6 Jacobi over-relaxation (JOR), 414
Human stereoscopic vision, 123 Join, 359
466 Index

Joint product space model, 430 Matching, 333

Joints, articulation point, 448 Maximum correlation criterion, 167
Jump edge, 365 Maximum entropy Markov model (MEMM),
Jump linear system (JLS), 442 424
Maximum pooling, 32
Mean square difference (MSD), 168
K Minimum mean difference function, 167
K-meansclustering, 144 Minimum mean square error function, 167
Knowledge, 335 Modified Hausdorff distance (MHD), 339
Moire contour stripes, 101
Monocular, 123
L Monocular image scene restoration, 238
Labeling via backtracking, 367 Motion energy image (MEI), 437
Lambertian surface, 245 Motion feature, 458
Laplacian eigenmap, 438 Motion history image (MHI), 437
Laplacian of Gaussian (LoG), 182 Motion parallax, 109
Laser ranging, 90 Multi-agent, 414
Laser scanning mode, 128 Multi-label actor-action recognition, 432
Laser SLAM, 378 Multilayer perceptron (MLP), 28
Learning, 440 Multimodal image matching (MMIM), 369
Least common multiple, 209 Multiple edge, 359
Length of precision measuring ruler, 92 Multi-scale aggregation, 281
Lens distortions, 58 Multispectral photometric stereo, 278
Levy flight mechanism, 157
LiDAR odometry, 384
Light detection and ranging (LiDAR), 98 N
Lightness imaging model, 43 Naive Bayes model, 429, 433
Light source calibration, 276 Network design, 36
Limb, 365 Network model based on a single point in the
Linear camera model, 55 point cloud, 155
Linear dynamical systems (LDS), 441 Neural network (NN), 28
Link, 359 Non-Lambertian surface, 277
LOAM algorithm, 392 Non-linear camera model, 59
Local conditional probability density (LCPD), Non-linear optimization, 382
442 Non-parametric modeling, 436
Local feature descriptor, 152 Non-uniform rational B-spline (NURBS),
Localization, 378 133
Locally linear embedding (LLE), 423, 438 Normal distribution transform (NDT), 379
Local mapping, 388
Loop, 359
Loss function, 33, 282 O
L2 pooling, 32 Object interaction characterization, 454
LSD-SLAM algorithm, 400, 402, 405, 406 Octree structure, 135
Luminous flux, 42 One-class neural network (ONN), 459
One-class SVM (OC-SVM), 459
Online activity analysis, 454
M Optical flow, 258, 458
Machine learning, 6 Optical flow equation, 260
Machine vision, 4 ORB-SLAM algorithm, 395
Mapping, 378, 393 ORB-SLAM2 algorithm, 398
Map-to-map, 380, 382 ORB-SLAM3 algorithm, 399
Mark, 366 Ordering constraint, 191
Mask, 166, 218 Ordering matching constraint, 193, 219
Index 467

Oriented FAST and rotated BRIEF (ORB), Pooling methods, 32

395 Pooling neighborhood, 31
Orthogonal projection, 10, 304 Positional accuracy, 90
Position and orientation system (POS), 143
Posture, 319
P Primal sketch, 10
Pan angle, 50, 51 Primitive event, 445
Pan-tilt-zoom camera (PTZ camera), 453 Probabilistic Petri net (PPN), 444
Panum’s zone of fusion, 124 Procedural texture, 148
Parallax, 106, 109, 165, 260 Procedural texture mapping, 151
Parallax map error detection and correction, 193 Projected plane coordinate system, 73
Parallel edge, 359 Projector coordinate system, 73
Parametric time-series modeling, 436 Propagate depth (PD), 407
Past-now-future (PNF), 444 Proper subgraph, 360
Path classification, 453 Proper supergraph, 360
Pattern, 6 Prototype matching, 336
Pattern recognition, 6 Pulse method, 92
Perception, 378 Purposive vision, 14
Perceptual feature groupings, 19 Pyramid scene parsing network (PSPNet), 37
Periodical patterns, 212
Perspective projection, 46, 304
Perspective three-point (P3P), 319 Q
Perspective transformation, 47 Quad-focal tensor, 226
Petri net, 443 Qualitative vision, 15
Phase correlation method, 349
Phase method, 92
Phong reflection model, 278, 322 R
Photogrammetry mode, 128 Radial alignment constrain (RAC), 65
Photometric compatibility constraint, 171 Radial distortion, 59, 68
Photometric error, 402 Radiance, 240
Photometric stereo, 240 Radiant intensity, 240
Photometric stereo vision, 240 Radius outlier removal filter, 135
Photometric stereoscopic, 240 Random sample consensus (RANSAC), 351,
Photometry, 42 374
Pincushion distortion, 59 Random value pooling, 32
Pinhole model, 55 Ranging accuracy, 107
Point cloud, 90 Rao-Blackwellized particle filter (RBPF), 385
Point cloud data, 128 Reasoning, 440
Point cloud dataset, 131 Receptive field block (RFB), 37
Point cloud density, 90 Rectifier linear unit (ReLU), 33
Point cloud feature description, 131 Recurrent neural network (RNN), 448
Point cloud fitting, 144 Reflection component, 43
Point cloud model construction, 129 Reflection intensity, 90
Point cloud quality improvement, 129 Reflection map, 250
Point cloud registration, 379, 393 Region growing, 145
Point cloud segmentation, 144 Registration, 349
Point cloud semantic information extraction, Regular grid of texel, 307
131 Reinforcement learning, 34
Point feature histogram (PFH), 152 Relational graph, 439
Point of interest, 184, 439 Relationship matching, 353
Point-to-line ICP (PLICP), 379 Relative pattern, 347
Pooling feature map, 31 Reliability, 337
Pooling layer, 31, 197 ReLU activation function, 33
468 Index

Representation, 10 Single-pixel imaging, 117

Representationalism, 24 Single view stereo matching (SVSM), 37
Representation and algorithm, 9 Solid angle, 43, 241
Representations by reconstruction, 16 Spanning subgraph, 360
Re-projection error, 81, 401 Spanning supergraph, 360
Resolving power, 317 Spatial-temporal behavior understanding, 419,
Resource sharing, 443 429
Reverse engineering, 90 Spatial-temporal technology, 420
Reverse mapping, 149 Speeded up robust feature (SURF), 184
Robot vision, 4 Speed profiling, 453
Robustness, 337 Spin image, 152
Rotational projection statistics (RoPS), 153 Static stereo (SS), 407
Statistical outlier removal filter, 133
Statistical relational learning (SRL), 427
S Stereo from motion, 109
Salient feature, 181 Stereoscopic imaging, 111
Salient feature points, 182, 184 Stochastic context-free grammar (SCFG), 445
Salient patch, 183 Stochastic sampling techniques, 427
Sampling importance resampling (SIR), 385 Structural reconstruction of point cloud objects
Scale invariant feature transform (SIFT), 181 and scene understanding, 131
Scan-to-map, 380, 388, 393 Structured detection, 118
Scan-to-scan, 380, 393 Structured illumination, 118
Scene interpretation, 428 Structured-light imaging, 99
Segment adjacency graph, 223 Structure matching, 340
Selective attention, 20 Subgraph, 360
Selective vision, 14 Subgraph isomorphism, 363
Self-calibration, 69 Sub-volume matching, 439
Self-occluding, 365 Successive over-relaxation (SOR), 414
Semantics of mental representations, 17 Sum of squared difference (SSD), 204
Semi-direct method, 394, 408 Sum of SSD (SSSD), 206
Semi-direct monocular visual odometry Supergraph, 360
algorithm (SVO), 408 Supervised learning, 457
Semi-HMM, 441 Super-voxel, 439
Semi-supervised learning, 457 Surface normal vector, 90
Sequencing, 443 SVO algorithm, 408
Shading, 288 Swarm robot, 410
Shadow, 288, 366 Swarm SLAM, 412
Shape from contour, 239, 274 Switching linear dynamic system (SLDS), 441
Shape from focal length, 239, 316 Synchronization, 443
Shape from motion, 239
Shape from shading, 121, 239, 288
Shape from silhouette (SfS), 239, 274 T
Shape from texture, 239, 300 Tangential distortion, 59, 60, 68
Shape from X, 238 Template, 166, 218
Siamese network, 37, 198 Template matching, 166, 336
Sigmoid function, 32, 200 Temporal stereo (TS), 407
Signature of histogram of orientation (SHOT), Temporal template, 437
153 Terrestrial laser scanning (TLS), 91
Simultaneous localization and mapping Texel, 303
(SLAM), 91, 377 Texture mapping, 148
Single-camera multi-mirror catadioptric system, Texture stereo technique, 310
231 Thin prism distortion, 59, 61
Single-label actor-action recognition, 429 3D laser scanning, 89
Index 469

3D representation, 11 Variation auto-encoder (VAE), 37

3D shape context (3DSC), 152 Vector feature histogram (VFH), 152
3D texture, 148 Vergence, 112
3D texture mapping, 151 Vertex set, 358
3D voxelization-based deep learning network, Video anomaly detection (VAD), 455
155 Video anomaly detection and localization
Three-layer model, 430, 435 (VADL), 455
Tilt angle, 50, 51 Video event challenge workshop, 447
Time of flight (TOF), 92, 94 Video event markup language (VEML), 447
Total cross number, 194 Video event representation language (VERL),
Traditional cognitivism, 22 447
Transfer learning, 34 Video volume, 438
Triangulated irregular networks (TIN), Virtual camera, 232
138, 143 Virtual fencing, 451
Tri-focal plane, 220 Vision, 1, 334
Trigonometry, 93 Vision sensor, 381
Trihedral corner, 366 Visual computational theory, 7, 22
Triple orthogonal local depth images (TOLDI), Visual hull, 274
154 Visual LiDAR odometry and real-time mapping
2D projection-based deep learning network, (VLOAM), 384
155 Visual odometry (VO), 381, 384
2D texture, 148 Visual perception, 2
2D texture mapping, 151 Visual sensation, 2
2.5D sketch, 10 Visual SLAM (vSLAM), 381
Two-layer model, 434 Visual vocabulary, 396
Two-step texture mapping, 150 Volumetric modeling, 436
Two-stream convolutional networks, 36

W
U Ward reflection model, 278, 322
Underlying simple graph, 360 Window, 166, 218
U-Net, 37 World coordinate system, 45
Uniqueness constraint, 171
Unsupervised learning, 457
Z
Zero-cross correction algorithm, 195
V Zero-crossing pattern, 178
Vanishing line, 307 Zone of clear single binocular vision (ZCSBV),
Vanishing point, 302, 307 124

YoshidaSotaR. 2011 ComputerVision
100% (1)
YoshidaSotaR. 2011 ComputerVision
379 pages
Physics Experiments
100% (1)
Physics Experiments
324 pages
Katsushi Ikeuchi (Editor) - Computer Vision - A Reference Guide-Springer (2021)
100% (4)
Katsushi Ikeuchi (Editor) - Computer Vision - A Reference Guide-Springer (2021)
1,436 pages
ZIMSEC O-Level Physical Science
100% (1)
ZIMSEC O-Level Physical Science
44 pages
Introductory Techniques For 3D Computer Vision
100% (1)
Introductory Techniques For 3D Computer Vision
180 pages
Computer Vision 2011
100% (1)
Computer Vision 2011
103 pages
Heat and Light Energy
100% (1)
Heat and Light Energy
3 pages
Computer Vision (7th Sem)
No ratings yet
Computer Vision (7th Sem)
48 pages
Computer Vision Three-Dimensional Reconstruction Techniques (Andrea Fusiello) (Z-Library)
No ratings yet
Computer Vision Three-Dimensional Reconstruction Techniques (Andrea Fusiello) (Z-Library)
348 pages
Head Up Display System
40% (5)
Head Up Display System
29 pages
3-D Computer Vision: Yu-Jin Zhang
No ratings yet
3-D Computer Vision: Yu-Jin Zhang
453 pages
Introductory Techniques For 3-D Computer Vision
No ratings yet
Introductory Techniques For 3-D Computer Vision
182 pages
Module 3 - Two Marks Questons Trandusers
No ratings yet
Module 3 - Two Marks Questons Trandusers
11 pages
Bow Science 7 10 Science 7e Lesson Plan For Deped
No ratings yet
Bow Science 7 10 Science 7e Lesson Plan For Deped
27 pages
Practical 1 Compound Microscope
No ratings yet
Practical 1 Compound Microscope
9 pages
Lasers in Endo
100% (1)
Lasers in Endo
144 pages
Ocular Motilityy
No ratings yet
Ocular Motilityy
6 pages
"Introduction To Computer Vision": Submitted by
No ratings yet
"Introduction To Computer Vision": Submitted by
45 pages
Computer Vision Three-Dimensional - Andrea Fusiello
No ratings yet
Computer Vision Three-Dimensional - Andrea Fusiello
632 pages
3dv Slides
No ratings yet
3dv Slides
153 pages
3D Computer Vision Foundations and Advanced Methodologies Springer
No ratings yet
3D Computer Vision Foundations and Advanced Methodologies Springer
480 pages
Refraction, Lenses Test
100% (1)
Refraction, Lenses Test
5 pages
CV #1 Course Introduction-1
No ratings yet
CV #1 Course Introduction-1
61 pages
3D Comp Vision
No ratings yet
3D Comp Vision
212 pages
Afmg Ophtha Images - 2021
No ratings yet
Afmg Ophtha Images - 2021
189 pages
Unit 1
No ratings yet
Unit 1
200 pages
AD8703 BCV Unit IV 2023
No ratings yet
AD8703 BCV Unit IV 2023
93 pages
Computer Vision11 PDF
No ratings yet
Computer Vision11 PDF
18 pages
Raz Report Final
No ratings yet
Raz Report Final
37 pages
Artlife B09 EL-BS2: Partner For Contact: Order No.: Company: Customer No.
No ratings yet
Artlife B09 EL-BS2: Partner For Contact: Order No.: Company: Customer No.
97 pages
CV Digital Notes
No ratings yet
CV Digital Notes
77 pages
Technologies 12 00015
No ratings yet
Technologies 12 00015
40 pages
Computer Vision in Aritificial Intelligence
No ratings yet
Computer Vision in Aritificial Intelligence
33 pages
CV Unit 1 Overview of Computer Vison and Application
No ratings yet
CV Unit 1 Overview of Computer Vison and Application
51 pages
View - IGuzzini - en
No ratings yet
View - IGuzzini - en
44 pages
Lec 1 - 2
No ratings yet
Lec 1 - 2
39 pages
CS5330 F22 Lectures
No ratings yet
CS5330 F22 Lectures
116 pages
CO1 Notes
No ratings yet
CO1 Notes
105 pages
Celma Guide Index 2010
No ratings yet
Celma Guide Index 2010
50 pages
Computer Vision Advancement Rebecca
No ratings yet
Computer Vision Advancement Rebecca
17 pages
CPCS335 - Chapter 9-Final
No ratings yet
CPCS335 - Chapter 9-Final
24 pages
The Elements of Visual Arts: By: Resty N. Ungab
No ratings yet
The Elements of Visual Arts: By: Resty N. Ungab
35 pages
Computer Vision Technology
No ratings yet
Computer Vision Technology
29 pages
Confocal Displacement Sensor: High-Precision Measurement On Any Material or Surface
No ratings yet
Confocal Displacement Sensor: High-Precision Measurement On Any Material or Surface
28 pages
PDF Joiner
No ratings yet
PDF Joiner
38 pages
Hacel Sample Calculations Spacings
No ratings yet
Hacel Sample Calculations Spacings
57 pages
Unit 1
No ratings yet
Unit 1
20 pages
Computer Vision
No ratings yet
Computer Vision
7 pages
CV 1
No ratings yet
CV 1
21 pages
Computer Vision SM-1
No ratings yet
Computer Vision SM-1
26 pages
Department of Computer Science and Engineering - University of Bologna
No ratings yet
Department of Computer Science and Engineering - University of Bologna
23 pages
Previewpdf
No ratings yet
Previewpdf
37 pages
CCV Preview
No ratings yet
CCV Preview
26 pages
Computer Vision Lecture 3
No ratings yet
Computer Vision Lecture 3
19 pages
IPCV Unit 01
No ratings yet
IPCV Unit 01
18 pages
Computer Vision
No ratings yet
Computer Vision
14 pages
Computer Visiondk
No ratings yet
Computer Visiondk
12 pages
Computer Vision
No ratings yet
Computer Vision
13 pages
Paper BackProoagation
No ratings yet
Paper BackProoagation
13 pages
grp3 Computervision
No ratings yet
grp3 Computervision
28 pages
CCV Preview
No ratings yet
CCV Preview
26 pages
8.what Is UFO LED Grow Lights
No ratings yet
8.what Is UFO LED Grow Lights
8 pages
Lighting PPT Shankar
No ratings yet
Lighting PPT Shankar
14 pages
Deep Learning With 3D Vision: Name: V.N.S.Manaswini Roll No: 23038112 Guru Ghasidas Vishwavidyalaya Bilaspur
No ratings yet
Deep Learning With 3D Vision: Name: V.N.S.Manaswini Roll No: 23038112 Guru Ghasidas Vishwavidyalaya Bilaspur
10 pages
Human Sensing 03
No ratings yet
Human Sensing 03
9 pages
Laser Blade L: Invisible Source
No ratings yet
Laser Blade L: Invisible Source
8 pages
A Comprehensive Guide To Computer Vision
No ratings yet
A Comprehensive Guide To Computer Vision
6 pages
What Is Computer Vision
No ratings yet
What Is Computer Vision
18 pages
An Infrared Eccentric Photo-Optometer
No ratings yet
An Infrared Eccentric Photo-Optometer
12 pages
New Seminar
No ratings yet
New Seminar
11 pages
Guide To The Best LED Grow Lights 2020 PDF
No ratings yet
Guide To The Best LED Grow Lights 2020 PDF
4 pages
Soligor Selector Meter
No ratings yet
Soligor Selector Meter
9 pages
Machine Vision A Comprehensive Analysis of Techniq
No ratings yet
Machine Vision A Comprehensive Analysis of Techniq
6 pages
Color Harmonies: Basic Techniques For Combining Colors
No ratings yet
Color Harmonies: Basic Techniques For Combining Colors
5 pages
Sensors 25 00035 v2
No ratings yet
Sensors 25 00035 v2
6 pages
Lampu TUNNEL 100 WATT
No ratings yet
Lampu TUNNEL 100 WATT
5 pages
Ipfile
No ratings yet
Ipfile
4 pages
A Computer Vision System Processes Images Acquired
No ratings yet
A Computer Vision System Processes Images Acquired
4 pages
DLL - Science 4 - Q3 - W10
No ratings yet
DLL - Science 4 - Q3 - W10
4 pages
3-D Computer Vision: Principles, Algorithms and Applications 1st Edition Yu-Jin Zhang Download
No ratings yet
3-D Computer Vision: Principles, Algorithms and Applications 1st Edition Yu-Jin Zhang Download
50 pages
Computer Vision
No ratings yet
Computer Vision
2 pages
Januaryr022020 8
No ratings yet
Januaryr022020 8
3 pages
Tite I 20E
No ratings yet
Tite I 20E
2 pages
Class - Notes Computer Vision
No ratings yet
Class - Notes Computer Vision
3 pages
Pricelist Sunshade (Jawa) 01 Jun 21
No ratings yet
Pricelist Sunshade (Jawa) 01 Jun 21
2 pages
Tutorial 1-CHM260
No ratings yet
Tutorial 1-CHM260
2 pages
Mastering Python Advanced Concepts and Practical Applications
From Everand
Mastering Python Advanced Concepts and Practical Applications
Aissa Younes
No ratings yet
Plain JavaScript: Learning the Front-End
From Everand
Plain JavaScript: Learning the Front-End
Roger Beans-Rivet
No ratings yet
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet

3D Computer Vision - Foundations and Advanced Methodologies-Springer (2024)

Uploaded by

3D Computer Vision - Foundations and Advanced Methodologies-Springer (2024)

Uploaded by

Yu-Jin Zhang

P^USHWS HOUSE OF E-ECTOhIC! IKCUSTCT

[nj 'I'linihT.IKilsWI L'"L-J MHLwaM houk of uowMca Musnrr

ISBN 978-981-19-7602-5 ISBN 978-981-19-7603-2 (eBook)

© Publishing House of Electronics Industry 2024

Paper in this product is recyclable.

Computer vision is an information discipline that uses computers and electronic

Beijing, China Yu-Jin Zhang

2.2.3 General Space Imaging Model ....................................... 50

4.2.3 Point Cloud Data Ground Region Filtering ................ 135

6.2 Orthogonal Trinocular Stereo Matching ....................................... 209

8 Monocular Single-Image Scene Restoration ........................................ 287

9.7 Labeling and Matching of Line Drawing ..................................... 364

11.5 Abnormal Event Detection ............................................................. 450

Index .................................................................................................................... 463

Image Analysis Techniques, and A Selection of Image Understanding Techniques

1.1 Introduction to Computer Vision

1.1.1 Visual Essentials

In a narrow sense, the ultimate goal of vision is to make meaningful explanations

1.1.2 The Goal of Computer Vision

1.1.3 Related Disciplines

As a discipline, computer vision is inextricably linked with many disciplines,

intelligence, which studies how to make computers simulate or implement human

1.2 Computer Vision Theory and Framework

1.2.1 Visual Computational Theory

1.2.1.1 Vision Is a Complex Information Processing Process

1.2.1.2 Three Essential Factors of Visual Information Processing

Fig. 1.2 The relationship of

1.2.1.3 Three-Level Internal Representation of Visual Information

According to the definition of visual computability, visual information processing

Fig. 1.3 2.5D sketch

Image Processing Processing Processing

Fig. 1.4 The three-level representation decomposition of the Marr’s framework

Table 1.2 Representation framework of visual computability problem

Representation Goal Primitive

1.2.1.4 Visual Information Understanding Is Organized in the Form

1.2.1.5 The Formal Representation of Computational Theory Must

1.2.2 Framework Issues and Improvements

Fig. 1.5 Improved visual computational framework

1. 2.3 A Discussion on Marr ’s Reconstruction Theory

1.2.3.1 Problems with Reconstruction Theory

1.2.3.2 Representation Without Reconstruction

representation of spatial resolution as an example, a set of patterns covering the

1.2.4 Research on New Theoretical Framework

attempts to establish a new theoretical framework. For example, Grossberg claimed

1.2.4.1 Knowledge-Based Theoretical Framework

Knowledge-based theoretical frameworks are developed around the study of per­

Fig. 1.6 Knowledge-based theoretical framework

1.2.4.2 Active Vision Theory Framework

Fig. 1.7 Active vision theory framework

1.2.5 Discussion from the Perspective of Psychological

cognition and intelligence. Among them, psychology is the core discipline of

1.2.5.1 Traditional Cognitivism

Traditional cognitivism believes that cognitive processes are based on rational

Connectionism believes that the human brain is a complex information processing

1.2.5.3 Embodied Cognition

The theory of embodied cognition is fundamentally different from the basis of

1.3 Introduction to Image Engineering

1.3.1 Three Levels of Technology in Image Engineering

Image technology is a general term for various image-related technologies in a broad

Fig. 1.8 Schematic diagram of three levels of image engineering

Fig. 1.9 Overall framework of image engineering

1.3.2 Research and Application of Image Engineering

organic combination of image processing, image analysis, and image understanding

1.4 Introduction to Deep Learning

1.4.1 Deep Learning Overview

1.4.1.1 Basic Structure of Convolutional Neural Network

The convolutional neural network is developed on the basis of the traditional

1.4.1.2 Convolutional Layer

1.4.1.3 Pooling Layer

1.4.1.4 Activation Function

h(z = i^1e~z (1:1)

Fig. 1.11 Three commonly used activation functions

h'(z) = ddF = h(z)[1 - h(z)]: (1'2}

2. Hyperbolic tangent function (as shown in Fig. 1.11b)

h0(z) = 1 — [h(z)]2: (1:4)

Knowledge-based theoretical frameworks are developed around the study of per

mainstream large-scale deep models. Current research focuses on finding simpli

Projection imaging involves the transformation between different coordinate sys

Positions without rotation (y = 0*) correspond to parallel X- and x-axes. Simi

r = \/x2 + y2: (2:38)