3D Computer Vision - Foundations and Advanced Methodologies-Springer (2024)
3D Computer Vision - Foundations and Advanced Methodologies-Springer (2024)
3D Computer
Vision
Foundations and Advanced
Methodologies
3D Computer Vision
Foundations and Advanced Methodologies
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publishers, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publishers nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publishers remain neutral with
regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
v
vi Preface
graduation projects and dissertations; on the other hand, this book is also suitable for
the company’s research and development personnel to understand the latest progress
information and serves as a scientific research reference.
This book has approximately 500,000 words. It has 11 chapters, which consist of
a total of 59 sections (second-level headings) and 156 subsections (third-level
headings). There are 215 numbered figures, 37 numbered tables, and 660 numbered
equations. Finally, the list of more than 300 references cited (there were over 100 in
the 2020s), and more than 500 subject terms used for indexing are listed at the end of
the book to facilitate further access to related literatures.
Finally, I would like to thank my wife Yun He, daughter Heming Zhang, and
other family members for their understanding and support in all aspects.
1 Introduction ............................................................................................... 1
1.1 Introduction to Computer Vision .................................................. 2
1.1.1 Visual Essentials .............................................................. 2
1.1.2 The Goal of Computer Vision .......................................... 3
1.1.3 Related Disciplines ............................................................ 4
1.2 Computer Vision Theory and Framework .................................... 7
1.2.1 Visual Computational Theory .......................................... 7
1.2.2 Framework Issues and Improvements ............................. 13
1.2.3 A Discussion on Marr’s ReconstructionTheory ............. 15
1.2.4 Research on New Theoretical Framework ...................... 18
1.2.5 Discussion from the Perspective of Psychological
Cognition ............................................................ 21
1.3 Introduction to Image Engineering ............................................... 24
1.3.1 Three Levels of Technology in Image Engineering . . . 25
1.3.2 Research and Application of Image Engineering ........... 27
1.4 Introduction to Deep Learning ...................................................... 28
1.4.1 Deep Learning Overview .................................................. 28
1.4.2 Deep Learning Core Technology ..................................... 34
1.4.3 Deep Learning in Computer Vision ................................. 36
1.5 Organization and Content of this Book ........................................ 38
References ................................................................................................... 40
2 Camera Imaging and Calibration .......................................................... 41
2.1 Lightness Imaging Model ............................................................... 42
2.1.1 Photometric Concepts ................................................... 42
2.1.2 Basic Lightness Imaging Model ..................................... 43
2.2 Space Imaging Model ........................................................ 44
2.2.1 Projection Imaging Geometry ............................................ 44
2.2.2 Basic Space Imaging Model ............................................. 46
vii
viii Contents
Yu-Jin Zhang Yu-Jin Zhang received his doctor’s degree in Applied Science from
the University of Liege in Belgium in 1989. From 1989 to 1993, he successively
engaged in postdoctoral research and served as a full-time researcher at Delft
University in the Netherlands. Since 1993, he has worked in the Department of
Electronic Engineering of Tsinghua University in Beijing, China. He has been a
professor since 1997, a Ph.D. supervisor since 1998, and a tenured-professor since
2014. During the sabbatical year of 2003, he was a visiting professor of Nanyang
Technological University in Singapore.
At Tsinghua University, he has offered and taught over 10 undergraduate and
graduate courses, including “Image Processing,” “Image Analysis,” “Image Under
standing,” and “Content Based Visual Information Retrieval.” At Nanyang Techno
logical University, he offered and taught a postgraduate course: “Advanced Image
Analysis (English).” More than 30 Chinese and English textbooks have been written
and published (with a total of over 300,000 printed copies). More than 30 teaching
research papers have been published both domestically and internationally.
The main scientific research fields are image engineering (image processing,
image analysis, image understanding, and their technical applications) and related
disciplines that it actively advocates. He has published over 500 research papers on
image engineering both domestically and internationally. He published the mono
graphs Image Segmentation and Content Based Visual Information Retrieval (Sci
ence Press, China), Subspace Based Face Recognition (Tsinghua University Press,
China); has written English Chinese Dictionary of Image Engineering (1st, 2nd, and
3rd editions; Tsinghua University Press, China); has written Selected Works of
Image Engineering Technology and Selected Works of Image Engineering Technol
ogy (2) (Tsinghua University Press, China); has translated Computational Geometry,
Topology, and Physics of Digital Images with Applications (Springer Verlag, Ger
many) into Chinese; has led the compilation of Advances in Image and Video
Segmentation and Semantic Based Visual Information Retrieval (IRM Press,
USA), as well as Advances in Face Image Analysis: Technologies and Technologies
(IGI Global, USA); has written Handbook of Image Engineering (Springer Nature,
Singapore); has written A Selection of Image Processing Techniques, A Selection of
xv
xvi About the Author
Vision is an important function and means for human beings to observe and
recognize the world. Computer vision, as a subject using computer to realize
human visual function, has not only received great attention and in-depth research
but also been widely used [1].
Visual process can be seen as a complex process from sensing (feeling the image
of the objective world) to perception (understanding the objective world from the
image). This involves the knowledge of optics, geometry, chemistry, physiology,
psychology, and so on. To complete such a process, computer vision should not only
have its own corresponding theory and technology but also combine the achieve
ments of various disciplines and the development of various engineering
technologies.
The sections of this chapter are arranged as follows.
Section 1. 1 gives a general introduction to computer vision, including the key
points of human vision, the research methods and objectives of computer vision, and
the connections and differences with several major related disciplines.
Section 1. 2 introduces the theory and framework of computer vision, mainly
including the important visual computational theory and its existing problems and
improvements, and also discusses some situations of other theoretical frameworks.
Section 1. 3 introduces the overview of various image processing technologies
that are the basis of computer vision technology. Under the overall framework of
image engineering, three levels of image technology, as well as various recent
research directions and application fields, are discussed in detail.
Section 1. 4 provides an overview of deep learning methods that have rapidly
promoted the development of computer vision technology in recent years. In addi
tion to listing some basic concepts of convolutional neural networks, it also dis
cusses the core technology of deep learning and its connection with computer vision.
Section 1. 5 introduces the main content involved in the book and its organiza
tional structure.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 1
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_1
2 1 Introduction
The following is a general introduction to the origin, objectives, and related disci
plines of computer vision.
Computer vision originates from human vision, which is generally called vision.
Vision is a kind of human function, which plays an important role in human
observation and cognition of the objective world. According to statistics, about
75% of the information obtained by humans from the outside world comes from
the visual system, which not only shows that the amount of visual information is
huge but also shows that humans have a high utilization rate of visual information.
Human visual process can be seen as a complex process from sensing (feeling the
image obtained by 2D projection of 3D world) to perception (recognizing the content
and meaning of 3D world from 2D image).
Vision is a very familiar function that not only helps people obtain information
but also helps people process it. Vision can be further divided into two levels: visual
sensation and visual perception. Here, sensation is at a lower level, which mainly
receives external stimuli, while perception is at a higher level, which converts
external stimuli into meaningful content. In general, sensation receives external
stimuli almost indiscriminately and completely, while perception determines
which parts of the external stimulus could combine together to form the “object”
of interest.
Visual sensation mainly explains the basic properties (such as brightness, color,
etc.) of people’s response to light (i.e., visible radiation) from the molecular level and
point of view, and it mainly involves physics, chemistry, and other disciplines. The
main research contents of visual sensation are (1) the physical properties of light,
such as light quantum, light wave, and spectrum, and (2) the degree to which light
stimulates the visual receptors, such as photometry, eye structure, visual adaptation,
visual intensity and sensitivity, vision time, and space characteristics.
Visual perception mainly discusses how people respond to visual stimuli from
the objective world and the way they respond. It studies how to make people form an
interpretation of the spatial representation of the external world through vision, so it
also has psychological factors. As a form of reflection on the current objective
things, visual perception only relies on the principle of projection of light onto the
retina to form a retinal image and the known mechanism of the eye or nervous
system. It is difficult to explain the whole (perception) process clearly. Visual
perception is a group of activities carried out in the nerve center, which organizes
some scattered stimuli in the visual field to form a whole with a certain shape to
understand the world. As early as 2000 years ago, Aristotle defined the task of visual
perception as determining “What is where” [2].
1.1 Introduction to Computer Vision 3
Computer vision is the use of computers to realize human visual functions, that is,
the sensation, perception, processing, and interpretation of three-dimensional scenes
in the objective world. The original purpose of vision research is to grasp and
understand the image of the scene; identify and locate the objects in it; determine
their own structure, spatial arrangement, and distribution; and explain the relation
ship between objects. The research goal of computer vision is to make meaningful
judgments about actual objects and scenes in the objective world based on perceived
images [3].
There are two main research methods of computer vision at present: one is the
method of bionics, which refers to the structural principle of the human visual
system to establish corresponding processing modules to complete similar functions
and tasks; the other is the method of engineering, which starts from analyzing the
function of human visual process, and does not deliberately simulate the internal
structure of the human visual system, but only considers the input and output of the
system, and adopts any existing and feasible means to realize the function of the
system. This book discusses the second approach primarily from an engineering
point of view.
The main research goals of computer vision can be summarized into two: they are
interrelated and complementary. The first research goal is to build computer vision
systems to accomplish various vision tasks. In other words, it enables the computer
to obtain images of the scene with the help of various visual sensors (such as CCD
and CMOS camera devices), from which to perceive and recover the geometric
properties, posture structure, motion, mutual position, etc. of objects in the 3D
environment, and to identify, describe, and explain objective scenarios and then
make judgments and decisions. The technical mechanism for accomplishing these
tasks is mainly studied here. At present, the work in this area focuses on building
various specialized systems to complete specialized visual tasks that appear in
various practical occasions; in the long run, it is necessary to build a more general
system (closer to the human visual system) to complete general vision tasks. The
second research goal is to use this research as a means to explore the visual working
mechanism of the human brain and to master and understand the visual working
mechanism of the human brain (such as computational neuroscience). The main
research here is the biological mechanism. For a long time, people have carried out a
lot of research on the human brain visual system from the aspects of physiology,
4 1 Introduction
psychology, nerve, cognition, etc., but it is far from revealing all the mysteries of the
visual process, especially the research and understanding of the visual mechanism is
still far away. It lags still behind the research and mastery of visual information
processing. It should be pointed out that a full understanding of human brain vision
will also promote in-depth research in computer vision [2]. This book mainly
considers the first research objective.
It can be seen from the above that computer vision uses computers to realize
human visual functions, and its research has obtained many inspirations from human
vision. Much important research in computer vision has been accomplished by
understanding the human visual system; typical examples include the use of pyra
mids as an efficient data structure, the use of the concept of local orientation, the use
of filtering techniques to detect motion, and the recent artificial neural network. In
addition, with the help of understanding and research on the function of the human
visual system, it can also help people develop new computer vision algorithms.
The research and application of computer vision has a long history. Overall, early
computer vision systems mainly relied on 2D projected images of objective scenes in
3D. The research goal of computer vision was to improve the quality of images, so
that users could obtain the information more clearly and conveniently, or focus on
automatically obtaining various characteristic data in the image to help users analyze
and recognize the scenery. This aspect of work can be attributed to 2D computer
vision, which is currently relatively mature with many application products avail
able. With the development of theory and technology, more and more research
focuses on fully utilizing the 3D spatial information obtained from objective scenery
(often combined with temporal information), automatically analyzing and under
standing the objective world, so as to making judgments and decisions. This includes
further obtaining depth information on the basis of 2D projection images to com
prehensively grasp the 3D world. This area of work is still being explored and
requires the introduction of technologies such as artificial intelligence, which is
currently the focus of research in computer vision. The recently related work can
be categorized under 3D computer vision and will be the concentration of this book.
Fig. 1.1 The connections and differences between related disciplines and fields
of the environment through visual sensors, building a system with visual percep
tion function, and realizing the algorithm for detecting and identifying objects.
On the other hand, robot vision emphasizes more on the machine vision of robot,
so that robot has the function of visual perception.
2. Computer graphics
Graphics refers to the science of expressing data and information in the form of
graphics, charts, drawings, etc. Computer graphics studies how to use computer
technology to generate these forms, and it is also closely related to computer
vision. Computer graphics is generally referred to as the inverse problem of
computer vision, because vision extracts 3D information from 2D images,
while graphics use 3D models to generate 2D scene images (more generally
based on nonimage forms of data description to generate realistic images). It
should be noted that, compared with the many uncertainties in computer vision,
computer graphics mostly deals with deterministic problems, which can be solved
through mathematical methods. In many practical applications, people are more
concerned with the speed and accuracy of graphics generation, that is, to achieve
some kind of compromise between real time and fidelity.
3. Image engineering
Image engineering is a very rich discipline, including three levels (three sub
disciplines) that are both related and different: image processing, image analysis,
and image understanding, as well as their engineering applications.
Image processing emphasizes the conversion between images (image in and
image out). Although image processing commonly refers to various image
technologies, image processing in a narrow sense mainly focuses on the visual
observation effect of the output image [4]. This includes making various
processing adjustments to the image to improve the visual effect of the image
and facilitate the subsequent high-level processing; or compress and encode the
image to reduce the required storage space or transmission time on the basis of
ensuring the required visual perception, so as to meet the requirements of a given
transmission path; or add some additional information to the image without
affecting the appearance of the original image; etc.
6 1 Introduction
Image analysis is mainly to detect and measure the objects of interest in the
image to obtain their objective information, thereby establishing the description
of the objects in the image (image in and data out) [5]. If image processing is a
process from image to image, image analysis is a process from image to data. The
data here can be the result of the measurement of the object characteristics, or the
symbolic representation based on the measurement, or the identification conclu
sion of the object category. They describe the characteristics and properties of
objects in the image.
The focus of image understanding is to further study the nature of each object
in the image and their mutual relations on the basis of image analysis and obtain
an understanding of the meaning of the whole image content and an explanation
of the original imaging objective scene, so that people can make judgments
(know the world) and guide and plan actions (transform the world) [6]. If
image analysis mainly focuses on the observer to study the objective world
(mainly on observable things), image understanding, to a certain extent, focuses
on the objective world, as well as to grasp and explain the entire objective world
(including things not directly observed) with the help of knowledge and
experience.
4. Pattern recognition
Patterns refer to categories of objective things or phenomena that are similar
but not identical. Patterns cover a wide range, and images are one of them.
(Image) pattern recognition is similar to image analysis in that they have the
same input, while the different outputs can be easily converted. Recognition
refers to mathematics and technology that automatically establish symbolic
descriptions or logical reasoning from objective facts, so people define pattern
recognition as the discipline of classifying and describing objects and processes
in the objective world. At present, the recognition of image patterns mainly
focuses on the classification, analysis, and description of the content of interest
(object) in the image. On this basis, the goal of computer vision can be further
realized. At the same time, many concepts and methods of pattern recognition are
used in computer vision research; however, visual information has its particular
ity and complexity, so traditional pattern recognition (competitive learning
model) cannot include all computer vision.
5. Artificial intelligence
Artificial intelligence can be counted as a new theory, new tool, and new
technology that has been widely studied and applied in the field of computer
vision in recent years. Human intelligence mainly refers to the ability of human
beings to understand the world, to judge the things, to learn the environment, to
plan the behavior, to reason the thinking, to solve the problems, etc. Visual
function is a manifestation of human intelligence, and similarly, computer vision
is closely related to artificial intelligence. Many artificial intelligence technolo
gies are used in the research of computer vision. Conversely, computer vision can
also be regarded as an important application field of artificial intelligence, which
requires the help of theoretical research results and system implementation
experience of artificial intelligence. Machine learning is the core of artificial
1.2 Computer Vision Theory and Framework 7
As a discipline, computer vision has its own origins, theories, and frameworks. The
origin of computer vision should be traced back to the invention and application of
computers. In the 1960s, the earliest computer vision technology has been studied
and applied.
The research on computer vision did not have a comprehensive theoretical frame
work in the early days. In the 1970s, the research on object recognition and scene
understanding basically detected linear edges as the primitives of the scene and then
combined them to form more complex scene structures. However, in practical
applications, comprehensive primitive detection is difficult and unstable, so the
computer vision system can only input simple lines and corners to form the
so-called building block world.
Marr’s book Vision, published in 1982, summarizes a series of research results of
his and his colleagues on human vision, proposes a visual computational theory,
and provides a framework for understanding visual information. This framework is
8 1 Introduction
both comprehensive and refined and is the key to making the study of visual
information understanding rigorous and moving visual research from the descriptive
level to the mathematical science level. Marr’s theory states that the purpose of
vision must be understood before going to the details. This is applicable to a variety
of information processing tasks. The gist of the theory is as follows.
Marr believes that vision is a far more complex information processing task and
process than people imagine, and its difficulty is often not recognized by people. A
major reason here is that while it is difficult for a computer to understand an image, it
is often a breeze for a human.
To understand the complex process of vision, two issues must first be addressed.
One is the representation of visual information; the other is the processing of visual
information. “Representation” here refers to a formal system (such as Arabic
numeral system, binary numeral system, etc.) that can clearly express certain entities
or certain types of information as well as some rules that explain how the system
works. In the representation, some information is prominent and explicit, while other
information is hidden and vague. Representation has a great influence on the
difficulty of subsequent information processing. The “processing” of visual infor
mation refers to the transformation and gradual abstraction of different forms of
representation through continuous processing, analysis, and understanding of the
information.
Solving the problem of representation and processing of visual information is
actually solving the problem of computability. If a task needs to be done by a
computer, it should be computable, which is the problem of computability. In
general, for a particular problem, a problem is computable if there is a program
that gives an output in finite steps for a given input.
To fully understand and interpret visual information, three essential factors need to
be grasped at the same time, namely, computational theory, algorithm implementa
tion, and hardware implementation.
First, the highest level of visual information understanding is abstract computa
tional theory. The question of whether vision can be calculated by modern com
puters needs to be answered by computational theory, but there is no clear answer so
far. Vision is a process of feeling and perception. From the perspective of micro
scopic anatomical knowledge and objective visual psychology knowledge, people
still lack the grasp of the mechanism of human visual function. Therefore, the
discussion on visual computability is still relatively limited, mainly focusing on
the number and symbol processing ability of existing computers to complete some
specific visual tasks and so on.
1.2 Computer Vision Theory and Framework 9
Table 1.1 The meaning of the three essential factors of visual information processing
Essential
factor Name Meaning and problems to be solved
1 Computational What is the computation goal? Why is it computed like this?
theory
2 Representation and How to realize computational theory? What is input and output
algorithm representation? What algorithm is used to realize the conver
sion between representations?
3 Hardware How to physically implement representations and algorithms?
implementation What are the specific details of computing structures?
Second, the objects of computer operation are discrete numbers or symbols, and
the storage capacity of the computer is also limited. Therefore, after having the
calculation theory, we must also consider the implementation of the algorithm.
Therefore, we need to select an appropriate representation for the entities operated
by the machining. On the one hand, the input and output representations of machin
ing should be selected; on the other hand, we should determine the algorithm to
complete the representation transformation. Representation and algorithm restrict
each other, so three points should be paid attention to: (1) in general, there are many
optional expressions; (2) the determination of the algorithm often depends on the
selected representation; (3) given a representation, there can be many algorithms to
complete the task. Generally, the instructions and rules used for machining are called
algorithms.
Finally, how the algorithm is physically implemented must also be considered.
Especially with the continuous improvement of real-time requirements, the problem
of dedicated hardware implementation is often raised. It should be noted that the
determination of an algorithm usually depends on the characteristics of the hardware
that physically implements the algorithm, and the same algorithm can be
implemented through different technical approaches.
After summarizing the above discussion, the content shown in Table 1.1 can be
obtained.
There is a certain logical causal connection between the above three essential
factors, but there is no absolute dependence. In fact, there are many different options
for each essential factor. In many cases, the problems involved in explaining each
essential factor are basically irrelevant to the other two essential factors (each
essential factor is relatively independent), or one or two essential factors can be
used to explain certain visual phenomena. The above three essential factors are also
called by many people the three levels of visual information processing, and they
point out that different problems need to be explained at different levels. The
relationship among the three essential factors is often shown in Fig. 1.2 (in fact, it
is more appropriate to regard it as two levels), in which the positive arrow indicates
that it has a guiding meaning and the reverse arrow has a meaning of as basic. Note
that once there is a computational theory, representations and algorithms as well as
hardware implementations influence each other.
10 1 Introduction
The 2.5D sketch map is actually an intrinsic image (see Sect. 1.3.2), because it
shows the orientation of the surface element of the object, thus giving the informa
tion of the surface shape. Surface orientation is an intrinsic characteristic, and depth
is also an intrinsic characteristic. The 2.5D sketch map can be converted into a
(relative) depth map.
3. 3D representation
3D representation is a representation form centered on the object (i.e., it also
includes the invisible part of the object). It describes the shape and spatial
organization of 3D objects in the object-centered coordinate system. Some
basic 3D entity representations can be found in Chap. 9.
Now come back to the problem of visual computability. From the perspective
of computer or information processing, the problem of visual computability can
be divided into several steps. Between the steps is a certain form of represen
tation, and each step consists of a calculation/processing method that connects
the two forms of representation (see Fig. 1.4).
According to the abovementioned three-level representation viewpoint, the prob
lem to be solved by visual computability is how to start from the pixel representation
of the original image, through the primal sketch representation and 2.5D sketch
representation, and finally obtain the 3D representation. They can be summarized in
Table 1.2.
12 1 Introduction
The idea of viewing the visual information system as a set of relatively independent
functional modules is not only supported by the evolutionary and epistemological
arguments in computing, but also some functional modules can be separated by
experimental methods.
In addition, psychological research also shows that people obtain various intrinsic
visual information by using a variety of clues or a combination of them. This
suggests that the visual information system should include many modules. Each
module obtains specific visual cues and performs certain processing, so that different
weight coefficients can be combined with different modules to complete the visual
information understanding task according to the environment. According to this
point of view, complex processing can be completed with some simple independent
functional modules, which can simplify research methods and reduce the difficulty
of specific implementation. This is also very important from an engineering
perspective.
During the image acquisition process, the information in the original scene will
undergo various changes, including the following:
1. When a 3D scene is projected as a 2D image, the depth of the object and the
invisible part of the information are lost.
1.2 Computer Vision Theory and Framework 13
2. Images are always obtained from a specific viewing direction. Different perspec
tive images of the same scene will be different. In addition, information will be
lost due to mutual occlusion of objects or mutual occlusion of various parts.
3. Imaging projection makes the illumination, object geometry and surface reflec
tion characteristics, camera characteristics, and the spatial relationship between
the light source, the object, and the camera all integrated into a single image gray
value, which are difficult to be distinguished.
4. Noise and distortion will inevitably be introduced in the imaging process.
For a problem, if its solution is existing, unique, and continuously dependent on
the initial data, then it is well-posed. If one or more of the above is not satisfied, it is
ill-posed (under-determined). Due to the information changes in the various original
scenes mentioned above, the method of solving the vision problem as the inverse
problem of the optical imaging process becomes an ill-posed problem (becoming an
ill-conditioned problem), so it is very difficult to solve. In order to solve this
problem, it is necessary to find out the constraints of the relevant problems according
to the general characteristics of the external objective world and turn them into
precise assumptions, so as to draw conclusive and testable conclusions. Constraints
are generally obtained with the aid of prior knowledge. The use of constraints can
change ill-conditioned problems. This is because adding constraints to the calcula
tion can make its meaning clear, thus enabling the problem to be solved.
Marr’s visual computational theory is the first theory that has a greater impact on
visual research. This theory has actively promoted research in this field and has
played an important role in the research and development of image understanding
and computer vision.
Marr’s theory also has its shortcomings, including four problems about the
overall framework (see Fig. 1.6):
1. The input in the framework is passive: what image is input, the system will
process what image.
2. The processing goal in the framework remains unchanged, and the position and
shape of the objects in the scene are always restored.
3. The framework lacks or does not pay enough attention to the guiding role of high-
level knowledge.
4. The information processing process in the entire framework is basically bottom-
up, one-way flow, and no feedback.
In response to the above problems, people have proposed a series of improvement
ideas in recent years. Corresponding to the framework of Fig. 1.4, these improve
ments can be incorporated into new modules to obtain the framework of Fig. 1.5.
14 1 Introduction
In the following, with conjunction to Fig. 1.5, the four aspects of the original
framework of Fig. 1.4 will be discussed in detail.
1. Human vision has initiative
People will change the line of sight or perspective as needed to help observa
tion and cognition. Active vision means that the vision system can determine the
movement of the camera according to the existing analysis results and the current
requirements of the vision task to obtain the corresponding image from the
appropriate position and perspective. Human vision is also selective, one can
stare (observing the region of interest at a higher resolution), or one can turn a
blind eye to certain parts of the scene. Selective vision means that the vision
system can determine the focus of the camera to obtain the corresponding image
based on the existing analysis results and the current requirements of the vision
task. Taking these factors into account, an image acquisition module is added to
the improved framework, which is also considered together with other modules in
the framework. This module should choose the image acquisition modes
according to the visual purpose.
The aforementioned active vision and selective vision can also be regarded as
two forms of active vision: one is to move the camera to focus on a specific object
of interest in the current environment; the other is to focus on a specific region in
the image and dynamically interact with it to get an interpretation. Although the
two forms of active vision look very similar, in the first form, the initiative is
mainly reflected in the observation of the camera, while in the second form, the
initiative is mainly reflected in the processing level and strategy. Although there
is interaction in both forms, that is, vision has initiative, mobile cameras need to
record and store all the complete scenes, which is a very expensive process. In
addition, the overall interpretations obtained in this way are not necessarily used.
Collecting only the most useful part of the scene, narrowing its scope, and
enhancing its quality to obtain useful interpretations mimic the process of
human interpretation of the scene.
2. Human vision can be adjusted for different purposes
Purposive vision means that the vision system makes decisions based on the
purpose of vision, such as whether to fully recover information like the position
and shape of objects in the scene or just detect whether there is an object in the
scene. It may give a simpler solution to vision problems. The key issue here is to
1.2 Computer Vision Theory and Framework 15
determine the purpose of the task. Therefore, a visual purpose box (vision goal) is
added to the improvement framework. Qualitative analysis or quantitative anal
ysis can be determined according to different purposes of understanding
(in practice, there are quite a lot of occasions where only qualitative results are
sufficient; no complex quantitative result is needed). However, the current qual
itative analysis still lacks complete mathematical tools. The motivation of pur
posive vision is to clarify only part of the information that is needed. For example,
the collision avoidance of autonomous vehicles does not require precise shape
descriptions, and some qualitative results are sufficient. This kind of thinking
does not have a solid theoretical basis, but the study of biological vision systems
provides many examples.
Qualitative vision, which is closely related to purposive vision, seeks a
qualitative description of an object or scene. Its motivation is not to express
geometric information that is not needed for qualitative (nongeometric) tasks or
decisions. The advantage of qualitative information is that it is less sensitive to
various unwanted transformations (such as slightly changing perspectives) or
noise than quantitative information. Qualitative or invariant can allow easy
interpretation of observed events at different levels of complexity.
3. Humans have the ability to completely solve visual problems with only partial
information obtained from images
Humans have this ability due to the implicit use of various knowledge. For
example, after obtaining object shape information with the aid of CAD design
data (using object model library), it can help solve the difficulty of restoring the
object shape from a single image. The use of high-level (domain) knowledge can
solve the problem of insufficient low-level information, so a high-level knowl
edge frame (module) is added to the improved framework.
4. There is an interaction between the sequential processing processes in human
vision
The human visual process has a certain sequence in time and different levels in
meaning, and there is a certain interaction between the various steps. Although
the mechanism of this interaction is not yet fully understood, the important role of
high-level knowledge and feedback from the later results to low-level processing
has been widely recognized. From this perspective, the feedback control flow is
added to the improvement framework, and the existing results and high-level
knowledge are used to improve visual efficiency.
Marr’s theory emphasizes the reconstruction of the scene and uses the reconstruction
as the basis for understanding the scene.
16 1 Introduction
According to Marr’s theory, the common core concept of different visual tasks/jobs
is representation, and the common processing goal is to recover the scene from
visual stimuli and incorporate it into the representation. If the vision system can
recover the characteristics of the scene, such as the reflective properties of the surface
of the object, the direction and speed of the object’s movement, the surface structure
of the object, etc., then there needs to be a representation that can help with various
recovery tasks. Under such a theory, different jobs should have the same conceptual
core, understanding process, and data structure.
In his theory, Marr showed how people can extract from various cues the
representations that construct the visual world from within. If the construction of
such a unified representation is regarded as the ultimate goal of visual information
processing and decision-making, then vision can be viewed as a reconstruction
process that starts with stimuli and is sequentially acquired and accumulated. This
idea of reconstructing the scene first and then interpreting it can simplify the visual
task, but it is not completely consistent with the human visual function. In fact,
reconstruction and interpretation are not always serial and need to be adjusted for
visual purpose.
The above assumptions have also been challenged. Some contemporaries, such as
Marr, questioned the vision process as a hierarchical, single-pass data processing
process. One of the meaningful contributions is that the single-path hypothesis has
been shown to be untenable, based on long-standing research in psychophysics and
neuropsychology. At the time Marr wrote Vision, there was little psychological
research that took into account information about the primate’s higher-level vision,
and little was known about the anatomy and functional organization of higher-level
visual areas. With the continuous acquisition of new data and the deepening of the
understanding of the entire visual process, people find that the visual process is less
and less like a single-channel processing process [7].
Fundamentally, a correct representation of the objective scene should be available
for any visual work. If this is not the case, then the visual world itself (which is an
external display of internal representations) cannot support visual behavior. None
theless, further research revealed that representations by reconstruction are in
many respect (shown below) a poor interpretation, or have a series of problems, for
understanding vision [7].
Let us first look at the implications of reconstruction for identification or classi
fication. If the visual world can be built in-house, then the visual system is not
necessary. In fact, acquiring an image, building a 3D model, or even giving a list of
locations of important stimulus features cannot guarantee recognition or classifica
tion. Of all the possible methods for interpreting the scene, the method involving
reconstruction has the largest circle, since reconstruction does not directly contribute
to the interpretation.
Secondly, it is also difficult to achieve the representation only by reconstruction
from the original image. From a computer vision point of view, it is very difficult to
1.2 Computer Vision Theory and Framework 17
recover scene representations from original images; there are now many findings in
biological vision that support other representation theories.
Finally, the reconstruction theory is also problematic conceptually. The source of
the problem has to do with theoretically that reconstruction can be applied to any
representation work. Leaving aside the question of whether reconstruction is achiev
able in concrete terms, one might first ask whether it is worthwhile to seek a
representation with universal unity. Since the best representation should be the one
best suited to the job, a representation with universal uniformity is not inevitably
necessary. In fact, according to the theory of information processing, the importance
of choosing the appropriate and correct representation for a given computational
problem is self-evident. Marr himself had also pointed out this importance.
Some studies and experiments in recent years have shown that the interpretation of
the scene is not necessarily based on the 3D restoration (reconstruction) of the scene,
or rather, it is not necessarily based on the complete 3D reconstruction of the scene.
Since there are a series of problems in realizing representation according to
reconstruction, other forms of representation methods have also been studied and
paid attention to. For example, there is a representation that was first proposed by
Locke in Concerning Human Understanding, which is now generally referred to as
semantics of mental representations [7]. Locke suggests representation in a natural
and predictable way. According to this view, a sufficiently reliable feature detector
constitutes a primitive representation of the existence of a certain feature in the visual
world. The representation of the entire goal and scene can then be constructed from
these primitives (if there are enough of them).
In the theory of natural computing, the original concept of feature hierarchy was
developed under the influence of the discovery of “insect detectors” in frog retinas.
Recent computer vision and computational neuroscience research results suggest
that modifications to the original feature-level representation hypothesis can serve as
an alternative to the reconstruction theory. Today’s feature detection is different
from traditional feature detection in two ways. One is that a set of feature detectors
can have much greater expressive power than any one of them; the other is that many
theoretical researchers realize that “symbols” are not the only elements that combine
features.
Consider the representation for spatial resolution as an example. In a typical
situation, the observer can see two straight line segments that are very close to each
other (the distance between them may also be smaller than the distance between the
photon receptors in the fovea). An early hypothesis was that at some stage of cortical
processing, visual input is reconstructed with sub-pixel accuracy, making it possible
to obtain distances in the scene that are smaller than pixels. Proponents of recon
struction theory do not believe that feature detectors can be used to build visual
functions, and Marr believes that “the world is so complex that it is impossible to
analyze with feature detectors.” Now this view is challenged. Taking the
18 1 Introduction
Due to factors such as history, Marr did not study how to use mathematical methods
to strictly describe visual information. Although he studied early vision more fully,
he basically did not discuss the representation, utilization, and recognition of visual
knowledge based on visual knowledge. In recent years, there have been many
1.2 Computer Vision Theory and Framework 19
3. Find the spatial correspondence by solving the unknown observation points and
model parameters, so that the projection of the 3D model directly matches the
image features.
In the above whole process, there is no need to measure the 3D object surface
(no reconstruction), and the information about the surface is calculated using the
principle of perception. This theoretical framework shows high stability for handling
occlusion and incomplete data. This theoretical framework introduces feedback,
emphasizing the guiding role of high-level knowledge in vision. However, practice
has shown that in some occasions such as judging the size of the object and
estimating the distance of the object, recognition alone is not enough, and 3D
reconstruction must be carried out. In fact, 3D reconstruction still has a very wide
range of applications. For example, in the virtual human plan, a lot of human
information can be obtained by 3D reconstruction of a series of slices. Another
example, the 3D distribution of cells, can be obtained by 3D reconstruction of tissue
slices, and the positioning has a very good auxiliary effect.
The active vision theoretical framework is mainly based on the initiative of human
vision (or more generally biological vision). Human vision has two special
mechanisms:
1. Selective attention
Not all what the human eye sees is what people care about, and useful visual
information is usually only distributed in a certain spatial range and time period,
so human vision does not treat all parts of the scene equally, but selectively
according to needs. Some of them pay special attention, and others are just
general observation or even turn a blind eye. According to this feature of the
selective attention mechanism, multi-azimuth and multi-resolution sampling can
be performed when acquiring images, and information relevant to a specific task
can be selected or retained.
2. Gaze control
People can adjust their eyeballs, so that people can “look” at different posi
tions in the environment at different times according to their needs to obtain
useful information, which is gaze control. Accordingly, the camera parameters
can be adjusted so that it can always obtain visual information suitable for a
specific task. Gaze control can be divided into gaze stabilization and gaze
change. The former is a localization process, such as object tracking; the latter
is similar to the rotation of the eyeball, which controls the next fixation point
according to the needs of a specific task.
The theoretical framework of active vision proposed according to the human
visual mechanism is shown in Fig. 1.7.
1.2 Computer Vision Theory and Framework 21
The active vision theoretical framework emphasizes that the visual system should
be task-oriented and purpose-oriented, and the visual system should have the ability
to actively perceive. According to the existing analysis results and the current
requirements of the vision task, the active vision system can control the motion of
the camera through the mechanism of actively controlling the camera parameters and
coordinate the relationship between the processing task and the external signal.
These parameters include camera position, orientation, focal length, aperture, etc.
In addition, active vision also incorporates the ability of “attention.” By changing the
camera parameters or processing the post-camera data, the “points of attention” can
be controlled to achieve selective perception of space, time, resolution, etc.
Similar to the theoretical framework based on knowledge, the theoretical frame
work of active vision also attaches great importance to knowledge and believes that
knowledge belongs to the advanced ability to guide visual activities, which should
be used for complete visual tasks. However, there is a lack of feedback in the current
theoretical framework of active vision. On the one hand, this structure without
feedback does not conform to the biological vision system; on the other hand, it
often leads to the problems of poor accuracy of results, great influence by noise, and
high computational complexity. At the same time, it also lacks some adaptability to
applications and environments.
Computer vision is closely related to human vision. It tries to achieve the functions
of human vision while also gaining many insights from human vision. Computer
vision needs to grasp the objective information of the scene through the visual
stimulus signals it experiences, and there is a process from perception to cognition.
Cognitive science is the result of the cross cooperation of psychology, linguistics,
neuroscience, computer science, anthropology, philosophy, artificial intelligence,
and other disciplines. Its goal is to explore the essence and mechanism of human
22 1 Introduction
1.2.5.2 Connectionism
The theory of embodied cognition holds that cognition cannot be separated from the
body and is largely dependent and originating from the body. Human cognition is
closely related to the structure of the human body, the structure of nerves, the way of
sensory and motor system activities, etc. These factors determine a person’s thinking
style and way of understanding the world. The theory of embodied cognition holds
that cognition is the cognition of the body, which endows the body with a pivotal
role and decisive significance in cognitive shaping and enhances the importance of
the body and its activities in the interpretation of cognition [14].
24 1 Introduction
In order to realize the vision function, computer vision needs to use a series of
technologies. Among them, the most direct and closely related is image technology.
1.3 Introduction to Image Engineering 25
Higher Smaller
Symbol's
O
Object
Pixel
Lower Bigger
symbols abstracted from image descriptions, which have many similarities with
human thinking and reasoning.
It can be seen from Fig. 1.8 that among the three layers, the amount of data
gradually decreases as the abstraction level increases. Specifically, the raw image
data is gradually transformed after a series of processing, becoming more organized
and more abstractly expressed. In this process, semantic information is continuously
introduced, the operation objects also change, and the amount of data is gradually
compressed. In addition, high-level operations have a guiding role for low-level
operations, which can improve the efficiency of low-level operations.
From the perspective of comparing and combining with computer vision, the
main components of image engineering can also be represented by the overall
framework shown in Fig. 1.9, where the dashed box is the basic module of image
engineering. Various image techniques are used here to help people getting infor
mation from the scene. The first thing to do is to obtain images from the scene in
various ways. The next low-level processing of the image is mainly to improve the
visual effect of the image or reduce the data amount of the image on the basis of
maintaining the visual effect, and the processing result is mainly for the user to
watch. The middle-level analysis of the image is mainly to detect, extract, and
measure the objects of interest in the image. The results of the analysis provide the
user with data describing the characteristics and properties of the image objects.
Finally, the high-level understanding of the image is to understand and grasp the
content of the image and explain the original objective scene through the study of the
nature of the objects in the image and the relationship between them. The results of
the understanding provide the user with objective-world information that can guide
and plan actions. These image technologies from low level to high level are strongly
supported by new theories, new tools, and new technologies including artificial
intelligence, neural networks, genetic algorithms, fuzzy logic, image algebra,
1.3 Introduction to Image Engineering 27
machine learning, and deep learning. Appropriate strategies are also required to
control these tasks.
By the way, computer vision technology has evolved over the years. There are
many types of existing technologies. For these technologies, although there are some
classification methods, it does not appear to be stable and consistent till present. For
example, some people divide computer vision into low-level vision, middle-level
vision, and 3D vision, and some people divide computer vision into early vision
(which is further divided into single image and multiple images), middle-level
vision, and high-level vision (geometric method). Even the classification schemes
adopted by the same researcher in different time periods are not completely consis
tent. For example, someone once divided computer vision into early vision (which is
further divided into single image and multiple images), middle-level vision, and
high-level vision (which is further divided into geometric methods and probability
and inference methods). Similarly, the different schemes are divided into three
layers, which are somewhat similar to the stable and consistent three layers of
image engineering, although it does not correspond exactly.
Among the three levels of image engineering, the image understanding level is
most closely related to the current research and application of computer vision,
which has many historical origins. In building an image/visual information system
and using computers to assist humans in completing various visual tasks, both image
understanding and computer vision require theories in projection geometry, proba
bility theory and stochastic processes, and artificial intelligence. For example, they
all rely on two types of intelligent activities: (1) perception, such as perceiving the
distance, orientation, shape, movement speed, interrelationship, etc. of the visible
parts of the scene, and (2) thinking, such as analyzing the behavior of objects
according to the scene structure, inferring the development and changes of the
scene, and deciding and planning main actions. It can be said that image under
standing (based on image processing and analysis) has the same goal as computer
vision, and both use engineering techniques to realize the understanding and inter
pretation of the scene through the images obtained from the objective scene. In fact,
the terms image understanding and computer vision are often used interchangeably.
Essentially, they are interrelated; and in many cases, their coverages and contents
overlap, with no absolute boundaries in terms of concept or practicality. In many
contexts and situations, they have different focuses but are often complementary, so
it is more appropriate to think of them as different terms used by people of different
professions and backgrounds.
Simultaneously with the presentation of image engineering (IE), a review series for
statistical classification of the image engineering literature began, which has been in
place for 28 years [22]. This review series selected all literature related to image
engineering in 15 journals for analysis. Considering that image engineering is an
28 1 Introduction
Deep learning uses cascaded multilayer nonlinear processing units for feature
extraction and transformation, realizing multilevel feature representation and con
cept abstraction learning [23]. Deep learning still belongs to the category of machine
learning, but compared with traditional machine learning methods, deep learning
methods avoid the requirements for manual design features under traditional
machine learning methods and show obvious effect advantages under big data.
Compared with traditional machine learning methods, deep learning is more general
and requires less prior knowledge and annotation data. However, the theoretical
framework of deep learning has not been fully established. At present, there is still a
lack of powerful and complete theoretical explanation for how deep neural network
operates and why it has high performance.
The current mainstream deep learning methods are based on neural networks (NN),
and neural networks have the ability to directly learn pattern features from training
data without first designing and extracting features and can easily achieve end-to-end
training. The study of neural networks has a long history. Among them, in 1989, the
universal approximation theorem of the multilayer perceptron (MLP) was proven,
and one of the basic deep learning models, the convolutional neural network
(CNN), was also used for handwritten digit recognition. The concept of deep
learning was formally proposed in 2006 and led to extensive research and applica
tion of deep neural network technology.
Table 1.3 Main categories and quantities of image engineering research and application literature
in the past 18 years
Category # Subcategory and main contexts #
Image 4050 Image acquisition (including various imaging methods, image 832
processing capturing, representation and storage, camera calibration, etc.)
Image reconstruction (including image reconstruction from pro 375
jection, indirect imaging, etc.)
Image enhancement/image restoration (including transformation, 1313
filtering, restoration, repair, replacement, correction, visual
quality evaluation, etc.)
Image/video coding and compression (including algorithm 505
research, implementation and improvement of related interna
tional standards, etc.)
Image information security (including digital watermarking, 705
information hiding, image authentication and forensics, etc.)
Image multi-resolution processing (including super-resolution 320
reconstruction, image decomposition and interpolation, resolu
tion conversion, etc.)
Image 4820 Image segmentation and primitive detection (including edges, 1564
analysis corners, control points, points of interest, etc.)
Object representation, object description, feature measurement 150
(including binary image morphology analysis, etc.)
Object feature extraction and analysis (including color, texture, 466
shape, space, structure, motion, saliency, attributes, etc.)
Object detection and object recognition (including object 2D 1474
positioning, tracking, extraction, identification and classification,
etc.)
Human body biological feature extraction and verification 1166
(including detection, positioning and recognition of human body,
face and organs, etc.)
Image 2213 Image matching and fusion (including registration of sequence 1070
understanding and stereo image, mosaic, etc.)
Scene recovering (including 3D scene representation, modeling, 256
reconstruction, etc.)
Image perception and interpretation (including semantic 123
description, scene model, machine learning, cognitive reasoning,
etc.)
Content-based image/video retrieval (including corresponding 450
labeling, classification, etc.)
Spatial-temporal techniques (including high-dimensional motion 314
analysis, object 3D posture detection, spatial-temporal tracking,
behavior judgment and behavior understanding, etc.)
Technical 3164 Hardware, system devices, and fast/parallel algorithms 348
applications Communication, video transmission and broadcasting (including 244
TV, network, radio, etc.)
Documents and texts (including words, numbers, symbols, etc.) 163
Biology and medicine (physiology, hygiene, health, etc.) 590
Remote sensing, radar, sonar, surveying and mapping, etc. 1279
Other (no directly/explicitly included technology application) 540
30 1 Introduction
Fig. 1.10 Schematic diagram of the basic structure of a typical convolutional neural network
network. The main input difference is that the BP network is a 1D vector, while the
convolutional neural network is a 2D matrix. The convolutional neural network
consists of a layer-by-layer structure, mainly including input layer, convolution
layer, pooling layer, output layer, fully connected layer, batch normalization layer,
etc. In addition, the convolutional neural networks also use activation functions
(excitation functions), cost functions, etc.
Figure 1.10 shows a part of the basic structure of a typical convolutional neural
network.
The convolutional neural network has four similarities with the general fully
connected neural network (multilayer perceptron):
1. Both construct the sum of products.
2. Both superimpose a bias (see below).
3. Both let the result go through an activation function (see below).
4. Both use the activation function value as a single input to the next layer.
Convolutional neural networks are different from general fully connected neural
networks in four ways:
1. The input of the convolutional neural network is a 2D image, and the input of the
fully connected neural network is a 1D vector.
2. Convolutional neural networks can directly learn 2D features from raw image
data, while fully connected neural networks cannot,
3. In a fully connected neural network, the output of all neurons in a layer is directly
provided to each neuron in the next layer, while the convolutional neural network
first uses convolution to convert the output of the neurons in the previous layer
according to the spatial neighbors. The fields are combined into a single value and
then provided to each neuron in the next layer.
4. In a convolutional neural network, the 2D image input to the next layer is first
subsampled to reduce sensitivity to translation.
In the following, the convolutional layers, pooling layers, activation functions,
and cost functions are explained in more detail.
1.4 Introduction to Deep Learning 31
The convolutional layer mainly implements the convolution operation, and the
convolutional neural network is named after the convolution operation. At each
location in the image, add a bias value to the convolution value (sum of products) at
that location, and convert their sum to a single value through an activation function.
This value is used as the input to the next layer at that location. If the above operation
is performed on all positions of the input image, a set of 2D values is obtained, which
can be called a feature map (because convolution is to extract features). Different
convolution layers have different numbers of convolution kernels. The convolution
kernel is actually a numerical matrix. The commonly used convolution kernel sizes
arel x 1, 3 x 3, 5x5,7 x 7, etc. Each convolution kernel has a constant bias, and the
elements in all matrices plus the bias form the weight of the convolution layer, and
the weight participates in the iterative update of the network.
Two important concepts in convolution operations are local receptive field and
weight sharing (also called parameter sharing). The size of the local receptive field is
the scope of the convolution kernel convolution operation, and each convolution
operation only needs to care about the information in this range. Weight sharing
means that the value of each convolution kernel is unchanged during the convolution
operation, and only the weight of each iteration is updated. In other words, the same
weights and a single bias are used to generate the values of the feature map
corresponding to all locations of the receptive field of the input image. This allows
the same feature to be detected at all locations in the image. Each convolution kernel
only extracts a certain feature, so the values in different convolution kernels are different.
The operation after convolution and activation is pooling. The pooling layer mainly
implements down-sampling and dimension reduction operations, so it is also called
down-sampling layer or subsampling layer. The design of the pooling layer is based
on a model of the mammalian visual cortex. The model considers the visual cortex to
include two types of cells: simple cells and complex cells. Simple cells perform
feature extraction, while complex cells combine (merge) these features into a more
meaningful entirety. Pooling layers generally do not have weight updates.
The role of pooling includes reducing the spatial size of the data volume (reduc
ing the spatial resolution to obtain translation invariance), reducing the number of
parameters in the network and the amount of data to be processed, and thereby
reducing the overhead of computing resources and effectively controlling
overfitting. Pooling feature maps can be thought of as the result of subsampling
(for each feature map, there is a corresponding pooled feature map.). In other words,
pooled feature maps are feature maps with reduced spatial resolution. Pooling first
decomposes the feature map into a set of small regions (neighborhoods) and then
replaces all elements in the neighborhood with a single value. This is called pooling
neighborhoods, and here it can be assumed that the pooled neighborhoods are
contiguous (i.e., nonoverlapping).
32 1 Introduction
There are multiple ways to compute pooled values, and they are all collectively
referred to as pooling methods. Common pooling methods include the following:
1. Average pooling, also known as mean pooling, selects the average of all values
in each neighborhood.
2. Maximum pooling, also called maximum value pooling, selects the maximum
value of all values in each neighborhood.
3. L2 pooling selects the square root value of the sum of squares of all values in each
neighborhood.
4. Random value pooling: selection is made according to the corresponding values
in each neighborhood that satisfy certain criteria.
The activation function is also called the excitation function. Its function is to
selectively activate or suppress the features of the neuron nodes, which can enhance
the activation of useful object features and weaken the useless background features,
so that the convolutional neural network can solve the nonlinear problems. If the
nonlinear activation function is not added to the network model, the network model
is equivalent to a linear expression, which makes the network expressive ability not
strong. Therefore, it is necessary to use a nonlinear activation function, so that the
network model has the nonlinear mapping ability of the feature space.
The activation function must have some basic properties: (1) monotonicity in
which a monotonic activation function ensures that the single-layer network model
has convex function performance and (2) differentiability in which this allows the
use of error gradients to fine-tune the update of the model weights. Commonly used
activation functions include the following:
1. Sigmoid function (as shown in Fig. 1.11a)
Its derivative is
1 e
h(z) = tanh(z) =1-e-2z:
-2z
(1:3)
“r e
Its derivative is
The hyperbolic tangent function is similar in shape to the sigmoid function, but
the hyperbolic tangent function is also symmetric about the function axis.
3. Rectifier function (as shown in Fig. 1.11c)
Because the unit it uses is also called a rectified linear unit (ReLU), the
corresponding activation function is often called the ReLU activation function.
Its derivative is
1Ifz > 0
h0 (z)= : (1:6)
0 If z<0
The loss function is also called the cost function. In the task of machine learning, all
algorithms have an objective function. The principle of the algorithm is to optimize
this objective function. The direction of optimizing the objective function is to take
its maximum or minimum value. When the objective function is minimized under
constraints, it is the loss function. In the convolutional neural network, the loss
function is used to drive the network training, so that the network weights are
updated.
The most commonly used loss function in convolutional neural network model
training is the soft max loss function, which is the cross entropy loss function of soft
max. Soft max is a commonly used classifier whose expression is
34 1 Introduction
exp (wTXi)
h(xi) = (1-7)
V^n - T T \
E j exp (wT Xi)
Since 2012, deep learning algorithms have achieved excellent results in image
classification, video classification, object detection, object tracking, semantic seg
mentation, depth estimation, image/video generation, and other tasks. Deep learning
has gradually replaced the traditional statistical learning and becomes the main
stream frameworks and methods of computer vision.
1. Image classification
The goal of image classification is to classify an image into a given category.
Some classical models in image classification have also become the backbone
network for other tasks such as detection and segmentation, from AlexNet to
VGG, to GoogleNet, to ResNet, and to DenseNet. The layers of neural network
model are getting deeper and deeper, from several layers to thousands of layers.
2. Video classification
An earlier and effective deep learning method is the two-stream convolution
network, which combines the apparent features and motion features. The
two-stream convolution network is based on 2D convolution kernel. In recent
years, many scholars have proposed some 3D CNN to realize video classification
by extending 2D convolution kernel to 3D or combining 2D and 3D, including
I3D, C3d, P3D, etc. In video motion detection, boundary sensitive network
(BSN), attention clustering network, generalized compact nonlocal network,
and so on are proposed.
3. Object detection
Object detection is to identify objects in an image and determine boundaries
and labels for each object. The commonly used deep learning-based models
mainly include the one-stage model and the two-stage model. The two-stage
method is based on image classification, that is, the potential candidate regions of
the object are first determined, and then the classification method is used to
identify them. A typical two-stage method is the R-CNN family. From R-CNN
to Fast R-CNN, R-FCN, and Faster R-CNN, the detection efficiency continues to
1.4 Introduction to Deep Learning 37
improve. The one-stage method is based on the regression method, which can
achieve a complete single training shared feature and can greatly improve the
speed under the premise of ensuring a certain accuracy. Some relative important
methods include the YOLO series and the SSD series and, recently, the deeply
supervised object detector (DSOD) method, the receptive field module (RFB)
network, etc.
4. Object tracking
Object tracking is to track one or more specific objects of interest in a specific
scene. Multi-object tracking is to track multiple object trajectories of interest in
video images and extract their motion trajectory information through temporal
correlation. Object tracking methods can be divided into two categories: gener
ative methods and discriminative methods. Generative methods mainly use a
generative model to describe the apparent features of the object and then search
for candidate objects to minimize the reconstruction error. The discriminative
method distinguishes the object from the background by training the classifier, so
it is also called the tracking-by-detection method. Its performance is more stable,
and it has gradually become the main research method in the field of object
tracking. The relative popular methods in recent years include a series of tracking
methods based on Siamese network.
5. Semantic segmentation
Semantic segmentation needs to label the semantic category of each pixel in
the image. A typical deep learning approach is to use a fully convolutional
network (FCN). After inputting an image, the FCN model directly obtains the
category of each pixel at the output, so as to achieve end-to-end image semantic
segmentation. Further improvements include U-Net, dilated convolution,
DeepLab series, pyramid scene parsing network (PSPNet), etc.
6. Depth estimation
The method of depth estimation based on monocular usually uses the image
data of a single view as input and directly predicts the depth value corresponding
to each pixel in the image. The baseline of deep learning in monocular depth
estimation is CNN. In order to overcome the difficulty that monocular depth
estimation usually requires a large amount of depth-annotated data and the high
cost of such data acquisition, a single view stereo matching (SVSM) model is
proposed, which can achieve good results with only a small amount of depth-
annotated data.
7. Image/video generation
Image/video generation is closer to computer graphics technology. The input
is the abstract attribute of the image, while the output is the image distribution
corresponding to the attribute. With the development of deep learning, the
automatic generation of image/video, the expansion of database, and the com
pletion of image information have been paid attention to. Two popular depth
generation models are variational auto-encoder (VAE) and generative adver
sarial network (GAN). As an unsupervised deep learning method, GAN can
learn by playing games between two neural networks, which can alleviate the
problem of data sparsity to a certain extent. Based on GAN, from Pix2Pix that
38 1 Introduction
needs to prepare paired data, to CycleGAN that only needs unpaired data, and to
StarGAN that can span multiple domains, it is gradually closer to practical (such
as AI anchor).
This book has 11 chapters, divided into 4 levels, and its organizational structure is
shown in Fig. 1.12. Among them, four levels are given on the left: background
knowledge, image acquisition, scene recovery, and scene interpretation; on the right
are the chapters and their titles corresponding to these four levels, which are
described in detail below (see [24, 25] for the explanation of relevant names).
1. Background knowledge: providing an overview of computer vision and related
background information.
Chapter 1 briefly introduces the basics of computer vision, image engineering,
and deep learning and discusses the theory and framework of computer vision.
Chapter 4 introduces the point cloud data and its processing. Point cloud data
sources, preprocessing (including registration of point cloud data with the help
of biomimetic optimization), laser point cloud modeling, texture mapping, local
feature description, and deep learning methods in scene understanding are
discussed.
3. Scene restoration: discussing various technical principles for reconstructing
objective 3D scenes from images.
Chapter 5 introduces binocular stereovision technology, mainly region-based
matching technology and feature-based matching technology. In addition, an
error correction algorithm, an overview of a recent stereo matching network
based on deep learning technology, and a specific method are also presented.
Chapter 6 introduces the multi-ocular stereovision technology, first discusses
the two specific modes of horizontal multi-ocular and orthogonal tri-ocular, and
then generalizes to the general multi-ocular situation. Finally, two systems
consisting of five cameras and a single camera combined with multiple mirrors
are analyzed separately.
Chapter 7 introduces monocular multi-image scene recovery and discusses
the principles and methods of shape recovery from illumination and shape
recovery from motion, respectively. In addition to an overview of recent
research progress in photometric stereo technology, the corresponding technol
ogies using GAN and CNN are also introduced in detail.
Chapter 8 introduces the scene restoration from monocular and single image
and discusses the principles and methods of restoring shape from shading,
texture, and focal length, respectively. Recent works on shape recovery from
tones under mixed surface perspective projection using different models are also
presented.
4. Scene interpretation: analyzing how to realize the understanding and judgment
of the scene with the help of the reconstructed 3D scene.
Chapter 9 introduces generalized matching, mainly object matching, relation
matching, matching with the help of graph theory, and line drawing labeling. It
also analyzes some specific matching techniques, the connection between
matching and registration, and provides a recent overview of multimodal
image matching.
Chapter 10 introduces simultaneous localization and mapping (SLAM). The
composition, process, and modules of laser SLAM and visual SLAM, their
comparison and fusion, and their combination with deep learning and multi
agent are discussed, respectively. Some typical algorithms are also analyzed in
detail.
Chapter 11 introduces the understanding of spatiotemporal behavior. On the
basis of its concept, definition, development, and hierarchical research, this
chapter focuses on the discussions of joint modeling of agent and action, activity
and behavior recognition, and automatic activity analysis, especially the detec
tion of abnormal events.
40 1 Introduction
References
1. Szeliski R (2022). Computer Vision: Algorithms and Applications, 2nd ED. Springer Nature
Switzerland.
2. Finkel L H, Sajda P (1994) Constructing Visual Perception. American Scientist, 82(3): 224-237.
3. Shapiro L, Stockman G (2001) Computer Vision. Prentice Hall, London.
4. Zhang Y-J (2017) Image Engineering, Vol.1: Image Processing. De Gruyter, Germany.
5. Zhang Y-J (2017) Image Engineering, Vol.2: Image Analysis. De Gruyter, Germany.
6. Zhang Y-J (2017) Image Engineering, Vol.3: Image Understanding. De Gruyter, Germany.
7. Edelman S (1999) Representation and Recognition in Vision. MIT Press, Cambridge.
8. Grossberg S, Mingolia E (1987) Neural dynamics of surface perception: boundary webs,
illuminants and shape-from-shading. Computer Vision, Graphics and Image Processing,
37(1): 116-165.
9. Kuvich G (2004) Active vision and image/video understanding systems for intelligent
manufacturing. SPIE 5605: 74-86.
10. Lowe D G (1987) Three-dimensional object recognition from single two-dimensional images.
Artificial Intelligence, 31(3): 355-395.
11. Lowe D G (1988) Four steps towards general-purpose robot vision. Proceedings of the 4th
International Symposium on Robotics Research, 221-228.
12. Ye H S (2017) Embodied Cognition: Principles and Applications. Commercial Press, Beijing.
13. Osbeck L M (2009) Transformations in cognitive science: Implementations and issues posed.
Journal of Theoretical and Philosophical Psychology, 29(1): 16-33.
14. Chen W, Yin R, Zhang J (2021) Embodied Cognition in Psychology: A Dialogue among Brain,
Body and Mind. Science Press, Beijing.
15. Alban M W, Kelley C M (2013) Embodiment meets metamemory: Weight as a cue for
metacognitive judgements. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 39(8): 1-7.
16. Varela F, Thompson, E, Rosch E (1991) The Embodied Mind: Cognitive Science and Human
Experience. Cambridge: MIT Press.
17. Li H W, Xiao J Y (2006) The embodied view of cognition. Journal of Dialectics of Nature,
2006, (1): 29-34.
18. Meng F K (2023) Embodied intelligence: A new stage of intelligent evolution. China Industry
and Information Technology, (7): 6-10.
19. Bonsignorio F (2023) Editorial: Novel methods in embodied and enactive AI and cognition.
Front. Neurorobotics, 17:1162568 (DOI: 10.3389/fnbot.2023.1162568).
20. Zhang Y-J (1996) Image engineering in China: 1995. Journal of Image and Graphics, 1(1):
78-83.
21. Zhang Y-J (1996) Image engineering and bibliography in China. Technical Digest of Interna
tional Symposium on Information Science and Technology, 158-160.
22. Zhang Y-J (2023) Image engineering in China: 2022. Journal of Image and Graphics, 28(4):
879-892.
23. Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press, Massachusetts.
24. Zhang Y-J (2021) English-Chinese Dictionary of Image Engineering. Tsinghua University
Press, Beijing.
25. Zhang Y-J (2021) Handbook of Image Engineering. Springer Nature, Singapore.
Chapter 2
Camera Imaging and Calibration
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 41
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_2
42 2 Camera Imaging and Calibration
Photometry is the study of the strength of light (radiation). More general radiometry
is the study of the strength of (electromagnetic) radiation. The brightness of the
scenery itself in the scene is related to the intensity of the light radiation.
For a luminous scenery (light source), the brightness of the scene is proportional
to its own radiated power or amount of light radiation. In photometry, luminous flux
is used to express the power or amount of optical radiation, and its unit is lm
2.1 Lightness Imaging Model 43
where
Equation (2.3) shows that the incident amount is always greater than zero (only
considering the incident case), but it is not infinite (because it should be physically
realizable). Equation (2.4) shows that the reflectivity is between 0 (total absorption)
and 1 (total reflection). The values given by the two equations are theoretical limits.
It should be noted that the value of I(x, y) is determined by the lighting source, while
the value of R(x, y) is determined by the surface characteristics of objects in the
scene.
Generally, the luminance value of the monochrome image f(«) at the coordinates
(x, y) is called the gray value of the image at this point (represented by g). According
to Eqs. (2.2) and (2.3), g will take values in the following ranges:
The purpose of building a space imaging model is to determine the (x, y) of the
image, that is, the 2D position where the 3D objective scene is projected onto the
image [4].
When discussing the conversion between different coordinate systems, if the coor
dinate system can be expressed in the form of homogeneous coordinates, the
conversion between each coordinate system can be expressed in the form of a linear
matrix. Let us first consider the homogeneous representation of lines and points.
A straight line on the plane can be represented by the equation of line
ax + by + c = 0. Different a, b, c can represent different straight lines, so a straight
line can also be represented by a vector l = [a, b, c]T. Because the line ax + by + c = 0
and the line (ka)x + (kb)y + kc = 0 are the same when k is not 0, so when k is not
0, the vector [a, b, c]T and the vector k[a, b, c]T represent the same straight line. In
fact, these vectors that differ by only one scale can be considered equivalent. A
vector set that satisfies this equivalence relationship is called a homogeneous
vector, and any specific vector [a, b, c]T is the representative of the vector set.
46 2 Camera Imaging and Calibration
For a line l = [a, b, c]T, the point x = [x, y]T is on the line if and only if
ax + by + c = 0. This can be represented by the inner product of the vector [x, y,1]
corresponding to the point and the vector [a, b, c]T corresponding to the line, that is,
[x, y, 1]*[a, b, c]T = [x, y, 1]*l = 0. Here, the point vector [x, y]T is represented by a
3D vector with the value 1 added as the last item. Note that for any nonzero constant
k and any straight line l, [kx, ky, k]*l = 0 if and only if [x, y, 1]•l = 0. Therefore, it can
be considered that all vectors [kx, ky, k]T (obtained by the variation of k) are
expressions of points [x, y]T. Thus, like lines, points can also be represented by
homogeneous vectors.
In general, the homogeneous coordinates of the Cartesian coordinates XYZ
corresponding to a point in space are defined as (kX, kY, kZ, k), where k is an
arbitrary nonzero constant. Obviously, the transformation from homogeneous coor
dinates back to Cartesian coordinates can be obtained by dividing the first three
coordinate quantities by the 4-th coordinate quantity. Thus, a point in a Cartesian
world coordinate system can be represented in vector form as
W = [X Y Z]T: (2:6)
Wh = [ kX kY kZ k ]T: (2:7)
Consider the most basic and simplest space imaging model first; suppose that the
world coordinate system XYZ coincides with the camera coordinate system xyz and
the xy plane in the camera coordinate system xyz also coincides with the image plane
x’y’ (the computer image coordinate system is not considered in the first).
The camera projects the 3D objective world scene onto the 2D image plane through
perspective projection. This projection can be spatially described by an imaging
transformation (also called a geometric perspective transformation or perspective
transformation). Figure 2.2 shows a schematic diagram of the geometric perspective
transformation model of an imaging process, in which the camera optical axis
(through the center of the lens) is directed outward along the positive z-axis. In
this way, the center of the image plane is at the origin, and the coordinates of the lens
center are (0, 0, z), and z represents the focal length of the lens.
2.2 Space Imaging Model 47
X - —X _ X
(2:8)
2 Z—2 2—Z
y _ —Y _ Y
(2:9)
2 = Z — 2 = 2 — Z:
In these equations, the negative sign before X and Y means that the image points
are reversed. The image plane coordinates after 3D point perspective projection can
be obtained from these two equations:
2X
x= 2—Z (2:10)
y= 2—Z
2Y
: (2:11)
■1 0 0 0'
0 1 0 0
(2:12)
.0
0 0
0
1
-1 X = 0
1.
Here the elements of ch are the camera coordinates in homogeneous form, and
these coordinates can be converted into Cartesian form by removing the first three
items from the 4-th item of ch, respectively. Therefore, the Cartesian coordinates of
any point in the camera coordinate system can be expressed in vector form:
.t= r xx XY xz t
c = [x y z1 X-X (2:14)
X- Y X-Z ,
where the first two items of c are the coordinates (x, y) of the 3D space point (X, Y, Z)
projected onto the image plane.
■1 0 0 0"
0 1 0 0
(2:16)
0 0 1 0
.0 0 1=2 1.
Wh = [ kx ky 0 k ]T. (2:18)
W =[ X Y Z ]T = [ x y0 0 ]T. (2.19)
Equation (2.19) shows that the Z coordinate of the 3D space point cannot be
uniquely determined by the image point (X‘, Y‘) (because it gives z = 0 for any
point). The problem here is caused by the many-to-one transformation from the 3D
objective scene to the image plane. The image points (X‘, Y‘) now correspond to the
set of all collinear 3D spatial points on the straight lines passing through (X‘, Y‘, 0)
and (0, 0, 2) (see the line between image points and space points in Fig. 2.2). In the
world coordinate system, X and Y can be inversely solved from Eqs. (2.10) and
(2.11), respectively:
x0
X=I (2 - Z) (2:20)
The above two equations show that it is impossible to completely recover the
coordinates of a 3D space point from its image unless there is some prior knowledge
about the 3D space point mapped to the image point (such as knowing its
Z coordinate). In other words, it is necessary to know at least one world coordinate
of the point in order to recover the 3D space point from its image by using the inverse
perspective transformation.
50 2 Camera Imaging and Calibration
Further consider the case that the camera coordinate system is separated from the
world coordinate system, but the camera coordinate system coincides with the image
plane coordinate system (the computer image coordinate system is still not consid
ered). Figure 2.3 shows a schematic diagram of the geometric model for imaging at
this time. The position deviation between the center of the image plane (origin) and
the world coordinate system is recorded as vector D, and its components are Dx, Dy,
and Dz, respectively. Here, it is assumed that the camera scans horizontally at an
angle of / between the X- and x-axes and tilts vertically at an angle of a between the
Z- and z-axes. If the XY plane is taken as the equatorial plane of the Earth, and the Z-
axis points to the north pole of the Earth, the pan angle corresponds to the longitude
and the tilt angle corresponds to the latitude.
The above model can be converted from the basic camera space imaging model
when the world coordinate system coincides with the camera coordinate system
through the following series of steps: (1) move the origin of the image plane out of
the origin of the world coordinate system according to the vector D; (2) scan the x-
axis with a pan angle / (around the Z-axis); (3) tilt the z-axis (rotate around the x-
axis) at a certain tilt angle a.
Moving the camera relative to the world coordinate system is also equivalent to
moving the world coordinate system opposite to the camera. Specifically, the three
steps taken to convert the above geometric relationship can be performed for each
point in the world coordinate system. The following transformation matrix can be
used to translate the origin of the world coordinate system to the origin of the image
plane:
"10 0 - Dx"
0 1 0 — Dy
(2:22)
0 0 1 - Dz
0 0 0 1
■1 0 0 0"
0 cos a sin a 0
Ra = (2:24)
0 — sin a cos a 0
.0 0 0 1.
A position without tilt (a = 0°) corresponds to the z- and Z-axes being parallel.
The transformation matrices that complete the above two rotations can be
concatenated into one matrix:
ch =PRTWh (2:26)
52 2 Camera Imaging and Calibration
Expand Eq. (2.26) and convert to Cartesian coordinates to get the coordinates of
the midpoint (X, Y, Z ) of the world coordinate system in the image plane:
In a more comprehensive imaging model than the general space imaging model
above, there are two factors to consider in addition to the misalignment of the world
coordinate system, the camera coordinate system, and the image plane coordinate
system (so transformation is required). First, the camera lens would be distorted, so
the imaging position on the image plane will be offset from the perspective projec
tion result calculated by the above equations. Second, the image coordinate unit used
in the computer is the number of discrete pixels in the memory, so the coordinates on
the image plane need to be rounded and converted (here, continuous coordinates are
still used on the image plane). Figure 2.4 presents a schematic diagram of the
complete space imaging model when these factors are taken into account.
In this way, the complete space imaging transformation from an objective scene
to a digital image can be viewed as consisting of four steps:
1. Transformation from world coordinates (X, Y, Z) to camera 3D coordinates (x, y,
z). Considering the case of a rigid body, the transformation can be expressed as
(2:29)
r1 r2 r3
R= r4 r5 r6 (2:30)
r7 r8 r9
T = [ Tx T
Ty Tz]T: (2:31 )
x0 x
= 2- (2:32)
z
y0 = 2y: (2:33)
z
3. The transformations from undistorted image plane coordinates (x‘, y‘) to actual
image plane coordinates (x*, y*) with offset by the radial distortion of the lens
(see Sect. 2.3.2) are
x* = x — Rx (2:34)
y* = y0 - Ry, (2:35 )
where Rx and Ry represent the radial distortion of the lens. Most lenses have
certain radial distortion. Although it generally has little impact on human eye
observation, it still needs to be corrected in optical measurement; otherwise, large
errors will occur. Theoretically speaking, there are two main types of lens distortion,
namely, radial distortion and tangential distortion. Since the tangential distortion is
relatively small, only radial distortion is often considered in general industrial
machine vision applications:
where
54 2 Camera Imaging and Calibration
In Eqs. (2.36) and (2.37), k = k1 is taken. On the one hand, the reason for this
approximate simplification is that, in practice, the higher-order term of R can be
ignored, so k2 can be ignored and only k1 can be considered. On the other hand, the
factor that the radial distortion is usually symmetrical about the main optical axis of
the camera lens is considered. At this time, the radial distortion of a point in the
image is proportional to the distance from the point to the optical axis of the lens [5].
4. The transformation from the actual image plane coordinates (x*, y*) to the
computer image coordinates (M, N )is
M = ^C Lx x + Om
SxM (2:39)
Sy + On,
N=f (2:40)
where M and N are the number of rows and columns of pixels in the computer
memory (computer
and columns coordinates),
of pixels respectively;
in the computer Om and
memory On are
center, the numberSxofisrows
respectively; the
distance
line between
direction); Sythe centers
is the of two
distance between thesensors
adjacent centersalong theadjacent
of two X direction (scanning
sensors along
z
(
AX = x0 = x*+Rx= x* (1 + kr2) = M - Om 5xLx (1
pMx
) + kr2) (2:41)
Substitute Eqs. (2.30) and (2.31) into the above two equations, and finally get
2.3 Camera Calibration Model 55
The camera model represents the relationship between the coordinates of the scene
in the world coordinate system and its coordinates in the image coordinate system,
that is, the projection relationship between the object point (space point) and the
image point is given. Camera models can be mainly divided into two types: linear
models and nonlinear models.
The linear camera model is also called the pinhole model. In this model, it is
considered that any point in the 3D space on the image coordinate system is formed
by the principle of pinhole imaging.
For linear camera models, the distortion caused by nonideal lenses does not have
to be considered, but the coordinates on the image plane are rounded. In this way,
Fig. 2.5 can be used to illustrate the transformation from the 3D world coordinate
system through the camera coordinate system and the image plane coordinate system
to the computer image coordinate system. Here, the three transformations (steps) are
represented by T1, T2, and T3, respectively.
The calibration parameters involved in camera calibration can be divided into two
categories: external parameters (outside the camera) and internal parameters (inside
the camera).
1. External parameters
Fig. 2.5 Schematic diagram of conversion from 3D world coordinate system to computer image
coordinate system under the linear camera model
56 2 Camera Imaging and Calibration
The first transformation (T1) in Fig. 2.5 is to transform from the 3D world
coordinate system to the 3D camera coordinate system whose center is located at
the optical center of the camera. The transformation parameters are called exter
nal parameters, also known as camera attitude parameters. The rotation
matrix R has a total of nine elements but actually only has three degrees of
freedom, which can be represented by the three Euler angles of the rigid body
rotation. The schematic diagram of Euler angles is shown in Fig. 2.6 (here the line
of sight is inverse to the X-axis), where the intersection line AB of the XY plane
and the xy plane is called the nodal line, and the angle 9 between AB and the x-axis
is the first Euler angle, called the yaw angle (also called the deflection angle),
which is the angle of rotation around the z-axis; the angle yr between the AB and
the X-axis is the second Euler angle, called the tilt angle, which is the angle of
rotation around the Z-axis; the angle $ between the Z-axis and the z-axis is the
third Euler angle, called the pitch angle, which is the angle of rotation around the
pitch line.
The rotation matrix can be expressed as a function of 9, ^, yr using Euler angles:
(2:45)
It can be seen that the rotation matrix has three degrees of freedom. In addition,
the translation matrix also has three degrees of freedom (translation coefficients in
three directions). Thus, the camera has six independent external parameters, namely,
three Euler angles 9, ^, y in R and three elements Tx, Ty, Tz in T.
2. Internal parameters
The second transformation (T2) and the third transformation (T3) in Fig. 2.5
are the transformations from the 3D camera coordinate system through the
2.3 Camera Calibration Model 57
According to the general space imaging model discussed in Sect. 2.2.3, if a series of
transformations PRTWh are performed on the homogeneous coordinates Wh of the
space points, the world coordinate system and the camera coordinate system can be
coincident. Here, P is the imaging projection transformation matrix, R is the camera
rotation matrix, and T is the camera translation matrix. Let A = PRT; the elements in
A include camera translation, rotation, and projection parameters; then there is a
homogeneous expression of image coordinates: Ch = AWh. If k = 1 in the homo
geneous expression, we get
Ch1
Ch2
(2:46)
Ch3
.Ch4.
x = Ch1=Ch4: (2:47)
y = Ch2=Ch4 : (2:48 )
Substitute Eqs. (2.47) and (2.48) into (2.46) and expand the matrix product to get
(aii — a4i x)X + (ai2 — a42x) Y + (ai3 — a43x)Z + (ai4 — a44x) — 0 (2:52)
(a21 — a4iy)X + (a22 — a42y) Y + (a23 — a43y)Z + (a24 — a44y) — 0: (2:53)
It can be seen that a calibration procedure should include the following: (i) obtain
M > 6 space points with known world coordinates (Xi, Yi, Zi), i — i, 2, ...,
M(in practical applications, more than 25 points are often taken, and thenthe least
squares fitting is used to reduce the error); (2) shoot these points with a camera at a
given position to get their corresponding image plane coordinates (xi, yi), i — i,
2, ..., M; (3) substitute these coordinates into Eqs. (2.52) and (2.53) to solve the
unknown coefficients.
In order to realize the above calibration procedure, it is necessary to obtain spatial
points and image points with corresponding relationship. To precisely determine
these points, a calibrator (also called a calibration target, i.e., a standard reference) is
used, which has a fixed pattern of marked points (reference points) on it. The most
commonly used 2D calibration targets have a series of regularly arranged square
patterns (similar to a chessboard), and the vertices of these squares (crosshair
intersections) can be used as reference points for calibration. If the coplanar refer
ence point calibration algorithm is used, the calibration target corresponds to one
plane; if the noncoplanar reference point calibration algorithm is used, the calibra
tion target generally corresponds to two orthogonal planes.
In practical situations, the camera is usually imaged through a lens (often including
multiple lenses). Based on the current lens processing technology and camera
manufacturing technology, the projection relationship of the camera cannot be
simply described as a pinhole model. In other words, due to the influence of various
factors, such as lens processing and installation, the projection relationship of the
camera will not be a linear projection relationship, that is, the linear model cannot
accurately describe the imaging geometric relationship of the camera.
Real optical systems do not work exactly according to the idealized pinhole
imaging principle but have lens distortion. Due to various distortion factors, there
is a deviation between the real position of the 3D space point projected onto the 2D
image plane and the ideal image point position without distortion. Optical distortion
2.3 Camera Calibration Model 59
errors are more pronounced near the edge of the lens. Especially when using wide-
angle lenses, there is often a lot of distortion in the image plane far from the center. In
this way, there will be deviations in the measured coordinates, and the accuracy of
the obtained world coordinates will be reduced. Therefore, a nonlinear camera
model that takes into account the distortion must be used for camera calibration.
Due to the influence of various distortion factors, when the 3D space point is
projected onto the 2D image plane, there is a deviation between the actual coordi
nates (xa, ya) and the ideal coordinates (xi, yi) without distortion, which can be
expressed as
xa= Xi + dx (2:54)
ya=y< + dy, (2:55)
where dx and dy are the total nonlinear distortion deviation values in the x and
y directions, respectively. There are two basic types of common distortion: radial
distortion and tangential distortion. They can be seen in Fig. 2.7, where dr
represents the deviation caused by radial distortion and dt represents the deviation
caused by tangential distortion. Most other distortions are the combination of these
two basic distortions, and the most typical combined distortions are eccentric
distortion (centrifugal distortion) and thin prism distortion.
1. Radial distortion
Radial distortion is mainly caused by irregularities in the lens shape (surface
curvature errors); the resulting aberrations are generally symmetrical about the
main optical axis of the camera lens and are more pronounced along the lens
radius away from the optical axis. Generally, positive radial distortion is called
pincushion distortion, and negative radial distortion is called barrel distortion,
as shown in Fig. 2.8 (where the square represents the original shape, pincushion
distortion causes four right angles to become acute angles, and barrel distortion
causes four right angles to become rounded). Their mathematical models are both
Fig. 2. 7 Schematic
diagram of radial distortion
and tangential distortion
60 2 Camera Imaging and Calibration
Fig. 2. 8 Illustration of
pincushion distortion and
barrel distortion
Fig. 2. 9 Schematic
diagram of tangential
distortion
dxr = xi
Among them, r = (xi2 + yi2)1/2 is the distance from the image point to the center of
the image, and k1, k2, etc. are the radial distortion coefficients.
2. Tangential distortion
The tangential distortion is mainly caused by the noncollinearity of the
optical centers of the lens groups, which causes the actual image point to move
tangentially on the image plane. Tangential distortion has a certain orientation in
space, so there are a maximum axis of distortion in a certain direction and a
minimum axis of distortion in a direction perpendicular to this direction. The
schematic diagram is shown in Fig. 2.9, where the solid line represents the case
without distortion and the dashed line represents the result caused by tangential
distortion. Generally, the influence of tangential distortion is relatively small,
and the case of separate modeling is relatively small.
3. Eccentric distortion
The eccentric distortion is caused by the discrepancy between the optical
center and the geometric center of the optical system, that is, the optical center of
the lens device is not strictly collinear. Its mathematical model is
2.3 Camera Calibration Model 61
where r = (xi2 + yi2)1/2 is the distance from the image point to the image center
and l1, l 2, etc. are eccentric distortion coefficients.
4. Thin prism distortion
The thin prism distortion is caused by the improper design and assembly of
lens. This kind of distortion is equivalent to adding a thin prism to the optical
system, which will not only cause radial deviation but also cause tangential
deviation. Its mathematical model is
If the terms higher than third order are ignored, n1 = 11 + m1, n2 = l2 + m2, n3 = 2 11,
and n4 = 2 l2, the following can be obtained:
In practical applications, the radial distortion of the camera lens often has the greatest
impact. If other distortions are ignored, the transformation from the undistorted
image plane coordinates (x‘, y‘) to the actual image plane coordinates (x*, y*) offset
by the radial distortion of the lens is given by Eqs. (2.34) and (2.35).
Considering the transformation from (x‘, y‘) to (x*, y*), the transformation from
3D world coordinate system to computer image coordinate system realized
according to the nonlinear camera model is shown in Fig. 2.10. The original
transformation T3 is now decomposed into two transformations (T31 and T32), and
Eqs. (2.39) and (2.40) can still be used to define T32 (only x* and y* are required to
replace x‘ and y‘).
62 2 Camera Imaging and Calibration
Fig. 2.10 Schematic diagram of transformation from 3D world coordinate system to computer
image coordinate system under nonlinear camera model
Although only radial distortion is considered when Eqs. (2.34) and (2.35) are
used here, the forms of Eqs. (2.62) and 2.63 or Eqs. (2.64) and (2.65) are actually
applicable to a variety of distortions. In this sense, the conversion process in
Fig. 2.10 is applicable to the case of various distortions, as long as the corresponding
T31 is selected according to the type of distortion. Comparing Figs. 2.10 with 2.5,
“nonlinearity” is reflected in the conversion from (x‘, y‘) to (x*, y*).
Many camera calibration methods have been proposed. The classification of cali
bration methods is discussed first, and then several typical methods are introduced.
For the camera calibration methods, there are different classification ways according
to different criteria. For example, according to the characteristics of the camera
model, it can be divided into linear methods and nonlinear methods; according to
whether or not to require calibration targets, it can be divided into traditional camera
calibration methods, camera self-calibration methods, and active vision-based cali
bration methods (some people also take the latter two methods into one class). When
using the calibration target, according to the dimension of the calibration target, it
can also be divided into the method of using 2D plane target and the method of using
3D volumetric target; according to the results of solving parameters, it can be divided
into explicit methods and implicit methods; according to whether the internal
parameters of the camera can be changed, it can be divided into methods with
variable internal parameters and methods with immutable internal parameters;
according to the movement mode of the camera, it can be divided into methods
that limit the movement mode and methods that do not limit the movement mode;
according to the number of cameras used in the vision system, it can be divided into
single-camera calibration method and multi-camera calibration method. In addition,
when the spectra are different, the calibration method (and calibration target) often
needs to be adjusted, such as [6]. Table 2.1 gives a classification table of calibration
methods, which lists some classification criteria, categories, and typical methods.
2.4 Camera Calibration Methods 63
In Table 2.1, the nonlinear methods are generally more complex and slower and
require a good initial value. In addition, the nonlinear search cannot guarantee that
the parameters converge to the global optimal solution. The implicit method takes
the elements of the transformation matrix as calibration parameters and uses a
transformation matrix to represent the correspondence between 3D space points
and 2D plane image points. Because the parameters themselves do not have clear
physical meanings, they are also called implicit parameter methods. Since the
implicit parameter method only needs to solve the linear equation, this method can
obtain higher efficiency when the accuracy requirement is not very high. The direct
linear method (DLT) takes the linear model as the object and uses a 3 x 4 matrix to
represent the correspondence between 3D space points and 2D plane image points,
ignoring the intermediate imaging process (or, comprehensively considering the
factors in the process). The most common multi-camera calibration method is the
dual-camera calibration method. Compared with the single-camera calibration, the
dual-camera calibration not only needs to know the internal and external parameters
of each camera itself but also needs to measure the relative relationship (location and
orientation) between the two cameras through calibration.
The traditional camera calibration procedure requires the use of a known calibration
target (2D calibration plate with known data, or 3D calibration block), that is, it is
necessary to know the size and shape of the calibration target (position and distri
bution of calibration points), and then by establishing the corresponding relationship
between the points on calibration target and the corresponding points on the captured
image to determine the internal and external parameters of the camera. The advan
tage is that the theory is clear, the solution is simple, and the calibration accuracy is
high. The disadvantage is that the calibration process is relatively complicated, and
the accuracy for the calibration target is relatively high.
2.4 Camera Calibration Methods 65
Referring to the complete space imaging model introduced in Sect. 2.2.4 and the
nonlinear camera model introduced in Sect. 2.3.2, the calibration of the camera can
be performed along the transformation direction from 3D world coordinates to
computer image coordinates. As shown in Fig. 2.11, there are four steps in the
conversion from the world coordinate system to the computer image coordinate
system, and each step has parameters to be calibrated.
Step 1: The parameters to be calibrated are the rotation matrix R and the
translation matrix T.
Step 2: The parameter to be calibrated is the lens focal length 2.
Step 3: The parameters to be calibrated are the radial distortion coefficient k of the
lens, the eccentric distortion coefficient l, and the thin prism distortion coefficient m.
Step 4: The parameter to be calibrated is the uncertainty image scale factor p.
Let s i have the following relationship with the rotation parameters r1, r2, r4, r5 and
the translation parameters Tx, Ty:
Set the vector u = [x1 x2 ... xM]T; then we can first solve s with the following
linear equations:
As = U: (2:68)
Then, the various rotation and translation parameters can be calculated according
to the following steps:
1. Let S = s 1 2 + s22 + s 3 2 + s42 , and calculate:
S- S2 - 4(sis4 - S2S3)2
(S1S4 — S2S3) + 0
4 (s1s4 - s2s3)2
T23
=< 1 (2:69)
y
s2 + s2 ^ 0
s1 + s2
1
s3 + s4 ^ 0
. S3 + s4
2. Set Ty = (Ty2)1/2, that is, take the positive square root, and calculate:
3. Select a point whose world coordinates are (X, Y, Z), and require its image plane
coordinates (x, y) to be far from the center of the image, and calculate:
2.4 Camera Calibration Methods 67
2 1 - r1 — r2r4 1 — r2r4 - r2
r 2r
— r2 — 2 6 =
A
d1
2 2
— r2 — r2r7 =------ - -------- r8 =------- 5---------r9
r3 r6
r3 =
= V 1 — r3r7 — r6 r8 :
r
(2:73)
Note that if the sign of r1r4 + r2r5 is positive, then r6 should be negative, and
the signs of r7 and r8 should be adjusted after calculating the focal length z.
5. Establish another set of linear equations to calculate the focal length Z and the
translation parameter Tz in the z direction. A matrix B can be constructed first,
and the row bi can be expressed as
Bt=v: (2:76)
6. If z < 0, to use the right-handed coordinate system, r3, r6, r7, r8, Z, and Tz must be
negative.
7. Calculate the radial distortion coefficient k of the lens using the estimation of t,
and adjust the values of Z and Tz. Using the perspective projection equation
including distortion here, the following nonlinear equation can be obtained:
The values of k, Z, and Tz can be obtained by solving the above equations with
nonlinear regression method.
68 2 Camera Imaging and Calibration
The above two-level calibration method only considers the radial distortion of the
camera lens. If the tangential distortion of the lens is further considered on this basis,
it is possible to further improve the accuracy of camera calibration.
Referring to Eqs. (2.62) and (2.63), the total distortion deviations dx and dy
considering radial distortion and tangential distortion are
Considering the fourth-order term for radial distortion and the second-order term
for tangential distortion, we have
The calibration of the camera can be divided into the following two steps.
1. Set the initial values of lens distortion coefficients k1, k2, 11, and 12 to be 0, and
calculate the values of R, T, and 2.
Referring to Eqs. (2.32) and (2.33), and referring to the derivation of
Eq. (2.77), we can get
= .X r1X+ r2Y + r3 Z + Tx
x 2Z +r7X + r8Y + r<9Z + Tz (2.82)
y
Y =rr4X + r5Y + ^Z + Ty
2Z r7X + r8Y + r9Z + Tz ' (2.83)
Equation (2.84) is true for all datum points, that is, an equation can be established
by using the 3D world coordinates and 2D image coordinates of each datum point.
There are eight unknowns in Eq. (2.84), so if eight datum points can be determined,
an equation system with eight equations can be constructed, and then the values of r 1,
r2, r3, r4, r5, r6, Tx, and Ty can be calculated out. Because R is an orthogonal matrix,
the values of r7, r8, and r9 can be calculated according to its orthogonality. Substitute
the calculated values into Eqs. (2.82) and (2.83), and then arbitrarily take the 3D
world coordinates and 2D image coordinates of any two datum points; the values of
Tz and 2 can be calculated.
2.4 Camera Calibration Methods 69
2. Calculate the values of lens distortion coefficients k1, k2, l1, and l2.
According to Eqs. (2.54) and (2.55), as well as Eqs. (2.78-2.81), the follow
ing can be obtained:
With the help of R and T already obtained, (Z, Y, Z) can be calculated by using
Eq. (2.84), and then substituted into Eqs. (2.85) and (2.86), we get
The camera self-calibration method was proposed in the early 1990s. Camera self
calibration can calculate real-time, online camera model parameters from geometric
constraints obtained from image sequences without resorting to high-precision
calibration targets, which is especially suitable for cameras that often need to
move. Since all the self-calibration methods are only related to the parameters of
the camera and have nothing to do with the external environment and the motion of
the camera, the self-calibration method is more flexible than the traditional calibra
tion method. However, the existing self-calibration methods are not very accurate
and robust.
The idea of the basic self-calibration method is to first establish the constraint
equation about the parameter matrix in the camera through the absolute quadratic
curve, which is called the Kruppa equation. Then, solve the Kruppa equation to
determine the matrix C (C = KTK-1, K is the internal parameter matrix). Finally, the
matrix K is obtained by Cholesky decomposition.
The self-calibration method can be realized with the help of active vision tech
nology. However, some researchers have put forward the calibration method based
on active vision technology as a separate category. Active vision system means that
the system can control the camera to obtain multiple images in motion and then use
the camera’s motion trajectory and the corresponding relationship between the
70 2 Camera Imaging and Calibration
obtained images to calibrate the camera. The method of active vision-based cali
bration is generally used when the motion parameters of the camera in the world
coordinate system are known, and it can usually be solved linearly and the obtained
results have high robustness.
In practical applications, the method based on active vision calibration generally
installs the camera accurately on the controllable platform, and actively controls the
platform to perform special movements to obtain multiple images, and then uses the
correspondence between the images and the camera motion parameters to determine
camera parameters. However, this method cannot be used if the camera motion
parameters are unknown or in situations where the camera motion cannot be
controlled. In addition, the motion platform required by this method has high
precision and high cost.
A typical self-calibration method can be introduced with reference to Fig. 2.12
[8]. The optical center of the camera is translated from O1 to O2, and the two images
formed are I1 and I2, respectively (the coordinate origins are o 1 and o2, respectively).
A point P in space is imaged as point p 1 on I1 and is imaged as point p2 on I2. Here, p 1
and p2 constitute a pair of corresponding points. If a point p2‘ is marked on I1
according to the coordinate value of point p2 on I2, then the connection between
p2‘ and p1 is called the connection of the corresponding point on I1. It can be proven
that when the camera performs pure translational motion, the lines connecting the
!——
corresponding points of all spatial points on I1 intersect at the same point e, and O1e
is the movement direction of the camera (here e is on the line connecting O1 to O2,
and O1O2 is translational motion trajectory).
According to the analysis of Fig. 2.12, by determining the intersection of the lines
connecting the corresponding points, the camera translation direction in the camera
coordinate system can be obtained. In this way, by controlling the camera to perform
translational motions in three directions, respectively, during calibration, and using
the corresponding point connection line to calculate the corresponding intersection ei
(i = 1, 2, 3) before and after each motion, the three translational motion directions
can be obtained.
2.4 Camera Calibration Methods 71
Referring to Eqs. (2.39) and (2.40), by considering the ideal case where the
uncertain image scale factor p is 1, and by taking each sensor in the x direction to
sample 1 pixel in each row, then Eqs. (2.39) and (2.40) can be written as
x
M=^ + Om (2:89)
x
N=y-+On
Sy
: (2:90)
Equations (2.89) and (2.90) establish the conversion relationship between the
image plane coordinate system x’y’ expressed in physical units (such as mm) and the
computer image coordinate system MN expressed in pixels. According to the
coordinates of the intersection point e i (i = 1, 2, 3) on I1 in Fig. 2.12, respectively
(xi, yi), Eqs. (2.89) and (2.90), we can see that the coordinates of ei in the camera
coordinate system are
If you make the camera translation three times and make the movement directions
of these three times orthogonal, you can get eiTj 0 (i 4 j), and then it gives
2
(2:98)
72 2 Camera Imaging and Calibration
Q2 = (2:99)
Sx :
Then, Eqs. (2.95), (2.96), and (2.97) become three equations including four
unknown quantities of Om, On, Q1, and Q2. These equations are nonlinear, and if
Eqs. (2.96) and (2.97) are subtracted from Eq. (2.95), respectively, two linear
equations are obtained:
Represent OnQ1 in Eqs. (2.100) and (2.101) with an intermediate variable Q3:
Q3 = OnQ1 : (2:102)
Then Eqs. (2.100) and (2.101) become two linear equations about three
unknowns including Om, Q1, and Q3. Since the two equations have three unknowns,
the solutions of Eqs. (2.100) and (2.101) are generally not unique. In order to obtain
a unique solution, the camera can be moved three times in other orthogonal direc
tions to obtain three other intersection points e i (i = 4, 5, 6). If the three translational
movements have different directions from the previous three translational move
ments, two equations similar to Eqs. (2.100) and (2.101) can be obtained. In this
way, a total of four equations are obtained, and any three equations can be taken, or
the least squares method can be used to solve Om, Q1, and Q3 from the four
equations. Next, solve On from Eq. (2.102), and then substitute Om, On, and Q1
into Eq. (2.97) to solve for Q2. In this way, all the internal parameters of the camera
can be obtained by controlling the camera to perform two sets of three-orthogonal
translational motions.
The structured light active vision system can be regarded as mainly composed of a
camera and a projector, and the accuracy of the 3D reconstruction of the system is
mainly determined by their calibration. There are many methods for camera calibra
tion, which are often realized by means of calibration targets and feature points. The
projector is generally regarded as a camera with a reverse light path. The biggest
difficulty in projector calibration is to obtain the world coordinates of the feature
points. A common solution is to project the projection pattern onto the calibration
target used to calibrate the camera and obtain the world coordinates of the projection
point according to the known feature points on the calibration target and the
2.4 Camera Calibration Methods 73
calibrated camera parameter matrix. This method requires the camera to be cali
brated in advance, so the camera calibration error will be superimposed into the
projector calibration error, resulting in an increase in the projector calibration error.
Another method is to project the encoded structured light onto a calibration target
containing several feature points and then use the phase technique to obtain the
coordinate points of the feature points on the projection plane. This method does not
need to calibrate the camera in advance but needs to project the sinusoidal grating
many times, and the total number of collected images will be relatively large.
The following introduces a calibration method for active vision system based on
color concentric circle array [9]. The projector projects a color concentric circle
pattern to the calibration plate drawn with the concentric circle array and separates
the projected concentric circle and the calibration plate concentric circle from the
acquired image through color channel filtering. Using the geometric constraints
satisfied by the concentric circle projection, the pixel coordinates of the center of
the circle on the image are calculated, and the homography relationship between the
calibration plane, the projector projection plane, and the camera imaging plane is
established, and then the system calibration is realized. This method only needs to
collect at least three images to achieve calibration.
The projection process of the projector and the imaging process of the camera have
the same principle but opposite directions, and the reverse pinhole camera model can
be used as the mathematical model of the projector.
Similar to the camera imaging model, the projection model of the projector is also
designed as a conversion between three coordinate systems (the world coordinate
system, the projector coordinate system, and the projection plane coordinate system,
respectively), and the coordinate system in the computer is not considered first. The
world coordinate system is still represented by XYZ. The projector coordinate
system is a coordinate system xyz centered on the projector and generally takes
the optical axis of the projector as the z-axis. The projection plane coordinate
system is the coordinate system x’y’ on the imaging plane of the projector.
For simplicity, the corresponding axes of the world coordinate system XYZ and
the projector coordinate system xyz can be taken to coincide (and the projector
optical center is located at the origin). Then, the xy plane of the projector coordinate
system and the imaging plane of the projector can be taken to coincide, so that the
origin of projection plane is on the optical axis of the projector, and the z-axis of the
projector coordinate system is perpendicular to the projection plane and points
toward the projection plane, as shown in Fig. 2.13. Among them, the space point
(X, Y, Z ) is projected to the projection point (x, y) of the projection plane through the
optical center of the projector, and the connection between them is a spatial projec
tion ray.
The coordinate system and transformation ideas in the calibration are as follows.
First, use the projector to project the calibration pattern to the calibration plate [with
74 2 Camera Imaging and Calibration
the world coordinate system W = (X, Y, Z)] and then use the camera [with the camera
coordinate system c = (x, y, z)] to collect the projected image. Calibrate the plate
image, and separate the calibration pattern on the calibration plate from the projected
pattern. By acquiring and matching the feature points on these patterns, the direct
linear transformation (DLT) algorithm [10] can be used to calculate the
homography matrix Hwc between the calibration plate and the camera imaging
plane, as well as the homography matrix Hcp between the camera imaging plane and
the projector [with the projector coordinate system p = (x’, y’)] projection plane,
which is caused by the calibration plate plane. They are both 3x3 non-singular
matrices representing a 2D projective transformation between two planes.
After obtaining Hwc and Hcp, the pixel coordinates I’c and J’c on the camera
imaging plane and the pixel coordinates I’p and J’p on the projector’s projection
plane, of the virtual circle points I = [1, i, 0]T and J = [1, -i, 0]T, can be obtained as
follows:
By changing the position and direction of the calibration plate to obtain the pixel
coordinates of at least three sets of virtual circle points in different planes on the
camera and the projector, the absolute conic curve images Sc and Sp in the camera
imaging plane and projector projection plane can be fitted. Then, by performing
Cholesky decomposition on Sc and Sp, the internal parameter matrices Kc and Kp of
the camera and projector can be obtained, respectively. Finally, using Kc and Kp, as
well as Hwc and Hcp, the external parameter matrices of the camera and projector can
be obtained.
Using a projector to project a new pattern onto the calibration plate that has already
drawn a pattern, and then using a camera to capture the projected calibration plate
2.4 Camera Calibration Methods 75
Fig. 2.14 Extraction of projected pattern from overlapping calibration plate pattern and projected
pattern
image, the two patterns in the captured image are overlapping and need to be
separated. For this purpose, it is possible to consider using two patterns of different
colors, with the aid of color filtering to separate the two patterns.
Specifically, a calibration plate with a magenta concentric circle array (7x9
concentric circles) patterned on a white background can be used, and a blue-green
concentric circle array (also 7 x 9 concentric circles) patterned on a yellow back
ground is projected onto the calibration plate by a projector. When the patterns are
projected onto the calibration plate Ib with a projector, the calibration plate pattern
and the projected pattern overlap, as shown in Fig. 2.14a, where only a pair of each
of the two circular patterns is drawn as an example. The area where the two patterns
overlap will change color, where the intersection of the magenta circle and the
yellow background turns into red, the intersection of the magenta circle and the
cyan circle turns into blue, and the intersection of the white background of the
calibration board and the projected pattern turns into the color of the projected
pattern. First convert it to the camera image Ic with the help of the homography
matrix Hwc (as shown in Fig. 2.14b), and then convert it to the projector image Ip
with the help of the homography matrix Hcp (as shown in Fig. 2.14c).
In the color filtering process, the image is first passed through the green, red, and
blue filter channels, respectively. After passing through the green filter channel,
since the circle pattern on the calibration plate has no green component, it will appear
black, and other areas will appear white, which can separate the calibration plate
pattern. After passing through the red filter channel, the projected circular pattern
appears black because there is no red component in it, while the yellow background
portion and the calibration plate circular pattern appear close to white. After passing
through the blue filter channel, since the yellow background area projected onto the
calibration plate and the red circle pattern on the calibration plate have no blue
component, they appear close to black, while the projected cyan circle pattern
appears close to white. Since the color difference of each pattern part is relatively
large, the overlapping patterns can be separated more easily. Taking the centers of
the separated concentric rings as feature points and obtaining their image coordi
nates, the homography matrix Hwc and the homography matrix Hcp can be
calculated.
76 2 Camera Imaging and Calibration
In order to calculate the homography matrix between the calibration plate and the
projection plane of the projector with the imaging plane of the camera, it is necessary
to calculate the center of the concentric circles on the calibration plate and the image
coordinates projected to the center of the concentric circles on the calibration plate.
Here, consider a pair of concentric circles C1 and C2 with the center O on a plane in
the space. The vector form of any point p on the plane relative to the polar line l of
the circle C1 is l = C1p, and the polar line l relative to the pole of the circle C2 is
q = C2-1l. The point p can be on the circumference of the circle C1 (as shown in
Fig. 2.15a), outside the circumference of the circle C1 (as shown in Fig. 2.15b), or on
the inside of the circumference of the circle C1 (as shown in Fig. 2.15c). However, in
these three cases, according to the constraint relationship between the poles and the
polar lines of the conic, the line connecting the point p and the point q will pass
through the center O.
The projection transformation maps the concentric circles C1 and C2 with the
center O on the plane S to the camera imaging plane Sc, the corresponding point of
the circle center O on Sc is Oc, and the corresponding conic curves of the concentric
circles C1 and C2 on Sc are G1 and G2, respectively. If the polar line of any point pi on
the plane Sc relative to G1 is li‘, and the pole of li‘ relative to G2 is qi, then according
to the projection invariance of the collinear relationship and the polar line-pole
relationship, it can be known that the connection between pi and qi goes through
Oc. If the connection between pi and qi is recorded as mi, then we have
(a) (b)
Fig. 2.15 Constraints between polar lines and poles of concentric circles
2.4 Camera Calibration Methods 77
From its local minimum point, the optimal projection position of the circle center
can be obtained.
In order to automatically extract and match concentric circle images, Canny
operator can be used for sub-pixel edge detection to extract circle boundaries and
fit conic curves. In the large number of conic curves detected in each image, the
conic curve pairs from the same concentric circle are first found by using the rank
constraint of concentric circles [11]. Considering that the two conic curves are G1
and G2, respectively, their generalized eigenvalues are 21, 22, and 23, respectively. If
21 = 22 = 23, G1 and G2 are the same conic; if 21=22 / 23, G1 and G2 are projections
of a pair of concentric circles; if 21 ^ 22 ^ 23, G1 and G2 come from different
concentric circles.
After pairing the conic curves, it is also necessary to match the concentric circles
on the calibration plate with the curve pairs in the image. Here, we can use cross ratio
invariance to automatically match concentric circles. As shown in Fig. 2.16a, let the
straight line on the diameter of the concentric circle intersect with the concentric
circle at p 1, p 2, p 3, and p4, and these intersection points are mapped
p p
top 1 , 2 , 3 , and
p4‘ after projection transformation (as shown in Fig. 2.16b). According to the cross
ratio invariance, the following relationship can be obtained (where |p ipj| represents
the distance from point p i to point pj):
For concentric circles with different radius ratios, the intersection ratios formed
by the straight line where the diameter is located and the four intersections of the
concentric circles are different, so the radius ratio can be used to identify the
78 2 Camera Imaging and Calibration
concentric circles. When designing the calibration plate pattern and projection
pattern, different radius ratios can be set according to the positions of different
concentric circles to uniquely identify different concentric circles in the pattern. In
practice, only part of the concentric circles can be set with different radius ratios.
After the corresponding homography matrix is obtained, the positions of other
concentric circles and the projection point of the center of the circle can be obtained
with the help of the homography matrix.
After the homography matrix Hwc and the homography matrix Hcp are determined,
the internal and external parameters of the camera and the projector can be calcu
lated. First, the homography matrix Hwc between the calibration plate plane and the
camera imaging plane can be expressed as
Among them, Oi = (xi, yi, 1)T is the coordinates of the center of the concentric
circles on the calibration plate in the calibration plate coordinate system, and
Oi‘ = (ui, vi, 1)T is the image coordinates of the Oi projection point. Hwc can be
obtained by calculating the image coordinates of the center of the concentric circles
of four or more calibration plates and using the DLT algorithm.
Similar to the above process, the homography matrix Hcp between the projection
plane of the projector and the imaging plane of the camera can also be calculated.
Then, with the help of Eqs. (2.103) and (2.104), the internal parameter matrices Kc
and Kp of the camera and the projector can be calculated.
Further compute the external parameter matrices of the camera and projector. Set
the calibration plate plane to coincide with the XwYw plane of the world coordinate
system, the homogeneous coordinate of the previous point X in the world coordinate
system is Xw = [xw, yw, 0, 1]T, and its image point on the camera xc = [uc, vc, 1]T
satisfies (where Rp and tp are the rotation matrix and translation vector of the
calibration plate plane relative to the world coordinate system, respectively)
Xc - Kc[Rcjtc]Xw: (2:110)
Denote the 2D coordinate plane corresponding to the point Xw = [xw, yw, 0, 1]T as
xw = [xw, yw, 1]T, and use rc1 and rc2 to represent the first two columns of Rc,
respectively; then there is Kc[Rc|tc]Xw = Kc[rc1, rc2, tc]Xw, and substitute into
Eq. (2.110) to get
If rc1, rc2, and tc are not coplanar, that is, the plane of the calibration plate does not
pass through the optical center of the camera, there is a homography matrix Hwc
between the plane of the calibration plate and the image plane of the camera, which
can be known from Eq. (2.111):
From the above equation, rc1, rc2, and tc can be obtained. Because Rc is a unit
orthogonal matrix, so
Similar to the above process, since the corresponding relationship is also satisfied
between the calibration plate plane and the projector projection plane, so the rotation
matrix Rp and translation vector tp of the projector coordinate system relative to the
world coordinate system can be obtained. In this case, the rotation matrix R and
translation vector t between the camera coordinate system and the projector coordi
nate system can be expressed as R = Rc-1Rp and t = Rc-1(tp—tc), respectively.
methods [14]. However, these methods usually rely on the ground signs that meet the
constraints of specific points, lines, or surfaces and are mainly suitable for offline
calibration. In addition, due to maintenance and structural deformation, the external
parameters of the camera may change significantly in the life cycle of the vehicle.
How to calibrate and adjust the external parameters online is also very important.
To solve these problems, a real-time camera external parameter calibration
method based on the matching of camera and high-precision map without using
precise and expensive calibration field is proposed [15].
The basic idea of this method is as follows. Firstly, the lane line in the image is
detected by using the deep learning technology. By assuming an initial external
parameter matrix T, and according to this matrix, the lane line point Pw in the world
coordinate system W(XYZ) is projected to the camera coordinate system C(xyz) to
obtain the 3D image point Pc, which is matched with the map. Then, the projection
error L(Tcv) between Pc and the lane point Dc detected by the camera is evaluated by
reasonably designing the error function L, and the external parameter matrix Tcv is
solved by using the idea of bundle adjustment (BA) to minimize the reprojection
error from the lane line curve to the image plane [16]. Here, Tcv determines the
coordinate system transformation between the camera coordinate system C(xyz) and
the vehicle coordinate system V(x’y’z’). Tcv is composed of rotation matrix R and
translation vector T. the three degrees of freedom of R can be expressed by three
Euler angles (rotation angles; see Sect. 2.3.1). Considering that the onboard camera
needs to detect obstacles such as pedestrians and vehicles within 200 m, its detection
accuracy is about 1 m. Assuming that the horizontal field angle of view of the camera
is about 57°, the requirements for the accuracy of external parameters of the camera
are that the rotation angle is about 0.2°, and the translation is about 0.2 m.
If the coordinates of lane line points set on the image plane collected by the camera
are (x‘, y‘), then according to the pinhole imaging model,
Among them, zc is the distance between the lane line point Pc and the camera, M
is the parameter matrix of the camera, and Tvw is the coordinate transformation
matrix between the world coordinate system W(XYZ) and the vehicle coordinate
system V(x’y’z’), which expresses the pose of the vehicle.
The detection of lane lines can be carried out with the help of a deep learning
method based on the network structure U-Net++ [17]. After obtaining the lane line
features in the image plane, the 3D world coordinate system position cannot be
directly recovered from the 2D features in the image plane, so it is necessary to
2.4 Camera Calibration Methods 81
project the true value of the lane lines to the image plane, and set the loss function in
the image plane to perform optimization.
To prevent over-optimization and improve computational efficiency, the detected
features need to be selected/screened. Lane lines are usually composed of curves and
straight lines, and the actual curvature is relatively small. When the vehicle is driving
normally, in most cases, the lane line does not provide useful information for
translating Tx, and it is necessary to select the scene of the vehicle steering for
calibration. Therefore, the video captured by the vehicle camera can be divided into
useless frames, data frames, and key frames according to the following rules:
1. When the number of lane line pixels detected in the frame image is less than a
certain threshold, it is regarded as a useless frame to avoid vehicles passing
through intersections and traffic jams without obvious lane lines in the image.
2. The frame images when the vehicle driving distance from the previous key frame
and the vehicle yaw angle are both less than a certain threshold are classified as
useless frames to avoid repeated collection of lane line information.
3. When the Rule 1 and Rule 2 are not satisfied and the angle between the vehicle
and the true value of the lane line (map data) is greater than a certain threshold,
the frame image is classified as a key frame.
4. The frame images collected in other cases are classified as data frames.
Since the useless frame does not contain lane line information or only contains the
lane line information that has been counted, it can be ignored in the optimization of
the loss function to reduce the amount of data.
In actual driving, because the vehicle is parallel to the lane line, in most of the
time, the number of key frame images collected is less than the number of data
frames. As pointed out above, the lane lines in the data frame do not provide useful
information for translating Tx, so not distinguishing between key frames and data
frames may over-optimize other external parameters. To this end, a threshold can be
set. If the number of collected key frames is small, only parameters other than Tx are
optimized; if the number of collected key frames is sufficient, all external parameters
are optimized.
Defining the reprojection errors of lane line observation points and map reference
points as loss, the loss function can be expressed as
Among them, Pw is the position of the lane line in the high-precision map in the
world coordinate system, and Tvw can be obtained through a global positioning
system (GPS) or the like. In this way, the loss function can be determined by
82 2 Camera Imaging and Calibration
determining Tcv. When the loss of different poses of the vehicle traversing the lane
during driving is combined, the camera external parameter calibration problem can
be reduced to an optimization problem that minimizes the loss:
In practice, the lane line has no obvious texture features along the direction of the
vehicle, so it is impossible to establish a one-to-one mapping between Pw and (x‘, y‘,
1)T to solve Eq. (2.116). To do this, convert the point-to-point error in Eq. (2.115) to
a point-set-to-point-set error:
where (xnw, ynw) is the projection of the lane line in the map on the image plane. The
calculation of the normal direction can be found in [18].
To sum up, the reprojection error calculation process includes the following steps:
1. Project the lane line point set in the map (within a range of 200 m from the
vehicle) into the camera coordinate system based on the camera external param
eter matrix Tcv and the vehicle pose matrix Tvw.
2. Project the point set that has undergone coordinate system transformation into the
image plane according to the parameter matrix in the camera.
3. Calculate the normal directions of the lane line point set in projected map and of
the detected lane line point set.
4. Determine the association between the map lane line points and the detected lane
line points by matching.
5. Determine the reprojection error according to Eq. (2.118), for example, a simple
steepest descent method can be used.
References
1. Khurana, A, Nagla, K S (2022) Extrinsic calibration methods for laser range finder and camera:
A systematic review. Mapan-Journal of Metrology Society of INDIA, 36(3): 669-690.
2. Tian J Y, Wu J G, Zhao Q C (2021) Research progress of camera calibration methods in vision
system. Chinese Journal of Liquid Crystals and Displays, 36(12): 1674-1692.
References 83
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 85
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_3
86 3 Depth Image Acquisition
The image obtained by imaging with the camera can be represented by f(x, y). f(x, y)
can also represent an attribute f in space (x, y), and f in a general grayscale image
represents the gray scale or brightness at the pixel (x, y). If the image attribute
f represents depth, such images are called depth maps or depth images, which
reflect the 3D spatial information of the scene.
During general imaging, the 3D scene is projected to the 2D plane, so the collected
2D image f(x, y) does not directly contain the depth (or distance) information of the
scene (with information loss). The objective world is 3D. In order to express it
completely, the images collected from the scene also need to be 3D. The depth image
can be expressed as z = f(x, y). The 3D image f(x, y, z) can be further obtained from
the depth image. The 3D image f(x, y, z) including the depth information can express
the complete information of the scene (including the depth information).
Image is a description form of objective scene, which can be divided into two
categories according to the nature of the scene it describes: intrinsic image and
extrinsic image [2]. Image is the image of the scene obtained by the observer or
collector (sensor). The scene and the scenery in the scene have some objectively
existing characteristics that are independent of the nature of the observer and the
collector, such as the surface reflectance, transparency, surface orientation, move
ment speed of the scenery, the relative distance between the scenery and the
orientation in space, etc. These properties are called (scene) intrinsic properties,
and the images representing the physical quantities of these intrinsic properties are
called intrinsic images. There are many kinds of intrinsic images. Each intrinsic
image can only represent one intrinsic characteristic of the scene, without the
influence of other properties. If the intrinsic image can be obtained, it is very useful
to correctly explain the scenery represented by the image. Depth image is one of the
most commonly used intrinsic images, in which each pixel value represents the
distance (depth, also known as the elevation of the scene) between the scene point
represented by the pixel and the camera. These pixel values actually directly reflect
the shape of the visible surface of the scene (intrinsic property). From the depth
image, the geometric shape of the scene itself and the spatial relationship between
the scenes can be easily obtained.
The physical quantity represented by an extrinsic image is not only related to the
scene but also to the nature of the observer/collector, the conditions of image
acquisition, or the surrounding environment. A typical representative of extrinsic
images is the common grayscale image (gray scale corresponds to brightness or
illumination). The grayscale image reflects the received radiation intensity at the
observation site, and its intensity value is often the result of the combination of
3.1 Depth Image and Depth Imaging 87
multiple factors, such as the intensity of the radiation source, the orientation of the
radiation mode, the reflection property of the scene surface, as well as the position
and performance of the collector.
The difference between depth image and grayscale image can be explained with
the help of Fig. 3.1. There is an object in the figure. Considering a section (profile) on
it, the depth acquired from the section has the following two characteristics com
pared with the grayscale image:
1. The pixel value of the same outer plane on the corresponding object in the depth
image changes at a certain rate (the plane is inclined relative to the image plane).
This value changes with the shape and orientation of the object but has nothing to
do with the external lighting conditions. The corresponding pixel value in
grayscale image depends not only on the illuminance of the surface (this is not
only related to the shape and orientation of the object but also related to the
external lighting conditions) but also depends on the reflection coefficient of the
surface.
2. There are two kinds of edge lines in depth images: one is the (distance) step edge
between the object and the background; the other is the ridge edge at the
intersection of various regions inside the object (the depth at the corresponding
extreme value is still continuous). In the grayscale image, both are step edges.
Solving many computer vision problems requires the use of extrinsic images to
recover intrinsic properties, that is, to obtain intrinsic images, which can further
explain the scene. In order to recover the intrinsic structure of the scene from
extrinsic images, various image (pre)processing methods are often required. For
example, in the imaging process of grayscale images, many physical information
about the scene are mixed and integrated in the pixel gray value, so the imaging
process can be regarded as a degenerate transformation. However, the physical
information about the scene is not completely lost after being mixed in the grayscale
image. Various preprocessing techniques (such as filtering, edge detection, distance
transformation, etc.) can be used to eliminate the degradation in the imaging process
with the help of redundant information in the image (i.e., the transformation of the
88 3 Depth Image Acquisition
Many image understanding problems can be solved with the help of depth images.
There are many ways of depth imaging, which are mainly determined by the mutual
position and movement of the light source, the collector, and the scene. The most
basic imaging method is monocular imaging, that is, one collector is used to collect
an image of the scene at a fixed position. Although depth information about the scene
is not directly reflected in the image at this time, it is still implicit in the imaged
geometric distortion, shading, texture, surface contour, and other factors (Chaps. 7
and 8 will describe how to recover depth information from such images). If two
collectors are used to take images of the same scene at two positions (or one collector
can be used to take images of the same scene at two positions successively, or one
collector is used to obtain two images with the help of an optical imaging system), it
is binocular imaging (see Sect. 3.3.1 and Chap. 5). The parallax (disparity) generated
between the two images (similar to the human eyes) at this time can be used to help
determine the distance between the collector and the scene. If more than two
collectors are used to take images of the same scene at different locations (or one
collector can be used to take images of the same scene at multiple locations one after
the other), it is multi-ocular (multi-eye) imaging (see Chap. 6). Monocular, binoc
ular, or multi-ocular methods can obtain sequence images by continuous shooting in
addition to still images. Compared with binocular imaging, monocular imaging is
simpler to acquire, but it is more complicated to obtain depth information from
it. Conversely, binocular imaging increases the acquisition complexity but reduces
the complexity of acquiring depth information.
In the above discussion, it is considered that the light source is fixed in several
imaging methods. If the collector is fixed relative to the scene and the light source is
moved around the scene, this imaging mode is called light shift imaging (also called
stereo photometric imaging). Since the surface of the same scene has different
brightness under different lighting conditions, the image obtained by light shift
imaging can obtain the surface orientation of the object (but absolute depth
3.2 Direct Depth Imaging 89
information cannot be obtained; see Sect. 7.2 for details). If you keep the light source
fixed and let the collector move to track the scene, or let the collector and the scene
move at the same time, it constitutes active vision imaging (refer to the initiative of
human vision, that is, people will move the body or head according to the needs of
observation to change the perspective and selectively pay special attention to part of
the scene), the latter of which is also called active visual self-motion imaging.
Alternatively, if a controllable light source is used to illuminate the scene,
interpreting the surface shape of the scene through the captured projection pattern
is structured light imaging (see Sect. 3.2.4). In this way, the light source and the
collector can be fixed, while the scenery can be rotated; or the scenery can be fixed,
while the light source and the collector can be rotated around the scenery together.
Some of the properties of light sources, collectors, and sceneries in these modes
are summarized in Table 3.1.
Direct depth imaging refers to the use of specific equipment and devices to directly
obtain distance information to acquire 3D depth images. At present, the most
commonly used methods are mostly based on 3D laser scanning technology, and
other methods include Moire fringe method, holographic interferometry, Fresnel
diffraction, and other technologies. The direct depth imaging methods are mostly
active from the point of view of the signal source.
quickly reconstruct the various data such as line, surface, volume, and 3D models of
the measured object. Here are a few related concepts:
1. Laser ranging: a laser beam is emitted to the object by the transmitter, the laser
beam reflected by the object is received by the photoelectric element, and the
timer measures the time from the emission to the reception of the laser beam,
thereby calculating the distance from the transmitter to the target. This kind of
imaging that collects information one point at a time can be regarded as an
extreme special case of monocular imaging, and the result obtained is
z = f(x, y). If such acquisition is repeated to obtain information of a region, it is
closer to ordinary monocular imaging.
2. Reverse engineering: it often refers to a reproduction process of product design
technology, that is, reverse analysis and research on an object product, so as to
deduce and obtain the design elements such as the processing flow, organizational
structure, functional characteristics, and technical specifications of the product, to
produce products with similar functions, but not exactly the same. The data
collection of the objective scene is also a reverse process. The objective infor
mation of the scene is obtained, and after analysis and processing, the scene
model is constructed, and the structure of the scene and the spatial relationship
between the scenes are reversed. In reverse engineering, the collection of points
on the surface of product appearance obtained by measuring instruments is called
point cloud. Point clouds are massive collections of points that express the spatial
distribution of the object and the characteristics of the object surface under the
same spatial reference system. A laser point cloud is a large collection of points
acquired by a laser scanner (more introduction in Chap. 4).
3. Reflection intensity: it represents the energy value returned by the reflected laser
wave, similar to the brightness value of gray level. The point cloud is obtained
according to the principle of laser measurement, including 3D coordinates (XYZ)
and laser reflection intensity. The point cloud obtained according to the photo
grammetry principle includes 3D coordinates (XYZ) and color information
(RGB).
The attribute of laser point cloud can be expressed with different parameters,
including point cloud density, point position accuracy, and surface normal
vector [3].
1. Point cloud density (p): the number of laser points per unit area, corresponding
to the point cloud spacing (average spacing of laser points, Ad): p = 1/Ad2.
2. Positional accuracy: the plane and elevation accuracy of the laser point is related
to the conditions of the laser scanner and other hardware, the density of the point
cloud, the surface properties of the object, and the coordinate transformation.
3. Surface normal vector: a single laser point can represent limited object attri
butes, and often the object attributes are expressed jointly by multiple laser points
in the neighborhood around the laser point. If the pixels in the neighborhood are
considered to form an approximate plane or surface, they can be represented by
normal vectors. The vector represented by a straight line perpendicular to a plane
3.2 Direct Depth Imaging 91
verification, and coordinate conversion on the acquired images and point clouds
can be obtained. The process performs geolocation, generates high-precision 3D
coordinate information, and realizes 3D point clouds of roads and surrounding
objects.
4. Airborne laser scanning system
Similar to the vehicle-mounted laser scanning system, the airborne laser
scanning system also includes laser scanners and high-resolution digital cameras.
It integrates GPS and INS and uses various low-, medium-, and high-altitude
aircraft as platforms to obtain 3D spatial information of the observation area.
5. Spaceborne laser scanning system
Based on the satellite platform, it has the ability to actively obtain 3D
information of the global surface and objects. Some satellites are equipped with
geoscience laser altimeter system (GLAS) and advanced topographic laser
altimeter system (ATLAS).
From the principle of 3D laser scanning ranging, it can be mainly divided into time
based and space-based modes. The time-based mode can be further divided into
pulse method and phase method, while the space-based mode is mainly
trigonometry.
1. Pulse method
Also known as time of flight (TOF). Calculate the distance D by measuring
the round-trip time of the laser from the transmitter to the object:
D =lc0t: (3:1)
Among them, c0 is the speed of light in vacuum (299,792,458 m/s), and t is the
round-trip time of the laser.
The distance that can be measured by the pulse method is relatively large, often
up to several hundred or several kilometers, but the accuracy is poor (influenced by
the measurement of time), generally at the centimeter level.
2. Phase method
The ranging method is similar to the pulse method, but the laser signal is
modulated. When calculating the distance D, it is expressed by m whole wave
lengths of wavelength 2 plus dz [d 2 (0-1)] less than one wavelength:
The half wavelength z/2 is also called the length of precision measuring ruler,
2_ c0 c
2 = 2Fr 2F, (3:3)
where c is the true speed of light at the time of measurement, r is the refractive
index of the medium, and F is the frequency of the measuring ruler.
In Eq. (3.2), m and d can be determined by the following different methods:
(a) With the help of phase angle measurement: the distance corresponding to the
phase delay A# is calculated by measuring A# of the round-trip laser and
changing the wavelength 2 of the light. That is,
(b) With the help of optical path measurement: if the measurement distance is
changed by SD, then d2/2 in Eq. (3.2) can be eliminated:
2 A# 2
D - SD = D - = 2m: (3:5)
2 2n
In this way, the distance D can be determined by only measuring the integer
measuring ruler.
(c) With the help of measurement of modulation frequency: it can be seen from
Eq. (3.2) that by changing the frequency of modulation light, the mantissa less
than the length of a measuring ruler can be 0:
D = 2m : (3:6)
In this way, the distance D can be determined by only measuring the integer
measuring ruler.
The measurable distance of the phase method is generally about 100 m, and the
accuracy is generally in the millimeter level.
3. Trigonometry
The most commonly used is the oblique type (the laser emission axis forms a
certain angle with the normal of the object surface) trigonometry. Consider
Fig. 3.2, where the laser is at the origin of the coordinate system, the Z-axis
points from the laser to the sensor, the Y-axis points from the inside paper, and
the X-axis points from the bottom to the top. In addition, the triangle consisting
of the laser, the sensor, and the object is in the XZ plane, where the distance
L between the laser and the sensors is the known length baseline of the system.
The position of the object point in this coordinate system is determined by the
angle a between the emitted ray and the baseline, the angle fi between the
94 3 Depth Image Acquisition
reflected ray and the baseline, and the angle / that the triangle rotates around the
Z-axis:
cos a sin fl l
sin (a + fl)
sin a sin fl cos / l
sin (a + fl) (3:7)
sin a sin fl sin / l
sin (a + fl)
This method obtains distance information by measuring the time it takes for a light
wave to return from the light source to the sensor after being reflected by the
measured object. Generally, the light source and the sensor are placed in the same
position, so the relationship between the propagation time t and the measured
distance D is shown in Eq. (3.1). Considering that light is actually transmitted in
the air, it should be corrected according to the medium:
D =1C0t, (3:8)
air temperature, air pressure, and humidity. In practice, it is generally about 1.00025.
and r = 1. for simplicity, in many cases, the speed of light can be set as 3 x 108 m/s,
Therefore,
The depth image acquisition method based on time of flight is a typical method to
obtain distance information by measuring the travel time of light waves. Because a
point light source is generally used, it is also called the flying spot method. To obtain
3.2 Direct Depth Imaging 95
a 2D image, the beam needs to be scanned in 2D, or the object being measured needs
to be moved in 2D. The key to distance measurement in this method is to measure
time accurately, because the speed of light is 3 x 108 m/s, so if the spatial distance
resolution is required to be 0.001 m (i.e., to be able to distinguish two points or two
parallel lines that are 0.001 m apart in space), the time resolution needs to reach
66 x 1013/s.
This method uses the pulse interval to measure the time, specifically by measuring
the time difference of the pulse wave. Its basic principle block diagram is shown in
Fig. 3.3. The specific frequency laser emitted by the pulsed laser source is directed
forward through the optical lens and the beam scanning system and is reflected after
touching the object. The reflected light is received by another optical lens and enters
the time difference measurement module after photoelectric conversion. The module
simultaneously receives the laser directly sent by the pulsed laser source and
measures the time difference between the emitted pulse and the received pulse.
According to the time difference, the measured distance can be calculated by using
Eq. (3.8). It should be noted here that the initial pulse and echo pulse of the laser
cannot overlap within the working distance range.
Using the above principle, the distance measurement can also be performed by
replacing the pulsed laser with ultrasonic waves. Ultrasound can work not only
under natural light but also inside water. Because the propagation speed of sound
waves is relatively slow, the accuracy of time measurement is relatively low;
however, because the absorption of sound by the medium is generally large, the
sensitivity of the receiver is required to be high. In addition, due to the large
divergence of sound waves, very high-resolution distance information cannot be
obtained.
Fig. 3.3 Principle block diagram of pulse time interval measurement method
96 3 Depth Image Acquisition
The time difference can also be measured by measuring the phase difference. A
block diagram of the basic principle of a typical method can be seen in Fig. 3.4.
In Fig. 3.4, the laser emitted by the continuous laser source is amplitude-
modulated with the light intensity of a certain frequency and is sent out in two
ways. One path is directed forward through the optical scanning system and is
reflected after touching the object. The reflected light is filtered to take out the
phase. The other path enters the phase difference measurement module and is
compared with the phase of the reflected light. Because the phase has a period of
2n and the measured phase difference ranges from 0 to 2n, the depth measurement
D is
1 c c 1! J- AO + kr
A# - k-— (3:9)
2 |^2n Fmod F mod 2 2n
where c is the speed of light, Fmod is the modulation frequency, A# is the phase
difference (in radians), and k is an integer. The possible ambiguity of depth mea
surement can be overcome by limiting the range of measurement depth (limiting the
value of k). The parameter l introduced in Eq. (3.9) is a measurement scale: the
smaller l is, the higher the accuracy of distance measurement. In order to obtain a
smaller l, a higher modulation frequency Fmod should be used.
The time difference can also be measured by measuring the frequency change. The
laser emitted by the continuous laser source can be frequency modulated with a
linear waveform of a certain frequency. Let the laser frequency be F and the
modulating wave frequency be Fmod, and the modulated laser frequency exhibits a
linear periodic change between F ± AF/2 (where AF is the frequency change of the
laser frequency after modulation). One part of the modulated laser is used as the
Fig. 3.4 The principle block diagram of the phase measurement method of amplitude modulation
3.2 Direct Depth Imaging 97
reference light, and the other part is projected to the object to be measured. After
touching the object, it is reflected and then received by the receiver. The two optical
signals coherently produce a beat frequency signal FB, which is equal to the product
of the slope of the laser frequency change and the propagation time:
c
D = 4FmodAFFB: (3:U)
Then, by the phase change between the outgoing light wave and the returning
light wave,
Af = 2nAFt = 4nAFd =:
c (3:12)
Again got
Af
D =2AcF (3:13)
2n :
Comparing Eqs. (3.1) and (3.13), the number of coherent fringes N (which is also
the number of zero crossings of the beat frequency signal in the half-cycle of the
modulation frequency) is obtained:
N=Af= F ■'■
N 2n 2Fmod: (3:14)
In practice, the actual distance can be obtained by calibration, that is, according to
the accurate reference distance dref and the measured reference coherent fringe
number Nref using the following equation (by counting the actual coherent fringe
number):
D= -dref N:
(3:15)
Nref
3.2.3 LiDAR
The data obtained by only laser scanning lacks brightness information. If the laser
scanning process is supplemented with camera shooting, the depth information and
98 3 Depth Image Acquisition
brightness information in the scene can be obtained at the same time. Light
detection and ranging (LiDAR) is a typical example.
The principle of LiDAR can be illustrated with the help of Fig. 3.5 [5]. The whole
device is placed on a platform that can tilt and pan and can radiate and receive
amplitude-modulated laser waves. For each point on the surface of the 3D scene, the
waves radiated to and received from that point are compared to obtain information.
These information include both spatial information and intensity information. Spe
cifically, the spatial coordinates X and Y of each point are related to the pitch and
horizontal motion of the platform, and the depth Z is closely related to the phase
difference. The reflection properties of wavelength laser light can be determined by
the amplitude difference of the waves. In this way, LiDAR can obtain two registered
images at the same time: one is the depth image and the other is the luminance image.
Note that the depth range of the depth image is related to the modulation period of
the laser wave. If the modulation period is 2, the same depth will be calculated every
2/2, so the depth measurement range needs to be limited.
Compared with camera acquisition equipment, the acquisition speed of LiDAR is
relatively slow because the phase is calculated for each 3D surface point. According
to a similar idea, there are also systems that combine an independent laser scanning
device with a camera acquisition device to simultaneously acquire depth and color
information. One problem this brings is the need for data registration (see Chap. 4).
Structured light method is a commonly used method of active sensing and direct
acquisition of depth images. Its basic idea is to use the geometric information in
lighting to help extracting the geometric information of the scene. Structured light
ranging is carried out by trigonometry. The imaging system is mainly composed of a
camera and a light source, which are arranged in a triangle with the observed object.
The light source generates a series of point or line lasers to illuminate the surface of
the object, and the light-sensitive camera records the illuminated part and then
obtains depth information through triangulation calculation, so it is also called active
triangulation. The ranging accuracy of the active structured light method can reach
3.2 Direct Depth Imaging 99
the micrometer level, and the measurable depth field range can reach hundreds to
tens of thousands of times of the accuracy.
There are many specific ways to use structured light imaging, including light strip
method, grid method, circular light strip method, crossline method, thick light strip
method, spatial coding template method, color coding strip method, density ratio
method, etc. In addition, with the development of tunable flat optics, there will be
more structured light imaging methods [6]. Due to the different geometric structures
of the projected beams they use, the camera shooting methods and the depth distance
calculation methods are also different, but the common point is that they all utilize
the geometric structure relationship between the camera and the light source.
In the basic light strip method, a single light plane is used to illuminate parts of the
scene in sequence so that a light strip appears on the scene and only this part of the
light strip is detectable by the camera. In this way, a 2D entity (light plane) map is
obtained each time it is irradiated, and then the third dimension (distance) informa
tion of the spatial point corresponding to the visible image point on the light strip can
be obtained by calculating the intersection of the camera line of sight and the light
plane.
When using structured light imaging, the camera and light source should be
calibrated first. Figure 3.6 shows the geometric relationship of a structured light
system. Here, the XZ plane where the lens is located and perpendicular to the light
source is given (the Y-axis goes from the inside of the paper to the outside, and the
light source is a strip along the Y-axis). The laser emitted through the narrow slot
irradiates from the origin O of the world coordinate system to the spatial point
W (on the object surface) to generate a linear projection, and the optical axis of the
camera intersects with the laser beam. In this way, the camera can collect the linear
projection, so as to obtain the distance information at the point W on the object
surface.
In Fig. 3.6, F and H determine the position of the lens center in the world
coordinate system, a is the angle between the optical axis and the projection line,
fi is the angle between the z- and Z-axes, / is the angle between the projection line
and the Z-axis, z is the focal length of the camera, h is the imaging height (the
distance from the image to the optical axis of the camera), and r is the distance from
the lens center to the intersection of the z- and Z-axes. It can be seen from the figure
that the distance Z between the light source and the object is the sum of s and d,
where s is determined by the system and d can be obtained by the following
equation:
Z—s + d—s +
==
r x cscp x h z
1 — cot p x h z:
(3:17)
Equation (3.17) links Z with h (the rest are all system parameters), providing a
way to calculate the object distance according to the imaging height. It can be seen
that the imaging height contains 3D depth information, or the depth is a function of
the imaging height.
Structured light imaging can not only give the distance Z of the space point but also
the thickness of the object along the Y direction. The image width can then be
analyzed by looking up the top plane from the bottom of the camera, as shown in
Fig. 3.7.
Figure 3.7 shows a schematic diagram of the plane determined by the Y-axis and
the center of the lens, where w is the imaging width:
w = Z0 Y, (3:18)
where t is the distance from the center of the lens to the vertical projection of point
W on the Z-axis (see Fig. 3.6):
and 2‘ is the distance along the z-axis from the center of the lens to the imaging plane
(see Fig. 3.6):
wt (Z - F)2 + H2
Y=2T = w (3:21)
h2 +22
In this way, Eq. (3.21) links the object thickness coordinate Y to the imaging
height, system parameters, and object distance.
Moire stripes can be formed when two gratings have a certain inclination and
overlap. The distribution of Moire contour stripes obtained by a certain method
can contain the distance information of the scene surface.
When the grating is projected onto the surface of the scene with projection light, the
undulation of the surface will change the distribution of the projected image. If this
deformed projection image is reflected from the scene surface and then passes
through another grating, Moire contour stripes can be obtained. According to the
transmission principle of the optical signal, the above process can be described as the
result of the secondary spatial modulation of the optical signal. If both gratings are
linear sinusoidal perspective gratings, and the parameter defining the grating period
variation is l, the observed output optical signal is
f (l) =f1 fl +mi cos [wil + 01(1)]} * f2f 1 + m2 cos [w21 + &2(l)]g, (3:22)
where fi is the light intensity, mi is the modulation coefficient, 0i is the phase change
caused by the fluctuation of the scene surface, and wi is the spatial frequency
determined by the grating period. In Eq. (3.22), the first term on the right side
corresponds to the modulation function of the first grating passed by the optical
signal, and the second term on the right side corresponds to the modulation function
of the second grating passed by the optical signal.
102 3 Depth Image Acquisition
There are four periodic variables of spatial frequency in the output signal f(l) of
Eq. (3.22), namely w1, w2, w1 + w2, w1 — w2. Since the receiving process of the detector
acts as a low-pass filter on the spatial frequency, the light intensity of the Moire
stripes can be expressed as
It can be seen that the distance information from the scene surface is directly
reflected in the phase change of the Moire stripes.
Figure 3.8 shows a schematic diagram of distance measurement using the Moire
stripe method. The light source and the viewpoint are at a distance D, and they have
the same distance from the grating G, and both are H. The grating is a transmissive
line grating with alternating black and white (period R). According to the coordinate
system in the figure, the grating surface is on the XOY plane; the measured height is
along the Z-axis, which is represented by the Z coordinate.
Considering a point A, whose coordinates are (x, y) on the measured surface, the
illuminance of the light source passing through the grating to it is the product of the
intensity of the light source and the transmittance of the grating at point A*. The light
intensity distribution at point A can be expressed as
1 2 v1 1 2nn xH
T 1 (x, y) = Ci 2 + n nL n sin \ R z + HJ (3:25)
where n is an odd integer number and C1 is a constant related to the intensity. After
T passes through the grating G again, it is equivalent to another transmission
modulation at the point A’, and the light intensity distribution at A’ becomes
i i x i 1 । 2 1 ■ ('2nm xH + Dz'
T2(x,y) = C2 . ...= msinVR1T z + H , (3:26)
where m is an odd integer number and C2 is a constant related to the intensity. The
final received light intensity at the viewpoint is the product of two distributions:
Expand Eq. (3.27) with a polynomial, and through low-pass filtering of the
receiving system, a partial sum containing only the variable z can be obtained [7]:
2
T (z) = B + S (3:28)
where n is an odd integer number, B is the background intensity of the Moire stripes,
and S is the contrast of the stripes. Equation (3.28) gives the mathematical descrip
tion of Moire contour stripes. Generally, only the fundamental frequency term of
n = 1 is used to approximately describe the distribution of Moire stripes, that is,
Eq. (3.28) can be simplified as
n NRH
ZN= D - NR N 2 I: (3:30)
2. The height difference between any two bright stripes is not equal, so the height
cannot be determined by the number of stripes; only the height difference
between two adjacent bright stripes can be calculated.
3. If the distribution of the phase term 9 can be obtained, the height distribution of
the surface of the measured object can be obtained:
104 3 Depth Image Acquisition
ZN = 2nD-Re: (3:31)
The abovementioned basic method requires the use of a grating of the same size as
the measured object, which brings inconvenience to the use and manufacture of the
device. An improved method is to install the grating in the projection system of the
light source and use the magnification capability of the optical system to obtain the
effect of a large grating. Specifically, two gratings are used, which are placed close to
the light source and the viewpoint, respectively. The light source transmits the light
beam through the grating, and the viewpoint is imaged behind the grating.
A practical schematic diagram of ranging using the above projection principle is
shown in Fig. 3.9. Two sets of imaging systems with the same parameters are used,
their optical axes are parallel, and two gratings with the same spacing are geomet
rically imaged at the same imaging distance, plus the projection images of the two
gratings are coincident.
Suppose Moire stripes are observed behind the grating G2, and G1 is used as the
projection grating, then the projection center O1 of the projection system L1 and the
convergence center O2 of the receiving system L2 are equivalent to the light source
point S and the viewpoint W in the basic method, respectively. In this way, as long as
R in Eqs. (3.29) and (3.31) are replaced by MR (M = H/H0 is the imaging
magnification of two optical paths), the distribution of Moire stripes can be described
as above, and the height distribution on the surface of the object to be measured can
be calculated.
In practical applications, the grating in front of the projection system L1 can be
omitted, while the computer software is used to complete its function. At this time,
the projected grating image containing the depth information of the measured object
surface is directly received by the camera.
It can be known from Eq. (3.31) that if the distribution of the phase term 0 can be
obtained, the distribution of the height Z of the surface of the measured object can be
obtained. The phase distribution can be obtained by using multiple Moire images
with a certain phase shift. This method is often referred to simply as the phase-shift
method. Taking three images as an example, after obtaining the first image, move the
projection grating horizontally by R/3 distance to obtain the second image, and then
move the projection grating horizontally by R/3 distance to obtain the third image.
Referring to Eq. (3.29), these three images can be expressed as
Indirect depth imaging means that the directly obtained images do not have depth
information or do not directly reflect depth information, and they need to be
processed to extract depth information. The human binocular depth vision function
is a typical example. The objective world seen by each eye (imaged on the retina) is
equivalent to a 2D image (which does not directly reflect depth information), but
from the two 2D images seen by both eyes, people perceive the distance of the scene.
This is the result of information processing by the human brain. There are other
methods to obtain depth information indirectly, such as various 3D layered imaging
methods. Indirect depth imaging methods are mostly passive from the perspective of
signal sources and are commonly used in various image processing, analysis, and
understanding techniques.
Binocular imaging can obtain two images of the same scene with different
viewpoints (similar to human eyes), and the binocular imaging model can be
regarded as a combination of two monocular imaging models. In actual imaging,
either two monocular systems can be used to acquire at the same time, or one
monocular system can be used to acquire at two poses in succession (at this time,
the subject and the light source are generally assumed to have no movement
changes).
By generalizing binocular imaging, multi-ocular imaging can also be achieved,
and some examples will be discussed in Chap. 6.
106 3 Depth Image Acquisition
Depending on the relative poses of the two cameras, there are multiple modes of
binocular imaging, and several typical situations are described below.
It can be seen from Fig. 3.10 that the same 3D space point corresponds to points in
two images (two image plane coordinate systems), respectively, and the position
difference between them is called parallax. The relationship between parallax and
depth (object distance) in binocular lateral mode is discussed below with the help of
Fig. 3.11. It is a schematic diagram of the plane (XZ plane) where the two lenses are
connected. Among them, the world coordinate system coincides with the first camera
coordinate system and only has a translation amount B in the X-axis direction with
the second camera coordinate system.
Considering the geometric relationship between the coordinate X of the point
W in 3D space and the coordinate x1 of the projected point on the first image plane,
we can get
jXj _X1
(3:34)
Z-2 2 :
B — jXj : jx2j — B
(3:35)
Z—2 2
d = xi + |x2j — B = 72b :
Z—2
: :
(3 36)
Z =2 (3:37)
Equation (3.37) directly relates the distance Z between the object and the image
plane (i.e., the depth in 3D information) to the parallax d. In turn, it also shows that
the size of parallax is related to depth. That is, parallax contains the spatial infor
mation of 3D objects. According to Eq. (3.37), when the baseline and focal length
are known, it is very simple to calculate the Z coordinate of point W after determining
the parallax d. In addition, after the Z coordinate is determined, the world coordinates
X and Y of point W can be calculated by (x1, y1) or (x2, y2) referring to Eqs. (3.34) and
(3.35).
Now let’s look at the ranging accuracy obtained by parallax. According to
Eq. (3.37), depth information is related to parallax, which is related to imaging
coordinates. Suppose x 1 produces a deviation e, that is, x1e = x 1 + e, then there is
d1e = x1 + e + |x2| — B = d + e, so the distance deviation is
AZ = e (Z - 2)2 ~ eZ2
(3:39)
2B + e(Z - 2) ~ 2B + eZ :
The last step is the simplification in considering Z >> 2 in general. It can be seen
from Eq. (3.39) that ranging accuracy is related to camera focal length and baseline
length between cameras and object distance. The longer the focal length, the longer
the baseline, the higher the accuracy; the greater the object distance, the lower the
accuracy.
Equation (3.37) gives the expression of the relationship between absolute depth
and parallax. With the help of differentiation, the relationship between depth change
and parallax change is
So,
If both parallax and parallax change are measured in pixels, it can be known that
the measurement error of relative depth in the scene is (1) proportional to the pixel
size, (2) proportional to the depth Z, and (3) inversely proportional to the baseline
length B between cameras.
In addition, it can also be obtained by Eq. (3.41) that
It can be seen that the measurement error of relative depth and the measurement
error of relative parallax are numerically the same.
Observe the cylindrical object with a circular cross-section of local radius r using
two cameras, as shown in Fig. 3.12. There is a certain distance between the intersec
tion of the two camera sight lines and the boundary point of the circular section,
which is the error S. Now it is to obtain the equation for calculating the error S.
Fig. 3.13 Schematic diagram of the simplified geometrical structure of the calculated measurement
error
=
8 = r sec (0 2) — r (3.44)
tan (0=2) = B 2Z =. (3:45)
Substituting 9, we get
8 = r [1 + (B=2Z)2]
1=2
= .
— r « rB2 8Z2 (3:46)
The above equation provides the formula for calculating the error 8. It can be seen
that the error is proportional to r and Z—2 .
Let the baseline B = Z1 - Z2, and assuming that B << Z1, B << Z2, we can get
(take Z2 = Z1Z2)
d =RB^: (3:50)
Z2
d =BR0 (3:51)
d Z :
7=BR0= BRo
(3:52)
d (R2 - R1):
Equation (3.51) can be compared with Eq. (3.36); here the parallax depends on
the (average) radial distance R0 between the image point and the optical axis of the
camera and where it is independent of the radial distance. Then, Eq. (3.52) can be
compared with Eq. (3.37); here the depth information of the object point on the
optical axis cannot be given. For other object points, the accuracy of depth infor
mation depends on the radial distance.
3.3 Indirect Depth Imaging 111
In the above binocular lateral imaging mode, in order to determine the information of
a 3D space point, the point needs to be in the common field of view of the two
cameras. If you rotate the two cameras (around the X-axis), you can increase the
common field of view and capture panoramic images. This can be referred to as
stereoscopic imaging with an angular scanning camera, that is, a binocular
angular scanning mode, where the coordinates of the imaging point are determined
by the camera’s azimuth and elevation angles. In Fig. 3.15, 01 and 02 give the two
azimuth angles (corresponding to the saccade movement around the Y-axis), respec
tively, and the elevation angle <fr is the angle between the XZ plane with the plane
defined by the two optical centers and space point W.
Generally, the azimuth angle of the lens can be used to represent the spatial
distance between object images. Using the coordinate system shown in Fig. 3.15, we
have
tan 01 = jZj
(3:53)
tan ■ /: (3:54)
: B
Z = ----- -----------H-.
tan 01 + tan 02
(3.55)
Equation (3.55) actually relates the distance Z between the object and the image
plane (i.e., the depth in 3D information) with the tangents of the two azimuths
directly. Comparing Eqs. (3.55) with (3.37), it can be seen that the effects of parallax
and focal length are implicit in the azimuth angle. According to the Z coordinate of
the space point W, its X and Y coordinates can also be obtained as
X =Z tan 01 (3.56)
Y = Ztan ^. (3:57)
To achieve greater field of view (FOV) coincidence, two cameras can be placed side
by side but with the two optical axes converging. This binocular convergent
horizontal mode can be regarded as a generalization of the binocular lateral mode
(the vergence between binoculars is not zero at this time).
Consider only the case shown in Fig. 3.16, which is obtained by rotating the two
monocular systems in Fig. 3.11 toward each other around their respective centers.
Figure 3.16 shows the plane (XZ plane) where the two lenses are connected. The
distance between the centers of the two lenses (i.e., the baseline) is B. The two
optical axes intersect at the (0, 0, Z ) point in the XZ plane, and the intersection angle
is 29. Now let’s see how to find the coordinates (X, Y, Z) of the 3D space point W if
two image plane coordinate points (x1, y1) and (x2, y2) are known.
First of all, it can be known from the triangle enclosed by the two world
coordinate axes and the camera optical axis
B cos 9
+ 2 cos 9: (3:58)
2 sin 9
Now draw vertical lines from point W to the optical axes of the two cameras,
because the angle between the two vertical lines and the X-axis is 9, so according to
the relationship of similar triangles, we can get
jx1 j _ Xcos 9
(3:59)
2 r - X sin 9
jx2 j _ X cos 0
(3:60)
T - r + Xsin 0,
where r is the distance from the (either) lens center to the point where the two optical
axes converge.
Combining Eqs. (3.59) and (3.60) and eliminating r and X, we get (refer to
Fig. 3.16)
2 COs 0 --
2jx1j • jx2j sin 0 _
Illi
jx1 j— jx2 j
---
d
: :
2jxij • jx2j sin 0
(3:61)
Equation (3.62), like Eq. (3.37), also directly relates the distance Z between the
object and the image plane with the parallax d. In addition, from Fig. 3.16, we can
get
B
r 2 sin 0: (3:63)
Substitute into Eqs. (3.59) and (3.60) to get the X coordinate of point W,
The case of binocular convergence can also be converted to the case of binocular
parallelism. Image rectification is the process of geometric transformation of the
image obtained by the camera whose optical axis converges to obtain the image
equivalent to the image obtained by the camera whose optical axis is parallel
[8]. Considering the images before and after correction in Fig. 3.17 (represented
by trapezoid and square, respectively), the light emitted from the object point
W intersects with the left image at (x, y) and (X, Y) before and after rectification,
respectively. Each point on the image before rectification can be connected to the
lens center and extended to intersect with the image after rectification. Therefore, for
each point on the image before rectification, its corresponding point on the image
after rectification can be determined. The coordinates of the points before and after
rectification are connected by the projection transformation (a1 to a8 are the coeffi
cients of the projection transformation matrix):
114 3 Depth Image Acquisition
aiXx+—a_____________
2Y + a3
(3:65)
a4X + a5Y + 1
a6X -p a7 Y -p a8
y — a4X + a5Y +1 : (3:66)
The eight coefficients in the above two equations can be determined with the help
of four groups of corresponding points on the image before and after rectification
(see [9]). Here, it can be considered to use the horizontal polar line [the intersection
of the plane composed of the baseline and a point in the scene and the imaging plane
(see Sect. 5.1.2)]. Therefore, it is necessary to select two polar lines in the image
before rectification and map them to two horizontal lines in the image after rectifi
cation, as shown in Fig. 3.18. The corresponding relationship can be
X1 — x1 X2 — x2 X3 — x3 X4 — x4 : (3:67)
The above correspondence can maintain the width of the image before and after
rectification, but there will be scale changes in the vertical direction (in order to map
non-horizontal polar lines to horizontal polar lines). In order to obtain the rectified
image, for each point (x, y) on the rectified image, Eqs. (3.65) and (3.66) are to be
used to find the corresponding point (x, y) on the image before rectification.
Moreover, the gray level at the point (x, y) is assigned to the point (X, Y).
The above process should also be repeated for the right image. In order to ensure
that the corresponding polar lines on the rectified left image and right image
represent the same scanning line, it is necessary to map the corresponding polar
lines on the rectified image to the same scanning line on the rectified image, so the
3.3 Indirect Depth Imaging 115
Y coordinate in Eq. (3.68) should be used when correcting both the left image and
right image.
When using binocular lateral mode or binocular convergent lateral mode, the
parallax needs to be calculated according to the triangle method, so the baseline
cannot be too short; otherwise, the accuracy of depth calculation will be affected.
However, when the baseline is long, the problem caused by field of view
misalignment will be more serious. At this time, the binocular axial mode, also
known as binocular longitudinal mode, can be considered. That is, the two cameras
are arranged in turn along the optical axis. This situation can also be seen as moving
the camera along the optical axis and acquiring the second image at a position closer
to the object than the first image, as shown in Fig. 3.19. In Fig. 3.19, only the XZ
plane is drawn, and the Y-axis is pointed from the inside to the outside of the paper.
The origin of the two camera coordinate systems that obtain the first image and the
second image only differs by B in the Z direction, where B is also the distance
(baseline) between the optical centers of the two cameras.
According to the geometric relationship in Fig. 3.19, we have
X |xi|
(3:69)
Z-2 2
X N (3:70)
Z-2-B 2
Combining Eqs. (3.69) and (3.70) can provide (only X is considered, similar for
Y)
_ |xi | • |x2|
X = B jxlj • |x2| = B
(3:71)
2 |X2| - |xi| 2 d
Z =2+ |X2B^
| - |X1| 2 + d/21:
(3:72)
Compared with the binocular lateral mode, the common field of view of the two
cameras in binocular axial mode is the field of view of the camera in front (the
camera that obtained the second image in Fig. 3.19), so the boundary of the common
field of view is easily determined, and the problem that the 3D space point is only
seen by one camera due to occlusion can be basically ruled out. However, since the
two cameras basically use the same angle of view to observe the scene, the benefits
of lengthening the baseline for depth calculation accuracy cannot be fully reflected.
In addition, the accuracy of parallax and depth calculation is related to the distance
between the 3D space point and the optical axis of the camera (e.g., as in Eq. 3.72,
the depth Z is related to |x2|, that is, the distance between the projection of the 3D
space point and the optical axis), which is different from the binocular
horizontal mode.
The relative height of the ground object can be obtained by taking two images of
the object in the air with the camera carried by the aircraft. In Fig. 3.20, w represents
the moving distance of the camera by W, H is the camera height, h is the relative
height difference between two measuring points A and B, and (d1 - d2) corresponds
to the parallax between A and B in the two images. When d1 and d2 are much less
than W and H is much less than h, h can be simplified as follows:
h = W ^1 - d2^: (3:73)
If the above conditions are not met, the X and Y coordinates in the image need to
be corrected as follows:
0 H-h
x =x (3:74)
H
H-h
y0 y H (3:75)
When the object is close, the object can be rotated to obtain two images. A
schematic diagram is given in Fig. 3.21a, where 8 represents a given rotation angle.
At this time, the horizontal distances between the two object points A and B are
different in the two images, d1 and d2, respectively, as shown in Fig. 3.21b. The
3.4 Single-Pixel Depth Imaging 117
Fig. 3.21 Rotating the object to obtain two images to measure the relative height
-1 cos 6 - d2=di
0 = tan 1 (3.76)
sin 6 J
The origin of single-pixel imaging can be traced back to the point-by-point scanning
imaging more than 100 years ago. Nowadays, single-pixel imaging mainly refers to
using only a single-pixel detector without spatial resolution to record image infor
mation. The term single-pixel imaging first appeared in a document in 2008 [10],
which is a work combined with compressed sensing. Among them, a digital
micromirror device (DMD) is used to perform binary random modulation on the
image of the object, and a single-pixel detector is used to obtain the total energy of
the modulated image. In the next reconstruction process, the Haar wavelet basis is
used as the sparse sampling basis to realize the sparse transformation of the image, so
that the system can recover a clear reconstructed image from the underdetermined
sampling data. In the same year, some researchers proposed a theoretical model of
computational ghost imaging (CGI) based on the correlation of thermal light
intensity [11]. By using a spatial light modulator to generate a light field with
known spatial intensity distribution, this provides another implementation scheme
for single-pixel imaging.
The imaging schemes for single-pixel imaging and computational ghost imaging are
essentially the same [12]. In terms of implementation methods, the main difference
118 3 Depth Image Acquisition
(a)
(b)
between early single-pixel imaging and computational ghost imaging is the location
of the spatial modulation of the optical signal in the imaging system. Figure 3.21
shows the flowcharts of the two schemes, respectively. Among them, Fig. 3.22a
shows the flowchart of the single-pixel imaging scheme. The light source illuminates
the scene, and the reflected or transmitted light signal will be spatially modulated by
a digital micromirror device (DMD), then passed through the lens, and finally
received by the single-pixel detector. Figure 3.22b shows the flowchart of the
computational ghost imaging scheme. The light signal is first spatially modulated
by the DMD and then illuminates the scene. The reflected or transmitted light signal
passes through the lens and is finally received by the single-pixel detector. In
contrast, the computational ghost imaging scheme uses a pre-modulation strategy
to modulate the light emitted by the light source, also known as structured illumi
nation; the single-pixel imaging scheme uses a post-modulation strategy, and the
light signal reflected or transmitted from the scene is modulated, also known as
structured detection. Although the processes are somewhat different, their image
reconstruction algorithms can be generalized.
The imaging models for single-pixel imaging and computational ghost imaging
are as follows. Consider a 2D image I 2 RK x L with Npixels: N = K x L. To acquire
this image, a series of modulated mask patterns with spatial resolution need to be fed
to the DMD. For the modulation mask sequence, P = [P1, P2, ..., PM] 2 RM x K x L,
where Pi 2 RK x L represents the i-th frame modulation mask and M represents the
number of modulation masks. A single-pixel detector captures M total light intensity
values S = [s1, s2, ..., sM] 2 RM. If the 2D image I is expanded into a vector form,
that is, I 2 RM, and the modulation mask sequence is expressed in a 2D matrix form,
PI = S: (3:78)
measurement signal sequence S captured by the detector to solve the 2D image I.If
we multiply both sides of Eq. (3.78) by the inverse matrix of P, we can get the image:
I=P-1 S: (3.79)
The premise for Eq. (3.79) to hold is that the number of modulation matrices
M = N. In addition, the matrix is orthogonal to ensure that Eq. (3.79) has a unique
solution.
The practical single-pixel camera has flexible imaging ability and high photoelectric
conversion efficiency, which greatly reduces the requirements for high-complexity
and high-cost photodetectors. Its imaging process is shown in Fig. 3.23.
An optical lens is used to project the object illuminated by the light source in the
scene onto the digital micromirror device (DMD). It is a device that realizes light
modulation by reflecting incident light with a large number of tiny lenses. Each unit
in the micromirror array can be controlled by voltage signal to carry out mechanical
turnover of plus or minus 12°, respectively, so that the incident light can be reflected
or completely absorbed at a symmetrical angle without output. Thus, a random
measurement matrix consisting of 1 and 0 is formed. The light reflected at a
symmetrical angle is received by a photosensitive diode (a fast, sensitive,
low-cost, and efficient single-pixel sensor commonly used at present, and a blood
avalanche diode is also used in low light). Its voltage changes with the intensity of
the reflected light. After quantization, a measurement value can be given. The
random measurement mode of each DMD corresponds to a row in the measurement
matrix. At this time, if the input image is regarded as a vector, the measurement
result is their point product. Repeat this projection operation for M times, and
M measurement results can be obtained by randomly configuring the turning angle
of each micromirror on the DMD for M times. According to the total variation
reconstruction method (which can be realized by DSP), the image can be
reconstructed with M measured values far smaller than the original scene image
pixels (its resolution is the same as that of the micromirror array). This is equivalent
to the realization of image data compression in the process of image acquisition.
Compared with traditional multi-pixel array scan cameras, single-pixel cameras
have the following advantages:
1. The energy collection efficiency of single-pixel detection is higher, and the dark
noise is lower, which is suitable for extremely weak light and long-distance
scenes.
2. The single-pixel detection sensitivity is high, the advantages in invisible light
band and unconventional imaging are obvious, and the cost is low.
3. High temporal resolution, which can be used for 3D imaging of objects.
Of course, the biggest disadvantage of single-pixel imaging is that it requires
multiple measurements to image (essentially trading time for space). Theoretically, if
an image with N pixels is to be captured, at least N mask patterns are required for
modulation, that is, at least N measurements are required, which greatly limits the
development of single-pixel imaging applications. In practice, due to the sparseness
of natural image signals, the compressive sensing (CS) algorithm can be used to
reduce the number of measurements and make it practical.
The imaging effect of a single-pixel camera is still a certain distance away from
the imaging effect of a common CCD camera. Figure 3.24 presents a set of
examples. Figure 3.24a is the imaging effect of the white letter R on a piece of
black paper, and the number of pixels is 256 x 256; Fig. 3.24b is the imaging effect
of a single-pixel camera, where M is the 16% of number of pixels (11,000 measure
ments were taken); Fig. 3.23c is another imaging effect with a single-pixel camera,
where M is 2% of the number of pixels (1300 measurements were taken).
As can be seen from the figure, although the number of measurements has been
reduced, the quality is not comparable. In addition, mechanically flipping the
micromirror takes a certain amount of time, so the imaging time to obtain a sufficient
number of measurements is relatively long (often measured in minutes). In fact, in
the visible light range, imaging with a single-pixel camera costs more than a typical
CCD camera or CMOS camera. This is because the visible spectrum is consistent
with the photoelectric response region of silicon materials, so the integration of CCD
or CMOS devices is high and the price is low. But in other spectral ranges, such as
the infrared spectrum, single-pixel cameras have advantages. Since detection devices
in the infrared spectrum are expensive, single-pixel cameras can compete with them.
Other advantages of single-pixel imaging include imaging radar (LiDAR), terahertz
imaging, and X-ray imaging. Since single-pixel imaging provides a solution that can
3.4 Single-Pixel Depth Imaging 121
be imaged with a single detector and a large field of view illumination, single-pixel
cameras have potential in some imaging needs where detection and illumination
technologies are immature.
The previous discussion is all about 2D imaging, and it is also convenient to extend
single-pixel imaging from 2D imaging to 3D imaging. At present, the main methods
can be divided into two types: direct method and reconstruction method.
1. Direct method
The expansion of single-pixel imaging has natural advantages from 2D imag
ing to 3D imaging. Single-pixel detectors are imaged point-by-point, so the time
spent on each single-pixel detector can be recorded, and depth information can be
measured by measuring the time of flight of light with the help of the time-of-
flight method in Sect. 3.2.2 [13]. In order to obtain high-precision depth infor
mation, such methods require detectors with high temporal resolution.
2. Reconstruction method
Combining 2D images obtained from different angles with single-pixel imag
ing results in a 3D image with depth information. For example, 3D reconstruction
can be performed using photometric stereo techniques (see Sect. 7.2) and multiple
single-pixel detectors located at different locations [14]. Among them, the corre
lation between the light intensity sequence captured by each single-pixel detector
and the corresponding structured mask sequence is used to reconstruct the 2D
image. Since each 2D image can be regarded as the imaging result from different
angles of the scene, the 3D image can be reconstructed accurately through pixel
matching.
Two types of schemes that are more common in current single-pixel 3D imaging
research and applications are introduced below.
Methods based on intensity information mainly rely on the technique of shape from
shading (see Sect. 9.1). An experimental device that reconstructs the scene surface
based on the shape from shading places four single-pixel detectors equidistant from
the top, bottom, left, and right of the light source to detect the light field projected on
the scene surface and reflected back [15]. Because the single-pixel detector has no
spatial resolution capability, the projection device at the light source will uniquely
determine the resolution of the reconstructed object. According to the reciprocity
principle of the imaging system, the distribution orientation of the single-pixel
detector will determine the shade distribution of the reconstructed object. Here,
four single-pixel detectors with different viewing angles will obtain different
122 3 Depth Image Acquisition
detection values due to the change of the orientation of the scene surface. After
calculation, 2D images with different light and dark distributions at different viewing
angles can be obtained.
Further, under the premise that the reflectivity of the scene surface is the same
everywhere, the brightness value is reconstructed according to the detection values
of different detectors, and the surface normal vectors of different pixel points can be
obtained. With the surface normal vectors, the gradient distribution among adjacent
pixels can be further obtained. Starting from a given point on the surface of the
object, the depth change of the object can be initially obtained, and the 3D recon
struction effect of classic stereovision can be obtained after the subsequent optimi
zation steps. On this basis, a 3D single-pixel video imaging method has also
emerged, which uses a single-pixel compression method called evolutionary com
pressed sensing, which can preserve spatial resolution to a large extent while
projecting at high frame rates [16].
The principle of this type of method is similar to structured light imaging (see Sect.
3.2.4). The common ones mainly include Fourier single-pixel 3D imaging technol
ogy and single-pixel 3D imaging technology based on digital grating.
1. Fourier single-pixel 3D imaging technology
Fourier single-pixel imaging (FSI) reconstructs high-quality images by
acquiring the Fourier spectrum of the object image [17]. FSI uses phase shift to
generate structured light illumination for spectrum acquisition and then performs
inverse Fourier transform on the obtained spectrum to achieve a reconstructed
image. For a rectangular image composed of N pixels with a length K and a width
L, let u and v denote the spatial frequency of the image in the X direction and the
Y dimension, respectively; the resulting Fourier matrix can be expressed as
where $ represents the phase. Specifically, in order to obtain the Fourier coeffi
cients, it is necessary to set different phase values at the same frequency to solve the
spectrum. The commonly used four-step phase-shift method uses four equally
spaced phases between 0 and 2n. If the corresponding four single-pixel detection
values are recorded as D0, Dn/2, Dn, and D3n/2, the Fourier coefficients corresponding
to the spatial frequency (u, v) can be expressed as
Equation (3.81) shows that for an image with N pixels, 4 N samplings are required
to fully recover its image information.
3.5 Binocular Vision and Stereopsis Vision 123
I1 (x, y) - I3(x, y)
$(x, y) = arctan (3:83)
I4(x,y) - h(x,y):
As above, a package phase with phase jump between [—n, n] can be obtained.
After spatial phase unwrapping, the absolute phase distribution of the object can be
obtained.
Binocular vision can realize human stereoscopic vision, that is, to perceive the
depth based on the difference of image positions between the left and right retinas,
because each eye has a slightly different angle to observe the scene (resulting in
binocular parallax). In order to make full use of this ability of human vision, we
should not only consider the characteristics of human vision when obtaining images
from nature but also consider the characteristics of human vision when displaying
these images to people.
In Helmet mounted display (HMD), there are mainly three types of visual displays
used [19]: (1) monocular (one eye is used to view one image), (2) binocular
(bio-binocular; two eyes are used to view two identical images), and (3) binocular
124 3 Depth Image Acquisition
stereo (two eyes are used to view two different images). The connections and
differences of three types are shown in Fig. 3.25.
Currently, there is a trend from monocular to binocular in terms of display. The
original display was monocular, and the current display is binocular but has two
independent optical paths, so it is possible to use different images to display
information to create stereo depth in binocular stereo mode (to display stereo or
3D image).
In the real world, when a person focuses the gaze on an object, the two eyes converge
so that the object falls on the fovea of each retina, with little or no parallax. This
situation is shown in Fig. 3.26.
In Fig. 3.26, the Object F is the object in focus, and the arc passing through this
fixed point is called the horopter (concentric line of sight, the line containing all the
lines in space that fall on the corresponding image points of the retinas of the two
eyes). This is equivalent to setting a baseline from which the relative depth can be
judged. On either side of the line of sight, there is a region where images can be
fused, where a single object at a different depth from the focal object can be
perceived. This spatial region is known as the Panum’s zone of fusion or zone of
clear single binocular vision (ZCSBV). Objects falling in front of Panum’s zone
(cross-parallax zone) or behind (non-cross-parallax zone) will exhibit diplopia,
which is less reliable or accurate though still facilitating depth perception in the
form of qualitative stereopsis. For example, Object A is inside the Panum’s zone of
fusion and will therefore be considered as a single image point, while Object B is
outside the Panum’s zone of fusion and thus will have diplopia.
References 125
A F^B A F
The spatial positioning of the horopter and the resulting constant changes in the
Panum’s zone of fusion will depend on where the individual focuses and where the
eyes are approached. Its size will vary between and within individuals, depending on
factors such as fatigue, brightness, and pupil size, but a difference of about 10-15
arcmin from focus can provide clear depth clue when viewed in the fovea. However,
the range generally considered to provide comfortable binocular viewing without
adverse symptoms is only the middle third of the Panum’s zone of fusion (approx
imately 0.5 crossed and non-crossed diopters).
References
1. Liu XS, Li, AH, Deng ZJ, et al. (2022) Advances in three-dimensional imaging technologies
based on single-camera stereo vision. Laser and Optoelectronics Progress, 59(14): 87-105.
2. Ballard DH, Brown CM (1982) Computer Vision, Prentice-Hall, London.
3. Li F, Wang J, Liu XY, et al. (2020) Principle and Application of 3D Laser Scanning. Earthquake
Press, Beijing.
4. Yang BS, Dong Z (2020) Point Cloud Intelligent Processing. Science Press, Beijing.
5. Shapiro L, Stockman G (2001) Computer Vision. Prentice Hall, London.
6. Dorrah A H, Capasso F. Tunable structured light with flat optics. Science, 2022, 376(6591):
367-377.
7. Liu XL (1998) Optical Vision Sensing. Beijing Science and Technology Press, Beijing.
8. Goshtasby AA (2006) 2-D and 3-D Image Registration -- for Medical, Remote Sensing, and
Industrial Applications. Wiley-Interscience, USA. Hoboken.
9. Zhang Y-J (2017) Image Engineering, Vol.1: Image Processing. De Gruyter, Germany.
10. Duarte MF, Davenport MA, Takbar D, et al. (2008) Single-pixel imaging via compressive
sampling. IEEE Signal Processing Magazine 25(2): 83-89.
11. Shapiro JH (2008) Computational ghost imaging. Physical Review A 78(6): 061802.
12. Zhai XL, Wu XY, Sun YW, et al (2021) Theory and approach of single-pixel imaging. Infrared
and Laser Engineering 50(12): 20211061.
13. Howland GA, Lum DJ, Ware MR, et al. (2013) Photon counting compressive depth mapping.
Optics Express 21(20): 23822-23837.
126 3 Depth Image Acquisition
14. Sun BQ, Edgar MP, Bowman R, et al. (2013) 3D computational imaging with single-pixel
detectors. Science 340(6134): 844-847.
15. Sun BQ, Jiang S, Ma YY, et al. (2020) Application and development of single pixel imaging in
the special wavebands and 3D imaging. Infrared and Laser Engineering 49(3): 0303016.
16. Zhang Y, Eedgar MP, Sun BQ, et al. (2016) 3D single-pixel video. Journal of Optics 18(3):
35203.
17. Zhang ZB, Ma X, Zhang JG (2015) Single-pixel imaging by means of Fourier spectrum
acquisition. Nature Communications 6(1): 1-6.
18. Radwell N, Mitchell KJ, Gibson GM, et al. (2014). Single-pixel infrared and visible micro
scope. Optica 1(5): 285-289.
19. Posselt BN, Winterbottom M (2021) Are new vision standards and tests needed for military
aircrew using 3D stereo helmet-mounted displays? BMJ Military Health 167: 442-445.
Chapter 4
3D Point Cloud Data and Processing Check for
updates
3D point cloud data can be obtained by laser scanning or photogrammetry and can
also be seen as a representation of 3D digitization of the physical world. Point cloud
data is a kind of temporal and spatial data. Its data structure is relatively simple, its
storage space is relatively compact, and its representation of local details of complex
surfaces is relatively complete. It has been widely used in many fields [1]. However,
3D point cloud data often lacks correlations with each other, and the amount of data
is very large, which brings many challenges to its processing [2].
The sections of this chapter will be arranged as follows.
Section 4. 1 first gives an overview of point cloud data, mainly the source of point
cloud data, different forms, and processing tasks.
Section 4. 2 discusses the preprocessing of point cloud data, which includes point
cloud data trapping, point cloud data denoising, point cloud data reduction or
compression, multi-platform point cloud data registration, as well as point cloud
data and image data registration.
Section 4. 3 introduces the modeling of laser point clouds and discusses the
Delaunay triangulation method and the patch fitting method, respectively.
Section 4. 4 introduces texture mapping for 3D models and discusses color texture
mapping, geometric texture mapping, and procedural texture mapping, respectively.
Section 4. 5 introduces the description of the local features of the point cloud and
introduces the description methods using the orientation histogram label, the rota
tional projection statistics, and the tri-orthogonal local depth map, respectively.
Section 4. 6 discusses deep learning methods in point cloud scene understanding,
mainly the challenges faced and various network models.
Section 4. 7 introduces the registration of point cloud data with the help of bionic
optimization, analyzes cuckoo search and improved cuckoo search techniques, and
discusses their application in point cloud registration.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 127
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_4
128 4 3D Point Cloud Data and Processing
Point cloud data has its particularity in terms of acquisition method, acquisition
equipment, data form, storage format, etc., which also brings about difference on its
processing tasks and requirements.
At present, the acquisition methods of point cloud data mainly include laser scanning
methods and photogrammetry methods, which complement each other in practice,
such as LiDAR described in Sect. 3.2.3.
The laser scanning mode integrates laser scanners, global positioning systems,
and inertial measurement units on different platforms to jointly calculate the position
and attitude information of the laser transmitter and the distance to the object region
to obtain 3D point cloud in the object region.
The photogrammetry mode restores the position and attitude of the multi-view
image data captured by specific professional software and generates a dense image
point cloud with color information.
The difference between point cloud data obtained by 3D laser scanning technol
ogy and photogrammetry technology is as follows:
1. Different data sources: 3D laser scanning technology directly collects laser point
clouds (without reprocessing); the obtained point clouds have high precision,
uniform data, regular distribution, and obvious separation between objects. Pho
togrammetry technology uses continuous shooting with certain characteristics.
The overlapping images are spatially positioned, and the point cloud of the
encrypted points is obtained by means of adjustment. These point clouds often
have large fluctuations, cluttered distribution, high noise, and poor accuracy. All
points in the point cloud are often connected as a whole.
2. The data registration methods are different: the registration of the laser point
cloud is carried out through the coordinate registration of the points with the same
name between each station; the point cloud of photogrammetry generates the
overall point cloud according to the methods of internal orientation, relative
orientation, and absolute orientation.
3. Different coordinate conversion methods: 3D laser scanning only needs to mea
sure control points during the transformation of geodetic coordinates, and relative
coordinates can also be used; photogrammetry often requires auxiliary control
measurements for high-precision reconstruction of 3D point cloud data.
4. The measurement accuracy is different: the distribution of laser point clouds is
regular and uniform, and the accuracy is high; the photogrammetric point cloud is
generated by generating encrypted points, and the process is greatly affected by
the image matching accuracy, and the accuracy is relatively low.
5. The construction methods of 3D models are different: laser point clouds generate
3D ground models by filtering out ground points, and accurate 3D building
4.1 Point Cloud Data Overview 129
Point cloud data based on different acquisition methods have their own characteris
tics, and some advantages and disadvantages of different point cloud data types are
shown in Table 4.2 [3].
After acquiring the 3D point cloud, multiple processing needs to be done to make
full use of it. The main contents include the following [4]:
1. Point cloud quality improvement
Point cloud quality improvement includes point cloud position correction,
point cloud reflection intensity correction, point cloud data attribute
integration, etc.
2. Point cloud model construction
The point cloud model construction performs the construction of a data
model (responsible for basic operations such as point cloud storage, management,
query, index, etc., as well as the design of data model and logic model), a
130 4 3D Point Cloud Data and Processing
processing model (responsible for point cloud preprocessing, point cloud feature
extraction, point cloud classification, etc.), and a representation model (responsi
ble for application analysis of point cloud processing results).
3. Point cloud feature description
The point cloud feature description is to characterize the point cloud mor
phological structure, which is mainly divided into artificially designed features
and deep network learning features. The features of artificial design depend on the
prior knowledge of the designer and often have certain parameter sensitivity, for
example, eigenvalue-based descriptors, spin images, fast point feature histo
grams, rotational projection statistical feature descriptions, binary shape contexts,
etc. The features of deep network learning are automatically learned from a large
amount of training data based on deep learning, which can contain a large number
of parameters and have strong description ability. According to different deep
learning models, the features learned by deep networks are divided into three
categories: voxel-based, multi-view-based, and irregular point-based.
4. Point cloud semantic information extraction
Point cloud semantic information extraction refers to identifying and
extracting object elements from a large number of disorganized point cloud
data, providing the underlying objects and analysis basis for high-level under
standing of the scene. On the one hand, the point cloud scene contains the high-
density and high-precision 3D information of the object, which provides the real
3D perspective and miniature of the object. On the other hand, the high-density,
massive, spatial discrete characteristics of point cloud and the incomplete data of
3D objects in the scene, the overlap, the occlusion, the similarity, and other
phenomena among objects also bring great challenges to semantic information
extraction.
5. Structural reconstruction of point cloud objects and scene understanding
In order to describe the function and structure of the object in the point cloud
scene and the positional relationship between multiple objects, it is necessary to
perform structural reconstruction of point cloud objects and scene under
standing, i.e., to represent the object in the point cloud scene in a structured way
to support complex computational analysis and further scene interpretation. The
point cloud-based 3D object structural reconstruction is different from the mesh
structure-based digital surface model reconstruction. The key of the former is to
accurately extract the 3D boundaries of different functional structures, so as to
convert the discrete and disordered point cloud into a geometric primitive com
bination model with topological characteristics.
Numerous point cloud datasets have been released by many universities and
industries in recent years, which can provide a fair comparison for testing various
methods. These public benchmark datasets consist of virtual or real scenes, with a
132 4 3D Point Cloud Data and Processing
When the point cloud data is acquired, it is imperfect and incomplete due to various
reasons, and the amount of data is very large, so it often needs to be preprocessed
first (an algorithm overview can see [19]). Common point cloud data preprocessing
tasks include trapping, denoising, reduction/compression, registration, etc.
Due to the reasons of the scanned object itself (such as self-occlusion, the surface
normal is nearly parallel to the incident laser line, and various factors that cause
insufficient reflected light intensity), the point cloud data will be missing in some
4.2 Point Cloud Data Preprocessing 133
positions, forming loopholes. For example, the 3D point cloud obtained by scanning
the human body often has loopholes in the top of the head, underarms, and other
body parts.
For the trapping of point cloud data, in addition to using common repair methods
(such as reverse engineering preprocessing), the point cloud data can also be
converted into grid form first and then repaired by grid model-based methods
[20]. The main steps of a three-stage leak trapping method are as follows: (1) recon
struct the scanned point cloud into a triangular mesh model, and identify the
vulnerabilities on it; (2) determine the type of the loophole boundary; and (3) repair
according to the type of the loophole boundary. The calculation of missing points
can use the method of nonuniform rational B-spline (NURBS) curve [21].
The factors that cause noise points in point cloud data mainly include the following
[22]:
1. Errors caused by the surface factors of the measured object, such as surface
roughness, texture of material, distance, angle, etc. (the surface reflectivity is
too low to cause the incident laser to be absorbed without sufficient reflection, the
distance is too far, or the incident angle is too large resulting in a weak reflected
signal). This often needs to be solved by adjusting the position, angle, distance,
etc. of the equipment.
2. Errors caused by the scanning system itself, such as equipment ranging, posi
tioning accuracy, resolution, laser spot size, scanner vibration, etc.
3. Accidental noise points, such as the interference of external factors (moving
objects, flying insects) during the acquisition process.
Statistical outlier removal filters are mainly used to remove outliers. Its basic idea
is to discriminate outlier points by counting the distribution density of point clouds
in the input point cloud region. The more concentrated the point cloud is, the greater
the distribution density is, while the less the point cloud is clustered where the
distribution density is small. For example, the average distance between each point
134 4 3D Point Cloud Data and Processing
The radius outlier point removal filter uses the radius as the criterion to eliminate
all the points that do not reach a sufficient number of neighbor points in a certain
range of the input point cloud. Set the threshold of the number of neighbor points to
N, take (one by one) the current point as the center, and determine a sphere of radius
d. Calculate the number of neighbor points in the current sphere: when the number is
greater than N, the point is retained; otherwise, it is eliminated.
4.2 Point Cloud Data Preprocessing 135
The main steps of the radius outlier removal method are as follows:
1. Calculate the number of neighbors n in the sphere with the radius d of each point
in the input point cloud.
2. Set the number of points threshold N.
3. If n < N, mark the point as an outlier point; if n > N, do not mark it.
4. Repeat the above process for all points, and finally remove the marked points.
In the point cloud data obtained from aerial photography, the ground region is often
included. The ground region needs to be removed when generating a digital ground
model from LiDAR data; otherwise, it will interfere with the segmentation of the
scene. The main filtering methods for points above the ground region include
filtering method based on elevation, filtering method based on model, filtering
method based on region growing, filtering method based on window movement,
and filtering method based on triangulation. Their comparison is shown in
Table 4.5 [3].
Due to the fast speed and high acquisition density of 3D laser scanning technology,
the scale of point cloud data is often large. This may lead to problems such as large
amount of calculation, high memory usage, and slow processing speed.
3D laser point clouds can be compressed in different ways. The simplest method
is the barycentric compression method, that is, only the points closest to the
barycenter in the neighborhood after point cloud rasterization are retained. However,
this method easily leads to missing features of point cloud data. In addition, a data
structure called octree can be used for compression by the bounding box method,
which can not only reduce the amount of data but also facilitate the calculation of
normal vectors, tangent planes, and curvature values of local neighborhood
data [24].
The octree structure is implemented by recursively dividing the point cloud
space. First, the space bounding box (circumscribed cube) of the point cloud data is
constructed as the root; then divide it into eight sub-cubes of the same size as the
child nodes of the root. This recursive division is performed until the side length of
the smallest sub-cube is equal to the given point distance, at which point the point
cloud space is divided into sub-cubes of the power of 2.
For recursive octaves of a space bounding box, assuming that the number of
division layers is N, the octree space model can be represented by an N-layer octree.
Each cube in the octree space model has a one-to-one correspondence with a node in
136 4 3D Point Cloud Data and Processing
the octree, and its position in the octree space model can be represented by the octree
code Q of the corresponding node:
In the equation, the node serial number qm is an octal number, m2 {0, 1,.. .,N —
1}, qm represents the serial number of the node in the sibling nodes, and qm +1
represents the sequence number of the parent node of the node in its sibling nodes. In
this way, from q0 to qN—1, the path from each leaf node in the octree to the root of the
tree is completely represented.
The specific steps of point cloud data encoding are as follows:
1. Determine the number of division layers N of the point cloud octree: N should
satisfy d0 • 2N > dmax, where d0 is the simplified specified point distance and dmax
is the maximum side length of the point cloud bounding box.
2. Determine the spatial index value (i, j, k) of the sub-cube where the point cloud
data point is located: if the data point is P(x, y, z), there are
Among them, (xmin, ymin, zmin) represents the minimum vertex coordinates of the
bounding box corresponding to the root node.
3. Determine the encoding of the sub-cube where the point cloud data point is
located: convert the index value (i, j, k) to a binary representation:
where im, jm, km 2 {0, 1}, m 2 {0, 1, ..., N — 1}. According to Eq. (4.1), the
octree code Q of the node corresponding to the sub-cube can be obtained.
4. From the octree code Q of the node corresponding to the sub-cube, the spatial
index value (i, j, k) can also be calculated inversely:
138 4 3D Point Cloud Data and Processing
N N- 1
i = Y (qmmod2)*2m
m—1
N— 1
j= = £ (L?m=2Jmod2)* 2m : (4:4)
m— 1
N— 1
k = £ ' 7m 4 mod2 * 2m
k m— 1
Here, the gap between adjacent nodes in the X direction is set to 1, the gap
between adjacent nodes in the Y direction is 2, and the gap between adjacent nodes in
the Z direction is 4.
In addition to the traditional data reduction methods based on bounding boxes
above, there are also scan line-based methods and reduced polygon count based
methods. The scan line-based data reduction method utilizes the characteristics of
scanning, that is, the points on each scan line are in the same scan plane during line
scanning, and there is a sequence, so it can be based on the change of the slope of the
front and rear points on the scan line to determine if it can be simplified [25]. For
point cloud data, if a triangulated irregular network (TIN) model has been
constructed, data reduction can be performed by reducing the number of polygons
in the model. A commonly used method is the common vertex compression
method [25].
Registration methods can be divided into three categories: methods based on geo
metric features, methods based on surface features, as well as algorithms based on
iterative closest point (ICP) and its improvements.
4.2 Point Cloud Data Preprocessing 139
The basic algorithm for iterative closest point registration has two main steps:
1. Quickly search for the closest point pair in adjacent point cloud datasets.
2. Calculate the corresponding transformation (translation and rotation) parameters
according to the correspondence between the coordinates of the closest
neighbors.
Let P and Q be two sets of point cloud data, and P £ Q, the specific steps of the
algorithm are as follows (see Fig. 4.1):
1. Sampling the reference point cloud and the object point cloud to determine the
initial corresponding feature points to improve the subsequent stitching speed.
2. Registration is performed by calculating the closest point. The registration can be
divided into two steps: coarse registration and precise registration. Coarse
140 4 3D Point Cloud Data and Processing
Fig. 4.1 Schematic diagram of coordinate system conversion for the complete imaging process
X x
Y = kRx(a)Ry(^)Rz(Y) y + (4:5)
Z z
Among them, k is the scaling factor between the two coordinate systems; x0, y0,
and z0 are the translation amounts along the X, Y, and Z directions of the coordinate
axes, respectively; a, fi, and / are the rotation angles around the three coordinate
axes, respectively; and Rx(a)Ry(fi)Rz(y) represents the rotation matrix:
(4:6)
It can be seen from Eq. (4.5) that in order to achieve the matching of point cloud
data, it is necessary to extract three pairs (or more) of feature points (control points)
or arrange three (or more) object points in the common region to calculate six
transformation parameters (a, fi, y, x0, y0, z0).
Precise registration is to iterate the point cloud data on the basis of rough
registration to minimize the objective function value and achieve accurate and
optimized registration. According to the method of determining the corresponding
point set, three methods can be used: point-to-point, point-to projection, and point-
4.2 Point Cloud Data Preprocessing 141
to-surface. They correspond to define the closest point using the shortest spatial
distance, the shortest projection distance, and the shortest normal distance,
respectively.
Only one point-to-point based registration method [24] is introduced below. Let
the two point sets be P = {pi, i = 0, 1, 2, ..., m} and U = {ui, i = 0, 1, 2, ..., n}. It is
not required that there is a one-to-one correspondence between the points in the two
point sets, nor does it require the same number of points in the two point sets; let
m > n. The registration process is to compute translation and rotation matrices
between the two coordinate systems such that the distance between homologous
points from P and U is minimized.
The main steps of precise registration are as follows:
1. Calculate the closest point, that is, for each point in the point set U, find the
closest corresponding point in the point set P by means of the distance measure,
and set the new point set composed of these points in the point set P as Q1 = {qi,
i = 0, 1, 2, ..., n}.
2. Using the quaternion method (see below), the registration between the point set
U and the point set Q1 is calculated, and the registration transformation matrices
R and T are obtained.
3. Perform coordinate transformation, that is, use transformation matrices R and T to
obtain U1 = RU1 + T for the point set U.
4. Calculate the distance difference between U1 and Q1, dj = (EIIU1 - Q1II2)/N;
calculate the closest point set between U1 and P and transform to U2; calculate
dj +1= (EllU2 - Q2ll23 )/N, if dj +1- dj < e (preset threshold), end; otherwise,
replace U with point set U1, and repeat the above steps.
The quaternion in Step 2 represents the rigid body motion, and the quaternion is a
vector with four elements, which can be regarded as a 3 x 1 vector part and a scalar
part. The steps to use it to calculate translation and rotation matrices are as follows:
1. Calculate the centroids of point sets P = {pi} and U = {ui}, respectively:
p'=m, <4:7>
i=1
u'=n52ui- (4-8)
i= 1
2. The point set P = {pi} and U = {ui} are translated relative to the centroid:
qi = Pi - p‘, Vi = Ui — u‘.
3. Calculate the correlation matrix K according to the moved point sets {qi} and
{vi}:
142 4 3D Point Cloud Data and Processing
K=N fW-
(4:9)
i= 1
K0
k11 + k12 + k13 k23 - k32 k13 - k31 k12 - k21
k23 - k32 k11 - k22 - k33 k12 + k21 k13 + k31
k13 - k31 ki2 + k2i - k11 + k12 - k13 k23 + k32
k12 - k21 k13 + k31 k23 + k32 - k11 - k12 + k13
(4:10)
5. Calculate the unit eigenvector (optimal selection vector) s* = [s0, s1, s2, s3]T
corresponding to the largest eigen-root of K‘.
6. Calculate the rotation matrix R with the help of the relationship between s* and
R:
2 2 2 2
s2 + s2 - s2 - s3 2(S1S2 - S0S3) 2(S1 S3 + S0S2)
R= 2(s i S2 + S0 S3) S0- S1+s2 — S3 2(S2 S3 +S0S1) ( 4:11)
2 (s1s3 -s0s2) 2(S2S3 + S0S1) S2 - S2 - S2 + S3
3D laser point clouds lack texture and spectral information, so many systems are also
equipped with color cameras to obtain color and texture information of the scene. By
registering 3D laser point clouds with color images, color point clouds with texture
attributes can be generated [24].
The registration of 2D optical images and 3D laser point clouds here is different
from the registration between the previous 3D point clouds. The methods are mainly
divided into three categories:
1. 2D-3D registration algorithm based on feature matching.
The basic principle of this type of algorithm is to use the corresponding
geometric features between the laser point cloud and the optical image for
registration, that is, to determine the relative registration parameters (translation
and rotation parameters).
2. 2D-3D registration algorithm based on statistics
4.3 Laser Point Cloud 3D Modeling 143
Here we mainly consider the method of automatic modeling of laser point cloud with
3D surface model [24].
B C
3. Optimize the newly formed triangle according to the optimization criterion (such
as swapping the diagonals), and add the formed triangle to the triangle linked list.
4. Repeat Step 2 until all points in the point set are inserted.
The insertion process of Step 2 can be described with the help of Fig. 4.2:
Fig. 4.2a shows the insertion of a new point P into the existing triangle set of
AABC and kBCD; Fig. 4.2b shows the circumcircles of AABC and kBCD all
contain point P, so they are all influence triangles of point P; Fig. 4.2c shows the
result of deleting the common edge of the influence triangle; Fig. 4.2d shows the
insertion point P, and all vertices of the two influence triangles are connected,
separately (the new triangles formed can be added to the triangle linked list).
If the laser point cloud is first segmented to obtain patches, then these patches are
fitted to form parts of the 3D model. Point cloud segmentation is to divide whole
point cloud into multiple subregions corresponding to a natural surface one-to-one,
and each subregion only contains scan points collected from a specific natural
surface. Point cloud fitting is to construct the geometric shape of the object
represented by the point cloud by means of mathematical geometry from the point
cloud with certain characteristics after segmentation.
There are many algorithms for segmentation of point clouds, which can be mainly
divided into edge-based, region-based, model-based, graph theory-based, and
cluster-based algorithms [3]. Among them, the simpler ones are K-means clustering
algorithm and region growing algorithm.
K-means clustering algorithm is a simple laser point cloud segmentation algo
rithm. The basic idea is to perform unsupervised classification on the data. The
4.3 Laser Point Cloud 3D Modeling 145
specific method is to update the value of each cluster center successively through an
iterative method until the best clustering effect is obtained. If it is assumed that the
sample set is divided into K categories, the algorithm can be described as follows:
1. Appropriately select the initial centers of the K classes.
2. In the i-th iteration, for any sample, calculate the distance to K centers, and assign
the sample to the class where the nearest center is located.
3. Use the mean method to update the central value of the class:
Zj = x: (4:12)
x2K j
where x represents the sample, nj is the number of samples of the same class, and
Kj represents the j-th class.
4. For all K cluster centers, carry out iterative update continuously according to
Step 2 and Step 3 until the iteration of the maximum number of iterations is
reached, or the difference between the pre- and post-iteration values of the
objective function:
The object represented by the laser point cloud can have various geometric features,
the most basic being the plane. Let the space plane equation be ax + by + cz = d,
where a, b, and c are the unit normal vectors of the plane, and a2 + b2 + c2 = 1; d is
the distance from the origin to the plane, d > 0.
146 4 3D Point Cloud Data and Processing
Assuming that a point cloud of N points in a plane {(xi, yi, zi), i = 1, 2, ..., N}is
obtained, the distance from any point to the plane is di = |axi + byi + czi|; in addition,
under the condition a2 + b2 + c2 = 1, the best fit plane should satisfy that
Taking the derivative of Eq. (4.15) with respect to d and setting the derivative to
zero, we get
df
dd 2^2 (axi + byi + cz, — d) = 0: (4:16)
Solved from it
d=«EN+*££+c£N ■ (4:17)
i i i
Substituting Eq. (4.17) into the equation for the distance from point to plane, we
have
Further take the derivative of Eq. (4.15) with respect to a, b, and c, respectively,
and set the derivative to zero, we get
Therefore,
148 4 3D Point Cloud Data and Processing
indicates that the minimum eigenvalue of matrix A needs to be calculated, and the
eigenvector [abc]T corresponding to the minimum eigenvalue is the plane normal
vector to be calculated. As above, a, b, c, and d are all calculated, and the plane can
be fitted.
Texture mapping can be seen as the process of mapping texture pixels in texture
space (via scene space) to pixels in screen space. It enhances realism by attaching an
image to the surface of a 3D object. The essence of texture mapping is to establish
two mapping relationships from screen space to texture space and from texture space
to screen space.
Depending on the texture function used, textures can be divided into 2D textures
and 3D textures. 2D texture patterns are defined in 2D space. 3D texture, also
known as entity texture, is a texture function that has a one-to-one mapping
relationship between texture points and 3D object space points.
According to the representation of texture, texture can be divided into color
texture, geometric texture, and procedural texture. Color texture refers to the
patterns on the smooth surface, which reflect the details with color or sensitivity.
Generally, it is used at the macro level to paste the real color pattern on the surface of
the scene to simulate the real realistic color. Geometric texture is composed of
rough or irregular small bumps (concave-convex). It is a surface texture based on the
microscopic geometric shape of the surface of the scene. It is generally used at the
sub-macro level to represent the unevenness, texture details, as well as changes of
light and darkness on the surface of the scene. Procedural textures are a variety of
dynamically changing natural scenes (regularly or irregularly, such as water waves,
clouds, smoke, etc.) that can be used to simulate complex, continuous surfaces.
From a mathematical point of view, if (u, v) is used for texture space and (x, y, z)is
used for scene space, the mapping can be represented as (u, v) = F(x, y, z); if
invertible, then (x, y, z) = F-1(u, v). The texture mapping algorithm consists of three
steps:
1. Define the texture object and obtain the texture.
2. Define the mapping function between texture space and scene space.
3. Select the resampling method of the texture to reduce various deformation and
aliasing of the mapped texture.
4.4 Texture Mapping for 3D Models 149
Color is mainly used to reflect the color of each point on the surface of the scene. If
the colors of adjacent points are combined, the texture characteristics can also be
reflected. There are different methods for color texture mapping [24].
The forward mapping method generally uses the Catmull algorithm. The algorithm
projects the texture pixel coordinates to the scene surface coordinates one by one
through the mapping function and then displays them in the screen space through
surface parameterization, as shown in Fig. 4.3. Specifically, first calculate the
position and size of the texture pixel coordinates on the scene surface by means of
projection, assign the center gray value of the texture pixels to the corresponding
point on the scene surface according to hyperbolic interpolation, and take the texture
color value assigned at the corresponding point as the surface texture attribute of the
sampling point in the center of the pixel on the surface. Then, use the lighting model
to simulate the calculation of the brightness at the surface point, and assign its gray
value. The forward mapping algorithm can be represented as (u, v) ^ (x, y) = [p(u, v),
q(u, v)], where p and q both represent projection functions.
Forward mapping is relatively simple to implement texture mapping. Because it
sequentially accesses texture patterns, it can save a lot of storage space in computing
and improve computing speed. The disadvantage is that the texture mapping value is
only the gray value of the image, and the scene space and texture space pixels have
not one-to-one correspondence; there will be no corresponding texture pixels in
some regions, or there are redundant pixels, resulting in holes or multiple shots,
causing graphics deformation and confusion.
The reverse mapping method is also called the screen scanning method. It reversely
projects the coordinates of the scene surface to the coordinates of the texture space
through the mapping function, which can ensure the one-to-one correspondence
between the scene space and the texture space. After resampling the texture image
x y
Texture Scene space Screen space
space
Projection Surface
parameterization
O u O y o x
according to the projection, the pixel value of the coordinate center of the
corresponding scene surface space is calculated, and then the calculation result is
assigned to the scene surface. The reverse mapping algorithm can be represented as
(x, y) ^ (u, v) = [f(x, y), g(x, y)], where f and g both represent projection functions.
The reverse mapping method needs to scan and search the scene space pixel by
pixel and resample each pixel at any time. In order to improve the calculation
efficiency, it needs to dynamically store the problem pattern, so it needs a lot of
storage space. In order to improve computational efficiency, on the one hand, the
search and matching of texture images can be optimized, and on the other hand, pixel
information can be obtained preferentially for reconstructing 3D scenes.
Complex scene surfaces are often nonlinear, and it is difficult to directly parameter
ize with mathematical functions, that is, to directly establish the analytical relation
ship between texture space and scene space. One way to solve this problem is to
build an intermediate 3D surface, which decomposes the mapping from texture
space to scene space into a combination of two simple mappings from texture
space to intermediate 3D surface and then from intermediate 3D surface to scene
space.
The basic process of two-step texture mapping is as follows:
1. Establish a mapping from the 2D texture space to the intermediate 3D surface
space, called T mapping: T(u, v) ^ T‘(x‘, y‘, z‘).
2. Remap the texture mapped to the intermediate 3D surface space to the scene
surface, which is called T‘ mapping, T‘(x‘, y‘, z‘) ^ O(x, y, z), where O(x, y, z)
represent the scene space.
Geometric texture mapping technology is required for some scene surfaces that are
not smooth, bumpy, and random diffuse reflections caused by light exposure. One
method is to change the microscopic geometry of the scene surface by slightly
perturbing the position of each sampling point on the scene surface, thereby causing
the normal direction of the light on the scene surface to change and causing the
surface brightness to change abruptly, resulting in uneven realism (so geometric
texture mapping is also called bump texture mapping).
In practice, in order to improve the sense of realism, it is necessary not only to
map the texture of the scene itself but also to map the environment of the scene, so
the flowchart of geometric texture mapping is shown in Fig. 4.4.
In Fig. 4.4, the environment map is implemented with the help of environment
texture mapping, which maps the texture of the environment scene to the scene space
by mapping the environment, that is, simulating the reflection of the surface to the
4.4 Texture Mapping for 3D Models 151
The effective description of point cloud features is the basis of point cloud scene
understanding. The existing 3D point cloud feature descriptors include global feature
descriptors and local feature descriptors.
Global feature descriptor mainly describes the global features of the object.
Common global feature descriptors include ensemble of shape functions (ESF)
and vector feature histogram (VFH) [3]. There are many kinds of local feature
descriptors, such as spin image, 3D shape context (3DSC), point feature histo
gram (PFH), fast point feature histogram (FPFH), etc. The principles and char
acteristics of these descriptors are shown in Table 4.6.
Since global feature descriptors require pre-segmentation of objects and ignore
shape details, it is difficult to identify incomplete or only partially visible scenes
from cluttered scenes where objects are interlaced and overlapped. In contrast, local
feature descriptors can describe local surface features within a certain neighborhood
range and have strong robustness to object interlacing, occlusion, and overlapping
and are more suitable for the recognition of incomplete or only partially visible
scenes.
A good local feature descriptor should be both descriptive and robust. On the one
hand, the feature descriptor should have a broad description ability, which can
describe the local surface shape, texture, echo intensity, and other information as
much as possible. On the other hand, feature descriptors should be robust to noise,
occlusion and overlap between objects, point density changes, etc.
The steps for obtaining the other three local feature descriptors will be described
in detail below.
Triple orthogonal local depth images (TOLDI) borrow the idea of two feature
descriptors, signature of histogram of orientation and rotational projection statistics
[30]. The main steps to obtain triple orthogonal local depth map feature descriptors
are as follows:
1. For the point p in the point cloud, first use the neighborhood points of p to
construct a covariance matrix, and decompose the covariance matrix to obtain
three eigenvalues (21 > 22 > 23) and the corresponding eigenvectors (e 1, e2, e3).
2. Adjust the direction of e3 to make it consistent with the direction of the connection
between most of the neighborhood points and p, and use p and e 3 as the origin and
Z-axis of the local coordinate system, respectively.
3. Project all neighborhood points to the tangent plane of the Z-axis, and calculate
the weighted sum of the connection vector between the projected point and p (the
farther the projection distance is, the smaller the weight, and the closer the
projection distance is, the greater the weight) as the X-axis, and then determine
the Y-axis from it.
4. Project all neighborhood points into the local coordinate system established
above, and project the converted point cloud onto the three coordinate planes of
XY, XZ, and YZ.
5. Grid the projected point cloud, and use the minimum projected distance of the
point in the grid as the grid value to form a projected distance image.
6. Concatenate the projection distance maps of the three projection surfaces to
obtain the TOLDI descriptor (in the form of a histogram).
4.6 Point Cloud Understanding and Deep Learning 155
The improvement of computing power and the development of tensor data theory
have promoted the widespread application of deep learning in scene understanding
[31]. At present, point cloud understanding based on deep learning mainly faces
three challenges [3]:
1. Point cloud data is any point distributed in space and belongs to unstructured data.
Convolutional neural network filters cannot be used directly because there is no
structured grid.
2. Point cloud data is essentially a series of points in 3D space. In a geometric sense,
the order of points does not affect how it is represented in the underlying matrix
structure, so the same point cloud can be represented by two completely different
matrices.
3. The number of points in the point cloud data is different from the pixels in the
image; the number of pixels is a given constant and only depends on the camera.
However, the number of points in the point cloud is uncertain, and its number
depends on factors such as the sensor and the scanned scene. In addition, the
number of points of various objects in the point cloud is also different, which
leads to the problem of sample imbalance during network training.
Researchers have proposed a variety of deep networks for point cloud under
standing, which are mainly divided into three categories: 2D projection-based deep
learning network, 3D voxelization-based deep learning network, and network
model based on a single point in the point cloud [3].
1. 2D projection-based deep learning network.
With the progress of deep learning in 2D image segmentation and classifica
tion, people project 3D point clouds onto 2D images as the input of CNN
[32, 33]. Commonly used 2D projection images include virtual camera-based
RGB images, virtual camera-based depth maps, sensor-based distance images,
and panoramic images. These projection methods are fine-tuned by using the
object detection and semantic segmentation network models that have been
trained on a large number of 2D images as pretrained models, and it is more
convenient to obtain better detection and classification effects on 2D images. But
this method may also lose some 3D information.
There are also methods that use multi-view projection techniques for point
cloud classification, such as MVCNN [32], Snapnet [34], and DeePr3SS
[35]. This kind of method is easy to cause the loss of 3D structure information,
and the selection of the projection angle and the projection of the same angle have
different representation capabilities of the object, and they also have a certain
impact on the generalization ability of the network.
2. 3D voxelization-based deep learning network.
On the basis of the 2D CNN model, people build a 3D CNN model by
voxelizing the point cloud, which can retain more 3D structural information,
which is conducive to the high-resolution representation of point cloud data.
156 4 3D Point Cloud Data and Processing
Some results have been achieved in the work of marking point cloud and
classifying objects. To further improve the ability of representing voxels, a
variety of multi-scale voxel CNN methods have also been proposed, such as
MS3_DVS [36] and MVSNet [37].
The basis of this class of methods is voxelization [38]. In practice, voxelization
often uses a 0-1-based method to indicate whether there is a point or not, and a
voxel network density-based method and a grid point-based method can also be
used. The dimensions of voxelization are mainly 11 x 11 x 11, 16 x 16 x 16,
20 x 20 x 20, 32 x 32 x 32, etc. The amount of data can be reduced by down
sampling voxelization.
The CNN method based on 3D voxelization provides the point cloud structure
through meshing, uses mesh transformation to solve the alignment problem, and
also obtains a constant number of voxels. However, 3D CNN is very computa
tionally intensive during convolution. For solving this problem, it is usually to
reduce the resolution of voxels, but again, the error of quantization will increase.
In addition, only the structure information of the point cloud is used in the
network, and the information such as the color and intensity of the point cloud
are not considered.
3. Network model based on a single point in the point cloud.
The network model based on a single point in the point cloud can make full use
of the multimodal information of the point cloud and reduce the computational
complexity in the preprocessing process. For example, the PointNet network
model for indoor point cloud scenes can classify, partially segment, and seman
tically segment indoor point cloud data; its improved version, PointNet++ net
work model, can obtain multi-scale and comprehensive local features
[39]. Another improvement of the PointNet network model is the PointCNN
network model [40]. It proposes an X-transform method based on point learning
in the point cloud by analyzing the characteristics of the point cloud and uses it to
simultaneously weight the input features associated with the point. It improves
the performance on point cloud convolution processing by rearranging them into
a potentially implicit canonical order and applying quadrature and sum operations
on the elements. Another improvement of the PointNet network model is the
PointFlowNet network model [41]. It can estimate scene flow. In addition, there
are also studies that first segment or block large-scale point clouds and then use
the PointNet network model for classification on the results to overcome the
limitations of the original PointNet network model for large-scale point cloud
processing, such as SPGraph [42].
The cuckoo search algorithm has the characteristics of few parameters and simple
model, so it has good versatility and robustness and also has global convergence. But
there are some limitations:
1. In the iterative process, the algorithm generates new positions in the form of
random walk, which has a certain blindness, which will lead to the failure to
quickly search the global optimal value, and the search accuracy is difficult to
improve.
2. After searching the current position, the algorithm always selects the better
solution in a greedy way, which is easy to fall into local optimization.
3. The algorithm always discards the bad solution and generates a new solution with
a fixed probability. Without learning and inheriting the good experience of the
dominant group in the population, the search time will be increased.
In response to the above problems, many improvement methods have been
proposed, including the following:
1. Combining pattern search and coarse to fine strategy.
Although CS algorithm has good global detection ability, its local search
performance is relatively insufficient. Therefore, under the framework of CS
algorithm, pattern search with efficient coarse to fine search ability can be
embedded to enhance the local solution accuracy. The principle of pattern search
method is to find the lowest point of the search region. You can first determine a
valley leading to the center of the region and then search along the direction of the
valley line [45]. The essence of this strategy is to improve the mode of search step
4.7 Bionic Optimized Registration of Point Clouds 159
Consider the point cloud registration problem in Sect. 4.2.5. The registration process
needs to obtain the translation and rotation matrix between the coordinate systems of
the two point sets to minimize the distance between the homologous points from the
two point sets.
In practical application, it is generally necessary to sample the point set first to
reduce the amount of data processed subsequently by the point cloud and improve
the operation efficiency. Selecting feature points is an effective means to reduce the
amount of data. There are many methods to select feature points. If the point cloud
dataset includes N points, and the coordinates of any point pi are (xi, yi, zi), the main
steps of selecting feature points using intrinsic shape signatures (ISS) include the
following:
1. Define a local coordinate system for each point pi, and set the search radius of
each point as r.
2. Query all points within the radius r distance around each point pi and calculate
their weights:
4. Calculate the eigenvalues {Ai1, Ai2, Ai3} of the covariance matrix cov(pi) of each
point pi, in descending order.
5. Set the thresholds T1 and T2, and determine the points that meet the following
equations as feature points:
A2< T A3< T
(4.27)
A1< 1 d< 2-
error is zero. But in practice, there are various reasons leading to errors. In this way,
the 3D point cloud registration problem becomes an optimization problem. It is
necessary to find the optimal transformation matrix to minimize the Euclidean
distance between the two point sets.
If the improved cuckoo search algorithm is used, the minimum corresponding
distance should be used as the global search criterion to achieve effective registration
of point cloud sets. Here, six transformation parameters need to be encoded. Since
the value ranges of the rotation parameters a, /?’, and / as well as the translation
variables x0, y0, and z0 are different, the parameter encoding needs to be normalized.
For example, randomly generate six solutions s1, s2, s3, s4, s5, and s6 within the
constraints of parameter coding, and form a set of solutions S = [s1, s2, s3, s4, s5, s6],
normalized to get S’ = [s1 ‘, s2’, s3’, s4’, s5’, s6’], where si’ = (si — l)/(u.i — li), i = 1,
2, ...,6, ui and li are the upper and lower bounds of si, respectively, so that the value
of the parameter encoding is in the range of [0, 1]. In this way, let each parameter
correspond to the position of the nest in the algorithm, and the entire point cloud
registration problem is transformed into a function optimization problem in 6D
space.
References
1. Engin I C, Maerz N H (2022) Investigation on the processing of LiDAR point cloud data for
particle size measurement of aggregates as an alternative to image analysis. Journal of Applied
Remote Sensing, 16(1): #016511 (DOI: 10.1117/1.JRS.16.016511).
2. Mirzaei K, Arashpour M, Asadi E, et al. (2022) 3D point cloud data processing with machine
learning for construction and infrastructure applications: A comprehensive review. Advanced
Engineering Informatics, 51: #101501 (DOI: 10.1016/j.aei.2021.101501).
3. Li Y, Tong G F, Yang J C, et al. (2019) 3D point cloud scene data acquisition and its key
technologies for scene understanding. Laser & Optoelectronics Progress 56(4): 040002-
1~040002-14.
4. Yang B S, Dong Z (2020) Point cloud intelligent processing. Beijing: Science Press.
5. Qin J, Wang W B, Zou Q J, et al. (2023) Review of 3D target detection methods based on
LiDAR point clouds. Computer Science, 50(6A): 259-265.
6. Song X, Wang P, Zhou D, et al. (2019) Apollocar3D: A large 3D car instance understanding
benchmark for autonomous driving. Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition 5452-5462.
7. Pomerleau F, Liu M, Colas F, et al. (2012) Challenging data sets for point cloud registration
algorithms. International Journal of Robotic Research 31: 1705-1711.
8. Xue J, Fang J, Li T, et al. (2019) BLVD: Building A large-scale 5D semantics benchmark for
autonomous driving. Proceedings of the International Conference on Robotics and Automation
20-24.
9. Chen Y, Wang J, Li J, et al. (2018) Lidar-video driving dataset: Learning driving policies
effectively. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
5870-5878.
10. Vallet B, Bredif M, Serna A, et al. (2015) TerraMobilita/iQmulus urban point cloud classifica
tion benchmark. Computers & Graphics 49(jun.):126-133.
162 4 3D Point Cloud Data and Processing
11. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? The KITTI vision
benchmark suite. Proceedings of the Conference on Computer Vision and Pattern Recognition,
3642-3649.
12. Wang C, Hou S, Wen C, et al. (2018) Semantic line framework-based indoor building modeling
using backpacked laser scanning point cloud. ISPRS J. Photogrammetric Remote Sensing
143, 150-166.
13. Carlevaris-Bianco N, Ushani A K, Eustice R M (2016) University of Michigan North Campus
long-term vision and LiDAR dataset. International Journal of Robotic Research 35, 1023-1035.
14. Roynard X, Deschaud J E, Goulette F (2018) Paris-Lille-3D: A large and high-quality ground
truth urban point cloud dataset for automatic segmentation and classification. International
Journal of Robotic Research 37, 545-557.
15. Caesar H, Bankiti V, Lang A H, et al. (2019) nuScenes: A multimodal dataset for autonomous
driving. arXiv:1903.11027.
16. Maddern W, Pascoe G, Linegar C, et al. (2017) 1000km: The Oxford RobotCar dataset.
International Journal of Robotic Research 36, 3-15.
17. Behley J, Garbade M, Milioto A, et al. (2019) SemanticKITTI: A dataset for semantic scene
understanding of LiDAR sequences. Proceedings of IEEE/CVF International Conference on
Computer Vision.
18. Dong Z, Liang F, Yang B, et al. (2020) Registration of large-scale terrestrial laser scanner point
clouds: A review and benchmark. ISPRS J. Photogrammetric Remote Sensing 163, 327-342.
19. Fan Y C, Zhang J Q, Cui C, et al. (2023) Overview of point cloud preprocessing algorithms.
Information and Computer, 35(6): 206-209.
20. Sun X D (2021) Point cloud data research and shape analysis. Beijing: Electronic Industry
Press.
21. Shi F Z (2013) Computer Aided Geometric Design and Nonuniform Rational B-Splines
(Revised Edition). Beijing: Higher Education Press.
22. Zhao Z X, Dong X J, Lv B X, et al. (2019) Application Theory and Practice of Ground 3D Laser
Scanning Technology. Beijing: China Water Conservancy and Hydropower Press.
23. Han X-F, Jin J S, Wang M-J. (2017) A review of algorithms for filtering the 3D point cloud.
Signal Processing: Image Communication 57: 103-112.
24. Li F, Wang J, Liu X Y, et al. (2020) Principle and Application of 3D Laser Scanning. Beijing:
Earthquake Press.
25. Wu Q H, Qu J K, Zhou B X. (2020) 3D Laser Scanning Data Processing Technology and Its
Engineering Application. Jinan: Shandong University Press.
26. Tombari F, Salti S, Di Stefano L (2010) Unique signatures of histograms for local surface
description. Proceedings of the European Conference on Computer Vision 356-369.
27. Tombari F, Salti S, Di Stefano L (2011) A combined texture-shape descriptor for enhanced 3D
feature matching. Proceedings of International Conference on Image Processing, 809-812.
28. Seib V, Pualus D (2021) Shortened color-shape descriptors for point cloud classification from
RGB-D cameras. IEEE International Conference on Autonomous Robot Systems and Compe
titions, 203-208.
29. Guo Y, Sohel F, Bennamoun M, et al. (2013) Rotational projection statistics for 3D local
surface description and object recognition. International Journal of Computer Vision 105(1):
63-86.
30. Yang J, Zhang Q, Xiao Y, et al. (2017) TOLDI: An effective and robust approach for 3D local
shape description. Pattern Recognition 65: 175-187.
31. GONG J Y, LOU Y J, LIU F Q, et al. Scene point cloud understanding and reconstruction
technologies in 3D space. Journal of Image and Graphics, 2023, 28(6): 1741-1766.
32. Su H, Maji S, Kalogerakis E, et al. (2015) Multi-view convolutional neural networks for 3D
shape recognition. IEEE International Conference on Computer Vision 945-853.
33. Qi C R, Su H, NieBner M, et al. (2016) Volumetric and multi-view CNNs for object classifi
cation on 3D data IEEE Conference on Computer Vision and Pattern Recognition 5694-5656.
References 163
34. Boulch A, Guerry J, Le Saux B, et al. (2018) SnapNet 3D point cloud semantic labeling with 2D
deep segmentation networks. Computers & Graphics 189-198.
35. Lawin F J, Danelljan M, Tosteberg P, et al. (2017) Deep projective 3D semantic segmentation.
International Conference on Computer Analysis of Images and Patterns 95-107.
36. Roynard X, Deschaud J E, Francois G (2018) Classification of point cloud scenes with
multiscale voxel deep network. https//arxiv.org/abs/1804.03583.
37. Wang L, Huang Y C, Shan J, et al. (2018) MSNet multi-scale convolutional network for point
cloud classification. Remote Sensing 10(4): 612.
38. Xu Y S, Tong X H, Stilla U (2021) A voxel-based representation of 3D point clouds: Methods,
applications, and its potential use in the construction industry. Automation in Construction 126:
103675 (DOI: 10.1016/j.autcon.2021.103675).
39. Charles R Q, Su H, Mok C, et al. (2017) PointNet: Deep learning on point sets for 3D
classification and segmentation. IEEE Conference on Computer Vision and Pattern Recognition
77-85.
40. Li Y Y, Bu R, Sun M C, et al. (2018) PointCNN: Convolution on x transformed points. https//
arxiv.org/abs/1801.07791.
41. Behl A, Paschalidou D, Donne S, et al. (2018) PointFlowNet: Learning representations for 3D
scene flow estimation from point clouds. https//arxiv.org/abs/1806.02170.
42. Landrieu L, Simonovsky M (2018) Large scale point cloud semantic segmentation with
superpoint graphs. IEEE Conference on Computer Vision and Pattern Recognition 4558-4567.
43. Yang X S, Deb S (2010) Engineering optimization by cuckoo search. International Journal of
Mathematical Modelling and Numerical Optimisation, 1(4): 330-343.
44. Yang X S, Deb S (2009) Cuckoo search via Levy flights. 2009 World Congress on Nature &
Biologically Inspired Computing (NaBIC) 210-214.
45. Hooke R, Jeeves T A. (1961) “Direct search” solution of numerical and statistical problems.
Journal of the ACM, 8(2): 212-229.
46. Ma W (2021) Bionic swarm intelligence optimization algorithm and its application in point
cloud registration. Nanjing: Southeast University Press.
Chapter 5
Binocular Stereovision Check for
updates
The human visual system is a natural stereovision system. The human eyes (each
equivalent to a camera) observe the same scene from two viewpoints, and the
information obtained is combined in the human brain to give a 3D objective
world. In computer vision, by collecting one set of two (or more) images from
different viewing angles, the parallax (disparity) between corresponding pixels in
different images can be obtained by means of the principle of triangulation. That is,
the parallax is the difference between the positions of a 3D space point projected
onto these 2D images. The depth information and the reconstructions of the 3D scene
can be further obtained according to the parallax.
In stereovision, computing parallax is a key step to obtain depth information, and
the main challenge is to determine the projected image points of 3D space points on
different images (two images for binocular stereovision, multiple images for multi
ocular stereovision). The determination of the correspondence after projection is a
matching problem. This chapter considers only binocular stereovision (multi-ocular
stereovision will be discussed in the next chapter).
The sections of this chapter will be arranged as follows.
Section 5. 1 discusses the principle and several common techniques of binocular
stereo matching based on regional grayscale correlation.
Section 5. 2 introduces the basic steps and methods for feature-based binocular stereo
matching and also analyzes the two commonly used feature points SIFT and
SURF in recent years.
Section 5. 3 presents an algorithm to detect and correct errors in parallax maps
obtained by stereo matching.
Section 5. 4 introduces recent stereo matching methods based on deep learning
techniques, including various stereo matching networks and a specific method.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 165
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_5
166 5 Binocular Stereovision
The most basic matching method here is generally called template matching (the
principle can also be used for more complex matching), and its essence is to use a
small image (template) to match a part (sub-image) of a large image. The result of the
5.1 Region-Based Binocular Stereo Matching 167
....<>
■b J
(s,t)
M |<-x —>
w(x—s, y-t)
f(x, y)
'x
matching is to determine whether there is the small image in the large image and, if
so, further determine the position of the small image in the large image. Templates
are usually square in template matching but can also be rectangular or other shapes.
Now consider finding the matching position of a template image w(x, y) of size J x K
and a large imagef(x, y) of M x N, and let J < M and K < N. In the simplest case, the
correlation function between f(x, y) and w(x, y) can be written as
The correlation function defined by Eq. 5.1 has the disadvantage that it is
sensitive to changes in the magnitude of f(x, y) and w(x, y). For example, when the
value of f(x, y) is doubled, c(s, t) are also doubled. To overcome this problem, the
following correlation coefficients can be defined:
x y
(
■t /EE [f(x, y) -f x, y)] [w(x - s, y - t) - w]2
where s = 0, 1,2, ..., M - 1; t = 0, 1,2, ..., N - 1, wis the mean value of w (only
need to be calculated once) and (x, y) represents the mean value of the region
corresponding to the current position of w in f(x, y).
The summation in Eq. 5.4 is for the common coordinates of f(x, y) and w(x, y).
Because the correlation coefficient has been scaled to the interval [-1, 1], the change
of its value is independent of the amplitude change of f(x, y) and w(x, y).
Another method is to calculate the grayscale difference between the template and
the sub-image and to establish the corresponding relationship between the two
groups of pixels that meet the mean square difference (MSD). The advantage of
this kind of method is that the matching result is not easily affected by the template
grayscale detection accuracy and density, so high positioning accuracy and dense
parallax surface can be obtained [2]. The disadvantage of this kind of method is that
it depends on the statistical characteristics of image gray scale, so it is sensitive to the
surface structure of the scenery and light reflection. Therefore, it is difficult in the
case of lack of sufficient texture details on the surface of the spatial scene and large
imaging distortion (such as too large baseline length). In actual matching, some gray
scale derived quantities can also be used, but some experiments show that in the
matching comparison using gray scale, grayscale differential size and direction,
grayscale Laplace value, and grayscale curvature as matching parameters, the effect
obtained by using grayscale parameters is the best [3].
As a basic matching technique (see Chap. 9 for some other typical matching
techniques), template matching has been applied in many aspects, especially when
the image only has translation. The above calculation of correlation coefficient can
normalize the correlation function and overcome the problem caused by amplitude
change. However, it is difficult to normalize the image size and rotation. The
normalization of size requires spatial scale transformation, which requires a lot of
calculation. The normalization of rotation is more complicated. If the rotation angle
off(x, y) is known, it requires just rotating w(x, y) by the same angle to align it with
f(x, y). However, without knowing the rotation angle off(x, y), to find the best match,
w(x, y) needs to be rotated at all possible angles. In practice, this method is not
5.1 Region-Based Binocular Stereo Matching 169
The above equation does not change under the affine transformation. That is, the
value of (s, t) is only related to the three noncollinear points and has nothing to do
with the affine transformation itself. In this way, the value of (s, t) can be regarded as
the affine coordinate of the point Q. This property also applies to line segments: three
nonparallel line segments can be used to define an affine datum.
Geometric hashing builds a hash table that helps matching algorithms quickly to
determine the potential location of a template in an image. The hash table can be
constructed as follows: for any three noncollinear points in the template (datum point
group), calculate the affine coordinates (s, t) of other points. The affine coordinates
170 5 Binocular Stereovision
(s, t) of these points will be used as indexes into the hash table. For each point, the
hash table retains the index (serial number) for the current datum group. If you want
to search multiple templates in an image, you need to keep more template indexes.
To search for a template, a set of datum points is randomly selected in the image,
and the affine coordinates (s, t) of the other points are calculated. Using this affine
coordinate (s, t) as the index of the hash table, the index of the datum point group can
be obtained. This results in a vote for the presence of this datum group in the image.
If the randomly selected point does not correspond to the set of datum points on the
template, no vote is accepted. However, if the randomly selected point corresponds
to the set of datum points on the template, the vote is accepted. If many votes are
accepted, it indicates that the template is likely to be present in the image, and
metrics for the datum group are available. Because there is a certain probability that
the selected set of datum points will not be suitable, the algorithm needs to iterate to
increase the probability of finding the correct match. In fact, it is only necessary to
find a correct set of datum points to determine the matching template. So, if k out of
N template points are found in the image, the probability that the set of datum points
is correctly selected at least once in m attempts is
P = 1 - [1 — (k=N)3 ]m (5:6)
If the ratio k/N of the number of template points appearing in the image to the
number of image points is 0.2, and it is hoped that the probability of the template
matching is 99% (i.e., p = 0.99), then the number of attempts m required is 574.
physical properties in the two images can be matched, also known as the
photometric compatibility constraint.
2. Uniqueness constraint: it means that a single black point in one image can only
match a single black point in another image.
3. Continuous constraint: it means that the parallax change near the matching
point is smooth (change gradually) in most of the points in the whole image
except the occlusion region or discontinuous region, also known as the disparity
smoothness constraint.
When discussing stereo matching, in addition to the above three constraints, the
epipolar line constraints introduced below and the order constraints introduced in
Sect. 5.2.4 can also be considered.
Epipolar line constraints can help reduce the search range and speed up the search
process.
First, the two important concepts of epipole and epipolar line are introduced with
the help of the binocular horizontal convergence mode in Fig. 5.2. In Fig. 5.2, the
coordinate origin is the optical center of the left eye, the X-axis connects the optical
centers of the left and right eyes, the Z-axis points to the observation direction, the
distance between the left and right eyes is B (also commonly referred to as the system
baseline), the optical axes of the left and right image planes are in the XZ plane, and
the intersection angle is 9. Consider the connection between the left and right image
planes. O1 and O2 are the optical centers of the left and right image planes,
respectively. The connecting line between them is called the optical center line.
The intersections e1 and e2 of the optical center line and the left and right image
planes are called the epipoles of the left and right image planes, respectively (the
epipole coordinates are represented by e1 and e2, respectively). The optical center
line and the space point W are in the same plane. This plane is called the epipolar
plane. The intersection lines L1 and L2 of the epipolar plane with the left and right
image planes are called the epipolar lines of the projection points of the space point
W on the left and right image planes, respectively. The epipolar line defines the
position of the corresponding point of the binocular image, and the projection point
p2 (coordinate p2) of the right image plane corresponding to the projection point p1
(coordinate p1) of the space point W on the left image plane must be on the epipolar
line L2. On the contrary, the projection point of the left image plane corresponding to
the projection point of the space point W on the right image plane must be on the
epipolar line L1.
From the above discussion, it can be seen that the epipole and the epipolar line
have correspondence. The epipolar line defines the position of the corresponding
point on the binocular image, and the projection point of the right image
plane corresponding to the projection point of the space point W on the left image
plane must be on the epipolar line L2. The projection point on the left image plane
corresponding to the projection point must be on the epipolar line L1. This is the
epipolar line constraint.
In binocular vision, when the ideal parallel optical axis model is adopted (i.e., the
line of sight of each camera is parallel), the epipolar line and the image scanning line
are coincident, and the stereovision system at this time is called a parallel
stereovision system. In parallel stereovision systems, the search range of stereo
matching can also be reduced by means of epipolar line constraints. Ideally, a search
of the entire graph can be changed to a search of one line of the image using epipolar
line constraints. However, it should be pointed out that the epipolar line constraint is
only a local constraint. For a space point, there may be more than one projection
point on the epipolar line.
An illustration of the epipolar line constraint is shown in Fig. 5.3. A camera (left)
is used to observe a point W in space, and the imaged point p1 should be on the line
connecting the optical center of the camera and point W. But all points on this line
will be imaged at point p1, so the position/distance of a particular point W cannot be
completely determined by point p1. Now use the second camera to observe the same
spatial point W, and the imaged point p2 should also be on the connection line
between the optical center of the camera and the point W. All points W on this line
are projected onto a straight line on the imaging plane 2, which is the epipolar line.
According to the geometric relationship in Fig. 5.3, for any point p1 on the
imaging plane 1, the imaging plane 2 and all its corresponding points are
5.1 Region-Based Binocular Stereo Matching 173
(constrained) on the same straight line, which is the abovementioned epipolar line
constraint.
The relationship between the projected coordinate points of the space point W on the
two images can be described by an essential matrix E with five degrees of freedom
[6], which can be decomposed into an orthogonal rotation matrix R followed by a
translation matrix T (E = RT). If the coordinates of the projected point in the left
image are represented by p1, and the coordinates of the projected point in the right
image are represented by p2, then it has
pT EP i =0 (5:7)
On the corresponding image, the epipolar lines passing through p1 and p2 meet
L2 = Ep1 and L1 = ETp2, respectively. On the corresponding image, Ee1 = 0 and
ETe2 = 0 are satisfied through the epipoles of p1 and p2, respectively.
Refer to Fig. 5.4 for derivation of the essential matrix. In Fig. 5.4, suppose that the
projection positions p1 and p2 of point W on the image can be observed, and the
rotation matrix R and the translation matrix T between the two cameras are also
known, then three 3D vectors O1O2, O1W, and O2W can be obtained. These three 3D
vectors must be coplanar. In mathematics, the criterion that the coplanar of three 3D
vectors a, b, and c can be written as a ■ (b x c) = 0 can be used to derive the essential
matrix. The essential matrix indicates the relationship between the coordinates of
projection points of the same space point W (coordinate W) on two images.
According to the perspective relationship of the second camera, vector O1W /
Rp 1, vector O1O2 / T, and vector O2W=p2. Combining these relationships with the
coplanarity condition yields the desired result:
The above discussion assumes that p1 and p2 are the camera-calibrated pixel
coordinates. If the camera is not calibrated, the original pixel coordinates q1 and q2
need to be used. Suppose the internal parameter matrix of the camera is G1 and G2,
then
Substituting the above two equations into Eq. 5.7, we get q2T(G2-1)TEG1-
iqi = 0, which can be written as
qT Fqi= o (5:ii)
where
is called the fundamental matrix because it contains all the information for camera
calibration. The fundamental matrix has seven degrees of freedom (two parameters
are required for each epipole, plus three parameters to map the three epipolar lines
from one image to the other, since the projective transformation in the two 1D
projected spaces has three degrees of freedom), and the essential matrix has five
degrees of freedom, so the fundamental matrix has two more free parameters than the
essential matrix, but comparing Eqs. 5.7 and 5.11, it can be seen that the roles or
functions of these two matrices are similar.
The essential and fundamental matrices are related to the internal and external
parameters of the camera. If the internal and external parameters of the camera are
given, according to the epipolar line constraint, for any point on the imaging plane
1, only a 1D search is needed in the imaging plane 2 to determine the position of
the corresponding point. Further, the correspondence constraint is a function of the
internal and external parameters of the camera. Given the internal parameters, the
external parameters can be determined by means of the observed pattern of
corresponding points, and then the geometric relationship between the two cameras
can be established.
There are still some specific problems to be considered and solved when using the
region matching method in practice. Here are two examples:
5.1 Region-Based Binocular Stereo Matching 175
1. Due to the shape of the scenery itself or the mutual occlusion of the scenery when
shooting the scene, not all the scenes captured by the left camera can be captured
by the right camera. Therefore, some templates determined by the left image may
not be able to find the exact matching position in the right image. At this time, it is
often necessary to interpolate according to the matching results of other matching
positions to get the data of these unmatched points.
2. When using the pattern of template image to express the characteristics of a single
pixel, the premise is that different template images should have different patterns,
so that the matching can be differentiated, which can reflect the characteristics of
different pixels. However, sometimes there are some smooth regions in the
image, and the template images obtained in these smooth regions have the
same or similar patterns, so there will be uncertainty in the matching, and it
leads to false matching. In order to solve this problem, it is sometimes necessary
to project some random textures onto these surfaces to convert smooth regions
into texture regions, so as to obtain template images with different patterns to
eliminate uncertainty.
The following is an example of the error caused by stereo matching when there is
a gray level smooth region along the binocular baseline direction. See Fig. 5.5,
where Figs. 5.5a and b are the left and right of a pair of perspective views,
Fig. 5.5 Examples of binocular stereo matching affected by smooth regions of the image
176 5 Binocular Stereovision
respectively. Figure 5.5c is the disparity map obtained by binocular stereo matching
(for the sake of clarity, only the result of scene matching is retained), the dark color
in the figure represents a farther distance (larger depth), and the light color represents
a short distance (smaller depth). Figure 5.5d is a 3D perspective (contour map)
display corresponding to Fig. 5.5c. Comparing the figures, it can be seen that since
there are some positions in the scene (such as the horizontal eaves of towers, houses,
and other buildings), the gray values are generally similar along the horizontal
direction, so it is difficult to search and match them along the epipolar line direction.
Determining the corresponding points produces a lot of errors due to mismatches,
which are reflected in the Fig. 5.5c, which have some white or black regions that are
incompatible with the surrounding, and some sharp burrs are reflected in Fig. 5.5d.
where
Among them, p, a, k are the coefficients related to the optical properties of the
surface, which can be calculated from the image data.
The first term on the right side of Eq. 5.14 considers the scattering effect, which
does not vary with the viewing angle; the second term considers the specular
reflection effect. Let H be the unit vector in the angular direction of the specular
reflection:
S+V
H= (5:15)
V2[1 + (S • V)]
The second term on the right of the equal sign in Eq. 5.14 reflects the change of
line of sight vector V through vector H. In the coordinate system adopted in Fig. 5.2,
5.2 Feature-Based Binocular Stereo Matching 177
For an image f(x, y), the feature point image can be obtained by calculating the edge
points:
178 5 Binocular Stereovision
Then, t(x, y) is divided into nonoverlapping small regions W, and the point with
the largest calculated value is selected as the feature point in each small region.
Now consider matching an image pair composed of a left image and a right
image. For each feature point of the left image, all possible matching points in the
right image can be formed into a set of possible matching points. In this way, a label
set can be obtained for each feature point of the left image, in which the label l is
either the parallax between the feature point of the left image and its possible
matching point or a special label representing no matching point. For each possible
matching point, calculate the following equation to set the initial matching proba
bility P(0)(l):
A(l) = 52 )
[fL(x,y -f (x r +lx,y+ly^2 (5:22)
where l = (lx, ly) is the possible parallax. A(l) represents the gray level fitting degree
between the two regions, which is inversely proportional to the initial matching
probability P(0)(l). In other words, P(0)(l) is related to the similarity in the neighbor
hood of possible matching points. According to this, the relaxation iteration method
can be used to iteratively update P(0)(l) by giving positive increments to the points
with close parallax in the neighborhood of possible matching points and negative
increments to the points with far parallax in the neighborhood of possible matching
points. With the iteration, the iterative matching probability P(k)(l) of the correct
matching point will gradually increase, while the matching probability P(k)(l) of
other points will gradually decrease. After a certain number of iterations, the point
with the maximum matching probability P(k)(l) is determined as the matching point.
When matching feature points, the zero-crossing pattern can also be used to obtain
matching primitives [7]. Zero-crossings can be obtained by convolution with the
Laplacian of Gaussian functions (e.g., see [8]). Considering the connectivity of the
5.2 Feature-Based Binocular Stereo Matching 179
In the following, Fig. 5.7 (which is obtained by removing the epipolar line in Fig. 5.2
and then moving the baseline to the X-axis to facilitate description, where the
meanings of the letters are the same as those in Fig. 5.2) will be used to explain
the corresponding relationship between the spatial feature points and the image
feature points.
In the 3D space coordinate system, a feature point W(x, y, -z) is orthogonally
projected on the left and right images as
The calculation of u" here is carried out according to the coordinate transforma
tion of first translation and then rotation. Equation 5.24 can also be derived with the
help of Fig. 5.8 (a schematic diagram parallel to the XZ plane in Fig. 5.7 is given
here):
If u" has been determined by u‘ (i.e., the matching between the feature points has
been established), the depth of the feature points projected to u‘ and u" can be
inversely solved from Eq. 5.24 as
It can be seen from the above discussion that the feature points are just some specific
points on the object, and there is a certain interval between them. The dense disparity
field cannot be directly obtained from only sparse matching points, so the shape of
the object may not be accurately recovered. For example, Fig. 5.9a gives four points
that are coplanar in space (equidistant from another space plane). These points are
sparse matching points obtained by disparity calculation. These points are assumed
to be located on the outer surface of the object, but there can be infinitely many
surfaces passing through these four points, and several possible examples are given
as shown in Fig. 5.9b-d. It can be seen that only the sparse matching points cannot
uniquely restore the shape of the object, and it is necessary to combine some other
5.2 Feature-Based Binocular Stereo Matching 181
(c) (d)
Fig. 5.9 Only sparse matching points cannot uniquely recover object shape
conditions or interpolate the sparse matching points to obtain a dense disparity map
such as region matching.
_, i 1 r x2 + y2
G(x,y,o)2no exp --202- (5.28)
where o is the scale factor. The image multiscale representation after convolution
with Gaussian convolution kernel is
Gaussian function is a low-pass function, which will smooth the image after
convolution with the image. The big or small of the scale factor is related to the
smoothness degree, and the large o corresponds to the large scale. After convolution,
the general picture of the image is mainly given; small o corresponds to small scale,
and the details of the image are preserved after convolution. In order to make full use
of the image information of different scales, a series of Gaussian convolution kernels
with different scale factors are used to construct the Gaussian pyramid. It is generally
assumed that the scale factor coefficient between the two adjacent layers of the
Gaussian pyramid is k. If the scale factor of the first layer is o, the scale factor of the
second layer is ko, and the scale factor of the third layer is k2o, and so on.
182 5 Binocular Stereovision
SIFT then searches for salient feature points in the multiscale representation of
the image and uses the difference of Gaussian (DoG) operator. DoG is the differ
ence of convolution results by two Gaussian kernels with different scales, which is
similar to Laplace of Gaussian (LoG) operator. If h and k are used to represent
different scale factor coefficients, the DoG pyramid can be expressed as
D(x,y, o) = [G(x, y, ko) — G(x,y, ho)] ® f (x, y) = L(x, y, ko) — L(x, y, ho) (5:30)
ff
" ( y)v)=arctan [L(x,+1,
x y+y1)— L((x,—
)— Ly—1, y1)]
)J ((55 3232))
:
After obtaining the orientation of each point, the orientations of the pixels in the
neighborhood can be combined to obtain the orientation of the salient feature points.
See Fig. 5.10 for details. First (on the basis of determining the position and scale of
the salient feature points), take a 16 x 16 window centered on the salient feature
points, as shown in Fig. 5.10a. Divide the window into 16 groups of 4 x 4, as shown
in Fig. 5.10b. Calculate the gradient of each pixel in each group to obtain the
gradient of the pixels in the group, as shown in Fig. 5.10c in which the direction
of the arrow indicates the direction of the gradient and the length of the arrow is
proportional to the magnitude of the gradient. Use the eight-direction (interval 45°)
histogram to count the gradient directions of the pixels in each group, and take the
5.2 Feature-Based Binocular Stereo Matching 183
peak direction as the gradient direction of the group, as shown in Fig. 5.10d. In this
way, for 16 groups, each group can provide an 8D direction vector and be spliced
together to get a 16 x 8 = 128D vector. This vector is normalized and finally used as
the description vector of each salient feature point, that is, the SIFT descriptor. In
practice, the coverage region of SIFT descriptors can be square or circular, also
known as salient patch.
SIFT descriptors are invariant to image scaling, rotation, and illumination
changes, and they are also stable to affine transformations, viewing angle changes,
local shape distortion, and noise interference. This is because in the process of
acquiring SIFT descriptors, the influence of rotation is eliminated by calculating
and adjusting the gradient direction; the influence of illumination changes is elim
inated by vector normalization; and the combination of pixel direction information in
the neighborhood is used to enhance robustness. In addition, the SIFT descriptor is
rich in its own information and has good uniqueness (compared to edge points or
corner points that only contain position and extreme value information, the SIFT
descriptor has a 128D description vector). Also due to its uniqueness or specialty/
particularity, a large number of salient patches can often be identified in an image for
selection in different applications. Of course, due to the high dimension of the
description vector, the computational complexity of the SIFT descriptor is often
relatively large (the next subsection describes a descriptor for accelerating SIFT).
There are also many improvements to SIFT, including replacing the gradient histo
gram with PCA (effective dimensionality reduction), limiting the amplitude of the
histogram in each direction (some nonlinear illumination changes mainly affect the
amplitude), etc.
With the help of SIFT, a large number of local regions (generally more than a
hundred can be obtained for a 256 x 384 image) covering the image that do not
change with the translation, rotation, and zooming of the image can be determined in
the image scale space, and they are very little affected by noise and interference. For
example, Fig. 5.11 shows two results of salient patch detection. On the left is a ship
image and on the right is a beach image. All detected SIFT salient patches are
represented by circles (here, circular salient patches) covering on the image.
SURF algorithm determines the position and scale information of the points of
interest by calculating the determinant of the second-order Hessian matrix of the
image. The Hessian matrix of image f(x, y) at position (x, y) and scale c is defined as
follows:
where hxx(x, y, c), hxy(x, y, c), and hyy(x, y, c) are the results of convolution of
Gaussian second-order differential [d2G(c)]/dx2, [d2G(c)]/dxdy, and [d2G(c)]/dy2
with image f(x, y) at(x, y), respectively.
The determinant of Hessian matrix is
d"(H)=dx2dyf—ixyH (534)
The maximum point in scale space and image space is called the point of
interest. The value of the determinant of the Hessian matrix is the characteristic
value of the Hessian matrix. Whether the point is an extreme point can be determined
according to the positive and negative values of the determinant at the image point.
Gaussian filter is optimal in the analysis of scale space, but in practice, after
discretization and quantization, it will lose repeatability (because the template is
square and anisotropic) when the image rotates at an odd number of 45° angle. For
example, Fig. 5.12a and b show the quantized Gaussian second-order partial differ
ential responses discretized along the X direction as well as along the X and
Y bisector direction, respectively, which are quite different.
In practice, a box filter can be used to approximate the Hessian matrix, resulting in
faster computations (independent of filter size) with integral images (e.g., see [8]).
For example, Fig. 5.12c and d are approximations to the Gaussian second-order
partial differential responses of Fig. 5.12a and b, respectively, where the 9x9 box
filter is an approximation of Gaussian filter of scale 1.2, which also represents the
lowest scale (i.e., the highest spatial resolution) at which the response is computed.
5.2 Feature-Based Binocular Stereo Matching 185
Fig. 5.12 Gaussian second-order partial differential response and its approximation (light colors
represent positive values, dark colors represent negative values, and middle gray represents 0)
Denote the approximate values of hxx(x, y, o), hxy(x, y, o), and hyy(x, y, o) as Axx, Axy,
and Ayy, respectively, and then the determinant of the approximate Hessian matrix is
where W is the relative weight for balancing the filter response (i.e., for the
compensation in using the approximation instead of the Gaussian convolution
kernel) to maintain the energy between the Gaussian kernel and the approximate
Gaussian kernel, which can be calculated as follows:
The detection of points of interest needs to be done at different scales. The scale
space is generally represented by a pyramid structure (e.g., see [1]). But because of
the use of box filters and integral images, instead of using the same filter for each
level of the pyramid, box filters of different sizes are used directly on the original
image (same computational speed). So, it is possible to up-sample the filter (Gauss
ian kernel) without iteratively reducing the image size. Taking the output of the
previous 9x9 box filter as the initial scale layer, the subsequent scale layers can be
obtained by filtering the image with larger and larger templates. Since the image is
186 5 Binocular Stereovision
(a) (b)
Fig. 5.13 Filters between two adjacent scale layers (9 x 9 and 15 x 15)
Group
Fig. 5.14 Illustration of filter side length in different groups (logarithmic horizontal axis)
larger than the size of the corresponding filter, the fourth group can be calculated,
using the filter with the size of 51, 99, 147, and 195. Figure 5.14 shows the full
picture of the filter used, and each group overlaps with each other to ensure smooth
coverage of all possible scales. In typical scale space analysis, the number of points
of interest that can be detected in each group decreases very quickly.
The large scale change, especially the change between the first filter among these
groups (the change from 9 to 15 is 1.7), makes the scale sampling quite rough. For
this purpose, a scale space with a finer sampling scale can also be used. At this time,
the image is zoomed in and out by a factor of 2, and then the filter with size of 15 is
used to start the first group. The next filter sizes are 21, 27, 33, and 39. Then, the
second group starts, whose size increases in steps of 12 pixels, and so on. Thus, the
scale change between the first two filters is only 1.4 (21/15). The minimum scale that
can be detected by quadratic interpolation is c = (1.2 x 18/9)/2 = 1.2.
Since Frobenius norm keeps constant for any size of filter, it can be considered
that it has been normalized in scale, and it is no longer necessary to weight the
response of filter.
Xdx
Hdy
dx| and |dy| of the wavelet responses dx and dy are also summed, respectively. In
this way, a 4D description vector V can be obtained from each subregion,
V = (Edx, E dy, Eldxl, Eldyl). For all 16 subregions, the description vectors are
connected to form a description vector with a length of 64D. The wavelet
response thus obtained is insensitive to changes in illumination. The invariance
of contrast (scalar) is obtained by converting descriptor into unit vector.
Figure 5.17 is a schematic diagram of three different brightness modes and
descriptors obtained from corresponding subregions. On the left is a uniform
pattern, and each component of the descriptor is very small; in the middle, there is
an alternating pattern along the X direction, only Eldxl is large, and the rest are
small; on the right is the mode in which the brightness gradually increases along
the horizontal direction, and the values of Edx and Eldxl are both large. It can be
seen that descriptors have obvious differences for different brightness modes. It is
also conceivable that if these three local brightness modes are combined, a
specific descriptor can be obtained.
190 5 Binocular Stereovision
Image sub-region
No noise
With noise
The principle of SURF is similar to that of SIFT to some extent. Both of them
are based on the spatial distribution of gradient information. But in practice,
SURF often has better performance than SIFT. The reason here is that SURF
gathers all gradient information in the subregion, while SIFT only depends on the
orientation of each independent gradient. This difference makes SURF more
noise resistant. An example is shown in Fig. 5.18. When there is no noise,
SIFT has only one gradient direction; when there is noise (the edge is no longer
smooth), SIFT has a certain gradient component in other directions except that the
main gradient direction remains unchanged. However, the SURF response is
basically the same in both cases (the noise is smoothed).
The evaluation experiments on the number of sampling points and subregions
show that the square subregion divided by 4 x 4 can give the best results. Further
subdivision will lead to poor robustness and greatly increase the matching time.
On the other hand, the short descriptor (SURF-36, i.e., 3 x 3 = 9 subregions,
4 responses per subregion) obtained by using the subregion of 3 x 3 will slightly
reduce the performance (acceptable compared with other descriptors), but the
calculation is much faster.
Besides, there is a variant of SURF descriptor, namely, SURF-128. It also uses
the previous summation but divides the values finer. The sum of dx and |dx| is
separated by/according to dy < 0 and dy > 0. Similarly, the sum of dy and Idyl is
also divided by/according to dx < 0 and dx > 0. In this way, the number of
features is doubled, and the robustness and reliability of the descriptor are
improved. However, although the descriptor itself is faster to calculate, the
amount of calculation will still increase due to the high dimension during
matching.
3. Quick index for matching
In order to quickly index in matching, the sign of Laplacian value (i.e., the
rank of Hessian matrix) of the point of interest can be considered. Generally, the
5.2 Feature-Based Binocular Stereo Matching 191
points of interest are detected and processed in the blob like structure. The sign of
Laplacian value can distinguish the bright patch in the dark background from the
dark patch in the bright background. No additional calculation is required here,
because the sign of Laplacian value has been calculated in the detection step. In
the matching step, you only need to compare the signs of Laplacian values. With
this information, the matching speed can be accelerated without degrading the
performance of descriptors.
The advantages of surf algorithm include that it is not affected by image
rotation and scale change, and it is anti-blur. The disadvantages include that it
is greatly affected by the change of viewpoint and illumination.
The selection method of feature points is often closely related to the matching
method used for them. For the matching of feature points, it is necessary to establish
the corresponding relationship between feature points. Therefore, the ordering
constraint conditions can be used and the dynamic programming method can be
used [10].
Taking Fig. 5.19a as an example, consider the three feature points on the visible
surface of the observed object, and name them as A, B, and C in sequence. They are
exactly the reverse of the order of projection (along the epipolar line) on the two
imaging images (see c, b, and a as well as c’, b’, and a’, respectively). The law of
their opposite order is called ordering constraint. The ordering constraint is an ideal
situation, which cannot be guaranteed to be always true in the actual scenario. For
example, in the case shown in Fig. 5.19b, a small object crosses in front of the large
object behind, blocking a part of the large object, so that the original c point and a’
point cannot be seen on the image and the order of projection on the image does not
meet the ordering constraint.
In practical applications, the parallax map will have errors due to the existence of
periodic patterns, smooth regions, occlusion effects, and the laxity of constraint
principles. The parallax map is the basis for subsequent 3D reconstruction and other
5.3 Parallax Map Error Detection and Correction 193
With the help of the ordering constraints discussed above, we first define the concept
of ordering matching constraints. Suppose that fL(x, y) and fR(x, y) are a pair of
(horizontal) images and OL and OR are their imaging centers, respectively. Let P and
Q be two points that do not coincide in space, PL and QL be the projections of P and
Q on fL(x, y), and PR and QR the projections of P and Q on fR(x, y), as shown in
Fig. 5.21 (see the discussion on binocular imaging in Sect. 3.3).
Assuming that X(«) is used to represent the X coordinate of the pixel, it can be seen
from Fig. 5.21 that in the correct match, if X(P) < X(Q), then X(PL) < X(QL) and
X(Pr) < X(Qr); and if X(P) > X(Q), then X(Pl) > X(Ql) and X(Pr) > X(Qr). So, if
the following conditions hold () means implicitly),
(5:37)
X(Pl) >X(Q ) )
l X(Pr) > X(Q )
r
X
194 5 Binocular Stereovision
Then it is said that PR and QR satisfy the ordering matching constraint; otherwise,
it is said that there is a crossover here. It can be seen from Fig. 5.21 that the ordering
matching constraints have certain restrictions on the Z coordinates of points P and Q,
which are relatively easy to determine in practical applications.
Matching intersection regions can be detected based on the concept of ordering
matching constraints. Let PR = fR(I, j ) and QR = fR(k, j ) be any two pixels in the j-th
row of fR(x, y), and then their matching points in fL(x, y) can be, respectively,
recorded as for PL = fL(i + d(i, j), j) and QL = fL(k + d(k, j), j ). Define C(PR, QR)
as the cross label between PR and QR. If Eq. 5.37 holds, then denote it as C(PR,
QR) = 0; otherwise, denote it as C(PR, QR) = 1. Define the intersection number Nc of
the corresponding pixel point PR as
N- 1
Nc (i,j) = CP C(PR, QR) k+i (5:38)
k=0
If the interval in which the number of crossings is not zero is called a crossing
interval, the mismatch in the crossing interval can be corrected by the following
algorithm. Suppose {fR(i,j)l i £ [p, q]} is the intersection interval corresponding to
R, and then the total cross number Ntc of all pixels in this interval is
q
Ntc(i, j) = £ NC(i, j) (5:39)
2. Determine the new search range, {fL(i, j)| i £ [s, t]}, for the matching point
fR(k, j ), where
3. Find a new matching point from the search range that can reduce the total number
of crossings Ntc (e.g., the maximum grayscale correlation matching technology
can be used).
4. Use the new matching point to correct d(k, j) to eliminate the mismatch
corresponding to the pixel with the current maximum number of crossings.
The above steps can be used iteratively, and after correcting one mismatched
pixel, continue to correct each remaining error pixel. After d(k, j ) is corrected,
Nc(i, j) in the cross interval is recalculated by Eq. 5.38, and Ntc is calculated
according to Eq. 5.39. Then, the next round of correction process is performed
according to the above iteration until Ntc = 0. Because the correction principle is to
make Ntc = 0, it can be called a zero-cross correction algorithm. After correction, a
parallax map that complies with the ordering matching constraints can be obtained.
The above process of detecting and eliminating errors can be exemplified as
follows. Assume that the results of parallax calculation in the interval [153, 163] in
the j-th row of an image are shown in Table 5.2, and the distribution of each
matching point in this interval is shown in Fig. 5.22. According to the correspon
dence between fL(x, y) and fR(x, y), the matching points in the interval [160, 162] are
mismatched points. Table 5.3 can be obtained by calculating the number of crosses
according to Eq. 5.38.
It can be seen from Table 5.3 that the interval [fR(154, j), fR(162, j)] is a cross
interval. From Eq. 5.39, Ntc = 28 can be obtained; so from Eq. 5.40, we can know
that the pixel with the largest number of crossings is fR(160, j). Then, according to
Eq. 5.41, the search range of the new matching point fR(160, j) can be determined as
{fL(i, j)l i £ [181, 190]}. According to the maximum grayscale correlation matching
196 5 Binocular Stereovision
153 154 155 156 157 158 159 160 161 162 163
181 182 183 184 185 186 187 188 189 190 191
With the development of deep learning technology, it has been widely used in stereo
matching. Different from the traditional matching algorithm based on man-made
features, the stereo matching algorithm based on depth learning can extract more
image features for cost calculation by nonlinear transformation of images through
convolution, pooling, and full connection. Compared with man-made features, deep
learning features can provide more context information, make more use of the global
information of the image, and obtain the model parameters through training to
improve the robustness of the algorithm. At the same time, using GPU acceleration
technology can also obtain faster processing speed and meet the real-time require
ments of many application fields [12].
Currently, image networks for stereo matching mainly include image pyramid
networks, Siamese networks, and generative adversarial networks [13].
A spatial pyramid pooling layer is set between the convolution layer and the full
connection layer, which can convert image features of different sizes into fixed
length representations [14]. This can avoid repeat calculation of convolution and can
ensure consistency of input image size.
Table 5.4 lists some typical methods using image pyramid network, as well as
their characteristics, principles, and effects.
Table 5.4 Several typical methods using image pyramid network and their characteristics, princi
ples, and effects
Methods Features and principles Effect
[15] Convolution neural network is used to The man-made features are replaced by
extract image features for cost calculation deep learning features
[16] Pyramid pooling module is introduced It solves the problems of gradient disap
into feature extraction, and multiscale pearance and gradient explosion and is
analysis and 3D-cnn structure are adopted suitable for weak texture, occlusion,
uneven light, and so on
[17] Build packet cost calculation Improve computational efficiency by
replacing 3D convolution layer
[18] A semi global aggregation layer and a Improve computational efficiency by
local guidance aggregation layer are replacing 3D convolution layer
designed
198 5 Binocular Stereovision
The basic structure of Siamese network is shown in Fig. 5.25 [19]. First, the two
input images to be matched are converted into two feature vectors by using two
weighted shared convolutional neural networks (CNN), and then the similarity
between the two images is determined according to the L1 distance between the two
feature vectors.
The current method improves the basic structure of Siamese network. Table 5.5
shows their characteristics, principles, and effects.
Table 5.5 Several improvements using Siamese networks and their characteristics, principles, and
effects
Methods Characteristics and principles Effects
[20] Deepen convolutional layers using ReLU The matching accuracy is improved
function and small convolution kernel
[21] When extracting features, first calculate the Uses color input as a guide; the
disparity map in the low-resolution cost con high-quality boundaries can be
volution, and then use the hierarchical refine generated
ment network to introduce high-frequency
details
[22] Pyramid pooling is used to connect two sub Multiscale features can be obtained
networks. The first subnetwork is composed of
a Siamese network and a 3D convolutional
network, which can generate low-precision
disparity maps; the second subnetwork is a fully
convolutional network, which restores the ini
tial disparity map to the original resolution
[23] Depth discontinuity is processed on the The continuity of depth disconti
low-resolution disparity map and restored to the nuities is improved
original resolution in the disparity refinement
stage
5.4 Stereo Matching Based on Deep Learning 199
Table 5.6 Several improvements using GAN and their characteristics, principles, and effects
Methods Characteristics and principles Effects
[25] Using a binocular vision-based GAN frame This unsupervised model works
work, including two generative subnetworks and well under uneven lighting
one discriminative network. The two generative conditions
networks are used to train and reconstruct the
disparity map, respectively, in the adversarial
learning. Through mutual restriction and super
vision, two disparity maps from different per
spectives are generated, and the final data is
output after fusion
[26] Use generative models to process occluded Recoverable to get a good parallax
regions effect
[27] Generative adversarial models using deep con The visual effect of depth map for
volutions to obtain multiple depth maps with occluded regions is improved
adjacent frames
[28] Use two images from the left and right cameras The disparity map for regions with
to generate a brand-new image for improving the poor lighting is improved
poorly matched part of the disparity map
generated image similar to the original image, while the discriminative model is used
to distinguish the “generated” image from the real image [24]. This process runs
iteratively, and the final discrimination result reaches the Nash equilibrium, that is,
the true and false concepts are both 0.5.
Some modifications have been made to the fundamental method using GAN.
Their characteristics and principles as well as effects are shown in Table 5.6.
References
1. Zhang Y-J (2017) Image Engineering, Vol. 1: Image Processing. De Gruyter, Germany.
2. Kanade T, Yoshida A, Oda K, et al. (1996) A stereo machine for video-rate dense depth
mapping and its new applications. IEEE Conference on Computer Vision and Pattern Recog
nition (CVPR) 196-202.
3. Lew MS, Huang TS, Wong K (1994) Learning and feature selection in stereo matching. IEEE
Transactions on Pattern Analysis and Machine Intelligence 16(9): 869-881.
References 201
4. Zhang Y-J (2002) Image Engineering (Addendum)—Teaching Reference and Problem Solving.
Tsinghua University Press, Beijing.
5. Forsyth D, Ponce J (2012) Computer Vision: A Modern Approach, 2nd Ed. Prentice Hall,
London.
6. Davies ER (2005) Machine Vision: Theory, Algorithms, Practicalities, 3rd Ed. Elsevier,
Amsterdam.
7. Kim YC, Aggarwal JK (1987) Positioning three-dimensional objects using stereo images. IEEE
Transactions on Robotics and Automation 1: 361-373.
8. Zhang Y-J (2017) Image Engineering, Vol. 2: Image analysis. De Gruyter, Germany.
9. Nixon MS, Aguado AS (2008) Feature Extraction and Image Processing. 2nd Ed. Academic
Press, Maryland.
10. Forsyth D, Ponce J (2003) Computer Vision: A Modern Approach. Prentice Hall, London.
11. Jia B, Zhang Y-J, Lin XG (2000) General and fast algorithm for disparity error detection and
correction. Journal of Tsinghua University (Science & Technology) 40(1): 28-31.
12. Li, JI, Liu T, Wang XF (2022) Advanced pavement distress recognition and 3D reconstruction
by using GA-DenseNet and binocular stereo vision. Measurement, 201: 111760 https://doi.org/
10.1016/j.measurement.2022.111760.
13. Chen Y, Yang LL, Wang ZP (2020) Literature survey on stereo vision matching algorithms.
Journal of Graphics 41(5): 702-708.
14. He KM, Zhang XY, Ren SQ, et al. (2015) Spatial pyramid pooling in deep convolutional
networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelli
gence, 37(9): 1904-1916.
15. Zbontar J, Lecun Y (2015). Computing the stereo matching cost with a convolutional neural
network. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1592-1599.
16. Chang J, Chen Y (2018) Pyramid stereo matching network. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 5410-5418.
17. Guo XY, Yang K, Yang WK, et al. (2019) Group-wise correlation stereo network. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 3268-3277.
18. Zhang FH, Prisacariu V, Yang RG, et al. (2019) GA-NET: Guided aggregation net for end-to-
end stereo matching. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
185-194.
19. Bromley J, Bentz JW, Bottou L, et al. (1993) Signature verification using a “Siamese” time
delay neural network. International Journal of Pattern Recognition and Artificial Intelligence
7(4): 669-688.
20. Zagoruyko S, Komodakis N (2015) Learning to compare image patches via convolutional
neural networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
4353-4361.
21. Khamis S, Fanello S, Rhemann C, et al. (2018) StereoNet: Guided hierarchical refinement for
real-time edge-aware depth prediction. European Conference on Computer Vision (ECCV)
596-613.
22. Liu GD, Jiang GL, Xiong R, et al. (2019) Binocular depth estimation using convolutional neural
network with Siamese branches. IEEE International Conference on Robotics and Biomimetics
(ROBIO) 1717-1722.
23. Guo CG, Chen DY, Huang ZQ. (2019) Learning efficient stereo matching network with depth
discontinuity aware super-resolution. IEEE Access 7: 159712-159723.
24. Luo JY, Xu Y, Tang CW, et al. (2017) Learning inverse mapping by AutoEncoder based
generative adversarial nets. Neural Information Processing 207-216.
25. Pilzer A, Xu D, Puscas M, et al. (2018) Unsupervised adversarial depth estimation using cycled
generative networks. International Conference on 3D Vision (3DV) 587-595.
26. Matias LPN, Sons M, Souza JR, et al. (2019) VeIGAN: Vectorial inpainting generative
adversarial network for depth maps object removal. IEEE Intelligent Vehicles Symposium
(IV) 310-316.
202 5 Binocular Stereovision
27. Lore KG, Reddy K, Giering M, et al. (2018) Generative adversarial networks for depth map
estimation from RGB video. IEEE Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW) 1177-1185.
28. Liang H, Qi L, Wang ST, et al. (2019) Photometric stereo with only two images: a generative
approach. IEEE 2nd International Conference on Information Communication and Signal
Processing (ICICSP) 363-368.
29. Wu JJ, Chen Z, Zhang CX (2021) Binocular stereo matching based on feature cascade
convolutional network. Acta Electronica Sinica 49(4): 690-695.
30. Huang G, Liu Z, Van Der Maaten L, et al. (2017) Densely connected convolutional networks.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4700-4708.
Chapter 6
Multi-ocular Stereovision Check for
updates
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 203
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_6
204 6 Multi-ocular Stereovision
From the discussion in Sect. 3.3.1, in the stereovision using the binocular horizontal
mode, the parallax d in the two images has the following relationship with the
baseline B between the two cameras (2 represents the camera focal length):
d= ■' “ BZ <6:1>
where the last step is to consider the simplification when Z >> 2 in general.
It can be seen from Eq. 6.1 that for a given object distance Z, the parallax d is
proportional to the baseline length B. The larger the baseline length B, the more
accurate the distance calculation will be. However, the problem brought by the
longer baseline length is that a larger parallax range needs to be searched to find
matching points, which not only increases the amount of calculation but also
increases the probability of false matching when there are periodic repetitive features
in the image (see below).
In order to solve the above problems, the method of multi-ocular stereovision [1]
can be used, that is, using multiple cameras to obtain multiple pairs of related
images, and the baseline used for each pair of images is not very long (so the
searched parallax range is not very large), but combining multiple pairs of images
is equivalent to form a longer baseline, which can improve parallax measurement
accuracy without increasing the probability of false matches when there are periodic
repeating features in the images.
From binocular to multi-ocular, the most direct method is to add cameras along the
extension line of the original binocular baseline to form the multi-ocular. For the
binocular horizontal mode, a set of image sequences along the (horizontal) baseline
direction is used for stereo matching to become the multi-ocular horizontal mode.
The basic idea of this method is to reduce the overall mismatch by computing the
sum of squared differences (SSD) between pairs of images [2]. Assuming the
camera moves along a horizontal line perpendicular to the optical axis (multiple
cameras can also be used), acquire a series of images fi(x, y) at points P0, P1, P2, ...,
PM, i = 0, 1, ..., M (see Fig. 6.1), resulting in a series of image pairs whose baseline
lengths are B1, B2, ..., BM, respectively.
According to Fig. 6.1, the parallax between the image captured at point P0 and the
image captured at point Pi is
6.1 Horizontal Multi-ocular Stereo Matching 205
BM
Because only the horizontal direction is considered here, the image function
f(x, y) can be simplified by f(x), and the image obtained at each position is
It is considered that the distribution of noise ni(x) satisfies the Gaussian distribu
tion with mean value of 0 and variance of on that is, ni(x) ~ N(0, o2).
In f0(x), the SSD value at the position x is (W is the matching window)
Sd(x; di) = [ f (x + f) - fi (x + di + j) ]
o
(6:4)
where di is the parallax estimate at the position x. Since SSD is a random variable, its
expected value can be calculated as a global evaluation function (let NW be the
number of pixels in the matching window):
+ nox + j - n,-x + di + j] 2}
The above equation shows that when di = di, Sd(x; di) gets the minimum value. If
the image has the same grayscale patterns at x and x + p (p ^ 0), i.e.,
f (x + j) (x + P + j) j 2 W (6:6)
This shows that the expected value of the SSD is likely to be extreme at both x and
x + p, i.e., there is an uncertainty problem, which will result in errors (mismatches).
Mismatching at x + p occurs for all image pairs (the location of the mismatch is not
related to the baseline length and baseline number), and the error cannot be avoided
even with multi-ocular images.
Now introduce the concept of inverse distance (or inverse depth), and search for the
correct parallax by searching for the correct inverse distance. The inverse distance
t satisfies
t=Z (6:8)
t b' (6:9)
^^Bdl (6:10)
where ti and ti are the true and estimated inverse distances, respectively.
Substituting Eq. 6.10 into Eq. 6.4, the SSD corresponding to t is
Adding all the SSDs corresponding to M inverse distances provides the sum of
SSDs (SSSD), which can be expressed as
(S)
(x; T) = ^St(x; tfi) (6:13)
t (12...M)
i= 1
e[s JS?
2.. m (x; t)] = t)
i= 1
Now consider the aforementioned problem of images having the same patterns at
x and x + p (see Eq. 6.6), where
It should be noted that the uncertainty problem still exists here, because there is
still a minimum at the inverse distance tp = ti + p/(B,2). However, here tp consists of
two items, with the change of Bi, although tp will also change, but ti does not change.
In other words, the parallax obtained for each camera is proportional to the reciprocal
distance, but the parallax obtained for different cameras is different. This is an
important property when using inverse distance in SSSD, and it can help eliminate
uncertainty problems caused by periodic patterns. Specifically, by choosing different
baselines, the minimum values of the sum of squared differences between each pair
of images are located at different positions. Taking the use of two baselines B1 and
B2 (B1 / B2) as an example, from Eq. 6.14,
(S)
That is, at the correct matching position t, there will be a true minimum St(12)(x ; t ).
It can be seen that by using two baselines with different lengths, the uncertainty
caused by repeated patterns can be resolved with the help of the new metric function.
Figure 6.2 gives an example of the effect of the new metric function
[1]. Figure 6.2a is the curve of f(x), which has the expression
Let d1 = 5, on2 = 0.2, corresponding to the baseline B1, and the window size is
5. Figure 6.2b plots E[Sd1(x; d)], which has minima at both d1 = 5 and d1 = 13, so
208 6 Multi-ocular Stereovision
E [ S,(S2)( x, t)]
25 -
20 -
15 -
10 -
5 -
t
0+ 5 10 15 20
0
(c) (f)
figure that the minimum value at the correct matching position is smaller than the
minimum value at the false matching position due to the overlap (the difference
between the two increases with the number of overlapping images). In other words,
the minimum value at the correct matching position has a global minimum, which
solves the uncertainty problem.
Consider the case where f(x) is a periodic function, and let its period be T. Thus,
each St(x, t) is a periodic function of t whose period is T/BiA. This shows that every
other T/BiA segment has a minimum value. When using two baselines, the resulting
S(ts()12)(x; t) is still a periodic function of t, but its period T12 increases to
'h-TQ
T12 = LCM (6:19)
Here LCM stands for least common multiple, so it can be seen that T12 will not
be smaller than T1 or T2. Further, by selecting appropriate baselines B1 and B2, it is
possible to make only one minimum value in the search interval, that is, the
uncertainty problem will be eliminated.
It can be seen from Fig. 4.5 that when binocular vision processes the matching
parallel to the epipolar line direction (i.e., the horizontal scanning line direction),
there will be a mismatch due to the lack of obvious features in the grayscale smooth
region. At this time, the gray values in the matching window will take the same
value within a certain range, so the matching position cannot be determined. This
kind of mismatching problem caused by the smooth grayscale region is inevitable in
the use of binocular stereo matching. The parallel-baseline multi-ocular stereo
matching method described earlier in Sect. 6.1 does not eliminate mismatches due
to this cause (although it can eliminate mismatches due to periodic patterns).
In practical applications, regions with relatively smooth gray scales in the hori
zontal direction may often have obvious grayscale differences in the vertical direc
tion. In other words, it is not smooth vertically. This suggests that people can use the
image pairs in the vertical direction to perform a vertical search to solve the problem
of mismatches that are easily generated by matching in the horizontal direction in
210 6 Multi-ocular Stereovision
these regions. Of course, for the grayscale smooth region in the vertical direction,
only using the image pairs in the vertical direction may also cause a mismatch
problem, and it is necessary to use the image pairs in the horizontal direction to
perform horizontal matching.
Since both the horizontal grayscale smooth region and the vertical grayscale smooth
region may appear in the image, it is necessary to collect the horizontal image pair
and the vertical image pair at the same time. In the simplest case, two pairs of
orthogonal acquisition positions can be arranged on the plane (see Fig. 6.3). Here,
the left image L and the right image R form a horizontal stereo image pair, whose
baseline is Bh, and the left image L and the top image T form a vertical stereo image
pair, whose baseline is Bv. These two pairs of images constitute a set of orthogonal
trinocular images.
The characteristics of stereovision matching using orthogonal trinocular images
can be analyzed by referring to the method in Sect. 6.1. The images obtained at the
three acquisition positions can be represented as (since this is an orthogonal acqui
sition, the images are represented by f(x, y))
where dh and dv are the horizontal and vertical parallaxes, respectively (see Eq. 6.3).
In the following discussion, let dh = dv = d, and the SSDs corresponding to the
horizontal and vertical directions are
Adding them up gives the orthogonal parallax metric function O(S)(x, y; d):
(6:23)
It can be seen that at the correct parallax value, E[O(S)(x, y; d)] achieves a
minimum value. It can be seen from the above discussion that when using ortho
graphic trinocular images, it is not necessary to use the inverted distance in order to
eliminate repeating patterns in one direction.
An example of using the orthogonal trinocular method to eliminate the mismatch
brought to the binocular method by grayscale smooth region in one direction is
shown in Fig. 6.4, in which Fig. 6.4a—c are a group of a square cone image (in turn
left image, right image, and top image) with smooth grayscale regions in horizontal
and vertical directions; Fig. 6.4d is the parallax map obtained by stereo matching
using only the horizontal binocular images; and Fig. 6.4e is parallax map obtained by
stereo matching using only the vertical binocular images. In addition, Fig. 6.4f is the
parallax map obtained by stereo matching using orthogonal trinocular images.
Figure 6.4g—i are the 3D perspective views corresponding to Fig. 6.4d—f, respec
tively. From these images, it can be seen that in the parallax map obtained from the
horizontal binocular images, there are obvious mismatches (horizontal black strips)
at the smooth regions of horizontal gray scale; in the parallax map obtained from the
vertical binocular images, obvious mismatches (vertical black strips) occur in the
vertical grayscale smooth regions; in the parallax map obtained from the ortho
graphic trinocular images, the mismatches caused by various unidirectional gray
scale smooth regions are eliminated. That is, the results of parallax calculations in
various regions are correct, and these results are also very clear in each 3D perspec
tive view.
212 6 Multi-ocular Stereovision
Fig. 6.4 Examples of orthogonal trinocular stereo matching to eliminate mismatches in grayscale
smooth regions
Orthogonal trinocular stereo matching method can not only reduce the mismatch
caused by grayscale smooth regions but also reduce the mismatch caused by
periodic patterns. The following is an example of a scene that has periodic
repeating patterns in both horizontal and vertical directions. Suppose f(x, y)is a
periodic function whose horizontal and vertical periods are Tx and Ty, respectively,
that is,
where Tx ^ 0, Ty ^ 0 are constants. Using Eqs. 6.21 -6.24, it can be deduced that
E Sh x, y; d = E Sh x, y; d + Tx (6:26)
6.2 Orthogonal Trinocular Stereo Matching 213
E Sv x, y; d = E SJ x, y; d + Ty (6:27)
e[o (S)(X; y;d) ] = E[Sh (x, y;d + Tx) + Sv (x, y;d + Ty)]
It can be seen from Eq. 6.29 that, when Tx / Ty, the expected period Txy of
O(S)(x, y; d) is generally larger than both the expected period Tx of Sh(x, y; d) and the
expected period Ty of Sv(x, y; d ).
Consider further the extent to which parallax searches are performed for
matching. If d 2 [dmin, dmax] is set, then the number of minimum values Nv, Nh,
and N that occurs in E[Sh(x, y; d)], E[Sv(x, y; d)], and E[Sv(x, y; d)], respectively, is
AZ = dmax
N
dmin (fy '^'7'1
(6:32)
LCM (Tx, Ty)
This shows that when Sh(x, y; d) and Sv(x, y; d) are replaced by O(S)(x, y; d) as the
similarity measure function, E[O(S)(x, y; d)] has more minima than both E[Sh(x, y;
d)] and E[Sv(x, y; d)] in the same parallax search range. In other words, the
probability of a mismatch is reduced. In practical applications, it is possible to try
to limit the parallax search range to further avoid false matching.
An example of using the orthogonal trinocular method to eliminate periodic
patterns that bring false matching to the binocular method is shown in Fig. 6.6.
Figure 6.5a-c are the left image, right image, and top image, respectively, of a group
images with periodically repeating texture (period ratio of horizontal direction and
vertical direction is 2:3). Figure 6.5d shows the parallax map obtained by stereo
matching using only the horizontal binocular images; Fig. 6.5e is the parallax map
obtained by stereo matching using only vertical binocular images; Fig. 6.5f is the
parallax map obtained by stereo matching using orthogonal trinocular images.
Figure 6.5g-i are 3D perspective views corresponding to Fig. 6.5d-f, respectively.
Due to the effect of periodic patterns, there are many mismatches in both Fig. 6.5d
and e, while in the parallax map obtained from the orthographic trinocular images,
214 6 Multi-ocular Stereovision
Fig. 6.5 An example of eliminating periodic pattern mismatch by orthogonal trinocular method
most of the mismatches are eliminated. The effect of orthographic trinocular stereo
matching is also clearly seen in each 3D perspective views.
In trinocular vision, in order to reduce ambiguity as much as possible and ensure
the accuracy of feature location, it is necessary to generate two epipolar lines. These
two epipolar lines should be as orthogonal as possible in at least one image space,
which will help to uniquely determine all matching features. The projection center of
the third camera should not be on the same line with the projection centers of the
other two cameras; otherwise, the epipolar lines will be collinear. Once a feature is
uniquely defined, using more cameras does not reduce the influence of ambiguity.
However, the use of more cameras may produce further supporting data. It can help
further reduce the positioning error with the help of averaging, and it is possible to
obtain a slight increase in the accuracy and range of 3D depth perception.
Because the orthogonal trinocular stereo matching method can reduce a variety of
errors, there are many implementation methods. The main steps of an orthogonal
6.2 Orthogonal Trinocular Stereo Matching 215
trinocular stereo matching method are as follows [3]: (1) two independent complete
parallax images are obtained by using a certain correlation matching algorithms
(e.g., taking edge points as the matching feature (see Sect. 5.2.1)) through horizontal
image pairs and vertical image pairs, respectively; (2) according to certain fusion
criteria, the two parallax maps are combined into one parallax map using relaxation
technique. This method needs to use dynamic programming algorithm, fusion
criteria, relaxation technology, and other complex synthesis operations, so the
amount of calculation is large and the implementation is complex. A fast orthogonal
stereo matching method based on gradient classification is introduced as follows.
The basic idea of this method is to compare the smoothness of each region of
the image along the horizontal and vertical directions. For the smoother region in
the horizontal direction, the vertical image pair is used for matching, and for the
smoother region in the vertical direction, the horizontal image pair is used for
matching. In this way, it is not necessary to calculate two complete parallax maps,
respectively, and the synthesis of the parallax of the two regions is very simple.
Whether a region is smoother horizontally or vertically can be determined by
calculating the gradient direction of the region. The flowchart of the algorithm is
shown in Fig. 6.6, which is mainly composed of the following four specific steps:
1. The gradient direction information of each point in fL(x, y) is obtained by
calculating the gradient of fL(x, y).
2. According to the gradient direction information of each point in fL(x, y), fL(x, y)
can be divided into two parts by using the classification decision criteria: the
Fig. 6.6 Flowchart of 2D search stereo matching algorithm using gradient direction
216 6 Multi-ocular Stereovision
horizontal region whose gradient direction is closer to the horizontal direction and
the vertical region whose gradient direction is closer to the vertical direction.
3. A horizontal image pair is used to match and calculate parallax in the region
where the gradient direction is closer to the horizontal direction, and a vertical
image pair is used to match and calculate parallax in the region where the gradient
direction is closer to the vertical direction.
4. The two parallax values are combined into a complete parallax map, and then the
depth map is obtained.
When calculating the gradient map, considering that it is only necessary to
compare or judge whether the gradient direction is closer to the horizontal direction
or the vertical direction, the following simple methods with low computational
complexity can be used. For any pixel (x, y)in fL(x, y), the horizontal and vertical
gradient values Gh and Gv are, respectively,
W = 2 y+W=2
W=2 x+W=2
Fig. 6.7 An example of eliminating the influence of grayscale smooth regions of image by
orthogonal trinocular method
218 6 Multi-ocular Stereovision
In the above method, two templates (masks, windows) with different sizes are used.
The gradient template is used to calculate the gradient direction information, and the
matching (searching) template is used to calculate the relevant information of the
grayscale region. Here, the size of gradient template and the size of matching
(searching) template have a great impact on the matching performance [4].
The influence of gradient template size can be illustrated by taking Fig. 6.8 as an
example. In the figure, two regions with different gray levels are given, with A, B,
and C as vertices as well as B, C, E, and D as vertices, respectively. Assuming that
the point P to be matched is located near the horizontal edge segment BC, if the
gradient template is too small (such as the rectangle in Fig. 6.8a, which does not
include the point on the edge BC), it is difficult to distinguish the horizontal region
from the vertical region because Gh ~ Gv, and it is possible to match the point P with
the horizontal image, so the horizontal direction is relatively smooth, which may
cause false matching. If the gradient template is large enough (such as the rectangle
in Fig. 6.8b, which includes the points on the edge BC), then there must be Gh < Gv,
and then the point P will be matched with vertical image pairs to avoid mismatching.
However, it should be noted that in addition to the problem of large amount of
computation, too large template may cover multiple different edges and lead to
wrong direction determination.
The size of the matching (searching) template also has a great impact on the
performance. If the matching template is large enough to contain large enough
intensity changes for matching, then false matching will be reduced, but large
matching blur may occur. It can be divided into two cases (see Fig. 6.9). The two
regions with A, B, and C as vertices as well as B, C, E, and D as vertices,
respectively, have different textures (the rest are smooth regions).
1. When matching the boundary parts of the texture region and the smooth region
(as shown in Fig. 6.9a): if the template is small and only covers the smooth
region, the matching calculation will be random; if the template is large and
covers two regions, the appropriate matching image pair can be determined and
the correct matching can be obtained.
2. The boundary parts adjacent to two texture regions are matched (as shown in
Fig. 6.9b): since the template is always contained in the texture region, the correct
matching is guaranteed regardless of the template size.
The algorithm for detecting and correcting errors in parallax maps introduced in
Sect. 4.3 is also applicable to parallax maps obtained from orthogonal trinocular
stereovision. The definition of ordering matching constraint can be used for both
horizontal image pairs and vertical image pairs (after corresponding adjustment).
The flowchart of the basic algorithm for error detection and correction of parallax
map in orthogonal trinocular is shown in Fig. 6.10. The images involved here
include the left image fL(x, y), the right image fR(x, y), the top image fT(x, y), and
the parallax map d(x, y). First, the parallax map dX(x, y) corrected along the
horizontal direction is obtained by means of the ordering matching constraint in
the horizontal direction, then the resulted parallax map is corrected by means of the
ordering matching constraint in the vertical direction, and finally a new parallax map
dXY(x, y) satisfying the global (both along the horizontal X direction and along the
vertical Y direction) ordering matching constraint is obtained.
The orthogonal trinocular stereo matching described in Sect. 6.2 is a special case of
multi-ocular stereo matching. In more general cases, more than three cameras can be
used to form a stereo system, and the baseline of each image pair can also be
220 6 Multi-ocular Stereovision
Fig. 6.10 Algorithm flowchart for error detection and correction of parallax map in orthogonal
trinocular vision
non-orthogonal. Two special cases, which are more general than orthogonal trinoc-
ular matching, are discussed below.
In a trinocular stereo imaging system, three cameras can be arranged in any form
other than on a straight line or on a right triangle. Figure 6.11 shows the schematic
diagram of an arbitrarily arranged trinocular stereo imaging system, where C1, C2,
and C3 are the optical centers of three image planes, respectively, which can
determine a trifocal plane. Referring to the introduction of epipolar constraint in
Sect. 5.1.2, it can be seen that a given object point W (generally not located on the
trifocal plane) and any two optical center points can determine an epipolar plane. The
intersection of this plane with the image plane of the corresponding optical center is
the epipolar line. The epipolar line Lij represents the epipolar line in Image i, which
corresponds to Image j. Matching is always done on the epipolar line. In the
trinocular stereo imaging system, there are two epipolar lines on each image plane,
and the intersection of the two epipolar lines is also the intersection of the object
point W with the optical center line and the corresponding image plane.
If all three cameras observe the object point W, the coordinates of the three image
points obtained are p1, p2, and p3, respectively. Each pair of cameras can determine
an epipolar constraint. If Eij is used to represent the essential matrix between Image
i and Image j, there are
6.3 Multi-ocular Stereo Matching 221
L13 L23
X
L12 L 21
C2
P1TE12P2 = 0 (6:36)
P2TE23P3 = 0 (6:37)
p3 T E31p1 =0 (6:38 )
If eij is used to represent the epipolar coordinates of Image i and Image j, the
above three equations are not independent because
e31TE12e32 = e12TE23e13 = e23TE31e21 = 0. However, any of the above two
equations are independent, so when the essential matrix is known, the coordinates
of the third image point can be predicted by using the coordinates of any two image
points.
Compared with the binocular system, the trinocular system adds a third camera,
which can eliminate many uncertainties caused by only using the binocular image
matching. Although the methods introduced in Sect. 6.2 directly use two pairs of
images to match at the same time, in most trinocular stereo matching algorithms, a
pair of images is often used to establish the corresponding relationship, and then the
third image is used for verification, that is, the third image is used to check the match
made with the first two images [5].
A typical approach is shown in Fig. 6.i2. Consider using three cameras to image a
scene with four points A, B, C, andD. In Fig. 6.i2, the six points labeled i, 2, 3, 4,
5, and 6 represent incorrect reconstructed positions for the four points in the first two
images (corresponding to optical centers Oi and O2, respectively). Take the point
marked i as an example, which is the result of a mismatch between a2 and bi. When
the 3D space point i reconstructed from the first two images is reprojected to the
third image, the problem of mismatching can be found. It neither coincides with a3
nor b3, so it can be judged as an incorrect reconstruction position.
The above method first reconstructs the 3D space points corresponding to the
matching points in the first two images and then projects them to the third image. If
there is no compatible point near the projected point obtained above in the third
image, then the match is likely to be a false match. In practical applications, explicit
reconstruction and reprojection are not required. If the camera has been calibrated
(even only weakly calibrated [5]), and a 3D space point is known to correspond to
222 6 Multi-ocular Stereovision
the two image points of the first image and the second image, respectively, then
taking the intersection of the corresponding epipolar lines can predict the position of
the 3D space point in the third image.
Several different matching methods are described below.
This approach relies on searching along epipolar lines to disambiguate and achieve
matching [6]. Referring to Fig. 6.13, let Lji represent the j-th epipolar line in Image i.
If it is known that L1 1 and L1 2 are the corresponding epipolar lines obtained from
Image 1 and Image 2, then to find the corresponding point of Point a in Image 1 in
Image 2, it is only necessary to search for edges along L1 2. Suppose two possible
points b and c are found along L1 2, but it is not known which one to choose. Let the
epipolar lines passing through points b and c in Image 2 be L23 and L3 3, respectively,
and the epipolar lines passing through points b and c in image 3 be L23 and L3 3 ,
respectively. Now consider the stereo image pair consisting of Image 1 and Image 3;
if L1 1 and L1 3 are corresponding epipolar lines and there is only one point d along L1 3
on the intersection of L1 3 and L23, then the conclusion that Point a in Image 1 and
Point b in image 2 corresponded to each other holds true, because they both
6.3 Multi-ocular Stereo Matching 223
This method utilizes the detected edge segments from the image to achieve trinocular
stereo matching [7]. First, the edge segments in the image are detected, and then a
segment adjacency graph is defined. The nodes in the graph represent edge
segments, and the arcs between the nodes indicate that the corresponding edge
segments are adjacent. For each edge segment, it can be expressed with local
geometric features such as its length and direction, midpoint position, etc. After
obtaining the line segment adjacency graphs G1, G2, and G3 of the three images in
this way, matching can be performed as follows (see Fig. 6.14):
1. For a Segment S1 in G1, calculate the Epipolar line L21 in Image 2 where the
Midpoint p1 of S1 is located, and the corresponding Point p2 of p1 in Image 2 will
be on the Epipolar line L21.
2. Consider the segment S2 intersecting the Epipolar line L21 in G2, and let the
intersection of L21 and S2 be p2; for each Segment S2, compare its length and
direction with Segment S1. If the difference is less than the given threshold, it is
considered that they may match.
3. For each possible matching line segment, further calculate its Epipolar line L32 in
Image 3, and set the intersection of it with p1 in Image 3 as p3; search for Segment
p3 whose difference in length and direction from Segment S1 and S2 is less than
the given threshold near p3. If it can be found, S1, S2, and S3 form a group of
matching segments.
Carry out the above steps on all line segments in the graph, and finally get all
matching line segments to realize image matching.
In the previous matching based on edge segments, it is implicitly believed that the
contour of the scene to be matched is approximated by polygons. If the scene is
composed of polyhedron, the contour representation will be very compact. But for
many natural scenes, according to the increase of the complexity of their contours,
the number of sides of the polygons used to express them may be doubled to ensure
the accuracy of approximation. In addition, due to the change of angle of view,
the corresponding polygons in the two images cannot ensure that their vertices are on
the corresponding epipolar line. In this case, more calculations are needed, such as
the use of improved epipolar constraints [8].
To solve the above problem, curve-based matching can be performed (i.e., the
local contour of the scene is approximated by a polynomial higher than the first
order). Referring to Fig. 6.15, assume that a curve T1 1 has been detected in Image
1 (the superscript i indicates the image, while the subscript indicates the serial
number; here it refers to the j-th curve in the Image i), which is an image of a 3D
curve on the surface of the scene. The next step is to search for the corresponding
curve in Image 2. To this end, you can choose a Point p1 1 on T1 1 (the unit tangent
vector of this point is t1 1, and the curvature is k1 1). Consider the epipolar line L21 in
Image 2 (the epipolar line in Image 2 corresponding to Image 1). Let this epipolar
line intersect the curve family Tji in Image 2. Here, j = 2 is taken in the figure, that is,
the epipolar line L21 and the curves T2 1 and T22 intersect at points p1 2 and p22 (the
unit tangent vectors of these two points are t1 2 and t22, respectively, and the
curvatures are k12 and k22, respectively). Next in Image 3, the epipolar line from
Image 2 that intersects epipolar line L31 where Point p1 2 is located is considered.
Here, j = 2 is taken in the figure, that is, the epipolar line L31 and epipolar lines L32,1
and L32,2 corresponding to points p1 2 and p22 are considered (the numbers after the
subscript comma indicate the serial number). These two epipolar lines intersect with
epipolar line L31 at points p1 3 and p23, respectively.
If points p1 1 and p1 2 are corresponding, then theoretically, Point p1 3 whose
tangent unit vector and curvature can be calculated from the tangent unit vector
and the curvature of the two points p1 1 and p1 2 should be found on the curve T1 3.Ifit
is not found, it may be that (1) there is no point very close to p1 3; (2) there is a curve
passing through point p1 3, but its tangent unit vector is not as expected; or (3) there is
a curve passing through point p1 3, and its tangent unit vector is as expected, but its
curvature is not as expected. Either of the above states that points p1 1 and p1 2 should
not correspond.
In general, for each pair of points p11 and pj2, the intersection point pj3 of the
epipolar line L31 of the corresponding point p1 1 and the epipolar line L32,j of the
corresponding point pj2 and the unit tangent vector tj3 and curvature kj3at the point
pj-3 are calculated in Image 3. For each intersection point pj3, search for the closest
curve Tj3, and judge and execute according to the following three conditions of
increasing in stringency: (1) if the distance between the curve Tj3 and the point pj3
exceeds a certain threshold, cancel the corresponding relation between them; other
wise (2) calculate the unit tangent vector tj3 at each point pj3, and if the difference
between it and the unit tangent vector calculated by points p11 and pj2 exceeds a
certain threshold, cancel the correspondence between them; otherwise (3) the cur
vature k j3 at each point pj-3 is calculated, and if the difference between it and the
curvature calculated by points p11 and pj2 exceeds a certain threshold, the corre
spondence between them is canceled.
After the above filtering, only one possible corresponding Point pj2 in Image 2 is
reserved for Point p11 in Image 1, and the nearest curves Tj2 and Tj3 are further
searched in the neighborhood of points pj2 and pj3. The above process is performed
for all selected points in Image 1, and the final result is that a series of corresponding
points pj 1 , pj2 ,andpj3 are determined on the curves
1 2 3
Tj , Tj , and Tj , respectively.
It has been pointed out in Sect. 6.1 that replacing the unidirectional binocular images
with unidirectional multi-ocular images can eliminate the effect of the unidirectional
periodic pattern. In Sect. 6.2.1, it is also pointed out that replacing the unidirectional
binocular images (or multi-ocular images) with orthogonal trinocular images can
eliminate the influence of grayscale smooth regions. The combination of the two can
form an orthogonal multi-ocular image sequence, and it is better to use the orthog
onal multi-ocular stereo matching method to eliminate the above two effects at the
same time [9]. Figure 6.16 shows a schematic diagram of the shooting position of the
orthogonal multi-camera image sequence. Let the camera shoot along each point
L, R1, R2, ... on the horizontal line and each point L, T1, T2, ... on the vertical line,
and a stereo image series of orthogonal baselines can be obtained. The analysis of
orthogonal multi-ocular images can be obtained by combining the method of
unidirectional multi-ocular image analysis in Sect. 6.1 with the method and results of
orthogonal trinocular image analysis in Sect. 6.2.1.
The test results of the real image using the orthogonal multi-ocular stereo
matching method are as follows. Figure 6.17(a) is a parallax calculation result. In
the orthogonal multi-ocular images used, in addition to Fig. 4.5a and b, and
Fig. 6.7a, one image each is increased along the horizontal and vertical directions
of Fig. 6.16, respectively. This is equivalent to add two positions of R2 and T2 in
Fig. 6.16 for image acquisition. Figure 6.17b shows the corresponding 3D perspec
tive view. Comparing Fig. 6.17a and b with Fig. 6.7g and h, respectively, it can be
seen that the effect here is even better (fewer mismatch points).
Theoretically, multiple images can be acquired not only in the horizontal and
vertical directions but even in the depth direction (along the Z-axis). For example, as
shown in Fig. 6.16, the two positions D1 and D2 along the Z-axis can also be used.
However, practice shows that the contribution effect of depth-direction images to
recovering the 3D information of the scene is not obvious.
In addition, various cases of multi-ocular stereo matching can also be regarded as
the generalization of the method in this section. For example, a schematic diagram of
a four-ocular stereo matching is shown in Fig. 6.18. Figure 6.18a is a schematic
diagram of the projection imaging of the scene point W, and its imaging points for
the four images are p1, p2, p3, and p4, respectively. They are the intersections of the
four rays R1, R2, R3, and R4 with the four image planes in turn. Figure 6.18b shows
the projection imaging of a straight line L passing through the space point W. The
imaging results of the straight line on the four images are four straight lines l1, l2, l3,
and l4 on four planes Q1, Q2, Q3, and Q4, respectively. Geometrically, a ray passing
through C1 and p1 must also pass through the intersection of planes Q2, Q3, and Q4.
Algebraically, given the quad-focal tensor and any three straight lines passing
through the three image points, the position of the fourth image point can be
deduced [5].
6.4 Equal Baseline Multiple Camera Set 227
There are many forms of multi-ocular stereovision. For example, literature [10] also
provides the source code of a multi-eye stereovision measurement system and some
photos and videos; literature [11] introduces a trinocular stereovision system com
posed of a projector and two cameras. The following is a brief introduction to an
equal baseline multiple camera set (EBMCS), in which a total of five cameras are
used [12].
The equal baseline multiple camera set arranges five cameras in a cross that shoots in
parallel, as shown in Fig. 6.19. Among them, C0 is the center camera, C1 is the right
camera, C2 is the top camera, C3 is the left camera, and C4 is the bottom camera.
It can be seen from Fig. 6.19 that the four cameras around the center camera form
with the center camera four pairs of stereo cameras in binocular parallel mode,
respectively. Their baselines are of equal length, so they are called equal baseline
multiple camera set.
From the perspective of image processing, what C0 and C1 collected is a pair of
horizontal binocular stereo images. For convenience, each pair of images can also be
regarded as images obtained by the horizontal binocular mode. Of course, some
228 6 Multi-ocular Stereovision
conversion is required here. A pair of stereo images collected for C0 and C2 needs to
be rotated 90° counterclockwise; a pair of stereo images collected for C0 and C4
needs to be rotated 90° clockwise; and a pair of stereo images collected for C0 and
C3 can be mirrored flipped. In this way, it is equivalent to calibrating the four pairs of
stereo images relative to the images collected by the central camera, and their results
can be combined and compared when calculating the parallax map.
A series of chessboard images with 11 x 8 black and white squares
(24 mm x 24 mm in size) are taken for the (geometric) calibration of the camera
and the rectification of the acquired images, so that its corners with ten lines in the
horizontal direction and with seven columns in the vertical direction can be used.
The series of images consists of ten groups, each containing five images from five
cameras. The calculation of the calibration parameters can be found in [13], and the
rectification algorithm can be found in [14].
Due to the use of multiple cameras, in addition to geometric calibration, images
from EBMCS were color-calibrated. Because the parallax map used here is a
grayscale map, the color calibration is actually the calibration of the pixel intensity.
The expression for the triangular filter used to adjust the intensity is given by
f =f + k (1 - jM-fj) (6:39)
Among them, f is the intensity before calibration, f is the intensity after calibra
tion, k is the intensity correction factor selected for the feature points in the image,
and M is the middle value of the grayscale range of the image. Generally, the
grayscale range is from 0 to 256, and M is preferably greater than or equal to 128.
Image transformation was also performed on the parallax maps obtained from the
images captured by the five cameras. Parallax maps are obtained from images
modified by scaling, rectification, as well as transformations such as rotation and
specular reflection. Therefore, the points of each parallax map correspond to the
points of the center image after these transformations. However, the transformation
parameters are different in different stereo cameras. Therefore, the center image
requires various modifications depending on the stereo camera used. Here the
parallax maps need to be merged to get a higher quality map, so the requirement is
to unify the maps by making them refer to the same image. The unification of
parallax maps is obtained by performing a transformation on them that is the inverse
of the transformation performed on the images from which these parallax maps were
obtained. All points in the generated parallax map correspond to points in the input
center image before calibration and rectification.
One parallax map can be obtained from each pair of cameras, and EBMCS can
provide four parallax maps. The next step is to combine these four parallax maps into
6.4 Equal Baseline Multiple Camera Set 229
a single parallax map. Two different methods can be used to combine: arithmetic
mean merging method (AMMM) and exception excluding merging method
(EEMM).
No matter which method is used, the parallax value of each coordinate point in the
resulting parallax map depends on the parallax value of each coordinate point
located in the same coordinate position in each pre-merging parallax map. However,
since it is possible that some parts of the scene are occluded during image acquisi
tion, the parallax values of the corresponding positions in one or several parallax
maps cannot be calculated, so the number of combined parallax values of certain
positions in the final result parallax map may be smaller/lower than the number of
camera pairs included in the EBMCS. If N is the number of camera pairs in EBMCS
and Mx represents the number of points located at coordinate x in the disparity map
before merging and still at coordinate x in the disparity map after merging, then
Mx < N. The parallax value at coordinate x in the final result parallax map is
Df (x) = (X)
E1 ~ i MxDi (6:40)
where Di represents the parallax value in the pre-merging parallax map with index i.
There may be significant differences between parallax values located at the same
coordinates in different pre-merging parallax maps. AMMM does not exclude any
values but only averages them. However, a significant difference would indicate that
at least one of the pre-merging parallax maps contained incorrect parallax values. To
eliminate potential false discrepancies, EEMM can be used.
Let the parallax after performing EEMM merging be denoted by E(x), which
depends on each parallax value Di(x) at coordinate x in the parallax map i before
merging. If a pre-merging parallax map does not contain parallax values for coor
dinate x, the value of E(x) is equal to 0. The function E(x) is calculated differently
depending on the number of pre-merging parallax maps that contain parallax values
at coordinate x.
If only one pre-merge parallax map with index i contains parallax value Di(x),
then the value of E(x) is equal to Di(x). When the number of pre-merging parallax
maps with parallax at coordinate x is equal to 2, the EEMM calculates the difference
between these parallax values. The parallax value is equal to |Di(x) - Dj(x)|, where
i and j are the indices of the parallax map before merging under consideration.
EEMM specifies a maximum acceptable difference, denoted as T. A difference
value greater than T indicates that the difference value is indeterminate. Therefore,
the EEMM states that the parallax value is undefined and the value of E(x) is equal to
zero. If the difference between the parallax values is not greater than T, then E(x) is
equal to the arithmetic mean of the parallax values Di(x) and Dj(x):
230 6 Multi-ocular Stereovision
Di(x) + D j(x)
if |Di-(x) - Dj(x)| < T
E(x) = * 2 (6:41)
0 if |D,-(x) - Dj(x)|> T
In the case of merging three parallax values Di(x), Dj(x), and Dk(x), from different
pre-merging parallax maps, it is necessary to calculate the difference value between
each two parallax values and then use these differences’ value to determine the
calculation of E(x). Since there are a total of three difference values to be judged, the
condition of the maximum acceptable difference is set more stringent (the maximum
acceptable difference value S = T/2 at this time). The calculation of E(x) is divided
into four cases, as follows:
22l2i, j, kDlx
3
if|Dix — Djx| < SjDix — Dkxj < S|Dj-x — Dkx| < S
Dix
if|Di-x — Djx| < SjDix — Dkxj < S|Dj-x — Dkx| > S
Ex = (6:42)
Dix + Dj-x
2
if |Dix — Djx| < SjDix — Dkxj > S|Dj-x — Dkx| > S
0
_ if|Dix — Djx| > SjDix — Dkxj > S|Dj-x — Dkx| > S
It can be seen from Eq. 6.42 that when all three parallax values are not greater
than S, the resulting parallax value E(x) of the merged parallax map is equal to the
arithmetic mean of all the pre-merging parallax values. When there is one parallax
value greater than S, the resulting parallax value E(x) is equal to the parallax value
that satisfies the other two conditions for the maximum acceptable parallax value.
When there are two parallax values greater than S, the resulting parallax value E(x) is
equal to the arithmetic mean of the two parallax values in the condition that satisfies
the maximum acceptable parallax value. When all three disparities are greater than S,
the resulting parallax value E(x) is indeterminate.
The last case in EEMM occurs when the four pre-merging parallax maps i, j, k,
and l all have parallax values at coordinate x. In this case, the merge method first
sorts the parallax values from different pre-merging parallax maps. Two extreme
values (maximum and minimum) are removed after sorting. Then, the arithmetic
mean is calculated from the two remaining parallax values, and this mean is taken as
the result of the merge method:
E(x) = Dj x ( ) +2 D (x)
k if|Di(x) < Dj (x) < Dk(x) < Di (x)} (6:43)
6.5 Single-Camera Multi-mirror Catadioptric System 231
The overall system structure is shown in Fig. 6.20. It is composed of one camera and
five mirrors, and the horizontal and vertical layout are combined. The top mirror with
the focus of O1 is the main mirror, and the rest are secondary mirrors/sub-mirrors.
Four secondary mirrors are symmetrically placed in a plane perpendicular to the
optical axis of the main mirror and the camera, as shown in the top view (along the
optical axis of the camera, aerial view) of Fig. 6.20a. By arranging the primary and
secondary mirrors in two layers, the advantages of vertical and horizontal structures
(b) (c)
are combined to achieve a longer baseline in a compact manner. The camera can
shoot the reflected images of five mirrors at a time, forming four stereo image pairs.
Taking O1-XYZ as the reference system and considering the (side view) XZ plane as
shown in Fig. 6.20b and c, the relative position of secondary mirror O2 can be
expressed by P = [Bx, 0, -BZ]T, where Bx and BZ are the horizontal and vertical
baselines of the system, respectively. The relative positions of other secondary
mirrors can also be obtained according to symmetry.
The system design has the following three characteristics:
1. The primary mirror and four secondary mirrors form a system containing four
binocular stereo pairs. In practice, it may not be possible for all mirrors to capture
the desired scene due to potential occlusion in the system. However, such a
design enables each object in the scene to still be seen by at least two stereo
pairs, which makes it possible to achieve higher reconstruction accuracy by
fusing stereo pairs.
2. Compared with the general purely horizontal [16] or purely vertical [17] baseline
layouts, the special layout between the primary and secondary mirrors achieves a
longer stereo baseline in a compact manner.
3. Flexible selection of mirrors and cameras. Unlike traditional central catadioptric
systems, which are only able to use a limited combination of camera and mirror
types, this system can be built with either a central or noncentral configuration. As
shown in Fig. 6.20b, the orthographic cameras pointed at five parabolic mirrors
can be regarded as five different central cameras. As shown in Fig. 6.20c,
perspective cameras directed at multiple parabolic, hyperbolic, or spherical mir
rors can constitute multiple noncentral cameras. Through efficient system model
ing, the 3D reconstruction process can be unified and simplified.
TT
/xs + qi y1+q2
Pt — I\zs +I q3 ,
1 | (6:44)
zs + q3
3. Considering the distortion of the real camera, the radial distortion is compensated
in units of Pt.
4. Finally, apply the generalized perspective projection to get the pixels [19]:
g1 g1a uo
p—KPt — 0 g2 V0 Pt (6:45)
0 0 1 _
In the concept of a virtual camera array, each mirror and the region it occupies in the
pixel plane are considered a virtual sub-camera. Integrate mirror parameters into
virtual sub-cameras to convert the relative positions between mirrors into rigid body
transformations between virtual sub-cameras. After each sub-camera is indepen
dently calibrated, the relative positions of the sub-cameras need to be jointly
optimized to improve the consistency between the sub-cameras.
Let c1 be the reference coordinate of the main camera. The rigid body transfor
mation of ci (i — 2, 3, 4, 5) relative to c1 can be represented by T(ci-c1):
R3 X 3 t3 X 3
T(ci - c1) 0 i—2,3,4,5 (6:46)
1
where R3X3 is the rotation matrix and t3X1 is the translation vector.
234 6 Multi-ocular Stereovision
Given each 3D point Pw,ij in the world coordinate system and its corresponding
imaging pixel p1,ij in the main camera and pi,ij in the i-th camera, the steps to
calculate the reprojection error are as follows:
1. First transform the world point Pw,ij to Pc1,ij in the c1 coordinate system using Tw-
c1:
Pci,j=(Pw,j) -1 Pw,j(6:47)
3. Next, use the internal parameter matrix Ki of the i-th sub-camera to convert Pci,ij
to reprojection pixel coordinates pi’,ij:
If you let the function G denote the whole process of obtaining pi’,ij from Pw,ij,
then the optimal rigid body transformation T(ci-c1) can be calculated by minimizing
the reprojection error shown in Eq. 6.50:
Since each virtual camera has been calibrated according to the previous virtual
sub-camera model description, the initial values of the parameters in G have been
obtained. Equation 6.51 can then be solved using a nonlinear optimization algorithm
such as the Levenberg-Marquardt algorithm.
References
4. Jia B, Zhang Y-J, Lin X G (1998) Study of a fast tri-nocular stereo algorithm and the influence
of mask size on matching. Proceedings of International Workshop on Image, Speech, Signal
Processing and Robotics, 169-173.
5. Forsyth D, Ponce J (2003). Computer Vision: A Modern Approach. Prentice Hall, London.
6. Goshtasby A A (2005) 2-D and 3-D Image Registration—for Medical, Remote Sensing, and
Industrial Applications. Wiley-Interscience, Hoboken, USA.
7. Ayache N, Lustman F (1987) Fast and reliable passive trinocular stereovision. Proceedings of
First ICCV, 422-427.
8. Faugeras O (1993) Three-dimensional Computer Vision: A Geometric Viewpoint. MIT Press,
Cambridge, USA.
9. Jia B, Zhang Y-J, Lin X G (2000) Stereo matching using both orthogonal and multiple image
pairs. Proceedings of the ICASSP, 4: 2139-2142.
10. Wu H (2022) Code and dataset. https://ieee-dataport.org/documents/code-and-dataset.
11. Zhou D, Wang P, Sun C K, et al. (2021) Calibration method for trinocular stereo vision system
comprising projector and dual cameras. Acta Optica Sinica, 41(11): 120-130.
12. Kaczmarek A L (2017) Stereo vision with equal baseline multiple camera set (EBMCS) for
obtaining depth maps of plants. Computers and Electronics in Agriculture, 135: 23-37.
13. Zhang Z (2000) A flexible new technique for camera calibration. IEEE Transactions on Pattern
Analysis and Machine Intelligence 22: 1330-1334.
14. Hartley RI (1999) Theory and practice of projective rectification. International Journal of
Computer Vision, 35, 115-127.
15. Chen S Y, Xiang Z Y, Zou N (2020) Multi-stereo 3D reconstruction with a single-camera multi
mirror catadioptric system. Measurement Science and Technology, 31: 015102.
16. Caron G, Marchand E, Mouaddib EM (2009) 3D model based pose estimation for omnidirec
tional stereovision. Proceedings of IEEE/RSJ International Conference on Intelligent Robots
and Systems, 5228-5233.
17. Lui W L D, Jarvis R (2000) Eye-full tower: A GPU-based variable multi-baseline omnidirec
tional stereovision system with automatic baseline selection for outdoor mobile robot naviga
tion. Robotics and Autonomous Systems, 58(6): 747-761.
18. Xiang Z, Dai X, Gong X (2013) Noncentral catadioptric camera calibration using a generalized
unified model. Optics Letters, 38: 13671369.
Chapter 7
Monocular Multi-image Scene Restoration Check for
updates
The stereo vision method introduced in the first two chapters restores the depth of the
scene according to two or more images obtained by the cameras in different
positions. Here, the depth information (distance information) can be regarded as
the redundant information from multiple images. Acquiring multiple images with
redundant information can also be achieved by collecting images of light change
and/or scene change at the same location. These images can be obtained with only
one (fixed) camera, so they can also be collectively referred to as monocular methods
(stereo vision methods are all based on multiple cameras and multiple images.
Although one camera can be used to shoot in multiple positions, it is still equivalent
to multiple cameras due to different angles of view). From the (monocular) multiple
images obtained in this way, the surface orientation of the scene can be determined,
and the relative depth between the parts of the scene can be directly obtained from
the surface orientation of the scene. In practice, it is often possible to further calculate
the absolute depth of the scene [1].
The sections of this chapter will be arranged as follows.
Section 7. 1 first gives a general introduction to monocular scene restoration methods
and classifies them according to monocular multiple images and monocular single
images.
Section 7. 2 introduces the basic principle of shape restoration by illumination and
discusses the photometric stereo method of determining the orientation of the
scene surface by using a series of images with the same viewing angle but
different illumination.
Section 7. 3 discusses the basic principle of restoring shape from motion. Based on
the establishment of the optical flow field describing the moving scene, the
relationship between the optical flow and the surface orientation and relative
depth of the scene is analyzed.
Section 7. 4 introduces a method to restore shape from contour, which combines
segmentation technology and convolution neural network technology to decom
pose the visual shell and estimate human posture.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 237
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_7
238 7 Monocular Multi-image Scene Restoration
The stereo vision method introduced in the first two chapters is an important method
of scene restoration. Its advantage is that the geometric relationship is very clear, but
its disadvantage is that it is necessary to determine the corresponding points in the
binocular or multi-ocular image. As can be seen from the first two chapters,
determining the corresponding points is the main work of stereo vision method,
and it is also a very difficult problem, especially when the lighting is inconsistent and
there are shadows. In addition, the stereo vision method needs to make several points
on the scene appear in all the images that need to determine the corresponding points
at the same time. In practice, due to the influence of line of sight occlusion, different
cameras cannot be guaranteed to have the same field of view, which leads to the
difficulty of corresponding point detection and affects the matching. At this time, if
the baseline length is shortened, the influence of occlusion may be weakened, but the
matching accuracy of the corresponding points will be reduced.
In order to avoid the complex matching problem of corresponding points, mon
ocular image scene restoration method is also often used, that is, only a single
camera with fixed position (but can shoot a single or multiple images) is used to
collect images, and various 3D clues in the obtained images are used to restore the
scene [2]. Due to the loss of 1D information (depth information) when projecting the
3D world onto the 2D image, the key to restore the scene here is to restore the lost 1D
depth information and realize the 3D reconstruction of the scene [3, 4].
From a more general point of view, restoring the scenery is to restore the intrinsic
characteristics of the scenery. Among the various intrinsic characteristics of the
scene, the shape of 3D object is the most basic and important. On the one hand,
many other features of the object, such as surface normal and object boundaries, can
be derived from the shape; on the other hand, people usually define the object with
shape first and then use other characteristics of the object to further describe the
object. Various methods of restoring scenery from shape are often labeled with the
name “shape from X.” Here, X can represent tone change, texture change, scene
motion, illumination change, focal length size, scene pose, contour position, shadow
size, etc.
It is worth pointing out that in the process of obtaining 2D images from 3D
scenes, some useful information is indeed lost due to projection, but some
7.1 Monocular Image Scene Restoration 239
information is retained after conversion (or there are 3D clues of the scene in the 2D
images). Here are some examples:
1. If the light source position is changed during imaging, multiple images under
different lighting conditions can be obtained. The image brightness of the same
scenery surface varies with the shape of the scenery, so it can be used to help
determine the shape of the 3D scenery. At this time, multiple images do not
correspond to different viewpoints but to different illumination, which is called
shape from illumination.
2. If the scene moves during image acquisition, optical flow will be generated in the
image sequence composed of multiple images. The size and direction of optical
flow vary with the orientation of the scene surface, so it can be used to help
determine the 3D structure of the moving object, which is called shape from
motion; some people also refer to it as the structure from motion [5].
3. If the scenery rotates around itself in the process of image acquisition, the contour
of the scenery (the boundary between the object and the background, also known
as silhouette) will be easily obtained in each image. Combining these contours,
the surface shape of the scenery can also be restored, which is called shape from
contour, or shape from silhouette (SfS).
4. During the imaging process, some information about the shape of the original
scenery will be converted into the brightness information corresponding to the
shape of the original scenery in the image (or in the case of determined illumi
nation, the brightness change in the image is related to the shape of the scenery).
Therefore, according to the shadow of the image, we can try to recover the surface
shape of the scenery, which is called shape from shading.
5. In the case of perspective projection, some information about the shape of the
scenery will remain in the changes of the surface texture of the object (different
orientations of the scenery surface will lead to different surface texture changes).
Therefore, through the analysis of the texture changes, we can determine the
different orientations of the object surface and then try to recover its surface
shape, which is called shape from texture.
6. There is also a close relationship between the focal length change caused by
focusing on the scene at different distances and the depth of the scene. Therefore,
the distance of the corresponding scene can be determined according to the focal
length of the clear imaging of the scene, which is called shape from focal length.
7. In addition, if the 3D scenery model and the camera focal length are known, the
perspective projection can establish the corresponding relationship between the
3D scene points and the imaging points on the image, so that the geometric shape
and pose of the 3D scenery can be calculated by using the relationship between
several corresponding points (a pose estimation from three-point perspective will
be discussed in Sect. 7.4).
Among the seven examples of scenery restoration listed above, the first three
cases need to collect multiple images, which will be introduced in the following
sections of this chapter, respectively; the last four cases only need to collect a single
image, which will be introduced in Chap. 8. The above methods can also be used in
240 7 Monocular Multi-image Scene Restoration
The shape from illumination is realized according to the photometric stereo prin
ciple. Photometric stereo, also known as photometric stereoscopic or photomet
ric stereo vision, is a method of reconstructing 3D information of the scene by using
photometric information (illumination direction, intensity, etc.) in the scene. The
specific method is to restore the surface orientation (normal direction) of the scene
with the help of a series of images collected under the same viewing angle (the same
viewpoint) but in different light source directions and, on this basis, restore the 3D
geometric structure of the scene.
Photometric stereo method is based on three conditions: (1) The incident light is a
parallel light or coming from a point source from an infinite distance; (2) it is
assumed that the reflection model of the object surface is a Lambert reflection
model, that is, the incident light is uniformly scattered in all directions, and it is
the same for the observer to observe from any angle; (3) the camera model is an
orthogonal projection model.
Photometric stereo method is often used in the environment where the lighting
conditions are easy to control or determine, with low cost, and can often obtain more
finer local details. For an ideal Lambert surface (see Sect. 7.2.2), a good effect can
often be obtained. The four steps of shape recovery are as follows: (1) establishing a
lighting model, (2) calibrating light source information, (3) solving the surface
reflectance and/or normal information, and (4) calculating the depth information.
Scene brightness and image brightness are two related but different concepts in
photometry. In imaging, the former is related to radiant intensity or radiance,
while the latter is related to irradiance or illuminance. Specifically, the former
corresponds to the luminous flux emitted by the surface of the scene (regarded as a
light source), which is the power emitted by the unit area of the light source surface
within the unit solid angle, and the unit is Wm-2 sr-1. The latter corresponds to the
luminous flux irradiating on the surface of the scene, which is the power per unit area
irradiating on the surface of the scene, and the unit is Wm-2. In optical imaging, the
scene is imaged on the image plane (of the imaging system), so the scene brightness
corresponds to the luminous flux emitted from the surface of the scene, and the
image brightness corresponds to the luminous flux obtained by the image plane.
7.2 Shape from Illumination 241
It should be noted that the image brightness obtained after imaging the 3D scene
depends on many factors. For example, the light intensity reflected by an ideal
diffuse surface when illuminated by a point light source is proportional to the cosine
of the incident light intensity, the surface light reflection coefficient, and the light
incidence angle (the angle between the line of sight and the incoming ray). In a more
general case, the image brightness is affected by the shape of the scene itself, the
attitude in space, the surface reflection characteristics, the relative orientation and
position between the scene and the image acquisition system, the sensitivity of the
acquisition device, and the radiation intensity and distribution of the light source,
which does not represent the intrinsic characteristics of the scene.
SO _ cos a z 2
’S7 = ’Cos9 2 (7:1)
Let’s see how much light from the surface of the scene will pass through the lens.
Because the lens area is n(d/2)2, it can be seen from Fig. 7.1 that the solid angle of the
lens seen from the scene patch is:
2
nd2 1 n/d\
a= —— cos a cos 3 a (7:2)
4 (z= cos a)2 4 z
In this way, the power emitted by the surface patch SO of the scene and passing
through the lens is:
where L is the brightness of the scene surface in the direction toward the lens. Since
the light from other areas of the scene will not reach the image patch SI, the
illumination obtained by this patch is:
From Eq. (7.5), it can be seen that the measured patch illuminance E is directly
proportional to the brightness L of the scene of interest, is directly proportional to
the area of the lens, and is inversely proportional to the square of the focal length of
the lens. The illumination change caused by camera movement is reflected in the
included angle a.
When imaging the observed scene, the brightness L of the scene is related not only to
the luminous flux incident on the surface of the scene and the proportion of the
incident light reflected but also to the geometric factors of light reflection, that is, to
the direction of illumination and line of sight. Now let’s look at the coordinate
system shown in Fig. 7.2, where N is the normal of the surface patch, OR is an
arbitrary reference line, and the direction of a light L can be expressed by the
included angle 0 (called polar angle) between the light and the normal of the patch
and the included angle $ (called azimuth) between the orthographic projection of the
light on the surface of the scene and the reference line.
With the help of such a coordinate system, the direction of light incident on the
surface of the scene can be represented by (0i, $ i), and the direction reflected to the
observer’s line of sight can be represented by (0e, ^e), as shown in Fig. 7.3.
Thus, the bidirectional reflection distribution function (BRDF), which is very
important for understanding surface reflection, can be defined, and it is written as
7.2 Shape from Illumination 243
Fig. 7. 3 Schematic
diagram of bidirectional
reflection distribution
function
Fig. 7. 4 Schematic
diagram of calculating
surface brightness under the
condition of extended light
source
f(0i, fii; 0e, fie) below. It represents the brightness of the surface observed by the
observer in direction V(0e, fie) when light enters the surface of the scene in direction
L(0i, fii). The unit of the bidirectional reflection distribution function is the reciprocal
of the solid angle (sr-1), and its value ranges from zero to infinity (at this time, any
small incident will lead to the observation of radiation). Note that f(0i, fii; 0e,
fie) = f(0e, fie; 0i, fii), that is, the bidirectional reflection distribution function is
symmetrical about the incident and reflection directions. Let the illuminance
obtained by incident on the object surface along the direction (0i, fii) be 8E(0i, fii)
and the reflected (emitted) brightness observed in the direction (0e, fie) be 8L(0e, fie).
The bidirectional reflection distribution function is the ratio of brightness and
illuminance, that is:
Now consider further the case of the extended light source (e.g., see [8]). In
Fig. 7.4, the width of an infinitesimal patch on the sky (which can be considered as
radius 1) is 80i along the polar angle and 8fii along the azimuth. The solid angle
244 7 Monocular Multi-image Scene Restoration
corresponding to this patch is 8a = sinfliSfliS^i (where sinfli takes into account the
reduced spherical radius). If Eo(0-V ^i) is the illuminance of the unit solid angle along
the direction (fli, ^i), the illumination of the patch is Eo(0-V ^i)sinfli8fli8^i, and the
illumination received by the whole surface is:
n n=2
E I Eo(fli, <ft)
E= sin fli cos flidflid^i (7:7)
—n 0
where cosfli takes into account the influence of the projection of the surface along the
direction (fli, ^i) (projected on a plane perpendicular to the normal).
In order to obtain the brightness of the whole surface, the product of bidirectional
reflection distribution function and patch illumination needs to be added up on the
hemispherical surface including the possible light. With the help of Eq. (7.6),
there are:
n n=2
L(fle, fc) = I I
f (fli, ^; fle, ^e)Eo(fli, <&) sin fli cos flidflid^i (7:8)
—n 0
The above result is a function of two variables (fle, $ e), which indicates the
direction of light hitting the observer.
The bidirectional reflection distribution function is related to both the incidence of
light and the observation of light. Common light incidence and observation methods
include the four basic forms shown in Fig. 7.5, where fl represents the angle of
incidence and $ represents the azimuth. They are a combination of diffuse incidence
di and directional (fli, ^i) incidence and diffuse reflection de and directional (fle, ^e)
observations. Their reflectance is as follows: diffuse incidence diffuse reflection p(di;
de), directional incidence diffuse reflection p(fli, ^i; de), diffuse incidence directional
observation p(di; fle, ^e), and directional incidence-directional observation p(fli, ^i;
fle, &)•
Only two extreme cases are considered below: ideal scattering surface and ideal
specular reflection surface.
1. Ideal scattering surface
The ideal scattering surface, also known as Lambert surface or diffuse
reflection surface, is equally bright from all observation directions (independent
of the angle between the observation line of sight and the surface normal), and it
reflects all incident light completely unabsorbed. Therefore, f(0i, Qi; 0e, Qe) of the
ideal scattering surface is a constant (independent of angle), which can be
calculated as follows. For a surface, its brightness integral in all directions should
be equal to the total illumination obtained by the surface, that is:
n n/2
f f ^Qi; ^e, Qe^E^^i, Qi^ cos ^i sin de ^ed^edQe
—n 0
where both sides are multiplied by cos0i to convert to the N direction. From the
above equation, it can be solved that the BRDF of the ideal scattering surface is:
It can be seen from the above that for an ideal scattering surface, the relation
ship between brightness L and illuminance E is:
L = E/n (7.11)
In practice, the common frosted (matte) surface will reflect light divergently,
and the ideal frosted surface model is the Lambertian model. The reflectivity of
Lambert surface only depends on the incident angle i. Further, the variation of
reflectance with i is cosi. For a given reflected light intensity L, it can be seen that
the incident angle satisfies cosi = C x L, and C is a constant, that is, the constant
reflection coefficient (albedo). Therefore, i is also a constant. It can be concluded
that the normal of the surface is on a directional cone around the direction of
the incident light, the half angle of the cone is i, and the axis of the cone points to
the point source of illumination, that is, the cone is centered on the direction of the
incident light.
The cones in two directions intersecting on two lines can define two directions
in space, as shown in Fig. 7.6. Therefore, in order to make the surface normal
completely unambiguous, a third cone is needed. When using three light sources,
each surface normal must have a common vertex with each of the three cones:
The two cones have two intersecting lines, and the third cone in the conventional
position will reduce the range to a single line, thus giving a unique explanation
and estimation of the direction of the surface normal. It should be noted that if
246 7 Monocular Multi-image Scene Restoration
some points are hidden behind and are not shot by the light of a light source, there
will still be ambiguity. In fact, the three light sources cannot be in the same
straight line, and they should be relatively separated on the surface without
blocking each other.
If the absolute reflection coefficient R of the surface is unknown, a fourth cone
can be considered. Using four light sources can help determine the orientation of
an unknown or nonideal characteristic surface. But this is not always necessary.
For example, when three rays are orthogonal to each other, the sum of the cosines
of the included angles relative to each axis must be 1, which indicates that only
two angles are independent. Therefore, three sets of data are used to determine
R and two independent angles, that is, a complete solution is obtained. The use of
four light sources in practical applications can help determine any inconsistent
interpretation, which may come from the presence of specular elements.
2. Ideal specular surface
The ideal specular reflection surface reflects like a mirror (e.g., the highlight
region on the object is the result of the specular reflection of the light source by
the object), so the reflected light wavelength only depends on the light source and
has nothing to do with the color of the reflection surface. Unlike the ideal
scattering surface, an ideal specular reflection surface can reflect all the light
emitted from the (9-i, $i) direction to the (0e, Qe) direction. At this time, the
incident angle and the reflection angle are equal, as shown in Fig. 7.7. The BRDF
of the ideal specular reflection surface will be proportional to (the scale factor is k)
the product of two pulses 8(#e — 0i) and 8(Qe — Qi — n).
In order to calculate the scale factor k, the brightness of the surface in all
directions is integrated, which should be equal to the total illumination obtained
by the surface, that is:
7.2 Shape from Illumination 247
n n=2
/ / fe8(0e
- 0i)&(^e - fa — n) sin 0e cos 0ed0ed^e = k sin 0-i cos 0-i = 1 (7:12)
—n 0
From it, the BRDF of the ideal specular reflection surface can be solved:
8(0e — 0i)8(& — fa — n)
f (0i, ^i; 0e, &) = (7:13)
sin 0i cos 0i
When the light source is an extended light source, replace the above equation
into Eq. (7.8), and the brightness of the ideal specular reflection surface is:
n n=2
L(0e, &) = I I
—n 0
5 0e( dn0 (
sin 0i tcos
es/
0i )
n E(0i, ^i) sin &i cos = E&, — n) (7:14)
It can be seen that the polar angle has not changed, but the azimuth has turned
180°.
In practice, both ideal scattering surface and ideal specular reflection surface
are extreme cases, which are relatively rare. Many surfaces can be regarded as
having the properties of both parts of the ideal scattering surface and of the ideal
specular reflection surface (see Sect. 7.5.2 for further discussion). In other words,
the BRDF of the actual surface is the weighted sum of Eqs. (7.10) and (7.13).
The orientation of the scene surface is an important description of the surface. For a
smooth surface, each point on it has a corresponding section, and the orientation of
this section can be used to represent the orientation of the surface at that point. The
normal vector of the surface, that is, the (unit) vector perpendicular to the tangent
plane, can indicate the orientation of the tangent plane. If the Gaussian spherical
coordinate system is used (see [8]) and the tail of the normal vector is placed in the
center of the ball, the top of the vector and the sphere will intersect at a specific point,
which can be used to mark the surface orientation. The normal vector has two
degrees of freedom, so the position of the intersection on the sphere can be
represented by two variables, such as polar angle and azimuth, or longitude and
latitude.
The selection of these variables is related to the setting of the coordinate system.
Generally, for convenience, one axis of the coordinate system is often overlapped
248 7 Monocular Multi-image Scene Restoration
with the optical axis of the imaging system, and the system origin is placed at the
center of the lens, so that the other two axes are parallel to the image plane. In the
right-hand system, you can point the Z-axis to the image, as shown in Fig. 7.8. In this
way, the surface of the scene can be described by a distance -z orthogonal to the lens
plane (i.e., parallel to the image plane).
Now write the surface normal vector by z and the partial derivatives of z to x and
y. The surface normal is perpendicular to all lines on the surface section, so the
surface normal can be obtained by calculating the outer (cross) product of any two
nonparallel lines on the section, as shown in Fig. 7.9.
If a small step 8x is taken from a given point (x, y) along the X-axis direction,
according to the Taylor expansion, the change along the Z-axis direction is
8z = 8x x dz/dx + e, where e includes the higher-order term. In the following,
p and q are used to represent the partial derivatives of z to x and y, respectively. (p, q)
is generally called surface gradient. In this way, the vector along the X-axis direction
is [8x 0 p8x]T, and then it is parallel to the (x, y) of the tangent plane of the straight
line with the vector ry = [0 1 q]T. Similarly, a line parallel to the vector rx = [1 0 p]T
also passes through (x, y) of the tangent plane. The surface normal can be obtained
by finding the outer product of these two lines. Finally, it is necessary to determine
whether to point the normal to the observer or leave the observer. If you let it point to
the observer (reverse direction), then there is:
N = N = [ -p -q 1 ]T
(7:16)
jNj \/1 + p2 + q2
The included angle between the normal of the scene surface and the lens direction
0e is calculated next. If the scene is quite close to the optical axis, the unit
observation vector V from the scene to the lens can be considered as [0 0 1]T, so
the dot product of the two unit vectors can be obtained:
1
N • V = cos 0e = (7:17)
^1 + p2 + q2
When the distance between the light source and the scene is much larger than the
scale of the scene itself, the direction of the light source can be indicated only by a
fixed vector, and the surface orientation corresponding to the vector is orthogonal to
the light emitted by the light source. If the normal of the surface of the scene can be
expressed by [-ps -qs 1]T, when the light source and the observer are on the same
side of the scene, the direction of the light source can be indicated by the gradient
( ps, qs).
Now consider linking pixel gray scale (image brightness) to pixel gradient (surface
orientation).
Consider that a point light source irradiates a Lambert surface, and the illumination is
E. According to Eq. (7.10), its brightness L is:
where 0i is the angle between the surface normal vector [-p -q 1]T and the vector
pointing to the light source [-ps -qs 1]T. Note that since the brightness cannot be
negative, there is 0 < 0i < n/2. The inner product of these two unit vectors can be
obtained:
250 7 Monocular Multi-image Scene Restoration
By substituting it into Eq. (7.18), the relationship between the brightness of the
scene and the orientation of the surface can be obtained. The relation function
obtained in this way is recorded as R( p, q) and drawn as a function of gradient
(p, q) in the form of isolines. The obtained graph is called reflection map. Gener
ally, PQ plane is called gradient space, in which each point (p, q) corresponds to a
specific surface orientation. In particular, the point at the origin represents all planes
perpendicular to the viewing direction. The reflection map depends on the properties
of the target surface material and the location of the light source, or in other words,
the information of the surface reflection characteristics and the distribution of the
light source are integrated in the reflection map.
The image illumination is proportional to several constants, including the recip
rocal of the focal length z square and the fixed brightness of the light source. In
practice, the reflection map is often normalized for unified description. For the
Lambert surface illuminated by a distant point light source, there are:
From the above formula, the relationship between the brightness of the scene and
the orientation of the surface can be obtained from the reflection map. For the
Lambert surface, the isoline will be a nested conic, because from R(p, q) = c,
(1 + psp + qsq)2 = c2(1 + p2 + q2)(1 + ps2 + qs 2) can be obtained. The maximum value
of R( p, q) is obtained at ( p, q) = ( ps, qs).
Figure 7.10 shows three examples of Lambert surface reflection diagrams, of
which Fig. 7.10a is the case when ps = 0, qs = 0 (corresponding to nested concentric
circles); Fig. 7.10b shows the situation when ps / 0 and qs = 0 (corresponding to
ellipse or hyperbola); Fig. 7.10c shows the situation when ps / 0 and qs / 0
(corresponding to hyperbola).
Now consider another extreme case, called isotropy radiation surface. If the
surface of an object can radiate evenly in all directions (physically impossible), it
will feel brighter when you look at it obliquely. This is because the tilt reduces
the visible surface area, and it is assumed that the radiation itself does not change, so
the amount of radiation per unit area will be larger. At this time, the brightness of the
surface depends on the reciprocal of the cosine of the radiation angle. Considering
the projection of the object surface in the direction of the light source, it can be seen
that the brightness is proportional to cos0i/cos0e. Because cos0e = 1/(1 + p2 + q2)1/2,
there are:
The isolines are now parallel straight lines because (1 + psp + qsq) = c(1-
+ ps2 + qs2)1/2 can be obtained from R(p, q) = c. These lines are orthogonal to the
direction ( ps, qs).
Figure 7.11 is an example of an isotropic radiation surface reflection diagram.
Here, let ps/qs = 1/2, so the slope of the isoline (straight line) is 2.
The reflection map shows the dependence between surface brightness and surface
orientation. The illumination E(x, y) of a point on the image is proportional to the
brightness of the corresponding point on the surface of the scene. If the surface
gradient at this point is (p, q), the brightness of this point can be recorded as R(p, q).
If the scale coefficient is set to the unit value by normalization, the
This equation is called the image brightness constraint equation, which shows
that the gray level I(x, y) of the pixel at (x, y) in the image depends on the reflection
characteristic R(p, q) expressed by (p, q) of the pixel. The image brightness
constraint equation connects the brightness of any position (x, y) in the image
plane XY with the orientation (p, q) of the sampling unit expressed in a gradient
252 7 Monocular Multi-image Scene Restoration
space PQ. The image brightness constraint equation plays an important role in
restoring the shape of the object surface from the image.
Suppose a sphere on the Lambert surface is illuminated by a point light source,
and the observer is also in the position of the point source. Since there are 0e = 0i and
( ps, qs) = (0, 0), the relationship between brightness and gradient can be seen from
Eq. (7.20):
R(p, q) = > 12 2
(7:23)
\/1 + p2 + q2
If the center of the sphere is on the optical axis, its surface equation is:
where r is the radius of the ball;-z0 is the distance between the ball center and the
lens (see Fig. 7.12).
According to p =-x/(z- z0) and q =-y/(z- z0), we can get (1 + p2 + q2)1/2 = r/
(z- z0) and finally get:
As can be seen from the above equation, the brightness gradually decreases from
the maximum value in the center of the image to the zero value at the edge of the
image. The same conclusion can also be obtained by considering the light source
direction S, line of sight direction V, and surface direction N marked in Fig. 7.12.
When people observe such a shadow change, they will think that the image is imaged
by a circular or spherical object. However, if each part of the surface of the ball has
different reflection characteristics, the image and feeling will be different. For
example, when the reflection map is represented by Eq. (7.21) and (ps,
qs) = (0, 0), a disk with uniform brightness is obtained. For people who are used
to observing the reflection characteristics of Lambert surface, such a sphere will
appear relatively flat.
7.2 Shape from Illumination 253
For a given image, people often hope to restore the shape of the original imaging
object. The correspondence between the surface orientation determined by p and
q and the brightness determined by the reflection graph R( p, q) is unique, but the
reverse is not necessarily true. In practice, there are often an infinite number of
surface orientations that can give the same brightness. On the reflection map, these
orientations corresponding to the same brightness are connected by isolines. In some
cases, special points with maximum or minimum brightness can often be used to
help determine the surface orientation. According to Eq. (7.20), for a Lambert
surface, R( p, q) = 1 can only be true when ( p, q) = ( ps, qs), so given the surface
brightness, the surface orientation can be uniquely determined. However, in general,
the corresponding relationship from image brightness to surface orientation is not
unique, because the brightness has only one degree of freedom (brightness value) in
each spatial position, while the orientation has two degrees of freedom (two gradient
values).
Thus, in order to restore the surface orientation, new information needs to be
introduced. To determine the two unknowns p and q, there should be two equations.
Two equations can be generated for each image point by using the two images
collected under different lighting conditions (see Fig. 7.13):
R1(p, q) = Ei
(7-26)
Rl(p, q) = E2
If these equations are linearly independent, then there are unique solutions for
p and q. If these equations are not linear, then for p and q, there are either no
solutions or multiple solutions. The correspondence between brightness and surface
orientation is not unique, which is an ill-conditioned problem. Acquiring two images
is equivalent to adding equipment to provide additional conditions to solve the
ill-conditioned problem.
The calculation of solving the image brightness constraint equation can be carried
out as follows. Set:
R1 (p, q) = J' + rp + q q
P1
1
1 R-IP, q) = J'+ P2p
r2
+ q2q (7:27)
where
It can be seen that as long as p1/q1 / p2/q2, it can be solved from the above
equations:
It can be seen from the above that given two corresponding images collected
under different lighting conditions, the unique solution can be obtained for the
surface orientation of each point on the imaging object.
An example of solving the image brightness constraint equation is shown in
Fig. 7.i4. Figure 7.i4a and b are two corresponding images acquired from the same
sphere under different lighting conditions (the same light source is placed in two
different positions). Figure 7.i4c shows the result of drawing the orientation vector
of each point after calculating the surface orientation with the above method. It can
be seen that the orientation close to the center of the ball is relatively perpendicular to
the paper, while the orientation close to the edge of the ball is relatively parallel to
the paper. Note that the orientation of the surface cannot be determined where the
light cannot reach or where only one image is illuminated.
In many practical situations, three different light sources are often used, which
can not only linearize the equation but also improve the accuracy and increase the
solvable surface orientation range. In addition, this newly added third image can also
help restore the surface reflection coefficient. The following is a specific description.
The surface reflection property can often be described by the product of two
factors (coefficients). One is the geometric term, which represents the dependence on
the light reflection angle; the other is the proportion of incident light reflected by the
surface, which is called the reflection coefficient.
Generally, the reflection characteristics of each part of the object surface are
inconsistent. In the simplest case, brightness is only the product of reflection
coefficient and some orientation function. Here, the reflection coefficient is between
0 and 1. There is a surface similar to the Lambert surface (it has the same brightness
from all directions but does not reflect all incident light), and its brightness can be
expressed as pcosOi, where p is the surface reflection coefficient (it may change with
the position on the surface). In order to recover the reflection coefficient and gradient
( p, q), three kinds of information are needed, which can be obtained from the
measurement of three images.
Now introduce the unit vector in three light source directions:
S j= [ pj qx2]_ j = 1,2,3
(7:30)
Vi+pj+qj
where
N= [-p -q 1]T
(7:32)
^/1 + p2 + q2
is the unit vector of the surface normal. In this way, three equations can be obtained
for unit vectors N and p:
E = p(S ■ N) (7:34)
The rows of matrix S are the light source direction vectors S1, S2, and S3, and the
elements of vector E are the three luminance measurements.
Let S be nonsingular; it can be obtained from Eq. (5.34):
256 7 Monocular Multi-image Scene Restoration
(a) (b)
The direction of the surface normal is the product of a constant and a linear
combination of three vectors, each of which is perpendicular to the direction of the
two light sources. If each vector is multiplied by the brightness obtained when the
third light source is used, the unique reflection coefficient can be determined by
determining the value of the vector.
Finally, an example of using three images to restore the reflection coefficient is
given. Set a light source at three positions (-3.4, -0.8, -i.0), (0.0, 0.0, -i.0), and
(-4.7, -3.9, -i.0) in the space to collect three images. According to the brightness
constraint equation, three sets of equations can be obtained, so that the surface
orientation and reflection coefficient p can be calculated. Figure 7.i5a shows the
three groups of reflection characteristic curves. It can be seen from Fig. 7.i5b: When
the reflection coefficient p = 0.8, the three reflection characteristic curves intersect at
a point p = -0.i, q = -0.i; in other cases, there will be no intersection.
In the shape from illumination, the orientation of each surface of the scene is
revealed by changing the illumination in moving the light source. In fact, if the
light source is fixed but the pose of the scene is changed, it is also possible to reveal
different surfaces of the scene. The pose change of the scene can be realized by the
movement of the scene, so the shape and structure of each part of the scene can also
be revealed by using sequence images or videos and detecting the movement of the
scene.
7.3 Shape from Motion 257
The detection of motion can be based on the change of image brightness with
time. It should be noted that although the movement of the camera or the movement
of the scenery will lead to the brightness change between each image frame in the
video, the change of the lighting conditions in the video image will also lead to the
change of the image brightness with time, so the change of the brightness with time
on the image plane does not always correspond to the movement. Generally, optical
flow (vector) is used to represent the change of image brightness over time, but
sometimes it is different from the actual motion in the scene.
1 1
2ri= ro ■ zro (7:37)
(a) (b)
where 2 is the focal length of the lens and z is the distance from the lens center to the
object. Making the derivative of the above equation can get the velocity vectors
assigned to each pixel, and these velocity vectors constitute the motion field.
Visual psychology believes that when relative motion occurs between people and
the observed object, the movement of the parts with optical characteristics on the
surface of the observed object provides people with information about motion and
structure. When there is relative motion between the camera and the scene object, the
observed brightness mode motion is called optical flow or image flow. In other
words, the movement of the object with optical characteristics is projected onto the
retinal plane (i.e., the image plane) to form optical flow. Optical flow expresses the
change of image, which contains the information of object motion, and can be used
to determine the movement of the observer relative to the object. Optical flow has
three elements: (1) motion (velocity field), which is the necessary condition for the
formation of optical flow; (2) parts with optical characteristics (such as gray-scale
pixel points) that can carry information; and (3) imaging projection (from scene to
image plane), so optical flow can be observed.
Although optical flow is closely related to the motion field, they are not
completely corresponding. The object motion in the scene leads to the brightness
mode motion in the image, and the visible motion of the brightness mode generates
optical flow. In the ideal case, the optical flow corresponds to the motion field, but in
practice, there are also cases when it does not correspond. In other words, motion
produces optical flow, so there must be motion if there is optical flow, but there must
not be optical flow if there is motion.
Here are a few examples to illustrate the difference between optical flow and
motion field. First, when the light source is fixed, a sphere with uniform reflection
characteristics rotates in front of the camera, as shown in Fig. 7.17a. At this time,
there are spatial changes in brightness everywhere in the spherical image, but this
spatial change does not change with the rotation of the spherical surface, so the
image does not change (in gray level) with time. In this case, although the motion
field is not zero, the optical flow is zero everywhere. Next, consider the case that the
fixed ball is illuminated by a moving light source, as shown in Fig. 7.17b. The gray
scale everywhere in the image will change with the movement of the light source due
to the change of lighting conditions. In this case, although the optical flow is not
7.3 Shape from Motion 259
zero, the motion field of the ball is zero everywhere. This motion is also called
apparent motion (optical flow is the apparent motion of brightness mode). The above
two cases can be regarded as optical illusion.
As can be seen from the above example, optical flow is not equivalent to a motion
field. However, in most cases, there is still a certain corresponding relationship
between optical flow and motion field. So in many cases, the relative motion can
be estimated by image changes according to the corresponding relationship between
optical flow and motion field. However, it should be noted that there is also a
problem of determining the corresponding points between different images.
Refer to Fig. 7.18, where each closed curve represents an equal brightness curve.
Consider that there is an image point P with brightness E at time t, as shown in
Fig. 7.18a. When t + St, which image point does P correspond to? In other words, to
solve this problem, we need to know how the brightness mode changes. Generally,
many points near P have the same brightness E. If the brightness changes continu
ously in this part of the region, P should be on an equal brightness curve C.At t + St,
there will be some iso-luminance curves C with the same brightness near the
original C, as shown in Fig. 7.18b. However, it is difficult to say which point P‘
on C corresponds to the point P on the original C, because the shapes of the two
equal brightness curves C and C may be completely different. Therefore, although it
can be determined that curve C corresponds to curve C‘, it cannot be determined that
point P corresponds to point P‘.
It can be seen from the above example that only relying on the local information
in the changing image cannot uniquely determine the optical flow. Further, consider
Fig. 7.17 again. If there is a region in the image where the brightness is uniform and
does not change with time, the optical flow that is most likely to produce is zero
everywhere, but in fact any vector movement mode can be assigned to the region
with uniform brightness.
Optical flow can represent the changes in the image, which includes not only the
motion information of the observed object but also the related scene structure
information. Through the analysis of optical flow, we can determine the 3D structure
of the scene and the relative motion between the observer and the moving object.
Motion analysis can describe image changes and calculate object structure and
motion with the help of optical flow. The first step is to represent the changes in
the image with 2D optical flow (or the speed of the corresponding reference point),
and the second step is to calculate the 3D structure of the moving object and its
movement relative to the observer according to the optical flow calculation results.
260 7 Monocular Multi-image Scene Restoration
The movement of the scenery in the scene will cause the scenery to be in different
relative positions in the image obtained during the movement. This difference in
position can be called parallax, which corresponds to the displacement vector
(including size and direction) reflected by the scenery movement on the image. If
the parallax is divided by the time difference, the velocity vector (also known as
the instantaneous displacement vector) is obtained. Optical flow can be regarded as
the instantaneous velocity field generated by the movement of gray-scale pixels on
the image plane. Based on this, the basic optical flow constraint equation, also
known as optical flow equation or image flow equation, can be established.
At time t, a specific image point is at (x, y). At time t +dt, the image point moves to
(x +dx, y + dy). If the time interval dt is small, it can be expected (or assumed) that
the gray level of the image point remains unchanged; in other words, there are:
Expand the right side of the above equation with Taylor series, let dt ^ 0, take the
limit, and omit the higher-order terms to obtain:
- df=dfdx+dfdy=dfu+dfv=0 (739)
where u and v are the moving speeds of image points in the X and Y directions,
respectively, and they form a speed vector. If write
f‘=d< d f^-f 40
The optical flow equation shows that the temporal change rate of the gray level of
a point in the moving image is the product of the spatial change rate of the gray level
of the point and the spatial motion speed of the point.
7.3 Shape from Motion 261
In practice, the time change rate of gray scale can be estimated by the first-order
difference average value along the time direction:
The spatial change rate of gray scale can be estimated by the first-order difference
average along the X and Y directions, respectively:
nents u and v can be estimated by the least squares method. Take N pixels at different
positions on the same object with the same u and v on two consecutive images f(x, y,
t) and f(x, y, t + 1), and represent the estimation of ft, fx, and fy at the kth position
(k) <;(k) ( ^(k)
(k = 1, 2, ..., N) by ft ,fx , andfy , respectively:
1
[u, v]T=(FTyFxy) FTyft (7:47)
Figure 7.19 shows an example of optical flow detection. Figure 7.19a is a side
image of a patterned sphere, and Fig. 7.19b is an image obtained by rotating the
sphere (around the up and down axis) to the right by a small angle. The motion of the
sphere in 3D space is basically translational motion reflected in the 2D image, so in
the optical flow detected in Fig. 7.19c, the parts with large optical flow are distrib
uted along the longitude, reflecting the result of the horizontal movement of the
edge.
How to solve the optical flow equation given by Eq. (7.41)? The essence of this
problem is to calculate the optical flow component according to the gradient of gray
value of image points. This needs to be considered in different situations. Here are
some common situations.
The calculation of optical flow is to solve the optical flow equation, that is, to find the
optical flow components according to the gray value gradients of the image point.
The optical flow equation restricts the relationship between the three directional
gradients and the optical flow components. It can be seen from Eq. (7.41) that this is
a linear constraint equation about the velocity components u and v. If a velocity
space is established with the velocity components as the axes (the coordinate system
is shown in Fig. 7.20), then the u and v values satisfying the constraint Eq. (7.41) are
on a straight line. It can be obtained from Fig. 7.20:
7.3 Shape from Motion 263
ff fx,
u0 -1 v0= ~ -ft 0= arctan (7:48)
fx fy fy.
Note that each point on the line is the solution of the optical flow equation. In
other words, only one optical flow equation is not enough to uniquely determine the
two quantities u and v. In fact, solving two variables with only one equation is an
ill-conditioned problem that must be solved with additional constraints.
In many cases, the research object can be regarded as a non-deformable rigid
body, and each adjacent point on it has the same optical flow velocity. This condition
can be used to help solve the optical flow equation. According to the condition that
the adjacent points on the object have the same optical flow velocity, it can be known
that the spatial variation rate of the optical flow velocity is zero, that is:
z— 2 a du du 2
(Vu) = u+dy) =0 (7:49)
2
(Vv)2 = + =0 (7:50)
dy,
These two conditions can be combined with the optical flow equation to calculate
the optical flow by solving a minimization problem. Assume:
The value of 2 should take into account the noise in the image. If the noise is
strong, it means that the confidence of the image data itself is low, and it needs to rely
more on the optical flow constraint, so 2 needs to take a larger value; otherwise, 2
needs to take a smaller value.
In order to minimize the total error in Eq. (7.51), e can be differentiated with
respect to u and v, respectively, and take the derivative to be zero, so that:
The above two equations are also called the Euler equations. If we let u and v
denote the mean in the u neighborhood and v neighborhood, respectively (which can
be calculated by the image local smoothing operator) and let Vu = u — u and
Vv = v — v, then Eqs. (7.52) and (7.53) can be changed into:
Equations (7.56) and (7.57) provide the basis for an iterative solution to u(x, y)
and v(x, y). In practice, the following relaxation iterative equations are often used:
Here we can take u(0) = 0, v(0) = 0 (straight line through the origin). The above
two equations have a simple geometric interpretation, that is, the iteration value at a
new (u, v) point is the average value in the neighborhood of the point minus an
adjustment amount, which is in the direction of the brightness gradient (see
Fig. 7.21). Therefore, the iterative process is a process of moving a straight line
along the brightness gradient, and the straight line is always perpendicular to the
direction of the brightness gradient. For the specific flowchart of solving Eqs. (7.58)
and (7.59), please refer to Fig. 8.10 in the next chapter.
Further analysis of the previous Eqs. (7.52) and (7.63) shows that the optical flow in
the region where the brightness gradient is completely zero is actually
undeterminable, while in the region where the brightness gradient changes rapidly,
the resulting error of the optical flow calculation may be large. A common method
for solving the optical flow equation is to consider the smooth condition that the
7.3 Shape from Motion 265
motion field is generally slow and stable in most parts of the image. In this case, we
can consider minimizing a measure of deviation from smoothness. A commonly
used measure is the integral of the square of the magnitude of the gradient of the
optical flow velocity:
es = Jj [ uxx2 + Uy]
2 2
+ (_vx2 + vyj_| dxdy (7:60)
Also consider minimizing the error of the optical flow constraint equation:
Optical flow will have discontinuities at the edges where objects overlap each other.
To generalize the above optical flow detection method from one region to another, it
is necessary to determine the discontinuity. This brings up a chicken-and-egg
266 7 Monocular Multi-image Scene Restoration
problem. If there is an accurate optical flow estimation, it is easy to find the places
where the optical flow changes rapidly and divide the image into different regions;
on the contrary, if the image can be well divided into different regions, an accurate
estimation of the optical flow can be obtained. The solution to this contradiction is to
incorporate the segmentation of regions into the iterative solution of optical flow.
Specifically, after each iteration, look for places where the optical flow changes
rapidly, and mark these places to avoid the smooth solution obtained in the next
iteration from crossing these discontinuities. In practical applications, the threshold
for determining the degree of optical flow change is generally set high to avoid
prematurely and finely dividing the image, and then the threshold is gradually
reduced as the optical flow estimation becomes better and better.
More generally speaking, the optical flow constraint equation is not only appli
cable to the continuous region of gray level but also applicable to the region with
abrupt changes in gray level. In other words, one condition for the optical flow
constraint equation to apply is that there can be (finite) abrupt discontinuities in the
image, but the changes around the discontinuities should be uniform.
Let’s see Fig. 7.23a; XY is the image plane, I is the gray-scale axis, and the object
moves along the X direction with velocity (u, v). At t0, the gray scale at point P0 is I0,
and the gray scale at point Pd is Id; at t0 + dt, the gray scale at P0 moves to Pd to form
optical flow. In this way, there is a gray-scale mutation between P0 and Pd, and the
gray-scale gradient is Vf = (fX, fy). Now look at Fig. 7.23b; if you look at the gray
scale change from the path, because the gray scale at Pd is the gray scale at P0 plus
the gray-scale difference between P0 and Pd, so there is:
7.3 Shape from Motion 267
I
Po
I0
Id Pd
X, T
t0 t 0+ dt
(a) (b)
Fig. 7.23 Solving the optical flow equation when the gray scale is abruptly changed
Pd
Id= /vf ■ dl + I0
(7:62)
Po
If you look at the gray-scale changes from the time course, because the observer
sees the gray scale changing from Id to I0 at Pd, so there is:
to+dt
f ftdt +1d
I0 = (7:63)
to
Since the change of gray levels should be the same in these two cases, it can be
solved by combining Eqs. (7.62) and (7.63):
Jdvf ■ dl =- "j ft dt
(7:64)
Po to
Substituting dl = [u v]Tdt and considering that the line integration limit and the
time integration limit should correspond, we can get:
This shows that the solution can still be solved by using the previous method of
dealing with discontinuities.
It can be proved that the optical flow constraint equation is also applicable to the
discontinuous velocity field caused by the transition between the background and the
object under certain conditions, provided that the image has sufficient sampling
density. For example, in order to obtain the proper information from the texture
image sequence, the sampling rate of the space should be smaller than the scale of
the image texture. The sampling distance in time should also be smaller than the
268 7 Monocular Multi-image Scene Restoration
scale of the velocity field change, or even much smaller, so that the displacement is
smaller than the scale of the image texture. Another condition for the optical flow
constraint equation to apply is that the gray-scale change at each point in the image
plane should be entirely due to the motion of a specific pattern in the image and
should not include the effects of changes in reflection properties. This condition can
also be expressed as a change in the position of a mode in the image at different times
produces an optical flow velocity field, but the mode itself does not change.
The previous solution to the optical flow equation only utilizes the first-order
gradient of the image gray levels. There is a view that the optical flow constraint
equation itself already contains the smoothness constraint on the optical flow field,
so in order to solve the optical flow constraint equation, it is necessary to consider the
continuity of the image itself on the gray level (i.e., consider the high-order gradient
of the image gray level) to constrain the gray-level field.
Expand the terms in the optical flow constraint equation with the Taylor series at
(x, y, t), and take the second order to get:
+ df?t
d y2
) dy (7.67)
) _ d f (x, y, t)
d f (x + dx, y + dy, t d2f (x, y, t) d
ft dt dt 1 dtdx dx
+ ^^^1 dy (7.68)
dtdy v '
u(x + dx, y + dy, t) = u(x,y, t) + ux(x,y, t)dx + uy(x,y, t)dy (7-69)
v(x+ dx,y + dy, t)= v(x, y, t)+vx (x, y, t)dx + vy (x, y, t)dy (7:70 )
Substituting the above five equation into the optical flow constraint equation,
we get:
7.3 Shape from Motion 269
Because these terms are independent, six equations can be obtained, respectively:
To solve the two unknowns from these three equations, the method of least
squares method can be used.
When solving the optical flow constraint equation with the help of gradient, it is
assumed that the image is differentiable, that is, the movement of the object between
frame images should be small enough (less than one pixel/frame). If it is too large, the
aforementioned assumption does not hold, and the optical flow constraint equation
cannot be solved accurately. One of the methods that can be taken at this time is to
reduce the resolution of the image, which is equivalent to performing low-pass filtering
on the image, which has the effect of reducing the speed of optical flow.
The optical flow contains information about the structure of the scene, so the
orientation of the surface can be solved from the optical flow of the object surface
270 7 Monocular Multi-image Scene Restoration
movement. The orientation of each point in the objective world and the surface of the
object can be represented by an orthogonal coordinate system XYZ centered on the
observer. Consider a monocular observer located at the origin of coordinates, and
assume that the observer has a spherical retina, so that the objective world can be
considered to be projected onto a unit image sphere. The image sphere has a
coordinate system consisting of longitude <fr and latitude 9. Points in the objective
world can be represented by these two image spherical coordinates plus a distance
r from the origin (see Fig. 7.24).
The transformations from spherical coordinates to Cartesian coordinates and from
Cartesian coordinates to spherical coordinates are, respectively:
and
r = \/x2 + y2 + z2 (7:84)
9= arc cos (z=r) (7:85 )
$ = arc cos (y=x) (7:86 )
v cos $ — u sin $
5= (7:87)
r sin 9
7.3 Shape from Motion 271
The above two equations are general representations for optical flow in the $ and
0 directions. Consider the optical flow calculation in a simple case below. Suppose
the scene is stationary and the observer is moving along the Z-axis (positive) with
velocity S. At this time, there are u = 0, v = 0, and w =—S, and substituting them
into Eqs. (7.87) and (7.88) can obtain, respectively:
8=0 (7:89)
e = S sin 0=r (7:90)
They form a simplified optical flow equation and are the basis for solving surface
orientation (and edge detection). According to the solution of the optical flow
equation, it can be judged whether each point in the optical flow field is a boundary
point, a surface point, or a space point. Among them, the type of boundary and the
orientation of the surface can also be determined in the two cases of boundary point
and surface point [9].
Here we only introduce how to obtain the surface orientation with the help of
optical flow. Looking at Fig. 7.25a first, let R be a point on a given surface patch on
the surface of the object, and the monocular observer with focus at O observes the
surface patch along the line of sight OR. Let the normal vector of the surface patch be
N; N can be decomposed into two mutually perpendicular directions: One is in the
ZR plane, of which the included angle with OR is c (as shown in Fig. 7.25b), and the
other is in a plane perpendicular to the ZR plane (parallel to the XY plane), of which
the angle with OR is t (as shown in Fig. 7.25c, where the Z-axis is pointed out from
the paper). In Fig. 7.25b, $ is a constant, while in Fig. 7.25c, 0 is a constant. In
Fig. 7.25b, the ZOR plane constitutes a “depth profile” along the line of sight, while
in Fig. 7.25c the “depth profile” is parallel to the XY plane.
How to determine c and t is now discussed. Consider first c in the ZR plane (see
Fig. 7.25b). If the vector angle 0 is given a small increment A0, the change in the
vector radius r is Ar. Passing R as an auxiliary line p, it can be seen that p/r = tan
Fig. 7.25 Schematic diagram for obtaining surface orientation by means of optical flow
272 7 Monocular Multi-image Scene Restoration
(A0) ~ A0, on the one hand, and p/Ar = tanff, on the other hand. By eliminating p
simultaneously, we can get:
Consider next t in the vertical plane of the RZ plane (see Fig. 7.25c). If the vector
angle $ is given a small increment A^, the length of the vector radius r changes Ar.
Now, making the auxiliary line p, it can be seen that p/r = tanA^ ~ A^, on the one
hand, and p/Ar = tant, on the other hand. Simultaneously eliminating p, we can get:
Further, taking the limits of Eqs. (7.91) and (7.92), respectively, we can get:
rl dr
cot ff = l_r d0 (7:93)
rl' dr
cot t = l_r 3^ (7:94)
S sin 0
(7:95)
e(^, 0)
dr cos 0 sin 0 de
(7:97)
d0 \ e e2 30
d (ln e)
ff = arccot cot 0 d- (7:98)
d( ln e)
t = arccot (7:99)
d$
7.3 Shape from Motion 273
Using optical flow to analyze the motion can also obtain the mutual velocities u, v,
and w in the X, Y, and Z directions between the camera and the object in the world
coordinate system. If the coordinates of an object point at t0 = 0 are (X0, Y0, Z0), the
focal length of the optical system is set to 1, and the object moving speed is constant,
then the image coordinates of this point at time t are:
Xo + ut Y0 + vt
(x, y) = Z0 + wt, Z0 + wt (7:100)
D(t) Z(t)
(7:101)
V(t) w(t)
The above equation is the basis for determining the distance between moving
objects. Assuming the motion is toward the camera, the ratio Z/w gives the time it
takes for an object moving at a constant speed w to cross the image plane. Based on
the knowledge of the distance of any point in an image that is moving along the Z-
axis with velocity w, the distance of any other point on that image that is moving
with the same velocity w can be calculated:
Z(t)V (t)D0(t)
Z0 (t)= (7:102)
D(t)V0(t)
where Z(t) is the known distance and Z‘(t) is the unknown distance.
According to Eq. (7.102), the relationship between the world coordinates X and
Y with the image coordinates x and y can be given by the observation position and
velocity:
M
(7:104)
N
276 7 Monocular Multi-image Scene Restoration
where M is the number of pixels with an intensity value in the range of [1, 254] and
N is the number of pixels in the partial projection image with an intensity value of
255. By applying a simple threshold, Ft, to each contour part image, only high-
quality parts of the body contour are obtained for reconstruction, thereby improving
the quality of the final visual hull.
The light source is an important device to realize the photometric stereo technology.
There are many kinds of light sources. A simple light source classification diagram is
shown in Fig. 7.27. Light sources can be divided into two categories: infinite point
light sources and near-field light sources. The premise of photometric stereo is that
the incident light is parallel light. In reality, it is often difficult to create a large region
of parallel light, so the point light source that is usually far away (generally, the
distance between the light source and the scene is ten times the width of the scene) is
emitted, and this light is approximately regarded as parallel light. For near-field light
sources, it is difficult to treat the light as parallel light because the light source is too
close. In practice, the light source has a certain scale and cannot be regarded as a
point light source but is called an extended light source. The extended light source is
divided into linear or planar according to its shape.
Light source calibration refers to placing auxiliary objects such as calibration
targets to estimate the information of the light source, including the direction and
intensity of the light source. The accuracy of light source information greatly affects
the performance and effect of photometric stereo technology. There are many
methods of light source calibration. Figure 7.28 shows a classification diagram of
light source calibration methods.
Tone/shading information
Calibration
Special reflection properties
information
Shadow information
No calibration target
Calibration Number of Single calibration target
target With calibration _ calibration targets Multiple calibration targets
target Near plane target
Type of
calibration targets Sphere with specific
reflection properties
From the information used for calibration, the tone information, shadow infor
mation, or reflection characteristic information of the scene surface are often used
(the three kinds of information can also be used in combination). From the perspec
tive of light source properties/characteristics, light source information (light inten
sity, direction, and position) can be distinguished from the number of light sources
(single or multiple). The use of calibration targets is to obtain more accurate light
source information by using the reflection properties/characteristics of different
calibration targets. Commonly used calibration targets include cubes, differential
spheres, hollow transparent glass spheres, mirrors, etc.
The light source calibration method also depends on the type of light source. For
example, near-field light sources will cause uneven illumination distribution, and
white paper with Lambertian reflection characteristics can be used as a calibration
target to compensate for the intensity distribution of different light sources [20].
Finally, it should be pointed out that the calibration of the light source usually
requires the selection of a special calibration object and a separate calibration
experiment, which increases the difficulty of the application of photometric stereo
and limits the application of photometric stereo. How to simplify or omit this step
into light source self-calibration is a very valuable research direction [21]. A recent
work can be found in [22].
The ideal scattering surface and the ideal specular reflecting surface discussed in
Sect. 7.2.2 are rare in practice, and the actual scene surface often has very different
reflection characteristics, which are generally called non-Lambertian surfaces.
278 7 Monocular Multi-image Scene Restoration
highlights and shadows, it can detect the highlight of the object and calculate the
normal direction of the object surface at the same time [29].
With the help of color photometric stereo technology, the original three gray
scale images can also be replaced with the three channels of the acquired color
image, and then the surface reconstruction can be achieved through a single color
image. This method can avoid the influence of position changes due to time-sharing
and realize fast 3D reconstruction and even real-time 3D reconstruction. Addition
ally, a work using convolutional neural networks for multispectral photometric
stereo can be seen in [30].
Methods based on deep learning have also been introduced into calibrated photo
metric stereo methods in recent years, for example, see [41]. Instead of constructing
complex reflectivity models, these methods directly learn a mapping from reflectiv
ity observations in a given direction to normal information.
Traditional photometric stereo vision methods mostly assume a simplified reflec
tivity model (ideal Lambertian model or simple reflectivity model). However, in the
real world, most scenes have non-Lambertian surfaces (a combination of diffuse and
specular reflections), and a particular simple model is only valid for a small subset of
materials. At the same time, the calibration of light source information is also a
complex and tedious process. Solving this problem requires uncalibrated photomet
ric stereo vision technology [42], that is, it is necessary to directly calculate the
280 7 Monocular Multi-image Scene Restoration
Table 7.1 Principles, advantages, and disadvantages of some typical 3D reconstruction methods
Methods Principles Advantages Disadvantages
Path integral Direct integration of gra Easy to implement Error accumulation,
method [31] dient according to Green’s and fast greatly affected by data
formula error and noise
Least squares Search the best fitting sur Better overall optimi Loss of local informa
method face by minimizing the zation effect tion, large amount of
function and sacrificing calculation when the
local information amount of data is large
Fourier basis Approximate gradient data Good global effect The derivation is com
function with basic functions to and high computa plex and difficult to
method [32] obtain the best approxi tional efficiency apply to other basis
mation surface functions
Poisson’s Transform the functional Can be based on Fou It is necessary to deter
equation problem of minimizing a rier basis functions or mine which basis func
method [33] function into the problem extended to sine and tion to use for projection
of solving the Poisson cosine functions according to different
equation boundary conditions
Variational Solving the Poisson equa The iterative process Long calculation time
method [34] tion using iterative method is simple and can and accumulation of
based on global thinking solve the overall dis errors
tortion problem
Pyramid The subsurface is obtained It can ensure global Local detail information
method [35] based on the iterative pro shape optimization is lost; iterative correc
cess of the height space, and has certain noise tion is required
the sampling interval is immunity
continuously reduced, and
finally the entire surface is
stitched together
Algebraic By correcting the gradient Independent of inte There is a certain devia
method error of the curl value, it is gration paths, local tion in the reconstruc
finally reconstructed with error accumulation tion of the local detail
the Poisson equation can be suppressed surface
Singular value Obtain a vector that differs No need to calibrate Less computationally
decomposition from the true normal vec the light source efficient, will create
method [36] tor by a transformation general bas-relief
matrix through singular problems
value decomposition
normal information in the image only through multiple images under a fixed
viewpoint.
The following introduces a method to obtain the normal information of the scene
with the help of a multi-scale aggregation generative adversarial network (GAN)
to realize the uncalibrated photometric stereo vision technology [43].
7.6 GAN-Based Photometric Stereo 281
64 64 3
with the feature map of the corresponding down-sampling part, it is into the
convolutional layer. Finally, the obtained results are subjected to maximum pooling
and normalization to obtain the final result.
Several considerations for the network structure are as follows:
1. Using skip connections to achieve multi-scale feature aggregation: For the fea
tures in each image, using multi-scale feature aggregation can achieve better
fusion of local and global features, so that more comprehensive information can
be observed in each image.
2. Aggregate multiple features using the maximum pooling method: Photometric
stereo has multiple inputs, and maximum pooling can naturally capture strong
features in images from different light directions; maximum pooling can easily be
used during training, ignoring inactive features to make the network more robust.
3. Perform L2 normalization on the pooled features: A coarse-grained normal
information map can be obtained.
4. Residual structure is adopted in the multi-scale aggregation network and fine
tuning module: The problem of gradient vanishing can be overcome by using skip
connections [44].
The loss function LG of the generator model consists of two parts: the cosine
similarity loss Lnormal of the normal vector and the adversarial loss Lgen:
Among them, Nx,y represents the real normal information at the point (x, y). If the
real normal information is very different from the predicted normal information, the
point product of Nx,y and N‘x,y is close to 1, the Lnormai value will be small, and vice
versa. The analysis of the second term on the right-hand side of Eq. (7.106) is
similar.
The generator adversarial loss Lgen is defined as follows:
Among them, x ~ pg indicates that the input data x conforms to the pg distribution,
and N ~ pr indicates that the true normal information N conforms to the pr
distribution.
References
1. Lee J H, Kim C S (2022) Single-image depth estimation using relative depths. Journal of Visual
Communication and Image Processing, 84: 103459. https://doi.org/10.1016/j.jvcir.2022.
103459.
2. Pizlo Z, Rosenfeld A (1992) Recognition of planar shapes from perspective images using
contour-based invariants. CVGIP: Image Understanding 56(3): 330-350.
3. Song W, Zhu M F, Zhang M H, et al. (2022). A review of monocular depth estimation
techniques based on deep learning. Journal of Image and Graphics, 27(2): 292-328.
4. Luo H L, Zhou Y F. (2022). Review of monocular depth estimation based on deep learning.
Journal of Image and Graphics, 27(2): 390-403.
5. Swanborn D J B, Stefanoudis P V, Huvenne V A I, et al. Structure-from-motion photogram
metry demonstrates that fine-scale seascape heterogeneity is essential in shaping mesophotic
fish assemblages. Remote Sensing in Ecology and Conservation, 2022, 8(6): 904-920.
6. Wang S, Wu T H, Wang K P, et al. (2021) 3-D particle surface reconstruction from multiview
2-D images with structure from motion and shape from shading. IEEE Transaction on Industrial
Electronics 68(2): 1626-1635.
7. Horn B K P (1986) Robot Vision. MIT Press, USA. Cambridge.
8. Zhang Y-J (2017) Image Engineering, Vol. 3: image understanding. De Gruyter, Germany.
9. Ballard D H, Brown C M (1982) Computer Vision. Prentice-Hall, London.
10. Sonka M, Hlavac V, Boyle R (2008) Image Processing, Analysis, and Machine Vision. 3rd
Ed. Thomson, USA.
11. Zhang Y-J (2017) Image Engineering, Vol. 2: image analysis. De Gruyter, Germany.
12. Krajnik W, Markiewicz L, Sitnik R (2022) sSfS: Segmented shape from silhouette reconstruc
tion of the human body. Sensors 22: 925.
284 7 Monocular Multi-image Scene Restoration
13. Lu E, Cole F, Dekel T, et al. (2021) Omnimatte: Associating objects and their effects in video.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4505-4513.
14. Lin K, Wang L, Luo K, et al. (2021) Cross-domain complementary learning using pose for
multi-person part segmentation. IEEE Transactions on Circuits System and Video Technology
31, 1066-1078.
15. Li P, Xu Y, Wei Y, et al. (2022) Self-correction for human parsing. IEEE Transactions on
Pattern Analysis and Machine Intelligence 44(6): 3260-3271.
16. Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking.
Proceedings of the European Conference on Computer Vision (ECCV), 8-14.
17. Jertec A, Bojanic D, Bartol K, et al. (2019) On using PointNet architecture for human body
segmentation. Proceedings of the 2019 11th International Symposium on Image and Signal
Processing and Analysis (ISPA), 23-25.
18. Ueshima T, Hotta K, Tokai S (2021) Training PointNet for human point cloud segmentation
with 3D meshes. Proceedings of the Fifteenth International Conference on Quality Control by
Artificial Vision 12-14.
19. Deng X L, He Y B, Zhou J P (2021). Review of three-dimensional reconstruction methods
based on photometric stereo. Modern Computer 27(23): 133-143.
20. Xie W, Song Z, Zhang X (2010) A novel photometric method for real-time 3D reconstruction of
fingerprint. International Symposium on Visual Computing 31-40.
21. Shi B, Matsushita Y, Wei Y, et al. (2010) Self- calibrating photometric stereo. Proceedings of
the International Conference on Computer Vision and Pattern Recognition 1118-1125.
22. Abzal A, Saadatseresht M, Varshosaz M, et al. (2020) Development of an automatic map
drawing system for ancient bas-reliefs. Journal of Cultural Heritage 45: 204-214.
23. Phong BT (1998) Illumination for computer generated pictures. Communications of the ACM
18(6): 311-317.
24. Tozza S, Mecca R, Duocastella M, et al. (2016) Direct differential photometric stereo shape
recovery of diffuse and specular surfaces. Journal of Mathematical Imaging and Vision 56(1):
57-76.
25. Torrance K E, Sparrow E M (1967) Theory for off-specular reflection from roughened surfaces.
Journal of the Optical Society of America 65(9): 1105-1114.
26. Cook R L, Torrance K E (1982) A reflectance model for computer graphics. ACM Transactions
on Graphics 1(1): 7-24.
27. Ward G J (1992) Measuring and modeling anisotropic reflection. Proceedings of the 19th
Annual Conference on Computer Graphics and Interactive Techniques 265-272.
28. Shih Y C, Krishnan D, Durand F, et al. (2015) Reflection removal using ghosting cues.
Proceedings of the IEEE International Conference on Conference on Computer Vision and
Pattern Recognition 3193-3201.
29. Barsky S, Petrou M (2003) The 4-source photometric stereo technique for three-dimensional
surfaces in the presence of highlights and shadows. IEEE Transactions on Pattern Analysis and
Machine Intelligence 25(10): 1239-1252.
30. Lu L, Qi L, Luo Y, et al. (2018) Three-dimensional reconstruction from single image base on
combination of CNN and multi-spectral photometric stereo. Sensors 18(3): 764.
31. Horn B K P (1990) Height and gradient from shading. International Journal of Computer Vision
5(1): 37-75.
32. Frankot R T, Chellappa R (1998) A method for enforcing integrability in shape from shading
algorithms. IEEE Transactions on Pattern Analysis & Machine Intelligence 10(4): 439-451.
33. Simchony T, Chellappa R, Shao M (1990) Direct analytical methods for solving Poisson
equations in computer vision problems. IEEE Transactions on Pattern Analysis and Machine
Intelligence 12(5): 435-446.
34. Lv D H, Zhang D, Sun J A (2010) Simulation and evaluation of 3D reconstruction algorithm
based on photometric stereo technique. Computer Engineering and Design 31(16): 3635-3639.
35. Chen Y F, Tan W J, Wang H T, et al. (2005) Photometric stereo 3D reconstruction and
application. Journal of Computer-aided Design and Computer Graphics (11): 28-34.
References 285
As pointed out in Sect. 7.1, this chapter introduces the method of scene restoration
based on monocular single image. According to the introduction and discussion in
Sect. 2.2.2, it is actually an ill-conditioned problem to use only monocular single
image for scene restoration. This is because when the 3D scene is projected onto the
2D image, the depth information is lost. However, from the practice of human visual
system, especially the ability of spatial perception (see [1]), in many cases, many
depth clues are still retained in the image, so it is possible to recover the scene from it
under the condition of certain constraints or prior knowledge [2-4].
The sections of this chapter will be arranged as follows.
Section 8.1 discusses how to reconstruct the shape of the scene surface according to
the image tone (light and dark information) generated by the spatial change of the
brightness of the scene surface during imaging.
Section 8. 2 introduces three technical principles to restore the surface orientation
according to the change (distortion) of the surface texture elements of the scene
after projection.
Section 8. 3 describes the relationship between the focal length change of the camera
and the depth of the scene caused by focusing on the scene at different distances
during imaging, so that the distance of the corresponding scene can be determined
according to the focal length of the clear imaging of the scene.
Section 8. 4 introduces a method to calculate the geometry and pose of 3D scene by
using the coordinates of three points on an image when the 3D scene model and
camera focal length are known.
Section 8. 5 further introduces the non-Lambertian illumination model and the
corresponding new technology used in the process of restoring shape from the
tone, relaxing the limited conditions, and imaging the mixed (diffuse and spec
ular) surface under perspective projection.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 287
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_8
288 8 Monocular Single-Image Scene Restoration
When the object in the scene is illuminated by light, the brightness will be different
due to the different orientations of the various parts of the surface. This spatial
change (light and dark change) of the brightness will appear as different shadings or
tones on the image (also often called different shadows) after imaging. The shape
information of the object can be obtained according to the distribution and change of
the tones, which is called shape from shading.
In the following, the relationship between the shades on the image and the surface
shapes of the object in the scene is first discussed, and then how to represent the
change of orientation is introduced.
S
8.1 Shape from Shading 289
According to Fig. 8.1, if the incident light intensity on the 3D object surface S is
L, and the reflection coefficient p is a constant, then the reflection intensity along
N is:
If the light source is behind the observer and emits a parallel ray, then cosi = cose.
Assuming that the line of sight intersects the XY plane of the imaging perpendicu
larly, and the object has a Lambertian scattering surface, that is, the surface reflection
intensity does not change due to the change of the observation position, the observed
light intensity can be written as:
In order to establish the relationship between the surface orientation and the image
brightness, the gradient coordinates PQ are also arranged on the XY plane, and the
method line is along the direction away from the observer; then according to N = [p
q — 1]T, V = [0 0-1]T, it can be obtained:
[p q — 1]T •[0 0 — 1 ]T 1
cos e = cos i = (8:3)
|[ p q — 1 ]T| -|[ 0 0 — 1 ]T| ^/p2 + q2 + 1
Substituting Eq. (8.3) into Eq. (8.1), the observed image gray scale is:
Now consider the general case where the ray is not incident at angle i = e. Let the
light vector L incident through the surface element be [pi qi — 1]T, because cosi is the
cosine of the angle between N and L, so there is:
. [p q —1 ]T •[ 0 0 —1 ]T
cos i =
|[p q — 1 ]T| .|[0 0 — 1 ]T|
, ppi+ qqjt1 , (8:5)
-/p2 + q2 + 1 \/p2 + q2 + 1
Substituting Eq. (8.5) into Eq. (8.1), the observed image gray scale when incident
at any angle is:
Now consider the gray-scale changes of the image due to changes in the orientation
of the bins. A 3D surface can be represented as z = f (x, y), and the surface normal on
it can be represented as N = [pq- 1]T. It can be seen that a surface in 3D space is
just a point G(p, q) in 2D gradient space from its orientation, as shown in Fig. 8.2.
Using this gradient space approach to study 3D surfaces can act as a dimensionality
reduction (to 2D), but the representation of the gradient space does not determine the
position of the 3D surface in 3D coordinates. In other words, a point in the gradient
space represents all surface elements facing the same orientation, but the spatial
locations of these surface elements can vary.
Structures formed by plane intersections can be analyzed and interpreted with the
help of gradient space methods. For example, the intersection of multiple planes may
form convex or concave structures. To judge whether it is a convex structure or a
concave structure, the gradient information can be used. Let’s first look at the
situation where the two planes S1 and S2 intersect to form the intersection line l,as
shown in Fig. 8.3 (where the gradient coordinate PQ coincides with the spatial
coordinate XY). Here G1 and G2 represent the gradient space points corresponding to
the normal lines of the two planes, respectively, and the connection between them is
perpendicular to the projection l' of l.
If S and G of the same face have the same sign (on the same side of the projection
l' of l), it means that the two faces form a convex structure (see Fig. 8.4a). If the S and
r=(p2+q2 )1/2
= =arctan(q/p)
O P
N1 N2
S1 S2
G1 l
—►
X,P
(a) (b)
Fig. 8.4 Two spatial planes form a convex structure and a concave structure
G of the same face have different signs, it means that the two faces form a concave
structure (see Fig. 8.4b).
Further consider the case where three planes A, B, and C intersect, and the
intersection lines are l1, l2, and l3, respectively (see Fig. 8.5a). If the faces on both
sides of each intersection line and the corresponding gradient point have the same
sign (the faces are AABBCC in turn clockwise), it means that the three faces form a
convex structure, as shown in Fig. 8.5b. If the faces on both sides of each intersection
and the corresponding gradient points have different signs (the faces are CBACBA in
turn clockwise), it means that the three faces form a concave structure, as shown in
Fig. 8.5c.
Now go back to Eq. (8.4) and rewrite it as:
L(x, y)p]2
p2 + q2 = E(x, y)J 1 K2 1 (8:8)
where K represents the relative reflection intensity observed by the observer. Equa
tion (8.8) corresponds to the equation of a series of concentric circles on the PQ
plane, and each circle represents the observed orientation locus of the same gray
level surface element. At i = e, the reflection map consists of concentric circles. For
292 8 Monocular Single-Image Scene Restoration
the general case of i / e, the reflection map consists of a series of ellipses and
hyperbolas.
An example application of a reflection map is now given. Suppose the observer
can see three planes A, B, and C, which form the plane intersection angle as shown in
Fig. 8.6a, but the actual inclination of each plane is unknown. Using the reflection
map, the angle between the three planes can be determined. Assuming that L and V
are in the same direction, it has (the relative reflection intensity can be measured
from the image) KA = 0.707, KB = 0.807, and KC = 0.577. According to
the characteristic that the line between G( p, q) of the two faces is perpendicular to
the intersection of the two faces, the triangle shown in Fig. 8.6b can be obtained (i.e.,
the condition that the orientation of the three planes satisfies). Now find GA, GB, and
GC on the reflection map shown in Fig. 8.6c. Substitute each value of K into Eq. (8.8)
to obtain the following two sets of solutions:
The first set of solutions corresponds to the small triangles in Fig. 8.6c, and the
second set of solutions corresponds to the large triangles in Fig. 8.6c. Both sets of
solutions satisfy the condition of relative reflection intensity, so there are two
possible combinations of the orientations of the three planes, corresponding to the
two structures of the convex and concave points of intersection between the three
intersecting lines.
Since the image brightness constraint equation relates the gray level of the pixel with
the orientation, the orientation ( p, q) at that location can be obtained from the gray
8.1 Shape from Shading 293
level L(x, y) of the pixel at (x, y) in the image. But measuring the brightness of a
single point on an image can only provide one constraint, while the orientation of the
surface has two degrees of freedom. In other words, suppose that the object visible
surface in the image consists of N pixels, each pixel has a gray value L(x, y); to solve
Equation (8.7) is to obtain ( p, q) value at the pixel position. Because only
N equations can be formed from the image brightness equation according to
N pixels, but there are 2 N unknowns, that is, there are two gradient values to be
solved for each gray value, so this is an ill-conditioned problem, and a unique
solution cannot be obtained. It is generally necessary to solve this ill-conditioned
problem by adding additional conditions to establish additional equations. In other
words, the surface orientation cannot be recovered from the image luminance
equation alone without additional information.
A simple way to consider additional information is to exploit constraints in
monocular images. The main ones that can be considered include uniqueness,
continuity (surface, shape), compatibility (symmetry, epipolar lines), etc. In practical
applications, there are many factors that affect brightness, so it is only possible to
recover the shape of an object well from shadow tones if the environment is highly
controlled.
In practice, people often estimate the shape of each part of the human face by just
observing a flat picture. This indicates that the picture contains sufficient information
or that people implicitly introduce additional assumptions based on empirical
knowledge while observing. In fact, many real object surfaces are smooth, or
continuous in depth, and further partial derivatives are also continuous. The more
general case is that the object has a patch-continuous surface, only rough at the
edges. The above information provides a strong constraint. For the two adjacent
surface elements on the surface, their orientations are related to a certain extent, and
together they should give a continuous smooth surface. It can be seen that the
method of macroscopic smoothness constraint can be used to provide additional
information to help solve the image brightness constraint equation. The following
three cases are introduced from simple to complex.
where a and b are constants, and the reflection map is shown in Fig. 8.7. The
contours (isolines) of the gradient space in the figure are parallel lines.
In Eq. (8.11), f is a strictly monotonic function (see Fig. 8.8), and its inverse
function f-1 exists. From the image brightness equation, we know:
294 8 Monocular Single-Image Scene Restoration
Fig. 8. 8 s = ap + bq can be
recovered by E(x, y)
Now choose a specific direction 90 (see Fig. 8.7), tan90 = b/a, that is:
ap + bq 1
m(90) = f \ f 1[E(x, y)] (8:15)
Va2 + b2 V'a2 + b2
Starting from a specific image point, take a small step size 5s, at which time the
change of z is 5z = m5s, that is:
Fig. 8. 9 Restoring a
surface from parallel surface
profiles
First find the solution at a point (x0, y0, z0) on the surface, and then integrate the
previous differential equation with z to get:
z(s) = z +
o z 1 ff 1[E(x, y)] ds (8:18)
V a2 + b2
In this way a surface profile is obtained along a line in the direction given above
(one of the parallel lines in Fig. 8.9). When the reflection map is a function of a linear
combination of gradient elements, the surface profiles are parallel straight lines. The
surface can be recovered by integrating along these lines as long as the initial height
z0(t) is given. Of course, the integrals are calculated using numerical algorithms in
practice.
It should be noted that if you want to know the absolute distance, you need to
know the z0 value at a certain point, but the (surface) shape can be recovered without
this absolute distance. In addition, the absolute distance cannot be determined only
by the integral constant z0, because z0 itself does not affect the shade, and only the
change of the depth can affect the shade.
Now consider a more general case. If the distribution of the light source is
rotationally symmetric to the observer, then the reflection map is also rotationally
symmetric. For example, when the observer looks at the hemispherical sky from the
bottom up, the resulting reflection map is rotationally symmetric; when the point
light source is at the same position as the observer, the resulting reflection map is
also rotationally symmetric. In these cases, there are:
Now suppose that the function f is strictly monotone and differentiable, and the
inverse function is f-1, then according to the image brightness equation:
296 8 Monocular Single-Image Scene Restoration
If the angle between the direction of the fastest ascent of the surface and the x-axis
is 0s, where tan0s = p/q, then:
According to Eq. (8.13), the slope in the direction of steepest ascent is:
In this case, if you know the brightness of the surface, you can know its slope, but
you don’t know the direction of the fastest rise, that is, you don’t know the respective
values of p and q. Now suppose the direction of steepest ascent is given by (p, q), if a
small step of length 5s is taken in the direction of steepest ascent, the resulting
changes in x and y should be:
To simplify these equations, the step size can be taken as -^/p2 + q25s; then
we get:
d2z d2z d\ = d\
(8:26)
dx2 dxdy v dydx w dy 2
where f'(r) is the derivative of f (r) with respect to its unique variable r.
Now let’s determine the change in 8p and 8q due to taking steps (8x, 8y) in the
image plane. By differentiating p and q, we get:
8q =2Ey7 8s (8.30)
8P
2f 8s
= 27 2f
In this way, in the limit case of 8s ^ 0, the following set of five differential
equations can be obtained (differentials are all performed on s):
x =P y=q z = p2 + q2 p= f q= f (8.31)
Given initial values, the above five ordinary differential equations can be solved
numerically to obtain a curve on the object surface. The curve thus obtained is called
the characteristic curve, and in this case it is the steepest ascent curve. This type of
curve is perpendicular to the contour line. Note that when R(p, q) is a linear function
of p and q, the characteristic curve is parallel to the object surface.
In addition, another set of equations can be obtained by differentiating x_ = p and y_
= q in Eq. (8.31) with respect to s again:
Since both Ex and Ey are measurements of image brightness, the above equations
need to be solved numerically.
In general, the surface of the object is relatively smooth (although there are discon
tinuities between the objects); this smooth condition can be used as an additional
constraint. If the surface of the object is considered to be smooth (within the contour
of the object), the following two equations hold:
298 8 Monocular Single-Image Scene Restoration
2
<vp)2=fdp+dp) = 0
(8:33)
Ox dy
2 2
(Vq)2= (q^ + qA = 0
(8:34)
Ox Oy
When combined with the image brightness constraint equation, solving the
surface orientation problem can be transformed into a problem of minimizing a
total error as follows:
The above equation can be regarded as the following: The orientation distribution
of the surface elements of the object is obtained, so that the weighted sum of the
overall gray level error and the overall smoothness error is minimized (see Sect. 7.
3.3). Let p and q represent the mean value in the neighborhood of p and q,
respectively, take e to make the derivative of p and q, respectively, and take the
derivVp = p — paVq = q — qtive to be zero; then substitute and into the result, we
get
1
q(x, y) = q(x, y)+ J E(x, y) — R(p, q)] dR (8:37)
j
The equations for iteratively solving the above two equations are as follows (the
boundary point values can be used as the initial values):
It should be noted here that the inside and outside of the object outline are not
smooth, and there are jumps.
The flowchart for solving Eqs. (8.38) and (8.39) is shown in Fig. 8.10, and its basic
framework can also be used to solve the relaxed iterative Eqs. (7.58) and (7.59) of
the optical flow equations.
8.1 Shape from Shading 299
Finally, two sets of examples of shape recovery from shades are given, as shown
in Fig. 8.11. Figure 8.11a is an image of a sphere, and Fig. 8.11b is a needle map of
the spherical surface obtained from Fig. 8.11a by using the shade information (the
short line in the figure indicates the normal direction of the place); Fig. 8.11c is
another image of a sphere, and Fig. 8.11d is the surface orientation needle map
obtained from Fig. 8.11c using the shade information. The direction of the light
source is relatively close to the line-of-sight direction in the group of Fig. 8.11a and
b, so the orientation of each point can basically be determined for the entire visible
surface. In Fig. 8.11c and d, the angle between the direction of the light source and
the direction of the line of sight is relatively large, so it is impossible to determine the
direction of the visible surface that is not illuminated by the light.
300 8 Monocular Single-Image Scene Restoration
When a person looks at a surface covered with texture, the tilt of the surface can be
observed with only one eye, because the texture of the surface will appear distorted
due to the tilt, and from the prior knowledge, one can obtain the information from the
distortion which direction the surface is facing. The role of texture in restoring
surface orientation was elucidated as early as 1950 [5]. This type of approach to
estimating surface orientation based on observed texture distortions is described
below, which is the problem of shape from texture.
The projection of the above two ends can be represented as PW1 = [kX1 kY1 kZ1
q 1]T and PW2 = [kX2 kY2 kZ2 q2]T, with the help of homogeneous coordinates (Sect.
2.2.1), where q1 = k(2—Z1)/2, q2 = k(2—Z2)/2. The point on the straight line between
the original W1 and W2 after projection can be represented as (0 < s < 1):
8.2 Shape from Texture 301
‘kx1‘ kXX2l
kY 1 kY2
P[sW1 + (1 - s)W2] = s + (1 - s) (8:41)
kZ 1 kZ2
In other words, the image plane coordinates of all points on this space line can be
obtained by dividing the first three terms by the fourth term of the homogeneous
coordinates, which can be represented as (0 < s < 1):
w =[ x y ]T = s X1 + (1 — s)X2 sYr+^1—)Y2
(8:42)
sq1 + (1 — s)q2 sq1 +(1 — s)q2
The above is the projection transformation result of using s to represent the space
point. On the other hand, in the image plane, we have w1 = [2X1/(2—Z1) 2Y1/(2—
Z1)]T, w2 = [AX2/(A—Z2) 2Y2/(2—Z2)]T, and the points on the connecting line
between them can be represented as (0 < t < 1):
2X1 -i
r - 2X2 -
2 — Z1 2 — Z2
t w1 + (1 — t)w2 = t + (1 —1) (8:43)
2Y 1 2Y2
L2 — Z1J -2 — Z2 _
So the coordinates of w 1 and w2 and the points on the line between them on the
image plane (denoted by t) are (0 < t < 1):
... -iT
2X1 2 2 2X2 2Y 1 2 2 2Y2 T ,o
w =[ x y ]T = ti—Z,+ (1—')j—iJ (844)
tr—Z+(1—• ■ Z )
If the projection result represented by s is the coordinate of the image point expressed
by t, then Eq. (8.42) and Eq. (8.44) should be equal, so it can be solved as:
s= tq2 (8:45)
tq2 + (1 — t) q1
t= sq1 (8:46)
sq1 +(1 — s) q2
It can be seen from the above two equations: s and t have a single value
relationship. In 3D space, the point represented by s corresponds to one and only
one point represented by t in the 2D image plane. All the space points represented by
s are connected into a straight line, and all the image points represented by t are also
connected in a straight line. After a straight line in the visible 3D space is projected
onto the 2D image plane, as long as it is not projected vertically, the result is still a
straight line (but the length can be changed). In the case of vertical projection, the
302 8 Monocular Single-Image Scene Restoration
projection result is just a point (this is a special case). Its inverse proposition is also
true, that is, a straight line on the 2D image plane must be produced by the projection
of a straight line in 3D space (in special cases, it can also be produced by a plane
projection).
Next, consider the distortion of parallel lines, because parallelism is a line-to-line
relationship that is characteristic of linear systems. In 3D space, a point (X, Y, Z)ona
line can be represented as:
(8:47)
Among them, (X0, Y0, Z0) is the starting point of the straight line; (a, b, c) is the
direction cosine of the straight line; k is an arbitrary coefficient.
For a set of parallel lines, their (a, b, c) are all the same, but (X0, Y0, Z0) are
different. The distance between parallel lines is determined by their (X0, Y0, Z0)
differences. Substitute Eq. (8.47) into Eqs. (2.27) and (2.28) to get:
(8:49)
When the straight line extends infinitely to both ends, k = ± 1, the above two
equations are simplified as:
. a cos y + b sin y
xi = 2------ :------ :----- -5—:---------------------- (8:50)
— a sin a sin / + b sin a cos / — c cos a
. — a sin y cos a + b cos a cos y + c sin a
(8:51)
yi = 2------- ---------------------------- -------------
— a sin a sin y + b sin a cos y — c cos a
It can be seen that the projected trajectory of parallel lines is only related to (a, b,
c) but not to (X0, Y0, Z0). In other words, parallel lines with the same (a, b, c) will
meet at a point after extending infinitely. This point can be in the image plane or
outside the image plane, so it is also called vanishing point or imaginary point. The
calculation of the vanishing point will be introduced in Sect. 8.2.3.
8.2 Shape from Texture 303
Using the texture on the surface of an object can help determine the orientation of the
surface and therefore restore the shape of the surface. The description of texture here
is mainly based on the idea of structural method (e.g., see [6]): Complex texture is
composed of some simple texture primitives (texels) that are repeatedly arranged and
combined in a regular form. In other words, texels can be viewed as visual primitives
with repetition and invariance in a region. Here, repetition means that these primi
tives appear repeatedly in different positions and directions. Of course, this repetition
is only possible under a certain resolution (the number of texels in a given visual
range). Invariance means that pixels that make up the same primitive have basically
the same characteristics, which may be related only to the gray level or may also
depend on other properties such as their shape.
Using the texture of the object surface to determine its orientation should consider
the influence of the imaging process, which is related to the relationship between the
scene texture and the image texture. In the process of acquiring the image, the texture
structure on the original scene may change on the image (producing a gradient
change in both size and direction). This change may be different with the orientation
of the surface on which the texture is located, so it brings the 3D information about
the orientation of the surface of the object. Note that this does not mean that the
surface texture itself has 3D information but that the changes in the texture during the
imaging process have 3D information. The changes of texture can be mainly divided
into three categories (it is assumed that the texture is limited to a planar surface); see
the schematic diagram in Fig. 8.12. Commonly used information recovery methods
can also be divided into the following three categories:
(a) (c)
In perspective projection, there is a law of near big and far small, so texture elements
with different positions will have different changes in size after projection. This is
evident when looking in the direction in which the floor or tile is laid. According to
the maximum value of the texel projected size change rate, the orientation of the
plane where the texel is located can be determined (see Fig. 8.12a). The direction of
this maximum value is also the direction of the texture gradient. Assuming that the
image plane coincides with the paper surface and the line of sight comes out of the
paper, the direction of the texture gradient depends on the rotation angle of the texel
around the camera line of sight, and the value of the texture gradient gives the
degree of inclination of the texel relative to the line of sight. Therefore, with the help
of the geometric information placed by the camera, the orientation of the texture
element and the plane where it is located can be determined.
Figure 8.13 presents two pictures to illustrate that changes in texel size can give
clues to the depth of the scene. Figure 8.13a has many petals in the front (they are
equivalent to texels), and the petal size goes gradually reduced from front to back
(from near to far). This texel size change gives a sense of depth to the scene. The
building in Fig. 8.13b has many columns and windows (which are equivalent to
regular texels), and their size changes also give a sense of depth to the scene and
easily help the viewer to make the judgment that the corners of the building are
farthest.
It should be noted that the regular texture of the 3D scene surface will generate
texture gradients in the 2D image, but in turn the texture gradient in the 2D image
does not necessarily come from the regular texture of the 3D scene surface.
The shape of the texel on the surface of the object may change to a certain extent
after the perspective projection and orthogonal projection imaging. If the original
shape of the texel is known, the surface orientation can also be calculated from the
result of the change in the shape of the texel. The orientation of the plane is
determined by two angles (the angle of rotation relative to the camera axis and the
angle of inclination relative to the line of sight). For a given original texel, these two
angles can be determined according to the change results after imaging. For exam
ple, on a plane, a texture composed of circles will become ellipses on an inclined
plane (see Fig. 8.12b), where the orientation of the major axis of the ellipse
determines the angle of rotation relative to the camera axis, and the ratio of lengths
of the major and minor axes reflects the angle of inclination relative to the line of
sight. This ratio is also called the aspect ratio, and its calculation process is
described below. Let the equation of the plane on which the circular texture primitive
resides be:
ax + by + cz + d = 0 (8:52)
The circle that constitutes the texture can be regarded as the intersection line
between the plane and the sphere (the intersection line between the plane and the
sphere is always a circle, but when the line of sight is not perpendicular to the plane,
the deformation causes the observed intersection lines to always be elliptical); here
the spherical equation is set as:
x2 + y2 + z2 = r2 (8:53)
Combining the above two equations can provide the solution (equivalent to
projecting the sphere onto the plane):
From the above equation, the coordinates of the center point of the ellipse can be
obtained, and the long and short semiaxes of the ellipse can be determined, so that
the rotation angle and inclination angle can be calculated.
Another method to judge the deformation of circular texture is to calculate
directly the long and short semiaxes of different ellipses, respectively. Refer to
Fig. 8.14 (where the world coordinates coincide with the camera coordinates); the
included angle between the plane of the circular texture primitive and the Y-axis
306 8 Monocular Single-Image Scene Restoration
Fig. 8.14 The position of circular texture primitive plane in coordinate system
(also the included angle between the texture plane and the image plane) is a. At this
time, not only the circular texture primitive becomes an ellipse, but also the density
of the upper primitive is greater than that of the middle, forming a density gradient.
In addition, the appearance ratio of each ellipse, that is, the length ratio of the short
half axis to the long half axis, is not constant, forming an appearance ratio gradient.
At this time, both the size and shape of texture elements change.
If the diameter of the original circle is set as D, for the circle in the center of the
scene, the long axis of the ellipse in the image can be obtained as:
Dmajor(0, 0) = AD (8:56)
where A is the focal length of the camera, and Z is the object distance. At this time,
the appearance ratio is the cosine of the inclination angle, that is:
Now consider the primitive on the scene that is not on the optical axis of the camera
(such as the light colored ellipse in Fig. 8.14). If the Y coordinate of the primitive is y,
and the included angle between the line with the origin and the Z-axis is Q, then [7]:
At this time, the appearance ratio is cosa(1-tanQtana), which will decrease with
the increase of Q, forming an appearance ratio gradient.
Incidentally, the above idea of using the shape change of texture elements to
determine the orientation of the plane where the texture elements are located can also
be extended, but more factors often need to be considered here. For example, the
8.2 Shape from Texture 307
Fig. 8.15 Schematic diagram of texture element grid and vanishing point
shape of the 3D scene can sometimes be inferred from the shape of the boundary of
the 2D region in the image. For example, the direct explanation of the ellipse in the
image often comes from the disk or ball in the scene. At this time, if the light and
shade changes and the texture patterns in the ellipse are uniform, the explanation of
the disk is often more reasonable; however, if both shading and texture patterns have
radial changes toward the boundary, the explanation of the ball is often more
reasonable.
(3) Using the change of spatial relationship among texture elements
If the texture is composed of regular grid of texel, the surface orientation
information can be restored by calculating its vanishing points (see Sect. 8.2.3).
Vanishing point is the common intersection of all segments in the set of intersecting
segments. For a transmission image, the vanishing point on the plane is formed by
the infinite texture element projecting to the image plane in a certain direction, or the
convergence point of parallel lines at infinity. For example, Fig. 8.15a shows a
perspective view of a box with parallel grid lines on each surface, and Fig. 8.15b is a
schematic diagram of the vanishing point of its surface texture.
If you look at the vanishing point along the surface, you can see the change of the
spatial relationship between texture elements, that is, the increase of the distribution
density of texture elements. The orientation of the surface can be determined by
using two vanishing points obtained from the same surface texture element grid. The
straight line where these two points are located is also called vanishing line/hidden
line, which is composed of vanishing points of parallel lines in different directions
on the same plane (e.g., the vanishing points of parallel lines in different directions
on the ground constitute the line of horizon). The direction of the vanishing line
indicates the rotation angle of the texture element relative to the camera axis, and the
intersection of the vanishing line and x = 0 indicates the inclination angle of the
texture element relative to the line of sight, as shown in Fig. 8.12c. The above
situation can be easily explained with the help of perspective projection model.
Finally, the above three methods of using texture element changes to determine
the orientation of the object surface can be summarized in Table 8.1.
308 8 Monocular Single-Image Scene Restoration
Table 8.1 Comparison of three methods of using texture element changes to determine the
orientation of object surface
Rotation angle around Tilt angle with respect to
Method viewing line viewing line
Using texel change in Texture gradient direction Texture gradient value
size
Using texel change in The direction of major princi Ration of texel major and minor prin
shape pal axis of texel ciple axes
Using texel change in The direction of line The cross point of line connecting two
spatial relation connecting two vanishing vanishing points and x = 0
points
Type
of Original Type of Analysis Analysis Unit
Surface clue surface texture projection method unit property
Texture gradient Plane Unknown Perspective Statistical Wave Wavelength
Texture gradient Plane Unknown Perspective Statistical Region Area
Texture gradient Plane Uniform Perspective Statistical/ Edge/ Density
density structural region
Convergence line Plane Parallel Perspective Statistical Edge Direction
line
Convergence line Plane Parallel Perspective Statistical Edge Direction
line
Normalized texture Plane Known Orthogonal Structural Line Length
characteristic map
Normalized texture Surface Known Spherical Structural Region Axis
characteristic map
Shape distortion Plane Isotropy Orthogonal Statistical Edge Direction
Shape distortion Plane Unknown Orthogonal Structural Region Shape
The specific effect of determining the surface orientation and restoring the surface
shape from the texture is related to the gradient of the surface itself, the distance
between the observation point and the surface, and the angle between the line of
sight and the image. Table 8.2 gives an overview of some typical methods, which
also lists various terms for obtaining shapes from textures [8]. The various methods
that have been proposed to determine the surface by texture are mostly based on
different combinations of them.
In Table 8.2, the difference between different methods is mainly that different
surface clues are used, which are texture gradient (refers to the rate and direction of
the maximum change of texture roughness on the surface), convergence line (which
can limit the orientation of the planar surface. Assuming that these lines are parallel
in 3D space, the convergence line can determine the vanishing point of the image),
normalized texture characteristic map (this map is similar to the reflection map in the
8.2 Shape from Texture 309
shape from shade), and shape distortion (if the original shape of a pattern on the
surface is known, the observed shape can be determined on the image for various
orientations of the surface). In most cases, the surface is a plane, but it can also be a
surface; the analysis method can be either structural method or statistical method.
In Table 8.2, perspective projection is often used for projection type, but it can
also be orthogonal projection or spherical projection. In spherical projection, the
observer is at the center of the sphere, the image is formed on the sphere, and the line
of sight is perpendicular to the sphere. When restoring the orientation of the surface
from the texture, the 3D solid should be reconstructed according to the distortion of
the original texture element shape after projection. Shape distortion is mainly related
to two factors: (1) the distance between the observer and the object, which affects the
size of texture element distortion, and (2) the angle between the normal of the object
surface and the line of sight (also known as the surface inclination), which affects the
shape of the texture element after distortion. In orthogonal projection, the first factor
does not work; only the second factor will work. In perspective projection, the first
factor works, while the second factor only works when the object surface is curved
(if the object surface is flat, it will not produce distortion that affects the shape). The
projection form that can make the above two factors work together on the shape of
the object is spherical perspective projection. At this time, the change of distance
between the observer and the object will cause the change of texture element size,
and the change of object surface inclination will cause the change of object shape
after projection.
In the process of restoring surface orientation from texture, it is often necessary to
have certain assumptions about texture pattern. Two typical assumptions are as
follows:
Isotropy Assumption
The isotropy assumption holds that for isotropic textures, the probability of finding
a texture primitive in the texture plane is independent of the orientation of the texture
primitive. In other words, the probability model of isotropic texture does not need to
consider the orientation of the coordinate system on the texture plane [9].
Homogeneity Assumption
The uniformity of texture in the image refers to that the texture of a window selected
at any position in the image is consistent with that of the window selected at other
positions. More strictly, the probability distribution of a pixel value only depends on
the nature of the pixel neighborhood and has nothing to do with the spatial coordi
nates of the pixel itself [9]. According to the homogeneity assumption, if the texture
of a window in the image is collected as a sample, the texture outside the window can
be modeled according to the nature of the sample.
310 8 Monocular Single-Image Scene Restoration
In the image obtained by orthogonal projection, even assuming that the texture is
uniform, the orientation of the texture plane cannot be restored, because the uniform
texture is still uniform after viewing angle transformation. However, if the image
obtained by perspective projection is considered, the restoration of the orientation of
the texture plane is possible.
This problem can be explained as follows: According to the uniformity assump
tion, the texture is considered to be composed of uniform patterns of points. At this
time, if the texture plane is sampled with equally spaced meshes, the number of
texture points obtained by each mesh should be the same or very close. However, if
the texture plane covered by equi-spaced meshes is used for perspective projection,
some meshes will be mapped into larger quadrangles, while others will be mapped
into smaller quadrangles. That is, the texture on the image plane is no longer
uniform. Because the mesh is mapped into different sizes, the number of texture
patterns (originally uniform) contained in it is no longer consistent. According to this
property, the relative orientation of the imaging plane and the texture plane can be
determined by the proportional relationship of the number of texture modes
contained in different windows.
The combination of texture method and stereo vision method is called texture stereo
technology. It estimates the direction of the scene surface by acquiring two images
of the scene at the same time, avoiding the complex problem of corresponding point
matching. In this method, the two imaging systems used are connected by rotation
transformation.
In Fig. 8.16, the straight line orthogonal to the direction of the texture gradient
and parallel to the object surface is called the characteristic line, and there is no
change in the texture structure on this line. The angle between the feature line and the
Texture
1 gradient
direction
Characteristic line
X
X-axis is called the feature angle, which can be calculated by comparing the Fourier
energy spectrum of the texture region. According to the feature lines and feature
angles obtained from the two images, the surface normal vector N = [Nx Ny Nz]T can
be determined:
where 01 and 02 are the included angle between the characteristic line and the X-axis
in the two images in the counterclockwise direction, respectively; the coefficient aij
is the direction cosine between the corresponding axes in the two imaging systems.
If the symbol “)” is used to represent the transformation from one set to another,
the transformation {-, y} ) {2, G} maps a line in the image space XY to a point in the
parameter space A0 while the collection of lines with the same vanishing point (xv,
yv) in the image space XY is projected to a circle in the parameter space A0. To
illustrate this, 2 = x/x1 + y2 and G = arctan{y/x} can be substituted into the follow
ing equation:
Turn the result to the rectangular coordinate system again, and you can get:
The above equation represents a circle with a center of (xv/2, yv/2) and a radius of
2 =^/(xv=2)2 +(yv=2)2, as shown in Fig. 8.17b. This circle is the trajectory
projected into A0 space by the set of line segments with (xv,yv) as the vanishing
point. In other words, the vanishing point can be detected by mapping the line
segment set from XY space to A0 space with the transformation {x, y} ) {2, G}.
The above method of determining the vanishing point has two disadvantages:
One is that the detection of circles is more difficult than the detection of straight
lines, and the amount of calculation is also large; the other is that when xv ^ 1 or
yv ^ 1, there are 2 ^ 1 (symbol “^” here indicates a trend). To overcome these
shortcomings, the transformation {x, y} ) {k/2, G} can be used instead, where k is a
constant (k is related to the value range of the Hough transform space). At this time,
Eq. (8.64) becomes:
=
k 2 = xvcos G + yvsin G (8:66)
Converting Eq. (8.66) into the Cartesian coordinate system (letting s = 2cosG,
t = 2sinG), we get:
k = —v s + yvt (8:67)
This is a straight-line equation. In this way, the vanishing point at infinity can be
projected to the origin, and the locus of the point corresponding to the line segment
with the same vanishing point (xv, yv) inST space becomes a straight line, as shown
in Fig. 8.17c. The slope of this line is given by Eq. (8.67) as -yv/xv, so this line is
8.2 Shape from Texture 313
orthogonal to the vector from the origin to the vanishing point (xv, yv) and has a
distance k= ^x^ + yV from the origin. This straight line can be detected by another
Hough transform, that is, the space ST where the straight line is located is taken as
the original space, and it is detected in the (new) Hough transform space RW. In this
way, the straight line in the space ST is a point in the space RW, as shown in
Fig. 8.17d, and its position is:
k
r= —, (8:68)
•\/xV+y2
w = arctan yv (8:69)
Xv
From the above two equations, the coordinates of the vanishing point can be
solved as:
k2
(8:70)
r2V1 + tan2w
k2 tan w
(8:71)
r2V1 + tan2w
The above method has no problem when the vanishing point is within the range of
the original image. In practice, however, the vanishing point is often outside the
image range (as shown in Fig. 8.18), or even at infinity, where problems with the
general image parameter space are encountered. For long-distance vanishing points,
the peaks of the parameter space are distributed over a large distance, so the
detection sensitivity will be poor, and the positioning accuracy will be low.
yi (ya- yi) = xi = a
(8.72)
yi(ya- yi) xi a+b
where y3 can be calculated from Eq. (8.72):
8.2 Shape from Texture 315
byi y2
ayi + byi - ay2
In practice, the position and angle of the camera relative to the ground should be
adjustable so that a = b, and then:
This simple formula shows that the absolute values of a and b are not important; as
long as the ratio is known, it can be calculated. Further, the above calculation does
not assume that the points Vi, V2, and V3 are vertically above the point O, nor does it
assume that the points O, Hi, and H2 are on the horizontal line, only that they are on
two coplanar lines, and C is also in this plane.
In perspective projection, the ellipse is projected as an ellipse, but its center is
shifted a bit, because perspective projection doesn’t preserve length ratios (the
midpoint is no longer the midpoint). Assuming that the position of the vanishing
point of the plane can be determined from the image, the offset of the center can be
easily calculated using the previous method. Consider first the special case of
ellipse—circle, which is ellipse after projection. Referring to Fig. 8.2i, let b be the
short semiaxis of the projected ellipse, d be the distance between the projected ellipse
and the vanishing line, e be the offset of the center of the circle after projection, and
point P be the projected center. Taking b + e as y i,2b as y2, and b + d as y3, it can be
obtained from Eq. (8.74):
e =b2 (8:75)
The difference from the previous method is that y3 is known here, and it is used to
calculate yi and then calculate e. If the vanishing line is not known, but the
orientation of the plane where the ellipse is located and the orientation of the
image plane are known, the vanishing line can be deduced and calculated as above.
If the original object is an ellipse, the problem is more complicated because not
only the longitudinal position of the center of the ellipse is not known but also its
lateral position. At this time, two pairs of parallel tangent lines of the ellipse should
be considered. After projection imaging, one pair intersects at Pi, and the other
intersects at P2. Both intersections are on the vanishing line, as shown in Fig. 8.22.
Since for each pair of tangents, the chord connecting the tangent points passes
316 8 Monocular Single-Image Scene Restoration
through the center O of the original ellipse (this property does not vary with
projection), the projection center should be on the chord. The intersection of the
two chords corresponding to the two pairs of tangents is the projection center C.
When using the optical system to image the scene, the lens actually used can only
clearly image the scene within a certain distance interval. In other words, when the
optical system is focused at a certain distance, it can only image the scene within a
certain range above and below this distance with sufficient clarity (the defocused
image will become blurred [11]). This distance range is called the depth of focus of
the lens. The depth of focus is determined by the farthest point and the nearest point
satisfying the degree of sharpness or by the farthest plane and the nearest plane. It is
conceivable that if the depth of field can be controlled, when the depth of field is
small, the farthest point and the nearest point on the scene that satisfy the degree of
clarity are very close, and then the depth of the scene can be determined. The median
of the depth of field range basically corresponds to the focal length [12], so this
method is often referred to as shape from focal length or shape from focus.
Figure 8.23 gives a schematic representation of the depth of field of a thin lens.
When the lens is focused on a point on the scene plane (the distance between the
8.3 Shape from Focus 317
scene and the lens is do), it is imaged on the image plane (the distance between the
image and the lens is di). If you reduce the distance between the scene and the lens to
do1, the image will be imaged at a distance of di1 from the lens, and the point image
on the original image plane will spread into a blurred disk of diameter D. If the
distance between the scene and the lens is increased to do2, the image will be imaged
at a distance of di2 from the lens, and the point image on the original image plane will
also spread into a blurred disk of diameter D.If D is the largest diameter acceptable
for sharpness, the difference between do1 and do2 is the depth of field.
The diameter of the blurred disk is related to both camera resolution and depth of
field. The resolution of the camera depends on the number, size, and arrangement of
the camera imaging units. In the common square grid arrangement, if there are N x N
cells, then N/2 lines can be distinguished in each direction. That is, there is an
interval of one cell between two adjacent lines. The general grating is that the black
and white lines are equidistant, so it can be said that N/2 pairs of lines can be
distinguished. The resolution capability of a camera can also be expressed in terms
of resolving power. If the spacing of the imaging units is A and the unit is mm, the
resolution power of the camera is 0.5/A and the unit is line/mm. If the side length of
the imaging element array of a CCD camera is 8 mm, and there are 512 x 512
elements in total, its resolution is 0.5 x 512/8 = 32 line/mm.
Assuming that the focal length of the lens is 2, then according to the thin lens
imaging formula, we have:
12. X
(8:76)
2 do di
Now set the lens aperture as A; then when the scene is at the closest point, the
distance di1 between the image and the lens is:
A
di1= - di (8:77)
A -D
According to Eq. (8.76), the distance to the closest point of the scene is:
Similarly, the distance of the farthest point of the scene can be obtained as:
318 8 Monocular Single-Image Scene Restoration
A A+Ddi = AAdo
(8:80)
A+Ddi -A = AA -D(do — A)
It can be seen from the denominator on the right side of Eq. (8.80) that when:
do = A+-D A = H (8:81)
then do2 is infinite, and the depth of field is also infinite. H is called the hyper-focal
distance, and when do2 > H, the depth of field is infinite. While for do2 < H, the
depth of field is:
Ad =d — d = 2AA-Dd°(d° — A) (8 82)
o do2 do1 (AA)2 - D2(do — A)2 (8 )
It can be seen from Eq. (8.82) that the depth of field increases with the increase of D.
If a larger blurred disk is allowed/tolerated, the depth of field is also larger. In
addition, Eq. (8.82) shows that the depth of field decreases with the increase of A, that
is, a lens with a short focal length will give a larger depth of field.
Since the depth of field obtained with a lens having a longer focal length will be
smaller (closest and farthest points are close), it is possible to determine the distance
of the scene from the determination of the focal length. In fact, the human visual
system does this too. When people observe a scene, in order to see clearly, the
refractive power of the lens is controlled by adjusting the pressure of the ciliary
body, so that the depth information is connected with the pressure of the ciliary body,
and the distance of the scene is judged according to the pressure adjustment. The
automatic focus function of the camera is also realized based on this principle. If the
focal length of the camera changes smoothly within a certain range, edge detection
can be performed on the image obtained at each focal length value. For each pixel in
the image, determine the focal length value that produces a sharp edge, and use the
focal length value to determine the distance (depth) between the 3D scene surface
point corresponding to the pixel and the camera lens. In practical applications, for a
given scene, adjust the focal length to make the image of the camera clear; then the
focal length value at this time indicates the distance between the camera and it,
while, for an image shot with a certain focal length, the depth of the scene point
corresponding to the clear pixel point can also be calculated.
required. The following introduces a method to calculate the geometric shape and
posture of the 3D scene by using the coordinates of the three image points under the
condition that the 3D scene model and the focal length of the camera are known
[13]. The pairwise distances are known.
Equation (8.85) gives a quadratic equation about the three ki, in which the three d2mn
on the left side of the equation are known from the 3D scene model, and the three dot
products vm^vn can also be calculated from the coordinates of the image points. The
P3P problem of the Wi coordinates of a point becomes a problem of solving three
quadratic equations with three unknowns. In theory, there are eight solutions to
Eq. (8.85) (i.e., eight sets of [k1 k2 k3]), but as can be seen from Fig. 8.24, due to
symmetry, if [k1 k2 k3] is a set of solutions, then [—k1 -k2 -k3] must also be a set of
solutions. Since the object can only be on one side of the camera, there are at most
four sets of real solutions. It has also been shown that, although there may be four
sets of solutions in certain cases, there are generally only two sets of solutions.
Now solve for ki in the following three functions:
Suppose the initial value is around [k1 k2 k3], butf(k1, k2, k3) / 0. An increment [41
42 43] is now required so thatf(k1 + 41, k2 + 42, k3 + 43) tends to 0. Expandf(k1 +41,
k2 + 42, k3 + 43) in the neighborhood of [k1 k2 k3] and omit higher-order terms, and
we get:
'k1'
df df df
f (k1+41, k2 +42, k3+43) =f (k1, k2, k3) + k2 (8:87)
dk1 dk2 dk3_
k3
Equating the left-hand side of the above equation to 0 yields a (partial differential)
linear equation containing [k1 k2 k3]. Similarly, the functions g and h in Eq. (8.86)
can also be transformed into linear equations. Put them together:
r< df df_ I
0 ■f (k1, k2, k3) dk1 dk2 dk3 'k1'
Sg dg dg
0 g(k1, k2, k3) k2 (8:88)
dk1 dk2 dk3
0 h(k1, k2, k3)_ k3
dh dh dh
. dk1 dk2 dk3 .
The above partial differential matrix is the Jacobian matrix J. The Jacobian matrix
J of a function f(k1, k2, k3) has the following form (where vmn = vm^vn):
8.5 Shape from Shading in Hybrid Surface Perspective Projection 321
J 11 J12 J 13
J(k1, k2, k3) = J21 J22 J23
J31 J32 J33
(2k1 - 2vi2k2) (2k2 - 2vi2ki) 0
0 (2k2 — 2v23k3) (2k3 — 2v23 k2) (8:89)
(2k1 — 2v3ik3) 0 (2k3 — 2V31 k1)
If the Jacobian matrix J is invertible at the point (k1, k2, k3), the parameter increment
can be obtained:
Add the above increment to the parameter value of the previous step, and use Kl to
represent the lth iteration value of the parameter to get (Newton’s method
representation):
The method of recovering shape from shading was proposed in the early days using
some assumptions, for example, the light source is located at infinity, the camera
follows the orthogonal projection model, the reflection characteristics of the object
322 8 Monocular Single-Image Scene Restoration
surface obey the ideal diffuse reflection, etc., to simplify the imaging model. These
assumptions reduce the complexity of the SFS method but may also generate large
reconstruction errors in practical applications. For example, the actual scene surface
is rarely an ideal diffuse reflection surface, and the specular reflection factor is often
mixed. For another example, when the distance between the camera and the scene
surface is relatively close, the camera is closer to perform the perspective projection,
which will lead to a relatively obvious reconstruction error.
Considering that the surface of the actual scene is mostly a mixture of diffuse
reflection and specular reflection, Ward proposed a reflection model [14], which
uses a Gaussian model to describe the specular component in the surface reflection.
Ward represents this model using the bidirectional reflectance distribution func
tion (BRDF; see Sect. 7.2.1):
bm 1 tan 2S\
;
f (0i ^-; 0e, &) =b + , exp
4nff2 (J(cos 0i cos 0e)
ff2 ) (8:92)
Among them, bl and bm are the diffuse reflection and specular reflection coeffi
cients, respectively; ff is the surface roughness coefficient, and S is the angle bisector
direction vector between the light source and the camera (L + V)/||L + V|| and the
surface normal vectors (where L is the light direction vector and V is the viewing
direction vector).
The Ward reflection model is a concrete physical realization of the Phong
reflection model. The Ward reflection model is actually a linear combination of
diffuse and specular reflections, where the diffuse part still uses the Lambertian
model. Since the use of the Lambertian model to calculate the radiance of the scene
surface is not accurate enough for the actual diffuse reflection surface, a more
accurate reflection model was proposed [15]. In this model, the scene surface is
considered to be composed of many “V”-shaped grooves, and the slopes of the two
micro-planes in the “V”-shaped grooves are the same in magnitude but opposite in
direction. Define the surface roughness as a probability distribution function of the
micro-facet orientation. Using the Gaussian probability distribution function, the
formula for calculating the radiance of the diffuse surface can be obtained as:
fV(0j, ^i; 0e, ^e) = ~ cos 0ifA + B max [0, cos (^e — ^i)] sin a sin fig (8:93)
Substituting Eq. (8.93) for the diffuse reflection term (the first term on the right
side of the equal sign) in Eq. (8.92), an improved Ward reflection model can be
obtained:
f'(0i, fa; 0e, fa = — cos 0ifA + B max [0, cos (^e — ^)] sin a sin fig
bm__________ 1 /— tan 22^ (8:94)
+ 4n^2 A/( cos 0i cos 0e)&XP V o2 )
The improved Ward model should better describe hybrid surfaces with both diffuse
reflection and specular reflection [16].
Consider the perspective projection when the camera is relatively close to the scene
urface as shown in Fig. 8.25. The optical axis of the camera coincides with the
Z-axis, the optical center of the camera is located at the projection center, and the
focal length of the camera is 2. Let the image plane xy lie at Z = —2. At this time
0i = 0e = a = fi, ^j = ^e, Eq. (8.94) becomes:
Suppose the surface shape of the scene in the image can be represented by the
function T: Q ^ R3:
t <x>=zr1f x,
(8:96)
2 —2
x
x= 2Q (8:97)
.y.
Among them, z(x) = — Z(X) > 0 represents the depth information of the point on
the scene surface along the optical axis; Q is an open set defined on the real number
***set R3, representing the size of the image.
The normal vector n(x) at any point P on the scene surface is:
2Vz(x)
n(x) = (8:98)
z(x) + x ■ Vz(x)
where Vz(x) is the gradient of z(x). The vector in the direction of ray casting through
point P is:
1 —x
L(x) = (8:99)
^/||x||2 + 22 2
Because 0i is the included angle between n(x) and L(x), if we let v(x) = lnz(x),
there is:
nT(x)
0i = arccos L(x)
lln(x)ll
Q (x)
= arccos (8:100)
2+ [1 +x ■ Vv(x)]2
=
where Q(x) = 2 y ||x||2 + 22. Substitute Eq. (8.100) into Eq. (8.95) to obtain the
image brightness constraint equation under perspective projection:
V V V V
where F(x, v) = 22II v(x)ll2 + [1 + x^ v(x)]2 and v is the abbreviation of v(- V
x) = [p, q]T. The above equation is a first-order partial differential equation, and the
corresponding Hamiltonian function can be obtained:
Considering the Dirichlet boundary conditions, the above equation can be written as
a static Hamilton-Jacobi equation:
H(x, p, q) = 0 8x 2Q
(8:103)
v(x) = rn(x) 8x 2dQ
'vt + H(x, p, q) = 0 8x 2Q
< v(x, t) = «(x) 8x 2 dQ (8:104)
v(x,0) = v0(x)
Then the fixed-point iterative sweeping method [17] and 2D central Hamilton
function [18] are used to solve.
Consider the mesh points xi-j- = (ih, jw) in the m x n image Q, where i = 1, 2,...,
m, j = 1, 2, ..., n, and (h, w) define the size of the discrete mesh in the numerical
algorithm. Now it is required to solve the discrete approximate solution vi,j = v(xi,j)
of the unknown function v(x).
The forward Euler formula is applied to expand Eq. (8.104) in time domain, and:
n+1 n
vij = vij ^tH\pij— ,pij+ ; qij, qij+ J
—
(8:105)
where At = Y{1/[(ffx/h) + (ffy/w)]} and Y is the CFL coefficient; ffx and oy are artificial
viscosity factors, which meet the following requirements:
326 8 Monocular Single-Image Scene Restoration
dH(p, q)
&x = max (8:106)
p, q dp
&y = max
dH(p, q) (8:107)
p, q dq
where p-, p+ and q-, q+ represent the backward and forward differences of p and q,
respectively:
Substituting Eq. (8.106) to Eq. (8.112) into Eq. (8.105), the final iterative equation is
obtained:
iteration. Assign the value of image area points (i.e., q) to a larger value, i.e., vi0,j =
M, M should be greater than the maximum value of all height values, and the
values of these points will be updated in the iteration process.
2. Alternate direction scanning: In step k + 1, use iterative Eq. (8.113) to update vi,j.
The scanning process adopts the Gauss-Seidel method from the following four
directions: (1) From top left to bottom right, i = 1: m, j = 1: n; (2) from bottom
left to top right, i = m:1,j = 1: n; (3) from bottom right to top left, i = m:1,j = n:
1; and (4) from top right to bottom left, i = 1: m, j = n: 1. When vk+1 < vj, update
i,j <vvi,jk+1. .
vvnew
3. Iteration stop criteria: When ||vk + 1 - vk||1 < e = 10-5, stop iteration; otherwise,
return to (2).
Due to the complexity of ward reflection model, it is difficult to find the optimal
artificial viscosity factor by using the fixed-point iterative scanning method,
resulting in slow convergence in the calculation process. Therefore, the Blinn-
Phong reflection model [19] can be used to establish the equation.
Based on the Blinn-Phong reflection model to characterize the hybrid reflection
characteristics of the object surface, the image brightness constraint equation is
[20]:
where I(u, v) is the gray value of the image at (u, v); kl and km are the weighting
factors of diffuse reflection and specular reflection components of the scene surface,
respectively, and there is kl + km < 1; specular reflection index a > 0; 0i is the angle
between the normal vector N(u, v) at a point P(x, y, z) on the surface of the scene
corresponding to (u, v) and the light source L(u, v); 5 is defined in Eq. (8.92).
Considering that the point light source is approximately located at the projection
center, then:
u v —2
(8:117)
Therefore, the point P(x, y, z) on the hybrid surface can be represented as:
( )
P(x,y,z) = z u’v (u,v, — 2),
2
(u,v) 2 Q (8:118)
Among them, z(u, v) > 0; Q is the image region captured by the camera.
With the help of Eq. (8.118), the normal vector at point P can be calculated by:
1
L(u, v) = 2 —2 [ - u, - v, 2] (8:120)
Because 0i is the included angle between N(u, v) and L(u, v), there is:
where Q(u, v) = 2/(u2 + v2 + 22)1/2 > 0. Let Z = ln[z(u, v)], and substituting
Eq. (8.121) into Eq. (8.116), the image brightness constraint equation of the hybrid
surface under perspective projection can be obtained:
I(u, v)= kl
Q(u, v) —
k Qa(u’ v)
(8:122)
U(u, v, g) — kmUa(u, v, g)
2 2 2
U(u,v,g)= — — u(z—4Z— 1 >0 (8:123)
du dv
Tk = Tk- 1 F(Tk - 1)
(8:126)
F0 (Tk - 1)
thus get
Q(u, v)
U(u,v,g)= (8:127)
Tk
Substitute Eq. (8.127) into Eq. (8.123) to obtain a new image brightness constraint
equation:
2
Tk (4) \s/ udz+vdz+i
+ + -Q(u,v)=0 (8:128)
du J dv du dv
It can be seen that Eq. (8.128) is a partial differential equation of the Hamilton-Jacobi
type. It does not have a solution in the usual sense in general, so the solution in the
viscous sense needs to be calculated. First, give the Hamiltonian function of
Eq. (8.128):
Use the Legendre transform to obtain the control form corresponding to Eq. (8.129):
=
u p u2 + v2 v=pu2 + v2
u2 + v2 ± 0
— v=pu2 + v2 u=pu2 + v2
R(u, v) = < (8:131)
1 0
u2 + v2 = 0
0 1
u (u2 + v2 + A2) 0
D (u, v)= (8:133)
0 A
H (u, v, g)« — Q(u, v)+ sup f — lc(u, v, h) + min [—f 1(u, v, h),0]g1H
a2B2 (0,1)
+ max [—f 1 (u, v, h) ,0]g— + min [—f2 (u, v, h) ,0]gj g + max [—f2 (u, v, h) ,0]g— g
(8:134)
where fm is the mth (m = 1,2) component offc and gm and gm— are the forward and
backward differences of the mth component, respectively. The computation here is
an optimization problem (see [21]).
Finally, define Zk = Z(u, v, kAt), and expand the forward Euler formula in the
time domain to obtain the numerical solution of Z:
where At is the time increment. Using the iterative fast marching strategy [22], the
viscous solution of Z can be accurately approximated after several iterations, and the
exponential function exp.(Z) of this viscous solution is the height value of the hybrid
reflective surface.
References
1. Zhang Y-J (2017) Image engineering, Vol. 3: Image understanding. De Gruyter, Germany.
References 331
2. Lee JH, Kim CS (2022) Single-image depth estimation using relative depths. Journal of Visual
Communication and Image Processing, 84: 103459. https://doi.org/10.1016/j.jvcir.2022.
103459).
3. Heydrich T, Yang Y, Du S (2022) A lightweight self-supervised training framework for
monocular depth estimation. International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2265-2269.
4. Anunay, Pankaj, Dhiman C (2021) DepthNet: A monocular depth estimation framework. 7th
International Conference on Engineering and Emerging Technologies (ICEET), 495-500.
5. Gibson JJ (1950) The perception of the visual world. Houghton Mifflin, Boston.
6. Zhang Y-J (2017) Image engineering, Vol. 2: Image analysis. De Gruyter, Germany.
7. Jain R, Kasturi R, Schunck BG (1995) Machine Vision. McGraw-Hill Companies. Inc.,
New York.
8. Tomita F, Tsuji S (1990) Computer Analysis of Visual Textures. Kluwer Academic Publishers,
Amsterdam.
9. Forsyth D, Ponce J (2003) Computer Vision: A Modern Approach. Prentice Hall, UK London.
10. Davies ER (2005) Machine Vision: Theory, Algorithms, Practicalities, 3rd Ed. Elsevier,
Amsterdam5.
11. Anwar S, Hayder Z, Porikli F (2021) Deblur and deep depth from single defocus image.
Machine Vision and Applications, 32(1): #34 (DOI: https://doi.org/10.1007/s00138-020-
01162-6).
12. Gladines J, Sels S, Hillen M, et al. A continuous motion shape-from-focus method for geometry
measurement during 3D printing. Sensors, 2022, 22(24): #9805 (DOI: https://doi.org/10.3390/
s22249805).
13. Shapiro L, Stockman G (2001) Computer Vision. Prentice Hall, UK London.
14. Ward GJ (1992) Measuring and modeling anisotropic reflection. Proceedings of the 19th
Annual Conference on Computer Graphics and Interactive Techniques 265-272.
15. Oren M, Nayar SK (1995) Generalization of the Lambertian model and implications for
machine vision. International Journal of Computer Vision 14(3): 227-251.
16. Wang GH, Han JQ, Zhang XM, et al. (2011) A new shape from shading algorithm for hybrid
surfaces. Journal of Astronautics 32(5): 1124-1129.
17. Zhao HK (2005) A fast sweeping method for Eikonal equations. Mathematics of Computation,
2005, 74(250): 603-627.
18. Shu CW (2007) High order numerical methods for time dependent Hamilton-Jacobi equations.
World Scientific Publishing, Singapore.
19. Tozza S, Mecca R, Duocastella M, et al. (2016) Direct differential photometric stereo shape
recovery of diffuse and specular surfaces. Journal of Mathematical Imaging and Vision 56(1):
57-76.
20. Wang GH, Zhang X (2021) Fast shape-from-shading algorithm for 3D reconstruction of hybrid
surfaces under perspective projection. Acta Optica Sinica 41(12): 1215003 (1-9).
21. Wang GH, Han JQ, Jia HH, et al. (2009) Fast viscosity solutions for shape from shading under a
more realistic imaging model. Optical Engineering 48(11): 117201.
22. Wang GH, Han JQ, Zhang XM (2009) Three-dimensional reconstruction of endoscope images
by a fast shape from shading method. Measurement Science and Technology 20(12): 125801.
Chapter 9
Generalized Matching
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 333
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_9
334 9 Generalized Matching
Vision can be considered to include two aspects: “seeing” and “perception.” On the
one hand, “seeing” should be a purposeful “seeing,” that is, according to certain
knowledge (including the description of the object and the interpretation of
the scene), the scene should be found in the scene with the help of images; on the
other hand, “perception” should be “perception” with cognition, that is, to extract the
characteristics of the scene from the input image, and then match with the existing
scene model, so as to achieve the purpose of understanding the meaning of the scene.
9.1 Introduction to Matching 335
comparable to the matching in the object space. In the case of occlusion, the
smoothness assumption will be affected, and the image matching algorithm will
encounter difficulties.
Image matching algorithms can be further classified according to the image repre
sentation model they use.
1. Raster-based matching
Raster-based matching uses a raster representation of the image, i.e., they
attempt to find a mapping function between image regions by directly comparing
grayscale or grayscale functions. This class of methods can be highly accurate but
sensitive to occlusion.
2. Feature-based matching
In feature-based matching, the symbolic description of an image is first
decomposed by salient features extracted from the image using feature extraction
operators, and then the corresponding features of different images are searched
based on assumptions about the local geometric properties of the object to be
described and the geometric mapping. These methods are more suitable for
situations with surface discontinuities and data approximations than raster
based matching methods.
3. Relationship-based matching
Relation matching, also known as structural matching, is based on the simi
larity of topological relationships (topological properties do not change under
perspective transformation) between features, and these similarities exist in
feature adjacency graphs rather than in grayscale or point distributions.
Matching of relational descriptions can be applied in many situations, but it
may generate a very complex search tree, so its computational complexity may
be very large.
Based on the template matching theory in Subsection 5.1.1, it is believed that in
order to recognize the content of an image, it is necessary to have its “memory trace”
or basic model in past experience, which is also called “mask.” If the present
stimulus matches the mask in the brain, you can tell what the stimulus
is. However, template matching theory says that the matching is that the external
stimulus must exactly match the template. In practice, people in real life can not only
recognize images that are consistent with the basic pattern but also recognize images
that do not completely conform to the basic pattern.
Gestalt psychologists came up with the prototype matching theory. This theory
holds that a presently observed image of the letter “A,” no matter what shape it is or
where it is placed, bears a resemblance to an “A” known to have been perceived in
the past. Humans do not store countless templates of different shapes in long-term
memory but use the similarity abstracted from various images as prototypes to test
9.1 Introduction to Matching 337
the images to be recognized. If a prototype resemblance can be found from the image
to be recognized, then the recognition of the image is achieved. This model of image
cognition is more suitable than template matching in terms of both neurological and
memory search processes and can also illustrate the cognitive process of some
irregular but some aspects similar to the prototype. According to this model, an
idealized prototype of the letter “A” can be formed, which summarizes the common
characteristics of various images similar to this prototype. On this basis, it becomes
possible by matching cognition to all other “A”s that are not identical to the
prototype but only similar.
Although prototype matching theory can more reasonably explain some phenom
ena in image cognition, it does not explain how humans discriminate and process
similar stimuli. Prototype matching theory does not give a clear image recognition
model or mechanism, and it is difficult to realize it in computer programs. Further
research is still a topic of vision and computer vision.
Although the matching theory is not perfect, the matching task still needs to be
completed, so there are needs for an evaluation criterion in matching. In turn,
research on matching evaluation criteria will also promote the development of
matching theory. Commonly used image matching evaluation criteria mainly
include accuracy, reliability, robustness, and computational complexity [5].
Accuracy refers to the difference between the true value and the estimated value.
The smaller the difference, the more accurate the estimate. In image-level matching,
accuracy can refer to statistics such as the mean, median, maximum, or root mean
square of the distance between two image points to be matched (or a reference image
point and a matching image point). Accuracy can also be measured from synthetic or
simulated images when the correspondence has been determined; another approach
is to place fiducial markers in the scene and use the location of the fiducial markers to
evaluate the accuracy of the match. The unit of accuracy is often pixels or voxels.
Reliability refers to how many times the matching algorithm has achieved
satisfactory results in a total of multiple tests. If N pairs of images are tested, wherein
M tests give satisfactory results, when N is large enough and N pairs of images are
representative, then M/N represents reliability. The closer M/N is to 1, the more
reliable it is. In this sense, the reliability of the algorithm is predictable.
Robustness refers to the stability of accuracy or the reliability of an algorithm
under different changes in its parameters. Robustness can be measured in terms of
noise, density, geometric differences, percentage of dissimilar regions between
images, etc. The robustness of an algorithm can be obtained by determining how
stable the algorithm’s accuracy is or how reliable it is when the input parameters
change (e.g., by using their variance, the smaller the variance, the more robust the
algorithm). If there are many input parameters, each of which may affect the
accuracy or reliability of the algorithm, then the robustness of the algorithm can be
338 9 Generalized Matching
defined with respect to each parameter. For example, an algorithm might be robust to
noise but not robust to geometric distortions. Saying that an algorithm is robust
generally means that the performance of the algorithm does not change significantly
as the parameters involved change.
Computational complexity determines the speed of an algorithm, indicating its
usefulness in a specific application. For example, in image-guided neurosurgery, the
images used to plan the surgery need to be matched within seconds to images that
reflect the conditions of the surgery at a particular time. However, matching the
aerial imagery acquired by the aircraft often needs to be completed in the order of
milliseconds. Computational complexity can be expressed as a function of image
size (considering the number of additions or multiplications required for each unit);
it is generally hoped that the computational complexity of a good matching algo
rithm is a linear function of image size.
Image matching takes pixels as the unit, the amount of calculation is generally large,
and the matching efficiency is low. In practice, objects of interest are often detected
and extracted first, and then objects are matched. If a concise object representation is
used, the matching effort can be greatly reduced. Since the object can be represented
in different ways, the matching of the object can also take a variety of methods.
The effect of object matching should be judged by a certain measure, the core of
which is the similarity of the objects.
In an image, the object is composed of points (pixels), and the matching of two
objects is, in a certain sense, the matching between two sets of points. The method of
using Hausdorff distance (HD) to describe the similarity between point sets and
matching through feature point sets is widely used. Given two finite point sets
A = {a 1, a2, ..., am} and B = {b1, b2, ..., bn}, the Hausdorff distance between
them is defined as:
Among them, the function h(A, B) is called the directed Hausdorff distance from
the set A to B, which describes the longest distance from the point a 2 A to any point
in the point set B; similarly, the function h(B, A) is called the directed Hausdorff
distance from set B to A, describing the longest distance from point b 2 B to any
point in point set A. Since h(A, B) and h(B, A) are not symmetrical, the maximum
value between them is generally taken as the Hausdorff distance between the two
point sets.
The geometric meaning of Hausdorff distance can be explained as follows: If the
Hausdorff distance between two point sets A and B is d, then for any point in each
point set, taking it as the center, at least one point in another set of points in a circle
with radius d can be found. If the Hausdorff distance between two point sets is 0, it
means that the two point sets are coincident. In the schematic of Fig. 9.2:
h(A, B) = d21, h(B, A) = d22 = H(A, B).
The Hausdorff distance as defined above is sensitive to noise points or the outline
of the point set. A commonly used improvement method adopts the concept of
statistical averaging and replaces the maximum value with the average value, which
is called the modified Hausdorff distance (MHD) [6], that is, Eq. (9.2) and Eq. (9.3)
are changed to:
where NA represents the number of points in point set A and NB represents the
number of points in point set B. Substituting them into Eq. (9.1), we get:
When using the Hausdorff distance to calculate the correlation between images, it
does not require a clear point relationship between the two images; in other words, it
does not need to establish a one-to-one relationship of point correspondences
between the two point sets, which is one of its important advantages.
Objects can often be broken down into their individual components. Different
objects can have the same components but different structures. For structure
matching, most matching measures can be explained by the so-called “template
and spring” physical analogy model [7]. Considering that structure matching is the
matching between the reference structure and the structure to be matched, if the
reference structure is viewed as a structure depicted on a transparency, the matching
can be seen as moving the transparency on the structure to be matched and
deforming it to obtain a fit of the two structures.
Matching often involves similarities that can be quantitatively described. A
matching is not a simple correspondence, but a correspondence that is quantitatively
described according to a certain goodness index, and this goodness corresponds to
the matching measure. For example, the goodness of fit of two structures depends
both on how well the components of the two structures match each other and on the
amount of work required to deform the transparencies.
In practice, to achieve deformation, consider the model as a set of rigid templates
connected by springs, such as the template and spring model of a face as shown in
Fig. 9.3. Here the templates are connected by springs, and the spring functions
describe the relationship between the templates. The relationship between templates
generally has certain constraints. For example, in a face image, the two eyes are
generally on the same horizontal line, and the distance is always within a certain
range. The quality of the matching is a function of the goodness of the local fit of the
Left Right
edge edge
Mouth
9.2 Object Matching 341
template and the energy required to elongate the spring to fit the structure to be
matched to the reference structure.
The matching measure of template and spring can be represented in general form
as follows:
where CT represents the dissimilarity between the template d and the structure to be
matched, CS represents the dissimilarity between the structure to be matched and the
object part e, CM represents the penalty for missing parts, and F(«) is the mapping for
transformation of the reference structure template to the structure components to be
matched. F divides reference structures into two categories: structures that can be
found in the structures to be matched (belonging to set Y) and structures that are not
found in the structures to be matched (belonging to set N). Similarly, components
can also be divided into components that exist in the structure to be matched
(belonging to set E) and components that do not exist in the structure to be matched
(belonging to set M).
Normalization issues need to be considered in structure matching metrics because
the number of matched parts may affect the value of the final matching metric. For
example, if a “spring” always has a finite cost, such that the more elements matched,
the greater the total energy, that doesn’t mean that having more parts matched is
worse than having fewer parts. Conversely, delicate matching of a part of the
structure to be matched with a specific reference object often makes the remaining
part unmatched, and this “sub-match” is not as good as making most of the parts to
be matched closely matched. In Eq. (9.7), this is avoided by a penalty for missing
parts.
The matching between two objects (or a model and an object) can be done by means
of the correspondence between them when there are feature points (see Sect. 5.2) or
specific landmark points (see [8]) on the object. In 2D space, if these feature points or
landmark points are different from each other (have different properties), matching
can be done according to two pairs of corresponding points. If these landmark points
or feature points are the same as each other (have the same attributes), at least three
noncollinear corresponding points (three points must be coplanar) need to be
determined on each of the two 2D objects.
In 3D space, if perspective projection is used, since any set of three points can
match any other set of three points, the correspondence between the two sets of
points cannot be determined at this point. Whereas if a weak perspective projection
is used, the matching is much less ambiguous.
342 9 Generalized Matching
Consider a simple case. Suppose a set of three points P1, P2, and P3 on the object
are on the same circle, as shown in Fig. 9.4a. Suppose the center of gravity of the
triangle is C, and the straight lines connecting C and P1, P2, and P3 intersect the
circle at points Q1, Q2, and Q3, respectively. Under weak perspective projection
conditions, the distance ratio PiC:CQi remains unchanged after projection. In this
way, the circle will become an ellipse after projection (but the line will still be a line
after projection, and the distance ratio will not change), as shown in Fig. 9.4b. If P 1,
P 2, and P 3 can be observed in the image, C can be calculated, and then the
positions of Q 1, Q 2, and Q 3 can be determined. This leaves six points to determine
the position and parameters of the ellipse (actually at least five points are required).
Once the ellipse is determined, the matching becomes an ellipse matching (see
Subsection 9.2.3).
If the distance ratio is calculated incorrectly, Qi will not fall on the circle, as
shown in Fig. 9.4c. In this way, the ellipse passing through P 1, P 2, and P 3 and Q 1,
Q 2, and Q 3 cannot be obtained after projection, and the above calculation is
impossible.
For more general ambiguity cases, see Table 9.1, which gives the number of
solutions obtained when matching objects with corresponding points in the image in
each case. When the number of solutions > 2, it indicates that there is ambiguity. All
ambiguities occur when they are coplanar, corresponding to perspective inversion.
Any noncoplanar point (when more than three points in the corresponding plane)
provides enough information to disambiguate. In Table 9.1, the two cases of
coplanar point and noncoplanar point are considered, respectively, and the perspec
tive projection and weak perspective projection are also compared.
9.2 Object Matching 343
The matching between objects can also be done by means of their inertia equivalent
ellipses, which have been used in the registration work for 3D object reconstruction
from sequence images [9]. Unlike the matching based on the object contour, the
matching based on the equivalent ellipse of inertia is performed based on the entire
object region. For any object region, an inertia ellipse corresponding to it can be
obtained (e.g., see [8]). With the help of the inertia ellipse corresponding to the
object, an equivalent ellipse can be further calculated for each object. From the point
of view of object matching, since each object in the image pair to be matched can be
represented by its equivalent ellipse, the matching problem of the object can be
transformed into the matching of its equivalent ellipse. A schematic diagram is
shown in Fig. 9.5.
In general object matching, the main consideration is the deviation caused by
translation, rotation, and scale transformation, and the geometric parameters
corresponding to these transformations need to be obtained. For this purpose, the
parameters required for translation, rotation, and scale transformation can be calcu
lated by the center coordinates of the equivalent ellipse, the orientation angle
(defined as the angle between the major axis of the ellipse and the positive X-axis),
and the length of the main (major) axes.
First consider the center coordinates (xc, yc) of the equivalent ellipse, that is, the
barycentric coordinates of the object. Assuming that the object region contains
N pixels in total, then:
(9:8)
(9:9)
The translation parameter can be calculated according to the difference of the center
coordinates of the two equivalent ellipses. Secondly, the orientation angle <fr of the
equivalent ellipse can be obtained by using the slopes k and l of the two main axes of
the corresponding inertia ellipse (let A be the moment of inertia of the object rotating
around the X-axis and B be the moment of inertia of the object rotating around the
Y-axis):
The rotation parameter can be calculated from the difference in the orientation angle
of the two ellipses. Finally, the two semi-main lengths (a and b) of the equivalent
ellipse reflect information about the object size. If the object itself is an ellipse, it is
identical to its equivalent ellipse. In general, the equivalent ellipse of the object is the
approximation of the object in terms of moment of inertia and area (but not equal at
the same time). Here, the axis length needs to be normalized by the object area M.
After normalization, when A < B, the length a of the semi-main axis of the
equivalent ellipse can be calculated by the following equation (let H represent the
product of inertia):
The scaling parameter can be calculated according to the length ratio of the main
axes of the two ellipses. The three transformation parameters of the geometric
correction required for the above two object matching can be calculated indepen
dently, so each transformation in the equivalent ellipse matching can be performed
sequentially [10].
Matching with inertia equivalent ellipse is more suitable for matching irregular
objects. Figure 9.6 shows an example of matching two adjacent cell profile images in
the process of reconstructing 3D cells from sequential medical profile images.
Figure 9.6a shows two cross-sectional views of the same cell on two adjacent
sections. The size and shape of the two cell sections, as well as their position and
orientation in the image, are different due to the effects of translation and rotation
during sectioning. Considering that the changes in the structures inside and around
the cell are large, matching only considering the contour does not work very well.
Figure 9.6b shows the result obtained after calculating the equivalent ellipses of the
cell profiles and then matching them. It can be seen that the positions and
orientations of the two cell profiles are reasonably matched, which lays a solid
foundation for the subsequent 3D reconstruction.
in this way, all profiles on a sequence of sections can be registered (Fig. 9.7 only
takes one profile as an example). This strategy essentially relies on spatial relation
ships [12] for matching.
Referring to the flow diagram in Fig. 9.7, it can be seen that there are six main
steps in dynamic pattern matching:
1. Select a matched profile from the matched section.
2. Construct the pattern representation of the selected matched profile.
3. Determine the candidate region of the to-be-matched profile (a priori knowledge
can be used to reduce the amount of computation and ambiguity).
4. Select the profile to-be-matched in the candidate region.
5. Construct the pattern representation of each selected profile to-be-matched.
6. Use the similarity between profile patterns to check to determine the correspon
dence between profiles.
Since the distribution of cell sections on the profile is not uniform, in order to
complete the above matching steps, it is necessary to dynamically establish a pattern
representation for each section that can be used for matching. Here, it can be
considered to construct the unique pattern of the section by using the relative
positional relationship of each section to its several adjacent sections. The pattern
thus constructed can be represented by a pattern vector. Assuming that the relation
ship used is the length and orientation of the line between each section and its
adjacent section (or the included angle between the lines), then the two profile modes
Pl and Pr that need to-be-matched on two adjacent sections (both using vector
representation) can be written as:
In these equations, xl0, yl0, and xr0, yr0 are the coordinates of the two center profiles,
respectively; each d represents the length of the connecting line between other
profiles on the same section and the matching profile; each 9 represents the angle
between the connecting lines from the matching profile to the surrounding two
adjacent profiles. Note that m and n can be different here. When m is different
from n, a part of the points constructing patterns can also be selected for matching. In
addition, the selection of m and n should be the result of the balance between the
amount of calculation and the uniqueness of the pattern, and the specific value can be
adjusted by determining the radius of the pattern (i.e., the largest d, such as d2 in
Fig. 9.8a). The entire pattern can be seen as contained in a circle with a defined radius
of action.
9.3 Dynamic Pattern Matching 347
In order to match the profiles, the corresponding patterns need to be translated and
rotated. The pattern constructed above can be called an absolute pattern because it
contains the absolute coordinates of the central section. Figure 9.8a gives an example
of Pl. The absolute pattern has rotation invariance to the origin (central profile), that
is, after the entire pattern is rotated, d and 9 remain unchanged, but from Fig. 9.8b, it
can be seen that it does not have translation invariance, because after the entire
pattern is translated, both x0 and y0 have changed.
In order to obtain translation invariance, the coordinates of the center point in the
absolute pattern can be removed, and the relative pattern can be constructed as
follows:
Qi = [dn,9ii, (9:14)
Qr = [<4 1,0rl,'",drn, #rn]T (9:15)
The relative pattern corresponding to the absolute pattern in Fig. 9.8a is shown in
Fig. 9.9a.
It can be seen from Fig. 9.9b that the relative pattern is not only rotationally
invariant but also translationally invariant. In this way, two relative pattern repre
sentations can be matched by rotation and translation, and their similarity can be
calculated, so as to obtain the purpose of matching profiles.
Figure 9.10 shows the distribution of cell profiles on two adjacent medical
sections in practice [12], where each cell profile is represented by a dot. Since the
diameter of the cells is much larger than the thickness of the profiles, many cells span
multiple profiles. In other words, there should be many corresponding cell profiles
on adjacent profiles. However, as can be seen from Fig. 9.10, the distribution of
points on each section is very different, and the number of points is also very
different; there are 112 in Fig. 9.10a and 137 in Fig. 9.10b. The reasons include
348 9 Generalized Matching
the following: some cell profiles in Fig. 9.10a are the last profiles of cells and do not
continue to extend to Fig. 9.10b, and some cell profiles in Fig. 9.10b are new
beginnings of cells and not continued from Fig. 9.10a.
Using dynamic pattern matching to match the cell profiles in these two sections
resulted in 104 profiles in Fig. 9.10a finding the correct corresponding profiles
(92.86%) in Fig. 9.10b, while there are eight profiles with errors (7.14%).
From the analysis of dynamic pattern matching, its main characteristics are as
follows: The pattern is established dynamically and the matching is completely
automatic. This method is quite general and flexible, and its basic idea can be
applied to a variety of applications [9].
9.4 Matching and Registration 349
Matching and registration are two closely related concepts with many technical
similarities. Many registration tasks are accomplished with the aid of various
matching techniques. But if you analyze it carefully, there are some differences
between the two. The meaning of registration is generally narrow, mainly referring
to the establishment of correspondence between images obtained in different time or
space, especially the geometric correspondence (geometric correction). The final
effect is often reflected at the pixel level. Matching can consider both the geometric
properties of the image, the grayscale properties of the image, and even other
abstract properties and attributes of the image. From this point of view, registration
can be seen as matching of lower-level representations, while generalized matching
can include registration. By the way, the main difference between image registration
and stereo matching introduced in Chaps. 4 and 5 is that the former needs to establish
the relationship between point pairs and calculate the coordinate transformation
between the two images from this correspondence. The latter only needs to establish
the correspondence between point pairs and then calculate the disparity/parallax for
each pair of points separately.
(the Fourier power spectrum can be used to calculate when the rotation and scale
change). The calculation of the phase correlation between the two images can be
carried out by means of phase estimation of the cross-power spectrum. Suppose two
images f1(x, y) and f2(x, y) have the following simple translation relationship in the
spatial domain:
where the inverse Fourier transform of exp.[—j2n(ux0 + vy0)] is 6(x — x0,y — y0). It can
be seen that the relative translation of the two images f1(x, y) and f2(x, y) in space is
(x0, y0). The amount of translation can be determined by searching the graph for the
location of the maximum value (caused by the pulse).
The steps of the Fourier transform-based phase correlation algorithm are sum
marized below:
1. Calculate the Fourier transforms F1(u, v) and F2(u, v) of the two images f1(x, y)
and f2(x, y) to be registered.
2. Filter out the DC component and high-frequency noise in the spectrum, and
calculate the product of the spectrum components.
3. Calculate the normalized cross-power spectrum using Eq. (9.18).
4. Perform inverse Fourier transform on the normalized cross-power spectrum.
5. Search for the peak point coordinates in the graph, which give the relative
translation.
The calculation amount of the abovementioned registration method is only related
to the size of the images and has nothing to do with the relative positions between the
images or whether they overlap. The method only utilizes the phase information in
the cross-power spectrum, which is simple to calculate and insensitive to the
brightness change between images and can effectively overcome the influence of
illumination changes. Since the obtained correlation peaks are relatively sharp and
prominent, higher registration accuracy can be obtained.
9.4 Matching and Registration 351
closest Euclidean distance to the second closest Euclidean distance is large (greater
than a threshold). Generally, the smaller the distance of the first matching point pair,
the better the matching quality. Next, calculate the average of all matching point
pairs and subtract it from each nearest Euclidean distance. If the result is negative,
this matching point pair is kept, and RANSAC is performed to finally select the true
matching point pair.
Scene images usually contain many objects. The representation of image space
relationship is to describe the geometric relationship of image objects in Euclidean
space. Due to the complexity of the real world and the randomness of scene
shooting, the imaging of the same object on different images can vary significantly.
It is difficult to accurately match the image by relying only on the overall represen
tation of the image to calculate the similarity of the image. On the other hand, when
the same object is imaged in different images, its imaging morphology changes
significantly, but its spatial relationship to neighboring objects generally remains
stable (see Sect. 9.3). A method to solve the whole image matching problem by
inferentially analyzing the spatial proximity of objects in the image exploits this
fact [15].
The flowchart of the image matching algorithm is shown in Fig. 9.12. First, for
image pairs from the scene, object blocks are detected and depth features are
extracted for matching, thereby determining the spatial proximity of objects and
constructing a spatial proximity map of the objects in the scene. Then, based on the
constructed spatial proximity map, the spatial proximity relationship of the objects in
the image is analyzed, and the proximity of the image pair is quantitatively calcu
lated. Finally, find matching images.
Some details are as follows:
1. In order to extract the deep features of the object block, an object block feature
extraction network based on the contrast mechanism is constructed. The network
contains two identical channels with shared weights. Each channel is a deep
convolutional network with seven convolutional layers and two fully connected
layers. Based on the depth features, the same object blocks in the two images are
matched.
2. The spatial proximity map of the scene objects is constructed, and the spatial
proximity relationship of different objects in the scene is analyzed by reasoning
according to the distribution of each object on the prior image. The construction
process is an iterative search process that includes an initialization step and an
update step [15]. The constructed spatial proximity map summarizes all objects
present in the scene and quantitatively represents the proximity between different
objects, where the same object blocks on different images are aggregated in the
same node.
3. In order to determine the matched image, the nodes of the object in the image are
searched in the spatial proximity graph, and the proximity relationship between
the objects in the image is determined according to the connection weight
between the nodes. Each test image may include several object blocks, and
their belonging nodes can be searched in the node set.
4. In order to calculate the spatial proximity of image pairs, it is necessary to detect
the object blocks contained in the image and determine the nodes to which each
object block belongs to form a node set. The connection weights between the
belonging nodes represent the proximity relationship between the object blocks in
the image. The spatial relationship between two images can be represented by the
proximity relationship of the object blocks in the image, and the spatial relation
ship matching of the images is completed by quantitatively calculating the spatial
proximity of the images.
The objective scene can be decomposed into multiple objects, and each object can be
decomposed into multiple components/parts, and there are different relationships
between them. The images collected from the objective scene can be represented by
the collection of various interrelationships between objects, so relationship matching
is an important step in image understanding. Similarly, the object in the image can be
represented by the set of interrelationships among the various parts of the object, so
the object can also be identified by using relationship matching. The two represen
tations to be matched in relationship matching are relations, and one of them is
usually called the object to-be-matched, and the other is called the model.
The main steps of relationship matching are described below. Here we consider
the case that an object to-be-matched is given, and it is required to find a model that
matches it. There are two relation sets: Xl and Xr, where Xl belongs to the object to
be-matched and Xr belongs to the model; they are respectively represented as:
354 9 Generalized Matching
In the equations, Ri1, Ri2, ..., Rim and Rr1, Rr2, ..., Rrn represent the representa
tions of different reiationships between the objects to-be-matched and the compo
nents in the modei, respectiveiy.
For exampie, Fig. 9.13a shows a schematic representation of an object in an
image (think of a front view of a tabie). It has three eiements, which can be expressed
as Qi = {A, B, C}, and the set of reiations between these eiements can be expressed
as Xi = {R1, R2, R3}. Among them, R1 represents the connection reiationship,
R1 = {(A, B), (A, C)}; R2 represents the upper-iower reiationship, R2 = {(A, B),
(A, C)}; R3 represents the ieft-right reiationship, R3 = {(B, C)}. Figure 9.13b gives a
schematic diagram of the object in another image (can be seen as a front view of a
tabie with a middie drawer), which has four eiements and can be represented as
Qr = {1, 2, 3, 4}. The set of reiationships between eiements can be expressed as
Xr = (R1, R2, R3). Among them, R1 represents the connection reiationship,
R1 = {(1, 2), (1, 3), (1, 4), (2, 4), (3, 4)}; R2 represents the upper and iower
reiationship, R2 = {(1, 2), (1, 3), (1, 4)}; R3 represents the ieft-right reiationship,
R3 = {(2, 3), (2, 4), (4, 3)}.
Now consider the distance between Xi and Xr, denoted as dis(Xi, Xr). dis(Xi, Xr) is
composed of the difference of the corresponding items expressed by each pair of
corresponding reiations in Xi and Xr, nameiy, each dis(Ri, Rr). The matching of Xi and
Xr is the matching of each pair of corresponding reiations in the two sets. The
foiiowing first considers one of the reiations and uses Ri and Rr to represent the
corresponding reiations, respectiveiy:
The distance between the two relational expressions Rl and Rr is the weighted sum
of the errors in Eq. (9.25) (here is the weighting of the effects of the errors, and the
weight is W):
From the previous analysis, to match Rl and Rr, we should try to find a
corresponding mapping to minimize the error (distance in terms) between Rl and
Rr. Note that E is a function of p, so the corresponding mapping p that needs to be
found should satisfy:
356 9 Generalized Matching
Going further back to Eq. (9.19) and Eq. (9.20), to match two relation sets Xl and
Xr, a series of corresponding mappings pj should be found such that:
m
disC(Xi, Xr) = mf £ Vj^WijC[Ej(p>)] > (9:29)
, j i
Rr = {(1,2), (1,3), (1,4), (2,4), (3,4)} = T(1) x T(2) x T(3) x T(4) x T(5)
c TN
When there is no element 4 in Qr, Rr = [(1, 2), (1, 3)], which gives p = {(A, 1), (B,
2), (C, 3)}, p-1 = {(1, A), (2, B), (3, C)}, R^ p = {(1, 2), (1, 3)}, R^ p
1 = {(A, B), (A, C)}. At this time, the four errors in Eq. (9.25) are, respectively:
E4 = {Rl - (Rr®p-!) DRl} = {(A, B), (A, C)} - {(A,B), (A, C)} = 0
- {(2,4), ( 3,4)}= {(1,2), (1,3), ( 1, 4)}E3= {(B,A), ( C, A)} - {(A, B), (A, C)}
If only the connection relationship is considered, the order of each element can be
exchanged. From the above results, dis(Rl, Rr) = {(1, 2), (1, 3), (1, 4)}. In terms of
error terms, C(E1) = 0, C(E2) = 3, C(E3) = 0, C(E4) = 0, so disC(Rl, Rr) = 3.
9.5 Relationship Matching 357
Matching is to use the model stored in the computer to identify unknown patterns
in the object to be matched, so after finding a series of corresponding mappings pj, it
is necessary to determine their corresponding models. Assuming that the object
to-be-recognized, X, defined by Eq. (9.19) can find a correspondence that conforms
to Eq. (9.29) for each of the multiple models Y1, Y2, ..., YL (they can be represented
by Eq. (9.20)), suppose they are respectively p 1, p2, ..., pL, that is to say, the distance
disC(X, Yq) after X and multiple models are matched according to their respective
corresponding relationships can be obtained. If, for model Yq, its distance from
X satisfies:
then for q < L, X 2 Yq holds, that is, the object to-be-matched, X, is considered to
match the model Yq.
Summarizing the above discussion, it can be seen that the matching process can
be summarized into four steps.
1. Determine the same relationship (relationship between parts), that is, for a given
relationship in Xl, determine a relationship in Xr that is the same as it. This
requires m x n comparisons:
'Rl1' Rr1
Rl2 Rr2
Xl = = Xr (9:31)
LRlmJ
Rrn
p1 : disC(Rl,Rr)^
disC(Rl,Rr) ^
Rl >Rr (9:32)
^Pk: disC(Rl,Rr) ^
disC(Ru, Rri)
disC (Rl2, Rr2)
disC(Xl, Xr) ( < (9:33)
disC(Rlm, Rrm)
Note that m < n is set in the above equation, that is, only m pairs of relations can
find correspondence, and n—m relations only exist in the relation set Xr.
4. Determine the model to which it belongs (find the minimum value in L values of
disC(Xl, Xr)):
pi ^ Y 1 ^ disC(X, Y 1)
P2 ^ Y2 > disC(X, Y2)
X (9:34)
The following first introduces some basic concepts and definitions of graph theory.
Below, the elements in set V are represented by uppercase letters, and the
elements in set E are represented by lowercase letters. Generally, the edge
e formed by the disordered pair of vertices A and B is recorded as e $ AB or e $
BA, and A and B are called the endpoints of e; the edge e is called join A and B. In this
case, vertices A and B are incident with edge e, which is incident with vertices A and
B. Two vertices incident with the same edge are adjacent, as are two edges that share
a common vertex. Two edges are called multiple edge or parallel edge if they have
the same two endpoints. If the two endpoints of an edge are the same, it is called a
loop; otherwise, it is called a link.
In the definition of a graph, the two elements (i.e., two vertices) of each unordered
pair can be the same or different, and any two unordered pairs (i.e., two edges) can be
the same or different. Different elements can be represented by vertices of different
colors, which is called the chromaticity of vertices (meaning that vertices are labeled
with different colors). Different relationships between elements can be represented
by edges of different colors, which is called edge chromaticity (meaning that edges
are marked with different colors). So a generalized colored graph G can be
expressed as:
where V is the vertex set, C is the vertex chromaticity set, E is the edge set, and S is
the edge chromaticity set. They are, respectively:
S = {sViVjlVi, Vj 2 V} (9:40)
Among them, each vertex can have a color, and each edge can also have a color.
If the vertices of the graph are represented by dots, and the edges are represented by
straight lines or curves connecting the vertices, the geometric representation or
geometric realization of the graph can be obtained. Graphs with edges greater than
or equal to 1 can have infinitely many geometric representations.
For example, suppose V(G) = {A, B, C}, E(G) = {a, b, c, d}, where a $ AB, b $
AB, c $ BC, d $ CC. The graph G can then be represented by the graph given in
Fig. 9.14.
In Fig. 9.14, edges a, b, and c are adjacent to each other, and edges c and d are
adjacent to each other, but edges a and b are not adjacent to edge d. Likewise,
vertices A and B are adjacent, and vertices B and C are adjacent, but vertices A and
360 9 Generalized Matching
C are not adjacent. In terms of edge types, edges a and b are multiple edges, edge d is
a loop, and edges a, b, and c are links.
According to the geometric representation of the graph introduced above, the two
objects in Fig. 9.13 can be represented by two colored graphs as shown in Fig. 9.15,
where the vertex chromaticity is distinguished by the vertex shape, and the edge
chromaticity is distinguished by the line type. It can be seen that the information
reflected by the colored map is more comprehensive and intuitive.
For two graphs G and H, if V(H) £ V(G), E(H) £ E(G), then graph H is called a
subgraph of graph G, denoted as H £ G. Conversely, graph G is called the
supergraph of graph H. If graph H is a subgraph of graph G, but H ^ G, then
graph H is called the proper subgraph of graph G, and graph G is called the proper
supergraph graph of graph H [16].
If H £ G and V(H) = V(G), then graph H is called the spanning subgraph of
graph G, and graph G is called the spanning supergraph of graph H. For example,
in Fig. 9.16, Fig. 9.16a gives a graph G, while Fig. 9.16b, Fig. 9.16c, and Fig. 9.16d
give each of the spanning subgraphs of graph G (they are all graphs G spanning
subgraphs of but distinct from each other).
If all double edges and loops are removed from a graph G, the resulting simple
spanning subgraph is called the underlying simple graph of the graph G. The three
spanning subgraphs given in Fig. 9.16b-d have only one underlying simple graph,
Fig. 9.16d. The four operations to obtain the underlying simple graph are described
below with the help of the graph G given in Fig. 9.17a.
1. For the non-empty vertex subset V‘(G) £ V(G) of graph G, if there is a subgraph
of graph G taking V‘(G) as the vertex set, and taking all edges with both endpoints
in graph V‘(G) as the edge sets, then the subgraph is called the induced subgraph
of graph G and is denoted as G[V‘(G)] or G[V’]. Figure 9.17b gives the graph of G
[A, B, C] = G[a, b, c].
2. Similarly, for the non-empty edge subset E’(G) £ E(G) of graph G, if there is a
subgraph of graph G taking E’(G) as the edge set, and taking all the endpoints of
the edge as the vertex set, then the subgraph is called the edge-induced subgraph
of the graph G, denoted as G[E’(G)] or G[E’]. Figure 9.17c gives the graph of G
[a, d] = G[A, B, D].
3. For the proper subset V‘(G) £ V(G) of non-empty vertices of graph G, if there is a
subgraph of graph G taking all vertices after removing V‘(G) c V(G) as the
9.6 Graph Isomorphism Matching 361
vertex set, and taking the edge set after removing all edges associated with V’(G)
in the graph G as the edge set, the subgraph is the remaining subgraph of the
graph G, denoted as G-V. Here G-V’ = G[V\ V’]. Figure 9.17d gives the graph
of G-{A, D} = G[B, C] = G[{A, B, C, D}—{A, D}].
4. For the non-empty proper edge subset E’(G) £ E(G) of graph G, if there is a
subgraph of graph G taking the edges after removing E ‘(G) C E(G) as the edge
set, then the subgraph is the spanning subgraph of the graph G, denoted as G-E ‘.
Note here that G-E’ and G[E\E’] have the same set of edges, but they are not
necessarily identical. Among them, the former always generates subgraphs, while
the latter does not necessarily. An example of the former is given in Fig. 9.17e,
G-{c} = G[a, b, d, e]. An example of the latter is given in Fig. 9.17f, G[{a, b,
c, d, e}—{a, b}] = G-A + G-[{a, b}].
According to the definition of graph, for two graphs G and H, if and only if
V(G) = V(H), E(G) = E(H), the graphs G and H are said to be identical, and the
two graphs can be represented by the same geometric representation. For example,
graphs G and H in Fig. 9.18 are identical. However, if two graphs can be represented
362 9 Generalized Matching
B
AAA b
G = [V, E]
C B b
H = [V, E]
C Y y
I = [V', E']
Z
by the same geometric representation, they are not necessarily identical. For exam
ple, the graphs G and I in Fig. 9.18 are not identical (the labels of the vertices and
edges are different), although they can be represented by two geometric representa
tions of the same shape.
For two graphs that have the same geometric representation but are not identical,
as long as the labels of the vertices and edges of one graph are appropriately
renamed, a graph that is identical to the other graph can be obtained, which can be
called isomorphism for such two graphs. In other words, a two-graph isomorphism
indicates that there is a one-to-one correspondence between the vertices and edges of
the two graphs. The isomorphism of two graphs G and H can be written as G ffi H,
and the necessary and sufficient conditions are that the following mappings exist
between V(G) and V(H) as well as E(G) and E(H), respectively:
and the mappings P and Q maintain an associated relationship, that is, Q(e) = P(A)
P(B), 8e $ AB 2 E(G), as shown in Fig. 9.19.
It can be seen from the previous definition that isomorphic graphs have the same
structure, and the only difference is that the labels of vertices or edges are not exactly
the same. Graph isomorphism is more focused on describing mutual relationships, so
9.6 Graph Isomorphism Matching 363
graph isomorphism can have no geometric requirements, that is, it is more abstract
(of course, it can also have geometric requirements, i.e., more specific). Graph
isomorphism matching is essentially a tree search problem, where different branches
represent heuristics on different combinations of correspondences.
Now consider several cases of graph-to-graph isomorphism. For the sake of
simplicity, all graph vertices and edges are not labeled here, that is, all vertices are
considered to have the same color, and all edges also have the same color. For
clarity, a monochromatic line graph (which is a special case of G):
(b) (c)
is necessary to find a common object in two scenarios, the task can be transformed
into the problem of double-subgraph isomorphism.
There are many algorithms for finding graph isomorphism. For example, each
graph to be determined can be converted into a certain standard form, so that
isomorphism can be easily determined. In addition, it is also possible to perform
an exhaustive search on the tree of possible matches between corresponding
vertices in the line graph, but this method requires a large amount of computation
when the number of vertices in the line graph is large.
A method that is less restrictive and converges faster than isomorphic methods
is association graph matching [17]. In associative graph matching, the graph is
defined as G = [V, P, R], where V represents the set of nodes, P represents the set
of unit predicates used for the nodes, and R represents the set of binary relations
between nodes. Here the predicate represents a statement that takes only one of
the two values TRUE or FALSE, and the binary relation describes the properties
that a pair of nodes has. Given two graphs, an associative graph can be
constructed. Associative graph matching is the matching between nodes and
nodes as well as binary relationships and binary relationships in two graphs.
Observe the 3D scenery, and what you see is its (visible) surface, and when the 3D
scene is projected onto a 2D image, the individual surfaces form regions. The
boundaries of individual surfaces are shown as contours in the 2D image, and
these contours representing the object can form a line drawing of the object. For
relatively simple scenes, the line drawing can be labeled, that is, a 2D image with
contour labels can be used to represent the relationship between the various surfaces
of the 3D scene [18]. With the aid of such labels, 3D scenes can also be matched to
corresponding models in order to interpret the scene.
9.7 Labeling and Matching of Line Drawing 365
9.7.1.2 Limb
If a continuous surface in a 3D scene not only occludes a part of another surface but
also occludes other parts of itself, that is, self-occluding, the change of the surface
normal direction is smooth and continuous and perpendicular to the line of sight
direction, then the contour at the time is called a limb (often formed when a smooth
3D surface is viewed from the side). To represent the limbs, two opposite arrows
“$” can be added to the curve. The orientation of the 3D surface does not change
when travelling along the limb, whereas the direction of the 3D surface changes
continuously when travelling in a direction that is not parallel to the limb.
A blade is the true (physical) edge of a 3D scene, while a limb is just the
(apparent) edge. When a blade or limb crosses the boundary or contour between
the occluded object surface and the occluded background surface, a jump edge with
discontinuity in depth is created.
9.7.1.3 Crease
9.7.1.4 Mark
Marks are formed if parts of the 3D surface have different reflectivity. The marks are
not due to the 3D surface shape and can be labeled with an “M.”
9.7.1.5 Shadow
If a continuous surface in a 3D scene does not block a part of the other surface from
the viewpoint but blocks the light from the light source on this part, it will cause
shades (shadows) on that part of the second surface. Shades on surfaces are not
caused by the shape of the surface itself but are the result of other parts’ effects on
lighting. Shades can be marked with “S.” There is a sudden change in lighting at the
shade boundary, called the lighting boundary.
Figure 9.21 gives examples of some nouns of the above contour labels. The
picture shows a hollow cylinder placed on a platform, there is a trace M on the
cylinder, and the cylinder creates a shade S on the platform. There are two limbs $
on the two sides of the cylinder, and the top contour is divided into two parts by the
two limbs. The creases everywhere on the platform are convex, while the crease
between the platform and the cylinder is concave.
Next, we consider the inference analysis of the structure of the 3D object with the
help of the contour structure in the 2D image. Here, it is assumed that the surfaces of
the objects are all planes, and all the intersecting corners are formed by the inter
section of three faces. Such a 3D object can be called a trihedral corner object,
such as the object represented by the two line drawings in Fig. 9.22. A small change
in viewpoint at this time will not cause a change in the topology of the line drawing,
that is, it will not cause the disappearance of faces, edges, and connections, and the
object is said to be in general position in this case.
The two line drawings in Fig. 9.22 are geometrically identical, but there are two
different 3D interpretations of them. The difference is that Fig. 9.22b labels three
more concave creases than Fig. 9.22a, so that the object in Fig. 9.22a appears to be
floating in the air, while the object in Fig. 9.22b appears to be attached to the
back wall.
In the drawings labeled only with {+, —, -^}, “+” means unclosed convex lines,
“—” means unclosed concave lines, and “^” means closed lines. At this time, there
are four categories and 16 types of (topological) combination of edge connections:
six types of L connections, four types of T connections, three types of arrow
connections (" connections), and three types of fork connections (Y connections),
as shown in Fig. 9.23.
If we consider the case of vertices formed by the intersection of all three faces,
there should be 64 labeling methods, but only the above 16 connection methods are
reasonable. In other words, only line drawings that can be labeled with the 16 con
nection types shown in Fig. 9.23 can physically exist. When a line drawing can be
labeled, its labeling provides a qualitative interpretation of the drawing.
of the new label against other labels. If the connection produced with the new label
contradicts or does not conform to the situation in Fig. 9.23, fall back to consider
another path; otherwise, continue to consider the next edge. If the labels assigned to
all the edges in this order are consistent, a labeling result is obtained (a complete path
to the leaves is obtained). Generally, more than one labeling result can be obtained
from the same line drawing, and it is necessary to use some additional information or
prior knowledge to obtain the final and unique judgment result.
Now use the pyramid shown in Fig. 9.24 as an example to explain the basic steps
of backtracking notation. The process of labeling with backtracking and the resulting
interpretation tree (including each step and final result) are shown in Table 9.2. In
Table 9.2, considering vertex A first, there are three cases that conform to Fig. 9.23.
For the first case according to Fig. 9.23, consider vertex B in turn. At this time,
among the three possible situations that may conform to Fig. 9.23, there are two
kinds of interpretations of edge AB that are inconsistent with Fig. 9.23 and one that
does not conform to Fig. 9.23 if we continue to consider vertex C. In other words,
there is no correct explanation for vertex A conforming to case 1 of Fig. 9.23. Next,
consider the second case according to Fig. 9.23, and so on.
As can be seen from all the interpretation trees in the table, there are three
complete paths (marked down to the leaves) that give three different interpretations
of the same line drawing. The search space of the whole explanation tree is quite
small, which indicates that the trihedral corner object has a rather strong constraint
mechanism.
9.8 Multimodal Image Matching 369
Generalized image matching aims to identify and correspond to the same or similar
relationship/structure/content from two or more images. Multimodal image
matching (MMIM) can be seen as a special case. Often, images and/or objects to
be-matched have significant nonlinear appearance differences, which are caused not
370 9 Generalized Matching
only by different imaging sensors but also by different imaging conditions (such as
day and night, weather changes, or across seasons) and input data types such as
Image-Drawing-Sketch and Image-Text.
The multimodal image matching problem can be formulated as: Given a reference
image IR and a matching image IM of different modalities, find the correspondence
between them (or the objects in them) according to their similarity. Objects can be
represented by the region they occupy or the features they have. Therefore, matching
techniques can be divided into region-based techniques and feature-based
techniques.
Region-based techniques take into account the intensity information of the object.
Two groups can be distinguished: the traditional group with handcrafted framework
and the recent group with learned framework.
A flowchart of the traditional region-based technique is shown in Fig. 9.25. It
includes three important modules: (i) metrics, (ii) transformation models, and (iii)
optimization methods [19].
9.8.1.1 Metrics
The accuracy of matching results depends on the metrics (matching criteria). Dif
ferent metrics can be designed based on assumptions about the intensity relationship
between the two images. Commonly used manual metrics can be simply divided into
correlation-based methods and information theory-based methods.
into linear models and nonlinear models. The latter can be further divided into
physical models (derived from physical phenomena, represented by partial differ
ential equations) and interpolation models (derived from interpolation or approxi
mation theory).
Optimization methods are used to search for the best transformation based on a given
metric to achieve the desired matching accuracy and efficiency. Considering the
nature of the variables inferred by optimization methods, they can be simply divided
into continuous methods and discrete methods. Continuous optimization assumes
that the variables are differentiable real-valued values of the desired objective
function. Discrete optimization assumes the solution space as a discrete set.
The method classes and some typical technologies in each module are shown in
Table 9.3.
In recent years, deep learning techniques have been used to drive iterative
optimization processes or directly estimate geometric transformation parameters in
an end-to-end manner. The first type of methods is called deep iterative learning
methods, and the second type of methods is called deep transformation learning
methods. Depending on the training strategy, the latter can be roughly divided into
two categories: supervised methods and unsupervised methods. Table 9.4 lists some
typical methods in these three categories and their principles.
Table 9.3 The method classes and some typical technologies in each module
Module Method classes Typical technologies References
Metrics Correlation-based Cross-correlation [20]
Normalized correlation coefficient (NCC) [21]
Information theory Mutual information (MI) [22]
based Normalized mutual information (NMI) [23]
Conditional mutual information (CMI) [24]
Transformation Linear models Rigid body, affine, projection [25]
models transformation
Nonlinear physical Differential homeomorphism [26]
models Large deformation differential homeo [27]
morphism metric mapping
Nonlinear interpo Radial basis function (RBF) [28]
lation models Thin plate splines (TPS)) [29]
Free-form deformation (FFDS) [30]
Optimization Continuous Gradient descent [31]
methods methods Conjugate gradient [31]
Discrete methods Based on graph theory [32]
Message passing [33]
Linear programming [34]
372 9 Generalized Matching
Feature-based techniques usually have three steps: (1) feature detection, (2) feature
description, and (3) feature matching. In the feature detection and feature description
steps, the modal differences are suppressed, so the feature matching step can be done
well using the general approach. Depending on whether local image descriptors are
used or not, the feature matching step can be performed indirectly or directly. The
flowchart of the feature-based technique is shown in Fig. 9.26 [19].
The detected features usually represent specific semantic structures in images or the
real world. Commonly used features can be divided into corner features (the
intersection of two straight lines, usually located at textured regions or edges),
blob features (locally closed regions where pixels are considered similar and there
fore distinct from surrounding neighborhoods), line/edge, and morphological region
features. The core idea of feature detection is to construct a response function to
distinguish different features, as well as flat and nonunique image regions.
9.8 Multimodal Image Matching 373
Commonly used functions can be further classified into gradient, strength, second
derivative, contour curvature, region segmentation, and learning-based functions.
Deep learning has shown great potential for key-point detection, especially in two
images with significant appearance differences, which usually occurs in cross-modal
image matching. Three groups of commonly used convolutional neural network
(CNN)-based detectors are as follows: (i) supervised [43], (ii) self-supervised [44],
and (iii) unsupervised [45].
Feature description refers to the mapping of local intensities around feature points
into a stable and discriminative vector form, so that detected features can be matched
quickly and easily. Existing descriptors can be classified into floating-point descrip
tors, binary descriptors, and learnable descriptors according to the image cues used
(e.g., gradient, intensity) and the form of descriptor generation (e.g., comparison,
statistics, and learning). Floating-point descriptors are usually generated by gradient
or intensity cue-based statistical methods. The core idea of gradient statistics-based
descriptors is to compute the direction of the gradient to form a floating-point vector
for feature description. Binary descriptors are usually based on local intensity
comparison strategies. Learnable descriptors are deep descriptors with high-order
image cues or semantic information extracted in CNNs. These descriptors can be
further classified into gradient-based statistics, local intensity comparison, local
intensity order statistics, and learning-based descriptors [46].
In the second stage, false matches are culled by imposing additional local and/or
global geometric constraints.
Random sample consensus (RANSAC) is a classical resampling-based
mismatch cancellation and parameter estimation method. Inspired by classical
RANSAC, a learning technique [47] that removes outliers and/or estimates model
parameters by training a deep regressor has been proposed to estimate the
transformed model. In addition to learning with multilayer perceptrons (MLPs),
graph convolutional networks (GCNs) [48] can also be used.
References
1. Huang DW, Pettie S (2022) Approximate generalized matching: f-Matchings and f-Edge
covers. Algorithmica, 84(7): 1952-1992.
2. Zhang J, Wang X, Bai X, et al. (2022) Revisiting domain generalized stereo matching networks
from a feature consistency perspective. IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 12991-13001.
3. Liu B, Yu H, Qi G (2022) GraftNet: Towards domain generalized stereo matching with a broad
spectrum and task-oriented feature. IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 13002-13011.
4. Kropatsch WG, Bischof H (eds.) (2001) Digital Image Analysis—Selected Techniques and
Applications. Berlin: Springer.
5. Goshtasby AA (2005) 2-D and 3-D Image Registration—for Medical, Remote Sensing, and
Industrial Applications. USA. Hoboken: Wiley-Interscience.
6. Dubuisson M, Jain AK (1994) A modified Hausdorff distance for object matching. Proceedings
of the 12th ICPR, 566-568.
7. Ballard DH, Brown CM (1982) Computer Vision, UK London: Prentice-Hall.
8. Zhang Y-J (2017) Image Engineering, Vol. 3: Image understanding. De Gruyter, Germany.
9. Zhang Y-J (1991) 3-D image analysis system and megakaryocyte quantitation. Cytometry, 12:
308-315.
10. Zhang Y-J (1997) Ellipse matching and its application to 3D registration of serial cell images.
Journal of Image and Graphics 2(8,9): 574-577.
11. Zhang Y-J (1990) Automatic correspondence finding in deformed serial sections. Scientific
Computing and Automation (Europe) Chapter 5 (39-54).
12. Li Q, You X, Li K, et al. (2021) Spatial relation reasoning and representation for image
matching. Acta Geodaetica et Cartographica Sinica, 50(1): 117-131.
13. Lohmann G (1998) Volumetric Image Analysis. USA, Hoboken: John Wiley & Sons and
Teubner Publishers.
14. Lan CZ, Lu WJ, Yu JM, et al. (2021) Deep learning algorithm for feature matching of cross
modality remote sensing image. Acta Geodaetica et Cartographica Sinica 50(2): 189-202.
15. Li Q, You X, Li K, et al. (2021) Spatial relation reasoning and representation for image
matching. Acta Geodaetica et Cartographica Sinica 50(1): 117-131.
16. Sun HQ (2004) Graph Theory and Its Application. Beijing: Science Press.
17. Snyder WE, Qi H (2004) Machine Vision. UK: Cambridge University Press.
18. Shapiro L, Stockman G (2001) Computer Vision. UK London: Prentice Hall.
19. Jiang XY, Ma JY, Xiao GB, et al. (2021) A review of multimodal image matching: Methods and
applications. Information Fusion 73: 22-71.
20. Avants BB, Epstein CL, Grossmann M, et al. (2008) Symmetric diffeomorphic image registra
tion with cross-correlation: evaluating automated labeling of elderly and neurodegenerative
brain. Medical Image Analysis 12(1): 26-41.
References 375
21. Luo J, Konofagou EE (2010) A fast normalized cross-correlation calculation method for motion
estimation, IEEE Trans. Ultrason. Ferroelectr. Freq. Control 57(6): 1347-1357.
22. Viola P, Wells III WM (1997) Alignment by maximization of mutual information, International
Journal of Computer Vision 24(2): 137-154.
23. Studholme C, Hill DLG, Hawkes DJ (1999) An overlap invariant entropy measure of 3D
medical image alignment. Pattern Recognition 32(1): 71-86.
24. Loeckx D, Slagmolen P, Maes F, et al. (2009) Nonrigid image registration using conditional
mutual information. IEEE Trans. Med. Imaging, 29(1): 19-29.
25. Zhang X, Yu FX, Karaman S, et al. (2017) Learning discriminative and transformation
covariant local feature detectors. Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition 6818-6826.
26. Trouve A (1998) Diffeomorphisms groups and pattern matching in image analysis. Interna
tional Journal of Computer Vision 28(3): 213-221.
27. Marsland S, Twining CJ (2004) Constructing diffeomorphic representations for the groupwise
analysis of nonrigid registrations of medical images. IEEE Trans. Med. Imaging 23(8):
1006-1020.
28. Zagorchev L, Goshtasby A (2006) A comparative study of transformation functions for
nonrigid image registration. IEEE Trans. Image Process 15(3): 529-538.
29. Bookstein FL (1989) Principal warps: Thin-plate splines and the decomposition of deforma
tions. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(6): 567-585.
30. Sederberg TW, Parry SR (1986) Free-form deformation of solid geometric models. Proceedings
of the 13th Annual Conference on Computer Graphics and Interactive Techniques 151-160.
31. Zhang Y-J (2021) Handbook of Image Engineering. Singapore: Springer Nature.
32. Ford Jr LR, Fulkerson DR. Flows in Networks. USA: Princeton University Press.
33. Pearl J (2014) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
The Netherlands: Elsevier.
34. Komodakis N, Tziritas G (2007) Approximate labeling via graph cuts based on linear program
ming. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(8): 1436-1453.
35. Cheng X, Zhang L, Zheng Y (2018) Deep similarity learning for multimodal medical images.
Computer Methods in Biomechanics and Biomedical Engineering: Imaging and Visualization
6(3): 248--252.
36. Blendowski M, Heinrich MP (2019) Combining MRF-based deformable registration and deep
binary 3D-CNN descriptors for large lung motion estimation in COPD patients. Int. J. Comput.
Assist. Radiol. Surg. 14(1): 43-52.
37. Liao R, Miao S, de Tournemirf P, et al. (2016) An artificial agent for robust image registration.
arXiv preprint, arXiv: 1611.10336.
38. Uzunova H, Wilms M, Handels H, et al. (2017) Training CNNs for image registration from few
samples with model-based data augmentation. Proceedings of International Conference on
Medical Image Computing and Computer-Assisted Intervention 223-231.
39. Hering A, Kuckertz S, Heldmann S, et al. (2019) Enhancing label-driven deep deformable
image registration with local distance metrics for state-of-the-art cardiac motion tracking.
Bildverarbeitung Fur Die Medizin 309-314.
40. Yan P, Xu S, Rastinehad AR, et al. (2018) Adversarial image registration with application for
MR and TRUS image fusion. International Workshop on Machine Learning in Medical Imaging
197-204.
41. Sun L, Zhang S (2018) Deformable MRI-ultrasound registration using 3D convolutional neural
network. Simulation, Image Processing, and Ultrasound Systems for Assisted Diagnosis and
Navigation 152-158.
42. Kori A, Krishnamurthi G (2019) Zero shot learning for multi-modal real time image registra
tion, arXiv preprint, arXiv:1908.06213.
43. Zhang Y-J (2017) Image Engineering, Vol.1: Image Processing. Germany: De Gruyter.
376 9 Generalized Matching
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 377
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_10
378 10 Simultaneous Location and Mapping
Laser SLAM infers the motion of the LiDAR itself and the surrounding environment
based on the point cloud data of continuous motion frame by frame. Laser SLAM
can accurately measure the angle and distance of object points in the environment;
without pre-arranging the scene, it can work in a poor light environment and
generate an environment map that is easy to navigate.
Laser SLAM mainly solves three problems: (i) extracting useful information from
the environment, that is, feature extraction; (ii) establishing the relationship between
environmental information observed at different times, that is, data association; and
(iii) describing the environment, that is, the map represents the problem.
The process framework of laser SLAM is shown in Fig. 10.2, which mainly
includes five modules:
1. Laser scanner: Receive the distance and angle information returned by the emitted
laser from the surrounding environment to form point cloud data.
form a pose graph. By adjusting the nodes in the pose graph to satisfy the
spatial constraints to the greatest extent, the pose information and map are
obtained.
5. Map update: The obtained point cloud data of each frame and the corresponding
pose are spliced into a global map to complete the map update. There are different
types of maps: scale maps, topological maps, and semantic maps. In 2D laser
SLAM, grid maps and feature maps in scale maps are mainly used [2]:
(a) Grid map: Divide the environment space into grid cells of equal size, whose
attribute is the probability that the grid is occupied by objects. If the grid is
occupied, the probability value is closer to 1. When there are no objects in the
grid, the probability value is closer to 0. If it is uncertain whether there is an
object in the grid, the probability value is equal to 0.5. The grid map has high
accuracy and can fully reflect the structural characteristics of the environ
ment, so the grid map can be directly used for autonomous navigation and
positioning of mobile robots.
(b) Feature map: It is also known as geometric map, which is composed of
geometric features such as points, lines, or arcs extracted from environmental
information. Because it occupies less resources and has a custom map
precision, it is suitable for building small scene maps.
Visual SLAM generally tracks the setup key points through continuous camera
frames, locates its spatial position by triangulation, and uses the position information
to infer its own pose [12]. At present, deep learning technology has been widely
applied in visual SLAM [13].
The cameras used in visual SLAM mainly include three classes: monocular
cameras, binocular cameras, and depth cameras (RGB-D). Other special cameras
such as panorama and fish-eye are used less frequently.
The advantages of monocular cameras are that they are low cost, are not affected
by the size of the environment, and can be used both indoors and outdoors; the
disadvantage is that they cannot obtain absolute depth and can only estimate relative
depth. The binocular camera can directly obtain the depth information, but it is
restricted by the length of the baseline (the size itself needs to be larger), the amount
of calculation to obtain the depth data is large, and the configuration and calibration
are complicated. Depth cameras can directly measure the depth of many points, but
they are mainly used indoors, while outdoor applications are limited by light
interference.
The process frame diagram of visual SLAM (vSLAM) is shown in Fig. 10.3,
which mainly includes five modules:
1. Vision sensor: It reads image information and performs data preprocessing
(feature extraction and matching).
2. Visual odometry (VO): It is also known as (SLAM’s) front-end. It is able to
estimate camera motion with the help of adjacent image frames and recover the
spatial structure of the scene. It is called visual odometry because it only
382 10 Simultaneous Location and Mapping
Laser SLAM and visual SLAM have their own characteristics, and some compari
sons are listed in Table 10.3.
Due to the complexity of application scenarios, both laser SLAM and visual
SLAM have certain limitations when used alone. In order to take advantage of
different sensors, these two can be combined to fuse the two kinds of
information [14].
Extended Kalman filter (EKF) itself is a filtering method for online SLAM
systems. It can also be used to combine laser SLAM with visual SLAM with
RGB-D cameras [15]. When camera matching fails, a laser device is used to
supplement the camera’s 3D point cloud data and generate a map. However, this
method essentially only adopts a switching working mechanism between the work
ing modes of the two sensors and has not really fused the data of the two sensors.
Using visual SLAM alone may not be able to effectively extract the depth informa
tion of feature points. Laser SLAM works better in this regard, so the depth of the
384 10 Simultaneous Location and Mapping
scene can be measured with laser SLAM first, and then the point cloud can be
projected onto the video frame [16].
When using laser SLAM alone, there will be some difficulties in extracting the
features of the object region. Using visual SLAM to extract ORB features and
perform closed-loop detection can improve the performance of laser SLAM in this
regard [17].
To make laser SLAM and visual SLAM more tightly coupled, it is possible to use
both laser SLAM and visual SLAM at the same time and use the measurement
residuals of both modes at the back-end for back-end optimization [18]. In addition,
visual LiDAR odometry and real-time mapping (VLOAM) [19] can also be
designed, which combines low-frequency LiDAR odometry and high-frequency
visual odometry. This can quickly improve motion estimation accuracy and suppress
drift.
10.2 Laser SLAM Algorithms 385
People have developed many laser SLAM algorithms based on different technolo
gies and with different characteristics. These algorithms can be mainly divided into
two categories: filtering methods and optimization methods.
In the following introduction, it is assumed for simplicity that the laser device and
its carrier (which can be a car, robot, drone, etc.) use the same coordinate system, and
the laser device is used to refer to the combined device of the laser device and its
carrier. Let xk represent the pose of the laser device, use mi to represent the marked
point in the environment (map), and use zk-1, i + 1 to represent the marked feature
mi + 1 observed by the laser device at xk-1. In addition, use uk to represent the motion
displacement between two adjacent poses on the motion trajectory.
The basic idea of RBPF is to deal with the localization and mapping problems in
SLAM separately. Specifically, first use P(x1:t|z1:t, u1:t-1) to estimate the trajectory
x1:t of the laser device, and then continue to estimate the map m with the help of x1:t:
Mapping with P(m|x1:t, z1:t) is straightforward given the pose of the laser device. The
following only discusses the positioning problem represented by P(x1:t|z1:t, u1:t-1).
Here a particle filter algorithm called sampling importance resampling (SIR) filter
is used. It mainly has four steps:
Sampling
Taking the probabilistic motion model of the laser device as the proposed distribu
tion D, the new particle point set {xt(i)} at the current moment is sampled from the
proposed distribution D by the particle point set {xt-1(i)} at the previous moment
owned. Therefore, the generation process of the new particle point set {xt(i)} can be
represented as xt(i) ~ P(xt|xt-1(i), ut-1).
386 10 Simultaneous Location and Mapping
Considering the entire motion process, each possible trajectory of the laser device
can be represented by a particle point x1:t(i), and the importance weight of each
trajectory corresponding to the particle point x1:t(i) can be defined as the following
equation:
Resampling
Resampling refers to replacing newly generated particle points with importance
weights. Since the total number of particle points remains the same, when the
particle points with smaller weights are deleted, the particle points with larger
weights need to be copied to keep the total number of particle points unchanged.
After resampling, the weights of particle points become the same and then proceed to
the next round of sampling and resampling.
Map Estimation
Under the condition that each trajectory corresponds to the particle point x1:t(i),
P(m(i)lx1:t(i), z1-t) can be used to calculate a map m(i), and then the final map m is
obtained by integrating the maps calculated for each trajectory.
The GMapping algorithm improves RBPF in two aspects, namely, proposed distri
bution and resampling strategy.
The proposed distribution Dis discussed first. It can be seen from Eq. (10.2) that
each calculation needs to calculate the weight corresponding to the entire trajectory.
Over time, the trajectory will become very long, and the amount of computation will
increase. An improved method for this is to derive a recursive calculation method for
the weights based on Eq. (10.2):
10.2 Laser SLAM Algorithms 387
p(Wl:t jzi: t ,u1 :t-1) p(zt jxlit, Z1:t-1) p(xlit jzi:t, u1:t-1) =P(zt |Z1:t-1, U1:t-1)
=
p(zt |x1i:)t, Z1:t-1) p(xjt-1, Ut-Q p(x1i:)t-1|Z1:t-1, Ut-2) P(zt |z1:t-1,u1:t-1)
D^xt(i) |x1i:)t-1, Z1 :t, u1 t-1} D^x1i:)t-1 |z 1 t-1, u1 t-2
: : :
However, directly using the motion model as the proposed distribution will cause a
problem, that is, when the reliability of the observation data is relatively high, the
number of new particles generated by sampling the motion model will fall in the
observation distribution interval, resulting in a lower accuracy of the observation
update. To this end, the observation update process can be divided into two cases:
When the observation reliability is low, the default motion model of Eq. (10.3) is
used to generate a new particle point set {xt(i)} and the corresponding weight; when
the observation reliability is high, then directly sample from the interval of the
observation distribution, and approximate the distribution of the sampling point set
{xk} as a Gaussian distribution; by using the point set {xk} to calculate the param-
(i) (i) (i) (i)
eters^^ and^t of Liie ^Gaussian distribution, tne ^Gaussian distributionXt *~^N(^^t ,
£t(i)) is used to sample to finally generate a new particle point set {xt(i)} and
corresponding weights.
After generating the new particle point set {xt(i)} and the corresponding weights,
the resampling strategy can be considered. If every time the particle point set {xt(i)}
is updated, the weights are used for resampling, when the particle weights do not
change too much during the update process, or some bad particle points are better
than good particles due to noise. When the weight of the point is even larger,
performing resampling will result in the loss of good particle points. So before
performing resampling, you need to ensure its validity. To this end, the improved
388 10 Simultaneous Location and Mapping
resampling strategy measures its effectiveness with the help of parameters shown in
the following equations:
i<£(w-i°)2
Neff = (10:5)
i= 1
where w represents the normalized weight of the particle. When the similarity
between the proposed distribution and the object distribution is high, the weights
of each particle point are relatively close; when the similarity between the proposed
distribution and the object distribution is low, the weight difference of each particle
point is relatively large. In this way, a threshold can be set to judge the validity of the
parameter Neff, and resampling is performed when Neff is less than the threshold;
otherwise, resampling is skipped.
Local mapping is the process of building a local map using sensor scan data.
Let’s first introduce the structure of the Cartographer map. Cartographer map is
composed of many local sub-maps, and each local sub-map contains several scan
frames (scan), as shown in Fig. 10.5. Maps, local sub-maps, and scan frames are
related by pose relationships. The scan frame and the sub-map are related by the
local pose qij, the sub-map and the map are related by the global pose qim, and the
scan frame and the map are related by the global pose qjs.
The pose coordinates can be expressed as q = (qx, qy, qe). Assuming that the
initial pose is q1 = (0, 0, 0), the scan frame here is Scan(1), and Sub-map(1) is
initialized with Scan(1). The corresponding pose q2 of Scan(2) is calculated using
the update method of scan-to-map matching, and Scan(2) is added to Sub-map(1)
Scan(1) | Sub-map(l) | -
Scan(2)
Scan(3)
Scan(4) | Sub-map(2) | _ Map
Scan(5)
Scan(n) | Sub-map(m) | -
based on the pose q2. The newly obtained scan frame is added to the map matching
method without performing scan, until the new scan frame is completely contained
in Sub-map(1), that is, the creation of Sub-map (1) ends when no new information
other than Sub-map (1) is observed in the new scanning frame. Then repeat the
above steps to construct a new local sub-map, Sub-map(2). All local sub-maps {Sub-
map(m)} constitute the final global map. In Fig. 10.5, it is assumed that Sub-map(1)
is constructed from Scan(1) and Scan(2); Sub-map(2) is constructed from Scan(3),
Scan(4), and Scan(5).
As can be seen from Fig. 10.5, each scan frame corresponds to a global coordinate
in the global map coordinate system and also corresponds to a local coordinate in the
local map coordinate system (because the scan frame is also included in the
corresponding local subgraph). Each local sub-image starts with the first inserted
scan frame, and the global coordinates of the initial scan frame are also the global
coordinates of the local sub-image. So, the global pose Qs = {qjs}( j = 1, 2, ..., n)
corresponding to all scan frames and the global pose Qm = {qim}( j = 1, 2, ..., m) are
associated with the local pose qij generated by scan-to-map matching, and these
constraints constitute the pose graph, which will be applied in the global mapping
later.
The construction of local subgraphs involves the transformation of multiple
coordinate systems. First, the distance point {dk}(k = 1, 2, ..., K) obtained by
scanning the laser device for one cycle takes the rotation center of the laser device as
the coordinate system to take the value. Then, in a local sub-map, the pose of the first
scan frame is used as a reference, and the pose of the subsequent scan frames can be
represented by a relative transition matrix Tq = (Rq, tq). In this way, the data points in
the scan frame can be transformed into the local sub-map coordinate system using
the following equation:
cos qs — sin qs qx
Tq ■ dk = dk + (10:6)
sin qs cos qs Jy.
Rq tq
In other words, the data points dk in the scan frame are transformed into the local
sub-map coordinate system.
390 10 Simultaneous Location and Mapping
The sub-maps in Cartographer are in the form of probabilistic grid maps, that is,
the continuous 2D space is divided into discrete grids, and the side length of the grid
(usually 5 cm) represents the resolution of the map. The scanned scene point is
replaced by the grid occupied by the scene point, and the probability is used to
describe whether there is a scenery in the grid. The larger the probability value, the
higher the possibility of the existence of the scenery.
Next, consider the process of adding scan data to a sub-map. If the data is
converted to the sub-map coordinate system according to Eq. (10.6), these data
will cover some grids {Mold} of the sub-map. The grid in the sub-map has three
states: occupied (hit), not occupied (miss), and unknown. The grid covered by the
scan points should be occupied. The region between the start and end of the scanning
beam should be free of scenery (light can pass through), so the corresponding grid
should be unoccupied. Due to scan resolution and range limitations, grids not
covered by scan points should be unknown. Because the grid in the sub-map may
be covered by more than one scan frame, the grid state needs to be iteratively
updated in two cases:
1. In the grid {Mold} covered by data points in the current frame, if the grid has never
been covered by data points before (i.e., unknown state), then use the following
formula for initial update:
if state(x) = hit
Mnew (x) = (10:7)
Pmiss if state(x) = miss
Among them, if the grid x is marked as an occupied state by a data point, then the
occupancy probability Phit is used to assign an initial value to the grid; if the grid x is
marked as a non-occupied state by a data point, then the non-occupied probability
Pmiss is used to assign an initial value to the grid.
(2) In the grid {Mold} covered by the data points of the current frame, if the grid
has been covered by the data points before (i.e., it has the value Mold), then use the
following formula to iteratively update:
Among them, if grid x is marked as occupied by data points, then Mold is updated
with occupancy probability Phit; if grid x is marked as non-occupied by data points,
then Mold is updated with non-occupied probability Pmiss. where inv is an inverse
proportional function: inv(p) = p/(1—p), and inv-1 is the inverse function of inv, clip
is an interval-limited function. When the function value is higher than the maximum
value of the set interval, the maximum value is taken, and when the function value is
lower than the minimum value of the set interval, the minimum value is taken.
10.2 Laser SLAM Algorithms 391
The Cartographer algorithm uses the above iterative update mechanism to effec
tively reduce the interference of dynamic scenes in the environment. Because the
dynamic scene will make the grid state transition between occupancy and
non-occupancy, each state transition will make the probability value of the grid
smaller, which reduces the interference of the dynamic scene.
Finally, considering that the pose predicted by the motion model may have a large
error, it is also necessary to use the observation data to correct the predicted pose
before adding it to the map. Here, the scan-to-map matching method is still used to
search and match in the neighborhood of the predicted pose to locally optimize
the pose:
K
arg min 52 (1 -Msmooth(Tq • dt)) (10-9)
q k=i
Closed-Loop Detection
The local optimization of the pose in Eq. (10.9) can reduce the cumulative error in
the local mapping, but when the scale of the mapping is large, the total cumulative
error will also lead to the phenomenon of ghosting on the map. This is actually the
motion trajectory back to the position it had previously reached. This requires using
closed-loop detection, adding closed-loop constraints to the overall mapping con
straints, and performing a global optimization of the global pose. In closed-loop
detection, a search matching algorithm with higher computational efficiency and
higher precision is required.
The closed-loop detection can be represented by the following formula
(W represents the search window):
where the Mnearest function value is the probability value of the grid covered by
Tq^dk. When the search result is the current real pose, the matching degree is very
high, that is, the value of each Mnearest function is large, and the entire summation
result is also the largest.
If Eq. (10.10) is computed with exhaustive search, the computation is too large to
be performed in real time. For this reason, a branch-and-bound method is used to
improve efficiency. Branch and bound is to first match with a low-resolution map
392 10 Simultaneous Location and Mapping
and then gradually increase the resolution to match until the highest resolution.
Cartographer uses a depth-first strategy to search here.
Global Mapping
Cartographer uses a sparse pose map for global optimization, and the constraint
relationship of the sparse pose map can be constructed according to Fig. 10.5. Global
pose Qs = {qjs}( j = 1, 2, ..., n) corresponding to all scan frames and global pose
Qm = {qim}( j = 1, 2, ..., m) corresponding to all local submaps. The local pose qij
generated by scan-to-map matching is associated:
where
In the above two equations, i is the sequence number of the sub-map, j is the
sequence number of the scan frame, and qij represents the local pose of the scan
frame with the sequence number j in the local sub-map with the sequence number i.
The loss function L is used to penalize too large error terms, and the Huber function
can be used.
Equation (10.11) is actually a nonlinear least squares problem. When the closed-
loop is detected, all pose quantities in the whole pose map are globally optimized,
and then all pose quantities in Qj and Qm will be corrected, and the corresponding
map points on each pose will be corrected accordingly, which is called global
mapping.
The function of point cloud registration module is to extract feature points from
point cloud data. It calculates the smoothness of each point in the point cloud data of
the current frame, determines the points whose smoothness is less than the given
threshold as corners, and determines the points whose smoothness is greater than the
given threshold as surface points. Put all corners into the corner cloud set and all
surface points into the surface cloud set.
Mapping Module
The mapping module uses the method of scan-to-map for high-precision position
ing. It takes the above low-precision odometer as the initial value of pose and
matches the corrected feature point cloud with the map. This matching can get a
high-precision odometer (1 Hz odometer). Based on the pose provided by such a
high-precision odometer, the corrected feature point cloud can be added to the
existing map.
Although the accuracy of LiDAR odometer used for positioning is low, its update
speed is high. Although the odometer output by the mapping module has high
394 10 Simultaneous Location and Mapping
accuracy, its update speed is low. If these two are integrated, an odometer with high
speed and high accuracy can be obtained. Fusion is achieved by interpolation. If the
1 Hz high-precision odometer is used as the benchmark and the 10 Hz low-precision
odometer is used to interpolate it, the 1 Hz high-precision odometer can be output at
the speed of 10 Hz (equivalent to 10 Hz odometer).
It should be pointed out that if the frequency of the laser device itself is high
enough, or the inertial measurement unit (IMU), visual odometer (VO), wheel
odometer, etc., provide external information to speed up the speed of inter-frame
feature registration to respond to the changes of posture and correct the distortion in
motion, then fusion is not necessary.
LOAM algorithm has two characteristics worth pointing out: First, it solves
motion distortion; second, it improves the efficiency of mapping. Motion distortion
comes from the interference in data acquisition. The problem of motion distortion is
more prominent in low-cost laser devices because of low scanning frequency and
rotation speed. The lOAM algorithm uses the odometer obtained by inter-frame
registration to correct the motion distortion, so that the low-cost laser device can be
applied. The problem of mapping efficiency is more prominent when processing a
large number of 3D point cloud data. LOAM algorithm uses low-precision odometer
and high-precision odometer to decompose simultaneous positioning and mapping
into independent positioning and independent mapping, which can be processed
separately, reducing the amount of calculation, so that low-power computer equip
ment can also be applied.
According to the different processing methods of image data, visual SLAM algo
rithm can be divided into feature point method, direct method, and semi-direct
method.
The feature point method first extracts features from the image and performs
feature matching and then uses the obtained data association information to calculate
the camera motion, that is, the front-end VO, and finally performs back-end optimi
zation and global mapping (see Fig. 10.3).
The direct method directly uses the image gray-scale information for data
association and calculates the camera motion. The front-end VO in the direct method
is directly carried out on the image pixels, and there is no need for feature extraction
and matching. The subsequent back-end optimization and global mapping are
similar to the feature point method.
Semi-direct method combines the robustness advantage of feature point method
in using feature extraction and matching and the computational speed advantage of
direct method without feature extraction and matching. It often has more stable and
faster performance.
10.3 Visual SLAM Algorithms 395
ORB-SLAM algorithm uses the optimization method for solution. It uses three
threads: front-end, back-end, and closed loop. Its process flow diagram is shown in
Fig. 10.7. The front-end combines the logic related to positioning such as feature
extraction, feature matching, and visual odometer in a separate thread (not dragged
by the relatively slow back-end thread to ensure the real-time positioning), which
extracts feature points by oriented FAST and rotated BRIEF (ORB) from the
image [26]. The back-end combines the mapping related logic of global optimization
and local optimization in a separate thread. It first performs local optimization
mapping and triggers global optimization when the closed-loop detection is success
ful (the global optimization process is carried out on the camera pose map, without
considering the map features to speed up the optimization speed). In addition, the
algorithm uses key frames (representative frames in image input). Generally, the
image frames directly input into the system tracking thread from the camera are
called ordinary frames, which are only used for positioning tracking. The number of
ordinary frames is very large, and the redundancy between frames is also large. If
only some frames with more feature points, rich attributes, large differences between
front and back frames, and more common visibility relationship with surrounding
frames are selected as key frames, the amount of calculation is smaller, and the
robustness is higher when generating map points. ORB-SLAM algorithm maintains
a sequence of key frames in operation. In this way, the front-end can quickly relocate
with the help of key frame information when positioning is lost, and the back-end
can optimize the key frame sequence to avoid bringing a large number of redundant
input frames into the optimization process and wasting computing resources.
Figure 10.7 mainly includes six modules, which are briefly introduced below.
Map Module
The map module corresponds to the data structure of SLAM system. The map
module of ORB-SLAM algorithm includes map points, key frame, common visibil
ity map, and spanning tree. The running process of the algorithm is to dynamically
maintain the map, in which there is a mechanism responsible for adding and deleting
the data in the key frame and also a mechanism responsible for adding and deleting
396 10 Simultaneous Location and Mapping
Normal ORB feature Initial pose Local map New key frame
frame extraction estimation tracking selection
Tracking thread
Map Key frame
the data in the map point cloud, so as to maintain the efficiency and robustness of
the map.
Map Initialization
Location Recognition
If the tracking thread is lost during the mapping process of SLAM system, it is
necessary to start relocation to retrieve the lost information; if the SLAM system
wants to judge whether the current position has been reached before after building a
large map, it needs to carry out closed-loop detection. In order to realize relocation
and closed-loop detection, position recognition technology is needed. In the large
environment, image-to-image matching method is often used in position recogni
tion. Among them, bag of word (BoW) model is often used to build a visual
vocabulary recognition database for matching.
10.3 Visual SLAM Algorithms 397
Tracking Thread
The tracking thread obtains the input image from the camera, completes the map
initialization, and then extracts the ORB feature. The next initial pose estimation
corresponds to coarse positioning, while local map tracking corresponds to fine
positioning. On the basis of coarse positioning, precise positioning uses the current
frame and multiple key frames on the local map to establish a common visibility
relationship and uses the projection relationship between the point cloud of the
common visibility map and the current frame to solve the pose of the camera more
accurately. Finally, select some new alternative key frames.
The local mapping thread first calculates the feature vector of the candidate key
frame selected by the tracking thread with the help of the bag of word model, that is,
the key frame is added to the database of the bag of word model. Then, the cloud
points (points in the point cloud) in the map that have a common visibility relation
ship but do not map with the key frame are associated (these cloud points are called
recent map points), and the key frame is inserted into the map structure. Some poor
quality map points in the near future should be deleted. For the newly inserted key
frames, the common visibility map can be used to match the adjacent key frames,
and the triangulation method can be used to reconstruct the new map cloud points.
Next, several key frames and map cloud points near the current frame are optimized
by bundle adjustment (BA). Finally, the key frames in the local map are filtered
again, and the redundant key frames are eliminated to ensure robustness.
Closed-Loop Thread
Closed-loop thread is divided into two parts: closed-loop detection and closed-loop
correction. The closed-loop detection uses the bag of word model to select the
frames in the database with high similarity to the current key frame as candidate
closed-loop frames and then calculate the similarity transformation relationship
between each candidate closed-loop frame and the current frame. If there is enough
data to calculate the similar transformation, and the transformation can ensure that
there are enough common viewpoints, the closed-loop detection is successful. Next,
the transform is used to correct the cumulative error of the current key frame and its
adjacent frames, and those map points that are inconsistent due to the cumulative
error are fused together. Finally, with the help of global optimization, those frames
that do not have a common visibility relationship with the current key frame are
corrected. Here, the key frame pose on the global map is taken as the optimization
variable, so it is also called pose map optimization.
398 10 Simultaneous Location and Mapping
Restored
Binocular
frames
(a)
Restored
RGB-D
frames
(b)
Closed-loop thread
The direct method, that is, the direct visual odometer method, does not need feature
extraction and matching (reducing the calculation time), directly uses the attributes
of image pixels to establish data association, and constructs the corresponding model
by minimizing the photometric error to solve the camera pose and map point cloud.
10.3 Visual SLAM Algorithms 401
Re-Projection Error
P1 = D(P1) (10:13)
Conversely, the back projection relationship from the pixel point p1 to the spatial
point P1 is:
P1=D-1(p1) (10:14)
R t
P2=TP1= P1 (10:15)
0 1
2 = D(P2)
p0 (10:16)
‘
Ideally, the re-projected pixel p2 should coincide with the actually observed pixel
p2. However, in practice, if the projection is disturbed by noise and the transforma
tion has certain errors, they do not coincide. The difference between the two is called
re-projection error:
e=p2-p02 (10:17)
Equation (10.17) gives the re-projection error of a point. If the re-projection error of
all feature points is considered, the camera pose transformation and map point cloud
can be optimized by minimizing the re-projection error:
Photometric error.
402 10 Simultaneous Location and Mapping
Equation (10.19) gives the photometric error of a point. If the photometric error of all
pixel points is considered, the camera pose transformation and map point cloud can
be optimized by minimizing the re projection error:
Comparing the re-projection error and photometric error it can be seen that the
calculation of re-projection error needs the help of feature extraction and feature
matching and the calculated error corresponds to the distance between pixels;
feature extraction and matching are not needed to calculate photometric error. The
calculated error is the difference between the gray value of pixels in one image and
the gray value of pixels re-projected into another image. The difference between the
two also explains the relative advantages and disadvantages between the feature
point method and the direct method.
Tracking Module
The tracking module uses the new input frame and the current key frame to calculate
the pose transformation of the new input frame which is achieved by calculating the
minimum error between them that is photometric error:
10.3 Visual SLAM Algorithms 403
rA.p, j
EA j = 12 (10:21)
02 , x
p2QDi WqW &
Among them, qji is a similar transformation on Lie algebra (used to describe pose
transfer); I •ll.g is Huber norm; rp(p, qji) is the photometric error:
x0 px=d
’x=z0'
y0 py=d
w(p, d, q) = y0=z0 = exp (q) (10:23)
z0 1=d
1=z0
1 1
02
r (p,q ) is the variance of photometric error:
p ji
2
X (p, qji)\
022
rp (p,qji) = 20II Vi(p) (10:24)
k dDi(p)
where Di is the inverse depth map and Vi is the covariance of the inverse depth of the
image.
404 10 Simultaneous Location and Mapping
The depth estimation module receives the photometric error calculated by the
tracking module for each new input frame and the current key frame and determines
whether to replace the current key frame with the new input frame or improve the
current key frame. If the photometric error is large enough, the current key frame is
added to the map, and then the new input frame is used as the current key frame.
Specifically, first calculate the similarity transformation between the new input
frame and the current key frame, and then project the depth information of the
current key frame to the new input frame through the similarity transformation and
calculate its depth estimation value. If the photometric error is relatively small, the
new input frame is used to filter and update the depth estimation value of the current
key frame.
The map optimization module receives the new key frame from the depth estimation
module and calculates its similar transformation with other key frames in the map
before adding it to the map. Here, the photometric error and depth error of the image
should be minimized at the same time:
where rd(p, qj) and ^2 (pq ) are the depth error and the variances of depth error,
respectively:
Omnidirectional cameras have a wide field of view (FOV), and the field of view of
some fish-eye cameras can even exceed 180°. This feature of omnidirectional camera
is more suitable for SLAM applications [29]. However, the wide field of view
inevitably brings the problem of image distortion. The practical omnidirectional
camera is still monocular camera, so the main challenge to expand the monocular
10.3 Visual SLAM Algorithms 405
Cx
M«(x) = (10:28)
-cy -
406 10 Simultaneous Location and Mapping
where
/
(u — cx)=fx
u
(10:30)
v (v — cy) fy
One of the main characteristics of this model is that the projection function, back-
projection function, and their derivatives are easy to calculate.
New stereo
frame
Track at
Current
keyframes
Tracking Module
The tracking module uses the new input (stereo) frames and the current key (stereo)
frames to calculate the pose transformation of the new input frame, which is still
achieved by calculating the minimum error between them, that is, the photometric
error.
The estimation of scene geometry is carried out in key frames. Each key frame
maintains a Gaussian probability distribution on the inverse depth of the pixel
subset. This subset is selected as pixels with high image gradient amplitude, because
these pixels provide rich structural information and more robust disparity estimation
than pixels in non-textured regions.
The depth estimation combines the temporal stereo (TS) of the original monoc
ular LSD-SLAM with the static stereo (SS) obtained here from a fixed baseline
stereo camera. For each pixel, the binocular LSD-SLAM algorithm integrates static
and temporal stereo cues into depth estimation according to availability. In this way,
the characteristics of monocular structure from motion are combined with fixed
baseline stereo depth estimation in single SLAM method. Static stereo effectively
removes the scale as a free parameter, and temporal stereo cues can help estimate the
depth from a baseline other than the small baseline of the stereo camera.
In the binocular LSD-SLAM algorithm, the depth of the key frame can be
estimated directly from the static stereo (see Fig. 10.13). This method, which only
depends on time or static solid, has many advantages. Static stereo allows to estimate
the absolute proportion of the world and to move independently of the camera.
However, static stereo is limited by a constant baseline (in many cases, with a fixed
direction), which effectively limits performance to a specific range. However, time
stereo will not limit the performance to a specific range. The same sensor can be used
in both very small and very large-scale environments, and there is a seamless
transition between the two.
If a new key frame is generated (initialized), the propagate depth (PD) map can
be updated and trimmed with the help of static stereo.
Camera motion between two images can be determined by means of direct image
alignment. This method tracks the movement of the camera toward the reference key
408 10 Simultaneous Location and Mapping
frames. It can also be used to estimate relative pose constraints between key frames
for pose map optimization. Of course, there is also a need to compensate for changes
in affine lighting.
The semi-direct method uses some threads or modules of the feature point method
and the direct method in combination. Figure 10.14 can be used to introduce their
connection.
As can be seen from Fig. 10.14, the feature point method first establishes the
association between images through feature extraction and feature matching and
then solves the camera pose and map point cloud by minimizing the re-projection
error, while the direct method directly uses the attributes of image pixels to establish
the associations between the images, and then the camera pose and map point cloud
are solved by minimizing the photometric error. The semi-direct method combines
the feature extraction module of the feature point method with the direct association
module in the direct method. Because feature points are more robust than pixel
points, and direct association is more efficient than feature matching to minimize
re-projection errors, the semi-direct method combines the advantages of both
methods, which not only ensures robustness but also improves efficiency.
The flowchart of the SVO algorithm is shown in Fig. 10.15, which mainly includes
two threads: motion estimation and mapping.
The motion estimation thread mainly has three modules: image alignment, feature
alignment, and posture and structure optimization.
The image alignment module uses a sparse model to align the input new image with
the previous frame image. The alignment here is achieved by re-projecting the
extracted feature points (FAST corners) to a new image and computing the camera
posture transformation based on minimizing the photometric error. This is equiva
lent to replacing the pixels in the direct method with (sparse) feature points.
Using the previously computed (rough) camera posture transformation, the feature
points that are already in the map and co-viewed by the new image can be
re-projected back (from the map to the new image). Considering that the
re-projected feature point position may not coincide with the real feature point
position in the image, it needs to be corrected by minimizing the photometric
error. To align feature points, an affine transformation can be used.
It should be pointed out that although both alignment modules need to determine
the six parameters of the camera pose, if only the first one is used, the possibility of
pose drift will be very large, and if only the second one is used, the amount of
calculation will be very big.
410 10 Simultaneous Location and Mapping
Based on the rough camera posture obtained in Step (1) and the corrected feature
points obtained in Step (2), the camera pose and map point cloud can be optimized
by minimizing the re-projection error of the map point cloud to the new image. Note
that both Step (1) and Step (2) use the idea of the direct method, both of which
minimize the photometric error. The idea of the feature point method is used here,
which is to minimize the re-projection error. If only the camera pose is optimized,
only motion BA is used; if only map point cloud is optimized, only structure BA
is used.
The mapping thread estimates the depth of a 3D point given image and its
posture. It mainly includes three modules: feature extraction, initialization depth
filter, and update depth filter. They work under the guidance of two judgments.
A swarm robot is a decentralized system that can collectively accomplish tasks that
a single robot cannot do alone. The properties of swarm robots, such as scalability,
flexibility, and fault tolerance, have been greatly improved due to the development
of localization awareness and communication, self-organization and redundancy
techniques, etc., which make swarm robots an ideal candidate for performing
tasks. In large unknown environments, swarm robots can autonomously perform
simultaneous localization and mapping (SLAM) by navigating dangerous dynamic
environments using a self-organizing exploration scheme.
10.4 Swarm Robots and Swarm SLAM 411
Swarm robots have some characteristics that distinguish them from centralized
multi-robot systems [34].
10.4.1.1 Scalability
Robots in swarms interact only with close peers and the immediate environment.
Contrary to most centralized multi-robot systems, they do not require global knowl
edge or supervision to operate. Therefore, modifying the size of the swarm does not
require reprogramming of individual robots, nor does it have a significant impact on
qualitative collective behavior. This enables swarm robots to achieve scalability—
that is, maintaining performance as more agents join the system—as they can cope
with environments of any size over a considerable range. Of course, an approach that
only works for very expensive robots doesn’t actually extend in practice, as eco
nomic constraints may prevent the acquisition of large numbers of robots. Therefore,
the design of swarm SLAM methods should take into account the cost of a single
robot.
10.4.1.2 Flexibility
Since swarms are decentralized and self-organizing, individual robots can dynami
cally assign themselves to different tasks to meet the requirements of specific
environmental and operating conditions, even if those conditions change while
operating. This adaptability provides flexibility for swam SLAM. Flexibility in
swam SLAM is not only suitable for very specialized hardware configurations but
also for existing infrastructure or global information sources with good results.
Swarm robots are composed of a large number of robots with high redundancy. This
high redundancy, coupled with the lack of centralized control, prevents swarm
robots from being a single point of failure. Therefore, the swarm SLAM method
can achieve fault tolerance, as it can cope with the loss or failure of some robots
(as well as the noise due to the measurement). Likewise, fault tolerance makes
economic sense: Losing a robot should not have a significant impact on the cost of
the task or its success.
Taking these characteristics into account, it can be seen that swarm SLAM should
have a different application than multi-robot SLAM: Swarm robots are best suited
for applications where the main constraint is time or cost rather than high accuracy.
Therefore, they should be more suitable for generating rough abstract graphs, such as
412 10 Simultaneous Location and Mapping
topological maps or simple semantic maps, rather than precise metric maps. In fact,
when an accurate map is required, there is usually enough time to construct it, and
when time (or cost) is the main constraint, it is often acceptable to generate an
approximate but informative map. The method of swarm SLAM should also be
suitable for mapping dangerous dynamic environments. As the environment evolves
over time, a single or small swarm of robots takes time to update the map, while a
large enough swarm can do so quickly.
In recent years, in addition to combining with bionics [45], SLAM has a lot of
combination with deep learning and multi-agent.
With the help of deep learning, the performance of odometry and closed-loop
detection can be improved, and the SLAM system’s understanding of environmental
semantics can be enhanced. For example, a visual odometry method called DeepVO
uses a convolutional neural network (CNN) on raw image sequences to learn
features and a recurrent neural network (RNN) to learn dynamical connections
between images [46]. This structure based on double convolutional neural network
can efficiently extract the effective information between adjacent frames and at the
same time has good generalization ability. For another example, in terms of closed-
loop detection, ConvNet is used to calculate the characteristics of the road sign
region and compare the similarity of the road sign region, so as to judge the
similarity between the whole images and improve the detection robustness in
situation when there is partial occlusion and severe changes in the scene [47]. In
fact, deep learning-based closed-loop detection methods are more robust to chang
ing environmental conditions, seasonal changes, and occlusions due to the presence
of dynamic objects.
An overview of some recent deep learning algorithms for closed-loop detection is
shown in Table 10.4.
414 10 Simultaneous Location and Mapping
In Table 10.4, see Sect. 5.2 for SIFT and SURF, Sect. 10.3.1 for ORB, and Sect.
10.1.1 for NDT.
Each agent in a multi-agent system can communicate with each other, coordinate
with each other, and solve problems in parallel, which can improve the solution
efficiency of SLAM, and each agent is relatively independent, with good fault
tolerance and anti-interference ability, which can help SLAM solve problems in
large-scale environments. For example, the multi-agent distributed architecture [63]
uses successive over-relaxation (SOR) and Jacobi over-relaxation (JOR) to solve
normal equations, which can effectively save data bandwidth.
Visual SLAM systems assisted by using inertial measurement units (IMUs) are
often referred to as visual-inertial navigation systems. Multi-agent collaborative
visual SLAM systems often have a moving subject equipped with one or more
References 415
visual sensors, which can estimate the change of its own posture and reconstruct a
3D map of the unknown environment through the perception of environmental
information.
An overview of several existing multi-agent vision SLAM system schemes is
shown in Table 10.5 [64].
In Table 10.5, CCM-SLAM is a multi-agent visual SLAM framework [65] that
incorporates IMU, each agent only runs a visual odometry with a limited number of
key frames, and the agent will detect the key frame information and send it to the
server (reducing the cost and communication burden of a single agent); the server
constructs a local map based on this information and fuses the local map information
through the method of location recognition. In the server, posture estimation and
bundle adjustment are applied to refine the map.
References
5. Zhao J, Zhao L, Huang SD, et al. (2020) 2D Laser SLAM with general features represented by
implicit functions. IEEE Robotics and Automation Letters 5(3): 4329-4336.
6. Biber P, Strasser W (2003) The normal distributions transform: A new approach to laser scan
matching. Proceedings of International Conference on Intelligent Robotics and Systems
2743-2748.
7. Olson E (2015) M3RSM: Many-to-many multi-resolution scan matching. Proceedings of IEEE
International Conference on Robotics and Automation 5815-5821.
8. Yin H, Ding XQ, Tang L, et al. (2017) Efficient 3D LiDAR based loop closing using deep
neural network. Proceedings of IEEE International Conference on Robotics and Biomimetric
481-486.
9. Arshad S, Kim GW (2021) Role of deep learning in loop closure detection for visual and
LiDAR SLAM: A survey. Sensors 21, #1243 (DOI: https://doi.org/10.3390/s21041243).
10. Magnusson M, Lilienthal AJ, Duckett T (2007) Scan registration for autonomous mining
vehicles using 3D-NDT. J. Field Robot 24, 803-827.
11. Douillard B, Underwood J, Kuntz N, et al. (2011) On the segmentation of 3D LiDAR point
clouds. Proceedings of the IEEE International Conference on Robotics and Automation 9-13.
12. Ning LR, Pang L, Dong D, et al. (2020) The combination of new technology and research status
of simultaneous location and mapping. Proceedings of the 6th Symposium on Novel Optoelec
tronic Detection Technology and Applications, #11455 (DOI: https://doi.org/10.1117/12.
2565347).
13. Huang Z X, Shao C L. Survey of visual SLAM based on deep learning. Robot, 2023, DOI:
https://doi.org/10.13973/j.cnki.robot.220426.
14. Wang JK, Jia X (2020) Survey of SLAM with camera-laser fusion sensor. Journal of Liaoning
University of Technology (Natural Science Edition) 40(6): 356-361.
15. Xu Y, Ou Y, Xu T. (2018) SLAM of robot based on the fusion of vision and LiDAR.
Proceedings of International Conference on Cyborg and Bionic Systems 121-126.
16. Graeter J, Wilczynski A, Lauer M (2018) LIMO: LiDAR-monocular visual odometry. Pro
ceedings of International Conference on Intelligent Robots and Systems 7872-7879.
17. Liang X, Chen H, Li Y, et al. (2016) Visual laser-SLAM in large-scale indoor environments.
Proceedings of International Conference on Robotics and Biomimetics 19-24.
18. Seo Y, Chou C (2019) A tight coupling of Visual-LiDAR measurements for an effective
odometry. Intelligent Vehicles Symposium 1118-1123.
19. Zhang J, Singh S (2015) Visual-LiDAR odometry and mapping: Low-drift, robust, and fast.
Proceedings of International Conference on Robotics and Automation 2174-2181.
20. Grisetti G, Stachniss C, Burgard W (2007) Improved techniques for grid mapping with
Rao-Blackwellized particle filters. IEEE Transactions on Robotics 23(1): 34-46.
21. Hess W, Kohler D, Rapp H, et al. (2016) Real-time loop closure in 2D LiDAR SLAM. Proc.
International Conference on Robotics and Automation 1271-1278.
22. Zhang J, Singh S (2014) LOAM: LiDAR odometry and mapping in real-time. Robotics: Science
and Systems Conference 1-9.
23. Mur-Artal R, Montiel JMM, Tardos JD (2015) ORB-SLAM: A versatile and accurate monoc
ular Slam system. IEEE Transactions on Robotics 31(5): 1147-1163.
24. Mur-Artal R, Tardos JD (2017) ORB-SLAM2: An open-source SLAM system for monocular,
stereo and RGB-D cameras. IEEE Transactions on Robotics 33(5): 1255-1262.
25. Campos C, Elvira R, Rodriguez J JG, et al. (2021) ORB-SLAM3: An accurate open-source
library for visual, visual-inertial, and Multimap SLAM. IEEE Transactions on Robotics 37(6):
1874-1890.
26. Rublee E, Rbaud V, Konolige K, et al. (2011) ORB: An efficient alternative to SIFT or SURF.
Proceedings of ICCV, 2564-2571.
27. Engel J, Schps T, Cremers D (2014) LSD-SLAM: Large-scale direct monocular SLAM. Pro
ceedings of ECCV 834-849.
28. Engel J, Stuckler J, Cremers D (2015) Large-scale direct SLAM with stereo cameras. IEEE
International Conference on Intelligent Robots and Systems 1935-1942.
References 417
29. Caruso D, Engel J, Cremers D (2015) Large-scale direct SLAM for omnidirectional cameras.
IEEE International Conference on Intelligent Robots and Systems 141-148.
30. Geyer C, Daniilidis N (2000) A unifying theory for central panoramic systems and practical
implications. Proceedings of ECCV 445-461.
31. Ying XH, Hu ZY (2004) Can we consider central catadioptric cameras and fisheye cameras
within a unified imaging model. Lecture Notes in Computer Science 3021: 442-455.
32. Barreto JP (2006) Unifying image plane liftings for central catadioptric and dioptric cameras.
Imaging Beyond the Pinhole Camera 21-38.
33. Forsterl C, Pizzoli M, Scaramuzza D (2014) SVO: Fast semi-direct monocular visual odometry.
Proceedings of IEEE International Conference on Robotics and Automation 15-22.
34. Kegeleirs M, Grisetti G, Birattari M (2021) Swarm SLAM: Challenges and Perspectives.
Frontiers in Robotics and AI 8: #618268 (DOI: https://doi.org/10.3389/frobt.2021.618268 ).
35. Dimidov C, Oriolo G, Trianni V (2016) Random walks in swarm robotics: An experiment with
kilobots, Swarm Intelligence 9882: 185-196.
36. Kegeleirs M, Garzon RD, Birattari M (2019) Random walk exploration for swarm mapping.
LNCS, in Towards Autonomous Robotic Systems 11650: 211-222.
37. Birattari M, Ligot A, Bozhinoski D, et al. (2019) Automatic off-line design of robot swarms: A
manifesto. Frontiers in Robotics and AI 6: 59.
38. Birattari M, Ligot A, Hasselmann K (2020). Disentangling automatic and semi-automatic
approaches to the optimization-based design of control software for robot swarms. Nature
Machine Intelligence 2, 494-499.
39. Spacy G, Kegeleirs M, Garzon RD (2020) Evaluation of alternative exploration schemes in the
automatic modular design of robot swarms of CCIS. Proceedings of the 31st Benelux Confer
ence on Artificial Intelligence 1196: 18-33.
40. Saeedi S, Trentini M, Seto M, et al. (2016) Multiple-robot simultaneous localization and
mapping: A review. Journal of Field Robotics 33, 3-46.
41. Fox D, Ko J, Konolige K, et al. (2006) Distributed multirobot exploration and mapping.
Proceedings of IEEE 94: 1325-1339.
42. Ghosh R, Hsieh C, Misailovic S, et al. (2020) Koord: a language for programming and verifying
distributed robotics application. Proceedings of ACM Program Language 4, 1-30.
43. Lajoie P-Y, Ramtoula B, Chamg Y, et al. (2020) DOOR-SLAM: Distributed, online, and outlier
resilient SLAM for robotic teams. IEEE Robotics and Automation Letters 5(2): 1656-1663.
44. Kummerle R, Grisetti G, Strasdat H, et al. (2011) G2o: A general framework for graph
optimization. Proceedings of the IEEE International Conference on Robotics and Automation
3607-3613.
45. Li WL, Wu DW, Zhu HN, et al. (2021) A bionic simultaneous location and mapping with
closed-loop correction based on dynamic recognition threshold. Proceedings of the 33rd
Chinese Control and Decision Conference (CCDC), 737-742.
46. Wang S, Clark R, Wen H, et al. (2017) DeepVO: Towards end-to-end visual odometry with
deep recurrent convolutional neural networks. Proceedings of International Conference on
Robotics and Automation 2043-2050.
47. Sunderhauf N, Shiraizi H, Dayoub F (2015) On the performance of ConvNet features for place
recognition. Proceedings of International Conference on Intelligent Robots and Systems
4297-4304.
48. Chen B, Yuan D, Liu C, et al. (2019) Loop closure detection based on multi-scale deep feature
fusion. Applied Science 9, 1120.
49. Merrill N, Huang G (2018) Lightweight unsupervised deep loop closure. Robotics: Science and
Systems.
50. Dube R, Cramariuc A, Dugas D, et al. (2018) SegMap: 3D segment mapping using data-driven
descriptors. Robotics: Science and Systems XIV.
51. Dube R, Cramariuc A, Dugas D, et al. (2019). SegMap: segment-based mapping and localiza
tion using data-driven descriptors. Int. J. Robot. Res. 39, 339-355.
418 10 Simultaneous Location and Mapping
52. Hu M, Li S, Wu J, et al. (2019) Loop closure detection for visual SLAM fusing semantic
information. IEEE Chinese Control Conference 4136-4141.
53. Liu Y, Xiang R, Zhang Q, et al. (2019) Loop closure detection based on improved hybrid deep
learning architecture. Proceedings of 2019 IEEE International Conferences on Ubiquitous
Computing and Communications and Data Science and Computational Intelligence and
Smart Computing, Networking and Services, 312-317.
54. Xia Y, Li J, Qi L, et al. (2016) Loop closure detection for visual SLAM using PCANet features.
Proceedings of the International Joint Conference on Neural Networks 2274-2281.
55. Zaganidis A, Zerntev A, Duckett T, et al. (2019) Semantically assisted loop closure in SLAM
using NDT histograms. Proceedings of the 2019 IEEE/RSJ International Conference on Intel
ligent Robots and Systems (IROS) 4562-4568.
56. Chen X, Labe T, Milioto A, et al. (2020) OverlapNet: Loop closing for LiDAR-based SLAM.
Proceedings of the Robotics: Science and Systems (RSS), Online Proceedings.
57. Wang S, Lv X, Liu X, et al. (2020) Compressed holistic ConvNet representations for detecting
loop closures in dynamic environments. IEEE Access 8, 60552-60574.
58. Facil JM, Olid D, Montesano L, et al. (2019) Condition-invariant multi-view place recognition.
arXiv:1902.09516.
59. Yin H, Tang L, Ding X, et al. (2018) LocNet: Global localization in 3D point clouds for mobile
vehicles. Proceedings of IEEE Intelligent Vehicles Symposium 728-733.
60. Olid D, Facil JM, Civera J. (2018) Single-view place recognition under seasonal changes.
arXiv:1808.06516.
61. Zywanowski K, Banaszczyk A, Nowicki M (2020) Comparison of camera-based and 3D
LiDAR-based loop closures across weather conditions. arXiv:2009.03705.
62. Wang Y, Zell A (2018) Improving feature-based visual SLAM by semantics. Proceedings of the
2018 IEEE International Conference on Image Processing, Applications and Systems 7-12.
63. Cieslewski T, Choudhary S, Scaramuzza D (2018) Data-efficient decentralized visual SLAM.
Proceedings of IEEE International Conference on Robotics and Automation 2466-2473.
64. Wang L, Yang GL, Cai QZ, et al. (2020) Research progress in collaborative visual SLAM for
multiple agents Navigation Positioning and Timing 7(3): 84-92.
65. Schmuck P, Chli M (2019) CCM-SLAM: Robust and efficient centralized collaborative mon
ocular simultaneous localization and mapping for robotic teams. Journal of Field Robotics
36(4): 763-781.
Chapter 11
Spatial-Temporal Behavior Understanding Check for
updates
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 419
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2_11
420 11 Spatial-Temporal Behavior Understanding
The image engineering survey series mentioned in Chap. 1 has been going on for
28 years since the beginning of the literature statistics in 1995 [4]. In the second
decade of the image engineering survey series (starting with the literature statistics in
2005), with the emergence of new hotspots in image engineering research and
applications, a new subcategory has been added to the image understanding
category—C5: spatial-temporal techniques (3D motion analysis, posture detection,
object tracking, behavior judgment, and understanding) [5]. What is emphasized
here is the comprehensive use of various information in images/videos to make
corresponding judgments and interpretations of the scene and the dynamic situation
of the objects in it.
In the past 18 years, a total of 314 articles in the C5 subcategory have been
collected in the survey series, and their distribution in each year is shown by
histogram in Fig. 11.1. The figure also shows the development trend obtained by
fitting the number of documents in each year with a third-order polynomial. It can be
seen that the number of documents in the first few years has obvious fluctuations.
Later it was relatively stable, but the research results were not many; however, the
number of documents in the past 4 years has increased significantly, with an average
of more than 30 per year, while in the previous 14 years, the average is only about
13 articles.
11.1 Spatial-Temporal Technology 421
35
30
25
20
15
10
0
Il ■ ill.111 Illi
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Fig. 11.1 The numbers and their change for spatial-temporal technical documents
etc.) or return the ball (including moving, extending the arm, turning the wrist,
drawing the ball, etc.). Going to the paddle and picking up the ball is often seen as an
activity. In addition, two players hitting the ball back and forth to win points is also a
typical activity scene. The competition between sports teams is generally regarded as
an event, and the awarding of awards after the competition is also a typical event.
Although a player ’s self-motivation by making a fist after winning the game can be
regarded as an action, it is more often regarded as a behavioral performance of the
player. When the players play a beautiful counterattack, the audience’s applause,
shouting, cheering, etc., are also attributed to the behavior of the audience.
It should be pointed out that the concepts of the last three levels are often used
loosely in many studies. For example, when an activity is called an event, it
generally refers to some abnormal activities (such as a dispute between two people,
an old man walking and falling, etc.); when an activity is called an action, the
meaning (behavior) and nature of the activity (such as stealing, as the act of stealing
or breaking into a house is called theft) are emphasized.
The research corresponding to the content of the first two levels is relatively
mature [7], and related technologies have been widely used in many other tasks. The
following sections in this chapter mainly focus on the last three levels and make
some distinctions among them as much as possible.
Fig. 11.3 Example pictures of actions in the Weizmann action recognition database
The time state model models the probabilities between states and between states and
observations. Each state summarizes the action performance at a certain moment,
and observes the corresponding image representation at a given time. The time state
model is either generative or discriminative.
The generative model learns a joint distribution between observations and
actions, to model each action class (considering all variations). Discriminative
models learn the probabilities of action classes under observation conditions; they
do not model classes but focus on differences between classes.
The most typical of the generative models is the hidden Markov Model (HMM),
in which the hidden states correspond to the various steps of the action. Hidden states
model state transition probabilities and observation probabilities. There are two
independent assumptions here. One is that the state transition only depends on the
previous state, and the other is that the observation only depends on the present state.
Variations of HMM include maximum entropy Markov model (MEMM), fac-
tored-state hierarchical HMM (FS-HHMM), and hierarchical variable transi
tion hidden Markov model (HVT-HMM).
On the other hand, discriminative models model the conditional distribution
given an observation, combining multiple observations to distinguish different
action classes. This model is beneficial for distinguishing related actions. Condi
tional random field (CRF) is a typical discriminant model, and its improvement
includes factorial CRF (FCRF), generalization of CRF, etc.
the foreground, background, contour, optical flow, and changes of the image) and
(2) human body model-based methods (use the human body). The model represents
the structural features of the actor, such as describing the action with a sequence of
human joint points. No matter what kind of method is adopted, it will play an
important role to realize the detection of the human body and the detection and
tracking of important parts of the human body (such as head, hands, feet, etc.).
The representation and recognition of actions and activities is a relatively new but
immature field [11]. The method used is largely dependent on the researcher’s
purpose. In scene interpretation, representations can be independent of the object
(such as a person or car) that led to the activity; in surveillance applications, human
activities and interactions between people are generally concerned. In a holistic
approach, global information is preferred over component information, such as
when determining a person’s gender. For simple actions such as walking or running,
a local approach can also be considered, in which more attention is paid to detailed
actions or action primitives. A framework for action recognition can be seen [12].
The representation and calculation methods of human body posture can be mainly
divided into three types:
1. Appearance-based method: Instead of modeling the physical structure of the
human directly, the human posture is analyzed using information such as color,
texture, and contour. It is difficult to estimate human posture since only the
apparent information in 2D images is exploited.
2. Human body model-based methods: The human body is first modeled using a line
graph model, 2D or 3D model, and then the human posture is estimated by
analyzing these parameterized human models. Such methods usually require
high image resolution and object detection accuracy.
3. Method based on 3D reconstruction: First, the 2D moving objects obtained by
multiple cameras at different positions are reconstructed into 3D moving objects
through corresponding point matching, and then the camera parameters and
imaging formulas are used to estimate the human posture in 3D space.
Posture can be modeled based on spatial-temporal interest points (see [9]). If only
the spatial-temporal Harris interest point detector is used (see [13]), the obtained
spatial-temporal interest points are mostly located in the region of sudden motion.
The number of such points is small, which belongs to the sparse type, and it is easy to
lose important motion information in the video, resulting in detection failure. To
overcome this problem, dense spatial-temporal interest points can also be extracted
with the help of motion intensity to fully capture motion-induced changes. Here, the
motion intensity can be calculated by convolving the image with a spatial Gaussian
filter and a temporal Gabor filter (see [10]). After the spatial-temporal interest points
are extracted, a descriptor is first established for each point, and then each posture is
modeled. A specific method is to first extract the spatial-temporal feature points of
the posture in the training sample library as the underlying feature, so that one
posture corresponds to a set of spatial-temporal feature points. The posture samples
are then classified using unsupervised classification methods to obtain clustering
results of typical postures. Finally, each typical posture category is modeled using an
EM-based Gaussian mixture model.
A recent trend in posture estimation in natural scenes is to use a single frame for
posture detection in order to overcome the problem of tracking with a single view in
unstructured scenes. For example, robust part detection and probabilistic combina
tion of parts have resulted in better estimates of 2D postures in complex movies.
Actions lead to changes in posture. If each static posture of the human body is
defined as a state, then by means of the state space method (also known as the
probability network method), the states are switched through the transition proba
bility, and the construction of an activity sequence can be obtained by performing a
traversal between the states of the corresponding posture.
11.2 Action Classification and Recognition 427
Interactive activities are more complex activities. They can be divided into two
categories: (1) the interaction between people and the environment, such as people
driving a car and holding a book, and (2) interpersonal interaction, which often refers
to the communication activities or contact behaviors of two people (or multiple
people), which is obtained by combining the (atomic) activities of a single person.
Single-person activities can be described with the help of probabilistic graph models.
The probabilistic graph model is a powerful tool for modeling continuous dynamic
feature sequences and has a relatively mature theoretical basis. Its disadvantage is
that the topology of its model depends on the structural information of the activity
itself, so a large amount of training data is required for complex interactive activities
to learn the topology of the graph model. To combine single-person activities, the
method of statistical relational learning (SRL) can be used. SRL is a machine
learning method that integrates relational/logical representation, probabilistic rea
soning, machine learning, and data mining to obtain a likelihood model of
relational data.
group object activities are semantically analyzed, in order to explain the trend and
situation of the whole scene.
In the analysis of group activity, the statistics of the number of individuals
participating in the activity is a basic data. For example, in many public places,
such as squares, stadium entrances, and exits, it is necessary to have certain statistics
on the number of people. Fig. 11.4 shows a picture of people counting in a
surveillance scenario [13]. Although there are many people in the scene with
different movement patterns, the concern here is the number of people in a certain
region (in the region enclosed by the box).
With the deepening of research, the categories of actors and actions that need to be
considered in the spatial-temporal behavior understanding are increasing. To do
this, the actor and action need to be jointly modeled [14]. In fact, jointly detecting an
ensemble of several objects in an image is more robust than detecting individual
objects individually. Therefore, joint modeling is necessary when considering mul
tiple different types of actors performing multiple different types of actions.
Consider the video as a 3D image f(x, y, t) and represent the video using the graph
structure G = (N, A). Among them, the node set N = (n 1, ..., nM) represents M voxels
(or M super-voxels), and the arc set A(n) represents the voxel set in the neighborhood
of a certain n in N. Assume that the body label set is denoted by X, and the action
label set is denoted by Y.
Consider a set of random variables {x} representing actors and a set of random
variables {y} representing actions. The actor-action understanding problem of inter
est can be viewed as a maximum a posteriori problem:
It is assumed that the actor and the action are independent of each other, that is, any
actor can initiate any action. At this point, a set of classifiers need to be trained in the
action space to classify different actions. This is the simplest method but does not
emphasize the existence of actor-action tuples, that is, some actors may not initiate
all actions, or some actors can only initiate certain actions. In this way, when there
are many different actors and different actions, sometimes unreasonable combina
tions (such as people can fly, birds can swim, etc.) occur when using the naive Bayes
model.
430 11 Spatial-Temporal Behavior Understanding
Joint product space model utilizes the body space X and the action space Y to
generate a new label space Z. Here, the product relationship is used: Z = X x Y. In the
joint product space, a classifier can be learned directly for each actor-action tuple.
Obviously, this method emphasizes the existence of actor-action tuples, which can
eliminate the appearance of unreasonable combinations, and it is possible to use
more cross-actor-action features to learn more discriminative classifiers. However,
this approach may not take advantage of commonalities across different actors or
different actions, such as steps and arm swings for both adults and children to walk.
The three-layer model unifies the naive Bayes model and the joint product space
model. It simultaneously learns a classifier in actor space X, action space Y, and joint
actor-action space Z. At inference time, it infers Bayesian terms and joint product
space terms separately and then combines them linearly to get the final result. It not
only models actor-action intersection but also models different actions initiated by
the same actor and the same action initiated by different actors.
In practice, many videos have multiple actors and/or initiate multiple actions, which
is the case for multi-labels. At this point, both x and y are binary vectors with
dimensions |X|and|Y|. The value of xi is 1 if the ith actor type exists in the video, and
0 otherwise. Similarly, the value ofyj is 1 if the jth action type exists in the video, and
0 otherwise. This generalized definition does not confine specific elements in x to
specific elements in y. This facilitates an independent comparison of the multi-label
performance of actors and actions with the multi-label performance of actor-action
tuples.
To study the situation where multiple actors initiate multiple actions, a
corresponding video database has been constructed [15]. This database is called
the actor-action database (A2D). Among them, a total of seven actor categories are
considered, adults, infants, cats, dogs, birds, cars, and balls, and nine action catego
ries, walking, running, jumping, rolling, climbing, crawling, flying, eating, and no
action (not the first eight categories). The main body includes both articulated, such
as adults, babies, cats, dogs, and birds, and also includes rigid bodies, such as cars
and balls. Many actors can initiate the same action, but no actor can initiate all of
them. So, while there are 63 combinations of them, some of them are unreasonable
(or hardly ever), resulting in a total of 43 reasonable actor-action tuples. Using the
text of these 43 actor-action tuples, 3782 videos were collected in YouTube, ranging
11.3
Actor and Action Joint Modeling
Table 11.1 Number of video segments corresponding to actor-action labels in the database
Walking Running Jumping Rolling Climbing Crawling Flying Eating No action
Adults 282 175 174 105 101 105 105 761
Infants 113 107 104 106 36
Cats 113 99 105 103 106 110 53
Dogs 176 110 104 104 109 107 46
Birds 112 107 107 99 106 105 26
Cars 120 107 104 102 99
Balls 105 117 109 87
431
432 11 Spatial-Temporal Behavior Understanding
in length from 24 to 332 frames (136 frames per segment on average). The number of
video segments corresponding to each actor-action tuple is shown in Table 11.1. The
blanks in the table correspond to unreasonable actor-action tuples, so no video was
collected. It can be seen from Table 11.1 that the number of video segments
corresponding to each actor-action tuple is about a hundred segments.
Among these 3782 videos, the number of video segments containing different
numbers (1-5) of actors, the number of video segments containing different numbers
(1-5) of actions, and the number of video segments containing different numbers of
actors-actions, respectively, are shown in Table 11.2. It can be seen from Table 11.2
that in more than one third of the video segments, the number of actors or actions is
greater than 1 (the last four columns of the bottom row in the table, including one
actor initiated more than two actions, or more than two actors initiated one action).
For the case of multi-label actor-action recognition, three classifiers can still be
considered similar to single-label actor-action recognition: a multi-label actor-action
classifier using naive Bayes, a multi-label actor-action classifier in the joint product
space, and an actor-action classifier based on a three-layer model that combines the
first two classifiers.
Multi-label actor-action recognition can be viewed as a retrieval problem. Exper
iments on the database introduced earlier (with 3036 segments as training set and
746 segments as test set, with basically similar ratios for various combinations) show
that the multi-label actor-action classifier in the joint product space performs better
than naive Bayes classifier, while the effect of the multi-label actor-action classifier
based on the three-layer model can still be improved [14].
then introduce a method based on the joint product space model, which utilizes the
tuple [x, y] to jointly consider the actor and action. Next consider a two-layer model
that considers the association of actor and action variables. Finally, a three-layer
model is introduced, which considers both intra-category linkages and inter-category
linkages.
Similar to the case in single-label actor-action recognition, the naive Bayes model
can be represented as:
Among them, qi and ri encode the potential functions defined in the actor and
action models, respectively, and qij and rij encode the potential functions in the actor
node set and the action node set, respectively.
Now, it is required to train the classifier {fc\c 2 X} on the actor and use the
features on the action set to train the classifier {gc|c 2 Y}. The paired edge potential
functions have the form of the following contrast-sensitive Potts model:
' 1 xi = xj
qij =
' exp [ — k (1 +x2)] otherwise
(11:3)
'1 xi = xj
=
exp [ — k (1 +x2-)] otherwise
(11:4)
where x2ij is thex2 distance between the feature histograms of nodes i and j, while k is
the parameter to be learned from the training data. Actor-action semantic segmen
tation can be obtained by solving these two flat conditional random fields
independently [15].
Consider a new set of random variables z = {z1, ..., zM}; they are also defined over
all super-voxels in a video and pick labels from the actor and action product space
Z = X x Y. This way jointly captures the actor-action tuple as the only element but
cannot model the common factor of actor and action in different tuples (the model
introduced below will solve this problem). This has a single-layer graph model:
434 11 Spatial-Temporal Behavior Understanding
(H:5)
/ n<i n sijtzi, zj=nsi^xi', yi n sj( [xi, yi], [xj, yj-D
where si is the potential function of the joint actor-action product space label and sij is
the node internal potential function between the two nodes of the corresponding
tuple [x, y]. Specifically, si contains the classification score obtained by pairing node
i with the trained actor-action classifier {hclc 2 Z}, and sij has the same form as
Eq. (11.3) or Eq. (11.4). For illustration, see Fig. 11.5a and b
Given an actor node x and an action node y, the two-layer model uses edges that
encode the potential function of the tuple to connect each random variable pair {(xi,
yi)Mi = 1} and directly obtains the covariance of the cross-actor-action labels:
(11:6)
/nqi(xi)ri(yi)ti(xi, yion n qj(xi, xjWyi-, yjO
i2M ieMjeA(i)
where ti(xi, yi) is the potential function learned for the labels of the entire product
space, which can be obtained as si in Fig. 11.5; see Fig. 11.5c. Here, connecting
edges across layers are added.
11.3 Actor and Action Joint Modeling 435
The naive Bayes model represented by the previous Eq. (11.2) does not consider the
connection between the actor variable x and the action variable y. The joint product
space model of Eq. (11.5) combines features across actors and actions as well as
interaction features within the neighborhood of an actor-action node. The two-layer
model of Eq. (11.6) adds actor-action interactions between separate actor nodes and
action nodes but does not account for the spatial-temporal variation of these
interactions.
A three-layer model is given below, which can explicitly model the spatial-
temporal variation of Fig. 11.5d. It combines the nodes of the joint product space
with all the actor nodes and action nodes:
where.
where w(yi'|xi) and w(xi'|yi) are the classification scores of conditional classifiers
specially trained for this three-layer model.
These conditional classifiers are the main reason for the performance improve
ment: An action-specific, disjunctive classifier based on the actor-type condition can
take advantage of properties specific to the actor-action tuple. For example, when
training a conditional classifier on the action “eating” given an actor adult, all other
actions of the actor adult can be treated as negative training samples. In this way, this
three-layer model takes into account all the connections in each actor space and each
action space, as well as in the joint product space. In other words, the first three basic
models are all special cases of the three-layer model. It can be shown that maximiz
ing (x*, y*, Z*) of Eq. (11.7) also maximizes Eq. (11.1) [14].
436 11 Spatial-Temporal Behavior Understanding
Fig. 11.6 Classification diagram of action and activity modeling recognition techniques
The methods of action modeling can be divided into three categories: nonparamet
ric modeling, volumetric modeling, and parametric time-series modeling. Non
parametric methods extract a set of features from each frame of the video and match
these features to stored templates. Stereo approaches do not extract features frame by
frame but treat the video as a 3D volume of pixel intensities and extend standard
image features (e.g., scale-space extrema, spatial filter responses) to 3D. Parametric
sequential methods model the temporal dynamics of motion, estimating parameters
specific to a set of actions from a training set.
11.4 Activity and Behavior Modeling 437
2D Template
This type of method involves the steps of performing motion detection and then
tracking objects in the scene. After tracking, build a cropped sequence containing the
object. Changes in scale can be compensated for by normalizing the object size.
Calculate a periodic index for a given action, and if the periodicity is strong, perform
action recognition. For identification, an estimate of the period is used to segment the
sequence of periods into individual periods. The average period is decomposed into
several temporal segments, and flow-based features are computed for each spatial
point in each segment. Average the flow features in each segment into a single frame.
The average flow frame in this activity cycle constitutes the template for each action
group.
A typical approach is to model temporal templates as actions. The background is
first extracted, and the background patches extracted from a sequence are combined
into a still image. There are two ways of combining: One is to weight all frames in
the sequence with the same weight, and the resulting representation can be called a
motion energy image (MEI); the other is to use different weights for different
frames in the sequence, generally using larger weights for new frames and smaller
weights for old frames; the resulting representation can be called a motion history
image (MHI). For a given action, use the combined images to form a template. Then,
the region invariant moments of the template are calculated and identified.
3D Object Model
The 3D object model is a model established for the spatial-temporal object, such as
the generalized cylinder model (see [10]), the 2D contour stacking model, and so
on. The motion and shape information of the object is included in the 2D contour
stacking model, from which geometric features of the object surface, such as peaks,
pits, valleys, ridges, etc., can be extracted (see [10]). If you replace the 2D contours
with blobs in the background, you get a binary space-time volume.
A lot of action recognition involves data in high-dimensional space. Since the feature
space becomes exponentially sparse with dimensionality, a large number of samples
are required to construct an effective model. The inherent dimension of the data can
be determined by using the manifold where the learning data is located, which has a
438 11 Spatial-Temporal Behavior Understanding
relatively small degree of freedom and can help design an effective model in a
low-dimensional space. The easiest way to reduce dimensionality is to use principal
component analysis (PCA) techniques, where the data are assumed to be in a linear
subspace. In practice, except in very special cases, the data is not in a linear
subspace, so methods that can learn the eigen-geometry of the manifold from a
large number of samples are needed. Nonlinear dimensionality reduction techniques
allow data points to be represented according to how close they are to each other in a
nonlinear manifold; typical methods include local linear embedding (LLE) and
Laplacian eigenmaps.
Space-Time Filtering
Component-Based Approach
Sub-volume Matching
Sub-volume matching refers to the matching between the sub-volumes in the video
and the template. For example, the action can be matched with the template from the
perspective of spatial-temporal motion correlation. The main difference between this
method and the component-based method is that it does not need to extract the action
descriptor from the extreme point of the scale space but checks the similarity
between two local spatial-temporal patches (by comparing the motion between the
two patches). However, it is time-consuming to calculate the whole video volume.
One way to solve this problem is to extend the successful fast Haar feature (box
feature) in object detection to 3D. The Haar feature of 3D is the output of 3D filter
bank, and the filter coefficients are 1 and - 1. Combining the output of these filters
with the bootstrap method (see [10]) can obtain robust performance. Another method
is to regard the video volume as a collection of sub-volumes of any shape. Each
sub-volume is a spatially consistent stereo region, which can be obtained by clus
tering the pixels that are close in appearance and space. Then the given video is over
segmented into many sub-volumes or super-voxels. The action template is matched
440 11 Spatial-Temporal Behavior Understanding
by searching for the smallest set of regions in these sub-volumes that can maximize
the overlap between the sub-volume set and the template.
The advantage of sub-volume matching is that it is robust to noise and occlusion.
If combined with optical flow characteristics, it is also robust to apparent changes.
The disadvantage of sub-volumes matching is that it is easily affected by background
changes.
(4) Tensor-based method
Tensor is the generalization of 2D matrix in multidimensional space. A 3D space
time body can naturally be regarded as a tensor with three independent dimensions.
For example, human action, human identity, and joint trajectory can be regarded as
three independent dimensions of a tensor. By decomposing the total data tensor into
dominant patterns (similar to the generalization of PCA), the signs of the
corresponding person’s action and identity (the person who performs the action)
can be extracted. Of course, the 3D of the tensor can also be directly taken as the 3D
of the spatial-temporal domain, that is (x, y, t).
Tensor-based method provides a direct method to match video as a whole, which
does not need to consider the middle-level representation used in the previous
methods. In addition, other kinds of features (such as optical flow, spatial-temporal
filter response, etc.) can also be easily combined by increasing the tensor dimension.
The first two modeling methods are more suitable for simple actions, and the
modeling methods introduced below are more suitable for complex actions that
span the time domain, such as complex dance steps in ballet video, special gestures
of instrument players, etc.
Hidden Markov model (HMM) is a typical model in state space. It is very effective
in modeling time-series data, has good generalization and discrimination, and is
suitable for the work that needs recursive probability estimation. In the process of
constructing discrete hidden Markov model, the state space is regarded as a finite set
of discrete points. It is modeled as a series of probabilistic steps from one state to
another over time. The three key problems of hidden Markov model are reasoning,
decoding, and learning. Hidden Markov model was first used to identify tennis shot
actions, such as forehand, forehand volley, backhand, backhand volley, smash, etc.
Among them, a series of images with background subtraction are modeled into
hidden Markov models corresponding to specific categories. Hidden Markov model
can also be used to model time-varying actions (such as gait).
The model with single hidden Markov can be used to model single person’s
action. For multi-person’s actions or interactive actions, a pair of hidden Markov
models can be used to represent alternate actions. In addition, domain knowledge
11.4 Activity and Behavior Modeling 441
can also be combined into the construction of hidden Markov models, or hidden
Markov models can be combined with object detection to take advantage of the
relationship between actions and action objects. For example, the prior knowledge of
state duration can be combined into the framework of hidden Markov model, and the
resulting model is called semi-HMM. If the state space is added with a discrete label
for modeling high-level behavior, the hidden Markov model of mixed state can be
used for modeling nonstationary behavior.
Linear dynamic systems (LDS) are more general than hidden Markov models, in
which the state space is not limited to a set of finite symbols, but can be continuous
values in Rk space, where k is the dimension of the state space. The simplest linear
dynamic system is a first-order time invariant Gaussian Markov process, which can
be expressed as:
where x 2 Rd is the d-D state space, y 2 Rn is the n-D observation vector, d < < n,
and w and v are the process and observation noise, respectively; they are all Gaussian
distribution, the mean value is zero, and the covariance matrix is P and Q, respec
tively. Linear dynamic system can be regarded as an extension of hidden Markov
model with Gaussian observation model in continuous state space, which is more
suitable for processing high-dimensional time-series data, but it is still not suitable
for unsteady actions.
Compared with the previous Eq. (11.10) and Eq. (11.11), both A and C can
change with time. In order to solve such complex dynamic problems, the commonly
used method is to use switching linear dynamic system (SLDS) or jump linear
442 11 Spatial-Temporal Behavior Understanding
system (JLS). Switched linear dynamic system includes a group of linear dynamic
systems and a switching function, which changes model parameters by switching
between models. In order to recognize complex motion, a multi-layer method
including multiple different levels of abstraction can be adopted. At the lowest
level, there is a series of input images, the upper level includes the region composed
of consistent motion, which is called blob, and then the upper level combines the
trajectories of blobs from time, and the highest level includes a hidden Markov
model that represents complex behavior.
Although switched linear dynamic systems have stronger modeling and descrip
tion capabilities than hidden Markov models and linear dynamic systems, learning
and reasoning are much more complex in switched linear dynamic systems, so
approximate methods are generally required. In practice, it is difficult to determine
the appropriate number of switching states, which often requires a lot of training data
or complicated manual adjustment.
Compared with action, activity not only lasts a long time, but also most of the
activity applications that people pay attention to, such as monitoring and content
based indexing, include multiple action people. Their activities interact not only with
each other but also with contextual entities. In order to model complex scenes, it is
necessary to represent and infer the intrinsic structure and semantics of complex
behaviors at a high level.
Fig. 11.7 Diagram showing the probability Petri net of car picking up activities
DBN is more general than HMM if the dependence between multiple random
variables is considered. However, in DBN, the time model is also a Markov model as
in HMM, so the basic DBN model can only deal with the behavior of sequences. The
development of graph models for learning and reasoning enables them to model
structured behavior. However, for large-scale network, learning local CPD often
requires a lot of training data or complicated manual adjustment by experts, both of
which have brought certain restrictions on the use of DBN in large-scale
environments.
(2) Petri net
Petri net is a mathematical tool to describe the relationship between conditions
and events. It is especially suitable for modeling and visualizing behaviors such as
sequencing, concurrency, synchronization, and resource sharing. Petri net is a
two-sided graph containing two kinds of nodes—location and transition, where
location refers to the state of the entity and transition refers to the change of the
state of the entity. Consider an example of using probabilistic Petri nets to represent a
car pickup activity, as shown in Fig. 11.7. In the figure, the position marks are p 1, p2,
p p
3, 4, and p 5, and the transition marks are t 1, t2, t3, t4, t5, and t6. In this Petri net, p 1 and
p3 are the starting nodes and p5 is the ending node. A car enters the scene and puts a
token in position p 1. Transition t 1 can be enabled at this time, but it will not be
officially started until the relevant conditions (i.e., the car should be parked in the
nearby parking space) are met. At this time, the token at p 1 is eliminated and placed at
p2. Similarly, when a person enters the parking space, place the token at p3, and the
transition starts after the person leaves the parked car. The token is then removed
from p 3 and placed at p4.
Now, a token is placed at each allowable position of transition t6, so that when the
relevant conditions (here, the car leaves the parking space) are met, the fire can be
ignited. Once the car leaves, t6 ignites, the tokens are removed, and a token is placed
at the final position p5. In this example, sorting, concurrency, and synchronization all
happen.
Petri network has been used to develop a system for high-level interpretation of
image sequences. Among them, the structure of Petri network needs to be deter
mined in advance, which is a very complicated work for the large network
representing complex activities. The above work can be semiautomated by automat
ically mapping a small group of logical, spatial, and temporal operations to the graph
structure. With this method, interactive tools for querying video surveillance can be
444 11 Spatial-Temporal Behavior Understanding
The synthesis method is mainly realized with the help of grammatical concepts and
rules.
Grammar
Subsequently, context-free grammar (CFG) was applied, which was used to model
and recognize human motion and multi-person interaction. A hierarchical process is
used here. At the low level, HMM and BN are combined, and at the high level, CFG
is used to model the interaction. The method of context-free grammar has a strong
theoretical basis and can model structured processes. In the synthesis method, we
only need to enumerate the primitive events to be detected and define the produc
tion rules of high-level activities. Once the rules of CFG are constructed, the existing
analytical algorithm can be used.
Because the deterministic grammar expects very good accuracy at the low level, it
is not suitable for the occasions where errors are caused by tracking errors and
missing observations at the low level. In complex scenarios with multiple time
connections (such as parallelism, coverage, synchronization, etc.), it is often difficult
to manually build grammar rules. Learning grammar rules from training data is a
promising alternative, but it has proved to be very difficult in general situations.
Stochastic Grammar
S ® BOARDINGN
BOARDING ® appear0 CHECK1 disappear1
(isPerson (appear, class) a isInside (appear.loc, Gate) a isInside (disappear.loc, Plane) )
CHECK ® moveclose0 CHECK1
CHECK ® moveaway0 CHECK1
CHECK ® moveclose0 moveaway1 CHECK1
(isPerson (moveclose, class) a moveclose.idr = moveaway.idr
Logic-Based Approach
skeleton joint point data corresponds to higher-level features of the human body and
is not easily affected by the appearance of the scene. In addition, it can better avoid
noise effects caused by background occlusion, lighting changes, and viewing angle
changes. At the same time, it is also very efficient in terms of computation and
storage.
Joint point data is usually represented as a series of points (4D points in space
time space) coordinate vectors. That is, joints (articulation point) can be
represented by a 5D function J(l, x, y, z, t), where l is the label, (x, y, z) represents
the space coordinate, and t represents the time coordinate. In different deep learning
networks and algorithms, joint point data are often represented in different forms
(such as pseudo-images, vector sequences, and topological maps).
The joint data research based on deep learning method mainly involves three
aspects: data processing method, network architecture, and data fusion method.
Among them, the data processing method mainly involves whether to perform
preprocessing and data denoising, and the methods used for data fusion are also
relatively consistent. At present, more attention is paid to network architecture, and
there are three commonly used ones: convolutional neural network (CNN), recurrent
neural network (RNN), and graph convolutional network (GCN). The representation
methods of their corresponding joint point data are pseudo-images, vector
sequences, and topological maps [20].
Recurrent neural networks (RNNs) can process sequence data of varying lengths.
The behavior recognition method based on RNN first represents the joint point data
as a vector sequence, which contains the position information of all relevant nodes in
a time (state) sequence; then the vector sequence is sent to the behavior recognition
network with RNN as the backbone. The long short-term memory (LSTM) model is
11.4 Activity and Behavior Modeling 449
a variant of RNN because its cell states can decide which temporal states should be
left behind and which should be forgotten and have greater advantages in processing
time-series data such as joint video.
Table 11.4 briefly summarizes some recent techniques for behavior recognition
using LSTMs as network structures.
Joint point-based behavior recognition research can also use hybrid networks,
making full use of the feature extraction capabilities of CNN and GCN in the spatial
domain and the advantages of RNN in time-series classification. In this case, the
original joint node data should be represented by the corresponding data format
according to the needs of different hybrid networks.
Table 11.6 briefly summarizes some recent techniques using hybrid networks.
Based on the detection and recognition of actions and activities, activities can be
automatically analyzed, and interpretation and judgment of the scene can be
established. Automated activity analysis is a broad term in which the detection of
abnormal events is an important task.
Once the scenario model is established, the behavior and activities of the object can
be analyzed. A basic function of surveillance video is to verify events of interest.
11.5 Abnormal Event Detection 451
within the range. This is equivalent to establishing a virtual fence at the boundary
of the monitoring range and triggering analysis once there is an intrusion, such as
controlling a high-resolution pan-tilt-zoom camera (PTZ camera) to obtain the
details of the intrusion, and starting to count the number of intrusions.
2. Speed analysis: The virtual fence only uses location information, and with the
help of tracking technology, dynamic information can also be obtained to realize
speed-based early warning, such as vehicle speeding or road congestion.
3. Path classification: Velocity analysis only utilizes presently tracked data, and in
practice, activity paths (AP) obtained from historical motion patterns can also be
used. The behavior of emerging objects can be described with the aid of a
maximum a posteriori (MAP) path:
This can help determine which activity path best explains the new data. Since
the prior path distribution p(lk) can be estimated using the training set, the
problem is reduced to using the HMM for maximum likelihood estimation.
4. Abnormality detection (anomaly detection): The detection of abnormal events
is often an important task of the monitoring system. Because activity paths
indicate typical activity, anomalies can be detected if a new trajectory does not
match an existing one. Anomalous patterns can be detected with intelligent
thresholding:
where the value of the active path l* most similar to the new trajectory G is still
smaller than the threshold value Ll.
454 11 Spatial-Temporal Behavior Understanding
5. Online activity analysis: Being able to analyze, identify, and evaluate activities
online is more important than using the entire trajectory to describe movement.
A real-time system needs to be able to quickly reason about what is happening
based on incomplete data (often based on graph models). Two cases are
considered here:
(a) Path prediction: The tracking data to date can be used to predict future
behavior, and the prediction can be refined as more data is collected.
Predicting activity using incomplete trajectories can be represented as:
L = argmaxp(lj\WtGt+k) (11:16)
Intuitively, exceptions are counted relative to normal. But the definition of normal
can also change with time, environment, purpose, conditions, etc. In particular,
normal and abnormal are relatively subjective concepts, so objective and quantita
tive abnormal events often cannot be precisely defined.
The detection of abnormal events is mostly carried out with the help of video, so it
is often called video anomaly detection (VAD), also called video anomaly detec
tion and localization (VADL) to emphasize the need to not only detect abnormal
events appearing in the video but also identify the location where it happened in the
video.
Video abnormal event detection can be divided into two parts: video feature
extraction and abnormal event detection model establishment. Commonly used
video features are mainly divided into hand-designed features and features extracted
by deep models. Video abnormal event detection models can be divided into models
based on traditional probabilistic reasoning and models based on deep learning.
Therefore, there are various schemes for the classification of abnormal event detec
tion methods. The following considers dividing it into methods based on traditional
machine learning and methods based on deep learning and dividing them into
methods with supervised learning and methods with unsupervised learning.
Table 11.7 Classification of abnormal event detection methods from a development perspective
Method category Input model Discriminate criterion
Traditional machine learning Point model Cluster discrimination
Co-occurrence discrimination
Reconstruction discrimination
Other discriminations
Sequence model Generative probabilistic discrimination
Graph model Graph inference discrimination
Graph structure discrimination
Composite model
Deep learning Point model Cluster discrimination
Reconstruction discrimination
Joint discrimination
Sequence model Prediction error discrimination
Composite model
Hybrid learning Point model Cluster discrimination
Reconstruction discrimination
Other discriminations
Point Model
There are presently five types of anomaly discrimination criteria: (1) cluster discrim
ination (according to the distribution of feature points in the feature space, the points
far from the cluster center, points belonging to small clusters, or points with low
distribution probability density are judged as abnormal); (2) co-occurrence discrim
ination (according to the probability of co-occurrence of feature points and normal
samples, the feature points with a lower probability of co-occurrence with normal
samples are judged as abnormal); (3) reconstruction discrimination (using
low-dimensional subspace/manifold as feature points, describe the distribution in
the feature space, and then judge the abnormality according to the distance from the
feature point to the normal sample subspace/manifold according to the reconstruc
tion error); (4) joint discrimination (the model uses the above three types of discrim
ination jointly); and (5) other discrimination (including hypothesis testing
discrimination, semantic analysis discrimination, etc.).
Sequence Model
which the input sequence obeys the normal transition law according to the prediction
error, and the sample with large prediction error is judged as abnormal).
Graph Model
From the perspective of abnormal event detection technology, if there are clear
boundaries between normal events and abnormal events, and there are corresponding
samples, supervised learning technology can be used to classify them; if there is no
prior knowledge of normal events and abnormal events, only consider the clustering
distribution of each event sample, and you need to use unsupervised learning
technology; if you define abnormal events as all events except normal events, only
use the prior knowledge of normal events for training, and use normal samples to
learn the pattern of normal events, and then judging all samples that do not obey the
normal pattern as abnormal, this is the technique of semi-supervised learning. Of
course, these technologies can also be combined at different levels, generally
referred to as integrated technologies. The results of classification from this perspec
tive are shown in Table 11.8 [43].
In Table 11.8, the division of semi-supervised learning techniques is relatively
fine. In fact, there are many studies on semi-supervised learning techniques. In
practical applications, on the one hand, normal samples are easier to obtain than
abnormal samples, so semi-supervised learning techniques are easier to use than
supervised learning techniques; on the other hand, since a prior knowledge of normal
events is used, semi-supervised learning techniques have better performance than
that of unsupervised learning techniques. In this way, researchers pay more attention
to semi-supervised learning techniques and work more.
For the complete representation and description of video events, multiple features
are often required. The fusion of multiple features has stronger representative power
458 11 Spatial-Temporal Behavior Understanding
Table 11.8 Classification of abnormal event detection methods from a technical perspective
Method category Specific technology Main points
Supervised Binary classification Support vector
learning machines
Multi-example learning Multiple networks
Unsupervised Hypothesis test method Binary classifier
learning Unmask method Binary classifier
Semi-supervised Traditional machine Distance method One-class classifier
learning learning KNN method
Probabilistic method Distribution probability
Bayesian probability
Reconstruction error Sparse coding
method
Deep learning Deep distance method Deep one-class
classification
Deep KNN method
Deep probabilistic Autoregressive network
method Variational auto
encoder
Generative adversarial
networks
Deep generative error
method
Integrated learning Weighted sum method Deep generative
network
Sorting method Multiple detectors
Cascade method Multiple detectors
Fig. 11.11 The flowchart of video abnormal event detection using convolutional auto-encoder
For each optical flow feature and HOG feature of a block, an anomaly detection
convolutional auto-encoder (AD-ConvAE) is set up for training and testing,
respectively. The AD-ConvAE on the block region in each video frame only pays
attention to the motion in the position region of the video frame, and the block
learning method can learn local features more effectively. During the training
process, the video only contains normal samples, and AD-ConvAE learns the normal
motion pattern of a certain region through the optical flow and HOG features of
the blocks in the video frame. During the test, the optical flow and HOG features of
the blocks in the test video frame are put into AD-ConvAE for reconstruction, and
the weighted reconstruction error is calculated according to the optical flow recon
struction error and the HOG reconstruction error. If the reconstruction error is large
enough, there are abnormal events in the block. In this way, in addition to detecting
the abnormal event, the localization of the abnormal event is also completed.
The network structure of AD-ConvAE includes two parts: encoder and decoder.
In the encoder part, multiple pairs of convolutional and pooling layers are used to
obtain deep features. In the decoder part, multiple pairs of convolution operations
and up-sampling operations are used to reconstruct the deep representation of the
features, outputting an image of the same size as the input image.
Auto-encoder
Fig. 11.12 The flowchart of video abnormal event detection using ONN
460 11 Spatial-Temporal Behavior Understanding
ONN can be regarded as a neural network structure designed using the equivalent
loss function of OC-SVM. In ONN, the data representation in the hidden layer is
directly driven by the ONN, so it can be designed for the task of anomaly detection,
combining the two stages of feature extraction and anomaly detection for joint
optimization. ONN combines the layer-by-layer data representation ability and
one-class classification ability of the auto-encoder, which can distinguish all normal
samples from abnormal samples.
Using ONN’s video anomaly event detection method [45], ONN is trained
separately on video frames of the same size and local region blocks of optical flow
graph to detect appearance anomalies and motion anomalies, and these two are fused
to determine the final detection result. The flowchart of video anomaly detection
using ONN is shown in Fig. 11.12. In the training phase, two auto-encoder networks
are learned with the help of the RGB images and optical flow images of the training
samples, respectively, and the encoder layer and ONN network of the pre-trained
auto-encoder are combined to optimize the parameters and learn the anomaly
detection model; in the test phase, the RGB image and optical flow image of the
test region are input into the appearance anomaly detection model and the motion
anomaly detection model, respectively, the output scores are fused, and the detection
threshold is set to judge whether the region is abnormal.
References
12. Afza F, Khan M A, Sharif M, et al. A framework of human action recognition using length
control features fusion and weighted entropy-variances based feature selection. Image and
Vision Computing, 2021, 106: #104090 (DOI: https://doi.org/10.1016/j.imavis.2020.104090 ).
13. Jia HX, Zhang Y-J (2009) Automatic people counting based on machine learning in intelligent
video surveillance. Video Engineering (4): 78-81.
14. Xu CL, Hsieh SH, Xiong CM, et al. (2015) Can humans fly? Action understanding with
multiple classes of actors. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition 2264-2273.
15. Xu CL, Corso JJ (2016) Actor-action semantic segmentation with grouping process models.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3083-3092.
16. Turaga P, Chellappa R, Subrahmanian VS, et al. (2008) Machine recognition of human
activities: A survey. IEEE-CSVT 18(11): 1473-1488.
17. Zhao J W, Li X P, Zhang W G. Vehicle trajectory prediction method based on modeling of
multi-agent interaction behavior. CAAI Transactions on Intelligent Systems, 2023, 18(3):
480-488.
18. Lu Y X, Xu G H, Tang B. Worker behavior recognition based on temporal and spatial self
attention of vision Transformer. Journal of Zhejiang University (Engineering Science), 2023,
57(03): 446-454.
19. Jiang H Y, Han J. Behavior recognition based on improved spatiotemporal heterogeneous
two-stream network. Computer Engineering and Design, 2023, 44(7): 2163-2168.
20. Liu Y, Xue PP, Li H, et al. (2021) A review of action recognition using joints based on deep
learning. Journal of Electronics and Information Technology 43(6): 1789-1802.
21. Ji XF, Qin LL, Wang YY (2019) Human interaction recognition based on RGB and skeleton
data fusion model. Journal of Computer Applications 39(11): 3349-3354.
22. Yan A, Wang YL, Li ZF, et al. (2019) PA3D: Pose-action 3D machine for video recognition.
IEEE Conference on Computer Vision and Pattern Recognition 7922-7931.
23. Caetano C, Bremond F, Schwartz WR (2019) Skeleton image representation for 3D action
recognition based on tree structure and reference joints. SIBGRAPI Conference on Graphics,
Patterns and Images 16-23.
24. Caetano C, Sena J, Bremond F, et al. (2019) SkeleMotion: A new representation of skeleton
joint sequences based on motion information for 3D action recognition. IEEE International
Conference on Advanced Video and Signal Based Surveillance 1-8.
25. Li YS, Xia RJ, Liu X, et al. (2019) Learning shape motion representations from geometric
algebra spatiotemporal model for skeleton-based action recognition. IEEE International Con
ference on Multimedia and Expo 1066-1071.
26. Liu J, Shahroudy A, Xu D, et al. (2017) Skeleton-based action recognition using spatiotemporal
LSTM network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelli
gence 40(12): 3007-3021.
27. Liu J, Wang G, Hu P, et al. (2017) Global context-ware attention LSTM networks for 3D action
recognition. IEEE Conference on Computer Vision and Pattern Recognition 1647-1656.
28. Liu J, Wang G, Duan LY, et al. (2018) Skeleton-based human action recognition with global
context-aware attention LSTM networks. IEEE Transactions on Image Processing 27(4):
1586-1599.
29. Zheng W, Li L, Zhang ZX, et al. (2019) Relational network for skeleton-based action recog
nition. IEEE International Conference on Multimedia and Expo, 826-831.
30. Li MS, Chen SH, Chen X, et al. (2019) Actional-structural graph convolutional networks for
skeleton-based action recognition. IEEE Conference on Computer Vision and Pattern Recog
nition 3595-3603.
31. Peng W, Hong XP, Chen HY, et al. (2019) Learning graph convolutional network for skeleton
based human action recognition by neural searching. arXiv preprint, arXiv: 1911.04131.
32. Wu C, Wu XJ, Kittler J (2019) Spatial residual layer and dense connection block enhanced
spatial temporal graph convolutional network for skeleton-based action recognition. IEEE
International Conference on Computer Vision Workshop 1-5.
462 11 Spatial-Temporal Behavior Understanding
33. Shi L, Zhang YF, Cheng J, et al. (2019) Skeleton-based action recognition with directed graph
neural networks. IEEE Conference on Computer Vision and Pattern Recognition 7912-7921.
34. Li MS, Chen SH, Chen X, et al. (2019) Symbiotic graph neural networks for 3D skeleton-based
human action recognition and motion prediction. arXiv preprint, arXiv: 1910.02212.
35. Yang HY, Gu YZ, Zhu JC, et al. (2020) PGCNTCA: Pseudo graph convolutional network with
temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 8:
10040-10,047.
36. Zhang PF, Lan CL, Xing JL, et al. (2019) View adaptive neural networks for high performance
skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence 41(8): 1963-1978.
37. Hu GY, Cui B, Yu S (2019) Skeleton-based action recognition with synchronous local and
non-local spatiotemporal learning and frequency attention. IEEE International Conference on
Multimedia and Expo 1216-1221.
38. Si CY, Chen WT, Wang W, et al. (2019) An attention enhanced graph convolutional LSTM
network for skeleton-based action recognition. IEEE Conference on Computer Vision and
Pattern Recognition 1227-1236.
39. Gao JL, He T, Zhou X, et al. (2019) Focusing and diffusion: Bidirectional attentive graph
convolutional networks for skeleton-based action recognition. arXiv preprint, arXiv:
1912.11521.
40. Zhang PF, Lan CL, Zeng WJ, et al. (2020) Semantics-guided neural networks for efficient
skeleton-based human action recognition. IEEE Conference on Computer Vision and Pattern
Recognition 1109-1118.
41. Zhang WL, Qi H, Li S (2022) Application of spatial temporal graph convolutional networks in
human abnormal behavior recognition. Computer Engineering and Application, 58(12):
122-131.
42. Wang ZG, Zhang Y-J (2020) Anomaly detection in surveillance videos: A survey. Journal of
Tsinghua University (Science & Technology) 60(6): 518-529.
43. Wang ZG (2022) Researches on Semi-Supervised and Deep Generative Model-Based Surveil
lance Video Anomaly Detection Algorithms (Dissertation). Tsinghua University.
44. Li XL, Ji GL, Zhao B (2021) Convolutional auto-encoder patch learning based video anomaly
event detection and localization. Journal of Data Acquisition and Processing 36(3): 489-497.
45. Jiang WY, Li G (2021) One-class neural network for video anomaly detection and localization.
Journal of Electronic Measurement and Instrumentation 35(7): 60-65.
Index
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 463
Y.-J. Zhang, 3D Computer Vision, https://doi.org/10.1007/978-981-19-7603-2
464 Index
D F
Data augmentation, 35 Factored-state hierarchical HMM (FS-HHMM),
Declarative model, 446 424
Decoding, 440 Factorial CRF (FCRF), 424
Deep iterative learning methods, 371 Fast point feature histogram (FPFH), 152
Deep learning, 7, 413 Feature adjacency graph, 336
Deeply supervised object detector (DSOD), 37 Feature cascaded convolutional neural
Deep photometric stereo network (DPSN), 279 networks, 199
Deep transformation learning methods, 371 Feature map, 31
Depth image, 86 Feature point method, 394
Depth imaging, 88 Feature points, 177
Depth map, 86 Field of view (FOV), 18, 112, 404
Depth of focus, 316 Fitting, 335
Difference of Gaussian (DoG), 182 Fixed-point iterative sweeping, 325
Diffuse reflection surface, 245 Focus of expansion (FOE), 273
Digital micro-mirror device (DMD), 117, 119 Forward mapping, 149
Dilated convolution, 37 Fourier single-pixel imaging (FSI), 122
Diplopic, 124 4-point mapping, 47
Direct depth imaging, 89 Front-end matching, 379
Index 465
I
G Ideal scattering surface, 245
Gaze change, 20 Ideal specular reflecting surface, 277
Gaze control, 20 Identical, 361
Gaze stabilization, 20 Illuminance, 43, 240
Generalization of the CRF, 424 Illumination component, 43
Generalized compact non-local networks, 36 Image analysis, 6, 25
Generalized matching, 334 Image brightness constraint equation, 251, 290,
Generalized unified model (GUM), 232 293, 325, 327
Generative adversarial networks (GAN), 37, Image coordinate system, 45
198, 280 Image coordinate system in computer, 45
Generative model, 424 Image engineering, 5, 25, 27
Geometric feature-based methods, 139 Image flow, 258
Geometric hashing, 169 Image flow equation, 260
Geometric realization, 359 Image matching, 334
Geometric representation, 359 Image processing, 5, 25
Geometric texture, 148 Image pyramid network, 197
Geometric texture mapping, 150 Image rectification, 113
Geoscience laser altimeter system (GLAS), 92 Image-to-image, 382, 396
Global feature descriptor, 152 Image-to-map, 382
Global positioning system (GPS), 91 Image understanding, 6, 25
GMapping algorithm, 385 Imaginary point, 302
Gradient space, 250, 290 Incident, 359
Graph, 192, 358 Indirect depth imaging, 105
Graph convolutional networks (GCN), 374, 448 Induced subgraph, 360
Graph isomorphism, 363 Inertia equivalent ellipse, 343
Gray-scale smooth region, 209 Inertial measurement unit (IMU), 91, 394,
Gray value, 44 399, 414
Gray value range, 44 Inertial navigation system (INS), 91
In general position, 366
Internal parameter, 57
H Intrinsic image, 86
Hardware implementation, 9 Intrinsic property, 86
Harris interest point detector, 426, 439 Intrinsic shape signatures (ISS), 159
Hausdorff distance (HD), 338 Inverse distance, 206
Helmet mounted display (HMD), 123 Inverse perspective, 319
Hessian matrix, 184 Inverse perspective transformation, 48
Hidden line, 307 Irradiance, 240
Hidden Markov model (HMM), 440 Isomorphism, 362
Hierarchical variable transition hidden Markov Isotropy assumption, 309
model (HVT-HMM), 424 Isotropy radiation surface, 251
Holistic recognition, 425 Iterative closest point (ICP), 138, 379
Homogeneity assumption, 309 Iterative closest point registration, 139
Homogeneous coordinate, 45 Iterative fast marching, 330
Homogeneous vector, 45
Homography matrix, 74
Horopter, 124 J
Human intelligence, 6 Jacobi over-relaxation (JOR), 414
Human stereoscopic vision, 123 Join, 359
466 Index
W
U Ward reflection model, 278, 322
Underlying simple graph, 360 Window, 166, 218
U-Net, 37 World coordinate system, 45
Uniqueness constraint, 171
Unsupervised learning, 457
Z
Zero-cross correction algorithm, 195
V Zero-crossing pattern, 178
Vanishing line, 307 Zone of clear single binocular vision (ZCSBV),
Vanishing point, 302, 307 124