Machine Vision
Machine Vision
MACHINE VISION
This book is an accessible and comprehensive introduction to machine vision. It provides all
the necessary theoretical tools and shows how they are applied in actual image processing
and machine vision systems. A key feature is the inclusion of many programming exercises
that give insights into the development of practical image processing algorithms.
The authors begin with a review of mathematical principles and go on to discuss key
issues in image processing such as the description and characterization of images, edge
detection, feature extraction, segmentation, texture, and shape. They also discuss image
matching, statistical pattern recognition, syntactic pattern recognition, clustering, diffusion,
adaptive contours, parametric transforms, and consistent labeling. Important applications
are described, including automatic target recognition. Two recurrent themes in the book are
consistency (a principal philosophical construct for solving machine vision problems) and
optimization (the mathematical tool used to implement those methods).
Software and data used in the book can be found at www.cambridge.org/9780521830461.
The book is aimed at graduate students in electrical engineering, computer science,
and mathematics. It will also be a useful reference for practitioners.
wesleye. snyder received his Ph.D. from the University of Illinois, and is currently
Professor of Electrical and Computer Engineering at North Carolina State University. He has
written over 100 scientific papers and is the author of the book Industrial Robots. He was
a founder of both the IEEE Robotics and Automation Society and the IEEE Neural
Networks Council. He has served as an advisor to the National Science Foundation,
NASA, Sandia Laboratories, and the US Army Research Office.
hairong qi received her Ph.D. from North Carolina State University and is currently an
Assistant Professor of Electrical and Computer Engineering at the University of
Tennessee,Knoxville.
MACHINE VISION
Wesley E. Snyder
North Carolina State University, Raleigh
Hairong Qi
University of Tennessee, Knoxville
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore,
São Paulo, Delhi, Dubai, Tokyo, Mexico City
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521169813
A catalogue record for this publication is available from the British Library
1 Introduction 1
8 Segmentation 181
9 Shape 216
15 Clustering 356
17 Applications 382
xv
Sample syllabus.
1 Introduction, terminology, operations on images, pattern 2.2–2.5 and 2.9 (1) Read Chapter 2. Convince
classification and computer vision, image formation, yourself that you have the
resolution, dynamic range, pixels background for this course
2 The image as a function. Image degradation. Point spread 3.1 (1) Chapters 1 and 3
function. Restoration
3 Properties of an image, isophotes, ridges, connectivity 3.2, 4.1 (2) Sections 4.1–4.5
4 Kernel operators: Application of kernels to estimate edge 4.A1, 4.A2 (1) Sections 5.1 and 5.2
locations
5 Fitting a function (a biquadratic) to an image. Taking 5.1, 5.2 (1) Sections 5.3–5.4 (skip hexagonal
derivatives of vectors to minimize a function pixels)
6 Vector representations of images, image basis functions. 5.4, 5.5 (2) and 5.7, 5.8, Sections 5.5 and 5.6 (skip section
Edge detection, Gaussian blur, second and higher 5.9 (1) 5.7)
derivatives
7 Introduction to scale space. Discussion of homeworks 5.10, 5.11 (1) Section 5.8 (skip section 5.9)
10 Equivalence of MFA and diffusion 6.7 and 6.8 (1) Section 6A.4
17 2D shape features, invariant moments, Fourier 9.2, 9.4, 9.10 (1) Sections 9.3–9.7
descriptors, medial axis
23 Hough transform, parametric transforms 11.1 (2) Sections 11.1, 11.2, 11.3.3
25 Iconic matching, springs and templates, association 13.2 and 13.3 (1) Sections 13.1–13.3
graphs
The assignments are projects which must include a formal report. Since there is usu-
ally programming involved, we allow more time to accomplish these assignments –
suggested times are in parentheses in column 3. It is also possible, by careful selec-
tion of the students and the topics, to use this book in an advanced undergraduate
course.
For advanced students, the “Topics” sections of this book should serve as a col-
lection of pointers to the literature. Be sure to emphasize to your students (as we
do in the text) that no textbook can provide the details available in the literature,
and any “real” (that is, for a paying customer) machine vision project will require
the development engineer to go to the published journal and conference literature.
As stated above, the two recurrent themes throughout this book are consistency
and optimization. The concept of consistency occurs throughout the discipline as a
principal philosophical construct for solving machine vision problems. When con-
fronted with a machine vision application, the engineer should seek to find ways to
determine sources of information which are consistent. Optimization is the princi-
pal mathematical tool for solving machine vision problems, including determining
consistency. At the end of each chapter which introduces techniques, we remind the
student where consistency fits into the problems of that chapter, as well as where
and which optimization methods are used.
Acknowledgements
WES
I’d like to express my sincere thanks to Dr. Wesley Snyder for inviting me to coau-
thor this book. I have greatly enjoyed this collaboration and have gained valuable
experience.
The final delivery of the book was scheduled around Christmas when my parents
were visiting me from China. Instead of touring around the city and enjoying the
holidays, they simply stayed with me and supported me through the final submission
of the book. I owe my deepest gratitude to them. And to Feiyi, my forever technical
support and emergency reliever.
HQ
xviii
1 Introduction
This is an important We have written this book at two levels, the principal level being introductory.
observation: This book
does NOT have enough “Introductory” does not mean “easy” or “simple” or “doesn’t require math.” Rather,
information to tell you the introductory topics are those which need to be mastered before the advanced
how to implement
significant large systems. topics can be understood.
It teaches general In addition, the book is intended to be useful as a reference. When you have to
principles. You MUST
make use of the literature study a topic in more detail than is covered here, in order, for example, to implement a
when you get down to the practical system, we have tried to provide adequate citations to the relevant literature
gnitty gritty.
to get you off to a good start.
We have tried to write in a style aimed directly toward the student and in a
conversational tone.
We have also tried to make the text readable and entertaining. Words which are
deluberately missppelled for humorous affects should be ubvious. Some of the humor
runs to exaggeration and to puns; we hope you forgive us.
We did not attempt to cover every topic in the machine vision area. In particu-
lar, nearly all papers in the general areas of optical character recognition and face
recognition have been omitted; not to slight these very important and very success-
ful application areas, but rather because the papers tend to be rather specialized; in
addition, we simply cannot cover everything.
There are two themes which run through this book: consistency and optimization.
Consistency is a conceptual tool, implemented as a variety of algorithms, which helps
machines to recognize images – they fuse information from local measurements to
make global conclusions about the image. Optimization is the mathematical mech-
anism used in virtually every chapter to accomplish the objectives of that chapter,
be they pattern classification or image matching.
1 Ja-Chen Lin and Wen-Hsiang Tsai, “Feature-preserving Clustering of 2-D Data for Two-class Problems Using
Analytical Formulas: An Automatic and Fast Approach,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(5), 1994.
1
2 Introduction
These two topics, consistency and optimization, are so important and so pervasive,
that we point out to the student, in the conclusion of each chapter, exactly where those
concepts turned up in that chapter. So read the chapter conclusions. Who knows, it
might be on a test.
The target audience for this book is graduate students or advanced undergraduates
To find out if you meet in electrical engineering, computer engineering, computer science, math, statistics,
this criterion, answer the
following question: What or physics. To do the work in this book, you must have had a graduate-level course
do the following words in advanced calculus, and in statistics and/or probability. You need either a formal
mean? “transpose,”
“inverse,” “determinant,” course or experience in linear algebra.
“eigenvalue.” If you do Many of the homeworks will be projects of sorts, and will be computer-based.
not have any idea, do not
take this course! To complete these assignments, you will need a hardware and software environment
capable of
You will have to write (1) declaring large arrays (256 × 256) in C
programs in C (yes, C or
C++, not Matlab) to (2) displaying an image
complete this course. (3) printing an image.
Students usually confuse machine vision with image processing. In this section, we
define some terminology that will clarify the differences between the contents and
objectives of these two topics.
Enhancement
Enhancement systems perform operations which make the image look better, as
perceived by a human observer. Typical operations include contrast stretching
(including functions like histogram equalization), brightness scaling, edge sharp-
ening, etc.
Coding
Coding is the process of finding efficient and effective ways to represent the infor-
mation in an image. These include quantization methods and redundancy removal.
Coding may also include methods for making the representation robust to bit-errors
which occur when the image is transmitted or stored.
Compression
Compression includes many of the same techniques as coding, but with the specific
objective of reducing the number of bits required to store and/or transmit the image.
Restoration
Restoration concerns itself with fixing what is wrong with the image. It is unlike
enhancement, which is just concerned with making images look better. In order
to “correct” an image, there must be some model of the image degradation. It is
common in restoration applications to assume a deterministic blur operator, followed
by additive random noise.
4 Introduction
Reconstruction
Measurement of features
The measurement of features is the principal focus of this book. Except for
Chapters 14 and 15, in this book, we focus on processing the elements of images
(pixels) and from those pixels and collections of pixels, extract sets of measurements
which characterize either the entire image or some component thereof.
Pattern classification
2 Sometimes, CT is referred to as “CAT scanning.” In that case, CAT stands for “computed axial tomography.”
There are other types of tomography as well.
5 1.4 Organization of a machine vision system
example, the set of possible classes might be men and women and one measurement
which we could make to distinguish men from women would be height (clearly,
height is not a very good measurement to use to distinguish men from women, for
if our decision is that anyone over five foot six is male we will surely be wrong in
many instances).
Pattern recognition
Fig. 1.1 shows schematically, at the most basic level, the organization of a machine
vision system. The unknown is first measured and the values of a number of features
are determined. In an industrial application, such features might include the length,
width, and area of the image of the part being measured. Once the features are
measured, their numerical values are passed to a process which implements a decision
rule. This decision rule is typically implemented by a subroutine which performs
calculations to determine to which class the unknown is most likely to belong based
on the measurements made.
As Fig. 1.1 illustrates, a machine vision system is really a fairly simple architec-
tural structure. The details of each module may be quite complex, however, and many
different options exist for designing the classifier and the feature measuring system.
In this book, we mention the process of classifier design. However, the process of
determining and measuring features is the principal topic of this book.
The “feature measurement” box can be further broken down into more detailed
operations as illustrated in Fig. 1.2. At that level, the organization chart becomes
more complex because the specific operations to be performed vary with the type
of image and the objective of the tasks. Not every operation is performed in every
application.
Raw data
Segmentation
analysis
removal
Shape
Noise
Features
Matching
Consistency
analysis
Fig. 1.2. Some components of a feature characterization system. Many machine vision
applications do not use every block, and information often flows in other ways. For
example, it is possible to perform matching directly on the image data.
We will pay much more attention to the nature of images in Chapter 4. We will
observe that there are several different types of images as well as several different
ways to represent images. The types of images include what we call “pictures,” that
is, two-dimensional images. In addition, however, we will discuss three-dimensional
images and range images. We will also consider different representations for images,
including iconic, functional, linear, and relational representations.
Some equivalent words. We will learn many different operations to perform on images. The emphasis in this
course is “image analysis,” or “computer vision,” or “machine vision,” or “image
understanding.” All these phrases mean the same thing. We are interested in making
measurements on images with the objective of providing our machine (usually, but
not always, a computer) with the ability to recognize what is in the image. This
process includes several steps:
r denoising – all images are noisy, most are blurred, many have other distortions
as well. These distortions need to be removed or reduced before any further
operations can be carried out. We discuss two general approaches for denoising
in Chapters 6 and 7.
r segmentation – we must segment the image into meaningful regions. Segmenta-
tion is covered in Chapter 8.
r feature extraction – making measurements, geometric or otherwise, on those
regions is discussed in Chapter 9.
7 Reference
Reference
[1.1] C. Shu and R. Jain, “Vector Field Analysis for Oriented Patterns,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(9), 1994.
2 Review of mathematical principles
Let us imagine a statistical experiment: rolling two dice. It is possible to roll any
number between two and twelve (inclusive), but as we know, some numbers are
more likely than others. To see this, consider the possible ways to roll a five.
We see from Fig. 2.1 that there are four possible ways to roll a five with two dice.
Each event is independent. That is, the chance of rolling a two with the second die
(1 in 6) does not depend at all on what is rolled with die number 1.
Independence of events has an important implication. It means that the joint
probability of the two events is equal to the product of their individual probabilities
and the conditional probabilities:
In Eq. (2.1), the symbols a and b represent events, e.g., the rolling of a six. Pr (b) is the
probability of such an event occurring, and Pr (a | b) is the conditional probability
of event a occurring, given that event b has occurred.
In Fig. 2.1, we tabulate all the possible ways of rolling two dice, and show the
resulting number of different ways that the numbers from 2 to 12 can occur. We
note that 6 different events can lead to a 7 being rolled. Since each of these events
is equally probable (1 in 36), then a 7 is the most likely roll of two dice. In Fig. 2.2
the information from Fig. 2.1 is presented in graphical form.
In pattern classification, we are most often interested in the probability of a par-
ticular measurement occurring. We have a problem, however, when we try to plot a
graph such as Fig. 2.2 for a continuously-valued function. For example, how do we
ask the question: “What is the probability that a man is six feet tall?” Clearly, the
answer is zero, for an infinite number of possibilities could occur (we might equally
well ask, “What is the probability that a man is (exactly) 6.314 159 267 feet tall?”).
Still, we know intuitively that the likelihood of a man being six feet tall is higher
than the likelihood of his being ten feet tall. We need some way of quantifying this
intuitive notion of likelihood.
8
9 2.1 A brief review of probability
Sum Number
of ways
0 0
1 0
2 1–1 1
3 2–1, 1–2 2
4 1–3, 3–1, 2–2 3
5 2–3, 3–2, 4 –1, 1– 4 4
6 1–5, 5–1, 2– 4, 4–2, 3–3 5
7 3– 4, 4 –3, 2–5, 5–2, 1– 6, 6 –1 6
8 2– 6, 6 –2, 3–5, 5–3, 4 – 4 5
9 3– 6, 6 –3, 4 –5, 5– 4 4
10 4 – 6, 6 – 4, 5–5 3
11 6 –5, 5– 6 2
12 6– 6 1
6
5
4
3
2
1
0 1 2 3 4 5 6 7 8 9 10 11 12
Sum
One question that does make sense is, “What is the probability that a man is
less than six feet tall?” Such a function is referred to as a probability distribution
function
Dividing by x and taking the limit as x → 0, we see that we may define the
probability density function as the derivative of the distribution function:
d
p(x) = P(x). (2.3)
dx
10 Review of mathematical principles
0 1 2 3 4 5 6 7 8 9 10 11 12 x
Fig. 2.3. The probability distribution of Fig. 2.2, showing the probability of rolling two dice to get
a number LESS than x. Note that the curve is steeper at the more likely numbers.
p(x) has all the properties that we desire. It is well defined for continuously-valued
measurements and it has a maximum value for those values of the measurement
which are intuitively most likely.
Furthermore:
∞
p(x) d x = 1, (2.4)
−∞
This section will serve In this section, we very briefly review vector and matrix operations. Generally, we
more as a reference than a
teaching aid, since you denote vectors in boldface, scalars in lowercase Roman, and matrices in uppercase
should know this material Roman.
already.
Vectors are always considered to be column vectors. If we need to write one
horizontally for the purpose of saving space in a document, we use transpose notation.
For example, we denote a vector which consists of three scalar elements as:
v = [x1 x2 x3 ]T .
The inner product of two vectors is a scalar, v = a T b. Its value is the sum of products
11 2.2 A review of linear algebra
You will also sometimes see the notation x, y used for inner product. We do not like
this because it looks like an expected value of a random variable. One sometimes
also sees the “dot product” notation x√· y for inner product.
The magnitude of a vector is |x| = x T x. If |x| = 1, x is said to be a “unit vector.”
If x T y = 0, then x and y are “orthogonal.” If x and y are orthogonal unit vectors,
they are “orthonormal.”
The concept of orthogonality can easily be extended to continuous functions by
simply thinking of a function as an infinite-dimensional vector. Just list all the values
of f (x) as x varies between, say, a and b. If x is continuous, then there are an infinite
number of possible values of x between a and b. But that should not stop us – we
cannot enumerate them, but we can still think of a vector containing all the values
of f (x). Now, the concept of summation which we defined for finite-dimensional
vectors turns into integration, and an inner product may be written
b
f (x), g(x) = f (x)g(x) d x. (2.5)
a
The concepts of orthogonality and orthonormality hold for this definition of the
inner product as well. If the integral is equal to zero, we say the two functions are
orthogonal. So the transition from orthogonal vectors to orthogonal functions is not
that difficult. With an infinite number of dimensions, it is impossible to visualize
orthogonal as “perpendicular,” of course, so you need to give up on thinking about
things being perpendicular. Just recall the definition and use it.
Suppose we have n vectors x 1 , x 2 , . . . x n ; if we can write v = a1 x 1 + a2 x 2 +
· · · an x n , then v is said to be a “linear combination” of x 1 , x 2 , . . . x n .
A set of vectors x 1 , x 2 , . . . x n is said to be “linearly independent” if it is impossible
to write any of the vectors as a linear combination of the others.
Given d linearly independent vectors, of d dimensions, x 1 , x 2 , . . . x d defined on
d , then any vector y in the space may be written y = a1 x 1 + a2 x 2 + · · · ad x d .
Since any d-dimensional real-valued vector y may be written as a linear combi-
nation of x 1 , . . . x d , then the set {xi } is called a “basis” set and the vectors are said
to “span the space”
d . Any linearly independent set of vectors can be used as a
basis (necessary and sufficient). It is often particularly convenient to choose basis
sets which are orthonormal.
For example, the following two vectors form a basis for
2
x1
y
x2
a1
Fig. 2.4. x1 and x2 are orthonormal bases. The projection of y onto x1 has length a1 .
This is the familiar Cartesian coordinate system. Here’s another basis set for
2
If x 1 , x 2 , . . . x d span
d , and y = a1 x 1 + a2 x 2 + · · · ad x d , then the “components”
of y may be found by
ai = y T xi (2.6)
and
A AT = AT A = I (2.9)
then obviously, the transpose of the matrix is the inverse as well, and A is said to
be an “orthonormal transformation” (OT), which will correspond geometrically to
a rotation. If A is a d × d orthonormal transformation, then the columns of A are
orthonormal, linearly independent, and form a basis spanning the space of
d . For
3 , three convenient OTs are the rotations about the Cartesian axes:
Some example
orthonormal ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
transformations. 1 0 0 cos 0 −sin cos −sin 0
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
Rx = ⎣0 cos −sin ⎦ R y = ⎣ 0 1 0 ⎦ Rz = ⎣ sin cos 0⎦
0 sin cos sin 0 cos 0 0 1
y = x T Ax > 0 ∀x ∈
d , x
= 0
and is called the “gradient.” This will be often used when we talk about edges in
images, and f (x) will be the brightness as a function of the two spatial directions.
14 Review of mathematical principles
Eq. (2.16) is used. However, remember that the same concepts apply to operators of
arbitrary dimension):
∂ ∂ f1 ∂ f1 ∂ f2
div f = ∇ f = = + . (2.17)
∂x ∂y f2 ∂x ∂y
We will also have opportunity to use the outer product of the del operator with a
matrix:
⎡ ⎤ ⎡ ⎤
∂ ∂ f1 ∂ f2
⎢∂x ⎥ ⎢ ∂x ∂x ⎥
∇× f =⎢ ⎥ ⎢
⎣ ∂ ⎦ [ f1 f2] = ⎣ ∂ f1
⎥. (2.18)
∂ f2 ⎦
∂y ∂y ∂y
Ax = x, ∈
. (2.19)
In this book, essentially Minimization of functions is a pervasive element of engineering: One is always trying
EVERY machine vision
topic will be discussed in to find the set of parameters which minimizes some function of those parameters.
terms of some sort of Notationally, we state the problem as: Find the vector x which produces a minimum
minimization, so get used
to it! of some function H (x):
H = minx H (x) (2.20)
1 “Eigen-” is the German prefix meaning “principal” or “most important.” These are NOT named for Mr Eigen.
16 Review of mathematical principles
minimal H as x
The authors get VERY The most straightforward way to minimize a function is to set its derivative to zero:
annoyed at improper use
of the word “optimal.” If ∇ H (x) = 0, (2.22)
you didn’t solve a formal
optimization problem to where ∇ is the gradient operator – the set of partial derivatives. Eq. (2.22) results in
get your result, you didn’t
come up with the a set of equations, one for each element of x, which must be solved simultaneously:
“optimal” anything.
∂
H (x) = 0
∂ x1
∂
H (x) = 0 (2.23)
∂ x2
···
∂
H (x) = 0.
∂ xd
Such an approach is practical only if the system of Eq. (2.23) is solvable. This may
be true if d = 1, or if H is at most quadratic in x.
EXERCISE
Find the vector x = [x1 , x2 , x3 ]T which minimizes
Solution
∂H
= 2ax1 + b
∂ x1
∂H
= 2cx2
∂ x2
∂H
= 2d x3
∂ x3
minimized by
−b
x3 = x2 = 0, x1 = .
2a
If H is some function of order higher than two, or is transcendental, the technique
of setting the derivative equal to zero will not work (at least, not in general) and we
must resort to numerical techniques. The first of these is gradient descent.
In one dimension, the utility of the gradient is easy to see. At a point x (k)
(Fig. 2.5), the derivative points AWAY FROM the minimum. That is, in one
dimension, its sign will be positive on an “uphill” slope.
17 2.3 Introduction to function minimization
x(k)
Fig. 2.5. The sign of the derivative is always away from the minimum.
2.3.1 Newton–Raphson
It is not immediately obvious in Eq. (2.25) how to choose the variable ␣. If ␣ is too
small, the iteration of Eq. (2.25) will take too long to converge. If ␣ is too large, the
algorithm may become unstable and never find the minimum.
We can find an estimate for ␣ by considering the well-known Newton–Raphson
method for finding roots: (In one dimension), we expand the function H (x) in a
Taylor series about the point x (k) and truncate, assuming all higher order terms are
zero,
H x (k+1) = H x (k) + x (k+1) − x (k) H x (k) .
EXAMPLE
Given a set of x, y data pairs {(xi , yi )} and a function of the form
y = aebx , (2.31)
Solution
We can solve this problem with the linear approach by observing that
ln y = ln a + bx and re-defining variables g = ln y and r = ln a.
With these substitutions, Eq. (2.32) becomes
H (r, b) = (gi − r − bxi )2 (2.33)
i
∂H
=2 (gi − r − bxi )(−xi ) (2.34)
∂b i
∂H
=2 (gi − r − bxi )(−1). (2.35)
∂r i
or
Nr + b xi = gi (2.39)
i i
where N is the number of data points. Eqs. (2.37) and (2.39) are two simultaneous
linear equations in two unknowns which are readily solved. (See [2.2, 2.3, 2.4] for
more sophisticated descent techniques such as the conjugate gradient method.)
It is possible that in order to compute this probability, we must know all of the
history, or it is possible that we need only know the class of the last few symbols.
One particularly interesting case is when we need only know the class of the last
symbol received. In that case, we could say that the probability of class assignments
for symbol y(t), given all of the history, is precisely the same as the probability
knowing only the last symbol:
where we have simplified the notation slightly by omitting the set element symbols.
That is, y(k) does not denote the fact that the kth symbol was received, but rather
that the kth symbol belongs to some particular class. If this is the case – that the
probability conditioned on all of history is identical to the probability conditioned
on the last symbol received – we refer to this as a Markov process.3
This relationship implies that
N
P(y(N ) ∈ w N , . . . y(1) ∈ w 1 ) = P(y(t) ∈ w t |(y(t − 1) ∈ w t−1 )) P(y(1) ∈ w 1 ).
t=2
3 To be perfectly correct, this is a first-order Markov process, but we will not be dealing with any other types in
this chapter.
22 Review of mathematical principles
Suppose there are only two classes possible, say 0 and 1. Then we need to know only
four possible “transition probabilities,” which we define using subscripts as follows:
In general, there could be more than two classes, so we denote the transition proba-
bilities by Pi j , and can therefore describe a Markov chain by a c × c matrix P whose
elements are Pi j .
We will take another look at Markov processes when we think about Markov
random fields in Chapter 6.
Assignment 2.1
Is the matrix P symmetric? Why or why not? Does P have
any interesting properties? Do its rows (or columns)
add up to anything interesting?
Markov
process
1
y(t)
Switch
Markov
process
2
y(t ) =
Fig. 2.6. A hidden Markov model may be viewed as a process which switches randomly
between two signals.
23 2.4 Markov models
Switch Switch
up down
state machine (FSM), which at each time instant may stay in the same state or may
switch, as shown in Fig. 2.7.
Here is our problem: We observe a sequence of symbols
What can we infer? The transition probabilities? The state sequence? The structure
of the FSM? The rules governing the FSM? Let’s begin by estimating the state
sequence.
Let s(t) t = 1, . . . , N denote the state associated with measurement y(t), and denote
the sequence of states S = [s(1), s(2), . . . s(N )], where each s(t) ∈ {s1 , s2 , . . . sm }.
We seek a sequence of states, S, which maximizes the conditional probability that
the sequence is correct, given the measurements; P(S|Y ).
Using Bayes’ rule
p(Y |S)P(S)
P(S|Y ) = . (2.42)
p(Y )
We assume the states form a Markov chain, so
N
P(S) = Ps(t),s(t−1) Ps(0) . (2.43)
t=2
Now, let’s make a temporarily unbelievable assumption, that the probability density
of the output depends only on the state. Denote that relationship by p(y(t)|s(t)).
Then the posterior conditional probability of the sequence can be written:
N N
p(Y |S)P(S) = p(y(t)|s(t)) Ps(t),s(t−1) Ps(0) . (2.44)
t=1 t=2
N
p(Y |S)P(S) = p(y(t)|s(t))Ps(t),s(t−1) . (2.45)
t=1
24 Review of mathematical principles
Now look back at Eq. (2.42). The choice of S does not affect the denominator, so
all we need to do is find the sequence S which maximizes
N
E= p(y(t)|s(t))Ps(t),s(t−1) . (2.46)
t=1
N
L ≡ ln E ≡ ((t) + i, j ) (2.47)
t=1
sm sm sm
sq
s3 s3 s3
s2 s2 s2
s1 s1 s1
Fig. 2.8. Every possible sequence of states can be thought of as a path through such a graph.
25 2.4 Markov models
s4 s4 s4 s4
s3 s3 s3 s3
s2 s2 s2 s2
s1 s1 s1 s1
Fig. 2.9. A path through a problem with four states and four time values.
There is a value associated with each edge in the graph as well, the function
determined by the associated transition probability. So every possible path through
the graph has a corresponding value of the objective function L. We describe the
algorithm to find the best path as an induction: Suppose at time t we have already
found the best path to each node, with cost denoted LBi (t), i = 1, . . . , m. Then we
can compute the cost of going from each node at time t to each node at time t + 1
(m 2 calculations) by
The best path to node j at time t + 1 is the maximum of these. When we finally
reach time step N , the node which terminates the best path is the final node.
The computational complexity of this algorithm is thus N m 2 , which is a lot less
than m N , the complexity of a simple exhaustive search of all possible paths.
That is, given the observation sequence, what is the probability that we went from
state i to state j at time t? We can compute that quantity using the methods of
section 2.4.2 if we know the transition probabilities Pi, j and the output probabilities
k, l . Suppose we do know those. Then, we estimate the transition probability by
averaging the probabilities over all the inputs.
N
Pi, j|Y (t)
Pi, j = t=2
N
(2.50)
t=2 P j|Y (t)
where, since in order to go into state j, the system had to go there from somewhere,
N
P j|Y (t) = Pi j|Y (t). (2.51)
i=1
Then we estimate the probability of the observation by again averaging all the
observations.
N
t=1,y(t)= j Pi|Y (t)
k,l = N . (2.52)
t=1 Pi|Y (t)
At each iteration, we use Eqs. (2.50) and (2.52) to update the parameters. We then
use Eqs. (2.49) and (2.51) to update the conditional probabilities. The process then
repeats until it converges.
Assignment 2.2
(Trivia question) In what novel did a character named
Markov Chaney occur?
Assignment 2.3
Find the OT corresponding to a rotation of 30◦ about
the z axis. Prove the columns of the resulting matrix
are a basis for
3 .
Assignment 2.4
Prove Eq. (2.10). Hint: Use Eq. (2.7); Eq. (2.9) might
be useful as well.
Assignment 2.5
A positive definite matrix has positive eigenvalues.
Prove this. (For that matter, is it even true?)
Assignment 2.6
Does the function y = xe−x have a unique value x which
minimizes y? If so, can you find it by taking a
derivative and setting it equal to zero? Suppose
this problem requires gradient descent to solve.
Write the algorithm you would use to find the x which
minimizes y.
Assignment 2.7
We need to solve a minimization problem using gradient
descent. The function we are minimizing is sin x + ln y.
Which of the following is the expression for the
gradient which you need in order to do gradient
descent?
1 1
(a) cos x + (b)y = − (c)−∞
y cos x
cos x ∂ ∂
(d) (e) sin x + ln y
1/y ∂y ∂x
Assignment 2.8
(a) Write the algorithm which uses gradient descent to
find the vector [x,y]T which minimizes the function
28 Review of mathematical principles
Assignment 2.9
Determine whether the functions sin x and sin 2x might
be orthonormal or orthogonal functions.
References
[2.1] E.H.L. Aarts and P.J.M. van Laarhoven. Simulated Annealing: Theory and Applications,
Dordrecht, Holland, Reidel, 1987.
[2.2] R.L. Burden, J.D. Faires, and A.C. Reynolds. Numerical Analysis, Boston, MA, Prindle,
Weber and Schmidt, 1981.
[2.3] G. Dahlquist and A. Bjorck. Numerical Methods, Englewood Cliffs, NJ, Prentice-Hall,
1974.
[2.4] B. Gottfried and J. Weisman. Introduction to Optimization Theory, Englewood Cliffs,
NJ, Prentice-Hall, 1973.
3 Writing programs to process images
Computer Science is not about computers any more than astronomy is about telescopes
E. W. Dijkstra
One may take two approaches to writing software for image analysis, depending on
what one is required to optimize. One may write in a style which optimizes/minimizes
programmer time, or one may write to minimize computer time. In this course,
computer time will not be a concern (at least not usually), but your time will be far
more valuable. For that reason, we want to follow a programming philosophy which
produces correct, operational code in a minimal amount of programmer time.
The programming assignments in this book are specified to be written in C or
C++, rather than in MATLAB or JAVA. This is a conscious and deliberate decision.
MATLAB in particular hides many of the details of data structures and data mani-
pulation from the user. In the course of teaching variations of this course for many
years, the authors have found that many of those details are precisely the details
that students need to grasp in order to effectively understand what image processing
(particularly at the pixel level) is all about.
The objective of quickly writing good software is accomplished by using the image
access subroutines in IFS. IFS is a collection of subroutines and applications based
on those subroutines which support the development of image processing software.
Advantages of IFS include the following.
r IFS supports any data type including char, unsigned char, short, unsigned short,
int, unsigned int, float, double, complex float, complex double, complex short,
and structure.
r IFS supports any image size, and any number of dimensions. One may do signal
processing by simply considering a signal as a one-dimensional image.
r IFS is available on most current computer systems, including Windows on the PC,
Linux on the PC, Unix on the SUN, and OS-X on the Macintosh.1 Files written on
1 Regrettably, IFS does not support Macintosh operating systems prior to OS-X.
29
30 Writing programs to process images
one platform may be read on any of the other platforms. Conversion to the format
native to the platform is done by the read routine, without user intervention.
r A large collection of functions are available, including two-dimensional Fourier
transforms, filters, segmenters, etc.
v = ifsfgp(img,x,y)
31 3.2 Basic programming structure for image processing
then a floating point number will be returned independent of the data type of the
image. That is, the subroutine will do data conversions for you. Similarly, ifsigp will
return an integer, no matter what the internal data type is. This can, of course, get you
in trouble. Suppose the internal data type is float, and you have an image consisting
of numbers less than one. Then, the process of the conversion from float to int will
truncate all your values to zero.
For some projects you will have three-dimensional data. That means you must
access the images using a set of different subroutines, ifsigp3d, ifsfgp3d, ifsipp3d,
and ifsfpp3d. For example,
y = ifsigp3d(img,frame,row,col)
(1) ifsipp(img,x,y,exp(-t*t)) will give you trouble because ifsipp expects a fourth
argument which is an integer, and exp will return a double. You should use
ifsfpp.
(2) ifsigp(img,x,y,z) is improperly formed. ifsigp expects three arguments, and does
not check the dimension of the input image to determine number of arguments
(it could, however; sounds like a good student project . . . ). To access a three-
dimensional image, either use pointers, or use ifsigp3d(img,x,y,z), where the
second argument is the frame number.
......
int row, col;
......
for ( row = 0; row < 128; row ++)
{
for( col = 0; col < 128; col++)
{
/* pixel processing */
......
}
......
}
......
int row, col, frame;
......
for (frame = 0; frame < 224; frame++)
{
for ( row = 0; row < 128; row++)
{
for( col = 0; col < 128; col++)
{
/* pixel processing */
......
}
......
}
}
In this example, we use two integers (row and col) as the indices to the row and
column of the image. By increasing row and col with a step one, we are actually
scanning the image pixel-wise from left to right, top to bottom.
If the image has more than two dimensions, (e.g. hyperspectral images) a third
integer is then used as the index to the dimensions, and correspondingly, a three-
nested for-loops is needed, as shown in Fig. 3.2.
You should follow one important programming construct: all programs should be
written so that they will work for any size image. You are not required to write
your programs so that they will work for any number of dimensions (although that
is possible too), or any data type; only size. One implication of this requirement
is that you cannot declare a static array, and copy all your data into that array
(which novice programmers like to do). Instead, you must use the image access
subroutines.
Another important programming guideline is that except in rare instances, global
variables are forbidden. A global variable is a variable which is declared outside of a
subroutine. Use of such globals is poor programming practice, and causes more bugs
than anything else in all of programming. Good structured programming practice
requires that everything a subroutine needs to know is included in its argument
list.
Following these simple programming guidelines will allow you to write general-
purpose code easily and efficiently, and with few bugs. As you become more skilled
you can take advantage of the pointer manipulation capabilities to increase the run
speed of your programs, if this is needed later.
33 3.4 Example programs
You will lose major points Besides the aforementioned programming guidelines, it is very important for
if your code does not
follow these guidelines. students to also follow the indenting, spacing, and commenting rules in order to
make your code “readable.”
There are four commonly used indenting styles, the K&R style (or kernel style),
Allman style (or BSD style), Whitesmiths style, and the GNU style, as shown in
Fig. 3.3. In this book, we use the Allman indenting style. Adding some space (e.g.
a blank line) between different segments of your code also improves readability.
We emphasize the importance of the comments. However, do not add too many
comments since that will break the flow of your code. In general, you should add a
block comment at the top of each function implementation, including descriptions of
what the function does, who wrote this function, how to call this function, and what
the function returns. You should also add a description for each variable declaration.
Take another look at the IFS manual. It will help you to follow these example
programs. Also pay attention to the programming style we use. The comments
might be too detailed, but they are included for pedagogical purpose.
A typical program is given in Fig. 3.4, which is probably as simple an example
program as one could write. Fig. 3.5 lists another example which implements the
same function as Fig. 3.4 does but is written in a more flexible way such that it can
handle images of different size.
Both these examples make use of the subroutine calls ifsigp, ifsipp, ifsfgp, and
ifsfpp to access the images utilizing either integer-valued or floating point data. The
advantage of these subroutines is convenience: No matter what data type the image
is stored in, ifsigp will return an integer, and ifsfgp will return a float. Internally,
these subroutines determine precisely where the data is stored, access the data, and
convert it. All these operations, however, take computer time. For class projects,
the authors strongly recommend using these subroutines. However, for production
operations, IFS supports methods to access data directly using pointers, trading
additional programmer time for shorter run times.
34 Writing programs to process images
/* Example1.c
This program thresholds an image. It uses a fixed image size.
Written by Harry Putter, October, 2006
*/
#include <stdio.h>
#include <ifs.h>
main( )
{
IFSIMG img1, img2; /* Declare pointers to headers */
int len[3]; /* len is an array of dimensions, used by ifscreate */
int threshold; /* threshold is an int here */
int row,col; /* counters */
int v;
/* read in image */
img1 = ifspin("infile.ifs"); /* read in file by this name */
Fig. 3.4. Example IFS program to threshold an image using specified values of dimensions and
predetermined data type.
3.5 Makefiles
You really should use makefiles. They are far superior to just typing commands.
If you are doing your software development using Microsoft C++, Lcc, or some
other compiler, then the makefiles are sort of hidden from you, but it is helpful to
know how they operate. Basically, a makefile specifies how to build your project, as
illustrated by the example makefile in Fig. 3.6.
The example in Fig. 3.6 is just about as simple a makefile as one can write. It
states that the executable named myprogram depends on only one thing, the object
module myprogram.o. It then shows how to make myprogram from myprogram.o
and the IFS library.
Similarly, myprogram.o is made by compiling (but not linking) the source file,
myprogram.c, utilizing header files found in an “include” directory on the CDROM,
named hdr. Note: To specify a library, as in the link step, one must specify the library
35 3.5 Makefiles
/* Example2.c
Thresholds an image using information about its data type and the dimensionality.
Written by Sherlock Holmes, May 16, 1885
*/
#include <stdio.h>
#include <ifs.h>
main( )
{
IFSIMG img1, img2; /* Declare pointers to headers */
int *len; /* len is an array of dimensions, used by ifscreate */
int frame, row, col; /* counters */
float threshold, v; /* threshold is a float here */
Fig. 3.5. An example IFS program to threshold an image using number of dimensions, size of
dimensions, and data type determined by the input image.
36 Writing programs to process images
myprogram: myprogram.o
cc -o myprogram myprogram.o /CDROM/Solaris/ifslib/libifs.a
myprogram.o: myprogram.c
cc -c myprogram.c -I/CDROM/Solaris/hdr
Fig. 3.6. An example makefile which compiles a program and links it with the IFS library.
name (e.g. libifs.a), but to specify an include file (e.g. ifs.h), one specifies only the
directory in which that file is located, since the file name was given in the #include
preprocessor directive.
In WIN32 the makefiles look like the example shown in Fig. 3.7. Here, many of
the symbolic definition capabilities of the make program are demonstrated, and the
location of the compiler is specified explicitly.
The programs generated by IFS are (with the exception of ifsview) console-based.
That is, you need to run them inside an MSDOS window on the PC, inside a terminal
window under Linux, Solaris, or on the Mac, using OS-X.
In this chapter, we describe how images are formed and how they are represented.
Representations include both mathematical representations for the information con-
tained in an image and for the ways in which images are stored and manipulated in a
digital machine. In this chapter, we also introduce a way of thinking about images –
as surfaces with varying height – which we will find to be a powerful way to describe
both the properties of images as well as operations on those images.
38
39 4.1 Image representations
or a quardic:
ax 2 + by 2 + cz 2 + d x y + ex z + f yz + gx + hy + i z + j = 0. (4.2)
Implicit and explicit The form given in Eq. (4.1), in which one variable is defined in terms of the others,
representations
is often referred to as an explicit representation, whereas the form of Eq. (4.2) is an
implicit representation [4.23], which may be equivalently represented in terms of
the zero set, {(x, y, z) : f (x, y, z) = 0}. Implicit polynomials have some convenient
properties. For example consider a point (x0 , y0 ) which is not in the zero set of
f (x, y), that is, the set of points x, y which satisfy
f (x, y) ≡ x 2 + y 2 − R 2 = 0. (4.3)
If we substitute x0 and y0 into the equation for f (x, y), we know we get a nonzero
result (since we said this point is not in the zero set); if that value is negative, we
40 Images: Formation and representation
know that (x0 , y0 ) is inside the curve, otherwise, outside [4.3]. This inside/outside
property holds for all closed curves (and surfaces) representable by polynomials.
Fig. 4.1. (a) An image with lower horizontal frequency content. (b) An image with higher
horizontal frequency content.
41 4.1 Image representations
Fig. 4.2. (L) An image. (R) A low-frequency iconic representation of that image. The right-hand
image is blurred in the horizontal direction only, a blur which results when the camera
is panned while taking the picture. Notice that horizontal edges are sharp.
The spatial frequency content of an image can be modified by filters which block
specific frequency ranges. For example, Fig. 4.2 illustrates an original image and an
iconic representation of that image which has been passed through a low-pass filter.
That is, a filter which permits low frequencies to pass from input to output, but which
blocks higher frequencies. As you can see, the frequency response is one way of
characterizing sharpness. Images with lots of high-frequency content are perceived
as sharp.
Although we will make little use of frequency domain representations for
images in this course, you should be aware of a few aspects of frequency domain
representations.
First, as you should have already observed, spatial frequencies differ with direc-
tion. Fig. 4.1 illustrates much more rapid variation, higher spatial frequencies in the
vertical direction than in the horizontal. Furthermore, in general, an image contains
many spatial frequencies. We can extract the spatial frequency content of an image
using the two-dimensional Fourier transform, given by
1
F(u, v) = f (x, y) exp(−i2(ux + v y)) (4.4)
K x y
The second observation made here is that spatial frequencies vary over an image.
That is, if one were to take subimages of an image, one would find significant
variation in the Fourier transforms of those subimages.
Third, take a look at the computational complexity implied in Eq. (4.4). The
Fourier transform of an image is a function of spatial frequencies, u and v, and may
thus be considered as an image (its values are complex, but that should not worry
you). If our image is N × N , we must sum over x and y to get a SINGLE u, v value –
a complexity of N 2 . If the frequency domain space is also sampled at N × N , we
have a total complexity of N 4 to compute the Fourier transform. BUT, there exists
an algorithm called the fast Fourier transform which very cleverly computes a single
u, v value in N log2 N rather than N 2 , resulting in a significant saving. Thus it is
sometimes faster to compute things in the frequency domain.
Finally, there is an equivalence between convolution, which we will discuss in
the next chapter and multiplication in the frequency domain. More on that in section
5.8.
Now suppose the image is sampled; that is, x and y take on only discrete, integer
values, and also suppose f is quantized ( f takes on only a set of integer values).
Such an image could be stored in a computer memory and called a “digital image.”
A lens is used to form an image on the surface of the CCD. When a photon of the
appropriate wavelength strikes the special material of the device, a quantum of charge
is created (an electron–hole pair). Since the conductivity of the material is quite low,
these charges tend to remain in the same general area where they were created. Thus,
to a good approximation, the charge, q, in a local area of the CCD follows
tf
q= i dt
0
where i is the incident light intensity, measured in photons per second. If the incident
light is a constant over the integration time, then q = itf , where tf is called the frame
time.
In vidicon-like devices, the accumulated (positive) charge is cancelled by a scan-
ning electron beam. The cancellation process produces a current which is amplified
and becomes the video signal. In a CCD, charge is shifted from one cell to the next
synchronously with a digital clock. The mechanism for reading out the charge, be it
electron beam, or charge coupling, is always designed so that as much of the charge
is set to zero as possible. We start the integration process with zero accumulated
charge, build up the charge at a rate proportional to local light intensity, and then
read it out. Thus, the signal measured at a point will be proportional to both the light
intensity at that point and to the amount of time between read operations.
Since we are interested only in the intensities and not in the integration time, we
remove the effect of integration time by making it the same everywhere in the picture.
This process, called scanning, requires that each point on the device be interrogated
and its charge accumulation zeroed, repetitively and cyclically. Probably the most
straightforward, and certainly the most common way in which to accomplish this is
in a top-to-bottom, left-to-right scanning process called raster scanning (Fig. 4.3).
Fig. 4.3. Raster scanning: Active video is indicated by a solid line, blanking (retrace) by a dashed
line. In an electron beam device, the beam is turned off as it is repositioned. Blanking
has no physical meaning in a CCD, but is imposed for compatibility. This simplified
figure represents noninterlaced scanning.
44 Images: Formation and representation
Active
video
Blanking Sync
Composite video
Noncomposite video
Fig. 4.4. Composite and noncomposite outputs of a television camera, voltage as a function
of time.
To be consistent with standards put in place when scanning was done by electron
beam devices (and there needed to be a time when the beam was shut off), the
television signal has a pause at the end of each scan line called blanking. While
charge is being shifted out from the bottom of the detector, charge is once again built
up at the top. Since charge continues to accumulate over the entire surface of the
detector at all times, it is necessary for the read/shift process to return immediately
to the top of the detector and begin shifting again. This scanning process is repeated
many times each second. In American television, the entire faceplate is scanned once
every 33.33 ms (in Europe, the frame time is 40 ms).
To compute exactly how fast the electron beam is moving, we compute
1s 525 lines
÷ = 63.5 s/line. (4.5)
30 frame frame
Using the European standard of 625 lines and 25 frames per second, we arrive at
almost exactly the same answer, 64 s per line. This 63.5 s includes not only
the active video signal but also the blanking period, approximately 18 percent of
the line time. Subtracting this dead time, we arrive at the active video time, 52 s
per line.
Fig. 4.4 shows the output of a television camera as it scans three successive lines.
One immediately observes that the raster scanning process effectively converts a
picture from a two-dimensional signal to a one-dimensional signal, where voltage is
a function of time. Fig. 4.4 shows both composite and noncomposite video signals,
45 4.2 The digital image
that is, whether the signal does or does not include the sync and blanking timing
pulses.
The sync signal, while critical to operation of conventional television, is not
particularly relevant to our understanding of digital image processing at this time.
The blanking signal, however, is the single most important timing signal in a raster
scan system. Blanking refers to the time that there is no video. There are two distinct
blanking events: horizontal blanking, which occurs at the end of each line, and
vertical blanking, which occurs at the bottom of the picture. In a digital system, both
blanking events may be represented by pulses on separate digital wires. Composite
video is constructed by shifting these special timing pulses negative and adding them
to the video signal.
Now that we recognize that horizontal blanking signifies the beginning of a new
line of video data, we can concentrate on that line and learn how a computer might
acquire the brightness information encoded in that voltage.
Resolution
The number of samples on a single line defines the horizontal resolution of a video
system. Similarly, the number of lines in a single image defines the vertical reso-
lution. It is interesting to note that European television, with 625 lines per picture,
has a greater vertical resolution than American. This is why viewers observe that
European TV has a “better picture” than American.
The term resolution may also refer to the physical size of the smallest thing the
imaging system can clearly image. For example, the resolution of mammographic
x-ray film is around 50 microns; meaning that a dot on the film as small as that may
be discovered.
For computer monitors, there are many resolution standards, and we will not list
them all here. However the approach to calculating clock rates is the same.
Dynamic range
The sampled analog signal is converted to digital form by the quantization process,
as shown in Fig. 4.7. The digital representation of any signal can have only a finite
number of possible values, which are defined by the number of bits in the output
word. Video signals are often quantized to 8 bits of accuracy, thus allowing a signal
to be represented as one of 256 possible values.
One definition of the dynamic range of the imaging system is the number of
bits of the digital representation. An alternative definition specifies the dynamic
range as the range of input signal over which a camera successfully operates. Both
meanings are accepted and are in common use, but they differ according to the
context.
Since a digital image is raster scanned and sampled, there is a one-to-one relation-
ship between time and space. That is, if we refer to the sampling time, we must speak
of it relative to the top-of-picture signal (vertical blanking). That timing relationship
identifies a unique position on the screen.
Fig. 4.8. Image of a face represented using 16 shades of gray (4 bits) on left, and with eight
shades (3 bits) on right.
that exact reconstruction requires a sampling rate of at least twice the highest fre-
quency in the signal.
In machine vision, we are usually not concerned with exact reconstruction of the
most subtle details of the image, but wish to extract just the information we need to
accomplish the task at hand.
Quantization error is the term used to refer to the fact that information is lost
whenever the continuously valued analog signal is partitioned into discrete ranges.
Quantization error is often observed as contouring, as illustrated in Fig. 4.8.
Stereopsis
Most animals have two eyes, and we know from experience that with two views, we
can extract three-dimensional information. It is not hard to work that out geometri-
B cally (Fig. 4.9).
Fig. 4.9. From the
If we know the distance between the cameras, the angle of observation of each
images extracted from camera (in most stereopsis systems, the cameras are set up so their midlines are
two cameras, it is parallel), and we can measure where particular points in the scene appear in both
possible to compute
the location in 3-space images, then we can calculate distance to the object, which will be referred to as
of any point, provided “range.” If we can do this in the general case, that is, if we can always identify which
we can solve the
correspondence
point in the left image corresponds to which point in the right image, we have solved
problem. the correspondence problem.
48 Images: Formation and representation
The correspondence An often-used simplifying assumption is that the two cameras are set up so that
problem is one of the
fundamental problems of they are exactly parallel, and if a point occurs on a particular line in the left image
machine vision! Some then it will appear on the same “epipolar line” in the right image. In other words,
people would say it is
THE problem in machine the epipolar line connects a point and its correspondence. This assumption makes it
vision. possible to reduce the complexity of the correspondence problem dramatically.
The literature is filled with papers on approaches to the correspondence problem.
Most of them focus on point matching. That is, find a point on the epipolar line in the
second image which in some way resembles a point in the first image. For example,
Bokil and Khotanzad [4.5] extended the work of Marr and Poggio [4.27], which
makes use of epipolar assumption. They accomplish point matching by establishing
a gray level compatibility matrix (GLCM). The pixel values of the left and right
images are labels at the bottom and left of the matrix. The i, j element of the matrix
is determined by computing the absolute value of the difference between brightness
values in the ith row of the left-hand image and the jth column in the right-hand
image. The GLCM values are then normalized. Row-to-row correlations are then
established and a best match is selected.
The correspondence problem may be made easier by hierarchical matching [4.26]
(two low-level features like epipolar edges correspond only if they belong to regions
which correspond).
There are methods for finding curves in 3-space which do not explicitly require a
solution to the correspondence problem. For example, Cohen and Wang [4.10, 4.11]
solve for the best matching curve rather than for the individual points.
Camera calibration [4.28, 4.37] is important for stereo [4.31], and at the same
time, stereopsis can be used for camera calibration, since it establishes a relationship
between the two-dimensional image and the three-dimensional world. A great deal
of effort has been expended to determine the minimum set of correspondences [4.1,
4.32] or other relationships [4.33, 4.34] required to calibrate the cameras.
A set of correspondences implies a transformation determining the pose of an
object (the position and orientation of the object in 3-space is known as the “pose”).
In a given scene, there may be multiple sets of correspondences [4.20, 4.38].
There are many special case applications [4.13], including how to obtain stereo
information from panoramic cameras [4.30].
We will revisit stereopsis in Chapter 11 after we have learned how the concepts of
parametric transformations will help in the solution of the correspondence problem.
Structured illumination
One variation eliminates the correspondence problem – replace one camera with
a light source (e.g. a laser beam passing through a cylindrical mirror). However,
this is really no longer stereopsis. Instead, it is a method referred to as structured
illumination. To see how this works, look back at Fig. 4.9, and think of one of
49 4.3 Describing image formation
θ φ
Camera
Projector
d
those cameras as being replaced by a projector which shines a very narrow, very
bright slit on the scene, as illustrated in Fig. 4.10. Now, one angle, , is known
from the projector; the other angle, , is measured by finding the bright spot in the
camera image, counting over pixels, and knowing the relationship between pixels
and angle. Finally, knowledge of the distance between cameras, d, makes the triangle
solvable.
One observation seems relevant to this point in describing images. An interesting
problem occurs when one uses structured illumination to look at specular reflectors
such as metal surfaces. With specular reflectors, either not enough or too much light
may be reflected (polarization filters help [4.29]).
We will see more about using structured illumination when we get to the “shape-
from-X” sections of this book.
A measurement system corrupts the input image to produce the measured image:
where D is some distortion function which typically includes some random noise
process.
50 Images: Formation and representation
where ␦(r ) represents the delta function which is equal to zero when its argument
is nonzero, equal to infinity when its argument is zero, and has an integral of one.
Equation (4.8) is not profound – it simply defines the way the delta samples a
function. However, now let us suppose our function f is corrupted by some operator
which changes f at every point x. Then,
⎛ ∞ ⎞
First assumption: d is
linear.
D( f (x)) = D ⎝ f (x )␦(x − x ) d x ⎠ . (4.9)
−∞
Now if D is a linear operator, then we can interchange the operator D and the integral
to obtain
∞
D( f (x)) = D( f (x )␦(x − x )) d x . (4.10)
−∞
Now observe that D may depend on x, or it may depend on the difference between x
and x , but in any case, it is the distortion operator applied to just the delta function.
So any LINEAR distortion of f can be written as an integral of the product of f with a
function which is the distortion applied to the delta. Since the delta function is really
just a very bright spot, with infinite height and zero width, in one dimension, we call
it an impulse, and call D(␦(x − x )) the impulse response. The two-dimensional
51 4.4 The image as a surface
delta function is a point of light, so we call the result of applying the distortion to it
the point spread function. The impulse response and the point spread function are
precisely the same thing, the only difference is in usage.
Since the impulse response might depend on both x and x , we introduce a new
notation which we call h:
Look out! another
assumption:
h(x, x ) = D(␦(x − x )). (4.12)
Space-invariant.
If we make another assumption, we can get a simpler expression: Let’s assume that
D depends not on x, but only on the difference between x and x . In that case, we
can write h(x, x ) = h(x − x ), and Eq. (4.11) simplifies to
∞
g(x) = D( f (x)) = f (x )h(x − x ) d x (4.13)
−∞
where we have introduced g, the output of the system. This, you will come to
recognize as the convolution integral. This integral is very important for a variety
of reasons, including the fact that it can be computed rapidly using the fast Fourier
transform (FFT). Even more significant is the observation that ANY distortion of
an image (as long as it is linear and space-invariant) can be computed by an integral
like this.
4.4.1 Isophotes
Consider the value of f (x, y) as a surface in space, described by z = f (x, y). Then
the ordered triple [x, y, f (x, y)]T describes this surface. For every point, (x, y), there
is a corresponding value in the third dimension. It is important to observe that there
is just ONE such z value for any x, y pair ( f (x, y) is a function). Therefore, z is a
surface.
Consider the set of all points satisfying f (x, y) = C for some constant C. If f
represents brightness, then this set of points is a set of points, all of which have the
same brightness. We therefore refer to this set as an “isophote.”
Theorem
Proof of this theorem will At any image point, (x, y), the isophote passing through that point is perpendicular
be a test question.
to the gradient.2
2 The gradient vector is defined in Eq. 2.11 and elaborated upon in Eq. 5.22.
52 Images: Formation and representation
Direction of gradient
Fig. 4.11. Contour lines on an elevation map are equivalent to isophotes. The gradient vector at
a point is perpendicular to the isophote at that point.
4.4.2 Ridges
Now let’s think about z (x, y), a surface in space, as a mountain (see Fig. 4.11). If
we draw a geological contour map of this mountain, the lines on the map are lines
of equal elevation. However, if we think of “elevation” denoting brightness, then
the contour lines are isophotes. Stand at a point on this “mountain,” and look in the
direction of the gradient. The direction you are looking is the way you would go to
undertake the steepest ascent.
Look to your right or left and you are looking along the isophote. Note that the
direction of the gradient is the steepest direction at that particular point. It does not
necessarily point at the peak.
Let’s climb this mountain by taking small steps in the direction of the local
gradient. What happens at the ridge line? How would you know you were on a
ridge? How can you describe this process mathematically?
Think about taking steps in the direction of the gradient. Your steps are generally
in the same direction, until you reach the ridge, then, the direction radically shifts.
So, one useful definition of a ridge is the locus of points which are local maxima of
A local maximum is any the rate of change of gradient direction. That is, we need to find the points where
point which does not have
a larger neighbor. ∂/∂v is maximized. Here, v represents a derivative taken in the direction of the
gradient. In Cartesian coordinates,
∂ 2 f x f y f x y − f y2 f x x − f x2 f yy
= 3/2 . (4.14)
∂v f x2 + f y2
Maintz et al. [4.24] point out that it is essentially equivalent to a slightly simpler
formulation based on simply the second derivative of brightness in the v direction,
which leads to maximizing
f y2 f x x − 2 f x f y f x y + f x2 f yy
, (4.15)
f x2 + f y2
53 4.5 Neighborhood relations
where the subscript denotes “partial derivative with respect to.” In three-dimensional
data, the concepts of ridges are the same, just harder to visualize. In that case, the
gradient is a 3-vector, pointing in the direction of increasing density. Isophotes are
surfaces instead of curves. In that same paper [4.24], Maintz et al. also consider the
concept of ridges in three-dimensional data; check it out if you have to implement
such things.
We may define neighborhoods in a variety of ways, but the most common and most
intuitive way is to say that two pixels are neighbors if they share a side (4-connected)
or they share either a side or a vertex (8-connected). The neighborhood of a pixel
is the set of pixels which are neighbors (surprise!). The 4-neighbors of the center
Fig. 4.12. The
point are illustrated in Fig. 4.12. Denote the neighborhood of a point s by ℵs . Later
4-neighbors of the
center pixel are we will discuss operations on sets of points, neighborhoods of points, and on sets of
shaded. neighborhoods. For example, let A and B be sets of points in the image, and let s be
a point in the set A. We may define the aura [4.15] of a set A with respect to a set B,
for a neighborhood structure ℵs by
O B (A) = (ℵs ∩ B).
s∈A
The neighbors of a pixel That is, the aura of a set of pixels A relative to a set B is the collection of all points
are usually adjacent to
that pixel, but there is no in B that are neighbors of pixels in A, where the concept of “neighbor” is given
fundamental requirement by a problem-specific definition. Fig. 4.13 illustrates an image containing (a) a set
that they be. We will see
this again when we talk A (defined to be the set of shaded pixels), a set B (defined to be the set of blank
about “cliques.” pixels), a neighborhood relation given by (b), and the aura of set B with respect to
set A in (c) [4.15]. We will see more about relationships like this when we discuss
morphology.
54 Images: Formation and representation
Fig. 4.13. (a) Set A (shaded pixels) and set B (blank pixels); (b) a neighborhood relation.
The shaded pixels are neighbors by definition of the center pixel. (c) The aura of
the set of white pixels in (a) relative to the set of shaded pixels is given by the dark
pixels. It is important to observe that this example uses the standard 4-connected
definition of neighbor, there is no requirement that neighbors even be spatially
adjacent.
4.6 Conclusion
In this chapter, you have been introduced to a variety of ways to represent images,
and the information in images. In subsequent chapters, we will build on these repre-
sentations, developing algorithms which extract and categorize that information.
4.7 Vocabulary
Correspondence problem
Curvature
3 To learn how to use an IFS program, either type program name -h or look it up in the manual.
57 Topic 4A Image representations
Dynamic range
Functional representation
Graph
Iconic representation
Isophote
Linear system
Medial axis
Probabilistic representation
Quantization
Range image
Raster scan
Ridge
Resolution
Sampling
Spatial frequency
Stereo
Structured illumination
In a number of papers [4.36], imaging sensors have been described which use hexagonally-
organized pixel arrays. Hexagons are the minimum-energy solution when a collection of
tangent circles with flexible boundaries is subjected to pressure. The beehive is the best known
such naturally occurring organization, but many others occur too, including the organization
of cones in the human retina.
v
y
u x
Fig. 4.14. A coordinate system which is natural to hexagonal tessellation of the plane. The
u and v directions are not orthogonal. Unit vectors u and v describe this coordinate
system.
58 Images: Formation and representation
Traditionally, electronic imaging sensors have been arranged in rectangular arrays mainly
because an electron beam needed to be swept in a raster-scan way, and more recently because
it is slightly more convenient to arrange charge-coupled devices in a rectangular organization.
Rectangular arrays, however, introduce an ambiguity in attempts to define neighborhoods.
On the other hand, we see no connectivity paradoxes in hexagonal connectivity analysis:
Every pixel has exactly six neighbors, foreground, background, or other colors.
Notation
We denote a point in R 2 by p = uu + vv, where the unbolded character denotes the magnitude
in the direction of the unit vector denoted by the bold character. In the case that we discuss two
or more points, we will denote different vectors by using subscripts, with the same subscripts
on the components, e.g., pi = u i u + v i v.
Pi = [u i , v i ]T . (4.16)
In some cases, we will be interested in the location of points in the familiar Cartesian
representation, [x, y]T . In this case, we will denote points by subscripts as well, e.g.
P i = [u i , v i ]T = [xi , yi ]T with corresponding values for u, v, x, and y.
Lemma 1
Any ordered pair [u, v]. corresponds to exactly one pair [x, y].
Proof
Using simple trigonometry, and noting that the cosine of 60 degrees is 1/2, it is straightforward
to derive that
√
v 3v
x =u+ and y= . (4.17)
2 2
Lemma 2
Any ordered pair of Cartesian coordinates [x, y] corresponds to exactly one pair [u, v].
Proof
By solving Eq. (4.17) for u and v, we find
y y
u=x−√ and v = 2 √ . (4.18)
3 3
in Eq. (4.19), where the inner product of u and v in Cartesian coordinates does not equal
zero.
⎡ ⎤
√ 1
1 3 ⎢ 2 ⎥ 1
u = x, v = x+ y, so uT v = [1, 0] ⎢ √ ⎥
⎣ 3 ⎦ = 2. (4.19)
2 2
2
Theorem
The vectors u and v form a basis for
d .
Proof
Since x and y obviously are a basis for
d , we can write any point p in
d as an ordered pair
p = [x, y]T = x x + y y. But from Eq. (4.19), we have
(2v − u) y 2y
p = xu + y √ = x − √ u + √ v.
3 3 3
Given a pixel with coordinates u, v (assumed integer), the coordinates of the neighbors are
illustrated in Fig. 4.15.
We can use the following loop to efficiently access all six neighbors of pixel u, v. We
observe that no “if” statements are required to determine if the center pixel is on an even- or
odd-numbered row.
ou = {−1, −1, 0, 1, 1, 0}
ov = {0, 1, 1, 0, −1, −1}
for i = 1 to 6
nu = u + ou[i]
nv = v + ov[i]
value = image[nu][nv]
We note that this method of accessing the neighbors of a pixel is also useful in rectangular
grids for 8-neighbors and is more efficient than the doubly indexed loop
for i = −1 to 1
for j = −1 to 1
if((i! = 0) or ( j! = 0))
value = image[u + i][v + j]
u – 1, v + 1 u, v + 1
u – 1, v u, v u + 1, v
u, v – 1 u + 1, v – 1
So far, we have only spoken of images as if they were always brightness. Well actually, we
did mention 2 12 D images, which range as a function of x and y. However, there are other
things which we could compute to represent image properties. Curvature is one.
4A.2.1 Curvature
The computation of local curvature could be performed at every point in an image. For 2 12
D images (surfaces), the curvature cannot be described adequately by a single scalar, but
rather takes the form of a matrix. (See doCarmo’s book [4.12] or other texts on differential
geometry for details.)
−1
E F e f
K = (4.20)
F G f g
where
2
∂z 2 ∂z∂z ∂z
E = 1+ ,F = ,G = 1+ ,
∂x ∂ x∂ y ∂y
2 2 2 2 2 2
∂ z ∂ z ∂ z
e= H, f = H, g = H
∂x 2 ∂ x∂ y ∂ y2
and finally,
2 2
∂z ∂z
H= + + 1.
∂x ∂y
The principal curvatures K 1 and K 2 are defined as the two eigenvalues of the matrix K, and
the corresponding eigenvectors determine the directions of the curvature.
61 4.7 Vocabulary
For many of our purposes, we will need scalar measurements of curvature which are
invariant to viewpoint. Two such scalars are easily defined, the mean curvature
1 1
Km = (K 1 + K 2 ) = Tr (K ) (4.21)
2 2
and the Gauss curvature
K G = K 1 K 2 = det (K ). (4.22)
Since it is a product, the Gauss curvature is zero whenever either of the two principal curvatures
is zero, a condition which routinely occurs with industrial parts. For this reason, we seldom
use the Gauss curvature.
4A.2.2 Texture
Texture is one of those words that everybody seems to know, but knows without a defini-
tion. There are at least two different definitions of texture – “natural” textures, which are
best characterized by random process descriptions, and “regular” textures, which are best
characterized by frequency–domain representations.
Haralick and Shapiro [4.18] describe textures as “having one or more of the properties of
fineness, coarseness, smoothness, granulation, randomness, lineation, or as being mottled,
irregular, or hummocky.” In detecting that clusters of pixels are different, many features may
be used including moments of the power spectrum [4.4, 4.14], fractal dimension [8.12], and
the cepstrum [4.35]. Texture segmentation involves representing [4.17, 4.21] an image in a
way which incorporates both spatial and spatial–frequency information, and then using that
information to identify regions with similar characteristics [4.7, 4.14, 4.16].
The fact that textures can be effectively represented by self-similar (fractal) processes is
addressed in a number of papers, the first of which was presented in the classic work by
Mandelbrot and Van Ness [4.25]. Kaplan and Kuo [4.22] point out that true textures do not
necessarily keep the same exact textures over scale, and the concept of self-similarity should
be modified.
Assignment 4.A1
Suppose the image f(x,y) is describable by f(x,y) = x 4 /4 −
x 3 + y 2 . At the point x = 1, y = 2, which of the following is a
unit vector which points along the isophote passing through
Fig. 4.16. (a) Examples of wool textures [4.2]. (b) Examples of tree bark textures [4.2].
(c) Examples comparing natural and regular textures [4.4]. Used with permission.
62 Images: Formation and representation
that point?
T
T
2 1 −1 2
(a) √ √ (c) √ √ (e) [2 1]T
( 5) 5 ( 5) 5
T
T
1 2 2 1
(b) √ √ (d) [−2 4]T (f) −√ √
( 5) 5 5 5
Assignment 4.A2
Imagine you are standing on a surface. You cannot see the en-
tire surface, but you can see a fairly large portion. If you
measure the curvature at all the points you can see, you find
that one of the two principal curvatures is zero. The other
principal curvature varies monotonically in one direction.
You cannot measure it precisely, but you suspect that vari-
ation of curvature is linear in that one direction. On what
type of surface are you standing?
References
[4.1] T. Alter, “3-D Pose from 3 Points Using Weak-perspective,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(8), 1994.
[4.2] D. Badler, J. JáJá, and R. Chellappa, “Scalable Data Parallel Algorithms for
Texture Synthesis and Compression using Gibbs Random Fields,” IEEE Transactions
on Image Processing, 4(10), 1995.
[4.3] R. Bajcsy and F. Solina, “Three Dimensional Object Representation Revisited,”
International Conference on Computer Vision, London, May, 1987.
[4.4] J. Bigün and J. du Buf, “N-folded Symmetries by Complex Moments in Gabor Space
and Their Application to Unsupervised Texture Segmentation,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[4.5] A. Bokil and A. Khotanzad, “A Constraint Learning Feedback Dynamic Model for
Stereopsis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11),
1995.
[4.6] K. Castleman, Digital Image Processing, Englewood Cliffs, NJ, Prentice-Hall, 1996.
[4.7] J. Chen and A. Kundu, “Rotation and Gray Scale Transformation Invariant Tex-
ture Identification using Wavelet Decomposition and Hidden Markov Models,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(2), 1994.
[4.8] R. Chien and W. Snyder, “Hardware for Visual Image Processing,” IEEE Transactions
on Circuits and Systems, 22(6), 1975.
[4.9] D. Clausi, “Texture Segmentation Example,” Web publication,
http://www.eng.uwaterloo.ca/∼dclausi/texture.html, Spring 2001.
[4.10] F. Cohen and J. Wang, “Part I: Modeling Image Curves Using Invariant 3-D Object
Curve Models – A Path to 3-D Reconstruction and Shape Estimation from Image
63 References
Contours Using B-Splines, Shape Invariant Matching and Neural Network,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[4.11] F. Cohen and J. Wang, “Part II: 3-D Object Recognition and Shape Estimation from
Image Contours,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(1), 1994.
[4.12] M. doCarmo, Differential Geometry of Curves and Surfaces, Englewood Cliffs, NJ,
Prentice-Hall, 1976.
[4.13] U. Dhond, and J. Aggarwal, “Stereo Matching in the Presence of Narrow Occluding
Objects using Dynamic Disparity Search,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(7), 1995.
[4.14] D. Dunn, W. Higgins, and J. Wakeley, “Texture Segmentation using 2-D Gabor
Elementary Functions,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 16(2), 1994.
[4.15] I. Elfadel and R. Picard, “Gibbs Random Fields, Co-occurrences, and Texture Model-
ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[4.16] H. Greenspan, R. Goodman, R. Chellappa, and C. Anderson, “Learning Texture Dis-
crimination Rules in a Multiresolution System,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 16(9), 1994.
[4.17] M. Gürelli and L. Onural, “On a Parameter Estimation Method for Gibbs–Markov
Random Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(4), 1994.
[4.18] R. Haralick and L. Shapiro, Computer and Robot Vision, Volume I, Reading, MA,
Addison-Wesley, 1992.
[4.19] G. Healey and R. Kondepudy, “Radiometric CCD Camera Calibration and Noise
Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(3),
1994.
[4.20] Y. Hel-Or and M. Werman, “Pose Estimation by Fusing Noisy Data of Different
Dimensions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(2),
1995.
[4.21] A. Jain and K. Karu, “Learning Texture Discrimination Masks,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 18(2), 1996.
[4.22] L. Kaplan and C. Kuo, “Texture Roughness Analysis and Synthesis via Extended
Self-similar (ESS) Model,” IEEE Transactions on Pattern Analysis and Machine In-
telligence, 17(11), 1995.
[4.23] D. Keren, D. Cooper, and J. Subrahmonia, “Describing Complicated Objects by Im-
plicit Polynomials,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(1), 1994.
[4.24] J. Maintz, P. van den Elsen, and M. Viergever, “Evaluation of Ridge Seeking Oper-
ations for Multimodality Medical Image Matching,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(4), 1996.
[4.25] B. Mandelbrot and J. Van Ness, “Fractional Brownian Motions, Fractional Noises,
and Applications,” SIAM Review, 10, October, 1968.
[4.26] S. Marapan and M. Trivedi, “Multi-primitive Hierarchical (MPH) Stereo Analysis,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
64 Images: Formation and representation
[4.27] D. Marr and T. Poggio, “Cooperative Computation of Stereo Disparity,” Science, 194,
pp. 283–287, October, 1976.
[4.28] P. McLauchlan and D. Murray, “Active Camera Calibration for a Head-eye Platform
Using Variable State-dimension Filter,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(1), 1996.
[4.29] N. Page, W. Snyder, and S. Rajala, “Turbine Blade Image Processing System,”
In Advanced Software for Robotics, ed. A Danthine, Amsterdam, North-Holland,
1984.
[4.30] S. Peleg, M. Ben-Ezra, and Y. Pritch, “Omnistereo: Panoramic Stereo Imaging,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 23(3), 2001.
[4.31] L. Quan, “Invariants of Six Points and Projective Reconstruction from Three Uncal-
ibrated Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), 1995
[4.32] A. Shashua, “Projective Structure from Uncalibrated Images: Structure from Motion
and Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(8), 1994.
[4.33] A. Shashua, “Algebraic Functions for Recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(8), 1995.
[4.34] A. Shashua and N. Navab, “Relative Affine Structure: Canonical Model for 3D From
2D Geometry and Applications,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(9), 1996.
[4.35] P. Smith and N. Nandhakumar, “An Improved Power Cepstrum Based Stereo Corre-
spondence Method for Textured Scenes,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(3), 1996.
[4.36] W. Snyder, H. Qi, and W. Sander, “A Hexagonal Coordinate System,” SPIE Medical
Imaging: Image Processing, Pt. 1–2, pp. 716–727, February, 1999.
[4.37] G. Wei and S. Ma, “Implicit and Explicit Camera Calibration: Theory and Exper-
iments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5),
1994.
[4.38] X. Zhuang and Y. Huang, “Robust 3-D – 3-D Pose Estimation,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(8), 1994.
5 Linear operators and kernels
where f 1 and f 2 are images, ␣ and  are scalar multipliers, then we say that D is a
“linear operator.”
A gedankenexperiment
g = D( f ) = a f + b a, b ∈
Is D a linear operator?
We suggest you work this out for yourself before reading the solution. It certainly
LOOKS linear. Multiplication by a constant followed by addition of a constant. If
f were a scalar variable, then D describes the equation of a line, which SURELY is
linear (isn’t it?)! OK. Let’s prove it. Using Eq. (5.1), we evaluate
1 The authors are grateful to Bilge Karacali, Rajeev Ramanath, and Lena Soderberg for their assistance in
producing the images used in this chapter.
65
66 Linear operators and kernels
Since f is now digital, many authors choose to write f as a matrix, f i j , instead of the
functional notation f (x, y). However, we prefer the x, y notation, for reasons which
shall become apparent later. We will find it more convenient to use a single subscript
f i , at several points later, but for now, let’s stick with f (x, y) and remember that x
and y take on only a small range of integer values, e.g. 0 < x < 511.
Think about a one-dimensional image named f with five pixels and another one-
dimensional image, which we will call a kernel named h with three pixels, as illus-
trated in Fig. 5.1.
Place the kernel down so its center, pixel h 0 , is over some pixel of f, say f 2 ; we get
g2 = f 1 h −1 + f 2 h 0 + f 3 h 1 , which is a sum of products of elements of the kernel and
elements of the image. With that understanding, let us consider the most common
example of Eq. (5.2), the case we will come to call application of a 3 × 3 kernel:
g(x, y) = f (x + ␣, y + )h(␣, ). (5.2)
␣ 
This case occurs when both ␣ and  take on only the values −1, 0, and 1. In this
case, Eq. (5.2) expands to
f1 f2 f3 f4 f5
h –1 h0 h1
Fig. 5.1. A one-dimensional image with five pixels and a one-dimensional kernel with three
pixels. The subscript is the x-coordinate of the pixel.
67 5.2 Application of kernel operators in digital images
Note the order of the Table 5.1. Values of the elements of a kernel.
arguments here, x
(column) and y (row).
Sometimes the reverse h(−1, −1) h(0, −1) h(1, −1)
convention is followed. h(−1, 0) h(0, 0) h(1, 0)
h(−1, 1) h(0, 1) h(1, 1)
To better capture the essence of Eq. (5.3), let us write h as a 3 × 3 grid of numbers
(yes, we used the word “grid” rather than “array” intentionally), as in Table 5.1.
Now we imagine that we place this grid down on top of the image so that the
center of the grid is directly over pixel f (x, y); then each h value in the grid is
multiplied by the corresponding point in the image. We will refer to the grid of h
values henceforth as a “kernel.”
The observant student will have noticed a discrepancy in order between Eqs. (5.4)
Mathematically, and (5.5). In formal convolution, as given by Eq. (5.5), the arguments reverse: the
convolution and
correlation differ in the
right-most pixel of the kernel (h 1 ) is multiplied by the left-most pixel in the corre-
left–right order of sponding region of the image ( f 2 ). However, in Eq. (5.4), we think of “placing” the
coordinates.
kernel down over the image and multiplying corresponding pixels. If we multiply
corresponding pixels, left–left and right–right, we have correlation. There is, unfor-
tunately, a misnomer in much of the literature – both may be called “convolution.”
We advise the student to watch for this. In many publications, the authors use the
term “convolution” when they really mean “sum of products.” In order to avoid con-
fusion, in this book, we will avoid the use of the word “convolve” unless we really
do mean the application of Eq. (5.5), and instead use the term “kernel operator,”
when we mean Eq. (5.4).
−1 1
Derivative estimated at
this point
But this kernel is aesthetically unpleasing – the estimate at x depends on the value
at x and at x + 1, but not at x − 1; why? We actually like a symmetric definition
better, such as
∂ f f (x0 + x) − f (x0 − x)
= lim .
∂x xa x→0 2x
We cannot get x smaller than 1, and we end up with this kernel
−1/2 0 1/2
1/2 −1 0 1 .
This approach presents yet another way to make use of the continuous representa-
tion of an image f (x, y). Think of the brightness as a function of the two spatial
coordinates, and consider a plane which is tangent to that brightness surface at a
point, as illustrated in Fig. 5.2.
In this case, we may write the continuous image representation using the equation
of a plane
f (x, y) = ax + by + c. (5.8)
Then, we may consider the edge strength using the two numbers ∂ f /∂ x = a,
∂ f /∂ y = b, and the rate of change of brightness at the point (x, y) is represented by
the gradient vector
T
∂f ∂f
∇f = = [a b]T . (5.9)
∂x ∂y
The approach followed here is to find a, b, and c given some noisy, blurred measure-
ment of f, and the assumption of Eq. (5.8).
To find those parameters, first observe that Eq. (5.8) may be written as f (x, y) =
A X where the vectors A and X are AT = [a b c] and X T = [x y 1].
T
Expanding the square and eliminating the functional notation for simplicity, we find
E= (AT X )(AT X ) − 2AT Xg + g 2 .
ℵ
f(x, y)
y Fig. 5.2. The brightness in an image can be thought of as a surface, a function of two variables.
x The slopes of the tangent plane are the two spatial partial derivatives.
70 Linear operators and kernels
through, we have
E= AT X X T A − 2 AT Xg + g2
ℵ ℵ ℵ
!
= AT X XT A − 2AT Xg + g2.
ℵ ℵ ℵ
Now we wish to find the A (the parameters of the plane) which minimizes E; so we
may take derivatives and set the result to zero.
!
dE
=2 XX T
A−2 Xg = 0. (5.10)
dA ℵ ℵ
Let’s call ℵ X X T ≡ S (it is the “scatter matrix”) and see what Eq. (5.10) means:
consider a neighborhood ℵ which is symmetric about the origin. In that neighbor-
hood, suppose x and y only take on values of −1, 0, and 1, then
⎡ 2 ⎤
⎡ ⎤ x xy x
⎢x ⎥ ⎢ ⎥
⎢ ⎥
S= X XT = ⎣ ⎦y [x y 1] = ⎢ x y y 2
y ⎥
ℵ ℵ
⎣ ⎦
1
x y 1
More detail on how the You do not see where those values came from?
elements of the scatter
matrix are derived. Do not Ok, here’s how you get them. Look at the top left point, at coordinates x =
forget the positive −1, y = −1. At that point, x 2 = (−1)2 = 1. Now look at the top middle point, at
direction for y is down.
coordinates x = 0, y = −1. At that point, x 2 = 0. Do this for all 9 points in the
neighborhood, and you obtain x 2 = 6. Got it?
Convince yourself that Useful observation: If you make the neighborhood symmetric about the origin,
this is true.
all the terms in the scatter matrix which contain x or y to an odd power will be zero.
Be sure you carefully Also: A common miss steak is to put a 1 in the lower right corner rather than a
proofread whatever you
write! 9 – be careful!
So now we have the matrix equation
⎡ ⎤
⎡ ⎤⎡ ⎤ g(x, y)x
6 0 0 a ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
2 ⎣0 6 0⎦ ⎣b ⎦ = 2 ⎢ g(x, y)y ⎥ .
⎣ ⎦
0 0 9 c
g(x, y)
71 5.3 Derivative estimation by function fitting
−1 0 1
−1 0 1
−1 0 1
That is precisely the kernel of Eq. (5.6), which we derived intuitively. Now, we have
it derived formally. Doesn’t it give you a warm fuzzy feeling when theory agrees with
intuition?!? (Whoops, we forgot to multiply each term by 1/6, but we can simply
factor the 1/6 out, and when we get the answer, we will just divide by 6.)
We accomplished this by using an optimization method, in this case, minimizing
the squared error, to find the coefficients of a function f (x) in an equation of the
form y = f (x), where f is polynomial. Recall from section 4.1.2 that this form is
referred to as an explicit functional representation.
One more terminology issue: In future material, we will use the expression radius
of a kernel. The radius is the number of pixels from the center to the nearest edge.
For example, a 3 × 3 kernel has a radius of one. A 5 × 5 kernel has a radius of 2,
etc. It is possible to design kernels which are circular, but most of the time, we use
squares.
In this section, we find image gradients again, in exactly the same way, but this
time with hexagonally arranged pixels. Refer to section 4A.1 for a discussion of the
coordinate system. It is a different presentation of the same material, and if you read
both presentations carefully, you will understand the concepts more clearly.
To find the gradient of intensity in an image, we will fit a plane to the data
in a small neighborhood. This plane will be represented in the form of Eq. (5.8).
We then take partial derivatives with respect to u and v to find the gradient of
intensity in those corresponding directions. We choose a neighborhood of six points,
surrounding a central point, and fit the plane to them. Define the set of data points as
z i , (i = 1, . . . , 6). Then the following expression represents the error in fitting these
six points to a plane parameterized by a, b, and c.
6
E= (z i − (au i + bv i + c))2 . (5.11)
i=1
72 Linear operators and kernels
and set the partial derivative of Eq. (5.15) equal to zero to produce a pair of
simultaneous equations,
4a − 2b = ϒu
(5.16)
−4a + 8b = 2ϒv
with solution
1
b= (2ϒv + ϒu ). (5.17)
6
Similarly,
1
a= (ϒv + 2ϒu ). (5.18)
6
73 5.4 Vector representations of images
Suppose we list every pixel in an image in raster scan order, as one long vector. For
example, for the 4 × 4 image
1 2 4 1
7 3 2 8
f (x, y) =
9 2 1 4
4 1 2 3
F = [1 2 4 1 7 3 2 8 9 2 1 4 4 1 2 3]T .
−1 1 1 2
−2 2 −1 1
−1 1 −2 −1
Fig. 5.3. The kernels used to estimate the gradient of brightness in the u direction, and in the v
direction.
74 Linear operators and kernels
0, 0 is the UPPER left This is called the “lexicographic” representation. If we write the image in this
corner.
way, each pixel may be identified by a single index, e.g., F0 = 1, F4 = 7, F15 = 3,
where the indexing starts with zero.
Now suppose we want to apply the following kernel to this image:
⎡ ⎤
−1 0 2
⎢ ⎥
h = ⎣−2 0 4⎦
3 9 1
H5 = [−1 0 2 0 −2 0 4 0 3 9 1 0 0 0 0 0]T .
Now you try it. Determine what vector to use to apply this kernel at (2, 2). Did you
get this?
H10 = [0 0 0 0 0 −1 0 2 0 −2 0 4 0 3 9 1]T .
H6 = [0 −1 0 2 0 −2 0 4 0 3 9 1 0 0 0 0]T .
Compare H5 at (1, 1) and H6 at (2, 1). They are the same except for a rotation.
We could convolve the entire image by constructing a matrix in which each column
is one such H. Doing so would result in a matrix such as the one illustrated. By
producing the product G = H T F, G will be the (vector form of) convolution of
image F with kernel H.
Some observations about this process:
… −1 0 … 0 …
… 0 −1 … 0 … r The resulting matrix is a “circulant” matrix. Each column is simply a rotation of
… 2 0 … 0 …
… 0 2 … 0 … the adjacent column.
r The fact that we can apply kernels in this way is yet another demonstration of the
… −2 0 … 0 …
… 0 −2 … −1 … fact that kernel operators are linear operators.
… 4 0 … 0 … r This form suggests one approach for dealing with the nasty problem of bound-
… 0 4 … 2 …
… 3 0 … 0 …
ary conditions (you did think about that, didn’t you?). Specifically, how do you
… 9 3 … −2 … multiply by the data value above when you are on the top line? One answer is to
… 1 9 … 0 … rotate around the image and pull from the bottom.
… 0 1 … 4 … r The matrix H is VERY large. If f is a typical image of 256 × 256 pixels, then H is
… 0 0 … 0 …
… 0 0 … 3 …
(256 × 256) × (256 × 256) which is a large number (although still smaller than
… 0 0 … 9 … the US national debt). Do not worry about the monstrous size of H. Nobody (well,
… 0 0 … 1 … almost nobody) ever computes H and uses it in this way. This form is useful for
thinking about images and for proving theorems about image operators – it is a
H5 H6 conceptual, not a computational tool.
75 5.5 Basis vectors for images
1 2 1 1 0–1 −1 2
0
0 0 0 2 0 − 2 0 −1
1
–1 − 2 −1 1 0 −1 − 2 1 0
u1 u2 u3
2 −1 0 0 1 0 −1 0 1
−1 0 1 −1 0 −1 0 0 0
0 1 − 2 0 1 0 1 0 −1
u4 u5 u6
1 −2 1 –2 1 −2 1 1 1
−2 4 −2 1 4 1 1 1 1
1 −2 1 −2 1 −2 1 1 1
u7 u8 u9
Do you think you could One way to determine how similar a neighborhood about some point is to a vertical
develop a similar basis set
for the seven pixels in a edge is to compute the inner product of the neighborhood vector with the vertical
hexagonal neighborhood? edge basis vector. One final question: What is the difference between calculating
Hint: Seven pixels defines
a seven-dimensional this projection and convolving the image at that point with a kernel which estimates
vector space. You will ∂ f /∂ x? The answer is left as an exercise to the student. (Don’t you wish they were
need to find seven such
vectors. all this easy?)
So now you know all there is to know (almost) about linear operators and kernel
operators. Let’s move on to an application to which we have already alluded – finding
edges.
Edges are areas in the image where the brightness changes suddenly; where the
derivative (or more correctly, some derivative) has a large magnitude. We can cate-
gorize edges as step, roof, or ramp [5.20], as illustrated in Fig. 5.5.
Fig. 5.5. Types of commonly occurring edges. Note that the term positive or negative generally
refers to the sign of the first instance of the first derivative.
77 5.6 Edge detection
−1 0 1
hx =
−1 0 1 (5.19)
−1 0 1
Which is correct? It
−1 −1 −1 1 1 1
depends on which
direction you have chosen hy = 0 0 0 or 0 0 0 (5.20)
as positive y.
1 1 1 −1 −1 −1
estimates ∂ f /∂ y.
Some other forms have appeared in the literature that you should know about for
historical purposes.
Important
(This will give you trouble for the entire semester, so you may as well start now.)
In software implementations, the positive y direction is DOWN! This results from
the fact that scanning is top-to-bottom, left-to-right. So pixel (0, 0) is the upper left
corner of the image. Furthermore, numbering starts at zero, not one. We find the best
way to avoid confusion is to never use the words “x” and “y” in writing programs,
but instead use “row” and “column” remembering that now 0 is on top.
However, in these notes, we will use conventional Cartesian coordinates in order
to get the math right, and to further confuse the student (which is, after all, what
Professors are there for. Right?).
Having cleared up that muddle, let us proceed. Given the gradient vector
∂f ∂f T
∇f = ≡ [G x G y ]T (5.22)
∂x ∂y
we are interested in its magnitude
$
|∇ f | = G 2x + G 2y (5.23)
78 Linear operators and kernels
We hope you have realized by now that all the edge detector operators you have used
so far are doing two things simultaneously; smoothing (read “low-pass filtering,”
“noise removal,” “averaging,” or “blurring”) and differentiation (read “high-pass
filtering” or “sharpening”). The kernel of Eq. (5.6) actually takes the vertical average
of three derivative estimates. It is actually counter-intuitive, however, since it weights
the center pixel the same as the lines above and below.
Consider the result you got on Assignment 5.6. If you did it correctly, the kernel
values increase as they are farther from the center. That is even worse, right? Why
should data points farther away from the point where we are estimating the derivative
contribute more heavily to the estimate? Wrong! Wrong! Wrong! It is an artifact of
the assumption we made that all the pixels fit the same plane. They obviously don’t.
So here’s a better way – weight the center pixel more heavily. You already saw
this – the Sobel operator, Eq. (5.7) does it. But now, let’s get a bit more rigorous.
Let’s blur the image by applying a kernel which is bigger in the middle and then
differentiate. We have lots of choices for a kernel like that, e.g., a triangle or a
Gaussian, but thorough research [5.28] has shown that a Gaussian works best for
this sort of thing. We can write this process as
∂
d= (g ⊗ h)
∂x
Recall what you learned where now g is the measured image, h is a Gaussian, and d will be our new derivative
about linear systems.
estimate image. Now, a crucial point from linear systems theory:
For linear operators D and ⊗,
Equation (5.25) means we do not have to do blurring in one step and differentiation
in the next; instead, we can pre-compute the derivative of the blur kernel and simply
apply the resultant kernel.
79 5.7 A kernel as a sampled differentiable function
Let’s see if we can remember how to take a derivative of a 2D Gaussian (did you
forget it is a 2D function?).
A d-dimensional multivariate Gaussian has the general form
1 [x − ]T K −1 [x − ]
exp − (5.26)
(2)d/2 |K |1/2 2
where K is the covariance matrix and is the mean vector. Since we want a Gaussian
centered at the origin (which will be the center pixel) = 0, and since we have no
reason to prefer one direction over another, we choose K to be diagonal (isotropic)
2 0
K = = 2 I. (5.27)
0 2
Let’s look in a bit more detail about how to make use of these formulae and their
two-dimensional equivalents to derive kernels.
The simplest way to get the kernel values for the derivatives of a Gaussian is
to simply substitute x = 0, 1, 2, etc. along with their negative values, which yields
numbers for the kernel. The first problem to arise is “what should be?” To ad-
dress these questions, we will derive the elements of the kernel used for the second
derivative of a one-dimensional Gaussian. The other derivatives can be developed
Proving this is the correct using the same philosophy. Take a look at Fig. 5.6(a) and ask, “is there a value of
value of is a homework
problem. such that the maximum of the second√derivative occurs at x = −1 and x = 1?”
Clearly there is, and its value is = 1/( 3). Given this value of , we can compute
The elements of any
kernel which
the values of the second derivative
√ √ x = {−1, 0, 1}.
of a Gaussian at the integer points
approximates a derivative At x = 0, we find G x x (1/ 3, 0) = −2.07, and at x = 1, G x x (1/ 3, 1) = 0.9251.
must sum to zero. So are we finished? That wasn’t so hard, was it? Unfortunately, we are not done. It is
very important that the elements of the kernel sum to zero. If they don’t, then iterative
algorithms like those described in Chapter 6 will not maintain the proper brightness
levels over many iterations. The kernel also needs to be symmetric. That essentially
defines the second derivative of a Gaussian. The most reasonable set of values close
to those given, which satisfy symmetry and summation to zero are {1, −2, 1}.
However, this does not teach us very much. Let’s look at a 5 × 1 kernel and see
if we can learn a bit more. We require the following.
r The elements of the kernel should approximate the values of the appropriate
derivative of a Gaussian as closely as possible.
r The elements must sum to zero.
r The kernel should be symmetric about its center, unless you want to do special
processing.
very important that the kernel values integrate to zero, not quite so important that the
actual values be precise. So what do we do in a case like this? We use constrained
optimization. One strategy is to set up a problem to find a second derivative of a
Gaussian, which has these values as closely as possible, but which integrates to
zero. For more complex problems, the authors use Interopt [5.3] to solve numerical
optimization problems, but you can solve this problem without using numerical
methods. This is accomplished as follows. First, understand the problem (presented
for the case of five points given above): We wish to find five numbers as close as
possible to [0.0565, 0.9251, −2.0730, 0.9251, 0.0565] which satisfy the constraint
that the five sum to zero. By symmetry, we actually only have three numbers, which
we will denote [a, b, c]. For notational convenience, introduce three constants ␣ =
0.0565,  = 0.9251, ␥ = −2.073. Thus, to find a, b, and c which resemble these
numbers, we write the mean squared error (MSE) form
H0 (a, b, c) = 2(a − ␣)2 + 2(b − )2 + (c − ␥ )2 . (5.31)
H is the constrained Using the concept of Lagrange multipliers, we can find the best choice of a, b, and
version of H0 .
c by minimizing a different objective function
H (a, b, c) = 2(a − ␣)2 + 2(b − )2 + (c − ␥ )2 + (2a + 2b + c). (5.32)
A few words of explanation are in order for those students who are not familiar with
constrained optimization using Lagrange multipliers. The term with the in front
( is the Lagrange multiplier) is the constraint. It is formulated such that it is exactly
equal to zero, if we should find the proper a, b, and c. By minimizing H, we will find
the parameters which minimize H0 while simultaneously satisfying the constraint.
To minimize H , take the partials and set them equal to zero:
∂H
= 4a − 4␣ + 2
∂a
∂H
= 4b − 4 + 2 (5.33)
∂b
∂H
= 2c − 2␥ + .
∂c
Setting the partial derivatives equal to zero, simplifying, and adding the constraint,
we find the following set of linear equations:
a =␣−
2
b=− (5.34)
2
c=␥−
2
2a + 2b + c = 0
0.0261 0 −0.0261
0.1080 0 −0.1080
0.0261 0 −0.0261
We can proceed in the same way to compute the kernels to estimate the partial
derivatives using Gaussians in two dimensions.
One implementation of the first derivative with respect to x, assuming an isotropic
Gaussian is presented in Fig. 5.7. You will have the opportunity to derive others as
homeworks.
In this chapter, we have explored the idea of edge operators based on kernel opera-
tors. We discovered that no matter what, noisy images result in edges which are:
r too thick in places
r missing in places
r extraneous in places.
That is just life – we cannot do any better with simple kernels. In Chapter 6, we
will explore some approaches to these problems.
As we hope you have guessed, there are other ways of finding edges in images
besides simply thresholding a derivative. In later sections, we will mention a few of
them.
−1 2 −1
2 −4 2 .
−1 2 −1
Remembering that the only difference between convolution and a kernel operator is
the direction of x and y (section 5.2.1) for any kernel operator, there is an equivalent
convolution kernel. Therefore efficient ways to calculate convolution are also effi-
cient ways to apply kernel operators. The convolution operation may be computed
directly as discussed above. It is simply a sum of products, calculated in the neigh-
borhood of each pixel. However, it may also be computed by the Fourier transform.
The Fourier transform of a convolution is the product of the Fourier transforms of the
two arguments. That is (denoting convolution by the operator ⊗), we are concerned
with computing
Let the Fourier transforms of the two images and the convolution kernel be defined
by
where the symbol F denotes the process of taking the Fourier transform. Remember
from section 4.1.5, the Fourier transform of an image (a function of two variables)
is itself a function of two variables. We refer to those variables, x and y , as the
spatial frequencies in the x and y directions, respectively. Then G is the product of
F and H.
The “product” of two transforms means, for each spatial frequency value (each
combination of x and y ), multiply the values of the two functions. (Just in case
you do not remember the details, in general, these values are complex numbers.)
84 Linear operators and kernels
r Transform f: N 2 log N .
r Transform h: L 2 log L .
r Perform appropriate operations, such as padding, to get H and F the same size.
r Multiply H by F: N 2 .
r Inverse transform the result: N 2 log N .
L
16
12
“Scale space” is a recent addition to the well-known concept of image pyramids, first
used in picture processing by Kelly [5.19] and later extended in a number of ways
(see [5.5, 5.8, 5.30, 5.32], and many others). In a pyramid, a series of representations
of the same image are generated, each created by a 2 : 1 subsampling (or averaging)
of the image at the next higher level (Fig. 5.9).
In Fig. 5.10, a Gaussian pyramid is illustrated. It is generated by blurring each
level with a Gaussian prior to 2 :1 subsampling. An interesting question should arise
as you look at this figure. Could you, from all the data in this pyramid, reconstruct
the original image? The answer is “no, because at each level, you are throwing away
high-frequency information.”
Usually, when we say Although the Gaussian pyramid alone does not contain sufficient information
“scale-space”, we do not
mean a pyramid, we mean to reconstruct the original image, we could construct a pyramid that does contain
a varying blur. sufficient information. To do that, we use a “Laplacian” pyramid, constructed by
computing a similar representation of the image; this preserves the high-frequency
information (Fig. 5.11). Combining the two pyramid representations allows recon-
struction of the original image.
In a modern scale space representation we preserve the concept that each level
is a blurring of the previous level, but do not subsample – each level is the same
size as the previous level, but more blurred. Normally, each level is generated by
convolving the original image with a Gaussian of variance 2 , and varies from
one level to the next. This variance then becomes the “scale parameter.” Clearly, at
high levels of scale, ( large), only the largest features are visible. We will see more
about scale space later in this chapter, when we talk about wavelets.
Fig. 5.9. A pyramid is a data structure which is a series of images, in which each pixel is the
average of four pixels at the next lower level.
Fig. 5.10. A Gaussian pyramid, constructed by blurring each level with a Gaussian and then 2 : 1
subsampling.
86 Linear operators and kernels
0 1 3
0 1 2
310
30
312
2
33 32
30 32 33
Fig. 5.12. An image is divided into four blocks. Each inhomogeneous block is further divided.
This partitioning may be represented by a tree.
87 5.9 Scale space
sequence, have shown that this is not true. Since the difference image is only nonzero
where things are moving, it seems obvious that this, mostly zero, image would be
efficiently stored in a quad tree. Not so. Even in that case, the overhead of managing
the tree overwhelms the storage gains. So, surprisingly, the quad tree is not an
efficient image compression technique. When used as a means for representing a
pyramid, it does, however, have advantages as a way of representing scale space.
Another disadvantage of using quad trees is that a slight movement of an object
can result in radically different tree representations, that is, the tree representation
is not rotation or translation invariant. In fact, it is not even robust. Here, “robust”
means a small translation of an object results in a correspondingly small change in the
representation. One can get around this problem, to some extent, by not representing
the entire image, but instead, representing each object subimage with a quad tree.
The generalization of the quad tree to three dimensions is called an “octree.” The
same principles apply.
The last property is certainly debatable, since convolution formally requires a lin-
ear, space-invariant operator. One interesting approach to scale space which violates
this requirement is to produce a scale space by using gray scale morphological
88 Linear operators and kernels
smoothing (we will discuss this later) with larger and larger structuring elements
[5.16].
You could use scale space concepts to represent texture [4.16] or even a prob-
ability density function (in which case, your scale space representation becomes a
clustering algorithm [5.24]) as well as brightness. We will see applications of scale
representations as we proceed through the course.
One of the most interesting aspects of scale space representations is the behavior
of our old friend, the Gaussian. The second derivative of the Gaussian (in two
dimensions, the Laplacian of Gaussian: LOG) has been shown [5.27] to have some
very nice properties when used as a kernel. In particular, the zero crossings of the
LOG are good indicators of the location of an edge. One might be inclined to ask,
“Is the Gaussian the best smoothing operator to use to develop a kernel like this?”
Said another way: We want a kernel whose second derivative never generates a new
zero crossing as we move to larger scale. In fact, we could state this desire in the
following more general form.
Let our concept of a “feature” be a point where some operator has an extreme,
either maximum or minimum. The concept of scale space causality says that as
scale increases, as images become more blurred, new features are never created. The
Gaussian is the ONLY kernel (linear operator) with this property [5.1, 5.2]. Studies
of nonlinear operators have been done to see under what conditions these operators
are scale space causal [5.22].
This idea of scale space causality is illustrated in the following example. Fig. 5.13
illustrates the brightness profile along a single line from an image, and the scale space
created by blurring that single line with one-dimensional Gaussians of increasing
variance. In Fig. 5.14, we see the Laplacian of the Gaussian, and the points where
the Laplacian changes sign. The features in this example, the zero crossings (which
are good candidates for edges) are indicated in the right image. Observe that as
scale increases, feature points (in this case, zero-crossings) are never created as
scale increases. As we go from top (low scale) to bottom (high scale), some features
disappear, but no new ones are created.
One obvious application of this idea is to identify the important edges in the image
first. We can do that by going up in scale, finding those few edges, and then tracking
them down to lower scale.
Since there are many options in the design of an edge detection algorithm, we need
some objective ways to say that one edge detector works better than another. Pratt
[5.33] has suggested a simple formula to address this question. The formula is just
89 5.10 Quantifying the accuracy of an edge detector
50
100
150
200
250
Fig. 5.13. (a) Brightness profile of a scanline through an image. (b) Scale space representation
of that scanline. Scale increases toward the bottom, so no new features should be
created as one goes from top to bottom.
Increasing varlance Laplacian blur increasing variance Gaussian blur, superimposed with zero crossings of Laplacian
50 50
100 100
150 150
200 200
250 250
50 100 150 200 250 50 100 150 200 250
Fig. 5.14. Laplacian of the scale space representation, and the zero crossings of the Laplacian.
Since this is one-dimensional data, there is no difference between the Laplacian and
the second derivative.
1 Ia
1
R= (5.36)
I N i=1 1 + ␣d 2
Two neurophysiologists, David Hubel and Thorsten Wiesel [5.13, 5.14] stuck some
electrodes in the brains – specifically the visual cortex – first of cats3 and later
of monkeys. While recording the firing of neurons, they provided the animal with
visual stimuli of various types. They observed some fascinating results: First, there
are cells which fire only when specific types of patterns are observed. For example,
a particular cell might only fire if it observed an edge, bright to dark, at a particular
angle. There was evidence that each of the cells they measured received input from
a neighborhood of cells called a “receptive field.” There were a variety of types
of receptive fields, possibly all connected to the same light detectors, which were
organized in such a way as to accomplish edge detection and other processing. Jones
and Palmer [5.17] mapped receptive field functions carefully and confirmed [5.9,
5.10] that the function of receptive fields could be accurately represented by Gabor
functions, which have the form of Eq. (5.37):
2
1 x y2
G(x, y) = exp − + exp(i[ x + y]). (5.37)
2 2 2
The first exponential is a two-dimensional Gaussian whose isophotes form ellipses
with major and minor axes aligned with the x and y axes. (If you happen to be
dealing with a receptive field which is tilted with respect to x and y, you need to
rotate your coordinate system to make this equation still hold.) The second (complex)
exponential represents a plane wave. Eq. (5.37) assumes the origin is at the center
of the Gaussian. Fig. 5.15 illustrates a Gabor filter.
The following interesting observations have been made [5.23] regarding the values
of the parameters in Eq. (5.37), when those parameters are actually measured in living
organisms:
r The aspect ratio, /␣, of the ellipse is 2 : 1.
r The plane wave tends to propagate along the short axis of the ellipse.
r The half-amplitude bandwidth of the frequency response is about 1 to 1.5 octaves
along the optimal orientation.
3 We were going to put a dead cat joke here, something like “One of the cats died during the procedure, but its
behavior was unchanged,” but the publisher told us people would be offended, so we had to remove it.
91 5.11 So how do people do it?
x 10-3
1.5
0.5
−0.5
30
-1
20
−1.5 10
−2 0
−20 −10
−15
−10
−5 −20
0
5 10
15 −30
20
Fig. 5.15. Gabor filter. Note that the positive/negative response is very similar to those that we
derived earlier in this chapter.
Gabor Function
2
−1
−2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
0.5
−0.5
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 5.16. Comparison of a section through a Gabor filter and a similar section through a fourth
derivative of a Gaussian.
through a Gabor and a fourth derivative of a Gaussian. You can see noticeable
differences, the principal one being that the Gabor goes on forever, whereas the fourth
derivative has only three extrema. However, to the precision available to neurology
experiments, they are the same. The problem is simply that the measurement of
the data is not sufficiently accurate, and it is possible to fit a variety of curves
to it.
92 Linear operators and kernels
Bottom line: We have barely a clue how the brain works, and we don’t really
know very much about the retina. There are two or three mathematical models
which adequately model the behavior of receptive fields.
5.12 Conclusion
Consistency in edge Explicit use of consistency has not been made in this chapter. However, in Assign-
detection.
ment 10.1, you will see an application of consistency to edge detection. In that
problem, you will be asked to develop an algorithm which makes use of the fact
that adjacent edge pixels have parallel gradients. That is, if pixel A is a neighbor
of pixel B, and the gradient at pixel A is parallel (or nearly parallel) to the gradient
at pixel B, this increases the confidence that both pixels are members of the same
edge.
In this chapter, we have looked at several ways to derive kernel operators which,
when applied to images, result in strong responses for types of edges.
r We applied the definition of the derivative.
Minimize the r We fit an analytic function to a surface, by minimizing the sum squared error.
sum-squared error. r We converted subimages into vectors, and projected those vectors onto special
basis vectors which described edge-like characteristics.
r We made use of the linearity of kernel operators to interchange the roll of blur and
differentiation to construct kernels which are the derivatives of special blurring
Constrained optimization kernels. We used the constrained optimization and Lagrange multipliers to solve
and Lagrange multipliers.
this problem.
5.13 Vocabulary
Basis vector
Convolution
Correlation
Gabor filter
Image gradient
Inner product
Kernel operator
Lagrange multiplier
Lexicographic
Linear operator
93 5.13 Vocabulary
LOG
Projection
Pyramid
Quad tree
Scale space
Sum-squared error
Assignment 5.1
The previous section showed how to estimate the first
derivative by fitting a plane. Clearly that will
not work for the second derivative, since the second
derivative of a plane is zero everywhere. Use the same
approach, but use a biquadratic
f(x,y) = ax 2 + by 2 + cx + dy + e.
Then [a b c d e]T = A
% &T
x2 y2 x y 1 = X.
∂2 f
Find the 3 × 3 kernel which estimates ∂ x2
.
Assignment 5.2
Oh no! Another part of the assignment! (hey, this one
is much easier than the one above -- a real piece of
cake). Using the same methods, find the 5 × 5 ker-
nel which estimates ∂∂ xf at the center point, using the
equation of a plane.
Assignment 5.3
Determine whether u1 and u2 in Fig. 5.4 are in fact
orthonormal. If not, recommend a modification or other
approach which will allow all the us to be used as
basis functions.
94 Linear operators and kernels
Assignment 5.4
(1) Write a program to generate an image which is
64 × 64, as illustrated below.
50 100
32
25 10 16
32
1 1 1
1/10
1 2 1
1 1 1
Assignment 5.5
Write a program to apply the following two kernels
(referred to in the literature as the “Sobel opera-
tors”) to images SYNTH1, BLUR1, and BLUR1.V1.
−1 0 1 −1 −2 −1
hx = −2 0 2 hy = 0 0 0
−1 0 1 1 2 1
Assignment 5.6
In Assignment 5.2, you derived a 5 × 5 kernel. Repeat
Assignment 5.5 using that kernel, for ∂/∂ x and the
appropriate version for ∂/∂ y.
Assignment 5.7
(1) Verify the mathematics we did in Eq. (5.30). Find
a 3 × 3 kernel which implements the derivative-of-
Gaussian vertical edge operator of Eq. (5.30). Use
= 1 and = 2 and determine two kernels. Repeat for
a 5 × 5 kernel. Discuss the impact of the choice of
and its relation to kernel size. Assume the kernel
may contain real (floating point) numbers.
96 Linear operators and kernels
Assignment 5.8
In section 5.7, parameters useful for developing dis-
crete Gaussian kernels were discussed. Prove that the
value of such that the maximum of the second deriva-
√
tive occurs at x = −1 and x = 1. Is = 1/ 3?
Assignment 5.9
Use the method of fitting a polynomial to estimate
∂ 2 f/∂ y2 . Which of the following polynomials would be
most appropriate to choose?
(a) f = ax 2 + by + cxy (c) f = ax 3 + by 3 + cxy
(b) f = ax 2 + by 2 + cx y + d (d) f = ax + by + c
Assignment 5.10
Fit the following expression to pixel data in a 3 × 3
neighborhood: f(x,y) = ax 2 + bx + cy + d. From this fit,
determine a kernel which will estimate the second
derivative with respect to x.
Assignment 5.11
Use the function f = ax 2 + by 2 + cxy to find a 3 × 3 kernel
which estimates ∂ 2 f/∂ y 2 . Which of the following is the
kernel which results? (Note: The following answers do
not include the scale factor. Thus, the best choice
below will be the one proportional to the correct
answer.)
6 4 0 2 6 2 6 4 0
(a) 4 6 0 (c) −4 0 −4 (e) 4 6 0
0 0 4 2 6 2 0 0 1
1 −2 1 1 0 −1 1 2 1
(b) 3 0 3 (d) −2 0 2 (f) −2 0 −2
1 −2 1 1 0 −1 −1 −2 −1
97 5.12
Assignment 5.12
Suppose the kernel that estimates ∂ 2 f/∂ x∂ y is
2 −1 0
−1 0 1
0 1 −2
.
4 1 0
1 0 1
0 1 4
2 2
estimates ∂ f/∂ x ∂ y ? Explain your answer.
The process of edge detection includes more than simply thresholding the gradient. We want
to know the location of the edge more precisely than simple gradient thresholds reveal. Two
methods which seem to have acquired a substantial reputation in this area are the so called
“Canny edge detector” [5.6] and the “facet model” [4.18]. Here, we only describe the Canny
edge detection.
The edge detection algorithm begins with finding estimates of the gradient magnitude at
each point. Canny uses 2 × 2 rather than 3 × 3 kernels as we have, but it does not affect the
philosophy of the approach. Once we have estimates of the two partial derivatives, we use
Eqs. (5.22) through (5.24) to calculate the magnitude and direction of the gradient, producing
two images, M(x, y) and THETA (x, y). We now have a result which easily identifies the
pixels where the magnitude of the gradient is large. That is not sufficient, however, as we
now need to thin the magnitude array, leaving only points which are maxima, creating a new
image N (x, y). This process is called nonmaximum4 suppression (NMS).
NMS may be accomplished in a number of ways. The essential idea, however, is as follows:
First, initialize N (x, y) to M(x, y). Then, at each point, (x, y), look one pixel in the direction
4 This is sometimes written using the plural, nonmaxima suppression. The expression is ambiguous. It could be
a compression of suppress every point which is not a maximum, or suppress all points which are not maxima.
We choose to use the singular.
98 Linear operators and kernels
of the gradient, and one pixel in the reverse direction. If M(x, y) (the point in question) is
not the maximum of these three, set its value in N (x, y) to zero. Otherwise, the value of N is
unchanged.
After NMS, we have edges which are properly located and are only one pixel wide. These
new edges, however, still suffer from the problems we identified earlier – extra edge points
due to noise (false hits) and missing edge points due either to blur or to noise (false misses).
Some improvement can be gained by using a dual-threshold approach. Two thresholds are
used, 1 and 2 , where 2 is significantly larger than 1 . Application of these two different
thresholds to N (x, y) produces two binary edge images, denoted T1 and T2 respectively. Since
T1 was created using a lower threshold, it will contain more false hits than T2 . Points in T2
are therefore considered to be parts of true edges. Connected points in T2 are copied to the
output edge image. When the end of an edge is found, points are sought in T1 which could
be continuations of the edge. The continuation is continued until it connects with another T2
edge point, or no connected T1 points are found.
In [5.6] Canny also illustrates some clever approximations which provide significant
speedups.
The derivative of g(x) is Tagare and deFigueiredo [5.34] (see also [5.1]) describe the process of edge detection as
the second derivative of
follows.
the image, so look for
zero crossings.
(1) The input is convolved with a filter which smoothly differentiates the input and produces
high values at and near the location of the edge. The output g(x) is the sum of the
differentiated step edge and filtered noise.
(2) A decision mechanism isolates regions where the output of the filter is significantly higher
than that due to noise.
(3) A mechanism identifies the zero crossing in the derivative of g(x) in the isolated region
and declares it to be the location of the edge.
A Gaussian low-pass filter followed by finding a zero in the (second) derivative accurately
(to subpixel resolution) finds the exact location of the edge, but only if the edge is straight
[5.38]. If the edge is curved, errors are introduced. For example, the second derivative in
the gradient direction (SDGD) and the Laplacian both make errors in estimating the location
of the edge, but interestingly, in opposite directions, which prompts Verbeek and van Vliet
[5.38] to suggest an operator which is the sum of the two.
All the methods cited or described in this section perform signal processing in a direction
normal to the edge [5.18] in order to better locate the actual edge. Taratorin and Sideman
[5.35] present a way to make use of the fact that the images are known to have properties such
as positivity and finite support to improve the accuracy with which derivatives may be esti-
mated. Iverson and Zucker [5.15] improve the results of the Canny by adding logical/Boolean
reasoning. This improves edge detection over simple thresholding of the derivative, but does
not provide results as good as active contours (see Chapter 9) or optimization (see Chapter 6).
There are a wide variety of papers on signal processing techniques applied to edge detection
[5.36].
99 Topic 5A Edge detectors
Examination of the literature in biological imaging systems, all the way back to the pio-
neering work of Hubel and Wiesel in the 1960s [5.13] suggests that biological systems analyze
images by making local measurements which quantify orientation, scale, and motion. Keep-
ing this in mind, suppose we wish to ask a question like “is there an edge at orientation
at this point?” How might we construct a kernel that is specifically sensitive to edges at that
orientation? A straightforward approach [5.37] is to construct a weighted sum of the two
Gaussian first derivative kernels, G x and G y , using a weighting something like
After we have chosen the very best operators to estimate derivatives, have chosen the best
thresholds, and selected the best estimates of edge position, we still have nothing for a set
of pixels, some of whom have been marked as probably part of an edge. If those points are
adjacent, one could “walk” from one pixel to the next, eventually circumnavigating a region,
and there are representations such as the chain code which make this process easy. However,
the points are unlikely to be connected the way we would like them to be. Some points may
be missing due to blur, noise, or partial occlusion. There are many ways to approach this
problem, including relaxation labeling, and parametric transforms, both of which will be
discussed in detail later in this book. In addition, there are combination methods, such as the
work of Deng and Iyengar [5.11] which combines relaxation and Bayes’ methods as well as
other methods [5.29] which we do not have space to discuss.
Wavelets are very important, but a thorough examination of this area is beyond the scope of
this book. Therefore we present only a rather superficial description here, and provide some
pointers to literature. For example, Castleman [4.6] has a readable chapter on wavelets.
Consider the image illustrated in Fig. 5.17. Clearly, the spatial frequencies appear to be differ-
ent, depending on where one looks in the image. The Fourier transform has no mechanism for
capturing this intuitive need to represent both frequency and location. The Fourier transform
of this image will be a two-dimensional array of numbers, representing the amount of energy
in the entire image at each spatial frequency. Clearly, since the Fourier transform is invertible,
it captures all this spatial and frequency information, but there is no obvious way to answer
the question: at each position, what are the local spatial frequencies?
100 Linear operators and kernels
The wavelet approach is to add a degree of freedom to the representation. Since the
Fourier transform is complete and invertible, all we really need to characterize the image is a
single two-dimensional array. Instead however, following the space/frequency philosophy as
described in section 5.8, we use a three-(or higher) dimensional data structure. In this sense,
the space/frequency representation is redundant (or overcomplete ), and requires significantly
more storage than the Fourier transform.
We define a basic5 wavelet (x, y) as any function of the two spatial variables x and y, which
meets a certain criterion that we need not concern ourselves with here. Basically, we desire a
function which is symmetric about the origin and has almost finite support. By “almost finite
support,” we mean that the magnitude of this function drops off to zero rapidly (in a particular
way defined by the admissibility criterion) as it goes away from its center. A one-dimensional
example basic wavelet is
2
2 x
(x) = √ √ (1 − x ) exp −
2
(5.39)
3 2
graphed in Fig. 5.18. From one such wavelet, one may then generate a (potentially infinite) set
of similar functions by translation and scaling of the original. That is, we define a translated,
scaled version of by (again, in one dimension)
1 x −b
a,b (x) = √ . (5.40)
a a
The wavelet transform of a function f is computed by the inner products of f with the wavelet
at each of the possible values of a and b.
∞
W f (a, b) = f (x) a,b (x) d x. (5.41)
−∞
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−2 −1
−10 −5 0 5 10 15 20 −10 −5 0 5 10 15 20
Fig. 5.19. Original and the result of the inner product of the original and three different two-
dimensional wavelets (three slices through the wavelet transform).
Observe that the transform is a function of the scale and the shift. The same idea holds in two
dimensions, where the equation for the inner product becomes
∞
W f (a, bx , b y ) = f (x, y) a,bx ,b y (x, y) d x d y. (5.42)
−∞
Fig. 5.19 illustrates a cross section of W for different values of a. Clearly, this process pro-
duces a scale space representation.
Lee [5.23] takes the neurophysiological evidence and derives a mother wavelet of the
following form.
2
2
1 4x + y 2
(x, y) = √ exp − exp(ix) − exp − , (5.43)
2 8 2
where is a constant whose value depends on assumptions about the bandwidth, but is ap-
proximately equal to 3.5. Scaling and translating this “mother wavelet” produces a collection
of filters which (in the same way that the Frei–Chen basis set did) represents how much of
an image neighborhood resembles the feature, and from which the image can be completely
reconstructed.
5A.5 Vocabulary
Assignment 5.A1
At high levels of scale, only objects are
visible. (Fill in the blank.)
102 Linear operators and kernels
Assignment 5.A2
What does the following expression estimate? ( f ⊗ h1 )2 +
( f ⊗ h2 )2 = E , where the kernel h1 is defined by
Assignment 5.A3
You are to use the idea of differentiating a Gaussian to
derive a kernel. What variance does a one-dimensional (zero
mean) Gaussian need to have to have the property that the
extrema of its first derivative occur at x = ± 1?
Assignment 5.A4
What is the advantage of the quadratic variation as compared
to the Laplacian?
Assignment 5.A5
Let E = (f − H g)T (f − H g). Using the equivalence between the
kernel form of a linear operator and the matrix form, write
an expression for E using kernel notation.
Assignment 5.A6
The purpose of this assignment is to walk you through the
construction of an image pyramid, so that you will more fully
understand the potential utility of this data structure, and
can use it in a coding and transmission application. Along
the way, you will pick up some additional understanding of
the general area of image coding. Coding is not a princi-
pal learning objective of this course, but in your research
career, you are sure to encounter people who deal with coding
103 Topic 5A Edge detectors
References
[5.1] V. Anh, J. Shi, and H. Tsai, “Scaling Theorems for Zero Crossings of Bandlimited
Signals,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3).
1996.
[5.2] J. Babaud, A. Witkin, M. Baudin, and R. Duda, “Uniqueness of the Gaussian Ker-
nel for Scale-space Filtering,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 8(1), 1986.
[5.3] G. Bilbro and W. Snyder, “Optimization of Functions with Many Minima,” IEEE
Transactions on Systems, Man, and Cybernetics, 21(4), July/August, 1991.
105 References
[5.4] K. Boyer and S. Sarkar, “On the Localization Performance Measure and Optimal Edge
Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1),
1994.
[5.5] P. Burt and E. Adelson, “The Laplacian Pyramid as a Compact Image Code,” Computer
Vision, Graphics, and Image Processing, 16, pp. 20–51, 1981.
[5.6] J. Canny, “A Computational Approach to Edge Detection,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 8(6), 1986.
[5.7] W. Chojnacki, M. Brooks, A. van der Hengel, and D. Gawley, “On the Fitting of Sur-
faces to Data with Covariances,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(11), 2000.
[5.8] J. Crowley, A Representation for Visual Information, Ph.D. Thesis, Carnegie-Mellon
University, 1981.
[5.9] J. Daugman, “Two-dimensional Spectral Analysis of Cortical Receptive Fields,”
Vision Research, 20, pp. 847–856, 1980.
[5.10] J. Daugman, “Uncertainly Relation for Resolution in Space, Spatial Frequency, and
Orientation Optimized by Two-dimensional Visual Cortical Filters,” Journal of the
Optical Society of America, 2(7), 1985.
[5.11] W. Deng and S. Iyengar, “A New Probabilistic Scheme and Its Application to Edge
Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
[5.12] W. Frei and C. Chen, “Fast Boundary Detection: A Generalization and a New Algo-
rithm,” IEEE Transactions on Computers, 26(2), 1977.
[5.13] D. Hubel and T. Wiesel, “Receptive Fields, Binocular Interaction, and Functional
Architecture in the Cat’s Visual Cortex,” Journal of Physiology (London), 160,
pp. 106–154, 1962.
[5.14] D. Hubel and T. Wiesel, ‘Functional Architecture of Macaque Monkey Visual Cortex,”
Proceedings of the Royal Society of London, B, 198, pp. 1–59, 1977.
[5.15] L. Iverson and S. Zucker, “Logical/Linear Operators for Image Curves,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[5.16] P. Jackway and M. Deriche, “Scale-space Properties of the Multiscale Morphological
Dilation–Erosion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
18(1), 1996.
[5.17] J. Jones and L. Palmer, “An Evaluation of the Two-dimensional Gabor Filter Model
of Simple Receptive Fields in the Cat Striate Cortex,” Journal of Neurophysiology,
58, pp. 1233–1258, 1987.
[5.18] E. Joseph and T. Pavlidis, “Bar Code Waveform Recognition using Peak Locations,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6), 1994.
[5.19] M. Kelly, in Machine Intelligence, volume 6, University of Edinburgh Press,
1971.
[5.20] M. Kisworo, S. Venkatesh, and G. West, “Modeling Edges at Subpixel Accuracy using
the Local Energy Approach,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(4), 1994.
[5.21] A. Klinger, “Pattern and Search Statistics,” in Optimizing Methods in Statistics,
New York, Academic Press, 1971.
106 Linear operators and kernels
[5.22] P. Kube and P. Perona, “Scale-space Properties of Quadratic Feature Detectors,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(10), 1996.
[5.23] T. Lee, “Image Representation Using 2-D Gabor Wavelets,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(10), 1996.
[5.24] Y. Leung, J. Zhang, and Z. Xu, “Clustering by Scale-space Filtering,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 22(12), 2000.
[5.25] T. Lindeberg, “Scale-space for Discrete Signals,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 12(3), 1990.
[5.26] T. Lindeberg, “Scale-space Theory, A Basic Tool for Analysing Structures at Different
Scales,” Journal of Applied Statistics, 21(2), 1994.
[5.27] D. Marr and E. Hildreth, “Theory of Edge Detection,” Proceedings of the Royal Society
of London, B, 207, pp. 187–217, 1980.
[5.28] D. Marr and T. Poggio, “A Computational Theory of Human Stereo Vision,” Proceed-
ings of the Royal Society of London, B, 204, pp. 301–328, 1979.
[5.29] R. Nelson, “Finding Line Segments by Stick Growing,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(5), 1994.
[5.30] E. Pauwels, L. Van Gool, P. Fiddelaers, and T. Moons, “An Extended Class of Scale-
invariant and Recursive Scale Space Filters,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(7), 1995.
[5.31] P. Perona, “Deformable Kernels for Early Vision,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(5), 1995.
[5.32] P. Perona and J. Malik, “Scale-space and Edge Detection using Anisotropic Diffusion”,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7), pp. 629–639,
1990.
[5.33] W. Pratt, Digital Image Processing, Chichester, John Wiley and Sons, 1978.
[5.34] H. Tagare and R. deFigueiredo, “Reply to ‘On the Localization Performance Measure
and Optimal Edge Detection’,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(1), 1994.
[5.35] A. Taratorin and S. Sideman, “Constrained Regularized Differentiation,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[5.36] F. van der Heijden, “Edge and Line Feature Extraction Based on Covariance Models,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[5.37] M. Van Horn, W. Snyder, and D. Herrington, “A Radial Filtering Scheme Applied to
Intracoronary Ultrasound Images,” Computers in Cardiology, September, 1993.
[5.38] P. Verbeek and L. van Vliet, “On the Location Error of Curved Edges in Low-pass
Filtered 2-D and 3-D Images,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(7), 1994.
[5.39] I. Weiss, “High-order Differentiation Filters that Work,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(7), 1994.
[5.40] M. Werman and Z. Geyzel, “Fitting a Second Degree Curve in the Presence of Error,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(2), 1995.
[5.41] R. Young, “The Gaussian Derivative Model for Spatial Vision: I. Retinal Mechanisms,”
Spatial Vision, 2, pp. 273–293, 1987.
6 Image relaxation: Restoration and
feature extraction
To change, and to change for the better are two different things
German proverb
In this chapter, we move toward developing techniques which remove noise and
degradations so that features can be derived more cleanly for segmentation. The
techniques of a posteriori image restoration and iterative image feature extraction
are described and compared. While image restoration methods remove degradations
from an image [6.3], image feature extraction methods extract features such as edges
from noisy images. Both are shown to perform the same basic operation: image re-
laxation. In the advanced topics section, image feature extraction methods, known
as graduated nonconvexity (GNC) and variable conductance diffusion (VCD), are
compared with a restoration/feature extraction method known as mean field anneal-
ing (MFA). This equivalence shows the relationship between energy minimization
methods and spatial analysis methods and between their respective parameters of
temperature and scale. The chapter concludes by discussing the general philosophy
of extracting features from images.
6.1 Relaxation
The term “relaxation” was originally used to describe a collection of iterative numeri-
cal techniques for solving simultaneous nonlinear equations (see [6.18] for a review).
The term was extended to a set of iterative classification methods by Rosenfeld and
Kak [6.64] because of their similarity. Here, we provide a general definition of the
term which will encompass these methods as well as those more recent techniques
which are the emphasis of this discussion.
Definition
A relaxation process is an A relaxation process is a multistep algorithm with the property that (1) the output of
iteration.
a single step is of the same form as the input, so that the algorithm may be applied
iteratively, and (2) it converges to a bounded result. Some researchers also require
that the operation on any element (any pixel, in our application) be dependent only
on the state of the pixels in some well defined, finite “neighborhood” of that element.
We will see that all the algorithms discussed here are relaxation processes, according
to these criteria.
107
108 Image relaxation: Restoration and feature extraction
6.2 Restoration
In an image restoration problem, we assume that an ideal image, f, has been corrupted
to create the measured image, g. The usual model for the corruption is a distortion
operation, denoted by D, followed by the addition of random noise
g = D( f ) + n, (6.1)
where g = [g1 , . . . , g N ]T and gi denotes the ith pixel in a column vector repre-
sentation of the image g. f and n are similarly defined. The restoration problem,
then, is the problem of finding a best estimate of f given the measurement, g, some
knowledge of the distortion (often called “blur”), and the statistics of the noise.
Restoration is often referred to as an inverse problem. That is, we have a process
(in this case blur) which takes an input and produces an output. We can only measure
the output, and we wish to infer the input.
Definition of an ill-posed If these three conditions do not all hold, the problem is said to be “ill-posed.” Ill-
problem.
posedness is normally caused by the ill-conditioning of the problem. Conditioning of
a mathematical problem is measured by the sensitivity of output to changes in input.
For a well-conditioned problem, a small change of input does not affect the output
much; while for an ill-conditioned problem, a small change of input can change the
output a great deal.
Condition number is the measurement of the conditioning of a problem. Generally,
it is defined as Eq. (6.2). The larger the condition number, the more ill-conditioned
the problem is:
change in output
condition number ≈ . (6.2)
change in input
The conditioning of a linear system Ax = b is determined by the condition number
of matrix A. The relative condition number K is defined as Eq. (6.3),
K = AA−1 (6.3)
where . usually indicates the 2-norm. K is in the range of [1, ∞). When K 1,
the linear system is ill-conditioned.
109 6.2 Restoration
ga = 0.5b+ 0.5d
gb = 0.33a+ 0.33c+ 0.33e
gc = 0.5b+ 0.5f
gd = 0.33a+ 0.33e+ .33g
ge = 0.25b+ 0.25d+ 0.25f+ 0.25h
gf = 0.33c+ 0.33e+ 0.33i
gg = 0.5d+ 0.5h
gh = 0.33e+ 0.33g+ 0.33i
gi = 0.5f+ 0.5h
Denote G = [ga gb gc gd ge gf gg gh gi]T , F = [a b c d e f g h i]T .
If the blur process turns In Eq. (6.1), suppose we know how the blur process works. Then it would seem
out to be linear and
space-invariant, we call that we should be able to undo it. We will see why that might not be the case.
the process of inverting it As an example, look at a very simple image, just about the simplest image we can
deconvolution.
come up with, a 3 × 3 image, and give each pixel a name, using the letters a, . . . , i.
Now, suppose this image is blurred by a linear blur process in which each pixel is
replaced by the average of its neighbors (suppose the 4-neighbor definition is used).
In the case of edge or corner pixels, fewer pixels are in the neighborhood. This new,
blurred image has values ga, . . . , gi, and these are related to the original values by
the system of linear equations shown in Table 6.1.
a b c
d e f
g h i
⎡ ⎤
0 0.5 0 0.5 0 0 0 0 0
⎢0.33
⎢ 0 0.33 0 0.33 0 0 0 0 ⎥
⎥
⎢ ⎥
⎢ 0 0.5 0 0 0 0.5 0 0 0 ⎥
⎢ ⎥
⎢0.33
⎢ 0 0 0 0.33 0 0.33 0 0 ⎥
⎥
H =⎢
⎢ 0 0.25 0 0.25 0 0.25 0 0.25 0 ⎥
⎥
⎢ 0
⎢ 0 0.33 0 0.33 0 0 0 0.33⎥
⎥
⎢ ⎥
⎢ 0 0 0 0.5 0 0 0 0.5 0 ⎥
⎢ ⎥
⎣ 0 0 0 0 0.33 0 0.33 0 0.33⎦
0 0 0 0 0 0.5 0 0.5 0
Then, we may represent the process of blur by G = H F and solve for the values
before the blur using F = H −1 G. It appears that this should be simple. If we have a
model for the distortion process (the model in this case is the matrix H), all we need
do is invert it, and multiply. Let us look to see why that might not work too well.
First, calculate the inverse of H numerically. Whoops! Our matrix inverter program
110 Image relaxation: Restoration and feature extraction
tells us that the matrix is singular. Guess that will not work. Is it a bad choice of blur
operations?
It is hard, it turns out, to contrive a numerical example which is not singular. It is
possible, of course, but difficult. Here is the real key: Even if the distortion matrix
turns out to be nonsingular, the problem is still likely to be “ill-conditioned.”
That is, (let’s review) we measure ga, . . . , gi, and use matrix multiplication to
determine a, . . . , i. If H is not singular, then in a perfect world it should work. As
engineers know, however, in fact there is always noise, so instead of measuring ga,
we actually measure ga + ε, where ε is some perturbation due to noise. If the system,
that is, the distortion matrix, is ill-conditioned (and it is), then this small change in
ga may produce large differences in the estimates of a, . . . , i. For this reason, simple
matrix inversion does not work, even if the system is linear.
Another, perhaps simpler example of ill-conditioning [6.36] is as follows: Con-
sider the linear system described by a blur A, and unknown image f, and a measure-
ment g, where
g = Af
1 1 f 1
A= f = 1 g= . (6.4)
1 1.01 f2 1
The condition number for matrix A is 402.0075 which is much larger than 1. This
You decide whether this system has solution f 1 = 1, f 2 = 0. Now, suppose the measurement, g, is corrupted
change is “trivially
small.” by noise, producing g = [1 1.01]T . Then, the solution is f 1 = 0, f 2 = 1. A trivially
small change in the measured data caused a dramatic change in the solution.
There are many ways to approach these ill-posed restoration problems. They
all share a common structure: the regularization theory. Generally speaking, any
regularization method tries to analyze a related well-posed problem whose solution
approximates the original ill-posed problem [6.57].
The first approach one might think of is to produce an image estimate which has
the minimum expected mean square error. That is, find the unknown image f which
minimizes
E= (gi − ( f i ⊗ h))2 , (6.5)
i
Not only is this version where the sum is over all the pixels in the image, and the distortion is represented by
also ill-posed, it is the
same problem! (Recall the application of a kernel operator corresponding to a blur h. Simply minimizing E does
way to write application not work, as the problem is still ill-conditioned. Making some assumptions about the
of kernel operators by
using a matrix?) noise can give us a bit better performance. If the distortion is linear, space-invariant,
and the noise is stationary, the Wiener filter gives the optimal solution according to
this criterion. (See [6.28] for a tutorial presentation.)
111 6.3 The MAP approach
In this section, we introduce a bit of mathematics we need before we can move any
further.
We can relate the three probability functions just defined by using Bayes’ rule:
p(g| f ) p( f )
p( f |g) = (6.6)
something
something = p(g). (6.7)
In Eq. (6.6) we used “something” to represent the denominator for the conditional
probability density. We used the word “something” to call attention to the fact that this
number represents the probability density of that value of g occurring, independent
of the original, uncorrupted image. Since this number is independent of f, and is the
same for all possible fs, it therefore does not provide us any help in distinguishing
which class is most likely. Instead, it is a normalization constant which we use to
ensure that the number p( f |g) has the desirable properties of a probability; that is,
it lies between 0 and 1 and sums to 1 when summed over all possible images (that is,
the observed object belongs to at least one of the classes which we are considering).
We wish to maximize the a posteriori probability density p( f |g) of the unknown
correct image f given measured image g. We will be relating the probabilities of the
entire image to the probabilities of individual pixels. Using Bayes’ rule, we have the
Watch for the notation proportionality
change here! Now, the
subscript on f refers to a
pixel. p( f i |gi ) ∝ p(gi | f i ) p( f i ) (6.8)
Since the only difference between the measurement gi and the true but unknown
image pixel f i is the noise, and if we assume a Gaussian noise model, we can
replace the conditional density of the individual pixels with the density of the noise,
producing
1
n i2
p(g| f ) = √ exp − 2 . (6.10)
i 2 2
Moving the product inside the exponential allows us to write [6.8, 6.27, 6.37, 6.38,
6.80]:
⎛ ⎞
N − ( f i − gi )2
1 ⎜ i ⎟
p(g| f ) = √ exp ⎝ 2 ⎠. (6.11)
2 2
113 6.3 The MAP approach
The sum is taken over ℵi , the neighborhood of pixel i. Recall from Chapter 4 that
the aura of a set of pixels A relative to a set B is the collection of all points in B that
are neighbors of pixels in A, where the concept of “neighbor” is given by a problem-
specific definition. Here, the concept is similar, except that we consider only the
neighborhood of a single pixel rather than the neighborhood of a set. Like the defi-
nition of aura, the definition of neighborhood is allowed to be problem-dependent,
and in this sense two pixels do not have to be adjacent or even necessarily “close”
in the image to be considered neighbors. In essentially all practical applications,
however, the neighbors of a particular pixel are those pixels which are adjacent. T
is an adjustable width parameter, and the Vs are potential functions which are, in
general, functions of the pixels in the neighborhood.
The way we formulate the prior probability for the entire image is to once again
write a product
p( f ) = p( f i ). (6.13)
i
Check this out! The
answer which maximized Forming the product of Eqs. (6.11) and (6.12) as indicated in Eq. (6.8) and eliminating
the conditional
probability is the same as the constant1 term, we then take natural logarithms and change the sign thereby
the answer which converting the problem from maximizing a probability to minimizing an objective
minimized the squared
error! This happens function.
because the noise is !
additive Gaussian. ( f i − gi )2
H ( f, g) = + Vi j . (6.14)
i
2 2 i j∈ℵi
We will refer to the first, conditional, term of Eq. (6.14) as the “noise” term [6.27]
and to the second as the “prior.” This gives the following form:
H ( f, g) = Hn ( f, g) + Hp ( f ). (6.15)
1 These terms do not affect the location of the minimum. The which remains in Eq. (6.14) allows us to weight
the relative importance of the noise and prior terms.
114 Image relaxation: Restoration and feature extraction
having a brightness which varies smoothly, except at boundaries [6.9, 6.16]; or the
most common: having brightness which is constant in local areas and discontinuous
at boundaries [6.16, 6.38].
Penalty
Variation
Fig. 6.1. The penalty should be stronger for higher noise. Assuming local variations in brightness
are due to noise, a larger variation is penalized more. However, local brightness
variations can also be due to edges, which should not be penalized (otherwise, they
will be blurred). Therefore we want our penalty function to have an upper limit.
!
b 2
Hp ( f ) = − √ exp − i2 . (6.16)
2 i
2
√
The 2 does not mean In Eq. (6.16), the constants are irrelevant. We include them only so that it looks like a
anything at all! Gaussian. is a soft threshold: It represents a priori knowledge of the roughness of
the surface. The form for makes this knowledge explicit. It is hoped that the spatial
derivatives i will become small almost everywhere as the algorithm proceeds. This
concept will be explored in more detail in the next section.
Combining Eqs. (6.15) and (6.16), we have an objective function which, if mini-
mized, will result in a restored image which resembles the data (in the least squares
sense) while at the same time consisting of regions of uniform brightness sepa-
rated by abrupt edges. We could use mean field annealing (MFA) to minimize this
objective function.
Mean field annealing (MFA) is a technique for finding a good minimum of complex
functions which typically have many minima. The mean field approximation in
statistical mechanics allows a continuous representation of the energy states of a
collection of particles. In the same sense, MFA approximates the stochastic algorithm
called “simulated annealing” (SA). SA has been shown to converge in probability
to the global minimum, even for nonconvex problems [6.27]. Because SA also takes
an unacceptably long time to converge, a number of techniques have been derived
to accomplish speedups [6.43]; MFA is one of those techniques.
Since its introduction in 1989 [6.8, 6.11], this technique has found applications in
restoration of locally homogeneous [6.37, 6.38] and locally smooth [6.8, 6.9] images;
in image segmentation [6.69, 6.70]; in motion analysis [6.1]; and in sensor fusion
116 Image relaxation: Restoration and feature extraction
Here we have a (not very interesting) image consisting of only two pixels, f 1 and f 2 ,
which have been corrupted by noise to result in measured pixel values g1 and g2 . The
prior term chosen in this case encourages solutions in which f 1 = f 2 . The principal
result of the MFA derivations, when applied to a function of the type described
by Eq. (6.17), is the replacement of by + T , where T is a parameter (called
“temperature” in the literature) which is initialized to a “large” value, and gradually
117 6.4 Mean field annealing
for some small scalar constant ␣. To see how this works, consider the case of large
T: Making T large results in
0.15
0.1
0.05
0
τ+Τ=5 τ+Τ=6
−0.05 τ+Τ=4.5
τ+Τ=10
τ+Τ=5.5
−0.1
−0.15
−0.2
−5 0 5 10 15
Fig. 6.2. Continuous deformation of a function which is initially convex to find the global
minimum of a nonconvex function.
with scalar constants k and l. At high values of T (T + = 10), the curve is com-
pletely convex, and as T is reduced, it assumes its true form. At each iteration, the
minimum is tracked, as indicated by the arrows, concluding in the global minimum.
EXAMPLE
Piecewise-constant images
Consider this form for the prior
2 2
∂f ∂f
R (f) =
2
+ . (6.24)
∂x ∂y
In order for this term to be zero, both partial derivatives must be zero. The only type of
surface which satisfies this condition is one which does not vary in either direction –
flat, but not flat everywhere. To see why the solution is piecewise-constant rather than
completely constant, you need to recognize that the total function being minimized
is the sum of both the prior and the noise terms. The prior term seeks a constant
solution, but the noise term seeks a solution which is faithful to the measurement.
The optimal solution to this problem is a solution which is flat in segments, as illus-
trated in one dimension in Fig. 6.3. The function R(f) is nonzero only for the points
where f undergoes an abrupt edge. To see more clearly what this produces, consider
the extension to continuous functions. If x is continuous, then the summation in
Eq. (6.23) becomes an integral. The argument of the integral is nonzero at only
a small, finite number of points (referred to as a set of measure zero), which is
insignificant compared to the rest of the integral.
Fig. 6.3. A piecewise-constant solution to the fitting of a surface. The solution has a derivative
equal to zero at almost every point. Nonzero derivatives exist only at the points where
steps exist.
120 Image relaxation: Restoration and feature extraction
EXAMPLE
Piecewise-planar images
This is the quadratic Let’s try another example and see what that does. Consider
variation. The Laplacian
also involves second 2 2 2 2 2 2
derivatives. Ask yourself ∂ f ∂ f ∂ f
R (f) =
2
+ +2 . (6.25)
how it differs from the ∂x 2 ∂y 2 ∂ x∂ y
quadratic variation.
What does this do? What kind of function has a zero for all its second derivatives?
Answer, a plane. Thus, using this form for R(f) will produce an image which is
planar, but still maintains fidelity to the data – a piecewise-planar image. Another
alternative operator which is also based on the second derivative is the Laplacian,
∂2 f ∂2 f
+ .
∂x2 ∂ y2
We saw both of these back in Chapter 5.
You might ask your instructor, “Breaking a brightness image into piecewise-linear
segments is the same as assuming the actual surfaces are planar, right?” To which you
would probably get “Yes, except for variations in lighting, reflectance, and albedo.”
Ignoring that, you charge on, saying “But real surfaces aren’t all planar.” The answer
is twofold: First is the trivial and useless observation that all surfaces are planar, you
just have to examine a sufficiently small region. More seriously, whether breaking
an image into planar patches makes sense depends on the application. For example
[6.14, 6.74], one could do a piecewise-constant segmentation to remove noise and
then treat each patch as a plane and get improved estimates of optic flow [6.41] or
stereo [6.72] that way. Some of the underlying theory of planar approximations to
images may be found in [6.62].
Interestingly, Yi and Chelberg [6.83] observe that second-order priors like this
one require a great deal more computation than first-order priors, and that first-order
priors can be made approximately invariant. However, in our own experiments, we
have not found that second-order priors impose such a severe computational penalty,
and they do provide more flexibility in reconstruction.
A one-dimensional example of such a solution is shown in Fig. 6.4. The idea
of modeling an image as piecewise-planar has recently received additional support
from some work by Elder and co-workers [6.22, 6.23, 6.24] which suggests that
“the edge representation of an image is, to a good approximation, invertible.” They
accomplish this remarkable result by assuming that except at edges, an image satisfies
the Laplace equation ∇ 2 f (x, y) = 0.
In conclusion, you can choose any function for the argument of the exponential
that you wish, as long as the image you want is produced by setting the argument to
zero. Some more general properties of prior models have been stated by Li [6.51]
as a function of the local image gradient : A prior should (1) be continuous in its
first derivative; (2) be even (h( ) = h(− )); (3) be positive h( ) > 0; and (4) have
121 6.4 Mean field annealing
Fig. 6.4. A piecewise-linear solution to the data of Fig. 6.3. Clearly, the piecewise-linear solution
preserves fidelity to the data more accurately than the piecewise-constant.
The delta function is not convenient, since it is not differentiable, and we will want
to use gradient descent to solve this problem. But there is another problem with
this formulation: If the image is continuously valued (or even if it is represented
in floating point), what does it mean for f i to equal to f j ? How close should they
122 Image relaxation: Restoration and feature extraction
be before they are considered equal? How about | f i − f i | < 0.01? Is this small
enough? How about 0.001? OK? Do you accept that? So we insist that two points
which differ by more than 0.001 contribute to the error. What about two points that
differ by 0.000 999? That pair does not contribute at all. Does that make sense?
What we have generated is very similar to the problem of describing the proba-
bility that a measure has a particular value. The probability, for example, of being
exactly 6.000 000 (for an arbitrary number of zeros) is zero, and we therefore resort
to a different representation for the concept of likelihood, and we invent the prob-
ability density. In a similar way, in this problem, we pursue the same philosophy,
instead of using the Kroneker delta, we replace the delta function with a continuous,
differentiable function, which represents the same intuition,
1
( f i − f j )2
Hp = − √ exp − . (6.29)
i j∈ℵi 2 2 2
This form also allows the concept of annealing: Start with large and reduce until
it approaches zero. The square root of a constant is not particularly meaningful, but
it does serve to ensure that the function remains bounded in the appropriate way.
The details of why this process of annealing avoids local minima are described
elsewhere [6.8, 6.11, 6.12], but result from the analogy to simulated annealing [6.27].
Initial value of
Decreasing
MFA is based upon the mathematics of simulated annealing. One can show that in
simulated annealing, a global minimum can be achieved by following a logarithmic
annealing schedule like
1
K = (6.31)
ln K
where K is the iteration number. This schedule decreases extremely slowly; so
slowly as to be impractical. Instead, one could choose a schedule like
K = 0.99 K −1 (6.32)
123 6.4 Mean field annealing
which has been shown to work satisfactorily in many applications, and reduces
much faster than the logarithmic schedule.
+ ( f 3 h −2 + f 4 h −1 + f 5 h 0 + f 6 h 1 + f 7 h 2 − g5 )2
+ ( f 4 h −2 + f 5 h −1 + f 6 h 0 + f 7 h 1 + f 8 h 2 − g6 )2
where ( f ⊗ h)i denotes the application of kernel h to image f with the origin of h (in
this case, the center) located at pixel f i . The derivative of Hn with respect to pixel
f 4 can then be derived as Eq. (6.35), and further generalized as Eq. (6.36),
∂ Hn
= 2(( f ⊗ h)2 − g2 )h 2 + 2(( f ⊗ h)3 − g3 )h 1 + 2(( f ⊗ h)4 − g4 )h 0
∂ f4
+2(( f ⊗ h)5 − g5 )h −1 + 2(( f ⊗ h)6 − g6 )h −2 (6.35)
∂ Hn
= ((( f ⊗ h) − g) ⊗ h rev )4 (6.36)
∂ f4
where h rev = h 2 , h 1 , h 0 , h −1 , h −2 , and ( f ⊗ h − g) is computed at all points. The
… f2 f3 f4 f5 f6 …
h–2 h–1 h0 h1 h2
Fig. 6.5. A one-dimensional image and a one-dimensional kernel with five pixels.
124 Image relaxation: Restoration and feature extraction
… n2 n3 n4 n5 n6 …
h2 h1 h0 h–1 h– 2
h2 h1 h0 h–1 h–2
h2 h1 h0 h–1 h–2
h2 h1 h0 h–1 h–2
h2 h1 h0 h–1 h–2
Fig. 6.6. The reverse kernel in the derivation of the noise term.
where (R( f ))i will be the quadratic variation of Eq. (6.25) at pixel i. Of course, in
specific applications, different priors may be indicated. To perform gradient descent,
we must find the derivative with respect to f . This problem becomes complicated
when we recognize that the numerator of the argument of the exponential varies in
both x and y. We have two choices.
r We could recognize that R is a sum of three terms, that the exponential of a sum is
the product of the exponentials, and use the product rule of derivatives to construct
a rather complicated expression for the derivative.
r We could say “instead of putting the summation in the argument of the exponential,
let’s just add the exponentials.”
Of course, these are not equivalent expressions. However, minimizing either gets
at the same idea – a piecewise-linear image. Since the second is simpler to implement,
being engineers, we choose the second option. We know how to do the derivatives,
so we end up with the following algorithm.
The derivative of the noise term is trivial: On each iteration, simply change pixel i
by dnoisei = ( f i − gi )/ 2 .
To determine the derivative of the prior requires a bit more work: according to
Eq. (6.40), the derivative of the prior term is
 f ⊗ ( f ⊗ )2
exp − ⊗ rev .
2 2 2
Define three kernel operators which we will use to estimate the three partial second
derivatives in the quadratic variation
⎡ ⎤ ⎡ ⎤
0 0 0 0 1 0
1 ⎢ ⎥ 1 ⎢ ⎥
x x = √ ⎣1 −2 1⎦ , yy = √ ⎣0 −2 0⎦ , and
6 6
0 0 0 0 1 0
⎡ ⎤
−0.25 0 0.25
⎢ ⎥
x y = 2 ⎣ 0 0 0 ⎦.
0.25 0 −0.25
Compute three images, where the ith pixels of those images are
The gradient descent rule says to change each element of f using f i ⇐ f i − ␣di ,
where di = dnoisei + dpriori .
√ '
The learning coefficient ␣ should be ␣ = ␥ R M S(di ), where ␥ is a small
dimensionless number, like 0.04; RMS(d) is the root mean square norm of the gra-
dient d; can be determined as the variance of the noise in the image (note that
this is NOT a good estimate in synthetic images). We observe that in this form,
␣ changes every iteration.
The coefficient  is on the order of , and choosing  = is usually adequate.
Implementing this algorithm and annealing over a couple of orders of magnitude
of should give reductions of noise similar to those illustrated in Figs. 6.10–6.13.
6.5 Conclusion
chapter, we use gradient descent with annealing, but other, more sophisticated and
faster techniques, such as conjugate gradient, could be used.
6.6 Vocabulary
Assignment 6.1
Equation (6.34) illustrates the partial derivative of
an expression involving a kernel, by expanding the ker-
nel into a sum. Use this approach to prove that Eq.
(6.40) can be derived from Eq. (6.39). Do your proof
using a one-dimensional problem, and use a kernel which
is 3 × 1 (denote the elements of the kernel as h −1 ,h0 ,
and h1 ).
Assignment 6.2
Implement Eq. (6.65) on the image angio.ifs, or some
other image which your instructor selects. Experiment
with various run times and parameter settings.
Assignment 6.3
In Eq. (6.25), the quadratic variation is presented as
a prior term. A very similar prior would be the Lapla-
cian. What is the difference? That is, are there image
features which would minimize the Laplacian and not
minimize the quadratic variation? Vice versa?
128 Image relaxation: Restoration and feature extraction
Assignment 6.4
Which of the following expressions represents a Lapla-
cian?
2 2 2 2 2 2 2
∂2 f ∂2 f ∂ f ∂ ∂2 f ∂
(a) + 2 (c) + (e) +
∂x 2 ∂y ∂x 2 ∂y2 ∂x 2 ∂ y2
2 2 2 2
∂f ∂f ∂f ∂f ∂f ∂f
(b) + (d) + (f) +
∂x ∂y ∂x ∂x ∂x ∂x
Assignment 6.5
One form of the diffusion equation is written
df/dt = hx ⊗ (c(hx ⊗ f)) + hy ⊗ (c(hy ⊗ f)) where hx and hy
estimate the first derivatives in the x and y
directions, respectively. This suggests that four
kernels must be applied to compute this result. Simple
algebra, however, suggests that this could be rewritten
as df/dt= c(hxx ⊗ f + hyy ⊗ f), which requires application
of only two kernels. Is this simplification of the
algorithm correct? If not, explain why not, or under
what conditions it would be true.
Assignment 6.6
Consider the following image Hamiltonian
!
fi − gi2 1
(h ⊗ f)2
H (f) = − exp − ≡ H n ( f) + H p ( f)
i
2 i
2
where ⊗ denotes application of a kernel operator, the
pixels in the image are lexicographically indexed by i,
and the kernel h is
⎡ ⎤
−1 2 −1
⎢ ⎥
⎣−2 4 −2⎦ .
−1 2 −1
Let G p (fk) denote the partial derivative of H p with
respect to pixel k. G p (fk) = ∂/∂ fk H p (f). Write an expres-
sion for G p (fk). Use kernel notation.
Assignment 6.7
Continuing problem Assignment 6.6, you are to consider
ONLY the prior term. Write an equation which describes
129 Topic 6A Alternative and equivalent algorithms
Assignment 6.8
Continuing problem Assignment 6.7, expand this differ-
ential equation (assume the brightness varies only in
the x direction) by substituting the form for G p (fk)
which you derived. Is this a type of diffusion equa-
tion? Discuss. (Hint: Replace the application of ker-
nels with appropriate derivatives.)
Assignment 6.9
In a diffusion problem, you are asked to diffuse a VEC-
TOR quantity, instead of the brightness which you did
in your project. Replace the terms in the diffusion
equation with the appropriate vector quantities, and
write the new differential equation. (Hint: You may
find the algebra easier if you denote the vector as
[a,b]T .)
Assignment 6.10
The time that the diffusion runs is somehow related to
blur. This is why some people refer to diffusions of
this type as “scale space.” Discuss this use of termi-
nology.
Just as the MFA approach described in the previous section minimizes an objective function
in order to find an image with sharp edges, graduated nonconvexity (GNC) does the same,
but uses an objective function which treats the presence of edges explicitly.
We consider the case in which our a priori knowledge states that the image is uniform
in brightness, except for step discontinuities. Blake and Zisserman [6.16] refer to this case
as the “weak membrane,” and a similar MFA instance is referred to [6.12] as “piecewise-
uniform.” The similarities can be seen both in the objective function (compare Fig. 6.7 and
Fig. 6.9) and in the restorations (Figs. 6.10–6.13). There exist other formulations [6.12] which
130 Image relaxation: Restoration and feature extraction
HpMFA
Δ
i( f )
Fig. 6.7. Prior energy for MFA, for various values of T. Smaller T results in sharper peaks.
pose the MFA problem in a manner even more similar to GNC, a similarity first noted by
Geiger and Girosi [6.25]. In the “weak membrane” application of GNC, the minimization
problem is
where
HGNC = Hn + S + P, S = 2 |∇i ( f )|2 (1 − li ), and P=␣ li (6.43)
i i
and the notation i ( f ) is interpreted to mean “the gradient of the image at point i.” Here, the
v li ∈ {0, 1} denotes a discontinuity in f at the ith pixel. That is, if l1 = 1, the pixel at point i
is interpreted as an edge point. Similarly, f i will denote the brightness of the ith pixel. It has
been shown [6.16] that minimizing HGNC can be reduced to the following problem, which
Δ
i( f )
involves only continuous variables
!
Fig. 6.8. Prior energy
of the GNC algorithm. min f Hn + v(∇i ( f )) . (6.44)
i
In Eqs. (6.43) and (6.44), the |∇(.)| represents any operator which returns a scalar measure
of the local “edginess” of the image such as (∂ f /∂ x)2 + (∂ f /∂ y)2 , and the v function of
Note: This “clipped
parabola” has the same Eq. (6.44) is the “clipped parabola” illustrated in Fig. 6.8.
general shape as the The minimization problem posed by Eq. (6.44) is unsolvable by techniques such as gra-
inverted Gaussian we
mentioned in Eq. (6.16), dient descent, since the function defined by Eq. (6.44) is in general not convex. That is,
for the same reasons: We it may possess many minima. Instead, GNC approximates v with the piecewise-smooth
want to penalize noise (so
the bigger the noise, the function
bigger the penalty) but we ⎧
don’t want to penalize ⎪
⎨ 2 t 2 if (|t| < q)
∗
edges (so at some point, (t) = ␣ − c∗ (|t| − r )2 /2 if (q ≤ |t| < r ) (6.45)
we stop making the ⎪
⎩
penalty any larger). ␣ if (|t| ≥ r )
131 Topic 6A Alternative and equivalent algorithms
v* Fig. 6.9. Smoothed approximations to the energy of Fig. 6.8. Smaller values of p result in
approximations which are closer to the desired prior. The t used here (in GNC) is
equivalent to the edge strength ∇of MFA.
t
2 1 ␣
r =␣ ∗ + 2
2
, and q = . (6.46)
c 2 r
Equations (6.45) and (6.46) then define the algorithm. Reducing the parameter p from 1 to
0 steadily changes v ∗ until it becomes precisely equal to v. This produces a family of prior
energies, illustrated in Fig. 6.9.
The process of gradually reducing p begins by minimizing a function which is convex, and
therefore has a unique minimum. Then, from that minimum, the local minimum is tracked
continuously as p is reduced from 1 to 0.
Variable conductance diffusion (VCD) [6.31, 6.59, 6.62] is a powerful method of image
feature extraction in which blurring is allowed to occur except at edges. The term “edge”
may be liberally interpreted to mean any image feature which is of interest. For example,
Whitaker [6.79] operates on the gradient of the image rather than the image itself and allows
smoothing to occur except where ridges (sharp changes of gradient direction) occur. Such
an operation is not a restoration by any means, since most of the information in the original
image is lost. It is, however, a very robust way to extract a central axis of an object in a gray
scale image.
Relation of spatial VCD operates by simulating the diffusion equation
derivatives (RHS) to
temporal derivatives
(LHS). ∂ fi
= ∇ · (ci · ∇i f ) (6.47)
∂t
where t is time, and ∇i f denotes the spatial gradient of f at pixel i. The diffusion equation
models the flow of some quantity (the most commonly used example is heat) through a
material with conductivity (e.g., thermal conductivity) c.
If ci is constant, independent of the pixel number i, (ci = c) then the partial differential
Eq. (6.47) has a solution which is the same as convolution with a Gaussian, in which the
variance of that Gaussian depends on c and on the time over which the diffusion is run.
Specifically, let f, a function of space and time, be described by a specific partial differential
equation (PDE). If it is possible to write f in the following form:
∞
f (x, t) = G(x, x , t) f (x , 0) d x , (6.48)
−∞
132 Image relaxation: Restoration and feature extraction
then we say that G(x, x , t) is the Green’s function of the PDE. The special case of isotropic
diffusion may be stated formally in one dimension:
Theorem
The Gaussian is the Green’s function of the PDE
∂f ∂2 f
=c 2. (6.49)
∂t ∂x
Proof
This is accomplished by writing the Gaussian as
1 (x − x )2
G(x, x , t) = exp − ,
2 2
√
where will turn out to be a function of time (we have omitted the 1/ 2 since it occurs
on both sides of the PDE and cancels out).
Substitute Eq. (6.48) into Eq. (6.49), producing on the left-hand side (LHS) the partial
with respect to t of an integral in which is a function of t. Taking this partial, make the LHS
equal to
(x − x )2 1 ∂
− . (6.50)
4 2 ∂t
Similarly, we can take the second partial derivative with respect to x to create the right-hand
side (RHS):
c (x − x )2 1
− . (6.51)
4 2
Equating Eqs. (6.50) and (6.51) produces the equation
∂ c
= (6.52)
∂t
whose solution is
In the case of VCD, the conductance becomes a function of the spatial coordinates, in
this instance parameterized by i. In particular it becomes a property of the local image
intensities themselves. The conductance ci is usefully seen as a factor by which space is
locally compressed.
To smooth, except at edges, we let ci be small if i is an edge pixel, i.e., if a selected image
property is locally nonuniform. If ci is small (in the heat transfer analogy), little heat flows
(space is stretched), and in the image, little smoothing occurs. If, on the other hand, ci is
large, then much smoothing is allowed in the vicinity of pixel i (space is compressed). VCD
then, just as the forms of MFA and GNC discussed, implements an operation which after
repetition produces a nearly piecewise uniform result.
133 Topic 6A Alternative and equivalent algorithms
As we observed in Eq. (6.48), the Gaussian is the Green’s function of the diffusion equation.
That is, a diffusion process running on an image produces the same result as convolution with
a Gaussian, where the variance of the Gaussian depends on how long the diffusion has been
running. The constant-conductance diffusion equation is
f t = c( f x x + f yy ). (6.54)
If there is an edge in the image, we want to remove noise on both sides of the edge, but not
blur the edge. Therefore it makes sense to diffuse in a direction tangent to the edge. The
normal and tangent vectors to an edge in a two-dimensional image are given by
[ f x f y ]T [− f y f x ]T
N= $ and T = $ .
f x2 + f y2 f x2 + f y2
Consider now, the second partial derivatives of f taken in the N and T directions: f N N and
fT T .
Since the Laplacian is rotation-invariant, we can write the diffusion PDE (Eq. (6.54)) in
the new coordinates by f t = c( f N N + f T T ).
One may derive the following relationships between the partials
f N N = f x2 f x x + 2 f x f y f x y + f y2 f yy / f x2 + f y2
(6.55)
f T T = f y2 f x x − 2 f x f y f x y + f x2 f yy / f x2 + f y2 .
Substituting this form into Eq. (6.54) and subtracting the normal flow, we end up with a PDE
which smooths along edges without smoothing across edges:
f t = f y2 f x x − 2 f x f y f x y + f x2 f yy / f x2 + f y2 , (6.56)
In both cases, we have an energy function which increasingly penalizes the presence of gra-
dients in the image. In the GNC case the prior retains its original shape, and the “annealing”
process (that is, the lowering of p) results in successively closer fits to the predetermined
shape of the prior. In the MFA case, the shape of the prior itself changes, retaining a con-
stant area, but becoming narrower as T is reduced. With this observation, it should not be
surprising that piecewise-constant MFA and the weak membrane of GNC achieve the same
result.
134 Image relaxation: Restoration and feature extraction
Fig. 6.10. Original image. Fig. 6.11. Corrupted image. Fig. 6.12. MFA restoration. Fig. 6.13. GNC restoration.
The equivalence of the techniques is demonstrated in the next section and proven formally
in [6.12], to which we refer the reader for in-depth analysis. The following experiments are
also described in that paper and are reprinted here to assist the reader in understanding the
action of the algorithms.
The two approaches were each used to restore the same image with various signal-to-
noise ratios (SNRs). On each application of MFA and GNC to a noisy image, the respective
parameters were varied to achieve the best possible image restoration. Several hundred runs
with distinct parameter values were completed for each algorithm. We found that for each
noisy image there existed some parameter set for each algorithm such that the restored images
were of comparable quality.
The resulting image quality achieved is depicted above with the original image (Fig. 6.10),
the image corrupted with SNR = 2 (Fig. 6.11), the MFA restored image (Fig. 6.12), and the
GNC restored image (Fig. 6.13).
Coding of the GNC algorithm can be found in [6.16], which performs descent using
successive over-relaxation (SOR). By using an implementation of MFA which also uses
SOR, we found the execution times of MFA to be roughly ten times faster than GNC for high
noise cases (SNR < 3). For cleaner images, SNR ≥ 4, the GNC execution times were faster.
Before we can complete this comparison, we need to elaborate on the nature of a spatial
derivative. A derivative of brightness with respect to distance in an image could be written
∂f f (x + x) − f (x)
= lim . (6.57)
∂x x→0 x
In a sampled image (as all digital images are), however, the limiting process makes no sense,
for as ter Haar Romeny et al. point out [6.73], one cannot take differences at scales smaller
than a pixel. Instead, one must estimate the derivative at a point by operations performed on
some neighborhood of that point. The topic of estimating derivatives is an old one, and we
will not examine it further except to point out that most analyses have concluded that such
estimates (for higher order derivatives as well) are generally computed by a kernel operator
in one dimension and by the Euclidean norm of an array of n operators in n dimensions.
For the purposes of this derivation, we consider only derivatives in the x direction. We will
135 Topic 6A Alternative and equivalent algorithms
generalize this in a few paragraphs. Thus, we may rewrite the prior term as
−b ( f ⊗ r )i2
Hp ( f ) = √ exp − . (6.58)
2T i 2T 2
By the notation ( f ⊗ r )i we mean the result of applying a kernel r to the image f at point i. The
kernel may be chosen to emphasize problem-dependent image characteristics. This general
form has been used to remove noise from piecewise-constant [6.38] and piecewise-linear
[6.8, 6.9] images.
In the following derivation, we consider only the prior term.
To perform gradient descent we will need the derivative, so we calculate
∂ Hp b ( f ⊗ r) ( f ⊗ r )2
= √ exp − ⊗ r rev (6.59)
∂ fi 2T T2 2T 2 i
∂ Hp
= −(∇((∇ f ) exp(−(∇ f )2 ))). (6.60)
∂f
In the equation above, we have lumped the constants together into and set the annealing
control parameter, T, to 1 for clarity. We have then made use of the observation that for
And here. centered first derivative kernels, f ⊗ h = −( f ⊗ h rev ). We will discuss the impact of T in
the next section.
Finally, we consider the use of ∂∂f H ( f ) in a gradient descent algorithm. In the simplest
implementations of gradient descent, f is updated by (compare with Eq. (6.20))
∂H
f ik+1 = f ik − ␣ (6.61)
∂ fi
where f k denotes the value of f at iteration k, and where ␣ is some small constant (or, in more
sophisticated algorithms, a function of the Hessian of H ). Rewriting Eq. (6.61),
∂H f k+1 − f ik
= i (6.62)
∂ fi ␣
we note that the LHS of Eq. (6.62) then represents a change in f between iterations k and
k + 1, and in fact bears a strong resemblance to the form of the derivative of f. We make
this similarity explicit by defining that iteration k is calculated at time t and iteration k + 1
calculated at time t + t. (In similar contexts, t is sometimes known as the “evolution
parameter.”) Since t is an artificially introduced parameter, without physically meaningful
136 Image relaxation: Restoration and feature extraction
f' f '2
∑ exp(-x/T ) ∑ ∫
which we showed in Eq. (6.11) is a measure of the effect of Gaussian noise in a blur-free
imaging system. Thus we see that biased anisotropic diffusion (BAD) [6.59] may be thought of
as a maximum a posteriori restoration of an image. This observation now permits researchers
in VCD/BAD to consider use of different stabilizing costs, if they have additional information
concerning the noise generation process.
The relationship between a Hopfield neural network and an optimization problem is well-
known. Given an objective of the “Ising” type, it is straightforward to find a recurrent neural
network whose stable states are minima of the objective functions (see [6.39] as well as
[6.40]).
The most straightforward forms of this type of recurrent nets deal with binary-valued
variables, with one neuron representing each variable. In this usage, a “neuron” is a sum-
of-products operator which produces a weighted sum of its inputs. That sum is then passed
through a monotonic nonlinearity – usually a limiter of some sort, e.g., a sigmoid. Following
this definition, we may describe their operations represented by Eq. (6.59) as a two-layer
network, in which each layer is only locally connected as shown in Fig. 6.14 (see also
[6.13]).
The local connectivity is important from an implementation point of view, since that aspect
makes construction of parallel, real-time hardware very feasible.
6A.6 Conclusion
An equivalence exists between the problems of image optimization and diffusion. Others
MFA combines annealing have made similar observations. In addition to Nordström [6.59], Geiger and Yuille came to a
with gradient descent.
similar conclusion [6.26] for an energy function requiring explicit line processes. Since it has
been shown [6.38] that the line processes are not required, their result may now be interpreted
more generally. In an image optimization problem (in particular, restoration), one defines a
criterion function which is to be minimized, and applies some minimization scheme to find a
global (or at least a good) minimum. Thus, an image restoration problem may be considered
as a combination of goals: (1) to preserve the information in the original image, and thereby
produce a resulting image which resembles that original (or the result of an operation on it)
in some way; and (2) to produce an image which possesses certain properties, such as local
smoothness except at boundaries. If one abandons the first goal, then restoration problems
138 Image relaxation: Restoration and feature extraction
become iconic feature extraction problems. Wu and Doerschuk [6.81] have developed an
attractive extension to this work.
Finally, recall that in [6.19], we demonstrated that MFA operations could be calculated
by a two-layer, locally connected, recurrent neural network. From this paper one may then
conclude that GNC and VCD are likewise implementable by straightforward neural networks.
These results lead us to conjecture the following guiding principles for the design of feature
extraction algorithms.
r Relaxation is a central concept. A relaxation algorithm possesses the following properties.
(1) It must be iterative. That is, the output of one pass of the algorithm is the same format
as the input, so that the algorithm may be applied to its own output. (2) It must converge.2
r The relaxation algorithm needs to be local in nature. That is, at any time, changes to any
pixel depend only on the local neighborhood of that pixel. Compliance with this guideline
allows global interactions to occur smoothly over time (iteration number) and space, and
provides a theoretical basis, via the Gibbs/Markov field equivalence for analyzing the
operations.
r The equivalence between diffusion and optimization is helpful in understanding the per-
formance of both forms of algorithms. For those designing diffusion algorithms, seeing
them as an optimizing relaxation is helpful: Since all relaxation algorithms minimize
something, it is useful to look at the integral of the diffusion step (although this process
is often intractable) to see what properties are actually being minimized in the develop-
ment/application of an ad hoc technique. For those designing optimization methods, seeing
the result as a spatial deformation according to the local degree of nonuniformity followed
by an averaging can help to understand the spatiotemporal effects of the optimization.
r In designing feature extraction algorithms, it is helpful to consider the algorithm as a
restoration, even if no residual is explicitly minimized, for then, the precise effect of the
algorithm on the image may be better comprehended.
r Scale variation in spatial analysis algorithms and temperature control in annealing algo-
rithms are closely related. The power that has been found for both of them is related.
r Finally, the nonlinear operations (exponentials) indicated in all the algorithms mentioned
are absolutely essential to success. The Kolmogorov theorem [6.49] demonstrates suffi-
ciency by stating that a linear operation followed by the appropriate nonlinearity allows
computation of arbitrary mapping. Here, we claim that such nonlinearity is not only suffi-
cient but necessary. This fact probably contributes most significantly to the recent success
of a number of neural network applications.
Bibliography
[6.3] H. Andrews and B. Hunt, Digital Image Restoration, Englewood Cliffs, NJ, Prentice-
Hall, 1977.
[6.4] D. Baker and J. Aggarwal, “Geometry Guided Segmentation of Outdoor Scenes,”
SPIE Applications of Artificial Intelligence, VI, pp. 576–583, 1988.
[6.5] J. Besag, “Spatial Interaction and the Statistical Analysis of Lattice Systems,” Journal
of the Royal Statistical Society, B, 36, pp. 192–326, 1974.
[6.6] J. Besag, “On the Statistical Analysis of Dirty Pictures,” Journal of the Royal Statistical
Society, B, 48 (3), 1986.
[6.7] G. Bilbro and W. Snyder, “Fusion of Range and Luminance Data,” IEEE Symposium
on Intelligent Control, Arlington, August, 1988.
[6.8] G. Bilbro and W. Snyder, “Range Image Restoration using Mean Field Annealing,”
In Advances in Neural Network Information Processing Systems, San Mateo, CA,
Morgan-Kaufmann, 1989.
[6.9] G. Bilbro and W. Snyder, “Mean Field Annealing, an Application to Image Noise
Removal,” Journal of Neural Network Computing, Fall, 1990.
[6.10] G. Bilbro and W. Snyder, “Optimization of Functions with Many Minima,” IEEE
Transactions on Systems, Man, and Cybernetics, 21(4), July/August, 1991.
[6.11] G. Bilbro, R. Mann, T. Miller, W. Snyder, D. Van den Bout and M. White, “Opti-
mization by Mean Field Annealing,” In Advances in Neural Information Processing
Systems, San Mateo, CA, Morgan-Kauffman, 1989.
[6.12] G. Bilbro, W. Snyder, S. Garnier, and J. Gault, “Mean Field Annealing: a Formalism
for Constructing GNC-like Algorithms,” IEEE Transactions on Neural Networks, 3(1)
pp. 131–138, 1992.
[6.13] G. Bilbro, W. Snyder, and R. Mann, “Mean Field Approximation Minimizes Relative
Entropy,” Journal of the Optical Society of America, A, 8(2), February 1991.
[6.14] M. Black and A. Jepson, “Estimating Optical Flow in Segmented Images Using
Variable-order Parametric Models with Local Deformations,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(10), 1996.
[6.15] A. Blake, “Comparison of the Efficiency of Deterministic and Stochastic Algorithms
for Visual Reconstruction,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 1(1), 1989.
[6.16] A. Blake and A. Zisserman, Visual Reconstruction, Cambridge, MA, MIT Press, 1987.
[6.17] E. Brezin, D. LeGuillon, and J. Zinn-Justin, “Field Theoretical Approaches to Critical
Phenomena,” Phase Transitions and Critical Phenomena, volume 6, eds. C. Domb
and M. Green, New York, Academic Press, 1976.
[6.18] R. Burden, J. Faires, and A. Reynolds, Numerical Analysis, Boston, Prindle, 1981.
[6.19] H. Chang and M. Fitzpatrick, “Geometrical Image Transformation to Compensate
for MRI Distortions,” SPIE Medical Imaging IV, 1233, pp. 116–127, February,
1990.
[6.20] H. Derin and H. Elliot, “Modeling and Segmentation of Noisy and Textured Images
using Gibbs Random Fields,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 9, pp. 39–55, 1987.
[6.21] H. Ehricke, “Problems and Approaches for Tissue Segmentation in 3D-MR Imaging,”
SPIE Medical Imaging IV: Image Processing, 1233, pp. 128–137, February, 1990.
140 Image relaxation: Restoration and feature extraction
[6.22] J. Elder, “Are Edges Incomplete?” International Journal of Computer Vision, 34(2),
1999.
[6.23] J. Elder and R. Goldberg, “Image Editing in the Contour Domain,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 23(3), 2001.
[6.24] J. Elder and S. Zucker, “Scale Space Localization, Blur, and Contour-based
Image Coding,” IEEE Conference on Computer Vision and Pattern Recognition, San
Francisco, CA, 1996.
[6.25] D. Geiger and F. Girosi, “Parallel and Deterministic Algorithms for MRFS: Sur-
face Reconstruction and Integration,” AI Memo, No 1114, Cambridge, MA, MIT,
1989.
[6.26] D. Geiger and A. Yuille, “A Common Framework for Image Segmentation by Energy
Functions and Nonlinear Diffusion,” MIT AI Lab Report, Cambridge, MA, 1989.
[6.27] D. Geman and S. Geman, “Stochastic Relaxation, Gibbs Distributions, and the
Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 6(6), November, 1984.
[6.28] R. Gonzalez and P. Wintz, Digital Image Processing, 2nd edn, Reading, MA, Addison-
Wesley, 1987.
[6.29] A. Gray, J. Kay, and D. Titterington, “An Empirical Study of the Simulation of Var-
ious Models used for Images,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(5), 1994.
[6.30] B. Groshong, G. Bilbro, and W. Snyder, “Restoration of Eddy Current Images by
Constrained Gradient Descent,” Journal of Nondestructive Evaluation, December,
1991.
[6.31] S. Grossberg, “Neural Dynamics of Brightness Perception: Features, Boundaries,
Diffusion, and Resonance,” Perception and Psychophysics, 36(5), pp. 428–456,
1984.
[6.32] J. Hadamard, Lectures on the Cauchy Problem in Linear Partial Differential Equations,
New Haven, CT, Yale University Press, 1923.
[6.33] J. Hammersley and P. Clifford, “Markov Field on Finite Graphs and Lattices,”
unpublished.
[6.34] F. Hansen and H. Elliot, “Image Segmentation using Simple Markov Field Models,”
Computer Graphics and Image Processing, 20, pp. 101–132, 1982.
[6.35] R. Haralick and G. Shapiro, “Image Segmentation Techniques,” Computer Vision,
Graphics, and Image Processing, 29, pp. 100–132, 1985.
[6.36] E. Hensel, Inverse Theory and Applications for Engineers, Englewood Cliffs, NJ,
Prentice-Hall, 1991.
[6.37] H. Hiriyannaiah, Signal Reconstruction using Mean Field Annealing. Ph.D. Thesis,
North Carolina State University, Raleigh, NC, 1990.
[6.38] H. Hiriyannaiah, G. Bilbro, W. Snyder, and R. Mann, “Restoration of Locally Ho-
mogeneous Images using Mean Field Annealing,” Journal of the Optical Society of
America A, 6, pp. 1901–1912, December, 1989.
[6.39] J. Hopfield, “Neurons with Graded Response Have Collective Computational Prop-
erties Like Those of Two-state Neurons,” Proceedings of the National Academy of
Science USA, 81, pp. 3058–3092.
141 Bibliography
[6.77] D. Van den Bout and T. Miller, “Graph Partitioning using Annealed Neural Networks,”
IEEE Transactions on Neural Networks, 1(2), pp. 192–203, 1990.
[6.78] C. Wang, W. Snyder, and G. Bilbro, “Optimal Interpolation of Images,” Neural Net-
works for Computing Conference, Snowbird, UT, April, 1995.
[6.79] R. Whitaker, “Geometry-limited Diffusion in the Characterization of Geometric
Patches in Images,” TR91-039, Dept. of Computer Science, UNC, Chapel Hill, NC,
1991.
[6.80] G. Wolberg and T. Pavlidis, “Restoration of Binary Images Using Stochastic Relax-
ation With Annealing,” Pattern Recognition Letters, 3(6), pp. 375–388, December,
1985.
[6.81] C. Wu and P. Doerschuk, “Cluster Expansions for the Deterministic Computation
of Bayesian Estimators Based on Markov Random Fields,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(3), 1975.
[6.82] M. Yaou and W. Chang, “Fast Surface Interpolation Using Multiresolution Wavelet
Transform,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7),
1994.
[6.83] J. Yi and D. Chelberg, “Discontinuity-preserving and Viewpoint Invariant Reconstruc-
tion of Visible Surfaces Using a First-order Regularization,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(6), 1995.
7 Mathematical morphology
A man’s discourse is like to a rich Persian carpet, the beautiful figures and patterns of which
can be shown only by spreading and extending it out; when it is contracted and folded up,
they are obscured and lost
Plutarch
The suffix “-ology” means “study of-,” so obviously, “morphology” is the study of
morphs; answering critical questions like: “How come they only come out at night,
and then fly toward the light?” and “Why is it that bug zappers only toast the harmless
critters, leaving the ’skeeters alone?” and – HOLD IT! That’s MORPH-ology, the
study of SHAPE, not moths! Try again . . .
7.1.1 Dilation
First, the intuitive definition: The dilation of a (BINARY) image is that same image
with all the foreground regions made just a little bit bigger.
Now, formally: We consider two images, f A and f B , and let A and B be sets of
ordered pairs, consisting of the coordinates of each foreground pixel in f A and f B ,
respectively.
Consider one pixel in f B , and its corresponding element (ordered pair) of B, call
that element b ∈ B. Create a new set by adding the ordered pair b to EVERY ordered
pair in A. Let’s look at a tiny example.
For this image, A = {(2, 8), (3, 6), (4, 4), (5, 6), (6, 4), (7, 6), (8, 8)}. Adding the
pair (−1, 1) results in the set A(−1,1) = {(1, 9), (2, 7), (3, 5), (4, 7), (5, 5), (6, 7),
(7, 9)}. The corresponding image is also shown in Fig. 7.1, and we hope you
144
145 7.1 Binary morphology
(a) 9 (b) 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 12 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
fA fA (−1, 1)
Fig. 7.1. Example of dilation. (a) The original binary image. (b) The binary image dilated by
B = {(--1, 1)}.
observed that A(−1,1) is nothing more than a translation of A. With this concept
firmly understood, think about what would happen if you constructed a SET of
translations of A, one for each pair in B. We denote that as {Ab , b ∈ B}, that is, b is
one of the ordered pairs in B.
Formally, we define the DILATION of A by B as A ⊕ B = {a + b|(a ∈ A,
b ∈ B)}, which is the same as the union of all those translations of A,
A⊕B = Ab (7.1)
b∈B
and we use the same symbol to denote the dilation of images: f A ⊕ f B . Here is
another example.
9 4
8 3
7 2
6 1
fA 5 fB 0
4 −1
3 −2
2 −3
1 −4
0 −5
0 1 2 3 4 5 6 7 8 9 −4 −3 −2 −1 0 1 2 3 4 5
A = {(2, 8), (3, 6), (4, 4), (5, 6), (6, 4), B = {(0, 0), (0, 1)}
(7, 6), (8, 8)}
and
9
8
7
fA ⊕ fB = 6
5
4
3
2
1
0
0 12 3 4 5 6 7 8 9
It’s time to start getting Denote by #A the number of elements in set A. In this example, #A = 7, and #(A ⊕ B)
used to set notation.
= 14. This happened to be true, only coincidently, because there was no overlap
between A(0, 0) and A(0, 1), or said another way: A(0,0) ∩ A(0,1) = Ø.
For a more general problem, this will not be the case. In general,
To go further, we need to define some notation: If x is an ordered pair, then (1) the
translation1 of a set A by x is A x , (2) the reflection of A is à = {(−x, −y|(x, y) ∈ A)}
and (3) the complement of set A is Ac . An example of reflection is
2 2
1 1 .
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
fA fA~
In this example, A = {(0, 0), (1, 0), (1, 1)} and the reflection of A is {(0, 0), (−1, 0),
(−1, −1)}.
7.1.2 Erosion
Now, we define the (sort of) inverse of dilation, erosion,
1 We do not have to be limited to a 2-space. As long as x and A are drawn from the same space, it works. More
generally, if A and B are sets in a space ε, and x ∈ ε, the translation of A by x is denoted A x = {y| for some
a ∈ A, a = a + x}.
147 7.1 Binary morphology
Notice two things: The second set, B, is reflected, and the intersection symbol is
used. Let’s do an example.
9
8
7
6
5 fA
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9
Rather than the tedious job of listing all the 17 elements of A, just draw
8
7
6
5 fA(0, 0)
4 8
3 7
2 6
1 5
0 4
0 1 2 3 4 56 7 8 9 3
2
8 1
7 0
6 0 1 2 3 4 5 6 7 8 9
5 fA(−1, 0)
4
3 fA(0, 0) ∩ fA(−1, 0)
2
1
0
0 1 2 3 4 5 6 7 8 9
So now we have dilation and erosion defined. You will observe that usually (for all
practical purposes) one of the images is “small” compared to the other; that is, in
the example above
#A #B. (7.7)
When this is the case, we refer to the smaller image, f B , as the “structuring element”
(s.e.).
148 Mathematical morphology
A ⊕ B = B ⊕ A. (7.8)
(A ⊕ B) ⊕ C = A ⊕ (B ⊕ C). (7.9)
A ⊕ (B ∪ C) = (A ⊕ B) ∪ (A ⊕ C). (7.10)
A ⊕ K ⊆ B ⊕ K, (7.11)
for any s.e. K. When this property holds, we say the operator is “increasing.”
An example proof: Dilation is increasing
Let the set A consist of elements Ai . A = {A1 , A2 , . . . , An }, and let B be similarly
denoted. Furthermore, suppose B ⊆ A. Now, suppose both A and B are dilated by
the same s.e., K. Take a single element of K , say K 1 , and dilate each element of
A by K 1 , A ⊕ K 1 = {A1 + K 1 , A2 + K 1 , . . . , An + K 1 }, and similarly dilate B.
Since every element of B was also an element of A, every element of B ⊕ K i is in
A ⊕ K i . Since this observation is true for an arbitrary element of K , it is true for
all elements of K . Now consider the union of the results of applying two elements
of the s.e. K to A: A12 = (A ⊕ K 1 ) ∪ (A ⊕ K 2 ). Since B ⊕ K 1 ⊆ A ⊕ K 1 and
B ⊕ K 2 ⊆ A ⊕ K 2 , we know from set theory that if R1 ⊆ S1 and R2 ⊆ S2 then
R1 ∪ R2 ⊆ S1 ∪ S2 and we are done.
r Extensive and anti-extensive properties. If we say an operator is “extensive,” we
mean that applying this operator to a set, A, produces an answer which contains
A. If the s.e. contains the origin (that is, element (0, 0)), dilation is extensive:
A ⊕ K ⊇ A. (7.12)
As you might guess, erosion has some extensive properties as well: That is, erosion
149 7.1 Binary morphology
(A B)c = Ac ⊕ B̃
(7.13)
(A ⊕ B)c = Ac B̃
A (B ⊕ C) = (A B) C (7.14)
(A ∪ B) C ⊇ (A C) ∪ (B C) (7.15)
A (B ∩ C) ⊇ (A B) ∪ (A C) (7.16)
A (B ∪ C) = (A B) ∩ (A C). (7.17)
Erosion and dilation are
not inverses of each other. A cautionary note: You cannot do cancellation with morphological operators as
you might suspect. For example, if A = B C, we could dilate both sides by C
to get A ⊕ C = (B C) ⊕ C and if dilation and erosion were true inverses, the
RHS would be just B. However, the RHS is in fact the opening of B by C and not
simply B.
f Ao f B = ( f A f B ) ⊕ f B (7.18)
An application
So what is the purpose of all this? Let’s do an example: Inspection of printed circuit
(PC) boards. Here’s a picture of a PC board with two traces on it shorted together by
a hair which was stuck to the board when it went through the wave solder machine.
We will use opening to identify the short.
150 Mathematical morphology
First, erode the image using a small s.e. We choose an s.e. which is smaller than the
features of interest (the traces), but larger than the defect. The erosion then looks
like this:
and surprise, surprise, the defect is gone. For inspection purposes, one could now
subtract the original from the opened, and the difference image would contain only
the defect. Furthermore these operations can be done in hardware, blindingly fast.
This example illustrates first of all that morphological concepts can be extended to a
151 7.1 Binary morphology
continuous domain. (For the time being, remember, however, that this is continuous
in resolution, not in brightness value; still binary. We will fix that soon.) Second, it
illustrates the fact that opening preserves exactly the geometry of objects which are
“big enough,” and totally erases smaller objects. In this sense, opening resembles
the functioning of the median filter, where each pixel is replaced by the median of
its neighbors.
r Duality: (AoK )c = Ac r K̃ .
Proof of duality. Notice how this proof is done. We expect students to do proofs
this carefully.
1. (AoK )c = [(A K ) ⊕ K ]c definition of opening
2. = (A K )c K̃ complement of dilation
3. = (Ac ⊕ K ) K̃ complement of erosion
Can you prove this? Is it 4. = Ac r K̃ definition of closing.
even true? r Idempotency: Opening and closing are idempotent. That is, repetitions of the
same operation have no further effect:
r Closing is extensive: A r K ⊇ A.
r Opening is anti-extensive: AoK ⊆ A.
r Images dilated by f K remain invariant under opening by f K . That is, f A ⊕ f K =
( f A ⊕ f K )o f K .
Proof
1. A r K ⊇ A since closing is extensive
2. (A r K ) ⊕ K ⊇ A ⊕ K dilation is increasing
3. ((A ⊕ K ) K ) ⊕ K ⊇ A ⊕ K definition of closing
4. (A ⊕ K )oK ⊇ A ⊕ K definition of opening
5. for any B, BoK ⊆ B opening is anti-extensive
6. therefore, (A ⊕ K )oK ⊆ A ⊕ K substitution of A ⊕ K for B
7. (A ⊕ K )oK = A ⊕ K since A ⊕ K is both greater or equal to and less or
equal to (A ⊕ K )oK , so only equality can be true.
152 Mathematical morphology
Up to now, we have assumed that the images with which we are dealing are binary.
Now, we relax that requirement and allow f A to take on continuous values in a finite
range f min A ≤ f A ≤ f max A . With this extension, we no longer have a simple
definition for such operations as dilation, since the union operator is not defined. To
define gray-scale morphological operations, we need first to define a new concept,
the “umbra.”
The umbra, U ( f A ) of a two-dimensional gray-scale image f A is the set of all
ordered triples, (x, y, U ), which satisfy 0 < U ≤ f A (x, y). If we think of f A as
being continuously valued, then U ( f A ) is an infinite set. To make morphological
operations feasible, we assume f A is quantized to M values,
#U ( f A ) ≤ M · #A (7.23)
To illustrate the concept of umbra, let f A be one dimensional. (We illustrate a one-
dimensional function here as a two-dimensional function in which one dimension is
always zero. That way, you have an example that is easily extended to two dimen-
sions):
A = {(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0)}
and the pixel value at the corresponding coordinate is
f A (x, 0) = [1, 2, 3, 1, 2, 3, 3].
Notice the new notation: Since f A takes on various values, depending on which
element of A is being considered, we use functional notation. Drawing f A , we have
Fig. 7.2.
In Fig. 7.2, the heavy black line represents f A , and the umbra is the shaded area
under f A . Following this figure, the heavy black line is the TOP of the umbra. We
could write the umbra as a set of ordered triples:
U ( f A ) = {(0, 0, 1), (1, 0, 1), (1, 0, 2), (2, 0, 1), (2, 0, 2), (2, 0, 3), (3, 0, 1),
(4, 0, 1), (4, 0, 2), (5, 0, 1), (5, 0, 2), (5, 0, 3), (6, 0, 1), (6, 0, 2), (6, 0, 3)}.
153 7.3 The distance transform
fA(x, 0)
3
1
0 1 2 3 4 5 6 x
Here’s the trick: Although the gray-level image is no longer binary (and therefore
not representable by set membership) the umbra does have those properties.
We may therefore define the dilation of a gray-scale image f A by a gray-scale
s.e., f B as
and erosion is similarly defined. Furthermore gray-scale opening and closing can be
defined in terms of gray-scale dilation and erosion.
Generalizing this concept to two-dimensional images, the umbra becomes three
dimensional, a set of triples
for (x1 , y1 ) ∈
⊂ Z × Z , where
denotes the set of possible pixel locations,
assumed here to be positive and integer.
∇ DT (x) = 1 (7.27)
4 3 4
3 +0
This process is repeated at each pixel until all pixels in the image have been
processed, and then iterated until all pixels in the DT are marked with a finite value.
Masks other than that of Fig. 7.4 produce other variations of the DT. In particular,
Fig. 7.6 produces the chamfer map. If divided by three, the chamfer map produces
a DT which is not a bad approximation to the Euclidian distance.
that points in the Voronoi diagram do not belong to the Voronoi domain of any
region.
7.4 Conclusion
7.5 Vocabulary
Closing
Dilation
Distance transform
Erosion
Extensive
Increasing
Opening
Umbra
Voronoi diagram
Assignment 7.1
In section 7.1.3, we stated that dilation is commuta-
tive because addition is commutative. Erosion also in-
volves addition, but one of the two images is reversed.
Is erosion commutative? Prove or disprove it.
Assignment 7.2
In section 7.3, a mask is given and it is stated that
application of this mask produces a distance transform
which is “not a bad approximation” to the Euclidian
distance from the interior point to the nearest edge
point. How bad is it? Contrive an example where the
value produced by the application of the mask is dif-
ferent from the Euclidian distance to the nearest edge
point.
157 Assignments
Assignment 7.3
Consider a region with an area of 500 pixels with 120
pixels on the boundary. You need to find the distance
transform from each pixel in the interior to the bound-
ary, using the Euclidian distance. (Note: Pixels ON the
boundary are not considered IN the region -- at least
in this problem.) What is the computational complex-
ity? Note: You may come up with some algorithm more
clever than that used to produce any of the answers
below, so if your algorithm does not produce one of
these answers, explain it.
(a) 60 000
(b) 120 000
(c) 45 600
(d) 91 200
Assignment 7.4
Trick question: For Problem Assignment 7.3, how many
square roots must you calculate to determine this
distance transform? Remember, this is the Euclidian
distance.
Assignment 7.5
Prove (or disprove) that binary images eroded by a
kernel, K, remain invariant under closing by K . That
is, prove that A K ==(A K ) r K .
Assignment 7.6
Show that dilation is not necessarily extensive if the
origin is not in the s.e.
Assignment 7.7
Prove that dilation is increasing.
Assignment 7.8
Let C be the class of binary images that have only one
dark pixel. For a particular image, let that pixel be
located at (i0 , j0 ).
Using erosion and dilation by kernels that have
{(0, 0)} as an element, devise an operator, that is,
158 Mathematical morphology
Assignment 7.9
Which of the following statements is correct? (You
should be able to reason through this, without doing
a proof.)
(a) (A B) C == A (B C ) (7.30)
(b) (A B) C == A (B ⊕ C ) (7.31)
Assignment 7.10
Use the thresholded images you created in Assignment
5.5 and Assignment 5.6. Choose a suitable structuring
element, and apply opening to remove the noise.
Topic 7A Morphology
Equation (7.14) (or Eq. (7.31)), as we hope you determined from Assignment 7.9, is correct.
It says that erosion by a large s.e., say K, can be broken down into two sequential erosions,
first by B and then by C, provided we are able to find B and C such that K = B ⊕ C. This is
sometimes referred to as the “chain rule” for erosion. It has substantial impact on hardware
implementations.
Suppose we have custom hardware that can compute erosion by a 3 × 3 s.e. at video rates,
but in some application we need to erode by a particular 4 × 4. The chain rule says that if we
can (somehow) find two 3 × 3 s.e.s such that their dilation is the 4 × 4 we want, we can get the
same result by passing the input image through our special hardware twice. But how to find B
and C ? To give an indication of how this is done, let us consider a very simple decomposition,
into a set of s.e.s each of which contains only two elements, the origin and one other point.
That is, we wish to find H1 , H2 , . . . , HN such that A ⊕ H = (. . . [(A ⊕ H1 ) ⊕ H2 ] . . .), and
Hi = (0, pi ). The ps are found [7.12] using the following approach: Search among pairs of
points in H for a pair p1 and p2 such that H is invariant under opening by the difference of the
two points. H = H o{0, p1 − p2 }. If two such points can be found, then Hi = (0, p1 − p2 ),
and we reduce H by H = H Hi . The process is repeated recursively. If no such pair of
points can be found, then a search is made of quadruples of points p1 , p2 , p3 , p4 , to see if
H = H o{ p1 − p2 , p3 − p4 }, and so on.
159 Topic 7A Morphology
Matheron [7.21] proved that any of a large class of morphological operations can be
computed as a union of erosions, or by using duality, as an intersection of dilations. Choosing
the set of “basis sets” so that a given operation by a given structuring element may be
calculated in this way has been the subject of considerable research [7.18, 7.20].
2 In the following, we illustrate the technique of Park and Chin [7.23], using their examples.
3 1 The algorithm is presented here. The reader is referred to the original papers for the theory.
4 0 The example given in this section is from [7.23].
In order to understand this method for decomposing structuring elements, we need knowl-
5 7 edge of the chain code from section 9.5. In a chain code, we represent a counterclockwise
6
walk around a region by a sequence of numbers, all between 0 and 7, designating the direction
Fig. 7.8. The eight of each step (Fig. 7.8). In this work, the concept of a chain code is slightly generalized to
directions in which one include not only single pixel steps in one of the eight cardinal directions, but also to include
might step in going
specific concave boundary segments.
from one pixel to
another around a The structuring element will be decomposed into a set of 3 × 3 elements. First, observe
boundary. that there are only 28 distinct concave boundary segments which fit into a 3 × 3 area. These
are listed and named in Fig. 7.9.
We now define a rather general class of structuring elements which are simply connected
U0 U2 U4 U6 (no holes) and whose boundaries have a form given by a regular expression involving the
concave segments and the chain code directions. For example, the regular expression L 0 22 42
J0 J1 J2 J3 denotes the curve created by starting with the L 0 segment, followed by two steps in the “2”
direction (the superscript denotes repetition) followed by two steps in the “4” direction. This
J4 J5 J6 J7 is illustrated in Fig. 7.10.
The class of structuring elements which can be decomposed using this algorithm is the set
of all simply connected structuring elements whose boundary can be written in the form (the
L0 L2 L4 L6
order of the subscripts is important to the definition)
V1 V3 V5 V7 U0SU 0 J0S J 0 L 0SL0 R0SR0 0 S0 J1S J 1 V1SV 1 R1SR1 1 S1 . . . J7S J 7 V7SV 7 R7SR7 7 S7 (7.32)
R0 R1 R2 R3
where any of the superscripts may be zero. For example, V1 12 2R42 463 R7 is a member of this
set, but J7 1J2 is not.
R4 R5 R6 R7 Definition
An image A is a factor of an image S if and only if it is possible to write S as the dilation of
Fig. 7.9. Possible
A, that is, S = A ⊕ B. A factor A is a prime factor of S if and only if A cannot be factored
concave boundaries in
3 × 3 images. The into anything other than itself and single-pixel images. In Table 7.1 are listed all the prime
number denotes the factors which start with R0 . The prime factors are not required to be in the form of Eq. (7.32).
direction of the first In Table 7.2, we present other prime factors, listing only their chain code representation.
chain code.
Now we present the approach to decomposition of the structuring element using an example.
We will decompose the structuring element illustrated in Fig. 7.11, whose chain code is
S = L 0 03 124 R4 43 R6 62 . The concave portions of this boundary are denoted by v 1 = L 0 ,
v 2 = R4 , v 3 = R6 . The convex portions by d1 = 0, d2 = 1, d3 = 2, d4 = 4, d5 = 6.
160 Mathematical morphology
First, we identify all the prime factors involving L 0 , R4 , and R6 , which are compatible
with this image. These are illustrated in Fig. 7.12. To understand more clearly how these are
compatible with the image, consider the segment R4 62 01. Observe that this matches the R4
segment which is at the upper right of Fig. 7.11.
The next step in the process is to construct a matrix , where i j represents the number
of times v i occurs in A j . In this example,
⎡ ⎤
Fig. 7.11. A structuring 1 0 0 0 0
element to be ⎢ ⎥
= ⎣0 1 1 0 0⎦ ,
decomposed into 3 × 3
elements. 0 0 0 1 1
161 Topic 7A Morphology
Fig. 7.12. The prime factors which match segment of the boundary of Fig. 7.11. The chain codes
in parentheses indicate what the boundary would be if rotated to R0 equivalents.
where we can see that R4 occurs once in A2 and A3 , but not at all in A1 , A4 , or A5 . Next, we
construct a matrix
, where
i j counts the number of times di occurs in A j . Here,
⎡ ⎤
0 1 2 1 0
⎢0 1⎥
⎢ 1 0 0 ⎥
⎢ ⎥
= ⎢2 0 0 2 1⎥ .
⎢ ⎥
⎣2 0 0 0 0⎦
0 2 1 0 0
Two vectors are defined, Y , representing the number of times v i occurs in the original boundary
and Z , the number of times di occurs in the original boundary. In this example, Y = [1 1 1]T
and Z = [3 1 4 3 1]T . A vector X which satisfies
X = Y
(7.33)
X ≤ Z
is a solution for the decomposition. In this example, X = [10110]T satisfies these two equa-
tions. Note that X = [30421]T which is less than or equal to Z, in the sense that each element
of X is less than or equal to the corresponding element of Z. Thus, we can decompose the
boundary by dilating by A1 once, by A2 zero times, A3 once, A4 once, and A5 zero times, or
4 S = A1 ⊕ A3 ⊕ A4 ⊕ B.
6 All that remains is to determine a kernel B. This is accomplished by looking at the differ-
1 ence between X and Z . Z − X = [01011]T . Thus, we need a kernel whose boundary is
described by the sequence d2 , d4 , d5 , each repeated once. This sequence is 146, as in Fig. 7.13.
Fig. 7.13. The s.e.
described by the Thus, we have a sequence of structuring elements, each 3 × 3, whose sequential application
sequence 146. produces the same result as the kernel in Fig. 7.11.
Here, we are going to be talking about sampling, much as Shannon did in his famous sampling
theorem, except that we will start with an image which has been sampled already, and
represented on a digital grid, and consider (sub)sampling to a smaller grid.
First, define a sampling grid. A sampling grid, as we use the term here, is just an image,
with a foreground point at every point at which we will want to sample our image. Any old
grid will do, except that it must satisfy
S⊕S=S (7.34)
162 Mathematical morphology
and
S = S̃. (7.35)
Equation (7.34) represents an example of the property that this particular S is “closed under
dilation.” A convenient grid to use is every third point, as illustrated in Fig. 7.14.
So now, sampling means to read and remember the image values at all the pixels where
the grid is black. Now suppose K is some s.e. We propose a rather simple reconstruction
You might want to verify
that this grid satisfies the algorithm. Here is the idea: Let’s sample our image using the sampling grid specified. Then,
properties mentioned we ask the question: Under what conditions will the dilation of the sampled image by the
above.
s.e. K, be the original image? Florencio and Schafer [7.7] have shown that the following
properties are required: First, the sampling grid itself, when dilated by the s.e., must be the
8
7 entire space:
6
5
4 S ⊕ K = , (7.36)
3
2
1 the translations of K by all the points in S form a partition of the space:
0
0 1 2 3 4 5 6 7 8 9
∀((x, y) ∈ S, x
= y), K x ∩ K y = Ø, (7.37)
Fig. 7.14. An image
which represents and K must contain the origin.
sampling at every third
For the sampling grid shown above, here’s an s.e. which satisfies these three conditions:
point.
Actually, this is not very interesting – just the center pixel (the origin) and its eight neighbors.
But it does satisfy the three conditions.
If these properties hold for S and K, then the theorem is as follows. Let F be some image,
let P be the set of images Fi satisfying Fi = A ⊕ K for some A ⊆ S, and let Q be the set of
images F j satisfying F j = F j oK = F j r K . Then:
Part A The samples of F ∈ P at the points in S are necessary and sufficient to perfectly
reconstruct F.
Part B The samples of F ∈ Q at the points in S are necessary to reconstruct F with bounded
error and sufficient to reconstruct F with an error at most r (K) where r (K) is the
radius of the smallest circle containing K.
There is a lot to talk about regarding this theorem: First, what does part A mean? Can you
figure it out from the set notation? You should be able to, but you may say to yourself “if this
means what I think it means, it’s trivial.” Well, what it means is this: If F can be generated
by taking some collection of the points in S, and dilating them with kernel K, then all you
have to remember is which points in S were needed. Yes, it is kind of trivial, isn’t it? But
there are some profound implications, which we will get to when we talk about the Nyquist
rate.
163 Topic 7A Morphology
(a) (b)
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Fig. 7.15. (a) Original image. (b) Result of sampling the original image with S of Fig. 7.14.
Now to understand part B: The Haussdorf distance is a measure of the difference between
two (sub)images. Given two sets of points S = {s1 , s2 , . . . sn } and T = {t1 , t2 , . . . tm }, the
Haussdorf distance is defined by
where h(S, T ) = maxs∈S mint∈T s − t. That is, the Haussdorf distance is the largest of the
minimum distances between elements of one set to the elements of the other set.
Part B says that if F is closed under opening by K, (that’s quite an expression, don’t you
think? “Closed under opening”), that is, if F does not change when it’s opened by K and
closed under closing by K, then the samples of F at the points of S are sufficient to represent
F almost exactly. By “almost exactly” we mean the set of sample points, when dilated by K,
is very nearly equal to the original F.
Now let’s do that example: Fig. 7.15 represents an original image, FA and what we get
when we sample it with S.
Since we know that we could only get data on rows 0, 3, 6, or 9 and similarly for columns,
we could toss out those rows and columns in our reduced resolution version, and get a
smaller image. So there, on the left of Fig. 7.16, we have the subsampled, smaller version
of the original image. Now let’s zoom it back, using dilation: Place K down at each of the
sampling points and we get the reconstruction of Fig. 7.16. Hmmm. doesn’t look much like
the original, unsampled image, does it? (Whatever happened to “exact reconstruction”?) And
here’s a tough question. This theorem claims to say that one can reconstruct a signal by
sampling every third point. Doesn’t this contradict Shannon? Doesn’t Shannon say we have
0 3 6 9
0 1 2 3 4 5 6 78 9
Fig. 7.16. Subsampled image and reconstruction by dilation. Students: Note the original image
in Fig. 7.15(a) – does that image belong to P or Q?
164 Mathematical morphology
to sample every other point anyway? (Actually, Shannon’s theorem was defined for analog
signals. Students: How DOES the Shannon theorem apply to this case? What IS the Nyquist
rate?) The sampling theorem does not say we can reconstruct any signal, it says something
about frequencies, doesn’t it? (Say yes.) In fact, the sampling theorem basically says that a
sampling grid with a particular frequency (which is the same as the grid spacing) cannot be
used to store information which changes more than half the grid spacing. In morphological
sampling, the theorem is complicated by the fact that we not only can choose a grid, we can
also choose a kernel. The restrictions are given by part B of the theorem. Unless the image is
one which has already been “prefiltered” by K, K cannot restore it. Haralick [7.12, p. 252] says
it this way: “The morphological sampling theorem cannot produce a reconstruction whose
positional accuracy is better than the radius of the circumscribing disk of the reconstruction
structuring element K .”
Generally, we choose structuring elements which are small and symmetric, with the origin
in the center. Schonfeld [7.34] provides some suggestions, but generally, these are the only
guidelines.
The algorithm presented in this section is presented in a bit more detail than usual in this
book. A graduate student in the machine vision discipline needs to learn how to write a
good journal paper, and this chapter is written to illustrate the components of such a paper.
It includes an introduction with references to relevant literature, a body that describes and
explains an algorithm, and a results section that illustrates performance of the algorithm in
comparison to other published results. The student should not only read this chapter for its
technical content, but should pay attention to style and organization.
7A.4.1 Introduction
In this section, we apply the concepts of the distance transform, connected components (see
Chapter 8), and binary morphology-like operations to solve the problem of closing gaps in
edges (see [7.16] for alternative strategies). In two dimensions, an edge is a curve, and in
three dimensions, a surface. As we have seen, any edge operator is certain to occasionally
fail, resulting in both extraneous edges and gaps in edges. If an edge has gaps in it, then
connected component labeling routines will fail, labeling interior and exterior points the
same. Thus, we must develop techniques which correct such edge detection errors. We will
relate these techniques to morphological operators. In [7.10], we developed a technique
called distance-transform-based closing, and have implemented this technique in both two
and three dimensions. This new technique is compared with binary morphology (mask erosion
techniques [7.1]) and iterative parallel thinning [7.44], a 3D parallel thinning technique [7.42]
2 Several of the figures in this chapter are from a paper written by one of the authors [7.36].
165 Topic 7A Morphology
and with 3D morphological techniques. In every case, the technique reported here produces
superior performance by better preserving the shape of sharp corners.
Problem definition
We are given an edge image. Due to noise, blur, or other error, the edge/surface resulting
from edge detection in a real image will occasionally have “gaps” – areas in which the edge
The word “closing” refers detector response is not sufficiently strong to make a positive determination. (See Pratt [7.25,
to closing gaps in edges, Section 17] for an excellent discussion of edge detector errors.) Such gaps may be closed by
not the morphological
closing operator. various types of morphological operators.The algorithm described here is denoted “DT-driven
closing.”
We have already seen that the distance transform, D(x, y) gives some measure of the distance
from point x, y to the nearest edge point [7.28, 7.29, 7.31, 7.32]. We use a particular variant
of the distance transform called the “chamfer map,” denoted C(x, y) and refer to the values
of this map as the “DT distance.”
Similarly, we may extend the concept of the distance transform to three-dimensional
space. The transform D(x, y, z) is parameterized identically to g(x, y, z) and contains the
DT distance from point x, y, z to the closest edge.
The distance transform may be used to assist in the measurement of the properties of
regions: At every point, one knows the maximum size kernel which may be applied without
crossing an edge [7.35].
The k-neighbors of a point x, y are defined in either two or three dimensions as:
1 1 1 4 3 4
1 0 1 3 0 3
1 1 1 4 3 4
Fig. 7.17. Two distance definitions which may be used to compute distance transforms. The one
on the right produces the chamfer map, closely approximating the Euclidian distance.
2 2 2 3 3 3 3 3 3 3
1 1 2 2 2 2 2 2 2 2
1 1 1 1 2 1 1 1 1
1 1 2 1
1 1 1 1 1 2 1 1 1 1
2 2 2 2 2 2 2 2 2 2
To construct a complete DT, Eq. (7.28) is iterated until no more changes occur between
iterations. This is the approach followed by Bister et al. [7.3]. For the application discussed
here, however, we assume some a priori knowledge of maximum gap size. This allows us to
define a value K max reflecting this knowledge. K max will represent a distance so great that any
pixel this far away could not possibly be part of the edge. Normally, we use values K max ≤ 4,
since this allows gaps of six pixels to be bridged. Fig. 7.18 illustrates a distance transform
near a gap in an edge.
Generation of the three-dimensional distance transform proceeds in an analogous manner.
Observe that the computational complexity of distance transform generation is proportional
to the number of edge points in the image and to the size of K max . Larger values of K max
increase the size of the kernel ℵk substantially in three dimensions.
Pixels whose DT distance from an edge is less than K max are labeled as “TBA” (to be as-
signed), meaning, they could possibly belong to an edge. The remaining pixels are segmented
into disjoint regions. The straightforward way to accomplish this segmentation is to use con-
nected component analysis on the pixels which are not labeled TBA. However, one could use
more sophisticated cooperative processes such as those described in Chapter 8 and in [7.14].
Somehow we produce a label image like the label images we will discuss in Chapter 8, in
which L(x, y) = j is interpreted as “the pixel at x, y belongs to region j.” Pixels in L close
to an edge are labeled as potentially part of the edge:
Bister et al. [7.3] identify local maxima in the DT, and each maximum results in a potential
region. They note, however, “Since the distance transform is sensitive to noise producing
irregularities in the borders of the regions, one cavity (region) can contain many local maxima
very close to one another. In order to eliminate these spurious maxima, a filter merges the
maxima for which the sum of the heights is much larger than the geometrical distance
between them.” We conjecture that the performance of this filter is equivalent to our choice
of K max . Instead of searching for a maximum, we use connected components (Chapter 8).
Both techniques identify an area within a region which robustly characterizes that region.
The last step in the algorithm relabels the TBA points, using the distance transform DT(x, y)
and the label image L(x, y) to create a new label image L (x, y). Working inward from those
points where DT(x, y) ≥ K max (the set of points K max or farther from an edge point), each
point is relabeled by assigning it the label of the most appropriate neighbor. This “erosion” of
the questionable pixels is performed iteratively. At each iteration, i, only the TBA pixels in
the label image L(x, y) that correspond to pixels such that DT(x, y) = K max − i are relabeled
in each pass.
TBA pixels are relabeled only when there is strong evidence (defined in the next section)
indicating to which region the pixel should be assigned. If the reassignment is uncertain, then
the pixel is left TBA and the decision is postponed until the next pass. When all of the TBA
pixels in L(x, y) corresponding to the k-valued pixels of the distance transform have been
reassigned to valid image regions or when no further change takes place in an iteration, k is
decremented and the next set of TBA points is considered for relabeling. The relabeling is
complete when all of the edge pixels (k = 0) have been assigned to valid image regions. When
single-pixel resolution between regions is required, more sophisticated region normalization
techniques may be used [7.8].
The key to closing gaps lies in the strategy used to select the most appropriate region to
A voxel is a which to reassign the currently considered pixel L(x, y). To accomplish this in two dimensions,
three-dimensional pixel.
we examine the pixel’s eight surrounding neighbors. In three dimensions, we examine the
voxel’s 26 neighbors. We refer to this as finding the pixel’s or voxel’s “best neighbor.”
The surrounding pixels may come from one of three possible classes, depending on how
they are labeled at the current iteration:
(1) A label corresponding to an object or a region of an object in the label image L(x, y).
(2) A label corresponding to the image background.
(3) The TBA label.
We relabel the current pixel to one of the first two classes by simply counting how many
neighboring pixels come from that class and selecting the largest as the “best” class. This
strategy will only fail if all the neighbors are TBA, or constraints (see next paragraph) we
have placed on the relabeling make it undesirable to reassign the pixel to the apparent best
neighbor.
168 Mathematical morphology
One constraint on the relabeling algorithm is the selection of the background region as the
best neighbor. Neighbors belonging to the background are considered in a more restricted
fashion than those belonging to regions. For the background value to be selected as the best
neighbor, the background pixel must be face-connected to the pixel under consideration.
This avoids the undesirable result of occasionally finding isolated background pixels inside
a closed boundary. (See [7.30] for more discussion on this connectivity paradox.)
When k = 0, the TBA pixels we seek to relabel are either true edges or noise pixels.
An edge in an image, by definition, represents an interface between an object region and
some other region in that image. The other region may be either another object region or the
background. We choose to relabel an edge pixel as part of an image region regardless of the
outcome of the enumeration. Only in the case that the edge pixel is completely surrounded
by background, do we choose background as the best neighbor.
copyarray(L, Ltemp);
}/* end while */
}/* end for k */
}/* end relabel */
/*==============================================================*/
/* in this function, p and n are data structures containing the frame, row, and col*/
/* coordinates of a voxel*/
int Best26Neighbor(p)
{
while ((n = neighbor(p)) != NULL)
{
if (L(n) != EDGE)
{
if(L(n) != BACKGROUND) Card[L(n)]++;
else if(faceconnected(n,p)) Card[BACKGROUND]++;
}
}
if ((maximum(Card) == BACKGROUND) && (L(n) == EDGE))
return NextMax(Card);
else return maximum(Card);
} /* end Best26Neighbor */
In this algorithm, “Card” is simply an array which keeps a count (cardinality) of the number
of times a particular label is adjacent to the voxel of interest.
7A.4.5 Examples
In this section, we will provide the technique described above with competing morphological
strategies.
Fig. 7.19. Two regions with large gaps in their boundaries. From [7.36]. Used with permission.
Fig. 7.20. Distance transform and result of connected component labeling of Fig. 7.19. From
[7.36]. Used with permission.
171 Topic 7A Morphology
Fig. 7.21. Segmentation resulting from applying DT-driven closing. Note that the regions are
correctly partitioned, with a precision accurate to the individual pixel. From [7.36]. Used
with permission.
Thinning differs from To evaluate the performance of DT-driven edge closing in two-dimensional images, we com-
skeletonization in that
pared it with two other types of binary thinning techniques – iterative erosion and iterative
thinning preserves
connectivity: A boundary thinning. As input to these algorithms we used the bit-mapped “cutlery” image shown in
which is intact when wide Fig. 7.22 after it was dilated by a 5 × 5 kernel to close all gaps. The basic idea is this:
will still be intact after
thinning. Use dilation to close the gaps in the boundary. Then use thinning to reduce the “fat”
boundary to a thin one. The results of these comparisons are described in the following
sections.
Note that although both thinning and skeleton perform similar operations, thinning pre-
serves the connectivity of the edge, while skeleton does not. A skeleton may be defined in
terms of morphological operations as follows.
The kth order homothetic of an s.e., A, denoted kA, is the result of applying the operator
of interest – in this case dilation – to A. That is, dilating {0} by A, and then dilating the result
by A again, until k such dilations have been done.
A conventional morphological skeleton of an image X is formed by decomposing an image
into a number of skeleton subsets, Si where
(a) (b)
Fig. 7.22. (a) Original cutlery image, containing gaps in edges. (b) Distance transform of original
cutlery image. From [7.36]. Used with permission.
(c) (d)
Fig. 7.23. (c) Thickened edge image. (d) Label image based on dilated edges. From [7.36]. Used
with permission.
173 Topic 7A Morphology
Fig. 7.24. (e) Relabeled cutlery image. From [7.36]. Used with permission.
where X\Y denotes all the elements of X which are not in Y. The skeleton is then
constructed by
N
Skeleton = Si ⊕ i A. (7.40)
i=0
The subsets contain information about size, orientation, and connectivity. A minimal skeleton
has the property of being able to exactly reconstruct the original image, but it does not
necessarily preserve path or surface connectivity [7.20]. An alternative to the morphological
skeleton, not considered in this book, is morphological shape decomposition (MSD) [7.24].
MSD and the morphological skeleton are compared by Reinhardt and Higgins [7.27] who
conclude that MSD performs slightly better. While morphological skeletonization can be
used for many applications (such as image coding) and has been widely studied, it is not
directly comparable to 2D or 3D thinning, in which connectivity is preserved. Since, in this
application of edge/surface gap closing, we insist on preservation of connectivity, we consider
below only techniques which possess this property.
A1 A2 A3 A4
B1 B2 B3 B4
Fig. 7.25. 3 × 3 masks (left) and result of erosion thinning on the cutlery image (right). From
[7.36]. Used with permission.
Fig. 7.26. Results of iterative thinning. From [7.36]. Used with permission.
to the next iteration until all possible pixels have been eroded. In each mask shown in Fig.
7.25 [7.1, 7.26], the positions marked in black denote edge pixels, white denotes background,
and hashed are pixels which do not become involved in the computation. A mask is said to
“match” an image at coordinates (k, l) if, when the center of the mask is registered to pixel
(k, l), then all image pixels covered by the mask are edge or background as denoted by the
corresponding mask pixel. If a mask matches an image at a particular edge point, then the
edge at that point in the image may be reset to background. The masks are applied in the
following order: A1, B1, A2, B2, A3, B3, A4, B4. The result of thinning Fig. 7.23 using this
algorithm is shown in Fig. 7.25. Note how this technique distorts the shape of sharp vertices
like the fork tines.
pixel, the number of 0–1 transitions in a sequence around the pixel, and on two sets of back-
ground neighbor configurations. The result of iteratively thinning the dilated cutlery image
(Fig. 7.23) is shown in Fig. 7.26. It is interesting to note the similarity with the technique of
Arcelli et al.
DT-driven closing was applied to a three-dimensional image of a box with broken interior
partitions; Figs. 7.27–7.29 show the results. The order of the images is the same as shown for
the 2D results. Note that even though the gaps are large in x, y, they are successfully bridged,
and the edges are still sharp in the final relabeled image.
The three-dimensional algorithm was also tested on a synthetic ellipsoid (Fig. 7.30) which
was deliberately undersampled to produce large gaps. The results of DT-driven closing with
a K max value of 3 are shown in Fig. 7.30.
Fig. 7.27. (a) Original broken box image, frame 20. (b) Distance transform of broken box.
(e) Fig. 7.28. (c) Thickened edge image. (d) Label image based on dilated edges.
(a) (b)
Fig. 7.30. Ellipsoid with gaps (a) and relabeled ellipsoid (b).
176 Mathematical morphology
Other research in three-dimensional thinning and skeletonization [7.11, 7.19, 7.41, 7.42]
is primarily based on computing the Euler [7.13] connectivity number, N, for each three-
dimensional solid where N = V − E + F and V, E, and F denote the numbers of vertices,
edges and faces of the object, respectively. For a solid which does not contain tunnels or
cavities, N = 2. Each tunnel or hole penetrating an object decreases N by two; each cavity
in the solid increases N by two. The connectivity number must be preserved during thinning
to maintain the topology of the original object [7.26, 7.33].
Probably the most significant aspect of the performance of DT-driven closing is its ability
to preserve the geometry of surfaces, particularly near vertices. Its two-dimensional per-
formance is particularly well demonstrated in Fig. 7.24. To demonstrate how it processes
vertex geometry in three dimensions, we synthesized a cone and extracted the surface with a
three-dimensional edge detector.
Fig. 7.32. (a) Dilated ellipsoid and (b) Tsao and Fu thinned ellipsoid. From [7.36]. Used with
permission.
177 Topic 7A Morphology
Fig. 7.33. A cross section through the apex of a cone which has been processed by DT-based
closing (a) and by Tsao–Fu thinning (b). From [7.36]. Used with permission.
Gaps in the surface were then closed with DT-driven closing. The same cone surface
was dilated using conventional dilation to close gaps, and then thinned using the Tsao–Fu
algorithm. The results are shown in Fig. 7.33. Since the process of dilation replaces the edge
information, the Tsao–Fu thinning algorithm has no memory of the original surface when
it thins the dilated image. Of course, neither technique processes the geometry perfectly.
However, since DT-driven closing retains, via the DT, a “memory” of the original, undilated,
geometry, this technique is better able to restore that geometry after gaps are closed. See
[7.38] for more details and computation speed.
We included this rather long section on DT-driven edge closing for several reasons: First, it
is important for the student to see that the implementation of any real technique requires that
you know a bit about more than just one technique. You had to use knowledge of connected
components, edge detection, morphological dilation, etc. Second, since this book is directed
primarily at graduate students, we wanted you to see how to write a journal paper. This chapter
is similar in many ways to (and includes figures from) [7.38]. Note the format: Introduce the
problem, describe the algorithm, cite the literature, and most important, compare the new
technique with existing techniques. Follow these simple rules, and you too can publish!
7A.5 Vocabulary
You should know the meanings of the following terms.
Chamfer map
Prime factor
Sampling
Assignment 7.A1
In section 7A.1.1 an example decomposition of a structuring
element was given. Prove that this decomposition gives the
178 Mathematical morphology
Assignment 7.A2
What is the major difference between the output of a thinning
algorithm and the maxima of the distance transform? Choose
the best answer from the following.
Assignment 7.A3
Use the distance transform to compute the medial axis of an
image. The name of the image will be given in class.
Bibliography
[7.1] C. Arcelli, L. Cordella, and S. Levialdi, “Parallel Thinning of Binary Pictures,” Elec-
tronics Letters, 11, pp. 148–149, 1975.
[7.2] H.G. Barrow, “Parametric Correspondence and Chamfer Matching,” Proceedings
of the 5th International Joint Conference on Artificial Intelligence, August, 1977,
pp. 659–663.
[7.3] M. Bister, Y. Taeymans, and J. Cornelis, “Automated Segmentation of Cardiac MR
Images,” Computers in Cardiology, Washington, DC, IEEE Computer Society Press,
pp. 215–218, 1989.
[7.4] G. Borgefors, “Distance Transformations in Arbitrary Dimensions,” Computer Graph-
ics, Vision, and Image Processing, 27, pp. 321–345, 1984.
[7.5] G. Borgefors, “Distance Transformations in Digital Images,” Computer Graphics,
Vision, and Image Processing, 34, pp. 344–371, 1986.
[7.6] H. Breu, J. Gil, D. Kirkpatrick, and M. Werman, “Linear Time Euclidian Distance
Transform Algorithms,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 17(5), 1995.
[7.7] D. Florencio and R. Schafer, “Homotopy and Critical Morphological Sampling,”
Proceedings of the SPIE, 2308, June, 1994.
[7.8] J.D. Foley, A. vanDam, S.K. Feiner, and J.F. Hughes, Computer Graphics: Principles
and Practice, Reading, MA, Addison-Wesley, pp. 91–99, 1990.
[7.9] W. Gong and G. Bertrand, “A Simple Parallel 3D Thinning Algorithm,” 10th Inter-
national Conference on Pattern Recognition, June, 1990.
179 Bibliography
[7.10] B. R. Groshong, and W. E. Snyder, “Using Chamfer Maps to Segment Images,” Tech-
nical Report CCSP-WP-86/11, Center for Communications and Signal Processing,
North Carolina State University, Raleigh, NC, USA, December, 1986.
[7.11] K. Hafford and K. Preston Jr., “Three-dimensional Skeletonization of Elongated
Solids,” Computer Vision, Graphics, and Image Processing, 27, pp. 78–91, 1984.
[7.12] R. Haralick and L. Shapiro, Computer and Robot Vision, Volume 1, New York,
Addison-Wesley, 1992.
[7.13] D. Hilbert and S. Cohn-Vossen, Geometry and the Imagination, New York, Chelsea,
1952.
[7.14] H. Hiriyannaiah, G. Bilbro, and W. Snyder, “Restoration of Locally Homogeneous
Images using Mean Field Annealing,” Journal of the Optical Society of America, A,
6(12), pp. 1901–1912, 1989.
[7.15] C. Huang and O. Mitchell, “A Euclidian Distance Transform using Grayscale
Morphology Decomposition,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(4), 1994.
[7.16] X. Jiang, “An Adaptive Contour Closure Algorithm and its Experimental Evaluation,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11), 2000.
[7.17] R. Jones and I. Svalbe, “Morphological Filtering as Template Matching,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 16(4), 1994.
[7.18] R. Jones and I. Svalbe, “Algorithms for the Decomposition of Gray-scale Morpholog-
ical Operations,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(6), 1994.
[7.19] S. Lobregt, P. Verbeek, and F. Groen, “Three-Dimensional Skeletonization: Principle
and Algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
2(1), pp. 75–77, 1980.
[7.20] P. Maragos and R. Schafer, “Morphological Skeleton Representation and Coding of
Binary Images,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 34,
pp. 1228–1244, 1986.
[7.21] G. Matheron, Random Sets and Integral Geometry, New York, Wiley, 1975.
[7.22] H. Park and R. Chin, “Optimal Decomposition of Convex Morphological Structuring
Elements for 4-connected Parallel Array Processors,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(3), 1994.
[7.23] H. Park and R. Chin, “Decomposition of Arbitrarily Shaped Morphological Structuring
Elements,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1),
1995.
[7.24] I. Pitas and A. Venetsanopoulos, “Morphological Shape Decomposition,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 12(1), 1990.
[7.25] Pratt, W. K., Digital Image Processing, New York, Wiley, 1978.
[7.26] K. Preston, and M. Duff, Modern Cellular Automata Theory and Applications, New
York, Plenum Press, 1984.
[7.27] J. Reinhardt and W. Higgins, “Comparison Between the Morphological Skeleton and
Morphological Shape Decomposition,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(9), 1996.
180 Mathematical morphology
In many machine vision applications, the set of possible objects in the scene is quite
limited. For example, if the camera is viewing a conveyer, there may be only one
type of part which appears, and the vision task could be to determine the position
and orientation of the part. In other applications, the part being viewed may be one
of a small set of possible parts, and the objective is to both locate and identify each
part.?? Finally, the camera may be used to inspect parts for quality control.
Fig. 8.1. An image
with two foreground In this section, we will assume that the parts are fairly simple and can be charac-
regions. terized by their two-dimensional projections, as provided by a single camera view.
Furthermore, we will assume that the shape is adequate to characterize the objects.
11112221111111111111111111111
11122221111113333333333333111
11112221111113333333333333111
That is, color or variation in brightness is not required. We will first consider dividing
11112211111111111111333311111
11112211111111111111333331111 the picture into connected regions.
11112211111111111111333333111
11222222111111111113333333111
11224422211111111133333331111
A segmentation of a picture is a partitioning into connected regions, where each
11222222111111111111333111111
11112211111111111111111111111 region is homogeneous in some sense and is identified by a unique label. For example,
11111111111111111111111111111
in Fig. 8.2 (a “label image”), region 1 is identified as the background. Although
Fig. 8.2. A region 4 is really background as well, it is labeled as a separate region since it is not
segmentation and
labeling of the image connected to region 1.
in Fig. 8.1. The term “homogeneous” deserves some discussion. It could mean all the pixels
are the same brightness, but that criterion is too strong for most practical applications.
It could mean that all pixels are close to some representative (mean) brightness.
181
182 Segmentation
Stated more formally [8.80], a region is homogeneous if the brightness values are
consistent with having been generated by a particular probability distribution (see
also the analysis by Ng and Lee [8.44]). In the case of range imagery [8.35], where
we (might) have an equation which describes the surface, we could say a region
is homogeneous if it can be described by the combination of that equation and
some probabilistic deformation. For example, if all the points in a region of a range
image lay in the same plane except for deviations whose distance from the plane
may be described by a particular Gaussian distribution, we might say this region is
homogeneous.
There are several ways to perform segmentation. Threshold-based techniques are
guaranteed to form closed regions, for they simply assign all pixels above (or below,
depending on the problem) a specified threshold to be in the same region. Edge-
based techniques assume that regions are separated by neighborhoods where the
edge strength is high. Region-based methods start with elemental (e.g., homoge-
neous) regions and split or merge them. Then, there are a variety of hybrid methods,
including watershed [8.5] techniques. A watershed method generally operates on the
gradient of the image; segmentation consists of flooding the image with (by anal-
ogy) water, in which region boundaries (areas of high edge strength) are erected to
prevent water from different seed points from mixing. Traditional “region growing”
methods are really variations on watershed methods [8.1].
Before we can go much further in our discussion of issues and techniques in
segmentation, you need to understand some of the interesting and unexpected things
that happen to geometry and topology when you deal with digital images. Remember
the connectivity paradox from section 4.5? We discovered that an object could have
a closed boundary but still have an inside and outside which were connected? As
another example, consider the problem of finding the perimeter of a region, or even
the length of a line from a sampled version of that line. That problem has immediate
applications in segmentation, and yet it is not obvious how to estimate it [8.31].
Keep in mind that these sorts of things can happen, because we are going to talk a
lot about connectivity in this chapter.
In applications where specific gray values of regions are not important, we can
segment a picture into “objects” and “background” by simply choosing a threshold
in brightness. We define any region whose brightness is above the threshold as object
and all below the threshold as background.
There are several different ways to choose thresholds, ranging from the trivially
simple to the very sophisticated. As the sophistication of the technique increases,
performance improves but at the cost of increased computational complexity.
183 8.2 Segmentation by thresholding
Fig. 8.3. Two detectors forming an image of a rectilinear grid. The light source is uniform,
however, the images exhibit both radiometric (brightness) distortion (the left image
is brighter in the right center and the right image is brighter in the middle) and
geometric distortion (straight lines are distorted in a characteristic pincushion
form).
Probably the most important factor to note is the local nature of thresholding.
That is, a single threshold is almost never appropriate for an entire scene. It is nearly
always the local contrast between object and background that contains the relevant
information. Since camera sensitivity drops off from the center of the picture to the
edges due to parabolic distortion and/or vignetting as shown in Fig. 8.3, it is often
useless to attempt to establish a global threshold. A dramatic example of this effect
can be seen in an image of a rectilinear grid, in which the “uniform” white varies
significantly over the surface.
Effects such as parabolic distortion and vignetting are quite predictable and easy
to correct. In fact, off-the-shelf hardware is available for just such applications.
It is more difficult, however, to predict and correct effects of nonuniform ambi-
ent illumination, such as sunlight through a window, which changes radically over
the day.
Since a single threshold cannot provide sufficient performance, we must choose
local thresholds. The most common approach is called block thresholding, in which
the picture is partitioned into rectangular blocks and different thresholds are used
on each block. Typical block sizes are 32 × 32 or 64 × 64 for 512 × 512 images.
The block is first analyzed and a threshold is chosen; then that block of the image
is thresholded using the results of the analysis. In more sophisticated (but slower)
versions of block thresholding, the block is analyzed and the threshold computed.
Then that threshold is applied only to the single pixel at the center of the block. The
block is then moved over one pixel, and the process is repeated.
184 Segmentation
Fig. 8.4. A histogram of a bimodal image, with many bright pixels (intensity around 169) and
many dark pixels (intensity around 11).
In [8.7], a technique was developed which finds the global minimum of a function
of several variables, even for functions which have more than one minimum. That
technique, known as tree annealing (TA) may be applied to the problem of histo-
gram analysis and thresholding in the following way [8.62].
185 8.3 Connected component analysis
If h(f) is properly normalized, one may adjust the usual normalization of the two
component Gaussians so that each sums to unity on the 256 discrete gray levels
(rather than integrating to unity on the continuous interval), and thereby admit the
additional constraint that A1 + A2 = 1. Use of this constraint reduces the number
of parameters to be estimated from six to five; however, we have determined ex-
perimentally that TA actually solves the problem more accurately by using the six
variables, without readjusting the normalization on each iteration. Conventional de-
scent often terminates at a suboptimal local minimum for this two-Gaussian problem
and is even less reliable for a three-Gaussian problem. TA deals easily with either.
The result of fitting a sum of three Gaussians to the histogram of an image is shown
in Fig. 8.5.
Whatever algorithm we use, the philosophy of histogram-based thresholding is
the same: Find peaks in the histogram and choose the threshold to be between them.
In many industrial environments, the lighting may be extremely well controlled.
With such control, the best thresholds will be constant over time and may be chosen
interactively during system set up. However, in general, different thresholds are used
in different areas of the picture.
Let us assume, for now, that a good threshold has been chosen and that our picture
has been partitioned into regions of pure black and pure white, as shown in Fig. 8.1.
186 Segmentation
A B F G
C
D H
(1) Find an unlabeled black pixel; that is, L(x, y) = 0. Choose a new label number
for this region, call it N. If all pixels have been labeled, stop.
(2) L(x, y) ← N .
(3) If f (x − 1, y) is black and L(x − 1, y) = 0, push the coordinate pair (x − 1, y)
onto the stack.
If f (x + 1, y) is black and L(x + 1, y) = 0, push (x + 1, y) onto the stack.
If f (x, y − 1) is black and L(x, y − 1) = 0, push (x, y − 1) onto the stack.
If f (x, y + 1) is black and L(x, y + 1) = 0, push (x, y + 1) onto the stack.
(4) Choose a new (x, y) by popping the stack.
(5) If the stack is empty, go to 1, else go to 2.
This labeling operation results in a set of connected regions, each assigned a unique
label number. To find the region to which any given pixel belongs, the computer has
only to interrogate the corresponding location in the L memory and read the region
number.
EXAMPLE
Applying region growing
Fig. 8.9 shows a 4 × 7 array of pixels. Assume the initial value of x, y is 2, 4.
Apply algorithm “grow” and show the contents of the stack and L each time step (3)
is executed. Let the initial value of N be 1.
Solution
Pass 1. Immediately after execution of step (3). The algorithm has examined
pixel 2, 4, examined its 4-neighbors, and detected only one 4-neighbor in the
foreground, the pixel at 3, 4. Thus, the coordinates of that pixel are placed on
Stack: 7
( ) ← top 6
1
1 2 3 4
the stack.
⎡ ⎤
7 0 0 0 0
6 ⎢0 0 0 0⎥
⎢ ⎥
⎢0 0 0 0⎥
Stack: 5 ⎢ ⎥
⎢ ⎥
L= 4 ⎢0 1 0 0⎥
3, 4 ← top ⎢ ⎥
3 ⎢0 0 0 0⎥
⎢ ⎥
2 ⎣0 0 0 0⎦
1 0 0 0 0
1 2 3 4
Pass 2. The pixel at 3, 4 was removed from the top of the stack and marked with
a 1 in the L image, its neighbors were examined and two 4-neighbors were found,
the pixels at 3, 3 and at 3, 5; both were put on the stack.
⎡ ⎤
7 0 0 0 0
6 ⎢0 0 0 0⎥
⎢ ⎥
Stack: ⎢0 0 0 0⎥
5 ⎢ ⎥
⎢ ⎥
3, 5 ← top L= 4 ⎢0 1 1 0⎥
⎢ ⎥
3, 3 3 ⎢0 0 0 0⎥
⎢ ⎥
2 ⎣0 0 0 0⎦
1 0 0 0 0
1 2 3 4
Pass 3. The top of the stack contained 3, 5. That pixel was removed from the
stack and marked with a 1 in the L image. All its neighbors were examined and one
4-neighbor was found, the pixel at 3, 6. Thus the coordinates of this pixel are put
on the stack.
⎡ ⎤
7 0 0 0 0
6 ⎢0 0 0 0⎥
⎢ ⎥
Stack: ⎢0 0 1 0⎥
5 ⎢ ⎥
⎢ ⎥
3, 6 ← top L= 4 ⎢0 1 1 0⎥
⎢ ⎥
3, 3 3 ⎢0 0 0 0⎥
⎢ ⎥
2 ⎣0 0 0 0⎦
1 0 0 0 0
1 2 3 4
Pass 4. The stack was “popped” again, this time removing the 3, 6 and marked with
a 1 in the L image. That pixel was examined and determined to have no 4-neighbors
189 8.3 Connected component analysis
Pass 5. The stack was popped again, removing the 3, 3 and marked with a 1 in the
L image. That pixel was examined and determined to have no 4-neighbors which
had not already been labeled.
⎡ ⎤
7 0 0 0 0
6⎢ ⎥
⎢ 0 0 1 0⎥
5⎢ ⎥
⎢ 0 0 1 0⎥
Stack : ⎢ ⎥
L = 4 ⎢ 0 1 1 0⎥
() ← top ⎢ ⎥
3⎢ 0 0 1 0⎥ ⎥
⎢
2 ⎣ 0 0 0 0⎦
1 0 0 0 0
1 2 3 4
Pass 6. The stack was popped again, producing a return value of “stack empty” and
the algorithm is complete since all black pixels had been labeled.
This region growing algorithm is just one of several strategies for performing
connected component analysis. Other strategies exist which are faster than the one
described, including some that run at raster-scan rates [8.6]. We will now consider
one such technique.
r f (x, y) is the gray-scale value of the (x, y) pixel in the image memory.
r (x, y)i is the ith adjacent neighbor of the (x, y) pixel.
r f i (x, y) is the gray-scale value of the ith adjacent neighbor of the (x, y) pixel.
r L(x, y) is the region label corresponding to the (x, y) pixel in the image memory.
r L i (x, y) is the region label number corresponding to the ith adjacent neighbor of
the (x, y) pixel.
r K (i) is the contents or the ith element in the equivalence memory. This memory
is a content-addressable memory.
191 8.3 Connected component analysis
start
Initialize memories
p=0
Get first pixel
get next pixel
i=1
N
N
(x,y)i labeled? i = i+1 i>N?
Y
Y
Y Y
N (x,y) labeled?
|f(x,y) - fi(x,y)| < T?
N
(x,y) N p=p+1
labeled? L(x,y) = K(Li(x,y))
Y L(x,y)=p
z = max(K(L(x,y)),K(Li(x,y)))
w = min(K(L(x,y)),K(Li(x,y)))
A K(p) = p
Y
z=w?
K*(z) = K(w)
The algorithm is illustrated in Fig. 8.13 for an arbitrary region adapted from
Milgram et al. [8.41]. The reader should note that the relation | f (x, y) − f i (x, y)| <
T tests if two pixels are similar. There are other similarity measures which could
be used, including local first- and/or second-order statistics. If two pixels meet this
criterion and they are adjacent, then they are in the same region. By definition, if
two pixels are in the same region, then R(a, b) holds. That is,
{ADJACENT(x, y, x , y ) ∧ | f (x, y) − f i (x, y)| < T } ↔ R(x, y, x , y ).
192 Segmentation
Image
Memory
(f)
Interface
Processor Region
Label
Memory
(L)
Equivalence
Memory
(K)
Host Computer
1 1 1 2 2
2 1 1 2 2
3 1 1 2 2
4 1 1 3 3 3 3 3 3 3 3 2 2
5 1 1 3 3 3 3 3 3 3 3 2 2
6 1 1 3 3 3 3 2 2
7 1 1 3 3 3 3 2 2
8 1 1 3 3 3 3 2 2
9 1 1 3 3 3 3 2 2
10 1 1 1 1 1 1 4 4 1 1 2 2
11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Fig. 8.13 illustrates a difficult labeling problem, and Table 8.1 illustrates the
labeling process for this example.
The f memory and the L memory are both conventional random access memories.
However, the equivalence memory has two modes of operation. It may be used
on a conventional RAM where the address input corresponds to the region table
194 Segmentation
x y 1 2 3 4
1 1 1
16 1 1 2
5 4 1 2 3
5 10 1 2 1
9 10 1 2 1 4
11 10 1 2 1 1
16 11 1 1 1 1
and data output is the equivalent table. In associative memory mode it is used to
update that table. In this mode, two activities occur in synchronism with a two-phase
clock.
r Phase 1: All memory cells whose contents match the contents of the data bus set
their corresponding enable flip-flops (see Fig. 8.14).
r Phase 2: All memory cells whose enable flip-flops are set read the contents of the
data bus.
This operation effectively updates the equivalence table in parallel during the
scan.
φ1
select
φ2
Address
decoder
Address Data
bus bus
It is also possible to update the table by a search algorithm at the end of the scan.
However, doing the updating in parallel during the scan tremendously reduces the
number of equivalences (since the lowest number is always used), thus reducing
the number of bits needed for the K memory. A content-addressable memory has
the property that memory cells can be accessed or loaded by their contents [8.45,
8.69, 8.77].
The parameter most crucial to the development of a satisfactory memory, and
hence a system capable of operating in real time, is the memory size. A near real-time
system will result if the memory size necessitates a compromise in the access speed.
Issues of memory size are discussed further in [8.60] where simulation involving
real images is presented.
The final component of the architecture is the interface/processor. The primary
purpose of this unit is to execute the algorithm described in this section and
flowcharted in Fig. 8.11. Additionally, it must be capable of (1) processing the video
signal input into gray-scale values for storage in the memory, and (2) interpreting
the L memory in terms of the K memory.
Simulation
The algorithm was applied to a 512 × 512 image of text data that was thresholded
prior to segmentation. Two parameters were of interest: (1) the number of elemental
regions (those whose labels are stored in L), since this affects the word width of L
and K and the length of K; and (2) the number of regions perceived by the algorithm
since this determines the amount of further processing which the host computer must
perform before useful information can be gleaned from the image.
The results of one simulation are summarized below.
r Four-neighbor connectivity. 912 elemental regions, 138 perceived regions.
r Eight-neighbor connectivity. 883 elemental regions, 109 perceived regions.
These results indicate that a 512 × 512 × 10 bit L memory and a 1024 × 10 bit
memory would be required for this image.
In this section we have addressed the issue of performing image analysis opera-
tions in real time on television-scanned data. We have shown that it is possible to
design hardware which can perform the operation of region growing in this way.
The concept of using equivalence relations to partition an input set is fundamental
to the algorithm. Furthermore, the use of content-addressable read/write memories
facilitates the implementation of such equivalence relation processing in real time.
These concepts were developed by considering potential hardware structures;
however, nothing about the algorithm prevents its implementation on a digital com-
puter. The program described here was written to simulate the effectiveness of this
approach. Since then, we have used it to label regions in an image segmentation. Its
196 Segmentation
speed of operation, even in simulation, is superior to our earlier region grower, for
identical performance.
Sometimes, we already have the boundaries of regions, and are interested in describ-
ing the boundary in some way which is appropriate for us to be able to characterize
either the entire boundary or individual segments. While there are many ways to
approach this problem [8.17, 8.42, 8.52], almost all of them involve identifying dis-
tinguished points [8.15, 8.26, 8.55, 8.79] along the boundary, and then characterizing
the curves between these “salient” points. Obviously, the definition of saliency [8.18,
8.23, 8.48] is critical to the performance of the algorithm.
First, recognize that a curve is a one-dimensional function, which is simply bent
in 2-space. That is, a curve can be parameterized using a single parameter, which
is usually arc length. However, the arc length parameterization is not invariant to
affine transforms [9.1]. For this reason, Rivlin and Weiss [8.49] develop invariants
of curves without using a parameter. The x–y coordinates of a point on the curve can
then be written as one-dimensional functions of the arc length, (s) = [x(s), y(s)],
that is, how far along the curve we have traveled. One can construct any smooth
curve up to a rigid body motion if the curvature as a function of arc length is known
[8.76]. Of course, we never deal with a smooth, arc-length parameterized curve, we
only have sampled versions of such curves, and the relationship between true arc
length and the number of pixels through which the curve passes is not as simple as
one might think [8.31]. In fact, even for noise-free curves, digitization introduces
errors [8.76]. Interestingly, if one adds noise, curve positions (at least for straight
lines) can be estimated with increased accuracy [8.37].
The speed of a curve at a point s is
2 2
∂x ∂y
˙ (s) = + .
∂s ∂s
197 8.5 Active contours (snakes)
Suppose the curve is closed, then the concepts of INSIDE and OUTSIDE make sense.
Given a point in the plane x = [xi , yi ] which is not on the curve, let x represent the
closest point on the curve to x (at this point, the arc length is defined to be sx ). Then
we say x is INSIDE the curve if [x − x ]n (sx ) ≤ 0 and OUTSIDE otherwise.
There is a way [8.54] to perform curve evolution (see section 9.8) such that the
enclosed area remains constant.
You do not necessarily have to find salient points. Chen et al. [8.13] simply
apply an orientation-selective filter at all possible orientations. If two segments have
sufficiently different orientations, the filter response will have multiple peaks, and
the position of the peaks can be used to identify the segments. This approach seems
to work well for images consisting of straight lines with X or T intersections (see
Chapter 10 for a discussion of the types of intersections).
Rosen and West [8.50] propose a slightly different strategy for finding salient
points. They fit the sequence of data points with whatever function seems appropriate
(ellipses or straight lines). The data point which fits most poorly becomes the salient
point. The curve is then divided into two segments, and the fit is repeated recursively
on each segment.
The concept of active contours was originally developed to address the fact that
any edge detection algorithm will fail on some images, because in certain areas
of the image, the edge simply does not exist. For example, Fig. 8.15 illustrates a
human heart imaged using nuclear medicine. A radioactive drug is introduced into
the circulation, and an image is made which reflects the radiation at each point. The
brightness at a point is a measure of the integral in a direction perpendicular to the
Fig. 8.15. The left imaging plane of the amount of blood in the area subtended by that pixel. The volume
ventricle imaged using of blood within the ventricle can thus be calculated by summing the brightness over
nuclear medicine.
the area of the ventricle. Of course, this requires an accurate segmentation of the
ventricle boundary, a problem made difficult by the fact that there is essentially no
contrast in the upper left corner of the ventricle. This occurs because radiation from
other sources behind the ventricle (superior and inferior vena cava, etc.) contributes
to blur out the contrast. Thus, a technique is required which can bridge these rather
large gaps – gaps which are really too large to be bridged using the closing techniques
of Chapter 7.
198 Segmentation
The snake is a closed Following the active contour philosophy, a contour is first initialized, either by
boundary which moves,
retaining its connectivity a user or automatically. The boundary then moves, and moves until many/most of
as it moves. the contour points align with image edge points. An animation of the contour as it
searches for boundary points reminds the viewer of the movements of a snake, hence
these boundaries are often referred to as “snakes.”
Two philosophies are followed in deriving snake algorithms: Energy minimization
and partial differential equations (PDEs).
EI = ␣X i − X j + X i−1 − 2X i + X i+1 ,
where X i = [xi yi ]T is the snake point. Minimizing the first term produces a curve
where the snake points are close together. The curve which minimizes the second
term will have little bending. A negative aspect of the first term is that it is minimized
by a snake which shrinks to a single point. Because of this, many applications also
introduce an “expansion” term which causes the entire curve to grow larger.
The external energy measures edginess of the region through which the boundary
passes. Again, there are many functions which may be used for this. Our favorite is
EE = exp (−∇ f (X i )).
For two dimensions, the minimization problem is solvable using dynamic program-
ming [9.19]. However, rather than using local edginess at the boundary, one could
use the difference in average contrast between outside and inside [9.55], which is
only meaningful if the outside is relatively homogeneous.
Observation: This problem fits perfectly into the MAP philosophy, and simulated
annealing (SA) can be used [9.78]. However, the search neighborhood is problematic.
That is, as we have discussed, SA is guaranteed to find the state which globally
minimizes a set of states. However, the set of states must be sampled in order for
SA to work. In [9.78], an existing contour is used as a starting point, and at each
199 8.5 Active contours (snakes)
iteration, the only states sampled are contours within one pixel of the current contour,
and from that set a minimum is chosen. The resultant contour is the best contour of
the set sampled, but not necessarily from the entire region of interest.
It is also important that the forms chosen for the energy functions be invariant
to scale, translation, and rotation. One way to accomplish this is to use two snakes,
suitably weighted, one outside the hypothesized boundary and contracting, and one
initiated inside and expanding [9.25].
where sI (x, y) = ±1 − ε(x, y) and sE (x, y) = 1/(1 + (x, y)). (x, y) is a mea-
sure of the “edginess” in the image at point x, y, and (x, y) measures the curvature
of the contour at x, y.
Manhaeghe et al. obtain a snake-like result using Kohonen maps [9.47]; see
also [9.88]. One advantage of this approach is that the computations are local. One
can simply look at a point on the present boundary, and consider where that point
could be moved. Choose one candidate location and determine if moving to that
point increases or decreases the energy (if you are using the energy minimization
method).
Considering only the movement of boundary points, however, introduces some
problems. The first is the difficulty in accurately computing the curvature from the
boundary points. As you know, any derivative-based operator is super-sensitive to
noise. Since the curvature involves a second derivative, it is even worse. Another
problem is that there is no really effective way to allow for the possibility that the
boundary might divide into separate components. The following level-set approach
resolves those differences.
Remember the distance transform? From Chapter 7, the distance transform re-
sulted in a function DT (x, y) which was equal to zero on boundary pixels and got
larger as one went away from the boundary. Now consider a new version of the
distance transform, which is exactly the same OUTSIDE the contour of interest.
(Remember, the contour is closed, and so the concepts of INSIDE and OUTSIDE
make sense.) Inside the contour, this new function (which we will refer to as the
200 Segmentation
where s involves something about the brightness variations in the image and also
involves the curvature (in 2D) of the isophote at x, y. Of course, if you insist on
using the 2D curvature of the zero level set, you need to relate that to the function ,
which fortunately is not too hard. Since the normal vector has already been worked
out, and the curvature can be related to changes in normal direction, it is possible to
To be consistent with the
literature (and so it will fit show that:
on one line), we are using
the subscript notation for x x x2 − 2 x y x y + yy y2
partial derivatives here. = 3/2 , (8.9)
x2 + y2
where the functional notation has been dropped.
Over the course of the algorithm, the metric function evolves following a rule like
Eq. (8.7). As it evolves, it will have zero values at different points, and those points
will define the evolution of the contour.
One interesting detail which one must consider when implementing an algorithm
like this is the possibility that the contour might cross itself. For example, consider
the contour segment illustrated in Fig. 8.16. The current contour contains a sharp
concavity. Some typical normal vectors are illustrated. A unit step in the direction of
Fig. 8.16. A contour the normal at the lowest point would place the new contour point inside the contour.
with a very sharp One approach to dealing with this is a simple heuristic which states that points which
crease in it. A unit step
in the normal direction
were labeled inside can never again be considered as outside.
near the crease moves The idea of using level sets for adaptive contours was first proposed by Sethian
the new contour inside [9.66, 9.67]. Malladi et al. [9.46] extended this by observing that there are advantages
the old.
to considering only a set of points near the current contour. Taubin and Ronfard [9.84]
use the concept of a level set implicitly in fitting piecewise-linear curves. Kimmel
et al. [9.40] demonstrate that level sets may be used for other things, such as finding
the shortest path on a surface.
Not all algorithms which use a deformable contour philosophy follow the strate-
gies described in section 8.5. For example, Lai and Chin [8.33] describe a variation
which treats the contour points as a sequence of random variables, which may there-
fore be described by a Markov process, and optimized using MAP strategies. Space
does not permit a discussion of those algorithms here, but the reader may find ade-
quate direction in the sources cited in the bibliography at the end of this chapter.
In range images, we have (typically) numerous surfaces. There are generally two
philosophies which can be followed. First, one may simply seek surfaces which
do not bend too quickly. Following that philosophy produces algorithms which
seek smooth solutions, and segment regions along lines of high surface curvature.
One example of this philosophy was discussed in Chapter 6 where we describe an
202 Segmentation
algorithm which removes noise while seeking the best piecewise-linear fit to the data
points. Such a fit is equivalent to fitting a surface with a set of planes. Points where
planes meet produce either “roof” edges or “step” edges, depending on viewpoint.
If an annealing algorithm is used such as MFA, good segmentations to more general
surfaces can be produced simply by not running the algorithm all the way to a truly
planar solution [8.6]. A second philosophical approach to range image segmentation
is to assume some equation for the surface, e.g., a quadric (a general second-order
surface, defined in section 8.6.1). Then, all points which satisfy that equation and
which are adjacent belong to the same surface. This philosophy mixes the problems
of segmentation and fitting, for we do not know which points to use to estimate the
parameters of the surface until we have some sort of segmentation [8.7, 8.59]. In the
next section, we look at these two philosophies in a bit more detail.
z = ax 2 + by 2 + cx y + d x + ey + f (8.10)
and an implicit form is
ax 2 + by 2 + cz 2 + d x y + ex z + f yz + gx + hy + i z + j = 0. (8.11)
The expression of Eq. (8.11) is called a quadric, and it is a general form which de-
scribes all second-order surfaces (cones, spheres, planes, ellipsoids, etc.). In Chapter
5 you learned how to fit an explicit function to data, by minimizing the squared error.
Unfortunately, the explicit form does not allow the possibility of higher order terms
in z. You could solve Eq. (8.11) for z, using the quadratic form, and then have an
explicit form, but now you have a square root on the right-hand side, and lose the
ability to use linear methods for solving for the vector of coefficients.
We can use the implicit form by first defining f (x, y, z) ≡ ax 2 + by 2 + cz 2 +
d x y + ex z + f yz + gx + hy + i z + j and making the following observation: If the
point [xi , yi , z i ]T is on the surface described by the parameter vector [a, b, c, d, e,
f, g, h, i, j]T , then f (xi , yi , z i ) should be exactly zero. We define a level
set of a function as the collection of points [xi , yi , z i ]T such that
f (xi , yi , z i ) = L for some scalar constant L. Thus, we?? can find the
coefficients by minimizing E = i ( f (xi , yi , z i ))2 (also known as the al-
gebraic distance from the point (xi , yi , z i ) to the surface). In some
203 8.6 Segmentation of surfaces
cases, this gives good results, but it is not really what we want; we
really should be minimizing i d([xi , yi , z i ], f (x, y, z)), where d is some distance
metric, for example the Euclidean distance from the point to the surface (this is
known as the geometric distance [8.66] to the surface). Again, this turns out to be
algebraically intractable. (To implement this see [8.67] and [17.37] for important
details.) Although methods based on the algebraic distance work relatively well most
of the time, they can certainly fail. Whatever distance measure we use, it should have
the following properties [17.37]: (1) The measure should be zero whenever the true
(Euclidean, geometric) distance is zero (the algebraic distance does this); (2) at
the sample points, the derivatives with respect to the parameters are the same for
the true distance and the measure.
Of course, whatever representation you choose to use (polynomials are popular),
there is always a desire for a representation that is invariant to affine transforms
[8.27].
ax 2 + bx y + cy 2 + d x + ey + f = 0. (8.12)
This implicit form describes not only ellipses, but lines, hyperbolae, parabolae, and
circles. To guarantee that the resulting curve is an ellipse we must also ensure that
it satisfies
we get a solution which tends to fit areas of low curvature with hyperbolic arcs
rather than with ellipses. Similar difficulties occur when attempting to fit ellipsoids
to range data. See Wang et al. [8.74], Rosen and West [8.50], and Fitzgibbon et al.
[8.19] for more details.
In performing such fits, it is important to know when a point is simply an “outlier,”
that is, it has been corrupted by substantial noise, but actually belongs to the surface
under consideration, or when it really belongs to another, possibly occluding, surface.
Darrell and Pentland [13.9] examined this question in some detail and demonstrated
204 Segmentation
that “M-estimates” lead to excellent segmentations. Cabrera and Meer [8.11] remove
the bias from fits of an ellipse using an iterative algorithm called “bootstrapping.”
How you fit a function to data also depends on the nature of the noise or corruption
to the data. If the noise is additive, zero-mean Gaussian (which is what we almost
always assume) then the minimum vertical distance (MMSE) or the minimal normal
distance (which we called eigenvector line fitting) methods work well. If the noise
is not Gaussian, then other methods are more appropriate. For example, nuclear
medicine images are corrupted primarily by counting (Poisson) noise. Such a noise
differs from Gaussian in two important ways – it is never negative, and it is signal
dependent. Well away from zero, Poisson noise may be reasonably modeled by
additive Gaussian with a variance equal to the signal. Other sensors produce other
types of noise. Stewart [8.64] considers the case of inliers and outliers, but assumes
that the bad data are randomly distributed over the dynamic range of the sensor. That
is, the noise is not additive.
Given one segmentation, should you merge two adjacent regions? If they are
adjacent and satisfy the same equation to within some noise measure, they should
be merged [8.8, 8.29, 8.34, 8.56]. Other relevant papers on fitting surfaces include
[8.4, 8.75, 8.78].
One is also faced with the issue of what surface measurements to use as a basis
for segmentation. Curvature would appear to be particularly attractive since the
measurement curvature is invariant to viewpoint. However, “curvature estimates are
very sensitive to quantization noise” [8.71].
As you have concluded by now, there are many algorithms and variations on algo-
rithms for segmentation. But which is the best? Who knows? We need an algorithm
to evaluate the quality of segmentations. But which is the best such evaluation algo-
rithm? We need an algorithm which evaluates the quality of evaluation algorithms
which ... (help!).
There are several approaches to evaluating segmentation quality. Since one
result of a segmentation is edges, you could indirectly infer segmentation quality
by measuring edge positions. Pratt [5.33] provides one such algorithm.
Bilbro and Snyder [8.6] first remove noise, then fit the resulting surface. They only
consider the quality of the noise removal, which can be tested very simply: Subtract
the segmented, cleaned image from the original. What you SHOULD see is just the
noise. If your noise-removal algorithm produces an image containing features, then
it removed something other than noise.
It is really difficult to evaluate the quality of segmentation of a brightness image,
since different human observers will come to different conclusions as to what the
205 8.8 Conclusion
correct answer is. With range images, however, it is somewhat easier to determine
“truth” since surfaces can physically be measured.
Hoover et al. [8.22] propose the following formalism for comparing the quality of a
machine-segmented (MS) image using a human-segmented ground-truth (GT) image
as the “gold standard.” Let M and G denote the MS and GT images respectively; let
Mi (i = 1, . . . , m) denote a region in M; and let G j ( j = 1, . . . , m) denote a region
in G. |R| will denote the number of pixels in region R. Let Oi j be the number of
pixels which belong to both region i in the MS image and region j in the GT image.
Finally, let T be a threshold, 0.5 < T ≤ 1.0.
There are five different segmentation results, defined as follows.
Two correct segmentations can be further compared in the case of range imagery
by computing the normal vectors to the GT region and the corresponding MS region,
and finding the absolute value of the angle between these two vectors.
With these definitions, we can evaluate the quality of a segmentation by counting
correct or erroneous segmentations, and measuring total angle error. By plotting these
measures vs. T and comparing these plots, a measure of segmenter performance may
be determined.
Hoover et al. [8.22] use this approach to do a thorough evaluation of the quality
of four different range image segmentation algorithms.
8.8 Conclusion
were defined to be in the same region. In the example of section 8.6.1, all the points
satisfying the same surface equation were defined to fall in the same region.
Minimum squared error. In section 8.2.1 we used an optimization method, minimum squared error, to find
the best threshold. In section 8.5, we obtained a closed boundary using the philos-
ophy of active contours, by specifying a problem-specific objective function and
finding the boundary which minimizes that function. Any appropriate minimization
Gradient descent. technique could be used. In section 8A.5, we will once again see function minimiza-
tion (based on gradient descent with annealing) used in a maximum a posteriori
method which will find the picture which minimizes a particular objective function,
in this case resulting in a segmentation.
8.9 Vocabulary
Active contour
Algebraic distance
Connected component
Explicit form
Geometric distance
Histogram
Homogeneous
Implicit form
Label image
Normal direction
Oversegmentation
Quadric
Region growing
Salient point
Segmentation
Snake
Speed of a curve
Thresholding
Undersegmentation
Assignment 8.1
Section 2.3.5 of the text by Haralick and Shapiro
[4.18] describes a labeling algorithm similar in
some ways to that presented in this section. Write a
207 Topic 8A Segmentation
Location 0 1 2 3 4 5 6 7 8 9
Contents 0 0 2 2 4 4 2 4 2 2
Assignment 8.2
In the process of a connected component labeling
scheme, we find at one point that the lookup table ap-
pears as shown in Table 8.2.
Now, we discover the following equivalence: 9 = 7. On
the blank row in Table 8.2, show the contents of the
lookup table after resolving the equivalence.
Topic 8A Segmentation
In this chapter, we have up to this point considered partitioning of images into areas which
differ in some way in their brightness or range. The algorithms we have presented are ap-
plicable, however to any feature which characterizes a pixel or area around a pixel. For this
reason, we mention here some other measures which may be used, including texture, color,
and motion. In this section, we also mention other approaches to segmentation, including
segmentation based on edges.
In section 4A.2.2, texture is discussed. If one is able to quantify the concept of texture – to
assign two different numbers to two different textures in order to distinguish them – then
texture can be used as a feature in a segmentation algorithm. Instead of defining ADJACENT
as similar brightnesses, one simply defines them as having similar textures. One could also
combine color and texture [8.46, 8.65].
Several researchers have observed that there are really two fundamentally different types
of textures, those that are in some sense “deterministic,” and those that are in some sense
“random,” and which can in principle be modeled by Markov random fields [8.2]. Liu and
Picard [8.38] and others [8.20, 8.21, 8.58] observe that the peaks in the Fourier transform
give hints to how separate these texture characteristics are.
208 Segmentation
n 0 1 2 n
ε 1 1/2 1/4 1/2n
Mε 1 2 4 2n
n 0 1 2 n
ε 1 1/2 1/4 1/2n
Mε 1 4 16 22n
The fractal dimension measures the self-similarity of a shape to itself when measured on
a different scale. In that way it provides a measure of the spatial distribution of the points
within the foreground (which presumably are the object of interest). The utility of the fractal
dimension may be seen by considering a set S of points in a 2-space and defining the fractal
dimension of S as
log Mε
dim (S) = lim (8.15)
ε→0 log (1/ε)
where Mε is the minimum number of ε × ε boxes required to cover S. Let’s look at the fractal
dimension of some example objects. We begin with a single point. Obviously, a point may
be covered by precisely one square box, independent of the size of the box. Therefore, in Eq.
(8.15), Mε is always equal to one, and the limit of the denominator is infinity. Therefore we
find the fractal dimension of a point is simply zero. Now let us consider a straight line of unit
length. Clearly, such a line may be covered by a 1 × 1 box. However, it may also be covered
by two boxes, each 1/2 × 1/2, or four boxes, each 1/4 × 1/4. Each time the size of the box
is halved, the number of boxes required doubles. We could tabulate this process in terms of
a parameter n, equal to the number of halvings that have been done (Table 8.3).
From Table 8.3, we may evaluate dim(S) from
log 2n
dim (S) = lim = 1. (8.16)
n→∞ log (1/(1/2))n
Finally, let S be a square area, and without loss of generality, let its sides be of length one.
We could cover this with a single 1 × 1 box, or by four 1/2 × 1/2 boxes, or by 16 1/4 × 1/4
boxes, etc. An argument similar to the one for the straight line results in Table 8.4.
From Table 8.4, the fractal dimension is
Fig. 8.17. The figure on the left has a fractal dimension of 2.0, the figure on the right has a fractal
dimension of 1.58. In the right-hand figure, the covering 2×2 squares are shown with
dark lines.
So, we end up with an intuitively appealing result: A point is a zero-dimensional thing, a line
is one dimensional, and a square is two dimensional. At least, the result agrees with intuition
for these very simple shapes.
Given an image, we must give some thought to how to apply Eq. (8.15) to the discrete
pixel domain, since we obviously cannot take a limiting box size less than ε = 1. Solving
Eq. (8.15) for Mε we find for small values of ε
1
log Mε = log dim(S) (8.18)
ε
which simplifies to
Thus, for small ε, the log of Mε is linearly related to the log of ε, with the slope being
the dimension of S. We may therefore estimate the dimension of S by considering the two
smallest values we have for ε, 1 and 2, and computing that slope, using
producing a simple algorithm: Find M1 , the number of 1 × 1 boxes required to cover the
object; this is simply the area of the object measured in pixels. Find M2 , the number of 2 × 2
boxes required to cover the same object, and take the difference of the logs.
Consider the following example. The image on the left of Fig. 8.17 has a foreground
region which has an area of 36, and which can be covered by 9 2 × 2 squares. Thus,
its fractal dimension is (log 36 − log 9)/ log 2 = 2. The figure on the right has the same
area, but requires 12 2 × 2 squares to cover it, and therefore has a fractal dimension of
(log 36 − log 12)/ log 2 = 1.58.
The concept of fractal dimension can be extended to gray-scale images, and gray-scale
image features extracted using this measure [8.12].
210 Segmentation
One way to segment is to use a connected components algorithm, but to assume there are
certain points which cannot be connected to anything. These might be, for example, edge
points [8.75].
The usual problem with using edges for segmentation is that edge detectors, as you well
know, are both prone to producing extra edges and to losing portions of edges, producing
gaps. Jacobs [8.24] has an interesting approach to this problem. He first defines an “acceptable
region” in terms of its edges by requiring that the region be convex and that the set of edges
bounding the region be mostly real measured rather than inferred. That is, between any set
of points, we could hypothesize that edges exist. However, if a particular set of points is
really connected, with only a few gaps, and those gaps are small in totality compared to the
perimeter of the region, we might believe such a region is real, that is, a salient group. Jacobs
comes up with some very clever heuristics to reduce this combinatorial search of possible
regions into something fairly manageable.
The Hough transform, described in Chapter 11, will provide another means for identifying
segments of edges, even if some parts are missing.
If one could identify all the connected pixels which have the same motion characteristics, one
could apply the connected component methods we have already discussed. Although some
papers emphasize motion segmentation – Patras et al. [8.47] use watershed methods and a
multistage segmentation method – it is critical to any motion segmentation algorithm that it
be based on an effective method for detecting and representing motion.
The problem of characterizing image motion has been the object of a great deal of research,
and is discussed in much more detail in section 9A.3.
Just as texture variation can produce segmentations, so can color variations. Variations on
color segmentation include posing an optimization problem and minimizing an objective
function. The image which produces the minimum of the objective function is then the
segmentation. Liu and Yang [8.36] use simulated annealing to find a good color-based seg-
mentation. Clustering, which is described briefly in Chapter 15, also gives an approach
[8.72].
In 1991, Snyder et al. [8.63] made the observation that one could convert a restoration problem
into a classification problem in a very straightforward way if one had knowledge of the mean
and, optionally, the variance, of the brightness of each class [8.39, 8.63]. In 2000, the same
conclusion was reached using variational methods [8.53]. In order to “classify” an image
surface into different classes, that is, different segments, we can use the same MAP methods
as we did in solving the image restoration problems.
211 Bibliography
In order to make use of the MAP methods, all one need do is modify the prior term to
include the expected brightness. For example, suppose we have prior knowledge that an image
should be smooth except for step discontinuities, and in addition, each pixel is allowed to
only have one of three brightnesses (e.g. csf, white matter, and gray matter). The following
prior term is maximal when the brightness at pixel i is identical to its neighbors or when it
has the value k1 or the value k2 , or the value k3 .
H( f ) = exp(−( f i − f j )2 ( f i − k1 )2 ( f i − k2 )2 ( f i − k3 )2 ). (8.21)
i j∈ℵi
For further reading, Zhu and Yuille [8.80] show that several similar techniques may be
combined. Further, they illustrate relationships in a variety of image representations.
One of the aspects of segmentation which, regrettably, we do not have time to address in this
book is the issue of how humans do segmentation. That is, what is the “correct” segmentation?
For example, Koenderink and van Doorn [8.28] observed that humans tend to perceive the
projection of three-dimensional objects as collections of ellipses. See [8.57] for an overview.
Bibliography
[8.1] R. Adams and L. Bischof, “Seeded Region Growing,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(6), 1994.
[8.2] P. Andrey and P. Tarroux, “Unsupervised Segmentation of Markov Random Field
Modeled Textured Images using Selectionist Relaxation,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(3), 1998.
[8.3] K.E. Batcher, “STARAN Parallel Processor System Hardware,” Proceedings of AFIPS
National Computer Conference, vol 43, pp. 405–410, 1974.
[8.4] J. Berkmann and T. Caelli, “Computation of Surface Geometry and Segmentation
using Covariance Techniques,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(11), 1994.
[8.5] S. Beucher, “Watersheds of Functions and Picture Segmentation,” IEEE International
Conference on Acoustics, Speech and Signal Processing, Paris, May, 1982.
[8.6] G. Bilbro and W. Snyder, “Range Image Restoration using Mean Field Annealing,”
In Advances in Neural Network Information Processing Systems, San Mateo, CA,
Morgan-Kaufmann, 1989.
[8.7] G. Bilbro and W. Snyder, “Optimization of Functions with Many Minima,” IEEE
Transactions on Systems, Man, and Cybernetics, 21(4), July/August, 1991.
[8.8] G. Blais and M. Levine, “Registering Multiview Range Data to Create 3D Computer
Objects,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8),
1995.
[8.9] K. Boyer, M. Mirza, and G. Ganguly, “The Robust Sequential Estimator: A Gen-
eral Approach and its Application to Surface Organization in Range Data,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(10), 1994.
212 Segmentation
[8.10] C.R. Brice and C.L. Fennema, “Scene Analysis using Regions,” Artificial Intelligence,
1, pp. 205–226, Fall, 1970.
[8.11] J. Cabrera and P. Meer, “Unbiased Estimation of Ellipses by Bootstrapping,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(7), 1996.
[8.12] B. Chaudhuri and N. Sarkar, “Texture Segmentation using Fractal Dimension,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[8.13] J. Chen, Y. Sato, and S. Tamura, “Orientation Space Filtering for Multiple Orientation
Line Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 22(5), 2000.
[8.14] C. Chow and T. Keneko, “Automatic Detection of the Left Ventricle from Cine-
angiograms,” Computers and Biomedical Research, 5, pp. 388–410, 1972.
[8.15] T. Davis, “Fast Decomposition of Digital Curves into Polygons using the Haar
Transform,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8),
August, 1999.
[8.16] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, New York, Wiley,
1973.
[8.17] M. Fishler and R. Bolles, “Perceptual Organization and Curve Partitioning,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 8(1), 1986.
[8.18] M. Fishler and H. Wolf, “Locating Perceptually Salient Points on Planar Curves,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2), 1994.
[8.19] A. Fitzgibbon, M. Pilu, and R. Fisher, “Direct Least Square Fitting of Ellipses,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5), May,
1999.
[8.20] J. Francos, “Orthogonal Decomposition of 2-D Random Fields and Their Applications
in 2-D Spectral Estimation,” Signal Processing and its Applications, ed. N.K. Bose
and C.R. Rao, Amsterdam, North Holland, 1993.
[8.21] J. Francos, Z. Meiri, and B. Porat, “A Unified Texture Model Based on a 2-D Wold
Like Decomposition,” IEEE Transactions on Signal Processing, 41, pp. 2665–2678,
August, 1993.
[8.22] A. Hoover, G. Jean-Baptiste, X. Jiang, P. Flynn, H. Bunke, D. Goldgof, K. Bowyer,
D. Eggbert, A. Fitzgibbon, and R. Fisher, “An Experimental Comparison of Range
Image Segmentation Algorithms,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(7), 1996.
[8.23] L. Itti, C. Koch, and E. Niebur, “A Model of Saliency-based Visual Attention for Rapid
Scene Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(11), 1998.
[8.24] D. Jacobs, “Robust and Efficient Detection of Salient Convex Groups,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 18(1), 1996.
[8.25] K. Kanatani, “Statistical Bias of Conic Fitting and Renormalization”, IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
[8.26] N. Katzir, M. Lindenbaum, and M. Porat, “Curve Segmentation under Partial Oc-
clusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5),
1994.
[8.27] D. Keren, “Using Symbolic Computation to Find Algebraic Invariants,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 16(11), 1994.
213 Bibliography
[8.28] J. Koenderink and A. Van Doorn, “The Shape of Smooth Objects and the Way Contours
End,” Perception, 11, pp. 129–137, 1982.
[8.29] K. Köster and M. Spann, “MIR: An Approach to Robust Clustering – Application to
Range Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(5), 2000.
[8.30] L. Krakauer, “Computer Analysis of Visual Properties of Curved Objects,” Project
MAC TR-82, 1971.
[8.31] S. Kulkarni, S. Mitter, T. Richardson, and J. Tsitsiklis, “Local vs. Global Computation
of Length of Digitized Curves,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(7), 1994.
[8.32] S. Kumar, S. Han, D. Goldgof, and K. Bowyer, “On Recovering Hyperquadrics
from Range Data,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(11), 1995.
[8.33] K. Lai and R. Chin, “Deformable Contours: Modeling and Extraction,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(11), pp. 1084–1090, 1995.
[8.34] S. LaValle and S. Hutchinson, “A Bayesian Segmentation Methodology for Parametric
Image Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(2), 1995.
[8.35] K. Lee, P. Meer, and R. Park, “Robust Adaptive Segmentation of Range Images,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 1998.
[8.36] J. Liu and Y. Yang, “Multiresolution Color Image Segmentation,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(7), 1994.
[8.37] X. Liu and R. Ehrich, “Subpixel Edge Location in Binary Images using Dithering,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6), 1995.
[8.38] F. Liu and R. Picard, “Periodicity, Directionality, and Randomness: Wold Features for
Image Modeling and Retrieval,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(7), 1996.
[8.39] A. Logenthiran, W. Snyder, and P. Santago, “MAP Segmentation of Magnetic Reso-
nance Images using Mean Field Annealing,” SPIE Symposium on Electronic Imaging,
Science and Technology, February, 1991.
[8.40] A. Matheny and D. Goldgof, “The Use of Three- and Four-dimensional Surface Har-
monics for Rigid and Nonrigid Shape Recovery and Representation,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[8.41] D.L. Milgram, A. Rosenfeld, T. Willett, and G. Tisdale, “Algorithms and Hardware
Technology for Image Recognition,” Final Report to U.S. Army Night Vision Lab,
March 31, 1978.
[8.42] F. Mokhtarian and A. Mackworth, “Scale-based Description and Recognition of Planar
Curves and Two-dimensional Shapes,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 8(1), 1986.
[8.43] K. Mori, M. Kidode, H. Shinoda, and H. Asada, “Design of Local Parallel Pattern
Processor for Image Processing,” Proceedings AFIPS National Computer Conference,
vol 47, pp. 1025–1031 June, 1978.
[8.44] W. Ng and C. Lee, “Comment on Using the Uniformity Measure for Performance Mea-
sure in Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(9), 1996.
214 Segmentation
[8.45] B. Parhami, “Associative Memories and Processors: An Overview and Selected Bib-
liography,” Proceedings of the IEEE, 61, pp. 772–730, June, 1973.
[8.46] D. Panjwani and G. Healey, “Markov Random Field Models for Unsupervised Seg-
mentation of Textured Color Images,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(10), 1995.
[8.47] I. Patras, E. Hendriks, and R. Lagendijk, “Video Segmentation by MAP Labeling of
Watershed Segments,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 23(3), 2001.
[8.48] A. Pikaz and I. Dinstein, “Using Simple Decomposition for Smoothing and Feature
Point Detection of Noisy Digital Curves,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(8), 1994.
[8.49] E. Rivlin and I. Weiss, “Local Invariants for Recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(3), 1995.
[8.50] P. Rosen and G. West, “Nonparametric Segmentation of Curves into Various Repre-
sentations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12),
1995.
[8.51] A. Rosenfeld and A. Kak, Digital Picture Processing, 2nd edition, New York,
Academic Press, 1997.
[8.52] G. Roth and M. Levine, “Geometric Primitive Extraction using a Genetic Algorithm,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9), 1994.
[8.53] C. Samson, L. Blanc-Fèraud, G. Aubert, and J. Zerubia, “A Variational Model for
Image Classification and Restoration,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(5), 2000.
[8.54] G. Sapiro and A. Tannenbaum, “Area and Length Preserving Geometric Invariant
Scale Spaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), 1995.
[8.55] H. Sheu and W. Hu, “Multiprimitive Segmentation of Planar Curves – a Two-level
Breakpoint Classification and Tuning Approach,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 21(8), 1999.
[8.56] H. Shum, K. Ikeuchi, and R. Reddy, “Principal Component Analysis with Missing
Data and Its Application to Polyhedra1 Object Modeling,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(9), 1995.
[8.57] K. Siddiqi and B. Kimia, “Parts of Visual Form: Computational Aspects,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(3), 1995.
[8.58] R. Sriram, J. Francos, and W. Pearlman, “Texture Coding Using a Wold Decomposition
Model,” Proc. International Conference on Pattern Recognition, Jerusalem, October,
1994.
[8.59] W. Snyder and G. Bilbro, “Segmentation of Range Images,” International Conference
on Robotics and Automation, St. Louis, March, 1985.
[8.60] W. Snyder and A. Cowart, “An Iterative Approach to Region Growing,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 5(3), 1983.
[8.61] W.E. Snyder and C.D. Savage, “Content-Addressable Read-Write Memories for Image
Analysis,” IEEE Transactions on Computers, 31(10), pp. 963–967, 1982.
[8.62] W. Snyder, G. Bilbro, A. Logenthiran, and S. Rajala, “Optimal Thresholding, A New
Approach,” Pattern Recognition Letters, 11(11), 1990.
215 Bibliography
Space tells matter how to move, and matter tells space to get bent
Douglas Adams
One topic we will want to consider in this chapter is invariance to various types of
linear transformations on regions. That is, consider all the pixels in a region, and
write the x, y coordinates of each pixel as a 2-vector, and operate on that set of
vectors. The first transformations of interest are the orthogonal transformations such
as
cos −sin
Rz = ,
sin cos
which operate on the coordinates of the pixels in a region to produce new coordinate
pairs. For example
x x
= Rz ,
y y
216
217 9.1 Linear transformations
where Rz is as defined above, represents a rotation about the z axis. Given a re-
gion s, we can easily construct a matrix, which we denote by S, in which each
column contains the x, y coordinates of a pixel in the region. For example, sup-
pose the region s = {(1, 2), (3, 4), (1, 3), (2, 3)}, then the corresponding coordinate
matrix S is
1 3 1 2
S= .
2 4 3 3
That works wonderfully well for rotations, but how can we include translation in this
formalism? To accomplish that desire, we augment the rotation matrices by adding
a row and a column, all zeros except for a 1 in the lower right corner. With this new
definition,
⎡ ⎤
cos −sin 0
⎢ ⎥
Rz = ⎣ sin cos 0⎦ .
0 0 1
We also augment the definition of a point by adding a 1 in the third position, so the
coordinate pair (x, y) in this new notation becomes
⎡ ⎤
x
⎢ ⎥
⎣y⎦ .
1
Now, we can combine translation and rotation into a single matrix representation
(called a homogeneous transformation matrix). We accomplish this by changing the
third column to include the translation. For example, to rotate a point about the origin
by an amount , and then translate it by an amount dx in the x direction and dy in
the y direction, we perform the matrix multiplication
⎡ ⎤⎡ ⎤
cos − sin d x x
⎢ ⎥⎢ ⎥
x = ⎣ sin cos dy ⎦ ⎣ y ⎦ . (9.1)
0 0 1 1
Thus, it is possible to represent rotation in the viewing plane (about the z axis) and
translation in that plane by a single matrix multiplication. All the transformations
mentioned above are elements of a class of transformations called “similarity trans-
formations.” Similarity transformations are characterized by the fact that they may
move an object around, but they do not change its shape.
218 Shape
Fig. 9.1. Affine transformations can scale the coordinate axes. If the axes are scaled differently,
the image undergoes a shear distortion.
Fig. 9.2. Aircraft images which are affine transformations of each other (from [9.1]).
But what can we do, if anything, to represent rotations out of the camera plane?
To answer this, we need to define an affine transformation. An affine transformation
of the 2-vector x = [x, y]T produces the 2-vector x = [x , y ]T , where
x = Ax + b, (9.2)
and b is likewise a 2-vector. This looks just like the similarity transformations men-
tioned above, except that we do not require the matrix A be orthonormal, only non-
singular. An affine transformation may distort the shape of a region. For example,
shear may result from an affine transformation, as illustrated in Fig. 9.1.
As you probably recognize, rotation out of the field of view of a planar object is
equivalent to an affine transformation of that object. This gives us a (very limited)
way to think about rotation out of the field of view. If an object is nearly planar, and
the rotation out of the field of view is small, that is, nothing gets occluded, we can
represent this 3D motion as a 2D affine transformation. For example, Fig. 9.2 shows
some images of an airplane which are all affine transformations of each other.
The matrix which implements an affine transform (after correction for translation)
can be decomposed into its constituents [9.62] – rotation, zoom, and shear:
a11 a12 x cos sin ␣ 0 1  x
= . (9.3)
a21 a22 y − sin cos 0 ␦ 0 1 y
Now, what does one do with these concepts of transformations? One can correct
transformations and align objects together through inverse transformations to assist
in shape analysis. For example, one can correct for translation by shifting so the
centroid is at the origin. Correcting for rotation involves rotating the image until the
principal axes of the object are in alignment with the coordinate axes.
219 9.2 Transformation methods based on the covariance matrix
r d(a, a) = 0 ∀a
Properties of a metric. r d(a, b) = d(b, a) ∀(a, b)
r d(a, b) + d(b, c) ≥ d(a, c) ∀(a, b, c)
Consider the distribution of points illustrated in Fig. 9.3. Each point may be exactly
characterized by its location, the ordered pair, (x1 , x2 ), but neither x1 nor x2 , by itself,
is adequate to describe the point.
Now consider Fig. 9.4, which shows two new axes, y1 and y2 . Again, the ordered
pair y1 , y2 is adequate to exactly describe the point, but y2 is (compared to y1 ) very
nearly zero, most of the time. Thus, we would lose very little if we simply discarded
y2 and used the scalar, y1 to describe each point. Our objective in this section will
be to learn how to determine y1 and y2 in an optimal manner.
x2
x1
y1
y2
Fig. 9.4. A new set of coordinates, derived by a rotation of the original coordinates, in which one
coordinate well represents the data.
Here, the vectors bi are deterministic (and may in general, be specified in advance).
Since any random vector x may be expressed in terms of the same d vectors, bi (i =
1, . . . , d), we say the vectors bi span the space containing x, and refer to them as a
basis set for x. To make further use of the basis set, we will require:1
Under these conditions, the yi may be found by projections, where the projection
operation is defined by
and we let
y = [y1 , . . . , yd ]T . (9.7)
Here, we say the number yi results from projecting x onto the basis vector bi .
Suppose we wish to ignore all but m(m < d) components of y (which we will
call the principal components) and yet we wish to still represent x, although with
some error. We will thus calculate (by projection onto the basis vectors) the first m
1 To be a basis, they do not have to be orthonormal, just not parallel, but here, we will require orthonormality.
221 9.2 Transformation methods based on the covariance matrix
elements of y, and replace the others with constants, forming the estimate
m
d
x̂ = yi bi + ␣i bi . (9.8)
i=1 i=m+1
The error which we have introduced by using some arbitrary constants, the alphas
of Eq. (9.8), rather than the elements of y, is given by
x = x − x̂
m
d
=x− yi bi + ␣i bi (9.9)
i=1 i=m+1
d
= [yi − ␣i ]bi .
i=m+1
i=m+1 j=m+1
- . (9.10)
=E (yi − ␣i )(y j − ␣ j )biT b j
So, we should replace those elements of y which we do not measure by their expected
values – mathematically and intuitively appealing.
Substituting Eq. (9.13) into Eq. (9.11), we have
d
% &
ε 2 (m) = E (yi − E{yi })2 . (9.14)
i=m+1
% / 0 &
= E biT x − E biT x (x T bi − E{x T bi }) (9.15)
i
% &
= E biT (x − E{x})(x T − E{x T })bi .
i
and we now recognize the term between the bs in Eq. (9.16) as the covariance of x:
d
ε 2 (m) = biT K x bi . (9.17)
i=m+1
It can be shown (and we will show it below, for the special case of fitting a straight
line) that the choice of vector bi which minimizes Eq. (9.17) also satisfies
K x bi = i bi . (9.18)
where the matrix B has columns made from the basis vectors, b1 , b2 , . . . , bd .
Furthermore, in the case that the columns of B are the eigenvectors of Kx , then B
will be the transformation which diagonalizes Kx , resulting in
⎡ ⎤
1 0 · · · 0
⎢ 0 ··· 0 ⎥
⎢ 2 ⎥
Ky = ⎢ ⎥. (9.21)
⎣ . . . . . . . . . . . .⎦
0 0 · · · d
Substituting Eq. (9.21) into Eq. (9.17), we find
d
ε 2 (m) = biT i bi . (9.22)
i=m+1
Since i is scalar,
d
ε 2 (m) = i biT bi (9.23)
i=m+1
yi = biT x, (9.25)
1 ≥ 2 ≥ 3 ≥ . . . d . (9.26)
Interpretation as a hyperellipsoid
b1
b2
λ1
Fig. 9.5. A covariance matrix may be thought of as representing a hyperellipsoid, oriented in the
directions of the eigenvectors, and with extent in those directions equal to the square
root of the corresponding eigenvalues.
224 Shape
We wish to find the straight line which best fits this set of data. Move the origin
to the center of gravity of the set. Then, characterize the (currently unknown) best
fitting line by its unit normal vector, n. Then, for each point xi , the perpendicular
distance from xi to the best fitting line will be equal to the projection of that point
onto n. Denote this distance by di (n):
To find the best fitting straight line, we minimize the sum of the squared perpendicular
distances
!
n n n
n
ε2 = di2 (n) = (nT xi )2 = nT xi xiT n = nT xi xiT n (9.29)
i=1 i=1 i=1 i=1
nT n = 1. (9.30)
which is the same eigenvalue problem mentioned earlier. Thus we may state:
The best fitting straight line passes through the mean of the set of data points, and
will lie in the direction corresponding to the major eigenvector of the covariance of
the set.
We have now seen two different ways to find the straight line which best fits data:
The method of least squares described in section 5.3, if applied to fitting a line rather
than a plane, minimizes the vertical distance from the data points to the line. The
method described in this section minimizes the perpendicular distance described by
Eq. (9.29). Other methods also exist. For example, [9.53] finds piecewise represen-
tations which preserve moments up to an arbitrarily specified order.
225 9.3 Simple features
Fitting functions to data occurs in many contexts. For example, O’Gorman [9.54]
has looked at fitting not only straight edges, but points, straight lines, and regions
with straight edges. By so doing, subpixel precision can be obtained.
In the following, we will consider a few of the many simple features which may be
used to characterize the shape of regions (for additional features, see also [9.2]).
In this section, we describe several simple features which can be used to describe
the shape of a patch – the output of the segmentation process. Many of these features
can be computed as part of the segmentation process itself. For example, since the
connected component labeling program must touch every pixel in the region, it can
easily keep track of the area.
The following is a list of simple features which are likewise simple to calculate.
r Average gray value. In the case of black and white “silhouette” pictures, this is
simple to compute.
r Maximum gray value. Is straightforward to compute.
r Minimum gray value. Is straightforward to compute.
r Area (A). A count of all pixels in the region.
r Perimeter (P). Several different definitions exist. Probably the simplest is a count
of all pixels in the region that are adjacent to a pixel not in the region.
r Diameter (D). The diameter describes the maximum chord – the distance between
those two points on the boundary of the region whose mutual distance is maximum
[9.68, 9.71]. We will discuss computation of this parameter in the next section.
r Thinness (also called compactness)2 (T ). Two definitions for compactness exist:
Ta = (P 2 /A) − 4 measures the ratio of the squared perimeter to the area; Tb =
D/A measures the ratio of the diameter to the area. Fig. 9.6 compares these two
measurements on example regions.
r Center of gravity (CG). The x and y coordinates of the center of gravity may be
written as
1 1
mx = x, m y = y
N N
for all N points in a region. However, we prefer the vector form
1 N
xi
m= . (9.34)
N i=1 yi
2 Some authors [9.69] prefer not to confuse the mathematical definition of compactness with this definition, and
thus refer to this measure as the isoperimetric measure.
226 Shape
Ta small, Tb large
Ta large, Tb large
Ta large, Tb small
Fig. 9.6. Results of applying two different compactness measures to various regions. Since a
circle has the minimum perimeter for a given area, it minimizes Ta . A starfish on the
other hand, has a large perimeter for the same area.
r X–Y aspect ratio. See Fig. 9.7. The aspect ratio is the length/width ratio of the
y bounding rectangle of the region. This is simple to compute.
r Minimum aspect ratio. See Fig. 9.8. Again, a length/width, but much more com-
x putation is required to find the minimum such rectangle.
Fig. 9.7. y/x is the
The minimum aspect ratio can be a difficult calculation, since it requires a
aspect ratio using one search for extremal points. A very good approximation can be obtained if we
definition, with think of a region as represented by an ellipse-shaped distribution of points. In this
horizontal and vertical
sides to the bounding case, as we discussed in Fig. 9.5, the eigenvalues of the covariance of points are
rectangle. measures of the distribution of points along orthogonal axes – major and minor.
The ratio of those eigenvalues is quite a good approximations of the minimum
y aspect ratio.
r Number of holes. One feature that is very descriptive and reasonably easy to
x compute is the number of holes in a region.
r Triangle similarity. Consider three points on the boundary of a region, P1 , P2 , P3 ,
Fig. 9.8. y/x is the let d(Pi , P j ) denote the Euclidian distance between two of those points, and
minimum aspect ratio. let S = d(P1 , P2 ) + d(P2 , P3 ) + d(P3 , P1 ) be the perimeter of that triangle. The
2-vector
The methods of section
9.2.2 give a simple way to d(P1 , P2 ) d(P2 , P3 )
estimate the aspect ratio. , , (9.35)
S S
simply the ratio of side lengths to perimeter, is invariant to rotation, translation,
and zoom [9.6].
r Symmetry. In two dimensions, a region is said to be mirror-symmetric if it is
invariant under a reflection about a line. That line is referred to as the axis of
symmetry. A region is said to have rotational symmetry of order n if it is invariant
under a rotation of 2/n about some point, usually the center of gravity of the
region. There are two challenges in determining the symmetry of regions. The
first is simply determining the axis. The other is to answer the question: “How
symmetric is it?” Most papers prior to 1995 which analyzed symmetry of regions
227 9.3 Simple features
In this method we first find the major axis of the region and those two points on
the boundary of the region which are closest to that axis. In this approach, minor
228 Shape
deviations, such as the spur shown in Fig. 9.9, will be ignored, even though they
may actually contain one of the extreme points.
The first step in the process is calculation of the major axis. This is performed
Assignment: Is this by a minimization-of-squared-error technique. It is important that minimization of
algorithm new, or is it
repeated somewhere else error be independent of the coordinate axes; therefore, we use the eigenvector line
in the text? fitting technique described earlier in this section rather than the conventional MSE
technique. We define a line to be the best representation of the major axis if it
minimizes the sum of squares of the perpendicular distances from the points in the
region to the line.
Let us assume the region R is described by a set of points R = {(xi , yi )|i =
1, . . . , n}. Let the point (xi , yi ) be denoted by the vector v i , and di be the perpen-
Fig. 9.9. A region with dicular distance from v i to the major axis. Thus, the major axis of the region is the
a spur. line which minimizes:
n
d2 = di2 . (9.36)
i=1
It is easily shown that the major axis must pass through the center of gravity of the
region, thus it is necessary only to find the slope of the axis. Since the line passes
through the center of gravity, let us take that point as the origin of our coordinate
system. Then the problem becomes: Given n points with zero mean, find the line
through the origin which minimizes d 2 .
This turns out to be the same eigenvector problem we described earlier. Thus one
can find the major axis by:
(1) relocating the origin by subtracting the center of gravity from each point;
(2) finding the principal eigenvector of the scatter matrix of this modified set of
points; and
(3) solving for the major axis as the line through the center of gravity parallel to the
principal eigenvector.
Having found the major axis, one then treats each point on the boundary as a
vector and projects each onto the major axis. The extrema are then those two points
on opposite sides of the center of gravity whose projections onto the major axis are
maximum in length. (The extrema are not necessarily unique.) This approach yields
a solution which is an accurate representation of the “shape” of the region in the
least-mean-squared-error sense. It may or may not actually find the two points whose
mutual distance is maximum. In many applications, an approximation like this is
exactly what is needed. However, occasionally, one encounters regions (Fig. 9.9)
with spurs on them, where the spur may be relevant. In this case, it is necessary to
use an algorithm which will find the actual extrema (see section 9A.1).
229 9.4 Moments
The moments of a shape are easily calculated, and as we shall see, can be robust to
similarity transforms.
A moment of order p + q may be defined on a region as
m pq = x p y q f (x, y). (9.37)
If we assume that the region is uniform in gray value and that gray value is arbitrarily
set to 1 inside the region and 0 outside, the area of the region is then m 00 , and we
find that the center of gravity is
m 10 m 01
mx = my = . (9.38)
m 00 m 00
We may derive a set of moment-like measurements (the central moments) which are
invariant to translation by moving the origin:
pq = (x − m x ) p (y − m y )q f (x, y). (9.39)
By taking into account rotation and zoom, we can continue this sort of derivation and
can now define as many features as we wish by choosing higher orders of moments
or combinations thereof. From the central moments, we may define the normalized
central moments by
pq p+q
pq = ␥ , where ␥ = + 1.
00 2
Finally, the invariant moments [9.21] have the characteristics that they are invariant to
translation, rotation, and scale change, which means that we get the same moment,
even though the image may be moved, rotated, or zoomed.3 They are listed in
Table 9.1.
3 Gonzalez and Wintz [9.21] refer to zooming as “scale change.” Since we use the word “scale” in a slightly
different way, we refer to this as “zoom.”
230 Shape
1 = 20 − 02
2 = (20 − 02 )2 + 411
2
Since their original development by Hu [9.33], the concept has been extended to
moments which are invariant to affine transforms by Rothe et al. [9.62].
Despite their attractiveness, strategies based on moments do have problems, not
the least of which is sensitivity to quantization and sampling [9.45]. (See Assignment
9.9.)
The use of moments is actually a special case of a much more general approach to
image matching [9.89] referred to as the method of normalization. In this philosophy,
we first transform the region into a “canonical frame” by performing a (typically
linear) transform on all the points. The simplest such transformation is to subtract
the coordinates of the center of gravity (CG) from all the pixels, thus moving the
coordinate origin to the CG of the region. In the more general case, such a trans-
formation might be a general affine transform, including translation, rotation, and
shear. We then do matching in the transform domain, where all objects of the same
class (e.g., triangles) look the same.
Some refinements are also required if moments are to be calculated with gray-
scale images, that is, when the f of Eq. (9.37) is not the result of a thresholding
operation. All the theory of invariance still holds, but as Gruber and Hsu [9.24] point
out, noise corrupts moment features in a data-dependent way.
Once a program has extracted a set of features, some use must be made of this
set, either to match two observations or to match an observation to a model. The use
of simple features in matching is described in section 13.2.
2 1
3 1
4 0 2 0
5 7
6 3
Fig. 9.11. The eight directions (for 8-neighbor) and four directions (for 4-neighbor) in which one
might step in going from one pixel to another around a boundary.
all between 0 and 7 (if using eight directions) or between 0 and 3 (if using four
directions), designating the direction of each step. The eight and four cardinal di-
rections are defined as illustrated in Fig. 9.11. The boundary of a region may then
Fig. 9.12. The be represented by a string of single digits. A more compact representation utilizes
boundary segment superscripts whenever a direction is repeated. For example, 0012112776660 could
represented by the
chain code be written 02 1212 272 63 0, and illustrates the boundary shown in Fig. 9.12. The ability
02 1212 272 63 0. to describe boundaries with a sequence of symbols plays a significant role in the
Can you determine what
discipline known as “syntactic pattern recognition,” and appears frequently in the
if anything is wrong with machine vision literature, including other places in this book.
the caption on this figure?
We see that these two sequences differ only in the first (DC) term. They therefore
represent two encodings of the same boundary which differ only by a translation.
4 Pun alert! (bet you missed that one) – note the kinds of numbers discussed in the paragraph.
232 Shape
How we represent the movement from one boundary point to another is critical.
Simply using a 4-neighbor chain code produces poor results. Use of an 8-neighbor
code reduces the error by 40 to 80%, but is still not as good as using a subpixel inter-
polation. There are other complications as well, including the observation that the
usual parameterization of the boundary (arc length) is not invariant to affine trans-
formations [9.1, 9.96]. Experiments have [9.39] compared affine-invariant Fourier
descriptors and autoregressive methods. For more detail on Fourier descriptors, see
[9.1].
In two dimensions, the medial axis of a region is defined as the locus of the centers
of “maximal circles.” A maximal circle is the largest circle that can be located at a
given point in the region. Let’s say that a bit more carefully (we all need to learn
how to use mathematics to make our wording precise). At point (x, y) inside region
, draw a circle of radius R about that point. Make R as large as possible, but such
that (1) no points in the circle are outside the region and (2) the circle touches the
boundary of the region at no fewer than two points. Any point on the medial axis
can be shown (see Assignment 7.12) to be a local maximum of a distance transform
(DT). A point with DT values of k is a local maximum if none of its neighbors
has a greater value. Fig. 9.13(a) repeats Fig. 7.3. The local maxima of this distance
transform are illustrated in Fig. 9.13(b).
Another way to think about the medial axis is as the minimum of an electrostatic
potential field. This approach is relatively easy to develop if the boundary happens
to be straight lines or, in three dimensions, planes [9.12]. See [9.16] for additional
algorithms to efficiently compute the medial axis.
233 9.8 Deformable templates
(a) ( b)
1 1 1 1 1 1
1 2 2 1 2
1 23 21 1 3 1
1 22 1 2
1 22 1 2
1 23 21 1 3 1
1 12 22 1 1 22
1 1 11 11 1 11
Fig. 9.13. An example region whose DT, computed using 4-neighbors, is shown in (a). The
morphological skeleton is illustrated in (b), and simply consists of the local maxima of
the DT.
Recall active contours (snakes) from section 8.5? If we think of the region surrounded
by a snake, rather than the snake itself, we realize we have a region whose shape
can be deformed, and thus have invented “deformable templates.” Objects can be
tracked utilizing a philosophy which allows for deformation of the template [9.102].
In addition, the concepts of deformable templates may be useful in image data base
access. For example, Bimbo and Pala [13.5] do indexing by comparing shapes in the
image with a user-drawn sketch, an “iconic index,” which is actually a deformable
template. The best matching template is written as
where s is (normalized) arc length, (s) is the template as stored in the data base, and
(s) is the deformation required to make this particular template match a sequence of
boundary points in the image being accessed. We emphasize that (s) is the difference
between the original and the deformed template. The image in the data base which
best matches the template is the image which minimizes “the difference between the
234 Shape
which represents (in the first term) how much the template had to be strained to fit the
object while the second term represents the energy spent to bend the template. This
is thus a deformable templates problem. The optimization problem may be solved
numerically [13.5].
A variation on deformable templates is the idea of “geometric flow” to change
a given initial curve into a form which is better suited for identification by, say,
template matching. The term “geometric” flow means that the flow is completely
determined by the geometry of the curve. Pauwels et al. [9.58] cast this discipline
as the answer to: “Is it possible to use optimization of functionals to crystallize the
geometric content of a curve by reducing the noise while at the same time enhancing
the salient features?”
This one equation describes all the second-order surfaces, some of which are illus-
trated in Fig. 9.14.
Ellipsoid
Hyperboloid of
(L) one sheet
(R) two sheets
Hyperbolic
Elliptic paraboloid paraboloid
Fig. 9.14. The quadric equation describes a wide variety of surfaces [9.103] (CRC Press. Used
with permission).
235 9.9 Quadric surfaces
If the quadric is centered at the origin, and its principal axes happen to be aligned
with the coordinate axes, the quadric will take on a particular form. For example, an
ellipsoid has the special form
x2 y2 z2
+ + = 1. (9.43)
a2 b2 c2
However, when the axes of the quadric do not align with the coordinate axes, only
the general form of Eq. (9.42) occurs.
From range or other surface data, the quadric coefficients may be determined by
methods such as those in section 8.6.1. Given the coefficients, the type of quadric
may be determined by the following method.5
If there is a constant term, d, divide by the constant, redefining the other coefficients
(e.g. a ⇐ a/d), resulting in a form of the quadric equation in which the constant
term is unity:
Obtain the three eigenvalues, 1 , 2 , and 3 , and find the reciprocal of each which
is nonzero, r1 = 1/1 , r2 = 1/2 , and r3 = 1/3 . At least one reciprocal must be
positive to have a real surface: If exactly one is positive then the surface is a hyper-
boloid of two sheets; if exactly two are positive, then it is a hyperboloid of one sheet;
if all three are positive, it is an ellipsoid, and the square roots of r1 , r2 , r3 are the
major axes of the ellipsoid. Otherwise the distance between the foci of hyperboloids
is determined by the magnitudes of the rs.
5 The authors are grateful to Dr G. L. Bilbro for his formulation of this method.
236 Shape
∇ 2 (x, y, z) = 0 (9.46)
which is
∂ 2 ∂ 2 ∂ 2
+ + =0 (9.47)
∂x2 ∂ y2 ∂z 2
in Cartesian coordinates. Most of the work in harmonic representations has not,
however, been in Cartesian coordinates, but rather in spherical coordinates. (See
Matheny and Goldgof [8.40] for a discussion of other forms.) Any continuous func-
tion which can be written as r = r (, ) can be expressed as a linear combination
of spherical harmonics. In spherical coordinates, Laplace’s equation is
∂ 2 ∂ 1 ∂ ∂ 1 ∂ 2
r + sin + = 0. (9.48)
∂r ∂r sin ∂ ∂ sin2 ∂2
We seek solutions which are separable in the sense that they may be written as the
product of functions of a single variable; that is, they have the form
With this restriction, the partial differential equation may be separated into three
ordinary differential equations and the solution can be shown to be
where the parameter l is referred to as the “degree,” m is an integer less than l, and
P is a Legendre polynomial.
Thus, any function may be represented as
L l
% m m &
r (, ) = Ul Pl cos +
0
Ul Pl cos cos m + Vl Pl cos sin m
m m
l=0 m=0
(9.51)
where the coefficients are found by the process of fitting this form to the data.
We discussed in section 8.6 the problem of segmenting a range image using the
surface function and how to fit functions to data. In this section, we describe how to
fit superquadrics and hyperquadrics to range data.
237 9.11 Superquadrics and hyperquadrics
N
|Ai x + Bi y + Ci z + Di |␥i = 1 (9.53)
i=1
N
EOF = (1 − F(xi , yi , z i ))2 (9.54)
i=1
are presumed to be a good fit. (Since, at every point on the surface F, the value of F
(x, y, z) is supposed to be 1.0.) Observe that this minimizes the algebraic distance
(see section 8.6.1) from the point to the surface! This distance will be zero if the
point lies on the surface, but otherwise, there is not a simple relationship between the
Euclidean distance from the point to the surface and the algebraic distance. Kumar
et al. observe: “This function is biased, especially for oblong objects.” Similar
complaints exist for essentially all applications of the algebraic distance.
To get a somewhat better fit, in the case of hyperquadrics, the following approach
can be used. Suppose we have an initial estimate of the surface, an estimate which
is not too bad, but might not be the best estimate. Let that estimate be defined by a
set of parameters, A, B, C, and D, defining a function F(x, y, z). For a particular
point (xi , yi , z i ), substitute these values into F to determine w i = F(xi , yi , z i ). If the
point is actually on the surface, then w will be equal to 1. Now, consider the surface
defined by F(x, y, z) = w i . The distance normal to this surface can be approximated
by the distance di in the direction of the gradient,
so
1 − F(xi , yi , z i )
di = . (9.56)
∇ F(xi , yi , z i )
This di is really what we want to minimize. It is the distance from a surface through
the point (xi , yi , z i ) which is in some sense parallel to the surface we wish to de-
termine. These are estimates only; it remains to iteratively refine these estimates
until they become the real solutions. To accomplish that, rewrite the squared error in
238 Shape
terms of d,
N
(1 − F(xi , yi , z i ))2
E= (9.57)
i=1
∇ F(xi , yi , z i )
and minimize this objective function in the following way.
(1) First, determine an initial estimate, probably by numerically finding the value
which minimizes the EOF of Eq. (9.54). This will not be a bad fit, but better fits
are possible because of the bias.
(2) Compute, for each data point w i = 1/∇ F(xi , yi , z i )2 .
N
(3) Minimize i=1 (w i (1 − F(xi , yi , z i )))2 .
(4) If the solution is good enough, stop; otherwise go to (2).
Observe that the general idea presented here to fitting an implicit function is
applicable to more than just hyperquadrics. The idea of minimizing a function normal
to the gradient of a curve through the data points fits many (probably most) problems
in fitting of explicit functions.
Dickinson et al. [13.11] combine superquadric representations with the concept
of aspect graphs.
A cylinder may be described as a circle which translates along a straight line in space,
with the plane of the circle perpendicular to the line. Now, suppose that the line is
allowed to bend in space, in fact to become an arbitrary space curve, parameterized
by arc length, s. Then, the line becomes a vector function x(s), y(s), z(s) of s. Next,
allow the radius of the circle to vary with the point along the curve, R = R(s), and
you have some idea of what a generalized cylinder is [9.4, 9.21, 9.23, 9.74, 9.75,
9.98]. However, the concepts of GCs are more general even than described above.
The object translated does not have to be a circle, it can be any 2D shape.
If we can fit GCs to regions, we can then use the vector function of the line and the
radius function as features to describe the shape of the region. However, there are
significant challenges to fitting GCs to images. We will not pursue the GC idea any
further in this book. The reader can find many interesting papers in the literature, a
few of which are listed in the previous paragraph.
9.13 Conclusion
In this chapter, several features were defined that could be used to quantify the shape
of a region. Some, like the moments, are easy measurements to make. Others, such as
239 9.14 Vocabulary
Constrained r In section 9.2.2 a derivation is presented which finds the straight line which
minimization.
best fits a set of points, in the sense that the sum of perpendicular distances is
minimized. To accomplish that, we were required to use constrained minimization
with Lagrange multipliers.
Integral squared error. r In section 9.8 we find a deformation to a template by performing a minimization
of an integral squared error.
Pseudo-inverse minimizes r In section 9A.2, we will encounter a problem requiring “inversion” of a nonsquare
the squared error.
matrix. Of course, one cannot formally invert such a matrix, and instead we derive
the “pseudo-inverse.” We also show that the pseudo-inverse is really a minimum-
squared-error algorithm.
9.14 Vocabulary
Affine transform
Aspect ratio
Basis vector
Center of gravity
Chain code
Compactness
Convex hull
Convex discrepancy
Deformable template
Diameter
Fourier descriptor
Generalized cylinder
Homogeneous transformation matrix
Invariant moment
240 Shape
K--L transform
Linear transformation
Medial axis
Metric
Moment
Orthogonal transformation
Principal component
Similarity transform
Thinness
This technique is easy to program and converges rapidly. We have had good luck with it.
Unfortunately, it is not guaranteed to converge to the global extrema, although there is a high
likelihood that it will do so.
An extension of the previous algorithm provides a strategy which is guaranteed to converge.
In addition, this algorithm provides a mechanism for rapidly narrowing the search space.
First, define a linear search function M(Pi , R) which returns the point Pi+1 = M(Pi , R)
such that ∀x ∈ R, d(Pi , x) ≤ d(Pi+1 , Pi ).
Choose an arbitrary point P1 ∈ R, find P2 = M(P1 , R − P1 ). Next, find P3 = M(P2 , R −
{P1 , P2 }). The relationship between d(P1 , P2 ) and d(P2 , P3 ) can be one of the following three
cases, which are illustrated in Figs. 9.15– 9.17:
In Case 3, we compute a new P3 called P3 so that d(P2 , P3 ) = d(P2 , P3 ) and P1 , P2 , and
P3are colinear. We, therefore have a symmetrical lens shape which encloses all the points in
R using at most two linear searches.
241 Topic 9A Shape description
PP11
P'3
P3
P1
PP33
P3 P1
P
P22
P2 P2
α1
β22 B2
β21
R21 R22
P1
P2
R11 R12
β12 β11
B1
Extreme
α2
Fig. 9.18. In this figure, B2 is one of the extrema. The other extreme is an element of R11
(redrawn from [9.71]).
Our heuristic states that the “best” direction in which to search is perpendicular to P1 , P2
(or P2 , P3 in Case 3).
Compute the apexes ␣1 and ␣2 as shown in Fig. 9.18. Then find B1 = M(␣1 , {R −
{P1 , P2 }}). If d(␣1 , B1 ) ≤ d(P1 , P2 ), stop, else partition R − P1 , P2 into two mutually ex-
clusive regions R1 and R2 , where R1 = {x ∈ R − {P1 , P2 }d(x, P1 ) > d(P1 , P2 )}. Find
B2 = M(␣2 , R − R1 ), and find R2 = {x ∈ R − {P1 , P2 } − R1 d(x, ␣2 ) > d(P1 , P2 )}.
Note that: (1) In general, R1 ∪ R2 ⊂ R. However, R − {R1 ∪ R2 } contains no points of
interest. (2) If either R1 = or R2 = , we can stop and P1 , P2 is the diameter.
If both R1 and R2 are nonempty, we can define them as “antipodal regions” in the following
sense: If a diameter exists greater than d(P1 , P2 ), one endpoint must lie in R1 and the other
in R2 , which effectively cuts the search space in half.
Two new points, 11 and 12 , are computed by swinging an arc with center at ␣1 and radius
d(␣1 , B1 ) to intersect the lens. Similarly, 21 and 22 are computed.
242 Shape
Note that d(21 , 12 ) = d(11 , 22 ) and this is an upper bound on the diameter. In a digital
picture, having an upper limit on the diameter may allow earlier stopping of the algorithm,
since if we know a diameter candidate, and we know the upper bound, if these two values
differ by less than 1.414 (the diagonal of a pixel), there is no need to search further.
Compute r = MAX(d(B1 , B2 ), d(P1 , P2 )). Use this as a radius with which to draw arcs.
Using 21 as center and r as radius, partition R1 into two regions, R11 and R1 − R11 . Similarly,
use 22 as center and find R12 ⊂ R1 . Note that R1 − {R11 ∪ R12 } contains no points of interest.
Similarly, find R21 and R22 as shown. If R21 = and R11 ∩ R12 = , then R21 and R12 is an
antipodal pair as is R11 and R22 . In any case, R21 ∪ R22 is antipodal with R11 ∪ R12 . These
antipodal regions will be our search space in the next phase.
The strategy to this point has either identified the extrema or it has provided other useful
results, specifically:
At this point, if fewer than K points remain (which is most often the case), the most appropriate
technique is to compute convex hulls. (The optimal choice of K is a function of region
topology. It has been our experience that K = 50 seems to work well.) This computation is
aided by the observations that:
If, on the other hand, many more points remain, the algorithm can be invoked recursively
using the antipodal pairs of regions as subject areas, and choosing new starting points as
those points closest to 21 and 12 , or 11 and 22 .
For R with N points, the blind exhaustive search takes O(n 2 ) distance calculations and
comparisons. Our search algorithm is exhaustive (hence guarantees convergence), but intelli-
gent by taking the global shape of the region R into consideration. The initial search space R
is sequentially divided into smaller mutually exclusive subspaces by eliminating those points
that cannot be the end points of the extrema.
Although the number of subspaces is increased by two for each recursive call of the
procedure, there are many more points that are eliminated from the search space after each
call. Therefore, the search space is rapidly decreasing. How rapid the decreasing rate is
depends on the shape of R.
This method is derived from geometric considerations and consequently its rate of con-
vergence is strongly dependent on the geometry of the region. It is, therefore, difficult to
accurately access the computational complexity of the method. If used as a pre-processing
technique, it operates in O(4n) time, plus the time required to perform the convex hull cal-
culation on the remaining points, or O(k log k) where k represents the points remaining after
pre-processing.
243 Topic 9A Shape description
Certainly in the worst case, almost no points are eliminated and convergence operates
as convex hull in O(n log n). It is in fact worse than the convex hull in this case, since the
program is more complex.
However, due to the large number of branch points, the algorithm exits early for virtually
all regions and it converges very rapidly.
Many papers have been written which extract three-dimensional shape from various sources:
From silhouettes [9.42, 9.43, 9.44, 9.49, 9.101]; from images of specular reflectors [9.64];
from three orthogonal projections (x-ray projections) [9.81]; making use of the assumption
that objects tend to have orthogonality [9.22] or symmetry [9.18]. Ultimately, in all these
algorithms one must address the question of visibility [9.83].
Range images, one might think, already contain a complete description of three-dimen-
sional shape, but, of course, you cannot see the entire surface in one image [9.31]. A tough
problem is how to integrate several range images to form one description of a three-dimen-
sional object [9.76, 9.93].
Even though you may have segmented surfaces with some success, those segmentations are
They do not intersect almost never perfectly correct. One might think that the equations describing the intersection
because they are in 3D
of edges would be straightforward to find. After all, you have the equations of the surfaces
space.
whose intersections determine the edges. Just calculate the intersection! Ah, but, it is never
quite that easy. The problems occur when you have vertices, the intersections of edges – the
trihedral or multihedral intersection points. Those equations you just derived never intersect
at a point. Hoover et al. [9.31] address this problem, and extend the solution to nonvisible
surfaces.
Another important issue in extracting three-dimensional shape is what representation you
choose, including shape from perspective, shape from shading, shape from texture, etc.
n pairs of corresponding angles are observed. Analytic solutions have been determined for
P3P [9.17], P3L [9.14], and PnA [9.95]. A clear explanation of the linear approaches to using
uncalibrated cameras (but assuming correct solutions to the correspondence problem) may
be found in [9.26].
Some work [9.34] has been done on the harder version of the correspondence problem, the
case that the cameras are not only uncalibrated, but are oriented in such a way that the epipolar
assumption is not necessarily valid. In this case the stereo matching problem becomes one of
search for best matching pairs, using both radiometric and geometric information to narrow
the search.
Shape from shading was first introduced by Horn, who argued that some knowledge of how
light is generated, reflected, and observed could substantially improve the performance of
machine vision systems. Consider Fig. 9.19, and assume you know:
r the angle of the light source
r the angle of the observer
r the measured brightness of the pixel
r a law governing how light is scattered
r the albedo of the surface.
Can you find the surface normal? (If you have the normal at every point, how would you
determine the surface?)
The solution may be found by solving differential equations. We start by writing the surface
normal vector as n = r /|r |, where the direction vector
T
∂z ∂z
r= , ,1 .
∂x ∂y
In most shape-from-shading literature, the partial derivatives are abbreviated p ≡ ∂z/∂ x,
q ≡ ∂z/∂ y.
Despite the fact that we use the term “brightness” constantly, it actually does not have
a rigorous physical definition. Following Horn [9.32], we define irradiance as the power
per unit area falling on a surface, measured in watts per square meter. Then, we can define
radiance as the power per unit foreshortened area per unit solid angle. This dependence on
n θI
θo
Fig. 9.19. Light strikes a surface at an incident angle, relative to the surface normal, n, and is
reflected/scattered in another direction.
245 Topic 9A Shape description
foreshortened area makes it clear that the angle of observation plays an important role in
scene “brightness.”
Often, the “reflectivity model” of a surface is known, or may be measured. For example,
the brightness observed might be independent of the angle of observation, and depend on
the angle of incidence. For example, the reflected brightness might be related to the incident
brightness with a relationship such as
Thus, if we know the incident brightness, I, the albedo (how the surface is painted), a, and
the reflected brightness, R, we should be able to solve for I , and from that, infer the surface
normal, and from that the surface. The reflectivity function of Eq. (9.59) is known as a
Lambertian model. Note that the angle of observation does not enter the Lambertian model.
Another familiar reflectivity function is the specular model
which describes mirrors – you only get a reflection if the angle of illumination equals the
angle of incidence. Of course most surfaces, even “shiny” ones, are not perfect specular
reflectors, and a more realistic model for a mixed surface might be
Although the use of reflectivity functions requires radiometric calibration of cameras [9.28],
that requirement in itself is not the major difficulty. To see the complexity of the problem
[9.51], let us expand Eq. (9.59) in terms of the elements of the observation vector and the
normal vector.
∂z ∂z
R(x, y) = a I (x, y) cos(I r N ) = a I (x, y) cos Ix + Iy + Iz Nz . (9.62)
∂x ∂y
Assuming we know the angle of observation (which we will actually know only approximately
at best), the albedo and the incident brightness, we still have a partial differential equation
which we must solve to determine the surface function z.
Many papers and Horn’s classic text [9.32] address approaches for solving various special
cases of the shape-from-shading problem. A recent paper by Zhang et al. [9.100] surveys
the field up to 1999. (Remember, Equation (9.62) is itself a special case – it assumes the
brightness does not depend on the observation angle.) In the following, we discuss another
special case, photometric stereo.
Photometric stereo
In many cases, it is reasonable to model the reflectivity of a surface as proportional to the
cosine of the angle between the surface normal vector and the illumination vector:
I (x, y) = ro (N i r n) (9.63)
where Ni is a unit vector in the direction of light source i. If we are fortunate enough to
have an object, a Lambertian reflector, which satisfies this equation, and which possesses the
same albedo (r0 ), independent of the illumination, we can make use of multiple pictures from
246 Shape
multiple angles to determine the surface normal [9.35, 9.94]. Let us illuminate a particular
pixel with three different light sources (one at a time) and measure the brightness of that pixel
each time. At that pixel, we construct a vector from the three observations
I = [I1 , I2 , I3 ]T . (9.64)
We know the direction of each light source. Let those directions be denoted by unit vectors
from the surface point toward the light source, N 1 , N 2 , and N 3 . Write those three direction
vectors in a single matrix by making each vector a row of the matrix.
⎡ ⎤ ⎡ ⎤
N1 n 11 n 12 n 13
⎢ ⎥ ⎢ ⎥
N = ⎣ N 2 ⎦ = ⎣n 21 n 22 n 23 ⎦ , (9.65)
N3 n 31 n 32 n 33
and now we have a matrix version of Eq. (9.63):
I = r0 N n. (9.66)
r0 = |N −1 I|, (9.67)
E = I T I − 2nT N T I + nT N T N n. (9.70)
We wish to find the surface normal vector n which minimizes this sum-squared difference E,
so we differentiate E with respect to n,
∇n E = −2N T I + 2N T N n (9.71)
or
n = (N T N )−1 N T I. (9.73)
(In case you have not recognized it yet, the pseudo-inverse just appeared.) That was a lot of
work. Let’s see if there is an easier way:
Go back to Eq. (9.66), again omitting the r0 for clarity, and multiply both sides by N T .
N T I = N T N n. (9.74)
Multiply both sides by (N T N )−1 and we find the same result as Eq. (9.73).
So why did we go to so much trouble – all the work of Eqs. (9.69) through (9.72) seems
like a waste. Ah, but there is madness in our method! We have now demonstrated to you
that multiplying by the pseudo-inverse (N T N )−1 N T produces the minimum squared error
estimate of an overdetermined linear system. That is a significant result; important in the case
of photometric stereo, and in many other applications as well.
2
SE loci
0
q
−2
−4
−6
BSE loci
−8
−8 −6 −4 −2 0 2 4 6
p
Fig. 9.20. A given measured brightness pair can be created by a locus of p, q values in both SE
and BSE images. (The authors are grateful to B. Karacali for this figure.)
pair and the vector which is normal to the surface z(x, y), and write that difference as
∂z ∂z
di (x, y) = , , ( pi , qi ) .
∂x ∂y
Finally, assuming that the two curves of Fig. 9.20 intersect at m points (we make m a function
of x, y to remind the reader that all this is being done for a single x, y point), we define an
objective function as
!−1
m(x,y)
−1
E= (di (x, y)) + R, (9.75)
x,y i
In section 4.2.2, the basic concepts of structured illumination were introduced. The key point
is that by controlling the lighting, one or more of the unknowns in the stereopsis problem
may be eliminated. Let’s look at an example in more detail to see how this might work.
The problem to be solved is an application in robot vision: A robot is to pick up shiny,
metallic turbine blades from a rack and place them into a machine for further processing. In
order to locate the blades, a horizontal slit of light is projected onto the scene, by passing a
laser beam through a cylindrical lens. The geometry of the resulting images is illustrated in
249 Topic 9A Shape description
Blade
θ
z
h
φ
Cart
Fig. 9.21. The presence of a blade translates the location of the light stripe vertically in the
image.
Fig. 9.21 [9.57]. If there were no blade in the image, the light stripe from the laser would
form a horizontal line in the image as it is reflected off the cart. The presence of the blade
causes a vertical translation of the light stripe. The number of lines of vertical displacement is
directly proportional to an angular difference which produces the angle . Knowing the two
angles and the distance h between the camera and the projector allows a simple calculation
of the distance z:
h tan
z= . (9.76)
tan + tan
Although this relationship is relatively simple, it turns out to be both simpler and more
accurate to simply keep a lookup table of z vs. row displacement.
One practical problem which arises in this problem is the specular nature of the reflections
from the turbine blades; the bright spots may be orders of magnitude brighter than the rest of
the image. This is dealt with by passing the beam through a polarizing filter. By placing another
Fig. 9.22. Variations in
texture can provide such filter on the camera lens, the specular spots are considerably reduced in magnitude.
information about In this example of the use of structured illumination, only one light stripe was projected
shape (from [9.13]). at a time, so there was no ambiguity about which bright spot resulted from which projector,
C 2003 Artists Rights
Society (ARS), New however, in the more general case, one may have multiple light sources, and then, some
York / ADAGP, Paris. method for disambiguation is required [9.7, 9.50].
That one can obtain range from focus is obvious. However, the extension to robust shape
from focus is difficult. The principal problem is determining precisely when each pixel is in
focus [9.52, 9.79].
250 Shape
Motion analysis may be viewed as two different problems. The first is the case when the
camera is moving and the objects in the world are stationary. In this case, the extraction of
camera motion is the challenge. In the second case, the camera is stationary, and objects in the
world move. Finally, there is the combination of the two – where both the camera and some
objects in the world are moving. Motion analysis has many of the same issues as stereo. As
in stereo, correspondence is a major problem, but in this case, the correspondence is between
scenes which differ in time rather than in the spatial location of the camera.
One approach to the motion analysis problem is referred to as “optic flow.” Consider two
images of the same object. Assume the image in frame 2 is the same as the image in frame
1, but displaced
f 1 (x + ␦) = f 2 (x). (9.77)
f 2 (x) − f 1 (x)
␦= . (9.79)
f 1 (x)
There is a serious problem here. What happens if the gradient is zero? The gradient being
zero is really saying that there is no information at this point in the image. Imagine you are
looking through a telescope at a semi-tractor–trailer truck which is passing by, and the field
of view of your telescope allows you to see only a small area, a few square inches, of the
truck. When the bumper passes, you have information, and you know motion has occurred.
However, as the top of the trailer is passing, you see no changes for a relatively long time.
Another troublesome issue that arises in computing and applying optic flow is that in 2D,
Eq. (9.79) becomes a differential equation (see [9.10]). To deal with these problems, re-
searchers in optic flow have tried various ways to combine information from local measure-
ments to infer global knowledge, such as using clustering to identify sets of points which are
moving together [9.41]. As discussed in Chapter 4, one can match either points or boundaries.
For example, Quan [9.61] matches conics; Taylor and Kriegman [9.85] match line segments,
as does Zhang [9.99]; Smith and Nandha Kumar [4.35] match textures. Motion segmentation
is discussed in [4.35, 9.3].
In two dimensions, the ␦ of Eq. (9.77) becomes a vector. Optic flow algorithms generate
a disparity field, a vector field which associates a vector with each pixel [9.20]. Efficient
implementations for optic flow have been investigated [10.19, 17.13, 17.14], including objects
which deform [17.60].
The optic flow at a point in the image plane of a moving camera is
1
ui = Ai T + Bi
(9.80)
zi
251 Topic 9A Shape description
are completely determined by the camera. T is the translational velocity of the camera and
is the rotational velocity of the camera. z i is the depth to the image point at camera coordinates
(xi , yi ). The problem of determining T and
has been addressed by a number of researchers
[9.36, 9.37]. Earnshaw and Blostein [9.15] compare these methods and introduce a new
variation.
Motion can be estimated from smear. Chen et al. [9.9] make the observation that:
“Psychophysical investigation has shown that the human visual system (HVS) integrates reti-
nal images for 120 ms. Due to the integration, motion smear is inevitable. It has been reported
that when the HVS is presented with an image blurred due to motion, the amount of smear per-
ceived by human observers increases with observing duration at short durations (up to 20 ms).
At longer durations, however, the perceived image becomes sharper. It is our conjecture that
the HVS is performing a deblurring, or sharpening, function on the image.” This observation
leads to a “motion from smear” algorithm [9.9].
Recently, a gradual shift in focus has been observed [9.5] in the motion analysis literature
from analyzing the image or camera motion to labeling the action taking place – from “How
are things moving?” to “What is happening?”
u f p = i Tf (s p − t f ) v f p = j Tf (s p − t f ). (9.82)
Defining
x f = −t Tf i f y f = −t Tf j f , (9.83)
u f p = i Tf s p + x f v f p = j Tf s p + y f . (9.84)
252 Shape
Now we have an inverse problem: We know the image coordinates, and from these we must
determine the spatial position of both the camera and the object points.
Accumulate all the observations of the image points in a matrix:
⎡ ⎤
u 11 . . . u1 p
⎢ ... ... ... ⎥
⎢ ⎥
⎢u . . . uFP ⎥
⎢ F1 ⎥
W =⎢ ⎥, (9.85)
⎢ 11 . . . v 1P ⎥
⎢ ⎥
⎣ ... ... ... ⎦
v F1 . . . vFP
given that there are F frames and P points. We observe that each column in this matrix is
a listing of the image coordinates of a particular point as it is viewed in different frames,
whereas each row is a listing of the locations of all the points in a particular frame. Next, we
define a matrix M which is a 2F × 3 matrix whose rows are the i f and j f vectors, and a
matrix S which is a 3 × P “shape matrix” whose columns are the s p vectors. Finally, define
T to be a 2F-dimensional translation vector with elements x f and y f . With these definitions,
we can rewrite Eq. (9.84) for all the points as
W = M S + T 1P , (9.86)
Is this parenthetical where 1 P is a vector of length P, composed of all ones. (Observe that the product of T with
remark correct?
the ROW vector of all ones, an outer product, constructs a matrix of dimension 2F × P.)
If we move the origin to the center of gravity of the object (why not? location of the origin
is arbitrary), we obtain
1 P
C= s p = 0. (9.87)
P p=1
This location of the origin allows us to solve for T immediately, by observing that since the
sum of any row of S is zero, the sum of any row of W is simply PT , and each row of T can be
determined as the corresponding row of W divided by P. Now, subtract T from W producing
a new matrix W which satisfies
W = M S. (9.88)
Using singular-valued decomposition, we can find a suitable decomposition of W , which we
denote by W = M S . Unfortunately, these are not necessarily the M and S we want, since we
could put any product A A−1 between them without changing the value of the product. We
thus search for a matrix A such that
M = MA S = A−1 S. (9.89)
To find A, we can make use of the fact that the rows of M are the direction vectors of the
camera and are hence orthonormal. With these additional constraints, A is determined, and
we know the positions in 3-space of all the P points as well as knowing the camera angles at
each frame time.
253 Topic 9A Shape description
9A.4 Vocabulary
Irradiance
Optic flow
Perspective
Photometric stereo
Pseudo-inverse
Reflectivity
Shape from shading
Structured illumination
Assignment 9.1
For each feature described in section 9.3, determine if that
feature is invariant to: (1) rotation in the viewing plane;
(2) translation in that plane; (3) rotation out of that plane
(an affine transform if the object is planar); and (4) zoom.
Assignment 9.2
Let the Euclidian distance between two points be denoted by
the operator d(P1 ,P2 ) (you may want to use this in the next
problem). Design a monotonic metric R(P1 ,P2 ) which maps all
distances to be between 0 and 1. That is, if d(P1 ,P2 ) = ∞,
then R(P1 ,P2 ) = 1 and if d(P1 ,P2 ) = 0 then R(P1 ,P2 ) = 0.
For the metric you developed, show how you would prove your
measure is a formal metric. Just set up the problem. Extra
credit if you actually do the proof.
Assignment 9.3
Following are five points on the boundary of a region. (1,1),
(2,1), (2,2), (2,4), (3,2). Use eigenvector methods to fit a
straight line to this set of points, thus finding the princi-
pal axes of the region. Having found the principal axes, then
estimate the aspect ratio of the region.
Assignment 9.4
Write the chain code for the figure below.
254 Shape
Assignment 9.5
Discuss the following postulate: Let P1 and P2 be the two ex-
trema of a region which determine the diameter. Then P1 and
P2 are on the boundary of the region.
Assignment 9.6
Starting with Eq. (9.19), prove Eq. (9.20).
Assignment 9.7
What is the difference between the intensity axis of symmetry
and the medial axis?
Assignment 9.8
In Table 9.1, prove that the invariant moment 1 is invariant
to zoom.
Assignment 9.9
Your instructor will specify an image containing a single
region with unity brightness and zero background.
Assignment 9.10
Prove that Eq. (9.35) is invariant to: (1) translation;
(2) rotation; and (3) zoom.
Assignment 9.11
Is the caption on Fig. 9.12 correct?
Assignment 9.12
Two silhouettes, A and B, are measured and their boundaries
encoded. Then, Fourier descriptors are computed. The descrip-
tors are as in Table 9.3.
It is possible that these two objects represent similar-
ity transforms of one another. (A similarity transform is
equivalent to a rigid body motion, translation or rotation
only.) Could they be affine transformations of one another?
(An affine transformation is a linear transformation which
includes not only rigid body motion, but also the possibility
255 Topic 9A Shape description
Object A Object B
for scaling of the coordinate axes. If both axes (in 2D) are
scaled by the same amount, you get zoom. If they are scaled
by different amounts, you get shear.)
If you decide that these two sets of descriptors represent
the same shape, possibly transformed, describe and justify
what type of operations convert A into B. If they are not the
same shape, explain why.
Assignment 9.13
A cylinder with unit radius and height of ten is oriented
vertically about the origin, and is known to have a sur-
face which is a Lambertian reflector. That is, the reflected
brightness is independent of the angle of observation, and
depends on the angle of incidence following the relationship
f = aIcos i, where a is the albedo and I is the brightness of
the source. On this cylinder, the albedo is constant.
The camera is located at x = 0, y = −2, z = 2, and the optical
axis of the camera is pointed at the origin.
y
Max brightness
observed somewhere
along here
θ
Bibliography
[9.29] D. Heeger and A. Jepson, “Subspace Methods for Recovering Rigid Motion I:
Algorithm and Implementation,” International Journal of Computer Vision, 7(2),
pp. 95–117, 1992.
[9.30] D. Helman and J. JáJá, “Efficient Image Processing Algorithms on the Scan Line
Array Processor,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), pp. 47–56, 1995.
[9.31] A. Hoover, D. Goldgof, and K. Bowyer, “Extracting a Valid Boundary Representa-
tion from a Segmented Range Image,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(9), pp. 920–924, 1995.
[9.32] B.K.P. Horn, Robot Vision, Cambridge, MA, MIT Press, 1986.
[9.33] M. Hu, “Visual Pattern Recognition by Moment Invariants,” IRE Transactions on
Information Theory, 8, pp. 179–187, 1962.
[9.34] X. Hu and N. Ahuja, “Matching Point Features with Ordered Geometric, Rigidity,
and Disparity Constraints,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(10), pp. 1041–1049, 1994.
[9.35] Y. Iwahori, R. Woodham, and A. Bagheri, “Principal Components Analysis and
Neural Network Implementation of Photometric Stereo,” Proceedings IEEE Con-
ference on Physics-Based Modeling in Computer Vision, June, 1995, pp. 117–125,
1995.
[9.36] A. Jepson and D. Heeger, “Linear Subspace Methods for Recovering Translational
Direction,” In Spatial Vision in Humans and Robots, ed. L. Harris and M. Jenkin,
Cambridge, Cambridge University Press, 1993.
[9.37] K. Kanatani, “Unbiased Estimation and Statistical Analysis of 3-D Rigid Motion
from Two Views,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
15(1), pp. 37–50, 1993.
[9.38] K. Kanatani, “Comments on ‘Symmetry as a Continuous Feature’,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 19(3), pp. 246–247, 1997.
[9.39] H. Kauppinen T. Seppänen, and M. Pietikäinen, “An Experimental Comparison of
Autoregressive and Fourier-based Descriptors in 2D Shape Classification,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(2), pp. 201–207,
February, 1995.
[9.40] R. Kimmel, A. Amir, and A. Bruckstein, “Finding Shortest Paths on Surfaces Us-
ing Level Sets Propagation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(6), pp. 635–640, 1995.
[9.41] D. Kottke and Y. Sun, “Motion Estimation via Cluster Matching,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(11), pp. 1128–1132, 1994.
[9.42] A. Laurentini, “The Visual Hull Concept for Silhouette-based Image Understanding,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2), pp. 150–
162, 1994.
[9.43] A. Laurentini, “How Far 3D Shapes can be Understood from 2D Silhouettes,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(2), pp. 188–195,
1995.
[9.44] S. Lavallée and R. Szeliski, “Recovering the Position and Orientation for Free-form
Objects from Image Contours Using 3D Distance Maps,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(4), pp. 378–390, 1995.
259 Bibliography
[9.45] S. Liao and M. Pawlak, “On Image Analysis by Moments,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(3), pp. 254–266, 1996.
[9.46] R. Malladi, J. Sethian, and B. Vemuri, “Shape Modeling with Front Propagation: A
Level Set Approach,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 17(2), pp. 158–175, 1995.
[9.47] C. Manhaeghe, I. Lemahieu, D. Vogelaers, and F. Colardyn, “Automatic Initial
Estimation of the Left Ventricular Myocardial Midwall in Emission Tomograms
using Kohonen Maps,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 16(3), pp. 259–266, 1994.
[9.48] J. Michel, N. Nandhakumar, and V. Velten, “Thermophysical Algebraic Invariants
from Infrared Imagery for Object Recognition,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 19(1), pp. 41–51, 1997.
[9.49] F. Mokhtarian, “Silhouette-based Isolated Object Recognition through Curvature
Scale Space,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(5), pp. 539–544, 1995.
[9.50] R. Morano, C. Ozturk, R. Conn, S. Dubin, S. Zietz, and J. Nissano, “Structured Light
using Pseudorandom Codes,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 20(3), pp. 322–327, 1998.
[9.51] S. Nayar and R. Bolle, “Reflectance Based Object Recognition,” International Jour-
nal of Computer Vision, 17(3), pp. 219–240, 1996.
[9.52] S. Nayar and Y. Nakagawa, “Shape from Focus,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(8), pp. 824–831, 1994.
[9.53] T. Nguyen and B. Oommen, “Moment-preserving Piecewise Linear Approximations
of Signals and Images,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 19(1), pp. 84–91, 1997.
[9.54] L. O’Gorman, “Subpixel Precision of Straight-edged Shapes for Registration and
Measurement,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
18(7), 1996.
[9.55] F. O’Sullivan and M. Qian, “A Regularized Contrast Statistic for Object Bound-
ary Estimation – Implementation and Statistical Evaluation,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(6), pp. 561–570, 1994.
[9.56] M. Oren and S. Nayar, “A Theory of Specular Surface Geometry,” International
Journal of Computer Vision, 24(2), pp. 105–124, 1996.
[9.57] N. Page, W. Snyder, and S. Rajala, “Turbine Blade Image Processing System.”
In Advanced Software in Robotics, ed. A. Danthine, Amsterdam, North-Holland,
1984.
[9.58] E. Pauwels, P. Fiddelaers, and L. Van Gool, “Enhancement of Planar Shape Through
Optimization of Functionals for Curves,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(12), 1995.
[9.59] S. Pizer, C. Burbeck, J. Coggins, D. Fritsch, and B. Morse, “Object Shape Before
Boundary Shape: Scale Space Medial Axis,” Journal of Mathematical Imaging and
Vision, 4, pp. 303–313, 1994.
[9.60] C. Poelman and T. Kanade, “A Paraperspective Factorization Method for Shape and
Motion Recovery,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(3), pp. 206–218, 1997.
260 Shape
[9.61] L. Quan, “Conic Reconstruction and Correspondence from Two Views,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(2), 1996.
[9.62] I. Rothe, H. Süsse, and K. Voss, “The Method of Normalization to Determine
Invariants,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
[9.63] G. Sandini and V. Tagliasco, “An Anthropomorphic Retina-like Structure for Scene
Analysis,” Computer Graphics and Image Processing, 14, pp. 365–372, 1980.
[9.64] H. Schultz, “Retrieving Shape Information from Multiple Image of a Specular
Surface,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2),
pp. 195–201, 1994.
[9.65] E. Schwartz, “Computational Anatomy and Functional Architecture of Striate Cortex,
Spatial Mapping Approach to Perceptual Coding,” Vision Research, 20, pp. 645–669,
1980.
[9.66] J. Sethian, “Curvature and Evolution of Fronts,” Communications in Mathematical
Physics, 101, pp. 487–499, 1985.
[9.67] J. Sethian, “Numerical Algorithms for Propagating Interfaces: Hamilton–Jacobi
Equations and Conservation Laws,” Journal of Differential Geometry, 31, pp. 131–
161, 1990.
[9.68] M. Shamos, “Geometric Complexity,” 7th Annual ACM Symposium on Theory of
Computing, May, 1975, Albuquerque, NM, pp. 224–233, 1975.
[9.69] D. Sinclair and A. Blake, “Isoperimetric Normalization of Planar Curves,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(8), pp. 769–777,
1994.
[9.70] S. Smith and J. Brady, “ASSET-2: Real-time Motion Segmentation and Shape Track-
ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 1995.
[9.71] W. Snyder and I. Tang, “Finding the Extrema of a Region,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2, pp. 266–269, 1980.
[9.72] S. Soatto and P. Perona, “Reducing ‘Structure from Motion’: A General Frame-
work for Dynamic Vision.1. Modeling,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(9), pp. 933–942, 1998.
[9.73] S. Soatto and P. Perona, “Reducing ‘Structure from Motion’: A General Framework
for Dynamic Vision. 2. Implementation and Experimental Assessment,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 20(9), pp. 943–960, 1998.
[9.74] B. Soroka, “Generalized Cylinders from Parallel Slices,” Proceedings of the Con-
ference on Pattern Recognition and Image Processing, 1979.
[9.75] B. Soroka and R. Bajcsy, “Generalized Cylinders from Serial Sections,” 3rd In-
ternational Joint Conference on Pattern Recognition, November, Coronado, CA,
1976.
[9.76] M. Soucy and D. Laurendeau, “A General Surface Approach to the Integration of a Set
of Range Views,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(4), pp. 344–358, 1995.
[9.77] J. Stone and S. Isard, “Adaptive Scale Filtering: A General Method for Obtaining
Shape from Texture,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 17(7), pp. 713–718, 1995.
261 Bibliography
The single most challenging problem in all of computer vision is the “local/global
inference problem.” As in the fable of the blind men and the elephant, the computer
must, from a set of local measurements, infer the global properties of what is being
observed. In other words, the next level of the machine vision problem is to interpret
the global scene (which is composed of individual objects) using local information
about each object obtained from segmentation and shape analysis as we have dis-
cussed in Chapters 8 and 9. One way to approach the local/global inference problem
is to introduce the concept of consistency.
10.1 Consistency
Let’s begin with some notation: Define a set of objects {x1 , x2 , . . . xn }, and a set
of labels for those objects {1 , 2 , . . . k }, which we assume for now are mutually
exclusive (each object may have only one label) and collectively exhaustive (each
object has a label). Denote a labeling as the ordered pair (xi , j ). By this notation,
we mean that object i has been assigned label j.
As an example of consistent labeling, we will consider the problem of labeling
objects in a line drawing. Researchers have been interested in analysis of line draw-
ings since the beginnings of work in machine vision for three reasons: First, humans
can obviously look at line drawings and make interpretations with ease. Second, psy-
chological experiments [10.1, 10.6, 10.10] have convincingly demonstrated that it is
the points where brightness changes rapidly that convey the most information, and it
is relatively easy to convert edges to lines. Third, a line drawing is a dramatic reduc-
tion in the amount of data (which is not the same as information) in an image, and
perhaps, just perhaps, learning how to process line drawings would make analysis al-
gorithms run faster. Most of the groundwork in line drawing analysis was done in the
late 1960s and 1970s [10.5, 10.8], however, progress continues [10.17] to be made.
263
264 Consistent labeling
−
+ −
−
− −
+ +
+
Fig. 10.1. A line drawing illustrating convex, concave, and occluding lines, and a labeling of that
drawing.
We will allow each line in such a drawing to be labeled as either convex (an edge
in three dimensions which points toward the observer, such as the corner of a desk),
concave (an edge in three dimensions which points away from the observer, such as
the joint between the wall and the floor of a room), or occluding (the edge occurs
because one surface is partially hidden by another surface). For example, consider the
drawing in Fig. 10.1. In that drawing, lines resulting from convex edges are labeled
with a plus sign, concave edges are labeled with a minus sign, and occluding edges
are labeled with an arrow. The arrow points in the direction such that the occluded
surface is on the left if one moves in the direction of the arrow. You may note that
not all the lines in this drawing have been labeled. That is deliberate. Thus we have
one type of object, lines, and three types of labels for those lines. Our mission is to
learn how to get a computer to do automatically what human beings did so easily
when we interpreted the lines in that figure.
Before we can do that, we need to address one of those unfortunate ambiguities
that natural language introduces into discussions like this. The term labeling may
have two meanings. It may refer to the label assigned to a single object, to a pair
of objects, or to an entire scene. We will try to make sure that the meaning is clear
The compatibility
function r (.) will have from context in this discussion. In order to accomplish the objective of labeling a
differing interpretations scene, we must consider something we call consistency. The simultaneous labeling
through this chapter,
depending on the specific of two objects is represented by some function which we will call compatibility
labeling algorithm.
and denoted by r (i, , j, ). This function is defined to have a value of 1 if the
Check it out. Did we do
the math right? n objects,
two labelings can exist together (mutually compatible) and −1 if they cannot. For
with each object labeling example, r (i, +, i, −) = −1, since the object i cannot be both concave and convex
consistent with the
(consider the drawing in Fig. 10.2). A labeling of an image is said to be complete
labeling of all other
objects. and consistent when i i
= j r (i, , j, ) = n(n − 1), where any label values at
265 10.1 Consistency
− − + +
− − + +
+ − + −
− − + +
+ + + −
− + + −
−
all are allowed. In this chapter, we will utilize several different realizations of this
compatibility function.
Let’s look at the line labeling in more detail to see how this works. Although we
are interested in labeling lines, it will turn out to be useful to think about vertices
as well. A vertex is where lines meet. If each line can have four labels (concave,
convex, arrow in, arrow out), there should be 43 ways to label a vertex with three
lines meeting. It turns out, however that not all of those combinations are physically
possible. In Figs. 10.3 to 10.5, we illustrate all the “Y”, “ELL”, and “arrow” vertices
which are physically possible. There are a variety of ways we could make use of this
information. One way is to use depth-first search. The algorithm is as follows:
2
(1) Choose a starting vertex (call it vertex 1) and label all the lines coming in to it
1
in a physically possible way.
(2) Choose an adjacent vertex (call it vertex 2) and label all lines coming in to it
3 in a physically possible way, such that the labeling is consistent with previous
labeling. That is, the line from vertex 1 to 2 can only have one label.
4 (3) If no consistent labeling is possible, back up.
Fig. 10.6. An object
which we wish to label
In Fig. 10.7, we illustrate the labeling process of the 3D object shown in Fig. 10.6,
consistently. beginning with a choice of one possible labeling for the lines of vertex 1. Given the
266 Consistent labeling
+
+
1 +
−
2 + + − Some of the possible labelings
+
for the lines of vertex 2
3
Some of the possible labelings
for the lines of vertex 3
− +
4 +
?
labeling of vertex 1, we can choose any of a set of labeling for the lines of vertex 2,
but all of them must have a + sign on the line between 1 and 2, as illustrated on the
second line of the figure. Now, choose one of those labelings of vertex 2 (let’s pick
the one on the left), and label vertex 3 in such a way that it is consistent with the
labeling of both 1 and 3. Now, we need to assume a “correct” interpretation of vertex
3 in order to label vertex 4. Again, choose the one on the left. In order to label the
lines coming into vertex 4, we must choose a labeling which is consistent with the
(assumed correct) labeling of vertex 3 and vertex 1. Since two of the lines coming in
to vertex 4 are now determined, there is only one way to label vertex 4 consistently,
and that is with an arrow coming in on the third line.
Suppose we had reached the labeling of vertex 4, and found there were no physi-
cally possible labelings. Then clearly one of the earlier assumptions was incorrect.
We now “back up,” and choose a new labeling for vertex 3. Suppose we run through
all the possible labelings of the lines of vertex 3 and from none of them can we
find a consistent labeling of 4, then we back up again and choose a new labeling of
vertex 2. We follow this approach until we either find a consistent labeling of the
entire object, or we fail to find one. If we fail, then the object cannot be consistently
labeled [10.7].
Since line labeling was originally developed, many researchers have added en-
hancements. For example, Parodi and Piccioli [10.13] demonstrate that if vanishing
points can be determined, and the 3D coordinates of a single point are known, the
3D coordinates of all the other labeled points may be found.
Now that we have seen an application of consistent labeling, let’s generalize the
idea a bit by allowing a particular object to have more than one label at a time.
267 10.2 Relaxation labeling
The first way one might approach the idea of using consistency is to set up a linear
system that takes into account the initial probabilities and the consistencies. We
define the compatibility of object i having label with object j having label
as rL (i, , j, ) (the subscript L denotes that this is used in the linear relaxation
algorithm) and require that 0 ≤ rL ≤ 1 and
Be careful! The definition rL (i, , j, ) = 1 for all i, j, . (10.1)
of r will change in the
next section.
The linear relaxation process iteratively updates the label weights following
pi () = rL (i, , j, ) p j ( ). (10.2)
j
Nonlinear relaxation
Don’t worry about the denominator; it is just there to make sure the values of pi stay
within the range of 0–1. The term qi () represents a measure of how compatible
pi ( j ) is with the labeling of all the other objects. While p is strictly positive, q can
be either positive or negative. If negative, qi () suggests that the current labeling of
pi by j is incompatible with most other labelings,
qik () = Ci j rN (i, , j, ) p kj ( ) , (10.4)
Details of a specific form j
for r are problem-
dependent. where this is a similar compatibility r (.) that we saw earlier in this chapter, but with a
subscript N to denote that we will use it in nonlinear relaxation. The only difference
is that now we allow r to take on values of not only −1 and 1, but all values in
between. If two labelings are completely consistent, their compatibility is said to be
1. If the labelings are totally inconsistent, the compatibility is −1. If the labeling
of one simply does not affect the other, the compatibility is 0. Let’s examine what
Eq. (10.4) means.
The CHANGE in our confidence that object i has label is simply the sum of how
compatible that label for object i is with all the other labelings currently. Notice that
the compatibility is multiplied by the confidence that the other labeling is correct.
That is, if we have little confidence in a labeling of object j, we do not really care
how much it is compatible with a labeling of object i. Finally, Ci j is there for
convenience. It simply weights the influence that object j has on object i, without
regard to labels. It might be zero, for example, if object i and j have no influence on
each other, and we know that ahead of time. Ci j is optional; one could just as well
incorporate it into the compatibility function.
Assume you have done a segmentation of a range image into planes. Now, you wish
to find which of a set of models best matches the object being observed. Assume
the segmentation produces patches which are planar, and since the image is a range
image, we can compute the orientation of those planes in 3-space. The problem now
is to find the set of planar surfaces in the model (or collection of models) which best
matches the set of planar surfaces in the image. Patches in the image are the objects,
regions in the model(s) are the labels. One way to define compatibility of labeling
is as follows. Consider the compatibility of labeling image patch A as model region
269 10.2 Relaxation labeling
Fig. 10.8. Four possibilities in the definition of compatibility of labeling between two patches
and the corresponding two models.
1 with B as 2. Let us now consider: Does patch A border patch B? And does region
1 border region 2? There are four possibilities, illustrated in Fig. 10.8. If the two
regions in the image are adjacent, and the two regions in the model are also adjacent,
then we define the compatibility of two labelings as
where AB denotes the angle between patch/regions A and B (remember, this is a
range image, these angles can be measured).
The point to emphasize here is that the definition of the compatibility function
is totally problem-dependent. The rest of relaxation labeling is a simple iteration
defined by Eqs. (10.3) and (10.4).
Fig. 10.9. Objects in
frame 1 are
represented by open Another example: tracking of moving objects
circles, objects in
frame 2 are filled Let us assume that we are tracking four-wheeled vehicles as they move. Our camera
circles. The problem is can only record the position of the tires (it is a pretty weird camera). Our objective
to find the most
consistent labeling of then is to determine which tires in one image correspond to which tires in the next
open circles with filled image. This is illustrated in Fig. 10.9, in which the location of the tires in frame n
circles.
are denoted by open circles and the locations in frame n + 1 by closed circles. In
this application, we have another labeling task. The objects (wheels) in frame n are
to be labeled with the labels (also wheels) in frame n + 1. So what shall we use as a
compatibility function? To figure that out, let’s think about some incorrect labelings.
For example, Fig. 10.10 illustrates the labeling with an arrow, from object to label.
See if you think this makes sense.
If Fig. 10.10 is correct, then the front left tire went to the front right, and the
Fig. 10.10. A labeling
which probably is not rear tires did the same. We can only interpret that as the car flipped over, which,
correct. while possible, we certainly hope it did not actually happen. A more reasonable
270 Consistent labeling
interpretation is in Fig. 10.11, which shows the arrows are almost parallel! Ah, that’s
it! We can use the cosine of the angle between the arrows,
where i and j are wheels in frame n and m and p are wheels in frame n + 1. Equation
Fig. 10.11. A labeling
which makes more (10.6) measures the consistency of assuming that wheel i is wheel m and wheel j is
sense. wheel p. Although included in a notes version of this book several years earlier, this
concept was published in 1995 by Wu [10.19].
10.3 Conclusion
This chapter is all about consistency. We hope to have convinced the student that
the best way to fuse information from diverse courses is to seek labelings which are
consistent.
Optimization is formally used only in the next section, where an optimization is set
Conjugate gradient. up and solved using a numerical optimization technique called conjugate gradient.
The conjugate gradient technique is not explained in this chapter – the reader is
referred to standard texts in numerical methods. However, the technique is in many
ways similar to gradient descent, but runs much faster.
Researchers continue to work on improving the concepts of consistent labeling,
including relaxation labeling [10.4]. Relaxation or similar algorithms have been
used in such diverse applications as optical character recognition [10.12] and edge
detection [10.15]. See [10.9, 10.16] for underlying theory.
10.4 Vocabulary
Compatibility
Concave edge
Consistent
Convex edge
Labeling
Linear relaxation
Local/global inference
Nonlinear relaxation
Occluded
Relaxation labeling
271 Topic 12A 3D interpretation of 2D line drawings
Assignment 10.1
OK. You have seen two examples of compatibility func-
tions. Now you have the opportunity to make up your
own. Here is the problem. You have applied an edge
detector to an image. At every pixel in the image, a
gradient has been taken and you know the magnitude and
direction of that gradient. Recognizing that some of
these measurements may be corrupted by noise and blur,
develop an application of relaxation labeling which
helps determine “real” edge pixels. Hint: A “real” edge
pixel would have its gradient vector point in the same
direction as neighboring edge pixels. Develop a com-
patibility function using this concept. Describe how to
use it. You may use pseudo-code or words or flowcharts,
or all three. Do not write actual software.
As we have seen, interpretation of line drawings is a difficult problem. A single line drawing
shows only one view of a 3D object, and is therefore ambiguous. The ambiguity may be
resolved through the use of a set of stored 3D models. This approach requires prior knowledge
of what objects are likely to appear in a drawing, and may not give good interpretations of
novel drawings which do not correspond to any of the stored models.
Marill [10.11] proposes another approach for interpreting line drawings which requires no
models. He uses only one heuristic to generate a 3D wire-frame object from a 2D line drawing,
that is, a given 3D interpretation is considered less likely to be correct if some angles between
the wires are much larger than others. Specifically, of all the 3D models consistent with a
given 2D drawing, the preferred interpretation is the one with the least standard deviation of
angles (SDA) as defined in Eq. (10.7), where the angle is illustrated in Fig. 10.12.
1
2 !2
2
3
SDA = n 2 − . (10.7)
A(x a, y a, z a )
v1
B( x b, y b, z b ) θ
v2 C( x c, y c, z c )
1.5
1
1 2
5 3
4
0.5
3 4 2 7
0 0 6
1 8
−0.5 5 5
−5
−1 6 2
8 5
0 0
−1.5 7
−4 −2 0 2 4 −2 −5
6
3 2
4
7 6 5
1
2 5
0 2 4
0
1 3 6 −5
−2 4
5 −5 7
−5 0
−4 0
−4 −2 0 2 4 5 5
3
3
2 4
3 2
2
1
1
0
0
4 −2
−1 5 4
5
1 2 0 0
−2
−6 −4 −2 0 2 −5 −5
3
4 3
2
7 5
8
1 4
1 3
0
0 2 8
−5 7
−1 5 6
2
1 −10 5
−2 −5
6 0
5 0
−3
−4 −2 0 2 4 5 −5
Fig. 10.13. Left column: 2D line drawings. Right: 3D interpretation using the emulation method.
This algorithm can interpret a wide range of line drawings and seems to consistently generate
the same interpretations that a human does, even without any explicit models.
To simplify the problem, we square the objective function SDA and call it S . We would
like to find the third coordinate (z i ) of the points in the 2D picture which will minimize the
objective function S.
!2
S=n −
2
. (10.8)
273 References
∂S ∂ ∂
= 2n −2 , (10.9)
∂z i
∂z i
∂z i
where the angle , formed by two vectors v 1 and v 2 , can be computed by Eq. (10.10).
" #
= cos−1 v v1 v
1v2
2
!
−1 (xa − xb )(xc − xb ) + (ya − yb )(yc − yb ) + (z a − z b )(z c − z b )
= cos 4 4 .
(xa − xb )2 + (ya − yb )2 + (z a − z b )2 (xc − xb )2 + (yc − yb )2 + (z c − z b )2
(10.10)
Some 3D interpretation results of 2D line drawings are shown in Fig. 10.13. The optimization
problem is solved using conjugate gradient.
Wang [10.18] has made several improvements on Marill’s original algorithm, most of
which are less computationally intensive, including the usage of the standard derivation of
segment magnitudes (DSM) as the objective function [10.3] and the application of gradient
descent to solve the minimization problem [10.2].
References
Supposing I was on the other side of the glass, wouldn’t the orange still be in my right hand?
Lewis Carroll
This chapter discusses another approach to the solution of the local/global inference
problem, the use of parametric transformations. In this approach, we assume that the
object for which we are searching in the image may be described by a mathematical
expression, which in turn is represented by a set of parameters. For example, a
straight line may be written in slope–intercept form:
y = ax + b, (11.1)
where a and b are the parameters describing the line. Our approach is as follows:
Given a set of points (or other features), all of which satisfy the same equation, we
will find the parameters of that equation. In a sense, this is the same as fitting a
curve to a set of points, but as we will discover, the parametric transform approach
allows us to find multiple curves, without knowing a priori which point belongs
to which curve. We begin this process by considering the special case of finding
straight lines.
Suppose you are tasked with the problem of finding the straight lines in the image
shown in Fig. 11.1. If only one straight line were present in the image, we could
use straight line fitting to determine the parameters of the curve. But we have two
line segments here. If we could segment this first, then we could fit each segment
separately – yes, this is a segmentation problem, but we are segmenting a boundary
into boundary segments rather than segmenting an image into regions. In this section,
we will learn how to do this.
Fig. 11.1. An image
which is the output First, let us prove an illustrative theorem.
from an edge detector.
The human can Definition
immediately discern Given a point in a d-space, and a parameterized expression defining a curve in that
that this boundary
consists of two space, the parametric transform of that point is the curve which results from treating
straight segments. the point as a constant and the parameters as variables. For example, Eq. (11.1)
275
276 Parametric transforms
b = y − xa (11.2)
which is itself a straight line in the 2-space a, b. Given the point x = 3, y = 5,
then the parametric transform is b = 5 − 3a.
Theorem
If n points in a 2-space are colinear, all the parametric transforms corresponding to
those points, using the form b = y − xa intersect at a common point in the space
a, b.
Proof
Suppose n points {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )} all satisfy the same equation
y = a 0 x + b0 . (11.3)
Consider two of those points, (xi , yi ) and (x j , y j ). The parametric transforms of the
b points are the curves (which happen to be straight lines)
yi = xi a + b
(11.4)
yj = x ja + b
a
which we rewrite to make clear the fact that a and b are independent variables:
y j − yi = (x j − xi )a (11.6)
y −y
and therefore a = x jj −xii .
We substitute a into Eq. (11.5) to find b,
y j − yi
b = yi − xi (11.7)
x j − xi
and we have the a and b values where the two curves intersect. However, we also
know from Eq. (11.3) that all the xs and ys satisfy the same curve. By performing
that substitution into Eq. (11.7), we obtain
(a0 x j + b0 ) − (a0 xi + b0 )
b = (a0 xi + b0 ) − xi (11.8)
x j − xi
which simplifies to
b = (a0 xi + b0 ) − xi a0 = b0 . (11.9)
277 11.1 The Hough transform
Similarly,
y j − yi (a0 x j + b0 ) − (a0 xi + b0 )
a= = = a0 . (11.10)
x j − xi x j − xi
Thus, for any two points along the straight line parameterized by a0 and b0 , their
parametric transforms intersect at the point a = a0 , and b = b0 . Since the transforms
of any two points intersect at that one point, all such transforms intersect at that
common point. QED.
Review of concept: Each POINT in the image produces a CURVE (possibly straight)
in the parameter space. If the points all lie on a straight line in the image, the
corresponding curves will intersect at a common point in parameter space.
Got that? Now, on to the next problem.
Pick a value of and . Hold those values constant. Then the set of points which sat-
isfy Eq. (11.11) can be shown to be a straight line. There is a geometric interpretation
of this equation which is illustrated in Fig. 11.3.
This representation of a straight line has a number of advantages. Unlike the use
of the slope, both of these parameters are bounded; can be no larger than the largest
diagonal of the image, and need be no larger than 2. A line at any angle may be
represented without singularity.
There is a Whoops here. It The use of this parameterization of a straight line solves one of the problems which
is true that the maximum
value of is the diagonal.
confronts us, the possibility of infinite slopes. The other problem is the calculation
However, when you start of intersections.
calculating while
varying , you will find
points with negative . y
Rather than testing
negative values and Gradient direction
reflecting, it is easier to
just let be negative; ρ
which makes the θ
{
θ
x
Fig. 11.3. In the , representation of a line, is the perpendicular distance of the line from the
origin, and is the angle made with the x axis.
278 Parametric transforms
(a)
(b)
Fig. 11.5. (a) An image with two line segments which have distinctly different slopes and
intercepts, but whose actual positions are corrupted significantly by noise. (b) The
corresponding Hough transform.
279 11.2 Reducing computational complexity
11.3.2 Finding circles when the origin is unknown but the radius is known
Equation (11.12) describes an equation in which x and y are assumed to be variables,
and h, k, and R are assumed to be constants. As before, let us rewrite this equation
281 11.3 Finding circles
(h − xi )2 + (k − yi )2 = R 2 . (11.13)
In the space (h, k) what geometric shape does this describe? You guessed it, a circle.
Each point in image space (xi , yi ) produces a curve in parameter space, and if all those
points in image space belong to the same circle, where do the curves in parameter
space intersect? You should be able to figure that one out by now.
However, that might not Now, what if R is also unknown? It is the same problem, however, instead of
be the most efficient way
to do it. Think about it . . . allowing h to range over all possible values and computing k, we must now allow
both h and k to range over all values and compute R. We now have a three-dimensional
parameter space. Allowing two variables to vary and computing the third defines
a surface in this 3-space. What type of surface is this (an ellipse, a hyperboloid, a
cone, a paraboloid, a plane)?
Fig. 11.8. The point xi , yi is proposed to lie on a circle of radius R. The gradient points in the
R θ direction of the arrow. If the circle is known to be dark relative to the background, the
xi , yi center can be found by taking a step of length R in the direction opposite to the
{
gradient. If the center/surround contrast is opposite, then the step should be taken in
the direction of the gradient. If the contrast is not known a priori, steps can be taken
(and accumulators incremented) in both directions.
282 Parametric transforms
So far, we have assumed that the shape we are seeking can be represented by an
analytic function, representable by a set of parameters. The concepts we have been
using, of allowing data components which agree to “vote,” can be extended to
generalized shapes. Initially we suppose we have an arbitrarily shaped region, and
suppose we know the orientation, shape, and zoom. Our first problem is to figure
out how to represent this object in a manner which is suitable for use by Hough-like
methods. The following is one such approach [11.2]:
First, define some reference point. The choice of a reference point is arbitrary,
but the center of gravity is convenient. Call that point O. For each point Pi on the
O
P1
boundary, calculate both the gradient vector at that point and the vector O Pi . from
P2
the reference to the boundary point. Quantize the gradient direction into, say, n
P3 values, and create a table with n rows. Each time a pointP j on the boundary has a
gradient direction with value G i (i = 1, . . . n), a new column is filled in on row i,
containing O P j . Thus, the fact that multiple points on the boundary may have
Fig. 11.9. Boundary
points P1 and P2 have
identical gradient directions is accommodated by placing a separate column in the
gradient vectors which table for each entry. In Fig. 11.9, shape is shown and three entries in the R-table are
have identical illustrated in Table 11.1.
directions, and
therefore correspond To utilize such a shape representation to perform shape matching and location,
to entries on the same we use the following algorithm.
row of the R-table. P3
is an entry on the
second row of the (1) Form an accumulator array, which will be used to hold candidate locations of
R-table.
the reference point. Initialize the accumulator to zero.
(2) For each edge point, Pi , do the following.
(2.1) Compute gradient direction, and determine which row of the R-table cor-
responds to that direction.
(2.2) For each entry, j, on that row:
(a) compute the location of the candidate center by adding the stored
vector to the boundary point location: A = T [i, j] + Pi
(b) increment accumulator determined by A.
Gradient direction Vector from boundary point Vector from boundary point
(in degrees) to reference point to reference point
11.5 Conclusion
Accumulator arrays In this chapter, another approach to analysis of consistency was introduced in the
enforce consistency.
form of the Hough transform and its descendants, all of which use accumulator arrays.
An accumulator array allows for easy detection of consistency, since “things” which
are consistent all add to the same accumulator, or at least to nearby accumulators. By
constructing the accumulator array by adding hypotheses, we also gain a measure
of noise immunity, since inconsistent solutions tend not to contribute to the globally
consistent solution.
11.6 Vocabulary
Accumulator array
Generalized Hough transform
Hough transform
Parametric transform
Wechsler and Sklansky [11.12] have developed one approach to the problem of finding
parabolic curves in images, as described below.
A parabola is the locus of points each of whose distance from a fixed point, the focus, is
equal to its distance from a fixed straight line, the directrix, as illustrated in Fig. 11.10.
x 2 = (x − 2a)2 + y 2 (11.15)
Directrix
y
λ
Δy d
θ
V A
a a F x
d Δx
or
and
√
y 4a(x − a)
tan = = . (11.20)
x − 2a x − 2a
Substituting for a,
2 tan
tan = = tan 2. (11.21)
1 − tan2
The solution to Eq. (11.21) is
= 2. (11.22)
and
Equation (11.16) assumes the focus lies at x = a. This technique will work if only one
parabola lies in the field of view, for then the position of the origin is arbitrary. In the more
general case, however, we must make the location of the origin explicit.
The derivation of Wechsler and Sklansky [11.12] can be generalized as follows to overcome
these difficulties.
Initially, we continue to assume the parabola is symmetric about a horizontal line,
but having an origin at an arbitrary point x0 , y0 . The parabolic equation becomes
285 11A.2 Finding the peak
y Directrix
y0
a a
x0 x
(y − y0 )2 = 4a(x − x0 ). (11.27)
a = 2 (x − x0 ). (11.29)
(y − y0 )2 = 42 (x − x0 )2 . (11.30)
(y − y0 ) = 2(x − x0 ). (11.31)
Equation (11.31) describes a straight line in the x0 , y0 parameter space. Thus, by making use
of the local derivative, the 3-parameter problem of Eq. (11.27) has been reduced to a much
more tractable two-dimensional problem.
It should be obvious that shapes other than circles and parabolae can be found using this
approach [11.3].
The peak of the accumulator array is, hopefully, at or near the true parameter value, but in
general may be displaced by noise in the original data. There are several ways to approach
this problem, including some sophisticated methods [11.4, 11.11] and some simple ones.
In general, however you look at it, this is a clustering problem [11.4], so any algorithm
which does clustering well is also good at peak finding. Simple techniques such as k-means,
described in section 15.2.2, perform well.
286 Parametric transforms
Clustering is the process of finding natural groupings in data. In this chapter the obvious
application is to find the best estimate of the peak(s) of the accumulator; however, these ideas
are much broader in applicability than just finding the mode of a distribution. For example,
McLean and Kotturi [11.9] use clustering [11.5] to locate the vanishing points in an image.
Clustering is discussed in detail in Chapter 15.
The so called “Gauss map” provides an effective way to represent operations in range
images. It is philosophically a type of parametric transform. The concept is quite simple:
First, tessellate the surface of a sphere into whatever size patches you wish. This tesselation
will determine the resolution of the map. Associate a counter (accumulator) with each cell of
the tesselation. Now consider a range image. In the range image, calculate the surface normal
at each pixel, and increment an accumulator of the Gauss map which has the same surface
normal. This provides essentially a histogram of normal vectors, which in turn can be used
to recognize the orientation of the object in the range image.
Since curvature represents rates of change of normal direction, the Gauss map can be
related back to curvature. The invertability of the Gauss map and invariance under rotation
and translation are discussed in more detail in [11.7]. The map also finds application in
identifying the vanishing points in an image [11.8].
As we discussed in section 4.2.2, the correspondence problem prohibits simple use of stere-
opsis to provide three-dimensional information about the world. In this section, we take a
closer look at this problem and provide a partial solution, based on accumulator arrays.
Consider the two-camera stereo problem. Recall from Fig. 4.9 that the disparity is the dis-
tance in pixels between two corresponding pixels, z is the distance to the point corresponding
to those two pixels, and B is the baseline (the distance between the two cameras). Then
BF
d= (11.32)
z
where F is the focal distance of either of the cameras (which are assumed to have the same
focal length). The hard problem, of course, is to determine which pixels correspond to the
same point in 3-space. Suppose we extract a small window from the leftmost image and
use the sum of squared differences (SSD) to template match that small window along a
horizontal line in the other image. We could graph that objective function vs. the disparity,
Fig. 11.12. The result or the inverse distance (1/z). We find it convenient to use the inverse distance and we will
of matching a template
typically find that the match function has multiple minima, as illustrated in Fig. 11.12. If
extracted from one
image in a stereo pair we take an image from a third or fourth camera, with a different baseline from camera 1,
with a second image, we find similar nonconvex curves. However, all the curves will have minima at the same
as a function of the
point, the correct disparity. Immediately, we have a consistency! We form a new function,
location of the inverse
distance to the the sum of such curves taken from multiple baseline pairs, and find that this new function
matching point. (called the SSD-in-inverse-distance) has a sharp minimum at the correct answer. Okutomi and
287 11A.5 Conclusion
Kanade [11.10] have proven that this function always exhibits a clear minimum at the correct
matching position, and that the uncertainty of the measurement decreases as the number of
baseline pairs increases.
11A.5 Conclusion
The general concept of the parametric transform seeks consistency! That is the key here. Many
points which are in some sense consistent contribute to the same cell in the accumulator –
they vote, in a sense. Hopefully, noise averages out in this voting process, and we can arrive
at consistent solutions.
In computed tomography (CT), the signal measured is a line integral along the ray from
the x-ray source to the x-ray detector. The line along which the integral is performed may be
represented using the − parameterization of a straight line:
R( , ) = ( − (x(s)cos + y(s)sin )) ds. (11.33)
Examination of Eq. (11.33) leads one to the immediate conclusion that the Hough transform
may be formally represented by the Radon transform, and that, except for applications, they
are the same transform.
In addition to the use of these transforms for identifying specific shapes, Leavers [11.6] has
shown that if one considers the shape of the parameter space, rather than simply the location
of the peaks, one may determine the convex hull and several shape parameters of regions.
A fascinating alternative to the Hough transform was proposed by Aghajan and Kailath
[11.1], using wavefront propagation. The idea is to think of each pixel as a radio transmitter
emitting signals which are detected by receivers at the end of each row. Using the mathematics
of direction of arrival signal processing, they show it is possible to detect straight lines with
a computational complexity significantly less than the conventional Hough transform. This
idea of wavefront propagation makes sense as a paradigm for how the brain detects straight
lines.
11A.6 Vocabulary
Gauss map
Parabola
Radon transform
SSD
Assignment 11.1
In the directory named “leadhole” are a set of images of
wires coming through circuit board holes. The holes are
roughly circular and black. Use parametric transform meth-
ods to find the centers of the holes. This is a project and
288 Parametric transforms
? p1(x, y) p2(x, y)
Assignment 11.2
You are to use the generalized Hough transform approach to
both represent an object and to search for that object in
an image. It turns out that the object is a perfect square,
centered at the origin, with sides two units long, but you do
not know that ahead of time. You only have five points, those
at (0,1), (1,0), (1, 0.5), (-1,0), and (0, -1). Fill out
the “R-table” which will be used in the generalized Hough
transform of this object. (Table 11.2 contains four rows;
that is just a coincidence. You are not required to fill them
all in, and if you need more rows, you can add them.)
Assignment 11.3
Let P1 = [x1 , y1 ] = [3, 0] and P2 = [x2 , y2 ] = [2.39, 1.42] be two
points, both of which lie approximately on the same disk.
We do not know a priori whether the disk is dark inside or
bright. The image gradients at P1 and P2 are 5∠0 and 4.5∠/4
(using polar notation).
Use Hough methods to estimate the location of the center of
the disk, and radius of the disk, and determine whether the
disk is darker or brighter than the background.
References
[11.1] H. Aghajan and T. Kailath, “SLIDE: Subspace-based Line Detection,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 16(11), 1994.
[11.2] D. Ballard, “Generalizing the Hough Transform to Detect Arbitrary Shapes,” Pattern
Recognition, 13(2), 1981.
[11.3] N. Bennett, R. Burridge, and N. Saito, “A Method to Detect and Characterize Ellipses
Using the Hough Transform,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 21(7), 1999.
289 References
[11.4] Y. Cheng, “Mean Shift, Mode Seeking, and Clustering,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(8), 1995.
[11.5] T. Hofmann and J. Buhmann, “Pairwise Data Clustering by Deterministic Anneal-
ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1997.
[11.6] V. Leavers, “Use of the Two-dimensional Radon Transform to Generate a Taxonomy
of Shape for the Characterization of Abrasive Powder Particles,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(12), 2000.
[11.7] P. Liang and C. Taubes, “Orientation-based Differential Geometric Representations
for Computer Vision Applications,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(3), 1994.
[11.8] E. Lutton, H. Maı̂tre, and J. Lopez-Krahe, “Contribution to the Determination of
Vanishing points using the Hough Transform,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(4), 1994.
[11.9] G. McLean and D. Kotturi, “Vanishing Point Detection by Line Clustering,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(11), 1995.
[11.10] M. Okutomi and T. Kanade, “A Multiple-Baseline Stereo”, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 15(4), 1993.
[11.11] J. Princen, J. Illingworth, and J. Kittler, “Hypothesis Testing: A Framework for
Analyzing and Optimizing Hough Transform Performance,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(4), 1994.
[11.12] H. Wechsler and J. Sklansky, “Finding the Rib Cage in Chest Radiographs,” Pattern
Recognition, 9, pp. 21–30, 1977.
[11.13] Ylä-Jääski and N. Kiryati, “Adaptive Termination of Voting in the Probabilistic
Circular Hough Transform,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(9), 1994.
12 Graphs and graph-theoretic concepts
Functions are born of functions, and in turn, give birth or death to others. Forms emerge from
forms and others arise or descend from these
L. Sullivan
You have already seen the use of graph-theoretic terminology in connected compo-
nent labeling in Chapter 8. The way we used the term “connected components” in
the past was to consider each pixel as a vertex in a graph, and think of each vertex as
having four, six, or eight edges to other vertices (that is, four-connected neighbors,
six neighbors if hexagonal pixel is used, and eight-connected neighbors). However,
we did not build elaborate set-theoretic or other data structures there. We will do
so in this chapter. The graph-matching techniques discussed in this chapter will be
used a great deal in Chapter 13.
12.1 Graphs
∀(a, b ∈ V )[(a, b) ∈ E ⇔ (b, a) ∈ E]. Otherwise the graph is directed (or, in some
special cases, partially directed – a seldom-used term).
In this section, we define the vocabulary we will need to further our discussion of
graphs.
r The degree of a node is the number of edges coming into that node.
r A path between nodes v 0 and vl is a sequence of nodes v 0 , v 1 , . . . vl such that
there exists an edge between v i and v i+1 for all is.
r A graph is connected if there exists a path between any two nodes.
r A clique (remember this word, you will see it again) is a subgraph in which there
exists an edge between any two nodes.
r A tree is a graph which contains no loops. Tree representations have found appli-
cation in speeding up Markov random field applications [12.8].
The first data structure used to implement graph structures in computers was the
linked list as shown in Fig. 12.2. Such a data structure contains two types of data:
r nodes, which consist of two pointers (addresses), and
r atoms, which are data.
Atoms are distinguished from nodes by a single bit, usually the most significant
bit of the computer word. Certain nodes contain a zero in their right half, indicating
the end of the list. Such nodes are indicated in Fig. 12.2 by the cross in the right-hand
half. The linked list can also be used to store computer instructions, thus providing
a powerful mechanism for automatic programs. This was the foundation of the
programming language LISP.
Datum Datum
Fig. 12.2. In a linked list, each node contains two pointers. A pointer with the value of zero
indicates the end of a list.
292 Graphs and graph-theoretic concepts
Datum Datum
Datum Datum
Pointer Pointer
Pointer Pointer
Fig. 12.3. A more general data structure may contain both data and pointers.
The concept of the linked list was incorporated into more modern programming
languages and extended to allow more generality. For example, a structure such as
that illustrated in Fig. 12.3 can contain data and pointers to other data structures of
the same or different types. For example the following C definition describes the
data structure of Fig. 12.3.
struct patch
{
int area;
int perimeter;
struct *patch;
struct *patch;
}
In model matching, we will make use of the region adjacency graph (RAG) as
a means for identifying how the regions in a segmented image match (or do not
match) faces in a three-dimensional model. A model of a polyhedron with six faces
is illustrated in Fig. 12.4.
The RAG for the object of Fig. 12.4 is illustrated in Fig. 12.5; Fig. 12.6 shows
another example.
E D
B
A polyhedron has all
FLAT faces. A
C
A B
What a mess! Do you E
think there is a planar way
to draw this graph? That C
is, can you draw it without
any lines crossing? D F
3
3
1
1
2 2
Now is the problem: Given an observation, and the RAG derived from the ob-
servation, and given a collection of models and their corresponding graphs, which
model best matches the observation? We will address this matching problem later.
Other graph representations are possible and often useful, for example, the con-
structive solid geometry (CSG) community uses a collection of primitives subjected
to transformations to represent input to automatic parts manufacturing systems. The
primitives are objects like spheres and cylinders. Methods have been developed
[13.8] to match scenes to models constructed from CSG representations as well as
to RAG representations.
3
2
1
Patch #3
Patch #1 Avail: yes
Avail: yes B. Window
B. Window B. Volume
B. Volume Direction
Direction Normal
Normal Near
Near Next
Next
Patch #2
Avail: yes
B. Window
B. Volume
Direction
Normal
Near
Next
Fig. 12.7. Scene graph for an image with one object segmented into three patches.
Notice the use of
“cardinality” rather than
area. Why do you suppose r
this is? Think about a the patch list is sorted by cardinality (number of pixels in a patch), and
range image viewed r each node has a pointer to a list of adjacent patches.
obliquely.
There are several kinds of problems in computer science, defined by how long they
take to run as a function of the size of the data set, represented by the variable n.
Consider a polyhedron with its center of gravity located at the origin, and think about
the image formed by a camera pointed at the origin and located at the 3D spatial
location parameterized by spherical coordinates [ , , ] where denotes rotation
about the x axis and denotes rotation about the y axis, as illustrated in Fig. 12.8.
If is a constant, then the locus of possible camera positions is a sphere about the
origin. For now, we will only consider this case. Thus, the camera may be thought
of as moving on a sphere. As can be seen from Fig. 12.9, two different viewpoints
may produce very different views. But consider only a very small camera motion.
Except in rare conditions, small camera motions cause only slight changes in the
image.
y
x
Fig. 12.8. The coordinate system is defined so that the object to be characterized has its center
of gravity located at the origin.
Fig. 12.9. Two different aspects of the same object produce very different two-dimensional
images.
296 Graphs and graph-theoretic concepts
C
AD ABD
A B A AB B BD
ADC AC ABC
D DCD C BC BCD
Fig. 12.10. Each partition of the VSP is identified by a list of the surfaces visible from that set of
viewpoints (redrawn from [12.5]).
In those rare camera movements, however, radical things happen to the image –
surfaces disappear or appear. That is, the topological structure of the image changes.
Two viewpoints V1 and V2 are defined to be aspect equivalent, denoted V1 ∼ V2 ,
if and only if there exists a sequence of infinitesimal camera motions, a path, from
V1 to V2 such that the topology of the viewed image does not change. The aspect
equivalent property is obviously symmetric, reflexive, and trivially transitive. It is
therefore an equivalence relation and thus imposes a partition, denoted the viewpoint
space partition (VSP), on the set of points on the sphere. Each element of this partition
is referred to as a viewing region. Fig. 12.10, from [12.5], illustrates the VSP for a
tetrahedron with faces A, B, C, and D. The aspect graph (referred to originally as
the visual potential) is the dual of the VSP.
Computing the aspect graph is accomplished [12.2] by first constructing the
labeled image structure graph (LISG), which is an augmented graph in which
each node in the graph corresponds to a vertex in the line drawing, and each arc
corresponds to the line segment between those nodes. The arcs are augmented
with the labels corresponding to the properties +, −, and →, as we used them in
Chapter 10 to mean convex, concave, or occluding. The algorithm in [12.2] partitions
the viewing sphere such that all points in a partition have isomorphic LISGs.
For arbitrary (potentially nonconvex) polyhedra, using orthographic projection,
the algorithm is of high (but still polynomial) computational complexity. For a
polyhedron with n faces, the total worst case time complexity is O(n 8 ).
Aspect graphs were originally developed by Koenderink and Van Doorn [12.3],
and extended by a variety of authors; Bowyer and Dyer [12.1] provide a good survey
of work prior to 1990. More recent work has focussed on such nasty problems as
the fact that we are dealing with sampled data [12.7].
12.7 Conclusion
Consistent labeling The concepts of graphs occur throughout the machine vision literature. See [12.6]
searches a tree of
interpretations. for more on scene structure graphs and Bayesian networks which use them. As we
297 References
now understand, the search algorithm used in section 10.1 to label line drawings
actually searched a tree of interpretations.
12.8 Vocabulary
Aspect graph
Clique
Connected
Degree
Edge
Isomorphic
Node
NP--complete
Path
RAG
Scene graph
Tree
Vertex
References
[12.1] K. Bowyer and C. Dyer, “Aspect Graphs: An Introduction and Survey of Recent
Results,” International Journal of Imaging Systems and Technology, 2, pp. 315–328,
1990.
[12.2] Z. Gigus and J. Malik, “Computing the Aspect Graph for Line Drawings of Polyhedral
Objects,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(2),
1990.
[12.3] J. Koenderink and A. Van Doorn, “The Internal Representation of Solid Shape with
Respect to Vision,” Biological Cybernetics, 32, pp. 211–216, 1979.
[12.4] B. Messmer and H. Bunke, “A New Algorithm for Error-tolerant Subgraph Isomor-
phism Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(5), 1998.
[12.5] H. Plantinga and C. Dyer, “Visibility, Occlusion, and the Aspect Graph,” International
Journal of Computer Vision, 5(2), pp. 137–160, 1990.
[12.6] S. Sarkar and P. Soundararajan, “Supervised Learning of Large Perceptual Organiza-
tion: Graph of Spectral Partitioning and Learning Automata,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(5), 2000.
[12.7] I. Shimshoni and J. Ponce, “Finite-resolution Aspect Graphs of Polyhedral Objects,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 1997.
[12.8] C. Wu and P. Doerschuk, “Tree Approximations to Markov Random Fields,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(4), 1995.
13 Image matching
In this chapter we will consider issues associated with matching – matching observed
images with models as well as matching images with each other. We will consider
matching iconic representations as well as matching graph-theoretic representations.
Matching establishes an interpretation. That is, it puts two representations into
correspondence.
r Both representations may be of the same form. For example, correlation matches
an observed image with a template. Similarly, subgraph isomorphism matches a
region adjacency graph to a subgraph of a model graph.
r Both representations might be of different forms. For example, one image matches
one paragraph describing something. In most such applications, we find ourselves
matching an equation to some data, and in this case, “fitting” might be a better
word.
In the remainder of this chapter we address all of these matching problems except
fitting, which was discussed earlier in this book.
N
N
S E(x, y) = ( f (x − ␣, y − ) − T (␣, ))2 , (13.1)
␣=1 =1
(assuming the template is N × N ) which provides a measure of how well the template
(T ) matches the image ( f ) at point x, y. If we expand the square and carry the
298
299 13.1 Matching iconic representations
N
N
N
N
S E(x, y) = f 2 (x − ␣, y − ) − 2 f (x − ␣, y − )T (␣, )
␣=1 =1 ␣=1 =1
Let’s look at these terms: The first term is the squared sum of the image brightness
values at the point of application. It says nothing about how well the image matches
the template (although it IS dependent on the image). The third term is simply the
sum of the squared elements of the template, and is a constant, no matter where the
template is applied. The second term obviously is the key to matching, and that term
is the correlation.
In matching using an optimization criterion, the assumption is made that the qual-
ity of match can be described by a set of parameters a = {a1 , a2 , . . . , an }, which
could be the pixels themselves. We define a merit function M(a, f (x)) which quan-
tifies the quality of the match between the template and the local image. Matching
consists of determining a so that M is maximized. Typically, a is the x–y coordinates
specifying where the template is placed.
If M is monotonic in a, we can maximize M by solving
∂M
Ma j = =0 for j = 1, . . . , n. (13.3)
∂a j
If M is not monotonic, the process of finding points where the partial derivatives are
zero can terminate in a local maximum. Furthermore, as we have discussed earlier,
it is probably not possible to find an analytic solution to Eq. (13.3). In that case, we
The equation for
hill-climbing should be could use hill-climbing:
compared with that for
gradient descent.
a kj = a k−1
j + cMa j . (13.4)
13.1.4 Eigenimages
The eigenimage approach has been an effective solution to problems like object
identification and recognition [13.49, 13.50], where the image of an unknown object
is compared to images of known objects in a data base (or a training set) and the
unknown object can be identified or recognized when a close match is found. We
can surely do a pixel-by-pixel comparison. However, this is very time-consuming,
especially when the size of the image is large and the number of images included in
the data base is large as well.
301 13.1 Matching iconic representations
The eigenimage approach has its origin in principal component analysis (PCA),
which is a popular technique for dimensionality reduction. In section 9.2.1, one type
of PCA, the K–L transform, is described in detail. PCA constructs a representation
of the data with a set of orthogonal basis vectors that are the eigenvectors of the
covariance matrix generated from the data. By projecting the data onto the dominant
eigenvectors (corresponding to the larger eigenvalues), the dimension of the original
data set can be reduced with minimal loss of information. Similarly, in the eigen-
image approach, each image is represented as a linear combination of a set of dom-
inant principal components (the eigenimages). Matching is then conducted based
on the coefficients of the linear combination (or the weight of projections onto the
eigenimages) which greatly speeds up the process. The projection preserves most of
the energy, and thus captures the highest amount of variation in the data base. Here,
we discuss the calculation of the eigenimages in detail.
Let f 1 , f 2 , . . . , f p represent a set of images of known objects in the data base.
Without loss of generality, assume these images are of the same dimension m × n.
The following steps lead us to the eigenimages.
C = E E T
Step 5. Calculate the projection coefficients of image f i onto the selected eigen-
images for comparison purpose by
Wi = IiT × [E 1 . . . E k ] (13.6)
Wtest = Itest
T
× [E 1 . . . E k ].
Compare the distance between Wtest and all the Wi s in the data base (a Euclidean
distance might be the simplest approach); the one with the closest distance is selected
as a match.
We now show an interesting example of applying the eigenimage approach to
face recognition [13.51]. Assume we have three images in the data base (Lena,
Einstein, and the Clock), the unknown image is Monalisa. Following Steps 1 through
3 described above, we can derive 64 × 64 eigenimages. We use only two of the
dominant ones since the ratio between the summation of the first two eigenvalues
and the summation of all the eigenvalues is close to 1, as stated in Step 4. Fig. 13.1
shows all the original images and the two eigenimages derived. Following Step
5, we compute the projection coefficients of all four original images on the two
eigenimages; these are listed in Fig. 13.1 as well. Based on a simple Euclidean
distance calculation, it turns out that the closest match to Monalisa is Einstein. Is
that a surprise? Not really. In the “eyes” of the computer, these two images are indeed
more alike than Monalisa and Lena.
Even though the eigenimage approach has great potential for image matching, from
the procedure described above, we see that the most time-consuming step is the
derivation of the eigensystem. When the size of images is large, the calculation of
the covariance matrix (which is mn × mn) can take up a lot of computation resources
or be completely infeasible. For more efficient calculations, readers are referred to
[13.34, 13.35].
We illustrate one approach to reducing computation through an example. Assume
that each image has only three pixels and that there are only two such images in the
set. Let them be
f 1 = [1 2 3]T
f 2 = [5 8 9]T .
303 13.1 Matching iconic representations
Original
images
I1 I2 I3 Itest
E1 E2
The distance
d1, test = 1.2872, d2, test = 0.3963, d3, test = 3.5831
Fig. 13.1. Demonstration of the eigenimage approach on a data base of three images (Lena,
Einstein, Clock) and a test image (Monalisa). The images have been rescaled for
display purpose.
Observe that if p < mn, then S is the scatter matrix, identical to the covariance
except for the multiplicative scale factor. S is huge – if the image is 256 × 256, then
S is 2562 × 2562 . However, if there are only, say, five images in the set, I is 2562 × 5,
304 Image matching
I T I i = i i . (13.7)
The most straightforward way to use the simple features we described in Chapter 9 is
to use them in a pattern classifier. To do this, we will extract a statistical representation
for the model and the object and match those representations. The strategy is as
follows.
r Decide which measurements you wish to use to describe the shape. For example,
one might build a system which measures seven invariant moments and the aspect
ratio, for a total of eight “features.” The best collection of features is application-
dependent, and methods for optimally choosing feature sets is beyond the scope of
this book (see [14.4, 14.11, 18.30] which are just a few of many texts in statistical
methods). Organize these eight features into a vector, x = [x1 , x2 , . . . , x8 ]T .
r Describe a “model” object using a collection of example images (called a “training
set”), from which feature vectors have been extracted. Continuing the example
with eight features, we could collect a set of n images of axes, measure the feature
The unknown is vector of each axe, and characterize the model axe by its average over this set
distinguished from the
examples by lack of a axe = n1 x∈axe x. Hatchets might be similarly characterized by an average over
subscript. a set of sample hatchets.
r Now, given an unknown region, characterized by its feature vector x, shape match-
ing consists of finding the model which is “closest” in some sense to the observed
region. Probably the simplest definition of “close” uses the Euclidian distance
4
d(modelaxe, observation) = (x − axe )T (x − axe )
4 .
d(modelhatchet, observation) = (x − hatchet )T (x − hatchet )
Euclidian distance. In Chapter 14, the concepts in this discipline of statistical pattern
classification are covered in more detail.
Recall the formal In this section, we consider the problem of matching image representations which
definition of a metric. are fundamentally graph-based. However, we allow the data stored at a node in the
This paragraph refers to
subgraph isomorphism as graph to include images or templates.
a metric. Is that correct?
Recall that a clique of size N is a totally connected subgraph of size N . We
also will want to use matching metrics, such as the mean squared error or correla-
tions mentioned in section 13.1.1. A matching metric measures the “goodness” of a
match.
In a totally graph-based representation, one matching metric might be sub-
graph isomorphism. But, subgraph isomorphism does not really allow for close
but not perfect matches. Most machine vision specialists would say that it is too
inflexible.
The graph-matching problem can be approached using annealed neural networks
[13.8].
As we saw earlier, relaxation labeling can also provide a mechanism for a type of
graph matching. In the example we saw in section 10.2.2, we can match a subgraph
of the scene graph (i.e. the two surfaces in the example given) with a subset of
the model graph. Other variations are possible, for example, Gold and Rangarajan
[13.14] describe a variation on graph matching which utilizes a nonlinear optimiza-
tion method which they say runs much faster and more accurately than relaxation
labeling.
Two other approaches are described here – association graphs and spring-loaded
templates.
These are methods which will produce matchings on hybrid representations,
that is, those which are fundamentally graph-based, but which include image
information.
Definition
Here, a graph is denoted G = V, P, R, where V represents a set of nodes,
P represents a set of unary predicates on nodes, and R represents binary relations
between nodes.
A predicate is a statement which takes on only the values TRUE or FALSE.
For example let x denote a region in a range image. Then CYLINDRICAL(x) is a
predicate which is true or false depending on whether all the pixels in x lie on a
cylindrical surface.
A binary relation describes a property possessed by a pair of nodes. It may be
considered as a set of ordered pairs R = {(a1 , b1 ), (a2 , b2 ), . . . (an , bn )}. In most
applications, order is important. It is possible to think of a relation as a predicate,
since for any given pair, say (ak , bk ), either it is an element of the set R or it is not.
However, it seems more descriptive to use the word relation in this context.
Problem 1. Is the largest clique the best match? The largest clique is the largest set
of consistent matches. Is this really the best match?
A
1
B
2
C
D 3
Image Model
Fig. 13.2. A range camera has observed a scene and segmented it into segments which satisfy
1A the same equation, however, an error has occurred.
3D
2B
3C
Here is the challenge: Identify what it means to be consistent – determine
2C
rA (i, , j, ), or in this example, determine rA (1, A, 2, B), where the compatibility
3B
2D function r has the same meaning as in Chapter 10 and the subscript A simply denotes
Fig. 13.3. Candidate
that an association graph is being used. It is often easier to do this by determining
matches. what is NOT consistent, and that is a problem-dependent decision. Here, we define
any two labelings as consistent if they do not involve the same region. Some example
consistencies for this example are
rA (1, A, 2, B) = 1
3D
1A rA (2, B, 2, C) = −1
2B rA (2, B, 3, B) = −1.
3C
2C The second line says that patch B in the image could not be region 2 in the model
3B while simultaneously, patch C in the image is the same region. In both examples,
2D
inconsistencies are really based on the assumption that the segmenter is working
Fig. 13.4. Solid lines correctly. However, one could allow the segmenter to fail. In that case, new edges are
denote the edges
which are present if we
added, because new relationships are now consistent. For example, rA (3, C, 3, D) =
assume the segmenter 1 since we believe that two patches could be part of the same regions (the segmenter
does not fail. The can fail by oversegmentation), however, rA (2, D, 3, D) = −1 still holds because
dotted lines are added
if we believe the we still believe the segmenter will not merge patches (fail by undersegmentation).
segmenter can fail by Allowing for oversegmentation produces the association graph of Fig. 13.4.
oversegmentation. The
heavy lines indicate a
Note one other type of inconsistency which prevents an edge from being con-
maximal clique. structed: 3D and 3B are not connected since B and D do not border. That is, we
believe that if the segmentation fails by oversegmentation, the segmenter will not
introduce an entire new patch between the two. We must emphasize that how you
develop these rules is totally problem-dependent!
The maximal clique is not Once you have the allowable consistencies, the matching is straightforward.
unique.
Simply find all maximal cliques. The maximal clique is not unique, since there
may be several cliques of the same size.
In this case, there are at least two maximal cliques, two of which are:
{(1, A)(2, B)(2, C)(3, D)} and {(1, A)(3,B)(2, C)(2, D)}.
308 Image matching
hair
eye
ear
nose
Mouth
In Eq. (13.9), d is a template and F(d) is the point in the image where that template
is applied. TemplateCost is therefore a function indicating how well a particular
template matches the image when applied at its best matching point. SpringCost is
a measure of how much the model must be distorted (the springs stretched) to apply
those particular templates at those particular locations. Finally, it may be that not
every template can be located – in some images, the left eye may not be visible – and a
cost may be imposed for things missing. All these costs are empirically determined.
However, once they are determined, it becomes relatively easy to determine how
well any given image matches any given model.
309 13.5 Vocabulary
There is one significant (among several others) problem with spring matching:
The number of elements matched affects the magnitude of the cost. The costs are
summed, so a poor match of only a few things may be less than a good match of
many (and therefore better, in a minimal cost algorithm).
This is a problem which is not unique to springs and templates. The usual solution
is to normalize the calculations, using a technique such as
13.4 Conclusion
Association graphs use the concepts and formalisms of consistent labeling directly.
Association graphs are a
kind of consistent
The advantage of using a graph structure is that the search for largest clique is
labeling. aided by a body of available software for performing such searches as quickly as
computational complexity allows. Similarly, the springs and templates ideas measure
both consistency and deviation from consistency. The springs and templates concepts
also illustrate both how one might construct an appropriate objective function, and
a problem that can easily arise if one does not pay attention to interpretation of the
objective function – if we are summing match quality, a good match of many things
(adding up lots of small numbers) may be more than (and therefore worse than) a
poor match of only a few things (adding up just a few rather large numbers).
We began this chapter by pointing out that formal optimization methods, either
Objective functions need
suitable normalization.
descent or “hill-climbing,” are hard to apply to image matching because the search
SSD is a common space is littered with local minima. However, if we initialize the algorithm suffi-
objective function.
ciently close to the solution, such techniques will work. We used the sum of squared
differences (SSD), also sometimes called the sum-squared error, as the objective
function.
Eigenimages are lower dimensionality representations of the original images. The
projections are chosen by minimizing the error between the original data and the
projected data.
13.5 Vocabulary
Association graph
Clique
310 Image matching
Correspondence
Deformable template
Eigenimage
Hill-climbing
Matching metric
PCA
Template
Assignment 13.1
In this chapter, we stated that the problem of find-
ing the largest clique is NP-complete. What does that
really mean? Suppose you have an association graph with
ten nodes, interconnected with 20 edges. How many tests
must you perform to find all cliques (which you must do
in order to identify which of these are maximal)? You
ARE permitted (encouraged!) to look up clique-finding
in a graph theory text.
Assignment 13.2
In section 13.3.1, an example problem is presented
which involves an association graph which allows for
segmentation errors. The result of that graph is two
maximal cliques, which (presumably) mean two different
interpretations of the scene. Describe in words these
two interpretations.
Assignment 13.3
In the bibliography for this chapter, there is an
incomplete citation to Olson [13.36]. First, locate
a copy of that paper. You may use a search engine, the
Web, the library, or any other resource you wish. In
that paper, the author does template matching in a dif-
ferent way: Using a binary (edge) image and a similar
template, he does not ask “Does the template match the
image at this point?” Instead he asks, “At this point,
how far is it to the nearest edge point?”
How does he perform this operation, apparently a
search, efficiently?
Once he knows the distance to the nearest edge point,
how does he make use of that information to compute a
quality of match measure?
311 13.5 Vocabulary
Assignment 13.4
In an image-matching problem, we have two types of ob-
jects, lions and antelope (which occupy only one pixel
each).
Assignment 13.5
Do you think the concepts of springs and templates
would be applicable to Assignment 13.4? Discuss.
Assignment 13.6
Still thinking about lions and antelope, you ob-
1 3 serve the scene opposite: The sketch is not to scale.
However, for your convenience, we have tabulated the
distance between each pair of animals (Table 13.1).
Lions are yellow (which is denoted by a dotted inte-
4 rior) or brown (denoted by a black interior -- that’s
5 right; there aren’t any). Antelope are white (denoted
by a white interior) or yellow. You wish to use
2 association graph methods to solve this problem; and
since this technique is not as powerful as nonlinear
312 Image matching
Distance
Pair (arbitrary units)
1, 2 5.5
1, 3 2
1, 4 3
1, 5 2
2, 3 2
2, 4 3
2, 5 4
3, 4 2
3, 5 3.8
4, 5 3.4
Recall the correspondence problem, described in section 4.2.2, which may be restated as
“given a set of features in two images, identify which feature in image 1 corresponds to
which feature in image 2.” There is little difference between the model-matching problem
and the correspondence problem, except for the fact that in the correspondence problem, both
of the images may be corrupted by noise.
To solve the correspondence problem, we use a consistency-based philosophy. In this
section, we describe this philosophy through an example, which we will use again in section
Fig. 13.6. Boundary of 13A.2. The first step is to identify feature points which are relatively distinctive. Then the
an object. Points where algorithm will make use of relationships between these points. Points on the boundary of a
the sign of the
curvature changes are
region where the curvature of the boundary changes sign meet this requirement, as illustrated
marked. in Fig. 13.6.
This example derivation was first described by Shapiro and Brady [13.46] who use eigen-
vector methods as follows.
As in the original springs and templates formulation, we are finding which of a collection
of feature point sets best matches one particular set. Let di j be the Euclidian distance between
feature points xi and x j , and construct a matrix of weights
!
di2j
H = [Hi j ], where Hi j = exp − . (13.10)
2 2
The matrix H is diagonalized using standard methods into the product of three matrices
H = E E T (13.11)
where E is a matrix with the eigenvectors of H as its columns, and is a diagonal matrix
with the eigenvalues on the diagonal. Let us assume the rows and columns of E and are
sorted so that the eigenvalues are sorted along the diagonal in decreasing size. We think of
each row of E as a feature vector, denoted Fi . Thus
⎡ ⎤
F1
⎢ ⎥
E = ⎣. . . ⎦ .
Fm
Suppose we have two images, f 1 and f 2 , and suppose f 1 has m feature points while f 2 has n
feature points, and suppose m < n. Then by treating each set of feature points independently,
we have H1 = E 1 1 E 1T for image f 1 , and H2 = E 2 2 E 2T for image f 2 . Since the images have
different numbers of points, the matrices H1 and H2 have different numbers of eigenvalues.
We therefore choose to use only the most significant k features for comparison purposes.
It is important that the directions of the eigenvectors to be matched be consistent, but
changing the sign does not affect the orthonormality. We choose E 1 as a reference and then
orient the axes of E 2 by choosing the direction that best aligns the two sets of feature vectors;
see [13.46] for details. After aligning the axes, a matrix Z characterizing the match between
image 1 and image 2 is defined by
The best matches are indicated by the elements of Z which are the smallest in their row
and column. We will revisit this example in the next section.
Sclaroff and Pentland [13.44] present a further alternative to the springs and templates
formulation: First, compute a description of the entire shape which is robust to sampling
and parameterization error. Then, using this description of the entire shape, find a coordinate
system which effectively describes the shape. Doing this on the image and the model makes
it straightforward to determine cardinal directions.
Wu [10.19] takes the problem of computing optic flow and uses relaxation labeling to find
consistent template matches.
The concept of deformable templates can be combined with graph representations to
produce an approach [13.1] to matching of objects which are similar, but not identical in
shape (e.g. x-rays of hands). The idea of deformable templates can be viewed as an extension
314 Image matching
of MAP methods. See [13.26] for a well-written concise description. Methods like this also
find applications in target tracking and automatic target recognition (ATR) [13.12].
We have already discussed the fact that pattern recognition techniques provide for us a means
of making “what is it?” decisions, when we have been presented with a set of measure-
ments (features) which describe the item being observed. There are many ways to develop
classifiers, and methods which follow the neural networks paradigm have been among the
most successful. Neural networks accept features as inputs and produce decisions as outputs.
They are based on mathematical abstractions of what we know about how individual neurons
compute.
There are two types of neural networks which can perform matching, feedforward and
recurrent.
x1 w1
u y
x2 w2 Σ S
x3 w3
Fig. 13.7. Computation performed by a single neuron. Each input (xi ) is multiplied by a weight
(wi ) and the results are added, producing a signal u, which is passed through a
sigmoid-like nonlinearity function (S) producing the neuron output y.
315 Topic 13A Matching
Decision boundary is
a hyperplane
Single-layer perceptron
Two-layer
Arbitrary
decision regions
Multilayer (usually 3)
Fig. 13.8. Types of feedforward neural networks, and the decision regions which they can
implement.
Fig. 13.9. A feedforward neural network with three inputs and two outputs. Each circle denotes
a neuron. Weights are not explicitly shown, but exist on the connections.
∂
w i j (t + t) = w i j (t) − ck M S E.
∂w i j
Using the three-level neural network model illustrated in Fig. 13.9, the gradient descent rule
may be readily implemented by making use of the chain rule for derivatives. Hussain and
Kabuka [13.24] demonstrate use of a neural network for character recognition.
316 Image matching
A recurrent neural network (NN) is one which feeds the output back to the input at run time,
as illustrated in Fig. 13.10. Following the same notation used earlier, in the steady state, the
output of neuron i satisfies
!
d
v i = S(yi ) = S w i j v j − Ii . (13.13)
j=1
This model of the behavior of a neuron is true only in the steady state. That is, since the
output is dependent on the input, which is the output, which is dependent . . . (to iterate is
human, to recurse, divine2 ). But such a description is woefully inadequate when things are
Fig. 13.10. A recurrent
neural network with 3 changing. In that case, we need some model of the dynamics of the system. Many different
neurons. The weights models can be used, and the reader is referred to the literature [13.15, 13.20, 13.23] for a
are not shown, but closer examination. Here, we consider a single, rather simple model, one in which the rate
each input to each
neuron has an
of change of output from the summer is dependent on the input, and can be represented by a
associated weight. first-order differential equation:
d n
yi (t) = −yi (t) + w i j S(u j ) − Ii (13.14)
dt j=1
where the yi s are the neuron outputs, the w i s are the weights, as before, and the Ii s are
inputs to each neuron from the external world (not shown in the figure). Thus the change is
proportional to the current state, the inputs from all the other neurons, and the external input.
In operation, a recurrent NN is presented with a particular input, and then allowed to run.
They should converge to a particular state.
This model was described by Hopfield [13.23], among others. In Hopfield’s model, the
rate constant, , resulted from a lumped-constant model of capacitance and resistance in an
operational-amplifier implementation of such a recurrent network.
Now, forget about Eq. (13.14) for a moment, and consider the objective function below,
which we wish to minimize:
vi
1
E =− w i j vi v j +  S −1 ( ) d + Ii v i . (13.15)
2 i j i i
0
If we are to find the vs which minimize this, we need to differentiate E with respect to those
vs. Doing that, we find
∂E
=− w i j v j + S −1 (v i ) + Ii . (13.16)
∂v i j
Now, we observe that the derivative of E with respect to the variables v has the same form as
the dynamics of a Hopfield neural network, or
∂E du i
=− . (13.17)
∂v i dt
Think about the steady state of the system described by Eq. (13.17). When the network has
finished changing (all the derivatives with respect to time are zero), all the partials of the
2 L. P. Deutsch.
317 Topic 13A Matching
Angle θi
fi−1
fi fi+1
Fig. 13.11. The angle between neighboring feature points is a measurement local to feature
point i.
energy function are also zero. So we are at an extreme. It is relatively easy to show that one
may ignore that annoying integral in Eq. (13.15), and therefore a Hopfield neural network
finds the set of variables v i which minimize an objective function of the form described by
Eq. (13.15) (without the integral). We illustrate the use of such a network for matching through
an example.
Using the same set of features as the previous section, the zero crossings of the boundary
curvature, we assign a local measure to each feature point, in this case, the angle between the
vectors to neighboring points, as illustrated in Fig. 13.11. We will use this and a more global
feature, the distance between feature points, to solve the correspondence problem.
Assume image 1 (which you can think of as a model, if you wish) has n feature points,
and image 2 has m feature points. We define a matrix of neurons which has n columns and
m rows. The neuron at row i, column j should have a value between zero and one depending
on the degree to which feature point i in the first image matches feature point j in the second
image.
The matching process is posed as minimizing the expression
!
A q
E =− Ci jkl Vik V jl + Vik Vil + Vik V jk .
2 i j k l 2 i k k
=l k l i
= j
(13.18)
The first term quantifies the compatibility of matches ik and jl. The last two terms are in-
cluded to encourage uniqueness of matches. This form is chosen to allow for occlusions. The
compatibility coefficient is the sum of three terms
where
1 if (|a − b| < T )
(a, b) = ,
−1 otherwise
for threshold T; i is a measurement local to feature point i as illustrated in Fig. 13.11; and
ri j is a measure of similarity of relational measures between feature points. For example, if
the distance between points i and j is the same as the distance between points k and l, then
labelings ik and jl are consistent.
Proper manipulation of Eq. (13.18) allows it to be put in the form of Eq. (13.15), enabling
Fig. 13.12. Silhouette
minimization by a neural network. More details are available in [13.28, 13.54]. In Fig. 13.12
of a gun partially
occluded by a hammer an outline is shown of a pistol partially occluded by a hammer. A neural network using these
(redrawn from [13.28]). principles is able to identify both objects from this image.
318 Image matching
Up to this point, we have considered the process of image matching as searching a data
base of models for the model which best matches the observation. We have not addressed
the process of “search” itself. One could, of course, simply try all models, but that could
be prohibitively time-consuming, particularly in instances which involve large data bases
of models. In applications like automatic target recognition, where matching requires both
high speed and large data bases [13.45], better methods are required. The alternate paradigm,
indexing (sometimes called image hashing ) is analyzed in [9.6]. In an indexing scheme, a set
of parameters are extracted from the image. Obviously, such parameters need to be invariant
to as many image transformations as possible and also need to be robust [13.1]. This resulting
parameter vector is then used as indices into a lookup table containing references to models.
The lookup process returns a list of candidate models consistent with this particular parameter
vector. To see how this works, consider the following algorithm.
Begin by looking at local areas around the boundary and attempting to match each local
area with a data base of feature descriptors such as lines, circular arcs, and minima and maxima
of curvature. Assuming a successful segmentation of an unoccluded object, we start with an
edge image, where the edges are not required to be connected. About some point [x0 , y0 ]
on the edge, we sample the edge in that neighborhood using a sampling scheme3 which is
invariant to zoom. Form all possible combinations of that point with two other nearby points
and generate an invariant parameter vector similar to that described in [9.37]. That parameter
vector is then used to index a data base of local shapes. For each entry selected, a feature
instance is extracted and after all the triples have been considered, the feature instance with
the highest number of votes is selected.
Now, the boundary is represented by a sequence of feature instances, and the indexing
method may be repeated, using a look-up table of object models which are indexed by
geometry and occurrence of feature instances.
Numerous other approaches to indexing exist [13.5, 13.32]; an excellent review is included
in [13.48]. The space requirements for some indexing schemes are analyzed in [13.25].
As data bases get larger, one must consider the image indexing problem in the context of
the entire digital library. The reader is directed to an entire special issue of IEEE Transactions
on Pattern Analysis and Machine Intelligence (August, 1996) which addresses this.
We start with simply finding a set of numbers which are invariant. The approach will be to
find five points in the 3D model and calculate from them some properties which uniquely
characterize them in an invariant way. Then, we will find five points in the image and determine
which model they best match.
Choose a set of five feature points {X 1 , X 2 , X 3 , X 4 , X 5 } from the 3D model, at least four
of which are noncoplanar. Since five points cannot be linearly independent, we can write one
of them as a linear combination of the others. We choose to represent point X 5 in this way,
3 To avoid cluttering this description of the indexing paradigm with lots of details, we ask the reader to tolerate
the omission of some details. They are in the cited paper.
319 Topic 13A Matching
X 5 = a X 1 + bX 2 + cX 3 + d X 4 . (13.20)
We make use of the observation that the determinant of a matrix of points is invariant to
rigid body motions,4 and write the determinant which is constructed from any four of the five
points, using as a subscript, the index of the point we omitted. For example,
M1 = |X 2 X 3 X 4 X 5 |. (13.21)
From the linear dependence of X 5 in Eq. (13.20), we substitute for X 5 in each case, deriving
M1 = a|X 2 X 3 X 4 X 1 | + b|X 2 X 3 X 4 X 2 |
+ c|X 2 X 3 X 4 X 3 | + d|X 2 X 3 X 4 X 4 |. (13.22)
This can be simplified by observing that the determinant of a matrix which has two identical
columns is zero:
M1 = a|X 2 X 3 X 4 X 1 |. (13.23)
But this can be simplified even more by observing that if you interchange two columns, you
flip the sign of the determinant.
So
M1 = −a M5 . (13.25)
Similarly
M2 = bM5
M3 = −cM5 (13.26)
M4 = d M5 .
We construct 3 × 3 matrices by leaving out two indices, and denoting by subscript the indices
left out:
m 12 = |x3 x4 x5 |. (13.29)
4 In fact, absolute invariants of linear forms are always ratios of powers of determinants [13.19].
320 Image matching
At this point, we simplify the notation, get rid of the xs and just keep track of the subscripts,
rewriting the definition of m 12 :
m 12 = |3 4 5|. (13.30)
As above, we can do algebra to relate the determinants and the coefficients, for example,
m 12 = a|3 4 1| + b| 3 4 2|
= a|1 3 4| + b|2 3 4| (13.31)
= am 25 + bm 15
and
m 13 = am 35 − cm 15
(13.32)
m 14 = am 45 + dm 15 .
We have determined forms for the coefficients in terms of the Mi s, and adding those relations
into the equations we just derived produces
M5 m 12 + M1 m 25 − M2 m 15 = 0
M5 m 13 + M1 m 35 − M3 m 15 = 0 (13.33)
M5 m 14 + M1 m 45 − M4 m 15 = 0.
These relations are invariant to both 3D and 2D motions except for a multiplicative scale
which affects all the Mi s the same. We can eliminate this dependence by using ratios and
define 3D invariants
M1 M2 M3
I1 = I2 = I3 = (13.34)
M5 M5 M5
and 2D invariants
m 12 m 13 m 25 m 35
i 12 = i 13 = i 25 = i 35 = . (13.35)
m 15 m 15 m 15 m 15
The denominators are not zero, since they are the determinants of matrices which we know
are nonsingular. Look at Eq. (13.33) and divide the top line by M5 :
M5 M1 M2
m 12 + m 25 − m 15 = 0, (13.36)
M5 M5 M5
which simplifies to
m 12 + I1 m 25 − I2 m 15 = 0. (13.37)
i 12 + I1 i 25 − I2 = 0
(13.38)
i 13 + I1 i 35 − I3 = 0.
So if we have 2D invariants we have two equations for the 3D invariants. The two equations of
Eq. (13.38) do not, unfortunately, determine the three 3D invariants. Still those two equations
determine a space line in the 3-space of the Is.
321 Topic 13A Matching
How do we use an idea like this? Given a 3D model of an object, and any five points,
four of which are not coplanar, we can find I1 , I2 , and I3 , a point in a 3D space. To perform
recognition, we first extract from the 2D image (generally several) 5-tuples of feature points
and from them construct the 2D invariants. Each 5-tuple gives rise to two equations in I1 ,
I2 , I3 space, that is, a straight line in the 3D invariant space. If a 5-tuple in the 2D image
is a projection of some 5-tuple in 3D, then the line so obtained will pass through the single
point representing the model. If we have a different projection of those five points, we get a
different straight line, but it still passes through the model point.
Implementing this for realistic scenes is slightly more complicated than this description
because one must actually make use of projective geometry rather than assuming orthogonal
projections. Other complications arise in determining a suitable way to choose 5-tuples, and
a means for dealing with the fact that the line may “almost” pass through the point. Weiss
and Ray [13.52] address these issues.
13A.5 Conclusion
So far, we have described quite a collection of representations for objects, but certainly not
all that are in the literature. Other methods include variations on deformable models [13.10,
13.11], especially for range images [13.21].
Consider Fig. 13.13: Should you match it to a circle or a six-sided polygon? Clearly, there
is no simple answer to this question. If you have prior, problem-specific knowledge that you
are always dealing with circular objects, you might choose to use the circular model, which
Fig. 13.13. A set of is certainly less complex than that of a polygon. The idea of minimum description length
points produced by an (MDL) provides some help along these lines. The MDL paradigm states that the optimal
edge detector which
representation for a given image may be determined by minimizing the combined length of
might come from a
circle or a polygon. the encoding of the representation and the residual error. Interestingly, a MAP representation
can be shown [13.9, 13.30] to be equivalent to the MDL representation where the prior truly
represents the signal.
Schweitzer [13.43] uses the MDL philosophy to develop algorithms for computing the
optic flow, and Lanterman [13.29] uses it to characterize infrared scenes in ATR applications –
“if there are several descriptions compatible with the observed data, we select the most
parsimonious” [13.29].
Rissanen [13.41] suggests that the quality of an object/model match could be represented
by
where x is the observed object, is the model, represented as a vector of parameters, P(x|) is
the conditional probability of making this particular measurement given the model, and L()
denotes the number of bits required to represent the model. The logarithm of the conditional
probability is then a measure of how well the data fits the model. We thus may trade off a
more precise fit of a more complex model with a less accurate fit of a simpler model [13.6].
322 Image matching
Ultimately, machine vision is not going to be solved by one program, one algorithm, or
one set of mathematical concepts. Ultimately, its solution will depend on the ability to build
systems which integrate a collection of specialists. The jury is still out on how to accomplish
this. Regrettably, only a few papers have undertaken this formidable task. For example, Grosso
and Tistarelli [13.18] combine stereopsis and motion. Bilbro and Snyder [13.4] fuse luminance
and range to improve the quality of the range imagery, and Pankanti and Jain [13.38] fuse
stereo, shading, and relaxation labeling. Zhu and Yuille [8.80] incorporate the MDL approach,
including active contours and region growing, into a unified look at segmentation. Gong and
Kulikowski [13.16] use a planning strategy, primarily in the medical application area.
In sections 13A.1 and 13A.2, the first step is to identify feature points which are relatively
distinctive. Then the algorithm makes use of relationships between these points, relying on
consistency to find the best match.
A recurrent neural In the discussion of recurrent neural networks, we showed that such a network achieves a
network is an
stable state which is in fact the minimum of the objective function of Eq. (13.15).
optimization engine!
13A.6 Vocabulary
Eigenvector
Feedforward neural net
Geometric invariant
Image indexing
Recurrent neural net
Bibliography
[13.1] Y. Amit and A. Kong, “Graphical Templates for Model Registration,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 18(3), 1996.
[13.2] K. Astrom, “Fundamental Limitations on Projective Invariants of Planar Curves,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[13.3] B. Bhanu and O. Faugeras, “Shape Matching of Two Dimensional Objects,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 6(2), 1984.
[13.4] G. Bilbro and W. Snyder, “Fusion of Range and Luminance Data,” IEEE Symposium
on Intelligent Control, Arlington, August, 1988.
[13.5] A. Bimbo and P. Pala, “Visual Image Retrieval by Elastic Matching of User Sketches,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2), 1997.
[13.6] J. Canning, “A Minimum Description Length Model for Recognizing Objects with
Variable Appearances (The VAPOR Model),” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 16(10), 1994.
[13.7] Q. Chen, M. Defrise, and F. Deconinck, “Symmetric Phase-only Matched Filtering
of Fourier–Mellin Transforms for Image Reconstruction and Recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(12), 1994.
323 Bibliography
[13.8] T. Chen and W. Lin, “A Neural Network Approach to CSG-based 3-D Object Recog-
nition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7),
1994.
[13.9] T. Darrell and A. Pentland, “Cooperative Robust Estimation Using Layers of
Support,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5),
1995.
[13.10] D. DeCarlo and D. Metaxas, “Blended Deformable Models,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(4), 1996.
[13.11] S. Dickinson, D. Metaxas, and A. Pentland, “The Role of Model-based Segmentation
in the Recovery of Volumetric Parts From Range Data,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 19(3), 1997.
[13.12] M. Dubuisson Jolly, S. Lakshmanan, and A. Jain, “Vehicle Segmentation and
Classification using Deformable Templates,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 18(3), 1996.
[13.13] M. Fischler and R. Elschlager, “The Representation and Matching of Pictorial Struc-
tures,” IEEE Transactions on Computers, 22(1), 1973.
[13.14] S. Gold and A. Rangarajan, “A Graduated Assignment Algorithm for Graph Match-
ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4), 1996.
[13.15] R. Golden, Mathematical Methods for Neural Network Analysis and Design,
Cambridge, MA, MIT Press, 1996.
[13.16] L. Gong and C. Kulikowski, “Composition of Image Analysis Processes Through
Object-centered Hierarchical Planning,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(10), 1995.
[13.17] F. Goudail, E. Lange, T. Iwamoto, K. Kyuma, and N. Otsu, “Face Recognition
System Using Local Autocorrelation and Multiscale Integration,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 18(10), 1996.
[13.18] E. Grosso and M. Tistarelli, “Active/Dynamic Stereo Vision,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(9), 1995.
[13.19] G. Gurevich, Foundations of the Theory of Algebraic Invariants, Transl. Raddock
and Spencer, Groningen, The Netherlands, Nordcliff Ltd, 1964.
[13.20] S. Haykin, Neural Networks, A Comprehensive Foundation, Englewood Cliff, NJ,
Prentice-Hall, 1999.
[13.21] M. Hebert, K. Ikeuchi, and H. Delingette, “Spherical Representation for Recogni-
tion of Free-form Surfaces,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(7), 1995.
[13.22] D. Heisterkamp and P. Bhattachaya, “Matching of 3D Polygonal Arcs,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 19(1), 1997.
[13.23] J. Hopfield, “Neural Networks and Physical System with Emergent Collective
Computational Abilities,” Proceedings of the National Academy of Science, 79,
pp. 2554–2558, 1982.
[13.24] B. Hussain and M. Kabuka, “A Novel Feature Recognition Neural Network and its
Application to Character Recognition,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(1), 1994.
[13.25] D. Jacobs, “The Space Requirements of Indexing Under Perspective Projections,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3), 1996.
324 Image matching
[13.43] H. Schweitzer, “Occam Algorithms for Computing Visual Motion,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 17(11), 1995.
[13.44] S. Sclaroff and A. Pentland, “Model Matching for Correspondence and Recognition,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6), 1995.
[13.45] K. Sengupta and K. Boyer, “Organizing Large Structural Modelbases,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(4), 1995.
[13.46] L. Shapiro and J. M. Brady, “Feature-based Correspondence: an Eigenvector
Approach,” Image and Vision Computing, 10(5), 1992.
[13.47] X. Shen and P. Palmer, “Uncertainty Propagation and Matching of Junctions as Fea-
ture Groupings,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(12), 2000.
[13.48] A. Smeulders, M. Worring, S. Santini, G. Gupta, and R. Jain, “Content-based Image
Retrieval at the End of the Early Years,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(12), 2000.
[13.49] D. Swets and J. Weng, “Using Discriminant Eigenfeatures for Image Retrieval,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 1996.
[13.50] M. Turk, and A. Pentland, “Eigenfaces for Recognition,” Journal of Cognitive
Neuroscience, 3(1), pp. 71–86, 1991.
[13.51] X. Wang and H. Qi, “Face Recognition Using Optimal Non-orthogonal Wavelet
Basis Evaluated by Information Complexity,” International Conference on Pattern
Recognition, vol. 1, pp. 164–167, Quebec, Canada, August, 2002.
[13.52] I. Weiss and M. Ray, “Model-based Recognition of 3D Objects from Single Images,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 2001.
[13.53] M. Yang and J. Lee, “Object Identification from Multiple Images Based on Point
Matching under a General Transformation,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 16(7), 1994.
[13.54] S. Yoon, A New Multiresolution Approximation Approach to Object Recognition,
Ph.D. Thesis, North Carolina State University, 1995.
14 Statistical pattern recognition
Statistics are used much like a drunk uses a lamppost: for support, not illumination
Vin Scully
The discipline of statistical pattern recognition by itself can fill textbooks (and in fact,
it does). For that reason, no effort is made to cover the topic in detail in this single
chapter. However, the student in machine vision needs to know at least something
about statistical pattern recognition in order to read the literature and to properly
put the other machine vision topics in context. For that reason, a brief overview
of the field of statistical methods is included here. To do serious research in ma-
chine vision, however, this chapter is not sufficient, and the student must take a
full course in statistical pattern recognition. For texts, we recommend several: The
original version of the text by Duda and Hart [14.3] included both statistical pattern
classification and machine vision, however, the new version [14.4] is pretty much
limited to classification, and we recommend it for completeness. The much older
text by Fukanaga [14.6] still retains a lot of useful information, and we recommend
[14.11] for readability.
Recall the example described in section 13.2. In that example, we are given models
for axes and hatchets which were derived statistically by computing averages of
samples known to be either axes or hatchets. We called these collections “training
sets.” In section 13.2, we compared an unknown, represented by a feature vector,
with both models and decided to assign the unknown to the class of the model it
most closely resembled, where “closely resembled” was simple Euclidian distance
between the unknown feature vector and the two models. In this chapter, we demon-
strate that this “closest mean” decision rule is actually a simplification of maximum
likelihood assuming a Gaussian probability density for the classes. Further, we show
that decision rules other than closest mean may be used, and may perform better
and/or be more effectively computed.
We begin by discussing some of the options in decision rules.
326
327 14.1 Design of a classifier
Area
x o
x x x o
x x xx o o
x x o oo
x x x o
o
o
Length
Figure 14.1 illustrates the result of a large number of measurements of two different
industrial parts. The features, area and length have both been measured and each
measurement indicated in the figure by a single mark. (A chart like this is called a
“scatter graph.”) The x points represent one class, flanges and the o points represent
a second class, gaskets. A linear decision boundary has been drawn on the figure. A
linear decision rule would be as follows: “Decide the unknown object is a flange if the
result of the measurements lies on the left of the decision boundary otherwise decide
it is a gasket.” Linear decision rules are particularly attractive because they can be
implemented by linear machines which have a great deal of potential parallelism
and therefore high speed. As can be seen from the figure these two classes are
not linearly separable. That is, there does not exist any one straight line which
completely partitions the two classes. The choice of the best such straight line is the
result of the linear classifier design process. A variation on linear machines is given
in section 14A.2, where we introduce support vector machines.
Supervised learning
If we are given one training set for each class and from those training sets we can
develop the statistical representations of the classes, then this process is known as
“supervised learning.” The word supervised refers to the fact that each data point is
independently labeled according to the class to which it belongs. Each class may then
be characterized either statistically by its mean, variance or other statistical measures,
or by some other parametric representation. Fig. 14.1 illustrates the data distribution
resulting from supervised sampling, i.e., the x points are identified as belonging to
one class and the o points to another. The example we have used previously in section
13.2, of distinguishing axes from hatchets, is a supervised learning problem, since
it was assumed that we had training sets of both classes.
Unsupervised learning
o o o
o o o
oo o o o o o oo
o o o o
o o o o
o
o
oo
Fig. 14.2. In unsupervised learning, only one set of measurements is taken; however, such data
may fall into natural clusters.
In this section we will design a classifier based upon assumptions concerning the
statistical nature of the training sets. If these assumptions should be valid for the
training sets in question, then the classifier which results will give the best perfor-
mance. We will also investigate the performance of such a classifier, including error
rates.
The attentive student will note similarities between the descriptions of statistical
concepts presented here and those presented in Chapter 6. The similarities are cor-
rect and deliberate. In that chapter, we sought an image which minimized certain
properties. In this chapter we seek a decision: To which class an object belongs.
Fig. 14.3. The same measurement on two objects may have different average values, but due to
noise in the measurement, or actual variation, these values may overlap.
times as many flanges as gaskets. Flanges and gaskets may come down the conveyor
at random times. But because of our a priori knowledge that the plant manufactures
nine times as many flanges as gaskets we know that we are much more likely to see
a flange than a gasket if we chose to look at the conveyor at some random time. Thus
the a priori possibility of flanges is 0.9 and the a priori probability of gaskets is 0.1.
We define p(x|w i ) to represent the conditional probability density of a measure-
ment x occurring given that the sample is known to come from class w i . For a
particular w i , p(x|w i ) should be thought of as a function of x. Suppose we have a
factory which manufactures axes and hatchets. Then we might find the probability
densities for the lengths of axes and hatchets to be represented by Fig. 14.3. In that
figure we see that an axe is most likely to be 30 inches long and a hatchet most likely
to be 12 inches, but that some variation in length can occur.
The probability density function may be characterized in several possible ways.
One is by simply tabulating the number of times a particular value occurs for each
possible value of the variable, in this case, length. Such a tabulation is referred to
as a histogram of the variables. Properly normalized, a histogram can be a useful
representation of a probability density function, but of course requires that only a
finite number of possible values may exist. One may also describe a density function
in a parametric way using some analytic function (e.g., the Gaussian) to represent
the density.
Finally we define P(w i |x), the posterior conditional probability, to represent the
conditional probability that the object being observed belongs to a class w i given a
measurement x. P(w i |x) is what we are looking for. We will use it as our decision
The term likelihood is
used rather than rule or, more correctly, as our discriminant function. Our decision rule will then be
probability because we
will actually maximize
as follows: For a measurement x made on an unknown object, compute P(w i |x)
some function of the for each class, that is, for each possible value of i. Then decide that the unknown
probability.
belongs to the class i for which P(w i |x) is greater than P(w j |x) for all i
= j.
331 14.2 Bayes’ rule and the maximum likelihood classifier
When we make a classification decision based on P(w i |x), we are using a maximum
likelihood classifier.
We can relate the three functions just defined by using Bayes’ rule:
p(x|w i )P(w i )
P(w i |x) = (14.1)
Something
c
Something = p(x) = p(x|w j )P(w j ). (14.2)
j=1
In Eq. (14.1) we used “something” to represent the denominator for the conditional
probability density. We used the word “something” to call attention to the fact that this
number represents the probability density of that value of x occurring, independent
of the class of the observation. Since this number is independent of the class, and is
the same for all classes, it therefore does not provide us any help in distinguishing
which class is most likely. Instead, it is a normalization constant which we use to
ensure that the number P(w i |x) has the desirable properties of a probability; that
is, it lies between 0 and 1 and when summed over all the classes, it sums to 1 (the
observed object belongs to at least one of the classes which we are considering).
In a sense, Equation (14.1) solves the pattern recognition problem. It tells us how
to make a decision, assuming we know each of the components of the RHS. In the
next section we consider how one goes about determining those components.
These two numbers serve to completely describe the conditional probability density
of the variable x for class i assuming that the density has a Gaussian form.
332 Statistical pattern recognition
Assuming that the samples are drawn independently, the probability of drawing
the entire set X i is determined by
ni
p(X i ) = p(xik ). (14.6)
k=1
A second use of the term The maximum likelihood estimate of i is then defined as that value i which max-
“maximum likelihood.”
imizes p(X i |i ). Eq. (14.7) describes the likelihood of any particular training set
occurring, given that the probability distribution is described by the parameter vec-
tor i . Since we are dealing with the Gaussian density, we rewrite Eq. (14.7) as
ni
1 (xik − i )2
p(X i |i ) = √ exp − . (14.8)
k=1 2i 2i2
Now an important observation: The value of i which maximizes p(X i |i ) also
maximizes ln [ p(X i |i )]. This is true because the natural logarithm is a monotonically
increasing function. Thus, we have our choice of finding the parameter vector i
which maximizes either the density or its logarithm. The logarithm will be much
easier to use. Taking the log of the RHS we find
ni ni
1 1 xik − i 2
ln( p(X i |i )) = ln √ − . (14.9)
k=1 2i k=1
2 i
which simplifies to
ni
ni
xik −
ˆi =0 (14.11)
k=1 k=1
1 ni
ˆi =
xik . (14.12)
n i k=1
ˆ i is known as the sample mean and the fact that it is equal to the average value is
certainly intuitively satisfying.
334 Statistical pattern recognition
Rewrite Eq. (14.9), to make it a bit simpler and the log of the probability becomes
√
1 xk − 2
L=− ln 2 − ln − (14.14)
2
and we find (from ∂ L/∂ = 0),
n
1
(x − )
2 k
ˆ =0 (14.15)
k=1
and
1 n
ˆ = xk (14.18)
n k=1
as before.
Equation (14.16) simplifies similarly to yield
n 1 n
= 3 (xk − )
ˆ 2, (14.19)
ˆ ˆ k=1
and therefore
1 n
ˆ 2 = (xk − )
ˆ 2. (14.20)
n k=1
Thus we see that the best estimate for the parameters of a normal density are the
familiar sample mean and sample variance.
335 14.2 Bayes’ rule and the maximum likelihood classifier
1 ni
ˆi = xik (14.21)
n i k=1
1
Look at how K is defined. ni
Is it a matrix? A scalar? A
vector?
Ki = (xik −
ˆ i )(xik −
ˆ i )T . (14.22)
n i k=1
Thus we have essentially the same results for the multivariate case as for the univariate
case: That the best estimate (in the maximum likelihood sense) of the mean and
variance of a Gaussian are the sample mean and sample (co)variance.
Now what is there to remember from this chapter? To perform the maximum
likelihood estimate of a set of parameters, given a training set, assume independence
(if you can) and write the probability of the entire training set occurring as a product.
Take logs, differentiate, and set to zero to produce a set of simultaneous equations
which, when solved, will be the best estimates of the parameters. This approach
works for any distribution, not just the Gaussian. However, for some cases, the
process of solving the system of simultaneous equations may be intractable.
Finally, there are other ways to find parameters, other than maximum likelihood;
techniques which space and time do not permit us to cover here.
As mentioned above, remember that p(x) is the same regardless of whether x belongs
to class 1 or 2. Since this denominator is unaffected by the classification decision,
we can ignore it in making that decision.
In the two-class case, we choose class 1 if P(w 1 |x) > P(w 2 |x), or, substituting
Bayes’ rule, we choose class 1 if
that is,
p(x|w 1 ) P(w 2 )
> . (14.25)
p(x|w 2 ) P(w 1 )
The expression on the left is known as the likelihood ratio. The relationship in
Eq. (14.25) provides a true–false relationship between the likelihood ratio and the
prior information. If it is false, we choose class two. Observe that this form was
derived by making the decision which maximizes the probability of making a correct
decision, using knowledge of the measurement and the prior probability of the
classes. We could use other criteria as well. For example, instead of maximizing
the probability, we could choose to minimize the conditional risk.
The effect of any decision rule is to partition the feature space into c decision
regions
1 , . . .
c . Suppose we define a set of discriminant functions gi (x). Then,
if gi (x) > g j (x) for all j
= i, then x ∈
i and we decide w i . The equation of a
decision boundary is gi (x) = g j (x) when
i borders
j .
In the two-class case, we may compute the probability of error in terms of these
decision regions by
P(error) = P(x ∈
2 , w 1 ) + P(x ∈
1 , w 2 ). (14.26)
P(error) = P(x ∈
2 |w 1 )P(w 1 ) + P(x ∈
1 |w 2 )P(w 2 )
= p(x|w 1 )P(w 1 )dx + p(x|w 2 )P(w 2 )dx. (14.27)
2
1
We also use the notation P(error|w 2 ) to mean the probability that we make an
incorrect decision when w 2 is the true state. Fig. 14.4 illustrates the a posteriori
probability density of two classes, the process of deriving the decision boundary,
and the probability of error.
In general, if p(x|w 1 )P(w 1 ) > p(x|w 2 )P(w 2 ), we should decide that x is in
1 ,
so that the smaller term contributes to the error integral. That is exactly what Bayes’
decision rule does.
337 14.4 Conditional risk
p(x|w 2 )P (w 2 )
p(x|w 1 )P (w 1 )
Eliminated by
moving boundary
to the left
∫ p ( x w )P ( w ) dx
Ω1
2 2
∫ p ( x | w )P ( w ) dx
Ω2
1 1
Ω1 Ω2
Fig. 14.4. The a posteriori probability density function of two classes as Gaussian, the decision
boundary, and the probability of error.
In the multiclass case, since there are more ways to be wrong than right, it is
simpler to compute the probability of being correct:
c
P(correct) = P(x ∈
i , w i )
i=1
c c
= p(x ∈
i |w i )P(w i ) = p(x|w i )P(w i ) dx. (14.28)
i=1 i=1
i
This result will be valid no matter how the feature space is partitioned. The Bayes’
classifier maximizes this probability by choosing regions which maximize the
integrals.
What value of x would you expect to observe? Pretty obvious, isn’t it? Of course
you will always observe x = 2. Now let’s repeat this for a slightly more complicated
338 Statistical pattern recognition
case,
Would you agree that the expected value of x is 2.5? That is, half the time one would
expect to see x equal to 2, and half the time, x will be 3.
Now, generalize that concept to lots of possible values with a probability associated
with each value. If the number of possible values is finite, the expected value is a
sum
x = x P(x). (14.29)
In the more general case, when x is continuous, we will replace the probability with
Watch out for this a density, and replace the summation with an integral.
replacement of
summation by integral to Suppose we observe x and contemplate taking action ␣i . If the true state of nature
occur. We are not going to is w j , we incur loss Ci j . Since P(w j |x) is the probability that w j is true, the expected
warn you . . .
loss associated with ai is
c
ri = Ci j P(w j |x). (14.30)
j=1
The expected loss is called the risk. We write ri as r (␣i |x) to make it clear that this
is the conditional risk.
We wish to minimize the total risk, which we denote as r by choosing the best ␣i .
A decision rule is a function ␣(x) that tells us what action to take to minimize r.
For every x the decision function ␣(x) assumes a value ␣1 , . . . , ␣a . The overall
risk is associated, then, with the decision rule.
Since ri or r (␣i |x) is the conditional risk, the overall risk is
r = r (␣(x)|x) p(x) dx. (14.31)
If there are only two classes, we have four possibilities, two that we made the
correct decision, and two that we were wrong. The total risk therefore is:
r = C11 P(w 1 |x) d x
As you read these terms,
1
consider the concepts of
“false negative,” “false
positive,” “true positive,” + C12 P(w 2 |x) d x
and “true negative.”
1
Which of these terms (14.32)
represents the total
amount of each of these? + C21 P(w 1 |x) d x
2
+ C22 P(w 2 |x) d x.
2
Reminder: Ci j is the cost
of deciding i when reality
is j. That is, the probability that we guessed it was in class 1, integrated over the region
of x in which our decision rule says we SHOULD guess class 1, plus . . .
Rewriting all that, we get
r = [C11 P(w 1 |x) + C12 P(w 2 |x)] d x
1
+ [C21 P(w 1 |x) + C22 P(w 2 |x)] d x. (14.33)
2
which reorganizes to
r = 2 + ((C11 − C21 )P(w 1 |x) + (C12 − C22 )P(w 2 |x)) d x. (14.35)
1
Our objective is to minimize this quantity (remember, it is the risk incurred in making
all four possible decisions).That is, the decision rule is really the determination of
the decision region(s). In this case, since there are only two decision regions, and
everywhere that we do not decide class one, we decide class two, all we need to
do is to determine the region
1 . To accomplish that, first, we need to make an
340 Statistical pattern recognition
assumption. We assume that the cost of making a correct decision is always less than
the cost of an error. So (C11 − C21 ) < 0, etc.
For some reason, students How do we choose the limits of an integral (and remember, the limits of the integral
get upset when we say
“make the integrand are in fact the boundaries of the region where we decide class 1) such that the integral
negative everywhere.” is maximally small? Simply choose the decision region such that the integrand is
Remember, we are
optimizing by finding the negative everywhere. Doing so produces the condition required for region 1 to be
limits on the integral! chosen: Choose
1 such that
(C11 − C21 )P(w 1 |x) + (C12 − C22 )P(w 2 |x) < 0. (14.36)
Replace the posterior probabilities with the product of the conditional densities and
prior probabilities to get
(C11 − C21 ) p(x|w 1 )P(w 1 ) < (C22 − C12 ) p(x|w 2 )P(w 2 ), (14.37)
which after appropriate algebraic manipulation becomes the decision rule: Choose
class 1 if
p(x|w 1 ) (C12 − C22 )P(w 2 )
> , (14.38)
p(x|w 2 ) (C21 − C11 )P(w 1 )
else choose class 2.
Try this: Substitute the This expression is called the “likelihood ratio test.”
symmetrical cost function
into Eq. (14.38) and see Consider the symmetrical loss function:
how the likelihood ratio
test simplifies. 0 i= j
Ci j = (14.39)
1 i
= j
so that all errors are equally costly and there is no cost for a correct decision. We
may now rewrite the conditional risk, the cost of making decision i
c
ri = Ci j P(w j |x) = P(w j |x) = 1 − P(w i |x). (14.40)
j=1 j
=i
Thus, to minimize the average probability of error, we select i as the i which max-
imizes the a posteriori probability P(w i |x). That is, for minimum cost, we decide
w i , if P(w i |x) > P(w j |x) for all i
= j, which we have already seen is the simple
maximum likelihood classifier. Thus, we see that the maximum likelihood classifier
minimizes the Bayes’ risk associated with a symmetric cost function.
Consider the general multivariate Gaussian classifier, with two classes. As in As-
signment 14.1, if we take logs, we can work out a decision rule based on a likelihood
341 14.5 The quadratic classifier
g(x) ≡ xT Ax + bT x + c. (14.43)
And the decision rule becomes: Decide class 1 if g(x) < T . In this formulation, we
see clearly why the Gaussian parametric classifier is known as a quadratic classifier.
The Mahalanobis distance Let’s examine the implications of this rule. Consider the quantity (x −
has the properties of a
metric. Can you prove 1 )T K 1−1 (x − 1 ). This is some sort of measure involving a measurement, x, and a
that? (Do you recall the class parameterized by mean vector and a covariance matrix. This quantity is known
defnition of a metric?)
as the Mahalanobis distance.
First, let’s look at the case that the covariance is the identity. Then, the Maha-
lanobis distance simplifies to (x − 1 )T (x − 1 ). That is, take the difference between
the measurement and the mean. That is a vector. Then take the inner product of that
vector with itself, which is, of course, the squared magnitude of that vector. What is
this quantity? Of course! It is just the (squared) Euclidean distance between the mea-
surement and the mean of the class. If the prior probabilities are the same and we use
symmetric costs, Threshold works out to be zero, and the decision rule simplifies to:
Decide class 1 if
(x − 1 )T (x − 1 ) − (x − 2 )T (x − 2 ) < 0 (14.44)
else decide class 2. If the measurement is closer to the mean of class 1 than class 2,
this quantity is less than zero. Therefore, we refer to this (very simplified) classifier
as a nearest mean classifier, or nearest mean decision rule.
Now, let’s complicate the rule a bit. We no longer assume the covariances are
equal to the identity, but do assume they are equal to each other (K 1 = K 2 ≡ K ). In
this case, look at Eq. (14.42) and notice that the A matrix becomes zero. Now, the
operations are not quadratic any more. We have a linear classifier.
We could choose to ignore the ratio of the determinates of the covariance matrices,
or, more appropriately, to include that number in the threshold T. Then we have a
minimum distance decision rule, but now the distance used is not the Euclidean
distance. We refer to this as a minimum Mahalanobis distance classifier.
342 Statistical pattern recognition
Here is another special case: What if the covariance is not only equal, but diagonal?
Now, the Mahalanobis distance takes on a special form. We illustrate this by using
a three-dimensional measurement vector, and letting the mean be zero:
⎡ ⎤
1
0 0 ⎡ ⎤
⎢ 11 ⎥ x
⎢ ⎥ 1
⎢ 1 ⎥⎢ ⎥
[x1 x2 x3 ] ⎢ 0 0 ⎥ ⎣x2 ⎦
⎢ 22 ⎥
⎣ 1 ⎦ x3
0 0
33
which we expand to
x12 x2 x2
+ 2 + 3.
11 33 33
Do you recall seeing this This is the equation of an ellipsoid, centered at the origin (or, in the case that the
ellipse discussion
somewhere else in this mean is not zero, centered at the mean) with axes located along the coordinate axes.
book? In the more general case, with covariance which is not diagonal, the only thing
that happens is that this ellipsoid may rotate. So, the equation which represents the
Mahalanobis distance from a point to a class produces an ellipsoid.
Here is one more interesting case: Suppose the covariances are the same, diagonal,
and proportional to the identity K i = 2 I . Now, the discriminant function for class
i takes on the form
2iT x |i |2
gi (x) = − 2 + 2 ln P(w i ). (14.45)
2
Assume further that the magnitudes of all the means are the same. That is, all the
Remember! An inner means are located on a hypersphere centered at the origin. Then, we do not need to
product computes a
projection. consider the second term in Eq. (14.45), and the discriminant function simplifies to
d
gi (x) = iT x = ik xk , (14.46)
k=1
Sometimes the a priori probabilities are unknown. In this case, a fixed decision rule
will not yield the minimum risk, so we use the minimax rule and attempt to minimize
the maximum possible risk. Suppose we have c classes; then the overall expected
risk is
r= P(w i ) Ci j p(x|w j ) d x, (14.47)
i j
i
343 14.7 Nearest neighbor methods
that is, the probability of a particular state of nature times all the decisions we could
make if that were the state of nature, and the cost associated with those decisions.
To see what this means clearly, think about the two-class case and let i = 1. We
observe that r is linear in P(w i ). Thus, r is maximized at one extreme of P(w 1 ) or
the other, e.g., P(w 1 ) = 0 or P(w 1 ) = 1. If we let C11 = C22 = 0 then the maximum
of r becomes either
C12 p(x|w 2 ) d x (14.48)
1
or
C21 p(x|w 1 ) d x. (14.49)
2
Since
1 ∪
2 is the complete space, then
⎛ ⎞
max ⎝ C12 p(x|w 2 ) d x, C21 p(x|w 1 ) d x ⎠ (14.50)
1
2
In previous sections, we have assumed we have a model for the density, usually a
Gaussian. However, if we simply have a training set, that data may or may not fit a
Gaussian (or any other parametric model for that matter). A simple heuristic which
we might use is called the “nearest neighbor rule” – assign the unknown to the same
class as the class of the nearest neighbor in the training set.
In this section, we extend the nearest neighbor rule, and show that it is equivalent to
a maximum likelihood classifier with the density estimated by the extended nearest
neighbor rule.
344 Statistical pattern recognition
This method utilizes a volume1 V around the unknown. We simply count the
number of points from the various classes which occur. Then the class-conditional
density is estimated by
km
p(x|m ) = , (14.52)
nm V
where km is the number of samples in class m inside the volume V centered at x, and
n m is the total number of samples in class m in the training set.
Use of a constant volume is a problem, because in regions which are densely
populated (many training set points nearby) the volume will contain many points,
resulting in too much smoothing, whereas in more sparsely populated areas, the same
volume results in estimates which are not sufficiently representative. The simple
solution is to let the volume depend on the data. For example, to estimate p(x) from
n samples, one can center a cell about x and let it grow until it contains kn samples,
where kn is some (yet to be specified) function of n. If the density of samples near x
is high, then the volume will be small, resulting in good resolution. If the density is
small, then the region will grow, providing smoothing. Duda et al. [14.4] point out
√
that kn = n provides one form for kn which behaves in a reasonable way.
The k-nearest-neighbor (k-NN) rule can be extended slightly to allow us to use the
strategy directly for classification. Given c training sets, we combine all the sample
points from all the training sets into one data set of n points, where now
c
n= ni (14.53)
i=1
1 Of course, in more than three dimensions, this is a hypervolume. For simplicity, we will continue to use the
word “volume” with the understanding that no limit on dimensionality exists.
345 14.9 Vocabulary
This rule tells us to look in a neighborhood of the unknown feature vector for k
samples. If, within that neighborhood, more samples lie in class i than any other
class, we assign the unknown as belonging to class i. We thus have the k-nearest-
neighbor classification rule.
The student should note that in the k-NN strategy, we have never defined precisely
how nearest should be computed. The Euclidean metric is generally assumed to be
the most reasonable measure for distance, but others may certainly be used.
In the authors’ own experience in classifying large data sets of industrial data, we
have found nearest neighbor algorithms to work surprisingly well.
A major practical disadvantage of the k-NN strategy for classification is the fact
that all the data must be stored. This can be a massive storage burden, especially
when compared with parametric methods which require only a few points. The com-
putational burden associated with the k-NN techniques can likewise be significant,
since, in order to find the k nearest neighbors, the distance from the unknown to all
the neighbors must be determined. Heuristics have been published which speed up
this process significantly, and the student is referred to the literature for suggestions.
See, for example, the condensed-nearest-neighbor rule described in Hand [15.7].
14.8 Conclusion
In this brief introduction to statistical pattern recognition, you have seen how statisti-
Maximum likelihood and cal methods can assist in the process of making decisions. You have also noticed how
minimum squared error.
pervasive the optimization approach to problem solving is. Probability densities are
estimated using maximum likelihood methods, where the likelihood is a product of
probabilities. For Gaussian forms, maximum likelihood simplifies to sum-squared
error.
Minimize the risk. You learned how to find decision regions which minimize the total risk, even
when different decisions have different costs, by finding limits of integration which
Minimize the maximum make the argument of the integral negative. Even in the case that the risk cannot be
risk.
computed, we can develop a scheme which minimizes the maximum risk.
Minimum distance. Classification is often considered a “minimum distance” process. That is, we
make the decision which minimizes some sort of distance, and you have seen several
examples in this chapter.
14.9 Vocabulary
Bayes’ rule
Cluster
Conditional density
346 Statistical pattern recognition
Decision boundary
Decision rule
Discriminant function
Feature vector
Likelihood ratio
Linear machine
Linearly separable
Maximum likelihood
Minimax
Multivariate
Prior probability
Quadratic classifier
Risk
Supervised learning
Training set
Univariate
Unsupervised learning
Assignment 14.1
Assume class 1 and 2 are well represented by Gaussian
densities with the following parameters: Class 1 mean
= 0, variance = 1. Class 2 mean = 3, variance = 4.
Substitute the forms for the Gaussian into Eq. (14.25)
and derive an equation which gives the range of x in
which class 1 is chosen. You will need to make a rea-
sonable assumption about prior probabilities (equal
probabilities are often chosen).
Hint: After doing the substitution, take natural
logarithms of both sides.
Assignment 14.2
In a one-dimensional problem, the conditional density
for class 1 is Gaussian with mean 0 and variance 2; for
class 2, the conditional density is also Gaussian with
mean 3 and variance 1. That is:
!
1 1 x 2
p(x|w 1 ) = √ √ exp − √
2 2 2 2
1 1
p(x|w 2 ) = √ exp − (x − 3) 2
2 2
(1) Sketch the two densities on the same axis.
(2) What is the likelihood ratio?
347 Topic 14A Statistical pattern recognition
Assignment 14.3
In a one-dimensional problem the class-conditional den-
sities for a feature x are
exp(−(x − r)) x ≥ r
p(x|w 1 ) = and
0 otherwise
exp(x − 3) x < 3
p(x|w 2 ) = ,
0 otherwise
Statistical pattern recognition, as mentioned above, is a process worthy of an entire book, and
in fact many books have been written on the topic. Here, through a simple example [13.17],
we will present just a glimpse of what the discipline entails. Our problem is to recognize
faces. Let’s first collect images which contain only faces (and thereby avoid the segmentation
problem) by requiring the subjects all wear black clothing and stand against a black wall. We
acquire relatively low-resolution images, 180 × 120 pixels. We then scan over the image with
a collection of feature extractors, shown in Fig. 14.5. Each feature extractor operates on the
neighborhood of each pixel, in much the same way that a kernel operator does, but instead of
a sum of products, this operator returns the product of the image pixels corresponding to the
black pixels in the kernel. First, we observe that each kernel, used in this way, is returning a
very local autocorrelation of the image, in a particular direction. Denote the result of applying
kernel i to the neighborhood of pixel j by Φi j . Then, the sum
Fig. 14.5. The
collection of 25 kernels
n
xi = Φi j (14.58)
used to extract a
j=1
25-element feature
vector from an image. is computed, producing a 25-element vector which in some sense describes the image.
348 Statistical pattern recognition
So for every image, we have a vector consisting of 25 numbers. Using that 25-vector, the
challenge is to properly make a decision. The first step is to reduce the dimensionality to
something more manageable than 25.
We look for a method for reducing the dimensionality from, in general, d dimensions, to
c − 1 dimensions, where we are hoping to classify the data into c classes. (Somehow, we
must know c, which in this example is the number of individual faces.) The following strategy
is an extension of a method known in the literature as “Fisher’s linear discriminant.”
Assume we have c different classes, and a training set, X i of examples from each class.
Thus, this is a supervised learning problem. Define the within-class scatter matrix to be
c
SW = Si (14.59)
i=1
where
Si = (x − i )(x − i )T , (14.60)
x∈X i
and i is the mean of class i. Thus, Si is a measure of how much each class varies from its
μ2
average.
μ1
1
μ i = x. (14.61)
n i x∈X i
y = Wx (14.63)
such that first, y is of lower dimension than x, and second, the classes are better separated
after they are projected.
The projection from d-dimensional space to c − 1 dimensional space is accomplished by
c − 1 linear discriminant functions
yi = wiT x. (14.64)
y = W T x. (14.65)
We now define a criterion function which is a function of W and measures the ratio of between-
class scatter to within-class scatter. That is, we want to maximize SB relative to SW , or rather,
349 Topic 14A Statistical pattern recognition
−1 −1 −1
to maximize some measure of SW SB . The trace of SW SB is the sum of the spreads of SW SB
−1
in the direction of the principal components of SW SB . We can see clearly what this means
in the two-class case.
−1 n1n2 −1 n1n2
J = trSW SB = trSW (1 − 2 )(1 − 2 )T = D2 (14.66)
n1 + n2 n1 + n2
Since the determinant is the product of the eigenvalues, it is therefore the product of the
“spread” in the principal directions. As in the case of Fisher’s linear discriminant, the solution
to this equation can be found by eigenvector analysis. The columns of the optimal W are the
eigenvectors which correspond to the largest eigenvalues in
SB wi = i SW wi . (14.68)
One can find the eigenvalues as the roots of the characteristic equation
SB − i SW = 0 (14.69)
Pattern classifiers based on the concepts of support vectors are relatively new. They were
first introduced by Vapnik [14.12] based on the concept of minimizing structural risk. We
present them here because they appear to provide performance superior to most other pattern
classification methods, and to illustrate the approach – to set up an optimization problem which
maximizes the margin – a derivation of the simplest support vector approach is provided.
350 Statistical pattern recognition
Class 1
class 1 Class 2
w
class 2
q+ρ
class 1 q
q−ρ
Fig. 14.7. A poor
choice of the dividing class 2
hyperplane (a line in Fig. 14.9. Two classes denoted by 1 and 2 are both
this 2D example) Fig. 14.8. A good choice projected onto the same line, w. The line dividing the
produces a margin of the dividing hyperplane sets is orthogonal to w, and we seek the points in the
which is small. results in a large margin. training sets which are closest to that line.
In this section, we assume the training sets are separable by a linear surface. In the next
subsection, we discuss this assumption and demonstrate how to ensure it.
As before, we seek to divide the feature space by a hyperplane into two segments, in which
the examples of the training sets are linearly separable. Define the distance from the closest
point in class 1 to the hyperplane as d1 . Similarly let d2 be the distance from the closest point
in class 2 to the hyperplane. Define the margin as d1 + d2 . We will seek the hyperplane which
maximizes the margin (see Figs. 14.7 and 14.8).
We don’t know yet. Given an example x, we project this sample onto some unit vector , and make a decision
Finding it is the challenge.
using the rule decide class 1 if T x − q > 0 where q is a constant.
I1 is the set of points in Let x1 be a point in class 1, and similarly, let x2 be a point in class 2. Since we are seeking
the training set
the points closest to the decision line, we wish to choose x1 and x2 so that their projections
representing class 1.
is
the decision region for onto are as close together as possible. As illustrated in Fig. 14.9, we seek points x1 and x2
class 1. such that with minimal positive
T xi = q + T x2 = q − . (14.71)
Define Ii to be the set of training examples from class i. For any point in I1 , T x − q > ,
and for any point in I2 , T x − q < . We need to find two things. (1) A pair of points, one2 in
each class, which are as close together as possible. We will call these points support vectors.
(2) A vector onto which to project the support vectors so that their projections are maximally
far apart. We solve this problem as follows.
Any other vector in the Recall that was a unit vector. It is thus equal to some other vector in the same direc-
same direction, actually.
tion divided by its magnitude, = w/w. We will look for one such vector, with certain
properties which will be introduced in a moment: For now, let x1 denote any point in I1 , not
necessarily a support vector, and similarly for x2 . Then
w T w T
x1 − q ≥ x2 − q ≤ (14.72)
w w
2 It is possible to have more than one support vector in each class, since two points might both be precisely the
same distance from the hyperplane.
351 Topic 14A Statistical pattern recognition
which leads to
Define b = −qw, and we then add a constraint to w to require that its magnitude have a
particular property:
Now we have two equations which describe behavior for any points in class 1 or class 2:
wT x1 + b ≥ 1 wT x2 + b ≤ 1. (14.75)
From this point on in this Since we wish to find the line which maximizes the margin, , from Eq. (14.74), we see that
derivation, the subscript
this is the same as finding the projection vector w whose magnitude is minimal, thus we seek
on the x no longer denotes
the class to which x a minimizer w = arg min( 12 wT w). Unfortunately, the null vector would minimize this, so we
belongs, but rather just its need to add some constraints to avoid this trivial solution.
index, as an element of
the training set. Let yi be the label for point xi , and define the labels as
1 if xi ∈ I1
yi = (14.76)
−1 if xi ∈ I2
and consider the expression yi = (wT xi + b). This will always be greater than or equal to
1, regardless of the class of xi . We thus have a constraint, and our minimization problem
becomes: Find the w which minimizes wT w such that yi (wT xi + b) ≥ 1.
This can be accomplished by setting up the following constrained optimization problem:
1 T l
L(w, b, ) = w w− i (yi (wT xi + b) − 1) (14.77)
2 i=1
where the first and second term are the same except for the 1/2. The third term is zero.
352 Statistical pattern recognition
In principle, Eq. (14.84) may be solved for b using any i, but it is numerically better to use
an average.
Similarly, we note the dimension of A is the same as the number of samples in the training
set. Thus, unless some “filtering” is done on the training set prior to building the SVM, the
computational complexity can be substantial.
Instead of dealing with the actual samples, we apply a nonlinear transformation which pro-
duces a vector of higher dimension, yi = (xi ). For example, if x = [x1 , x2 ]T is of dimension
2, y might be
% &T
yi = x12 x22 x1 x2 x1 x2 1 , (14.85)
which is of dimension 6. Surprisingly, this increase in dimension does not seem to destroy the
capability of the classifier. How can an expansion to more “degrees of freedom” improve both
accuracy on the training set and generalizability? Concerning this question, Burges [14.2]
says:
Usually, mapping your data to a “feature space” with an enormous number of dimensions
would bode ill for the generalization performance of the resulting machine. After all, the set of
all hyperplanes {w, b} are parameterized by dim(H ) + 1 numbers. Most pattern recognition
systems with billions, or even an infinite number, of parameters would not make it past the
start gate. How come SVMs do so well? One might argue that, given the form of solution,
there are at most l + 1 adjustable parameters (where l is the number of training samples), but
this seems to be begging the question. It must have something to do with our requirement of
maximum margin hyperplanes that is saving the day. . . . a strong case can be made for this
claim.
An expansion of the form described in Eq. (14.85) increases the dimensionality of the space,
and increases the likelihood that the classes will be linearly separable in the higher dimensional
353 Topic 14A Statistical pattern recognition
space (for reasons beyond the scope of this brief explanation). It also provides a simple
mechanism for incorporating nonlinear mixtures of the information from the measurements.
The polynomial form of Eq. (14.85) is but one way of expanding the dimensionality of the
measurement vector. A more interesting collection of ways comes to mind when one looks
at Eq. (14.83), and observes that to compute the optimal separating hyperplane, one does not
need to know the vectors themselves, but only the scalars which result from computing all
possible inner products. Thus, we do not need to map each vector to a high-dimensionality
space and then take the inner product of those vectors, not if we can figure out ahead of time
what those inner products should be.
In Eq. (14.86), the subscript i denotes the ith element of the vector-valued function . Thus,
that expression represents an inner product. Notice that Mercer’s conditions simply state
that if K satisfies these conditions, then K may be decomposed into an inner product of two
instances of a function . It does not say what is, nor does it say what the dimensionality
of is. But that’s OK. We do not have to know. In fact, the vector may have infinite
dimensionality. That’s still OK.
One kernel which is known to satisfy Mercer’s condition, and which is very popular in the
SVM literature, is the radial basis function
(a − b)T (a − b)
K (a, b) = exp − . (14.87)
2 2
In the literature, SVMs have been applied to various problems such as face recognition [14.10]
and breast cancer detection [14.1]. In previous studies [14.7] and in comparative analysis in
the literature, they have empirically been shown to outperform classical classification tools
such as neural networks and nearest neighbor rules [14.9, 14.13, 14.14]. Interestingly, in
a comparison with a classifier based on hyperspectral data, a SVM-based classifier using
multispectral data (derived from the original hyperspectral data by filtering) performed better
than classifiers based on the original data [14.8].
354 Statistical pattern recognition
14A.3 Conclusion
Statistical methods provide tools for making decisions, based on measurements. If the mea-
surements are sufficiently discriminating that simple thresholds may be used, sophisticated
statistical methods may not be required. On the other hand, most collections of measurements
are not sufficient to make such decisions based on trivial feature comparisons.
We have seen in section 14A.1 an example of what is in the discipline of statistical pattern
recognition, and just one application to machine vision. There is NOT enough information
in this book to teach you all you need to know about statistical methods. You really need to
take a full course. We hope this chapter has given you enough motivation to do that.
In section 14A.1 we derived an objective function which, if maximized, would result in
projected data with classes maximally separated. It turned out that this maximization problem
turns into an eigenvalue problem.
Maximize the margin. A SVM finds the decision boundary which maximizes the margin, where the margin is
the distance between the closest points and the decision boundary, and the derivation of the
machine requires use of constrained optimization with Lagrange multipliers.
14A.4 Vocabulary
Between-class scatter
Fisher’s linear discriminant
Margin
Mercer’s conditions
Support vector
Within-class scatter
References
In this chapter, we approach the problem alluded to in Chapter 14 where the training
set simply contains points, and those points are not marked in any way to indicate
from which class they may have come. As in the previous chapter, we present only
a brief overview of the field, and refer the reader to other texts [14.4, 15.7] for
more thorough coverage. One very important area which we omit here is the use
of biologically inspired models for clustering [15.4, 15.5, 15.6], and the reader is
strongly encouraged to look into these.
We will discuss the issues of clustering in a rather general sense, but note one
particular application, which is identification of peaks in the Hough transform
array.
Consider this example from satellite pattern classification: We imagine a
downward-looking satellite orbiting the earth, which, at each observed point, makes
a number of measurements of the light emitted/reflected from that point on the
earth’s surface. Typically, as many as seven different measurements might be taken
from a given point, each measurement in a different spectral band. Each “pixel”
in the resulting image would then be a 7-vector where the elements of this vector
might represent the intensity in the far-infrared, the near-infrared, blue, green, etc.
Now suppose we have labeled training sets indicating examples of pixels containing
wheat, corn, grass, and trees. With these training sets, it would seem that we should
be able to build a pattern classifier; and indeed we can. Furthermore, the problem as
stated so far is a supervised learning problem. Let’s consider for a moment however,
the class which we call “trees.” This class consists of evergreen and deciduous trees
and, depending upon the time of year, these two subclasses will give radically dif-
ferent spectral signatures. We thus have a pattern classification problem which is not
easily approached with parametric classifiers. While we could use a non-parametric
approach, parametric methods are very attractive. An alternative to nonparametric
classifiers is to consider methods for determining the existence of the subclasses,
assigning points in the training set to the correct subclass and then representing that
Fig. 15.1. A training subclass parametrically. Fig. 15.1 illustrates a two-dimensional problem in which
set with two clusters. the existence of two classes within the same training set is readily apparent. We will
356
357 15.1 Distances between clusters
refer to these subclasses as “clusters” for the duration of this discussion. Each cluster
could be represented fairly accurately by a 2D Gaussian. The entire measurement
space, however, is obviously bimodal.
Such clustering is easy (for us humans) to visualize in a 2-space, and essentially
impossible to visualize in problems with more than three dimensions.
where K A−1 is the inverse of the covariance which best represents cluster A.
We have not yet however considered measures of distance between two clusters
and for our discussion of clustering algorithms we will need such a measure. It is
not quite obvious (as we will see from the following discussion) how one should
define the distance between two clusters since each cluster contains potentially many
points. The simplest such definition would be just the distance between the sample
means of the two clusters (also called the centroid distance)
One might also consider some generalization of the Mahalanobis distance from
(point, cluster) to (cluster, cluster) defining
| A − B |
dfisher (A, B) = , (15.3)
A2 + B2
where the sigmas are computed by first projecting the two clusters onto the line
between the two means. The sample mean and sample variances of the projected
data are the parameters of Eq. (15.3).
We can provide a more formal statement about how to define a distance between
two clusters by first stating the desirable properties of such a distance. We require
d(A, B) ≥ 0
d(A, B) = 0 if A = B (15.4)
d(A, B) = d(B, A).
358 Clustering
There are many measures which satisfy these conditions; for example, we could
integrate the densities over all the sample space to obtain the divergence,
p(x|A)
ddiv (A, B) = [ p(x|A) − p(x|B)] ln dx. (15.5)
p(x|B)
In the case of multivariate Gaussians, this becomes
'
(1 − 2 )T K 1−1 + K 2−1 [1 − 2 ] + tr K 1−1 K 2 + K 2−1 K 1 − 2I 2, (15.6)
which simplifies to Eq. (15.7) if the two covariance matrices are equal to each other.
That is, over all pairs of points, one from cluster A and one from cluster B, choose
those two points which are closest together and define that to be the distance between
the two clusters.
Similarly, one may define the furthest neighbor distance
Each of the definitions given above simply gives us a scalar measure for representing,
in some sense, how far apart two clusters are.
Any time you use a distance measure on vector-valued quantities, you should
pay attention to the possibility that scaling of the coordinate axes might change the
results. For example, consider the set of points shown in Fig. 15.2. Another example
that fairly well shows the impact of clustering is a classification problem involving
the vector [a, b]T , where a represents population and b represents the number of
359 15.2 Clustering algorithms
24
2 4
1 5 a a
3 6
1
3 6
Fig. 15.2. Simple scaling of the coordinate axes can change the apparent clusters.
and in this case, the second feature is essentially not considered because it is so
small in magnitude compared with the first feature. A common normalization which
is often used to deal with such problems is to divide each feature by its standard
deviation.
We will consider only two clustering algorithms here, agglomerative clustering and
k-means, and will also formulate the clustering problem as an optimization problem.
There are a number of other algorithms in the literature, but these are less often used
in applied systems.
C
B
B
A A
Fig. 15.3. Before the clustering iteration (left), the data points have already been assigned to
three clusters. Clusters B and C are determined (somehow) to be closer than A, B or
A, C, so B and C are merged and renamed to become a new cluster B.
The clusters which result from this algorithm are highly dependent on the measure
of cluster distance which is used. If, for example, one uses dmin , and illustrates
C the distances used by drawing a line in Fig. 15.4, one gets the minimum spanning
tree (MST) of the graph which represents the data points by simply continuing the
algorithm until there is only one cluster. If we want three clusters, then we need only
B
cut the longest two edges in this graph. One then realizes that if formal graph theoretic
A operations result from clustering, then the converse is likewise true: Whatever we
know about graphs may help us in designing clustering algorithms. In particular, the
Fig. 15.4. Illustrating following algorithm constructs the MST of a graph very quickly:
the distances chosen
at each iteration by a Define the operation y = FIND(x) as returning the name of the set containing x.
line, we arrive at the Similarly, UNION(A, B, C) creates a new set C = A ∪ B, and then delete the sets
minimum spanning
tree. A and B.
(1) Compute all edges in the graph (in clustering, this means finding all distances
between pairs of points – but not all possible clusters; just the points).
(2) Sort the edges by length (if there are N points, we have N 2 edges).
(3) Beginning with the shortest edge, for each edge, between nodes u and v, perform
the following operations:
(3.1) A = FIND(u)
(3.2) B = FIND(v)
(3.3) IF (A
= B) THEN C = UNION (A, B, C), and erase sets A and B.
361 15.2 Clustering algorithms
Generally, in step 3.3, we index each set by an integer index, and rather than discard-
ing indices when A and B are erased, we use the index for A or for B (whichever is
smaller) as the index for the new set C.
As has been discussed in the literature [15.1, 15.8] there exist parallel algorithms
for performing the union–find operations in constant time. The parallel algorithm
assumes the existence of a lookup table which performs the FIND operation. Then
C
a UNION in parallel hardware is implemented as follows: to do a UNION–FIND
operation on points u and v:
(b) (b)
(b)
Fig. 15.6. Three examples of Fig. 15.7. The result of using Fig. 15.8. The result of using the
two-dimensional clustering minimum distance metric on the maximum distance metric dmax , to
problems (from [14.3]). Used with example of Fig. 15.6 (from [14.3]). cluster the data of Fig. 15.6 (from
permission. Used with permission. [14.3]). Used with permission.
362 Clustering
case. In particular, dmin will tend to choose clusters which are long and thin, and
dmax will choose clusters which are basically round. Students often get confused
right here, concerning the maximum and minimum criteria, so let us spend just a
few extra words to reiterate what we are doing: In this algorithm (the agglomerative
clustering algorithm) we are ALWAYS merging the two CLOSEST clusters. We get
into the maximum distance thing when we define the distance between the clusters.
dmax says to use as a measure of distance between the clusters, the distance between
those points, in those clusters, whose mutual distance is maximum.
Step 1. In an arbitrary way, assign samples to clusters. Or, if you don’t like being that
arbitrary, choose an arbitrary set of cluster centers and then assign all the samples
to the nearest cluster. How you pick the cluster centers is problem-dependent. For
example, if you were clustering points in a color space, where the dimensions are
red, green, and blue, you might scatter all your cluster centers uniformly over this
3-space, or you might put all of them along the line from 0, 0, 0 to maxred, maxgreen,
maxblue.
Step 3. Reassign each sample as belonging to the cluster with the nearest mean.
Step 4. If nothing changed this iteration, exit, otherwise go to step 2.
In Fig. 15.9 we illustrate the use of k-means to identify the peaks in a Hough
accumulator array. We use the same accumulator array illustrated in Fig. 11.5. Each
point in the accumulator array is treated as if it contains a number of points equal to
the value of the number stored at that point. The initial cluster centers were chosen
well separated and far from the actual centers. The simplest implementation of
k-means does not work in this application because there are very many points in
which the accumulator array contained only one point. All those ones add up to
place the mean at a point not near the peak we seek. The solution is to simply ignore
points with low values. In this example, any point not containing at least three points
was ignored. Other heuristics are possible (see Assignment 15.4).
The ISODATA algorithm [15.2] extends k-means. ISODATA allows the algorithm
to pick its own version of the number of clusters, and provides for more flexibility
in specifying maximum and minimum cluster sizes.
363 15.3 Optimization methods in clustering
Fig. 15.9. Path followed by two cluster centers initialized far from their final positions in a Hough
accumulator array. The length of the lines indicates the move from one estimate of the
center to the estimate calculated in the next iteration.
Let’s see if we can figure out how to do clustering in a rigorous, formal way, by
posing and solving an optimization problem.
We want to find the best clusterING. That is, that assignment of points to clusters
which is in some sense the best assignment. In order to begin to approach this
problem, we must first invent a scalar measure of the quality of an assignment. As
a reasonable start on defining such a measure, consider the within-class scatter
c
Sw = (x − i )(x − i )T (15.14)
i=1 x∈X i
c
c
Tr(Sw ) = Tr((x − i )(x − i )T ) = (x − i )T (x − i ) (15.15)
i=1 x∈X i i=1 x∈X i
is simply the sum of the squared deviations of the points from their means. The prin-
ciple disadvantage of the trace criterion is that when used in a clustering algorithm,
it can yield different results when the variables are scaled.
The determinant of Sw is invariant to axis scaling, but use of the determinant
imposes the assumption that all clusters have roughly the same shape.
364 Clustering
Assumption
(required to make this technique work). Adding a point to a clustering always in-
creases the cost, independent of what that decision is. That is, suppose we have
evaluated the cost of 112X; that is, assume J(112X) = a. When we make the deci-
sion about L, we know that the cost increases. J(1121) > a and J(1122) > a.
With that assumption, we can define the branch and bound search algorithm. Since
we have no better way to search, let’s just start evaluating possibilities in order –
let’s try 1XXX, 11XX, 111X, 1111, 1112, 112X, etc. Suppose, along the way, we
have determined that J(12XX) is greater than J(1112). Then, there is no point in
evaluating any of the children of 12XX, since, if adding a decision to a cluster can
only INCREASE the criterion function, we can only get a higher answer than we
already have. That is the essence of the branch and bound algorithm. It’s a search of
all possibilities, enumerated in sequential order, but remembering the lowest value
encountered and culling the search tree based on the assumption given above.
Compare this with the Suppose cluster center ␣ wins. Then, adjust the weights of ␣ using
step size in gradient
descent. Can you
speculate on a ␣j ⇐ ␣j + ε(v j − ␣j ) (15.17)
relationship between the
two algorithms?
where the scalar ε is known as a “learning parameter.” Typical values of this
parameter are small, on the order of 0.01. The input data is presented to the al-
gorithm repeatedly, in randomized order. Each presentation of the entire data set is
called an epoch. After several epochs, the cluster centers will move to accurately
represent the clusters in the data.
Often, a single cluster is always chosen. Since this is clearly not desirable, to
allow other cluster centers to sometimes be chosen, we include a parameter known
as loneliness. On each epoch, any cluster center that did not win any points has its
loneliness increased slightly. The choice of cluster centers is then made with the
decision strongly biased by the distance between the cluster center and slightly by
the loneliness.
where v is the data presented to the algorithm at this iteration; F is some nonincreasing
scalar function of di j , a measure of the distance between clusters i and j; and is a
maximum on that distance. This algorithm is easily programmed and converges to
excellent clusterings.
15.4 Conclusion
As we have noted, the form of the clustering algorithm significantly affects the results
of the clustering. Some attempts to reduce the dependency on algorithm form have
been made [15.3], but this remains a fertile area for new ideas.
We view clustering as a collection of methods for determining consistency (as
in determining the peaks of the Hough transform) but not for using consistency to
solve other problems.
Minimize the trace of the
Clustering algorithms are totally dependent on optimization methods. In section
scatter matrix. 15.3, we minimized the trace of the scatter matrix to find a good clustering measure.
We used branch and bound to speed up the combinatorial problem which results
when points must simply be switched between clusters.
Although couched in the terminology of neural networks, the winner-take-all
Gradient descent. methods of section 15.3.3 use Eq. (15.17) (which is quite reminiscent of gradient
descent) to find the best cluster center.
15.5 Vocabulary
Assignment 15.1
Prove that for the equal-covariance case, the
Bhattacharyya distance becomes the same as the measure
given in Eq. (15.7).
Assignment 15.2
The following points are to be partitioned into two
clusters:
[0,0],[0,1],[0,2],[0,3],[0,4],[0,5],[0,7],[0,8].
Assignment 15.3
In your images directory are three images, called
facered.ifs
faceblue.ifs
Assignment 15.4
In Fig. 15.9 a Hough accumulator is illustrated. That
same accumulator array is available on the CDROM as
hough.ifs. Invent a new way to find the peaks using a
clustering algorithm. (Do NOT simply find the brightest
point.) Suggestions might include weighting each point
by the square of the accumulator value, doing something
with the exponential of the square of the value, etc.
References
[15.1] R. Anderson and H. Woll, “Wait-free Parallel Algorithms for the Union-Find Prob-
lem,” Proc. 22nd ACM Symposium on Theory of Computing, pp. 370–380, New York,
ACM Press, 1991.
[15.2] G. Ball, “Data Analysis in the Social Sciences: What about the Details?” Proc. AFIPS
Fall Joint Computer Conference, Washington, DC, Spartan Books, 1965.
[15.3] G. Beni and X. Liu, “A Least Biased Fuzzy Clustering Method,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(9), 1994.
[15.4] G. Carpenter and S. Grossberg, “A Massively Parallel Architecture for a Self-
organizing Neural Pattern Recognition Machine,” Computer Vision, Graphics, and
Image Processing, 37, pp. 54–115, 1987.
[15.5] G. Carpenter and S. Grossberg, “ART-2: Stable Self-organization of Stable Category
Recognition Codes for Analog Input Patterns,” Applied Optics, 26(23), p. 4919, 1987.
[15.6] G. Carpenter and S. Grossberg, “ART-3: Hierarchical Search using Chemical Trans-
mitters in Self-Organizing Pattern Recognition Architectures,” Neural Networks, 3,
pp. 129–152, 1990.
[15.7] D. Hand, Discrimination and Classification, New York, Wiley, 1989.
[15.8] W. Snyder and C. Savage, “Content-Addressable Read–Write Memories for Image
Analysis,” IEEE Transactions on Computers, 31(10), pp. 963–967, 1982.
16 Syntactic pattern recognition
16.1 Terminology
To make more progress in this area, we need to define some terminology. The
definitions are in reference to analysis of strings of symbols, such as occur in language
analysis.
Terminal symbol. A terminal symbol: A word, like “horse,” “aardvark,” “professor,” “runs,” “grades.”
Terminal symbols may also be line segments, parts of a picture, or other features.
Generally, we denote terminal symbols using lower case. Most often, terminal sym-
bols are denoted by a single symbol, like “a” or “0” but in the example of words
from English, the terminal symbols are words, not letters.
Nonterminal symbol. A nonterminal symbol: A symbol that describes a grammatical construct, like “noun,”
“verb,” “verb phrase,” “adverb phrase,” etc. Generally, we denote terminal symbols
with an upper case character such as “A” or a string of upper case characters like
“VP” abbreviating “Verb Phrase.”
369
370 Syntactic pattern recognition
S > NP VP
VP > VP ADV
VP > V
NP > ADJ N
NP > ADJ NP
NP > ART N
N > horse
N > professor
V > runs
V > sleeps
ADV > quickly
ADJ > green
ART > the
Grammar. A grammar: A set of terminal symbols, a set of nonterminal symbols, and a set of
rules that generates a set of strings. The critical technical component of syntactic
pattern recognition is the fact that for every grammar there exists a machine (e.g. finite
state machine, pushdown automaton, Turing machine) that recognizes the language
generated by that grammar.
Start symbol: S.
Derivation. In the application of a grammar, any production may be applied any number of
times, in any order. A derivation is one instance of an application of the productions
371 16.2 Types of grammars
S > NP VP > ART N VP > ART N V ADV > The professor sleeps quickly.
The set of possible grammars can be divided into four categories, depending on
restrictions on the types of allowable productions.
1 Type 1 grammars are sometimes called “context-sensitive,” but we will not use that term because some students
find it confusing.
372 Syntactic pattern recognition
Bc > aabCC
is not allowable, because the LHS contains more than one symbol, however
B > aabbCC
S > 0S1
S > 01
Is it ALL such strings? In What is the language generated by this grammar? Is it obvious to you? It is the
particular, is the string 01
generated by this set of strings of zeros followed by exactly the same number of ones, denoted 0n 1n .
grammar? In this chapter, only type 2 and type 3 grammars are discussed in detail.
In this section, we illustrate a few examples of the use of the syntactic approach to
image shape recognition.
Fig. 16.4. Atrioventricular block: Delay between the P and the R, piiritipiiiritip.
Fig. 16.5. Myocardial infarction: Signal does not return to isoelectric between R and T, pirtii.
S > pA
A > iC
C > rD
D > iE
E > tF
F > iG
G > i
G > iH
H > i
H > iS
A > a, construct a state change of the form ␦(A, a) = Q. Finally, if a is any input
symbol, ␦(Q, a) = , where denotes the “empty” symbol.
The state change description of the machine which recognizes the language gen-
erated by this grammar is shown in Table 16.4.
Do you see why this is called “nondeterministic?” This machine goes from H to
Q under input i or H to S under the same input i. We do not mean sometimes it goes
to Q and sometimes to S. We mean it really does both, which of course is impossible
in a physically realizable machine.
To convert this into something that could be built, we construct a machine M as
follows.
The states of the new machine are all possible subsets of states of the original
machine, including (but not all will necessarily be used). In this example, there
376 Syntactic pattern recognition
p i r
[S] [A] [C] [D]
i
i
[E]
[HQ] [F] t
i [G] i
Fig. 16.6. Simple deterministic FSM which recognizes normal ECGs. Accepting states are
circled.
are 29 such states. These states will be denoted by square brackets and a list of the
original state names. If a transition involves such a set-valued state on the left, the
new state will be the union of the states to which the original machine went. (Whew!
was that awkward enough?) The accepting states of the new machine are any states
involving Q or any states which were accepting states in the original machine. The
accepting states of the new machine are any states containing accepting states of
the original machine. This process produces the physically implementable machine
illustrated in Table 16.5 and Fig. 16.6. In this example, although there are 29 states
in the new machine, only a few of them are used.
Thus, we have a machine which recognizes heart rhythms. We hope you have
observed that we have only listed what to do if the “normal” or expected input
occurs. Now, just for the fun of it, we could modify the state diagram (Fig. 16.6)
to include pathological conditions. For example, we could add ␦(D, t) = Y which
would cause a transition to an alarm state indicating the patient might be having a
myocardial infarction (or aortic dissection or other bad stuff).
Next, we will give one more example of a type 3 grammar, this one using a chain
code. But first, another way to describe regular languages: Regular expressions.
0
7
Given a set of terminal symbols, T, a regular expression is a string constructed by
7 1 0 0 concatenating elements of T and the symbol * (denoting repetition), with parentheses
6 2 to delineate order of operation, and comma to denote the logical OR operation.
6 1 In this section, we will use the terminals {0, 1, 2, 3, 4, 5, 6, 7} (the elements of a
Fig. 16.7. Boundary chain code).
segment One element of the language generated by (0, 7)(0, 7)*(7, 6)(7, 6)(61, 72)(1, 2)
corresponding to the
chain code (1, 2)0(0, 1)* is illustrated in Fig. 16.7. The FSM which recognizes the set of all
0776612100. strings generated by this regular expression is given in Fig. 16.8.
377 16.3 Shape recognition using grammatical structure
6 1
0,7 6,7 6,7 1,2 1,2 0 1,2
7 1
2
Fig. 16.8. State diagram for the nondeterministic FSM that recognizes strings generated by the
regular expression above. Two numbers separated by a comma denote that either
input will cause this transition. Any other input will cause a transition to an error state
which is not shown.
Recognition of chromosomes
The following example is abstracted from the text by Gonzalez and Thomason [16.6],
based originally upon the work of Ledley et al. [16.9], and illustrates the use of a
context-free grammar to recognize types of chromosomes.
The terminals in this grammar are boundary segments, denoted by a, b, c, d, and e,
and illustrated in Fig. 16.9. In the recognition setting, these might be called boundary
primitives. A chromosome will be described by a sequence of symbols a–e. Note that
a
c d
b e
Fig. 16.9. Primitive boundary segments used for syntactic pattern recognition. Note that
segment size and direction are important.
378 Syntactic pattern recognition
(a) (b)
c a
a b
b e
b
d b a
d
b
b
c a
a a
Fig. 16.10. (a) A submedian chromosome. (b) A telocentric chromosome. (Redrawn from [16.6].)
Submedian Telocentric
except for the symbol d, which can appear either way, the symbols have associated
directions.
There are two types of chromosomes recognized by this grammar, telocentric and
submedian, as illustrated in Fig. 16.10. Each may be described by a sequence of
boundary segments. The following grammar (Table 16.6) will generate either type
of chromosome.
These productions were not invented without some thought. The first two pro-
ductions, those involving the start symbol, S, control which type of chromosome
image is being generated: S1 denotes a submedian chromosome, and S2 designates
a telocentric chromosome. In addition, the other symbols connote components of
the chromosome boundary. That is, A will result in generation of armpair, B will
result in generation of bottom, C will result in generation of side, D will result in
generation of arm, E will result in generation of rightside, F will result in generation
of leftside.
379 16.3 Shape recognition using grammatical structure
Fig. 16.11. Two of the productions used to generate a hexagonal texture. (Redrawn from [9.2].)
Shape grammars
Finally one last example from [9.2] makes use of shape grammars [17.57] to generate
and recognize textures. In a shape grammar, both the set of terminal symbols, VT ,
and the set of nonterminal symbols, VN , are sets of shapes, with the restriction that
VT ∩ VN = Ø.
In this example, the set of terminals contains only one element, a hexagon:
Pushdown automata. Type 2 languages are recognized by pushdown automata. A pushdown automaton
is a finite state machine which has been augmented with an unbounded memory
in the form of a pushdown stack. Such a memory is a last-in-first-out memory. The
operation of storing information in such a memory is called PUSH, and the operation
of reading whatever is on the top of the stack is called POP. Note that POP does two
things: It reads whatever is stored there, and it changes the memory, so the next POP
operation will return the next-from-top value.
To implement a pushdown automaton, we add the symbol on the top of the stack
to the criteria for the state change. That is: ␦(A, i, j) = (C, q) now means “if the
state is A, the input is i, and the symbol on the top of the stack is j, change to state
380 Syntactic pattern recognition
C and push the symbol q onto the stack.” Using such a machine, it is now possible
to recognize languages such as the example in section 16.2.3, where the number
of ones must equal the number of zeros. The idea is simple: Every time we see a
zero, push a zero onto the stack and stay in the same state. When we first see a one,
change state and pop the stack. Subsequently, every time we see a one, pop the stack.
If we ever see another zero, go to an error state. Else, when the stack goes empty,
the number of ones is equal to the number of zeros, and go to an accepting state.
One of the principal concerns of practitioners of syntactic pattern recognition is
illustrated well by both examples of chromosome recognition and ECG recognition.
Both systems assume that a recognizer exists which can identify the primitives such
as T waves. The underlying assumption is that such a primitive preprocessor would
be simple, perhaps a template matcher, and robust to noise. In practice, this may
be difficult to accomplish, and may require that the grammar itself be designed for
some degree of noise tolerance. The reader is referred to textbooks [16.5, 16.6] on
syntactic methods for details.
16.4 Conclusion
16.5 Vocabulary
Derivation
Finite state machine
Grammar
Nonterminal symbol
Primitive
Production
Pushdown automaton
Regular expression
381 References
Regular grammar
Shape grammar
Terminal symbol
Assignment 16.1
Show that the string representing the submedian chromo-
some in Fig. 16.10(a) can be generated by the grammar
of Table 16.6.
Assignment 16.2
The statement was made earlier that the grammar of
Table 16.6 is a type 2 grammar. Prove or disprove
this statement.
References
[16.1] M. Chen, A. Kundu, and J. Zhou, “Off-line Handwritten Word Recognition using
a Hidden Markov Model Type Stochastic Network,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(5), 1994.
[16.2] N. Chomsky, “Three Models for the Description of Language,” IRE Transactions on
Information Theory, 2(3), 1956.
[16.3] A. Corazza, R. De Mori, R. Gretter, and G. Satta, “Optimal Probabilistic Evalua-
tion Functions for Search Controlled by Stochastic Context-free Grammars,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(10), 1994.
[16.4] C. Fermuller and W. Kropatsch, “A Syntactic Approach to Scale-space-based
Corner Description,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 16(7), 1994.
[16.5] K.S. Fu, Syntactic Pattern Recognition and Applications, Englewood Cliffs, NJ,
Prentice-Hall, 1982.
[16.6] R. Gonzalez and M. Thomason, Syntactic Pattern Recognition, Reading, MA,
Addison-Wesley, 1978.
[16.7] J. Hopcroft and J. Ullman, Introduction to Automata Theory, Languages, and
Computation, Reading, MA, Addison-Wesley, 1979.
[16.8] W. Kropatsch, “Curve Representation in Multiple Resolution,” Pattern Recognition
Letters, 6(3), 1987.
[16.9] R. Ledley, L. Rotolo, R. Kirsch, M. Ginsberg, and J. Wilson, “FIDAC: Film Input
to Digital Automatic Computer and Associated Syntax-directed Pattern-recognition
Programming System,” In Optical and Electro-optical Information Processing, ed.
J. Tippet, D. Beckowitz, L. Clapp, C. Koester, and A. Vanderburgh, Cambridge, MA,
MIT Press, 1965.
[16.10] B. Olstad and A. Torp, “Encoding of a priori Information in Active Contour Models,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9), 1996.
17 Applications
Machine vision has found a wide set of applications from astronomy [17.44] to
industrial inspection, to automatic target recognition. It would be impossible to cover
them all in the detail which they deserve. In this chapter, we choose to provide the
reader with more of an annotated bibliography than a pedagogical text. We mention
a few applications very briefly, and provide a few references. In the next chapter,
we choose one application discipline, automatic target recognition, to cover in a bit
more detail.
The strategy of multispectral image analysis combines spatial and spectral repre-
sentations in a representation in which each pixel is a vector, an ordered set of
measurements. Color, where the elements of the vector are [r, g, b], is the obvious
example, and there is a great deal of work in the literature in color processing. Most
of the reported work has been intended for image quality enhancement. Only some
recent papers elaborate on the use of color for recognition [17.14, 17.18, 17.53,
17.58].
The methods we have studied for univariate images, for example using Markov
random field methods to remove noise, are applicable to multispectral images [17.3].
Often, all that is necessary is to use a vector description instead of scalar pixels.
Despite our love for this topic and the huge number of papers devoted (e.g., in this
paragraph, we cite only a few references [16.1, 17.32, 17.64]), we cannot take the
space to cover it in the kind of detail it deserves. One of the first problems in OCR is
automatic zoning [17.28]: that is, locating the text on the page [17.37]. Many OCR
382
383 17.4 Inspection/quality control
One would suspect that the area of manufacturing inspection would be intensely
researched, as industry strives to utilize the latest technology to gain competitive
advantage. However, this seems not to be the case. For example, a conference was
sponsored by the US National Science Foundation in the spring of 2000 to discuss
how industry and universities might collaborate more effectively. Over two hundred
CEOs of machine vision companies (most of which are involved in manufacturing
inspection) were invited, and fewer than 30 showed up. Why was there such a poor
turnout when the topic is seemingly so important?
One possible answer is that most machine vision companies are small. But one
might ask, “Why are so many machine vision companies so small?” We speculate
that the uniqueness of this field comes in the answer to that question. The capital
investment required to set up a machine vision company is actually quite small. One
can get into the business with a computer, some inexpensive hardware, and some
good ideas. Unless you go into custom hardware or sophisticated specializations,
you might be able to get your company running without venture capital. You do not
need to be big to be in the machine vision business. Because the companies are so
small, they are intensely market-driven, and do not see that basic academic research
can help them in the short term. Sometimes they are right.
Still, some basic research in industrial inspection is getting done, such as regis-
tration using fiducial marks [17.61], automatic extraction of features [17.62], recog-
nition of overlapping parts [17.21], as well as applications in assembly [17.43].
If a company has been in the business of manufacturing a particular line of prod-
ucts for many years, it is likely that many of those product designs were not entered
into CAD data bases. Reverse engineering is the process of going from legacy de-
signs into modern data bases. It may require reading of blueprints [17.13], and
it may also require generation of CAD models and data bases [17.59] from ac-
tual objects, which in turn may require extraction of geometric primitives such as
spheres, cylinders, cones, etc. from range data [17.40], or other coordinate measuring
machines.
Microscopy is another application area in which machine vision plays an important
role. For example, Pap smears are often screened using automated systems, and white
blood cell counts may be done by computers as well. Tracking tubular molecules
in epi-fluorescence microscopy [17.46] has been recently reported in the research
literature.
Many industrial parts are specular reflectors, and their shape and roughness can
be extracted using multiple light sources [17.17, 17.56].
However it is done, and whatever the application, machine vision requires system
building and work on sensor modeling, together with hypothesis generation [17.73].
Lincoln’s
head Memorial
Fig. 17.1. Three possible views of a penny. Although possible, the view of a penny standing on
edge is so unlikely as to be discarded.
estimation of the spatial pose of a 3D object from 2D images [17.36]. Recent robotic
surgical successes include coronary artery bypass graft on the beating heart [17.9],
stomach surgery [17.7], and gall bladder surgery [17.24.]
Bibliography
[17.26] X. Jia and M. Nixon, “Extending the Feature Vector for Automatic Face Recog-
nition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12),
1995.
[17.27] B. F. Jones, “A Reappraisal of the Use of Infrared Thermal Image Analysis in
Medicine,” IEEE Transactions on Medical Imaging, 17(6), pp. 1019–1027, 1998.
[17.28] J. Kanai, S. Rice, T. Nartker, and G. Nagy, “Automated Evaluation of OCR Zoning,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[17.29] T. Kanungo, R. Haralick, H. Baird, W. Stuezle, and D. Madigan, “A Statistical,
Nonparametric Methodology for Document Degradation Model Validation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(11), 2000.
[17.30] A. Katz and P. Thrift, “Generating Image Filters for Target Recognition by Genetic
Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9),
1994.
[17.31] J. R. Keyserlingk, P. D. Ahlgren, E. Yu, N. Belliveau, and M. Yassa, “Functional In-
frared Imaging of the Breast,” IEEE Engineering in Medicine and Biology, May/June,
pp. 30–41, 2000.
[17.32] G. Kopec and P. Chou, “Document Image Decoding using Markov Source
Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6),
1994.
[17.33] A. Kumar, Y. Bar-Shalom, and E. Oron, “Precision Tracking Based on Segmentation
with Optimal Layering for Imaging Sensors,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(2), 1995.
[17.34] Y. Kwoh, J. Hou, E. Jonckheere, and S. Hayati, “A Robot with Improved Absolute
Positioning Accuracy for CT Guided Stereotactic Brain Surgery,” IEEE Transactions
on Biomedical Engineering, 35(2), 1988.
[17.35] S. Lakshmanan and D. Grimmer, “A Deformable Template Approach to Detect-
ing Straight Edges in Radar Images,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(4), 1996.
[17.36] S. Lavallée and R. Szeliski, “Recovering the Position and Orientation of Free-form
Objects from Image Contours Using 3D Distance Maps,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(4), 1995.
[17.37] K. Lee, Y. Choy, and S. Cho, “Geometric Structure Analysis of Document Images: A
Knowledge-based Approach,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(11), 2000.
[17.38] P. Le Roux, H. Das, S. Esquenzai, and P. Kelly, “Robot-assisted Microsurgery; Fea-
sibility in a Rat Microsurgical Model,” Neurosurgery, 48, 2001.
[17.39] E. Marchand and F. Chaurmette, “Active Vision for Complete Scene Reconstruction
and Exploration,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
21(1), 1999.
[17.40] D. Marshall, G. Lukacs, and R. Martin, “Robust Segmentation of Primitives from
Range Data in the Presence of Geometric Degeneracy,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 23(3), 2001.
[17.41] N. Merlet and J. Zerubia, “New Prospects in Line Detection by Dynamic Program-
ming,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
389 Bibliography
[17.57] G. Stiny and J. Gips, Algorithmic Aesthetics: Computer Models for Criticism and
Design in the Arts, University of California Press, 1972.
[17.58] M. Swain and D. Ballard, “Color Indexing,” International Journal of Computer
Vision, 7(1), 1991.
[17.59] T. Syeda-Mahmood, “Indexing of Technical Line Drawing Databases,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 21(8), 1999.
[17.60] H. Takeda, C. Facchinetti, and J. Latombe, “Planning the Motions of a Mobile Robot
in a Sensory Uncertainty Field,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(10), 1994.
[17.61] M. Tichem and M. Cohen, “Submicron Registration of Fudicial Marks using Machine
Vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(8),
1994.
[17.62] S. Trika and R. Kashyap, “Geometric Reasoning for Extraction of Manufacturing
Features in Iso-oriented Polyhedrons,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(11), 1994.
[17.63] L. Tsap, D. Goldgof, and S. Sarkar, “Nonrigid Motion Analysis Based on Dynamic
Refinement of Finite Element Models,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(5), 2000.
[17.64] T. Wakahara, “Shape Matching using LAT and its Application to Handwritten
Numeral Recognition,” IEEE Transactions on Pattern Analysis and Machine In-
telligence, 16(6), 1994.
[17.65] R. Wallace, A. Stentz, C. Thorpe, H. Moravec, W. Whittaker, and T. Kanade, “First
Results in Robot Road-Following,” Proceedings of the International Joint Confer-
ence on Artificial Intelligence, 1985.
[17.66] C. Wang, “Collision Detection of a Moving Polygon in the Presence of Polygo-
nal Obstacles in the Plane,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(6), 1994.
[17.67] C. Wang and W. Snyder, “MAP Transmission Image Reconstruction via Mean Field
Annealing for Segmented Attenuation Correction of PET Imaging,” 17th Interna-
tional Conference of the IEEE Engineering in Medicine and Biology Society, Mon-
treal, September, 1995.
[17.68] C. Wang and W. Snyder, “Frequency Characteristic Study Of Filtered-Backprojection
Reconstruction And Maximum Reconstruction For PET Images,” 17th International
Conference of the IEEE Engineering in Medicine and Biology Society, Montreal,
September, 1995.
[17.69] C. Wang, W. Snyder, and G. Bilbro, “Performance Evaluation of Filtered Back-
projection Reconstruction and Iterative Reconstruction Methods for PET Images,”
Computers in Medicine and Biology, 9(3), 1998.
[17.70] C. Wang, W. Snyder, G. Bilbro, and P. Santago, “A Performance Evaluation of FBP
and ML Algorithms for PET Imaging,” SPIE Medical Imaging, 1996.
[17.71] D. Weinshall and W. Werman, “On View Likelihood and Stability,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 19(2).
[17.72] J. Weng, and S. Chen, “Vision-guided Navigation using SHOSLIF,” Neural
Networks, 1998.
391 Bibliography
This is the principal application chapter of this book.1 We have selected one appli-
cation area: Automatic target recognition (ATR), and illustrate how the mathematics
and algorithms previously covered are used in this application. The point to be made
is that almost all applications similarly benefit from not one, but fusions of most of
the techniques previously described. As in previous chapters, we provide the reader
with both an explanation of concepts and pointers to more advanced literature. How-
ever, since this chapter emphasizes the application, we do not include a “Topics”
There are lots of section in this chapter.
transducers which can
provide information about Automatic target/object recognition (ATR) is the term given to the field of en-
targets. In this book, we gineering sciences that deals with the study of systems and techniques designed to
only consider imaging
sensors. identify, to locate, and to characterize specific physical objects (referred to as targets)
[18.7, 18.9, 18.69], usually in a military environment. Limited surveys of the field
are available [18.3, 18.8, 18.21, 18.66, 18.74, 18.79, 18.89]. In this chapter, the only
ATR systems considered are those that make use of images. Therefore, our use of
terminology (e.g., clutter) will be restricted to terms that make sense in an imaging
scenario.
In this section, we define a few popularly used terms and acronyms in the ATR
[18.57] world, starting with the five levels in the ATR hierarchy.
1 The authors are indebted to Rajeev Ramanath, who assisted significantly in the generation of this chapter, and
in fact wrote some sections, and to Richard Sims, and John Irvine who provided careful reviews and extremely
helpful feedforward.
392
393 18.1 The hierarchy of levels of ATR
Recognition. Distinguish targets from similar kinds. For example, distinguish tanks
from front-end loaders, jeeps from automobiles, rocket launchers from school buses,
etc.
Identification. Identify the type of target such as the type of tank (whether it is T90
or M1, etc.).
Each level of the ATR hierarchy is a refinement to the target description. Target
characterization reveals the most detail of the target.
Chip. A small image usually containing the image of a single target, extracted from a
large image of a scene. Target cueing algorithms, which identify the likely presence
of a target, often produce chips as output.
Detection rate. Fraction of targets correctly detected by the system.
False alarm rate. Generally, the fraction of the number of detections that do not
correspond to actual targets. However, this definition may be modified if the task is
classification rather than detection. We observe that false alarm rate is not the same
as probability of false alarm. The false alarm rate is usually given in false alarms per
square kilometer. See section 18.3.
394 Automatic target recognition
Image
Preprocessing Detection Segmentation Classification Recognition
Fig. 18.2. Different imaging modalities. (a) Visible spectrum image. (b) Thermal IR image (notice
hot tank engine on far left). (c) Ground truth (from [18.5]). Used with permission.
395 18.3 Evaluating performance of ATR algorithms
Given an image of the scene (which consists of a possible target and back-
ground), we need to detect the target. Target detection methods can be viewed as
having two steps [18.86]. In the first step, appropriate measurements are extracted
from an image using low-level image processing techniques. These measurements
are then utilized to derive a primary segmentation of the image into regions. In
the second step, higher-level descriptors of the segmented regions are used to de-
termine the presence or absence of the target (detect) and possibly classify that
target.
In this section, we consider some of the issues with evaluation of the performance of
an ATR system. In this description, we use the term “classifier” assuming the job of
the system is to correctly classify the targets in the scene. The same terminology –
“false alarms,” etc. – would be used if the objective of the system were any of the
other levels of the ATR hierarchy.
For purposes of vocabulary, we will define these terms in the context of a pattern
recognition problem which is to classify a result as “there” or “not there.” Some
applications of such classifiers are in automatic target detection (enemy target or
not target), medical diagnosis (tumor or not tumor), and the digital communication
channel (send a “1” or “0” at one end and receive it at the other). There will be
several references back to Chapter 14, so you might want to glance at that first.
To illustrate a few key ideas in detection theory, we will begin with a simple
example. Consider a digital communication system, sending a symbol a for “0” and
b for “1”.
We say we have hypothesis 0 (H0 ) when a “0” is sent and hypothesis 1 (H1 ) when
“1” is sent. (In an ATR problem, we would have H0 for a target absent and H1 for a
target present in the scene.) As the laws of nature state, there WILL be noise in the
396 Automatic target recognition
H0 Z
Measurement
Source Noise source space
H1
Decision Rule
system. So, when we send a “0” or in other words, a, we will receive a + n, where
n is a noise sample. We have,
H0 Z =a+n (18.1)
H1 Z = b + n,
True Positive. The object is there (e.g., the patient has a tumor; a “1” was transmitted)
and our classifier says it is there (we decide the patient has a tumor; we received a
“1”).
True Negative. The object is not there (e.g., the patient does not have a tumor; a
“0” was transmitted) and our classifier concludes no object (there is no tumor; we
received a “0”).
False Negative. The object is there (e.g., the patient has a tumor, a “1” was trans-
mitted) and our classifier says otherwise (the patient has no tumor; we received a
“0”).
False Positive. The object is not there (e.g., the patient is healthy; a “0” was trans-
mitted) and our classifier says there is an object (the patient has a tumor; we received
a “1”).
Clearly both “false” conditions are not favorable. False negative might be worse
since we might miss a really dangerous target or overlook a malignancy. However,
both types of error have associated with them different costs (refer to Chapter 14).
The terms above are sometimes referred to by other names, such as “false alarms”
(false positives) and “false misses” (false negatives). Based on these four values,
two probabilities can be derived:
397 18.3 Evaluating performance of ATR algorithms
Probability of detection, Sensitivity. Probability of a true positive, that is, the ratio between true positive
Pd .
and the summation of true position and false negative. P(D1 |H1 )P(H1 ). In the
specific application of target detection (as distinct from classification, recognition,
and identification), the sensitivity is referred to as the probability of detection and
denoted Pd .
Specificity. Probability of a true negative, that is, the ratio between true negative and
the summation of true negative and false positive. P(D0 |H0 )P(H0 ).
Observe that P(Di |Hi )P(Hi ) = P(Di , Hi ). Now, the probability of a correct
decision is
0.75 0.25
P(D1|H1) P(D0|H1)
0.5 0.5
0.25 0.75
where
i is the region of measurement space over which we decide class i. P(Hi ) is
the prior probability of seeing an example of class i, and p(x|Hi ) is the conditional
probability density of whatever we are measuring (e.g., brightness) given that the
example is from class i. Of course, we do not know the TRUE probability densities,
we only know our estimates of those densities, determined from the training sets.
Furthermore, we derived the decision regions
i from those (estimated) densities.
We could try to determine the error rate by simply counting the number of ele-
ments in the training set that are misclassified: We call this the apparent error rate.
Unfortunately, this leads to an optimistic result: It underestimates the error rate of
the system when tested on data not in the training set. This occurs because the ATR
has been designed to minimize the number of misclassifications of the training set,
and unless the training set perfectly represents the true distribution of the data, the
classifier will reflect characteristics of the training set which may not be true of the
entire sample population.
True error rate is different We must distinguish the apparent error rate from the true error rate. Although we
from apparent error rate.
have no way to determine the true error rate, we may get a better estimate using the
two different approaches now described.
399 18.3 Evaluating performance of ATR algorithms
The dimensionality of the This is straightforward. We simply divide our original training set into two parts
problem is the number of
measurements made (e.g., (randomly of course) and build the classifier using half of the data. We then test
max. brightness, area, the system on the other half. This approach works reasonably well if we have very
contrast).
large training sets (thousands of examples), or better said, something like 10d where
d is the dimensionality of the problem. (What does dimensionality mean here?)
Unfortunately, such large training sets are unlikely in most problems.
Assume there are n points in the training set. Remove point 1 from the set, and
design the classifier using the other n − 1 points. Then test the resulting machine
on point 1. Repeat for all points. The resulting error rate can be shown to be an
almost unbiased estimate of the expected true error rate for the classifier designed
using all n points. Of course, this requires that we design n machines, which could
be prohibitive. However, with such a result, we have numbers which we can put into
the ROC curves.
localizes potential target areas. Hence, probability of detection and false alarm may
be used to evaluate this step. The segmentation operation extracts the target after it has
been detected. We may therefore use measures like misclassified pixels, correlation
coefficient between true and extracted target, etc.
The problems and issues of interest to the ATR community can be summarized as
target signature variability, false alarm rate, segmentation, feature selection, per-
formance degradation due to incomplete information, and performance evaluation
[18.9, 18.71].
All images record signals received as a sum of emitted and reflected radiation:
However, the amount emitted may be dramatically smaller than the amount reflected
(in the case of visible light) or significant (in the case of longwave infrared). The
emissivity is the ratio of emitted to total radiation.
(x, y)
ε(x, y) = . (18.4)
f (x, y)
The emitted radiation is positively related to the temperature
= A exp(−T 4 ) (18.5)
Fe
tire
Fig. 18.5. Objects may exhibit contrast reversals over a 24 hour period. (From US Army Night
Vision and Electro-optics Research Center, used with permission.)
Occlusions
Unlike industrial machine vision problems, where the setup of the manufacturing
facility is specifically designed to minimize occlusion, not only do occlusions occur
in ATR scenes, but targets are usually at least partially occluded. In fact, the opponent
will be actively trying to have his equipment as occluded as possible [4.13]! An image
of a truck occluded by a tree is shown in Fig. 18.7, later.
All these variabilities raise the question of “How well trained should the ATR be?”
It is entirely too easy to over-train a system, and produce a system which performs
very well on the data on which it has been trained, but poorly on data it has not seen,
even though that data may (in the eyes of a human) be very similar. The problem is not
to get the probability of detection high, but to do so while simultaneously keeping the
false alarm rate low. The Neyman–Pearson test [18.53] provides a means to perform
such a minimization with performance bounds on probability of false alarm.
18.4.2 Tracking
In ATR, many if not most applications require target tracking, and furthermore the
tracking problems are less constrained and more challenging than tracking in the
civilian domain. Centroid tracking is the simplest type of tracking algorithm, al-
though there are ways to improve its sophistication [18.39]. The centroid tracker
(usually) assumes there is just one target in the field of view, and that bright spot
is much brighter than background. If those assumptions are true, the centroid of
the target is the centroid of the field of view. More sophisticated tracking of mov-
ing objects is most often done using optimal filters like the Kalman–Bucy filter.
402 Automatic target recognition
Haddad and Simanca [18.28] discuss the limitations of the Kalman filtering ap-
proach and propose a nonlinear tracking filter based on wavelets and the Zakai
equation. Amoozegar et al. [18.3] provide a survey of fuzzy and neural techniques
in tracking.
The process of tracking can be combined with the process of classifying vehicles
[13.12, 18.22] as well.
18.4.3 Segmentation
In most ATR scenarios, the problem of separating clutter from target is fundamental.
Clutter varies from one scene class to another and requires adaptive representations
[18.38]. However, there currently is not even a uniformly accepted definition for
“signal-to-clutter” [18.61, 18.68, 18.78].
Once a potential target is localized, it is extracted from the background as ac-
curately as possible. However, every segmenter makes prior assumptions about the
target and its neighboring pixels. These assumptions may not be valid for all viewing
conditions. As we learned in Chapter 8, two common approaches to segmentation are
edge or boundary formation and region growing [18.68]. Edge detection approaches
are based upon recognizing dissimilarities in images whereas region growing utilizes
similarity properties. Because edge detection techniques are quite sensitive to noise,
successful edge detection usually depends on higher level semantic knowledge. Re-
gion growing techniques offer better immunity to noise and therefore do not have as
much reliance on semantic knowledge. Qi et al. [18.63] propose an efficient segmen-
tation approach to segment man-made targets from unmanned aerial vehicle (UAV)
imagery using curvature information derived from an image histogram smoothed by
Bezier splines. Experimental results show that by enhancing the histogram instead
of the original image, similar segmentation results can be obtained in a more efficient
way. In [18.87], a segmentation strategy based on the image pyramid data structure
is developed, working its way from the top of the pyramid to the bottom, processing
image detail hierarchically.
As we learned in Chapter 6, diffusion and diffusion-like processes [18.41, 18.42]
provide excellent noise-removal steps as components of a segmentation process.
In the following, we relate a number of the topics covered earlier in this book to the
specific application to ATR. This is done primarily through reference to the ATR
literature. We will not attempt to survey all the relevant literature in ATR, since a
great deal of that literature makes use of the methods of statistical pattern recognition,
which deserves a textbook in itself. However, we cite a number of publications which
will mention those methods.
The ATR application exhibits characteristics which are different from most other
machine vision applications, such as industrial inspection. Key differences include
the following.
(1) ATR systems must of necessity deal with unstructured environments – it simply
is not possible to control the illumination, the viewing angle, the atmosphere,
etc.
(2) Occluded targets are not just possible, they are likely.
(3) Only a few pixels are likely to be on target. The probability of correct classifi-
cation strongly depends on the number of pixels on target. This is illustrated in
Fig. 18.6 for a neural network classifier, but similar results, including especially
the dramatic change around 50 pixels-on-target, occurs for any system, including
a human.
Interestingly, the first studies relating pixels on target to Pd were done prior to
Human performance is digital images, and considered scan lines rather than pixels. In infrared images
dependent on
pixels-on-target, too. of military targets, Johnson [18.36] found that target recognition by humans
required at least four pixels across the critical (smallest) target dimension. Of
course, such results vary with target complexity and recognition task details, but
100
75
50
25
the range of 3–5 pixels across, corresponding (for a square target) to roughly
20 pixels-on-target seems to hold fairly well. See also [18.37].
Clearly, the ATR system observer needs to remain as far as possible from the
target, and therefore the number of pixels-on-target is always too small! In fact
the observer would really prefer to not be there at all, which should strongly
motivate the development of robotic forward observers [18.35].
(4) 3D information must be considered. By 3D here, we mean that the target may
be observed in a number of possible aspects. (See Chapters 9 and 12.)
(5) Any ATR system must consider clutter and confusers, which are other objects,
accidently or deliberately introduced into the scene and which resemble targets.
The most obvious confuser is friendly equipment of the same type as the target.
The requirements set out above dramatically affect the design of ATR systems.
For example, one might consider a system which identifies targets by using a variety
of different sized templates, one for each of many possible aspects. Or, a system
might simply extract a variety of different sized windows, and do dimensionality
reduction (using the K–L transform) and pass those results on to a classifier [18.14].
ATR philosophies of approach can be categorized using different taxonomies. In
one way, the philosophies are divided into two groups [18.7]. First is the classical
pattern recognition approach mentioned earlier, which uses statistical techniques.
This is very popular due to ease in implementation and speed, provided it is pos-
sible to appropriately and effectively extract useful features. The other approach is
AI-based and requires an additional step of symbolic manipulation.
Depending upon how the ATR systems function, other specialists in this area
classify them as geometry-based (mostly single sensor) and spectral-based systems
[18.71]. The basis for the geometry-based systems is the observation that, by pre-
serving as much as possible the information that is received from an object and by
using intelligent reasoning, a better interpretation of the scene containing the objects
of interest can be made [18.70]. Thus, by better understanding the forward image
formation process, the inverse problem – the task of recognizing objects from their
received signatures – becomes more feasible. However, the integration of these mod-
els into a cohesive end-to-end model by considering the effects of their interactions
is still very preliminary.
The geometry-based approach may be subdivided in various ways. For example,
one could divide geometry-based systems into those that use a large collection of
templates (an aspect graph) and those that start with a 3D model [18.77].
Spectral-based systems recognize targets based on spectral analysis and spectral
matching [18.33]. Intuitively, the more spectral bands used, the more details about
the scene can be revealed. The finer details provided in the spectral signature can
substantially increase the detectability of targets, especially small targets, at pixel or
even subpixel level [18.16, 18.47].
405 18.5 ATR algorithms
Another type of ATR system is based on the premise that the more sensory data
that is available from the target of interest, the better the system performance. This
is intuitively obvious for sensors that have complementary properties. Due to nu-
merous limitations of single-sensor ATR systems, there has been a move toward
multisensor targeting systems and, hence, the problem of correlating and combining
data generated from multiple sensors. This is also referred to sometimes as multi-
sensor fusion, however, the information sources may be different sensors (sensor
fusion) or different algorithms (algorithm fusion) [18.32].
Finally, some researchers break the set of ATR algorithms into model-based meth-
ods, statistics-based methods, and template-based methods. These three categoriza-
tions are discussed in more detail below.
Multispectral matching
One technique for multispectral analysis uses the concept of histogram intersection
introduced by Swain and Ballard [18.82] for the realm of color images. The idea
is simply to compare the histograms of two images and determine any overlap
factor (how many pixels in the data base histogram match those in the histogram
of the new image). Specifically, given a pair of histograms, I (from new image)
and M (from database), each containing n bins, the intersection is defined to be
n
j=1 min(I j , M j ). The result is the number of pixels that have the same color in the
two images. This number may be normalized to obtain overlap factor. The color of an
object is subject to significant change depending upon varying lighting conditions;
and in this situation, this simple algorithm will clearly not give us a good match. To
overcome this problem, Funt and Finlayson [18.24] combined the idea of histogram
intersection with a concept referred to as “color constancy” [18.23], which removes
effects of varying illumination conditions and in effect, normalizes the image to
a standard illuminant. Now that we have a data base also in standard illuminant
conditions, we can “compare apples with apples” and use the histogram intersection
method described above. We have not put any restriction to the dimensionality of the
histogram, and hence can extend this concept to higher dimensions (more sensors),
obtaining a more robust system.
Another measure of spectral match between a known target signature and an ob-
served signature is obtained by treating the signatures as vectors and finding the inner
product between the two vectors [18.93]. The better the match, the closer the angular
separation to zero. In other words, if we have two d-dimensional measurements in
the spectral signatures, X and Y , then the distance between these two measurements
407 18.6 The Hough transform in ATR
X rY
cos = .
|X ||Y |
Small indicates that the two spectra are quantitatively similar; similarly large
indicates dissimilar spectra. In [18.93], Weisberg et al. use this measure to perform a
clustering procedure, segmenting the image into multiple regions of interest. There
are many other measures of similarity that are in use [18.34, 18.62].
In [18.86], Trivedi introduces the use of relative spectral information rather than
absolute information about the target for remote-sensing purposes. This introduces
some robustness into the system. For example, a certain object may be brighter than
the background in a particular channel and darker in some other channel.
Since many man-made objects have straight-line properties, the Hough transform
pops up often in ATR applications. For example, it may be used [18.19] to identify
the track of a (subpixel) missile as it is observed from space. By taking differences
in time, the target track becomes visible, along with a great deal of noise. The track
is, however, more-or-less a straight line, and comes out very clearly in the Hough
408 Automatic target recognition
transform. Cowart et al. [18.19] also consider the use of a parametric transform
which allows for maneuvering targets.
Viewed from above, a ship is a long, straight (unless the ship has had an unfortunate
encounter) object. When viewed with SAR,2 the image of a ship consists of spots and
dropouts. One may estimate [18.25, 18.50] the orientation of a ship by simply taking
the Hough transform. Surprisingly, it appears [18.25] that the Hough transform is
less sensitive to noise than using principal axes. If the ship is moving, its wake is
even longer, straighter, and may be more easily found [18.18].
A satellite, when viewed from a terrestrial telescope, has straight edges (see Fig.
18.8). The Hough transform can also be used to identify and characterize those edges
[18.20].
The special needs of ATR affect the types of morphological techniques reported in
the literature. For example, in many ATR applications, the targets are observed from
above and have few pixels on target. Therefore, it is important that noise removal
operations such as morphological opening be invariant to rotation. For example
[18.60], the traditional opening
( f oB)(n, m) = maxi, j∈B {mini, j∈B { f (n + i, m + j)}} (18.6)
2 SAR is an abbreviation for Synthetic Aperture Radar.
409 18.8 Chain codes in ATR
Template matching and feature-based approaches have their own shortcomings. Most
such algorithms depend on some prior knowledge about the object. Features are
needed which are invariant with respect to size and orientation to the object in
the field-of-view of the sensor. Similarly, approaches based on template matching
require extensive data bases, with large search times. However, silhouettes may be
matched [18.72] using the chain code of the segmented object, and then processing
the histogram of that chain code (Fig. 18.10). This strategy has two very useful
properties:
Histogram of
Trellis algorithm Histogram matching chain code
Decision
(1) Scale variation in the image domain is equivalent to a vertical shift in the chain
code histogram domain.
(2) Changes in orientation are equivalent to horizontal cyclic shifts in the histogram.
It is possible that two different objects have the same chain code histogram. To this
end a trellis algorithm may be used to distinguish between such objects [18.81]. A
trellis structure of a distorted pattern is created using each row vector as a “fractional
pattern” (a node in the trellis). A large collection of observed data is used to provide
a statistical basis for this trellis, to which the Viterbi algorithm (see section 2.4.2) is
applied. Although the use has been demonstrated only for handwritten characters,
this method may be universally applied to any category of distorted patterns.
18.9 Conclusion
A wide variety of papers are available in the public literature that have comparative
studies on different techniques. For example, Li et al. [18.43] compared a number
of neural, statistical and model-based approaches and concluded that: “At least for
FLIR images, the neural network-based approaches gave better results than the
PCA (principal components analysis) and the LDA (linear discriminant analysis)-
based approaches.” They found that methods based on the Hausdorff distance also
performed “well.”
Often, papers in the literature present conclusions that make the reader think the
ATR problem is nearly solved when, in reality, either these systems have not been
tested on real military data or their scope is so limited to one specific application
and set of conditions that their “actual” performance may be doubted. We need
to understand that the problem being dealt with here is not only nontrivial, it gets
substantially more difficult with every new invention made by the enemy. What is
“state-of-the-art” today may not be tomorrow.
Nonetheless, from an engineering point of view (which is almost always practi-
cal!), the authors believe that the following are needed to produce ATR systems that
may be benchmarked:
r a large set of standard real-world images, with ground truth information available
to the research community,
r a tool to optimally evaluate and incorporate the best of a large number of ATR
techniques into a system with “optimal” performance.
Several questions arise when we take these issues into consideration. How do
we come up with a standard? What objectives are we trying to reach – a powerful
system that has the best ROC curves or developing a system that is portable, using
411 Bibliography
minimal hardware? Clearly all these cannot be met at the same time and someone
has to strike a balance.
General trends are seen, however, in the development of ATR systems. The use
of more than one sensory system, portability of the system, its ability to be truly an
automated system with minimal human interference, the increased use of mathemat-
ical tools to provide a sound basis for the system are clearly where research seems
to be headed. In this chapter the authors hope to have communicated the vastness of
this problem while also presenting the achievements of science in approaching this
problem.
Bibliography
[18.63] H. Qi, W. E. Snyder, and D. Marchette, “An Efficient Approach to Segmenting Man-
made Targets from Unmanned Aerial Vehicle Imagery,” Optical Engineering, 39(5),
pp. 1267–1274, 2000.
[18.64] H. Ranganath and R. Sims, “Self Partitioning Neural Networks for Target Recogni-
tion,” Automatic Target Recognition IV, Proceedings SPIE, 2234, April, 1994.
[18.65] K. Rao, “Combinatorics Reduction for Target Recognition in ATR Applications,”
Automatic Object Recognition II, Proceedings SPIE, 1700, 1992.
[18.66] J. Ratches, C. Walters, R. Buser, and B. Guenther, “Aided and Automatic Target
Recognition Based upon Sensory Inputs from Image Forming Systems, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19(9), Sept. 1997.
[18.67] A. Reno, D. Gillies, and D. Booth, “Deformable Models for Object Recognition in
Aerial Images,” Automatic Target Recognition VIII, Proceedings SPIE, 3371, 1998.
[18.68] E.M. Riseman and M.A. Arbib, “Computational Techniques in Visual Segmentation
of Static Scenes,” CCGIP, 7, Target Recognition VIII, Proceedings SPIE, 3371,
April, 1998.
[18.69] R. Robmann and H. Bunke, “Towards Robust Edge Extraction – a Fusion Based
Approach using Greylevel and Range Images,” Automatic Target Recognition V,
Proceedings SPIE, 2485, April, 1995.
[18.70] F.A. Sadjadi, “A Model-Based Technique for Recognizing Targets by Using Mil-
limeter Wave Radar Signatures,” International Journal of Infrared and Millimeter
Waves, 10(3), 337–342, 1989.
[18.71] F.A. Sadjadi, “Automatic Object Recognition: Critical Issues and Current Ap-
proaches,” Proceedings SPIE, 1471, 1991.
[18.72] F.A. Sadjadi, “Automatic Object Recognition: Critical Issues and Current Ap-
proaches,” 1991. Selected SPIE Papers on CD-ROM, Volume 6. Automatic Target
Recognition SPIE: 1 PO Box 10, Bellingham, WA 98227-0010, USA.
[18.73] F.A. Sadjadi, “Automatic Object Recognition: Critical Issues and Current Ap-
proaches,” 1991. Selected SPIE Papers on CD-ROM, Volume 6. Automatic Target
Recognition SPIE: 1 PO Box 10, Bellingham, WA 98227-0010, USA.
[18.74] F.A. Sadjadi, Special Section on ATR, Optical Engineering, 31(12), 1992.
[18.75] F.A. Sadjadi, “Application of Genetic Algorithm for Automatic Recognition of
Partially Occluded Objects,” Automatic Object Recognition IV, Proceedings SPIE,
2234, 1994.
[18.76] R. Samy and J. Bonnet, “Robust and Incremental Active Contour Models for Objects
Tracking,” Automatic Target Recognition V, Proceedings SPIE, 2485, April, 1995.
[18.77] R. Sharma and N. Subotic, “Construction of Hybrid Templates from Collected and
Simulated Data for SAR ATR Algorithms,” Automatic Target Recognition VIII, Pro-
ceedings SPIE, 3371, April, 1998.
[18.78] R. Sims, “Signal to Clutter Measurement and ATR Performance,” Automatic Target
Recognition VIII, Proceedings SPIE, 3371, April, 1998.
[18.79] R. Sims and B. Dasarathy, “Automatic Target Recognition using a Passive Multisen-
sor Suite,” Special Section on ATR, Optical Engineering, 31(12), 1992.
[18.80] A. Srivastava, B. Thomasson, and R. Sims, “A Regression Model for Prediction of
IR Images,” Proceedings of SPIE Aerosense ATR XI, Orlando, 2001.
416 Automatic target recognition
[18.81] L.B. Stotts, E.M. Winter, L.E. Hoff, and I.S. Reed, “Clutter Rejection using Multi-
spectral Processing,” Proceedings SPIE Signal and Data Processing of Small Targets,
1305, pp. 2–10, 1990.
[18.82] M. Swain and D. Ballard, “Color Indexing,” International Journal of Computer
Vision, 7(1), 1991.
[18.83] H. Tanaka, Y. Hirakawa, and S. Kaneku, “Recognition of Distorted Patterns Us-
ing the Viterbi Algorithm,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 4(1), 1982.
[18.84] W. Thoet, T. Rainey, D. Brettle, L. Stutz, and F. Weingard, “ANVIL Neural Network
Program for Three-dimensional Automatic Target Recognition,” Optical Engineer-
ing, 31(12), December, 1992.
[18.85] M.M. Trivedi, “Detection of Objects in High Resolution Multispectral Aerial
Images,” SPIE Applications of Artificial Intelligence II, 1985.
[18.86] M. Trivedi, “Object Detection Using Their Multispectral Characteristics,” Proceed-
ings SPIE, 754, 1987.
[18.87] M. M. Trivedi and J. C. Bezdek, “Low-Level Segmentation of Aerial Images with
Fuzzy Clustering,” IEEE Transactions on Systems, Man, and Cybernetics, 16(4),
1986.
[18.88] A. Ueltschi and H. Bunke, “Model-based Recognition of Three-dimensional Objects
from Incomplete Range Data,” Automatic Target Recognition V, Proceedings SPIE,
2485, April, 1995.
[18.89] J. Wald, D. Krig, and T. DePersia, “ATR: Problems and Possibilities for the IU
Community,” Proceedings of the DARPA Image Understanding Workshop, IUW,
255–264, San Diego, January, 1992.
[18.90] B. Wallet, D. Marchette, and J. Solka, “A Matrix Representation for Genetic Algo-
rithms,” Automatic Target Recognition VI, Proceedings SPIE, 2756, April, 1996.
[18.91] B. Wallet, D. Marchette, and J. Solka, “Using Genetic Algorithms to Search for
Optimal Projections,” Automatic Target Recognition VII, Proceedings SPIE, 3069,
April, 1997.
[18.92] S. Wang, G. Chen, D. Sapounas, H. Shi, and R. Peer, “Development of Gazing
Algorithms for Tracking Oriented Recognition,” Automatic Target Recognition VII,
Proceedings SPIE, 3069, April, 1997.
[18.93] A. Weisberg, M. Najarian, B. Borowski, J. Lisowski, and B. Miller, “Spectral Angle
Automatic Cluster Routine (SAALT): An Unsupervised Multispectral Clustering
Algorithm,” Proceedings of IEEE Aerospace Conference, 307–317, 1999.
[18.94] D. Xue, Y. Zhu, and G. Zhu, “Recognition of Low-contrast FLIR Tank Object Based
on Multiscale Fractal Character Vector,” Automatic Target Recognition VI, Proceed-
ings SPIE, 2756, April, 1996.
[18.95] C. Zhou, G. Zhang, and J. Peng, “Performance Modeling Based Adaptive Target
Tracking in Multiscenario Environment,” Automatic Target Recognition VI, Pro-
ceedings SPIE, 2756, April, 1996.
Author index
417
418 Author index
426
427 Index