0% found this document useful (0 votes)
591 views

Machine Vision

Machine Vision
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
591 views

Machine Vision

Machine Vision
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 453

This page intentionally left blank

MACHINE VISION

This book is an accessible and comprehensive introduction to machine vision. It provides all
the necessary theoretical tools and shows how they are applied in actual image processing
and machine vision systems. A key feature is the inclusion of many programming exercises
that give insights into the development of practical image processing algorithms.
The authors begin with a review of mathematical principles and go on to discuss key
issues in image processing such as the description and characterization of images, edge
detection, feature extraction, segmentation, texture, and shape. They also discuss image
matching, statistical pattern recognition, syntactic pattern recognition, clustering, diffusion,
adaptive contours, parametric transforms, and consistent labeling. Important applications
are described, including automatic target recognition. Two recurrent themes in the book are
consistency (a principal philosophical construct for solving machine vision problems) and
optimization (the mathematical tool used to implement those methods).
Software and data used in the book can be found at www.cambridge.org/9780521830461.
The book is aimed at graduate students in electrical engineering, computer science,
and mathematics. It will also be a useful reference for practitioners.
wesleye. snyder received his Ph.D. from the University of Illinois, and is currently
Professor of Electrical and Computer Engineering at North Carolina State University. He has
written over 100 scientific papers and is the author of the book Industrial Robots. He was
a founder of both the IEEE Robotics and Automation Society and the IEEE Neural
Networks Council. He has served as an advisor to the National Science Foundation,
NASA, Sandia Laboratories, and the US Army Research Office.
hairong qi received her Ph.D. from North Carolina State University and is currently an
Assistant Professor of Electrical and Computer Engineering at the University of
Tennessee,Knoxville.
MACHINE VISION

Wesley E. Snyder
North Carolina State University, Raleigh

Hairong Qi
University of Tennessee, Knoxville
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore,
São Paulo, Delhi, Dubai, Tokyo, Mexico City

Cambridge University Press


The Edinburgh Building, Cambridge cb2 8ru, UK

Published in the United States of America by Cambridge University Press, New York

www.cambridge.org
Information on this title: www.cambridge.org/9780521169813

© Cambridge University Press 2004

This publication is in copyright. Subject to statutory exception


and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.

First published 2004


First paperback edition 2010

A catalogue record for this publication is available from the British Library

Library of Congress Cataloging in Publication Data


Snyder, Wesley E.
Machine vision / Wesley E. Snyder and Hairong Qi.
p. cm.
ISBN 0 521 83046 X
1. Computer vision. I. Qi, Hairong, 1970. II. Title.
TA1634.S69 2004
006.37 – dc21 2003053083

isbn 978-0-521-83046-1 Hardback


isbn 978-0-521-16981-3 Paperback

Cambridge University Press has no responsibility for the persistence or


accuracy of URLs for external or third-party internet websites referred to in
this publication, and does not guarantee that any content on such websites is,
or will remain, accurate or appropriate.
To Graham and Robert
WES

To my parents and Feiyi


HQ
Contents

To the instructor page xv


Acknowledgements xviii

1 Introduction 1

1.1 Concerning this book 1


1.2 Concerning prerequisites 2
1.3 Some terminology 3
1.4 Organization of a machine vision system 5
1.5 The nature of images 6
1.6 Images: Operations and analysis 6
Reference 7

2 Review of mathematical principles 8

2.1 A brief review of probability 8


2.2 A review of linear algebra 10
2.3 Introduction to function minimization 15
2.4 Markov models 20
References 28

3 Writing programs to process images 29

3.1 Image File System (IFS) software 29


3.2 Basic programming structure for image processing 31
3.3 Good programming styles 32
3.4 Example programs 33
3.5 Makefiles 34

4 Images: Formation and representation 38

4.1 Image representations 38


4.2 The digital image 42
vii
viii Contents

4.3 Describing image formation 49


4.4 The image as a surface 51
4.5 Neighborhood relations 53
4.6 Conclusion 56
4.7 Vocabulary 56

Topic 4A Image representations 57


4A.1 A variation on sampling: Hexagonal pixels 57
4A.2 Other types of iconic representations 60
References 62

5 Linear operators and kernels 65

5.1 What is a linear operator? 65


5.2 Application of kernel operators in digital images 66
5.3 Derivative estimation by function fitting 69
5.4 Vector representations of images 73
5.5 Basis vectors for images 75
5.6 Edge detection 76
5.7 A kernel as a sampled differentiable function 78
5.8 Computing convolutions 83
5.9 Scale space 85
5.10 Quantifying the accuracy of an edge detector 88
5.11 So how do people do it? 90
5.12 Conclusion 92
5.13 Vocabulary 92
Topic 5A Edge detectors 97
5A.1 The Canny edge detector 97
5A.2 Improvements to edge detection 98
5A.3 Inferring line segments from edge points 99
5A.4 Space/frequency representations 99
5A.5 Vocabulary 101
References 104

6 Image relaxation: Restoration and feature extraction 107

6.1 Relaxation 107


6.2 Restoration 108
6.3 The MAP approach 111
6.4 Mean field annealing 115
ix Contents

6.5 Conclusion 126


6.6 Vocabulary 127
Topic 6A Alternative and equivalent algorithms 129
6A.1 GNC: An alternative algorithm for noise removal 129
6A.2 Variable conductance diffusion 131
6A.3 Edge-oriented anisotropic diffusion 133
6A.4 A common description of image relaxation operators 133
6A.5 Relationship to neural networks 137
6A.6 Conclusion 137
Bibliography 138

7 Mathematical morphology 144

7.1 Binary morphology 144


7.2 Gray-scale morphology 152
7.3 The distance transform 153
7.4 Conclusion 156
7.5 Vocabulary 156

Topic 7A Morphology 158


7A.1 Computing erosion and dilation efficiently 158
7A.2 Morphological sampling theorem 161
7A.3 Choosing a structuring element 164
7A.4 Closing gaps in edges and surfaces 164
7A.5 Vocabulary 177
Bibliography 178

8 Segmentation 181

8.1 Segmentation: Partitioning an image 181


8.2 Segmentation by thresholding 182
8.3 Connected component analysis 185
8.4 Segmentation of curves 196
8.5 Active contours (snakes) 197
8.6 Segmentation of surfaces 201
8.7 Evaluating the quality of a segmentation 204
8.8 Conclusion 205
8.9 Vocabulary 206
x Contents

Topic 8A Segmentation 207


8A.1 Texture segmentation 207
8A.2 Segmentation of images using edges 210
8A.3 Motion segmentation 210
8A.4 Color segmentation 210
8A.5 Segmentation using MAP methods 210
8A.6 Human segmentation 211
Bibliography 211

9 Shape 216

9.1 Linear transformations 216


9.2 Transformation methods based on the covariance matrix 219
9.3 Simple features 225
9.4 Moments 229
9.5 Chain codes 230
9.6 Fourier descriptors 231
9.7 The medial axis 232
9.8 Deformable templates 233
9.9 Quadric surfaces 234
9.10 Surface harmonic representations 236
9.11 Superquadrics and hyperquadrics 236
9.12 Generalized cylinders (GCs) 238
9.13 Conclusion 238
9.14 Vocabulary 239
Topic 9A Shape description 240
9A.1 Finding the diameter of nonconvex regions 240
9A.2 Inferring 3D shape from images 243
9A.3 Motion analysis and tracking 250
9A.4 Vocabulary 253
Bibliography 256

10 Consistent labeling 263

10.1 Consistency 263


10.2 Relaxation labeling 266
10.3 Conclusion 270
10.4 Vocabulary 270
xi Contents

Topic 10A 3D Interpretation of 2D line drawings 271


References 273

11 Parametric transforms 275

11.1 The Hough transform 275


11.2 Reducing computational complexity 279
11.3 Finding circles 280
11.4 The generalized Hough transform 282
11.5 Conclusion 283
11.6 Vocabulary 283
Topic 11A Parametric transforms 283
11A.1 Finding parabolae 283
11A.2 Finding the peak 285
11A.3 The Gauss map 286
11A.4 Parametric consistency in stereopsis 286
11A.5 Conclusion 287
11A.6 Vocabulary 287
References 288

12 Graphs and graph-theoretic concepts 290

12.1 Graphs 290


12.2 Properties of graphs 291
12.3 Implementing graph structures 291
12.4 The region adjacency graph 292
12.5 Using graph-matching: The subgraph isomorphism problem 294
12.6 Aspect graphs 295
12.7 Conclusion 296
12.8 Vocabulary 297
References 297

13 Image matching 298

13.1 Matching iconic representations 298


13.2 Matching simple features 304
13.3 Graph matching 305
13.4 Conclusion 309
13.5 Vocabulary 309
xii Contents

Topic 13A Matching 312


13A.1 Springs and templates revisited 312
13A.2 Neural networks for object recognition 314
13A.3 Image indexing 318
13A.4 Matching geometric invariants 318
13A.5 Conclusion 321
13A.6 Vocabulary 322
Bibliography 322

14 Statistical pattern recognition 326

14.1 Design of a classifier 326


14.2 Bayes’ rule and the maximum likelihood classifier 329
14.3 Decision regions and the probability of error 336
14.4 Conditional risk 337
14.5 The quadratic classifier 340
14.6 The minimax rule 342
14.7 Nearest neighbor methods 343
14.8 Conclusion 345
14.9 Vocabulary 345

Topic 14A Statistical pattern recognition 347


14A.1 Matching feature vectors using statistical methods 347
14A.2 Support vector machines (SVMs) 349
14A.3 Conclusion 354
14A.4 Vocabulary 354
References 354

15 Clustering 356

15.1 Distances between clusters 357


15.2 Clustering algorithms 359
15.3 Optimization methods in clustering 363
15.4 Conclusion 366
15.5 Vocabulary 366
References 368

16 Syntactic pattern recognition 369

16.1 Terminology 369


16.2 Types of grammars 371
xiii Contents

16.3 Shape recognition using grammatical structure 373


16.4 Conclusion 380
16.5 Vocabulary 380
References 381

17 Applications 382

17.1 Multispectral image analysis 382


17.2 Optical character recognition (OCR) 382
17.3 Automated/assisted diagnosis 383
17.4 Inspection/quality control 383
17.5 Security and intruder identification 384
17.6 Robot vision 385
Bibliography 386

18 Automatic target recognition 392

18.1 The hierarchy of levels of ATR 392


18.2 ATR system components 394
18.3 Evaluating performance of ATR algorithms 395
18.4 Machine vision issues unique to ATR 400
18.5 ATR algorithms 403
18.6 The Hough transform in ATR 407
18.7 Morphological techniques in ATR 408
18.8 Chain codes in ATR 409
18.9 Conclusion 410
Bibliography 411

Author index 417


Index 426
To the instructor

This textbook covers both fundamentals and advanced topics in computer-based


recognition of objects in scenes. It is intended to be both a text and a reference. Al-
most every chapter has a “Fundamentals” section which is pedagogically structured
as a textbook, and a “Topics” section which includes extensive references to the
current literature and can be used as a reference. The text is directed toward grad-
uate students and advanced undergraduates in electrical and computer engineering,
computer science, or mathematics.
Chapters 4 through 17 cover topics including edge detection, shape characteriza-
tion, diffusion, adaptive contours, parametric transforms, matching, and consistent
labeling. Syntactic and statistical pattern recognition and clustering are introduced.
Two recurrent themes are used throughout these chapters: Consistency (a principal
philosophical construct for solving machine vision problems) and optimization (the
mathematical tool used to implement those methods). These two topics are so per-
vasive that we conclude each chapter by discussing how they have been reflected
in the text. Chapter 18 uses one application area, automatic target recognition, to
show how all the topics presented in the previous chapters can be integrated to solve
real-world problems.
This text assumes a solid graduate or advanced-undergraduate background including
linear algebra and advanced calculus. The student who successfully completes this
course can design a wide variety of industrial, medical, and military machine vision
systems. Software and data used in the book can be found at www.cambridge.org/
9780521830461. The software will run on PCs running Windows or Linux, Macin-
tosh computers running OS-X, and SUN computers running SOLARIS. Software
includes ability to process images whose pixels are of any data type on any com-
puter and to convert to and from “standard” image formats such as JPEG.
Although it can be used in a variety of ways, we designed the book primarily as
a graduate textbook in machine vision, and as a reference in machine vision. If
used as a text, the students would be expected to read the basic topics section of
each chapter used in the course (there is more material in this book than can be
covered in a single semester). For use in a first course at the graduate level, we
present a sample syllabus in the following table.

xv
Sample syllabus.

Lecture Topics Assignment (weeks) Reading assignment

1 Introduction, terminology, operations on images, pattern 2.2–2.5 and 2.9 (1) Read Chapter 2. Convince
classification and computer vision, image formation, yourself that you have the
resolution, dynamic range, pixels background for this course

2 The image as a function. Image degradation. Point spread 3.1 (1) Chapters 1 and 3
function. Restoration

3 Properties of an image, isophotes, ridges, connectivity 3.2, 4.1 (2) Sections 4.1–4.5

4 Kernel operators: Application of kernels to estimate edge 4.A1, 4.A2 (1) Sections 5.1 and 5.2
locations

5 Fitting a function (a biquadratic) to an image. Taking 5.1, 5.2 (1) Sections 5.3–5.4 (skip hexagonal
derivatives of vectors to minimize a function pixels)

6 Vector representations of images, image basis functions. 5.4, 5.5 (2) and 5.7, 5.8, Sections 5.5 and 5.6 (skip section
Edge detection, Gaussian blur, second and higher 5.9 (1) 5.7)
derivatives

7 Introduction to scale space. Discussion of homeworks 5.10, 5.11 (1) Section 5.8 (skip section 5.9)

8 Relaxation and annealing 6.1, 6.3 (1) Sections 6.1–6.3

9 Diffusion 6.2 (2) Sections 6A.2

10 Equivalence of MFA and diffusion 6.7 and 6.8 (1) Section 6A.4

11 Image morphology 7.5–7.7 (1) Section 7.1

12 Morphology, continued. Gray-scale morphology. 7.10 (2) Sections 7.2, 7.3


Distance transform

13 Closing gaps in edges, connectivity 7.4 (1) Section 7A.4

14 Segmentation by optimal thresholding Sections 8.1, 8.2

15 Connected component labeling 8.2 (1) Section 8.3

16 2D geometry, transformations 9.3 (1) Sections 9.1, 9.2

17 2D shape features, invariant moments, Fourier 9.2, 9.4, 9.10 (1) Sections 9.3–9.7
descriptors, medial axis

18 Segmentation using snakes and balloons Sections 8.5, 8.5.1

19 PDE representations and level sets Section 8.5.2

20 Shape-from-X and structured illumination 9.10 (1) Sections 9A.2.2, 9A.2.3

21 Graph-theoretic image representations: Graphs, region Chapter 12


adjacency graphs. Subgraph isomorphism

22 Consistent and relaxation labeling 10.1 (1) Chapter 10

23 Hough transform, parametric transforms 11.1 (2) Sections 11.1, 11.2, 11.3.3

24 Generalized Hough transform, Gauss map, application to Section 11A.3


finding holes in circuit boards

25 Iconic matching, springs and templates, association 13.2 and 13.3 (1) Sections 13.1–13.3
graphs

26 The role of statistical pattern recognition


xvii To the instructor

The assignments are projects which must include a formal report. Since there is usu-
ally programming involved, we allow more time to accomplish these assignments –
suggested times are in parentheses in column 3. It is also possible, by careful selec-
tion of the students and the topics, to use this book in an advanced undergraduate
course.
For advanced students, the “Topics” sections of this book should serve as a col-
lection of pointers to the literature. Be sure to emphasize to your students (as we
do in the text) that no textbook can provide the details available in the literature,
and any “real” (that is, for a paying customer) machine vision project will require
the development engineer to go to the published journal and conference literature.
As stated above, the two recurrent themes throughout this book are consistency
and optimization. The concept of consistency occurs throughout the discipline as a
principal philosophical construct for solving machine vision problems. When con-
fronted with a machine vision application, the engineer should seek to find ways to
determine sources of information which are consistent. Optimization is the princi-
pal mathematical tool for solving machine vision problems, including determining
consistency. At the end of each chapter which introduces techniques, we remind the
student where consistency fits into the problems of that chapter, as well as where
and which optimization methods are used.
Acknowledgements

My graduate students at North Carolina State University, especially Rajeev


Ramanath, deserve a lot of credit for helping us make this happen. Bilgé Karacali
also helped quite a bit with his proofreading, and contributed significantly to the
section on support vector machines.
Of course, none of this would have mattered if it were not for my wife, Rosalyn,
who provided the encouragement necessary to make it happen. She also edited the
entire book (more than once), and converted it from Engineerish to English.

WES

I’d like to express my sincere thanks to Dr. Wesley Snyder for inviting me to coau-
thor this book. I have greatly enjoyed this collaboration and have gained valuable
experience.
The final delivery of the book was scheduled around Christmas when my parents
were visiting me from China. Instead of touring around the city and enjoying the
holidays, they simply stayed with me and supported me through the final submission
of the book. I owe my deepest gratitude to them. And to Feiyi, my forever technical
support and emergency reliever.
HQ

xviii
1 Introduction

The proof is straightforward, and thus omitted


Ja-Chen Lin and Wen-Hsiang Tsai1

1.1 Concerning this book

This is an important We have written this book at two levels, the principal level being introductory.
observation: This book
does NOT have enough “Introductory” does not mean “easy” or “simple” or “doesn’t require math.” Rather,
information to tell you the introductory topics are those which need to be mastered before the advanced
how to implement
significant large systems. topics can be understood.
It teaches general In addition, the book is intended to be useful as a reference. When you have to
principles. You MUST
make use of the literature study a topic in more detail than is covered here, in order, for example, to implement a
when you get down to the practical system, we have tried to provide adequate citations to the relevant literature
gnitty gritty.
to get you off to a good start.
We have tried to write in a style aimed directly toward the student and in a
conversational tone.
We have also tried to make the text readable and entertaining. Words which are
deluberately missppelled for humorous affects should be ubvious. Some of the humor
runs to exaggeration and to puns; we hope you forgive us.
We did not attempt to cover every topic in the machine vision area. In particu-
lar, nearly all papers in the general areas of optical character recognition and face
recognition have been omitted; not to slight these very important and very success-
ful application areas, but rather because the papers tend to be rather specialized; in
addition, we simply cannot cover everything.
There are two themes which run through this book: consistency and optimization.
Consistency is a conceptual tool, implemented as a variety of algorithms, which helps
machines to recognize images – they fuse information from local measurements to
make global conclusions about the image. Optimization is the mathematical mech-
anism used in virtually every chapter to accomplish the objectives of that chapter,
be they pattern classification or image matching.
1 Ja-Chen Lin and Wen-Hsiang Tsai, “Feature-preserving Clustering of 2-D Data for Two-class Problems Using
Analytical Formulas: An Automatic and Fast Approach,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(5), 1994.

1
2 Introduction

These two topics, consistency and optimization, are so important and so pervasive,
that we point out to the student, in the conclusion of each chapter, exactly where those
concepts turned up in that chapter. So read the chapter conclusions. Who knows, it
might be on a test.

1.2 Concerning prerequisites

The target audience for this book is graduate students or advanced undergraduates
To find out if you meet in electrical engineering, computer engineering, computer science, math, statistics,
this criterion, answer the
following question: What or physics. To do the work in this book, you must have had a graduate-level course
do the following words in advanced calculus, and in statistics and/or probability. You need either a formal
mean? “transpose,”
“inverse,” “determinant,” course or experience in linear algebra.
“eigenvalue.” If you do Many of the homeworks will be projects of sorts, and will be computer-based.
not have any idea, do not
take this course! To complete these assignments, you will need a hardware and software environment
capable of

You will have to write (1) declaring large arrays (256 × 256) in C
programs in C (yes, C or
C++, not Matlab) to (2) displaying an image
complete this course. (3) printing an image.

Software and data used in the book can be found at www.cambridge.org/


9780521830461.
We are going to insist that you write programs, and that you write them at a
relatively low level. Some of the functionality that you will be coding is available
in software packages like Matlab. However, while you learn something by simply
calling a function, you learn more by writing and debugging the code yourself.
Exceptions to this occur, of course, when the coding is so extensive that the pro-
gramming gets in the way of the image analysis. For that reason, we provide the
student with a library of subroutines which allow the student to ignore details like
data type, byteswapping, file access, and platform dependencies, and instead focus
on the logic of making image analysis algorithms work.
You should have an instructor, and if you do, we strongly recommend that you
GO to class, even though all the information you really need is in this book. Read
the assigned material in the text, then go to class, then read the text material again.
Remember:

A hacker hermit named Dave


Tapped in to this course in his cave.
He had to admit
He learned not a bit.
But look at the money he saved.

And now, on to the technical stuff.


3 1.3 Some terminology

1.3 Some terminology

Students usually confuse machine vision with image processing. In this section, we
define some terminology that will clarify the differences between the contents and
objectives of these two topics.

1.3.1 Image processing


Many people consider the content of this course as part of the discipline of image
processing. However, a better use of the term is to distinguish between image pro-
cessing and machine vision by the intent. “Image processing” strives to make images
look better, and the output of an image processing system is an image. The output
of a “machine vision” system is information about the content of the image. The
functions of an image processing system may include enhancement, coding, com-
pression, restoration, and reconstruction.

Enhancement

Enhancement systems perform operations which make the image look better, as
perceived by a human observer. Typical operations include contrast stretching
(including functions like histogram equalization), brightness scaling, edge sharp-
ening, etc.

Coding

Coding is the process of finding efficient and effective ways to represent the infor-
mation in an image. These include quantization methods and redundancy removal.
Coding may also include methods for making the representation robust to bit-errors
which occur when the image is transmitted or stored.

Compression

Compression includes many of the same techniques as coding, but with the specific
objective of reducing the number of bits required to store and/or transmit the image.

Restoration

Restoration concerns itself with fixing what is wrong with the image. It is unlike
enhancement, which is just concerned with making images look better. In order
to “correct” an image, there must be some model of the image degradation. It is
common in restoration applications to assume a deterministic blur operator, followed
by additive random noise.
4 Introduction

Reconstruction

Reconstruction usually refers to the process of constructing an image from sev-


eral partial images. For example, in computed tomography (CT),2 we make a large
number, say 360, of x-ray projections through the subject. From this set of one-
dimensional signals, we can compute the actual x-ray absorption at each point in the
two-dimensional image. Similar methods are used in positron emission tomography
(PET), magnetic resonance imagery (MRI), and in several shape-from-X algorithms
which we will discuss later in this course.

1.3.2 Machine vision


Machine vision is the process whereby a machine, usually a digital computer, auto-
matically processes an image and reports “what is in the image.” That is, it recognizes
the content of the image. Often the content may be a machined part, and the objective
is not only to locate the part, but to inspect it as well. We will in this book discuss
several applications of machine vision in detail, such as automatic target recognition
(ATR), and industrial inspection. There are a wide variety of other applications, such
as determining the flow equations from observations of fluid flow [1.1], which time
and space do not allow us to cover.
The terms “computer vision” and “image understanding” are often also used to
denote machine vision.
Machine vision includes two components – measurement of features and pattern
classification based on those features.

Measurement of features

The measurement of features is the principal focus of this book. Except for
Chapters 14 and 15, in this book, we focus on processing the elements of images
(pixels) and from those pixels and collections of pixels, extract sets of measurements
which characterize either the entire image or some component thereof.

Pattern classification

Pattern classification may be defined as the process of making a decision about a


measurement. That is, we are given a measurement or set of measurements made
on an unknown object. From that set of measurements with knowledge about the
possible classes to which that unknown might belong, we make a decision. For

2 Sometimes, CT is referred to as “CAT scanning.” In that case, CAT stands for “computed axial tomography.”
There are other types of tomography as well.
5 1.4 Organization of a machine vision system

example, the set of possible classes might be men and women and one measurement
which we could make to distinguish men from women would be height (clearly,
height is not a very good measurement to use to distinguish men from women, for
if our decision is that anyone over five foot six is male we will surely be wrong in
many instances).

Pattern recognition

Pattern recognition may be defined as the process of assigning unknowns to classes


just as in the definition of pattern classification. However, the definition is extended
to include the process of making the measurements.

1.4 Organization of a machine vision system

Fig. 1.1 shows schematically, at the most basic level, the organization of a machine
vision system. The unknown is first measured and the values of a number of features
are determined. In an industrial application, such features might include the length,
width, and area of the image of the part being measured. Once the features are
measured, their numerical values are passed to a process which implements a decision
rule. This decision rule is typically implemented by a subroutine which performs
calculations to determine to which class the unknown is most likely to belong based
on the measurements made.
As Fig. 1.1 illustrates, a machine vision system is really a fairly simple architec-
tural structure. The details of each module may be quite complex, however, and many
different options exist for designing the classifier and the feature measuring system.
In this book, we mention the process of classifier design. However, the process of
determining and measuring features is the principal topic of this book.
The “feature measurement” box can be further broken down into more detailed
operations as illustrated in Fig. 1.2. At that level, the organization chart becomes
more complex because the specific operations to be performed vary with the type
of image and the objective of the tasks. Not every operation is performed in every
application.

Raw data Feature vector Class identity


Feature Pattern
measurement classifier

Fig. 1.1. Organization of a machine vision system.


6 Introduction

Raw data

Segmentation

analysis
removal

Shape
Noise
Features

Matching
Consistency
analysis
Fig. 1.2. Some components of a feature characterization system. Many machine vision
applications do not use every block, and information often flows in other ways. For
example, it is possible to perform matching directly on the image data.

1.5 The nature of images

We will pay much more attention to the nature of images in Chapter 4. We will
observe that there are several different types of images as well as several different
ways to represent images. The types of images include what we call “pictures,” that
is, two-dimensional images. In addition, however, we will discuss three-dimensional
images and range images. We will also consider different representations for images,
including iconic, functional, linear, and relational representations.

1.6 Images: Operations and analysis

Some equivalent words. We will learn many different operations to perform on images. The emphasis in this
course is “image analysis,” or “computer vision,” or “machine vision,” or “image
understanding.” All these phrases mean the same thing. We are interested in making
measurements on images with the objective of providing our machine (usually, but
not always, a computer) with the ability to recognize what is in the image. This
process includes several steps:

r denoising – all images are noisy, most are blurred, many have other distortions
as well. These distortions need to be removed or reduced before any further
operations can be carried out. We discuss two general approaches for denoising
in Chapters 6 and 7.
r segmentation – we must segment the image into meaningful regions. Segmenta-
tion is covered in Chapter 8.
r feature extraction – making measurements, geometric or otherwise, on those
regions is discussed in Chapter 9.
7 Reference

r consistency – interpreting the entire image from local measurements is covered


in Chapters 10 and 11.
r classification and matching – recognizing the object is covered in Chapter 12
through Chapter 16.
So turn to the next chapter. (Did you notice? No homework assignments in this
chapter? Don’t worry. We’ll fix that in future chapters.)

Reference

[1.1] C. Shu and R. Jain, “Vector Field Analysis for Oriented Patterns,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(9), 1994.
2 Review of mathematical principles

Everything, once understood, is trivial


W. Snyder

2.1 A brief review of probability

Let us imagine a statistical experiment: rolling two dice. It is possible to roll any
number between two and twelve (inclusive), but as we know, some numbers are
more likely than others. To see this, consider the possible ways to roll a five.
We see from Fig. 2.1 that there are four possible ways to roll a five with two dice.
Each event is independent. That is, the chance of rolling a two with the second die
(1 in 6) does not depend at all on what is rolled with die number 1.
Independence of events has an important implication. It means that the joint
probability of the two events is equal to the product of their individual probabilities
and the conditional probabilities:

Pr (a|b)P(b) = Pr (a)Pr (b) = Pr (b|a)Pr (a) = Pr (a, b). (2.1)

In Eq. (2.1), the symbols a and b represent events, e.g., the rolling of a six. Pr (b) is the
probability of such an event occurring, and Pr (a | b) is the conditional probability
of event a occurring, given that event b has occurred.
In Fig. 2.1, we tabulate all the possible ways of rolling two dice, and show the
resulting number of different ways that the numbers from 2 to 12 can occur. We
note that 6 different events can lead to a 7 being rolled. Since each of these events
is equally probable (1 in 36), then a 7 is the most likely roll of two dice. In Fig. 2.2
the information from Fig. 2.1 is presented in graphical form.
In pattern classification, we are most often interested in the probability of a par-
ticular measurement occurring. We have a problem, however, when we try to plot a
graph such as Fig. 2.2 for a continuously-valued function. For example, how do we
ask the question: “What is the probability that a man is six feet tall?” Clearly, the
answer is zero, for an infinite number of possibilities could occur (we might equally
well ask, “What is the probability that a man is (exactly) 6.314 159 267 feet tall?”).
Still, we know intuitively that the likelihood of a man being six feet tall is higher
than the likelihood of his being ten feet tall. We need some way of quantifying this
intuitive notion of likelihood.
8
9 2.1 A brief review of probability

Sum Number
of ways
0 0
1 0
2 1–1 1
3 2–1, 1–2 2
4 1–3, 3–1, 2–2 3
5 2–3, 3–2, 4 –1, 1– 4 4
6 1–5, 5–1, 2– 4, 4–2, 3–3 5
7 3– 4, 4 –3, 2–5, 5–2, 1– 6, 6 –1 6
8 2– 6, 6 –2, 3–5, 5–3, 4 – 4 5
9 3– 6, 6 –3, 4 –5, 5– 4 4
10 4 – 6, 6 – 4, 5–5 3
11 6 –5, 5– 6 2
12 6– 6 1

Fig. 2.1. The possible ways to roll two dice.


Number of ways

6
5
4
3
2
1

0 1 2 3 4 5 6 7 8 9 10 11 12
Sum

Fig. 2.2. The information of Fig. 2.1, in graphical form.

One question that does make sense is, “What is the probability that a man is
less than six feet tall?” Such a function is referred to as a probability distribution
function

P(x) = Pr (z < x) (2.2)

for some measurement, z.


Fig. 2.3 illustrates the probability distribution function for the result of rolling
two dice.
When we asked “what is the probability that a man is less than x feet tall?” we
obtained the probability distribution function. Another well-formed question would
be “what is the probability that a man’s height is between x and x + x?” Such a
question is easily answered in terms of the density function:

Pr (x ≤ h < x + x) = Pr (h < x + x) − Pr (h < x) = P(x + x) − P(x)

Dividing by x and taking the limit as x → 0, we see that we may define the
probability density function as the derivative of the distribution function:
d
p(x) = P(x). (2.3)
dx
10 Review of mathematical principles

0 1 2 3 4 5 6 7 8 9 10 11 12 x
Fig. 2.3. The probability distribution of Fig. 2.2, showing the probability of rolling two dice to get
a number LESS than x. Note that the curve is steeper at the more likely numbers.

p(x) has all the properties that we desire. It is well defined for continuously-valued
measurements and it has a maximum value for those values of the measurement
which are intuitively most likely.
Furthermore:
∞
p(x) d x = 1, (2.4)
−∞

which we must require, since some value will certainly occur.

2.2 A review of linear algebra

This section will serve In this section, we very briefly review vector and matrix operations. Generally, we
more as a reference than a
teaching aid, since you denote vectors in boldface, scalars in lowercase Roman, and matrices in uppercase
should know this material Roman.
already.
Vectors are always considered to be column vectors. If we need to write one
horizontally for the purpose of saving space in a document, we use transpose notation.
For example, we denote a vector which consists of three scalar elements as:

v = [x1 x2 x3 ]T .

The inner product of two vectors is a scalar, v = a T b. Its value is the sum of products
11 2.2 A review of linear algebra

of the corresponding elements of the two vectors:



aTb = ai bi .
i

You will also sometimes see the notation x, y used for inner product. We do not like
this because it looks like an expected value of a random variable. One sometimes
also sees the “dot product” notation x√· y for inner product.
The magnitude of a vector is |x| = x T x. If |x| = 1, x is said to be a “unit vector.”
If x T y = 0, then x and y are “orthogonal.” If x and y are orthogonal unit vectors,
they are “orthonormal.”
The concept of orthogonality can easily be extended to continuous functions by
simply thinking of a function as an infinite-dimensional vector. Just list all the values
of f (x) as x varies between, say, a and b. If x is continuous, then there are an infinite
number of possible values of x between a and b. But that should not stop us – we
cannot enumerate them, but we can still think of a vector containing all the values
of f (x). Now, the concept of summation which we defined for finite-dimensional
vectors turns into integration, and an inner product may be written

b
 f (x), g(x) = f (x)g(x) d x. (2.5)
a

The concepts of orthogonality and orthonormality hold for this definition of the
inner product as well. If the integral is equal to zero, we say the two functions are
orthogonal. So the transition from orthogonal vectors to orthogonal functions is not
that difficult. With an infinite number of dimensions, it is impossible to visualize
orthogonal as “perpendicular,” of course, so you need to give up on thinking about
things being perpendicular. Just recall the definition and use it.
Suppose we have n vectors x 1 , x 2 , . . . x n ; if we can write v = a1 x 1 + a2 x 2 +
· · · an x n , then v is said to be a “linear combination” of x 1 , x 2 , . . . x n .
A set of vectors x 1 , x 2 , . . . x n is said to be “linearly independent” if it is impossible
to write any of the vectors as a linear combination of the others.
Given d linearly independent vectors, of d dimensions, x 1 , x 2 , . . . x d defined on

d , then any vector y in the space may be written y = a1 x 1 + a2 x 2 + · · · ad x d .
Since any d-dimensional real-valued vector y may be written as a linear combi-
nation of x 1 , . . . x d , then the set {xi } is called a “basis” set and the vectors are said
to “span the space”
d . Any linearly independent set of vectors can be used as a
basis (necessary and sufficient). It is often particularly convenient to choose basis
sets which are orthonormal.
For example, the following two vectors form a basis for
2

x 1 = [0 1]T and x 2 = [1 0]T .


12 Review of mathematical principles

x1
y
x2

a1

Fig. 2.4. x1 and x2 are orthonormal bases. The projection of y onto x1 has length a1 .

This is the familiar Cartesian coordinate system. Here’s another basis set for
2

Is this set orthonormal? x 1 = [1 1]T x 2 = [−1 1]T .

If x 1 , x 2 , . . . x d span
d , and y = a1 x 1 + a2 x 2 + · · · ad x d , then the “components”
of y may be found by

ai = y T xi (2.6)

and ai is said to be the “projection” of y onto xi . In a simple Cartesian geometric


interpretation, the inner product of Eq. (2.6) is literally a projection as illustrated in
Fig. 2.4. However, whenever Eq. (2.6) is used, the term “projection” may be used as
well, even in a more general sense (e.g. the coefficients of a Fourier series).
The only vector spaces which concern us here are those in which the vectors are
real-valued.

2.2.1 Linear transformations


What does this say about A “linear transformation,” A, is simply a matrix. Suppose A is m × d. If applied to a
m and d?
vector x ∈
d , y = Ax, then y ∈
m . So A took a vector from one vector space,
d ,
and produced a vector in
m . If that vector y could have been produced by applying
A to one and only one vector in
d , then A is said to be “one-to-one.” Now suppose
that there are no vectors in
m that can not be produced by applying A to some
vector in
d . In that case, A is said to be “onto.” If A is one-to-one and onto, then
A−1 exists. Two matrices A and B are “conformable” if the matrix multiplication
C = AB makes sense.
We assume you know the Some important (and often forgotten) properties: If A and B are conformable,
meanings of transpose,
inverse, determinant, and then
trace. If you do not, look
them up. (AB)T = B T AT (2.7)

and

(AB)−1 = B −1 A−1 (2.8)


13 2.2 A review of linear algebra

if A and B are invertible at all.


A couple of other useful properties are

det(AB) = det(B A) and tr(AB) = tr(B A)

which only is true, of course, if A and B are square. If a matrix A satisfies

A AT = AT A = I (2.9)

then obviously, the transpose of the matrix is the inverse as well, and A is said to
be an “orthonormal transformation” (OT), which will correspond geometrically to
a rotation. If A is a d × d orthonormal transformation, then the columns of A are
orthonormal, linearly independent, and form a basis spanning the space of
d . For

3 , three convenient OTs are the rotations about the Cartesian axes:
Some example
orthonormal ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
transformations. 1 0 0 cos ␪ 0 −sin ␪ cos ␪ −sin ␪ 0
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
Rx = ⎣0 cos ␪ −sin ␪ ⎦ R y = ⎣ 0 1 0 ⎦ Rz = ⎣ sin ␪ cos ␪ 0⎦
0 sin ␪ cos ␪ sin ␪ 0 cos ␪ 0 0 1

Suppose R is an OT, and y = Rx, then

|y| = |x|. (2.10)

A matrix A is “positive definite” if

y = x T Ax > 0 ∀x ∈
d , x = 0

x T Ax is called a quadratic form.


The derivative of a quadratic form is particularly useful:
What happens here if A is
d T
symmetric? (x Ax) = (A + AT )x.
dx

Since we mentioned derivatives, we might as well mention a couple of other vector


calculus things:
Suppose f is a scalar function of x, x ∈
d , then

T
df ∂f ∂f ∂f
= ··· , (2.11)
dx ∂ x1 ∂ x2 ∂ xd

and is called the “gradient.” This will be often used when we talk about edges in
images, and f (x) will be the brightness as a function of the two spatial directions.
14 Review of mathematical principles

If f is vector-valued, then the derivative is a matrix


⎡ ⎤
∂ f1 ∂ f2 ∂ fm
⎢ ∂ x1 ∂ x1 · · ·
⎢ ∂ x1 ⎥

⎢ ··· · ··· ··· ⎥
d fT ⎢ ⎥
=⎢ ⎢ ··· · ··· ··· ⎥ ⎥, (2.12)
dx ⎢ ··· · ··· ··· ⎥
⎢ ⎥
⎣ ∂ f1 ∂ f2 ∂ fm ⎦
···
∂ xd ∂ xd ∂ xd
and is called the “Jacobian.”
One more: If f is scalar-valued, the matrix of second derivatives
⎡ 2 ⎤
∂ f ∂2 f ∂2 f
⎢ ∂x2 ···
⎢ 1 ∂ x1 ∂ x2 ∂ x1 ∂ xd ⎥

⎢ ··· · · · · · · · · · ⎥
⎢ ⎥
⎢ ⎥
⎢ ··· ··· ··· ··· ⎥ (2.13)
⎢ ⎥
⎢ ··· ··· ··· ··· ⎥
⎢ 2 ⎥
⎣ ∂ f ∂ 2 f2 ∂2 f ⎦
···
∂ xd ∂ x1 ∂ xd ∂ x2 ∂ xd2
is called the “Hessian.”

2.2.2 Derivative operators


Here, we introduce a new notation, a vector containing only derivative operators,


∂ ∂ ∂ T
∇= ··· . (2.14)
∂ x1 ∂ x2 ∂ xd
It is important to note that this is an OPERATOR, not a vector. We will do linear
algebra sorts of things with it, but by itself, it has no value, not even really any
meaning – it must be applied to something to have any meaning. For most of this
book, we will deal with two-dimensional images, and with the two-dimensional form
of this operator,


∂ ∂ T
∇= . (2.15)
∂x ∂y
Apply this operator to a scalar, f , and we get a vector which does have meaning,
the gradient of f :


∂f ∂f T
∇f = . (2.16)
∂x ∂y
Similarly, we may define the divergence using the inner (dot) product (in all the
following definitions, only the two-dimensional form of the del operator defined in
15 2.3 Introduction to function minimization

Eq. (2.16) is used. However, remember that the same concepts apply to operators of
arbitrary dimension):


∂ ∂ f1 ∂ f1 ∂ f2
div f = ∇ f = = + . (2.17)
∂x ∂y f2 ∂x ∂y

We will also have opportunity to use the outer product of the del operator with a
matrix:
⎡ ⎤ ⎡ ⎤
∂ ∂ f1 ∂ f2
⎢∂x ⎥ ⎢ ∂x ∂x ⎥
∇× f =⎢ ⎥ ⎢
⎣ ∂ ⎦ [ f1 f2] = ⎣ ∂ f1
⎥. (2.18)
∂ f2 ⎦
∂y ∂y ∂y

2.2.3 Eigenvalues and eigenvectors


If matrix A and vector x are conformable, then one may write the “characteristic
equation”

Ax = ␭x, ␭ ∈
. (2.19)

Since Ax is a linear operation, A may be considered as mapping x onto itself with


only a change in length. There may be more than one “eigenvalue1 ” ␭, which satisfies
Eq. (2.19). For x ∈
d , A will have exactly d eigenvalues (which are not, however,
necessarily distinct). These may be found by solving det(A − ␭I ) = 0. (But for
For any given matrix, d > 2, we do not recommend this method. Use a numerical package instead.)
there are only a few
eigenvalue/eigenvector Given some eigenvalue ␭, which satisfies Eq. (2.19), the corresponding x is called
pairs. the corresponding “eigenvector.”

2.3 Introduction to function minimization

In this book, essentially Minimization of functions is a pervasive element of engineering: One is always trying
EVERY machine vision
topic will be discussed in to find the set of parameters which minimizes some function of those parameters.
terms of some sort of Notationally, we state the problem as: Find the vector x which produces a minimum
minimization, so get used
to it! of some function H (x):

H = minx H (x) (2.20)

where x is some d-dimensional parameter vector, and H is a scalar function of x,


often referred to as an “objective function.” We denote the x which results in the

1 “Eigen-” is the German prefix meaning “principal” or “most important.” These are NOT named for Mr Eigen.
16 Review of mathematical principles

minimal H as x

x = arg minx H (x). (2.21)

The authors get VERY The most straightforward way to minimize a function is to set its derivative to zero:
annoyed at improper use
of the word “optimal.” If ∇ H (x) = 0, (2.22)
you didn’t solve a formal
optimization problem to where ∇ is the gradient operator – the set of partial derivatives. Eq. (2.22) results in
get your result, you didn’t
come up with the a set of equations, one for each element of x, which must be solved simultaneously:
“optimal” anything.

H (x) = 0
∂ x1

H (x) = 0 (2.23)
∂ x2
···

H (x) = 0.
∂ xd
Such an approach is practical only if the system of Eq. (2.23) is solvable. This may
be true if d = 1, or if H is at most quadratic in x.

EXERCISE
Find the vector x = [x1 , x2 , x3 ]T which minimizes

H = ax12 + bx1 + cx22 + d x32

where a, b, c, and d are known constants.

Solution
∂H
= 2ax1 + b
∂ x1
∂H
= 2cx2
∂ x2
∂H
= 2d x3
∂ x3
minimized by
−b
x3 = x2 = 0, x1 = .
2a
If H is some function of order higher than two, or is transcendental, the technique
of setting the derivative equal to zero will not work (at least, not in general) and we
must resort to numerical techniques. The first of these is gradient descent.
In one dimension, the utility of the gradient is easy to see. At a point x (k)
(Fig. 2.5), the derivative points AWAY FROM the minimum. That is, in one
dimension, its sign will be positive on an “uphill” slope.
17 2.3 Introduction to function minimization

x(k)

Fig. 2.5. The sign of the derivative is always away from the minimum.

Thus, to find a new point, x k+1 , we let



∂ H 
x (k+1) = x (k) − ␣ (2.24)
∂ x x (k)
where ␣ is some “small” constant.
In a problem with d variables, we write

x(k+1) = x(k) − ␣∇ H (x)|x(k) . (2.25)

2.3.1 Newton–Raphson
It is not immediately obvious in Eq. (2.25) how to choose the variable ␣. If ␣ is too
small, the iteration of Eq. (2.25) will take too long to converge. If ␣ is too large, the
algorithm may become unstable and never find the minimum.
We can find an estimate for ␣ by considering the well-known Newton–Raphson
method for finding roots: (In one dimension), we expand the function H (x) in a
Taylor series about the point x (k) and truncate, assuming all higher order terms are
zero,
       
H x (k+1) = H x (k) + x (k+1) − x (k) H  x (k) .

Since we want x (k+1) to be a zero of H , we set


     
H x (k) + x (k+1) − x (k) H  x (k) = 0, (2.26)

and find that to estimate a root, we should use


 
H x (k)
x (k+1)
=x (k)
−   (k)  . (2.27)
H x
In optimization however, we are not finding roots, but rather, we are minimizing a
function, so how does knowing how to find roots help us? The minima of the function
are the roots of its derivative, and our algorithm becomes
18 Review of mathematical principles

Algorithm: Gradient descent


 
H  x (k)
x (k+1)
=x (k)
−   (k)  . (2.28)
H x
In higher dimensions, Equation (2.28) becomes

x(k+1) = x(k) − H−1 ∇ H, (2.29)

where H is the Hessian matrix of second derivatives, which we mentioned earlier in


this chapter:


∂2
H= H (x) . (2.30)
∂ xi ∂ x j

EXAMPLE
Given a set of x, y data pairs {(xi , yi )} and a function of the form

y = aebx , (2.31)

find the parameters a and b which minimize



H (a, b) = (yi − aebxi )2 . (2.32)
i

Solution
We can solve this problem with the linear approach by observing that
ln y = ln a + bx and re-defining variables g = ln y and r = ln a.
With these substitutions, Eq. (2.32) becomes

H (r, b) = (gi − r − bxi )2 (2.33)
i
∂H 
=2 (gi − r − bxi )(−xi ) (2.34)
∂b i
∂H 
=2 (gi − r − bxi )(−1). (2.35)
∂r i

Setting Eq. (2.34) to zero, we have


  
gi xi − r xi − bxi2 = 0 (2.36)
i i i
or
  
r xi + b xi2 = gi xi (2.37)
i i i

and from Eq. (2.35)


  
gi − r 1−b xi = 0 (2.38)
i i i
19 2.3 Introduction to function minimization

or
 
Nr + b xi = gi (2.39)
i i

where N is the number of data points. Eqs. (2.37) and (2.39) are two simultaneous
linear equations in two unknowns which are readily solved. (See [2.2, 2.3, 2.4] for
more sophisticated descent techniques such as the conjugate gradient method.)

2.3.2 Local vs. global minima


Gradient descent suffers from a serious problem: Its solution is strongly dependent
on the starting point. If started in a “valley,” it will find the bottom of that valley. We
have no assurance that this particular minimum is the lowest, or “global,” minimum.
Before continuing, we will find it useful to distinguish two kinds of nonlinear
optimization problems.
r Combinatorial optimization. In this case, the variables have discrete values,
typically 0 and 1. With x consisting of d binary-valued variables, 2d possible
values exist for x. Minimization of H (x) then (in principle) consists of sim-
ply generating each possible value for x and consequently of H (x), and choos-
ing the minimum. Such “exhaustive search” is in general not practical due to
the exponential explosion of possible values. We will find that simulated an-
nealing provides an excellent approach to solving combinatorial optimization
problems.
r Image optimization. Images have a particular property: Each pixel is influenced
only by its neighborhood (this will be explained in more detail later), however,
the pixel values are continuously-valued, and there are typically many thousand
such variables. We will find that mean field annealing is most appropriate for the
solution of these problems.

2.3.3 Simulated annealing


We will base much of the following discussion of minimization techniques on an
algorithm known as “simulated annealing” (SA) which proceeds as follows. (See
the book by Aarts and Van Laarhoven for more detail [2.1].)

Algorithm: Simulated annealing

Choose (at random) an initial value of x, and an initial value of T > 0.


While T > Tmin , do
(1) Generate a point y which is a neighbor of x. (The exact definition of neighbor
will be discussed soon.)
(2) If H (y) < H (x) then replace x with y.
20 Review of mathematical principles

(3) Else compute Py = exp(− (H (y)−H


T
(x))
). If Py ≥ R then replace x with y, where
R is a random number uniformly distributed between 0 and 1.
(4) Decrease T slightly and go to step 1.

How simulated annealing works

Simulated annealing is most easily understood in the context of combinatorial op-


timization. In this case, the “neighbor” of a vector x is another vector x2 , such that
only one of the elements of x is changed (discretely) to create x2 .2 Thus, if x is
binary and of dimension d, one may choose a neighboring y = x ⊕ z, where z is a
binary vector in which exactly one element is nonzero, and that element is chosen
at random, and ⊕ represents exclusive OR.
In step 2 of the algorithm, we perform a descent. Thus we “always fall down hill.”
In step 3, we provide a mechanism for sometimes making uphill moves. Initially,
we ignore the parameter T and note that if y represents an uphill move, the probability
of accepting y is proportional to e−(H (y)−H (x)) . Thus, uphill moves can occur, but
are exponentially less likely to occur as the size of the uphill move becomes larger.
The likelihood of an uphill move is, however, strongly influenced by T . Consider the
case that T is very large. Then H (y)−H
T
( x)
 1 and Py ≈ 1. Thus, all moves will be
accepted. As T is gradually reduced, uphill moves become gradually less likely until
for low values of T (T  (H (y) − H (x))), such moves are essentially impossible.
One may consider an analogy to physical processes in which the state of each
variable (one or zero) is analogous to the spin of a particle (up or down). At high tem-
peratures, particles randomly change state, and if temperature is gradually reduced,
minimum energy states are achieved. The parameter T in step 4 is thus analogous to
(and often referred to as) temperature, and this minimization technique is therefore
called “simulated annealing.”

2.4 Markov models

A Markov process is most easily described in terms of time, although in machine


vision we are primarily interested in interactions over spatial distances. The concept
is that the probability of something happening is dependent on a thing that just
recently happened. We will use Markov processes primarily in the noise removal
and segmentation issues in Chapter 6, however, they find applications in a wide
variety of problems, including character recognition [16.1].
We start by introducing the simplest kind of Markov model, the Markov chain.
This type of model is appropriate whenever a sequence of things can be identified,

2 Thus the set of neighbors of x consists of all x’s of Hamming distance = 1.


21 2.4 Markov models

for example, when a string of symbols is received over a computer network, or a


sequence of words is input to a natural language processor. Let the symbols which are
received be denoted by y(t), where the argument t denotes the (discrete) time instant
at which that symbol is received. Thus, y(1) is received before y(2). Let w denote
the class to which a symbol belongs, w ∈ {w 1 , w 2 , . . . w c }, if there are c possible
classes. As an example the signals could represent 0 or 1, as in a communications
system.
We are interested in the probability that y(t) belongs to some particular class. For
example, what is the probability that the kth symbol will be a one? We are interested
in that probability as a function of history. For example, let w 1 represent the class 1.
DONT PANIC! It’s just We formalize our previous discussion by asking what is the probability that y(t)
notation, and it is
explained, right after the is a 1, given the last N symbols received:
equation.
P(y(t) ∈ w 1 |y(t − 1) ∈ w t−1 , y(t − 2) ∈ w t−2 , . . . , y(t − N ) ∈ w t−N ). (2.40)
This is an awkward bit of notation, but here is what it means. When you see the
term “y(t)” think “the symbol received at time t.” When you see the “∈,” think “is.”
When you see “w k ” think “1” or think “0” or think “whatever might be received at
time k.” For example, we might ask “what is the probability that you receive at time
k the symbol 1, when the last four symbols received previously were 0110?” Which,
in our notation is to ask what is

P(y(k) ∈ w 1 |(y(k − 1) ∈ w 0 , y(k − 2) ∈ w 1 , y(k − 3) ∈ w 1 , y(k − 4) ∈ w 0 )).

It is possible that in order to compute this probability, we must know all of the
history, or it is possible that we need only know the class of the last few symbols.
One particularly interesting case is when we need only know the class of the last
symbol received. In that case, we could say that the probability of class assignments
for symbol y(t), given all of the history, is precisely the same as the probability
knowing only the last symbol:

P(y(k)|y(k − 1)) = P(y(k)|(y(k − 1), y(k − 2), . . .)) (2.41)

where we have simplified the notation slightly by omitting the set element symbols.
That is, y(k) does not denote the fact that the kth symbol was received, but rather
that the kth symbol belongs to some particular class. If this is the case – that the
probability conditioned on all of history is identical to the probability conditioned
on the last symbol received – we refer to this as a Markov process.3
This relationship implies that
 
N
P(y(N ) ∈ w N , . . . y(1) ∈ w 1 ) = P(y(t) ∈ w t |(y(t − 1) ∈ w t−1 )) P(y(1) ∈ w 1 ).
t=2

3 To be perfectly correct, this is a first-order Markov process, but we will not be dealing with any other types in
this chapter.
22 Review of mathematical principles

Suppose there are only two classes possible, say 0 and 1. Then we need to know only
four possible “transition probabilities,” which we define using subscripts as follows:

P(y(t) = 0|y(t − 1) = 0) ≡ P00


P(y(t) = 0|y(t − 1) = 1) ≡ P01
P(y(t) = 1|y(t − 1) = 0) ≡ P10
P(y(t) = 1|y(t − 1) = 1) ≡ P11 .

In general, there could be more than two classes, so we denote the transition proba-
bilities by Pi j , and can therefore describe a Markov chain by a c × c matrix P whose
elements are Pi j .
We will take another look at Markov processes when we think about Markov
random fields in Chapter 6.

Assignment 2.1
Is the matrix P symmetric? Why or why not? Does P have
any interesting properties? Do its rows (or columns)
add up to anything interesting?

2.4.1 Hidden Markov models


Hidden Markov models (HMMs) occur in many applications, including in partic-
ular recent research in speech recognition. In a hidden Markov model, we assume
that there may be more than one transition matrix, and that there is an unmeasur-
able (hidden) process which switches between transition matrices. Furthermore, that
switching process itself may be statistical in nature, and we normally assume it is
a Markov process. This is illustrated in Fig. 2.6, where the position of the switch
determines whether the output y(t) is connected to the output of Markov Process 1
or Markov Process 2. The switch may be thought of as being controlled by a finite

Markov
process
1

y(t)
Switch
Markov
process
2

y(t ) =

Fig. 2.6. A hidden Markov model may be viewed as a process which switches randomly
between two signals.
23 2.4 Markov models

Switch Switch
up down

Fig. 2.7. The finite state machine of switches.

state machine (FSM), which at each time instant may stay in the same state or may
switch, as shown in Fig. 2.7.
Here is our problem: We observe a sequence of symbols

Y = [y(t = 1), y(t = 2), . . .] = [y(1), y(2), . . .].

What can we infer? The transition probabilities? The state sequence? The structure
of the FSM? The rules governing the FSM? Let’s begin by estimating the state
sequence.

Estimating the state sequence

Let s(t) t = 1, . . . , N denote the state associated with measurement y(t), and denote
the sequence of states S = [s(1), s(2), . . . s(N )], where each s(t) ∈ {s1 , s2 , . . . sm }.
We seek a sequence of states, S, which maximizes the conditional probability that
the sequence is correct, given the measurements; P(S|Y ).
Using Bayes’ rule
p(Y |S)P(S)
P(S|Y ) = . (2.42)
p(Y )
We assume the states form a Markov chain, so

N
P(S) = Ps(t),s(t−1) Ps(0) . (2.43)
t=2

Now, let’s make a temporarily unbelievable assumption, that the probability density
of the output depends only on the state. Denote that relationship by p(y(t)|s(t)).
Then the posterior conditional probability of the sequence can be written:

N N
p(Y |S)P(S) = p(y(t)|s(t)) Ps(t),s(t−1) Ps(0) . (2.44)
t=1 t=2

Define Ps(1),s(0) ≡ Ps(0) , and Eq. (2.44) simplifies to


N
p(Y |S)P(S) = p(y(t)|s(t))Ps(t),s(t−1) . (2.45)
t=1
24 Review of mathematical principles

Now look back at Eq. (2.42). The choice of S does not affect the denominator, so
all we need to do is find the sequence S which maximizes


N
E= p(y(t)|s(t))Ps(t),s(t−1) . (2.46)
t=1

Surprisingly, there is an algorithm which will solve this maximization problem. It


is called the Viterbi algorithm. It has many, many applications. We explain it in the
next section, since this seems like a good place to motivate it.

2.4.2 The Viterbi algorithm


This is used for a particular kind of optimization problem, one where each state s(t)
has only two neighboring states, s(t + 1) and s(t − 1). Imposing this “neighboring
state” requirement allows us to use an efficient algorithm.
First, it is (almost) always easier to work with a sum than with a product, so let’s
define a new objective function by taking logs.


N
L ≡ ln E ≡ ((t) + i, j ) (2.47)
t=1

where (t) = ln p(y(t)|s(t)) and i, j = ln Pi, j .


Pictorially, we illustrate the set of all possible sequences as a graph as illustrated
in Fig. 2.8.
A particular sequence of states describes a path through this graph. For example,
for a graph like this with N = 4, m = 3, the path [s1 , s2 , s3 , s1 ] is illustrated in
Fig. 2.9.
A path like this implies a set of values for the functions. For each node in the graph,
we associate a value of . Suppose we measured y(1) = 2, y(2) = 1, y(3) = 2.2,

sm sm sm

sq

s3 s3 s3
s2 s2 s2
s1 s1 s1

t=1 t=2 t=N

Fig. 2.8. Every possible sequence of states can be thought of as a path through such a graph.
25 2.4 Markov models

s4 s4 s4 s4
s3 s3 s3 s3
s2 s2 s2 s2
s1 s1 s1 s1

Fig. 2.9. A path through a problem with four states and four time values.

y(4) = 1. The function  has value

 = ln p(y(1) = 2|s(1) = s1 ) + ln p(y(2) = 1|s(2) = s2 )


+ ln p(y(3) = 2.2|s(3) = s3 ) + ln p(y(4) = 1|s(4) = s1 ).

There is a value associated with each edge in the graph as well, the function 
determined by the associated transition probability. So every possible path through
the graph has a corresponding value of the objective function L. We describe the
algorithm to find the best path as an induction: Suppose at time t we have already
found the best path to each node, with cost denoted LBi (t), i = 1, . . . , m. Then we
can compute the cost of going from each node at time t to each node at time t + 1
(m 2 calculations) by

L i j (t + 1) = L Bi (t) + (y(t + 1)|s j (t + 1)) + i, j . (2.48)

The best path to node j at time t + 1 is the maximum of these. When we finally
reach time step N , the node which terminates the best path is the final node.
The computational complexity of this algorithm is thus N m 2 , which is a lot less
than m N , the complexity of a simple exhaustive search of all possible paths.

2.4.3 Markov outputs


In the description above, we assumed that the probability of a particular output
depended only on the state. We do not have to be that restrictive; we could allow the
outputs themselves to be the Markov processes.
Assume if the state changes, the first output depends only on the state, as before.
But afterward, if the state remains the same, the outputs obey a Markov chain. We
can formulate this problem in the same way, and solve it with the Viterbi algorithm.

2.4.4 Estimating model parameters


One final detail still eludes us: Given an observation Y , how do we estimate the
conditional output probabilities p(y(k)|s(k)), and those transition probabilities for
the states, Pi, j ?
26 Review of mathematical principles

To do this, first, we need to go from a continuous to a discrete representation


for the values of y(t), in order to be able to use probabilities rather than densities.
Assume y(t) ∈ {y1 , y2 , . . . yr }. Then, we may define an output probability matrix
 = [␲k,l ], k = 1, . . . m; l = 1, . . . r , which represents the probability of observing
output yl if in state sk .
Define

Pi, j|Y (t) = Pr ((s(t − 1) = i, s(t) = j)|Y ). (2.49)

That is, given the observation sequence, what is the probability that we went from
state i to state j at time t? We can compute that quantity using the methods of
section 2.4.2 if we know the transition probabilities Pi, j and the output probabilities
␲k, l . Suppose we do know those. Then, we estimate the transition probability by
averaging the probabilities over all the inputs.

N
Pi, j|Y (t)
Pi, j = t=2
N
(2.50)
t=2 P j|Y (t)

where, since in order to go into state j, the system had to go there from somewhere,


N
P j|Y (t) = Pi j|Y (t). (2.51)
i=1

Then we estimate the probability of the observation by again averaging all the
observations.

N
t=1,y(t)= j Pi|Y (t)
␲k,l = N . (2.52)
t=1 Pi|Y (t)

At each iteration, we use Eqs. (2.50) and (2.52) to update the parameters. We then
use Eqs. (2.49) and (2.51) to update the conditional probabilities. The process then
repeats until it converges.

2.4.5 Applications of HMMs


Hidden Markov models have found many applications in speech recognition and
document content recognition [17.29].
27 2.4 Markov models

Assignment 2.2
(Trivia question) In what novel did a character named
Markov Chaney occur?

Assignment 2.3
Find the OT corresponding to a rotation of 30◦ about
the z axis. Prove the columns of the resulting matrix
are a basis for
3 .

Assignment 2.4
Prove Eq. (2.10). Hint: Use Eq. (2.7); Eq. (2.9) might
be useful as well.

Assignment 2.5
A positive definite matrix has positive eigenvalues.
Prove this. (For that matter, is it even true?)

Assignment 2.6
Does the function y = xe−x have a unique value x which
minimizes y? If so, can you find it by taking a
derivative and setting it equal to zero? Suppose
this problem requires gradient descent to solve.
Write the algorithm you would use to find the x which
minimizes y.

Assignment 2.7
We need to solve a minimization problem using gradient
descent. The function we are minimizing is sin x + ln y.
Which of the following is the expression for the
gradient which you need in order to do gradient
descent?
1 1
(a) cos x + (b)y = − (c)−∞
y cos x

cos x ∂ ∂
(d) (e) sin x + ln y
1/y ∂y ∂x

Assignment 2.8
(a) Write the algorithm which uses gradient descent to
find the vector [x,y]T which minimizes the function
28 Review of mathematical principles

z = x exp(−(x 2 + y)). (b) Write a computer program which


finds the minimizing x,y pair.

Assignment 2.9
Determine whether the functions sin x and sin 2x might
be orthonormal or orthogonal functions.

References

[2.1] E.H.L. Aarts and P.J.M. van Laarhoven. Simulated Annealing: Theory and Applications,
Dordrecht, Holland, Reidel, 1987.
[2.2] R.L. Burden, J.D. Faires, and A.C. Reynolds. Numerical Analysis, Boston, MA, Prindle,
Weber and Schmidt, 1981.
[2.3] G. Dahlquist and A. Bjorck. Numerical Methods, Englewood Cliffs, NJ, Prentice-Hall,
1974.
[2.4] B. Gottfried and J. Weisman. Introduction to Optimization Theory, Englewood Cliffs,
NJ, Prentice-Hall, 1973.
3 Writing programs to process images

Computer Science is not about computers any more than astronomy is about telescopes
E. W. Dijkstra

One may take two approaches to writing software for image analysis, depending on
what one is required to optimize. One may write in a style which optimizes/minimizes
programmer time, or one may write to minimize computer time. In this course,
computer time will not be a concern (at least not usually), but your time will be far
more valuable. For that reason, we want to follow a programming philosophy which
produces correct, operational code in a minimal amount of programmer time.
The programming assignments in this book are specified to be written in C or
C++, rather than in MATLAB or JAVA. This is a conscious and deliberate decision.
MATLAB in particular hides many of the details of data structures and data mani-
pulation from the user. In the course of teaching variations of this course for many
years, the authors have found that many of those details are precisely the details
that students need to grasp in order to effectively understand what image processing
(particularly at the pixel level) is all about.

3.1 Image File System (IFS) software

The objective of quickly writing good software is accomplished by using the image
access subroutines in IFS. IFS is a collection of subroutines and applications based
on those subroutines which support the development of image processing software.
Advantages of IFS include the following.
r IFS supports any data type including char, unsigned char, short, unsigned short,
int, unsigned int, float, double, complex float, complex double, complex short,
and structure.
r IFS supports any image size, and any number of dimensions. One may do signal
processing by simply considering a signal as a one-dimensional image.
r IFS is available on most current computer systems, including Windows on the PC,
Linux on the PC, Unix on the SUN, and OS-X on the Macintosh.1 Files written on

1 Regrettably, IFS does not support Macintosh operating systems prior to OS-X.

29
30 Writing programs to process images

one platform may be read on any of the other platforms. Conversion to the format
native to the platform is done by the read routine, without user intervention.
r A large collection of functions are available, including two-dimensional Fourier
transforms, filters, segmenters, etc.

3.1.1 The IFS header structure


All IFS images include a header which contains various items of information about
the image, such as the number of points in the image, the number of dimensions
for the image, the data format, the units and scan direction of each dimension, and
so on. Also associated with the image is the actual data for the image. The image
header includes a pointer to the image data.
The user manipulates an image by calling some function in the IFS library; one of
the arguments to the function will be the address of the header. From the information
in the header, the IFS library functions automatically determine where the data
is and how to access it. In addition to accessing data in images, the IFS routines
automatically take care of allocating space in memory to store data and headers.
Everything is totally dynamic in operation; there are no fixed-dimension arrays.
This relieves the user of the difficulties involved with accessing data in arrays, when
the arrays are not of some fixed size.
The header structure for an image is defined in the file ifs.h, and is known by the
name IFSHDR. To manipulate an image, the user merely needs to declare a pointer
to an image header structure (as IFSHDR *your image; or IFSIMG your image;).
Then, the user simply calls some IFS function to create a new image, and sets the
pointer to the value returned from that function.

3.1.2 Some useful IFS functions


Now, you should look at You may read images off disk using ifspin. This subroutine will calculate the size
the IFS manual. Read
especially carefully of the image, how much memory is required, and determine what sort of computer
through the first two wrote the image. It will do all necessary data conversions (byte swapping, floating
example programs in the
front of the manual. point format conversions, etc.) and read it into your computer in the format native
to the machine you are using. You do not need to know how to do any of those data
conversion operations. Similarly, you can write images to disk using ifspot. This
subroutine will write both the IFS header and the associated actual image data from
memory to the disk.
You may access the image using ifsigp or ifsfgp and ifsipp or ifsfpp. These sub-
routine names stand for IFS Integer Get Pixel, IFS Floating Get Pixel, IFS Integer
Put Pixel, and IFS Floating Put Pixel. The words Integer and Floating refer to the
data type returned or being written. If you say, for example,

v = ifsfgp(img,x,y)
31 3.2 Basic programming structure for image processing

then a floating point number will be returned independent of the data type of the
image. That is, the subroutine will do data conversions for you. Similarly, ifsigp will
return an integer, no matter what the internal data type is. This can, of course, get you
in trouble. Suppose the internal data type is float, and you have an image consisting
of numbers less than one. Then, the process of the conversion from float to int will
truncate all your values to zero.
For some projects you will have three-dimensional data. That means you must
access the images using a set of different subroutines, ifsigp3d, ifsfgp3d, ifsipp3d,
and ifsfpp3d. For example,

y = ifsigp3d(img,frame,row,col)

3.1.3 Common problems


Two common problems usually occur when students first use IFS software.

(1) ifsipp(img,x,y,exp(-t*t)) will give you trouble because ifsipp expects a fourth
argument which is an integer, and exp will return a double. You should use
ifsfpp.
(2) ifsigp(img,x,y,z) is improperly formed. ifsigp expects three arguments, and does
not check the dimension of the input image to determine number of arguments
(it could, however; sounds like a good student project . . . ). To access a three-
dimensional image, either use pointers, or use ifsigp3d(img,x,y,z), where the
second argument is the frame number.

3.2 Basic programming structure for image processing

Images may be thought of as two-dimensional arrays. They are usually processed


pixel-by-pixel in a raster scan. In order to manipulate an image, two-nested for-loops
is the most commonly used programming structure, as shown in Fig. 3.1.

......
int row, col;
......
for ( row = 0; row < 128; row ++)
{
for( col = 0; col < 128; col++)
{
/* pixel processing */
......
}
......
}

Fig. 3.1. Basic programming structure: two-nested for-loops.


32 Writing programs to process images

......
int row, col, frame;
......
for (frame = 0; frame < 224; frame++)
{
for ( row = 0; row < 128; row++)
{
for( col = 0; col < 128; col++)
{
/* pixel processing */
......
}
......
}
}

Fig. 3.2. Basic programming structure: three-nested for-loops.

In this example, we use two integers (row and col) as the indices to the row and
column of the image. By increasing row and col with a step one, we are actually
scanning the image pixel-wise from left to right, top to bottom.
If the image has more than two dimensions, (e.g. hyperspectral images) a third
integer is then used as the index to the dimensions, and correspondingly, a three-
nested for-loops is needed, as shown in Fig. 3.2.

3.3 Good programming styles

You should follow one important programming construct: all programs should be
written so that they will work for any size image. You are not required to write
your programs so that they will work for any number of dimensions (although that
is possible too), or any data type; only size. One implication of this requirement
is that you cannot declare a static array, and copy all your data into that array
(which novice programmers like to do). Instead, you must use the image access
subroutines.
Another important programming guideline is that except in rare instances, global
variables are forbidden. A global variable is a variable which is declared outside of a
subroutine. Use of such globals is poor programming practice, and causes more bugs
than anything else in all of programming. Good structured programming practice
requires that everything a subroutine needs to know is included in its argument
list.
Following these simple programming guidelines will allow you to write general-
purpose code easily and efficiently, and with few bugs. As you become more skilled
you can take advantage of the pointer manipulation capabilities to increase the run
speed of your programs, if this is needed later.
33 3.4 Example programs

for (<cond>) { for (<cond>) for (<cond>) for (<cond>)


<body> { { {
} <body> <body> <body>
} } }
(a) the K&R style (b) the Allman (c) the Whitesmiths (d) the GNU style
style style

Fig. 3.3. The four commonly used indenting styles.

You will lose major points Besides the aforementioned programming guidelines, it is very important for
if your code does not
follow these guidelines. students to also follow the indenting, spacing, and commenting rules in order to
make your code “readable.”
There are four commonly used indenting styles, the K&R style (or kernel style),
Allman style (or BSD style), Whitesmiths style, and the GNU style, as shown in
Fig. 3.3. In this book, we use the Allman indenting style. Adding some space (e.g.
a blank line) between different segments of your code also improves readability.
We emphasize the importance of the comments. However, do not add too many
comments since that will break the flow of your code. In general, you should add a
block comment at the top of each function implementation, including descriptions of
what the function does, who wrote this function, how to call this function, and what
the function returns. You should also add a description for each variable declaration.

3.4 Example programs

Take another look at the IFS manual. It will help you to follow these example
programs. Also pay attention to the programming style we use. The comments
might be too detailed, but they are included for pedagogical purpose.
A typical program is given in Fig. 3.4, which is probably as simple an example
program as one could write. Fig. 3.5 lists another example which implements the
same function as Fig. 3.4 does but is written in a more flexible way such that it can
handle images of different size.
Both these examples make use of the subroutine calls ifsigp, ifsipp, ifsfgp, and
ifsfpp to access the images utilizing either integer-valued or floating point data. The
advantage of these subroutines is convenience: No matter what data type the image
is stored in, ifsigp will return an integer, and ifsfgp will return a float. Internally,
these subroutines determine precisely where the data is stored, access the data, and
convert it. All these operations, however, take computer time. For class projects,
the authors strongly recommend using these subroutines. However, for production
operations, IFS supports methods to access data directly using pointers, trading
additional programmer time for shorter run times.
34 Writing programs to process images

/* Example1.c
This program thresholds an image. It uses a fixed image size.
Written by Harry Putter, October, 2006
*/
#include <stdio.h>
#include <ifs.h>
main( )
{
IFSIMG img1, img2; /* Declare pointers to headers */
int len[3]; /* len is an array of dimensions, used by ifscreate */
int threshold; /* threshold is an int here */
int row,col; /* counters */
int v;
/* read in image */
img1 = ifspin("infile.ifs"); /* read in file by this name */

/* create a new image to save the result */


len[0] = 2; /* image to be created is two dimensional */
len[1] = 128; /* image has 128 columns */
len[2] = 128; /* image has 128 rows */
img2 = ifscreate("u8bit",len,IFS_CR_ALL,0); /* image is unsigned 8 bit */
threshold = 55; /* set some value to threshold */
/* image processing part - thresholding */
for (row = 0; row < 128; row++)
for (col = 0; col < 128; col++)
{
v = ifsigp(img1,row,col); /* read a pixel as an int */
if (v > threshold)
ifsipp(img2,row,col,255);
else
ifsipp(img2,row,col,0);
}
/* write the processed image to a file */
ifspot(img2, "img2.ifs"); /* write image 2 to disk */
}

Fig. 3.4. Example IFS program to threshold an image using specified values of dimensions and
predetermined data type.

3.5 Makefiles

You really should use makefiles. They are far superior to just typing commands.
If you are doing your software development using Microsoft C++, Lcc, or some
other compiler, then the makefiles are sort of hidden from you, but it is helpful to
know how they operate. Basically, a makefile specifies how to build your project, as
illustrated by the example makefile in Fig. 3.6.
The example in Fig. 3.6 is just about as simple a makefile as one can write. It
states that the executable named myprogram depends on only one thing, the object
module myprogram.o. It then shows how to make myprogram from myprogram.o
and the IFS library.
Similarly, myprogram.o is made by compiling (but not linking) the source file,
myprogram.c, utilizing header files found in an “include” directory on the CDROM,
named hdr. Note: To specify a library, as in the link step, one must specify the library
35 3.5 Makefiles

/* Example2.c
Thresholds an image using information about its data type and the dimensionality.
Written by Sherlock Holmes, May 16, 1885
*/
#include <stdio.h>
#include <ifs.h>
main( )
{
IFSIMG img1, img2; /* Declare pointers to headers */
int *len; /* len is an array of dimensions, used by ifscreate */
int frame, row, col; /* counters */
float threshold, v; /* threshold is a float here */

img1 = ifspin("infile.ifs"); /*read in file by this name*/


len = ifssiz(img1); /* ifssiz returns a pointer to an array of image dimensions*/
img2 = ifscreate(img1->ifsdt,len,IFS_CR_ALL,0);
/* output image is to be the same type as the input */
threshold = 55; /* set some value to threshold */

/* check for one, two or three dimensions */


switch (len[0]) {
case 1: /* 1d signal */
for (col = 0; col < len[1]; col++)
{
v = ifsfgp(img1,0,col); /* read a pixel as a float */
if (v > threshold)
ifsfpp(img2,0,col,255.0); /* write a float */
else /* if img2 not float, will be converted*/
ifsfpp(img2,0,col,0.0);
}
break;
case 2: /* 2d picture */
for (row = 0; row < len[2]; row++)
for (col = 0; col < len[1]; col++)
{
v = ifsfgp(img1,row,col); /* read a pixel as a float */
if (v > threshold)
ifsfpp(img2,row,col,255.0); /* store a float */
else
ifsfpp(img2,row,col,0.0);
}
break;
case 3: /* 3d volume */
for (frame = 0; frame < len[3];frame++)
for (row = 0; row < len[2]; row++)
for (col = 0; col < len[1]; col++)
{
v = ifsfgp3d(img1,frame,row,col); /* read a pixel as a float */
if (v > threshold)
ifsfpp3d(img2,frame,row,col,255.0);
else
ifsfpp3d(img2,frame,row,col,0.0);
}
break;
default:
printf("Sorry I cannot do 4 or more dimensions\n");
} /* end of switch */

ifspot(img2, "img2.ifs"); /* write image 2 to disk */


}

Fig. 3.5. An example IFS program to threshold an image using number of dimensions, size of
dimensions, and data type determined by the input image.
36 Writing programs to process images

myprogram: myprogram.o
cc -o myprogram myprogram.o /CDROM/Solaris/ifslib/libifs.a
myprogram.o: myprogram.c
cc -c myprogram.c -I/CDROM/Solaris/hdr

Fig. 3.6. An example makefile which compiles a program and links it with the IFS library.

CFLAGS=-Ic:\lcc\include -g2 -ansic


CC=c:\lcc\bin\lcc.exe
LINKER=c:\lcc\bin\lcclnk.exe
DIST=c:\ece763\myprog\lcc\
OBJS=c:\ece763\myprog\objs\
LIBS= ifs.lib -lm
# Build myprog.c
myprog:
$(CC) -c $(CFLAGS) c:\ece763\myprog\mysubroutine1.c
$(CC) -c $(CFLAGS) c:\ece763\myprog\mysubroutine2.c
$(CC) -c $(CFLAGS) c:\ece763\myprog\myprog.c
$(LINKER) -subsystem console -o myprog.exe myprog.obj mysubroutine1.obj
mysubroutine2.obj $(LIBS)

Fig. 3.7. An example WIN32 makefile.

name (e.g. libifs.a), but to specify an include file (e.g. ifs.h), one specifies only the
directory in which that file is located, since the file name was given in the #include
preprocessor directive.
In WIN32 the makefiles look like the example shown in Fig. 3.7. Here, many of
the symbolic definition capabilities of the make program are demonstrated, and the
location of the compiler is specified explicitly.
The programs generated by IFS are (with the exception of ifsview) console-based.
That is, you need to run them inside an MSDOS window on the PC, inside a terminal
window under Linux, Solaris, or on the Mac, using OS-X.

Assignment 3.1 Learning how to use the tools


Windows users:
If you are using a WIN32 computer, open an MSDOS
session and cd to the directory where you installed
all the IFS files. What program would you use to
add Gaussian random noise to an image? (Hint: Look
in INDEX). Apply that program to the image named
angio128.ifs. Display the result using ifsview.
Figure out how to print the result. Hand in the
answers to the questions and the image.
37 3.2 Learn how to use IFS

Assignment 3.2 Learn how to use IFS


The intent of this assignment is to get you to use the
computer and to begin to write programs.
Use ifsview to view the following images:
images/echo1
images/echo2
images/echo3

Convert each of these to TIFF, using any2any. (For


usage, type “any2any -h”). For example, you might use

any2any echo1.ifs echo1.tif tiff

Important observation: echo1 is three-dimensional. TIFF


is a two-dimensional image format. When you use any2any
as above, you only get the zeroth frame!
You will see that the three echo images are similar.
Written assignment, to be handed in: Describe in one
paragraph what the images look like. We recognize
that you do not know the anatomy. Just describe what
you see; include an image (figure out how to print a
file -- you know this already, we hope.) If it happens
that you do know the anatomy, your instructor will be
impressed.
4 Images: Formation and representation

Computers are useless. They can only give us answers


Pablo Picasso

In this chapter, we describe how images are formed and how they are represented.
Representations include both mathematical representations for the information con-
tained in an image and for the ways in which images are stored and manipulated in a
digital machine. In this chapter, we also introduce a way of thinking about images –
as surfaces with varying height – which we will find to be a powerful way to describe
both the properties of images as well as operations on those images.

4.1 Image representations

In this section, we discuss several ways to represent the information in an image.


These representations include: iconic, functional, linear, probabilistic, spatial fre-
quency, and relational representations.

4.1.1 Iconic representations (an image)


An iconic representation of the information in an image is an image. “Yeah, right;
and a rose is a rose is a rose.” When you see what we mean by functional, linear,
and relational representations, you will realize we need a word1 for a representa-
tion which is itself a picture. Some examples of iconic representations include the
following.
r 2D brightness images, also called luminance images. The things you are used to
y
calling “images.” These might be color or gray-scale. (Be careful with the words
f(x, y) “black and white,” as that might be interpreted as “binary”). We usually denote
the brightness at a point x, y as f (x, y). Note: x and y could be integers (in
x this case, we are referring to discrete points in a sampled image; these points are
called “pixels,” short for “picture elements”), or real numbers (in this case, we
are thinking of the image as a function).

1 “Iconic” comes from the Greek word meaning picture.

38
39 4.1 Image representations

r 3D (density) images. These sorts of data structures usually occur in medical


z
images, like CT (computed tomography), MRI (magnetic resonance imaging),
f (x, y, z)
ultrasound, etc. In the typical 3D image, each pixel represents the density at
y a point, (technically, we cannot measure the density AT a point, we can only
x measure the density in the vicinity of a point – but the meaning should be clear).
We usually denote the density at a point x, y, z as f (x, y, z).
z r 2 12 D (range) images. In a range image, the value at each point on a surface
z(x, y)
y represents a distance, usually the distance from the camera or the distance along
x a normal to a plane containing the camera. Thus, in a range image, we denote the
You have a homework position of the surface point as z(x, y). Suppose, for example, that the sensor is a
involving files named laser, measuring time-of-flight. Then the “brightness” is actually proportional to
(e.g.) site1.c.ifs. The c
indicates that correction the time required for a pulse of light to travel from the laser to a surface, bounce off,
has been done.
and return to a detector located at the source. In truth, such a sensor is measuring
range as a function of two deflection angles, r (␪, ␸), and this produces a range
image. However to transform from these coordinates to z(x, y) is straightforward,
and normally, any data you get (in this course anyway) will already have this
correction performed to produce an altitude image.

4.1.2 Functional representations (an equation)


We can fit a function to any set of data points. When we measure an image, it always
consists of a discrete and finite set of measurements, f (i, j), where i and j are the
discrete pixel coordinates (integers) in the x and y directions. By performing some
operation such as minimum-least-squares, we can find a continuous function which
best fits this set of data. We could thus represent an image, at least over a small area,
by an equation such as a biquadratic:
z = ax 2 + by 2 + cx y + d x + ey + f (4.1)

or a quardic:
ax 2 + by 2 + cz 2 + d x y + ex z + f yz + gx + hy + i z + j = 0. (4.2)
Implicit and explicit The form given in Eq. (4.1), in which one variable is defined in terms of the others,
representations
is often referred to as an explicit representation, whereas the form of Eq. (4.2) is an
implicit representation [4.23], which may be equivalently represented in terms of
the zero set, {(x, y, z) : f (x, y, z) = 0}. Implicit polynomials have some convenient
properties. For example consider a point (x0 , y0 ) which is not in the zero set of
f (x, y), that is, the set of points x, y which satisfy

f (x, y) ≡ x 2 + y 2 − R 2 = 0. (4.3)

If we substitute x0 and y0 into the equation for f (x, y), we know we get a nonzero
result (since we said this point is not in the zero set); if that value is negative, we
40 Images: Formation and representation

know that (x0 , y0 ) is inside the curve, otherwise, outside [4.3]. This inside/outside
property holds for all closed curves (and surfaces) representable by polynomials.

4.1.3 Linear representations (a vector)


Throughout this book, we We will discuss this in more detail later, but the basic idea is to unwind
use the transpose 5 10
whenever we write a the image into a vector, and then talk about doing vector–matrix opera-
6 4
vector on a row, because tions on such a representation. For example, the 2 × 2 image with bright-
we think of vectors as
column vectors. ness values as shown could be written as a vector f = [5 10 6 4]T .

4.1.4 Probabilistic representations


In Chapter 6 we will represent an image as the output of a random process which
generates images. We can in that way make use of a powerful set of mathematical
tools for estimating the best version of a particular image, given a measurement of
a corrupted, noisy image.

4.1.5 The spatial frequency representation


Let your eye scan across the two images in Fig. 4.1, say from left to right, and decide in
which of them the brightness changes more rapidly. You define what “more rapidly”
means. We think you will agree that the right-hand image has more rapid changes in
brightness. We can quantify that concept of brightness change by characterizing it in
cycles per inch of paper, where one cycle is a transition from maximum brightness
to minimum and back. Of course, if you scan the image at a fixed rate, the brightness
variation seen by your eye, measured in cycles per second (hertz) is proportional to
the spatial frequency content of the image you are scanning.

Fig. 4.1. (a) An image with lower horizontal frequency content. (b) An image with higher
horizontal frequency content.
41 4.1 Image representations

Fig. 4.2. (L) An image. (R) A low-frequency iconic representation of that image. The right-hand
image is blurred in the horizontal direction only, a blur which results when the camera
is panned while taking the picture. Notice that horizontal edges are sharp.

The spatial frequency content of an image can be modified by filters which block
specific frequency ranges. For example, Fig. 4.2 illustrates an original image and an
iconic representation of that image which has been passed through a low-pass filter.
That is, a filter which permits low frequencies to pass from input to output, but which
blocks higher frequencies. As you can see, the frequency response is one way of
characterizing sharpness. Images with lots of high-frequency content are perceived
as sharp.
Although we will make little use of frequency domain representations for
images in this course, you should be aware of a few aspects of frequency domain
representations.
First, as you should have already observed, spatial frequencies differ with direc-
tion. Fig. 4.1 illustrates much more rapid variation, higher spatial frequencies in the
vertical direction than in the horizontal. Furthermore, in general, an image contains
many spatial frequencies. We can extract the spatial frequency content of an image
using the two-dimensional Fourier transform, given by

1 
F(u, v) = f (x, y) exp(−i2␲(ux + v y)) (4.4)
K x y

where K is a suitable normalization constant. The Fourier transform is properly


defined using integrals, and the form in Eq. (4.4) is defined only for discrete two-
dimensional grids, such as a two-dimensional sampled image. It is important to note
that the Fourier transform converts a function of x and y into another function, whose
arguments are u and v. We think of u and v as the spatial frequency content in the
x and y directions respectively. If F(u 0 , v 0 ) is a large number, for some particular
values u and v, then there is a lot of energy at that particular spatial frequency. You
may interpret the term energy as “lots of pixels,” or “very large variations at that
frequency.”
42 Images: Formation and representation

The second observation made here is that spatial frequencies vary over an image.
That is, if one were to take subimages of an image, one would find significant
variation in the Fourier transforms of those subimages.
Third, take a look at the computational complexity implied in Eq. (4.4). The
Fourier transform of an image is a function of spatial frequencies, u and v, and may
thus be considered as an image (its values are complex, but that should not worry
you). If our image is N × N , we must sum over x and y to get a SINGLE u, v value –
a complexity of N 2 . If the frequency domain space is also sampled at N × N , we
have a total complexity of N 4 to compute the Fourier transform. BUT, there exists
an algorithm called the fast Fourier transform which very cleverly computes a single
u, v value in N log2 N rather than N 2 , resulting in a significant saving. Thus it is
sometimes faster to compute things in the frequency domain.
Finally, there is an equivalence between convolution, which we will discuss in
the next chapter and multiplication in the frequency domain. More on that in section
5.8.

4.1.6 Relational representations (a graph)


Finally, graphs are a very general way to think about the information in an image.
We will talk a lot more about graphs in Chapter 12.

4.2 The digital image

Now suppose the image is sampled; that is, x and y take on only discrete, integer
values, and also suppose f is quantized ( f takes on only a set of integer values).
Such an image could be stored in a computer memory and called a “digital image.”

4.2.1 The formation of a digital image


The imaging literature is filled with a variety of imaging devices, including dissec-
tors, flying spot scanners, vidicons, orthicons, plumbicons, CCDs (charge-coupled
devices), and others [4.6, 4.8]. Currently, CCDs dominate the market. Imaging de-
vices differ both in the ways in which they form images and in the properties of the
images so formed. However, all these devices convert light energy to voltage in sim-
ilar ways. Since our intent in this chapter is to introduce the fundamental concepts of
image analysis, we will choose one device, the CCD, and discuss the way in which
it acquires a digital image. More precise details are available in the literature, for
example, for CCDs [4.19].
43 4.2 The digital image

Image formation with a silicon device

A lens is used to form an image on the surface of the CCD. When a photon of the
appropriate wavelength strikes the special material of the device, a quantum of charge
is created (an electron–hole pair). Since the conductivity of the material is quite low,
these charges tend to remain in the same general area where they were created. Thus,
to a good approximation, the charge, q, in a local area of the CCD follows
tf
q= i dt
0

where i is the incident light intensity, measured in photons per second. If the incident
light is a constant over the integration time, then q = itf , where tf is called the frame
time.
In vidicon-like devices, the accumulated (positive) charge is cancelled by a scan-
ning electron beam. The cancellation process produces a current which is amplified
and becomes the video signal. In a CCD, charge is shifted from one cell to the next
synchronously with a digital clock. The mechanism for reading out the charge, be it
electron beam, or charge coupling, is always designed so that as much of the charge
is set to zero as possible. We start the integration process with zero accumulated
charge, build up the charge at a rate proportional to local light intensity, and then
read it out. Thus, the signal measured at a point will be proportional to both the light
intensity at that point and to the amount of time between read operations.
Since we are interested only in the intensities and not in the integration time, we
remove the effect of integration time by making it the same everywhere in the picture.
This process, called scanning, requires that each point on the device be interrogated
and its charge accumulation zeroed, repetitively and cyclically. Probably the most
straightforward, and certainly the most common way in which to accomplish this is
in a top-to-bottom, left-to-right scanning process called raster scanning (Fig. 4.3).

Fig. 4.3. Raster scanning: Active video is indicated by a solid line, blanking (retrace) by a dashed
line. In an electron beam device, the beam is turned off as it is repositioned. Blanking
has no physical meaning in a CCD, but is imposed for compatibility. This simplified
figure represents noninterlaced scanning.
44 Images: Formation and representation

Active
video

Blanking Sync
Composite video

Noncomposite video

Fig. 4.4. Composite and noncomposite outputs of a television camera, voltage as a function
of time.

To be consistent with standards put in place when scanning was done by electron
beam devices (and there needed to be a time when the beam was shut off), the
television signal has a pause at the end of each scan line called blanking. While
charge is being shifted out from the bottom of the detector, charge is once again built
up at the top. Since charge continues to accumulate over the entire surface of the
detector at all times, it is necessary for the read/shift process to return immediately
to the top of the detector and begin shifting again. This scanning process is repeated
many times each second. In American television, the entire faceplate is scanned once
every 33.33 ms (in Europe, the frame time is 40 ms).
To compute exactly how fast the electron beam is moving, we compute

1s 525 lines
÷ = 63.5 ␮s/line. (4.5)
30 frame frame

Using the European standard of 625 lines and 25 frames per second, we arrive at
almost exactly the same answer, 64 ␮s per line. This 63.5 ␮s includes not only
the active video signal but also the blanking period, approximately 18 percent of
the line time. Subtracting this dead time, we arrive at the active video time, 52 ␮s
per line.
Fig. 4.4 shows the output of a television camera as it scans three successive lines.
One immediately observes that the raster scanning process effectively converts a
picture from a two-dimensional signal to a one-dimensional signal, where voltage is
a function of time. Fig. 4.4 shows both composite and noncomposite video signals,
45 4.2 The digital image

Fig. 4.5. Video signal, showing sampling times.

Fig. 4.6. Sampled video signal.

that is, whether the signal does or does not include the sync and blanking timing
pulses.
The sync signal, while critical to operation of conventional television, is not
particularly relevant to our understanding of digital image processing at this time.
The blanking signal, however, is the single most important timing signal in a raster
scan system. Blanking refers to the time that there is no video. There are two distinct
blanking events: horizontal blanking, which occurs at the end of each line, and
vertical blanking, which occurs at the bottom of the picture. In a digital system, both
blanking events may be represented by pulses on separate digital wires. Composite
video is constructed by shifting these special timing pulses negative and adding them
to the video signal.
Now that we recognize that horizontal blanking signifies the beginning of a new
line of video data, we can concentrate on that line and learn how a computer might
acquire the brightness information encoded in that voltage.

The sampling process

The charge on the detector is converted to a voltage by passing through a resistor


and then amplified. That signal is converted to a digital representation using an
analog-to-digital converter. An analog-to-digital converter performs two functions
simultaneously – sampling and quantization.
The sampling process approximates the video signal by measuring it at specific
times. At each discrete time, the video signal is measured and that value is remem-
bered until the next sampling time. Fig. 4.5 shows an analog voltage represented as
a function of time, and Fig. 4.6 shows the same voltage after sampling.
46 Images: Formation and representation

Fig. 4.7. Sampled quantized video signal.

Resolution
The number of samples on a single line defines the horizontal resolution of a video
system. Similarly, the number of lines in a single image defines the vertical reso-
lution. It is interesting to note that European television, with 625 lines per picture,
has a greater vertical resolution than American. This is why viewers observe that
European TV has a “better picture” than American.
The term resolution may also refer to the physical size of the smallest thing the
imaging system can clearly image. For example, the resolution of mammographic
x-ray film is around 50 microns; meaning that a dot on the film as small as that may
be discovered.
For computer monitors, there are many resolution standards, and we will not list
them all here. However the approach to calculating clock rates is the same.

Dynamic range
The sampled analog signal is converted to digital form by the quantization process,
as shown in Fig. 4.7. The digital representation of any signal can have only a finite
number of possible values, which are defined by the number of bits in the output
word. Video signals are often quantized to 8 bits of accuracy, thus allowing a signal
to be represented as one of 256 possible values.
One definition of the dynamic range of the imaging system is the number of
bits of the digital representation. An alternative definition specifies the dynamic
range as the range of input signal over which a camera successfully operates. Both
meanings are accepted and are in common use, but they differ according to the
context.
Since a digital image is raster scanned and sampled, there is a one-to-one relation-
ship between time and space. That is, if we refer to the sampling time, we must speak
of it relative to the top-of-picture signal (vertical blanking). That timing relationship
identifies a unique position on the screen.

The sampling theorem


An interesting question arises if we wish to sample and store an analog signal and
we wish to reconstruct that signal exactly from the sampled version. It can be shown
47 4.2 The digital image

Fig. 4.8. Image of a face represented using 16 shades of gray (4 bits) on left, and with eight
shades (3 bits) on right.

that exact reconstruction requires a sampling rate of at least twice the highest fre-
quency in the signal.
In machine vision, we are usually not concerned with exact reconstruction of the
most subtle details of the image, but wish to extract just the information we need to
accomplish the task at hand.
Quantization error is the term used to refer to the fact that information is lost
whenever the continuously valued analog signal is partitioned into discrete ranges.
Quantization error is often observed as contouring, as illustrated in Fig. 4.8.

4.2.2 Formation of range images


There are two dominant ways to acquire a range image – using stereopsis and
structured lighting.

Stereopsis

Most animals have two eyes, and we know from experience that with two views, we
can extract three-dimensional information. It is not hard to work that out geometri-
B cally (Fig. 4.9).
Fig. 4.9. From the
If we know the distance between the cameras, the angle of observation of each
images extracted from camera (in most stereopsis systems, the cameras are set up so their midlines are
two cameras, it is parallel), and we can measure where particular points in the scene appear in both
possible to compute
the location in 3-space images, then we can calculate distance to the object, which will be referred to as
of any point, provided “range.” If we can do this in the general case, that is, if we can always identify which
we can solve the
correspondence
point in the left image corresponds to which point in the right image, we have solved
problem. the correspondence problem.
48 Images: Formation and representation

The correspondence An often-used simplifying assumption is that the two cameras are set up so that
problem is one of the
fundamental problems of they are exactly parallel, and if a point occurs on a particular line in the left image
machine vision! Some then it will appear on the same “epipolar line” in the right image. In other words,
people would say it is
THE problem in machine the epipolar line connects a point and its correspondence. This assumption makes it
vision. possible to reduce the complexity of the correspondence problem dramatically.
The literature is filled with papers on approaches to the correspondence problem.
Most of them focus on point matching. That is, find a point on the epipolar line in the
second image which in some way resembles a point in the first image. For example,
Bokil and Khotanzad [4.5] extended the work of Marr and Poggio [4.27], which
makes use of epipolar assumption. They accomplish point matching by establishing
a gray level compatibility matrix (GLCM). The pixel values of the left and right
images are labels at the bottom and left of the matrix. The i, j element of the matrix
is determined by computing the absolute value of the difference between brightness
values in the ith row of the left-hand image and the jth column in the right-hand
image. The GLCM values are then normalized. Row-to-row correlations are then
established and a best match is selected.
The correspondence problem may be made easier by hierarchical matching [4.26]
(two low-level features like epipolar edges correspond only if they belong to regions
which correspond).
There are methods for finding curves in 3-space which do not explicitly require a
solution to the correspondence problem. For example, Cohen and Wang [4.10, 4.11]
solve for the best matching curve rather than for the individual points.
Camera calibration [4.28, 4.37] is important for stereo [4.31], and at the same
time, stereopsis can be used for camera calibration, since it establishes a relationship
between the two-dimensional image and the three-dimensional world. A great deal
of effort has been expended to determine the minimum set of correspondences [4.1,
4.32] or other relationships [4.33, 4.34] required to calibrate the cameras.
A set of correspondences implies a transformation determining the pose of an
object (the position and orientation of the object in 3-space is known as the “pose”).
In a given scene, there may be multiple sets of correspondences [4.20, 4.38].
There are many special case applications [4.13], including how to obtain stereo
information from panoramic cameras [4.30].
We will revisit stereopsis in Chapter 11 after we have learned how the concepts of
parametric transformations will help in the solution of the correspondence problem.

Structured illumination

One variation eliminates the correspondence problem – replace one camera with
a light source (e.g. a laser beam passing through a cylindrical mirror). However,
this is really no longer stereopsis. Instead, it is a method referred to as structured
illumination. To see how this works, look back at Fig. 4.9, and think of one of
49 4.3 Describing image formation

θ φ

Camera
Projector
d

Fig. 4.10. Structured illumination.

those cameras as being replaced by a projector which shines a very narrow, very
bright slit on the scene, as illustrated in Fig. 4.10. Now, one angle, ␪, is known
from the projector; the other angle, ␾, is measured by finding the bright spot in the
camera image, counting over pixels, and knowing the relationship between pixels
and angle. Finally, knowledge of the distance between cameras, d, makes the triangle
solvable.
One observation seems relevant to this point in describing images. An interesting
problem occurs when one uses structured illumination to look at specular reflectors
such as metal surfaces. With specular reflectors, either not enough or too much light
may be reflected (polarization filters help [4.29]).
We will see more about using structured illumination when we get to the “shape-
from-X” sections of this book.

4.3 Describing image formation

Define brightness as a function of two spatial variables. f (x, y) is the brightness at


point x, y; x, y ∈
, and f is assumed real-valued f ∈
.

f (x, y) Measurement g (x, y)


system (D)

A measurement system corrupts the input image to produce the measured image:

g(x, y) = D( f (x, y)), (4.6)

where D is some distortion function which typically includes some random noise
process.
50 Images: Formation and representation

A distortion of particular interest to us is one where the noise is additive and


independent of the signal, and g can be written as
∞
g(x, y) = f (␣, ␤)h(x − ␣, y − ␤) d␣ d␤ + n(x, y). (4.7)
Suppose f is a delta
function, then g will be h. ∞
That is, the image of a
point is h: The “point This is the convolution integral. We can show that for any distortion operator D, if
spread function” (PSF). D is linear and is the same wherever it is applied (space-invariant), then there is a
convolution in the form of Eq. (4.7) which produces distortion identical to D. Here is
how it happens: The following derivation of the convolution integral is done in one
dimension, only for convenience. The extension to two dimensions is trivial. First,
observe that any function f, evaluated at a point x, can be written as
∞
f (x) = f (x  )␦(x − x  ) d x  (4.8)
−∞

where ␦(r ) represents the delta function which is equal to zero when its argument
is nonzero, equal to infinity when its argument is zero, and has an integral of one.
Equation (4.8) is not profound – it simply defines the way the delta samples a
function. However, now let us suppose our function f is corrupted by some operator
which changes f at every point x. Then,
⎛ ∞ ⎞
First assumption: d is 
linear.
D( f (x)) = D ⎝ f (x  )␦(x − x  ) d x  ⎠ . (4.9)
−∞

Now if D is a linear operator, then we can interchange the operator D and the integral
to obtain
∞
D( f (x)) = D( f (x  )␦(x − x  )) d x  . (4.10)
−∞

D is an operator which operates on things that are functions of x, not x  . As far as D


is concerned, anything which just depends on x  is a constant. So we can factor the
first term outside of the operator to obtain
∞
D( f (x)) = f (x  )D(␦(x − x  )) d x  . (4.11)
−∞

Now observe that D may depend on x, or it may depend on the difference between x
and x  , but in any case, it is the distortion operator applied to just the delta function.
So any LINEAR distortion of f can be written as an integral of the product of f with a
function which is the distortion applied to the delta. Since the delta function is really
just a very bright spot, with infinite height and zero width, in one dimension, we call
it an impulse, and call D(␦(x − x  )) the impulse response. The two-dimensional
51 4.4 The image as a surface

delta function is a point of light, so we call the result of applying the distortion to it
the point spread function. The impulse response and the point spread function are
precisely the same thing, the only difference is in usage.
Since the impulse response might depend on both x and x  , we introduce a new
notation which we call h:
Look out! another
assumption:
h(x, x  ) = D(␦(x − x  )). (4.12)
Space-invariant.
If we make another assumption, we can get a simpler expression: Let’s assume that
D depends not on x, but only on the difference between x and x  . In that case, we
can write h(x, x  ) = h(x − x  ), and Eq. (4.11) simplifies to
∞
g(x) = D( f (x)) = f (x  )h(x − x  ) d x  (4.13)
−∞

where we have introduced g, the output of the system. This, you will come to
recognize as the convolution integral. This integral is very important for a variety
of reasons, including the fact that it can be computed rapidly using the fast Fourier
transform (FFT). Even more significant is the observation that ANY distortion of
an image (as long as it is linear and space-invariant) can be computed by an integral
like this.

4.4 The image as a surface

In this section, we consider the problem of interpreting the image as a surface in


a 3-space. Thinking of an image in this way will allow us to conceive of image
properties as heights.

4.4.1 Isophotes
Consider the value of f (x, y) as a surface in space, described by z = f (x, y). Then
the ordered triple [x, y, f (x, y)]T describes this surface. For every point, (x, y), there
is a corresponding value in the third dimension. It is important to observe that there
is just ONE such z value for any x, y pair ( f (x, y) is a function). Therefore, z is a
surface.
Consider the set of all points satisfying f (x, y) = C for some constant C. If f
represents brightness, then this set of points is a set of points, all of which have the
same brightness. We therefore refer to this set as an “isophote.”

Theorem
Proof of this theorem will At any image point, (x, y), the isophote passing through that point is perpendicular
be a test question.
to the gradient.2

2 The gradient vector is defined in Eq. 2.11 and elaborated upon in Eq. 5.22.
52 Images: Formation and representation

Direction of gradient

Fig. 4.11. Contour lines on an elevation map are equivalent to isophotes. The gradient vector at
a point is perpendicular to the isophote at that point.

4.4.2 Ridges
Now let’s think about z (x, y), a surface in space, as a mountain (see Fig. 4.11). If
we draw a geological contour map of this mountain, the lines on the map are lines
of equal elevation. However, if we think of “elevation” denoting brightness, then
the contour lines are isophotes. Stand at a point on this “mountain,” and look in the
direction of the gradient. The direction you are looking is the way you would go to
undertake the steepest ascent.
Look to your right or left and you are looking along the isophote. Note that the
direction of the gradient is the steepest direction at that particular point. It does not
necessarily point at the peak.
Let’s climb this mountain by taking small steps in the direction of the local
gradient. What happens at the ridge line? How would you know you were on a
ridge? How can you describe this process mathematically?
Think about taking steps in the direction of the gradient. Your steps are generally
in the same direction, until you reach the ridge, then, the direction radically shifts.
So, one useful definition of a ridge is the locus of points which are local maxima of
A local maximum is any the rate of change of gradient direction. That is, we need to find the points where
point which does not have
a larger neighbor. ∂␪/∂v is maximized. Here, v represents a derivative taken in the direction of the
gradient. In Cartesian coordinates,

∂␪ 2 f x f y f x y − f y2 f x x − f x2 f yy
=  3/2 . (4.14)
∂v f x2 + f y2

Maintz et al. [4.24] point out that it is essentially equivalent to a slightly simpler
formulation based on simply the second derivative of brightness in the v direction,
which leads to maximizing

f y2 f x x − 2 f x f y f x y + f x2 f yy
  , (4.15)
f x2 + f y2
53 4.5 Neighborhood relations

where the subscript denotes “partial derivative with respect to.” In three-dimensional
data, the concepts of ridges are the same, just harder to visualize. In that case, the
gradient is a 3-vector, pointing in the direction of increasing density. Isophotes are
surfaces instead of curves. In that same paper [4.24], Maintz et al. also consider the
concept of ridges in three-dimensional data; check it out if you have to implement
such things.

4.4.3 Binary images and the medial axis


Compared to the number The medial axis is used to characterize a region inside an image. It is especially
of pixels in the image, the
number where the effective when dealing with a binary image where the gradient vector is not useful,
gradient is strong, the since it is zero almost everywhere. Inside the region, find a place where you can
boundaries of regions, is a
very small percentage. In draw a circle that touches the boundary at exactly two points. The set of centers of
fact, as the resolution all such circles is the medial axis.
goes up, the percentage
gets smaller and smaller. The medial axis is easy to define, but tough to compute. We will deal with that in
This is called a “set of Chapter 9 after morphology and distance transforms have been introduced.
measure zero.”
It is possible to relate ridges and the medial axis by using a scale-space represen-
tation of the image. This will be covered in section 9.7.1.

4.5 Neighborhood relations

We may define neighborhoods in a variety of ways, but the most common and most
intuitive way is to say that two pixels are neighbors if they share a side (4-connected)
or they share either a side or a vertex (8-connected). The neighborhood of a pixel
is the set of pixels which are neighbors (surprise!). The 4-neighbors of the center
Fig. 4.12. The
point are illustrated in Fig. 4.12. Denote the neighborhood of a point s by ℵs . Later
4-neighbors of the
center pixel are we will discuss operations on sets of points, neighborhoods of points, and on sets of
shaded. neighborhoods. For example, let A and B be sets of points in the image, and let s be
a point in the set A. We may define the aura [4.15] of a set A with respect to a set B,
for a neighborhood structure ℵs by

O B (A) = (ℵs ∩ B).
s∈A

The neighbors of a pixel That is, the aura of a set of pixels A relative to a set B is the collection of all points
are usually adjacent to
that pixel, but there is no in B that are neighbors of pixels in A, where the concept of “neighbor” is given
fundamental requirement by a problem-specific definition. Fig. 4.13 illustrates an image containing (a) a set
that they be. We will see
this again when we talk A (defined to be the set of shaded pixels), a set B (defined to be the set of blank
about “cliques.” pixels), a neighborhood relation given by (b), and the aura of set B with respect to
set A in (c) [4.15]. We will see more about relationships like this when we discuss
morphology.
54 Images: Formation and representation

(a) (b) (c)

Fig. 4.13. (a) Set A (shaded pixels) and set B (blank pixels); (b) a neighborhood relation.
The shaded pixels are neighbors by definition of the center pixel. (c) The aura of
the set of white pixels in (a) relative to the set of shaded pixels is given by the dark
pixels. It is important to observe that this example uses the standard 4-connected
definition of neighbor, there is no requirement that neighbors even be spatially
adjacent.

Digression: The connectivity paradox

Here is an image. Foreground pixels in black, background in white. We have learned


A about 4-connectedness and 8-connectedness. Let’s apply what we have learned.
The foreground in this image is a ring, perhaps a very low-resolution image of a
washer. Is this ring closed? That is, can we start at one point, walk around the
ring, without retracing our steps, and passing only from one pixel to a connected
B
neighbor? (Incidentally, we just defined a property of connectivity called a path.) If
we can walk all the way around this region like that, we say the region is closed. Use
4-connectedness. Is it closed? If you said no, good! In a 4-connected system, we
cannot get from the pixel A to pixel B. If it is not closed, it must be open. Right? If
it is open, the inside and outside must be connected, right? But, using a 4-connected
definition, the inside and outside are separate regions.
Well, guess we have to give up on 4-connectedness, since it leads to a paradox.
Let’s go to 8-connected. Now, the foreground is closed, right? (say right). But now!
The inside and outside are also connected! How can that be? If the region is closed,
the inside and outside must (logically) be separate. So 8-connected does not work
either.
For this particular paradox, there is a fix, at least sort of a fix, which is to use one
definition (8- or 4-connected) for the foreground and the other for the background.
This fix works for binary images, but when we get to images with more than two levels
of brightness, these discretization problems occur again. We present this example to
illustrate that in digital images, intuition is not always right. There are lots of other
examples of similar problems involving, for example, the measure of the perimeter
of regions. Just be aware that weird things happen and intuition does not always
hold.
55 4.5 Neighborhood relations

The hexagonal representation discussed in section 4A.1 solves the connectivity


paradox. It may solve other problems as well.

Assignment 4.1 Processing and viewing range images.


The purpose of this exercise is to get you thoroughly
used to using the computer and the software packages we
will need for this course.

(1) Look on your CDROM in the “images” directory (Unix


users, look on your system disk in the appropri-
ate place) and in the “leadhole” subdirectory,
you will notice a number of images with the name
“site . . . .” View these with whatever tool you
wish and copy your favorite to your directory. These
are tiny images, and will not take up much space.
(2) Run ifs viewpoint on each of the images you just
copied (to get the arguments, type “ifs viewpoint
-h”). This program accepts range images as input,
and allows you to view those images from a viewpoint
other than straight up and down. When viewed from
another angle, you can see the three-dimensionality
of the range image. To select your movement, respond
to the “enter transform” with a command, something
like
Having trouble with this?
Does it seem to hang? It is movex − 10 movey 10 roll 30 pitch 0 yaw 5 movex 3 movez 10 end
probably waiting for you
to type “end”.
(3) You can enter these on successive lines if you wish.
It continues to read motion command until it gets
the “end”. Roll, pitch, and yaw are rotations, and
their arguments are in degrees. Play with this until
you can answer the following questions.
(a) Does movex move the viewpoint or the image, and
in which direction (rows or columns)?
(b) Roll rotates about the z axis. About what axis
does pitch rotate?
(4) Now, take this image and create six new images,
When using ifs2avi, you each resulting from a different roll (0, 30, 60,
can choose the name of
the output file. To make 90, 120, and 150 degrees). Did the roll move any
your PC work correctly,
name the file with a .avi
of them out of the image? Come up with a transform
extension. that fixes this problem (Hint: Consider moving the
56 Images: Formation and representation

origin to the center of the image using a move com-


mand before you do the roll). Use the program ifs
stack to convert these six two-dimensional images
into a single three-dimensional image. If you are
using Unix, view that image using imp, and demon-
strate how to use imp to display the rotation as a
“movie” (Hint: Use the “volume” button). If you are
using a PC, you may convert the three-dimensional
image produced by stack into an AVI image. For
this, use ifs2avi. The AVI image can be viewed by
any of a large collection of PC programs. A double
click on the icon of the .avi image should do the
job.
(5) Now, learn how3 to use the program ifs spin. Demon-
strate that you can use it to generate a sophisti-
cated movie. NOTE: ifs spin actually runs viewpoint.
The Unix version will generate quite a large set
of temporary files, which it deletes when done. Be
aware of a need for temporary disk space.

Write up your results, and show your instructor a demo.

Hint: On your CDROM, in the “leadhole” directory, you


will find an image name spinout.avi. The image you pro-
duced should vaguely resemble the output.

4.6 Conclusion

In this chapter, you have been introduced to a variety of ways to represent images,
and the information in images. In subsequent chapters, we will build on these repre-
sentations, developing algorithms which extract and categorize that information.

4.7 Vocabulary

You should know the meanings of the following terms.

Correspondence problem
Curvature
3 To learn how to use an IFS program, either type program name -h or look it up in the manual.
57 Topic 4A Image representations

Dynamic range
Functional representation
Graph
Iconic representation
Isophote
Linear system
Medial axis
Probabilistic representation
Quantization
Range image
Raster scan
Ridge
Resolution
Sampling
Spatial frequency
Stereo
Structured illumination

Topic 4A Image representations

4A.1 A variation on sampling: Hexagonal pixels

In a number of papers [4.36], imaging sensors have been described which use hexagonally-
organized pixel arrays. Hexagons are the minimum-energy solution when a collection of
tangent circles with flexible boundaries is subjected to pressure. The beehive is the best known
such naturally occurring organization, but many others occur too, including the organization
of cones in the human retina.

v
y

u x

Fig. 4.14. A coordinate system which is natural to hexagonal tessellation of the plane. The
u and v directions are not orthogonal. Unit vectors u and v describe this coordinate
system.
58 Images: Formation and representation

Traditionally, electronic imaging sensors have been arranged in rectangular arrays mainly
because an electron beam needed to be swept in a raster-scan way, and more recently because
it is slightly more convenient to arrange charge-coupled devices in a rectangular organization.
Rectangular arrays, however, introduce an ambiguity in attempts to define neighborhoods.
On the other hand, we see no connectivity paradoxes in hexagonal connectivity analysis:
Every pixel has exactly six neighbors, foreground, background, or other colors.

Notation
We denote a point in R 2 by p = uu + vv, where the unbolded character denotes the magnitude
in the direction of the unit vector denoted by the bold character. In the case that we discuss two
or more points, we will denote different vectors by using subscripts, with the same subscripts
on the components, e.g., pi = u i u + v i v.

We will also use column vector notation for such points:

Pi = [u i , v i ]T . (4.16)

In some cases, we will be interested in the location of points in the familiar Cartesian
representation, [x, y]T . In this case, we will denote points by subscripts as well, e.g.
P i = [u i , v i ]T = [xi , yi ]T with corresponding values for u, v, x, and y.

Lemma 1
Any ordered pair [u, v]. corresponds to exactly one pair [x, y].

Proof
Using simple trigonometry, and noting that the cosine of 60 degrees is 1/2, it is straightforward
to derive that

v 3v
x =u+ and y= . (4.17)
2 2

Lemma 2
Any ordered pair of Cartesian coordinates [x, y] corresponds to exactly one pair [u, v].

Proof
By solving Eq. (4.17) for u and v, we find

y y
u=x−√ and v = 2 √ . (4.18)
3 3

A set of vectors b1 , b2 , . . . bd , is said to be a basis for the vector space


d if any vector
in
d can be written as a linear combination of b1 , b2 , . . . bd . If the bi s are orthonormal
(biT b j = 0 if i = j and biT bi = 1), this is sufficient to claim that these vectors constitute a
basis. However, being orthonormal is not a necessary condition. In the case of u and v which
are normalized but not orthogonal, they still form a basis. The nonorthogonality is shown
59 Topic 4A Image representations

in Eq. (4.19), where the inner product of u and v in Cartesian coordinates does not equal
zero.
⎡ ⎤
√ 1
1 3 ⎢ 2 ⎥ 1
u = x, v = x+ y, so uT v = [1, 0] ⎢ √ ⎥
⎣ 3 ⎦ = 2. (4.19)
2 2
2

Theorem
The vectors u and v form a basis for
d .
Proof
Since x and y obviously are a basis for
d , we can write any point p in
d as an ordered pair
p = [x, y]T = x x + y y. But from Eq. (4.19), we have
 
(2v − u) y 2y
p = xu + y √ = x − √ u + √ v.
3 3 3

Thus, any point in R 2 may be written as a weighted sum of u and v. QED.

4A.1.1 Identifying the neighbors of a pixel

Given a pixel with coordinates u, v (assumed integer), the coordinates of the neighbors are
illustrated in Fig. 4.15.
We can use the following loop to efficiently access all six neighbors of pixel u, v. We
observe that no “if” statements are required to determine if the center pixel is on an even- or
odd-numbered row.

ou = {−1, −1, 0, 1, 1, 0}
ov = {0, 1, 1, 0, −1, −1}
for i = 1 to 6
nu = u + ou[i]
nv = v + ov[i]
value = image[nu][nv]

We note that this method of accessing the neighbors of a pixel is also useful in rectangular
grids for 8-neighbors and is more efficient than the doubly indexed loop

for i = −1 to 1
for j = −1 to 1
if((i! = 0) or ( j! = 0))
value = image[u + i][v + j]

It is also interesting to note that ou and ov are circulations of each other.


60 Images: Formation and representation

u – 1, v + 1 u, v + 1

u – 1, v u, v u + 1, v

u, v – 1 u + 1, v – 1

Fig. 4.15. The neighborhood of a pixel is absolutely symmetric, with no ambiguity of 4- or


8-neighborhoods.

4A.2 Other types of iconic representations

So far, we have only spoken of images as if they were always brightness. Well actually, we
did mention 2 12 D images, which range as a function of x and y. However, there are other
things which we could compute to represent image properties. Curvature is one.

4A.2.1 Curvature

The computation of local curvature could be performed at every point in an image. For 2 12
D images (surfaces), the curvature cannot be described adequately by a single scalar, but
rather takes the form of a matrix. (See doCarmo’s book [4.12] or other texts on differential
geometry for details.)
−1
E F e f
K = (4.20)
F G f g

where

  2
∂z 2 ∂z∂z ∂z
E = 1+ ,F = ,G = 1+ ,
∂x ∂ x∂ y ∂y
 2 2   2 2   2 2 
∂ z ∂ z ∂ z
e= H, f = H, g = H
∂x 2 ∂ x∂ y ∂ y2

and finally,

 2  2
∂z ∂z
H= + + 1.
∂x ∂y

The principal curvatures K 1 and K 2 are defined as the two eigenvalues of the matrix K, and
the corresponding eigenvectors determine the directions of the curvature.
61 4.7 Vocabulary

For many of our purposes, we will need scalar measurements of curvature which are
invariant to viewpoint. Two such scalars are easily defined, the mean curvature
1 1
Km = (K 1 + K 2 ) = Tr (K ) (4.21)
2 2
and the Gauss curvature

K G = K 1 K 2 = det (K ). (4.22)

Since it is a product, the Gauss curvature is zero whenever either of the two principal curvatures
is zero, a condition which routinely occurs with industrial parts. For this reason, we seldom
use the Gauss curvature.

4A.2.2 Texture

Texture is one of those words that everybody seems to know, but knows without a defini-
tion. There are at least two different definitions of texture – “natural” textures, which are
best characterized by random process descriptions, and “regular” textures, which are best
characterized by frequency–domain representations.
Haralick and Shapiro [4.18] describe textures as “having one or more of the properties of
fineness, coarseness, smoothness, granulation, randomness, lineation, or as being mottled,
irregular, or hummocky.” In detecting that clusters of pixels are different, many features may
be used including moments of the power spectrum [4.4, 4.14], fractal dimension [8.12], and
the cepstrum [4.35]. Texture segmentation involves representing [4.17, 4.21] an image in a
way which incorporates both spatial and spatial–frequency information, and then using that
information to identify regions with similar characteristics [4.7, 4.14, 4.16].
The fact that textures can be effectively represented by self-similar (fractal) processes is
addressed in a number of papers, the first of which was presented in the classic work by
Mandelbrot and Van Ness [4.25]. Kaplan and Kuo [4.22] point out that true textures do not
necessarily keep the same exact textures over scale, and the concept of self-similarity should
be modified.

Assignment 4.A1
Suppose the image f(x,y) is describable by f(x,y) = x 4 /4 −
x 3 + y 2 . At the point x = 1, y = 2, which of the following is a
unit vector which points along the isophote passing through

(a) (b) (c)

Fig. 4.16. (a) Examples of wool textures [4.2]. (b) Examples of tree bark textures [4.2].
(c) Examples comparing natural and regular textures [4.4]. Used with permission.
62 Images: Formation and representation

that point?

T
T
2 1 −1 2
(a) √ √ (c) √ √ (e) [2 1]T
( 5) 5 ( 5) 5

T
T
1 2 2 1
(b) √ √ (d) [−2 4]T (f) −√ √
( 5) 5 5 5

Assignment 4.A2
Imagine you are standing on a surface. You cannot see the en-
tire surface, but you can see a fairly large portion. If you
measure the curvature at all the points you can see, you find
that one of the two principal curvatures is zero. The other
principal curvature varies monotonically in one direction.
You cannot measure it precisely, but you suspect that vari-
ation of curvature is linear in that one direction. On what
type of surface are you standing?

References

[4.1] T. Alter, “3-D Pose from 3 Points Using Weak-perspective,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(8), 1994.
[4.2] D. Badler, J. JáJá, and R. Chellappa, “Scalable Data Parallel Algorithms for
Texture Synthesis and Compression using Gibbs Random Fields,” IEEE Transactions
on Image Processing, 4(10), 1995.
[4.3] R. Bajcsy and F. Solina, “Three Dimensional Object Representation Revisited,”
International Conference on Computer Vision, London, May, 1987.
[4.4] J. Bigün and J. du Buf, “N-folded Symmetries by Complex Moments in Gabor Space
and Their Application to Unsupervised Texture Segmentation,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[4.5] A. Bokil and A. Khotanzad, “A Constraint Learning Feedback Dynamic Model for
Stereopsis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11),
1995.
[4.6] K. Castleman, Digital Image Processing, Englewood Cliffs, NJ, Prentice-Hall, 1996.
[4.7] J. Chen and A. Kundu, “Rotation and Gray Scale Transformation Invariant Tex-
ture Identification using Wavelet Decomposition and Hidden Markov Models,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(2), 1994.
[4.8] R. Chien and W. Snyder, “Hardware for Visual Image Processing,” IEEE Transactions
on Circuits and Systems, 22(6), 1975.
[4.9] D. Clausi, “Texture Segmentation Example,” Web publication,
http://www.eng.uwaterloo.ca/∼dclausi/texture.html, Spring 2001.
[4.10] F. Cohen and J. Wang, “Part I: Modeling Image Curves Using Invariant 3-D Object
Curve Models – A Path to 3-D Reconstruction and Shape Estimation from Image
63 References

Contours Using B-Splines, Shape Invariant Matching and Neural Network,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[4.11] F. Cohen and J. Wang, “Part II: 3-D Object Recognition and Shape Estimation from
Image Contours,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(1), 1994.
[4.12] M. doCarmo, Differential Geometry of Curves and Surfaces, Englewood Cliffs, NJ,
Prentice-Hall, 1976.
[4.13] U. Dhond, and J. Aggarwal, “Stereo Matching in the Presence of Narrow Occluding
Objects using Dynamic Disparity Search,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(7), 1995.
[4.14] D. Dunn, W. Higgins, and J. Wakeley, “Texture Segmentation using 2-D Gabor
Elementary Functions,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 16(2), 1994.
[4.15] I. Elfadel and R. Picard, “Gibbs Random Fields, Co-occurrences, and Texture Model-
ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[4.16] H. Greenspan, R. Goodman, R. Chellappa, and C. Anderson, “Learning Texture Dis-
crimination Rules in a Multiresolution System,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 16(9), 1994.
[4.17] M. Gürelli and L. Onural, “On a Parameter Estimation Method for Gibbs–Markov
Random Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(4), 1994.
[4.18] R. Haralick and L. Shapiro, Computer and Robot Vision, Volume I, Reading, MA,
Addison-Wesley, 1992.
[4.19] G. Healey and R. Kondepudy, “Radiometric CCD Camera Calibration and Noise
Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(3),
1994.
[4.20] Y. Hel-Or and M. Werman, “Pose Estimation by Fusing Noisy Data of Different
Dimensions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(2),
1995.
[4.21] A. Jain and K. Karu, “Learning Texture Discrimination Masks,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 18(2), 1996.
[4.22] L. Kaplan and C. Kuo, “Texture Roughness Analysis and Synthesis via Extended
Self-similar (ESS) Model,” IEEE Transactions on Pattern Analysis and Machine In-
telligence, 17(11), 1995.
[4.23] D. Keren, D. Cooper, and J. Subrahmonia, “Describing Complicated Objects by Im-
plicit Polynomials,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(1), 1994.
[4.24] J. Maintz, P. van den Elsen, and M. Viergever, “Evaluation of Ridge Seeking Oper-
ations for Multimodality Medical Image Matching,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(4), 1996.
[4.25] B. Mandelbrot and J. Van Ness, “Fractional Brownian Motions, Fractional Noises,
and Applications,” SIAM Review, 10, October, 1968.
[4.26] S. Marapan and M. Trivedi, “Multi-primitive Hierarchical (MPH) Stereo Analysis,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
64 Images: Formation and representation

[4.27] D. Marr and T. Poggio, “Cooperative Computation of Stereo Disparity,” Science, 194,
pp. 283–287, October, 1976.
[4.28] P. McLauchlan and D. Murray, “Active Camera Calibration for a Head-eye Platform
Using Variable State-dimension Filter,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(1), 1996.
[4.29] N. Page, W. Snyder, and S. Rajala, “Turbine Blade Image Processing System,”
In Advanced Software for Robotics, ed. A Danthine, Amsterdam, North-Holland,
1984.
[4.30] S. Peleg, M. Ben-Ezra, and Y. Pritch, “Omnistereo: Panoramic Stereo Imaging,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 23(3), 2001.
[4.31] L. Quan, “Invariants of Six Points and Projective Reconstruction from Three Uncal-
ibrated Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), 1995
[4.32] A. Shashua, “Projective Structure from Uncalibrated Images: Structure from Motion
and Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(8), 1994.
[4.33] A. Shashua, “Algebraic Functions for Recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(8), 1995.
[4.34] A. Shashua and N. Navab, “Relative Affine Structure: Canonical Model for 3D From
2D Geometry and Applications,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(9), 1996.
[4.35] P. Smith and N. Nandhakumar, “An Improved Power Cepstrum Based Stereo Corre-
spondence Method for Textured Scenes,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(3), 1996.
[4.36] W. Snyder, H. Qi, and W. Sander, “A Hexagonal Coordinate System,” SPIE Medical
Imaging: Image Processing, Pt. 1–2, pp. 716–727, February, 1999.
[4.37] G. Wei and S. Ma, “Implicit and Explicit Camera Calibration: Theory and Exper-
iments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5),
1994.
[4.38] X. Zhuang and Y. Huang, “Robust 3-D – 3-D Pose Estimation,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(8), 1994.
5 Linear operators and kernels

Now I see through a glass darkly, but then, face to face


Paul of Tarsus

In this chapter,1 we investigate linear operations on images. We first consider the


derivative, probably the most common linear operator. That discussion is extended
into edge detection, and we consider a variety of methods for accomplishing this
objective.

5.1 What is a linear operator?

Suppose D is an operator which takes an image f and produces an image g. If D


satisfies

D(␣f 1 + ␤ f 2 ) = ␣D( f 1 ) + ␤D( f 2 ), (5.1)

where f 1 and f 2 are images, ␣ and ␤ are scalar multipliers, then we say that D is a
“linear operator.”

A gedankenexperiment

Consider the image operator D

g = D( f ) = a f + b a, b ∈

Is D a linear operator?
We suggest you work this out for yourself before reading the solution. It certainly
LOOKS linear. Multiplication by a constant followed by addition of a constant. If
f were a scalar variable, then D describes the equation of a line, which SURELY is
linear (isn’t it?)! OK. Let’s prove it. Using Eq. (5.1), we evaluate

D(␣f 1 + ␤ f 2 ) = a(␣f 1 + ␤ f 2 ) + b = a␣f 1 + a␤ f 2 + b

1 The authors are grateful to Bilge Karacali, Rajeev Ramanath, and Lena Soderberg for their assistance in
producing the images used in this chapter.

65
66 Linear operators and kernels

and check to see if this is the same as

␣D( f 1 ) + ␤D( f 2 ) = ␣(a f 1 + b) + ␤(a f 2 + b)


= a␣f 1 + ␣b + a␤ f 2 + ␤b
= a␣f 1 + a␤ f 2 + b(␣ + ␤)

so unless ␣ + ␤ = 1, D is NOT a linear operator! This seems counter-intuitive,


doesn’t it? We’ll have another look at this later and see if we can figure out why. In
the remainder of this chapter, we will look at image operators which are linear.

5.2 Application of kernel operators in digital images

Since f is now digital, many authors choose to write f as a matrix, f i j , instead of the
functional notation f (x, y). However, we prefer the x, y notation, for reasons which
shall become apparent later. We will find it more convenient to use a single subscript
f i , at several points later, but for now, let’s stick with f (x, y) and remember that x
and y take on only a small range of integer values, e.g. 0 < x < 511.
Think about a one-dimensional image named f with five pixels and another one-
dimensional image, which we will call a kernel named h with three pixels, as illus-
trated in Fig. 5.1.
Place the kernel down so its center, pixel h 0 , is over some pixel of f, say f 2 ; we get
g2 = f 1 h −1 + f 2 h 0 + f 3 h 1 , which is a sum of products of elements of the kernel and
elements of the image. With that understanding, let us consider the most common
example of Eq. (5.2), the case we will come to call application of a 3 × 3 kernel:

g(x, y) = f (x + ␣, y + ␤)h(␣, ␤). (5.2)
␣ ␤

This case occurs when both ␣ and ␤ take on only the values −1, 0, and 1. In this
case, Eq. (5.2) expands to

g(x, y) = f (x − 1, y − 1)h(−1, −1) + f (x, y − 1)h(0, −1)


+ f (x + 1, y − 1)h(1, −1) + f (x − 1, y)h(−1, 0)
Remember, y = 0 is at (5.3)
the TOP of the image and + f (x, y)h(0, 0) + f (x + 1, y)h(1, 0) + f (x − 1, y + 1)h(−1, 1)
y goes up as you go down
in the image. + f (x, y + 1)h(0, 1) + f (x + 1, y + 1)h(1, 1).

f1 f2 f3 f4 f5

h –1 h0 h1

Fig. 5.1. A one-dimensional image with five pixels and a one-dimensional kernel with three
pixels. The subscript is the x-coordinate of the pixel.
67 5.2 Application of kernel operators in digital images

Note the order of the Table 5.1. Values of the elements of a kernel.
arguments here, x
(column) and y (row).
Sometimes the reverse h(−1, −1) h(0, −1) h(1, −1)
convention is followed. h(−1, 0) h(0, 0) h(1, 0)
h(−1, 1) h(0, 1) h(1, 1)

To better capture the essence of Eq. (5.3), let us write h as a 3 × 3 grid of numbers
(yes, we used the word “grid” rather than “array” intentionally), as in Table 5.1.
Now we imagine that we place this grid down on top of the image so that the
center of the grid is directly over pixel f (x, y); then each h value in the grid is
multiplied by the corresponding point in the image. We will refer to the grid of h
values henceforth as a “kernel.”

5.2.1 On the direction of the arguments: Convolution and correlation


Let’s restate two important equations. First the equation for a kernel operator,
recopied from Eq. (5.2), and then the equation for two-dimensional, discrete
convolution.

g(x, y) = f (x + ␣, y + ␤)h(␣, ␤) (5.4)
␣ ␤

g(x, y) = f (x − ␣, y − ␤)h(␣, ␤). (5.5)
␣ ␤

The observant student will have noticed a discrepancy in order between Eqs. (5.4)
Mathematically, and (5.5). In formal convolution, as given by Eq. (5.5), the arguments reverse: the
convolution and
correlation differ in the
right-most pixel of the kernel (h 1 ) is multiplied by the left-most pixel in the corre-
left–right order of sponding region of the image ( f 2 ). However, in Eq. (5.4), we think of “placing” the
coordinates.
kernel down over the image and multiplying corresponding pixels. If we multiply
corresponding pixels, left–left and right–right, we have correlation. There is, unfor-
tunately, a misnomer in much of the literature – both may be called “convolution.”
We advise the student to watch for this. In many publications, the authors use the
term “convolution” when they really mean “sum of products.” In order to avoid con-
fusion, in this book, we will avoid the use of the word “convolve” unless we really
do mean the application of Eq. (5.5), and instead use the term “kernel operator,”
when we mean Eq. (5.4).

5.2.2 Using kernels to estimate derivatives


Let us examine this concept via an example – approximating the spatial derivatives
of the image, ∂ f /∂ x and ∂ f /∂ y.
68 Linear operators and kernels

We recall from some dimly remembered calculus class,


∂f f (x + x) − f (x)
= lim
∂x x→0 x
which would suggest the following kernel could be used (for x = 1):

−1 1

Derivative estimated at
this point

But this kernel is aesthetically unpleasing – the estimate at x depends on the value
at x and at x + 1, but not at x − 1; why? We actually like a symmetric definition
better, such as

∂ f  f (x0 + x) − f (x0 − x)
 = lim .
∂x xa x→0 2x
We cannot get x smaller than 1, and we end up with this kernel

−1/2 0 1/2

which, for notational simplicity, we write as

1/2 −1 0 1 .

A major problem with derivatives is noise sensitivity. We compensate for this by


taking the difference horizontally, and then averaging vertically – which produces
the following kernel:
⎛⎡ ⎤ ⎞
 −1 0 1
∂f  1 ⎜⎢ ⎥ ⎟
= ⎝⎣−1 0 1⎦ ⊗ f ⎠ (5.6)
∂ x x0 6
−1 0 1
where we have introduced a new symbol, ⊗ which will denote the sum-of-products
implementation described above. The literature abounds with kernels like this one.
All of them combine the concept of estimating the derivative by differences, and
then averaging the result in some way to compensate for noise. Probably the best
known of these ad hoc kernels is the Sobel:
⎛⎡ ⎤ ⎞
 −1 0 1
∂ f  1 ⎜⎢ ⎥ ⎟
 = ⎝⎣−2 0 2⎦ ⊗ f ⎠ . (5.7)
∂ x x0 8
−1 0 1

The Sobel operator has the benefit of being center-weighted.


69 5.3 Derivative estimation by function fitting

5.3 Derivative estimation by function fitting

This approach presents yet another way to make use of the continuous representa-
tion of an image f (x, y). Think of the brightness as a function of the two spatial
coordinates, and consider a plane which is tangent to that brightness surface at a
point, as illustrated in Fig. 5.2.
In this case, we may write the continuous image representation using the equation
of a plane

f (x, y) = ax + by + c. (5.8)

Then, we may consider the edge strength using the two numbers ∂ f /∂ x = a,
∂ f /∂ y = b, and the rate of change of brightness at the point (x, y) is represented by
the gradient vector


T
∂f ∂f
∇f = = [a b]T . (5.9)
∂x ∂y

The approach followed here is to find a, b, and c given some noisy, blurred measure-
ment of f, and the assumption of Eq. (5.8).
To find those parameters, first observe that Eq. (5.8) may be written as f (x, y) =
A X where the vectors A and X are AT = [a b c] and X T = [x y 1].
T

Suppose we have measured brightness values g(x, y) at a collection of points


ℵ ⊂ Z × Z (Z is the set of integers) in the image. Over that set of points, we
wish to find the plane which best fits the data. To accomplish this objective, write
the error as a function of the measurement and the (currently unknown) function
A sum-squared error f (x, y).
objective function
 
E= ( f (x, y) − g(x, y))2 = (AT X − g(x, y))2 .
ℵ ℵ

Expanding the square and eliminating the functional notation for simplicity, we find


E= (AT X )(AT X ) − 2AT Xg + g 2 .

f(x, y)

Remembering that for vectors A and X, AT X = X T A, and taking the summation

y Fig. 5.2. The brightness in an image can be thought of as a surface, a function of two variables.
x The slopes of the tangent plane are the two spatial partial derivatives.
70 Linear operators and kernels

through, we have
  
E= AT X X T A − 2 AT Xg + g2
ℵ ℵ ℵ
!
  
= AT X XT A − 2AT Xg + g2.
ℵ ℵ ℵ

Now we wish to find the A (the parameters of the plane) which minimizes E; so we
may take derivatives and set the result to zero.
!
dE  
=2 XX T
A−2 Xg = 0. (5.10)
dA ℵ ℵ


Let’s call ℵ X X T ≡ S (it is the “scatter matrix”) and see what Eq. (5.10) means:
consider a neighborhood ℵ which is symmetric about the origin. In that neighbor-
hood, suppose x and y only take on values of −1, 0, and 1, then
⎡ 2   ⎤
⎡ ⎤ x xy x
  ⎢x ⎥ ⎢   ⎥
⎢ ⎥
S= X XT = ⎣ ⎦y [x y 1] = ⎢ x y y 2
y ⎥
ℵ ℵ
⎣   ⎦
1
x y 1

which, for the neighborhood described, is


⎡ ⎤
6 0 0
⎢ ⎥
⎣0 6 0⎦ .
0 0 9

More detail on how the You do not see where those values came from?
elements of the scatter
matrix are derived. Do not Ok, here’s how you get them. Look at the top left point, at coordinates x =
forget the positive −1, y = −1. At that point, x 2 = (−1)2 = 1. Now look at the top middle point, at
direction for y is down.
coordinates x = 0, y = −1. At that point, x 2 = 0. Do this for all 9 points in the

neighborhood, and you obtain x 2 = 6. Got it?
Convince yourself that Useful observation: If you make the neighborhood symmetric about the origin,
this is true.
all the terms in the scatter matrix which contain x or y to an odd power will be zero.
Be sure you carefully Also: A common miss steak is to put a 1 in the lower right corner rather than a
proofread whatever you
write! 9 – be careful!
So now we have the matrix equation
⎡ ⎤
⎡ ⎤⎡ ⎤ g(x, y)x
6 0 0 a ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
2 ⎣0 6 0⎦ ⎣b ⎦ = 2 ⎢ g(x, y)y ⎥ .
⎣ ⎦
0 0 9 c
g(x, y)
71 5.3 Derivative estimation by function fitting

We can easily solve for a:


1 ∂f
a= g(x, y)x ≈ .
6 ∂x
So, to compute the derivative from a fit to a neighborhood at each of the nine points in
the neighborhood, take the measured value at that point, multiply by its x coordinate,
and add them up. Let’s write down the x coordinates in tabular form:

−1 0 1
−1 0 1
−1 0 1

That is precisely the kernel of Eq. (5.6), which we derived intuitively. Now, we have
it derived formally. Doesn’t it give you a warm fuzzy feeling when theory agrees with
intuition?!? (Whoops, we forgot to multiply each term by 1/6, but we can simply
factor the 1/6 out, and when we get the answer, we will just divide by 6.)
We accomplished this by using an optimization method, in this case, minimizing
the squared error, to find the coefficients of a function f (x) in an equation of the
form y = f (x), where f is polynomial. Recall from section 4.1.2 that this form is
referred to as an explicit functional representation.
One more terminology issue: In future material, we will use the expression radius
of a kernel. The radius is the number of pixels from the center to the nearest edge.
For example, a 3 × 3 kernel has a radius of one. A 5 × 5 kernel has a radius of 2,
etc. It is possible to design kernels which are circular, but most of the time, we use
squares.

Finding image gradients in hexagonal arrays of pixels

In this section, we find image gradients again, in exactly the same way, but this
time with hexagonally arranged pixels. Refer to section 4A.1 for a discussion of the
coordinate system. It is a different presentation of the same material, and if you read
both presentations carefully, you will understand the concepts more clearly.
To find the gradient of intensity in an image, we will fit a plane to the data
in a small neighborhood. This plane will be represented in the form of Eq. (5.8).
We then take partial derivatives with respect to u and v to find the gradient of
intensity in those corresponding directions. We choose a neighborhood of six points,
surrounding a central point, and fit the plane to them. Define the set of data points as
z i , (i = 1, . . . , 6). Then the following expression represents the error in fitting these
six points to a plane parameterized by a, b, and c.

6
E= (z i − (au i + bv i + c))2 . (5.11)
i=1
72 Linear operators and kernels

In order to represent E in a form that will be easy to differentiate, we will reformulate


the argument of the summation using matrix notation. Define vectors A = [a b c]T
and Z = [u v 1]T . Then E may be written using

6
E= (z i − AT Z i )2 (5.12)
i=1

6
 
= z i2 − 2z i AT Z i + AT Z i AT Z i . (5.13)
i=1

First, we observe that AT Z = Z T A, and rewrite Eq. (5.13), temporarily dropping


the limits on the summation to make the typing easier:
  " #
= z i2 − 2AT z i Z i + AT Z i Z iT A. (5.14)
The term in parentheses in Eq. (5.14) is the scatter matrix, the collection of locations
of points at which data exists. We denote this matrix by the symbol S. In order to
find the value of vector A which minimizes E 2 , we take the partial derivative with
respect to A:
∂E 
= −2 z i Z i + 2S A. (5.15)
∂A
Evaluating S, we find
⎡ 2   ⎤
ui ui vi ui ⎡ ⎤
⎢   ⎥ ⎢ ⎥ 4 −2 0
⎢ ⎥
S=⎢ ui vi v i2 v i ⎥ = ⎣−2 4 0⎦ .
⎣   ⎦
0 0 6
ui vi 1
(One finds the numerical values by summing the u and v coordinates over each of
the pixels in the neighborhood, as illustrated in Fig. 4.15, assuming the center pixel
is at location 0, 0.)
Define

z i u i ≡ ϒu

z i v i ≡ ϒv ,

and set the partial derivative of Eq. (5.15) equal to zero to produce a pair of
simultaneous equations,
4a − 2b = ϒu
(5.16)
−4a + 8b = 2ϒv

with solution
1
b= (2ϒv + ϒu ). (5.17)
6
Similarly,
1
a= (ϒv + 2ϒu ). (5.18)
6
73 5.4 Vector representations of images

Substituting actual values of u and v at each pixel in the six-pixel neighborhood,


we determine the gradient vector, since the gradient in the u direction is a and the
gradient in the v direction is b.
We rewrite the equation for a, substituting in the definitions of the ϒs:
1 "  # 1
a= zi vi + 2 zi u i = z i (v i + 2u i ).
6 6
Now, look at the pixel directly to the right of center, its u, v coordinates are
1, 0. So, we should multiply the image brightness, Z i , at that point by (v i + 2u i ), or
(0 + 2 × 1) which is 2. That is the value that goes into the kernel. We need to remem-
ber in using this method, that the answer we get is not quite right . . . It is actually
six times the best estimate, and we need to divide by six, if the actual value of the
derivative is important, rather than a result proportional to the estimated derivative.
Repeating this process at each point, we obtain the kernels shown in Fig. 5.3.
The concepts of fitting described in this section are ubiquitous. They pop up all
over the discipline of machine vision. You find a need to fit gray values, surfaces,
lines (and we will do that in Chapter 9), curves, etc. There are even techniques for
fitting when data has known statistical variations [5.7, 5.40].

5.4 Vector representations of images

Suppose we list every pixel in an image in raster scan order, as one long vector. For
example, for the 4 × 4 image

1 2 4 1

7 3 2 8
f (x, y) =
9 2 1 4

4 1 2 3

F = [1 2 4 1 7 3 2 8 9 2 1 4 4 1 2 3]T .

−1 1 1 2

−2 2 −1 1

−1 1 −2 −1

Fig. 5.3. The kernels used to estimate the gradient of brightness in the u direction, and in the v
direction.
74 Linear operators and kernels

0, 0 is the UPPER left This is called the “lexicographic” representation. If we write the image in this
corner.
way, each pixel may be identified by a single index, e.g., F0 = 1, F4 = 7, F15 = 3,
where the indexing starts with zero.
Now suppose we want to apply the following kernel to this image:
⎡ ⎤
−1 0 2
⎢ ⎥
h = ⎣−2 0 4⎦
3 9 1

at point x = 1, y = 1, again, starting the indexing at zero.


A dot product computes Point (1, 1) corresponds to pixel F5 in the image. We could accomplish this
application of a kernel.
application of the kernel by taking the dot product of the vector F with the vector

H5 = [−1 0 2 0 −2 0 4 0 3 9 1 0 0 0 0 0]T .

Now you try it. Determine what vector to use to apply this kernel at (2, 2). Did you
get this?

H10 = [0 0 0 0 0 −1 0 2 0 −2 0 4 0 3 9 1]T .

Now try (2, 1) (x = 2, y = 1):

H6 = [0 −1 0 2 0 −2 0 4 0 3 9 1 0 0 0 0]T .

Compare H5 at (1, 1) and H6 at (2, 1). They are the same except for a rotation.
We could convolve the entire image by constructing a matrix in which each column
is one such H. Doing so would result in a matrix such as the one illustrated. By
producing the product G = H T F, G will be the (vector form of) convolution of
image F with kernel H.
Some observations about this process:
… −1 0 … 0 …
… 0 −1 … 0 … r The resulting matrix is a “circulant” matrix. Each column is simply a rotation of
… 2 0 … 0 …
… 0 2 … 0 … the adjacent column.
r The fact that we can apply kernels in this way is yet another demonstration of the
… −2 0 … 0 …
… 0 −2 … −1 … fact that kernel operators are linear operators.
… 4 0 … 0 … r This form suggests one approach for dealing with the nasty problem of bound-
… 0 4 … 2 …
… 3 0 … 0 …
ary conditions (you did think about that, didn’t you?). Specifically, how do you
… 9 3 … −2 … multiply by the data value above when you are on the top line? One answer is to
… 1 9 … 0 … rotate around the image and pull from the bottom.
… 0 1 … 4 … r The matrix H is VERY large. If f is a typical image of 256 × 256 pixels, then H is
… 0 0 … 0 …
… 0 0 … 3 …
(256 × 256) × (256 × 256) which is a large number (although still smaller than
… 0 0 … 9 … the US national debt). Do not worry about the monstrous size of H. Nobody (well,
… 0 0 … 1 … almost nobody) ever computes H and uses it in this way. This form is useful for
thinking about images and for proving theorems about image operators – it is a
H5 H6 conceptual, not a computational tool.
75 5.5 Basis vectors for images

Finally, multiplication by a circulant matrix can be accomplished considerably


faster by using the fast Fourier transform; but more about that later.

5.5 Basis vectors for images

In the previous section, we saw that we could think of an image as a vector. If we


can do that for an image, surely we can do the same thing for a small subimage.
Consider the nine-pixel neighborhood of a single point. We can easily construct the
9-vector which is the lexicographic representation of that neighborhood.
In Chapter 2, we learned that any vector could be represented as a weighted sum
of basis vectors, and we will apply that same concept here. Let’s rewrite Eq. (2.6)
in the form:

9
V = ai ui
i=1

where now V is the 9-vector representation of this nine-pixel neighborhood, the ai


are scalar weights, and the ui are some set of orthonormal basis vectors.
But what basis vectors should we use? More to the point, what basis set would be
useful? The set we normally use, the Cartesian basis, is
u1 = [1 0 0 0 0 0 0 0 0]T
u2 = [0 1 0 0 0 0 0 0 0]T
.
.
.
u9 = [0 0 0 0 0 0 0 0 1]T
which, while convenient and simple, does not help us at all here. Could another basis
be more useful? (answer yes). Before we figure out what, do you remember how
many possible basis vectors there are for this real-valued 9-space? The answer is
“a zillion”.2 With so many choices, we should be able to pick some good ones. To
accomplish that, recall the role of the coefficients ai . Recall that if some particular ai
is much larger than all the other as, it means that V is “very similar” to ui . Computing
the as then allows us a means to find which of a set of prototype neighborhoods a
particular image most resembles.
Fig. 5.4 illustrates a set of prototype neighborhoods developed by Frei and Chen
[5.12]. Notice that neighborhood (u1 ) is negative below and positive above the
horizontal center line, and therefore is indicative of a horizontal edge, or a point
where ∂ f /∂ y is large.
Now recall how to compute the projection ai . The scalar-valued projection of a
vector V onto a basis vector ui is the inner product ai = V T ui .

2 Actually, the correct answer is infinity – a zillion is just an engineering approximation.


76 Linear operators and kernels

1 2 1 1 0–1 −1 2
0
0 0 0 2 0 − 2 0 −1
1
–1 − 2 −1 1 0 −1 − 2 1 0
u1 u2 u3
2 −1 0 0 1 0 −1 0 1
−1 0 1 −1 0 −1 0 0 0
0 1 − 2 0 1 0 1 0 −1
u4 u5 u6
1 −2 1 –2 1 −2 1 1 1
−2 4 −2 1 4 1 1 1 1
1 −2 1 −2 1 −2 1 1 1
u7 u8 u9

Fig. 5.4. Frei-Chen basis vectors.

Do you think you could One way to determine how similar a neighborhood about some point is to a vertical
develop a similar basis set
for the seven pixels in a edge is to compute the inner product of the neighborhood vector with the vertical
hexagonal neighborhood? edge basis vector. One final question: What is the difference between calculating
Hint: Seven pixels defines
a seven-dimensional this projection and convolving the image at that point with a kernel which estimates
vector space. You will ∂ f /∂ x? The answer is left as an exercise to the student. (Don’t you wish they were
need to find seven such
vectors. all this easy?)
So now you know all there is to know (almost) about linear operators and kernel
operators. Let’s move on to an application to which we have already alluded – finding
edges.

5.6 Edge detection

Edges are areas in the image where the brightness changes suddenly; where the
derivative (or more correctly, some derivative) has a large magnitude. We can cate-
gorize edges as step, roof, or ramp [5.20], as illustrated in Fig. 5.5.

Positive step edge

Negative step edge


Positive roof edge
Negative roof edge
Positive ramp edge
Positive ramp edge

Negative ramp edge

Negative ramp edge

Fig. 5.5. Types of commonly occurring edges. Note that the term positive or negative generally
refers to the sign of the first instance of the first derivative.
77 5.6 Edge detection

We have already seen (twice) how application of a kernel such as

−1 0 1
hx =
−1 0 1 (5.19)
−1 0 1

approximates the partial derivative wrt x. Similarly,

Which is correct? It
−1 −1 −1 1 1 1
depends on which
direction you have chosen hy = 0 0 0 or 0 0 0 (5.20)
as positive y.
1 1 1 −1 −1 −1

estimates ∂ f /∂ y.
Some other forms have appeared in the literature that you should know about for
historical purposes.

Remember the Sobel


−1 −2 −1 1 2 1
operator?
hy = 0 0 0 or 0 0 0 (5.21)
1 2 1 −1 −2 −1

Important
(This will give you trouble for the entire semester, so you may as well start now.)
In software implementations, the positive y direction is DOWN! This results from
the fact that scanning is top-to-bottom, left-to-right. So pixel (0, 0) is the upper left
corner of the image. Furthermore, numbering starts at zero, not one. We find the best
way to avoid confusion is to never use the words “x” and “y” in writing programs,
but instead use “row” and “column” remembering that now 0 is on top.
However, in these notes, we will use conventional Cartesian coordinates in order
to get the math right, and to further confuse the student (which is, after all, what
Professors are there for. Right?).

Having cleared up that muddle, let us proceed. Given the gradient vector


∂f ∂f T
∇f = ≡ [G x G y ]T (5.22)
∂x ∂y
we are interested in its magnitude
$
|∇ f | = G 2x + G 2y (5.23)
78 Linear operators and kernels

(which we will call the “edge strength”) and its direction


 
Gy
∠∇ f = atan . (5.24)
Gx
One way to find an edge in an image is to compute the “gradient magnitude” image,
and to threshold that. So go try it: Work homework Assignment 5.5.
While you are at it, you What did you learn from those experiments? Clearly, several problems arise when
might want to work
Assignment 5.6 too. we try to find edges by simple kernel operations. During much of the remainder of
the course, we will address these issues. First, let us improve our kernel-based edge
detection.

5.7 A kernel as a sampled differentiable function

We hope you have realized by now that all the edge detector operators you have used
so far are doing two things simultaneously; smoothing (read “low-pass filtering,”
“noise removal,” “averaging,” or “blurring”) and differentiation (read “high-pass
filtering” or “sharpening”). The kernel of Eq. (5.6) actually takes the vertical average
of three derivative estimates. It is actually counter-intuitive, however, since it weights
the center pixel the same as the lines above and below.
Consider the result you got on Assignment 5.6. If you did it correctly, the kernel
values increase as they are farther from the center. That is even worse, right? Why
should data points farther away from the point where we are estimating the derivative
contribute more heavily to the estimate? Wrong! Wrong! Wrong! It is an artifact of
the assumption we made that all the pixels fit the same plane. They obviously don’t.
So here’s a better way – weight the center pixel more heavily. You already saw
this – the Sobel operator, Eq. (5.7) does it. But now, let’s get a bit more rigorous.
Let’s blur the image by applying a kernel which is bigger in the middle and then
differentiate. We have lots of choices for a kernel like that, e.g., a triangle or a
Gaussian, but thorough research [5.28] has shown that a Gaussian works best for
this sort of thing. We can write this process as

d= (g ⊗ h)
∂x
Recall what you learned where now g is the measured image, h is a Gaussian, and d will be our new derivative
about linear systems.
estimate image. Now, a crucial point from linear systems theory:
For linear operators D and ⊗,

D(g ⊗ h) = D(h) ⊗ g. (5.25)

Equation (5.25) means we do not have to do blurring in one step and differentiation
in the next; instead, we can pre-compute the derivative of the blur kernel and simply
apply the resultant kernel.
79 5.7 A kernel as a sampled differentiable function

Let’s see if we can remember how to take a derivative of a 2D Gaussian (did you
forget it is a 2D function?).
A d-dimensional multivariate Gaussian has the general form
 
1 [x − ␮]T K −1 [x − ␮]
exp − (5.26)
(2␲)d/2 |K |1/2 2
where K is the covariance matrix and ␮ is the mean vector. Since we want a Gaussian
centered at the origin (which will be the center pixel) ␮ = 0, and since we have no
reason to prefer one direction over another, we choose K to be diagonal (isotropic)

␴2 0
K = = ␴ 2 I. (5.27)
0 ␴2

For two dimensions, Eq. (5.26) simplifies to


   
Learn this Gaussian blur 1 −[x y]T [x y] 1 (x 2 + y 2 )
stuff, and how to use it to h(x, y) = exp = exp − (5.28)
compute a first- or
2␲␴ 2 2␴ 2 2␲␴ 2 2␴ 2
second-derivative kernel.
It is guaranteed to be on
and
 
the exam. ∂ −x (x 2 + y 2 )
h(x, y) = exp − . (5.29)
∂x 2␲␴ 4 2␴ 2
If our objective is edge detection, we are done. However, if our objective is precise
estimation of derivatives, particularly higher order derivatives, use of a Gaussian
kernel, since it blurs the image, clearly introduces errors which can only be partially
compensated for [5.39]. Nevertheless, this is one of the most simple ways to develop
effective derivative kernels.
For future reference, here are a few of the derivatives of the one-dimensional

Gaussian. Even though there is no particular need for the normalizing 2␲ for most
of our needs (it just ensures that the Gaussian integrates to one), we have included it.
That way these formulae are in agreement with the literature. The subscript notation
is used here to denote derivatives. That is,
∂2
G x x (␴, x) = G(␴, x),
∂x2
where G(␴, x) is a Gaussian function of x with mean of zero and standard devia-
tion ␴.
 
1 x2
Don’t scoff at third G(␴, x) = √ exp − 2
derivatives. You never 2␲␴ 2␴
know when you might  
need one (like finding the −x x2
maxima of the second G x (␴, x) = √ exp − 2
2␲␴ 3 2␴
derivative).  2    (5.30)
x 1 x2
G x x (␴, x) = √ −√ exp − 2
2␲␴ 5 2␲␴ 3 2␴
   
x x2 x2
G x x x (␴, x) = √ 3 − 2 exp − 2 .
2␲␴ 5 ␴ 2␴
80 Linear operators and kernels

Fig. 5.6. (a) The second derivative of a one-dimensional Gaussian centered at 0.


(b) A three-point approximation.

Let’s look in a bit more detail about how to make use of these formulae and their
two-dimensional equivalents to derive kernels.
The simplest way to get the kernel values for the derivatives of a Gaussian is
to simply substitute x = 0, 1, 2, etc. along with their negative values, which yields
numbers for the kernel. The first problem to arise is “what should ␴ be?” To ad-
dress these questions, we will derive the elements of the kernel used for the second
derivative of a one-dimensional Gaussian. The other derivatives can be developed
Proving this is the correct using the same philosophy. Take a look at Fig. 5.6(a) and ask, “is there a value of
value of ␴ is a homework
problem. ␴ such that the maximum of the second√derivative occurs at x = −1 and x = 1?”
Clearly there is, and its value is ␴ = 1/( 3). Given this value of ␴, we can compute
The elements of any
kernel which
the values of the second derivative
√ √ x = {−1, 0, 1}.
of a Gaussian at the integer points
approximates a derivative At x = 0, we find G x x (1/ 3, 0) = −2.07, and at x = 1, G x x (1/ 3, 1) = 0.9251.
must sum to zero. So are we finished? That wasn’t so hard, was it? Unfortunately, we are not done. It is
very important that the elements of the kernel sum to zero. If they don’t, then iterative
algorithms like those described in Chapter 6 will not maintain the proper brightness
levels over many iterations. The kernel also needs to be symmetric. That essentially
defines the second derivative of a Gaussian. The most reasonable set of values close
to those given, which satisfy symmetry and summation to zero are {1, −2, 1}.
However, this does not teach us very much. Let’s look at a 5 × 1 kernel and see
if we can learn a bit more. We require the following.
r The elements of the kernel should approximate the values of the appropriate
derivative of a Gaussian as closely as possible.
r The elements must sum to zero.
r The kernel should be symmetric about its center, unless you want to do special
processing.

We can calculate the elements


√ of a five-element one-dimensional Gaussian, and if
we do so, assuming ␴ = 1/ 3, for x = {−2, −1, 0, 1, 2}, we get [0.0565, 0.9251,
−2.0730, 0.9251, 0.0565]. Unfortunately, those numbers do not sum to zero. It is
81 5.7 A kernel as a sampled differentiable function

very important that the kernel values integrate to zero, not quite so important that the
actual values be precise. So what do we do in a case like this? We use constrained
optimization. One strategy is to set up a problem to find a second derivative of a
Gaussian, which has these values as closely as possible, but which integrates to
zero. For more complex problems, the authors use Interopt [5.3] to solve numerical
optimization problems, but you can solve this problem without using numerical
methods. This is accomplished as follows. First, understand the problem (presented
for the case of five points given above): We wish to find five numbers as close as
possible to [0.0565, 0.9251, −2.0730, 0.9251, 0.0565] which satisfy the constraint
that the five sum to zero. By symmetry, we actually only have three numbers, which
we will denote [a, b, c]. For notational convenience, introduce three constants ␣ =
0.0565, ␤ = 0.9251, ␥ = −2.073. Thus, to find a, b, and c which resemble these
numbers, we write the mean squared error (MSE) form
H0 (a, b, c) = 2(a − ␣)2 + 2(b − ␤)2 + (c − ␥ )2 . (5.31)
H is the constrained Using the concept of Lagrange multipliers, we can find the best choice of a, b, and
version of H0 .
c by minimizing a different objective function
H (a, b, c) = 2(a − ␣)2 + 2(b − ␤)2 + (c − ␥ )2 + ␭(2a + 2b + c). (5.32)
A few words of explanation are in order for those students who are not familiar with
constrained optimization using Lagrange multipliers. The term with the ␭ in front
(␭ is the Lagrange multiplier) is the constraint. It is formulated such that it is exactly
equal to zero, if we should find the proper a, b, and c. By minimizing H, we will find
the parameters which minimize H0 while simultaneously satisfying the constraint.
To minimize H , take the partials and set them equal to zero:
∂H
= 4a − 4␣ + 2␭
∂a
∂H
= 4b − 4␤ + 2␭ (5.33)
∂b
∂H
= 2c − 2␥ + ␭.
∂c
Setting the partial derivatives equal to zero, simplifying, and adding the constraint,
we find the following set of linear equations:

a =␣−
2

b=␤− (5.34)
2

c=␥−
2
2a + 2b + c = 0

which we solve to find the sets given in Table 5.2.


82 Linear operators and kernels

In the case of the first Table 5.2. Derivatives of a one-dimensional Gaussian.


derivative, symmetry
ensures the values always
sum to zero, so no 1st deriv, 3 × 1 [0.2420, 0.0, −0.2420] or [1, 0, −1]
“tweaking” is necessary.
Therefore the integer 1st deriv, 5 × 1 [0.1080, 0.2420, 0, −0.2420, −0.1080]
values are just as good as 2nd deriv, 3 × 1 [1, −2, 1]
the floating point ones.
2nd deriv, 5 × 1 [0.07846, 0.94706, −2.05104, 0.94706, 0.07846]

0.0261 0 −0.0261

0.1080 0 −0.1080

0.0261 0 −0.0261

Fig. 5.7. The 3×3 first derivative kernel.

We can proceed in the same way to compute the kernels to estimate the partial
derivatives using Gaussians in two dimensions.
One implementation of the first derivative with respect to x, assuming an isotropic
Gaussian is presented in Fig. 5.7. You will have the opportunity to derive others as
homeworks.
In this chapter, we have explored the idea of edge operators based on kernel opera-
tors. We discovered that no matter what, noisy images result in edges which are:
r too thick in places
r missing in places
r extraneous in places.

That is just life – we cannot do any better with simple kernels. In Chapter 6, we
will explore some approaches to these problems.
As we hope you have guessed, there are other ways of finding edges in images
besides simply thresholding a derivative. In later sections, we will mention a few of
them.

5.7.1 Higher order derivatives


We have just seen how second or third derivatives may be computed using derivatives
of Gaussians. Since the topic has come up, and you will need to know the terminology
later, we define here two scalar operators which depend on the second derivatives:
the Laplacian and the quadratic variation.
The Laplacian of brightness at a point x, y is
∂2 f ∂2 f
+ 2,
∂x 2 ∂y
83 5.8 Computing convolutions

whereas the quadratic variation of brightness is


 2 2  2 2  2 2
∂ f ∂ f ∂ f
+ +2 .
∂x 2 ∂y 2 ∂ x∂ y
The Laplacian may be approximated by several kernels, including

−1 2 −1
2 −4 2 .
−1 2 −1

5.8 Computing convolutions

Remembering that the only difference between convolution and a kernel operator is
the direction of x and y (section 5.2.1) for any kernel operator, there is an equivalent
convolution kernel. Therefore efficient ways to calculate convolution are also effi-
cient ways to apply kernel operators. The convolution operation may be computed
directly as discussed above. It is simply a sum of products, calculated in the neigh-
borhood of each pixel. However, it may also be computed by the Fourier transform.
The Fourier transform of a convolution is the product of the Fourier transforms of the
two arguments. That is (denoting convolution by the operator ⊗), we are concerned
with computing

g(x, y) = f (x, y) ⊗ h(x, y).

Let the Fourier transforms of the two images and the convolution kernel be defined
by

G(␻x , ␻ y ) = F(g(x, y))


F(␻x , ␻ y ) = F( f (x, y))
H (␻x , ␻ y ) = F(h(x, y))

where the symbol F denotes the process of taking the Fourier transform. Remember
from section 4.1.5, the Fourier transform of an image (a function of two variables)
is itself a function of two variables. We refer to those variables, ␻x and ␻ y , as the
spatial frequencies in the x and y directions, respectively. Then G is the product of
F and H.

G(␻x , ␻ y ) = F(␻x , ␻ y ) r H (␻x , ␻ y ). (5.35)

The “product” of two transforms means, for each spatial frequency value (each
combination of ␻x and ␻ y ), multiply the values of the two functions. (Just in case
you do not remember the details, in general, these values are complex numbers.)
84 Linear operators and kernels

The ability to perform point-by-point multiplication is significant because of the


computational complexity. Consider the complexity of convolving an N × N image
with an L × L kernel. Doing it spatially, we have L × L multiplications for each
pixel, totalling N 2 L 2 . Doing it with the Fourier transform (this assumes you are using
a particular algorithm called the fast Fourier transform) works out in the following
way. (The details are outside the scope of this section, but we will assume that the
Fourier transform of an N × N image is itself a two-dimensional array of the same
size.)

r Transform f: N 2 log N .
r Transform h: L 2 log L .
r Perform appropriate operations, such as padding, to get H and F the same size.
r Multiply H by F: N 2 .
r Inverse transform the result: N 2 log N .

If the sum of these four terms is smaller than N 2 L 2 , it is computationally more


effective to go through the (considerable) inconvenience of using the transform
domain. But for a particular image size and kernel size, what should we do? For-
tunately, the relative efficiency of Fourier and spatial methods for computing con-
volutions of varying sizes has been analyzed, and the results are illustrated in Fig.
5.8. In that figure, we see that for kernels larger than about 15 × 15, we should
use Fourier methods, for kernels smaller than 7 × 7, we should use spatial meth-
ods. The peculiar variations in the region boundary occur because the FFT requires
the image size be a power of two, and images of other sizes introduce additional
complications.

L
16

12

200 400 600 800 1000 N

Fig. 5.8. Efficiency of computing a convolution with an L × L kernel on an N × N image.


Combinations above the curve shown are more efficiently computed using Fourier
methods, below the curve, by spatial methods (redrawn from Pratt [5.33]).
85 5.9 Scale space

5.9 Scale space

“Scale space” is a recent addition to the well-known concept of image pyramids, first
used in picture processing by Kelly [5.19] and later extended in a number of ways
(see [5.5, 5.8, 5.30, 5.32], and many others). In a pyramid, a series of representations
of the same image are generated, each created by a 2 : 1 subsampling (or averaging)
of the image at the next higher level (Fig. 5.9).
In Fig. 5.10, a Gaussian pyramid is illustrated. It is generated by blurring each
level with a Gaussian prior to 2 :1 subsampling. An interesting question should arise
as you look at this figure. Could you, from all the data in this pyramid, reconstruct
the original image? The answer is “no, because at each level, you are throwing away
high-frequency information.”
Usually, when we say Although the Gaussian pyramid alone does not contain sufficient information
“scale-space”, we do not
mean a pyramid, we mean to reconstruct the original image, we could construct a pyramid that does contain
a varying blur. sufficient information. To do that, we use a “Laplacian” pyramid, constructed by
computing a similar representation of the image; this preserves the high-frequency
information (Fig. 5.11). Combining the two pyramid representations allows recon-
struction of the original image.
In a modern scale space representation we preserve the concept that each level
is a blurring of the previous level, but do not subsample – each level is the same
size as the previous level, but more blurred. Normally, each level is generated by
convolving the original image with a Gaussian of variance ␴ 2 , and ␴ varies from
one level to the next. This variance then becomes the “scale parameter.” Clearly, at
high levels of scale, (␴ large), only the largest features are visible. We will see more
about scale space later in this chapter, when we talk about wavelets.

Fig. 5.9. A pyramid is a data structure which is a series of images, in which each pixel is the
average of four pixels at the next lower level.

Fig. 5.10. A Gaussian pyramid, constructed by blurring each level with a Gaussian and then 2 : 1
subsampling.
86 Linear operators and kernels

Fig. 5.11. This Laplacian pyramid is actually computed by a difference of Gaussians.

5.9.1 Quad trees


A quad tree [5.21] is a data structure in which images are recursively broken into
four blocks, corresponding to nodes in a tree. The four blocks are designated NW
(north–west), NE, SW, and SE. The correspondence between the nodes in the tree
and the image are best illustrated by an example (see Fig. 5.12).
In encoding binary images, it is straightforward to come up with a scheme for
generating the quad tree for an image: If the quadrant is homogeneous (either solid
black or solid white), then make it a leaf, otherwise divide it into four quadrants and
add another layer to the tree. Repeat recursively until the blocks either reach pixel
size or are homogeneous.
It is easy to make a quad tree representation into a pyramid. It is only necessary to
keep, at each node, the average of the values of its children. Then, all the information
in a pyramid is stored in the quad tree.
If an image has large homogeneous regions, a quad tree would seem to be an
efficient way to store and transmit an image. However, experiments with a variety
of images, even images which were the difference between two frames in a video

0 1 3
0 1 2

310
30
312
2

33 32
30 32 33

310 311 312 313

Fig. 5.12. An image is divided into four blocks. Each inhomogeneous block is further divided.
This partitioning may be represented by a tree.
87 5.9 Scale space

sequence, have shown that this is not true. Since the difference image is only nonzero
where things are moving, it seems obvious that this, mostly zero, image would be
efficiently stored in a quad tree. Not so. Even in that case, the overhead of managing
the tree overwhelms the storage gains. So, surprisingly, the quad tree is not an
efficient image compression technique. When used as a means for representing a
pyramid, it does, however, have advantages as a way of representing scale space.
Another disadvantage of using quad trees is that a slight movement of an object
can result in radically different tree representations, that is, the tree representation
is not rotation or translation invariant. In fact, it is not even robust. Here, “robust”
means a small translation of an object results in a correspondingly small change in the
representation. One can get around this problem, to some extent, by not representing
the entire image, but instead, representing each object subimage with a quad tree.
The generalization of the quad tree to three dimensions is called an “octree.” The
same principles apply.

5.9.2 Gaussian scale structures


A good way to remember We know how to blur an image. Here is a gedankenexperiment for you: Take an
large and small scale is
that at large scale, only image, blur it with a Gaussian kernel of standard deviation 1. You get a new image.
large objects may be Call that image 1. Now, blur the original image with a Gaussian kernel of standard
distinguished.
deviation 2. Call that image 2. Continue until you have a set of images which you can
think of as stacked, and the “top” image is almost blurred away. We say the top image
is a representation of the image at “large scale.” This stack of images is referred to
as a “scale space” representation. Clearly, we are not required to use integer values
of the standard deviation, so we can create scale space representations with as much
resolution in scale as desired. The essential premise of scale space representations
is that certain features can be tracked over scale, and how those features vary with
scale tells something about the image. A scale space has been formally defined [5.25,
5.26] as having the following properties.

r All signals should be defined on the same domain (no pyramids).


r Increasing values of the scale parameter should produce coarser representations.
r A signal at a coarser level should contain less structure than a signal at a finer
level. If one considers the number of local extrema as a measure of smoothness,
then the number of extrema should not increase as we go to coarser scale. This
property is called “scale space causality.”
r All representations should be generated by applications of a convolution kernel
to the original image.

The last property is certainly debatable, since convolution formally requires a lin-
ear, space-invariant operator. One interesting approach to scale space which violates
this requirement is to produce a scale space by using gray scale morphological
88 Linear operators and kernels

smoothing (we will discuss this later) with larger and larger structuring elements
[5.16].
You could use scale space concepts to represent texture [4.16] or even a prob-
ability density function (in which case, your scale space representation becomes a
clustering algorithm [5.24]) as well as brightness. We will see applications of scale
representations as we proceed through the course.
One of the most interesting aspects of scale space representations is the behavior
of our old friend, the Gaussian. The second derivative of the Gaussian (in two
dimensions, the Laplacian of Gaussian: LOG) has been shown [5.27] to have some
very nice properties when used as a kernel. In particular, the zero crossings of the
LOG are good indicators of the location of an edge. One might be inclined to ask,
“Is the Gaussian the best smoothing operator to use to develop a kernel like this?”
Said another way: We want a kernel whose second derivative never generates a new
zero crossing as we move to larger scale. In fact, we could state this desire in the
following more general form.
Let our concept of a “feature” be a point where some operator has an extreme,
either maximum or minimum. The concept of scale space causality says that as
scale increases, as images become more blurred, new features are never created. The
Gaussian is the ONLY kernel (linear operator) with this property [5.1, 5.2]. Studies
of nonlinear operators have been done to see under what conditions these operators
are scale space causal [5.22].
This idea of scale space causality is illustrated in the following example. Fig. 5.13
illustrates the brightness profile along a single line from an image, and the scale space
created by blurring that single line with one-dimensional Gaussians of increasing
variance. In Fig. 5.14, we see the Laplacian of the Gaussian, and the points where
the Laplacian changes sign. The features in this example, the zero crossings (which
are good candidates for edges) are indicated in the right image. Observe that as
scale increases, feature points (in this case, zero-crossings) are never created as
scale increases. As we go from top (low scale) to bottom (high scale), some features
disappear, but no new ones are created.
One obvious application of this idea is to identify the important edges in the image
first. We can do that by going up in scale, finding those few edges, and then tracking
them down to lower scale.

5.10 Quantifying the accuracy of an edge detector

Since there are many options in the design of an edge detection algorithm, we need
some objective ways to say that one edge detector works better than another. Pratt
[5.33] has suggested a simple formula to address this question. The formula is just
89 5.10 Quantifying the accuracy of an edge detector

increasing variance Gaussian blur


(a) (b)

50

100

150

200

250

50 100 150 200 250

20 40 60 80 100 120 140 160 180 200

Fig. 5.13. (a) Brightness profile of a scanline through an image. (b) Scale space representation
of that scanline. Scale increases toward the bottom, so no new features should be
created as one goes from top to bottom.

Increasing varlance Laplacian blur increasing variance Gaussian blur, superimposed with zero crossings of Laplacian

50 50

100 100

150 150

200 200

250 250
50 100 150 200 250 50 100 150 200 250

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Fig. 5.14. Laplacian of the scale space representation, and the zero crossings of the Laplacian.
Since this is one-dimensional data, there is no difference between the Laplacian and
the second derivative.

a summation over the edge points

1  Ia
1
R= (5.36)
I N i=1 1 + ␣d 2

where I N = max(I I , I A ), I I denotes the number of edge points detected and I A is


the number of edge points actually in the image. ␣ is a constant scale factor, and d
is a measure of the distance from the edge point detected to the nearest actual edge.
This formula requires, of course, some knowledge of where the edge points should
be. Consequently, it is of primary usefulness when applied to synthetic data, since
only in such data can we be absolutely sure of the actual position of edge points (see
also [5.4]).
90 Linear operators and kernels

5.11 So how do people do it?

Two neurophysiologists, David Hubel and Thorsten Wiesel [5.13, 5.14] stuck some
electrodes in the brains – specifically the visual cortex – first of cats3 and later
of monkeys. While recording the firing of neurons, they provided the animal with
visual stimuli of various types. They observed some fascinating results: First, there
are cells which fire only when specific types of patterns are observed. For example,
a particular cell might only fire if it observed an edge, bright to dark, at a particular
angle. There was evidence that each of the cells they measured received input from
a neighborhood of cells called a “receptive field.” There were a variety of types
of receptive fields, possibly all connected to the same light detectors, which were
organized in such a way as to accomplish edge detection and other processing. Jones
and Palmer [5.17] mapped receptive field functions carefully and confirmed [5.9,
5.10] that the function of receptive fields could be accurately represented by Gabor
functions, which have the form of Eq. (5.37):
  2 
1 x y2
G(x, y) = exp −␲ + exp(i[␰ x + ␯y]). (5.37)
2␲␴␤ ␴2 ␤2
The first exponential is a two-dimensional Gaussian whose isophotes form ellipses
with major and minor axes aligned with the x and y axes. (If you happen to be
dealing with a receptive field which is tilted with respect to x and y, you need to
rotate your coordinate system to make this equation still hold.) The second (complex)
exponential represents a plane wave. Eq. (5.37) assumes the origin is at the center
of the Gaussian. Fig. 5.15 illustrates a Gabor filter.
The following interesting observations have been made [5.23] regarding the values
of the parameters in Eq. (5.37), when those parameters are actually measured in living
organisms:
r The aspect ratio, ␤/␣, of the ellipse is 2 : 1.
r The plane wave tends to propagate along the short axis of the ellipse.
r The half-amplitude bandwidth of the frequency response is about 1 to 1.5 octaves
along the optimal orientation.

So do we have Gabor filters in our brains? Or Gabor-like wavelet generators? Well,


we don’t know about YOUR brain, but Young [5.41] has looked at the stimulus–
response characteristics of mammalian retina and observed that those same receptive
fields that are so nicely modeled with Gabor filters or wavelets may equally well
be described in terms of kernels called “difference of offset Gaussians,” which is
essentially the LOG with an additive Gaussian offset. Fig. 5.16 illustrates a section

3 We were going to put a dead cat joke here, something like “One of the cats died during the procedure, but its
behavior was unchanged,” but the publisher told us people would be offended, so we had to remove it.
91 5.11 So how do people do it?

x 10-3

1.5

0.5

−0.5
30
-1
20
−1.5 10

−2 0
−20 −10
−15
−10
−5 −20
0
5 10
15 −30
20

Fig. 5.15. Gabor filter. Note that the positive/negative response is very similar to those that we
derived earlier in this chapter.

Gabor Function
2

−1

−2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Fourth derivative of Gaussian Function


1

0.5

−0.5

−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Fig. 5.16. Comparison of a section through a Gabor filter and a similar section through a fourth
derivative of a Gaussian.

through a Gabor and a fourth derivative of a Gaussian. You can see noticeable
differences, the principal one being that the Gabor goes on forever, whereas the fourth
derivative has only three extrema. However, to the precision available to neurology
experiments, they are the same. The problem is simply that the measurement of
the data is not sufficiently accurate, and it is possible to fit a variety of curves
to it.
92 Linear operators and kernels

Bottom line: We have barely a clue how the brain works, and we don’t really
know very much about the retina. There are two or three mathematical models
which adequately model the behavior of receptive fields.

5.12 Conclusion

Consistency in edge Explicit use of consistency has not been made in this chapter. However, in Assign-
detection.
ment 10.1, you will see an application of consistency to edge detection. In that
problem, you will be asked to develop an algorithm which makes use of the fact
that adjacent edge pixels have parallel gradients. That is, if pixel A is a neighbor
of pixel B, and the gradient at pixel A is parallel (or nearly parallel) to the gradient
at pixel B, this increases the confidence that both pixels are members of the same
edge.
In this chapter, we have looked at several ways to derive kernel operators which,
when applied to images, result in strong responses for types of edges.
r We applied the definition of the derivative.
Minimize the r We fit an analytic function to a surface, by minimizing the sum squared error.
sum-squared error. r We converted subimages into vectors, and projected those vectors onto special
basis vectors which described edge-like characteristics.
r We made use of the linearity of kernel operators to interchange the roll of blur and
differentiation to construct kernels which are the derivatives of special blurring
Constrained optimization kernels. We used the constrained optimization and Lagrange multipliers to solve
and Lagrange multipliers.
this problem.

5.13 Vocabulary

You should know the meanings of the following terms.

Basis vector
Convolution
Correlation
Gabor filter
Image gradient
Inner product
Kernel operator
Lagrange multiplier
Lexicographic
Linear operator
93 5.13 Vocabulary

LOG
Projection
Pyramid
Quad tree
Scale space
Sum-squared error

Assignment 5.1
The previous section showed how to estimate the first
derivative by fitting a plane. Clearly that will
not work for the second derivative, since the second
derivative of a plane is zero everywhere. Use the same
approach, but use a biquadratic

f(x,y) = ax 2 + by 2 + cx + dy + e.

Then [a b c d e]T = A

% &T
x2 y2 x y 1 = X.

∂2 f
Find the 3 × 3 kernel which estimates ∂ x2
.

Assignment 5.2
Oh no! Another part of the assignment! (hey, this one
is much easier than the one above -- a real piece of
cake). Using the same methods, find the 5 × 5 ker-
nel which estimates ∂∂ xf at the center point, using the
equation of a plane.

Assignment 5.3
Determine whether u1 and u2 in Fig. 5.4 are in fact
orthonormal. If not, recommend a modification or other
approach which will allow all the us to be used as
basis functions.
94 Linear operators and kernels

Assignment 5.4
(1) Write a program to generate an image which is
64 × 64, as illustrated below.

50 100
32

25 10 16

32

The images should contain areas of uniform bright-


ness, with dimensions and brightness as shown. Save
this in a file and call it “SYNTH1.”
(2) Write a program which reads in SYNTH1, and applies
the blurring kernel

1 1 1
1/10
1 2 1
1 1 1

and write the answer to file named “BLUR1.”


(3) Add Gaussian random noise of variance ␴ 2 = 9 to
BLUR1, and write the output to a file “BLUR1.V1.”

Assignment 5.5
Write a program to apply the following two kernels
(referred to in the literature as the “Sobel opera-
tors”) to images SYNTH1, BLUR1, and BLUR1.V1.

−1 0 1 −1 −2 −1

hx = −2 0 2 hy = 0 0 0

−1 0 1 1 2 1

To accomplish this, perform the following.


95 5.13 Vocabulary

(1) Apply hx to the input, save the result as a tempo-


rary image in memory (remember that the numbers CAN
be negative).
(2) Apply hy to the input, save the result as another
array in memory.
(3) Compute a third array in which each point in the
array is the sum of the squares of the corresponding
points in the two arrays you just saved. Finally,
take the square root of each point. Save the result.
(4) Examine the values you get. Presumably, high values
are indicative of edges. Choose a threshold value
and compute a new image, which is one whenever the
edge strength exceeds your threshold and is zero
otherwise.
(5) Apply steps (1)--(4) to the blurred and noisy images
as well.
(6) Write a report. Include a printout of all three
binary output images. Are any edge points lost? Are
any points artificially created? Are any edges too
thick? Discuss sensitivity of the result to noise,
blur, and choice of threshold.
Be thorough; this is a research course which
requires creativity and exploring new ideas, as
well as correctly doing the minimum required by the
assignment.

Assignment 5.6
In Assignment 5.2, you derived a 5 × 5 kernel. Repeat
Assignment 5.5 using that kernel, for ∂/∂ x and the
appropriate version for ∂/∂ y.

Assignment 5.7
(1) Verify the mathematics we did in Eq. (5.30). Find
a 3 × 3 kernel which implements the derivative-of-
Gaussian vertical edge operator of Eq. (5.30). Use
␴ = 1 and ␴ = 2 and determine two kernels. Repeat for
a 5 × 5 kernel. Discuss the impact of the choice of
␴ and its relation to kernel size. Assume the kernel
may contain real (floating point) numbers.
96 Linear operators and kernels

(2) Suppose the kernel can only contain integers.


Develop kernels which produce approximately the same
result.

Assignment 5.8
In section 5.7, parameters useful for developing dis-
crete Gaussian kernels were discussed. Prove that the
value of ␴ such that the maximum of the second deriva-

tive occurs at x = −1 and x = 1. Is ␴ = 1/ 3?

Assignment 5.9
Use the method of fitting a polynomial to estimate
∂ 2 f/∂ y2 . Which of the following polynomials would be
most appropriate to choose?
(a) f = ax 2 + by + cxy (c) f = ax 3 + by 3 + cxy
(b) f = ax 2 + by 2 + cx y + d (d) f = ax + by + c

Assignment 5.10
Fit the following expression to pixel data in a 3 × 3
neighborhood: f(x,y) = ax 2 + bx + cy + d. From this fit,
determine a kernel which will estimate the second
derivative with respect to x.

Assignment 5.11
Use the function f = ax 2 + by 2 + cxy to find a 3 × 3 kernel
which estimates ∂ 2 f/∂ y 2 . Which of the following is the
kernel which results? (Note: The following answers do
not include the scale factor. Thus, the best choice
below will be the one proportional to the correct
answer.)

6 4 0 2 6 2 6 4 0
(a) 4 6 0 (c) −4 0 −4 (e) 4 6 0
0 0 4 2 6 2 0 0 1

1 −2 1 1 0 −1 1 2 1
(b) 3 0 3 (d) −2 0 2 (f) −2 0 −2
1 −2 1 1 0 −1 −1 −2 −1
97 5.12

Assignment 5.12
Suppose the kernel that estimates ∂ 2 f/∂ x∂ y is

2 −1 0
−1 0 1
0 1 −2
.

(It may not be true that this kernel estimates that


derivative, but it does not affect the answer. Assume
it is true.) Is it true then that

4 1 0
1 0 1
0 1 4
 2 2
estimates ∂ f/∂ x ∂ y ? Explain your answer.

Topic 5A Edge detectors

The process of edge detection includes more than simply thresholding the gradient. We want
to know the location of the edge more precisely than simple gradient thresholds reveal. Two
methods which seem to have acquired a substantial reputation in this area are the so called
“Canny edge detector” [5.6] and the “facet model” [4.18]. Here, we only describe the Canny
edge detection.

5A.1 The Canny edge detector

The edge detection algorithm begins with finding estimates of the gradient magnitude at
each point. Canny uses 2 × 2 rather than 3 × 3 kernels as we have, but it does not affect the
philosophy of the approach. Once we have estimates of the two partial derivatives, we use
Eqs. (5.22) through (5.24) to calculate the magnitude and direction of the gradient, producing
two images, M(x, y) and THETA (x, y). We now have a result which easily identifies the
pixels where the magnitude of the gradient is large. That is not sufficient, however, as we
now need to thin the magnitude array, leaving only points which are maxima, creating a new
image N (x, y). This process is called nonmaximum4 suppression (NMS).
NMS may be accomplished in a number of ways. The essential idea, however, is as follows:
First, initialize N (x, y) to M(x, y). Then, at each point, (x, y), look one pixel in the direction

4 This is sometimes written using the plural, nonmaxima suppression. The expression is ambiguous. It could be
a compression of suppress every point which is not a maximum, or suppress all points which are not maxima.
We choose to use the singular.
98 Linear operators and kernels

of the gradient, and one pixel in the reverse direction. If M(x, y) (the point in question) is
not the maximum of these three, set its value in N (x, y) to zero. Otherwise, the value of N is
unchanged.
After NMS, we have edges which are properly located and are only one pixel wide. These
new edges, however, still suffer from the problems we identified earlier – extra edge points
due to noise (false hits) and missing edge points due either to blur or to noise (false misses).
Some improvement can be gained by using a dual-threshold approach. Two thresholds are
used, ␶1 and ␶2 , where ␶2 is significantly larger than ␶1 . Application of these two different
thresholds to N (x, y) produces two binary edge images, denoted T1 and T2 respectively. Since
T1 was created using a lower threshold, it will contain more false hits than T2 . Points in T2
are therefore considered to be parts of true edges. Connected points in T2 are copied to the
output edge image. When the end of an edge is found, points are sought in T1 which could
be continuations of the edge. The continuation is continued until it connects with another T2
edge point, or no connected T1 points are found.
In [5.6] Canny also illustrates some clever approximations which provide significant
speedups.

5A.2 Improvements to edge detection

The derivative of g(x) is Tagare and deFigueiredo [5.34] (see also [5.1]) describe the process of edge detection as
the second derivative of
follows.
the image, so look for
zero crossings.
(1) The input is convolved with a filter which smoothly differentiates the input and produces
high values at and near the location of the edge. The output g(x) is the sum of the
differentiated step edge and filtered noise.
(2) A decision mechanism isolates regions where the output of the filter is significantly higher
than that due to noise.
(3) A mechanism identifies the zero crossing in the derivative of g(x) in the isolated region
and declares it to be the location of the edge.

A Gaussian low-pass filter followed by finding a zero in the (second) derivative accurately
(to subpixel resolution) finds the exact location of the edge, but only if the edge is straight
[5.38]. If the edge is curved, errors are introduced. For example, the second derivative in
the gradient direction (SDGD) and the Laplacian both make errors in estimating the location
of the edge, but interestingly, in opposite directions, which prompts Verbeek and van Vliet
[5.38] to suggest an operator which is the sum of the two.
All the methods cited or described in this section perform signal processing in a direction
normal to the edge [5.18] in order to better locate the actual edge. Taratorin and Sideman
[5.35] present a way to make use of the fact that the images are known to have properties such
as positivity and finite support to improve the accuracy with which derivatives may be esti-
mated. Iverson and Zucker [5.15] improve the results of the Canny by adding logical/Boolean
reasoning. This improves edge detection over simple thresholding of the derivative, but does
not provide results as good as active contours (see Chapter 9) or optimization (see Chapter 6).
There are a wide variety of papers on signal processing techniques applied to edge detection
[5.36].
99 Topic 5A Edge detectors

Examination of the literature in biological imaging systems, all the way back to the pio-
neering work of Hubel and Wiesel in the 1960s [5.13] suggests that biological systems analyze
images by making local measurements which quantify orientation, scale, and motion. Keep-
ing this in mind, suppose we wish to ask a question like “is there an edge at orientation ␪
at this point?” How might we construct a kernel that is specifically sensitive to edges at that
orientation? A straightforward approach [5.37] is to construct a weighted sum of the two
Gaussian first derivative kernels, G x and G y , using a weighting something like

G ␪ = G x cos ␪ + G y sin ␪. (5.38)


Could you calculate the Unfortunately, unless quite large kernels are used, the kernels obtained in this way have rather
orientation selectivity?
What is the smallest poor orientation selectivity. In the event that we wish to differentiate across scale, the problem
angular difference you is even worse, since a scale space representation is normally computed rather coarsely, to
could detect with a 3 × 3
kernel determined in this minimize computation time. Perona [5.31] provides an approach to solving these problems.
way?

5A.3 Inferring line segments from edge points

After we have chosen the very best operators to estimate derivatives, have chosen the best
thresholds, and selected the best estimates of edge position, we still have nothing for a set
of pixels, some of whom have been marked as probably part of an edge. If those points are
adjacent, one could “walk” from one pixel to the next, eventually circumnavigating a region,
and there are representations such as the chain code which make this process easy. However,
the points are unlikely to be connected the way we would like them to be. Some points may
be missing due to blur, noise, or partial occlusion. There are many ways to approach this
problem, including relaxation labeling, and parametric transforms, both of which will be
discussed in detail later in this book. In addition, there are combination methods, such as the
work of Deng and Iyengar [5.11] which combines relaxation and Bayes’ methods as well as
other methods [5.29] which we do not have space to discuss.

5A.4 Space/frequency representations

Wavelets are very important, but a thorough examination of this area is beyond the scope of
this book. Therefore we present only a rather superficial description here, and provide some
pointers to literature. For example, Castleman [4.6] has a readable chapter on wavelets.

5A.4.1 Why wavelets?

Consider the image illustrated in Fig. 5.17. Clearly, the spatial frequencies appear to be differ-
ent, depending on where one looks in the image. The Fourier transform has no mechanism for
capturing this intuitive need to represent both frequency and location. The Fourier transform
of this image will be a two-dimensional array of numbers, representing the amount of energy
in the entire image at each spatial frequency. Clearly, since the Fourier transform is invertible,
it captures all this spatial and frequency information, but there is no obvious way to answer
the question: at each position, what are the local spatial frequencies?
100 Linear operators and kernels

Fig. 5.17. An image in which spatial frequencies vary dramatically.

The wavelet approach is to add a degree of freedom to the representation. Since the
Fourier transform is complete and invertible, all we really need to characterize the image is a
single two-dimensional array. Instead however, following the space/frequency philosophy as
described in section 5.8, we use a three-(or higher) dimensional data structure. In this sense,
the space/frequency representation is redundant (or overcomplete ), and requires significantly
more storage than the Fourier transform.

5A.4.2 The basic wavelet and wavelet transform

We define a basic5 wavelet ␺ (x, y) as any function of the two spatial variables x and y, which
meets a certain criterion that we need not concern ourselves with here. Basically, we desire a
function which is symmetric about the origin and has almost finite support. By “almost finite
support,” we mean that the magnitude of this function drops off to zero rapidly (in a particular
way defined by the admissibility criterion) as it goes away from its center. A one-dimensional
example basic wavelet is
 2
2 x
␺ (x) = √ √ (1 − x ) exp −
2
(5.39)
3 ␲ 2
graphed in Fig. 5.18. From one such wavelet, one may then generate a (potentially infinite) set
of similar functions by translation and scaling of the original. That is, we define a translated,
scaled version of ␺ by (again, in one dimension)
 
1 x −b
␺ a,b (x) = √ ␺ . (5.40)
a a

The wavelet transform of a function f is computed by the inner products of f with the wavelet
at each of the possible values of a and b.
∞
W f (a, b) = f (x)␺ a,b (x) d x. (5.41)
−∞

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−2 −1
−10 −5 0 5 10 15 20 −10 −5 0 5 10 15 20

Fig. 5.18. A basic wavelet, and a wavelet generated by scale change.

5 The basic wavelet is often referred to as the mother wavelet.


101 Topic 5A Edge detectors

Fig. 5.19. Original and the result of the inner product of the original and three different two-
dimensional wavelets (three slices through the wavelet transform).

Observe that the transform is a function of the scale and the shift. The same idea holds in two
dimensions, where the equation for the inner product becomes

∞
W f (a, bx , b y ) = f (x, y)␺ a,bx ,b y (x, y) d x d y. (5.42)
−∞

Fig. 5.19 illustrates a cross section of W for different values of a. Clearly, this process pro-
duces a scale space representation.
Lee [5.23] takes the neurophysiological evidence and derives a mother wavelet of the
following form.

  2 
 2 
1 4x + y 2 ␬
␺ (x, y) = √ exp − exp(i␬x) − exp − , (5.43)
2␲ 8 2

where ␬ is a constant whose value depends on assumptions about the bandwidth, but is ap-
proximately equal to 3.5. Scaling and translating this “mother wavelet” produces a collection
of filters which (in the same way that the Frei–Chen basis set did) represents how much of
an image neighborhood resembles the feature, and from which the image can be completely
reconstructed.

5A.5 Vocabulary

You should know the meaning of the following terms.

Canny edge detector


Nonmaximum suppression
Wavelet

Assignment 5.A1
At high levels of scale, only objects are
visible. (Fill in the blank.)
102 Linear operators and kernels

Assignment 5.A2
What does the following expression estimate? ( f ⊗ h1 )2 +
( f ⊗ h2 )2 = E , where the kernel h1 is defined by

0.05 0.08 0.05

−0.136 −0.225 −0.136 ,

0.05 0.08 0.05

and h2 is created by the transpose.


Choose the best answer:

(a) the first derivative with respect to x (where x is the


horizontal direction)
(b) the second derivative with respect to x
(c) the Laplacian
(d) the second derivative with respect to y
(e) the quadratic variation.

Assignment 5.A3
You are to use the idea of differentiating a Gaussian to
derive a kernel. What variance does a one-dimensional (zero
mean) Gaussian need to have to have the property that the
extrema of its first derivative occur at x = ± 1?

Assignment 5.A4
What is the advantage of the quadratic variation as compared
to the Laplacian?

Assignment 5.A5
Let E = (f − H g)T (f − H g). Using the equivalence between the
kernel form of a linear operator and the matrix form, write
an expression for E using kernel notation.

Assignment 5.A6
The purpose of this assignment is to walk you through the
construction of an image pyramid, so that you will more fully
understand the potential utility of this data structure, and
can use it in a coding and transmission application. Along
the way, you will pick up some additional understanding of
the general area of image coding. Coding is not a princi-
pal learning objective of this course, but in your research
career, you are sure to encounter people who deal with coding
103 Topic 5A Edge detectors

all the time. It will prove to be useful to have a grasp of


some of the more basic concepts.

(1) Locate the image “asterix512.ifs,” and verify that it is


indeed 512 × 512 (use ifs info). If it is not 512 × 512,
then first write a program which will pad it out to that
size.
(2) Write a program which will take a two-dimensional n × n
image and produce an image, of the same data type, which
is n/2 × n/2. The program name should be ShrinkByTwo and
called by
ShrinkByTwo inimg outimg
The program should NOT simply take every other pixel.
Instead, each pixel of the output image should be con-
structed by averaging the corresponding four pixels in
the input image. Note the requirement that the output
image be of the same type as the input. That is easily
accomplished by using ifscreate with the first argu-
ment being the character string defining the type of the
input. For example ifscreate (in->ifsdt, len,
IFS CR ALL,0). Use your program to create asterix256,
asterix128, asterix64, and asterix32. Don’t bother going
below a 32 × 32 image. Turn in your program and a printout
of your images.
(3) Write a subroutine to zoom an image. It should have the
following calling convention:
ZoomByTwo(inimg,outimg)
IFSIMG inimg, outimg;
The calling program is responsible for creating, reading,
etc. of the images. The subroutine simply fills outimg
with a zoomed version of inimg. Use any algorithm you
wish to fill in the missing pixels. (We recommend that
the missing pixels be some average of the input pixels.)
Before we can proceed, we need to consider a “pyramid
coder.” When you ran ShrinkByTwo, the set of images you
produced is the pyramid representation of asterix512. In
a pyramid coder, the objective is to use the pyramid rep-
resentation to transmit as little information as possible
over the channel. Here is the idea: First, transmit all
of asterix32. Then, both the transmitter and receiver run
ZoomByTwo to create a zoomed version of asterix32, some-
thing like
ZoomByTwo(a32, a64prime)
104 Linear operators and kernels

When we created asterix32 from asterix64, we threw


away some information, and we cannot easily get
it back, so a64prime will not be identical to
asterix64. If, however, ZoomByTwo (which in the
image coding literature is called the predictor) is
pretty good, the difference between a64prime and
asterix64 will be small (small in value that is; it is
still 64 × 64). Therefore, compute diff64, the difference
between a64prime and asterix64. If the predictor were
perfect, the resultant difference would be a 64 × 64 im-
age of all zeros, which could be coded in some clever
way (by run-length encoding for example), and transmitted
with very few bits. Let us now transmit diff64 to the re-
ceiver. By simply adding diff64 to the version of a64prime
generated at the receiver, we can correct the errors made
by the predictor, and we now have a correct version of
asterix64 at the receiver, but all we transmitted was
diff64. Clever, aren’t we? Now, from asterix64, we play
the same game and get asterix128 created by transmitting
diff128, et cetera, ad nauseam. Now, the assignment:
(4) Create the images described above, diff64, diff128,
diff256, diff512. Measure the approximate number of bits
required to transmit each of them. To make that measure-
ment: Compute the standard deviation of, say, diff64.
Take the log base 2 of that standard deviation, and that
is the average number of bits per pixel required to code
that image. Assume you were to transmit asterix512 di-
rectly. That would require 512 × 512 × 8 bits (assuming the
image is 8 bit -- you better verify). Now, compare with
the performance of your pyramid coder by adding up all
the bits/pixel you found for each of the diff images you
transmitted. Did your coder work well? Discuss this in
your report.

References

[5.1] V. Anh, J. Shi, and H. Tsai, “Scaling Theorems for Zero Crossings of Bandlimited
Signals,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3).
1996.
[5.2] J. Babaud, A. Witkin, M. Baudin, and R. Duda, “Uniqueness of the Gaussian Ker-
nel for Scale-space Filtering,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 8(1), 1986.
[5.3] G. Bilbro and W. Snyder, “Optimization of Functions with Many Minima,” IEEE
Transactions on Systems, Man, and Cybernetics, 21(4), July/August, 1991.
105 References

[5.4] K. Boyer and S. Sarkar, “On the Localization Performance Measure and Optimal Edge
Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1),
1994.
[5.5] P. Burt and E. Adelson, “The Laplacian Pyramid as a Compact Image Code,” Computer
Vision, Graphics, and Image Processing, 16, pp. 20–51, 1981.
[5.6] J. Canny, “A Computational Approach to Edge Detection,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 8(6), 1986.
[5.7] W. Chojnacki, M. Brooks, A. van der Hengel, and D. Gawley, “On the Fitting of Sur-
faces to Data with Covariances,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(11), 2000.
[5.8] J. Crowley, A Representation for Visual Information, Ph.D. Thesis, Carnegie-Mellon
University, 1981.
[5.9] J. Daugman, “Two-dimensional Spectral Analysis of Cortical Receptive Fields,”
Vision Research, 20, pp. 847–856, 1980.
[5.10] J. Daugman, “Uncertainly Relation for Resolution in Space, Spatial Frequency, and
Orientation Optimized by Two-dimensional Visual Cortical Filters,” Journal of the
Optical Society of America, 2(7), 1985.
[5.11] W. Deng and S. Iyengar, “A New Probabilistic Scheme and Its Application to Edge
Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
[5.12] W. Frei and C. Chen, “Fast Boundary Detection: A Generalization and a New Algo-
rithm,” IEEE Transactions on Computers, 26(2), 1977.
[5.13] D. Hubel and T. Wiesel, “Receptive Fields, Binocular Interaction, and Functional
Architecture in the Cat’s Visual Cortex,” Journal of Physiology (London), 160,
pp. 106–154, 1962.
[5.14] D. Hubel and T. Wiesel, ‘Functional Architecture of Macaque Monkey Visual Cortex,”
Proceedings of the Royal Society of London, B, 198, pp. 1–59, 1977.
[5.15] L. Iverson and S. Zucker, “Logical/Linear Operators for Image Curves,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[5.16] P. Jackway and M. Deriche, “Scale-space Properties of the Multiscale Morphological
Dilation–Erosion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
18(1), 1996.
[5.17] J. Jones and L. Palmer, “An Evaluation of the Two-dimensional Gabor Filter Model
of Simple Receptive Fields in the Cat Striate Cortex,” Journal of Neurophysiology,
58, pp. 1233–1258, 1987.
[5.18] E. Joseph and T. Pavlidis, “Bar Code Waveform Recognition using Peak Locations,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6), 1994.
[5.19] M. Kelly, in Machine Intelligence, volume 6, University of Edinburgh Press,
1971.
[5.20] M. Kisworo, S. Venkatesh, and G. West, “Modeling Edges at Subpixel Accuracy using
the Local Energy Approach,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(4), 1994.
[5.21] A. Klinger, “Pattern and Search Statistics,” in Optimizing Methods in Statistics,
New York, Academic Press, 1971.
106 Linear operators and kernels

[5.22] P. Kube and P. Perona, “Scale-space Properties of Quadratic Feature Detectors,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(10), 1996.
[5.23] T. Lee, “Image Representation Using 2-D Gabor Wavelets,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(10), 1996.
[5.24] Y. Leung, J. Zhang, and Z. Xu, “Clustering by Scale-space Filtering,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 22(12), 2000.
[5.25] T. Lindeberg, “Scale-space for Discrete Signals,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 12(3), 1990.
[5.26] T. Lindeberg, “Scale-space Theory, A Basic Tool for Analysing Structures at Different
Scales,” Journal of Applied Statistics, 21(2), 1994.
[5.27] D. Marr and E. Hildreth, “Theory of Edge Detection,” Proceedings of the Royal Society
of London, B, 207, pp. 187–217, 1980.
[5.28] D. Marr and T. Poggio, “A Computational Theory of Human Stereo Vision,” Proceed-
ings of the Royal Society of London, B, 204, pp. 301–328, 1979.
[5.29] R. Nelson, “Finding Line Segments by Stick Growing,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(5), 1994.
[5.30] E. Pauwels, L. Van Gool, P. Fiddelaers, and T. Moons, “An Extended Class of Scale-
invariant and Recursive Scale Space Filters,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(7), 1995.
[5.31] P. Perona, “Deformable Kernels for Early Vision,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(5), 1995.
[5.32] P. Perona and J. Malik, “Scale-space and Edge Detection using Anisotropic Diffusion”,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7), pp. 629–639,
1990.
[5.33] W. Pratt, Digital Image Processing, Chichester, John Wiley and Sons, 1978.
[5.34] H. Tagare and R. deFigueiredo, “Reply to ‘On the Localization Performance Measure
and Optimal Edge Detection’,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(1), 1994.
[5.35] A. Taratorin and S. Sideman, “Constrained Regularized Differentiation,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[5.36] F. van der Heijden, “Edge and Line Feature Extraction Based on Covariance Models,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[5.37] M. Van Horn, W. Snyder, and D. Herrington, “A Radial Filtering Scheme Applied to
Intracoronary Ultrasound Images,” Computers in Cardiology, September, 1993.
[5.38] P. Verbeek and L. van Vliet, “On the Location Error of Curved Edges in Low-pass
Filtered 2-D and 3-D Images,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(7), 1994.
[5.39] I. Weiss, “High-order Differentiation Filters that Work,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(7), 1994.
[5.40] M. Werman and Z. Geyzel, “Fitting a Second Degree Curve in the Presence of Error,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(2), 1995.
[5.41] R. Young, “The Gaussian Derivative Model for Spatial Vision: I. Retinal Mechanisms,”
Spatial Vision, 2, pp. 273–293, 1987.
6 Image relaxation: Restoration and
feature extraction

To change, and to change for the better are two different things
German proverb

In this chapter, we move toward developing techniques which remove noise and
degradations so that features can be derived more cleanly for segmentation. The
techniques of a posteriori image restoration and iterative image feature extraction
are described and compared. While image restoration methods remove degradations
from an image [6.3], image feature extraction methods extract features such as edges
from noisy images. Both are shown to perform the same basic operation: image re-
laxation. In the advanced topics section, image feature extraction methods, known
as graduated nonconvexity (GNC) and variable conductance diffusion (VCD), are
compared with a restoration/feature extraction method known as mean field anneal-
ing (MFA). This equivalence shows the relationship between energy minimization
methods and spatial analysis methods and between their respective parameters of
temperature and scale. The chapter concludes by discussing the general philosophy
of extracting features from images.

6.1 Relaxation

The term “relaxation” was originally used to describe a collection of iterative numeri-
cal techniques for solving simultaneous nonlinear equations (see [6.18] for a review).
The term was extended to a set of iterative classification methods by Rosenfeld and
Kak [6.64] because of their similarity. Here, we provide a general definition of the
term which will encompass these methods as well as those more recent techniques
which are the emphasis of this discussion.

Definition
A relaxation process is an A relaxation process is a multistep algorithm with the property that (1) the output of
iteration.
a single step is of the same form as the input, so that the algorithm may be applied
iteratively, and (2) it converges to a bounded result. Some researchers also require
that the operation on any element (any pixel, in our application) be dependent only
on the state of the pixels in some well defined, finite “neighborhood” of that element.
We will see that all the algorithms discussed here are relaxation processes, according
to these criteria.
107
108 Image relaxation: Restoration and feature extraction

6.2 Restoration

In an image restoration problem, we assume that an ideal image, f, has been corrupted
to create the measured image, g. The usual model for the corruption is a distortion
operation, denoted by D, followed by the addition of random noise

g = D( f ) + n, (6.1)

where g = [g1 , . . . , g N ]T and gi denotes the ith pixel in a column vector repre-
sentation of the image g. f and n are similarly defined. The restoration problem,
then, is the problem of finding a best estimate of f given the measurement, g, some
knowledge of the distortion (often called “blur”), and the statistics of the noise.
Restoration is often referred to as an inverse problem. That is, we have a process
(in this case blur) which takes an input and produces an output. We can only measure
the output, and we wish to infer the input.

6.2.1 Inverse problems and ill-posedness


A problem g = D( f ) is said [6.32] to be well-posed if
r for each f, a solution, g, exists
r the solution is unique
r the solution g continuously depends on the data f.

Definition of an ill-posed If these three conditions do not all hold, the problem is said to be “ill-posed.” Ill-
problem.
posedness is normally caused by the ill-conditioning of the problem. Conditioning of
a mathematical problem is measured by the sensitivity of output to changes in input.
For a well-conditioned problem, a small change of input does not affect the output
much; while for an ill-conditioned problem, a small change of input can change the
output a great deal.
Condition number is the measurement of the conditioning of a problem. Generally,
it is defined as Eq. (6.2). The larger the condition number, the more ill-conditioned
the problem is:
change in output
condition number ≈ . (6.2)
change in input
The conditioning of a linear system Ax = b is determined by the condition number
of matrix A. The relative condition number K is defined as Eq. (6.3),

K = AA−1  (6.3)

where . usually indicates the 2-norm. K is in the range of [1, ∞). When K  1,
the linear system is ill-conditioned.
109 6.2 Restoration

Table 6.1. The system of resulting linear equations.

ga = 0.5b+ 0.5d
gb = 0.33a+ 0.33c+ 0.33e
gc = 0.5b+ 0.5f
gd = 0.33a+ 0.33e+ .33g
ge = 0.25b+ 0.25d+ 0.25f+ 0.25h
gf = 0.33c+ 0.33e+ 0.33i
gg = 0.5d+ 0.5h
gh = 0.33e+ 0.33g+ 0.33i
gi = 0.5f+ 0.5h
Denote G = [ga gb gc gd ge gf gg gh gi]T , F = [a b c d e f g h i]T .

If the blur process turns In Eq. (6.1), suppose we know how the blur process works. Then it would seem
out to be linear and
space-invariant, we call that we should be able to undo it. We will see why that might not be the case.
the process of inverting it As an example, look at a very simple image, just about the simplest image we can
deconvolution.
come up with, a 3 × 3 image, and give each pixel a name, using the letters a, . . . , i.
Now, suppose this image is blurred by a linear blur process in which each pixel is
replaced by the average of its neighbors (suppose the 4-neighbor definition is used).
In the case of edge or corner pixels, fewer pixels are in the neighborhood. This new,
blurred image has values ga, . . . , gi, and these are related to the original values by
the system of linear equations shown in Table 6.1.
a b c
d e f
g h i
⎡ ⎤
0 0.5 0 0.5 0 0 0 0 0
⎢0.33
⎢ 0 0.33 0 0.33 0 0 0 0 ⎥

⎢ ⎥
⎢ 0 0.5 0 0 0 0.5 0 0 0 ⎥
⎢ ⎥
⎢0.33
⎢ 0 0 0 0.33 0 0.33 0 0 ⎥

H =⎢
⎢ 0 0.25 0 0.25 0 0.25 0 0.25 0 ⎥

⎢ 0
⎢ 0 0.33 0 0.33 0 0 0 0.33⎥

⎢ ⎥
⎢ 0 0 0 0.5 0 0 0 0.5 0 ⎥
⎢ ⎥
⎣ 0 0 0 0 0.33 0 0.33 0 0.33⎦
0 0 0 0 0 0.5 0 0.5 0

Then, we may represent the process of blur by G = H F and solve for the values
before the blur using F = H −1 G. It appears that this should be simple. If we have a
model for the distortion process (the model in this case is the matrix H), all we need
do is invert it, and multiply. Let us look to see why that might not work too well.
First, calculate the inverse of H numerically. Whoops! Our matrix inverter program
110 Image relaxation: Restoration and feature extraction

tells us that the matrix is singular. Guess that will not work. Is it a bad choice of blur
operations?
It is hard, it turns out, to contrive a numerical example which is not singular. It is
possible, of course, but difficult. Here is the real key: Even if the distortion matrix
turns out to be nonsingular, the problem is still likely to be “ill-conditioned.”
That is, (let’s review) we measure ga, . . . , gi, and use matrix multiplication to
determine a, . . . , i. If H is not singular, then in a perfect world it should work. As
engineers know, however, in fact there is always noise, so instead of measuring ga,
we actually measure ga + ε, where ε is some perturbation due to noise. If the system,
that is, the distortion matrix, is ill-conditioned (and it is), then this small change in
ga may produce large differences in the estimates of a, . . . , i. For this reason, simple
matrix inversion does not work, even if the system is linear.
Another, perhaps simpler example of ill-conditioning [6.36] is as follows: Con-
sider the linear system described by a blur A, and unknown image f, and a measure-
ment g, where

g = Af

1 1 f 1
A= f = 1 g= . (6.4)
1 1.01 f2 1

The condition number for matrix A is 402.0075 which is much larger than 1. This
You decide whether this system has solution f 1 = 1, f 2 = 0. Now, suppose the measurement, g, is corrupted
change is “trivially
small.” by noise, producing g = [1 1.01]T . Then, the solution is f 1 = 0, f 2 = 1. A trivially
small change in the measured data caused a dramatic change in the solution.
There are many ways to approach these ill-posed restoration problems. They
all share a common structure: the regularization theory. Generally speaking, any
regularization method tries to analyze a related well-posed problem whose solution
approximates the original ill-posed problem [6.57].
The first approach one might think of is to produce an image estimate which has
the minimum expected mean square error. That is, find the unknown image f which
minimizes

E= (gi − ( f i ⊗ h))2 , (6.5)
i

Not only is this version where the sum is over all the pixels in the image, and the distortion is represented by
also ill-posed, it is the
same problem! (Recall the application of a kernel operator corresponding to a blur h. Simply minimizing E does
way to write application not work, as the problem is still ill-conditioned. Making some assumptions about the
of kernel operators by
using a matrix?) noise can give us a bit better performance. If the distortion is linear, space-invariant,
and the noise is stationary, the Wiener filter gives the optimal solution according to
this criterion. (See [6.28] for a tutorial presentation.)
111 6.3 The MAP approach

6.3 The MAP approach

In this section, we introduce a bit of mathematics we need before we can move any
further.

6.3.1 Bayes’ rule


Bayes’ rule concerns three probability functions: the a priori probability density,
p( f ), the conditional probability density, p(g| f ), and the a posteriori conditional
probability density, p( f |g).
We define p( f ) to represent the a priori probability density that some particular
picture f occurs. (If we consider the brightness values as continuous, then we need
to use a probability density rather than a probability. Which we use has no impact
on the resulting form of the derivations given below.) That is, it is the probability
of picture f occurring before any measurements are made. As a discrete example of
the a priori probability, suppose we have a factory which manufactures flanges and
gaskets, but makes ten times as many flanges as gaskets. Flanges and gaskets may
come down the conveyor at random times. But because of our a priori knowledge
that the plant manufactures ten times as many flanges as gaskets we know that we are
much more likely to see a flange than a gasket if we choose to look at the conveyor
at some random time. Thus the a priori possibility that our camera sees a flange is
0.9 and the a priori probability of gaskets is 0.1.
We define p(g| f ) to represent the conditional probability density that image g
is measured, given that the measured image is known to be some corruption of
image f. The probability density function may be characterized in several possible
ways. One is by simply tabulating the number of times a particular value occurs
for each possible value of the variable, in this case, length. Such a tabulation is
referred to as a histogram of the variables. Properly normalized, a histogram can
be a useful representation of a probability density function. Unfortunately, it is
difficult to represent the density function of images as a histogram. One may also
describe a density function in a parametric way using some analytic function (e.g.,
the Gaussian).
We define p( f |g) to represent the a posteriori conditional probability (density)
that the observed image g is really a corruption of image f . p( f |g) is what we are
looking for. We will use it as our decision rule or, more correctly, as our discrimi-
nation function. Our decision rule will then be as follows.
A MAP algorithm seeks For a measurement g made on an unknown image, compute p( f i |g) for each
the image which
maximizes the probability possible f i . Then decide that the unknown is the image f i for which p( f i |g) is
AFTER the measurement. greater than p( f j |g) for all i = j. When we make classification decision based on
p( f i |g), we are using a maximum a posteriori (MAP) image processing algorithm.
112 Image relaxation: Restoration and feature extraction

We can relate the three probability functions just defined by using Bayes’ rule:
p(g| f ) p( f )
p( f |g) = (6.6)
something
something = p(g). (6.7)

In Eq. (6.6) we used “something” to represent the denominator for the conditional
probability density. We used the word “something” to call attention to the fact that this
number represents the probability density of that value of g occurring, independent
of the original, uncorrupted image. Since this number is independent of f, and is the
same for all possible fs, it therefore does not provide us any help in distinguishing
which class is most likely. Instead, it is a normalization constant which we use to
ensure that the number p( f |g) has the desirable properties of a probability; that is,
it lies between 0 and 1 and sums to 1 when summed over all possible images (that is,
the observed object belongs to at least one of the classes which we are considering).
We wish to maximize the a posteriori probability density p( f |g) of the unknown
correct image f given measured image g. We will be relating the probabilities of the
entire image to the probabilities of individual pixels. Using Bayes’ rule, we have the
Watch for the notation proportionality
change here! Now, the
subscript on f refers to a
pixel. p( f i |gi ) ∝ p(gi | f i ) p( f i ) (6.8)

where i is the pixel index f = [ f 1 , f 2 , . . . , f N ]T .


For the purposes of this discussion, one can ignore the normalizing constant of
proportionality and maximize the right-hand side of Eq. (6.8).
For now we assume that there is no distortion to the image except noise, and we
assume that the noise is statistically independent from one pixel to the next. Then
we can write

p(g| f ) = p(gi | f i ). (6.9)
i

Since the only difference between the measurement gi and the true but unknown
image pixel f i is the noise, and if we assume a Gaussian noise model, we can
replace the conditional density of the individual pixels with the density of the noise,
producing
 1  
n i2
p(g| f ) = √ exp − 2 . (6.10)
i 2␲␴ 2␴
Moving the product inside the exponential allows us to write [6.8, 6.27, 6.37, 6.38,
6.80]:
⎛  ⎞
 N − ( f i − gi )2
1 ⎜ i ⎟
p(g| f ) = √ exp ⎝ 2 ⎠. (6.11)
2␲␴ 2␴
113 6.3 The MAP approach

A number of researchers have noted that an image is appropriately represented by


a two- or three-dimensional Markov field [6.5, 6.33, 6.47, 6.56] which allows [6.6,
6.27, 6.45] the representation of the a priori probability of a given pixel value by a
Gibbs distribution.
⎛  ⎞
− Vi j
⎜ j∈ℵi ⎟
p( f i ) ∝ exp ⎝ ⎠. (6.12)
T

The sum is taken over ℵi , the neighborhood of pixel i. Recall from Chapter 4 that
the aura of a set of pixels A relative to a set B is the collection of all points in B that
are neighbors of pixels in A, where the concept of “neighbor” is given by a problem-
specific definition. Here, the concept is similar, except that we consider only the
neighborhood of a single pixel rather than the neighborhood of a set. Like the defi-
nition of aura, the definition of neighborhood is allowed to be problem-dependent,
and in this sense two pixels do not have to be adjacent or even necessarily “close”
in the image to be considered neighbors. In essentially all practical applications,
however, the neighbors of a particular pixel are those pixels which are adjacent. T
is an adjustable width parameter, and the Vs are potential functions which are, in
general, functions of the pixels in the neighborhood.
The way we formulate the prior probability for the entire image is to once again
write a product

p( f ) = p( f i ). (6.13)
i
Check this out! The
answer which maximized Forming the product of Eqs. (6.11) and (6.12) as indicated in Eq. (6.8) and eliminating
the conditional
probability is the same as the constant1 term, we then take natural logarithms and change the sign thereby
the answer which converting the problem from maximizing a probability to minimizing an objective
minimized the squared
error! This happens function.
because the noise is !
additive Gaussian.  ( f i − gi )2 
H ( f, g) = + Vi j . (6.14)
i
2␴ 2 i j∈ℵi

We will refer to the first, conditional, term of Eq. (6.14) as the “noise” term [6.27]
and to the second as the “prior.” This gives the following form:

H ( f, g) = Hn ( f, g) + Hp ( f ). (6.15)

Examination of the noise term reveals that it is minimized by f = g, which implies


that the restoration, f, should resemble the measurement g.
We may choose the prior (regularization) term to emphasize whatever property we
think the image should have, i.e., having only a few known brightness values [6.70];

1 These terms do not affect the location of the minimum. The ␴ which remains in Eq. (6.14) allows us to weight
the relative importance of the noise and prior terms.
114 Image relaxation: Restoration and feature extraction

having a brightness which varies smoothly, except at boundaries [6.9, 6.16]; or the
most common: having brightness which is constant in local areas and discontinuous
at boundaries [6.16, 6.38].

6.3.2 Digression: The problem with inverse problems


The problem we have described to this point is an inverse problem. That is, we know
We are performing an the distortion process (it is just noise so far, but we could add blur), and we have
operation that some (but
not we) refer to as the result of that process operating on an unknown image. Estimating that unknown
“denoising.” image is the inverse problem. Such problems are most often approached by choosing
a function (like the function of Eq. (6.14)) to be minimized, which is the sum of two
Remember what we said terms, one dependent on the measured data and one on the unknown image. The
about ill-posed problems.
image may be represented, as we do here, as a collection of point values which must
be solved for, or as a function (cf. [6.82]). In the case that f is, conceptually at least, a
function, then f has values between the sample points. In such a case, we may think
of the process of solving for f as interpolation.

6.3.3 An objective function for edge-preserving smoothing


In section 6.3.1, we briefly discussed the prior term which is chosen to emphasize
the property of the unknown image. The most common property used is that the
pixel brightness should be constant in local areas and discontinuous at boundaries
[6.16, 6.38]. In order to minimize the objective function of Eq. (6.14), the potential
function V is formulated as a penalty function.
Let i represent some scalar measure of the brightness variation about pixel i. We
are about to develop the penalty function that will follow a recommendation of Besag
[6.6] “to encourage smooth variation.” It says that Vi “should be strictly increasing”
in the absolute value of i and if “occasional abrupt changes” are expected, it
should “quickly reach a maximum.” Let’s see if we can understand why he makes
this recommendation.
The quadratic form Vi = b i2 with b > 0 is attractive because it is simple. It also
penalizes noise in a nonlinear way. That is, i is some measure of the strength of the
An image which gradient at pixel i, so that the bigger the gradient, the bigger Vi . The minimization
minimizes the quadratic
form will be blurry. process chooses solutions – images – which result in SMALL values of Vi . Thus,
we are most likely to find a solution which does not have abrupt changes; no abrupt
changes means no sharp edges. This solution results in blurred edges that are un-
acceptable in many applications. It is appropriate only for “smooth variation in the
true scene” [6.6]. The key to preserving sharp edges is simple: Penalize brightness
variations due to noise – the bigger the variation, the bigger the penalty – but at some
point, do not penalize any more (so edges get no additional penalty), as illustrated
in Fig. 6.1. Unfortunately (see section 6A.1), this function is not differentiable, and
we therefore cannot do gradient descent with it.
115 6.4 Mean field annealing

Penalty

Variation

Fig. 6.1. The penalty should be stronger for higher noise. Assuming local variations in brightness
are due to noise, a larger variation is penalized more. However, local brightness
variations can also be due to edges, which should not be penalized (otherwise, they
will be blurred). Therefore we want our penalty function to have an upper limit.

Instead, we will choose an inverted Gaussian and write:

  !
b 2
Hp ( f ) = − √ exp − i2 . (6.16)
2␲␶ i
2␶

The 2␲ does not mean In Eq. (6.16), the constants are irrelevant. We include them only so that it looks like a
anything at all! Gaussian. ␶ is a soft threshold: It represents a priori knowledge of the roughness of
the surface. The form for makes this knowledge explicit. It is hoped that the spatial
derivatives i will become small almost everywhere as the algorithm proceeds. This
concept will be explored in more detail in the next section.
Combining Eqs. (6.15) and (6.16), we have an objective function which, if mini-
mized, will result in a restored image which resembles the data (in the least squares
sense) while at the same time consisting of regions of uniform brightness sepa-
rated by abrupt edges. We could use mean field annealing (MFA) to minimize this
objective function.

6.4 Mean field annealing

Mean field annealing (MFA) is a technique for finding a good minimum of complex
functions which typically have many minima. The mean field approximation in
statistical mechanics allows a continuous representation of the energy states of a
collection of particles. In the same sense, MFA approximates the stochastic algorithm
called “simulated annealing” (SA). SA has been shown to converge in probability
to the global minimum, even for nonconvex problems [6.27]. Because SA also takes
an unacceptably long time to converge, a number of techniques have been derived
to accomplish speedups [6.43]; MFA is one of those techniques.
Since its introduction in 1989 [6.8, 6.11], this technique has found applications in
restoration of locally homogeneous [6.37, 6.38] and locally smooth [6.8, 6.9] images;
in image segmentation [6.69, 6.70]; in motion analysis [6.1]; and in sensor fusion
116 Image relaxation: Restoration and feature extraction

[6.7]. MFA is based on a combination of the concepts of simulated annealing [6.48]


and the mean field approximation of statistical mechanics [6.17]. The earlier work
in MFA followed the lead of researchers in simulated annealing, and relied on an
analogy to statistical mechanics to justify the technique. However, it was later [6.13]
demonstrated that such an analogy, although insightful and correct, is not necessary,
and that MFA may be derived purely from information-theoretic considerations.
Use of the term “MFA” can be confusing, since the acronym represents first
and foremost, a technique for deriving algorithms. Secondarily, however, the term
has come to represent a class of image restoration algorithms. These algorithms
are derived by posing an image restoration problem as a minimization problem,
and then following a particular methodology (the mean field approximation) to
derive the minimization method. The resulting method will combine gradient descent
and “annealing,” that is, monotonic variation of a control parameter (known as
“temperature”) through the duration of the algorithm.
Another popular technique for image noise removal, known as “graduated non-
convexity” (GNC) [6.16], bears many similarities to MFA in that GNC uses a descent
method and reduces a control parameter. The similarities are so strong that GNC
may in fact be derived using MFA as a basis [6.12].
Much of the work done to date with MFA has used the MFA methodology to derive
restoration-like algorithms. However, MFA may be considered an image feature-
extraction method and is equivalent to another relaxation method known as “image
diffusion.”
The student is referred to the above references for formal derivations of MFA.
Here, we present what we have found to be a more pedagogically attractive expla-
nation of the same concept, showing MFA to be a particular type of continuation
method [6.2].
A continuation method is an algorithm which implements a homotopy, that is,
a continuous deformation of one (hyper) surface into another. In the problem de-
scribed by Eq. (6.14), we use MFA to distort a convex N -surface into one which
has (typically) many local minima. For purposes of illustration, let us consider the
simplest possible version of Eq. (6.15) which still captures the essence of pixel
Did you observe that the 2
is missing from the interactions:
denominator? We said the  
constants don’t really 1 ( f 1 − f 2 )2
matter, but stay on your H ( f 1 , f 2 ) = ( f 1 − g1 ) + ( f 2 − g2 ) − exp −
2 2
. (6.17)
toes anyway! ␶ ␶2

Here we have a (not very interesting) image consisting of only two pixels, f 1 and f 2 ,
which have been corrupted by noise to result in measured pixel values g1 and g2 . The
prior term chosen in this case encourages solutions in which f 1 = f 2 . The principal
result of the MFA derivations, when applied to a function of the type described
by Eq. (6.17), is the replacement of ␶ by ␶ + T , where T is a parameter (called
“temperature” in the literature) which is initialized to a “large” value, and gradually
117 6.4 Mean field annealing

reduced to approach zero. Performing these substitutions in Eq. (6.17) we have a


Beware of notation! Here, new objective function
we are doing “gradient
 
descent” to minimize a 1 ( f 1 − f 2 )2
function of f, so the pixel HT ( f 1 , f 2 ) = ( f 1 − g1 )2 + ( f 2 − g2 )2 − exp − . (6.18)
values are what we are ␶ +T (␶ + T )2
differentiating WRT. We
also saw different We denote the result of the MFA derivation (and make the dependence on T explicit)
“gradients” in the last
chapter. Those were by referring to this as HT . Differentiating the HT of Eq. (6.18) with respect to the
derivatives of f with two elements of the vector f results in
respect to x and y.
⎡   ⎤
1 ( f 1 − f 2 )2 2( f 1 − f 2 )
∂ ⎢ 2( f 1 − g1 ) + exp − ⎥
⎢ ␶ +T (␶ + T )2 (␶ + T )2 ⎥
HT ( f 1 , f 2 ) = ⎢     ⎥.
∂f ⎣ 1 ( f 1 − f 2 )2 2( f 2 − f 1 ) ⎦
2( f 2 − g2 ) + exp −
␶ +T (␶ + T )2 (␶ + T )2
(6.19)

The MFA approach is as follows.

(1) Set T = Tinitial (a problem-dependent parameter).


(2) Using Eq. (6.19), perform gradient descent or some other minimization tech-
nique, and find the f which minimizes HT .
(3) Reduce T.
(4) If T > Tfinal , go to (2).

The simplest version of gradient descent applicable here is the iteration


k+1 k
f1 f1 ∂  
⇐ −␣ HT f 1k , f 2k (6.20)
f2 f2 ∂f

for some small scalar constant ␣. To see how this works, consider the case of large
T: Making T large results in

min f HT ( f 1 , f 2 ) = min f (( f 1 − g1 )2 + ( f 2 − g2 )2 ) as (T → ∞) (6.21)

which is a convex function (a paraboloid) minimized by f i = gi . Now, we have an


initial starting point for the gradient descent. We may now make T slightly smaller,
repeat the descent step, find a new solution, and iterate again, using smaller and
smaller T until T becomes insignificant in comparison to ␶ .
Even though we have simplified the problem to only two pixels, the two-
dimensional space of f 1 , f 2 described by Eq. (6.18) is still difficult to visualize,
and we simplify still further to a one-dimensional problem to capture the essence of
the homotopy. In Fig. 6.2, we illustrate a one-dimensional function:
 
1 ( f − l)2
( f − k) −
2
exp −
␶ +T (␶ + T )2
118 Image relaxation: Restoration and feature extraction

0.15

0.1

0.05

0
τ+Τ=5 τ+Τ=6
−0.05 τ+Τ=4.5
τ+Τ=10
τ+Τ=5.5
−0.1

−0.15

−0.2
−5 0 5 10 15

Fig. 6.2. Continuous deformation of a function which is initially convex to find the global
minimum of a nonconvex function.

with scalar constants k and l. At high values of T (T + ␶ = 10), the curve is com-
pletely convex, and as T is reduced, it assumes its true form. At each iteration, the
minimum is tracked, as indicated by the arrows, concluding in the global minimum.

6.4.1 Selecting a prior term


A good exam question Let us review the meaning of the terms in the MFA energy function. First, consider
would be “why is the
noise term called the the noise term, and write the noise term in a more general way:
noise term?” or “what 
assumptions were ((D( f ))i − gi )2 , (6.22)
involved in deriving the i
noise term from
conditional probabilities where (D( f ))i denotes some distortion of image f in the vicinity of pixel i. Finding
of noise?”
the image f which minimizes this term produces the image which, when distorted,
most closely (in the sum-squared error sense) resembles the measurement g. Now,
let’s look at the prior term.
The term “fidelity” refers Writing the prior in a slightly more general way
to how well the   
reconstruction resembles 1  (R( f ))i2
the measurement; − exp − (6.23)
minimizing these terms ␶ i
␶2
preserves fidelity to the
measurement.
where the term (R( f ))i denotes some function of the (unknown) image f at pixel i,
and ␶ incorporates the function of ␶ + T in Eq. (6.18). What kind of image minimizes
this term? Let’s look at what it means.
Most errors are made in First, observe the minus sign in front of the prior term. With that sign there, to
implementing this
algorithm by dropping minimize the function, we should find the image which causes the exponential to
minus signs! be maximized. What kind of image maximizes the exponential? Now look at the
argument of the exponential. Observe the minus sign, and observe that both terms
are squared, and therefore always positive. Thus, the argument of the exponential
119 6.4 Mean field annealing

is always negative. What negative argument causes an exponential to be maximal?


Answer, zero (well, minus zero actually). Thus, to maximize the exponential, choose
any image f which causes R(f) to be zero.
What do we conclude from this? For any function R(f), the f which causes R(f)
to be zero is the f which the prior term will seek. This observation gives us a lot of
freedom of design. We can choose the function R(f) to produce the type of solution
we seek. Let’s now look at some examples.

EXAMPLE
Piecewise-constant images
Consider this form for the prior
 2  2
∂f ∂f
R (f) =
2
+ . (6.24)
∂x ∂y
In order for this term to be zero, both partial derivatives must be zero. The only type of
surface which satisfies this condition is one which does not vary in either direction –
flat, but not flat everywhere. To see why the solution is piecewise-constant rather than
completely constant, you need to recognize that the total function being minimized
is the sum of both the prior and the noise terms. The prior term seeks a constant
solution, but the noise term seeks a solution which is faithful to the measurement.
The optimal solution to this problem is a solution which is flat in segments, as illus-
trated in one dimension in Fig. 6.3. The function R(f) is nonzero only for the points
where f undergoes an abrupt edge. To see more clearly what this produces, consider
the extension to continuous functions. If x is continuous, then the summation in
Eq. (6.23) becomes an integral. The argument of the integral is nonzero at only
a small, finite number of points (referred to as a set of measure zero), which is
insignificant compared to the rest of the integral.

Fig. 6.3. A piecewise-constant solution to the fitting of a surface. The solution has a derivative
equal to zero at almost every point. Nonzero derivatives exist only at the points where
steps exist.
120 Image relaxation: Restoration and feature extraction

EXAMPLE
Piecewise-planar images
This is the quadratic Let’s try another example and see what that does. Consider
variation. The Laplacian
also involves second  2 2  2 2  2 2
derivatives. Ask yourself ∂ f ∂ f ∂ f
R (f) =
2
+ +2 . (6.25)
how it differs from the ∂x 2 ∂y 2 ∂ x∂ y
quadratic variation.
What does this do? What kind of function has a zero for all its second derivatives?
Answer, a plane. Thus, using this form for R(f) will produce an image which is
planar, but still maintains fidelity to the data – a piecewise-planar image. Another
alternative operator which is also based on the second derivative is the Laplacian,
∂2 f ∂2 f
+ .
∂x2 ∂ y2
We saw both of these back in Chapter 5.
You might ask your instructor, “Breaking a brightness image into piecewise-linear
segments is the same as assuming the actual surfaces are planar, right?” To which you
would probably get “Yes, except for variations in lighting, reflectance, and albedo.”
Ignoring that, you charge on, saying “But real surfaces aren’t all planar.” The answer
is twofold: First is the trivial and useless observation that all surfaces are planar, you
just have to examine a sufficiently small region. More seriously, whether breaking
an image into planar patches makes sense depends on the application. For example
[6.14, 6.74], one could do a piecewise-constant segmentation to remove noise and
then treat each patch as a plane and get improved estimates of optic flow [6.41] or
stereo [6.72] that way. Some of the underlying theory of planar approximations to
images may be found in [6.62].
Interestingly, Yi and Chelberg [6.83] observe that second-order priors like this
one require a great deal more computation than first-order priors, and that first-order
priors can be made approximately invariant. However, in our own experiments, we
have not found that second-order priors impose such a severe computational penalty,
and they do provide more flexibility in reconstruction.
A one-dimensional example of such a solution is shown in Fig. 6.4. The idea
of modeling an image as piecewise-planar has recently received additional support
from some work by Elder and co-workers [6.22, 6.23, 6.24] which suggests that
“the edge representation of an image is, to a good approximation, invertible.” They
accomplish this remarkable result by assuming that except at edges, an image satisfies
the Laplace equation ∇ 2 f (x, y) = 0.
In conclusion, you can choose any function for the argument of the exponential
that you wish, as long as the image you want is produced by setting the argument to
zero. Some more general properties of prior models have been stated by Li [6.51]
as a function of the local image gradient : A prior should (1) be continuous in its
first derivative; (2) be even (h( ) = h(− )); (3) be positive h( ) > 0; and (4) have
121 6.4 Mean field annealing

Fig. 6.4. A piecewise-linear solution to the data of Fig. 6.3. Clearly, the piecewise-linear solution
preserves fidelity to the data more accurately than the piecewise-constant.

a negative derivative for positive argument h  ( ) < 0, for > 0; go to a constant


in the limit lim | h( )| = C.
→∞

6.4.2 Annealing: Avoiding local minima


MFA gives us one more capability that distinguishes it from other MAP methods,
the ability to avoid most local minima.
The magic is all in the use of the parameter ␶ in Eq. (6.23). Begin with “large” ␶ .
Do gradient descent. As the descent iterations proceed, “slowly” reduce ␶ . When ␶
reaches an “appropriate” minimum, stop. Now, let’s see what the words in quotes in
the previous sentences mean.
This is the definition of We wish to develop a function which is minimized when the number of pixels
the Kroneker delta. The
Dirac delta is infinity which are equal to their neighbors is maximal. One way to count such pixels is to
when its argument goes to use the delta function, which is one if the difference between two pixels is zero:
zero and zero otherwise, 
and integrates to one.
1 if fi = f j
␦( f i − f j ) = (6.26)
0 otherwise

so the number of times a pixel is equal to its neighbor is



␦( f i − f j ). (6.27)
i j∈ℵi

We want a function which is minimal when this double summation is maximal, so


we simply introduce a minus sign

Hp = − ␦( f i − f j ). (6.28)
i j∈ℵi

The delta function is not convenient, since it is not differentiable, and we will want
to use gradient descent to solve this problem. But there is another problem with
this formulation: If the image is continuously valued (or even if it is represented
in floating point), what does it mean for f i to equal to f j ? How close should they
122 Image relaxation: Restoration and feature extraction

be before they are considered equal? How about | f i − f i | < 0.01? Is this small
enough? How about 0.001? OK? Do you accept that? So we insist that two points
which differ by more than 0.001 contribute to the error. What about two points that
differ by 0.000 999? That pair does not contribute at all. Does that make sense?
What we have generated is very similar to the problem of describing the proba-
bility that a measure has a particular value. The probability, for example, of being
exactly 6.000 000 (for an arbitrary number of zeros) is zero, and we therefore resort
to a different representation for the concept of likelihood, and we invent the prob-
ability density. In a similar way, in this problem, we pursue the same philosophy,
instead of using the Kroneker delta, we replace the delta function with a continuous,
differentiable function, which represents the same intuition,
 1  
( f i − f j )2
Hp = − √ exp − . (6.29)
i j∈ℵi 2␲␶ 2␶ 2

This form also allows the concept of annealing: Start with large ␶ and reduce ␶ until
it approaches zero. The square root of a constant is not particularly meaningful, but
it does serve to ensure that the function remains bounded in the appropriate way.
The details of why this process of annealing avoids local minima are described
elsewhere [6.8, 6.11, 6.12], but result from the analogy to simulated annealing [6.27].

Initial value of ␶

We will choose an initial value of f to be g. Consider what happens if ␶initial is large.


A large ␶ will cause the argument of the exponential to be close to zero, and the value
of the exponential itself to be approximately one. But that value is, itself, divided by
␶ , so that when ␶ is large, the value of the prior is on the order of 1/␶ ; and if ␶ is a
big number, then the prior is insignificant relative to the noise term. We can ensure
that ␶ is initially “large” by choosing ␶ to be larger than, say, twice the average value
of the numerator,
␶initial = 2R( f ). (6.30)

Decreasing ␶

MFA is based upon the mathematics of simulated annealing. One can show that in
simulated annealing, a global minimum can be achieved by following a logarithmic
annealing schedule like
1
␶K = (6.31)
ln K
where K is the iteration number. This schedule decreases ␶ extremely slowly; so
slowly as to be impractical. Instead, one could choose a schedule like
␶ K = 0.99␶ K −1 (6.32)
123 6.4 Mean field annealing

which has been shown to work satisfactorily in many applications, and reduces ␶
much faster than the logarithmic schedule.

6.4.3 How to differentiate a function containing a kernel operator


In previous discussions, we excluded the effect of blur and assumed noise is the
only source of distortion. In this section, we extend the results we derived above
and include blur in the formation of the measured image g, such that the noise term
can be written as Eq. (6.33). More important, we illustrate how to differentiate a
function containing a kernel operator (or a blur operator)

(( f ⊗ h)i − gi )2 . (6.33)
i

Taking a one-dimensional example, assume f i is a pixel from the original (un-


known) image, gi is a pixel from the measured image, and h is the horizontal blur
kernel with a finite kernel size 5, as shown in Fig. 6.5. We now explain the derivative
of the noise term, Eq. (6.33), in gradient descent in detail. First write out all the terms
involving a pixel ( f 4 ) at which a measurement (g4 ) was made in the noise term Hn
above
These are the only terms E 4 = (( f ⊗ h)2 − g2 )2 + (( f ⊗ h)3 − g3 )2 + (( f ⊗ h)4 − g4 )2
in the sum of Eq. (6.33)
involving f 4 . + (( f ⊗ h)5 − g5 )2 + (( f ⊗ h)6 − g6 )2
= ( f 0 h −2 + f 1 h −1 + f 2 h 0 + f 3 h 1 + f 4 h 2 − g2 )2
+ ( f 1 h −2 + f 2 h −1 + f 3 h 0 + f 4 h 1 + f 5 h 2 − g3 )2 (6.34)
+ ( f 2 h −2 + f 3 h −1 + f 4 h 0 + f 5 h 1 + f 6 h 2 − g4 ) 2

+ ( f 3 h −2 + f 4 h −1 + f 5 h 0 + f 6 h 1 + f 7 h 2 − g5 )2
+ ( f 4 h −2 + f 5 h −1 + f 6 h 0 + f 7 h 1 + f 8 h 2 − g6 )2

where ( f ⊗ h)i denotes the application of kernel h to image f with the origin of h (in
this case, the center) located at pixel f i . The derivative of Hn with respect to pixel
f 4 can then be derived as Eq. (6.35), and further generalized as Eq. (6.36),
∂ Hn
= 2(( f ⊗ h)2 − g2 )h 2 + 2(( f ⊗ h)3 − g3 )h 1 + 2(( f ⊗ h)4 − g4 )h 0
∂ f4
+2(( f ⊗ h)5 − g5 )h −1 + 2(( f ⊗ h)6 − g6 )h −2 (6.35)
∂ Hn
= ((( f ⊗ h) − g) ⊗ h rev )4 (6.36)
∂ f4
where h rev = h 2 , h 1 , h 0 , h −1 , h −2 , and ( f ⊗ h − g) is computed at all points. The

… f2 f3 f4 f5 f6 …

h–2 h–1 h0 h1 h2

Fig. 6.5. A one-dimensional image and a one-dimensional kernel with five pixels.
124 Image relaxation: Restoration and feature extraction

… n2 n3 n4 n5 n6 …

h2 h1 h0 h–1 h– 2
h2 h1 h0 h–1 h–2
h2 h1 h0 h–1 h–2
h2 h1 h0 h–1 h–2
h2 h1 h0 h–1 h–2

Fig. 6.6. The reverse kernel in the derivation of the noise term.

application of h rev is illustrated in Fig. 6.6 more clearly where we assume n i =


(( f ⊗ h) − g)i .
The application of the reverse kernel with a two-dimensional image follows the
same rule as that in the one-dimensional case. Equation (6.37) shows a 3 × 3 kernel
function (h) and the corresponding reverse kernel h rev .
⎡ ⎤ ⎡ ⎤
h −1,−1 h −1,0 h −1,1 h 1,1 h 1,0 h 1,−1
⎢ ⎥ ⎢ ⎥
h = ⎣ h 0,−1 h 0,0 h 0,1 ⎦ , h rev = ⎣ h 0,1 h 0,0 h 0,−1 ⎦ . (6.37)
h 1,−1 h 1,0 h 1,1 h −1,1 h −1,0 h −1,−1
Before we discuss the prior term, let’s conclude the general form of differentiation
when a kernel function is involved. Let R( f ⊗ h) be some differentiable function.
The derivative with respect to f is

R( f ⊗ h) = (R  ( f ⊗ h)) ⊗ h rev (6.38)
∂f
where R  (t) = ∂t∂ R(t).
In [6.63], differentiation of a function containing a kernel operator is used in a
particularly interesting case, where we only have measurements at every other pixel.
Thus, this problem may be viewed as not only image estimation, but interpolation as
well [6.78]. The experimental results and performance evaluation of MFA missing
data estimation algorithm are demonstrated in detail in [6.63].
Besides the noise term, the prior term can also contain a kernel operator. Recall
that the prior energy function models the neighborhood operation and thus can be
represented as a kernel operation with the kernel selected dependent on the property
of the image. We illustrate the gradient using the following prior:

Hp = − exp(−( f ⊗ r )2 ) (6.39)
i
∂ Hp
= [2( f ⊗ r ) exp (−( f ⊗ r )2 )] ⊗ rrev . (6.40)
∂f
Recall that the derivative of the prior ∂ H p /∂ f is itself an image, and that image
is derived by applying r to f, multiplying (pixel-by-pixel) with the exponential to
produce another image, and applying the reverse of r to that image.
125 6.4 Mean field annealing

6.4.4 Practical considerations: Edge-preserving smoothing


Collecting all we have learned in this chapter so far into two equations, we choose
an objective function which preserves fidelity (the restoration resembles the mea-
surement) but which is piecewise-linear.
 ( f i − gi )2  
␤ (R( f ))i2
H( f ) = − exp − (6.41)
i
2␴ 2 ␶ i ␶2

where (R( f ))i will be the quadratic variation of Eq. (6.25) at pixel i. Of course, in
specific applications, different priors may be indicated. To perform gradient descent,
we must find the derivative with respect to f . This problem becomes complicated
when we recognize that the numerator of the argument of the exponential varies in
both x and y. We have two choices.
r We could recognize that R is a sum of three terms, that the exponential of a sum is
the product of the exponentials, and use the product rule of derivatives to construct
a rather complicated expression for the derivative.
r We could say “instead of putting the summation in the argument of the exponential,
let’s just add the exponentials.”

Of course, these are not equivalent expressions. However, minimizing either gets
at the same idea – a piecewise-linear image. Since the second is simpler to implement,
being engineers, we choose the second option. We know how to do the derivatives,
so we end up with the following algorithm.

The derivative of the noise term is trivial: On each iteration, simply change pixel i
by dnoisei = ( f i − gi )/␴ 2 .

To determine the derivative of the prior requires a bit more work: according to
Eq. (6.40), the derivative of the prior term is
   
␤ f ⊗ ( f ⊗ )2
exp − ⊗ rev .
␶ ␶2 2␶ 2
Define three kernel operators which we will use to estimate the three partial second
derivatives in the quadratic variation
⎡ ⎤ ⎡ ⎤
0 0 0 0 1 0
1 ⎢ ⎥ 1 ⎢ ⎥
x x = √ ⎣1 −2 1⎦ ,  yy = √ ⎣0 −2 0⎦ , and
6 6
0 0 0 0 1 0
⎡ ⎤
−0.25 0 0.25
⎢ ⎥
x y = 2 ⎣ 0 0 0 ⎦.
0.25 0 −0.25

Notice that these kernels are symmetric. Therefore  = rev .


126 Image relaxation: Restoration and feature extraction

Compute three images, where the ith pixels of those images are

ri x x = (x x ⊗ f )i , ri yy = ( yy ⊗ f )i , and ri x y = (x y ⊗ f )i .

Create the image sx x whose elements are


 2 
1 r
ri x x exp − i x x2 ,
␶ 3 2␶
and similarly create s yy and sx y .
For the purposes of gradient descent, the change in pixel i from the prior term is
then

dpriori = ␤((x x ⊗ sx x )i + ( yy ⊗ s yy )i + (x y ⊗ sx y )i ).

The gradient descent rule says to change each element of f using f i ⇐ f i − ␣di ,
where di = dnoisei + dpriori .
√ '
The learning coefficient ␣ should be ␣ = ␥ ␴ ␶ R M S(di ), where ␥ is a small
dimensionless number, like 0.04; RMS(d) is the root mean square norm of the gra-
dient d; ␴ can be determined as the variance of the noise in the image (note that
this is NOT a good estimate in synthetic images). We observe that in this form,
␣ changes every iteration.
The coefficient ␤ is on the order of ␴, and choosing ␤ = ␴ is usually adequate.
Implementing this algorithm and annealing over a couple of orders of magnitude
of ␶ should give reductions of noise similar to those illustrated in Figs. 6.10–6.13.

6.5 Conclusion

Consistency is easy to see in some of the applications of image optimization. For


example, in a paper by one of the authors [6.7], use is made of consistency in fusing
a (noisy) range image with a (noisy) brightness image of the same object. Since both
are images of the same object, both the range image, which depends on the geometry,
and the brightness image, which depends on the reflectivity (and therefore on the
geometry) must be consistent.
Optimization methods are so pervasive in this chapter that the chapter title could
almost be “image optimization.” We set up an objective function, a function of
the measured image and the (unknown) true image and find the (unknown) image
Gradient descent with which minimizes the objective function. We introduce two terms, a noise term –
annealing.
which depends on the measurement – and a prior term – which depends only on the
true image. We then find the “true” image by finding the image which minimizes
the objective function. A variety of minimization techniques could be used. In this
127 6.6 Vocabulary

chapter, we use gradient descent with annealing, but other, more sophisticated and
faster techniques, such as conjugate gradient, could be used.

6.6 Vocabulary

You should know the meanings of the following terms.

Anisotropic diffusion (see 6A.2)


Annealing
Bayes’ rule
GNC (see 6A.1)
Inverse problem
MAP algorithm
Relaxation
Restoration

Assignment 6.1
Equation (6.34) illustrates the partial derivative of
an expression involving a kernel, by expanding the ker-
nel into a sum. Use this approach to prove that Eq.
(6.40) can be derived from Eq. (6.39). Do your proof
using a one-dimensional problem, and use a kernel which
is 3 × 1 (denote the elements of the kernel as h −1 ,h0 ,
and h1 ).

Assignment 6.2
Implement Eq. (6.65) on the image angio.ifs, or some
other image which your instructor selects. Experiment
with various run times and parameter settings.

Assignment 6.3
In Eq. (6.25), the quadratic variation is presented as
a prior term. A very similar prior would be the Lapla-
cian. What is the difference? That is, are there image
features which would minimize the Laplacian and not
minimize the quadratic variation? Vice versa?
128 Image relaxation: Restoration and feature extraction

Assignment 6.4
Which of the following expressions represents a Lapla-
cian?
 2  2  2 2  2  2 2
∂2 f ∂2 f ∂ f ∂ ∂2 f ∂
(a) + 2 (c) + (e) +
∂x 2 ∂y ∂x 2 ∂y2 ∂x 2 ∂ y2
 2   2  2  2
∂f ∂f ∂f ∂f ∂f ∂f
(b) + (d) + (f) +
∂x ∂y ∂x ∂x ∂x ∂x

Assignment 6.5
One form of the diffusion equation is written
df/dt = hx ⊗ (c(hx ⊗ f)) + hy ⊗ (c(hy ⊗ f)) where hx and hy
estimate the first derivatives in the x and y
directions, respectively. This suggests that four
kernels must be applied to compute this result. Simple
algebra, however, suggests that this could be rewritten
as df/dt= c(hxx ⊗ f + hyy ⊗ f), which requires application
of only two kernels. Is this simplification of the
algorithm correct? If not, explain why not, or under
what conditions it would be true.

Assignment 6.6
Consider the following image Hamiltonian
!
  fi − gi2 1 
(h ⊗ f)2

H (f) = − exp − ≡ H n ( f) + H p ( f)
i
␴2 i
␶ ␶2
where ⊗ denotes application of a kernel operator, the
pixels in the image are lexicographically indexed by i,
and the kernel h is
⎡ ⎤
−1 2 −1
⎢ ⎥
⎣−2 4 −2⎦ .
−1 2 −1
Let G p (fk) denote the partial derivative of H p with
respect to pixel k. G p (fk) = ∂/∂ fk H p (f). Write an expres-
sion for G p (fk). Use kernel notation.

Assignment 6.7
Continuing problem Assignment 6.6, you are to consider
ONLY the prior term. Write an equation which describes
129 Topic 6A Alternative and equivalent algorithms

the change in image brightness at pixel k as one itera-


tion of a simple gradient descent algorithm. Denote the
gradient by G p (fk) and use that in this answer.

Assignment 6.8
Continuing problem Assignment 6.7, expand this differ-
ential equation (assume the brightness varies only in
the x direction) by substituting the form for G p (fk)
which you derived. Is this a type of diffusion equa-
tion? Discuss. (Hint: Replace the application of ker-
nels with appropriate derivatives.)

Assignment 6.9
In a diffusion problem, you are asked to diffuse a VEC-
TOR quantity, instead of the brightness which you did
in your project. Replace the terms in the diffusion
equation with the appropriate vector quantities, and
write the new differential equation. (Hint: You may
find the algebra easier if you denote the vector as
[a,b]T .)

Assignment 6.10
The time that the diffusion runs is somehow related to
blur. This is why some people refer to diffusions of
this type as “scale space.” Discuss this use of termi-
nology.

Topic 6A Alternative and equivalent algorithms

6A.1 GNC: An alternative algorithm for noise removal

Just as the MFA approach described in the previous section minimizes an objective function
in order to find an image with sharp edges, graduated nonconvexity (GNC) does the same,
but uses an objective function which treats the presence of edges explicitly.
We consider the case in which our a priori knowledge states that the image is uniform
in brightness, except for step discontinuities. Blake and Zisserman [6.16] refer to this case
as the “weak membrane,” and a similar MFA instance is referred to [6.12] as “piecewise-
uniform.” The similarities can be seen both in the objective function (compare Fig. 6.7 and
Fig. 6.9) and in the restorations (Figs. 6.10–6.13). There exist other formulations [6.12] which
130 Image relaxation: Restoration and feature extraction

HpMFA

Δ
i( f )

Fig. 6.7. Prior energy for MFA, for various values of T. Smaller T results in sharper peaks.

pose the MFA problem in a manner even more similar to GNC, a similarity first noted by
Geiger and Girosi [6.25]. In the “weak membrane” application of GNC, the minimization
problem is

min f,l HGNC (6.42)

where
 
HGNC = Hn + S + P, S = ␭2 |∇i ( f )|2 (1 − li ), and P=␣ li (6.43)
i i

and the notation i ( f ) is interpreted to mean “the gradient of the image at point i.” Here, the
v li ∈ {0, 1} denotes a discontinuity in f at the ith pixel. That is, if l1 = 1, the pixel at point i
is interpreted as an edge point. Similarly, f i will denote the brightness of the ith pixel. It has
been shown [6.16] that minimizing HGNC can be reduced to the following problem, which
Δ
i( f )
involves only continuous variables
!
Fig. 6.8. Prior energy 
of the GNC algorithm. min f Hn + v(∇i ( f )) . (6.44)
i

In Eqs. (6.43) and (6.44), the |∇(.)| represents any operator which returns a scalar measure
of the local “edginess” of the image such as (∂ f /∂ x)2 + (∂ f /∂ y)2 , and the v function of
Note: This “clipped
parabola” has the same Eq. (6.44) is the “clipped parabola” illustrated in Fig. 6.8.
general shape as the The minimization problem posed by Eq. (6.44) is unsolvable by techniques such as gra-
inverted Gaussian we
mentioned in Eq. (6.16), dient descent, since the function defined by Eq. (6.44) is in general not convex. That is,
for the same reasons: We it may possess many minima. Instead, GNC approximates v with the piecewise-smooth
want to penalize noise (so
the bigger the noise, the function
bigger the penalty) but we ⎧
don’t want to penalize ⎪
⎨ ␭2 t 2 if (|t| < q)

edges (so at some point, ␯ (t) = ␣ − c∗ (|t| − r )2 /2 if (q ≤ |t| < r ) (6.45)
we stop making the ⎪

penalty any larger). ␣ if (|t| ≥ r )
131 Topic 6A Alternative and equivalent algorithms

v* Fig. 6.9. Smoothed approximations to the energy of Fig. 6.8. Smaller values of p result in
approximations which are closer to the desired prior. The t used here (in GNC) is
equivalent to the edge strength ∇of MFA.
t

where c∗ is a scalar constant, c = c∗ / p,

 
2 1 ␣
r =␣ ∗ + 2
2
, and q = . (6.46)
c ␭ ␭2 r

Equations (6.45) and (6.46) then define the algorithm. Reducing the parameter p from 1 to
0 steadily changes v ∗ until it becomes precisely equal to v. This produces a family of prior
energies, illustrated in Fig. 6.9.
The process of gradually reducing p begins by minimizing a function which is convex, and
therefore has a unique minimum. Then, from that minimum, the local minimum is tracked
continuously as p is reduced from 1 to 0.

6A.2 Variable conductance diffusion

Variable conductance diffusion (VCD) [6.31, 6.59, 6.62] is a powerful method of image
feature extraction in which blurring is allowed to occur except at edges. The term “edge”
may be liberally interpreted to mean any image feature which is of interest. For example,
Whitaker [6.79] operates on the gradient of the image rather than the image itself and allows
smoothing to occur except where ridges (sharp changes of gradient direction) occur. Such
an operation is not a restoration by any means, since most of the information in the original
image is lost. It is, however, a very robust way to extract a central axis of an object in a gray
scale image.
Relation of spatial VCD operates by simulating the diffusion equation
derivatives (RHS) to
temporal derivatives
(LHS). ∂ fi
= ∇ · (ci · ∇i f ) (6.47)
∂t

where t is time, and ∇i f denotes the spatial gradient of f at pixel i. The diffusion equation
models the flow of some quantity (the most commonly used example is heat) through a
material with conductivity (e.g., thermal conductivity) c.
If ci is constant, independent of the pixel number i, (ci = c) then the partial differential
Eq. (6.47) has a solution which is the same as convolution with a Gaussian, in which the
variance of that Gaussian depends on c and on the time over which the diffusion is run.
Specifically, let f, a function of space and time, be described by a specific partial differential
equation (PDE). If it is possible to write f in the following form:

∞
f (x, t) = G(x, x  , t) f (x  , 0) d x  , (6.48)
−∞
132 Image relaxation: Restoration and feature extraction

then we say that G(x, x  , t) is the Green’s function of the PDE. The special case of isotropic
diffusion may be stated formally in one dimension:

Theorem
The Gaussian is the Green’s function of the PDE
∂f ∂2 f
=c 2. (6.49)
∂t ∂x

Proof
This is accomplished by writing the Gaussian as
 
1 (x − x  )2
G(x, x  , t) = exp − ,
␴ 2␴ 2

where ␴ will turn out to be a function of time (we have omitted the 1/ 2␲ since it occurs
on both sides of the PDE and cancels out).
Substitute Eq. (6.48) into Eq. (6.49), producing on the left-hand side (LHS) the partial
with respect to t of an integral in which ␴ is a function of t. Taking this partial, make the LHS
equal to


(x − x  )2 1 ∂␴
− . (6.50)
␴4 ␴ 2 ∂t
Similarly, we can take the second partial derivative with respect to x to create the right-hand
side (RHS):


c (x − x  )2 1
− . (6.51)
␴ ␴4 ␴2
Equating Eqs. (6.50) and (6.51) produces the equation
∂␴ c
= (6.52)
∂t ␴
whose solution is

␴ 2 = 2ct. QED. (6.53)

In the case of VCD, the conductance becomes a function of the spatial coordinates, in
this instance parameterized by i. In particular it becomes a property of the local image
intensities themselves. The conductance ci is usefully seen as a factor by which space is
locally compressed.
To smooth, except at edges, we let ci be small if i is an edge pixel, i.e., if a selected image
property is locally nonuniform. If ci is small (in the heat transfer analogy), little heat flows
(space is stretched), and in the image, little smoothing occurs. If, on the other hand, ci is
large, then much smoothing is allowed in the vicinity of pixel i (space is compressed). VCD
then, just as the forms of MFA and GNC discussed, implements an operation which after
repetition produces a nearly piecewise uniform result.
133 Topic 6A Alternative and equivalent algorithms

6A.3 Edge-oriented anisotropic diffusion

As we observed in Eq. (6.48), the Gaussian is the Green’s function of the diffusion equation.
That is, a diffusion process running on an image produces the same result as convolution with
a Gaussian, where the variance of the Gaussian depends on how long the diffusion has been
running. The constant-conductance diffusion equation is

f t = c( f x x + f yy ). (6.54)

If there is an edge in the image, we want to remove noise on both sides of the edge, but not
blur the edge. Therefore it makes sense to diffuse in a direction tangent to the edge. The
normal and tangent vectors to an edge in a two-dimensional image are given by

[ f x f y ]T [− f y f x ]T
N= $ and T = $ .
f x2 + f y2 f x2 + f y2

Consider now, the second partial derivatives of f taken in the N and T directions: f N N and
fT T .
Since the Laplacian is rotation-invariant, we can write the diffusion PDE (Eq. (6.54)) in
the new coordinates by f t = c( f N N + f T T ).
One may derive the following relationships between the partials
   
f N N = f x2 f x x + 2 f x f y f x y + f y2 f yy / f x2 + f y2
    (6.55)
f T T = f y2 f x x − 2 f x f y f x y + f x2 f yy / f x2 + f y2 .

Substituting this form into Eq. (6.54) and subtracting the normal flow, we end up with a PDE
which smooths along edges without smoothing across edges:
   
f t = f y2 f x x − 2 f x f y f x y + f x2 f yy / f x2 + f y2 , (6.56)

which we denote as anisotropic diffusion.

6A.4 A common description of image relaxation operators

6A.4.1 MFA and GNC

In both cases, we have an energy function which increasingly penalizes the presence of gra-
dients in the image. In the GNC case the prior retains its original shape, and the “annealing”
process (that is, the lowering of p) results in successively closer fits to the predetermined
shape of the prior. In the MFA case, the shape of the prior itself changes, retaining a con-
stant area, but becoming narrower as T is reduced. With this observation, it should not be
surprising that piecewise-constant MFA and the weak membrane of GNC achieve the same
result.
134 Image relaxation: Restoration and feature extraction

Fig. 6.10. Original image. Fig. 6.11. Corrupted image. Fig. 6.12. MFA restoration. Fig. 6.13. GNC restoration.

The equivalence of the techniques is demonstrated in the next section and proven formally
in [6.12], to which we refer the reader for in-depth analysis. The following experiments are
also described in that paper and are reprinted here to assist the reader in understanding the
action of the algorithms.
The two approaches were each used to restore the same image with various signal-to-
noise ratios (SNRs). On each application of MFA and GNC to a noisy image, the respective
parameters were varied to achieve the best possible image restoration. Several hundred runs
with distinct parameter values were completed for each algorithm. We found that for each
noisy image there existed some parameter set for each algorithm such that the restored images
were of comparable quality.
The resulting image quality achieved is depicted above with the original image (Fig. 6.10),
the image corrupted with SNR = 2 (Fig. 6.11), the MFA restored image (Fig. 6.12), and the
GNC restored image (Fig. 6.13).
Coding of the GNC algorithm can be found in [6.16], which performs descent using
successive over-relaxation (SOR). By using an implementation of MFA which also uses
SOR, we found the execution times of MFA to be roughly ten times faster than GNC for high
noise cases (SNR < 3). For cleaner images, SNR ≥ 4, the GNC execution times were faster.

6A.4.2 MFA and VCD – equivalent algorithms

Before we can complete this comparison, we need to elaborate on the nature of a spatial
derivative. A derivative of brightness with respect to distance in an image could be written
∂f f (x + x) − f (x)
= lim . (6.57)
∂x x→0 x
In a sampled image (as all digital images are), however, the limiting process makes no sense,
for as ter Haar Romeny et al. point out [6.73], one cannot take differences at scales smaller
than a pixel. Instead, one must estimate the derivative at a point by operations performed on
some neighborhood of that point. The topic of estimating derivatives is an old one, and we
will not examine it further except to point out that most analyses have concluded that such
estimates (for higher order derivatives as well) are generally computed by a kernel operator
in one dimension and by the Euclidean norm of an array of n operators in n dimensions.
For the purposes of this derivation, we consider only derivatives in the x direction. We will
135 Topic 6A Alternative and equivalent algorithms

generalize this in a few paragraphs. Thus, we may rewrite the prior term as
 
−b  ( f ⊗ r )i2
Hp ( f ) = √ exp − . (6.58)
2␲T i 2T 2

By the notation ( f ⊗ r )i we mean the result of applying a kernel r to the image f at point i. The
kernel may be chosen to emphasize problem-dependent image characteristics. This general
form has been used to remove noise from piecewise-constant [6.38] and piecewise-linear
[6.8, 6.9] images.
In the following derivation, we consider only the prior term.
To perform gradient descent we will need the derivative, so we calculate

  
∂ Hp b ( f ⊗ r) ( f ⊗ r )2
= √ exp − ⊗ r rev (6.59)
∂ fi 2␲T T2 2T 2 i

where rrev denotes the mirror image of the kernel r.


Now, let us assume that the variation i , to be made small almost everywhere, is the
magnitude of the gradient of the image. It is well-known [6.53] that using the derivative of
Students: Pay attention!
The magic is here. a Gaussian as a kernel is an excellent way to estimate image derivatives in the presence of
noise. Denoting by G x the derivative with respect to x of a Gaussian, we could substitute
G x , G y = ∇G for r in Eq. (6.59); writing ∇ f for f ⊗ ∇G results in

∂ Hp
= −␬(∇((∇ f ) exp(−(∇ f )2 ))). (6.60)
∂f

In the equation above, we have lumped the constants together into ␬ and set the annealing
control parameter, T, to 1 for clarity. We have then made use of the observation that for
And here. centered first derivative kernels, f ⊗ h = −( f ⊗ h rev ). We will discuss the impact of T in
the next section.
Finally, we consider the use of ∂∂f H ( f ) in a gradient descent algorithm. In the simplest
implementations of gradient descent, f is updated by (compare with Eq. (6.20))

∂H
f ik+1 = f ik − ␣ (6.61)
∂ fi

where f k denotes the value of f at iteration k, and where ␣ is some small constant (or, in more
sophisticated algorithms, a function of the Hessian of H ). Rewriting Eq. (6.61),

∂H f k+1 − f ik
= i (6.62)
∂ fi ␣

we note that the LHS of Eq. (6.62) then represents a change in f between iterations k and
k + 1, and in fact bears a strong resemblance to the form of the derivative of f. We make
this similarity explicit by defining that iteration k is calculated at time t and iteration k + 1
calculated at time t + t. (In similar contexts, t is sometimes known as the “evolution
parameter.”) Since t is an artificially introduced parameter, without physically meaningful
136 Image relaxation: Restoration and feature extraction

units, we may scale it by any convenient proportionality constant, and we have


∂H f i (t + t) − f i (t) ∂ fi
= ≡ , (6.63)
∂ fi t ∂t
where we have allowed the constant ␣ to be renamed t so the expression looks like a
derivative with respect to time. Substituting this (re)definition into Eq. (6.60), we have our
final result: By simply changing notation and making explicit the time between iterations,
the derivative of the MFA prior term may be rewritten as
∂ fi
= −␬(∇((∇ f ) exp(−(∇T f )2 ))), (6.64)
∂t
where ∇T f denotes the scaling of the gradient magnitude by T.
Writing the diffusion equation
∂ fi
= −␬(∇(c∇ f )) (6.65)
∂t
in which the conductivity, c, is replaced by the exponential, we observe that Eq. (6.64) is
precisely the form of the diffusion equation used in (VCD) [6.31, 6.59, 6.62]. The constant
␬ simply incorporates into one place all the constants that are involved.
By Eq. (6.65), we have shown the equivalence of MFA and VCD, provided that MFA is
performed without using the noise term, Eq. (6.11). This equivalence provides for the union of
two schools of thought concerning image analysis: The first (optimization) school considers
the properties that an image ought to have. It then sets up an optimization problem whose
solution is the desired image. One might also call this the restoration school. The second
(process) school is more concerned with determining the appropriate spatial analysis to apply.
Adaptive filtering, diffusion, template matching, etc. are more concerned with the process
itself than with what that process does to some hypothetical “energy function” of the image.
The result of this section demonstrates that these two schools are not just philosophically
equivalent, at least for this particular form of edge-preserving smoothing, they are precisely
equivalent.
The above equivalence considered only the prior term of the MFA objective function.
Addition of the noise term converts an image feature extraction algorithm into a constrained
restoration algorithm.
Nordström [6.59] also observed a similarity between diffusion techniques and regulariza-
tion (optimization) methods. He notes that “the anisotropic diffusion method (which Whitaker
[6.79] calls VCD) does not seek an optimal solution of any kind.” This is not quite true: “The
developer of the technique did not intend for it to be used as a minimization technique” is
possibly a better way to say it. Nordström then proceeds to “unify from the original outlook
quite different methods of regularization and anisotropic diffusion.” He proceeds quite ele-
gantly and precisely to define a cost function whose behavior is an anisotropic diffusion, in
a similar manner to the derivation presented here. Nordström also argues for the necessity of
adding a “stabilizing cost” to “restrict the space of possible estimated image functions.” At
this point the reader should not be surprised to learn that the form of the stabilized cost is

( f i − gi )2 (6.66)
i
137 Topic 6A Alternative and equivalent algorithms

f' f '2
∑ exp(-x/T ) ∑ ∫

Fig. 6.14. It is straightforward to configure a locally connected, recurrent network which


performs the image relaxation networks described here.

which we showed in Eq. (6.11) is a measure of the effect of Gaussian noise in a blur-free
imaging system. Thus we see that biased anisotropic diffusion (BAD) [6.59] may be thought of
as a maximum a posteriori restoration of an image. This observation now permits researchers
in VCD/BAD to consider use of different stabilizing costs, if they have additional information
concerning the noise generation process.

6A.5 Relationship to neural networks

The relationship between a Hopfield neural network and an optimization problem is well-
known. Given an objective of the “Ising” type, it is straightforward to find a recurrent neural
network whose stable states are minima of the objective functions (see [6.39] as well as
[6.40]).
The most straightforward forms of this type of recurrent nets deal with binary-valued
variables, with one neuron representing each variable. In this usage, a “neuron” is a sum-
of-products operator which produces a weighted sum of its inputs. That sum is then passed
through a monotonic nonlinearity – usually a limiter of some sort, e.g., a sigmoid. Following
this definition, we may describe their operations represented by Eq. (6.59) as a two-layer
network, in which each layer is only locally connected as shown in Fig. 6.14 (see also
[6.13]).
The local connectivity is important from an implementation point of view, since that aspect
makes construction of parallel, real-time hardware very feasible.

6A.6 Conclusion

An equivalence exists between the problems of image optimization and diffusion. Others
MFA combines annealing have made similar observations. In addition to Nordström [6.59], Geiger and Yuille came to a
with gradient descent.
similar conclusion [6.26] for an energy function requiring explicit line processes. Since it has
been shown [6.38] that the line processes are not required, their result may now be interpreted
more generally. In an image optimization problem (in particular, restoration), one defines a
criterion function which is to be minimized, and applies some minimization scheme to find a
global (or at least a good) minimum. Thus, an image restoration problem may be considered
as a combination of goals: (1) to preserve the information in the original image, and thereby
produce a resulting image which resembles that original (or the result of an operation on it)
in some way; and (2) to produce an image which possesses certain properties, such as local
smoothness except at boundaries. If one abandons the first goal, then restoration problems
138 Image relaxation: Restoration and feature extraction

become iconic feature extraction problems. Wu and Doerschuk [6.81] have developed an
attractive extension to this work.
Finally, recall that in [6.19], we demonstrated that MFA operations could be calculated
by a two-layer, locally connected, recurrent neural network. From this paper one may then
conclude that GNC and VCD are likewise implementable by straightforward neural networks.
These results lead us to conjecture the following guiding principles for the design of feature
extraction algorithms.
r Relaxation is a central concept. A relaxation algorithm possesses the following properties.
(1) It must be iterative. That is, the output of one pass of the algorithm is the same format
as the input, so that the algorithm may be applied to its own output. (2) It must converge.2
r The relaxation algorithm needs to be local in nature. That is, at any time, changes to any
pixel depend only on the local neighborhood of that pixel. Compliance with this guideline
allows global interactions to occur smoothly over time (iteration number) and space, and
provides a theoretical basis, via the Gibbs/Markov field equivalence for analyzing the
operations.
r The equivalence between diffusion and optimization is helpful in understanding the per-
formance of both forms of algorithms. For those designing diffusion algorithms, seeing
them as an optimizing relaxation is helpful: Since all relaxation algorithms minimize
something, it is useful to look at the integral of the diffusion step (although this process
is often intractable) to see what properties are actually being minimized in the develop-
ment/application of an ad hoc technique. For those designing optimization methods, seeing
the result as a spatial deformation according to the local degree of nonuniformity followed
by an averaging can help to understand the spatiotemporal effects of the optimization.
r In designing feature extraction algorithms, it is helpful to consider the algorithm as a
restoration, even if no residual is explicitly minimized, for then, the precise effect of the
algorithm on the image may be better comprehended.
r Scale variation in spatial analysis algorithms and temperature control in annealing algo-
rithms are closely related. The power that has been found for both of them is related.
r Finally, the nonlinear operations (exponentials) indicated in all the algorithms mentioned
are absolutely essential to success. The Kolmogorov theorem [6.49] demonstrates suffi-
ciency by stating that a linear operation followed by the appropriate nonlinearity allows
computation of arbitrary mapping. Here, we claim that such nonlinearity is not only suffi-
cient but necessary. This fact probably contributes most significantly to the recent success
of a number of neural network applications.

Bibliography

[6.1] I. Abdelqader, S. Rajala, W. Snyder, and G. Bilbro, “Energy Minimization Approach


to Motion Estimation using Mean Field Annealing,” Signal Processing, July 1992.
[6.2] E. Allgower and K. Georg, Numerical Continuation Methods, Berlin, Springer-Verlag,
1990.
2 Some authors also include a third requirement, “localness,” to the definition of a relaxation. We prefer to
include this in a separate bullet point.
139 Bibliography

[6.3] H. Andrews and B. Hunt, Digital Image Restoration, Englewood Cliffs, NJ, Prentice-
Hall, 1977.
[6.4] D. Baker and J. Aggarwal, “Geometry Guided Segmentation of Outdoor Scenes,”
SPIE Applications of Artificial Intelligence, VI, pp. 576–583, 1988.
[6.5] J. Besag, “Spatial Interaction and the Statistical Analysis of Lattice Systems,” Journal
of the Royal Statistical Society, B, 36, pp. 192–326, 1974.
[6.6] J. Besag, “On the Statistical Analysis of Dirty Pictures,” Journal of the Royal Statistical
Society, B, 48 (3), 1986.
[6.7] G. Bilbro and W. Snyder, “Fusion of Range and Luminance Data,” IEEE Symposium
on Intelligent Control, Arlington, August, 1988.
[6.8] G. Bilbro and W. Snyder, “Range Image Restoration using Mean Field Annealing,”
In Advances in Neural Network Information Processing Systems, San Mateo, CA,
Morgan-Kaufmann, 1989.
[6.9] G. Bilbro and W. Snyder, “Mean Field Annealing, an Application to Image Noise
Removal,” Journal of Neural Network Computing, Fall, 1990.
[6.10] G. Bilbro and W. Snyder, “Optimization of Functions with Many Minima,” IEEE
Transactions on Systems, Man, and Cybernetics, 21(4), July/August, 1991.
[6.11] G. Bilbro, R. Mann, T. Miller, W. Snyder, D. Van den Bout and M. White, “Opti-
mization by Mean Field Annealing,” In Advances in Neural Information Processing
Systems, San Mateo, CA, Morgan-Kauffman, 1989.
[6.12] G. Bilbro, W. Snyder, S. Garnier, and J. Gault, “Mean Field Annealing: a Formalism
for Constructing GNC-like Algorithms,” IEEE Transactions on Neural Networks, 3(1)
pp. 131–138, 1992.
[6.13] G. Bilbro, W. Snyder, and R. Mann, “Mean Field Approximation Minimizes Relative
Entropy,” Journal of the Optical Society of America, A, 8(2), February 1991.
[6.14] M. Black and A. Jepson, “Estimating Optical Flow in Segmented Images Using
Variable-order Parametric Models with Local Deformations,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(10), 1996.
[6.15] A. Blake, “Comparison of the Efficiency of Deterministic and Stochastic Algorithms
for Visual Reconstruction,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 1(1), 1989.
[6.16] A. Blake and A. Zisserman, Visual Reconstruction, Cambridge, MA, MIT Press, 1987.
[6.17] E. Brezin, D. LeGuillon, and J. Zinn-Justin, “Field Theoretical Approaches to Critical
Phenomena,” Phase Transitions and Critical Phenomena, volume 6, eds. C. Domb
and M. Green, New York, Academic Press, 1976.
[6.18] R. Burden, J. Faires, and A. Reynolds, Numerical Analysis, Boston, Prindle, 1981.
[6.19] H. Chang and M. Fitzpatrick, “Geometrical Image Transformation to Compensate
for MRI Distortions,” SPIE Medical Imaging IV, 1233, pp. 116–127, February,
1990.
[6.20] H. Derin and H. Elliot, “Modeling and Segmentation of Noisy and Textured Images
using Gibbs Random Fields,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 9, pp. 39–55, 1987.
[6.21] H. Ehricke, “Problems and Approaches for Tissue Segmentation in 3D-MR Imaging,”
SPIE Medical Imaging IV: Image Processing, 1233, pp. 128–137, February, 1990.
140 Image relaxation: Restoration and feature extraction

[6.22] J. Elder, “Are Edges Incomplete?” International Journal of Computer Vision, 34(2),
1999.
[6.23] J. Elder and R. Goldberg, “Image Editing in the Contour Domain,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 23(3), 2001.
[6.24] J. Elder and S. Zucker, “Scale Space Localization, Blur, and Contour-based
Image Coding,” IEEE Conference on Computer Vision and Pattern Recognition, San
Francisco, CA, 1996.
[6.25] D. Geiger and F. Girosi, “Parallel and Deterministic Algorithms for MRFS: Sur-
face Reconstruction and Integration,” AI Memo, No 1114, Cambridge, MA, MIT,
1989.
[6.26] D. Geiger and A. Yuille, “A Common Framework for Image Segmentation by Energy
Functions and Nonlinear Diffusion,” MIT AI Lab Report, Cambridge, MA, 1989.
[6.27] D. Geman and S. Geman, “Stochastic Relaxation, Gibbs Distributions, and the
Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 6(6), November, 1984.
[6.28] R. Gonzalez and P. Wintz, Digital Image Processing, 2nd edn, Reading, MA, Addison-
Wesley, 1987.
[6.29] A. Gray, J. Kay, and D. Titterington, “An Empirical Study of the Simulation of Var-
ious Models used for Images,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(5), 1994.
[6.30] B. Groshong, G. Bilbro, and W. Snyder, “Restoration of Eddy Current Images by
Constrained Gradient Descent,” Journal of Nondestructive Evaluation, December,
1991.
[6.31] S. Grossberg, “Neural Dynamics of Brightness Perception: Features, Boundaries,
Diffusion, and Resonance,” Perception and Psychophysics, 36(5), pp. 428–456,
1984.
[6.32] J. Hadamard, Lectures on the Cauchy Problem in Linear Partial Differential Equations,
New Haven, CT, Yale University Press, 1923.
[6.33] J. Hammersley and P. Clifford, “Markov Field on Finite Graphs and Lattices,”
unpublished.
[6.34] F. Hansen and H. Elliot, “Image Segmentation using Simple Markov Field Models,”
Computer Graphics and Image Processing, 20, pp. 101–132, 1982.
[6.35] R. Haralick and G. Shapiro, “Image Segmentation Techniques,” Computer Vision,
Graphics, and Image Processing, 29, pp. 100–132, 1985.
[6.36] E. Hensel, Inverse Theory and Applications for Engineers, Englewood Cliffs, NJ,
Prentice-Hall, 1991.
[6.37] H. Hiriyannaiah, Signal Reconstruction using Mean Field Annealing. Ph.D. Thesis,
North Carolina State University, Raleigh, NC, 1990.
[6.38] H. Hiriyannaiah, G. Bilbro, W. Snyder, and R. Mann, “Restoration of Locally Ho-
mogeneous Images using Mean Field Annealing,” Journal of the Optical Society of
America A, 6, pp. 1901–1912, December, 1989.
[6.39] J. Hopfield, “Neurons with Graded Response Have Collective Computational Prop-
erties Like Those of Two-state Neurons,” Proceedings of the National Academy of
Science USA, 81, pp. 3058–3092.
141 Bibliography

[6.40] J. Hopfield, and D. Tank, “Neural Computations of Decisions in Optimization Prob-


lems,” Biological Cybernetics, 52, pp. 141–152, 1985.
[6.41] M. Irani, B. Rousso, and S. Peleg, “Recovery of Ego-motion Using Region Align-
ment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3), 1997.
[6.42] A. Kak and M. Slaney, Principles of Computerized Tomographic Imaging, New York,
IEEE Press, 1988.
[6.43] S. Kapoor, P. Mundkur, and U. Desai, “Depth and Image Recovery using a MRF
Model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(11),
1994.
[6.44] I. Kapouleas and C. Kulikowski, “A Model-based System for Interpretation of MR
Human Brain Scans,” Proceedings of the SPIE, Medical Imaging II, vol. 914, February,
1988.
[6.45] R. Kashyap and R. Chellappa, “Estimation and Choice of Neighbors in Spatial-
interaction Model of Images,” IEEE Transactions on Information Theory, 29, pp.
60–72, January, 1983.
[6.46] M. Kelly, In Machine Intelligence, vol 6, Edinburgh, University of Edinburgh Press,
1971.
[6.47] R. Kindermann and J. Snell, Markov Random Fields and Their Applications, Provi-
dence, RI, American Mathematical Society, 1980.
[6.48] S. Kirkpatrick, Gelatt C, and Vecchi M, “Optimization by Simulated Annealing,”
Science, 220, pp. 671–668, 1983.
[6.49] A. Kolmogorov, “On the Representation of Continuous Functions of One Variable by
Superposition of Continuous Functions of One Variable and Addition,” AMS Trans-
lation, 2, pp. 55–59, 1957.
[6.50] R. Lee, and R. Leahy, “Multi-spectral Tissue Classification of MR Images Using
Sensor Fusion Approaches,” SPIE Medical Imaging IV: Image Processing, 1233,
pp. 149–157, February, 1990.
[6.51] S. Li, “On Discontinuity-adaptive Smoothness Priors in Computer Vision,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(6), 1995.
[6.52] R. Malik and T. Whangbo, “Angle Densities and Recognition of 3D Objects,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1997.
[6.53] D. Marr, Vision, San Francisco, CA, Freeman, 1982.
[6.54] J. Marroquin, Probabilistic Solution to Inverse Problems, Doctoral Dissertation, MIT,
1985.
[6.55] P. Morris, in Nuclear Magnetic Resonance Imaging in Medicine and Biology, Oxford,
Clarendon Press, 1986.
[6.56] J. Moussouris, “Gibbs and Markov Systems with Constraints,” Journal of Statistical
Physics, (10), pp. 11–33, 1974.
[6.57] M. Nashed, “Aspects of Generalized Inverses in Analysis and Regularization,” in
Generalized Inverses and Applications, ed. by M. Nashed, New York, Academic
Press, 1976.
[6.58] T. Nelson, “Propagation Characteristics of a Fractal Network: Applications to the His-
Purkinje Conduction System,” SPIE Medical Imaging IV: Image Processing, 1233,
pp. 23–32, February, 1990.
142 Image relaxation: Restoration and feature extraction

[6.59] N. Nordström, “Biased Anisotropic Diffusion – A Unified Regularization and Diffu-


sion Approach to Edge Detection,” Image and Vision Computing, 8(4), pp. 318–327,
1990.
[6.60] T. Pavlidis, Structural Pattern Recognition, Berlin, Springer-Verlag, 1977.
[6.61] A. Pentland, “Interpolation using Wavelet Bases,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 16(4), 1994.
[6.62] P. Perona and J. Malik, “Scale-space and Edge Detection using Anisotropic Diffusion,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, pp. 629–639,
July, 1990.
[6.63] H. Qi, A High-resolution, Large-area Digital Imaging System, Ph.D. Thesis, North
Carolina State University, 1999.
[6.64] A. Rosenfeld and A. Kak, Digital Picture Processing, 2nd edn, Vol. 2, New York,
Academic Press, 1982.
[6.65] Samet, The Design and Analysis of Spacial Data Structures, Reading, MA, Addison-
Wesley, 1989.
[6.66] P. Santago, K. Link, W. Snyder, J. Worley, S. Rajala, and Y. Han, “Restoration of Car-
diac Magnetic Resonance Images,” Symposium on Computer Based Medical Systems,
Chapel Hill, NC, June 3–6, 1990.
[6.67] S. Shemlon and S. Dunn, “Rule-based Interpretation with Models of Expected Struc-
ture,” SPIE Medical Imaging IV, 1233, pp. 33–44, February, 1990.
[6.68] W. Snyder, G. Bilbro, A. Logenthiran, and S. Rajala, “Optimal Thresholding – A New
Approach,” Pattern Recognition Letters, 11(12), December, 1990.
[6.69] W. Snyder, P. Santago, A. Logenthiran, K. Link, G. Bilbro, and S. Rajala, “Segmenta-
tion of Magnetic Resonance Images using Mean Field Annealing,” XII International
Conference on Information Processing in Medical Imaging, Kent, England, July 7–11,
1991.
[6.70] W. Snyder, A. Logenthiran, P. Santago, K. Link, G. Bilbro, and S. Rajala, “Segmenta-
tion of Magnetic Resonance Images using Mean Field Annealing,” Image and Vision
Computing, 10(6), pp. 361–368, 1992.
[6.71] C. Soukoulis, K. Levin, and G. Grest, “Irreversibility and Metastability in Spin-glasses.
I. Ising Model,” Physical Review B, 28(3), pp. 1495–1509, 1983.
[6.72] B. Super and W. Klarquist, “Patch-based Stereo in a General Binocular Viewing
Geometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3),
1997.
[6.73] B. ter Haar Romeny, L. Florack, J. Koenderink, and M. Viergever, “Scale Space:
Its Natural Operators and Differential Invariants,” XII International Conference on
Information Processing in Medical Imaging, Kent, England, July 7–11, 1991.
[6.74] P. Torr, R. Szeliski, and P. Anandan, “An Integrated Bayesian Approach to Layer Ex-
traction from Image Sequences,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 23(3), 2001.
[6.75] J. van Laarhoven and E. Aarts, Simulated Annealing: Theory and Applications,
Norwell, MA, Reidel, 1988.
[6.76] M. Vannier, et al., “Multispectral Analysis of Magnetic Resonance Images,” Radiol-
ogy, 154, pp. 221–224, January, 1985.
143 Bibliography

[6.77] D. Van den Bout and T. Miller, “Graph Partitioning using Annealed Neural Networks,”
IEEE Transactions on Neural Networks, 1(2), pp. 192–203, 1990.
[6.78] C. Wang, W. Snyder, and G. Bilbro, “Optimal Interpolation of Images,” Neural Net-
works for Computing Conference, Snowbird, UT, April, 1995.
[6.79] R. Whitaker, “Geometry-limited Diffusion in the Characterization of Geometric
Patches in Images,” TR91-039, Dept. of Computer Science, UNC, Chapel Hill, NC,
1991.
[6.80] G. Wolberg and T. Pavlidis, “Restoration of Binary Images Using Stochastic Relax-
ation With Annealing,” Pattern Recognition Letters, 3(6), pp. 375–388, December,
1985.
[6.81] C. Wu and P. Doerschuk, “Cluster Expansions for the Deterministic Computation
of Bayesian Estimators Based on Markov Random Fields,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(3), 1975.
[6.82] M. Yaou and W. Chang, “Fast Surface Interpolation Using Multiresolution Wavelet
Transform,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7),
1994.
[6.83] J. Yi and D. Chelberg, “Discontinuity-preserving and Viewpoint Invariant Reconstruc-
tion of Visible Surfaces Using a First-order Regularization,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(6), 1995.
7 Mathematical morphology

A man’s discourse is like to a rich Persian carpet, the beautiful figures and patterns of which
can be shown only by spreading and extending it out; when it is contracted and folded up,
they are obscured and lost
Plutarch

The suffix “-ology” means “study of-,” so obviously, “morphology” is the study of
morphs; answering critical questions like: “How come they only come out at night,
and then fly toward the light?” and “Why is it that bug zappers only toast the harmless
critters, leaving the ’skeeters alone?” and – HOLD IT! That’s MORPH-ology, the
study of SHAPE, not moths! Try again . . .

7.1 Binary morphology

We begin by considering ONLY BINARY images. That’s important, remember it!


Only binary! We will discuss a couple of operators first. Then, once you understand
how they work, we’ll explain how they are used in binary images. As an extension
to binary morphology, we also describe gray scale morphology operations and the
corresponding operators.

7.1.1 Dilation
First, the intuitive definition: The dilation of a (BINARY) image is that same image
with all the foreground regions made just a little bit bigger.
Now, formally: We consider two images, f A and f B , and let A and B be sets of
ordered pairs, consisting of the coordinates of each foreground pixel in f A and f B ,
respectively.
Consider one pixel in f B , and its corresponding element (ordered pair) of B, call
that element b ∈ B. Create a new set by adding the ordered pair b to EVERY ordered
pair in A. Let’s look at a tiny example.
For this image, A = {(2, 8), (3, 6), (4, 4), (5, 6), (6, 4), (7, 6), (8, 8)}. Adding the
pair (−1, 1) results in the set A(−1,1) = {(1, 9), (2, 7), (3, 5), (4, 7), (5, 5), (6, 7),
(7, 9)}. The corresponding image is also shown in Fig. 7.1, and we hope you
144
145 7.1 Binary morphology

(a) 9 (b) 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 12 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
fA fA (−1, 1)

Fig. 7.1. Example of dilation. (a) The original binary image. (b) The binary image dilated by
B = {(--1, 1)}.

observed that A(−1,1) is nothing more than a translation of A. With this concept
firmly understood, think about what would happen if you constructed a SET of
translations of A, one for each pair in B. We denote that as {Ab , b ∈ B}, that is, b is
one of the ordered pairs in B.
Formally, we define the DILATION of A by B as A ⊕ B = {a + b|(a ∈ A,
b ∈ B)}, which is the same as the union of all those translations of A,

A⊕B = Ab (7.1)
b∈B

and we use the same symbol to denote the dilation of images: f A ⊕ f B . Here is
another example.

9 4
8 3
7 2
6 1
fA 5 fB 0
4 −1
3 −2
2 −3
1 −4
0 −5
0 1 2 3 4 5 6 7 8 9 −4 −3 −2 −1 0 1 2 3 4 5
A = {(2, 8), (3, 6), (4, 4), (5, 6), (6, 4), B = {(0, 0), (0, 1)}
(7, 6), (8, 8)}

Based on the definition,

A ⊕ B = A(0,0) ∪ A(0,1) (7.2)


= {(2, 8), (2, 9), (3, 6), (3, 7), (4, 4), (4, 5), (5, 6), (5, 7), (6, 4),
(7.3)
(6, 5), (7, 6), (7, 7), (8, 8), (8, 9)}
146 Mathematical morphology

and
9
8
7
fA ⊕ fB = 6
5
4
3
2
1
0
0 12 3 4 5 6 7 8 9

It’s time to start getting Denote by #A the number of elements in set A. In this example, #A = 7, and #(A ⊕ B)
used to set notation.
= 14. This happened to be true, only coincidently, because there was no overlap
between A(0, 0) and A(0, 1), or said another way: A(0,0) ∩ A(0,1) = Ø.
For a more general problem, this will not be the case. In general,

#(A ⊕ B) ≤ #A · #B. (7.4)

To go further, we need to define some notation: If x is an ordered pair, then (1) the
translation1 of a set A by x is A x , (2) the reflection of A is à = {(−x, −y|(x, y) ∈ A)}
and (3) the complement of set A is Ac . An example of reflection is

2 2
1 1 .
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
fA fA~

In this example, A = {(0, 0), (1, 0), (1, 1)} and the reflection of A is {(0, 0), (−1, 0),
(−1, −1)}.

7.1.2 Erosion
Now, we define the (sort of) inverse of dilation, erosion,

A  B = {a|(a + b) ∈ A for every (a ∈ A, b ∈ B)} (7.5)

which can be written in terms of translations by


,
A B = Ab . (7.6)
b∈ B̃

1 We do not have to be limited to a 2-space. As long as x and A are drawn from the same space, it works. More
generally, if A and B are sets in a space ε, and x ∈ ε, the translation of A by x is denoted A x = {y| for some
a ∈ A, a = a + x}.
147 7.1 Binary morphology

Notice two things: The second set, B, is reflected, and the intersection symbol is
used. Let’s do an example.

9
8
7
6
5 fA
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9

2 B = {(0, 0), (1, 0)}


1 fB
0
−1 B = {(0, 0), (−1, 0)}
−2
−2 −1 0 1 2

Rather than the tedious job of listing all the 17 elements of A, just draw

8
7
6
5 fA(0, 0)
4 8
3 7
2 6
1 5
0 4
0 1 2 3 4 56 7 8 9 3
2
8 1
7 0
6 0 1 2 3 4 5 6 7 8 9
5 fA(−1, 0)
4
3 fA(0, 0) ∩ fA(−1, 0)
2
1
0
0 1 2 3 4 5 6 7 8 9

So now we have dilation and erosion defined. You will observe that usually (for all
practical purposes) one of the images is “small” compared to the other; that is, in
the example above

#A  #B. (7.7)

When this is the case, we refer to the smaller image, f B , as the “structuring element”
(s.e.).
148 Mathematical morphology

7.1.3 Properties of dilation and erosion


r Commutative property. Dilation is commutative, which means we can dilate
images in any order without changing the result

A ⊕ B = B ⊕ A. (7.8)

r Associative property. Dilation is also associative, which means we can group


images in a dilation in any way we want without changing the result

(A ⊕ B) ⊕ C = A ⊕ (B ⊕ C). (7.9)

r Distributive property. This property says that in an expression where we need to


dilate an image with the union of two images, we can dilate first and then take
the union. In other words, the dilation can be distributed over all the terms inside
the parentheses

A ⊕ (B ∪ C) = (A ⊕ B) ∪ (A ⊕ C). (7.10)

r Increasing property. If A ⊆ B, then

A ⊕ K ⊆ B ⊕ K, (7.11)

for any s.e. K. When this property holds, we say the operator is “increasing.”
An example proof: Dilation is increasing
Let the set A consist of elements Ai . A = {A1 , A2 , . . . , An }, and let B be similarly
denoted. Furthermore, suppose B ⊆ A. Now, suppose both A and B are dilated by
the same s.e., K. Take a single element of K , say K 1 , and dilate each element of
A by K 1 , A ⊕ K 1 = {A1 + K 1 , A2 + K 1 , . . . , An + K 1 }, and similarly dilate B.
Since every element of B was also an element of A, every element of B ⊕ K i is in
A ⊕ K i . Since this observation is true for an arbitrary element of K , it is true for
all elements of K . Now consider the union of the results of applying two elements
of the s.e. K to A: A12 = (A ⊕ K 1 ) ∪ (A ⊕ K 2 ). Since B ⊕ K 1 ⊆ A ⊕ K 1 and
B ⊕ K 2 ⊆ A ⊕ K 2 , we know from set theory that if R1 ⊆ S1 and R2 ⊆ S2 then
R1 ∪ R2 ⊆ S1 ∪ S2 and we are done.
r Extensive and anti-extensive properties. If we say an operator is “extensive,” we
mean that applying this operator to a set, A, produces an answer which contains
A. If the s.e. contains the origin (that is, element (0, 0)), dilation is extensive:

A ⊕ K ⊇ A. (7.12)

As you might guess, erosion has some extensive properties as well: That is, erosion
149 7.1 Binary morphology

is “anti-extensive,” if o ∈ B, then A  B ⊆ A, where o indicates the element at


the origin.
r Duality. Duality is similar to DeMorgan’s laws: It relates set complement, dilation,
and erosion.

(A  B)c = Ac ⊕ B̃
(7.13)
(A ⊕ B)c = Ac  B̃

where the superscript c denotes set complement.


r Other properties of erosion.

A  (B ⊕ C) = (A  B)  C (7.14)
(A ∪ B)  C ⊇ (A  C) ∪ (B  C) (7.15)
A  (B ∩ C) ⊇ (A  B) ∪ (A  C) (7.16)
A  (B ∪ C) = (A  B) ∩ (A  C). (7.17)
Erosion and dilation are
not inverses of each other. A cautionary note: You cannot do cancellation with morphological operators as
you might suspect. For example, if A = B  C, we could dilate both sides by C
to get A ⊕ C = (B  C) ⊕ C and if dilation and erosion were true inverses, the
RHS would be just B. However, the RHS is in fact the opening of B by C and not
simply B.

7.1.4 Opening and closing


The OPENING of f A by an s.e. f B is written as

f Ao f B = ( f A  f B ) ⊕ f B (7.18)

and, as you might guess, the CLOSING of f A by an s.e. f B is written as

fA r fB = ( fA ⊕ fB)  fB. (7.19)

An application

So what is the purpose of all this? Let’s do an example: Inspection of printed circuit
(PC) boards. Here’s a picture of a PC board with two traces on it shorted together by
a hair which was stuck to the board when it went through the wave solder machine.
We will use opening to identify the short.
150 Mathematical morphology

First, erode the image using a small s.e. We choose an s.e. which is smaller than the
features of interest (the traces), but larger than the defect. The erosion then looks
like this:

Now, we dilate back using the same s.e.

and surprise, surprise, the defect is gone. For inspection purposes, one could now
subtract the original from the opened, and the difference image would contain only
the defect. Furthermore these operations can be done in hardware, blindingly fast.

Another way to think of opening

The opening of f A by f K selects precisely those points in f A that “match” f K


in the following sense: Take the structuring element f K and place it so that every
foreground pixel in f K covers a foreground pixel in f A . If you can find a position
where the s.e. can be located to make this true, then every pixel of f A which is
covered is in the opening. Now, move f K around (by translations only), and find
every such position – every place the s.e. can be placed where it is totally inside
the image. The set of covered pixels determines the opening of f A by f K . Here’s an
example of opening in 1D, viewed from this perspective (from Haralick and Shapiro
[7.12]). Let A be the open interval

A = (3.1, 7.4) ∪ (11.5, 11.6) ∪ (18.9, 19.8) (7.20)

and open it with the s.e. K = (−1, 1). Then

AoK = (3.1, 7.4). (7.21)

This example illustrates first of all that morphological concepts can be extended to a
151 7.1 Binary morphology

continuous domain. (For the time being, remember, however, that this is continuous
in resolution, not in brightness value; still binary. We will fix that soon.) Second, it
illustrates the fact that opening preserves exactly the geometry of objects which are
“big enough,” and totally erases smaller objects. In this sense, opening resembles
the functioning of the median filter, where each pixel is replaced by the median of
its neighbors.

7.1.5 Properties of opening and closing


Some properties of opening and closing are listed below (some you should be pre-
pared to figure out).

r Duality: (AoK )c = Ac r K̃ .
Proof of duality. Notice how this proof is done. We expect students to do proofs
this carefully.
1. (AoK )c = [(A  K ) ⊕ K ]c definition of opening
2. = (A  K )c  K̃ complement of dilation
3. = (Ac ⊕ K )  K̃ complement of erosion
Can you prove this? Is it 4. = Ac r K̃ definition of closing.
even true? r Idempotency: Opening and closing are idempotent. That is, repetitions of the
same operation have no further effect:

AoK = (AoK )oK


(A r K ) = (A r K ) r K .

r Closing is extensive: A r K ⊇ A.
r Opening is anti-extensive: AoK ⊆ A.
r Images dilated by f K remain invariant under opening by f K . That is, f A ⊕ f K =
( f A ⊕ f K )o f K .
Proof
1. A r K ⊇ A since closing is extensive
2. (A r K ) ⊕ K ⊇ A ⊕ K dilation is increasing
3. ((A ⊕ K )  K ) ⊕ K ⊇ A ⊕ K definition of closing
4. (A ⊕ K )oK ⊇ A ⊕ K definition of opening
5. for any B, BoK ⊆ B opening is anti-extensive
6. therefore, (A ⊕ K )oK ⊆ A ⊕ K substitution of A ⊕ K for B
7. (A ⊕ K )oK = A ⊕ K since A ⊕ K is both greater or equal to and less or
equal to (A ⊕ K )oK , so only equality can be true.
152 Mathematical morphology

Computing opening and closing quickly

Any translation-invariant increasing operation () such as opening or closing can


be computed by a union operation of the form

(A) = A  Ki (7.22)
i
for some set of structuring elements K = {K 1 , K 2 , . . .}, where the set K is said to be
a basis set for this operation. The erosions could all be done in parallel and the union
performed very quickly using lookup table methods. More details are available in
[7.17].

7.2 Gray-scale morphology

Up to now, we have assumed that the images with which we are dealing are binary.
Now, we relax that requirement and allow f A to take on continuous values in a finite
range f min A ≤ f A ≤ f max A . With this extension, we no longer have a simple
definition for such operations as dilation, since the union operator is not defined. To
define gray-scale morphological operations, we need first to define a new concept,
the “umbra.”
The umbra, U ( f A ) of a two-dimensional gray-scale image f A is the set of all
ordered triples, (x, y, U ), which satisfy 0 < U ≤ f A (x, y). If we think of f A as
being continuously valued, then U ( f A ) is an infinite set. To make morphological
operations feasible, we assume f A is quantized to M values,
#U ( f A ) ≤ M · #A (7.23)
To illustrate the concept of umbra, let f A be one dimensional. (We illustrate a one-
dimensional function here as a two-dimensional function in which one dimension is
always zero. That way, you have an example that is easily extended to two dimen-
sions):
A = {(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0)}
and the pixel value at the corresponding coordinate is
f A (x, 0) = [1, 2, 3, 1, 2, 3, 3].
Notice the new notation: Since f A takes on various values, depending on which
element of A is being considered, we use functional notation. Drawing f A , we have
Fig. 7.2.
In Fig. 7.2, the heavy black line represents f A , and the umbra is the shaded area
under f A . Following this figure, the heavy black line is the TOP of the umbra. We
could write the umbra as a set of ordered triples:

U ( f A ) = {(0, 0, 1), (1, 0, 1), (1, 0, 2), (2, 0, 1), (2, 0, 2), (2, 0, 3), (3, 0, 1),
(4, 0, 1), (4, 0, 2), (5, 0, 1), (5, 0, 2), (5, 0, 3), (6, 0, 1), (6, 0, 2), (6, 0, 3)}.
153 7.3 The distance transform

fA(x, 0)
3

1
0 1 2 3 4 5 6 x

Fig. 7.2. An illustration of umbra.

Here’s the trick: Although the gray-level image is no longer binary (and therefore
not representable by set membership) the umbra does have those properties.
We may therefore define the dilation of a gray-scale image f A by a gray-scale
s.e., f B as

f A (x, y) ⊕ f B (x, y) = TOP(U ( f A ) ⊕ U ( f B )) (7.24)

and erosion is similarly defined. Furthermore gray-scale opening and closing can be
defined in terms of gray-scale dilation and erosion.
Generalizing this concept to two-dimensional images, the umbra becomes three
dimensional, a set of triples

U ( f (x, y)) = {(x, y, z)|(z ≤ f (x, y))}. (7.25)

Then, gray-scale dilation and erosion may be written compactly as

g(x, y) ⊕ h(x, y) = {(x, y, z)|(z ≤ max(g(x − x1 , y − y1 ) + h(x1 , y1 )))} ∀(x1 , y1 )


g(x, y)  h(x, y) = {(x, y, z)|(z ≤ min(g(x − x1 , y − y1 ) − h(x1 , y1 )))} ∀(x1 , y1 )

for (x1 , y1 ) ∈
⊂ Z × Z , where
denotes the set of possible pixel locations,
assumed here to be positive and integer.

7.3 The distance transform

A very important application of morphological operations is to derive the distance


transform (DT). The distance transform may be defined in several ways. We will give
one definition in section 7A.4, but we will present another, simpler one here. This
is an iconic representation of an image, in which each pixel in the DT contains the
1 11 1
1 22 1 distance of the corresponding pixel in the input image from some feature. Most often,
1 23 21 1 the feature is an edge. In this section, we consider the set of points ∂ R, denoting the
1 22 1
1 221 boundary of region R. Fig. 7.3 illustrates the DT of the interior of a region.
1 2 32 1 1
1 12 22 1
1 111 11 Fig. 7.3. Pixels interior to the region are shaded. The boundary is assumed to lie just outside
this region. The DT of the region, computed using a 4-connected definition, is shown.
154 Mathematical morphology

Formally, the DT is described by

DT (x) = min y∈∂ R x − y (7.26)

where x and y are 2-vectors of coordinates. This transform is a solution to the


differential equation

∇ DT (x) = 1 (7.27)

when the initial conditions are DT (x) = 0 for x ∈ ∂ R.


One may compute the DT using conventional morphological operations: Let us
suppose we want the distance from the outside edge of an object. Repetitively erode
the image, using some “appropriate” structuring element. Each time a pixel disap-
pears, record the iteration at which that pixel vanished. Store the iteration number
in the corresponding pixel as the DT. Pretty simple, isn’t it? That definition does not
quite give the Euclidian distance. Huang and Mitchell [7.15] show how to obtain
the Euclidian distance transform using gray-scale morphology, and Breu et al. [7.6]
show how to compute the Euclidian DT in linear time. One interesting question in
this kind of iterative erosion is “where do the points go?” It turns out for a structuring
element which is strictly convex, the points go in the direction of the normal to the
boundary of the region being eroded [7.43].

7.3.1 Using a mask to compute the DT


Computing the distance transform may be accomplished by iterative application of
1 a mask, such as the one illustrated in Fig. 7.4.
T=
At iteration m, the distance transform is updated using the equation:
1 +0
D m (x, y) = mink,l∈T (D m−1 (x + k, y + l) + T (k, l)). (7.28)
Fig. 7.4. Mask used to
calculate the A more detailed explanation follows. First, the distance transform, D(x, y), is ini-
4-connected DT. The
origin of the mask is tialized to a special symbol denoting “infinity” at every nonedge point, D 0 (x, y) =
denoted by a plus sign. ∞, ∀(x, y) ∈ / ∂ R, and to zero at every edge point D 0 (x, y) = 0, ∀(x, y) ∈ ∂ R. Then,
application of the mask starts at the upper left corner of the image, placing the origin
0 of the mask over the (1, 1) pixel of the image, and applying Eq. (7.28) to calculate
0 a new value for the DT at (1, 1). In the example of Fig. 7.5, the DT is illustrated
with infinities denoted by blank squares, and edges denoted by zeros. The mask
of Fig. 7.4 is applied in the shaded area. The application of Eq. (7.28) produces
min(1 + 0, 1 + ∞) for the distance transform value in the pixel indicated by the
Fig. 7.5. Calculation of arrow.
the distance transform
value at one particular After one pass, top-to-bottom, left-to-right, the mask is reversed (in both direc-
point. The result is 1. tions), and applied again, bottom-to-top, right-to-left.
155 7.3 The distance transform

4 3 4

3 +0

Fig. 7.6. A mask whose application produces the chamfer map.

This process is repeated at each pixel until all pixels in the image have been
processed, and then iterated until all pixels in the DT are marked with a finite value.
Masks other than that of Fig. 7.4 produce other variations of the DT. In particular,
Fig. 7.6 produces the chamfer map. If divided by three, the chamfer map produces
a DT which is not a bad approximation to the Euclidian distance.

7.3.2 The Voronoi diagram


In later sections of this book, we will be concerned with the junctions between
regions, and the relationships of adjacent regions. We will occasionally need to
consider relationships between regions which do not actually touch. For this, the
concept of the Voronoi diagram will be useful. We introduce it here because of the
similarities it shares with the distance transform.
Consider the image illustrated in Fig. 7.7. In that image, several regions are
indicated in gray, inside the white circle. For any region, i, the Voronoi domain of
that region is the set of points such that those points are closer to points in that region
than to points in any other region:

Vi = {x|d(x, Pi ) < d(x, P j ), ∀( j = i)} (7.29)

where Pi denotes any point in region i.


The set of points equidistant from two of those regions is referred to as the Voronoi
diagram, and is illustrated by the dark lines in the figure. This is equivalent to saying

Fig. 7.7. Several regions and the resulting Voronoi diagram.


156 Mathematical morphology

that points in the Voronoi diagram do not belong to the Voronoi domain of any
region.

7.4 Conclusion

In this chapter, we have looked at a particular approach to processing the shape of


regions. Morphological operators are particularly useful in binary images, but may
be applied to gray-scale images as well. Unlike most chapters in this book, we did
not make explicit use of either optimization methods or consistency.

7.5 Vocabulary

You should know the meanings of the following terms.

Closing
Dilation
Distance transform
Erosion
Extensive
Increasing
Opening
Umbra
Voronoi diagram

Assignment 7.1
In section 7.1.3, we stated that dilation is commuta-
tive because addition is commutative. Erosion also in-
volves addition, but one of the two images is reversed.
Is erosion commutative? Prove or disprove it.

Assignment 7.2
In section 7.3, a mask is given and it is stated that
application of this mask produces a distance transform
which is “not a bad approximation” to the Euclidian
distance from the interior point to the nearest edge
point. How bad is it? Contrive an example where the
value produced by the application of the mask is dif-
ferent from the Euclidian distance to the nearest edge
point.
157 Assignments

Assignment 7.3
Consider a region with an area of 500 pixels with 120
pixels on the boundary. You need to find the distance
transform from each pixel in the interior to the bound-
ary, using the Euclidian distance. (Note: Pixels ON the
boundary are not considered IN the region -- at least
in this problem.) What is the computational complex-
ity? Note: You may come up with some algorithm more
clever than that used to produce any of the answers
below, so if your algorithm does not produce one of
these answers, explain it.

(a) 60 000
(b) 120 000
(c) 45 600
(d) 91 200

Assignment 7.4
Trick question: For Problem Assignment 7.3, how many
square roots must you calculate to determine this
distance transform? Remember, this is the Euclidian
distance.

Assignment 7.5
Prove (or disprove) that binary images eroded by a
kernel, K, remain invariant under closing by K . That
is, prove that A  K ==(A  K ) r K .

Assignment 7.6
Show that dilation is not necessarily extensive if the
origin is not in the s.e.

Assignment 7.7
Prove that dilation is increasing.

Assignment 7.8
Let C be the class of binary images that have only one
dark pixel. For a particular image, let that pixel be
located at (i0 , j0 ).
Using erosion and dilation by kernels that have
{(0, 0)} as an element, devise an operator, that is,
158 Mathematical morphology

a set of erosions and dilations, and structuring


elements (you may need only one s.e. or you may need
more than one), which, when applied to an element of
C, would output an image with the dark pixel shifted to
(i0 + 2, j0 + 1) (disregard boundaries).

Assignment 7.9
Which of the following statements is correct? (You
should be able to reason through this, without doing
a proof.)

(a) (A  B)  C == A  (B  C ) (7.30)
(b) (A  B)  C == A  (B ⊕ C ) (7.31)

Assignment 7.10
Use the thresholded images you created in Assignment
5.5 and Assignment 5.6. Choose a suitable structuring
element, and apply opening to remove the noise.

Topic 7A Morphology

7A.1 Computing erosion and dilation efficiently

Equation (7.14) (or Eq. (7.31)), as we hope you determined from Assignment 7.9, is correct.
It says that erosion by a large s.e., say K, can be broken down into two sequential erosions,
first by B and then by C, provided we are able to find B and C such that K = B ⊕ C. This is
sometimes referred to as the “chain rule” for erosion. It has substantial impact on hardware
implementations.
Suppose we have custom hardware that can compute erosion by a 3 × 3 s.e. at video rates,
but in some application we need to erode by a particular 4 × 4. The chain rule says that if we
can (somehow) find two 3 × 3 s.e.s such that their dilation is the 4 × 4 we want, we can get the
same result by passing the input image through our special hardware twice. But how to find B
and C ? To give an indication of how this is done, let us consider a very simple decomposition,
into a set of s.e.s each of which contains only two elements, the origin and one other point.
That is, we wish to find H1 , H2 , . . . , HN such that A ⊕ H = (. . . [(A ⊕ H1 ) ⊕ H2 ] . . .), and
Hi = (0, pi ). The ps are found [7.12] using the following approach: Search among pairs of
points in H for a pair p1 and p2 such that H is invariant under opening by the difference of the
two points. H = H o{0, p1 − p2 }. If two such points can be found, then Hi = (0, p1 − p2 ),
and we reduce H by H  = H  Hi . The process is repeated recursively. If no such pair of
points can be found, then a search is made of quadruples of points p1 , p2 , p3 , p4 , to see if
H = H o{ p1 − p2 , p3 − p4 }, and so on.
159 Topic 7A Morphology

Matheron [7.21] proved that any of a large class of morphological operations can be
computed as a union of erosions, or by using duality, as an intersection of dilations. Choosing
the set of “basis sets” so that a given operation by a given structuring element may be
calculated in this way has been the subject of considerable research [7.18, 7.20].

7A.1.1 Decomposition into 3 × 3 structuring elements

2 In the following, we illustrate the technique of Park and Chin [7.23], using their examples.
3 1 The algorithm is presented here. The reader is referred to the original papers for the theory.
4 0 The example given in this section is from [7.23].
In order to understand this method for decomposing structuring elements, we need knowl-
5 7 edge of the chain code from section 9.5. In a chain code, we represent a counterclockwise
6
walk around a region by a sequence of numbers, all between 0 and 7, designating the direction
Fig. 7.8. The eight of each step (Fig. 7.8). In this work, the concept of a chain code is slightly generalized to
directions in which one include not only single pixel steps in one of the eight cardinal directions, but also to include
might step in going
specific concave boundary segments.
from one pixel to
another around a The structuring element will be decomposed into a set of 3 × 3 elements. First, observe
boundary. that there are only 28 distinct concave boundary segments which fit into a 3 × 3 area. These
are listed and named in Fig. 7.9.
We now define a rather general class of structuring elements which are simply connected
U0 U2 U4 U6 (no holes) and whose boundaries have a form given by a regular expression involving the
concave segments and the chain code directions. For example, the regular expression L 0 22 42
J0 J1 J2 J3 denotes the curve created by starting with the L 0 segment, followed by two steps in the “2”
direction (the superscript denotes repetition) followed by two steps in the “4” direction. This
J4 J5 J6 J7 is illustrated in Fig. 7.10.
The class of structuring elements which can be decomposed using this algorithm is the set
of all simply connected structuring elements whose boundary can be written in the form (the
L0 L2 L4 L6
order of the subscripts is important to the definition)

V1 V3 V5 V7 U0SU 0 J0S J 0 L 0SL0 R0SR0 0 S0 J1S J 1 V1SV 1 R1SR1 1 S1 . . . J7S J 7 V7SV 7 R7SR7 7 S7 (7.32)

R0 R1 R2 R3
where any of the superscripts may be zero. For example, V1 12 2R42 463 R7 is a member of this
set, but J7 1J2 is not.

R4 R5 R6 R7 Definition
An image A is a factor of an image S if and only if it is possible to write S as the dilation of
Fig. 7.9. Possible
A, that is, S = A ⊕ B. A factor A is a prime factor of S if and only if A cannot be factored
concave boundaries in
3 × 3 images. The into anything other than itself and single-pixel images. In Table 7.1 are listed all the prime
number denotes the factors which start with R0 . The prime factors are not required to be in the form of Eq. (7.32).
direction of the first In Table 7.2, we present other prime factors, listing only their chain code representation.
chain code.

Now we present the approach to decomposition of the structuring element using an example.
We will decompose the structuring element illustrated in Fig. 7.11, whose chain code is
S = L 0 03 124 R4 43 R6 62 . The concave portions of this boundary are denoted by v 1 = L 0 ,
v 2 = R4 , v 3 = R6 . The convex portions by d1 = 0, d2 = 1, d3 = 2, d4 = 4, d5 = 6.
160 Mathematical morphology

4 Table 7.1. Prime factors starting with R0 .


4
2
2 R 02245 R 022V56 R022R 5 R 0235 R 02R 46 R 0242 R 0V3426 R 0V345

Fig. 7.10. Illustration of


how the sequence
L0 22 42 describes a
figure. The L0 is drawn R 0V3V56 R0V3R 5 R 0R 35 R 0R 346 R 0326 R 034
in bold.

Table 7.2. Other prime factors starting with U0 , J0 , I0 , V1 , R0 and R1 . To obtain


Vi , Ji , etc. for i = 2, 4, 6, increase each chain code by i. To obtain i = 3, 5, 7,
increase each chain code by (i − 1).

U0 U 0 02 22 42 U0 02 234 U0 01242 U0 0134


J0 J0 022 42 J0 0234 J0 1242 J0 134
J1 J1 22 42 6 J1 22 45 J1 2346 J1 235
L0 L 0 22 42 L 0 234
V1 V1 22 42 V7 V1 22 4R6 V1 22 V5 62 V1 22 V5 V7 V1 22 R5 6
V1 V1 234V7 V1 23R6 V1 2R4 62 V1 2R4 V7 V1 242 6
V1 V1 V3 42 62 V1 V3 42 V7 V1 V3 456 V1 V3 4R6 V1 V3 V5 62
V1 V1 V3 R5 6 V1 V3 52 V1 R3 462 V1 R3 4V7 V1 R3 56
V1 V1 22 52 V1 245 V1 V3 V5 V7 V1 R3 R6
V1 V1 32 62 V1 32 V7 V1 346 V1 35
R0 R0 22 45 R0 22 V5 6 R0 22 R5 R0 235 R0 2R4 6
R0 R0 242 R0 V3 42 6 R0 V3 45 R0 V3 V5 6 R0 V3 R5
R0 R0 R35 R0 R346 R0 32 6 R0 34
R1 R1 242 V7 R1 24R6 R1 2V5 62 R1 2V5 V7 R1 2R5 6
R1 R1 252 R1 3462 R1 34V7 R1 356 R1 3R6
R1 R1 R4 62 R1 R4 V7 R1 42 6 R1 45

First, we identify all the prime factors involving L 0 , R4 , and R6 , which are compatible
with this image. These are illustrated in Fig. 7.12. To understand more clearly how these are
compatible with the image, consider the segment R4 62 01. Observe that this matches the R4
segment which is at the upper right of Fig. 7.11.
The next step in the process is to construct a matrix , where i j represents the number
of times v i occurs in A j . In this example,
⎡ ⎤
Fig. 7.11. A structuring 1 0 0 0 0
element to be ⎢ ⎥
 = ⎣0 1 1 0 0⎦ ,
decomposed into 3 × 3
elements. 0 0 0 1 1
161 Topic 7A Morphology

L 02242 R 46201 R 4602 R 6022 R 612


(R 02245) (R 0242) (R 0242) (R 034)
A1 A2 A3 A4 A5

Fig. 7.12. The prime factors which match segment of the boundary of Fig. 7.11. The chain codes
in parentheses indicate what the boundary would be if rotated to R0 equivalents.

where we can see that R4 occurs once in A2 and A3 , but not at all in A1 , A4 , or A5 . Next, we
construct a matrix
, where
i j counts the number of times di occurs in A j . Here,
⎡ ⎤
0 1 2 1 0
⎢0 1⎥
⎢ 1 0 0 ⎥
⎢ ⎥

= ⎢2 0 0 2 1⎥ .
⎢ ⎥
⎣2 0 0 0 0⎦
0 2 1 0 0

Two vectors are defined, Y , representing the number of times v i occurs in the original boundary
and Z , the number of times di occurs in the original boundary. In this example, Y = [1 1 1]T
and Z = [3 1 4 3 1]T . A vector X which satisfies

X = Y
(7.33)

X ≤ Z

is a solution for the decomposition. In this example, X = [10110]T satisfies these two equa-
tions. Note that X = [30421]T which is less than or equal to Z, in the sense that each element
of X is less than or equal to the corresponding element of Z. Thus, we can decompose the
boundary by dilating by A1 once, by A2 zero times, A3 once, A4 once, and A5 zero times, or
4 S = A1 ⊕ A3 ⊕ A4 ⊕ B.
6 All that remains is to determine a kernel B. This is accomplished by looking at the differ-
1 ence between X and Z . Z −  X = [01011]T . Thus, we need a kernel whose boundary is
described by the sequence d2 , d4 , d5 , each repeated once. This sequence is 146, as in Fig. 7.13.
Fig. 7.13. The s.e.
described by the Thus, we have a sequence of structuring elements, each 3 × 3, whose sequential application
sequence 146. produces the same result as the kernel in Fig. 7.11.

7A.2 Morphological sampling theorem

Here, we are going to be talking about sampling, much as Shannon did in his famous sampling
theorem, except that we will start with an image which has been sampled already, and
represented on a digital grid, and consider (sub)sampling to a smaller grid.
First, define a sampling grid. A sampling grid, as we use the term here, is just an image,
with a foreground point at every point at which we will want to sample our image. Any old
grid will do, except that it must satisfy

S⊕S=S (7.34)
162 Mathematical morphology

and

S = S̃. (7.35)

Equation (7.34) represents an example of the property that this particular S is “closed under
dilation.” A convenient grid to use is every third point, as illustrated in Fig. 7.14.
So now, sampling means to read and remember the image values at all the pixels where
the grid is black. Now suppose K is some s.e. We propose a rather simple reconstruction
You might want to verify
that this grid satisfies the algorithm. Here is the idea: Let’s sample our image using the sampling grid specified. Then,
properties mentioned we ask the question: Under what conditions will the dilation of the sampled image by the
above.
s.e. K, be the original image? Florencio and Schafer [7.7] have shown that the following
properties are required: First, the sampling grid itself, when dilated by the s.e., must be the
8
7 entire space:
6
5
4 S ⊕ K = ␰, (7.36)
3
2
1 the translations of K by all the points in S form a partition of the space:
0
0 1 2 3 4 5 6 7 8 9
∀((x, y) ∈ S, x = y), K x ∩ K y = Ø, (7.37)
Fig. 7.14. An image
which represents and K must contain the origin.
sampling at every third
For the sampling grid shown above, here’s an s.e. which satisfies these three conditions:
point.

Actually, this is not very interesting – just the center pixel (the origin) and its eight neighbors.
But it does satisfy the three conditions.
If these properties hold for S and K, then the theorem is as follows. Let F be some image,
let P be the set of images Fi satisfying Fi = A ⊕ K for some A ⊆ S, and let Q be the set of
images F j satisfying F j = F j oK = F j r K . Then:

Part A The samples of F ∈ P at the points in S are necessary and sufficient to perfectly
reconstruct F.
Part B The samples of F ∈ Q at the points in S are necessary to reconstruct F with bounded
error and sufficient to reconstruct F with an error at most r (K) where r (K) is the
radius of the smallest circle containing K.

There is a lot to talk about regarding this theorem: First, what does part A mean? Can you
figure it out from the set notation? You should be able to, but you may say to yourself “if this
means what I think it means, it’s trivial.” Well, what it means is this: If F can be generated
by taking some collection of the points in S, and dilating them with kernel K, then all you
have to remember is which points in S were needed. Yes, it is kind of trivial, isn’t it? But
there are some profound implications, which we will get to when we talk about the Nyquist
rate.
163 Topic 7A Morphology

(a) (b)
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

Fig. 7.15. (a) Original image. (b) Result of sampling the original image with S of Fig. 7.14.

Now to understand part B: The Haussdorf distance is a measure of the difference between
two (sub)images. Given two sets of points S = {s1 , s2 , . . . sn } and T = {t1 , t2 , . . . tm }, the
Haussdorf distance is defined by

d H (S, T ) = max(h(S, T ), h(T, S)) (7.38)

where h(S, T ) = maxs∈S mint∈T s − t. That is, the Haussdorf distance is the largest of the
minimum distances between elements of one set to the elements of the other set.
Part B says that if F is closed under opening by K, (that’s quite an expression, don’t you
think? “Closed under opening”), that is, if F does not change when it’s opened by K and
closed under closing by K, then the samples of F at the points of S are sufficient to represent
F almost exactly. By “almost exactly” we mean the set of sample points, when dilated by K,
is very nearly equal to the original F.
Now let’s do that example: Fig. 7.15 represents an original image, FA and what we get
when we sample it with S.
Since we know that we could only get data on rows 0, 3, 6, or 9 and similarly for columns,
we could toss out those rows and columns in our reduced resolution version, and get a
smaller image. So there, on the left of Fig. 7.16, we have the subsampled, smaller version
of the original image. Now let’s zoom it back, using dilation: Place K down at each of the
sampling points and we get the reconstruction of Fig. 7.16. Hmmm. doesn’t look much like
the original, unsampled image, does it? (Whatever happened to “exact reconstruction”?) And
here’s a tough question. This theorem claims to say that one can reconstruct a signal by
sampling every third point. Doesn’t this contradict Shannon? Doesn’t Shannon say we have

0 3 6 9

0 1 2 3 4 5 6 78 9

Fig. 7.16. Subsampled image and reconstruction by dilation. Students: Note the original image
in Fig. 7.15(a) – does that image belong to P or Q?
164 Mathematical morphology

to sample every other point anyway? (Actually, Shannon’s theorem was defined for analog
signals. Students: How DOES the Shannon theorem apply to this case? What IS the Nyquist
rate?) The sampling theorem does not say we can reconstruct any signal, it says something
about frequencies, doesn’t it? (Say yes.) In fact, the sampling theorem basically says that a
sampling grid with a particular frequency (which is the same as the grid spacing) cannot be
used to store information which changes more than half the grid spacing. In morphological
sampling, the theorem is complicated by the fact that we not only can choose a grid, we can
also choose a kernel. The restrictions are given by part B of the theorem. Unless the image is
one which has already been “prefiltered” by K, K cannot restore it. Haralick [7.12, p. 252] says
it this way: “The morphological sampling theorem cannot produce a reconstruction whose
positional accuracy is better than the radius of the circumscribing disk of the reconstruction
structuring element K .”

7A.3 Choosing a structuring element

Generally, we choose structuring elements which are small and symmetric, with the origin
in the center. Schonfeld [7.34] provides some suggestions, but generally, these are the only
guidelines.

7A.4 Closing gaps in edges and surfaces2

The algorithm presented in this section is presented in a bit more detail than usual in this
book. A graduate student in the machine vision discipline needs to learn how to write a
good journal paper, and this chapter is written to illustrate the components of such a paper.
It includes an introduction with references to relevant literature, a body that describes and
explains an algorithm, and a results section that illustrates performance of the algorithm in
comparison to other published results. The student should not only read this chapter for its
technical content, but should pay attention to style and organization.

7A.4.1 Introduction

In this section, we apply the concepts of the distance transform, connected components (see
Chapter 8), and binary morphology-like operations to solve the problem of closing gaps in
edges (see [7.16] for alternative strategies). In two dimensions, an edge is a curve, and in
three dimensions, a surface. As we have seen, any edge operator is certain to occasionally
fail, resulting in both extraneous edges and gaps in edges. If an edge has gaps in it, then
connected component labeling routines will fail, labeling interior and exterior points the
same. Thus, we must develop techniques which correct such edge detection errors. We will
relate these techniques to morphological operators. In [7.10], we developed a technique
called distance-transform-based closing, and have implemented this technique in both two
and three dimensions. This new technique is compared with binary morphology (mask erosion
techniques [7.1]) and iterative parallel thinning [7.44], a 3D parallel thinning technique [7.42]

2 Several of the figures in this chapter are from a paper written by one of the authors [7.36].
165 Topic 7A Morphology

and with 3D morphological techniques. In every case, the technique reported here produces
superior performance by better preserving the shape of sharp corners.

An aside: Connected component labeling


The intended use of this book is for students to read the introductory portions of each chapter,
and then come back and read the advanced topics sections (like this one). In the introduc-
tory portion of Chapter 8, an operation called “connected component labeling (CCL)” is
explained. You need to use the concept of connected components right now. If you haven’t
read that section yet, please read and understand section 8.3 first, and come back to this
section.

Problem definition
We are given an edge image. Due to noise, blur, or other error, the edge/surface resulting
from edge detection in a real image will occasionally have “gaps” – areas in which the edge
The word “closing” refers detector response is not sufficiently strong to make a positive determination. (See Pratt [7.25,
to closing gaps in edges, Section 17] for an excellent discussion of edge detector errors.) Such gaps may be closed by
not the morphological
closing operator. various types of morphological operators.The algorithm described here is denoted “DT-driven
closing.”

7A.4.2 The distance transform

We have already seen that the distance transform, D(x, y) gives some measure of the distance
from point x, y to the nearest edge point [7.28, 7.29, 7.31, 7.32]. We use a particular variant
of the distance transform called the “chamfer map,” denoted C(x, y) and refer to the values
of this map as the “DT distance.”
Similarly, we may extend the concept of the distance transform to three-dimensional
space. The transform D(x, y, z) is parameterized identically to g(x, y, z) and contains the
DT distance from point x, y, z to the closest edge.
The distance transform may be used to assist in the measurement of the properties of
regions: At every point, one knows the maximum size kernel which may be applied without
crossing an edge [7.35].
The k-neighbors of a point x, y are defined in either two or three dimensions as:

ℵk (x, y) = {xk1 , yk1 , xk2 , yk2 , . . . , xkm , ykm }


= {xk␭ , yk␭ } : DT distance (xk␭ , yk␭ , x, y) = k.

Here m = 8k, k ≥ 1 in 2-space and m = (2k + 1)3 − (2k − 1)3 in 3-space.


We referred to the distance transform resulting from application of any masks using
Eq. (7.28) as “chamfer [7.2] maps” in our earlier work [7.10], but in other literature
[7.4, 7.5, 7.31], the term “chamfer” is restricted to the use of the second of the masks in
Fig. 7.17. Either definition will suffice for the edge closing method described in the next
subsection.
166 Mathematical morphology

1 1 1 4 3 4

1 0 1 3 0 3

1 1 1 4 3 4

Fig. 7.17. Two distance definitions which may be used to compute distance transforms. The one
on the right produces the chamfer map, closely approximating the Euclidian distance.

2 2 2 3 3 3 3 3 3 3
1 1 2 2 2 2 2 2 2 2
1 1 1 1 2 1 1 1 1
1 1 2 1
1 1 1 1 1 2 1 1 1 1
2 2 2 2 2 2 2 2 2 2

Fig. 7.18. Distance transform near a gap in an edge.

To construct a complete DT, Eq. (7.28) is iterated until no more changes occur between
iterations. This is the approach followed by Bister et al. [7.3]. For the application discussed
here, however, we assume some a priori knowledge of maximum gap size. This allows us to
define a value K max reflecting this knowledge. K max will represent a distance so great that any
pixel this far away could not possibly be part of the edge. Normally, we use values K max ≤ 4,
since this allows gaps of six pixels to be bridged. Fig. 7.18 illustrates a distance transform
near a gap in an edge.
Generation of the three-dimensional distance transform proceeds in an analogous manner.
Observe that the computational complexity of distance transform generation is proportional
to the number of edge points in the image and to the size of K max . Larger values of K max
increase the size of the kernel ℵk substantially in three dimensions.

7A.4.3 Segmentation using connected components

Pixels whose DT distance from an edge is less than K max are labeled as “TBA” (to be as-
signed), meaning, they could possibly belong to an edge. The remaining pixels are segmented
into disjoint regions. The straightforward way to accomplish this segmentation is to use con-
nected component analysis on the pixels which are not labeled TBA. However, one could use
more sophisticated cooperative processes such as those described in Chapter 8 and in [7.14].
Somehow we produce a label image like the label images we will discuss in Chapter 8, in
which L(x, y) = j is interpreted as “the pixel at x, y belongs to region j.” Pixels in L close
to an edge are labeled as potentially part of the edge:

{DT(x, y) < K max } → {L(x, y) = TBA}.


167 Topic 7A Morphology

Bister et al. [7.3] identify local maxima in the DT, and each maximum results in a potential
region. They note, however, “Since the distance transform is sensitive to noise producing
irregularities in the borders of the regions, one cavity (region) can contain many local maxima
very close to one another. In order to eliminate these spurious maxima, a filter merges the
maxima for which the sum of the heights is much larger than the geometrical distance
between them.” We conjecture that the performance of this filter is equivalent to our choice
of K max . Instead of searching for a maximum, we use connected components (Chapter 8).
Both techniques identify an area within a region which robustly characterizes that region.

7A.4.4 Relabeling unassigned points

The last step in the algorithm relabels the TBA points, using the distance transform DT(x, y)
and the label image L(x, y) to create a new label image L (x, y). Working inward from those
points where DT(x, y) ≥ K max (the set of points K max or farther from an edge point), each
point is relabeled by assigning it the label of the most appropriate neighbor. This “erosion” of
the questionable pixels is performed iteratively. At each iteration, i, only the TBA pixels in
the label image L(x, y) that correspond to pixels such that DT(x, y) = K max − i are relabeled
in each pass.
TBA pixels are relabeled only when there is strong evidence (defined in the next section)
indicating to which region the pixel should be assigned. If the reassignment is uncertain, then
the pixel is left TBA and the decision is postponed until the next pass. When all of the TBA
pixels in L(x, y) corresponding to the k-valued pixels of the distance transform have been
reassigned to valid image regions or when no further change takes place in an iteration, k is
decremented and the next set of TBA points is considered for relabeling. The relabeling is
complete when all of the edge pixels (k = 0) have been assigned to valid image regions. When
single-pixel resolution between regions is required, more sophisticated region normalization
techniques may be used [7.8].
The key to closing gaps lies in the strategy used to select the most appropriate region to
A voxel is a which to reassign the currently considered pixel L(x, y). To accomplish this in two dimensions,
three-dimensional pixel.
we examine the pixel’s eight surrounding neighbors. In three dimensions, we examine the
voxel’s 26 neighbors. We refer to this as finding the pixel’s or voxel’s “best neighbor.”
The surrounding pixels may come from one of three possible classes, depending on how
they are labeled at the current iteration:

(1) A label corresponding to an object or a region of an object in the label image L(x, y).
(2) A label corresponding to the image background.
(3) The TBA label.

We relabel the current pixel to one of the first two classes by simply counting how many
neighboring pixels come from that class and selecting the largest as the “best” class. This
strategy will only fail if all the neighbors are TBA, or constraints (see next paragraph) we
have placed on the relabeling make it undesirable to reassign the pixel to the apparent best
neighbor.
168 Mathematical morphology

One constraint on the relabeling algorithm is the selection of the background region as the
best neighbor. Neighbors belonging to the background are considered in a more restricted
fashion than those belonging to regions. For the background value to be selected as the best
neighbor, the background pixel must be face-connected to the pixel under consideration.
This avoids the undesirable result of occasionally finding isolated background pixels inside
a closed boundary. (See [7.30] for more discussion on this connectivity paradox.)
When k = 0, the TBA pixels we seek to relabel are either true edges or noise pixels.
An edge in an image, by definition, represents an interface between an object region and
some other region in that image. The other region may be either another object region or the
background. We choose to relabel an edge pixel as part of an image region regardless of the
outcome of the enumeration. Only in the case that the edge pixel is completely surrounded
by background, do we choose background as the best neighbor.

Relabeling algorithm – three-dimensional data


/* in this function example, we have omitted several details like definitions of
variables. Still, this example captures the essence of the algorithm */
void relabel()
{
/* loop over all the voxels */
for(frame = 0; frame < numberframes; frame++)
for(row = 0; row < numberrows; row++)
for(col = 0; col < numbercols; col++) {
if(DT[frame][row][col] >= Kmax)
Lprime[frame][row][col] = L[frame][row][col];
else if(DT[frame][row][col] == 0)
Lprime[frame][row][col] = EDGE;
else Lprime[frame][row][col] = TBA;
}
for(k = Kmax-1; k > 0;k--){
changing = TRUE;
while(changing == TRUE) {
changing = FALSE;
for(frame = 0; frame < numberframes; frame++)
for(row = 0; row < numberrows; row++)
for(col = 0; col < numbercols; col++){
if(((L[frame][row][col] == TBA)
|| (L[frame][row][col] == EDGE))
&&(DT[frame][row][col] >= k)) {
Lprime[frame][row][col] =
Best26Neighbor(frame,row,col);
changing = TRUE;
}
else Lprime[frame][row][col] =
L[frame][row][col];
}/* end for loop on frame,row,col*/
copyarray(Ltemp, Lprime); /*copy entire array Lprime into Ltemp*/
copyarray(Lprime, L);
169 Topic 7A Morphology

copyarray(L, Ltemp);
}/* end while */
}/* end for k */
}/* end relabel */
/*==============================================================*/
/* in this function, p and n are data structures containing the frame, row, and col*/
/* coordinates of a voxel*/
int Best26Neighbor(p)
{
while ((n = neighbor(p)) != NULL)
{
if (L(n) != EDGE)
{
if(L(n) != BACKGROUND) Card[L(n)]++;
else if(faceconnected(n,p)) Card[BACKGROUND]++;
}
}
if ((maximum(Card) == BACKGROUND) && (L(n) == EDGE))
return NextMax(Card);
else return maximum(Card);
} /* end Best26Neighbor */

In this algorithm, “Card” is simply an array which keeps a count (cardinality) of the number
of times a particular label is adjacent to the voxel of interest.

7A.4.5 Examples

In this section, we will provide the technique described above with competing morphological
strategies.

Closely spaced regions


The output of an edge detection process is illustrated in Fig. 7.19. Note that the typical gaps
in the edge are larger than the spacing between the regions. A distance transform is applied to
the edge image. The distance transform and the result of connected component labeling are
shown in Fig 7.20 and Fig. 7.21 shows the segmentation resulting from the DT-based closing
method.

Other two-dimensional images


A binary image was processed containing several cutlery objects (knives, spoons, and forks).
Note the similarity
between the thickened Figs. 7.22–7.24 show the results obtained for this input, which are presented in the order:
edge image and the (a) the original image; (b) the DT itself; (c) the “thickened edge image”, which consists of all
morphological dilation of
the edge. pixels such that C(x, y) < K max are initially relabeled as TBA; (d) labeled image produced
from thickened edge image; (e) relabeled image based on DT and labeled image.
170 Mathematical morphology

Fig. 7.19. Two regions with large gaps in their boundaries. From [7.36]. Used with permission.

Fig. 7.20. Distance transform and result of connected component labeling of Fig. 7.19. From
[7.36]. Used with permission.
171 Topic 7A Morphology

Fig. 7.21. Segmentation resulting from applying DT-driven closing. Note that the regions are
correctly partitioned, with a precision accurate to the individual pixel. From [7.36]. Used
with permission.

7A.4.6 Comparison with published two-dimensional methods

Thinning differs from To evaluate the performance of DT-driven edge closing in two-dimensional images, we com-
skeletonization in that
pared it with two other types of binary thinning techniques – iterative erosion and iterative
thinning preserves
connectivity: A boundary thinning. As input to these algorithms we used the bit-mapped “cutlery” image shown in
which is intact when wide Fig. 7.22 after it was dilated by a 5 × 5 kernel to close all gaps. The basic idea is this:
will still be intact after
thinning. Use dilation to close the gaps in the boundary. Then use thinning to reduce the “fat”
boundary to a thin one. The results of these comparisons are described in the following
sections.
Note that although both thinning and skeleton perform similar operations, thinning pre-
serves the connectivity of the edge, while skeleton does not. A skeleton may be defined in
terms of morphological operations as follows.
The kth order homothetic of an s.e., A, denoted kA, is the result of applying the operator
of interest – in this case dilation – to A. That is, dilating {0} by A, and then dilating the result
by A again, until k such dilations have been done.
A conventional morphological skeleton of an image X is formed by decomposing an image
into a number of skeleton subsets, Si where

Si = (X  i A)\[(X  i A)o(A)] (7.39)


172 Mathematical morphology

(a) (b)

Fig. 7.22. (a) Original cutlery image, containing gaps in edges. (b) Distance transform of original
cutlery image. From [7.36]. Used with permission.

(c) (d)

Fig. 7.23. (c) Thickened edge image. (d) Label image based on dilated edges. From [7.36]. Used
with permission.
173 Topic 7A Morphology

Fig. 7.24. (e) Relabeled cutlery image. From [7.36]. Used with permission.

where X\Y denotes all the elements of X which are not in Y. The skeleton is then
constructed by


N
Skeleton = Si ⊕ i A. (7.40)
i=0

The subsets contain information about size, orientation, and connectivity. A minimal skeleton
has the property of being able to exactly reconstruct the original image, but it does not
necessarily preserve path or surface connectivity [7.20]. An alternative to the morphological
skeleton, not considered in this book, is morphological shape decomposition (MSD) [7.24].
MSD and the morphological skeleton are compared by Reinhardt and Higgins [7.27] who
conclude that MSD performs slightly better. While morphological skeletonization can be
used for many applications (such as image coding) and has been widely studied, it is not
directly comparable to 2D or 3D thinning, in which connectivity is preserved. Since, in this
application of edge/surface gap closing, we insist on preservation of connectivity, we consider
below only techniques which possess this property.

Morphological 2D thinning (Arcelli, Cordella, and Levialdi)


Arcelli et al. [7.1] use a sequence of masks to implement thinning. The original image is
eroded by each of eight 3 × 3 masks in sequence and the resulting image is used as input
174 Mathematical morphology

A1 A2 A3 A4

B1 B2 B3 B4

Fig. 7.25. 3 × 3 masks (left) and result of erosion thinning on the cutlery image (right). From
[7.36]. Used with permission.

Fig. 7.26. Results of iterative thinning. From [7.36]. Used with permission.

to the next iteration until all possible pixels have been eroded. In each mask shown in Fig.
7.25 [7.1, 7.26], the positions marked in black denote edge pixels, white denotes background,
and hashed are pixels which do not become involved in the computation. A mask is said to
“match” an image at coordinates (k, l) if, when the center of the mask is registered to pixel
(k, l), then all image pixels covered by the mask are edge or background as denoted by the
corresponding mask pixel. If a mask matches an image at a particular edge point, then the
edge at that point in the image may be reset to background. The masks are applied in the
following order: A1, B1, A2, B2, A3, B3, A4, B4. The result of thinning Fig. 7.23 using this
algorithm is shown in Fig. 7.25. Note how this technique distorts the shape of sharp vertices
like the fork tines.

Iterative 2D thinning (Zhang and Suen)


Zhang and Suen’s algorithm for thinning binary images [7.44] iteratively passes over the
image, deciding if contour points can be deleted. A contour point is an edge pixel with at
least one 8-neighbor that is a background pixel. Each iteration contains two passes and at
each iteration, the decision to remove a pixel is based on the number of edge neighbors of a
175 Topic 7A Morphology

pixel, the number of 0–1 transitions in a sequence around the pixel, and on two sets of back-
ground neighbor configurations. The result of iteratively thinning the dilated cutlery image
(Fig. 7.23) is shown in Fig. 7.26. It is interesting to note the similarity with the technique of
Arcelli et al.

7A.4.7 Three-dimensional images

DT-driven closing was applied to a three-dimensional image of a box with broken interior
partitions; Figs. 7.27–7.29 show the results. The order of the images is the same as shown for
the 2D results. Note that even though the gaps are large in x, y, they are successfully bridged,
and the edges are still sharp in the final relabeled image.
The three-dimensional algorithm was also tested on a synthetic ellipsoid (Fig. 7.30) which
was deliberately undersampled to produce large gaps. The results of DT-driven closing with
a K max value of 3 are shown in Fig. 7.30.

Fig. 7.27. (a) Original broken box image, frame 20. (b) Distance transform of broken box.

(e) Fig. 7.28. (c) Thickened edge image. (d) Label image based on dilated edges.

(a) (b)

Fig. 7.29. (e)


Relabeled broken box
image.

Fig. 7.30. Ellipsoid with gaps (a) and relabeled ellipsoid (b).
176 Mathematical morphology

7A.4.8 Comparison with published three-dimensional method

Other research in three-dimensional thinning and skeletonization [7.11, 7.19, 7.41, 7.42]
is primarily based on computing the Euler [7.13] connectivity number, N, for each three-
dimensional solid where N = V − E + F and V, E, and F denote the numbers of vertices,
edges and faces of the object, respectively. For a solid which does not contain tunnels or
cavities, N = 2. Each tunnel or hole penetrating an object decreases N by two; each cavity
in the solid increases N by two. The connectivity number must be preserved during thinning
to maintain the topology of the original object [7.26, 7.33].

3D thinning (Tsao and Fu)


DT-driven thinning was compared to a topological three-dimensional thinning algorithm
described by Tsao and Fu [7.42] which preserves both path and surface connectivity. If
surface connectivity is maintained, then path connectivity or number of components is also
maintained. This algorithm examines the center voxel in a 3 × 3 × 3 cube and classifies it as
a border point in each of six subiterations, N(orth), S(outh), E(ast), W(est), U(p), or D(own),
if the corresponding N, S, E, W, U or D neighbor is zero. If the deletion of a voxel does not
change the connectivity of the remaining voxels in the 3 × 3 × 3 window and if its removal
does not change the connectivity in two “checking planes,” the voxel can be deleted. The
result of thinning the thickened edge image in Fig. 7.28(c) by Tsao and Fu’s method is shown
in Fig. 7.31.
The ellipsoid in Fig. 7.30 was also dilated using a 5 × 5 × 5 kernel to close the gaps
Fig. 7.31. Tsao and Fu
(Fig. 7.32(a)) and thinned using Tsao and Fu’s algorithm (Fig. 7.32(b)). The extra lines at the
thinning of dilated box.
top result from quantization effects in adjacent planes.

7A.4.9 Preserving geometry

Probably the most significant aspect of the performance of DT-driven closing is its ability
to preserve the geometry of surfaces, particularly near vertices. Its two-dimensional per-
formance is particularly well demonstrated in Fig. 7.24. To demonstrate how it processes
vertex geometry in three dimensions, we synthesized a cone and extracted the surface with a
three-dimensional edge detector.

Fig. 7.32. (a) Dilated ellipsoid and (b) Tsao and Fu thinned ellipsoid. From [7.36]. Used with
permission.
177 Topic 7A Morphology

Fig. 7.33. A cross section through the apex of a cone which has been processed by DT-based
closing (a) and by Tsao–Fu thinning (b). From [7.36]. Used with permission.

Gaps in the surface were then closed with DT-driven closing. The same cone surface
was dilated using conventional dilation to close gaps, and then thinned using the Tsao–Fu
algorithm. The results are shown in Fig. 7.33. Since the process of dilation replaces the edge
information, the Tsao–Fu thinning algorithm has no memory of the original surface when
it thins the dilated image. Of course, neither technique processes the geometry perfectly.
However, since DT-driven closing retains, via the DT, a “memory” of the original, undilated,
geometry, this technique is better able to restore that geometry after gaps are closed. See
[7.38] for more details and computation speed.

7A.4.10 Why this section was included

We included this rather long section on DT-driven edge closing for several reasons: First, it
is important for the student to see that the implementation of any real technique requires that
you know a bit about more than just one technique. You had to use knowledge of connected
components, edge detection, morphological dilation, etc. Second, since this book is directed
primarily at graduate students, we wanted you to see how to write a journal paper. This chapter
is similar in many ways to (and includes figures from) [7.38]. Note the format: Introduce the
problem, describe the algorithm, cite the literature, and most important, compare the new
technique with existing techniques. Follow these simple rules, and you too can publish!

7A.5 Vocabulary
You should know the meanings of the following terms.
Chamfer map
Prime factor
Sampling

Assignment 7.A1
In section 7A.1.1 an example decomposition of a structuring
element was given. Prove that this decomposition gives the
178 Mathematical morphology

same result as the original when used to dilate the image


{(0,0)}.

Assignment 7.A2
What is the major difference between the output of a thinning
algorithm and the maxima of the distance transform? Choose
the best answer from the following.

(a) A thinning algorithm preserves connectivity. The maxima of


the DT are not necessarily connected.
(b) The maxima of the DT are unique, thinning algorithms do
not produce unique results.
(c) The DT preserves connectivity. Thinning algorithms do not.
(d) Thinning algorithms produce the intensity axis of symme-
try. The DT does not.

Assignment 7.A3
Use the distance transform to compute the medial axis of an
image. The name of the image will be given in class.

Bibliography

[7.1] C. Arcelli, L. Cordella, and S. Levialdi, “Parallel Thinning of Binary Pictures,” Elec-
tronics Letters, 11, pp. 148–149, 1975.
[7.2] H.G. Barrow, “Parametric Correspondence and Chamfer Matching,” Proceedings
of the 5th International Joint Conference on Artificial Intelligence, August, 1977,
pp. 659–663.
[7.3] M. Bister, Y. Taeymans, and J. Cornelis, “Automated Segmentation of Cardiac MR
Images,” Computers in Cardiology, Washington, DC, IEEE Computer Society Press,
pp. 215–218, 1989.
[7.4] G. Borgefors, “Distance Transformations in Arbitrary Dimensions,” Computer Graph-
ics, Vision, and Image Processing, 27, pp. 321–345, 1984.
[7.5] G. Borgefors, “Distance Transformations in Digital Images,” Computer Graphics,
Vision, and Image Processing, 34, pp. 344–371, 1986.
[7.6] H. Breu, J. Gil, D. Kirkpatrick, and M. Werman, “Linear Time Euclidian Distance
Transform Algorithms,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 17(5), 1995.
[7.7] D. Florencio and R. Schafer, “Homotopy and Critical Morphological Sampling,”
Proceedings of the SPIE, 2308, June, 1994.
[7.8] J.D. Foley, A. vanDam, S.K. Feiner, and J.F. Hughes, Computer Graphics: Principles
and Practice, Reading, MA, Addison-Wesley, pp. 91–99, 1990.
[7.9] W. Gong and G. Bertrand, “A Simple Parallel 3D Thinning Algorithm,” 10th Inter-
national Conference on Pattern Recognition, June, 1990.
179 Bibliography

[7.10] B. R. Groshong, and W. E. Snyder, “Using Chamfer Maps to Segment Images,” Tech-
nical Report CCSP-WP-86/11, Center for Communications and Signal Processing,
North Carolina State University, Raleigh, NC, USA, December, 1986.
[7.11] K. Hafford and K. Preston Jr., “Three-dimensional Skeletonization of Elongated
Solids,” Computer Vision, Graphics, and Image Processing, 27, pp. 78–91, 1984.
[7.12] R. Haralick and L. Shapiro, Computer and Robot Vision, Volume 1, New York,
Addison-Wesley, 1992.
[7.13] D. Hilbert and S. Cohn-Vossen, Geometry and the Imagination, New York, Chelsea,
1952.
[7.14] H. Hiriyannaiah, G. Bilbro, and W. Snyder, “Restoration of Locally Homogeneous
Images using Mean Field Annealing,” Journal of the Optical Society of America, A,
6(12), pp. 1901–1912, 1989.
[7.15] C. Huang and O. Mitchell, “A Euclidian Distance Transform using Grayscale
Morphology Decomposition,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(4), 1994.
[7.16] X. Jiang, “An Adaptive Contour Closure Algorithm and its Experimental Evaluation,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11), 2000.
[7.17] R. Jones and I. Svalbe, “Morphological Filtering as Template Matching,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 16(4), 1994.
[7.18] R. Jones and I. Svalbe, “Algorithms for the Decomposition of Gray-scale Morpholog-
ical Operations,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(6), 1994.
[7.19] S. Lobregt, P. Verbeek, and F. Groen, “Three-Dimensional Skeletonization: Principle
and Algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
2(1), pp. 75–77, 1980.
[7.20] P. Maragos and R. Schafer, “Morphological Skeleton Representation and Coding of
Binary Images,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 34,
pp. 1228–1244, 1986.
[7.21] G. Matheron, Random Sets and Integral Geometry, New York, Wiley, 1975.
[7.22] H. Park and R. Chin, “Optimal Decomposition of Convex Morphological Structuring
Elements for 4-connected Parallel Array Processors,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(3), 1994.
[7.23] H. Park and R. Chin, “Decomposition of Arbitrarily Shaped Morphological Structuring
Elements,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1),
1995.
[7.24] I. Pitas and A. Venetsanopoulos, “Morphological Shape Decomposition,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 12(1), 1990.
[7.25] Pratt, W. K., Digital Image Processing, New York, Wiley, 1978.
[7.26] K. Preston, and M. Duff, Modern Cellular Automata Theory and Applications, New
York, Plenum Press, 1984.
[7.27] J. Reinhardt and W. Higgins, “Comparison Between the Morphological Skeleton and
Morphological Shape Decomposition,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(9), 1996.
180 Mathematical morphology

[7.28] A. Rosenfeld, “Distance Functions on Digital Images,” Pattern Recognition, 1,


pp. 33–61, 1968.
[7.29] A. Rosenfeld, Picture Processing by Computer, New York, Academic Press, 1969.
[7.30] A. Rosenfeld, “Connectivity in Digital Pictures,” Journal of the Association for Com-
puting Machinery, 17(1), pp. 146–160, 1970.
[7.31] A. Rosenfeld and A. Kak, Digital Picture Processing, Volume 2, New York, Academic
Press, 1982.
[7.32] A. Rosenfeld and J. Pfaltz, “Sequential Operations in Digital Picture Processing,”
Journal of the Association for Computing Machinery, 13, pp. 471–494, 1968.
[7.33] P. Saha and B. Chaudhuri, “Detection of 3-D Simple Points for Topology Preserv-
ing Transformations with Application to Thinning,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(10), 1994.
[7.34] D. Schonfeld, “Optimal Structuring Elements for the Morphological Pattern Restora-
tion of Binary Images,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 16(6), 1994.
[7.35] W. Snyder and G. Bilbro, “Segmentation of Three Dimensional Range Images,”
CCSP-TR-84/7, North Carolina State University, November, 1984.
[7.36] W. Snyder and A. Cowart, “An Iterative Approach to Region Growing Using Asso-
ciative Memories,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
5(6), pp. 638–651, 1985.
[7.37] W. Snyder and C. Savage, “Content-Addressable Read-Write Memories for Image
Analysis,” IEEE Transactions on Computers, 31(10), pp. 963–968, 1982.
[7.38] W. Snyder, M. Hsiao, K. Boone, T. Hudacko, and B. Groshong, “Closing Gaps in
Edges and Surfaces,” Image and Vision Computing, October, 1992.
[7.39] J. Srihari, K. Uhurpa, and M. Yau, “Understanding the Bin of Parts,” IEEE Conference
on Decision Control, pp. 44–49, December, 1979.
[7.40] H. Tagare, F. Vos, C. Jaffe, and J. Duncan, “Arrangement: A Spatial Relation Be-
tween Parts for Evaluating Similarity of Tomographic Section,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 17(9), 1995.
[7.41] J. Toriwaki and Y. Tatsuhiro, “Topological Properties and Topology- preserving Trans-
formation of a Three-dimensional Binary Picture,” Proceedings of the Sixth Interna-
tional Pattern Recognition Conference, pp. 414–419, Munich, 1982.
[7.42] Y. Tsao and K. Fu, “A Parallel Thinning Algorithm for 3-D Pictures,” Computer
Graphics and Image Processing, 17, pp. 315–331, 1981.
[7.43] R. van den Boomgaard and A. Smeulders, “The Morphological Structure of Images:
The Differential Equations of Morphological Scale-space,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(11), 1994.
[7.44] T. Zhang and C. Suen, “A Fast Parallel Algorithm for Thinning Digital Patterns,”
Communications of the Association for Computing Machinery, 27(3), pp. 236–239,
1984.
8 Segmentation

Galia est omnes divisa in partes tres


Julius Caesar

Segmentation is the process of separating objects from background. It is the building


block for all the subsequent processes like shape analysis, object recognition, etc.
In this chapter, we first discuss several popular segmentation algorithms, including
threshold-based, region-based (or connected component analysis), edge-based, and
surface-based. We also describe some recently developed segmentation algorithms
in the topics section.

8.1 Segmentation: Partitioning an image

In many machine vision applications, the set of possible objects in the scene is quite
limited. For example, if the camera is viewing a conveyer, there may be only one
type of part which appears, and the vision task could be to determine the position
and orientation of the part. In other applications, the part being viewed may be one
of a small set of possible parts, and the objective is to both locate and identify each
part.?? Finally, the camera may be used to inspect parts for quality control.
Fig. 8.1. An image
with two foreground In this section, we will assume that the parts are fairly simple and can be charac-
regions. terized by their two-dimensional projections, as provided by a single camera view.
Furthermore, we will assume that the shape is adequate to characterize the objects.
11112221111111111111111111111
11122221111113333333333333111
11112221111113333333333333111
That is, color or variation in brightness is not required. We will first consider dividing
11112211111111111111333311111
11112211111111111111333331111 the picture into connected regions.
11112211111111111111333333111
11222222111111111113333333111
11224422211111111133333331111
A segmentation of a picture is a partitioning into connected regions, where each
11222222111111111111333111111
11112211111111111111111111111 region is homogeneous in some sense and is identified by a unique label. For example,
11111111111111111111111111111
in Fig. 8.2 (a “label image”), region 1 is identified as the background. Although
Fig. 8.2. A region 4 is really background as well, it is labeled as a separate region since it is not
segmentation and
labeling of the image connected to region 1.
in Fig. 8.1. The term “homogeneous” deserves some discussion. It could mean all the pixels
are the same brightness, but that criterion is too strong for most practical applications.
It could mean that all pixels are close to some representative (mean) brightness.

181
182 Segmentation

Stated more formally [8.80], a region is homogeneous if the brightness values are
consistent with having been generated by a particular probability distribution (see
also the analysis by Ng and Lee [8.44]). In the case of range imagery [8.35], where
we (might) have an equation which describes the surface, we could say a region
is homogeneous if it can be described by the combination of that equation and
some probabilistic deformation. For example, if all the points in a region of a range
image lay in the same plane except for deviations whose distance from the plane
may be described by a particular Gaussian distribution, we might say this region is
homogeneous.
There are several ways to perform segmentation. Threshold-based techniques are
guaranteed to form closed regions, for they simply assign all pixels above (or below,
depending on the problem) a specified threshold to be in the same region. Edge-
based techniques assume that regions are separated by neighborhoods where the
edge strength is high. Region-based methods start with elemental (e.g., homoge-
neous) regions and split or merge them. Then, there are a variety of hybrid methods,
including watershed [8.5] techniques. A watershed method generally operates on the
gradient of the image; segmentation consists of flooding the image with (by anal-
ogy) water, in which region boundaries (areas of high edge strength) are erected to
prevent water from different seed points from mixing. Traditional “region growing”
methods are really variations on watershed methods [8.1].
Before we can go much further in our discussion of issues and techniques in
segmentation, you need to understand some of the interesting and unexpected things
that happen to geometry and topology when you deal with digital images. Remember
the connectivity paradox from section 4.5? We discovered that an object could have
a closed boundary but still have an inside and outside which were connected? As
another example, consider the problem of finding the perimeter of a region, or even
the length of a line from a sampled version of that line. That problem has immediate
applications in segmentation, and yet it is not obvious how to estimate it [8.31].
Keep in mind that these sorts of things can happen, because we are going to talk a
lot about connectivity in this chapter.

8.2 Segmentation by thresholding

In applications where specific gray values of regions are not important, we can
segment a picture into “objects” and “background” by simply choosing a threshold
in brightness. We define any region whose brightness is above the threshold as object
and all below the threshold as background.
There are several different ways to choose thresholds, ranging from the trivially
simple to the very sophisticated. As the sophistication of the technique increases,
performance improves but at the cost of increased computational complexity.
183 8.2 Segmentation by thresholding

Fig. 8.3. Two detectors forming an image of a rectilinear grid. The light source is uniform,
however, the images exhibit both radiometric (brightness) distortion (the left image
is brighter in the right center and the right image is brighter in the middle) and
geometric distortion (straight lines are distorted in a characteristic pincushion
form).

Probably the most important factor to note is the local nature of thresholding.
That is, a single threshold is almost never appropriate for an entire scene. It is nearly
always the local contrast between object and background that contains the relevant
information. Since camera sensitivity drops off from the center of the picture to the
edges due to parabolic distortion and/or vignetting as shown in Fig. 8.3, it is often
useless to attempt to establish a global threshold. A dramatic example of this effect
can be seen in an image of a rectilinear grid, in which the “uniform” white varies
significantly over the surface.
Effects such as parabolic distortion and vignetting are quite predictable and easy
to correct. In fact, off-the-shelf hardware is available for just such applications.
It is more difficult, however, to predict and correct effects of nonuniform ambi-
ent illumination, such as sunlight through a window, which changes radically over
the day.
Since a single threshold cannot provide sufficient performance, we must choose
local thresholds. The most common approach is called block thresholding, in which
the picture is partitioned into rectangular blocks and different thresholds are used
on each block. Typical block sizes are 32 × 32 or 64 × 64 for 512 × 512 images.
The block is first analyzed and a threshold is chosen; then that block of the image
is thresholded using the results of the analysis. In more sophisticated (but slower)
versions of block thresholding, the block is analyzed and the threshold computed.
Then that threshold is applied only to the single pixel at the center of the block. The
block is then moved over one pixel, and the process is repeated.
184 Segmentation

20 40 60 80 100 120 140 160 180

Fig. 8.4. A histogram of a bimodal image, with many bright pixels (intensity around 169) and
many dark pixels (intensity around 11).

8.2.1 Choosing a threshold


The simplest strategy for choosing a threshold is to average the intensity over the
block and choose i avg + i as the threshold, where i is some small increment, such
as 5 out of 256 gray levels. Such a simple thresholding scheme can have surprisingly
good results.
However, when the simpler schemes fail, one is forced to move to more sophis-
ticated techniques, such as thresholding based on histogram analysis. Before we
describe this technique, we will first define a histogram.
The histogram h(i) of an image f (x, y) is a function of the permissible intensity
values. In a typical imaging system, intensity takes on values between 0X00 (black)
and 0XFF (white). A graph that shows, for each gray level, the number of times that
level occurs in the image is called the histogram of an image. Illustrated in Fig. 8.4
is a histogram of black parts on a white conveyor.
In Fig. 8.4 we note two distinct peaks, one at gray level 11, almost pure black, and
one at gray level 169, bright white. With the exception of noise pixels, every point
in the image belongs to one of these regions. A good threshold, then, is anywhere
between the two peaks.
Histograms are seldom as “nice” as the one in Fig. 8.4 and some additional pro-
cessing is generally needed ([ 8.14, 8.51, 8.70] explain and experimentally compare
several such methods). In the following section, we describe a more sophisticated
technique for finding the optimal threshold.

Fitting a sum of Gaussians

In [8.7], a technique was developed which finds the global minimum of a function
of several variables, even for functions which have more than one minimum. That
technique, known as tree annealing (TA) may be applied to the problem of histo-
gram analysis and thresholding in the following way [8.62].
185 8.3 Connected component analysis

Fig. 8.5. MSE fit of the sum of three Gaussians to a histogram.

Given a bimodal histogram, h(f) of an image, where f represents a brightness


value, a standard technique [8.14] to find the best threshold of that image is to fit the
histogram with the sum of two Gaussians:



A1 (x − ␮1 )2 A2 (x − ␮2 )2
h(x) = √ exp − + √ exp − . (8.1)
2␲␴1 2␴12 2␲␴2 2␴22

If h(f) is properly normalized, one may adjust the usual normalization of the two
component Gaussians so that each sums to unity on the 256 discrete gray levels
(rather than integrating to unity on the continuous interval), and thereby admit the
additional constraint that A1 + A2 = 1. Use of this constraint reduces the number
of parameters to be estimated from six to five; however, we have determined ex-
perimentally that TA actually solves the problem more accurately by using the six
variables, without readjusting the normalization on each iteration. Conventional de-
scent often terminates at a suboptimal local minimum for this two-Gaussian problem
and is even less reliable for a three-Gaussian problem. TA deals easily with either.
The result of fitting a sum of three Gaussians to the histogram of an image is shown
in Fig. 8.5.
Whatever algorithm we use, the philosophy of histogram-based thresholding is
the same: Find peaks in the histogram and choose the threshold to be between them.
In many industrial environments, the lighting may be extremely well controlled.
With such control, the best thresholds will be constant over time and may be chosen
interactively during system set up. However, in general, different thresholds are used
in different areas of the picture.

8.3 Connected component analysis

Let us assume, for now, that a good threshold has been chosen and that our picture
has been partitioned into regions of pure black and pure white, as shown in Fig. 8.1.
186 Segmentation

A B F G

C
D H

Fig. 8.6. A graph with two connected components.

The production of a segmented picture such as Fig. 8.2 requires an analysis of


connectedness. That is, a pixel is in region i if it is above threshold and is adjacent
to a pixel in region i. Since regions may curve and fork, the analysis cannot be as
simple as starting at the top and marking connected pixels going down. Instead, a
more sophisticated technique is needed.
The term “connected component” comes from graph theory. Consider the graph
in Fig. 8.6.
This graph has eight vertices, eight edges, and two connected components. That
is, there is a path along edges from A to D, or B to E, or F to H, etc. But there is no
Fig. 8.7. An image path from A to F. In image analysis, we think of the foreground pixels as vertices in
containing two
connected
a graph, and the “adjacent to” property as determining the edges in the graph. With
components. this definition, we see that the image in Fig. 8.7 contains two connected components.
Graphs will be revisited in more detail in Chapter 12.
In order to make use of the concept of connected component labeling (CCL), we
11112221111111111111111111111
11122221111113333333333333111
11112221111113333333333333111
must also define a “label image” which is an iconic representation, isomorphic to
11112211111111111111333311111
11112211111111111111333331111 the original image, in which each pixel contains the number of the component to
11112211111111111111333333111
11222222111111111113333333111
11224422211111111133333331111
which it belongs.
11222222111111111111333111111
11112211111111111111111111111 One algorithm which produces a label image is known as “region growing.” It
Fig. 8.8. The label
utilizes a label memory corresponding to the frame buffer just as Fig. 8.8 corresponds
image corresponding to Fig. 8.7. In this description, we will refer to “black” pixels as object and “white”
to Fig. 8.7. as background.
Initially, each cell in the label memory L is set to zero. We will refer to the picture
memory as f. Thus the labeling operation can be written as L(x, y) ← N for some
label number N.

8.3.1 Recursive region growing algorithm


This algorithm implements region growing by using a pushdown stack on which to
temporarily keep the coordinates of pixels in the region.
187 8.3 Connected component analysis

(1) Find an unlabeled black pixel; that is, L(x, y) = 0. Choose a new label number
for this region, call it N. If all pixels have been labeled, stop.
(2) L(x, y) ← N .
(3) If f (x − 1, y) is black and L(x − 1, y) = 0, push the coordinate pair (x − 1, y)
onto the stack.
If f (x + 1, y) is black and L(x + 1, y) = 0, push (x + 1, y) onto the stack.
If f (x, y − 1) is black and L(x, y − 1) = 0, push (x, y − 1) onto the stack.
If f (x, y + 1) is black and L(x, y + 1) = 0, push (x, y + 1) onto the stack.
(4) Choose a new (x, y) by popping the stack.
(5) If the stack is empty, go to 1, else go to 2.

This labeling operation results in a set of connected regions, each assigned a unique
label number. To find the region to which any given pixel belongs, the computer has
only to interrogate the corresponding location in the L memory and read the region
number.

EXAMPLE
Applying region growing
Fig. 8.9 shows a 4 × 7 array of pixels. Assume the initial value of x, y is 2, 4.
Apply algorithm “grow” and show the contents of the stack and L each time step (3)
is executed. Let the initial value of N be 1.

Solution
Pass 1. Immediately after execution of step (3). The algorithm has examined
pixel 2, 4, examined its 4-neighbors, and detected only one 4-neighbor in the
foreground, the pixel at 3, 4. Thus, the coordinates of that pixel are placed on

Stack: 7
( ) ← top 6

1
1 2 3 4

Fig. 8.9. An example image after thresholding.


188 Segmentation

the stack.
⎡ ⎤
7 0 0 0 0
6 ⎢0 0 0 0⎥
⎢ ⎥
⎢0 0 0 0⎥
Stack: 5 ⎢ ⎥
⎢ ⎥
L= 4 ⎢0 1 0 0⎥
3, 4 ← top ⎢ ⎥
3 ⎢0 0 0 0⎥
⎢ ⎥
2 ⎣0 0 0 0⎦
1 0 0 0 0
1 2 3 4

Pass 2. The pixel at 3, 4 was removed from the top of the stack and marked with
a 1 in the L image, its neighbors were examined and two 4-neighbors were found,
the pixels at 3, 3 and at 3, 5; both were put on the stack.

⎡ ⎤
7 0 0 0 0
6 ⎢0 0 0 0⎥
⎢ ⎥
Stack: ⎢0 0 0 0⎥
5 ⎢ ⎥
⎢ ⎥
3, 5 ← top L= 4 ⎢0 1 1 0⎥
⎢ ⎥
3, 3 3 ⎢0 0 0 0⎥
⎢ ⎥
2 ⎣0 0 0 0⎦
1 0 0 0 0
1 2 3 4

Pass 3. The top of the stack contained 3, 5. That pixel was removed from the
stack and marked with a 1 in the L image. All its neighbors were examined and one
4-neighbor was found, the pixel at 3, 6. Thus the coordinates of this pixel are put
on the stack.
⎡ ⎤
7 0 0 0 0
6 ⎢0 0 0 0⎥
⎢ ⎥
Stack: ⎢0 0 1 0⎥
5 ⎢ ⎥
⎢ ⎥
3, 6 ← top L= 4 ⎢0 1 1 0⎥
⎢ ⎥
3, 3 3 ⎢0 0 0 0⎥
⎢ ⎥
2 ⎣0 0 0 0⎦
1 0 0 0 0
1 2 3 4

Pass 4. The stack was “popped” again, this time removing the 3, 6 and marked with
a 1 in the L image. That pixel was examined and determined to have no 4-neighbors
189 8.3 Connected component analysis

which had not already been labeled.


⎡ ⎤
7 0 0 0 0
6 ⎢0 0 1 0⎥
⎢ ⎥
⎢0 0 1 0⎥
Stack: 5 ⎢ ⎥
⎢ ⎥
L= 4 ⎢0 1 1 0 ⎥
 3, 3 ← top ⎢ ⎥
3 ⎢0 0 0 0⎥
⎢ ⎥
2 ⎣0 0 0 0⎦
1 0 0 0 0
1 2 3 4

Pass 5. The stack was popped again, removing the 3, 3 and marked with a 1 in the
L image. That pixel was examined and determined to have no 4-neighbors which
had not already been labeled.
⎡ ⎤
7 0 0 0 0
6⎢ ⎥
⎢ 0 0 1 0⎥
5⎢ ⎥
⎢ 0 0 1 0⎥
Stack : ⎢ ⎥
L = 4 ⎢ 0 1 1 0⎥
() ← top ⎢ ⎥
3⎢ 0 0 1 0⎥ ⎥

2 ⎣ 0 0 0 0⎦
1 0 0 0 0
1 2 3 4

Pass 6. The stack was popped again, producing a return value of “stack empty” and
the algorithm is complete since all black pixels had been labeled.

This region growing algorithm is just one of several strategies for performing
connected component analysis. Other strategies exist which are faster than the one
described, including some that run at raster-scan rates [8.6]. We will now consider
one such technique.

8.3.2 An iterative approach to connected component analysis


Since the region growing technique always results in closed regions, this technique
is often preferable to other techniques which are based on edge detection or line
fitting. Numerous variations and applications of the basic region growing technique
have been proposed in the past [8.10, 8.16]. Although region growing has proved
to be an integral part of scene analysis, its use can quickly become computationally
prohibitive, particularly for high-resolution images. This has prompted the consid-
eration of alternative, faster, more hardware-specific methods of region partitioning.
In this section we present an alternative algorithm to recursive region growing.
This algorithm is functionally identical to recursive region growing in that it returns
a set of labeled pixels, meeting the adjacency and similarity criteria.
190 Segmentation

While this algorithm is functionally identical to traditional region growing, it is


fundamentally different in concept and potential implementation.
The objective in developing this algorithm was to find a way to achieve the results
of region growing, but to do so “on the fly” with a single pass over the data. This result
was achieved by using the concept of a content-addressable memory. This memory
may be a physical piece of hardware or a lookup-table-driven access method in
simulation software.
This algorithm is based on the concept of equivalence relationships between the
pixels of an image. Equivalency is defined as follows. Two pixels a and b are defined
to be equivalent (denoted R(a, b)) if they belong to the same region of an image. This
relationship can be shown to be reflexive (R(a, a)), symmetric (R(a, b) ⇒ R(b, a)),
and transitive (R(a, b) AND R(b, c) ⇒ R(a, c)); which makes it an “equivalence
relation.”
The transitive property enables all pixels in a region to be determined by con-
sidering only local adjacency properties. In this algorithm, each pixel will be com-
pared to each adjacent pixel in a left-to-right, top-to-bottom raster-scan fashion.
111 22 The assignment of a region label to a pixel results from this comparison operation.
111 22
Pixels in a simple binary image are labeled in raster-scan order. The region label-
111 22
111 22 ing proceeds in a straightforward manner: First look to the left, consider that your
111 11? neighbor and use its label. Then look up to the pixel above, and consider that pixel
Fig. 8.10. Ambiguity in
your neighbor. If both are labeled and the labels are different, you have a problem.
label assignment. Fig. 8.10 demonstrates the situation that can arise as result of this comparison when
the equivalence relation R (1, 2) is discovered at the pixel designated by the question
mark.
The system proposed in [8.60] employs hardware to assign region labels to pix-
els and to maintain a table of equivalence relationships. Fig. 8.12 shows that this
hardware resides between the image memory and a host computer. Functionally, this
hardware is transparent to the host computer. For the example of Fig. 8.10, all pixels
will be perceived by the host computer as belonging to region 1 (the lower numbered
region label takes precedence in the equivalence relationship here).
The operation of the hardware is described in the flowchart of Fig. 8.11. In order
to understand this flowchart, the following notation is introduced.

r f (x, y) is the gray-scale value of the (x, y) pixel in the image memory.
r (x, y)i is the ith adjacent neighbor of the (x, y) pixel.
r f i (x, y) is the gray-scale value of the ith adjacent neighbor of the (x, y) pixel.
r L(x, y) is the region label corresponding to the (x, y) pixel in the image memory.
r L i (x, y) is the region label number corresponding to the ith adjacent neighbor of
the (x, y) pixel.
r K (i) is the contents or the ith element in the equivalence memory. This memory
is a content-addressable memory.
191 8.3 Connected component analysis

start

Initialize memories
p=0
Get first pixel
get next pixel

i=1
N
N
(x,y)i labeled? i = i+1 i>N?

Y
Y

Y Y
N (x,y) labeled?
|f(x,y) - fi(x,y)| < T?
N

(x,y) N p=p+1
labeled? L(x,y) = K(Li(x,y))

Y L(x,y)=p
z = max(K(L(x,y)),K(Li(x,y)))
w = min(K(L(x,y)),K(Li(x,y)))
A K(p) = p

Y
z=w?

K*(z) = K(w)

Fig. 8.11. Flowchart of the algorithm (from [8.60]).

r K ∗ (i) = K ( j) implies the sequence:


b read K ( j)
b
search K using i as a search key (i.e., determine all l such that K (l) = i)
b
write K ( j) to all positive responders of the search.
r T is a threshold.
r p is the highest numbered region label (initially 1).
r N is the number of pixels in a neighborhood (four or eight connectivity).

The algorithm is illustrated in Fig. 8.13 for an arbitrary region adapted from
Milgram et al. [8.41]. The reader should note that the relation | f (x, y) − f i (x, y)| <
T tests if two pixels are similar. There are other similarity measures which could
be used, including local first- and/or second-order statistics. If two pixels meet this
criterion and they are adjacent, then they are in the same region. By definition, if
two pixels are in the same region, then R(a, b) holds. That is,
{ADJACENT(x, y, x  , y  ) ∧ | f (x, y) − f i (x, y)| < T } ↔ R(x, y, x  , y  ).
192 Segmentation

The transitive property of R cannot be used to infer that R(x, y, x  , y  ) →


| f (x, y) − f i (x, y)| < T without also considering the adjacency property.
As the region partitioning proceeds in real time (i.e., synchronously with the
raster scan), two activities must be performed. First, the L memory must be loaded
with the region label number of each pixel under consideration, and second, the
memory must be updated with all equivalence relationships discovered. For example,
if region 4 is actually identical to region 2, then both K(2) and K(4) will contain 2
(the lower numbered region label takes precedence). Hence, when the host computer
interrogates pixel (x, y) of the L memory, the interface/processor interprets L(x, y)
in terms of the memory and returns K (L(x, y)) to the computer. For example, pixel
(11, 11) in Fig. 8.13 would be returned as belonging to region 1 since K (L(11, 11)) =
K (4) = 1.
The difficulty generally encountered in this type of procedure is the problem of
chaining. That is, if R (2, 4) and R (3, 4) have been determined, then R(2, 3) must also
be deduced. However, to require that the machine search out all such possibilities
after the image has been processed defeats the original objective of performing region
partitioning during scan time. Chaining is avoided in this algorithm by ensuring that
K (2) = K (3) = K (4) = 2. However, this merely transfers the chaining problem to
the scanning and labeling process.
Block A of the algorithm flowchart shown in Fig. 8.11 resolves the chaining
problem. Whenever an equivalence relationship is detected, all locations in the K
memory containing the larger region label number are loaded with the smaller region
label number. While the execution of this step in real time may not be feasible for
conventional random access memories, it is within the capability of the content-
addressable memories discussed in the next section.

An architecture for implementation

The architecture proposed to implement the algorithm of this section is shown in


Fig. 8.12. This hardware is intended as a special-purpose processor for an existing
computer-based image processing system. The architecture contains four major com-
ponents: Image memory (f), Region label memory (L), Equivalence memory (K),
and an Interface/processor. The gray scale values of the image reside in the image
memory. Typically, the f memory would contain 512 × 512 bytes. The region labels
assigned to individual pixels are contained in the region label memory. However,
the contents of the L memory also include all intermediate region labels for which
equivalence labels were determined. The contents of the L memory are interpreted
in terms of the contents of the equivalence memory by straightforward table lookup
hardware. The size of the L memory is directly related to the bit length required
to represent the region label (including intermediate region labels) associated with
each pixel.
193 8.3 Connected component analysis

Image
Memory
(f)

Interface
Processor Region
Label
Memory
(L)

Equivalence
Memory
(K)

Host Computer

Fig. 8.12. Architecture of a region labeling system.

1 1 1 2 2
2 1 1 2 2
3 1 1 2 2
4 1 1 3 3 3 3 3 3 3 3 2 2
5 1 1 3 3 3 3 3 3 3 3 2 2
6 1 1 3 3 3 3 2 2
7 1 1 3 3 3 3 2 2
8 1 1 3 3 3 3 2 2
9 1 1 3 3 3 3 2 2
10 1 1 1 1 1 1 4 4 1 1 2 2
11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Fig. 8.13. An example of a difficult labeling problem (from [8.60]).

Fig. 8.13 illustrates a difficult labeling problem, and Table 8.1 illustrates the
labeling process for this example.
The f memory and the L memory are both conventional random access memories.
However, the equivalence memory has two modes of operation. It may be used
on a conventional RAM where the address input corresponds to the region table
194 Segmentation

Table 8.1. The contents of equivalence


memory as the labeling algorithm progresses.

Pixel K memory address

x y 1 2 3 4

1 1 1
16 1 1 2
5 4 1 2 3
5 10 1 2 1
9 10 1 2 1 4
11 10 1 2 1 1
16 11 1 1 1 1

and data output is the equivalent table. In associative memory mode it is used to
update that table. In this mode, two activities occur in synchronism with a two-phase
clock.
r Phase 1: All memory cells whose contents match the contents of the data bus set
their corresponding enable flip-flops (see Fig. 8.14).
r Phase 2: All memory cells whose enable flip-flops are set read the contents of the
data bus.

This operation effectively updates the equivalence table in parallel during the
scan.

φ1

select
φ2
Address
decoder

Address Data
bus bus

Fig. 8.14. Organization of the K memory.


195 8.3 Connected component analysis

It is also possible to update the table by a search algorithm at the end of the scan.
However, doing the updating in parallel during the scan tremendously reduces the
number of equivalences (since the lowest number is always used), thus reducing
the number of bits needed for the K memory. A content-addressable memory has
the property that memory cells can be accessed or loaded by their contents [8.45,
8.69, 8.77].
The parameter most crucial to the development of a satisfactory memory, and
hence a system capable of operating in real time, is the memory size. A near real-time
system will result if the memory size necessitates a compromise in the access speed.
Issues of memory size are discussed further in [8.60] where simulation involving
real images is presented.
The final component of the architecture is the interface/processor. The primary
purpose of this unit is to execute the algorithm described in this section and
flowcharted in Fig. 8.11. Additionally, it must be capable of (1) processing the video
signal input into gray-scale values for storage in the memory, and (2) interpreting
the L memory in terms of the K memory.

Simulation

The algorithm was applied to a 512 × 512 image of text data that was thresholded
prior to segmentation. Two parameters were of interest: (1) the number of elemental
regions (those whose labels are stored in L), since this affects the word width of L
and K and the length of K; and (2) the number of regions perceived by the algorithm
since this determines the amount of further processing which the host computer must
perform before useful information can be gleaned from the image.
The results of one simulation are summarized below.
r Four-neighbor connectivity. 912 elemental regions, 138 perceived regions.
r Eight-neighbor connectivity. 883 elemental regions, 109 perceived regions.

These results indicate that a 512 × 512 × 10 bit L memory and a 1024 × 10 bit
memory would be required for this image.
In this section we have addressed the issue of performing image analysis opera-
tions in real time on television-scanned data. We have shown that it is possible to
design hardware which can perform the operation of region growing in this way.
The concept of using equivalence relations to partition an input set is fundamental
to the algorithm. Furthermore, the use of content-addressable read/write memories
facilitates the implementation of such equivalence relation processing in real time.
These concepts were developed by considering potential hardware structures;
however, nothing about the algorithm prevents its implementation on a digital com-
puter. The program described here was written to simulate the effectiveness of this
approach. Since then, we have used it to label regions in an image segmentation. Its
196 Segmentation

speed of operation, even in simulation, is superior to our earlier region grower, for
identical performance.

8.3.3 An alternative to the label image


In this chapter, we primarily use the label image as the means for identifying which
pixels belong to which regions. Other authors [8.73] have used a tree-based approach.
That is, a graph is established, and all the children of a given parent are assumed to
lie in the same region. Thus, levels of the graph correspond to levels of scale. That is,
a “parent node” is at a higher (more blurred) level of scale than the children of that
node. A child node is defined as being related to a parent if (1) it is geometrically
close to the parent and (2) it is similar in intensity. More details are available in
[8.73].

8.4 Segmentation of curves

Sometimes, we already have the boundaries of regions, and are interested in describ-
ing the boundary in some way which is appropriate for us to be able to characterize
either the entire boundary or individual segments. While there are many ways to
approach this problem [8.17, 8.42, 8.52], almost all of them involve identifying dis-
tinguished points [8.15, 8.26, 8.55, 8.79] along the boundary, and then characterizing
the curves between these “salient” points. Obviously, the definition of saliency [8.18,
8.23, 8.48] is critical to the performance of the algorithm.
First, recognize that a curve is a one-dimensional function, which is simply bent
in 2-space. That is, a curve can be parameterized using a single parameter, which
is usually arc length. However, the arc length parameterization is not invariant to
affine transforms [9.1]. For this reason, Rivlin and Weiss [8.49] develop invariants
of curves without using a parameter. The x–y coordinates of a point on the curve can
then be written as one-dimensional functions of the arc length, ␺ (s) = [x(s), y(s)],
that is, how far along the curve we have traveled. One can construct any smooth
curve up to a rigid body motion if the curvature as a function of arc length is known
[8.76]. Of course, we never deal with a smooth, arc-length parameterized curve, we
only have sampled versions of such curves, and the relationship between true arc
length and the number of pixels through which the curve passes is not as simple as
one might think [8.31]. In fact, even for noise-free curves, digitization introduces
errors [8.76]. Interestingly, if one adds noise, curve positions (at least for straight
lines) can be estimated with increased accuracy [8.37].
The speed of a curve at a point s is

 2  2
∂x ∂y
␺˙ (s) = + .
∂s ∂s
197 8.5 Active contours (snakes)

The outward normal direction to a curve at point s is


∂ 
∂s
y(s) − ∂s∂ x(s)
n ␺ (s) = .
␺ (s)

Suppose the curve is closed, then the concepts of INSIDE and OUTSIDE make sense.
Given a point in the plane x = [xi , yi ] which is not on the curve, let ␺ x represent the
closest point on the curve to x (at this point, the arc length is defined to be sx ). Then
we say x is INSIDE the curve if [x − ␺ x ]n ␺ (sx ) ≤ 0 and OUTSIDE otherwise.
There is a way [8.54] to perform curve evolution (see section 9.8) such that the
enclosed area remains constant.
You do not necessarily have to find salient points. Chen et al. [8.13] simply
apply an orientation-selective filter at all possible orientations. If two segments have
sufficiently different orientations, the filter response will have multiple peaks, and
the position of the peaks can be used to identify the segments. This approach seems
to work well for images consisting of straight lines with X or T intersections (see
Chapter 10 for a discussion of the types of intersections).
Rosen and West [8.50] propose a slightly different strategy for finding salient
points. They fit the sequence of data points with whatever function seems appropriate
(ellipses or straight lines). The data point which fits most poorly becomes the salient
point. The curve is then divided into two segments, and the fit is repeated recursively
on each segment.

8.5 Active contours (snakes)

The concept of active contours was originally developed to address the fact that
any edge detection algorithm will fail on some images, because in certain areas
of the image, the edge simply does not exist. For example, Fig. 8.15 illustrates a
human heart imaged using nuclear medicine. A radioactive drug is introduced into
the circulation, and an image is made which reflects the radiation at each point. The
brightness at a point is a measure of the integral in a direction perpendicular to the
Fig. 8.15. The left imaging plane of the amount of blood in the area subtended by that pixel. The volume
ventricle imaged using of blood within the ventricle can thus be calculated by summing the brightness over
nuclear medicine.
the area of the ventricle. Of course, this requires an accurate segmentation of the
ventricle boundary, a problem made difficult by the fact that there is essentially no
contrast in the upper left corner of the ventricle. This occurs because radiation from
other sources behind the ventricle (superior and inferior vena cava, etc.) contributes
to blur out the contrast. Thus, a technique is required which can bridge these rather
large gaps – gaps which are really too large to be bridged using the closing techniques
of Chapter 7.
198 Segmentation

The snake is a closed Following the active contour philosophy, a contour is first initialized, either by
boundary which moves,
retaining its connectivity a user or automatically. The boundary then moves, and moves until many/most of
as it moves. the contour points align with image edge points. An animation of the contour as it
searches for boundary points reminds the viewer of the movements of a snake, hence
these boundaries are often referred to as “snakes.”
Two philosophies are followed in deriving snake algorithms: Energy minimization
and partial differential equations (PDEs).

8.5.1 The energy minimization philosophy


The boundary moves to reduce an energy: E = E I + E E , where the internal energy
(E I ) characterizes the curve itself, and the external energy (E E ) characterizes the
image at the points where the snake is currently located.
Internal energy measures the degree of bending along the curve, how big the curve
is, etc.
The term with the ␤ The exact form of the internal energy is application-dependent, but the following
coefficient should remind
you of something. form is commonly used.


EI = ␣X i − X j  + ␤X i−1 − 2X i + X i+1 ,

where X i = [xi yi ]T is the snake point. Minimizing the first term produces a curve
where the snake points are close together. The curve which minimizes the second
term will have little bending. A negative aspect of the first term is that it is minimized
by a snake which shrinks to a single point. Because of this, many applications also
introduce an “expansion” term which causes the entire curve to grow larger.
The external energy measures edginess of the region through which the boundary
passes. Again, there are many functions which may be used for this. Our favorite is


EE = exp (−∇ f (X i )).

For two dimensions, the minimization problem is solvable using dynamic program-
ming [9.19]. However, rather than using local edginess at the boundary, one could
use the difference in average contrast between outside and inside [9.55], which is
only meaningful if the outside is relatively homogeneous.
Observation: This problem fits perfectly into the MAP philosophy, and simulated
annealing (SA) can be used [9.78]. However, the search neighborhood is problematic.
That is, as we have discussed, SA is guaranteed to find the state which globally
minimizes a set of states. However, the set of states must be sampled in order for
SA to work. In [9.78], an existing contour is used as a starting point, and at each
199 8.5 Active contours (snakes)

iteration, the only states sampled are contours within one pixel of the current contour,
and from that set a minimum is chosen. The resultant contour is the best contour of
the set sampled, but not necessarily from the entire region of interest.
It is also important that the forms chosen for the energy functions be invariant
to scale, translation, and rotation. One way to accomplish this is to use two snakes,
suitably weighted, one outside the hypothesized boundary and contracting, and one
initiated inside and expanding [9.25].

8.5.2 The PDE philosophy


The movement of a contour C can be described by dCi /dt = s(X i )N (X i ), where
N (X i ) is the normal vector at the ith contour point, Ci is a point on the contour, and s
the speed. The notation on the LHS simply describes the motion of the ith boundary
point. The RHS says the boundary moves in the normal direction. “Movement” of
point Ci actually consists of changing the x, y coordinates of that point.
The speed, s, is the product of two functions

s(x, y) = sI (x, y)sE (x, y) (8.2)

where sI (x, y) = ±1 − ε␬(x, y) and sE (x, y) = 1/(1 + (x, y)). (x, y) is a mea-
sure of the “edginess” in the image at point x, y, and ␬(x, y) measures the curvature
of the contour at x, y.
Manhaeghe et al. obtain a snake-like result using Kohonen maps [9.47]; see
also [9.88]. One advantage of this approach is that the computations are local. One
can simply look at a point on the present boundary, and consider where that point
could be moved. Choose one candidate location and determine if moving to that
point increases or decreases the energy (if you are using the energy minimization
method).
Considering only the movement of boundary points, however, introduces some
problems. The first is the difficulty in accurately computing the curvature from the
boundary points. As you know, any derivative-based operator is super-sensitive to
noise. Since the curvature involves a second derivative, it is even worse. Another
problem is that there is no really effective way to allow for the possibility that the
boundary might divide into separate components. The following level-set approach
resolves those differences.
Remember the distance transform? From Chapter 7, the distance transform re-
sulted in a function DT (x, y) which was equal to zero on boundary pixels and got
larger as one went away from the boundary. Now consider a new version of the
distance transform, which is exactly the same OUTSIDE the contour of interest.
(Remember, the contour is closed, and so the concepts of INSIDE and OUTSIDE
make sense.) Inside the contour, this new function (which we will refer to as the
200 Segmentation

metric function) is the negative of the distance transform.



This is the key point to DT (x, y) if (x, y) is outside the contour
understanding level sets. ␺ (x, y) = (8.3)
You are modifying −DT (x, y) if (x, y) is inside the contour.
EVERY point, outside
and inside the contour.
It is important to note that for points on the contour, the metric function is equal to
zero. The set of points where ␺ (x, y) = C is called the C-level set of ␺ , and we are
particularly interested in the zero-level set.
Now we will modify the metric function ␺ . For every point (x, y) we compute a
new value of ␺ (x, y). There are several ways we could modify those points, and we
will mention some of them below, but remember, the contour of interest is still the set
of points where the (modified) metric function takes on a value of zero. We initialized
it to be the distance transform, but from here on, you should no longer think of it as
a distance transform (although it will retain some of those characteristics). Instead,
just think of it as another “brightness,” a function of x and y.
What is the gradient of the metric function? You knew how to compute the gradient
when it was brightness. It is no different. Thinking about a level set as an isophote,
you knew that the gradient is normal to the isophote, so, given the gradient vector,
how did you get the normal? Do you remember how to calculate the gradient?

T
∂ ∂
G(x, y) = ␺ (x, y), ␺ (x, y) , (8.4)
∂x ∂y
then the normal is just the normalized (naturally) version of the gradient.
G(x, y)
n(x, y) = . (8.5)
|G(x, y)|
So we can relate the normal to the contour to the gradient of the metric function.
We can also relate [9.46] the movement of the contour in the normal direction to
a function (which is called the speed function), describing how rapidly the metric
changes, producing a differential form for the change in ␺ :

␺ (x, y) = s(␬)|∇␺ (x, y)|, (8.6)
∂t
where s is a problem-dependent speed function, with curvature and image edginess
as parameters, similar to that defined in Eq. (8.2).
We could modify the metric function in a variety of ways. For example, we could
use a form which looks like a gradient descent
Now precisely what is the
difference between ␺ n+1 (x, y) = ␺ n (x, y) − ␣s(x, y)|∇␺ (x, y)|. (8.7)
equations (8.7) and (8.8)?
Or a form which looks like a differential equation
␺ n+1 (x, y) − ␺ n (x, y)
= s(x, y)|∇␺ (x, y)|, (8.8)
t
201 8.6 Segmentation of surfaces

where s involves something about the brightness variations in the image and also
involves the curvature (in 2D) of the isophote at x, y. Of course, if you insist on
using the 2D curvature of the zero level set, you need to relate that to the function ␺ ,
which fortunately is not too hard. Since the normal vector has already been worked
out, and the curvature can be related to changes in normal direction, it is possible to
To be consistent with the
literature (and so it will fit show that:
on one line), we are using
the subscript notation for ␺ x x ␺ x2 − 2␺ x ␺ y ␺ x y + ␺ yy ␺ y2
partial derivatives here. ␬=  3/2 , (8.9)
␺ x2 + ␺ y2
where the functional notation has been dropped.
Over the course of the algorithm, the metric function evolves following a rule like
Eq. (8.7). As it evolves, it will have zero values at different points, and those points
will define the evolution of the contour.
One interesting detail which one must consider when implementing an algorithm
like this is the possibility that the contour might cross itself. For example, consider
the contour segment illustrated in Fig. 8.16. The current contour contains a sharp
concavity. Some typical normal vectors are illustrated. A unit step in the direction of
Fig. 8.16. A contour the normal at the lowest point would place the new contour point inside the contour.
with a very sharp One approach to dealing with this is a simple heuristic which states that points which
crease in it. A unit step
in the normal direction
were labeled inside can never again be considered as outside.
near the crease moves The idea of using level sets for adaptive contours was first proposed by Sethian
the new contour inside [9.66, 9.67]. Malladi et al. [9.46] extended this by observing that there are advantages
the old.
to considering only a set of points near the current contour. Taubin and Ronfard [9.84]
use the concept of a level set implicitly in fitting piecewise-linear curves. Kimmel
et al. [9.40] demonstrate that level sets may be used for other things, such as finding
the shortest path on a surface.
Not all algorithms which use a deformable contour philosophy follow the strate-
gies described in section 8.5. For example, Lai and Chin [8.33] describe a variation
which treats the contour points as a sequence of random variables, which may there-
fore be described by a Markov process, and optimized using MAP strategies. Space
does not permit a discussion of those algorithms here, but the reader may find ade-
quate direction in the sources cited in the bibliography at the end of this chapter.

8.6 Segmentation of surfaces

In range images, we have (typically) numerous surfaces. There are generally two
philosophies which can be followed. First, one may simply seek surfaces which
do not bend too quickly. Following that philosophy produces algorithms which
seek smooth solutions, and segment regions along lines of high surface curvature.
One example of this philosophy was discussed in Chapter 6 where we describe an
202 Segmentation

algorithm which removes noise while seeking the best piecewise-linear fit to the data
points. Such a fit is equivalent to fitting a surface with a set of planes. Points where
planes meet produce either “roof” edges or “step” edges, depending on viewpoint.
If an annealing algorithm is used such as MFA, good segmentations to more general
surfaces can be produced simply by not running the algorithm all the way to a truly
planar solution [8.6]. A second philosophical approach to range image segmentation
is to assume some equation for the surface, e.g., a quadric (a general second-order
surface, defined in section 8.6.1). Then, all points which satisfy that equation and
which are adjacent belong to the same surface. This philosophy mixes the problems
of segmentation and fitting, for we do not know which points to use to estimate the
parameters of the surface until we have some sort of segmentation [8.7, 8.59]. In the
next section, we look at these two philosophies in a bit more detail.

8.6.1 Describing surfaces


Implicit and explicit In general, we must fit a surface to the data. Taubin et al. [8.68] have looked carefully
forms of equations were
defined in Chapter 4. at the question of fitting surfaces to data and observe first that polynomial surfaces
are very attractive, but such a polynomial should be of even order. The implicit
form is clearly much more attractive, but is much more difficult to fit. Consider for
example the second-order forms mentioned in Chapter 4. An explicit representation
might be

z = ax 2 + by 2 + cx y + d x + ey + f (8.10)
and an implicit form is

ax 2 + by 2 + cz 2 + d x y + ex z + f yz + gx + hy + i z + j = 0. (8.11)
The expression of Eq. (8.11) is called a quadric, and it is a general form which de-
scribes all second-order surfaces (cones, spheres, planes, ellipsoids, etc.). In Chapter
5 you learned how to fit an explicit function to data, by minimizing the squared error.
Unfortunately, the explicit form does not allow the possibility of higher order terms
in z. You could solve Eq. (8.11) for z, using the quadratic form, and then have an
explicit form, but now you have a square root on the right-hand side, and lose the
ability to use linear methods for solving for the vector of coefficients.
We can use the implicit form by first defining f (x, y, z) ≡ ax 2 + by 2 + cz 2 +
d x y + ex z + f yz + gx + hy + i z + j and making the following observation: If the
point [xi , yi , z i ]T is on the surface described by the parameter vector [a, b, c, d, e,
f, g, h, i, j]T , then f (xi , yi , z i ) should be exactly zero. We define a level
set of a function as the collection of points [xi , yi , z i ]T such that
f (xi , yi , z i ) = L for some scalar constant L. Thus, we?? can find the

coefficients by minimizing E = i ( f (xi , yi , z i ))2 (also known as the al-
gebraic distance from the point (xi , yi , z i ) to the surface). In some
203 8.6 Segmentation of surfaces

cases, this gives good results, but it is not really what we want; we

really should be minimizing i d([xi , yi , z i ], f (x, y, z)), where d is some distance
metric, for example the Euclidean distance from the point to the surface (this is
known as the geometric distance [8.66] to the surface). Again, this turns out to be
algebraically intractable. (To implement this see [8.67] and [17.37] for important
details.) Although methods based on the algebraic distance work relatively well most
of the time, they can certainly fail. Whatever distance measure we use, it should have
the following properties [17.37]: (1) The measure should be zero whenever the true
(Euclidean, geometric) distance is zero (the algebraic distance does this); (2) at
the sample points, the derivatives with respect to the parameters are the same for
the true distance and the measure.
Of course, whatever representation you choose to use (polynomials are popular),
there is always a desire for a representation that is invariant to affine transforms
[8.27].

8.6.2 Fitting ellipses and ellipsoids


Although this section discusses fitting surfaces, we have introduced the concept of
algebraic distance, and this seems to be an appropriate place to talk about the simpler
case of fitting ellipses to curve data, as well as the three-dimensional extension to
fitting ellipsoids. An ellipse is described by the general equation for a conic section:

ax 2 + bx y + cy 2 + d x + ey + f = 0. (8.12)

This implicit form describes not only ellipses, but lines, hyperbolae, parabolae, and
circles. To guarantee that the resulting curve is an ellipse we must also ensure that
it satisfies

b2 − 4ac < 0. (8.13)

Satisfying this constraint produces an optimization problem which is nonlinear. If


instead, we find the coefficients a– f which minimize
 2
axi2 + bxi yi + cyi2 + d xi + eyi + f , (8.14)
i

we get a solution which tends to fit areas of low curvature with hyperbolic arcs
rather than with ellipses. Similar difficulties occur when attempting to fit ellipsoids
to range data. See Wang et al. [8.74], Rosen and West [8.50], and Fitzgibbon et al.
[8.19] for more details.
In performing such fits, it is important to know when a point is simply an “outlier,”
that is, it has been corrupted by substantial noise, but actually belongs to the surface
under consideration, or when it really belongs to another, possibly occluding, surface.
Darrell and Pentland [13.9] examined this question in some detail and demonstrated
204 Segmentation

that “M-estimates” lead to excellent segmentations. Cabrera and Meer [8.11] remove
the bias from fits of an ellipse using an iterative algorithm called “bootstrapping.”
How you fit a function to data also depends on the nature of the noise or corruption
to the data. If the noise is additive, zero-mean Gaussian (which is what we almost
always assume) then the minimum vertical distance (MMSE) or the minimal normal
distance (which we called eigenvector line fitting) methods work well. If the noise
is not Gaussian, then other methods are more appropriate. For example, nuclear
medicine images are corrupted primarily by counting (Poisson) noise. Such a noise
differs from Gaussian in two important ways – it is never negative, and it is signal
dependent. Well away from zero, Poisson noise may be reasonably modeled by
additive Gaussian with a variance equal to the signal. Other sensors produce other
types of noise. Stewart [8.64] considers the case of inliers and outliers, but assumes
that the bad data are randomly distributed over the dynamic range of the sensor. That
is, the noise is not additive.
Given one segmentation, should you merge two adjacent regions? If they are
adjacent and satisfy the same equation to within some noise measure, they should
be merged [8.8, 8.29, 8.34, 8.56]. Other relevant papers on fitting surfaces include
[8.4, 8.75, 8.78].
One is also faced with the issue of what surface measurements to use as a basis
for segmentation. Curvature would appear to be particularly attractive since the
measurement curvature is invariant to viewpoint. However, “curvature estimates are
very sensitive to quantization noise” [8.71].

8.7 Evaluating the quality of a segmentation

As you have concluded by now, there are many algorithms and variations on algo-
rithms for segmentation. But which is the best? Who knows? We need an algorithm
to evaluate the quality of segmentations. But which is the best such evaluation algo-
rithm? We need an algorithm which evaluates the quality of evaluation algorithms
which ... (help!).
There are several approaches to evaluating segmentation quality. Since one
result of a segmentation is edges, you could indirectly infer segmentation quality
by measuring edge positions. Pratt [5.33] provides one such algorithm.
Bilbro and Snyder [8.6] first remove noise, then fit the resulting surface. They only
consider the quality of the noise removal, which can be tested very simply: Subtract
the segmented, cleaned image from the original. What you SHOULD see is just the
noise. If your noise-removal algorithm produces an image containing features, then
it removed something other than noise.
It is really difficult to evaluate the quality of segmentation of a brightness image,
since different human observers will come to different conclusions as to what the
205 8.8 Conclusion

correct answer is. With range images, however, it is somewhat easier to determine
“truth” since surfaces can physically be measured.
Hoover et al. [8.22] propose the following formalism for comparing the quality of a
machine-segmented (MS) image using a human-segmented ground-truth (GT) image
as the “gold standard.” Let M and G denote the MS and GT images respectively; let
Mi (i = 1, . . . , m) denote a region in M; and let G j ( j = 1, . . . , m) denote a region
in G. |R| will denote the number of pixels in region R. Let Oi j be the number of
pixels which belong to both region i in the MS image and region j in the GT image.
Finally, let T be a threshold, 0.5 < T ≤ 1.0.
There are five different segmentation results, defined as follows.

(1) A correct classification occurs when Oi j ≥ T |M| and Oi j ≥ T |G|.


(2) An oversegmentation occurs when a region in the GT image is broken into several
regions in the MS image. Formally, given one region in GT, i, and several regions
in MS, j1 , j2 , . . . , jn , if
(a) at least 100T percent of the pixels in each region of MS actually belong to
region i. (Oi, jl ≥ T | jl |, ∀l), and
(b) at least 100T percent of the pixels which actually belong to region i are
n
marked as belonging to the union of regions j1 , j2 , . . . jn , ( l=1 Oi, jl ≥
T |i|).
(3) Undersegmentation occurs when pixels in distinct regions in the GT image are
identified as belonging to the same region in the MS image. The definition is
isomorphic to the definition of oversegmentation with the two images reversed.
(4) A missed classification occurs when a region in the GT image is neither correctly
segmented, nor is part of an oversegmentation or an undersegmentation.
(5) A noise classification is identical to a missed classification except that the region
belongs to the MS image.

Two correct segmentations can be further compared in the case of range imagery
by computing the normal vectors to the GT region and the corresponding MS region,
and finding the absolute value of the angle between these two vectors.
With these definitions, we can evaluate the quality of a segmentation by counting
correct or erroneous segmentations, and measuring total angle error. By plotting these
measures vs. T and comparing these plots, a measure of segmenter performance may
be determined.
Hoover et al. [8.22] use this approach to do a thorough evaluation of the quality
of four different range image segmentation algorithms.

8.8 Conclusion

In this chapter, we used the concepts of consistency to identify the components of a


region. In the first example we studied, if all pixels were the same brightness, they
206 Segmentation

were defined to be in the same region. In the example of section 8.6.1, all the points
satisfying the same surface equation were defined to fall in the same region.
Minimum squared error. In section 8.2.1 we used an optimization method, minimum squared error, to find
the best threshold. In section 8.5, we obtained a closed boundary using the philos-
ophy of active contours, by specifying a problem-specific objective function and
finding the boundary which minimizes that function. Any appropriate minimization
Gradient descent. technique could be used. In section 8A.5, we will once again see function minimiza-
tion (based on gradient descent with annealing) used in a maximum a posteriori
method which will find the picture which minimizes a particular objective function,
in this case resulting in a segmentation.

8.9 Vocabulary

You should know the meanings of the following terms.

Active contour
Algebraic distance
Connected component
Explicit form
Geometric distance
Histogram
Homogeneous
Implicit form
Label image
Normal direction
Oversegmentation
Quadric
Region growing
Salient point
Segmentation
Snake
Speed of a curve
Thresholding
Undersegmentation

Assignment 8.1
Section 2.3.5 of the text by Haralick and Shapiro
[4.18] describes a labeling algorithm similar in
some ways to that presented in this section. Write a
207 Topic 8A Segmentation

Table 8.2. Lookup table.

Location 0 1 2 3 4 5 6 7 8 9
Contents 0 0 2 2 4 4 2 4 2 2

report which compares and contrasts these two tech-


niques. Consider: (1) ease of implementation on a
uniprocessor (simplicity of code); (2) speed on a
uniprocessor; and (3) potential for parallel imple-
mentation.

Assignment 8.2
In the process of a connected component labeling
scheme, we find at one point that the lookup table ap-
pears as shown in Table 8.2.
Now, we discover the following equivalence: 9 = 7. On
the blank row in Table 8.2, show the contents of the
lookup table after resolving the equivalence.

Topic 8A Segmentation

In this chapter, we have up to this point considered partitioning of images into areas which
differ in some way in their brightness or range. The algorithms we have presented are ap-
plicable, however to any feature which characterizes a pixel or area around a pixel. For this
reason, we mention here some other measures which may be used, including texture, color,
and motion. In this section, we also mention other approaches to segmentation, including
segmentation based on edges.

8A.1 Texture segmentation

In section 4A.2.2, texture is discussed. If one is able to quantify the concept of texture – to
assign two different numbers to two different textures in order to distinguish them – then
texture can be used as a feature in a segmentation algorithm. Instead of defining ADJACENT
as similar brightnesses, one simply defines them as having similar textures. One could also
combine color and texture [8.46, 8.65].
Several researchers have observed that there are really two fundamentally different types
of textures, those that are in some sense “deterministic,” and those that are in some sense
“random,” and which can in principle be modeled by Markov random fields [8.2]. Liu and
Picard [8.38] and others [8.20, 8.21, 8.58] observe that the peaks in the Fourier transform
give hints to how separate these texture characteristics are.
208 Segmentation

Table 8.3. Halvings which occur to compute


the fractal dimension.

n 0 1 2 n
ε 1 1/2 1/4 1/2n
Mε 1 2 4 2n

Table 8.4. Fractal dimension.

n 0 1 2 n
ε 1 1/2 1/4 1/2n
Mε 1 4 16 22n

8A.1.1 Fractal dimension

The fractal dimension measures the self-similarity of a shape to itself when measured on
a different scale. In that way it provides a measure of the spatial distribution of the points
within the foreground (which presumably are the object of interest). The utility of the fractal
dimension may be seen by considering a set S of points in a 2-space and defining the fractal
dimension of S as
log Mε
dim (S) = lim (8.15)
ε→0 log (1/ε)

where Mε is the minimum number of ε × ε boxes required to cover S. Let’s look at the fractal
dimension of some example objects. We begin with a single point. Obviously, a point may
be covered by precisely one square box, independent of the size of the box. Therefore, in Eq.
(8.15), Mε is always equal to one, and the limit of the denominator is infinity. Therefore we
find the fractal dimension of a point is simply zero. Now let us consider a straight line of unit
length. Clearly, such a line may be covered by a 1 × 1 box. However, it may also be covered
by two boxes, each 1/2 × 1/2, or four boxes, each 1/4 × 1/4. Each time the size of the box
is halved, the number of boxes required doubles. We could tabulate this process in terms of
a parameter n, equal to the number of halvings that have been done (Table 8.3).
From Table 8.3, we may evaluate dim(S) from

log 2n
dim (S) = lim = 1. (8.16)
n→∞ log (1/(1/2))n

Finally, let S be a square area, and without loss of generality, let its sides be of length one.
We could cover this with a single 1 × 1 box, or by four 1/2 × 1/2 boxes, or by 16 1/4 × 1/4
boxes, etc. An argument similar to the one for the straight line results in Table 8.4.
From Table 8.4, the fractal dimension is

log 22n log 22n 2n


dim (S) = lim = lim = lim = 2. (8.17)
n→∞ log (1/(1/2))n n→∞ log 2n n→∞ n
209 Topic 8A Segmentation

Fig. 8.17. The figure on the left has a fractal dimension of 2.0, the figure on the right has a fractal
dimension of 1.58. In the right-hand figure, the covering 2×2 squares are shown with
dark lines.

So, we end up with an intuitively appealing result: A point is a zero-dimensional thing, a line
is one dimensional, and a square is two dimensional. At least, the result agrees with intuition
for these very simple shapes.
Given an image, we must give some thought to how to apply Eq. (8.15) to the discrete
pixel domain, since we obviously cannot take a limiting box size less than ε = 1. Solving
Eq. (8.15) for Mε we find for small values of ε
  
1
log Mε = log dim(S) (8.18)
ε

which simplifies to

log Mε = −(log ε) dim(S). (8.19)

Thus, for small ε, the log of Mε is linearly related to the log of ε, with the slope being
the dimension of S. We may therefore estimate the dimension of S by considering the two
smallest values we have for ε, 1 and 2, and computing that slope, using

log M1 − log M2 log M1 − log M2


dim (S) ≈ − = (8.20)
log 1 − log 2 log 2

producing a simple algorithm: Find M1 , the number of 1 × 1 boxes required to cover the
object; this is simply the area of the object measured in pixels. Find M2 , the number of 2 × 2
boxes required to cover the same object, and take the difference of the logs.
Consider the following example. The image on the left of Fig. 8.17 has a foreground
region which has an area of 36, and which can be covered by 9 2 × 2 squares. Thus,
its fractal dimension is (log 36 − log 9)/ log 2 = 2. The figure on the right has the same
area, but requires 12 2 × 2 squares to cover it, and therefore has a fractal dimension of
(log 36 − log 12)/ log 2 = 1.58.
The concept of fractal dimension can be extended to gray-scale images, and gray-scale
image features extracted using this measure [8.12].
210 Segmentation

8A.2 Segmentation of images using edges

One way to segment is to use a connected components algorithm, but to assume there are
certain points which cannot be connected to anything. These might be, for example, edge
points [8.75].
The usual problem with using edges for segmentation is that edge detectors, as you well
know, are both prone to producing extra edges and to losing portions of edges, producing
gaps. Jacobs [8.24] has an interesting approach to this problem. He first defines an “acceptable
region” in terms of its edges by requiring that the region be convex and that the set of edges
bounding the region be mostly real measured rather than inferred. That is, between any set
of points, we could hypothesize that edges exist. However, if a particular set of points is
really connected, with only a few gaps, and those gaps are small in totality compared to the
perimeter of the region, we might believe such a region is real, that is, a salient group. Jacobs
comes up with some very clever heuristics to reduce this combinatorial search of possible
regions into something fairly manageable.
The Hough transform, described in Chapter 11, will provide another means for identifying
segments of edges, even if some parts are missing.

8A.3 Motion segmentation

If one could identify all the connected pixels which have the same motion characteristics, one
could apply the connected component methods we have already discussed. Although some
papers emphasize motion segmentation – Patras et al. [8.47] use watershed methods and a
multistage segmentation method – it is critical to any motion segmentation algorithm that it
be based on an effective method for detecting and representing motion.
The problem of characterizing image motion has been the object of a great deal of research,
and is discussed in much more detail in section 9A.3.

8A.4 Color segmentation

Just as texture variation can produce segmentations, so can color variations. Variations on
color segmentation include posing an optimization problem and minimizing an objective
function. The image which produces the minimum of the objective function is then the
segmentation. Liu and Yang [8.36] use simulated annealing to find a good color-based seg-
mentation. Clustering, which is described briefly in Chapter 15, also gives an approach
[8.72].

8A.5 Segmentation using MAP methods

In 1991, Snyder et al. [8.63] made the observation that one could convert a restoration problem
into a classification problem in a very straightforward way if one had knowledge of the mean
and, optionally, the variance, of the brightness of each class [8.39, 8.63]. In 2000, the same
conclusion was reached using variational methods [8.53]. In order to “classify” an image
surface into different classes, that is, different segments, we can use the same MAP methods
as we did in solving the image restoration problems.
211 Bibliography

In order to make use of the MAP methods, all one need do is modify the prior term to
include the expected brightness. For example, suppose we have prior knowledge that an image
should be smooth except for step discontinuities, and in addition, each pixel is allowed to
only have one of three brightnesses (e.g. csf, white matter, and gray matter). The following
prior term is maximal when the brightness at pixel i is identical to its neighbors or when it
has the value k1 or the value k2 , or the value k3 .

H( f ) = exp(−( f i − f j )2 ( f i − k1 )2 ( f i − k2 )2 ( f i − k3 )2 ). (8.21)
i j∈ℵi

For further reading, Zhu and Yuille [8.80] show that several similar techniques may be
combined. Further, they illustrate relationships in a variety of image representations.

8A.6 Human segmentation

One of the aspects of segmentation which, regrettably, we do not have time to address in this
book is the issue of how humans do segmentation. That is, what is the “correct” segmentation?
For example, Koenderink and van Doorn [8.28] observed that humans tend to perceive the
projection of three-dimensional objects as collections of ellipses. See [8.57] for an overview.

Bibliography

[8.1] R. Adams and L. Bischof, “Seeded Region Growing,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(6), 1994.
[8.2] P. Andrey and P. Tarroux, “Unsupervised Segmentation of Markov Random Field
Modeled Textured Images using Selectionist Relaxation,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(3), 1998.
[8.3] K.E. Batcher, “STARAN Parallel Processor System Hardware,” Proceedings of AFIPS
National Computer Conference, vol 43, pp. 405–410, 1974.
[8.4] J. Berkmann and T. Caelli, “Computation of Surface Geometry and Segmentation
using Covariance Techniques,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(11), 1994.
[8.5] S. Beucher, “Watersheds of Functions and Picture Segmentation,” IEEE International
Conference on Acoustics, Speech and Signal Processing, Paris, May, 1982.
[8.6] G. Bilbro and W. Snyder, “Range Image Restoration using Mean Field Annealing,”
In Advances in Neural Network Information Processing Systems, San Mateo, CA,
Morgan-Kaufmann, 1989.
[8.7] G. Bilbro and W. Snyder, “Optimization of Functions with Many Minima,” IEEE
Transactions on Systems, Man, and Cybernetics, 21(4), July/August, 1991.
[8.8] G. Blais and M. Levine, “Registering Multiview Range Data to Create 3D Computer
Objects,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8),
1995.
[8.9] K. Boyer, M. Mirza, and G. Ganguly, “The Robust Sequential Estimator: A Gen-
eral Approach and its Application to Surface Organization in Range Data,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(10), 1994.
212 Segmentation

[8.10] C.R. Brice and C.L. Fennema, “Scene Analysis using Regions,” Artificial Intelligence,
1, pp. 205–226, Fall, 1970.
[8.11] J. Cabrera and P. Meer, “Unbiased Estimation of Ellipses by Bootstrapping,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(7), 1996.
[8.12] B. Chaudhuri and N. Sarkar, “Texture Segmentation using Fractal Dimension,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[8.13] J. Chen, Y. Sato, and S. Tamura, “Orientation Space Filtering for Multiple Orientation
Line Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 22(5), 2000.
[8.14] C. Chow and T. Keneko, “Automatic Detection of the Left Ventricle from Cine-
angiograms,” Computers and Biomedical Research, 5, pp. 388–410, 1972.
[8.15] T. Davis, “Fast Decomposition of Digital Curves into Polygons using the Haar
Transform,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8),
August, 1999.
[8.16] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, New York, Wiley,
1973.
[8.17] M. Fishler and R. Bolles, “Perceptual Organization and Curve Partitioning,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 8(1), 1986.
[8.18] M. Fishler and H. Wolf, “Locating Perceptually Salient Points on Planar Curves,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2), 1994.
[8.19] A. Fitzgibbon, M. Pilu, and R. Fisher, “Direct Least Square Fitting of Ellipses,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5), May,
1999.
[8.20] J. Francos, “Orthogonal Decomposition of 2-D Random Fields and Their Applications
in 2-D Spectral Estimation,” Signal Processing and its Applications, ed. N.K. Bose
and C.R. Rao, Amsterdam, North Holland, 1993.
[8.21] J. Francos, Z. Meiri, and B. Porat, “A Unified Texture Model Based on a 2-D Wold
Like Decomposition,” IEEE Transactions on Signal Processing, 41, pp. 2665–2678,
August, 1993.
[8.22] A. Hoover, G. Jean-Baptiste, X. Jiang, P. Flynn, H. Bunke, D. Goldgof, K. Bowyer,
D. Eggbert, A. Fitzgibbon, and R. Fisher, “An Experimental Comparison of Range
Image Segmentation Algorithms,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(7), 1996.
[8.23] L. Itti, C. Koch, and E. Niebur, “A Model of Saliency-based Visual Attention for Rapid
Scene Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(11), 1998.
[8.24] D. Jacobs, “Robust and Efficient Detection of Salient Convex Groups,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 18(1), 1996.
[8.25] K. Kanatani, “Statistical Bias of Conic Fitting and Renormalization”, IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
[8.26] N. Katzir, M. Lindenbaum, and M. Porat, “Curve Segmentation under Partial Oc-
clusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5),
1994.
[8.27] D. Keren, “Using Symbolic Computation to Find Algebraic Invariants,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 16(11), 1994.
213 Bibliography

[8.28] J. Koenderink and A. Van Doorn, “The Shape of Smooth Objects and the Way Contours
End,” Perception, 11, pp. 129–137, 1982.
[8.29] K. Köster and M. Spann, “MIR: An Approach to Robust Clustering – Application to
Range Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(5), 2000.
[8.30] L. Krakauer, “Computer Analysis of Visual Properties of Curved Objects,” Project
MAC TR-82, 1971.
[8.31] S. Kulkarni, S. Mitter, T. Richardson, and J. Tsitsiklis, “Local vs. Global Computation
of Length of Digitized Curves,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(7), 1994.
[8.32] S. Kumar, S. Han, D. Goldgof, and K. Bowyer, “On Recovering Hyperquadrics
from Range Data,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(11), 1995.
[8.33] K. Lai and R. Chin, “Deformable Contours: Modeling and Extraction,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(11), pp. 1084–1090, 1995.
[8.34] S. LaValle and S. Hutchinson, “A Bayesian Segmentation Methodology for Parametric
Image Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(2), 1995.
[8.35] K. Lee, P. Meer, and R. Park, “Robust Adaptive Segmentation of Range Images,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 1998.
[8.36] J. Liu and Y. Yang, “Multiresolution Color Image Segmentation,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(7), 1994.
[8.37] X. Liu and R. Ehrich, “Subpixel Edge Location in Binary Images using Dithering,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6), 1995.
[8.38] F. Liu and R. Picard, “Periodicity, Directionality, and Randomness: Wold Features for
Image Modeling and Retrieval,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(7), 1996.
[8.39] A. Logenthiran, W. Snyder, and P. Santago, “MAP Segmentation of Magnetic Reso-
nance Images using Mean Field Annealing,” SPIE Symposium on Electronic Imaging,
Science and Technology, February, 1991.
[8.40] A. Matheny and D. Goldgof, “The Use of Three- and Four-dimensional Surface Har-
monics for Rigid and Nonrigid Shape Recovery and Representation,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[8.41] D.L. Milgram, A. Rosenfeld, T. Willett, and G. Tisdale, “Algorithms and Hardware
Technology for Image Recognition,” Final Report to U.S. Army Night Vision Lab,
March 31, 1978.
[8.42] F. Mokhtarian and A. Mackworth, “Scale-based Description and Recognition of Planar
Curves and Two-dimensional Shapes,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 8(1), 1986.
[8.43] K. Mori, M. Kidode, H. Shinoda, and H. Asada, “Design of Local Parallel Pattern
Processor for Image Processing,” Proceedings AFIPS National Computer Conference,
vol 47, pp. 1025–1031 June, 1978.
[8.44] W. Ng and C. Lee, “Comment on Using the Uniformity Measure for Performance Mea-
sure in Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(9), 1996.
214 Segmentation

[8.45] B. Parhami, “Associative Memories and Processors: An Overview and Selected Bib-
liography,” Proceedings of the IEEE, 61, pp. 772–730, June, 1973.
[8.46] D. Panjwani and G. Healey, “Markov Random Field Models for Unsupervised Seg-
mentation of Textured Color Images,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(10), 1995.
[8.47] I. Patras, E. Hendriks, and R. Lagendijk, “Video Segmentation by MAP Labeling of
Watershed Segments,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 23(3), 2001.
[8.48] A. Pikaz and I. Dinstein, “Using Simple Decomposition for Smoothing and Feature
Point Detection of Noisy Digital Curves,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(8), 1994.
[8.49] E. Rivlin and I. Weiss, “Local Invariants for Recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(3), 1995.
[8.50] P. Rosen and G. West, “Nonparametric Segmentation of Curves into Various Repre-
sentations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12),
1995.
[8.51] A. Rosenfeld and A. Kak, Digital Picture Processing, 2nd edition, New York,
Academic Press, 1997.
[8.52] G. Roth and M. Levine, “Geometric Primitive Extraction using a Genetic Algorithm,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9), 1994.
[8.53] C. Samson, L. Blanc-Fèraud, G. Aubert, and J. Zerubia, “A Variational Model for
Image Classification and Restoration,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(5), 2000.
[8.54] G. Sapiro and A. Tannenbaum, “Area and Length Preserving Geometric Invariant
Scale Spaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), 1995.
[8.55] H. Sheu and W. Hu, “Multiprimitive Segmentation of Planar Curves – a Two-level
Breakpoint Classification and Tuning Approach,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 21(8), 1999.
[8.56] H. Shum, K. Ikeuchi, and R. Reddy, “Principal Component Analysis with Missing
Data and Its Application to Polyhedra1 Object Modeling,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(9), 1995.
[8.57] K. Siddiqi and B. Kimia, “Parts of Visual Form: Computational Aspects,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(3), 1995.
[8.58] R. Sriram, J. Francos, and W. Pearlman, “Texture Coding Using a Wold Decomposition
Model,” Proc. International Conference on Pattern Recognition, Jerusalem, October,
1994.
[8.59] W. Snyder and G. Bilbro, “Segmentation of Range Images,” International Conference
on Robotics and Automation, St. Louis, March, 1985.
[8.60] W. Snyder and A. Cowart, “An Iterative Approach to Region Growing,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 5(3), 1983.
[8.61] W.E. Snyder and C.D. Savage, “Content-Addressable Read-Write Memories for Image
Analysis,” IEEE Transactions on Computers, 31(10), pp. 963–967, 1982.
[8.62] W. Snyder, G. Bilbro, A. Logenthiran, and S. Rajala, “Optimal Thresholding, A New
Approach,” Pattern Recognition Letters, 11(11), 1990.
215 Bibliography

[8.63] W. Snyder, P. Santago, A. Logenthiran, K. Link, G. Bilbro, and S. Rajala, “Segmenta-


tion of Magnetic Resonance Images using Mean Field Annealing,” XII International
Conference on Information Processing in Medical Imaging, Kent, England, July 7–11,
1991.
[8.64] C. Stewart, “MINIPRAN: A New Robust Estimator for Computer Vision,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[8.65] P. Suen and G. Healey, “The Analysis and Recognition of Real-world Textures in Three
Dimensions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(5),
2000.
[8.66] S. Sullivan, L. Sandford, and J. Ponce, “Using Geometric Distance Fits for 3-D Object
Modeling and Recognition,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(12), 1994.
[8.67] G. Taubin, “Nonplanar Curve and Surface Estimation in 3-space,” IEEE Robotics and
Automation Conference, Philadelphia, May, 1988.
[8.68] G. Taubin, F. Cukierman, S. Sullivan, J. Ponce, and D. Kriegman, “Parameterized
Families of Polynomials for Bounded Algebraic Curve and Surface Fitting,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
[8.69] K.J. Thurber, Large-Scale Computer Architecture: Parallel and Associative Proces-
sors, Rochelle Park, NJ, Hayden, 1976.
[8.70] O. Trier and A. Jain, “Goal-directed Evaluation of Binarization Methods,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(12), 1995.
[8.71] E. Trucco and R. Fisher, “Experiments in Curvature-based Segmentation of Range
Data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(2), 1995.
[8.72] T. Uchiyama and M. Arbib, “Color Image Segmentation Using Competitive Learning,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(12), 1994.
[8.73] K. Vincken, A. Koster, and M. Viergever, “Probabilistic Multiscale Image Segmenta-
tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2), 1997.
[8.74] R. Wang, A. Hanson, and E. Riseman, “Fast Extraction of Ellipses,” Ninth Interna-
tional Conference on Pattern Recognition, Rome, 1988.
[8.75] M. Wani and B. Batchelor, “Edge-region-based Segmentation of Range Images,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
[8.76] M. Worring and A. Smeulders, “Digitized Circular Arcs: Characterization and Param-
eter Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(6), 1995.
[8.77] S. Yau and H. Jung, “Associative Processor Architecture – A Survey,” Computer
Surveys, 9, pp. 3–26, March, 1977.
[8.78] X. Yu, T. Bui, and A. Kryzak, “Robust Estimation for Range Image Segmentation and
Reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(5), 1994.
[8.79] P. Zhu and P. Chirlian, “On Critical Point Detection of Digital Shapes,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(8), 1995.
[8.80] S. Zhu and A. Yuille, “Region Competition: Unifying Snakes, Region Growing, and
Bayes/MDL for Multiband Image Segmentation,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 18(9), 1996.
9 Shape

Space tells matter how to move, and matter tells space to get bent
Douglas Adams

In this chapter, we assume a successful segmentation, and explore the question of


characterization of the resulting regions. We begin by considering two-dimensional
regions which are denoted by each pixel in the region having value 1 and all back-
ground pixels having value 0. We assume only one region is processed at a time,
since in studying connected component labeling, we learned how to realize these
assumptions.
In the process of generating the segmented version of a picture, the computer
performs a region growing operation that acts on each pixel in the region. In so
doing, the computer can easily keep track of the area. Area is one of many features
that can help us to distinguish one type of object from another. For example, the
image of a connecting rod typically occupies more area (more black pixels) than
does the image of a valve. Thus, by measuring the area of a region, we may discern
the type of object.

9.1 Linear transformations

One topic we will want to consider in this chapter is invariance to various types of
linear transformations on regions. That is, consider all the pixels in a region, and
write the x, y coordinates of each pixel as a 2-vector, and operate on that set of
vectors. The first transformations of interest are the orthogonal transformations such
as

cos ␪ −sin ␪
Rz = ,
sin ␪ cos ␪

which operate on the coordinates of the pixels in a region to produce new coordinate
pairs. For example

x x
 = Rz ,
y y
216
217 9.1 Linear transformations

where Rz is as defined above, represents a rotation about the z axis. Given a re-
gion s, we can easily construct a matrix, which we denote by S, in which each
column contains the x, y coordinates of a pixel in the region. For example, sup-
pose the region s = {(1, 2), (3, 4), (1, 3), (2, 3)}, then the corresponding coordinate
matrix S is

1 3 1 2
S= .
2 4 3 3

We can apply an orthogonal transformation such as a rotation to the entire region by


matrix multiplication

 cos ␪ −sin ␪ 1 3 1 2
S = Rz S = .
sin ␪ cos ␪ 2 4 3 3

That works wonderfully well for rotations, but how can we include translation in this
formalism? To accomplish that desire, we augment the rotation matrices by adding
a row and a column, all zeros except for a 1 in the lower right corner. With this new
definition,
⎡ ⎤
cos ␪ −sin ␪ 0
⎢ ⎥
Rz = ⎣ sin ␪ cos ␪ 0⎦ .
0 0 1
We also augment the definition of a point by adding a 1 in the third position, so the
coordinate pair (x, y) in this new notation becomes
⎡ ⎤
x
⎢ ⎥
⎣y⎦ .
1
Now, we can combine translation and rotation into a single matrix representation
(called a homogeneous transformation matrix). We accomplish this by changing the
third column to include the translation. For example, to rotate a point about the origin
by an amount ␪, and then translate it by an amount dx in the x direction and dy in
the y direction, we perform the matrix multiplication
⎡ ⎤⎡ ⎤
cos ␪ − sin ␪ d x x
 ⎢ ⎥⎢ ⎥
x = ⎣ sin ␪ cos ␪ dy ⎦ ⎣ y ⎦ . (9.1)
0 0 1 1
Thus, it is possible to represent rotation in the viewing plane (about the z axis) and
translation in that plane by a single matrix multiplication. All the transformations
mentioned above are elements of a class of transformations called “similarity trans-
formations.” Similarity transformations are characterized by the fact that they may
move an object around, but they do not change its shape.
218 Shape

Reference Affine Not affine

Fig. 9.1. Affine transformations can scale the coordinate axes. If the axes are scaled differently,
the image undergoes a shear distortion.

Fig. 9.2. Aircraft images which are affine transformations of each other (from [9.1]).

But what can we do, if anything, to represent rotations out of the camera plane?
To answer this, we need to define an affine transformation. An affine transformation
of the 2-vector x = [x, y]T produces the 2-vector x  = [x  , y  ]T , where

x  = Ax + b, (9.2)

and b is likewise a 2-vector. This looks just like the similarity transformations men-
tioned above, except that we do not require the matrix A be orthonormal, only non-
singular. An affine transformation may distort the shape of a region. For example,
shear may result from an affine transformation, as illustrated in Fig. 9.1.
As you probably recognize, rotation out of the field of view of a planar object is
equivalent to an affine transformation of that object. This gives us a (very limited)
way to think about rotation out of the field of view. If an object is nearly planar, and
the rotation out of the field of view is small, that is, nothing gets occluded, we can
represent this 3D motion as a 2D affine transformation. For example, Fig. 9.2 shows
some images of an airplane which are all affine transformations of each other.
The matrix which implements an affine transform (after correction for translation)
can be decomposed into its constituents [9.62] – rotation, zoom, and shear:

a11 a12 x cos ␪ sin ␪ ␣ 0 1 ␤ x
= . (9.3)
a21 a22 y − sin ␪ cos ␪ 0 ␦ 0 1 y

Now, what does one do with these concepts of transformations? One can correct
transformations and align objects together through inverse transformations to assist
in shape analysis. For example, one can correct for translation by shifting so the
centroid is at the origin. Correcting for rotation involves rotating the image until the
principal axes of the object are in alignment with the coordinate axes.
219 9.2 Transformation methods based on the covariance matrix

Finding the principal axes is accomplished by a linear transformation which trans-


forms the covariance matrix of the object (or its boundary) into the unit matrix. This
transformation is related to the whitening transformation and the K–L transform. Un-
fortunately, once such a transformation has been done, Euclidian distances between
points in the region have been changed [9.92].
In the previous paragraph, the word “distance” occurs. Although one usually thinks
that “distance” means “Euclidian distance,” in this book we will use this word many
times and in many forms. For that reason, we should be just a bit more rigorous
about exactly what the concept means. The Euclidian distance (for example) is a
type of measure called a “metric.” Any metric d(a, b) has the following properties:

r d(a, a) = 0 ∀a
Properties of a metric. r d(a, b) = d(b, a) ∀(a, b)
r d(a, b) + d(b, c) ≥ d(a, c) ∀(a, b, c)

We will have several opportunities to examine metrics in later chapters.

9.2 Transformation methods based on the covariance matrix

Consider the distribution of points illustrated in Fig. 9.3. Each point may be exactly
characterized by its location, the ordered pair, (x1 , x2 ), but neither x1 nor x2 , by itself,
is adequate to describe the point.
Now consider Fig. 9.4, which shows two new axes, y1 and y2 . Again, the ordered
pair y1 , y2  is adequate to exactly describe the point, but y2 is (compared to y1 ) very
nearly zero, most of the time. Thus, we would lose very little if we simply discarded
y2 and used the scalar, y1 to describe each point. Our objective in this section will
be to learn how to determine y1 and y2 in an optimal manner.

x2

x1

Fig. 9.3. A region which is approximately an ellipsoid.


220 Shape

y1

y2

Fig. 9.4. A new set of coordinates, derived by a rotation of the original coordinates, in which one
coordinate well represents the data.

9.2.1 Derivation of the K–L expansion


Let x be a d-dimensional random vector. We will describe x in terms of a set of basis
vectors. That is, represent x by

d
x= yi bi . (9.4)
i=1

Here, the vectors bi are deterministic (and may in general, be specified in advance).
Since any random vector x may be expressed in terms of the same d vectors, bi (i =
1, . . . , d), we say the vectors bi span the space containing x, and refer to them as a
basis set for x. To make further use of the basis set, we will require:1

(1) The bi vectors are linearly independent;


(2) The bi vectors are orthonormal, i.e.,

1 (i = j)
biT b j = . (9.5)
0 (i = j)

Under these conditions, the yi may be found by projections, where the projection
operation is defined by

yi = biT x(i = 1, . . . , d) (9.6)

and we let

y = [y1 , . . . , yd ]T . (9.7)

Here, we say the number yi results from projecting x onto the basis vector bi .
Suppose we wish to ignore all but m(m < d) components of y (which we will
call the principal components) and yet we wish to still represent x, although with
some error. We will thus calculate (by projection onto the basis vectors) the first m

1 To be a basis, they do not have to be orthonormal, just not parallel, but here, we will require orthonormality.
221 9.2 Transformation methods based on the covariance matrix

elements of y, and replace the others with constants, forming the estimate

m 
d
x̂ = yi bi + ␣i bi . (9.8)
i=1 i=m+1

The error which we have introduced by using some arbitrary constants, the alphas
of Eq. (9.8), rather than the elements of y, is given by

x = x − x̂


m 
d
=x− yi bi + ␣i bi (9.9)
i=1 i=m+1

d
= [yi − ␣i ]bi .
i=m+1

If we think of x and therefore x as random vectors, we can use the expected


magnitude of x to quantify how well our representation works.
 
d d
ε (m) = E
2
(yi − ␣i )bi (y j − ␣ j )b j
T

i=m+1 j=m+1
-  . (9.10)
=E (yi − ␣i )(y j − ␣ j )biT b j

noting that yi is a scalar, and recalling Eq. (9.5), we have



d
/ 0
ε 2 (m) = E (yi − ␣i )2 . (9.11)
i=m+1

To find the optimal choice for ␣i , we minimize ε 2 with respect to ␣i :


∂ε 2 ∂ / 0
= E (yi − ␣i )2 = −2(E{yi } − ␣i ) = 0, (9.12)
∂␣i ∂␣i
resulting in

␣i = E{yi } = biT E{x}. (9.13)

So, we should replace those elements of y which we do not measure by their expected
values – mathematically and intuitively appealing.
Substituting Eq. (9.13) into Eq. (9.11), we have

d
% &
ε 2 (m) = E (yi − E{yi })2 . (9.14)
i=m+1

Substituting Eq. (9.6) into Eq. (9.14):



d
% / 02 &
ε 2 (m) = E biT x − E biT x
i=m+1
222 Shape

 % / 0 &
= E biT x − E biT x (x T bi − E{x T bi }) (9.15)
i
 % &
= E biT (x − E{x})(x T − E{x T })bi .
i

Equation (9.15) can be reformatted into



ε 2 (m) = biT E[(x − E{x})(x − E{x})T ]bi (9.16)
i

and we now recognize the term between the bs in Eq. (9.16) as the covariance of x:

d
ε 2 (m) = biT K x bi . (9.17)
i=m+1

It can be shown (and we will show it below, for the special case of fitting a straight
line) that the choice of vector bi which minimizes Eq. (9.17) also satisfies

K x bi = ␭i bi . (9.18)

That is, the optimal basis vectors are the eigenvectors of K x .


The covariance of y can be easily related to K x
Proof is a homework. K y = E[(y − E{y})(y − E{y})T ] (9.19)
= B Kx B
T
(9.20)

where the matrix B has columns made from the basis vectors, b1 , b2 , . . . , bd .
Furthermore, in the case that the columns of B are the eigenvectors of Kx , then B
will be the transformation which diagonalizes Kx , resulting in
⎡ ⎤
␭1 0 · · · 0
⎢ 0 ␭ ··· 0 ⎥
⎢ 2 ⎥
Ky = ⎢ ⎥. (9.21)
⎣ . . . . . . . . . . . .⎦
0 0 · · · ␭d
Substituting Eq. (9.21) into Eq. (9.17), we find

d
ε 2 (m) = biT ␭i bi . (9.22)
i=m+1

Since ␭i is scalar,

d
ε 2 (m) = ␭i biT bi (9.23)
i=m+1

and, again remembering the orthonormal conditions on bi ,



d
ε 2 (m) = ␭i . (9.24)
i=m+1
223 9.2 Transformation methods based on the covariance matrix

Thus, we represent a d-dimensional measurement, x, by an m-dimensional vector,


y, where m < d,

yi = biT x, (9.25)

and the bi s are the eigenvectors of the covariance of x.


This expansion of a random vector in terms of the eigenvectors of the covariance
matrix is referred to as the Karhunen–Loève expansion, or the “K–L expansion.”

9.2.2 Properties of the K–L expansion


Without loss of generality, we will sort the eigenvectors bi in terms of their eigen-
values. That is, assign subscripts to the eigenvalues such that

␭1 ≥ ␭2 ≥ ␭3 ≥ . . . ␭d . (9.26)

Then, we refer to b1 , corresponding to ␭1 , as the “major eigenvector.”

Interpretation as a hyperellipsoid

Representing data as an If we think of the distribution of points, x, as represented by a hyperellipsoid, the


ellipsoid will be useful
when we discuss aspect major axis of that ellipse will pass through the center of gravity of the data, and
ratio. will be in the direction of the eigenvector corresponding to the largest eigenvalue
of K x . This is illustrated in Fig. 9.5. Thus, the K–L transform fits an ellipse to
two-dimensional data, an ellipsoid to three-dimensional data, and a hyperellipsoid
to higher dimensional data.

Use in straight line fitting

Consider the set of instances of the random vector x.

{xi }(i = 1, . . . n). (9.27)

b1
b2

λ1

Fig. 9.5. A covariance matrix may be thought of as representing a hyperellipsoid, oriented in the
directions of the eigenvectors, and with extent in those directions equal to the square
root of the corresponding eigenvalues.
224 Shape

We wish to find the straight line which best fits this set of data. Move the origin
to the center of gravity of the set. Then, characterize the (currently unknown) best
fitting line by its unit normal vector, n. Then, for each point xi , the perpendicular
distance from xi to the best fitting line will be equal to the projection of that point
onto n. Denote this distance by di (n):

di2 (n) = (nT xi )2 . (9.28)

To find the best fitting straight line, we minimize the sum of the squared perpendicular
distances
!
n n n
   n
ε2 = di2 (n) = (nT xi )2 = nT xi xiT n = nT xi xiT n (9.29)
i=1 i=1 i=1 i=1

which we wish to minimize, subject to the constraint that n is a unit vector,

nT n = 1. (9.30)

We perform the constrained minimization using a Lagrange multiplier and


minimizing
!

n T
xi xi n − ␭(nT n − 1).
T
(9.31)
i

Defining S = i xi xiT , we take
∂ T
(n Sn − ␭(nT n − 1)). (9.32)
∂n
Differentiating the quadratic form nT Sn, we get 2Sn, and setting the derivative to
zero results in

2Sn − 2␭n = 0, (9.33)

which is the same eigenvalue problem mentioned earlier. Thus we may state:

The best fitting straight line passes through the mean of the set of data points, and
will lie in the direction corresponding to the major eigenvector of the covariance of
the set.

We have now seen two different ways to find the straight line which best fits data:
The method of least squares described in section 5.3, if applied to fitting a line rather
than a plane, minimizes the vertical distance from the data points to the line. The
method described in this section minimizes the perpendicular distance described by
Eq. (9.29). Other methods also exist. For example, [9.53] finds piecewise represen-
tations which preserve moments up to an arbitrarily specified order.
225 9.3 Simple features

Fitting functions to data occurs in many contexts. For example, O’Gorman [9.54]
has looked at fitting not only straight edges, but points, straight lines, and regions
with straight edges. By so doing, subpixel precision can be obtained.
In the following, we will consider a few of the many simple features which may be
used to characterize the shape of regions (for additional features, see also [9.2]).

9.3 Simple features

In this section, we describe several simple features which can be used to describe
the shape of a patch – the output of the segmentation process. Many of these features
can be computed as part of the segmentation process itself. For example, since the
connected component labeling program must touch every pixel in the region, it can
easily keep track of the area.
The following is a list of simple features which are likewise simple to calculate.
r Average gray value. In the case of black and white “silhouette” pictures, this is
simple to compute.
r Maximum gray value. Is straightforward to compute.
r Minimum gray value. Is straightforward to compute.
r Area (A). A count of all pixels in the region.
r Perimeter (P). Several different definitions exist. Probably the simplest is a count
of all pixels in the region that are adjacent to a pixel not in the region.
r Diameter (D). The diameter describes the maximum chord – the distance between
those two points on the boundary of the region whose mutual distance is maximum
[9.68, 9.71]. We will discuss computation of this parameter in the next section.
r Thinness (also called compactness)2 (T ). Two definitions for compactness exist:
Ta = (P 2 /A) − 4␲ measures the ratio of the squared perimeter to the area; Tb =
D/A measures the ratio of the diameter to the area. Fig. 9.6 compares these two
measurements on example regions.
r Center of gravity (CG). The x and y coordinates of the center of gravity may be
written as
1  1 
mx = x, m y = y
N N
for all N points in a region. However, we prefer the vector form

1  N
xi
m= . (9.34)
N i=1 yi

2 Some authors [9.69] prefer not to confuse the mathematical definition of compactness with this definition, and
thus refer to this measure as the isoperimetric measure.
226 Shape

Ta small, Tb large

Ta large, Tb large

Ta large, Tb small

Fig. 9.6. Results of applying two different compactness measures to various regions. Since a
circle has the minimum perimeter for a given area, it minimizes Ta . A starfish on the
other hand, has a large perimeter for the same area.

r X–Y aspect ratio. See Fig. 9.7. The aspect ratio is the length/width ratio of the
y bounding rectangle of the region. This is simple to compute.
r Minimum aspect ratio. See Fig. 9.8. Again, a length/width, but much more com-
x putation is required to find the minimum such rectangle.
Fig. 9.7. y/x is the
The minimum aspect ratio can be a difficult calculation, since it requires a
aspect ratio using one search for extremal points. A very good approximation can be obtained if we
definition, with think of a region as represented by an ellipse-shaped distribution of points. In this
horizontal and vertical
sides to the bounding case, as we discussed in Fig. 9.5, the eigenvalues of the covariance of points are
rectangle. measures of the distribution of points along orthogonal axes – major and minor.
The ratio of those eigenvalues is quite a good approximations of the minimum
y aspect ratio.
r Number of holes. One feature that is very descriptive and reasonably easy to
x compute is the number of holes in a region.
r Triangle similarity. Consider three points on the boundary of a region, P1 , P2 , P3 ,
Fig. 9.8. y/x is the let d(Pi , P j ) denote the Euclidian distance between two of those points, and
minimum aspect ratio. let S = d(P1 , P2 ) + d(P2 , P3 ) + d(P3 , P1 ) be the perimeter of that triangle. The
2-vector
The methods of section

9.2.2 give a simple way to d(P1 , P2 ) d(P2 , P3 )
estimate the aspect ratio. , , (9.35)
S S
simply the ratio of side lengths to perimeter, is invariant to rotation, translation,
and zoom [9.6].
r Symmetry. In two dimensions, a region is said to be mirror-symmetric if it is
invariant under a reflection about a line. That line is referred to as the axis of
symmetry. A region is said to have rotational symmetry of order n if it is invariant
under a rotation of 2␲/n about some point, usually the center of gravity of the
region. There are two challenges in determining the symmetry of regions. The
first is simply determining the axis. The other is to answer the question: “How
symmetric is it?” Most papers prior to 1995 which analyzed symmetry of regions
227 9.3 Simple features

in machine vision applications treated symmetry as a predicate: Either the region


is symmetric or it is not. Zabrodsky et al. [9.97] present a measure called the
symmetry distance which quantifies how symmetric a region is.

9.3.1 Computing the diameter


The diameter, mentioned previously, is a robust measure of shape, when taken with
other suitable measures, and when calculated appropriately. For this reason, we
devote a section to the process of actually performing that calculation.
Consider for a moment, the following scenario: “I” (who happen to be simulating
a digital computer) am looking for a region in an image. I am only willing to accept
regions which have a particular appearance – namely, such regions should be long,
thin, and oriented in a particular direction. How can I quantify such vague terms as
“long,” “thin,” and what precisely is the orientation of a region? After all, a region
is simply a list of points which was the output from a region growing routine. There
are, of course, a large number of methods described in the literature for describing
shape of regions. One approach to solving such problems makes use of the location
of the “extremes” or diameter of the region; that is, those two points, A and B, whose
mutual distance d(A, B) is greater than or equal to the distance between any other
two points in the region.
This calculation can find applications when it is necessary to quantify the “shape”
of a region in an image, as might occur in classification of parts on an assembly line,
optical character recognition, etc., or in a robotic application which requires that the
orientation of a region be determined in order to pick up an object.
Suppose we have a region in a two-dimensional image plane described by a set
of points, R. The problem then is to find two points, A, B ∈ R, such that ∀ p1 , p2 ∈
R, d(A, B) ≥ d( p1 , p2 ). If the set R is small (say, 10–20 points) a simple comparison
of every point with every other point is the most straightforward approach. However,
when R becomes larger, the number of comparisons needed grows as n(n − 1)/2 or
O(n 2 ).
In this section, we mention a technique for solving this problem. It uses eigenvalue
analysis to find the “best” (in a generalized least squares sense) estimate of the major
axis of the region, and defines the extrema as being the two points furthest out in the
direction of that axis. A second approach, which is applicable to nonconvex regions,
is described in section 9A.1. This technique is heuristic and relatively complicated
to implement, but is fast and guaranteed to converge.

The principal axis approach

In this method we first find the major axis of the region and those two points on
the boundary of the region which are closest to that axis. In this approach, minor
228 Shape

deviations, such as the spur shown in Fig. 9.9, will be ignored, even though they
may actually contain one of the extreme points.
The first step in the process is calculation of the major axis. This is performed
Assignment: Is this by a minimization-of-squared-error technique. It is important that minimization of
algorithm new, or is it
repeated somewhere else error be independent of the coordinate axes; therefore, we use the eigenvector line
in the text? fitting technique described earlier in this section rather than the conventional MSE
technique. We define a line to be the best representation of the major axis if it
minimizes the sum of squares of the perpendicular distances from the points in the
region to the line.
Let us assume the region R is described by a set of points R = {(xi , yi )|i =
1, . . . , n}. Let the point (xi , yi ) be denoted by the vector v i , and di be the perpen-
Fig. 9.9. A region with dicular distance from v i to the major axis. Thus, the major axis of the region is the
a spur. line which minimizes:


n
d2 = di2 . (9.36)
i=1

It is easily shown that the major axis must pass through the center of gravity of the
region, thus it is necessary only to find the slope of the axis. Since the line passes
through the center of gravity, let us take that point as the origin of our coordinate
system. Then the problem becomes: Given n points with zero mean, find the line
through the origin which minimizes d 2 .
This turns out to be the same eigenvector problem we described earlier. Thus one
can find the major axis by:

(1) relocating the origin by subtracting the center of gravity from each point;
(2) finding the principal eigenvector of the scatter matrix of this modified set of
points; and
(3) solving for the major axis as the line through the center of gravity parallel to the
principal eigenvector.

Having found the major axis, one then treats each point on the boundary as a
vector and projects each onto the major axis. The extrema are then those two points
on opposite sides of the center of gravity whose projections onto the major axis are
maximum in length. (The extrema are not necessarily unique.) This approach yields
a solution which is an accurate representation of the “shape” of the region in the
least-mean-squared-error sense. It may or may not actually find the two points whose
mutual distance is maximum. In many applications, an approximation like this is
exactly what is needed. However, occasionally, one encounters regions (Fig. 9.9)
with spurs on them, where the spur may be relevant. In this case, it is necessary to
use an algorithm which will find the actual extrema (see section 9A.1).
229 9.4 Moments

Note that the convex The convex hull


discrepancy is one of
those features you are If one were to stretch a rubber band around a given region, the region that would
supposed to be figuring
out for Assignment 9.1. result would be the convex hull (Fig. 9.10).
The difference in area between a region and its convex hull is the convex dis-
crepancy. See Shamos [9.68] for fast algorithms for computing the convex hull, and
[9.30] for such algorithms for parallel machines.
We can find the convex hull in O(n log n) time. Furthermore, finding the convex
Fig. 9.10. Convex hull hull provides us with another simple feature, the convex discrepancy, as illustrated
of a region. The in Fig. 9.10.
shaded area is the
convex discrepancy.
9.4 Moments

The moments of a shape are easily calculated, and as we shall see, can be robust to
similarity transforms.
A moment of order p + q may be defined on a region as

m pq = x p y q f (x, y). (9.37)

If we assume that the region is uniform in gray value and that gray value is arbitrarily
set to 1 inside the region and 0 outside, the area of the region is then m 00 , and we
find that the center of gravity is
m 10 m 01
mx = my = . (9.38)
m 00 m 00
We may derive a set of moment-like measurements (the central moments) which are
invariant to translation by moving the origin:

␮ pq = (x − m x ) p (y − m y )q f (x, y). (9.39)

By taking into account rotation and zoom, we can continue this sort of derivation and
can now define as many features as we wish by choosing higher orders of moments
or combinations thereof. From the central moments, we may define the normalized
central moments by
␮ pq p+q
␩pq = ␥ , where ␥ = + 1.
␮00 2
Finally, the invariant moments [9.21] have the characteristics that they are invariant to
translation, rotation, and scale change, which means that we get the same moment,
even though the image may be moved, rotated, or zoomed.3 They are listed in
Table 9.1.

3 Gonzalez and Wintz [9.21] refer to zooming as “scale change.” Since we use the word “scale” in a slightly
different way, we refer to this as “zoom.”
230 Shape

Table 9.1. Invariant moments.

␸1 = ␩20 − ␩02
␸2 = (␩20 − ␩02 )2 + 4␩11
2

␸3 = (␩30 − 3␩12 )2 + (3␩21 − ␩03 )2


␸4 = (␩30 + ␩12 )2 + (␩03 + ␩21 )2
␸5 = (␩30 − 3␩12 )(␩30 + ␩12 )[(␩30 + ␩12 )2 − 3(␩03 + ␩21 )2 ]+
(3␩21 − ␩03 )(␩03 + ␩21 )[3(␩30 + ␩12 )2 − (␩03 + ␩21 )2 ]
␸6 = (␩20 − ␩02 )[(␩30 + ␩12 )2 − (␩03 + ␩21 )2 ]+
4␩11 (␩30 + ␩12 )(␩21 + ␩03 )
␸7 = (3␩21 − ␩03 )(␩30 + ␩12 )[(␩30 + ␩12 )2 − 3(␩03 + ␩21 )2 ]+
(3␩12 − ␩03 )(␩21 + ␩03 )[3(␩30 + ␩12 )2 − (␩03 + ␩21 )2 ]

Since their original development by Hu [9.33], the concept has been extended to
moments which are invariant to affine transforms by Rothe et al. [9.62].
Despite their attractiveness, strategies based on moments do have problems, not
the least of which is sensitivity to quantization and sampling [9.45]. (See Assignment
9.9.)
The use of moments is actually a special case of a much more general approach to
image matching [9.89] referred to as the method of normalization. In this philosophy,
we first transform the region into a “canonical frame” by performing a (typically
linear) transform on all the points. The simplest such transformation is to subtract
the coordinates of the center of gravity (CG) from all the pixels, thus moving the
coordinate origin to the CG of the region. In the more general case, such a trans-
formation might be a general affine transform, including translation, rotation, and
shear. We then do matching in the transform domain, where all objects of the same
class (e.g., triangles) look the same.
Some refinements are also required if moments are to be calculated with gray-
scale images, that is, when the f of Eq. (9.37) is not the result of a thresholding
operation. All the theory of invariance still holds, but as Gruber and Hsu [9.24] point
out, noise corrupts moment features in a data-dependent way.
Once a program has extracted a set of features, some use must be made of this
set, either to match two observations or to match an observation to a model. The use
of simple features in matching is described in section 13.2.

9.5 Chain codes

A chain code is a feature describing the boundary of a region. In a chain code,


we represent a counter-clockwise walk around a region by a sequence of numbers,
231 9.6 Fourier descriptors

2 1
3 1

4 0 2 0

5 7
6 3

Fig. 9.11. The eight directions (for 8-neighbor) and four directions (for 4-neighbor) in which one
might step in going from one pixel to another around a boundary.

all between 0 and 7 (if using eight directions) or between 0 and 3 (if using four
directions), designating the direction of each step. The eight and four cardinal di-
rections are defined as illustrated in Fig. 9.11. The boundary of a region may then
Fig. 9.12. The be represented by a string of single digits. A more compact representation utilizes
boundary segment superscripts whenever a direction is repeated. For example, 0012112776660 could
represented by the
chain code be written 02 1212 272 63 0, and illustrates the boundary shown in Fig. 9.12. The ability
02 1212 272 63 0. to describe boundaries with a sequence of symbols plays a significant role in the
Can you determine what
discipline known as “syntactic pattern recognition,” and appears frequently in the
if anything is wrong with machine vision literature, including other places in this book.
the caption on this figure?

9.6 Fourier descriptors

A Fourier descriptor is another feature describing the boundary of a region. Given


a boundary of a region, we imagine4 that the region is drawn on the complex plane,
and the x coordinate of each point then represents a distance along the real axis,
and the y coordinate a distance along the imaginary axis. Each boundary point is
therefore viewed as a single complex number. Traversing the boundary then results
in a (cyclic) sequence of complex numbers. Taking the Fourier transform of this
sequence provides another sequence of complex numbers, one which can be shown
to possess a type of invariance as illustrated in Table 9.2.
The following example illustrates these ideas in a somewhat oversimplified way:
Suppose we have the Fourier descriptors for two boundaries:

f 1 = 0.7, 0.505, 0.304, 0.211, . . .

f 2 = 0.87, 0.505, 0.304, 0.211, . . .

We see that these two sequences differ only in the first (DC) term. They therefore
represent two encodings of the same boundary which differ only by a translation.

4 Pun alert! (bet you missed that one) – note the kinds of numbers discussed in the paragraph.
232 Shape

Table 9.2. Equivalence between motions in the image and


transform domains.

In the image In the transform

A change in size Multiplication by a constant


A rotation of ␾ about the origin Phase shift
A translation A change in the DC term

This example is oversimplified, because in reality the sequence will be a set of


complex numbers, not real numbers as illustrated, but the concepts are identical.

Practical considerations in using Fourier descriptors

How we represent the movement from one boundary point to another is critical.
Simply using a 4-neighbor chain code produces poor results. Use of an 8-neighbor
code reduces the error by 40 to 80%, but is still not as good as using a subpixel inter-
polation. There are other complications as well, including the observation that the
usual parameterization of the boundary (arc length) is not invariant to affine trans-
formations [9.1, 9.96]. Experiments have [9.39] compared affine-invariant Fourier
descriptors and autoregressive methods. For more detail on Fourier descriptors, see
[9.1].

9.7 The medial axis

In two dimensions, the medial axis of a region is defined as the locus of the centers
of “maximal circles.” A maximal circle is the largest circle that can be located at a
given point in the region. Let’s say that a bit more carefully (we all need to learn
how to use mathematics to make our wording precise). At point (x, y) inside region

, draw a circle of radius R about that point. Make R as large as possible, but such
that (1) no points in the circle are outside the region and (2) the circle touches the
boundary of the region at no fewer than two points. Any point on the medial axis
can be shown (see Assignment 7.12) to be a local maximum of a distance transform
(DT). A point with DT values of k is a local maximum if none of its neighbors
has a greater value. Fig. 9.13(a) repeats Fig. 7.3. The local maxima of this distance
transform are illustrated in Fig. 9.13(b).
Another way to think about the medial axis is as the minimum of an electrostatic
potential field. This approach is relatively easy to develop if the boundary happens
to be straight lines or, in three dimensions, planes [9.12]. See [9.16] for additional
algorithms to efficiently compute the medial axis.
233 9.8 Deformable templates

(a) ( b)
1 1 1 1 1 1
1 2 2 1 2
1 23 21 1 3 1
1 22 1 2
1 22 1 2
1 23 21 1 3 1
1 12 22 1 1 22
1 1 11 11 1 11

Fig. 9.13. An example region whose DT, computed using 4-neighbors, is shown in (a). The
morphological skeleton is illustrated in (b), and simply consists of the local maxima of
the DT.

9.7.1 Relating the medial axis and ridges


The medial axis is defined only for binary images, or at least only for images in which
each region may be represented by its boundary. The ridge, however, is defined for
gray-scale images. We can relate these two characterizations using the following
strategy: Use a scale space representation of the image. That is, blur the binary
image. This process turns the binary image into a gray-scale image. Now find the
ridge. Start with the ridge at high scale (lots of blur); you can call that your initial
estimate of the medial axis. Now, reduce the scale slightly and see what new ridges
appear. Add them to your estimate. Continue this process over varying scales. See
Pizer et al. [9.59] for more details on this philosophy.

9.8 Deformable templates

Recall active contours (snakes) from section 8.5? If we think of the region surrounded
by a snake, rather than the snake itself, we realize we have a region whose shape
can be deformed, and thus have invented “deformable templates.” Objects can be
tracked utilizing a philosophy which allows for deformation of the template [9.102].
In addition, the concepts of deformable templates may be useful in image data base
access. For example, Bimbo and Pala [13.5] do indexing by comparing shapes in the
image with a user-drawn sketch, an “iconic index,” which is actually a deformable
template. The best matching template is written as

␾(s) = ␶ (s) + ␪(s) (9.40)

where s is (normalized) arc length, ␶ (s) is the template as stored in the data base, and
␪(s) is the deformation required to make this particular template match a sequence of
boundary points in the image being accessed. We emphasize that ␪(s) is the difference
between the original and the deformed template. The image in the data base which
best matches the template is the image which minimizes “the difference between the
234 Shape

original and the deformed template.” This can be achieved by minimizing


1  2  2
!
d␪x d␪y d 2 ␪x d 2 ␪y
E= ␣ + +␤ + − I E (␸(s)) ds (9.41)
ds ds ds 2 ds 2
0

which represents (in the first term) how much the template had to be strained to fit the
object while the second term represents the energy spent to bend the template. This
is thus a deformable templates problem. The optimization problem may be solved
numerically [13.5].
A variation on deformable templates is the idea of “geometric flow” to change
a given initial curve into a form which is better suited for identification by, say,
template matching. The term “geometric” flow means that the flow is completely
determined by the geometry of the curve. Pauwels et al. [9.58] cast this discipline
as the answer to: “Is it possible to use optimization of functionals to crystallize the
geometric content of a curve by reducing the noise while at the same time enhancing
the salient features?”

9.9 Quadric surfaces

A surface defined by an algebraic equation of degree two is called a quadric (see


section 8.6.1). The general equation for a quadric is

ax 2 + by 2 + cz 2 + 2 f yz + 2gzx + 2hx y + px + qy + r z + d = 0. (9.42)

This one equation describes all the second-order surfaces, some of which are illus-
trated in Fig. 9.14.

Ellipsoid

Hyperboloid of
(L) one sheet
(R) two sheets

Hyperbolic
Elliptic paraboloid paraboloid

Fig. 9.14. The quadric equation describes a wide variety of surfaces [9.103] (CRC Press. Used
with permission).
235 9.9 Quadric surfaces

If the quadric is centered at the origin, and its principal axes happen to be aligned
with the coordinate axes, the quadric will take on a particular form. For example, an
ellipsoid has the special form

x2 y2 z2
+ + = 1. (9.43)
a2 b2 c2

However, when the axes of the quadric do not align with the coordinate axes, only
the general form of Eq. (9.42) occurs.
From range or other surface data, the quadric coefficients may be determined by
methods such as those in section 8.6.1. Given the coefficients, the type of quadric
may be determined by the following method.5
If there is a constant term, d, divide by the constant, redefining the other coefficients
(e.g. a ⇐ a/d), resulting in a form of the quadric equation in which the constant
term is unity:

ax 2 + by 2 + cz 2 + 2 f yz + 2gzx + 2hx y + px + qy + r z + 1 = 0. (9.44)

This equation can then be written in the form:


⎡ ⎤⎡ ⎤
a h g p x
⎢ ⎥⎢ ⎥
⎢h b f q ⎥ ⎢y⎥
[x y z 1] ⎢ ⎥ ⎢ ⎥ = 0. (9.45)
⎣g f c r ⎦ ⎣z ⎦
0 0 0 1 1

Consider the upper left 3 × 3 submatrix,


⎡ ⎤
a h g
⎢ ⎥
E = ⎣h b f⎦.
g f c

Obtain the three eigenvalues, ␭1 , ␭2 , and ␭3 , and find the reciprocal of each which
is nonzero, r1 = 1/␭1 , r2 = 1/␭2 , and r3 = 1/␭3 . At least one reciprocal must be
positive to have a real surface: If exactly one is positive then the surface is a hyper-
boloid of two sheets; if exactly two are positive, then it is a hyperboloid of one sheet;
if all three are positive, it is an ellipsoid, and the square roots of r1 , r2 , r3 are the
major axes of the ellipsoid. Otherwise the distance between the foci of hyperboloids
is determined by the magnitudes of the rs.

5 The authors are grateful to Dr G. L. Bilbro for his formulation of this method.
236 Shape

9.10 Surface harmonic representations

This approach represents a surface as a linear combination of fixed harmonic basis


functions. Harmonic functions are solutions to Laplace’s equation

∇ 2 ␺ (x, y, z) = 0 (9.46)

which is
∂ 2␺ ∂ 2␺ ∂ 2␺
+ + =0 (9.47)
∂x2 ∂ y2 ∂z 2
in Cartesian coordinates. Most of the work in harmonic representations has not,
however, been in Cartesian coordinates, but rather in spherical coordinates. (See
Matheny and Goldgof [8.40] for a discussion of other forms.) Any continuous func-
tion which can be written as r = r (␪, ␸) can be expressed as a linear combination
of spherical harmonics. In spherical coordinates, Laplace’s equation is
   
∂ 2 ∂␺ 1 ∂ ∂␺ 1 ∂ 2␺
r + sin ␪ + = 0. (9.48)
∂r ∂r sin ␪ ∂␪ ∂␪ sin2 ␪ ∂␸2
We seek solutions which are separable in the sense that they may be written as the
product of functions of a single variable; that is, they have the form

␺ (r, ␪, ␸) = R(r )  (␪) (␸). (9.49)

With this restriction, the partial differential equation may be separated into three
ordinary differential equations and the solution can be shown to be

Plm cos ␪ sin m␸ (9.50)

where the parameter l is referred to as the “degree,” m is an integer less than l, and
P is a Legendre polynomial.
Thus, any function may be represented as
 
L  l
% m m &
r (␪, ␸) = Ul Pl cos ␪ +
0
Ul Pl cos ␪ cos m␸ + Vl Pl cos ␪ sin m␸
m m

l=0 m=0
(9.51)

where the coefficients are found by the process of fitting this form to the data.

9.11 Superquadrics and hyperquadrics

We discussed in section 8.6 the problem of segmenting a range image using the
surface function and how to fit functions to data. In this section, we describe how to
fit superquadrics and hyperquadrics to range data.
237 9.11 Superquadrics and hyperquadrics

A superquadric is a surface with the equation


 x ␥1  y ␥2  z ␥3
     
  +  +  =1 (9.52)
a b c
and a hyperquadric has the equation


N
|Ai x + Bi y + Ci z + Di |␥i = 1 (9.53)
i=1

which may be rewritten as F(x, y, z) = 1.


Kumar et al. [8.32] propose an approach to fit hyperquadrics to range data. It
operates as follows: First, recognize that the parameters which minimize


N
EOF = (1 − F(xi , yi , z i ))2 (9.54)
i=1

are presumed to be a good fit. (Since, at every point on the surface F, the value of F
(x, y, z) is supposed to be 1.0.) Observe that this minimizes the algebraic distance
(see section 8.6.1) from the point to the surface! This distance will be zero if the
point lies on the surface, but otherwise, there is not a simple relationship between the
Euclidean distance from the point to the surface and the algebraic distance. Kumar
et al. observe: “This function is biased, especially for oblong objects.” Similar
complaints exist for essentially all applications of the algebraic distance.
To get a somewhat better fit, in the case of hyperquadrics, the following approach
can be used. Suppose we have an initial estimate of the surface, an estimate which
is not too bad, but might not be the best estimate. Let that estimate be defined by a
set of parameters, A, B, C, and D, defining a function F(x, y, z). For a particular
point (xi , yi , z i ), substitute these values into F to determine w i = F(xi , yi , z i ). If the
point is actually on the surface, then w will be equal to 1. Now, consider the surface
defined by F(x, y, z) = w i . The distance normal to this surface can be approximated
by the distance di in the direction of the gradient,

F(x, y, z) = F(xi , yi , z i ) + di ∇ F(xi , yi , z i ) (9.55)

so
1 − F(xi , yi , z i )
di = . (9.56)
∇ F(xi , yi , z i )

This di is really what we want to minimize. It is the distance from a surface through
the point (xi , yi , z i ) which is in some sense parallel to the surface we wish to de-
termine. These are estimates only; it remains to iteratively refine these estimates
until they become the real solutions. To accomplish that, rewrite the squared error in
238 Shape

terms of d,

N
(1 − F(xi , yi , z i ))2
E= (9.57)
i=1
∇ F(xi , yi , z i )
and minimize this objective function in the following way.

(1) First, determine an initial estimate, probably by numerically finding the value
which minimizes the EOF of Eq. (9.54). This will not be a bad fit, but better fits
are possible because of the bias.
(2) Compute, for each data point w i = 1/∇ F(xi , yi , z i )2 .
N
(3) Minimize i=1 (w i (1 − F(xi , yi , z i )))2 .
(4) If the solution is good enough, stop; otherwise go to (2).

Observe that the general idea presented here to fitting an implicit function is
applicable to more than just hyperquadrics. The idea of minimizing a function normal
to the gradient of a curve through the data points fits many (probably most) problems
in fitting of explicit functions.
Dickinson et al. [13.11] combine superquadric representations with the concept
of aspect graphs.

9.12 Generalized cylinders (GCs)

A cylinder may be described as a circle which translates along a straight line in space,
with the plane of the circle perpendicular to the line. Now, suppose that the line is
allowed to bend in space, in fact to become an arbitrary space curve, parameterized
by arc length, s. Then, the line becomes a vector function x(s), y(s), z(s) of s. Next,
allow the radius of the circle to vary with the point along the curve, R = R(s), and
you have some idea of what a generalized cylinder is [9.4, 9.21, 9.23, 9.74, 9.75,
9.98]. However, the concepts of GCs are more general even than described above.
The object translated does not have to be a circle, it can be any 2D shape.
If we can fit GCs to regions, we can then use the vector function of the line and the
radius function as features to describe the shape of the region. However, there are
significant challenges to fitting GCs to images. We will not pursue the GC idea any
further in this book. The reader can find many interesting papers in the literature, a
few of which are listed in the previous paragraph.

9.13 Conclusion

In this chapter, several features were defined that could be used to quantify the shape
of a region. Some, like the moments, are easy measurements to make. Others, such as
239 9.14 Vocabulary

the diameter or the convex discrepancy, require development of fairly sophisticated


algorithms to avoid extremely long computation time. There have also been some
efforts in automatic learning of shape descriptors [9.11, 9.91].
How DO people do it? Studies of signal processing in the visual cortex [9.65] have suggested that images
are represented in the cortex after being subjected to a log-polar transformation, and
that invariance may be provided in this way [9.63, 9.90]. Other transformations
[9.87] may provide equivalent or superior representations in computer applications.
In this chapter, again, we have seen the philosophy of consistency invoked by
simply saying that a set of pixels is consistent if each point satisfies the same equation,
be it quadric or generalized cylinder.
Optimization methods came up quite a few times in this chapter:

Constrained r In section 9.2.2 a derivation is presented which finds the straight line which
minimization.
best fits a set of points, in the sense that the sum of perpendicular distances is
minimized. To accomplish that, we were required to use constrained minimization
with Lagrange multipliers.
Integral squared error. r In section 9.8 we find a deformation to a template by performing a minimization
of an integral squared error.
Pseudo-inverse minimizes r In section 9A.2, we will encounter a problem requiring “inversion” of a nonsquare
the squared error.
matrix. Of course, one cannot formally invert such a matrix, and instead we derive
the “pseudo-inverse.” We also show that the pseudo-inverse is really a minimum-
squared-error algorithm.

9.14 Vocabulary

You should know the meanings of the following terms.

Affine transform
Aspect ratio
Basis vector
Center of gravity
Chain code
Compactness
Convex hull
Convex discrepancy
Deformable template
Diameter
Fourier descriptor
Generalized cylinder
Homogeneous transformation matrix
Invariant moment
240 Shape

K--L transform
Linear transformation
Medial axis
Metric
Moment
Orthogonal transformation
Principal component
Similarity transform
Thinness

Topic 9A Shape description

9A.1 Finding the diameter of nonconvex regions

One fast algorithm for an estimate of the diameter is as follows:

(1) Choose a point on the boundary, call it P0 .


(2) Find the point on the boundary at a maximum distance from P0 . Call it P1 .
(3) Compute the midpoint, M, of P0 , P1 .
(4) Find the point P2 on the boundary at a maximum distance from M.
(5) Find the boundary point P3 at a maximum distance from P2 .
(6) If d(P2 , P3 ) ≤ d(P0 , P1 ), stop. P0 and P1 are the extrema; else, assign P0 ← P2 ,
P1 ← P3 , go to (3).

This technique is easy to program and converges rapidly. We have had good luck with it.
Unfortunately, it is not guaranteed to converge to the global extrema, although there is a high
likelihood that it will do so.
An extension of the previous algorithm provides a strategy which is guaranteed to converge.
In addition, this algorithm provides a mechanism for rapidly narrowing the search space.
First, define a linear search function M(Pi , R) which returns the point Pi+1 = M(Pi , R)
such that ∀x ∈ R, d(Pi , x) ≤ d(Pi+1 , Pi ).
Choose an arbitrary point P1 ∈ R, find P2 = M(P1 , R − P1 ). Next, find P3 = M(P2 , R −
{P1 , P2 }). The relationship between d(P1 , P2 ) and d(P2 , P3 ) can be one of the following three
cases, which are illustrated in Figs. 9.15– 9.17:

Case 1: d(P1 , P2 ) > d(P2 , P3 )


Case 2: d(P1 , P2 ) = d(P2 , P3 )
Case 3: d(P1 , P2 ) < d(P2 , P3 ).

In Case 3, we compute a new P3 called P3 so that d(P2 , P3 ) = d(P2 , P3 ) and P1 , P2 , and
P3are colinear. We, therefore have a symmetrical lens shape which encloses all the points in
R using at most two linear searches.
241 Topic 9A Shape description

PP11
P'3
P3

P1
PP33
P3 P1

P
P22
P2 P2

Fig. 9.15. Case 1. Fig. 9.16. Case 2. Fig. 9.17. Case 3.

α1

β22 B2
β21
R21 R22

P1
P2

R11 R12
β12 β11
B1

Extreme

α2

Fig. 9.18. In this figure, B2 is one of the extrema. The other extreme is an element of R11
(redrawn from [9.71]).

Our heuristic states that the “best” direction in which to search is perpendicular to P1 , P2
(or P2 , P3 in Case 3).
Compute the apexes ␣1 and ␣2 as shown in Fig. 9.18. Then find B1 = M(␣1 , {R −
{P1 , P2 }}). If d(␣1 , B1 ) ≤ d(P1 , P2 ), stop, else partition R − P1 , P2 into two mutually ex-
clusive regions R1 and R2 , where R1 = {x ∈ R − {P1 , P2 }d(x, P1 ) > d(P1 , P2 )}. Find
B2 = M(␣2 , R − R1 ), and find R2 = {x ∈ R − {P1 , P2 } − R1 d(x, ␣2 ) > d(P1 , P2 )}.
Note that: (1) In general, R1 ∪ R2 ⊂ R. However, R − {R1 ∪ R2 } contains no points of
interest. (2) If either R1 = ␾ or R2 = ␾, we can stop and P1 , P2 is the diameter.
If both R1 and R2 are nonempty, we can define them as “antipodal regions” in the following
sense: If a diameter exists greater than d(P1 , P2 ), one endpoint must lie in R1 and the other
in R2 , which effectively cuts the search space in half.
Two new points, ␤11 and ␤12 , are computed by swinging an arc with center at ␣1 and radius
d(␣1 , B1 ) to intersect the lens. Similarly, ␤21 and ␤22 are computed.
242 Shape

Note that d(␤21 , ␤12 ) = d(␤11 , ␤22 ) and this is an upper bound on the diameter. In a digital
picture, having an upper limit on the diameter may allow earlier stopping of the algorithm,
since if we know a diameter candidate, and we know the upper bound, if these two values
differ by less than 1.414 (the diagonal of a pixel), there is no need to search further.
Compute r = MAX(d(B1 , B2 ), d(P1 , P2 )). Use this as a radius with which to draw arcs.
Using ␤21 as center and r as radius, partition R1 into two regions, R11 and R1 − R11 . Similarly,
use ␤22 as center and find R12 ⊂ R1 . Note that R1 − {R11 ∪ R12 } contains no points of interest.
Similarly, find R21 and R22 as shown. If R21 = ␾ and R11 ∩ R12 = ␾, then R21 and R12 is an
antipodal pair as is R11 and R22 . In any case, R21 ∪ R22 is antipodal with R11 ∪ R12 . These
antipodal regions will be our search space in the next phase.
The strategy to this point has either identified the extrema or it has provided other useful
results, specifically:

(1) An upper bound on the diameter has been derived.


(2) All points within a very large area have been eliminated as candidates for extrema.
(3) The remaining points have been partitioned into antipodal regions.

At this point, if fewer than K points remain (which is most often the case), the most appropriate
technique is to compute convex hulls. (The optimal choice of K is a function of region
topology. It has been our experience that K = 50 seems to work well.) This computation is
aided by the observations that:

r If either of an antipodal pair of regions is empty, the other may be eliminated.


r The extrema must lie on the opposite sides of the convex hull of the antipodal regions,
eliminating a great many more points, since we are able to compute convex hulls of smaller
regions.

If, on the other hand, many more points remain, the algorithm can be invoked recursively
using the antipodal pairs of regions as subject areas, and choosing new starting points as
those points closest to ␤21 and ␤12 , or ␤11 and ␤22 .
For R with N points, the blind exhaustive search takes O(n 2 ) distance calculations and
comparisons. Our search algorithm is exhaustive (hence guarantees convergence), but intelli-
gent by taking the global shape of the region R into consideration. The initial search space R
is sequentially divided into smaller mutually exclusive subspaces by eliminating those points
that cannot be the end points of the extrema.
Although the number of subspaces is increased by two for each recursive call of the
procedure, there are many more points that are eliminated from the search space after each
call. Therefore, the search space is rapidly decreasing. How rapid the decreasing rate is
depends on the shape of R.
This method is derived from geometric considerations and consequently its rate of con-
vergence is strongly dependent on the geometry of the region. It is, therefore, difficult to
accurately access the computational complexity of the method. If used as a pre-processing
technique, it operates in O(4n) time, plus the time required to perform the convex hull cal-
culation on the remaining points, or O(k log k) where k represents the points remaining after
pre-processing.
243 Topic 9A Shape description

Certainly in the worst case, almost no points are eliminated and convergence operates
as convex hull in O(n log n). It is in fact worse than the convex hull in this case, since the
program is more complex.
However, due to the large number of branch points, the algorithm exits early for virtually
all regions and it converges very rapidly.

9A.2 Inferring 3D shape from images

Many papers have been written which extract three-dimensional shape from various sources:
From silhouettes [9.42, 9.43, 9.44, 9.49, 9.101]; from images of specular reflectors [9.64];
from three orthogonal projections (x-ray projections) [9.81]; making use of the assumption
that objects tend to have orthogonality [9.22] or symmetry [9.18]. Ultimately, in all these
algorithms one must address the question of visibility [9.83].
Range images, one might think, already contain a complete description of three-dimen-
sional shape, but, of course, you cannot see the entire surface in one image [9.31]. A tough
problem is how to integrate several range images to form one description of a three-dimen-
sional object [9.76, 9.93].
Even though you may have segmented surfaces with some success, those segmentations are
They do not intersect almost never perfectly correct. One might think that the equations describing the intersection
because they are in 3D
of edges would be straightforward to find. After all, you have the equations of the surfaces
space.
whose intersections determine the edges. Just calculate the intersection! Ah, but, it is never
quite that easy. The problems occur when you have vertices, the intersections of edges – the
trihedral or multihedral intersection points. Those equations you just derived never intersect
at a point. Hoover et al. [9.31] address this problem, and extend the solution to nonvisible
surfaces.
Another important issue in extracting three-dimensional shape is what representation you
choose, including shape from perspective, shape from shading, shape from texture, etc.

9A.2.1 Shape from perspective

We formalize the shape-from-perspective problem as follows: A point is observed with row–


column coordinates ri and ci , which is actually located in 3-space at [xi , yi , z i ]. The obser-
vation coordinates and true coordinates are related by
⎡ ⎤
⎡ ⎤ ⎡ ⎤
u0 0 xi
ri ku f 0 ⎢ ⎥
⎢ ⎥ ⎢ ⎥ R T ⎢ yi ⎥
⎣ci ⎦ = ⎣ 0 k v f v 0 0⎦ ⎢ ⎥ (9.58)
0 1 ⎣zi ⎦
1 0 0 1 0
1
where ku , kv , u 0 , v 0 and f are projective properties of the camera, R is a 3 × 3 rotation matrix
and T is a 3 × 1 translation vector. Assuming we know for each point the actual 3D coordinates
and the corresponding 2D observations we should be able to infer the transformations and
camera properties. Variations on this problem include [9.95] the Perspective n Point problem
(PnP), when n point correspondences are known; the Perspective n Line problem (PnL), when
n pairs of corresponding lines are known; and the Perspective n Angle problem (PnA), when
244 Shape

n pairs of corresponding angles are observed. Analytic solutions have been determined for
P3P [9.17], P3L [9.14], and PnA [9.95]. A clear explanation of the linear approaches to using
uncalibrated cameras (but assuming correct solutions to the correspondence problem) may
be found in [9.26].
Some work [9.34] has been done on the harder version of the correspondence problem, the
case that the cameras are not only uncalibrated, but are oriented in such a way that the epipolar
assumption is not necessarily valid. In this case the stereo matching problem becomes one of
search for best matching pairs, using both radiometric and geometric information to narrow
the search.

9A.2.2 Shape from shading

Shape from shading was first introduced by Horn, who argued that some knowledge of how
light is generated, reflected, and observed could substantially improve the performance of
machine vision systems. Consider Fig. 9.19, and assume you know:
r the angle of the light source
r the angle of the observer
r the measured brightness of the pixel
r a law governing how light is scattered
r the albedo of the surface.

Can you find the surface normal? (If you have the normal at every point, how would you
determine the surface?)
The solution may be found by solving differential equations. We start by writing the surface
normal vector as n = r /|r |, where the direction vector

T
∂z ∂z
r= , ,1 .
∂x ∂y
In most shape-from-shading literature, the partial derivatives are abbreviated p ≡ ∂z/∂ x,
q ≡ ∂z/∂ y.
Despite the fact that we use the term “brightness” constantly, it actually does not have
a rigorous physical definition. Following Horn [9.32], we define irradiance as the power
per unit area falling on a surface, measured in watts per square meter. Then, we can define
radiance as the power per unit foreshortened area per unit solid angle. This dependence on

n θI
θo

Fig. 9.19. Light strikes a surface at an incident angle, relative to the surface normal, n, and is
reflected/scattered in another direction.
245 Topic 9A Shape description

foreshortened area makes it clear that the angle of observation plays an important role in
scene “brightness.”
Often, the “reflectivity model” of a surface is known, or may be measured. For example,
the brightness observed might be independent of the angle of observation, and depend on
the angle of incidence. For example, the reflected brightness might be related to the incident
brightness with a relationship such as

R(x, y) = a I (x, y) cos(␪I ). (9.59)

Thus, if we know the incident brightness, I, the albedo (how the surface is painted), a, and
the reflected brightness, R, we should be able to solve for ␪I , and from that, infer the surface
normal, and from that the surface. The reflectivity function of Eq. (9.59) is known as a
Lambertian model. Note that the angle of observation does not enter the Lambertian model.
Another familiar reflectivity function is the specular model

R(x, y) = a I (x, y)␦(␪I − ␪o ), (9.60)

which describes mirrors – you only get a reflection if the angle of illumination equals the
angle of incidence. Of course most surfaces, even “shiny” ones, are not perfect specular
reflectors, and a more realistic model for a mixed surface might be

R(x, y) = a I (x, y) cos4 (␪I − ␪o ). (9.61)

Although the use of reflectivity functions requires radiometric calibration of cameras [9.28],
that requirement in itself is not the major difficulty. To see the complexity of the problem
[9.51], let us expand Eq. (9.59) in terms of the elements of the observation vector and the
normal vector.
 
∂z ∂z
R(x, y) = a I (x, y) cos(I r N ) = a I (x, y) cos Ix + Iy + Iz Nz . (9.62)
∂x ∂y
Assuming we know the angle of observation (which we will actually know only approximately
at best), the albedo and the incident brightness, we still have a partial differential equation
which we must solve to determine the surface function z.
Many papers and Horn’s classic text [9.32] address approaches for solving various special
cases of the shape-from-shading problem. A recent paper by Zhang et al. [9.100] surveys
the field up to 1999. (Remember, Equation (9.62) is itself a special case – it assumes the
brightness does not depend on the observation angle.) In the following, we discuss another
special case, photometric stereo.

Photometric stereo
In many cases, it is reasonable to model the reflectivity of a surface as proportional to the
cosine of the angle between the surface normal vector and the illumination vector:

I (x, y) = ro (N i r n) (9.63)

where Ni is a unit vector in the direction of light source i. If we are fortunate enough to
have an object, a Lambertian reflector, which satisfies this equation, and which possesses the
same albedo (r0 ), independent of the illumination, we can make use of multiple pictures from
246 Shape

multiple angles to determine the surface normal [9.35, 9.94]. Let us illuminate a particular
pixel with three different light sources (one at a time) and measure the brightness of that pixel
each time. At that pixel, we construct a vector from the three observations

I = [I1 , I2 , I3 ]T . (9.64)

We know the direction of each light source. Let those directions be denoted by unit vectors
from the surface point toward the light source, N 1 , N 2 , and N 3 . Write those three direction
vectors in a single matrix by making each vector a row of the matrix.
⎡ ⎤ ⎡ ⎤
N1 n 11 n 12 n 13
⎢ ⎥ ⎢ ⎥
N = ⎣ N 2 ⎦ = ⎣n 21 n 22 n 23 ⎦ , (9.65)
N3 n 31 n 32 n 33
and now we have a matrix version of Eq. (9.63):

I = r0 N n. (9.66)

Since N is known, and n is a unit vector, we find

r0 = |N −1 I|, (9.67)

and once we know r0 , we can solve for n using


1 −1
n= N I. (9.68)
r0
Note that this derivation of photometric stereo assumes the albedo (sometimes called the
surface reflectance) is the same for every angle of illumination. In the next subsection, we
illustrate an application which combines shape from shading with photometric stereo, and
does not make this assumption. For example, specular reflectors provide a special condi-
tion: the angle of observation is exactly equal to the angle of incidence. This allows special
techniques to be used [9.56].
But what if we used more than three light sources? This gives us a wonderful context in
which to discuss an important topic, overdetermined systems and the pseudo-inverse.
If we in fact used more than three light sources, we would hope to cancel out some effects of
noise and/or measurement error. Suppose we have k light sources. Then Eq. (9.66) is rewritten
as I k×1 = Nk×3 n3×1 where subscripts are used to emphasize the matrix dimensions, and r0
is removed for clarity of explanation.
Now, we cannot simply multiply by the inverse of N because N is not square. Instead, as we
have done so many times before, let us set up a minimization problem: We will find the surface
normal vector n which minimizes the squared sum of differences between measurements I
and products of N and n. Of course, if Eq. (9.63) is strictly true everywhere, then we do not
need to do a minimization. Of course, if Eq. (9.63) were true everywhere, there would be no
point to taking more than three measurements. We believe that measurements are not perfect
and there is some advantage to taking additional ones. Define an objective function E which
incorporates the desire to find an optimal solution:

k
 2
E= Ii − NiT n = (I − N n)T (I − N n). (9.69)
i=1
247 Topic 9A Shape description

Expand the product

E = I T I − 2nT N T I + nT N T N n. (9.70)

We wish to find the surface normal vector n which minimizes this sum-squared difference E,
so we differentiate E with respect to n,

∇n E = −2N T I + 2N T N n (9.71)

and setting the gradient to zero we have

NTNn = NTI (9.72)

or

n = (N T N )−1 N T I. (9.73)

(In case you have not recognized it yet, the pseudo-inverse just appeared.) That was a lot of
work. Let’s see if there is an easier way:
Go back to Eq. (9.66), again omitting the r0 for clarity, and multiply both sides by N T .

N T I = N T N n. (9.74)

Multiply both sides by (N T N )−1 and we find the same result as Eq. (9.73).
So why did we go to so much trouble – all the work of Eqs. (9.69) through (9.72) seems
like a waste. Ah, but there is madness in our method! We have now demonstrated to you
that multiplying by the pseudo-inverse (N T N )−1 N T produces the minimum squared error
estimate of an overdetermined linear system. That is a significant result; important in the case
of photometric stereo, and in many other applications as well.

Using shape from shading with two light sources


Here we illustrate how to use shape from shading when two light sources are available. We
will not go into the physics here, but scanning electron microscopes provide an excellent
example application. In such a microscope two images are produced, one from secondary
emission (SE) electrons and one from backscattered (BSE) electrons. Reflectivity functions
may readily be measured.
Rather than attempting to model the geometry of the microscope exactly, we may simply
measure the reflectivity functions RSE and RBSE by placing a sphere in the microscope and
imaging it. Assuming the albedo is known (that is an important assumption), the measured
brightness is a function of p and q in both the BSE and SE images. For example, a particular
brightness value might be represented by the locus of points shown in Fig. 9.20. Although in
each image, there are an infinite number of possible p, q values, there are only two possible
(p, q) pairs which can explain both the measured brightnesses. We thus solve for z(x, y)
by defining an objective function and finding the surface which minimizes that function. To
accomplish this, first let ␳ (( p1 , q1 ), ( p2 , q2 )) define a function which represents the difference
in the surface normals defined by the two (p, q) ordered pairs (it could, for example, be
something as simple as the cosine of the angle between them ␳ (( p1 , q1 ), ( p2 , q2 )) = p1 p2 +
q1 q2 ). Then, we consider the difference between the surface normal defined by the ith (p, q)
248 Shape

2
SE loci
0
q
−2

−4

−6
BSE loci
−8

−8 −6 −4 −2 0 2 4 6
p

Fig. 9.20. A given measured brightness pair can be created by a locus of p, q values in both SE
and BSE images. (The authors are grateful to B. Karacali for this figure.)

pair and the vector which is normal to the surface z(x, y), and write that difference as
  
∂z ∂z
di (x, y) = ␳ , , ( pi , qi ) .
∂x ∂y
Finally, assuming that the two curves of Fig. 9.20 intersect at m points (we make m a function
of x, y to remind the reader that all this is being done for a single x, y point), we define an
objective function as
!−1
 m(x,y)
−1
E= (di (x, y)) + ␭R, (9.75)
x,y i

allowing R to be a regularization term such as piecewise-linearity.


A variation on the simple model given above occurs when the surface is not totally reflec-
tive. For example, in infrared wavelengths (and others, to a lesser extent), the power measured
coming from a surface is a combination of energy which is reflected and energy which is
emitted (so called “black body radiation”) as a result of the temperature of the surface [9.48].
The concepts of generalized cylinders (GCs) were discussed in section 9.12. If you combine
GCs with shape from shading, you can improve the functionality of the GC model [9.23].

9A.2.3 Structured illumination

In section 4.2.2, the basic concepts of structured illumination were introduced. The key point
is that by controlling the lighting, one or more of the unknowns in the stereopsis problem
may be eliminated. Let’s look at an example in more detail to see how this might work.
The problem to be solved is an application in robot vision: A robot is to pick up shiny,
metallic turbine blades from a rack and place them into a machine for further processing. In
order to locate the blades, a horizontal slit of light is projected onto the scene, by passing a
laser beam through a cylindrical lens. The geometry of the resulting images is illustrated in
249 Topic 9A Shape description

Blade
θ
z
h

φ
Cart

Fig. 9.21. The presence of a blade translates the location of the light stripe vertically in the
image.

Fig. 9.21 [9.57]. If there were no blade in the image, the light stripe from the laser would
form a horizontal line in the image as it is reflected off the cart. The presence of the blade
causes a vertical translation of the light stripe. The number of lines of vertical displacement is
directly proportional to an angular difference which produces the angle ␾. Knowing the two
angles and the distance h between the camera and the projector allows a simple calculation
of the distance z:
h tan ␪
z= . (9.76)
tan ␪ + tan ␾
Although this relationship is relatively simple, it turns out to be both simpler and more
accurate to simply keep a lookup table of z vs. row displacement.
One practical problem which arises in this problem is the specular nature of the reflections
from the turbine blades; the bright spots may be orders of magnitude brighter than the rest of
the image. This is dealt with by passing the beam through a polarizing filter. By placing another
Fig. 9.22. Variations in
texture can provide such filter on the camera lens, the specular spots are considerably reduced in magnitude.
information about In this example of the use of structured illumination, only one light stripe was projected
shape (from [9.13]). at a time, so there was no ambiguity about which bright spot resulted from which projector,
C 2003 Artists Rights

Society (ARS), New however, in the more general case, one may have multiple light sources, and then, some
York / ADAGP, Paris. method for disambiguation is required [9.7, 9.50].

9A.2.4 Shape from texture

Texture, or rather spatial variations in texture, can be used to characterize 3D shape, as


illustrated in Fig. 9.22. For a very nice collection of pictures showing how texture suggests
3D shape, see [9.82]. Of course, in order to perform shape from texture, one must be able to
robustly extract texture primitives, as we discussed in section 4A.2.2.
The ideas of shape from texture have also found application in work to recover 3D motion
[9.80]; see also [9.77].

9A.2.5 Shape from focus

That one can obtain range from focus is obvious. However, the extension to robust shape
from focus is difficult. The principal problem is determining precisely when each pixel is in
focus [9.52, 9.79].
250 Shape

9A.3 Motion analysis and tracking

Motion analysis may be viewed as two different problems. The first is the case when the
camera is moving and the objects in the world are stationary. In this case, the extraction of
camera motion is the challenge. In the second case, the camera is stationary, and objects in the
world move. Finally, there is the combination of the two – where both the camera and some
objects in the world are moving. Motion analysis has many of the same issues as stereo. As
in stereo, correspondence is a major problem, but in this case, the correspondence is between
scenes which differ in time rather than in the spatial location of the camera.
One approach to the motion analysis problem is referred to as “optic flow.” Consider two
images of the same object. Assume the image in frame 2 is the same as the image in frame
1, but displaced

f 1 (x + ␦) = f 2 (x). (9.77)

Expand the LHS in a Taylor series

f 1 (x) + ␦f 1 (x) + · · · = f 2 (x). (9.78)

If we truncate this series to two terms and solve for ␦, we find

f 2 (x) − f 1 (x)
␦= . (9.79)
f 1 (x)

There is a serious problem here. What happens if the gradient is zero? The gradient being
zero is really saying that there is no information at this point in the image. Imagine you are
looking through a telescope at a semi-tractor–trailer truck which is passing by, and the field
of view of your telescope allows you to see only a small area, a few square inches, of the
truck. When the bumper passes, you have information, and you know motion has occurred.
However, as the top of the trailer is passing, you see no changes for a relatively long time.
Another troublesome issue that arises in computing and applying optic flow is that in 2D,
Eq. (9.79) becomes a differential equation (see [9.10]). To deal with these problems, re-
searchers in optic flow have tried various ways to combine information from local measure-
ments to infer global knowledge, such as using clustering to identify sets of points which are
moving together [9.41]. As discussed in Chapter 4, one can match either points or boundaries.
For example, Quan [9.61] matches conics; Taylor and Kriegman [9.85] match line segments,
as does Zhang [9.99]; Smith and Nandha Kumar [4.35] match textures. Motion segmentation
is discussed in [4.35, 9.3].
In two dimensions, the ␦ of Eq. (9.77) becomes a vector. Optic flow algorithms generate
a disparity field, a vector field which associates a vector with each pixel [9.20]. Efficient
implementations for optic flow have been investigated [10.19, 17.13, 17.14], including objects
which deform [17.60].
The optic flow at a point in the image plane of a moving camera is

1
ui = Ai T + Bi
(9.80)
zi
251 Topic 9A Shape description

where the matrices


 
−f 0 xi x y − 1 + xi2 yi
Ai = and Bi =  i i 2 (9.81)
0 −f yi 1 + yi −xi yi −xi

are completely determined by the camera. T is the translational velocity of the camera and

is the rotational velocity of the camera. z i is the depth to the image point at camera coordinates
(xi , yi ). The problem of determining T and
has been addressed by a number of researchers
[9.36, 9.37]. Earnshaw and Blostein [9.15] compare these methods and introduce a new
variation.
Motion can be estimated from smear. Chen et al. [9.9] make the observation that:
“Psychophysical investigation has shown that the human visual system (HVS) integrates reti-
nal images for 120 ms. Due to the integration, motion smear is inevitable. It has been reported
that when the HVS is presented with an image blurred due to motion, the amount of smear per-
ceived by human observers increases with observing duration at short durations (up to 20 ms).
At longer durations, however, the perceived image becomes sharper. It is our conjecture that
the HVS is performing a deblurring, or sharpening, function on the image.” This observation
leads to a “motion from smear” algorithm [9.9].
Recently, a gradual shift in focus has been observed [9.5] in the motion analysis literature
from analyzing the image or camera motion to labeling the action taking place – from “How
are things moving?” to “What is happening?”

Shape from motion


The discipline known as shape from motion usually assumes a solution to the correspon-
dence problem: That certain points are identified and corresponded in each frame. Here, we
describe the approach to shape from motion presented by Kanade and co-workers [9.60,
9.86]. This description is based on the unrealistic assumption that the points in the image are
projected orthographically (perpendicular to) the image plane. We use this derivation because
it is simpler to understand. However, if you need to implement this, refer to [9.60] for the
modifications required for perspective projection. See also Soatto and Perona [9.72, 9.73] for
a recent general study.
In the following derivation, the object is stationary, only the camera is moving. A point at
spatial coordinates s p is projected onto the imaging at frame f at coordinates (u f p , v f p ). The
camera moves, and at each frame, the camera is located at t f and has orientation described
by the three vectors (i f , j f , k f ). If we knew s p , we would know the camera coordinates of
the image of p, from

u f p = i Tf (s p − t f ) v f p = j Tf (s p − t f ). (9.82)

Defining

x f = −t Tf i f y f = −t Tf j f , (9.83)

we can rewrite Eqs. (9.82) as

u f p = i Tf s p + x f v f p = j Tf s p + y f . (9.84)
252 Shape

Now we have an inverse problem: We know the image coordinates, and from these we must
determine the spatial position of both the camera and the object points.
Accumulate all the observations of the image points in a matrix:
⎡ ⎤
u 11 . . . u1 p
⎢ ... ... ... ⎥
⎢ ⎥
⎢u . . . uFP ⎥
⎢ F1 ⎥
W =⎢ ⎥, (9.85)
⎢ ␯11 . . . v 1P ⎥
⎢ ⎥
⎣ ... ... ... ⎦
v F1 . . . vFP

given that there are F frames and P points. We observe that each column in this matrix is
a listing of the image coordinates of a particular point as it is viewed in different frames,
whereas each row is a listing of the locations of all the points in a particular frame. Next, we
define a matrix M which is a 2F × 3 matrix whose rows are the i f and j f vectors, and a
matrix S which is a 3 × P “shape matrix” whose columns are the s p vectors. Finally, define
T to be a 2F-dimensional translation vector with elements x f and y f . With these definitions,
we can rewrite Eq. (9.84) for all the points as

W = M S + T 1P , (9.86)

Is this parenthetical where 1 P is a vector of length P, composed of all ones. (Observe that the product of T with
remark correct?
the ROW vector of all ones, an outer product, constructs a matrix of dimension 2F × P.)
If we move the origin to the center of gravity of the object (why not? location of the origin
is arbitrary), we obtain

1 P
C= s p = 0. (9.87)
P p=1

This location of the origin allows us to solve for T immediately, by observing that since the
sum of any row of S is zero, the sum of any row of W is simply PT , and each row of T can be
determined as the corresponding row of W divided by P. Now, subtract T from W producing

a new matrix W which satisfies

W = M S. (9.88)

Using singular-valued decomposition, we can find a suitable decomposition of W , which we
 
denote by W = M S . Unfortunately, these are not necessarily the M and S we want, since we
could put any product A A−1 between them without changing the value of the product. We
thus search for a matrix A such that
 
M = MA S = A−1 S. (9.89)

To find A, we can make use of the fact that the rows of M are the direction vectors of the
camera and are hence orthonormal. With these additional constraints, A is determined, and
we know the positions in 3-space of all the P points as well as knowing the camera angles at
each frame time.
253 Topic 9A Shape description

9A.4 Vocabulary

You should know the meanings of the following terms.

Irradiance
Optic flow
Perspective
Photometric stereo
Pseudo-inverse
Reflectivity
Shape from shading
Structured illumination

Assignment 9.1
For each feature described in section 9.3, determine if that
feature is invariant to: (1) rotation in the viewing plane;
(2) translation in that plane; (3) rotation out of that plane
(an affine transform if the object is planar); and (4) zoom.

Assignment 9.2
Let the Euclidian distance between two points be denoted by
the operator d(P1 ,P2 ) (you may want to use this in the next
problem). Design a monotonic metric R(P1 ,P2 ) which maps all
distances to be between 0 and 1. That is, if d(P1 ,P2 ) = ∞,
then R(P1 ,P2 ) = 1 and if d(P1 ,P2 ) = 0 then R(P1 ,P2 ) = 0.
For the metric you developed, show how you would prove your
measure is a formal metric. Just set up the problem. Extra
credit if you actually do the proof.

Assignment 9.3
Following are five points on the boundary of a region. (1,1),
(2,1), (2,2), (2,4), (3,2). Use eigenvector methods to fit a
straight line to this set of points, thus finding the princi-
pal axes of the region. Having found the principal axes, then
estimate the aspect ratio of the region.

Assignment 9.4
Write the chain code for the figure below.
254 Shape

Assignment 9.5
Discuss the following postulate: Let P1 and P2 be the two ex-
trema of a region which determine the diameter. Then P1 and
P2 are on the boundary of the region.

Assignment 9.6
Starting with Eq. (9.19), prove Eq. (9.20).

Assignment 9.7
What is the difference between the intensity axis of symmetry
and the medial axis?

Assignment 9.8
In Table 9.1, prove that the invariant moment ␸1 is invariant
to zoom.

Assignment 9.9
Your instructor will specify an image containing a single
region with unity brightness and zero background.

(1) Compute the seven invariant moments of the foreground


region.
(2) Rotate the foreground region about its center of grav-
ity through ten, twenty and forty degrees, and compute
the invariant moments of the resulting image. What do you
conclude?

Assignment 9.10
Prove that Eq. (9.35) is invariant to: (1) translation;
(2) rotation; and (3) zoom.

Assignment 9.11
Is the caption on Fig. 9.12 correct?

Assignment 9.12
Two silhouettes, A and B, are measured and their boundaries
encoded. Then, Fourier descriptors are computed. The descrip-
tors are as in Table 9.3.
It is possible that these two objects represent similar-
ity transforms of one another. (A similarity transform is
equivalent to a rigid body motion, translation or rotation
only.) Could they be affine transformations of one another?
(An affine transformation is a linear transformation which
includes not only rigid body motion, but also the possibility
255 Topic 9A Shape description

Table 9.3. Complex values of


the Fourier descriptor.

Object A Object B

5.00 + i0.00 5.83 + i1.80


4.2 + i1.87 3.69 + i2.57
3.86 + i1.00 3.48 + i2.00
2.95 + i2.05 2.30 + i2.77
3.19 + i1.47 2.70 + i2.24

for scaling of the coordinate axes. If both axes (in 2D) are
scaled by the same amount, you get zoom. If they are scaled
by different amounts, you get shear.)
If you decide that these two sets of descriptors represent
the same shape, possibly transformed, describe and justify
what type of operations convert A into B. If they are not the
same shape, explain why.

Assignment 9.13
A cylinder with unit radius and height of ten is oriented
vertically about the origin, and is known to have a sur-
face which is a Lambertian reflector. That is, the reflected
brightness is independent of the angle of observation, and
depends on the angle of incidence following the relationship
f = aIcos ␪i, where a is the albedo and I is the brightness of
the source. On this cylinder, the albedo is constant.
The camera is located at x = 0, y = −2, z = 2, and the optical
axis of the camera is pointed at the origin.
y

Max brightness
observed somewhere
along here
θ

The light source is known to be 4 units from the origin,


and is known to be a point source which radiates equally
256 Shape

in all directions. The brightest spot in the image of the


cylinder is at an angle of 29 degrees. Where is the light
source? A sketch of the coordinate system is illustrated for
your convenience. (Note: Drawing is NOT to scale, and is not
guaranteed to even be correct.)

Bibliography

[9.1] K. Arbter, W. Snyder, H. Burkhardt, and G. Hirzinger, “Application of Affine-


invariant Fourier Descriptors to Recognition of 3-D Objects,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 12(7), pp. 640–647, 1990.
[9.2] D. Ballard and C. Brown, Computer Vision, Englewood Cliffs, NJ, Prentice-Hall,
1982.
[9.3] M. Bichsel, “Segmenting Simply Connected Moving Objects in a Static Scene,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(11), pp. 1138–1142,
1994.
[9.4] T. Binford, “Visual Perception by Computer,” IEEE Conference on Systems and
Control, Miami, December, 1971.
[9.5] A. Bobick and J. Davis, “The Recognition of Human Movement Using Temporal
Templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3),
pp. 257–267, 2001.
[9.6] A. Califano and R. Mohan, “Multidimensional Indexing for Recognizing Visual
Shapes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(4),
pp. 373–392, 1994.
[9.7] D. Caspi, N. Kiryati, and J. Shamir, “Range Imaging with Adaptive Color Structured
Light,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5),
pp. 470–480, 1998.
[9.8] C. Chen, T. Huang, and M. Arrott, “Modeling, Analysis, and Visualization of Left
Ventricle Shape and Motion by Hierarchical Decomposition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(4), pp. 342–356, 1994.
[9.9] W. Chen, N. Nandhakumar, and W. Martin, “Image Motion Estimation from Motion
Smear – A New Computational Model,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(4), pp. 412–425, 1996.
[9.10] A. Chhabra and T. Grogan, “On Poisson Solvers and Semi-direct Methods for Com-
puting Area Based Optic Flow,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(11), pp. 1133–1138, 1994.
[9.11] K. Cho and S. Dunn, “Learning Shape Classes,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(9), pp. 882–888, 1994.
[9.12] J. Chuang, C. Tsai, and M. Ko, “Skeletonization of Three-dimensional Object using
Generalized Potential Field,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(11), pp. 1241–1251, 2000.
[9.13] M. Clerc, “Texture Gradient,” http://www.masterworksfineart.com/inventory/
vas originals.htm.
257 Bibliography

[9.14] M. Dhome, M. Richetin, J. Lapreste, and G. Rives, “Determination of the Attitude of


3-D Objects from a Single Perspective View,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 11(12), pp. 1265–1278, 1989.
[9.15] A. Earnshaw and S. Blostein, “The Performance of Camera Translation Direction
Estimators from Optical Flow: Analysis, Comparison, and Theoretical Limits,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(9), pp. 927–932,
1996.
[9.16] A. Ferreira and S. Ubeda, “Computing the Medial Axis Transform in Parallel with
Eight Scan Operations,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 21(3), pp. 277–282, 1999.
[9.17] M. Fischler and R. Bolles, “Random Sample Consensus: A Paradigm for Model
Fitting with Application to Image Analysis and Automated Cartography,” Commu-
nications of the Association for Computing Machinery, 24(6), 1981.
[9.18] P. Flynn, “3-D Object Recognition with Symmetric Models: Symmetry Extraction
and Encoding,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(8), pp. 814–818, 1994.
[9.19] D. Geiger, A. Gupta, L. Costa, and J. Vlontzos, “Dynamic Programming for Detect-
ing, Tracking, and Matching Deformable Contours,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(3), pp. 294–302, 1995.
[9.20] S. Ghosal and P. Vanek, “A Fast Scalable Algorithm for Discontinuous Optical Flow
Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(2),
pp. 181–194, 1996.
[9.21] R. Gonzalez and P. Wintz, Digital Image Processing, Reading, MA, Addison-Wesley,
1977.
[9.22] A. Gross, “Toward Object-based Heuristics,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 16(8), pp. 794–802, 1994.
[9.23] A. Gross and T. Boult, “Recovery of SHGCs from a Single Intensity View,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(2), pp. 161–180,
1996.
[9.24] M. Gruber and K. Hsu, “Moment-based Image Normalization with High Noise-
Tolerance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2),
pp. 136–139, 1997.
[9.25] S. Gunn and M. Nixon, “A Robust Snake Implementation: A Dual Active Contour,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), pp. 63–68,
1997.
[9.26] R. Hartley, “Projective Reconstruction and Invariants from Multiple Images,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(10), pp. 1036–1041,
1994.
[9.27] P. Havaldar and G. Medioni, “Full Volumetric Descriptions from Three Intensity
Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5),
pp. 540–545, 1998.
[9.28] G. Healey and R. Kondepudy, “Radiometric CCD Camera Calibration and Noise
Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(3),
pp. 267–276, 1994.
258 Shape

[9.29] D. Heeger and A. Jepson, “Subspace Methods for Recovering Rigid Motion I:
Algorithm and Implementation,” International Journal of Computer Vision, 7(2),
pp. 95–117, 1992.
[9.30] D. Helman and J. JáJá, “Efficient Image Processing Algorithms on the Scan Line
Array Processor,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), pp. 47–56, 1995.
[9.31] A. Hoover, D. Goldgof, and K. Bowyer, “Extracting a Valid Boundary Representa-
tion from a Segmented Range Image,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(9), pp. 920–924, 1995.
[9.32] B.K.P. Horn, Robot Vision, Cambridge, MA, MIT Press, 1986.
[9.33] M. Hu, “Visual Pattern Recognition by Moment Invariants,” IRE Transactions on
Information Theory, 8, pp. 179–187, 1962.
[9.34] X. Hu and N. Ahuja, “Matching Point Features with Ordered Geometric, Rigidity,
and Disparity Constraints,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(10), pp. 1041–1049, 1994.
[9.35] Y. Iwahori, R. Woodham, and A. Bagheri, “Principal Components Analysis and
Neural Network Implementation of Photometric Stereo,” Proceedings IEEE Con-
ference on Physics-Based Modeling in Computer Vision, June, 1995, pp. 117–125,
1995.
[9.36] A. Jepson and D. Heeger, “Linear Subspace Methods for Recovering Translational
Direction,” In Spatial Vision in Humans and Robots, ed. L. Harris and M. Jenkin,
Cambridge, Cambridge University Press, 1993.
[9.37] K. Kanatani, “Unbiased Estimation and Statistical Analysis of 3-D Rigid Motion
from Two Views,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
15(1), pp. 37–50, 1993.
[9.38] K. Kanatani, “Comments on ‘Symmetry as a Continuous Feature’,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 19(3), pp. 246–247, 1997.
[9.39] H. Kauppinen T. Seppänen, and M. Pietikäinen, “An Experimental Comparison of
Autoregressive and Fourier-based Descriptors in 2D Shape Classification,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(2), pp. 201–207,
February, 1995.
[9.40] R. Kimmel, A. Amir, and A. Bruckstein, “Finding Shortest Paths on Surfaces Us-
ing Level Sets Propagation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(6), pp. 635–640, 1995.
[9.41] D. Kottke and Y. Sun, “Motion Estimation via Cluster Matching,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(11), pp. 1128–1132, 1994.
[9.42] A. Laurentini, “The Visual Hull Concept for Silhouette-based Image Understanding,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2), pp. 150–
162, 1994.
[9.43] A. Laurentini, “How Far 3D Shapes can be Understood from 2D Silhouettes,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(2), pp. 188–195,
1995.
[9.44] S. Lavallée and R. Szeliski, “Recovering the Position and Orientation for Free-form
Objects from Image Contours Using 3D Distance Maps,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(4), pp. 378–390, 1995.
259 Bibliography

[9.45] S. Liao and M. Pawlak, “On Image Analysis by Moments,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(3), pp. 254–266, 1996.
[9.46] R. Malladi, J. Sethian, and B. Vemuri, “Shape Modeling with Front Propagation: A
Level Set Approach,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 17(2), pp. 158–175, 1995.
[9.47] C. Manhaeghe, I. Lemahieu, D. Vogelaers, and F. Colardyn, “Automatic Initial
Estimation of the Left Ventricular Myocardial Midwall in Emission Tomograms
using Kohonen Maps,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 16(3), pp. 259–266, 1994.
[9.48] J. Michel, N. Nandhakumar, and V. Velten, “Thermophysical Algebraic Invariants
from Infrared Imagery for Object Recognition,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 19(1), pp. 41–51, 1997.
[9.49] F. Mokhtarian, “Silhouette-based Isolated Object Recognition through Curvature
Scale Space,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(5), pp. 539–544, 1995.
[9.50] R. Morano, C. Ozturk, R. Conn, S. Dubin, S. Zietz, and J. Nissano, “Structured Light
using Pseudorandom Codes,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 20(3), pp. 322–327, 1998.
[9.51] S. Nayar and R. Bolle, “Reflectance Based Object Recognition,” International Jour-
nal of Computer Vision, 17(3), pp. 219–240, 1996.
[9.52] S. Nayar and Y. Nakagawa, “Shape from Focus,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(8), pp. 824–831, 1994.
[9.53] T. Nguyen and B. Oommen, “Moment-preserving Piecewise Linear Approximations
of Signals and Images,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 19(1), pp. 84–91, 1997.
[9.54] L. O’Gorman, “Subpixel Precision of Straight-edged Shapes for Registration and
Measurement,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
18(7), 1996.
[9.55] F. O’Sullivan and M. Qian, “A Regularized Contrast Statistic for Object Bound-
ary Estimation – Implementation and Statistical Evaluation,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(6), pp. 561–570, 1994.
[9.56] M. Oren and S. Nayar, “A Theory of Specular Surface Geometry,” International
Journal of Computer Vision, 24(2), pp. 105–124, 1996.
[9.57] N. Page, W. Snyder, and S. Rajala, “Turbine Blade Image Processing System.”
In Advanced Software in Robotics, ed. A. Danthine, Amsterdam, North-Holland,
1984.
[9.58] E. Pauwels, P. Fiddelaers, and L. Van Gool, “Enhancement of Planar Shape Through
Optimization of Functionals for Curves,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(12), 1995.
[9.59] S. Pizer, C. Burbeck, J. Coggins, D. Fritsch, and B. Morse, “Object Shape Before
Boundary Shape: Scale Space Medial Axis,” Journal of Mathematical Imaging and
Vision, 4, pp. 303–313, 1994.
[9.60] C. Poelman and T. Kanade, “A Paraperspective Factorization Method for Shape and
Motion Recovery,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(3), pp. 206–218, 1997.
260 Shape

[9.61] L. Quan, “Conic Reconstruction and Correspondence from Two Views,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(2), 1996.
[9.62] I. Rothe, H. Süsse, and K. Voss, “The Method of Normalization to Determine
Invariants,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
[9.63] G. Sandini and V. Tagliasco, “An Anthropomorphic Retina-like Structure for Scene
Analysis,” Computer Graphics and Image Processing, 14, pp. 365–372, 1980.
[9.64] H. Schultz, “Retrieving Shape Information from Multiple Image of a Specular
Surface,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2),
pp. 195–201, 1994.
[9.65] E. Schwartz, “Computational Anatomy and Functional Architecture of Striate Cortex,
Spatial Mapping Approach to Perceptual Coding,” Vision Research, 20, pp. 645–669,
1980.
[9.66] J. Sethian, “Curvature and Evolution of Fronts,” Communications in Mathematical
Physics, 101, pp. 487–499, 1985.
[9.67] J. Sethian, “Numerical Algorithms for Propagating Interfaces: Hamilton–Jacobi
Equations and Conservation Laws,” Journal of Differential Geometry, 31, pp. 131–
161, 1990.
[9.68] M. Shamos, “Geometric Complexity,” 7th Annual ACM Symposium on Theory of
Computing, May, 1975, Albuquerque, NM, pp. 224–233, 1975.
[9.69] D. Sinclair and A. Blake, “Isoperimetric Normalization of Planar Curves,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(8), pp. 769–777,
1994.
[9.70] S. Smith and J. Brady, “ASSET-2: Real-time Motion Segmentation and Shape Track-
ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 1995.
[9.71] W. Snyder and I. Tang, “Finding the Extrema of a Region,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2, pp. 266–269, 1980.
[9.72] S. Soatto and P. Perona, “Reducing ‘Structure from Motion’: A General Frame-
work for Dynamic Vision.1. Modeling,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(9), pp. 933–942, 1998.
[9.73] S. Soatto and P. Perona, “Reducing ‘Structure from Motion’: A General Framework
for Dynamic Vision. 2. Implementation and Experimental Assessment,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 20(9), pp. 943–960, 1998.
[9.74] B. Soroka, “Generalized Cylinders from Parallel Slices,” Proceedings of the Con-
ference on Pattern Recognition and Image Processing, 1979.
[9.75] B. Soroka and R. Bajcsy, “Generalized Cylinders from Serial Sections,” 3rd In-
ternational Joint Conference on Pattern Recognition, November, Coronado, CA,
1976.
[9.76] M. Soucy and D. Laurendeau, “A General Surface Approach to the Integration of a Set
of Range Views,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(4), pp. 344–358, 1995.
[9.77] J. Stone and S. Isard, “Adaptive Scale Filtering: A General Method for Obtaining
Shape from Texture,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 17(7), pp. 713–718, 1995.
261 Bibliography

[9.78] G. Storvik, “A Bayesian Approach to Dynamic Contours through Stochastic


Sampling and Simulated Annealing,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(10), pp. 976–986, 1994.
[9.79] M. Subbarao and T. Choi, “Accurate Recovery of Three-dimensional Shape from
Image Focus,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(3), pp. 266–274, 1995.
[9.80] S. Sull and N. Ahuja, “Integrated 3-D Analysis and Analysis-Guided Synthesis
of Flight Image Sequences,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(4), pp. 357–372, 1994.
[9.81] Y. Sun, I. Liu, and J. Grady, “Reconstruction of 3-D Tree-like Structures from
Three Mutually Orthogonal Projections,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(3), pp. 241–248, 1994.
[9.82] B. Super, and A. Bovik, “Shape from Texture Using Local Spectral Moments,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(4), pp. 333–343,
1995.
[9.83] K. Tarabanis, R. Tsai, and A. Kaul, “Computing Occlusion-free Viewpoints,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(3), pp. 279–292,
1996.
[9.84] G. Taubin and R. Ronfard, “Implicit Simplicial Models for Adaptive Curve Recon-
struction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3),
pp. 321–325, 1996.
[9.85] C. Taylor and D. Kriegman, “Structure and Motion from Line Segments in Multiple
Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11),
pp. 1021–1032, 1995.
[9.86] C. Tomasi and T. Kanade, “Shape and Motion from Image Streams under Ortho-
graphy: A Factorization Method,” International Journal of Computer Vision, 9(2),
pp. 137–154, 1992.
[9.87] F. Tong and Z. Li, “Reciprocal-wedge Transform for Space-variant Sensing,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(5), pp. 500–511,
1995.
[9.88] B. Vemuri and Y. Guo, “Snake Pedals: Compact and Versatile Geometric Models
with Physics-based Control,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(5), pp. 445–459, 2000.
[9.89] K. Voss and H. Suesse, “Invariant Fitting of Planar Objects by Primitives,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19(1), pp. 80–84,
1997.
[9.90] C. Weiman and G. Chaikin, “Logarithmic Spiral Grids for Image Processing and
Display,” Computer Graphics and Image Processing, 11, pp. 197–226, 1979.
[9.91] D. Weinshall and C. Tomasi, “Linear and Incremental Acquisition of Invariant
Shape Models from Image Sequences,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(5), pp. 512–517, 1995.
[9.92] M. Werman and D. Weinshall, “Similarity and Affine Invariant Distances Between
2D Point Sets,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(8), pp. 810–814, 1995.
262 Shape

[9.93] P. Whaite and F. Ferrie, “Autonomous Exploration: Driven by Uncertainty,” IEEE


Transactions on Pattern Analysis and Machine Intelligence, 19(3), pp. 193–205,
1997.
[9.94] R. Woodham, “Photometric Method for Determining Surface Orientation from
Multiple Images,” Optical Engineering, 19, pp. 139–144, 1980.
[9.95] Y. Wu, S. Iyengar, R. Jain, and S. Bose, “A New Generalized Framework for Finding
Object Orientation using Perspective Trihedral Angle Constraint,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 16(10), pp. 961–975, 1994.
[9.96] R. Yip, P. Tam, and D. Leung, “Application of Elliptic Fourier Descriptors to Sym-
metry Detection under Parallel Projection,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 16(3), pp. 277–286, 1994.
[9.97] H. Zabrodsky, S. Peleg, and D. Avnir, “Symmetry as a Continuous Feature,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(12), pp. 1154–1166,
1995.
[9.98] M. Zerroug and R. Nevatia, “Three Dimensional Descriptions based on the Analysis
of the Invariant and Quasi-invariant Properties of Some Curved-axis Generalized
Cylinders,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3),
pp. 237–253, 1996.
[9.99] Z. Zhang, “Estimating Motion and Structure from Correspondences of Line Seg-
ments between Two Perspective Images,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(12), pp. 1129–1139, 1995.
[9.100] R. Zhang, P. Tsai, J. Cryer, and M. Shah, “Shape-from-shading: A Survey,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 21(8), pp. 690–706,
1999.
[9.101] J. Zheng, “Acquiring 3-D Models from Sequences of Contours,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(2), pp. 163–178, 1994.
[9.102] Y. Zhong, A. Jain, and M. Dubuisson-Jolly, “Object Tracking using Deformable
Templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(5),
pp. 544–549, 2000.
[9.103] D. Zwillinger, CRC Standard Mathematical Tables and Formulae, 31st edn., CRC
Press, 2003.
10 Consistent labeling

On axis, as the planets run,


Yet make at once their circle round the sun;
So two consistent motions act the soul;
And one regards itself and one the whole.
Alexander Pope

The single most challenging problem in all of computer vision is the “local/global
inference problem.” As in the fable of the blind men and the elephant, the computer
must, from a set of local measurements, infer the global properties of what is being
observed. In other words, the next level of the machine vision problem is to interpret
the global scene (which is composed of individual objects) using local information
about each object obtained from segmentation and shape analysis as we have dis-
cussed in Chapters 8 and 9. One way to approach the local/global inference problem
is to introduce the concept of consistency.

10.1 Consistency

Let’s begin with some notation: Define a set of objects {x1 , x2 , . . . xn }, and a set
of labels for those objects {␭1 , ␭2 , . . . ␭k }, which we assume for now are mutually
exclusive (each object may have only one label) and collectively exhaustive (each
object has a label). Denote a labeling as the ordered pair (xi , ␭ j ). By this notation,
we mean that object i has been assigned label j.
As an example of consistent labeling, we will consider the problem of labeling
objects in a line drawing. Researchers have been interested in analysis of line draw-
ings since the beginnings of work in machine vision for three reasons: First, humans
can obviously look at line drawings and make interpretations with ease. Second, psy-
chological experiments [10.1, 10.6, 10.10] have convincingly demonstrated that it is
the points where brightness changes rapidly that convey the most information, and it
is relatively easy to convert edges to lines. Third, a line drawing is a dramatic reduc-
tion in the amount of data (which is not the same as information) in an image, and
perhaps, just perhaps, learning how to process line drawings would make analysis al-
gorithms run faster. Most of the groundwork in line drawing analysis was done in the
late 1960s and 1970s [10.5, 10.8], however, progress continues [10.17] to be made.
263
264 Consistent labeling


+ −

− −
+ +
+

Fig. 10.1. A line drawing illustrating convex, concave, and occluding lines, and a labeling of that
drawing.

We will allow each line in such a drawing to be labeled as either convex (an edge
in three dimensions which points toward the observer, such as the corner of a desk),
concave (an edge in three dimensions which points away from the observer, such as
the joint between the wall and the floor of a room), or occluding (the edge occurs
because one surface is partially hidden by another surface). For example, consider the
drawing in Fig. 10.1. In that drawing, lines resulting from convex edges are labeled
with a plus sign, concave edges are labeled with a minus sign, and occluding edges
are labeled with an arrow. The arrow points in the direction such that the occluded
surface is on the left if one moves in the direction of the arrow. You may note that
not all the lines in this drawing have been labeled. That is deliberate. Thus we have
one type of object, lines, and three types of labels for those lines. Our mission is to
learn how to get a computer to do automatically what human beings did so easily
when we interpreted the lines in that figure.
Before we can do that, we need to address one of those unfortunate ambiguities
that natural language introduces into discussions like this. The term labeling may
have two meanings. It may refer to the label assigned to a single object, to a pair
of objects, or to an entire scene. We will try to make sure that the meaning is clear
The compatibility
function r (.) will have from context in this discussion. In order to accomplish the objective of labeling a
differing interpretations scene, we must consider something we call consistency. The simultaneous labeling
through this chapter,
depending on the specific of two objects is represented by some function which we will call compatibility
labeling algorithm.
and denoted by r (i, ␭, j, ␭ ). This function is defined to have a value of 1 if the
Check it out. Did we do
the math right? n objects,
two labelings can exist together (mutually compatible) and −1 if they cannot. For
with each object labeling example, r (i, +, i, −) = −1, since the object i cannot be both concave and convex
consistent with the
(consider the drawing in Fig. 10.2). A labeling of an image is said to be complete
labeling of all other  
objects. and consistent when i i = j r (i, ␭, j, ␭ ) = n(n − 1), where any label values at
265 10.1 Consistency

− − + +
− − + +

Fig. 10.2. A line which


is both concave and
convex can exist only
in an impossible +
object.
Fig. 10.3. All the physically possible Y vertices.

+ − + −

Fig. 10.4. All the physically possible ELL vertices.

− − + +
+ + + −

− + + −

Fig. 10.5. All the physically possible arrow vertices.

all are allowed. In this chapter, we will utilize several different realizations of this
compatibility function.
Let’s look at the line labeling in more detail to see how this works. Although we
are interested in labeling lines, it will turn out to be useful to think about vertices
as well. A vertex is where lines meet. If each line can have four labels (concave,
convex, arrow in, arrow out), there should be 43 ways to label a vertex with three
lines meeting. It turns out, however that not all of those combinations are physically
possible. In Figs. 10.3 to 10.5, we illustrate all the “Y”, “ELL”, and “arrow” vertices
which are physically possible. There are a variety of ways we could make use of this
information. One way is to use depth-first search. The algorithm is as follows:
2
(1) Choose a starting vertex (call it vertex 1) and label all the lines coming in to it
1
in a physically possible way.
(2) Choose an adjacent vertex (call it vertex 2) and label all lines coming in to it
3 in a physically possible way, such that the labeling is consistent with previous
labeling. That is, the line from vertex 1 to 2 can only have one label.
4 (3) If no consistent labeling is possible, back up.
Fig. 10.6. An object
which we wish to label
In Fig. 10.7, we illustrate the labeling process of the 3D object shown in Fig. 10.6,
consistently. beginning with a choice of one possible labeling for the lines of vertex 1. Given the
266 Consistent labeling

+
+
1 +


2 + + − Some of the possible labelings
+
for the lines of vertex 2

3
Some of the possible labelings
for the lines of vertex 3
− +

4 +
?

Fig. 10.7. Possible labeling for the object in Fig. 10.6.

labeling of vertex 1, we can choose any of a set of labeling for the lines of vertex 2,
but all of them must have a + sign on the line between 1 and 2, as illustrated on the
second line of the figure. Now, choose one of those labelings of vertex 2 (let’s pick
the one on the left), and label vertex 3 in such a way that it is consistent with the
labeling of both 1 and 3. Now, we need to assume a “correct” interpretation of vertex
3 in order to label vertex 4. Again, choose the one on the left. In order to label the
lines coming into vertex 4, we must choose a labeling which is consistent with the
(assumed correct) labeling of vertex 3 and vertex 1. Since two of the lines coming in
to vertex 4 are now determined, there is only one way to label vertex 4 consistently,
and that is with an arrow coming in on the third line.
Suppose we had reached the labeling of vertex 4, and found there were no physi-
cally possible labelings. Then clearly one of the earlier assumptions was incorrect.
We now “back up,” and choose a new labeling for vertex 3. Suppose we run through
all the possible labelings of the lines of vertex 3 and from none of them can we
find a consistent labeling of 4, then we back up again and choose a new labeling of
vertex 2. We follow this approach until we either find a consistent labeling of the
entire object, or we fail to find one. If we fail, then the object cannot be consistently
labeled [10.7].
Since line labeling was originally developed, many researchers have added en-
hancements. For example, Parodi and Piccioli [10.13] demonstrate that if vanishing
points can be determined, and the 3D coordinates of a single point are known, the
3D coordinates of all the other labeled points may be found.

10.2 Relaxation labeling

Now that we have seen an application of consistent labeling, let’s generalize the
idea a bit by allowing a particular object to have more than one label at a time.
267 10.2 Relaxation labeling

To accomplish this, we define the variable pi (␭ j ) to represent our confidence that


object i has label ␭ j . Now some terminology needs to be discussed before we go
any further. You may hear pi (␭ j ) referred to as the probability that object i has label
␭ j . While sometimes used, that terminology is incorrect. Yes, p has the range of a

probability 0 ≤ pi (␭) ≤ 1; and yes, it integrates to one, j pi (␭ j ) = 1, but there
is no random process here. Rather, pi (␭ j ) is our confidence that, knowing what we
do right now, object i has label ␭ j . As we study the problem more and make more
use of consistency, hopefully these ps will change from assuming continuous values
between zero and one to becoming only zeros or ones.

10.2.1 Using consistency to update labelings


Linear relaxation

The first way one might approach the idea of using consistency is to set up a linear
system that takes into account the initial probabilities and the consistencies. We
define the compatibility of object i having label ␭ with object j having label ␭
as rL (i, ␭, j, ␭ ) (the subscript L denotes that this is used in the linear relaxation
algorithm) and require that 0 ≤ rL ≤ 1 and

Be careful! The definition rL (i, ␭, j, ␭ ) = 1 for all i, j, ␭ . (10.1)
of r will change in the

next section.
The linear relaxation process iteratively updates the label weights following

pi (␭) = rL (i, ␭, j, ␭ ) p j (␭ ). (10.2)
j ␭

Equation (10.2) is readily seen to be a multiplication of the matrix of consistencies


with the vector of weights. Repeated application of this equation can be shown
[10.15] to converge to a weight vector that is the eigenvector corresponding to the
eigenvalue unity. Under some circumstances, it is possible to learn the consistency
matrix [10.14]. Unfortunately, this does not really give us the kind of information
we want. How can the result be independent of the initial conditions?

Nonlinear relaxation

In nonlinear relaxation, we again make use of what we know about consistency to


update labelings. We will also develop an algorithm which drives the labelings to
zero or one, and which does depend on the measurements. First, we define a new
update rule. At iteration k + 1 of our algorithm, use
% &
pik (␭) 1 + qik (␭)
pi (␭) =  k
k+1
% &. (10.3)
j pi (␭ j ) 1 + qi (␭ j )
k
268 Consistent labeling

Don’t worry about the denominator; it is just there to make sure the values of pi stay
within the range of 0–1. The term qi (␭) represents a measure of how compatible
pi (␭ j ) is with the labeling of all the other objects. While p is strictly positive, q can
be either positive or negative. If negative, qi (␭) suggests that the current labeling of
pi by ␭ j is incompatible with most other labelings,

 
qik (␭) = Ci j rN (i, ␭, j, ␭ ) p kj (␭ ) , (10.4)
Details of a specific form j ␭
for r are problem-
dependent. where this is a similar compatibility r (.) that we saw earlier in this chapter, but with a
subscript N to denote that we will use it in nonlinear relaxation. The only difference
is that now we allow r to take on values of not only −1 and 1, but all values in
between. If two labelings are completely consistent, their compatibility is said to be
1. If the labelings are totally inconsistent, the compatibility is −1. If the labeling
of one simply does not affect the other, the compatibility is 0. Let’s examine what
Eq. (10.4) means.
The CHANGE in our confidence that object i has label ␭ is simply the sum of how
compatible that label for object i is with all the other labelings currently. Notice that
the compatibility is multiplied by the confidence that the other labeling is correct.
That is, if we have little confidence in a labeling of object j, we do not really care
how much it is compatible with a labeling of object i. Finally, Ci j is there for
convenience. It simply weights the influence that object j has on object i, without
regard to labels. It might be zero, for example, if object i and j have no influence on
each other, and we know that ahead of time. Ci j is optional; one could just as well
incorporate it into the compatibility function.

10.2.2 Example labeling problems


In this section, we illustrate the choice of the compatibility function r for some
example problems.

A model matching problem

Assume you have done a segmentation of a range image into planes. Now, you wish
to find which of a set of models best matches the object being observed. Assume
the segmentation produces patches which are planar, and since the image is a range
image, we can compute the orientation of those planes in 3-space. The problem now
is to find the set of planar surfaces in the model (or collection of models) which best
matches the set of planar surfaces in the image. Patches in the image are the objects,
regions in the model(s) are the labels. One way to define compatibility of labeling
is as follows. Consider the compatibility of labeling image patch A as model region
269 10.2 Relaxation labeling

The two patches in the image


do not border each other; and
A B 1 2 neither do the two regions in
the model; r = 0.

The two patches in the image


A B
1 2 are adjacent. The two in the
model are not; r = –1.

A B The two patches in the image


1 2 are not adjacent. The two in the
model are; r = –1.

A B The two patches in the image


1 2 are adjacent, and so are the
two regions in the model; r is a
function of the angle.

Fig. 10.8. Four possibilities in the definition of compatibility of labeling between two patches
and the corresponding two models.

1 with B as 2. Let us now consider: Does patch A border patch B? And does region
1 border region 2? There are four possibilities, illustrated in Fig. 10.8. If the two
regions in the image are adjacent, and the two regions in the model are also adjacent,
then we define the compatibility of two labelings as

rN (A, ␭1 , B, ␭2 ) = cos(␪AB − ␪1,2 ) (10.5)

where ␪AB denotes the angle between patch/regions A and B (remember, this is a
range image, these angles can be measured).
The point to emphasize here is that the definition of the compatibility function
is totally problem-dependent. The rest of relaxation labeling is a simple iteration
defined by Eqs. (10.3) and (10.4).
Fig. 10.9. Objects in
frame 1 are
represented by open Another example: tracking of moving objects
circles, objects in
frame 2 are filled Let us assume that we are tracking four-wheeled vehicles as they move. Our camera
circles. The problem is can only record the position of the tires (it is a pretty weird camera). Our objective
to find the most
consistent labeling of then is to determine which tires in one image correspond to which tires in the next
open circles with filled image. This is illustrated in Fig. 10.9, in which the location of the tires in frame n
circles.
are denoted by open circles and the locations in frame n + 1 by closed circles. In
this application, we have another labeling task. The objects (wheels) in frame n are
to be labeled with the labels (also wheels) in frame n + 1. So what shall we use as a
compatibility function? To figure that out, let’s think about some incorrect labelings.
For example, Fig. 10.10 illustrates the labeling with an arrow, from object to label.
See if you think this makes sense.
If Fig. 10.10 is correct, then the front left tire went to the front right, and the
Fig. 10.10. A labeling
which probably is not rear tires did the same. We can only interpret that as the car flipped over, which,
correct. while possible, we certainly hope it did not actually happen. A more reasonable
270 Consistent labeling

interpretation is in Fig. 10.11, which shows the arrows are almost parallel! Ah, that’s
it! We can use the cosine of the angle between the arrows,

rN (i, m, j, n) = cos(␪(i, m) − ␪( j, p)) (10.6)

where i and j are wheels in frame n and m and p are wheels in frame n + 1. Equation
Fig. 10.11. A labeling
which makes more (10.6) measures the consistency of assuming that wheel i is wheel m and wheel j is
sense. wheel p. Although included in a notes version of this book several years earlier, this
concept was published in 1995 by Wu [10.19].

10.3 Conclusion

This chapter is all about consistency. We hope to have convinced the student that
the best way to fuse information from diverse courses is to seek labelings which are
consistent.
Optimization is formally used only in the next section, where an optimization is set
Conjugate gradient. up and solved using a numerical optimization technique called conjugate gradient.
The conjugate gradient technique is not explained in this chapter – the reader is
referred to standard texts in numerical methods. However, the technique is in many
ways similar to gradient descent, but runs much faster.
Researchers continue to work on improving the concepts of consistent labeling,
including relaxation labeling [10.4]. Relaxation or similar algorithms have been
used in such diverse applications as optical character recognition [10.12] and edge
detection [10.15]. See [10.9, 10.16] for underlying theory.

10.4 Vocabulary

You should know the meanings of the following terms.

Compatibility
Concave edge
Consistent
Convex edge
Labeling
Linear relaxation
Local/global inference
Nonlinear relaxation
Occluded
Relaxation labeling
271 Topic 12A 3D interpretation of 2D line drawings

Assignment 10.1
OK. You have seen two examples of compatibility func-
tions. Now you have the opportunity to make up your
own. Here is the problem. You have applied an edge
detector to an image. At every pixel in the image, a
gradient has been taken and you know the magnitude and
direction of that gradient. Recognizing that some of
these measurements may be corrupted by noise and blur,
develop an application of relaxation labeling which
helps determine “real” edge pixels. Hint: A “real” edge
pixel would have its gradient vector point in the same
direction as neighboring edge pixels. Develop a com-
patibility function using this concept. Describe how to
use it. You may use pseudo-code or words or flowcharts,
or all three. Do not write actual software.

Topic 10A 3D Interpretation of 2D line drawings

As we have seen, interpretation of line drawings is a difficult problem. A single line drawing
shows only one view of a 3D object, and is therefore ambiguous. The ambiguity may be
resolved through the use of a set of stored 3D models. This approach requires prior knowledge
of what objects are likely to appear in a drawing, and may not give good interpretations of
novel drawings which do not correspond to any of the stored models.
Marill [10.11] proposes another approach for interpreting line drawings which requires no
models. He uses only one heuristic to generate a 3D wire-frame object from a 2D line drawing,
that is, a given 3D interpretation is considered less likely to be correct if some angles between
the wires are much larger than others. Specifically, of all the 3D models consistent with a
given 2D drawing, the preferred interpretation is the one with the least standard deviation of
angles (SDA) as defined in Eq. (10.7), where the angle ␪ is illustrated in Fig. 10.12.

1
2 !2
2  
3
SDA = n ␪2 − ␪ . (10.7)
␪ ␪

A(x a, y a, z a )
v1

B( x b, y b, z b ) θ

v2 C( x c, y c, z c )

Fig. 10.12. Line and angle.


272 Consistent labeling

1.5
1
1 2
5 3
4
0.5
3 4 2 7
0 0 6
1 8
−0.5 5 5
−5
−1 6 2
8 5
0 0
−1.5 7
−4 −2 0 2 4 −2 −5

6
3 2
4
7 6 5
1
2 5
0 2 4
0
1 3 6 −5
−2 4
5 −5 7
−5 0
−4 0
−4 −2 0 2 4 5 5

3
3
2 4
3 2
2
1
1
0
0

4 −2
−1 5 4
5
1 2 0 0
−2
−6 −4 −2 0 2 −5 −5

3
4 3
2
7 5
8
1 4
1 3
0
0 2 8
−5 7
−1 5 6
2
1 −10 5
−2 −5
6 0
5 0
−3
−4 −2 0 2 4 5 −5

Fig. 10.13. Left column: 2D line drawings. Right: 3D interpretation using the emulation method.

This algorithm can interpret a wide range of line drawings and seems to consistently generate
the same interpretations that a human does, even without any explicit models.
To simplify the problem, we square the objective function SDA and call it S . We would
like to find the third coordinate (z i ) of the points in the 2D picture which will minimize the
objective function S.

!2
 
S=n ␪ −
2
␪ . (10.8)
␪ ␪
273 References

The first derivative of S with respect to z i is derived as Eq. (10.9),

∂S   ∂␪    ∂␪
= 2n ␪ −2 ␪ , (10.9)
∂z i ␪
∂z i ␪ ␪
∂z i

where the angle ␪, formed by two vectors v 1 and v 2 , can be computed by Eq. (10.10).
" #
␪ = cos−1 v v1 v
1v2
2
!
−1 (xa − xb )(xc − xb ) + (ya − yb )(yc − yb ) + (z a − z b )(z c − z b )
= cos 4 4 .
(xa − xb )2 + (ya − yb )2 + (z a − z b )2 (xc − xb )2 + (yc − yb )2 + (z c − z b )2
(10.10)

Some 3D interpretation results of 2D line drawings are shown in Fig. 10.13. The optimization
problem is solved using conjugate gradient.
Wang [10.18] has made several improvements on Marill’s original algorithm, most of
which are less computationally intensive, including the usage of the standard derivation of
segment magnitudes (DSM) as the objective function [10.3] and the application of gradient
descent to solve the minimization problem [10.2].

References

[10.1] F. Attneave, “Some Informational Aspects of Visual Perception,” Psychology Review,


61(3), 1954.
[10.2] L. Baird and P. S. Wang, “3D Object Perception Using Gradient Descent,” Interna-
tional Journal of Mathematical Imaging and Vision, 5, pp. 111–117, 1995.
[10.3] E. W. Brown and P. S. Wang, “Why We See Three-Dimensional Objects: Another
Approach,” http://www.ccs.neu.edu/home/feneric/msdsm.html.
[10.4] W. Christmas, J. Kittler, and M. Petrou, “Structural Matching in Computer Vision
using Probabilistic Relaxation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(8), 1995.
[10.5] M. Clowes, “On Seeing Things,” Artificial Intelligence, 2(1), 1971.
[10.6] J. Elder and S. Zucker, “Evidence for Boundary-specific Grouping,” Vision Research,
38(1), 1998.
[10.7] D. Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid, New York, Basic
Books, Inc., 1979.
[10.8] D. Huffman, “Impossible Objects as Nonsense Sentences,” in Machine Intelligence,
vol. 6, ed. B. Meltzer and D. Michie, Edinburgh University Press, 1971.
[10.9] R. Hummel and S. Zucker, “On the Foundations of Relaxation Labeling Processes,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(3), 1983.
[10.10] J. Koenderink, “What Does the Occluding Contour Tell Us About Solid Shape?”
Perception, 13, pp. 321–330, 1984.
[10.11] T. Marill, “Emulating the Human Interpretation of Line-drawings as Three-
dimensional Objects,” International Journal of Computer Vision, 6(2), pp. 147–161,
1991.
274 Consistent labeling

[10.12] J. Ohya, A. Shio, and S. Akamatsu, “Recognizing Characters in Scene Images,”


IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2), 1994.
[10.13] P. Parodi and G. Piccioli, “3D Shape Reconstruction by Using Vanishing Points,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(2), 1996.
[10.14] M. Pelillo and M. Refice, “Learning Compatibility Coefficients for Relaxation
Labeling Processes,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 16(9), 1994.
[10.15] A. Rosenfeld, R. Hummel, and S. Zucker, “Scene Labeling Based on the Relaxation
Principle,” IEEE Transactions on Systems, Man, and Cybernetics, 6, pp. 420–433,
1976.
[10.16] P. Sastry and M. Thathachar, “Analysis of Stochastic Automata Algorithm for
Relaxation Labeling,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 16(5), 1994.
[10.17] M. Shpitalni and H. Lipson, “Identification of Faces in a 2D Line Drawing Projec-
tion of a Wireframe Object,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(10), 1996.
[10.18] P. S. Wang, “3D Line Drawing Image Analysis and Recognition,” Progress in Image
Analysis and Processing, 111, pp. 14–21, 1994.
[10.19] Q. Wu, “A Correlation-relaxation-labeling Framework for Computing Optical Flow –
Template Matching from a New Perspective,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(9), 1995.
11 Parametric transforms

Supposing I was on the other side of the glass, wouldn’t the orange still be in my right hand?
Lewis Carroll

This chapter discusses another approach to the solution of the local/global inference
problem, the use of parametric transformations. In this approach, we assume that the
object for which we are searching in the image may be described by a mathematical
expression, which in turn is represented by a set of parameters. For example, a
straight line may be written in slope–intercept form:

y = ax + b, (11.1)

where a and b are the parameters describing the line. Our approach is as follows:
Given a set of points (or other features), all of which satisfy the same equation, we
will find the parameters of that equation. In a sense, this is the same as fitting a
curve to a set of points, but as we will discover, the parametric transform approach
allows us to find multiple curves, without knowing a priori which point belongs
to which curve. We begin this process by considering the special case of finding
straight lines.

11.1 The Hough transform

Suppose you are tasked with the problem of finding the straight lines in the image
shown in Fig. 11.1. If only one straight line were present in the image, we could
use straight line fitting to determine the parameters of the curve. But we have two
line segments here. If we could segment this first, then we could fit each segment
separately – yes, this is a segmentation problem, but we are segmenting a boundary
into boundary segments rather than segmenting an image into regions. In this section,
we will learn how to do this.
Fig. 11.1. An image
which is the output First, let us prove an illustrative theorem.
from an edge detector.
The human can Definition
immediately discern Given a point in a d-space, and a parameterized expression defining a curve in that
that this boundary
consists of two space, the parametric transform of that point is the curve which results from treating
straight segments. the point as a constant and the parameters as variables. For example, Eq. (11.1)
275
276 Parametric transforms

produces the parametric transform

b = y − xa (11.2)

which is itself a straight line in the 2-space a, b. Given the point x = 3, y = 5,
then the parametric transform is b = 5 − 3a.

Theorem
If n points in a 2-space are colinear, all the parametric transforms corresponding to
those points, using the form b = y − xa intersect at a common point in the space
a, b.
Proof
Suppose n points {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )} all satisfy the same equation

y = a 0 x + b0 . (11.3)

Consider two of those points, (xi , yi ) and (x j , y j ). The parametric transforms of the
b points are the curves (which happen to be straight lines)

yi = xi a + b
(11.4)
yj = x ja + b

a
which we rewrite to make clear the fact that a and b are independent variables:

Fig. 11.2. The b = yi − xi a


intersection of two (11.5)
curves in the
b = y j − x j a.
parameter space.
The intersection of those two curves in a, b is illustrated in Fig. 11.2.
Solving the two equations of Eq. (11.5) simultaneously results in

y j − yi = (x j − xi )a (11.6)
y −y
and therefore a = x jj −xii .
We substitute a into Eq. (11.5) to find b,
y j − yi
b = yi − xi (11.7)
x j − xi
and we have the a and b values where the two curves intersect. However, we also
know from Eq. (11.3) that all the xs and ys satisfy the same curve. By performing
that substitution into Eq. (11.7), we obtain
(a0 x j + b0 ) − (a0 xi + b0 )
b = (a0 xi + b0 ) − xi (11.8)
x j − xi
which simplifies to

b = (a0 xi + b0 ) − xi a0 = b0 . (11.9)
277 11.1 The Hough transform

Similarly,
y j − yi (a0 x j + b0 ) − (a0 xi + b0 )
a= = = a0 . (11.10)
x j − xi x j − xi
Thus, for any two points along the straight line parameterized by a0 and b0 , their
parametric transforms intersect at the point a = a0 , and b = b0 . Since the transforms
of any two points intersect at that one point, all such transforms intersect at that
common point. QED.

Review of concept: Each POINT in the image produces a CURVE (possibly straight)
in the parameter space. If the points all lie on a straight line in the image, the
corresponding curves will intersect at a common point in parameter space.
Got that? Now, on to the next problem.

11.1.1 The problem with vertical lines


What about vertical lines? Whoops! The parameter a goes to infinity. This is not
good. Perhaps we need a new form for the equation of a straight line. Here is a good
one

␳ = x cos ␪ + y sin ␪. (11.11)

Pick a value of ␳ and ␪. Hold those values constant. Then the set of points which sat-
isfy Eq. (11.11) can be shown to be a straight line. There is a geometric interpretation
of this equation which is illustrated in Fig. 11.3.
This representation of a straight line has a number of advantages. Unlike the use
of the slope, both of these parameters are bounded; ␳ can be no larger than the largest
diagonal of the image, and ␪ need be no larger than 2␲. A line at any angle may be
represented without singularity.
There is a Whoops here. It The use of this parameterization of a straight line solves one of the problems which
is true that the maximum
value of ␳ is the diagonal.
confronts us, the possibility of infinite slopes. The other problem is the calculation
However, when you start of intersections.
calculating ␳ while
varying ␪, you will find
points with negative ␳ . y
Rather than testing
negative values and Gradient direction
reflecting, it is easier to
just let ␳ be negative; ρ
which makes the θ
{

accumulator twice as big.

θ
x

Fig. 11.3. In the ␳, ␪ representation of a line, ␳ is the perpendicular distance of the line from the
origin, and ␪ is the angle made with the x axis.
278 Parametric transforms

11.1.2 How to find intersections – accumulator arrays


It is not feasible to find all intersections of all curves, and to then determine which of
those are close together. Instead, we make use of the concept of an accumulator array.
To create an accumulator array, we make an image, say 360 columns by 512 rows.
1 1 We initialize each of the pixels to zero. From now on, we will refer to the pixels of
1 1 this special image as accumulators. Fig. 11.4 illustrates plotting three straight lines
1 1 1 1
1 1 1 1 1 through a very small accumulator array using the following algorithm:
3 2 For each point xi , yi in the edge image:
2 1 1 1
1 1
1 1 (1) for all values of ␪ compute ␳
(2) at the point ␳ , ␪ in the accumulator array, increase the value at that point
Fig. 11.4. Three lines
plotted through an by 1.
accumulator array. The
lines themselves are This algorithm results in multiple increments of those accumulators corresponding
illustrated. Each to intersections. Thus the peaks in the accumulator array correspond to multiple
accumulator is
incremented whenever intersections, and hence to proper parameter choices.
a line passes through Figure 11.5 illustrates an image with two straight edges and the corresponding
that accumulator.
Hough transform where the brightness of each pixel is the value of an accumulator.

(a)

(b)

Fig. 11.5. (a) An image with two line segments which have distinctly different slopes and
intercepts, but whose actual positions are corrupted significantly by noise. (b) The
corresponding Hough transform.
279 11.2 Reducing computational complexity

11.2 Reducing computational complexity

The computational complexity of parametric transforms can be quite high. The


Be sure you understand Hough is sometimes referred to as the “coffee transform,” because you can command
where this accumulator
size comes from.
your computer to perform a Hough transform, and drink an entire cup of coffee before
Remember, ␳ can be it finishes. For example, suppose we are performing a Hough transform in which the
negative.
size of the image is 512 × 512, and √
we wish an angular resolution of one degree. Our
accumulator array is thus (512 × 2 2) × 360. Our iteration consists of calculating
␳ for each value of ␪, so we calculate and increase 360 accumulators for each edge
pixel in the image.

11.2.1 Using gradient information


One way to reduce the computation is to observe that the edge points in the image
are often the output of some gradient operator, and if we have both the magnitude
and direction of that gradient, we can make use of that information to reduce the
computational complexity. To see this, refer to Fig. 11.3. If we know the gradient
at a point, we know the direction of the edge, we therefore know ␪. Thus, we need
only compute one value of ␳ , instead of 360 of them, and we need increase only one
accumulator – a 360 : 1 speedup. Not bad!
Of course, (you should be used to caveats by now) there are some problems with
this method. The first is that the gradient direction returned by most gradient operators
is not particularly accurate. Therefore, the one cell which you are incrementing may
not be precisely located. There is an easy solution to that: The trick is to not increase
a point, but to increase a neighborhood. For example, you might increase the point
calculated by two, and increment the neighborhood of that point by one. (Yes, that
is a Gaussian of sorts.)
You can also use the magnitude of the gradient in the process of incrementing
the accumulator array. In the description presented earlier, we suggested that you
threshold the gradient image, and increment the accumulator array by 1 at the points
corresponding to the edge points in the image. Another approach is to increment by
the magnitude of the gradient, or a value proportional to the magnitude. Of course,
this requires that you use a floating point representation for the accumulator array,
but that should be no problem.
In summary, you can trade off computation in image space (computation of mag-
nitude and direction of the gradient) for computation in parameter space, and this
trade off can result in substantial speedups. You will see this again later in this
chapter. Other heuristics [11.13] can also result in speedups.
280 Parametric transforms

11.3 Finding circles

A generalization of the Hough transform can be used to encompass detection of


segments of circles. The practicality of the technique is ensured by use of low-
dimensionality accumulator arrays.

11.3.1 Derivation of the location of a circle represented by any three


noncolinear pixels
Let the array T contain the brightness values for the candidate edge pixels. The
integer Trc represents the magnitude of the edge pixel at row r and column c. Given
any three noncolinear points in T, this section will review how the center and radius
of a circle passing through those three points can be found. In section 11.3.2 this
result will be integrated into a parametric transform which will identify the most
likely circular edge segment.
B01 B12
Suppose the perpendicular bisectors, B01 and B12, of any two chords intersect
P1 at some finite point C (see Fig. 11.6). It is easy to show that C is the center of the
P2
circle containing P0, P1, and P2. Let L0C be the line segment from P0 to C, L1C be
P0
the line segment from P1 to C, and L2C be the line segment from P2 to C. Since C
C is on the bisector, B01, P0, and P1 are equidistant from C. Therefore, R, the length
Fig. 11.6. Radii of the
of L0C, equals the length of L2C. The three points are equidistant from the same
circle containing P0, point, C, and hence satisfy the definition of a circle with center C and radius R.
P1, and P2. In the case where the bisectors B01 and B12 are parallel, they will not have a finite
intersection. Hence lines L01 and L12 have equal slope and the points P0, P1, and
P2 are colinear. However any three noncolinear points will lie on a circle with the
center found by intersection of the perpendicular bisectors of line segments joining
the points and the radius equal to the distance from the center to one of the points.
The circle containing P0, P1, and P2 is defined by the following equation in the x–y
plane.

(x − h)2 + (y − k)2 = R 2 , (11.12)

where C = (h, k) is the center of this circle with radius R.


But how do we use the techniques of parametric transformations to find the para-
meters, given that we know something (or nothing, for that matter) about the circle?

11.3.2 Finding circles when the origin is unknown but the radius is known
Equation (11.12) describes an equation in which x and y are assumed to be variables,
and h, k, and R are assumed to be constants. As before, let us rewrite this equation
281 11.3 Finding circles

interchanging the roles of parameters and variables.

(h − xi )2 + (k − yi )2 = R 2 . (11.13)

In the space (h, k) what geometric shape does this describe? You guessed it, a circle.
Each point in image space (xi , yi ) produces a curve in parameter space, and if all those
points in image space belong to the same circle, where do the curves in parameter
space intersect? You should be able to figure that one out by now.
However, that might not Now, what if R is also unknown? It is the same problem, however, instead of
be the most efficient way
to do it. Think about it . . . allowing h to range over all possible values and computing k, we must now allow
both h and k to range over all values and compute R. We now have a three-dimensional
parameter space. Allowing two variables to vary and computing the third defines
a surface in this 3-space. What type of surface is this (an ellipse, a hyperboloid, a
cone, a paraboloid, a plane)?

11.3.3 Using gradient information to reduce computation in finding circles


Suppose we know we have only one circle, or portion of a circle in the image. How can
we find the center with a minimum of computational complexity? Here is the trick:
At every edge point, compute the gradient vector. The accumulator is isomorphic
to the image. At every edge point, walk out the gradient vector, incrementing the
accumulator as you go. Just as before, the maximum of the accumulator will indicate
the center of the circle. Fig. 11.7 illustrates this.
What if you know the radius? If you know the direction (from the gradient), and
you know the distance (from knowing the radius), then you know the location of the
center of the circle (at least to within an inside/outside ambiguity). Let’s say you
Fig. 11.7. All the have a point at (xi , yi ) which you believe might be on a circle of radius R, and assume
extensions of the the gradient at that point has a magnitude of M and a direction of ␪, as illustrated in
gradient vectors tend
to intersect at a point.
Fig. 11.8. Then, the location of the center of the circle is at
x0 = xi − R cos ␪
(11.14)
y0 = yi − R sin ␪.
Again, the accumulator array should be incremented over a neighborhood rather than
just a single accumulator, and it will prove efficacious to increment by an amount
proportional to M.

Fig. 11.8. The point xi , yi is proposed to lie on a circle of radius R. The gradient points in the
R θ direction of the arrow. If the circle is known to be dark relative to the background, the
xi , yi center can be found by taking a step of length R in the direction opposite to the
{

gradient. If the center/surround contrast is opposite, then the step should be taken in
the direction of the gradient. If the contrast is not known a priori, steps can be taken
(and accumulators incremented) in both directions.
282 Parametric transforms

11.4 The generalized Hough transform

So far, we have assumed that the shape we are seeking can be represented by an
analytic function, representable by a set of parameters. The concepts we have been
using, of allowing data components which agree to “vote,” can be extended to
generalized shapes. Initially we suppose we have an arbitrarily shaped region, and
suppose we know the orientation, shape, and zoom. Our first problem is to figure
out how to represent this object in a manner which is suitable for use by Hough-like
methods. The following is one such approach [11.2]:
First, define some reference point. The choice of a reference point is arbitrary,
but the center of gravity is convenient. Call that point O. For each point Pi on the
O
P1

boundary, calculate both the gradient vector at that point and the vector O Pi . from
P2
the reference to the boundary point. Quantize the gradient direction into, say, n
P3 values, and create a table with n rows. Each time a pointP j on the boundary has a
gradient direction with value G i (i = 1, . . . n), a new column is filled in on row i,

containing O P j . Thus, the fact that multiple points on the boundary may have
Fig. 11.9. Boundary
points P1 and P2 have
identical gradient directions is accommodated by placing a separate column in the
gradient vectors which table for each entry. In Fig. 11.9, shape is shown and three entries in the R-table are
have identical illustrated in Table 11.1.
directions, and
therefore correspond To utilize such a shape representation to perform shape matching and location,
to entries on the same we use the following algorithm.
row of the R-table. P3
is an entry on the
second row of the (1) Form an accumulator array, which will be used to hold candidate locations of
R-table.
the reference point. Initialize the accumulator to zero.
(2) For each edge point, Pi , do the following.
(2.1) Compute gradient direction, and determine which row of the R-table cor-
responds to that direction.
(2.2) For each entry, j, on that row:
(a) compute the location of the candidate center by adding the stored
vector to the boundary point location: A = T [i, j] + Pi
(b) increment accumulator determined by A.

Table 11.1. The R-table corresponding to the three boundary points of


Figure 11.9.

Gradient direction Vector from boundary point Vector from boundary point
(in degrees) to reference point to reference point

0 −1, −0.1 −1, 0.5


300 −0.6, 1.1 empty
283 11.5 Conclusion

11.5 Conclusion

Accumulator arrays In this chapter, another approach to analysis of consistency was introduced in the
enforce consistency.
form of the Hough transform and its descendants, all of which use accumulator arrays.
An accumulator array allows for easy detection of consistency, since “things” which
are consistent all add to the same accumulator, or at least to nearby accumulators. By
constructing the accumulator array by adding hypotheses, we also gain a measure
of noise immunity, since inconsistent solutions tend not to contribute to the globally
consistent solution.

11.6 Vocabulary

You should know the meanings of the following terms.

Accumulator array
Generalized Hough transform
Hough transform
Parametric transform

Topic 11A Parametric transforms

11A.1 Finding parabolae

Wechsler and Sklansky [11.12] have developed one approach to the problem of finding
parabolic curves in images, as described below.
A parabola is the locus of points each of whose distance from a fixed point, the focus, is
equal to its distance from a fixed straight line, the directrix, as illustrated in Fig. 11.10.

x 2 = (x − 2a)2 + y 2 (11.15)

Directrix
y
λ

Δy d

θ
V A
a a F x
d Δx

Fig. 11.10. A parabola.


284 Parametric transforms

or

y 2 = 4a(x − a). (11.16)

Differentiate Eq. (11.16) with respect to x:


5
dy a
␤ = tan ␭ = = (11.17)
dx x −a
and solve for a:
x␤2
a= . (11.18)
1 + ␤2
In Figure 11.10, use the right-angled triangle PAF to determine

tan ␪ = − tan(␲ − ␪) (11.19)

and

y 4a(x − a)
tan ␪ = = . (11.20)
x − 2a x − 2a
Substituting for a,
2 tan ␭
tan ␪ = = tan 2␭. (11.21)
1 − tan2 ␭
The solution to Eq. (11.21) is

␪ = 2␭. (11.22)

Referring again to Figure 11.10, obtain

x = d cos(␲ − ␪) = d cos ␪ (11.23)

and

y = −d sin(␲ − ␪) = −d sin ␪ (11.24)


x F = x p + x (11.25)
y F = y p + y. (11.26)

x F and y F can be used as parameters for an accumulator array.


The derivation given in Eqs. (11.17)–(11.26) provides a mechanism for using accumulation
arrays to detect parabolae. It is, however, very restrictive. The assumptions are as follows:
r there is only one parabola in the field of view
r the parabola is symmetric about a horizontal line.

Equation (11.16) assumes the focus lies at x = a. This technique will work if only one
parabola lies in the field of view, for then the position of the origin is arbitrary. In the more
general case, however, we must make the location of the origin explicit.
The derivation of Wechsler and Sklansky [11.12] can be generalized as follows to overcome
these difficulties.
Initially, we continue to assume the parabola is symmetric about a horizontal line,
but having an origin at an arbitrary point x0 , y0 . The parabolic equation becomes
285 11A.2 Finding the peak

y Directrix

y0
a a

x0 x

Fig. 11.11. Parabola with arbitrary reference.

(see Fig. 11.11)

(y − y0 )2 = 4a(x − x0 ). (11.27)

Differentiating, as in Eq. (11.17), yields


5
dy a
␤ = tan ␭ = = (11.28)
dx x − x0
and

a = ␤2 (x − x0 ). (11.29)

Substituting Eq. (11.29) into Eq. (11.27),

(y − y0 )2 = 4␤2 (x − x0 )2 . (11.30)

Taking the square root of both sides,

(y − y0 ) = 2␤(x − x0 ). (11.31)

Equation (11.31) describes a straight line in the x0 , y0 parameter space. Thus, by making use
of the local derivative, the 3-parameter problem of Eq. (11.27) has been reduced to a much
more tractable two-dimensional problem.
It should be obvious that shapes other than circles and parabolae can be found using this
approach [11.3].

11A.2 Finding the peak

The peak of the accumulator array is, hopefully, at or near the true parameter value, but in
general may be displaced by noise in the original data. There are several ways to approach
this problem, including some sophisticated methods [11.4, 11.11] and some simple ones.
In general, however you look at it, this is a clustering problem [11.4], so any algorithm
which does clustering well is also good at peak finding. Simple techniques such as k-means,
described in section 15.2.2, perform well.
286 Parametric transforms

Clustering is the process of finding natural groupings in data. In this chapter the obvious
application is to find the best estimate of the peak(s) of the accumulator; however, these ideas
are much broader in applicability than just finding the mode of a distribution. For example,
McLean and Kotturi [11.9] use clustering [11.5] to locate the vanishing points in an image.
Clustering is discussed in detail in Chapter 15.

11A.3 The Gauss map

The so called “Gauss map” provides an effective way to represent operations in range
images. It is philosophically a type of parametric transform. The concept is quite simple:
First, tessellate the surface of a sphere into whatever size patches you wish. This tesselation
will determine the resolution of the map. Associate a counter (accumulator) with each cell of
the tesselation. Now consider a range image. In the range image, calculate the surface normal
at each pixel, and increment an accumulator of the Gauss map which has the same surface
normal. This provides essentially a histogram of normal vectors, which in turn can be used
to recognize the orientation of the object in the range image.
Since curvature represents rates of change of normal direction, the Gauss map can be
related back to curvature. The invertability of the Gauss map and invariance under rotation
and translation are discussed in more detail in [11.7]. The map also finds application in
identifying the vanishing points in an image [11.8].

11A.4 Parametric consistency in stereopsis

As we discussed in section 4.2.2, the correspondence problem prohibits simple use of stere-
opsis to provide three-dimensional information about the world. In this section, we take a
closer look at this problem and provide a partial solution, based on accumulator arrays.
Consider the two-camera stereo problem. Recall from Fig. 4.9 that the disparity is the dis-
tance in pixels between two corresponding pixels, z is the distance to the point corresponding
to those two pixels, and B is the baseline (the distance between the two cameras). Then
BF
d= (11.32)
z
where F is the focal distance of either of the cameras (which are assumed to have the same
focal length). The hard problem, of course, is to determine which pixels correspond to the
same point in 3-space. Suppose we extract a small window from the leftmost image and
use the sum of squared differences (SSD) to template match that small window along a
horizontal line in the other image. We could graph that objective function vs. the disparity,
Fig. 11.12. The result or the inverse distance (1/z). We find it convenient to use the inverse distance and we will
of matching a template
typically find that the match function has multiple minima, as illustrated in Fig. 11.12. If
extracted from one
image in a stereo pair we take an image from a third or fourth camera, with a different baseline from camera 1,
with a second image, we find similar nonconvex curves. However, all the curves will have minima at the same
as a function of the
point, the correct disparity. Immediately, we have a consistency! We form a new function,
location of the inverse
distance to the the sum of such curves taken from multiple baseline pairs, and find that this new function
matching point. (called the SSD-in-inverse-distance) has a sharp minimum at the correct answer. Okutomi and
287 11A.5 Conclusion

Kanade [11.10] have proven that this function always exhibits a clear minimum at the correct
matching position, and that the uncertainty of the measurement decreases as the number of
baseline pairs increases.

11A.5 Conclusion

The general concept of the parametric transform seeks consistency! That is the key here. Many
points which are in some sense consistent contribute to the same cell in the accumulator –
they vote, in a sense. Hopefully, noise averages out in this voting process, and we can arrive
at consistent solutions.
In computed tomography (CT), the signal measured is a line integral along the ray from
the x-ray source to the x-ray detector. The line along which the integral is performed may be
represented using the ␳ − ␪ parameterization of a straight line:

R(␳ , ␪) = (␳ − (x(s)cos ␪ + y(s)sin ␪)) ds. (11.33)

Examination of Eq. (11.33) leads one to the immediate conclusion that the Hough transform
may be formally represented by the Radon transform, and that, except for applications, they
are the same transform.
In addition to the use of these transforms for identifying specific shapes, Leavers [11.6] has
shown that if one considers the shape of the parameter space, rather than simply the location
of the peaks, one may determine the convex hull and several shape parameters of regions.
A fascinating alternative to the Hough transform was proposed by Aghajan and Kailath
[11.1], using wavefront propagation. The idea is to think of each pixel as a radio transmitter
emitting signals which are detected by receivers at the end of each row. Using the mathematics
of direction of arrival signal processing, they show it is possible to detect straight lines with
a computational complexity significantly less than the conventional Hough transform. This
idea of wavefront propagation makes sense as a paradigm for how the brain detects straight
lines.

11A.6 Vocabulary

You should know the meanings of the following terms.

Gauss map
Parabola
Radon transform
SSD

Assignment 11.1
In the directory named “leadhole” are a set of images of
wires coming through circuit board holes. The holes are
roughly circular and black. Use parametric transform meth-
ods to find the centers of the holes. This is a project and
288 Parametric transforms

Table 11.2. The R-table.

? p1(x, y) p2(x, y)

will require a formal write up. Process as many images as


possible. If your method fails in some cases, discuss why.

Assignment 11.2
You are to use the generalized Hough transform approach to
both represent an object and to search for that object in
an image. It turns out that the object is a perfect square,
centered at the origin, with sides two units long, but you do
not know that ahead of time. You only have five points, those
at (0,1), (1,0), (1, 0.5), (-1,0), and (0, -1). Fill out
the “R-table” which will be used in the generalized Hough
transform of this object. (Table 11.2 contains four rows;
that is just a coincidence. You are not required to fill them
all in, and if you need more rows, you can add them.)

Assignment 11.3
Let P1 = [x1 , y1 ] = [3, 0] and P2 = [x2 , y2 ] = [2.39, 1.42] be two
points, both of which lie approximately on the same disk.
We do not know a priori whether the disk is dark inside or
bright. The image gradients at P1 and P2 are 5∠0 and 4.5∠␲/4
(using polar notation).
Use Hough methods to estimate the location of the center of
the disk, and radius of the disk, and determine whether the
disk is darker or brighter than the background.

References

[11.1] H. Aghajan and T. Kailath, “SLIDE: Subspace-based Line Detection,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 16(11), 1994.
[11.2] D. Ballard, “Generalizing the Hough Transform to Detect Arbitrary Shapes,” Pattern
Recognition, 13(2), 1981.
[11.3] N. Bennett, R. Burridge, and N. Saito, “A Method to Detect and Characterize Ellipses
Using the Hough Transform,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 21(7), 1999.
289 References

[11.4] Y. Cheng, “Mean Shift, Mode Seeking, and Clustering,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(8), 1995.
[11.5] T. Hofmann and J. Buhmann, “Pairwise Data Clustering by Deterministic Anneal-
ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1997.
[11.6] V. Leavers, “Use of the Two-dimensional Radon Transform to Generate a Taxonomy
of Shape for the Characterization of Abrasive Powder Particles,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(12), 2000.
[11.7] P. Liang and C. Taubes, “Orientation-based Differential Geometric Representations
for Computer Vision Applications,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(3), 1994.
[11.8] E. Lutton, H. Maı̂tre, and J. Lopez-Krahe, “Contribution to the Determination of
Vanishing points using the Hough Transform,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(4), 1994.
[11.9] G. McLean and D. Kotturi, “Vanishing Point Detection by Line Clustering,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(11), 1995.
[11.10] M. Okutomi and T. Kanade, “A Multiple-Baseline Stereo”, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 15(4), 1993.
[11.11] J. Princen, J. Illingworth, and J. Kittler, “Hypothesis Testing: A Framework for
Analyzing and Optimizing Hough Transform Performance,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(4), 1994.
[11.12] H. Wechsler and J. Sklansky, “Finding the Rib Cage in Chest Radiographs,” Pattern
Recognition, 9, pp. 21–30, 1977.
[11.13] Ylä-Jääski and N. Kiryati, “Adaptive Termination of Voting in the Probabilistic
Circular Hough Transform,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(9), 1994.
12 Graphs and graph-theoretic concepts

Functions are born of functions, and in turn, give birth or death to others. Forms emerge from
forms and others arise or descend from these
L. Sullivan

You have already seen the use of graph-theoretic terminology in connected compo-
nent labeling in Chapter 8. The way we used the term “connected components” in
the past was to consider each pixel as a vertex in a graph, and think of each vertex as
having four, six, or eight edges to other vertices (that is, four-connected neighbors,
six neighbors if hexagonal pixel is used, and eight-connected neighbors). However,
we did not build elaborate set-theoretic or other data structures there. We will do
so in this chapter. The graph-matching techniques discussed in this chapter will be
used a great deal in Chapter 13.

12.1 Graphs

A graph is a relational data structure. It consists of data elements, referred to as


vertices or nodes, and relationships between vertices, referred to as edges.
Graphs may be completely described by sets. The set of vertices is a sim-
ple set, and the edges form a set of ordered pairs. For example, let G = V, E
a b d represent a graph, where the vertex set V = {a, b, c, d, e, f } and the edge set
E = {(a, b), (b, c), (a, c), (b, e), (d, f )}. Graphs may also be represented pictori-
e f ally. The pictorial representation of this example is drawn in Fig. 12.1. There are
c
two connected components in this graph.
Fig. 12.1. A graph with
This example allows us to elaborate on the concept of an edge just a bit more. In
six nodes, five edges,
and two connected general, graphs may be directed. That is, the relations which are represented by edges
components. may have a direction. Consider for example the relations “above” and “adjacent to.”
Clearly ABOVE(a, b) = ABOVE(b, a), and so ABOVE would be represented by a
directed graph, whereas the relation ADJACENT TO has no preferred direction. The
pictorial version of a directed graph may be thought of as having arrowheads on the
ends of the edges. Let’s redo the graph example given above, and include ordered
pairs with both directions in them: G = V, E where V = {a, b, c, d, e, f } and
E = {(a, b), (b, a), (b, c), (c, b), (a, c), (c, a), (b, e), (e, b), (d, f ), ( f, d)}. This is
a properly done, undirected graph. We define a graph as undirected if
290
291 12.3 Implementing graph structures

∀(a, b ∈ V )[(a, b) ∈ E ⇔ (b, a) ∈ E]. Otherwise the graph is directed (or, in some
special cases, partially directed – a seldom-used term).

12.2 Properties of graphs

In this section, we define the vocabulary we will need to further our discussion of
graphs.
r The degree of a node is the number of edges coming into that node.
r A path between nodes v 0 and vl is a sequence of nodes v 0 , v 1 , . . . vl such that
there exists an edge between v i and v i+1 for all is.
r A graph is connected if there exists a path between any two nodes.
r A clique (remember this word, you will see it again) is a subgraph in which there
exists an edge between any two nodes.
r A tree is a graph which contains no loops. Tree representations have found appli-
cation in speeding up Markov random field applications [12.8].

12.3 Implementing graph structures

The first data structure used to implement graph structures in computers was the
linked list as shown in Fig. 12.2. Such a data structure contains two types of data:
r nodes, which consist of two pointers (addresses), and
r atoms, which are data.

Atoms are distinguished from nodes by a single bit, usually the most significant
bit of the computer word. Certain nodes contain a zero in their right half, indicating
the end of the list. Such nodes are indicated in Fig. 12.2 by the cross in the right-hand
half. The linked list can also be used to store computer instructions, thus providing
a powerful mechanism for automatic programs. This was the foundation of the
programming language LISP.

Datum Datum

Fig. 12.2. In a linked list, each node contains two pointers. A pointer with the value of zero
indicates the end of a list.
292 Graphs and graph-theoretic concepts

Datum Datum
Datum Datum
Pointer Pointer
Pointer Pointer

Fig. 12.3. A more general data structure may contain both data and pointers.

The concept of the linked list was incorporated into more modern programming
languages and extended to allow more generality. For example, a structure such as
that illustrated in Fig. 12.3 can contain data and pointers to other data structures of
the same or different types. For example the following C definition describes the
data structure of Fig. 12.3.
struct patch
{
int area;
int perimeter;
struct *patch;
struct *patch;
}

The C programmer will recognize the *patch as denoting a pointer to another


structure of type patch.

12.4 The region adjacency graph

In model matching, we will make use of the region adjacency graph (RAG) as
a means for identifying how the regions in a segmented image match (or do not
match) faces in a three-dimensional model. A model of a polyhedron with six faces
is illustrated in Fig. 12.4.
The RAG for the object of Fig. 12.4 is illustrated in Fig. 12.5; Fig. 12.6 shows
another example.

E D

B
A polyhedron has all
FLAT faces. A
C

Fig. 12.4. A polyhedron with six faces.


293 12.4 The region adjacency graph

A B
What a mess! Do you E
think there is a planar way
to draw this graph? That C
is, can you draw it without
any lines crossing? D F

Fig. 12.5. The RAG of the three-dimensional object in Fig. 12.4.

3
3
1
1
2 2

Fig. 12.6. The image and its RAG.

Now is the problem: Given an observation, and the RAG derived from the ob-
servation, and given a collection of models and their corresponding graphs, which
model best matches the observation? We will address this matching problem later.
Other graph representations are possible and often useful, for example, the con-
structive solid geometry (CSG) community uses a collection of primitives subjected
to transformations to represent input to automatic parts manufacturing systems. The
primitives are objects like spheres and cylinders. Methods have been developed
[13.8] to match scenes to models constructed from CSG representations as well as
to RAG representations.

12.4.1 The scene graph


First, let’s introduce a term to help with vocabulary. When talking about models,
we use the term “region” to represent an identifiable surface/face, etc. When we are
talking about the output of the segmenter, that is, what one might consider a “region”
in the observation, we generally use the term “patch.”
The RAG does not contain a lot of information, only that there is a region and it
is adjacent to certain other regions. We can produce a graph containing much more
information by constructing an augmented RAG, which we call the scene graph. In
a scene graph, nodes have properties, such as the area of the corresponding patch,
the color, the albedo, etc. An example is shown in Fig. 12.7.
Furthermore, the scene graph may be multiply traversed. In a simple RAG,
the only edges between nodes are edges that correspond to the predicate
ADJACENT TO, however, we can also have edges corresponding to the property
JUST LARGER THAN, and thus have a mechanism for traversing the graph in
order of patch size, or whatever other property seems convenient for a particular
application. Generally,
294 Graphs and graph-theoretic concepts

3
2
1

Patch #3
Patch #1 Avail: yes
Avail: yes B. Window
B. Window B. Volume
B. Volume Direction
Direction Normal
Normal Near
Near Next
Next

Patch #2
Avail: yes
B. Window
B. Volume
Direction
Normal
Near
Next

Fig. 12.7. Scene graph for an image with one object segmented into three patches.
Notice the use of
“cardinality” rather than
area. Why do you suppose r
this is? Think about a the patch list is sorted by cardinality (number of pixels in a patch), and
range image viewed r each node has a pointer to a list of adjacent patches.
obliquely.

12.5 Using graph-matching: The subgraph isomorphism problem

There are several kinds of problems in computer science, defined by how long they
take to run as a function of the size of the data set, represented by the variable n.

r Polynomial time problems. The time required may be written as a polynomial in


n, e.g. t ∝ n 3 .
r Exponential time problems. The time required may be written as an exponential,
e.g. t ∝ en . Observe that as n becomes large, n k is tiny compared to k n .
r NP-hard problems. Here, NP says nonpolynomial, but it really means exponential.
NP-hard problems are provably exponential in complexity. That is, one can prove
that there does not exist any way to compute it in polynomial time.
r NP-complete problems. These are problems for which there does not exist any
known algorithm to compute the answer in polynomial time. However, there is
also no proof that no such algorithm exists. Any algorithm (or machine) which
could convert an NP-complete problem to polynomial time would be a dramatic
breakthrough, since all NP-complete problems can be shown to be isomorphic.
That is, any algorithm which could solve one NP-complete problem could be
applied to any other such problems.
295 12.6 Aspect graphs

Subgraph isomorphism is one approach to matching a scene graph to a RAG. The


idea is simple. We know the scene graph, assuming the segmenter worked correctly
(a terrible assumption in realistic images). Furthermore, we have a collection of
RAGs, one for each model in our data base. To accomplish matching, we simply
look for where the scene graph is isomorphic to some subgraph of the model graph.
Unfortunately, constructing all subgraphs of a model graph is an NP-complete prob-
lem, with exponential complexity. Still, the use of heuristics can provide dramatic
improvements in performance [12.4].
We will discuss matching in more detail in Chapter 13.

12.6 Aspect graphs

Consider a polyhedron with its center of gravity located at the origin, and think about
the image formed by a camera pointed at the origin and located at the 3D spatial
location parameterized by spherical coordinates [␳ , ␪, ␸] where ␪ denotes rotation
about the x axis and ␸ denotes rotation about the y axis, as illustrated in Fig. 12.8.
If ␳ is a constant, then the locus of possible camera positions is a sphere about the
origin. For now, we will only consider this case. Thus, the camera may be thought
of as moving on a sphere. As can be seen from Fig. 12.9, two different viewpoints
may produce very different views. But consider only a very small camera motion.
Except in rare conditions, small camera motions cause only slight changes in the
image.

y
x

Fig. 12.8. The coordinate system is defined so that the object to be characterized has its center
of gravity located at the origin.

Fig. 12.9. Two different aspects of the same object produce very different two-dimensional
images.
296 Graphs and graph-theoretic concepts

C
AD ABD

A B A AB B BD
ADC AC ABC
D DCD C BC BCD

Fig. 12.10. Each partition of the VSP is identified by a list of the surfaces visible from that set of
viewpoints (redrawn from [12.5]).

In those rare camera movements, however, radical things happen to the image –
surfaces disappear or appear. That is, the topological structure of the image changes.
Two viewpoints V1 and V2 are defined to be aspect equivalent, denoted V1 ∼ V2 ,
if and only if there exists a sequence of infinitesimal camera motions, a path, from
V1 to V2 such that the topology of the viewed image does not change. The aspect
equivalent property is obviously symmetric, reflexive, and trivially transitive. It is
therefore an equivalence relation and thus imposes a partition, denoted the viewpoint
space partition (VSP), on the set of points on the sphere. Each element of this partition
is referred to as a viewing region. Fig. 12.10, from [12.5], illustrates the VSP for a
tetrahedron with faces A, B, C, and D. The aspect graph (referred to originally as
the visual potential) is the dual of the VSP.
Computing the aspect graph is accomplished [12.2] by first constructing the
labeled image structure graph (LISG), which is an augmented graph in which
each node in the graph corresponds to a vertex in the line drawing, and each arc
corresponds to the line segment between those nodes. The arcs are augmented
with the labels corresponding to the properties +, −, and →, as we used them in
Chapter 10 to mean convex, concave, or occluding. The algorithm in [12.2] partitions
the viewing sphere such that all points in a partition have isomorphic LISGs.
For arbitrary (potentially nonconvex) polyhedra, using orthographic projection,
the algorithm is of high (but still polynomial) computational complexity. For a
polyhedron with n faces, the total worst case time complexity is O(n 8 ).
Aspect graphs were originally developed by Koenderink and Van Doorn [12.3],
and extended by a variety of authors; Bowyer and Dyer [12.1] provide a good survey
of work prior to 1990. More recent work has focussed on such nasty problems as
the fact that we are dealing with sampled data [12.7].

12.7 Conclusion

Consistent labeling The concepts of graphs occur throughout the machine vision literature. See [12.6]
searches a tree of
interpretations. for more on scene structure graphs and Bayesian networks which use them. As we
297 References

now understand, the search algorithm used in section 10.1 to label line drawings
actually searched a tree of interpretations.

12.8 Vocabulary

You should know the meanings of the following terms.

Aspect graph
Clique
Connected
Degree
Edge
Isomorphic
Node
NP--complete
Path
RAG
Scene graph
Tree
Vertex

References

[12.1] K. Bowyer and C. Dyer, “Aspect Graphs: An Introduction and Survey of Recent
Results,” International Journal of Imaging Systems and Technology, 2, pp. 315–328,
1990.
[12.2] Z. Gigus and J. Malik, “Computing the Aspect Graph for Line Drawings of Polyhedral
Objects,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(2),
1990.
[12.3] J. Koenderink and A. Van Doorn, “The Internal Representation of Solid Shape with
Respect to Vision,” Biological Cybernetics, 32, pp. 211–216, 1979.
[12.4] B. Messmer and H. Bunke, “A New Algorithm for Error-tolerant Subgraph Isomor-
phism Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(5), 1998.
[12.5] H. Plantinga and C. Dyer, “Visibility, Occlusion, and the Aspect Graph,” International
Journal of Computer Vision, 5(2), pp. 137–160, 1990.
[12.6] S. Sarkar and P. Soundararajan, “Supervised Learning of Large Perceptual Organiza-
tion: Graph of Spectral Partitioning and Learning Automata,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(5), 2000.
[12.7] I. Shimshoni and J. Ponce, “Finite-resolution Aspect Graphs of Polyhedral Objects,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 1997.
[12.8] C. Wu and P. Doerschuk, “Tree Approximations to Markov Random Fields,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(4), 1995.
13 Image matching

One of these things is not like the other


Sesame Street

In this chapter we will consider issues associated with matching – matching observed
images with models as well as matching images with each other. We will consider
matching iconic representations as well as matching graph-theoretic representations.
Matching establishes an interpretation. That is, it puts two representations into
correspondence.
r Both representations may be of the same form. For example, correlation matches
an observed image with a template. Similarly, subgraph isomorphism matches a
region adjacency graph to a subgraph of a model graph.
r Both representations might be of different forms. For example, one image matches
one paragraph describing something. In most such applications, we find ourselves
matching an equation to some data, and in this case, “fitting” might be a better
word.
In the remainder of this chapter we address all of these matching problems except
fitting, which was discussed earlier in this book.

13.1 Matching iconic representations

13.1.1 Template matching


A template is a representation for an image (or subimage) which is itself a picture. A
template is typically moved around the target image until a location is found which
maximizes some match function. The most obvious function is the squared error


N 
N
S E(x, y) = ( f (x − ␣, y − ␤) − T (␣, ␤))2 , (13.1)
␣=1 ␤=1

(assuming the template is N × N ) which provides a measure of how well the template
(T ) matches the image ( f ) at point x, y. If we expand the square and carry the
298
299 13.1 Matching iconic representations

summation through, we find


N 
N 
N 
N
S E(x, y) = f 2 (x − ␣, y − ␤) − 2 f (x − ␣, y − ␤)T (␣, ␤)
␣=1 ␤=1 ␣=1 ␤=1

The relationship between 


N 
N
correlation and squared + T 2 (␣, ␤). (13.2)
error. ␣=1 ␤=1

Let’s look at these terms: The first term is the squared sum of the image brightness
values at the point of application. It says nothing about how well the image matches
the template (although it IS dependent on the image). The third term is simply the
sum of the squared elements of the template, and is a constant, no matter where the
template is applied. The second term obviously is the key to matching, and that term
is the correlation.
In matching using an optimization criterion, the assumption is made that the qual-
ity of match can be described by a set of parameters a = {a1 , a2 , . . . , an }, which
could be the pixels themselves. We define a merit function M(a, f (x)) which quan-
tifies the quality of the match between the template and the local image. Matching
consists of determining a so that M is maximized. Typically, a is the x–y coordinates
specifying where the template is placed.
If M is monotonic in a, we can maximize M by solving

∂M
Ma j = =0 for j = 1, . . . , n. (13.3)
∂a j

If M is not monotonic, the process of finding points where the partial derivatives are
zero can terminate in a local maximum. Furthermore, as we have discussed earlier,
it is probably not possible to find an analytic solution to Eq. (13.3). In that case, we
The equation for
hill-climbing should be could use hill-climbing:
compared with that for
gradient descent.
a kj = a k−1
j + cMa j . (13.4)

This strategy is generally a difficult approach to template matching, since ∂ M/∂a


is difficult to use as it often contains multiple minima and broad peaks. However,
if we happen to be close to the matching point, it is possible that hill-climbing will
lead to an optimal match. The width of the peaks can be reduced somewhat by using
heuristics such as phase-only matching [13.7].
Assuming a correct segmentation, one could match 2D binary shapes which have
been transformed by first finding the affine (or similarity) transform that best matches
the two shapes [17.61], and then simply doing template matching on the transformed
regions. Generally, the complexity of template matching can be reduced somewhat
300 Image matching

by masking out impossible matches [13.37]. As we have seen in section 11A.4,


iconic matching is a critically important part of any stereopsis system [13.31].
One of the principal problems in any template matching system is choosing the
optimal size of the template. If the template chosen is too small, it will not cover
enough variation in the image to make an accurate estimate. If it is too large, the
window covers too large an area to match at all. Some adaptive algorithms have been
reported [13.27]. In addition, one must consider what objective is being minimized
when matching. The squared error and the signal-to-noise ratio are not always the
best [13.40].

13.1.2 Point matching


Consider an image as just a set of points with known distances between each other
(one could consider this an instance of the springs and templates problem discussed
in the next section, with trivial templates). One example where this type of data
occurs is in the recognition of targets in Synthetic Aperture Radar (SAR) images. In
this situation one may approach the problem by assuming a 3D model of the object,
and finding the transformation from 3D to 2D which best describes the observation
[13.53]; see also [13.3].
Matching in the stereo environment can make use of the epipolar constraint as
well as characteristics of edges and vertices, and can be treated probabilistically
[13.47].

13.1.3 Segment matching


The problem of finding a match between short arcs and pieces of a long arc has been
investigated in the literature [13.3, 13.39, 13.42]. The corresponding 3D problem
has not received much attention largely because it is difficult to extract 3D curves;
[13.22] provides one approach.

13.1.4 Eigenimages
The eigenimage approach has been an effective solution to problems like object
identification and recognition [13.49, 13.50], where the image of an unknown object
is compared to images of known objects in a data base (or a training set) and the
unknown object can be identified or recognized when a close match is found. We
can surely do a pixel-by-pixel comparison. However, this is very time-consuming,
especially when the size of the image is large and the number of images included in
the data base is large as well.
301 13.1 Matching iconic representations

The eigenimage approach has its origin in principal component analysis (PCA),
which is a popular technique for dimensionality reduction. In section 9.2.1, one type
of PCA, the K–L transform, is described in detail. PCA constructs a representation
of the data with a set of orthogonal basis vectors that are the eigenvectors of the
covariance matrix generated from the data. By projecting the data onto the dominant
eigenvectors (corresponding to the larger eigenvalues), the dimension of the original
data set can be reduced with minimal loss of information. Similarly, in the eigen-
image approach, each image is represented as a linear combination of a set of dom-
inant principal components (the eigenimages). Matching is then conducted based
on the coefficients of the linear combination (or the weight of projections onto the
eigenimages) which greatly speeds up the process. The projection preserves most of
the energy, and thus captures the highest amount of variation in the data base. Here,
we discuss the calculation of the eigenimages in detail.
Let f 1 , f 2 , . . . , f p represent a set of images of known objects in the data base.
Without loss of generality, assume these images are of the same dimension m × n.
The following steps lead us to the eigenimages.

Step 1. Construct a set of vectors {I1 . . . I p } where each Ii (i = 1, . . . , p) is the


lexicographical representation of the corresponding image f i minus the average
p
image, Ii = f i − A, and A = 1p i=1 f i . Note that each vector is mn × 1.

Step 2. Calculate the covariance matrix of this set by


1  T
p
C= Ii I , (13.5)
p − 1 i=1 i
where C is an mn × mn matrix.
Step 3. Use eigenvalue decomposition techniques to obtain the eigenvectors and
eigenvalues of C:

C = E E T

where E is an mn × mn matrix with each column vector an eigenvector of C (or an


eigenimage of C), is a diagonal matrix
⎡ ⎤
␭1 . . . 0
⎢ ⎥
= ⎣ 0 ... 0 ⎦
0 . . . ␭mn
with the eigenvalues of C on the diagonal of , and ␭1 > ␭2 > · · · > ␭mn .
Step 4. Suppose among the mn eigenvalues, the first k values are much larger than
 
the other values, that is, kj=1 ␭ j / mnj=1 ␭ j = ␩ is very close to 1. Then we can
use the first k eigenvalues to reconstruct the original image without losing too much
information. Hopefully, k  mn.
302 Image matching

Step 5. Calculate the projection coefficients of image f i onto the selected eigen-
images for comparison purpose by

Wi = IiT × [E 1 . . . E k ] (13.6)

where E j is the jth eigenimage, and Wi is a p × k matrix, containing the projection


coefficient of the original image onto each eigenimage. The comparison process can
be easily carried out by the following: Given an image of unknown object in the test
set f test , calculate its projection coefficients by

Wtest = Itest
T
× [E 1 . . . E k ].

Compare the distance between Wtest and all the Wi s in the data base (a Euclidean
distance might be the simplest approach); the one with the closest distance is selected
as a match.
We now show an interesting example of applying the eigenimage approach to
face recognition [13.51]. Assume we have three images in the data base (Lena,
Einstein, and the Clock), the unknown image is Monalisa. Following Steps 1 through
3 described above, we can derive 64 × 64 eigenimages. We use only two of the
dominant ones since the ratio between the summation of the first two eigenvalues
and the summation of all the eigenvalues is close to 1, as stated in Step 4. Fig. 13.1
shows all the original images and the two eigenimages derived. Following Step
5, we compute the projection coefficients of all four original images on the two
eigenimages; these are listed in Fig. 13.1 as well. Based on a simple Euclidean
distance calculation, it turns out that the closest match to Monalisa is Einstein. Is
that a surprise? Not really. In the “eyes” of the computer, these two images are indeed
more alike than Monalisa and Lena.

Reducing computational complexity

Even though the eigenimage approach has great potential for image matching, from
the procedure described above, we see that the most time-consuming step is the
derivation of the eigensystem. When the size of images is large, the calculation of
the covariance matrix (which is mn × mn) can take up a lot of computation resources
or be completely infeasible. For more efficient calculations, readers are referred to
[13.34, 13.35].
We illustrate one approach to reducing computation through an example. Assume
that each image has only three pixels and that there are only two such images in the
set. Let them be

f 1 = [1 2 3]T
f 2 = [5 8 9]T .
303 13.1 Matching iconic representations

Training images Test image

Original
images

I1 I2 I3 Itest

The 1st and 2nd


eigenimages

E1 E2

The projection coefficients

W1 = [−1.422 −0.8161], W2 = [−1.7015 0.7688], W3 = [3.1235 0.0473]

Wtest = [−1.4399 0.4710]

The distance
d1, test = 1.2872, d2, test = 0.3963, d3, test = 3.5831

Fig. 13.1. Demonstration of the eigenimage approach on a data base of three images (Lena,
Einstein, Clock) and a test image (Monalisa). The images have been rescaled for
display purpose.

Then the mean A = [3 5 6]T , and


I1 = [−2 −3 −3]T
.
I2 = [2 3 3]T
% &
Construct the matrix I = I1 . . . I p , in which the ith column of I is one of the
images Ii , and consider the product S = I I T . In this example,
⎡ ⎤ ⎡ ⎤
−2 2 8 12 12
⎢ ⎥ ⎢ ⎥
I = ⎣−3 3⎦ and S = ⎣12 18 18⎦ .
−3 3 12 18 18

Observe that if p < mn, then S is the scatter matrix, identical to the covariance
except for the multiplicative scale factor. S is huge – if the image is 256 × 256, then
S is 2562 × 2562 . However, if there are only, say, five images in the set, I is 2562 × 5,
304 Image matching

and I T I is 5 × 5. Suppose ␮i is one of the eigenvectors of I T I . That is,

I T I ␮i = ␭i ␮i . (13.7)

Here’s the mathemagic: Multiply on both sides of Eq. (13.7) by I to obtain I I T I ␮i =


I ␭i ␮i = ␭i I ␮i , and we notice that I ␮i is an eigenvector of I I T . Since the size of
I T I is dramatically smaller than the size of I I T , we obtain a similarly dramatic
decrease in complexity in the process of determining the eigenvectors. Thus, if ei s
are the eigenvectors of S, we can obtain them using

p
ei = I ␮i = ␮ik Ik . (13.8)
k=1

13.2 Matching simple features

The most straightforward way to use the simple features we described in Chapter 9 is
to use them in a pattern classifier. To do this, we will extract a statistical representation
for the model and the object and match those representations. The strategy is as
follows.
r Decide which measurements you wish to use to describe the shape. For example,
one might build a system which measures seven invariant moments and the aspect
ratio, for a total of eight “features.” The best collection of features is application-
dependent, and methods for optimally choosing feature sets is beyond the scope of
this book (see [14.4, 14.11, 18.30] which are just a few of many texts in statistical
methods). Organize these eight features into a vector, x = [x1 , x2 , . . . , x8 ]T .
r Describe a “model” object using a collection of example images (called a “training
set”), from which feature vectors have been extracted. Continuing the example
with eight features, we could collect a set of n images of axes, measure the feature
The unknown is vector of each axe, and characterize the model axe by its average over this set
distinguished from the 
examples by lack of a ␮axe = n1 x∈axe x. Hatchets might be similarly characterized by an average over
subscript. a set of sample hatchets.
r Now, given an unknown region, characterized by its feature vector x, shape match-
ing consists of finding the model which is “closest” in some sense to the observed
region. Probably the simplest definition of “close” uses the Euclidian distance
4
d(modelaxe, observation) = (x − ␮axe )T (x − ␮axe )
4 .
d(modelhatchet, observation) = (x − ␮hatchet )T (x − ␮hatchet )

We make the decision based on which of these distances is smaller.


This example is rather naive for several reasons. There are better models for a
set of examples than just the mean, and there are better matching metrics than the
305 13.3 Graph matching

Euclidian distance. In Chapter 14, the concepts in this discipline of statistical pattern
classification are covered in more detail.

13.3 Graph matching

Recall the formal In this section, we consider the problem of matching image representations which
definition of a metric. are fundamentally graph-based. However, we allow the data stored at a node in the
This paragraph refers to
subgraph isomorphism as graph to include images or templates.
a metric. Is that correct?
Recall that a clique of size N is a totally connected subgraph of size N . We
also will want to use matching metrics, such as the mean squared error or correla-
tions mentioned in section 13.1.1. A matching metric measures the “goodness” of a
match.
In a totally graph-based representation, one matching metric might be sub-
graph isomorphism. But, subgraph isomorphism does not really allow for close
but not perfect matches. Most machine vision specialists would say that it is too
inflexible.
The graph-matching problem can be approached using annealed neural networks
[13.8].
As we saw earlier, relaxation labeling can also provide a mechanism for a type of
graph matching. In the example we saw in section 10.2.2, we can match a subgraph
of the scene graph (i.e. the two surfaces in the example given) with a subset of
the model graph. Other variations are possible, for example, Gold and Rangarajan
[13.14] describe a variation on graph matching which utilizes a nonlinear optimiza-
tion method which they say runs much faster and more accurately than relaxation
labeling.
Two other approaches are described here – association graphs and spring-loaded
templates.
These are methods which will produce matchings on hybrid representations,
that is, those which are fundamentally graph-based, but which include image
information.

13.3.1 Association graphs


Association graphs embody a methodology which is less restrictive than isomor-
phism, and which may converge more rapidly. It will converge to a solution which
is consistent, but not necessarily optimal (depending, of course, on the criteria for
optimality used in any particular application).
The method matches a set of nodes from the model to a set of nodes (extracted)
from the image.
306 Image matching

Definition
Here, a graph is denoted G = V, P, R, where V represents a set of nodes,
P represents a set of unary predicates on nodes, and R represents binary relations
between nodes.
A predicate is a statement which takes on only the values TRUE or FALSE.
For example let x denote a region in a range image. Then CYLINDRICAL(x) is a
predicate which is true or false depending on whether all the pixels in x lie on a
cylindrical surface.
A binary relation describes a property possessed by a pair of nodes. It may be
considered as a set of ordered pairs R = {(a1 , b1 ), (a2 , b2 ), . . . (an , bn )}. In most
applications, order is important. It is possible to think of a relation as a predicate,
since for any given pair, say (ak , bk ), either it is an element of the set R or it is not.
However, it seems more descriptive to use the word relation in this context.

Given two graphs, G 1 = V1 , P, R and G 2 = V 2, P, R, we construct the


association graph G by:
r for each v 1 ∈ V1 and v 2 ∈ V2 , if v 1 and v 2 have the same properties, construct a
node of G labeled (v 1 , v 2 );
r if r ∈ R and r (v 1 , v 1 ) ⇔ r (v 2 , v 2 ), connect (v 1 , v 2 ) to (v 1 , v 2 ).

The best match of G 1 to G 2 is the largest clique of G.


Like every other technique in machine vision, we need to ask, “How good is this
method?” Some problems arise when attempting to answer that question:

Problem 1. Is the largest clique the best match? The largest clique is the largest set
of consistent matches. Is this really the best match?

Problem 2. Computational complexity. Like the subgraph isomorphism problem,


the problem of finding the largest clique is NP-complete. That is, no algorithm is
known which can solve this problem in less than exponential time.

An example of using association graphs to match a scene to a model

In Fig. 13.2 we illustrate an observation in which a segmentation error, oversegmen-


tation, has occurred. That is, regions B and C are actually part of the same region, but
due to some measurement or algorithmic error, have been labeled as two separate
regions. In this example, the unary predicates are labels spherical, cylindrical, and
planar. Regions A and 1 are spherical, while B, C, D, 2, and 3 are cylindrical. The
only candidates for matches are those with the same predicate. So only A can match 1.
We now construct a graph in which all candidate matches are the nodes. We then
have the nodes of the association graph, as illustrated in Fig. 13.3.
307 13.3 Graph matching

A
1
B
2
C
D 3

Image Model

Fig. 13.2. A range camera has observed a scene and segmented it into segments which satisfy
1A the same equation, however, an error has occurred.
3D
2B
3C
Here is the challenge: Identify what it means to be consistent – determine
2C
rA (i, ␭, j, ␭ ), or in this example, determine rA (1, A, 2, B), where the compatibility
3B
2D function r has the same meaning as in Chapter 10 and the subscript A simply denotes
Fig. 13.3. Candidate
that an association graph is being used. It is often easier to do this by determining
matches. what is NOT consistent, and that is a problem-dependent decision. Here, we define
any two labelings as consistent if they do not involve the same region. Some example
consistencies for this example are

rA (1, A, 2, B) = 1
3D
1A rA (2, B, 2, C) = −1
2B rA (2, B, 3, B) = −1.
3C

2C The second line says that patch B in the image could not be region 2 in the model
3B while simultaneously, patch C in the image is the same region. In both examples,
2D
inconsistencies are really based on the assumption that the segmenter is working
Fig. 13.4. Solid lines correctly. However, one could allow the segmenter to fail. In that case, new edges are
denote the edges
which are present if we
added, because new relationships are now consistent. For example, rA (3, C, 3, D) =
assume the segmenter 1 since we believe that two patches could be part of the same regions (the segmenter
does not fail. The can fail by oversegmentation), however, rA (2, D, 3, D) = −1 still holds because
dotted lines are added
if we believe the we still believe the segmenter will not merge patches (fail by undersegmentation).
segmenter can fail by Allowing for oversegmentation produces the association graph of Fig. 13.4.
oversegmentation. The
heavy lines indicate a
Note one other type of inconsistency which prevents an edge from being con-
maximal clique. structed: 3D and 3B are not connected since B and D do not border. That is, we
believe that if the segmentation fails by oversegmentation, the segmenter will not
introduce an entire new patch between the two. We must emphasize that how you
develop these rules is totally problem-dependent!
The maximal clique is not Once you have the allowable consistencies, the matching is straightforward.
unique.
Simply find all maximal cliques. The maximal clique is not unique, since there
may be several cliques of the same size.
In this case, there are at least two maximal cliques, two of which are:
{(1, A)(2, B)(2, C)(3, D)} and {(1, A)(3,B)(2, C)(2, D)}.
308 Image matching

hair

eye
ear

nose

Mouth

Fig. 13.5. A springs and templates model of a face.

13.3.2 Springs and templates


Another approach to the local–global problem is provided by the “springs and tem-
plates philosophy” [13.13]. This is a hybrid model-matching paradigm which in-
volves both graph-structured matching and template matching. The model is a set of
rigid “templates” connected by “springs” which characterize how much deformation
must be applied to the model to make it match the image. Fig. 13.5 illustrates this
concept using a simple model of a human face.
Specific features such as eyes are matched by iconic methods such as template
matching. However, in order to match an entire face, distances between the best
matching locations of templates are also recorded. A match is then based on mini-
mizing a total cost, as follows.

Cost = TemplateCost(d, F(d))
d∈templates

+ SpringCost(F(d), F(e)) (13.9)
d,e∈ref×ref

+ MissingCost(c).
c∈(Rmissing ×Rmissing )

In Eq. (13.9), d is a template and F(d) is the point in the image where that template
is applied. TemplateCost is therefore a function indicating how well a particular
template matches the image when applied at its best matching point. SpringCost is
a measure of how much the model must be distorted (the springs stretched) to apply
those particular templates at those particular locations. Finally, it may be that not
every template can be located – in some images, the left eye may not be visible – and a
cost may be imposed for things missing. All these costs are empirically determined.
However, once they are determined, it becomes relatively easy to determine how
well any given image matches any given model.
309 13.5 Vocabulary

There is one significant (among several others) problem with spring matching:
The number of elements matched affects the magnitude of the cost. The costs are
summed, so a poor match of only a few things may be less than a good match of
many (and therefore better, in a minimal cost algorithm).
This is a problem which is not unique to springs and templates. The usual solution
is to normalize the calculations, using a technique such as

SpringCost(unary and binary) constant


Cost = + .
Total number of springs Total number of references matched

13.4 Conclusion

Association graphs use the concepts and formalisms of consistent labeling directly.
Association graphs are a
kind of consistent
The advantage of using a graph structure is that the search for largest clique is
labeling. aided by a body of available software for performing such searches as quickly as
computational complexity allows. Similarly, the springs and templates ideas measure
both consistency and deviation from consistency. The springs and templates concepts
also illustrate both how one might construct an appropriate objective function, and
a problem that can easily arise if one does not pay attention to interpretation of the
objective function – if we are summing match quality, a good match of many things
(adding up lots of small numbers) may be more than (and therefore worse than) a
poor match of only a few things (adding up just a few rather large numbers).
We began this chapter by pointing out that formal optimization methods, either
Objective functions need
suitable normalization.
descent or “hill-climbing,” are hard to apply to image matching because the search
SSD is a common space is littered with local minima. However, if we initialize the algorithm suffi-
objective function.
ciently close to the solution, such techniques will work. We used the sum of squared
differences (SSD), also sometimes called the sum-squared error, as the objective
function.
Eigenimages are lower dimensionality representations of the original images. The
projections are chosen by minimizing the error between the original data and the
projected data.

13.5 Vocabulary

You should know the meanings of the following terms.

Association graph
Clique
310 Image matching

Correspondence
Deformable template
Eigenimage
Hill-climbing
Matching metric
PCA
Template

Assignment 13.1
In this chapter, we stated that the problem of find-
ing the largest clique is NP-complete. What does that
really mean? Suppose you have an association graph with
ten nodes, interconnected with 20 edges. How many tests
must you perform to find all cliques (which you must do
in order to identify which of these are maximal)? You
ARE permitted (encouraged!) to look up clique-finding
in a graph theory text.

Assignment 13.2
In section 13.3.1, an example problem is presented
which involves an association graph which allows for
segmentation errors. The result of that graph is two
maximal cliques, which (presumably) mean two different
interpretations of the scene. Describe in words these
two interpretations.

Assignment 13.3
In the bibliography for this chapter, there is an
incomplete citation to Olson [13.36]. First, locate
a copy of that paper. You may use a search engine, the
Web, the library, or any other resource you wish. In
that paper, the author does template matching in a dif-
ferent way: Using a binary (edge) image and a similar
template, he does not ask “Does the template match the
image at this point?” Instead he asks, “At this point,
how far is it to the nearest edge point?”
How does he perform this operation, apparently a
search, efficiently?
Once he knows the distance to the nearest edge point,
how does he make use of that information to compute a
quality of match measure?
311 13.5 Vocabulary

Assignment 13.4
In an image-matching problem, we have two types of ob-
jects, lions and antelope (which occupy only one pixel
each).

r A scene may contain only lions and antelope.


r Lions hunt in packs, so if you see one lion, you will
see at least one other lion, usually about 5 pixels
away.
r Antelope stay as close to one another as possible.
r Except for certain rare, and (for the antelope)
unpleasant events, lions and antelope are VERY far
apart.

We wish to use relaxation labeling to solve this


assignment problem. All the formulae are in the book
except the formula for the consistency r(a,␭1 ,b,␭2 ),
where a and b are points of interest in the image and
the lambdas are labels for either “antelope” or “lion.”
Invent an r function for this problem. That is, tell
how to compute values for:
(1) r(a, lion, b, antelope)
(2) r(a, lion, b, lion)
(3) r(a, antelope, b, lion)
(4) r(a, antelope, b, antelope).

Assignment 13.5
Do you think the concepts of springs and templates
would be applicable to Assignment 13.4? Discuss.

Assignment 13.6
Still thinking about lions and antelope, you ob-
1 3 serve the scene opposite: The sketch is not to scale.
However, for your convenience, we have tabulated the
distance between each pair of animals (Table 13.1).
Lions are yellow (which is denoted by a dotted inte-
4 rior) or brown (denoted by a black interior -- that’s
5 right; there aren’t any). Antelope are white (denoted
by a white interior) or yellow. You wish to use
2 association graph methods to solve this problem; and
since this technique is not as powerful as nonlinear
312 Image matching

Table 13.1. Distances between


pairs of animals.

Distance
Pair (arbitrary units)

1, 2 5.5
1, 3 2
1, 4 3
1, 5 2
2, 3 2
2, 4 3
2, 5 4
3, 4 2
3, 5 3.8
4, 5 3.4

relaxation, you talked to a botanist1 who gave you some


improved information: Lions NEVER get closer to each
other than 3 pixels, and antelope never get farther
from another antelope than 3 pixels.
Draw an association graph for this problem. (Denote
the nodes in the association graph by pairs, e.g.,
1L means “interpret node 1 as a lion”.) Indicate the
maximal clique by circling the nodes in that clique.

Topic 13A Matching

13A.1 Springs and templates revisited

Recall the correspondence problem, described in section 4.2.2, which may be restated as
“given a set of features in two images, identify which feature in image 1 corresponds to
which feature in image 2.” There is little difference between the model-matching problem
and the correspondence problem, except for the fact that in the correspondence problem, both
of the images may be corrupted by noise.
To solve the correspondence problem, we use a consistency-based philosophy. In this
section, we describe this philosophy through an example, which we will use again in section
Fig. 13.6. Boundary of 13A.2. The first step is to identify feature points which are relatively distinctive. Then the
an object. Points where algorithm will make use of relationships between these points. Points on the boundary of a
the sign of the
curvature changes are
region where the curvature of the boundary changes sign meet this requirement, as illustrated
marked. in Fig. 13.6.

1 Yes, a botanist! He was out of his field.


313 Topic 13A Matching

This example derivation was first described by Shapiro and Brady [13.46] who use eigen-
vector methods as follows.
As in the original springs and templates formulation, we are finding which of a collection
of feature point sets best matches one particular set. Let di j be the Euclidian distance between
feature points xi and x j , and construct a matrix of weights
!
di2j
H = [Hi j ], where Hi j = exp − . (13.10)
2␴ 2

The matrix H is diagonalized using standard methods into the product of three matrices

H = E E T (13.11)

where E is a matrix with the eigenvectors of H as its columns, and is a diagonal matrix
with the eigenvalues on the diagonal. Let us assume the rows and columns of E and are
sorted so that the eigenvalues are sorted along the diagonal in decreasing size. We think of
each row of E as a feature vector, denoted Fi . Thus
⎡ ⎤
F1
⎢ ⎥
E = ⎣. . . ⎦ .
Fm

Suppose we have two images, f 1 and f 2 , and suppose f 1 has m feature points while f 2 has n
feature points, and suppose m < n. Then by treating each set of feature points independently,
we have H1 = E 1 1 E 1T for image f 1 , and H2 = E 2 2 E 2T for image f 2 . Since the images have
different numbers of points, the matrices H1 and H2 have different numbers of eigenvalues.
We therefore choose to use only the most significant k features for comparison purposes.
It is important that the directions of the eigenvectors to be matched be consistent, but
changing the sign does not affect the orthonormality. We choose E 1 as a reference and then
orient the axes of E 2 by choosing the direction that best aligns the two sets of feature vectors;
see [13.46] for details. After aligning the axes, a matrix Z characterizing the match between
image 1 and image 2 is defined by

Z i j = (Fi1 − F j2 )T (Fi1 − F j2 ). (13.12)

The best matches are indicated by the elements of Z which are the smallest in their row
and column. We will revisit this example in the next section.
Sclaroff and Pentland [13.44] present a further alternative to the springs and templates
formulation: First, compute a description of the entire shape which is robust to sampling
and parameterization error. Then, using this description of the entire shape, find a coordinate
system which effectively describes the shape. Doing this on the image and the model makes
it straightforward to determine cardinal directions.
Wu [10.19] takes the problem of computing optic flow and uses relaxation labeling to find
consistent template matches.
The concept of deformable templates can be combined with graph representations to
produce an approach [13.1] to matching of objects which are similar, but not identical in
shape (e.g. x-rays of hands). The idea of deformable templates can be viewed as an extension
314 Image matching

of MAP methods. See [13.26] for a well-written concise description. Methods like this also
find applications in target tracking and automatic target recognition (ATR) [13.12].

13A.2 Neural networks for object recognition

We have already discussed the fact that pattern recognition techniques provide for us a means
of making “what is it?” decisions, when we have been presented with a set of measure-
ments (features) which describe the item being observed. There are many ways to develop
classifiers, and methods which follow the neural networks paradigm have been among the
most successful. Neural networks accept features as inputs and produce decisions as outputs.
They are based on mathematical abstractions of what we know about how individual neurons
compute.
There are two types of neural networks which can perform matching, feedforward and
recurrent.

13A.2.1 Feedforward neural networks

In a feedforward neural network, each computational element (which we will henceforth


refer to as a “neuron”) has a large number of inputs, and a single output. Although many
variations have been explored in the literature, the most common version of the computation

performed by a single neuron is y = S( i w i xi ), where the function S is a sigmoid, the xi
are the inputs and the w i are weights which modify the significance of the various inputs
to the neuron. Fig. 13.7 illustrates the architecture and naming conventions for a single
neuron.
Almost arbitrary functions can be computed by using layers of neurons, as illustrated
in Fig. 13.8. The principal problem in neural network design is determining the weights in
order to make the input produce an appropriate output. In the feedforward case, information
about the model(s) is embedded in the weights, and when presented with image information,
perhaps represented by shape features, the network produces a “yes” answer if the feature
vector matches the model. The principal method for determining weights involves finding
the weights which solve a gradient descent problem, perhaps minimizing the mean squared
difference between what the output is and what it should be. Such an algorithm implements

x1 w1

u y
x2 w2 Σ S

x3 w3

Fig. 13.7. Computation performed by a single neuron. Each input (xi ) is multiplied by a weight
(wi ) and the results are added, producing a signal u, which is passed through a
sigmoid-like nonlinearity function (S) producing the neuron output y.
315 Topic 13A Matching

Decision boundary is
a hyperplane

Single-layer perceptron

Convex regions. Can


be open or closed.

Two-layer

Arbitrary
decision regions

Multilayer (usually 3)

Fig. 13.8. Types of feedforward neural networks, and the decision regions which they can
implement.

Fig. 13.9. A feedforward neural network with three inputs and two outputs. Each circle denotes
a neuron. Weights are not explicitly shown, but exist on the connections.

the familiar gradient descent rule


w i j (t + t) = w i j (t) − ck M S E.
∂w i j

Using the three-level neural network model illustrated in Fig. 13.9, the gradient descent rule
may be readily implemented by making use of the chain rule for derivatives. Hussain and
Kabuka [13.24] demonstrate use of a neural network for character recognition.
316 Image matching

13A.2.2 Recurrent neural networks

A recurrent neural network (NN) is one which feeds the output back to the input at run time,
as illustrated in Fig. 13.10. Following the same notation used earlier, in the steady state, the
output of neuron i satisfies
!

d
v i = S(yi ) = S w i j v j − Ii . (13.13)
j=1

This model of the behavior of a neuron is true only in the steady state. That is, since the
output is dependent on the input, which is the output, which is dependent . . . (to iterate is
human, to recurse, divine2 ). But such a description is woefully inadequate when things are
Fig. 13.10. A recurrent
neural network with 3 changing. In that case, we need some model of the dynamics of the system. Many different
neurons. The weights models can be used, and the reader is referred to the literature [13.15, 13.20, 13.23] for a
are not shown, but closer examination. Here, we consider a single, rather simple model, one in which the rate
each input to each
neuron has an
of change of output from the summer is dependent on the input, and can be represented by a
associated weight. first-order differential equation:
d n
yi (t) = −␤yi (t) + w i j S(u j ) − Ii (13.14)
dt j=1

where the yi s are the neuron outputs, the w i s are the weights, as before, and the Ii s are
inputs to each neuron from the external world (not shown in the figure). Thus the change is
proportional to the current state, the inputs from all the other neurons, and the external input.
In operation, a recurrent NN is presented with a particular input, and then allowed to run.
They should converge to a particular state.
This model was described by Hopfield [13.23], among others. In Hopfield’s model, the
rate constant, ␤, resulted from a lumped-constant model of capacitance and resistance in an
operational-amplifier implementation of such a recurrent network.
Now, forget about Eq. (13.14) for a moment, and consider the objective function below,
which we wish to minimize:
 
vi
1  
E =− w i j vi v j + ␤ S −1 (␷ ) d␷ + Ii v i . (13.15)
2 i j i i
0

If we are to find the vs which minimize this, we need to differentiate E with respect to those
vs. Doing that, we find
∂E 
=− w i j v j + ␤S −1 (v i ) + Ii . (13.16)
∂v i j

Now, we observe that the derivative of E with respect to the variables v has the same form as
the dynamics of a Hopfield neural network, or
∂E du i
=− . (13.17)
∂v i dt
Think about the steady state of the system described by Eq. (13.17). When the network has
finished changing (all the derivatives with respect to time are zero), all the partials of the

2 L. P. Deutsch.
317 Topic 13A Matching

Angle θi
fi−1

fi fi+1

Fig. 13.11. The angle between neighboring feature points is a measurement local to feature
point i.

energy function are also zero. So we are at an extreme. It is relatively easy to show that one
may ignore that annoying integral in Eq. (13.15), and therefore a Hopfield neural network
finds the set of variables v i which minimize an objective function of the form described by
Eq. (13.15) (without the integral). We illustrate the use of such a network for matching through
an example.
Using the same set of features as the previous section, the zero crossings of the boundary
curvature, we assign a local measure to each feature point, in this case, the angle between the
vectors to neighboring points, as illustrated in Fig. 13.11. We will use this and a more global
feature, the distance between feature points, to solve the correspondence problem.
Assume image 1 (which you can think of as a model, if you wish) has n feature points,
and image 2 has m feature points. We define a matrix of neurons which has n columns and
m rows. The neuron at row i, column j should have a value between zero and one depending
on the degree to which feature point i in the first image matches feature point j in the second
image.
The matching process is posed as minimizing the expression
!
A  q  
E =− Ci jkl Vik V jl + Vik Vil + Vik V jk .
2 i j k l 2 i k k =l k l i = j

(13.18)

The first term quantifies the compatibility of matches ik and jl. The last two terms are in-
cluded to encourage uniqueness of matches. This form is chosen to allow for occlusions. The
compatibility coefficient is the sum of three terms

Ci jkl = ␻1 (␪i , ␪k ) + ␻2 (␪ j , ␪l ) + ␻3 (ri j , rkl ) (13.19)

where

1 if (|a − b| < T )
(a, b) = ,
−1 otherwise

for threshold T; ␪i is a measurement local to feature point i as illustrated in Fig. 13.11; and
ri j is a measure of similarity of relational measures between feature points. For example, if
the distance between points i and j is the same as the distance between points k and l, then
labelings ik and jl are consistent.
Proper manipulation of Eq. (13.18) allows it to be put in the form of Eq. (13.15), enabling
Fig. 13.12. Silhouette
minimization by a neural network. More details are available in [13.28, 13.54]. In Fig. 13.12
of a gun partially
occluded by a hammer an outline is shown of a pistol partially occluded by a hammer. A neural network using these
(redrawn from [13.28]). principles is able to identify both objects from this image.
318 Image matching

13A.3 Image indexing

Up to this point, we have considered the process of image matching as searching a data
base of models for the model which best matches the observation. We have not addressed
the process of “search” itself. One could, of course, simply try all models, but that could
be prohibitively time-consuming, particularly in instances which involve large data bases
of models. In applications like automatic target recognition, where matching requires both
high speed and large data bases [13.45], better methods are required. The alternate paradigm,
indexing (sometimes called image hashing ) is analyzed in [9.6]. In an indexing scheme, a set
of parameters are extracted from the image. Obviously, such parameters need to be invariant
to as many image transformations as possible and also need to be robust [13.1]. This resulting
parameter vector is then used as indices into a lookup table containing references to models.
The lookup process returns a list of candidate models consistent with this particular parameter
vector. To see how this works, consider the following algorithm.
Begin by looking at local areas around the boundary and attempting to match each local
area with a data base of feature descriptors such as lines, circular arcs, and minima and maxima
of curvature. Assuming a successful segmentation of an unoccluded object, we start with an
edge image, where the edges are not required to be connected. About some point [x0 , y0 ]
on the edge, we sample the edge in that neighborhood using a sampling scheme3 which is
invariant to zoom. Form all possible combinations of that point with two other nearby points
and generate an invariant parameter vector similar to that described in [9.37]. That parameter
vector is then used to index a data base of local shapes. For each entry selected, a feature
instance is extracted and after all the triples have been considered, the feature instance with
the highest number of votes is selected.
Now, the boundary is represented by a sequence of feature instances, and the indexing
method may be repeated, using a look-up table of object models which are indexed by
geometry and occurrence of feature instances.
Numerous other approaches to indexing exist [13.5, 13.32]; an excellent review is included
in [13.48]. The space requirements for some indexing schemes are analyzed in [13.25].
As data bases get larger, one must consider the image indexing problem in the context of
the entire digital library. The reader is directed to an entire special issue of IEEE Transactions
on Pattern Analysis and Machine Intelligence (August, 1996) which addresses this.

13A.4 Matching geometric invariants

We start with simply finding a set of numbers which are invariant. The approach will be to
find five points in the 3D model and calculate from them some properties which uniquely
characterize them in an invariant way. Then, we will find five points in the image and determine
which model they best match.
Choose a set of five feature points {X 1 , X 2 , X 3 , X 4 , X 5 } from the 3D model, at least four
of which are noncoplanar. Since five points cannot be linearly independent, we can write one
of them as a linear combination of the others. We choose to represent point X 5 in this way,

3 To avoid cluttering this description of the indexing paradigm with lots of details, we ask the reader to tolerate
the omission of some details. They are in the cited paper.
319 Topic 13A Matching

using homogeneous coordinates (section 9.1)

X 5 = a X 1 + bX 2 + cX 3 + d X 4 . (13.20)

We make use of the observation that the determinant of a matrix of points is invariant to
rigid body motions,4 and write the determinant which is constructed from any four of the five
points, using as a subscript, the index of the point we omitted. For example,

M1 = |X 2 X 3 X 4 X 5 |. (13.21)

From the linear dependence of X 5 in Eq. (13.20), we substitute for X 5 in each case, deriving

M1 = a|X 2 X 3 X 4 X 1 | + b|X 2 X 3 X 4 X 2 |
+ c|X 2 X 3 X 4 X 3 | + d|X 2 X 3 X 4 X 4 |. (13.22)

This can be simplified by observing that the determinant of a matrix which has two identical
columns is zero:

M1 = a|X 2 X 3 X 4 X 1 |. (13.23)

But this can be simplified even more by observing that if you interchange two columns, you
flip the sign of the determinant.

M1 = (−a)|X 1 X 3 X 4 X 2 | = a|X 1 X 3 X 2 X 4 | = (−a)|X 1 X 2 X 3 X 4 |. (13.24)

So

M1 = −a M5 . (13.25)

Similarly

M2 = bM5
M3 = −cM5 (13.26)
M4 = d M5 .

From this, we can write an expression for the coefficients:


M1 M2 M3 M4
a=− b= c=− d= . (13.27)
M5 M5 M5 M5
In 2D, the same five points project to a set of 3-vectors (again, using homogeneous coordi-
nates), and

x5 = ax1 + bx2 + cx3 + d x4 . (13.28)

We construct 3 × 3 matrices by leaving out two indices, and denoting by subscript the indices
left out:

m 12 = |x3 x4 x5 |. (13.29)

4 In fact, absolute invariants of linear forms are always ratios of powers of determinants [13.19].
320 Image matching

At this point, we simplify the notation, get rid of the xs and just keep track of the subscripts,
rewriting the definition of m 12 :

m 12 = |3 4 5|. (13.30)

As above, we can do algebra to relate the determinants and the coefficients, for example,

m 12 = a|3 4 1| + b| 3 4 2|
= a|1 3 4| + b|2 3 4| (13.31)
= am 25 + bm 15

and
m 13 = am 35 − cm 15
(13.32)
m 14 = am 45 + dm 15 .
We have determined forms for the coefficients in terms of the Mi s, and adding those relations
into the equations we just derived produces

M5 m 12 + M1 m 25 − M2 m 15 = 0
M5 m 13 + M1 m 35 − M3 m 15 = 0 (13.33)
M5 m 14 + M1 m 45 − M4 m 15 = 0.

These relations are invariant to both 3D and 2D motions except for a multiplicative scale
which affects all the Mi s the same. We can eliminate this dependence by using ratios and
define 3D invariants
M1 M2 M3
I1 = I2 = I3 = (13.34)
M5 M5 M5
and 2D invariants
m 12 m 13 m 25 m 35
i 12 = i 13 = i 25 = i 35 = . (13.35)
m 15 m 15 m 15 m 15
The denominators are not zero, since they are the determinants of matrices which we know
are nonsingular. Look at Eq. (13.33) and divide the top line by M5 :
M5 M1 M2
m 12 + m 25 − m 15 = 0, (13.36)
M5 M5 M5
which simplifies to

m 12 + I1 m 25 − I2 m 15 = 0. (13.37)

Similarly divide by m 15 to produce two independent equations:

i 12 + I1 i 25 − I2 = 0
(13.38)
i 13 + I1 i 35 − I3 = 0.
So if we have 2D invariants we have two equations for the 3D invariants. The two equations of
Eq. (13.38) do not, unfortunately, determine the three 3D invariants. Still those two equations
determine a space line in the 3-space of the Is.
321 Topic 13A Matching

How do we use an idea like this? Given a 3D model of an object, and any five points,
four of which are not coplanar, we can find I1 , I2 , and I3 , a point in a 3D space. To perform
recognition, we first extract from the 2D image (generally several) 5-tuples of feature points
and from them construct the 2D invariants. Each 5-tuple gives rise to two equations in I1 ,
I2 , I3 space, that is, a straight line in the 3D invariant space. If a 5-tuple in the 2D image
is a projection of some 5-tuple in 3D, then the line so obtained will pass through the single
point representing the model. If we have a different projection of those five points, we get a
different straight line, but it still passes through the model point.
Implementing this for realistic scenes is slightly more complicated than this description
because one must actually make use of projective geometry rather than assuming orthogonal
projections. Other complications arise in determining a suitable way to choose 5-tuples, and
a means for dealing with the fact that the line may “almost” pass through the point. Weiss
and Ray [13.52] address these issues.

13A.5 Conclusion

13A.5.1 Which model to use?

So far, we have described quite a collection of representations for objects, but certainly not
all that are in the literature. Other methods include variations on deformable models [13.10,
13.11], especially for range images [13.21].
Consider Fig. 13.13: Should you match it to a circle or a six-sided polygon? Clearly, there
is no simple answer to this question. If you have prior, problem-specific knowledge that you
are always dealing with circular objects, you might choose to use the circular model, which
Fig. 13.13. A set of is certainly less complex than that of a polygon. The idea of minimum description length
points produced by an (MDL) provides some help along these lines. The MDL paradigm states that the optimal
edge detector which
representation for a given image may be determined by minimizing the combined length of
might come from a
circle or a polygon. the encoding of the representation and the residual error. Interestingly, a MAP representation
can be shown [13.9, 13.30] to be equivalent to the MDL representation where the prior truly
represents the signal.
Schweitzer [13.43] uses the MDL philosophy to develop algorithms for computing the
optic flow, and Lanterman [13.29] uses it to characterize infrared scenes in ATR applications –
“if there are several descriptions compatible with the observed data, we select the most
parsimonious” [13.29].
Rissanen [13.41] suggests that the quality of an object/model match could be represented
by

L(x, ␪) = − log2 P(x|␪) + L(␪) (13.39)

where x is the observed object, ␪ is the model, represented as a vector of parameters, P(x|␪) is
the conditional probability of making this particular measurement given the model, and L(␪)
denotes the number of bits required to represent the model. The logarithm of the conditional
probability is then a measure of how well the data fits the model. We thus may trade off a
more precise fit of a more complex model with a less accurate fit of a simpler model [13.6].
322 Image matching

Ultimately, machine vision is not going to be solved by one program, one algorithm, or
one set of mathematical concepts. Ultimately, its solution will depend on the ability to build
systems which integrate a collection of specialists. The jury is still out on how to accomplish
this. Regrettably, only a few papers have undertaken this formidable task. For example, Grosso
and Tistarelli [13.18] combine stereopsis and motion. Bilbro and Snyder [13.4] fuse luminance
and range to improve the quality of the range imagery, and Pankanti and Jain [13.38] fuse
stereo, shading, and relaxation labeling. Zhu and Yuille [8.80] incorporate the MDL approach,
including active contours and region growing, into a unified look at segmentation. Gong and
Kulikowski [13.16] use a planning strategy, primarily in the medical application area.

13A.5.2 Consistency and optimization in matching

In sections 13A.1 and 13A.2, the first step is to identify feature points which are relatively
distinctive. Then the algorithm makes use of relationships between these points, relying on
consistency to find the best match.
A recurrent neural In the discussion of recurrent neural networks, we showed that such a network achieves a
network is an
stable state which is in fact the minimum of the objective function of Eq. (13.15).
optimization engine!

13A.6 Vocabulary

You should know the meanings of the following terms.

Eigenvector
Feedforward neural net
Geometric invariant
Image indexing
Recurrent neural net

Bibliography

[13.1] Y. Amit and A. Kong, “Graphical Templates for Model Registration,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 18(3), 1996.
[13.2] K. Astrom, “Fundamental Limitations on Projective Invariants of Planar Curves,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[13.3] B. Bhanu and O. Faugeras, “Shape Matching of Two Dimensional Objects,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 6(2), 1984.
[13.4] G. Bilbro and W. Snyder, “Fusion of Range and Luminance Data,” IEEE Symposium
on Intelligent Control, Arlington, August, 1988.
[13.5] A. Bimbo and P. Pala, “Visual Image Retrieval by Elastic Matching of User Sketches,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2), 1997.
[13.6] J. Canning, “A Minimum Description Length Model for Recognizing Objects with
Variable Appearances (The VAPOR Model),” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 16(10), 1994.
[13.7] Q. Chen, M. Defrise, and F. Deconinck, “Symmetric Phase-only Matched Filtering
of Fourier–Mellin Transforms for Image Reconstruction and Recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(12), 1994.
323 Bibliography

[13.8] T. Chen and W. Lin, “A Neural Network Approach to CSG-based 3-D Object Recog-
nition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7),
1994.
[13.9] T. Darrell and A. Pentland, “Cooperative Robust Estimation Using Layers of
Support,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5),
1995.
[13.10] D. DeCarlo and D. Metaxas, “Blended Deformable Models,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(4), 1996.
[13.11] S. Dickinson, D. Metaxas, and A. Pentland, “The Role of Model-based Segmentation
in the Recovery of Volumetric Parts From Range Data,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 19(3), 1997.
[13.12] M. Dubuisson Jolly, S. Lakshmanan, and A. Jain, “Vehicle Segmentation and
Classification using Deformable Templates,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 18(3), 1996.
[13.13] M. Fischler and R. Elschlager, “The Representation and Matching of Pictorial Struc-
tures,” IEEE Transactions on Computers, 22(1), 1973.
[13.14] S. Gold and A. Rangarajan, “A Graduated Assignment Algorithm for Graph Match-
ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4), 1996.
[13.15] R. Golden, Mathematical Methods for Neural Network Analysis and Design,
Cambridge, MA, MIT Press, 1996.
[13.16] L. Gong and C. Kulikowski, “Composition of Image Analysis Processes Through
Object-centered Hierarchical Planning,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(10), 1995.
[13.17] F. Goudail, E. Lange, T. Iwamoto, K. Kyuma, and N. Otsu, “Face Recognition
System Using Local Autocorrelation and Multiscale Integration,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 18(10), 1996.
[13.18] E. Grosso and M. Tistarelli, “Active/Dynamic Stereo Vision,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(9), 1995.
[13.19] G. Gurevich, Foundations of the Theory of Algebraic Invariants, Transl. Raddock
and Spencer, Groningen, The Netherlands, Nordcliff Ltd, 1964.
[13.20] S. Haykin, Neural Networks, A Comprehensive Foundation, Englewood Cliff, NJ,
Prentice-Hall, 1999.
[13.21] M. Hebert, K. Ikeuchi, and H. Delingette, “Spherical Representation for Recogni-
tion of Free-form Surfaces,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(7), 1995.
[13.22] D. Heisterkamp and P. Bhattachaya, “Matching of 3D Polygonal Arcs,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 19(1), 1997.
[13.23] J. Hopfield, “Neural Networks and Physical System with Emergent Collective
Computational Abilities,” Proceedings of the National Academy of Science, 79,
pp. 2554–2558, 1982.
[13.24] B. Hussain and M. Kabuka, “A Novel Feature Recognition Neural Network and its
Application to Character Recognition,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(1), 1994.
[13.25] D. Jacobs, “The Space Requirements of Indexing Under Perspective Projections,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3), 1996.
324 Image matching

[13.26] A. Jain, Y. Zhong, and S. Lakshmanan, “Object Matching Using Deformable


Templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3),
1996.
[13.27] T. Kanade and M. Okutomi, “A Stereo Matching Algorithm with an Adaptive
Window: Theory and Experiment,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(9), 1994.
[13.28] J. Kim, S. Yoon, and K. Sohn, “A Robust Boundary-based Object Recognition in
Occlusion Environment by Hybrid Hopfield Neural Networks,” Pattern Recognition,
29(12), 1996.
[13.29] A. Lanterman, “Minimum Description Length Understanding of Infrared Scenes,”
SPIE Automatic Target Recognition VIII, 3371, April 1998.
[13.30] Y. Leclerc, “Constructing Simple Stable Descriptions for Image Partitioning,”
International Journal of Computer Vision, 3, pp. 73–102, 1989.
[13.31] M. Lew, T. Huang, and K. Wong, “Learning and Feature Selection in Stereo Match-
ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9), 1994.
[13.32] S. Li, K. Chan, and C. Wang, “Performance Evaluation of the Nearest Feature
Line Method in Image Classification and Retrieval,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 22(11), 2000.
[13.33] T. Mitchell, Machine Learning, New York, McGraw-Hill, 1997.
[13.34] H. Murakami and B. Kumar, “Efficient Calculation of Primary Images from a Set
of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 4(5),
1982.
[13.35] H. Murase and M. Lindenbaum, “Partial Eigenvalue Decomposition of Large Im-
ages Using the Spatial Temporal Adaptive Method,” IEEE Transactions on Image
Processing, 4(5), 1995.
[13.36] C. Olson, “Maximum Likelihood Template Matching,” Completing this citation is a
homework assignment (Assignment 13.3).
[13.37] D. Paglieroni, G. Ford, and E. Tsujimoto, “The Position-orientation Masking
Approach to Parametric Search for Template Matching,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(7), 1994.
[13.38] S. Pankanti and A. Jain, “Integrating Vision Modules: Stereo, Shading, Grouping, and
Line Labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(9), 1995.
[13.39] B. Parsi, A. Margalit, and A. Rosenfeld, “Matching General Polygonal Arcs,”
Computer Vision, Graphics, and Image Processing. Image Understanding, 53(2),
pp. 227–234, March, 1991.
[13.40] K. Rao and J. Ben-Arie, “Optimal Edge Detection Using Expansion Matching and
Restoration,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(12), 1994.
[13.41] J. Rissanen, “A Universal Prior for Integers and Estimation by Minimum Description
Length,” Annals of Statistics, 11(2), pp. 416–431, 1983.
[13.42] J. Schwartz and M. Sharir, “Identification of Partially Obscured Objects in Two and
Three Dimensions by Matching Noisy Characteristic Curves,” International Journal
of Robotics Research, 6(2), 1987.
325 Bibliography

[13.43] H. Schweitzer, “Occam Algorithms for Computing Visual Motion,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 17(11), 1995.
[13.44] S. Sclaroff and A. Pentland, “Model Matching for Correspondence and Recognition,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6), 1995.
[13.45] K. Sengupta and K. Boyer, “Organizing Large Structural Modelbases,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 17(4), 1995.
[13.46] L. Shapiro and J. M. Brady, “Feature-based Correspondence: an Eigenvector
Approach,” Image and Vision Computing, 10(5), 1992.
[13.47] X. Shen and P. Palmer, “Uncertainty Propagation and Matching of Junctions as Fea-
ture Groupings,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(12), 2000.
[13.48] A. Smeulders, M. Worring, S. Santini, G. Gupta, and R. Jain, “Content-based Image
Retrieval at the End of the Early Years,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(12), 2000.
[13.49] D. Swets and J. Weng, “Using Discriminant Eigenfeatures for Image Retrieval,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 1996.
[13.50] M. Turk, and A. Pentland, “Eigenfaces for Recognition,” Journal of Cognitive
Neuroscience, 3(1), pp. 71–86, 1991.
[13.51] X. Wang and H. Qi, “Face Recognition Using Optimal Non-orthogonal Wavelet
Basis Evaluated by Information Complexity,” International Conference on Pattern
Recognition, vol. 1, pp. 164–167, Quebec, Canada, August, 2002.
[13.52] I. Weiss and M. Ray, “Model-based Recognition of 3D Objects from Single Images,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 2001.
[13.53] M. Yang and J. Lee, “Object Identification from Multiple Images Based on Point
Matching under a General Transformation,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 16(7), 1994.
[13.54] S. Yoon, A New Multiresolution Approximation Approach to Object Recognition,
Ph.D. Thesis, North Carolina State University, 1995.
14 Statistical pattern recognition

Statistics are used much like a drunk uses a lamppost: for support, not illumination
Vin Scully

The discipline of statistical pattern recognition by itself can fill textbooks (and in fact,
it does). For that reason, no effort is made to cover the topic in detail in this single
chapter. However, the student in machine vision needs to know at least something
about statistical pattern recognition in order to read the literature and to properly
put the other machine vision topics in context. For that reason, a brief overview
of the field of statistical methods is included here. To do serious research in ma-
chine vision, however, this chapter is not sufficient, and the student must take a
full course in statistical pattern recognition. For texts, we recommend several: The
original version of the text by Duda and Hart [14.3] included both statistical pattern
classification and machine vision, however, the new version [14.4] is pretty much
limited to classification, and we recommend it for completeness. The much older
text by Fukanaga [14.6] still retains a lot of useful information, and we recommend
[14.11] for readability.

14.1 Design of a classifier

Recall the example described in section 13.2. In that example, we are given models
for axes and hatchets which were derived statistically by computing averages of
samples known to be either axes or hatchets. We called these collections “training
sets.” In section 13.2, we compared an unknown, represented by a feature vector,
with both models and decided to assign the unknown to the class of the model it
most closely resembled, where “closely resembled” was simple Euclidian distance
between the unknown feature vector and the two models. In this chapter, we demon-
strate that this “closest mean” decision rule is actually a simplification of maximum
likelihood assuming a Gaussian probability density for the classes. Further, we show
that decision rules other than closest mean may be used, and may perform better
and/or be more effectively computed.
We begin by discussing some of the options in decision rules.
326
327 14.1 Design of a classifier

Area
x o
x x x o
x x xx o o
x x o oo
x x x o
o
o

Length

Fig. 14.1. A linear decision boundary.

14.1.1 Choice of a decision rule


Remember from Figure 1.2 that the image is measured and features are the output of
the measurement process. The features are a set of numbers that characterize some
property of the image. We use those features in a classifier.
Designing a classifier essentially consists of developing a method to implement a
decision rule. Several options exist for determining both the form of the decision rule
and the parameters which characterize that rule. We describe those options briefly
here.

Linear decision rules

Figure 14.1 illustrates the result of a large number of measurements of two different
industrial parts. The features, area and length have both been measured and each
measurement indicated in the figure by a single mark. (A chart like this is called a
“scatter graph.”) The x points represent one class, flanges and the o points represent
a second class, gaskets. A linear decision boundary has been drawn on the figure. A
linear decision rule would be as follows: “Decide the unknown object is a flange if the
result of the measurements lies on the left of the decision boundary otherwise decide
it is a gasket.” Linear decision rules are particularly attractive because they can be
implemented by linear machines which have a great deal of potential parallelism
and therefore high speed. As can be seen from the figure these two classes are
not linearly separable. That is, there does not exist any one straight line which
completely partitions the two classes. The choice of the best such straight line is the
result of the linear classifier design process. A variation on linear machines is given
in section 14A.2, where we introduce support vector machines.

14.1.2 Maximum likelihood classifier


We have already seen the term maximum likelihood in section 6.2, in the context of
image restoration. Here, we describe the use of this particular set of mathematics in
classifier design.
328 Statistical pattern recognition

In the maximum likelihood approach to classifier design, statistical representations


Conditional probabilities
play a critical role in are used to describe the probability that an object having a certain set of measurements
maximum likelihood
functions.
belongs to a particular class; that is, we estimate P(w i |x), the conditional probability
that the object being measured belongs to class w i given a particular measurement
(vector) x. We compute such a probability for each w i and the decision rule is then
“assign the object to the class i, if P(w i |x) > P(w j |x) for all j.”
It should be noted that the term “maximum likelihood” does not necessarily mean
best performance. Maximum likelihood algorithms are typically based on assump-
tions concerning the form of probability density functions: Assumptions which may
be invalid for many of the data points within a class. Designing a maximum like-
lihood classifier entails both the process of choosing the form for the probability
density functions for each class (e.g., Gaussian density functions) and the process
of choosing the parameters which describe those density functions (e.g., mean and
variance).

14.1.3 Determining descriptions for classifiers


A training set is a set of samples which are drawn from each class and which may
be used to statistically characterize that class. The method which we will use to
determine the parameters of our pattern classifier will depend upon whether we have
training sets available.

Supervised learning

If we are given one training set for each class and from those training sets we can
develop the statistical representations of the classes, then this process is known as
“supervised learning.” The word supervised refers to the fact that each data point is
independently labeled according to the class to which it belongs. Each class may then
be characterized either statistically by its mean, variance or other statistical measures,
or by some other parametric representation. Fig. 14.1 illustrates the data distribution
resulting from supervised sampling, i.e., the x points are identified as belonging to
one class and the o points to another. The example we have used previously in section
13.2, of distinguishing axes from hatchets, is a supervised learning problem, since
it was assumed that we had training sets of both classes.

Unsupervised learning

Fig. 14.2 represents a possible result of an unsupervised learning process. In that


figure the samples are not identified by their class. However, two distinct clusters
may readily be seen. An example might be a multispectral satellite image of a field
of wheat. If the field contains two distinct types of wheat, two clusters could occur
329 14.2 Bayes’ rule and the maximum likelihood classifier

o o o
o o o
oo o o o o o oo
o o o o
o o o o
o
o
oo

Fig. 14.2. In unsupervised learning, only one set of measurements is taken; however, such data
may fall into natural clusters.

in the measurements. Clustering seems, from this example, to be a very simple


problem – the two clusters are obvious, just looking at them. However, remember,
not all measurement vectors are two dimensional, and humans have a much harder
time dealing with, for example, seven-dimensional data. An unsupervised learning
or clustering algorithm will automatically identify those two (or more) clusters and
associate samples with the nearest (in some sense) cluster.
As we proceed through this material, you will learn how to determine the mea-
surements needed in order to implement these classification algorithms. Clustering
turns out to be useful in other machine vision algorithms as well, such as locating
the peaks in the Hough transform accumulator array. It is covered in more detail in
Chapter 15.

14.2 Bayes’ rule and the maximum likelihood classifier

In this section we will design a classifier based upon assumptions concerning the
statistical nature of the training sets. If these assumptions should be valid for the
training sets in question, then the classifier which results will give the best perfor-
mance. We will also investigate the performance of such a classifier, including error
rates.
The attentive student will note similarities between the descriptions of statistical
concepts presented here and those presented in Chapter 6. The similarities are cor-
rect and deliberate. In that chapter, we sought an image which minimized certain
properties. In this chapter we seek a decision: To which class an object belongs.

14.2.1 Bayes’ rule


We define P(w i ) to represent the a priori probability that class w i occurs, that is, the
probability of class w i occurring before any measurements are made. For example,
suppose we have a factory which manufactures flanges and gaskets, but makes nine
330 Statistical pattern recognition

Fig. 14.3. The same measurement on two objects may have different average values, but due to
noise in the measurement, or actual variation, these values may overlap.

times as many flanges as gaskets. Flanges and gaskets may come down the conveyor
at random times. But because of our a priori knowledge that the plant manufactures
nine times as many flanges as gaskets we know that we are much more likely to see
a flange than a gasket if we chose to look at the conveyor at some random time. Thus
the a priori possibility of flanges is 0.9 and the a priori probability of gaskets is 0.1.
We define p(x|w i ) to represent the conditional probability density of a measure-
ment x occurring given that the sample is known to come from class w i . For a
particular w i , p(x|w i ) should be thought of as a function of x. Suppose we have a
factory which manufactures axes and hatchets. Then we might find the probability
densities for the lengths of axes and hatchets to be represented by Fig. 14.3. In that
figure we see that an axe is most likely to be 30 inches long and a hatchet most likely
to be 12 inches, but that some variation in length can occur.
The probability density function may be characterized in several possible ways.
One is by simply tabulating the number of times a particular value occurs for each
possible value of the variable, in this case, length. Such a tabulation is referred to
as a histogram of the variables. Properly normalized, a histogram can be a useful
representation of a probability density function, but of course requires that only a
finite number of possible values may exist. One may also describe a density function
in a parametric way using some analytic function (e.g., the Gaussian) to represent
the density.
Finally we define P(w i |x), the posterior conditional probability, to represent the
conditional probability that the object being observed belongs to a class w i given a
measurement x. P(w i |x) is what we are looking for. We will use it as our decision
The term likelihood is
used rather than rule or, more correctly, as our discriminant function. Our decision rule will then be
probability because we
will actually maximize
as follows: For a measurement x made on an unknown object, compute P(w i |x)
some function of the for each class, that is, for each possible value of i. Then decide that the unknown
probability.
belongs to the class i for which P(w i |x) is greater than P(w j |x) for all i = j.
331 14.2 Bayes’ rule and the maximum likelihood classifier

When we make a classification decision based on P(w i |x), we are using a maximum
likelihood classifier.
We can relate the three functions just defined by using Bayes’ rule:
p(x|w i )P(w i )
P(w i |x) = (14.1)
Something

c
Something = p(x) = p(x|w j )P(w j ). (14.2)
j=1

In Eq. (14.1) we used “something” to represent the denominator for the conditional
probability density. We used the word “something” to call attention to the fact that this
number represents the probability density of that value of x occurring, independent
of the class of the observation. Since this number is independent of the class, and is
the same for all classes, it therefore does not provide us any help in distinguishing
which class is most likely. Instead, it is a normalization constant which we use to
ensure that the number P(w i |x) has the desirable properties of a probability; that
is, it lies between 0 and 1 and when summed over all the classes, it sums to 1 (the
observed object belongs to at least one of the classes which we are considering).
In a sense, Equation (14.1) solves the pattern recognition problem. It tells us how
to make a decision, assuming we know each of the components of the RHS. In the
next section we consider how one goes about determining those components.

14.2.2 Parametric pattern classifiers


In the previous section we demonstrated that a maximum likelihood classifier could
make a decision based upon knowledge of two things: The a priori probability of
the class being considered and the conditional probability densities for each class.
We have said nothing up to this point about the form which the probability density
function may take. In Fig. 14.3 the density functions take the familiar “bell curve”
shape. This shape could be described by a table of values or by an analytic function.
For scalar-valued x the analytic function of the “bell curve” has the form


1 1 " x − ␮ #2
p(x) = √ exp − . (14.3)
2␲␴ 2 ␴
Here p(x) is known as the univariate Gaussian (normal) density. The word “uni-
variate” refers to the fact that x is a scalar, a single variable. Here, the mean, ␮ and
standard deviation, ␴ comprise the elements of a “parameter vector:”

␮i
␪i = . (14.4)
␴i

These two numbers serve to completely describe the conditional probability density
of the variable x for class i assuming that the density has a Gaussian form.
332 Statistical pattern recognition

If, instead of a single measurement, we have taken a set of measurements, then x


becomes a vector quantity and p(x|w i ) becomes the “multivariate Gaussian density.”
1 % &
p(x|w i ) = exp −(1/2)(x − ␮i )T K i−1 (x − ␮i ) . (14.5)
(2␲)d/2 |K i |1/2
In Eq. (14.5), d is the dimensionality of the vector x, ␮i is the mean vector which
represents the average (vector-valued) value of the random vector of measurements
and K i is a d × d covariance matrix.
Thus, in the univariate Gaussian case, we can represent the class conditional
density of the measurements by two numbers, and in the multivariate case by a
d-dimensional vector and a d by d matrix. Given these parameters, we can easily
substitute them into Eq. (14.5) and then into Eq. (14.1) and compute the most likely
class for an unknown object. Unfortunately, in most applications we are not given
the mean and covariance but rather must estimate them from training sets.
Take the log of the RHS of Eq. (14.5). That will eliminate the exponential, and,
since the logarithm is monotonic, will result in an expression involving the mea-
surement vector x, and the statistics ␮i and K i which characterize class i. This
expression is maximized in the same way as the original probability. Classification
is now straightforward: Just substitute x into each equation you can generate like
this using all the different means and covariances. This gives you c (assuming you
have c classes) different functions, called discriminant functions, all of which have
the same form but which have different parameters. Assign x to the class for which
the corresponding discriminant function is the largest.

14.2.3 Density estimation


Since the Gaussian (normal) density occurs so often in actual distributions of random
variables, and is so convenient to work with, we will use it often, and treat it as a
special case.
To design a pattern classifier using supervised methods, we need to estimate the
parameters of the density from a training set of samples. We will denote the parameter
set by the vector ␪.

The univariate Gaussian case

Let X i represent a set of n i elements where each element represents a measurement


made on an object known to belong to class i. Because we are considering here
only the univariate case, the elements of X i will be scalar-valued. We will consider
vector-valued elements and multivariate densities in the next subsection.
We assume that the samples in X i give us no information about the parameters in
any class other than class i. This permits us to work with each class separately.
333 14.2 Bayes’ rule and the maximum likelihood classifier

Assuming that the samples are drawn independently, the probability of drawing
the entire set X i is determined by

ni
p(X i ) = p(xik ). (14.6)
k=1

Since this probability is intrinsically dependent upon the parameters ␪i , we rewrite


Eq. (14.6) to make this dependence clear:

ni
p(X i |␪i ) = p(xik |␪i ). (14.7)
k=1

A second use of the term The maximum likelihood estimate of ␪i is then defined as that value ␪i which max-
“maximum likelihood.”
imizes p(X i |␪i ). Eq. (14.7) describes the likelihood of any particular training set
occurring, given that the probability distribution is described by the parameter vec-
tor ␪i . Since we are dealing with the Gaussian density, we rewrite Eq. (14.7) as
ni  
1 (xik − ␮i )2
p(X i |␪i ) = √ exp − . (14.8)
k=1 2␲␴i 2␴i2
Now an important observation: The value of ␪i which maximizes p(X i |␪i ) also
maximizes ln [ p(X i |␪i )]. This is true because the natural logarithm is a monotonically
increasing function. Thus, we have our choice of finding the parameter vector ␪i
which maximizes either the density or its logarithm. The logarithm will be much
easier to use. Taking the log of the RHS we find
ni    ni  
1 1 xik − ␮i 2
ln( p(X i |␪i )) = ln √ − . (14.9)
k=1 2␲␴i k=1
2 ␴i

14.2.4 Estimating the mean


We denote our estimate of ␮ by ␮. ˆ To find the value of ␮ which maximizes
Eq. (14.9) we differentiate with respect to ␮ and set the result to 0.
ni
xik − ␮ˆi
=0 (14.10)
k=1
␴i

which simplifies to

ni 
ni
xik − ␮
ˆi =0 (14.11)
k=1 k=1

1  ni
ˆi =
␮ xik . (14.12)
n i k=1

ˆ i is known as the sample mean and the fact that it is equal to the average value is
certainly intuitively satisfying.
334 Statistical pattern recognition

14.2.5 Estimating the variance


We perform the derivation as before, but consider both mean and variance as un-
knowns. In the following, we drop the subscript i denoting the class, for notational
simplicity. Initially, consider the univariate case, then, the parameter vector ␪ is a
2-vector:


␪= . (14.13)
␴2

Rewrite Eq. (14.9), to make it a bit simpler and the log of the probability becomes
 √    
1  xk − ␮ 2
L=− ln 2␲ − ln ␴ − (14.14)
2 ␴
and we find (from ∂ L/∂␮ = 0),
n
1
(x − ␮)
2 k
ˆ =0 (14.15)
k=1

and from ∂ L/∂␴,


n
1  n
(xk − ␮) ˆ 2
− + = 0. (14.16)
k=1
␴ˆ k=1
␴ˆ 3

We simplify Eq. (14.15) by multiplying by ␴ˆ 2 and find



n 
n
xk = ␮
ˆ 1 (14.17)
k=1 k=1

and
1 n

ˆ = xk (14.18)
n k=1

as before.
Equation (14.16) simplifies similarly to yield

n 1 n
= 3 (xk − ␮)
ˆ 2, (14.19)
␴ˆ ␴ˆ k=1

and therefore
1 n
␴ˆ 2 = (xk − ␮)
ˆ 2. (14.20)
n k=1

Thus we see that the best estimate for the parameters of a normal density are the
familiar sample mean and sample variance.
335 14.2 Bayes’ rule and the maximum likelihood classifier

The multivariate Gaussian density

We may treat the (vector-valued) parameters of the multivariate normal density


in a manner similar to the previous subsection, and find the maximum likelihood
estimates of those parameters:

1  ni

ˆi = xik (14.21)
n i k=1
1 
Look at how K is defined. ni
Is it a matrix? A scalar? A
vector?
Ki = (xik − ␮
ˆ i )(xik − ␮
ˆ i )T . (14.22)
n i k=1

Thus we have essentially the same results for the multivariate case as for the univariate
case: That the best estimate (in the maximum likelihood sense) of the mean and
variance of a Gaussian are the sample mean and sample (co)variance.
Now what is there to remember from this chapter? To perform the maximum
likelihood estimate of a set of parameters, given a training set, assume independence
(if you can) and write the probability of the entire training set occurring as a product.
Take logs, differentiate, and set to zero to produce a set of simultaneous equations
which, when solved, will be the best estimates of the parameters. This approach
works for any distribution, not just the Gaussian. However, for some cases, the
process of solving the system of simultaneous equations may be intractable.
Finally, there are other ways to find parameters, other than maximum likelihood;
techniques which space and time do not permit us to cover here.

14.2.6 The likelihood ratio


We wish to make a decision which maximizes the likelihood that we are correct. To
accomplish this, we choose the class which maximizes the a posteriori probability –
the probability of our decision being correct, given the measurement.
That is, choose i to maximize P(w i |x). To accomplish this, recall Bayes’ rule, for
class 1:
p(x|w 1 )P(w 1 )
P(w 1 |x) = . (14.23)
p(x)

As mentioned above, remember that p(x) is the same regardless of whether x belongs
to class 1 or 2. Since this denominator is unaffected by the classification decision,
we can ignore it in making that decision.
In the two-class case, we choose class 1 if P(w 1 |x) > P(w 2 |x), or, substituting
Bayes’ rule, we choose class 1 if

p(x|w 1 )P(w 1 ) > p(x|w 2 )P(w 2 ), (14.24)


336 Statistical pattern recognition

that is,

p(x|w 1 ) P(w 2 )
> . (14.25)
p(x|w 2 ) P(w 1 )

The expression on the left is known as the likelihood ratio. The relationship in
Eq. (14.25) provides a true–false relationship between the likelihood ratio and the
prior information. If it is false, we choose class two. Observe that this form was
derived by making the decision which maximizes the probability of making a correct
decision, using knowledge of the measurement and the prior probability of the
classes. We could use other criteria as well. For example, instead of maximizing
the probability, we could choose to minimize the conditional risk.

14.3 Decision regions and the probability of error

The effect of any decision rule is to partition the feature space into c decision
regions
1 , . . .
c . Suppose we define a set of discriminant functions gi (x). Then,
if gi (x) > g j (x) for all j = i, then x ∈
i and we decide w i . The equation of a
decision boundary is gi (x) = g j (x) when
i borders
j .
In the two-class case, we may compute the probability of error in terms of these
decision regions by

P(error) = P(x ∈
2 , w 1 ) + P(x ∈
1 , w 2 ). (14.26)

That is, an error occurs if x falls in


1 , but it is really w 2 , or if x falls in
2 , but it is
really w 1 . Since these events cannot both be true,

P(error) = P(x ∈
2 |w 1 )P(w 1 ) + P(x ∈
1 |w 2 )P(w 2 )
 
= p(x|w 1 )P(w 1 )dx + p(x|w 2 )P(w 2 )dx. (14.27)

2
1

We also use the notation P(error|w 2 ) to mean the probability that we make an
incorrect decision when w 2 is the true state. Fig. 14.4 illustrates the a posteriori
probability density of two classes, the process of deriving the decision boundary,
and the probability of error.
In general, if p(x|w 1 )P(w 1 ) > p(x|w 2 )P(w 2 ), we should decide that x is in
1 ,
so that the smaller term contributes to the error integral. That is exactly what Bayes’
decision rule does.
337 14.4 Conditional risk

p(x|w 2 )P (w 2 )
p(x|w 1 )P (w 1 )

Eliminated by
moving boundary
to the left

∫ p ( x w )P ( w ) dx
Ω1
2 2
∫ p ( x | w )P ( w ) dx
Ω2
1 1

Ω1 Ω2

Fig. 14.4. The a posteriori probability density function of two classes as Gaussian, the decision
boundary, and the probability of error.

In the multiclass case, since there are more ways to be wrong than right, it is
simpler to compute the probability of being correct:

c
P(correct) = P(x ∈
i , w i )
i=1
c c 

= p(x ∈
i |w i )P(w i ) = p(x|w i )P(w i ) dx. (14.28)
i=1 i=1

i

This result will be valid no matter how the feature space is partitioned. The Bayes’
classifier maximizes this probability by choosing regions which maximize the
integrals.

14.4 Conditional risk

Let W = {w 1 , w 2 , . . . , w c } be a finite set of possible classes, and let A =


{␣1 , ␣2 , . . . , ␣a } be the set of possible activities – action ␣i will be taken if the
object lies in class w i . We will, in general, incur some loss if we take an incorrect
action. Let us assume that we have some way to quantify these losses. That is, let
Ci j be the loss incurred for taking action ␣i when the object really is in class w j .
In order to continue with this section, you need to recall the definition of the
expected value of a random variable. You should know what this is, but just to
refresh your memory, let’s suppose a random variable, x, can take on four different
values, 0, 1, 2, and 3. Now, suppose that we know the probabilities of each of those
values, and suppose they are

p(x = 0) = 0, p(x = 1) = 0, p(x = 2) = 1.0, p(x = 3) = 0.

What value of x would you expect to observe? Pretty obvious, isn’t it? Of course
you will always observe x = 2. Now let’s repeat this for a slightly more complicated
338 Statistical pattern recognition

case,

p(x = 0) = 0, p(x = 1) = 0, p(x = 2) = 0.5, p(x = 3) = 0.5.

Would you agree that the expected value of x is 2.5? That is, half the time one would
expect to see x equal to 2, and half the time, x will be 3.
Now, generalize that concept to lots of possible values with a probability associated
with each value. If the number of possible values is finite, the expected value is a
sum

x = x P(x). (14.29)

In the more general case, when x is continuous, we will replace the probability with
Watch out for this a density, and replace the summation with an integral.
replacement of
summation by integral to Suppose we observe x and contemplate taking action ␣i . If the true state of nature
occur. We are not going to is w j , we incur loss Ci j . Since P(w j |x) is the probability that w j is true, the expected
warn you . . .
loss associated with ai is

c
ri = Ci j P(w j |x). (14.30)
j=1

The expected loss is called the risk. We write ri as r (␣i |x) to make it clear that this
is the conditional risk.
We wish to minimize the total risk, which we denote as r by choosing the best ␣i .
A decision rule is a function ␣(x) that tells us what action to take to minimize r.
For every x the decision function ␣(x) assumes a value ␣1 , . . . , ␣a . The overall
risk is associated, then, with the decision rule.
Since ri or r (␣i |x) is the conditional risk, the overall risk is

r = r (␣(x)|x) p(x) dx. (14.31)

If we choose ␣(x) so that r(␣(x)|x) is as small as possible for every value of x, we


minimize the overall risk. Thus, to minimize r, compute ri and select the action ␣i
for which ri is minimal. The resulting overall risk is called the Bayes’ risk.

14.4.1 Derivation of the likelihood ratio when risk is a consideration


The two-class derivation is presented here in detail. The generalization of multiple
classes, if one understands this, is straightforward. Of course, students should remind
themselves that the likelihood RATIO is only going to work for the two-class case.
Let’s look at the previous derivation of the minimum risk criterion in more detail,
for the two-class case.
339 14.4 Conditional risk

If there are only two classes, we have four possibilities, two that we made the
correct decision, and two that we were wrong. The total risk therefore is:

r = C11 P(w 1 |x) d x
As you read these terms,

1
consider the concepts of 
“false negative,” “false
positive,” “true positive,” + C12 P(w 2 |x) d x
and “true negative.”

1
Which of these terms  (14.32)
represents the total
amount of each of these? + C21 P(w 1 |x) d x

2

+ C22 P(w 2 |x) d x.

2
Reminder: Ci j is the cost
of deciding i when reality
is j. That is, the probability that we guessed it was in class 1, integrated over the region
of x in which our decision rule says we SHOULD guess class 1, plus . . .
Rewriting all that, we get

r = [C11 P(w 1 |x) + C12 P(w 2 |x)] d x

1

+ [C21 P(w 1 |x) + C22 P(w 2 |x)] d x. (14.33)

2

Now, we observe that the integral over


1 of any density function, say q, is simply
one minus the integral over
2 of the same function. Thus we can write both integrals
in terms of region
1 , and obtain
 
r = C11 P(w 1 |x) d x + C12 P(w 2 |x) d x

1
1
 
+ 1 − C21 P(w 1 |x) d x + 1 − C22 P(w 2 |x) d x (14.34)

1
1

which reorganizes to

r = 2 + ((C11 − C21 )P(w 1 |x) + (C12 − C22 )P(w 2 |x)) d x. (14.35)

1

Our objective is to minimize this quantity (remember, it is the risk incurred in making
all four possible decisions).That is, the decision rule is really the determination of
the decision region(s). In this case, since there are only two decision regions, and
everywhere that we do not decide class one, we decide class two, all we need to
do is to determine the region
1 . To accomplish that, first, we need to make an
340 Statistical pattern recognition

assumption. We assume that the cost of making a correct decision is always less than
the cost of an error. So (C11 − C21 ) < 0, etc.
For some reason, students How do we choose the limits of an integral (and remember, the limits of the integral
get upset when we say
“make the integrand are in fact the boundaries of the region where we decide class 1) such that the integral
negative everywhere.” is maximally small? Simply choose the decision region such that the integrand is
Remember, we are
optimizing by finding the negative everywhere. Doing so produces the condition required for region 1 to be
limits on the integral! chosen: Choose
1 such that

(C11 − C21 )P(w 1 |x) + (C12 − C22 )P(w 2 |x) < 0. (14.36)

Replace the posterior probabilities with the product of the conditional densities and
prior probabilities to get

(C11 − C21 ) p(x|w 1 )P(w 1 ) < (C22 − C12 ) p(x|w 2 )P(w 2 ), (14.37)

which after appropriate algebraic manipulation becomes the decision rule: Choose
class 1 if
p(x|w 1 ) (C12 − C22 )P(w 2 )
> , (14.38)
p(x|w 2 ) (C21 − C11 )P(w 1 )
else choose class 2.
Try this: Substitute the This expression is called the “likelihood ratio test.”
symmetrical cost function
into Eq. (14.38) and see Consider the symmetrical loss function:
how the likelihood ratio 
test simplifies. 0 i= j
Ci j = (14.39)
1 i = j

so that all errors are equally costly and there is no cost for a correct decision. We
may now rewrite the conditional risk, the cost of making decision i

c 
ri = Ci j P(w j |x) = P(w j |x) = 1 − P(w i |x). (14.40)
j=1 j =i

Thus, to minimize the average probability of error, we select i as the i which max-
imizes the a posteriori probability P(w i |x). That is, for minimum cost, we decide
w i , if P(w i |x) > P(w j |x) for all i = j, which we have already seen is the simple
maximum likelihood classifier. Thus, we see that the maximum likelihood classifier
minimizes the Bayes’ risk associated with a symmetric cost function.

14.5 The quadratic classifier

Consider the general multivariate Gaussian classifier, with two classes. As in As-
signment 14.1, if we take logs, we can work out a decision rule based on a likelihood
341 14.5 The quadratic classifier

ratio: Decide class 1 if


|K 1 |
ln + (x − ␮1 )T K 1−1 (x − ␮1 ) − (x − ␮2 )T K 2−1 (x − ␮2 ) < T hr eshold;
|K 2 |
6 7 (14.41)
P(w 2 )(C12 − C22 )
else decide class 2; where T hr eshold = −2 ln .
P(w 1 )(C21 − C11 )
If we define

A = K 1−1 − K 2−1 , b = 2(K 2−1 ␮2 − K 1−1 ␮1 ), and


|K 1 |
c = ␮T1 K 1−1 ␮1 − ␮T2 K 2−1 ␮2 + ln , (14.42)
|K 2 |
we can rewrite Eq. (14.41) using

g(x) ≡ xT Ax + bT x + c. (14.43)

And the decision rule becomes: Decide class 1 if g(x) < T . In this formulation, we
see clearly why the Gaussian parametric classifier is known as a quadratic classifier.
The Mahalanobis distance Let’s examine the implications of this rule. Consider the quantity (x −
has the properties of a
metric. Can you prove ␮1 )T K 1−1 (x − ␮1 ). This is some sort of measure involving a measurement, x, and a
that? (Do you recall the class parameterized by mean vector and a covariance matrix. This quantity is known
defnition of a metric?)
as the Mahalanobis distance.
First, let’s look at the case that the covariance is the identity. Then, the Maha-
lanobis distance simplifies to (x − ␮1 )T (x − ␮1 ). That is, take the difference between
the measurement and the mean. That is a vector. Then take the inner product of that
vector with itself, which is, of course, the squared magnitude of that vector. What is
this quantity? Of course! It is just the (squared) Euclidean distance between the mea-
surement and the mean of the class. If the prior probabilities are the same and we use
symmetric costs, Threshold works out to be zero, and the decision rule simplifies to:
Decide class 1 if

(x − ␮1 )T (x − ␮1 ) − (x − ␮2 )T (x − ␮2 ) < 0 (14.44)

else decide class 2. If the measurement is closer to the mean of class 1 than class 2,
this quantity is less than zero. Therefore, we refer to this (very simplified) classifier
as a nearest mean classifier, or nearest mean decision rule.
Now, let’s complicate the rule a bit. We no longer assume the covariances are
equal to the identity, but do assume they are equal to each other (K 1 = K 2 ≡ K ). In
this case, look at Eq. (14.42) and notice that the A matrix becomes zero. Now, the
operations are not quadratic any more. We have a linear classifier.
We could choose to ignore the ratio of the determinates of the covariance matrices,
or, more appropriately, to include that number in the threshold T. Then we have a
minimum distance decision rule, but now the distance used is not the Euclidean
distance. We refer to this as a minimum Mahalanobis distance classifier.
342 Statistical pattern recognition

Here is another special case: What if the covariance is not only equal, but diagonal?
Now, the Mahalanobis distance takes on a special form. We illustrate this by using
a three-dimensional measurement vector, and letting the mean be zero:
⎡ ⎤
1
0 0 ⎡ ⎤
⎢ ␴11 ⎥ x
⎢ ⎥ 1
⎢ 1 ⎥⎢ ⎥
[x1 x2 x3 ] ⎢ 0 0 ⎥ ⎣x2 ⎦
⎢ ␴22 ⎥
⎣ 1 ⎦ x3
0 0
␴33
which we expand to
x12 x2 x2
+ 2 + 3.
␴11 ␴33 ␴33
Do you recall seeing this This is the equation of an ellipsoid, centered at the origin (or, in the case that the
ellipse discussion
somewhere else in this mean is not zero, centered at the mean) with axes located along the coordinate axes.
book? In the more general case, with covariance which is not diagonal, the only thing
that happens is that this ellipsoid may rotate. So, the equation which represents the
Mahalanobis distance from a point to a class produces an ellipsoid.
Here is one more interesting case: Suppose the covariances are the same, diagonal,
and proportional to the identity K i = ␴ 2 I . Now, the discriminant function for class
i takes on the form
2␮iT x |␮i |2
gi (x) = − 2 + 2 ln P(w i ). (14.45)
␴2 ␴
Assume further that the magnitudes of all the means are the same. That is, all the
Remember! An inner means are located on a hypersphere centered at the origin. Then, we do not need to
product computes a
projection. consider the second term in Eq. (14.45), and the discriminant function simplifies to

d
gi (x) = ␮iT x = ␮ik xk , (14.46)
k=1

which we refer to as the inner product classifier.

14.6 The minimax rule

Sometimes the a priori probabilities are unknown. In this case, a fixed decision rule
will not yield the minimum risk, so we use the minimax rule and attempt to minimize
the maximum possible risk. Suppose we have c classes; then the overall expected
risk is
 
r= P(w i ) Ci j p(x|w j ) d x, (14.47)
i j

i
343 14.7 Nearest neighbor methods

that is, the probability of a particular state of nature times all the decisions we could
make if that were the state of nature, and the cost associated with those decisions.
To see what this means clearly, think about the two-class case and let i = 1. We
observe that r is linear in P(w i ). Thus, r is maximized at one extreme of P(w 1 ) or
the other, e.g., P(w 1 ) = 0 or P(w 1 ) = 1. If we let C11 = C22 = 0 then the maximum
of r becomes either

C12 p(x|w 2 ) d x (14.48)

1

or

C21 p(x|w 1 ) d x. (14.49)

2

Since
1 ∪
2 is the complete space, then
⎛ ⎞
 
max ⎝ C12 p(x|w 2 ) d x, C21 p(x|w 1 ) d x ⎠ (14.50)

1
2

takes its minimum value when


 
C12 p(x|w 2 ) d x = C21 p(x|w 1 ) d x. (14.51)

1
2

So, if C21 = C12 , the minimax rule says to choose


1 and
2 so that the probabilities
of the two types of errors are the same. That is, we have chosen a condition which,
in the absence of prior information about the classes, will minimize our maximum
risk.

14.7 Nearest neighbor methods

In previous sections, we have assumed we have a model for the density, usually a
Gaussian. However, if we simply have a training set, that data may or may not fit a
Gaussian (or any other parametric model for that matter). A simple heuristic which
we might use is called the “nearest neighbor rule” – assign the unknown to the same
class as the class of the nearest neighbor in the training set.
In this section, we extend the nearest neighbor rule, and show that it is equivalent to
a maximum likelihood classifier with the density estimated by the extended nearest
neighbor rule.
344 Statistical pattern recognition

This method utilizes a volume1 V around the unknown. We simply count the
number of points from the various classes which occur. Then the class-conditional
density is estimated by
km
p(x|␻m ) = , (14.52)
nm V
where km is the number of samples in class m inside the volume V centered at x, and
n m is the total number of samples in class m in the training set.
Use of a constant volume is a problem, because in regions which are densely
populated (many training set points nearby) the volume will contain many points,
resulting in too much smoothing, whereas in more sparsely populated areas, the same
volume results in estimates which are not sufficiently representative. The simple
solution is to let the volume depend on the data. For example, to estimate p(x) from
n samples, one can center a cell about x and let it grow until it contains kn samples,
where kn is some (yet to be specified) function of n. If the density of samples near x
is high, then the volume will be small, resulting in good resolution. If the density is
small, then the region will grow, providing smoothing. Duda et al. [14.4] point out

that kn = n provides one form for kn which behaves in a reasonable way.
The k-nearest-neighbor (k-NN) rule can be extended slightly to allow us to use the
strategy directly for classification. Given c training sets, we combine all the sample
points from all the training sets into one data set of n points, where now

c
n= ni (14.53)
i=1

where n i is the number of samples in training set i.


Now, given a point x at which we wish to determine the statistics, we find the
hypersphere of volume V which just encloses k points from the combined set. If,
within that volume, km of those points belong to class ␻m , then we estimate the
density as before for class ␻m by
km
p(x|␻m ) = (14.54)
nm V
and
nm
P(␻m ) = (14.55)
n
k
p(x) = . (14.56)
nV
If we apply Bayes’ rule to Eqs. (14.54)–(14.56), we find
km
P(␻m |x) = . (14.57)
k

1 Of course, in more than three dimensions, this is a hypervolume. For simplicity, we will continue to use the
word “volume” with the understanding that no limit on dimensionality exists.
345 14.9 Vocabulary

This rule tells us to look in a neighborhood of the unknown feature vector for k
samples. If, within that neighborhood, more samples lie in class i than any other
class, we assign the unknown as belonging to class i. We thus have the k-nearest-
neighbor classification rule.
The student should note that in the k-NN strategy, we have never defined precisely
how nearest should be computed. The Euclidean metric is generally assumed to be
the most reasonable measure for distance, but others may certainly be used.
In the authors’ own experience in classifying large data sets of industrial data, we
have found nearest neighbor algorithms to work surprisingly well.
A major practical disadvantage of the k-NN strategy for classification is the fact
that all the data must be stored. This can be a massive storage burden, especially
when compared with parametric methods which require only a few points. The com-
putational burden associated with the k-NN techniques can likewise be significant,
since, in order to find the k nearest neighbors, the distance from the unknown to all
the neighbors must be determined. Heuristics have been published which speed up
this process significantly, and the student is referred to the literature for suggestions.
See, for example, the condensed-nearest-neighbor rule described in Hand [15.7].

14.8 Conclusion

In this brief introduction to statistical pattern recognition, you have seen how statisti-
Maximum likelihood and cal methods can assist in the process of making decisions. You have also noticed how
minimum squared error.
pervasive the optimization approach to problem solving is. Probability densities are
estimated using maximum likelihood methods, where the likelihood is a product of
probabilities. For Gaussian forms, maximum likelihood simplifies to sum-squared
error.
Minimize the risk. You learned how to find decision regions which minimize the total risk, even
when different decisions have different costs, by finding limits of integration which
Minimize the maximum make the argument of the integral negative. Even in the case that the risk cannot be
risk.
computed, we can develop a scheme which minimizes the maximum risk.
Minimum distance. Classification is often considered a “minimum distance” process. That is, we
make the decision which minimizes some sort of distance, and you have seen several
examples in this chapter.

14.9 Vocabulary

You should know the meaning of the following terms.

Bayes’ rule
Cluster
Conditional density
346 Statistical pattern recognition

Decision boundary
Decision rule
Discriminant function
Feature vector
Likelihood ratio
Linear machine
Linearly separable
Maximum likelihood
Minimax
Multivariate
Prior probability
Quadratic classifier
Risk
Supervised learning
Training set
Univariate
Unsupervised learning

Assignment 14.1
Assume class 1 and 2 are well represented by Gaussian
densities with the following parameters: Class 1 mean
= 0, variance = 1. Class 2 mean = 3, variance = 4.
Substitute the forms for the Gaussian into Eq. (14.25)
and derive an equation which gives the range of x in
which class 1 is chosen. You will need to make a rea-
sonable assumption about prior probabilities (equal
probabilities are often chosen).
Hint: After doing the substitution, take natural
logarithms of both sides.

Assignment 14.2
In a one-dimensional problem, the conditional density
for class 1 is Gaussian with mean 0 and variance 2; for
class 2, the conditional density is also Gaussian with
mean 3 and variance 1. That is:
  !
1 1 x 2
p(x|w 1 ) = √ √ exp − √
2 2␲ 2 2
 
1 1
p(x|w 2 ) = √ exp − (x − 3) 2
2␲ 2
(1) Sketch the two densities on the same axis.
(2) What is the likelihood ratio?
347 Topic 14A Statistical pattern recognition

(3) Assuming that P(w 1 ) = P(w 2 ) = 0.5, C 11 = C 22 = 0,C 12 = 1



and C 21 = 3, use the integral form for the proba-
bility of error assuming a Bayes’ decision rule.

Assignment 14.3
In a one-dimensional problem the class-conditional den-
sities for a feature x are

exp(−(x − r)) x ≥ r
p(x|w 1 ) = and
0 otherwise

exp(x − 3) x < 3
p(x|w 2 ) = ,
0 otherwise

where P(w 1 ) = P(w 2 ) = 0.5.

(1) Assume that r < 3, and sketch the densities. Deter-


mine the decision rule that minimizes the probabil-
ity of error, and indicate what that decision rule
means by marking a point on the x axis.
(2) Find the value of r which minimizes P(error|w 2 ).

Topic 14A Statistical pattern recognition

14A.1 Matching feature vectors using statistical methods

Statistical pattern recognition, as mentioned above, is a process worthy of an entire book, and
in fact many books have been written on the topic. Here, through a simple example [13.17],
we will present just a glimpse of what the discipline entails. Our problem is to recognize
faces. Let’s first collect images which contain only faces (and thereby avoid the segmentation
problem) by requiring the subjects all wear black clothing and stand against a black wall. We
acquire relatively low-resolution images, 180 × 120 pixels. We then scan over the image with
a collection of feature extractors, shown in Fig. 14.5. Each feature extractor operates on the
neighborhood of each pixel, in much the same way that a kernel operator does, but instead of
a sum of products, this operator returns the product of the image pixels corresponding to the
black pixels in the kernel. First, we observe that each kernel, used in this way, is returning a
very local autocorrelation of the image, in a particular direction. Denote the result of applying
kernel i to the neighborhood of pixel j by Φi j . Then, the sum
Fig. 14.5. The
collection of 25 kernels

n
xi = Φi j (14.58)
used to extract a
j=1
25-element feature
vector from an image. is computed, producing a 25-element vector which in some sense describes the image.
348 Statistical pattern recognition

So for every image, we have a vector consisting of 25 numbers. Using that 25-vector, the
challenge is to properly make a decision. The first step is to reduce the dimensionality to
something more manageable than 25.
We look for a method for reducing the dimensionality from, in general, d dimensions, to
c − 1 dimensions, where we are hoping to classify the data into c classes. (Somehow, we
must know c, which in this example is the number of individual faces.) The following strategy
is an extension of a method known in the literature as “Fisher’s linear discriminant.”
Assume we have c different classes, and a training set, X i of examples from each class.
Thus, this is a supervised learning problem. Define the within-class scatter matrix to be

c
SW = Si (14.59)
i=1

where

Si = (x − ␮i )(x − ␮i )T , (14.60)
x∈X i
and ␮i is the mean of class i. Thus, Si is a measure of how much each class varies from its
μ2
average.
μ1
1 
μ ␮i = x. (14.61)
n i x∈X i

We define the between-class scatter matrix as


μ3
c

Fig. 14.6. The SB = n i (␮i − ␮)(␮i − ␮)T , (14.62)


between-class scatter i=1
is a measure of the where ␮ is the mean of all the points in all the training sets and n i is the number of samples
total distance between
in class i. To see what this means, consider Fig. 14.6. The between-class scatter is a measure
the class means and
the overall mean. of the sum of the distances between each of the class means and the overall sample mean.
Maximization of some measure of SB will push the class means apart, away from the overall
mean.
The idea is to find some projection of each data vector x onto a vector y,

y = Wx (14.63)

such that first, y is of lower dimension than x, and second, the classes are better separated
after they are projected.
The projection from d-dimensional space to c − 1 dimensional space is accomplished by
c − 1 linear discriminant functions

yi = wiT x. (14.64)

If we view the yi as components of a vector, and the vectors wi as columns of a matrix W,


we can describe all the discriminant functions by a single matrix equation

y = W T x. (14.65)

We now define a criterion function which is a function of W and measures the ratio of between-
class scatter to within-class scatter. That is, we want to maximize SB relative to SW , or rather,
349 Topic 14A Statistical pattern recognition

−1 −1 −1
to maximize some measure of SW SB . The trace of SW SB is the sum of the spreads of SW SB
−1
in the direction of the principal components of SW SB . We can see clearly what this means
in the two-class case.
−1 n1n2 −1 n1n2
J = trSW SB = trSW (␮1 − ␮2 )(␮1 − ␮2 )T = D2 (14.66)
n1 + n2 n1 + n2

where D 2 is the squared Mahalanobis distance between the two classes.


Rather than maximize the trace of the matrix, we might redefine J using the determinant:
 T 
W SB W 
J (W ) =  T . (14.67)
W SW W 

Since the determinant is the product of the eigenvalues, it is therefore the product of the
“spread” in the principal directions. As in the case of Fisher’s linear discriminant, the solution
to this equation can be found by eigenvector analysis. The columns of the optimal W are the
eigenvectors which correspond to the largest eigenvalues in

SB wi = ␭i SW wi . (14.68)

One can find the eigenvalues as the roots of the characteristic equation
 
 SB − ␭i SW  = 0 (14.69)

and then solve

(SB − ␭i SW )wi = 0 (14.70)

for the eigenvectors wi .


This generalization of Fisher’s linear discriminant is the method most widely implemented
in computer packages. It is often used for selection of a subset of the original variables rather
than for dimensionality reduction via transformation. This is done by examining the resulting
eigenvalues and noting the relative sizes (positive or negative) corresponding to the variables.
A large eigenvalue means that the corresponding variable makes an important contribution
to the between-group separability, as we learned when we studied the K–L transform.
We have reduced the dimensionality of the 25-element vector to something smaller (yes,
this assumes the number of classes is less than 25). In this lower dimensionality space, we
will design a classifier. How? It is easy at this point. Use Eq. (14.5), or rather the logarithm
of the RHS of Eq. (14.5) (no, YOU do the math!) to derive a discriminant function which is
maximal when the example is assigned to the correct class.

14A.2 Support vector machines (SVMs)

Pattern classifiers based on the concepts of support vectors are relatively new. They were
first introduced by Vapnik [14.12] based on the concept of minimizing structural risk. We
present them here because they appear to provide performance superior to most other pattern
classification methods, and to illustrate the approach – to set up an optimization problem which
maximizes the margin – a derivation of the simplest support vector approach is provided.
350 Statistical pattern recognition

Class 1

class 1 Class 2

w
class 2
q+ρ
class 1 q
q−ρ
Fig. 14.7. A poor
choice of the dividing class 2
hyperplane (a line in Fig. 14.9. Two classes denoted by 1 and 2 are both
this 2D example) Fig. 14.8. A good choice projected onto the same line, w. The line dividing the
produces a margin of the dividing hyperplane sets is orthogonal to w, and we seek the points in the
which is small. results in a large margin. training sets which are closest to that line.

14A.2.1 Derivation of SVMs assuming separability

In this section, we assume the training sets are separable by a linear surface. In the next
subsection, we discuss this assumption and demonstrate how to ensure it.
As before, we seek to divide the feature space by a hyperplane into two segments, in which
the examples of the training sets are linearly separable. Define the distance from the closest
point in class 1 to the hyperplane as d1 . Similarly let d2 be the distance from the closest point
in class 2 to the hyperplane. Define the margin as d1 + d2 . We will seek the hyperplane which
maximizes the margin (see Figs. 14.7 and 14.8).
We don’t know ␸ yet. Given an example x, we project this sample onto some unit vector ␸, and make a decision
Finding it is the challenge.
using the rule decide class 1 if ␸T x − q > 0 where q is a constant.
I1 is the set of points in Let x1 be a point in class 1, and similarly, let x2 be a point in class 2. Since we are seeking
the training set
the points closest to the decision line, we wish to choose x1 and x2 so that their projections
representing class 1.
is
the decision region for onto ␸ are as close together as possible. As illustrated in Fig. 14.9, we seek points x1 and x2
class 1. such that with minimal positive ␳

␸T xi = q + ␳ ␸T x2 = q − ␳ . (14.71)

Define Ii to be the set of training examples from class i. For any point in I1 , ␸T x − q > ␳ ,
and for any point in I2 , ␸T x − q < ␳ . We need to find two things. (1) A pair of points, one2 in
each class, which are as close together as possible. We will call these points support vectors.
(2) A vector onto which to project the support vectors so that their projections are maximally
far apart. We solve this problem as follows.
Any other vector in the Recall that ␸ was a unit vector. It is thus equal to some other vector in the same direc-
same direction, actually.
tion divided by its magnitude, ␸ = w/w. We will look for one such vector, with certain
properties which will be introduced in a moment: For now, let x1 denote any point in I1 , not
necessarily a support vector, and similarly for x2 . Then
   
w T w T
x1 − q ≥ ␳ x2 − q ≤ ␳ (14.72)
w w

2 It is possible to have more than one support vector in each class, since two points might both be precisely the
same distance from the hyperplane.
351 Topic 14A Statistical pattern recognition

which leads to

wT x1 − qw ≥ ␳ w wT x2 − qw ≤ ␳ w. (14.73)

Define b = −qw, and we then add a constraint to w to require that its magnitude have a
particular property:

w = 1/␳ . (14.74)

Now we have two equations which describe behavior for any points in class 1 or class 2:

wT x1 + b ≥ 1 wT x2 + b ≤ 1. (14.75)

From this point on in this Since we wish to find the line which maximizes the margin, ␳ , from Eq. (14.74), we see that
derivation, the subscript
this is the same as finding the projection vector w whose magnitude is minimal, thus we seek
on the x no longer denotes
the class to which x a minimizer w = arg min( 12 wT w). Unfortunately, the null vector would minimize this, so we
belongs, but rather just its need to add some constraints to avoid this trivial solution.
index, as an element of
the training set. Let yi be the label for point xi , and define the labels as

1 if xi ∈ I1
yi = (14.76)
−1 if xi ∈ I2

and consider the expression yi = (wT xi + b). This will always be greater than or equal to
1, regardless of the class of xi . We thus have a constraint, and our minimization problem
becomes: Find the w which minimizes wT w such that yi (wT xi + b) ≥ 1.
This can be accomplished by setting up the following constrained optimization problem:

1 T l
L(w, b, ␭) = w w− ␭i (yi (wT xi + b) − 1) (14.77)
2 i=1

where l is the number of the samples in the training set.


The Lagrange multipliers will all be positive. Take the partial derivative with respect to w,
∂L 
=w− ␭i xi yi (14.78)
∂w
and setting this to zero we find

w= ␭i xi yi . (14.79)

Similarly, take the derivative with respect to b,


∂L 
=− ␭i yi = 0. (14.80)
∂b
With these two observations, the equation for L is simplified to
1    
L= ␭i yi ␭ j y j xiT x j − ␭i yi ␭ j y j xiT x j − b ␭i yi + ␭i (14.81)
2 i j i j i i

where the first and second term are the same except for the 1/2. The third term is zero.
352 Statistical pattern recognition

Defining the matrix A by


% &
A = yi y j xiT x j (14.82)

allows us to write L in a matrix form


1
L = − T A + 1T , (14.83)
2
where 1 denotes a vector of all ones.
Finding the vector of Lagrange multipliers ( ) which minimizes L is a quadratic optimiza-
tion problem. There are several numerical packages available which perform such operations.
Once we have the set of Lagrange multipliers, the optimal projection vector is found by
Eq. (14.79), which we observe requires a summation over all the elements of the training set.
To solve for b, we need to make use of the Kuhn-Tucker [14.5] conditions:

␭i (yi (wT xi + b) − 1) = 0 ∀i. (14.84)

In principle, Eq. (14.84) may be solved for b using any i, but it is numerically better to use
an average.
Similarly, we note the dimension of A is the same as the number of samples in the training
set. Thus, unless some “filtering” is done on the training set prior to building the SVM, the
computational complexity can be substantial.

14A.2.2 Nonlinear support vector machines

Instead of dealing with the actual samples, we apply a nonlinear transformation which pro-
duces a vector of higher dimension, yi = ␽(xi ). For example, if x = [x1 , x2 ]T is of dimension
2, y might be
% &T
yi = x12 x22 x1 x2 x1 x2 1 , (14.85)

which is of dimension 6. Surprisingly, this increase in dimension does not seem to destroy the
capability of the classifier. How can an expansion to more “degrees of freedom” improve both
accuracy on the training set and generalizability? Concerning this question, Burges [14.2]
says:

Usually, mapping your data to a “feature space” with an enormous number of dimensions
would bode ill for the generalization performance of the resulting machine. After all, the set of
all hyperplanes {w, b} are parameterized by dim(H ) + 1 numbers. Most pattern recognition
systems with billions, or even an infinite number, of parameters would not make it past the
start gate. How come SVMs do so well? One might argue that, given the form of solution,
there are at most l + 1 adjustable parameters (where l is the number of training samples), but
this seems to be begging the question. It must have something to do with our requirement of
maximum margin hyperplanes that is saving the day. . . . a strong case can be made for this
claim.

An expansion of the form described in Eq. (14.85) increases the dimensionality of the space,
and increases the likelihood that the classes will be linearly separable in the higher dimensional
353 Topic 14A Statistical pattern recognition

space (for reasons beyond the scope of this brief explanation). It also provides a simple
mechanism for incorporating nonlinear mixtures of the information from the measurements.
The polynomial form of Eq. (14.85) is but one way of expanding the dimensionality of the
measurement vector. A more interesting collection of ways comes to mind when one looks
at Eq. (14.83), and observes that to compute the optimal separating hyperplane, one does not
need to know the vectors themselves, but only the scalars which result from computing all
possible inner products. Thus, we do not need to map each vector to a high-dimensionality
space and then take the inner product of those vectors, not if we can figure out ahead of time
what those inner products should be.

Kernels and inner products


We seek the best separating hyperplane in the higher dimensional space defined by yi =
␽(xi ), ␽ : (
d →
m ) where m > d. Then, the equations for the elements of A become
functions of the inner products of these new vectors A = [yi y j ␽T (xi ) ␽(x j )]. For notational
convenience (and to lead to a really clever result), define a kernel operator, K (xi , x j ), which
takes into account both the nonlinear transformation ␽ and the inner product. Instead of
asking “What nonlinear operator should I use?” let’s ask a different question: “Given a
particular kernel, is there any chance it represents the combination of a nonlinear operator
and an inner product?” The amazing answer is yes, under certain conditions that is true.
These conditions are known as Mercer’s conditions: Given a kernel function K(a, b) of two
8
vector-valued arguments, if for any g(x) which has finite energy (that is, (g(x))2 d x is finite),
8
then K (a, b)g(a)g(b) da db ≥ 0, then there exists a mapping ␽ and a decomposition of K
of the form

K (a, b) = ␽i (a)␽i (b). (14.86)
i

In Eq. (14.86), the subscript i denotes the ith element of the vector-valued function ␽. Thus,
that expression represents an inner product. Notice that Mercer’s conditions simply state
that if K satisfies these conditions, then K may be decomposed into an inner product of two
instances of a function ␽. It does not say what ␽ is, nor does it say what the dimensionality
of ␽ is. But that’s OK. We do not have to know. In fact, the vector ␽ may have infinite
dimensionality. That’s still OK.
One kernel which is known to satisfy Mercer’s condition, and which is very popular in the
SVM literature, is the radial basis function
 
(a − b)T (a − b)
K (a, b) = exp − . (14.87)
2␴ 2
In the literature, SVMs have been applied to various problems such as face recognition [14.10]
and breast cancer detection [14.1]. In previous studies [14.7] and in comparative analysis in
the literature, they have empirically been shown to outperform classical classification tools
such as neural networks and nearest neighbor rules [14.9, 14.13, 14.14]. Interestingly, in
a comparison with a classifier based on hyperspectral data, a SVM-based classifier using
multispectral data (derived from the original hyperspectral data by filtering) performed better
than classifiers based on the original data [14.8].
354 Statistical pattern recognition

14A.3 Conclusion

Statistical methods provide tools for making decisions, based on measurements. If the mea-
surements are sufficiently discriminating that simple thresholds may be used, sophisticated
statistical methods may not be required. On the other hand, most collections of measurements
are not sufficient to make such decisions based on trivial feature comparisons.
We have seen in section 14A.1 an example of what is in the discipline of statistical pattern
recognition, and just one application to machine vision. There is NOT enough information
in this book to teach you all you need to know about statistical methods. You really need to
take a full course. We hope this chapter has given you enough motivation to do that.
In section 14A.1 we derived an objective function which, if maximized, would result in
projected data with classes maximally separated. It turned out that this maximization problem
turns into an eigenvalue problem.
Maximize the margin. A SVM finds the decision boundary which maximizes the margin, where the margin is
the distance between the closest points and the decision boundary, and the derivation of the
machine requires use of constrained optimization with Lagrange multipliers.

14A.4 Vocabulary

You should know the meaning of the following terms.

Between-class scatter
Fisher’s linear discriminant
Margin
Mercer’s conditions
Support vector
Within-class scatter

References

[14.1] P. S. Bradley, U. M. Fayyad, and O. L. Mangasarian, “Mathematical Programming


for Data Mining: Formulations and Challenges,” INFORMS Journal on Computing,
11(3), pp. 217–238, 1999.
[14.2] C. J. C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,”
Data Mining and Knowledge Discovery, Vol. 2, No. 2, Dordrecht, Kluwer, 1998.
[14.3] R. Duda and P. Hart, Pattern Recognition and Scene Analysis, New York, Wiley,
1973.
[14.4] R. Duda, P. Hart, and D. Stork, Pattern Classification, Second Edition, New York,
Wiley, 2001.
[14.5] R. Fletcher, Practical Methods of Optimization, New York, Wiley, 1987.
[14.6] K. Fukanaga, Introduction to Statistical Pattern Recognition, New York, Academic
Press, 1972.
[14.7] B. Karacali and H. Krim, “Fast Minimization of Structural Risk Using the Nearest
Neighbor Rule,” IEEE Transactions on Neural Networks, 14(1), Jan. 2003.
355 References

[14.8] B. Karacali and W. Snyder, “On-the-fly Multispectral Automatic Target Recogni-


tion,” Combat Identification Systems Conference, Colorado Springs, June, 2002.
[14.9] D. Li, S. M. R. Azimi, and D. J. Dobeck, “Comparison of Different Neural Network
Classification Paradigms for Underwater Target Discrimination,” Proceedings of
SPIE, Detection and Remediation Technologies for Mines and Minelike Targets V,
4038, pp. 334–345, 2000.
[14.10] E. Osuna, R. Freund, and F. Girosi, “Training Support Vector Machines: An Appli-
cation to Face Detection,” Proceedings of Computer Vision and Pattern Recognition,
Puerto Rico, June, 1997.
[14.11] C. Therrien, Decision, Estimation, and Classification, New York, Wiley, 1989.
[14.12] V. Vapnik, The Nature of Statistical Learning Theory, Berlin, Springer, 1995.
[14.13] M. H. Yang and B. Moghaddam, “Gender Classification Using Support Vector
Machines,” Proceedings of IEEE International Conference on Image Processing,
Vancouver, BC, Canada, September, 2000, vol. 2, pp. 471–474, 2000.
[14.14] Y. Yang and X. Liu, “Re-examination of Text Categorization Methods,” Proceed-
ings of the 1999 22nd International Conference on Research and Development in
Information Retrieval (SIGIR’99), Berkeley, CA, pp. 42–49, 1999.
15 Clustering

Woes cluster; rare are solitary woes


Edward Young

In this chapter, we approach the problem alluded to in Chapter 14 where the training
set simply contains points, and those points are not marked in any way to indicate
from which class they may have come. As in the previous chapter, we present only
a brief overview of the field, and refer the reader to other texts [14.4, 15.7] for
more thorough coverage. One very important area which we omit here is the use
of biologically inspired models for clustering [15.4, 15.5, 15.6], and the reader is
strongly encouraged to look into these.
We will discuss the issues of clustering in a rather general sense, but note one
particular application, which is identification of peaks in the Hough transform
array.
Consider this example from satellite pattern classification: We imagine a
downward-looking satellite orbiting the earth, which, at each observed point, makes
a number of measurements of the light emitted/reflected from that point on the
earth’s surface. Typically, as many as seven different measurements might be taken
from a given point, each measurement in a different spectral band. Each “pixel”
in the resulting image would then be a 7-vector where the elements of this vector
might represent the intensity in the far-infrared, the near-infrared, blue, green, etc.
Now suppose we have labeled training sets indicating examples of pixels containing
wheat, corn, grass, and trees. With these training sets, it would seem that we should
be able to build a pattern classifier; and indeed we can. Furthermore, the problem as
stated so far is a supervised learning problem. Let’s consider for a moment however,
the class which we call “trees.” This class consists of evergreen and deciduous trees
and, depending upon the time of year, these two subclasses will give radically dif-
ferent spectral signatures. We thus have a pattern classification problem which is not
easily approached with parametric classifiers. While we could use a non-parametric
approach, parametric methods are very attractive. An alternative to nonparametric
classifiers is to consider methods for determining the existence of the subclasses,
assigning points in the training set to the correct subclass and then representing that
Fig. 15.1. A training subclass parametrically. Fig. 15.1 illustrates a two-dimensional problem in which
set with two clusters. the existence of two classes within the same training set is readily apparent. We will
356
357 15.1 Distances between clusters

refer to these subclasses as “clusters” for the duration of this discussion. Each cluster
could be represented fairly accurately by a 2D Gaussian. The entire measurement
space, however, is obviously bimodal.
Such clustering is easy (for us humans) to visualize in a 2-space, and essentially
impossible to visualize in problems with more than three dimensions.

15.1 Distances between clusters

It is relatively easy to define, and to visualize, measures of distance between points,


including the Euclidian distance d(a, b) = |a − b|, and the city block distance

i |ai − bi |. These ideas can be readily extended to measures for distance between
a point and a cluster. One such measure is the Euclidean (or city block) distance
between the point x and the mean of the cluster ␮. Another such measure, which
takes into account the distribution of the cluster, is the familiar Mahalanobis distance
(see section 14.5) between a point x and a cluster A

dmah (x, A) = (x − ␮ A )T K A−1 (x − ␮ A ), (15.1)

where K A−1 is the inverse of the covariance which best represents cluster A.
We have not yet however considered measures of distance between two clusters
and for our discussion of clustering algorithms we will need such a measure. It is
not quite obvious (as we will see from the following discussion) how one should
define the distance between two clusters since each cluster contains potentially many
points. The simplest such definition would be just the distance between the sample
means of the two clusters (also called the centroid distance)

dmean (A, B) = |␮ A − ␮ B |. (15.2)

One might also consider some generalization of the Mahalanobis distance from
(point, cluster) to (cluster, cluster) defining
|␮ A − ␮ B |
dfisher (A, B) = , (15.3)
␴A2 + ␴B2
where the sigmas are computed by first projecting the two clusters onto the line
between the two means. The sample mean and sample variances of the projected
data are the parameters of Eq. (15.3).
We can provide a more formal statement about how to define a distance between
two clusters by first stating the desirable properties of such a distance. We require

d(A, B) ≥ 0
d(A, B) = 0 if A = B (15.4)
d(A, B) = d(B, A).
358 Clustering

There are many measures which satisfy these conditions; for example, we could
integrate the densities over all the sample space to obtain the divergence,

p(x|A)
ddiv (A, B) = [ p(x|A) − p(x|B)] ln dx. (15.5)
p(x|B)
In the case of multivariate Gaussians, this becomes
    '
(␮1 − ␮2 )T K 1−1 + K 2−1 [␮1 − ␮2 ] + tr K 1−1 K 2 + K 2−1 K 1 − 2I 2, (15.6)

which simplifies to Eq. (15.7) if the two covariance matrices are equal to each other.

2 ≡ (␮1 − ␮2 )T K −1 (␮1 − ␮2 ). (15.7)

The Chernoff distance is



dch (A, B) = − ln ( p(x|A))1−s ( p(x|B))s dx (15.8)

which, for multivariate Gaussians, becomes


1
s(1 − s)(␮ A − ␮ B )T [(1 − s)K A + s K B ]−1 (␮ A − ␮ B )
2  
1 |(1 − s)K A + s K B |
+ ln . (15.9)
2 |K A |(1−s) |K B |s
In the case that s = 1/2, the Chernoff distance turns into the Bhattacharyya distance:
   
1 K A + K B −1 1  12 (K A + K B )
(␮ A − ␮ B )T
(␮ A − ␮ B ) + ln . (15.10)
8 2 2 |K A |1/2 |K B |1/2
The most commonly occurring cluster distance metric in the literature is the nearest
neighbor measure

dmin (A, B) = mina,b d(a, b) for a ∈ A, b ∈ B. (15.11)

That is, over all pairs of points, one from cluster A and one from cluster B, choose
those two points which are closest together and define that to be the distance between
the two clusters.
Similarly, one may define the furthest neighbor distance

dmax (A, B) = maxa∈A, b∈B d(a, b). (15.12)

Each of the definitions given above simply gives us a scalar measure for representing,
in some sense, how far apart two clusters are.
Any time you use a distance measure on vector-valued quantities, you should
pay attention to the possibility that scaling of the coordinate axes might change the
results. For example, consider the set of points shown in Fig. 15.2. Another example
that fairly well shows the impact of clustering is a classification problem involving
the vector [a, b]T , where a represents population and b represents the number of
359 15.2 Clustering algorithms

24

2 4
1 5 a a
3 6

1
3 6

Fig. 15.2. Simple scaling of the coordinate axes can change the apparent clusters.

universities. Now, consider the distance between a vector x and a vector y,


4
d(x, y) = (xa − ya )2 + (xb − yb )2 ≈ xa − ya (15.13)

and in this case, the second feature is essentially not considered because it is so
small in magnitude compared with the first feature. A common normalization which
is often used to deal with such problems is to divide each feature by its standard
deviation.

15.2 Clustering algorithms

We will consider only two clustering algorithms here, agglomerative clustering and
k-means, and will also formulate the clustering problem as an optimization problem.
There are a number of other algorithms in the literature, but these are less often used
in applied systems.

15.2.1 Agglomerative clustering


In agglomerative clustering, we begin by assigning each data point in the training
set to a separate cluster. If there are N data points, then at the beginning, we have N
clusters. Next, we perform iterations of: Merge the two “closest” clusters.
By “merge” we mean: (1) find the two closest clusters (each cluster may be
viewed as a set); (2) create a new set, consisting of the union of the two; and
(3) remove the original two. Fig. 15.3 shows an example.
This process continues until there are only c clusters. Presumably c is known
beforehand.
When we begin, every cluster consists of a single point, and the distance between
clusters is the same as the measure used for the distance between points. After we
have begun the process, however, we will be forced to make use of measures of
distances between clusters as described earlier.
360 Clustering

C
B
B
A A

Fig. 15.3. Before the clustering iteration (left), the data points have already been assigned to
three clusters. Clusters B and C are determined (somehow) to be closer than A, B or
A, C, so B and C are merged and renamed to become a new cluster B.

Properties of agglomerative clustering

The clusters which result from this algorithm are highly dependent on the measure
of cluster distance which is used. If, for example, one uses dmin , and illustrates
C the distances used by drawing a line in Fig. 15.4, one gets the minimum spanning
tree (MST) of the graph which represents the data points by simply continuing the
algorithm until there is only one cluster. If we want three clusters, then we need only
B
cut the longest two edges in this graph. One then realizes that if formal graph theoretic
A operations result from clustering, then the converse is likewise true: Whatever we
know about graphs may help us in designing clustering algorithms. In particular, the
Fig. 15.4. Illustrating following algorithm constructs the MST of a graph very quickly:
the distances chosen
at each iteration by a Define the operation y = FIND(x) as returning the name of the set containing x.
line, we arrive at the Similarly, UNION(A, B, C) creates a new set C = A ∪ B, and then delete the sets
minimum spanning
tree. A and B.

Algorithm: Finding the Minimum Spanning Tree

(1) Compute all edges in the graph (in clustering, this means finding all distances
between pairs of points – but not all possible clusters; just the points).
(2) Sort the edges by length (if there are N points, we have N 2 edges).
(3) Beginning with the shortest edge, for each edge, between nodes u and v, perform
the following operations:
(3.1) A = FIND(u)
(3.2) B = FIND(v)
(3.3) IF (A = B) THEN C = UNION (A, B, C), and erase sets A and B.
361 15.2 Clustering algorithms

Generally, in step 3.3, we index each set by an integer index, and rather than discard-
ing indices when A and B are erased, we use the index for A or for B (whichever is
smaller) as the index for the new set C.
As has been discussed in the literature [15.1, 15.8] there exist parallel algorithms
for performing the union–find operations in constant time. The parallel algorithm
assumes the existence of a lookup table which performs the FIND operation. Then
C
a UNION in parallel hardware is implemented as follows: to do a UNION–FIND
operation on points u and v:

B (1) Register 1 <= lookup (u)


A
(2) Register 2 <= lookup (v)
(3) If XOR(register1, register2), rewrite (minimum (u, v), maximum (u, v)), where
Fig. 15.5. Using dmax rewrite is a parallel operation.
and illustrating the
distances chosen at rewrite (x, y) will set all locations of the lookup table which contain y to x.
each iteration by a line,
we derive a clustering
Now you know how agglomerative clustering works. You know also that ag-
very similar to that glomerative clustering with the dmin distance function will result in the minimum
using dmin . spanning tree of the set of points, and furthermore that this tree may be computed
using graph-theoretic and/or parallel algorithms.
It is interesting to note how dmax would cluster the same data. This result is
shown in Fig. 15.5. However, as illustrated by Figs. 15.6 to 15.8, this is not always the

(a) (a) (a)

(b) (b)
(b)

(c) (c) (c)

Fig. 15.6. Three examples of Fig. 15.7. The result of using Fig. 15.8. The result of using the
two-dimensional clustering minimum distance metric on the maximum distance metric dmax , to
problems (from [14.3]). Used with example of Fig. 15.6 (from [14.3]). cluster the data of Fig. 15.6 (from
permission. Used with permission. [14.3]). Used with permission.
362 Clustering

case. In particular, dmin will tend to choose clusters which are long and thin, and
dmax will choose clusters which are basically round. Students often get confused
right here, concerning the maximum and minimum criteria, so let us spend just a
few extra words to reiterate what we are doing: In this algorithm (the agglomerative
clustering algorithm) we are ALWAYS merging the two CLOSEST clusters. We get
into the maximum distance thing when we define the distance between the clusters.
dmax says to use as a measure of distance between the clusters, the distance between
those points, in those clusters, whose mutual distance is maximum.

15.2.2 k-means clustering


There is, of course, another way to do clustering. In fact there are several ways. The
k-means algorithm is probably the most popular, and is described as follows:

Algorithm: k-means clustering

Step 1. In an arbitrary way, assign samples to clusters. Or, if you don’t like being that
arbitrary, choose an arbitrary set of cluster centers and then assign all the samples
to the nearest cluster. How you pick the cluster centers is problem-dependent. For
example, if you were clustering points in a color space, where the dimensions are
red, green, and blue, you might scatter all your cluster centers uniformly over this
3-space, or you might put all of them along the line from 0, 0, 0 to maxred, maxgreen,
maxblue.

Step 2. Compute the mean of each cluster.

Step 3. Reassign each sample as belonging to the cluster with the nearest mean.
Step 4. If nothing changed this iteration, exit, otherwise go to step 2.

In Fig. 15.9 we illustrate the use of k-means to identify the peaks in a Hough
accumulator array. We use the same accumulator array illustrated in Fig. 11.5. Each
point in the accumulator array is treated as if it contains a number of points equal to
the value of the number stored at that point. The initial cluster centers were chosen
well separated and far from the actual centers. The simplest implementation of
k-means does not work in this application because there are very many points in
which the accumulator array contained only one point. All those ones add up to
place the mean at a point not near the peak we seek. The solution is to simply ignore
points with low values. In this example, any point not containing at least three points
was ignored. Other heuristics are possible (see Assignment 15.4).
The ISODATA algorithm [15.2] extends k-means. ISODATA allows the algorithm
to pick its own version of the number of clusters, and provides for more flexibility
in specifying maximum and minimum cluster sizes.
363 15.3 Optimization methods in clustering

Fig. 15.9. Path followed by two cluster centers initialized far from their final positions in a Hough
accumulator array. The length of the lines indicates the move from one estimate of the
center to the estimate calculated in the next iteration.

15.3 Optimization methods in clustering

Let’s see if we can figure out how to do clustering in a rigorous, formal way, by
posing and solving an optimization problem.
We want to find the best clusterING. That is, that assignment of points to clusters
which is in some sense the best assignment. In order to begin to approach this
problem, we must first invent a scalar measure of the quality of an assignment. As
a reasonable start on defining such a measure, consider the within-class scatter


c 
Sw = (x − ␮i )(x − ␮i )T (15.14)
i=1 x∈X i

where ␮i is the mean (centroid) of cluster X i . Sw provides a good measure of


average deviation of training set points from their means. However, in order to do
minimization, we need a scalar measure of Sw , such as the trace or determinant.
The trace of Sw


c  
c 
Tr(Sw ) = Tr((x − ␮i )(x − ␮i )T ) = (x − ␮i )T (x − ␮i ) (15.15)
i=1 x∈X i i=1 x∈X i

is simply the sum of the squared deviations of the points from their means. The prin-
ciple disadvantage of the trace criterion is that when used in a clustering algorithm,
it can yield different results when the variables are scaled.
The determinant of Sw is invariant to axis scaling, but use of the determinant
imposes the assumption that all clusters have roughly the same shape.
364 Clustering

Once we have chosen an optimization criterion, whether it be Tr(Sw ), det(Sw ), or


other criterion, we need to search the space of all possible clusterings of the points, in
order to find the best clustering. Thus, we need to consider optimization methods. In
this section, we consider only the method known as “branch and bound,” but another,
possibly more important method is given in section 2.3.3, simulated annealing.

15.3.1 Branch and bound


Suppose we have chosen a scalar-valued criterion function to minimize, let that
criterion be called J . (Why are these things always called either J or H ? You don’t
know? Go look up Hamiltonians and Jacobians.) We will present this optimization
method by an example. Let’s suppose we have four things, I, J, K, and L to assign
into two clusters, cluster 1 and cluster 2. Define some notation.
1112 means we have assigned I, J, and K to cluster 1 and L to cluster 2; and
112X means we have assigned I and J to cluster 1, K to cluster 2, and have not yet
decided to which cluster to assign L.

Assumption
(required to make this technique work). Adding a point to a clustering always in-
creases the cost, independent of what that decision is. That is, suppose we have
evaluated the cost of 112X; that is, assume J(112X) = a. When we make the deci-
sion about L, we know that the cost increases. J(1121) > a and J(1122) > a.

With that assumption, we can define the branch and bound search algorithm. Since
we have no better way to search, let’s just start evaluating possibilities in order –
let’s try 1XXX, 11XX, 111X, 1111, 1112, 112X, etc. Suppose, along the way, we
have determined that J(12XX) is greater than J(1112). Then, there is no point in
evaluating any of the children of 12XX, since, if adding a decision to a cluster can
only INCREASE the criterion function, we can only get a higher answer than we
already have. That is the essence of the branch and bound algorithm. It’s a search of
all possibilities, enumerated in sequential order, but remembering the lowest value
encountered and culling the search tree based on the assumption given above.

15.3.2 Vector quantization


In the research area known as “vector quantization,” the computer is presented with
a set of n vectors, and is to find natural groupings among the vectors. Said another
way, the computer is to find a set of c “reference vectors” which represent the set of
n in some optimal way. If you think this sounds like clustering you are correct. It is
in fact precisely clustering; so let’s call it that.
365 15.3 Optimization methods in clustering

15.3.3 Winner-take-all approaches


The winner-take-all (also called competitive learning) approach originally resulted
from researchers who were interested in modeling the cognitive process we know as
“generalization,” and therefore each reference vector/cluster center was represented
by a mathematical construct which modeled a neuron. It will not be necessary for us
to go into the physiological model of neurons here. Instead, we use the term “cluster
center” for what some other material might refer to as a “neuron.”
Each cluster center ␻i has an associated weight vector which we will refer to
by the same name ␻i = [␻i j ]T , j = 1, . . . , d. Note that the vector ␻ describes a
location in a d-space. To do clustering, we present an input vector [v 1 , v 2 , . . . v d ]T .
Define the winner as the cluster center closest (in whatever sense is appropriate) to
the input vector. That is,

d(␻i , v j ) ≤ d(␻k , v j ) ∀(k = i). (15.16)

Compare this with the Suppose cluster center ␣ wins. Then, adjust the weights of ␣ using
step size in gradient
descent. Can you
speculate on a ␻␣j ⇐ ␻␣j + ε(v j − ␻␣j ) (15.17)
relationship between the
two algorithms?
where the scalar ε is known as a “learning parameter.” Typical values of this
parameter are small, on the order of 0.01. The input data is presented to the al-
gorithm repeatedly, in randomized order. Each presentation of the entire data set is
called an epoch. After several epochs, the cluster centers will move to accurately
represent the clusters in the data.
Often, a single cluster is always chosen. Since this is clearly not desirable, to
allow other cluster centers to sometimes be chosen, we include a parameter known
as loneliness. On each epoch, any cluster center that did not win any points has its
loneliness increased slightly. The choice of cluster centers is then made with the
decision strongly biased by the distance between the cluster center and slightly by
the loneliness.

Kohonen feature maps

A Kohonen map is a clustering algorithm which is an extension of the winner-take-all


algorithm. In this extension, a problem-dependent topological distance is assumed
to exist between each pair of the cluster centers. Then, when the winning cluster
center is updated, so are its neighbors in the sense of this topological distance. The
primary cluster, ␻i , (the winner) is updated as in Equation (15.17). Other clusters
are updated by
 
␻k+1
j = ␻kj + F(␩, di j ) v − ␻kj , (15.18)
366 Clustering

where v is the data presented to the algorithm at this iteration; F is some nonincreasing
scalar function of di j , a measure of the distance between clusters i and j; and ␩ is a
maximum on that distance. This algorithm is easily programmed and converges to
excellent clusterings.

15.4 Conclusion

As we have noted, the form of the clustering algorithm significantly affects the results
of the clustering. Some attempts to reduce the dependency on algorithm form have
been made [15.3], but this remains a fertile area for new ideas.
We view clustering as a collection of methods for determining consistency (as
in determining the peaks of the Hough transform) but not for using consistency to
solve other problems.
Minimize the trace of the
Clustering algorithms are totally dependent on optimization methods. In section
scatter matrix. 15.3, we minimized the trace of the scatter matrix to find a good clustering measure.
We used branch and bound to speed up the combinatorial problem which results
when points must simply be switched between clusters.
Although couched in the terminology of neural networks, the winner-take-all
Gradient descent. methods of section 15.3.3 use Eq. (15.17) (which is quite reminiscent of gradient
descent) to find the best cluster center.

15.5 Vocabulary

You should know the meaning of the following terms.


Agglomerative clustering
Bhattacharyya distance
Branch and bound
Chernoff distance
Cluster
Competitive learning
Distance
Euclidean distance
Furthest neighbor distance
k-means
Kohonen map
Mahalanobis distance
Minimum spanning tree
Nearest neighbor distance
Union--find
367 15.5 Vocabulary

Assignment 15.1
Prove that for the equal-covariance case, the
Bhattacharyya distance becomes the same as the measure
given in Eq. (15.7).

Assignment 15.2
The following points are to be partitioned into two
clusters:
[0,0],[0,1],[0,2],[0,3],[0,4],[0,5],[0,7],[0,8].

Sketch the points and indicate the minimum spanning


tree.
Using dmin , find and identify the two clusters. You
may do this graphically if you wish.
Using dmax also find and identify the two clusters.
Can you suggest another distance measure to use in
this case? Discuss.

Assignment 15.3
In your images directory are three images, called

facered.ifs
faceblue.ifs

and (can you guess?)


facegreen.ifs

These are the red, blue, and green components of a full


color image. Each pixel can be represented by 8 bits
of red, 8 of green and 8 of blue. Therefore, there
are potentially 224 colors in this image. Unfortu-
nately, your workstation (probably) only has 8 bits
of color, for a total of 256 possible colors. Your mis-
sion (should you choose to accept it) is to figure out
a way to display this picture in full color on your
workstation.
Approach: Use some kind of clustering algorithm. Find
128 clusters which best represent the color space, and
assign all points to one of those colors. Then, make a
file with the following data in it:
368 Clustering

brightness value red green blue


Example
1 214 9 3

This means that if a pixel has brightness 1, it should


be displayed on the screen as 214 of red, 9 of green
and 3 of blue. Such a pixel will show up as almost
pure red. Thus, each cluster center is represented
by a color. In the example above, cluster center 1 is
the almost pure red point. Now, make an image in which
every pixel has the brightness equal to the cluster
number of the nearest cluster.

Assignment 15.4
In Fig. 15.9 a Hough accumulator is illustrated. That
same accumulator array is available on the CDROM as
hough.ifs. Invent a new way to find the peaks using a
clustering algorithm. (Do NOT simply find the brightest
point.) Suggestions might include weighting each point
by the square of the accumulator value, doing something
with the exponential of the square of the value, etc.

References

[15.1] R. Anderson and H. Woll, “Wait-free Parallel Algorithms for the Union-Find Prob-
lem,” Proc. 22nd ACM Symposium on Theory of Computing, pp. 370–380, New York,
ACM Press, 1991.
[15.2] G. Ball, “Data Analysis in the Social Sciences: What about the Details?” Proc. AFIPS
Fall Joint Computer Conference, Washington, DC, Spartan Books, 1965.
[15.3] G. Beni and X. Liu, “A Least Biased Fuzzy Clustering Method,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(9), 1994.
[15.4] G. Carpenter and S. Grossberg, “A Massively Parallel Architecture for a Self-
organizing Neural Pattern Recognition Machine,” Computer Vision, Graphics, and
Image Processing, 37, pp. 54–115, 1987.
[15.5] G. Carpenter and S. Grossberg, “ART-2: Stable Self-organization of Stable Category
Recognition Codes for Analog Input Patterns,” Applied Optics, 26(23), p. 4919, 1987.
[15.6] G. Carpenter and S. Grossberg, “ART-3: Hierarchical Search using Chemical Trans-
mitters in Self-Organizing Pattern Recognition Architectures,” Neural Networks, 3,
pp. 129–152, 1990.
[15.7] D. Hand, Discrimination and Classification, New York, Wiley, 1989.
[15.8] W. Snyder and C. Savage, “Content-Addressable Read–Write Memories for Image
Analysis,” IEEE Transactions on Computers, 31(10), pp. 963–967, 1982.
16 Syntactic pattern recognition

Ours is the age of substitution:


Instead of language, we have jargon;
Instead of principles, slogans;
and instead of genuine ideas, bright suggestions.
Eric Bentley

In this chapter, we discuss a completely different approach to pattern recognition,


a methodology based on an analogy to language understanding. These methods
have not often been used recently, primarily because they are very sensitive to
noise and distortion. However, for certain applications, they may be appropriate,
and the student is advised to learn enough about this topic to recognize potential
applications.
Consider a boundary segment represented by a chain code. Each step in that
chain code is a symbol, an integer between 0 and 7, so that the boundary segment
is represented by a string of symbols: What makes syntactic methods work is the
analogy between this string of symbols and the string of symbols which show up in
the description of a formal language.

16.1 Terminology

To make more progress in this area, we need to define some terminology. The
definitions are in reference to analysis of strings of symbols, such as occur in language
analysis.

Terminal symbol. A terminal symbol: A word, like “horse,” “aardvark,” “professor,” “runs,” “grades.”
Terminal symbols may also be line segments, parts of a picture, or other features.
Generally, we denote terminal symbols using lower case. Most often, terminal sym-
bols are denoted by a single symbol, like “a” or “0” but in the example of words
from English, the terminal symbols are words, not letters.
Nonterminal symbol. A nonterminal symbol: A symbol that describes a grammatical construct, like “noun,”
“verb,” “verb phrase,” “adverb phrase,” etc. Generally, we denote terminal symbols
with an upper case character such as “A” or a string of upper case characters like
“VP” abbreviating “Verb Phrase.”
369
370 Syntactic pattern recognition

Table 16.1. Productions in a


simple grammar. The start
symbol is S.

S > NP VP
VP > VP ADV
VP > V
NP > ADJ N
NP > ADJ NP
NP > ART N
N > horse
N > professor
V > runs
V > sleeps
ADV > quickly
ADJ > green
ART > the

Grammar. A grammar: A set of terminal symbols, a set of nonterminal symbols, and a set of
rules that generates a set of strings. The critical technical component of syntactic
pattern recognition is the fact that for every grammar there exists a machine (e.g. finite
state machine, pushdown automaton, Turing machine) that recognizes the language
generated by that grammar.

Production. A rewriting rule, or a production, is an allowable substitution, including an arrow. The


string of symbols to the left of the arrow may be replaced by the string on the right.

For a more thorough understanding of syntactic pattern recognition, see [16.5].


We will discuss “traditional” formal languages (as defined by Chomsky [16.2])
in this section. However, current research continues, including interest in stochastic
grammars [16.3], in which the rewriting rules have probabilities associated with
them. For a more thorough description of the theory of formal languages, see [16.7].
An example of a grammar is given below (see Table 16.1) for a limited set of
English sentences.

Terminals: {horse, boy, professor, runs, green, the, quickly, sleeps}.

Start symbol: S.

Nonterminals: {S, VP, NP, N, V, ADV, ART, ADJ}.

Derivation. In the application of a grammar, any production may be applied any number of
times, in any order. A derivation is one instance of an application of the productions
371 16.2 Types of grammars

to the start symbol. For example,

S > NP VP > ART N VP > ART N V ADV > The professor sleeps quickly.

is a valid derivation in this grammar.


Language. Several observations are in order about this grammar. The first is that it generates
an infinite number of strings. This is demonstrated most clearly by the productions
with the same nonterminal on both the left and right, such as NP > ADJ NP. The set
of all strings which could possibly be generated by a grammar is called the language
generated by the grammar.
If you know the grammar, For every grammar, there is a machine which will recognize the language gener-
you know how to build a
recognizer. ated by that grammar. Furthermore, the rules for building those machines are very
straightforward. By recognize, we mean: If the machine is provided an input which
is a string in the language, then the machine will stop and indicate “yes, this string
could have been generated by this particular grammar.” Here is the key point. If
one can devise a grammar which generates a particular set of strings, then one can
easily build a recognizer, and it is usually easier for us humans to devise generators
than recognizers.

16.2 Types of grammars

The set of possible grammars can be divided into four categories, depending on
restrictions on the types of allowable productions.

16.2.1 Type 0 grammars


In a type 0 grammar, any rewrite rule is allowable. The left-hand side of the production
may contain any mix of terminal and nonterminal symbols. For example
abAaBc > abAaCCc
is allowable. Let’s reiterate what this means: If, in the course of a derivation, the
No, we are not going to string abAaBc occurs, it may be replaced by abAaCCc. e.g. aardvabAaBcark could
explain what a Turing be replaced by aardvabAaCCcark.
machine is. Just assume
that it is a machine which For any type 0 grammar, there is a Turing machine which recognizes the language
could be simulated by a generated by that grammar.
computer program.

16.2.2 Type 1 grammars1


Any rewrite rule is allowable, except that strings are not permitted to get shorter. The
left-hand side of the production may contain any mix of terminal and nonterminal

1 Type 1 grammars are sometimes called “context-sensitive,” but we will not use that term because some students
find it confusing.
372 Syntactic pattern recognition

symbols. For example,

abAaBc > aabCC

is not allowable, because the resulting string is shorter, however


No, we are not going to
explain what an bAaBc > aabbCC
LR-automaton is. Just
assume that it is a is allowable.
machine which could be
simulated by a computer
For any type 1 grammar, there is an LR automaton which recognizes the language
program. generated by that grammar.

16.2.3 Type 2 grammars2


The left-hand side of the production may contain only a single nonterminal symbol.
For example

Bc > aabCC

is not allowable, because the LHS contains more than one symbol, however

B > aabbCC

Actually, we ARE going is allowable.


to explain what a
pushdown automaton is. For any type 2 grammar, there is a pushdown automaton which recognizes the
However, it is a machine language generated by that grammar. We present one example of a type 2 grammar
which could be simulated
by a computer program. which you might find interesting.

EXAMPLE: TYPE 2 GRAMMAR.


Terminals {0,1}
Nonterminals: {S}
Start symbol: S

Table 16.2. Productions in an


example type 2 grammar.

S > 0S1
S > 01

Is it ALL such strings? In What is the language generated by this grammar? Is it obvious to you? It is the
particular, is the string 01
generated by this set of strings of zeros followed by exactly the same number of ones, denoted 0n 1n .
grammar? In this chapter, only type 2 and type 3 grammars are discussed in detail.

2 Also sometimes called “context-free” grammars.


373 16.3 Shape recognition using grammatical structure

16.2.4 Type 3 grammars


The left-hand side of the production may contain only a single nonterminal symbol.
The right-hand side many contain only strings of the form “a” or “aA,” that is, a
single terminal symbol, or a single terminal followed by a single nonterminal.
For example,
B > aCCb
is not allowable, because the RHS is not of an allowable form, however
B > aC
is allowable.

16.3 Shape recognition using grammatical structure

In this section, we illustrate a few examples of the use of the syntactic approach to
image shape recognition.

16.3.1 Type 3 grammars


For any type 3 grammar, there is a finite state machine (FSM) which recognizes the
The reader familiar with language generated by that grammar. A FSM is a system which may exist in a finite
FSMs will recognize this
as defining a Mealy number of states3 , denoted by upper case letters, and has a finite number of input
machine (or is it a Moore symbols, denoted here by lower case letters or numbers. Its operation is governed by
machine? You look it up).
a set of transition rules of the form ␦(A, a) = B; that is, when the machine is in state
A and receives input a, it will change to state B. The output of the machine depends
only on its state. The machine produces an output when it is in an “accepting” state.
Nondeterministic To derive the machine from the grammar requires two steps, first, we construct a
machines.
strange artifice called a nondeterministic finite state machine. Then, we convert that
machine into one which we can actually build. Let’s do that for a grammar which
describes regular ECGs.4 First, a bit about how your heart beats.
A normal heart beat is sketched in Fig. 16.1. It illustrates several waves. The
P wave is generated by the electrical signal, “depolarization,” which initiates the
contraction of the two atria, the smaller chambers of the heart. After the atrial beat,
the ECG returns to the isoelectric line, denoted i, for a short period of time, to
allow the larger chambers, the ventricles, to fill. Then the ventricles depolarize,
creating the QRS (denoted by just R) signal. In a healthy heart, the signal again
returns to the isoelectric line until the ventricles repolarize, creating the T wave.
Another isoelectric period occurs until the next P wave. So a healthy heart produces
3 “A finite state machine is a system which may exist in a finite number of states.” We can’t believe we actually
said that!
4 An ECG is an electrocardiogram. The familiar expression “EKG” comes from the German spelling of the
word.
374 Syntactic pattern recognition

Fig. 16.1. A normal heart beat.

Fig. 16.2. Sinus Tachycardia: Insufficient time between T and P piritip.

Fig. 16.3. Atrial flutter: Uncontrolled P waves, pipipirpipipir.

Fig. 16.4. Atrioventricular block: Delay between the P and the R, piiritipiiiritip.

a sequence of symbols like piritiipiritiipiritii (repeat forever, or for 90 or so years


anyway). A few of the things that could go wrong include the conditions illustrated
in Figs. 16.2 to 16.5.
Of course, this is oversimplified, but it gives us a useful example to work on. Table
16.3 illustrates a grammar that generates normal heart beats.
To build the machine, which we will name M, we follow the following steps: Each
nonterminal symbol in the grammar becomes a state in the machine. In addition one
extra state is needed, denoted Q. Then, for each production of the form A > a B,
construct a state change of the form ␦(A, a) = B. For each production of the form
375 16.3 Shape recognition using grammatical structure

Fig. 16.5. Myocardial infarction: Signal does not return to isoelectric between R and T, pirtii.

Table 16.3. A type 3 grammar


which generates normal ECGs.

S > pA
A > iC
C > rD
D > iE
E > tF
F > iG
G > i
G > iH
H > i
H > iS

Table 16.4. A nondeterministic FSM which


recognizes ECGs.

␦(S, p) = A ␦(A, i) = C ␦(C, r ) = D


␦(D, i) = E ␦(E, t) = F ␦(F, i) = G
␦(G, i) = {H, Q} ␦(H, i) = {S, Q} ␦(Q, i) = ␸

A > a, construct a state change of the form ␦(A, a) = Q. Finally, if a is any input
symbol, ␦(Q, a) = ␸, where ␸ denotes the “empty” symbol.
The state change description of the machine which recognizes the language gen-
erated by this grammar is shown in Table 16.4.
Do you see why this is called “nondeterministic?” This machine goes from H to
Q under input i or H to S under the same input i. We do not mean sometimes it goes
to Q and sometimes to S. We mean it really does both, which of course is impossible
in a physically realizable machine.
To convert this into something that could be built, we construct a machine M as
follows.
The states of the new machine are all possible subsets of states of the original
machine, including ␸ (but not all will necessarily be used). In this example, there
376 Syntactic pattern recognition

Table 16.5. A deterministic FSM.

␦([S], p) = [A] ␦([A], i) = [C] ␦([C], r ) = [D]


␦([D], i) = [E] ␦([E], t) = [F] ␦([F], i) = [G]
␦([G], i) = [H, Q] ␦([H ], i) = [S, Q] ␦([Q], i) = ␸
␦([H, Q], i) = [S] ␦([S, Q], i) = A

p i r
[S] [A] [C] [D]
i
i
[E]
[HQ] [F] t
i [G] i

Fig. 16.6. Simple deterministic FSM which recognizes normal ECGs. Accepting states are
circled.

are 29 such states. These states will be denoted by square brackets and a list of the
original state names. If a transition involves such a set-valued state on the left, the
new state will be the union of the states to which the original machine went. (Whew!
was that awkward enough?) The accepting states of the new machine are any states
involving Q or any states which were accepting states in the original machine. The
accepting states of the new machine are any states containing accepting states of
the original machine. This process produces the physically implementable machine
illustrated in Table 16.5 and Fig. 16.6. In this example, although there are 29 states
in the new machine, only a few of them are used.
Thus, we have a machine which recognizes heart rhythms. We hope you have
observed that we have only listed what to do if the “normal” or expected input
occurs. Now, just for the fun of it, we could modify the state diagram (Fig. 16.6)
to include pathological conditions. For example, we could add ␦(D, t) = Y which
would cause a transition to an alarm state indicating the patient might be having a
myocardial infarction (or aortic dissection or other bad stuff).
Next, we will give one more example of a type 3 grammar, this one using a chain
code. But first, another way to describe regular languages: Regular expressions.
0
7
Given a set of terminal symbols, T, a regular expression is a string constructed by
7 1 0 0 concatenating elements of T and the symbol * (denoting repetition), with parentheses
6 2 to delineate order of operation, and comma to denote the logical OR operation.
6 1 In this section, we will use the terminals {0, 1, 2, 3, 4, 5, 6, 7} (the elements of a
Fig. 16.7. Boundary chain code).
segment One element of the language generated by (0, 7)(0, 7)*(7, 6)(7, 6)(61, 72)(1, 2)
corresponding to the
chain code (1, 2)0(0, 1)* is illustrated in Fig. 16.7. The FSM which recognizes the set of all
0776612100. strings generated by this regular expression is given in Fig. 16.8.
377 16.3 Shape recognition using grammatical structure

6 1
0,7 6,7 6,7 1,2 1,2 0 1,2

7 1
2

Fig. 16.8. State diagram for the nondeterministic FSM that recognizes strings generated by the
regular expression above. Two numbers separated by a comma denote that either
input will cause this transition. Any other input will cause a transition to an error state
which is not shown.

In this example, we illustrate how a boundary segment may be represented by a


chain code, and how a possibly infinite number of similar boundary segments may
be recognized by the same machine.
There are many ways to represent images as strings other than chain codes,
including for example the curve code [16.4, 16.8].

16.3.2 Type 2 grammars


Although type 3 grammars provide simple implementations as simple finite state
machines, which can be built using only flip-flops and combinational logic, they do
not have sufficient generality to solve many problems. In some applications, other
types of grammars may be more appropriate. In this subsection we present two
examples of shape recognition using type 2 grammars.

Recognition of chromosomes

The following example is abstracted from the text by Gonzalez and Thomason [16.6],
based originally upon the work of Ledley et al. [16.9], and illustrates the use of a
context-free grammar to recognize types of chromosomes.
The terminals in this grammar are boundary segments, denoted by a, b, c, d, and e,
and illustrated in Fig. 16.9. In the recognition setting, these might be called boundary
primitives. A chromosome will be described by a sequence of symbols a–e. Note that

a
c d
b e

Fig. 16.9. Primitive boundary segments used for syntactic pattern recognition. Note that
segment size and direction are important.
378 Syntactic pattern recognition

(a) (b)
c a
a b
b e
b
d b a
d
b
b
c a
a a

Fig. 16.10. (a) A submedian chromosome. (b) A telocentric chromosome. (Redrawn from [16.6].)

Table 16.6. Productions to generate images


of chromosomes.

Submedian Telocentric

S > S1 S> > S2


S1 > AA S2 > BA
A > CA A > AC
A > DE A > FD
B > bB B > Bb
C > Cb C > bC
D > Db D > bD
E > cD F > Dc
B > e C > b
C > d D > a

except for the symbol d, which can appear either way, the symbols have associated
directions.
There are two types of chromosomes recognized by this grammar, telocentric and
submedian, as illustrated in Fig. 16.10. Each may be described by a sequence of
boundary segments. The following grammar (Table 16.6) will generate either type
of chromosome.
These productions were not invented without some thought. The first two pro-
ductions, those involving the start symbol, S, control which type of chromosome
image is being generated: S1 denotes a submedian chromosome, and S2 designates
a telocentric chromosome. In addition, the other symbols connote components of
the chromosome boundary. That is, A will result in generation of armpair, B will
result in generation of bottom, C will result in generation of side, D will result in
generation of arm, E will result in generation of rightside, F will result in generation
of leftside.
379 16.3 Shape recognition using grammatical structure

Fig. 16.11. Two of the productions used to generate a hexagonal texture. (Redrawn from [9.2].)

Shape grammars

Finally one last example from [9.2] makes use of shape grammars [17.57] to generate
and recognize textures. In a shape grammar, both the set of terminal symbols, VT ,
and the set of nonterminal symbols, VN , are sets of shapes, with the restriction that
VT ∩ VN = Ø.
In this example, the set of terminals contains only one element, a hexagon:

The set of nonterminals likewise contains a single element, a dot: •


The productions describe how to expand nonterminals, which are in the context
of terminals, into additional shapes. For example, a single dot in the context of a
hexagon might be against one specific side of the hexagon, and the production could
allow replication of the hexagon at the point where the dot occurs. Many such rules
make up the grammar, as illustrated by only two rules in Fig. 16.11.
Olstad and Torp [16.10] make use of syntactic methods to extend the capabilities
of active contour methods – one of very few recent papers which make use of these
methods in imaging – to distinguish occlusions.

Recognizers for type 2 languages

Pushdown automata. Type 2 languages are recognized by pushdown automata. A pushdown automaton
is a finite state machine which has been augmented with an unbounded memory
in the form of a pushdown stack. Such a memory is a last-in-first-out memory. The
operation of storing information in such a memory is called PUSH, and the operation
of reading whatever is on the top of the stack is called POP. Note that POP does two
things: It reads whatever is stored there, and it changes the memory, so the next POP
operation will return the next-from-top value.
To implement a pushdown automaton, we add the symbol on the top of the stack
to the criteria for the state change. That is: ␦(A, i, j) = (C, q) now means “if the
state is A, the input is i, and the symbol on the top of the stack is j, change to state
380 Syntactic pattern recognition

C and push the symbol q onto the stack.” Using such a machine, it is now possible
to recognize languages such as the example in section 16.2.3, where the number
of ones must equal the number of zeros. The idea is simple: Every time we see a
zero, push a zero onto the stack and stay in the same state. When we first see a one,
change state and pop the stack. Subsequently, every time we see a one, pop the stack.
If we ever see another zero, go to an error state. Else, when the stack goes empty,
the number of ones is equal to the number of zeros, and go to an accepting state.
One of the principal concerns of practitioners of syntactic pattern recognition is
illustrated well by both examples of chromosome recognition and ECG recognition.
Both systems assume that a recognizer exists which can identify the primitives such
as T waves. The underlying assumption is that such a primitive preprocessor would
be simple, perhaps a template matcher, and robust to noise. In practice, this may
be difficult to accomplish, and may require that the grammar itself be designed for
some degree of noise tolerance. The reader is referred to textbooks [16.5, 16.6] on
syntactic methods for details.

16.4 Conclusion

Syntactic methods do not rely on optimization methods or consistency except in


how the terminal symbols are classified. A lower level algorithm is required to
extract these features from images. This is in fact the principal weakness of syntactic
methods. Because they must rely on other algorithms to provide input, they can fail
in two different ways. The symbol recognizers may fail due to noise, blur, occlusion
or other unexpected variation. Second, the object may simply not “look like” what
the grammar was designed for. It may look similar, but there is no simple way to
incorporate “similarity” into a grammar.

16.5 Vocabulary

You should know the meaning of the following terms.

Derivation
Finite state machine
Grammar
Nonterminal symbol
Primitive
Production
Pushdown automaton
Regular expression
381 References

Regular grammar
Shape grammar
Terminal symbol

Assignment 16.1
Show that the string representing the submedian chromo-
some in Fig. 16.10(a) can be generated by the grammar
of Table 16.6.

Assignment 16.2
The statement was made earlier that the grammar of
Table 16.6 is a type 2 grammar. Prove or disprove
this statement.

References

[16.1] M. Chen, A. Kundu, and J. Zhou, “Off-line Handwritten Word Recognition using
a Hidden Markov Model Type Stochastic Network,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(5), 1994.
[16.2] N. Chomsky, “Three Models for the Description of Language,” IRE Transactions on
Information Theory, 2(3), 1956.
[16.3] A. Corazza, R. De Mori, R. Gretter, and G. Satta, “Optimal Probabilistic Evalua-
tion Functions for Search Controlled by Stochastic Context-free Grammars,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(10), 1994.
[16.4] C. Fermuller and W. Kropatsch, “A Syntactic Approach to Scale-space-based
Corner Description,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 16(7), 1994.
[16.5] K.S. Fu, Syntactic Pattern Recognition and Applications, Englewood Cliffs, NJ,
Prentice-Hall, 1982.
[16.6] R. Gonzalez and M. Thomason, Syntactic Pattern Recognition, Reading, MA,
Addison-Wesley, 1978.
[16.7] J. Hopcroft and J. Ullman, Introduction to Automata Theory, Languages, and
Computation, Reading, MA, Addison-Wesley, 1979.
[16.8] W. Kropatsch, “Curve Representation in Multiple Resolution,” Pattern Recognition
Letters, 6(3), 1987.
[16.9] R. Ledley, L. Rotolo, R. Kirsch, M. Ginsberg, and J. Wilson, “FIDAC: Film Input
to Digital Automatic Computer and Associated Syntax-directed Pattern-recognition
Programming System,” In Optical and Electro-optical Information Processing, ed.
J. Tippet, D. Beckowitz, L. Clapp, C. Koester, and A. Vanderburgh, Cambridge, MA,
MIT Press, 1965.
[16.10] B. Olstad and A. Torp, “Encoding of a priori Information in Active Contour Models,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9), 1996.
17 Applications

Example isn’t another way to teach, it is the only way to teach


Albert Einstein

Machine vision has found a wide set of applications from astronomy [17.44] to
industrial inspection, to automatic target recognition. It would be impossible to cover
them all in the detail which they deserve. In this chapter, we choose to provide the
reader with more of an annotated bibliography than a pedagogical text. We mention
a few applications very briefly, and provide a few references. In the next chapter,
we choose one application discipline, automatic target recognition, to cover in a bit
more detail.

17.1 Multispectral image analysis

The strategy of multispectral image analysis combines spatial and spectral repre-
sentations in a representation in which each pixel is a vector, an ordered set of
measurements. Color, where the elements of the vector are [r, g, b], is the obvious
example, and there is a great deal of work in the literature in color processing. Most
of the reported work has been intended for image quality enhancement. Only some
recent papers elaborate on the use of color for recognition [17.14, 17.18, 17.53,
17.58].
The methods we have studied for univariate images, for example using Markov
random field methods to remove noise, are applicable to multispectral images [17.3].
Often, all that is necessary is to use a vector description instead of scalar pixels.

17.2 Optical character recognition (OCR)

Despite our love for this topic and the huge number of papers devoted (e.g., in this
paragraph, we cite only a few references [16.1, 17.32, 17.64]), we cannot take the
space to cover it in the kind of detail it deserves. One of the first problems in OCR is
automatic zoning [17.28]: that is, locating the text on the page [17.37]. Many OCR

382
383 17.4 Inspection/quality control

papers are special-purpose applications of techniques we have already covered, for


example, thinning or skeletonizing.

17.3 Automated/assisted diagnosis

There is a growing application of (and need for) machine vision techniques in


medicine. Tagari et al. [7.40] use Voronoi diagrams to develop a graph-based repre-
sentation for understanding the topology of the components of the heart, relating 2D
and 3D views. Measuring ejection fraction, the ratio of fluid ejected during a single
beat to total ventricular capacity (particularly for the left ventricle) has attracted con-
siderable attention with both methods which use nuclear medicine images [17.11]
and those which use multiple conventional x-ray images [17.52].
Some representations are convenient for representing specific body parts and
motions. For example, harmonic representations have found considerable application
in representing the heart [8.40]. Gong and Kulikowski [13.16] use a planning strategy
to identify features in MRI images.
Thermal imaging was really hot1 in the early 1970s as it seemed to provide
potential for identifying and possibly diagnosing conditions such as breast cancer.
However, interest soon cooled as it became clear that the accuracy was not sufficiently
high and the false positive rate was too high for the method to be useful. Therefore,
despite its deployment in many areas of industry and military, the usage of thermal
imaging in medicine declined [17.27]. Recently, several papers and studies have been
published to reappraise the use of thermal infrared (IR) imaging in medicine [17.31]
for the following three reasons. (1) We have much improved infrared technology.
Largely due to developments in night vision systems for the Army, new generations
of IR cameras have much enhanced accuracy. (2) We have much better capabilities in
image processing. Advanced techniques including image enhancement, restoration
and segmentation have been effectively used in processing IR images. (3) We have
a deeper understanding of how IR images represent underlying pathophysiology of
human systems [17.12, 17.42].

17.4 Inspection/quality control

One would suspect that the area of manufacturing inspection would be intensely
researched, as industry strives to utilize the latest technology to gain competitive
advantage. However, this seems not to be the case. For example, a conference was
sponsored by the US National Science Foundation in the spring of 2000 to discuss

1 We couldn’t resist the pun!


384 Applications

how industry and universities might collaborate more effectively. Over two hundred
CEOs of machine vision companies (most of which are involved in manufacturing
inspection) were invited, and fewer than 30 showed up. Why was there such a poor
turnout when the topic is seemingly so important?
One possible answer is that most machine vision companies are small. But one
might ask, “Why are so many machine vision companies so small?” We speculate
that the uniqueness of this field comes in the answer to that question. The capital
investment required to set up a machine vision company is actually quite small. One
can get into the business with a computer, some inexpensive hardware, and some
good ideas. Unless you go into custom hardware or sophisticated specializations,
you might be able to get your company running without venture capital. You do not
need to be big to be in the machine vision business. Because the companies are so
small, they are intensely market-driven, and do not see that basic academic research
can help them in the short term. Sometimes they are right.
Still, some basic research in industrial inspection is getting done, such as regis-
tration using fiducial marks [17.61], automatic extraction of features [17.62], recog-
nition of overlapping parts [17.21], as well as applications in assembly [17.43].
If a company has been in the business of manufacturing a particular line of prod-
ucts for many years, it is likely that many of those product designs were not entered
into CAD data bases. Reverse engineering is the process of going from legacy de-
signs into modern data bases. It may require reading of blueprints [17.13], and
it may also require generation of CAD models and data bases [17.59] from ac-
tual objects, which in turn may require extraction of geometric primitives such as
spheres, cylinders, cones, etc. from range data [17.40], or other coordinate measuring
machines.
Microscopy is another application area in which machine vision plays an important
role. For example, Pap smears are often screened using automated systems, and white
blood cell counts may be done by computers as well. Tracking tubular molecules
in epi-fluorescence microscopy [17.46] has been recently reported in the research
literature.
Many industrial parts are specular reflectors, and their shape and roughness can
be extracted using multiple light sources [17.17, 17.56].
However it is done, and whatever the application, machine vision requires system
building and work on sensor modeling, together with hypothesis generation [17.73].

17.5 Security and intruder identification

Intruder detection may involve identifying an individual using a combination of


cues [17.6] as well as facial appearance. There has been a lot of research on face
recognition which we do not have space to discuss [17.26].
385 17.6 Robot vision

Lincoln’s
head Memorial

Image H Image E Image T

Fig. 17.1. Three possible views of a penny. Although possible, the view of a penny standing on
edge is so unlikely as to be discarded.

17.6 Robot vision

Robot vision combines everything we have studied. Camera calibration [4.37] is an


important first component, but the robot must not only identify, it must navigate,
and it must therefore build maps [17.4, 17.60]. Sogo et al. recently [17.54] focused
on qualitative maps, which imbed such information as INFRONTOF, but without
calibrated position data.
In industrial automation, particularly robotics parts handling, a great deal of use is
made of the concept of “stable states,” which are best illustrated by an example. Three
images of a penny which we might observe are shown in Fig. 17.1. Clearly, if we see
image H or T, we are likely to recognize it as a penny, but image E, although possible,
is so unlikely a view of a penny that any “reasonable” machine vision system (includ-
ing a human) would reject it as an interpretation (unless other knowledge were pro-
vided). Thus, a penny has two “stable states.” Rather than treating all possible views
of a penny, a machine vision system could have two models in its database,
PENNYH and PENNYT, and treat them as separate objects. This concept of stable
states works, of course, only for objects, like coins, which may exist in only a few
possible configurations. The term “aspect” is often used for such configurations,
(see section 12.6, where the aspect graph, a data structure representing all possible
aspects of an object, is discussed in more detail).
This idea of stability may be generalized [17.71] to take into account the likelihood
and stability of viewpoints, including what position is optimal in order to take the
next look [9.93], which the authors refer to as “autonomous exploration,” a type of
active vision [17.39]. Another approach to tracking using active vision makes use
of the assumption that all motion is in the plane [17.5].

17.6.1 Robot surgery


Robot-assisted surgery is becoming more important in applications where precise
placement requirements are in excess of what a human can achieve. A common
application is in brain surgery, where the head may be rigidly and precisely held
[17.23, 17.34, 17.38]. In robot-assisted surgery it is necessary to match 3D medical
images (MRI or CT) with 2D x-ray projections. This can be formulated as the
386 Applications

estimation of the spatial pose of a 3D object from 2D images [17.36]. Recent robotic
surgical successes include coronary artery bypass graft on the beating heart [17.9],
stomach surgery [17.7], and gall bladder surgery [17.24.]

17.6.2 Robot driving


Robot navigation includes identifying and avoiding obstacles [17.74] as well as
navigating on roads and off. Finding road edges from a moving vehicle has been
approached using a variety of imaging modes [17.20, 17.65, 17.72] as well as ground
level millimeter-wave radar [17.35].
We reiterate, robotic vision is really a system science. It draws components from
all of the techniques described elsewhere in this book. For example, optic flow may
be used to analyze camera motion and keep the camera trained on the target [17.49],
Grosso and Tistarelli [13.18] combine stereopsis and motion. Zhang et al. [17.74]
make use of the assumption that the robot is moving in the ground plane.
In fact, if one were to design a single project which would teach students the most
about engineering, robotics would probably be the optimal topic.
The reader interested in robot vision should peruse the proceedings of the IEEE
International Conference on Robotics and Automation.

Bibliography

[17.1] M. Anbar, Quantitative Dynamic Telethermometry in Medical Diagnosis and Man-


agement, Boca Raton, FL, CRC Press, 1994.
[17.2] M. Barzohar and D. Cooper, “Automatic Finding of Main Roads in Aerial Images by
Using Geometric-stochastic Models and Estimation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(7), 1996.
[17.3] M. Berman, “Automated Smoothing of Image and Other Regularly Spaced
Data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5),
1994.
[17.4] Ö. Bozma and R. Kuc, “A Physical Model-based Analysis of Heterogeneous Envi-
ronments using Sonar – ENDURA method,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 16(5), 1994.
[17.5] K. Bradshaw, I. Reid, and D. Murray, “The Active Recovery of 3D Motion
Trajectories and Their Use in Prediction,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 19(3), 1997.
[17.6] R. Brunelli and D. Falavigna, “Person Identification Using Multiple Cues,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[17.7] G. Cadiere, J. Himpens, M. Vertruyen, J. Bruyns, O. Germay, G. Leman, and
R. Izizaw, “Evaluation of Telesurgical (Robotic) NISSEN Fundoplication.” Surg.
Endosc., 15(9), pp. 918–923, 2001.
[17.8] V. Caglioti, “Uncertainty Minimization in the Localization of Polyhedral Objects,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 1995.
387 Bibliography

[17.9] I. W. Chitwood and L. Nifong, “Minimally Invasive Videoscopic Mitral Valve


Surgery: The Current Role of Surgical Robotics,” J. Card. Surg., 15(1), pp. 61–75,
2000.
[17.10] I. Cox and S. Hingorani, “An Efficient Implementation of Reid’s Multiple Hypothesis
Tracking Algorithm and Its Evaluation,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(2), 1996.
[17.11] X. Dai, W. Snyder, G. Bilbro, R. Williams, and R. Cowan, “ Left-Ventricle Boundary
Detection from Nuclear Medicine Images,” Journal of Digital Imaging, February,
1998.
[17.12] N. Diakides, ed., Special Issue on Infrared Imaging, IEEE EMBS Magazine, June,
2000.
[17.13] D. Dori, “Vector-based Arc Segmentation in the Machine Drawing Understanding
System Environment,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 17(11), 1995.
[17.14] F. Ennesser and G. Medioni, “Finding Waldo, or Focus of Attention Using Local
Color Information,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 17(8), 1995.
[17.15] C. Fan, N. Namazi, and P. Penafiel, “A New Image Motion Estimation Algorithm
Based on the EM Technique,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(3), 1996.
[17.16] D. Fleet and K. Langley, “Recursive Filters for Optical Flow,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[17.17] J. Franke and W. Snyder, “Determination of Part Pose with Unconstrained Moving
Lighting,” International Conference on Robotics, Atlanta, 1984.
[17.18] B. Funt and G. Finlayson, “Color Constant Color Indexing,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(5), 1995.
[17.19] D. Geman and B. Jedynak, “An Active Testing Model for Tracking Roads in Satellite
Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(1),
1996.
[17.20] F. Gibbs and B. Thomas, “The Fusion of Multiple Image Analysis Algorithms for
Robot Road Following,” in Image Processing and its Applications, IEE, Edinburgh,
1995.
[17.21] W. Grimson and T. Lozano-Perez, “Localizing Overlapping Parts by Searching the
Interpretive Tree,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
9(4), 1987.
[17.22] Z. Haddad and S. Simanca, “Filtering Image Records Using Wavelets and the Zakai
Equation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11),
1995.
[17.23] S. Hayati and M. Mirmirani, “Improving the Absolute Positioning Accuracy of Robot
Manipulators,” Journal of Robotic Systems, 2, 1985.
[17.24] J. Himpens, G. Leman, and G. Cardiere. “Telesurgical Laparoscopic Cholecystec-
tomy,” Surg. Endosc., 12, p. 1091, 1998.
[17.25] Y. Huang and C. Suen, “A Method of Combining Multiple Experts for the Recognition
of Unconstrained Handwritten Numerals,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(1), 1995.
388 Applications

[17.26] X. Jia and M. Nixon, “Extending the Feature Vector for Automatic Face Recog-
nition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12),
1995.
[17.27] B. F. Jones, “A Reappraisal of the Use of Infrared Thermal Image Analysis in
Medicine,” IEEE Transactions on Medical Imaging, 17(6), pp. 1019–1027, 1998.
[17.28] J. Kanai, S. Rice, T. Nartker, and G. Nagy, “Automated Evaluation of OCR Zoning,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[17.29] T. Kanungo, R. Haralick, H. Baird, W. Stuezle, and D. Madigan, “A Statistical,
Nonparametric Methodology for Document Degradation Model Validation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(11), 2000.
[17.30] A. Katz and P. Thrift, “Generating Image Filters for Target Recognition by Genetic
Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9),
1994.
[17.31] J. R. Keyserlingk, P. D. Ahlgren, E. Yu, N. Belliveau, and M. Yassa, “Functional In-
frared Imaging of the Breast,” IEEE Engineering in Medicine and Biology, May/June,
pp. 30–41, 2000.
[17.32] G. Kopec and P. Chou, “Document Image Decoding using Markov Source
Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6),
1994.
[17.33] A. Kumar, Y. Bar-Shalom, and E. Oron, “Precision Tracking Based on Segmentation
with Optimal Layering for Imaging Sensors,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(2), 1995.
[17.34] Y. Kwoh, J. Hou, E. Jonckheere, and S. Hayati, “A Robot with Improved Absolute
Positioning Accuracy for CT Guided Stereotactic Brain Surgery,” IEEE Transactions
on Biomedical Engineering, 35(2), 1988.
[17.35] S. Lakshmanan and D. Grimmer, “A Deformable Template Approach to Detect-
ing Straight Edges in Radar Images,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(4), 1996.
[17.36] S. Lavallée and R. Szeliski, “Recovering the Position and Orientation of Free-form
Objects from Image Contours Using 3D Distance Maps,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(4), 1995.
[17.37] K. Lee, Y. Choy, and S. Cho, “Geometric Structure Analysis of Document Images: A
Knowledge-based Approach,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(11), 2000.
[17.38] P. Le Roux, H. Das, S. Esquenzai, and P. Kelly, “Robot-assisted Microsurgery; Fea-
sibility in a Rat Microsurgical Model,” Neurosurgery, 48, 2001.
[17.39] E. Marchand and F. Chaurmette, “Active Vision for Complete Scene Reconstruction
and Exploration,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
21(1), 1999.
[17.40] D. Marshall, G. Lukacs, and R. Martin, “Robust Segmentation of Primitives from
Range Data in the Presence of Geometric Degeneracy,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 23(3), 2001.
[17.41] N. Merlet and J. Zerubia, “New Prospects in Line Detection by Dynamic Program-
ming,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
389 Bibliography

[17.42] J. Michel, N. Nandhakumar, and V. Velten, “Thermophysical Algebraic Invariants


from Infrared Imagery for Object Recognition,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 19(1), 1997.
[17.43] J. Miura and K. Ikeuchi, “Task-oriented Generation of Visual Sensing Strategies in
Assembly Tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(2), 98.
[17.44] R. Molina, “On the Hierarchical Bayesian Approach to Image Restoration: Applica-
tions to Astronomical Images,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(11), 1994.
[17.45] H. Nogawa, Y. Nakajima, Y. Sato, and S. Tamura, “Acquisition of Symbolic Descrip-
tion from Flow Fields: A New Approach Based on Fluid Model,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 19(1), 1997.
[17.46] B. Parvin, C. Peng, W. Johnson, and F. Maestre, “Tracking of Tubular Molecules
for Scientific Applications,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(8), 1995.
[17.47] H. Qi, P. T. Kuruganti, and Z. Liu, “Early Detection of Breast Cancer Using Thermal
Texture Maps,” IEEE International Symposium on Biomedical Imaging: Macro to
Nano, Washington, DC, July, 2002.
[17.48] G. Ravichandran and D. Casasent, “Advanced In-plane Rotation-invariant Corre-
lation Filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(4), 1994.
[17.49] S. Reddi and G. Loizou, “Analysis of Camera Behavior During Tracking,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(8), 1995.
[17.50] D. Ringach and Y. Baram, “A Diffusion Mechanism for Obstacle Detection from
Size-Change Information,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(1), 1994.
[17.51] J. Rocha and T. Pavlidis, “A Shape Analysis Model with Applications to a Char-
acter Recognition System,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(4), 1994.
[17.52] Y. Sato, M. Moriyama, M. Hanayama, H. Naito, and S. Tamura, “Acquiring 3D
Models of Non-rigid Moving Objects from Time- and Viewpoint-invariant Images:
A Step Toward Left Ventricle Recovery,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 19(3), 1997.
[17.53] D. Slater and G. Healey, “The Illumination-invariant Recognition of 3D Objects
Using Local Color Invariants,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(2), 1996.
[17.54] T. Sogo, H. Ishiguro, and T. Ishida, “Acquisition and Propagation of Spatial Con-
straints Based on Qualitative Information,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 23(3), 2001.
[17.55] W. Sohn and N. Kehtarnavaz, “Analysis of Camera Movement Errors in Vision-based
Vehicle Tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), 1995.
[17.56] F. Solomon and K. Ikeuchi, “Extracting the Shape and Roughness of Specular Lobe
Objects Using Four Light Photometric Stereo,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 18(4), 1996.
390 Applications

[17.57] G. Stiny and J. Gips, Algorithmic Aesthetics: Computer Models for Criticism and
Design in the Arts, University of California Press, 1972.
[17.58] M. Swain and D. Ballard, “Color Indexing,” International Journal of Computer
Vision, 7(1), 1991.
[17.59] T. Syeda-Mahmood, “Indexing of Technical Line Drawing Databases,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 21(8), 1999.
[17.60] H. Takeda, C. Facchinetti, and J. Latombe, “Planning the Motions of a Mobile Robot
in a Sensory Uncertainty Field,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(10), 1994.
[17.61] M. Tichem and M. Cohen, “Submicron Registration of Fudicial Marks using Machine
Vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(8),
1994.
[17.62] S. Trika and R. Kashyap, “Geometric Reasoning for Extraction of Manufacturing
Features in Iso-oriented Polyhedrons,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(11), 1994.
[17.63] L. Tsap, D. Goldgof, and S. Sarkar, “Nonrigid Motion Analysis Based on Dynamic
Refinement of Finite Element Models,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(5), 2000.
[17.64] T. Wakahara, “Shape Matching using LAT and its Application to Handwritten
Numeral Recognition,” IEEE Transactions on Pattern Analysis and Machine In-
telligence, 16(6), 1994.
[17.65] R. Wallace, A. Stentz, C. Thorpe, H. Moravec, W. Whittaker, and T. Kanade, “First
Results in Robot Road-Following,” Proceedings of the International Joint Confer-
ence on Artificial Intelligence, 1985.
[17.66] C. Wang, “Collision Detection of a Moving Polygon in the Presence of Polygo-
nal Obstacles in the Plane,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(6), 1994.
[17.67] C. Wang and W. Snyder, “MAP Transmission Image Reconstruction via Mean Field
Annealing for Segmented Attenuation Correction of PET Imaging,” 17th Interna-
tional Conference of the IEEE Engineering in Medicine and Biology Society, Mon-
treal, September, 1995.
[17.68] C. Wang and W. Snyder, “Frequency Characteristic Study Of Filtered-Backprojection
Reconstruction And Maximum Reconstruction For PET Images,” 17th International
Conference of the IEEE Engineering in Medicine and Biology Society, Montreal,
September, 1995.
[17.69] C. Wang, W. Snyder, and G. Bilbro, “Performance Evaluation of Filtered Back-
projection Reconstruction and Iterative Reconstruction Methods for PET Images,”
Computers in Medicine and Biology, 9(3), 1998.
[17.70] C. Wang, W. Snyder, G. Bilbro, and P. Santago, “A Performance Evaluation of FBP
and ML Algorithms for PET Imaging,” SPIE Medical Imaging, 1996.
[17.71] D. Weinshall and W. Werman, “On View Likelihood and Stability,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 19(2).
[17.72] J. Weng, and S. Chen, “Vision-guided Navigation using SHOSLIF,” Neural
Networks, 1998.
391 Bibliography

[17.73] M. Wheeler and K. Ikeuchi, “Sensor Modeling, Probabilistic Hypothesis Generation,


and Robust Localization for Object Recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(3), 1995.
[17.74] Z. Zhang, R. Weiss, and A. Hanson, “Obstacle Detection Based on Qualitative and
Quantitative 3D Reconstruction,” IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 19(1), 1997.
18 Automatic target recognition

Luke, you’ve switched off your targeting computer. What’s wrong?


George Lucas

This is the principal application chapter of this book.1 We have selected one appli-
cation area: Automatic target recognition (ATR), and illustrate how the mathematics
and algorithms previously covered are used in this application. The point to be made
is that almost all applications similarly benefit from not one, but fusions of most of
the techniques previously described. As in previous chapters, we provide the reader
with both an explanation of concepts and pointers to more advanced literature. How-
ever, since this chapter emphasizes the application, we do not include a “Topics”
There are lots of section in this chapter.
transducers which can
provide information about Automatic target/object recognition (ATR) is the term given to the field of en-
targets. In this book, we gineering sciences that deals with the study of systems and techniques designed to
only consider imaging
sensors. identify, to locate, and to characterize specific physical objects (referred to as targets)
[18.7, 18.9, 18.69], usually in a military environment. Limited surveys of the field
are available [18.3, 18.8, 18.21, 18.66, 18.74, 18.79, 18.89]. In this chapter, the only
ATR systems considered are those that make use of images. Therefore, our use of
terminology (e.g., clutter) will be restricted to terms that make sense in an imaging
scenario.

18.1 The hierarchy of levels of ATR

In this section, we define a few popularly used terms and acronyms in the ATR
[18.57] world, starting with the five levels in the ATR hierarchy.

Detection. Identifying the presence or absence of a target in a given scene.

Classification. This term, at least in Army parlance, originally meant distinguish-


ing between vehicles with tracks and those with wheels. Since this definition was

1 The authors are indebted to Rajeev Ramanath, who assisted significantly in the generation of this chapter, and
in fact wrote some sections, and to Richard Sims, and John Irvine who provided careful reviews and extremely
helpful feedforward.

392
393 18.1 The hierarchy of levels of ATR

generated however, most ATR developments have bypassed classification with


respect to performance requirements. Furthermore, the US Army is moving slowly
away from tracked vehicles and this definition will certainly be obsolete when that
happens.

Recognition. Distinguish targets from similar kinds. For example, distinguish tanks
from front-end loaders, jeeps from automobiles, rocket launchers from school buses,
etc.
Identification. Identify the type of target such as the type of tank (whether it is T90
or M1, etc.).

Characterization. Describe the identified target in more detail. In Army parlance,


this level of process characterizes the target based on how many and what types of
weapons are on board, e.g., a T90 tank with an extra 55 gallon oil drum attached to
the back.

Each level of the ATR hierarchy is a refinement to the target description. Target
characterization reveals the most detail of the target.

18.1.1 ATR terminology


There are a few other terms that are often used in the ATR literature. We give the
definitions as follows.

Chip. A small image usually containing the image of a single target, extracted from a
large image of a scene. Target cueing algorithms, which identify the likely presence
of a target, often produce chips as output.
Detection rate. Fraction of targets correctly detected by the system.

Classification rate. Fraction of targets classified correctly, or more generally, the


conditional probability of correct recognition given the target was detected.
Clutter. Objects that are imaged but are not targets. Clutter typically may be trees,
houses, and other vehicles – anything that is in the picture but is not target.

Cultural clutter. Refers to man-made objects like buildings, as opposed to natural


objects.

False alarm rate. Generally, the fraction of the number of detections that do not
correspond to actual targets. However, this definition may be modified if the task is
classification rather than detection. We observe that false alarm rate is not the same
as probability of false alarm. The false alarm rate is usually given in false alarms per
square kilometer. See section 18.3.
394 Automatic target recognition

FLIR (Forward-looking Infrared). This refers to images formed in the midwave


(3–5 ␮m) and longwave (8–14 ␮m) spectral bands. The term “forward looking” is
no longer really meaningful, but the acronym persists.

IFF. Identify friend or foe.

18.2 ATR system components

The algorithmic components of an ATR system can be decomposed into preprocess-


ing, detection, segmentation and classification (Fig. 18.1), although it is certainly
possible that specific system implementations might not have one or more of these
components. For example, a “blob” tracker can simply track the center of gravity of
the infrared image, if there is just one hot region in the field of view, and no explicit
segmentation is required.
ATR systems “see” images as their inputs. There is a variety of imaging modalities
which have inherent advantages as each of them “see” different properties of these
targets under consideration. For example, a visually well camouflaged tank in a field
may be hidden to the visible band, but clearly visible in the infrared bands, simply
because its engines are running! Fig. 18.2 illustrates images taken from two different
imaging modalities. Fig. 18.2(a) shows what is captured by a regular video camera.
Fig. 18.2(b) shows the images captured by a FLIR camera. Note how the engine of
the tank and Hummer can be seen as “hot.”
Table 18.1 lists some of the commonly used spectral bands. The boundaries in
wavelength of some of these ranges may vary from one user to another [18.9].

Image
Preprocessing Detection Segmentation Classification Recognition

Fig. 18.1. Classical ATR system components.

(a) (b) (c)

Fig. 18.2. Different imaging modalities. (a) Visible spectrum image. (b) Thermal IR image (notice
hot tank engine on far left). (c) Ground truth (from [18.5]). Used with permission.
395 18.3 Evaluating performance of ATR algorithms

Table 18.1. Popularly used spectral bands.

Name assigned Wavelength range Source

Visible (V) 0.4–0.7 ␮m Solar illumination


Near-infrared (NIR) 0.7–1.1 ␮m Solar illumination
Shortwave infrared (SWIR) 1.1–2.5 ␮m Solar illumination
Midwave infrared (MWIR) 3–5 ␮m Solar illumination, Thermal
Thermal infrared (TIR) or 8–12 ␮m Thermal
Longwave infrared (LWIR)
Microwave, RADAR 1 mm–1 m Thermal, Artificial

Given an image of the scene (which consists of a possible target and back-
ground), we need to detect the target. Target detection methods can be viewed as
having two steps [18.86]. In the first step, appropriate measurements are extracted
from an image using low-level image processing techniques. These measurements
are then utilized to derive a primary segmentation of the image into regions. In
the second step, higher-level descriptors of the segmented regions are used to de-
termine the presence or absence of the target (detect) and possibly classify that
target.

18.3 Evaluating performance of ATR algorithms

In this section, we consider some of the issues with evaluation of the performance of
an ATR system. In this description, we use the term “classifier” assuming the job of
the system is to correctly classify the targets in the scene. The same terminology –
“false alarms,” etc. – would be used if the objective of the system were any of the
other levels of the ATR hierarchy.
For purposes of vocabulary, we will define these terms in the context of a pattern
recognition problem which is to classify a result as “there” or “not there.” Some
applications of such classifiers are in automatic target detection (enemy target or
not target), medical diagnosis (tumor or not tumor), and the digital communication
channel (send a “1” or “0” at one end and receive it at the other). There will be
several references back to Chapter 14, so you might want to glance at that first.
To illustrate a few key ideas in detection theory, we will begin with a simple
example. Consider a digital communication system, sending a symbol a for “0” and
b for “1”.
We say we have hypothesis 0 (H0 ) when a “0” is sent and hypothesis 1 (H1 ) when
“1” is sent. (In an ATR problem, we would have H0 for a target absent and H1 for a
target present in the scene.) As the laws of nature state, there WILL be noise in the
396 Automatic target recognition

H0 Z

Measurement
Source Noise source space
H1

Decision Rule

Decision (D0 or D1)

Fig. 18.3. Binary decision system.

system. So, when we send a “0” or in other words, a, we will receive a + n, where
n is a noise sample. We have,

H0 Z =a+n (18.1)
H1 Z = b + n,

where Z is the received signal denoted a binary hypothesis.


Graphically we may summarize this decision-theoretic problem by Fig. 18.3.
Thus, D0 and D1 denote decisions, possibly incorrect decisions, made by the
classifier, whereas H0 and H1 denote reality. We use P(H0 ) and P(H1 ) to represent
prior probabilities or in a controlled experiment, ground truth.
There are four terms typically used in the evaluation of ATR algorithms:

True Positive. The object is there (e.g., the patient has a tumor; a “1” was transmitted)
and our classifier says it is there (we decide the patient has a tumor; we received a
“1”).
True Negative. The object is not there (e.g., the patient does not have a tumor; a
“0” was transmitted) and our classifier concludes no object (there is no tumor; we
received a “0”).

False Negative. The object is there (e.g., the patient has a tumor, a “1” was trans-
mitted) and our classifier says otherwise (the patient has no tumor; we received a
“0”).
False Positive. The object is not there (e.g., the patient is healthy; a “0” was trans-
mitted) and our classifier says there is an object (the patient has a tumor; we received
a “1”).

Clearly both “false” conditions are not favorable. False negative might be worse
since we might miss a really dangerous target or overlook a malignancy. However,
both types of error have associated with them different costs (refer to Chapter 14).
The terms above are sometimes referred to by other names, such as “false alarms”
(false positives) and “false misses” (false negatives). Based on these four values,
two probabilities can be derived:
397 18.3 Evaluating performance of ATR algorithms

Probability of detection, Sensitivity. Probability of a true positive, that is, the ratio between true positive
Pd .
and the summation of true position and false negative. P(D1 |H1 )P(H1 ). In the
specific application of target detection (as distinct from classification, recognition,
and identification), the sensitivity is referred to as the probability of detection and
denoted Pd .

Specificity. Probability of a true negative, that is, the ratio between true negative and
the summation of true negative and false positive. P(D0 |H0 )P(H0 ).

Observe that P(Di |Hi )P(Hi ) = P(Di , Hi ). Now, the probability of a correct
decision is

P(C) = P(D0 , H0 ) + P(D1 , H1 ) = P(D0 |H0 )P(H0 ) + P(D1 |H1 )P(H1 ).

18.3.1 ATR performance representation


In this section, we discuss how the performance of an ATR system may be charac-
terized and estimated.
Once we have designed an ATR system, we invariably have some parameter we can
adjust. One parameter might be a threshold on the brightness or pixels on target; in the
quadratic classifier, it is the decision boundary; in the k-NN classifier (recall section
14.7), it is the number k, etc. How does performance vary with such parameters?
Observe that it is not sufficient to design a system to have 100% sensitivity without
considering specificity. A doctor who tells every patient seen “you have a tumor”
would have 100% sensitivity. Instead, we wish to design systems which have as close
to 100% sensitivity as possible while at the same time having as close to perfect
specificity as possible. If we adjust the parameter(s), we can normally increase one
performance measure at the expense of the other, as illustrated in Fig. 18.4.
In this figure, the true positive fraction (yet another term for P(D1 |H1 )) is graphed
on the vertical axis vs. the false positive fraction on the horizontal axis. Each curve
represents the performance of a particular classifier (or under a particular system
setting), as some parameter is varied over its range. Of the three systems shown, the
one with the sharpest bend, which passes closest to the upper left corner, is clearly
the best. We can quantify this idea by calculating the area above the curve, and
choosing the system with the smallest area.
Such curves are called “ROC” curves, a term which is an abbreviation for Receiver
Operating Characteristic, a term from communication theory.

18.3.2 Generating ROC curves from training data


ROC curves can be generated by applying the classifiers to the training data. First
of all, let’s review the definition of probability of error.
398 Automatic target recognition

0.75 P(D0|H0) 0.50 0.25

0.75 0.25

P(D1|H1) P(D0|H1)
0.5 0.5

0.25 0.75

0.25 P(D1|H0) 0.5 0.75

Fig. 18.4. Receiver Operating Characteristic (ROC) curve.

In terms of decision regions, we can write the average probability of error as


This comes from Bayes’
rule. Take a look back at  
section 14.2. ε= P(H1 ) p(z|H1 ) dz + P(H0 ) p(z|H0 ) dz, (18.2)

0
1

where
i is the region of measurement space over which we decide class i. P(Hi ) is
the prior probability of seeing an example of class i, and p(x|Hi ) is the conditional
probability density of whatever we are measuring (e.g., brightness) given that the
example is from class i. Of course, we do not know the TRUE probability densities,
we only know our estimates of those densities, determined from the training sets.
Furthermore, we derived the decision regions
i from those (estimated) densities.
We could try to determine the error rate by simply counting the number of ele-
ments in the training set that are misclassified: We call this the apparent error rate.
Unfortunately, this leads to an optimistic result: It underestimates the error rate of
the system when tested on data not in the training set. This occurs because the ATR
has been designed to minimize the number of misclassifications of the training set,
and unless the training set perfectly represents the true distribution of the data, the
classifier will reflect characteristics of the training set which may not be true of the
entire sample population.
True error rate is different We must distinguish the apparent error rate from the true error rate. Although we
from apparent error rate.
have no way to determine the true error rate, we may get a better estimate using the
two different approaches now described.
399 18.3 Evaluating performance of ATR algorithms

Separating the training set and the test set

The dimensionality of the This is straightforward. We simply divide our original training set into two parts
problem is the number of
measurements made (e.g., (randomly of course) and build the classifier using half of the data. We then test
max. brightness, area, the system on the other half. This approach works reasonably well if we have very
contrast).
large training sets (thousands of examples), or better said, something like 10d where
d is the dimensionality of the problem. (What does dimensionality mean here?)
Unfortunately, such large training sets are unlikely in most problems.

The leave-one-out approach

Assume there are n points in the training set. Remove point 1 from the set, and
design the classifier using the other n − 1 points. Then test the resulting machine
on point 1. Repeat for all points. The resulting error rate can be shown to be an
almost unbiased estimate of the expected true error rate for the classifier designed
using all n points. Of course, this requires that we design n machines, which could
be prohibitive. However, with such a result, we have numbers which we can put into
the ROC curves.

18.3.3 ATR performance and system objectives


ATR algorithms can only be evaluated in context of the system objective. It would
be silly to evaluate the performance of a missile (which only tracks hot spots) in its
ability to distinguish tanks from trucks. Thus one is tempted to only evaluate each
“block” of the ATR algorithm. But that introduces a problem as well. It is conceivable
that each component of a system might function well in its own independent context
but the entire system still not meet overall objectives. Therefore one tries to evaluate
each component with respect to its own performance but also against the overall
system objective [18.7].
Performance evaluation is a hard problem. It would be highly desirable to have an
information-theoretic measure of the difficulty of a particular ATR scenario. Just as
can be done with capacity of transmission channels, we would like to know: “What
is the best performance we could reliably expect from ANY ATR algorithm operating
on this class of scene?” Unfortunately we do not have any such measures available
at this writing, although progress is being made [18.10, 18.61].
The preprocessing operations “correct” the input data improving image “quality”
by performing denoising, deblurring, and other image corrections. For example,
a denoising system may be evaluated by edge detection methods, comparing the
number of edges detected, edge thinness or edge continuity, etc. Preprocessing steps
generally increase the separability between the target and the background, hence,
distance metrics in feature-space may also be used since the detection operation
400 Automatic target recognition

localizes potential target areas. Hence, probability of detection and false alarm may
be used to evaluate this step. The segmentation operation extracts the target after it has
been detected. We may therefore use measures like misclassified pixels, correlation
coefficient between true and extracted target, etc.

18.4 Machine vision issues unique to ATR

The problems and issues of interest to the ATR community can be summarized as
target signature variability, false alarm rate, segmentation, feature selection, per-
formance degradation due to incomplete information, and performance evaluation
[18.9, 18.71].

18.4.1 Target signature variability and false alarm rate


The “signature” of a target could mean a geometric signature and/or a spectral sig-
nature. Both are highly susceptible to change, given that we have a “smart enemy.”
Among all the affecting factors, variability in surface reflectance and natural illumi-
nation and occlusions of targets are most difficult to deal with.

Variability in surface reflectance and natural illumination

All images record signals received as a sum of emitted and reflected radiation:

f (x, y) = ␰(x, y) + ␳ (x, y). (18.3)

However, the amount emitted may be dramatically smaller than the amount reflected
(in the case of visible light) or significant (in the case of longwave infrared). The
emissivity is the ratio of emitted to total radiation.
␰(x, y)
ε(x, y) = . (18.4)
f (x, y)
The emitted radiation is positively related to the temperature

␰ = ␴ A exp(−T 4 ) (18.5)

where A is the surface area, measured in m2 , ␴ is Stefan’s constant, 5.67 ×


10−8 W m−2 K−4 , and T is absolute temperature in degrees K.
Since the amount of radiation received from the sun and reflected off the target
varies dramatically over a 24 hour period, and as parts of the object heat and cool
at different rates, the same object may undergo contrast reversals as parts of it cool
faster than others. This effect is illustrated in Fig. 18.5.
401 18.4 Machine vision issues unique to ATR

12:00 14:00 16:00 18:00


Al

Fe

20:00 22:00 00:00 02:00

tire

04:00 06:00 08:00 10:00

Fig. 18.5. Objects may exhibit contrast reversals over a 24 hour period. (From US Army Night
Vision and Electro-optics Research Center, used with permission.)

Occlusions

Unlike industrial machine vision problems, where the setup of the manufacturing
facility is specifically designed to minimize occlusion, not only do occlusions occur
in ATR scenes, but targets are usually at least partially occluded. In fact, the opponent
will be actively trying to have his equipment as occluded as possible [4.13]! An image
of a truck occluded by a tree is shown in Fig. 18.7, later.
All these variabilities raise the question of “How well trained should the ATR be?”
It is entirely too easy to over-train a system, and produce a system which performs
very well on the data on which it has been trained, but poorly on data it has not seen,
even though that data may (in the eyes of a human) be very similar. The problem is not
to get the probability of detection high, but to do so while simultaneously keeping the
false alarm rate low. The Neyman–Pearson test [18.53] provides a means to perform
such a minimization with performance bounds on probability of false alarm.

18.4.2 Tracking
In ATR, many if not most applications require target tracking, and furthermore the
tracking problems are less constrained and more challenging than tracking in the
civilian domain. Centroid tracking is the simplest type of tracking algorithm, al-
though there are ways to improve its sophistication [18.39]. The centroid tracker
(usually) assumes there is just one target in the field of view, and that bright spot
is much brighter than background. If those assumptions are true, the centroid of
the target is the centroid of the field of view. More sophisticated tracking of mov-
ing objects is most often done using optimal filters like the Kalman–Bucy filter.
402 Automatic target recognition

Haddad and Simanca [18.28] discuss the limitations of the Kalman filtering ap-
proach and propose a nonlinear tracking filter based on wavelets and the Zakai
equation. Amoozegar et al. [18.3] provide a survey of fuzzy and neural techniques
in tracking.
The process of tracking can be combined with the process of classifying vehicles
[13.12, 18.22] as well.

18.4.3 Segmentation
In most ATR scenarios, the problem of separating clutter from target is fundamental.
Clutter varies from one scene class to another and requires adaptive representations
[18.38]. However, there currently is not even a uniformly accepted definition for
“signal-to-clutter” [18.61, 18.68, 18.78].
Once a potential target is localized, it is extracted from the background as ac-
curately as possible. However, every segmenter makes prior assumptions about the
target and its neighboring pixels. These assumptions may not be valid for all viewing
conditions. As we learned in Chapter 8, two common approaches to segmentation are
edge or boundary formation and region growing [18.68]. Edge detection approaches
are based upon recognizing dissimilarities in images whereas region growing utilizes
similarity properties. Because edge detection techniques are quite sensitive to noise,
successful edge detection usually depends on higher level semantic knowledge. Re-
gion growing techniques offer better immunity to noise and therefore do not have as
much reliance on semantic knowledge. Qi et al. [18.63] propose an efficient segmen-
tation approach to segment man-made targets from unmanned aerial vehicle (UAV)
imagery using curvature information derived from an image histogram smoothed by
Bezier splines. Experimental results show that by enhancing the histogram instead
of the original image, similar segmentation results can be obtained in a more efficient
way. In [18.87], a segmentation strategy based on the image pyramid data structure
is developed, working its way from the top of the pyramid to the bottom, processing
image detail hierarchically.
As we learned in Chapter 6, diffusion and diffusion-like processes [18.41, 18.42]
provide excellent noise-removal steps as components of a segmentation process.

18.4.4 Feature selection


Most of the features used by researchers are geometric, topological and/or spectral
[18.7]. The primary goal of feature selection is to obtain features which maximize
the similarity of objects in the same class while minimizing the similarity of objects
in different classes. The mathematics of feature selection are covered nicely in a
textbook by Hand [18.30].
403 18.5 ATR algorithms

18.5 ATR algorithms

In the following, we relate a number of the topics covered earlier in this book to the
specific application to ATR. This is done primarily through reference to the ATR
literature. We will not attempt to survey all the relevant literature in ATR, since a
great deal of that literature makes use of the methods of statistical pattern recognition,
which deserves a textbook in itself. However, we cite a number of publications which
will mention those methods.
The ATR application exhibits characteristics which are different from most other
machine vision applications, such as industrial inspection. Key differences include
the following.

(1) ATR systems must of necessity deal with unstructured environments – it simply
is not possible to control the illumination, the viewing angle, the atmosphere,
etc.
(2) Occluded targets are not just possible, they are likely.
(3) Only a few pixels are likely to be on target. The probability of correct classifi-
cation strongly depends on the number of pixels on target. This is illustrated in
Fig. 18.6 for a neural network classifier, but similar results, including especially
the dramatic change around 50 pixels-on-target, occurs for any system, including
a human.
Interestingly, the first studies relating pixels on target to Pd were done prior to
Human performance is digital images, and considered scan lines rather than pixels. In infrared images
dependent on
pixels-on-target, too. of military targets, Johnson [18.36] found that target recognition by humans
required at least four pixels across the critical (smallest) target dimension. Of
course, such results vary with target complexity and recognition task details, but

100

75

50

25

100 200 300 400 500

Fig. 18.6. Probability of detection (circles), probability of classification given detection


(triangles), and probability of identification given detection (squares) as a function of
pixels-on-target (redrawn from [18.84]).
404 Automatic target recognition

the range of 3–5 pixels across, corresponding (for a square target) to roughly
20 pixels-on-target seems to hold fairly well. See also [18.37].
Clearly, the ATR system observer needs to remain as far as possible from the
target, and therefore the number of pixels-on-target is always too small! In fact
the observer would really prefer to not be there at all, which should strongly
motivate the development of robotic forward observers [18.35].
(4) 3D information must be considered. By 3D here, we mean that the target may
be observed in a number of possible aspects. (See Chapters 9 and 12.)
(5) Any ATR system must consider clutter and confusers, which are other objects,
accidently or deliberately introduced into the scene and which resemble targets.
The most obvious confuser is friendly equipment of the same type as the target.

The requirements set out above dramatically affect the design of ATR systems.
For example, one might consider a system which identifies targets by using a variety
of different sized templates, one for each of many possible aspects. Or, a system
might simply extract a variety of different sized windows, and do dimensionality
reduction (using the K–L transform) and pass those results on to a classifier [18.14].
ATR philosophies of approach can be categorized using different taxonomies. In
one way, the philosophies are divided into two groups [18.7]. First is the classical
pattern recognition approach mentioned earlier, which uses statistical techniques.
This is very popular due to ease in implementation and speed, provided it is pos-
sible to appropriately and effectively extract useful features. The other approach is
AI-based and requires an additional step of symbolic manipulation.
Depending upon how the ATR systems function, other specialists in this area
classify them as geometry-based (mostly single sensor) and spectral-based systems
[18.71]. The basis for the geometry-based systems is the observation that, by pre-
serving as much as possible the information that is received from an object and by
using intelligent reasoning, a better interpretation of the scene containing the objects
of interest can be made [18.70]. Thus, by better understanding the forward image
formation process, the inverse problem – the task of recognizing objects from their
received signatures – becomes more feasible. However, the integration of these mod-
els into a cohesive end-to-end model by considering the effects of their interactions
is still very preliminary.
The geometry-based approach may be subdivided in various ways. For example,
one could divide geometry-based systems into those that use a large collection of
templates (an aspect graph) and those that start with a 3D model [18.77].
Spectral-based systems recognize targets based on spectral analysis and spectral
matching [18.33]. Intuitively, the more spectral bands used, the more details about
the scene can be revealed. The finer details provided in the spectral signature can
substantially increase the detectability of targets, especially small targets, at pixel or
even subpixel level [18.16, 18.47].
405 18.5 ATR algorithms

Another type of ATR system is based on the premise that the more sensory data
that is available from the target of interest, the better the system performance. This
is intuitively obvious for sensors that have complementary properties. Due to nu-
merous limitations of single-sensor ATR systems, there has been a move toward
multisensor targeting systems and, hence, the problem of correlating and combining
data generated from multiple sensors. This is also referred to sometimes as multi-
sensor fusion, however, the information sources may be different sensors (sensor
fusion) or different algorithms (algorithm fusion) [18.32].
Finally, some researchers break the set of ATR algorithms into model-based meth-
ods, statistics-based methods, and template-based methods. These three categoriza-
tions are discussed in more detail below.

18.5.1 Model-based techniques


Most model-based techniques are geometry-based. They pose the question: “Given
a certain viewing angle, what should the target look like?” [18.1, 18.6, 18.12, 18.13].
This is potentially a powerful philosophy, as it provides information on portions of
the target that may have been occluded due to its position – e.g., from a certain
viewpoint the barrel of a tank may be missing. However if we have a 3D model
of the target, we could generate all possible views and perform a combinatorial
search [18.65] to get a match. Model-based techniques are readily combined with
different data types, especially range (laser radar) images [18.87]. However, like
almost everywhere in machine vision, some optimization problem must be solved,
using neural networks [18.29], genetic algorithms [18.10], or other optimization
methods.
Usually, descriptions that correspond to scene structure and geometry alone are
obtained as opposed to scene physics (heat, light, material properties, etc.). Matching
is then the process of hypothesizing and verifying matches between model and
image points. This process produces a 3D to 2D transformation which brings the
3D model points into correspondence with the 2D image points. The best match
is then the transformation which best explains the scene. Solution of the 3D to 2D
correspondence is basically solution of the perspective equation. Errors between
projected model points and the corresponding image points help verify how good a
match is.
These methods are powerful, but require a lot of processing and a large database.
They perform poorly when targets are occluded [18.73] as this results in a case of
“incomplete information.” To this end, a lot of work is being done in obtaining the
“actual geometric shape” of the target from occluded views [18.75].
Fig. 18.7. Image of a Occlusions, in general, fall into two distinct categories – contiguous (buildings or
truck occluded by a
tree [18.65]. SPIE,
whole trees as shown in Fig. 18.7) and distributed (tree branches). The first type is
used with permission. easier to deal with as it is possible to have enough information on the nonoccluded
406 Automatic target recognition

portion to solve the problem. In [18.73], Sadjadi presents an approach to detection


under occlusion. The boundary of the segmented region of the image is converted to a
chain code. The chain code is then matched using histogram intersection (described
in the next section) with a set of models. The match is used as a measure of the
A histogram may be
constructed of the chain confidence of the system. If, however, the degree of confidence is still low, the
code directions.
system assumes that the object is partially occluded and now the object is matched
with occlusion models in a database.

18.5.2 Statistics-based techniques


The ideas behind statistics-based techniques are the same as those described in Chap-
ter 14: (1) Obtain features; (2) develop statistical measurements which characterize
different classes; (3) make decisions which optimize some measure such as mini-
mum cost, or maximum probability of correct decision. In this section, we consider
only multispectral measurements.

Multispectral matching

One technique for multispectral analysis uses the concept of histogram intersection
introduced by Swain and Ballard [18.82] for the realm of color images. The idea
is simply to compare the histograms of two images and determine any overlap
factor (how many pixels in the data base histogram match those in the histogram
of the new image). Specifically, given a pair of histograms, I (from new image)
and M (from database), each containing n bins, the intersection is defined to be
n
j=1 min(I j , M j ). The result is the number of pixels that have the same color in the
two images. This number may be normalized to obtain overlap factor. The color of an
object is subject to significant change depending upon varying lighting conditions;
and in this situation, this simple algorithm will clearly not give us a good match. To
overcome this problem, Funt and Finlayson [18.24] combined the idea of histogram
intersection with a concept referred to as “color constancy” [18.23], which removes
effects of varying illumination conditions and in effect, normalizes the image to
a standard illuminant. Now that we have a data base also in standard illuminant
conditions, we can “compare apples with apples” and use the histogram intersection
method described above. We have not put any restriction to the dimensionality of the
histogram, and hence can extend this concept to higher dimensions (more sensors),
obtaining a more robust system.
Another measure of spectral match between a known target signature and an ob-
served signature is obtained by treating the signatures as vectors and finding the inner
product between the two vectors [18.93]. The better the match, the closer the angular
separation to zero. In other words, if we have two d-dimensional measurements in
the spectral signatures, X and Y , then the distance between these two measurements
407 18.6 The Hough transform in ATR

can be represented by an angle between these two spectra

X rY
cos ␪ = .
|X ||Y |

Small ␪ indicates that the two spectra are quantitatively similar; similarly large ␪
indicates dissimilar spectra. In [18.93], Weisberg et al. use this measure to perform a
clustering procedure, segmenting the image into multiple regions of interest. There
are many other measures of similarity that are in use [18.34, 18.62].
In [18.86], Trivedi introduces the use of relative spectral information rather than
absolute information about the target for remote-sensing purposes. This introduces
some robustness into the system. For example, a certain object may be brighter than
the background in a particular channel and darker in some other channel.

18.5.3 Template matching in ATR


Template matching is the simplest and most often applied algorithm in the ATR
application domain. Template matching is most effective when the aspect can easily
be established. For example [18.51], when the target is a ship, it is easy to find the
center of gravity, and therefore to eliminate the translational degree of freedom;
in addition, ships are long and thin, and their orientation can be determined by
computing the major axis. Having eliminated rigid body motions, only a few template
comparisons are required to get precise pose and identification.
In those applications where the pose of the target is not easily estimated, sets of
focused filters can be used, with the strongest response indicating the target pose
[18.2, 18.40].
Deformable templates of Chapter 9 are well suited to detecting and classifying
approximately located objects [18.44, 18.76] like the silhouettes of aircraft [18.65],
where the optimization problem is posed as a MAP estimation.
The problem of identifying and tracking roads in satellite images is discussed
in [18.26, 18.48]. Barzohar and Cooper [18.4] also pose this as a MAP estimation
problem.

18.6 The Hough transform in ATR

Since many man-made objects have straight-line properties, the Hough transform
pops up often in ATR applications. For example, it may be used [18.19] to identify
the track of a (subpixel) missile as it is observed from space. By taking differences
in time, the target track becomes visible, along with a great deal of noise. The track
is, however, more-or-less a straight line, and comes out very clearly in the Hough
408 Automatic target recognition

Fig. 18.8. Hubble space telescope. Photo: NASA.

transform. Cowart et al. [18.19] also consider the use of a parametric transform
which allows for maneuvering targets.
Viewed from above, a ship is a long, straight (unless the ship has had an unfortunate
encounter) object. When viewed with SAR,2 the image of a ship consists of spots and
dropouts. One may estimate [18.25, 18.50] the orientation of a ship by simply taking
the Hough transform. Surprisingly, it appears [18.25] that the Hough transform is
less sensitive to noise than using principal axes. If the ship is moving, its wake is
even longer, straighter, and may be more easily found [18.18].
A satellite, when viewed from a terrestrial telescope, has straight edges (see Fig.
18.8). The Hough transform can also be used to identify and characterize those edges
[18.20].

18.7 Morphological techniques in ATR

The special needs of ATR affect the types of morphological techniques reported in
the literature. For example, in many ATR applications, the targets are observed from
above and have few pixels on target. Therefore, it is important that noise removal
operations such as morphological opening be invariant to rotation. For example
[18.60], the traditional opening
( f oB)(n, m) = maxi, j∈B {mini, j∈B { f (n + i, m + j)}} (18.6)
2 SAR is an abbreviation for Synthetic Aperture Radar.
409 18.8 Chain codes in ATR

Table 18.2. Result of modified opening with


ring-shaped structuring element.

L<R Object erased


R < L < 2R Some parts removed, some preserved
2R < L Object preserved

can be replaced with


R
L
␥ ( f B )(n, m) = min{maxi, j∈B { f (n + i, m + j)}, f (n, m)} (18.7)
Fig. 18.9. A
ring-shaped where B is the ring-shaped structuring element of Fig. 18.9. The behavior of this
structuring element
and a rectangular
structuring element is different according to the relative size of the object, L, and
object (redrawn from the radius of the structuring element, R. Such methods [18.58] lead to morphology-
[18.60]). based algorithms which are more specific to target recognition [18.10] or target
tracking [18.92]. The result is shown in Table 18.2.
Shape recognition using morphology has also been addressed by Pham et al.
[18.58].

18.8 Chain codes in ATR

Template matching and feature-based approaches have their own shortcomings. Most
such algorithms depend on some prior knowledge about the object. Features are
needed which are invariant with respect to size and orientation to the object in
the field-of-view of the sensor. Similarly, approaches based on template matching
require extensive data bases, with large search times. However, silhouettes may be
matched [18.72] using the chain code of the segmented object, and then processing
the histogram of that chain code (Fig. 18.10). This strategy has two very useful
properties:

Image Segmented region Chain rode

Histogram of
Trellis algorithm Histogram matching chain code

Decision

Fig. 18.10. Schematic of Sadjadi’s proposed method [18.71–18.74] (redrawn).


410 Automatic target recognition

(1) Scale variation in the image domain is equivalent to a vertical shift in the chain
code histogram domain.
(2) Changes in orientation are equivalent to horizontal cyclic shifts in the histogram.

It is possible that two different objects have the same chain code histogram. To this
end a trellis algorithm may be used to distinguish between such objects [18.81]. A
trellis structure of a distorted pattern is created using each row vector as a “fractional
pattern” (a node in the trellis). A large collection of observed data is used to provide
a statistical basis for this trellis, to which the Viterbi algorithm (see section 2.4.2) is
applied. Although the use has been demonstrated only for handwritten characters,
this method may be universally applied to any category of distorted patterns.

18.9 Conclusion

A wide variety of papers are available in the public literature that have comparative
studies on different techniques. For example, Li et al. [18.43] compared a number
of neural, statistical and model-based approaches and concluded that: “At least for
FLIR images, the neural network-based approaches gave better results than the
PCA (principal components analysis) and the LDA (linear discriminant analysis)-
based approaches.” They found that methods based on the Hausdorff distance also
performed “well.”
Often, papers in the literature present conclusions that make the reader think the
ATR problem is nearly solved when, in reality, either these systems have not been
tested on real military data or their scope is so limited to one specific application
and set of conditions that their “actual” performance may be doubted. We need
to understand that the problem being dealt with here is not only nontrivial, it gets
substantially more difficult with every new invention made by the enemy. What is
“state-of-the-art” today may not be tomorrow.
Nonetheless, from an engineering point of view (which is almost always practi-
cal!), the authors believe that the following are needed to produce ATR systems that
may be benchmarked:

r a large set of standard real-world images, with ground truth information available
to the research community,
r a tool to optimally evaluate and incorporate the best of a large number of ATR
techniques into a system with “optimal” performance.

Several questions arise when we take these issues into consideration. How do
we come up with a standard? What objectives are we trying to reach – a powerful
system that has the best ROC curves or developing a system that is portable, using
411 Bibliography

minimal hardware? Clearly all these cannot be met at the same time and someone
has to strike a balance.
General trends are seen, however, in the development of ATR systems. The use
of more than one sensory system, portability of the system, its ability to be truly an
automated system with minimal human interference, the increased use of mathemat-
ical tools to provide a sound basis for the system are clearly where research seems
to be headed. In this chapter the authors hope to have communicated the vastness of
this problem while also presenting the achievements of science in approaching this
problem.

Bibliography

[18.1] J. Albus, “Applications of an Efficient Algorithm for Locating 3D Models in


2D Images,” Automatic Target Recognition VIII, Proceedings SPIE, 3371, April,
1998.
[18.2] K. Al-Ghoneim and B. Kumar, “Combining Focused MACE Filters for Target De-
tection,” Automatic Target Recognition VIII, Proceedings SPIE, 3371, April, 1998.
[18.3] F. Amoozegar, A. Notash, and H. Pang, “Survey of Fuzzy Logic and Neural Net-
work Technology for Multi-target Tracking,” Automatic Target Recognition VIII,
Proceedings SPIE, 3371, April, 1998.
[18.4] M. Barzohar and D. Cooper, “Automatic Finding of Main Roads in Aerial Images by
Using Geometric-stochastic Models and Estimation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(7), 1996.
[18.5] J. Beveridge, D. Panda, and T. Yachik, “November 1993 Fort Carson RSTA Data
Collection Final Report,” Colorado State University Technical Report, CS-94-118,
August, 1994.
[18.6] J. Bevington and K. Siejko, “Ladar Sensor Modeling and Image Synthesis for
ATR Algorithm Development,” Automatic Target Recognition VI, Proceedings SPIE,
2756, April 1996.
[18.7] B. Bhanu, “Automatic Target Recognition: State of the Art Survey,” IEEE Transac-
tions on Aerospace and Electronic Systems, 22(4), 1986.
[18.8] B. Bhanu and T. Jones, “Image Understanding Research for Automatic Target Recog-
nition,” Proceedings of the DARPA Image Understanding Workshop, 1992.
[18.9] B. Bhanu and T. Jones, “Image Understanding Research for Automatic Target Recog-
nition,” IEEE Aerospace and Electronics Systems Magazine, 8(10), 1993.
[18.10] M. Boshra and B. Bhanu, “Predicting Performance of Object Recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(9), 2000.
[18.11] M. Bullock, D. Wang, S. Fairchild, and T. Patterson, “Automated Training of 3-D
Morphology Algorithm for Object Recognition,” Automatic Target Recognition IV,
Proceedings SPIE, 2234, April, 1994.
[18.12] J. Burrill, S. Wang, A. Barrow, M. Friedman, and M. Soffen, “Model-based Match-
ing using Elliptical Features,” Automatic Target Recognition VI, Proceedings SPIE,
2756, April, 1996.
412 Automatic target recognition

[18.13] L. Carin, L. Felsen, and C. Tran, “Model-based Object Recognition by Wave-oriented


Data Processing,” Automatic Target Recognition V, Proceedings SPIE, 2485, April,
1995.
[18.14] L. Chan, N. Nasrabadi, and D. Torrieri, “Discriminative Eigen Targets for Automatic
Target Recognition,” Automatic Target Recognition VIII, Proceedings SPIE, 3371,
April, 1998.
[18.15] B. Chen, S. Thomopoulos, and C. Lin, “Feature Estimation and Objects Extraction
using Markov Random Field Modeling,” Automatic Target Recognition V, Proceed-
ings SPIE, 2485, April, 1995.
[18.16] S. S. Chiang and C. I. Chang, “Subpixel Detection for Hyperspectral Images Using
Project Pursuit,” EUROPTO Conference on Image and Signal Processing for Remote
Sensing V, Proceedings SPIE, 3871, pp. 107–115, September, 1999.
[18.17] C. Chun, D. Fleming, and E. Torok, “Polarization-sensitive Thermal Imaging,” Au-
tomatic Target Recognition IV, Proceedings SPIE, 2234, April, 1994.
[18.18] A. Copeland, G. Ravichandran and M. Trivedi, “Localized Radon Transform for Ship
Wake Detection in SAR Imagery,” Automatic Target Recognition IV, Proceedings
SPIE, 2234, April, 1994.
[18.19] A. Cowart, W. Snyder, and H. Ruedger, “The Detection of Unresolved Targets Using
the Hough Transform,” Computer Graphics and Image Processing, December, 1982.
[18.20] X. Du, S. Ahalt, and B. Stribling, “Estimating and Refining Orientation Vectors for
Sub-components of Space Object Imagery,” Automatic Target Recognition VIII,
Proceedings SPIE, 3371, April, 1998.
[18.21] Dudgeon D.E. and Lacoss, R.T. “An Overview of Automatic Target Recognition,”
MIT Lincoln Laboratory Journal, 6(1), pp. 2–10, Spring, 1993.
[18.22] D. Ernst, H. Gross, D. Stricker, and U. Thönnesen, “Improvement of Object Classi-
fication in Image Sequences,” Automatic Target Recognition VII, Proceedings SPIE,
3069, April, 1997.
[18.23] D. Forsyth, “A Novel Color Constancy Algorithm,” International Journal of Com-
puter Vision, 5, 1990.
[18.24] B. Funt and G. Finlayson, “Color Constant Color Indexing,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(5), 1995.
[18.25] L. Gagnon and R. Klepko, “Hierarchical Classifier Design for Airborne SAR Images
of Ships,” Automatic Target Recognition VIII, Proceedings SPIE, 3371, April, 1998.
[18.26] D. Geman and B. Jedynak, “An Active Testing Model for Tracking Roads in Satellite
Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(1),
1996.
[18.27] R. Ghamasaee, F. Amoozegar, H. Pang, Y. Chin, and S. Blackman, “Survey of Neu-
ral Networks as Applied to Target Tracking,” Automatic Target Recognition VII,
Proceedings SPIE, 3069, April, 1997.
[18.28] Z. Haddad and S. Simanca, “Filtering Image Records Using Wavelets and the Zakai
Equation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11),
1995.
[18.29] M. Hamilton and T. Kipp, “Model-based Multi-sensor Automatic Target Identifi-
cation for FLIR Fused with MMW,” Automatic Target Recognition V, Proceedings
SPIE, 2485, April, 1995.
413 Bibliography

[18.30] D. J. Hand, Discrimination and Classification, New York, Wiley, 1981.


[18.31] H. Hayes, C. Priebe, G. Rogers, D. Marchette, J. Solka, and R. Lorey, “Improved
Texture Discrimination and Image Segmentation with Boundary Incorporation,” Au-
tomatic Target Recognition V, Proceedings SPIE, 2485, April, 1995.
[18.32] Y. Huang and C. Suen, “A Method of Combining Multiple Experts for the Recognition
of Unconstrained Handwritten Numerals,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(1), 1995.
[18.33] R. L. Huguenin and J. L. Jones, “Intelligent Information Extraction from Reflectance
Spectra: Absorption Band Positions,” Journal of Geophysical Research, 91(B8),
pp. 9585–9598, 1986.
[18.34] D. Huttenlocher, D. Klanderman and A. Rucklige, “Comparing Images Using the
Hausdorff Distance,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 15(9), 1993.
[18.35] X. Jiang and H. Bunke, “Vision Planner for an Intelligent Multisensory Vision
System,” Automatic Target Recognition IV, Proceedings SPIE, 2234, April, 1994.
[18.36] J. Johnson, “Analysis of Image Forming Systems,” Proceedings of the Image Inten-
sifier Symposium, October 6–7, 1958.
[18.37] J. Johnson and W. Lawson, “Performance Modeling Methods and Problems,” Pro-
ceedings of the IRIS Specialty Group on Imaging, January, 1974.
[18.38] A. Katz and P. Thrift, “Generating Image Filters for Target Recognition by Genetic
Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9),
1994.
[18.39] A. Kumar, Y. Bar-Shalom, and E. Oron, “Precision Tracking Based on Segmentation
with Optimal Layering for Imaging Sensors,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(2), 1995.
[18.40] B. Kumar, A. Mahalanobis, and A. Takessian, “Optimal Tradeoff Correlation Filters
with Controlled In-plane Rotation Response for Target Recognition,” Automatic
Target Recognition VIII, Proceedings SPIE, 3371, April, 1998.
[18.41] A. Lanterman, M. Miller and D. Snyder, “Implementation of Jump-diffusion Algo-
rithms for Understanding FLIR Scenes,” Automatic Target Recognition V, Proceed-
ings SPIE, 2485, April, 1995.
[18.42] A. Lanterman, M. Miller, and D. Snyder, “Representations of Thermodynamic
Variability in the Automated Understanding of FLIR Scenes,” Automatic Target
Recognition VI, Proceedings SPIE, 2756, April, 1996.
[18.43] B. Li, Q. Zheng, S. Der, R. Chellappa, N. Nasrabadi, L. Chan, and L. Wang,
“Experimental Evaluation of Neural, Statistical and Model-based Approaches to
FLIR ATR, Automatic Target Recognition VIII, Proceedings SPIE, 3371, April,
1998.
[18.44] J. Li and C. Kuo, “Automatic Target Shape Recognition via Deformable Wavelet
Templates,” Automatic Target Recognition VI, Proceedings SPIE, 2756, April, 1996.
[18.45] Lincoln Labs, “Special Issue on Automatic Target Recognition,” Lincoln Laboratory
Journal, 6(1), Spring, 1993.
[18.46] A. Mahalanobis and D. Kelly, “High Value Target Recognition using Correla-
tion Filters,” Automatic Target Recognition VI, Proceedings SPIE, 2756, April,
1996.
414 Automatic target recognition

[18.47] D. Manolakis, G. Shaw, and N. Keshava, “Comparative Analysis of Hyperspectral


Adaptive Matched Filter Detectors,” Algorithms for Multispectral, Hyperspectral,
and Ultraspectral Imagery VI, Proceedings SPIE, 4049, pp. 2–17, 2000.
[18.48] N. Merlet and J. Zerubia, “New Prospects in Line Detection by Dynamic Program-
ming,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
[18.49] V. Mirelli, D. Nguyen, and N. Nasrabadi, “Target Recognition For FLIR Imagery
Using Learning Vector Quantization And Multi-layer Perception,” Automatic Target
Recognition V, Proceedings SPIE, 2485, April, 1995.
[18.50] S. Musman, D. Kerr, and C. Bachmann, “Automatic Recognition of ISAR Ship
Images,” IEEE Transactions on Aerospace Electronic Systems, 32(4), pp. 1392–
1404, 1996.
[18.51] Y. Nakano, Y. Hara, J. Saito, and Y. Inasawa, “Radar Target Recognition System
using 3-D Mathematical Model,” Automatic Target Recognition VIII, Proceedings
SPIE, 3371, April, 1998.
[18.52] E. Natonek and C. Baur, “Model based 3-D Object Recognition using Intensity and
Range Images,” Automatic Target Recognition IV, Proceedings SPIE, 2234, April,
1994.
[18.53] J. Neyman and E. S. Pearson, “On the Problem of the Most Efficient Tests of Statistical
Hypotheses,” Philosophical Transactions of the Royal Society London Series A, 168,
pp. 268–282, 1933.
[18.54] C. Paiva, “Theater Targets Plume Edge Extraction and Hardbody Aimpoint Selec-
tion,” Automatic Target Recognition VII, Proceedings SPIE, 3069, April, 1997.
[18.55] T. Patterson, “Radiometrically Correct Sharpening of Multispectral Images using a
Panchromatic Image,” Automatic Target Recognition IV, Proceedings SPIE, 2234,
April, 1994.
[18.56] A. Pears and E. Pissaloux, “Using Hardware Assisted Geometric Hashing for High
Speed Target Acquisition and Guidance,” Automatic Target Recognition VII, Pro-
ceedings SPIE, 3069, April, 1997.
[18.57] J. Perez-Jacome, “Automatic Target Recognition Systems with Emphasis on Model-
Based Approaches,” CSIP TR-97-01, ECE-Georgia Tech., 1996.
[18.58] Q. Pham and M. Smith, “A Morphological Multistage Algorithm for Recognition of
Targets in FLIR Data,” Automatic Target Recognition VI, Proceedings SPIE, 2756,
April, 1996.
[18.59] Q. Pham, T. Brosnan, and M. Smith, “Sequential Digital Filters for Fast Detection of
Targets in FLIR Image Data,” Automatic Target Recognition VII, Proceedings SPIE,
3069, April, 1997.
[18.60] Q. Pham, T. Brosnan, M. Smith, and R. Mersereau, “A Morphological Technique
for Clutter Suppression in ATR,” Automatic Target Recognition VIII, Proceedings
SPIE, 3371, April, 1998.
[18.61] M. Phillips and R. Sims, “A Signal to Clutter Measure for ATR Performance Com-
parison,” Automatic Target Recognition VII, Proceedings SPIE, 3069, April, 1997.
[18.62] K. Plataniotis and A. Venetsanopoulos, Color Image Processing and Applications,
Berlin, Springer, 2000.
415 Bibliography

[18.63] H. Qi, W. E. Snyder, and D. Marchette, “An Efficient Approach to Segmenting Man-
made Targets from Unmanned Aerial Vehicle Imagery,” Optical Engineering, 39(5),
pp. 1267–1274, 2000.
[18.64] H. Ranganath and R. Sims, “Self Partitioning Neural Networks for Target Recogni-
tion,” Automatic Target Recognition IV, Proceedings SPIE, 2234, April, 1994.
[18.65] K. Rao, “Combinatorics Reduction for Target Recognition in ATR Applications,”
Automatic Object Recognition II, Proceedings SPIE, 1700, 1992.
[18.66] J. Ratches, C. Walters, R. Buser, and B. Guenther, “Aided and Automatic Target
Recognition Based upon Sensory Inputs from Image Forming Systems, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19(9), Sept. 1997.
[18.67] A. Reno, D. Gillies, and D. Booth, “Deformable Models for Object Recognition in
Aerial Images,” Automatic Target Recognition VIII, Proceedings SPIE, 3371, 1998.
[18.68] E.M. Riseman and M.A. Arbib, “Computational Techniques in Visual Segmentation
of Static Scenes,” CCGIP, 7, Target Recognition VIII, Proceedings SPIE, 3371,
April, 1998.
[18.69] R. Robmann and H. Bunke, “Towards Robust Edge Extraction – a Fusion Based
Approach using Greylevel and Range Images,” Automatic Target Recognition V,
Proceedings SPIE, 2485, April, 1995.
[18.70] F.A. Sadjadi, “A Model-Based Technique for Recognizing Targets by Using Mil-
limeter Wave Radar Signatures,” International Journal of Infrared and Millimeter
Waves, 10(3), 337–342, 1989.
[18.71] F.A. Sadjadi, “Automatic Object Recognition: Critical Issues and Current Ap-
proaches,” Proceedings SPIE, 1471, 1991.
[18.72] F.A. Sadjadi, “Automatic Object Recognition: Critical Issues and Current Ap-
proaches,” 1991. Selected SPIE Papers on CD-ROM, Volume 6. Automatic Target
Recognition SPIE: 1 PO Box 10, Bellingham, WA 98227-0010, USA.
[18.73] F.A. Sadjadi, “Automatic Object Recognition: Critical Issues and Current Ap-
proaches,” 1991. Selected SPIE Papers on CD-ROM, Volume 6. Automatic Target
Recognition SPIE: 1 PO Box 10, Bellingham, WA 98227-0010, USA.
[18.74] F.A. Sadjadi, Special Section on ATR, Optical Engineering, 31(12), 1992.
[18.75] F.A. Sadjadi, “Application of Genetic Algorithm for Automatic Recognition of
Partially Occluded Objects,” Automatic Object Recognition IV, Proceedings SPIE,
2234, 1994.
[18.76] R. Samy and J. Bonnet, “Robust and Incremental Active Contour Models for Objects
Tracking,” Automatic Target Recognition V, Proceedings SPIE, 2485, April, 1995.
[18.77] R. Sharma and N. Subotic, “Construction of Hybrid Templates from Collected and
Simulated Data for SAR ATR Algorithms,” Automatic Target Recognition VIII, Pro-
ceedings SPIE, 3371, April, 1998.
[18.78] R. Sims, “Signal to Clutter Measurement and ATR Performance,” Automatic Target
Recognition VIII, Proceedings SPIE, 3371, April, 1998.
[18.79] R. Sims and B. Dasarathy, “Automatic Target Recognition using a Passive Multisen-
sor Suite,” Special Section on ATR, Optical Engineering, 31(12), 1992.
[18.80] A. Srivastava, B. Thomasson, and R. Sims, “A Regression Model for Prediction of
IR Images,” Proceedings of SPIE Aerosense ATR XI, Orlando, 2001.
416 Automatic target recognition

[18.81] L.B. Stotts, E.M. Winter, L.E. Hoff, and I.S. Reed, “Clutter Rejection using Multi-
spectral Processing,” Proceedings SPIE Signal and Data Processing of Small Targets,
1305, pp. 2–10, 1990.
[18.82] M. Swain and D. Ballard, “Color Indexing,” International Journal of Computer
Vision, 7(1), 1991.
[18.83] H. Tanaka, Y. Hirakawa, and S. Kaneku, “Recognition of Distorted Patterns Us-
ing the Viterbi Algorithm,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 4(1), 1982.
[18.84] W. Thoet, T. Rainey, D. Brettle, L. Stutz, and F. Weingard, “ANVIL Neural Network
Program for Three-dimensional Automatic Target Recognition,” Optical Engineer-
ing, 31(12), December, 1992.
[18.85] M.M. Trivedi, “Detection of Objects in High Resolution Multispectral Aerial
Images,” SPIE Applications of Artificial Intelligence II, 1985.
[18.86] M. Trivedi, “Object Detection Using Their Multispectral Characteristics,” Proceed-
ings SPIE, 754, 1987.
[18.87] M. M. Trivedi and J. C. Bezdek, “Low-Level Segmentation of Aerial Images with
Fuzzy Clustering,” IEEE Transactions on Systems, Man, and Cybernetics, 16(4),
1986.
[18.88] A. Ueltschi and H. Bunke, “Model-based Recognition of Three-dimensional Objects
from Incomplete Range Data,” Automatic Target Recognition V, Proceedings SPIE,
2485, April, 1995.
[18.89] J. Wald, D. Krig, and T. DePersia, “ATR: Problems and Possibilities for the IU
Community,” Proceedings of the DARPA Image Understanding Workshop, IUW,
255–264, San Diego, January, 1992.
[18.90] B. Wallet, D. Marchette, and J. Solka, “A Matrix Representation for Genetic Algo-
rithms,” Automatic Target Recognition VI, Proceedings SPIE, 2756, April, 1996.
[18.91] B. Wallet, D. Marchette, and J. Solka, “Using Genetic Algorithms to Search for
Optimal Projections,” Automatic Target Recognition VII, Proceedings SPIE, 3069,
April, 1997.
[18.92] S. Wang, G. Chen, D. Sapounas, H. Shi, and R. Peer, “Development of Gazing
Algorithms for Tracking Oriented Recognition,” Automatic Target Recognition VII,
Proceedings SPIE, 3069, April, 1997.
[18.93] A. Weisberg, M. Najarian, B. Borowski, J. Lisowski, and B. Miller, “Spectral Angle
Automatic Cluster Routine (SAALT): An Unsupervised Multispectral Clustering
Algorithm,” Proceedings of IEEE Aerospace Conference, 307–317, 1999.
[18.94] D. Xue, Y. Zhu, and G. Zhu, “Recognition of Low-contrast FLIR Tank Object Based
on Multiscale Fractal Character Vector,” Automatic Target Recognition VI, Proceed-
ings SPIE, 2756, April, 1996.
[18.95] C. Zhou, G. Zhang, and J. Peng, “Performance Modeling Based Adaptive Target
Tracking in Multiscenario Environment,” Automatic Target Recognition VI, Pro-
ceedings SPIE, 2756, April, 1996.
Author index

Aarts, E. 142 Besag, J. 114


Adams, R. 211 Beucher. S. 211
Aggarwal, J. 48, 63, 401 Bezdek, J. 416
Aghajan, H. 287, 288 Bhanu, B. 322
Ahlgren, P. 388 Bhattachaya, P. 323
Ahuja, N. 258, 261 Bichsel, M. 256
Alter, T. 62 Bigün, J. 62
Amir, A. 201, 258 Bilbro, G. 140, 142, 143, 204, 211, 214,
Amit, Y. 322 322
Amoozegar, F. 402, 412 Bilbro, S. 215
Anandan, P. 142 Bimbo, A. 322
Anbar, M. 386 Bischof, L. 211
Anderson, C. 63 Bister, M. 166
Andrey, P. 211 Black, M. 120, 139
Anh, V. 104 Blackman, S. 412
Arbib, M. 215, 415 Blais, G. 204, 211
Arbter, K. 256 Blake, A. 129, 260
Arcelli, C. 173 Blanc-Fèraud, L. 214
Arrott, M. 256 Blostein, S. 251, 257
Astrom, K. 322 Bobick, A. 256
Attneave, F. 273 Bokil, A. 62
Aubert, G. 214 Bolle, R. 259
Avnir, D. 227, 262 Bonnet, J. 415
Booth, D. 415
Babaud, J. 104 Borowski, B. 416
Bachmann, C. 414 Bose, S. 262
Badler, D. 62 Boult, T. 257
Bagheri, A. 258 Bovik, A. 261
Baird, H. 388 Bowyer, K. 212, 213, 237, 297
Bajcsy, R. 62 Boyer, K. 105, 211, 325
Ball, G. 368 Bozma, O. 386
Ballard, D. 256, 288, 390, 406, Bradshaw, K. 386
416 Brady, J. 260
Baram, Y. 389 Brady, M. 313, 325
Bar-Shalom, Y. 388, 413 Breu, H. 154, 178
Barzohar, M. 386, 407, 411 Brice, C. 212
Batchelor, B. 215 Brooks, M. 105
Baudin, M. 104 Brosnan, T. 414
Baur, C. 414 Bruckstein, A. 201, 258
Belliveau, N. 388 Brunelli, R. 386
Ben-Arie, J. 324 Buhmann, J. 289
Ben-Ezra, M. 64 Bui, T. 215
Beni, G. 368 Bullock, M. 411
Bennett, N. 288 Bunke, H. 212, 296, 415, 416
Berkmann, J. 211 Burbeck, C. 259
Berman, M. 386 Burkhardt, H. 256
Bertrand, G. 178 Burridge, R. 288

417
418 Author index

Cabrera, J. 204, 212 Cryer, J. 262


Caelli, T. 211 Cukierman, F. 215
Caglioti, V. 386
Califano, A. 256 Dai, X. 387
Canning, J. 322 Darrell, T. 323
Canny, J. 97, 105 Das, H. 388
Carpenter, G. 368 Dasarathy, B. 415
Casasent, D. 389 Daugman, J. 105
Caspi, D. 256 Davis, J. 256
Castleman, K. 42, 99 Davis, T. 212
Chaikin, G. 261 De Mori, R. 381
Chan, K. 324 DeCarlo, D. 323
Chang, W. 143 Deconinck, F. 322
Chaudhuri, B. 180, 212 deFigueiredo, R. 98
Chaurmette, F. 388 Defrise, M. 322
Chelberg, D. 120, 143 Delingette, H. 323
Chellappa, R. 62, 63, 141 Deng, W. 99, 105
Chen, B. 412 DePersia, T. 416
Chen, C. 75, 105, 256 Deriche, M. 105
Chen, G. 416 Derin, H. 139
Chen, J. 197, 212 Desai, U. 141
Chen, M. 62, 381 Dhome, M. 257
Chen, Q. 322 Dhond, U. 48, 63, 401
Chen, R. 159 Diakides, N. 387
Chen, S. 390 Dickinson, S. 238, 323
Chen, T. 323 Dinstein, I. 214
Chen, W. 251, 256 Doerschuk, P. 138, 143, 297
Cheng, Y. 285, 289 Dori, D. 387
Chhabra, A. 256 du Buf, J. 62
Chin, R. 179, 201, 213 Dubin, S. 259
Chin, Y. 412 Dubuisson-Jolly, M. 262, 323
Chirlian, P. 196, 215 Duda, R. 104, 326
Cho, K. 256 Duncan, J. 155, 180
Cho, S. 388 Dunn, D. 63
Choi, T. 261 Dunn, S. 142, 256
Chojnacki, W. 105 Dyer, C. 297
Chou, P. 388
Chow, C. 184 Earnshaw, A. 251, 257
Choy, Y. 388 Eggbert, D. 212
Christmas, W. 270, 273 Ehrich, E. 213
Chuang, J. 256 Ehricke, H. 139
Clausi, D. 62 Elder, J. 120, 140, 273
Clerc, M. 256 Elliot, H. 139, 140
Clifford, P. 140 Elschlager, R. 323
Clowes, M. 273 Ennesser, F. 387
Coggins, J. 259 Esquenzai, S. 388
Cohen, F. 62
Cohen, M. 390 Facchinetti, C. 390
Colardyn, F. 259 Fairchild, S. 411
Conn, R. 259 Falavigna, D. 386
Cooper, D. 63, 386, 407, 411 Fan, C. 250, 387
Corazza, A. 381 Faugeras, P. 322
Cordella, L. 173 Fennema, C. 212
Costa, L. 257 Fermuller, C. 381
Cowan, R. 387 Ferreira, A. 257
Cowart, A. 408 Ferrie, F. 262
Cox, I. 387 Fiddelaers, P. 85, 106, 259
419 Author index

Finlayson, G. 387, 406 Gupta, A. 257


Fischler, M. 257, 323 Gurelli, M. 63
Fisher, R. 212, 215 Gurevich, G. 322
Fitzgibbon, A. 203, 212
Fleet, D. 387 H. Delingette 321
Florack, L. 142 Hadamard, J. 140
Florencio, D. 162 Haddad, Z. 387, 402, 412
Flynn, P. 212, 257 Hammersley, J. 140
Focus of attention 387 Han, S. 213, 237
Ford, G. 324 Han, Y. 142
Francos, J. 212, 214 Hanayama, M. 389
Franke, J. 387 Hand 402
Frei, W. 75, 105 Hand, D. 368
Fritsch, D. 259 Hansen, F. 140
Fu, K. 176, 180 Hanson, A. 215, 391
Fukanaga, K. 326 Hara, Y. 414
Funt, B. 387, 406 Haralick, R. 63, 97, 388
Hart, P. 326
Ganguly, G. 211 Hartley, R. 257
Gawley, D. 105 Havaldar, P. 257
Geiger, D. 130, 137, 140, 257 Hayati, S. 387, 388
Gelatt, C. 141 Hayes, H. 413
Geman, D. 140, 387, 412 Haykin, S. 323
Geman, S. 140 Healey, G. 63, 214, 215, 257, 389
Geyzel, Z. 106 Hebert, M. 321, 323
Ghamasaee, R. 412 Heeger, D. 258
Ghosal, S. 257 Heisterkamp, D. 323
Gibbs, F. 387 Helman, D. 258
Gigus, Z. 297 Hel-Or, Y. 63
Gil, J. 178 Hendriks, E. 214
Gillies, D. 415 Hensel, E. 140
Ginsberg, M. 381 Herrington, D. 106
Gips, J. 390 Higgins, W. 63, 173, 179
Girosi, F. 130, 140 Hildreth, E. 106
Gold, S. 305, 323 Hingorani, S. 387
Golden, R. 323 Hirakawa, Y. 416
Goldgof, B. 236 Hiriyannaiah, H. 140
Goldgof, D. 212, 213, 237, 258, 383, 390 Hirzinger, G. 256
Gong, L. 322, 323, 383 Hoff, L. 416
Gong, W. 178 Hofmann, T. 289
Gonzalez, R. 140, 257, 377, 381 Hoover, A. 205, 212, 243, 258
Goodman, R. 63 Hopfield, J. 140, 316, 323
Goudail, F. 323 Horn, B. 258
Grady, J. 261 Horn, BKP 244
Gray, A. 140 Hou, J. 388
Greenspan, H. 63 Hsu, K. 230, 257
Grest, G. 142 Hu, M. 230, 258
Gretter, R. 381 Hu, W. 214
Grimmer, D. 388 Hu, X. 258
Grimson, W. 387 Huang, C. 154, 179
Grogan, T. 256 Huang, T. 256, 324
Groshong, B. 140 Huang, Y. 64, 387, 413
Gross, A. 257 Hubel, D. 99, 105
Grossberg, S. 140, 368 Huffman, D. 273
Grosso, E. 322, 323, 386 Hummel, R. 274
Gruber, M. 230, 257 Hussain, B. 323
Guo, Y. 261 Hutchinson, S. 213
420 Author index

Ikeuchi, K. 214, 321, 323, 389, 391 Keshava, N. 414


Illingworth, J. 288 Keyserlingk, J. 388
Inasawa, Y. 414 Khotanzad, A. 62
Irani, M. 120, 141 Kim, J. 324
Isard, S. 249, 260 Kimia, B. 214
Ishida, T. 389 Kimmel, R. 201, 258
Ishiguro, H. 389 Kindermann, R. 141
Itti, L. 212 Kirkpatrick, D. 178
Iverson, L. 98, 105 Kirkpatrick, S. 141
Iwahori, Y. 258 Kirsch, R. 381
Iwamoto, T. 323 Kiryati, N. 256, 289
Iyengar, S. 99, 105, 262 Kisworo, M. 105
Kittler, J. 270, 273, 288
Jääski, Y. 289 Klarquist, W. 142
Jackway, P. 105 Ko, M. 256
Jacobs, D. 210, 212, 323 Koch, C. 212
Jaffe, C. 155, 180 Koenderink, J. 142, 213, 273, 296, 297
Jain, A. 63, 184, 215, 262, 322, 323, 324 Kolmogorov, A. 138, 141
Jain, R. 262 Kondepudy, R. 63, 257
JáJá, J. 62, 258 Kong, A. 322
Jean-Baptiste, G. 212 Kopec, G. 388
Jedynak, B. 387, 412 Koster, A. 215
Jepson, A. 120, 139, 258 Köster, K. 213
Jia, X. 388 Kottke, D. 258
Jiang, X. 179, 212 Kotturi, D. 286, 289
Johnson, J. 403, 413 Kriegman, D. 215, 250, 261
Johnson, W. 389 Krig, D. 416
Jonckheere, E. 388 Kropatsch, W. 381
Jones, B. 388 Kryzak, A. 215
Jones, J. 90, 105 Kube, P. 106
Jones, R. 179 Kuc, R. 386
Joseph, E. 105 Kulikowski, C. 322, 323, 383
Jung, H. 215 Kulkarni, S. 213
Kumar, A. 388, 413
Kabuka, M. 323 Kumar, B. 324
Kailath, T. 287, 288 Kumar, S. 213, 237
Kak, A. 141, 142 Kundu, A. 62, 381
Kanade, T. 251, 259, 261, 287, 324, 390 Kuo, C. 61, 63
Kanai, J. 388 Kuruganti, P. 389
Kanatani, K. 212, 258 Kwoh, Y. 388
Kaneko, T. 184 Kyuma, K. 323
Kaneku, S. 416
Kanungo, T. 388 Lagendijk, R. 214
Kaplan, L. 61, 63 Lai, K. 201, 213
Kapoor, S. 141 Lakshmanan, S. 323, 324, 388
Karu, K. 63 Lange, E. 323
Kashyap, R. 141, 390 Langley, K. 387
Katz, A. 388, 413 Lapreste, J. 257
Katzir, N. 212 Latombe, J. 390
Kaul, A. 261 Laurendeau, D. 260
Kauppinen, H. 258 Laurentini, A. 258
Kay, J. 140 LaValle, S. 213
Kehtarnavaz, N. 389 Lavallée, S. 258, 388
Kelly, D. 413 Lawson, W. 413
Kelly, M. 85, 141 Le Roux, P. 388
Kelly, P. 388 Leahy, R. 141
Keren, D. 63, 212 Leavers, V. 289
421 Author index

Leclerc, Y. 324 Martin, R. 388


Ledley, R. 377, 381 Martin, W. 251, 256
Lee, C. 182, 213 Matheny, A. 213, 236, 383
Lee, J. 325 Matheron, G. 159
Lee, K. 213, 388 McLauchlan, P. 64
Lee, R. 141 McLean, G. 286, 289
Lee, T. 106 Medioni, G. 257, 387
Lemahieu, I. 259 Meer, P. 204, 212, 213
Leung, D. 262 Meiri, Z. 212
Leung, Y. 106 Merlet, N. 388, 407, 414
Levialdi, S. 173 Mersereau, R. 414
Levin, K. 142 Messmer, B. 296
Levine, M. 204, 211, 214 Metaxas, D. 238, 323
Lew, M. 324 Michel, J. 259, 389
Li, S. 120, 141, 324 Milgram, D. 191
Li, Z. 261 Miller, B. 416
Liang, P. 288 Miller, T. 143
Liao, S. 259 Mirelli, V. 414
Lin, C. 412 Mirmirani, M. 387
Lin, W. 323 Mirza, M. 211
Lindeberg, T. 106 Mitchell, O. 154, 179
Lindenbaum, M. 212, 324 Mitchell, T. 324
Link, K. 142, 215 Mitter, S. 213
Lipson, H. 274 Miura, J. 389
Lisowski, J. 416 Mohan, R. 256
Liu, F. 207, 213 Mokhtarian, F. 259
Liu, I. 261 Molina, R. 389
Liu, J. 210, 213 Moons, T. 85, 106
Liu, X. 196, 213, 368 Morano, R. 259
Liu, Z. 389 Moravec, H. 390
Logenthiran, A. 142, 213, 214 Moriyama, M. 389
Loizou, G. 389 Morris, P. 141
Lopez-Krahe, J. 288 Morse, B. 259
Lorey, R. 413 Moussouris, J. 141
Lozano-Perez, T. 387 Mundkur, P. 141
Lukacs, G. 388 Murakami, H. 324
Lutton, E. 288 Murase, H. 324
Murray, D. 64, 386
Ma, S. 64
Madigan, D. 388 Nagy, G. 388
Maestre, F. 389 Naito, H. 389
Mahalanobis, A. 413 Najarian, M. 416
Maintz, J. 52, 63 Nakagawa, Y. 259
Maitre, N. 288 Nakajima, Y. 389
Malik, J. 106, 142, 297 Nakano, Y. 414
Malik, R. 141 Namazi, N. 250, 387
Malladi, R. 259 Nandhakumar, N. 64, 251, 256, 259, 387
Mandelbrot, B. 61, 63 Nartker, T. 388
Manhaeghe, C. 259 Nasrabadi, N. 414
Mann, R. 140 Natonek, E. 414
Marapan, S. 63 Navab, N. 64
Marchand, E. 388 Nayar, S. 259
Marchette, D. 413, 416 Nelson, R. 106
Margalit, A. 324 Nelson, T. 141
Marr, D. 64, 106, 141 Nevatia, R. 262
Marroquin, J. 141 Ng, W. 182, 213
Marshall, D. 388 Nguyen, D. 414
422 Author index

Nguyen, T. 224, 259 Poggio, T. 64


Niebur, E. 212 Ponce, J. 215, 297
Nissano, J. 259 Porat, B. 212
Nixon, M. 388 Porat, M. 212
Nogawa, H. 389 Pratt, W. 106, 165, 204
Nordström, N. 137, 142 Priebe, C. 413
Princen, J. 288
O’Gorman, L. 224, 259 Pritch, Y. 64
O’Sullivan, F. 259
Okutomi, M. 286, 324 Qi, H. 142, 389
Olson, C. 324 Qian, M. 259
Olstad, B. 379, 381 Quan, L. 64, 260
Oren, M. 259
Oron, E. 388, 413 Rainey, T. 416
Otsu, N. 323 Rajala, S. 64, 142, 214
Ozturk, C. 259 Ranganath, H. 415
Rangarajan, A. 305, 323
Paglieroni, D. 324 Rao, K. 324
Paiva, C. 414 Ravichandran, G. 389
Pala, P. 322 Ray, M. 321, 325
Palmer, L. 90, 105 Reddi, S. 389
Palmer, P. 325 Reddy, R. 214
Pang, H. 412 Reed, I. 416
Panjwani, D. 214 Refice, M. 274
Pankanti, S. 322, 324 Reid, I. 386
Park, H. 159, 179 Reinhardt, J. 173, 179
Park, R. 213 Reno, A. 415
Parodi, P. 266, 274 Rice, S. 388
Parsi, B. 324 Richardson, T. 213
Parvin, B. 389 Richetin, M. 257
Patras, I. 210, 214 Ringach, D. 389
Patterson, T. 411, 414 Riseman, E. 215, 415
Pauwels, E. 85, 106, 234, 259 Rissanen, J. 324
Pavlidis, T. 105, 142, 143, 389 Rives, G. 257
Pawlak, M. 259 Rivlin, E. 196, 214
Pearlman, W. 214 Robmann, R. 415
Pears, A. 414 Rocha, J. 389
Peer, R. 416 Rogers, G. 413
Peleg, S. 64, 141, 227, 262 Ronfard, R. 261
Pelillo, M. 274 Rosen, P. 214
Penafiel, P. 250, 387 Rosenfeld, A. 107, 142, 274, 324
Peng, C. 389 Rosin, P. 197, 203
Peng, J. 416 Roth, G. 214
Pentland, A. 142, 238, 313, 323, 325 Rothe, I. 230, 260
Perona, P. 99, 106, 142, 251, 260 Rotolo, L. 381
Petrou, M. 270, 273 Rousso, B. 141
Pham, Q. 414
Picard, R. 207, 213 Sadjadi, F. 415
Piccioli, G. 266, 274 Saha, P. 180
Pietikäinen, M. 258 Saito, J. 414
Pikaz, A. 214 Saito, N. 288
Pilu, M. 212 Samson, C. 214
Pissaloux, E. 414 Samy, R. 415
Pitas, I. 179 Sandford, L. 215
Pizer, S. 259 Sandini, G. 260
Plantinga, H. 297 Santago, P. 142, 213, 215, 390
Poelman, C. 259 Sapiro, G. 214
423 Author index

Sapounas, D. 416 Soundararajan, P. 297


Sarkar, N. 212 Spann, M. 213
Sarkar, S. 105, 297, 390 Sriram, R. 214
Sato, Y. 212, 389 Srivastava, A. 415
Satta, G. 381 Stentz, A. 390
Savage, C. 368 Stewart, C. 204, 215
Schafer, R. 162 Stiny, G. 390
Schonfeld, D. 180 Stone, J. 249, 260
Schultz, H. 260 Storvik, G. 261
Schwartz, E. 260 Stotts, L. 416
Schwartz, J. 324 Stuezle, W. 388
Schweitzer, H. 321, 325 Stutz, L. 416
Sclaroff, S. 312, 325 Subbarao, M. 261
Sengupta, K. 325 Subotic, N. 415
Seppänen, T. 258 Subrahmonia, J. 63
Sethian, J. 259, 260 Suen, C. 174, 180, 387, 413
Shah, M. 262 Suen, P. 215
Shamir, J. 256 Suesse, H. 261
Shamos, M. 229, 260 Sull, S. 261
Shapiro, L. 63, 313, 325 Sullivan, S. 215
Sharir, M. 324 Sun, Y. 258, 261
Sharma, R. 415 Super, B. 142, 261
Shashua, A. 48, 64 Süsse, H. 260
Shaw, G. 414 Svalbe, I. 179
Shemlon, S. 142 Swain, M. 390, 406, 416
Shen, X. 325 Swets, D. 325
Sheu, H. 214 Syeda-Mahmood, T. 390
Shi, H. 416 Szeliski, R. 142, 258, 388
Shi, J. 104
Shimshoni, I. 297 Tagare, H. 98, 155, 180
Shpitalni, M. 274 Tagari, H. 383
Shum, H. 214 Tagliasco, V. 260
Siddiqi, K. 214 Takeda, H. 390
Sideman, S. 98 Tam, P. 262
Simanca, S. 387, 402, 412 Tamura, S. 212, 389
Sims, R. 415 Tanaka, H. 416
Sinclair, D. 260 Tang, I. 260
Sklansky, J. 283, 284, 289 Tank, D. 141
Slaney, M. 141 Tannenbaum, A. 214
Slater, D. 389 Tarabanis, K. 261
Smeulders, A. 180, 215 Taratorin, A. 98
Smith, M. 414 Tarroux, P. 211
Smith, P. 61, 64 Tatsuhiro, Y. 180
Smith, S. 260 Taubes, C. 288
Snell, J. 141 Taubin, G. 215, 261
Snyder, W. 64, 106, 140, 142, 143, 204, 211, 213, Taylor, C. 250, 261
214, 256, 260, 322, 368, 387, 408 ter Haar Romeny, B. 134, 142
Soatto, S. 251, 260 Thoet, W. 416
Sogo, T. 385, 389 Thomas, B. 387
Sohn, K. 324 Thomason, M. 377, 381
Sohn, W. 389 Thomasson, B. 415
Solina, F. 62 Thomopoulos, S. 412
Solka, J. 413, 416 Thorpe, C. 390
Solomon, F. 389 Thrift, P. 388, 413
Soroka, B. 260 Thurber, K. 215
Soucy, M. 260 Tichem, M. 390
Soukoulis, C. 142 Tistarelli, M. 322, 323, 386
424 Author index

Titterington, D. 140 Wang, C. 143, 324, 390


Tomasi, C. 261 Wang, D. 411
Tong, F. 261 Wang, J. 62
Toriwaki, J. 180 Wang, R. 203, 215
Torp, A. 379, 381 Wang, S. 416
Torr, P. 142 Wani, M. 215
Trier, O. 184, 215 Wechsler, H. 283, 284, 289
Trika, S. 390 Wei, G. 64
Trivedi, M. 63, 407, 416 Weiman, C. 261
Trucco, E. 215 Weingard, F. 416
Tsai, C. 256 Weinshall, D. 261, 390
Tsai, H. 104 Weisberg, A. 407, 416
Tsai, P. 262 Weiss, I. 106, 196, 214, 321, 325
Tsai, R. 261 Weiss, R. 391
Tsao, Y. 176, 180 Weng, J. 325, 390
Tsap, L. 390 Werman, D. 178
Tsitsiklis, J. 213 Werman, M. 63, 106, 261
Tsujimoto, E. 324 Werman, W. 390
Turk, M. 325 West, G. 105, 197, 203, 214
Whaite, P. 262
Ubeda, S. 257 Whangbo, T. 141
Uchiyama, T. 215 Wheeler, M. 391
Ueltschi, A. 416 Whitaker, R. 131, 136, 143
Uhurpa, K. 180 Whittaker, W. 390
Wiesel, T. 99, 105
van den Boomgaard, R. 180 Williams, R. 387
Van den Bout, D. 143 Wilson, J. 381
van den Elsen, P. 63 Winter, E. 416
van den Elsen, P. 52 Wintz, P. 140, 257
van der Heijden, F. 106 Witkin, A. 104
van der Hengel, A. 105 Wolberg, G. 143
Van Doorn, A. 212, 296, 297 Wong, K. 324
Van Gool, L. 85, 106, 259 Woodham, R. 258, 262
Van Horn, M. 106 Worley, J. 142
van Laarhoven, J. 142 Worring, M. 215
Van Ness, J. 63 Wu, C. 138, 143, 297
van Vliet, L. 106 Wu, Q. 250, 270, 274, 313
Vanek, P. 257 Wu, Y. 262
Vannier, M. 142
vanVliet, L. 98 Xu, Z. 106
Vecchi, M. 141 Xue, D. 416
Velten, V. 259, 389
Vemuri, B. 259, 261 Yang, M. 325
Venetsanopoulos, A. 179 Yang, Y. 210, 213
Venkatesh, S. 105 Yaou, M. 143
Verbeek, P. 98, 106 Yassa, M. 388
Viergever, M. 52, 63, 142, 215 Yau, M. 180
Vincken, K. 215 Yau, S. 215
Vlontzos, J. 257 Yi, J. 120, 143
Vogelaers, D. 259 Yip, R. 262
Vos, F. 155, 180 Yoon, S. 324, 325
Voss, K. 260, 261 Young, R. 90, 106
Yu, E. 388
Wakahara, T. 390 Yu, X. 215
Wakeley, J. 63 Yuille, A. 137, 140, 215, 322
Wald, J. 416
Wallace, R. 390 Zabrodsky, H. 227, 262
Wallet, B. 416 Zerroug, M. 262
425 Author index

Zerubia, J. 214, 388, 407, Zhou, J. 381


414 Zhu, G. 416
Zhang, G. 416 Zhu, P. 196, 215
Zhang, J. 106 Zhu, S. 215, 322
Zhang, R. 245, 262 Zhu, Y. 416
Zhang, T. 174, 180 Zhuang, X. 64
Zhang, Z. 250, 262, 391 Zietz, S. 259
Zheng, J. 262 Zisserman, A. 129, 139
Zhong, Y. 262, 324 Zucker, S. 98, 140, 273
Zhou, C. 416 Zucker, W. 105
Index

3D analysis 261 Bayes’ classifier 337


4-connected 53 Bayes’ risk 338
8-connected 53 Bayes’ rule 23, 329
between-class scatter 348
accepting state 373 Bhattacharyya distance 358, 366
accumulator arrays 278 biased anisotropic diffusion 137
active contours 98, 322 binary 38
active testing 387, 412 binary hypothesis 396
active vision 385 binary morphology 144
affine 196 biquadratic 39, 93
affine transformation 218 black body radiation 248
aircraft 407 blanking 45
albedo 244 block thresholding 183
algebraic distance 202, 203, 237 blueprint, reading of 384
algebraic invariants 389 blur 109
algorithm fusion 405 bootstrapping 204
algorithms, clustering 359 boundary conditions 74
algorithms, performance of 395 brightness 38
angle density 141
anisotropic diffusion 133 camera calibration 48, 64
annealing 121, 127, 133, 206 camera motion 386
annealing schedule 122 Canny edge detector 97
annealing, tree 184 cardinality 294
anti-extensive property 149, 151 Cartesian 58
apparent error rate 398 causality 87
arc length 196 CCD 42, 43
array processor 258 CCD camera calibration 63, 257
aspect 385 center of gravity 225
aspect equivalent 296 central axis 131
aspect graphs 295 centroid tracking 401
aspect ratio 226, 304 cepstrum 61
assembly 384 chain code 159, 230, 369, 376, 377, 409
atoms 291 chain code histogram 410
ATR 367, 392 chaining 192
ATR, performance of 395 chamfer map 165
attention, focus of 387 chamfer-based closing 165
aura 53 character recognition 322, 382, 389
autonomous exploration 385 characteristic equation 15
axes, principal 235 characterization 393
axis of symmetry 226 charge-coupled device 42
Chernoff distance 358, 366
BAD 137 chip 393
bar code 105 circles, finding 280
basis 11, 59, 152 circulant matrix 74
basis functions 75 circular arcs 215
basis set 220 city block distance 357
basis vectors 75, 220 classification 392

426
427 Index

classification rate 393 curve code 377


classifier, inner product 342 curve evolution 197
classifier, linear 341 curve segmentation 212
classifier, maximum likelihood 327, 331 curves 196
classifier, quadratic 340 cylinders, generalized 238
clique 291
closing 149 decision boundary 327
cluster 328 decision rule 326, 330, 338
clustering 210, 288, 329, 359 decision rules, linear 327
clustering, agglomerative 359, 366 decomposition 179
clutter 393, 404 deconvolution 109
clutter, cultural 393 deformable contour 201
coffee transform 279 deformable models 321
color constancy 406 deformable template 233, 312, 388, 405
color indexing 382 degeneracy, geometric 388
color information 387 degree 291
color invariants 382 delta function 121
commutative property 148 density 39
compactness 225 density, conditional probability 331
competitive learning 215 depolarization 373
composite 44 derivation 370
computational complexity 279 derivative estimation 69
conditional probability 8 derivatives 67, 68
conditional risk 336, 338 detection 392
conductivity 131 detection rate 393
cone 202 diagrams, Voronoi 383
conformable 12 diameter 225, 227
confuser 404 differentiation 123
conic fitting 212 differentiation filters 106
conic section 203 diffusion 402
conjugate gradient 127, 270, 273 diffusion equation 131
connected 291 digital library 318
connected component 167, 186, 290 dilation 144
connected component labeling 216 dimensionality 399
connected regions 181 Dirac delta 121
connectivity paradox 54 directrix 283
consistency 92, 267, 286 discriminant function 329, 336
consistent labeling 266 disparity 64, 286
constructive solid geometry 293 distance transform 153, 165, 179, 199, 232
context-free 372, 381 distance transform, Euclidian 154
context-sensitive 371 distance, affine-invariant 261
continuation method 116 distance, Mahalanobis 341
contouring 47 distances, Euclidian 219
contrast reversal 400 distortion 49, 108
convex discrepancy 229 dithering 213
convex hull 229 divergence 358
convolution 50, 74 duality 149
convolution and correlation 67 dynamic programming 198
correlation filters 413 dynamic range 46
correspondence 250
correspondence problem 48, 286, 312, 321 ECG 373
covariance 79, 106, 219, 224, 226 edge detection 76, 97
covariance techniques 211 edge extension using relaxation 99
critical point 215 edge strength 69
curvature 60 edges 290
curvature estimates 204 edges in range images, determining 243
curvature-based segmentation 204 edges of roads 386
428 Index

ego-motion 141 frame time 43


eigenimage 300 frequency response 41
eigenvalues 15, 349 FSM 23
eigenvector 15, 222 function fitting 69
ejection fraction 383 functional 39
electrocardiogram 373 furthest neighbor distance 358
electron–hole 43 fusion 405
ellipse, direct fitting 203
ellipse, finding 288 Gabor function 90
ellipses, fitting 203 Gauss curvature 61
ellipses, least square fitting 212 Gauss map 286
ellipsoid 202, 235 Gaussian 78, 88, 279
ellipsoids, fitting 203 Gaussian, mean of 333
elllipse, finding 285 Gaussian, multivariate 335, 358
emissivity 400 generalized Hough transform 282
epipolar line 48 genetic algorithm 214, 403
equivalence memory 190 genetic learning 388, 413
equivalence relation 190 geometric flow 234
erosion 146 geometry, curve 234
error rate 398 geometry-based 404
Euclidian distance 219, 304, 356 GNC 129
Euclidian distance transform 154 gradient 13, 97, 279
events 8 gradient descent 16, 18, 127, 206, 270
explicit 202 gradient magnitude 78
explicit representation 39, 71 graduated nonconvexity 116, 129
extensive property 148 grammar 370
external energy 198 graph, directed 290
graphs, aspect 238
face recognition 346, 384 gray-scale 38
facet model 97 gray-scale morphology 152, 179
factor 159 Green’s function 132, 133
false alarm 396 grid 67
false alarm rate 393 ground plane 386
false negative 396 ground truth 396
false positive 396
false positive fraction 397 Haar transform 212
fast Fourier transform 75, 84 handwritten word recognition 380
feature 304 harmonic 236
feature selection 324 Hausdorff distance 410
feature vector 304 heart 383
FIDAC 381 heat equation 132
filters 41 Hessian 14, 135
finite element models 390 hexagonal 57
finite state machine 22, 370, 373 hexagonal derivative 71
Fisher’s linear discriminant 348 hidden Markov model 22, 62, 381
fitting 69, 73 histogram 184, 330
fitting surfaces 202 histogram intersection 406
fitting, straight lines 223 homogeneity predicate 181
fitting, subpixel precision of 225 homogeneous 181
FLIR 394 homogeneous transformation 217
focus, of a parabola 284 homotopy 116
focused filters 407 Hopfield neural network 137
formal language 369 horizontal blanking 45
Fourier descriptors 231, 262 horizontal resolution 46
Fourier transform 41, 83, 207 Hough transform 275, 288, 356, 407
fractal dimension 207, 212 Hough transform, generalized 282
fractal dimension for gray-scale 209 hull, visual 258
429 Index

human movement 256 label image 181


hyperboloid 235 label memory 186
hyperellipsoid 223 labeling 264
hyperquadric 237 labels 263
Lagrange multiplier 224
iconic 38 language 371
idempotent property 151 Laplace’s equation 120, 236
identification 393 Laplacian 82
IFF 394 Laplacian of Gaussian 88
IFS 29 learning 324
ill-posed 108 learning shape 239
image formation 43, 404 leave-one-out approach 399
image hashing 318 left ventricle 256
image representations 38 Legendre polynomial 236
imaging, infrared 383 level set 200, 202
implicit 202 lexicographic 74
implicit representation 39 library, digital 318
increasing property 148, 151 light sources, multiple 249
independent 8 likelihood ratio 335
indexing 256 likelihood ratio test 340
infrared imagery 389 likelihood, of views 385
inliers 204 line drawing 263
inner product classifier 342 line segments, motion from 250
inside a curve 197 linear 40
inside/outside 40 linear classifier 341
inspection of PC boards 150 linear machines 327
inspection/quality control 383 linear operator 65, 74, 78
integrating several range images 243 linear relaxation 267
intensity gradient 71 linear transformation 12
internal energy 198 linearly independent 11
interpolation 114 linearly separable 327
interpretive tree 387 linked list 291
intersections of edges 243 LISP 291
invariance 239 local/global 275
invariant 203 local/global inference problem 263
invariant moments 304 LOG 88
invariant shape models 261 logarithmic spiral grids 261
invariants, geometric 318 log-polar transformation 239
invariants, thermophysical 389 low-pass 78, 98, 106
inverse problem 108, 114, 404 luminance 38
Ising type 137 LWIR 395
ISODATA 362
isoelectric 373 magnetic resonance imaging 39
isophote 51, 201 Mahalanobis distance 341, 349, 357
major axis 223
Jacobian 14 MAP 121, 198, 201
joint probability 8 MAP estimation 407
junction matching 325 maps, Kohonen 259
margin 349
Kalman Bucy filter 401 Markov 201
kernel 66 Markov chain 20
K–L transform 220, 404 Markov model 20
k-means 362 Markov model, hidden 22
k-neighbors 165 Markov random field 64, 291, 381
Kroneker delta 121, 122 maximum likelihood 327
DT distance 165 mean curvature 61
Kuhn–Tucker conditions 352 mean field approximation 116
430 Index

measurement system 49 nonterminal symbol 369


medial axis 53, 232 normal vector to an edge 133
median filter 151 normalization 309
Mercer’s conditions 353 normalized central moments 229
metric 219, 358 NP-complete 294
metric function 200 NP-hard 294
microscopy, epi-fluorescence 384 nuclear medicine 204, 383
minimax rule 342
minimization 15 objective function 15, 115, 126
minimum description length 321 objects 263
minimum spanning tree 360 occlusions 401
minimum-squared-error 239 opening 149
MMSE 204 optic flow 120, 250, 256, 313, 321, 386, 387
model 304 optimal structuring elements 180
model-based 405 optimization 405
modeling edges 105 orthogonal 11
molecules, tubular 384 orthogonal functions 11
moments 61, 224, 230 orthogonality 243
moments, central 229 orthonormal 93
morphological filtering 179 orthonormal transformation 13
morphological sampling 162 outliers 204
morphological shape decompostion 173 outside a curve 197
morphological skeleton 171 outward normal direction 197
morphology 144, 408, 409 oversegmentation 205, 306
motion 322, 386
motion analysis 115, 250 P wave 373
motion segmentation 210, 250 panoramic camera 48
MRI 39, 383 panoramic stereo 48
MSE estimate 69 parabola, finding 283
multiple light sources 384 parameters, model 25
multisensor fusion 405 parametric transformation 275
multispectral 382 partial differential equation 131
multispectral ATR 406 parts manufacturing 293
multispectral matching 406 parts, overlapping 384
multivariate Gaussian 79 path 24, 291
multivariate pdf 332 pattern classifier 304
multiview range data 211 pattern recognition 304
MWIR 395 PC board 149
PCA 301
nearest feature line 324 PDE 132
nearest neighbor measure 358 peak, finding 285
neighborhood 53 perimeter 225
neighboring state 24 perspective 62
neural network 137, 403 photometric stereo 245
neuron 137 piecewise-constant 119, 120, 135
Newton–Raphson 17 piecewise-linear 135
Neyman–Pearson 401 piecewise-linear fitting 224
nodes 290, 291, 305 piecewise-planar 120
noise 49 pixels 38
noise estimation 63, 257 planar approximations 120
noise sensitivity 68 plane, fitting 70
noise, counting 204 planning 322, 383
noise, Gaussian 204 polarization 49
noise, non-Gaussian 204 polyhedron 292, 295
noise, Poisson 204 polynomials 39, 202
nonlinear relaxation 267 pose 48, 62
nonmaximum suppression 97 positive definite 13
431 Index

power spectrum 61 relational representations 42


predicate 306 relaxation 107
prime factor 159 relaxation labeling 266, 305, 313, 321
principal axis 219, 227 restoration 108
principal component 214, 220 reverse engineering 384
principal component analysis 301 ridge seeking 63
principal curvatures 60 ridges 52
prior probabilities 396 roads, finding in images 386
probabilistic 40 ROC 397
probability distribution 9 roof edge 76
probability of error 340 rotation 217, 218
probability, prior 329 rotation, correction for 218
probability, transition 22
production 370 salient 196
projection 12, 75, 76, 220 salient group 210
projective invariants 322 sample mean 333, 334
pseudo-inverse 239 sample variance 334
puns, bad 383 sampled 42
pushdown stack 186 sampling 45
pyramid 85, 402 sampling grid 161
sampling theorem 46, 161
quad tree 86 sampling time 46
quadratic 202 SAR 408
quadratic classifier 340 satellite images 407
quadratic form 13 scale space 85
quadratic variation 83 scale space causality 87
quadric 39, 202, 234 scaling 358
quality, segmentation 205 scanning 43, 192
quantization 46 scatter matrix 70
quantization error 47 scene graph 293
quantized 42 secondary electrons 247
segmentation 115, 120, 181, 322, 402
radar 386 segmentation of surfaces 201
radial basis function 353 segmentation, based on texture 207
radiometric 63, 257 segmentation, psychology of 211
radius of a convolution kernel 71 segmentation, quality of 204
ramp edge 76 segmentation, range 388
range 39, 47 self-organizing 368
range image segmentation 211, 215 sensitivity 397
range images 201, 267 sensor fusion 115, 405
range images, merging of 204 separability 350
range images, registration of 204 set of measure zero 119
raster scanning 43 shape from 243
receiver operating characteristic 397 shape from focus 249
recognition 393 shape from shading 244, 322
recognizer 371 shape from texture 249
recurrent network 137 sharpness 41
recursive region growing 186 shear 218
reflection 226 sigmoid 137
reflection of a set 146 signal-to-clutter 402
reflectivity 245 signature 400
region adjacency graph 292 silhouettes 243, 259, 407
region alignment 141 similarity transformations 217
region growing 182, 322, 402 simple points 180
registration 204 simplicial models 261
regular expression 376 simply connected 159
relabeling 167 simulated annealing 19, 115, 122, 198
432 Index

singular-valued decomposition 252 target detection 395


skeleton 171 template matching 179, 299, 313, 405
slope–intercept 275 templates 404
smoothing 386 temporal templates 256
snakes 197 terminal symbol 369
Sobel 78, 94 tessellation 57
Sobel operator 68 test set 399
sonar 386 texture 61, 88, 261, 379
space-invariant 50 texture discrimination 63
spatial 49 texture segmentation 207, 212
spatial derivatives 67 thermal imaging 383
spatial frequency 40, 61 thermophysical invariants 389
spatial vision 106 thinness 225
specificity 397 thinning 97, 171
spectral bands 394 threshold 182
spectral-based 404 thresholding 184
specular 49, 243, 260 tomography 180
specular reflectors 49, 384 tracking 250, 269, 401
speech recognition 26 tracking of molecules 384
speed of a curve 196 tracking roads 407
sphere 202 tracking using deformable templates
spherical harmonics 236 233
springs and templates 308 training set 304, 326, 328, 348, 399
squared residual transform, K–L 219
stability, of views 385 transformations, orthogonal 216
stable state 385 transition matrix 22
state sequence 23 transition probability 22
statistical mechanics 115 translation 145
statistical pattern classification 305 translation of a set 146
step 76 tree 291
stereo 64, 120, 322 tree annealing 184
stereo matching 324 triangle similarity 226
stereo, patch-based 142 true error rate 398
stereopsis 47, 286, 300, 322, 386 true negative 396
stick growing 106 true positive 396
stochastic grammars 370 true positive fraction 397
stochastic network 381 tubular molecules 384
straight line, fitting of 223 Turing machine 370
striate cortex 105, 260
structure from motion 64, 260 umbra 152
structured illumination 48, 248 uncertainty propagation 325
structured lighting 48 undersegmentation 205
sum of squared differences 286 uniformity 213
superquadric 237, 238 univariate pdf 331
supervised learning 328 unsupervised learning 328
support vector machine 349
surface geometry 211 vanishing points 266, 286
surface harmonic 236 variable conductance diffusion 131
surface normal 244, 247 variance, estimation of 334
surfaces, quadric 234 VCD 131, 136
SWIR 395 vector representations 73
symmetry 226, 243 vector space 12
symmetry distance 227 vector, infinite-dimensional 11
symmetry, mirror 226 vehicle tracking 389
symmetry, rotational 226 ventricle 373
sync 45 ventricle, left 383
syntactic pattern recognition 369 vertex, multihedral 243
433 Index

vertex, trihedral 243 Waldo, finding 387


vertical blanking 45 watershed 182, 210, 211
vertical resolution 46 wavefront propagation 287
vertices 290 wavelet 101, 142, 143, 400
vidicons 42 weak membrane 129
view likelihood 385 whitening 219
view stability 385 within-class scatter 348
viewing region 296 Wold features 213
viewpoint space partition 296
visual potential 296 Zakai equation 402
Viterbi algorithm 24, 408 zero crossings 88, 98
Voronoi diagram 155, 382 zero set 39
VSP 296 zoom 218

You might also like