Computer Vision ch11 PDF
Computer Vision ch11 PDF
Matching in 2D
This chapter explores how to make and use correspondences between images and maps,
images and models, and images and other images. All work is done in two dimensions;
methods are extended in Chapter 14 to 3D-2D and 3D-3D matching. There are many im-
mediate applications of 2D matching which do not require the more general 3D analysis.
Consider the problem of taking inventory of land use in a township for the purpose of
planning development or collecting taxes. A plane is dispatched on a clear day to take
aerial images of all the land in the township. These pictures are then compared to the most
recent map of the area to create an updated map. Moreover, other databases are updated to
record the presence of buildings, roads, oil wells, etc., or perhaps the kind of crops growing
in the various elds. This work can all be done by hand, but is now commonly done using
a computer. A second example is from the medical area. The blood
ow in a patient's heart
and lungs is to be examined. An X-ray image is taken under normal conditions, followed
by one taken after the injection into the bloodstream of a special dye. The second image
should reveal the blood
ow, except that there is a lot of noise due to other body structures
such as bone. Subtracting the rst image from the second will reduce noise and artifact
and emphasize only the changes due to the dye. Before this can be done, however, the rst
image must be geometrically transformed or warped to compensate for small motions of the
body due to body positioning, heart motion, breathing, etc.
(r,c)
(x,y)
M
I
Figure 11.1: A mapping between 2D spaces M and I. M may be a model and I an image,
but in general any 2D spaces are possible.
1 Definition The mapping from one 2D coordinate space to another as dened in Equa-
tion 11.1 is sometimes called a spatial transformation, geometric transformation, or warp
(although to some the term warp is reserved for only nonlinear transformations).
The functions g and h create a correspondence between model points [x; y] and image
points [r; c] so that a point feature in the model can be located in the image: we assume
that the mapping is invertible so that we can go in the other direction using their inverses.
Having such mapping functions in the tax record problem allows one to transform property
boundaries from a map into an aerial image. The region of the image representing a par-
ticular property can then be analyzed for new buildings or crop type, etc. (Currently, the
analysis is likely to be done by a human using an interactive graphics workstation.) Having
such functions in the medical problem allows the radiologist to analyze the dierence image
I2 [r2; c2] I[g(r2 ; c2); h(r2; c2)]: here the mapping functions register like points in the two
images.
2 Definition Image registration is the process by which points of two images from similar
viewpoints of essentially the same scene are geometrically transformed so that corresponding
feature points of the two images have the same coordinates after transformation.
Another common and important application, although not actually a matching opera-
tion, is creation of a new image by collecting sample pixels from another image. For example,
as shown in Figure 11.2, we might want to cut out a single face I2 from an image I1 of several
people as shown in Figure 11.2. Although the content of the new image I2 is a subset of the
original image I1 , it is possible that I2 can have the same number of pixels as I1 (as is the
case in the gure), or even more.
There are several issues of this theory which have practical importance. What is the form
of the functions g and h, are they linear, continuous, etc? Are straight lines in one space
mapped into straight or curved lines in the other space? Are the distances between point
pairs the same in both spaces? More important, how do we use the properties of dierent
functions to achieve the mappings needed? Is the 2D space of the model or image continuous
or discrete? If at least one of the spaces is a digital image then quantization eects will
impact both accuracy and visual quality. ( The quantization eects have deliberately been
kept in the right image of Figure 11.2 to demonstrate this point.)
Shapiro and Stockman 3
Figure 11.2: New image of a face (right) cut out of original image (left) using a sampling
transformation. The face at the right is the rightmost face of the ve at the left.
Exercise 1
Describe how to enhance the right image of Figure 11.2 to lessen the quantization or aliasing
eect.
Sometimes we will have need to label a point according to the type of feature from which it
was determined. For example, a point could be the center of a hole, a vertex of a polygon,
or the computed location where two extended line segments intersect. Point type will be
used to advantage in the automatic matching algorithms discussed later in the chapter.
Reference Frames
The coordinates of a point are always relative to some coordinate frame. Often, there
are several coordinate frames needed to analyze an environment, as discussed at the end of
Chapter 2. When multiple coordinate frames are in use, we may use a special superscript
to denote which frame is in use for the coordinates being given for the point.
3 Definition If Pj is some feature point and C is some reference frame, then we denote
the coordinates of the point relative to the coordinate system as c Pj.
4 Computer Vision: Mar 2000
Homogeneous Coordinates
As will soon become clear, it is often convenient notationally and for computer processing
to use homogeneous coordinates for points, especially when ane transformations are used.
4 Definition The homogeneous coordinates of a 2D point P = [x; y]t are [sx; sy; s]t,
where s is a scale factor, commonly 1.0.
Finally, we need to note the conventions of coordinate systems and programs that display
pictures. The coordinate systems in the drawn gures of this chapter are typically plotted
as they are in mathematics books with the rst coordinate (x or u or even r) increasing to
the right from the origin and the second coordinate (y or v or even c) increasing upward
from the origin. However, our image display programs display an image of n rows and m
columns with the rst row (row r = 0) at the top and the last row (row r = n 1) at the
bottom. Thus r increases from the top downward and c increases from left to right. This
presents no problem to our algebra, but may give our intuition trouble at times since the
displayed image needs to be mentally rotated counterclockwise 90 degrees to agree with the
conventional orientation in a math book.
[2, 4]
scaling by a
factor of 2
[1, 2]
X
[0, 0] [4, 0] [8, 0]
Y Y
[-sin , cos ]
P = [ x , y ] [0, 1]
[ cos , sin ]
P = [ x, y ]
[0, 0] X [1, 0] X
rotation of point P by angle rotation of basis vectors by
Figure 11.4: Rotation of any 2D point in terms of rotation of the basis vectors.
Exercise 3
(a) Sketch the three points [0,0], [2,2], and [0,2] using an XY coordinate system. (b) Scale
these points by 0.5 using Equation 11.2 and plot the results. (c) Using a new plot, plot the
result of rotating the three points by 90 degrees about the origin using Equation 11.4. (d)
Let the scaling matrix be S and the rotation matrix be R. Let SR be the matrix resulting
from multiplying matrix S on the left of matrix R. Is there any dierence if we transform
the set of three points using SR and RS?
Shapiro and Stockman 7
* Orthogonal and Orthonormal Transforms
5 Definition A set of vectors is said to be orthogonal if all pairs of vectors in the set are
perpendicular; or equivalently, have scalar product of zero.
6 Definition A set of vectors is said to be orthonormal if it is an orthogonal set and if all
the vectors have unit length.
A rotation preserves both the length of the basis vectors and their othogonality. This can
be seen both intuitively and algebraically. As a direct result, the distance between any two
transformed points is the same as the distance between the points before transformation.
A rigid transformation has this same property: a rigid transformation is a composition of
rotation and translation. Rigid transformations are commonly used for rigid objects or for
change of coordinate system. A uniform scaling that is not 1.0 does not preserve length;
however, it does preserve the angle between vectors. These issues are important when we
seek properties of objects that are invariant to how they are placed in the scene or how a
camera views them.
Translation
Often, point coordinates need to be shifted by some constant amount, which is equivalent
to changing the origin of the coordinate system. For example, row-column coordinates of
a pixel image might need to be shifted to transform to latitude-longitude coordinates of a
map. Since translation does not map the origin [0, 0] to itself, we cannot model it using
a simple 2x2 matrix as has been done for scaling and rotation: in other words, it is not a
linear operation. We can extend the dimension of our matrix to 3x3 to handle translation as
well as some other operations: accordingly, another coordinate is added to our point vector
to obtain homogeneous coordinates. Typically, the appended coordinate is 1.0, but other
values may sometimes be convenient.
P = [x; y] ' [wx; wy; w] = [x; y; 1] for w = 1
The matrix multiplication shown in Equation 11.5 can now be used to model the translation
D of point [x; y] so that [x0; y0] = D([x; y]) = [x + x0 ; y + y0].
2 3 2 32 3 2 3
x0 1 0 x0 x x + x0
4 y0 5 = 4 0 1 y0 5 4 y 5 = 4 y + y0 5 (11.5)
1 0 0 1 1 1
col
300
Y
i
P
2
200 row
X
200 i
P 400
1
100
200
O
i
X
O 100 200 300 400 w
w
Figure 11.5: Image from a square-pixel camera looking vertically down on a workbench:
feature points in image coordinates need to be rotated, scaled, and translated to obtain
workbench coordinates.
Figure 11.6: (Left) 128x128 digital image of grid; (center) 128x128 image extracted by an
ane warp dened by three points in the left image; (right) 128x128 rectied version of part
of the center image.
[x , y ]
2 2 [x , y ]
1 1
[x, y]
u
v
[x ,y ]
0 0
Figure 11.7: Distorted face of Andrew Jackson extracted from a $20 bill by dening an
ane mapping with shear.
Shapiro and Stockman 11
second is its equivalent in standard form.
x = x0 + r x1 x0 + c x2 x0
y y n y1 y0 m y2 y0
2 3 2 0 32 3
x (x1 x0)=n (x2 x0)=m x0 r
4 y 5 = 4 (y1 y0 )=n (y2 y0 )=m y0 5 4 c 5 (11.9)
1 0 0 1 1
Conceptually, the point [x; y] is dened in terms of the new unit vectors along the new
axes dened by the user selected points. The computed coordinates [x; y] must be rounded
to get integer pixel coordinates to access digital image 1I. If either x or y are out of bounds,
then the output point is set to black, in this case 2 I[r; c] = 0; otherwise 2 I[r; c] = 1 I[x; y].
One can see a black triangle at the upper right of Jackson's head because the sampling
parallelogram protrudes above the input image of the $20 bill image.
Object Recognition/Location Example
Consider the example of computing the transformation matching the model of an object
shown at the left in Figure 11.8 to the object in the image shown at the right of the gure.
Assume that automatic feature extraction produced only three of the holes in the object.
The spatial transformation will map points [x; y] of the model to points [u; v] of the image.
Assume that we have a controlled imaging environment and the known scale factor has
already been applied to the image coordinates to produce the u-v coordinates shown. Only
two image points are needed in order to derive the rotation and translation that will align
all model points with correpsonding image points. Point locations in the model and image
and interpoint distances are shown in Tables 11.1 and 11.2. We will use the hypothesized
correspondences (A; H2) and (B; H3 ) to deduce the transformation. Note that these corre-
spondences are consistent with the known interpoint distances. We will discuss algorithms
for making such hypotheses later on in this chapter.
Table 11.1: Model Point Locations and Interpoint Distances (ycoordinates are for the center
of holes).
point coordinates y to A to B to C to D to E
A (8,17) 0 12 15 37 21
B (16,26) 12 0 12 30 26
C (23,16) 15 12 0 22 15
D (45,20) 37 30 22 0 30
E (22,1) 21 26 15 30 0
Y V
H
B 3
R
D
A
C
Q H
E H 1
2
X U
Figure 11.8: (Left) Model object and (right) three holes detected in an image.
Table 11.2: Image Point Locations and Interpoint Distances (ycoordinates are for the center
of holes).
point coordinates y to H1 to H2 to H3
H1 (31,9) 0 21 26
H2 (10,12) 21 0 12
H3 (10,24) 26 12 0
Exercise 8
Construct the matrix for a re
ection about the line y=3 by composing a translation with
y0 = 3 followed by a re
ection about the x-axis. Verify that the matrix is correct by
transforming the three points [1, 1], [1, 0], and [2, 1] and plotting the input and output
points.
Exercise 9
Verify that the product of the matrices Dx0 ;y0 and D x0 ; y0 is the 3x3 identity matrix.
Explain why this should be the case.
14 Computer Vision: Mar 2000
v v
u u
[1,0] [1,0]
Exercise 10
Solve Equation 11.17 using the following three pairs of matching control points: ([0,0],[0,0]),
([1,0],[0,2]), ([0,1],[0,2]). Do your computations give the same answer as reasoning about
the transformation using basis vectors?
Exercise 11
Solve Equation 11.17 using the following three pairs of matching control points: ([0,0],[1,2]),
([1,0],[3,2]), ([0,1],[1,4]). Do your computations give the same answer as reasoning about
the transformation using basis vectors?
It is common to use many control points to put an image and map or two images into
correspondence. Figure 11.10 shows two images of approximately the same scene. Eleven
pairs of matching control points are given at the bottom of the gure. Control points are
corners of objects that are uniquely identiable in both images (or map). In this case, the
control points were selected using a display program and mouse. The list of residuals shows
that, using the derived transformation matrix, no u or v coordinate in the right image will
be o by 2 pixels from the transformed value. Most residuals are less than one pixel. Better
results can be obtained by using automatic feature detection which locates feature points
with subpixel accuracy: control point coordinates are often o by one pixel when chosen
16 Computer Vision: Mar 2000
using a computer mouse and the human eye. Using the derived ane transformation, the
right image can be searched for objects known to be in the left image. Thus we have reached
the point of understanding how the tax-collectors map can be put into correspondence with
an aerial image for the purpose of updating its inventory of objects.
Exercise 12
Take three pairs of matching control points from Figure 11.10 (for example, ([288, 210, 1],
[31, 160, 1]) ) and verify that the ane tranformation matrix maps the rst into the second.
0.18 -0.68 -1.22 0.47 -0.77 0.06 0.34 -0.51 1.09 0.04 0.96
1.51 -1.04 -0.81 0.05 0.27 0.13 -1.12 0.39 -1.04 -0.12 1.81
Figure 11.10: Images of same scene and best ane mapping from the left image into the
right image using 11 control points. [x,y] coordinates for the left image with x increasing
downward and y increasing to the right; [u,v] coordinates for the right image with u in-
creasing downward and v toward the right. The 11 clusters of coordinates directly below
the images are the matching control points x,y,u,v.
18 Computer Vision: Mar 2000
hypothesizes that this object is in the image and uses a verication technique to decide if
the hypothesis is correct.
The verication procedure must determine whether there is enough evidence in the im-
age that the hypothesized object is in the scene. For polyhedral objects, the boundary of
the object is often used as suitable evidence. The set of feature correspondences is used
to determine a possible ane transformation from the model points to the image points.
This transformation is then used to transform each line segment of the boundary into the
image space. The transformed line segments should approximately line up with image line
segments wherever the object is unoccluded. Due to noise in the image and errors in feature
extraction and matching, it is unlikely that the transformed line segments will exactly align
with image line segments, but a rectangular area about each transformed line segment can
be searched for evidence of possibly matching image segments. If sucient evidence is found,
that model segment is marked as veried. If enough model segments are veried, the object
is declared to be in the image at the location specied by the computed transformation.
The local-feature-focus algorithm for matching a given model F to an image is given
below. The model has a set fF1; F2; : : :; FM g of focus features. For each focus feature Fm ,
Shapiro and Stockman 19
there is a set S(Fm ) of nearby features that can help to verify this focus feature. The image
has a set fG1; G2; : : :; GI g of detected image features. For each image feature Gi, there is
a set of nearby image features S(Gi ).
F2 F3
F1
F4
G2
G1 Image
G4 G3
E4
G8 E1
G5 G6 G7
E2 Model E
E3
L Y T Arrow X
junction junction junction junction junction
possible pairings, only 10 of them have matching types for both points. The transformations
computed from each of those are given in Table 11.3. The 10 transformations computed
have scattered and inconsistent parameter sets, except for 3 of them indicated by '* in the
last column of the table. These 3 parameter sets form a cluster whose average parameters
are = 0:68; s = 2:01; u0 = 233; v0 = 41. While one would like less variance than this for
correct matches, this variance is typical due to slight errors in point feature location and
small nonlinear distortions in the imaging process. If the parameters of the RST mapping
are inaccurate, they can be used to verify matching points which can then be used as control
points to nd a nonlinear mapping or ane mapping (with more parameters) that is more
accurate in matching control points.
Pose clustering can work using low level features; however, both accuracy and eciency
are improved when features can be ltered by type. Clustering can be performed by a simple
O(n2) algorithm: for each parameter set , count the number of other parameter sets i
that are close to it using some permissible distance. This requires n 1 distance computa-
tions for each of the n parameter sets in cluster space. A faster, but less
exible, alternative
is to use binning. Binning has been the traditional approach reported in the literature and
was discussed in Chapter 10 with respect to the Hough transform. Each parameter set
produced is contributed to a bin in parameter space, after which all bins are examined for
signicant counts. A cluster can be lost when a set of similar i spread over neighboring bins.
The clustering approach has been used to detect the presence of a particular model of
airplane from an aerial image, as shown in Figure 11.15. Edge and curvature features are
extracted from the image using the methods of Chapters 5 and 10. Various overlapping
22 Computer Vision: Mar 2000
Y c
X
L
L
200 X 400 T Y L
X
Y
T T
200 Y
100 T L
X L
X
Y
Figure 11.14: Example pose detection problem with 5 model feature point pairs and 4 image
feature point pairs.
Table 11.3: Cluster space formed from 10 pose computations from Figure 11.14
Model Pair Image Pair s u0 v0
L(170,220),X(100,200) L(545,400),X(200,120) 0.403 6.10 118 -1240
L(170,220),X(100,200) L(420,370),X(360,500) 5.14 2.05 -97 514
T(100,100),Y( 40,150) T(260,240),Y(100,245) 0.663 2.05 225 -48 *
T(100,100),Y( 40,150) T(140,380),Y(300,380) 3.87 2.05 166 669
L(200,100),X(220,170) L(545,400),X(200,120) 2.53 6.10 1895 200
L(200,100),X(220,170) L(420,370),X(360,500) 0.711 1.97 250 -36 *
L(260, 70),X( 40, 70) L(545,400),X(200,120) 0.682 2.02 226 -41 *
L(260, 70),X( 40, 70) L(420,370),X(360,500) 5.14 0.651 308 505
T(150,125),Y(150, 50) T(260,240),Y(100,245) 4.68 2.13 3 568
T(150,125),Y(150, 50) T(140,380),Y(300,380) 1.57 2.13 407 60
Shapiro and Stockman 23
Figure 11.15: Pose-clustering applied to detection of a particular airplane. (a) Aerial image
of an aireld; (b) object model in terms of real edges and abstract edges subtending one
corner and one curve tip point; (c) image window containing detections that match many
model parts via the same transformation.
24 Computer Vision: Mar 2000
windows of these features are matched against the model shown in part (b) of the gure.
Part (c) of the gure shows the edges detected in one of these windows where many of the
features aligned with the model features using the same transformation parameters.
Geometric Hashing
Both the local-feature-focus method and the pose clustering algorithm were designed to
match a single model to an image. If several dierent object models were possible, then these
methods would try each model, one at a time. This makes them less suited for problems in
which a large number of dierent objects can appear. Geometric hashing was designed to
work with a large database of models. It trades a large amount of oine preprocessing and a
large amount of space for a potentially fast online object recognition and pose determination.
Suppose we are given
1. a large database of models
2. an unknown object whose features are extracted from an image and which is known
to be an ane transformation of one of the models.
and we wish to determine which model it is and what transformation was applied.
Consider a model M to be an ordered set of feature points. Any subset of three non-
collinear points E = fe00, e01 , e10 g of M can be used to form an ane basis set, which
denes a coordinate system on M, as shown in Figure 11.16a). Once the coordinate system
is chosen, any point x 2 M can be represented in ane coordinates (; ) where
x = (e10 e00) + (e01 e00) + e00
Furthermore, if we apply an ane transform T to point x, we get
Tx = (Te10 Te00) + (Te01 Te00) + Te00
Thus Tx has the same ane coordinates (; ) with respect to (Te00 ; Te01; Te10) as x has
with respect to (e00 ; e01; e10). This is illustrated in Figure 11.16b).
Oine Preprocessing: The oine preprocessing step creates a hash table containing all
of the models in the database. The hash table is set up so that the pair of ane coordinates
(; ) indexes a bin of the hash table that stores a list of model-basis pairs (M; E) where
some point x of model M has ane coordinates (; ) with respect to basis E. The oine
preprocessing algorithm is given below.
Online Recognition: The hash table created in the preprocessing step is used in the
online recognition step. The recognition step also uses an accumulator array A indexed by
model-basis pairs. The bin for each (M; E) is initialized to zero and used to vote for the
hypothesis that there is a transformation T that places (M; E) in the image. Computation
of the actual transformations is done only for those model-basis pairs that achieve a high
number of votes and is part of the verication step that follows the voting. The online
Shapiro and Stockman 25
e 10
Te 01
e 01
x Tx
Te 10
e 00
Te 00
Figure 11.16: The ane transformation of a point with respect to an ane basis set.
procedure GH Preprocessing(D,H);
f
for each model M
f
Extract the feature point set FM of M;
for each noncollinear triple E of points from FM
for each other point x of FM
f
Calculate (; ) for x with respect to E;
Store (M; E) in hash table H at index (; );
g;
g;
g
Algorithm 3: Geometric Hashing Oine Preprocessing
11.6 2D Object Recognition via Relational Matching
We have previously described three useful methods for matching observed image points to
model points: these were local-feature-focus, pose clustering, and geometric hashing. In
this section, we examine three simple general paradigms for object recognition within the
context given in this chapter. All three paradigms view recognition as a mapping from
model structures to image structures: a consistent labeling of image features is sought in
terms of model features, recognition is equivalent to mapping a sucient number of features
from a single model to the observed image features. The three paradigms dier in how the
mapping is developed.
Four concepts important to the matching paradigms are parts, labels, assignments, and
relations.
A part is an object or structure in the scene such as region segment, edge segment,
hole, corner or blob.
A label is a symbol assigned to a part to identify/recognize it at some level.
An assignment is a mapping from parts to labels. If P1 is a region segment and L1 is
the lake symbol and L2 the eld symbol, an assignment may include the pair (P1; L2) or
perhaps (P1; fL1; L2 g) to indicate remaining ambiguity. A pairing (P1 ; NIL) indicates
that P1 has no interpretation in the current label set. An interpretation of the scene
is just the set of all pairs making up an assignment.
A relation is the formal mathematical notion. Relations will be discovered and
computed among scene objects and will be stored for model objects. For example,
R4(P1; P2) might indicate that region P1 is adjacent to region P2
Shapiro and Stockman 27
procedure GH Recognition(H,A,I);
f
Initialize accumulator array A to all zeroes;
Extract feature points from image I;
for each basis triple F
f
for each other point v
f
Calculate (; ) for v with respect to F;
Retrieve the list L of model-basis pairs from the
hash table H at index (; );
for each pair (M; E) of L
A[M; E] = A[M; E] + 1;
g;
Find the peaks in accumulator array A;
for each peak (M; E)
f
Calculate T such at F = TE;
if enough of the transformed model points of M nd
evidence on the image then return(T );
g;
g;
g
Figure 11.17: The geometric hashing algorithm can halucinate that a given model is present
in an image. In this example, 60% of the feature points (left) led to the veried hypothesis
of an object (right) that was not actually present in the image.
For another example, we return to the recognition of the kleep object shown in Fig-
ure 11.8 and dened in the associated tables. Our matching paradigms will use the distance
Shapiro and Stockman 29
Sj
S1
Sa Si Sk
S2 S5 Sn
S6
Sh
S4 S11 Sb Sg Sl
S3 S7 S8
Sc Sd Sf Sm
S9 S10 Se
Image 1 Image 2
P = fS1,S2,S3,S4,S5,S6,S7,S8,S9,S10,S11g.
L = fSa,Sb,Sc,Sd,Se,Sf,Sg,Sh,Si,Sj,Sk,Sl,Smg.
RP = f (S1,S2), (S1,S5), (S1,S6), (S2,S3), (S2,S4), (S3,S4), (S3,S9), (S4,S5), (S4,S7),
(S4,S11), (S5,S6), (S5,S7), (S5,S11), (S6,S8), (S6,S11), (S7,S9), (S7,S10), (S7,S11), (S8,S10),
(S8,S11), (S9,S10) g.
RL = f (Sa,Sb), (Sa,Sj), (Sa,Sn), (Sb,Sc), (Sb,Sd), (Sb,Sn), (Sc,Sd), (Sd,Se), (Sd,Sf),
(Sd,Sg), (Se,Sf), (Se,Sg), (Sf,Sg), (Sf,Sl), (Sf,Sm), (Sg,Sh), (Sg,Si), (Sg,Sn), (Sh,Si), (Sh,Sk),
(Sh,Sl), (Sh,Sn), (Si,Sj), (Si,Sk), (Si,Sn), (Sj,Sk), (Sk,Sl), (Sl,Sm) g.
\relation" dened for any two points: each pair of holes is related by the distance between
them. Distance is invariant to rotations and translations but not scale change. Abusing
notation, we write 12(A; B) and 12(B; C) to indicate that points A and B are distance 12
apart in the model, similarly for points B and C. 12(C; D) does NOT hold, however, as we
see from the distance tables. To allow for some distortion or detection error, we might allow
that 12(C; D) is true even when the distance between C and D is actually 12 + = for
some small amount .
The Interpretation Tree
11 Definition An interpretation tree (IT) is a tree that represents all possible assignments
of labels to parts. Every path in the tree is terminated either because it represents a complete
consistent assignment, or because the partial assignment it represents fails some relation.
A partial interpretation tree for the image data of Figure 11.8 is shown in Figure 11.19.
The tree has three levels, each to assign a label to one of the three holes H1; H2; H3 ob-
30 Computer Vision: Mar 2000
H ,A H ,B H1 ,C H1 ,D H1 ,E H1 ,NIL
1 1
21(H ,H )
1 2
26(H ,H ) 12(H , H )
1 3 2 3
H ,A H ,B H ,C H ,D H ,E H ,NIL H ,B
3 3 3 3 3 3 3
26(H ,H ) ~ 26(E,B)
0(A,A) 37(A,D) 1 3
21(A,E) 12(H ,H ) ~ 12(A,B)
30(D,E) 2 3
Figure 11.19: Partial interpretation tree search for a consistent labeling of the kleep parts
in Figure 11.8 (right).
served in the image. No inconsistencies occur at the rst level since there are no distance
constraints to check. However, most label possibilities at level 2 can be immediately ter-
minated using one distance check. For example, the partial assignment f(H1; A); (H2; A)g
is inconsistent because the relation 21(H1; H2) is violated by 0(A; A). Many paths are not
shown due to lack of space. The path of labels denoted by the boxes yields a complete and
consistent assignment. The path of labels denoted by ellipses is also consistent; however, it
contains one NIL label and thus has fewer constraints to check. This assignment has the
rst two pairs of the complete (boxed) assignment reversed in labels and the single distance
check is consistent. Multiple paths of an IT can succeed due to symmetry. Although the
IT potentially contains an exponential number of paths, it has been shown that most paths
will terminate by level 3 due to the relational constraints. Use of the label NIL allows for
detection of artifacts or the presence of features from another object in the scene.
The IT can easily be developed using a recursive backtracking process that develops
paths in a depth-rst fashion. At any instantiation of the procedure, the parameter f,
which is initially NIL, contains the consistent partial assignment. Whenever a new labeling
of a part is consistent with the partial assignment, the algorithm goes deeper in the tree
by hypothesizing another label for an unlabeled part; if an inconsistency is detected, then
the algorithm backs up to make an alternate choice. As coded, the algorithm returns the
rst completed path, which may include NIL labels if that label is explicitly included in L.
Shapiro and Stockman 31
An improvement would be to return the completed path with the most nonNIL pairs, or
perhaps all completed paths.
The recursive interpretation tree search algorithm is given below. It is dened to be gen-
eral and to handle arbitary N-ary relations, RP and RL , rather than just binary relations.
RP and RL can be single relations, such as the connection relation in our rst example,
or they can be unions of a number of dierent relations, such as connection, parallel, and
distance.
Exercise 14
Give detailed justication for each of the labels being in or out of each of the label sets after
pass 1 as shown in Table 11.5.
Pass 2 deletes label C from L(H2) because the relation 21(H1; H2) can no longer be
explained by using D as a label for H1. After pass 3, additional passes cannot change any
label set so the process has converged. In this case, the label sets are all singletons repre-
Shapiro and Stockman 33
senting a single assignment and interpretation. A high-level sketch of the algorithm is given
below. Although a simple and potentially fast procedure, relaxation labeling sometimes
leaves more ambiguity in the interpretation than does IT search because constraints are
only applied pairwise. Relaxation labeling can be applied as preprocessing for IT search: it
can substantially reduce the branching factor of the tree search.
Continuous Relaxation
In exact consistent labeling procedures, such as tree search and discrete relaxation, a
label l for a part p is either possible or impossible at any stage of the process. As soon
as a part{label pair (p; l) is found to be incompatible with some already instantiated pair,
the label l is marked as illegal for part p. This property of calling a label either possible
or impossible in the preceding algorithms makes them discrete algorithms. In contrast, we
can associate with each part{label pair (p; l) a real number representing the probability or
p1 l1 l5
p4 p2 l4 l2 l6
p3 l3 l7
model image
parts labels
The numerator of the expression allows us to add to the current probability prik (l) a term
that is the product prik (l)qik (l) of the current probability and the opinions of other related
parts, based on the current probabilities of their own possible labels. The denominator
normalizes the expression by summing over all possible labels for part i.
That is, the relational distance is the minimal total error obtained for any one{one, onto
mapping f from A to B. We call a mapping f that minimizes total error a best mapping from
DA to DB . If there is more than one best mapping, additional information that is outside
the pure relational paradigm can be used to select the preferred mapping. More than one
38 Computer Vision: Mar 2000
1 2 a b
3 4 c d
R S
best mapping will occur when the relational descriptions involve certain kinds of symmetries.
We illustrate the relational distance with several examples. Figure 11.21 shows two di-
graphs, each having four nodes. A best mapping from A = f1; 2; 3; 4g to B = fa; b; c; dg is
ff(1) = a; f(2) = b; f(3) = c; f(4) = dg: For this mapping we have
jR f S j = jf(1; 2)(2; 3)(3; 4)(4; 2)g f f(a; b)(b; c)(c; b)(d; b)gj
= jf(a; b)(b; c)(c; d)(d;b)g f(a; b)(b; c)(c; b)(d; b)gj
= jf(c; d)gj
= 1
jS f 1
Rj = jf(a; b)(b; c)(c; b)(d; b)g f 1 f(1; 2)(2; 3)(3; 4)(4; 2)gj
= jf(1; 2)(2; 3)(3; 2)(4;2)g f(1; 2)(2; 3)(3; 4)(4; 2)gj
= jf(3; 2)gj
= 1
E(f) = jR f S j + jS f 1
Rj
= 1 + 1
= 2
Connection 4"
Connection (1",2")
(1,2) 1 (1",3")
1"
(1,3) (1",4")
2 3 2" 3"
Parallel Parallel
(2,3) (2",3")
M1 M3
Connection
(4*,5*) 4*
Connection 1 (4*,6*) 5* 6*
(1,2) 2 (1*,5*)
(1,3) (1*,6*) 1*
3
(1*,2*)
Parallel (1*,3*) 2* 3*
0
Parallel
M2 (2*,3*) M4
(5*,6*)
Figure 11.22: Four object models. The relational distance of model M1 to M2 and M1 to
M3 is 1. The relational distance of model M3 to M4 is 6.
now isomorphic but there is one more connection in M3 than in M2 : Again the relational
distance is exactly 1.
Finally consider models M3 and M4 : The best mapping maps 100 to 1 ; 200 to 2 ; 300 to 3;
4 to 4 ; 5d to 5 ; and 6d to 6 : (5d and 6d are dummy primitives.) For this mapping we have
00
40 Computer Vision: Mar 2000
Relational Indexing
Sometimes a tree search even with relaxation ltering is too slow, especially when an im-
age is to be compared to a large database of models. For structural descriptions in terms of
labeled relations, it is possible to approximate the relational distance with a simpler voting
scheme. Intuitively, suppose we observe two coincentric circles and two 90 degree corners
connected by an edge. We would like to quickly nd all models that have these structures
and match them in more detail. To achieve this, we can build an index that allows us to look
up the models given the partial graph structure. Given two coincentric circles, we look up
all models containing these related features and give each of those models one vote. Then,
we look up all models having connected 90 degree corners: any models repeating from the
rst test will now have two votes. These lookups can be done rapidly provided that an index
is built oine before recognition by extracting signicant binary relations from each model
and recording each in a lookup table.
Let DB = fM1 ; M2; : : :; Mtg be a database of t object models. Each object model Mt
consists of a set of attributed parts PT plus a labeled relation RT . For simplicity of explana-
tion, we will assume that each part has a single label, rather than a vector of attributes and
that the relation is a binary relation, also with a single label attached to each tuple. In this
case, a model is represented by a set of two-graphs each of which is a graph with two nodes
and two directed edges. Each node represents a part, and each edge represents a directed
binary relationship. The value in the node is the label of the part, rather than a unique
identier. Similarly, the value in an edge is the label of the relationship. For example, one
node could represent an ellipse and another could represent a pair of parallel lines. The edge
from the parallel lines node to the ellipse node could represent the relationship \encloses",
while the edge in the opposite direction represents the relationship \is enclosed by".
Relational indexing requires a preprocessing step in which a large hash table is set up.
The hash table is indexed by a string representation of a two-graph. When it is completed,
one can look up any two-graph in the table and quickly retrieve a list of all models containing
that particular two-graph. In our example, all models containing an ellipse between two
parallel line segments can be retrieved. During recognition of an object from an image, the
42 Computer Vision: Mar 2000
Figure 11.23: (Left) Regular grid of lines; (right) grid warped by wrapping it around a
cylinder.
Figure 11.24: (Left) Image of center of US$20 bill; (center) image of Andrew Jackson
wrapped around a cylinder of circumference 640 pixels; (right) same as center except cir-
cumference is 400 pixels.
features are extracted and all the two-graphs representing the image are computed. A set of
accumulators, one for each model in the database are all set to zero. Then each two-graph
in the image is used to index the hash table, retrieve the list of associated models, and vote
for each one. The discrete version of the algorithm adds one to the vote; a probabilistic
algorithm would add a probability value instead. After all two-graphs have voted, the
models with the largest numbers of votes are candidates for verication.
[x, y] [u, v]
W
d r
d d
x u
0 0
W
I
Y I V 2
1
input image output image
Figure 11.25: The output image at the right is created by \wrapping" the input image at
the left around a cylinder (center): distance d in the input image becomes distance d' on
output.
Exercise 20
(a) Determine the transformation that maps a circular region of an image onto a hemisphere
and then projects the hemisphere onto an image. The circular region of the original image
is dened by a center (xc ; yc ) and radius r0. (b) Develop a computer program to carry out
the mapping.
Shapiro and Stockman 45
Figure 11.26: Two types of radial distortion, barrel (left) and pincushion (center) which can
be removed by warping to produce a rectied image (right).
p
R = ((x xc )2 + (y yc )2) (11.21)
Dr = (c2 R2 + c4 R4)
x = xc + (x xc )Dr
y = yc + (y yc )Dr
Polynomial Mappings
Many small global distortion factors can be rectied using polynomial mappings of max-
imum degree two in two variables as dened in Equation 11.22. Twelve dierent coecients
must be estimated in order to adapt to the dierent geometric factors. To estimate these
coecients, we need the coordinates of at least six control points before and after mapping;
however, many more points are used in practice. ( Each such control point yields two equa-
tions.) Note that if only the rst three terms are used in Equation 11.22 the mapping is an
46 Computer Vision: Mar 2000
ane mapping.
u = a00 + a10x + a01y + a11xy + a20x2 + a02y2 (11.22)
v = b00 + b10x + b01y + b11xy + b20x2 + b02y2
Exercise 21
Show that radial distortion in Equation 11.21 with c4 = 0 can be modeled exactly by a
polynomial mapping of the form shown in Equation 11.22.
11.8 Summary
Multiple concepts have been discussed in this chapter under the theme of 2D matching. One
major theme was 2D mapping using transformations. These could be used as simple image
processing operations which could extract or sample a region from an image, register two
images together in the same coordinate system, or remove or creatively add distortion to 2D
images. Algebraic machinery was developed for these transformations and various methods
and applications were discussed. This development is continued in Chapter 13 as it relates
to mapping points in 3D scenes and 3D models. The second major theme of this chapter
was the interpretation of 2D images through correspondences with 2D models. A general
paradigm is recognition-by-alignment: the image is interpreted by discovering a model and
an RST transformation such that the transformation maps known model structures onto
image structures. Several dierent algorithmic approaches were presented, including pose-
clustering, interpretation tree search, and the feature focus method. Discrete relaxation
and relational matching were also presented: these two methods can be applied in very
general contexts even though they were introduced here within the context of constrained
geometric relationships. Relational matching is potentially more robust than rigid alignment
when the relations themselves are more robust than those depending on metric properties.
Image distortions caused by lens distortions, slanted viewing axes, quantication eects, etc,
can cause metric relations to fail; however, topological relationships such as cotermination,
connectivity, adjacency, and insideness are usually invariant to such distortions. A successful
match using topological relationships on image/model parts might then be used to nd a
large number of matching points, which can then be used to nd a mapping function with
many parameters that can adapt to the metric distortions. The methods of this chapter
are directly applicable to many real world applications. Subsequent chapters extend the
methods to 3D.
11.9 References
The paper by Van Wie and Stein (1977) discusses a system to automatically bring satellite
images into registration with a map. An approximate mapping is known using the time at
which the image is taken. This mapping is used to do a rened search for control points
using templates: the control points are then used to obtain a rened mapping. The book by
Wolberg gives a complete account of image warping including careful treatment of sampling
the input image and lessening aliasing by smoothing. The treatment of 2D matching via
pose clustering was drawn from Stockman et al (1982), which contained the airplane detec-
tion example provided. A more general treatment handling the 3D case is given in Stockman
Shapiro and Stockman 47
1987. The paper by Grimson and Lozano-Perez demonstrates how distance constraints can
be used in matching model points to observed data points. Least squares tting is treated
very well in other references and is a topic of much depth. Least squares techniques are
in common use for the purpose of estimating transformation parameters in the presence of
noise by using many more than the minimal number of control points. The book by Wolberg
(1990) treats several least squares methods within the context of warping while the book
by Daniel and Wood (1971) treats the general tting problem. In some problems, it is not
possible to nd a good geometric transformation that can be applied globally to the entire
image. In such cases, the image can be partitioned into a number of regions, each with
its own control points, and separate warps can be applied to each region. The warps from
neighboring regions must smoothly agree along the boundary between them. The paper by
Gostasby (1988) presents a
exible way of doing this.
Theory and algorithms for the consistent labeling problems can be found in the papers
by Haralick and Shapiro (1979,1980). Both discrete and continuous relaxation are dened
in the paper by Rosenfeld, Hummel, and Zucker (1976), and continuous relaxation is further
analyzed in the paper by Hummel and Zucker (1983). Methods for matching using structural
descriptions have been derived from Shapiro (1981); relational indexing using 2-graphs can
be found in Costa and Shapiro (1995). Use of invariant attributes of structural parts for
indexing into models can be found in Chen and Stockman (1996).
1. J.L. Chen and G. Stockman, Indexing to 3D Model Aspects using 2D Contour Features,
Proc. Int. Conf. Computer Vision and pattern Recognition (CVPR), San Francisco,
CA June 18-20, 1996. Expanded paper to appear in the journal CVIU.
2. M. Clowes (1971) On seeing things, Artical Intelligence, Vol. 2, pp79-116.
3. M. S. Costa and L. G. Shapiro (1995), Scene Analysis Using Appearance-Based Models
and Relational Indexing, IEEE Symposium on Computer Vision, November, 1995, pp.
103-108.
4. C. Daniel and F. Wood (1971) Fitting Equations to Data John Wiley and Sons, Inc.,
New York.
5. A.Goshtasby (1988) Image Registration by Local Approximation Methods, Image and
Vision Computing, Vol. 6, No. 4 (Nov 1988) pp255-261.
6. W. Grimson and T. Lozano-Perez (1984) Model-Based Recognition and Localization
From Sparse Range or Tactile Data, Int. Journal of Robotics Research, Vol. 3, No. 3
pp3-35.
7. R. Haralick and L. Shapiro (1979) The consistent labeling problem I, IEEE Trans.
PAMI-1, pp173-184.
8. R. Haralick and L. Shapiro (1980) The consistent labeling problem II, IEEE Trans.
PAMI-2, pp193-203.
9. R. Hummel and S. Zucker (1983) On the Foundations of Relaxation Labeling Processes,
IEEE Trans. PAMI-5, pp267-287.
48 Computer Vision: Mar 2000
10. Y. Lamden and H. Wolfson (1988) Geometric Hashing: A General and Ecient Model-
Based Recognition Scheme, Proc. 2nd Int. Conf. on Computer Vision, Tarpon Springs,
FL. (Nov 1988) pp238-249.
11. D. Rogers and J. Adams (**) Mathematical Elements for Computer Graphics,
2nd Ed.. McGraw-Hill.
12. A. Rosenfeld, R. Hummel, and S. Zucker (1976) Scene Labeling by Relaxation Opera-
tors, IEEE Trans. Systems, Man, and Cybernetics, SMC-6, pp. 420-453.
13. G. Stockman, S. Kopstein and S. Benett (1982) Matching Images to Models for Reg-
istration and Object Detection via Clustering, IEEE Trans. on PAMI, Vol. PAMI-4,
No. 3 (May 1982) pp 229-241.
14. G. Stockman (1987) Object Recognition and Localization via Pose Clustering, Com-
puter Vision, Graphics and Image Processing, Vol. 40 (1987) pp 361-387.
15. P Van Wie and M. Stein (1977) A LANDSAT digital image rectication system, IEEE
Trans. Geosci. Electron., Vol. GE-15, July, 1977.
16. P.Winston (1977) Articial Intelligence Addison-Wesley.
17. G. Wolberg (1990) Digital Image Warping, IEEE Computer Society Press, Los
Alamitos, CA.