Scale Invariant Feature Transfrom: A Seminar On
Scale Invariant Feature Transfrom: A Seminar On
1
For image matching and recognition, SIFT It has been shown by Koenderink (1984)
features are first extracted from a set of reference and Lindeberg (1994) that under a variety of
images and stored in a database. A new image is reasonable assumptions the only possible scale-space
matched by individually comparing each feature from kernel is the Gaussian function. Therefore, the scale
the new image to this previous database and finding space of an image is defined as a function, L(x, y,ϭ),
candidate matching that is produced from the convolution of a variable-
features based on Euclidean distance of their feature scale Gaussian, G(x, y, ϭ), with an input image, I(x,
vectors. This paper will discuss fast nearest- y):
neighbour algorithms that can perform this
computation rapidly against large databases.
The keypoint descriptors are highly L ( x , y , σ )=G(x , y , σ )∗I ( x , y )………….
distinctive, which allows a single feature to find its
(1)
correct match with good probability in a large
database of features. However, in a cluttered image,
many features from the background will not have any where ∗ is the convolution operation in x and y, and
correct match in the database, giving rise to many
false matches in addition to the correct ones. The 1 2 2 2
2
Figure 1: For each octave of scale space, the initial
image is repeatedly convolved with Gaussians to
produce the set of scale space images shown on the
left. Adjacent Gaussian images are subtracted to
produce the difference-of-Gaussian images on the
right. After each octave, the Gaussian image is down-
sampled by a factor of 2, and the process repeated.
3
of the maximum, and his experiments showed that resulting 3x3 linear system can be solved with
this provides a substantial improvement to matching minimal cost. If the offset ˆx is larger than 0.5 in any
and stability. His approach uses the Taylor expansion dimension, then it means that the extremum lies
(up to the quadratic terms) of the scale-space closer to a different sample point. In this case, the
function, D(x, y, ϭ), shifted so that the origin is at the sample point is changed and the interpolation
sample point: performed instead about that point. The final offset ˆx
is added to the location of its sample point to get the
^ ∂ DT 1 T ∂2 D interpolated estimate for the location of the
D X =D+
( ) X+ X 2
X ………… extremum.
∂X 2 ∂X The function value at the extremum, D(ˆx),
(7) is useful for rejecting unstable extrema with low
contrast. This can be obtained by substituting
where D and its derivatives are evaluated at the equation (3) into (2), giving
sample point and x = (x, y, ϭ)T is the offset from this
T
point. The location of the extremum, ˆx, is ∂D ^
determined by taking the derivative of this function D(^
X ) =D+ X …………….(9)
∂X
with respect to x and setting it to zero, giving
For the experiments in this paper, all extrema with a
^ −∂2 D −1 ∂ D value of |D(ˆx)| less than 0.03 were discarded (as
X= ………….…(8)
∂ X2 ∂ X before, we assume image pixel values in the range
[0,1]).
Figure 3 shows the effects of keypoint
selection on a natural image. In order to avoid too
much clutter, a low-resolution 233 by 189 pixel
image is used and keypoints are shown as vectors
giving the location, scale, and orientation of each
keypoint (orientation assignment is described below).
Figure 3 (a) shows the original image, which is
shown at reduced contrast behind the subsequent
figures. Figure 3 (b) shows the 832 keypoints at all
detected maxima and minima of the difference-of-
Gaussian function, while (c) shows the 729 keypoints
that remain following removal of those with a value
of |D(x)| less than 0.03. Part (d) will be explained in
the following section.
4. Orientation assignment
4
so that all computations are performed in a scale-
invariant manner. For each image sample, L(x, y), at
this scale, the gradient magnitude, m(x, y), and
orientation, Ө(x, y), is precomputed using pixel
differences:
2 2
√
m ( x , y )= ( L ( x +1 , y )−L ( x−1 , y ) ) + ( L ( x , y +1 )−L ( x , y−1 ) )
5
animal shapes which show that matching gradients the left side of Figure 5, although, of course, the
while weight falls off smoothly. The purpose of this
allowing for shifts in their position results in much Gaussian window is to avoid sudden changes in the
better classification under 3D rotation. For example, descriptor with small changes in the position of the
recognition accuracy for 3D objects rotated in depth window, and to give less emphasis to gradients that
by 20 degrees increased from 35% for correlation of are far from the centre of the descriptor, as these are
gradients to 94% using the complex cell model. Our most affected by misregistration errors.
implementation described below was inspired by this The keypoint descriptor is shown on the
idea, but allows for positional shift using a different right side of Figure 5. It allows for significant shift in
computational mechanism. gradient positions by creating orientation histograms
over 4x4 sample regions. The figure shows eight
directions for each orientation histogram, with the
length of each arrow
corresponding to the magnitude of that histogram
entry. A gradient sample on the left can shift up to 4
sample positions while still contributing to the same
histogram on the right, thereby achieving the
objective of allowing for larger local positional shifts.
It is important to avoid all boundary affects
in which the descriptor abruptly changes as a sample
shifts smoothly from being within one histogram to
another or from one orientation to another. Therefore,
Figure 5: keypoint descriptor is created by first trilinear interpolation is used to distribute the value of
computing the gradient magnitude and orientation at each gradient sample into adjacent histogram bins. In
each image sample point in a region around the other words, each entry into a bin is multiplied by a
keypoint location, as shown on the left. These are weight of 1−d for each dimension, where d is the
weighted by a Gaussian window, indicated by the distance of the sample from the central value of the
overlaid circle. These samples are then accumulated bin as measured in units of the histogram bin spacing.
into orientation histograms summarizing the contents The descriptor is formed from a vector
over 4x4 subregions, as shown on the right, with the containing the values of all the orientation histogram
length of each arrow corresponding to the sum of the entries, corresponding to the lengths of the arrows on
gradient magnitudes near that direction within the the right side of Figure 5. The figure shows a 2x2
region. This figure shows a 2x2 descriptor array array of orientation histograms, whereas our
computed from an 8x8 set of samples, whereas the experiments below show that the best results are
experiments in this paper use 4x4 descriptors achieved with a 4x4 array of histograms with 8
computed from a 16x16 sample array. orientation bins in each. Therefore, the experiments
in this paper use a 4x4x8 = 128 element feature
5.1 Descriptor representation vector for each keypoint.
Finally, the feature vector is modified to
Figure 5 illustrates the computation of the reduce the effects of illumination change. First, the
keypoint descriptor. First the image gradient vector is normalized to unit length. A change in
magnitudes and orientations are sampled around the image contrast in which each pixel value is
keypoint location, using the scale of the keypoint to multiplied by a constant will multiply gradients by
select the level of Gaussian blur for the image. In the same constant, so this contrast
order to achieve orientation invariance, the change will be cancelled by vector normalization. A
coordinates of the descriptor and the gradient brightness change in which a constant is added to
orientations are rotated relative to the keypoint each image pixel will not affect the gradient values,
orientation. For efficiency, the gradients are as they are computed from pixel differences.
precomputed for all levels of the pyramid as Therefore, the descriptor is invariant to affine
described in Section 5. These are illustrated with changes in illumination.
small arrows at each sample location on the left side However, non-linear illumination changes can also
of Figure 5. occur due to camera saturation or due to illumination
A Gaussian weighting function with ϭ equal changes that affect 3D surfaces with differing
to one half the width of the descriptor window is used orientations by different amounts. These effects can
to assign a weight to the magnitude of each sample cause a large change in relative magnitudes for some
point. This is illustrated with a circular window on gradients, but are less likely to affect the gradient
6
orientations. Therefore, we reduce the influence of
large gradient magnitudes by thresholding the values
in the unit feature vector to each be no larger than
0.2, and then renormalizing to unit length. This
means that matching the magnitudes for large
gradients is no longer as important, and that the
distribution of orientations has greater emphasis. The
value of 0.2 was determined experimentally using
images containing differing illuminations for the
same 3D objects.
7
Other potential applications include view matching
for 3D reconstruction, motion tracking and
segmentation, robot localization, image panorama
assembly, epipolar calibration, and any others that
require identification of matching locations between
images.
There are many directions for further
research in deriving invariant and distinctive image
features. Systematic testing is needed on data sets
with full 3D viewpoint and illumination changes. The
features described in this paper use only a
monochrome intensity image, so further
distinctiveness could be derived from including
illumination-invariant colour descriptors (Funt and
Finlayson, 1995; Brown and Lowe, 2002). Similarly,
local texture measures appear to play an important
role in human vision and could be incorporated into
feature descriptors in a more general form than the
single spatial frequency used by the current
descriptors. An attractive aspect of the invariant local
feature approach to matching is that there is no need
to select just one feature type, and the best results are
likely to be obtained by using many different
features, all of which can contribute useful matches
and improve overall robustness.
8. References