Acknowledgement: Sift:Scale Invariant Feature Transform
Acknowledgement: Sift:Scale Invariant Feature Transform
ACKNOWLEDGEMENT
The success of any task depends largely on the encouragement and guidelines of
many others. I take this opportunity to express my gratitude to the people who have
been instrumental in the successful completion of this technical seminar.
I thank Dr. J Surya Prasad Director/Principal of PESIT BSC for not only pro-
viding with the excellent facilities but also for offering his unending encouragement
that has made this technical seminar a success
ABSTRACT
Here we present a algorithm for extracting distinctive invariant features from images
which can be used to perform consistent matching between different object or scene.
The features or key points are invariant to image scale and rotation, and are robust
matching across a substantial range of affine distortion, change in 3D viewpoint,
addition of noise, and change in illumination.
The key points are highly distinct , in a way that a single feature can be correctly
matched with high probability against large database of key points from many
images. This also helps us to describes an approach to using these features for
object recognition.
TABLE OF CONTENT
1 Inroduction 1
2 Related Research 3
5 Orientation Assignment 17
7 Conclusion 23
LIST OF FIGURES
CHAPTER1
INTRODUCTION
Here we describes image key points that have many properties that make them
suitable for matching different images of an object or scene. The key points are
invariant to image scaling and rotation, and partially invariant to change in
illumination and 3D camera viewpoint. An important characteristic of these
features is that the relative positions between them in the original scene shouldn’t
change from one image to another
They are well localized in both the spatial and frequency domains, reducing the
probability of distortion by noise. Numbers of feature can be extracted from images
with efficient algorithms. In addition, the key points are distinctive, which allows a
single feature to be correctly matched against a large database of features, providing
a basis for object recognition.
Scale Space Detection: Here we search for key points over all scales and image
locations. It is implemented efficiently by using difference-of-Gaussian function to
identify probable key points that are invariant to scale and orientation
each key point location based on local image gradient directions. All future
operations that have to be performed on image are done on the basis gradient
direction .Here image data has been transformed relative to the assigned orientation,
scale, and location for each feature, thereby providing invariance to these
transformations.
Keypoint descriptor: The local image gradients are evaluated at each selected
scale in the region around key points. These are transformed into a description that
allows for significant levels of local shape distortion and change in illumination.
This approach has been named the Scale Invariant Feature Transform (SIFT),
as it transforms image data into scale-invariant coordinates relative to local features.
CHAPTER2
RELATED RESEARCH
The development in image matching by using a set of local interest points can be
traced back to the work of Moravec (1981) on stereo matching using corner detector.
The detector was enhanced by Harris and Stephens (1988) to make it more precise
under small image variations. Harris also achieved efficient motion tracking and 3D
structure from motion recovery (Harris, 1992), and the Harris corner detector has
since been widely used for image matching process. These detectors are usually
called corner detectors, but they are not selecting corners, rather any image location
that has large gradients in all directions at a predetermined scale.
The initial applications were restricted to stereo and motion tracking, but later the
approach was extended further. Zhang et al. (1995) showed that it is possible to
match Harris corners over a large image range by using a correlation window around
each corner to select likely matches. Outliers were then removed by solving for a
fundamental matrix describing the geometric constraints between the two views of
rigid scene and removing matches that did not agree with the majority solution
. Impressive work on extending local features to be invariant to full affine
transformations (Baumberg, 2000; Tuytelaars and Van Gool, 2000; Mikolajczyk and
Schmid, 2002; Schaffalitzky and Zisserman, 2002; Brown and Lowe, 2002). This
allows mathing of invariant features on a planar surface under changes in
orthographic 3D projection, in most cases by resampling the image. However, none
of these approaches are yet fully affine invariant, as they start with initial feature
scales and locations selected in a non-affine-invariant manner due to the prohibitive
cost of exploring the full affine space.
CHAPTER3
Under a variety of reasonable assumptions the only possible scale-space kernel is the
Gaussian function. Therefore, the scale space of an image is defined as a function,
L(x, y, σ), that is produced from the convolution of a variable-scale
Gaussian,G(x, y, σ), with an input image, I(x, y):
1
G(x, y, σ) = ∗ exp(−(x2 + y 2 )/2σ 2 ) (1)
2πσ 2
There are many reasons that we have choosen this function for calculating DOG .
First, it is a particularly desirable function for computing , as the smoothed images,
L, need to be computed in for scale space feature description, and therefore D can
be calculated using simple image subtraction.
The relation between Difference of Gaussian and LOG can be understood by using
heat equation as:
∂G
∂σ
= σ 2 ∇2 G
from the above equation we can see that LOG can be calculated using finite
∂G
difference approximation of ∂σ
using difference of nearby scale at Kσand σ
∂G G(x,y,kσ)−G(x,y,σ)
∂σ
≈ kσ−σ
therefore,
The above equation shows that the DOG have scale differing by a constant factor as
it already consist of σ 2 required for scale invariant transformation. The factor
(K − 1) is constant over all extrema. Any errors regarding approximation will go if
K = 1, but practically we have seen that there is no effect on stability of exterma
√
detection or localization for even different scales,such as K = 2.
Once a complete octave is generated then the input image is resampled by reducing
its resolution by half by doubling the initial value of The accuracy of sampling
relative to than that of the previous computation of octave, in generally the
compuration are reduced from previus stage.
To detect the local maxima and minima of D(x, y, σ) , each sample point is
compared to its eight neighbors in the current image and nine neighbors in the scale
above and below as shown in fig 3.2. The sample point is selected only if it is larger
than all of these neighbors or smaller than allof them. This methof reduces the cost
cost reasonably low due to the fact that most sample points will be eliminated
following within few checks
An important issue is to figure out at what frequency we should sample the image
and scale domains to reliably detect the extrema. Unfortunately, there is no
minimum spacing of samples that will detect all extrema, as the extrema can be
arbitrarilyclose together. This can be seen by considering a white circle on a black
background, which will have a single scale space maximum where the circular
positive central region ofthe difference-of-Gaussian function matches the size and
location of the circle.
Sampling the image increases the stability of extrema that can be seen by fig 3.3
and 3.4.These figures are based on a matching task using a collection of 32 images
having different order of outdoor scenes, human faces, aerial photographs, and
industrial images, different contrast. Then each image was subject to a range of
transformations, including image rotation, scaling, affine transform, change in
brightness and contrast, and addition of noise. Because the changes were normal, it
was possible to predict where each feature in original image should appear in the
transformed image, allowing for measurement of correct repeatability and positional
accuracy for each feature.
Fig 3.3 shows the simulation results used to verify the effect of different number of
scales per octave at which the image function is sampled preceding to extrema
detection. In this case, each image was resampled following rotation by some
random angle and scaling randomly in between 0.2 of 0.9 times the original image
size. Key points from the reduced resolution image were matched against those from
the original image.
The top line in fig 3.3 shows the percentage of key points that detected the
matching location and scale in the transformed image. Here , we define matching
√
scale as approximate within a factor of 2. The correct scale, and a matching
location as being within σ pixels, where is the scale of the key point . The lower
line on this graph shows the number of key points that are correctly matched to a
database of 40,000 key points using the nearest-neighbor matching (this shows that
once a key point is repeatability located, it is useful to use it for recognition and
It is surprising that as we increase the number of scale per octave the repeatability
doesn’t improve. This is due to the fact that as we increase the number of scales
many more local extrema will be detected which are unstable and have low contrast,
as shown in fig 3.4 This figure shows the average number key points correctly
detected in each image.
The number of keypoints increases with increased sampling of scales and the total
number of correct matches also rises. Since the success of object recognition often
depends more on the quantity of correctly matched keypoints, as opposed to their
percentage correct matching, for many applications it will be optimal to use a larger
number of scale samples. However, the cost of computation rises with increase of
scale, so we should preferably choose to use just 3 scale samples per octave.
CHAPTER4
After a keypoint candidate has been detected by comparing a pixel to its neighbors,
then we perform a detailed fit to the nearby data for location, scale, and ratio of
principal curvatures. This also helps us to allows points to be rejected that have low
contrast (and are therefore sensitive to noise) or poorly localized along an edge.
Initially we simply locate keypoints at the location and scale of the central sample
point. However, Brown developed a method for fitting a 3D quadratic function to
the local sample points to determine the location of the maximum and minimum
,this turns out to have improved image matching and stability. This approach uses
Taylor expansion (up to the quadratic terms) of the scale-space function, D(x, y, σ),
shifted so that the origin is at the sample point:
∂DT 2
D(x) = D + ∂x
+ 21 xT ∂∂xDx
2 x (1)
where D and its derivatives are evaluated at the sample point and x = (x, y, σ)T is
the offset from this point. The location of the extremum, x̂ , is calculated by taking
the derivative of this function with respect to x and equating it to zero, which gives
∂ 2 D− 1 ∂d
x̂ = ∂x2
∗ ∂x
(2)
larger than 0.5 , then it means that the extremum lies closer to a different sample
point. In this situation, the sample point is changed and then the interpolation is
performed about that point. The final offset x̂ is added to the location of its sample
point to get the interpolated estimate for the location of the extremum.
The function value at the extremum, D(x̂), is useful for rejecting unstable keypoint
with low contrast. This can be obtained by substituting equation (2) into (1), giving
1 ∂DT
D(x̂) = D + 2 ∂ x̂
To remove the low contrast keypoints , all extrema with a value of —D(x )— less
than 0.03 were discarded. Figure 5 shows the effects of keypoint selection on a
natural image. In order to avoid too much clutter, a low-resolution 233 by 189 pixel
image is used and keypoints are shown as vectors giving the location, scale, and
orientation of each keypoint (orientation assignment is described below)
Fig 4.1 shows the original image, which is shown at reduced contrast behind the
subsequent figures. Fig 4.2 shows the 832 keypoints at all detected maxima and
minima of the difference-of-Gaussian function, while Fig 4.3 shows the 729 keypoints
that remain following removal of those with a value of —D(x )— less than 0.03. Fig
4.4 will be explained in the next section.
In the previous section we rejected low contrast extrema from image, but it is not
sufficient for stability. The DOG has a strong response along edges, even if the
location around each edge is unstable and therefore makes small amount of noise
DOG function with a poor peak will have a large principal curvature across the edge
but a small one in the perpendicular direction. Therefore principal curvatures is
computed from a 2x2 Hessian matrix, H, computed at the location and scale of the
keypoint:
Dxx Dxy
H=
Dxy Dyy
If the determinant is negative, the curvatures have different signs so the point is not
taken as an extremum. Let r be the ratio between the largest magnitude eigenvalue
and the smaller one, so that α = rβ. Then,
Here the ratio only depends on the eigenvalues rather than their individual values.
The quantity(r + 1)2 /r is minimum when the two eigenvalues αandβ are equal and
the ratio increases with r. Therefore, to check that the ratio of principal curvatures
is below some threshold, r, we only need to check if
T r(H)2 (r+1)2
Det(H)
< r
,
The above algorithm is very efficient as it compute, with less than 20 floating point
operations required to test each keypoint. we eliminates keypoints that have a ratio
between the principal curvatures greater than 10. The transition from Fig4.1(c) to
4.1(d) shows the effects of this operation.
CHAPTER5
ORIENTATION ASSIGNMENT
p
m(x, y) = (L(x + 1, y) − L(x − 1, y))2 + (L(x, y + 1) − L(x, y − 1))2
CHAPTER6
One of the approach would be to sample the local image intensities around the
keypoint at the appropriate scale, and to match these using a normalized correlation
measure. However, simple correlation of image patches is highly sensitive to changes
that cause misregistration of samples, such as affine or 3D viewpoint change or
non-rigid deformations.
Fig 6.1 illustrates the computation of the key point descriptor. First the image
gradient magnitudes and orientations are sampled around the key point location,
using the scale of the key point to select the level of Gaussian blur. In order to
achieve orientation invariance, the coordinates of the descriptor and the gradient
orientations are rotated relative to the key point orientation. For efficiency, the
gradients are pre computed for all levels of the pyramid as described in chapter 5.
These are illustrated with small arrows at each sample location on the left side of
Fig 6.1.
A Gaussian weighting function with equal to one half the width of the descriptor
window is used to assign a weight to the magnitude of each sample point. This is
illustrated with a circular window on the left side of Figure 6.1, although, the weight
falls off smoothly. The purpose of this Gaussian window is to avoid sudden changes
in the descriptor with small changes in the position of the window, and to give less
emphasis to gradients that are far from the center of the descriptor, as these are
most affected by errors
The key point descriptor is shown on the right side of Figure 6.1. It allows for
significant shift in gradient positions by creating orientation histograms over 4x4
sample regions. The figure shows eight directions for each orientation histogram,
with the length of each arrow corresponding to the magnitude of that histogram
entry. A gradient sample on the left can shift up to 4 sample positions while still
contributing to the same histogram on the right, thereby achieving the objective of
allowing for larger local positional shifts
The descriptor is formed from a vector containing the values of all the orientation
histogram entries, corresponding to the lengths of the arrows on the right side of
Figure 7. The figure shows a 2x2 array of orientation histograms, whereas our
experiments below show that the best results are achieved with a 4x4 array of
histograms with 8 orientation bins in each. Therefore, the experiments in this paper
use a 4x4x8 = 128 element feature vector for each key point.
However, non-linear illumination changes can also occur due to camera saturation or
due to illumination changes that affect 3D surfaces with differing orientations by
different amounts. These effects can cause a large change in relative magnitudes for
some gradients, but are less likely to affect the gradient orientations. Therefore, we
reduce the influence of large gradient magnitudes by thresholding the values in the
unit feature vector to each be no larger than 0.2, and then renormalizing to unit
length. the distribution of orientations has greater emphasis.
Figure 6.2 shows experimental results in which the number of orientations and size
of the descriptor were varied. The graph was generated for a viewpoint
transformation in which a planar surface is tilted by 50 degrees away from the
viewer and 4 per of noise.
CHAPTER7
CONCLUSION
The SIFT keypoints are particularly important due to their distinctiveness, which
enables them to get correct match for a keypoint which is selected from a large
database of other keypoints. The distinctiveness of keypoints are achieved by
representing the image gradients with high-dimensional vector within a local region
of the image. The keypoints are shown to be invariant to image rotation , scale ,
affine distortion, addition of noise, and change in illumination. Large numbers of
keypoints can be extracted from images, which leads to robustness in extracting
small objects among noise.Keypoints are detected over a complete range of scales
determines that the small local features are available for matching small , while
large keypoints perform well for images subject to noise ,blur and clutter. The
computation is efficient, as several thousand keypoints can be extracted from a
typical image with near real-time performance on standard PC hardware.
REFERENCES