Article
Article
Abstract
This article presents a detailed description and implementation of the Scale Invariant Feature
Transform (SIFT), a popular image matching algorithm. This work contributes to a detailed
dissection of SIFT’s complex chain of transformations and to a careful presentation of each of
its design parameters. A companion online demonstration allows the reader to use SIFT and
individually set each parameter to analyze its impact on the algorithm results.
Source Code
The source code (ANSI C), its documentation, and the online demo are accessible at the IPOL
web page of this article1 .
Keywords: sift; feature detection; image comparison
1 General Description
The scale invariant feature transform, SIFT [17], extracts a set of descriptors from an image. The
extracted descriptors are invariant to image translation, rotation and scaling (zoom-out). SIFT
descriptors have also proved to be robust to a wide family of image transformations, such as slight
changes of viewpoint, noise, blur, contrast changes, scene deformation, while remaining discriminative
enough for matching purposes.
The seminal paper introducing SIFT in 1999 [16] has sparked an explosion of competitors.
SURF [2], Harris and Hessian based detectors [19], MOPS [3], ASIFT [28] and SFOP [8], with
methods using binary descriptors such as BRISK [13] and ORB [24], are just a few of the successful
variants. These add to the numerous non multi-scale detectors such as the Harris-Stephens detec-
tor [10], SUSAN [25], the Förstner detector [7], the morphological corner detector [1] and the machine
learning based FAST [23] and AGAST [18].
The SIFT algorithm consists of two successive and independent operations: the detection of
interesting points (i.e. keypoints) and the extraction of a descriptor associated to each of them.
Since these descriptors are robust, they are usually used for matching pairs of images. Object
recognition and video stabilization are other popular application examples. Although the descriptors
comparison is not strictly speaking a step of the SIFT method, we have included it in our description
for the sake of completeness.
1
https://doi.org/10.5201/ipol.2014.82
Ives Rey-Otero, Mauricio Delbracio, Anatomy of the SIFT Method, Image Processing On Line, 4 (2014), pp. 370–396.
https://doi.org/10.5201/ipol.2014.82
Anatomy of the SIFT Method
The algorithm principle. SIFT detects a series of keypoints from a multiscale image represen-
tation. This multiscale representation consists of a family of increasingly blurred images. Each
keypoint is a blob-like structure whose center position (x, y) and characteristic scale σ are accurately
located. SIFT computes the dominant orientation θ over a region surrounding each one of these
keypoints. For each keypoint, the quadruple (x, y, σ, θ) defines the center, size and orientation of a
normalized patch where the SIFT descriptor is computed. As a result of this normalization, SIFT
keypoint descriptors are in theory invariant to any translation, rotation and scale change. The de-
scriptor encodes the spatial gradient distribution around a keypoint by a 128-dimensional vector.
This feature vector is generally used to match keypoints extracted from different images.
The algorithmic chain. In order to attain scale invariance, SIFT is built on the Gaussian scale-
space, a multiscale image representation simulating the family of all possible zoom-outs through
increasingly blurred versions of the input image (see [27] for a gentle introduction to the subject). In
this popular multiscale framework, the Gaussian convolution acts as an approximation of the optical
blur, and the Gaussian kernel approximates the camera’s point spread function. Thus, the Gaussian
scale-space can be interpreted as a family of images, each of them corresponding to a different zoom
factor. The Gaussian scale-space representation is presented in Section 2.
To attain translation, rotation and scale invariance, the extracted keypoints must be related to
structures that are unambiguously located, both in scale and position. This excludes image corners
and edges since they cannot be precisely localized both in scale and space. Image blobs or more
complex local structures characterized by their position and size, are therefore the most suitable
structures for SIFT.
Detecting and locating keypoints consists in computing the 3d extrema of a differential operator
applied to the scale-space. The differential operator used in the SIFT algorithm is the difference
of Gaussians (DoG), presented in Section 3.1. The extraction of 3d continuous extrema consists
of two steps: first, the DoG representation is scanned for 3d discrete extrema. This gives a first
coarse location of the extrema, which are then refined to subpixel precision using a local quadratic
model. The extraction of 3d extrema is detailed in Section 3.2. Since there are many phenomena
that can lead to the detection of unstable keypoints, SIFT incorporates a cascade of tests to discard
the less reliable ones. Only those that are precisely located and sufficiently contrasted are retained.
Section 3.3 discuses two different discarding steps: the rejection of 3d extrema with small DoG value
and the rejection of keypoint candidates laying on edges.
SIFT invariance to rotation is obtained by assigning to each keypoint a reference orientation.
This reference is computed from the gradient orientation over a keypoint neighborhood. This step
is detailed in Section 4.1. Finally the spatial distribution of the gradient inside an oriented patch
is encoded to produce the SIFT keypoint descriptor. The design of the SIFT keypoint descriptor is
described in Section 4.2. This ends the algorithmic chain defining the SIFT algorithm. Additionally,
Section 5 illustrates how SIFT descriptors can be used to find local matches between pairs of images.
The method presented here is the matching procedure described in the original paper by D. Lowe [16].
This complex chain of transformations is governed by a large number of design parameters.
Section 6 summarizes them and provides an analysis of their respective influence. Table 1 presents
the details of the adopted notation while the consecutive steps of the SIFT algorithm are summarized
in Table 2.
371
Ives Rey-Otero, Mauricio Delbracio
(i.e. when the zoom-out factor increases). The scale-space, therefore, provides SIFT with scale
invariance as it can be interpreted as the simulation of a set of snapshots of a given scene taken at
different distances. In what follows we detail the construction of the SIFT scale-space.
|x|2
1 − 2
where Gσ (x) = 2πσ 2e
2σ is the Gaussian kernel parameterized by its standard deviation σ ∈ R+ .
The Gaussian smoothing operator satisfies a semi-group relation,
In the case of digital images there is some ambiguity on how to define a discrete counterpart to the
continuous Gaussian smoothing operator [9, 22]. In the present work as in Lowe’s original work, the
digital Gaussian smoothing is implemented as a discrete convolution with samples of a truncated
Gaussian kernel.
372
Anatomy of the SIFT Method
Stage Description
Compute the Gaussian scale-space
in: u image
1.
out:v scale-space
Compute the Difference of Gaussians (DoG)
in: v scale-space
2.
out: w DoG
Find candidate keypoints (3d discrete extrema of DoG)
in: w DoG
3.
out: {(xd , yd , σd )} list of discrete extrema (position and scale)
Refine candidate keypoints location with sub-pixel precision
in: w DoG and {(xd , yd , σd )} list of discrete extrema
4.
out: {(x, y, σ)} list of interpolated extrema
Filter unstable keypoints due to noise
in: w DoG and {(x, y, σ)}
5. out: {(x, y, σ)} list of filtered keypoints
Filter unstable keypoints laying on edges
in: w DoG and {(x, y, σ)}
6.
out: {(x, y, σ)} list of filtered keypoints
Assign a reference orientation to each keypoint
in: (∂m v, ∂n v) scale-space gradient and {(x, y, σ)} list of keypoints
7.
out: {(x, y, σ, θ)} list of oriented keypoints
Build the keypoints descriptor
in: (∂m v, ∂n v) scale-space gradient and {(x, y, σ, θ)} list of keypoints
8.
out: {(x, y, σ, θ, f )} list of described keypoints
Digital Gaussian smoothing. Let gσ be the one-dimensional digital kernel obtained by sampling
a truncated Gaussian function of standard deviation σ,
k2
gσ (k) = Ke− 2σ2 ,
−d4σe ≤ k ≤ d4σe, k ∈ Z (3)
P
where d·e denotes the ceil function and K is set so that gσ (k) = 1. Let Gσ denote the digital
Gaussian convolution of parameter σ and u be a digital image of size M × N . Its digital Gaussian
smoothing, denoted by Gσ u, is computed via a separable two-dimensional (2d) discrete convolution
d4σe d4σe
X X
0
Gσ u(k, l) := gσ (k ) gσ (l0 ) ū(k − k 0 , l − l0 ), (4)
k0 =−d4σe l0 =−d4σe
where ū denotes the extension of u to Z2 via symmetrization with respect to −1/2, namely,
ū(k, l) = u(sM (k), sN (l)) with sM (k) = min(k mod 2M, 2M − 1 − k mod 2M ).
For the range of values of σ considered in the described algorithm (i.e. σ ≥ 0.7), the digital
Gaussian smoothing operator approximately satisfies a semi-group relation with an error below 10−4
for pixel intensity values ranging from 0 to 1 [22]. Applying successively two digital Gaussian smooth-
ings of parameters
p σ1 and σ2 is approximately equal to applying one digital Gaussian smoothing of
parameter σ1 + σ22 ,
2
373
Ives Rey-Otero, Mauricio Delbracio
Figure 1: Convention adopted for the sampling grid of the digital scale-space v. The blur level is
considered with respect to the sampling grid of the input image. The parameters are set to their
default value, namely σmin = 0.8, nspo = 3, noct = 8, σin = 0.5.
The digital scale-space also includes three additional images per octave, v0o , vno spo +1 , vno spo +2 . The
rationale for this will become clear later.
The construction of the digital scale-space begins with the computation of a seed image denoted
by v01 . This image will have a blur level σ01 = σmin , which is the minimum blur level considered, and
374
Anatomy of the SIFT Method
v01 = G 1
√ 2 −σ 2 Iδmin uin , (6)
δmin
σmin in
where Iδmin is the digital bilinear interpolator by a factor 1/δmin (see Algorithm 1) and Gσ is the
digital Gaussian convolution already defined. The entire digital scale-space is derived from this seed
image. The default value δmin = 0.5 implies an initial 2× interpolation. The blur level of the seed
image, relative to the input image sampling grid, is set as default to σmin = 0.8.
The second and posterior scale-space images s = 1, . . . , nspo + 2 at each octave o are computed
recursively according to
vso = Gρ[(s−1)→s] vs−1
o
, (7)
where
σmin p 2s/nspo
ρ[(s−1)→s] = 2 − 22(s−1)/nspo .
δmin
The first images (i.e. s = 0) of the octaves o = 2, . . . , no are computed as
v0o = S2 vno−1
spo
, (8)
where S2 denotes the subsampling operator by a factor of 2, (S2 u)(m, n) = u(2m, 2n). This procedure
produces a family of images (vso ), o = 1, . . . , noct and s = 0, . . . , nspo + 2, having inter-pixel distance
The diagram in Figure 2 depicts the digital scale-space architecture in terms of the sampling rates
and blur levels. Each point symbolizes a scale-space image vso having inter-pixel distance δ o and blur
level σso . The featured configuration is produced from the default parameter values of the Lowe SIFT
algorithm: σmin = 0.8, δmin = 0.5, nspo = 3, and σin = 0.5. The number of octaves noct is upper
limited by the number of possible subsamplings. Figure 3 shows an example of the digital scale-space
images generated with the given configuration.
3 Keypoint Definition
Precisely detecting interesting image features is a challenging problem. The keypoint features are
defined in SIFT as the extrema of the normalized Laplacian scale-space σ 2 ∆v [14]. A Laplacian
extremum is unequivocally characterized by its scale-space coordinates (σ, x) where x refers to its
375
Ives Rey-Otero, Mauricio Delbracio
0.5
0.5 0.8 1.01 1.27 1.6 2.02 2.54 3.2 4.03 5.08 6.4 8.06 10.16 12.8 16.13 20.32 25.6 32.25 40.64
Figure 2: (a) The succession of subsamplings and Gaussian convolutions that results in the SIFT
scale-space. The first image at each octave v0o is obtained via subsampling, with the exception of
the first octave first image v00 that is generated by a bilinear interpolation followed by a Gaussian
convolution. (b) An illustration of the digital scale-space in its default configuration. The digital
scale-space v is composed of images vso for o = 1, . . . , noct and s = 0, . . . , nspo + 2. All images are
computed directly or indirectly from uin (in blue). Each image is characterized by its blur level and
its inter-pixel distance, respectively noted by σ and δ. The scale-space is split into octaves: sets of
images sharing a common sampling rate. Each octave is composed of nspo scales (in red) and other
three auxiliary scales (in gray). The depicted configuration features noct = 5 octaves and corresponds
to the following parameter settings: nspo = 3, σmin = 0.8. The assumed blur level of the input image
is σin = 0.5.
center spatial position and σ relates to its size (scale). As will be presented in Section 4, the covariance
of the extremum (σ, x) induces the invariance to translation and scale of its associated descriptor.
Instead of computing the Laplacian of the image scale space, SIFT uses a difference of Gaussians
operator (DoG), first introduced by Burt and Adelson [4] and Crowley and Stern [6]. Let v be
a Gaussian scale-space and κ > 1. The difference of Gaussians (DoG) of ratio κ is defined by
w : (σ, x) 7→ v(κσ, x) − v(σ, x). The DoG operator takes advantage of the link between the Gaussian
kernel and the heat equation to approximately compute the normalized Laplacian σ 2 ∆v. Indeed,
from a set of simulated blurs following a geometric progression of ratio κ, the heat equation is
376
Anatomy of the SIFT Method
Figure 3: Crops of a subset of images extracted from the Gaussian scale-space of an example image.
The scale-space parameters are set to nspo = 3, σmin = 0.8, and the assumed input image blur level
σin = 0.5. Image pixels are represented by a square of side δo for better visualization.
approximated by
∂v v(κσ, x) − v(σ, x) w(σ, x)
σ∆v = ≈ = . (11)
∂σ κσ − σ (κ − 1)σ
Thus, we have w(σ, x) ≈ (κ − 1)σ 2 ∆v(σ, x).
The SIFT keypoints of an image are defined as the 3d extrema of the difference of Gaussians
(DoG). Since we deal with digital images, the continuous 3d extrema of the DoG cannot be directly
computed. Thus, the discrete extrema of the digital DoG are first detected and then their positions
are refined. The detected points are finally validated to discard possible unstable and false detections
due to noise. Hence, the detection of SIFT keypoints involves the following steps:
377
Ives Rey-Otero, Mauricio Delbracio
ū denotes the extension of u to Z2 via symmetrization with respect to −0.5: ū(k, l) = u(sM (k), sN (l)) with
sN (k) = min(k mod 2M, 2M − 1 − k mod 2M ).
note: b·c denotes the floor function.
with m = 0, . . . , Mo − 1, n = 0, . . . , No − 1. The image wso will be attributed the blur level σso .
This computation is illustrated in Figure 4 and summarized in Algorithm 3. See how, in the digital
scale-space, the computation of the auxiliary image vno spo +2 is required for computing the DoG ap-
proximation wno spo +1 . Figure 5 illustrates the DoG scale-space w relative to the previously introduced
Gaussian scale-space v. Figure 6 shows images of an example of DoG scale-space.
378
Anatomy of the SIFT Method
Figure 4: The difference of Gaussians operator is computed by subtracting pairs of contiguous scale-
space images. The procedure is not centered: the difference between the images at scales κσ and σ
is attributed a blur level σ.
Figure 5: The DoG scale-space. The difference of Gaussians acts as an approximation of the nor-
o
malized Laplacian σ 2 ∆. The difference wso = vs+1 − vso is relative to the blur level σso . Each octave
contains nspo images plus two auxiliary images (in black).
o
Detection of DoG 3D discrete extrema. Each sample ws,m,n of the DoG scale-space, with
s = 1, . . . , nspo , o = 1, . . . , noct , m = 1, . . . , Mo − 2, n = 1, . . . , No − 2 (which excludes the image
borders and the auxiliary images) is compared to its neighbors to detect the 3d discrete maxima and
minima (the number of neighbors is 26 = 3 × 3 × 3 − 1). Algorithm 4 summarizes the extraction
of 3d extrema from the digital DoG. These comparisons are possible thanks to the auxiliary images
w0o , wno spo +1 calculated for each octave o. This scanning process is nevertheless a rudimentary way to
detect candidate points. It is sensitive to noise, produces unstable detections, and the information it
provides regarding the location and scale may be flawed since it is constrained to the sampling grid.
To amend these shortcomings, this preliminary step is followed by an interpolation that refines the
localization of the extrema and by a cascade of filters that discard unreliable detections.
Keypoint position refinement. The location of the discrete extrema is constrained to the sam-
pling grid (defined by the octave o). This coarse localization hinders a rigorous covariance property
379
Ives Rey-Otero, Mauricio Delbracio
Figure 6: Crops of a subset of images extracted from the DoG scale-space of an example image. The
DoG operator is an approximation of the normalized Laplacian operator σ 2 ∆v. The DoG scale-space
parameters used in this example are the default: nspo = 3, σmin = 0.8, σin = 0.5. Image pixels are
represented by a square of side δo for better visualization.
of the set of keypoints and subsequently is an obstacle to the full scale and translation invariance of
the corresponding descriptor. SIFT refines the position and scale of each candidate keypoint using
a local interpolation model.
o
We denote by ωs,m,n (α) the quadratic function at sample point (s, m, n) in the octave o, given
by
o o 1
ωs,m,n (α) = ws,m,n + αT ḡs,m,n
o
+ αT H̄s,m,n
o
α, (12)
2
where α = (α1 , α2 , α3 )T ∈ [−1/2, 1/2]3 ; ḡs,m,n
o o
and H̄s,m,n denote respectively the 3d gradient and
Hessian at (s, m, n) in the octave o, computed with a finite difference scheme as follows
o o
(ws+1,m,n − ws−1,m,n )/2 h11 h12 h13
o o o o
ḡs,m,n = (ws,m+1,n − ws,m−1,n )/2 , H̄s,m,n = h12 h22 h23 (13)
o o
(ws,m,n+1 − ws,m,n−1 )/2 h13 h23 h33
with
o o o o o o o
h11 = ws+1,m,n + ws−1,m,n − 2ws,m,n , h12 = (ws+1,m+1,n − ws+1,m−1,n − ws−1,m+1,n + ws−1,m−1,n )/4,
o o o o o o o
h22 = ws,m+1,n + ws,m−1,n − 2ws,m,n , h13 = (ws+1,m,n+1 − ws+1,m,n−1 − ws−1,m,n+1 + ws−1,m,n−1 )/4,
o o o o o o o
h33 = ws,m,n+1 + ws,m,n−1 − 2ws,m,n , h23 = (ws,m+1,n+1 − ws,m+1,n−1 − ws,m−1,n+1 + ws,m−1,n−1 )/4.
380
Anatomy of the SIFT Method
This quadratic function is an approximation of the second order Taylor development of the underlying
continuous function (where its derivatives are approximated by finite difference schemes).
In order to refine the position of a discrete extremum (se , me , ne ) at octave oe SIFT proceeds as
follows.
1. Initialize (s, m, n) by the discrete coordinates of the extremum (se , me , ne ).
o o
2. Compute the continuous extrema of ωs,m,n by solving ∇ωs,m,n (α) = 0 (see Algorithm 7). This
yields −1 o
α∗ = − H̄s,m,n
o
ḡs,m,n . (14)
3. If max(|α1∗ |, |α2∗ |, |α3∗ |) ≤ 0.5 (i.e. the extremum of the quadratic function lies in its domain of
validity) the extremum is accepted. According to the scale-space architecture (see (10) and (9)),
the corresponding keypoint coordinates are
δoe (α∗1 +s)/nspo ∗ ∗
(σ, x, y) = σmin 2 , δoe (α2 +m) , δoe (α3 +n) . (15)
δmin
4. If α∗ falls outside the domain of validity, the interpolation is rejected and another one is carried
out. Update (s, m, n) to the closest discrete value to (s, m, n) + α∗ and repeat from (2).
This process is repeated up to five times or until the interpolation is validated. If after five iterations
the result is still not validated, the candidate keypoint is discarded. In practice, the validity domain
is defined by max(|α1∗ |, |α2∗ |, |α3∗ |) < 0.6 to avoid possible numerical instabilities due to the fact that
the piecewise interpolation model is not continuous. See Algorithm 6 for details.
According to the local interpolation model (12), the value of the DoG interpolated extremum is
1
o
ω := ωs,m,n (α∗ ) = ws,m,n
o
+ (α∗ )T ḡs,m,n
o
+ (α∗ )T H̄s,m,n
o
α∗
2
1
o
= ws,m,n + (α∗ )T ḡs,m,n
o
. (16)
2
This value will be used to discard uncontrasted keypoints.
Since the DoG function approximates (κ − 1)σ 2 ∆v, where κ is a function of the number of scales
per octave nspo , the value of threshold CDoG should depend on nspo (default value CDoG = 0.015
1
for nspo = 3). The threshold applied in the provided source-code is C eDoG = 2 /n1/3spo −1 CDoG , with
2 −1
CDoG relative to nspo = 3. This guarantees that the applied threshold is independent of the sampling
configuration. Before the refinement of the extrema, and to avoid unnecessary computations, a less
conservative threshold at 80% of CDoG is applied to the discrete 3d extrema (see Algorithm 5),
o
if |ws,m,n | < 0.8 × CDoG then discard the discrete 3d extremum.
381
Ives Rey-Otero, Mauricio Delbracio
The SIFT algorithm discards keypoint candidates whose eigenvalue ratio r := λmax /λmin is more
than a certain threshold Cedge (the default value is Cedge = 10). Since only this ratio is relevant, the
eigenvalues computation can be avoided. The ratio of the Hessian matrix determinant and its trace
are related to r by
o
o
tr(Hs,m,n )2 (λmax + λmin )2 (r + 1)2
edgeness(Hs,m,n ) = o
= = . (18)
det(Hs,m,n ) λmax λmin r
Thus, the filtering of keypoint candidates on edges consists in the following test:
o (Cedge + 1)2
if edgeness(Hs,m,n ) > then discard candidate keypoint.
Cedge
o o
Note that Hs,m,n is the bottom-right 2 × 2 sub-matrix of H̄s,m,n (13) previously computed for the
keypoint interpolation. Algorithm 9 summarizes how keypoints on edges are discarded.
3.4 Pseudocodes
382
Anatomy of the SIFT Method
383
Ives Rey-Otero, Mauricio Delbracio
4 Keypoint Description
In the literature, rotation invariant descriptors fall into two categories. On the one side, those based
on properties of the image that are already rotation-invariant and on the other side, descriptors
based on a normalization with respect to a reference orientation. The SIFT descriptor is based on
a normalization. The local dominant gradient angle is computed and used as a reference orienta-
tion. Then, the local gradient distribution is normalized with respect to this reference direction (see
Figure 7).
The SIFT descriptor is built from the normalized image gradient orientation in the form of
quantized histograms. In what follows, we describe how the reference orientation specific to each
keypoint is defined and computed.
384
Anatomy of the SIFT Method
Figure 7: The description of a keypoint detected at scale σ (the radius of the blue circle) involves the
analysis of the image gradient distribution around the keypoint in two radial Gaussian neighborhoods
with different sizes. The first local analysis aims at attributing a reference orientation to the
keypoint (the blue arrow). It is performed over a Gaussian window of standard deviation λori σ (the
radius of the green circle). The width of the contributing samples patch P ori (green square) is 6λori σ.
The figure shows the default case λori = 1.5. The second analysis aims at building the descriptor.
It is performed over a Gaussian window of standard deviation λdescr σ (the radius of the red circle)
within a square patch P descr (the red square) of approximate width 2λdescr σ. The figure features the
default settings: λdescr = 6, with a Gaussian window of standard deviation 6σ and a patch P descr of
width 12σ.
A. Orientation histogram accumulation. Given a keypoint (x, y, σ), the patch to be analyzed
is extracted from the image of the scale-space vso , whose scale σso is nearest to σ. This normalized
patch, denoted by P ori , is the set of pixels (m, n) of vso satisfying
max(|δo m − x|, |δo n − y|) ≤ 3λori σ. (19)
Keypoints whose distance to the image borders is less than 3λori σ are discarded since the patch P ori is
not totally included in the image. The orientation histogram h from which the dominant orientation
is found covers the range [0, 2π]. It is composed of nbins bins with centers θk = 2πk/nbins . Each pixel
(m, n) in P ori will contribute to the histogram with a total weight of cori
m,n , which is the product of the
gradient norm and a Gaussian weight of standard deviation λori σ (default value λori = 1.5) reducing
the contribution of distant pixels.
k(mδo ,nδo )−(x,y)k2
−
cori o o
2(λori σ)2
m,n =e
∂m vs,m,n , ∂n vs,m,n
. (20)
This contribution is assigned to the nearest bin, namely the bin of index
hn i
bins o o
bori
m,n = arctan2 ∂ v , ∂ v
m s,m,n n s,m,n . (21)
2π
385
Ives Rey-Otero, Mauricio Delbracio
where [·] denotes the round function and arctan2 is the two-argument inverse tangent function2 with
range in [0, 2π]. The gradient components of the scale-space image vos are computed through a finite
difference scheme
o 1 o o
o 1 o o
∂m vs,m,n = vs,m+1,n − vs,m−1,n , ∂n vs,m,n = vs,m,n+1 − vs,m,n−1 , (22)
2 2
for m = 1, . . . , Mo − 2 and n = 1, . . . , No − 2.
B. Smoothing the histogram. After being accumulated, the orientation histogram is smoothed
by applying six times a circular convolution with the three-tap box filter [1, 1, 1]/3.
C. Extraction of reference orientation(s). The keypoint reference orientations are taken among
the local maxima positions of the smoothed histogram. More precisely, the reference orientations are
the positions of local maxima larger than t times the global maximum (default value t = 0.8). Let
k ∈ {1, . . . , nbins } be the index of a bin such that hk > hk− , hk > hk+ , where k − = (k − 1) mod nbins
and k + = (k + 1) mod nbins and such that hk ≥ t max(h). This bin is centered at orientation θk =
2π(k−1)/n
bins . The corresponding keypoint reference orientation θref is computed from the maximum
position of the quadratic function that interpolates the values hk− , hk , hk+ ,
π hk− − hk+
θref = θk + . (23)
nbins hk− − 2hk + hk+
Each one of the extracted reference orientations leads to the computation of one local descriptor
of a keypoint neighborhood. The number of descriptors may consequently exceed the number of
keypoints. Figure 8 illustrates how a reference orientation is attibuted to a keypoint.
Figure 8: Reference orientation attribution. The normalized patch P ori (normalized to scale and
translation) width is 6λori σkey . The gradient magnitude is weighted by a Gaussian window of standard
deviation λori σkey . The gradient orientations are accumulated into an orientation histogram h which
is subsequently smoothed.
2
The two-argument inverse tangent, unlike the single argument one, determines the appropriate quadrant of
the computed angle thanks to the extra information about the signs of the inputs: arctan2(x, y) = arctan(x/y) +
π
2 sign(y)(1 − sign(x)).
386
Anatomy of the SIFT Method
The normalized patch. For each keypoint (xkey , ykey , σkey , θkey ), a normalized patch is isolated
from the Gaussian scale-space image relative to the nearest discrete scale (o, s) to scale σkey , namely
vso . A sample (m, n) in vso , of coordinates (xm,n , ym,n ) = (mδ o , nδ o ) with respect to the sam-
pling grid of the input image, has normalized coordinates (x̂m,n , ŷm,n ) with respect to the keypoint
(xkey , ykey , σkey , θkey ),
x̂m,n = ((mδo − xkey ) cos θkey + (nδo − ykey ) sin θkey ) /σkey ,
ŷm,n = (−(mδo − xkey ) sin θkey + (nδo − ykey ) cos θkey ) /σkey . (24)
The normalized patch denoted by P descr is the set of samples (m, n) of vso with normalized coordinates
(x̂m,n , ŷm,n ) satisfying
max(|x̂m,n |, |ŷm,n |) ≤ λdescr . (25)
√
Keypoints whose distance to the image borders is less than 2λdescr σ are discarded to guaranty
that the patch P descr is included in the image. Note that no image re-sampling is performed. Each
sample (m, n) is characterized by the gradient orientation normalized with respect to the keypoint
orientation θkey ,
o o
θ̂m,n = arctan2 ∂m vs,m,n , ∂n vs,m,n − θkey mod 2π, (26)
and its total contribution cdescr
m,n . The total contribution is the product of the gradient norm and a
Gaussian weight (with standard deviation λdescr σkey ) reducing the contribution of distant pixels,
The array of orientation histograms. The gradient orientation of each pixel in the normalized
patch P descr is accumulated into an array of nhist × nhist orientation histograms (default value nhist =
4). Each of these histograms, denoted by hi,j for (i, j) ∈ {1, . . . , nhist }2 , has an associated position
with respect to the keypoint (xkey , ykey , σkey , θkey ), given by
i 1 + nhist 2λdescr j 1 + nhist 2λdescr
x̂ = i− , ŷ = j− .
2 nhist 2 nhist
387
Ives Rey-Otero, Mauricio Delbracio
2λdescr 2λdescr 2π
|x̂i − x̂m,n | ≤ , |ŷ j − ŷm,n | ≤ and |θ̂k − θ̂m,n mod 2π| ≤ ,
nhist nhist nori
Figure 9: SIFT descriptor construction. No explicit re-sampling of the described normalized patch
is performed. The normalized patch P descr is partitioned into a set of nhist × nhist subpatches (default
value nhist = 4). Each sample (m, n) inside P descr , located at (mδ o , nδ o ), contributes by an amount
descr
that is a function of their normalized coordinates (x̂m,n , ŷm,n ) (see (24)). Each sub-patch P(i,j) is
centered at (x̂i , ŷj ).
The SIFT feature vector. The accumulated array of histograms is encoded into a vector feature f
of length nhist × nhist × nori , as follows
where i = 1, . . . , nhist , j = 1, . . . , nhist and k = 1, . . . , nori . The components of the feature vector f
are saturated to a maximum value of 20% of its Euclidean norm, i.e. fk ← min (fk , 0.2kf k). The
saturation of the feature vector components seeks to reduce the impact of non-linear illumination
changes, such as saturated regions.
The vector is finally renormalized so as to have kf k2 = 512 and quantized to 8 bit integers as
follows: fk ← min (bfk c, 255). The quantization aims at accelerating the computation of distances
between different feature vectors3 . Figure 9 and Figure 11 illustrate how a SIFT feature vector is
attibuted to an oriented keypoint.
3
The executable provided by D.Lowe http://www.cs.ubc.ca/~lowe/keypoints/, retrieved on September 11th,
2014) uses a different coordinate system (see source code’s README.txt for details).
388
Anatomy of the SIFT Method
Figure 10: Illustration of the spatial contribution of a sample inside the patch P descr . The sample
(m, n) contributes to the weighted histograms (2, 2) (green), (2, 3) (orange), (3, 2) (blue) and (3, 3)
(pink). The contribution cdescr
m,n is split over four pairs of bins according to (28).
Figure 11: The image on top shows the nhist × nhist array sub-patches relative to a keypoint;
the corresponding nori bins histograms are rearranged into a 1D-vector ~v (bottom). This vector is
subsequently thresholded and normalized so that the Euclidean norm equals 512 for each descriptor.
The dimension of the feature vector in this example is 128, relative to parameter nhist = 4, nori = 8
(default values).
4.3 Pseudocodes
389
Ives Rey-Otero, Mauricio Delbracio
// Compute
nbins the corresponding bin index
okey okey
bori
m,n = 2π arctan2 ∂m vs key ,m,n , ∂n vskey ,m,n mod 2π
// Smooth h
Apply six times a circular convolution with filter [1, 1, 1]/3 to h.
// Extract the reference orientations
for 1 ≤ k ≤ nbins do
if hk > hk− , hk > hk+ and hk ≥ t max(h) then
// Compute the reference orientation
θkey
π hk− −hk+
θkey = θk + nbins h − −2hk +h +
k k
note: [·] denotes the round function and arctan2 denotes the two-argument inverse tangent.
390
Anatomy of the SIFT Method
Temporary: hi,j
k , array of orientation weighted histograms, (i, j) ∈ {1, . . . , nhist } and k ∈ {1, . . . , nori }
// √
Check if the keypoint is √ distant enough √ from the image borders √
if 2λdescr σkey ≤ xkey ≤ h − 2λdescr σkey and 2λdescr σkey ≤ ykey ≤ w − 2λdescr σkey then
// Update the nearest histograms and the nearest bins j (Equation (28)).
for (i, j) ∈ {1, . . . , nhist }2 such that x̂i − x̂m,n ≤ 2λndescr ŷ − ŷm,n ≤ 2λdescr do
hist
and nhist
for k ∈ {1, . . . , nori } such that θ̂k − θ̂m,n mod 2π < n2π do
ori
i,j i,j
k ← hk +
h
1− 2λndescr
hist
|x̂m,n − x̂i | 1− 2λndescr
hist
|ŷm,n − ŷ j | 1− n2π
ori
|θ̂m,n − θ̂k mod 2π| cdescr
m,n
Add (x, y, σ, θ, f ) to LE
391
Ives Rey-Otero, Mauricio Delbracio
5 Matching
The classical purpose of detecting and describing keypoints is to find matches (pairs of keypoints)
between images. In the absence of extra knowledge on the problem, for instance in the form of
geometric constraints, a matching procedure generally consists of two steps: the pairing of similar
keypoints from respective images and the selection of those that are reliable. In what follows, we
present the matching method described in the original article by D. Lowe [17]. Let LA and LB be
the set of descriptors associated to the keypoints detected in images uA and uB . The matching is
done by considering every descriptor associated to the list LA and finding one possible match in list
LB . The first descriptor f a ∈ LA is paired to the descriptor f b ∈ LB that minimizes the Euclidean
distance between descriptors,
f b = arg min kf − f a k2 .
f ∈LB
Pairing a keypoint with descriptor f a requires then to compute distances to all descriptors in LB . A
match
pair is considered reliable only if its absolute distance is below a certain threshold Cabsolute . Otherwise
it is discarded.
To avoid dependence to an absolute distance, the SIFT method uses the second nearest neighbor
0
to define what constitutes a reliable match. SIFT applies an adaptive thresholding kf a − f b kCrelative
match
,
where fb0 is the second nearest neighbor,
0
f b = arg min kf − f a k2 .
f ∈LB \{f b }
This is detailed in Algorithm 13. The major drawback of using a relative threshold is that it omits
detections for keypoints associated to a repeated structure in the image. Indeed, in that case, the
distance to the nearest and the second nearest descriptor would be comparable. More sophisticated
techniques have been developed to allow robust matching of images with repeated structures [20].
This matching algorithm runs in time c·NA ·NB , where NA and NB are the number of keypoints in
images uA and uB respectively, and c is a constant proportional to the time that takes to compare two
SIFT features. This is prohibitively slow for images of moderate size, although keypoint matching is
highly parallelizable. A better solution is to use more compact descriptors [26] that reduce the cost
of distance computation (and thus reduce the value of c). Among the proposed solutions we can find
more compact SIFT-like descriptors [2, 12] or binary descriptors [5, 24, 21] which take advantage of
the fast computation of the Hamming distance between two binary vectors.
392
Anatomy of the SIFT Method
6 Summary of Parameters
The online demo provided with this publication examines in detail the behavior of each stage of the
SIFT algorithm. In what follows, we summarize all the parameters that can be adjusted in the demo.
The structure of the digital scale-space is unequivocally characterized by four structural param-
eters: noct , nspo , σmin , δmin and by the blur level in the input image σin . The associated online demo
allows the user to change these values. They can be tuned to satisfy specific requirements4 . For
example, increasing the number of scales per octave nspo and the initial interpolation factor δmin
increases the number of detections. On the other hand, reducing them results in a faster algorithm.
The image structures that are potentially detected by SIFT have a scale ranging from σmin to
σmin 2noct . Therefore, it may seem natural to choose the lowest possible value of σmin (i.e. σmin = σin ).
However, depending on the input image sharpness, low scale detections may be the result of aliasing
artifacts and should be avoided. Thus, a sound setting of parameter σmin should take into account
the image blur level σin and the possible presence of image aliasing.
The DoG thresholding, controlled by CDoG , was conceived to filter detections due to noise. With
that aim in view, CDoG should depend on the input image signal to noise ratio. It is however beyond
the scope of this publication to analyze the soundness of such an approach. We will only point
out that the reduction of CDoG increases the number of detected keypoints. Recall that the DoG
approximates (21/nspo − 1)σ 2 ∆v, its values depend on the number of scales per octave nspo . The
1
threshold applied in the provided source-code is C eDoG = 2 /n1/3spo −1 CDoG , with CDoG relative to nspo = 3.
2 −1
The threshold Cedge , applied to discard keypoints laying on edges, has in practice a negligible
impact on the algorithm performance. Indeed, many candidate keypoints laying on edges were
previously discarded during the extrema refinement.
393
Ives Rey-Otero, Mauricio Delbracio
many local maxima. Another algorithm design parameter, not included in Table 4 because of its
insignificant impact, is the level of smoothing applied to the histogram (Nconv = 6).
The size of the normalized patch used for computing the SIFT descriptor is governed by λdescr . A
larger patch will produce a more discriminative descriptor but will be less robust to scene deformation.
The number of histograms nhist × nhist and the number of bins nori can be set to make the feature
vector more compact. These architectural parameters govern the trade off between robustness and
discrimination.
Acknowledgements
This work was partially supported by the Centre National d’Etudes Spatiales (CNES, MISS Project),
the European Research Council (Advanced Grant Twelve Labours), the Office of Naval Research
(Grant N00014-97-1-0839), Direction Générale de l’Armement (DGA), Fondation Mathématique
Jacques Hadamard and Agence Nationale de la Recherche (Stereo project).
394
Anatomy of the SIFT Method
Image Credits
Crops of stereoscopic cards by T. Enami (1859-1929) were used in figures 3, 6, 8, 9 and 11.
References
[1] L. Alvarez and F. Morales, Affine morphological multiscale analysis of corners and multiple
junctions, International Journal of Computer Vision, 25 (1997), pp. 95–107. http://dx.doi.
org/10.1023/A:1007959616598.
[2] H. Bay, T. Tuytelaars, and L. van Gool, SURF: Speeded Up Robust Features, in Euro-
pean Conference on Computer Vision, 2006.
[3] M. Brown, R. Szeliski, and S. Winder, Multi-image matching using multi-scale oriented
patches, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005.
http://dx.doi.org/10.1109/CVPR.2005.235.
[4] P J Burt and E H Adelson, The Laplacian pyramid as a compact image code, IEEE Transac-
tions on Communications, 31 (1983), pp. 532–540. http://dx.doi.org/10.1109/TCOM.1983.
1095851.
[5] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, BRIEF: Binary Robust Independent
Elementary Features, in European Conference on Computer Vision, 2010, pp. 778–792. http:
//dx.doi.org/10.1007/978-3-642-15561-1_56.
[6] J L Crowley and R M Stern, Fast computation of the difference of low-pass transform,
IEEE Transactions on Pattern Analysis and Machine Intelligence, (1984), pp. 212–222. http:
//dx.doi.org/10.1109/TPAMI.1984.4767504.
[7] Wolfgang Förstner, A framework for low level feature extraction, in European Conference
on Computer Vision, 1994, pp. 383–394. http://dx.doi.org/10.1007/BFb0028370.
[8] W. Förstner, T. Dickscheid, and F. Schindler, Detecting interpretable and accurate
scale-invariant keypoints, in Proceedings of IEEE International Conference on Computer Vision,
2009. http://dx.doi.org/10.1109/ICCV.2009.5459458.
[9] P Getreuer, A survey of Gaussian convolution algorithms, Image Processing On Line, 3
(2013), pp. 286–310. http://dx.doi.org/10.5201/ipol.2013.87.
[10] C. Harris and M. Stephens, A combined corner and edge detector, in Alvey Vision Confer-
ence, vol. 15, 1988, p. 50. http://dx.doi.org/10.5244/C.2.23.
[11] T. Hassner, V. Mayzels, and L. Zelnik-Manor, On SIFTs and their scales, in IEEE
Conference on Computer Vision and Pattern Recognition, 2012, pp. 1522–1528. http://dx.
doi.org/10.1109/CVPR.2012.6247842.
[12] Y. Ke and R. Sukthankar, PCA-SIFT: A more distinctive representation for local image
descriptors, in IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2004.
http://dx.doi.org/10.1109/CVPR.2004.1315206.
[13] S. Leutenegger, M. Chli, and R.Y. Siegwart, BRISK: Binary Robust Invariant Scalable
Keypoints, in Proceedings of IEEE International Conference on Computer Vision, 2011. http:
//dx.doi.org/10.1109/ICCV.2011.6126542.
395
Ives Rey-Otero, Mauricio Delbracio
396