0% found this document useful (0 votes)
16 views27 pages

Article

This document presents an in-depth summary of the Scale Invariant Feature Transform (SIFT) algorithm. SIFT extracts keypoints and descriptors from images that are invariant to changes in scale, rotation, and translation. The algorithm involves generating a Gaussian scale space of an image by blurring it with different Gaussian kernels. Keypoints are detected as local extrema across scale spaces. Descriptors are generated for each keypoint based on gradient orientations in its local neighborhood. These descriptors can be used to match features between images despite changes in scale, rotation, and translation.

Uploaded by

hakan ulucan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views27 pages

Article

This document presents an in-depth summary of the Scale Invariant Feature Transform (SIFT) algorithm. SIFT extracts keypoints and descriptors from images that are invariant to changes in scale, rotation, and translation. The algorithm involves generating a Gaussian scale space of an image by blurring it with different Gaussian kernels. Keypoints are detected as local extrema across scale spaces. Descriptors are generated for each keypoint based on gradient orientations in its local neighborhood. These descriptors can be used to match features between images despite changes in scale, rotation, and translation.

Uploaded by

hakan ulucan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

2015/06/16 v0.5.

1 IPOL article class

Published in Image Processing On Line on 2014–12–22.


Submitted on 2013–03–27, accepted on 2014–12–02.
ISSN 2105–1232 c 2014 IPOL & the authors CC–BY–NC–SA
This article is available online with supplementary materials,
software, datasets and online demo at
https://doi.org/10.5201/ipol.2014.82

Anatomy of the SIFT Method


Ives Rey-Otero1 , Mauricio Delbracio2
1
CMLA, ENS Cachan, France ([email protected])
2
CMLA, ENS Cachan, France and ECE, Duke University ([email protected])

Communicated by Gabriele Facciolo Demo edited by Ives Rey Otero

Abstract
This article presents a detailed description and implementation of the Scale Invariant Feature
Transform (SIFT), a popular image matching algorithm. This work contributes to a detailed
dissection of SIFT’s complex chain of transformations and to a careful presentation of each of
its design parameters. A companion online demonstration allows the reader to use SIFT and
individually set each parameter to analyze its impact on the algorithm results.
Source Code
The source code (ANSI C), its documentation, and the online demo are accessible at the IPOL
web page of this article1 .
Keywords: sift; feature detection; image comparison

1 General Description
The scale invariant feature transform, SIFT [17], extracts a set of descriptors from an image. The
extracted descriptors are invariant to image translation, rotation and scaling (zoom-out). SIFT
descriptors have also proved to be robust to a wide family of image transformations, such as slight
changes of viewpoint, noise, blur, contrast changes, scene deformation, while remaining discriminative
enough for matching purposes.
The seminal paper introducing SIFT in 1999 [16] has sparked an explosion of competitors.
SURF [2], Harris and Hessian based detectors [19], MOPS [3], ASIFT [28] and SFOP [8], with
methods using binary descriptors such as BRISK [13] and ORB [24], are just a few of the successful
variants. These add to the numerous non multi-scale detectors such as the Harris-Stephens detec-
tor [10], SUSAN [25], the Förstner detector [7], the morphological corner detector [1] and the machine
learning based FAST [23] and AGAST [18].
The SIFT algorithm consists of two successive and independent operations: the detection of
interesting points (i.e. keypoints) and the extraction of a descriptor associated to each of them.
Since these descriptors are robust, they are usually used for matching pairs of images. Object
recognition and video stabilization are other popular application examples. Although the descriptors
comparison is not strictly speaking a step of the SIFT method, we have included it in our description
for the sake of completeness.
1
https://doi.org/10.5201/ipol.2014.82

Ives Rey-Otero, Mauricio Delbracio, Anatomy of the SIFT Method, Image Processing On Line, 4 (2014), pp. 370–396.
https://doi.org/10.5201/ipol.2014.82
Anatomy of the SIFT Method

The algorithm principle. SIFT detects a series of keypoints from a multiscale image represen-
tation. This multiscale representation consists of a family of increasingly blurred images. Each
keypoint is a blob-like structure whose center position (x, y) and characteristic scale σ are accurately
located. SIFT computes the dominant orientation θ over a region surrounding each one of these
keypoints. For each keypoint, the quadruple (x, y, σ, θ) defines the center, size and orientation of a
normalized patch where the SIFT descriptor is computed. As a result of this normalization, SIFT
keypoint descriptors are in theory invariant to any translation, rotation and scale change. The de-
scriptor encodes the spatial gradient distribution around a keypoint by a 128-dimensional vector.
This feature vector is generally used to match keypoints extracted from different images.

The algorithmic chain. In order to attain scale invariance, SIFT is built on the Gaussian scale-
space, a multiscale image representation simulating the family of all possible zoom-outs through
increasingly blurred versions of the input image (see [27] for a gentle introduction to the subject). In
this popular multiscale framework, the Gaussian convolution acts as an approximation of the optical
blur, and the Gaussian kernel approximates the camera’s point spread function. Thus, the Gaussian
scale-space can be interpreted as a family of images, each of them corresponding to a different zoom
factor. The Gaussian scale-space representation is presented in Section 2.
To attain translation, rotation and scale invariance, the extracted keypoints must be related to
structures that are unambiguously located, both in scale and position. This excludes image corners
and edges since they cannot be precisely localized both in scale and space. Image blobs or more
complex local structures characterized by their position and size, are therefore the most suitable
structures for SIFT.
Detecting and locating keypoints consists in computing the 3d extrema of a differential operator
applied to the scale-space. The differential operator used in the SIFT algorithm is the difference
of Gaussians (DoG), presented in Section 3.1. The extraction of 3d continuous extrema consists
of two steps: first, the DoG representation is scanned for 3d discrete extrema. This gives a first
coarse location of the extrema, which are then refined to subpixel precision using a local quadratic
model. The extraction of 3d extrema is detailed in Section 3.2. Since there are many phenomena
that can lead to the detection of unstable keypoints, SIFT incorporates a cascade of tests to discard
the less reliable ones. Only those that are precisely located and sufficiently contrasted are retained.
Section 3.3 discuses two different discarding steps: the rejection of 3d extrema with small DoG value
and the rejection of keypoint candidates laying on edges.
SIFT invariance to rotation is obtained by assigning to each keypoint a reference orientation.
This reference is computed from the gradient orientation over a keypoint neighborhood. This step
is detailed in Section 4.1. Finally the spatial distribution of the gradient inside an oriented patch
is encoded to produce the SIFT keypoint descriptor. The design of the SIFT keypoint descriptor is
described in Section 4.2. This ends the algorithmic chain defining the SIFT algorithm. Additionally,
Section 5 illustrates how SIFT descriptors can be used to find local matches between pairs of images.
The method presented here is the matching procedure described in the original paper by D. Lowe [16].
This complex chain of transformations is governed by a large number of design parameters.
Section 6 summarizes them and provides an analysis of their respective influence. Table 1 presents
the details of the adopted notation while the consecutive steps of the SIFT algorithm are summarized
in Table 2.

2 The Gaussian Scale-Space


The Gaussian scale-space representation is a family of increasingly blurred images. This blurring
process simulates the loss of detail produced when a scene is photographed from farther and farther

371
Ives Rey-Otero, Mauricio Delbracio

u Images, defined on the continuous domain (x, y) = x ∈ R2


u Digital images, defined in a rectangular grid (m, n) ∈ {0, . . . , M−1}×{0, . . . , N−1}
v Gaussian scale-space, defined on continuous domain (σ, x) ∈ R+ × R2
v Digital Gaussian scale-space, list of octaves v = (vo ), o = 1, . . . , noct
Each octave vo is defined on a discrete grid (s, m, n) ∈ {0, . . . , nspo +2}×{0, . . . , Mo −1}×{0, . . . , No −1}
w Difference of Gaussians (DoG), defined on continuous domain (σ, x) ∈ R+ × R2
w Digital difference of Gaussians (DoG), list of octaves w = (wo ), o = 1, . . . , noct
Each octave wo is defined on a discrete grid (s, m, n) ∈ {0, . . . , nspo +1}×{0, . . . , Mo −1}×{0, . . . , No −1}
ω DoG value after 3d extremum subpixel refinement
∂x v Scale-space gradient along x (∂y v along y), defined on continuous domain (σ, x) ∈ R+ × R2
∂m v Digital scale-space gradient along x (∂n v along y), list of octaves (∂m v = (∂m vo ), o = 1, . . . , noct )
Each octave ∂m vo is defined on a discrete grid (s, m, n) ∈ {2, . . . , nspo }×{1, . . . , Mo −2}×{1, . . . , No −2}
Gρ Continuous Gaussian convolution of standard deviation ρ
Gρ Digital Gaussian convolution of standard deviation ρ (see (4))
S2 Subsampling operator by a factor 2, (S2 u)(m, n) = u(2m, 2n)
Iδ Digital bilinear interpolator by a factor 1/δ (see Algorithm 2).

Table 1: Summary of the notation used in the article.

(i.e. when the zoom-out factor increases). The scale-space, therefore, provides SIFT with scale
invariance as it can be interpreted as the simulation of a set of snapshots of a given scene taken at
different distances. In what follows we detail the construction of the SIFT scale-space.

2.1 Gaussian Blurring


Consider a continuous image u(x) defined for every x = (x, y) ∈ R2 . The continuous Gaussian
smoothing is defined as the convolution
Z
Gσ u(x) := Gσ (x0 )u(x − x0 )dx0 ,

|x|2
1 − 2
where Gσ (x) = 2πσ 2e
2σ is the Gaussian kernel parameterized by its standard deviation σ ∈ R+ .
The Gaussian smoothing operator satisfies a semi-group relation,

Gσ2 (Gσ1 u)(x) = G√σ2 +σ2 u(x). (1)


1 2

We call Gaussian scale-space of u the three-dimensional (3d) function

v : (σ, x) 7→ Gσ u(x). (2)

In the case of digital images there is some ambiguity on how to define a discrete counterpart to the
continuous Gaussian smoothing operator [9, 22]. In the present work as in Lowe’s original work, the
digital Gaussian smoothing is implemented as a discrete convolution with samples of a truncated
Gaussian kernel.

372
Anatomy of the SIFT Method

Stage Description
Compute the Gaussian scale-space
in: u image
1.
out:v scale-space
Compute the Difference of Gaussians (DoG)
in: v scale-space
2.
out: w DoG
Find candidate keypoints (3d discrete extrema of DoG)
in: w DoG
3.
out: {(xd , yd , σd )} list of discrete extrema (position and scale)
Refine candidate keypoints location with sub-pixel precision
in: w DoG and {(xd , yd , σd )} list of discrete extrema
4.
out: {(x, y, σ)} list of interpolated extrema
Filter unstable keypoints due to noise
in: w DoG and {(x, y, σ)}
5. out: {(x, y, σ)} list of filtered keypoints
Filter unstable keypoints laying on edges
in: w DoG and {(x, y, σ)}
6.
out: {(x, y, σ)} list of filtered keypoints
Assign a reference orientation to each keypoint
in: (∂m v, ∂n v) scale-space gradient and {(x, y, σ)} list of keypoints
7.
out: {(x, y, σ, θ)} list of oriented keypoints
Build the keypoints descriptor
in: (∂m v, ∂n v) scale-space gradient and {(x, y, σ, θ)} list of keypoints
8.
out: {(x, y, σ, θ, f )} list of described keypoints

Table 2: Summary of the SIFT algorithm.

Digital Gaussian smoothing. Let gσ be the one-dimensional digital kernel obtained by sampling
a truncated Gaussian function of standard deviation σ,
k2
gσ (k) = Ke− 2σ2 ,
−d4σe ≤ k ≤ d4σe, k ∈ Z (3)
P
where d·e denotes the ceil function and K is set so that gσ (k) = 1. Let Gσ denote the digital
Gaussian convolution of parameter σ and u be a digital image of size M × N . Its digital Gaussian
smoothing, denoted by Gσ u, is computed via a separable two-dimensional (2d) discrete convolution
d4σe d4σe
X X
0
Gσ u(k, l) := gσ (k ) gσ (l0 ) ū(k − k 0 , l − l0 ), (4)
k0 =−d4σe l0 =−d4σe

where ū denotes the extension of u to Z2 via symmetrization with respect to −1/2, namely,

ū(k, l) = u(sM (k), sN (l)) with sM (k) = min(k mod 2M, 2M − 1 − k mod 2M ).

For the range of values of σ considered in the described algorithm (i.e. σ ≥ 0.7), the digital
Gaussian smoothing operator approximately satisfies a semi-group relation with an error below 10−4
for pixel intensity values ranging from 0 to 1 [22]. Applying successively two digital Gaussian smooth-
ings of parameters
p σ1 and σ2 is approximately equal to applying one digital Gaussian smoothing of
parameter σ1 + σ22 ,
2

Gσ2 (Gσ1 u) = G√σ2 +σ2 u. (5)


1 2

373
Ives Rey-Otero, Mauricio Delbracio

2.2 Digital Gaussian Scale-Space


As previously introduced, the Gaussian scale-space v : (x, σ) 7→ Gσ u(x) is a family of increasingly
blurred images, where the scale-space position (x, σ) refers to the pixel x in the image generated
with blur σ. In what follows, we detail how to compute the digital scale-space, a discrete counterpart
of the continuous Gaussian scale-space.
We will call digital scale-space a family of digital images relative to a discrete set of blur levels
and different sampling rates, all of them derived from an input image uin with assumed blur level σin .
This family is split into subfamilies of images sharing a common sampling rate. Since in the original
SIFT algorithm the sampling rate is iteratively decreased by a factor of two, these subfamilies are
called octaves.
Let noct be the total number of octaves in the digital scale-space, o ∈ {1, . . . , noct } be the index of
each octave, and δo its inter-pixel distance. We will adopt as a convention that the input image uin
inter-pixel distance is δin = 1. Thus, an inter-pixel distance δ = 0.5 corresponds to a 2× upsampling
of this image while a 2× subsampling results in an inter-pixel distance δ = 2. Let nspo be the
number of scales per octave (the default value is nspo = 3). Each octave o contains the images vso for
s = 1, . . . , nspo , each of them with a different blur level σso . The blur level in the digital scale-space
is measured taking as unit length the inter-sample distance in the sampling grid of the input image
uin (i.e. δin = 1). The adopted configuration is illustrated in Figure 1.

Figure 1: Convention adopted for the sampling grid of the digital scale-space v. The blur level is
considered with respect to the sampling grid of the input image. The parameters are set to their
default value, namely σmin = 0.8, nspo = 3, noct = 8, σin = 0.5.

The digital scale-space also includes three additional images per octave, v0o , vno spo +1 , vno spo +2 . The
rationale for this will become clear later.

The construction of the digital scale-space begins with the computation of a seed image denoted
by v01 . This image will have a blur level σ01 = σmin , which is the minimum blur level considered, and

374
Anatomy of the SIFT Method

inter pixel distance δ0 = δmin . It is computed from uin by

v01 = G 1
√ 2 −σ 2 Iδmin uin , (6)
δmin
σmin in

where Iδmin is the digital bilinear interpolator by a factor 1/δmin (see Algorithm 1) and Gσ is the
digital Gaussian convolution already defined. The entire digital scale-space is derived from this seed
image. The default value δmin = 0.5 implies an initial 2× interpolation. The blur level of the seed
image, relative to the input image sampling grid, is set as default to σmin = 0.8.
The second and posterior scale-space images s = 1, . . . , nspo + 2 at each octave o are computed
recursively according to
vso = Gρ[(s−1)→s] vs−1
o
, (7)
where
σmin p 2s/nspo
ρ[(s−1)→s] = 2 − 22(s−1)/nspo .
δmin
The first images (i.e. s = 0) of the octaves o = 2, . . . , no are computed as

v0o = S2 vno−1
spo
, (8)

where S2 denotes the subsampling operator by a factor of 2, (S2 u)(m, n) = u(2m, 2n). This procedure
produces a family of images (vso ), o = 1, . . . , noct and s = 0, . . . , nspo + 2, having inter-pixel distance

δo = δmin 2o−1 (9)

and blur level


δo
σmin 2 /nspo .
σso =
s
(10)
δmin
Consequently, the simulated blurs follow a geometric progression. The scale-space construction
process is summarized in Algorithm 1. The digital scale-space construction is thus defined by five
parameters:

- the number of octaves noct ,


- the number of scales per octave nspo ,
- the sampling distance δmin of the first image of the scale-space v01 ,
- the blur level σmin of the first image of the scale-space v01 , and
- σin the assumed blur level in the input image uin .

The diagram in Figure 2 depicts the digital scale-space architecture in terms of the sampling rates
and blur levels. Each point symbolizes a scale-space image vso having inter-pixel distance δ o and blur
level σso . The featured configuration is produced from the default parameter values of the Lowe SIFT
algorithm: σmin = 0.8, δmin = 0.5, nspo = 3, and σin = 0.5. The number of octaves noct is upper
limited by the number of possible subsamplings. Figure 3 shows an example of the digital scale-space
images generated with the given configuration.

3 Keypoint Definition
Precisely detecting interesting image features is a challenging problem. The keypoint features are
defined in SIFT as the extrema of the normalized Laplacian scale-space σ 2 ∆v [14]. A Laplacian
extremum is unequivocally characterized by its scale-space coordinates (σ, x) where x refers to its

375
Ives Rey-Otero, Mauricio Delbracio

(a) Scale-space construction

0.5

0.5 0.8 1.01 1.27 1.6 2.02 2.54 3.2 4.03 5.08 6.4 8.06 10.16 12.8 16.13 20.32 25.6 32.25 40.64

(b) Scale-space default configuration

Figure 2: (a) The succession of subsamplings and Gaussian convolutions that results in the SIFT
scale-space. The first image at each octave v0o is obtained via subsampling, with the exception of
the first octave first image v00 that is generated by a bilinear interpolation followed by a Gaussian
convolution. (b) An illustration of the digital scale-space in its default configuration. The digital
scale-space v is composed of images vso for o = 1, . . . , noct and s = 0, . . . , nspo + 2. All images are
computed directly or indirectly from uin (in blue). Each image is characterized by its blur level and
its inter-pixel distance, respectively noted by σ and δ. The scale-space is split into octaves: sets of
images sharing a common sampling rate. Each octave is composed of nspo scales (in red) and other
three auxiliary scales (in gray). The depicted configuration features noct = 5 octaves and corresponds
to the following parameter settings: nspo = 3, σmin = 0.8. The assumed blur level of the input image
is σin = 0.5.

center spatial position and σ relates to its size (scale). As will be presented in Section 4, the covariance
of the extremum (σ, x) induces the invariance to translation and scale of its associated descriptor.

Instead of computing the Laplacian of the image scale space, SIFT uses a difference of Gaussians
operator (DoG), first introduced by Burt and Adelson [4] and Crowley and Stern [6]. Let v be
a Gaussian scale-space and κ > 1. The difference of Gaussians (DoG) of ratio κ is defined by
w : (σ, x) 7→ v(κσ, x) − v(σ, x). The DoG operator takes advantage of the link between the Gaussian
kernel and the heat equation to approximately compute the normalized Laplacian σ 2 ∆v. Indeed,
from a set of simulated blurs following a geometric progression of ratio κ, the heat equation is

376
Anatomy of the SIFT Method

v11 v22 v23 v24 v35 v55


δ1 = 0.5 δ2 = 1.0 δ3 = 2.0 δ4 = 4.0 δ5 = 8.0 δ5 = 8.0
σ11 = 1.0 σ22 = 2.5 σ23 = 5.1 σ24 = 10.2 σ35 = 25.6 σ55 = 40.6

Figure 3: Crops of a subset of images extracted from the Gaussian scale-space of an example image.
The scale-space parameters are set to nspo = 3, σmin = 0.8, and the assumed input image blur level
σin = 0.5. Image pixels are represented by a square of side δo for better visualization.

approximated by
∂v v(κσ, x) − v(σ, x) w(σ, x)
σ∆v = ≈ = . (11)
∂σ κσ − σ (κ − 1)σ
Thus, we have w(σ, x) ≈ (κ − 1)σ 2 ∆v(σ, x).
The SIFT keypoints of an image are defined as the 3d extrema of the difference of Gaussians
(DoG). Since we deal with digital images, the continuous 3d extrema of the DoG cannot be directly
computed. Thus, the discrete extrema of the digital DoG are first detected and then their positions
are refined. The detected points are finally validated to discard possible unstable and false detections
due to noise. Hence, the detection of SIFT keypoints involves the following steps:

1. compute the digital DoG;


2. scan the digital DoG for 3d discrete extrema;
3. refine position and scale of these candidates via a quadratic interpolation;
4. discard unstable candidates such as uncontrasted candidates or candidates laying on edges.

We detail each one of these steps in what follows.

3.1 Scale-Space Analysis: Difference of Gaussians


The digital DoG w is built from the digital scale-space v. For each octave o = 1, . . . , noct and for
each image wso with s = 0, . . . , nspo + 1

wso (m, n) = vs+1


o
(m, n) − vso (m, n)

377
Ives Rey-Otero, Mauricio Delbracio

Algorithm 1: Computation of the digital Gaussian scale-space


Input: uin , input digital image of M × N pixels.
Output: (vso ), digital scale-space, o = 1, . . . , noct and s = 0, . . . , nspo + 2.
vso is a digital image of size Mo × No , blur level σso (Equation (10)) and inter-pixel distance
δ o = δmin 2o−1 , with Mo = b δmin δmin o o
δo M c and No = b δo N c. The samples of vs are denoted by vs (m, n).
Parameters: - noct , number of octaves.
- nspo , number of scales per octave.
- σmin , blur level in the seed image.
- δmin , inter-sample distance in the seed image.
- σin , assumed blur level in the input image.
//Compute the first octave
//Compute the seed image v01
//1.Interpolate the original image (Bilinear interpolation, see Algorithm 2)
u0 ← bilinear interpolation(uin , δmin )
// 2. Blur the interpolated image (Gaussian blur, see Equation (4))
v01 = G 1 √σ2 −σ2 u0
δmin min in

// Compute the other images in the first octave


for s = 1, . . . , nspo + 2 do
vs1 = Gρ[(s−1)→s] vs−11

// Compute subsequent octaves


for o = 2, . . . , noct do
// Compute the first image in the octave by subsampling
for m = 0, . . . , Mo − 1 and n = 0, . . . , No − 1 do
v0o (m, n) ← vno−1 spo
(2m, 2n)
// Compute the other images in octave o
for s = 1, . . . , nspo + 2 do
vso = Gρ[(s−1)→s] vs−1o

Algorithm 2: Bilinear interpolation of an image


Input: u, digital image, M × N pixels. The samples are denoted by u(m, n).
Output: u0 , digital image, M 0 × N 0 pixels with M 0 = b M 0 N
δ 0 c and N = b δ 0 c.
0
Parameter: δ < 1, inter-pixel distance of the output image.
for m0 = 0, . . . , M 0 − 1 and n0 = 0, . . . , N 0 − 1 do
x ← δ 0 m 0 y ← δ 0 n0

u0 (m0 , n0 ) ← (x − bxc) (y − byc) ū(bxc + 1, byc + 1) + (1 + bxc − x) (y − byc) ū(bxc, byc + 1)


+ (x − bxc) (1 + byc − y) ū(bxc + 1, byc) + (1 + bxc − x) (1 + byc − y) ū(bxc, byc)

ū denotes the extension of u to Z2 via symmetrization with respect to −0.5: ū(k, l) = u(sM (k), sN (l)) with
sN (k) = min(k mod 2M, 2M − 1 − k mod 2M ).
note: b·c denotes the floor function.

with m = 0, . . . , Mo − 1, n = 0, . . . , No − 1. The image wso will be attributed the blur level σso .
This computation is illustrated in Figure 4 and summarized in Algorithm 3. See how, in the digital
scale-space, the computation of the auxiliary image vno spo +2 is required for computing the DoG ap-
proximation wno spo +1 . Figure 5 illustrates the DoG scale-space w relative to the previously introduced
Gaussian scale-space v. Figure 6 shows images of an example of DoG scale-space.

378
Anatomy of the SIFT Method

Figure 4: The difference of Gaussians operator is computed by subtracting pairs of contiguous scale-
space images. The procedure is not centered: the difference between the images at scales κσ and σ
is attributed a blur level σ.

Figure 5: The DoG scale-space. The difference of Gaussians acts as an approximation of the nor-
o
malized Laplacian σ 2 ∆. The difference wso = vs+1 − vso is relative to the blur level σso . Each octave
contains nspo images plus two auxiliary images (in black).

3.2 Extraction of Candidate Keypoints


Continuous 3d extrema of the digital DoG are calculated in two successive steps. The 3d discrete
extrema are first extracted from (wso ) with pixel precision, then their location are refined through
interpolation of the digital DoG by using a quadratic model. In what follows, samples vso (m, n) and
wso (m, n) are noted respectively vs,m,n
o o
and ws,m,n for better readability.

o
Detection of DoG 3D discrete extrema. Each sample ws,m,n of the DoG scale-space, with
s = 1, . . . , nspo , o = 1, . . . , noct , m = 1, . . . , Mo − 2, n = 1, . . . , No − 2 (which excludes the image
borders and the auxiliary images) is compared to its neighbors to detect the 3d discrete maxima and
minima (the number of neighbors is 26 = 3 × 3 × 3 − 1). Algorithm 4 summarizes the extraction
of 3d extrema from the digital DoG. These comparisons are possible thanks to the auxiliary images
w0o , wno spo +1 calculated for each octave o. This scanning process is nevertheless a rudimentary way to
detect candidate points. It is sensitive to noise, produces unstable detections, and the information it
provides regarding the location and scale may be flawed since it is constrained to the sampling grid.
To amend these shortcomings, this preliminary step is followed by an interpolation that refines the
localization of the extrema and by a cascade of filters that discard unreliable detections.

Keypoint position refinement. The location of the discrete extrema is constrained to the sam-
pling grid (defined by the octave o). This coarse localization hinders a rigorous covariance property

379
Ives Rey-Otero, Mauricio Delbracio

w11 w22 w23 w24 w15 w35


δ1 = 0.5 δ2 = 1.0 δ3 = 2.0 δ4 = 4.0 δ5 = 8.0 δ5 = 8.0
σ11 = 1.0 σ22 = 2.5 σ23 = 5.1 σ24 = 10.2 σ15 = 16.1 σ35 = 25.6

Figure 6: Crops of a subset of images extracted from the DoG scale-space of an example image. The
DoG operator is an approximation of the normalized Laplacian operator σ 2 ∆v. The DoG scale-space
parameters used in this example are the default: nspo = 3, σmin = 0.8, σin = 0.5. Image pixels are
represented by a square of side δo for better visualization.

of the set of keypoints and subsequently is an obstacle to the full scale and translation invariance of
the corresponding descriptor. SIFT refines the position and scale of each candidate keypoint using
a local interpolation model.
o
We denote by ωs,m,n (α) the quadratic function at sample point (s, m, n) in the octave o, given
by
o o 1
ωs,m,n (α) = ws,m,n + αT ḡs,m,n
o
+ αT H̄s,m,n
o
α, (12)
2
where α = (α1 , α2 , α3 )T ∈ [−1/2, 1/2]3 ; ḡs,m,n
o o
and H̄s,m,n denote respectively the 3d gradient and
Hessian at (s, m, n) in the octave o, computed with a finite difference scheme as follows
o o
   
(ws+1,m,n − ws−1,m,n )/2 h11 h12 h13
o o o o
ḡs,m,n = (ws,m+1,n − ws,m−1,n )/2 , H̄s,m,n = h12 h22 h23  (13)
o o
(ws,m,n+1 − ws,m,n−1 )/2 h13 h23 h33

with
o o o o o o o
h11 = ws+1,m,n + ws−1,m,n − 2ws,m,n , h12 = (ws+1,m+1,n − ws+1,m−1,n − ws−1,m+1,n + ws−1,m−1,n )/4,
o o o o o o o
h22 = ws,m+1,n + ws,m−1,n − 2ws,m,n , h13 = (ws+1,m,n+1 − ws+1,m,n−1 − ws−1,m,n+1 + ws−1,m,n−1 )/4,
o o o o o o o
h33 = ws,m,n+1 + ws,m,n−1 − 2ws,m,n , h23 = (ws,m+1,n+1 − ws,m+1,n−1 − ws,m−1,n+1 + ws,m−1,n−1 )/4.

380
Anatomy of the SIFT Method

This quadratic function is an approximation of the second order Taylor development of the underlying
continuous function (where its derivatives are approximated by finite difference schemes).

In order to refine the position of a discrete extremum (se , me , ne ) at octave oe SIFT proceeds as
follows.
1. Initialize (s, m, n) by the discrete coordinates of the extremum (se , me , ne ).
o o
2. Compute the continuous extrema of ωs,m,n by solving ∇ωs,m,n (α) = 0 (see Algorithm 7). This
yields −1 o
α∗ = − H̄s,m,n
o
ḡs,m,n . (14)

3. If max(|α1∗ |, |α2∗ |, |α3∗ |) ≤ 0.5 (i.e. the extremum of the quadratic function lies in its domain of
validity) the extremum is accepted. According to the scale-space architecture (see (10) and (9)),
the corresponding keypoint coordinates are
 
δoe (α∗1 +s)/nspo ∗ ∗
(σ, x, y) = σmin 2 , δoe (α2 +m) , δoe (α3 +n) . (15)
δmin

4. If α∗ falls outside the domain of validity, the interpolation is rejected and another one is carried
out. Update (s, m, n) to the closest discrete value to (s, m, n) + α∗ and repeat from (2).
This process is repeated up to five times or until the interpolation is validated. If after five iterations
the result is still not validated, the candidate keypoint is discarded. In practice, the validity domain
is defined by max(|α1∗ |, |α2∗ |, |α3∗ |) < 0.6 to avoid possible numerical instabilities due to the fact that
the piecewise interpolation model is not continuous. See Algorithm 6 for details.
According to the local interpolation model (12), the value of the DoG interpolated extremum is
1
o
ω := ωs,m,n (α∗ ) = ws,m,n
o
+ (α∗ )T ḡs,m,n
o
+ (α∗ )T H̄s,m,n
o
α∗
2
1
o
= ws,m,n + (α∗ )T ḡs,m,n
o
. (16)
2
This value will be used to discard uncontrasted keypoints.

3.3 Filtering Unstable Keypoints


Discarding Low Contrasted Extrema
Image noise will typically produce several spurious DoG extrema. Such extrema are unstable and are
not linked to any particular structure in the image. SIFT attempts to eliminate these false detections
by discarding candidate keypoints with a DoG value ω below a threshold CDoG (see Algorithm 8),

if |ω| < CDoG then discard the candidate keypoint.

Since the DoG function approximates (κ − 1)σ 2 ∆v, where κ is a function of the number of scales
per octave nspo , the value of threshold CDoG should depend on nspo (default value CDoG = 0.015
1
for nspo = 3). The threshold applied in the provided source-code is C eDoG = 2 /n1/3spo −1 CDoG , with
2 −1
CDoG relative to nspo = 3. This guarantees that the applied threshold is independent of the sampling
configuration. Before the refinement of the extrema, and to avoid unnecessary computations, a less
conservative threshold at 80% of CDoG is applied to the discrete 3d extrema (see Algorithm 5),
o
if |ws,m,n | < 0.8 × CDoG then discard the discrete 3d extremum.

381
Ives Rey-Otero, Mauricio Delbracio

Discarding candidate keypoints on edges


Candidate keypoints lying on edges are difficult to precisely locate. This is a direct consequence of
the fact that an edge is invariant to translations along its principal axis. Such detections do not
help define covariant keypoints and should be discarded. The 2d Hessian of the DoG provides a
characterization of those undesirable keypoint candidates. Edges present a large principal curvature
orthogonal to the edge and a small one along the edge. In terms of the eigenvalues of the Hessian
matrix, the presence of an edge amounts to a big ratio between the largest eigenvalue λmax and the
smallest one λmin .
The Hessian matrix of the DoG is computed at the nearest grid sample using a finite difference
scheme:  
o h11 h12
Hs,m,n = , (17)
h12 h22
where
o o o o o o
h11 = ws,m+1,n + ws,m−1,n − 2ws,m,n , h22 = ws,m,n+1 + ws,m,n−1 − 2ws,m,n ,
o o o o
h12 = h21 = (ws,m+1,n+1 − ws,m+1,n−1 − ws,m−1,n+1 + ws,m−1,n−1 )/4.

The SIFT algorithm discards keypoint candidates whose eigenvalue ratio r := λmax /λmin is more
than a certain threshold Cedge (the default value is Cedge = 10). Since only this ratio is relevant, the
eigenvalues computation can be avoided. The ratio of the Hessian matrix determinant and its trace
are related to r by
o
o
tr(Hs,m,n )2 (λmax + λmin )2 (r + 1)2
edgeness(Hs,m,n ) = o
= = . (18)
det(Hs,m,n ) λmax λmin r

Thus, the filtering of keypoint candidates on edges consists in the following test:

o (Cedge + 1)2
if edgeness(Hs,m,n ) > then discard candidate keypoint.
Cedge
o o
Note that Hs,m,n is the bottom-right 2 × 2 sub-matrix of H̄s,m,n (13) previously computed for the
keypoint interpolation. Algorithm 9 summarizes how keypoints on edges are discarded.

3.4 Pseudocodes

Algorithm 3: Computation of the difference of Gaussians scale-space (DoG)


Input: (vso ), digital Gaussian scale-space, o = 1, . . . , noct and s = 0, . . . , nspo + 2.
Output: (wso ), digital DoG, o = 1, . . . , noct and s = 0, . . . , nspo + 1.
for o = 1, . . . , noct and s = 0, . . . , nspo + 1 do
for m = 0, . . . , Mo − 1 and n = 0, . . . , No − 1 do
wso (m, n) = vs+1
o
(m, n) − vso (m, n)

382
Anatomy of the SIFT Method

Algorithm 4: Scanning for 3d discrete extrema of the DoG scale-space


Input: (wso ), digital DoG, o = 1, . . . , noct and s = 0, . . . , nspo + 1.
The samples of digital image wso are denoted by ws,m,n o
.
Output: LA = {(o, s, m, n)}, list of the DoG 3d discrete extrema.
for o = 1, . . . , noct do
for s = 1, . . . , nspo , m = 1, . . . , Mo − 2 and n = 1, . . . , No − 2 do
o
if sample ws,m,n is larger or smaller than all of its 33 − 1 = 26 neighbors then
Add discrete extremum (o, s, m, n) to LA

Algorithm 5: Discarding low contrasted candidate keypoints (conservative test)


Inputs: - (wso ), digital DoG, o = 1, . . . , noct and s = 0, . . . , nspo + 1.
- LA = {(o, s, m, n)}, list of DoG 3d discrete extrema.
Output: LA’ = {(o, s, m, n)}, filtered list of DoG 3d discrete extrema.
Parameter: CDoG threshold.

for each DoG 3d discrete extremum (o, s, m, n) in LA do


o
if |ws,m,n | ≥ 0.8 × CDoG then
Add discrete extremum (o, s, m, n) to LA’

Algorithm 6: Keypoints interpolation


Inputs: - LA’ = {(o, s, m, n)}, list of DoG 3d discrete extrema.
- (wso ), digital DoG scale-space, o = 1, . . . , noct and s = 0, . . . , nspo + 1.
Output: LB = {(o, s, m, n, x, y, σ, ω)}, list of candidate keypoints.
for each DoG 3d discrete extremum (oe , se , me , ne ) in LA do
(s, m, n) ← (se , me , ne ) // initialize interpolation location
repeat
// Compute the extrema location and value of the local quadratic function (see
Algorithm 7)
(α∗ , ω) ← quadratic interpolation(oe , s, m, n)

// Compute  the corresponding absolute coordinates 


δoe (α∗
(σ, x, y) = δmin σmin 2 1 +s)/nspo
, δoe (α2∗ +m) , δoe (α3∗ +n) .
// Update the interpolating position
(s, m, n) ← ([s + α1∗ ], [m + α2∗ ], [n + α3∗ ])
until max(|α1∗ |, |α2∗ |, |α3∗ |) < 0.6 or after 5 unsuccessful tries.
if max(|α1∗ |, |α2∗ |, |α3∗ |) < 0.6 then
Add candidate keypoint (oe , s, m, n, σ, x, y, ω) to LB
note: [·] denotes the round function.

383
Ives Rey-Otero, Mauricio Delbracio

Algorithm 7: Quadratic interpolation on a discrete DoG sample


Inputs: - (wso ), digital DoG scale-space, o = 1, . . . , noct and s = 0, . . . , nspo + 1.
- (o, s, m, n), coordinates of the DoG 3d discrete extremum.
Outputs: - α∗ , offset from the center of the interpolated 3d extremum.
- ω, value of the interpolated 3d extremum.
o o
Compute ḡs,m,n and H̄s,m,n //DoG 3d gradient and Hessian by Equation (13)
−1
Compute α∗ = − H̄s,m,n
o
 o
ḡs,m,n
−1 o
o
Compute ω = ws,m,n − 12 (ḡs,m,n
o
)T H̄s,m,n
o
ḡs,m,n

Algorithm 8: Discarding low contrasted candidate keypoints


Input: LB = {(o, s, m, n, σ, x, y, ω)}, list of candidate keypoints.
Output: LB’ = {(o, s, m, n, σ, x, y, ω)}, reduced list of candidate keypoints.
Parameter: CDoG threshold.
for each candidate keypoint (σ, x, y, ω) in LB do
if |ω| ≥ CDoG then
Add candidate keypoint (σ, x, y, ω) to LB’ .

Algorithm 9: Discarding candidate keypoints on edges


Inputs: - (wso ), DoG scale-space.
- LB’ = {(o, s, m, n, σ, x, y, ω)}, list of candidate keypoints.
Output: LC = {(o, s, m, n, σ, x, y, ω)}, list of the SIFT keypoints.
Parameter: Cedge , threshold over the ratio between first and second Hessian eigenvalues.
for each candidate keypoint (o, s, m, n, σ, x, y, ω) in LB’ do
o
Compute Hs,m,n by (17) // 2d Hessian
o
tr(Hs,m,n )2
Compute o
det(Hs,m,n ) // edgeness
o
tr(Hs,m,n )2 (Cedge +1)2
if o
det(Hs,m,n ) < Cedge then
Add candidate keypoint (o, s, m, n, σ, x, y, ω) to LC .

4 Keypoint Description
In the literature, rotation invariant descriptors fall into two categories. On the one side, those based
on properties of the image that are already rotation-invariant and on the other side, descriptors
based on a normalization with respect to a reference orientation. The SIFT descriptor is based on
a normalization. The local dominant gradient angle is computed and used as a reference orienta-
tion. Then, the local gradient distribution is normalized with respect to this reference direction (see
Figure 7).
The SIFT descriptor is built from the normalized image gradient orientation in the form of
quantized histograms. In what follows, we describe how the reference orientation specific to each
keypoint is defined and computed.

384
Anatomy of the SIFT Method

Figure 7: The description of a keypoint detected at scale σ (the radius of the blue circle) involves the
analysis of the image gradient distribution around the keypoint in two radial Gaussian neighborhoods
with different sizes. The first local analysis aims at attributing a reference orientation to the
keypoint (the blue arrow). It is performed over a Gaussian window of standard deviation λori σ (the
radius of the green circle). The width of the contributing samples patch P ori (green square) is 6λori σ.
The figure shows the default case λori = 1.5. The second analysis aims at building the descriptor.
It is performed over a Gaussian window of standard deviation λdescr σ (the radius of the red circle)
within a square patch P descr (the red square) of approximate width 2λdescr σ. The figure features the
default settings: λdescr = 6, with a Gaussian window of standard deviation 6σ and a patch P descr of
width 12σ.

4.1 Keypoint Reference Orientation


A dominant gradient orientation over a keypoint neighborhood is used as its reference orientation.
This allows for orientation normalization and hence rotation-invariance of the resulting descriptor
(see Figure 7). Computing this reference orientation involves three steps:
A. accumulation of the local distribution of the gradient angle within a normalized patch in an
orientation histogram;
B. smoothing of the orientation histogram;
C. extraction of one or more reference orientations from the smoothed histogram.

A. Orientation histogram accumulation. Given a keypoint (x, y, σ), the patch to be analyzed
is extracted from the image of the scale-space vso , whose scale σso is nearest to σ. This normalized
patch, denoted by P ori , is the set of pixels (m, n) of vso satisfying
max(|δo m − x|, |δo n − y|) ≤ 3λori σ. (19)
Keypoints whose distance to the image borders is less than 3λori σ are discarded since the patch P ori is
not totally included in the image. The orientation histogram h from which the dominant orientation
is found covers the range [0, 2π]. It is composed of nbins bins with centers θk = 2πk/nbins . Each pixel
(m, n) in P ori will contribute to the histogram with a total weight of cori
m,n , which is the product of the
gradient norm and a Gaussian weight of standard deviation λori σ (default value λori = 1.5) reducing
the contribution of distant pixels.
k(mδo ,nδo )−(x,y)k2

cori o o

2(λori σ)2
m,n =e ∂m vs,m,n , ∂n vs,m,n . (20)
This contribution is assigned to the nearest bin, namely the bin of index
hn i
bins o o
bori
m,n = arctan2 ∂ v , ∂ v
m s,m,n n s,m,n . (21)

385
Ives Rey-Otero, Mauricio Delbracio

where [·] denotes the round function and arctan2 is the two-argument inverse tangent function2 with
range in [0, 2π]. The gradient components of the scale-space image vos are computed through a finite
difference scheme

o 1 o o
 o 1 o o

∂m vs,m,n = vs,m+1,n − vs,m−1,n , ∂n vs,m,n = vs,m,n+1 − vs,m,n−1 , (22)
2 2

for m = 1, . . . , Mo − 2 and n = 1, . . . , No − 2.

B. Smoothing the histogram. After being accumulated, the orientation histogram is smoothed
by applying six times a circular convolution with the three-tap box filter [1, 1, 1]/3.

C. Extraction of reference orientation(s). The keypoint reference orientations are taken among
the local maxima positions of the smoothed histogram. More precisely, the reference orientations are
the positions of local maxima larger than t times the global maximum (default value t = 0.8). Let
k ∈ {1, . . . , nbins } be the index of a bin such that hk > hk− , hk > hk+ , where k − = (k − 1) mod nbins
and k + = (k + 1) mod nbins and such that hk ≥ t max(h). This bin is centered at orientation θk =
2π(k−1)/n
bins . The corresponding keypoint reference orientation θref is computed from the maximum

position of the quadratic function that interpolates the values hk− , hk , hk+ ,
 
π hk− − hk+
θref = θk + . (23)
nbins hk− − 2hk + hk+

Each one of the extracted reference orientations leads to the computation of one local descriptor
of a keypoint neighborhood. The number of descriptors may consequently exceed the number of
keypoints. Figure 8 illustrates how a reference orientation is attibuted to a keypoint.

Figure 8: Reference orientation attribution. The normalized patch P ori (normalized to scale and
translation) width is 6λori σkey . The gradient magnitude is weighted by a Gaussian window of standard
deviation λori σkey . The gradient orientations are accumulated into an orientation histogram h which
is subsequently smoothed.

2
The two-argument inverse tangent, unlike the single argument one, determines the appropriate quadrant of
the computed angle thanks to the extra information about the signs of the inputs: arctan2(x, y) = arctan(x/y) +
π
2 sign(y)(1 − sign(x)).

386
Anatomy of the SIFT Method

4.2 Keypoint Normalized Descriptor


The SIFT descriptor encodes the local spatial distribution of the gradient orientation on a particular
neighborhood. SIFT descriptors can be computed anywhere, even densely over the entire image or its
scale-space [15, 11]. In the original SIFT method, however, descriptors are computed for all detected
keypoints over its normalized neighborhood, making them invariant to translations, rotations and
zoom-outs. Given a detected keypoint, the normalized neighborhood consists of a square patch
centered at the keypoint and aligned with the reference orientation.
The descriptor consists of a set of weighted histograms of the gradient orientation computed on
different regions of the normalized square patch.

The normalized patch. For each keypoint (xkey , ykey , σkey , θkey ), a normalized patch is isolated
from the Gaussian scale-space image relative to the nearest discrete scale (o, s) to scale σkey , namely
vso . A sample (m, n) in vso , of coordinates (xm,n , ym,n ) = (mδ o , nδ o ) with respect to the sam-
pling grid of the input image, has normalized coordinates (x̂m,n , ŷm,n ) with respect to the keypoint
(xkey , ykey , σkey , θkey ),

x̂m,n = ((mδo − xkey ) cos θkey + (nδo − ykey ) sin θkey ) /σkey ,
ŷm,n = (−(mδo − xkey ) sin θkey + (nδo − ykey ) cos θkey ) /σkey . (24)

The normalized patch denoted by P descr is the set of samples (m, n) of vso with normalized coordinates
(x̂m,n , ŷm,n ) satisfying
max(|x̂m,n |, |ŷm,n |) ≤ λdescr . (25)

Keypoints whose distance to the image borders is less than 2λdescr σ are discarded to guaranty
that the patch P descr is included in the image. Note that no image re-sampling is performed. Each
sample (m, n) is characterized by the gradient orientation normalized with respect to the keypoint
orientation θkey ,
o o

θ̂m,n = arctan2 ∂m vs,m,n , ∂n vs,m,n − θkey mod 2π, (26)
and its total contribution cdescr
m,n . The total contribution is the product of the gradient norm and a
Gaussian weight (with standard deviation λdescr σkey ) reducing the contribution of distant pixels,

k(mδ o ,nδ o )−(x,y)k2


− o o

descr 2(λdescr σ)2
cm,n = e ∂m vs,m,n , ∂n vs,m,n . (27)

The array of orientation histograms. The gradient orientation of each pixel in the normalized
patch P descr is accumulated into an array of nhist × nhist orientation histograms (default value nhist =
4). Each of these histograms, denoted by hi,j for (i, j) ∈ {1, . . . , nhist }2 , has an associated position
with respect to the keypoint (xkey , ykey , σkey , θkey ), given by
   
i 1 + nhist 2λdescr j 1 + nhist 2λdescr
x̂ = i− , ŷ = j− .
2 nhist 2 nhist

Each histogram hi,j consists of nori bins hi,j k


k with k ∈ {1, . . . , nori }, centered at θ̂ = 2π(k − 1)/nori
descr
(default value nori = 8). Each sample (m, n) in the normalized patch P contributes to the
descr
nearest histograms (up to four histograms). Its total contribution cm,n is split bilinearly over the
nearest histograms depending on the distances to each of them (see Figure 10). In the same way,
the contribution within each histogram is subsequently split linearly between the two nearest bins.
This results, for the sample (m, n), in the following updates.

387
Ives Rey-Otero, Mauricio Delbracio

For every (i, j, k) ∈ {1, . . . , nhist }2 × {1, . . . , nori } such that

2λdescr 2λdescr 2π
|x̂i − x̂m,n | ≤ , |ŷ j − ŷm,n | ≤ and |θ̂k − θ̂m,n mod 2π| ≤ ,
nhist nhist nori

the histogram hi,j


k is updated by
  
nhist i nhist j  n 
hi,j hi,j 1 − ori θk − θ̂m,n mod 2π cdescr

← + 1− x̂ − x̂m,n 1− ŷ − ŷm,n m,n . (28)

k k
2λdescr 2λdescr 2π

Figure 9: SIFT descriptor construction. No explicit re-sampling of the described normalized patch
is performed. The normalized patch P descr is partitioned into a set of nhist × nhist subpatches (default
value nhist = 4). Each sample (m, n) inside P descr , located at (mδ o , nδ o ), contributes by an amount
descr
that is a function of their normalized coordinates (x̂m,n , ŷm,n ) (see (24)). Each sub-patch P(i,j) is
centered at (x̂i , ŷj ).

The SIFT feature vector. The accumulated array of histograms is encoded into a vector feature f
of length nhist × nhist × nori , as follows

f(i−1)nhist nori +(j−1)nori +k = hi,j


k ,

where i = 1, . . . , nhist , j = 1, . . . , nhist and k = 1, . . . , nori . The components of the feature vector f
are saturated to a maximum value of 20% of its Euclidean norm, i.e. fk ← min (fk , 0.2kf k). The
saturation of the feature vector components seeks to reduce the impact of non-linear illumination
changes, such as saturated regions.
The vector is finally renormalized so as to have kf k2 = 512 and quantized to 8 bit integers as
follows: fk ← min (bfk c, 255). The quantization aims at accelerating the computation of distances
between different feature vectors3 . Figure 9 and Figure 11 illustrate how a SIFT feature vector is
attibuted to an oriented keypoint.

3
The executable provided by D.Lowe http://www.cs.ubc.ca/~lowe/keypoints/, retrieved on September 11th,
2014) uses a different coordinate system (see source code’s README.txt for details).

388
Anatomy of the SIFT Method

Figure 10: Illustration of the spatial contribution of a sample inside the patch P descr . The sample
(m, n) contributes to the weighted histograms (2, 2) (green), (2, 3) (orange), (3, 2) (blue) and (3, 3)
(pink). The contribution cdescr
m,n is split over four pairs of bins according to (28).

Figure 11: The image on top shows the nhist × nhist array sub-patches relative to a keypoint;
the corresponding nori bins histograms are rearranged into a 1D-vector ~v (bottom). This vector is
subsequently thresholded and normalized so that the Euclidean norm equals 512 for each descriptor.
The dimension of the feature vector in this example is 128, relative to parameter nhist = 4, nori = 8
(default values).

4.3 Pseudocodes

Algorithm 10: Computation of the 2d gradient at each image of the scale-space


Input: (vso ), digital Gaussian scale-space, o = 1, . . . , noct and s = 0, . . . , nspo + 2.
o
Outputs: - (∂m vs,m,n ), scale-space gradient along x, o = 1, . . . , noct and s = 1, . . . , nspo .
o
- (∂n vs,m,n ), scale-space gradient along y, o = 1, . . . , noct and s = 1, . . . , nspo .
for o = 1, . . . , noct and s = 1, . . . , nspo do
for m = 1, . . . , Mo − 2 and n = 1, . . . , No − 2 do
o o o
∂m vs,m,n = (vs,m+1,n − vs,m−1,n )/2
o o o
∂n vs,m,n = (vs,m,n+1 − vs,m,n−1 )/2

389
Ives Rey-Otero, Mauricio Delbracio

Algorithm 11: Computing the keypoint reference orientation


Inputs: - LC = {(okey , skey , xkey , ykey , σkey , ω)}, list of keypoints.
o
- (∂m vs,m,n ), scale-space gradient along x, o = 1, . . . , noct and s = 1, . . . , nspo .
o
- (∂n vs,m,n ), scale-space gradient along y, o = 1, . . . , noct and s = 1, . . . , nspo .
Parameters: - λori . The patch P ori is 6λori σkey wide for a keypoint of scale σkey .
The Gaussian window has a standard deviation of λori σkey .
- nbins , number of bins in the orientation histogram h.
- t, threshold for secondary reference orientations.
Output: LD = {(o, s0 , m0 , n0 , x, y, σ, ω, θ)} list of oriented keypoints.
Temporary: hk , orientation histogram, k = 1, . . . , nbins and with hk covering [ 2π(k−3/2)
nbins ; 2π(k−1/2)
nbins ].

for each keypoint (okey , skey , xkey , ykey , σkey , ω) in LC do


// Check if the keypoint is distant enough from the image borders
if 3λori σkey ≤ xkey ≤ h − 3λori σkey and 3λori σkey ≤ ykey ≤ w − 3λori σkey then

// Initialize the orientation histogram h


for 1 ≤ k ≤ nbins do hk ← 0
ori
// Accumulate samples from   normalized patch P
the  (Equation
 (19).
for m = [( xkey − 3λori σkey /δokey , . . . , [( xkey + 3λori σkey /δokey do

for n = [( ykey − 3λori σkey /δokey , . . . , [( ykey + 3λori σkey /δokey do
// Compute the sample contribution
k(mδo ,nδo )−(xkey ,ykey )k2
key key

∂m vsokey okey
ori 2(λori σkey )2
key

cm,n = e ,m,n , ∂n vskey ,m,n

// Compute
 nbins the corresponding bin index 
okey okey 
bori
m,n = 2π arctan2 ∂m vs key ,m,n , ∂n vskey ,m,n mod 2π

// Update the histogram


hbori
m,n
← hbori
m,n
+ cori
m,n

// Smooth h
Apply six times a circular convolution with filter [1, 1, 1]/3 to h.
// Extract the reference orientations
for 1 ≤ k ≤ nbins do
if hk > hk− , hk > hk+ and hk ≥ t max(h) then
// Compute the reference orientation
 θkey
π hk− −hk+
θkey = θk + nbins h − −2hk +h +
k k

note: [·] denotes the round function and arctan2 denotes the two-argument inverse tangent.

390
Anatomy of the SIFT Method

Algorithm 12: Construction of the keypoint descriptor


Inputs: - LD = {(okey , skey , xkey , ykey , σkey , θkey )} list of keypoints.
o
- (∂m vs,m,n ), scale-space gradient along x.
o
- (∂n vs,m,n ), scale-space gradient along y (see Algorithm 10).
Output: LE = {(okey , skey , xkey , ykey , σkey , θkey , f )} list of keypoints with feature vector f .
Parameters: - nhist . The descriptor is an array of nhist × nhist orientation histograms.
- nori , number of bins in the orientation histograms.
Feature vectors f have a length of nhist × nhist × nori
- λdescr .
The Gaussian window has a standard deviation of λdescr σkey .
The patch P descr is 2λdescr nhist +1
nhist σkey wide.

Temporary: hi,j
k , array of orientation weighted histograms, (i, j) ∈ {1, . . . , nhist } and k ∈ {1, . . . , nori }

for each keypoint (okey , skey , xkey , ykey , σkey , θkey ) in LD do

// √
Check if the keypoint is √ distant enough √ from the image borders √
if 2λdescr σkey ≤ xkey ≤ h − 2λdescr σkey and 2λdescr σkey ≤ ykey ≤ w − 2λdescr σkey then

// Initialize the array of weighted histograms


for 1 ≤ i ≤ nhist , 1 ≤ j ≤ nhist and 1 ≤ k ≤ nori do hi,j
k ←0

// Accumulate samples of normalized patch P descr in the array histograms


(Equationh (25)) √  i h √  i
for m = xkey − 2λdescr σkey nhist
nhist
+1
/δ o , . . . , x key + 2λ σ nhist +1
descr key nhist /δ o do
h √  i h √  i
for n = ykey − 2λdescr σkey nhist +1
nhist /δ o , . . . , y key + 2λ descr σ key
nhist +1
nhist /δ o do

// Compute normalized coordinates (Equation (24)). 


x̂m,n = (mδokey − xkey ) cos θkey + (nδokey − ykey ) sin θkey /σ
 key
ŷm,n = −(mδokey − xkey ) sin θkey + (nδokey − ykey ) cos θkey /σkey
// Verify if the sample (m, n) is inside the normalized patch P descr .
if max(|x̂m,n |, |ŷm,n |) < λdescr nhist +1
nhist then

// Compute normalized gradient orientation.


okey okey 
θ̂m,n = arctan2 ∂m vskey ,m,n , ∂n vskey ,m,n − θkey mod 2π

// Compute the total contribution of the sample (m, n)


o o
k(mδ key ,nδ key )−(xkey ,ykey )k2

∂m vsokey okey
2(λdescr σkey )2
key

cdescr
m,n = e ,m,n , ∂n vskey ,m,n

// Update the nearest histograms and the nearest bins j (Equation (28)).
for (i, j) ∈ {1, . . . , nhist }2 such that x̂i − x̂m,n ≤ 2λndescr ŷ − ŷm,n ≤ 2λdescr do

hist
and nhist
for k ∈ {1, . . . , nori } such that θ̂k − θ̂m,n mod 2π < n2π do

ori
i,j i,j
k ← hk +
h   
1− 2λndescr
hist
|x̂m,n − x̂i | 1− 2λndescr
hist
|ŷm,n − ŷ j | 1− n2π
ori
|θ̂m,n − θ̂k mod 2π| cdescr
m,n

// Build the feature vector f from the array of weighted histograms.


for 1 ≤ i ≤ nhist , 1 ≤ j ≤ nhist and 1 ≤ k ≤ nori do
f(i−1)nhist nori +(j−1)nori +k = hi,j
k
for 1 ≤ l ≤ nhist × nhist × nori do
fl ← min (fl , 0.2kf k) /*normalize and threshold f */
compute the l2 norm fl ← min (b512fl /kf kc, 255) /*quantize to 8 bit integers*/

Add (x, y, σ, θ, f ) to LE

391
Ives Rey-Otero, Mauricio Delbracio

5 Matching
The classical purpose of detecting and describing keypoints is to find matches (pairs of keypoints)
between images. In the absence of extra knowledge on the problem, for instance in the form of
geometric constraints, a matching procedure generally consists of two steps: the pairing of similar
keypoints from respective images and the selection of those that are reliable. In what follows, we
present the matching method described in the original article by D. Lowe [17]. Let LA and LB be
the set of descriptors associated to the keypoints detected in images uA and uB . The matching is
done by considering every descriptor associated to the list LA and finding one possible match in list
LB . The first descriptor f a ∈ LA is paired to the descriptor f b ∈ LB that minimizes the Euclidean
distance between descriptors,
f b = arg min kf − f a k2 .
f ∈LB

Pairing a keypoint with descriptor f a requires then to compute distances to all descriptors in LB . A
match
pair is considered reliable only if its absolute distance is below a certain threshold Cabsolute . Otherwise
it is discarded.
To avoid dependence to an absolute distance, the SIFT method uses the second nearest neighbor
0
to define what constitutes a reliable match. SIFT applies an adaptive thresholding kf a − f b kCrelative
match
,
where fb0 is the second nearest neighbor,
0
f b = arg min kf − f a k2 .
f ∈LB \{f b }

This is detailed in Algorithm 13. The major drawback of using a relative threshold is that it omits
detections for keypoints associated to a repeated structure in the image. Indeed, in that case, the
distance to the nearest and the second nearest descriptor would be comparable. More sophisticated
techniques have been developed to allow robust matching of images with repeated structures [20].
This matching algorithm runs in time c·NA ·NB , where NA and NB are the number of keypoints in
images uA and uB respectively, and c is a constant proportional to the time that takes to compare two
SIFT features. This is prohibitively slow for images of moderate size, although keypoint matching is
highly parallelizable. A better solution is to use more compact descriptors [26] that reduce the cost
of distance computation (and thus reduce the value of c). Among the proposed solutions we can find
more compact SIFT-like descriptors [2, 12] or binary descriptors [5, 24, 21] which take advantage of
the fast computation of the Hamming distance between two binary vectors.

Algorithm 13: Matching keypoints


Inputs: - LA = {(xa , y a , σ a , θa , f a)} keypoints and descriptors relative to image uA .
- LB ={ xb , y b , σ b , θb , f b } keypoints and descriptors
 relative to image uB .
a a a a a b b b b b
Output: M = (x , y , σ , θ , f ) , x , y , σ , θ , f list of matches with positions.
match
Parameter: Crelative relative threshold
for each descriptor f a in LA do
0
Find f b and f b , nearest and second nearest neighbors of f a :
for each descriptor f in LB do
Compute distance d(f a , f )
Select pairs satisfying a relative threshold.
0
if d(f a , f b ) < Crelative
match
d(f a , f b ) then
Add pair (f a , f b ) to M

392
Anatomy of the SIFT Method

6 Summary of Parameters
The online demo provided with this publication examines in detail the behavior of each stage of the
SIFT algorithm. In what follows, we summarize all the parameters that can be adjusted in the demo.

Digital scale-space configuration and keypoints detection

Parameter Default value Description


σmin 0.8 blur level of v01 (seed image)
δmin 0.5 the sampling distance in image v01 (corresponds to a 2× interpolation)
σin 0.5 assumed blur level in uin (input image)
noct 8 number of octaves (limited by the image size )) blog2 (min(w, h)/δmin /12) + 1c
nspo 3 number of scales per octave
CDoG 0.015 threshold over the DoG response (value relative to nspo = 3)
Cedge 10 threshold over the ratio of principal curvatures (edgeness).

Table 3: Parameters of the scale-space discretization and detection of SIFT keypoints.

The structure of the digital scale-space is unequivocally characterized by four structural param-
eters: noct , nspo , σmin , δmin and by the blur level in the input image σin . The associated online demo
allows the user to change these values. They can be tuned to satisfy specific requirements4 . For
example, increasing the number of scales per octave nspo and the initial interpolation factor δmin
increases the number of detections. On the other hand, reducing them results in a faster algorithm.
The image structures that are potentially detected by SIFT have a scale ranging from σmin to
σmin 2noct . Therefore, it may seem natural to choose the lowest possible value of σmin (i.e. σmin = σin ).
However, depending on the input image sharpness, low scale detections may be the result of aliasing
artifacts and should be avoided. Thus, a sound setting of parameter σmin should take into account
the image blur level σin and the possible presence of image aliasing.
The DoG thresholding, controlled by CDoG , was conceived to filter detections due to noise. With
that aim in view, CDoG should depend on the input image signal to noise ratio. It is however beyond
the scope of this publication to analyze the soundness of such an approach. We will only point
out that the reduction of CDoG increases the number of detected keypoints. Recall that the DoG
approximates (21/nspo − 1)σ 2 ∆v, its values depend on the number of scales per octave nspo . The
1
threshold applied in the provided source-code is C eDoG = 2 /n1/3spo −1 CDoG , with CDoG relative to nspo = 3.
2 −1
The threshold Cedge , applied to discard keypoints laying on edges, has in practice a negligible
impact on the algorithm performance. Indeed, many candidate keypoints laying on edges were
previously discarded during the extrema refinement.

Computation of the SIFT descriptor


The provided demo shows the computation of the keypoint reference orientation, and also the con-
struction of the feature vector for any detected keypoint.
The parameter λori controls how local the computation of the reference orientation is. Localizing
the gradient analysis generally results in an increased number of orientation references. Indeed,
the orientation histogram generated from an isotropic structure is almost flat and therefore has
4
The number of computed octaves is upper limited to ensure that images in the last octave are at least 12 × 12
pixels.

393
Ives Rey-Otero, Mauricio Delbracio

Parameter Default value Description


nbins 36 number of bins in the gradient orientation histogram
λori 1.5 sets how local the analysis of the gradient distribution is:
- Gaussian window of standard deviation λori σ
- patch width 6λori σ
t 0.80 threshold for considering local maxima in the gradient orientation histogram
nhist 4 number of histograms in the normalized patch is (nhist × nhist )
nori 8 number of bins in the descriptor histograms
the feature vectors dimension is nhist × nhist × nori
λdescr 6 sets how local the descriptor is:
- Gaussian window of standard deviation λdescr σ
- descriptor patch width 2λdescr σ

Table 4: Parameters of the computation of the SIFT feature vectors.

many local maxima. Another algorithm design parameter, not included in Table 4 because of its
insignificant impact, is the level of smoothing applied to the histogram (Nconv = 6).
The size of the normalized patch used for computing the SIFT descriptor is governed by λdescr . A
larger patch will produce a more discriminative descriptor but will be less robust to scene deformation.
The number of histograms nhist × nhist and the number of bins nori can be set to make the feature
vector more compact. These architectural parameters govern the trade off between robustness and
discrimination.

Matching of SIFT feature vectors


The SIFT algorithm consists of the detection of image keypoints and their description. The demo
provides additionally two naive algorithms to match SIFT features. The first one applies an absolute
threshold on the distance to the nearest keypoint feature to define if a match is reliable. The second
one applies a relative threshold that depends on the distance to the second nearest keypoint feature.
match
Increasing the absolute threshold Cabsolute evidently reduces the number of matches. In a relative
match
threshold scenario, increasing the threshold Crelative results in an increased number of matches. In
particular, pairs corresponding to repeated structures in the image will be less likely to be omitted.
However, this may lead to an increased number of false matches.

Parameter Default value Description


match
Cabsolute 250 to 300 threshold on the distance to the nearest neighbor
match
Crelative 0.6 relative threshold between nearest and second nearest neighbors

Table 5: Parameters of the SIFT matching algorithm.

Acknowledgements
This work was partially supported by the Centre National d’Etudes Spatiales (CNES, MISS Project),
the European Research Council (Advanced Grant Twelve Labours), the Office of Naval Research
(Grant N00014-97-1-0839), Direction Générale de l’Armement (DGA), Fondation Mathématique
Jacques Hadamard and Agence Nationale de la Recherche (Stereo project).

394
Anatomy of the SIFT Method

Image Credits
Crops of stereoscopic cards by T. Enami (1859-1929) were used in figures 3, 6, 8, 9 and 11.

References
[1] L. Alvarez and F. Morales, Affine morphological multiscale analysis of corners and multiple
junctions, International Journal of Computer Vision, 25 (1997), pp. 95–107. http://dx.doi.
org/10.1023/A:1007959616598.
[2] H. Bay, T. Tuytelaars, and L. van Gool, SURF: Speeded Up Robust Features, in Euro-
pean Conference on Computer Vision, 2006.
[3] M. Brown, R. Szeliski, and S. Winder, Multi-image matching using multi-scale oriented
patches, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005.
http://dx.doi.org/10.1109/CVPR.2005.235.
[4] P J Burt and E H Adelson, The Laplacian pyramid as a compact image code, IEEE Transac-
tions on Communications, 31 (1983), pp. 532–540. http://dx.doi.org/10.1109/TCOM.1983.
1095851.
[5] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, BRIEF: Binary Robust Independent
Elementary Features, in European Conference on Computer Vision, 2010, pp. 778–792. http:
//dx.doi.org/10.1007/978-3-642-15561-1_56.
[6] J L Crowley and R M Stern, Fast computation of the difference of low-pass transform,
IEEE Transactions on Pattern Analysis and Machine Intelligence, (1984), pp. 212–222. http:
//dx.doi.org/10.1109/TPAMI.1984.4767504.
[7] Wolfgang Förstner, A framework for low level feature extraction, in European Conference
on Computer Vision, 1994, pp. 383–394. http://dx.doi.org/10.1007/BFb0028370.
[8] W. Förstner, T. Dickscheid, and F. Schindler, Detecting interpretable and accurate
scale-invariant keypoints, in Proceedings of IEEE International Conference on Computer Vision,
2009. http://dx.doi.org/10.1109/ICCV.2009.5459458.
[9] P Getreuer, A survey of Gaussian convolution algorithms, Image Processing On Line, 3
(2013), pp. 286–310. http://dx.doi.org/10.5201/ipol.2013.87.
[10] C. Harris and M. Stephens, A combined corner and edge detector, in Alvey Vision Confer-
ence, vol. 15, 1988, p. 50. http://dx.doi.org/10.5244/C.2.23.
[11] T. Hassner, V. Mayzels, and L. Zelnik-Manor, On SIFTs and their scales, in IEEE
Conference on Computer Vision and Pattern Recognition, 2012, pp. 1522–1528. http://dx.
doi.org/10.1109/CVPR.2012.6247842.
[12] Y. Ke and R. Sukthankar, PCA-SIFT: A more distinctive representation for local image
descriptors, in IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2004.
http://dx.doi.org/10.1109/CVPR.2004.1315206.
[13] S. Leutenegger, M. Chli, and R.Y. Siegwart, BRISK: Binary Robust Invariant Scalable
Keypoints, in Proceedings of IEEE International Conference on Computer Vision, 2011. http:
//dx.doi.org/10.1109/ICCV.2011.6126542.

395
Ives Rey-Otero, Mauricio Delbracio

[14] T. Lindeberg, Scale-space theory in computer vision, Springer, 1993. http://dx.doi.org/


10.1007/978-1-4757-6465-9.
[15] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W.T. Freeman, SIFT Flow: Dense
correspondence across different scenes, in European Conference on Computer Vision, 2008.
http://dx.doi.org/10.1007/978-3-540-88690-7_3.
[16] D. Lowe, Object recognition from local scale-invariant features, in IEEE International Confer-
ence on Computer Vision, 1999. http://dx.doi.org/10.1109/ICCV.1999.790410.
[17] , Distinctive image features from scale-invariant keypoints, International Journal of Com-
puter Vision, 60 (2004), pp. 91–110. http://dx.doi.org/10.1023/B:VISI.0000029664.
99615.94.
[18] E. Mair, G.D. Hager, D. Burschka, M. Suppa, and G. Hirzinger, Adaptive and generic
corner detection based on the accelerated segment test, in European Conference on Computer
Vision, Springer, 2010, pp. 183–196. http://dx.doi.org/10.1007/978-3-642-15552-9_14.
[19] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaf-
falitzky, T. Kadir, and L. Van Gool, A comparison of affine region detectors, In-
ternational Journal of Computer Vision, 65, pp. 43–72. http://dx.doi.org/10.1007/
s11263-005-3848-x.
[20] J. Rabin, J. Delon, and Y. Gousseau, A statistical approach to the matching of local
features, SIAM Journal on Imaging Science, 2 (2009), pp. 931–958. http://dx.doi.org/10.
1137/090751359.
[21] M. Raginsky and S. Lazebnik, Locality-sensitive binary codes from shift-invariant kernels,
in Advances in Neural Information Processing Systems, vol. 22, 2009, pp. 1509–1517.
[22] I. Rey-Otero and M. Delbracio, Computing an exact Gaussian scale-space. Preprint,
http://www.ipol.im/pub/pre/117, 2014.
[23] E. Rosten and T. Drummond, Machine learning for high-speed corner detection, in European
Conference on Computer Vision, 2006. http://dx.doi.org/10.1007/11744023_34.
[24] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ORB: An efficient alternative to
SIFT or SURF, in Proceedings of IEEE International Conference on Computer Vision, 2011,
pp. 2564–2571. http://dx.doi.org/10.1109/ICCV.2011.6126544.
[25] S.M. Smith and J.M. Brady, SUSAN. A new approach to low level image processing, Inter-
national Journal of Computer Vision, 23 (1997), pp. 45–78. http://dx.doi.org/10.1023/A:
1007963824710.
[26] T. Tuytelaars and K. Mikolajczyk, Local invariant feature detectors: A survey, Foun-
dations and Trends in Computer Graphics and Vision, 3 (2008), pp. 177–280. http://dx.doi.
org/10.1561/0600000017.
[27] J. Weickert, S. Ishikawa, and A. Imiya, Linear scale-space has first been proposed in
Japan, Journal of Mathematical Imaging and Vision, 10 (1999), pp. 237–252. http://dx.doi.
org/10.1023/A:1008344623873.
[28] G. Yu and J-M. Morel, ASIFT: An Algorithm for Fully Affine Invariant Comparison, Image
Processing On Line, 1 (2011). http://dx.doi.org/10.5201/ipol.2011.my-asift.

396

You might also like