0% found this document useful (0 votes)
25 views17 pages

Local Features Scale Invariant Feature Transform

Uploaded by

5bnvpv9db4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views17 pages

Local Features Scale Invariant Feature Transform

Uploaded by

5bnvpv9db4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Local features - Scale Invariant Feature Transform

Clara Gonçalves
April 2024

1 Feature Extraction
Feature extraction is the process of transforming raw input data, such as images or videos, into a more
compact and representative format that captures relevant information. This typically involves identifying
and extracting distinctive patterns, structures, or attributes from the input data, which can include edges,
corners, textures, shapes, or colors. These extracted features serve as meaningful representations of the
original data and are used for tasks such as object recognition, classification, detection, and tracking.

Scale Space
One challenge in computer vision is dealing with objects of unknown scales in images. A solution to this
problem is the scale-space approach, which detects and selects edges at various scales. Exploring features at
different scales is beneficial for recognizing diverse objects.
• Blob detector: The Laplacian of Gaussian (LoG) with different σ values serves as a scaling parameter.
• The Scale-Invariant Feature Transform (SIFT) algorithm utilizes the Difference of Gaussians, an ap-
proximation of LoG, for feature extraction.

Scale-Space Blob Detector


A blob is a region in an image characterized by a cluster of pixels with similar intensity values that stand
out against the background.
Detecting blobs is crucial for identifying objects or features that don’t necessarily have well-defined edges,
such as spots, textures, or rounded shapes. The scale-space blob detector is a method used to identify these
blobs at various scales, allowing for robust detection of objects regardless of their size or appearance in the
image.
One popular technique for blob detection is the Laplacian of Gaussian (LoG), which is applied to the
image at different scales. The LoG is calculated by convolving the image with a Gaussian kernel to smooth
it, followed by applying the Laplacian operator to highlight regions of rapid intensity change. This process
effectively enhances blob-like structures in the image.
Moreover, the scale-space blob detector is not only capable of identifying blobs, but also extracting
keypoints across different scales. These keypoints represent points of interest in the image, such as corners,
edges, or other distinctive features. By extracting keypoints at multiple scales, the detector ensures that
important features are captured regardless of their size or orientation in the image.

SIFT — Scale Invariant Feature Transform


The Scale-Invariant Feature Transform (SIFT) is a widely used method in computer vision for detecting and
describing local features in images. It operates as both a detector and a descriptor, providing robustness to
changes in scale, rotation, and illumination. The SIFT algorithm involves several key steps:
1. Multi-scale extrema detection: SIFT searches for keypoints by identifying significant local minima
and maxima across multiple scales of an image. This is achieved by convolving the image with Gaussian
kernels of varying standard deviations to create a scale space representation.

1
2. Keypoint Localization: After detecting potential keypoints, SIFT performs precise localization by
refining their positions based on the local extrema in scale space. This step ensures that keypoints are
accurately localized and associated with a specific scale.
3. Orientation Assignment: To achieve rotational invariance, each keypoint is assigned a dominant
orientation based on local image gradients. This allows SIFT descriptors to be computed relative to
the keypoint’s orientation, enhancing their robustness to image rotations.
4. Keypoint Descriptor: Finally, SIFT constructs descriptors for each keypoint by sampling image
gradients in the local neighborhood around the keypoint location and orientation. These descrip-
tors capture distinctive patterns and textures, providing a compact representation of the keypoint’s
appearance.

The underlying principle of SIFT is akin to the Laplacian of Gaussian (LoG) approach, where the image is
analyzed at multiple levels of scale to capture features at different resolutions. By incorporating scale-space
analysis and orientation assignment, SIFT offers remarkable robustness and accuracy in feature detection
and description, making it a valuable tool for various computer vision tasks such as object recognition, image
matching, and 3D reconstruction.

Approximation of Laplacian of Gaussian (LoG)


The Laplacian of Gaussian (LoG) and the Difference of Gaussians (DoG) are two commonly used functions
for detecting scale-space features in computer vision. These functions are pivotal for identifying blobs or
regions of interest in an image. The scale space representation of an image involves convolving the image
with a series of Gaussian kernels at different scales σ, which effectively smoothes the image. This operation
helps in detecting features at various levels of detail.
The Laplacian function, denoted by L, is defined as the sum of second-order derivatives of a Gaussian
function with respect to spatial coordinates x and y:

L = σ 2 (Gxx (x, y, σ) + Gyy (x, y, σ))

where Gxx and Gyy are the second-order derivatives of the Gaussian function.
On the other hand, the Difference of Gaussians (DoG) is expressed as:

DoG = G(x, y, kσ) − G(x, y, σ)

Here, G(x, y, kσ) and G(x, y, σ) represent Gaussian functions with different scales kσ and σ, respectively.
The DoG operator essentially computes the difference between these two Gaussian-smoothed images.
Both the LoG and DoG operators exhibit similar characteristics, making them suitable for blob detection.
Blobs, or regions of uniform intensity surrounded by significantly different intensities, can be effectively
detected using these operators. Moreover, both functions are equivariant, meaning they maintain the same
form under coordinate transformations, ensuring consistent feature detection across different orientations
and scales.
In summary, the approximation of the Laplacian of Gaussian and the Difference of Gaussians are fun-
damental tools in scale-space analysis and blob detection, enabling robust feature extraction in computer
vision applications.

SIFT Detector
The Scale-Invariant Feature Transform (SIFT) detector is a widely used method for identifying key points
in images that are invariant to scale, rotation, and illumination changes. It operates by approximating the
Laplacian of Gaussian (LoG) operator with a series of Difference of Gaussians (DoG) applied over an image
pyramid.
The image pyramid consists of multiple octaves, each containing several levels. In each octave, the image
is progressively blurred using Gaussian smoothing, and then downsampled by a factor of 2 to create the next
level. Within each octave, the DoG is computed by taking the difference between consecutive Gaussian-
smoothed images. This process is repeated across all levels of the octave. The key insight of SIFT is that by

2
computing the DoG at different scales, it becomes possible to detect key points that are robust to variations
in scale. Additionally, the only parameter that changes across different octaves and levels within an octave
is the sigma value of the Gaussian kernel, which controls the amount of smoothing applied to the image.
SIFT key points are then identified at local extrema in the DoG pyramid, and their precise locations are
interpolated to sub-pixel accuracy.
Finally, key points are characterized by their local image gradients and orientations, forming a descriptor
that is invariant to various transformations. SIFT has proven to be highly effective in a wide range of
computer vision applications, including object recognition, image stitching, and 3D reconstruction. However,
it is computationally intensive and may be less suitable for real-time applications on resource-constrained
devices.

Scale Space Extrema Detection


Scale space extrema detection is a fundamental step in feature detection algorithms like the Difference of
Gaussians (DoG) used in the Scale-Invariant Feature Transform (SIFT). This process aims to locate points
in an image where the intensity is either a local maximum or minimum compared to neighboring points
across multiple scales. Here’s a detailed explanation of the procedure:
To find extrema in the scale space: - At each point on a DoG level: - Compare the intensity of the pixel
to its 26 neighbors at adjacent levels (including the current image, the one above, and the one below it). -
Repeat this process for each level of the DoG pyramid.
If a point is found to be a local extremum (either a maximum or a minimum) in its local neighborhood
across scales, it is considered a potential keypoint. This means that the keypoint is well represented at that
particular scale. Each keypoint is typically represented by its location (x, y) in the image and the scale (σ)
at which it was detected, ensuring scale invariance.
The scale space extrema detection process ensures that keypoints are detected at appropriate scales,
allowing for robust feature matching across images with variations in scale, rotation, and illumination. This
approach forms the basis of many feature detection algorithms in computer vision, enabling applications
such as object recognition, image stitching, and augmented reality to operate effectively across diverse image
datasets.

SIFT Keypoint Stability - Illumination


SIFT (Scale-Invariant Feature Transform) keypoints are critical for various computer vision tasks due to their
robustness against scale, rotation, and illumination changes. Ensuring the stability of keypoints, especially
under varying illumination conditions, is crucial for the accuracy of feature extraction. To address this,
several procedures are employed:

• Low Contrast Feature Removal: Keypoints associated with regions of low contrast are often
unreliable and can introduce noise into the feature extraction process. One method to mitigate this issue
is by removing keypoints corresponding to regions where the magnitude of intensity in the blurred image
at the current pixel location in the Difference of Gaussians (DoG) is below a certain threshold. This
thresholding mechanism effectively discards keypoints that lack sufficient contrast, thereby enhancing
the stability of feature detection. It’s essential to note that low contrast images are generally less
reliable than high contrast ones for feature extraction tasks.
• Gradient-based Filtering: Another approach to enhancing the stability of SIFT keypoints is by
employing gradient-based filtering. Keypoints located in regions with low gradients or along weak
edges are prone to noise and inaccuracies. By discarding such keypoints, the algorithm retains only
those associated with strong gradients and prominent edges. This filtering mechanism helps in selecting
robust keypoints that are less sensitive to variations in illumination and noise.

Ensuring the stability of SIFT keypoints under varying illumination conditions involves removing low con-
trast features and employing gradient-based filtering to retain robust keypoints. These techniques contribute
to the reliability and accuracy of feature extraction in challenging lighting environments.

3
Keypoint Orientation
Assigning orientation to keypoints is crucial for achieving rotation invariance in feature detection. Here’s
how it’s typically done:
First, at each level of scale, the gradient magnitude and orientation are computed for every pixel in the
image. The size of the orientation collection region around each keypoint depends on its scale.
Next, a histogram of orientations is constructed by summing the weighted magnitudes of gradients within
each orientation bin. Typically, the 360 degrees of orientation are divided into bins, often 36 bins (each
representing 10 degrees).
Peaks in the histogram are identified, and the orientations corresponding to these peaks are assigned to
the keypoints. Additionally, the sum of magnitudes within each peak’s bin is assigned to the keypoint.
This process results in keypoints that have the same location and scale as the original but are oriented in
different directions. This orientation assignment effectively splits one keypoint into multiple keypoints, each
representing the same feature but oriented differently. This technique enhances the robustness of feature
matching across different orientations, enabling more accurate object recognition and matching in images
with varying viewpoints.

SIFT Descriptor
When evaluating keypoints in a region, it’s essential to describe the local image features in a way that is
invariant to various transformations. This is where SIFT (Scale-Invariant Feature Transform) descriptors
come into play.

Local Image Description


For each detected keypoint, several attributes are assigned, including its location, scale (analogous to the
level it was detected), and orientation (assigned in a previous step). To describe the local image region
invariantly, we employ SIFT descriptors, which undergo the following steps:

1. Utilize the Gaussian blurred image associated with the keypoint’s scale.
2. Extract image gradients over a 16x16 square window surrounding the detected feature, obtaining
gradients for each pixel.
3. Rotate the gradient directions and locations relative to the keypoint orientation (provided by the
dominant orientation).
4. Divide the 16x16 window into a 4x4 grid of cells.
5. Compute an orientation histogram with 8 orientation bins for each cell, summing all the weighted
gradient magnitudes. Each sub-patch evaluates orientation, representing different orientations within
each patch. This information is then used to generate the keypoints’ descriptor, typically represented
as a vector of length 128.

SIFT Descriptor Properties


The SIFT descriptor exhibits several desirable properties:

• Rotation Invariance: Achieved by rotating the gradients, assuming that a rotated image generates
a keypoint at the same location as the original image.
• Scale Invariance: Maintained by working with the scale image from the Difference of Gaussians
(DoG).
• Robustness to Illumination Variation: Since only gradient orientations are considered, the SIFT
descriptor is insensitive to changes in illumination.
• Slight Robustness to Affine Transformations and Noise: Empirically observed, though not as
pronounced as other invariances.

4
Affine Invariant Detection: SIFT exhibits robustness to affine transformations, including similar
transforms (rotation and uniform scale) and affine transforms (rotation and non-uniform scale). However, a
downside of SIFT is its computational complexity, which may hinder real-time applications.

SURF - Speeded Up Robust Features


Surft is a technique than camos from sift. It is a fast aprroximations of SFIT idea. Efficient computation by
2D box filters integral images - 6 times faster then sift Equivalent quality for object identification . SURF
is a feature detection framework in wich the interest poitns are in-plane rotation.invariant, robust to noise ,
and overall, extremly fast to calculate. This procedure can be divided in 3 steps:
1. Interest Point detection 2. Interest Point Description 3. Interest Point Matching

Integral Images
Integral images provide an efficient method for calculating the sum of pixel values within rectangular regions
of an image. The integral image IΣ (x) at a location x = (x, y)T is defined as the sum of all pixel values
in the input image I within the rectangular region formed by the origin and x. This transform is achieved
through the following equation:

X
IΣ (x) = I(x′ , y ′ )
x′ ≤x,y ′ ≤y

Integral images differ from other methods like Scale-Invariant Feature Transform (SIFT) notably in their
underlying computation. While SIFT utilizes the Laplacian of Gaussian (LoG) for scale-space analysis, in-
tegral images replace LoG with a 2D box filter. The equation for computing the sum within a rectangular
region using integral images is given by:
X
=A−B−C +D
where A, B, C, and D represent the pixel sums of the integral image at the corners of the rectangular
region of interest. This equation essentially calculates the sum of pixel values within the rectangle by
subtracting the areas outside the rectangle from the total sum.
Integral images offer remarkable efficiency, enabling the characterization of image regions with just four
memory accesses and three arithmetic operations. This efficiency makes integral images particularly suitable
for tasks such as blob detection, where the computation of pixel sums within rectangular regions is preva-
lent. The integral image computation significantly reduces the computational cost associated with feature
extraction, making it an invaluable tool in computer vision applications.

2 Detection: Hessian-based Interest Points


In Hessian-based interest point detection, blob-like structures are identified at locations where the determi-
nant of the Hessian matrix is maximized. The Hessian matrix, denoted as H(x, σ), is defined as follows:
 
Lxx (x, σ) Lxy (x, σ)
H(x, σ) =
Lxy (x, σ) Lyy (x, σ)
Here, Lxx (x, σ) represents the convolution of the Gaussian second-order derivative with the image I at
point x. This approach is particularly adept at selecting corners rather than edges.
However, direct computation of the Hessian matrix is computationally expensive and slow. An alternative
approach is to approximate the Hessian using box filters. The Laplacian of Gaussian (LoG) is replaced by

5
the box filter, resulting in an approximate Hessian matrix denoted as Happrox . The determinant of this
approximate Hessian is computed as:

det(Happrox ) = Dxx Dyy − (wDxy )2


Here, Dxx represents the approximation of the Gaussian second-order partial derivative in the x-direction,
and w = 0.9. This approximation method helps mitigate computational complexity while still effectively
identifying interest points in the image. This method is widely used in applications such as feature matching,
object recognition, and image stitching due to its robustness and efficiency.

Scale-Space representation
To match interest points across different scales, a pyramidal scale space is built. Rather than serial down-
sampling, each successive level of the pyramid is built by upscaling the image in parallel. Each scale is
defined as the response of the image convolved with a box filter of certain dimensions. The scale is further
divided into octaves(set of filter responses).

Descriptor: Orientation Assignment


In the context of feature detection and description, orientation assignment is a crucial step, particularly
in algorithms like Speeded Up Robust Features (SURF). Unlike Scale-Invariant Feature Transform (SIFT),
which uses gradient histograms, SURF employs Haar wavelet responses to assign orientations to interest
points. This process involves calculating the responses in the x- and y-directions using Haar wavelets within
a circular neighborhood around the interest point, with a radius of 6s, where s represents the scale. These
responses are then weighted with a Gaussian function centered at the interest point, with a standard deviation
of 2s. Subsequently, directional strengths are visualized, and sliding orientation windows are applied to divide
the plots. Within each window, local orientation vectors are computed as the sum of the x and y responses.
The dominant orientation is determined as the largest vector among all computed orientations across all
windows. Notably, this approach offers a distinct methodology compared to SIFT, enhancing the versatility
and adaptability of feature detection across various image contexts.

Descriptor: Feature Vector


To describe key features in a straightforward manner, feature extraction involves the calculation of a feature
vector based on the Haar wavelet responses within a defined window centered around an interest point.
Specifically, a square window of size 20s is oriented along the axis and subdivided into a 4 × 4 grid. Within
each subdivision, the horizontal and vertical Haar wavelet responses (dX and dy ) are computed, and four
metrics are derived from these responses at five equally spaced points. These metrics encompass the sum
and absolute sum of the horizontal and vertical responses, yielding a local feature vector. Concatenating
these local feature vectors forms a comprehensive 64-element feature vector, which effectively characterizes
the interest point and its surrounding neighborhood:
P 
PdX
P dy 
 
v= | dX |
P
| dy |

Here, dX represents the horizontal Haar wavelet response, and dy denotes the vertical Haar wavelet response.

Matching: Laplacian Indexing


In the context of fast indexing during the matching phase, the sign of the Laplacian (Tr(H)) associated with
the underlying interest point is incorporated into the discrimination cascade. This sign of the Laplacian
is pivotal as it distinguishes between bright blobs on dark backgrounds and their opposite counterparts,
thereby serving as a crucial metric for partitioning the entire set of interest points. By leveraging this
information, the indexing process becomes more efficient and effective, enhancing the overall performance of

6
the matching algorithm. This approach contributes significantly to the accuracy and robustness of computer
vision systems, particularly in scenarios where discerning between different types of features is critical.

Recent Advances in Interest Points


Recent advancements in interest point detection and description have led to significant improvements in
computational efficiency and robustness in computer vision tasks. One notable area of progress is in the
development of binary feature descriptors, which offer advantages in terms of memory usage and processing
speed. Among these descriptors are:

• BRIEF: Binary Robust Independent Elementary Features, which efficiently encodes key points into
binary strings, making them computationally lightweight.
• ORB: Oriented FAST and Rotated BRIEF, which combines the speed of the FAST keypoint detector
with the robustness of BRIEF descriptors, allowing for rotation-invariant matching.
• BRISK: Binary Robust Invariant Scalable Keypoints, known for its efficiency in scale-space analysis
and robustness to variations in illumination and viewpoint.
• FREAK: Fast Retina Keypoint, a descriptor designed for high-speed performance and robustness to
geometric and photometric transformations.

• LIFT: Learn Invariant Feature Transform, the first binary descriptor based on deep learning, which
learns feature representations in an end-to-end manner, offering improved generalization capabilities.

These binary feature descriptors play a crucial role in various computer vision applications, including ob-
ject recognition, image matching, and 3D reconstruction, by providing efficient and effective representations
of visual data while mitigating computational overhead.

3 Binary Descriptors
- While complex feature descriptors like SIFT are highly effective and often considered the gold standard in
computer vision tasks, they come with drawbacks such as computational expense and patenting issues.
Binary descriptors offer an alternative approach by generating compact binary strings that are com-
putationally efficient to compute and compare.
Key Idea of Binary Descriptors: The fundamental concept behind binary descriptors involves a
simple strategy: 1. Select a patch around a keypoint. 2. Choose a set of pixel pairs within that patch. 3.
For each pair, compare their intensities according to the equation:
(
1, if I(s1 ) < I(s2 )
b= (1)
0, otherwise

4. Concatenate all resulting binary values b into a single bit string. This concise representation facilitates fast
computation and comparison, making binary descriptors suitable for various computer vision applications.
Key advantages of Binary descriptors
• Compact descriptor: Binary descriptors encode features using a fixed-length sequence of bits, where
the length corresponds to the number of feature pairs. This compact representation is efficient for
storage and processing.
• Fast to compute: Binary descriptors are computed rapidly, typically through simple intensity value
comparisons between neighboring pixels or image patches. This computational simplicity enables real-
time performance in various computer vision applications.

7
• Trivial and fast to compare: Binary descriptors facilitate rapid feature matching through the use of
the Hamming distance metric. The Hamming distance measures the number of differing bits between
two binary strings, and it is computed as follows:
n
X
HammingDistance(A, B) = (Ai ⊕ Bi )
i=1

where A and B are binary descriptors of length n, and ⊕ denotes the bitwise XOR operation.
Important Remark - Pairs
To ensure a fair comparison of descriptors across images, it is crucial to adhere to the following guidelines
regarding pairs selection:
- Consistent Pair Usage: Utilize the same pairs for all descriptors under consideration. This ensures
that each descriptor is evaluated on the same set of data, eliminating potential biases introduced by varying
pairs.
- Order Preservation: Maintain the identical order in which pairs are tested across all descriptors. This
preserves the integrity of the comparison process, allowing for accurate assessment of descriptor performance.
Different descriptors may necessitate distinct methodologies for pair selection; hence, it is
imperative to establish a standardized approach to pair selection to facilitate fair and mean-
ingful comparisons among descriptors.

BRIEF: Binary Robust Independent Elementary Features


BRIEF is a feature descriptor used in computer vision, particularly for tasks like image matching and object
recognition. Unlike traditional feature descriptors which produce feature vectors of variable lengths, BRIEF
generates fixed-length binary descriptors of 256 bits. These descriptors are efficient in terms of storage and
computation. BRIEF operates on a smoothed version of the image to mitigate the effects of noise. It employs
five different sampling strategies to select pairs of points in the image for feature extraction.

BRIEF Sampling Pairs


BRIEF employs various sampling strategies to select pairs of points from the image for feature extraction:
1. **Uniform Random Sampling**: Points are randomly selected across the image domain. 2. **Gaussian
Sampling**: Points are sampled from a Gaussian distribution, favoring areas with higher intensity. 3. **Dual
Gaussian Sampling**: Two Gaussian distributions are used, with one centered around the other, allowing
for more localized sampling. 4. **Discrete Location from a Coarse Polar Grid**: Points are selected from a
coarse polar grid, enabling efficient coverage of the image. 5. **Fixed Point with Coarse Polar Grid**: One
point is fixed at the origin (0,0), while the other point is selected from a coarse polar grid. This strategy
ensures robustness against image transformations.
These sampling strategies enable BRIEF to capture distinctive local features from images effectively,
contributing to its robustness and efficiency in various computer vision applications.

3.1 ORB: Oriented FAST Rotated BRIEF


An extension to BRIEF, ORB incorporates rotation compensation and optimal sampling pair selection,
enhancing its robustness and discriminative power.
Rotation Compensation: ORB estimates the center of mass (CoM) and the main orientation of the
area or patch to compensate for rotation. The CoM (M) of an image I is calculated as:
X
M= (x, y) · I(x, y)
x,y

The center of mass (C) is then computed using:


1
C= (M10 , M01 )
M00

8
Additionally, the orientation (θ) is determined using:
 
1 2M11
θ = arctan
2 M20 − M02
Given CoM and orientation (C, θ), coordinates s are rotated by θ around C using the transformation:

s′ = T (C, θ)s
Transformed pixel coordinates are then used for testing, ensuring invariance to rotation in the plane.
Learning Sampling Pairs: ORB selects sampling pairs optimized for being uncorrelated and having
high variance, thus adding new information to the descriptor and enhancing discriminative power. ORB
defines a strategy for selecting 256 pairs that optimize both properties using a training database.
Certainly! Below is the rewritten information with corrections and additional value, maintaining the list
format:

ORB vs SIFT:
— Speed: ORB is 100 times faster than SIFT.
— Descriptor Size: ORB uses 256-bit descriptors, whereas SIFT employs 4096-bit descriptors.
— Scale Invariance: ORB is not inherently scale invariant, but this can be achieved by applying image
pyramids.
— Rotation Invariance: ORB is primarily invariant to plane rotation.
— Matching Performance: ORB demonstrates similar matching performance to SIFT without consider-
ing scale.
— Modern Usage: Several modern online systems, such as Simultaneous Localization and Mapping
(SLAM), utilize binary features, including ORB, due to their computational efficiency and robustness.

This revised version clarifies the differences between ORB and SIFT, their advantages, and how they are
utilized in modern computer vision applications.

4 Visual Feature Representation


In computer vision, visual features play a crucial role in various tasks such as object detection, recognition,
and tracking. These features are typically represented by a combination of keypoints and descriptors.

4.1 Keypoints
Keypoints denote specific locations in an image that are distinctive or informative. They serve as anchor
points for further analysis. For instance, in a scene, keypoints could represent corners, edges, or other salient
features.

4.2 Descriptors
Descriptors, on the other hand, provide a detailed description of the appearance of the region surrounding
a keypoint. They encode information about the intensity gradients, colors, textures, or other relevant
attributes. This description facilitates matching and comparison between different keypoints.

4.3 Types of Descriptors


Various types of descriptors are employed in computer vision tasks. Some popular ones include:

• SIFT (Scale-Invariant Feature Transform): SIFT descriptors operate on gradient histograms,


capturing information about the local image structure at different scales.

9
• SURF (Speeded-Up Robust Features): Similar to SIFT, SURF descriptors also operate on gra-
dient information but are designed to be computationally faster, making them suitable for real-time
applications.
• BRIEF (Binary Robust Independent Elementary Features): To enhance efficiency, BRIEF
generates binary descriptors directly from image intensities, reducing memory and computational re-
quirements.
• ORB (Oriented FAST and Rotated BRIEF): Combining the efficiency of BRIEF with the rota-
tional invariance of keypoints, ORB descriptors provide robustness against image transformations.

By utilizing combinations of key points and descriptors, computer vision algorithms can effectively extract
and represent visual information, enabling tasks such as object recognition, tracking, and scene understand-
ing.

5 Feature Matching
Feature matching involves the process of associating keypoints between different images, enabling tasks such
as image alignment, object recognition, and scene reconstruction.

5.1 Matching Procedure


Given a set of keypoints extracted from two images, denoted as I1 and I2 , the objective is to find correspon-
dences between keypoints in I1 and their counterparts in I2 .

1. Descriptor Distance Calculation: To measure the similarity between descriptors corresponding to


keypoints, a distance function is defined. Common distance metrics include:
• L2 Norm (Mean Square Error): This metric calculates the Euclidean distance between two
descriptors, representing the root mean square difference between corresponding elements.
• L1 Norm (Mean Absolute Error): L1 norm computes the sum of absolute differences between
corresponding elements of two descriptors, providing robustness against outliers.
These metrics quantify the dissimilarity between descriptors, aiding in feature matching.

2. Matching Strategies: Once the distance function is established, the matching process proceeds in
one of the following ways:
• Single Best Match: This approach involves testing all features in I2 and finding the one with
the minimum distance to the descriptor from I1 . It aims to find the most likely correspondence
for each keypoint in I1 .
• Top-k Matches: In contrast, this strategy evaluates all features in I2 against the descriptor from
I1 and selects the k features with the lowest distances. This approach considers multiple potential
matches for each keypoint, offering robustness against outliers and ambiguities.
The choice between these strategies depends on the specific application requirements and the charac-
teristics of the images being matched.

By employing suitable distance metrics and matching strategies, feature matching enables the establish-
ment of correspondences between keypoints across different images, forming the basis for various computer
vision tasks.

5.2 Image Matching


Image matching involves the process of finding correspondences between keypoints detected in a target image
and those present in a reference image.

10
5.2.1 Keypoint Representation
Each keypoint identified in the target image is characterized by several attributes:
• 2D Location: The coordinates (x, y) pinpoint the position of the keypoint within the image plane.

• Scale (s) and Orientation (θ): These parameters specify the size and orientation of the local region
around the keypoint, aiding in scale and rotational invariance.
• Descriptor Vector (d): The descriptor vector encapsulates the appearance information of the region
surrounding the keypoint. It represents the local image structure, capturing gradients, textures, or
other relevant features.

Thus, each keypoint in the target image is described by the tuple (x, y, s, θ, d), providing comprehensive
information for matching.

5.2.2 Matching Procedure


For every keypoint in the target image, a search is conducted to find similar descriptor vectors within the
reference image. Due to variations in viewpoint, lighting conditions, and occlusions, a single descriptor vector
from the target image may correspond to multiple descriptors in the reference image.

• Similarity Search: The descriptor vector associated with each keypoint in the target image is com-
pared with those in the reference image using a distance metric, such as the L2 or L1 norm. Keypoints
in the reference image whose descriptors exhibit minimal distance to the target descriptor are consid-
ered potential matches.
• Potential Matches: It’s important to note that a single descriptor vector from the target image
may find correspondence with multiple descriptor vectors in the reference image. This occurs due to
similarities between local image structures or the presence of repetitive patterns.

By performing this matching process for all keypoints in the target image, a set of correspondences
between the two images is established, facilitating tasks such as image registration, object recognition, and
scene understanding.

5.3 Feature Distance


When comparing two features, denoted as f1 and f2 , it’s crucial to define a suitable distance metric that
quantifies their dissimilarity. However, a straightforward approach, such as the L2 distance ||f1 − f2 ||, may
not always yield optimal results, especially in scenarios where ambiguous or incorrect matches occur.

5.3.1 Analyzing Features


Before selecting a distance metric, it’s essential to analyze the characteristics of the features being compared.
Features typically consist of descriptors that encode information about the local image structure surround-
ing keypoints. These descriptors may encompass gradient histograms, color histograms, or other relevant
attributes.

5.3.2 Challenges with L2 Distance


While the L2 distance is widely used due to its simplicity and effectiveness in many cases, it can yield small
distances even for incorrect matches or ambiguous scenarios. This occurs because the L2 distance measures
the Euclidean distance between corresponding elements of the feature vectors f1 and f2 . Consequently,
even if two keypoints are not identical, their descriptors may exhibit similar patterns, leading to misleading
matches.

11
5.3.3 Improved Approaches
To mitigate the limitations of the L2 distance and address the challenges posed by ambiguous matches,
alternative distance metrics and matching strategies can be employed. These may include:

• Local Feature Matching: Instead of relying solely on global descriptors, consider matching based on
local regions surrounding keypoints. This approach can enhance robustness against viewpoint changes
and occlusions.

• Feature Space Transformation: Transforming the feature space or applying dimensionality reduc-
tion techniques can help emphasize discriminative features and suppress noise or irrelevant information.
• Advanced Distance Metrics: Explore distance metrics tailored to specific feature types or ap-
plication domains. For instance, metrics based on information theory, such as the Jensen-Shannon
divergence, may offer better discrimination between feature vectors.

By adopting these improved approaches, the feature matching process can achieve greater accuracy and
reliability, even in challenging conditions where traditional metrics may falter.

5.3.4 Ratio Test


A more effective approach to address the limitations of direct distance comparison, such as the L2 distance,
is the ratio test. This method aims to improve the robustness of feature matching by considering the relative
distances between potential matches.

• Distance Ratio Calculation: The distance ratio is computed as the ratio of the L2 distances between
the best and second-best matches of a feature f1 , denoted as f2 and f2′ , respectively. Mathematically,
it can be expressed as:
||f1 − f2 ||
Distance Ratio =
||f1 − f2′ ||
This ratio provides a measure of confidence in the match, with lower values indicating more reliable
matches.

• Sorting by Confidence: Sorting matches based on the distance ratio arranges them in order of
confidence, with more reliable matches appearing first in the list.
• Threshold Definition: To filter out ambiguous matches, a distance ratio threshold p is defined,
typically set around 0.5. This threshold signifies that the distance to the best match (f2 ) should be at
least twice as close as the distance to the second-best match (f2′ ) for a match to be considered reliable.

5.3.5 Ratio Test Procedure


The ratio test procedure is as follows:

• Top Two Matches: Retain the top two matches, f2 and f2′ , based on the L2 distances.

• Ambiguity Check: If the distance to the best match (f2 ) is significantly closer than the distance to
the second-best match (f2′ ), i.e., if d(f1 , f2 ) < p × d(f1 , f2′ ), consider the match between f1 and f2 as
a ’strong’ match and retain it.
• Ambiguous Match Rejection: Discard matches where the distance to the best match is not sig-
nificantly closer than the distance threshold (p) multiplied by the distance to the second-best match.
These matches are deemed ambiguous and are rejected.

By incorporating the ratio test into the feature matching process, ambiguous matches can be effectively
filtered out, ensuring the selection of more reliable correspondences between features.

12
5.4 Matching SIFT Features
Matching Scale-Invariant Feature Transform (SIFT) features involves comparing the descriptor vectors as-
sociated with keypoints extracted from different images. A commonly used criterion for accepting a match
is based on the normalized sum of squared differences (SSD) between the descriptors.

5.4.1 Matching Criterion


The matching criterion can be expressed as follows:

SSD(f1 , f2 )
<t
SSD(f1 , f2′ )

Here, f1 represents the descriptor vector from the target image, while f2 and f2′ represent the descriptor
vectors from the reference image. The SSD function computes the sum of squared differences between
corresponding elements of the descriptor vectors.

5.4.2 Threshold Selection


The threshold t plays a crucial role in determining the acceptance criteria for matches. Empirical studies
have shown that threshold values around 0.75 or 0.80 often yield good results in object recognition tasks.

5.4.3 Effectiveness
By applying this matching criterion with an appropriate threshold, a significant reduction in false matches
can be achieved. Studies have indicated that this approach can eliminate up to 90% of false matches while
discarding less than 5% of correct matches. This demonstrates the effectiveness of the matching strategy in
selecting reliable correspondences between SIFT features across different images.
In summary, matching SIFT features based on the normalized SSD criterion with a carefully chosen
threshold offers a robust solution for various computer vision tasks, including object recognition and image
alignment.

6 Evaluating the Results


Measuring the performance of a feature matcher is essential to assess its accuracy and reliability in various
computer vision applications.

6.1 True/False Negatives


One common approach to evaluating the results of a feature matcher is by considering true and false negatives
based on a chosen distance threshold.

6.1.1 Distance Threshold


The distance threshold serves as a criterion to determine whether a match is considered ’good’ or not. It
directly influences the performance of the feature matcher.

6.1.2 Performance Metrics


• True Positives (TP): These are the number of detected matches that survive the distance threshold
and are correctly identified as valid correspondences between features in the reference and target
images.
• False Positives (FP): These are the number of detected matches that survive the distance threshold
but are incorrect, erroneously identified as valid correspondences.

13
By quantifying the number of true positives and false positives, we can assess the effectiveness of the
feature matcher in distinguishing between correct and incorrect matches.
Adjusting the distance threshold allows for fine-tuning the performance of the feature matcher, balancing
between the number of true positives and false positives to achieve the desired level of accuracy and reliability
in feature matching.

6.2 Measuring Performance


One powerful method for evaluating the performance of a feature matcher is by using Receiver Operating
Characteristic (ROC) curves.

6.2.1 ROC Curves


ROC curves illustrate the trade-off between true positive rates and false positive rates by thresholding the
match distance at different thresholds.

• By sweeping through a range of threshold values, we can generate sets of matches with varying true/false
positive rates.
• The ROC curve is then plotted by computing these rates across the full range of possible thresholds.

6.2.2 Area Under the ROC Curve (AUC)


The Area Under the ROC Curve (AUC) summarizes the performance of a feature matching pipeline. A
higher AUC indicates better performance in distinguishing between true and false matches.

6.2.3 Application
ROC curves and AUC are versatile evaluation tools that can be applied across various domains, including:

• Binary classification
• Face verification
• Object detection
• Image retrieval

• And many more...

These metrics provide valuable insights into the effectiveness and reliability of feature matching algo-
rithms, aiding in their optimization and deployment in real-world applications.
We will revisit these concepts in detail when discussing specific tasks such as binary classification, face
verification, object detection, and image retrieval.
Be cautious with accuracy as a measure of performance, as it may not always provide a comprehensive
evaluation, especially in scenarios where the classes are imbalanced.

14
The accuracy is defined as the fraction of correctly classified items among the total number of cases
examined. Mathematically, it is represented as:
TP + TN
Accuracy =
TP + TN + FP + FN
Here: - TP (True Positives) represents the number of correctly classified positive cases. - TN (True
Negatives) represents the number of correctly classified negative cases. - FP (False Positives) represents the
number of incorrectly classified negative cases (classified as positive). - FN (False Negatives) represents the
number of incorrectly classified positive cases (classified as negative).
The accuracy metric is intuitive and straightforward, indicating the proportion of correct predictions
relative to the total number of predictions made.
However, accuracy can be misleading, particularly in situations where the classes are imbalanced. For
instance, in a scenario where only a small percentage of items are truly positive, a classifier that always
predicts ”negative” could achieve a high accuracy simply by capitalizing on the majority class. This dom-
inance by the larger set of positives or negatives can lead to inflated accuracy values and obscure the true
performance of the classifier.
Therefore, while accuracy provides a basic measure of performance, it is essential to complement it
with other evaluation metrics, especially in scenarios with class imbalance, to gain a more comprehensive
understanding of the classifier’s effectiveness.

6.3 Precision and Recall


Precision and recall are fundamental metrics for evaluating the performance of classifiers, providing insights
into their accuracy and ability to detect relevant instances.

6.3.1 Precision
Precision measures the accuracy of positive predictions made by a classifier. It answers the question: ”Of
all the instances predicted as positive, how many were truly positive?”
Precision is calculated using the formula:
True Positives
Precision =
True Positives + False Positives
A high precision indicates that the classifier has a low rate of false positives, meaning that when it
predicts positive, it is likely to be correct.

6.3.2 Recall
Recall measures the ability of a classifier to capture all the relevant positive instances. It answers the
question: ”Of all the true positive instances, how many were successfully predicted?”
Recall is calculated using the formula:
True Positives
Recall =
True Positives + False Negatives
A high recall indicates that the classifier is good at identifying positive instances, even if it means having
a higher rate of false positives.
These metrics provide complementary insights into the performance of classifiers, allowing for a balanced
assessment of their precision and ability to detect relevant instances.

6.4 F1 Score
The F1 score is a single metric that combines both precision and recall, providing a balanced assessment of
a classifier’s performance.

15
6.4.1 Precision and Recall Comparison
Precision emphasizes the accuracy of positive predictions, focusing on minimizing false positives, while recall
emphasizes the ability of the model to capture all relevant positive instances, aiming to minimize false
negatives.

6.4.2 F1 Score Calculation


The F1 score is calculated as the harmonic mean of precision and recall, expressed by the formula:
Precision × Recall
F1 Score = 2 ×
Precision + Recall

6.4.3 Interpretation
The F1 score provides a balanced measure of a classifier’s performance, taking into account both precision
and recall. It reaches its best value at 1 and worst at 0, with higher values indicating better performance.
By incorporating both precision and recall into a single metric, the F1 score offers a comprehensive
evaluation of a classifier’s ability to make accurate positive predictions while capturing relevant positive
instances.

7 Local Features and Deep Learning


Local feature extraction and matching, a crucial task in computer vision, has witnessed advancements with
the advent of deep learning techniques. Here’s how this procedure can be accomplished using deep learning
methodologies:

7.1 Local Feature Extraction


Deep learning models such as Superpoint have been developed to efficiently extract key points and describe
local features from images. These models leverage convolutional neural networks (CNNs) to detect distinctive
keypoints and generate corresponding descriptors that encode information about the surrounding image
regions.

7.2 Training on Synthetic Data


To train these deep learning models effectively, it’s common to start with synthetic data. Synthetic datasets
provide ground truth annotations, allowing for precise supervision during model training. By synthesizing
images with known ground truth keypoints and descriptors, the model can learn to accurately extract and
describe local features.

7.3 Self-Supervised Learning on Real Data


Following training on synthetic data, the model can be fine-tuned using self-supervised learning techniques
on real-world data. Self-supervised learning tasks, such as image reconstruction or pretext tasks, provide
additional supervision signals for model training without requiring manually annotated labels. This fine-
tuning process enhances the model’s ability to generalize to diverse real-world scenarios.

7.4 Feature Matching with SuperGlue


Once the local features are extracted from images using models like Superpoint, feature matching can be
performed using algorithms such as SuperGlue. SuperGlue utilizes deep learning techniques to establish
correspondences between keypoint descriptors from different images efficiently. By leveraging neural networks
for feature matching, SuperGlue offers improved robustness and accuracy compared to traditional methods.

16
In summary, deep learning enables end-to-end solutions for local feature extraction and matching, leverag-
ing neural network architectures and large-scale datasets to achieve superior performance in various computer
vision tasks.

17

You might also like