Local Features Scale Invariant Feature Transform
Local Features Scale Invariant Feature Transform
Clara Gonçalves
April 2024
1 Feature Extraction
Feature extraction is the process of transforming raw input data, such as images or videos, into a more
compact and representative format that captures relevant information. This typically involves identifying
and extracting distinctive patterns, structures, or attributes from the input data, which can include edges,
corners, textures, shapes, or colors. These extracted features serve as meaningful representations of the
original data and are used for tasks such as object recognition, classification, detection, and tracking.
Scale Space
One challenge in computer vision is dealing with objects of unknown scales in images. A solution to this
problem is the scale-space approach, which detects and selects edges at various scales. Exploring features at
different scales is beneficial for recognizing diverse objects.
• Blob detector: The Laplacian of Gaussian (LoG) with different σ values serves as a scaling parameter.
• The Scale-Invariant Feature Transform (SIFT) algorithm utilizes the Difference of Gaussians, an ap-
proximation of LoG, for feature extraction.
1
2. Keypoint Localization: After detecting potential keypoints, SIFT performs precise localization by
refining their positions based on the local extrema in scale space. This step ensures that keypoints are
accurately localized and associated with a specific scale.
3. Orientation Assignment: To achieve rotational invariance, each keypoint is assigned a dominant
orientation based on local image gradients. This allows SIFT descriptors to be computed relative to
the keypoint’s orientation, enhancing their robustness to image rotations.
4. Keypoint Descriptor: Finally, SIFT constructs descriptors for each keypoint by sampling image
gradients in the local neighborhood around the keypoint location and orientation. These descrip-
tors capture distinctive patterns and textures, providing a compact representation of the keypoint’s
appearance.
The underlying principle of SIFT is akin to the Laplacian of Gaussian (LoG) approach, where the image is
analyzed at multiple levels of scale to capture features at different resolutions. By incorporating scale-space
analysis and orientation assignment, SIFT offers remarkable robustness and accuracy in feature detection
and description, making it a valuable tool for various computer vision tasks such as object recognition, image
matching, and 3D reconstruction.
where Gxx and Gyy are the second-order derivatives of the Gaussian function.
On the other hand, the Difference of Gaussians (DoG) is expressed as:
Here, G(x, y, kσ) and G(x, y, σ) represent Gaussian functions with different scales kσ and σ, respectively.
The DoG operator essentially computes the difference between these two Gaussian-smoothed images.
Both the LoG and DoG operators exhibit similar characteristics, making them suitable for blob detection.
Blobs, or regions of uniform intensity surrounded by significantly different intensities, can be effectively
detected using these operators. Moreover, both functions are equivariant, meaning they maintain the same
form under coordinate transformations, ensuring consistent feature detection across different orientations
and scales.
In summary, the approximation of the Laplacian of Gaussian and the Difference of Gaussians are fun-
damental tools in scale-space analysis and blob detection, enabling robust feature extraction in computer
vision applications.
SIFT Detector
The Scale-Invariant Feature Transform (SIFT) detector is a widely used method for identifying key points
in images that are invariant to scale, rotation, and illumination changes. It operates by approximating the
Laplacian of Gaussian (LoG) operator with a series of Difference of Gaussians (DoG) applied over an image
pyramid.
The image pyramid consists of multiple octaves, each containing several levels. In each octave, the image
is progressively blurred using Gaussian smoothing, and then downsampled by a factor of 2 to create the next
level. Within each octave, the DoG is computed by taking the difference between consecutive Gaussian-
smoothed images. This process is repeated across all levels of the octave. The key insight of SIFT is that by
2
computing the DoG at different scales, it becomes possible to detect key points that are robust to variations
in scale. Additionally, the only parameter that changes across different octaves and levels within an octave
is the sigma value of the Gaussian kernel, which controls the amount of smoothing applied to the image.
SIFT key points are then identified at local extrema in the DoG pyramid, and their precise locations are
interpolated to sub-pixel accuracy.
Finally, key points are characterized by their local image gradients and orientations, forming a descriptor
that is invariant to various transformations. SIFT has proven to be highly effective in a wide range of
computer vision applications, including object recognition, image stitching, and 3D reconstruction. However,
it is computationally intensive and may be less suitable for real-time applications on resource-constrained
devices.
• Low Contrast Feature Removal: Keypoints associated with regions of low contrast are often
unreliable and can introduce noise into the feature extraction process. One method to mitigate this issue
is by removing keypoints corresponding to regions where the magnitude of intensity in the blurred image
at the current pixel location in the Difference of Gaussians (DoG) is below a certain threshold. This
thresholding mechanism effectively discards keypoints that lack sufficient contrast, thereby enhancing
the stability of feature detection. It’s essential to note that low contrast images are generally less
reliable than high contrast ones for feature extraction tasks.
• Gradient-based Filtering: Another approach to enhancing the stability of SIFT keypoints is by
employing gradient-based filtering. Keypoints located in regions with low gradients or along weak
edges are prone to noise and inaccuracies. By discarding such keypoints, the algorithm retains only
those associated with strong gradients and prominent edges. This filtering mechanism helps in selecting
robust keypoints that are less sensitive to variations in illumination and noise.
Ensuring the stability of SIFT keypoints under varying illumination conditions involves removing low con-
trast features and employing gradient-based filtering to retain robust keypoints. These techniques contribute
to the reliability and accuracy of feature extraction in challenging lighting environments.
3
Keypoint Orientation
Assigning orientation to keypoints is crucial for achieving rotation invariance in feature detection. Here’s
how it’s typically done:
First, at each level of scale, the gradient magnitude and orientation are computed for every pixel in the
image. The size of the orientation collection region around each keypoint depends on its scale.
Next, a histogram of orientations is constructed by summing the weighted magnitudes of gradients within
each orientation bin. Typically, the 360 degrees of orientation are divided into bins, often 36 bins (each
representing 10 degrees).
Peaks in the histogram are identified, and the orientations corresponding to these peaks are assigned to
the keypoints. Additionally, the sum of magnitudes within each peak’s bin is assigned to the keypoint.
This process results in keypoints that have the same location and scale as the original but are oriented in
different directions. This orientation assignment effectively splits one keypoint into multiple keypoints, each
representing the same feature but oriented differently. This technique enhances the robustness of feature
matching across different orientations, enabling more accurate object recognition and matching in images
with varying viewpoints.
SIFT Descriptor
When evaluating keypoints in a region, it’s essential to describe the local image features in a way that is
invariant to various transformations. This is where SIFT (Scale-Invariant Feature Transform) descriptors
come into play.
1. Utilize the Gaussian blurred image associated with the keypoint’s scale.
2. Extract image gradients over a 16x16 square window surrounding the detected feature, obtaining
gradients for each pixel.
3. Rotate the gradient directions and locations relative to the keypoint orientation (provided by the
dominant orientation).
4. Divide the 16x16 window into a 4x4 grid of cells.
5. Compute an orientation histogram with 8 orientation bins for each cell, summing all the weighted
gradient magnitudes. Each sub-patch evaluates orientation, representing different orientations within
each patch. This information is then used to generate the keypoints’ descriptor, typically represented
as a vector of length 128.
• Rotation Invariance: Achieved by rotating the gradients, assuming that a rotated image generates
a keypoint at the same location as the original image.
• Scale Invariance: Maintained by working with the scale image from the Difference of Gaussians
(DoG).
• Robustness to Illumination Variation: Since only gradient orientations are considered, the SIFT
descriptor is insensitive to changes in illumination.
• Slight Robustness to Affine Transformations and Noise: Empirically observed, though not as
pronounced as other invariances.
4
Affine Invariant Detection: SIFT exhibits robustness to affine transformations, including similar
transforms (rotation and uniform scale) and affine transforms (rotation and non-uniform scale). However, a
downside of SIFT is its computational complexity, which may hinder real-time applications.
Integral Images
Integral images provide an efficient method for calculating the sum of pixel values within rectangular regions
of an image. The integral image IΣ (x) at a location x = (x, y)T is defined as the sum of all pixel values
in the input image I within the rectangular region formed by the origin and x. This transform is achieved
through the following equation:
X
IΣ (x) = I(x′ , y ′ )
x′ ≤x,y ′ ≤y
Integral images differ from other methods like Scale-Invariant Feature Transform (SIFT) notably in their
underlying computation. While SIFT utilizes the Laplacian of Gaussian (LoG) for scale-space analysis, in-
tegral images replace LoG with a 2D box filter. The equation for computing the sum within a rectangular
region using integral images is given by:
X
=A−B−C +D
where A, B, C, and D represent the pixel sums of the integral image at the corners of the rectangular
region of interest. This equation essentially calculates the sum of pixel values within the rectangle by
subtracting the areas outside the rectangle from the total sum.
Integral images offer remarkable efficiency, enabling the characterization of image regions with just four
memory accesses and three arithmetic operations. This efficiency makes integral images particularly suitable
for tasks such as blob detection, where the computation of pixel sums within rectangular regions is preva-
lent. The integral image computation significantly reduces the computational cost associated with feature
extraction, making it an invaluable tool in computer vision applications.
5
the box filter, resulting in an approximate Hessian matrix denoted as Happrox . The determinant of this
approximate Hessian is computed as:
Scale-Space representation
To match interest points across different scales, a pyramidal scale space is built. Rather than serial down-
sampling, each successive level of the pyramid is built by upscaling the image in parallel. Each scale is
defined as the response of the image convolved with a box filter of certain dimensions. The scale is further
divided into octaves(set of filter responses).
Here, dX represents the horizontal Haar wavelet response, and dy denotes the vertical Haar wavelet response.
6
the matching algorithm. This approach contributes significantly to the accuracy and robustness of computer
vision systems, particularly in scenarios where discerning between different types of features is critical.
• BRIEF: Binary Robust Independent Elementary Features, which efficiently encodes key points into
binary strings, making them computationally lightweight.
• ORB: Oriented FAST and Rotated BRIEF, which combines the speed of the FAST keypoint detector
with the robustness of BRIEF descriptors, allowing for rotation-invariant matching.
• BRISK: Binary Robust Invariant Scalable Keypoints, known for its efficiency in scale-space analysis
and robustness to variations in illumination and viewpoint.
• FREAK: Fast Retina Keypoint, a descriptor designed for high-speed performance and robustness to
geometric and photometric transformations.
• LIFT: Learn Invariant Feature Transform, the first binary descriptor based on deep learning, which
learns feature representations in an end-to-end manner, offering improved generalization capabilities.
These binary feature descriptors play a crucial role in various computer vision applications, including ob-
ject recognition, image matching, and 3D reconstruction, by providing efficient and effective representations
of visual data while mitigating computational overhead.
3 Binary Descriptors
- While complex feature descriptors like SIFT are highly effective and often considered the gold standard in
computer vision tasks, they come with drawbacks such as computational expense and patenting issues.
Binary descriptors offer an alternative approach by generating compact binary strings that are com-
putationally efficient to compute and compare.
Key Idea of Binary Descriptors: The fundamental concept behind binary descriptors involves a
simple strategy: 1. Select a patch around a keypoint. 2. Choose a set of pixel pairs within that patch. 3.
For each pair, compare their intensities according to the equation:
(
1, if I(s1 ) < I(s2 )
b= (1)
0, otherwise
4. Concatenate all resulting binary values b into a single bit string. This concise representation facilitates fast
computation and comparison, making binary descriptors suitable for various computer vision applications.
Key advantages of Binary descriptors
• Compact descriptor: Binary descriptors encode features using a fixed-length sequence of bits, where
the length corresponds to the number of feature pairs. This compact representation is efficient for
storage and processing.
• Fast to compute: Binary descriptors are computed rapidly, typically through simple intensity value
comparisons between neighboring pixels or image patches. This computational simplicity enables real-
time performance in various computer vision applications.
7
• Trivial and fast to compare: Binary descriptors facilitate rapid feature matching through the use of
the Hamming distance metric. The Hamming distance measures the number of differing bits between
two binary strings, and it is computed as follows:
n
X
HammingDistance(A, B) = (Ai ⊕ Bi )
i=1
where A and B are binary descriptors of length n, and ⊕ denotes the bitwise XOR operation.
Important Remark - Pairs
To ensure a fair comparison of descriptors across images, it is crucial to adhere to the following guidelines
regarding pairs selection:
- Consistent Pair Usage: Utilize the same pairs for all descriptors under consideration. This ensures
that each descriptor is evaluated on the same set of data, eliminating potential biases introduced by varying
pairs.
- Order Preservation: Maintain the identical order in which pairs are tested across all descriptors. This
preserves the integrity of the comparison process, allowing for accurate assessment of descriptor performance.
Different descriptors may necessitate distinct methodologies for pair selection; hence, it is
imperative to establish a standardized approach to pair selection to facilitate fair and mean-
ingful comparisons among descriptors.
8
Additionally, the orientation (θ) is determined using:
1 2M11
θ = arctan
2 M20 − M02
Given CoM and orientation (C, θ), coordinates s are rotated by θ around C using the transformation:
s′ = T (C, θ)s
Transformed pixel coordinates are then used for testing, ensuring invariance to rotation in the plane.
Learning Sampling Pairs: ORB selects sampling pairs optimized for being uncorrelated and having
high variance, thus adding new information to the descriptor and enhancing discriminative power. ORB
defines a strategy for selecting 256 pairs that optimize both properties using a training database.
Certainly! Below is the rewritten information with corrections and additional value, maintaining the list
format:
ORB vs SIFT:
— Speed: ORB is 100 times faster than SIFT.
— Descriptor Size: ORB uses 256-bit descriptors, whereas SIFT employs 4096-bit descriptors.
— Scale Invariance: ORB is not inherently scale invariant, but this can be achieved by applying image
pyramids.
— Rotation Invariance: ORB is primarily invariant to plane rotation.
— Matching Performance: ORB demonstrates similar matching performance to SIFT without consider-
ing scale.
— Modern Usage: Several modern online systems, such as Simultaneous Localization and Mapping
(SLAM), utilize binary features, including ORB, due to their computational efficiency and robustness.
This revised version clarifies the differences between ORB and SIFT, their advantages, and how they are
utilized in modern computer vision applications.
4.1 Keypoints
Keypoints denote specific locations in an image that are distinctive or informative. They serve as anchor
points for further analysis. For instance, in a scene, keypoints could represent corners, edges, or other salient
features.
4.2 Descriptors
Descriptors, on the other hand, provide a detailed description of the appearance of the region surrounding
a keypoint. They encode information about the intensity gradients, colors, textures, or other relevant
attributes. This description facilitates matching and comparison between different keypoints.
9
• SURF (Speeded-Up Robust Features): Similar to SIFT, SURF descriptors also operate on gra-
dient information but are designed to be computationally faster, making them suitable for real-time
applications.
• BRIEF (Binary Robust Independent Elementary Features): To enhance efficiency, BRIEF
generates binary descriptors directly from image intensities, reducing memory and computational re-
quirements.
• ORB (Oriented FAST and Rotated BRIEF): Combining the efficiency of BRIEF with the rota-
tional invariance of keypoints, ORB descriptors provide robustness against image transformations.
By utilizing combinations of key points and descriptors, computer vision algorithms can effectively extract
and represent visual information, enabling tasks such as object recognition, tracking, and scene understand-
ing.
5 Feature Matching
Feature matching involves the process of associating keypoints between different images, enabling tasks such
as image alignment, object recognition, and scene reconstruction.
2. Matching Strategies: Once the distance function is established, the matching process proceeds in
one of the following ways:
• Single Best Match: This approach involves testing all features in I2 and finding the one with
the minimum distance to the descriptor from I1 . It aims to find the most likely correspondence
for each keypoint in I1 .
• Top-k Matches: In contrast, this strategy evaluates all features in I2 against the descriptor from
I1 and selects the k features with the lowest distances. This approach considers multiple potential
matches for each keypoint, offering robustness against outliers and ambiguities.
The choice between these strategies depends on the specific application requirements and the charac-
teristics of the images being matched.
By employing suitable distance metrics and matching strategies, feature matching enables the establish-
ment of correspondences between keypoints across different images, forming the basis for various computer
vision tasks.
10
5.2.1 Keypoint Representation
Each keypoint identified in the target image is characterized by several attributes:
• 2D Location: The coordinates (x, y) pinpoint the position of the keypoint within the image plane.
• Scale (s) and Orientation (θ): These parameters specify the size and orientation of the local region
around the keypoint, aiding in scale and rotational invariance.
• Descriptor Vector (d): The descriptor vector encapsulates the appearance information of the region
surrounding the keypoint. It represents the local image structure, capturing gradients, textures, or
other relevant features.
Thus, each keypoint in the target image is described by the tuple (x, y, s, θ, d), providing comprehensive
information for matching.
• Similarity Search: The descriptor vector associated with each keypoint in the target image is com-
pared with those in the reference image using a distance metric, such as the L2 or L1 norm. Keypoints
in the reference image whose descriptors exhibit minimal distance to the target descriptor are consid-
ered potential matches.
• Potential Matches: It’s important to note that a single descriptor vector from the target image
may find correspondence with multiple descriptor vectors in the reference image. This occurs due to
similarities between local image structures or the presence of repetitive patterns.
By performing this matching process for all keypoints in the target image, a set of correspondences
between the two images is established, facilitating tasks such as image registration, object recognition, and
scene understanding.
11
5.3.3 Improved Approaches
To mitigate the limitations of the L2 distance and address the challenges posed by ambiguous matches,
alternative distance metrics and matching strategies can be employed. These may include:
• Local Feature Matching: Instead of relying solely on global descriptors, consider matching based on
local regions surrounding keypoints. This approach can enhance robustness against viewpoint changes
and occlusions.
• Feature Space Transformation: Transforming the feature space or applying dimensionality reduc-
tion techniques can help emphasize discriminative features and suppress noise or irrelevant information.
• Advanced Distance Metrics: Explore distance metrics tailored to specific feature types or ap-
plication domains. For instance, metrics based on information theory, such as the Jensen-Shannon
divergence, may offer better discrimination between feature vectors.
By adopting these improved approaches, the feature matching process can achieve greater accuracy and
reliability, even in challenging conditions where traditional metrics may falter.
• Distance Ratio Calculation: The distance ratio is computed as the ratio of the L2 distances between
the best and second-best matches of a feature f1 , denoted as f2 and f2′ , respectively. Mathematically,
it can be expressed as:
||f1 − f2 ||
Distance Ratio =
||f1 − f2′ ||
This ratio provides a measure of confidence in the match, with lower values indicating more reliable
matches.
• Sorting by Confidence: Sorting matches based on the distance ratio arranges them in order of
confidence, with more reliable matches appearing first in the list.
• Threshold Definition: To filter out ambiguous matches, a distance ratio threshold p is defined,
typically set around 0.5. This threshold signifies that the distance to the best match (f2 ) should be at
least twice as close as the distance to the second-best match (f2′ ) for a match to be considered reliable.
• Top Two Matches: Retain the top two matches, f2 and f2′ , based on the L2 distances.
• Ambiguity Check: If the distance to the best match (f2 ) is significantly closer than the distance to
the second-best match (f2′ ), i.e., if d(f1 , f2 ) < p × d(f1 , f2′ ), consider the match between f1 and f2 as
a ’strong’ match and retain it.
• Ambiguous Match Rejection: Discard matches where the distance to the best match is not sig-
nificantly closer than the distance threshold (p) multiplied by the distance to the second-best match.
These matches are deemed ambiguous and are rejected.
By incorporating the ratio test into the feature matching process, ambiguous matches can be effectively
filtered out, ensuring the selection of more reliable correspondences between features.
12
5.4 Matching SIFT Features
Matching Scale-Invariant Feature Transform (SIFT) features involves comparing the descriptor vectors as-
sociated with keypoints extracted from different images. A commonly used criterion for accepting a match
is based on the normalized sum of squared differences (SSD) between the descriptors.
SSD(f1 , f2 )
<t
SSD(f1 , f2′ )
Here, f1 represents the descriptor vector from the target image, while f2 and f2′ represent the descriptor
vectors from the reference image. The SSD function computes the sum of squared differences between
corresponding elements of the descriptor vectors.
5.4.3 Effectiveness
By applying this matching criterion with an appropriate threshold, a significant reduction in false matches
can be achieved. Studies have indicated that this approach can eliminate up to 90% of false matches while
discarding less than 5% of correct matches. This demonstrates the effectiveness of the matching strategy in
selecting reliable correspondences between SIFT features across different images.
In summary, matching SIFT features based on the normalized SSD criterion with a carefully chosen
threshold offers a robust solution for various computer vision tasks, including object recognition and image
alignment.
13
By quantifying the number of true positives and false positives, we can assess the effectiveness of the
feature matcher in distinguishing between correct and incorrect matches.
Adjusting the distance threshold allows for fine-tuning the performance of the feature matcher, balancing
between the number of true positives and false positives to achieve the desired level of accuracy and reliability
in feature matching.
• By sweeping through a range of threshold values, we can generate sets of matches with varying true/false
positive rates.
• The ROC curve is then plotted by computing these rates across the full range of possible thresholds.
6.2.3 Application
ROC curves and AUC are versatile evaluation tools that can be applied across various domains, including:
• Binary classification
• Face verification
• Object detection
• Image retrieval
These metrics provide valuable insights into the effectiveness and reliability of feature matching algo-
rithms, aiding in their optimization and deployment in real-world applications.
We will revisit these concepts in detail when discussing specific tasks such as binary classification, face
verification, object detection, and image retrieval.
Be cautious with accuracy as a measure of performance, as it may not always provide a comprehensive
evaluation, especially in scenarios where the classes are imbalanced.
14
The accuracy is defined as the fraction of correctly classified items among the total number of cases
examined. Mathematically, it is represented as:
TP + TN
Accuracy =
TP + TN + FP + FN
Here: - TP (True Positives) represents the number of correctly classified positive cases. - TN (True
Negatives) represents the number of correctly classified negative cases. - FP (False Positives) represents the
number of incorrectly classified negative cases (classified as positive). - FN (False Negatives) represents the
number of incorrectly classified positive cases (classified as negative).
The accuracy metric is intuitive and straightforward, indicating the proportion of correct predictions
relative to the total number of predictions made.
However, accuracy can be misleading, particularly in situations where the classes are imbalanced. For
instance, in a scenario where only a small percentage of items are truly positive, a classifier that always
predicts ”negative” could achieve a high accuracy simply by capitalizing on the majority class. This dom-
inance by the larger set of positives or negatives can lead to inflated accuracy values and obscure the true
performance of the classifier.
Therefore, while accuracy provides a basic measure of performance, it is essential to complement it
with other evaluation metrics, especially in scenarios with class imbalance, to gain a more comprehensive
understanding of the classifier’s effectiveness.
6.3.1 Precision
Precision measures the accuracy of positive predictions made by a classifier. It answers the question: ”Of
all the instances predicted as positive, how many were truly positive?”
Precision is calculated using the formula:
True Positives
Precision =
True Positives + False Positives
A high precision indicates that the classifier has a low rate of false positives, meaning that when it
predicts positive, it is likely to be correct.
6.3.2 Recall
Recall measures the ability of a classifier to capture all the relevant positive instances. It answers the
question: ”Of all the true positive instances, how many were successfully predicted?”
Recall is calculated using the formula:
True Positives
Recall =
True Positives + False Negatives
A high recall indicates that the classifier is good at identifying positive instances, even if it means having
a higher rate of false positives.
These metrics provide complementary insights into the performance of classifiers, allowing for a balanced
assessment of their precision and ability to detect relevant instances.
6.4 F1 Score
The F1 score is a single metric that combines both precision and recall, providing a balanced assessment of
a classifier’s performance.
15
6.4.1 Precision and Recall Comparison
Precision emphasizes the accuracy of positive predictions, focusing on minimizing false positives, while recall
emphasizes the ability of the model to capture all relevant positive instances, aiming to minimize false
negatives.
6.4.3 Interpretation
The F1 score provides a balanced measure of a classifier’s performance, taking into account both precision
and recall. It reaches its best value at 1 and worst at 0, with higher values indicating better performance.
By incorporating both precision and recall into a single metric, the F1 score offers a comprehensive
evaluation of a classifier’s ability to make accurate positive predictions while capturing relevant positive
instances.
16
In summary, deep learning enables end-to-end solutions for local feature extraction and matching, leverag-
ing neural network architectures and large-scale datasets to achieve superior performance in various computer
vision tasks.
17