0% found this document useful (0 votes)
8 views78 pages

Feature Detection and Matching

The document discusses feature detection in images, highlighting the importance of identifying keypoints, edges, and localized features for computational tasks. It categorizes features into global and local representations, explaining various detectors like Moravec's, Harris, FAST, and Hessian, each with unique methodologies and applications. Additionally, it covers the significance of robustness, repeatability, and efficiency in feature detection algorithms, along with the need for multiscale detectors to handle variations in image scale.

Uploaded by

user828306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views78 pages

Feature Detection and Matching

The document discusses feature detection in images, highlighting the importance of identifying keypoints, edges, and localized features for computational tasks. It categorizes features into global and local representations, explaining various detectors like Moravec's, Harris, FAST, and Hessian, each with unique methodologies and applications. Additionally, it covers the significance of robustness, repeatability, and efficiency in feature detection algorithms, along with the need for multiscale detectors to handle variations in image scale.

Uploaded by

user828306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Feature detectors

Image Feature
• A feature is a piece of information which is relevant for solving the
computational task related to a certain application.
• Features may be specific structures in the image such as points, edges or
objects. Features may also be the result of a general neighborhood
operation or feature detection applied to the image. The features can be
classified into two main categories
• The features that are in specific locations of the images, such as mountain
peaks, building corners, doorways, or interestingly shaped patches of snow.
These kinds of localized features are often called keypoint features (or even
corners) and are often described by the appearance of patches of pixels
surrounding the point location.
• The features that can be matched based on their orientation and local
appearance (edge profiles) are called edges and they can also be good
indicators of object boundaries and occlusion events in the image sequence.
Global and Local Features
• In the global feature representation, the image is represented by one
multidimensional feature vector, describing the information in the whole image.
In other words, the global representation method produces a single vector with
values that measure various aspects of the image such as color, texture or shape.
Practically, a single vector from each image is extracted and then two images can
be compared by comparing their feature vectors. For example, when one wants
to distinguish images of a sea (blue) and a forest (green), a global descriptor of
color would produce quite different vectors for each category. In this context,
global features can be interpreted as a particular property of image involving all
pixels. This property can be color histograms, texture, edges or even a specific
descriptor extracted from some filters applied to the image . On the other hand,
the main goal of local feature representation is to distinctively represent the
image based on some salient regions while remaining invariant to viewpoint and
illumination changes. Thus, the image is represented based on its local structures
by a set of local feature descriptors extracted from a set of image regions called
interest regions (i.e., keypoints)
Interest or Feature Point
• Interest point or Feature Point is the point which is expressive in
texture. Interest point is the point at which the direction of the
boundary of the object changes abruptly or intersection point between
two or more edge segments.
Properties Of Interest(Feature)
Point
• It has a well-defined position in image space or well localized.
• It is stable under local and global perturbations in the image domain as
illumination/brightness variations, such that the interest points can be
reliably computed with a high degree of repeatability.
• Should provide efficient detection.
Main Component Of Feature Detection, Description and
Matching
Detection Description Matching

Identify the Interest Point The local appearance around each Matching: Descriptors are compared
feature point is described in some way across the images, to identify similar
that is (ideally) invariant under changes features. For two images we may get a
in illumination, translation, scale, and set of pairs (Xi, Yi) ↔ (Xi`, Yi`), where
in-plane rotation. We typically end up (Xi, Yi) is a feature in one image and
with a descriptor vector for each (Xi`, Yi`) its matching feature in the
feature point. other image.
Feature Detection

• Feature detection involves identifying specific points, regions, or structures


in an image that are significant and can be used as references for further
analysis. These features are usually characterized by their uniqueness,
repeatability, and robustness to variations such as lighting changes,
rotations, and scale transformations. Common types of features detected
include corners, edges, blobs, and key points.
• Feature detectors can be broadly classified into three categories: single-
scale detectors, multi-scale detectors, and affine invariant detectors.
• The single-scale detectors are invariant to image transformations such as
rotation, translation, changes in illuminations and addition of noise
Characteristics of Feature Detectors
• Robustness, the feature detection algorithm should be able to detect the same
feature locations independent of scaling, rotation, shifting, photometric
deformations, compression artifacts, and noise.
• Repeatability, the feature detection algorithm should be able to detect the same
features of the same scene or object repeatedly under variety of viewing conditions.
• Accuracy, the feature detection algorithm should accurately localize the image
features (same pixel locations), especially for image matching tasks, where precise
correspondences are needed to estimate the epipolar geometry.
• Generally, the feature detection algorithm should be able to detect features that
can be used in different applications.
• Efficiency, the feature detection algorithm should be able to detect features in
new images quickly to support real-time applications.
• Quantity, the feature detection algorithm should be able to detect all or most of
the features in the image. Where, the density of detected features should reflect
the information content of the image for providing a compact image representation.
Single-Scale Detectors
Moravec’s Detector

• Moravec’s detector is specifically interested in finding distinct regions in


the image that could be used to register consecutive image frames. It
has been used as a corner detection algorithm in which a corner is a
point with low self-similarity. The detector tests each pixel in a given
image to see if a corner is present. It considers a local image patch
centered on the pixel and then determines the similarity between the
patch and the nearby overlapping patches. The similarity is measured
by taking the sum of square differences (SSD) between the centered
patch and the other image patches. Based on the value of SSD, three
cases need to be considered as follows.
Moravec’s Detector
• If the pixel in a region of uniform intensity, then the nearby patches will
look similar or a small change occurs.
• If the pixel is on an edge, then the nearby patches in a parallel direction
to the edge will result in a small change and the patches in a direction
perpendicular to the edge will result in a large change.
• If the pixel is on a location with large change in all directions, then none
of the nearby patches will look similar and the corner can be detected
when the change produced by any of the shifts is large.
The smallest SSD between the patch and its neighbors (horizontal, vertical
and on the two diagonals) is used as a measure for cornerness. A corner or
an interest point is detected when the SSD reaches a local maxima. The
following steps can be applied for implementing Moravec’s detector:
Moravec’s Detector
• Input: gray scale image, window size, threshold T
• For each pixel (x, y) in the image compute the intensity variation V from a
shift (u, v) as

Construct the cornerness map by calculating the cornerness measure C(x, y) for each
pixel (x, y)

Threshold the cornerness map by setting all C(x, y) below the given threshold value T to
zero.
Perform non-maximum suppression to find local maxima. All non-zero points
remaining in the cornerness map are corners
Moravec’s Detector
For performing non-maximum suppression, an image is
scanned along its gradient direction, which should be
perpendicular to an edge. Any pixel that is not a local
maximum is suppressed and set to zero. As illustrated in Fig
given p and r are the two neighbors along the gradient
direction of q. If the pixel value of q is not larger than the
pixel values of both p and r, it is suppressed. One advantage
of Moravec’s detector is that, it can detect majority of the
corners. However, it is not isotropic; intensity variation is Performing the non-
calculated only at a discrete set of shifts (i.e., the eight maximum suppression
principle directions) and any edge is not in one of the eight
neighbors’ directions is assigned a relatively large cornerness
measure. Thus, it is not invariant to rotation resulting in the
detector is of a poor repeatability rate
Harris Detector
• Harris and Stephens have developed a combined corner and edge
detector to address the limitations of Moravec’s detector. By obtaining
the variation of the autocorrelation (i.e., intensity variation) over all
different orientations, this results in a more desirable detector in terms
of detection and repeatability rate. The resulting detector based on the
auto-correlation matrix is the most widely used technique. The 2 × 2
symmetric auto-correlation matrix used for detecting image features and
describing their local structures can be represented as.
Harris Detector

Where

The K is an adjusting parameter and λ1, λ2 are


the eigenvalues of the auto-correlation matrix.
The exact computation of the eigenvalues is
computationally expensive, since it requires
the computation of a square root. Therefore,
Harris suggested using this cornerness measure
that combines the two eigenvalues in a single
measure. The nonmaximum suppression
should be done to find local maxima and all
non-zero points remaining in the cornerness
map are the searched corners
FAST Detector
• FAST (Features from Accelerated Segment
Test) is a corner detector originally
developed by Rosten and Drummondn. In
this detection scheme, candidate points are
detected by applying a segment test to every
image pixel by considering a circle of 16
pixels around the corner candidate pixel as a
base of computation. If a set of n contiguous
pixels in the Bresenham circle with a radius r
are all brighter than the intensity of
candidate pixel (denoted by Ip) plus a Feature detection in an image patch
threshold value t, Ip + t, or all darker than the using FAST detector
intensity of candidate pixel minus the
threshold value Ip−t, then p is classified as a
corner. A high-speed test can be used to
exclude a very large number of non-corner
points; the test examines only the four pixels
1, 5, 9 and 13
FAST Detector
• A corner can only exist if three of these test pixels are brighter than I p + t or
darker than Ip − t and the rest of pixels are then examined for final conclusion.
Figure illustrates the process, where the highlighted squares are the pixels
used in the corner detection. The pixel at p is the center of a candidate corner.
The arc is indicated by the dashed line passes through 12 contiguous pixels
which are brighter than p by a threshold. The best results are achieved using a
circle with r = 3 and n = 9.
• Although the high speed test yields high performance, it suffers from several
limitations and weakness . An improvement for addressing these limitations
and weakness points is achieved using a machine learning approach. The
ordering of questions used to classify a pixel is learned by using the well-
known decision tree algorithm (Iterative Dichotomiser3( ID3)), which speeds
this step up significantly. As the first test produces many adjacent responses
around the interest point, an additional criteria is applied to perform a non-
maximum suppression. This allows for precise feature localization.
FAST Detector

• The used cornerness measure at this step is

where Ip→j denotes the pixels laying on the Bresenham circle. In this way, the processing time remains short
because the second test is performed only on a fraction of image points that passed the first test
In other words, the process operates in two stages. First, corner detection with a segment test of a given
and a convenient threshold is performed on a set of images (preferably from the target application
domain). Each pixel of the 16 locations on the circle is classified as darker, similar, or brighter. Second,
employing the ID3 algorithm on the 16 locations to select the one that yields the maximum information
gain. The non-maximum suppression is applied on the sum of the absolute difference between the pixels in
the contiguous arc and the center pixel. Notice that the corners detected using the ID3 algorithm may be
slightly different from the results obtained with segment test detector due to the fact that decision tree
model depends on the training data, which could not cover all possible corners. Compared to many existing
detectors, the FAST corner detector is very suitable for real-time video processing applications because of
its high-speed performance. However, it is not invariant to scale changes and not robust to noise, as well as
it depends on a threshold, where selecting an adequate threshold is not a trivial task.
Hessian Detector
• The Hessian blob detector is based on a 2 × 2 matrix of second-order derivatives of
image intensity I(x, y), called the Hessian matrix. This matrix can be used to analyze local
image structures and it is expressed in the for

where Ixx Ixy and Iyy are second-order image derivatives computed using Gaussian function of
standard deviation σ. In order to detect interest features, it searches for a subset of points
where the derivatives responses are high in two orthogonal directions. That is, the detector
searches for points where the determinant of the Hessian matrix has a local maxima.
Hessian Detector
By choosing points that maximize the determinant of the Hessian, this measure
penalizes longer structures that have small second derivatives (i.e., signal
changes) in a single direction. Applying non-maximum suppression using a
window of size 3 × 3 over the entire image, keeping only pixels whose value is
larger than the values of all eight immediate neighbors inside the window. Then,
the detector returns all the remaining locations whose value is above a pre-
defined threshold T. The resulting detector responses are mainly located on
corners and on strongly textured image areas. While, the Hessian matrix is used
for describing the local structure in a neighborhood around a point, its
determinant is used to detect image structures that exhibit signal changes in two
directions. Compared to other operators such as Laplacian, the determinant of
the Hessian responds only if the local image pattern contains significant
variations along any two orthogonal directions . However, using second order
derivatives in the detector is sensitive to noise. In addition, the local maxima are
often found near contours or straight edges, where the signal changes in only
one direction. Thus, these local maxima are less stable as the localization is
affected by noise or small changes in neighboring pattern
Multiscale detectors
• The single-scale detectors are invariant to image transformations such
as rotation, translation, changes in illuminations and addition of noise.
However, they are incapable to deal with the scaling problem. Given
two images of the same scene related by a scale change, we want to
determine whether same interest points can be detected or not.
Therefore, it is necessary to build multiscale detectors capable of
extracting distinctive features reliably under scale changes
Laplacian of Gaussian (LoG)
• Laplacian-of-Gaussian (LoG), a linear combination of second derivatives, is a common
blob detector. Given an input image I(x, y), the scale space representation of the image
defined by L(x, y, σ) is obtained by convolving the image by a variable scale Gaussian
kernel G(x, y, σ) where

and

For computing the Laplacian operator, the following formula is used

This results in strong positive responses for dark blobs and strong negative responses for bright blobs of
size 2 . However, the operator response is strongly dependent on the relationship between the size of
the blob structures in the image domain and the size of the smoothing Gaussian kernel. The standard
deviation of the Gaussian is used to control the scale by changing the amount of blurring. In order to
automatically capture blobs of different size in the image domain, a multi-scale approach with
automatic scale selection is proposed in through searching for scale space extrema of the scale-
normalized Laplacian operator.
Laplacian of Gaussian (LoG)

• Which can also detect points that are simultaneously local maxima/minima of
2norm L( x, y,  ) with respect to both space and scale. The
LoG operator is circularly symmetric; it is therefore naturally invariant to
rotation. The LoG is well adapted to blob detection due to this circular
symmetry property, but it also provides a good estimation of the characteristic
scale for other local structures such as corners, edges, ridges and multi-
junctions. In this context, the LoG can be applied for finding the characteristic
scale for a given image location or for directly detecting scale-invariant regions
by searching for 3D (location + scale) extrema of the LoG function as illustrated
in Fig.
Laplacian of Gaussian (LoG)

Searching for 3D scale space extrema of the LoG function


Difference of Gaussian (DoG)
• The computation of LoG operators is time consuming. To accelerate the
computation, Lowe proposed an efficient algorithm based on local 3D
extrema in the scale-space pyramid built with Difference-of-
Gaussian(DoG) filters. This approach is used in the scale-invariant
feature transform (SIFT) algorithm. In this context, the DoG gives a
close approximation to the Laplacian-of-Gaussian (LoG) and it is used
to efficiently detect stable features from scale-space extrema. The DoG
function D(x, y, σ) can be computed without convolution by subtracting
adjacent scale levels of a Gaussian pyramid separated by a factor k
Difference of Gaussian (DoG)
Feature types extracted by DoG can be classified in the same way as for the
LoG operator. Also, the DoG region detector searches for 3D scale space
extrema of the DoG function as shown in Fig. The computation of LoG
operators is time consuming. The common drawback of both the LoG and DoG
representations is that the local maxima can also be detected in neighboring
contours of straight edges, where the signal change is only in one direction,
which make them less stable and more sensitive to noise or small changes .

Searching for 3D scale space extrema in the DoG function


Harris-Laplace
Harris-Laplace is a scale invariant corner detector proposed by Mikolajczyk and Schmid . It relies on a
combination of Harris corner detector and a Gaussian scale space representation. Although Harris-corner points
have been shown to be invariant to rotation and illumination changes, the points are not invariant to the scale.
Therefore, the second-moment matrix utilized in that detector is modified to make it independent of the image
resolution. The scale adapted second-moment matrix used in the Harris-Laplace detector is represented as

where Ix and Iy are the image derivatives calculated in their respective direction using a Gaussian
kernel of scale σD. The parameter σI determines the current scale at which the Harris corner points
are detected in the Gaussian scale-space. In other words, the derivative scale σD decides the size of
gaussian kernels used to compute derivatives. While, the integration scale σI is used to performed a
weighted average of derivatives in a neighborhood. The multi-scale Harris cornerness measure is
computed using the determinant and the trace of the adapted second moment matrix as.
Harris-Laplace
The value of the constant α is between 0.04 and 0.06. At each level of the representation, the
interest points are extracted by detecting the local maxima in the 8-neighborhood of a point (x,
y). Then, a threshold is used to reject the maxima of small cornerness, as they are less stable
under arbitrary viewing conditions

In addition, the Laplacian-of-Gaussian is used to find the maxima over the scale. Where,
only the points for which the Laplacian attains maxima or its response is above a threshold
are accepted.
Gabor-Wavelet detector
Yussof and Hitam proposed a multi-scale interest points detector based on the principle of
Gabor wavelets. The Gabor wavelets are biologically motivated convolution kernels in the shape
of plane waves restricted by a Gaussian envelope function, whose kernels are similar to the
response of the two-dimensional receptive field profiles of the mammalian simple cortical cell.
The Gabor wavelets take the form of a complex plane wave modulated by a Gaussian envelope
function.

where Ku,v= Kveiφu, z = (x, y), u and v define the orientation and scale of the Gabor wavelets,
Kv = Kmax/f v and φu = πu/8, Kmax is the maximum frequency, and f = 2 is the spacing factor
between kernels in the frequency domain. The response of an image I to a wavelet ψ is
calculated as the convolution.
Gabor-Wavelet detector
The coefficients of the convolution, represent the information in a local image region,
which should be more effective than isolated pixels. The advantage of Gabor wavelets
is that they provides simultaneous optimal resolution in both space and spatial
frequency domains. Additionally, Gabor wavelets have the capability of enhancing low
level features such as peaks, valleys and ridges. Thus, they are used to extract points
from the image at different scales by combining multi orientations of image responses.
The interest points are extracted at multiple scales with a combination of uniformly
spaced orientation. The authors proved that the extracted points using Gabor-wavelet
detector have high accuracy and adaptability to various geometric transformations.
Affine Invariant Detectors
• The feature detectors discussed so far exhibit invariance to translations, rotations
and uniform scaling; assuming that the localization and scale are not affected by
an affine transformation of the local image structures. Thus, they partially handle
the challenging problem of affine invariance, keeping in mind that the scale can be
different in each direction rather than uniform scaling. Which in turn makes the
scale invariant detectors fail in the case of significant affine transformations.
Therefore, building a detector robust to perspective transformations necessitates
invariance to affine transformations. An affine invariant detector can be seen as a
generalized version of a scale invariant detector. Recently, some features detectors
have been extended to extract features invariant to affine transformations. For
instance, Schaffalitzky and Zisserman extended the Harris-Laplace detector by
affine normalization. Mikolajczyk and Schmid introduced an approach for scale
and affine invariant interest point detection. Their algorithm simultaneously
adapts location, scale and shape of a point neighborhood to obtain affine invariant
points. Where, Harris detector is adapted to affine transformations and the affine
shape of a point neighborhood is estimated based on the second moment matrix.
Harris-Affine Detector
• The Harris affine detector can identify similar regions between images that
are related through affine transformations and have different illuminations.
These affine-invariant detectors should be capable of identifying similar
regions in images taken from different viewpoints that are related by a simple
geometric transformation: scaling, rotation and shearing

• These detected regions have been called both invariant and covariant.
On one hand, the regions are detected invariant of the image
transformation but the regions covariantly change with image
transformation
Harris-Affine detector
This is achieved by following the iterative estimation scheme proposed by
Lindberg and Garding as follows:
• 1. Identify initial region points using scale-invariant Harris-Laplace
detector.
• 2. For each initial point, normalize the region to be affine invariant using
affine shape adaptation.
• 3. Iteratively estimate the affine region; selection of proper integration
scale, differentiation scale and spatially localize interest points.
• 4.Update the affine region using these scales and spatial localizations.
• 5.Repeat step 3 if the stopping criterion is not met.
Hessian-affine
• Similar to Harris-affine, the same idea can also be applied to the Hessian-based detector, leading to
an affine invariant detector termed as Hessian-affine. For a single image, the Hessian-affine
detector typically identifies more reliable regions than the Harris-affine detector.
• The performance changes depending on the type of scene being analyzed. Further, the Hessian-
affine detector responds well to textured scenes in which there are a lot of corner-like parts.
However, for some structured scenes, like buildings, the Hessian-affine detector performs very well.
• The Harris affine detector relies on interest points detected at multiple scales using the Harris
corner measure on the second-moment matrix.
• The Hessian affine also uses a multiple scale iterative algorithm to spatially localize and select scale
and affine invariant points
• At each individual scale, the Hessian affine detector chooses interest points based on the Hessian
matrix at that point
Hessian-affine
is the mixed partial second derivative in the a and b directions. It's important to note that
the derivatives are computed in the current iteration scale and thus are derivatives of an
image smoothed by a Gaussian kernel:

The derivatives must be scaled appropriately by a factor related to the Gaussian kernel:

At each scale, interest points are those points that simultaneously are local extrema of both
the determinant and trace of the Hessian matrix. The trace of Hessian matrix is identical to
the Laplacian of Gaussians (LoG)

Interest points based on the Hessian matrix are also spatially localized using an iterative search
based on the Laplacian of Gaussians. Predictably, these interest points are called Hessian–
Laplace interest points. Using these initially detected points, the Hessian affine detector uses an
iterative shape adaptation algorithm to compute the local affine transformation for each interest
point.
Image Feature Descriptor
Once a set of interest points has been detected from an image at a location
p(x, y), scale s, and orientation θ, their content or image structure in a
neighborhood of p needs to be encoded in a suitable descriptor for
discriminative matching and insensitive to local image deformations. The
descriptor should be aligned with θ and proportional to the scale s. There
are a large number of image feature descriptors in the literature, some of
used descriptors are discussed in the following sections.
Image Feature Descriptor
Scale Invariant Feature Transform (SIFT)
• Lowe presented the scale-invariant feature transform (SIFT) algorithm, where a
number of interest points are detected in the image using the Difference-of-
Gaussian (DOG) operator. The points are selected as local extrema of the DoG
function. At each interest point, a feature vector is extracted. Over a number of
scales and over a neighborhood around the point of interest, the local
orientation of the image is estimated using the local image properties to provide
invariance against rotation. Next, a descriptor is computed for each detected
point, based on local image information at the characteristic scale. The SIFT
descriptor builds a histogram of gradient orientations of sample points in a
region around the key point, finds the highest orientation value and any other
values that are within 80% of the highest, and uses these orientations as the
dominant orientation of the keypoint.
Image Feature Descriptor
Scale Invariant Feature Transform (SIFT)
The description stage of the SIFT algorithm starts by sampling the image gradient
magnitudes and orientations in a 16 × 16 region around each keypoint using its scale to
select the level of Gaussian blur for the image. Then, a set of orientation histograms is
created where each histogram contains samples from a 4×4 sub region of the original
neighborhood region and having eight orientations bins in each.

A schematic representation of the SIFT descriptor for a 16 × 16 pixel patch and a 4 × 4 descriptor array
Image Feature Descriptor
Scale Invariant Feature Transform (SIFT)
• Gaussian weighting function with σ equal to half the region size is used to assign
weight to the magnitude of each sample point and gives higher weights to
gradients closer to the center of the region, which are less affected by positional
changes. The descriptor is then formed from a vector containing the values of all
the orientation histograms entries. Since there are 4 × 4 histograms each with 8
bins, the feature vector has 4 × 4 × 8 = 128 elements for each keypoint. Finally,
the feature vector is normalized to unit length to gain invariance to affine
changes in illumination. However, non-linear illumination changes can occur due
to camera saturation or similar effects causing a large change in the magnitudes
of some gradients. These changes can be reduced by thresholding the values in
the feature vector to a maximum value of 0.2, and the vector is again normalized.
Figure 8 illustrates the schematic representation of the SIFT algorithm; where the
gradient orientations and magnitudes are computed at each pixel and then
weighted by a Gaussian falloff (indicated by overlaid circle). A weighted gradient
orientation histogram is then computed for each subregion.
Image Feature Descriptor
Scale Invariant Feature Transform
(SIFT)
The standard SIFT descriptor representation is noteworthy in several respects: the
representation is carefully designed to avoid problems due to boundary effects
smooth changes in location, orientation and scale do not cause radical changes in
the feature vector; it is fairly compact, expressing the patch of pixels using a 128
element vector; while not explicitly invariant to affine transformations, and the
representation is surprisingly resilient to deformations such as those caused by
perspective effects. These characteristics are evidenced in excellent matching
performance against competing algorithms under different scales, rotations and
lighting. On the other hand, the construction of the standard SIFT feature vector is
complicated and the choices behind its specific design are not clear resulting in the
common problem of SIFT its high dimensionality, which affects the computational
time for computing the descriptor (significantly slow). As an extension to SIFT, PCA-
SIFT to reduce the high dimensionality of original SIFT descriptor using the standard
Principal Components Analysis (PCA) technique to the normalized gradient image
patches extracted around keypoints.
Image Feature Descriptor
Scale Invariant Feature Transform
(SIFT)
It extracts a 41 × 41 patch at the given scale and computes its image
gradients in the vertical and horizontal directions and creates a feature vector
from concatenating the gradients in both directions. Therefore, the feature
vector is of length 2 × 39 × 39 = 3042 dimensions. The gradient image vector
is projected into a pre-computed feature space, resulting a feature vector of
length 36 elements. The vector is then normalized to unit magnitude to
reduce the effects of illumination changes. Also proved that the SIFT is fully
invariant with respect to only four parameters namely zoom, rotation and
translation out of the six parameters of the affine transform. Therefore, they
introduced affine-SIFT (ASIFT), which simulates all image views obtainable by
varying the camera axis orientation parameters, namely, the latitude and the
longitude angles, left over by the SIFT descriptor.
Gradient Location-Orientation Histogram (GLOH)
• Gradient location-orientation histogram (GLOH)
developed by Mikolajczyk and Schmid is also an
extension of the SIFT descriptor. GLOH is very similar to
the SIFT descriptor, where it only replaces the Cartesian
location grid used by the SIFT with a log-polar one, and
applies PCA to reduce the size of the descriptor. GLOH
uses a log-polar location grid with 3 bins in radial
direction (radius is set to 6, 11, and 15) and 8in angular
direction, resulting in 17 location bins as illustrated in
Fig. A schematic representation of the
GLOH algorithm using log-polar bin

GLOH descriptor builds a set of histograms using the gradient orientations in 16 bins, resulting in a
feature vector of 17 × 16 = 272 elements for each interest point. The 272-dimensional descriptor is
reduced to 128 dimensional one by computing the covariance matrix for PCA and the highest 128
eigenvectors are selected for description. Based on the experimental evaluation conducted in ,
GLOH has been reported to outperform the original SIFT descriptor and gives the best performance,
especially under illumination changes. Furthermore, it has been shown to be more distinctive but
also more expensive to compute than its counterpart SIFT.
Speeded-Up Robust Features Descriptor (SURF)
The Speeded-Up Robust Features (SURF) detector-descriptor scheme developed by Bay et al. is
designed as an efficient alternative to SIFT. It is much faster, and more robust as opposed to SIFT.
For the detection stage of interest points, instead of relying on ideal Gaussian derivatives, the
computation is based on simple 2D box filters; where, it uses a scale invariant blob detector based
on the determinant of Hessian matrix for both scale selection and locations. Its basic idea is to
approximate the second order Gaussian derivatives in an efficient way with the help of integral
images using a set of box filters. The 9 × 9 box filters depicted in Fig. are approximations of a
Gaussian with σ =1.2 and represent the lowest scale for computing the blob response maps. These
approximations are denoted by Dxx, Dyy, and Dxy. Thus, the approximated determinant of Hessian
can be expressed as

Where w is a relative weight for the filter response and it is used to balance the expression for the
Hessian’s determinant. The approximated determinant of the Hessian represents the blob response in
the image. These responses are stored in a blob response map, and local maxima are detected and
refined using quadratic interpolation, as with DoG. Finally, do non-maximum suppression in a 3 × 3 × 3
neighborhood to get steady interest points and the scale of values.
Speeded-Up Robust Features Descriptor (SURF)

• Left to right Gaussian second order derivatives in y- (Dyy), xy-direction (Dxy) and their
approximations in the same directions, respectively
The SURF descriptor starts by constructing a square region centered around the detected interest
point, and oriented along its main orientation. The size of this window is 20s, where s is the scale at
which the interest point is detected. Then, the interest region is further divided into smaller 4 × 4
sub-regions and for each subregion the Harr wavelet responses in the vertical and horizontal
directions (denoted dx and dy, respectively) are computed at a 5 × 5 sampled points as shown in Fig.
(Next slide) . These responses are weighted with a Gaussian window centered at the interest point
to increase the robustness against geometric deformations and localization errors. The wavelet
responses dx and dy are summed up for each sub-region and entered in a feature vector v,
SURF Descriptors
where

Computing this for all the 4 × 4 sub-regions, resulting a


feature descriptor of length 4 × 4 × 4 = 64 dimensions.
Finally, the feature descriptor is normalized to a unit
vector in order to reduce illumination effects. The main
advantage of the SURF descriptor compared to SIFT is the
processing speed as it uses 64 dimensional feature vector
to describe the local feature, while SIFT uses 128.
However, the SIFT descriptor is more suitable to describe
images affected by translation, rotation, scaling, and
other illumination deformations. Though SURF shows its
potential in a wide range of computer vision applications,
it also has some shortcomings. When 2D or 3D objects
are compared, it does not work if rotation is violent or
the view angle is too different
Local Binary Pattern (LBP)
• Local Binary Patterns (LBP) characterizes the spatial structure of a texture and presents the
gray-levels. It encodes the ordering relationship by comparing neighboring pixels with the
center pixel, that is, it creates an order based feature for each pixel by comparing each pixel’s
intensity value with that of its neighboring pixels. Specifically, the neighbors whose feature
responses exceed the central’s one are labeled as ‘1’ while the others are labeled as ‘0’. The
co-occurrence of the comparison results is subsequently recorded by a string of binary bits.
Afterwards, weights coming from a geometric sequence which has a common ratio of 2 are
assigned to the bits according to their indices in strings. The binary string with its weighted
bits is consequently transformed into a decimal valued index (i.e., the LBP feature response).
That is, the descriptor describes the result over the neighborhood as a binary number (binary
pattern). On its standard version, a pixel c with intensity g(c) is labeled.

where pixels p belong to a 3 × 3 neighborhood with gray levels gp(p = 0, 1,..., 7).
Then, the LBP pattern of the pixel neighborhood is computed by summing the
corresponding thresholded values S(gp − gc) weighted by a binomial factor of 2k
as
Local Binary Pattern (LBP)

After computing the labeling for each pixel of the image, a 256-bin histogram of the resulting labels is used as a feature
descriptor for the texture. An illustration example for computing LBP of a pixel in a 3 × 3 neighborhood and an orientation
descriptor of a basic region in an image is shown in Fig. Also, the LBP descriptor is calculated in its general form as follows
Local Binary Pattern (LBP)
where nc corresponds to the gray level of the center pixel of a local neighborhood and n i is the
gray levels of N equally spaced pixels on a circle of radius R. Since correlation between pixels
decreases with the distance, a lot of the texture information can be obtained from local
neighborhoods.
Thus, the radius R is usually kept small. In practice, the signs of the differences in a
neighborhood are interpreted as a N-bit binary number, resulting in 2N distinct values for the
binary pattern.
Several variations of LBP have been proposed, including the center-symmetric local binary
patterns (CS-LBP), the local ternary pattern (LTP), the center-symmetric local ternary pattern
(CS-LTP) based on the CS-LBP, and orthogonal symmetric local ternary pattern (OS-LTP) . Unlike
the LBP, the CS-LBP descriptor compares gray-level differences of center-symmetric pairs In
fact, the LBP has the advantage of tolerance of illumination changes and computational
simplicity. Also, the LBP and its variants achieve great success in texture description.
Unfortunately, the LBP feature is an index of discrete patterns rather than a numerical feature,
thus it is difficult to combine the LBP features with other discriminative ones in a compact
descriptor . Moreover, it produces higher dimensional features and is sensitive to Gaussian
noise on flat regions.
Binary Robust Independent Elementary
Features (BRIEF)
Binary robust independent elementary features (BRIEF), a low-bitrate
descriptor, is introduced for image matching with random forest and random
ferns classifiers . It belongs to the family of binary descriptors such as LBP and
BRISK, which only performs simple binary comparison test and uses Hamming
distance instead of Euclidean or Mahalanobis distance. Briefly, for building a
binary descriptor, it is only necessary to compare the intensity between two
pixel positions located around the detected interest points. This allows to
obtain a representative description at very low computational cost. Besides,
matching the binary descriptors requires only the computation of Hamming
distances that can be executed very fast through XOR primitives on modern
architectures.
Binary Robust Independent
Elementary Features (BRIEF)
The BRIEF algorithm relies on a relatively small number of intensity difference tests
to represent an image patch as a binary string. More specifically, a binary descriptor
for a patch of pixels of size S × S is built by concatenating the results of the following
test

where I(pi) denotes the (smoothed) pixel intensity value at pi, and the selection of the
location of all the pi uniquely defines a set of binary tests. The sampling points are drawn
from a zero-mean isotropic Gaussian distribution with variance equal to 1/25*S 2. For
increasing the robustness of the descriptor, the patch of pixels is pre-smoothed with a
Gaussian kernel with variance equal to 2 and size equal to 9 × 9 pixels. The BRIEF descriptor
has two setting parameters: the number of binary pixel pairs and the binary threshold.
Binary Robust Independent
Elementary Features (BRIEF)
• The experiments conducted by authors showed that only 256 bits, or
even 128 bits, often suffice to obtain very good matching results.
Thus, BRIEF is considered to be very efficient both to compute and to
store in memory. Unfortunately, BRIEF descriptor is not robust against
a rotation larger than 35◦ approximately, hence, it does not provide
rotation invariance.
Features Matching
Features matching or generally image matching, a part of many computer
vision applications such as image registration, camera calibration and object
recognition, is the task of establishing correspondences between two images
of the same scene/object. A common approach to image matching consists of
detecting a set of interest points each associated with image descriptors from
image data. Once the features and their descriptors have been extracted from
two or more images, the next step is to establish some preliminary feature
matches between these images.

Matching image regions


based on their local
feature descriptors
Features Matching
• Without losing the generality, the problem of image matching can be
formulated as follows, suppose that p is a point detected by a detector in
an image associated with a descriptor.

where, for all K, the feature vector provided by the k-th descriptor is

The aim is to find the best correspondence q in another image from the set of N interest
points Q = {q1, q2,..., qN} by comparing the feature vector φk(p) with those of the points in
the set Q. To this end, a distance measure between the two interest points descriptors φk
(p) and φk (q) can be defined as
Features Matching
Based on the distance dk, the points of Q are sorted in ascending order independently
for each descriptor creating the sets.

Such that,

A match between the pair of interest points (p, q) is accepted only if


(i) p is the best match for q in relation to all the other points in the first image and
(ii) q is the best match for p in relation to all the other points in the second image. In this context, it is very
important to devise an efficient algorithm to perform this matching process as quickly as possible. The nearest-
neighbor matching in the feature space of the image descriptors in Euclidean norm can be used for matching
vectorbased features. However, in practice, the optimal nearest neighbor algorithm and its parameters depend
on the data set characteristics. Furthermore, to suppress matching candidates for which the correspondence
may be regarded as ambiguous, the ratio between the distances to the nearest and the next nearest image
descriptor is required to be less than some threshold. As a special case, for matching high dimensional features,
two algorithms have been found to be the most efficient: the randomized k-d forest and the fast library for
approximate nearest neighbors (FLANN)
Features Matching(Binary features)
Binary features are compared using the Hamming distance calculated via
performing a bitwise XOR operation followed by a bit count on the result. This
involves only bit manipulation operations that can be performed quickly. The
typical solution in the case of matching large datasets is to replace the linear
search with an approximate matching algorithm that can offer speedups of
several orders of magnitude over the linear search. This is, at the cost that
some of the nearest neighbors returned are approximate neighbors, but usually
close in distance to the exact neighbors. For performing matching of binary
features
Features Matching
Generally, the performance of matching methods based on interest points depends on
both the properties of the underlying interest points and the choice of associated image
descriptors . Thus, detectors and descriptors appropriate for images contents shall be used
in applications. For instance, if an image contains bacteria cells, the blob detector should
be used rather than the corner detector. But, if the image is an aerial view of a city, the
corner detector is suitable to find man-made structures. Furthermore, selecting a detector
and a descriptor that addresses the image degradation is very important. For example, if
there is no scale change present, a corner detector that does not handle scale is highly
desirable; while, if image contains a higher level of distortion, such as scale and rotation,
the more computationally intensive SURF feature detector and descriptor is a adequate
choice in that case. For greater accuracy, it is recommended to use several detectors and
descriptors at the same time. In the area of feature matching, it must be noticed that the
binary descriptors are generally faster and typically used for finding point
correspondences between images, but they are less accurate than vector-based
descriptors . Statistically robust methods like RANSAC can be used to filter outliers in
matched feature sets while estimating the geometric transformation or fundamental
matrix, which is useful in feature matching for image registration and object recognition
applications.
Feature Tracking
• Feature tracking, the estimation of pixel-level correspondences and pixel-level
changes among adjacent video frames, is key to providing critical temporal
and geometric information for object motion/velocity estimation, camera self-
calibration and visual odometry.
• Accurate and stable feature tracks translate to accurate time-to-collision
estimates in obstacle perception, robust calculations of camera sensors’
extrinsic calibration (pitch/yaw/roll) values, and a key visual input for
generating a three dimensional world representation in visual odometry. Since
feature tracking is based on pixel-level computations, a high performance
computing platform is foundational to a practical implementation.
Feature Tracking
As the vehicle drives, pixel-level information
can become distorted due to illumination
changes, viewpoint changes and complexities
associated with motion of non-rigid objects on
the scene. In computer vision, there exist a
few common algorithmic approaches:
1) feature tracking with dense optical flow;
2) feature tracking with sparse optical flow;
and
3) deep learning-based methods. Caption: Feature tracking algorithm running
This approach comprises three major steps: 1) image on six-camera surround perception setup,
preprocessing; 2) feature detection; and 3) feature
tracking across frames.
with feature tracks shown in blue.
The image preprocessing step extracts gradients
information from the image. The feature detection step
then uses this information to identify the salient feature
points in the image, which can be robustly tracked across
frames. Finally, the optical flow-based feature tracking
step tracks the detected features and estimates their
motion across adjacent frames in the video sequence.
Challenges in Feature Tracking
• Figure out which features can be tracked.
• Efficiently track across frames.
• Some points may change appearance over time (e.g., due to rotation,
moving into shadows, etc.).
• Drift: small errors can accumulate as appearance model is updated.
• Points may appear or disappear: need to be able to add/delete tracked
points.

Given two subsequent frames, estimate the point translation


Key assumptions of Lucas-Kanade Tracker
• Brightness constancy: projection of the same point looks the same
in every frame
• Small motion: points do not move very far
• Spatial coherence: points move like their neighbors
The brightness constancy constraint in Feature Tracking
• Brightness Constancy Equation:

Take Taylor expansion of

So:
The brightness constancy constraint
• we use this equation to recover image motion (u,v) at each pixel

Here One equation (this is a scalar equation!), two unknowns (u,v)

The component of the motion perpendicular to the gradient (i.e., parallel to the edge) cannot be measured

If (u, v) satisfies the equation, so does (u+u’, v+v’ ) if


The aperture problem
The barber pole illusion
• The barber pole illusion is a visual illusion that exploits the way our
brains process and interpret visual information, particularly motion.
Solving the ambiguity
• How to get more equations for a pixel?
• Spatial coherence constraint .
• Assume the pixel’s neighbors have the same (u,v)
• If we use a 5x5 window, that gives us 25 equations per pixel
Solving the ambiguity…
• Solving the ambiguity…
Matching patches across images
• Over constrained linear system

Least squares solution for d given by

The summations are over all pixels in the K x K window


Conditions for solvability
• Optimal (u, v) satisfies Lucas-Kanade equation

When is this solvable? I.e., what are good points to track?

ATA should be invertible


ATA should not be too small due to noise
Eigenvalues 1 and 2 of ATA should not be too small
ATA should be well-conditioned
1 should not be too large
2
Low-texture region

– gradients have small magnitude – small1 , small2


Edge

gradients very large or very small – large 1 , small 2


High-texture region

gradients are different, large magnitudes – large1 , large 2


The aperture problem resolved
Dealing with larger movements: Iterative
refinement
Dealing with larger movements: coarse-to-fine registration
Image Pyramid
Shi-Tomasi feature tracker
• Find good features using eigenvalues of second-moment matrix (e.g.,
Harris detector or threshold on the smallest eigenvalue)
– Key idea: “good” features to track are the ones whose motion can be estimated
reliably.
• Track from frame to frame with Lucas-Kanade
– This amounts to assuming a translation model for frame-to-frame feature
movement.
• Check consistency of tracks by affine registration to the first observed
instance of the feature.
– Affine model is more accurate for larger displacements – Comparing to the first
frame helps to minimize drift
Tracking example
Summary of KLT tracking
• Find a good point to track (harris corner).
• Use intensity second moment matrix and difference across frames to
find displacement.
• Iterate and use coarse-to-fine search to deal with larger movements.
• When creating long tracks, check appearance of registered patch
against appearance of initial patch to find points that have drifted
Implementation issues
• Window size
– Small window more sensitive to noise and may miss larger motions (without
pyramid)
– Large window more likely to cross an occlusion boundary (and it’s slower)
– 15x15 to 31x31 seems typical
• Weighting the window
– Common to apply weights so that center matters more (e.g., with Gaussian)

You might also like