CV - Unit 2
CV - Unit 2
UNIT-2
Feature detection and matching
Align the two images so that they can be seamlessly stitched into a composite mosaic.
• Notice specific locations in the images, such as mountain peaks, building
corners, doorways, or interestingly shaped patches of snow.
• These kinds of localized feature are often called keypoint features or interest points (or
even corners) and are often described by the appearance of pixel patches surrounding
the point location
• Another class of important features are edges, e.g., the profile of mountains against the
sky.
• These kinds of features can be matched based on their orientation and local appearance
(edge profiles) and can also be good indicators of object boundaries and occlusion events
in image sequences.
• Edges can be grouped into longer curves and straight-line segments, which can be
directly matched or analyzed to find vanishing points and hence internal and external
camera parameters
• 2.1 Points and patches
Point features can be used to find a sparse set of corresponding locations in different images,
often as a pre- cursor to computing camera pose, which is a prerequisite for computing a
denser set of correspondences using stereo matching.
• Such correspondences can also be used to align different images, e.g., when stitching
image mosaics or high dynamic range images, or performing video stabilization.
• There are two main approaches to finding feature points and their correspondences.
• The first is to find features in one image that can be accurately tracked using a local
search technique, such as correlation or least squares. This approach is more
suitable when images are taken from nearby viewpoints or in rapid succession (e.g.,
video sequences).
• The second is to independently detect features in all the images under consideration
and then match features based on their local appearance. This approach is more
suitable when a large amount of motion or appearance change is expected, e.g., in
stitching together panoramas, establishing correspondences in wide baseline stereo,
or performing object recognition
• Keypoint detection and matching pipeline is splitted into four separate stages.
• During the feature detection (extraction) stage, each image is searched for locations
that are likely to match well in other images.
• In the feature description stage, each region around detected keypoint locations is
converted into a more compact and stable (invariant) descriptor that can be
matched against other descriptors.
• The feature matching stage efficiently searches for likely matching candidates in other
images.
• The feature tracking stage is an alternative to the third stage that only searches a
small neighborhood around each detected feature and is therefore more suitable
for video processing.
2.1.1.Feature detectors
Aperture problems for different image
patches: (a) stable (“corner-like”) flow;
(b) classic aperture problem (barber-pole illusion);
(c) textureless region.
The two images I0 (yellow) and I1 (red) are
lOMoAR cPSD| 36984744
Patches with gradients in at least two (significantly) different orientations are the easiest to
localize, as shown schematically in Figure a.
These intuitions can be formalized by looking at the simplest possible matching criterion for
comparing two image patches, i.e., their (weighted) summed square
difference, where I0 and I1 are the two
images being compared, u = (u, v) is the
displacement vector, w(x) is a spatially
varying weighting (or window) function,
and the summation i is over all the pixels in the patch.
When performing feature detection, we do not know which other image locations the feature
will end up being matched against.
Therefore, we can only compute how
stable this metric is with respect to small
variations in position ∆u by comparing
an image patch against itself, which is
known as an auto- correlation function or
surface
• Gradient can be computed using a variety of techniques.
• The classic “Harris” detector uses a [-2 -1 0 1 2] filter, but
more modern convolve the image with horizontal and
vertical derivatives of a Gaussian (typically with = 1).
• The auto-correlation matrix A can be written as
• where we have replaced the weighted
summations with discrete convolutions with
the weighting kernel w.
• This matrix can be interpreted as a tensor
(multiband) image, where the outer products of
the gradients I are convolved with a weighting
function w to provide a per- pixel estimate of
the local (quadratic) shape of the auto-
correlation function.
(say, with α = 0.05), which also reduces the response at 1D edges, where aliasing errors
sometimes inflate the smaller eigenvalue. He also shows how the basic 2 × 2 Hessian can
be extended to parametric motions to detect points that are also accurately localizable in
scale and rotation.
lOMoAR cPSD| 36984744
• Using a Taylor Series expansion of the image function I0(xi+ u) I0(xi)+ I0(xi) . u.
• We can approximate the auto-correlation surface as
is the image
gradient of x
• The inverse of the matrix A provides a lower bound on the uncertainty in the location of a
matching patch.
• It is therefore a useful indicator of which patches can be reliably matched.
• The easiest way to visualize and reason about this uncertainty is to perform an
eigenvalue analysis of the auto-correlation matrix A, which produces two eigenvalues (0;
1) and two eigenvector directions.
Scale invariance
In many situations, detecting features at the finest stable scale possible may not be
appropriate. For example, when matching images with little high frequency detail (e.g., clouds),
fine-scale features may not exist.
One solution to the problem is to extract features at a variety of scales, e.g., by performing
the same operations at multiple resolutions in a pyramid and then matching features at the
same level.
This kind of approach is suitable when the images being matched do not undergo large
scale changes, e.g., when matching successive aerial images taken from an airplane or
stitching panoramas taken with a fixed- focal-length camera.
While Lowe’s Scale Invariant Feature Transform (SIFT) performs well in practice, it is not
based on the same theoretical foundation of maximum spatial stability as the auto-correlation
based detectors.
Rotational invariance and orientation estimation
A better method is to estimate a dominant
orientation at each detected keypoint.
Once the local orientation and scale of a keypoint
have been estimated, a scaled and oriented patch
around the detected point can be extracted and
used to form a feature descriptor
A dominant orientation estimate can be computed
by creating a histogram of all the gradient
orientations (weighted by their magnitudes or after
thresholding out small gradients) and then finding
the significant peaks in this distribution
Affine invariance
Affine-invariant detectors not only respond at consistent locations after scale and orientation
changes, they also respond consistently across affine deformations such as (local)
perspective foreshortening.
In fact, for a small enough patch, any continuous image warping can be well approximated
by an affine deformation.
Another important affine invariant region detector is the maximally stable extremal region
(MSER) detector. To detect MSERs, binary regions are computed by thresholding the image at
all possible gray levels (the technique therefore only works for grayscale images).
This operation can be performed efficiently by first sorting all pixels by gray value and then
incrementally adding pixels to each connected component as the threshold is changed.
As the threshold is changed, the area of each component (region) is monitored; regions
whose rate of change of area with respect to the threshold is minimal are defined as
maximally stable.
Such regions are therefore invariant to both affine geometric and photometric (linear
bias-gain or smooth monotonic) transformations.
If desired, an affine coordinate frame can be fit to each detected region using its moment
matrix.
2.1.2.Feature descriptors
After detecting keypoint features, we must match them, i.e., we must determine which
features come from corresponding locations in different images.
In this case, simple error metrics, such as the sum of squared differences or normalized cross-
correlation, described in, can be used to directly compare the intensities in small patches
around each feature point.
In most cases, however, the local appearance of features will change in orientation and
scale, and sometimes even undergo affine deformations.
Extracting a local scale, orientation, or affine frame estimate and then using this to resample
the patch before forming the feature descriptor is thus usually preferable
Bias and gain normalization (MOPS).
MOPS descriptors are formed using an 8x8 sampling of bias and gain normalized intensity
values, with a sample spacing of five pixels relative to the detection scale. This low
frequency sampling gives the features some robustness to interest point location error and
is achieved by sampling at a higher pyramid level than the detection scale.
For tasks that do not exhibit large amounts of foreshortening, such as image stitching, simple
normalized intensity patches perform reasonably well and are simple to implement.
In order to compensate for slight inaccuracies in the feature point detector (location,
orientation, and scale), these multi-scale-oriented patches (MOPS) are sampled at a spacing
of five pixels relative to the detection scale, using a coarser level of the image
pyramid to avoid aliasing.
To compensate for affine photometric variations
, patch intensities are re-scaled so
that their mean is zero and their
variance is one.
The gradient magnitudes are down weighted by a Gaussian fall-off function in order to
reduce the influence of gradients far from the center, as these are more affected by small mis
registrations.
In each 4 × 4 quadrant, a gradient orientation histogram is formed by (conceptually) adding
the weighted gradient value to one of eight orientation histogram bins.
To reduce the effects of location and dominant orientation misestimation, each of the
original 256 weighted gradient magnitudes is softly added to 2 × 2 × 2 histogram bins
using trilinear interpolation.
Softly distributing values to adjacent histogram bins is generally a good idea in any
application where histograms are being computed, e.g., for Hough transforms
The resulting 128 non-negative values form a raw version of the SIFT descriptor vector.
To reduce the effects of contrast or gain (additive variations are already removed by the
gradient), the 128-D vector is normalized to unit length.
PCA-SIFT
it computes the x and y (gradient) derivatives over a 39 × 39 patch and hen reduces the
resulting 3042- dimensional vector to 36 using principal component analysis (PCA).
Another popular variant of SIFT is SURF,
which uses box filters to approximate the
derivatives and integrals used in SIFT.
The gradient location-orientation histogram
(GLOH) descriptor uses logpolar bins instead
of square bins to compute orientation
histograms
Gradient location-orientation histogram
(GLOH). This descriptor, is a variant on SIFT
that uses a log- polar binning structure
instead of the four quadrants.
The spatial bins are of radius 6,11, and 15, with eight
angular bins (except for the central region), for a total of 17 spatial bins and 16 orientation
bins.
The 272-dimensional histogram is then projected onto a 128-dimensional descriptor using
PCA trained on a large database.
Steerable filters. Steerable filters are combinations of derivative of Gaussian filters that
lOMoAR cPSD| 36984744
permit the rapid computation of even and odd (symmetric and anti-symmetric) edge-like
and corner-like features at all possible orientations.
Because they use reasonably broad Gaussians, they too are somewhat insensitive to
localization and orientation errors.
Performance of local descriptors. Among the local descriptors that compared, they found
that GLOH performed best, followed closely by SIFT
The BRIEF descriptor compares 128 different pixel values scattered around the keypoint
location to obtain a 128-bit vector. ORB adds an orientation component to the FAST
detector before computing oriented BRIEF descriptors.
Feature matching
• Once we have extracted features and their descriptors from two or more images,
the next step is to establish some preliminary feature matches between these
images.
• Different strategies may be preferable for matching images that are known to overlap (e.g.,
in image stitching) vs. images that may have no correspondence whatsoever (e.g., when
trying to recognize objects from a database).
• Two separate components of feature matching.
• The first is to select a matching strategy, which determines which correspondences are
passed on to the next stage for further processing.
• The second is to devise efficient data structures and algorithms to perform this
matching as quickly as possible.
Matching strategy and error rates
Determining which feature matches are reasonable to process further depends on the context
in which the matching is being performed.
Say we are given two images that overlap to a fair amount.
To begin with, we assume that the feature descriptors have been designed so that Euclidean
(vector magnitude) distances in feature space can be directly used for ranking potential
matches.
If it turns out that certain parameters (axes) in a descriptor are more reliable than others, it
is usually preferable to re-scale these axes ahead of time, e.g., by determining how much
they vary when compared against other known good matches
A more general process, which involves transforming feature vectors into a new scaled basis,
is called
whitening and is discussed in more detail in the context of eigenface-based face recognition.
lOMoAR cPSD| 36984744
results, etc.
ROC curve and its related rates:
(a) The ROC curve plots the true
positive rate against the false positive
rate for a particular combination of
feature extraction and matching
algorithms.
Ideally, the true positive rate should be close to 1, while the false positive rate is close to 0.
The area under the ROC curve (AUC) is often used as a single (scalar) measure of algorithm
performance. Alternatively, the equal error rate is sometimes used.
(b) The distribution of positives (matches) and negatives (non-matches) as a function of inter-
feature distance d. As the threshold θ is increased, the number of true positives (TP) and
false positives (FP) increases.
Matching strategy and error rates
• To compare the nearest neighbor distance to that of the
second nearest neighbor, preferably taken from an image
that is known not to match the target (e.g., a different object
in the database).
• We can define this nearest neighbor distance ratio as
• where d1 and d2 are the nearest and second nearest neighbor distances, DA is the
target descriptor, and DB and DC are its closest two neighbors
Efficient matching
Once we have decided on a matching strategy, we still need to search efficiently for
potential candidates. The simplest way to find all corresponding feature points is to
compare all features against all other features in each pair of potentially matching images.
Unfortunately, this is quadratic in the number of extracted features, which makes it
impractical for most applications.
A better approach is to devise an indexing structure, such as a multi-dimensional search
tree or a hash table, to rapidly search for features near a given feature.
Such indexing structures can either be built for each image independently (which is useful
if we want to only consider certain potential matches, e.g., searching for a particular object) or
globally for all the images in a given database, which can potentially be faster, since it
removes the need to iterate over each image.
A simple example of hashing is the Haar wavelets.
During the matching structure construction, each 8x 8 scaled, oriented, and normalized
MOPS patch is converted into a three-element index by performing sums over different
quadrants of the patch.
lOMoAR cPSD| 36984744
The resulting three values are normalized by their expected standard deviations and then
mapped to the two (of b = 10) nearest 1D bins.
The three-dimensional indices formed by concatenating the three quantized values are used
to index the 23 = 8 bins where the feature is stored (added).
At query time, only the primary (closest) indices are used, so only a single three-dimensional
bin needs to be examined. The coefficients in the bin can then be used to select k approximate
nearest neighbors for further processing (such as computing the NNDR).
The three Haar wavelet coefficients used for
hashing the MOPS descriptor devised are
computed by summing each 8x8 normalized
patch over the light and dark gray regions
and taking their difference.
For example, if we expect the whole image to be translated or rotated in the matching view,
we can fit a global geometric transform and keep only those feature matches that are
sufficiently close to this estimated transformation.
The process of selecting a small set of seed matches and then verifying a larger set is often
called random sampling or RANSAC.
Once an initial set of correspondences has been established, some systems look for additional
matches, e.g., by looking for additional correspondences along epipolar lines or in the vicinity
of estimated locations based on the global transform.
kp2 = orb.detect(img2match,None)
# compute the descriptors with ORB
kp2, des2 = orb.compute(img2match, kp2)
lOMoAR cPSD| 36984744
Feature tracking
An alternative to independently finding features in all candidate images and then matching
them is to find a set of likely feature locations in a first image and to then search for their
corresponding locations in subsequent images.
This kind of detect then track approach is more widely used for video tracking applications,
where the expected amount of motion and appearance deformation between adjacent
frames is expected to be small. The process of selecting good features to track is closely
related to selecting good features for more general recognition applications.
In practice, regions containing high gradients in both directions, i.e., which have high
eigenvalues in the auto-correlation matrix, provide stable locations at which to find
correspondences
A preferable solution is to compare the original patch to later image locations using an
affine motion model. n their system, features are only detected infrequently, i.e., only in
regions where tracking has failed.
In the usual case, an area around the current predicted location of the feature is searched
with an incremental registration algorithm.
The resulting tracker is often called the Kanade–Lucas–Tomasi (KLT) tracker.
2.2.1.Edge detection
Qualitatively, edges occur at boundaries between regions of different color,
intensity, or texture. Unfortunately, segmenting an image into coherent regions
is a difficult task.
Often, it is preferable to detect edges using only purely local information.
Under such conditions, a reasonable approach is to define an edge as a location of rapid
intensity variation. Think of an image as a height field. On such a surface, edges occur at
locations of steep slopes, or equivalently, in regions of closely packed contour lines (on a
topographic map).
A mathematical way to define the slope and
direction of a surface is through its gradient,
The local gradient vector J points in the direction of
steepest ascent in the intensity function.
Its magnitude is an indication of the slope or strength of the variation, while its orientation
points in a direction perpendicular to the local contour.
Unfortunately, taking image derivatives accentuates high frequencies and hence amplifies
lOMoAR cPSD| 36984744
Because we would like the response of our edge detector to be independent of orientation, a
circularly symmetric smoothing filter is desirable.
The Gaussian is the only separable circularly symmetric filter and so it is used in most edge
detection algorithms.
Because differentiation is a linear operation, it
commutes with other linear filtering operations.
The gradient of the smoothed image can therefore
be written as
i.e., we can convolve the image
with the horizontal and vertical
derivatives of the Gaussian kernel
function,
The parameter σ indicates the width
of the Gaussian
Finding this maximum corresponds to taking a directional derivative of the strength field in
the direction of the gradient and then looking for zero crossings.
The desired directional derivative is
equivalent to the dot product between a
second gradient operator and the results
of the first,
-
The gradient operator dot product
with the gradient is called the
Laplacian.
The convolution kernel is therefore called the
Laplacian of Gaussian (LoG) kernel.
This kernel can be split into two
separable parts, which allows for a
much more efficient implementation
using separable filtering
Scale selection and blur estimation
Laplacian, and Difference of Gaussian filters all require the selection of a spatial scale
parameter .
If we are only interested in detecting sharp edges, the width of the filter can be determined
from image noise characteristics
However, if we want to detect edges that occur at different resolutions, a scale-space
approach that detects and then selects edges at different scales may be necessary.
Given a known image noise level, their technique computes, for every pixel, the minimum
scale at which an edge can be reliably detected
Given a known image noise level, their technique computes, for every pixel, the minimum
scale at which an edge can be reliably detected.
Their approach first computes gradients densely over an image by selecting among
gradient estimates computed at different scales, based on their gradient
magnitudes.
It then performs a similar estimate of minimum scale for directed second derivatives and
uses zero crossings of this latter quantity to robustly select edges.
As an optional final step, the blur width of each edge can be computed from the distance
between extrema in the second derivative response minus the width of the Gaussian filter.
lOMoAR cPSD| 36984744
First, construct and train separate oriented half-disc detectors for measuring significant
differences in brightness (luminance), color (a* and b* channels, summed responses), and
texture.
Some of the responses are then sharpened using a soft non-maximal suppression technique.
Finally, the outputs of the three detectors are combined using a variety of machine-learning
techniques, from which logistic regression is found to have the best tradeoff between speed,
space and accuracy .
2.2.2.Contour detection
While isolated edges can be useful for a variety of applications, such as line detection and
sparse stereo matching, they become even more useful when linked into continuous
contours.
If the edges have been detected using zero crossings of some function, linking them up
is straightforward, since adjacent edgels share common endpoints.
Linking the edgels into chains involves picking up an unlinked edgel and following its
neighbors in both directions.
Either a sorted list of edgels (sorted first by x coordinates and then by y coordinates, for
example) or a 2D array can be used to accelerate the neighbor finding.
If edges were not detected using zero crossings, finding the continuation of an
edgel can be tricky. In this case, comparing the orientation (and, optionally, phase)
of adjacent edgels can be used for disambiguation.
Ideas from connected component computation can also sometimes be used to make the edge
linking process even faster
Once the edgels have been linked into chains, we can apply an optional thresholding with
hysteresis to remove low-strength contour segments.
lOMoAR cPSD| 36984744
The basic idea of hysteresis is to set two different thresholds and allow a curve being tracked
above the higher threshold to dip in strength down to the lower threshold.
Linked edgel lists can be encoded more compactly using a variety of alternative
representations.
Arc-length
parameterization of a
contour:
(a) discrete points along the
contour are first transcribed
as
(b) (x; y) pairs along the arc length
s. This curve can then be
regularly re-sampled or
converted into alternative
(e.g., Fourier)
representations.
Successive approximation
(a) Original curve and a polyline
approximation shown in red;
(b) successive approximation by recursively
finding points furthest away from the
current approximation;
(c) smooth interpolating spline, shown in dark
blue, fit to the polyline vertices.
Hough transforms
While curve approximation with polylines can often lead to successful line extraction, lines
in the real world are sometimes broken up into disconnected components or made up of
many collinear line segments.
In many cases, it is desirable to group such collinear segments into extended lines.
lOMoAR cPSD| 36984744
At a further processing stage, we can then group such lines into collections with common
vanishing points.
The Hough transform, named after its original inventor (Hough 1962), is a well-known
technique for having edges “vote” for plausible line locations.
In its original formulation, each edge point votes for all possible lines passing through it,
and lines corresponding to high accumulator or bin values are examined for potential line fits
Unless the points on a line are truly punctate, a better approach (in my experience) is to
use the local orientation information at each edgel to vote for a single accumulator cell
A hybrid strategy where each edgel votes for a number of possible orientation or location
pairs centered around the estimate orientation, may be desirable in some cases
Since lines are made up of edge segments, we adopt the convention that the line normal
^n points in the same direction (i.e., has the same sign) as the image gradient J(x) = rI(x)
The range of possible (, d) values is [180, 180] [p2,p2], assuming that we are using
normalized pixel coordinates (2.61) that lie in [1, 1].
The number of bins to use along each axis depends on the
accuracy of the position and orientation estimate available at
each edgel and the expected line density, and is best set
experimentally with some test runs on sample imagery.
Vanishing points
In many scenes, structurally important lines have the same vanishing point because they
are parallel in 3D. Examples of such lines are horizontal and vertical building edges, zebra
crossings, railway tracks, the edges of furniture such as tables and dressers, and of course,
the ubiquitous calibration pattern.
Finding the vanishing points common to such line sets can help refine their position in the
image and, in certain cases, help determine the intrinsic and extrinsic orientation of the
camera.
The first stage in my vanishing point detection algorithm uses a Hough transform to
accumulate votes for likely vanishing point candidates.
As with line fitting, one possible approach is to have each line vote for all possible vanishing
point directions, either using a cube map or a Gaussian sphere, optionally using knowledge
about the uncertainty in the vanishing point location to perform a weighted vote
The preferred approach is to use pairs of detected line segments to form candidate vanishing
point locations.
Let ^mi and ^mj be the (unit norm) line equations for a pair of line segments and li and lj be
their corresponding segment lengths.
The location of the corresponding vanishing point hypothesis can
be computed as
and the corresponding weight set to
This has the desirable effect of down weighting (near-)collinear line segments and short line
segments.
Rectangle detection
Once sets of mutually orthogonal vanishing points have been detected, it now becomes
possible to search for 3D rectangular structures in the image.
lOMoAR cPSD| 36984744
Over the last decade, a variety of techniques have been developed to find such rectangles,
primarily focused on architectural scenes
After detecting orthogonal vanishing directions, refine the fitted line equations, search for
corners near line intersections, and then verify rectangle hypotheses by rectifying the
corresponding patches and looking for a preponderance of horizontal and vertical edges.
Image segmentation is the task of finding groups of pixels that “go together”. In statistics, this problem is
known as cluster analysis and is a widely studied area with hundreds of different algorithms
Active contour is a segmentation method that uses energy forces and constraints to separate the pixels
Active contour is defined as an active model for the segmentation process. Contours are the boundaries
that define the region of interest in an image. A contour is a collection of points that have been
interpolated. The interpolation procedure might be linear, splines, or polynomial, depending on how
The primary use of active contours in image processing is to define smooth shapes in images and to
construct closed contours for regions. It is mainly used to identify uneven shapes in images.
Active contours are used in a variety of medical image segmentation applications. Various forms of
active contour models are employed in a variety of medical applications, particularly for the separation
of desired regions from a variety of medical images. A slice of a brain CT scan, for example, is examined
Application: Contour tracking and rotoscoping Active contours can be used in a wide variety of object-
tracking applications (Blake and Isard 1998; Yilmaz, Javed, and Shah 2006).
For example, they can be used to track facial features for performance-driven animation .They can also be
used to track heads and people, as shown in Figure 5.8, as well as moving vehicles
Additional applications include medical image segmentation, where contours can be tracked from slice to
slice in computerized tomography (3D medical imagery) or over time, as in ultrasound scans. An interesting
application that is closer to computer animation and visual effects is rotoscoping, which uses the tracked
contours to deform a set of hand-drawn animations (or to modify or replace the original video frames). They
also provide an excellent review of previous rotoscoping and image-based, contour-tracking systems
lOMoAR cPSD| 36984744
Normalized cuts
While bottom-up merging techniques aggregate regions into coherent wholes and mean-shift techniques
try to find clusters of similar pixels using mode finding, the normalized cuts technique introduced by Shi
and Malik (2000) examines the affinities (similarities) between nearby pixels and tries to separate groups
that are connected by weak affinities. Consider the simple graph shown in Figure 5.19a. The pixels in group
A are all strongly connected with high affinities, shown as thick red lines, as are the pixels in group B. The
connections between these two groups, shown as thinner blue lines, are much weaker. A normalized cut
between the two groups, shown as a dashed line, separates them into two clusters. The cut between two
groups A and B is defined as the sum of all the weights being cut
where the weights between two pixels (or regions) i and j measure their similarity. Using a minimum cut
as a segmentation criterion, however, does not result in reasonable clusters, since the smallest cuts usually
involve isolating a single pixel. A better measure of segmentation is the normalized cut, which is defined as,
where assoc(A, A) = P i∈A,j∈A wij is the association (sum of all the weights) within a cluster and assoc(A, V
) = assoc(A, A) +cut(A, B) is the sum of all the weights associated,
with nodes in A. Figure 5.19b shows how the cuts and associations can be thought of as area sums in the
weight matrix W = [wij ], where the entries of the matrix have been arranged so that the nodes in A come
first and the nodes in B come second.
lOMoAR cPSD| 36984744
Graph-based segmentation is a powerful technique used in image processing and computer vision to divide
an image into meaningful regions or segments. It works by representing the image as a graph, where:
The segmentation process then involves finding a good partition of the graph that groups pixels into regions
based on their similarity and separates dissimilar regions. This is achieved by:
• Graph partitioning algorithms: These algorithms aim to minimize the total weight of edges that
need to be cut to separate different regions, ensuring elements within a segment are similar and
elements in different segments are dissimilar.
• Graph cuts: In this approach, the goal is to find a cut in the graph that divides it into two subgraphs
such that the sum of edge weights within each subgraph is high, while the sum of edge weights
between subgraphs is low.
• Flexibility: It can be adapted to various image types and segmentation tasks by defining appropriate
similarity measures and graph partitioning algorithms.
• Incorporates spatial information: The use of edges explicitly captures the spatial relationships
between pixels, making it suitable for tasks where spatial context is important.
• Handles complex structures: It can handle images with complex shapes and boundaries, which may
be challenging for other segmentation methods.
• Computational cost: Graph partitioning algorithms can be computationally expensive, especially for
large images.
lOMoAR cPSD| 36984744
• Sensitivity to noise: The quality of segmentation can be affected by noise in the image, as it can
distort the similarity measures between pixels.
• Parameter tuning: Choosing the right similarity measures and graph partitioning algorithms often
requires careful tuning depending on the specific image and task.
Overall, graph-based segmentation is a valuable tool for image segmentation, offering flexibility, spatial
awareness, and the ability to handle complex structures. However, understanding its limitations and
carefully considering parameter choices are crucial for achieving optimal results.
Both graph cuts and energy-based methods are closely related approaches for image segmentation,
although they have some key differences:
Graph cuts:
• Approach: Formulate the segmentation problem as a min-cut problem on a graph. Each pixel is a
node, and edges connect neighboring pixels with weights representing their similarity. The goal is to
find a cut in the graph that minimizes the total weight of edges between the two resulting segments.
• Advantages: Efficiently solvable using algorithms like max-flow/min-cut, globally optimal
solution, handles complex boundaries well.
• Disadvantages: Limited to binary segmentation (two segments), may be sensitive to noise in edge
weights.
Energy-based methods:
• Approach: Define an energy function that measures the quality of a segmentation. This function
typically includes terms for similarity between pixels within regions, dissimilarity between pixels in
different regions, and potentially other constraints. The goal is to find the segmentation that
minimizes the energy function.
• Advantages: More flexible than graph cuts, can handle multiple segments, can incorporate prior
knowledge or specific constraints.
• Disadvantages: Finding the minimum energy can be computationally expensive, may not always
guarantee a globally optimal solution.
• Graph cuts can be seen as a special case of energy-based methods, where the energy function is
defined based on edge weights in a graph.
• Both approaches aim to find a segmentation that optimizes a certain objective function, but they
differ in how they define and solve the optimization problem.
• Other energy-based methods: Several energy-based methods exist beyond graph cuts, such as
Markov Random Fields (MRFs) and active contours.
• Hybrid approaches: Combining graph cuts with other energy-based methods can leverage the
strengths of both approaches.
• Application specific: The best approach for a specific segmentation task depends on the image
characteristics, desired properties of the segmentation, and computational constraints.
Mean_Shift
lOMoAR cPSD| 36984744
In computer vision, mean shift is a powerful technique used for various tasks, including:
1. Object tracking: This is its most popular application. Mean shift iteratively shifts a window based on the
local density of features within the image, effectively "locking onto" an object and tracking its movement
despite changes in position, scale, or rotation.
2. Image segmentation: Mean shift can be used to segment images by grouping pixels based on their
similarity in features like color, intensity, or texture. It identifies clusters of similar pixels and converges to
their centers, forming the segments.
3. Feature detection: Mean shift can help locate distinctive features in an image by finding local maxima in
the feature space. This can be useful for tasks like object recognition or image analysis.
4. Video analysis: Mean shift can be applied to analyze video sequences by tracking objects, identifying
events, and understanding motion patterns.
• Non-parametric: It doesn't require any prior assumptions about the data distribution, making it
flexible for various image types.
• Unsupervised: It doesn't require labeled data, making it suitable for tasks where labeled data is
scarce.
• Iterative: It gradually refines its estimates, converging towards the desired result.
• Kernel-based: It uses a kernel function to measure similarity between data points, allowing for
flexible feature representations.
• Sensitive to noise: Noisy features can affect the convergence and accuracy of mean shift.
• Computationally expensive: The iterative nature can be computationally demanding for large
images or complex tasks.
• Parameter dependent: Performance depends on choosing appropriate parameters like kernel
bandwidth and stopping criteria.
Overall, mean shift is a valuable tool in computer vision with diverse applications. Understanding its
strengths and limitations is crucial for effectively utilizing it in various tasks.
lOMoAR cPSD| 36984744
where assoc(A, A) = P i∈A,j∈A wij is the association (sum of all the weights) within a cluster and assoc(A, V )
= assoc(A, A) +cut(A, B) is the sum of all the weights associated with nodes in A. where assoc(A, A) = P
i∈A,j∈A wij is the association (sum of all the weights) within a cluster and assoc(A, V ) = assoc(A, A) +cut(A,
B) is the sum of all the weights associated with nodes in A.
lOMoAR cPSD| 36984744
A common theme in image segmentation algorithms is the desire to group pixels that have similar
appearance (statistics) and to have the boundaries between pixels in different regions be of short length
and across visible discontinuities. If we restrict the boundary measurements to be between immediate
neighbors and compute region membership statistics by summing over pixels, we can formulate this as a
classic pixel-based energy function using either a variational formulation or as a binary Markov random
field.
the energy corresponding to a segmentation problem can bethe energy corresponding to a segmentation
problem can be
is the negative log likelihood that pixel intensity (or color) I(i, j) is consistent with the statis tics of region
R(f(i, j)) and the boundary term
measures the inconsistency between N4 neighbors modulated by local horizontal and vertical smoothness
terms sx(i, j) and sy(i, j). Region statistics can be something as simple as the mean gray level or color
This allows their system to operate given minimal user input, such as a single bounding box (Figure
5.24a)—the background color model is initialized from a strip of pixels around the box outline. (The
foreground color model is initialized from the interior pixels, but quickly converges to a better estimate of
lOMoAR cPSD| 36984744
the object.) The user can also place additional strokes to refine the segmentation as the solution
progresses.