0% found this document useful (0 votes)
9 views30 pages

CV - Unit 2

Uploaded by

akilasriram721
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views30 pages

CV - Unit 2

Uploaded by

akilasriram721
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

lOMoARcPSD|36984744

UNIT II FEATURE DETECTION, MATCHING AND SEGMENTATION


Points and patches - Edges - Lines - Segmentation - Active contours - Split and
merge - Meanshift and mode finding - Normalized cuts - Graph cuts and energy-
based methods.
lOMoAR cPSD| 36984744

UNIT-2
Feature detection and matching
Align the two images so that they can be seamlessly stitched into a composite mosaic.
• Notice specific locations in the images, such as mountain peaks, building
corners, doorways, or interestingly shaped patches of snow.
• These kinds of localized feature are often called keypoint features or interest points (or
even corners) and are often described by the appearance of pixel patches surrounding
the point location
• Another class of important features are edges, e.g., the profile of mountains against the
sky.
• These kinds of features can be matched based on their orientation and local appearance
(edge profiles) and can also be good indicators of object boundaries and occlusion events
in image sequences.
• Edges can be grouped into longer curves and straight-line segments, which can be
directly matched or analyzed to find vanishing points and hence internal and external
camera parameters
• 2.1 Points and patches
Point features can be used to find a sparse set of corresponding locations in different images,
often as a pre- cursor to computing camera pose, which is a prerequisite for computing a
denser set of correspondences using stereo matching.
• Such correspondences can also be used to align different images, e.g., when stitching
image mosaics or high dynamic range images, or performing video stabilization.
• There are two main approaches to finding feature points and their correspondences.
• The first is to find features in one image that can be accurately tracked using a local
search technique, such as correlation or least squares. This approach is more
suitable when images are taken from nearby viewpoints or in rapid succession (e.g.,
video sequences).
• The second is to independently detect features in all the images under consideration
and then match features based on their local appearance. This approach is more
suitable when a large amount of motion or appearance change is expected, e.g., in
stitching together panoramas, establishing correspondences in wide baseline stereo,
or performing object recognition
• Keypoint detection and matching pipeline is splitted into four separate stages.
• During the feature detection (extraction) stage, each image is searched for locations
that are likely to match well in other images.
• In the feature description stage, each region around detected keypoint locations is
converted into a more compact and stable (invariant) descriptor that can be
matched against other descriptors.
• The feature matching stage efficiently searches for likely matching candidates in other
images.
• The feature tracking stage is an alternative to the third stage that only searches a
small neighborhood around each detected feature and is therefore more suitable
for video processing.

2.1.1.Feature detectors
Aperture problems for different image
patches: (a) stable (“corner-like”) flow;
(b) classic aperture problem (barber-pole illusion);
(c) textureless region.
The two images I0 (yellow) and I1 (red) are
lOMoAR cPSD| 36984744

overlaid. The red vector u indicates the


displacement between the patch centers and
the w(xi) weighting function (patch
window) is shown as a dark circle.

Patches with gradients in at least two (significantly) different orientations are the easiest to
localize, as shown schematically in Figure a.
These intuitions can be formalized by looking at the simplest possible matching criterion for
comparing two image patches, i.e., their (weighted) summed square
difference, where I0 and I1 are the two
images being compared, u = (u, v) is the
displacement vector, w(x) is a spatially
varying weighting (or window) function,
and the summation i is over all the pixels in the patch.
When performing feature detection, we do not know which other image locations the feature
will end up being matched against.
Therefore, we can only compute how
stable this metric is with respect to small
variations in position ∆u by comparing
an image patch against itself, which is
known as an auto- correlation function or
surface
• Gradient can be computed using a variety of techniques.
• The classic “Harris” detector uses a [-2 -1 0 1 2] filter, but
more modern convolve the image with horizontal and
vertical derivatives of a Gaussian (typically with = 1).
• The auto-correlation matrix A can be written as
• where we have replaced the weighted
summations with discrete convolutions with
the weighting kernel w.
• This matrix can be interpreted as a tensor
(multiband) image, where the outer products of
the gradients I are convolved with a weighting
function w to provide a per- pixel estimate of
the local (quadratic) shape of the auto-
correlation function.

Uncertainty ellipse corresponding to an eigenvalue


analysis of the autocorrelation matrix A.

Forstner–Harris feature detection


The minimum eigenvalue λ0 is not the only quantity that can be used to find key points. A
simpler quantity is with α = 0:06.
Unlike eigenvalue analysis, this quantity does not
require the use of square roots and yet is still
rotationally invariant and also down weights
edge-like features where λ1>> λ0.
Triggs (2004) suggests using the quantity

(say, with α = 0.05), which also reduces the response at 1D edges, where aliasing errors
sometimes inflate the smaller eigenvalue. He also shows how the basic 2 × 2 Hessian can
be extended to parametric motions to detect points that are also accurately localizable in
scale and rotation.
lOMoAR cPSD| 36984744

Brown, Szeliski, and Winder (2005), on the other hand, use


the harmonic mean,
which is a smoother function in the region where λ0 ≈ λ1.

• Using a Taylor Series expansion of the image function I0(xi+ u) I0(xi)+ I0(xi) . u.
• We can approximate the auto-correlation surface as
is the image
gradient of x

• The inverse of the matrix A provides a lower bound on the uncertainty in the location of a
matching patch.
• It is therefore a useful indicator of which patches can be reliably matched.
• The easiest way to visualize and reason about this uncertainty is to perform an
eigenvalue analysis of the auto-correlation matrix A, which produces two eigenvalues (0;
1) and two eigenvector directions.

Adaptive non-maximal suppression (ANMS).


While most feature detectors simply look for local maxima in the interest function, this can
lead to an uneven distribution of feature points across the image, e.g., points will be denser
in regions of higher contrast.
To mitigate this problem, only detect features that are both local maxima and whose response
value is significantly (10%) greater than that of all of its neighbors within a radius r.
They devise an efficient way to associate suppression radii with all local maxima by first
sorting them by their response strength and then creating a second list sorted by
decreasing suppression radius.
Measuring repeatability.
• Repeatability of feature detectors define as the frequency with which keypoints detected in
one image are found within (say, = 1.5) pixels of the corresponding location in a
transformed image.
• The planar images are transformed by applying rotations, scale changes, illumination
changes, viewpoint changes, and adding noise.
• The information content available at each detected feature point is called as entropy of a
set of rotationally invariant local grayscale descriptors

Scale invariance
In many situations, detecting features at the finest stable scale possible may not be
appropriate. For example, when matching images with little high frequency detail (e.g., clouds),
fine-scale features may not exist.
One solution to the problem is to extract features at a variety of scales, e.g., by performing
the same operations at multiple resolutions in a pyramid and then matching features at the
same level.
This kind of approach is suitable when the images being matched do not undergo large
scale changes, e.g., when matching successive aerial images taken from an airplane or
stitching panoramas taken with a fixed- focal-length camera.

Scale-space feature detection octave


using a sub- Difference of Gaussian ©
04)
pyramid (Lowe 20 2004 Springer: an
(a) Adjacent levels of a sub-octavence of
lOMoAR cPSD| 36984744

Gaussi pyramid are subtracted to


produce Differe Gaussian images;
(b) extrema (maxima and minima) in
the resulting 3D volume are
detected by comparing a pixel to
its 26 neighbors.

As with the Harris operator, pixels


where is strong asymmetry in the
local curvatur of the indicator
function (in this case, the
implemented by first computing the
local and then rejecting keypoints
for which

While Lowe’s Scale Invariant Feature Transform (SIFT) performs well in practice, it is not
based on the same theoretical foundation of maximum spatial stability as the auto-correlation
based detectors.
Rotational invariance and orientation estimation
A better method is to estimate a dominant
orientation at each detected keypoint.
Once the local orientation and scale of a keypoint
have been estimated, a scaled and oriented patch
around the detected point can be extracted and
used to form a feature descriptor
A dominant orientation estimate can be computed
by creating a histogram of all the gradient
orientations (weighted by their magnitudes or after
thresholding out small gradients) and then finding
the significant peaks in this distribution

Affine invariance
Affine-invariant detectors not only respond at consistent locations after scale and orientation
changes, they also respond consistently across affine deformations such as (local)
perspective foreshortening.
In fact, for a small enough patch, any continuous image warping can be well approximated
by an affine deformation.
Another important affine invariant region detector is the maximally stable extremal region
(MSER) detector. To detect MSERs, binary regions are computed by thresholding the image at
all possible gray levels (the technique therefore only works for grayscale images).
This operation can be performed efficiently by first sorting all pixels by gray value and then
incrementally adding pixels to each connected component as the threshold is changed.
As the threshold is changed, the area of each component (region) is monitored; regions
whose rate of change of area with respect to the threshold is minimal are defined as
maximally stable.
Such regions are therefore invariant to both affine geometric and photometric (linear
bias-gain or smooth monotonic) transformations.
If desired, an affine coordinate frame can be fit to each detected region using its moment
matrix.

Feature detectors - Harris corner detection


lOMoAR cPSD| 36984744

import cv2 #opencv itself


import common #some useful opencv functions
import numpy as np # matrix manipulations
%matplotlib inline
from matplotlib import pyplot as plt # this lets you draw inline pictures in the
notebooks import pylab # this allows you to control figure size
pylab.rcParams['figure.figsize'] = (10.0, 8.0) # this controls figure size in the
notebook input_image=cv2.imread('noidea.jpg’)
harris_test=input_image.copy()
#greyscale it
gray =
cv2.cvtColor(harris_test,cv2.COLOR_BGR2GRA
Y) gray = np.float32(gray)
blocksize=4 #
kernel_size=3 # sobel kernel: must be odd and fairly small
# run the harris corner detector
dst = cv2.cornerHarris(gray,blocksize,kernel_size,0.05) # parameters are blocksize
#result is dilated for marking the corners, this is visualisation related and just makes
them bigger
dst = cv2.dilate(dst,None)
#we then plot these on the input image for visualisation purposes, using bright red
harris_test[dst>0.01*dst.max()]=[0,0,255]
plt.imshow(cv2.cvtColor(harris_test,
cv2.COLOR_BGR2RGB))

2.1.2.Feature descriptors
After detecting keypoint features, we must match them, i.e., we must determine which
features come from corresponding locations in different images.
In this case, simple error metrics, such as the sum of squared differences or normalized cross-
correlation, described in, can be used to directly compare the intensities in small patches
around each feature point.
In most cases, however, the local appearance of features will change in orientation and
scale, and sometimes even undergo affine deformations.
Extracting a local scale, orientation, or affine frame estimate and then using this to resample
the patch before forming the feature descriptor is thus usually preferable
Bias and gain normalization (MOPS).
MOPS descriptors are formed using an 8x8 sampling of bias and gain normalized intensity
values, with a sample spacing of five pixels relative to the detection scale. This low
frequency sampling gives the features some robustness to interest point location error and
is achieved by sampling at a higher pyramid level than the detection scale.

For tasks that do not exhibit large amounts of foreshortening, such as image stitching, simple
normalized intensity patches perform reasonably well and are simple to implement.
In order to compensate for slight inaccuracies in the feature point detector (location,
orientation, and scale), these multi-scale-oriented patches (MOPS) are sampled at a spacing
of five pixels relative to the detection scale, using a coarser level of the image
pyramid to avoid aliasing.
To compensate for affine photometric variations
, patch intensities are re-scaled so
that their mean is zero and their
variance is one.

A schematic representation of Lowe’s


lOMoAR cPSD| 36984744

(2004) scale invariant feature


transform
(SIFT):
(a) Gradient orientations and
magnitudes are computed at each
pixel and weighted
by a Gaussian fall-off function (blue circle).
(b) A weighted gradient orientation histogram
is then computed in each subregion, using trilinear interpolation.
While this figure shows an 8 × 8 pixel patch and a 2 × 2 descriptor array, Lowe’s actual
implementation uses
16 × 16 patches and a 4 × 4 array of eight-bin histograms.
Scale invariant feature transform (SIFT):
SIFT features are formed by computing the gradient at each pixel in a 16×16 window around
the detected keypoint, using the appropriate level of the Gaussian pyramid at which the
keypoint was detected.

The gradient magnitudes are down weighted by a Gaussian fall-off function in order to
reduce the influence of gradients far from the center, as these are more affected by small mis
registrations.
In each 4 × 4 quadrant, a gradient orientation histogram is formed by (conceptually) adding
the weighted gradient value to one of eight orientation histogram bins.
To reduce the effects of location and dominant orientation misestimation, each of the
original 256 weighted gradient magnitudes is softly added to 2 × 2 × 2 histogram bins
using trilinear interpolation.
Softly distributing values to adjacent histogram bins is generally a good idea in any
application where histograms are being computed, e.g., for Hough transforms
The resulting 128 non-negative values form a raw version of the SIFT descriptor vector.
To reduce the effects of contrast or gain (additive variations are already removed by the
gradient), the 128-D vector is normalized to unit length.
PCA-SIFT
it computes the x and y (gradient) derivatives over a 39 × 39 patch and hen reduces the
resulting 3042- dimensional vector to 36 using principal component analysis (PCA).
Another popular variant of SIFT is SURF,
which uses box filters to approximate the
derivatives and integrals used in SIFT.
The gradient location-orientation histogram
(GLOH) descriptor uses logpolar bins instead
of square bins to compute orientation
histograms
Gradient location-orientation histogram
(GLOH). This descriptor, is a variant on SIFT
that uses a log- polar binning structure
instead of the four quadrants.
The spatial bins are of radius 6,11, and 15, with eight
angular bins (except for the central region), for a total of 17 spatial bins and 16 orientation
bins.
The 272-dimensional histogram is then projected onto a 128-dimensional descriptor using
PCA trained on a large database.
Steerable filters. Steerable filters are combinations of derivative of Gaussian filters that
lOMoAR cPSD| 36984744

permit the rapid computation of even and odd (symmetric and anti-symmetric) edge-like
and corner-like features at all possible orientations.
Because they use reasonably broad Gaussians, they too are somewhat insensitive to
localization and orientation errors.
Performance of local descriptors. Among the local descriptors that compared, they found
that GLOH performed best, followed closely by SIFT
The BRIEF descriptor compares 128 different pixel values scattered around the keypoint
location to obtain a 128-bit vector. ORB adds an orientation component to the FAST
detector before computing oriented BRIEF descriptors.
Feature matching
• Once we have extracted features and their descriptors from two or more images,
the next step is to establish some preliminary feature matches between these
images.
• Different strategies may be preferable for matching images that are known to overlap (e.g.,
in image stitching) vs. images that may have no correspondence whatsoever (e.g., when
trying to recognize objects from a database).
• Two separate components of feature matching.
• The first is to select a matching strategy, which determines which correspondences are
passed on to the next stage for further processing.
• The second is to devise efficient data structures and algorithms to perform this
matching as quickly as possible.
Matching strategy and error rates
Determining which feature matches are reasonable to process further depends on the context
in which the matching is being performed.
Say we are given two images that overlap to a fair amount.
To begin with, we assume that the feature descriptors have been designed so that Euclidean
(vector magnitude) distances in feature space can be directly used for ranking potential
matches.
If it turns out that certain parameters (axes) in a descriptor are more reliable than others, it
is usually preferable to re-scale these axes ahead of time, e.g., by determining how much
they vary when compared against other known good matches
A more general process, which involves transforming feature vectors into a new scaled basis,
is called
whitening and is discussed in more detail in the context of eigenface-based face recognition.
lOMoAR cPSD| 36984744

False positives and negatives: The black digits 1 and 2 are


features being matched against a database of features in
other images. At the current threshold setting (the
solid circles), the green 1 is a true positive (good match),
the blue 1 is a false negative (failure to match), and the red
3 is a false positive (incorrect match). If we set the
threshold higher (the dashed circles), the blue 1 becomes a
true positive but the brown 4 becomes an additional false
positive.
We can quantify the performance of a matching algorithm
at a particular threshold by first counting the number of
true and false matches and match failures, using the
following definitions
• TP: true positives, i.e., number of correct matches
• FN: false negatives, matches that were not correctly detected
• FP: false positives, proposed matches that are incorrect
• TN: true negatives, non-matches that were correctly rejected.

The number of matches correctly and incorrectly


estimated by a feature matching algorithm,
showing the number of true positives (TP), false
positives (FP), false negatives (FN) and true
negatives (TN).
The columns sum up to the actual number of
positives (P) and negatives (N), while the rows
sum up to the predicted number of positives (P’)
and negatives (N’).
The formulas for the true positive rate (TPR), the
false positive rate (FPR), the positive predictive
value (PPV), and the accuracy (ACC) are given in
the text.
Any particular matching strategy (at a
particular threshold or parameter setting) can
be rated by the TPR and FPR numbers; ideally,
the true positive rate will be close to 1 and the
false positive rate close to 0.
As we vary the matching threshold, we obtain a
family of such points, which are collectively
known as the receiver
operating characteristic (ROC curve). The closer this curve lies to the upper left corner, i.e., the
larger the area under the curve (AUC), the better its performance.
-
The ROC curve can also be used to
calculate the mean average precision,
which is the average precision (PPV)
as you vary the threshold to select
the best results, then the two top
lOMoAR cPSD| 36984744

results, etc.
ROC curve and its related rates:
(a) The ROC curve plots the true
positive rate against the false positive
rate for a particular combination of
feature extraction and matching
algorithms.
Ideally, the true positive rate should be close to 1, while the false positive rate is close to 0.
The area under the ROC curve (AUC) is often used as a single (scalar) measure of algorithm
performance. Alternatively, the equal error rate is sometimes used.
(b) The distribution of positives (matches) and negatives (non-matches) as a function of inter-
feature distance d. As the threshold θ is increased, the number of true positives (TP) and
false positives (FP) increases.
Matching strategy and error rates
• To compare the nearest neighbor distance to that of the
second nearest neighbor, preferably taken from an image
that is known not to match the target (e.g., a different object
in the database).
• We can define this nearest neighbor distance ratio as

• where d1 and d2 are the nearest and second nearest neighbor distances, DA is the
target descriptor, and DB and DC are its closest two neighbors

Fixed threshold, nearest neighbor, and nearest


neighbor distance ratio matching. At a fixed
distance threshold (dashed circles), descriptor DA
fails to match DB and DD incorrectly matches DC
and DE. If we pick the nearest neighbor, DA
correctly matches DB but DD correctly matches
DC. Using nearest neighbor distance ratio (NNDR)
matching, the small NNDR d1=d2 correctly
matches DA with DB, and the large NNDR d01=d02
correctly rejects matches for DD.

Efficient matching
Once we have decided on a matching strategy, we still need to search efficiently for
potential candidates. The simplest way to find all corresponding feature points is to
compare all features against all other features in each pair of potentially matching images.
Unfortunately, this is quadratic in the number of extracted features, which makes it
impractical for most applications.
A better approach is to devise an indexing structure, such as a multi-dimensional search
tree or a hash table, to rapidly search for features near a given feature.
Such indexing structures can either be built for each image independently (which is useful
if we want to only consider certain potential matches, e.g., searching for a particular object) or
globally for all the images in a given database, which can potentially be faster, since it
removes the need to iterate over each image.
A simple example of hashing is the Haar wavelets.
During the matching structure construction, each 8x 8 scaled, oriented, and normalized
MOPS patch is converted into a three-element index by performing sums over different
quadrants of the patch.
lOMoAR cPSD| 36984744

The resulting three values are normalized by their expected standard deviations and then
mapped to the two (of b = 10) nearest 1D bins.
The three-dimensional indices formed by concatenating the three quantized values are used
to index the 23 = 8 bins where the feature is stored (added).
At query time, only the primary (closest) indices are used, so only a single three-dimensional
bin needs to be examined. The coefficients in the bin can then be used to select k approximate
nearest neighbors for further processing (such as computing the NNDR).
The three Haar wavelet coefficients used for
hashing the MOPS descriptor devised are
computed by summing each 8x8 normalized
patch over the light and dark gray regions
and taking their difference.

Another widely used class of indexing


structures are multi-dimensional search
trees.
The best known of these are k-d trees,
also often written as kd-trees, which
divide the multidimensional feature
space along alternating axis-aligned
hyperplanes, choosing the threshold
along each axis so as to maximize some
criterion, such as the search tree
balance.

Here, eight different data points A–H are


shown as small diamonds arranged on a
two- dimensional plane. The k-d tree
recursively splits this plane along axis-
aligned (horizontal or vertical) cutting
planes.
At query time, a classic k-d tree search first locates the query point (+) in its appropriate bin
(D), and then searches nearby leaves in the tree (C, B, : : :) until it can guarantee that the
nearest neighbor has been found. The best bin first (BBF) search searches bins in order of
their spatial proximity to the query point and is therefore usually more efficient.

Feature match verification and densification


Once we have some hypothetical (putative) matches, we can often use geometric alignment to
verify which matches are inliers and which ones are outliers.

For example, if we expect the whole image to be translated or rotated in the matching view,
we can fit a global geometric transform and keep only those feature matches that are
sufficiently close to this estimated transformation.
The process of selecting a small set of seed matches and then verifying a larger set is often
called random sampling or RANSAC.
Once an initial set of correspondences has been established, some systems look for additional
matches, e.g., by looking for additional correspondences along epipolar lines or in the vicinity
of estimated locations based on the global transform.

kp2 = orb.detect(img2match,None)
# compute the descriptors with ORB
kp2, des2 = orb.compute(img2match, kp2)
lOMoAR cPSD| 36984744

# create BFMatcher object: this is a Brute Force matching object


bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
# Match descriptors.
matches = bf.match(des,des2)
# Sort them by distance between matches in feature space- so the best matches are
first.
matches = sorted(matches, key = lambda x:x.distance)
# Draw first 50 matches.
oimg =
cv2.drawMatches(orbimg,kp,img2match,kp2,matches[:50],
orbimg) plt.imshow(cv2.cvtColor(oimg,
cv2.COLOR_BGR2RGB))

Feature tracking
An alternative to independently finding features in all candidate images and then matching
them is to find a set of likely feature locations in a first image and to then search for their
corresponding locations in subsequent images.
This kind of detect then track approach is more widely used for video tracking applications,
where the expected amount of motion and appearance deformation between adjacent
frames is expected to be small. The process of selecting good features to track is closely
related to selecting good features for more general recognition applications.
In practice, regions containing high gradients in both directions, i.e., which have high
eigenvalues in the auto-correlation matrix, provide stable locations at which to find
correspondences

A preferable solution is to compare the original patch to later image locations using an
affine motion model. n their system, features are only detected infrequently, i.e., only in
regions where tracking has failed.
In the usual case, an area around the current predicted location of the feature is searched
with an incremental registration algorithm.
The resulting tracker is often called the Kanade–Lucas–Tomasi (KLT) tracker.

2.2.Edges and contours


While interest points are useful for finding image locations that can be accurately matched in
2D, edge points are far more plentiful and often carry important semantic associations.
For example, the boundaries of objects, which also correspond to occlusion events in 3D,
are usually delineated by visible contours. Other kinds of edges correspond to shadow
boundaries or crease edges, where surface orientation changes rapidly.

2.2.1.Edge detection
Qualitatively, edges occur at boundaries between regions of different color,
intensity, or texture. Unfortunately, segmenting an image into coherent regions
is a difficult task.
Often, it is preferable to detect edges using only purely local information.
Under such conditions, a reasonable approach is to define an edge as a location of rapid
intensity variation. Think of an image as a height field. On such a surface, edges occur at
locations of steep slopes, or equivalently, in regions of closely packed contour lines (on a
topographic map).
A mathematical way to define the slope and
direction of a surface is through its gradient,
The local gradient vector J points in the direction of
steepest ascent in the intensity function.
Its magnitude is an indication of the slope or strength of the variation, while its orientation
points in a direction perpendicular to the local contour.
Unfortunately, taking image derivatives accentuates high frequencies and hence amplifies
lOMoAR cPSD| 36984744

noise, since the proportion of noise to signal is larger at high frequencies.


It is therefore prudent to smooth the image with a low-pass filter prior to computing the
gradient.

Because we would like the response of our edge detector to be independent of orientation, a
circularly symmetric smoothing filter is desirable.
The Gaussian is the only separable circularly symmetric filter and so it is used in most edge
detection algorithms.
Because differentiation is a linear operation, it
commutes with other linear filtering operations.
The gradient of the smoothed image can therefore
be written as
i.e., we can convolve the image
with the horizontal and vertical
derivatives of the Gaussian kernel
function,
The parameter σ indicates the width
of the Gaussian
Finding this maximum corresponds to taking a directional derivative of the strength field in
the direction of the gradient and then looking for zero crossings.
The desired directional derivative is
equivalent to the dot product between a
second gradient operator and the results
of the first,
-
The gradient operator dot product
with the gradient is called the
Laplacian.
The convolution kernel is therefore called the
Laplacian of Gaussian (LoG) kernel.
This kernel can be split into two
separable parts, which allows for a
much more efficient implementation
using separable filtering
Scale selection and blur estimation
Laplacian, and Difference of Gaussian filters all require the selection of a spatial scale
parameter .
If we are only interested in detecting sharp edges, the width of the filter can be determined
from image noise characteristics
However, if we want to detect edges that occur at different resolutions, a scale-space
approach that detects and then selects edges at different scales may be necessary.
Given a known image noise level, their technique computes, for every pixel, the minimum
scale at which an edge can be reliably detected
Given a known image noise level, their technique computes, for every pixel, the minimum
scale at which an edge can be reliably detected.
Their approach first computes gradients densely over an image by selecting among
gradient estimates computed at different scales, based on their gradient
magnitudes.
It then performs a similar estimate of minimum scale for directed second derivatives and
uses zero crossings of this latter quantity to robustly select edges.
As an optional final step, the blur width of each edge can be computed from the distance
between extrema in the second derivative response minus the width of the Gaussian filter.
lOMoAR cPSD| 36984744

Color edge detection


While most edge detection techniques have been developed for grayscale images, color images
can provide additional information. For example, noticeable edges between iso-luminant colors
(colors that have the same luminance) are useful cues but fail to be detected by grayscale edge
operators.
One simple approach is to combine the outputs of grayscale detectors run on each color
band separately. For example, if we simply sum up the gradients in each of the color
bands, the signed gradients may actually cancel each other!
(Consider, for example a pure red-to-green edge.) We could also detect edges independently in
each band and then take the union of these, but this might lead to thickened or doubled edges
that are hard to link.
A better approach is to compute the oriented energy in each band, e.g., using a second-
order steerable filter, and then sum up the orientation-weighted energies and find their joint
best orientation.
Unfortunately, the directional derivative of this energy may not have a closed form solution
(as in the case of signed first-order steerable filters), so a simple zero crossing-based
strategy cannot be used.
An alternative approach is to estimate local color statistics in regions around each pixel.
This has the advantage that more sophisticated techniques (e.g., 3D color histograms) can
be used to compare regional statistics and that additional measures, such as texture, can
also be considered.

Combining edge feature cues


If the goal of edge detection is to match human boundary detection performance as opposed
to simply finding stable features for matching, even better detectors can be constructed by
combining multiple low-level cues such as brightness, color, and texture.

First, construct and train separate oriented half-disc detectors for measuring significant
differences in brightness (luminance), color (a* and b* channels, summed responses), and
texture.
Some of the responses are then sharpened using a soft non-maximal suppression technique.
Finally, the outputs of the three detectors are combined using a variety of machine-learning
techniques, from which logistic regression is found to have the best tradeoff between speed,
space and accuracy .

2.2.2.Contour detection

While isolated edges can be useful for a variety of applications, such as line detection and
sparse stereo matching, they become even more useful when linked into continuous
contours.
If the edges have been detected using zero crossings of some function, linking them up
is straightforward, since adjacent edgels share common endpoints.
Linking the edgels into chains involves picking up an unlinked edgel and following its
neighbors in both directions.
Either a sorted list of edgels (sorted first by x coordinates and then by y coordinates, for
example) or a 2D array can be used to accelerate the neighbor finding.
If edges were not detected using zero crossings, finding the continuation of an
edgel can be tricky. In this case, comparing the orientation (and, optionally, phase)
of adjacent edgels can be used for disambiguation.
Ideas from connected component computation can also sometimes be used to make the edge
linking process even faster
Once the edgels have been linked into chains, we can apply an optional thresholding with
hysteresis to remove low-strength contour segments.
lOMoAR cPSD| 36984744

The basic idea of hysteresis is to set two different thresholds and allow a curve being tracked
above the higher threshold to dip in strength down to the lower threshold.
Linked edgel lists can be encoded more compactly using a variety of alternative
representations.

Chain code representation of a grid-aligned linked edge chain.


The code is represented as a series of direction codes,
e.g, 0 1 0 7 6 5, which can further be compressed using
predictive and run-length coding.

A more useful representation is the arc length


parameterization of a contour, x(s), where s denotes the arc
length along a curve.
The advantage of the arc-length parameterization is that
it makes matching and processing (e.g., smoothing)
operations much easier. Arc-length parameterization can
also be used to smooth curves in order to remove
digitization noise.
However, if we just apply a regular smoothing filter,
the curve tends to shrink on itself.

Arc-length
parameterization of a
contour:
(a) discrete points along the
contour are first transcribed
as
(b) (x; y) pairs along the arc length
s. This curve can then be
regularly re-sampled or
converted into alternative
(e.g., Fourier)
representations.

The advantage of the arc-length


parameterization is that it makes
matching and processing (e.g., smoothing) operations much easier. Consider the two
curves describing similar shapes.
To compare the curves, we first subtract the average values x0 = sx(s) from each descriptor.
Next, we rescale each descriptor so that s goes from 0 to 1 instead of 0 to S, i.e., we
divide x(s) by S. Finally, we take the Fourier transform of each normalized descriptor,
treating each x = (x, y) value as a complex number.
If the original curves are the same (up to an unknown scale and rotation), the resulting
Fourier transforms should differ only by a scale change in magnitude plus a constant
complex phase shift, due to rotation, and a linear phase shift in the domain, due to different
starting points for s.
lOMoAR cPSD| 36984744

Matching two contours using their arc-length


parameterization. If both curves are normalized
to unit length, s [0,1] and centered around their
centroid x0, they will have the same descriptor
up to an overall “temporal” shift (due to different
starting points for s = 0) and a phase (x-y) shift
(due to rotation).

The evolution of curves as they are smoothed and


simplified is related to “grassfire” (distance)
transforms and region
skeletons, and can be used to recognize objects based on their contour shape.
More local descriptors of curve shape such as shape contexts can also be used for
recognition and are potentially more robust to missing parts due to occlusions.
The field of contour detection and linking continues to evolve rapidly and now includes
techniques for global contour grouping, boundary completion, and junction detection, as well
as grouping contours into likely regions and wide-baseline correspondence

Lines and vanishing points


While edges and general curves are suitable for describing the contours of natural objects,
the man-made world is full of straight lines.
Detecting and matching these lines can be useful in a variety of applications, including
architectural modeling, pose estimation in urban environments, and the analysis of printed
document layouts.
Successive approximation
Describing a curve as a series of 2D locations xi = x(si) provides a general representation
suitable for matching and further processing.
In many applications, however, it is preferable to approximate such a curve with a simpler
representation, e.g., as a piecewise-linear polyline or as a B-spline curve
Many techniques have been developed over the years to perform this approximation,
which is also known as line simplification.
One of the oldest, and simplest, is recursively subdivide the curve at the point furthest away
from the line joining the two endpoints (or the current coarse polyline approximation)
Provide a more efficient implementation and also cite some of the other related
work in this area. Once the line simplification has been computed, it can be used to
approximate the original curve.
If a smoother representation or visualization is desired, either approximating or
interpolating splines or curves can be used.

Successive approximation
(a) Original curve and a polyline
approximation shown in red;
(b) successive approximation by recursively
finding points furthest away from the
current approximation;
(c) smooth interpolating spline, shown in dark
blue, fit to the polyline vertices.

Hough transforms
While curve approximation with polylines can often lead to successful line extraction, lines
in the real world are sometimes broken up into disconnected components or made up of
many collinear line segments.
In many cases, it is desirable to group such collinear segments into extended lines.
lOMoAR cPSD| 36984744

At a further processing stage, we can then group such lines into collections with common
vanishing points.

The Hough transform, named after its original inventor (Hough 1962), is a well-known
technique for having edges “vote” for plausible line locations.
In its original formulation, each edge point votes for all possible lines passing through it,
and lines corresponding to high accumulator or bin values are examined for potential line fits

Original Hough transform:


(a) each point votes for a complete
family of potential lines ri() = xi cos
+ yi sin ;
(b) each pencil of lines sweeps out a
sinusoid in (r, );
their intersection provides the
desired line equation.

Unless the points on a line are truly punctate, a better approach (in my experience) is to
use the local orientation information at each edgel to vote for a single accumulator cell
A hybrid strategy where each edgel votes for a number of possible orientation or location
pairs centered around the estimate orientation, may be desirable in some cases
Since lines are made up of edge segments, we adopt the convention that the line normal
^n points in the same direction (i.e., has the same sign) as the image gradient J(x) = rI(x)

The range of possible (, d) values is [180, 180] [p2,p2], assuming that we are using
normalized pixel coordinates (2.61) that lie in [1, 1].
The number of bins to use along each axis depends on the
accuracy of the position and orientation estimate available at
each edgel and the expected line density, and is best set
experimentally with some test runs on sample imagery.

2D line equation expressed in terms of the normal ^n and


distance to the origin d.

An alternative representation can be obtained by using a


cube map, i.e., projecting m onto the face of a unit cube.
To compute the cube map coordinate of a 3D vector m, first find the largest (absolute value)
component of m, i.e., m = max (jnxj; jnyj; jdj), and use this to select one of the six cube
faces.
Divide the remaining two
coordinates by m and use
these as indices into the
cube face.
While this avoids the use
of trigonometry, it does
require some decision
logic.

Cube map representation for


line equations and vanishing
points:
lOMoAR cPSD| 36984744

(a) a cube map surrounding the unit sphere;


(b) projecting the half-cube onto three subspaces

RANSAC-based line detection.


Another alternative to the Hough transform is the RANdom SAmple Consensus (RANSAC)
algorithm.
In brief, RANSAC randomly chooses pairs of edgels to form a line hypothesis and then tests
how many other edgels fall onto this line.
(If the edge orientations are accurate enough, a single edgel can produce this hypothesis.)
Lines with sufficiently large numbers of inliers (matching edgels) are then selected as the
desired line segments.
An advantage of RANSAC is that no accumulator array is needed and so the algorithm can be
more space efficient and potentially less prone to the choice of bin size.
An advantage of RANSAC is that no accumulator array is needed and so the algorithm can be
more space efficient and potentially less prone to the choice of bin size.
The disadvantage is that many more hypotheses may need to be generated and tested than
those obtained by finding peaks in the accumulator array.
In general, there is no clear consensus on which line estimation technique performs best.
It is therefore a good idea to think carefully about the problem at hand and to implement
several approaches (successive approximation, Hough, and RANSAC) to determine the one
that works best for your application.

Vanishing points
In many scenes, structurally important lines have the same vanishing point because they
are parallel in 3D. Examples of such lines are horizontal and vertical building edges, zebra
crossings, railway tracks, the edges of furniture such as tables and dressers, and of course,
the ubiquitous calibration pattern.
Finding the vanishing points common to such line sets can help refine their position in the
image and, in certain cases, help determine the intrinsic and extrinsic orientation of the
camera.

The first stage in my vanishing point detection algorithm uses a Hough transform to
accumulate votes for likely vanishing point candidates.
As with line fitting, one possible approach is to have each line vote for all possible vanishing
point directions, either using a cube map or a Gaussian sphere, optionally using knowledge
about the uncertainty in the vanishing point location to perform a weighted vote
The preferred approach is to use pairs of detected line segments to form candidate vanishing
point locations.

Let ^mi and ^mj be the (unit norm) line equations for a pair of line segments and li and lj be
their corresponding segment lengths.
The location of the corresponding vanishing point hypothesis can
be computed as
and the corresponding weight set to

This has the desirable effect of down weighting (near-)collinear line segments and short line
segments.
Rectangle detection
Once sets of mutually orthogonal vanishing points have been detected, it now becomes
possible to search for 3D rectangular structures in the image.
lOMoAR cPSD| 36984744

Over the last decade, a variety of techniques have been developed to find such rectangles,
primarily focused on architectural scenes
After detecting orthogonal vanishing directions, refine the fitted line equations, search for
corners near line intersections, and then verify rectangle hypotheses by rectifying the
corresponding patches and looking for a preponderance of horizontal and vertical edges.

Image segmentation is the task of finding groups of pixels that “go together”. In statistics, this problem is
known as cluster analysis and is a widely studied area with hundreds of different algorithms

What are Active Contours?

Active contour is a segmentation method that uses energy forces and constraints to separate the pixels

of interest from a picture for further processing and analysis.


lOMoAR cPSD| 36984744

Active contour is defined as an active model for the segmentation process. Contours are the boundaries

that define the region of interest in an image. A contour is a collection of points that have been

interpolated. The interpolation procedure might be linear, splines, or polynomial, depending on how

the curve in the image is described.

Why Active Contours is needed?

The primary use of active contours in image processing is to define smooth shapes in images and to

construct closed contours for regions. It is mainly used to identify uneven shapes in images.

Active contours are used in a variety of medical image segmentation applications. Various forms of

active contour models are employed in a variety of medical applications, particularly for the separation

of desired regions from a variety of medical images. A slice of a brain CT scan, for example, is examined

for segmentation using active contour models.


lOMoAR cPSD| 36984744
lOMoAR cPSD| 36984744
lOMoAR cPSD| 36984744

Application: Contour tracking and rotoscoping Active contours can be used in a wide variety of object-

tracking applications (Blake and Isard 1998; Yilmaz, Javed, and Shah 2006).

For example, they can be used to track facial features for performance-driven animation .They can also be

used to track heads and people, as shown in Figure 5.8, as well as moving vehicles

Additional applications include medical image segmentation, where contours can be tracked from slice to

slice in computerized tomography (3D medical imagery) or over time, as in ultrasound scans. An interesting

application that is closer to computer animation and visual effects is rotoscoping, which uses the tracked

contours to deform a set of hand-drawn animations (or to modify or replace the original video frames). They

also provide an excellent review of previous rotoscoping and image-based, contour-tracking systems
lOMoAR cPSD| 36984744

Normalized cuts

While bottom-up merging techniques aggregate regions into coherent wholes and mean-shift techniques
try to find clusters of similar pixels using mode finding, the normalized cuts technique introduced by Shi
and Malik (2000) examines the affinities (similarities) between nearby pixels and tries to separate groups
that are connected by weak affinities. Consider the simple graph shown in Figure 5.19a. The pixels in group
A are all strongly connected with high affinities, shown as thick red lines, as are the pixels in group B. The
connections between these two groups, shown as thinner blue lines, are much weaker. A normalized cut
between the two groups, shown as a dashed line, separates them into two clusters. The cut between two
groups A and B is defined as the sum of all the weights being cut

where the weights between two pixels (or regions) i and j measure their similarity. Using a minimum cut
as a segmentation criterion, however, does not result in reasonable clusters, since the smallest cuts usually
involve isolating a single pixel. A better measure of segmentation is the normalized cut, which is defined as,

where assoc(A, A) = P i∈A,j∈A wij is the association (sum of all the weights) within a cluster and assoc(A, V
) = assoc(A, A) +cut(A, B) is the sum of all the weights associated,
with nodes in A. Figure 5.19b shows how the cuts and associations can be thought of as area sums in the
weight matrix W = [wij ], where the entries of the matrix have been arranged so that the nodes in A come
first and the nodes in B come second.
lOMoAR cPSD| 36984744

Graph-based segmentation is a powerful technique used in image processing and computer vision to divide
an image into meaningful regions or segments. It works by representing the image as a graph, where:

• Nodes: Each node represents a pixel in the image.


• Edges: Edges connect neighboring pixels and are weighted based on the similarity between the pixels
they connect. This similarity can be measured using various features like intensity, color, texture, or
other domain-specific criteria.

The segmentation process then involves finding a good partition of the graph that groups pixels into regions
based on their similarity and separates dissimilar regions. This is achieved by:

• Graph partitioning algorithms: These algorithms aim to minimize the total weight of edges that
need to be cut to separate different regions, ensuring elements within a segment are similar and
elements in different segments are dissimilar.
• Graph cuts: In this approach, the goal is to find a cut in the graph that divides it into two subgraphs
such that the sum of edge weights within each subgraph is high, while the sum of edge weights
between subgraphs is low.

Here are some key advantages of graph-based segmentation:

• Flexibility: It can be adapted to various image types and segmentation tasks by defining appropriate
similarity measures and graph partitioning algorithms.
• Incorporates spatial information: The use of edges explicitly captures the spatial relationships
between pixels, making it suitable for tasks where spatial context is important.
• Handles complex structures: It can handle images with complex shapes and boundaries, which may
be challenging for other segmentation methods.

However, there are also some limitations:

• Computational cost: Graph partitioning algorithms can be computationally expensive, especially for
large images.
lOMoAR cPSD| 36984744

• Sensitivity to noise: The quality of segmentation can be affected by noise in the image, as it can
distort the similarity measures between pixels.
• Parameter tuning: Choosing the right similarity measures and graph partitioning algorithms often
requires careful tuning depending on the specific image and task.

Overall, graph-based segmentation is a valuable tool for image segmentation, offering flexibility, spatial
awareness, and the ability to handle complex structures. However, understanding its limitations and
carefully considering parameter choices are crucial for achieving optimal results.

Graph Cuts and Energy-Based Methods in Segmentation

Both graph cuts and energy-based methods are closely related approaches for image segmentation,
although they have some key differences:

Graph cuts:

• Approach: Formulate the segmentation problem as a min-cut problem on a graph. Each pixel is a
node, and edges connect neighboring pixels with weights representing their similarity. The goal is to
find a cut in the graph that minimizes the total weight of edges between the two resulting segments.
• Advantages: Efficiently solvable using algorithms like max-flow/min-cut, globally optimal
solution, handles complex boundaries well.
• Disadvantages: Limited to binary segmentation (two segments), may be sensitive to noise in edge
weights.

Energy-based methods:

• Approach: Define an energy function that measures the quality of a segmentation. This function
typically includes terms for similarity between pixels within regions, dissimilarity between pixels in
different regions, and potentially other constraints. The goal is to find the segmentation that
minimizes the energy function.
• Advantages: More flexible than graph cuts, can handle multiple segments, can incorporate prior
knowledge or specific constraints.
• Disadvantages: Finding the minimum energy can be computationally expensive, may not always
guarantee a globally optimal solution.

Relationship and Connection:

• Graph cuts can be seen as a special case of energy-based methods, where the energy function is
defined based on edge weights in a graph.
• Both approaches aim to find a segmentation that optimizes a certain objective function, but they
differ in how they define and solve the optimization problem.

Here are some additional points to consider:

• Other energy-based methods: Several energy-based methods exist beyond graph cuts, such as
Markov Random Fields (MRFs) and active contours.
• Hybrid approaches: Combining graph cuts with other energy-based methods can leverage the
strengths of both approaches.
• Application specific: The best approach for a specific segmentation task depends on the image
characteristics, desired properties of the segmentation, and computational constraints.

Mean_Shift
lOMoAR cPSD| 36984744

In computer vision, mean shift is a powerful technique used for various tasks, including:

1. Object tracking: This is its most popular application. Mean shift iteratively shifts a window based on the
local density of features within the image, effectively "locking onto" an object and tracking its movement
despite changes in position, scale, or rotation.

2. Image segmentation: Mean shift can be used to segment images by grouping pixels based on their
similarity in features like color, intensity, or texture. It identifies clusters of similar pixels and converges to
their centers, forming the segments.

3. Feature detection: Mean shift can help locate distinctive features in an image by finding local maxima in
the feature space. This can be useful for tasks like object recognition or image analysis.

4. Video analysis: Mean shift can be applied to analyze video sequences by tracking objects, identifying
events, and understanding motion patterns.

Here are some key aspects of mean shift:

• Non-parametric: It doesn't require any prior assumptions about the data distribution, making it
flexible for various image types.
• Unsupervised: It doesn't require labeled data, making it suitable for tasks where labeled data is
scarce.
• Iterative: It gradually refines its estimates, converging towards the desired result.
• Kernel-based: It uses a kernel function to measure similarity between data points, allowing for
flexible feature representations.

However, there are also some limitations:

• Sensitive to noise: Noisy features can affect the convergence and accuracy of mean shift.
• Computationally expensive: The iterative nature can be computationally demanding for large
images or complex tasks.
• Parameter dependent: Performance depends on choosing appropriate parameters like kernel
bandwidth and stopping criteria.

Overall, mean shift is a valuable tool in computer vision with diverse applications. Understanding its
strengths and limitations is crucial for effectively utilizing it in various tasks.
lOMoAR cPSD| 36984744

where assoc(A, A) = P i∈A,j∈A wij is the association (sum of all the weights) within a cluster and assoc(A, V )
= assoc(A, A) +cut(A, B) is the sum of all the weights associated with nodes in A. where assoc(A, A) = P
i∈A,j∈A wij is the association (sum of all the weights) within a cluster and assoc(A, V ) = assoc(A, A) +cut(A,
B) is the sum of all the weights associated with nodes in A.
lOMoAR cPSD| 36984744

Graph cuts and energy-based methods

A common theme in image segmentation algorithms is the desire to group pixels that have similar
appearance (statistics) and to have the boundaries between pixels in different regions be of short length
and across visible discontinuities. If we restrict the boundary measurements to be between immediate
neighbors and compute region membership statistics by summing over pixels, we can formulate this as a
classic pixel-based energy function using either a variational formulation or as a binary Markov random
field.
the energy corresponding to a segmentation problem can bethe energy corresponding to a segmentation
problem can be

is the negative log likelihood that pixel intensity (or color) I(i, j) is consistent with the statis tics of region
R(f(i, j)) and the boundary term

measures the inconsistency between N4 neighbors modulated by local horizontal and vertical smoothness
terms sx(i, j) and sy(i, j). Region statistics can be something as simple as the mean gray level or color

This allows their system to operate given minimal user input, such as a single bounding box (Figure
5.24a)—the background color model is initialized from a strip of pixels around the box outline. (The
foreground color model is initialized from the interior pixels, but quickly converges to a better estimate of
lOMoAR cPSD| 36984744

the object.) The user can also place additional strokes to refine the segmentation as the solution
progresses.

Application: Medical image segmentation


One of the most promising applications of image segmentation is in the medical imaging domain,
where it can be used to segment anatomical tissues for later quantitative analysis. Figure 5.25 shows
a binary graph cut with directed edges being used to segment the liver tis sue (light gray) from its
surrounding bone (white) and muscle (dark gray) tissue. Figure 5.26 shows the segmentation of
bones in a 256 × 256 × 119 computed X-ray tomography (CT) volume. Without the powerful
optimization techniques available in today’s image segmentation algorithms, such processing used to
require much more laborious manual tracing of individual X-ray slices.
The fields of medical image segmentation (McInerney and Terzopoulos 1996) and medical image
registration (Kybic and Unser 2003) (Section 8.3.1) are rich research fields with their own specialized
conferences, such as Medical Imaging Computing and Computer Assisted Intervention (MICCAI), 11
and journals, such as Medical Image Analysis and IEEE Transactions on Medical Imaging. These can be
great sources of references and ideas for research in this area.

You might also like