Lec 13
Lec 13
These lecture summaries are designed to be a review of the lecture. Though I do my best to include all main topics from the
lecture, the lectures will have more elaborated explanations than these notes.
• Recognize object
• Determine pose of detected/recognized object
• Inspect object
Motivation for these approaches: In machine vision problems, we often manipulate objects in the world, and we want to
know what and where these objects are in the world. In the case of these specific problems, we assume prior knowledge of the
precise edge points of these objects (which, as we discussed in the two previous lectures, we know how to compute!)
• Perimeter
• Centroid (moment order 1)
• Euler number - In this case this is the number of blobs minus number of holes
A few notes about computing these different quantities of interest:
• We have seen from previous lectures that we can compute some of these elements, such as perimeter, using Green’s
Theorem. We can also accomplish this with counting - can be done simply by counting pixels based off of whether their
pixel is a 0 or 1.
1
• However, the issue with these approaches is that they require thresholding - i.e. removing points from any further
consideration early on the process and possibly without all information available; essentially removing potentially viable
points too early.
• Shape: As introduced above, shape is often computed by computing moments of any order. Recall the definition of
moments of a 2D shape:
RR
1. O-order: D E(x, y)dxdy → Area
RR
2. 1-order: D E(x, y)xydxdy → Centroid
3. 2-order: D E(x, y)x2 y 2 dxdy → Dispersion
RR
..
.
4. k-order: D E(x, y)xk y k dxdy
RR
• Note that these methods are oftentimes applied to processed, not raw images.
Idea: Try all possible positions/configurations of the pose space to create a match between the template and runtime im-
age of the object. If we are interested in the squared distance between the displaced template and the image in the other
object (for computational and analytic simplicity, let us only consider rotation for now), then we have the following optimization
problem:
ZZ
min (E1 (x − δx , y − δy ) − E2 (x, y))2 dxdy
δx ,δy D
In addition to framing this optimization mathematically as minimizing the squared distance between the two images, we can
also conceptualize this as maximizing the correlation between the displaced image and the other image:
ZZ
max E1 (x − δx , y − δy )E2 (x, y)dxdy
δx ,δy D
We can prove mathematically that the two are equivalent. Writing out the first objective as J(δx , δy ) and expanding it:
ZZ
J(δx , δy ) = (E(x − δx , y − δy ) − E2 (x, y))2 dxdy
D
ZZ ZZ ZZ
2
= (E1 (x − δx , y − δy ) − 2 E1 (x − δx , y − δy )E2 (x, y)dxdy + E22 (x, y)dxdy
D D D
ZZ
=⇒ arg min J(δx , δy ) = arg max E1 (x − δx , y − δy )E2 (x, y)dxdy
δx ,δy δx ,δy D
Since the first and third terms are constant, and since we are minimizing the negative of a scaled correlation objective, this is
equivalent to maximizing the correlation of the second objective.
We can also relate this to some of the other gradient-based optimization methods we have seen using Taylor Series. Suppose
δx , δy are small. Then the Taylor Series Expansion of first objective gives:
ZZ ZZ
2 ∂E1 ∂E1
(E1 (x − δx , y − δy ) − E2 (x, y)) dxdy = (E1 (x, y) − δx − + · · · − E2 (x, y))2 dxdy
D D ∂x ∂y
2
If we now consider that we are looking between consecutive frames with time period δt , then the optimization problem becomes
(after simplifying out E1 (x, y) − E2 (x, y) = −δt ∂E
∂t ):
ZZ
min (−δx Ex − δy Ey − δt Et )2 dxdy
δx ,δy D
A few notes about the methods here and the ones above as well:
• Note that the term under the square directly above looks similar to our BCCE constraint from optical flow!
• Gradient-based methods are cheaper to compute but only function well for small deviations δx , δy .
• Correlation methods are advantageous over least-squares methods when we have scaling between the images (e.g. due to
optical setting differences): E1 (x, y) = kE2 (x, y) for some k ∈ R .
Another question that comes up from this: How can we match at different contrast levels? We can do so with normalized
correlation. Below, we discuss each of the elements we account for and the associated mathematical transformations:
1. Offset: We account for this by subtracting the mean from each brightness function:
RR
0 E1 (x, y)dxdy
E1 (x, y) = E1 (x, y) − Ē1 , Ē1 = D RR
D
dxdy
RR
0
¯2 , E E2 (x, y)dxdy
¯2 = D RR
E2 (x, y) = E2 (x, y) − E
D
dxdy
This removes offset from images that could be caused by changes to optical setup.
2. Contrast: We account for this by computing normalized correlation, which in this case is the Pearson correlation coefficient:
RR 0 0
E (x − δx , y − δy )E2 (x, y)dxdy
D 1
qRR qRR ∈ [−1, 1]
0 0
D
E 1 (x − δ x , y − δy )dxdy D
E 2 (x, y)dxdy
Where a correlation coefficient of 1 denotes a perfect match, and a correlation coefficient of -1 denotes a perfectly imperfect
match.
Are there any issues with this approach? If parts/whole images of objects are obscured, this will greatly affect correlation
computations at these points, even with proper normalization and offsetting.
With these preliminaries set up, we are now ready to move into a case study: a patent for object detection and pose esti-
mation using probe points and template images.
1.2 Patent 7,016,539: Method for Fast, Robust, Multidimensional Pattern Recognition
This patent aims to extend beyond our current framework since the described methodology can account for more than just
translation, e.g. can account for:
• Rotation
• Scaling
• Shearing
A diagram of the system can be found below: A few notes about the diagram/aggregate system:
3
Figure 1: System diagram.
• A match score is computed for each configuration, and later compared with a threshold downstream. This process leads
to the construction of a high-dimensional matches surface.
• We can also see in the detailed block diagram from this patent document that we greatly leverage gradient estimation
techniques from the previous patent on fast and accurate edge detection.
6. Divide chains into segments of low curvature separated by conrner of high curvature.
• When comparing gradients between runtime and training images, project probe points onto the other image - we do not
have to look at all points in image; rather, we compare gradients (direction and magnitude - note that magnitude is
often less viable/robust to use than gradient direction) between the training and runtime images only at the probing points.
• We can also weight our probing points, either automatically or manually. Essentially, this states that some probes are
more important that others when scoring functions are called on object configurations.
• This patent can also be leveraged for machine part inspection, which necessitates high degrees of consistent accuracy
• An analog of probes in other machine and computer vision tasks is the notion of keypoints, which are used in descriptor-
based feature matching algorithms such as SIFT (Scale-Invariant Feature Transform), SURF (Speeded-Up Robust Fea-
tures), FAST (Features from Accelerated Segment Test), and ORB (Oriented FAST and rotated BRIEF). Many of these
aforementioned approaches rely on computing gradients at specific, “interesting points” as is done here, and construct
features for feature matching using a Histogram of Oriented Gradients (HoG) [1].
4
• For running this framework at multiple scales/resolutions, we want to use different probes at different scales.
• For multiscale, there is a need for running fast low-pass filtering. Can do so with rapid convolutions, which we will be
discussing in a later patent in this course.
• Probes “contribute evidence” individually, and are not restricted to being on the pixel grid.
• The accuracy of this approach, similar to the other framework we looked at, is limited by the degree of quantization in the
search space.
Next, let us look into addressing “noise”, which can cause random matches to occur. Area under S(θ) curve captures the
probability of random matches, and we can compensate by calculating error and subtracting it out of the results. However, even
with this compensation, we are still faced with additional noise in the result.
Instead, we can try to assign scoring weights by taking the dot product between gradient vectors: v̂1 · v̂2 = cos θ. But one
disadvantage of this approach is that we end up quantizing pose space.
Finally, let us look at how we score the matches between template and runtime image configurations: scoring functions.
Our options are:
5
• Normalized correlation (above)
• Removal of random matches (this was our “N” factor introduced above)
1.3 References
_______________________________________________________
1. Histogram of Oriented Gradients, https://en.wikipedia.org/wiki/Histogram_of_oriented gradients
6
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms