CV Notes PDF
CV Notes PDF
George Kudrayvtsev
(Draft)
This is still a work-in-progress; once this sentence is gone, you can consider it
to be v1.0. There may be errors, typos, or entirely incorrect or misleading information,
though I’ve done my best to ensure there aren’t. I will be further expanding it with topics
from Georgia Tech’s course on computational photography in the coming school year.
1
0 Preface 9
1 Introduction 11
3 Edge Detection 25
3.1 The Importance of Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Gradient Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 The Discrete Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Sobel Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Handling Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
We Have to Go Deeper. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Dimension Extension Detection . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 From Gradients to Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Canny Edge Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Non-Maximal Suppression . . . . . . . . . . . . . . . . . . . . . . . . 30
Canary Threshold Hysteresis . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 2nd Order Gaussian in 2D . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Hough Transform 31
4.1 Line Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2
Hough Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Polar Representation of Lines . . . . . . . . . . . . . . . . . . . . . . 34
4.1.4 Hough Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.5 Handling Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Finding Circles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Hough Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Frequency Analysis 43
5.1 Basis Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1 Limitations and Discretization . . . . . . . . . . . . . . . . . . . . . . 47
5.2.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1 Antialiasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Resizing Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3
6.4.3 Total Rigid Transformation . . . . . . . . . . . . . . . . . . . . . . . 68
6.4.4 The Duality of Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5 Intrinsic Camera Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5.1 Real Intrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6 Total Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.7 Calibrating Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.7.1 Method 1: Singular Value Decomposition . . . . . . . . . . . . . . . . 73
6.7.2 Method 2: Inhomogeneous Solution . . . . . . . . . . . . . . . . . . . 74
6.7.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . 75
6.7.4 Geometric Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.8 Using the Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.8.1 Where’s Waldo the Camera? . . . . . . . . . . . . . . . . . . . . . . . 77
6.9 Calibrating Cameras: Redux . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7 Multiple Views 79
7.1 Image-to-Image Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 The Power of Homographies . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.1 Creating Panoramas . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2.2 Homographies and 3D Planes . . . . . . . . . . . . . . . . . . . . . . 83
7.2.3 Image Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Forward Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Inverse Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3 Projective Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3.1 Alternative Interpretations of Lines . . . . . . . . . . . . . . . . . . . 86
7.3.2 Interpreting 2D Lines as 3D Points . . . . . . . . . . . . . . . . . . . 87
7.3.3 Interpreting 2D Points as 3D Lines . . . . . . . . . . . . . . . . . . . 88
7.3.4 Ideal Points and Lines . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.5 Duality in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4 Applying Projective Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4.1 Essential Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4.2 Fundamental Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Properties of the Fundamental Matrix . . . . . . . . . . . . . . . . . 94
Computing the Fundamental Matrix From Correspondences . . . . . 95
Fundamental Matrix Applications . . . . . . . . . . . . . . . . . . . . 96
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8 Feature Recognition 98
8.1 Finding Interest Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.1.1 Harris Corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Properties of the 2nd Moment Matrix . . . . . . . . . . . . . . . . . . 103
8.1.2 Harris Detector Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 104
8.1.3 Improving the Harris Detector . . . . . . . . . . . . . . . . . . . . . . 104
SIFT Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Harris-Laplace Detector . . . . . . . . . . . . . . . . . . . . . . . . . 106
4
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.2 Matching Interest Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2.1 SIFT Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Orientation Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 108
Keypoint Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Evaluating the Results . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.2 Matching Feature Points . . . . . . . . . . . . . . . . . . . . . . . . . 109
Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Wavelet-Based Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Locality-Sensitive Hashing . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.3 Feature Points for Object Recognition . . . . . . . . . . . . . . . . . 110
8.3 Coming Full Circle: Feature-Based Alignment . . . . . . . . . . . . . . . . . 110
8.3.1 Outlier Rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Nearest Neighbor Error . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.3.2 Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.3.3 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Benefits and Downsides . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9 Photometry 118
9.1 BRDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.1.1 Diffuse Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.1.2 Specular Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.1.3 Phong Reflection Model . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2 Recovering Light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.2.1 Retinex Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5
N -dimensional Kalman Filter . . . . . . . . . . . . . . . . . . . . . . 143
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.3.3 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Bayes Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.3.4 Real Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Tracking Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Other Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A Very Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.3.5 Mean-Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Similarity Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Kernel Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.3.6 Tracking Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
11 Recognition 158
11.1 Generative Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . 162
11.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 164
11.2.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . 168
11.2.2 Face Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
11.2.3 Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
11.2.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
11.3 Incremental Visual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.3.1 Forming Our Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Dynamics Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.3.2 All Together Now . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Handling Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.4 Discriminative Supervised Classification . . . . . . . . . . . . . . . . . . . . 178
11.4.1 Discriminative Classifier Architecture . . . . . . . . . . . . . . . . . . 179
Building a Representation . . . . . . . . . . . . . . . . . . . . . . . . 179
Train a Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Generating and Scoring Candidates . . . . . . . . . . . . . . . . . . . 180
11.4.2 Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.4.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Viola-Jones Face Detector . . . . . . . . . . . . . . . . . . . . . . . . 181
Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . 184
11.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.5.1 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.5.2 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
11.5.3 Extending SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Mapping to Higher-Dimensional Space . . . . . . . . . . . . . . . . . 189
Multi-category Classification . . . . . . . . . . . . . . . . . . . . . . . 192
6
11.5.4 SVMs for Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Using SVMs for Gender Classification . . . . . . . . . . . . . . . . . . 193
11.6 Visual Bags of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7
List of Algorithms
8
Preface
I read that Teddy Roosevelt once said, “Do what you can with what you have
where you are.” Of course, I doubt he was in the tub when he said that.
— Bill Watterson, Calvin and Hobbes
Notation
Before we begin to dive into all things computer vision, here are a few things I do in this
notebook to elaborate on concepts:
• An item that is highlighted like this is a “term;” this is some vocabulary or identifying
word/phrase that will be used and repeated regularly in subsequent sections. I try to
cross-reference these any time they come up again to link back to its first defined usage;
most mentions are available in the Index.
• An item that is highlighted like this is a “mathematical property;” such properties
are often used in subsequent sections and their understanding is assumed there.
• The presence of a TODO means that I still need to expand that section or possibly to
mark something that should link to a future (unwritten) section or chapter.
• An item in a maroon box, like. . .
9
Chapter 0: Preface
dive deeper into a concept that’s mentioned in lecture. This could be proofs,
examples, or just a more thorough explanation of something that might’ve
been “assumed knowledge” in the text.
10
Introduction
Every picture tells a story. But sometimes it’s hard to know what story is
actually being told.
— Anastasia Hollings, Beautiful World
he goal of computer vision is to create programs that can interpret and analyse
T images, providing the program with the meaning behind the image. This may involve
concepts such as object recognition as well as action recognition for images in motion
(colloquially, “videos”).
Computer vision is a hard problem because it involves much more complex analysis relative
to image processing. For example, observe the following set of images:
The two checkered squares A and B have the same color intensity, but our brain interprets
them differently without the connecting band due to the shadow. Shadows are actually quite
important to human vision. Our brains rely on shadows to create depth information and
track motion based on shadow movements to resolve ambiguities.
A computer can easily figure out that the intensities of the squares are equal, but it’s much
harder for it to “see” the illusion like we do. Computer vision involves viewing the image
as a whole and gaining a semantic understanding of its content, rather than just processing
things pixel-by-pixel.
11
Basic Image Manipulation
I : R2 7→ R
where min would be some “blackest black” and max would be some “whitest white,” and
(a, b, c, d) are ranges for the different dimensions of the images, though when actually per-
forming mathematical operations, such interpretations of values become irrelevant.
12
Notes on Computer Vision George Kudrayvtsev
We can easily expand this to colored images, with a vector-valued function mapping each
color component:
r(x, y)
I(x, y) = g(x, y)
b(x, y)
Addition Adding two images will result in a blend between the two. As we discussed,
though, intensities have a range [min, max]; thus, adding is often performed as an average
of the two images instead, to not lose intensities when their sum exceeds the maximum:
Ia Ib
Iadded = +
2 2
Subtraction In contrast, subtracting two images will give the difference between the two.
A larger intensity indicates more similarity between the two source images at that pixel.
Note that order of operations matters, though the results are inverses of each other:
Ia − Ib = −(Ib − Ia )
Often, we simply care about the absolute difference between the images. Because we are
often operating in a discrete space that will truncate negative values (for example, when
operating on images represented as uint8), we can use a special formulation to get this
difference:
Idiff = (Ia − Ib ) + (Ib + Ia )
Noise
A common function that is added to a single image is a
noise function. One of these is called the Gaussian
noise function: it adds a variation in intensity drawn
from a Gaussian normal distribution. We basically add a
random intensity value to every pixel on an image. See
Figure 2.2 for an example of Gaussian noise added to a
classic example image1 used in computer vision.
Figure 2.1: A Gaussian (nor-
Tweaking Sigma On a normal distribution, the mean mal) distribution.
is 0. If we interpret 0 as an intensity, it would have to
be between black (the low end) and white (the high end); thus, the average pixel intensity
added to the image should be gray. When we tweak σ – the standard deviation – this will
1
of a model named Lena, actually, who is posing for the centerfold of an issue of PlayBoy. . .
13
Chapter 2: Basic Image Manipulation
affect the amount of noise: a higher σ means a noisier image. Of course, when working with
image manipulation libraries, the choice of σ varies depending on the range of intensities in
the image.
14
Notes on Computer Vision George Kudrayvtsev
Averages in 2D
Extending a moving average to 2 dimensions is relatively straightforward. You take the
average of a range of values in both directions. For example, in a 100×100 image, you may
want to take an average over a moving 5×5 square. So, disregarding edge values, the value
of the pixel at (2, 2) would be:
In other words, with our square extending k = 2 pixels in both directions, we can derive the
formula for correlation filtering with uniform weights:
k k
1 X X
G[i, j] = · F [i + u, j + v] (2.1)
(2k + 1)2 u=−k v=−k
Of course, we decided that non-uniform weights were preferred. This results in a slightly
different equation for correlation filtering with non-uniform weights, where H[i, j] is
the weight function.
Xk Xk
G[i, j] = H[u, v]F [i + u, j + v] (2.2)
u=−k v=−k
Results Such a filter, weighted or not, actually performs terribly if our goal is to remove
some added Gaussian noise. It does remove noise, but it removes all noise, rather than some
noise that may have been added to an image that was originally sharp. In other words, we
lose fidelity. The result is a (poorly) blurred image. Well, despite not achieving our original
goal, we’ve stumbled upon something else: blurring images.
So what went wrong when trying to smooth out the image? Well, a “box filter” like the ones
in (2.1) and (2.2) are not smooth (in the mathematical, not social, sense). We’ll define and
explore this term later, but for now, suffice to say that a proper blurring (or “smoothing”)
function should be, well, smooth.
15
Chapter 2: Basic Image Manipulation
To get a sense of what’s wrong, suppose you’re looking at a single point of light very far
away, and then you made your camera out of focus. What would such an image look like?
Probably something like Figure 2.3.
Now, what kind of filtering function, applied on the singular point, should we apply to
get such a blurry spot? Well, a function that looked like that blurry spot would probably
work best: higher values in the middle that fall off (or attenuate) to the edges. This is a
Gaussian filter, which is an application of the Gaussian function:
1 u2 +v 2
h(u, v) = 2
e− σ2 (2.3)
|2πσ
{z }
normalization
coefficient
In such a filter, the nearest neighboring pixels have the most influence. This is much like
the weighted moving average presented in (2.2), but with weights that better represent
“nearness.” Such weights are “circularly symmetric,” which mathematically are said to be
isotropic; thus, this is the isotropic Gaussian filter. Note the normalization coefficient: this
value affects the brightness of the blur, not the blurring itself.
Gaussian Parameters
The Gaussian filter is a mathematical operation that does not care about pixels. Its only
parameter is the variance, which represents the “amount of smoothing” that the filter per-
forms. Of course, when dealing with images, we need to apply the filter to a particular range
of pixels; this is called the kernel.
Now, it’s critical to note that modifying the size of the kernel is not the same thing as
modifying the variance. They are related, though. The kernel has to be “big enough” to
fairly represent the variance and let it perform a smoother blurring.
16
Notes on Computer Vision George Kudrayvtsev
Linearity An operator H is linear if the following properties hold (where f1 and f2 are
some functions, and a is a constant):
• Additivity: the operator preserves summation, H(f1 + f2 ) = H(f1 ) + H(f2 )
• Multiplicative scaling, or homogeneity of degree 1: H(a · f1 ) = a · H(f1 )
With regards to computer vision, linearity allows us to build up an image one piece at a
time. We have guarantees that the operator operates identically per-pixel (or per-chunk, or
per-frame) as it would on the entire image. In other words, the total is exactly the sum of
its parts, and vice-versa.
Shift Invariance The property of shift invariance states that an operator behaves the
same everywhere. In other words, the output depends on the pattern of the image neigh-
borhood, rather than the position of the neighborhood. An operator must give the same
result on a pixel regardless of where that pixel (and its neighbors) is located to maintain
shift invariance.
2.3.1 Impulses
An impulse function in the discrete world is a very easy function (or signal) to understand:
its value = 1 at a single location. In the continuous world, an impulse is an idealized function
which is very narrow, very tall, and has a unit area (i.e. an area of 1). In the limit, it has
zero width and infinite height; it’s integral is 1.
2.3.2 Convolution
Let’s revisit the cross-correlation equation from Computing Averages:
k
X k
X
G[i, j] = H[u, v]F [i + u, j + v] (2.2)
u=−k v=−k
and see what happens when we treat it as a system H and apply impulses. We begin with
an impulse signal F (an image), and an arbitrary kernel H:
17
Chapter 2: Basic Image Manipulation
0 0 0 0 0
0 0 0 0 0 a b c
0
F (x, y) = 0 1 0 0 H(u, v) = d e f
0 0 0 0 0 g h i
0 0 0 0 0
What is the result of filtering the impulse signal with the kernel? In other words, what is
G(x, y) = F (x, y) ⊗ H(u, v)? As we can see in Figure 2.4, the resulting image is a flipped
version (in both directions) of the filter H.
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i h 0 0
0 0 0 0 0 0 i 0 0 0
0 0 1 0 0 −→ 0 0 0 0 0
0
0 1 0 0 −→
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
(b) The result (right) of subsequently applying the filter H
(a) The result (right) of applying the filter H (in red) on F on F at (2, 1) (in purple). The kernel covers a new area in
at (1, 1) (in purple). red.
0 0 0 0 0
0 i h g 0
0 f e d 0
0 c b a 0
0 0 0 0 0
We introduce the concept of a convolution operator to account for this “flipping.” The
cross-convolution filter, or G = H ~ F , is defined as such:
k
X k
X
G[i, j] = H[u, v]F [i − u, j − v]
u=−k v=−k
This filter flips both dimensions. Convolution filters must be shift invariant.
What is the difference between applying the Gaussian filter as a convolution vs. a
correlation?
18
Notes on Computer Vision George Kudrayvtsev
Answer: Nothing! Because the Gaussian is an isotropic filter, its symmetry en-
sures that the order of application is irrelevant. Thus, the distinction only matters
for an asymmetric filter.
Properties
Because convolution (and correlation) is both linear- and shift-invariant, it maintains some
useful properties:
• Commutative: F ~ G = G ~ F
• Associative: (F ~ G) ~ H = F ~ (G ~ H)
• Identity: Given the “unit impulse” e = [. . . , 0, 1, 0, . . .], then f ~ e = f
• Differentiation: ∂x
∂
(f ~ g) = ∂f
∂x
~ g. This property will be useful later, in Handling
Noise, when we find gradients for edge detection.
Computational Complexity
If an image is N × N and a filter is W × W , how many multiplications are necessary to
compute their convolution (N ~ W )?
Well, a single application of the filter requires W 2 multiplications, and the filter must be
applied for every pixel, so N 2 times. Thus, it requires N 2 W 2 multiplications, which can
grow to be fairly large.
Separability There is room for optimization here for certain filters. If the filter is sepa-
rable, meaning you can get the kernel H by convolving a single column vector by a single
row vector, as in the example:
1 2 1 1
H = 2 4 2 = 2 ~ 1 2 1
1 2 1 3
Then we can use the associative property to remove a lot of multiplications. The result, G,
can be simplified:
G=H ~F
= (C ~ R) ~ F
= C ~ (R ~ F )
So we perform two convolutions, but on smaller matrices: each one is W N 2 . This is useful if
W is large enough such that 2W N 2 W 2 N 2 . This optimization used to be very valuable,
but still can provide a significant benefit: if W = 31, for example, it is faster by a factor of
15 ! That’s still an order of magnitude.
19
Chapter 2: Basic Image Manipulation
20
Notes on Computer Vision George Kudrayvtsev
where the 90 is clearly an instance of “salt” noise sprinkled into the image. Finding
the median:
10 15 20 23 27 30 31 33 90
results in replacing the center point with intensity = 27, which is much better than the
result of a weighted box filter (as in (2.2)), which could have resulting in an intensity
of 61.2
An interesting benefit of this filter is that any new pixel value was already present
locally, which means new pixels never have any “weird” values.
In Adobe PhotoShop and other editing software, the “unsharp mask” tool would
actually sharpen the image. Why?
In the days of actual film, when photos had negatives and were developed in dark
rooms, someone came up with a clever technique. If light were shone on a negative
that was covered by wax paper, the result was a negative of the negative that was
blurrier than its original. If you then developed this negative of a negative layered
on top of the original negative, you would get a sharper version of the resulting
image!
This is a chemical replication of the exact same filtering mechanism as the one we
described in the sharpening filter, above! We had our original (the negative) and
were subtracting (because it was the negative of the negative) a blurrier version of
the negative. Hence, again, the result was a sharper developed image.
2
. . . if choosing 1
9 for non-center pixels and 4
9 for the center.
21
Chapter 2: Basic Image Manipulation
This blurrier double-negative was called the “unsharp mask,” hence the historical
name for the editing tool.
Filter Normalization Recall how the filters we are working with are linear operators.
Well, if we correlated an image (again, see (2.2)), and we multiplied that correlation filter
by some constant, then the resulting image would be scaled by that same constant. This
makes it tricky to compare filters: if we were to compare image 1 and filter 1 against image
2 and filter 2, to see how much the images “respond” to the filters, we would need to make
sure that both filters operate on a similar scale. Otherwise, outputs may differ greatly, but
not reflect an accurate comparison.
This topic is called normalized correlation, and we’ll discuss it further later. To summa-
rize things for now, suffice to say that “normalization” means that the standard deviation
of our filters will be consistent. For example, we may say that all of our filters will ensure
σ = 1. Not only that, but we also need to normalize the image as we move the kernel across
it. Consider two images with the same filter applied, but one image is just a scaled version of
the other. We should get the same result, but only if we ensure that the standard deviation
within the kernel is also consistent (or σ = 1).
Again, we’ll discuss implementing this later, but for now assume that all of our future
correlations are normalized correlations.
22
Notes on Computer Vision George Kudrayvtsev
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−0.4 −0.4
0 5 10 15 20 0 2 4
Figure 2.7: A signal and a filter which is part of the signal (specifically,
from x ∈ [5, 10]).
Where’s Waldo?
Suppose we have an image from a Where’s Waldo? book, and we’ve extracted an image of
Waldo himself, like so:
If we perform a correlation between these two images, using the template of Waldo as our
filter, we will get a correlation map whose maximum tells us where Waldo is!
23
Chapter 2: Basic Image Manipulation
Figure 2.9: The correlation map between the image and the template
filter with brightness corresponding to similarity to the template.
See that tiny bright spot around the center of the top half of the correlation map in Fig-
ure 2.9? That’s Waldo’s location in the original image (see Figure 2.8).
Applications
What’s template matching useful for? Can we apply it to other problems? What about using
it to match shapes, lines, or faces? We must keep in mind the limitations of this rudimentary
matching technique. Template matching relies on having a near-perfect representation of the
target to use as a filter. Using it to match lines – which vary in size, scale, direction, etc. – is
unreliable. Similarly, faces may be rotated, scaled, or have varying features. There are much
better options for this kind of matching that we’ll discuss later. Something specific, like icons
on a computer or words in a specific font, are a viable application of template matching.
24
Edge Detection
“How you can sit there, calmly eating muffins when we are in this horrible
trouble, I can’t make out. You seem to me to be perfectly heartless.”
“Well, I can’t eat muffins in an agitated manner. The butter would probably
get on my cuffs. One should always eat muffins quite calmly. It is the only
way to eat them.”
“I say it’s perfectly heartless your eating muffins at all, under the circum-
stances.”
— Oscar Wilde, The Importance of Being Earnest
n this chapter, we will continue to explore the concept of using filters to match certain
I features we’re looking for, like we discussed in Template Matching. Now, though, we’re
going to be “analysing” generic images that we know nothing about in advance. What
are some good features that we could try to find that give us a lot to work with in our
interpretation of the image?
25
Chapter 3: Edge Detection
Basic Idea: If we can find a neighborhood of an image with strong signs of change, that’s
an indication that an edge may exist there.
This is the right derivative, but it’s not necessarily the right derivative.2 If we look at the
finite difference stepping in the x direction, we will heavily accentuate vertical edges (since
we move across them), whereas if we look at finite differences in the y direction, we will
heavily accentuate horizontal edges.
1
Steep cliffs? That sounds like a job for slope, doesn’t it? Well, derivatives are exactly what’s up next.
2
Get it, because we take a step to the right for the partial derivative?
26
Notes on Computer Vision George Kudrayvtsev
What’s up with the 21 s? This is the average between the “left derivative” (which would be
−1 +1 0 and the “right derivative” (which would be 0 −1 +1 ).
Sobel Operator
The Sobel operator is a discrete gradient that preserves the “neighborliness” of an image
that we discussed earlier when talking about the Gaussian blur filter (2.3) in Blurring Images.
It looks like this:3
−1 0 +1 +1 +2 +1
1 1
Sx = · −2 0 +2 Sy = · 0 0 0 (3.2)
8 8
−1 0 +1 −1 −2 −1
We say that the application of the Sobel operator is gx for Sx and gy for Sy :
T
∇I = gx gy
The Sobel operator results in an edge images that’s not great but also not too bad. There
are other operators that use different constants, such as the Prewitt operator:
−1 0 +1 +1 +1 +1
Sx = −1 0 +1 Sy = 0 0 0 (3.3)
−1 0 +1 −1 −1 −1
but we won’t go into detail on these. Instead, let’s explore how we can improve our edge
detection results.
3
Note: In Sy , positive y is upward; the origin is assumed to be at the bottom-left.
27
Chapter 3: Edge Detection
Of course, we know how to reduce noise: apply a smoothing filter! Then, edges are peaks,
as seen in Figure 3.2. We can also perform a minor optimization here; recall that, thanks to
the differentiation property (discussed in Properties of Convolution):
∂ ∂
(h ~ f ) = h ~f (3.5)
∂x ∂x
Meaning we can skip Step (d) in Figure 3.2 and convolve the derivative of the smoothing
filter to directly get the result.
We Have to Go Deeper. . .
As we saw in Figure 3.2, edges are represented by
peaks in the resulting signal. How do we detect
peaks? More derivatives! So, consider the 2nd
derivative Gaussian operator:
∂2
h Figure 3.3: Using the 2nd order Gaus-
∂x2 sian to detect edge peaks.
which we convolve with the signal to detect peaks.
And we do see in Figure 3.3 that we’ve absolutely detected an edge at x = 1000 in the
original signal from Figure 3.2.
28
Notes on Computer Vision George Kudrayvtsev
When working in 2 dimensions, we need to specify the direction in which we are taking
the derivative. Consider, then, the 2D extension of the derivative of the Gaussian filter we
started with in (3.5):
(I ⊗ g) ⊗ hx = I ⊗ (g ⊗ hx )
Here, g is our Gaussian filter (2.3) and hx is the x-version of our gradient operator, which
could be the Sobel operator (3.2) or one of the others. We prefer the version on the right
because it creates reusable function that we can apply to any image, and it operates on a
smaller kernel, saving computational power.
Tweaking σ, Revisited Much like in the previous version of Tweaking Sigma, there is an
effect of changing σ in our Gaussian smoothing filter g. Here, smaller values of σ detect finer
features and edges, because less noise is removed and hence the gradient is more volatile.
Similarly, larger values of σ will only leave larger edges detected.
29
Chapter 3: Edge Detection
Non-Maximal Suppression
This is a fancy term for what amounts to choosing the brightest (maximal) pixel of an edge
and discarding the rest (suppression). It works by looking in the gradient direction and
keeping the maximal pixel.
∂ 2f ∂ 2f
∇2 h = + (3.6)
∂x2 ∂y 2
Once you apply the Laplacian, the zero-crossings are the edges of the image. This operator
is an alternative to the Canny Edge Operator and is better under certain circumstances.
30
Hough Transform
e can finally begin to discuss concepts that enable “real vision,” as opposed to the
W previous chapters which belonged more in the category of image processing. We
gave a function an input image and similarly received an output image. Now, we’ll
discuss learning “good stuff” about an image, stuff that tell us things about what an image
means. This includes the presence of lines, circles, particular objects or shapes, and more!
31
Chapter 4: Hough Transform
4.1.1 Voting
Since its computationally infeasible to simply try every possible line given a set of edge
pixels, instead, we need to let the data tell us something. We approach this by introducing
voting, which is a general technique where we let the features (in this case, edge pixels)
vote for all of the models with which they are compatible.
Voting is very straightforward: we cycle through the features, and each casts votes on par-
ticular model parameters; then, we look at model parameters with a high number of votes.
The idea behind the validity of voting is that all of the outliers – model parameters that are
only valid for one or two pixels – are varied: we can rely on valid parameters being elected
by the majority. Metaphorically, the “silly votes” for candidates like Donald Duck are evenly
distributed over an assortment of irrelevant choices and can thus be uniformally disregarded.
Hough Space
To achieve this voting mechanism, we need to introduce Hough space – also called pa-
rameter space – which enables a different representation of our desired shapes.
The key is that a line in image space represents a point in Hough space because we use the
parameters of the line (m, b). Similarly, a point in image space represents a line in Hough
space through some simple algebraic manipulation. This is shown in Figure 4.1. Given a
point (x0 , y0 ), we know that all of the lines going through it fit the equation y0 = mx0 + b.
Thus, we can rearrange this to be b = −x0 m + y0 , which is a line in Hough space.
What if we have two points? We can easily determine the line passing through both of those
points in Cartesian space by applying the point-slope formula:
y − y1 = m(x − x1 )
where, as we all know:
y2 − y1
m=
x 2 − x1
That’s simple enough for two points, as there’s a distinct line that passes through them by
definition. But what if we introduce more points, and what if those points can’t be modeled
32
Notes on Computer Vision George Kudrayvtsev
y = m0 x + b0
(m0 , b0 )
y
b
x m
(a) A line on the Cartesian plane, represented in (b) A parameterized representation of the same
the traditional slope-intercept form. line in Hough space.
Figure 4.1: Transforming a line from the Cartesian plane (which we’ll
call “image space”) to a point in Hough space.
by a perfect line that passes through them? We need a line of best fit, and we can use Hough
space for that.
A point on an image is a line in Hough space, and two points are two lines. The point at
which these lines intersect is the exact (m, b) that we would’ve found above, had we converted
to slope-intercept form. Similarly, a bunch of points produces a bunch of lines in Hough, all
of which intersect in (or near) one place. The closer the points are to a particular line of
best fit, the closer their points of intersection in Hough space. This is shown in Figure 4.2.
y
x m
(a) A series of points in image space. (b) The lines representing the possible parameters
for each of those points in Hough space.
Figure 4.2: Finding the line of best fit for a series of points by de-
termining an approximate point of intersection in a discretized Hough
space.
If we discretize Hough space into “bins” for voting, we end up with the Hough algorithm.
In the Hough algorithm, each point in image space votes for every bin along its line in Hough
space: the bin with the most votes becomes the line of best fit among the points.
33
Chapter 4: Hough Transform
Unfortunately, the (m, b) representation of lines comes with some problems. For example, a
vertical line has m = ∞, which is difficult to represent and correlate using this algorithm.
To avoid this, we’ll introduce Polar Representation of Lines.
x
d θ
To get a relationship between the Cartesian and polar spaces, we can derive, via some simple
vector algebra, that:
x cos θ + y sin θ = d (4.1)
This avoids all of the previous problems we had with certain lines being ill-defined. In
this case, a vertical line is represented by θ = 0, which is a perfectly valid input to the
trig functions. Unfortunately, though, now our transformation into Hough space is more
complicated. If we know x and y, then what we have left in terms of d and θ is a sinusoid
like in Figure 4.4.
34
Notes on Computer Vision George Kudrayvtsev
Note There are multiple ways to represent all possible lines under polar coordinates. Either
d > 0, and so θ ∈ [0, 2π), or d can be both positive or negative, and then θ ∈ [0, π).
Furthermore, when working with images, we consider the origin as being in one of the
corners, which restricts us to a single quadrant, so θ ∈ [0, π2 ). Of course, these are just
choices that we make in our representation and don’t really have mathematical trade-offs.
Input: An image, I
Result: A detected line.
Initialize an empty voting array: H[d, θ] = 0
foreach edge point, E(x, y) ∈ I do
foreach θ ∈ [0, 180, step) do
d = x cos θ + y sin θ
H[d, θ] += 1
end
end
Find the value(s) of (d, θ) where H[d, θ] is a maximum.
The result is given by d = x cos θ + y sin θ.
Complexity
The space complexity of the Hough Algorithm is simply k n : we are working in n dimensions
(2 for lines) and each one gets k bins. Naturally, this means working with more complex
objects (like circles as we’ll see later) that increase dimension count in Hough space can get
expensive fast.
The time complexity is linearly proportional to the number of edge points, whereas the
voting itself is constant.
35
Chapter 4: Hough Transform
Well, what if we smoothed the image in Hough space? It would blend similar peaks together,
but would reduce the area in which we need to search. Then, we can run the Hough transform
again on that smaller area and use a finer grid to find the best peaks.
What about a lot of noise? As in, what if the input image is just a random assortment of
pixels. We run through the Hough transform expecting to find peaks (and thus lines) when
there are none! So, it’s useful to have some sort of prior information about our expectations.
4.1.6 Extensions
The most common extension to the Hough transform leverages the gradient (recall, if need
be, the Gradient Operator). The following algorithm is nearly identical to the basic version
presented in algorithm 4.1, but we eliminate the loop over all possible θs.
Algorithm 4.2: The gradient variant of the Hough algorithm for line detection.
Input: An image, I
Result: A detected line.
Initialize an empty voting array: H[d, θ] = 0
foreach edge point, E(x, y) ∈ I do
θ = ∇I (x,y) /* or some range influenced by ∇I */
d = x cos θ + y sin θ
H[d, θ] += 1
end
Find the value(s) of (d, θ) where H[d, θ] is a maximum.
The result is given by d = x cos θ + y sin θ.
The gradient hints at the direction of the line: it can be used as a starting point to reduce
the range of θ from its old range of [0, 180).
Another extension gives stronger edges more votes. Recall that the Canny Edge Operator
used two thresholds to determine which edges were stronger. We can leverage this “metadata”
information to let stronger edges influence the reuslts of the line detection. strongest Of
course, this is far less democratic, but gives more reliable results.
Yet another extension leverages the sampling resolution of the voting array. Changing the
discretization of (d, θ) refines the lines that are detected. This could be problematic if there
are two similar lines that fall into the same bin given too coarse of a grid. The extension
does grid redefinition hierarchically: after determining ranges in which peaks exist with a
coarse grid, we can go back to just those regions with a finer grid to pick out lines.
Finally, and most importantly, we can modify the basic Hough algorithm to work on more
complex shapes such as circles, squares, or actually any shape that can be defined by a
template. In fact, that’s the subject of the next sections.
36
Notes on Computer Vision George Kudrayvtsev
We begin by extending the parametric model that enabled the Hough algorithm to a slightly
more complex shape: circles. A circle can be uniquely defined by its center, (a, b), and a
radius, r. Formally, its equation can be stated as:
For simplicitly, we will begin by assuming the radius is known. How does voting work, then?
Much like a point on a line in image space was a line in Hough space, a point in image space
on a circle is a circle in Hough space, as in Figure 4.5:
(a0 , b0 )
y
x a
(a) An arbitrary circle in image space defined by a (b) A parameterized representation of the same cir-
series of points. cle in Hough space, assuming a known radius.
Thus, each point on the not-yet-defined circle votes for a set of points surrounding that same
location in Hough space at the known radius, as in Figure 4.6:
37
Chapter 4: Hough Transform
b
x a
(a) The same circle from Figure 4.5a. (b) The set of points voted on by each correspond-
ing point in Hough space.
Figure 4.6: The voting process for a set of points defining a circle in
image space. The overlap of the voting areas in Hough space defines the
(a, b) for the circle.
Of course, having a known radius is not a realistic expectation for most real-world scenarios.
You might have a range of viable values, or you may know nothing at all. With an unknown
radius, each point on the circle in image space votes on a set of values in Hough space resem-
bling a cone.1 Again, that’s each point: growing the dimensionality of our parameterization
leads to unsustainable growth of the voting process.
We can overcome this growth problem by
taking advantage of the gradient, much like
we did in the Extensions for the Hough al-
gorithm to reduce the range of θ. We can
visualize the gradient as being a tangent
line of the circle: if we knew the radius, a
b
1
If we imagine the Hough space as being an abr plane with r going upward, and we take a known point
(a0 , b0 ), we can imagine if we did know the radius, say r = 7, we’d draw a circle there. But if was r = 4
we’d draw a circle a little lower (and a little smaller). If we extrapolate for all possible values of r, we get
a cone.
38
Notes on Computer Vision George Kudrayvtsev
of the voting process despite increasing the dimension. This leads us to a basic Hough
algorithm for circles:
Input: An image, I
Result: A detected circle.
Initialize an empty voting array: H[a, b, r] = 0
foreach edge point, E(x, y) ∈ I do
foreach possible radius do
foreach each possible gradient direction θ do // or an estimate via (3.1)
a = x − r cos θ
b = y + r sin θ
H[a, b, r] += 1
end
end
end
Find the value(s) of (a, b, r) where H[a, b, r] is a maximum.
The result is given by applying the equation of a circle (4.2).
In practice, we want to apply the same tips that we outlined in Extensions and a few others:
use edges with significant gradients, choose a good grid discretization, track which points
make which votes, and consider smoothing your votes by voting for neighboring bins (perhaps
with a different weight).
4.3 Generalization
The advent of the generalized Hough transform and its ability to determine the existence
of any well-defined shape has caused a resurgence in its application in computer vision.
Rather than working with non-analytic models that have fixed parameters (like circles
with a radius, r), we’ll be working with visual code-words that describe the object’s
features rather than using its basic edge pixels. Previously, we knew how to vote given a
particular pixel because we had solved the equation for the shape. For an arbitrary shape,
we instead determine how to vote by building a Hough table.
39
Chapter 4: Hough Transform
(a) The training phase for the horizontal line “fea- (b) The voting phase, in which each of the three
ture” in the shape, in which the entry for θ = 90° points (so far) has voted for the center points after
stores all of these displacement vectors. applying the entire set of displacement vectors.
After all of the boundary points vote for their “line of possible centers,” the strongest point
of intersection among the lines will be the initial reference point, c, as seen in Figure 4.10.
40
Notes on Computer Vision George Kudrayvtsev
Figure 4.10: A subset of votes cast during the second set of feature
points, after completing the voting for the first set that was started in
Figure 4.9.
This leads to the following generalized Hough transform algorithm, which (critically!) as-
sumes the orientation is known:
Input: I, an image.
Input: T , a Hough table trained on the arbitrary object.
Result: The center point of the recognized object.
Initialize an empty voting array: H[x, y] = 0
foreach edge point, E(x, y) ∈ I do
Compute the gradient direction, θ
foreach v ∈ T [θ] do
H[vx , vy ] += 1
end
end
The peak in the Hough space is the reference point with the most supported edges.
Of course, variations in orientation are just an additional variable. All we have to do is try
all of the possible orientations. Naturally, this is much more expensive since we are adding
another dimension to our Hough space. We can do the same thing for the scale of our
arbitrary shape, and the algorithm is nearly identical to algorithm 4.5, except we vote with
41
Chapter 4: Hough Transform
Input: I, an image.
Input: T , a Hough table trained on the arbitrary object.
Result: The center point of the recognized object.
Initialize an empty voting array: H[x, y, t] = 0
foreach edge point, E(x, y) ∈ I do
foreach possible θ∗ do // (the “master” orientation)
Compute the gradient direction, θ
θ0 = θ − θ∗
foreach v ∈ T [θ0 ] do
H[vx , vy , θ∗ ] += 1
end
end
end
The peak in the Hough space (which is now (x, y, θ∗ )) is the reference point with the
most supported edges.
42
Frequency Analysis
he goal of this chapter is to return to image processing and gain an intuition for images
T from a signal processing perspective. We’ll introduce the Fourier transform, then touch
on and study an phenomenon called aliasing, in which seemingly-straight lines in an
image appear jagged. Furthermore, we’ll extend this idea into understanding why, to shrink
an image in half, we shouldn’t just throw out every other pixel.
Things are boutta get real mathematical up in here. Do some review of fundamental
linear algebra concepts before going further. You’ve been warned.
vector in the xy-plane can be represented as a linear combination of these two vectors.
Formally: Suppose we have some B = {v1 , . . . , vn }, which is a finite subset of a vector
space V over a field F (such as the real or complex number fields, R and C). Then, B is a
basis if it satisfies the following conditions:
• Linear independence, which is the property that
∀a1 , . . . an ∈ F
if a1 v1 + . . . + an vn = 0
then a1 = . . . = an = 0
In other words, for any set of constants such that the linear combination of those
constants and the basis vectors is the zero vector, then it must be that those constants
43
Chapter 5: Frequency Analysis
We can formulate a simple basis set for this vector space by toggling each pixel as on or off:
h iT
0 0 0 . . . 0 1 0 0 . . . 0 0 0
h iT
0 0 0 . . . 0 0 1 0 . . . 0 0 0
..
.
This is obviously independent and can create any image, but isn’t very useful. . . Instead, we
can view an image as a variation in frequency in the horizontal and vertical directions, and
the basis set would be vectors that tease away fast vs. slow changes in the image:
44
Notes on Computer Vision George Kudrayvtsev
A sin (ωx + ϕ). Fourier’s conclusion was that if we add enough of these together, we can get
any signal f (x). Our sinusoid allows three degrees of freedom:
• A is the amplitude, which is just a scalar for the sinusoid.
• ω is the frequency. This parameter controls the coarseness vs. fine-ness of a signal;
as you increase it, the signal “wiggles” more frequently.
• ϕ is the phase. We won’t be discussing this much because our goal in computer vision
isn’t often to reconstruct images, but rather simply to learn something about them.
1
g(t) = sin (2πf t) + sin (2π(3f )t)
3
We can break it down into its “component sinusoids” like so:
t t t
1 1 1
−1 −1 −1
If we were to analyse the frequency spectrum of this signal, we would see that there is
some “influence” (which we’ll call power) at the frequency f , and 1/3rd of that influence at
the frequency 3f .
Notice that the signal seems to be approximating a square wave? Well, a square wave can
be written as an infinite sum of odd frequencies:
∞
X 1
A sin (2πkt)
1
k
Now that we’ve described this notion of the power of a frequency on a signal, we want to
transform our signals from being functions of time to functions of frequency. This algorithm
is called the Fourier transform.
We want to understand the frequency, ω, of our signal f (x). Let’s reparameterize the signal
by ω instead of x to get some F (ω). For every ω ∈ (−∞, ∞), our F (ω) will both hold the
45
Chapter 5: Frequency Analysis
We will see that R(ω) corresponds to the even part (that is, cosine) and I(ω) will be the
odd part (that is, sine).1 Furthermore, computing the Fourier transform is just computing
a basis set. We’ll see why shortly.
First off, the infinite integral of the product of two sinusoids of differing frequencies is zero,
and the infinite integral of the product of two sinusoids of the same frequency is infinite
(unless they’re perfectly in phase, since sine and cosine cancel out):
Z ∞ 0,
if a 6= b
sin (ax + φ) sin (bx + ϕ) dx = ±∞, if a = b, φ + π2 6= ϕ
−∞
if a = b, φ + π2 = ϕ
0,
With that in mind, suppose f (x) is a simple cosine wave of frequency ω: f (x) = cos (2πωx).
That means that we can craft a function C(u):
Z ∞
C(u) = f (x) cos (2πux) dx
−∞
that will infinitely spike (or, create an impulse. . . remember Impulses?) wherever u = ±ω
and be zero everywhere else.
Don’t we also need to do this for all phases? No! Any phase can be represented as a weighted
sum of cosine and sine, so we just need one of each piece. So, we’ve just created a function
that gives us the frequency spectrum of an input signal f (x). . . or, in other words, the
Fourier transform.
To formalize this, we represent the signal as an infinite weighted sum (or linear combina-
tion) of an infinite number of sinusoids:2
Z ∞
F (u) = f (x)e−i2πux dx (5.4)
−∞
We also have the inverse Fourier transform, which turns a spectrum of frequencies into
the original signal in the spatial spectrum:
Z ∞
f (x) = F (u)e−i2πux du (5.5)
−∞
1
An even function is symmetric with respect to the y-axis: cos (x) = cos (−x). Similarly, an odd function
is symmetric with respect to the origin: − √sin (x) = sin (−x).
2
Recall that eik = cos k + i sin k, where i = −1.
46
Notes on Computer Vision George Kudrayvtsev
With that in mind, if there is some range in which f is integrable (but not necessarily
(−∞, ∞)), we can just do the Fourier transform in just that range. More formally, if there
is some bound of width T outside of which f is zero, then obviously we could integrate
from [− T2 , T2 ]. This notion of a “partial Fourier transform” leads us to the discrete Fourier
transform, which is the only way we can do this kind of stuff in computers. The discrete
Fourier transform looks like this:
N −1
1 X 2πkx
F (k) = f (x)e−i N (5.6)
N x=0
Imagine applying this to an N -pixel image. We have x as our discrete “pixel iterator,” and
k represents the number of “cycles per period of the signal” (or “cycles per image”) which is
a measurement of how quickly we “wiggle” (changes in intensity) throughout the image.
It’s necessarily true that k ∈ [− N2 , N2 ], because the highest possible frequency of an image
would be a change from 0 to 255 for every pixel. In other words, every other pixel is black,
which is a period of 2 and N2 total cycles in the image.
We can extend the discrete Fourier transform into 2 dimensions fairly simply. The 2D
Fourier transform is:
1 ∞ ∞
Z Z
F (u, v) = f (x, y)e−i2π(ux+vy) dx dy (5.7)
2 −∞ −∞
and the discrete variant is:3
N −1 N −1
1 XX 2π(kx x+ky y)
F (kx , ky ) = f (x, y)e−i N (5.8)
N x=0 y=0
Now typically when discussing the “presence” of a frequency, we’ll be referring to its power
as in (5.2) rather than the odd or even parts individually.
5.2.2 Convolution
Now we will discuss the properties of the Fourier transform. When talking about convolution,
convolving a function with its Fourier transform is the same thing as multiplying the Fourier
transform. In other words, let g = f ~ h. Then, G(u) is
Z ∞
G(u) = g(x)e−i2πux dx
−∞
3
As a tip, the transform works best when origin of k is in the middle of the image.
47
Chapter 5: Frequency Analysis
This relationship with convolution is just one of the properties of the Fourier transform. Some
of the other properties are noted in Table 5.1. An interesting one is the scaling property:
in the spatial domain, a > 1 will shrink the function, whereas in the frequency domain
this stretches the inverse property. This is most apparent in the Gaussian filter: a tighter
Gaussian (in other words, a smaller σ) in the spatial domain results in a larger Gaussian in
the frequency domain (in other words, a 1/σ).
48
Notes on Computer Vision George Kudrayvtsev
5.3 Aliasing
With the mathematical understanding of the Fourier transform and its properties under our
belt, we can apply that knowledge to the problem of aliasing.
First, let’s talk about the comb function, also called an impulse train. Mathematically,
it’s formed like so:
X∞
δ(x − nx0 )
n=−∞
Where δ(x) is the magical “unit impulse function,” formally known as the Kronecker delta
function. The train looks like this:
The Fourier transform of an impulse train is a wider impulse train, behaving much like the
expansion due to the scaling property.
We use impulse trains to sample continuous signals, discretizing them into something under-
standable by a computer. Given some signal like sin(t), we can multiply it with an impulse
train and get some discrete samples that approximate the signal in a discrete way.
Obviously, some information is lost during sampling: our reconstruction might be imperfect
if we don’t have enough information. This (im)precise notion is exactly what the aliasing
phenomenon is, demonstrated by Figure 5.3. The blue signal is the original signal, and the
red dots are the samples we took. When trying to reconstruct the signal, the best we can
do is the dashed signal, which has a lower frequency. Aliasing is the notion that a signal
travels “in disguise” as one with a different frequency.
49
Chapter 5: Frequency Analysis
g(t)
In images, this same thing occurs often. The aliasing problem can be summarized by the
idea that there are not enough pixels (samples) to accurately render an intended effect. This
begs the question: how can we prevent aliasing?
• An obvious solution comes to mind: take more samples! This is in line with the
“megapixel craze” in the photo industry; camera lenses have ever-increasing fidelity and
can capture more and more pixels. Unfortunately, this can’t go on forever, and there
will always be uncaptured detail simply due to the nature of going from a continuous
to discrete domain.
• Another option would be to get rid of the problematic high-frequency information.
Turns out, even though this gets rid of some of the information in the image, it’s
better than aliasing. These are called low-pass filters.
5.3.1 Antialiasing
We can introduce low-pass filters to get rid of “unsafe” high-frequency information that
we know our sampling algorithm can’t capture, while keeping safe, low frequencies. We
perform this filtering prior to sampling, and then again after reconstruction. Since we know
50
Notes on Computer Vision George Kudrayvtsev
that certain frequencies simply did not exist, a reconstruction that results in these high
frequencies are incorrect can be safely clipped off.
Let’s formalize this idea. First, we define a comb function that can easily represent an
impulse train as follows (where M is an integer):
∞
X
combM [x] = δ[x − kM ]
−∞
This is an impulse train in which every M , x is a unit impulse. Remember that due to the
scaling property, the Fourier transform of the comb function is 21 comb1/2 (u). We can extend
this to 2D and define a bed of nails, which is just a comb in two directions, which also
tightens its Fourier transform if it spreads:
∞ ∞ ∞ ∞
X X 1 X X k l
combM,N (x, y) = δ(x − kM, y − lN ) ⇐⇒ δ u − ,v −
k=−∞ l=−∞
M N k=−∞ l=−∞ M N
With this construct in mind, we can multiply a signal by a comb function to get discrete
samples of the signal; the M parameter varies the fidelity of the samples. Now, if we consider
the Fourier transform of a signal and its resulting convolution with the comb function after
we do our sampling in the spatial spectrum, we can essentially imagine a repeating FT every
1/M steps. If there is no overlap within this repeat, we don’t get any distortion in the
frequency spectrum.
Specifically, if W < 2M
1
, where W is the highest frequency in the signal, we can recover the
original signal from the samples. This is why CDs sample at 44 kHz, so that we can recover
everything up to 22 kHz, which is the maximal extent of human hearing. If there is overlap
(which would be the presence of high-frequency content in the signal), it causes aliasing
when recovering the signal from the samples: the high-frequency content is masquerading as
low-frequency content.
We know how to get rid of high frequencies: use a Gaussian filter! By applying a Gaussian,
which now acts as an anti-aliasing filter, we get rid of any overlap. Thus, given a signal
f (x), we do something like:
(f (x) ~ h(x)) · combM (x)
Resizing Images
This anti-aliasing principle is very useful when resizing images. What do you do when you
want to make an image half (or a quarter, or an eighth) of its original size? You could just
throw out every other pixel, which is called image subsampling. This doesn’t give very
good results: it loses the high-frequency content of the original image because we sample too
infrequently.
Instead, we need to do use an antialiasing filter as we’ve discussed. In other words, first filter
the image, then do subsampling. The stark difference in quality is shown in Figure 5.4.
51
Chapter 5: Frequency Analysis
Figure 5.4: The result of down-sizing the original image of Van Gogh
(left) using subsampling (right) and filtering followed by subsampling
(center). The down-sized image was then blown back up to its original
size (by zooming) for comparison.
Image Compression
The discoveries of John Robson and Frederick Campbell of the variation in sensitivity to
certain contrasts and frequencies in the human eye4 can be applied to image compression.
The idea is that certain frequencies in images can be represented more coarsely than others.
The JPEG image format uses the discrete cosine transform to form a basis set. This
basis is then applied to 8×8 blocks of the image: and each block’s frequency corresponds to
a quantization table that uses a different number of bits to approximate that frequency with
a lower fidelity.
4
See this page for details; where do you stop differentiating the frequencies?
52
Cameras and Images
53
Chapter 6: Cameras and Images
6.1 Cameras
This section will discuss how cameras work, covering pinhole cameras; the math behind
lenses; properties like focus, aperture, and field-of-view; and the math behind all of the
various properties of lenses.
TODO: This
Covering these topics requires a lot of photographs and diagrams (as in, it’s basically
the only way to explain most of these topics), and so I’m putting off doing that
until later.
54
Notes on Computer Vision George Kudrayvtsev
y
PP
(x0 , y 0 , −d) (x, y, z)
d
Center of
Projection
z
x
To convert from some homogeneous coordinate y , we use (x/w, y/w), and similarly for 3D,
w
we use ( /w, /w, /w). It’s interesting to note that homogeneous coordinates are invariant
x y z
under scaling: if you scale the homogeneous coordinate by some a, the coordinate in pixel
space will be unaffected because of the division by aw.
Perspective Projection
With the power of homogeneous coordinates, the projection of a point in 3D to a 2D per-
spective projection plane is a simple matrix multiplication in homogeneous coordinates:
x
1 0 0 0 x
0 1 0 0 y = y ⇒ f x , f y =⇒ (u, v)
z (6.1)
z z
0 0 1/f 0 z/f
1
This multiplication is a linear transformation! We can do all of our math under homogeneous
coordinates until we actually need to treat them as a pixel in an image, which is when we
55
Chapter 6: Cameras and Images
perform our conversion. Also, here f is the focal length, which is the distance from the
center of projection to the projection plane (d in Figure 6.3).
How does scaling the projection matrix change the transformation? It doesn’t! Recall the
invariance property briefly noted previously:
x
a 0 0 0 ax
0 a 0 0 y = ay ⇒ f x , f y
z z z
0 0 a/f 0 az/f
1
Notice that the “start of the line,” (x0 , y0 , z0 ) has disappeared entirely! This means that no
matter where a line starts, it will always converge at the vanishing point (f a/c, f b/c). The
restriction that c 6= 0 means that this property of parallel lines applies to all parallel lines
except those that exist in the xy-plane, that is, parallel to our projection plane!
All parallel lines in the same plane converge at colinear vanishing points, which we know
as the horizon.
Human vision is strongly affected by the notion of parallel lines. See the lecture snippet.
TODO: Add the illusion and explain why it happens.
56
Notes on Computer Vision George Kudrayvtsev
Orthographic Projection
This variant of projection is often used in computer graphics and video games in that are 2
dimensions. The model essentially “smashes” the real world against the projection plane. In
2D games, the assets (textures, sprites, and other images) are already flat and 2-dimensional,
and so don’t need any perspective applied to them.
Orthographic projection – also called parallel projection – can actually be thought of as
a special case of perspective projection, with the center of projection (or camera) infinitely
far away. Mathematically, f → ∞, and the effect can be seen in Figure 6.4.
world
image
Transforming points onto an orthographic projection just involves dropping the z coordinate.
The projection matrix is simple:
x
1 0 0 0 x
0 1 0 0 y = y ⇒ (x, y)
z
0 0 0 1 1
1
Weak Perspective
This perspective model provides a form of 3D perspective, but the scaling happens between
objects rather than across every point. Weak perspective hinges on an important assump-
tion: the change in z (or depth) within a single object is not significant relative to its distance
from the camera. For example, a person might be about a foot thick, but they are standing
a mile from the camera.
All of the points in an object that is z0 distance away from the projection plane will be
mapped as such:
fx fy
(x, y, z) → ,
z0 z0
This z0 changes from object to object, so each object has its own scale factor. Its projection
57
Chapter 6: Cameras and Images
matrix is:
x
1 0 0 0 x
y
0 1 0 0 = y ⇒ (sx, sy)
z
0 0 0 1/s 1/s
1
58
Notes on Computer Vision George Kudrayvtsev
Z
pl pr
f xl xr
COPl B COPr
B − xl + xr B
=
Z −f Z
Z
pl pr
f xl xr
COPl B COPr
Figure 6.7: The similar triangles (in red) that we use to relate the
points projected onto both cameras.
59
Chapter 6: Cameras and Images
A smaller distance meant a more similar feature patch. It worked pretty well, and we
generated some awesome depth maps that we then used to create stereograms. Since we’re
big smort graduate students now, things are going to get a little more complicated. First,
we need to take a foray into epipolar geometry.
1
Link: CS61c, Project 1.
60
Notes on Computer Vision George Kudrayvtsev
p ?
Figure 6.8: The line of potential points (in blue) in the right image
that could map to the projected point in the left image.
61
Chapter 6: Cameras and Images
We will focus primarily on the similarity soft constraint. For that, we need to expand our
set of assumptions a bit to include that:
• Most scene points are visible from both views, though they may differ slightly based
on things like reflection angles.
• Image regions for two matching pixels are similar in appearance.
Uniqueness Constraint
Let’s briefly touch some of the other soft constraints we described above. The uniqueness
constraint states that there is no more than one match in the right image for every point in
the left image.
No more than one? Yep. It can’t be exactly one because of occlusion: the same scene from
different angles will have certain items occluded from view because of closer objects. Certain
pixels will only be visible from one side at occlusion boundaries.
2
The Euclidean distance is the same thing as the sum of squared differences when applied to a matrix (also
called the Frobenius norm).
62
Notes on Computer Vision George Kudrayvtsev
Ordering Constraint
The ordering constraint specifies that pixels have to be in the same order across both of our
stereo images. This is generally true when looking at solid surfaces, but doesn’t always hold!
This happens in two cases. When looking at (semi-)transparent objects, different viewing
angles give different orderings because we can “see through” the surface. This is a rare case,
but another one is much more common: narrow occluding surfaces. For example, consider
a 3D scene with a skinny tree in the middle. In your left eye, you may see “left grass, tree,
right grass,” but in your right eye, you may instead see “tree, more grass” (as in, both the
“left” and “right” grass are on the right side of the tree).
Instead of imagining such a contrived scene, you can easily do this experiment yourself. Hold
your fingers in front of you, one behind the other. In one eye, the front finger will be on the
left, whereas the back finger will be on the left in your other eye.
Unfortunately, the “state-of-the-art” algorithms these days aren’t very good at managing
violations of the ordering constraint such as the scenarios described here.
Scanlines
For the scanline method, we essentially have a 1D signal from both images and we want
the disparity that results in the best overall correspondence. This is implemented with a
dynamic programming formulation. I won’t go into detail on the general approach here;
instead, I’ll defer to this lecture which describes it much more eloquently. This approach
results in far better depth maps compared to the naïve window matching method, but
still results in streaking artifacts that are the result of the limitation of scanlines. Though
scanlines improved upon treating each pixel independently, they are still limited in that they
treat every scanline independently without taking into account the vertical direction. Enter
2D grid-based matching.
Grid Matching
We can define a “good” stereo correspondence as being one with a high match quality, mean-
ing each pixel finds a good match in the other image, and a good smoothness meaning
adjacent pixels should (usually) move the same amount. The latter property is similar to
63
Chapter 6: Cameras and Images
We want to minimize this term, and this is basically all we were looking at when doing Dense
Correspondence Search. Now, we also have a smoothness term:
X
Esmooth = ρ (D(i) − D(j)) (6.3)
neighbors i,j
Notice that we are only looking at the disparity image in the smoothness term. We are
looking at the neighbors of every pixel and determining the size of the “jump” we described
above. We define ρ as a robust norm: it’s small for small amounts of change and it gets
expensive for larger changes, but it shouldn’t get more expensive for even larger changes,
which would likely indicate a valid occlusion.
Both of these terms form the total energy, and for energy minimization, we want the D that
results in the smallest possible E:
where α and β are some weights for each energy term. This can be approximated via graph
cuts,3 and gives phenomenal results compared to the “ground truth” relative to all previous
methods.4
6.3.5 Conclusion
Though the scanline and 2D grid algorithms for stereo correspondence have come a long
way, there are still many challenges to overcome. Some of these include:
• occlusions
• low-contrast or textureless image regions
3
The graph cut algorithm is a well-known algorithm in computer science. It is an algorithm for divid-
ing graphs into two parts in order to minimize a particular value. Using the algorithm for the energy
minimization problem we’ve described was pioneered in this paper.
4
You can check out the current state-of-the-art at Middlebury.
64
Notes on Computer Vision George Kudrayvtsev
Notation Before we dive in, we need to define some notation to indicate which “coordinate
frame” our vectors are in. We will say that A P is the coordinates of P in frame A. Recall, as
65
Chapter 6: Cameras and Images
well, that a vector can be expressed as the scaled sum of the unit vectors (i, j, k). In other
words, given some origin A O,
A
x
A −→
y ⇐⇒ OP = A x · iA A y · jA A z · kA
A
P=
A
z
6.4.1 Translation
With that notation in mind, what if we want to find the location of some vector P, whose
position we know in coordinate frame A, within some other coordinate frame B. This
described in Figure 6.9.
This can be expressed as a straightforward translation: it’s the sum of P in our frame and
the origin of the other coordinate frame, OB , expressed in our coordinate frame:
B
P = A P + A (OB )
The origin of frame B in within frame A is just a simple offset vector. With that in mind,
we can model this translation as a matrix multiplication using homogeneous coordinates:
B
P = A P + A OB
B
1 0 0 A Ox,B
A
Px Px
B Py 0 1 0 A Oy,B A Py
B =
Pz 0 0 1 A Oz,B A Pz
1 0 0 0 1 1
We can greatly simplify this by substituting in vectors for column elements. Specifically, we
can use the 3×3 identity matrix, I3 , and the 3-element zero vector as a row, 0T :
B
I3 A OB A P
P
= T
1 0 1 1
66
Notes on Computer Vision George Kudrayvtsev
6.4.2 Rotation
Now things get even uglier. Suppose we have two coordinate frames that share an origin,
but they are differentiated by a rotation. We can express this succinctly, from frame A to
B, as:
B
P=B
AR A
P
Where BA R expresses points in frame A in the coordinate system of frame B. Now, what
does R look like? The complicated version is an expression of the basis vectors of frame A
in frame B:
iA · iB jA · iB kA · iB
B
AR =
iA · jB jA · jB kA · jB (6.4)
iA · kB jA · kB kA · kB
= iA B jA B kA (6.5)
B
A T
iB
A
= jBT (6.6)
A T
kB
Each of the components of the point in frame A can be expressed somehow in frame B using
all of B’s basis vectors. We can also imagine that it’s each basis vector in frame B expressed
in frame A. (6.6) is an orthogonal matrix: all of the columns are unit vectors that are
perpendicular.
jB jA
iB
θ iA
67
Chapter 6: Cameras and Images
Of course, there are many ways to combine this rotation in a plane to reach some arbitrary
rotation. The most common one is using Euler angles, which say: first rotate about the
“world” z, then about the new x, then about the new z (this is also called heading, pitch,
and roll). There are other ways to express arbitrary rotations – such as the yaw, pitch, and
roll we described earlier – but regardless of which one you use, you have to be careful. The
order in which you apply the rotations matters and negative angles matter; these can cause
all sorts of complications and incorrect results. The rotation matrices about the three axes
are (we exclude z since it’s above):
cos κ 0 − sin κ
Ry (κ) = 0 1 0 (6.8)
sin κ 0 cos κ
1 0 0
Rx (ϕ) = 0 cos ϕ − sin ϕ (6.9)
0 sin ϕ cos ϕ
First we rotate into the B frame, then we add the origin offset. Using homogeneous coordi-
nates, though, we can do this all in one step:5
B
1 B OA B
A
P AR 0 P
= T
1 0 1 0T 1 1
B B
A
R O A P
= AT
0 1 1
5
Notice that we reversed the order of the matrix multiplications. From right to left, we translate then
rotate.
68
Notes on Computer Vision George Kudrayvtsev
B B B
A A
P R OA P P
Even more simply, we can just say: = AT B
= AT
1 0 1 1 1
Then, to instead get from frame B to A, we just invert the transformation matrix!
A B
−1 B P
P A P B
= BT = AT
1 1 1
To put this back into perspective, our homogeneous transformation matrix being invertible
means that we can use the same matrix for going from “world space” to “camera space” to
the other way around: it just needs to be inverted.
The 4×4 matrix that makes up the first “row” we’ve shown here (as in, everything aside from
the bottom 0 0 0 1 row) is called the extrinsic parameter matrix. The bottom row
is what makes the matrix invertible, so it won’t always be used unless we need that property;
sometimes, in projection, we’ll just use the 3×4.
6.4.5 Conclusion
We’ve derived a way to transform points in the world to points in the camera’s coordinate
space. That’s only half of the battle, though. We also need to convert a point in camera
space into a point on an image. We touched on this when we talked about Perspective
Imaging, but we’re going to need a little more detail than that.
Once we understand these two concepts, we can combine them (as we’ll see, “combine”
means another beautifully-simple matrix multiplication) and be able to see exactly how any
arbitrary point in the world maps onto an image.
69
Chapter 6: Cameras and Images
Combining all of these potential parameters, we get 5 degrees of freedom in this mess:
x y
u = α − α cot(θ) + u0
z z
β y
v= + v0
sin(θ) z
. . . gags .
Can we make this any nicer? Notice the division by z all over the place? We can, again,
leverage homogeneous coordinates. Let’s shove the whole thing into a matrix:
x
zu α α cot(θ) u0 0
zv = 0 β y
sin(θ)
v0 0 z
z 0 0 1 0
1
More succinctly, we can isolate the intrinsic parameter matrix that transforms from a
point in camera space to a homogeneous coordinate:
p0 = K c p
We can get rid of the last column of zeroes, as well as use friendlier parameters. We can say
that f is the focal length, as before; s is the skew; a is the aspect ratio; and (cx , cy ) is the
camera offset. This gives us:
f s cx
K = 0 af cy
0 0 1
Of course, if we assume a perfect universe, f becomes our only degree of freedom and gives
us our perspective projection equation from (6.1).
70
Notes on Computer Vision George Kudrayvtsev
p0 = K C C W
W RW t p
p0 = M p W
We can also say that the true pixel coordinates are projectively similar to their homoge-
neous counterparts (here, s is our scaling value):
u su
v ' sv
1 s
s x0c
f 1 0 0 0
R 0 I T
M3×4 = 0 af yc 0
0 1 0 0 3×3 3×1 3 3×1
(6.11)
0 1 01×3 1
0 0 1 0 0 1 0 | 1×3{z }| {z }
| {z }| {z } rotation translation
intrinsics projection
| {z }
extrinsics
This equation gives us 11 degrees of freedom! So. . . what was the point of all of this again?
Well, now we’ll look into how to recover M given some world coordinates and some pixel in
an image. This allows us to reconstruct the entire set of parameters that created that image
in the first place!
71
Chapter 6: Cameras and Images
One of the ways this can be accomplished is through placing a known object into the scene;
this object will have a set of well-describable points on it. If we know the correspondence
between those points in “world space,” we can compute a mapping from those points to
“image space” and hence derive the camera calibration that created that transformation.
Another method that is mathematically similar is called resectioning. Instead of a par-
ticular object, we can also determine a series of known 3D points in the entire scene. This
results in a scene akin to what’s described in Figure 6.11.
After measuring each point in the real world from some arbitrary origin, we can then correlate
those points to an image of the same scene taken with a camera. What does this look like
mathematically? Well, for a single known point i, we end up with a pair of equations.
Xi
ui wui m00 m01 m02 m03
vi ' wvi = m10 m11 m12 m13 Yi (6.12)
Zi
1 w m20 m21 m22 m23
1
m00 Xi + m01 Yi + m02 Zi + m03
ui = (6.13)
m20 Xi + m21 Yi + m22 Zi + m23
m10 Xi + m11 Yi + m12 Zi + m13
vi = (6.14)
m20 Xi + m21 Yi + m22 Zi + m23
72
Notes on Computer Vision George Kudrayvtsev
We can rearrange this, then convert that back to a(n ugly) matrix multiplication:
A = UDVT
6
Refer to Appendix A for an explanation of this method.
7
I won’t discuss SVD in detail here, though I may add a section to Linear Algebra Primer later; for now,
please refer to your favorite textbook on Linear Algebra for details.
73
Chapter 6: Cameras and Images
Therefore, we’re aiming to minimize kUDVT mk. Well, there’s a trick here! U is a matrix
made up of unit vectors; multiplying by their magnitude doesn’t change the result. Thus,
The latter equivalence is also due to the orthogonality of the matrix. We can conceptually
think of V as a rotation matrix, and rotating a vector doesn’t change its magnitude.
Now we’re trying to minimize kDVT mk subject to kVT mk = 1, which is a stricter set of
constraints than before. This cool transformation lets us do a substitution: let ŷ = VT m.
Thus, we’re minimizing kDŷk (subject to kŷk = 1).
Now, remember that D is a diagonal matrix, and, by convention, we can sort the diagonal
such that it has decreasing values. Thus, kDŷk is a minimum when ŷ = 0 0 . . . 1 ; ŷ
T
is putting “all of its weight” in its last element, resulting in the smallest value in D.
Since ŷ = VT m, and V is orthogonal,8 we know that m = Vŷ. And since ŷ’s only meaningful
element is a 1 at the end, then m is the last column of V!
Thus, if m̂ = Vŷ, then m̂ selects the eigenvector of AT A with the smallest eigenvalue (we
say m̂ now because eigenvectors are unit vectors). That leap in logic is explained by the
properties of V, which are described in the aside below for those interested.
AT A = (VDT UT ) (UDVT )
= VDT IDV the transpose of an orthogonal matrix is its inverse
8
A property of an orthogonal matrix is that its transpose is its inverse: VT = V−1 .
74
Notes on Computer Vision George Kudrayvtsev
Contrast this with (6.12) and its subsequent expansions. We would actually have a term
that doesn’t contain an mij factor, resulting in an inhomogeneous system of equations
unlike before. Then, we can use least squares to approximate the solution.
Why isn’t this method as good? Consider if m23 ≈ 0, or some value close to zero. Setting
that value to 1 is dangerous for numerical stability.9
Advantages
• The approach is very simple to formulate and solve. We create a matrix from a set of
points and use singular value decomposition, which is a 1-line function in most math
libraries.
• We minimize “algebraic error.” This means that we are algebraically solving the system
of equations. We relied on a specific set of tricks (as we saw), such as the constraint
of m̂ as a unit vector to get to a clean, algebraic solution.
Disadvantages
• Because of the algebraic approach, and our restriction of m̂, it doesn’t tell us the camera
parameters directly. Obviously, it’s unlikely that all of the parameters in (6.11) result
in a unit vector.
• It’s an approximate method that only models exactly what can be represented in the
camera calibration matrix (again, see 6.11). If we wanted to include, say, radial distor-
tion – a property we could model, just not within the projective transform equation –
this method wouldn’t be able to pick that up.
• It makes constraints hard to enforce because it assumes everything in m̂ is unknown.
Suppose we definitively knew the focal length and wanted solutions where that stayed
constrained. . . things get much more complicated.
9
As per Wikipedia, a lack of numerical stability may significantly magnify approximation errors, giving us
highly erroneous results.
75
Chapter 6: Cameras and Images
• The approach minimizes kAm̂k, which isn’t the right error function. It’s algebraically
convenient to work with this “cute trick,” but isn’t precisely what we’re trying to solve
for; after all, we’re working in a geometric world.
Since we control M, what we’re really trying to find is the M that minimizes that difference:
X
min = d(x0i , Mxi )
M
i∈X
X
x0
Figure 6.12: The red points are the “true” projection of the real-world
points in X onto the image plane, whereas the blue points are their
projection based on the estimated camera parameters M.
This leads us to the “gold standard” algorithm that aims to determine the maximum like-
lihood estimation10 of M described in algorithm 6.1. The minimization process involves
10
We say it’s an estimation because we assume that our guesses for M are distorted by some level of Gaussian
noise; thus, we want to maximize the likelihood of our choice by taking the M with the least error.
76
Notes on Computer Vision George Kudrayvtsev
solving a non-linear equation, so use whatever solver flavor you wish for that.
Input: P : n known mappings from 3D to 2D, {Xi ←→ x0i }, i ∈ [1, n > 6].
Result: The best possible camera calibration, M.
if normalizing then // (optional)
X̃ = UX // normalization between image and world space
x̃0 = Tx0
end
M0 ← result of the direct linear transformation minimization
Minimize geometric error starting from M0 .
minM = i∈X̃ d(x̃0i , M̃X̃0i )
P
Proof. Suppose we have two points, p and c, and a ray passing through them also passes
through a plane. We can define x as being a point anywhere along that ray:
x = λp + (1 − λ)c
77
Chapter 6: Cameras and Images
Then, the resulting projection after applying the camera parameters is:
Now we must imagine the plane this ray passes through. All of the points along → −
pc are
projected onto the plane at the exact same point; in other words, regardless of λ, every point
on the ray will be projected onto the same point.
Thus, Mc = 0, and the camera center must be in the null space.
Simpler. We can actually achieve the same goal by applying a formula. If M is split into
two parts as we described above, then the camera center is found simply:11
−Q−1 b
c=
1
Of course, this is just one property that went into creating the original M, but we can hope
that other properties can be derived just as simply.
11
The proof is left as an exercise to the reader.
78
Multiple Views
There are things known and there are things unknown, and in between are
the doors of perception.
— Aldous Huxley, The Doors of Perception
n this chapter we’ll be discussing the idea of n-views: what can we learn when given
I multiple different images of the same scene? Mostly, we’ll discuss n = 2 and create
mappings from one image to another. We discussed this topic briefly in chapter 6 when
talking about Stereo Geometry; we’ll return to that later in this chapter as well, but we’ll
also be discussing other scenarios in which we create relationships from one image to another.
This transformation preserves all of the original properties: lengths, angles, and orien-
tations all remain unchanged. Importantly, as well, lines remain lines under a trans-
lation transformation.
Euclidean Also called a rigid body transformation, this is the 2D version of the rotation
79
Chapter 7: Multiple Views
This transformation preserves the same properties as translation aside from orientation:
lengths, angles, and lines are all unchanged.
Similarity Under a similarity transformation, we add scaling to our rigid body trans-
formation: 0
x a cos θ −a sin θ tx x
y 0 = a sin θ a cos θ ty y
1 0 0 1 1
Affine An affine transformation is one we’ve yet to examine but is important in com-
puter vision (and graphics). It allows 6 degrees of freedom (adding shearing to the
aforementioned translation, rotation, and scaling), and it enables us to map any 3
points to any other 3 points while the others follow the transformation.
0
x a b c x
y 0 = d e f y
1 0 0 1 1
It preserves a very different set of properties relative to the others we’ve seen: parallel
lines, ratios of areas, and lines (as in, lines stay lines) are preserved.
Combining all of these transformations gives us a general projective transformation,
also called a homography. It allows 8 degrees of freedom:1
0 0
x wx a b c x
y 0 ' wy 0 = d e f y
1 w g h 1 1
It’s important to be aware of how many points we need to determine the existence of all of
these transformations. Determining a translation only requires a single point correspondence
between two images: there are two unknowns (tx , ty ), and one correspondence gives us this
relationship. Similarly, for a homography, we need (at least) 4 correspondence points to
determine the transformation.
80
Notes on Computer Vision George Kudrayvtsev
such a projection. The point on the image plane at (x, y, 1) is just the intersection of the
ray with the image plane (in blue); that means any point on the ray projects onto the image
plane at that point.
The ray is just the scaled intersection point, (sx, sy, s), and that aligns with our understand-
ing of how homogeneous represent this model.
(sx, sy, s)
−y
(x, y, 1)
x
−z
So how can we determine how a particular point in one projective plane maps onto another
projective plane? This is demonstrated in Figure 7.2: we cast a ray through each pixel in
the first projective plane and draw where that ray intersects the other projective plane.
eye
Figure 7.2: We can project a ray from the eye through each pixel in one
projection plane and see the corresponding pixel in the other projection
plane.
81
Chapter 7: Multiple Views
Notice that where this ray hits the world is actually irrelevant, now! We are no longer
working with a 3D reprojection onto a 2D plane; instead, we can think about this as a 2D
image warp (or just a transformation) from one plane to another. This basic principle
is how we create image mosaics (colloquially, panoramas). To reiterate, homographies
allow us to map projected points from one plane to another.
mosaic
projection
plane
That’s all well and good, but what about the math? That’s what we’re really interested in,
right? Well, the transformation of a pixel from one projection plane to another (along a
2
Because we’re reprojecting onto a plane, we’re limited by a 180° field of view; if we want a panorama of
our entire surroundings, we’d need to map our mosaic of images on a different surface (say, a cylinder).
The 180° limitation can be thought of as having a single camera with a really wide lens; obviously, it still
can’t see what’s “behind” it.
82
Notes on Computer Vision George Kudrayvtsev
How do we solve it? Well, there’s the boring way and the cool way.
The Boring Way: Inhomogeneous Solution We can set this up as a system of linear
equations with a vector h of 8 unknowns: Ah = b. Given at least 4 corresponding points
between the images (the more the better),3 we can solve for h using the least squares method
as described in Appendix A: min kAh − bk2 .
The Cool Way: Déjà Vu We’ve actually already seen this sort of problem before: re-
call the trick we used in Method 1: Singular Value Decomposition for solving the camera
calibration matrix! We can apply the same principles here, finding the eigenvector with the
smallest eigenvalue in the singular value decomposition.
More specifically, we create a similar system of equations as in Equation 6.12, except in 2D.
Then, we can rearrange that into an ugly matrix multiplication like in Equation 6.17 and
pull the smallest eigenvector from the SVD.
Suppose, though, that all of the points in the world were lying on a plane. A plane is
represented by a normal vector and a point on the plane, combining into the equation:
d = ax + by + cz. We can, of course, rearrange things to solve for z: z = d+ax+by
−c
, and then
plug that into our transformation!
x
u m00 m01 m02 m03
v ' m10 m11 m12 m13 y
d−ax−by/c
1 m20 m21 m22 m23
1
3
For now, we assume an existing set of correctly-mapped pixels between the images; these could be mapped
by a human under forced labor, like a graduate student. Later, in the chapter on Feature Recognition,
we’ll see ways of identifying correspondence points automatically.
83
Chapter 7: Multiple Views
Of course, this effects the overall camera matrix. The effect of the 3rd column (which is
always multiplied by x, y, and a constant) can be spread to the other columns. This gives
us:
0 0 0
x
u m00 m01 0 m03
v ' m010 m011 0 m013 y
0 0 0
d−ax−by/c
1 m20 m21 0 m23
1
But now the camera matrix is a 3×3 homography! This demonstrates how homographies
allow us to transform (or warp) between arbitrary planes.
Since we can transform between arbitrary planes using homographies, we can actually apply
that to images to see what they would look like from a different perspective!
Figure 7.4: Mapping a plane (in this case, containing an image) onto
two other planes (which, in this case, are the projection planes from two
different cameras).
This process is called image rectification and enables some really cool applications. We
can do things like rectify slanted views by restoring parallel lines or measure grids that are
warped by perspective like in Figure 7.5.
84
Notes on Computer Vision George Kudrayvtsev
How do we do this image warping and unwarping process? Suppose we’re given a source
image, f (x, y) and a transformation function T (x, y) that relates each point in the source
image plane to another plane. How do we build the transformed image, g(x0 , y 0 )? There are
two approaches, one of which is incorrect.
Forward Warping
The naïve approach is to just pump every pixel through the transformation function and copy
that intensity into the resulting (x0 , y 0 ). That is, for each pixel (x, y) ∈ f ), (x0 , y 0 ) = T (x, y).
Unfortunately, though our images are discretized into pixels, our transformation function
may not be. This means that a particular (x0 , y 0 ) = T (x, y) may not correspond to an
individual pixel! Thus, we’d need to distribute the color from the original pixel around the
pixels near the transformed coordinate. This is known as splatting, and, as it sounds,
doesn’t result in very clean transformed images.
Inverse Warping
85
Chapter 7: Multiple Views
This doesn’t give great results. A better approach that is known as bilinear interpolation,
which weighs the neighboring intensities based on the distance of the location (x, y) to each
of its neighbors. Given a location, we can find its distances from a discrete pixel at (i, j):
a = x − i, b = y − j. Then we calculate the final intensity in the destination image at (x0 , y 0 ):
This calculation is demonstrated visually in Figure 7.6. There are more clever ways of
interpolating, such as bicubic interpolation which uses cubic splines, but bilinear inter-
polation is mathematically simpler and still gives great results.
ax + by + c = 0
Here, a, b, and c are all integers, and a > 0. From this, we can
imagine a “compact”
representation of the line that only uses its constants:
2 3 −12 . If we want the original
equation back, we can dot this vector with x y 1 :
(7.3)
a b c · x y 1 =0
Now recall also an alternative definition of the dot product: the angle between the two
vectors, θ, scaled by their magnitudes:
86
Notes on Computer Vision George Kudrayvtsev
87
Chapter 7: Multiple Views
Look familiar. . . ? That’s m, the constants from the original line! This means that we can
represent any line on our image plane by a point in the world using its constants.
The point defines a normal vector for a plane passing through the origin that creates the
line where it intersects the image plane.
Similarly, from this understanding that a line in 2-space can be represented by a vector in
3-space, we can relate points in 2-space to lines in 3-space. We’ve already shown this when
we said that a point on the projective plane lies on a ray (or line) in world space that passes
through the origin and that point.
(0, 4, 1)
(0, 4, 1) y
y (3, 2, 1)
z=1 (3, 2, 1)
x z=1
−z x
−z
Figure 7.9: Two views of the same line from Figure 7.8 plotted on a
projection plane (in blue) at z = 1. The green rays from the origin pass
through the same points in the line, forming the green plane. As you
can see, the intersection of the green plane with the blue plane create
the line in question. Finally, the orange vector is the normal vector
perpendicular to the green plane.
5
Finding this intersection is trivial via subtraction: 4y − 16 = 0 −→ y = 4, x = 0.
88
Notes on Computer Vision George Kudrayvtsev
Of course, we need to put this vector along the ray into our plane at z = 1, which would be
where vz = 1:
v 0 −32 −8
v̂ = =p
kvk (−32)2 + (−8)2
√ √
= 0 −4/ 17 −1/ 17
√
− 17v̂ = 0 4 1
89
Chapter 7: Multiple Views
Our claim was that this is the intersection of the lines. Let’s prove that by solving
the system of equations directly. We start with the system in matrix form:
a b x −c
=
d e y −f
This is the general form Mx = y, and we solve by multiplying both sides by M−1 .
Thankfully we can find the inverse of a 2×2 matrix fairly easily:a
−1 1 e −b
M =
ad − bc −d a
M−1 Mx = M−1 y
x 1 e −b −c
=
y ad − bc −d a −f
1 −ec + bf
=
ad − bc cd − af
bf − ec cd − af
(x, y) = ,
ad − bc ad − bc
90
Notes on Computer Vision George Kudrayvtsev
7.3.5 Duality in 3D
We can extend this notion of point-line duality into 3D as well. Recall the equation of a
plane:
ax + by + cz + d = 0
where (a, b, c) is the normal of the plane, and d = ax0 + by0 + cz0 for some point on the plane
(x0 , y0 , z0 ). Well, projective points in 3D have 4 coordinates, as we’ve already seen when
we discussed rotation and translation in Extrinsic Camera Parameters: wx wy wz w .
x2 = Rx1 + t
Let’s cross both sides of this equation by t, which would give us a vector normal to the pixel
ray and translation vector (a.k.a. the baseline):
t × x2 = t × Rx1 + t × t
= t × Rx1
91
Chapter 7: Multiple Views
Now we dot both sides with x2 , noticing that the left side is 0 because the angle between a
vector and a vector intentionally made perpendicular to that vector will always be zero!
x2 · (t × x2 ) = x2 · (t × Rx1 )
0 = x2 · (t × Rx1 )
xT2 Ex1 = 0
which is a really compact expression relating the two points on the two image planes. E is
called the essential matrix.
Notice that Ex1 evaluates to some vector, and that vector multiplied by xT2 (or dotted with
x2 ) is zero. We established earlier that this relationship is part of the point-line duality in
projective geometry, so this single point, Ex2 , defines a line in the other c1 camera frame!
Sound familiar? That’s the epipolar line.
We’ve converted the epipolar constraint into an algebraic expression! Well what if our
cameras are not calibrated? Theoretically, we should be able to determine the epipolar lines
if we’re given enough points correlated between the two images. This will lead us to the. . .
For convenience sake, we’ll assume that there is no skew, s. This makes the intrinsic param-
eter matrix K invertible:
−f/s 0 ox
K= 0 −f/sy oy
0 0 1
6
This introduces a different notation for the cross product, expressing it as a matrix multiplication. This
is explained in further detail in Linear Algebra Primer, in Cross Product as Matrix Multiplication.
92
Notes on Computer Vision George Kudrayvtsev
Recall, though, that the extrinsic parameter matrix is what maps points from world space
to points in the camera’s coordinate frame (see The Duality of Space), meaning:
pim = Kint pc
Since we said that the intrinsic matrix is invertible, that also means that:
pc = K−1
int pim
Which tells us that we can find a ray through the camera and the world (since it’s a homo-
geneous point in 2-space, and recall point-line duality) corresponding to this point. Further-
more, for two cameras, we can say:
pc,left = K−1
int,left pim,left
pc,right = K−1
int,right pim,right
Now note that we don’t know the values of Kint for either camera since we’re working in
the uncalibrated case, but we do know that there are some parameters that would calibrate
them. Furthermore, there is a well-defined relationship between the left and right points in
the calibrated case that we defined previously using the essential matrix. Namely,
pTc,right Epc,left = 0
and then use the properties of matrix multiplication7 to rearrange this and get the funda-
mental matrix, F:
T
pTim,right K−1
int,right EK −1
int,left pim,left = 0
| {z }
F
This gives us a beautiful, simple expression relating the image points on the planes from
both cameras:
pTim,right Fpim,left = 0
Or, even more simply: pT Fp0 = 0. This is the fundamental matrix constraint, and given
enough correspondences between p → p0 , we will be able to solve for F. This matrix is very
powerful in describing how the epipolar geometry works.
93
Chapter 7: Multiple Views
Recall from Projective Geometry that when we have an l such that pT l = 0, that l describes
a line in the image plane. Well, the epipolar line in the p image associated with p0 is defined
by: l = Fp0 ; similarly, the epipolar line in the prime image is defined by: l0 = FT p. This
means that the fundamental matrix gives us the epipolar constraint between two images
with some correspondence.
What if p0 was on the epipolar line in the prime image for every point p in the original
image? That occurs at the epipole! We can solve for that by setting l = 0, meaning we can
find the two epipoles via:
Fp0 = 0
FT p = 0
Finally, the fundamental matrix is a 3×3 singular matrix. It’s singular because it maps
from homogeneous 2D points to a 1D family (which are points or lines under point-line
duality), meaning it has a rank of 2. We will prove this in this aside and use it shortly.
The power of the fundamental matrix is that it relates the pixel coordinates between two
views. We no longer need to know the intrinsic parameter matrix. With enough correspon-
dence points, we can reconstruct the epipolar geometry with an estimation of the fundamen-
tal matrix, without knowing anything about the true intrinsic or extrinsic parameter matrix.
In Figure 7.10, we can see this in action. The green lines are the estimated epipolar lines in
both images derived from the green correspondence points, and we can see that points along
a line in one image are also exactly along the corresponding epipolar line in the other image.
94
Notes on Computer Vision George Kudrayvtsev
Multiplying this out, and generalizing it to n correspondences, gives us this massive system:
f11
f12
f13
0 0 0 0 0 0
u1 u1 u1 v1 u1 v1 u1 v1 v1 v1 u1 v1 1 f21
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
f 22 =0
un un un vn un vn un vn vn vn un vn n f23
0 0 0 0 0 0
f31
f32
f33
We can solve this via the same two methods we’ve seen before: using the trick with singular
value decomposition, or scaling to make f33 = 1 and using the least squares approximation.
We explained these in full in Method 1: Singular Value Decomposition and Method 2:
Inhomogeneous Solution, respectively.
Unfortunately, due to the fact that our point correspondences are estimations, this actually
doesn’t give amazing results. Why? Because we didn’t pull rank on our matrix F! It’s a
rank-2 matrix, as we demonstrate in this aside, but we didn’t enforce that when solving/ap-
proximating our 3×3 matrix with correspondence points and assumed it was full rank.9
How can we enforce that? Well, first we solve for F as before via one of the two methods we
described. Then, we take the SVD of that result, giving us: F = UDVT .
The diagonal matrix is the singular values of F, and we can enforce having only rank 2 by
setting the last value (which is the smallest value, since we sort the diagonal in decreasing
order by convention) to zero:
r 0 0 r 0 0
D = 0 s 0 =⇒ D̂ = 0
s 0
0 0 t 0 0 0
Then we recreate a better F̂ = UD̂VT , which gives much more accurate results for our
epipolar geometry.
8
You may notice that this looks awfully similar to previous “solve given correspondences” problems we’ve
done, such as in (6.17) when Calibrating Cameras.
9
Get it? Because pulling rank means taking advantage of seniority to enforcing some rule or get a task
done, and we didn’t enforce the rank-2 constraint on F? Heh, nice.
95
Chapter 7: Multiple Views
We begin with two planes (the left plane and the right plane) that have some points
p and p0 , respectively, mapping some world point xπ . Suppose we know the exact
homography that can make this translation: p0 = Hπ p.
Then, let l0 be the epipolar line in the right plane corresponding to p. It passes
through the epipole, e0 , as well as the correspondence point, p0 .
We know, thanks to point-line duality in projective geometry, that this line l0 is
cross product of p0 and e0 , meaning:a
l0 = e0 × p0
= e0 × Hπ p
= e0 × Hπ p
But since l0 is the epipolar line for p, we saw that this can be represented by the
fundamental matrix: l0 = Fp. Meaning:
0
e × Hπ = F
That means the rank of F is the same as the rank of [e0 × ], which is 2! (just trust
me on that part. . . )
a
As we showed in Interpreting 2D Points as 3D Lines, a line passing through two points
can be defined by the cross product of those two points.
96
Notes on Computer Vision George Kudrayvtsev
7.5 Summary
For 2 views into a scene (sorry, I guess we never did get into n-views ), there is a geometric
relationship that defines the relationship between rays in one view and another view. This
is Epipolar Geometry.
These relationships can be captured algebraically (and hence computed), with the essential
matrix for calibrated cameras and the fundamental matrix for uncalibrated cameras. We
can find these relationships with enough point correspondences.
This is proper computer vision! We’re no longer just processing images. Instead, we’re
getting “good stuff” about the scene: determining the actual 3D geometry from our 2D
images.
97
Feature Recognition
Since we live in a world of appearances, people are judged by what they seem
to be. If the mind can’t read the predictable features, it reacts with alarm or
aversion. Faces which don’t fit in the picture are socially banned. An ugly
countenance, a hideous outlook can be considered as a crime and criminals
must be inexorably discarded from society.
— Erik Pevernagie, “Ugly mug offense”
ntil this chapter, we’ve been creating relationships between images and the world with
U the convenient assumption that we have sets of well-defined correspondence points. As
noted in footnote 3, these could have been the product of manual labor and painstaking
matching, but we can do better. Now, we will discover ways to automatically creating cor-
respondence relationships, which will lead the way to a more complete, automated approach
to computer vision.
Our goal, to put it generally, is to find points in an image that can be precisely and reliably
found in other images.
We will detect features (specifically, feature points) in both images and match them to-
gether to find the corresponding pairs. Of course, this poses a number of problems:
• How can we detect the same points independently in both images? We need
whatever algorithm we use to be consistent across images, returning the same feature
points. In other words, we need a repeatable detector.
• How do we correlate feature points? We need to quantify the same interesting
feature points a similar way. In other words, we need a reliable and distinctive de-
scriptor. For a rough analogy, our descriptor could “name” a feature point in one
image Eddard, and name the same one Ned in the other, but it definitely shouldn’t
name it Robert in the other.
Feature points are used in many applications. To name a few, they are useful in 3D recon-
struction, motion tracking, object recognition, robot navigation, and much much more. . .
So what makes a good feature?
• Repeatability / Precision: The same feature can be found in several images pre-
98
Notes on Computer Vision George Kudrayvtsev
cisely despite geometric and photometric transformations. We can find the same point
with the same precision and metric regardless of where it is in the scene.
• Saliency / Matchability: Each feature should have a distinctive description: when
we find a point in one image, there shouldn’t be a lot of candidate matches in the other
image.
• Compactness / Efficiency: There should be far fewer features than there are pixels
in the image, but there should be “enough.”
• Locality: A feature occupies a relatively small area of the image, and it’s robust to
clutter and occlusion. A neighborhood that is too large may change drastically from a
different view due to occlusion boundaries.
Figure 8.1: A simple image. What sections would make good features?
What areas of the image would make good features? Would the center of the black square be
good? Probably not! There are a lot of areas that would match an all-black area: everywhere
else on the square! What about some region of the left (or any) edge? No again! We can
move along the edge and things look identical.
What about the corners? The top-right corner is incredibly distinct for the image; no other
region is quite like it.
99
Chapter 8: Feature Recognition
That begs the question: how do we detect corners? We can describe a corner as being an
area with significant change in a variety of directions. In other words, the gradients have
more than one direction.
The window function can be a simple piecewise function that’s 1 within the window and 0
elsewhere, or a Gaussian filter that will weigh pixels near the center of the window appro-
priately.
We can view the error function visually as an image, as well. The error function with no shift
(i.e. E(0, 0)) will be 0, since there is no change in pixels; it would be a black pixel. As we
shifted, the error would increase towards white. Now suppose we had an image that was the
same intensity everywhere (like, for example, the center region of a black square, maybe?).
Its error function would always be zero regardless of shift. This gives us an intuition for the
use of the error function: we want regions that have error in all shift directions.
We are working with small values of (u, v): a large error for a small shift might indicate a
corner-like region. How do we model functions for small changes? We use Taylor expan-
sions. A second-order Taylor expansion of E(u, v) about (0, 0) gives us a local quadratic
approximation for small values of u and v.
Recall, from calculus oh-so-long-ago, the Taylor expansion. We approximate a function
F (x) for some small δ value:
dF (0) 1 2 d2 F (0)
F (δx) ≈ F (0) + δx · + δx ·
dx 2 dx2
Things get a little uglier in two dimensions; we need to use matrices. To approximate our
error function E(u, v) for small values of u and v, we say:
Eu (0, 0) 1 Euu (0, 0) Euv (0, 0) u
E(u, v) ≈ E(0, 0) + u v + u v
Ev (0, 0) 2 Euv (0, 0) Evv (0, 0) v
where En indicates the gradient of E in the n direction, and similarly Enm indicates the
2nd -order gradient in the n then m direction.
100
Notes on Computer Vision George Kudrayvtsev
When we actually find, expand, and simplify all of these derivatives (which we do in this
aside), we get the following:
We can simplify this expression further with a substitution. Let M be the second moment
matrix computed from the image derivatives:
2
X Ix Ix Iy
M= w(x, y)
Ix Iy Iy2
xy
Let’s examine the properties of this magical moment matrix, M, and see if we can get some
insights about how a “corner-like” area would look.
So what do the derivatives of the error function look like? We’ll get through them
one step at a time. Our error function is:
X
E(u, v) = w(x, y) [I(x + u, y + v) − I(x, y)]2
x,y
We’ll start with the first order derivatives Eu and Ev . For this, we just need the
“power rule”: dx
d d
(f (x))n = nf (x)n−1 · dx f (x). Remember that u is the shift in the
x direction, and thus we need the image gradient in the x direction, and similarly
for v and y.
X
Eu (u, v) = 2w(x, y) [I(x + u, y + v) − I(x, y)] Ix (x + u, y + v)
x,y
X
Ev (u, v) = 2w(x, y) [I(x + u, y + v) − I(x, y)] Iy (x + u, y + v)
x,y
Now we will take the 2nd derivative with respect to u, giving us Euu (and likewise
for Evv ). Recall, briefly, the “product rule”:
d d d
f (x)g(x) = f (x) g(x) + g(x) f (x)
dx dx dx
101
Chapter 8: Feature Recognition
Now the only thing that remains is the cross-derivative, Euv , which now requires
gradients in both x and y of the image function as well as the cross-derivative in
x-then-y of the image.
X
Euv (u, v) = 2w(x, y)Ix (x + u, y + v)Iy (x + u, y + v)
x,y
X
+ 2w(x, y) [I(x + u, y + v) − I(x, y)] Ixy (x + u, y + v)
x,y
These are all absolutely disgusting, but, thankfully, we’re about to make a bunch
of the terms disappear entirely since we are evaluating them at (0, 0).
Onward, brave reader.
Plugging in (u = 0, v = 0) into each of these expressions gives us the following set
of newer, simpler expressions:
Eu (0, 0) = Ev (0, 0) = 0
X
Euu (0, 0) = 2w(x, y)Ix (x, y)2
x,y
X
Evv (0, 0) = 2w(x, y)Iy (x, y)2
x,y
X
Euv (0, 0) = 2w(x, y)Ix (x, y)Iy (x, y)
x,y
Notice that all we need now is the first-order gradient of the image in each direction,
x and y. What does this mean with regards to the Taylor expansion? It expanded
to:
Eu (0, 0) 1 Euu (0, 0) Euv (0, 0) u
E(u, v) ≈ E(0, 0) + u v + u v
Ev (0, 0) 2 Euv (0, 0) Evv (0, 0) v
But the error at (0, 0) of the image is just 0, since there is no shift! And we’ve already
seen that its first-order derivatives are 0 as well. Meaning the above expansion
102
Notes on Computer Vision George Kudrayvtsev
simplifies greatly:
1 Euu (0, 0) Euv (0, 0) u
E(u, v) ≈ u v
2 Euv (0, 0) Evv (0, 0) v
and we can expand each term fully to get the final Taylor expansion approximation
of E near (0, 0):a
You may (or may not. . . I sure didn’t) notice that this looks like the equation of an ellipse
in uv-space:
au2 + buv + cv 2 = d
Thus our approximation of E is a series of these ellipses stack on top of each other, with
varying values of k.
Consider a case for M in which the gradients are horizontal xor vertical, so Ix and Iy are
never non-zero at the same time. That would mean M looks like:
2
X Ix 0 λ1 0
M= w(x, y) =
0 Iy2 0 λ2
xy
This wouldn’t be a very good corner, though, if λ2 = 0, since it means all of the change
was happening in the horizontal direction, and similarly for λ1 = 0. These would be edges
more-so than corners. This might trigger a lightbulb of intuition:
If either λ is close to 0, then this is not a good corner, so look for areas in which
both λs are large!
With some magical voodoo involving linear algebra, we can get the diagonalization of M
that hints at this information:
−1 λ1 0
M=R R
0 λ2
103
Chapter 8: Feature Recognition
Where now the λs are the eigenvalues. Looking back at our interpretation of each slice as an
ellipse, R gives us the orientation of the ellipse and the λs give us the lengths of its major
and minor axes.
A slice’s “cornerness” is given by a having large and proportional λs. More specifically,
λ1 λ2 or λ2 λ1 ,
edge
λ1 , λ2 ≈ 0, flat region
λ1 , λ2 0 and λ1 ∼ λ2 , corner
Rather than finding these eigenvalues directly (which is expensive on 80s computers because
of the need for sqrt), we can calculate the “cornerness” of a slice indirectly by using the
determinant and the trace of M. This is the Harris response function:
Input: An image, I
Result: A set of “interesting” corner-like locations in the image.
Compute Gaussian derivatives at each pixel.
Compute M in a Gaussian window around each pixel.
Compute R.
Threshold R.
Find the local maxima of R via Non-Maximal Suppression.
104
Notes on Computer Vision George Kudrayvtsev
handle thresholding issues), but they are not invariant under scaling. Empirically, we’ve
seen that scaling by a factor of 2 reduces repeatability (i.e. the measurement of finding the
same features) by ~80% (see the Harris line in Figure 8.4).
Intuitively, we can imaging “zooming in” on a corner and applying the detector to parts of
it: each region would be treated like an edge! Thus, we want a scale invariant detector.
If we scaled our region by the same amount that the image was scaled, we wouldn’t have
any problems. Of course, we don’t know how much the image was scaled (or even if it was
scaled). How can we choose corresponding regions independently in each image? We need
to design a region function that is scale invariant.
A scale invariant function is one that is consistent across images given changes in region size.
A naïve example of a scale invariant function is average intensity. Given two images, one of
which is a scaled version of the other, there is some region size in each in which the average
intensity is the same for both areas. In other words, the average intensity “peaks” in some
place(s) and those peaks are correlated based on their independent scales.
A good scale invariant function has just one stable peak. For most images, a good function
is one that responds well to contrast (i.e. a sharp local intensity change. . . remember The
Importance of Edges?) We can apply the Laplacian (see Equation 3.6) of the Gaussian
filter (the “Mexican hat operator,” as Prof. Bobick puts it). But to avoid the 2nd derivative
nonsense, we can use something called the difference of Gaussians (DoG):
which gives an incredibly similar result. Both of these kernel operators are entirely invariant
to both rotation and scale.
SIFT Detector
This leads to a technique called SIFT: scale invariant f eature transform.
105
Chapter 8: Feature Recognition
The general idea is that we want to find robust extrema in both space and scale. We can
imagine “scale space” as being different scaled versions of the image, obtained via interpola-
tion of the original.
For each point in the original, we can say it has neighbors to the left and right (our traditional
understanding) as well as neighbors “up and down” in scale space. We can use that scale space
to create a difference of Gaussian pyramid, then eliminate the maxima corresponding
to edges, which just leaves the corners. Note that this is a completely different method of
corner detection; we aren’t hanging out with Harris anymore!
A “different of Gaussian pyramid” can be thought of as “stacks” of images; each item in a
stack is a set of DoG images at various scales, as in Figure 8.3. Each point in an image is
compared to its 8 local neighbors as well as its 9 neighbors in the images “above” and “below”
it; if you’re a maxima of all of those points, you’re an extremum! Once we’ve found these
extrema, we threshold the contrast and remove extrema on edges; that results in detected
feature points that are robust to scaling.
Harris-Laplace Detector
There is a scale-invariant detector that uses both of the detectors we’ve described so far. The
Harris-Laplace detector combines the difference of Gaussians with the Harris detector:
we find the local maximum of the Harris corner in space and the Laplacian in scale.
Comparison
The robustness of all three of these methods under variance in scale is compared in Figure 8.4.
As you can see, the Harris-Laplace detector works the “best,” but take these metrics with a
grain of salt: they were published by the inventors of Harris-Laplace.
1
0.9 Harris-Laplace
0.8 SIFT
repeatability rate
0.7 Harris
0.6
0.5
0.4
0.3
0.2
0.1
0
1 1.5 2 2.5 3 3.5 4 4.5
scale
106
Notes on Computer Vision George Kudrayvtsev
Simple Solution? Is there something simple we can do here? We have the feature points
from our Harris Detector Algorithm, can’t we just use correlation on each feature point win-
dow on the other image and choose the peak (i.e. something much like Template Matching)?
Unfortunately, correlation is not rotation-invariant, and it’s fairly sensitive to photometric
changes. Even normalized correlation is sensitive to non-linear photometric changes and
slight geometric distortions as well.
Furthermore, it’s slow: comparing one feature to all other feature points is not an ideal
solution from an algorithmic perspective (O(n2 )).
We’ll instead introduce the SIFT descriptor to solve these problems.
107
Chapter 8: Feature Recognition
Orientation Assignment
We want to compute the “best” orientation for a feature. Handily enough, the base orienta-
tion is just the dominant direction of the gradient.
To localize orientation to a feature, we create a histogram of local gradient directions at a
selected scale – 36 bins. Then, we assign the canonical orientation based on the peak of the
smoothed histogram. Thus, each feature point has some properties: its (x, y) coordinates,
and an invariant scale and orientation.
Keypoint Description
We want a descriptor that, again, is highly distinctive and as invariant as possible to pho-
tometric changes. First, we normalize: rotate a keypoint’s window based on the standard
orientation, then scale the window size based on the the keypoint’s scale.
Now, we create a feature vector based upon:
• a histogram of gradients, which we determined previous when finding the orientation
• weighed by a centered Gaussian filter, to appropriately value the center gradients more
We take these values and create a 4×4 grid of bins for each. In other words, the first bin
would contain the weighted histogram for the top-left corner of the window, the second for
the next sector to the right, and so on.
Minor Details There are a lot of minor tweaks here and there that are necessary to make
SIFT work. One of these is to ensure smoothness across the entire grid: pixels can affect
multiple bins if their gradients are large enough. This prevents abrupt changes across bin
boundaries. Furthermore, to lower the impact of highly-illumined areas (whose gradients
would dominate a bin), they encourage clamping the gradient to be ≤ 0.2 after the rotation
normalization. Finally, we normalize the entire feature vector (which is 16× the window
size) to be magnitude 1.
108
Notes on Computer Vision George Kudrayvtsev
Nearest Neighbor
We could, of course, use a naïve nearest-neighbor algorithm and compare a feature in one
image to all of the features in the other. Even better, we can use the kd-tree algorithm1 to
find the approximate nearest neighbor. SIFT modifies the algorithm slightly: it implements
the best-bin-first modification by using a heap to order bins by their distance from the query
point. This gives a 100-1000x speedup and gives the correct result 95% of the time.2
Wavelet-Based Hashing
An alternative technique computes a short 3-vector descriptor from the neighborhood using a
Haar wavelet.3 Then, you quantize each value into 10 overlapping bins, giving 103 possible
entries. This greatly reduces the amount of features we need to search through.
Locality-Sensitive Hashing
The idea behind this technique, and locality-sensitive hashing in general, is to construct a
hash function that has similar outputs for similar inputs. More rigorously, we say we craft
a hash function g : Rd → U such that for any two points p and q (where D is some distance
1
Unlike in lecture, understanding of this algorithm isn’t assumed . The kd-tree algorithm partitions a
k-dimensional space into a tree by splitting each dimension down its median. For example, given a set
of (x, y) coordinates, we would first find the median of the x coordinates (or the ys. . . the dimension is
supposed to be chosen at random), and split the set into two piles. Then, we’d find the median of the y
coordinates in each pile and split them down further. When searching for the nearest neighbor, we can use
each median to quickly divide the search space in half. It’s an approximate method, though: sometimes,
the nearest neighbor will lie in a pile across the divide. For more, check out this video which explains
things succinctly, or the Wikipedia article for a more rigorous explanation.
2
For reference, here is a link to the paper: Indexing w/o Invariants in 3D Object Recognition. See Figure 6
for a diagram “explaining” their modified kd-tree algorithm.
3
Per Wikipedia, Haar wavelets are a series of rescaled square-wave-like functions that form a basis set.
A wavelet (again, per Wikipedia) is an oscillation that begins at zero, increases, then decreases to zero
(think of sin(θ) ∈ [0, π]). We’ll briefly allude to these again descriptors when we discuss features in the
Viola-Jones Face Detector much later in chapter 11.
109
Chapter 8: Feature Recognition
function):
D(p, q) ≤ r ⇒ Pr [g(p) = g(q)] 0
D(p, q) > cr ⇒ Pr [g(p) = g(q)] ∼ 0
In English, we say that if the distance between p and q is high, the probability of their hashes
being the same is “small”; if the distance between them is low, the probability of their hashes
being the same is “not so small.”
If we can construct such a hash function, we can jump to a particular bin and find feature
points that are similar to a given input feature point and, again, reduce our search space
down significantly.
110
Notes on Computer Vision George Kudrayvtsev
want to find the transformation that fits them the best. We’ve seen this before in chapter 4
when discussing line and circle fitting by voting for parameters in Hough Space.
At its heart, model fitting is a procedure that takes a few concrete steps:
• We have a function that returns a predicted data set. This is our model.
• We have a function that computes the difference between our data set and the model’s
predicted data set. This is our error function.
• We tune the model parameters until we minimize this difference.
We already have our model: a transformation applied to a group of feature points. We
already have our data set: the “best matches” computed by our descriptor (we call these
putative matches). Thus, all that remains is a robust error function.
111
Chapter 8: Feature Recognition
Of course, we’ll be throwing out the baby with the bathwater a bit, since a lot of correct
matches will also get discarded, but we still have enough correct matches to work with.
Unfortunately, we still do have lots of outliers. How can we minimize their impact on our
model?
112
Notes on Computer Vision George Kudrayvtsev
Where h = a b and U is a matrix of the differences of averages for each point (i.e. ∀i xi −x).
Amazingly, we can solve this with the SVD trick we’ve seen time and time again.
This seems like a strong error function, but it has a fundamental assumption. We assume
that the noise to our line is corrupted by a Gaussian noise function perpendicular to the line.
To make our assumption mathematical, we are saying that our noisy point (x, y) is the
true point on the line perturbed along the normal by some noise sampled from a zero-mean
Gaussian with some σ:
x u a
= + |{z}
y v b
|{z} noise |{z}
true point n̂
Here, r finds the “residual” of the ith point with respect to the model parameters, θ. This
is our “distance” or error function that measures the difference between the model’s output
and our test set. ρ is a robust function with a scale parameter, σ; the idea of ρ is to not let
7
i.e. a model that generates the noise
113
Chapter 8: Feature Recognition
outliers have a large impact on the overall error. An example of a robust error function is:
2
ρ(u; σ) = σ2u+u2
For a small residual u (i.e. small error) relative to σ, this behaves much like the squared
distance: we get ≈ u2/σ2 . As u grows, though, the effect “flattens out”: we get ≈ 1.
Of course, again, this begs the question of how to choose this parameter. What makes a
good scale, σ? Turns out, it’s fairly scenario-specific. The error function we defined above
is very sensitive to scale.
It seems like we keep shifting the goalposts of what parameter we need to tune, but in the
next section we’ll define an error function that still has a scale, but is much less sensitive to
it and enables a much more robust, general-purpose solution.
8.3.3 RANSAC
The RANSAC error function, or random sample consensus error function, relies on the
(obvious) notion that it would be easy to fit a model if we knew which points belonged to
it and which ones didn’t. We’ll return to the concept of “consistent” matches: which sets of
matches are consistent with a particular model?
Like in the Hough Transform, we can count on the fact that wrong matches will be relatively
random, whereas correct matches will be consistent with each other. The basic main idea
behind RANSAC is to loop over a randomly-proposed model, find which points belong to it
(inliers) and which don’t (outliers), eventually choosing the model with the most inliers.
For any model, there is a minimal set of s points that can define it. We saw this in Image-
to-Image Projections: translations had 2, homographies had 4, the fundamental matrix has
8, etc. The general RANSAC algorithm is defined in algorithm 8.2 below.
How do we choose our distance threshold t, which defines whether or not a point counts
as an inlier or outlier of the instantiated model? That will depend on the way we believe
the noise behaves (our generative model). If we assume a Gaussian noise function like we
did previously, then the distance d to the noise is modeled by a Chi distribution8 with k
degrees of freedom (where k is the dimension of the Gaussian). This is defined by:
√ − d2
2e 2σ2
f (d) = √ , d≥0
πσ
We can then define our threshold based on what percentage of inliers we want. For example,
choosing t2 = 3.84σ 2 means there’s a 95% probability that when d < t, the point is an inlier.
Now, how many iterations N should we perform? We want to choose N such that, with
some probability p, at least one of the random sample sets (i.e. one of the Ci s) is completely
8
The Chi distribution is of the “square roots of the sum of squares” of a set of independent random variables,
each following a standard normal distribution (i.e. a Gaussian) (per Wikipedia). “That” is what we were
looking to minimize previously with the perpendicular least squares error, so it makes sense!
114
Notes on Computer Vision George Kudrayvtsev
free from outliers. We base this off of an “outlier ratio” e, which defines how many of our
feature points we expect to be bad.
Let’s solve for N :
• s – the number of points to compute a solution to the model
• p – the probability of success
• e – the proportion of outliers, so the % of inliers is (1 − e).
• Pr [sample set with all inliers] = (1 − e)s
• Pr [sample set with at least one outlier] = (1 − (1 − e)s )
• Pr [all N samples have outliers] = (1 − (1 − e)s )N
• But we want the chance of all N having outliers to be really small, i.e. < (1 − p). Thus,
we want (1 − (1 − e)s )N < (1 − p).
Solving for N gives us. . . drumroll . . .
log(1 − p)
N>
log (1 − (1 − e)s )
The beauty of this probability relationship for N is that it scales incredibly well with e. For
example, for a 99% chance of finding a set of s = 2 points with no outliers, with an e = 0.5
outlier ratio (meaning half of the matches are outliers!), we only need 17 iterations.
115
Chapter 8: Feature Recognition
Some more RANSAC values are in Table 8.1. As you can see, N scales relatively well as e
increases, but less-so as we increase s, the number of points we need to instantiate a model.
Proportion of outliers, e
s 5% 10% 20% 25% 30% 40% 50%
2 2 3 5 6 7 11 17
3 3 4 7 9 11 19 35
4 3 5 9 13 17 34 72
5 4 6 12 17 26 57 146
6 4 7 16 24 37 97 293
7 4 8 20 33 54 163 588
8 5 9 26 44 78 272 1177
Table 8.1: Values for N in the RANSAC algorithm, given the number
of model parameters, s, and the proportion of outliers.
Finally, when we’ve found our best-fitting model, we can recompute the model instance using
all of the inliers (as opposed to just the 4 we started with) to average out the overall noise
and get a better estimation.
Adapting Sample Count The other beautiful thing about RANSAC is that the number
of features we found is completely irrelevant! All we care about is the number of model
parameters and our expected outlier ratio. Of course, we don’t know that ratio qa priori 9 ,
but we can adapt it as we loop. We can assume a worst case (e.g. e = 0.5), then adjust it
based on the actual amount of inliers that we find in our loop(s). For example, finding 80%
inliers then means e = 0.2 for the next iteration. More formally, we get algorithm 8.3.
Just to make things a little more concrete, let’s look at what estimating a homogra-
phy might look like. Homographies need 4 points, and so the RANSAC loop might
look something like this:
1. Select 4 feature points at random.
2. Compute their exact homography, H.
3. Compute the inliers in the entire set of features where SSD(p0i , Hpi ) < .
4. Keep the largest set of inliers.
5. Recompute the least squares H estimate on all of the inliers.
9
fancy Latin for “from before,” or, in this case, “in advance”
116
Notes on Computer Vision George Kudrayvtsev
end
c=c+1
end
return Ci , Model(Ci )
8.4 Conclusion
We’ve made some incredible strides in automating the concepts we described in the chapter
on Multiple Views. Given some images, we can make good guesses at corresponding points
then fit those to a particular model (such as a fundamental matrix) with RANSAC and
determine the geometric relationship between images. These techniques are pervasive in
computer vision and have plenty of real-world applications like robotics.
117
Photometry
“Drawing makes you look at the world more closely. It helps you to see what
you’re looking at more clearly. Did you know that?”
I said nothing.
“What colour’s a blackbird?” she said.
“Black.”
“Typical!”
— David Almond, Skellig
9.1 BRDF
The appearance of a surface is largely dependent on three factors: the viewing angle, the
surface’s material properties, and its illumination. Before we get into these in detail, we
need to discuss some terms from radiometry.
Radiance The energy carried by a ray of light is radiance, L. It’s a measure of power
per unit area, perpendicular to the direction of travel, per unit solid angle. That’s super
confusing, so let’s dissect. We consider the power of the light falling into an area, then
consider its angle, and finally consider the size of the “cone” of light that hits us. The units
118
Notes on Computer Vision George Kudrayvtsev
Irradiance On the other hand, we have the energy arriving at a surface, E. In this
case, we just have the power in a given direction per unit area (units: W m−2 ). Intuitively,
light coming at a steep angle is less powerful than light straight above us, and so for an
area receiving radiance L(θ, ϕ) coming in from some light source size dω, the corresponding
irradiance is E(θ, ϕ) = L(θ, ϕ) cos θdω.
n̂
dω
θ
dA
dA cos θ
Figure 9.1: The radiance model.
If you don’t understand this stuff super well, that’s fine. I don’t, hence the explanation is
lacking. We’ll be culminating this information into the BRDF – the bidirectional reflectance
distribution f unction – and not worrying about radiance anymore. The BRDF is the ratio
between the irradiance at the surface from the incident direction and the radiance from
the surface in the viewing direction, as seen below in Figure 9.2. In other words, it’s the
percentage of the light reflected from the light coming in:
Lsurface (θr , ϕr )
BRDF: f (θi , ϕi ; θr , ϕr ) = (9.1)
Esurface (θi , ϕi )
z sensor
θ source viewing
intensity, I direction, v
incident n̂ (θr , ϕr )
direction, s θr
x y (θi , ϕi ) θi
ϕ
surface element
Figure 9.2: The model captured by the BRDF.
BRDFs can be incredibly complicated for real-world objects, but we’ll be discussing two
reflection models that can represent most objects to a reasonably accurate degree: diffuse
reflection and specular reflection. With diffuse reflection, the light penetrates the surface,
119
Chapter 9: Photometry
scattering around the inside before finally radiates out. Surfaces with a lot of diffuse re-
flection, like clay and paper, have a soft, matte appearance. The alternative is specular
reflection, in which most of the light bounces off the surface. Things like metals have a high
specular reflection, appearing glossy and have a lot of highlights.
With these two components, we can get a reasonable estimation of light intensity off of most
objects. We can say that image intensity = body reflection + surface reflection. Let’s discuss
each of these individually, then combine them into a unified model.
f (θi , ϕi ; θr , ϕr ) = ρd
where δ(x) is a simple toggle: 1 if x = 0 and 0 otherwise. Then, the surface radiance simply
adds the intensity:
L = Iρs δ(θi − θv )δ(ϕi + π − ϕv )
We can simplify this with vector notation, as well. We can say that m is the “mirror
direction,” which is that perfect reflection vector for an incoming s, and that vech is the
“half-angle,” i.e. the vector halfway between s and v. Then:
120
Notes on Computer Vision George Kudrayvtsev
Most things aren’t perfect mirrors, but they can be “shiny.” We can think of a shiny, glossy
object as being a blurred mirror: the light from a point now spreads over a particular area,
as shown in Figure 9.3. We simulate this by raising the angle between the mirror and the
viewing directions to an exponent, which blurs the source intensity more with a larger angle:
L = Iρs (m̂ · v̂)k
A larger k makes the specular component fall off faster, getting more and more like a mirror.
r i r i
n̂ n̂
v v
121
Chapter 9: Photometry
L = Iρ cos θ
If we combine the incoming light intensity and the angle at which it hits the surface into an
“energy function”, and represent the albedo as a more complicated “reflectance function,” we
get:
L(x, y) = R(x, y) · E(x, y)
Of course, if we want to recover R from L (which is what “we see”), we can’t without knowing
the lighting configuration. The question is, then, how can we do this?
The astute reader may notice some similarities between this and our discussion of noise
removal from chapter 2. In Image Filtering, we realized that we couldn’t remove noise that
was added to an image without knowing the exact noise function that generated it, especially
if values were clipped by the pixel intensity range. Instead, we resorted to various filtering
methods that tried to make a best-effort guess at the “true” intensity of an image based on
some assumptions.
Likewise here, we will have to make some assumptions about our scene in order to make our
“best effort” attempt at extracting the true reflectance function of an object:
1. Light is slowly varying.
This is reasonable for our normal, planar world. Shadows are often much softer than
the contrast between surfaces, and as we move through our world, the affect of the
lighting around us does not change drastically within the same scene.
2. Within an object, reflectance is constant.
This is a huge simplification of the real world, but is reasonable given a small enough
“patch” that we treat as an object. For example, your shirt’s texture (or possibly your
skin’s, if you’re a comfortable at-home reader) is relatively the same in most places.
This leads directly into the next assumption, which is that. . .
3. Between objects, reflectance varies suddenly.
We’ve already taken advantage of this before when discussing edges: the biggest jump
in variations of color and texture come between objects.
To simplify things even further, we’ll be working exclusively with intensity; color is out of the
picture for now. The model of the world we’ve created that is composed of these assumptions
122
Notes on Computer Vision George Kudrayvtsev
is often called the Mondrian world, after the Dutch painter, despite the fact that it doesn’t
do his paintings justice.
• Run the result through a high-pass filter, keeping high-frequency content, perhaps with
the derivative.
• Threshold the result to remove small low frequencies.
• Finally, invert that to get back the original result (integrate, exponentiate).
1
Recall, of course, that the logarithm of a product is the sum of their logarithms.
123
Motion & Tracking
otion will add another dimension to our images: time. Thus, we’ll be working with
M sequences of images: I(x, y, t). Unlike the real world, in which we perceive continuous
motion,1 digital video is merely a sequence of images with changes between them. By
changing the images rapidly, we can imitate fluid motion. In fact, studies have been done to
determine what amount of change humans will still classify as “motion;” if an object moves
10 feet between images, is it really moving?
There are many applications for motion; let’s touch on a few of them to introduce our
motivation for the math we’ll be slugging through in the remainder of this chapter.
Background Subtraction Given a video with a relatively static scene and some moving
objects, we can extract just the moving objects (or just the background!). We may
want to then overlay the “mover” onto a different scene, model its dynamics, or apply
other processing to the extracted object.
Shot Detection Given a video, we can detect when it cuts to a different scene, or shot.
Motion Segmentation Suppose we have many moving objects in a video. We can segment
1
Arguably continuous, as the chapter quote points out. How can we prove that motion is continuous if our
senses operate on some discrete intervals? At the very least, we have discretization by the rate at which a
neuron can fire. Or, perhaps, there’s never a moment in which a neuron isn’t firing, so we can claim that’s
effectively continuous sampling? But if Planck time is the smallest possible unit of time, does that mean
we don’t live in a continuous world in the first place, so the argument is moot? Philosophers argue on. . .
124
Notes on Computer Vision George Kudrayvtsev
out each of those objects individually and do some sort of independent analyses. Even
in scenarios in which the objects are hard to tell apart spatially, motion gives us another
level of insight for separating things out.
There are plenty of other applications of motion: video stabilization, learning dynamic mod-
els, recognizing events, estimating 3D structure, and more!
Our brain does a lot of “filling in the gaps” and other interpretations based on even the
most impoverished motion; obviously this is hard to demonstrate in written form, but even
a handful of dots arranged in a particular way, moving along some simple paths, can look
like an actual walking person to our human perception. The challenge of computer vision is
interpreting and creating these same associations.
Figure 10.1: The optic flow diagram (right) from a rotated Rubix
cube.
Our goal is to recover the arrows in Figure 10.2. How do we estimate that motion? In a way,
125
Chapter 10: Motion & Tracking
we are solving a pixel correspondence problem much like we did in Stereo Correspondence,
though the solution is much different. Given some pixel in I(x, y, t), look for nearby pixels
of the same color in I(x, y, t + 1). This is the optic flow problem.
I(x, y, t) I(x, y, t + 1)
Like we’ve seen time and time again, we need to establish some assumptions (of varying
validity) to simplify the problem:
• Color constancy: assume a corresponding point in one image looks the same in the
next image. For grayscale images (which we’ll be working with for simplicity), this is
called brightness constancy. We can formulate this into a constraint by saying that
there must be some (u, v) change in location for the corresponding pixel, then:
I(x, y, t) = I(x + u0 , y + v 0 , t + 1)
• Small motion: assume points do not move very far from one image to the next. We
can formulate this into a constraint, as well. Given that same (u, v), we assume that
they are very small, say, 1 pixel or less. The points are changing smoothly.
Yep, you know what that means! Another Taylor expansion. I at some location
(x + u, y + v) can be expressed exactly as:
∂I ∂I
I(x + u, y + v) = I(x, y) + u+ v + . . . [higher order terms]
∂x ∂y
We can disregard the higher order terms and make this an approximation that holds
for small values of u and v.
We can combine these two constraints into the following equation (the full derivation comes
in this aside), called the brightness constancy constraint equation:
Ix u + Iy v + It = 0 (10.1)
We have two unknowns describing the direction of motion, u and v, but only have one
equation! This is the aperture problem: we can only see changes that are perpendicular
to the edge. We can determine the component of u v in the gradient’s direction, but
not in the perpendicular direction (which would be along an edge). This is reminiscent of
the edge problem in Finding Interest Points: patches along an edge all looked the same.
Visually, the aperture problem is explained in Figure 10.3 below.
126
Notes on Computer Vision George Kudrayvtsev
Figure 10.3: The aperture problem. Clearly, motion from the black
line to the blue line moves it down and right, but through the view of
the aperture, it appears to have moved up and right.
Let’s combine the two constraint functions for optic flow. Don’t worry, this will be
much shorter than the last aside involving Taylor expansions. We begin with our
two constraints, rearranged and simplified for convenience:
0 = I(x + u0 , y + v 0 , t + 1) − I(x, y, t)
∂I ∂I
I(x + u, y + v) ≈ I(x, y) + u+ v
∂x ∂y
Then, we can perform a substitution of the second into the first and simplify:
0 ≈ I(x, y, t + 1) + Ix u + Iy v − I(x, y, t)
≈ [I(x, y, t + 1) − I(x, y, t)] + Ix u + Iy v
≈ It + Ix u + Iy v
≈ It + ∇I · u v
In the limit, as u and v approach zero (meaning the ∆t between our images gets
smaller and smaller), this equation becomes exact.
Notice the weird simplification of the image gradient: is Ix the gradient at t or t+1?
Turns out, it doesn’t matter! We assume that the image moves so slowly that the
derivative actually doesn’t change. This is dicey, but works out “in the limit,” as
Prof. Bobick says.
Furthermore, notice that It is the temporal derivative: it’s the change in the
image over time.
127
Chapter 10: Motion & Tracking
To solve the aperture problem, suppose we formulate Equation 10.1 as an error function:
x
ec = (Ix u + Iy v + It )2 dx dy
image
Of course, we still need another constraint. Remember when we did Better Stereo Correspon-
dence? We split our error into two parts: the raw data error (6.2) and then the smoothness
error (6.3), which punished solutions that didn’t behave smoothly. We apply the same logic
here, introducing a smoothness constraint:
x
u2x + u2y + vx2 + vy2 dx dy
es =
image
This punishes large changes to u or v over the image. Now given both of these constraints,
we want to find the (u, v) at each image point that minimizes:
e = es + λec
Where λ is a weighing factor we can modify based on how much we “believe” in our data
(noise, lighting, artifacts, etc.) to change the effect of the brightness constancy constraint.
This is a global constraint on the motion flow field; it comes with the disadvantages of such
global assumptions. Though it allows you to bias your solution based on prior knowledge,
local constraints perform much better. Conveniently enough, that’s what we’ll be discussing
next.
This time, we have more equations than unknowns. Thus, we can use the standard least
squares method on an over-constrained system (as covered in Appendix A) to find the best
approximate solution: (AT A) d = b. Or, in matrix form:
P P P
P Ix Ix P Ix Iy u = − P Ix It
Ix Iy Iy Iy v Iy It
128
Notes on Computer Vision George Kudrayvtsev
Each summation occurs over some k × k window. This equation is solvable when the pseu-
doinverse does, in fact, exist; that is, when AT A is invertible. For that to be the case, it
must be well-conditioned: the ratio of its eigenvalues, λ1/λ2 , should not be “too large.” 2
Wait. . . haven’t we already done some sort of eigenvalue analysis on a matrix with image
gradients? Yep! Recall our discussion of the Properties of the 2nd Moment Matrix when
finding Harris Corners? Our AT A is a moment matrix, and we’ve already considered what
it means to have a good eigenvalue ratio: it’s a corner!
We can see a tight relationship between the two ideas: use corners as good points to compute
the motion flow field. This is explored briefly when we discuss Sparse Flow later; for now,
though, we continue with dense flow and approximate the moment matrix for every pixel.
Improving Lucas-Kanade
Our small motion assumption for optic flow rarely holds in the real world. Often, motion
from one image to another greatly exceeds 1 pixel. That means our first-order Taylor ap-
proximation doesn’t hold: we don’t necessarily have linear motion.
To deal with this, we can introduce iterative refinement. We find a flow field between two
images using the standard Lucas-Kanade approach, then perform image warping using that
flow field. This is It+1
0
, our estimated next time step. Then we can find the flow field
between that estimation and the truth, getting a new flow field. We can repeat this until
the estimation converges on the truth! This is iterative Lucas-Kanade, formalized in
algorithm 10.1.
Hierarchical Lucas-Kanade
Iterative Lucas-Kanade improves our resulting motion flow field in scenarios that have a little
more change between images, but we can do even better. The big idea behind hierarchical
Lucas-Kanade is that we can make large changes seem smaller by just making the image
smaller. To do that, we will be reintroducing the idea of Gaussian pyramids that we used
when Improving the Harris Detector.
The idea is fairly straightforward. We first introduce two important operations: reduction
and expansion. These appear relatively complicated on the surface, but can be explained
simply; they form the building block for our Gaussian pyramid.
• Reduce: Remember when we had that terrible portrait of Van Gogh after directly
using image subsampling? If not, refer to Figure 5.4. We threw out every other pixel
to halve the image size. It looked like garbage; we lost a lot of the fidelity and details
because we were arbitrarily throwing out pixels. The better solution was first blurring
the image, then subsampling. The results looked much better.
2
From numerical analysis, the condition number relates to how “sensitive” a function is: given a small
change in input, what’s the change in output? Per Wikipedia, a high condition number means our least
squares solution may be extremely inaccurate, even if it’s the “best” approximation. In some ways, this is
related to our discussion of Error Functions for feature recognition: recall that vertical least squares gave
poor results for steep lines, so we used perpendicular least squares instead.
129
Chapter 10: Motion & Tracking
The reduce operation follows the same principle, but accomplishes it a little differently.
It uses what’s called a “5-tap filter” to make each subsampled pixel a weighted blend
of its 5 “source” pixels.
• Expand: Expansion is a different beast, since we must interpret the colors of pixels
that lie “between” ones whose values we know. For even pixels in the upscaled image,
the values we know are directly on the right and left of the pixel in the smaller image.
Thus, we blend them equally. For odd pixels, we have a directly corresponding value,
but for a smoother upscale, we also give some weight to its neighbors.3
A picture is worth a thousand words: these filters are visually demonstrated in Figure 10.4.
In practice, the filters for the expansion can be combined into a single filter, since the
corresponding pixels will have no value (i.e. multiplication by zero) in the correct places for
odd and even pixels. For example, applying the combined filter 81 , 12 , 34 , 21 , 18 on an even
pixel would multiply the odd filter indices by nothing.
We can now create a pyramid of Gaussians – each level 1/2 the size of the last – for each of
our images in time, then build up our motion field from the lowest level. Specifically, we can
find the motion flow from the highest (smallest) level using standard (or iterative) Lucas-
Kanade. To move up in the pyramid, we upscale the motion field (effectively interpolating
the motion field for inbetween pixels) and double it (since the change is twice as big now).
Then, we can use that upscaled flow to warp the next-highest level; this is our “estimation”
3
Whether or not an upscaled pixel has a corresponding value in the parent image depends on how you do
the upscaling. If you say that column 1 in the new image is column 1 in the old image, the rules apply (so
column 3 comes from column 2, etc.). If you say column 2 in the new image is column 1 in the old image
(so column 4 comes from column 2, etc.) then the “odd” and “even” rules are reversed.
130
Notes on Computer Vision George Kudrayvtsev
of that level. We can compare that estimate against the “true” image at that level and get
a new (cumulative) motion field. We then repeat this process all the way up the pyramid.
More formally,
Sparse Flow
We realized earlier that the ideal points for motion detection would be corner-like, to ensure
the accuracy of our gradients, but our discussion ended there. We still went on to estimate
131
Chapter 10: Motion & Tracking
the motion flow field for every pixel in the image. Sparse Lucas-Kanade is a variant of
Hierarchical Lucas-Kanade that is only applied to interest points.
(0, 0)
t0 t1
Suppose hierarchical Lucas-Kanade told us that the point moved (+3.5, +2) from t0 to
t1 (specifically, from (1, 1) to (4.5, 3)). We can then easily estimate that the point must
have existed at (2.75, 2) at t0.5 by just halving the flow vector and warping, as before. As
shorthand, let’s say F t is the warping of t with the flow field F , so then t0.5 = F2 t0 .
(0, 0)
t0 t0.5 t1
Of course, hierarchical Lucas-Kanade rarely gives such accurate results for the flow vectors.
Each interpolated frame would continually propogate the error between the “truth” at t1
and our “expected truth” based on the flow field. Much like when we discussed iterative
Lucas-Kanade, we can improve our results by iteratively applying HLK as we interpolate!
Suppose we wanted t0.25 and t0.75 from our contrived exam- F0
ple above, and we had the flow field F . We start as before, t0 t1
with t0.25 = F4 t0 . Instead of making t0.75 use 3F 4
t, though, we 2 0
F
3
recalculate the flow field from our estimated frame, giving us
F 0 = t0.25 → t1 . Then, we apply t0.75 = 23 F 0 t0.25 , since t0.75 is 2/3rds of the way along F 0 .
132
Notes on Computer Vision George Kudrayvtsev
Suppose we knew that our motion followed a certain pattern, or that certain objects in the
scene moved in a certain way. This would allow us to further constrain our motion beyond
the simple constraints that we imposed earlier (smoothness, brightness constancy, etc.). For
example, we know that objects closer to the camera move more on the camera image plane
than objects further away from the camera, for the same amount of motion. Thus, if we
know (or determine/approximate) the depth of a region of points, we can enforce another
constraint on their motion and get better flow fields.
To discuss this further, we need to return to some concepts from basic Newtonian physics.
Namely, the fact that a point rotating about some origin with a rotational velocity ω that
also has a translational velocity t has a total velocity of: v = ω × r + t.
In order to figure out how the point is moving in the image, we need to convert from these
values in world space (X, Y, Z) to image space (x, y). We’ve seen this before many times.
The perspective projection of these values in world space equates to f Z , f Z in the image,
X Y
where f is the focal length. How does this relate to the velocity, though? Well the velocity
is just the derivative of the position. Thus, we can take the derivative of the image space
coordinates to see, for a world-space velocity V :4
ZVx − ZVz Vx X Vz
u = vx = f =f − f
z2 Z Z Z
Vx Vz
=f −f
X z
ZVy − ZVz Vy Y Vz
v = vy = f =f − f
Z2 Z Z Z
Vy Vz
=f −f
Y Z
This is still kind of ugly, but we can “matrixify” it into a much cleaner equation that isolates
terms into things we know and things we don’t:
u(x, y) 1
= A(x, y)t + B(x, y)ω
v(x, y) Z(x, y)
Where t is the (unknown) translation vector and ω is the unknown rotation, and A and B
are defined as such:
−f 0 x
A(x, y) =
0 −f y
(xy)/f −(f +x2 )/f y
B(x, y) = (f +y2 )
/f −(xy)/f −x
The beauty of this arrangement is that A and B are functions of things we know, and they
relate our world-space vectors t and ω to image space. This is the general motion model.
f 0 (x)g(x)−g 0 (x)f (x)
f (x)
4
Recall the quotient rule for derivatives: d
dx g(x) = g(x)2 .
133
Chapter 10: Motion & Tracking
We can see that the depth in world space, Z(x, y), only impacts the translational term. This
corresponds to our understanding of parallax motion, in that the further away from a
camera an object is, the less we perceive it to move for the same amount of motion.
u(x, y) = a1 + a2 x + a3 y + a7 x2 + a8 xy
v(x, y) = a4 + a5 x + a6 y + a7 xy + a8 y 2
Orthographic On the other hand, if our plane lies at a sufficient distance from the camera,
the distance between points on the plane is miniscule relative to that distance. As we
learned when discussing An orthographic projection model, in which the z coordinate is
dropped entirely with no transformation., we can effectively disregard depth in this case.
The simplified pair of equations, needing 3 correspondence points to solve, is then:
u(x, y) = a1 + a2 x + a3 y (10.2)
v(x, y) = a4 + a5 x + a6 y (10.3)
Ix u + Iy v + It = 0
134
Notes on Computer Vision George Kudrayvtsev
This is actually a relaxation on our constraint, even though the math gets more complicated.
Now we can do the same least squares minimization we did before for the Lucas-Kanade
method but allow an affine deformation for our points:
X
Err(a) = [Ix (a1 + a2 x + a3 y) + Iy (a4 + a5 x + a6 y) + It ]2
Much like before, we get a nasty system of equations and minimize its result:
Ix Ix x1 Ix y1 Iy Iy x1 Iy y1 a1 It1
Ix 2
Ix x2 Ix y2 Iy Iy x2 Iy y2 a2
It
.. .. .. .. .. .. · .. = − ..
. . . . . . . .
Ix Ix xn Ix yn Iy Iy xn Iy yn an Itn
135
Chapter 10: Motion & Tracking
How does the math work? First, get an approximation of the “local flow” via Lucas-Kanade
or other methods. Then, segment that estimation by identifying large leaps in the flow field.
Finally, use the motion model to fit each segment and identify coherent segments. Visually,
this is demonstrated below:
(a) The true motion flow, u(x, y). (b) The local flow estimation. (c) Segmenting the local flow estima-
tion.
The implementation difficulties lie in identifying segments and clustering them appropriately
to fit the intended motion models.
10.3 Tracking
What’s the point of dedicating a separate section to tracking? We’ve already discussed
finding the flow between two images; can’t we replicate this process for an entire sequence
and track objects as they move? Well, as we saw, there are a lot of limitations of Lucas-
Kanade and other image-to-image motion approximations.
• It’s not always possible to compute optic flow. The Lucas-Kanade method needs a lot
of stars to align to properly determine the motion field.
• There could be large displacements of objects across images if they are moving rapidly.
Lucas-Kanade falls apart given a sufficiently large displacement (even when we apply
improvements like hierarchical LK). Hence, we probably need to take dynamics into
account; this is where we’ll spend a lot of our focus initially.
• Errors are compounded over time. When we discussed frame interpolation, we tried to
minimize the error drift by recalculating the motion field at each step, but that isn’t
always possible. If we only rely on the optic flow, eventually the compounded errors
would track things that no longer exist.
136
Notes on Computer Vision George Kudrayvtsev
• Objects getting occluded cause optic flow models to freak out. Similarly, when those
objects appear again (called disocclusions), it’s hard to reconcile and reassociate
things. Is this an object we lost previously, or a new object to track?
We somewhat incorporated dynamics when we discussed using Motion Models with Lucas-
Kanade: we expected points to move along an affine transformation. We could improve this
further by combining this with Feature Recognition, identifying good features to track and
likewise fitting motion models to them.6 This is good, but only gets us so far.
Instead, we will focus on tracking with dynamics, which approaches the problem differ-
ently: given a model of expected motion, we should be able to predict the next frame without
actually seeing it. We can then use that frame to adjust the dynamics model accordingly.
This integration of dynamics is the differentiator between feature detection and tracking: in
detection, we detect objects independently in each frame, whereas with tracking we predict
where objects will be in the next frame using an estimated dynamic model.
The benefit of this approach is that the trajectory model restricts the necessary search space
for the object, and it also improves estimates due to reduced measurement noise due to the
smoothness of the expected trajectory model.
As usual, we need to make some fundamental assumptions to simplify our model and con-
struct a mathematical framework for continuous motion. In essence, we’ll be expecting small,
gradual change in pose between the camera and the scene. Specifically:
• Unlike small children, who have no concept of object permeance, we assume that
objects do not spontaneously appear or disappear in different places in the scene.
• Similarly, we assume that the camera does not move instantaneously to a new view-
point, which would cause a massive perceived shift in scene dynamics.
Feature tracking is a multidisciplinary problem that isn’t exclusive to computer vision. There
are elements of engineering, physics, and robotics at play. Thus, we need to take a detour
into state dynamics and estimation in order to model the dynamics of an image sequence.
Tracking as Inference
Let’s begin our detour by establishing the terms in our inference system. We have our
hidden state, X, which is made up of the true parameters that we care about. We have our
measurement, Z, which is a noisy observation of the underlying state. At each time step t,
the real state changes from Xt−1 → Xt , resulting in a new noisy observation Zt .
6
Refer to the original paper, “Good Features to Track,” for more (link).
137
Chapter 10: Motion & Tracking
Our correction, then, is an updated estimate of the state after introducing a new observation
Zt = zt :
Pr [Xt | Z0 = z0 , Z1 = z1 , . . . , Zt−1 = zt−1 , Zt = zt ]
We can say that tracking is the process of propogating the posterior distribution of state
given measurements across time. We will again make some assumptions to simplify the
probability distributions:
• We will assume that we live in a Markovian world in that only the immediate past
matters with regards to the actual hidden state:
This is called the observation model, and much like the small motion constraint in
Lucas-Kanade, this is the most suspect assumption. Thankfully, we wont be exploring
relaxations to this assumption, but one example of such a model is conditional random
fields, if you’d like to explore further.
These assumptions are represented graphically in Figure 10.7. Readers with experience in
statistical modeling or machine learning will notice that this is a hidden Markov model.
138
Notes on Computer Vision George Kudrayvtsev
X1 X2 ... Xn
Z1 Z2 Zn
Tracking as Induction
Another way to view tracking is as an inductive process: if we know Xt , we can apply
induction to get Xt+1 .
As with any induction, we begin with our base case: this is our initial prior knowledge that
predicts a state in the absence of any evidence: Pr [X0 ]. At the very first frame, we correct
this given Z0 = z0 . After that, we can just keep iterating: given a corrected estimate for
frame t, predict then correct frame t + 1.
Making Predictions
Alright, we can finally get into the math.
Given: Pr [Xt−1 | z0 , . . . , zt−1 ]
Guess: Pr [Xt | z0 , . . . , zt−1 ]
To solve that, we can apply the law of total probability and marginalization if we
imagine we’re working with the joint set Xt ∩ Xt−1 .7 Then:
Z
= Pr [Xt , Xt−1 | z0 , . . . , zt−1 ] dXt−1
Z
= Pr [Xt | Xt−1 , z0 , . . . , zt−1 ] Pr [Xt−1 | z0 , . . . , zt−1 ] dXt−1
Z
independence assumption from
= Pr [Xt | Xt−1 ] Pr [Xt−1 | z0 , . . . , zt−1 ] dXt−1 the dynamics model
To explain this equation in English, what we’re saying is that the likelihood of being at
a particular spot (this is Xt ) depends on the probability of being at that spot given that
we were at some previous spot weighed by the probability of that previous spot actually
happening (our corrected estimate for Xt−1 ). Summing over all of the possible “previous
spots” (that is, the integral over Xt−1 ) gives us the marginalized distribution of Xt .
7
Specifically, the law of total probability states that if we have a joint set A ∩ B and we know all
of the probabilities in B, we Pcan get Pr [A] if we sum over all of the probabilities in B. Formally,
Pr [A] = n (Pr [A | Bn ] Pr [Bn ]). For the latter equivalence, recall that Pr [U, V ] =
P
n (Pr [A, B n ]) =
Pr [U | V ] Pr [V ]; this is the conditioning property.
In our working example, Xt is part of the same probability space as Xt−1 (and all of the Xi s that came
before it), so we can apply the law, using the integral instead of the sum.
139
Chapter 10: Motion & Tracking
Making Corrections
Now, given a predicted value Pr [Xt | z0 , . . . , zt−1 ] and the current observation zt , we want to
compute Pr [Xt | z0 , . . . , zt−1 , zt ], essentially folding in the new measurement:8
Pr [zt | Xt , z0 , . . . , zt−1 ] · Pr [Xt | z0 , . . . , zt−1 ]
Pr [Xt | z0 , . . . , zt−1 , zt ] = (10.4)
Pr [zt | z0 , . . . , zt−1 ]
Pr [zt | Xt ] · Pr [Xt | z0 , . . . , zt−1 ] independence assumption from
= the observation model
Pr [zt | z0 , . . . , zt−1 ]
Pr [zt | Xt ] · Pr [Xt | z0 , . . . , zt−1 ]
=R conditioning on Xt
Pr [zt | Xt ] Pr [Xt | z0 , . . . , zt−1 ] dXt
As we’ll see, the scary-looking denominator is just a normalization factor that ensures the
probabilities sum to 1, so we’ll never really need to worry about it explicitly.
Summary
We’ve developed the probabilistic model for predicting and subsequently correcting our state
based on some observations. Now we can dive into actual analytical models that apply these
mathematics and do tracking.
Notice the subscript on Dt and dt : with this, we indicate that the transformation itself may
change over time. Perhaps, for example, the object is moving with some velocity, and then
starts rotating. In our examples, though, these terms will stay constant.
We also have a linear measurement model, which describes our observations. Specifically
(and unsurprisingly), the model says that the measurement zt is linearly transformed by Mt ,
plus some level of Gaussian measurement noise:
zt ∼ N (Mt xt−1 , Σmt )
| A]·Pr[A]
8
We are applying Bayes’ rule here, which states that Pr [A | B] = Pr[BPr[B] .
9
It’s easy to get lost in the math. Remember, “Gaussian noise” is just a standard bell curve.
10
This is “capital sigma,” not a summation. It symbolizes the variance (std. dev squared): Σ = σ 2 .
140
Notes on Computer Vision George Kudrayvtsev
M is also sometimes called the extraction matrix because its purpose is to extract the
measurable data from a state (or a “state-like” matrix).
pt = pt−1 + (∆t)vt−1 + ε
vt = vt−1 + ξ
As we love doing, we can express these in matrix form to get our linear dynamics
model:
1 ∆t pt−1
xt = Dt xt + noise = + noise
0 1 vt−1
What about measurement? Well, suppose we can only measure position. Then our
linear measurement model is:
pt
zt = mt xt + noise = 1 0 + noise
vt
Simple stuff! Notice that we mostly defined these model parameters ourselves.
We could have added acceleration to our dynamics model if we expected it. The
“observer” defines the dynamics model to fit their scenario.
Before we continue, we need to introduce some standard notation to track things. In our
predicted state, Pr [Xt | z0 , . . . , zt−1 ], we say that the mean and standard deviation of the
resulting Gaussian distribution is µ−t and σt− . For our corrected state, Pr [Xt | z0 , . . . , zt−1 , zt ],
we similarly say that the mean and standard deviation are µ+t and σt+ .
Prediction
Our simple linear dynamics model defines a state as a constant times its previous
state, with some noise added in to indicate uncertainty:
Xt ∼ N dXt−1 , σd2
141
Chapter 10: Motion & Tracking
The distribution for the next predicted state, then, is also a Gaussian, so we can
simply update the mean and variance accordingly. Given:
− 2
Pr [Xt | z0 , . . . , zt−1 ] = N µ−
t , (σ t )
2
Update the variance: (σt− )2 = σd2 dσt−1
+
The mean of a Gaussian distribution that has been multiplied by a constant is just
likewise multiplied by that constant. The variance, though, is both multiplied by
that constant squared and we need to introduce some additional noise to account
for uncertainty in our prediction.
Correction
Similarly, our mapping of states to measurements relies on a constant, m:
2
zt ∼ N mXt , σm
Under linear, Gaussian dynamics and measurements, the Kalman filter defines the
corrected distribution (the simplified Equation 10.4) as a new Gaussian:
+ 2
Pr [Xt | z0 , . . . , zt−1 , zt ] ≡ N µ+
t , (σ t )
Intuition Let’s get an intuitive understanding of what this new mean, µ+t , really
is. First, we divide the entire thing by m2 to “unsimplify.” We get this mess:
µ− 2
t σm zt
2
+ (σt− )2
µt = m 2
+ m (10.5)
σm
+ (σt− )2
m2
The previous example gave us an important insight that applies to the Kalman filter regard-
less of the dimensionality we’re working with. Specifically, that our corrected distribution
142
Notes on Computer Vision George Kudrayvtsev
for Xt is a weighted average of the prediction (i.e. based on all prior measurements except
zt ) and the measurement guess (i.e. the former with zt incorporated).
Let’s take the equation from (10.5) and substitute a for the measurement variance and b for
the prediction variance. We get:
zt
aµ−t + b
µ+t = m
a+b
We can do some manipulation (add bµ−t − bµ−t to the top and factor) to get:
z
t
(a + b)µ−t + b − µ−t
= m
a+b
−
b zt −
= µt + −µ
a+b m t
zt
µ+t = µ−t + k − µ−t
m
Where k is known as the Kalman gain. What does this expression tell us? Well, the new
mean µ+t is the old predicted mean plus a weighted “residual”: the difference between the
measurement and the prediction (in other words, how wrong the prediction was).
Correct
Predict
Kt = Σt Mt (Mt Σ−t MTt + Σmt )−1
− T
x−t = Dt x+t−1
x+t = x−t + Kt (zt − Mt x−t )
Σ−t = Dt Σ+t−1 DTt + Σdt
Σ+t = (I − Kt Mt ) Σ−t
We now have a Kalman gain matrix, Kt . As our estimate covariance approaches zero (i.e. con-
fidence in our prediction grows), the residual gets less weight from the gain matrix. Similarly,
if our measurement covariance approaches zero (i.e. confidence in our measurement grows),
the residual gets more weight.
Summary
The Kalman filter is an effective tracking method due to its simplicity, efficiency, and com-
pactness. Of course, it does impose some fairly strict requirements and has significant pitfalls
for that same reason. The fact that the tracking state is always represented by a Gaussian
143
Chapter 10: Motion & Tracking
creates some huge limitations: such a unimodal distribution means we only really have one
true hypothesis for where the object is. If the object does not strictly adhere to our linear
model, things fall apart rather quickly.
We know that a fundamental concept in probability is that as we get more information, cer-
tainty increases. This is why the Kalman filter works: with each new measured observation,
we can derive a more confident estimate for the new state. Unfortunately, though, “always
being more certain” doesn’t hold the same way in the real world as it does in the Kalman
filter. We’ve seen that the variance decreases with each correction step, narrowing the Gaus-
sian. Does that always hold, intuitively? We may be more sure about the distribution, but
not necessarily the variance within that distribution. Consider the following extreme case
that demonstrates the pitfalls of the Kalman filter.
In Figure 10.8, we have our prior distribution and our measurement. Intuitively, where
should the corrected distribution go? When the measurement and prediction are far apart,
we would think that we can’t trust either of them very much. We can count on the truth
being between them, sure, and that it’s probably closer to the measurement. Beyond that,
though, we can’t be sure. We wouldn’t have a very high peak and our variance may not
change much. In contrast, as we see in Figure 10.8, Kalman is very confident about its
corrected prediction.
This is one of its pitfalls.
x
evidence posterior prior
Figure 10.8: One of the flaws of the Kalman model is that it is always
more confident in its distribution, resulting in a tighter Gaussian. In this
figure, the red Gaussian is what the Kalman filter calculates, whereas
the blue-green Gaussian may be a more accurate representation of our
intuitive confidence about the truth. As you can see, Kalman is way
more sure than we are.
Another downside of the Kalman filter is this restriction to linear models for dynamics.
There are extensions that alleviate this problem called extended Kalman filters (EKFs), but
it’s still a limitation worth noting.
More importantly, though, is this Gaussian model of noise. If the real world doesn’t match
with a Gaussian noise model, Kalman struggles. What can we do to alleviate this? Perhaps
we can actually determine (or at least approximate) the noise distribution as we track?
144
Notes on Computer Vision George Kudrayvtsev
We’ll also be introducing the notion of perturbation into our dynamics model. Previously,
we had a linear dynamics model that only consisted of our predictions based on previous
state. Perturbation – also called control – allows us to modify the dynamics by some known
model. By convention, perturbation is an input to our model using the parameter u.
Bayes Filters
The framework for our first particle filtering approach relies on some given quantities:
• As before, we need somewhere to start. This is our prior distribution, Pr [X0 ]. We
may be very unsure about it, but must exist.
• Since we’ve added perturbation to our dynamics model, we now refer to it as an action
model. We need that, too:
Pr [xt | ut , xt−1 ]
Note: We (consistently, unlike lecture) use ut for inputs occuring between the state xt−1 and xt .
• We additionally need the sensor model. This gives us the likelihood of our mea-
surements given some object location: Pr [z | X]. In other words, how likely are our
measurements given that we’re at a location X. It is not a distribution of possible
object locations based on a sensor reading.
• Finally, we need our stream of observations, z, and our known action data, u:
data = {u1 , z2 , . . . , ut , zt }
145
Chapter 10: Motion & Tracking
Given these quantities, what we want is the estimate of X at time t, just like before; this is
the posterior of the state, or belief :
Bel(xt ) = Pr [xt | u1 , z2 , . . . , ut , zt ]
The assumptions in our probabilistic model is represented graphically in Figure 10.9, and
result in the following simplifications:11
In English, the probability of the current measurement, given all of the past states, mea-
surements, and inputs only actually depends on the current state. This is sometimes called
sensor independence. Second: the probability of the current state – again given all of the
goodies from the past – actually only depends on the previous state and the current input.
This Markovian assumption is akin to the independence assumption in the dynamics model
from before.
ut−1 ut ut+1
zt−1 zt zt+1
As a reminder, Bayes’ Rule (described more in footnote 8) can also be viewed as a propor-
tionality (η is the normalization factor that ensures the probabilities sum to one):
With that, we can apply our given values and manipulation our belief function to get some-
thing more useful. Graphically, what we’re doing is shown in Figure 10.10 (and again, more
visually, in Figure 10.11), but mathematically:
11
The notation na:b represents a range; it’s shorthand for na , na+1 , . . . , nb .
146
Notes on Computer Vision George Kudrayvtsev
Bel(xt ) = Pr [xt | u1 , z2 , . . . , ut , zt ]
= ηPr [zt | xt , u1 , z2 , . . . , ut ] Pr [xt | u1 , z2 , . . . , ut ] Bayes’ Rule
This results in our final, beautiful recursive relationship between the previous belief and the
next belief based on the sensor likelihood:
Z
Bel(xt ) = ηPr [zt | xt ] Pr [xt | xt−1 , ut ] · Bel(xt−1 ) dxt−1 (10.8)
We can see that there is an inductive relationship between beliefs. The green and blue
sections correspond to the calculations we did with the Kalman filter: we need to first
find the prediction distribution before our latest measurement. Then, we fold in the actual
measurement, which is described by the sensor likelihood model from before.
With the mathematics out of the way, we can focus on the basic particle filtering algorithm.
It’s formalized in algorithm 10.3 and demonstrated graphically in Figure 10.11, but let’s
walk through the process informally.
We want to generate a certain number of samples (n new particles) from an existing distribu-
tion, given an additional input and measurement (these are St−1 , ut , and zt respectively). To
do that, we need to first choose a particle from our old distribution, which has some position
and weight. Thus, we can say pj = hxt−1,j , wt−1,j i. From that particle, we can incorporate
the control and create a new distribution using our action model: Pr [xt | ut , xt−1,j ]. Then,
we sample from that distribution, getting our new particle state xi . We need to calculate
the significance of this sampled particle, so we run it through our sensor model to reweigh
it: wi = Pr [zi | xi ]. Finally, we update our normalization factor to keep our probabilities
consistent and add it to the set of new particles, St .
Practical Considerations
Unfortunately, math is one thing but reality is another. We need to take some careful
considerations when applying particle filtering algorithms to real-world problems.
Sampling Method We need a lot of particles to sample the underlying distribution with
relative accuracy. Every timestep, we need to generate a completely new set of samples after
147
Chapter 10: Motion & Tracking
working all of our new information into our estimated distribution. As such, the efficiency
or algorithmic complexity of our sampling method is very important.
We can view the most straightforward sampling method as a direction on a roulette wheel,
as in Figure 10.12a. Our list of weights covers a particular range, and we choose a value in
that range. To figure out which weight that value refers to, we’d need to perform a binary
search. This gives a total O(n log n) runtime. Ideally, though, sampling runtime should grow
linearly with the number of samples!
As a clever optimization, we can use the systematic resampling algorithm (also called
stochastic universal sampling), described formally in algorithm 10.4. Instead of viewing
148
Notes on Computer Vision George Kudrayvtsev
the weights as a roulette wheel, we view it as a wagon wheel. We plop down our “spokes” at a
random orientation, as in Figure 10.12b. The spokes are 1/n distance apart and determining
their weights is just a matter of traversing the distance between each spoke, achieving O(n)
linear time for sampling!
Sampling Frequency We can add another optimization to lower the frequency of sam-
pling. Intuitively, when would we even want to resample? Probably when the estimated
distribution has changed significantly from our initial set of samples. Specifically, we can
only resample when there is a significant variance in the particle weights; otherwise, we can
just reuse the samples.
149
Chapter 10: Motion & Tracking
Highly peaked observations What happens to our particle distribution if we have in-
credibly high confidence in our observation? It’ll nullify a large number of particles by giving
them zero weight. That’s not very good, since it wipes out all possibility of other predictions.
To avoid this, we want to intentionally add noise to both our action and sensor models.
In fact, we can even smooth out the distribution of our samples by applying a Kalman filter
to them individually: we can imagine each sample as being a tiny little Gaussian rather
than a discrete point in the state space. In general, overestimating noise reduces the number
of required samples and avoids overconfidence; let the measurements focus on increasing
certainty.
Recovery from failure Remember our assumption regarding object permeance, in which
we stated that objects wouldn’t spontaneously (dis)appear? If that were to happen, our
distributions have no way of handling that case because there are unlikely to be any particles
corresponding to the new object. To correct for this, we can apply some randomly distributed
particles every step in order to catch any outliers.
150
Notes on Computer Vision George Kudrayvtsev
Tracking Contours
Suppose we wanted to track a hand, which is a fairly complex object. The hand is moving
(2 degrees of freedom: x and y) and rotating (1 degree of freedom: θ), but it also can change
shape. Using principal component analysis, a topic covered soon in chapter 11, we can
encode its shape and get a total of 12 degrees of freedom in our state space; that requires a
looot of particles.
. . . which looks an awful lot like a Gaussian; it’s pro- Figure 10.14: A contour and its nor-
mals. High-contrast features (i.e. edges)
portional to the distance to the nearest strong edge. are sought out along these normals.
We can then use this Gaussian as our sensor model
and track hands reliably.
151
Chapter 10: Motion & Tracking
Figure 10.15: Using edge detection and contours to track hand move-
ment.
Other Models
In general, you can use any model as long as you can compose all of the aforementioned
requirements for particle filters: we need an object state, a way to make predictions, and a
sensor model.
152
Notes on Computer Vision George Kudrayvtsev
10.3.5 Mean-Shift
The mean-shift algorithm tries to find the modes of a probability distribution; this dis-
tribution is often represented discretely by a number of samples as we’ve seen. Visually, the
algorithm looks something like Figure 10.17 below.
This visual example hand-waves away a few things, such as what shape defines the region
of interest (here it’s a circle) and how big it should be, but it gets the point across. At
each step (from blue to red to finally cyan), we calculate the mean, or center of mass of
the region of interest. This results in a mean-shift vector from the region’s center to the
center of mass, which we follow to draw a new region of interest, repeating this process until
the mean-shift vector gets arbitrarily small.
So how does this relate to tracking?
Well, our methodology is pretty similar to before. We start with a pre-defined model in the
first frame. As before, this can be expressed in a variety of ways, but it may be easiest to
imagine it as an image patch and a location. In the following frame, we search for a region
that most closely matches that model within some neighborhood based on some similarity
function. Then, the new maximum becomes the starting point for the next frame.
What truly makes this mean-shift tracking is the model and similarity functions that we use.
In mean-shift, we use a feature space which is the quantized color space. This means
we create a histogram of the RGB values based on some discretization of each channel (for
example, 4 bits for each channel results in a 64-bin histogram). Our model is then this
histogram interpreted as a probability distribution function; this is the region we are going
to track.
Let’s work through the math. We start with a target model with some histogram centered
at 0. It’s represented by q and contains m bins; since we are interpreting it as a probability
distribution, it also needs to be normalized (sum to 1):
m
X
q = {qu }u∈[1..m] qu = 1
u=1
We also have some target candidate centered at the point y with its own color distribution,
153
Chapter 10: Motion & Tracking
We need a similarity function f (y) to compute the difference between these two distributions
now; maximizing this function will render the “best” candidate location: f (y) = f [q, p(y)].
Similarity Functions
There are a large variety of similarity functions such as min-value or Chi squared, but the
one used in mean-shift tracking is called the Bhattacharyya coefficient. First, we change
the distributions by taking their element-wise square roots:
√ √ √
q0 = ( q1 , q2 , . . . , qm )
p p p
p0 (y) = p1 (y), p2 (y), . . . , pm (y)
Then, the Bhattacharyya relationship is defined as the sum of the products of these new
distributions:
Xm
f (y) = p0u (y)qu0 (10.9)
u=1
Well isn’t the sum of element-wise products the definition of the vector dot product? We
can thus also express this as:
But since by design these vectors are magnitude 1 (remember, we are treating them as
probability distributions), the Bhattacharyya coefficient essentially uses the cos θy between
these two vectors as a similarity comparison value.
Kernel Choices
Now that we have a suitable similarity function, we also need to define what region we’ll be
using to calculate it. Recall that in Figure 10.17 we used a circular region for simplicity.
This was a fixed region with a hard drop-off at the edge; mathematically, this was a uniform
kernel: (
c kxk ≤ 1
KU (x) =
0 otherwise
Ideally, we’d use something with some better mathematical properties. Let’s use something
that’s differentiable, isotropic, and monotonically decreasing. Does that sound like anyone
we’ve gotten to know really well over the last 154 pages?
154
Notes on Computer Vision George Kudrayvtsev
That’s right, it’s the Gaussian. Here, it’s expressed with a constant falloff, but we can, as
we know, also have a “scale factor” σ to control that:
1 2
KU (x) = c · exp − kxk
2
The most important property of a Gaussian kernel over a uniform kernel is that it’s differ-
entiable. The spread of the Gaussian means that new points introduced to the kernel as we
slide it along the image have a very small weight that slowly increases; similarly, points in
the center of the kernel have a constant weight that slowly decreases. We would see the most
weight change along the slope of the bell curve.
We can leverage the Gaussian’s differentiability and use its gradient to see how the overall
similarity function changes as we move. With the gradient, we can actually optimally “hill
climb” the similarity function and find its local maximum rather than blindly searching the
neighborhood.
This is the big idea in mean-shift tracking: the similarity function helps us determine the
new frame’s center of mass, and the search space is reduced by following the kernel’s gradient
along the similarity function.
Disadvantages
Much like the Kalman Filter from before, the biggest downside of using mean-shift as an
exclusive tracking mechanism is that it operates on a single hypothesis of the “best next
point.”
A convenient way to get around this problem while still leveraging the power of mean-shift
tracking is to use it as the sensor model in a particle filter; we treating the mean-shift tracking
algorithm as a measurement likelihood (from before, Pr [z | X]).
155
Chapter 10: Motion & Tracking
errors as well as false positives. Deformations and particularly occlusions cause them
to struggle.
Sensors and Dynamics How do we determine our dynamics and sensor models?
If we learned the dynamics model from real data (a difficult task), we wouldn’t even
need the model in the first place! We’d know in advance how things move. We could
instead learn it from “clean data,” which would be more empirical and an approximation
of the real data. We could also use domain knowledge to specify a dynamics model.
For example, a security camera pointed at a street can reasonably expect pedestrians
to move up and down the sidewalks.
The sensor model is much more finicky. We do need some sense of absolute truth to
rely on (even if it’s noisy). This could be the reliability of a sonar sensor for distance,
a preconfigured camera distance and depth, or other reliable truths.
Prediction vs. Correction Remember the fundamental trade-off in our Kalman Filter:
we needed to decide on the relative level of noise in the measurement (correction)
vs. the noise in the process (prediction). If one is too strong, we will ignore the other.
Getting this balance right is unfortunately just requires a bit of magic and guesswork
based on any existing data.
Data Association We often aren’t tracking just one thing in a scene, and it’s often not
a simple scene. How do we know, then, which measurements are associated with
which objects? And how do we know which measurements are the result of visual
clutter? The camoflauge techniques we see in nature (and warfare) are designed to
intentionally introduce this kind of clutter so that even our vision systems have trouble
with detection and tracking. Thus, we need to reliably associate relevant data with
the state.
The simple strategy is to only pay attention to measurements that are closest to the
prediction. Recall when tracking hand contours (see Figure A.3a) we relied on the
“nearest high-contrast features,” as if we knew those were truly the ones we were
looking for.
There is a more sophisticated approach, though, which relies on keeping multiple hy-
potheses. We can even use particle filtering for this: each particle becomes a hypothe-
sis about the state. Over time, it becomes clear which particle corresponds to clutter,
which correspond to our interesting object of choice, and we can even determine when
new objects have emerged and begin to track those independently.
Drift As errors in each component accumulate and compound over time, we run the risk of
drift in our tracking.
One method to alleviate this problem is to update our models over time. For example,
we could introduce an α factor that incorporates a blending of our “best match” over
time with a simple linear interpolation:
Model(t) = αBest(t) + (1 − α)Model(t − 1) (10.10)
There are still risks with this adaptive tracking method: if we blend in too much noise
156
Notes on Computer Vision George Kudrayvtsev
into our sensor model, we’ll eventually be tracking something completely unrelated to
the original template.
10.3.7 Conclusion
That ends our discussion of tracking. The notion we introduced of tracking state over time
comes up in computer vision a lot. This isn’t image processing: things change often!
We introduced probabilistic models to solve this problem. Kalman Filters and Mean-Shift
were methods that rendered a single hypothesis for the next best state, while Particle Filters
maintained multiple hypotheses and converged on a state over time.
157
Recognition
Recognizing isn’t at all like seeing; the two often don’t even agree.
— Sten Nadolny, The Discovery of Slowness
ntil this chapter, we’ve been working in a “semantic-free” vision environment. Though
U we have detected “interesting” features, developed motion models, and even tracked
objects, we have had no understanding of what those objects actually are. Yes, we
had Template Matching early on in chapter 2, but that was an inflexible, hard-coded, and
purely mathematical approach of determining the existence of a template in an image. In
this chapter, we’ll instead be focusing on a more human-like view of recognition. Our aim will
be to dissect a scene and label (in English) various objects within it or describe it somehow.
There are a few main forms of recognition: verification of the class of an object (“Is this a
lamp?”), detection of a class of objects (“Are there any lamps?”), and identification of a
specific instance of a class (“Is that the Eiffel Tower?”). More generally, we can also consider
object categorization and label specific general areas in a scene (“There are trees here,
and some buildings here.”) without necessarily identifying individual instances. Even more
generally, we may want to describe an image as a whole (“This is outdoors.”)
We’ll primarily be focusing on generic object categorization (“Find the cars,” rather than
“Find Aaron’s car”). This task can be presented as such:
Given a (small) number of training images as examples of a category, recognize
“a-priori” (previously) unknown instances of that category and assign the correct
category label.
This falls under a large class of problems that machine learning attempts to solve, and many
of the methods we will discuss in this chapter are applications of general machine learning
algorithms. The “small” aspect of this task has not gotten much focus in the machine learning
community at large; the massive datasets used to train recognition models present a stark
contrast to the handful of examples a human may need to reliably identify and recognize an
object class.
158
Notes on Computer Vision George Kudrayvtsev
Figure 11.1: Hot dogs or legs? You might be able to differentiate them
instantly, but how can we teach a computer to do that? (Image Source)
Categorization
Immediately, we have a problem. Any object can be classified in dozens of ways, often
following some hierarchy. Which category should we take? Should a hot dog be labeled as
food, as a sandwich, as protein, as an inanimate object, or even just as something red-ish?
This is a manifestation of prototype theory, tackled by psychologists and cognitive scientists.
There are many writings on the topic, the most impactful of which was Rosch and Lloyd’s
Cognition and categorization, namely the Principles of Categorization chapter, which drew
a few conclusions from the standpoint of human cognition. The basic level category for an
object can be based on. . .
• the highest level at which category members have a similar perceived shape. For
example, a German shepherd would likely be labeled that, rather than as a dog or a
mammal, which have more shape variation.
• the highest level at which a single mental image can be formed that reflects the entire
category.
• the level at which humans identify category members fastest.
• the first level named and understood by children.
• the highest level at which a person uses similar motor actions for interaction with
category members. For example, the set of actions done with a dog (petting, brushing)
is very different than the set of actions for all animals.
159
Chapter 11: Recognition
reaction time has robustly shown us that humans respond faster to whether or not
something is a dog than whether or not something is an animal.
This might make sense intuitively; the “search space” of animals is far greater than
that of dogs, it should take longer. Having scientific evidence to back that intuition
up is invaluable, though, and showing that such categories must somehow exist—
and even have measurable differences—is a significant insight into human cognition.
Even if we use some of these basic level categories, what scope are we dealing with? Psy-
chologists have given a range of 10,000 to 30,000 categories for human cognition. This gives
us an idea of scale when dealing with label quanities for recognition.
Challenges
Why is this a hard problem? There are many multivariate factors that influence how a
particular object is perceived, but we often don’t have trouble working around these factors
and still labeling it successfully.
Factors like illumination, pose, occlusions, and visual clutter all affect the way an object
appears. Furthermore, objects don’t exist in isolation. They are part of a larger scene full of
other objects that may conflict, occlude, or otherwise interact with one another. Finally, the
computational complexity (millions of pixels, thousands of categories, and dozens of degrees
of freedom) of recognition is daunting.
State-of-the-Art
Things that seemed impossibles years ago are commonplace now: character and handwriting
recognition on checks, envelopes, and license plates; fingerprint scans; face detection; and
even recognition of flat, textured objects (via the SIFT Detector) are all relatively solved
problems.
The current cutting edge is far more advanced. As demonstrated in Figure 11.2, individual
components of an image are being identified and labeled independently. The likelihood of
thousands of photos of dogs wearing Mexican hats exist and can be used for training is
unlikely, so dissecting this composition is an impressive result.
As we’ve mentioned, these modern techniques of strong label recognition are really machine
learning algorithms applied to patterns of pixel intensities. Instead of diving into these
directly (which is a topic more suited to courses on machine learning and deep learning),
we’ll discuss the general principles of what are called generative vs. discriminative methods
160
Notes on Computer Vision George Kudrayvtsev
161
Chapter 11: Recognition
We can say that there is some feature vector x that describes or measures the character.
At the best possible decision boundary (i.e. right in the middle of Figure 11.3), either label
choice will yield the same expected loss.
If we picked the class “four” at that boundary, the expected loss is the probability that it’s
actually a 9 times the cost of calling it a 9 plus the probability that it’s actually a 4 times
the cost of calling it a 4 (which we assumed earlier to be zero: L (4 → 4) = 0). In other
words:
= Pr [class is 9 | x] · L (9 → 4) + Pr [class is 4 | x] · L (4 → 4)
= Pr [class is 9 | x] · L (9 → 4)
Similarly, the expected loss of picking “nine” at the boundary is based on the probability
that it was actually a 4:
= Pr [class is 4 | x] · L (4 → 9)
Pr [class is 9 | x] · L (9 → 4) = Pr [class is 4 | x] · L (4 → 9)
With this in mind, classifying a new point becomes easy. Given its feature vector k, we
choose the class with the lowest expected loss. We would choose “four” if:
Pr [4 | k] · L (4 → 9) > Pr [9 | k] · L (9 → 4)
How can we apply this in practice? Well, our training set encodes the loss; perhaps it’s
different for certain incorrect classifications, but it could also be uniform. The difficulty
162
Notes on Computer Vision George Kudrayvtsev
Pr [x = 6 | skin] = 0.5
Pr [x = 6 | ¬skin] = 0.1
Pr [x | ¬skin]
Pr [x | skin]
Is this enough information to determine whether or not p is a skin pixel? In general, can we
say that p is a skin pixel if Pr [x | skin] > Pr [x | ¬skin]?
No!
Remember, this is the likelihood that the hue is some value already given whether or not
it’s skin. To extend the intuition behind this, suppose the likelihoods were equal. Does that
suggest we can’t tell if it’s a skin pixel? Again, absolutely not. Why? Because any given
pixel is much more likely to not be skin in the first place! Thus, if their hue likelihoods are
equal, it’s most likely not skin because “not-skin pixels” are way more common.
What we really want is the probability of a pixel being skin given a hue: Pr [skin | x], which
isn’t described in our model. As we did in Particle Filters, we can apply Bayes’ Rule, which
expresses exactly what we just described: the probability of a pixel being skin given a hue is
proportional to the probability of that hue representing skin AND the probability of a pixel
being skin in the first place (see Equation 10.6).
163
Chapter 11: Recognition
Unfortunately, we don’t know the prior, Pr [skin], . . . but we can assume it’s some constant,
Ω. Given enough training data marked as being skin,1 we can measure that prior Ω (hopefully
at the same time as when we measured the histogram likelihoods from before in Figure 11.4).
Then, the binary decision can be made based on the measured Pr [skin | x].
Generalizing the Generative Model We’ve been working with a binary decision. How
do we generalize this to an arbitrary number of classes? For a given measured feature vector
x and a set of classes ci , choose the best c∗ by maximizing the posterior probability:
c∗ = arg max Pr [c | x] = arg max Pr [c] Pr [x | c]
c c
Using Gaussians to create a parameterized version of the training data allows us to express
the probability density much more compactly.
164
Notes on Computer Vision George Kudrayvtsev
Ackshually. . .
Principal components point in the direction of maximum variance from the mean of
the points. Mathematically, though, we express this as being a direction from the
origin. This means that in the nitty-gritty, we translate the points so that their mean
is at the origin, do our PCA math, then translate everything back. Conceptually (and
graphically), though, we will express the principal component relative to the mean.
We’ll define this more rigorously soon, but let’s explain the “direction of maximum variance”
a little more more intuitively. We know that the variance describes how “spread out” the
data is; more specifically, it measures the average of the squared differences from the mean.
In 2 dimensions, the difference from the mean (the target in Figure 11.6) for a point is
expressed by its distance. The direction of maximum variance, then, is the line that most
accurately describes the direction of spread. The second line, then, describes the direction
of the remaining spread relative to the first line.
30 30
25
20 20
15
10 10
5
0 0
0 10 20 30 0 5 10 15 20 25 30
Figure 11.6: The first principal component and its subsequent orthog-
onal principal component.
Okay, that may not make a lot of sense yet, but perhaps diving a little deeper into the math
will explain further. You may have noticed that the first principal component in Figure 11.6
resembles the line of best fit. We’ve seen this sort of scenario before. . . Remember when we
discussed Error Functions, and how we came to the conclusion that minimizing perpendicular
distance was more stable than measuring vertical distance when trying to determine whether
or not a descriptor is a true match to a feature?
We leveraged the expression of a line as a normal vector and a distance from the origin (in
case you forgot how we expressed that, it’s replicated again in Figure 11.7). Then, our error
function was:
X
E(a, b, d) = (axi + byi − d)2
i
We did some derivations on this perpendicular least-squares fitting—which were omitted for
165
Chapter 11: Recognition
a
n̂ =
b
Figure 11.8: The projection of the point P onto the vector x̂.
Through a basic understanding of the Pythagorean theorem, we can see that minimizing the
distance from P to the line described by x̂ is the same thing as maximizing the projection
x̂T P (where x̂ is expressed a column vector). If we extend this to all of our data points, we
want to maximize the sum of the squares of the projections of those points onto the line
described by the principal component. We can express this in matrix form through some
166
Notes on Computer Vision George Kudrayvtsev
Notice that we expect, under the correct solution, for 1 − xT x = 0. We take the partial
derivative of our new error function and set it equal to 0:
∂E 0
= 2Mx + 2λx
∂x
0 = 2Mx + 2λx
λ is any constant,
Mx = λx so it absorbs the negative
167
Chapter 11: Recognition
30
25
20 x
15
10
5 E λ1
0
0 5 10 15 20 25 30
Figure 11.9: Collapsing a set of points to their principal component,
Êλ1 . The points can now be represented by a coefficient—a scalar of the
principal component unit vector.
Collapsing our example set of points from two dimensions to one doesn’t seem like that big of
a deal, but this idea can be extended to however many dimensions we want. Unless the data
is uniformally random, there will be directions of maximum variance; collapsing things along
them (for however many principal components we feel are necessary to accurately describe
the data) still results in massive dimensionality savings.
Suppose each data point is now n-dimensional; that is, x is an n-element column vector.
Then, to acquire the maximum direction of projection, v̂, we follow the same pattern as
before, taking the sum of the magnitudes of the projections:
X
var(v̂) = kv̂T (x − x)k
x
var(v) = vT Av
4
Remember, x is a column vector. Multiplying a column vector n by its transpose: nnT is the definition of
the outer product and results in a matrix. This is in contrast to multiplying a row vector by its transpose,
which is the definition of the dot product and results in a scalar.
168
Notes on Computer Vision George Kudrayvtsev
As before, the eigenvector with the largest eigenvalue λ captures the most variation among
the training vectors x.
With this background in mind, we can now explore applying principal component analysis
to images.
The eigenvectors found by PCA would’ve be the blue vectors; the red ones aren’t
even perpendicular! Intuitively, we can imagine the points above the blue line
“cancelling out” the ones below it, effectively making their average in-between them.
Principal component analysis works best when training data comes from a single
class. There are other ways to identify classes that are not orthogonal such as ICA,
or independent component analysis.
Let’s treat an image as a one-dimensional vector. A 100×100 pixel image has 10,000 elements;
it’s a 10,000-dimensional space! But how many of those vectors correspond to valid face
169
Chapter 11: Recognition
images? Probably much less. . . we want to effectively model the subspace of face images
from the general vector space of 100×100 images.
Specifically, we want to construct a low-dimensional (PCA anyone?) linear (so we can do
dot products and such) subspace that best explains the variation in the set of face images;
we can call this the face space.
Figure 11.10: Cyan points indicate faces, while orange points are “non-
faces.” We want to capture the black principal component.
u(xi ) = uT (xi − µ)
So what is the direction vector û that captures most of the variance? We’ll use the same
expressions as before, maximizing the variance of the projected data:
M
1 X T
var(u) = u (xi − µ) (uT (xi − µ))T
M i=1 | {z }
projection of xi
" M
#
1 X
= uT (xi − µ)(xi − µ)T u u doesn’t depend on i
M i=1
| {z }
covariance matrix of data
= ûT Σû don’t forget û is a unit vector
Thus, the variance of the projected data is var(û) = ûT Σû, and the direction of maximum
variance is then determined by the eigenvector of Σ with the largest eigenvalue. Naturally,
then, the top k orthogonal directions that capture the residual variance (i.e. the principal
components) correspond to the top k eigenvalues.
170
Notes on Computer Vision George Kudrayvtsev
Let Φi be a known face image I with the mean image subtracted. Remember, this is a
d-length column vector for a massive d.
Define C to be the average squared magnitude of the face images:
1 X
C= (Φi ΦTi ) = AAT
M i∈M
AT Avi = λvi
We see that Avi are the eigenvectors of C = AAT , which we previously couldn’t feasibly
compute directly! This is the dimensionality trick that makes PCA possible: we created
the eigenvectors of C without needing to compute them directly.
How Many Eigenvectors Are There? Intuition suggests M , but recall that we always
subtract out the mean which centers our component at the origin. This means (heh) that
if we have a point along it on one side, we know it has a buddy on the other side by just
flipping the coordinates. This removes a degree of freedom, so there are M − 1 eigenvectors
in general.
Yeah, I know this is super hand-wavey.
Just go with it. . .
11.2.3 Eigenfaces
The idea of eigenfaces and a face space was pioneered in this 1991 paper. The assumption
is that most faces lie in a low-dimensional subspace of the general image space, determined
by some k ≪ d directions of maximum variance.
Using PCA, we can determine the eigenvectors that track the most variance in the “face
space:” u1 , u2 , . . . , uk . Then any face image can be represented (well, approximated) by a
linear combination of these “eigenfaces;” the coefficients of the linear combination can be
determined easily by a dot product.
We can see in Figure 11.12, as we’d expect, that the “average face” from the training set has
an ambigious gender, and that eigenfaces roughly represent variation in faces. Notice that
171
Chapter 11: Recognition
(a) A subset of the training images—isolated faces with vari- (b) The resulting top 64 principal components.
ations in lighting, expression, and angle.
Figure 11.11: Data from Turk & Pentland’s paper, “Eigenfaces for Recognition.”
the detail in each eigenface increases as we go down the list of variance. We can even notice
the kind of variation some of the eigenfaces account for. For example, the 2nd through 4th
eigenfaces are faces lit from different angles (right, top, and bottom).
If we take the average face and add one of the eigenfaces to
it, we can see the impact the eigenface has in Figure 11.13.
Adding the second component causes the resulting face to be
lit from the right, whereas subtracting it causes it to be lit
from the left.
We can easily convert a face image to a series of eigenvector
Figure 11.12: The mean
coefficients by taking its dot product with each eigenface after face from the training data.
subtracting the mean face:
This vector of weights is now the entire representation of the face. We’ve reduced the face
from some n × n image to a k-length vector. To reconstruct the face, then, we say that the
reconstructed face is the mean face plus the linear combination of the eigenfaces and the
weights:
x̂ = µ + w1 u1 + w2 u2 + . . .
Naturally, the more eigenfaces we keep, the closer the reconstruction x̂ is to the original
face x. We can actually leverage this as an error function: if x − x̂ is low, the thing we are
172
Notes on Computer Vision George Kudrayvtsev
detecting is probably actually a face. Otherwise, we may have treated some random patch
as a face. This may come in handy for detection and tracking soon. . .
Optionally, we can use our above error function to determine whether or not the novel
image is a face. Then, we classify the face as whatever our closest training face was in our
k-dimensional subspace.
11.2.4 Limitations
Principal component analysis is obviously not a perfect solution; Figure 11.14 visualizes some
of its problems.
With regards to faces, the use of the dot product means our projection requires precise align-
ment between our input images and our eigenfaces. If the eyes are off-center, for example,
173
Chapter 11: Recognition
the element-wise comparison would result in a poor mapping and an even poorer reconstruc-
tion. Furthermore, if the training data or novel image is not tightly cropped, background
variation impacts the weight vector.
In general, though, the direction of maximum variance is not always good for classification;
in Figure 11.14b, we see that the red points and the blue points lie along the same principal
component. If we performed PCA first, we would collapse both sets of points into our single
component, effectively treating them as the same class.
Finally, non-linear divisions between classes naturally can’t be captured by this basic model.
174
Notes on Computer Vision George Kudrayvtsev
Our dynamics model, Pr [Lt | Lt−1 ], uses a simple motion model with no velocity.5 Our ob-
servation model, Pr [Ft | Lt ], is an approximation on an eigenbasis.6 If we can reconstruct
a patch well from our basis set, that renders a high probability.
Dynamics Model
The dynamics model is actually quite simple: each parameter’s probability is independently
distributed along its previous value perturbed by a Gaussian noise function. Mathematically,
this means:7
Given some state Lt , we are saying that the next state could vary in any of these parameters,
but the likelihood of some variation is inversely proportional to the amount of variation. In
other words, it’s way more likely to move a little than a lot. This introduces a massive amount
of possibilities for our state (remember, particle filters don’t scale well with dimension).
5
The lectures refer to this as a “Brownian motion model,” which can be described as each particle simply
vibrating around. Check out the Wikipedia page or this more formal introduction for more if you’re really
that interested in what that means; it won’t come up again.
6
An eigenbasis is a matrix made up of eigenvectors that form a basis. Our “face space” from before was an
eigenbasis; this isn’t a new concept, just a new term.
7
The notation N (a; b, c) states that the distribution for a is a Gaussian around b with the variance c.
175
Chapter 11: Recognition
Lt+1
Lt
Figure 11.15: A handful of the possible next-states (in red) from the
current state Lt (in cyan). The farther Lt+1 is from Lt , the less likely
it is.
Observation Model
We’ll be using probabilistic principal component analysis to model our image observation
process.
Given some location Lt , assume the observed frame was generated from the eigenbasis. How
well, then, could we have generated that frame from our eigenbasis? The probability of
observing some z given the eigenbasis B and mean µ is
This is our “distance from face space.” Let’s explore the math a little further. We see that
our Gaussian is distributed around the mean µ, since that’s the most likely face. Its “spread”
is determined by the covariance matrix of our eigenbasis, BBT , (i.e. variance within face
space) and a bit of additive Gaussian noise, εI that allows us to cover some “face-like space”
as well.
Notice that as lim, we are left with a likelihood of being purely in face space. Taking this
ε→0
limit and expanding the Gaussian function N gives us:
2
Pr [z | B] ∝ exp −
(z − µ) − BBT (z − µ)
| {z } | {z }
map to reconstruction
face space
| {z }
reconstruction error
The likelihood is proportional to the magnitude of error between the mapping of z into face
space and its subsequent reconstruction. Why is that the reconstruction, exactly? Well
BT (z − µ) results in a vector of the dot product of each eigenvector with our observation z
pulled into face space by µ. As we saw in Equation 11.1, this is just the coefficient vector
(call it γ) of z.
Thus, Bγ is the subsequent reconstruction, and our measurement model gives the likelihood
of an observation z fitting our eigenbasis.
176
Notes on Computer Vision George Kudrayvtsev
Incremental Learning
This has all been more or less “review” thus far; we’ve just defined dynamics and observation
models with our newfound knowledge of PCA. The cool part comes now, where we allow
incremental updates to our object model.
The blending introduced in Equation 10.10 for updating particle filter models had two ex-
tremes. If α = 0, we eventually can’t track the target because our model has deviated too
far from reality. Similarly, if α = 1, we eventually can’t track the target because our model
has incorporated too much environmental noise.
Instead, what we’ll do is compute a new eigenbasis Bt+1 from our previous eigenbasis Bt and
the new observation wt . We still track the general class, but have allowed some flexibility
based on our new observed instance of that class. This is called an incremental subspace
update and is doable in real time.8
Handling Occlusions
Our observation images have read the Declaration of Independence: all pixels are created
equal. Unfortunately, though, this means occlusions may introduce massive (incorrect!)
changes to our eigenbasis.
We can ask ourselves a question during our normal, un-occluded tracking: which pixels are
we reconstructing well ? We should probably weigh these accordingly. . . we have a lot more
confidence in their validity. Given an observation It , and assuming there is no occlusion with
8
Matrix Computations, 3rd ed. outlines an algorithm using recursive singular value decomposition that
allows us to do these eigenbasis updates quickly. [link]
177
Chapter 11: Recognition
Di = Wi ⊗ (It − BBT It )
D2i
Wi+1 = exp − 2
σ
This allows us to weigh each pixel individually based on how accurate its reconstruction has
been over time.
178
Notes on Computer Vision George Kudrayvtsev
• All mistakes have the same cost. The only thing we are concerned with is getting the
label right; getting it wrong has the same cost, regardless of “how wrong” it is.
• We need to construct a representation or description of our class instance, but we don’t
know a priori (in advance) which of our features are representative of the class label.
Our model will need to glean the diagnostic features.
Building a Representation
Let’s tackle the first part: representation. Suppose we have two classes: koalas and pandas.
One of the simplest ways to describe their classes may be to generate a color (or grayscale)
histogram of their image content. Naturally, our biggest assumption is that our images are
made up of mostly the class we’re describing.
Figure 11.16: Histograms for some training images for two classes:
koalas and pandas.
There is clear overlap across the histograms for koalas, and likewise for pandas. Is that
sufficient in discriminating between the two? Unfortunately, no! Color-based descriptions
are extremely sensitive to both illumination and intra-class appearance variations.
How about feature points? Recall our discussion of Harris Corners and how we found that
regions with high gradients and high gradient direction variance (like corners) were robust to
changes in illumination. What if we considered these edge, contour, and intensity gradients?
179
Chapter 11: Recognition
Even still, small shifts and rotations make our edge images look completely different (from an
overlay perspective). We can divide the image into pieces and describe the local distributions
of gradients with a histogram.
This has the benefit of being locally order-less, offering us invariance in our aforementioned
small shifts and rotations. If, for example, the top-left ear moves in one of the pandas, the
local gradient distribution will still stay the same. We can also perform constrast normal-
ization to try to correct for illumination differences.
This is just one form of determining some sort of feature representation of the training data.
In fact, this is probably the most important form; analytically determining an approximate
way to differentiate between classes is the crux of a discriminative classifier. Without a good
model, our classifier will have a hard time finding good decision boundaries.
Train a Classifier
We have an idea of how to build our representation, now how can we use our feature vectors
(which are just flattened out 1 × n vectors of our descriptions) to do classification? Given
our feature vectors describing pandas and describing koalas, we need to learn to differentiate
between them.
To keep things simple, we’ll stick to a binary classifier: we’ll have koalas and non-koalas,
cars and non-cars, etc. There are a massive number of discriminative classification techniques
recently developed in machine learning and applied to computer vision: nearest-neighbor,
neural networks, support vector machines (SVMs), boosting, and more. We’ll discuss each
of these in turn shortly.
180
Notes on Computer Vision George Kudrayvtsev
training example and label it accordingly. Each point corresponds to a Voronoi partition
which discretizes the space based on the distance from that point.
Nearest-neighbor is incredibly easy to write,
but is not ideal. It’s very data intensive: we
need to remember all of our training examples.
It’s computationally expensive: even with a kd-
tree (we’ve touched on these before; see 1 for
more), searching for the nearest neighbor takes
a bit of time. Most importantly, though, it sim-
ply does not work that well.
We can use the k-nearest neighbor variant
to make it so that a single training example
doesn’t dominate its region too much. The pro-
cess is just as simple: the k nearest neighbors Figure 11.19: Classification of the negative
“vote” to classify the new point, and major- (in black) and positive (in red) classes. The
partitioning is a Voronoi diagram.
ity rules. Surprisingly, this small modification
works really well. We still have the problem of
this process being data-intensive, but this notion of a loose consensus is very powerful and
effective.
Let’s look as some more sophisticated discriminative methods.
11.4.3 Boosting
This section introduces an iterative learning method: boosting. The basic idea is to look
at a weighted training error on each iteration.
Initially, we weigh all of our training examples equally. Then, in each “boosting round,” we
find the weak learner (more in a moment) that achieves the lowest weighted training error.
Following that, we raise the weights of the training examples that were misclassified by the
current weak learner. In essence, we say, “learn these better next time.” Finally, we combine
the weak learners from each step in a simple linear fashion to end up with our final classifier.
A weak learner, simply put, is a function that partitions our space. It doesn’t necessarily
have to give us the “right answer,” but it does give us some information. Figure 11.20 tries
to visually develop an intuition for weak learners.
The final classifier is a linear combination of the weak learners, and their weight directly
proportional to their accuracy. The exact formulas depend on the “boosting scheme;” one
of them is called AdaBoost; we won’t dive into it in detail, but a simplified version of the
algorithm is described in algorithm 11.1.
181
Chapter 11: Recognition
(a) Our set of training examples and (b) Reweighing the error examples (c) Final classifier is a combination of
the first boundary guess. and trying another boundary. the weak learners.
There were a few big ideas that made this detector so accurate and efficient:
• Brightness patterns were represented with efficiently-computable “rectangular” features
within a window of interest. These features were essentially large-scale gradient filters
or Haar wavelets.
(a) These are some “rectangular filters;” the filter (b) These filters were applied to different regions of
takes the sum of the pixels in the light area and the image; “features” were the resulting differences.
subtracts the sum of the pixels in the dark area.
The reason why this method was so effective was because it was incredibly efficient
to compute if performed many times to the same image. It leveraged the integral
image, which, at some (x, y) pixel location, stores the sum of all of the pixels spatially
before it. For example, the pixel at (40, 30) would contain the sum of the pixels in the
quadrant from (0, 0) to (40, 30):
182
Notes on Computer Vision George Kudrayvtsev
D B
(40, 30)
rectangle
C A
Why is this useful? Well with the rectangular filters, we wanted to find the sum of
the pixels within an arbitrary rectangular region: (A, B, C, D). What is its sum with
respect to the integral image? It’s simply A − B − C + D, as above.
Once we have the integral image, this is only 3 additions to compute the sum of any
size rectangle. This gives us really efficient handling of scaling as a bonus, as well:
instead of scaling the images to find faces, we can just scale the features, recomputing
them efficiently.
• We use a boosted combination of these filter results as features. With such efficiency, we
can look at an absurd amount of features quickly. The paper used 180,000 features
associated with a 24×24 window. These features were then run through the boosting
process in order to find the best linear combination of features for discriminating faces.
The top two weak learners were actually the filters shown in Figure 11.21b. If you look
carefully, the first appears useful in differentiating eyes (a darker patch above/below a
brighter one) and dividing the face in half.
• We’ll formulate a cascade of these classifiers to reject clear negatives rapidly. Even if
our filters are blazing fast to compute, there are still a lot of possible windows to search
if we have just a 24×24 sliding window. How can we make detection more efficient?
Well. . . almost everywhere in an image is not a face. Ideally, then, we could reduce
detection time significantly if we found all of the areas in the image that were definitely
not a face.
183
Chapter 11: Recognition
Each stage has no false negatives: you can be completely sure that any rejected sub-
window was not a face. All of the positives (even the false ones) go into the training
process for the next classifier. We only apply the sliding window detection if a sub-
window made it through the entire cascade.
To summarize this incredibly impactful technique, the Viola-Jones detector used rectangular
features optimized by the integral image, AdaBoost for feature selection, and a cascading set
of classifiers to discard true negatives quickly. Due to the massive feature space (180,000+
filters), training is very slow, but detection can be done in real-time. Its results are phenom-
enal and variants of it are in use today commercially.
184
Notes on Computer Vision George Kudrayvtsev
Input: X, Y : M positive and negative training samples with their corresponding labels.
Input: H: a weak classifier type.
Result: A boosted classifier, H ∗ .
/* Initialize
a uniform
weight distribution. */
ŵ = 1/M 1/M ...
t = some small threshold value close to 0
foreach training stage j ∈ [1..n] do
ŵ = w/kwk
/* Instantiate and train a weak learner for the current weights. */
hj = H(X, Y, ŵ)
/* The error is the sum of the incorrect training predictions. */
∀wi ∈ ŵ where hj (Xi ) 6= Yi
PM
εj = wi
i=0
1 1−ε
αj = 2
ln εj j
/* Update the weights only if the error is large enough. */
if ε > t then
wi = wi · exp (−Yi αj hj (Xi )) ∀wi ∈ w
else
break
end
/* The final boosted classifier is the sum of each hj weighed by its
corresponding αj . Prediction on an observation x is then simply: */
hP i
M
H ∗ (x) = sign j=0 α j hj (x)
return H ∗
185
Chapter 11: Recognition
Figure 11.24: Finding the boundary line between two classes of points
We’ve talked a lot about lines, but let’s revisit them in R2 once more. Lines can be expressed
10
as a normalvector
and a distance from the origin.
Specifically, given px + qy + b = 0, if
p x
we let w = and x be some arbitrary point , then this is equivalent to w · x + b = 0,
q y
and w is normal to the line. We can scale both w and b arbitrarily and still have the same
result (this will be important shortly). Then, for an arbitrary point (x0 , y0 ) not on the line,
we can find its (perpendicular) distance to the line relatively easily:
px0 + qy0 + b x0 · w + b
D= p =
p2 + q 2 kwk
10
We switch notation to p, q, b instead of the traditional a, b, c to stay in line with the SVM literature.
186
Notes on Computer Vision George Kudrayvtsev
This extends easily to higher dimensions, but that makes for tougher visualizations.
The margin lines between two classes will always have such points; they are called the sup-
port vectors. This is a huge reduction in computation relative to the generative methods
we covered earlier, which analyzed the entire data set; here, we only care about the examples
near the boundaries. So. How can we use these vectors to maximize the margin?
Remember that we can arbitrarily scale w and b and still represent the same line, so let’s say
that the separating line = 0, whereas the lines touching the positive and negative examples
are = 1 and = −1, respectively.
w·x+b=1
w·x+b=0
w · x + b = −1
The aforementioned support vectors lie on these lines, so xi · w + b = ±1 for them, meaning
we can compute the margin as a function of these values. The distance from a support vector
xi to the separating line is then:
|xi · w + b| ±1
=
kwk kwk
187
Chapter 11: Recognition
That means that M , the distance between the dashed green lines above, is defined by:
1 −1 2
M =
− =
kwk kwk kwk
Thus, we want to find the w that maximizes the margin, M . We can’t just pick any w,
though: it has to correctly classify all of our training data points. Thus, we have an additional
set of constraints that:
Let’s define an auxiliary variable yi to represent the label on each example. When xi is a
negative example, yi = −1; similarly, when xi is a positive example, yi = 1. This gives us a
standard quadratic optimization problem:
1 T
Minimize: ww
2
Subject to: yi (xi · w + b) ≥ 1
The solution to this optimization problem (whose derivation we won’t get into) is just a
linear combination of the support vectors:
X
w= α i y i xi (11.3)
i
The αi s are “learned weights” that are non-zero at the support vectors. For any support
vector, we can substitute in yi = w · xi + b, so:
X
yi = αi yi xi · x + b
i
Since it = yi , it’s always ±1. We can use this to build our classification function:
The highlighted box is a crucial component: the entirety of the classification depends only
on this dot product between some “new point” x and our support vectors xi s.
188
Notes on Computer Vision George Kudrayvtsev
0 x
0 x
No such luck this time. But what if we mapped them to a higher-dimensional space? For
example, if we map these to y = x2 , a wild linear separator appears!
x2
189
Chapter 11: Recognition
This seems promising. . . how can we find such a mapping (like the arbitrary x 7→ x2 above)
for other feature spaces? Let’s generalize this idea. We can call our mapping function Φ; it
maps xs in our feature space to another higher-dimensional space ϕ(x), so Φ : x 7→ ϕ(x).
We can use this to find the “kernel trick.”
Kernel Trick Just a moment ago in (11.4), we showed that the linear classifier relies on
the dot product between vectors: xi · x. Let’s define a kernel function K:
K(xi , xj ) = xi · xj = xTi xj
This kernel function is a “similarity” function that corresponds to an inner (dot) product in
some expanded feature space. A similarity function grows if its parameters are similar. This
is significant: as long as there is some higher dimensional space in which K is a dot product,
we can use K in a linear classifier.11
The kernel trick is that instead of explicitly computing the lifting transformation (since it
lifts us into a higher dimension ) ϕ(x), we instead define the kernel function in a way that
we know maps to a dot product in a higher dimensional space, like the polynomial kernel in
the example below. Then, we get the non-linear decision boundary in the original feature
space: X X
αi yi (xTi x) + b −→ αi yi K(xTi x) + b
i i
Let’s work through a proof that a particular kernel function is a dot product in
some higher-dimensional space. Remember, we don’t actually care about what
that space is when it comes to applying the kernel; that’s the beauty of the kernel
trick. We’re working through this to demonstrate how you would show that some
kernel function does have a higher-dimensional mapping.
x
We have 2D vectors, so x = 1 .
x2
11
There are mathematical restrictions on kernel function that we won’t get into but I’ll mention for those
interested in reading further. Mercer’s Theorem guides us towards the conclusion that a kernel matrix κ
(i.e. the matrix resulting from applying the kernel function to its inputs, κi,j = K(xi , xj )) must be positive
semi-definite. For more reading, try this article.
190
Notes on Computer Vision George Kudrayvtsev
K(xi , xj ) = (1 + xTi xj )2
xj1 xj1
= 1 + xi1 xi2 1 + xi1 xi2 expand
xj2 xj2
multiply
= 1 + x2i1 x2ij + 2xi1 xj1 xi2 xj2 + x2i2 x2j2 + 2xi1 xj1 + 2xi2 xj2 it all out
1
x2
√ j1
√ √ √ 2xj1 xj2
rewrite it as a
= 1 x2i1 2xi1 xi2 x2i2
2xi1 2xi2
√x2j2
vector product
√2xj1
2xj2
At this point, we can see something magical and crucially important: each of the
vectors only relies on terms from either xi or xj ! That means it’s a. . . wait for it. . .
dot product! We can define ϕ as a mapping into this new 6-dimensional space:
√ √ √ T
ϕ(x) = 1 x21 2x1 xn2 x22
2x1 2x2
What if the data isn’t separable even in a higher dimension, or there is some decision bound-
ary that is actually “better” 12 that a perfect separation that perhaps ignores a few whacky
edge cases? This is an advanced topic in SVMs that introduces the concept of slack vari-
ables which allow for this error, but that’s more of a machine learning topic. For a not-so-
gentle introduction, you can try this page.
Example Kernel Functions This entire discussion begs the question: what are some
good kernel functions?
Well, the simplest example is just our linear function: K(xi , xj ) = xTi xj . It’s useful when x
is already just a massive vector in a high-dimensional space.
Another common kernel is a Gaussian noise function, which is a specific case of the radial
basis function: !
kxi − xj k2
K(xi , xj ) = exp −
2σ 2
As xj moves away from the support vector xi , it just falls off like a Gaussian. But how is
this a valid kernel function, you might ask? Turns out, it maps us to an infinite dimensional
space . You may (or may not) recall from calculus that the exponential function ex , can
12
For some definition of better.
191
Chapter 11: Recognition
Applying this to our kernel renders this whackyboi (with σ = 1 for simplification):13
∞
(xT x0 )j
X
1 0 2 1 2 1 0 2
exp − kx − x k2 = · exp − kxk2 · exp − kx k2
2 j=0
j! 2 2
As you can see, it can be expressed as an infinite sum of dot product of x · x0 . Of course,
thanks to the kernel trick, we don’t actually have to care how it works; we can just use it. A
Gaussian kernel function is really useful in computer vision because we often use histograms
to describe classes and histograms can be represented as mixtures of Gaussians; we saw this
before when describing Continuous Generative Models.
Another useful kernel function (from this paper) is one that describes histogram intersection,
where the histogram is normalized to represent a probability distribution:
X
K(xi , xj ) = min (xi (k), xj (k))
k
We’ll leverage this kernel function briefly when discussing Visual Bags of Words and activity
classification in video.
Multi-category Classification
Unfortunately, neither of the solutions to this question are particularly “satisfying,” but
fortunately both of them work. There are two main approaches:
One vs. All In this approach, we adopt a “thing or ¬thing” mindset. More specifically, we
learn an SVM for each class versus all of the other classes. In the testing phase, then,
we apply each SVM to the test example and assign to it the class that resulted in the
highest decision value (that is, the distance that occurs within (11.4)).
One vs. One In this approach, we actually learn an SVM for each possible pair of classes.
As you might
expect, this blows up relatively fast. Even with just 4 classes, {A, B, C, D},
we need 42 = 6 pairs of SVMs: {(A, B), (A, C), (A, D), (B, C), (B, D), (C, D)}. Dur-
ing the testing phase, each learned SVM “votes” for a class to be assigned to the test
example. Like when we discussed voting in Hough Space, we trust that the most sen-
sible option results in the most votes, since the votes for the “invalid” pairs (i.e. the
(A, B) SVM for an example in class D) would result in a random value.
192
Notes on Computer Vision George Kudrayvtsev
The lowest error rate found by the SVM with an RBF kernel was around 3.4% on tiny
21 × 12 images; this performed better than humans even when high resolution images were
used. Figure 11.28 shows the top five gender misclassifications by humans.
193
Chapter 11: Recognition
The obvious problem with this idea is sheer scalability. We may need to search for millions
of possible features for a single image. Thankfully, this isn’t a unique problem, and we can
draw parallels to other disciplines in computer science to find a better search method. To
summarize our problem, let’s restate it as such:
With potentially thousands of features per image, and hundreds of millions of
images to search, how can we efficiently find those that are relevant to a new
image?
194
Notes on Computer Vision George Kudrayvtsev
If we draw a parallel to the real world of text documents, we can see a similar problem.
How, from a book with thousands of words on each page, can we find all of the pages that
contain a particular “interesting” word? Reference the Index, of course! Suppose we’re given
a new page, then, which some selection of interesting words in it. How can we find the “most
similar” page in our book? The likeliest page is the page with the most references to those
words in the index!
Index
ant 1,4 gong 2,4
bread 2 hammer 2,3
bus 4 leaf 1,2,3
chair 2,4 map 2,3
chisel 3,4 net 2
desk 1 pepper 1,2
fly 1,3 shoe 1,3
Now we’re given a page with a particular set of words from the index: gong, fly,
pepper, ant, and shoe. Which page is this “most like”?
Finding the solution is fairly straightforward: choose the page with the most refer-
ences to the new words:
Clearly, then, the new page is most like the 1st page. Naturally, we’d need some
tricks for larger, more realistic examples, but the concept remains the same: iterate
through the index, tracking occurrences for each page.
Similarly, we want to find all of the images that contain a particular interesting feature. This
leads us to the concept of mapping our features to “visual words.”
195
Chapter 11: Recognition
We’ll disregard some implementation details regarding this method, such as which features
to choose or what size they should be. These are obviously crucial to the method, but we’re
more concerned with using it rather than implementing it. The academic literature will give
more insights (such as the paper linked in Figure 11.30) if you’re interested. Instead, we’ll
be discussing using these “bags of visual words” to do recognition.
To return to the document analogy, if you had a document and wanted to find similar
documents from your library, you likely would pull documents with a similar histogram
distribution of words, much like we did in the previous example.
196
Notes on Computer Vision George Kudrayvtsev
We can use our visual vocabulary to compute a histogram of occurrences of each “word” in
each object. Notice in Figure 11.33 that elements that don’t belong to the object do some-
times “occur”: the bottom of the violin might look something like a bicycle tire, for example.
By-and-large, though, the peaks are related to the relevant object from Figure 11.32.
Comparing images to our database is easy. By normalizing the histograms, we can treat
them a unit vector. For some database image dj and query image q, similarity can then be
easily represented by the dot product:
dj · q
sim(dj , q) =
kdj k kqk
14
Such visual bags of words already exist, see the Caltech 101 dataset.
197
Video Analysis
To be continued. . .
198
Linear Algebra Primer
his appendix covers a few topics in linear algebra needed throughout the rest of the
T guide. Computer vision depends heavily on linear algebra, so reading an external
resource is highly recommended. The primer from Stanford’s machine learning course
is an excellent starting point and likely explains things much better than this appendix.
Notation I try to stick to a particular notation for vectors and matrices throughout the
guide, but it may be inconsistent at times. Vectors are lowercase bold: a; matrices are
uppercase bold: A. Unit vectors wear hats: â.
In this case, though, we don’t have a solution. How do we know this? Perhaps A isn’t
invertible, or we know in advance that we have an over-constrained system (more rows than
columns in A). Instead, we can try to find the “best” solution. We can define “best” as being
the solution vector with the shortest distance to y. The Euclidean distance between two
n-length vectors a and b is just the magnitude of their difference:
p
d(a, b) = ka − bk = (a1 − b1 )2 + . . . + (an − bn )2
Thus, we can define x∗ as the best possible solution that minimizes this distance:
min
∗
= ky − Ax∗ k
x
199
Appendix A: Linear Algebra Primer
A Visual Parallel. We can present a visual demonstration of how this works and show
that the least-squares solution is the same as the projection.
Suppose we have a set of points that don’t ex-
3 actly fit a line: {(1, 1), (2, 1), (3, 2)}, plotted in
Figure A.1. We want to find the best possible
line y = Dx + C that minimizes the total error.
2 e2 This corresponds to solving the following system
of equations, forming y = Ax:
y
e1
1 e0
1 = C + D · 1
0 1=C +D·2 or
0 1 2 3
2=C +D·3
x
1 1 1
Figure A.1: A set of points with “no so- 1 = 1 2 C
lution”: no line passes through all of them. D
2 1 3
The set of errors is plotted in red: (e0 , e1 , e2 ).
The lack of a solution to this system means that the vector of y-values isn’t in the column
space of A, or: y ∈ / C(A). The vector can’t be represented by a linear combination of
column vectors in A.
We can imagine the column space as a plane in xyz-
space, and y existing outside of it; then, the vector y
that’d be within the column space is the projection of
y into the column space plane: p = proj y. This is the
C(A) e
closest possible vector in the column space to y, which C(A)
is exactly the distance we were trying to minimize! p
Thus, e = y − p.
The projection isn’t super convenient to calculate or
Figure A.2: The vector y relative to
determine, though. Through algebraic manipulation, the column space of A, and its projec-
calculus, and other magic, we learn that the way to tion p onto the column space.
find the least squares approximation of the solution is:
AT Ax∗ = AT y
x∗ = (AT A)−1 AT y
| {z }
pseudoinverse
More Resources. This section basically summarizes and synthesizes this Khan Academy
video, this lecture from our course (which goes through the full derivation), this section of
Introduction to Linear Algebra, and this explanation from NYU. These links are provided in
order of clarity.
200
Notes on Computer Vision George Kudrayvtsev
Meaning a × b = [a× ] b. We’ll use this notation throughout certain parts of the guide.
Note: The rank of this matrix is 2. Just trust me (and Prof. Aaron Bobick) on that.
∇f (x∗ , y ∗ ) ∝ ∇g(x∗ , y ∗ )
∇f (x∗ , y ∗ ) = λ∇g(x∗ , y ∗ )
201
Appendix A: Linear Algebra Primer
10
−1 1 2 3 4 5
(b) Zooming in on the tangent between the best
(a) Some of the contours of f and the constraint contour of f and the constraint, we can see that
are shown. The purple contour is tangent to the their normal vectors (in other words, their gradi-
constraint. ents) are proportional.
λ is the Lagrange multiplier, named after its discoverer, Joseph-Louis λagrange. We can
now use this relationship between our two functions to actually find the tangent point—and
thus the maximum value:
∂
xy y
∇f = ∂x
∂ =
∂y
xy x
∂
∂x
(x + y) 1
∇g = ∂ =
∂y
(x + y) 1
We can arrange this new relationship into a system of equations and include our original
constraint x + y = 4:
y = λ
x=λ
x+y =4
202
Notes on Computer Vision George Kudrayvtsev
where f is the function to maximize, g is the constraint function, and k is the constraint
constant. Maximizing f is then a matter of solving for ∇L = 0.
This should look more familiar if you came here from the first reference to the Lagrange
multiplier.
More Resources. This section is based on the Khan Academy series on solving con-
strained optimization problems, which discuss the Lagrange multiplier, as well as its rela-
tionship to contours and the gradient, in much greater detail.
203
Index of Terms
A D
affine transformation . 80, 110, 134, 137, 175 data term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
albedo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120, 122 dense correspondence search . . . . 62, 96, 152
Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43, 49 difference of Gaussian . . . . . . . . . . . . . . . . . . 105
anti-aliasing filter . . . . . . . . . . . . . . . . . . . . . . . 51 difference of Gaussian, pyramid . . . 106, 129
aperture problem . . . . . . . . . . . . . . . . . . . . . . 126 diffuse reflection . . . . . . . . . . . . . . . . . . . . . . . 119
appearance-based tracking . . . . . . . . . . . . . 174 direct linear calibration transformation . . 75
attenuate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 discrete cosine transform . . . . . . . . . . . . . . . . 52
discriminative . . . . . . . . . . . . . . . . . . . . . . . . . 161
B disocclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60, 91 disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . 59, 111
bed of nails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 dynamic programming . . . . . . . . . . . . . . . . . . 63
Bhattacharyya coefficient . . . . . . . . . . . . . . 154 dynamics model . . . . . . . . . . . . . . . . . . . 138, 175
binary classifier . . . . . . . . . . . . . . . . . . . . . . . . 180
binocular fusion . . . . . . . . . . . . . . . . . . . . . . . . . 58 E
boosting . . . . . . . . . . . . . . . . . . . . . 180, 181, 197 edge-preserving filter . . . . . . . . . . . . . . . . . . . . 20
BRDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 eigenfaces . . . . . . . . . . . . . . . . . . . . 171, 174, 175
brightness constancy constraint . . . 126, 134 energy minimization . . . . . . . . . . . . . . . . . . . . 64
epipolar constraint . . . . . . . . . . . . . . . 60, 91, 94
C epipolar geometry . . . . . . . 60, 92, 93, 97, 110
calibration matrix . . . . . . . . . . . . . . . . . . . . . . . 71 epipolar line . . . . . . . . . . 60, 60, 65, 92, 94, 96
Canny edge detector . . . . . . . . . . . . . 29, 31, 36 epipolar plane . . . . . . . . . . . . . . . . . . . . . . . 60, 91
center of projection . . . . . . . 54, 57, 59, 65, 70 epipole . . . . . . . . . . . . . . . . . . . . . . . . . . 60, 94, 96
classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 essential matrix . . . . . . . . . . . . . . . . . . 92, 93, 97
comb function . . . . . . . . . . . . . . . . . . . . . . . 49, 51 Euler angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
computer vision . . . 11, 97, 98, 118, 125, 151, extrinsic parameter matrix . . . . . . . 69, 93, 94
157, 179, 192 extrinsic parameters . . . . . . . . . . . . . . . . . . . . 65
convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 47
correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 F
correlation filter, non-uniform weights . . . 15 feature points . . . . . . . . . . . . . . . . . 98, 179, 194
correlation filter, uniform weights . . . . . . . 15 feature vector . . . . . . . . . . . . . . . . . . . . . 108, 109
cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162, 179 finite difference . . . . . . . . . . . . . . . . . . . . . . . . . 26
204
Index of Terms
205
Index
O roll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
observation model . . . . . . . . . . . . . . . . 138, 175
occlusion . . . . . . . . . . . . . . 62, 64, 110, 137, 177 S
occlusion boundaries . . . . . . . . . . . . . . . . 62, 99 second moment matrix . . . . . . . . . . . . 101, 129
optic flow . . . . . . . . . . . . . . . . . . . . . . . . . 125, 136 Sharpening filter . . . . . . . . . . . . . . . . . . . . . . . . 20
outliers . . . . . . . . . . . . . . . . . . . . . . 110, 113, 114 shearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80, 110
SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105, 160
P SIFT descriptor . . . . . . . . . . . . . . . . . . . . . . . 107
panoramas . . . . . . . . . . . . . . . . . . . . . . . . . 82, 194 similarity transformation . . . . . . . . . . . . . . . . 80
parallax motion . . . . . . . . . . . . . . . . . . . . . . . . 134 singular value decomposition 73, 83, 95, 110,
parameter space . . . . . . . . . . . . . . . . . . . . . . . . 32 113, 177
particle filtering . . . . . . . . . . . . . . . . . . . 145, 175 slack variable . . . . . . . . . . . . . . . . . . . . . . . . . . 191
perspective, weak . . . . . . . . . . . . . . . . . . . . . . . 57 smoothness term . . . . . . . . . . . . . . . . . . . . . . . . 64
perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Sobel operator . . . . . . . . . . . . . . . . . . . . . . 27, 29
Phong reflection model . . . . . . . . . . . . . . . . 121 specular reflection . . . . . . . . . . . . . . . . . . . . . . 119
pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 splatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
point-line duality . . . . . . . . . . . . . 89, 92–94, 96 stereo correspondence . . . . . . 61, 91, 111, 126
power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 stochastic universal sampling . . . . . . . . . . 148
Prewitt operator . . . . . . . . . . . . . . . . . . . . . . . . 27 support vector machine . . . . . . 180, 185, 197
principal component analysis . 151, 164, 176
principle point . . . . . . . . . . . . . . . . . . . . . . . . . . 90 T
projection . . . . . . . . . . . . . . . . . . . . . . 53, 79, 133 Taylor expansion . . . . . . . . . . . . . . . . . . 100, 126
projection plane . . . . . . . . . . . . . . 54, 82, 87, 89 template matching . . . . . . . . . . . . . . . . . . . . . . 22
projection, orthographic . . . . . . . . . . . . 57, 134 total rigid transformation . . . . . . . . . . . . . . . 68
projection, perspective . . . . . . . . . . . . . 55, 134
projective geometry . . . . . . . . . . 61, 86, 94, 96 V
putative matches . . . . . . . . . . . . . . . . . . 111, 194 vanishing point . . . . . . . . . . . . . . . . . . . . . . . . . 56
variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
R Viola-Jones detector . . . . . . . . . . . . . . . . . . . 181
radial basis function . . . . . . . . . . . . . . . . . . . 191 visual code-words . . . . . . . . . . . . . . . . . . . . . . . 39
radiance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 voting . . . . . . . . . . . . . . . . . . . . . 32, 36, 112, 192
RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . 114, 194
resectioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 W
retinex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 weak calibration . . . . . . . . . . . . . . . . . . . . . . . . 92
right derivative . . . . . . . . . . . . . . . . . . . . . . . . . 26 weak learner . . . . . . . . . . . . . . . . . . . . . . 181, 181
rigid body transformation . . . . . . . . . . . . . . . 79 weighted moving average . . . . . . . . . . . . . . . . 14
risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Roberts operator . . . . . . . . . . . . . . . . . . . . . . . . 27 Y
robust estimator . . . . . . . . . . . . . . . . . . . . . . . 113 yaw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
206