0% found this document useful (0 votes)
934 views206 pages

CV Notes PDF

Uploaded by

thomas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
934 views206 pages

CV Notes PDF

Uploaded by

thomas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 206

An Introduction to Computer Visi E

or: An Unofficial Companion Guide to the Georgia


n
Institute of Technology’s CS6476: Computer Vision

George Kudrayvtsev

Last Updated: April 9, 2019

(Draft)

his work is a culmination of hundreds of hours of effort to create a lasting reference


T to read along with Georgia Tech’s graduate course on computer vision. Credit for a
majority of the content belongs to Prof. Aaron Bobick, the lecturer for the course, as
well as numerous external academic sources that are linked where appropriate. All of the
explanations are in my own words, and many of the diagrams were hand-crafted.

This is still a work-in-progress; once this sentence is gone, you can consider it
to be v1.0. There may be errors, typos, or entirely incorrect or misleading information,
though I’ve done my best to ensure there aren’t. I will be further expanding it with topics
from Georgia Tech’s course on computational photography in the coming school year.

Many lattés were consumed throughout the making of this


companion guide. If you found it useful and are in a
generous mood, feel free to buy me another! Shoot me
a donation on Venmo @george_k_btw with whatever this
guide was worth to you.
Happy studying!

1
0 Preface 9

1 Introduction 11

2 Basic Image Manipulation 12


2.1 Images as Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Operations on Images Functions . . . . . . . . . . . . . . . . . . . . . 13
Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Image Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Computing Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Averages in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Blurring Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Gaussian Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Linearity and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Impulses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Boundary Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 More Filter Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Filters as Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Where’s Waldo? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Non-Identical Template Matching . . . . . . . . . . . . . . . . . . . . 24
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Edge Detection 25
3.1 The Importance of Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Gradient Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 The Discrete Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Sobel Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Handling Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
We Have to Go Deeper. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Dimension Extension Detection . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 From Gradients to Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Canny Edge Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Non-Maximal Suppression . . . . . . . . . . . . . . . . . . . . . . . . 30
Canary Threshold Hysteresis . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 2nd Order Gaussian in 2D . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Hough Transform 31
4.1 Line Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2
Hough Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Polar Representation of Lines . . . . . . . . . . . . . . . . . . . . . . 34
4.1.4 Hough Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.5 Handling Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Finding Circles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Hough Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Frequency Analysis 43
5.1 Basis Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1 Limitations and Discretization . . . . . . . . . . . . . . . . . . . . . . 47
5.2.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1 Antialiasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Resizing Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Cameras and Images 53


6.1 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Perspective Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.1 Homogeneous Coordinates . . . . . . . . . . . . . . . . . . . . . . . . 54
Perspective Projection . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.2 Geometry in Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2.3 Other Projection Models . . . . . . . . . . . . . . . . . . . . . . . . . 56
Orthographic Projection . . . . . . . . . . . . . . . . . . . . . . . . . 57
Weak Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Stereo Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.1 Finding Disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.2 Epipolar Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.3 Stereo Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Dense Correspondence Search . . . . . . . . . . . . . . . . . . . . . . 62
Uniqueness Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Ordering Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.4 Better Stereo Correspondence . . . . . . . . . . . . . . . . . . . . . . 63
Scanlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Grid Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 Extrinsic Camera Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4.1 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4.2 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Example: Rotation about a single axis. . . . . . . . . . . . . . . . . . 67
Rotation with Homogeneous Coordinates . . . . . . . . . . . . . . . . 68

3
6.4.3 Total Rigid Transformation . . . . . . . . . . . . . . . . . . . . . . . 68
6.4.4 The Duality of Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5 Intrinsic Camera Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5.1 Real Intrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6 Total Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.7 Calibrating Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.7.1 Method 1: Singular Value Decomposition . . . . . . . . . . . . . . . . 73
6.7.2 Method 2: Inhomogeneous Solution . . . . . . . . . . . . . . . . . . . 74
6.7.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . 75
6.7.4 Geometric Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.8 Using the Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.8.1 Where’s Waldo the Camera? . . . . . . . . . . . . . . . . . . . . . . . 77
6.9 Calibrating Cameras: Redux . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7 Multiple Views 79
7.1 Image-to-Image Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 The Power of Homographies . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.1 Creating Panoramas . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2.2 Homographies and 3D Planes . . . . . . . . . . . . . . . . . . . . . . 83
7.2.3 Image Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Forward Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Inverse Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3 Projective Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3.1 Alternative Interpretations of Lines . . . . . . . . . . . . . . . . . . . 86
7.3.2 Interpreting 2D Lines as 3D Points . . . . . . . . . . . . . . . . . . . 87
7.3.3 Interpreting 2D Points as 3D Lines . . . . . . . . . . . . . . . . . . . 88
7.3.4 Ideal Points and Lines . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.5 Duality in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4 Applying Projective Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4.1 Essential Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4.2 Fundamental Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Properties of the Fundamental Matrix . . . . . . . . . . . . . . . . . 94
Computing the Fundamental Matrix From Correspondences . . . . . 95
Fundamental Matrix Applications . . . . . . . . . . . . . . . . . . . . 96
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

8 Feature Recognition 98
8.1 Finding Interest Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.1.1 Harris Corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Properties of the 2nd Moment Matrix . . . . . . . . . . . . . . . . . . 103
8.1.2 Harris Detector Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 104
8.1.3 Improving the Harris Detector . . . . . . . . . . . . . . . . . . . . . . 104
SIFT Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Harris-Laplace Detector . . . . . . . . . . . . . . . . . . . . . . . . . 106

4
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.2 Matching Interest Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2.1 SIFT Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Orientation Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 108
Keypoint Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Evaluating the Results . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.2 Matching Feature Points . . . . . . . . . . . . . . . . . . . . . . . . . 109
Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Wavelet-Based Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Locality-Sensitive Hashing . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.3 Feature Points for Object Recognition . . . . . . . . . . . . . . . . . 110
8.3 Coming Full Circle: Feature-Based Alignment . . . . . . . . . . . . . . . . . 110
8.3.1 Outlier Rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Nearest Neighbor Error . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.3.2 Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.3.3 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Benefits and Downsides . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

9 Photometry 118
9.1 BRDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.1.1 Diffuse Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.1.2 Specular Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.1.3 Phong Reflection Model . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2 Recovering Light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.2.1 Retinex Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

10 Motion & Tracking 124


10.1 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.1.1 Lucas-Kanade Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Improving Lucas-Kanade . . . . . . . . . . . . . . . . . . . . . . . . . 129
Sparse Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.1.2 Applying Lucas-Kanade: Frame Interpolation . . . . . . . . . . . . . 132
10.2 Motion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
10.2.1 Known Motion Geometry . . . . . . . . . . . . . . . . . . . . . . . . 134
10.2.2 Geometric Motion Constraints . . . . . . . . . . . . . . . . . . . . . . 134
10.2.3 Layered Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.3.1 Modeling Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Tracking as Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Tracking as Induction . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Making Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Making Corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
10.3.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5
N -dimensional Kalman Filter . . . . . . . . . . . . . . . . . . . . . . 143
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.3.3 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Bayes Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.3.4 Real Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Tracking Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Other Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A Very Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.3.5 Mean-Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Similarity Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Kernel Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.3.6 Tracking Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

11 Recognition 158
11.1 Generative Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . 162
11.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 164
11.2.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . 168
11.2.2 Face Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
11.2.3 Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
11.2.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
11.3 Incremental Visual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.3.1 Forming Our Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Dynamics Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.3.2 All Together Now . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Handling Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.4 Discriminative Supervised Classification . . . . . . . . . . . . . . . . . . . . 178
11.4.1 Discriminative Classifier Architecture . . . . . . . . . . . . . . . . . . 179
Building a Representation . . . . . . . . . . . . . . . . . . . . . . . . 179
Train a Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Generating and Scoring Candidates . . . . . . . . . . . . . . . . . . . 180
11.4.2 Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.4.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Viola-Jones Face Detector . . . . . . . . . . . . . . . . . . . . . . . . 181
Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . 184
11.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.5.1 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.5.2 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
11.5.3 Extending SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Mapping to Higher-Dimensional Space . . . . . . . . . . . . . . . . . 189
Multi-category Classification . . . . . . . . . . . . . . . . . . . . . . . 192

6
11.5.4 SVMs for Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Using SVMs for Gender Classification . . . . . . . . . . . . . . . . . . 193
11.6 Visual Bags of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

12 Video Analysis 198

A Linear Algebra Primer 199


A.1 Solving a System of Equations via Least-Squares . . . . . . . . . . . . . . . . 199
A.2 Cross Product as Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 201
A.3 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
A.3.1 The Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

Index of Terms 204

7
List of Algorithms

4.1 The basic Hough algorithm for line detection. . . . . . . . . . . . . . . . . . . 35


4.2 The gradient variant of the Hough algorithm for line detection. . . . . . . . . 36
4.3 The Hough algorithm for circles. . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 The generalized Hough transform, for known orientations. . . . . . . . . . . . 41
4.5 The generalized Hough transform, for unknown orientations. . . . . . . . . . . 42

6.1 Finding camera calibration by minimizing geometric error. . . . . . . . . . . . 77

8.1 The basic Harris detector algorithm. . . . . . . . . . . . . . . . . . . . . . . . 104


8.2 General RANSAC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3 Adaptive RANSAC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 117

10.1 Iterative Lucas-Kanade algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 130


10.2 The hierarchical Lucas-Kanade algorithm. . . . . . . . . . . . . . . . . . . . . 131
10.3 Basic particle filtering algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.4 The stochastic universal sampling algorithm. . . . . . . . . . . . . . . . . . . 150

11.1 The simplified AdaBoost algorithm. . . . . . . . . . . . . . . . . . . . . . . . 185

8
Preface

I read that Teddy Roosevelt once said, “Do what you can with what you have
where you are.” Of course, I doubt he was in the tub when he said that.
— Bill Watterson, Calvin and Hobbes

Notation
Before we begin to dive into all things computer vision, here are a few things I do in this
notebook to elaborate on concepts:
• An item that is highlighted like this is a “term;” this is some vocabulary or identifying
word/phrase that will be used and repeated regularly in subsequent sections. I try to
cross-reference these any time they come up again to link back to its first defined usage;
most mentions are available in the Index.
• An item that is highlighted like this is a “mathematical property;” such properties
are often used in subsequent sections and their understanding is assumed there.
• The presence of a TODO means that I still need to expand that section or possibly to
mark something that should link to a future (unwritten) section or chapter.
• An item in a maroon box, like. . .

Boxes: A Rigorous Approach


. . . this example, often represents fun and interesting asides or examples that
pertain to the material being discussed. They are largely optional, but should
be interesting to read and have value, even if it’s not immediately rewarding.

• An item in a blue box, like. . .

Quick Maffs: Proving That the Box Exists


. . . this example, is a mathematical aside; I only write these if I need to

9
Chapter 0: Preface

dive deeper into a concept that’s mentioned in lecture. This could be proofs,
examples, or just a more thorough explanation of something that might’ve
been “assumed knowledge” in the text.

• An item in a green box, like. . .

Example 0.1: Examples


. . . this is an example that explores a theoretical topic with specifics. It’s
sometimes from lecture, but can also be just something I added in to under-
stand it better myself.

10
Introduction

Every picture tells a story. But sometimes it’s hard to know what story is
actually being told.
— Anastasia Hollings, Beautiful World

he goal of computer vision is to create programs that can interpret and analyse
T images, providing the program with the meaning behind the image. This may involve
concepts such as object recognition as well as action recognition for images in motion
(colloquially, “videos”).
Computer vision is a hard problem because it involves much more complex analysis relative
to image processing. For example, observe the following set of images:

Figure 1.1: The checker shadow illusion.

The two checkered squares A and B have the same color intensity, but our brain interprets
them differently without the connecting band due to the shadow. Shadows are actually quite
important to human vision. Our brains rely on shadows to create depth information and
track motion based on shadow movements to resolve ambiguities.
A computer can easily figure out that the intensities of the squares are equal, but it’s much
harder for it to “see” the illusion like we do. Computer vision involves viewing the image
as a whole and gaining a semantic understanding of its content, rather than just processing
things pixel-by-pixel.

11
Basic Image Manipulation

Beauty is in the eye of the beholder.


— Margaret Wolfe Hungerford, Molly Bawn

2.1 Images as Functions


A straightforward way to turn the beauty of an image into a representation that we can
manipulate into a mathematical mess is by interpreting it as a function mapping a location
(of a pixel) to an intensity. We begin with black-and-white images in order to keep our
representation simple: the intensity represents a range from some minimum (black, often 0)
and some maximum (white, often 1.0).
Image processing is then a task of applying operations on this function to transform it into
another function: images in, images out. Since it’s just a mathematical structure, we can
perform traditional mathematical operations on it. For example, when we smooth out peaks
and valleys (which we would “visually” perceive as sharp contrasts in image intensity) in the
function, the result is a blurred version of that image! We’ll explore this in Blurring Images.
More formally, an “image function” is a mapping from R2 (that is, two real numbers repre-
senting a position (x, y) in the image) to R (that is, some intensity or value). In the real
world, images have a finite size or dimension, so thus we have:

I : R2 7→ R

or, more specifically,


I : x × y 7→ R
where

x ∈ [a, b], y ∈ [c, d]


R ∈ [min, max]

where min would be some “blackest black” and max would be some “whitest white,” and
(a, b, c, d) are ranges for the different dimensions of the images, though when actually per-
forming mathematical operations, such interpretations of values become irrelevant.

12
Notes on Computer Vision George Kudrayvtsev

We can easily expand this to colored images, with a vector-valued function mapping each
color component:  
r(x, y)
I(x, y) = g(x, y)
b(x, y)

2.1.1 Operations on Images Functions


Because images are just functions, we can perform any mathematical operations on them
that we can on, well, functions.

Addition Adding two images will result in a blend between the two. As we discussed,
though, intensities have a range [min, max]; thus, adding is often performed as an average
of the two images instead, to not lose intensities when their sum exceeds the maximum:
Ia Ib
Iadded = +
2 2

Subtraction In contrast, subtracting two images will give the difference between the two.
A larger intensity indicates more similarity between the two source images at that pixel.
Note that order of operations matters, though the results are inverses of each other:

Ia − Ib = −(Ib − Ia )

Often, we simply care about the absolute difference between the images. Because we are
often operating in a discrete space that will truncate negative values (for example, when
operating on images represented as uint8), we can use a special formulation to get this
difference:
Idiff = (Ia − Ib ) + (Ib + Ia )

Noise
A common function that is added to a single image is a
noise function. One of these is called the Gaussian
noise function: it adds a variation in intensity drawn
from a Gaussian normal distribution. We basically add a
random intensity value to every pixel on an image. See
Figure 2.2 for an example of Gaussian noise added to a
classic example image1 used in computer vision.
Figure 2.1: A Gaussian (nor-
Tweaking Sigma On a normal distribution, the mean mal) distribution.
is 0. If we interpret 0 as an intensity, it would have to
be between black (the low end) and white (the high end); thus, the average pixel intensity
added to the image should be gray. When we tweak σ – the standard deviation – this will
1
of a model named Lena, actually, who is posing for the centerfold of an issue of PlayBoy. . .

13
Chapter 2: Basic Image Manipulation

Figure 2.2: Gaussian noise on an image.

affect the amount of noise: a higher σ means a noisier image. Of course, when working with
image manipulation libraries, the choice of σ varies depending on the range of intensities in
the image.

2.2 Image Filtering


Now that we have discussed adding noise to an image, how would we approach removing noise
from an image? If noise is just a function added to an image, couldn’t we just remove that
noise by subtracting the noise function again? Unfortunately, this is not possible without
knowing the original noise function! This information is not stored independently within our
image, so we need to somehow extract or derive that information from the merged images
instead. Furthermore, because of the common range limitation on pixels, some information
may be lost as a result of an “overflowing” (or underflowing) noise addition that results in
pixel values outside of that range!
To clarify, consider a single pixel in the range [0, 255] and the intensity 200. If the noise
function added 60 intensity, how would you derive the original value of 200 from the new
value of 255 even if you did know that 60 was added? The original value could be anywhere
in the range [195, 255]!

2.2.1 Computing Averages


A common “intuitive” suggestion to remove noise is to replace the value of each pixel with the
average value of the pixels around it. This is known as a moving average. This approach
hinges on some key assumptions:
• The “true” value of a pixel is probably similar to the “true” values of the nearby pixels.
• The noise in each pixel is added independently. Thus, the average of the noise around
a pixel will be 0.
Consider the first assumption further: shouldn’t closer pixels be more similar than further
pixels, if we consider all pixels that are within some radius? This leads us to trying a
weighted moving average; such an assumption would result in a smoother representation.

14
Notes on Computer Vision George Kudrayvtsev

Averages in 2D
Extending a moving average to 2 dimensions is relatively straightforward. You take the
average of a range of values in both directions. For example, in a 100×100 image, you may
want to take an average over a moving 5×5 square. So, disregarding edge values, the value
of the pixel at (2, 2) would be:

P(2,2) =P(0,0) + P(0,1) + . . . + P(0,5) +


P(1,0) + . . . + P(1,5) +
...
P(5,0) + . . . + P(5,5)

In other words, with our square extending k = 2 pixels in both directions, we can derive the
formula for correlation filtering with uniform weights:
k k
1 X X
G[i, j] = · F [i + u, j + v] (2.1)
(2k + 1)2 u=−k v=−k

Of course, we decided that non-uniform weights were preferred. This results in a slightly
different equation for correlation filtering with non-uniform weights, where H[i, j] is
the weight function.
Xk Xk
G[i, j] = H[u, v]F [i + u, j + v] (2.2)
u=−k v=−k

This is also known as cross-correlation, denoted G = H ⊗ F .

Results Such a filter, weighted or not, actually performs terribly if our goal is to remove
some added Gaussian noise. It does remove noise, but it removes all noise, rather than some
noise that may have been added to an image that was originally sharp. In other words, we
lose fidelity. The result is a (poorly) blurred image. Well, despite not achieving our original
goal, we’ve stumbled upon something else: blurring images.

2.2.2 Blurring Images


OK now he was close, tried to domesticate you
But you’re an animal, baby, it’s in your nature
Just let me liberate you
— Robin Thicke, Blurred Lines

So what went wrong when trying to smooth out the image? Well, a “box filter” like the ones
in (2.1) and (2.2) are not smooth (in the mathematical, not social, sense). We’ll define and
explore this term later, but for now, suffice to say that a proper blurring (or “smoothing”)
function should be, well, smooth.

15
Chapter 2: Basic Image Manipulation

To get a sense of what’s wrong, suppose you’re looking at a single point of light very far
away, and then you made your camera out of focus. What would such an image look like?
Probably something like Figure 2.3.

Figure 2.3: A distant point of light, viewed out of focus.

Now, what kind of filtering function, applied on the singular point, should we apply to
get such a blurry spot? Well, a function that looked like that blurry spot would probably
work best: higher values in the middle that fall off (or attenuate) to the edges. This is a
Gaussian filter, which is an application of the Gaussian function:
1 u2 +v 2
h(u, v) = 2
e− σ2 (2.3)
|2πσ
{z }
normalization
coefficient

In such a filter, the nearest neighboring pixels have the most influence. This is much like
the weighted moving average presented in (2.2), but with weights that better represent
“nearness.” Such weights are “circularly symmetric,” which mathematically are said to be
isotropic; thus, this is the isotropic Gaussian filter. Note the normalization coefficient: this
value affects the brightness of the blur, not the blurring itself.

Gaussian Parameters
The Gaussian filter is a mathematical operation that does not care about pixels. Its only
parameter is the variance, which represents the “amount of smoothing” that the filter per-
forms. Of course, when dealing with images, we need to apply the filter to a particular range
of pixels; this is called the kernel.
Now, it’s critical to note that modifying the size of the kernel is not the same thing as
modifying the variance. They are related, though. The kernel has to be “big enough” to
fairly represent the variance and let it perform a smoother blurring.

2.3 Linearity and Convolution


We are going to continue working with this concept of filtering: applying a filtering function
to an image. Naturally, we need to start with some mathematical definitions.

16
Notes on Computer Vision George Kudrayvtsev

Linearity An operator H is linear if the following properties hold (where f1 and f2 are
some functions, and a is a constant):
• Additivity: the operator preserves summation, H(f1 + f2 ) = H(f1 ) + H(f2 )
• Multiplicative scaling, or homogeneity of degree 1: H(a · f1 ) = a · H(f1 )
With regards to computer vision, linearity allows us to build up an image one piece at a
time. We have guarantees that the operator operates identically per-pixel (or per-chunk, or
per-frame) as it would on the entire image. In other words, the total is exactly the sum of
its parts, and vice-versa.

Shift Invariance The property of shift invariance states that an operator behaves the
same everywhere. In other words, the output depends on the pattern of the image neigh-
borhood, rather than the position of the neighborhood. An operator must give the same
result on a pixel regardless of where that pixel (and its neighbors) is located to maintain
shift invariance.

2.3.1 Impulses
An impulse function in the discrete world is a very easy function (or signal) to understand:
its value = 1 at a single location. In the continuous world, an impulse is an idealized function
which is very narrow, very tall, and has a unit area (i.e. an area of 1). In the limit, it has
zero width and infinite height; it’s integral is 1.

Impulse Responses If we have an unknown system and send an impulse as an input, we


get an output (duh?). This output is the impulse response that we call h(t). If this “black
box” system – which we’ll call the unknown operator H – is linear, then H can be described
by h(x).
Why is that the case? Well, since any input to the system is simply a scale version of
the original impulse (for which we know the response), we can describe the output to any
impulse as following that addition or scaling.

2.3.2 Convolution
Let’s revisit the cross-correlation equation from Computing Averages:

k
X k
X
G[i, j] = H[u, v]F [i + u, j + v] (2.2)
u=−k v=−k

and see what happens when we treat it as a system H and apply impulses. We begin with
an impulse signal F (an image), and an arbitrary kernel H:

17
Chapter 2: Basic Image Manipulation

 
0 0 0 0 0  
0 0 0 0 0 a b c
 
0
F (x, y) =  0 1 0 0 H(u, v) = d e f 
0 0 0 0 0 g h i
0 0 0 0 0

What is the result of filtering the impulse signal with the kernel? In other words, what is
G(x, y) = F (x, y) ⊗ H(u, v)? As we can see in Figure 2.4, the resulting image is a flipped
version (in both directions) of the filter H.

   
    0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i h 0 0
 
0 0 0 0 0 0 i 0 0 0  
0 0 1 0 0 −→ 0 0 0 0 0
 
 
0
 0 1 0 0 −→ 

0 0 0 0 0

0 0 0 0 0



0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
(b) The result (right) of subsequently applying the filter H
(a) The result (right) of applying the filter H (in red) on F on F at (2, 1) (in purple). The kernel covers a new area in
at (1, 1) (in purple). red.

 
0 0 0 0 0
0 i h g 0
 
0 f e d 0
 
0 c b a 0
0 0 0 0 0

(c) The resulting image (impulse response) after applying


the filter H to the entire image.

Figure 2.4: Performing F ⊗ H.

We introduce the concept of a convolution operator to account for this “flipping.” The
cross-convolution filter, or G = H ~ F , is defined as such:
k
X k
X
G[i, j] = H[u, v]F [i − u, j − v]
u=−k v=−k

This filter flips both dimensions. Convolution filters must be shift invariant.

Pop Quiz: Convolution of the Gaussian Filter

What is the difference between applying the Gaussian filter as a convolution vs. a
correlation?

18
Notes on Computer Vision George Kudrayvtsev

Answer: Nothing! Because the Gaussian is an isotropic filter, its symmetry en-
sures that the order of application is irrelevant. Thus, the distinction only matters
for an asymmetric filter.

Properties
Because convolution (and correlation) is both linear- and shift-invariant, it maintains some
useful properties:
• Commutative: F ~ G = G ~ F
• Associative: (F ~ G) ~ H = F ~ (G ~ H)
• Identity: Given the “unit impulse” e = [. . . , 0, 1, 0, . . .], then f ~ e = f
• Differentiation: ∂x

(f ~ g) = ∂f
∂x
~ g. This property will be useful later, in Handling
Noise, when we find gradients for edge detection.

Computational Complexity
If an image is N × N and a filter is W × W , how many multiplications are necessary to
compute their convolution (N ~ W )?
Well, a single application of the filter requires W 2 multiplications, and the filter must be
applied for every pixel, so N 2 times. Thus, it requires N 2 W 2 multiplications, which can
grow to be fairly large.

Separability There is room for optimization here for certain filters. If the filter is sepa-
rable, meaning you can get the kernel H by convolving a single column vector by a single
row vector, as in the example:
   
1 2 1 1  
H = 2 4 2 = 2 ~ 1 2 1
  
1 2 1 3

Then we can use the associative property to remove a lot of multiplications. The result, G,
can be simplified:

G=H ~F
= (C ~ R) ~ F
= C ~ (R ~ F )

So we perform two convolutions, but on smaller matrices: each one is W N 2 . This is useful if
W is large enough such that 2W N 2  W 2 N 2 . This optimization used to be very valuable,
but still can provide a significant benefit: if W = 31, for example, it is faster by a factor of
15 ! That’s still an order of magnitude.

19
Chapter 2: Basic Image Manipulation

2.4 Boundary Issues


We have avoided discussing what happens when applying a filter to the edges of an image.
We want our resulting image to be the same size as the input, but what do we use as the
values to the input when the filter demands pixels beyond its size? There are a few choices:
• Clipping: This method simply treats the non-existant pixels as black. Images filtered
with this method result in a black border bleeding into the their edges. Such an effect
is very noticeable, but may be desireable. It’s similar to the artistic “vignette” effect.
• Wrapping: This method uses the opposite edge of the image to continue the edge.
It was intended for periodic functions, and iss useful for seamless images, but looks
noticeably bad for non-seamless images. The colors from the opposite edge unnaturally
bleed into the border.
• Extending: This method copies a chunk of the edge to fit the filter kernel. It has
good results that don’t have noticeable artifacts like the previous two.
• Reflecting: This method copies a chunk of the edge like the previous method, but
it mirrors the edge like a reflection. It often results in slightly more natural-looking
results, but the differences are largely imperceptible.

2.5 More Filter Examples


As we all know from Instagram and Snapchat, there are a lot of techniques out there to change
the way an image looks. The following list is obviously not exhaustive, and Figure 2.5 and
Figure 2.6 showcase the techniques.
• Sharpening filter: This filter accentuates the “differences with the local average,” by
comparing a more intense version of an image and its box blur.

Figure 2.5: A sharpening filter applied to an image of Einstein.


   
0 0 0 1 1 1
An example sharpening filter could be: 0 2 0 − 19 1 1 1
0 0 0 1 1 1
• Median filter: Also called an edge-preserving filter, this is actually a non-linear
filter that is useful for other types of noise in an image. For example, a salt-and-
pepper noise function would randomly add very-white and very-black dots at random
throughout an image. Such dots would significantly throw off an average filter, since
they are often outlier in the kernel, but can easily be managed by a median filter.

20
Notes on Computer Vision George Kudrayvtsev

Figure 2.6: A median filter applied to a salt-and-peppered image of


peppers.

For example, consider a kernel with intensities as follows:


 
10 15 20
23 90 27
33 31 30

where the 90 is clearly an instance of “salt” noise sprinkled into the image. Finding
the median:
10 15 20 23 27 30 31 33 90
results in replacing the center point with intensity = 27, which is much better than the
result of a weighted box filter (as in (2.2)), which could have resulting in an intensity
of 61.2
An interesting benefit of this filter is that any new pixel value was already present
locally, which means new pixels never have any “weird” values.

Theory in Action: The Unsharp Mask

In Adobe PhotoShop and other editing software, the “unsharp mask” tool would
actually sharpen the image. Why?
In the days of actual film, when photos had negatives and were developed in dark
rooms, someone came up with a clever technique. If light were shone on a negative
that was covered by wax paper, the result was a negative of the negative that was
blurrier than its original. If you then developed this negative of a negative layered
on top of the original negative, you would get a sharper version of the resulting
image!
This is a chemical replication of the exact same filtering mechanism as the one we
described in the sharpening filter, above! We had our original (the negative) and
were subtracting (because it was the negative of the negative) a blurrier version of
the negative. Hence, again, the result was a sharper developed image.

2
. . . if choosing 1
9 for non-center pixels and 4
9 for the center.

21
Chapter 2: Basic Image Manipulation

This blurrier double-negative was called the “unsharp mask,” hence the historical
name for the editing tool.

2.6 Filters as Templates


Now we will consider filters that aren’t simply a representation of intensities in the image,
but actually represents some property of the pixels, giving us a sort of semantic meaning
that we can apply to other images.

Filter Normalization Recall how the filters we are working with are linear operators.
Well, if we correlated an image (again, see (2.2)), and we multiplied that correlation filter
by some constant, then the resulting image would be scaled by that same constant. This
makes it tricky to compare filters: if we were to compare image 1 and filter 1 against image
2 and filter 2, to see how much the images “respond” to the filters, we would need to make
sure that both filters operate on a similar scale. Otherwise, outputs may differ greatly, but
not reflect an accurate comparison.
This topic is called normalized correlation, and we’ll discuss it further later. To summa-
rize things for now, suffice to say that “normalization” means that the standard deviation
of our filters will be consistent. For example, we may say that all of our filters will ensure
σ = 1. Not only that, but we also need to normalize the image as we move the kernel across
it. Consider two images with the same filter applied, but one image is just a scaled version of
the other. We should get the same result, but only if we ensure that the standard deviation
within the kernel is also consistent (or σ = 1).
Again, we’ll discuss implementing this later, but for now assume that all of our future
correlations are normalized correlations.

2.6.1 Template Matching


Suppose we make our correlation filter a chunk of our original signal. Then the (normalized)
correlation between the signal and the filter would result in an output whose maximum is
where the chunk was from!
Let’s discuss this to develop an intuition. A correlation is (see (2.2)) a bunch of multipli-
cations and sums. Consider the signal and a filter as in Figure 2.7. Our signal is centered
around 0: when would the filter and signal result in the largest possible values? We’d nat-
urally expect this to happen when the filter and signal values are the “most similar,” or, in
other words, equal!
Thus, the maximum of the correlated output represents the location at which the filter
matches the original signal! This powerful property, called template matching has many
useful applications in image processing when we extend the concept to 2 dimensions.

22
Notes on Computer Vision George Kudrayvtsev

Input signal Filter

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2

−0.4 −0.4
0 5 10 15 20 0 2 4
Figure 2.7: A signal and a filter which is part of the signal (specifically,
from x ∈ [5, 10]).

Where’s Waldo?

Suppose we have an image from a Where’s Waldo? book, and we’ve extracted an image of
Waldo himself, like so:

Figure 2.8: An excerpt from a Where’s Waldo? children’s book and a


template of Waldo himself.

If we perform a correlation between these two images, using the template of Waldo as our
filter, we will get a correlation map whose maximum tells us where Waldo is!

23
Chapter 2: Basic Image Manipulation

Figure 2.9: The correlation map between the image and the template
filter with brightness corresponding to similarity to the template.

See that tiny bright spot around the center of the top half of the correlation map in Fig-
ure 2.9? That’s Waldo’s location in the original image (see Figure 2.8).

Non-Identical Template Matching


What if we don’t have a perfect template to start with? For example, what if we want to
detect most cars, but only have a single template image of a car. As it turns out, if the stars
align – as in, we have similarity in scale, orientation, color, etc. – a template matching filter
can still detect similar objects because they result in the “best” match to the template.
How would this relate to finding Waldo? Well, in our example, we had the exact template
from the source image. The pictures are naturally designed to make it difficult to find Waldo:
there are others in red-striped shirts, or in glasses, or surrounded by red-striped textures.
Thus, an inexact match may give us an assortment of places where to start looking, but is
unlikely to pinpoint Waldo perfectly for other images.

Applications
What’s template matching useful for? Can we apply it to other problems? What about using
it to match shapes, lines, or faces? We must keep in mind the limitations of this rudimentary
matching technique. Template matching relies on having a near-perfect representation of the
target to use as a filter. Using it to match lines – which vary in size, scale, direction, etc. – is
unreliable. Similarly, faces may be rotated, scaled, or have varying features. There are much
better options for this kind of matching that we’ll discuss later. Something specific, like icons
on a computer or words in a specific font, are a viable application of template matching.

24
Edge Detection

“How you can sit there, calmly eating muffins when we are in this horrible
trouble, I can’t make out. You seem to me to be perfectly heartless.”
“Well, I can’t eat muffins in an agitated manner. The butter would probably
get on my cuffs. One should always eat muffins quite calmly. It is the only
way to eat them.”
“I say it’s perfectly heartless your eating muffins at all, under the circum-
stances.”
— Oscar Wilde, The Importance of Being Earnest

n this chapter, we will continue to explore the concept of using filters to match certain
I features we’re looking for, like we discussed in Template Matching. Now, though, we’re
going to be “analysing” generic images that we know nothing about in advance. What
are some good features that we could try to find that give us a lot to work with in our
interpretation of the image?

3.1 The Importance of Edges

Consider a sketch of an elephant, as in Figure 3.1.


It’s simple and straightforward. There are no col-
ors: it’s drawn with a dark pencil. Yet, you can
absolutely make out the fact that it’s an elephant,
not an albino giraffe or an amorphous blob.
Figure 3.1: An elephant. Can you
All it takes are a few well-placed edges to understand imagine an archeologist many years from
an image. Thus this section is dedicated edge de- now finding the skeletal remains of an
tection algorithms. elephant? The ears and tusk would be
lost entirely in any rendition, and such a
How can we detect edges? We must return to the sketch would be a work of fantasy.
idea of Images as Functions. We visually interpret
edges as being sharp contrasts between two areas; mathematically, we can say that edges

25
Chapter 3: Edge Detection

look like steep cliffs in an image function.1

Basic Idea: If we can find a neighborhood of an image with strong signs of change, that’s
an indication that an edge may exist there.

3.2 Gradient Operator


Contrast, or change, in a function is modeled by its derivative. Thus, we will be using a
differential operator on our images to get a sense of the change that’s occurring in it. We’ll
model these operators as kernels that compute the image gradient function over the entire
image, then threshold this gradient function to try to select edge pixels.
The gradient operator is the vector of partial derivates for a particular function. For images,
our functions are parameterized by x and y; thus, the gradient operator would be defined as
 
∂ ∂
∇ := ,
∂x ∂y
Direction The gradient points in the direction of most rapid increase in intensity. For an
image f , the gradient’s direction is given by:
 
∂f ∂f
θ = tan −1
/ (3.1)
∂y ∂x

Magnitude The “amount of change” in the gradient is given by its magnitude:


s 
2  2
∂f ∂f
k∇f k = +
∂x ∂y

3.2.1 Finite Differences


Well the gradient is all well and good, but we can’t apply partial derivatives to images the
same way we would in math class. Images are discrete, and so we need to approximate the
partial derivative. We could, for instance, consider the following approximation that relies
on a finite difference between two points:
∂f (x, y)
≈ f (x + 1, y) − f (x, y)
∂x

This is the right derivative, but it’s not necessarily the right derivative.2 If we look at the
finite difference stepping in the x direction, we will heavily accentuate vertical edges (since
we move across them), whereas if we look at finite differences in the y direction, we will
heavily accentuate horizontal edges.
1
Steep cliffs? That sounds like a job for slope, doesn’t it? Well, derivatives are exactly what’s up next.
2
Get it, because we take a step to the right for the partial derivative?

26
Notes on Computer Vision George Kudrayvtsev

3.2.2 The Discrete Gradient


We want an operator that we can apply to a kernel and use as a correlation (or convolution)
filter. Consider the following operator:
 
0 0 0
H := − 12 0 + 12 
0 0 0

What’s up with the 21 s? This is the average between the “left derivative” (which would be
−1 +1 0 and the “right derivative” (which would be 0 −1 +1 ).
 

Sobel Operator
The Sobel operator is a discrete gradient that preserves the “neighborliness” of an image
that we discussed earlier when talking about the Gaussian blur filter (2.3) in Blurring Images.
It looks like this:3
   
−1 0 +1 +1 +2 +1
1 1
Sx = · −2 0 +2 Sy = ·  0 0 0 (3.2)
8 8
−1 0 +1 −1 −2 −1

We say that the application of the Sobel operator is gx for Sx and gy for Sy :
 T
∇I = gx gy

The Sobel operator results in an edge images that’s not great but also not too bad. There
are other operators that use different constants, such as the Prewitt operator:
   
−1 0 +1 +1 +1 +1
Sx = −1 0 +1 Sy =  0 0 0 (3.3)
−1 0 +1 −1 −1 −1

and the Roberts operator:


   
0 +1 +1 0
Sx = Sy = (3.4)
−1 0 0 −1

but we won’t go into detail on these. Instead, let’s explore how we can improve our edge
detection results.

3
Note: In Sy , positive y is upward; the origin is assumed to be at the bottom-left.

27
Chapter 3: Edge Detection

3.2.3 Handling Noise


Our gradient operators perform reasonably-well in idealized scenarios, but images are often
very noisy. As we increase the noise in an image, the gradient – and thus its representation
of the edges – get much messier.

Figure 3.2: Using a smoothing filter (like the Gaussian) to combat


image noise affecting the gradient. From top to bottom, we have (a)
the original noisy signal, f ; (b) the smoothing filter, h; (c) applying the
smoothing, h ~ f ; and (d) the gradient of the result, ∂f ∂
(h ~ f ).

Of course, we know how to reduce noise: apply a smoothing filter! Then, edges are peaks,
as seen in Figure 3.2. We can also perform a minor optimization here; recall that, thanks to
the differentiation property (discussed in Properties of Convolution):
 
∂ ∂
(h ~ f ) = h ~f (3.5)
∂x ∂x

Meaning we can skip Step (d) in Figure 3.2 and convolve the derivative of the smoothing
filter to directly get the result.

We Have to Go Deeper. . .
As we saw in Figure 3.2, edges are represented by
peaks in the resulting signal. How do we detect
peaks? More derivatives! So, consider the 2nd
derivative Gaussian operator:

∂2
h Figure 3.3: Using the 2nd order Gaus-
∂x2 sian to detect edge peaks.
which we convolve with the signal to detect peaks.
And we do see in Figure 3.3 that we’ve absolutely detected an edge at x = 1000 in the
original signal from Figure 3.2.

28
Notes on Computer Vision George Kudrayvtsev

3.3 Dimension Extension Detection


We’ve been discussing examples of gradients in one direction. It’s time to extend that to
two, and thus, to proper images.

When working in 2 dimensions, we need to specify the direction in which we are taking
the derivative. Consider, then, the 2D extension of the derivative of the Gaussian filter we
started with in (3.5):
(I ⊗ g) ⊗ hx = I ⊗ (g ⊗ hx )

Here, g is our Gaussian filter (2.3) and hx is the x-version of our gradient operator, which
could be the Sobel operator (3.2) or one of the others. We prefer the version on the right
because it creates reusable function that we can apply to any image, and it operates on a
smaller kernel, saving computational power.

Tweaking σ, Revisited Much like in the previous version of Tweaking Sigma, there is an
effect of changing σ in our Gaussian smoothing filter g. Here, smaller values of σ detect finer
features and edges, because less noise is removed and hence the gradient is more volatile.
Similarly, larger values of σ will only leave larger edges detected.

3.4 From Gradients to Edges


So how do we get from these gradient images to actual edges? In general, it’s a multi-step
process:
Smoothing First, we suppress noise by performing smoothing on the image.
Gradient Then, we compute the gradient to find the areas of the image that have significant
localized change (i.e. the “steep cliffs” we described at the start). Recall, though, that
this step can be combined with the previous step because of associativity.
Threshold Aside from tweaking σ, we can also clip the gradient to some range of values to
limit our edges to the “significant” parts of the gradient we’re interested in.
Thinning & Connecting Finally, we perform thinning to get the localized edge pixels that
we can work with, and connect these edge pixels to create a “complete” representation
of the edge if desired.
There are different algorithms that take different approaches at each of these steps. We will
discuss two, but there are many more.

3.4.1 Canny Edge Operator


Designed by John Canny for his Master’s thesis, the Canny edge detector is an algorithm
for creating a proper representation of an image’s edges:

29
Chapter 3: Edge Detection

1. Filter the image with the derivative of the Gaussian.


2. Find the magnitude and orientation of the gradient.
3. Perform Non-Maximal Suppression, which thins the multi-pixel ridges of the gradient
into a single-pixel width.
4. Threshold and link (hysteresis) the edges. This is done by choosing two thresholds,
low and high, and using the high threshold to start edge curves and the low one to
continue them.
Let’s discuss the latter two steps in more detail.

Non-Maximal Suppression
This is a fancy term for what amounts to choosing the brightest (maximal) pixel of an edge
and discarding the rest (suppression). It works by looking in the gradient direction and
keeping the maximal pixel.

Canary Threshold Hysteresis


After non-maximal suppression, some valid edge pixels didn’t survive thresholding! We can’t
just lower the threshold, though, since that would introduce non-edge noise.
The cleverness of the detector comes from using two thresholds. First, a high threshold
determines “strong edges,” which are then linked together. Then, a low threshold is used to
determine weak edges that are plausible. Finally, the original strong edges are extended if
they can follow weak edges.
The assumption is that all of the edges that we care about have strong edges, but might
have some portions that fade out. The weak threshold links strong edges together without
adding excess noise from the weaker gradients.

3.4.2 2nd Order Gaussian in 2D


Recall when we performed the 2nd derivative on the Gaussian filter to create “zero-crossings”
at edges in our output signal in Figure 3.3. How do we do this in 2 dimensions? Recall that
the Gaussian filter (2.3) is parameterized by (u, v), and so has two possible directions to take
the partial derivative. The 2nd derivative has even more choices. . . how do we know which
one to use?
The answer? None of them. Instead, we apply the Laplacian operator, defined as ∇2 :

∂ 2f ∂ 2f
∇2 h = + (3.6)
∂x2 ∂y 2

Once you apply the Laplacian, the zero-crossings are the edges of the image. This operator
is an alternative to the Canny Edge Operator and is better under certain circumstances.

30
Hough Transform

“Little pig, little pig won’t you let me come in?”


“No, no, no, by the hair on my chinny, chin, chin.”
“Then I’ll huff and I’ll puff and I’ll blow your house in.”
“No, no, no, Mr. Wolf I will not let you in.”
— The Three Little Pigs, a folktale

e can finally begin to discuss concepts that enable “real vision,” as opposed to the
W previous chapters which belonged more in the category of image processing. We
gave a function an input image and similarly received an output image. Now, we’ll
discuss learning “good stuff” about an image, stuff that tell us things about what an image
means. This includes the presence of lines, circles, particular objects or shapes, and more!

4.1 Line Fitting


We begin with a simple desire: finding the lines in an image. This is a lot more difficult
than it sounds, and our models for Edge Detection only get us part of the way. Why isn’t
edge detection sufficient? Unfortunately, edge detection in the forms we’ve seen isn’t perfect;
there are many flaws:
Clutter There are a lot of resulting points and edges that clutter the image and don’t
necessarily represent lines. Furthermore, having multiple models that we want to find
increases the computational complexity of the search.
Partial Matches Even with the clever hysteresis that extends and connects edges in the
Canny Edge Operator, the edges that we find won’t always represent the entire line
present in the original image.
Noise Even without the above imperfections, the resulting edges are noisy. Even if the
original image has a perfectly straight line, the detected edge may not be straight; it
may deviate a few pixels in some direction, or be in the wrong orientation, or have
some extra noise from other extraneous image details like texture.

31
Chapter 4: Hough Transform

4.1.1 Voting
Since its computationally infeasible to simply try every possible line given a set of edge
pixels, instead, we need to let the data tell us something. We approach this by introducing
voting, which is a general technique where we let the features (in this case, edge pixels)
vote for all of the models with which they are compatible.
Voting is very straightforward: we cycle through the features, and each casts votes on par-
ticular model parameters; then, we look at model parameters with a high number of votes.
The idea behind the validity of voting is that all of the outliers – model parameters that are
only valid for one or two pixels – are varied: we can rely on valid parameters being elected
by the majority. Metaphorically, the “silly votes” for candidates like Donald Duck are evenly
distributed over an assortment of irrelevant choices and can thus be uniformally disregarded.

4.1.2 Hough Transform


Pronounced “huff,” the Hough transform is a voting technique that can be used answer
all of our questions about whether or not a line exists:
• Given points that belong to a line, what is that line?
• How many lines are there?
• Which points belong to which lines?
The main idea behind the Hough transform is that each edge point (i.e. each pixel found
after applying an edge-detecting operator) votes on its compatible line(s). Hopefully, the
lines that get many votes are valid.

Hough Space
To achieve this voting mechanism, we need to introduce Hough space – also called pa-
rameter space – which enables a different representation of our desired shapes.
The key is that a line in image space represents a point in Hough space because we use the
parameters of the line (m, b). Similarly, a point in image space represents a line in Hough
space through some simple algebraic manipulation. This is shown in Figure 4.1. Given a
point (x0 , y0 ), we know that all of the lines going through it fit the equation y0 = mx0 + b.
Thus, we can rearrange this to be b = −x0 m + y0 , which is a line in Hough space.
What if we have two points? We can easily determine the line passing through both of those
points in Cartesian space by applying the point-slope formula:
y − y1 = m(x − x1 )
where, as we all know:
y2 − y1
m=
x 2 − x1
That’s simple enough for two points, as there’s a distinct line that passes through them by
definition. But what if we introduce more points, and what if those points can’t be modeled

32
Notes on Computer Vision George Kudrayvtsev

y = m0 x + b0

(m0 , b0 )
y

b
x m
(a) A line on the Cartesian plane, represented in (b) A parameterized representation of the same
the traditional slope-intercept form. line in Hough space.

Figure 4.1: Transforming a line from the Cartesian plane (which we’ll
call “image space”) to a point in Hough space.

by a perfect line that passes through them? We need a line of best fit, and we can use Hough
space for that.
A point on an image is a line in Hough space, and two points are two lines. The point at
which these lines intersect is the exact (m, b) that we would’ve found above, had we converted
to slope-intercept form. Similarly, a bunch of points produces a bunch of lines in Hough, all
of which intersect in (or near) one place. The closer the points are to a particular line of
best fit, the closer their points of intersection in Hough space. This is shown in Figure 4.2.
y

x m
(a) A series of points in image space. (b) The lines representing the possible parameters
for each of those points in Hough space.

Figure 4.2: Finding the line of best fit for a series of points by de-
termining an approximate point of intersection in a discretized Hough
space.

If we discretize Hough space into “bins” for voting, we end up with the Hough algorithm.
In the Hough algorithm, each point in image space votes for every bin along its line in Hough
space: the bin with the most votes becomes the line of best fit among the points.

33
Chapter 4: Hough Transform

Unfortunately, the (m, b) representation of lines comes with some problems. For example, a
vertical line has m = ∞, which is difficult to represent and correlate using this algorithm.
To avoid this, we’ll introduce Polar Representation of Lines.

4.1.3 Polar Representation of Lines


In the polar representation of a line, we can uniquely identify a line by its perpendicular
distance from the origin, d, and the angle of that perpendicular with the x-axis, θ. We can
see this visually in Figure 4.3.

x
d θ

Figure 4.3: Representing a line in polar coordinates.

To get a relationship between the Cartesian and polar spaces, we can derive, via some simple
vector algebra, that:
x cos θ + y sin θ = d (4.1)

This avoids all of the previous problems we had with certain lines being ill-defined. In
this case, a vertical line is represented by θ = 0, which is a perfectly valid input to the
trig functions. Unfortunately, though, now our transformation into Hough space is more
complicated. If we know x and y, then what we have left in terms of d and θ is a sinusoid
like in Figure 4.4.

Figure 4.4: A simple sinusoidal function, which is the manifestation of


a point in image space when transformed into Hough space.

34
Notes on Computer Vision George Kudrayvtsev

Note There are multiple ways to represent all possible lines under polar coordinates. Either
d > 0, and so θ ∈ [0, 2π), or d can be both positive or negative, and then θ ∈ [0, π).
Furthermore, when working with images, we consider the origin as being in one of the
corners, which restricts us to a single quadrant, so θ ∈ [0, π2 ). Of course, these are just
choices that we make in our representation and don’t really have mathematical trade-offs.

4.1.4 Hough Algorithm


Let’s synthesize all of these concepts and formalize the Hough transformation algorithm.
We begin with a polar representation for lines, as discussed in, well, Polar Representation
of Lines, as well as a Hough accumulator array which a discretization of the coordinate
plane used to track votes for particular values of (θ, d). With that, the algorithm is as follows:

Algorithm 4.1: The basic Hough algorithm for line detection.

Input: An image, I
Result: A detected line.
Initialize an empty voting array: H[d, θ] = 0
foreach edge point, E(x, y) ∈ I do
foreach θ ∈ [0, 180, step) do
d = x cos θ + y sin θ
H[d, θ] += 1
end
end
Find the value(s) of (d, θ) where H[d, θ] is a maximum.
The result is given by d = x cos θ + y sin θ.

Complexity
The space complexity of the Hough Algorithm is simply k n : we are working in n dimensions
(2 for lines) and each one gets k bins. Naturally, this means working with more complex
objects (like circles as we’ll see later) that increase dimension count in Hough space can get
expensive fast.
The time complexity is linearly proportional to the number of edge points, whereas the
voting itself is constant.

4.1.5 Handling Noise


Small amounts of noise end up creating many similar peaks that could be perceived as
separate lines. Furthermore, with too-fine of a bin size for the voting array, you could miss
the peaks altogether!

35
Chapter 4: Hough Transform

Well, what if we smoothed the image in Hough space? It would blend similar peaks together,
but would reduce the area in which we need to search. Then, we can run the Hough transform
again on that smaller area and use a finer grid to find the best peaks.
What about a lot of noise? As in, what if the input image is just a random assortment of
pixels. We run through the Hough transform expecting to find peaks (and thus lines) when
there are none! So, it’s useful to have some sort of prior information about our expectations.

4.1.6 Extensions
The most common extension to the Hough transform leverages the gradient (recall, if need
be, the Gradient Operator). The following algorithm is nearly identical to the basic version
presented in algorithm 4.1, but we eliminate the loop over all possible θs.

Algorithm 4.2: The gradient variant of the Hough algorithm for line detection.

Input: An image, I
Result: A detected line.
Initialize an empty voting array: H[d, θ] = 0
foreach edge point, E(x, y) ∈ I do
θ = ∇I (x,y) /* or some range influenced by ∇I */
d = x cos θ + y sin θ
H[d, θ] += 1
end
Find the value(s) of (d, θ) where H[d, θ] is a maximum.
The result is given by d = x cos θ + y sin θ.

The gradient hints at the direction of the line: it can be used as a starting point to reduce
the range of θ from its old range of [0, 180).
Another extension gives stronger edges more votes. Recall that the Canny Edge Operator
used two thresholds to determine which edges were stronger. We can leverage this “metadata”
information to let stronger edges influence the reuslts of the line detection. strongest Of
course, this is far less democratic, but gives more reliable results.
Yet another extension leverages the sampling resolution of the voting array. Changing the
discretization of (d, θ) refines the lines that are detected. This could be problematic if there
are two similar lines that fall into the same bin given too coarse of a grid. The extension
does grid redefinition hierarchically: after determining ranges in which peaks exist with a
coarse grid, we can go back to just those regions with a finer grid to pick out lines.
Finally, and most importantly, we can modify the basic Hough algorithm to work on more
complex shapes such as circles, squares, or actually any shape that can be defined by a
template. In fact, that’s the subject of the next sections.

36
Notes on Computer Vision George Kudrayvtsev

4.2 Finding Circles

We begin by extending the parametric model that enabled the Hough algorithm to a slightly
more complex shape: circles. A circle can be uniquely defined by its center, (a, b), and a
radius, r. Formally, its equation can be stated as:

(xi − a)2 + (yi − b)2 = r2 (4.2)

For simplicitly, we will begin by assuming the radius is known. How does voting work, then?
Much like a point on a line in image space was a line in Hough space, a point in image space
on a circle is a circle in Hough space, as in Figure 4.5:

(a0 , b0 )
y

x a
(a) An arbitrary circle in image space defined by a (b) A parameterized representation of the same cir-
series of points. cle in Hough space, assuming a known radius.

Figure 4.5: The transformation from a circle, defined by a handful of


points, into a single point in Hough space.

Thus, each point on the not-yet-defined circle votes for a set of points surrounding that same
location in Hough space at the known radius, as in Figure 4.6:

37
Chapter 4: Hough Transform

b
x a
(a) The same circle from Figure 4.5a. (b) The set of points voted on by each correspond-
ing point in Hough space.

Figure 4.6: The voting process for a set of points defining a circle in
image space. The overlap of the voting areas in Hough space defines the
(a, b) for the circle.

Of course, having a known radius is not a realistic expectation for most real-world scenarios.
You might have a range of viable values, or you may know nothing at all. With an unknown
radius, each point on the circle in image space votes on a set of values in Hough space resem-
bling a cone.1 Again, that’s each point: growing the dimensionality of our parameterization
leads to unsustainable growth of the voting process.
We can overcome this growth problem by
taking advantage of the gradient, much like
we did in the Extensions for the Hough al-
gorithm to reduce the range of θ. We can
visualize the gradient as being a tangent
line of the circle: if we knew the radius, a
b

single point and its tangent line would be


enough to define the circle. Since we don’t,
there’s a line of possible values for the cen-
ter. As we see in Figure 4.7, where the
blue circle is the circle defined for a spe-
cific radius given the green point and its a
gradient direction, the red line defines the
Figure 4.7: Finding the line of possible values for
range of possible locations for the center of the center of the circle given a point and its gradient.
the circle if we don’t know the radius. The blue circle represents an example circle if there
were a known radius.
With each point in image space now defin-
ing a line of votes in Hough space, we’ve drastically improved the computational complexity

1
If we imagine the Hough space as being an abr plane with r going upward, and we take a known point
(a0 , b0 ), we can imagine if we did know the radius, say r = 7, we’d draw a circle there. But if was r = 4
we’d draw a circle a little lower (and a little smaller). If we extrapolate for all possible values of r, we get
a cone.

38
Notes on Computer Vision George Kudrayvtsev

of the voting process despite increasing the dimension. This leads us to a basic Hough
algorithm for circles:

Algorithm 4.3: The Hough algorithm for circles.

Input: An image, I
Result: A detected circle.
Initialize an empty voting array: H[a, b, r] = 0
foreach edge point, E(x, y) ∈ I do
foreach possible radius do
foreach each possible gradient direction θ do // or an estimate via (3.1)
a = x − r cos θ
b = y + r sin θ
H[a, b, r] += 1
end
end
end
Find the value(s) of (a, b, r) where H[a, b, r] is a maximum.
The result is given by applying the equation of a circle (4.2).

In practice, we want to apply the same tips that we outlined in Extensions and a few others:
use edges with significant gradients, choose a good grid discretization, track which points
make which votes, and consider smoothing your votes by voting for neighboring bins (perhaps
with a different weight).

4.3 Generalization
The advent of the generalized Hough transform and its ability to determine the existence
of any well-defined shape has caused a resurgence in its application in computer vision.
Rather than working with non-analytic models that have fixed parameters (like circles
with a radius, r), we’ll be working with visual code-words that describe the object’s
features rather than using its basic edge pixels. Previously, we knew how to vote given a
particular pixel because we had solved the equation for the shape. For an arbitrary shape,
we instead determine how to vote by building a Hough table.

39
Chapter 4: Hough Transform

4.3.1 Hough Tables


A Hough table stores displacement vectors for a par-
ticular gradient angle θ. To “train” a Hough table
on a particular shape, we can follow these steps, as-
sisted by the arbitrary shape in Figure 4.8:
1. At each boundary point (defined by the
shape), compute a displacement vector r =
c − pi , where c can be the center of the object,
or just some arbitrary reference point.
2. Measure (or, rather, approximate, since our
best guess uses differences with neighboring Figure 4.8: An arbitrary shape that
pixels) the gradient angle θ at the boundary outlines building a few indices of its
Hough table.
point.
3. Store the displacement vector r in a table in-
dexed by θ.
Then, at “recognition” time, we essentially go backwards:
1. At each boundary point, measure the gradient angle θ.
2. Look up all displacements for that θ in the Hough table.
3. Vote for a center at each displacement.
Figure 4.9 demonstrates the voting pattern after accumulating the votes for a feature in the
object – in this case, it’s the bottom horizontal line.

(a) The training phase for the horizontal line “fea- (b) The voting phase, in which each of the three
ture” in the shape, in which the entry for θ = 90° points (so far) has voted for the center points after
stores all of these displacement vectors. applying the entire set of displacement vectors.

Figure 4.9: Voting and recognition of a single feature (i.e. a horizontal


line) in an arbitrary shape. Notice that if we were to extrapolate vot-
ing for the entire feature, we’d get strong votes for a horizontal line of
possible centers, with the amount of votes growing with more overlap.

After all of the boundary points vote for their “line of possible centers,” the strongest point
of intersection among the lines will be the initial reference point, c, as seen in Figure 4.10.

40
Notes on Computer Vision George Kudrayvtsev

Figure 4.10: A subset of votes cast during the second set of feature
points, after completing the voting for the first set that was started in
Figure 4.9.

This leads to the following generalized Hough transform algorithm, which (critically!) as-
sumes the orientation is known:

Algorithm 4.4: The generalized Hough transform, for known orientations.

Input: I, an image.
Input: T , a Hough table trained on the arbitrary object.
Result: The center point of the recognized object.
Initialize an empty voting array: H[x, y] = 0
foreach edge point, E(x, y) ∈ I do
Compute the gradient direction, θ
foreach v ∈ T [θ] do
H[vx , vy ] += 1
end
end
The peak in the Hough space is the reference point with the most supported edges.

Of course, variations in orientation are just an additional variable. All we have to do is try
all of the possible orientations. Naturally, this is much more expensive since we are adding
another dimension to our Hough space. We can do the same thing for the scale of our
arbitrary shape, and the algorithm is nearly identical to algorithm 4.5, except we vote with

41
Chapter 4: Hough Transform

a “master scale” instead of a “master orientation.”

Algorithm 4.5: The generalized Hough transform, for unknown orientations.

Input: I, an image.
Input: T , a Hough table trained on the arbitrary object.
Result: The center point of the recognized object.
Initialize an empty voting array: H[x, y, t] = 0
foreach edge point, E(x, y) ∈ I do
foreach possible θ∗ do // (the “master” orientation)
Compute the gradient direction, θ
θ0 = θ − θ∗
foreach v ∈ T [θ0 ] do
H[vx , vy , θ∗ ] += 1
end
end
end
The peak in the Hough space (which is now (x, y, θ∗ )) is the reference point with the
most supported edges.

42
Frequency Analysis

The peoples of civilization see their wretchedness increase in direct propor-


tion to the advance of industry.
— Charles Fourier

he goal of this chapter is to return to image processing and gain an intuition for images
T from a signal processing perspective. We’ll introduce the Fourier transform, then touch
on and study an phenomenon called aliasing, in which seemingly-straight lines in an
image appear jagged. Furthermore, we’ll extend this idea into understanding why, to shrink
an image in half, we shouldn’t just throw out every other pixel.

Warning: Here Be Dragons

Things are boutta get real mathematical up in here. Do some review of fundamental
linear algebra concepts before going further. You’ve been warned.

5.1 Basis Sets


A basis set B is defined as a linearly
 independent subset of a vector space V that spans
V . For example, the vectors 0 1 and 1 0 make up a basis set for the xy-plane: every
  

vector in the xy-plane can be represented as a linear combination of these two vectors.
Formally: Suppose we have some B = {v1 , . . . , vn }, which is a finite subset of a vector
space V over a field F (such as the real or complex number fields, R and C). Then, B is a
basis if it satisfies the following conditions:
• Linear independence, which is the property that

∀a1 , . . . an ∈ F
if a1 v1 + . . . + an vn = 0
then a1 = . . . = an = 0

In other words, for any set of constants such that the linear combination of those
constants and the basis vectors is the zero vector, then it must be that those constants

43
Chapter 5: Frequency Analysis

are all zero.


• Spanning property, which states that
∀x ∈ V, ∃a1 , . . . , an ∈ F
such that
x = a1 v 1 + . . . + an v n = 0
In English, for any vector in that vector space, we can find some constants such that
their linear combination with the basis set is the zero vector. But in other words,
it means we can make any vector in the vector space with some combination of the
(scaled) basis vectors.
With that in mind, we can consider images as being a single point in an N × N vector space:
 T
x00 x10 x20 . . . x(n−1)0 x10 . . . x(n−1)(n−1)

We can formulate a simple basis set for this vector space by toggling each pixel as on or off:
h iT


 0 0 0 . . . 0 1 0 0 . . . 0 0 0
h iT
0 0 0 . . . 0 0 1 0 . . . 0 0 0
..


.

This is obviously independent and can create any image, but isn’t very useful. . . Instead, we
can view an image as a variation in frequency in the horizontal and vertical directions, and
the basis set would be vectors that tease away fast vs. slow changes in the image:

This is called the Fourier basis set.

5.2 Fourier Transform


Jean Baptiste Joseph Fourier (not the guy in the chapter quote) realised that any peri-
odic (repeating) function can be rewritten as a weighted sum of sines and cosines of dif-
ferent frequencies, and we call this the Fourier series. Our building block is a sinusoid:

44
Notes on Computer Vision George Kudrayvtsev

A sin (ωx + ϕ). Fourier’s conclusion was that if we add enough of these together, we can get
any signal f (x). Our sinusoid allows three degrees of freedom:
• A is the amplitude, which is just a scalar for the sinusoid.
• ω is the frequency. This parameter controls the coarseness vs. fine-ness of a signal;
as you increase it, the signal “wiggles” more frequently.
• ϕ is the phase. We won’t be discussing this much because our goal in computer vision
isn’t often to reconstruct images, but rather simply to learn something about them.

Time & Frequency Suppose we have some sample signal:

1
g(t) = sin (2πf t) + sin (2π(3f )t)
3
We can break it down into its “component sinusoids” like so:

g(t) a(t) b(t)


1 1 1

t t t
1 1 1

−1 −1 −1

If we were to analyse the frequency spectrum of this signal, we would see that there is
some “influence” (which we’ll call power) at the frequency f , and 1/3rd of that influence at
the frequency 3f .
Notice that the signal seems to be approximating a square wave? Well, a square wave can
be written as an infinite sum of odd frequencies:

X 1
A sin (2πkt)
1
k

Now that we’ve described this notion of the power of a frequency on a signal, we want to
transform our signals from being functions of time to functions of frequency. This algorithm
is called the Fourier transform.
We want to understand the frequency, ω, of our signal f (x). Let’s reparameterize the signal
by ω instead of x to get some F (ω). For every ω ∈ (−∞, ∞), our F (ω) will both hold the

45
Chapter 5: Frequency Analysis

corresponding amplitude A and phase ϕ. How? By using complex numbers:

F (ω) = R(ω) + iI(ω) (5.1)


(5.2)
p
A = ± R(ω)2 + I(ω)2
I(ω)
ϕ = tan−1 (5.3)
R(ω)

We will see that R(ω) corresponds to the even part (that is, cosine) and I(ω) will be the
odd part (that is, sine).1 Furthermore, computing the Fourier transform is just computing
a basis set. We’ll see why shortly.
First off, the infinite integral of the product of two sinusoids of differing frequencies is zero,
and the infinite integral of the product of two sinusoids of the same frequency is infinite
(unless they’re perfectly in phase, since sine and cosine cancel out):

Z ∞ 0,
 if a 6= b
sin (ax + φ) sin (bx + ϕ) dx = ±∞, if a = b, φ + π2 6= ϕ
−∞
if a = b, φ + π2 = ϕ

0,

With that in mind, suppose f (x) is a simple cosine wave of frequency ω: f (x) = cos (2πωx).
That means that we can craft a function C(u):
Z ∞
C(u) = f (x) cos (2πux) dx
−∞

that will infinitely spike (or, create an impulse. . . remember Impulses?) wherever u = ±ω
and be zero everywhere else.
Don’t we also need to do this for all phases? No! Any phase can be represented as a weighted
sum of cosine and sine, so we just need one of each piece. So, we’ve just created a function
that gives us the frequency spectrum of an input signal f (x). . . or, in other words, the
Fourier transform.
To formalize this, we represent the signal as an infinite weighted sum (or linear combina-
tion) of an infinite number of sinusoids:2
Z ∞
F (u) = f (x)e−i2πux dx (5.4)
−∞

We also have the inverse Fourier transform, which turns a spectrum of frequencies into
the original signal in the spatial spectrum:
Z ∞
f (x) = F (u)e−i2πux du (5.5)
−∞
1
An even function is symmetric with respect to the y-axis: cos (x) = cos (−x). Similarly, an odd function
is symmetric with respect to the origin: − √sin (x) = sin (−x).
2
Recall that eik = cos k + i sin k, where i = −1.

46
Notes on Computer Vision George Kudrayvtsev

5.2.1 Limitations and Discretization


The Fourier transform only exists if the input function f is integrable. That is,
Z ∞
|f (x)| dx < ∞
−∞

With that in mind, if there is some range in which f is integrable (but not necessarily
(−∞, ∞)), we can just do the Fourier transform in just that range. More formally, if there
is some bound of width T outside of which f is zero, then obviously we could integrate
from [− T2 , T2 ]. This notion of a “partial Fourier transform” leads us to the discrete Fourier
transform, which is the only way we can do this kind of stuff in computers. The discrete
Fourier transform looks like this:
N −1
1 X 2πkx
F (k) = f (x)e−i N (5.6)
N x=0

Imagine applying this to an N -pixel image. We have x as our discrete “pixel iterator,” and
k represents the number of “cycles per period of the signal” (or “cycles per image”) which is
a measurement of how quickly we “wiggle” (changes in intensity) throughout the image.
It’s necessarily true that k ∈ [− N2 , N2 ], because the highest possible frequency of an image
would be a change from 0 to 255 for every pixel. In other words, every other pixel is black,
which is a period of 2 and N2 total cycles in the image.
We can extend the discrete Fourier transform into 2 dimensions fairly simply. The 2D
Fourier transform is:
1 ∞ ∞
Z Z
F (u, v) = f (x, y)e−i2π(ux+vy) dx dy (5.7)
2 −∞ −∞
and the discrete variant is:3
N −1 N −1
1 XX 2π(kx x+ky y)
F (kx , ky ) = f (x, y)e−i N (5.8)
N x=0 y=0

Now typically when discussing the “presence” of a frequency, we’ll be referring to its power
as in (5.2) rather than the odd or even parts individually.

5.2.2 Convolution
Now we will discuss the properties of the Fourier transform. When talking about convolution,
convolving a function with its Fourier transform is the same thing as multiplying the Fourier
transform. In other words, let g = f ~ h. Then, G(u) is
Z ∞
G(u) = g(x)e−i2πux dx
−∞
3
As a tip, the transform works best when origin of k is in the middle of the image.

47
Chapter 5: Frequency Analysis

Of course, we said that g(x) is a convolution of f and h, meaning:


Z ∞ Z ∞
G(u) = f (τ )h(x − τ )e−i2πux dτ dx
−∞ −∞

We can rearrange this and perform a change of variables, where x0 = x − τ :


Z ∞ Z ∞
−i2πuτ 0
G(u) = f (τ )e dτ h(x0 )e−i2πux dx0
−∞ −∞

Notice anything? This is a product of Fourier transforms!

G(u) = F (u) · H(u)

This leads us to the following property:


convolution in the spatial domain ⇐⇒ multiplication in the frequency domain
This can definitely be useful for performance, but that’s less relevant nowadays. Even still,
it has applications. Suppose we want to smooth the function f (x). We can convolve it
with a Gaussian kernel as in (2.3), or we can multiply F (u) by the Fourier transform of the
Gaussian kernel, which is actually still a Gaussian! This is demonstrated in Figure 5.1.

Figure 5.1: Applying a Gaussian filter to an image in both the spatial


(top) and frequency (bottom) domains.

This relationship with convolution is just one of the properties of the Fourier transform. Some
of the other properties are noted in Table 5.1. An interesting one is the scaling property:
in the spatial domain, a > 1 will shrink the function, whereas in the frequency domain
this stretches the inverse property. This is most apparent in the Gaussian filter: a tighter
Gaussian (in other words, a smaller σ) in the spatial domain results in a larger Gaussian in
the frequency domain (in other words, a 1/σ).

48
Notes on Computer Vision George Kudrayvtsev

Spatial Domain (x) Frequency Domain (u)


Linearity c1 f (x) + c2 g(x) c1 F (u) + c2 G(u)
Convolution f (x) ~ g(x) F (u)G(u)
Scaling 1
F ua )

f (ax) |a|
dn f (x)
Differentiation dxn
(i2πu)n F (u)

Table 5.1: Properties of the Fourier Transform between domains.

5.3 Aliasing
With the mathematical understanding of the Fourier transform and its properties under our
belt, we can apply that knowledge to the problem of aliasing.
First, let’s talk about the comb function, also called an impulse train. Mathematically,
it’s formed like so:
X∞
δ(x − nx0 )
n=−∞

Where δ(x) is the magical “unit impulse function,” formally known as the Kronecker delta
function. The train looks like this:

−2x0 −x0 x0 2x0

Figure 5.2: An impulse train.

The Fourier transform of an impulse train is a wider impulse train, behaving much like the
expansion due to the scaling property.
We use impulse trains to sample continuous signals, discretizing them into something under-
standable by a computer. Given some signal like sin(t), we can multiply it with an impulse
train and get some discrete samples that approximate the signal in a discrete way.
Obviously, some information is lost during sampling: our reconstruction might be imperfect
if we don’t have enough information. This (im)precise notion is exactly what the aliasing
phenomenon is, demonstrated by Figure 5.3. The blue signal is the original signal, and the
red dots are the samples we took. When trying to reconstruct the signal, the best we can
do is the dashed signal, which has a lower frequency. Aliasing is the notion that a signal
travels “in disguise” as one with a different frequency.

49
Chapter 5: Frequency Analysis

g(t)

Figure 5.3: The aliasing phenomenon. The high frequency signal in


blue is sampled too infrequently (samples in red); during reconstruction,
a lower frequency signal (that is incorrect) can be obtained.

Theory in Action: The Wheels on the Bus Go Round and. . .


Backwards?
We’ve all experienced aliasing in the temporal domain.
In car commercials, we’ve seen wheels that appear to be spinning backwards while
the car moves forwards. This is an example of aliasing, and it occurs because the
rotation of the wheel is too fast for a video camera to pick up accurately: it takes
a picture every x frames, but the wheel’s motion is much faster and the difference
from image to image looks more like a small backwards rotation rather than a large
forward rotation.

In images, this same thing occurs often. The aliasing problem can be summarized by the
idea that there are not enough pixels (samples) to accurately render an intended effect. This
begs the question: how can we prevent aliasing?
• An obvious solution comes to mind: take more samples! This is in line with the
“megapixel craze” in the photo industry; camera lenses have ever-increasing fidelity and
can capture more and more pixels. Unfortunately, this can’t go on forever, and there
will always be uncaptured detail simply due to the nature of going from a continuous
to discrete domain.
• Another option would be to get rid of the problematic high-frequency information.
Turns out, even though this gets rid of some of the information in the image, it’s
better than aliasing. These are called low-pass filters.

5.3.1 Antialiasing
We can introduce low-pass filters to get rid of “unsafe” high-frequency information that
we know our sampling algorithm can’t capture, while keeping safe, low frequencies. We
perform this filtering prior to sampling, and then again after reconstruction. Since we know

50
Notes on Computer Vision George Kudrayvtsev

that certain frequencies simply did not exist, a reconstruction that results in these high
frequencies are incorrect can be safely clipped off.
Let’s formalize this idea. First, we define a comb function that can easily represent an
impulse train as follows (where M is an integer):

X
combM [x] = δ[x − kM ]
−∞

This is an impulse train in which every M , x is a unit impulse. Remember that due to the
scaling property, the Fourier transform of the comb function is 21 comb1/2 (u). We can extend
this to 2D and define a bed of nails, which is just a comb in two directions, which also
tightens its Fourier transform if it spreads:
∞ ∞ ∞ ∞  
X X 1 X X k l
combM,N (x, y) = δ(x − kM, y − lN ) ⇐⇒ δ u − ,v −
k=−∞ l=−∞
M N k=−∞ l=−∞ M N

With this construct in mind, we can multiply a signal by a comb function to get discrete
samples of the signal; the M parameter varies the fidelity of the samples. Now, if we consider
the Fourier transform of a signal and its resulting convolution with the comb function after
we do our sampling in the spatial spectrum, we can essentially imagine a repeating FT every
1/M steps. If there is no overlap within this repeat, we don’t get any distortion in the

frequency spectrum.
Specifically, if W < 2M
1
, where W is the highest frequency in the signal, we can recover the
original signal from the samples. This is why CDs sample at 44 kHz, so that we can recover
everything up to 22 kHz, which is the maximal extent of human hearing. If there is overlap
(which would be the presence of high-frequency content in the signal), it causes aliasing
when recovering the signal from the samples: the high-frequency content is masquerading as
low-frequency content.
We know how to get rid of high frequencies: use a Gaussian filter! By applying a Gaussian,
which now acts as an anti-aliasing filter, we get rid of any overlap. Thus, given a signal
f (x), we do something like:
(f (x) ~ h(x)) · combM (x)

Resizing Images
This anti-aliasing principle is very useful when resizing images. What do you do when you
want to make an image half (or a quarter, or an eighth) of its original size? You could just
throw out every other pixel, which is called image subsampling. This doesn’t give very
good results: it loses the high-frequency content of the original image because we sample too
infrequently.
Instead, we need to do use an antialiasing filter as we’ve discussed. In other words, first filter
the image, then do subsampling. The stark difference in quality is shown in Figure 5.4.

51
Chapter 5: Frequency Analysis

Figure 5.4: The result of down-sizing the original image of Van Gogh
(left) using subsampling (right) and filtering followed by subsampling
(center). The down-sized image was then blown back up to its original
size (by zooming) for comparison.

Image Compression
The discoveries of John Robson and Frederick Campbell of the variation in sensitivity to
certain contrasts and frequencies in the human eye4 can be applied to image compression.
The idea is that certain frequencies in images can be represented more coarsely than others.
The JPEG image format uses the discrete cosine transform to form a basis set. This
basis is then applied to 8×8 blocks of the image: and each block’s frequency corresponds to
a quantization table that uses a different number of bits to approximate that frequency with
a lower fidelity.

4
See this page for details; where do you stop differentiating the frequencies?

52
Cameras and Images

“What’s in a name? That which we call a rose


By any other name would smell as sweet.”
— Shakespeare, Romeo and Juliet [Act II, Scene ii]

e’ve been looking at images as 2-dimensional arrays of intensities, and dedicated


W an entire chapter to treating Images as Functions. But what is an image more
realistically? What is a photograph that you take (and edit, and filter, and caption)
and post on Instagram?
Close one eye, make a square with your fingers, and hold it up. What you see through the
square is what your camera captures, and what your screen displays. As in Figure 6.1, this
is a called a projection. An image, or photograph, is a 2-dimensional projection of a set of
3-dimensional points.
When we go from 3D to 2D, we lose all of the in-
formation from the 3rd dimension. We’ve all seen
this in the real world. Fantastic artists create opti-
cal illusions that fool us into thinking they have real
depth, like in Figure 6.2. Our goal in this chapter is
going to be extrapolating this information back.

Figure 6.1: A diagram outlining a per-


spective projection of a 3-dimensional
object onto a 2-dimensional surface.

Figure 6.2: An example of anamorphic art and


Pavement Art Illusion that, when viewed from the
right angle, appears to have depth.

53
Chapter 6: Cameras and Images

6.1 Cameras
This section will discuss how cameras work, covering pinhole cameras; the math behind
lenses; properties like focus, aperture, and field-of-view; and the math behind all of the
various properties of lenses.

TODO: This

Covering these topics requires a lot of photographs and diagrams (as in, it’s basically
the only way to explain most of these topics), and so I’m putting off doing that
until later.

6.2 Perspective Imaging


To get an understanding of what we’ll be doing, refer to Figure 6.1. We start with a point
in the 3D “real world” at some coordinate, (x, y, z). This gets projected to the center of
projection, passing through a projection plane of some size (intersecting at some location
(x, y, −d)) that is d units away from the center of projection, centered on the z-axis.
We make the center of projection – which we’ll refer to simply as the camera – the origin in
our coordinate system, and the projection plane (or image plane) in front of the camera so
that we can treat the image plane as a simple xy-plane. This means the camera looks down
the negative z-axis, and the image plane has (0, 0) in the center, not the top-left corner like
we’re used to.
We can model our projection scheme by using the similar triangles in Figure 6.3. For some
point (x, y, z):  x y 
(x, y, z) → −d , −d , −d
z z
This models the point of intersection on the projection plane with an arbitrary ray of light
(like the blue ray in Figure 6.3). This means that (x0 , y 0 ) = (−d X
Z
, −d YZ ).

6.2.1 Homogeneous Coordinates


Unfortunately, the operation we use to convert from 3D space to our 2D projection – division
by an ever-changing z to get our new (x0 , y 0 ) – is not a linear transformation. Fortunately,
we can use a trick: add one more coordinate (sometimes called the scaling coordinate) to
make it a linear operation.
Our 2D homogeneous image coordinates create a simple mapping from “pixel space” to
image space, and a similar one for 3D:
 
  x
x y 
(x, y) ⇒ y  (x, y, z) ⇒ 
z 

1
1

54
Notes on Computer Vision George Kudrayvtsev

y
PP
(x0 , y 0 , −d) (x, y, z)
d

Center of
Projection
z

Figure 6.3: A projection coordinate system. P P is the projection (or


“image”) plane (which is what we see), and the center of projection is
the location of the “camera.”

 
x
To convert from some homogeneous coordinate y , we use (x/w, y/w), and similarly for 3D,

w
we use ( /w, /w, /w). It’s interesting to note that homogeneous coordinates are invariant
x y z

under scaling: if you scale the homogeneous coordinate by some a, the coordinate in pixel
space will be unaffected because of the division by aw.

Perspective Projection
With the power of homogeneous coordinates, the projection of a point in 3D to a 2D per-
spective projection plane is a simple matrix multiplication in homogeneous coordinates:
 
  x  
1 0 0 0   x
0 1 0 0 y  =  y  ⇒ f x , f y =⇒ (u, v)
 
z  (6.1)
z z
0 0 1/f 0 z/f
1

This multiplication is a linear transformation! We can do all of our math under homogeneous
coordinates until we actually need to treat them as a pixel in an image, which is when we

55
Chapter 6: Cameras and Images

perform our conversion. Also, here f is the focal length, which is the distance from the
center of projection to the projection plane (d in Figure 6.3).
How does scaling the projection matrix change the transformation? It doesn’t! Recall the
invariance property briefly noted previously:
 
  x  
a 0 0 0   ax
0 a 0 0 y  =  ay  ⇒ f x , f y
 
z  z z
0 0 a/f 0 az/f
1

6.2.2 Geometry in Perspective


As we’d hope, points, lines, and polygons in 3-dimensional space correspond to points, lines,
and polygons on our projection plane. They don’t preserve the same properties, though. For
example, parallel lines intersect at what’s called the vanishing point. We can see this in
the mathematics.
A line in 3D can be characterized as parametric vector of equations in t:

x(t) = x0 + at

r(t) = y(t) = y0 + bt

z(t) = z0 + ct

Let’s apply perspective projection to the line:


fx f (x0 + at) fy f (y0 + at)
x0 (t) = = y 0 (t) = =
z z0 + ct z z0 + ct

In the limit, as t → ±∞, we see (for c 6= 0):


fa fb
x0 (t) = y 0 (t) =
c c

Notice that the “start of the line,” (x0 , y0 , z0 ) has disappeared entirely! This means that no
matter where a line starts, it will always converge at the vanishing point (f a/c, f b/c). The
restriction that c 6= 0 means that this property of parallel lines applies to all parallel lines
except those that exist in the xy-plane, that is, parallel to our projection plane!
All parallel lines in the same plane converge at colinear vanishing points, which we know
as the horizon.
Human vision is strongly affected by the notion of parallel lines. See the lecture snippet.
TODO: Add the illusion and explain why it happens.

6.2.3 Other Projection Models


There are models aside from the perspective projection model that are useful and we’ll be
studying.

56
Notes on Computer Vision George Kudrayvtsev

Orthographic Projection
This variant of projection is often used in computer graphics and video games in that are 2
dimensions. The model essentially “smashes” the real world against the projection plane. In
2D games, the assets (textures, sprites, and other images) are already flat and 2-dimensional,
and so don’t need any perspective applied to them.
Orthographic projection – also called parallel projection – can actually be thought of as
a special case of perspective projection, with the center of projection (or camera) infinitely
far away. Mathematically, f → ∞, and the effect can be seen in Figure 6.4.

world

image

Figure 6.4: An orthographic projection model, in which the z coordi-


nate is dropped entirely with no transformation.

Transforming points onto an orthographic projection just involves dropping the z coordinate.
The projection matrix is simple:
 
  x  
1 0 0 0   x
0 1 0 0 y  = y  ⇒ (x, y)
z 
0 0 0 1 1
1

Weak Perspective
This perspective model provides a form of 3D perspective, but the scaling happens between
objects rather than across every point. Weak perspective hinges on an important assump-
tion: the change in z (or depth) within a single object is not significant relative to its distance
from the camera. For example, a person might be about a foot thick, but they are standing
a mile from the camera.
All of the points in an object that is z0 distance away from the projection plane will be
mapped as such:  
fx fy
(x, y, z) → ,
z0 z0
This z0 changes from object to object, so each object has its own scale factor. Its projection

57
Chapter 6: Cameras and Images

matrix is:
 
  x  
1 0 0 0   x
y  
0 1 0 0   = y ⇒ (sx, sy)
 
z
0 0 0 1/s 1/s
1

6.3 Stereo Geometry


In general, the topic of this section will be the geometric relationships between the camera
(2D) and the scene (3D). “Stereo” in the context of imaging is just having two views of the
same scene; this is much like the human vision system, which uses your two eyeballs to get
two slightly-differing views of the world around you. This construct enables us to perceive
depth, and we’ll be working with that shortly as well.
Structure and depth are inherently ambiguous from
single views. There are all kinds of optical illusions
like the one in Figure 6.5 that take advantage of this
ambiguity. Mathematically, why is this the case? If
we refer back to Figure 6.3, we can see that it’s be-
cause any point along the blue ray (i.e. at any depth)
will hit the same point on the projection plane.
Let’s return to the human eye. We see the same Figure 6.5: The largest pumpkin ever
scene from two slightly different angles, but this can grown.
also be interpreted as seeing the same scene that has
moved slightly. In other words, we can get a similar
understanding of depth from motion (which gives us two different views after some step)
rather than seeing two angles of the same scene.
In fact, this concept was used to create stereoscopic glasses that, when viewing a special
type of image that was two photographs of a scene at different angles, created the feeling of
depth. A similar principle is applied in the old-school red blue 3D glasses.
Research has shown us that human stereo processing, or binocular fusion, is based on a
low-level differentiation between changes across our two “cameras,” rather than a large-scale
recognition of objects and a correlation of those objects across views.
Our simple stereo camera system is shown in Figure 6.6. The cameras are separated by
a baseline, B, and their focal length is f . We also have some point P at a distance Z in
the camera coordinate system. We can also measure the distances xl and xr , which are the
points of intersection with the left and right planes, respectively. Since we’re working in a
coordinate system in which (0, 0) is the center of each camera, xl ≥ 0 and xr ≤ 0.
What is the expression for Z? To find out, we need to do some simple geometry with similar
triangles. We have the first triangle (pl , P, pr ) and the second triangle (CL , P, CR ), both

58
Notes on Computer Vision George Kudrayvtsev

Optic axis P Optic axis

Z
pl pr

f xl xr

COPl B COPr

Figure 6.6: A top-down view of a simple stereo camera system, with


distances marked for the projection of an arbitrary point P onto both
cameras.

highlighted in red in Figure 6.7. This leads to the relationship:

B − xl + xr B
=
Z −f Z

Rearranging that gives us:


B
Z=f
xl − x r
which is an incredibly useful relationship. We are computing the distance to something in
the scene based on the disparity, or difference between the two projections. In other words,
disparity is inversely proportional to depth.
What if disparity = 0? That means a point doesn’t change at all in the left or right image.
That would be more and more true for objects further away, which is exactly what we saw
with An orthographic projection model, in which the z coordinate is dropped entirely with
no transformation. as our center of projection got infinitely far away.

Optic axis P Optic axis

Z
pl pr

f xl xr

COPl B COPr

Figure 6.7: The similar triangles (in red) that we use to relate the
points projected onto both cameras.

59
Chapter 6: Cameras and Images

6.3.1 Finding Disparity


Knowing that depth is correlated with disparity, how can we find the disparity between two
images? Given a point in an image, we know that a similar point in the other image has to
be somewhere within some constaints.
In a project that I did in my undergrad,1 we used the squared Euclidean distance to find the
pixel in one image that was most similar to the pixel in another using a “feature patch”:
W X
X H
d2E = (A(x, y) − B(x, y))2
x=0 y=0

A smaller distance meant a more similar feature patch. It worked pretty well, and we
generated some awesome depth maps that we then used to create stereograms. Since we’re
big smort graduate students now, things are going to get a little more complicated. First,
we need to take a foray into epipolar geometry.

6.3.2 Epipolar Geometry


Let’s consider the general case in which we do not have parallel optical axes as in Figure 6.6.
Given a point p in the left image, where can the corresponding point p0 be in the right image?
Do you remember the initial problem we had with projection and depth that made optical
illusions like Figure 6.5 possible? It was that any point along a ray between the camera its
projection plane maps to the same point in the plane. This means that, similarly, mapping
that ray to the other camera will create a line of possible points. This notion is one of the
principles derived in epipolar geometry. Hopefully, Figure 6.8 illustrates this better than
my explanation does.
Thus, we have a line of possible values in the right projection plane called the epipolar line.
There is a plane that contains this line as well as the camera centers of both images, and
that plane forms a corresponding epipolar line on the original (left) image. Any point that
is on the epipolar line of one image must be on its corresponding epipolar line on the other
image. This is called the epipolar constraint.
Let’s define some terms used in epipolar geometry before we move forward:
• baseline: the line joining the camera centers.
• epipolar plane: the plane containing the baseline and some “world point.”
• epipolar line: to reiterate, this is the intersection of the epipolar plane of a given
point with the image planes.
• epipole: the point of intersection of the baseline with the image plane. Note that
because every epipolar plane also contains the baseline and intersects the image plane,
every epipolar line will contain the epipole.

1
Link: CS61c, Project 1.

60
Notes on Computer Vision George Kudrayvtsev

p ?

Figure 6.8: The line of potential points (in blue) in the right image
that could map to the projected point in the left image.

The epipolar constraint reduces the “correspondence problem” to a 1D search across an


epipolar line.

6.3.3 Stereo Correspondence


With some epipolar geometry under our belt, we can make progress on finding the disparity
(or correspondence) between two stereo images. We’re going to start with a lot of assump-
tions to simplify the scenario, but this allows us to immediately dive into solving the stereo
correspondence problem. We assume:
• the images planes are parallel (or co-planar),
• the same focal length for each camera,
• the epipolar lines are horizontal, and
• the epipolar lines are at the same y location in the image
We’ll discuss minimizing these assumptions later in chapter 7 when we introduce Projective
Geometry. For now, though, even with the epipolar constraint we don’t have enough infor-
mation to solve the correspondence problem. We need some additional “soft contraints” that
will help us identify corresponding points:
• Similarity: The corresponding pixel should have a similar intensity.
• Uniqueness: There is no more than one matching pixel.
• Ordering: Corresponding pixels must maintain their order: pixels ABC in the left
image must likewise be matched to A0 B 0 C 0 in the right image.
• Limited Disparity Gradient: The depth of the image shouldn’t change “too quickly.”

61
Chapter 6: Cameras and Images

We will focus primarily on the similarity soft constraint. For that, we need to expand our
set of assumptions a bit to include that:
• Most scene points are visible from both views, though they may differ slightly based
on things like reflection angles.
• Image regions for two matching pixels are similar in appearance.

Dense Correspondence Search


Let’s describe a simple algorithm for the dense correspondence search (dense here indi-
cates that we will try to find matches for every pixel in the image). The process is simple;
for each pixel (or window) in the left image:
• Compare with each pixel (or window) in the right image along the epipolar line.
• Choose the position with the minimum “match cost.” This can be determined by a
number of algorithms such as the sum of square differences (SSD) as we described
above2 or the normalized correlation we discussed in earlier chapters.
Using normalized correlation (as described in Filter Normalization), we can also account for
global differences in the images such as one being brighter than the other. Using the SSD
may give incorrect results because of such changes affecting the difference calculation.
We may run into issues in which our windows are too small to include a sufficient difference
to find a correlation. Imaging a plain white hallway: our “match window” won’t vary very
much as we move across the epipolar line, making it difficult to find the correct corresponding
window. We can increase the window size, of course, or, perhaps, we can simply do matching
on high-texture windows only!
All in all, we’ve ended up with the same algorithm as the one I used in my undergrad
(described in Finding Disparity), but now we understand why it works and have considered
several modifications to increase robustness.

Uniqueness Constraint
Let’s briefly touch some of the other soft constraints we described above. The uniqueness
constraint states that there is no more than one match in the right image for every point in
the left image.
No more than one? Yep. It can’t be exactly one because of occlusion: the same scene from
different angles will have certain items occluded from view because of closer objects. Certain
pixels will only be visible from one side at occlusion boundaries.

2
The Euclidean distance is the same thing as the sum of squared differences when applied to a matrix (also
called the Frobenius norm).

62
Notes on Computer Vision George Kudrayvtsev

Ordering Constraint
The ordering constraint specifies that pixels have to be in the same order across both of our
stereo images. This is generally true when looking at solid surfaces, but doesn’t always hold!
This happens in two cases. When looking at (semi-)transparent objects, different viewing
angles give different orderings because we can “see through” the surface. This is a rare case,
but another one is much more common: narrow occluding surfaces. For example, consider
a 3D scene with a skinny tree in the middle. In your left eye, you may see “left grass, tree,
right grass,” but in your right eye, you may instead see “tree, more grass” (as in, both the
“left” and “right” grass are on the right side of the tree).
Instead of imagining such a contrived scene, you can easily do this experiment yourself. Hold
your fingers in front of you, one behind the other. In one eye, the front finger will be on the
left, whereas the back finger will be on the left in your other eye.
Unfortunately, the “state-of-the-art” algorithms these days aren’t very good at managing
violations of the ordering constraint such as the scenarios described here.

6.3.4 Better Stereo Correspondence


The window-matching approach is not very good as a general solution. Here we’ll describe
a few solutions that are closer to the state-of-the-art in depth detection.
A very basic improvement we can make is treat the image like, well, an image. Instead of
assigning the pixels individually to their corresponding pixel, we can treat them as parts
of a whole (which they are). In other words, we can optimize correspondence assignments
jointly. This can be done at different granularities, such as finding disparity a scanline as a
time, or even for an entire 2D grid.

Scanlines
For the scanline method, we essentially have a 1D signal from both images and we want
the disparity that results in the best overall correspondence. This is implemented with a
dynamic programming formulation. I won’t go into detail on the general approach here;
instead, I’ll defer to this lecture which describes it much more eloquently. This approach
results in far better depth maps compared to the naïve window matching method, but
still results in streaking artifacts that are the result of the limitation of scanlines. Though
scanlines improved upon treating each pixel independently, they are still limited in that they
treat every scanline independently without taking into account the vertical direction. Enter
2D grid-based matching.

Grid Matching
We can define a “good” stereo correspondence as being one with a high match quality, mean-
ing each pixel finds a good match in the other image, and a good smoothness meaning
adjacent pixels should (usually) move the same amount. The latter property is similar to

63
Chapter 6: Cameras and Images

the “neighborliness” assumption we described in Computing Averages. Thus, in our search


for a good correspondence, we might want to penalize solutions that have these big jumps.
This leads to modeling stereo correspondence as an energy minimization problem. Given
two images, I1 and I2 , their matching windows W1 (i) and W2 (i + D(i)) (where we say that
the second window is the first window plus some disparity D(i)), and the resulting disparity
image D, we can create this energy minimization model.
We have the data term, Edata , which is the sum of the difference of squares between our
two images given some estimated disparity:
X
Edata = (W1 (i) − W2 (i + D(i)))2 (6.2)
i

We want to minimize this term, and this is basically all we were looking at when doing Dense
Correspondence Search. Now, we also have a smoothness term:
X
Esmooth = ρ (D(i) − D(j)) (6.3)
neighbors i,j

Notice that we are only looking at the disparity image in the smoothness term. We are
looking at the neighbors of every pixel and determining the size of the “jump” we described
above. We define ρ as a robust norm: it’s small for small amounts of change and it gets
expensive for larger changes, but it shouldn’t get more expensive for even larger changes,
which would likely indicate a valid occlusion.
Both of these terms form the total energy, and for energy minimization, we want the D that
results in the smallest possible E:

E = αEdata (I1 , I2 , D) + βEsmooth (D)

where α and β are some weights for each energy term. This can be approximated via graph
cuts,3 and gives phenomenal results compared to the “ground truth” relative to all previous
methods.4

6.3.5 Conclusion
Though the scanline and 2D grid algorithms for stereo correspondence have come a long
way, there are still many challenges to overcome. Some of these include:
• occlusions
• low-contrast or textureless image regions

3
The graph cut algorithm is a well-known algorithm in computer science. It is an algorithm for divid-
ing graphs into two parts in order to minimize a particular value. Using the algorithm for the energy
minimization problem we’ve described was pioneered in this paper.
4
You can check out the current state-of-the-art at Middlebury.

64
Notes on Computer Vision George Kudrayvtsev

• violations of brightness constancy, such as specular reflections


• really large baselines, B (distance between the cameras)
• camera calibration errors (which result in incorrect epipolar lines)

6.4 Extrinsic Camera Parameters


We began by discussing the properties of a camera and how it maps a 3D scene onto a 2D
image via lenses. We extrapolated on this to discuss projection, which described a variety
of different ways we can perform this mapping. Then, we discussed stereo geometry, which
used the geometric properties of a camera to correlate two images and get an understanding
of their depth. Now, we’ll return to the properties of a camera and discuss them further.
This time, though, we’ll be talking about extrinisic camera parameters. These are parame-
ters external to the camera; they can be manipulated by the person holding (or controlling)
it. In contrast with intrinsic parameters such as focal length, these include things such as
rotating the camera, or simply changing its location with respect to the world.
Recall, first, the perspective projection model we described in Figure 6.3. We introduced
Homogeneous Coordinates to handle the non-linearity problem of dividing by Z and turn
perspective projection into a simple matrix multiplication. An important assumption of our
model was the coordinate system: the center of projection was at the origin, and the camera
looked down the z-axis. This model means that the entire world literally revolves around
our camera; everything is relative to the center of projection.
This is a limiting model to work with. It’d be nicer if we could relate the coordinate system of
the world to the coordinate system of the camera; in other words, the camera just becomes
another object in the world. This is known as geometric camera calibration and is
composed of two transformations:
• First, we transform from some (arbitrary) world coordinate system to the camera’s 3D
coordinate space. These are the extrinsic parameters, or the camera pose.
• Then, from the 3D coordinates in the camera’s frame to the 2D image plane via pro-
jection. This transformation depends on the camera’s intrinsic parameters, such as
the focal length.
We can model the transformation from “world space” to “camera space” as Tcw . How many
degrees of freedom do we have in this transformation? Consider for a moment a cube in
space: how many ways can you move it? Well, obviously we can move it along any of the
axes in the xyz-plane, but we can also rotate it along any of these axes: we can turn it left
to right (yaw), we can lean it forward and back (pitch), and we can twist it back and forth
(roll). This gives six total degrees of freedom.

Notation Before we dive in, we need to define some notation to indicate which “coordinate
frame” our vectors are in. We will say that A P is the coordinates of P in frame A. Recall, as

65
Chapter 6: Cameras and Images

well, that a vector can be expressed as the scaled sum of the unit vectors (i, j, k). In other
words, given some origin A O,
A 
x
A −→
y ⇐⇒ OP = A x · iA A y · jA A z · kA
A 
  
P= 
A
z

6.4.1 Translation
With that notation in mind, what if we want to find the location of some vector P, whose
position we know in coordinate frame A, within some other coordinate frame B. This
described in Figure 6.9.

Figure 6.9: Finding P in the coordinate frame B, given A P.

This can be expressed as a straightforward translation: it’s the sum of P in our frame and
the origin of the other coordinate frame, OB , expressed in our coordinate frame:
B
P = A P + A (OB )

The origin of frame B in within frame A is just a simple offset vector. With that in mind,
we can model this translation as a matrix multiplication using homogeneous coordinates:
B
P = A P + A OB
B
1 0 0 A Ox,B
   A 
Px Px
B Py  0 1 0 A Oy,B  A Py 
B  =   
 Pz  0 0 1 A Oz,B  A Pz 
1 0 0 0 1 1

We can greatly simplify this by substituting in vectors for column elements. Specifically, we
can use the 3×3 identity matrix, I3 , and the 3-element zero vector as a row, 0T :
B  
I3 A OB A P
 
P
= T
1 0 1 1

66
Notes on Computer Vision George Kudrayvtsev

6.4.2 Rotation
Now things get even uglier. Suppose we have two coordinate frames that share an origin,
but they are differentiated by a rotation. We can express this succinctly, from frame A to
B, as:
B
P=B
AR A
P

Where BA R expresses points in frame A in the coordinate system of frame B. Now, what
does R look like? The complicated version is an expression of the basis vectors of frame A
in frame B:
 
iA · iB jA · iB kA · iB
B
AR =
 iA · jB jA · jB kA · jB  (6.4)
iA · kB jA · kB kA · kB
= iA B jA B kA (6.5)
B 
A T
iB
A
=  jBT  (6.6)
A T
kB

Each of the components of the point in frame A can be expressed somehow in frame B using
all of B’s basis vectors. We can also imagine that it’s each basis vector in frame B expressed
in frame A. (6.6) is an orthogonal matrix: all of the columns are unit vectors that are
perpendicular.

Example: Rotation about a single axis.


Suppose we just have the xy-plane, and are performing a rotation about the z axis. This
scenario, shown in Figure 6.10, is reminiscent of middle school algebra.

jB jA

iB
θ iA

Figure 6.10: A rotation about the z axis from frame A to B by some


angle θ.

67
Chapter 6: Cameras and Images

The rotation matrix for this transformation is:


 
cos θ − sin θ 0
Rz (θ) =  sin θ cos θ 0 (6.7)
0 0 1

Of course, there are many ways to combine this rotation in a plane to reach some arbitrary
rotation. The most common one is using Euler angles, which say: first rotate about the
“world” z, then about the new x, then about the new z (this is also called heading, pitch,
and roll). There are other ways to express arbitrary rotations – such as the yaw, pitch, and
roll we described earlier – but regardless of which one you use, you have to be careful. The
order in which you apply the rotations matters and negative angles matter; these can cause
all sorts of complications and incorrect results. The rotation matrices about the three axes
are (we exclude z since it’s above):
 
cos κ 0 − sin κ
Ry (κ) =  0 1 0  (6.8)
sin κ 0 cos κ
 
1 0 0
Rx (ϕ) = 0 cos ϕ − sin ϕ (6.9)
0 sin ϕ cos ϕ

Rotation with Homogeneous Coordinates


Thankfully, things are much simpler in homogeneous coordinates. It can be expressed as a
matrix multiplication, but is not commutative, unlike translation. We can say:
B  B  A 
P R 0 P
= AT (6.10)
1 0 1 1

6.4.3 Total Rigid Transformation


With translation and rotation, we can formulate a total rigid transformation between
two frames A and B:
B
P=B B B
A R P + OA

First we rotate into the B frame, then we add the origin offset. Using homogeneous coordi-
nates, though, we can do this all in one step:5
B  
1 B OA B
  A 
P AR 0 P
= T
1 0 1 0T 1 1
B B
 A 
R O A P
= AT
0 1 1
5
Notice that we reversed the order of the matrix multiplications. From right to left, we translate then
rotate.

68
Notes on Computer Vision George Kudrayvtsev

B  B B
 A  A 
P R OA P P
Even more simply, we can just say: = AT B
= AT
1 0 1 1 1
Then, to instead get from frame B to A, we just invert the transformation matrix!
A  B 
−1 B P
 
P A P B
= BT = AT
1 1 1

To put this back into perspective, our homogeneous transformation matrix being invertible
means that we can use the same matrix for going from “world space” to “camera space” to
the other way around: it just needs to be inverted.

6.4.4 The Duality of Space


We now have an understanding of how to transform between frames. We can use this for
our model of the camera and the world around it. Given a camera frame and world frame,
we can use our homogeneous transformation matrix to transform some point p:
C  C C
 W 
p WR Wt p
=
1 0T 1 1

The 4×4 matrix that makes up the first “row” we’ve shown here (as in, everything aside from
the bottom 0 0 0 1 row) is called the extrinsic parameter matrix. The bottom row
 

is what makes the matrix invertible, so it won’t always be used unless we need that property;
sometimes, in projection, we’ll just use the 3×4.

6.4.5 Conclusion
We’ve derived a way to transform points in the world to points in the camera’s coordinate
space. That’s only half of the battle, though. We also need to convert a point in camera
space into a point on an image. We touched on this when we talked about Perspective
Imaging, but we’re going to need a little more detail than that.
Once we understand these two concepts, we can combine them (as we’ll see, “combine”
means another beautifully-simple matrix multiplication) and be able to see exactly how any
arbitrary point in the world maps onto an image.

6.5 Intrinsic Camera Parameters


Hold the phone, didn’t we already do intrinsic camera parameters? Well, yes. Our pro-
jection matrix from (6.1) encoded the intrinsic properties of our camera, but unfortunately
it represented an ideal projection. It’s used in video games, sure, but everything there is
perfectly configured by the developer(s). If we want to apply this to the real world, things
aren’t so straightforward.

69
Chapter 6: Cameras and Images

6.5.1 Real Intrinsic Parameters


Our notion of “pixels” is convenient and necessary when working with discrete images, but
they don’t translate so well to the real world. Our camera has a focal length in some real-
world units, say, 10mm, that we need to somehow scale to pixels.
In other words, our perfect projection already has another set of factors (2, since pixels
might not be square!). Also, we assumed the center of the image aligned with the center
of projection (along the z-axis); of course, our image may have been cropped or otherwise
modified. Thus, we need to introduce some offset (u0 , v0 ). Furthermore, what if there was
some skew between the camera axes and they didn’t sample perfectly perpendicularly? We’d
need to account for this, as well:
v 0 sin(θ) = v
u0 = u − cos(θ)v 0
= u − cot(θ)v

Combining all of these potential parameters, we get 5 degrees of freedom in this mess:
x y
u = α − α cot(θ) + u0
z z
β y
v= + v0
sin(θ) z

. . . gags .
Can we make this any nicer? Notice the division by z all over the place? We can, again,
leverage homogeneous coordinates. Let’s shove the whole thing into a matrix:
 
    x
zu α α cot(θ) u0 0  
zv  =  0 β y
sin(θ)
v0 0 z 
z 0 0 1 0
1

More succinctly, we can isolate the intrinsic parameter matrix that transforms from a
point in camera space to a homogeneous coordinate:
p0 = K c p

We can get rid of the last column of zeroes, as well as use friendlier parameters. We can say
that f is the focal length, as before; s is the skew; a is the aspect ratio; and (cx , cy ) is the
camera offset. This gives us:  
f s cx
K =  0 af cy 
0 0 1

Of course, if we assume a perfect universe, f becomes our only degree of freedom and gives
us our perspective projection equation from (6.1).

70
Notes on Computer Vision George Kudrayvtsev

6.6 Total Camera Calibration


We can combine our extrinsic and intrinsic camera parameters to get a direct transformation
from an arbitrary 3D coordinate in world space to a 2D pixel on an image. Let’s reiterate our
two equations; we begin with our coordinate in world space, W p, and get out a homogeneous
pixel coordinate, p0 .
C  C C
 W 
p WR Wt p
The intrinsic camera parameters are represented by: =
1 0T
1 1
The extrinsic camera parameters are represented by: p0 = K C p
Combining these gives us our calibration matrix, M:

p0 = K C C W

W RW t p
p0 = M p W

We can also say that the true pixel coordinates are projectively similar to their homoge-
neous counterparts (here, s is our scaling value):
   
u su
v  ' sv 
1 s

In summary, we define our full camera calibration equation as being:

s x0c
  
f 1 0 0 0   
R 0 I T
M3×4 =  0 af yc 0
0 1 0 0 3×3 3×1 3 3×1
(6.11)
0 1 01×3 1
0 0 1 0 0 1 0 | 1×3{z }| {z }
| {z }| {z } rotation translation
intrinsics projection
| {z }
extrinsics

This equation gives us 11 degrees of freedom! So. . . what was the point of all of this again?
Well, now we’ll look into how to recover M given some world coordinates and some pixel in
an image. This allows us to reconstruct the entire set of parameters that created that image
in the first place!

6.7 Calibrating Cameras


At the heart of what we’ll be accomplishing in this section is finding the M described in
(6.11). In general, we have, for some point x in homogeneous coordinates, a transformation

71
Chapter 6: Cameras and Images

from world to image using M:


 
  x
∗ ∗ ∗ ∗  
y
x0 = ∗ ∗ ∗ ∗ 
z  = Mx
∗ ∗ ∗ ∗
1

One of the ways this can be accomplished is through placing a known object into the scene;
this object will have a set of well-describable points on it. If we know the correspondence
between those points in “world space,” we can compute a mapping from those points to
“image space” and hence derive the camera calibration that created that transformation.
Another method that is mathematically similar is called resectioning. Instead of a par-
ticular object, we can also determine a series of known 3D points in the entire scene. This
results in a scene akin to what’s described in Figure 6.11.

Figure 6.11: A lab outfitted with special markers indicating “known


points” for resectioning.

After measuring each point in the real world from some arbitrary origin, we can then correlate
those points to an image of the same scene taken with a camera. What does this look like
mathematically? Well, for a single known point i, we end up with a pair of equations.

 
      Xi
ui wui m00 m01 m02 m03  
 vi  '  wvi  = m10 m11 m12 m13   Yi  (6.12)
 Zi 
1 w m20 m21 m22 m23
1
m00 Xi + m01 Yi + m02 Zi + m03
ui = (6.13)
m20 Xi + m21 Yi + m22 Zi + m23
m10 Xi + m11 Yi + m12 Zi + m13
vi = (6.14)
m20 Xi + m21 Yi + m22 Zi + m23

72
Notes on Computer Vision George Kudrayvtsev

We can rearrange this, then convert that back to a(n ugly) matrix multiplication:

ui (m20 Xi + m21 Yi + m22 Zi + m23 ) = m00 Xi + m01 Yi + m02 Zi + m03 (6.15)


vi (m20 Xi + m21 Yi + m22 Zi + m23 ) = m10 Xi + m11 Yi + m12 Zi + m13 (6.16)
 
m00
m01 
 
m02 
 
m03 
 
m10 
    
Xi Yi Zi 1 0 0 0 0 −ui Xi −ui Yi −ui Zi −ui  m11  = 0 (6.17)

0 0 0 0 Xi Yi Zi 1 −vi Xi −vi Yi −vi Zi −vi m12  
 0
m13 
 
m20 
 
m21 
 
m22 
m23

This is a set of homogeneous equations because they’re equal to 0. An obvious solution


would be m = 0, but we don’t want that: this is a constraint on our solution.
You may recall from linear algebra that we can try to find the least squares solution to
find an approximation to a system of equations in this form: Ax = b.6 This amounts to
minimizing kAxk, which is an approximate solution if there isn’t one.
We know that m is only valid up to a particular scale. Consider again our original equation in
(6.12): if we scale all of M by some scale factor, it will go away after division by w. Remem-
ber, homogeneous coordinates are invariant under scaling (see Homogeneous Coordinates).
Instead, we’ll solve for m as a unit vector, m̂.
Then, the question becomes: which possible unit vector m̂ will minimize kAm̂k? The
solution, which we’ll show next, is the eigenvector of AT A with the smallest eigenvalue.
We can solve this with 6 or more known points, which correlates to our available degrees of
freedom in our calibration matrix M (see 6.11).

6.7.1 Method 1: Singular Value Decomposition


To reiterate, we want to find the m such that it minimizes kAmk, constrained by kmk = 1
(in other words, m̂ is a unit vector).
To derive and describe the process of determining this m̂, we’ll be looking at singular value
decomposition. Any matrix can be decomposed (or “factored”) into three matrices, where
D is a diagonal matrix and U, V are orthogonal:7

A = UDVT
6
Refer to Appendix A for an explanation of this method.
7
I won’t discuss SVD in detail here, though I may add a section to Linear Algebra Primer later; for now,
please refer to your favorite textbook on Linear Algebra for details.

73
Chapter 6: Cameras and Images

Therefore, we’re aiming to minimize kUDVT mk. Well, there’s a trick here! U is a matrix
made up of unit vectors; multiplying by their magnitude doesn’t change the result. Thus,

kUDVT mk = kDVT mk and,


kmk = kVT mk

The latter equivalence is also due to the orthogonality of the matrix. We can conceptually
think of V as a rotation matrix, and rotating a vector doesn’t change its magnitude.
Now we’re trying to minimize kDVT mk subject to kVT mk = 1, which is a stricter set of
constraints than before. This cool transformation lets us do a substitution: let ŷ = VT m.
Thus, we’re minimizing kDŷk (subject to kŷk = 1).
Now, remember that D is a diagonal matrix, and, by convention, we can sort the diagonal
such that it has decreasing values. Thus, kDŷk is a minimum when ŷ = 0 0 . . . 1 ; ŷ
T

is putting “all of its weight” in its last element, resulting in the smallest value in D.
Since ŷ = VT m, and V is orthogonal,8 we know that m = Vŷ. And since ŷ’s only meaningful
element is a 1 at the end, then m is the last column of V!
Thus, if m̂ = Vŷ, then m̂ selects the eigenvector of AT A with the smallest eigenvalue (we
say m̂ now because eigenvectors are unit vectors). That leap in logic is explained by the
properties of V, which are described in the aside below for those interested.

SVDs and Eigendecomposition


In an SVD, the singular values of A are the square roots of the eigenvalues of
AT A. Furthermore, the columns of V are its eigenvectors.
To quickly demonstrate this:

AT A = (VDT UT ) (UDVT )
= VDT IDV the transpose of an orthogonal matrix is its inverse

= VT D2 V the transpose of a diagonal matrix is itself

The last equation is actually the eigendecomposition of AT A, where V is com-


posed of the eigenvectors and D contains the eigenvalues.

6.7.2 Method 2: Inhomogeneous Solution


This method is more straightforward that the previous one, but its results are not as good.
We rely again on the fact that homogeneous coordinates are invariant under scaling. Suppose,

8
A property of an orthogonal matrix is that its transpose is its inverse: VT = V−1 .

74
Notes on Computer Vision George Kudrayvtsev

then, we “scale” by 1/m23 . . . this would result in the following equation:


 
    Xi
ui m00 m01 m02 m03  
 vi  ' m10 m11 m12 m13   Yi 
 Zi 
1 m20 m21 m22 1
1

Contrast this with (6.12) and its subsequent expansions. We would actually have a term
that doesn’t contain an mij factor, resulting in an inhomogeneous system of equations
unlike before. Then, we can use least squares to approximate the solution.
Why isn’t this method as good? Consider if m23 ≈ 0, or some value close to zero. Setting
that value to 1 is dangerous for numerical stability.9

6.7.3 Advantages and Disadvantages


The technique described in the first method is called the direct linear calibration trans-
formation. It has some advantages and disadvantages.

Advantages
• The approach is very simple to formulate and solve. We create a matrix from a set of
points and use singular value decomposition, which is a 1-line function in most math
libraries.
• We minimize “algebraic error.” This means that we are algebraically solving the system
of equations. We relied on a specific set of tricks (as we saw), such as the constraint
of m̂ as a unit vector to get to a clean, algebraic solution.

Disadvantages
• Because of the algebraic approach, and our restriction of m̂, it doesn’t tell us the camera
parameters directly. Obviously, it’s unlikely that all of the parameters in (6.11) result
in a unit vector.
• It’s an approximate method that only models exactly what can be represented in the
camera calibration matrix (again, see 6.11). If we wanted to include, say, radial distor-
tion – a property we could model, just not within the projective transform equation –
this method wouldn’t be able to pick that up.
• It makes constraints hard to enforce because it assumes everything in m̂ is unknown.
Suppose we definitively knew the focal length and wanted solutions where that stayed
constrained. . . things get much more complicated.

9
As per Wikipedia, a lack of numerical stability may significantly magnify approximation errors, giving us
highly erroneous results.

75
Chapter 6: Cameras and Images

• The approach minimizes kAm̂k, which isn’t the right error function. It’s algebraically
convenient to work with this “cute trick,” but isn’t precisely what we’re trying to solve
for; after all, we’re working in a geometric world.

6.7.4 Geometric Error


Suppose we have our known set of points in the world, X. What we’re really trying to do
is minimize the difference between the actual points in the image (x0 ) and our projection of
X given some estimated camera parameters for M (which results in our set of “predictions,”
x). Visually, this is shown in Figure 6.12; mathematically, this is:
X
min E = d(x0i , x0i )
i

Since we control M, what we’re really trying to find is the M that minimizes that difference:
X
min = d(x0i , Mxi )
M
i∈X

X
x0

Figure 6.12: The red points are the “true” projection of the real-world
points in X onto the image plane, whereas the blue points are their
projection based on the estimated camera parameters M.

This leads us to the “gold standard” algorithm that aims to determine the maximum like-
lihood estimation10 of M described in algorithm 6.1. The minimization process involves
10
We say it’s an estimation because we assume that our guesses for M are distorted by some level of Gaussian
noise; thus, we want to maximize the likelihood of our choice by taking the M with the least error.

76
Notes on Computer Vision George Kudrayvtsev

solving a non-linear equation, so use whatever solver flavor you wish for that.

Algorithm 6.1: Finding camera calibration by minimizing geometric error.

Input: P : n known mappings from 3D to 2D, {Xi ←→ x0i }, i ∈ [1, n > 6].
Result: The best possible camera calibration, M.
if normalizing then // (optional)
X̃ = UX // normalization between image and world space
x̃0 = Tx0
end
M0 ← result of the direct linear transformation minimization
Minimize geometric error starting from M0 .
minM = i∈X̃ d(x̃0i , M̃X̃0i )
P

if normalizing then // (optional)


M = T−1 M̃U // denormalize the resulting parameter matrix M̃
end
return M

6.8 Using the Calibration


Henceforth we will assume that we’ve found M. Let’s not lose sight of what all of this was
for. . . now, we can use M to find interesting camera parameters, such as the original focal
length or the camera center.
Extracting each parameter is a relatively unique process; we’ll start with finding the camera
center in world coordinate space.

6.8.1 Where’s Waldo the Camera?


Before we begin, let’s introduce a slight change in notation for M: we will divide it into a
3×3 matrix Q and its last column, b. In other words, M = [Q | b].
The camera’s location in world space can be found by finding the null space of the camera
calibration matrix. In other words, if we find some C such that: MC = 0, then C is the
camera center.

Proof. Suppose we have two points, p and c, and a ray passing through them also passes
through a plane. We can define x as being a point anywhere along that ray:

x = λp + (1 − λ)c

77
Chapter 6: Cameras and Images

Then, the resulting projection after applying the camera parameters is:

xproj = Mx = λMp + (1 − λ)Mc

Now we must imagine the plane this ray passes through. All of the points along → −
pc are
projected onto the plane at the exact same point; in other words, regardless of λ, every point
on the ray will be projected onto the same point.
Thus, Mc = 0, and the camera center must be in the null space. 

Simpler. We can actually achieve the same goal by applying a formula. If M is split into
two parts as we described above, then the camera center is found simply:11

−Q−1 b
 
c=
1

Of course, this is just one property that went into creating the original M, but we can hope
that other properties can be derived just as simply.

6.9 Calibrating Cameras: Redux


The modern method for calibrating cameras is much simpler. Instead of a series of known
points in a 3D scene, we can use a checkerboard on a firm surface. The checkerboard is
then moved around the scene in various locations and orientations. The analysis of the
checkerboards gives us the intrinsics, and the extrinsics can be obtained if we noted the
locations of our checkerboards in world space.
The method’s simplicity and availability makes it the dominant method of camera calibration
today; it’s supported directly by OpenCV and code is freely available.

Theory in Action: Calibration in Robotics


This method is useful in robotics in calibration a camera with respect to a robot’s
local coordinate system.
Suppose we mount such a checkerboard to a robot’s arm. The robot knows the pre-
cise position of its arm (with respect to itself). Thus, every time the camera takes a
picture, we can correlate the checkerboard in the image to the robot’s measurements
and calibrate the camera with respect to the robot’s coordinate system!

11
The proof is left as an exercise to the reader.

78
Multiple Views

There are things known and there are things unknown, and in between are
the doors of perception.
— Aldous Huxley, The Doors of Perception

n this chapter we’ll be discussing the idea of n-views: what can we learn when given
I multiple different images of the same scene? Mostly, we’ll discuss n = 2 and create
mappings from one image to another. We discussed this topic briefly in chapter 6 when
talking about Stereo Geometry; we’ll return to that later in this chapter as well, but we’ll
also be discussing other scenarios in which we create relationships from one image to another.

7.1 Image-to-Image Projections


There are many types of transformations; we open by discussing their mathematical differ-
ences.
Recall our discussion of perspective projection in 3D (see Equation 6.1): we had a 4×4
matrix and could model all of our various transformations (translation, rotation, etc.) under
this framework by using Homogeneous Coordinates. This framework is convenient for a lot
of reasons, and it enables us to chain transformations by continually multiplying. We can
similarly adapt this model here with a 3×3 projective transformation matrix.
Translation This is our simplest transformation. We’ve seen this before: x0 = x + t. In
our projective transformation model,
 0   
x 1 0 tx x
y 0  = 0 1 ty  y 
1 0 0 1 1

This transformation preserves all of the original properties: lengths, angles, and orien-
tations all remain unchanged. Importantly, as well, lines remain lines under a trans-
lation transformation.
Euclidean Also called a rigid body transformation, this is the 2D version of the rotation

79
Chapter 7: Multiple Views

matrix we saw in (6.7) combined with a translation:


 0   
x cos θ − sin θ tx x
y 0  =  sin θ cos θ ty  y 
1 0 0 1 1

This transformation preserves the same properties as translation aside from orientation:
lengths, angles, and lines are all unchanged.
Similarity Under a similarity transformation, we add scaling to our rigid body trans-
formation:  0   
x a cos θ −a sin θ tx x
y 0  =  a sin θ a cos θ ty  y 
1 0 0 1 1

Affine An affine transformation is one we’ve yet to examine but is important in com-
puter vision (and graphics). It allows 6 degrees of freedom (adding shearing to the
aforementioned translation, rotation, and scaling), and it enables us to map any 3
points to any other 3 points while the others follow the transformation.
 0   
x a b c x
 y 0  = d e f   y 
1 0 0 1 1

It preserves a very different set of properties relative to the others we’ve seen: parallel
lines, ratios of areas, and lines (as in, lines stay lines) are preserved.
Combining all of these transformations gives us a general projective transformation,
also called a homography. It allows 8 degrees of freedom:1
 0  0   
x wx a b c x
y 0  ' wy 0  = d e f  y 
1 w g h 1 1

It’s important to be aware of how many points we need to determine the existence of all of
these transformations. Determining a translation only requires a single point correspondence
between two images: there are two unknowns (tx , ty ), and one correspondence gives us this
relationship. Similarly, for a homography, we need (at least) 4 correspondence points to
determine the transformation.

7.2 The Power of Homographies


Why do homogeneous coordinates make sense? Let’s return to our understanding of the 3D
to 2D projection through an image plane to some camera center. In Figure 7.1, we can see
1
The last element is a 1 for the same reason as it was in Method 2: Inhomogeneous Solution: because the
homogeneous projective matrix is invariant under scale, we can scale it all by this last element and it will
have no effect. This would give us a different w than before, but that doesn’t matter once we convert back
to our new (x0 , y 0 ).

80
Notes on Computer Vision George Kudrayvtsev

such a projection. The point on the image plane at (x, y, 1) is just the intersection of the
ray with the image plane (in blue); that means any point on the ray projects onto the image
plane at that point.
The ray is just the scaled intersection point, (sx, sy, s), and that aligns with our understand-
ing of how homogeneous represent this model.

(sx, sy, s)
−y

(x, y, 1)
x
−z

Figure 7.1: Each point on the image plane at (x, y, 1) is represented


by a ray in space, (sx, sy, s). All of the points on the ray are equivalent:
(x, y, 1) ' (sx, sy, s).

So how can we determine how a particular point in one projective plane maps onto another
projective plane? This is demonstrated in Figure 7.2: we cast a ray through each pixel in
the first projective plane and draw where that ray intersects the other projective plane.

eye

Figure 7.2: We can project a ray from the eye through each pixel in one
projection plane and see the corresponding pixel in the other projection
plane.

81
Chapter 7: Multiple Views

Notice that where this ray hits the world is actually irrelevant, now! We are no longer
working with a 3D reprojection onto a 2D plane; instead, we can think about this as a 2D
image warp (or just a transformation) from one plane to another. This basic principle
is how we create image mosaics (colloquially, panoramas). To reiterate, homographies
allow us to map projected points from one plane to another.

7.2.1 Creating Panoramas


Panoramas work because the camera center doesn’t change as we rotate the camera; this
means we don’t need any knowledge about the 3D scene we’re recording. A panorama can
be stitched from multiple images via the following steps:
• Take a sequence of images from the same position, rotating the camera about its optical
center.
• Compute a transformation between the first image and the second. This gives us a
mapping between the images, showing how a particular pixel moved as we rotated, and
which new pixels were introduced.
• Transform the second image to overlap with the first according to that transformation.
• Blend the images together, stitching them into one.
We can repeat this process as needed for every image in the sequence and get our panorama.
We can interpret our mosaic naturally in 3D: each image is reprojection onto a common
plane, and the mosaic is formed on that plane. This is demonstrated in Figure 7.3.2

mosaic
projection
plane

Figure 7.3: Reprojection of a series of images from the same camera


center onto a “mosaic plane.”

That’s all well and good, but what about the math? That’s what we’re really interested in,
right? Well, the transformation of a pixel from one projection plane to another (along a

2
Because we’re reprojecting onto a plane, we’re limited by a 180° field of view; if we want a panorama of
our entire surroundings, we’d need to map our mosaic of images on a different surface (say, a cylinder).
The 180° limitation can be thought of as having a single camera with a really wide lens; obviously, it still
can’t see what’s “behind” it.

82
Notes on Computer Vision George Kudrayvtsev

ray, as we already showed in Figure 7.2) is just a homography:


 0   
wx a b c x
wy 0  = d e f   y (7.1)
w g h 1 1
p0 = Hp (7.2)

How do we solve it? Well, there’s the boring way and the cool way.

The Boring Way: Inhomogeneous Solution We can set this up as a system of linear
equations with a vector h of 8 unknowns: Ah = b. Given at least 4 corresponding points
between the images (the more the better),3 we can solve for h using the least squares method
as described in Appendix A: min kAh − bk2 .

The Cool Way: Déjà Vu We’ve actually already seen this sort of problem before: re-
call the trick we used in Method 1: Singular Value Decomposition for solving the camera
calibration matrix! We can apply the same principles here, finding the eigenvector with the
smallest eigenvalue in the singular value decomposition.
More specifically, we create a similar system of equations as in Equation 6.12, except in 2D.
Then, we can rearrange that into an ugly matrix multiplication like in Equation 6.17 and
pull the smallest eigenvector from the SVD.

7.2.2 Homographies and 3D Planes


In this section we’ll demonstrate how the 8 degrees of freedom in a homography allow us to
create mappings between planes. Recall, first, the camera calibration matrix and its effect
on world-space coordinates:
 
    x
u m00 m01 m02 m03  
v  ' m10 m11 m12 m13  y 
z 
1 m20 m21 m22 m23
1

Suppose, though, that all of the points in the world were lying on a plane. A plane is
represented by a normal vector and a point on the plane, combining into the equation:
d = ax + by + cz. We can, of course, rearrange things to solve for z: z = d+ax+by
−c
, and then
plug that into our transformation!
 
    x
u m00 m01 m02 m03 
v  ' m10 m11 m12 m13   y 

d−ax−by/c
1 m20 m21 m22 m23
1
3
For now, we assume an existing set of correctly-mapped pixels between the images; these could be mapped
by a human under forced labor, like a graduate student. Later, in the chapter on Feature Recognition,
we’ll see ways of identifying correspondence points automatically.

83
Chapter 7: Multiple Views

Of course, this effects the overall camera matrix. The effect of the 3rd column (which is
always multiplied by x, y, and a constant) can be spread to the other columns. This gives
us:

 
   0 0 0
 x
u m00 m01 0 m03 
v  ' m010 m011 0 m013   y 

0 0 0
d−ax−by/c
1 m20 m21 0 m23
1

But now the camera matrix is a 3×3 homography! This demonstrates how homographies
allow us to transform (or warp) between arbitrary planes.

7.2.3 Image Rectification

Since we can transform between arbitrary planes using homographies, we can actually apply
that to images to see what they would look like from a different perspective!

Figure 7.4: Mapping a plane (in this case, containing an image) onto
two other planes (which, in this case, are the projection planes from two
different cameras).

This process is called image rectification and enables some really cool applications. We
can do things like rectify slanted views by restoring parallel lines or measure grids that are
warped by perspective like in Figure 7.5.

84
Notes on Computer Vision George Kudrayvtsev

(a) The original image. (b) After unwarping.

Figure 7.5: Applying unwarping to an image in order to measure the


floor length.

How do we do this image warping and unwarping process? Suppose we’re given a source
image, f (x, y) and a transformation function T (x, y) that relates each point in the source
image plane to another plane. How do we build the transformed image, g(x0 , y 0 )? There are
two approaches, one of which is incorrect.

Forward Warping
The naïve approach is to just pump every pixel through the transformation function and copy
that intensity into the resulting (x0 , y 0 ). That is, for each pixel (x, y) ∈ f ), (x0 , y 0 ) = T (x, y).
Unfortunately, though our images are discretized into pixels, our transformation function
may not be. This means that a particular (x0 , y 0 ) = T (x, y) may not correspond to an
individual pixel! Thus, we’d need to distribute the color from the original pixel around the
pixels near the transformed coordinate. This is known as splatting, and, as it sounds,
doesn’t result in very clean transformed images.

Inverse Warping

Instead of mapping from the source to the destination, we


instead look at every pixel in the destination image and (i, j + 1) (i + 1, j + 1)
figure out where it came from in the source. Thus, we start
with (x0 , y 0 ), then find (x, y) = T −1 (x0 , y 0 ). a (x, y)
b
We still have the problem of non-discrete locations, but can
come up with a better solution. Now we know the neigh- (i, j) (i + 1, j)
boring pixels of our source location, and can perform inter-
polation on their intensities to get a better approximation Figure 7.6: Finding the val-
of the intensity that belongs at the destination location. ues for bilinear interpolation of
(x, y).
The simplest approach would be nearest neighbor inter-
polation, which just takes the intensity of the closest pixel.

85
Chapter 7: Multiple Views

This doesn’t give great results. A better approach that is known as bilinear interpolation,
which weighs the neighboring intensities based on the distance of the location (x, y) to each
of its neighbors. Given a location, we can find its distances from a discrete pixel at (i, j):
a = x − i, b = y − j. Then we calculate the final intensity in the destination image at (x0 , y 0 ):

g(x0 , y 0 ) = (1 − a)(1 − b) f [i, j]


+ a(1 − b) f [i + 1, j]
+ ab f [i + 1, j + 1]
+ (1 − a)b f [i, j + 1]

This calculation is demonstrated visually in Figure 7.6. There are more clever ways of
interpolating, such as bicubic interpolation which uses cubic splines, but bilinear inter-
polation is mathematically simpler and still gives great results.

7.3 Projective Geometry


In this section we’ll be getting much deeper into the mathematics behind projective geom-
etry; specifically, we’ll explore the duality of points and lines and the power of homogeneous
coordinates (again!). We’ll use this to build a mathematical basis (no pun intended) for ap-
proaching n-views in the subsequent section.
Recall, first, the geometric significance of homogeneous coordinates. In Figure 7.1 we saw
that a point on the image is a ray in space, and every point along the ray is projected onto
that same point in the image; we defined this as being projectively similar.

7.3.1 Alternative Interpretations of Lines


Homogeneous coordinates are also useful as representations of lines. Recall the standard
form of the equation of a line in 2D:

ax + by + c = 0

Here, a, b, and c are all integers, and a > 0. From this, we can
 imagine a “compact”
representation of the line that only uses its constants:
 2 3 −12 . If we want the original
equation back, we can dot this vector with x y 1 :

(7.3)
   
a b c · x y 1 =0

Now recall also an alternative definition of the dot product: the angle between the two
vectors, θ, scaled by their magnitudes:

a · b = kak kbk cos θ

86
Notes on Computer Vision George Kudrayvtsev

This indicates that the vector a b c is perpendicular to every


 

possible point (x, y) in the z = 1 plane. . . or the ray passing


through the origin and the point (x, y, 1). n̂
There’s another interpretation, as well. Suppose we find a vec- d

tor perpendicular to the line, as in Figure 7.7. The slope of the


normal line is the reciprocal of the initial slope: m0 = −1/m, and
it passes through the origin (0, 0).
In addition to this normal, we can define the line by also in-
cluding a minimum distance from the origin, d. A line can be
Figure 7.7: An alterna-
uniquely identified by this
 distance and
 normal vector; we can tive interpretation of a line.

represent it as nx , ny , d , where n̂ = nx , ny , and d = c/ a2 +b2 .4
 

7.3.2 Interpreting 2D Lines as 3D Points


First, we will demonstrate the relationship between lines in a 2D plane and points in 3D
space under projective geometry.
Suppose we plot some line in 2D, for example 2x + 3y −
12 = 0 as in Figure 7.8. We can imagine a “compact”
4 (0, 4)
representation of the line by just using its constants:
m = 2 3 −12 , as we described above.

2 (3, 2)

Now what if we drew this same line on a projection


plane in 3D, like on the image plane at z = 1 in Fig-
−3 3 6 ure 7.1? Then, suppose we visualize the rays that begin
−2 at the origin and pass through the two points on the
line, (0, 4, 1) and (3, 2, 1) (notice the additional coordi-
2x−4
+ 3y − 12 = 0 nate now that we’re in 3-space).
The area between these rays is an infinite triangle into
Figure 7.8: A line plotted on the
standard xy-plane. Two points onspace, right? Overall, it’s part of an entire plane! We
have 3 points in space: the 2 points on the line and the
the line and the segment connecting
them are highlighted. origin, and three points define a plane. The intersection
of that plane and the image plane is the line 2x + 3y −
12 = 0, except now it’s at z = 1. What we get is something that looks like Figure 7.9.
The most important and exciting part of this visualization is the orange normal vector. We
can determine the normal by calculating the cross product between our two rays:

î ĵ k̂
   
0 4 1 × 3 2 1 = 0 4 1
3 2 1
= (4 · 1 − 2 · 1)î − (0 · 1 − 3 · 1)ĵ + (0 · 2 − 3 · 4)k̂
 
= 2 3 −12
4
Wikipedia defines the minimum distance between a point and a line: see this link.

87
Chapter 7: Multiple Views

Look familiar. . . ? That’s m, the constants from the original line! This means that we can
represent any line on our image plane by a point in the world using its constants.
The point defines a normal vector for a plane passing through the origin that creates the
line where it intersects the image plane.
Similarly, from this understanding that a line in 2-space can be represented by a vector in
3-space, we can relate points in 2-space to lines in 3-space. We’ve already shown this when
we said that a point on the projective plane lies on a ray (or line) in world space that passes
through the origin and that point.

(0, 4, 1)
(0, 4, 1) y

y (3, 2, 1)

z=1 (3, 2, 1)

x z=1

−z x
−z
Figure 7.9: Two views of the same line from Figure 7.8 plotted on a
projection plane (in blue) at z = 1. The green rays from the origin pass
through the same points in the line, forming the green plane. As you
can see, the intersection of the green plane with the blue plane create
the line in question. Finally, the orange vector is the normal vector
perpendicular to the green plane.

7.3.3 Interpreting 2D Points as 3D Lines


Suppose we want to find the intersection between two lines in 2-space. Turns out, it’s simply
their cross product! It (possibly) makes sense intuitively. If we have two lines in 2D (that
are defined by a normal vector in 3D), their intersection is a point in the 2D plane, which is
a ray in 3D.
Let’s consider an example. We’ll use 2x+3y −12 = 0 as before, and another line, 2x−y +4 =
0. Their intersection lies at (0, 4, 1); again, we’re in the z = 1 plane.5 What’s the cross

5
Finding this intersection is trivial via subtraction: 4y − 16 = 0 −→ y = 4, x = 0.

88
Notes on Computer Vision George Kudrayvtsev

product between 2 3 −12 and 2 −1 4 (let’s call it v)?


   

î ĵ k̂
   
v = 2 3 −12 × 2 −1 4 = 2 3 −12
2 −1 4
= (3 · 4 − (−1 · −12))î − (2 · 4 − (2 · −12))ĵ + (2 · −1 − (2 · 3))k̂
 
= 0 −32 −8

Of course, we need to put this vector along the ray into our plane at z = 1, which would be
where vz = 1:
 
v 0 −32 −8
v̂ = =p
kvk (−32)2 + (−8)2
 √ √ 
= 0 −4/ 17 −1/ 17
√  
− 17v̂ = 0 4 1

Convenient, isn’t it? It’s our point of intersection!


This leads us to point-line duality: given any formula, we can switch the meanings of
points and lines to get another formula.
What if we wanted to know if, for example, a point p existed on a line l? Well, the line
is a normal vector in 3D defining a plane, and if the point in question existed on the line,
it would be on that plane, and thus its ray would be perpendicular to the normal. So if
p · l = pT l = 0, then it’s on the line!

Quick Maffs: Proving the Cross Product and Intersection Equiva-


lence
I wasn’t satisfied with my “proof by example” in Interpreting 2D Points as 3D Lines,
so I ventured to make it more mathematically rigorous by keeping things generic.
We begin with two lines in 2-space and their vectors:
 
ax + by + c = 0 m= a b c
 
dx + ey + f = 0 n= d e f

We can find their cross product as before:



î ĵ k̂
   
v = a b c × d e f = a b c
d e f

= (bf − ec)î − (af − dc)ĵ + (ae − db)k̂

This is a line in 3-space, but we need it to be at z = 1 on the projection plane, so

89
Chapter 7: Multiple Views

we divide by the z term, giving us:


 
bf − ec dc + af
, ,1
ae − db ae − db

Our claim was that this is the intersection of the lines. Let’s prove that by solving
the system of equations directly. We start with the system in matrix form:
    
a b x −c
=
d e y −f

This is the general form Mx = y, and we solve by multiplying both sides by M−1 .
Thankfully we can find the inverse of a 2×2 matrix fairly easily:a
 
−1 1 e −b
M =
ad − bc −d a

What does this expand to in our system? Well,

M−1 Mx = M−1 y
    
x 1 e −b −c
=
y ad − bc −d a −f
 
1 −ec + bf
=
ad − bc cd − af
 
bf − ec cd − af
(x, y) = ,
ad − bc ad − bc

Look familiar? Boom, done. 


a
Refer to this page.

7.3.4 Ideal Points and Lines


Our understanding of a point on the image plane as being the projection of every point on a
ray in space (again, as shown in Figure 7.1 and other figures) relies on the fact that the ray
does in fact intersect the image plane. When isn’t this true? For rays parallel to the image
plane! This would be any ray confined to an xy-plane outside of z = 1.
We call the points that form these rays ideal points. If you squint hard enough, you may
be able to convince yourself that these rays intersection with the image plane at infinity, so
we represent them as: p ' (x, y, 0); when we go back to non-homogeneous coordinates, we
divide by zero, which makes x and y blow up infinitely large.
We can extend this notion to ideal lines as well: consider a normal vector l ' (a, b, 0). This
is actually mathematically coherent: it defines a plane perpendicular to the image plane, and
the resulting line goes through the origin of the image plane, which we call the principle
point.

90
Notes on Computer Vision George Kudrayvtsev

7.3.5 Duality in 3D
We can extend this notion of point-line duality into 3D as well. Recall the equation of a
plane:
ax + by + cz + d = 0
where (a, b, c) is the normal of the plane, and d = ax0 + by0 + cz0 for some point on the plane
(x0 , y0 , z0 ). Well, projective points in 3D have 4 coordinates, as we’ve already seen when
we discussed rotation and translation in Extrinsic Camera Parameters: wx wy wz w .


We use this  to define planes as homogeneous coordinates in 3D! A plane N is defined by a


4-vector a b c d , and so N · p = 0.

7.4 Applying Projective Geometry


With the vocabulary and understanding under our belt, we can now aim to apply this
geometric understanding to multiple views of a scene. We’ll be returning back to stereo
correspondence, so before we begin, re-familiarize yourself with the terms and concepts we
established previously when discussing Epipolar Geometry.
To motivate us again, this is our scenario:
Given two views of a scene, from two cameras that don’t necessarily have
parallel optical axes (à la Figure 6.6), what is the relationship between the
location of a scene point in one image and its location in the other?
We established the epipolar constraint, which stated that a point in one image lies on the
epipolar line in the other image, and vice-versa. These two points, along with the baseline
vector between the cameras, created an epipolar plane.
We need to turn these geometric concepts into algebra so that we can calculate them with
our computers. To do that, we need to define our terms. We have a world point, X, and
two camera origins, c1 and c2 . The rays from the world point to the cameras are p1 and p2 ,
respectively, intersecting with their image planes at x1 and x2 .

7.4.1 Essential Matrix


We will assume that we have calibrated cameras, so we know the transformation from one
origin to the other: some translation t and rotation R. This means that:

x2 = Rx1 + t

Let’s cross both sides of this equation by t, which would give us a vector normal to the pixel
ray and translation vector (a.k.a. the baseline):

t × x2 = t × Rx1 + t × t
= t × Rx1

91
Chapter 7: Multiple Views

Now we dot both sides with x2 , noticing that the left side is 0 because the angle between a
vector and a vector intentionally made perpendicular to that vector will always be zero!

x2 · (t × x2 ) = x2 · (t × Rx1 )
0 = x2 · (t × Rx1 )

Continuing, suppose we say that E = [t× ]R.6 This means:

xT2 Ex1 = 0

which is a really compact expression relating the two points on the two image planes. E is
called the essential matrix.
Notice that Ex1 evaluates to some vector, and that vector multiplied by xT2 (or dotted with
x2 ) is zero. We established earlier that this relationship is part of the point-line duality in
projective geometry, so this single point, Ex2 , defines a line in the other c1 camera frame!
Sound familiar? That’s the epipolar line.
We’ve converted the epipolar constraint into an algebraic expression! Well what if our
cameras are not calibrated? Theoretically, we should be able to determine the epipolar lines
if we’re given enough points correlated between the two images. This will lead us to the. . .

7.4.2 Fundamental Matrix


Let’s talk about weak calibration. This is the idea that we know we have two cameras
that properly behave as projective cameras should, but we don’t know any details about
their calibration parameters (such as focal length). What we’re going to do is estimate the
epipolar geometry from a (redundant) set of point correspondences across the uncalibrated
cameras.
Recall the full calibration matrix from (6.11), compactly describing something like this:
 
  xw
wxim  yw 
 wyim  = Kint Φext  
 zw 
w
1

For convenience sake, we’ll assume that there is no skew, s. This makes the intrinsic param-
eter matrix K invertible:  
−f/s 0 ox
K= 0 −f/sy oy 
0 0 1

6
This introduces a different notation for the cross product, expressing it as a matrix multiplication. This
is explained in further detail in Linear Algebra Primer, in Cross Product as Matrix Multiplication.

92
Notes on Computer Vision George Kudrayvtsev

Recall, though, that the extrinsic parameter matrix is what maps points from world space
to points in the camera’s coordinate frame (see The Duality of Space), meaning:

pim = Kint Φext pw


| {z }
pc

pim = Kint pc

Since we said that the intrinsic matrix is invertible, that also means that:

pc = K−1
int pim

Which tells us that we can find a ray through the camera and the world (since it’s a homo-
geneous point in 2-space, and recall point-line duality) corresponding to this point. Further-
more, for two cameras, we can say:

pc,left = K−1
int,left pim,left
pc,right = K−1
int,right pim,right

Now note that we don’t know the values of Kint for either camera since we’re working in
the uncalibrated case, but we do know that there are some parameters that would calibrate
them. Furthermore, there is a well-defined relationship between the left and right points in
the calibrated case that we defined previously using the essential matrix. Namely,

pTc,right Epc,left = 0

Thus, via substitution, we can say that


T T
K−1
int,right pim,right E K−1
int,left pim,left =0

and then use the properties of matrix multiplication7 to rearrange this and get the funda-
mental matrix, F:  
T
pTim,right K−1
int,right EK −1
int,left pim,left = 0
| {z }
F

This gives us a beautiful, simple expression relating the image points on the planes from
both cameras:
pTim,right Fpim,left = 0

Or, even more simply: pT Fp0 = 0. This is the fundamental matrix constraint, and given
enough correspondences between p → p0 , we will be able to solve for F. This matrix is very
powerful in describing how the epipolar geometry works.

Specifically, we use the fact that (AB) = BT AT , as shown here.


7 T

93
Chapter 7: Multiple Views

Properties of the Fundamental Matrix

Recall from Projective Geometry that when we have an l such that pT l = 0, that l describes
a line in the image plane. Well, the epipolar line in the p image associated with p0 is defined
by: l = Fp0 ; similarly, the epipolar line in the prime image is defined by: l0 = FT p. This
means that the fundamental matrix gives us the epipolar constraint between two images
with some correspondence.
What if p0 was on the epipolar line in the prime image for every point p in the original
image? That occurs at the epipole! We can solve for that by setting l = 0, meaning we can
find the two epipoles via:

Fp0 = 0
FT p = 0

Finally, the fundamental matrix is a 3×3 singular matrix. It’s singular because it maps
from homogeneous 2D points to a 1D family (which are points or lines under point-line
duality), meaning it has a rank of 2. We will prove this in this aside and use it shortly.
The power of the fundamental matrix is that it relates the pixel coordinates between two
views. We no longer need to know the intrinsic parameter matrix. With enough correspon-
dence points, we can reconstruct the epipolar geometry with an estimation of the fundamen-
tal matrix, without knowing anything about the true intrinsic or extrinsic parameter matrix.
In Figure 7.10, we can see this in action. The green lines are the estimated epipolar lines in
both images derived from the green correspondence points, and we can see that points along
a line in one image are also exactly along the corresponding epipolar line in the other image.

Figure 7.10: Estimation of epipolar lines given a set of correspondence


points (in green).

94
Notes on Computer Vision George Kudrayvtsev

Computing the Fundamental Matrix From Correspondences


Each point correspondence generates one constraint on F. Specifically, we can say that:8
pT F p0 = 0
   
f 11 f 12 f 13 u
u0 v 0
 
1 f21 f22 f23  v  = 0
f31 f32 f33 1

Multiplying this out, and generalizing it to n correspondences, gives us this massive system:
 
f11
f12 
 
f13 
 
0 0 0 0 0 0
u1 u1 u1 v1 u1 v1 u1 v1 v1 v1 u1 v1 1 f21 
 .. .. .. .. .. .. .. .. ..   
 . . . . . . . . .  
 f 22 =0
un un un vn un vn un vn vn vn un vn n f23 
0 0 0 0 0 0 

f31 
 
f32 
f33

We can solve this via the same two methods we’ve seen before: using the trick with singular
value decomposition, or scaling to make f33 = 1 and using the least squares approximation.
We explained these in full in Method 1: Singular Value Decomposition and Method 2:
Inhomogeneous Solution, respectively.
Unfortunately, due to the fact that our point correspondences are estimations, this actually
doesn’t give amazing results. Why? Because we didn’t pull rank on our matrix F! It’s a
rank-2 matrix, as we demonstrate in this aside, but we didn’t enforce that when solving/ap-
proximating our 3×3 matrix with correspondence points and assumed it was full rank.9
How can we enforce that? Well, first we solve for F as before via one of the two methods we
described. Then, we take the SVD of that result, giving us: F = UDVT .
The diagonal matrix is the singular values of F, and we can enforce having only rank 2 by
setting the last value (which is the smallest value, since we sort the diagonal in decreasing
order by convention) to zero:
   
r 0 0 r 0 0
D = 0 s 0 =⇒ D̂ = 0
   s 0
0 0 t 0 0 0

Then we recreate a better F̂ = UD̂VT , which gives much more accurate results for our
epipolar geometry.
8
You may notice that this looks awfully similar to previous “solve given correspondences” problems we’ve
done, such as in (6.17) when Calibrating Cameras.
9
Get it? Because pulling rank means taking advantage of seniority to enforcing some rule or get a task
done, and we didn’t enforce the rank-2 constraint on F? Heh, nice.

95
Chapter 7: Multiple Views

Proof: F is a Singular Matrix of Rank 2

We begin with two planes (the left plane and the right plane) that have some points
p and p0 , respectively, mapping some world point xπ . Suppose we know the exact
homography that can make this translation: p0 = Hπ p.
Then, let l0 be the epipolar line in the right plane corresponding to p. It passes
through the epipole, e0 , as well as the correspondence point, p0 .
We know, thanks to point-line duality in projective geometry, that this line l0 is
cross product of p0 and e0 , meaning:a

l0 = e0 × p0
= e0 × Hπ p
= e0 × Hπ p
 

But since l0 is the epipolar line for p, we saw that this can be represented by the
fundamental matrix: l0 = Fp. Meaning:
 0 
e × Hπ = F

That means the rank of F is the same as the rank of [e0 × ], which is 2! (just trust
me on that part. . . )
a
As we showed in Interpreting 2D Points as 3D Lines, a line passing through two points
can be defined by the cross product of those two points.

Fundamental Matrix Applications

There are many useful applications of the fundamental matrix:


• Stereo Image Rectification: Given two radically different views of the same object,
we can use the fundamental matrix to rectify the views into a single plane! This makes
epipolar lines the same across both images and reduces the dense correspondence search
to a 1D scanline.
• Photo Synth: From a series of pictures from different views of a particular scene,
we can recreate not only the various objects in the scene, but also map the locations
from which those views were taken in a composite image, as demonstrated below in
Figure 7.11.
• 3D Reconstruction: Extending on the PhotoSynth concept, given enough different
angles of a scene or structure, we can go so far as to recreate a 3D model of that
structure.

96
Notes on Computer Vision George Kudrayvtsev

Figure 7.11: The Photosynth project (link). An assortment of photos


of the Notre Dame (left) can be mapped to a bunch of locations relative
to the object (right).

7.5 Summary
For 2 views into a scene (sorry, I guess we never did get into n-views ), there is a geometric
relationship that defines the relationship between rays in one view and another view. This
is Epipolar Geometry.
These relationships can be captured algebraically (and hence computed), with the essential
matrix for calibrated cameras and the fundamental matrix for uncalibrated cameras. We
can find these relationships with enough point correspondences.
This is proper computer vision! We’re no longer just processing images. Instead, we’re
getting “good stuff” about the scene: determining the actual 3D geometry from our 2D
images.

97
Feature Recognition

Since we live in a world of appearances, people are judged by what they seem
to be. If the mind can’t read the predictable features, it reacts with alarm or
aversion. Faces which don’t fit in the picture are socially banned. An ugly
countenance, a hideous outlook can be considered as a crime and criminals
must be inexorably discarded from society.
— Erik Pevernagie, “Ugly mug offense”

ntil this chapter, we’ve been creating relationships between images and the world with
U the convenient assumption that we have sets of well-defined correspondence points. As
noted in footnote 3, these could have been the product of manual labor and painstaking
matching, but we can do better. Now, we will discover ways to automatically creating cor-
respondence relationships, which will lead the way to a more complete, automated approach
to computer vision.
Our goal, to put it generally, is to find points in an image that can be precisely and reliably
found in other images.
We will detect features (specifically, feature points) in both images and match them to-
gether to find the corresponding pairs. Of course, this poses a number of problems:
• How can we detect the same points independently in both images? We need
whatever algorithm we use to be consistent across images, returning the same feature
points. In other words, we need a repeatable detector.
• How do we correlate feature points? We need to quantify the same interesting
feature points a similar way. In other words, we need a reliable and distinctive de-
scriptor. For a rough analogy, our descriptor could “name” a feature point in one
image Eddard, and name the same one Ned in the other, but it definitely shouldn’t
name it Robert in the other.
Feature points are used in many applications. To name a few, they are useful in 3D recon-
struction, motion tracking, object recognition, robot navigation, and much much more. . .
So what makes a good feature?
• Repeatability / Precision: The same feature can be found in several images pre-

98
Notes on Computer Vision George Kudrayvtsev

cisely despite geometric and photometric transformations. We can find the same point
with the same precision and metric regardless of where it is in the scene.
• Saliency / Matchability: Each feature should have a distinctive description: when
we find a point in one image, there shouldn’t be a lot of candidate matches in the other
image.
• Compactness / Efficiency: There should be far fewer features than there are pixels
in the image, but there should be “enough.”
• Locality: A feature occupies a relatively small area of the image, and it’s robust to
clutter and occlusion. A neighborhood that is too large may change drastically from a
different view due to occlusion boundaries.

8.1 Finding Interest Points


The title of this section may give the answer away, but what’s something that fits all of the
criteria we mentioned in our introduction? Consider, for a moment, a black square on a
white background, as such:

Figure 8.1: A simple image. What sections would make good features?

What areas of the image would make good features? Would the center of the black square be
good? Probably not! There are a lot of areas that would match an all-black area: everywhere
else on the square! What about some region of the left (or any) edge? No again! We can
move along the edge and things look identical.
What about the corners? The top-right corner is incredibly distinct for the image; no other
region is quite like it.

Figure 8.2: A corner feature, highlighted from Figure 8.1.

99
Chapter 8: Feature Recognition

That begs the question: how do we detect corners? We can describe a corner as being an
area with significant change in a variety of directions. In other words, the gradients have
more than one direction.

8.1.1 Harris Corners


Developed in 1988 by Harris & Stephens (though Harris gets all the recognition), the tech-
nique we’ll describe in this section finds these aforementioned “feature” areas in which the
gradient varies in more than one direction.
Harris corners are based on an approximation model and an error model. We start with an
image, I. We can say that the change in appearance after shifting that image some (small)
amount (u, v), over some window function w, is represented by an error function that’s the
sum of the changes:
X
E(u, v) = w(x, y) [I(x + u, y + v) − I(x, y)]2
x,y

The window function can be a simple piecewise function that’s 1 within the window and 0
elsewhere, or a Gaussian filter that will weigh pixels near the center of the window appro-
priately.
We can view the error function visually as an image, as well. The error function with no shift
(i.e. E(0, 0)) will be 0, since there is no change in pixels; it would be a black pixel. As we
shifted, the error would increase towards white. Now suppose we had an image that was the
same intensity everywhere (like, for example, the center region of a black square, maybe?).
Its error function would always be zero regardless of shift. This gives us an intuition for the
use of the error function: we want regions that have error in all shift directions.
We are working with small values of (u, v): a large error for a small shift might indicate a
corner-like region. How do we model functions for small changes? We use Taylor expan-
sions. A second-order Taylor expansion of E(u, v) about (0, 0) gives us a local quadratic
approximation for small values of u and v.
Recall, from calculus oh-so-long-ago, the Taylor expansion. We approximate a function
F (x) for some small δ value:
dF (0) 1 2 d2 F (0)
F (δx) ≈ F (0) + δx · + δx ·
dx 2 dx2

Things get a little uglier in two dimensions; we need to use matrices. To approximate our
error function E(u, v) for small values of u and v, we say:
    
  Eu (0, 0) 1  Euu (0, 0) Euv (0, 0) u
E(u, v) ≈ E(0, 0) + u v + u v
Ev (0, 0) 2 Euv (0, 0) Evv (0, 0) v

where En indicates the gradient of E in the n direction, and similarly Enm indicates the
2nd -order gradient in the n then m direction.

100
Notes on Computer Vision George Kudrayvtsev

When we actually find, expand, and simplify all of these derivatives (which we do in this
aside), we get the following:

2w(x, y)Ix (x, y)2


 P P  
1 2w(x, y)Ix (x, y)Iy (x, y) u
(8.1)
 x,y x,y
E(u, v) ≈ u v P P 2
2 x,y 2w(x, y)Ix (x, y)Iy (x, y) x,y 2w(x, y)Iy (x, y) v

We can simplify this expression further with a substitution. Let M be the second moment
matrix computed from the image derivatives:
 2 
X Ix Ix Iy
M= w(x, y)
Ix Iy Iy2
xy

Then, simply (for values of u and v near 0):


 
u
(8.2)
 
E(u, v) ≈ u v M
v

Let’s examine the properties of this magical moment matrix, M, and see if we can get some
insights about how a “corner-like” area would look.

(Not So) Quick Maffs: Slogging Through the Taylor Expansion

So what do the derivatives of the error function look like? We’ll get through them
one step at a time. Our error function is:
X
E(u, v) = w(x, y) [I(x + u, y + v) − I(x, y)]2
x,y

We’ll start with the first order derivatives Eu and Ev . For this, we just need the
“power rule”: dx
d d
(f (x))n = nf (x)n−1 · dx f (x). Remember that u is the shift in the
x direction, and thus we need the image gradient in the x direction, and similarly
for v and y.
X
Eu (u, v) = 2w(x, y) [I(x + u, y + v) − I(x, y)] Ix (x + u, y + v)
x,y
X
Ev (u, v) = 2w(x, y) [I(x + u, y + v) − I(x, y)] Iy (x + u, y + v)
x,y

Now we will take the 2nd derivative with respect to u, giving us Euu (and likewise
for Evv ). Recall, briefly, the “product rule”:

d d d
f (x)g(x) = f (x) g(x) + g(x) f (x)
dx dx dx

101
Chapter 8: Feature Recognition

This gives us:


X
Euu (u, v) = 2w(x, y)Ix (x + u, y + v)Ix (x + u, y + v)
x,y
X
+ 2w(x, y) [I(x + u, y + v) − I(x, y)] Ixx (x + u, y + v)
x,y
X
Evv (u, v) = 2w(x, y)Iy (x + u, y + v)Iy (x + u, y + v)
x,y
X
+ 2w(x, y) [I(x + u, y + v) − I(x, y)] Iyy (x + u, y + v)
x,y

Now the only thing that remains is the cross-derivative, Euv , which now requires
gradients in both x and y of the image function as well as the cross-derivative in
x-then-y of the image.
X
Euv (u, v) = 2w(x, y)Ix (x + u, y + v)Iy (x + u, y + v)
x,y
X
+ 2w(x, y) [I(x + u, y + v) − I(x, y)] Ixy (x + u, y + v)
x,y

These are all absolutely disgusting, but, thankfully, we’re about to make a bunch
of the terms disappear entirely since we are evaluating them at (0, 0).
Onward, brave reader.
Plugging in (u = 0, v = 0) into each of these expressions gives us the following set
of newer, simpler expressions:

Eu (0, 0) = Ev (0, 0) = 0
X
Euu (0, 0) = 2w(x, y)Ix (x, y)2
x,y
X
Evv (0, 0) = 2w(x, y)Iy (x, y)2
x,y
X
Euv (0, 0) = 2w(x, y)Ix (x, y)Iy (x, y)
x,y

Notice that all we need now is the first-order gradient of the image in each direction,
x and y. What does this mean with regards to the Taylor expansion? It expanded
to:
    
  Eu (0, 0) 1  Euu (0, 0) Euv (0, 0) u
E(u, v) ≈ E(0, 0) + u v + u v
Ev (0, 0) 2 Euv (0, 0) Evv (0, 0) v

But the error at (0, 0) of the image is just 0, since there is no shift! And we’ve already
seen that its first-order derivatives are 0 as well. Meaning the above expansion

102
Notes on Computer Vision George Kudrayvtsev

simplifies greatly:
  
1  Euu (0, 0) Euv (0, 0) u
E(u, v) ≈ u v
2 Euv (0, 0) Evv (0, 0) v

and we can expand each term fully to get the final Taylor expansion approximation
of E near (0, 0):a

w0 Ix (x, y)2 w0 Ix (x, y)Iy (x, y) u


 P P  
1  x,y x,y
E(u, v) ≈ u v P 0
P 0 2
2 x,y w Ix (x, y)Iy (x, y) x,y w Iy (x, y) v
a
I don’t have enough horizontal space for the entire expression, actually, so I substitute
w0 = 2w(x, y) for compactness-sake.

Properties of the 2nd Moment Matrix


With our Taylor expansion, the function E(u, v) is being locally approximated by a quadratic.
The function E(u, v) is a surface, and one of its “slices” would look like this:
 
u
= some constant
 
u v M
v
X X X
Ix2 u2 + 2 Ix Iy uv + Iy2 v 2 = k

You may (or may not. . . I sure didn’t) notice that this looks like the equation of an ellipse
in uv-space:
au2 + buv + cv 2 = d
Thus our approximation of E is a series of these ellipses stack on top of each other, with
varying values of k.
Consider a case for M in which the gradients are horizontal xor vertical, so Ix and Iy are
never non-zero at the same time. That would mean M looks like:
 2   
X Ix 0 λ1 0
M= w(x, y) =
0 Iy2 0 λ2
xy

This wouldn’t be a very good corner, though, if λ2 = 0, since it means all of the change
was happening in the horizontal direction, and similarly for λ1 = 0. These would be edges
more-so than corners. This might trigger a lightbulb of intuition:
If either λ is close to 0, then this is not a good corner, so look for areas in which
both λs are large!
With some magical voodoo involving linear algebra, we can get the diagonalization of M
that hints at this information:  
−1 λ1 0
M=R R
0 λ2

103
Chapter 8: Feature Recognition

Where now the λs are the eigenvalues. Looking back at our interpretation of each slice as an
ellipse, R gives us the orientation of the ellipse and the λs give us the lengths of its major
and minor axes.
A slice’s “cornerness” is given by a having large and proportional λs. More specifically,

λ1  λ2 or λ2  λ1 ,
 edge
λ1 , λ2 ≈ 0, flat region
λ1 , λ2  0 and λ1 ∼ λ2 , corner

Rather than finding these eigenvalues directly (which is expensive on 80s computers because
of the need for sqrt), we can calculate the “cornerness” of a slice indirectly by using the
determinant and the trace of M. This is the Harris response function:

R = det M − α trace(M)2 = λ1 λ2 − α(λ1 + λ2 )2

Empirically, α ∈ [0.04, 0.06]. The classification breakdown is now:



|R| ≈ 0, flat region

R  0, edge
R  0, corner

8.1.2 Harris Detector Algorithm


The culmination of the understanding of the 2nd moment matrix in our approximation of the
error function at (0, 0), specifically the response function R, leads to the Harris detector
algorithm:

Algorithm 8.1: The basic Harris detector algorithm.

Input: An image, I
Result: A set of “interesting” corner-like locations in the image.
Compute Gaussian derivatives at each pixel.
Compute M in a Gaussian window around each pixel.
Compute R.
Threshold R.
Find the local maxima of R via Non-Maximal Suppression.

8.1.3 Improving the Harris Detector


Studies have shown us that the Harris detector is consistent (invariant) under changes in-
volving rotation and intensity (as long as they’re additive or multiplicative changes and you

104
Notes on Computer Vision George Kudrayvtsev

handle thresholding issues), but they are not invariant under scaling. Empirically, we’ve
seen that scaling by a factor of 2 reduces repeatability (i.e. the measurement of finding the
same features) by ~80% (see the Harris line in Figure 8.4).
Intuitively, we can imaging “zooming in” on a corner and applying the detector to parts of
it: each region would be treated like an edge! Thus, we want a scale invariant detector.
If we scaled our region by the same amount that the image was scaled, we wouldn’t have
any problems. Of course, we don’t know how much the image was scaled (or even if it was
scaled). How can we choose corresponding regions independently in each image? We need
to design a region function that is scale invariant.
A scale invariant function is one that is consistent across images given changes in region size.
A naïve example of a scale invariant function is average intensity. Given two images, one of
which is a scaled version of the other, there is some region size in each in which the average
intensity is the same for both areas. In other words, the average intensity “peaks” in some
place(s) and those peaks are correlated based on their independent scales.
A good scale invariant function has just one stable peak. For most images, a good function
is one that responds well to contrast (i.e. a sharp local intensity change. . . remember The
Importance of Edges?) We can apply the Laplacian (see Equation 3.6) of the Gaussian
filter (the “Mexican hat operator,” as Prof. Bobick puts it). But to avoid the 2nd derivative
nonsense, we can use something called the difference of Gaussians (DoG):

DoG = G(x, y, kσ) − G(x, y, σ)

which gives an incredibly similar result. Both of these kernel operators are entirely invariant
to both rotation and scale.

SIFT Detector
This leads to a technique called SIFT: scale invariant f eature transform.

Figure 8.3: Creating a difference of Gaussian pyramid. The images in


each octave are 2x blurrier than their previous one in the set, and each
octave is 1/2 the scale of the previous octave.

105
Chapter 8: Feature Recognition

The general idea is that we want to find robust extrema in both space and scale. We can
imagine “scale space” as being different scaled versions of the image, obtained via interpola-
tion of the original.
For each point in the original, we can say it has neighbors to the left and right (our traditional
understanding) as well as neighbors “up and down” in scale space. We can use that scale space
to create a difference of Gaussian pyramid, then eliminate the maxima corresponding
to edges, which just leaves the corners. Note that this is a completely different method of
corner detection; we aren’t hanging out with Harris anymore!
A “different of Gaussian pyramid” can be thought of as “stacks” of images; each item in a
stack is a set of DoG images at various scales, as in Figure 8.3. Each point in an image is
compared to its 8 local neighbors as well as its 9 neighbors in the images “above” and “below”
it; if you’re a maxima of all of those points, you’re an extremum! Once we’ve found these
extrema, we threshold the contrast and remove extrema on edges; that results in detected
feature points that are robust to scaling.

Harris-Laplace Detector

There is a scale-invariant detector that uses both of the detectors we’ve described so far. The
Harris-Laplace detector combines the difference of Gaussians with the Harris detector:
we find the local maximum of the Harris corner in space and the Laplacian in scale.

Comparison

The robustness of all three of these methods under variance in scale is compared in Figure 8.4.
As you can see, the Harris-Laplace detector works the “best,” but take these metrics with a
grain of salt: they were published by the inventors of Harris-Laplace.

1
0.9 Harris-Laplace
0.8 SIFT
repeatability rate

0.7 Harris
0.6
0.5
0.4
0.3
0.2
0.1
0
1 1.5 2 2.5 3 3.5 4 4.5
scale

Figure 8.4: A comparison between different feature-point detectors


under variations in scale.

106
Notes on Computer Vision George Kudrayvtsev

8.2 Matching Interest Points


We now have a variety of tools to detect interest points, but how do we match them? To do
that, we need a descriptor: we need to “describe” the point in one image (perhaps based
on its neighboring points) and find a matching description in the right image.
We described our ideal descriptor in the introduction (you can jump back if you want): they
should be almost the same in both image, i.e. invariant, as well as distinctive.
Being distinctive is challenging: since a feature point’s neighborhood will change from one
point to the next, we need some level of flexibility in how we match them up, but we don’t
want to be too flexible and match other feature points that are also similar but lie elsewhere.
In a sense, we have a tradeoff here between how invariant we want to be and how distinctive
we can be.

Simple Solution? Is there something simple we can do here? We have the feature points
from our Harris Detector Algorithm, can’t we just use correlation on each feature point win-
dow on the other image and choose the peak (i.e. something much like Template Matching)?
Unfortunately, correlation is not rotation-invariant, and it’s fairly sensitive to photometric
changes. Even normalized correlation is sensitive to non-linear photometric changes and
slight geometric distortions as well.
Furthermore, it’s slow: comparing one feature to all other feature points is not an ideal
solution from an algorithmic perspective (O(n2 )).
We’ll instead introduce the SIFT descriptor to solve these problems.

8.2.1 SIFT Descriptor


We’ve already discussed the SIFT Detector, but another – and more popular – part of Lowe’s
work in SIFT was the SIFT descriptor. The idea behind the descriptor was to represent the
image content as a “constellation” of local features that are invariant to translation, scale,
rotation, and other imaging parameters. For example, in Figure 8.5 we can see the same
features being found in both images, and the SIFT features themselves are different than
each of the individual features; they’ve been normalized to match.
We can imagine that if we found enough of these matching features, we’d be able to get some
insight into the object they’re a part of as a whole. We’d be able to say something like, “Oh,
these 4 features are part of some object that’s been rotated [this much] in the other image.”
The basic approach we’ll run through is, given our set of keypoints (or features or interest
points or . . . ), we assign an orientation to each one (or some group). Then, we can build a
description for the point based on the assigned orientation.

107
Chapter 8: Feature Recognition

Figure 8.5: An example of the SIFT descriptor finding transformation-


invariant features in two images.

Orientation Assignment

We want to compute the “best” orientation for a feature. Handily enough, the base orienta-
tion is just the dominant direction of the gradient.
To localize orientation to a feature, we create a histogram of local gradient directions at a
selected scale – 36 bins. Then, we assign the canonical orientation based on the peak of the
smoothed histogram. Thus, each feature point has some properties: its (x, y) coordinates,
and an invariant scale and orientation.

Keypoint Description

We want a descriptor that, again, is highly distinctive and as invariant as possible to pho-
tometric changes. First, we normalize: rotate a keypoint’s window based on the standard
orientation, then scale the window size based on the the keypoint’s scale.
Now, we create a feature vector based upon:
• a histogram of gradients, which we determined previous when finding the orientation
• weighed by a centered Gaussian filter, to appropriately value the center gradients more
We take these values and create a 4×4 grid of bins for each. In other words, the first bin
would contain the weighted histogram for the top-left corner of the window, the second for
the next sector to the right, and so on.

Minor Details There are a lot of minor tweaks here and there that are necessary to make
SIFT work. One of these is to ensure smoothness across the entire grid: pixels can affect
multiple bins if their gradients are large enough. This prevents abrupt changes across bin
boundaries. Furthermore, to lower the impact of highly-illumined areas (whose gradients
would dominate a bin), they encourage clamping the gradient to be ≤ 0.2 after the rotation
normalization. Finally, we normalize the entire feature vector (which is 16× the window
size) to be magnitude 1.

108
Notes on Computer Vision George Kudrayvtsev

Evaluating the Results


I recommend watching this lecture and the following one to see the graphs and tables that
evaluate how well the SIFT descriptor works under a variety of transformations to an image.
TODO: Replicate the graphs, I guess.

8.2.2 Matching Feature Points


We still haven’t address the problem of actually matching feature points together, but we’re
making progress. We now have our feature vector from the SIFT Descriptor for our images,
and can finally tackle correlating features together.

Nearest Neighbor
We could, of course, use a naïve nearest-neighbor algorithm and compare a feature in one
image to all of the features in the other. Even better, we can use the kd-tree algorithm1 to
find the approximate nearest neighbor. SIFT modifies the algorithm slightly: it implements
the best-bin-first modification by using a heap to order bins by their distance from the query
point. This gives a 100-1000x speedup and gives the correct result 95% of the time.2

Wavelet-Based Hashing
An alternative technique computes a short 3-vector descriptor from the neighborhood using a
Haar wavelet.3 Then, you quantize each value into 10 overlapping bins, giving 103 possible
entries. This greatly reduces the amount of features we need to search through.

Locality-Sensitive Hashing
The idea behind this technique, and locality-sensitive hashing in general, is to construct a
hash function that has similar outputs for similar inputs. More rigorously, we say we craft
a hash function g : Rd → U such that for any two points p and q (where D is some distance

1
Unlike in lecture, understanding of this algorithm isn’t assumed . The kd-tree algorithm partitions a
k-dimensional space into a tree by splitting each dimension down its median. For example, given a set
of (x, y) coordinates, we would first find the median of the x coordinates (or the ys. . . the dimension is
supposed to be chosen at random), and split the set into two piles. Then, we’d find the median of the y
coordinates in each pile and split them down further. When searching for the nearest neighbor, we can use
each median to quickly divide the search space in half. It’s an approximate method, though: sometimes,
the nearest neighbor will lie in a pile across the divide. For more, check out this video which explains
things succinctly, or the Wikipedia article for a more rigorous explanation.
2
For reference, here is a link to the paper: Indexing w/o Invariants in 3D Object Recognition. See Figure 6
for a diagram “explaining” their modified kd-tree algorithm.
3
Per Wikipedia, Haar wavelets are a series of rescaled square-wave-like functions that form a basis set.
A wavelet (again, per Wikipedia) is an oscillation that begins at zero, increases, then decreases to zero
(think of sin(θ) ∈ [0, π]). We’ll briefly allude to these again descriptors when we discuss features in the
Viola-Jones Face Detector much later in chapter 11.

109
Chapter 8: Feature Recognition

function):
D(p, q) ≤ r ⇒ Pr [g(p) = g(q)]  0
D(p, q) > cr ⇒ Pr [g(p) = g(q)] ∼ 0

In English, we say that if the distance between p and q is high, the probability of their hashes
being the same is “small”; if the distance between them is low, the probability of their hashes
being the same is “not so small.”
If we can construct such a hash function, we can jump to a particular bin and find feature
points that are similar to a given input feature point and, again, reduce our search space
down significantly.

8.2.3 Feature Points for Object Recognition


We can use our feature descriptors to find objects in images. First, we have the training
phase: given a training image,4 we extract its outline, then compute its keypoints.
During recognition, we search for these keypoints to identify possible matches. Out of
these matches, we identify consistent solutions, such as those that hold under an affine
transformation. To identify an affine transformation, we need 3 corresponding points.5
Because we only need these 3 points, we only need 3 features to match across images! This
makes the recognition resilient to occlusion of the objects.

8.3 Coming Full Circle: Feature-Based Alignment


Let’s have a cursory review of the things we’ve been discussing since chapter 7.
We discussed associating images together given sets of known points, essentially finding
the transformations between them. We did this using The Power of Homographies and
aligned images together when Creating Panoramas and doing Image Rectification; we also
used the Fundamental Matrix for simplifying our epipolar geometry and doing things like
Fundamental Matrix Applications.
This (often) amounted to solving a system of equations in the form Ax = y via the methods
like the SVD trick.
In this chapter, we’ve developed some techniques to identify “useful” points (features) and
approximately match them together. These now become our “known points,” but the problem
is that they aren’t truly known. They’re estimations and have both minor inaccuracies and
completely incorrect outliers. This makes our techniques from before much less reliable.
Instead, we’re now approaching finding transformations between images as a form of model
fitting. Essentially, what we have is a series of approximately corresponding points, and we
4
A good training image would be the object, alone, on a black background.
5
Recall that an affine transformation has 6 unknowns: 2 for translation, 2 for rotations and scaling, and 2
for shearing. This is covered more in Image-to-Image Projections.

110
Notes on Computer Vision George Kudrayvtsev

want to find the transformation that fits them the best. We’ve seen this before in chapter 4
when discussing line and circle fitting by voting for parameters in Hough Space.
At its heart, model fitting is a procedure that takes a few concrete steps:
• We have a function that returns a predicted data set. This is our model.
• We have a function that computes the difference between our data set and the model’s
predicted data set. This is our error function.
• We tune the model parameters until we minimize this difference.
We already have our model: a transformation applied to a group of feature points. We
already have our data set: the “best matches” computed by our descriptor (we call these
putative matches). Thus, all that remains is a robust error function.

8.3.1 Outlier Rejection


Our descriptor gives the qualitatively “best” match for a particular feature, but how can we
tell if it’s actually a good match?
Remember the basic principle we started with when Finding Disparity for stereo correspon-
dence? We used the sum of square differences between a patch in one image and all of the
patches in the other image, choosing the one with the smallest “distance.” We could do the
same thing here, comparing the patch around two matching interest points and thresholding
them, discarding matches with a large difference.
This begs the question: how do we choose a threshold? Thankfully, someone has done
a thorough empirical investigation to determine the probability of a match being correct
relative to its “corresponding patch” error.

Nearest Neighbor Error


Just checking the best matching feature patch has a low error for correct matches, but it’s
also low for a decent number of incorrect matches. Instead, we can do something a little
more clever.
Intuitively, we can imagine that in the incorrect case, the matching patch is essentially a
random patch in the image that happened to have the best matching value. What about the
next best match for that feature? Probably, again, a different random patch in the image!
Their errors probably don’t differ by much. In the correct case, though, the best match is
probably really strong, and its next best match is probably a random patch in the image.
Thus, their errors would differ by a lot.
This is backed up by empirical evidence: comparing the ratio of error between the first-
nearest-neighbor and the second-nearest-neighbor (i.e. eN N1/eN N2 ) across correct and incorrect
matches shows a stark difference between the cases. Specifically, setting a threshold of 0.4
on the error ratio (meaning there’s a 60% difference between the best and next-best match)
would eliminate nearly all of the incorrect matches.

111
Chapter 8: Feature Recognition

Of course, we’ll be throwing out the baby with the bathwater a bit, since a lot of correct
matches will also get discarded, but we still have enough correct matches to work with.
Unfortunately, we still do have lots of outliers. How can we minimize their impact on our
model?

8.3.2 Error Functions


Recall, again, the Hough voting process for line fitting. We can imagine the voting process
as a way of minimizing the distance between a set of points and the line that fits them. In
essence, what we were doing was very similar to the least squares approximation to a linear
system; the figure below shows how we were minimizing the vertical error between points
and our chosen line.
Our error function in this case is the vertical dis-
tance between point (xi , yi ) and the line, for all 3
points. mxi − b is the “real y” at that x value via
the true equation of the line:
2 e2
n
X y
E= (yi − mxi − b)2 e1
1 e0
i=1

Solving this minimization again leads us to the 0


0 1 2 3
standard least squares approximation for a linear x
system, as discussed in detail in Appendix A.
Figure 8.6: Fitting a line to a set of points,
The problem with vertical least-squares error ap- with the errors in red.
proximation is that it causes huge discrepancies
as lines get more vertical (and fails entirely for completely vertical lines). Consider the error
in Figure 8.7: the point is very close to the line, but the error is very high.
What would be a better error metric is the perpendicular distance
y between the point and the line. In Projective Geometry, we talked
about how a line in the form ax + by = d has a normal vector n̂ =
a b and it d away from the origin at its closest. With that, we can


instead define a different error function:


n
X
e0 x E= (axi + byi − d)2
i=1

Figure 8.7: Mea-


suring vertical error To minimize this, which we do via vector calculus and algebraic ma-
over-emphasizes er- nipulation, we come up with the homogeneous equation:6
rors in steep lines.
dE
= 2(UT U)h = 0
dh
6
I’m trying not to bore you with the derivation, but you can refer to the original lecture for the details.

112
Notes on Computer Vision George Kudrayvtsev

Where h = a b and U is a matrix of the differences of averages for each point (i.e. ∀i xi −x).
 

Amazingly, we can solve this with the SVD trick we’ve seen time and time again.
This seems like a strong error function, but it has a fundamental assumption. We assume
that the noise to our line is corrupted by a Gaussian noise function perpendicular to the line.

Scrrrrr. Hold up, what?


We’re just assuming that the errors in our data set (i.e. the
noise) happens to follow a standard bell curve, with most of
the noisy data being close to the “true” model and few correct
points being far away.
Hm, well when you put it that way, it
sounds more reasonable.
Onward!

To make our assumption mathematical, we are saying that our noisy point (x, y) is the
true point on the line perturbed along the normal by some noise sampled from a zero-mean
Gaussian with some σ:      
x u a
= + |{z}

y v b
|{z} noise |{z}
true point n̂

We say that this is our underlying generative model:7 a Gaussian perturbation.


Of course, this comes with a natural pitfall: the model is extremely non-robust to non-
Gaussian noise. Outliers to the generative model have a huge impact on the final model
because squared error heavily penalizes outliers.
How does this tie back to our original discussion? Well, even though we tried to minimize
outliers in the previous sections, there still will be some. Those outliers will have a massive
impact on our approximation of the transformation between images if we used the sum of
squared errors.
Thus, we need something better. We need a robust estimator. The general approach of a
robust estimator is to minimize: X
ρ(ri (xi , θ); σ)
i

Here, r finds the “residual” of the ith point with respect to the model parameters, θ. This
is our “distance” or error function that measures the difference between the model’s output
and our test set. ρ is a robust function with a scale parameter, σ; the idea of ρ is to not let

7
i.e. a model that generates the noise

113
Chapter 8: Feature Recognition

outliers have a large impact on the overall error. An example of a robust error function is:
2
ρ(u; σ) = σ2u+u2
For a small residual u (i.e. small error) relative to σ, this behaves much like the squared
distance: we get ≈ u2/σ2 . As u grows, though, the effect “flattens out”: we get ≈ 1.
Of course, again, this begs the question of how to choose this parameter. What makes a
good scale, σ? Turns out, it’s fairly scenario-specific. The error function we defined above
is very sensitive to scale.
It seems like we keep shifting the goalposts of what parameter we need to tune, but in the
next section we’ll define an error function that still has a scale, but is much less sensitive to
it and enables a much more robust, general-purpose solution.

8.3.3 RANSAC
The RANSAC error function, or random sample consensus error function, relies on the
(obvious) notion that it would be easy to fit a model if we knew which points belonged to
it and which ones didn’t. We’ll return to the concept of “consistent” matches: which sets of
matches are consistent with a particular model?
Like in the Hough Transform, we can count on the fact that wrong matches will be relatively
random, whereas correct matches will be consistent with each other. The basic main idea
behind RANSAC is to loop over a randomly-proposed model, find which points belong to it
(inliers) and which don’t (outliers), eventually choosing the model with the most inliers.
For any model, there is a minimal set of s points that can define it. We saw this in Image-
to-Image Projections: translations had 2, homographies had 4, the fundamental matrix has
8, etc. The general RANSAC algorithm is defined in algorithm 8.2 below.
How do we choose our distance threshold t, which defines whether or not a point counts
as an inlier or outlier of the instantiated model? That will depend on the way we believe
the noise behaves (our generative model). If we assume a Gaussian noise function like we
did previously, then the distance d to the noise is modeled by a Chi distribution8 with k
degrees of freedom (where k is the dimension of the Gaussian). This is defined by:
√ − d2
2e 2σ2
f (d) = √ , d≥0
πσ

We can then define our threshold based on what percentage of inliers we want. For example,
choosing t2 = 3.84σ 2 means there’s a 95% probability that when d < t, the point is an inlier.
Now, how many iterations N should we perform? We want to choose N such that, with
some probability p, at least one of the random sample sets (i.e. one of the Ci s) is completely

8
The Chi distribution is of the “square roots of the sum of squares” of a set of independent random variables,
each following a standard normal distribution (i.e. a Gaussian) (per Wikipedia). “That” is what we were
looking to minimize previously with the perpendicular least squares error, so it makes sense!

114
Notes on Computer Vision George Kudrayvtsev

Algorithm 8.2: General RANSAC algorithm.

Input: F , the set of matching feature points.


Input: T , a “big enough” consensus set, or N , the number of trials.
Result: The best-fitting model instance.
Ci ← {}
/* Two versions of the algorithm: threshold or iteration count */
while |Ci | < T or N > 0 do
/* Sample s random points from the set of feature points. */
p = Sample(F ; s)
/* Instantiate the model from the sample. */
M = Model(p)
Ci = {pi ∈ F | M (pi ) > t}
N =N −1 // if doing the iteration version...
end
return M

free from outliers. We base this off of an “outlier ratio” e, which defines how many of our
feature points we expect to be bad.
Let’s solve for N :
• s – the number of points to compute a solution to the model
• p – the probability of success
• e – the proportion of outliers, so the % of inliers is (1 − e).
• Pr [sample set with all inliers] = (1 − e)s
• Pr [sample set with at least one outlier] = (1 − (1 − e)s )
• Pr [all N samples have outliers] = (1 − (1 − e)s )N
• But we want the chance of all N having outliers to be really small, i.e. < (1 − p). Thus,
we want (1 − (1 − e)s )N < (1 − p).
Solving for N gives us. . . drumroll . . .
log(1 − p)
N>
log (1 − (1 − e)s )

The beauty of this probability relationship for N is that it scales incredibly well with e. For
example, for a 99% chance of finding a set of s = 2 points with no outliers, with an e = 0.5
outlier ratio (meaning half of the matches are outliers!), we only need 17 iterations.

115
Chapter 8: Feature Recognition

Some more RANSAC values are in Table 8.1. As you can see, N scales relatively well as e
increases, but less-so as we increase s, the number of points we need to instantiate a model.

Proportion of outliers, e
s 5% 10% 20% 25% 30% 40% 50%
2 2 3 5 6 7 11 17
3 3 4 7 9 11 19 35
4 3 5 9 13 17 34 72
5 4 6 12 17 26 57 146
6 4 7 16 24 37 97 293
7 4 8 20 33 54 163 588
8 5 9 26 44 78 272 1177
Table 8.1: Values for N in the RANSAC algorithm, given the number
of model parameters, s, and the proportion of outliers.

Finally, when we’ve found our best-fitting model, we can recompute the model instance using
all of the inliers (as opposed to just the 4 we started with) to average out the overall noise
and get a better estimation.

Adapting Sample Count The other beautiful thing about RANSAC is that the number
of features we found is completely irrelevant! All we care about is the number of model
parameters and our expected outlier ratio. Of course, we don’t know that ratio qa priori 9 ,
but we can adapt it as we loop. We can assume a worst case (e.g. e = 0.5), then adjust it
based on the actual amount of inliers that we find in our loop(s). For example, finding 80%
inliers then means e = 0.2 for the next iteration. More formally, we get algorithm 8.3.

Example 8.1: Estimating a Homography

Just to make things a little more concrete, let’s look at what estimating a homogra-
phy might look like. Homographies need 4 points, and so the RANSAC loop might
look something like this:
1. Select 4 feature points at random.
2. Compute their exact homography, H.
3. Compute the inliers in the entire set of features where SSD(p0i , Hpi ) < .
4. Keep the largest set of inliers.
5. Recompute the least squares H estimate on all of the inliers.

9
fancy Latin for “from before,” or, in this case, “in advance”

116
Notes on Computer Vision George Kudrayvtsev

Algorithm 8.3: Adaptive RANSAC algorithm.

Input: F , the set of matching feature points.


Input: (s, p, t), the # of model parameters, probability of success, and inlier threshold.
Result: The best-fitting model instance.
N = ∞, c = 0, e = 1.0 // c is the number of samples
while N > c do
p = Sample(F ; s)
M = Model(p)
Ci = {∀xi ∈ F | M (xi ) > t} // find the inliers
e0 = 1 − kC ik
kF k
if e0 < e then
e = e0
log(1−p)
N = log(1−(1−e) s)

end
c=c+1
end
return Ci , Model(Ci )

Benefits and Downsides


RANSAC is used a lot in the wild because of its many benefits. It’s a simple and general
solution that can be applied to many different models; we can have a much larger number
of parameters than, for example, the generalized Hough transform, and they’re easier to
choose. Finally, and probably most-importantly, it’s robust to a large number of outliers;
this takes the load off of feature point detection.
On the other hand, it does have some pitfalls. As we saw, computational time increases
rapidly as the number of parameters increases; as we approach s > 10, things get a little
hairy. Furthermore, it’s not ideal at fitting multiple models simultaneously; thresholding
becomes much more finicky. Finally, its biggest downside is that it’s really not good for
approximate models. If, for example, you have something that’s “kind of a plane,” RANSAC
struggles; it needs a precise, exact model to work with.

8.4 Conclusion
We’ve made some incredible strides in automating the concepts we described in the chapter
on Multiple Views. Given some images, we can make good guesses at corresponding points
then fit those to a particular model (such as a fundamental matrix) with RANSAC and
determine the geometric relationship between images. These techniques are pervasive in
computer vision and have plenty of real-world applications like robotics.

117
Photometry

“Drawing makes you look at the world more closely. It helps you to see what
you’re looking at more clearly. Did you know that?”
I said nothing.
“What colour’s a blackbird?” she said.
“Black.”
“Typical!”
— David Almond, Skellig

hotometry is concerned with measuring light in terms of its perceived brightness


P by the human eye. This is in contrast with radiometry, which is more concerned
with the physical, absolute properties of light, such as its wavelength. We’ll begin by
discussing these properties and give a brief overview of how light interacts with objects to
get an understanding of the physics, but computer vision is more concerned with extracting
this model from images, and gaining an understanding of the scene based on how humans
perceive it.
Some of the visual phenomena we’ll be talking about are shadows, reflections, refractions,
scattering, and more! I highly recommend watching this lecture that visually demonstrates
all of these, as I’m too lazy to replicate every image he uses.

9.1 BRDF
The appearance of a surface is largely dependent on three factors: the viewing angle, the
surface’s material properties, and its illumination. Before we get into these in detail, we
need to discuss some terms from radiometry.

Radiance The energy carried by a ray of light is radiance, L. It’s a measure of power
per unit area, perpendicular to the direction of travel, per unit solid angle. That’s super
confusing, so let’s dissect. We consider the power of the light falling into an area, then
consider its angle, and finally consider the size of the “cone” of light that hits us. The units

118
Notes on Computer Vision George Kudrayvtsev

are watts per square meter per steradian: W m−2 sr−1 .

Irradiance On the other hand, we have the energy arriving at a surface, E. In this
case, we just have the power in a given direction per unit area (units: W m−2 ). Intuitively,
light coming at a steep angle is less powerful than light straight above us, and so for an
area receiving radiance L(θ, ϕ) coming in from some light source size dω, the corresponding
irradiance is E(θ, ϕ) = L(θ, ϕ) cos θdω.



θ

dA
dA cos θ
Figure 9.1: The radiance model.

If you don’t understand this stuff super well, that’s fine. I don’t, hence the explanation is
lacking. We’ll be culminating this information into the BRDF – the bidirectional reflectance
distribution f unction – and not worrying about radiance anymore. The BRDF is the ratio
between the irradiance at the surface from the incident direction and the radiance from
the surface in the viewing direction, as seen below in Figure 9.2. In other words, it’s the
percentage of the light reflected from the light coming in:
Lsurface (θr , ϕr )
BRDF: f (θi , ϕi ; θr , ϕr ) = (9.1)
Esurface (θi , ϕi )

z sensor

θ source viewing
intensity, I direction, v
incident n̂ (θr , ϕr )
direction, s θr
x y (θi , ϕi ) θi

ϕ
surface element
Figure 9.2: The model captured by the BRDF.

BRDFs can be incredibly complicated for real-world objects, but we’ll be discussing two
reflection models that can represent most objects to a reasonably accurate degree: diffuse
reflection and specular reflection. With diffuse reflection, the light penetrates the surface,

119
Chapter 9: Photometry

scattering around the inside before finally radiates out. Surfaces with a lot of diffuse re-
flection, like clay and paper, have a soft, matte appearance. The alternative is specular
reflection, in which most of the light bounces off the surface. Things like metals have a high
specular reflection, appearing glossy and have a lot of highlights.
With these two components, we can get a reasonable estimation of light intensity off of most
objects. We can say that image intensity = body reflection + surface reflection. Let’s discuss
each of these individually, then combine them into a unified model.

9.1.1 Diffuse Reflection


First, we’ll discuss a specific mathematical model for body (diffuse) reflection: the Lamber-
tian BRDF. This guy Lambert realized a clever simplification: a patch on a surface with
high diffusion (as before, things like clay and paper) looks equally bright from every direc-
tion. That means that the incident angle is the only one that matters; this is Lambert’s
law.
As expected, the more perpendicular the light source, the brighter the patch looks, but
equally bright regardless of where you look at it from. Mathematically, we are independent
of the v in Figure 9.2. Thus, the Lambertian BRDF is just a constant, known as the albedo:

f (θi , ϕi ; θr , ϕr ) = ρd

Then, the surface radiance is:

L = I cos θi = ρd I(n̂ · ŝ)

9.1.2 Specular Reflection


Consider a perfect mirror. The only time you see the reflection of a particular object is
when your viewing angle matches the incident angle of that object with the mirror. In other
words, it’s visible when θi = θv and when ϕi = π + ϕv , meaning the viewing direction is in
the same plane as the incident direction. In other words, it’s not sufficient to match up your
“tilt,” you also need to make sure you’re in line with the point as well.
Thus, the BRDF for mirror reflections is a “double delta function:”

f (θi , ϕi ; θr , ϕr ) = ρs δ(θi − θv )δ(ϕi + π − ϕv )

where δ(x) is a simple toggle: 1 if x = 0 and 0 otherwise. Then, the surface radiance simply
adds the intensity:
L = Iρs δ(θi − θv )δ(ϕi + π − ϕv )

We can simplify this with vector notation, as well. We can say that m is the “mirror
direction,” which is that perfect reflection vector for an incoming s, and that vech is the
“half-angle,” i.e. the vector halfway between s and v. Then:

L = Iρs δ(m − v) = Iρs δ(n − h)

120
Notes on Computer Vision George Kudrayvtsev

Most things aren’t perfect mirrors, but they can be “shiny.” We can think of a shiny, glossy
object as being a blurred mirror: the light from a point now spreads over a particular area,
as shown in Figure 9.3. We simulate this by raising the angle between the mirror and the
viewing directions to an exponent, which blurs the source intensity more with a larger angle:
L = Iρs (m̂ · v̂)k

A larger k makes the specular component fall off faster, getting more and more like a mirror.

r i r i

n̂ n̂
v v

more glossy / blurry less glossy, more specular

Figure 9.3: The spread of light as the specularity of a material changes.

9.1.3 Phong Reflection Model


We can approximate the BRDF of many surfaces by combining the Lambertian and specular
BRDFs. This is called the Phong reflection model; it combines the “matte-ness” and
“shininess” of an object.

Figure 9.4: Combining the diffuse and specular components of an


object using the Phong model.

121
Chapter 9: Photometry

9.2 Recovering Light


Remember the illusion in the Introduction, with the squares being the same intensity yet
appearing different due to the shadow? In this section, we’ll be fighting the battle of ex-
tracting lighting information and understanding its effect on a scene and differentiating it
from the surface itself.
We will assume a Lambertian surface; recall, such a surface’s “lightness” can be modeled as:

L = Iρ cos θ

If we combine the incoming light intensity and the angle at which it hits the surface into an
“energy function”, and represent the albedo as a more complicated “reflectance function,” we
get:
L(x, y) = R(x, y) · E(x, y)

Of course, if we want to recover R from L (which is what “we see”), we can’t without knowing
the lighting configuration. The question is, then, how can we do this?
The astute reader may notice some similarities between this and our discussion of noise
removal from chapter 2. In Image Filtering, we realized that we couldn’t remove noise that
was added to an image without knowing the exact noise function that generated it, especially
if values were clipped by the pixel intensity range. Instead, we resorted to various filtering
methods that tried to make a best-effort guess at the “true” intensity of an image based on
some assumptions.
Likewise here, we will have to make some assumptions about our scene in order to make our
“best effort” attempt at extracting the true reflectance function of an object:
1. Light is slowly varying.
This is reasonable for our normal, planar world. Shadows are often much softer than
the contrast between surfaces, and as we move through our world, the affect of the
lighting around us does not change drastically within the same scene.
2. Within an object, reflectance is constant.
This is a huge simplification of the real world, but is reasonable given a small enough
“patch” that we treat as an object. For example, your shirt’s texture (or possibly your
skin’s, if you’re a comfortable at-home reader) is relatively the same in most places.
This leads directly into the next assumption, which is that. . .
3. Between objects, reflectance varies suddenly.
We’ve already taken advantage of this before when discussing edges: the biggest jump
in variations of color and texture come between objects.
To simplify things even further, we’ll be working exclusively with intensity; color is out of the
picture for now. The model of the world we’ve created that is composed of these assumptions

122
Notes on Computer Vision George Kudrayvtsev

is often called the Mondrian world, after the Dutch painter, despite the fact that it doesn’t
do his paintings justice.

9.2.1 Retinex Theory


Dreamt up by Edwin Land (founder of Polaroid), retinex is a model for removing slow
variations from an image. These slow variations would be the result of lighting under our
Mondrian world; thus, removing them would give an approximation of the pure reflectance
function.
There are many approaches to this. Consider the following:
• Take the logarithm of the entire thing:1

log L(x, y) = log R(x, y) + log E(x, y)

• Run the result through a high-pass filter, keeping high-frequency content, perhaps with
the derivative.
• Threshold the result to remove small low frequencies.
• Finally, invert that to get back the original result (integrate, exponentiate).

TODO: Finish this set of lectures.

1
Recall, of course, that the logarithm of a product is the sum of their logarithms.

123
Motion & Tracking

Mathematics is unable to specify whether motion is continuous, for it deals


merely with hypothetical relations and can make its variable continuous or
discontinuous at will. The paradoxes of Zeno are consequences of the failure
to appreciate this fact and of the resulting lack of a precise specification of
the problem. The former is a matter of scientific description a posteriori,
whereas the latter is a matter solely of mathematical definition a priori.
The former may consequently suggest that motion be defined mathematically
in terms of continuous variable, but cannot, because of the limitations of
sensory perception, prove that it must be so defined.
— Carl B. Boyer, The History of the Calculus and Its Conceptual
Development

otion will add another dimension to our images: time. Thus, we’ll be working with
M sequences of images: I(x, y, t). Unlike the real world, in which we perceive continuous
motion,1 digital video is merely a sequence of images with changes between them. By
changing the images rapidly, we can imitate fluid motion. In fact, studies have been done to
determine what amount of change humans will still classify as “motion;” if an object moves
10 feet between images, is it really moving?
There are many applications for motion; let’s touch on a few of them to introduce our
motivation for the math we’ll be slugging through in the remainder of this chapter.
Background Subtraction Given a video with a relatively static scene and some moving
objects, we can extract just the moving objects (or just the background!). We may
want to then overlay the “mover” onto a different scene, model its dynamics, or apply
other processing to the extracted object.
Shot Detection Given a video, we can detect when it cuts to a different scene, or shot.
Motion Segmentation Suppose we have many moving objects in a video. We can segment
1
Arguably continuous, as the chapter quote points out. How can we prove that motion is continuous if our
senses operate on some discrete intervals? At the very least, we have discretization by the rate at which a
neuron can fire. Or, perhaps, there’s never a moment in which a neuron isn’t firing, so we can claim that’s
effectively continuous sampling? But if Planck time is the smallest possible unit of time, does that mean
we don’t live in a continuous world in the first place, so the argument is moot? Philosophers argue on. . .

124
Notes on Computer Vision George Kudrayvtsev

out each of those objects individually and do some sort of independent analyses. Even
in scenarios in which the objects are hard to tell apart spatially, motion gives us another
level of insight for separating things out.
There are plenty of other applications of motion: video stabilization, learning dynamic mod-
els, recognizing events, estimating 3D structure, and more!
Our brain does a lot of “filling in the gaps” and other interpretations based on even the
most impoverished motion; obviously this is hard to demonstrate in written form, but even
a handful of dots arranged in a particular way, moving along some simple paths, can look
like an actual walking person to our human perception. The challenge of computer vision is
interpreting and creating these same associations.

10.1 Motion Estimation


The motion estimation techniques we’ll be covering fall into two categories. The first is
feature-based methods, in which we extract visual features and track them over multiple
frames. We’ve already seen much of what goes on in the former method when we discussed
Feature Recognition. These result in sparse motion fields, but achieve more robust tracking;
this works well when there is significant motion. The second is dense methods, in which
we directly recover the motion at each pixel, based on spatio-temporal image brightness
variations. We get denser motion fields, but are highly sensitive to appearance changes.
This method is well-suited for video, since there are many samples and motion between any
two frames is small (aside from shot cuts).
We’ll be focusing more on the dense flow methods, though we will see how some of the more
cutting edge motion estimation models rely heavily on a combination of both approahces.
We begin by defining, and subsequently recovering, optic flow. Optic flow is the apparent
motion of surfaces or objects. For example, in Figure 10.1, the Rubix cube has rotated
counter-clockwise a bit, resulting in the optic flow diagram on the right.

Figure 10.1: The optic flow diagram (right) from a rotated Rubix
cube.

Our goal is to recover the arrows in Figure 10.2. How do we estimate that motion? In a way,

125
Chapter 10: Motion & Tracking

we are solving a pixel correspondence problem much like we did in Stereo Correspondence,
though the solution is much different. Given some pixel in I(x, y, t), look for nearby pixels
of the same color in I(x, y, t + 1). This is the optic flow problem.

I(x, y, t) I(x, y, t + 1)

Figure 10.2: The problem definition of optical flow; how do we recover


the motion vectors in the left image?

Like we’ve seen time and time again, we need to establish some assumptions (of varying
validity) to simplify the problem:
• Color constancy: assume a corresponding point in one image looks the same in the
next image. For grayscale images (which we’ll be working with for simplicity), this is
called brightness constancy. We can formulate this into a constraint by saying that
there must be some (u, v) change in location for the corresponding pixel, then:

I(x, y, t) = I(x + u0 , y + v 0 , t + 1)

• Small motion: assume points do not move very far from one image to the next. We
can formulate this into a constraint, as well. Given that same (u, v), we assume that
they are very small, say, 1 pixel or less. The points are changing smoothly.
Yep, you know what that means! Another Taylor expansion. I at some location
(x + u, y + v) can be expressed exactly as:

∂I ∂I
I(x + u, y + v) = I(x, y) + u+ v + . . . [higher order terms]
∂x ∂y

We can disregard the higher order terms and make this an approximation that holds
for small values of u and v.
We can combine these two constraints into the following equation (the full derivation comes
in this aside), called the brightness constancy constraint equation:

Ix u + Iy v + It = 0 (10.1)

We have two unknowns describing the direction of motion, u and v, but only have one
equation! This is the aperture problem: we can only  see  changes that are perpendicular
to the edge. We can determine the component of u v in the gradient’s direction, but
not in the perpendicular direction (which would be along an edge). This is reminiscent of
the edge problem in Finding Interest Points: patches along an edge all looked the same.
Visually, the aperture problem is explained in Figure 10.3 below.

126
Notes on Computer Vision George Kudrayvtsev

Figure 10.3: The aperture problem. Clearly, motion from the black
line to the blue line moves it down and right, but through the view of
the aperture, it appears to have moved up and right.

Quick Maffs: Combining Constraints

Let’s combine the two constraint functions for optic flow. Don’t worry, this will be
much shorter than the last aside involving Taylor expansions. We begin with our
two constraints, rearranged and simplified for convenience:

0 = I(x + u0 , y + v 0 , t + 1) − I(x, y, t)
∂I ∂I
I(x + u, y + v) ≈ I(x, y) + u+ v
∂x ∂y

Then, we can perform a substitution of the second into the first and simplify:

0 ≈ I(x, y, t + 1) + Ix u + Iy v − I(x, y, t)
≈ [I(x, y, t + 1) − I(x, y, t)] + Ix u + Iy v
≈ It + Ix u + Iy v
 
≈ It + ∇I · u v

In the limit, as u and v approach zero (meaning the ∆t between our images gets
smaller and smaller), this equation becomes exact.
Notice the weird simplification of the image gradient: is Ix the gradient at t or t+1?
Turns out, it doesn’t matter! We assume that the image moves so slowly that the
derivative actually doesn’t change. This is dicey, but works out “in the limit,” as
Prof. Bobick says.
Furthermore, notice that It is the temporal derivative: it’s the change in the
image over time.

127
Chapter 10: Motion & Tracking

To solve the aperture problem, suppose we formulate Equation 10.1 as an error function:
x
ec = (Ix u + Iy v + It )2 dx dy
image

Of course, we still need another constraint. Remember when we did Better Stereo Correspon-
dence? We split our error into two parts: the raw data error (6.2) and then the smoothness
error (6.3), which punished solutions that didn’t behave smoothly. We apply the same logic
here, introducing a smoothness constraint:
x
u2x + u2y + vx2 + vy2 dx dy
 
es =
image

This punishes large changes to u or v over the image. Now given both of these constraints,
we want to find the (u, v) at each image point that minimizes:

e = es + λec

Where λ is a weighing factor we can modify based on how much we “believe” in our data
(noise, lighting, artifacts, etc.) to change the effect of the brightness constancy constraint.
This is a global constraint on the motion flow field; it comes with the disadvantages of such
global assumptions. Though it allows you to bias your solution based on prior knowledge,
local constraints perform much better. Conveniently enough, that’s what we’ll be discussing
next.

10.1.1 Lucas-Kanade Flow


As we mentioned, a better approach to solving the aperture problem is imposing local con-
straints on the pixel we’re analysing. One method, known as the Lucas-Kanade method,
imposes the same (u, v) constraint on an entire patch of pixels.
For example, if we took the 5×5 block around some pixel, we’d get an entire system of
equations:    
Ix (p1 ) Iy (p1 ) It (p1 )
 Ix (p2 ) Iy (p2 )     It (p2 ) 
 u
 .. .. = −  .. 
  
 . .  v  . 

Ix (p25 ) Iy (p25 ) d, 2×1 It (p25 )
|{z}
| {z } | {z }
A, 25×2 b, 25×1

This time, we have more equations than unknowns. Thus, we can use the standard least
squares method on an over-constrained system (as covered in Appendix A) to find the best
approximate solution: (AT A) d = b. Or, in matrix form:
P P   P 
P Ix Ix P Ix Iy u = − P Ix It
Ix Iy Iy Iy v Iy It

128
Notes on Computer Vision George Kudrayvtsev

Each summation occurs over some k × k window. This equation is solvable when the pseu-
doinverse does, in fact, exist; that is, when AT A is invertible. For that to be the case, it
must be well-conditioned: the ratio of its eigenvalues, λ1/λ2 , should not be “too large.” 2
Wait. . . haven’t we already done some sort of eigenvalue analysis on a matrix with image
gradients? Yep! Recall our discussion of the Properties of the 2nd Moment Matrix when
finding Harris Corners? Our AT A is a moment matrix, and we’ve already considered what
it means to have a good eigenvalue ratio: it’s a corner!
We can see a tight relationship between the two ideas: use corners as good points to compute
the motion flow field. This is explored briefly when we discuss Sparse Flow later; for now,
though, we continue with dense flow and approximate the moment matrix for every pixel.

Improving Lucas-Kanade
Our small motion assumption for optic flow rarely holds in the real world. Often, motion
from one image to another greatly exceeds 1 pixel. That means our first-order Taylor ap-
proximation doesn’t hold: we don’t necessarily have linear motion.
To deal with this, we can introduce iterative refinement. We find a flow field between two
images using the standard Lucas-Kanade approach, then perform image warping using that
flow field. This is It+1
0
, our estimated next time step. Then we can find the flow field
between that estimation and the truth, getting a new flow field. We can repeat this until
the estimation converges on the truth! This is iterative Lucas-Kanade, formalized in
algorithm 10.1.

Hierarchical Lucas-Kanade
Iterative Lucas-Kanade improves our resulting motion flow field in scenarios that have a little
more change between images, but we can do even better. The big idea behind hierarchical
Lucas-Kanade is that we can make large changes seem smaller by just making the image
smaller. To do that, we will be reintroducing the idea of Gaussian pyramids that we used
when Improving the Harris Detector.
The idea is fairly straightforward. We first introduce two important operations: reduction
and expansion. These appear relatively complicated on the surface, but can be explained
simply; they form the building block for our Gaussian pyramid.
• Reduce: Remember when we had that terrible portrait of Van Gogh after directly
using image subsampling? If not, refer to Figure 5.4. We threw out every other pixel
to halve the image size. It looked like garbage; we lost a lot of the fidelity and details
because we were arbitrarily throwing out pixels. The better solution was first blurring
the image, then subsampling. The results looked much better.
2
From numerical analysis, the condition number relates to how “sensitive” a function is: given a small
change in input, what’s the change in output? Per Wikipedia, a high condition number means our least
squares solution may be extremely inaccurate, even if it’s the “best” approximation. In some ways, this is
related to our discussion of Error Functions for feature recognition: recall that vertical least squares gave
poor results for steep lines, so we used perpendicular least squares instead.

129
Chapter 10: Motion & Tracking

Algorithm 10.1: Iterative Lucas-Kanade algorithm.

Input: A video (or sequence of images), I(x, y, t).


Input: Some window for Lucas-Kanade, W .
Result: A motion flow field.
V ←0
0
It+1 ← It
0
while It+1 6= It+1 do
// Find the Lucas-Kanade flow field.
0
dV = LK(It+1 , It+1 , W )
// Warp our estimate towards the real image, then try again.
0 0
It+1 = Warp(It+1 , It+1 , dV )
V += dV
end
return V

The reduce operation follows the same principle, but accomplishes it a little differently.
It uses what’s called a “5-tap filter” to make each subsampled pixel a weighted blend
of its 5 “source” pixels.
• Expand: Expansion is a different beast, since we must interpret the colors of pixels
that lie “between” ones whose values we know. For even pixels in the upscaled image,
the values we know are directly on the right and left of the pixel in the smaller image.
Thus, we blend them equally. For odd pixels, we have a directly corresponding value,
but for a smoother upscale, we also give some weight to its neighbors.3
A picture is worth a thousand words: these filters are visually demonstrated in Figure 10.4.
In practice, the filters for the expansion can be combined into a single filter, since the
corresponding pixels will have no value (i.e. multiplication by zero) in the correct places for
odd and even pixels. For example, applying the combined filter 81 , 12 , 34 , 21 , 18 on an even
pixel would multiply the odd filter indices by nothing.
We can now create a pyramid of Gaussians – each level 1/2 the size of the last – for each of
our images in time, then build up our motion field from the lowest level. Specifically, we can
find the motion flow from the highest (smallest) level using standard (or iterative) Lucas-
Kanade. To move up in the pyramid, we upscale the motion field (effectively interpolating
the motion field for inbetween pixels) and double it (since the change is twice as big now).
Then, we can use that upscaled flow to warp the next-highest level; this is our “estimation”

3
Whether or not an upscaled pixel has a corresponding value in the parent image depends on how you do
the upscaling. If you say that column 1 in the new image is column 1 in the old image, the rules apply (so
column 3 comes from column 2, etc.). If you say column 2 in the new image is column 1 in the old image
(so column 4 comes from column 2, etc.) then the “odd” and “even” rules are reversed.

130
Notes on Computer Vision George Kudrayvtsev

1/16 1/4 3/8 1/4 1/16

1/2 1/2 1/8 3/4 1/8

Figure 10.4: Hierarchical Lucas-Kanade’s 5-tap reduction filter (top)


and two expansion filters for even pixels (left) and odd pixels (right).

of that level. We can compare that estimate against the “true” image at that level and get
a new (cumulative) motion field. We then repeat this process all the way up the pyramid.
More formally,

Algorithm 10.2: The hierarchical Lucas-Kanade algorithm.

Input: Two images from a sequence, It and It+1 .


Input: The number of Gaussians to create in each pyramid, L.
Result: An estimated motion flow field, V .
Pt ← GaussianPyramid(It , L)
Pt+1 ← GaussianPyramid(It+1 , L)
I 0 ← Pt [L]
V ←0
while L > 0 do
dV = Iterative-LK(I 0 , Pt+1 [L])
V = 2 · Expand(V + dV )
I 0 = Warp(Pt [L − 1], V )
L −= 1
end
// We need to do this one more time for the largest level, without
expansion or warping since we’re out of “next” levels.
return V + Iterative-LK(I 0 , Pt+1 [0])

Sparse Flow
We realized earlier that the ideal points for motion detection would be corner-like, to ensure
the accuracy of our gradients, but our discussion ended there. We still went on to estimate

131
Chapter 10: Motion & Tracking

the motion flow field for every pixel in the image. Sparse Lucas-Kanade is a variant of
Hierarchical Lucas-Kanade that is only applied to interest points.

10.1.2 Applying Lucas-Kanade: Frame Interpolation


A useful application of hierarchical Lucas-Kanade is frame interpolation. Given two frames
with a large delta, we can approximate the frames that occurred in-between by interpolating
the motion flow fields.
Consider the following contrived example for a single point:

(0, 0)
t0 t1

Suppose hierarchical Lucas-Kanade told us that the point moved (+3.5, +2) from t0 to
t1 (specifically, from (1, 1) to (4.5, 3)). We can then easily estimate that the point must
have existed at (2.75, 2) at t0.5 by just halving the flow vector and warping, as before. As
shorthand, let’s say F t is the warping of t with the flow field F , so then t0.5 = F2 t0 .

(0, 0)
t0 t0.5 t1

Of course, hierarchical Lucas-Kanade rarely gives such accurate results for the flow vectors.
Each interpolated frame would continually propogate the error between the “truth” at t1
and our “expected truth” based on the flow field. Much like when we discussed iterative
Lucas-Kanade, we can improve our results by iteratively applying HLK as we interpolate!
Suppose we wanted t0.25 and t0.75 from our contrived exam- F0
ple above, and we had the flow field F . We start as before, t0 t1
with t0.25 = F4 t0 . Instead of making t0.75 use 3F 4
t, though, we 2 0
F
3
recalculate the flow field from our estimated frame, giving us
F 0 = t0.25 → t1 . Then, we apply t0.75 = 23 F 0 t0.25 , since t0.75 is 2/3rds of the way along F 0 .

10.2 Motion Models


Until now, we’ve been treating our corresponding points relatively independently. We don’t
really have a concept of objects (though they are an emergent property of groups of similar
flow vectors). Of course, that isn’t generally how motion works in the real world.

132
Notes on Computer Vision George Kudrayvtsev

Suppose we knew that our motion followed a certain pattern, or that certain objects in the
scene moved in a certain way. This would allow us to further constrain our motion beyond
the simple constraints that we imposed earlier (smoothness, brightness constancy, etc.). For
example, we know that objects closer to the camera move more on the camera image plane
than objects further away from the camera, for the same amount of motion. Thus, if we
know (or determine/approximate) the depth of a region of points, we can enforce another
constraint on their motion and get better flow fields.
To discuss this further, we need to return to some concepts from basic Newtonian physics.
Namely, the fact that a point rotating about some origin with a rotational velocity ω that
also has a translational velocity t has a total velocity of: v = ω × r + t.
In order to figure out how the point is moving in the image, we need to convert from these
values in world space (X, Y, Z) to image space (x, y). We’ve seen this before  many times.
The perspective projection of these values in world space equates to f Z , f Z in the image,
X Y

where f is the focal length. How does this relate to the velocity, though? Well the velocity
is just the derivative of the position. Thus, we can take the derivative of the image space
coordinates to see, for a world-space velocity V :4
 
ZVx − ZVz Vx X Vz
u = vx = f =f − f
z2 Z Z Z
Vx Vz
=f −f
X z  
ZVy − ZVz Vy Y Vz
v = vy = f =f − f
Z2 Z Z Z
Vy Vz
=f −f
Y Z

This is still kind of ugly, but we can “matrixify” it into a much cleaner equation that isolates
terms into things we know and things we don’t:
 
u(x, y) 1
= A(x, y)t + B(x, y)ω
v(x, y) Z(x, y)

Where t is the (unknown) translation vector and ω is the unknown rotation, and A and B
are defined as such:
 
−f 0 x
A(x, y) =
0 −f y
 
(xy)/f −(f +x2 )/f y
B(x, y) = (f +y2 )
/f −(xy)/f −x

The beauty of this arrangement is that A and B are functions of things we know, and they
relate our world-space vectors t and ω to image space. This is the general motion model.
f 0 (x)g(x)−g 0 (x)f (x)
 
f (x)
4
Recall the quotient rule for derivatives: d
dx g(x) = g(x)2 .

133
Chapter 10: Motion & Tracking

We can see that the depth in world space, Z(x, y), only impacts the translational term. This
corresponds to our understanding of parallax motion, in that the further away from a
camera an object is, the less we perceive it to move for the same amount of motion.

10.2.1 Known Motion Geometry


Suppose we know that there’s a plane in our image, and so the motion model has to follow
that of a plane. If we can form equations to represent the points on the plane in space, we
can use them to enforce constraints on our motion model and get a better approximation of
flow since we are more confident in how things move.

Perspective Under world-space perspective projection, we have the standard equation of


a plane: aX + bY + cZ + d = 0.
We saw some time ago, when discussing homographies, that we can find the image-space
coordinates from any plane with our homogeneous system in (7.1). We didn’t expand it at
the time, but it would look like this:

u(x, y) = a1 + a2 x + a3 y + a7 x2 + a8 xy
v(x, y) = a4 + a5 x + a6 y + a7 xy + a8 y 2

Recall that 4 correspondence points are needed to solve this equation.

Orthographic On the other hand, if our plane lies at a sufficient distance from the camera,
the distance between points on the plane is miniscule relative to that distance. As we
learned when discussing An orthographic projection model, in which the z coordinate is
dropped entirely with no transformation., we can effectively disregard depth in this case.
The simplified pair of equations, needing 3 correspondence points to solve, is then:

u(x, y) = a1 + a2 x + a3 y (10.2)
v(x, y) = a4 + a5 x + a6 y (10.3)

This is an affine transformation!

10.2.2 Geometric Motion Constraints


Suppose we now use our aforementioned pair of equations for an affine transformation as a
constraint on the motion model. Previously, we could only enforce the brightness constancy
constraint equation over a small window; if the window was too large, there’s no way we
could guarantee that the colors would stay constant.
Since our constraint now incorporates a transformation model rather than an assumption,
we can reliably work with much larger windows. Recall the constraint from before, (10.1):

Ix u + Iy v + It = 0

134
Notes on Computer Vision George Kudrayvtsev

By substituting in the affine transformation equations from (10.2), we get:


Ix (a1 + a2 x + a3 y) + Iy (a4 + a5 x + a6 y) + It ≈ 0

This is actually a relaxation on our constraint, even though the math gets more complicated.
Now we can do the same least squares minimization we did before for the Lucas-Kanade
method but allow an affine deformation for our points:
X
Err(a) = [Ix (a1 + a2 x + a3 y) + Iy (a4 + a5 x + a6 y) + It ]2

Much like before, we get a nasty system of equations and minimize its result:
     
Ix Ix x1 Ix y1 Iy Iy x1 Iy y1 a1 It1
Ix 2
Ix x2 Ix y2 Iy Iy x2 Iy y2   a2 
  
 It 

 .. .. .. .. .. ..  ·  ..  = −  .. 

. . . . . .  . .
Ix Ix xn Ix yn Iy Iy xn Iy yn an Itn

10.2.3 Layered Motion


Obviously, all of the motion in an image is unlikely to fit into a single motion model, whether
that be an affine transformation or homography or some other model. If you have multiple
motion models in the same image, you want to identify those individually. The basic idea
behind layered motion is to break the image sequence into “layers” which all follow some
coherent motion model.5
Figure 10.5 demonstrates this visually. Under layered motion, each layer is defined by an
alpha mask (which identifies the relevant image region) and an affine motion model like the
one we just discussed.

Figure 10.5: Breaking an image sequence into coherent layers of mo-


tion.
5
First studied in 1993 in the paper “Layered Representation for Motion Analysis” (link).

135
Chapter 10: Motion & Tracking

How does the math work? First, get an approximation of the “local flow” via Lucas-Kanade
or other methods. Then, segment that estimation by identifying large leaps in the flow field.
Finally, use the motion model to fit each segment and identify coherent segments. Visually,
this is demonstrated below:

(a) The true motion flow, u(x, y). (b) The local flow estimation. (c) Segmenting the local flow estima-
tion.

(d) Fitting the motion model to the segmentations, identifying


the foreground and background “lines.” The blue line points to
an identified occlusion in the scene.

Figure 10.6: Applying an affine motion model to an image sequence


to segment it into the foreground and background layers.

The implementation difficulties lie in identifying segments and clustering them appropriately
to fit the intended motion models.

10.3 Tracking
What’s the point of dedicating a separate section to tracking? We’ve already discussed
finding the flow between two images; can’t we replicate this process for an entire sequence
and track objects as they move? Well, as we saw, there are a lot of limitations of Lucas-
Kanade and other image-to-image motion approximations.
• It’s not always possible to compute optic flow. The Lucas-Kanade method needs a lot
of stars to align to properly determine the motion field.
• There could be large displacements of objects across images if they are moving rapidly.
Lucas-Kanade falls apart given a sufficiently large displacement (even when we apply
improvements like hierarchical LK). Hence, we probably need to take dynamics into
account; this is where we’ll spend a lot of our focus initially.
• Errors are compounded over time. When we discussed frame interpolation, we tried to
minimize the error drift by recalculating the motion field at each step, but that isn’t
always possible. If we only rely on the optic flow, eventually the compounded errors
would track things that no longer exist.

136
Notes on Computer Vision George Kudrayvtsev

• Objects getting occluded cause optic flow models to freak out. Similarly, when those
objects appear again (called disocclusions), it’s hard to reconcile and reassociate
things. Is this an object we lost previously, or a new object to track?
We somewhat incorporated dynamics when we discussed using Motion Models with Lucas-
Kanade: we expected points to move along an affine transformation. We could improve this
further by combining this with Feature Recognition, identifying good features to track and
likewise fitting motion models to them.6 This is good, but only gets us so far.
Instead, we will focus on tracking with dynamics, which approaches the problem differ-
ently: given a model of expected motion, we should be able to predict the next frame without
actually seeing it. We can then use that frame to adjust the dynamics model accordingly.
This integration of dynamics is the differentiator between feature detection and tracking: in
detection, we detect objects independently in each frame, whereas with tracking we predict
where objects will be in the next frame using an estimated dynamic model.
The benefit of this approach is that the trajectory model restricts the necessary search space
for the object, and it also improves estimates due to reduced measurement noise due to the
smoothness of the expected trajectory model.
As usual, we need to make some fundamental assumptions to simplify our model and con-
struct a mathematical framework for continuous motion. In essence, we’ll be expecting small,
gradual change in pose between the camera and the scene. Specifically:
• Unlike small children, who have no concept of object permeance, we assume that
objects do not spontaneously appear or disappear in different places in the scene.
• Similarly, we assume that the camera does not move instantaneously to a new view-
point, which would cause a massive perceived shift in scene dynamics.
Feature tracking is a multidisciplinary problem that isn’t exclusive to computer vision. There
are elements of engineering, physics, and robotics at play. Thus, we need to take a detour
into state dynamics and estimation in order to model the dynamics of an image sequence.

10.3.1 Modeling Dynamics


We can view dynamics from two perspectives: inference or induction, but both result in the
same statistical model.

Tracking as Inference
Let’s begin our detour by establishing the terms in our inference system. We have our
hidden state, X, which is made up of the true parameters that we care about. We have our
measurement, Z, which is a noisy observation of the underlying state. At each time step t,
the real state changes from Xt−1 → Xt , resulting in a new noisy observation Zt .

6
Refer to the original paper, “Good Features to Track,” for more (link).

137
Chapter 10: Motion & Tracking

Our mission, should we choose to accept


If you are following along in the lectures, you
it, is to recover the most likely estimated may notice a discrepancy in notation: Zt for
distribution of the state Xt given all of the measurements instead of Yt . This is to align
observations we’ve seen thus far and our with the later discussion on Particle Filters in
knowledge about the dynamics of the state which the lectures also switch from Yt to Zt to
stay consistent with the literature these con-
transitions.
cepts come from (and to steal PowerPoint slides).
More formally, we can view tracking as
This guide uses Z immediately, instead, to
an adjustment of a probability distribu- maintain consistency throughout.
tion. We have some distribution represent-
ing our prediction or current belief for t,
and the actual resulting measurement at t. We need to adjust our prediction based on the
observation to form a new prediction for t + 1.
Our prediction can be expressed as the likelihood of a state given all of the previous obser-
vations:
Pr [Xt | Z0 = z0 , Z1 = z1 , . . . , Zt−1 = zt−1 ]

Our correction, then, is an updated estimate of the state after introducing a new observation
Zt = zt :
Pr [Xt | Z0 = z0 , Z1 = z1 , . . . , Zt−1 = zt−1 , Zt = zt ]

We can say that tracking is the process of propogating the posterior distribution of state
given measurements across time. We will again make some assumptions to simplify the
probability distributions:
• We will assume that we live in a Markovian world in that only the immediate past
matters with regards to the actual hidden state:

Pr [Xt | X0 , X1 , . . . , Xt−1 ] = Pr [Xt | Xt−1 ]

This latter probability, Pr [Xt | Xt−1 ], is the dynamics model.


• We will also assume that our noisy measurements (more specifically, the probability
distribution of the possible measurements) only depends on the current state, rather
than everything we’ve observed thus far:

Pr [Zt | X0 , Z0 , . . . , Xt−1 , Zt−1 , Xt ] = Pr [Zt | Xt ]

This is called the observation model, and much like the small motion constraint in
Lucas-Kanade, this is the most suspect assumption. Thankfully, we wont be exploring
relaxations to this assumption, but one example of such a model is conditional random
fields, if you’d like to explore further.
These assumptions are represented graphically in Figure 10.7. Readers with experience in
statistical modeling or machine learning will notice that this is a hidden Markov model.

138
Notes on Computer Vision George Kudrayvtsev

X1 X2 ... Xn

Z1 Z2 Zn

Figure 10.7: A graphical model for our assumptions.

Tracking as Induction
Another way to view tracking is as an inductive process: if we know Xt , we can apply
induction to get Xt+1 .
As with any induction, we begin with our base case: this is our initial prior knowledge that
predicts a state in the absence of any evidence: Pr [X0 ]. At the very first frame, we correct
this given Z0 = z0 . After that, we can just keep iterating: given a corrected estimate for
frame t, predict then correct frame t + 1.

Making Predictions
Alright, we can finally get into the math.
Given: Pr [Xt−1 | z0 , . . . , zt−1 ]
Guess: Pr [Xt | z0 , . . . , zt−1 ]
To solve that, we can apply the law of total probability and marginalization if we
imagine we’re working with the joint set Xt ∩ Xt−1 .7 Then:
Z
= Pr [Xt , Xt−1 | z0 , . . . , zt−1 ] dXt−1
Z
= Pr [Xt | Xt−1 , z0 , . . . , zt−1 ] Pr [Xt−1 | z0 , . . . , zt−1 ] dXt−1
Z
independence assumption from
= Pr [Xt | Xt−1 ] Pr [Xt−1 | z0 , . . . , zt−1 ] dXt−1 the dynamics model

To explain this equation in English, what we’re saying is that the likelihood of being at
a particular spot (this is Xt ) depends on the probability of being at that spot given that
we were at some previous spot weighed by the probability of that previous spot actually
happening (our corrected estimate for Xt−1 ). Summing over all of the possible “previous
spots” (that is, the integral over Xt−1 ) gives us the marginalized distribution of Xt .

7
Specifically, the law of total probability states that if we have a joint set A ∩ B and we know all
of the probabilities in B, we Pcan get Pr [A] if we sum over all of the probabilities in B. Formally,
Pr [A] = n (Pr [A | Bn ] Pr [Bn ]). For the latter equivalence, recall that Pr [U, V ] =
P
n (Pr [A, B n ]) =
Pr [U | V ] Pr [V ]; this is the conditioning property.
In our working example, Xt is part of the same probability space as Xt−1 (and all of the Xi s that came
before it), so we can apply the law, using the integral instead of the sum.

139
Chapter 10: Motion & Tracking

Making Corrections
Now, given a predicted value Pr [Xt | z0 , . . . , zt−1 ] and the current observation zt , we want to
compute Pr [Xt | z0 , . . . , zt−1 , zt ], essentially folding in the new measurement:8
Pr [zt | Xt , z0 , . . . , zt−1 ] · Pr [Xt | z0 , . . . , zt−1 ]
Pr [Xt | z0 , . . . , zt−1 , zt ] = (10.4)
Pr [zt | z0 , . . . , zt−1 ]
Pr [zt | Xt ] · Pr [Xt | z0 , . . . , zt−1 ] independence assumption from
= the observation model
Pr [zt | z0 , . . . , zt−1 ]
Pr [zt | Xt ] · Pr [Xt | z0 , . . . , zt−1 ]
=R conditioning on Xt
Pr [zt | Xt ] Pr [Xt | z0 , . . . , zt−1 ] dXt

As we’ll see, the scary-looking denominator is just a normalization factor that ensures the
probabilities sum to 1, so we’ll never really need to worry about it explicitly.

Summary
We’ve developed the probabilistic model for predicting and subsequently correcting our state
based on some observations. Now we can dive into actual analytical models that apply these
mathematics and do tracking.

10.3.2 Kalman Filter


Let’s introduce an analytic model that leverages the prediction and correction techniques
we mathematically defined previously: the Kalman filter. We’ll be working with a specific
case of linear dynamics, where the predictions are based on a linear function applied to
our previous beliefs, perturbed by some Gaussian noise function.9
A linear dynamics model says that the state at some time, xt , depends on the previous
state xt−1 undergoing some linear transformation Dt , plus some level of Gaussian process
noise Σdt .10 Mathematically, we have our normal distribution function N , and so
xt ∼ N (Dt xt−1 , Σdt )
| {z } |{z}
mean variance

Notice the subscript on Dt and dt : with this, we indicate that the transformation itself may
change over time. Perhaps, for example, the object is moving with some velocity, and then
starts rotating. In our examples, though, these terms will stay constant.
We also have a linear measurement model, which describes our observations. Specifically
(and unsurprisingly), the model says that the measurement zt is linearly transformed by Mt ,
plus some level of Gaussian measurement noise:
zt ∼ N (Mt xt−1 , Σmt )
| A]·Pr[A]
8
We are applying Bayes’ rule here, which states that Pr [A | B] = Pr[BPr[B] .
9
It’s easy to get lost in the math. Remember, “Gaussian noise” is just a standard bell curve.
10
This is “capital sigma,” not a summation. It symbolizes the variance (std. dev squared): Σ = σ 2 .

140
Notes on Computer Vision George Kudrayvtsev

M is also sometimes called the extraction matrix because its purpose is to extract the
measurable data from a state (or a “state-like” matrix).

Example 10.1: 1D Linear Motion


Let’s work through a simple example in one dimension to get a better feel for things.
 
p
Suppose our state vector defines the position and velocity of an object, xt = t ,
vt
where each of those is defined as they were in our basic physics courses, except with
some additional error terms thrown in to account for noise:

pt = pt−1 + (∆t)vt−1 + ε
vt = vt−1 + ξ

As we love doing, we can express these in matrix form to get our linear dynamics
model:   
1 ∆t pt−1
xt = Dt xt + noise = + noise
0 1 vt−1

What about measurement? Well, suppose we can only measure position. Then our
linear measurement model is:
 
  pt
zt = mt xt + noise = 1 0 + noise
vt

Simple stuff! Notice that we mostly defined these model parameters ourselves.
We could have added acceleration to our dynamics model if we expected it. The
“observer” defines the dynamics model to fit their scenario.

Before we continue, we need to introduce some standard notation to track things. In our
predicted state, Pr [Xt | z0 , . . . , zt−1 ], we say that the mean and standard deviation of the
resulting Gaussian distribution is µ−t and σt− . For our corrected state, Pr [Xt | z0 , . . . , zt−1 , zt ],
we similarly say that the mean and standard deviation are µ+t and σt+ .

Example 10.2: 1D Prediction & Correction


Let’s continue with our previous example and work through a single tracking step
(prediction followed by correction), given our new notation.

Prediction
Our simple linear dynamics model defines a state as a constant times its previous
state, with some noise added in to indicate uncertainty:

Xt ∼ N dXt−1 , σd2


141
Chapter 10: Motion & Tracking

The distribution for the next predicted state, then, is also a Gaussian, so we can
simply update the mean and variance accordingly. Given:
 
− 2
Pr [Xt | z0 , . . . , zt−1 ] = N µ−
t , (σ t )

Update the mean: µ−t = dµt−1


2
Update the variance: (σt− )2 = σd2 dσt−1
+

The mean of a Gaussian distribution that has been multiplied by a constant is just
likewise multiplied by that constant. The variance, though, is both multiplied by
that constant squared and we need to introduce some additional noise to account
for uncertainty in our prediction.

Correction
Similarly, our mapping of states to measurements relies on a constant, m:
2

zt ∼ N mXt , σm

Under linear, Gaussian dynamics and measurements, the Kalman filter defines the
corrected distribution (the simplified Equation 10.4) as a new Gaussian:
 
+ 2
Pr [Xt | z0 , . . . , zt−1 , zt ] ≡ N µ+
t , (σ t )

Update the mean: Update the variance:


2 − 2 2 (σ − )2
µ−
t σm + mzt (σt ) σm
µ+
t = (σt+ )2 = t
2 + m2 (σ − )2
σm 2 + m2 (σ − )2
σm
t t

Intuition Let’s get an intuitive understanding of what this new mean, µ+t , really
is. First, we divide the entire thing by m2 to “unsimplify.” We get this mess:

µ− 2
t σm zt
2
+ (σt− )2
µt = m 2
+ m (10.5)
σm
+ (σt− )2
m2

In blue we have our prediction of Xt ; it’s weighted by the variance of Xt computed


from the measurement (in red). Then, in orange, we have our measurement guess
of Xt , weighted by the variance of the prediction (in green).
Notice that all of this is divided by the sum of the weights (in red and green): this
is just a weighted average of our prediction and our measurement guess
based on variances!

The previous example gave us an important insight that applies to the Kalman filter regard-
less of the dimensionality we’re working with. Specifically, that our corrected distribution

142
Notes on Computer Vision George Kudrayvtsev

for Xt is a weighted average of the prediction (i.e. based on all prior measurements except
zt ) and the measurement guess (i.e. the former with zt incorporated).
Let’s take the equation from (10.5) and substitute a for the measurement variance and b for
the prediction variance. We get:
zt
aµ−t + b
µ+t = m
a+b

We can do some manipulation (add bµ−t − bµ−t to the top and factor) to get:
z 
t
(a + b)µ−t + b − µ−t
= m
a+b

b  zt −

= µt + −µ
a+b m  t
zt
µ+t = µ−t + k − µ−t
m

Where k is known as the Kalman gain. What does this expression tell us? Well, the new
mean µ+t is the old predicted mean plus a weighted “residual”: the difference between the
measurement and the prediction (in other words, how wrong the prediction was).

N -dimensional Kalman Filter


Even in Example 10.1, we didn’t have a single dimension: our state vector had both position
and velocity. Under an N -dimensional model, we have some wonderful matrices:

Correct
Predict
Kt = Σt Mt (Mt Σ−t MTt + Σmt )−1
− T

x−t = Dt x+t−1
x+t = x−t + Kt (zt − Mt x−t )
Σ−t = Dt Σ+t−1 DTt + Σdt
Σ+t = (I − Kt Mt ) Σ−t

We now have a Kalman gain matrix, Kt . As our estimate covariance approaches zero (i.e. con-
fidence in our prediction grows), the residual gets less weight from the gain matrix. Similarly,
if our measurement covariance approaches zero (i.e. confidence in our measurement grows),
the residual gets more weight.

Summary
The Kalman filter is an effective tracking method due to its simplicity, efficiency, and com-
pactness. Of course, it does impose some fairly strict requirements and has significant pitfalls
for that same reason. The fact that the tracking state is always represented by a Gaussian

143
Chapter 10: Motion & Tracking

creates some huge limitations: such a unimodal distribution means we only really have one
true hypothesis for where the object is. If the object does not strictly adhere to our linear
model, things fall apart rather quickly.
We know that a fundamental concept in probability is that as we get more information, cer-
tainty increases. This is why the Kalman filter works: with each new measured observation,
we can derive a more confident estimate for the new state. Unfortunately, though, “always
being more certain” doesn’t hold the same way in the real world as it does in the Kalman
filter. We’ve seen that the variance decreases with each correction step, narrowing the Gaus-
sian. Does that always hold, intuitively? We may be more sure about the distribution, but
not necessarily the variance within that distribution. Consider the following extreme case
that demonstrates the pitfalls of the Kalman filter.
In Figure 10.8, we have our prior distribution and our measurement. Intuitively, where
should the corrected distribution go? When the measurement and prediction are far apart,
we would think that we can’t trust either of them very much. We can count on the truth
being between them, sure, and that it’s probably closer to the measurement. Beyond that,
though, we can’t be sure. We wouldn’t have a very high peak and our variance may not
change much. In contrast, as we see in Figure 10.8, Kalman is very confident about its
corrected prediction.
This is one of its pitfalls.

x
evidence posterior prior

Figure 10.8: One of the flaws of the Kalman model is that it is always
more confident in its distribution, resulting in a tighter Gaussian. In this
figure, the red Gaussian is what the Kalman filter calculates, whereas
the blue-green Gaussian may be a more accurate representation of our
intuitive confidence about the truth. As you can see, Kalman is way
more sure than we are.

Another downside of the Kalman filter is this restriction to linear models for dynamics.
There are extensions that alleviate this problem called extended Kalman filters (EKFs), but
it’s still a limitation worth noting.
More importantly, though, is this Gaussian model of noise. If the real world doesn’t match
with a Gaussian noise model, Kalman struggles. What can we do to alleviate this? Perhaps
we can actually determine (or at least approximate) the noise distribution as we track?

144
Notes on Computer Vision George Kudrayvtsev

10.3.3 Particle Filters


The basic idea behind particle filtering and other sampling-based methods is that we can
approximate the probability distribution with a set of n weighted particles, xt . Then, the
density is represented by both where the particles are and their weight; we can think of
weight as being multiple particles in the same location.
Now we view Pr [x = x0 ] as being the probability of drawing an x with a value (really close
to) x0 . Our goal, then, is to make drawing a particle a very close approximation (with
equality as n → ∞) of the underlying distribution. Specifically, that
Pr [xt ∈ xt ] ≈ Pr [xt | z1...t ]

We’ll also be introducing the notion of perturbation into our dynamics model. Previously,
we had a linear dynamics model that only consisted of our predictions based on previous
state. Perturbation – also called control – allows us to modify the dynamics by some known
model. By convention, perturbation is an input to our model using the parameter u.

Example 10.3: Perturbation


Consider, for example, a security camera tracking a person. We don’t know how the
person will move but can approximate their trajectory based on previous predictions
and measurements: this is what we did before.
The camera can also move, and this is a known part of our dynamic model. We
can add the camera’s panning or tilting to the model as an input, adjusting the
predictions accordingly.

Bayes Filters
The framework for our first particle filtering approach relies on some given quantities:
• As before, we need somewhere to start. This is our prior distribution, Pr [X0 ]. We
may be very unsure about it, but must exist.
• Since we’ve added perturbation to our dynamics model, we now refer to it as an action
model. We need that, too:
Pr [xt | ut , xt−1 ]

Note: We (consistently, unlike lecture) use ut for inputs occuring between the state xt−1 and xt .

• We additionally need the sensor model. This gives us the likelihood of our mea-
surements given some object location: Pr [z | X]. In other words, how likely are our
measurements given that we’re at a location X. It is not a distribution of possible
object locations based on a sensor reading.
• Finally, we need our stream of observations, z, and our known action data, u:
data = {u1 , z2 , . . . , ut , zt }

145
Chapter 10: Motion & Tracking

Given these quantities, what we want is the estimate of X at time t, just like before; this is
the posterior of the state, or belief :

Bel(xt ) = Pr [xt | u1 , z2 , . . . , ut , zt ]

The assumptions in our probabilistic model is represented graphically in Figure 10.9, and
result in the following simplifications:11

Pr [zt | X0:t , z1:t−1 , u1:t ] = Pr [zt | xt ]


Pr [xt | X1:t−1 , z1:t−1 , u1:t ] = Pr [xt | xt−1 , ut ]

In English, the probability of the current measurement, given all of the past states, mea-
surements, and inputs only actually depends on the current state. This is sometimes called
sensor independence. Second: the probability of the current state – again given all of the
goodies from the past – actually only depends on the previous state and the current input.
This Markovian assumption is akin to the independence assumption in the dynamics model
from before.

ut−1 ut ut+1

... xt−1 Xt Xt+1 ...

zt−1 zt zt+1

Figure 10.9: A graphical model for Bayes filters.

As a reminder, Bayes’ Rule (described more in footnote 8) can also be viewed as a propor-
tionality (η is the normalization factor that ensures the probabilities sum to one):

Pr [x | z] = ηPr [z | x] Pr [x] (10.6)


∝ Pr [z | x] Pr [x] (10.7)

With that, we can apply our given values and manipulation our belief function to get some-
thing more useful. Graphically, what we’re doing is shown in Figure 10.10 (and again, more
visually, in Figure 10.11), but mathematically:

11
The notation na:b represents a range; it’s shorthand for na , na+1 , . . . , nb .

146
Notes on Computer Vision George Kudrayvtsev

Bel(xt ) = Pr [xt | u1 , z2 , . . . , ut , zt ]
= ηPr [zt | xt , u1 , z2 , . . . , ut ] Pr [xt | u1 , z2 , . . . , ut ] Bayes’ Rule

= ηPr [zt | xt ] Pr [xt | u1 , z2 , . . . , ut ] sensor independence


Z
= ηPr [zt | xt ] Pr [xt | xt−1 , u1 , z2 , . . . , ut ] · Pr [xt−1 | u1 , z2 , . . . , ut ] dxt−1
total probability
Z
Markovian
= ηPr [zt | xt ] Pr [xt | xt−1 , ut ] · Pr [xt−1 | u1 , z2 , . . . , ut ] dxt−1 assumption
Z
= ηPr [zt | xt ] Pr [xt | xt−1 , ut ] · Bel(xt−1 ) dxt−1 substitution
| {z }
predictions before measurement

This results in our final, beautiful recursive relationship between the previous belief and the
next belief based on the sensor likelihood:
Z
Bel(xt ) = ηPr [zt | xt ] Pr [xt | xt−1 , ut ] · Bel(xt−1 ) dxt−1 (10.8)

We can see that there is an inductive relationship between beliefs. The green and blue
sections correspond to the calculations we did with the Kalman filter: we need to first
find the prediction distribution before our latest measurement. Then, we fold in the actual
measurement, which is described by the sensor likelihood model from before.
With the mathematics out of the way, we can focus on the basic particle filtering algorithm.
It’s formalized in algorithm 10.3 and demonstrated graphically in Figure 10.11, but let’s
walk through the process informally.
We want to generate a certain number of samples (n new particles) from an existing distribu-
tion, given an additional input and measurement (these are St−1 , ut , and zt respectively). To
do that, we need to first choose a particle from our old distribution, which has some position
and weight. Thus, we can say pj = hxt−1,j , wt−1,j i. From that particle, we can incorporate
the control and create a new distribution using our action model: Pr [xt | ut , xt−1,j ]. Then,
we sample from that distribution, getting our new particle state xi . We need to calculate
the significance of this sampled particle, so we run it through our sensor model to reweigh
it: wi = Pr [zi | xi ]. Finally, we update our normalization factor to keep our probabilities
consistent and add it to the set of new particles, St .

Practical Considerations
Unfortunately, math is one thing but reality is another. We need to take some careful
considerations when applying particle filtering algorithms to real-world problems.

Sampling Method We need a lot of particles to sample the underlying distribution with
relative accuracy. Every timestep, we need to generate a completely new set of samples after

147
Chapter 10: Motion & Tracking

Figure 10.10: Propogation of a probability density in three phases:


drift due to the deterministic object dynamics, diffusion due to noise,
and reactive reinforcement due to observations.

working all of our new information into our estimated distribution. As such, the efficiency
or algorithmic complexity of our sampling method is very important.
We can view the most straightforward sampling method as a direction on a roulette wheel,
as in Figure 10.12a. Our list of weights covers a particular range, and we choose a value in
that range. To figure out which weight that value refers to, we’d need to perform a binary
search. This gives a total O(n log n) runtime. Ideally, though, sampling runtime should grow
linearly with the number of samples!
As a clever optimization, we can use the systematic resampling algorithm (also called
stochastic universal sampling), described formally in algorithm 10.4. Instead of viewing

Figure 10.11: A graphic representation of particle filtering.

148
Notes on Computer Vision George Kudrayvtsev

Algorithm 10.3: Basic particle filtering algorithm.

Input: A set of weighted particles,


St−1 = {hx1,t−1 , w1,t−1 i, hx2,t−1 , w2,t−1 i, . . . hxn,t−1 , wn,t−1 i}.
Input: The current input and measurement: ut and zt .
Result: A sampled distribution.
St = ∅
η=0
for i ∈ [1, n] do /* resample n samples */
xj,t−1 = Sample(St−1 ) /* sample a particle state */
xi,t = Sample (Pr [xt | ut , xj,t−1 ]) /* sample after incorporating input */
wi,t = Pr [zt | xi,t ] /* reweigh particle */
η += wi,t /* adjust normalization */
St = St ∪ hxi,t , wi,t i /* add to current particle set */
end
wt = wt/η /* normalize weights */
return St

the weights as a roulette wheel, we view it as a wagon wheel. We plop down our “spokes” at a
random orientation, as in Figure 10.12b. The spokes are 1/n distance apart and determining
their weights is just a matter of traversing the distance between each spoke, achieving O(n)
linear time for sampling!

(a) as a roulette wheel (b) as a wagon wheel

Figure 10.12: Two methods of sampling from our set of weighted


particles, with 10.12b being the more efficient method.

Sampling Frequency We can add another optimization to lower the frequency of sam-
pling. Intuitively, when would we even want to resample? Probably when the estimated
distribution has changed significantly from our initial set of samples. Specifically, we can
only resample when there is a significant variance in the particle weights; otherwise, we can
just reuse the samples.

149
Chapter 10: Motion & Tracking

Highly peaked observations What happens to our particle distribution if we have in-
credibly high confidence in our observation? It’ll nullify a large number of particles by giving
them zero weight. That’s not very good, since it wipes out all possibility of other predictions.
To avoid this, we want to intentionally add noise to both our action and sensor models.
In fact, we can even smooth out the distribution of our samples by applying a Kalman filter
to them individually: we can imagine each sample as being a tiny little Gaussian rather
than a discrete point in the state space. In general, overestimating noise reduces the number
of required samples and avoids overconfidence; let the measurements focus on increasing
certainty.

Recovery from failure Remember our assumption regarding object permeance, in which
we stated that objects wouldn’t spontaneously (dis)appear? If that were to happen, our
distributions have no way of handling that case because there are unlikely to be any particles
corresponding to the new object. To correct for this, we can apply some randomly distributed
particles every step in order to catch any outliers.

Algorithm 10.4: The stochastic universal sampling algorithm.

Input: A particle distribution, S, and the expected number of output samples, n.


Result: A set of n samples.
S0 = ∅
c1 = w1
/* We use S[i]x and S[i]w for the state or weight of the ith particle. */
for i ∈ [2, n] do
ci = ci−1 + S[i]w /* generate the CDF (outer ring) from the weights */
end
u1 = U [0, 1/n] /* initialize the first CDF bin */
i=1
for j ∈ [1, n] do
while uj > ci do
i += 1 /* skip until the next CDF ring boundary */
end
S 0 = S 0 ∪ {hS[i]x , 1/ni} /* insert sample from CDF ring */
uj+1 = uj + 1/n
end
return S 0

150
Notes on Computer Vision George Kudrayvtsev

10.3.4 Real Tracking


We’ve been defining things very loosely in our mathematics. When regarding an object’s
state, we just called it X. But what is X? What representation of features describing
the object do we put it to reliably track it? How do those features interact with our mea-
surements. Similarly, what are our measurements, Z? What do they measure, and does
measuring those things really make us more certain about X?
A lot of the content in this section comes from this paper (in addition to the lectures,
obviously), which was the paper that brought these probabilistic tracking models to computer
vision. Feel free to give it a read for historical context and deeper technical explanations.

Tracking Contours
Suppose we wanted to track a hand, which is a fairly complex object. The hand is moving
(2 degrees of freedom: x and y) and rotating (1 degree of freedom: θ), but it also can change
shape. Using principal component analysis, a topic covered soon in chapter 11, we can
encode its shape and get a total of 12 degrees of freedom in our state space; that requires a
looot of particles.

Figure 10.13: Tracking the movement of a hand using an edge detector.

What about measurement? Well, we’d expect there


to be a relationship between edges in the picture
and our state. Suppose X is a contour of the hand.
We can measure the edges and say Z is the sum
of the distances from the nearest high-contrast fea-
tures (i.e. edges) to the contour, as in Figure A.3a.
Specifically,
!
(d to edge)2
Pr [z | X] ∝ exp −
2σ 2

. . . which looks an awful lot like a Gaussian; it’s pro- Figure 10.14: A contour and its nor-
mals. High-contrast features (i.e. edges)
portional to the distance to the nearest strong edge. are sought out along these normals.
We can then use this Gaussian as our sensor model
and track hands reliably.

151
Chapter 10: Motion & Tracking

Figure 10.15: Using edge detection and contours to track hand move-
ment.

Other Models

In general, you can use any model as long as you can compose all of the aforementioned
requirements for particle filters: we need an object state, a way to make predictions, and a
sensor model.

Figure 10.16: Using colored blobs to track a head and hands.

As another example, whose effectiveness is demonstrated visually in Figure 10.16, we could


track hands and head movement using color models and optical flow. Our state is the location
of a colored blog (just a simple (x, y)), our prediction is based upon the calculated optic flow,
and our sensor model describes how well the predicted models match the color.

A Very Simple Model

Let’s make this as simple as we possibly can.


Suppose our state the location of an arbitrary image patch, (x, y)? Then, suppose we don’t
know anything about, well, anything, so we model the dynamics as literally just being random
noise. The sensor model, like we had with our barebones Dense Correspondence Search for
stereo, will just be the mean square error of pixel intensities.
Welcome to the model for the next Problem Set in CS6476.

152
Notes on Computer Vision George Kudrayvtsev

10.3.5 Mean-Shift
The mean-shift algorithm tries to find the modes of a probability distribution; this dis-
tribution is often represented discretely by a number of samples as we’ve seen. Visually, the
algorithm looks something like Figure 10.17 below.

Figure 10.17: Performing mean-shift 2 times to find the area in the


distribution with the most density (an approximation of its mode).

This visual example hand-waves away a few things, such as what shape defines the region
of interest (here it’s a circle) and how big it should be, but it gets the point across. At
each step (from blue to red to finally cyan), we calculate the mean, or center of mass of
the region of interest. This results in a mean-shift vector from the region’s center to the
center of mass, which we follow to draw a new region of interest, repeating this process until
the mean-shift vector gets arbitrarily small.
So how does this relate to tracking?
Well, our methodology is pretty similar to before. We start with a pre-defined model in the
first frame. As before, this can be expressed in a variety of ways, but it may be easiest to
imagine it as an image patch and a location. In the following frame, we search for a region
that most closely matches that model within some neighborhood based on some similarity
function. Then, the new maximum becomes the starting point for the next frame.
What truly makes this mean-shift tracking is the model and similarity functions that we use.
In mean-shift, we use a feature space which is the quantized color space. This means
we create a histogram of the RGB values based on some discretization of each channel (for
example, 4 bits for each channel results in a 64-bin histogram). Our model is then this
histogram interpreted as a probability distribution function; this is the region we are going
to track.
Let’s work through the math. We start with a target model with some histogram centered
at 0. It’s represented by q and contains m bins; since we are interpreting it as a probability
distribution, it also needs to be normalized (sum to 1):
m
X
q = {qu }u∈[1..m] qu = 1
u=1

We also have some target candidate centered at the point y with its own color distribution,

153
Chapter 10: Motion & Tracking

and pu is now a function of y


m
X
p(y) = {pu (y)}u∈[1..m] pu = 1
u=1

We need a similarity function f (y) to compute the difference between these two distributions
now; maximizing this function will render the “best” candidate location: f (y) = f [q, p(y)].

Similarity Functions
There are a large variety of similarity functions such as min-value or Chi squared, but the
one used in mean-shift tracking is called the Bhattacharyya coefficient. First, we change
the distributions by taking their element-wise square roots:
√ √ √
q0 = ( q1 , q2 , . . . , qm )
p p p 
p0 (y) = p1 (y), p2 (y), . . . , pm (y)

Then, the Bhattacharyya relationship is defined as the sum of the products of these new
distributions:
Xm
f (y) = p0u (y)qu0 (10.9)
u=1

Well isn’t the sum of element-wise products the definition of the vector dot product? We
can thus also express this as:

f (y) = p0 (y) · q0 = kp0 (y)k kq0 k cos θy

But since by design these vectors are magnitude 1 (remember, we are treating them as
probability distributions), the Bhattacharyya coefficient essentially uses the cos θy between
these two vectors as a similarity comparison value.

Kernel Choices
Now that we have a suitable similarity function, we also need to define what region we’ll be
using to calculate it. Recall that in Figure 10.17 we used a circular region for simplicity.
This was a fixed region with a hard drop-off at the edge; mathematically, this was a uniform
kernel: (
c kxk ≤ 1
KU (x) =
0 otherwise

Ideally, we’d use something with some better mathematical properties. Let’s use something
that’s differentiable, isotropic, and monotonically decreasing. Does that sound like anyone
we’ve gotten to know really well over the last 154 pages?

154
Notes on Computer Vision George Kudrayvtsev

That’s right, it’s the Gaussian. Here, it’s expressed with a constant falloff, but we can, as
we know, also have a “scale factor” σ to control that:
 
1 2
KU (x) = c · exp − kxk
2

The most important property of a Gaussian kernel over a uniform kernel is that it’s differ-
entiable. The spread of the Gaussian means that new points introduced to the kernel as we
slide it along the image have a very small weight that slowly increases; similarly, points in
the center of the kernel have a constant weight that slowly decreases. We would see the most
weight change along the slope of the bell curve.
We can leverage the Gaussian’s differentiability and use its gradient to see how the overall
similarity function changes as we move. With the gradient, we can actually optimally “hill
climb” the similarity function and find its local maximum rather than blindly searching the
neighborhood.
This is the big idea in mean-shift tracking: the similarity function helps us determine the
new frame’s center of mass, and the search space is reduced by following the kernel’s gradient
along the similarity function.

Disadvantages
Much like the Kalman Filter from before, the biggest downside of using mean-shift as an
exclusive tracking mechanism is that it operates on a single hypothesis of the “best next
point.”
A convenient way to get around this problem while still leveraging the power of mean-shift
tracking is to use it as the sensor model in a particle filter; we treating the mean-shift tracking
algorithm as a measurement likelihood (from before, Pr [z | X]).

10.3.6 Tracking Issues


These are universal problems in the world of tracking.
Initialization How do we determine our initial state (positions, templates, scales, etc.)?
In the examples we’ve been working with, the assumption has always been that it’s
determined manually: we were given a template, or a starting location, or a . . .
Another option is background subtraction: given a sufficiently-static scene, we can
isolate moving objects and use those to initialize state, continuing to track them once
the background is restored or we enter a new scene.
Finally, we could rely on a specialized detector that finds our interesting objects.
Once they are detected, we can use the detection parameters and the local image area
to initialize our state and continue tracking them. You might ask, “Well then why don’t
we just use the detector on every frame?” Well, as we’ve discussed, detectors rely on a
clear view of the object, and generalizing them too much could introduce compounding

155
Chapter 10: Motion & Tracking

errors as well as false positives. Deformations and particularly occlusions cause them
to struggle.
Sensors and Dynamics How do we determine our dynamics and sensor models?
If we learned the dynamics model from real data (a difficult task), we wouldn’t even
need the model in the first place! We’d know in advance how things move. We could
instead learn it from “clean data,” which would be more empirical and an approximation
of the real data. We could also use domain knowledge to specify a dynamics model.
For example, a security camera pointed at a street can reasonably expect pedestrians
to move up and down the sidewalks.
The sensor model is much more finicky. We do need some sense of absolute truth to
rely on (even if it’s noisy). This could be the reliability of a sonar sensor for distance,
a preconfigured camera distance and depth, or other reliable truths.
Prediction vs. Correction Remember the fundamental trade-off in our Kalman Filter:
we needed to decide on the relative level of noise in the measurement (correction)
vs. the noise in the process (prediction). If one is too strong, we will ignore the other.
Getting this balance right is unfortunately just requires a bit of magic and guesswork
based on any existing data.
Data Association We often aren’t tracking just one thing in a scene, and it’s often not
a simple scene. How do we know, then, which measurements are associated with
which objects? And how do we know which measurements are the result of visual
clutter? The camoflauge techniques we see in nature (and warfare) are designed to
intentionally introduce this kind of clutter so that even our vision systems have trouble
with detection and tracking. Thus, we need to reliably associate relevant data with
the state.
The simple strategy is to only pay attention to measurements that are closest to the
prediction. Recall when tracking hand contours (see Figure A.3a) we relied on the
“nearest high-contrast features,” as if we knew those were truly the ones we were
looking for.
There is a more sophisticated approach, though, which relies on keeping multiple hy-
potheses. We can even use particle filtering for this: each particle becomes a hypothe-
sis about the state. Over time, it becomes clear which particle corresponds to clutter,
which correspond to our interesting object of choice, and we can even determine when
new objects have emerged and begin to track those independently.
Drift As errors in each component accumulate and compound over time, we run the risk of
drift in our tracking.
One method to alleviate this problem is to update our models over time. For example,
we could introduce an α factor that incorporates a blending of our “best match” over
time with a simple linear interpolation:
Model(t) = αBest(t) + (1 − α)Model(t − 1) (10.10)
There are still risks with this adaptive tracking method: if we blend in too much noise

156
Notes on Computer Vision George Kudrayvtsev

into our sensor model, we’ll eventually be tracking something completely unrelated to
the original template.

10.3.7 Conclusion
That ends our discussion of tracking. The notion we introduced of tracking state over time
comes up in computer vision a lot. This isn’t image processing: things change often!
We introduced probabilistic models to solve this problem. Kalman Filters and Mean-Shift
were methods that rendered a single hypothesis for the next best state, while Particle Filters
maintained multiple hypotheses and converged on a state over time.

157
Recognition

Recognizing isn’t at all like seeing; the two often don’t even agree.
— Sten Nadolny, The Discovery of Slowness

ntil this chapter, we’ve been working in a “semantic-free” vision environment. Though
U we have detected “interesting” features, developed motion models, and even tracked
objects, we have had no understanding of what those objects actually are. Yes, we
had Template Matching early on in chapter 2, but that was an inflexible, hard-coded, and
purely mathematical approach of determining the existence of a template in an image. In
this chapter, we’ll instead be focusing on a more human-like view of recognition. Our aim will
be to dissect a scene and label (in English) various objects within it or describe it somehow.
There are a few main forms of recognition: verification of the class of an object (“Is this a
lamp?”), detection of a class of objects (“Are there any lamps?”), and identification of a
specific instance of a class (“Is that the Eiffel Tower?”). More generally, we can also consider
object categorization and label specific general areas in a scene (“There are trees here,
and some buildings here.”) without necessarily identifying individual instances. Even more
generally, we may want to describe an image as a whole (“This is outdoors.”)
We’ll primarily be focusing on generic object categorization (“Find the cars,” rather than
“Find Aaron’s car”). This task can be presented as such:
Given a (small) number of training images as examples of a category, recognize
“a-priori” (previously) unknown instances of that category and assign the correct
category label.
This falls under a large class of problems that machine learning attempts to solve, and many
of the methods we will discuss in this chapter are applications of general machine learning
algorithms. The “small” aspect of this task has not gotten much focus in the machine learning
community at large; the massive datasets used to train recognition models present a stark
contrast to the handful of examples a human may need to reliably identify and recognize an
object class.

158
Notes on Computer Vision George Kudrayvtsev

Figure 11.1: Hot dogs or legs? You might be able to differentiate them
instantly, but how can we teach a computer to do that? (Image Source)

Categorization
Immediately, we have a problem. Any object can be classified in dozens of ways, often
following some hierarchy. Which category should we take? Should a hot dog be labeled as
food, as a sandwich, as protein, as an inanimate object, or even just as something red-ish?
This is a manifestation of prototype theory, tackled by psychologists and cognitive scientists.
There are many writings on the topic, the most impactful of which was Rosch and Lloyd’s
Cognition and categorization, namely the Principles of Categorization chapter, which drew
a few conclusions from the standpoint of human cognition. The basic level category for an
object can be based on. . .
• the highest level at which category members have a similar perceived shape. For
example, a German shepherd would likely be labeled that, rather than as a dog or a
mammal, which have more shape variation.
• the highest level at which a single mental image can be formed that reflects the entire
category.
• the level at which humans identify category members fastest.
• the first level named and understood by children.
• the highest level at which a person uses similar motor actions for interaction with
category members. For example, the set of actions done with a dog (petting, brushing)
is very different than the set of actions for all animals.

Categorization Reaction Time


Experiments have been done that give us insight to the categorization layout in the
human brain.
Given a series of images, participants were asked to answer a question about the
image: “Is it an animal?” and “Is it a dog?” Measuring the subsequent response

159
Chapter 11: Recognition

reaction time has robustly shown us that humans respond faster to whether or not
something is a dog than whether or not something is an animal.
This might make sense intuitively; the “search space” of animals is far greater than
that of dogs, it should take longer. Having scientific evidence to back that intuition
up is invaluable, though, and showing that such categories must somehow exist—
and even have measurable differences—is a significant insight into human cognition.

Even if we use some of these basic level categories, what scope are we dealing with? Psy-
chologists have given a range of 10,000 to 30,000 categories for human cognition. This gives
us an idea of scale when dealing with label quanities for recognition.

Recognition is important in computer vision because it allows us to create relationships


between specific objects. Things are longer just an “interesting feature,” but something far
more descriptive that can then influence subsequent logic and behavior in a larger system.

Challenges
Why is this a hard problem? There are many multivariate factors that influence how a
particular object is perceived, but we often don’t have trouble working around these factors
and still labeling it successfully.
Factors like illumination, pose, occlusions, and visual clutter all affect the way an object
appears. Furthermore, objects don’t exist in isolation. They are part of a larger scene full of
other objects that may conflict, occlude, or otherwise interact with one another. Finally, the
computational complexity (millions of pixels, thousands of categories, and dozens of degrees
of freedom) of recognition is daunting.

State-of-the-Art
Things that seemed impossibles years ago are commonplace now: character and handwriting
recognition on checks, envelopes, and license plates; fingerprint scans; face detection; and
even recognition of flat, textured objects (via the SIFT Detector) are all relatively solved
problems.
The current cutting edge is far more advanced. As demonstrated in Figure 11.2, individual
components of an image are being identified and labeled independently. The likelihood of
thousands of photos of dogs wearing Mexican hats exist and can be used for training is
unlikely, so dissecting this composition is an impressive result.
As we’ve mentioned, these modern techniques of strong label recognition are really machine
learning algorithms applied to patterns of pixel intensities. Instead of diving into these
directly (which is a topic more suited to courses on machine learning and deep learning),
we’ll discuss the general principles of what are called generative vs. discriminative methods

160
Notes on Computer Vision George Kudrayvtsev

Figure 11.2: An example of GoogleNet labeling different parts of an


image, differentiating between a relatively unique pair of objects without
any additional context.

of recognition and the image representations they use.

Supervised Classification Methods


Given a collection of labeled examples, we want to come up with a function that will predict
the labels of new examples. The existence of this set of labeled examples (a “training set”)
is what makes this supervised learning.
The function we come up with is called a classifier. Whether or not the classifier can be
considered “good” depends on a few things, including the mistakes that it makes and the
cost associated with these mistakes. Since we know what the desired outputs are from our
training set, our goal is to minimize the expected misclassification. To achieve this, there
are two primary strategies:
• generative: use the training data to build a representative probability model for
various object classes. We model conditional densities and priors separately for each
class, meaning that the “recognized” class of an object is just the class that fits it best.
• discriminative: directly construct a good decision boundary between a class and
everything not in that class, modeling the posterior.

Word to the Wise


There is a lot of probability in this chapter. I try to work through these concepts
slowly and intuitively both for my own understanding and for others who have a weak
(or non-existent) foundation in statistics. If you have a knack for probability or have
been exposed to these ideas before, such explanations may become tedious; for that, I
apologize in advance.

161
Chapter 11: Recognition

11.1 Generative Supervised Classification


Before we get started, here is some trivial notation: we’ll say that 4 → 9 means that the
object is a 4, but we called it a 9. With that, we assume that the cost of X → X is 0.
Consider, then, a binary decision problem of classifying an image as being of a 4 or a 9.
L(4 → 9) is the loss of classifying a 4 as a 9, and similarly L(9 → 4) is the loss of classifying
a 9 as a 4. We define the risk of a classifier strategy S as being the expected loss. For our
binary decision, then:

R(S) = Pr [4 → 9 | using S] · L (4 → 9) + Pr [9 → 4 | using S] · L (9 → 4)

Figure 11.3: An arrangement of 4s to 9s.

We can say that there is some feature vector x that describes or measures the character.
At the best possible decision boundary (i.e. right in the middle of Figure 11.3), either label
choice will yield the same expected loss.
If we picked the class “four” at that boundary, the expected loss is the probability that it’s
actually a 9 times the cost of calling it a 9 plus the probability that it’s actually a 4 times
the cost of calling it a 4 (which we assumed earlier to be zero: L (4 → 4) = 0). In other
words:

= Pr [class is 9 | x] · L (9 → 4) + Pr [class is 4 | x] · L (4 → 4)
= Pr [class is 9 | x] · L (9 → 4)

Similarly, the expected loss of picking “nine” at the boundary is based on the probability
that it was actually a 4:
= Pr [class is 4 | x] · L (4 → 9)

Thus, the best decision boundary is the x such that:

Pr [class is 9 | x] · L (9 → 4) = Pr [class is 4 | x] · L (4 → 9)

With this in mind, classifying a new point becomes easy. Given its feature vector k, we
choose the class with the lowest expected loss. We would choose “four” if:

Pr [4 | k] · L (4 → 9) > Pr [9 | k] · L (9 → 4)

How can we apply this in practice? Well, our training set encodes the loss; perhaps it’s
different for certain incorrect classifications, but it could also be uniform. The difficulty

162
Notes on Computer Vision George Kudrayvtsev

comes in determining our conditional probability, Pr [4 | k]. We need to somehow know or


learn that. As with all machine learning, we trust the data to teach us this.
Suppose we have a training dataset in which each pixel is labeled as being a skin pixel and a
non-skin pixel, and we arrange that data into the pair of histograms in Figure 11.4. A new
pixel p comes along and it falls into a bin in the histograms that overlap (somewhere in the
middle). Suppose p = 6, and its corresponding probabilities are:

Pr [x = 6 | skin] = 0.5
Pr [x = 6 | ¬skin] = 0.1

Pr [x | ¬skin]
Pr [x | skin]

hue value, x hue value, x

Figure 11.4: Histograms from a training set determining the likelihood


of a particular hue value, x, occurring given that a pixel is (left) or is
not (right) skin.

Is this enough information to determine whether or not p is a skin pixel? In general, can we
say that p is a skin pixel if Pr [x | skin] > Pr [x | ¬skin]?
No!
Remember, this is the likelihood that the hue is some value already given whether or not
it’s skin. To extend the intuition behind this, suppose the likelihoods were equal. Does that
suggest we can’t tell if it’s a skin pixel? Again, absolutely not. Why? Because any given
pixel is much more likely to not be skin in the first place! Thus, if their hue likelihoods are
equal, it’s most likely not skin because “not-skin pixels” are way more common.
What we really want is the probability of a pixel being skin given a hue: Pr [skin | x], which
isn’t described in our model. As we did in Particle Filters, we can apply Bayes’ Rule, which
expresses exactly what we just described: the probability of a pixel being skin given a hue is
proportional to the probability of that hue representing skin AND the probability of a pixel
being skin in the first place (see Equation 10.6).

Pr [skin | x] ∝ Pr [x | skin] · Pr [skin]


| {z } | {z } | {z }
posterior likelihood prior

Thus, the comparison we really want to make is:


?
Pr [x | skin] · Pr [skin] > Pr [x | ¬skin] · Pr [¬skin]

163
Chapter 11: Recognition

Unfortunately, we don’t know the prior, Pr [skin], . . . but we can assume it’s some constant,
Ω. Given enough training data marked as being skin,1 we can measure that prior Ω (hopefully
at the same time as when we measured the histogram likelihoods from before in Figure 11.4).
Then, the binary decision can be made based on the measured Pr [skin | x].

Generalizing the Generative Model We’ve been working with a binary decision. How
do we generalize this to an arbitrary number of classes? For a given measured feature vector
x and a set of classes ci , choose the best c∗ by maximizing the posterior probability:
c∗ = arg max Pr [c | x] = arg max Pr [c] Pr [x | c]
c c

Continuous Generative Models If our feature vector x is continuous, we can’t rely on


a discrete histogram. Instead, we can return to our old friend the Gaussian (or a mixture of
Gaussians) to create a smooth, continuous approximation of the likelihood density model of
Pr [x | c].

Figure 11.5: A mixture of Gaussians approximating the histogram for


Pr [x | skin] from Figure 11.4.

Using Gaussians to create a parameterized version of the training data allows us to express
the probability density much more compactly.

11.2 Principal Component Analysis


One thing that may have slipped by in our discussion of generative models is that they work
best in low-dimensional spaces. Since we are building probabilistic descriptions (previously,
we used histograms to build density approximations) from data, building these robustly for
high-dimensional classification requires a lot of data. A method for reducing the necessary
dimensionality is called principal component analysis. Despite being used elsewhere,
this is a very significant idea in the computer vision space.
A component is essentially a direction in a feature space. A principal component, then, is
the direction along which points in that feature space have the greatest variance.
The first principal component is the direction of maximum variance. Subsequent principal
components are orthogonal to the previous principal components and describes the “next”
direction of maximum residual variance.
1
from graduate students or mechanical Turks, for instance. . .

164
Notes on Computer Vision George Kudrayvtsev

Ackshually. . .
Principal components point in the direction of maximum variance from the mean of
the points. Mathematically, though, we express this as being a direction from the
origin. This means that in the nitty-gritty, we translate the points so that their mean
is at the origin, do our PCA math, then translate everything back. Conceptually (and
graphically), though, we will express the principal component relative to the mean.

We’ll define this more rigorously soon, but let’s explain the “direction of maximum variance”
a little more more intuitively. We know that the variance describes how “spread out” the
data is; more specifically, it measures the average of the squared differences from the mean.
In 2 dimensions, the difference from the mean (the target in Figure 11.6) for a point is
expressed by its distance. The direction of maximum variance, then, is the line that most
accurately describes the direction of spread. The second line, then, describes the direction
of the remaining spread relative to the first line.

30 30
25
20 20
15
10 10
5
0 0
0 10 20 30 0 5 10 15 20 25 30
Figure 11.6: The first principal component and its subsequent orthog-
onal principal component.

Okay, that may not make a lot of sense yet, but perhaps diving a little deeper into the math
will explain further. You may have noticed that the first principal component in Figure 11.6
resembles the line of best fit. We’ve seen this sort of scenario before. . . Remember when we
discussed Error Functions, and how we came to the conclusion that minimizing perpendicular
distance was more stable than measuring vertical distance when trying to determine whether
or not a descriptor is a true match to a feature?
We leveraged the expression of a line as a normal vector and a distance from the origin (in
case you forgot how we expressed that, it’s replicated again in Figure 11.7). Then, our error
function was:
X
E(a, b, d) = (axi + byi − d)2
i

We did some derivations on this perpendicular least-squares fitting—which were omitted for

165
Chapter 11: Recognition

 
a
n̂ =
b

Figure 11.7: Representing a line, ax + by = d as a normal vector and


distance from the origin.

brevity—and it gave us an expression in matrix form, kBn̂k2 , where


 
x1 − x y 1 − y
 x2 − x y 2 − y 
B =  .. .. 
 
 . . 
xn − x y n − y

We wanted to minimize kBn̂k2 subject to kn̂k = 1, since n̂ is expressed as a unit vector.


What we didn’t mention was that geometrically (or physically?) this was the axis of least
inertia. We could imagine spinning the xy-plane around that axis, and it would be the
axis that resulted in the least movement of all of those points. Perhaps you’ve heard this
referenced as the moment of inertia? I sense eigenvectors in our near future. . .
How about another algebraic interpretation? In Appendix A, we used the projection to
explain how the standard least squares method for solving a linear system aims to minimize
the distance from the projection to the solution vector. Again, that visual is reiterated in
Figure 11.8.
P

x̂, some unit vector


through the origin
origin
x̂T P

Figure 11.8: The projection of the point P onto the vector x̂.

Through a basic understanding of the Pythagorean theorem, we can see that minimizing the
distance from P to the line described by x̂ is the same thing as maximizing the projection
x̂T P (where x̂ is expressed a column vector). If we extend this to all of our data points, we
want to maximize the sum of the squares of the projections of those points onto the line
described by the principal component. We can express this in matrix form through some

166
Notes on Computer Vision George Kudrayvtsev

clever manipulation. If B is a matrix of the points, then:


xT BT Bx =
 
x y1
 1
 x2 y2 
   X
 x1 x2 . . . xn   a
(xT Pi )2

y1 y2 . . . yn  ... ..  b =
a b
.

i
xn yn

More succinctly, our goal is to maximize xT BT Bx subject to xT x = 1. We say M = BT B.


Then, this becomes a constrained optimization problem to which we can apply the Lagrange
multiplier technique:2
E = xT Mx
E 0 = xT Mx + λ(1 − xT x)

Notice that we expect, under the correct solution, for 1 − xT x = 0. We take the partial
derivative of our new error function and set it equal to 0:

∂E 0
= 2Mx + 2λx
∂x
0 = 2Mx + 2λx
λ is any constant,
Mx = λx so it absorbs the negative

That’s a very special x. . . it’s the definition of an eigenvector!3 x is an eigenvector of BT B,


our matrix of points.
How about yet another (last one, we promise) algebraic interpretation? Our aforementioned
matrix product, BT B, if we were to expand it and express it with the mean at the origin,
would look like: X n n 
X
2
 xi xi y i 
i=1 i=1
BT B = X
 
n Xn 
 2 
xi y i yi
i=1 i=1
. . . that’s the covariance matrix of our set of points, describing the variance in our two
dimensions. What we’re looking for, then, are the eigenvectors of the covariance matrix of
the data points.
Generally, for an n × n matrix, there are n distinct eigenvectors (except in degenerate cases
with duplicate eigenvalues, but we’ll ignore those for now). Thus, we can describe principal
components as being the eigenvectors of the points.
2
I discuss this method in further detail (far beyond the scope to which it’s described in lecture) in the
Lagrange Multipliers section of Appendix A.
3
Briefly recall that an eigenvector of a matrix is a vector that, when multiplied by said matrix, is just a
scalar multiple of itself.

167
Chapter 11: Recognition

11.2.1 Dimensionality Reduction


With our relationship between principal components and eigenvectors out of way, how can
we use this to solve the low-dimensionality limitations of generative models we described
earlier?
The idea is that we can collapse a set of points to their largest eigenvector (i.e. their primary
principal component). For example, the set of points from Figure 11.6 will be collapsed to
the blue points in Figure 11.9; we ignore the second principal component and only describe
where on the line the points are, instead.

30
25
20 x
15
10
5 E λ1

0
0 5 10 15 20 25 30
Figure 11.9: Collapsing a set of points to their principal component,
Êλ1 . The points can now be represented by a coefficient—a scalar of the
principal component unit vector.

Collapsing our example set of points from two dimensions to one doesn’t seem like that big of
a deal, but this idea can be extended to however many dimensions we want. Unless the data
is uniformally random, there will be directions of maximum variance; collapsing things along
them (for however many principal components we feel are necessary to accurately describe
the data) still results in massive dimensionality savings.
Suppose each data point is now n-dimensional; that is, x is an n-element column vector.
Then, to acquire the maximum direction of projection, v̂, we follow the same pattern as
before, taking the sum of the magnitudes of the projections:
X
var(v̂) = kv̂T (x − x)k
x

We can isolate our terms


P by leveraging the covariance matrix from before; given A as the
4
outer product: A = x (x − x)(x − x) , we can say
T

var(v) = vT Av

4
Remember, x is a column vector. Multiplying a column vector n by its transpose: nnT is the definition of
the outer product and results in a matrix. This is in contrast to multiplying a row vector by its transpose,
which is the definition of the dot product and results in a scalar.

168
Notes on Computer Vision George Kudrayvtsev

As before, the eigenvector with the largest eigenvalue λ captures the most variation among
the training vectors x.
With this background in mind, we can now explore applying principal component analysis
to images.

Example 11.1: Understanding Check

What eigenvectors make up the principal components in the following dataset?


Remember, we’re looking for the directions of maximum variance.

Is it the ones on the right or on the left?

The eigenvectors found by PCA would’ve be the blue vectors; the red ones aren’t
even perpendicular! Intuitively, we can imagine the points above the blue line
“cancelling out” the ones below it, effectively making their average in-between them.
Principal component analysis works best when training data comes from a single
class. There are other ways to identify classes that are not orthogonal such as ICA,
or independent component analysis.

11.2.2 Face Space

Let’s treat an image as a one-dimensional vector. A 100×100 pixel image has 10,000 elements;
it’s a 10,000-dimensional space! But how many of those vectors correspond to valid face

169
Chapter 11: Recognition

images? Probably much less. . . we want to effectively model the subspace of face images
from the general vector space of 100×100 images.
Specifically, we want to construct a low-dimensional (PCA anyone?) linear (so we can do
dot products and such) subspace that best explains the variation in the set of face images;
we can call this the face space.

Figure 11.10: Cyan points indicate faces, while orange points are “non-
faces.” We want to capture the black principal component.

We will apply principal component analysis to determine the principal component(s) in


our 10,000-dimension feature space (an incredibly oversimplified view of which is shown in
Figure 11.10).
We are given M data points, {x1 , . . . , xM } ∈ Rd , where each data point is a column vector
and d is very large. We want some directions in Rd that capture most of the variation of the
xi . If u is one such direction, the coefficients representing locations along that directions
would be (where µ is the mean of the data points):

u(xi ) = uT (xi − µ)

So what is the direction vector û that captures most of the variance? We’ll use the same
expressions as before, maximizing the variance of the projected data:
M
1 X T
var(u) = u (xi − µ) (uT (xi − µ))T
M i=1 | {z }
projection of xi
" M
#
1 X
= uT (xi − µ)(xi − µ)T u u doesn’t depend on i
M i=1
| {z }
covariance matrix of data
= ûT Σû don’t forget û is a unit vector

Thus, the variance of the projected data is var(û) = ûT Σû, and the direction of maximum
variance is then determined by the eigenvector of Σ with the largest eigenvalue. Naturally,
then, the top k orthogonal directions that capture the residual variance (i.e. the principal
components) correspond to the top k eigenvalues.

170
Notes on Computer Vision George Kudrayvtsev

Let Φi be a known face image I with the mean image subtracted. Remember, this is a
d-length column vector for a massive d.
Define C to be the average squared magnitude of the face images:
1 X
C= (Φi ΦTi ) = AAT
M i∈M

where A is the matrix of our M faces as column vectors, A = Φ1 Φ2 . . . ΦM ; this is


 

of the dimension d × M . That means C is a massive d × d matrix.


Consider, instead, AT A. . . that’s only an M × M matrix, for which finding the eigenvalues
is much easier and computationally feasible. Suppose vi is one of these eigenvectors:

AT Avi = λvi

Now pre-multiply that boi by A and notice the wonderful relationship:

AAT Avi = λAvi

We see that Avi are the eigenvectors of C = AAT , which we previously couldn’t feasibly
compute directly! This is the dimensionality trick that makes PCA possible: we created
the eigenvectors of C without needing to compute them directly.

How Many Eigenvectors Are There? Intuition suggests M , but recall that we always
subtract out the mean which centers our component at the origin. This means (heh) that
if we have a point along it on one side, we know it has a buddy on the other side by just
flipping the coordinates. This removes a degree of freedom, so there are M − 1 eigenvectors
in general.
Yeah, I know this is super hand-wavey.
Just go with it. . .

11.2.3 Eigenfaces
The idea of eigenfaces and a face space was pioneered in this 1991 paper. The assumption
is that most faces lie in a low-dimensional subspace of the general image space, determined
by some k ≪ d directions of maximum variance.
Using PCA, we can determine the eigenvectors that track the most variance in the “face
space:” u1 , u2 , . . . , uk . Then any face image can be represented (well, approximated) by a
linear combination of these “eigenfaces;” the coefficients of the linear combination can be
determined easily by a dot product.
We can see in Figure 11.12, as we’d expect, that the “average face” from the training set has
an ambigious gender, and that eigenfaces roughly represent variation in faces. Notice that

171
Chapter 11: Recognition

(a) A subset of the training images—isolated faces with vari- (b) The resulting top 64 principal components.
ations in lighting, expression, and angle.

Figure 11.11: Data from Turk & Pentland’s paper, “Eigenfaces for Recognition.”

the detail in each eigenface increases as we go down the list of variance. We can even notice
the kind of variation some of the eigenfaces account for. For example, the 2nd through 4th
eigenfaces are faces lit from different angles (right, top, and bottom).
If we take the average face and add one of the eigenfaces to
it, we can see the impact the eigenface has in Figure 11.13.
Adding the second component causes the resulting face to be
lit from the right, whereas subtracting it causes it to be lit
from the left.
We can easily convert a face image to a series of eigenvector
Figure 11.12: The mean
coefficients by taking its dot product with each eigenface after face from the training data.
subtracting the mean face:

x → [u1 · (x − µ), u2 · (x − µ), . . . , uk · (x − µ)] (11.1)


(11.2)
 
= w1 , w2 , . . . , wk

This vector of weights is now the entire representation of the face. We’ve reduced the face
from some n × n image to a k-length vector. To reconstruct the face, then, we say that the
reconstructed face is the mean face plus the linear combination of the eigenfaces and the
weights:
x̂ = µ + w1 u1 + w2 u2 + . . .

Naturally, the more eigenfaces we keep, the closer the reconstruction x̂ is to the original
face x. We can actually leverage this as an error function: if x − x̂ is low, the thing we are

172
Notes on Computer Vision George Kudrayvtsev

detecting is probably actually a face. Otherwise, we may have treated some random patch
as a face. This may come in handy for detection and tracking soon. . .

Figure 11.13: Adding or subtracing a principal component from the


mean face. σk is the standard deviation of the coefficient; we won’t
worry about it for now.

But we aren’t trying to do reconstruction, we’re trying to do recognition. Given a novel


image, z, we project it onto the subspace and determine its weights, as before:
[w1 , w2 , . . . , wk ] = [u1 · (z − µ), u2 · (z − µ), . . . , uk · (z − µ)]

Optionally, we can use our above error function to determine whether or not the novel
image is a face. Then, we classify the face as whatever our closest training face was in our
k-dimensional subspace.

11.2.4 Limitations
Principal component analysis is obviously not a perfect solution; Figure 11.14 visualizes some
of its problems.

(b) PCA falls apart when the principal com-


(a) Traditional PCA has significant limitations if the face image is not aligned ponent itself is the dividing line between two
with the eigenfaces or has background variation. classes, rather than the direction of variation
of a single class.

Figure 11.14: Some common limitations of PCA.

With regards to faces, the use of the dot product means our projection requires precise align-
ment between our input images and our eigenfaces. If the eyes are off-center, for example,

173
Chapter 11: Recognition

the element-wise comparison would result in a poor mapping and an even poorer reconstruc-
tion. Furthermore, if the training data or novel image is not tightly cropped, background
variation impacts the weight vector.
In general, though, the direction of maximum variance is not always good for classification;
in Figure 11.14b, we see that the red points and the blue points lie along the same principal
component. If we performed PCA first, we would collapse both sets of points into our single
component, effectively treating them as the same class.
Finally, non-linear divisions between classes naturally can’t be captured by this basic model.

11.3 Incremental Visual Learning


We want to extend our facial recognition method to a generalized tracking method. Though
this method probably belongs in chapter 10, we didn’t have the background understanding
of principal component analysis there to be able to discuss it. Thus, we introduce it now and
will revisit a lot of the principles behind particle filtering (quickly jump to Particle Filters if
you need a refresh on those ideas).
What we’ll be doing is called appearance-based tracking. We’ll learn what a target “looks
like” through some robust description, and then track the target by finding the image area
that results in the most accurate reconstruction. Naturally, though, the appearance of our
target (and its surrounding environment) will change from frame to frame, so we need to be
robust to a variety of deformations.
Some of our inspiration comes from eigentracking, described in this paper, as well as from
incremental learning techniques described in this paper. The big idea behind eigentracking
was to decouple an object’s geometry from its appearance. We saw that one of the limitations
of eigenfaces was that they weren’t robust to imperfect alignments; we can get around this
by adding deformations to the subspace model so that eigenvectors also describe rotations,
scale changes, and other transformations.
In general, our goal is to have a tracker that. . .
• . . . is not based on a single image. This is the big leagues, not Template Matching.
• . . . constantly updates the model, learning the representation while tracking. We
touched on this when we introduced α to account for drift in particle filters (see Equa-
tion 10.10), but we need something far more robust now.
• . . . runs quickly, so we need to be careful of our complexity.
• . . . works when the camera itself moves as well.
As before, we’ll need a description of our target in advance. Tracking should be robust to
pose variations, scale, occlusions, and other deformations such as changes in illumination
and viewing angles. We can be robust to these changes by incrementally modifying our
description. We don’t want to modify it too much, though; that would eventually cause our

174
Notes on Computer Vision George Kudrayvtsev

tracking to drift from the real target.


To achieve this, we’ll be using particle filtering. Our particles will represent a distribution
of deformations, capturing how the geometry of our target changes from frame to frame.
We’ll also use the subspace-based reconstruction error to learn how to track our target: our
eigenfaces can only accurately reconstruct faces. To adapt to changes, we’ll introduce incre-
mental updates to our subspace; this will dynamically handle deformations by introducing
them into the variation described by our subspace automatically.
We say that the location at time t is Lt (though this may include more than just the (x, y)
position), and the current observation is Ft (for F aces). Our goal, then, is to predict the
target location Lt based on Lt−1 . This should look very familiar:

Pr [Lt | Ft , Lt−1 ] ∝ Pr [Ft | Lt ] · Pr [Lt | Lt−1 ]

Our dynamics model, Pr [Lt | Lt−1 ], uses a simple motion model with no velocity.5 Our ob-
servation model, Pr [Ft | Lt ], is an approximation on an eigenbasis.6 If we can reconstruct
a patch well from our basis set, that renders a high probability.

11.3.1 Forming Our Model


Let’s define the parameters of our particle filtering model a little more thoroughly.
We have a handful of choices when it comes to choosing our state, Lt . We could track a
similarity transform: position (xt , yt ), rotation θt , and scaling st , or an affine transformation
with 6 parameters to additionally allow shearing. For simplicity and brevity, we’ll assume
the similarity model in this section.

Dynamics Model
The dynamics model is actually quite simple: each parameter’s probability is independently
distributed along its previous value perturbed by a Gaussian noise function. Mathematically,
this means:7

Pr [L1 | L0 ] = N (x1 ; x0 , σx2 ) · N (y1 ; y0 , σy2 ) · N (θ1 ; θ0 , σθ2 ) · N (s1 ; s0 , σs2 )

Given some state Lt , we are saying that the next state could vary in any of these parameters,
but the likelihood of some variation is inversely proportional to the amount of variation. In
other words, it’s way more likely to move a little than a lot. This introduces a massive amount
of possibilities for our state (remember, particle filters don’t scale well with dimension).

5
The lectures refer to this as a “Brownian motion model,” which can be described as each particle simply
vibrating around. Check out the Wikipedia page or this more formal introduction for more if you’re really
that interested in what that means; it won’t come up again.
6
An eigenbasis is a matrix made up of eigenvectors that form a basis. Our “face space” from before was an
eigenbasis; this isn’t a new concept, just a new term.
7
The notation N (a; b, c) states that the distribution for a is a Gaussian around b with the variance c.

175
Chapter 11: Recognition

Lt+1

Lt

Figure 11.15: A handful of the possible next-states (in red) from the
current state Lt (in cyan). The farther Lt+1 is from Lt , the less likely
it is.

Observation Model
We’ll be using probabilistic principal component analysis to model our image observation
process.
Given some location Lt , assume the observed frame was generated from the eigenbasis. How
well, then, could we have generated that frame from our eigenbasis? The probability of
observing some z given the eigenbasis B and mean µ is

Pr [z | B] = N (z; µ, BBT + εI)

This is our “distance from face space.” Let’s explore the math a little further. We see that
our Gaussian is distributed around the mean µ, since that’s the most likely face. Its “spread”
is determined by the covariance matrix of our eigenbasis, BBT , (i.e. variance within face
space) and a bit of additive Gaussian noise, εI that allows us to cover some “face-like space”
as well.
Notice that as lim, we are left with a likelihood of being purely in face space. Taking this
ε→0
limit and expanding the Gaussian function N gives us:
 
2
 
 
 
Pr [z | B] ∝ exp − (z − µ) − BBT (z − µ) 
 
 | {z } | {z } 
 map to reconstruction 
 face space

| {z }
reconstruction error

The likelihood is proportional to the magnitude of error between the mapping of z into face
space and its subsequent reconstruction. Why is that the reconstruction, exactly? Well
BT (z − µ) results in a vector of the dot product of each eigenvector with our observation z
pulled into face space by µ. As we saw in Equation 11.1, this is just the coefficient vector
(call it γ) of z.
Thus, Bγ is the subsequent reconstruction, and our measurement model gives the likelihood
of an observation z fitting our eigenbasis.

176
Notes on Computer Vision George Kudrayvtsev

Incremental Learning
This has all been more or less “review” thus far; we’ve just defined dynamics and observation
models with our newfound knowledge of PCA. The cool part comes now, where we allow
incremental updates to our object model.
The blending introduced in Equation 10.10 for updating particle filter models had two ex-
tremes. If α = 0, we eventually can’t track the target because our model has deviated too
far from reality. Similarly, if α = 1, we eventually can’t track the target because our model
has incorporated too much environmental noise.
Instead, what we’ll do is compute a new eigenbasis Bt+1 from our previous eigenbasis Bt and
the new observation wt . We still track the general class, but have allowed some flexibility
based on our new observed instance of that class. This is called an incremental subspace
update and is doable in real time.8

11.3.2 All Together Now


We can finally incorporate all of these parts into our tracker:
1. Optionally construct an initial eigenbasis for initial detection.
2. Choose an initial location, L0 .
3. Generate all possible locations: Pr [Lt | Lt−1 ].
4. Evaluate all possible locations: Pr [Ft | Lt−1 ].
5. Select the most likely location: Pr [Lt | Ft , Lt−1 ].
6. Update the eigenbasis using the aforementioned R-SVD algorithm.
7. Go to Step 3.

Handling Occlusions
Our observation images have read the Declaration of Independence: all pixels are created
equal. Unfortunately, though, this means occlusions may introduce massive (incorrect!)
changes to our eigenbasis.
We can ask ourselves a question during our normal, un-occluded tracking: which pixels are
we reconstructing well ? We should probably weigh these accordingly. . . we have a lot more
confidence in their validity. Given an observation It , and assuming there is no occlusion with

8
Matrix Computations, 3rd ed. outlines an algorithm using recursive singular value decomposition that
allows us to do these eigenbasis updates quickly. [link]

177
Chapter 11: Recognition

our initial weight mask W0 , then:9

Di = Wi ⊗ (It − BBT It )
D2i
 
Wi+1 = exp − 2
σ

This allows us to weigh each pixel individually based on how accurate its reconstruction has
been over time.

11.4 Discriminative Supervised Classification


Generative models were some of the first popular methods for pattern recognition because
they could be modeled analytically and could handle low-dimensional space fairly well. As
we discussed, though, it came with some liabilities that prevented it from succeeding in the
modern world of Big Data™:
• Many signals are extremely high-dimensional—hundreds or tens of thousands of di-
mensions. Representing the density of these classes is “data-hard:” the amount of data
necessary is exponential.
• With a generative model, we were trying to model the entire class. Differentiating
between skin and “not skin” is too broad of a goal; what we’d rather do is draw a
fine, definitive line between skin and “things that are very very similar to skin.” In
other words, we only care about making the right decisions in the hard cases near the
“decision boundary”, rather than creating general models that describe a class entirely.
• We typically don’t know which features in our feature space are useful in discriminating
between classes. As a result, we need to have good feature selection: which features
are informative and useful in differentiation? A generative model has no way of doing
that.
All of these principles lead us to discriminative methods of supervised classification. As
with all of our forays into modeling the world around us coherently, we begin with a series
of assumptions.
• There are a fixed number of known classes.
We can count on the fact the amount of classes we’re modeling will not change over
time; from the get-go, we will have class A, B, and C, for example. Of course, one of
these classes may be a catch-all “none of the above” class.
• We are provided an ample number of training examples of each class.
We need sufficient training examples to get really fine boundaries. Realistically, this is
hard to guarantee, but it’s an important assumption and explains the Big Data™craze
of our modern world.
9
We use ⊗ to represent element-wise multiplication.

178
Notes on Computer Vision George Kudrayvtsev

• All mistakes have the same cost. The only thing we are concerned with is getting the
label right; getting it wrong has the same cost, regardless of “how wrong” it is.
• We need to construct a representation or description of our class instance, but we don’t
know a priori (in advance) which of our features are representative of the class label.
Our model will need to glean the diagnostic features.

11.4.1 Discriminative Classifier Architecture


A discriminative system has two main parts: a representation of the object model that
describes the training instances and a classifier. Since these are notes on computer vision,
we will be working with images. Given those, we then move on to testing our framework:
we generate candidate images and score our classifier based on how accurately it classified
them.

Building a Representation
Let’s tackle the first part: representation. Suppose we have two classes: koalas and pandas.
One of the simplest ways to describe their classes may be to generate a color (or grayscale)
histogram of their image content. Naturally, our biggest assumption is that our images are
made up of mostly the class we’re describing.

Figure 11.16: Histograms for some training images for two classes:
koalas and pandas.

There is clear overlap across the histograms for koalas, and likewise for pandas. Is that
sufficient in discriminating between the two? Unfortunately, no! Color-based descriptions
are extremely sensitive to both illumination and intra-class appearance variations.
How about feature points? Recall our discussion of Harris Corners and how we found that
regions with high gradients and high gradient direction variance (like corners) were robust to
changes in illumination. What if we considered these edge, contour, and intensity gradients?

Figure 11.17: Edge images of koala and pandas.

179
Chapter 11: Recognition

Even still, small shifts and rotations make our edge images look completely different (from an
overlay perspective). We can divide the image into pieces and describe the local distributions
of gradients with a histogram.

Figure 11.18: Subdivision of each picture into a local distribution of


gradients.

This has the benefit of being locally order-less, offering us invariance in our aforementioned
small shifts and rotations. If, for example, the top-left ear moves in one of the pandas, the
local gradient distribution will still stay the same. We can also perform constrast normal-
ization to try to correct for illumination differences.
This is just one form of determining some sort of feature representation of the training data.
In fact, this is probably the most important form; analytically determining an approximate
way to differentiate between classes is the crux of a discriminative classifier. Without a good
model, our classifier will have a hard time finding good decision boundaries.

Train a Classifier
We have an idea of how to build our representation, now how can we use our feature vectors
(which are just flattened out 1 × n vectors of our descriptions) to do classification? Given
our feature vectors describing pandas and describing koalas, we need to learn to differentiate
between them.
To keep things simple, we’ll stick to a binary classifier: we’ll have koalas and non-koalas,
cars and non-cars, etc. There are a massive number of discriminative classification techniques
recently developed in machine learning and applied to computer vision: nearest-neighbor,
neural networks, support vector machines (SVMs), boosting, and more. We’ll discuss each
of these in turn shortly.

Generating and Scoring Candidates


Window-based models using a “sliding window” across the image and apply the classifier to
every patch. There’s nothing clever about it, but it works well.

11.4.2 Nearest Neighbor


The nearest-neighbor classification strategy is very simple: just choose the label of the
nearest training data point. When we are given a novel test example, we finding the closest

180
Notes on Computer Vision George Kudrayvtsev

training example and label it accordingly. Each point corresponds to a Voronoi partition
which discretizes the space based on the distance from that point.
Nearest-neighbor is incredibly easy to write,
but is not ideal. It’s very data intensive: we
need to remember all of our training examples.
It’s computationally expensive: even with a kd-
tree (we’ve touched on these before; see 1 for
more), searching for the nearest neighbor takes
a bit of time. Most importantly, though, it sim-
ply does not work that well.
We can use the k-nearest neighbor variant
to make it so that a single training example
doesn’t dominate its region too much. The pro-
cess is just as simple: the k nearest neighbors Figure 11.19: Classification of the negative
“vote” to classify the new point, and major- (in black) and positive (in red) classes. The
partitioning is a Voronoi diagram.
ity rules. Surprisingly, this small modification
works really well. We still have the problem of
this process being data-intensive, but this notion of a loose consensus is very powerful and
effective.
Let’s look as some more sophisticated discriminative methods.

11.4.3 Boosting
This section introduces an iterative learning method: boosting. The basic idea is to look
at a weighted training error on each iteration.
Initially, we weigh all of our training examples equally. Then, in each “boosting round,” we
find the weak learner (more in a moment) that achieves the lowest weighted training error.
Following that, we raise the weights of the training examples that were misclassified by the
current weak learner. In essence, we say, “learn these better next time.” Finally, we combine
the weak learners from each step in a simple linear fashion to end up with our final classifier.
A weak learner, simply put, is a function that partitions our space. It doesn’t necessarily
have to give us the “right answer,” but it does give us some information. Figure 11.20 tries
to visually develop an intuition for weak learners.
The final classifier is a linear combination of the weak learners, and their weight directly
proportional to their accuracy. The exact formulas depend on the “boosting scheme;” one
of them is called AdaBoost; we won’t dive into it in detail, but a simplified version of the
algorithm is described in algorithm 11.1.

Viola-Jones Face Detector


There was an application of boosting to object detection that took the computer vision field
by storm: a real-time face detector called the Viola-Jones detector (from this 2001 paper).

181
Chapter 11: Recognition

(a) Our set of training examples and (b) Reweighing the error examples (c) Final classifier is a combination of
the first boundary guess. and trying another boundary. the weak learners.

Figure 11.20: Iteratively applying weak learners to differentiate be-


tween the red and blue classes.

There were a few big ideas that made this detector so accurate and efficient:
• Brightness patterns were represented with efficiently-computable “rectangular” features
within a window of interest. These features were essentially large-scale gradient filters
or Haar wavelets.

(a) These are some “rectangular filters;” the filter (b) These filters were applied to different regions of
takes the sum of the pixels in the light area and the image; “features” were the resulting differences.
subtracts the sum of the pixels in the dark area.

Figure 11.21: Feature detection in the Viola-Jones detector.

The reason why this method was so effective was because it was incredibly efficient
to compute if performed many times to the same image. It leveraged the integral
image, which, at some (x, y) pixel location, stores the sum of all of the pixels spatially
before it. For example, the pixel at (40, 30) would contain the sum of the pixels in the
quadrant from (0, 0) to (40, 30):

182
Notes on Computer Vision George Kudrayvtsev

D B
(40, 30)
rectangle

C A

Why is this useful? Well with the rectangular filters, we wanted to find the sum of
the pixels within an arbitrary rectangular region: (A, B, C, D). What is its sum with
respect to the integral image? It’s simply A − B − C + D, as above.
Once we have the integral image, this is only 3 additions to compute the sum of any
size rectangle. This gives us really efficient handling of scaling as a bonus, as well:
instead of scaling the images to find faces, we can just scale the features, recomputing
them efficiently.
• We use a boosted combination of these filter results as features. With such efficiency, we
can look at an absurd amount of features quickly. The paper used 180,000 features
associated with a 24×24 window. These features were then run through the boosting
process in order to find the best linear combination of features for discriminating faces.
The top two weak learners were actually the filters shown in Figure 11.21b. If you look
carefully, the first appears useful in differentiating eyes (a darker patch above/below a
brighter one) and dividing the face in half.
• We’ll formulate a cascade of these classifiers to reject clear negatives rapidly. Even if
our filters are blazing fast to compute, there are still a lot of possible windows to search
if we have just a 24×24 sliding window. How can we make detection more efficient?
Well. . . almost everywhere in an image is not a face. Ideally, then, we could reduce
detection time significantly if we found all of the areas in the image that were definitely
not a face.

Figure 11.23: The cascading classifiers in the Viola-Jones face detec-


tor.

183
Chapter 11: Recognition

Each stage has no false negatives: you can be completely sure that any rejected sub-
window was not a face. All of the positives (even the false ones) go into the training
process for the next classifier. We only apply the sliding window detection if a sub-
window made it through the entire cascade.
To summarize this incredibly impactful technique, the Viola-Jones detector used rectangular
features optimized by the integral image, AdaBoost for feature selection, and a cascading set
of classifiers to discard true negatives quickly. Due to the massive feature space (180,000+
filters), training is very slow, but detection can be done in real-time. Its results are phenom-
enal and variants of it are in use today commercially.

Advantages and Disadvantages

Boosting integrates classification with feature selection in a flexible, easy-to-implement fash-


ion. Its flexibility enables a “plug-and-play” nature of weak learners; as long as they support
some form of iterative improvement, boosting can select the best ones and optimize a lin-
ear combination. We mentioned detection efficiency for the Viola-Jones detector specifically
(testing is fast) but in general, boosting has a training complexity linear in the number of
training examples; that’s a nice property.
Some of the downsides of boosting include needing a large amount of training examples. This
was one of our earlier assumptions, though. More significantly, boosting has been found to
not work as well as Support Vector Machines and random forests which are newer approaches

184
Notes on Computer Vision George Kudrayvtsev

to discriminative learning, especially on many-class problems.

Algorithm 11.1: The simplified AdaBoost algorithm.

Input: X, Y : M positive and negative training samples with their corresponding labels.
Input: H: a weak classifier type.
Result: A boosted classifier, H ∗ .
/* Initialize
 a uniform
 weight distribution. */
ŵ = 1/M 1/M ...
t = some small threshold value close to 0
foreach training stage j ∈ [1..n] do
ŵ = w/kwk
/* Instantiate and train a weak learner for the current weights. */
hj = H(X, Y, ŵ)
/* The error is the sum of the incorrect training predictions. */
∀wi ∈ ŵ where hj (Xi ) 6= Yi
PM
εj = wi 
i=0
1 1−ε
αj = 2
ln εj j
/* Update the weights only if the error is large enough. */
if ε > t then
wi = wi · exp (−Yi αj hj (Xi )) ∀wi ∈ w
else
break
end
/* The final boosted classifier is the sum of each hj weighed by its
corresponding αj . Prediction on an observation x is then simply: */
hP i
M
H ∗ (x) = sign j=0 α j hj (x)
return H ∗

11.5 Support Vector Machines


In contrast with boosting, support vector machines (SVMs for short) are tougher to
implement. Despite that, SVMs are a recent development in machine learing that make
for incredibly reliable discriminative classifiers. Before we begin discussing them in further
detail, we need to talk about linear classifiers.

11.5.1 Linear Classifiers


Given a set of points, we want to automatically find the line that divides them the best. For
example, given the blue and pink points in Figure 11.24, we can easily (as humans) see that
the line defines their boundary. We want to adapt this idea to higher-dimensional space.

185
Chapter 11: Recognition

Figure 11.24: Finding the boundary line between two classes of points

We’ve talked a lot about lines, but let’s revisit them in R2 once more. Lines can be expressed
10
as a normalvector
 and a distance from the origin.
  Specifically, given px + qy + b = 0, if
p x
we let w = and x be some arbitrary point , then this is equivalent to w · x + b = 0,
q y
and w is normal to the line. We can scale both w and b arbitrarily and still have the same
result (this will be important shortly). Then, for an arbitrary point (x0 , y0 ) not on the line,
we can find its (perpendicular) distance to the line relatively easily:
px0 + qy0 + b x0 · w + b
D= p =
p2 + q 2 kwk

None of this should be new information.


If we want to find a linear classifier, we want the line that separates the positive and negative
examples. For the points in Figure 11.24, if xi is a positive example, we want xi · w + b ≥ 0,
and likewise if xi is negative, we want xi · w + b < 0.
Naturally, though, there are quite a few lines that separate the blue and pink dots in Fig-
ure 11.24. Which one is “best”? Support vector machines are a discriminative classifier based
on this “optimal separating line.”
In the 2D case, an SVM wants to maximize the margin between the positive and negative
training examples (the distance between the dotted lines):

10
We switch notation to p, q, b instead of the traditional a, b, c to stay in line with the SVM literature.

186
Notes on Computer Vision George Kudrayvtsev

This extends easily to higher dimensions, but that makes for tougher visualizations.

11.5.2 Support Vectors


There are some special points in the above visualization, points where the dashed lines
intersect a

The margin lines between two classes will always have such points; they are called the sup-
port vectors. This is a huge reduction in computation relative to the generative methods
we covered earlier, which analyzed the entire data set; here, we only care about the examples
near the boundaries. So. How can we use these vectors to maximize the margin?
Remember that we can arbitrarily scale w and b and still represent the same line, so let’s say
that the separating line = 0, whereas the lines touching the positive and negative examples
are = 1 and = −1, respectively.

w·x+b=1
w·x+b=0
w · x + b = −1

The aforementioned support vectors lie on these lines, so xi · w + b = ±1 for them, meaning
we can compute the margin as a function of these values. The distance from a support vector
xi to the separating line is then:

|xi · w + b| ±1
=
kwk kwk

187
Chapter 11: Recognition

That means that M , the distance between the dashed green lines above, is defined by:

1 −1 2
M =
− =
kwk kwk kwk

Thus, we want to find the w that maximizes the margin, M . We can’t just pick any w,
though: it has to correctly classify all of our training data points. Thus, we have an additional
set of constraints that:

∀xi ∈ positive examples: xi · w + b ≥ 1


∀xi ∈ negative examples: xi · w + b ≤ 1

Let’s define an auxiliary variable yi to represent the label on each example. When xi is a
negative example, yi = −1; similarly, when xi is a positive example, yi = 1. This gives us a
standard quadratic optimization problem:

1 T
Minimize: ww
2
Subject to: yi (xi · w + b) ≥ 1

The solution to this optimization problem (whose derivation we won’t get into) is just a
linear combination of the support vectors:
X
w= α i y i xi (11.3)
i

The αi s are “learned weights” that are non-zero at the support vectors. For any support
vector, we can substitute in yi = w · xi + b, so:
X
yi = αi yi xi · x + b
i

Since it = yi , it’s always ±1. We can use this to build our classification function:

f (x) = sign (w · x + b) (11.4)


!
X
= sign αi y i x i · x + b (11.5)
i

The highlighted box is a crucial component: the entirety of the classification depends only
on this dot product between some “new point” x and our support vectors xi s.

188
Notes on Computer Vision George Kudrayvtsev

11.5.3 Extending SVMs


Some questions arise with this derivation, and we’ll knock each of them down in turn:
1. We’ve been working with nice, neat 2D plots and drawing a line to separate the data.
This begs the question: what if the features are not in 2 dimensions?
2. Similarly, what if the data isn’t linearly separable?
3. Classifying everything into two binary classes seems like a pipe dream: what if we
have more than just two categories?
The first one is easy to knock down: nothing in the math of Equation 11.4 requires 2
dimensions: x could easily be an arbitrary n-dimensional vector. As long as w is a normal,
it defines a structure in that dimensional space: instead of a line, in 3D we’d create a plane;
this generalizes to creating an (n − 1)-dimensional hyperplane.

Mapping to Higher-Dimensional Space


The second question is a bit harder to tackle. Obviously, we can find the optimal separator
between the following group of points:

0 x

But what about these?

0 x

No such luck this time. But what if we mapped them to a higher-dimensional space? For
example, if we map these to y = x2 , a wild linear separator appears!

x2

Figure 11.25: Finding a linear separator by mapping to a higher-


dimensional space.

189
Chapter 11: Recognition

This seems promising. . . how can we find such a mapping (like the arbitrary x 7→ x2 above)
for other feature spaces? Let’s generalize this idea. We can call our mapping function Φ; it
maps xs in our feature space to another higher-dimensional space ϕ(x), so Φ : x 7→ ϕ(x).
We can use this to find the “kernel trick.”

Kernel Trick Just a moment ago in (11.4), we showed that the linear classifier relies on
the dot product between vectors: xi · x. Let’s define a kernel function K:

K(xi , xj ) = xi · xj = xTi xj

If we apply Φ to our points, we get a new kernel function:

K(xi , xj ) ⇒ ϕ(xi )T ϕ(xj )

This kernel function is a “similarity” function that corresponds to an inner (dot) product in
some expanded feature space. A similarity function grows if its parameters are similar. This
is significant: as long as there is some higher dimensional space in which K is a dot product,
we can use K in a linear classifier.11
The kernel trick is that instead of explicitly computing the lifting transformation (since it
lifts us into a higher dimension ) ϕ(x), we instead define the kernel function in a way that
we know maps to a dot product in a higher dimensional space, like the polynomial kernel in
the example below. Then, we get the non-linear decision boundary in the original feature
space: X X
αi yi (xTi x) + b −→ αi yi K(xTi x) + b
i i

Example 11.2: A Simple Polynomial Kernel Function

Let’s work through a proof that a particular kernel function is a dot product in
some higher-dimensional space. Remember, we don’t actually care about what
that space is when it comes to applying the kernel; that’s the beauty of the kernel
trick. We’re working through this to demonstrate how you would show that some
kernel function does have a higher-dimensional mapping.
 
x
We have 2D vectors, so x = 1 .
x2

Let define the following kernel function: K(xi , xj ) = (1 + xTi xj )2


This is a simple polynomial kernel; it’s called that because we are creating a
polynomial from the dot product. To prove that this is a valid kernel function, we
need to show that K(xi , xj ) = ϕ(xj )T ϕ(xj ) for some ϕ.

11
There are mathematical restrictions on kernel function that we won’t get into but I’ll mention for those
interested in reading further. Mercer’s Theorem guides us towards the conclusion that a kernel matrix κ
(i.e. the matrix resulting from applying the kernel function to its inputs, κi,j = K(xi , xj )) must be positive
semi-definite. For more reading, try this article.

190
Notes on Computer Vision George Kudrayvtsev

K(xi , xj ) = (1 + xTi xj )2
     
  xj1   xj1
= 1 + xi1 xi2 1 + xi1 xi2 expand
xj2 xj2
multiply
= 1 + x2i1 x2ij + 2xi1 xj1 xi2 xj2 + x2i2 x2j2 + 2xi1 xj1 + 2xi2 xj2 it all out
 
1
 x2 
√ j1 
√ √ √   2xj1 xj2 

rewrite it as a
= 1 x2i1 2xi1 xi2 x2i2

2xi1 2xi2 

 √x2j2 
 vector product
 
 √2xj1 
2xj2

At this point, we can see something magical and crucially important: each of the
vectors only relies on terms from either xi or xj ! That means it’s a. . . wait for it. . .
dot product! We can define ϕ as a mapping into this new 6-dimensional space:
√ √ √ T
ϕ(x) = 1 x21 2x1 xn2 x22

2x1 2x2

Which means now we can express K in terms of dot products in ϕ, exactly as we


wanted:
K(xi , xj ) = ϕ(xi )T ϕ(xj ) 

What if the data isn’t separable even in a higher dimension, or there is some decision bound-
ary that is actually “better” 12 that a perfect separation that perhaps ignores a few whacky
edge cases? This is an advanced topic in SVMs that introduces the concept of slack vari-
ables which allow for this error, but that’s more of a machine learning topic. For a not-so-
gentle introduction, you can try this page.

Example Kernel Functions This entire discussion begs the question: what are some
good kernel functions?
Well, the simplest example is just our linear function: K(xi , xj ) = xTi xj . It’s useful when x
is already just a massive vector in a high-dimensional space.
Another common kernel is a Gaussian noise function, which is a specific case of the radial
basis function: !
kxi − xj k2
K(xi , xj ) = exp −
2σ 2
As xj moves away from the support vector xi , it just falls off like a Gaussian. But how is
this a valid kernel function, you might ask? Turns out, it maps us to an infinite dimensional
space . You may (or may not) recall from calculus that the exponential function ex , can

12
For some definition of better.

191
Chapter 11: Recognition

be expressed as an infinite series:



x x2 x3 X xk
e =1+x+ + + ... =
2! 3! k=0
k!

Applying this to our kernel renders this whackyboi (with σ = 1 for simplification):13

(xT x0 )j
  X    
1 0 2 1 2 1 0 2
exp − kx − x k2 = · exp − kxk2 · exp − kx k2
2 j=0
j! 2 2

As you can see, it can be expressed as an infinite sum of dot product of x · x0 . Of course,
thanks to the kernel trick, we don’t actually have to care how it works; we can just use it. A
Gaussian kernel function is really useful in computer vision because we often use histograms
to describe classes and histograms can be represented as mixtures of Gaussians; we saw this
before when describing Continuous Generative Models.
Another useful kernel function (from this paper) is one that describes histogram intersection,
where the histogram is normalized to represent a probability distribution:
X
K(xi , xj ) = min (xi (k), xj (k))
k

We’ll leverage this kernel function briefly when discussing Visual Bags of Words and activity
classification in video.

Multi-category Classification
Unfortunately, neither of the solutions to this question are particularly “satisfying,” but
fortunately both of them work. There are two main approaches:
One vs. All In this approach, we adopt a “thing or ¬thing” mindset. More specifically, we
learn an SVM for each class versus all of the other classes. In the testing phase, then,
we apply each SVM to the test example and assign to it the class that resulted in the
highest decision value (that is, the distance that occurs within (11.4)).
One vs. One In this approach, we actually learn an SVM for each possible pair of classes.
As you might
 expect, this blows up relatively fast. Even with just 4 classes, {A, B, C, D},
we need 42 = 6 pairs of SVMs: {(A, B), (A, C), (A, D), (B, C), (B, D), (C, D)}. Dur-
ing the testing phase, each learned SVM “votes” for a class to be assigned to the test
example. Like when we discussed voting in Hough Space, we trust that the most sen-
sible option results in the most votes, since the votes for the “invalid” pairs (i.e. the
(A, B) SVM for an example in class D) would result in a random value.

11.5.4 SVMs for Recognition


13
The subscript
qPnotation on the norm: kxk2 expresses the p-norm. For p = 2, this is our familiar Euclidean
distance: xi ∈x xi . [from StackExchange]

192
Notes on Computer Vision George Kudrayvtsev

Recognition and classification are largely inter-


changeable ideas, so this serves more as a summary
of SVMs. To use them, we need to:
1. Define a representation. We process our
data set, decide what constitutes a feature,
etc.
2. Select a kernel function. This part involves
a bit of magic and guesswork (as is tradition
in machine learning).
3. Compute pairwise kernel values between Figure 11.26: Training an SVM to do
labeled examples. This is naturally the slowest facial recognition.
part of the training process; it has O(n ) complexity in the number of training samples.
2

4. Use the “kernel matrix” that results from


training to solve for the support vectors and
their weights.
The final linear combination of support vectors and weights renders the classification label
for any new input vector x via Equation 11.4.

Using SVMs for Gender Classification


An interesting application of SVMs was gender classification of faces, done in this paper in
2002. They used a Gaussian RBF as their kernel function to find the “support faces” that
defined the boundary between the genders, a few of which are shown in Figure 11.27.

Figure 11.27: The “support faces”—support vectors of faces—


determined by the SVM classifier.

The lowest error rate found by the SVM with an RBF kernel was around 3.4% on tiny
21 × 12 images; this performed better than humans even when high resolution images were
used. Figure 11.28 shows the top five gender misclassifications by humans.

193
Chapter 11: Recognition

Figure 11.28: The top five gender misclassifications by humans. How


would you label them? Answer: FMMFM

11.6 Visual Bags of Words


Briefly recall what we did when creating panoramas: we found putative matches among fea-
ture points using a descriptor (like the SIFT Descriptor), then used something like RANSAC
to find the resulting alignment by finding consensus for a particular transformation among
our points.
We can do something that is somewhat related for recognition, as well. We’ll define and find
interesting points, but we’ll then talk about describing the patches around those points. We
can then use the collection of described patches to create general descriptions or categoriza-
tions of images as a whole by searching for those patches.
We make an important assumption for these features: if we see points close in feature space
(e.g. the patches have similar descriptors), those indicate similar local content. Given a new
“query image,” then, we can describe its features by finding similar patches from a pre-defined
database of images.

Figure 11.29: Each region in an image has a descriptor, which is a


point in some high-dimensional feature space.

The obvious problem with this idea is sheer scalability. We may need to search for millions
of possible features for a single image. Thankfully, this isn’t a unique problem, and we can
draw parallels to other disciplines in computer science to find a better search method. To
summarize our problem, let’s restate it as such:
With potentially thousands of features per image, and hundreds of millions of
images to search, how can we efficiently find those that are relevant to a new
image?

194
Notes on Computer Vision George Kudrayvtsev

If we draw a parallel to the real world of text documents, we can see a similar problem.
How, from a book with thousands of words on each page, can we find all of the pages that
contain a particular “interesting” word? Reference the Index, of course! Suppose we’re given
a new page, then, which some selection of interesting words in it. How can we find the “most
similar” page in our book? The likeliest page is the page with the most references to those
words in the index!

Example 11.3: Find Likely Page

Suppose we have four pages of text with an index as follows:

Index
ant 1,4 gong 2,4
bread 2 hammer 2,3
bus 4 leaf 1,2,3
chair 2,4 map 2,3
chisel 3,4 net 2
desk 1 pepper 1,2
fly 1,3 shoe 1,3

Now we’re given a page with a particular set of words from the index: gong, fly,
pepper, ant, and shoe. Which page is this “most like”?
Finding the solution is fairly straightforward: choose the page with the most refer-
ences to the new words:

• Page 1 contains {ant, pepper, fly, shoe} (4).


• Page 2 contains {gong, pepper} (2).
• Page 3 contains {fly, shoe} (2).
• Page 4 contains {ant, gong} (2).

Clearly, then, the new page is most like the 1st page. Naturally, we’d need some
tricks for larger, more realistic examples, but the concept remains the same: iterate
through the index, tracking occurrences for each page.

Similarly, we want to find all of the images that contain a particular interesting feature. This
leads us to the concept of mapping our features to “visual words.”

195
Chapter 11: Recognition

Figure 11.30: An excerpt from the “visual vocabulary” formulated in


this paper. Each quadrant represents a selection describing one of the
“visual words” in the feature space.

We’ll disregard some implementation details regarding this method, such as which features
to choose or what size they should be. These are obviously crucial to the method, but we’re
more concerned with using it rather than implementing it. The academic literature will give
more insights (such as the paper linked in Figure 11.30) if you’re interested. Instead, we’ll
be discussing using these “bags of visual words” to do recognition.

Figure 11.31: We dissect an image into its composite features, which


we then use to create our “vocabulary” to describe novel images.

To return to the document analogy, if you had a document and wanted to find similar
documents from your library, you likely would pull documents with a similar histogram
distribution of words, much like we did in the previous example.

196
Notes on Computer Vision George Kudrayvtsev

Figure 11.32: Creating a visual vocabulary from a set of objects.

We can use our visual vocabulary to compute a histogram of occurrences of each “word” in
each object. Notice in Figure 11.33 that elements that don’t belong to the object do some-
times “occur”: the bottom of the violin might look something like a bicycle tire, for example.
By-and-large, though, the peaks are related to the relevant object from Figure 11.32.

Figure 11.33: Capturing the distribution of each object in the context


of our overall vocabulary.

Comparing images to our database is easy. By normalizing the histograms, we can treat
them a unit vector. For some database image dj and query image q, similarity can then be
easily represented by the dot product:
dj · q
sim(dj , q) =
kdj k kqk

This essentially amounts to a nearest-neighbor lookup. In our initial discussion of nearest-


neighbor, we saw that k-nearest-neighbor was fairly effective, but it was slow, inefficient,
and data-hungry. Working around this problem was our motivation behind boosting and
ultimately support vector machines.
We can apply the same principles here: given our “visual vocabulary” for a set of object
classes,14 we can train an SVM to differentiate between them for novel inputs. This concept
was pioneered in this paper.

14
Such visual bags of words already exist, see the Caltech 101 dataset.

197
Video Analysis

To be continued. . .

198
Linear Algebra Primer

his appendix covers a few topics in linear algebra needed throughout the rest of the
T guide. Computer vision depends heavily on linear algebra, so reading an external
resource is highly recommended. The primer from Stanford’s machine learning course
is an excellent starting point and likely explains things much better than this appendix.

Notation I try to stick to a particular notation for vectors and matrices throughout the
guide, but it may be inconsistent at times. Vectors are lowercase bold: a; matrices are
uppercase bold: A. Unit vectors wear hats: â.

A.1 Solving a System of Equations via Least-Squares


We start with a system of equations: y = Ax.
First of all, it’s worth noting that the least squares solution is an approximate solution to a
system of equations. It’s used when a solution doesn’t exist, which happens often when the
system has too many equations (more rows than columns in A); the idea instead is to find
the set of values for x that minimizes the error for the system.
Now normally, if there was a solution, we’d solve the system by simply multiplying both
sides by the inverse of A:
A−1 y = A−1 Ax

In this case, though, we don’t have a solution. How do we know this? Perhaps A isn’t
invertible, or we know in advance that we have an over-constrained system (more rows than
columns in A). Instead, we can try to find the “best” solution. We can define “best” as being
the solution vector with the shortest distance to y. The Euclidean distance between two
n-length vectors a and b is just the magnitude of their difference:
p
d(a, b) = ka − bk = (a1 − b1 )2 + . . . + (an − bn )2

Thus, we can define x∗ as the best possible solution that minimizes this distance:

min

= ky − Ax∗ k
x

199
Appendix A: Linear Algebra Primer

Then, we can use x∗ to create an invertible system by multiplying by AT :


AT y = AT Ax∗

A Visual Parallel. We can present a visual demonstration of how this works and show
that the least-squares solution is the same as the projection.
Suppose we have a set of points that don’t ex-
3 actly fit a line: {(1, 1), (2, 1), (3, 2)}, plotted in
Figure A.1. We want to find the best possible
line y = Dx + C that minimizes the total error.
2 e2 This corresponds to solving the following system
of equations, forming y = Ax:
y

e1
1 e0 
1 = C + D · 1

0 1=C +D·2 or

0 1 2 3 
2=C +D·3
x    
1 1 1  
Figure A.1: A set of points with “no so- 1 = 1 2 C
lution”: no line passes through all of them. D
2 1 3
The set of errors is plotted in red: (e0 , e1 , e2 ).
The lack of a solution to this system means that the vector of y-values isn’t in the column
space of A, or: y ∈ / C(A). The vector can’t be represented by a linear combination of
column vectors in A.
We can imagine the column space as a plane in xyz-
space, and y existing outside of it; then, the vector y
that’d be within the column space is the projection of
y into the column space plane: p = proj y. This is the
C(A) e
closest possible vector in the column space to y, which C(A)
is exactly the distance we were trying to minimize! p
Thus, e = y − p.
The projection isn’t super convenient to calculate or
Figure A.2: The vector y relative to
determine, though. Through algebraic manipulation, the column space of A, and its projec-
calculus, and other magic, we learn that the way to tion p onto the column space.
find the least squares approximation of the solution is:
AT Ax∗ = AT y
x∗ = (AT A)−1 AT y
| {z }
pseudoinverse

More Resources. This section basically summarizes and synthesizes this Khan Academy
video, this lecture from our course (which goes through the full derivation), this section of
Introduction to Linear Algebra, and this explanation from NYU. These links are provided in
order of clarity.

200
Notes on Computer Vision George Kudrayvtsev

A.2 Cross Product as Matrix Multiplication


We can express the cross product between two vectors as a matrix multiplication:
 
a1 −a3 a2
[a× ] =  a3 0 −a1 
−a2 a1 0

Meaning a × b = [a× ] b. We’ll use this notation throughout certain parts of the guide.
Note: The rank of this matrix is 2. Just trust me (and Prof. Aaron Bobick) on that.

A.3 Lagrange Multipliers


The Lagrange multiplier is a useful technique when solving for optimization problems with
constraints. It expresses the proportionality of the gradients of the to-be-optimized function
and its constraint.
Visually, we can express the solution to a constrained optimization problem as where the
optimization function and the constraint are tangent. For example, consider the simple
multivariate function f (x, y) = xy constrained upon the condition x + y = 4. We can
express the contours of f as being places where f (x, y) = k, and the latter function is
obviously just a line in space. A handful of the contours and the constraint are plotted in
Figure A.3a.
Choosing the inner-most plotted contour, f (x, y) = 2, would not be the optimal version of
f . As you can see, the constraint intersects it in two places: one of these has a higher y and
the other has a higher x. We want to maximize both simultaneously, though. So we want
to choose a large enough contour such that it juuust barely touches our constraint so that
it’s satisfied. Mathematically, this is where the to-be-optimizated function is tangent to the
constraint.
How do we choose this contour? This requires an understanding of the gradient. Remember,
the gradient (and derivatives in general) follows the function’s growth; it points in the
direction of the greatest rate of increase in the function. More interestingly, as seen in
Figure A.3b, the functions’ respective gradient vectors will point in the same direction at
their tangent point.
That discovery seems mathematically useful. Though the gradients may have different mag-
nitudes, they are proportional to one another. Let’s generalize our constraint function into
g(x, y) = x + y and call the (unknown) maximum value (x∗ , y ∗ ). Then, we can express this
proportionality as such:

∇f (x∗ , y ∗ ) ∝ ∇g(x∗ , y ∗ )
∇f (x∗ , y ∗ ) = λ∇g(x∗ , y ∗ )

201
Appendix A: Linear Algebra Primer

10

−1 1 2 3 4 5
(b) Zooming in on the tangent between the best
(a) Some of the contours of f and the constraint contour of f and the constraint, we can see that
are shown. The purple contour is tangent to the their normal vectors (in other words, their gradi-
constraint. ents) are proportional.

Figure A.3: Maximizing f (x, y) = xy subject to the constraint x + y =


4, visually restricted to the first quadrant for simplicity.

λ is the Lagrange multiplier, named after its discoverer, Joseph-Louis λagrange. We can
now use this relationship between our two functions to actually find the tangent point—and
thus the maximum value:

∂   
xy y
∇f = ∂x
∂ =
∂y
xy x
∂   
∂x
(x + y) 1
∇g = ∂ =
∂y
(x + y) 1

We can arrange this new relationship into a system of equations and include our original
constraint x + y = 4:

y = λ

x=λ

x+y =4

Trivially, then, λ = x = y = 2, and we see that (x∗ , y ∗ ) = (2, 2).

Example A.1: Chickens & Eggs

This example works through a less-contrived optimization problem to give a little


more insight (and a little more algebra) to the process.
TODO.

202
Notes on Computer Vision George Kudrayvtsev

A.3.1 The Lagrangian


There’s a special function called the Lagrangian that combines the previous 3-element
system of equations into a single step. The Lagrangian is defined by:

L(x, y, λ) = f (x, y) − λ (g(x, y) − k)

where f is the function to maximize, g is the constraint function, and k is the constraint
constant. Maximizing f is then a matter of solving for ∇L = 0.
This should look more familiar if you came here from the first reference to the Lagrange
multiplier.

More Resources. This section is based on the Khan Academy series on solving con-
strained optimization problems, which discuss the Lagrange multiplier, as well as its rela-
tionship to contours and the gradient, in much greater detail.

203
Index of Terms

Symbols cross-convolution filter . . . . . . . . . . . . . . . . . . 18


n-views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 cross-correlation . . . . . . . . . . . . . . . . . . . . . . . . . 15
2nd derivative Gaussian operator . . . . . . . . 28 cubic splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A D
affine transformation . 80, 110, 134, 137, 175 data term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
albedo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120, 122 dense correspondence search . . . . 62, 96, 152
Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43, 49 difference of Gaussian . . . . . . . . . . . . . . . . . . 105
anti-aliasing filter . . . . . . . . . . . . . . . . . . . . . . . 51 difference of Gaussian, pyramid . . . 106, 129
aperture problem . . . . . . . . . . . . . . . . . . . . . . 126 diffuse reflection . . . . . . . . . . . . . . . . . . . . . . . 119
appearance-based tracking . . . . . . . . . . . . . 174 direct linear calibration transformation . . 75
attenuate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 discrete cosine transform . . . . . . . . . . . . . . . . 52
discriminative . . . . . . . . . . . . . . . . . . . . . . . . . 161
B disocclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60, 91 disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . 59, 111
bed of nails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 dynamic programming . . . . . . . . . . . . . . . . . . 63
Bhattacharyya coefficient . . . . . . . . . . . . . . 154 dynamics model . . . . . . . . . . . . . . . . . . . 138, 175
binary classifier . . . . . . . . . . . . . . . . . . . . . . . . 180
binocular fusion . . . . . . . . . . . . . . . . . . . . . . . . . 58 E
boosting . . . . . . . . . . . . . . . . . . . . . 180, 181, 197 edge-preserving filter . . . . . . . . . . . . . . . . . . . . 20
BRDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 eigenfaces . . . . . . . . . . . . . . . . . . . . 171, 174, 175
brightness constancy constraint . . . 126, 134 energy minimization . . . . . . . . . . . . . . . . . . . . 64
epipolar constraint . . . . . . . . . . . . . . . 60, 91, 94
C epipolar geometry . . . . . . . 60, 92, 93, 97, 110
calibration matrix . . . . . . . . . . . . . . . . . . . . . . . 71 epipolar line . . . . . . . . . . 60, 60, 65, 92, 94, 96
Canny edge detector . . . . . . . . . . . . . 29, 31, 36 epipolar plane . . . . . . . . . . . . . . . . . . . . . . . 60, 91
center of projection . . . . . . . 54, 57, 59, 65, 70 epipole . . . . . . . . . . . . . . . . . . . . . . . . . . 60, 94, 96
classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 essential matrix . . . . . . . . . . . . . . . . . . 92, 93, 97
comb function . . . . . . . . . . . . . . . . . . . . . . . 49, 51 Euler angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
computer vision . . . 11, 97, 98, 118, 125, 151, extrinsic parameter matrix . . . . . . . 69, 93, 94
157, 179, 192 extrinsic parameters . . . . . . . . . . . . . . . . . . . . 65
convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 47
correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 F
correlation filter, non-uniform weights . . . 15 feature points . . . . . . . . . . . . . . . . . 98, 179, 194
correlation filter, uniform weights . . . . . . . 15 feature vector . . . . . . . . . . . . . . . . . . . . . 108, 109
cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162, 179 finite difference . . . . . . . . . . . . . . . . . . . . . . . . . 26

204
Index of Terms

focal length . . . . . . . . . . . . . . . . . 56, 75, 92, 133 image warping . . . . . . . . . . . . . . . . . . . . . . . . . 129


Fourier basis set . . . . . . . . . . . . . . . . . . . . . . . . 44 impulse function . . . . . . . . . . . . . . . . . . . . . 17, 46
Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 impulse response . . . . . . . . . . . . . . . . . . . . . . . . 17
Fourier transform . . . . . . . . . . . . . . . . . . . . 43, 45 impulse train . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Fourier transform, 2D . . . . . . . . . . . . . . . . . . . 47 independent component analysis . . . . . . . 169
Fourier transform, discrete . . . . . . . . . . . . . . 47 inliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Fourier transform, inverse . . . . . . . . . . . . . . . 46 integral image . . . . . . . . . . . . . . . . . . . . . . . . . 182
frequency spectrum . . . . . . . . . . . . . . . . . . . . . 45 intensity . 11, 12, 13, 21, 26, 47, 61, 85, 105,
fundamental matrix . . . . . . . . 93, 97, 114, 117 122
interpolation . . . . . . . . . 85, 106, 132, 136, 156
G interpolation, bicubic . . . . . . . . . . . . . . . . . . . 86
Gaussian filter 16, 30, 48, 51, 100, 105, 108, interpolation, bilinear . . . . . . . . . . . . . . . . . . . 86
155, 164 interpolation, nearest neighbor . . . . . . . . . . 85
Gaussian noise function . . 13, 113, 114, 140, intrinsic parameter matrix . . . . . . . 70, 92, 94
151, 175, 191 intrinsic parameters . . . . . . . . . . . . . . . . . . . . . 65
general motion model . . . . . . . . . . . . . . . . . . 133
generalized Hough transform . . . . . . . 39, 117 K
generative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Kalman filter . . . . . . . . . . . . . . . . . . . . . 140, 150
generative model . . . . . . . . . . . . . . . . . . 113, 114 Kalman gain . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
geometric camera calibration . . . . . . . . . . . . 65 kernel . . . . . . . . . . . . . . . . . . . 16, 26, 29, 48, 105
gradient . . 26, 36, 38–40, 100, 101, 103, 108, kernel function . . . . . . . . . . . . . . . . . . . . . . . . . 190
201, 203 kernel trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Kronecker delta function . . . . . . . . . . . . . . . . 49
H
Haar wavelet . . . . . . . . . . . . . . . . . . . . . . 109, 182
L
Lagrange multiplier . . . . . . . . . . . . . . . . . . . . 201
Harris corners . . . . . . . . . . . . . . . . . . . . . 100, 179
Lambert’s law . . . . . . . . . . . . . . . . . . . . . . . . . 120
Harris detector algorithm . . . . . . . . . . . . . . 104
Lambertian BRDF . . . . . . . . . . . . . . . . . . . . 120
Harris response function . . . . . . . . . . . . . . . 104
Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . 30, 105
Harris-Laplace detector . . . . . . . . . . . . . . . . 106
least squares . . 73, 75, 83, 95, 112, 116, 128,
hidden Markov model . . . . . . . . . . . . . . . . . . 138
135, 166, 199, 200
homogeneous image coordinates . . . . . . . . . 54
low-pass filters . . . . . . . . . . . . . . . . . . . . . 50, 123
homography . . . 80, 83, 84, 96, 114, 116, 134
Lucas-Kanade method . . . . . . . 128, 135, 136
horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Lucas-Kanade method, hierarchical . . . . 129,
Hough accumulator array . . . . . . . . . . . . . . . 35
132, 136
Hough algorithm . . . . . . . . . . . . . . . . . . . . . . . . 33
Lucas-Kanade method, iterative . . . . . . . 129
Hough space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Lucas-Kanade method, sparse . . . . . . . . . . 132
Hough table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Hough transform . . . . . . . . . . . . . . . . . . . . . . . . 32 M
Hough transformation algorithm . . . . . . . . 35 mean-shift algorithm . . . . . . . . . . . . . . . . . . . 153
hysteresis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Median filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 110
I moving average . . . . . . . . . . . . . . . . . . . . . . . . . 14
ideal lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
ideal points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 N
image mosaics . . . . . . . . . . . . . . . . . . . . . . . . . . 82 nearest-neighbor . . . . . . . . . . . . . 180, 180, 197
image rectification . . . . . . . . . . . . . . . . . . . . . . 84 noise function . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
image subsampling . . . . . . . . . . . . . . . . . 51, 129 non-analytic models . . . . . . . . . . . . . . . . . . . . . 39
image warp . . . . . . . . . . . . . . . . . . . . . . . . . . 82, 85 normalized correlation . . . . . . . . . . 22, 62, 107

205
Index

O roll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
observation model . . . . . . . . . . . . . . . . 138, 175
occlusion . . . . . . . . . . . . . . 62, 64, 110, 137, 177 S
occlusion boundaries . . . . . . . . . . . . . . . . 62, 99 second moment matrix . . . . . . . . . . . . 101, 129
optic flow . . . . . . . . . . . . . . . . . . . . . . . . . 125, 136 Sharpening filter . . . . . . . . . . . . . . . . . . . . . . . . 20
outliers . . . . . . . . . . . . . . . . . . . . . . 110, 113, 114 shearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80, 110
SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105, 160
P SIFT descriptor . . . . . . . . . . . . . . . . . . . . . . . 107
panoramas . . . . . . . . . . . . . . . . . . . . . . . . . 82, 194 similarity transformation . . . . . . . . . . . . . . . . 80
parallax motion . . . . . . . . . . . . . . . . . . . . . . . . 134 singular value decomposition 73, 83, 95, 110,
parameter space . . . . . . . . . . . . . . . . . . . . . . . . 32 113, 177
particle filtering . . . . . . . . . . . . . . . . . . . 145, 175 slack variable . . . . . . . . . . . . . . . . . . . . . . . . . . 191
perspective, weak . . . . . . . . . . . . . . . . . . . . . . . 57 smoothness term . . . . . . . . . . . . . . . . . . . . . . . . 64
perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Sobel operator . . . . . . . . . . . . . . . . . . . . . . 27, 29
Phong reflection model . . . . . . . . . . . . . . . . 121 specular reflection . . . . . . . . . . . . . . . . . . . . . . 119
pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 splatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
point-line duality . . . . . . . . . . . . . 89, 92–94, 96 stereo correspondence . . . . . . 61, 91, 111, 126
power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 stochastic universal sampling . . . . . . . . . . 148
Prewitt operator . . . . . . . . . . . . . . . . . . . . . . . . 27 support vector machine . . . . . . 180, 185, 197
principal component analysis . 151, 164, 176
principle point . . . . . . . . . . . . . . . . . . . . . . . . . . 90 T
projection . . . . . . . . . . . . . . . . . . . . . . 53, 79, 133 Taylor expansion . . . . . . . . . . . . . . . . . . 100, 126
projection plane . . . . . . . . . . . . . . 54, 82, 87, 89 template matching . . . . . . . . . . . . . . . . . . . . . . 22
projection, orthographic . . . . . . . . . . . . 57, 134 total rigid transformation . . . . . . . . . . . . . . . 68
projection, perspective . . . . . . . . . . . . . 55, 134
projective geometry . . . . . . . . . . 61, 86, 94, 96 V
putative matches . . . . . . . . . . . . . . . . . . 111, 194 vanishing point . . . . . . . . . . . . . . . . . . . . . . . . . 56
variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
R Viola-Jones detector . . . . . . . . . . . . . . . . . . . 181
radial basis function . . . . . . . . . . . . . . . . . . . 191 visual code-words . . . . . . . . . . . . . . . . . . . . . . . 39
radiance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 voting . . . . . . . . . . . . . . . . . . . . . 32, 36, 112, 192
RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . 114, 194
resectioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 W
retinex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 weak calibration . . . . . . . . . . . . . . . . . . . . . . . . 92
right derivative . . . . . . . . . . . . . . . . . . . . . . . . . 26 weak learner . . . . . . . . . . . . . . . . . . . . . . 181, 181
rigid body transformation . . . . . . . . . . . . . . . 79 weighted moving average . . . . . . . . . . . . . . . . 14
risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Roberts operator . . . . . . . . . . . . . . . . . . . . . . . . 27 Y
robust estimator . . . . . . . . . . . . . . . . . . . . . . . 113 yaw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

206

You might also like