0 Computer Vision Panikzettel
0 Computer Vision Panikzettel
19
3.3.1 Classification with SVMs (Support Vector Machines) . . . . . . . 19
3.3.2 Classification with Boosting . . . . . . . . . . . . . . . . . . . . . . 21
panikzettel.philworld.de
4 Local Features and Matching 23
4.1 Local Features - Detection and Description . . . . . . . . . . . . . . . . . 23
Computer Vision Panikzettel 4.1.1 Local Invariant Features . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Keypoint Localization . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.3 Scale Invariant Region Selection . . . . . . . . . . . . . . . . . . . 25
Hans Wurst 4.1.4 Local Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
September 3, 2021 4.2 Recognition with Local Features . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Finding Consistent Configurations . . . . . . . . . . . . . . . . . . 27
4.2.2 Affine Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.3 Homography Estimation . . . . . . . . . . . . . . . . . . . . . . . 28
Contents
5 Deep Learning 29
1 Image Processing 3 5.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.1 Image Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 5.1.1 Background: Deep Learning . . . . . . . . . . . . . . . . . . . . . 30
1.2 Linear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5.2 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . . 32
1.3 Nonlinear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.3 CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4 Multi-Scale Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.3.1 LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.5 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.3.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.5.1 Filters as Templates . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.3.3 VGGNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.5.2 Image gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.3.4 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.5.3 2D Edge Detection Filters . . . . . . . . . . . . . . . . . . . . . . . 7 5.3.5 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.5.4 Canny Edge Detector . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.3.6 Transfer Learning with CNNs . . . . . . . . . . . . . . . . . . . . 34
1.6 Fitting Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.4 Practical Advise on CNN Training . . . . . . . . . . . . . . . . . . . . . . 34
1.6.1 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.4.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.2 RANSAC (RANdom SAmple Consensus) . . . . . . . . . . . . . . 11 5.4.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4.3 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2 Segmentation 11
5.4.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 Segmentation as Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.4.5 Learning Rate Schedules . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.1 k-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.5 CNNs for Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.2 Probabilistic Clustering . . . . . . . . . . . . . . . . . . . . . . . . 13
5.5.1 R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.3 Model-free clustering . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.5.2 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2 Graph-Theoretic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 15
5.5.3 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.1 Segmentation as Energy Minimization . . . . . . . . . . . . . . . 15
5.5.4 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.2 Graph Cuts for Image Segmentation . . . . . . . . . . . . . . . . . 16
5.5.5 YOLO/SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Object Recognition and Categorization 18 5.6 CNNs for Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Sliding-Window Object Detection . . . . . . . . . . . . . . . . . . . . . . . 18 5.6.1 Fully Convolutional Networks (FCN) . . . . . . . . . . . . . . . . 38
3.2 Gradient-based Representation . . . . . . . . . . . . . . . . . . . . . . . . 18 5.6.2 Encoder-Decoder Architecture . . . . . . . . . . . . . . . . . . . . 38
5.6.3 Transpose Convolutions . . . . . . . . . . . . . . . . . . . . . . . . 39
1 2
5.6.4 Skip Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Depth of Field: Distance between image planes where blur is tolerable, a smaller
5.6.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 aperture (Blende) increases the range in which the object is approx. in focus.
5.6.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.7 CNNs for Human Body Pose Estimation . . . . . . . . . . . . . . . . . . 41
Field of view depends on focal length f :
5.8 CNNs for Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.8.1 Siamese Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 • f ↓: image becomes more wide angle, more world points
5.8.2 Triplet loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 project into the finite image plane
5.9 Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 • f ↑: image becomes more telescopic, smaller part of the
world projects onto the finite image plane
6 3D Reconstruction 43
6.1 Epipolar Geometry and Stereo Basics . . . . . . . . . . . . . . . . . . . . 43
6.1.1 Calibrated Case: Essential matrix . . . . . . . . . . . . . . . . . . 45
1.2 Linear Filters
6.2 Stereopsis and 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Stereo Image Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Types of noise (i.i.d. = “independent, identically distributed”):
6.4 Disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
• Salt and pepper noise: random occurrences of black and white pixels
6.4.1 Dense Correspondence Search . . . . . . . . . . . . . . . . . . . . 46
• Impulse noise: random occurrences of white pixels
6.4.2 Sparse Correspondence Search . . . . . . . . . . . . . . . . . . . . 46
• Gaussian noise: variations in intensity drawn from a Gaussian distribution
6.5 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.5.1 Camera Models/Parameters . . . . . . . . . . . . . . . . . . . . . 47 f ( x, y) = f¯( x, y) + η ( x, y) , η ( x, y) ∼ N (µ, σ)
6.5.2 Calibration Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 49 | {z } | {z }
ideal image noise process
6.6 Uncalibrated Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.6.1 Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Correlation Filtering Convolution Filtering
6.6.2 Uncalibrated Case: Fundamental Matrix . . . . . . . . . . . . . . 51 Replace each pixel by a weighted combina- Flip the filter in both dimensions, then ap-
6.6.3 Stereo Pipeline with Weak Calibration . . . . . . . . . . . . . . . . 53 tion of its neighbors. ply correlation.
6.6.4 Extension: Epipolar Transfer . . . . . . . . . . . . . . . . . . . . . 53
6.7 Structure-from-Motion (SfM) . . . . . . . . . . . . . . . . . . . . . . . . . 53 k k k k
G [i, j] = ∑ ∑ H [u, v] F [i + u, j + v] G [i, j] = ∑ ∑ H [u, v] F [i − u, j − v]
u=−k v=−k u=−k v=−k
= H⊗F = H?F
1 Image Processing
with averaging window size 2k + 1 × 2k + 1, input image F, output image G, ker-
1.1 Image Formation nel/mask with non-uniform weights H. Convolution is separable into rows and
columns 1D filters (linear).
Lenses: Increasing pinhole size in pinhole camera to increase amount of light causes blur.
(If H [u, v] = H [−u, −v], then correlation = convolution.)
Lenses keep image in sharp focus while gathering light from a larger area.
Gaussian Smoothing BOX filter
Thin Lens Model: valid if lens thickness is small
Weigh nearby pixels more than distant For every pixel, average every neighbor
compared to the radius of curvature. Thin lens
ones (→ ”fuzzy blob”) using Gaussian Ker- over the number of neighbors.
equation:
1 1 1 nel with variance σ2 :
− =
z0 z f 1 − (x2 +2y2 )
Gσ = e 2σ
In a thin lens the scene points at distinct depths 2πσ2
come in focus at different image planes.
3 4
Fourier Gaussian Pyramid
• Map function onto a frequency spectrum. • Create each level from previous one
• The better a function is localized in one do- with ”smooth and sample”-principle.
main, the worse it is localized in the other. • Smooth with Gaussians, q in part be-
(”compression” c ”stretching”, vice versa)
cause G (σ1 ) ∗ G (σ2 ) = G ( σ12 + σ22 ).
• Convolving in image domain corresponds to
product in frequency domain: • Gaussians are low-pass filters, so
the representation is redundant once
f ?g cF · G smoothing is performed: No need
to store smoothed images at the full
Noise introduces high frequencies: Gaussian as a convenient choice for a low-pass filter.
original resolution.
Laplacian Pyramid
1.3 Nonlinear Filters
Median Filter
Replace each pixel by the median of its neighbors.
Properties:
• Laplacian ∼ Difference of Gaussians
• doesn’t introduce new pixel values
(DoG): cheap approximation
• removes spikes: good for impulse, salt and pepper noise
• non-linear
• edge preserving
• better results than Gaussian, even with small kernel
Think of filters as a dot product of the filter vector with the image region. Measure the
angle between the vectors using cos θ = |aa||·bb| . Angle (similarity) between vectors can
be measured by normalizing the length of each vector to 1 and taking the dot product.
Filters look like the effects they are intended to find, and they find effects they look
alike.
For partial derivative filters we get Hx = (1, 0, −1) = HyT . Images of partial derivatives
Nyquist theorem: In order to recover a certain frequency f , we need to sample with are slightly shifted, therefore, shouldn’t be used.
at least 2 f . This corresponds to the point at which the transformed frequency spectra
∂f ∂f
start to overlap (Nyquist limit). • ∇ f = [ ∂x , ∂y ]: gradient, points in the direction of most rapid intensity change
− 1 ∂f ∂f
• θ = tan ( ∂y / ∂x ): gradient direction (orientation of edge normal)
Image pyramids provide an efficient representation for space-scale invariant processing. q
• k∇ f k = ( ∂∂xf )2 + ( ∂∂yf )2 : gradient magnitude (edge strength)
5 6
1.5.3 2D Edge Detection Filters • Thresholding, thinning.
Gradient filter amplifies noise, smooth with Gaussian first. Scale: Sensitivity:
Use σ of the Gaussian to set the scale at (
∂ which edges will be later extracted. 1, if F [i, j] ≥ t (on)
Derivative of Gaussian (DoG) ∂x hσ ( u, v ) FT [i, j] =
0, otherwise (o f f )
∂ ∂
(h ? f ) = ( h) ? f where t threshold, F [i, j] pixel.
∂x ∂x
= ((1, 0, −1) ? h) ? f Canny Edge Detector steps:
1. Filter image with derivative of Gaussian.
2. Find magnitude and orientation of gradient.
3. Non-maximum suppression: Thin multi-pixel wide ”ridges” down to single pixel
Laplacian of Gaussian (LoG) ∇2 hσ (u, v)
2 2 width.
(∇2 f = ∂∂x2f + ∂∂y2f )
Check, if pixel is local maximum along
∂2 ∂2 gradient direction, select single max across
(h ? f ) = ( 2 h) ? f width of the edge.
∂x2 ∂x
a) Compute interpolated pixels p and r
b) Keep q iff Mag(q) > Mag( p) and
Mag(q) > Mag(r )
Image Derivatives
4. Linking and (hysteresis) thresholding
G is 1D Gaussian filter, D is 1D derivative of Gaussian filter, and I is the image:
• Define two thresholds: k low and k high (k high /k low = 2)
Ix = G T ? ( D ? I ) • Use the high threshold to start edge curves and the low threshold to continue
them, until no pixel along the edge is above the low threshold.
Iy = D T ? ( G ? I )
Ixx = G T ? ( D ? Ix )
1.6 Fitting Techniques
Iyy = D T ? ( G ? Iy )
T 1.6.1 Hough Transform
Ixy = D ? ( G ? Ix )
Many objects are characterized by presence of straight lines.
1.5.4 Canny Edge Detector Voting technique Hough Transform answers the three main questions of Line Fitting:
An “optimal” edge detector should have good detection, good localization, and single • Given points that belong to a line, what is the line?
response. • How many lines are there?
• Which points belong to which lines?
Primary edge detection steps:
Idea:
1. Smoothing: Suppress noise.
2. Edge enhancement: Filter for contrast. 1. Vote for all possible lines on which each edge point could lie.
3. Edge localization 2. Look for line candidates that get many votes.
• Determine which local maxima from filter output are actually edges vs. 3. Noise features will cast votes too, but their votes should be inconsistent.
noise. Hough Space
7 8
• set of points ( x, y) 7→ (m, b) such that y = mx + b Extensions
(a line in the image corresponds to a point in Hough space)
• point ( x0 , y0 ) 7→ (m, − x0 m + y0 ) = (m, b) 1) Use the image gradient instead of iterating over 3) Change the sampling of (d, θ )
(a point in the image corresponds to a line in Hough space) all possible directions θ: give more/less resolution.
!
• two points ( x0 , y0 ), ( x1 , y1 ) correspond to b = − x0 m + y0 = − x1 m + y1 ∂f ∂f 4) The same procedure can be
(two points in the image correspond to the intersection of the two lines, which is a point θ = gradient at ( x, y) = tan−1 ( / )
∂y ∂x used with circles, squares, or any
in Hough space)
other shape.
→ Reduces degrees of freedom on voting space
Polar Representation for lines ((m, b) representation undefined for vertical lines, infinite
values) 2) Give more votes for stronger edges (use magni-
x · cos θ + y · sin θ = d tude of gradient).
with d the perpendicular distance from line to origin, θ angle d makes with the x-axis.
Extension to Circles
A point in image space corresponds to sinusoid segment in Hough space.
Circle equation with center ( a, b), radius r: ( xi − a)2 + (yi − b)2 = r2
Hough Transform Algorithm
Let each edge point in image space vote for a set of possible parameters in Hough space.
The Hough transform subdivides the Hough space into a discrete set of bins. Increase
the vote count in each bin that the line passes through. Find peaks in local maxima
of the Hough space (→ Non-maximum suppression filter, is the value in center larger
that the values of its 8 neighbors?).
1. Init: H [d, θ ] = 0 Algorithm:
2. For each edge point ( x, y) in the image
For θ = 0 to 180 For every edge pixel ( x, y)
d = x · cos θ + y · sin θ For each possible radius value r
H [d, θ ]+ = 1 For each possible gradient direction θ // or use estimated gradient
3. Find the value(s) of (d,ˆ θ̂ ) where H [d,ˆ θ̂ ] is maximal a = x − r cos θ
4. Detected line in the image is given by b = y + r sin θ
dˆ = x · cos θ̂ + y · sin θ̂ H [ a, b, r ]+ = 1
Noise makes the maximum point in Hough space spread over a larger area. Generalized Hough Transform
9 10
as r := (rk , αk ), where rk is the vectors length and αk the vector orientation. • parallelism, symmetry, continuity, closure
– Use retrieved r vectors to vote for position of reference point. The accumula-
tor array has the two coordinates of unknown object center a as axes.
xc = xi ± rk cos αk 2.1 Segmentation as Clustering
yc = yi ± rk sin αk Best cluster centers are those that minimize SSD
A( xc , yc )+ = 1 (Sum of Squared Distances) between all points and ∑ ∑ k p − c i k2
clusters i points p
• Peak in this Hough space is reference point with most supporting edges. their nearest cluster center ci : in cluster i
Alternative strategy for Line Fitting: Sample points and fit a line to them using least- k-Means++: prevents arbitrarily bad local
1. Randomly initialize the cluster cen-
squares regression. RANSAC only returns a “good” result with a certain probability, minima
ters.
but this probability increases with the number of iterations. 2. Determine points in each cluster: For 1. Randomly choose first center.
each point p, find closest ci . Put p 2. Pick new center with prob. propor-
RANSAC loop
into cluster i. tional to k p − ci k2 for i ∈ [1, k − 1]
1. Randomly select a seed group of points on which to base transformation estimate 3. Set ci to be the mean of points in centers
(e.g., a group of matches). cluster i. 3. Repeat until k centers.
2. Compute transformation from seed group. 4. If ci has changed, repeat Step 2. → expected error = O(log k) · optimal
3. Find inliers to this transformation. (inlier = points within a certain distance to the
transformation/line) Feature Space
4. If the number of inliers is sufficiently large, re-compute least-squares estimate of
Feature space depends on grouping by pixels:
transformation on all of the inliers.
→ Keep the transformation with the largest number of inliers! • intensity similarity (1D intensity value as feature space)
5. Repeat until some termination criterion is met (e.g. #iterations). • color similarity (3D color value as feature space)
• texture similarity (24D filter bank responses as feature space)
Improve this initial estimate with estimation over all inliers (e.g. with standard least-
• intensity+position similarity (simple way to encode both similarity and proximity)
squares minimization). But this may change inliers, so alternate fitting with reclassifica-
tion as inlier/outlier. k-Means for Clustering:
RANSAC is not limited to fitting lines, it also can be applied to arbitrary transformation 1. Collect feature vectors for all pixels in an image.
models. Then the inliers are those points whose transformation error (distance of 2. Apply k-Means with predefined k segments/clusters on those vectors.
transformed point to its corresponding point in the other image) is below a certain 3. Assign one segment per cluster.
threshold.
Pros and Cons:
In many practical situations, the percentage of outliers is often very high (≥ 90%), + simple, fast to compute - setting k
but RANSAC is only applicable with < 50% of outliers. In this case, use Generalized + converges to local minimum of - sensitive to initial centers and out-
Hough Transform instead. within-cluster squared error (always liers
finds some local minimum) - detects spherical clusters only
- assuming means can be computed
2 Segmentation - NP-hard, even with k = 2
Gestalt factors: (make intuitive sense, but are very difficult to translate into algorithms)
• proximity, similarity, common fate, common region
11 12
2.1.2 Probabilistic Clustering 2. Learn a MoG model for the color values in each region.
3. Use those models to classify all other pixels.
Instead of treating the data as a bunch of points, assume, that they are all generated
by sampling a continuous function. This function is called a generative model, which is Pros and Cons:
defined by a vector of parameters θ, which we want to compute. + probabilistic interpretation - local minima
+ soft assignments between data points - initialization (often a good idea to
Mixture of Gaussians (MoG)
and clusters start with some k-means interations)
K Gaussian blobs (ellipses) with means µ j , covariance matrices Σ j , dim. D + generative model, can predict novel - need to know number of components
data points (solution: model selection)
1 1 + relatively compact storage - numerical problems are often a nui-
p(x|θ j ) = exp{− (x − µ j )T Σ− 1
j ( x − µ j )}
(2π ) D/2 |Σ j |1/2 2 sance (Ärger)
K Mean-Shift Algorithm
p(x|θ ) = ∑ π j p ( x | θ j ), θ = (π1 , µ1 , Σ1 , ..., π M , µ M , Σ M )
1. Initialize random seed, and window W.
j =1
2. Calculate center of gravity (the ”mean”) of W: ∑ x∈W xH ( x ) (often Gaussian
profile). In this case, H is the height of the corresponding histogram bin.
Expectation Maximization (EM)
3. Shift the search window to the mean.
Goal: Find blob parameters θ that maximize Approach: 4. Repeat Step 2 until convergence.
the likelihood function 1. Randomly initialize shape of Gaus- To use mean-shift for clustering:
N sian blobs.
p(data|θ ) = ∏ p(xn |θ ) 2. E-step: Given current guess of blobs, • Cluster: all data points in the attraction basin of a mode (= local maximum of the
n =1
compute ownership of each point. density of a given distribution).
Idea: Given the Gaussian shape, assign 3. M-step: Given ownership probabili- • Attraction basin: the region for which all trajectories (Flugbahnen) lead to the same
points to clusters; Given the assigned ties, update blobs to maximize likeli- mode
points, approximate Gaussian shape. hood function. 1. Find features (color, gradients, texture, etc.).
4. Repeat until convergence. 2. Initialize windows at individual pixel locations.
3. Perform mean shift for each window until convergence.
4. Merge windows that end up near the same ”peak” or mode. (for plateaus)
Speed-ups to mitigate computational complexity:
• Assign all points within radius r of end point to the mode.
• Assign all points within radius r/c of the search path to the mode.
Pros and Cons:
+ model free: does not assume any prior - output depends on window size
shape on data clusters - window size selection is not trivial
+ single parameter h (window size) - computationally expensive
+ variable number of modes - does not scale well with dimension
MoG Color Models for Segmentation:
+ robust to outlies of feature space
1. User marks two regions for foreground and background.
13 14
2.2 Graph-Theoretic Segmentation 2.2.2 Graph Cuts for Image Segmentation
with
E energy function
φ single-node/unary potentials
– Encode information about given pixel/patch: How likely is a pixel/patch to
belong to a certain class?
ψ pairwise potentials
– Encode neighborhood information: How different is a pixel/patch’s label
from that of its neighbor?
Graph-Cuts Energy Minimization
Solve an equivalent graph cut problem:
1. Introduce extra nodes: source s and sink t.
2. Weigh connections to source/sink (t-links) by φ( xi = s) and φ( xi = t), respectively.
3. Weigh connections between nodes (n-links) by ψ( xi , x j ).
4. Find the minimum cost cut that separates source from sink.
⇒ Solution is equivalent to minimum of the energy.
15 16
s-t-Mincut Algorithm 3 Object Recognition and Categorization
When applicable?
Solve the dual maximum flow problem, Idea:
the maximum flow equals the cost of the • Represent each object (view) by a global descriptor (= feature vector).
st-mincut, which needs to be minimized • For recognizing objects, just match the descriptors.
(Min-cut/Max-flow Theorem). The cost of a • Some modes of variation are build into the descriptor, the others have to be
st-cut is the sum of cost of all edges going incorporated in the training data.
from set S to set T.
Identification: Find a particular object. Categorization: Recognize any object of a class.
Assuming non-negative capacity, we get
the algorithm:
3.1 Sliding-Window Object Detection
1. Find path from source to sink with positive capacity.
Idea: If the object may be in a cluttered scene, slide a window around looking for it.
2. Push maximum possible flow through this path.
Search over space and scale with the help of a binary classifier.
3. Adjust the capacity of the used edges and record “residual flows” (backwards
flow). Therefore, we need to:
4. Repeat until no path can be found. 1. Obtain training data.
α-Expansion 2. Define features.
3. Define classifier.
Algorithms for non-binary cases are no longer guaranteed to return the globally optimal
result, but an approximation. Problem is NP-hard with 3 or more labels. Pros and Cons:
Idea: Break multi-way cut computation into a sequence of binary s-t cuts. + simple detection protocol to - high computational complexity
implement - with so many windows, false positive rate
1. Start with any initial solution.
+ good feature choices critical better be low
2. For each label ”α” in any order:
+ past successes for certain - non-rigid, deformable objects not captured
a) Compute optimal α-expansion move (s-t
classes well with representations assuming a fixed
graph cuts).
2D structure
b) Decline the move if there is no energy
- objects with less-regular textures not captured
decrease.
well with holistic appearance-based descrip-
3. Stop when no expansion move would de-
tions
crease energy.
- if considering windows in isolation, context
Pros and Cons: is lost
+ powerful technique, based on proba- - graph cuts can only solve a limited - not all objects are “box” shaped
bilistic model (MRF) class of models: submodular energy
+ applicable for a wide range of prob- functions, can capture only part of
lems the expressiveness of MRFs 3.2 Gradient-based Representation
+ very efficient algorithms available - only approximate algorithms avail- Idea: Consider edges, contours, and (oriented) intensity gradients. Summarize local
+ becoming a de-facto standard able for multi-label cases distribution of gradients with histograms.
Histograms of Oriented Gradients (HoG)
Divide image window into small spatial regions (“cells”), map each grid cell in the
17 18
input window to a histogram, counting the matching gradients per orientations. That’s because we choose an already known kernel function seen below.
k xi − x j k p
K (xi , x j ) = xiT x j , K (xi , x j ) = (1 + xiT x j ) p , K (xi , x j ) = exp −
3.3 Classifier Construction linear polynomial of power p
2σ2
Gaussian (Radial-Basis Function)
Learn a decision rule (classifier) assigning image features to different classes.
Line equation for a linear classifier, which separate positive and negative examples: SVMs for Recognition:
1. Define your representation for each example.
w1 x1 + w2 x2 + b = 0 ⇐⇒ w T x + b = 0
2. Select a kernel function.
with w being the normal of the line. Then xn is positive, if w T xn + b ≥ 0, and negative, 3. Compute pairwise kernel values between labeled examples.
if w T xn + b < 0. 4. Pass this “kernel matrix” to SVM optimization software to identify support vectors
and weights.
5. To classify a new example: Compute kernel values between new input and
3.3.1 Classification with SVMs (Support Vector Machines) support vectors, apply weights, check sign of output.
Idea: The original input space can be mapped to some higher-dimensional feature space
where the training set is separable (x 7→ φ( x )). For datasets, which cannot be separated
by a linear hyperplane in 2D.
Kernel Trick: Instead of explicitly computing the lifting transformation φ( x ), define a
kernel function K (xi , x j ) = φ(xi )T · φ(x j ). This gives a nonlinear decision boundary in Assume linear SVM classification function y( x ) = w T x + b. Then x is the HOG feature
the original feature space: map and w the template obtained by the SVM.
∑ an tn K (xn , x) + b
n After a multi-scale dense scan, we want to suppress non-maximum detections. First, we
Since the optimization formulation uses the data points only in the form of inner clip the detection score (the negative). We map each detection to 3D [ x, y, scale] space.
products φ(xn )T φ(xm ), we never need to actually compute the lifting transformation. Subsequently, we apply a robust mode detection, e.g. mean shift.
DPM Detector
19 20
Basic steps: Detailed training algorithm:
1. In each iteration, AdaBoost trains
a new weak classifier hm (x) based
on the current weighting coefficients
W(m) .
2. Adapt the weighting coefficients for
each point: Increase wn if xn was mis-
classified by hm (x), else decrease.
3. Make predictions using the final com-
bined model
!
M
H (x) = sign ∑ αm hm (x)
m =1
Deformation cost can be directly computed
on top of the part filter responses through Recognition:
Generalized Distance Transform. Results • Evaluate all selected weak classifiers on test data: h1 (x), · · · , hm (x)
into transformed responses. • Final classifier is weighted combination of selected weak classifiers, as seen in
step 3 basic steps.
Viola-Jones Face Detection
Introduce a set of “rectangular” Haar filters: Subtract the
3.3.2 Classification with Boosting pixels in the white region from the pixels in the black region.
Efficiently computable with integral image, which is com-
AdaBoost
puted once per image: In every pixel ( x, y) in the integral
image we store the sum of pixels over the rectangle spanned
by the pixel and the top left image corner.
Idea: Build a strong classifier H by com-
bining a number of ”weak classifiers” Using AdaBoost for informative feature
h1 , · · · , h M , which need only be better than and classifier selection, we want to select
chance. At each iteration, add a weak clas- the single rectangle feature and thresh-
sifier (sequential learning process). old that best separates positive (faces) and
negative (non-faces) training examples, in
terms of weighted error.
Given: training set X = {x1 , · · · , x N } with target values T = {t1 , · · · , t N }, tn ∈ {−1, 1},
associated weights W = {w1 , · · · , w N } for each training point
Even if the filters are fast to compute, each
new image has a lot of possible windows
to search. For efficiency, in a cascade fashion
apply less accurate but faster classifiers
first to immediately discard windows that
clearly appear to be negative.
21 22
Extension: Integral Channel Features Requirements:
Generalization of Haar integral image idea from Viola-Jones: Instead of only considering • Region extraction needs to be repeatable and accurate.
intensities, also take into account other feature channels (gradient orientations, color, – Invariant to translation, rotation, scale changes.
texture). – Robust or covariant to out-of-plane (≈ affine) transformations.
Also generalize block computation: – Robust to lighting variations, noise, blur, quantisation.
• Locality: Features are local, therefore robust to occlusion and clutter.
• 1st order features: Sum of pixels in rectangu-
Classifier construct (e.g. VeryFast • Quantity: We need a sufficient number of regions to cover the object.
lar region (integral over certain region)
detector) • Distinctiveness: The regions should contain ”interesting” structure.
• 2nd order features: Haar-like difference of
• Efficiency: Close to real-time performance.
sum-over-blocks (difference between the integrals
over two regions)
• Generalized Haar: More complex combina- 4.1.2 Keypoint Localization
tions of weighted rectangles (higher order de-
Look for two-dimensional signal changes: In the region around a corner, image gradient
sign with multiple such blocks)
has two or more dominant directions.
• Histograms: Computed by evaluating local
sums on quantized images (multiple orienta- Harris Detector
tion channels, build histogram over corresponding
Formulation:
regions, similar to HOG)
Properties:
• R is invariant to image rotation (ellipse shape, therefore the eigenvalues stay the
same)
• not invariant to image scale (due to fixed window size)
Algorithm:
23 24
1. Compute second moment matrix (autocorrelation matrix): which are “scale invariant”. Rescale region to fixed size.
!
Ix2 (σD ) Ix Iy (σD ) Signature Function: Laplacian-of-Gaussian Detector (LoG)
M (σI , σD ) = g(σI ) ?
Ix Iy (σD ) Iy2 (σD )
25 26
Descriptor Computation: 3D rotation around coordinate axes:
• Divide patch into 4x4 sub-patches: 16
cos γ − sin γ 0
1 0 0 cos β 0 sin β
cells.
R x (α) = 0 cos α − sin α , Ry ( β) =
0 1 0 , Rz (γ) = sin γ cos γ 0
• Compute histogram of gradient ori-
0 sin α cos α − sin β 0 cos β 0 0 1
entations (8 reference angles) for all
pixels inside each sub-patch.
• Resulting descriptor: 4x4x8 = 128 di- Affine transformations are combinations of linear transformations and translations.
mensions.
SIFT aims to achieve robustness to lighting variations and small positional shifts. For 4.2.2 Affine Estimation
rotation invariant descriptors, find local orientation (dominant direction of gradient In alignment, we will fit the parameters of some transformation according to a set of
for the image patch). Rotate patch according to this angle, this puts the patches into a matching feature pairs (”correspondences”). An affine model approximates perspective
canonical orientation. projection of planar objects. Assuming we know the correspondences, how do we get
Computation: Extraordinarily robust matching technique: the transformation?
Least Squares Estimation
1. Compute orientation histogram. + can handle changes in viewpoint up
2. Select dominant orientation. to 60◦ out-of-plane rotation With Least Squares Estimation we get for Given set of data points ( Xi , Xi0 ), find linear
3. Normalize: Rotate to fixed orienta- + can handle significant changes in il- x0 = Mx + t function to predict X 0 s from Xs with Xa +
tion. lumination b = X0:
+ fast and efficient, can run in real time m
1 0
··· m2 ··· X1 1 X1
• SURF as fast approximation of SIFT
idea. xi yi 0 0 1 0
0
m3 x i X2 1 a = X 0 ⇐⇒ Ax = B
b 2
=
0 0 x i y i 0 1 m4 y 0 ··· ··· ···
i
··· ···
t1
4.2 Recognition with Local Features
t2 Overconstrained problem:
Image content is transformed into local features that are invariant to translation, rotation, mink Ax − Bk2 → Least-squares minimiza-
and scale. tion
Goal: Verify if they belong to a consistent configuration.
4.2.3 Homography Estimation
• Warping: Given a source image and the transformation, what does the transformed
output look like? A projective transform is a mapping between any two perspective projections with the
• Alignment: Given two images with corresponding features, what is the transfor- same center of projection. In a projective transform, a rectangle should map to arbitrary
mation between them? quadrilateral. Parallel lines aren’t preserved, but need to be straight.
The simplest way to estimate a homography H from feature correspondences is the
4.2.1 Finding Consistent Configurations Direct Linear Transformation (DLT) method:
27 28
5.1.1 Background: Deep Learning
Generalized Linear Discriminants: Multi-Layer Perceptrons (MLP):
yk ( x ) > 0, if input belongs to target class k. φ( xi ) can be seen as features. The black
node introduces an “offset”, the so called bias term.
An MLP with 1 hidden layer can implement any function (universal approximator).
Solution: Null-space vector of A, corresponds to smallest eigenvector. However, if the function is deep, a very large hidden layer may be required.
If v99 may be zero, normalize the vector length instead: If we leave out the non-linearity g(·), the layers collapse into a single linear function.
[v19 , ..., v99 ] Therefore, the non-linearities make multi-layer representation more powerful.
h=
|[v19 , ..., v99 ]| Nonlinearities
29 30
1. Computing the gradients for each weight. Glorot: He:
Idea: Compute the gradient layer by layer. Each layer below builds upon the results
of the layer above.
⇒ Backpropagation algorithm
( τ +1) (τ ) ∂E(w)
wkj = wkj − η 5.2 Convolutional Neural Networks (CNNs)
∂wkj w(τ )
with rate/step size η and times index τ. This is applied on minibatches of data. Convolutional Layers
Varnishing gradients problem Feed-forward feature extraction, with classification layer at the end:
In multi-layer nets, gradients need to be propagated through many layers. The mag- 1. Convolve input with learned filters
nitudes of the gradients are often very different for the different layers, especially if the 2. Non-linearity
initial weights are small. Gradients can therefore get very small in the early layers of 3. Spatial pooling
deep nets. 4. (Normalization)
When designing deep networks, we need Backprop steps and where the gradients The convolutional filters are learned through supervised training and back-propagating
to make sure gradients can be propagated can varnish: the classification error.
throughout the network: But why do we want to use CNNs?
• restricting the network depth • To avoid huge amounts of parameters, use convolutions with learned kernels, since
• very careful implementation the neurons of one layer share the same parameters (the kernels) across different
• choosing suitable nonlinearities locations.
• performing proper initialization • All neural net activations are arranged in 3 dimensions. Multiple neurons are all
looking at the same input region, stacked in depth. (Naming convention: depth
→, width %, height ↑)
Weights Initialization • Convolutional layers can be stacked. The filters of the next layer then operate on
the full activation volume. Filters are local in ( x, y), but densely connected in
Best practice is to use a zero-mean distribution for sampling the initial weights, e.g. a depth.
Gaussian or uniform distribution. Compute the variance according to Glorot or He, • Each activation map is a depth slice through the output volume.
and plug it into your chosen distribution. Pooling Layers
31 32
Pooling filter responses at different spatial locations, ⇒ same receptive field as AlexNet, but much fewer parameters: 3 · 32 = 27 vs.
we gain robustness to the exact spatial location of 72 = 49
features (robustness to translations). It makes the
representation smaller without losing too much in- 5.3.4 GoogLeNet
formation.
• Main ideas: ”inception” module as
Pooling happens independently across each slice, preserving the number of slices. modular component, learns filters at
several scales within each module,
1 × 1 convs (”bootleneck layers”) for
5.3 CNN Architectures dimensionality reduction
5.3.1 LeNet • several inception modules in net, with auxiliary classification outputs for training
• 2x conv-conv-pool blocks for feature the lower layers (deprecated), 22 layers, no FC layers, only 5M parameters
representation • VGGNet and GoogLeNet perform at similar level
• FC layers for classification on the end
⇒ successfully used for handwritten
5.3.5 ResNet
digit recognition
• for other tasks, the two layers of feature computation were not sufficient; also Core component:
computationally expensive
• skip connections bypassing each layer
• better propagation of gradients to the deeper layers
5.3.2 AlexNet
Similar to LeNet, but
• bigger model (7 hidden layers, 650k 5.3.6 Transfer Learning with CNNs
units, 60M parameters), more train- Transfer learning is a machine learning method where a model developed for a task is
ing data (106 images instead of 103 ) reused as the starting point for a model on a second task.
• receptive field in the first layer: 11 ×
11, stride 4 1. Train net on ImageNet.
2. If small dataset: Fix all weights (treat CNN as fixed feature extractor), retrain only
⇒ AlexNet almost halved the error rate from previous approaches
the classifier.
3. If you have medium sized dataset, ”finetune” instead: Use the old weights as
5.3.3 VGGNet initialization, train the full network or only some of the higher layers with a
• deeper network, more convolutional smaller learning rate.
layers with smaller filters (+ nonlin-
earity), with similar amount of FC
5.4 Practical Advise on CNN Training
layers and softmax layer on the end
• receptive field in the first layer: 3 × 3, 5.4.1 Data Augmentation
stride 1, stacked ones resulting into
Augment (cropping, zooming, flipping, color PCA) original data with synthetic varia-
an efficient 5 × 5 receptive field, twice
tions to reduce overfitting. Results into much larger training set and robustness against
7 × 7 etc.
expected variations.
• 138M parameters, but most of them in the FC layers
33 34
Overfitting: Happens when the model fits too well to the training set. The model starts 5.4.5 Learning Rate Schedules
to recognize specific images in the training set instead of general patterns.
Final improvement step after convergence is reached:
1. Reduce learning rate η by a factor of 10.
5.4.2 Initialization 2. Continue training for a few epochs.
When initializing the weights: 3. Do this 1-3 times, then stop training.
• Draw them randomly from a zero-mean distribution. Turning down the learning rate will reduce the random fluctuations in the error due to
• Common choices in practice: Gaussian or uniform. different gradients on different minibatches. When turning down the learning rate too
• Common trick: Add a small positive bias (+ε) to avoid units with ReLU nonlin- soon, further progress will be much slower/impossible after that.
earities getting stuck-at-zero.
When sampling weights from an uniform distribution [ a, b]: 5.5 CNNs for Object Detection
1
• Standard deviation is computed as σ2= −
12 ( b a )2 . Region Proposal Based Detectors
• Glorot initialization with uniform distribution
√ √ Avoid dense sliding window with region proposal.
6 6
W ∼ U [− √ ,√ ] • R-CNN: Selective Search + CNN classification / regression
nin + nout nin + nout
• Fast R-CNN: Swap order of convolution and region extraction
• Faster R-CNN: Compute region proposals within the network
5.4.3 Batch Normalization • Mask R-CNN: Detection + instance segmentation + pose estimation
Optimization works best of all inputs of a layer are normalized. Anchor Box Based Detectors
Introduce intermediate layer that centers the activations of the previous layer per mini-
batch, resulting in a variance of 1 or mean of 0. I.e., perform transformations on all Perform detection in a single step using grid of anchors boxes.
activations and undo those transformations when backpropagating gradients. • YOLO, SSD
Centering and normalization also needs to be done at test time, but minibatches are
no longer available at that point. Therefore, learn the normalization parameters to 5.5.1 R-CNN
compensate for the expected bias of the previous layer (usually a simple moving
average). Cut of feature extractor, replace with trained CNN.
5.4.4 Dropout
Reduce reliance on individual units by randomly switching off units during training.
This changes the net for each data point, effectively training many different variants of
the network.
To compensate much larger output responses during testing time, multiply activations
with the probability that the unit was set to zero, which reduces the magnitude of the
activations.
This results into improved performance.
35 36
Pipeline: 5.5.4 Mask R-CNN
Problems:
- ad hoc training objectives For detection + instance segmentation, detection + pose estimation.
– fine-tuned net with softmax classifier (log loss)
– trained post-hoc linear SVMs (hinge loss)
– trained post-hoc bounding-box regressors (squared loss) 5.5.5 YOLO/SSD
- training (2 days) and testing (47s/image) are slow Directly go from image to detection scores. Results into a very light weight backbone.
- takes a lot of disk space, need to store all precomputed CNN features for training Subdivide image into a grid. Within each grid cell
the classifier
1. Start from a set of anchor boxes.
2. Regress from each of the B anchor boxes to a final box.
5.5.2 Fast R-CNN 3. Predict scores for each of C classes (including background).
Instead of using ConvNets for every region proposal, apply a ConvNet once on the
entire image, and use RoI pooling. 5.6 CNNs for Segmentation
Pipeline:
For semantic segmentation label each pixel in the image with a category label. For instance
1. Forward whole image through ConvNet. segmentation also give an instance label per pixel.
2. Extract RoIs from a proposal method.
3. ”RoI Pooling” layer, resulting into warp.
4. Feed pools into FCs. 5.6.1 Fully Convolutional Networks (FCN)
5. Linear + softmax classifier, bounding-box re-
Design a network as a sequence of convolutional layers, to make predictions (for
gressors.
segmentation) for all pixels at once.
6. Feed linear + softmax and linear into multi-
task loss (log loss + smooth L1 loss) In FCNs, all operations formulated as convolutions. Fully-connected layers become 1x1
convolutions. The advantage of using convolutions is that FCNs can process arbitrarily
sized images.
5.5.3 Faster R-CNN
Think of FCNs as performing a sliding-window classification. The computation is
Remove dependence on external region One network, four losses (joint training):
more efficient, since computations are reused between windows. On the other hand,
proposal algorithm. Instead, infer region
convolutions at original image resolution will be very expensive.
proposals from same CNNs. Results into
feature sharing and makes object detection
in a single pass possible. 5.6.2 Encoder-Decoder Architecture
Faster R-CNN = Fast R-CNN + RPN (Re- Design net as a sequence of convolutional layers, with downsampling and upsampling
gion Proposal Network) inside the network.
37 38
• Downsamling: pooling, strided (stride > 1) convolution. 5.6.5 Extensions
• Upsampling:
Dilated Convolutions (Atrous Convolutions)
– Nearest-Neighbor: Spread value of pixel to a patch of one size bigger, results
into a blocky output structure. Sample the input at every r-th pixel for the convolution. In-
– “Bed of Nails”: Keep one pixel value as the original and pad the other ones crease receptive field without increasing computation. With
in a patch with zero. dilation factor l:
– Max Unpooling: Remember which element was maximum after max-pooling,
K
and use this position for “Bed of Nails” upsampling. y [i ] = ∑ x [ i + r · k ] w [ k ], ( F ?l k)(p) = ∑ F (s)k(t)
– Strided Transpose Convolution k =1 s+lt=p
39 40
5.7 CNNs for Human Body Pose Estimation Offline Hard Triplet Mining
Setup: Mining hard triplets becomes crucial for learning, where the triplet is prone for con-
1. Annotate images with keypoints for skeleton joints. fusion, because the anchor and negative example are close to each other. A popular
2. Define a target disk around each keypoint with radius r. solution for that is offline hard triplet mining:
3. Set the ground-truth label to 1 within each such disk. 1. Process the dataset to find hard triplets, in
4. Infer heatmaps for the joints as in semantic segmentation. form of mini-batches.
2. Use those for learning.
3. Iterate.
5.8 CNNs for Matching
Considerable effort needed, but a very wasteful design, because a minibatch of the
5.8.1 Siamese Networks mined triplets contains potentially much more hard triplets than we actually mined.
Present the two stimuli to two Types of models used for matching tasks: Online Hard Triplet Mining
identical copies of a network (with
shared parameters). Train them Use online hard triplet mining instead:
to output similar values iff the in- • Each member of another triplet becomes an additional negative candidate. But,
puts are (semantically) similar. we need both hard negatives and hard positives.
• An even better design is to sample K images from P classes for each minibatch. For
one anchor, the previously positive and negative example become positive, every
other sample in the minibatch (whether previous anchor, positive, or negative)
becomes a negative example. The triplets are then only constructed within the
minibatch.
Applications
41 42
On the other hand, the training is more challenging, since unrolled nets are deep. Parallel Optical Axes
We want to train a language model with
Assume, these parameters are given and fixed.
p(next word | previous words), where p is
high for this setting. Pink annotates the Task of depth estimation: Estimate disparity map
recurrent connections between hidden lay- D ( x, y) from a set of images. For similar tri-
ers. angles ( pl , P, pr ) and (Ol , P, Or ) we get depth
associated with point p
After training the RNN, we get a language
model, which predicts the next word by T − ( xr − x l ) T T
sampling the output (posterior) of the pre- = ⇐⇒ Z= f
Z− f Z xr − x l
vious one. From each output yi , sam-
ple next input word xi+1 , without end se- where xr − xl disparity, the horizontal motion.
quence. Then, we can get a point on the other image
through the disparity map D ( x, y) := f ZT :
Applications
• Image tagging: Simple combination of CNN and RNN. Use CNN to define initial ( x 0 , y0 ) = ( x + D ( x, y), y)
state h0 of an RNN. Use RNN to produce text description for the image. Trained Stereo Correspondences Constraints
on corpus of images with textual descriptions
• Video to text description Generally, two cameras do not need to have parallel optical axes.
Geometry of two views allows us to constrain where the corresponding pixel for some
image point in the first view must occur in the second view. Here, the epipolar constraint
6 3D Reconstruction is useful, because it reduces correspondences problem to 1D search along conjugate
epipolar lines.
To reconstruct a 3D structure, we need multi-view geometry because structure from one
image is inherently ambiguous. The epipolar geometry between two views is essentially the geometry of the intersection
of the image planes with the epipolar planes having the baseline as (fixed) axis.
Given several image of the same object or scene, compute a representation of its 3D
shape. In particular, given a calibrated binocular stereo pair, fuse it to produce a depth • Baseline: line joining the camera centers.
image. • Epipole e: point of intersection of baseline with the image plane. All epipolar lines
intersect at the epipole.
• Epipolar line l: intersection of epipolar plane with image plane.
6.1 Epipolar Geometry and Stereo Basics • Epipolar plane Π: plane containing baseline and world point. An epipolar plane
The epipolar geometry is the intrinsic projective geometry between two views. It is intersects the left and right image planes in epipolar lines.
independent of scene structure, and only depends on the cameras’ internal parameters Potential matches for p have to lie on the corresponding epipolar line l.
and relative pose.
Principle: Triangulation, which gives the reconstruction as intersection of two rays.
Requires camera calibration and point correspondences.
Parameters for camera calibration:
• Extrinsic: rotation matrix and translation vector (camera frame ↔ reference frame)
• Intrinsic: focal length, pixel sizes, image center point, radial distortion parameters
(image coords. relative to camera ↔ pixel coords.) Some useful stuff for computation:
43 44
• normalize homogeneous point: ( x1 , x2 , x3 )T → ( xx31 , xx23 , 1)T 1. Calibrate cameras.
• normalize homogeneous line: ( a, b, c)T → ( √ 2a 2 , √ 2b 2 , √ 2c 2 )T 2. Rectify images.
a +b a +b a +b
• perpendicular distance from a point to a line: d = l · p (dot product) 3. Compute disparity.
4. Estimate depth.
Restrict search to sparse set of detected features. Rather than pixel values (or lists of
pixel values) use feature descriptor and an associated feature distance. Still narrow search
further by epipolar geometry.
We want window large enough to have sufficient intensity variation, yet small enough
6.2 Stereopsis and 3D Reconstruction
to contain only pixels with about the same disparity.
Main steps:
45 46
+ efficiency - have to know enough to pick good coordinate system is in the corner.
+ can have more reliable feature features
Pinhole Model with Origin not in Princi- Not Explicit Camera Center
matches, less sensitive to illumina- - sparse information
ple Point
tion than raw pixels For pixel size 1/m x × 1/my , where m x , my
Possible Sources of Error But it may not be, so we define a more pixels per meter in horizontal and vertical
general mapping with the calibration matrix direction respectively, we get calibration
• low-contrast/textureless image regions
K: matrix K:
• occlusions
• camera calibration errors mx f px
• violations of brightness constancy (e.g. specular reflections) K= my f py
• large motions 1 1
αx x0
= α y y0
6.5 Camera Calibration
1
Recap: To solve Ax = 0, apply SVD
T
d11 v11 · · · v1N
.. .. .. ... Camera Rotation and Translation
A = UDV T = U
. . .
d NN v N1 · · · v NN
| {z }| {z }
singular values singular vectors
In general, the camera coordinate frame
• singular values of A = square roots of the eigenvalues of A T A will be related to the world coordinate
• Solution of Ax = 0 is the nullspace vectors of A. frame by a rotation and a translation.
• This corresponds to the smallest singular vector of A. X̃cam : coordinates of point in camera frame
X̃: coordinates of a point in world frame (non-homogeneous)
X: coordinates of a point in world frame (homogeneous)
C̃: coordinates of camera center in world frame
In the following, we will look at camera models, which are matrices with particular
Summary
properties that represent the camera mapping.
Pinhole Model (DoF)
47 48
(both relative to world coordinate system) DLT Algorithm (Direct Linear Transform)
• Camera projection matrix P = K [ R|t]
⇒ general pinhole camera: 9 DoF
Idea: Get rid of unknown scaling factor λ.
⇒ CCD camera with non-square pixels: 10 DoF
⇒ general camera: 11 DoF Notes:
• 5 1/2 correspondences needed for a minimal
solution
6.5.2 Calibration Procedure
• For coplanar points that satisfy Π T X = 0 for
Compute intrinsic and extrinsic parameters using observed camera data. a plane Π, we will get degenerate solutions
(Π, 0, 0), (0, Π, 0), or (0, 0, Π). We need cali-
Given n points with known 3D coordinates Xi and known image projections xi , estimate bration points in more than one plane!
camera parameters, such that xi = PXi .
To figure out intrinsic and extrinsic parameters after
Main idea: recovering the numerical from of the camera matrix,
xi : point on image
use matrix decomposition. Xi : point in world coordinates
1. Place calibration object with known geometry in the scene. λ: unknown scaling factor depending on unknown
depth
2. Get correspondences.
3. Solve for mapping from scene to image: Estimate P = Pint Pext .
6.6 Uncalibrated Reconstruction
For best results, the calibration points need to be measured with subpixel accuracy
(depending in exact pattern). 3 main question for two-view geometry:
Algorithm for checkerboard pattern: • Scene geometry (structure): Where is the pre-image of the points in 3D? (→ triangu-
lation)
1. Perform Canny edge detection.
• Correspondence (stereo matching): How does a point in one image constrain the
2. Fit straight lines to detected linked edges.
position of the corresponding point x 0 in another image? (→ epipolar constraint)
3. Intersect lines to obtain corners. If sufficient care is taken, the points can then be
• Camera geometry (motion): What are the cameras for the two views? (→ SfM)
obtained with localization accuracy < 1/10 pixel
The fundamental matrix and the essential matrix are the algebraic representations of
Rule of thumb: Number of constraints should exceed number of unknowns by a factor
epipolar geometry, that satisfies some epipolar constraint for any pair of corresponding
of 5, thus, at least 28 points are necessary.
points x ↔ x 0 in the two images.
6.6.1 Triangulation
1) Geometric Approach
Find shortest segment connecting the two viewing rays and let X be the midpoint of
that segment.
49 50
2) Linear Algebraic Approach
Most accurate, but unlike the other two methods, it does not have a closed-from solution
(e.g. can’t be expressed with a finite number of standard operations).
8 points are sufficient, because F has a rank of 2.
Eight-point algorithm has poor numerical conditioning, which can be fixed by rescaling
6.6.2 Uncalibrated Case: Fundamental Matrix the data.
Using x̂ and x̂ 0 in matrix space, we apply a normalized coordinate system to get pixel The solution will usually not fulfill the constraint that F only has rank 2, if we are
coordinates x and x 0 : working on noisy data. This means, that there will be no epipoles through which all
epipolar lines pass.
Normalized Eight-Point Algorithm
1. Center the image data at the origin, and scale it so the mean squared distance
between the origin and the data points is 2 pixels.
2. Use the eight-point algorithm to compute F from the normalized points.
3. Enforce the rank-2 constraint using SVD.
Geometrically, F represents a mapping from the 2D projective plane of the first image
to the pencil of epipolar lines through the epipole e. Thus, it represents a mapping
from a 2D onto a 1D projective space, and hence must have rank 2.
4. Transform fundamental matrix back to original units: If T and T 0 are the normal-
To estimate the fundamental matrix F use eight-point algorithm. izing transformations in the two images, then the fundamental matrix in original
coordinates is T T FT 0 .
Eight-Point Algorithm
Nonlinear Least-Squares (Refinement Approach)
N N
∑ (xiT Fxi0 )2 ⇒ ∑ (d2 (xi , Fxi0 ) + d2 (xi0 , Fxi ))
i =1 i =1
51 52
6.6.3 Stereo Pipeline with Weak Calibration Structure from Motion Ambiguity Hierarchy of 3D Transformations
We want to estimate world geometry without requiring calibrated cameras. If we transform the scene using a trans- From most unconstraint on top to the most
Main Idea: Estimate epipolar geometry from a (redundant) set of point correspondences formation Q (similarity, affine, projective) constraint on the bottom:
between two uncalibrated cameras. and apply the inverse transformation to
the camera matrices, then the images don’t
Procedure to find F and correspondences: change:
1. Find interest points in both images. (e.g., Harris corners)
x = PX = ( PQ−1 ) QX
2. Compute correspondences using only proximity.
3. Compute epipolar geometry. Idea: With no constraints on camera cali-
4. Refine. bration we get a projective reconstruction.
RANSAC for robust estimation of F: Add information to upgrade the reconstruc-
tion to affine, similarity, or Eulidean.
1. Select random sample of correspondences.
2. Compute F using them. This determines epipolar constraint. Projective Structure from Motion
3. Evaluate amount of support – number of inliers within threshold distance of
Given are m images of n fixed 3D points. Two-camera case with depths z and z0 :
epipolar line.
4. Iterate until a solution with sufficient support has been found (or max #iterations). zij xij = Pi X j , i = 1, ..., m, j = 1, ..., n
5. Choose F with most support.
We want to estimate m projection matrices
Pi and n 3D points X j from the mn corre-
6.6.4 Extension: Epipolar Transfer spondences xij . With no calibration info,
Given x1 and x2 points on the two images with known epipolar geometries F13 , F23 , cameras and points can only be recovered
point x3 in third image is the intersection of l31 and l32 : up to a 4 × 4 projective transformation Q:
T
l31 = F13 x1 , T
l32 = F23 x2 X → QX, P → PQ−1
We can solve for structure and motion when 2mn ≥ 11m + 3n − 15. For two cameras at
Can’t be applied, if the motion is strictly in the same plane. least 7 points are needed.
Decomposing the fundamental matrix in the two-camera case means that if we can com-
6.7 Structure-from-Motion (SfM) pute the fundamental matrix between two cameras, we can directly estimate the two
projection matrices from F. Once we have the projection matrices, we can compute the
3D position of any point X by triangulation.
Given: m images of n fixed 3D points xij = Pi X j , i = 1, ..., m,
j = 1, ..., n. To obtain both kinds of information at the same time, use projective factorization.
53 54
Projective Factorization
Bundle Adjustment
Non-linear method for refining structure and motion. Minimize mean-square reprojec-
tion error
m n
E(P, X) = ∑ ∑ D(xij , Pi X j )2
i =1 j =1
Idea: Seek the Maximum Likelihood (ML) solution assuming the measurement noise is
Gaussian. It involves adjusting the bundle of rays between each camera center and the
set of 3D points.
Considerably improves the results and allows assignment of individual covariances
to each measurement. However, it needs a good initialization, and can become an
extremely large minimization problem.
55