Cviii 2024 Ws
Cviii 2024 Ws
Hurile Borjigin
Technical University of Munich
2024 WS
1
Contents
1 Introduction and Object Detection 4
1.1 What this course is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Understanding an image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Understanding a video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Some architectures and concepts . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Object detection 6
2.1 One-stage detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Object detection with sliding window . . . . . . . . . . . . . . . . . . 6
2.1.2 Feature-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Two-stage detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Non-Maximum Suppression(NSM) . . . . . . . . . . . . . . . . . . . . 9
2.3 Object detection with deep networks . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Overfeat[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Object proposals(Pre-filtering) and pooling for two-stage detection . . 10
2.3.3 R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.5 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Feature Pyramid Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Single-stage object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.3 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.4 Problems with one-stage detectors . . . . . . . . . . . . . . . . . . . . 19
2.5.5 Focal loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.6 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Spatial Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Detection evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Multi-object tracking 28
4.1 Approach 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Typical models of dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Approach 2: Tracktor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Metric learning: For re-identification(Re-ID) . . . . . . . . . . . . . . . . . . . 31
4.5.1 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2
4.5.2 How do we train a network to learn a feature representation . . . . . . 32
4.5.3 Metric learning for tracking . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.4 Summary of metric learning . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Online vs. Offline tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Graph based MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7.1 Tracking with network flow . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7.2 Tracking with Message Passing Network . . . . . . . . . . . . . . . . . 38
4.7.3 Evaluation of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Segmentation 44
5.1 K-means(clustering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Normalized cut(Ncut) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Energy-based model: Conditional random fields(CRFs) . . . . . . . . . . . . 44
5.5 Fully convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 SegNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8.1 PointRend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.9 Panoptic FPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.10 Panoptic FCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3
1 Introduction and Object Detection
1.1 What this course is
Hight-level(semantic) computer vision:
• Object detection
• image segmentation
• object tracking
• etc...
From another perspective this is the intersection of Computer vision, deep learning and
real-world applications, as shown in Fig 1.
• Classification
• Object detection
• Semantic segmentation
• Instance segmentation
• Object tracking
4
Figure 2: CVIII in two Dimensions
• Semantic segmentation(pixel-level)
But challenges:
• A lot of redundancy.
• Deformable/atrous convolutions
• Contrastive learning
5
2 Object detection
2.1 One-stage detectors
One-stage detectors treat object detection as a single task, directly predicting the object
class and bounding box coordinates from the input images.
For every position, measure the distance(or correlation) between the template and the image
region(See Fig 3):
where d is the distance metric, I(x0 ,y0 ) is the image region, and T is the template.
1X
d(I(x0 ,y0 ) , T ) = (I(x0 ,y0 ) (x, y) − T (x, y))2 (2)
n x,y
• Normalized cross-correlation(NCC):
1X 1
d(I(x0 ,y0 ) , T ) = I(x ,y ) (x, y)T (x, y) (3)
n x,y σI σT 0 0
• Zero-normalized cross-correlation(ZNCC):
1X 1
d(I(x0 ,y0 ) , T ) = (I(x0 ,y0 ) (x, y) − µI )(T (x, y) − µT ) (4)
n x,y σI σT
Disadvantages
• Changes in appearance
6
Figure 4: One-stage detector(new)
2. Find a weak classifier with the lowest error across the dataset
7
3. Save the weak classifier and update the priority of the data samples
1. Choose your training set of images that contain the object you want to detect.
4. Train an SVM classifier on the two sets to detect whether a feature vector represents
the object of interest or not(0/1 classification)
Node: The amount of work for each RoI may grow significantly.
1. Region Proposal: The network first identifies potential regions in the image that
may contain objects.
2. Refinement: These regions are further analyzed to classify the objects and refine
their bounding boxes.
8
1. Selective search[5]
Using class-agnostic segmentation at multiple scales.
2. Edge boxes[6]
Bounding boxes that wholly enclose detected contours.
Region overlap
We measure region overlap with the Intersection over Union(IoU) or Jaccard In-
dex:
|A ∩ B|
J(A, B) =
A∪B
The threshold is the decision boundary used to determine whether a detected object or
prediction should be considered valid. For example, in object detection:
• Classification
• Localization
9
• Detection
Figure 8: Overfeat
Apply Non-Max Suppression to combine the predictions and windows, see Fig 9.
Figure 9: NMS
The complexity of a sliding window is O(N D), where N is the number of all the windows,
and D represents the amount of work needed to perform detection and classification for one
window. The complexity of the ”pre-filtered” method is O(N d+nD), where d is the amount
of work needed for filtering each window, and n is the number of windows left after being
filtered.
”Pre-filtering” method pays off when:
n d
+ <1
N D
n d
where N is the Region-of-Interest(RoI) ratio, and D is the efficiency of RoI generator. In
practice, there is a delicate balance between n and d: Reducing d and increasing n.
10
1. Selective search
Using class-agnostic segmentation at multiple scales.
2. Edge boxes
Bounding boxes that wholly detected contours
• (Fast/Faster) R-CNN
2.3.3 R-CNN
Training scheme:
2. Finetune the CNN on the number of classes the detector is aiming to classify
3. Train a linear SVM classifier to classify image regions - one linear SVM per class
Pros:
• New: CNN features; the overall pipeline with proposals is heavily engineered – Good
accuracy.
• Leverage transfer learning: The CNN can be pre-trained for image classification with
c classes. One needs only to change the FC layers to deal with Z classes.
Cons:
• Slow: 47s/image with VGG 16 backbone. One considers around 2000 proposals per
image, they need to be warped and forwarded through the CNN.
• Feature extraction and SVM classifier are trained separately - features are not learned
”end-to-end”.
Problems:
2. TBD
11
Figure 10: SPP-Net
• Feed the entire image into a convolutional neural network(e.g., VGG, ResNet).
This means the network only processes the image once, which is much more efficient.
2. Mapping region proposals to the feature map
12
You have a set of region proposals(bounding boxes) in the original image coordinates-
these come from an external region proposal method(e.g., selective search) or from a region
proposal network(in Faster R-CNN).
Each region proposal is then mapped onto the feature map coordinates:
- Because the CNN reduces spatial resolution(due to pooling layers or strided convolu-
tions), the coordinates of each region of the original image need to be scaled to align with
the coordinate system of the smaller feature map.
3. Dividing each RoI into Sub-regions
To handle different RoI sizes but produce a fixed-size output(for example 7X7 output
grid in Fast R-CNN):
• Divide the mapped RoI on the feature map into a predefined grid(e.g., 7X7).
• Each sub-region in this grid corresponds to a smaller set of feature map cells.
4. Pooling operation
Within each of these sub-regions(bins):
• Perform a pooling operation(usually max pooling, though average pooling can also be
used).
• This operation collapses the variable-sized sub-region into a single value(e.g., the max-
imum activation within that bin).
By doing this for each subregion in the grid, you transform the entire RoI into a fixed-size
feature map(e.g., 7X7), regardless of the original size of the bounding box.
5. Feeding pooled features to classifier layers
Noe that you have a fixed spatial dimension(e.g., 7X7), you can:
13
Figure 12: Fast R-CNN results
Classification ground truth: We compute p∗ which indicates how much an anchor over-
laps with the ground truth bounding boxes:
14
• 1 indicates that the anchor represents an object(foreground), and 0 indicates a back-
ground object. The rest do not contribute to the training.
• Those anchors that contain an object are used to compute the regression loss.
x − xa y − ya
tx = (7) ty = (8)
wa ha
w h
tw = log (9) th = log (10)
wa ha
1. RPN classification(object/non-object)
15
Figure 18: Faster R-CNN Performance
Straightforward implementation:
• Upsampling(nearest neighbours)
• Element-wise addition
Pros:
16
Figure 20: Pyramidal feature hierarchy
17
- A confidence value(object/no object).
- And a class distribution over C classes.
Inference time:
• Coarse grid resolution, few anchors per cell = issues with small objects
2.5.2 SSD
Pros:
Cons:
18
2.5.3 RetinaNet
Two-stage detectors:
• Hard negative mining: subsample the negatives with the largest error:
- works, but can be unstable.
Class imbalance:
- Idea: balance the positives/negatives in the loss function.
- Recall cross-entropy loss:
19
Figure 26: Caption
2.5.6 RetinaNet
20
Figure 28: Spatial Transformers
• Fully differentiable.
Cons:
- Difficulty of training for generalizing to more challenging scenes.
TP
Precision = (12)
TP + FP
Recall: How many actual zebras in the image/dataset you could find?
TP
Recall = (13)
TP + FN
What is a true positive?
Resolving conflicts
21
Figure 29: AP
• Fast motion.
• Changing appearance.
• Dynamic background.
• Occlusions.
• ...
22
Figure 30: Problem state
Goal: Estimate car position at each time instant (say, the white car).
Observation: Image sequence and known background.
Observations: Image
System state: Car position(x, y)
Notations:
23
Figure 32: Bayesian probabilities
Goal:
Estimate posterior probability p(xk |Zk ).
How? Recursion:
1.
p(zk |xk , Zk−1 ) = p(zk |xk ) (15)
2.
p(xk |xk−1 , Zk−1 ) = p(xk |xk−1 ) (16)
24
Recursive Estimation
Key Concepts:
p(b|a)p(a)
• Bayes’ Rule: p(a|b) = p(b)
Bayesian formulation:
Z
p(xk |Zk ) = k · p(zk , xk ) · p(xk |xk−1 ) · p(xk−1 |Zk−1 )dxk−1 (17)
• k → normalizing term
Estimators:
Assume the posterior probability p(xk |Zk ) is known:
• posterior mean:
• maximum a posterior(MAP):
Deep networks:
25
• It is easy to see what the networks have to output given the input
- but it is harder(yet more useful) to understand what a network models in terms of
our Bayesian formulation.
Offline tracking:
- ”Given all observations, estimate any state.”
An online tracking model can be used for offline tracking too. Our recursive Bayesian
model will still work.
3.3 GOTURN
26
• Input: Search region + template region(what to track).
Temporal prior:
Advantages:
• efficient(real-time).
Disadvantages:
• the temporal prior is too simple: fails if there is fast motion or occlusion.
3.5 MDNet
Tracking with online adaptation
27
Figure 35: MDNet
4 Multi-object tracking
Challenges:
• Heavy occlusions.
4.1 Approach 1
1. Track initialization(e.g., using a detector).
28
4.2 Typical models of dynamics
• Constant position:
- i.e. no real dynamics, but if the velocity of the object is sufficiently small, this can
work.
29
3. The bipartite matching solution corresponds to the minimum total cost.
30
Figure 38: Tractor
Advantages:
Disadvantages:
1. No motion model:
- problems due to large motions(camera, objects) or low frame rate.
Definition:
A set X(e.g., containing images) is said to be a metric space if with any two points p
and q of X there is associated a real number d(p, q), called the distance from p to q, such
that
•
d(p, q) > 0 if p ̸= q; d(p, p) = 0;
31
•
d(p, q) = d(q, p)
•
d(p, q) ≤ d(p, r) + d(r, q) for any r ∈ X
θ∗ := arg min EA,B∈S + [dθ (A, B)] − EB,C∈S − [d| theta(B, C)] (23)
θ
32
Figure 40: Metric learning
1. Hinge loss:
L(A, B) = y ∗ ||f (A) − f (B)||2 + (1 − y ∗ ) max (0, m2 − ||f (A) − f (B)||2 )
- where y ∗ is 1 if (A, B) is a positive pair, and 0 otherwise.
- hinge loss for negative pairs with margin m
2. Triplet loss:
L(A, B, C) = max(0, ||f (A) − f (B)||2 − ||f (A) − f (C)||2 + m
• Annotation needed:
- e.g., same identity in different images.
33
• In practice: careful tuning of the positive pair term vs. the negative term.
- hard-negative mining and a bounded function for the negative pairs help.
Offline tracking:
- ”Given all observations, estimate any state.”
34
Minimizing the cost:
X
f ∗ = arg min C(I, j)f (I, j) (24)
f
where f is the disjoint set of trajectories, C(I, j) indicates the cost, and f (I, j) means the
indicator 0, 1.
To incorporate detection confidence, we split the node in two.
where Cde t indicates the detection confidence, and Ct indicates the transition cost.
Problem with occlusions, such as:
MAP formulation
- Our solution is a set of trajectories τ ∗
35
Y
= arg max P (xi | T )P (T ) (27)
T
i
Y Y
= arg max P (xi | T ) P (Ti ) (28)
T
i Ti ∈T
Bayes rule
Bayes rule
Assumption 1:
Assumption 2:
Independence of trajectories
MAP fomulation:
Y Y
arg max P (xi |T ) P (Ti ) (29)
T
i Ti ∈T
X X
arg min − log P (xi |T ) − log P (Ti ) → log-space for optimization (30)
T
i TI ∈T
Prior
X
log P (Ti ) (31)
Ti ∈T
Y
P (Ti ) = Pin (x0 ) Pt (xj | xj−1 )Pout (xn ) (33)
j=1,n
X
− log P (Ti ) = − log Pin (x0 ) − log Pt (xj | xj−1 ) − log Pout (xn ) (34)
j=1,n
Transition Cost: log Pt (xj | xj−1 ) = ft (xj , xj−1 )Ct (xj , xj−1 )
36
Exit Cost: − log Pout (xn ) = fout (xn )Cout (xn )
Likelihood:
X
− log P (xi | T ) (35)
i
1 − γi
= f (xi ) log − log(1 − γi ) (38)
γi
1−γi
Cdet (xi ) = log γi
Optimization
(Cdet , Cin , Cout , Ct ) are estimated from data. Then:
Binary/Fibonacci search
O(log n)
37
– Find the min-cost flow by the algorithm of:
O(n2 m log n)
Summary
Open questions:
End-to-end learning?
• Can we learn features for multi-object tracking(encoding costs) to encode the solution
directly on the graph?
• Goal: Generalize the graph structure we have used and perform end-to-end learning.
Setup
38
Figure 46: Message Passing Network
Message passing
1. Initial graph
• Graph: G = (V, E)
• Initial embeddings:
(0)
- Node embedding: hi , I ∈ V
(0)
- Edge embedding: hi,j , (i, j) ∈ E
(l) (l)
• Embeddings after l steps: hi , i ∈ V h(i,j) , (i, j) ∈ E
2. ”Node-to-edge” update
3. ”Edge-to-node” update
39
(b) After a round of edge updates, each edge embedding contains information about
its pair of incident nodes.
(l) (l−1)
(c) By analogy: hi − Nv ([hi , hl(i,j) ])
(d) In general, we may have an arbitrary number neighbors(”degree”, or ”valency”)
(e) Define a permutation-invariant aggregation function:
(l) (l−1)
hi = N v(hi , Φ(l) (i)) (40)
Remarks:
• Main goal: gather content information into node and edge embeddings.
• All vertices/edges are treated equally, i.i. the parameters are shared.
40
1. Input
4. Edge classification
- Learn to directly predict solutions of the Min-cost flow problem by classifying edge
embeddings.
5. Output
Feature encoding
Appearance and geometry encodings:
3. Time Difference
tj − ti (43)
41
*Shared weights of CNN and MLP for all nodes and edges
Contrast:
• now:
Temporal causality
Flow conservation at a node
Classifying edges
• After several iterations of message passing, each edge embedding contains content
information about detections.
• After classifying edges, we get a prediction between 0 and 1 for each edge in the graph.
42
• Lightweight post-processing(rounding or linear programming).
• The overall method is fast( 12fps) and achieves SOTA in the MOT challenge by a
significant margin.
Summary
• FP = false positive
• FN = false negative
1. An ID switch is counted because the ground truth track is assigned first to red,
then to blue.
2. Count both an ID switch(red and blue both assigned to the same ground truth),
but also fragmentation(Frag.) because the ground truth coverage was cut.
3. Identity is preserved. If two trajectories overlap with a ground truth trajec-
tory(within a threshold), the one that forces the least ID switches is chosen(the
red one).
43
Metrics:
• F1 score:
P
2 t T Pt
IDF1 = P (45)
t 2T Pt + F Pt + F Nt
5 Segmentation
Flavours of image segmentation
5.1 K-means(clustering)
5.2 Spectral clustering
5.3 Normalized cut(Ncut)
5.4 Energy-based model: Conditional random fields(CRFs)
5.5 Fully convolutional neural networks
• 1 × 1 convolution facts
• Feature resolution
• Dilated convolutions
- Atrous spatial pyramid pooling
• Upsampling(transposed convolution)
• Upsampling(interpolation)
44
5.6 U-Net
5.7 SegNet
5.8 Mask R-CNN
5.8.1 PointRend
References
[1] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat:
Integrated recognition, localization and detection using convolutional networks,” 2014.
[Online]. Available: https://arxiv.org/abs/1312.6229
[2] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple fea-
tures,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. CVPR 2001, vol. 1, 2001, pp. I–I.
[3] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1.
[6] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial In-
telligence and Lecture Notes in Bioinformatics), vol. 8693. Springer, 2014, pp. 391–405.
45