Cviii 2024 Ws
Cviii 2024 Ws
Hurile Borjigin
Technical University of Munich
2024/2025 WS
1
Contents
1 Introduction and Object Detection 5
1.1 What this course is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Understanding an image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Understanding a video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Some architectures and concepts . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Object detection 7
2.1 One-stage detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Object detection with sliding window . . . . . . . . . . . . . . . . . . 7
2.1.2 Feature-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Two-stage detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Non-Maximum Suppression(NSM) . . . . . . . . . . . . . . . . . . . . 10
2.3 Object detection with deep networks . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Overfeat[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Object proposals(Pre-filtering) and pooling for two-stage detection . . 11
2.3.3 R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.5 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Feature Pyramid Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Single-stage object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.3 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.4 Problems with one-stage detectors . . . . . . . . . . . . . . . . . . . . 19
2.5.5 Focal loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.6 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Spatial Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Detection evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Multi-object tracking 28
4.1 Approach 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Typical models of dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Approach 2: Tracktor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Metric learning: For re-identification(Re-ID) . . . . . . . . . . . . . . . . . . . 32
4.5.1 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2
4.5.2 How do we train a network to learn a feature representation . . . . . . 33
4.5.3 Metric learning for tracking . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5.4 Summary of metric learning . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Online vs. Offline tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Graph based MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7.1 Tracking with network flow . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7.2 Tracking with Message Passing Network . . . . . . . . . . . . . . . . . 39
4.7.3 Evaluation of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Segmentation 45
5.1 K-means(clustering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Normalized cut(Ncut) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Energy-based model: Conditional random fields(CRFs) . . . . . . . . . . . . 49
5.4.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Fully convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5.1 1 × 1 convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.7 SegNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.8 Best practices for segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.9 Instance segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.9.1 Proposal based method: Mask R-CNN . . . . . . . . . . . . . . . . . . 57
5.9.2 Mask R-CNN with PontRend . . . . . . . . . . . . . . . . . . . . . . . 58
5.9.3 Proposal-free method . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.9.4 SOLOv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.9.5 Instance segmentation: Summary . . . . . . . . . . . . . . . . . . . . . 60
5.10 Panoptic segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.10.1 Panoptic FPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.10.2 Panoptic FPN: Summary . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.10.3 Panoptic FCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.11 Panoptic evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Transformers 70
7.1 Self-attention: A hash table analogy . . . . . . . . . . . . . . . . . . . . . . . 71
7.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3
7.4 Swin Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.5 DETR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.5.1 The losss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.6 MaskFormer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.7 Mask2Former . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8 Unsupervised 81
8.1 Evaluating Self-supervised learning(SSL) models . . . . . . . . . . . . . . . . 82
8.2 Pretext tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.2.1 Pretext task: Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.2.2 Pretext task: Jigsaw puzzle . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2.3 Pretext task: Colorization . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.3 Contrastive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.3.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3.2 Deep frameworks for SSL . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.4 Non-contrastive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.4.1 DINO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.4.2 DINOv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.4.3 Masked Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.5 Downstream applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9 Semi-supervised learning 91
9.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.1.1 Smoothness assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.1.2 Low-density assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.1.3 Manifold assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2 Two taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2.1 Unsupervised pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2.2 Wrapper methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.2.3 Intrinsically semi-supervised . . . . . . . . . . . . . . . . . . . . . . . . 94
9.2.4 Learning from synthetic data . . . . . . . . . . . . . . . . . . . . . . . 95
9.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4
1 Introduction and Object Detection
1.1 What this course is
Hight-level(semantic) computer vision:
• Object detection
• image segmentation
• object tracking
• etc...
From another perspective this is the intersection of Computer vision, deep learning and
real-world applications, as shown in Fig 1.
• Classification
• Object detection
• Semantic segmentation
• Instance segmentation
• Object tracking
5
Figure 2: CV III in two Dimensions
• Semantic segmentation(pixel-level)
But challenges:
• A lot of redundancy.
• Deformable/atrous convolutions
• Contrastive learning
6
2 Object detection
2.1 One-stage detectors
One-stage detectors treat object detection as a single task, directly predicting the object
class and bounding box coordinates from the input images.
For every position, measure the distance(or correlation) between the template and the image
region(See Fig 3):
where d is the distance metric, I(x0 ,y0 ) is the image region, and T is the template.
1X
d(I(x0 ,y0 ) , T ) = (I(x0 ,y0 ) (x, y) − T (x, y))2 (2)
n x,y
• Normalized cross-correlation(NCC):
1X 1
d(I(x0 ,y0 ) , T ) = I(x ,y ) (x, y)T (x, y) (3)
n x,y σI σT 0 0
• Zero-normalized cross-correlation(ZNCC):
1X 1
d(I(x0 ,y0 ) , T ) = (I(x0 ,y0 ) (x, y) − µI )(T (x, y) − µT ) (4)
n x,y σI σT
Disadvantages
• (Self-) Occlusions
- e.g. due to pose changes
• Changes in appearance
7
2.1.2 Feature-based detection
2. Find a weak classifier with the lowest error across the dataset
8
3. Save the weak classifier and update the priority of the data samples
1. Choose your training set of images that contain the object you want to detect.
4. Train an SVM classifier on the two sets to detect whether a feature vector represents
the object of interest or not(0/1 classification)
Node: The amount of work for each RoI may grow significantly.
1. Region Proposal: The network first identifies potential regions in the image that
may contain objects.
2. Refinement: These regions are further analyzed to classify the objects and refine
their bounding boxes.
9
1. Selective search[5]
- Using class-agnostic segmentation at multiple scales.
2. Edge boxes[6]
- Bounding boxes that wholly enclose detected contours.
Region overlap
We measure region overlap with the Intersection over Union(IoU) or Jaccard Index:
|A ∩ B|
J(A, B) =
A∪B
The threshold is the decision boundary used to determine whether a detected object or
prediction should be considered valid. For example, in object detection:
• Classification
• Localization
• Detection
10
Figure 8: Overfeat
Figure 9: NMS
Improved detection accuracy is largely due to feature representation learned with deep
nets.
But cons:
The complexity of a sliding window is O(N D), where N is the number of all the windows,
and D represents the amount of work needed to perform detection and classification for
one window. The complexity of the ”pre-filtered” method is O(N d + nD), where d is the
amount of work needed for filtering each window, and n is the number of windows left after
being filtered.
”Pre-filtering” method pays off when:
n d
+ <1 (6)
N D
n d
where N is the Region-of-Interest(RoI) ratio, and D is the efficiency of RoI generator. In
practice, there is a delicate balance between n and d: Reducing d and increasing n.
1. Selective search
- Using class-agnostic segmentation at multiple scales.
11
2. Edge boxes
- Bounding boxes that wholly detected contours
• (Fast/Faster) R-CNN
2.3.3 R-CNN
Training scheme:
2. Finetune the CNN on the number of classes the detector is aiming to classify
3. Train a linear SVM classifier to classify image regions - one linear SVM per class
Pros:
• New: CNN features; the overall pipeline with proposals is heavily engineered – Good
accuracy.
• Leverage transfer learning: The CNN can be pre-trained for image classification with
c classes. One needs only to change the FC layers to deal with Z classes.
Cons:
• Slow: 47s/image with VGG 16 backbone. One considers around 2000 proposals per
image, they need to be warped and forwarded through the CNN.
• Feature extraction and SVM classifier are trained separately - features are not learned
”end-to-end”.
Problems:
2. TBD
12
Figure 10: SPP-Net
• Feed the entire image into a convolutional neural network(e.g., VGG, ResNet).
This means the network only processes the image once, which is much more efficient.
2. Mapping region proposals to the feature map
13
You have a set of region proposals(bounding boxes) in the original image coordinates-these
come from an external region proposal method(e.g., selective search) or from a region pro-
posal network(in Faster R-CNN).
Each region proposal is then mapped onto the feature map coordinates:
- Because the CNN reduces spatial resolution(due to pooling layers or strided convolutions),
the coordinates of each region of the original image need to be scaled to align with the
coordinate system of the smaller feature map.
3. Dividing each RoI into Sub-regions
To handle different RoI sizes but produce a fixed-size output(for example 7X7 output grid
in Fast R-CNN):
• Divide the mapped RoI on the feature map into a predefined grid(e.g., 7X7).
• Each sub-region in this grid corresponds to a smaller set of feature map cells.
4. Pooling operation
Within each of these sub-regions(bins):
• Perform a pooling operation(usually max pooling, though average pooling can also be
used).
• This operation collapses the variable-sized sub-region into a single value(e.g., the max-
imum activation within that bin).
By doing this for each subregion in the grid, you transform the entire RoI into a fixed-size
feature map(e.g., 7X7), regardless of the original size of the bounding box.
5. Feeding pooled features to classifier layers
Note that you have a fixed spatial dimension(e.g., 7X7), you can:
14
Figure 12: Fast R-CNN results
Classification ground truth: We compute p∗ which indicates how much an anchor overlaps
with the ground truth bounding boxes:
• Those anchors that contain an object are used to compute the regression loss.
1. RPN classification(object/non-object)
15
Figure 13: Faster R-CNN
x − xa y − ya
tx = (10) ty = (11)
wa ha
w h
tw = log (12) th = log (13)
wa ha
16
Figure 18: Faster R-CNN Performance
Straightforward implementation:
• Convolution with 1 X 1 kernel
• Upsampling(nearest neighbours)
• Element-wise addition
Integration with RPN for object detection:
• Define RPN on each level of the pyramid.
17
Figure 20: Pyramidal feature hierarchy
Inference time:
It is more efficient than Faster R-CNN, but less accurate.
18
Figure 23: YOLO Inference
• Coarse grid resolution, few anchors per cell = issues with small objects
2.5.2 SSD
Pros:
Cons:
2.5.3 RetinaNet
Two-stage detectors:
19
• Class balance between foreground and background objects is manageable.
• Hard negative mining: subsample the negatives with the largest error:
- works, but can be unstable.
Class imbalance:
- Idea: balance the positives/negatives in the loss function.
- Recall cross-entropy loss:
2.5.6 RetinaNet
20
Figure 26: Focal loss
21
• Learn to localize RoI without dense supervision(only class label).
• Fully differentiable.
Cons:
- Difficulty of training for generalizing to more challenging scenes.
TP
Precision = (15)
TP + FP
Recall: How many actual zebras in the image/dataset you could find?
TP
Recall = (16)
TP + FN
What is a true positive?
• Use the Intersection over Union(IoU).
22
Figure 29: AP
Challenges:
• Fast motion.
• Changing appearance.
• Dynamic background.
• Occlusions.
23
• ...
Goal: Estimate car position at each time instant (say, the white car).
Observation: Image sequence and known background.
Observations: Image
System state: Car position(x, y)
Notations:
Goal:
Estimate posterior probability p(xk |Zk ).
How? Recursion:
24
Figure 32: Bayesian probabilities
1.
p(zk |xk , Zk−1 ) = p(zk |xk ) (18)
2.
p(xk |xk−1 , Zk−1 ) = p(xk |xk−1 ) (19)
Recursive Estimation
Key Concepts:
25
p(b|a)p(a)
• Bayes’ Rule: p(a|b) = p(b)
Bayesian formulation:
Z
p(xk |Zk ) = k · p(zk , xk ) · p(xk |xk−1 ) · p(xk−1 |Zk−1 )dxk−1 (20)
• k → normalizing term
Estimators:
Assume the posterior probability p(xk |Zk ) is known:
• posterior mean:
• maximum a posterior(MAP):
Deep networks:
• It is easy to see what the networks have to output given the input
- but it is harder(yet more useful) to understand what a network models in terms of
our Bayesian formulation.
26
3.2 Online vs. Offline tracking
Online tracking:
- ”Given observations so far, estimate the current state.”
Offline tracking:
- ”Given all observations, estimate any state.”
An online tracking model can be used for offline tracking too. Our recursive Bayesian
model will still work.
3.3 GOTURN
Temporal prior:
27
Figure 34: Temporal prior
• efficient(real-time).
Disadvantages:
• the temporal prior is too simple: fails if there is fast motion or occlusion.
3.5 MDNet
Tracking with online adaptation
4 Multi-object tracking
Challenges:
• Heavy occlusions.
28
Figure 35: MDNet
4.1 Approach 1
1. Track initialization(e.g., using a detector).
29
• Constant acceleration(possibly unknown):
- Also captures the acceleration of the object.
- This may include both the velocity, but also the directional acceleration.
30
• Solution: Pseudo node.
Advantages:
Disadvantages:
31
Figure 38: Tractor
1. No motion model:
- problems due to large motions(camera, objects) or low frame rate.
Definition:
A set X(e.g., containing images) is said to be a metric space if with any two points p
and q of X there is associated a real number d(p, q), called the distance from p to q, such
that
•
d(p, q) > 0 if p ̸= q; d(p, p) = 0;
•
d(p, q) = d(q, p)
•
d(p, q) ≤ d(p, r) + d(r, q) for any r ∈ X
32
d(p, q) = dω (fθ (p), fθ (q)) (25)
θ∗ := arg min EA,B∈S + [dθ (A, B)] − EB,C∈S − [d| theta(B, C)] (26)
θ
33
S + and S − are sets of positive and negative image pairs.
1. Hinge loss:
L(A, B) = y ∗ ||f (A) − f (B)||2 + (1 − y ∗ ) max (0, m2 − ||f (A) − f (B)||2 )
- where y ∗ is 1 if (A, B) is a positive pair, and 0 otherwise.
- hinge loss for negative pairs with margin m
2. Triplet loss:
L(A, B, C) = max(0, ||f (A) − f (B)||2 − ||f (A) − f (C)||2 + m
• Annotation needed:
- e.g., same identity in different images.
• In practice: careful tuning of the positive pair term vs. the negative term.
- hard-negative mining and a bounded function for the negative pairs help.
34
• For real-time applications.
Offline tracking:
- ”Given all observations, estimate any state.”
where f is the disjoint set of trajectories, C(I, j) indicates the cost, and f (I, j) means the
indicator 0, 1.
To incorporate detection confidence, we split the node in two.
where Cde t indicates the detection confidence, and Ct indicates the transition cost.
Problem with occlusions, such as:
35
Figure 43: Network flow
MAP formulation
- Our solution is a set of trajectories τ ∗
Y
= arg max P (xi | T )P (T ) (30)
T
i
Y Y
= arg max P (xi | T ) P (Ti ) (31)
T
i Ti ∈T
Bayes rule
Bayes rule
36
Assumption 1:
Assumption 2:
Independence of trajectories
MAP fomulation:
Y Y
arg max P (xi |T ) P (Ti ) (32)
T
i Ti ∈T
X X
arg min − log P (xi |T ) − log P (Ti ) → log-space for optimization (33)
T
i TI ∈T
Prior
X
log P (Ti ) (34)
Ti ∈T
Y
P (Ti ) = Pin (x0 ) Pt (xj | xj−1 )Pout (xn ) (36)
j=1,n
X
− log P (Ti ) = − log Pin (x0 ) − log Pt (xj | xj−1 ) − log Pout (xn ) (37)
j=1,n
Transition Cost: log Pt (xj | xj−1 ) = ft (xj , xj−1 )Ct (xj , xj−1 )
Likelihood:
X
− log P (xi | T ) (38)
i
37
γi denotes prediction confidence (e.g., provided by the detector)
1 − γi
= f (xi ) log − log(1 − γi ) (41)
γi
1−γi
Cdet (xi ) = log γi
Optimization
(Cdet , Cin , Cout , Ct ) are estimated from data. Then:
Binary/Fibonacci search
O(log n)
O(n2 m log n)
Summary
38
• Min-cost max-flow formulation:
the maximum number of trajectories with minimum costs.
Open questions:
End-to-end learning?
• Can we learn features for multi-object tracking(encoding costs) to encode the solution
directly on the graph?
• Goal: Generalize the graph structure we have used and perform end-to-end learning.
Setup
Message passing
1. Initial graph
• Graph: G = (V, E)
• Initial embeddings:
(0)
- Node embedding: hi , I ∈ V
(0)
- Edge embedding: hi,j , (i, j) ∈ E
(l) (l)
• Embeddings after l steps: hi , i ∈ V h(i,j) , (i, j) ∈ E
2. ”Node-to-edge” update
39
Figure 46: Message Passing Network
3. ”Edge-to-node” update
(l) (l−1)
hi = N v(hi , Φ(l) (i)) (43)
40
Remarks:
• Main goal: gather content information into node and edge embeddings.
• All vertices/edges are treated equally, i.i. the parameters are shared.
1. Input
4. Edge classification
- Learn to directly predict solutions of the Min-cost flow problem by classifying edge
embeddings.
5. Output
41
Figure 49: Geometry encodings
Feature encoding
Appearance and geometry encodings:
3. Time Difference
tj − ti (46)
*Shared weights of CNN and MLP for all nodes and edges
Contrast:
• now:
Temporal causality
Flow conservation at a node
42
Figure 50: Time-aware message passing
• After several iterations of message passing, each edge embedding contains content
information about detections.
• After classifying edges, we get a prediction between 0 and 1 for each edge in the graph.
• The overall method is fast( 12fps) and achieves SOTA in the MOT challenge by a
significant margin.
Summary
43
4.7.3 Evaluation of MOT
• FP = false positive
• FN = false negative
1. An ID switch is counted because the ground truth track is assigned first to red,
then to blue.
2. Count both an ID switch(red and blue both assigned to the same ground truth),
but also fragmentation(Frag.) because the ground truth coverage was cut.
3. Identity is preserved. If two trajectories overlap with a ground truth trajec-
tory(within a threshold), the one that forces the least ID switches is chosen(the
red one).
Metrics:
• F1 score:
P
2 t T Pt
IDF1 = P (48)
t 2T Pt + F Pt + F Nt
44
5 Segmentation
5.1 K-means(clustering)
1. Initialize(randomly) K cluster centers
Problem: The Euclidean distance may be insufficient. Instead, we want to cluster based
on the notion of ”connectedness”.
45
• Edge weights encode proximity between nodes(Similarity Matrix).
A := (ai,j ) ∈ Rn×n
L := D − A
X
Dii = Aij
j
Optimization problem:
The Laplacian quadratic form measures how much variation exists between connected
nodes.
1X
xT Lx = Ai,j (xi − xj )2 (50)
2 i,j
where
• If two points are highly connected, (Aij is large), then their values xi and xj should
be similar, the term (xi − xj )2 is small.
• The goal is to minimize this sum, ensuring connected nodes have similar values.
Intuitively, if
1X
xT Lx = Ai,j (xi − xj )2 ≈ 0 (52)
2 i,j
then,
46
• if Aij ≥ 0(i and j are similar), then xi ≈ xj (same cluster)
The solutions x∗ = arg minx xT Lx are eigenvectors corresponding to the second smallest
eigenvalues of L.
Special case:
1. Zero eigenvalue
1X
xT Lx = Ai,j (xi − xj )2 = 0, Aij for all i, j (53)
2 i,j
2. Connected components
- Proposition: the multiplicity k of eigenvalue 0 equals to the number of connected
components, spanned by indicator vectors.
- Example with two components(n × 2):
47
5.3 Normalized cut(Ncut)
Spectral clustering as a min cut as Fig 55
Balanced cut
where
Ncut
48
• V is the set of nodes representing pixels.
• E defines similarities of two nodes.
• d(i) =
P
j wi,j
• (D − W ) is the Laplacian matrix
1 1
• equivalently D− 2 (D − W )D− 2 x = λx
3. Use the eigenvector with the 2nd smallest eigenvalue to cut the graph.
where
• ϕ(xi , yi )isunaryterm
Unary potential:
Variables
• xi : Binary variable
– foreground/background
• yi : Annotation
– foreground/background/empty
49
Unary term
• φ(xi , yi ) = K[xi ̸= yi ]
Pairwise term
Figure 57: Max-flow min-cut theorem: The maximum value of an s-t flow is equal to the
minimum capacity over all s-t cuts.
1. Remove GAP:
50
Figure 58: Fully connected CRFs
• No translation invariance
• Few parameters
• Variable input size
• Translation invariance
5.5.1 1 × 1 convolution
• 1×1 convolution is equivalent to applying a shared fully connected layer to every pixel
feature.
51
Figure 60: Extended networks for segmentation
– Removing an operation with stride 2 reduces the area of the receptive field by a
factor of 4̃.
– Limited access to the context information.
D(K − 1) + 1 (59)
52
Figure 62: Dilated convolution
X ′ = W T X ′ → Broadcasting (60)
• X : [K × n]
• W T : [K × 1]
• X ′ : [1 × n]
• For CNNs, also apply the inverse of im2col to obtain the 2D grid representation - sums
up overlapping values.
Transposed convolution:
53
• Pad each pixel in the input(e.g., zeros).
Upsampling: Interpolation
• A better alternative:
- Interpolation(e.g., bilinear) followed by standard convolution.
Resize-convolution:
5.6 U-Net
To mitigate information loss, we apply the first layers in the encoder via a skip connection.
54
Figure 66: U-Net
• Skip Connections: One of U-Net’s defining features is the use of skip connections.
These connections directly link corresponding layers in the encoder and decoder, al-
lowing the model to retain fine-grained spatial information during the upsampling
process.
• Asymmetric Depth: The depth of the network is asymmetric, with the encoder
portion consisting of downsampling layers and the decoder portion consisting of up-
sampling layers. This structure enables the model to effectively capture both global
context and precise local features.
• Heavy Use of Convolutions: U-Net heavily relies on convolutions for feature ex-
traction and localization. It often uses small convolutional kernels, typically of size
3x3, to capture intricate details and reduce computational overhead.
• Loss Function: U-Net commonly uses a pixel-wise softmax loss function for multi-
class segmentation or binary cross-entropy for binary segmentation tasks. The archi-
tecture can be easily adapted for different types of loss functions, depending on the
problem.
• Symmetry and Output Size: Due to the symmetric structure of the architecture,
the size of the output is the same as the input, which is an important feature for
55
segmentation tasks where the output needs to align with the input image pixel-wise.
5.7 SegNet
SegNet is a deep learning architecture primarily used for image segmentation tasks. It
is similar to U-Net in that it uses an encoder-decoder structure, but it has some unique
features that distinguish it.
• Max Pooling Indices for Upsampling: A key difference between SegNet and
other architectures, like U-Net, is that SegNet uses max pooling indices from the
encoder during the upsampling in the decoder. This technique allows for more accurate
localization and better reconstruction of fine details during the decoding process.
• Efficient Memory Usage: By storing only the max pooling indices instead of the
pooled feature maps, SegNet reduces memory consumption during the upsampling
process, making it more efficient than many other models.
• Loss Function: SegNet typically uses cross-entropy loss for training, which is com-
monly used for segmentation tasks.
• Applications: SegNet has been successfully applied to a variety of tasks, such as road
scene understanding, medical image analysis, and other segmentation applications
requiring precise delineation of object boundaries.
56
• Use dilated convolution to keep the receptive field size high.
- context information is important.
2. Does not differentiate between the pixels from objects(instances) of the same class.
3. Do not label pixels from uncountable objects(”stuffs”), e.g., ”sky”, ”grass”, ”road”.
4. Differentiates between the pixels coming from instances of the same class.
• Mask R-CNN
• New: RoIAlign
57
Figure 69: Mask R-CNN
RoIAlign:
• No quantization.
• Instead, we parameterize the mask as continuous function that maps the signal do-
main(e.g., (x, y) coordinate):
58
Figure 71: RoIAlign
• Idea:
Remarks:
59
Figure 73: Qualitative result
• We can concatenate features sampled this way from multiple feature maps.
We obtain a semantic map using Fully convolution networks for semantic segmentation.
5.9.4 SOLOv2
Y = KX (61)
60
Figure 74: SOLOv2
• It differentiates between pixels coming from different instances of the same class(countable
objects) called ”things”(cars, pedestrians, etc.).
Challenges:
61
5.10.1 Panoptic FPN
1. NMS on instances.
2. Resolve stuff-things conflicts in favour of things.(WHY)
3. Remove any stuff regions labelled ”other” or with a small area.
• Loss function:
L = Lc + Lb + Lm + λs Ls (62)
Remark:
Training with multiple loss terms(”multi-task learning”) can be challenging, as different
loss terms may ”compete” for desired feature representation.
Even simpler?
62
Figure 77: Panoptic FCN
• Occlusions.
• Scale change.
• Illumination.
• Shape.
• ...
What we need:
1. Appearance model:
63
• Input: 1 frame.
• Output: segmentation mask.
• Offline tracking: considering the whole video - may provide a better clue(e.g., based
on object permanence).
• Assumptions:
1. Brightness constancy:
Image measurements(brightness) in a small region remain the same.
64
Figure 80: Brightness difference
∂ ∂ ∂
I(x, y, t) + ∆x I(x, y, t) + ∆y I(x, y, t) + ∆t I(x, y, t) + ϵ(∆2x + ∆2y + ∆2t ) (66)
∂x ∂x ∂t
∂ESSD X
≈2 (u · Ix + v · Iy )Iy = 0 (68)
∂v
R
65
Lucas-Kanade: " # " P #−1 " P #
I2
P
u R Ix Iy − R Ix It
= P R x P 2 P
v R Ix Iy R Iy − R Iy It
• Joint formulation:
iteratively improving segmentation and motion estimation.
• Slow to optimize:
runtime: up to 20s(excluding OF).
• Initialization matters:
we need(somewhat) accurate initial optical flow.
• DL to the rescue?
Correlation layer:
66
Figure 83: Siamese architecture
67
Figure 85: OSVOS
• The fine-tuned parameters are discarded before fine-tuning for the next video.
Drifting problem:
• The object appearance changes due to the changes in the object and camera pose
• Adapt model to appearance changes in every frame, not just the first frame.
68
Figure 87: Online adaptation
MaskTrack
• Conceptually simple.
Disadvantages:
69
• No temporal consistency.
• Idea: Learn a pixel-level embedding space where proximity between two feature vectors
is semantically meaningful.
• The user input can be in any form, first-frame ground-truth mask, scribble....
• Training: Use the triplet loss to bring foreground pixels together and separate them
from background pixels.
• Test: embed pixels from both annotated and test frame, and perform a nearest neigh-
bor search for the test pixels.
• Advantages:
– We do not need to retrain the model for each sequence, nor fine-tune.
– We can use unsupervised training to learn a useful feature representation(e.g.,
contrastive learning).
7 Transformers
CNNs: A few relevant facts
• We may benefit from more explicit(u.e. in the same layer) non-local feature interac-
tions.
70
Universality: Developing components that work across all possible settings:
• Modern deep learning is only partially universal, even in the contexts we have seen so
far(detection and segmentation).
• Transformers can be applied to language and images with excellent results in both
domains.
key 1 val 1
key 2 val 2
f
key 3 val 3
key 4 val 4
• A hash table is not a differentiable function(of values w.r.t. keys); it’s not even
continuous.
- We can access any value if we know the corresponding key exactly.
• We can use standard operators in our DL arsenal: dot product and softmax
71
Figure 90: Self attention
• A value matrix V ∈ RT ×m
QK T
Attention(Q, K, V ) = sof tmax( √ )V (71)
n
√
Scaling factor 2 eases optimization especially for large n
Linear projection
• Compute Q, K, V :
K := XW K , Q := XW Q , V := XW V (72)
• Output: Y := Attention(Q, K, V )W O
Complexity
QK T
sof tmax( √ )V (73)
n
Multi-head attention
• We can extend the same operation to multiple look-ups without increasing the com-
plexity.
72
• This is called multi-heda attention.
- For example, split it in two(e.g., split 128d vectors into two 64d vectors).
• Now we have two (Q, K, V ) triplets, where feature dimension is reduced by a factor
of 2.
• Intuition: Given a single query, we can fetch multiple values corresponding to different
keys.
Normalization:
Further improvements:
• Residual connection:
add the original token feature before the normalization.
73
• Layer normalization:
Normalize each token feature w.r.t. its own mean and standard deviation.
Recap:
• Self-attention is versatile:
– the query index does not matter; we will always fetch the same value for it.
– Is it always useful?
7.2 Transformers
Transformer-based encoder:
74
– It can be a learnable model parameter or fixed, for example:
pos
P E(pos,2i) = sin( 2i ) (75)
10000 dm odel
pos
P E(pos,2i+1) = cos( 2i ) (76)
10000 dm odel
– It introduces the notion of spatial affinity(e.g., distance between two words in
sentence; two patches in an image, etc.)
– It breaks permutation invariance.
• Main idea: Train on sequences of image patches; only minor changes to the origi-
nal(NLP) Transformer model.
Steps:
6. Predict the class with an MLP using the output [class] token.
75
• ViT performs well only when pre-trained on large JFT dataset(300M images).(QUIZ:
Why?)
– Locality(Self-attention is global.)
– 2D neighborhood structure(positional embedding is inferred from data.)
– Translation invariance.
• Nevertheless, now we use the same computational framework for two distinct modali-
ties: Language and vision!
• The number of tokens remains the same in all layers(QUIZ: How many/ what does it
depend on?)
76
Figure 96: Mature Swin
Results:
• More efficient and accurate than ViT and some of CNNs(despite using lower resolution
input).
• Demonstrates that Transformers are strong vision models across a range of classic
downstream applications.
Transformer for detection?
• Would it make sense to adapt the Transformer to object detection?
77
7.5 DETR
• During training, we uniquely assign predictions to ground truth boxes with Hungarian
matching.
A closer look:
• The model flattens it and supplements it with a positional encoding before passing it
to the Transformer.
– TF encoder: ”self-attention”
– TF decoder: ”cross-attention”
78
7.5.1 The losss function
N
X
LHungarian (y, ŷ) = [− log p̂σ̂(i) (ci ) + 1{ci ̸=∅} Lbox (bi , b̂σ̂ (i))] (77)
i=1
where
• LHungarian (y, ŷ): The total Hungarian loss used in DETR, which consists of classifica-
tion and localization terms.
• σ̂(i): The optimal assignment of ground truth objects to predicted objects obtained
using the Hungarian algorithm.
• p̂σ̂(i) (ci ): The predicted probability for the class ci of the ground truth object assigned
to prediction σ̂(i).
• − log p̂σ̂(i) (ci ): The classification loss term, which penalizes incorrect class predictions.
• 1{ci ̸=∅} : An indicator function that is 1 if the object ci is not the ”no object” (∅) class.
• Lbox (bi , b̂σ̂(i) ): The bounding box regression loss between the ground truth bounding
box bi and the predicted bounding box b̂σ̂(i) .
• λiou : A weight coefficient for the IoU (Intersection over Union) loss.(→hyperparameter)
• Liou (bi , b̂σ(i) ): The generalized IoU loss function, which penalizes misalignment be-
tween the predicted and ground truth bounding boxes.
• b̂σ(i) : The predicted bounding box assigned to the ground truth object i using the
Hungarian algorithm.
• ||bi − b̂σ(i) ||1 : The L1 loss, which measures the absolute difference between the ground
truth and predicted bounding box coordinates.
DETR: Summary
79
• Accurate panoptic segmentation with a minor extension.
• Issues:
7.6 MaskFormer
MaskFormer’s idea: Compute the kernels using learnable queries with a Transformer.
Works well, but has troubles with small objects/segments.
7.7 Mask2Former
Idea: Attend with self-attention at multiple scales of the feature hierarchy:
Masked attention:
• Idea: constrain attention only to the foreground area corresponding to the query.
80
Figure 102: Mask2Former
• Standard self-attention:
• Masked attention:
Mask2Former: Summary
7.8 Conclusions
• Transformers have revolutionized the field of NLP, achieving incredible results.
8 Unsupervised
Learning without labels
81
Compact, yet descriptive representation?
Consider the loss:
L = I(X; Z) − βI(Z; Y ) (82)
where
XX P(X,Y ) (x, y)
I(X; Y ) = P(X,Y ) (x, y) log (83)
PX (x)PY (y)
x∈X y∈Y
After obtaining compact and descriptive representation, we can apply it for different tasks,
such as detection, classification, segmentation, etc. by adding task-specific shallow mod-
els(e.g., linear projection).
• Linear probing:
– Only learn the new linear projection. The model parameters remain fixed.
– Pros: We can re-us the model for other tasks by training multiple linear projec-
tions.
– Cons: Typically worse task accuracy than fine-tuning.
• k-NN classification:
82
– Project labeled data in the embedding space.
– Classify datapoints according to the class of its k nearest neighbors.
– Pros: Same as linear probing(versatility), but no learning is necessary.
– Cons: Prediction can be a bit costly.
- Linear search complexity due to high feature dimensionality.
– Goal: Define training objectives with some relation to our target objective.
– By training the model on these objectives, we hope to learn something for the
target objective.
• Categories of self-supervision
– Pretext tasks
– Contrastive learning
– Non-contrastive learning
• The deep net will always try to cheat, i.e. find ”shortcut” solutions.
• Otherwise, there’s no canonical pose, hence the rotation angles are meaningless.
- A thought experiment: add all rotated images to the original dataset.
83
Figure 104: Pretext task: Rotation
• Solving this task requires the model to learn spatial relation of image patches.
Nuances:
84
• This is not the same as predicting RGB values from a greyscale image(due to multi-
modality of colorization).
L(A, B, C) = max (0, ||f (A) − f (B)||2 − ||f (A) − f (C)||2 + 2 (84)
Intuitive idea:
• Use data augmentation(e.g., cropping) to create a positive pair of the same image.
Example:
85
– Temperature τ : a hyperparameter(usually between 0.01 and 1.0).
– What is the range?
– What does it mean when it reaches maximum/minimum?
– We clearly want to maximize this value!(many implementations)
– Example loss: − log s(x).
8.3.1 Intuition
• SimCLR
- A simple framework for contrastive learning.
86
• BYOL
• MoCo
- Reduced the GPU memory footprint by applying momentum encoder:
• DINO
• Masked Autoencoders
• Self-attention maps in the last layer of a ViT with respect to [CLS] token.
87
Figure 112: DINO
8.4.2 DINOv2
• Compute the loss only on the patches masked out in the input:
- This is different from denoising autoencoders.
88
Figure 114: Masked autoencoders
89
Figure 117: Extracted features
Segmentation from motion cues Idea: If the object mask is correct, we cannot reconstruct
object-related optical flow. We train two networks
• Network G: Given an image and optical flow, predict object mask (foreground/background).
• Network I: Given a masked optical ow and the image (not masked), reconstruct the
original optical flow.
• Label each patch in the image and propagate them through a video.
- We compute affinity(using cosine similarity) between patches of subsequent frames.
• Cycle consistency loss: Each label should arrive at its original location.
8.6 Conclusion
• Research on unsupervised learning is very active.
90
Figure 119: Extracted features
9 Semi-supervised learning
Training on labeled and unlabeled data.
General remarks:
• Using both labeled and unlabeled data is a very practical scenario.
• If the goal is to get the best accuracy, semi-supervised learning is the way to go.
- Current state-of-the-art frameworks take this approach(rather than full supervision).
Small print:
1. Improvement is not always guaranteed.
- It depends on the model, the technique used and the unlabeled data.
91
Figure 121: Semi-loss
9.1 Assumptions
Assumptions about semi-supervised learning:
If two input points are close by, their labels should be the same.
Transitivity
The decision boundary should pass through a region with low density p(x).
• Data points sharing the same manifold, share the same label.
Remark:
Which assumptions to make depends on what we know about how our data distribution
p(x) interacts with the class posterior p(y|x).
92
9.2.2 Wrapper methods
Self-training
A single classifier trained jointly on labeled and self-labeled data from the unlabeled dataset.
OnAVOS:
Online adaptation: Adapt model to appearance changes in every frame, not just the first
frame.
Drawback: Can be slow.
93
Figure 124: OnAVOS with unlabeled dataset
A general outline:
Entropy minimization
• Example:
Entropy minimization for semantic segmentation(”self-training”):
X X
L({(xi , yi )}i , {x̂i }i ) = Lsupervised (xi , yi ) + λ Lunsupervised (x̂i ) (88)
i i
• Objective:
Minimize the entropy of class distribution for each pixel:
X
Lunsupervised (x̂) = − p(f (x̂)j |x̂) log p(f (x̂)j |x̂) (89)
j
where
radv := arg max D[q(y|x∗ ), p(y|x∗ + r, θ)] (91)
r:||r||≤ϵ
94
• Semi-supervised case:
Replace q(y|x∗ ) above with our current estimate, p(y|x∗ , θ̂)
Domain alignment
This translates into disjoint feature distribution of a model trained only on the labeled data:
Consistency regularization
Consistent prediction across image transformations.
95
Semantic meaning does not change, though not guaranteed in a deep net → Use it as a
consistency loss.
Framework:
Momentum net
Test-time augmentation is applied online at training time:
Limited supervision
96
Figure 130: Classification network
9.3 Summary
• Entropy minimization
- Improves accuracy, but leads to miscalibration.
• Consistency regularization
- can be effective, but sensitive to initial pseudo-label quality.
• Self-training
- simple and effective; but is limited by available augmentation techniques.
• Unsupervised pre-training
- simple and effective; this should be your first baseline.
• Domain alignment
- typically less fine-tuning is required, but can be still challenging to train(GAN).
References
[1] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat:
Integrated recognition, localization and detection using convolutional networks,” 2014.
[Online]. Available: https://arxiv.org/abs/1312.6229
[2] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple fea-
tures,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. CVPR 2001, vol. 1, 2001, pp. I–I.
[3] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1.
97
[4] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, mul-
tiscale, deformable part model,” in 2008 IEEE Conference on Computer Vision and
Pattern Recognition, 2008, pp. 1–8.
[6] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial In-
telligence and Lecture Notes in Bioinformatics), vol. 8693. Springer, 2014, pp. 391–405.
98