0% found this document useful (0 votes)
16 views98 pages

Cviii 2024 Ws

Uploaded by

huson7328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views98 pages

Cviii 2024 Ws

Uploaded by

huson7328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

COMPUTER VISION III LECTURE NOTES

Lecturer: Nikita Araslanov

Hurile Borjigin
Technical University of Munich
2024/2025 WS

1
Contents
1 Introduction and Object Detection 5
1.1 What this course is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Understanding an image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Understanding a video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Some architectures and concepts . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Object detection 7
2.1 One-stage detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Object detection with sliding window . . . . . . . . . . . . . . . . . . 7
2.1.2 Feature-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Two-stage detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Non-Maximum Suppression(NSM) . . . . . . . . . . . . . . . . . . . . 10
2.3 Object detection with deep networks . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Overfeat[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Object proposals(Pre-filtering) and pooling for two-stage detection . . 11
2.3.3 R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.5 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Feature Pyramid Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Single-stage object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.3 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.4 Problems with one-stage detectors . . . . . . . . . . . . . . . . . . . . 19
2.5.5 Focal loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.6 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Spatial Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Detection evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Single object tracking 23


3.1 Bayesian tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Hidden Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Online vs. Offline tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 GOTURN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Online adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 MDNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Multi-object tracking 28
4.1 Approach 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Typical models of dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Approach 2: Tracktor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Metric learning: For re-identification(Re-ID) . . . . . . . . . . . . . . . . . . . 32
4.5.1 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2
4.5.2 How do we train a network to learn a feature representation . . . . . . 33
4.5.3 Metric learning for tracking . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5.4 Summary of metric learning . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Online vs. Offline tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Graph based MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7.1 Tracking with network flow . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7.2 Tracking with Message Passing Network . . . . . . . . . . . . . . . . . 39
4.7.3 Evaluation of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Segmentation 45
5.1 K-means(clustering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Normalized cut(Ncut) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Energy-based model: Conditional random fields(CRFs) . . . . . . . . . . . . 49
5.4.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Fully convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5.1 1 × 1 convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.7 SegNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.8 Best practices for segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.9 Instance segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.9.1 Proposal based method: Mask R-CNN . . . . . . . . . . . . . . . . . . 57
5.9.2 Mask R-CNN with PontRend . . . . . . . . . . . . . . . . . . . . . . . 58
5.9.3 Proposal-free method . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.9.4 SOLOv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.9.5 Instance segmentation: Summary . . . . . . . . . . . . . . . . . . . . . 60
5.10 Panoptic segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.10.1 Panoptic FPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.10.2 Panoptic FPN: Summary . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.10.3 Panoptic FCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.11 Panoptic evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 Video object segmentation 63


6.1 Motion-based VOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1.1 Optical flow: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1.2 FlowNet: Architecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.3 FlowNet: Architecture 2 → Siamese architecture . . . . . . . . . . . . 66
6.1.4 Motion-based VOS: Summary . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Appearance-only VOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2.1 Appearance-only VOS: Summary . . . . . . . . . . . . . . . . . . . . . 69
6.3 Metric-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7 Transformers 70
7.1 Self-attention: A hash table analogy . . . . . . . . . . . . . . . . . . . . . . . 71
7.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3
7.4 Swin Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.5 DETR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.5.1 The losss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.6 MaskFormer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.7 Mask2Former . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8 Unsupervised 81
8.1 Evaluating Self-supervised learning(SSL) models . . . . . . . . . . . . . . . . 82
8.2 Pretext tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.2.1 Pretext task: Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.2.2 Pretext task: Jigsaw puzzle . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2.3 Pretext task: Colorization . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.3 Contrastive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.3.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3.2 Deep frameworks for SSL . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.4 Non-contrastive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.4.1 DINO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.4.2 DINOv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.4.3 Masked Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.5 Downstream applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9 Semi-supervised learning 91
9.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.1.1 Smoothness assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.1.2 Low-density assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.1.3 Manifold assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2 Two taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2.1 Unsupervised pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2.2 Wrapper methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.2.3 Intrinsically semi-supervised . . . . . . . . . . . . . . . . . . . . . . . . 94
9.2.4 Learning from synthetic data . . . . . . . . . . . . . . . . . . . . . . . 95
9.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4
1 Introduction and Object Detection
1.1 What this course is
Hight-level(semantic) computer vision:

• Object detection

• image segmentation

• object tracking

• etc...

From another perspective this is the intersection of Computer vision, deep learning and
real-world applications, as shown in Fig 1.

Figure 1: Another perspective

What is computer vision:

• First defined in the 60s(The summer vision project 1966).

• ”Mimic the human visual system.

• Center block of artificial intelligence.

Semantic scene understanding:

• Classification

• Object detection

• Semantic segmentation

• Instance segmentation

• Object tracking

This course in two dimension is as shown in Fig 2.

5
Figure 2: CV III in two Dimensions

1.2 Understanding an image


Different representations depending on the granularity

• Detection (bounding box-coarse description)

• Semantic segmentation(pixel-level)

• Instance segmentation(e.g. ”person 1”, ”person 2”)

1.3 Understanding a video


Why use the temporal domain

• Motion analysis, multi-view reasoning.

• A smoothness assumption: no abrupt changes between frames.

But challenges:

• High computational demand.

• A lot of redundancy.

• Occlusions, multiple objects moving and interacting.

1.4 Some architectures and concepts


• R-CNN, Fast R-CNN and Faster R-CNN (2-stage object detection)

• YOLO, SSD, RetinaNet (1-stage object detection)

• Siamese networks (online tracking)

• Message Passing Networks (offline tracking)

• Mask R-CNN, UPSNet (panoptic segmentation)

• Deformable/atrous convolutions

• Graph neural networks(GNNs)

• Vision Transformers(ViT), DETR(object detection), SAM

• Contrastive learning

6
2 Object detection
2.1 One-stage detectors
One-stage detectors treat object detection as a single task, directly predicting the object
class and bounding box coordinates from the input images.

2.1.1 Object detection with sliding window

For every position, measure the distance(or correlation) between the template and the image
region(See Fig 3):

Figure 3: One-stage detector(old)

L(x0 , y0 ) = d(I(x0 ,y0 ) , T ) (1)

where d is the distance metric, I(x0 ,y0 ) is the image region, and T is the template.

Template matching distances

• Sum of squared distances(SSD), or mean squared error(MSE):

1X
d(I(x0 ,y0 ) , T ) = (I(x0 ,y0 ) (x, y) − T (x, y))2 (2)
n x,y

• Normalized cross-correlation(NCC):

1X 1
d(I(x0 ,y0 ) , T ) = I(x ,y ) (x, y)T (x, y) (3)
n x,y σI σT 0 0

• Zero-normalized cross-correlation(ZNCC):

1X 1
d(I(x0 ,y0 ) , T ) = (I(x0 ,y0 ) (x, y) − µI )(T (x, y) − µT ) (4)
n x,y σI σT

Disadvantages

• (Self-) Occlusions
- e.g. due to pose changes

• Changes in appearance

• Unknown position, scale, and aspect ratio


- brute-force search(inefficient)

7
2.1.2 Feature-based detection

Idea: Learn feature-based classifiers invariant to natural object changes(See Fig 4)

Figure 4: One-stage detector(new)

• Features are not always linearly separable

• Learning multiple weak learners to build a strong classifier(Fig 5)

Figure 5: Multiple weak learners

Viola-Jones detector [2]


Given data (xi , yi )

1. Define a set of Haar-like features(Fig 6)


Haar-features are sensitive to directionality of patterns

Figure 6: Haar features

2. Find a weak classifier with the lowest error across the dataset

8
3. Save the weak classifier and update the priority of the data samples

4. Repeat Steps 2-3 N times

Final classifier is the linear combination of all weak learners.

Histogram of Oriented Gradients [3]


Average gradient image over training samples → gradients provide shape information.
HOG descriptor → Histogram of oriented gradients.
Compute gradients in dense grids, compute gradients and create a histogram based on
gradient direction.

1. Choose your training set of images that contain the object you want to detect.

2. Choose a set of images that do NOT contain that object.

3. Extract HOG features from both sets.

4. Train an SVM classifier on the two sets to detect whether a feature vector represents
the object of interest or not(0/1 classification)

Deformable Part Model[4]


Many objects are not rigid, so we could use a Bottom-up approach:

1. detect body parts

2. detect ”person” if the body parts are in correct arrangement

Node: The amount of work for each RoI may grow significantly.

2.2 Two-stage detectors


Two-stage object detectors split the object detection task to two parts(Fig 7):

1. Region Proposal: The network first identifies potential regions in the image that
may contain objects.

2. Refinement: These regions are further analyzed to classify the objects and refine
their bounding boxes.

Figure 7: 2-stage detection

A generic, class-agnostic objectness measure: object proposals or regions of inter-


est(RoI)
Object proposal methods:

9
1. Selective search[5]
- Using class-agnostic segmentation at multiple scales.

2. Edge boxes[6]
- Bounding boxes that wholly enclose detected contours.

2.2.1 Non-Maximum Suppression(NSM)

Method to keep only the best proposals(See algorithm 1).

Algorithm 1 Non-Max Suppression


1: procedure NMS(B, c)
2: Bnms ← ∅
3: for bi ∈ B do ▷ Start with anchor box i
4: discard ← False
5: for bj ∈ B do ▷ For another box j
6: if same(bi , bj ) > λnms then ▷ If they overlap
7: if score(c, bj ) > score(c, bi ) then
8: discard ← True ▷ Discard box i if the score is lower than the score of j
9: end if
10: end if
11: end for
12: if ¬discard then
13: Bnms ← Bnms ∪ bi
14: end if
15: end for
16: return Bnms
17: end procedure

Region overlap
We measure region overlap with the Intersection over Union(IoU) or Jaccard Index:

|A ∩ B|
J(A, B) =
A∪B
The threshold is the decision boundary used to determine whether a detected object or
prediction should be considered valid. For example, in object detection:

• Choosing a high threshold – more false positives, low precision

• Choosing a low threshold – more false negatives, low recall

2.3 Object detection with deep networks


2.3.1 Overfeat[1]

Explores three well-known vision tasks using a single framework.

• Classification

• Localization

• Detection

10
Figure 8: Overfeat

Training convolution on all tasks boosts up the accuracy, see Fig 8.


Apply Non-Max Suppression to combine the predictions and windows, see Fig 9.

Figure 9: NMS

Improved detection accuracy is largely due to feature representation learned with deep
nets.
But cons:

• Expensive to try all possible positions, scales and aspect ratios.

• The network operates on a fixed resolution.

The complexity of a sliding window is O(N D), where N is the number of all the windows,
and D represents the amount of work needed to perform detection and classification for
one window. The complexity of the ”pre-filtered” method is O(N d + nD), where d is the
amount of work needed for filtering each window, and n is the number of windows left after
being filtered.
”Pre-filtering” method pays off when:

O(N D) > O(N d + nD) (5)

Assume the constant factors are comparable/negligible:

n d
+ <1 (6)
N D
n d
where N is the Region-of-Interest(RoI) ratio, and D is the efficiency of RoI generator. In
practice, there is a delicate balance between n and d: Reducing d and increasing n.

2.3.2 Object proposals(Pre-filtering) and pooling for two-stage detection

Heuristic-based object proposal methods:

1. Selective search
- Using class-agnostic segmentation at multiple scales.

11
2. Edge boxes
- Bounding boxes that wholly detected contours

R-CNN family(Regions with CNN features)

• Spatial Pyramid Pooling(SPP)

• (Fast/Faster) R-CNN

• Region Proposal Network(RPN)

2.3.3 R-CNN

Training scheme:

1. Pre-train the CNN for image classification(ImageNet)

2. Finetune the CNN on the number of classes the detector is aiming to classify

3. Train a linear SVM classifier to classify image regions - one linear SVM per class

4. Train the bounding box regressor

Pros:

• New: CNN features; the overall pipeline with proposals is heavily engineered – Good
accuracy.

• CNN summarizes each proposal into 4096 vector.

• Leverage transfer learning: The CNN can be pre-trained for image classification with
c classes. One needs only to change the FC layers to deal with Z classes.

Cons:

• Slow: 47s/image with VGG 16 backbone. One considers around 2000 proposals per
image, they need to be warped and forwarded through the CNN.

• Training is also slow and complex.

• The object proposal algorithm is fixed.

• Feature extraction and SVM classifier are trained separately - features are not learned
”end-to-end”.

Problems:

1. Input image has a fixed size


FC layer prevents us from dealing with any image size.

2. TBD

12
Figure 10: SPP-Net

Figure 11: Fast R-CNN

2.3.4 Fast R-CNN

Fast R-CNN = R-CNN + RoI Pooling(Single layer SPP-Net)


RoI Pooling RoI Pooling is a technique introduced in the Fast R-CNN framework to
efficiently extract fixed-size feature representations for arbitrary regions(bounding boxes) in
a feature map. It address the problem of feeding region proposals of various sizes into a
neural network in a way that:

• Preserves spatial information with each region of interest(RoI).

• Produces a fixed-dimensional feature vector needed by fully connected layers(classifiers).

1. Feature extraction for the entire image


Instead of cropping individual region proposals from the original image and passing them
each though a ConvNet(as done in older approaches like R-CNN), we:

• Feed the entire image into a convolutional neural network(e.g., VGG, ResNet).

• Obtain the resulting feature map, typically a 2D grid of activations.

This means the network only processes the image once, which is much more efficient.
2. Mapping region proposals to the feature map

13
You have a set of region proposals(bounding boxes) in the original image coordinates-these
come from an external region proposal method(e.g., selective search) or from a region pro-
posal network(in Faster R-CNN).
Each region proposal is then mapped onto the feature map coordinates:
- Because the CNN reduces spatial resolution(due to pooling layers or strided convolutions),
the coordinates of each region of the original image need to be scaled to align with the
coordinate system of the smaller feature map.
3. Dividing each RoI into Sub-regions
To handle different RoI sizes but produce a fixed-size output(for example 7X7 output grid
in Fast R-CNN):

• Divide the mapped RoI on the feature map into a predefined grid(e.g., 7X7).

• Each sub-region in this grid corresponds to a smaller set of feature map cells.

4. Pooling operation
Within each of these sub-regions(bins):

• Perform a pooling operation(usually max pooling, though average pooling can also be
used).

• This operation collapses the variable-sized sub-region into a single value(e.g., the max-
imum activation within that bin).

By doing this for each subregion in the grid, you transform the entire RoI into a fixed-size
feature map(e.g., 7X7), regardless of the original size of the bounding box.
5. Feeding pooled features to classifier layers
Note that you have a fixed spatial dimension(e.g., 7X7), you can:

• Flatten or otherwise reshape this pooled feature map.

• Pass it into fully connected layers(and ultimately classification/regression heads) to


predict:
- The object class for that region.
- The bounding box refinement offsets, etc.

Result of Fast R-CNN

2.3.5 Faster R-CNN

Faster R-CNN = Fast R-CNN + Region proposal network


Region Proposal Network: We fix the number of proposals by using a set of n = 9
anchors per location.

9 anchors = 3 scales × 3 aspect ratios (7)

A 256-d descriptor characterizes every location.


For feature map location, we get a set of anchor corrections and classification into object/non-
object.
RPN: Training

14
Figure 12: Fast R-CNN results

Classification ground truth: We compute p∗ which indicates how much an anchor overlaps
with the ground truth bounding boxes:

p∗ = 1 if IoU > 0.7 (8)

p8 = 0 if IoU < 0.3 (9)

• 1 indicates that the anchor represents an object(foreground), and 0 indicates a back-


ground object. The rest do not contribute to the training.

• For an image, randomly sample a few(e.g., 256) anchors to form a mini-batch(balanced


objects vs. non-objects)

• We learn anchor activate with the binary cross-entropy loss

• Those anchors that contain an object are used to compute the regression loss.

• Each anchor is described by the center position, width and height(xa , ya , wa , ha ).

• What the network actually predicts are:

• Smooth L1 loss on regression targets

Faster R-CNN training:


- First implementation, training of RPN separate from the rest.
- Now we can train jointly!
- Four losses:

1. RPN classification(object/non-object)

2. RPN regression(anchor – proposal)

3. Fast R-CNN classification(type of object)

4. Fast R-CNN regression(proposal – box)

15
Figure 13: Faster R-CNN

x − xa y − ya
tx = (10) ty = (11)
wa ha

Figure 14: Normalized horizontal Figure 15: Normalized vertical shift


shift

   
w h
tw = log (12) th = log (13)
wa ha

Figure 16: Normalized width Figure 17: Normalized height

2.4 Feature Pyramid Networks


CNNs are not scale-invariant.

• Idea A: Featurised image hierarchy


- Pros: Typically boosts accuracy(esp. at test time).
- Cons: Computationally inefficient.

• Idea B: Pyramidal feature hierarchy

16
Figure 18: Faster R-CNN Performance

Figure 19: Featurised image hierarchy

- More efficient than Idea A, but


- Limited accuracy(inhibits the learning of deep representations)

• Feature pyramid network(FPN)


- Higher scales benefit from deeper representation from lower scales.
- Efficient and high accuracy.

Straightforward implementation:
• Convolution with 1 X 1 kernel

• Upsampling(nearest neighbours)

• Element-wise addition
Integration with RPN for object detection:
• Define RPN on each level of the pyramid.

• Only a single-scale anchor per level(scale varies across levels).

• At test time, merge the predictions from all levels.


Pros:
• Improves recall across all scales(esp. small objects).

• More accurate(in terms of AP).

• Broadly applicable, also for one-stage detectors.

• Still in wide use today.


Cons: Increased model complexity.

17
Figure 20: Pyramidal feature hierarchy

Figure 21: Feature pyramid network

2.5 Single-stage object detection


2.5.1 YOLO

Figure 22: YOLO

• Define a coarse grid(S × S).

• Associate B anchors to each cell.

• Each anchor is defined by:


- Localization(x, y, w, h).
- A confidence value(object/no object).
- And a class distribution over C classes.

Inference time:
It is more efficient than Faster R-CNN, but less accurate.

18
Figure 23: YOLO Inference

• Coarse grid resolution, few anchors per cell = issues with small objects

• Less robust to scale variation.

2.5.2 SSD

Pros:

• More accurate than YOLO.

• Works well even with lower resolution → improved inference speed.

Cons:

• Still lags behind two-stage detectors.


data augmentation is still crucial(esp. random sampling).

• A bit more complex(due to multi-scale features).

Figure 24: Single shot detection

2.5.3 RetinaNet

2.5.4 Problems with one-stage detectors

Two-stage detectors:

• Classification only works on ”interesting” foreground regions(proposals, 1-2k). Most


background examples are already filtered out.

19
• Class balance between foreground and background objects is manageable.

• Classifier can concentrate on analyzing proposals with rich information.

Problems with one-stage detectors:

• Many locations need to be analyzed(100k) densely covering the image - foreground-


background imbalance.
- Many negative examples(every cell on the feature grid).
- Few positive examples(actual number of objects).

• Hard negative mining: subsample the negatives with the largest error:
- works, but can be unstable.

Class imbalance:
- Idea: balance the positives/negatives in the loss function.
- Recall cross-entropy loss:

Figure 25: CE loss

2.5.5 Focal loss

Replace CE with focal loss(FL):

• When γ = 0, it is equivalent to the cross-entropy loss.

• As γ goes towards 1, the easy examples are down-weighted.

• Example: γ = 2, if pt = 0.9, FL is ×100 lower than CE.

2.5.6 RetinaNet

• One-stage(like YOLO and SSD) with Focal Loss.

• Feature extraction with ResNet.

• Multi-scale prediction - now with FPN.

20
Figure 26: Focal loss

Figure 27: Retina

2.6 Spatial Transformers


Features of Spatial pooling:

• Helps detect the presence of features.

• No information about precise location of features within a RoI.


- No effect on the RoI localization.

Solution: Spatial Transformers.


Equivariance is needed:
f (A(x)) = A(f (x)) (14)

• Learn to predict the parameters of the mapping θ:

Figure 28: Spatial Transformers

• Learn to predict a chain of transformations.

21
• Learn to localize RoI without dense supervision(only class label).

• Training multiple STs: Focus on different object parts.

• Learning to localize without any supervision.

• Makes the network equivariant to certain transformers(e.g., affine).

• Fully differentiable.

Cons:
- Difficulty of training for generalizing to more challenging scenes.

2.7 Detection evaluation


Precision: How many zebras you found are actually zebras?

TP
Precision = (15)
TP + FP
Recall: How many actual zebras in the image/dataset you could find?

TP
Recall = (16)
TP + FN
What is a true positive?
• Use the Intersection over Union(IoU).

• e.g., if IoU > 0.5 – positive match

• The criterion is defined by the benchmarks(MS-COCO, Pascal VOC)


Resolving conflicts

• Each prediction can match at most 1 ground truth box.

• Each ground truth box can match at most 1 prediction.

All-in-one metric: Average precision(AP)


There is often a trade-off between Precision and Recall.
AP = the area under the Precision-Recall curve(AUC)
Computing average precision
1. Sort the predicted boxes by confidence score

2. For each prediction find the associated ground truth


- The ground truth should be unassigned and pass IoU threshold

3. Compute cumulative TP and FP:


- TP: [1, 0, 1] FP: [0, 1, 0] → cTP = [1, 1, 2], cFP = [0, 1, 1]

4. Compute precision and recall(#GT = 3 → TP + FN + 3)


- Precision: [1/1, 1/2, 2/3]; Recall(#GT=3): [1/3, 1/3, 2/3]

5. Plot the precision-recall curve and the area beneath(num. integration).


mAP is the average over object categories

22
Figure 29: AP

3 Single object tracking


Problem statement: Given observations at t, and find their correspondence at t+1(n ̸= 0)

Figure 30: Problem state

Challenges:

• Fast motion.

• Changing appearance.

• Changing object pose.

• Dynamic background.

• Occlusions.

• Poor image quality.

23
• ...

A simple solution: Tracking by detection.


- Detect the object in every frame.
Problem: data association.

• Many objects may be present & the detector may misfire.

• We could use the motion prior.

3.1 Bayesian tracking

Figure 31: Bayesian example

Goal: Estimate car position at each time instant (say, the white car).
Observation: Image sequence and known background.

• Perform background subtraction.

• Obtain binary map of possible cars.

• But which one is the one we want to track?

Observations: Image
System state: Car position(x, y)
Notations:

• xk ∈ Rn : internal state at k-th frame


- Hidden random variable, e.g., position of the object in the image.
- Xk = [x1 , x2 , . . . , xk ]T history up to time step k.

• zk ∈ Rm : Measurements at k-th frame


- Observable random variable, e.g., the given image.
- Zk = [z1 , z2 , . . . , zk ]T history up tp time step k.

Goal:
Estimate posterior probability p(xk |Zk ).
How? Recursion:

p(xk−1 |Zk−1 ) → p(xk |Zk ) (17)

24
Figure 32: Bayesian probabilities

3.1.1 Hidden Markov model

Two assumptions of HMM:

1.
p(zk |xk , Zk−1 ) = p(zk |xk ) (18)

2.
p(xk |xk−1 , Zk−1 ) = p(xk |xk−1 ) (19)

Recursive Estimation

p(xk |Zk ) = p(xk |zk , Zk−1 ) (Applying Bayes’ theorem)


∝ p(zk |xk , Zk−1 ) · p(xk |Zk−1 )
∝ p(zk |xk ) · p(xk |Zk−1 ) (Assumption: p(zk |xk , Zk−1 ) = p(zk |xk ))
Z
∝ p(zk |xk ) · p(xk , xk−1 |Zk−1 ) dxk−1 (Marginalization)
Z
∝ p(zk |xk ) · p(xk |xk−1 , Zk−1 ) · p(xk−1 |Zk−1 ) dxk−1
Z
∝ p(zk |xk ) · p(xk |xk−1 ) · p(xk−1 |Zk−1 ) dxk−1 (Assumption: p(xk |xk−1 , Zk−1 ) = p(xk |xk−1 ))

Key Concepts:

25
p(b|a)p(a)
• Bayes’ Rule: p(a|b) = p(b)

• Assumption: p(zk |xk , Zk−1 ) = p(zk |xk )

• Marginalization: p(a) = p(a, b) db


R

• Factorization from graphical model: p(xk |xk−1 , Zk−1 ) = p(xk |xk−1 )

Bayesian formulation:
Z
p(xk |Zk ) = k · p(zk , xk ) · p(xk |xk−1 ) · p(xk−1 |Zk−1 )dxk−1 (20)

• p(xk |Zk ) → posterior probability at current time step.

• p(zk |xk ) → likelihood

• p(xk |xk−1 ) → temporal prior

• p(xk−1 |Zk−1 ) → posterior probability at previous time step

• k → normalizing term

Estimators:
Assume the posterior probability p(xk |Zk ) is known:

• posterior mean:

x̂k = E(xk |Zk ) (21)

• maximum a posterior(MAP):

x̂k = argmaxxk p(xk |Zk ) (22)

Deep networks:

• It is easy to see what the networks have to output given the input
- but it is harder(yet more useful) to understand what a network models in terms of
our Bayesian formulation.

• Typically, the networks are tasked to produce MAP directly, e.g.,

x̂k = argmaxxk p(xk |Zk ) ≈ fθ (Zk , xk−1 ) (23)

without modeling the actual state distribution.

26
3.2 Online vs. Offline tracking
Online tracking:
- ”Given observations so far, estimate the current state.”

• Process two frames at a time.

• For real-time applications.

• Prone to drifting → hard to recover from errors or occlusions.

Offline tracking:
- ”Given all observations, estimate any state.”

• Process a batch of frames.

• Good to recover from occlusions(short ones as we will see)

• Not suitable for real-time application.

• Suitable for video analysis.

An online tracking model can be used for offline tracking too. Our recursive Bayesian
model will still work.

3.3 GOTURN

Figure 33: GOTURN

• Input: Search region + template region(what to track).

• Output: Bounding box coordinates in the search region.

Temporal prior:

p(xk |xk−1 ) = δ(xk − xk−1 ) (24)

where δ is Dirac delta function.


Advantages:

27
Figure 34: Temporal prior

• Simple: close to template matching.

• efficient(real-time).

• end-to-end(we can make use of large annotated data).

Disadvantages:

• may be sensitive to the template choice.

• the temporal prior is too simple: fails if there is fast motion or occlusion.

• tracking one object only.

3.4 Online adaptation


Problem: The initial object appearance may change drastically,
- e.g., due to occlusion, pose change, etc.
Idea: adapt the appearance model on the (unlabeled) test sequence.

3.5 MDNet
Tracking with online adaptation

4 Multi-object tracking
Challenges:

• Multiple objects of the same type.

• Heavy occlusions.

• Appearances of individual people are often very similar.

DL’s role in MOT:

28
Figure 35: MDNet

1. Tracking initialization(e.g., using a detector.)


- Deep learning → more accurate detectors.

2. Prediction of the next position(motion model).


- We can learn a motion model from data.

3. Matching predictions with detections:


- Learning robust metrics(robust to pose changes, partial occlusions, etc.).
- Matching can be embedded into the model.

4.1 Approach 1
1. Track initialization(e.g., using a detector).

2. Prediction of the next position(motion model).


- Classic: Kalman filter(e.g., state transition model.)
- DL(data driven): LSTM/GRU.

3. Matching predictions with detections(appearance model).


- In general: reduction to the assignment problem.
- Bipartite matching.

4.2 Typical models of dynamics


• Constant position:
- i.e. no real dynamics, but if the velocity of the object is sufficiently small, this can
work.

• Constant velocity(possibly unknown):


- We assume that the velocity does not change over time.
= As long as the object does not quickly change velocity or direction, this is a quite
reasonable model.

29
• Constant acceleration(possibly unknown):
- Also captures the acceleration of the object.
- This may include both the velocity, but also the directional acceleration.

4.3 Bipartite matching

Figure 36: Bipartite matching

1. Define distances between boxes(e.g., 1 - IoU).


- We obtain N × N matrix.

2. Solve the assignment.


- Using Hungarian algorithm O(N 3 )

3. The bipartite matching solution corresponds to the minimum total cost.

What happens if one detection is missing?

1. Add a pseudo detection with a fixed cost(e.g., 0).

2. Run the Hungarian method as before.

3. Discard the assignment to the pseudo node.

What happens if no prediction is suitable?

• e.g., the object leaves the frame.

30
• Solution: Pseudo node.

• Its value will define a threshold.

• We may need to balance it out.

Figure 37: Pseudo Node

4.4 Approach 2: Tracktor


1. Detect objects in frame k − 1(e.g., using an object detector)

2. Initialize in the same location in frame j.

3. Refine predictions with regression.

4. Run object detection again to find new tracks.

Advantages:

1. We can reuse well-engineered object detectors:


- The same architecture of regression and classification heads.

2. Work well even if trained on still images:


- The regression head is agnostic to the object ID and category.

Disadvantages:

31
Figure 38: Tractor

1. No motion model:
- problems due to large motions(camera, objects) or low frame rate.

2. Confusion in crowded spaces:


- Since there is no notion of ”identity” in the model.

3. Temporary occlusions(the track is ”killed”):


- Generally applies to all online trackers.
- Partial remedy: Long-term memory of tracks(using object embeddings).

Problem of using IoU as metric:


The implicit assumption of small motion.
We need a more robust metric.

4.5 Metric learning: For re-identification(Re-ID)


4.5.1 Metric spaces

Definition:
A set X(e.g., containing images) is said to be a metric space if with any two points p
and q of X there is associated a real number d(p, q), called the distance from p to q, such
that


d(p, q) > 0 if p ̸= q; d(p, p) = 0;


d(p, q) = d(q, p)


d(p, q) ≤ d(p, r) + d(r, q) for any r ∈ X

Any function with these properties is called a distance function, or a metric.


Let’s reformulate:

32
d(p, q) = dω (fθ (p), fθ (q)) (25)

• We can decouple representation from the distance function:


- We can use simple metrics(e.g., L1, L2, etc) or parametric(Mahalanobis distance).

• The problem reduces to learning a feature representation fθ (·)

4.5.2 How do we train a network to learn a feature representation

Figure 39: Feature Representation

• Choose a distance function, e.g., L2:


- d(A, B; θ) := ||fθ (A) − fθ (B)||2

• Minimize the distance between image pairs of the same person:


- θ∗ := arg minθ EA,B [d(A, B; θ)]

Metric learning method: add negative pairs.

Figure 40: Metric learning

- Minimize the distance between positive pairs; maximize it otherwise.


Our goal: d(A, B; θ) < d(A, C; θ).
The loss:

θ∗ := arg min EA,B∈S + [dθ (A, B)] − EB,C∈S − [d| theta(B, C)] (26)
θ

33
S + and S − are sets of positive and negative image pairs.

1. Hinge loss:
L(A, B) = y ∗ ||f (A) − f (B)||2 + (1 − y ∗ ) max (0, m2 − ||f (A) − f (B)||2 )
- where y ∗ is 1 if (A, B) is a positive pair, and 0 otherwise.
- hinge loss for negative pairs with margin m

2. Triplet loss:
L(A, B, C) = max(0, ||f (A) − f (B)||2 − ||f (A) − f (C)||2 + m

Figure 41: Triplet loss

4.5.3 Metric learning for tracking

1. Train an embedding network on tripets of data:


- positive pairs: same person at different timesteps;
- negative pairs: different people.

2. Use the network to compute the similarity score for matching.

4.5.4 Summary of metric learning

• Many problems can be reduced to metric learning.


- Including MOT(both online and offline).

• Annotation needed:
- e.g., same identity in different images.

• In practice: careful tuning of the positive pair term vs. the negative term.
- hard-negative mining and a bounded function for the negative pairs help.

• Extension to unsupervised setting - contrastive learning.

4.6 Online vs. Offline tracking


Online tracking:
- ”Given observations so far, estimate the current state.”

• Process two frames at a time.

34
• For real-time applications.

• Prone to drifting → hard to recover from errors or occlusions.

Offline tracking:
- ”Given all observations, estimate any state.”

• Process a batch of frames.

• Good to recover from occlusions(short ones as we will see)

• Not suitable for real-time application.

• Suitable for video analysis.

4.7 Graph based MOT


4.7.1 Tracking with network flow

Minimum-cost maximum-flow problem:


Determine the maximum flow with a minimum cost.

Figure 42: MOT

• Node = object detection

• Edge = temporal ID correspondence

• Goal: disjoint set of trajectories

Minimizing the cost:


X
f ∗ = arg min C(I, j)f (I, j) (27)
f

where f is the disjoint set of trajectories, C(I, j) indicates the cost, and f (I, j) means the
indicator 0, 1.
To incorporate detection confidence, we split the node in two.
where Cde t indicates the detection confidence, and Ct indicates the transition cost.
Problem with occlusions, such as:

• occlusion in the last frame.

35
Figure 43: Network flow

Figure 44: Solution to occlusion

• the object appears only in the second frame.

Solution: Connect all nodes(detections) to entrance/exit nodes:


And the graph subjects to flow conservation:

Figure 45: Flow conservation

MAP formulation
- Our solution is a set of trajectories τ ∗

T ∗ = arg max P (T | X ) (28)


T

= arg max P (X | T )P (T ) (29)


T

Y
= arg max P (xi | T )P (T ) (30)
T
i

Y Y
= arg max P (xi | T ) P (Ti ) (31)
T
i Ti ∈T

Bayes rule

Bayes rule

36
Assumption 1:

Conditional independence of observations

Assumption 2:

Independence of trajectories

MAP fomulation:
Y Y
arg max P (xi |T ) P (Ti ) (32)
T
i Ti ∈T

X X
arg min − log P (xi |T ) − log P (Ti ) → log-space for optimization (33)
T
i TI ∈T

Prior
X
log P (Ti ) (34)
Ti ∈T

Trajectory model: count the entrance, exit and transition costs

Ti := (x0 , x1 , ..., xn ) (35)

Y
P (Ti ) = Pin (x0 ) Pt (xj | xj−1 )Pout (xn ) (36)
j=1,n

X
− log P (Ti ) = − log Pin (x0 ) − log Pt (xj | xj−1 ) − log Pout (xn ) (37)
j=1,n

Entrance Cost: − log Pin (x0 ) = fin (x0 )Cin (x0 )

Transition Cost: log Pt (xj | xj−1 ) = ft (xj , xj−1 )Ct (xj , xj−1 )

Exit Cost: − log Pout (xn ) = fout (xn )Cout (xn )

Likelihood:

X
− log P (xi | T ) (38)
i

We can use Bernoulli distribution:



γ ,
i if ∃Tj ∈ T , xi ∈ Tj
P (xi | T ) := (39)
1 − γi , otherwise

37
γi denotes prediction confidence (e.g., provided by the detector)

− log P (xi | T ) = −f (xi ) log γi − (1 − f (xi )) log(1 − γi ) (40)

1 − γi
= f (xi ) log − log(1 − γi ) (41)
γi

1−γi
Cdet (xi ) = log γi

log(1 − γi ) can be ignored in optimization

Optimization
(Cdet , Cin , Cout , Ct ) are estimated from data. Then:

• Construct the graph G(V, E, C, f ) from observation set X .

– V → Nodes (represent observations or potential assignments).


– E → Edges (represent possible transitions between observations).
– C → Costs assigned to edges.
– f → Flow through the graph, which represents assignments.

• Start with empty flow.

• WHILE (f (G) can be augmented):

– Augment f (G) by one.

Binary/Fibonacci search

O(log n)

– Find the min-cost flow by the algorithm of:

Min-cost flow algorithm (network simplex algorithm)

O(n2 m log n)

– IF (current min cost < global optimal cost):


∗ Store current min-cost assignment as global optimum.

• Return the global optimal flow as the best association hypothesis.

Summary

38
• Min-cost max-flow formulation:
the maximum number of trajectories with minimum costs.

• Optimization maximizes MAP:


global solution and efficient(polynomial time).

Open questions:

• How to handle occlusions

• Hoe to learn costs:


costs may be specific to the graph formulation and optimization.

4.7.2 Tracking with Message Passing Network

End-to-end learning?

• Can we learn features for multi-object tracking(encoding costs) to encode the solution
directly on the graph?

• Goal: Generalize the graph structure we have used and perform end-to-end learning.

Setup

• Input: task-encoding graph

– nodes: detections encoded as feature vectors


– edges: node interaction(e.g., inter-frame)

• Output: graph partitioning into(disjoint) trajectories


- e.g., encoded by edge label(0,1)

Deep learning on graphs


Key challenges:

• Graph can be of arbitrary size


number of nodes and edges

• Need invariance to node permutations.

Message passing

1. Initial graph

• Graph: G = (V, E)
• Initial embeddings:
(0)
- Node embedding: hi , I ∈ V
(0)
- Edge embedding: hi,j , (i, j) ∈ E
(l) (l)
• Embeddings after l steps: hi , i ∈ V h(i,j) , (i, j) ∈ E

2. ”Node-to-edge” update

39
Figure 46: Message Passing Network

Figure 47: Node-to-edge update

3. ”Edge-to-node” update

(a) Use the updated edge embeddings to update nodes:


(b) After a round of edge updates, each edge embedding contains information about
its pair of incident nodes.
(l) (l−1)
(c) By analogy: hi − Nv ([hi , hl(i,j) ])
(d) In general, we may have an arbitrary number neighbors(”degree”, or ”valency”)
(e) Define a permutation-invariant aggregation function:

Φ(l) (i) := Φ({h(l) (i, j)}j∈N e ) (42)

where the input is a set of embeddings from incident edges.


(f) Re-define the edge-to-node updates for a general graph:

(l) (l−1)
hi = N v(hi , Φ(l) (i)) (43)

40
Remarks:

• Main goal: gather content information into node and edge embeddings.

• Is one iteration of node-to-edge/edge-to-node updates enough?

• One iteration increases the receptive field of a node/edge by 1


- In practice, iterate message passing multiple times(hyperparameter).

• All operations used are differentiable.

• All vertices/edges are treated equally, i.i. the parameters are shared.

MOT with MPN

Figure 48: MOT with MPN

1. Input

2. Graph construction + feature encoding


- Encode appearance and scene geometry cues into node and edge embeddings.

3. Neural message passing


- Propagate cues across the entire graph with neural message passing.

4. Edge classification
- Learn to directly predict solutions of the Min-cost flow problem by classifying edge
embeddings.

5. Output

41
Figure 49: Geometry encodings

Feature encoding
Appearance and geometry encodings:

1. Relative Box Position  


2(yj − yi ) 2(xj − xi )
, (44)
hi + hj wi + wj
• This encodes the normalized relative position of two bounding boxes.
• The difference in the y-coordinate is divided by the sum of their heights.
• The difference in the x-coordinate is divided by the sum of their widths.
• The factor 2 ensures the values are in a normalized range.

2. Relative Box Size  


hi wi
log , log (45)
hj wj
• These terms encode the relative scale change between the two bounding boxes.
• Using a logarithm makes it more robust to size differences.

3. Time Difference
tj − ti (46)

• This represents the time gap between the two detections.


• It helps in determining whether two bounding boxes belong to the same object
across frames.

*Shared weights of CNN and MLP for all nodes and edges
Contrast:

• earlier: define pairwise and unary cost

• now:

– feature vectors associated to nodes and edges.


– use message passing to aggregate context information into the features.

Temporal causality
Flow conservation at a node

• At most 1 connection to the past.

42
Figure 50: Time-aware message passing

• At most 1 connection to the future.


Time-aware message passing:
Classifying edges

• After several iterations of message passing, each edge embedding contains content
information about detections.

• We feed the embeddings to an MLP that predicts whether an edge is active/inactive

Obtaining final solutions

• After classifying edges, we get a prediction between 0 and 1 for each edge in the graph.

• Directly thresholding solutions do not guarantee flow conservation constraints.

• In practice, around 98% of constraints are automatically satisfied.

• Lightweight post-processing(rounding or linear programming).

• The overall method is fast( 12fps) and achieves SOTA in the MOT challenge by a
significant margin.

Summary

• No strong assumptions on the graph structure


- handling occlusions.

• Costs can be learned from data.

• Accurate and fast(for an offline tracker).

• (Almost) End-to-end learning approach


- some post-processing is required.

43
4.7.3 Evaluation of MOT

Compute a set of measures per frame:

• Perform matching between predictions and ground truth.


- Hungarian algorithm

• FP = false positive

• FN = false negative

• IDSW = Identity switches


An identity switch (IDSW) happens in multi-object tracking (MOT) when the same
object is assigned different IDs across frames. This means that the tracking system
mistakenly reassigns a new ID to the same object, breaking continuity.

Figure 51: IDSW

1. An ID switch is counted because the ground truth track is assigned first to red,
then to blue.
2. Count both an ID switch(red and blue both assigned to the same ground truth),
but also fragmentation(Frag.) because the ground truth coverage was cut.
3. Identity is preserved. If two trajectories overlap with a ground truth trajec-
tory(within a threshold), the one that forces the least ID switches is chosen(the
red one).

Metrics:

• Multi-object tracking accuracy(MOTA):


P
t (F Nt + F Pt + IDSWt )
MOTA = 1 − P (47)
t GTt

• F1 score:
P
2 t T Pt
IDF1 = P (48)
t 2T Pt + F Pt + F Nt

• Multi-object tracking precision(MOTP):


P
t,i T Pt
MOTP = P (49)
t GTt

44
5 Segmentation

Figure 52: Segmentation

Flavors of image segmentation

• Semantic segmentation: Label every pixel with a semantic category.

• Instance segmentation: group object pixels as a separate category

– Disregard background(”stuff”) classes, e.g., road, sky, building etc.


– Can be class-agnostic or
– with object classification: ”semantic instance segmentation”

• Panoptic segmentation: semantic + instance segmentation

• Higher granularity, e.g., discriminating between object parts


- ”part segmentation”

5.1 K-means(clustering)
1. Initialize(randomly) K cluster centers

2. Assign points to clusters


- using a distance metric

3. Update centers by averaging the points in the cluster

4. Repeat 2 and 3 until convergence.

Problem: The Euclidean distance may be insufficient. Instead, we want to cluster based
on the notion of ”connectedness”.

5.2 Spectral clustering


Construct a Similarity Graph:

• Represent all data points as fully-connected(undirected) graph.

45
• Edge weights encode proximity between nodes(Similarity Matrix).

A := (ai,j ) ∈ Rn×n

• ai,j ≥ 0 measures the similarity of nodes i and j.

• We can use a distance metric: ai,j := exp (−γ · d(I, j))

Compute the Graph Laplacian:

• Define graph Laplacian:

L := D − A

where Dii is a diagonal matrix(degree matrix):

X
Dii = Aij
j

Optimization problem:
The Laplacian quadratic form measures how much variation exists between connected
nodes.

1X
xT Lx = Ai,j (xi − xj )2 (50)
2 i,j

where

• L is symmetric and positive semi-definite(L is positive semi-definite, meaning all its


eigenvalues are non-negative(λi ≥ 0).

• If two points are highly connected, (Aij is large), then their values xi and xj should
be similar, the term (xi − xj )2 is small.

• The goal is to minimize this sum, ensuring connected nodes have similar values.

f2 (second smallest eigenvector) = x∗ = arg min xT Lx (51)


x

• The smallest eigenvectors of L correspond to smooth functions on the graph(meaning


nodes in the same cluster should have similar values).

• They reflect connected components or clusters in the graph.

Intuitively, if

1X
xT Lx = Ai,j (xi − xj )2 ≈ 0 (52)
2 i,j

then,

• if xi ̸= xj (if i and j are in different clusters), then Aij ≈ 0

46
• if Aij ≥ 0(i and j are similar), then xi ≈ xj (same cluster)
The solutions x∗ = arg minx xT Lx are eigenvectors corresponding to the second smallest
eigenvalues of L.
Special case:

1. Zero eigenvalue

1X
xT Lx = Ai,j (xi − xj )2 = 0, Aij for all i, j (53)
2 i,j

What can we say about the corresponding eigenvector?

2. Connected components
- Proposition: the multiplicity k of eigenvalue 0 equals to the number of connected
components, spanned by indicator vectors.
- Example with two components(n × 2):

Figure 53: Example

Figure 54: Spectral clustering in practice

• Spectral clustering can handle complex distributions.

• The complexity is O(n3 ) due to eigendecomposition.

• There are efficient variants(e.g., using sparse affinity matrices).

47
5.3 Normalized cut(Ncut)
Spectral clustering as a min cut as Fig 55

Figure 55: Spectral clustering as a min cut

Problem with min cuts: Favour unbalanced isolated clusters:

Figure 56: Poor min cut

Balanced cut

N cut(A, B) = cut(A, B)(assoc−1 (A, V ) + assoc−1 (B, V )) (54)

where

• The cost of cutting sets A and B:


X
cut(A, B) = wi,j (55)
i∈A,j∈B

• Total edge cost in set A(equiv. B):


X
assoc(A, V ) = wi,j (56)
i∈A,j∈V

• Intuitively - minimize the similarity between groups A and B, while maximizing


the similarity within each group.

Ncut

1. Define a graph G := (V, E)

48
• V is the set of nodes representing pixels.
• E defines similarities of two nodes.

2. Solces a generalized eigenvalue problem: (D − W )x = λDx


where

• d(i) =
P
j wi,j
• (D − W ) is the Laplacian matrix
1 1
• equivalently D− 2 (D − W )D− 2 x = λx

3. Use the eigenvector with the 2nd smallest eigenvalue to cut the graph.

4. Recurse if needed(i.e. subdivide the two node groups).

5.4 Energy-based model: Conditional random fields(CRFs)


Energy function: X X
E(x, y) = ϕ(xi , yi ) + ψ(xi , xj ) (57)
i i,j

where

• ϕ(xi , yi )isunaryterm
Unary potential:

– encodes local information about a pixel.


– how likely is it to belong to a certain class(e.g. foreground/background)

• ψ(xi , xj ) is the pairwise term.


Pairwise potential:

– encode neighborhood information


– how different is this pixel/patch to its neighbors(e.g. based on color/texture/learned
feature)?

5.4.1 Conditional Random Fields

Boykov and Jolly (2001)


X X
E(x, y) = φ(xi , yi ) + ψ(xi , xj )
i ij

Variables

• xi : Binary variable

– foreground/background

• yi : Annotation

– foreground/background/empty

49
Unary term

• φ(xi , yi ) = K[xi ̸= yi ]

• Pay a penalty for disregarding the annotation

Pairwise term

• ψ(xi , xj ) = [xi ̸= xj ]wij

• Encourage smooth annotations

• wij affinity between pixels i and j

Optimization with graph cuts:

Figure 57: Max-flow min-cut theorem: The maximum value of an s-t flow is equal to the
minimum capacity over all s-t cuts.

Grid structured random fields

• Efficient solution using Maxflow/Mincut

• Optimal solution for binary labelling

Fully connected models:

• Efficient solution using mean-field:


Variational methods.

Using fully connected CRFs:

5.5 Fully convolutional neural networks


Deep networks for classification:
To extend this to segmentation:

1. Remove GAP:

• Increase number of parameters


• Fixed input size only

50
Figure 58: Fully connected CRFs

Figure 59: Classification networks

• No translation invariance

2. Replace fully connected layer with convolution(1 × 1 convolution)

• Few parameters
• Variable input size
• Translation invariance

3. Upsample the last layer output to the original resolution.

4. Can be trained with pixel-wise cross-entropy with SGD.

5.5.1 1 × 1 convolution

• Every (feature) pixel is a multi-dimentional feature.

• 1×1 convolution is equivalent to applying a shared fully connected layer to every pixel
feature.

• 1 × 1 convolution is a pixel-wise linear projection:

X ′ (D′ , pixels) := W (D′ , D)X(D, pixels) (58)

• each pixel is treated the same(shared parameters)


contrast this to 3 × 3 convolution

51
Figure 60: Extended networks for segmentation

Figure 61: 1 × 1 conv

• the output is treated as in fully connected layers:


followed by normalization, non-linearity(except for the output layer).

We may try to maintain the original image resolution in the encoder:


Output stride: input resolution/output resolution.
Decreasing the output stride improves segmentation accuracy.
The problems of keeping feature resolution high in all layers:

• Receptive field size(replace stride-2 ops with dilation-2 ops):

– Removing an operation with stride 2 reduces the area of the receptive field by a
factor of 4̃.
– Limited access to the context information.

• Computational footprint(both memory and FLOPs):


- e.g., feature tensor size increases 4 times for each remove stride-2 operation.

Dilated convolutions: To maintain the receptive field size.

• Consider convolution as a special case with ”dilation” = 1:

• For dilation N, the kernel ”skips” N − 1 pixels in-between:

• The number of parameters remains the same.

• The receptive field size of kernel size K and dilation ”D”:

D(K − 1) + 1 (59)

• Dilation improves scale invariance:


- we can use multiple dilations in the same layer.

52
Figure 62: Dilated convolution

Figure 63: ASPP

Problem 2: Computational footprint(both memory and FLOPs):


Solution: Upsampling
Recall convolution(as matrix multiplication):

Figure 64: Conv as matrix multiplication

We can obtain the opposite effect(increase the output size) by:

X ′ = W T X ′ → Broadcasting (60)

• X : [K × n]

• W T : [K × 1]

• X ′ : [1 × n]

• For CNNs, also apply the inverse of im2col to obtain the 2D grid representation - sums
up overlapping values.

Transposed convolution:

• Also called ”up-convolution” or ”deconvolution”(incorrect).

53
• Pad each pixel in the input(e.g., zeros).

• Convolve with a kernel(e.g., 3 × 3

• The amount of padding and stride determines the output resolution.

• Equivalent implementation without padding:

Figure 65: Equivalent impl.

• Issue: Checkerboard Artifacts:


- Reason: Uneven overlap when kernel size is not divisible by stride.

Upsampling: Interpolation

• Transposed convolution needs careful parameterization


- the boundaries can be still an issue.

• A better alternative:
- Interpolation(e.g., bilinear) followed by standard convolution.

Resize-convolution:

• Transposed convolution produces checkerboard artifacts


- can be resolved by a careful choice of the kernel/stride.

• ”Out-of-box” solution: interpolation, then convolve.

• Issue: We still lost some information due to downsampling.

• In general, there are multiple plausible results of upsampling.

5.6 U-Net
To mitigate information loss, we apply the first layers in the encoder via a skip connection.

U-Net’s key unique features:

• Encoder-Decoder Structure: U-Net follows a symmetric encoder-decoder archi-


tecture. The encoder captures context through a series of convolutional layers, while
the decoder uses this context to localize by progressively upsampling the feature maps.

54
Figure 66: U-Net

• Skip Connections: One of U-Net’s defining features is the use of skip connections.
These connections directly link corresponding layers in the encoder and decoder, al-
lowing the model to retain fine-grained spatial information during the upsampling
process.

• Asymmetric Depth: The depth of the network is asymmetric, with the encoder
portion consisting of downsampling layers and the decoder portion consisting of up-
sampling layers. This structure enables the model to effectively capture both global
context and precise local features.

• Heavy Use of Convolutions: U-Net heavily relies on convolutions for feature ex-
traction and localization. It often uses small convolutional kernels, typically of size
3x3, to capture intricate details and reduce computational overhead.

• Efficient and Data Augmentation-Friendly: U-Net was designed with limited


training data in mind. It can achieve high performance even with relatively small
datasets. Additionally, U-Net networks are often augmented with random transfor-
mations (e.g., rotations, scaling, etc.) to further improve robustness and performance.

• Pixel-wise Classification: The output of U-Net is a pixel-wise classification, where


each pixel in the input image is assigned a class label, making it ideal for segmentation
tasks.

• Loss Function: U-Net commonly uses a pixel-wise softmax loss function for multi-
class segmentation or binary cross-entropy for binary segmentation tasks. The archi-
tecture can be easily adapted for different types of loss functions, depending on the
problem.

• Symmetry and Output Size: Due to the symmetric structure of the architecture,
the size of the output is the same as the input, which is an important feature for

55
segmentation tasks where the output needs to align with the input image pixel-wise.

5.7 SegNet

Figure 67: SegNet

SegNet is a deep learning architecture primarily used for image segmentation tasks. It
is similar to U-Net in that it uses an encoder-decoder structure, but it has some unique
features that distinguish it.

• Encoder-Decoder Architecture: SegNet also follows an encoder-decoder design.


The encoder consists of a series of convolutional layers that extract features from the
input image, while the decoder upsamples the feature maps to create the segmented
output.

• Max Pooling Indices for Upsampling: A key difference between SegNet and
other architectures, like U-Net, is that SegNet uses max pooling indices from the
encoder during the upsampling in the decoder. This technique allows for more accurate
localization and better reconstruction of fine details during the decoding process.

• Efficient Memory Usage: By storing only the max pooling indices instead of the
pooled feature maps, SegNet reduces memory consumption during the upsampling
process, making it more efficient than many other models.

• Pixel-wise Classification: Like U-Net, SegNet outputs pixel-wise classification,


making it ideal for tasks such as semantic segmentation where each pixel needs to
be classified into a category.

• Loss Function: SegNet typically uses cross-entropy loss for training, which is com-
monly used for segmentation tasks.

• Applications: SegNet has been successfully applied to a variety of tasks, such as road
scene understanding, medical image analysis, and other segmentation applications
requiring precise delineation of object boundaries.

5.8 Best practices for segmentation


• Keep feature resolution high
- output stride 8 is typical.

56
• Use dilated convolution to keep the receptive field size high.
- context information is important.

• Use skip connections between the encode and decoder.


- improves upsampling quality.

• Use image-adaptive post-processing such as CRFs.


- improves segmentation boundaries

5.9 Instance segmentation


1. Label every pixel, including the background(sky, grass, road).

2. Does not differentiate between the pixels from objects(instances) of the same class.

3. Do not label pixels from uncountable objects(”stuffs”), e.g., ”sky”, ”grass”, ”road”.

4. Differentiates between the pixels coming from instances of the same class.

Figure 68: Instance segmentation methods

5.9.1 Proposal based method: Mask R-CNN

Start with Faster R-CNN

• add another head → ”mask-head”

• Mask R-CNN

• Faster R-CNN + mask head for segmentation.

• mask loss: cross-entropy per pixel.

• New: RoIAlign

RoIPool vs. RoIAlign

• We need accurate localization for mask prediction.

57
Figure 69: Mask R-CNN

Figure 70: Mask R-CNN

• RoIPool is inaccurate due to two quantization.

• Better alternative: RoIAlign.

RoIAlign:

• No quantization.

• Define 4 regularly placed sample points within each bin.

• Compute feature values with bilinear interpolation.

• Aggregate each bin as before.

Problem of Mast R-CNN: Low mask resolution.

5.9.2 Mask R-CNN with PontRend

• Mask head is an example of mapping a discrete signal representation(e.g., a pixel) to


desired value(e.g., binary mask).

• Instead, we parameterize the mask as continuous function that maps the signal do-
main(e.g., (x, y) coordinate):

– fθ (x, y) is an example of an coordinate-based net.

58
Figure 71: RoIAlign

– Why is it useful here?


– We can query fractional coordinates/

• Idea:

– Train coordinate-based mask representation by focusing on the boundaries.

Figure 72: PointRend

– Test time: Use the learned coordinate mapping to refine boundaries.

• Adaptive subdivision step(test time):

Remarks:

• In practice, compute a point-wise feature for each coordinate.

59
Figure 73: Qualitative result

• We compute a point-wise feature with bilinear interpolation.

• We can concatenate features sampled this way from multiple feature maps.

• The point head hθ is trained on these features, not the coordinates.

• Quiz: What’s the benefit?

5.9.3 Proposal-free method

We obtain a semantic map using Fully convolution networks for semantic segmentation.

5.9.4 SOLOv2

• Recall semantic segmentation.

• The last layer is a 1 × 1 convolution - a linear classifier:

Y = KX (61)

– Y : [C × HW ] → Pixel-wise class scores.


– K: [C × D] → Layer parameters(1 × 1 conv).
– D: [D × HW ] × Features.

• Why not apply the same strategy to instance segmentation?

• Problem: The number of kernels cannot be fixed.

Idea: Predict the kernels:

• Convolution → Mi,j = Gi,j ∗ F

• G : S × S × D → Dimensionality depends on the kernel size(1 × 1 works well, so


D = E).

5.9.5 Instance segmentation: Summary

• Proposal-free and proposal-based instance segmentation methods offer accuracy vs.


efficiency trade-off.

• Similar to our conclusions about object detectors;

– Proposal-based methods are more accurate(robust to scale variation), but less


efficient;
– Proposal-free methods are faster and have competitive accuracy.
Accurate segmentation of large-scale objects.

60
Figure 74: SOLOv2

5.10 Panoptic segmentation


• It gives labels to uncountable objects called ”stuff”(sky, road, etc), similar to FCN-like
networks.

• It differentiates between pixels coming from different instances of the same class(countable
objects) called ”things”(cars, pedestrians, etc.).

Figure 75: Typical architecture

Challenges:

• Can we harmonize architectures for predicting ”Stuff” and ”Things”?


- Semantic and instance segmentation pipelines are yet very different.

• Can we improve computational efficiency via parameter sharing?

Two broad categories:

• Top-down: typically two-stage proposal-based.

• Bottom-up: learn suitable feature representation for grouping pixels.

61
5.10.1 Panoptic FPN

Figure 76: Panoptic FPN

• Feature pyramid backbone.

• Mask R-CNN for instance segmentation.

• Semantic segmentation decoder.

• Replace things classes with 1 class ”other”.

• Merge things and stuff:

1. NMS on instances.
2. Resolve stuff-things conflicts in favour of things.(WHY)
3. Remove any stuff regions labelled ”other” or with a small area.

• Loss function:
L = Lc + Lb + Lm + λs Ls (62)

– Lc + Lb + Lm : Instance segmentation branch loss.


– λs Ls : Semantic segmentation branch loss.
– λ: Trade-off hyperparameter.

Remark:
Training with multiple loss terms(”multi-task learning”) can be challenging, as different
loss terms may ”compete” for desired feature representation.

5.10.2 Panoptic FPN: Summary

• Simple heuristic for merging things and stuff.

• The instance and semantic segmentation branches are treated independently:


i.e., semantic segmentation branch receives no gradient from instance supervision and
vice-versa.

5.10.3 Panoptic FCN

Even simpler?

62
Figure 77: Panoptic FCN

Figure 78: Panoptic FCN

5.11 Panoptic evaluation

6 Video object segmentation


Goal: Generate accurate and temporally consistent pixel masks for object in a video se-
quence.
Challenges:

• Strong viewpoint/appearance changes.

• Occlusions.

• Scale change.

• Illumination.

• Shape.

• ...

What we need:

1. Appearance model:

• Assumption: constant appearance.

63
• Input: 1 frame.
• Output: segmentation mask.

2. Motion model(maybe optional):

• Assumption: smooth displacement; bright constancy.


• Input: 2 frames.
• Output: motion(optical flow).

Semi-supervised(one-shot) VOS task formulation:

• Given: Segmentation mask of target object(s) in the first frame.

• Goal: Pixel-accurate segmentation of the entire video.

• Currently, the main testbed for dense(i.e. pixel-level) tracking.

• Choosing the objects to track can be subjective(esp. online).

• Offline tracking: considering the whole video - may provide a better clue(e.g., based
on object permanence).

6.1 Motion-based VOS


6.1.1 Optical flow:

• A pattern of apparent motion.

Figure 79: Optical flow

• Assumptions:

1. Brightness constancy:
Image measurements(brightness) in a small region remain the same.

I(x + u(x, y( horizontal motion))), y + v(x, y)(vertical motion), t + 1) = I(x, y, t)


(63)
2. Spatial coherence:
Neighbouring points in the scene typically belong to the same surface and hence
typically have similar 3D motions.
3. Temporal persistence:
The image motion of a surface patch changes gradually over time.

64
Figure 80: Brightness difference

Minize brightness difference:


X
ESSD (u, v) = (I(x + u, y + v, t + 1) − I(x, y, t))2 (64)
(x,y)∈R

• Goal: find (u, v) minimizing:


X
ESSD (u, v) = (I(x + u, y + v, t + 1) − I(x, y, t))2 (65)
(x,y)∈R

• First-order Taylor approximation of I(x + ∆x , y + ∆y , t + ∆t ):

∂ ∂ ∂
I(x, y, t) + ∆x I(x, y, t) + ∆y I(x, y, t) + ∆t I(x, y, t) + ϵ(∆2x + ∆2y + ∆2t ) (66)
∂x ∂x ∂t

• Differentiate w.r.t. u and v:


∂ESSD X
≈2 (u · Ix + v · Iy )Ix = 0 (67)
∂u
R

∂ESSD X
≈2 (u · Ix + v · Iy )Iy = 0 (68)
∂v
R

• By rearranging the terms, we get a system of 2 equations with 2 unknowns:

• By rearranging the terms, we get a system of 2 equations with 2 unknowns:


" P #" # " P #
I2
P
I x Iy u − I x It
P R x PR 2 = PR
R Ix Iy R Iy v − R Iy It

• Structure tensor is positive definite, hence invertible, so


" # " P #−1 " P #
I2
P
u R Ix Iy − R Ix It
= P R x P 2 P
v R Ix Iy R Iy − R Iy It

65
Lucas-Kanade: " # " P #−1 " P #
I2
P
u R Ix Iy − R Ix It
= P R x P 2 P
v R Ix Iy R Iy − R Iy It

This is a classical flow technique - Lucas-Kanade method.


Joint formulation of segmentation and optical flow estimation:

Figure 81: VOS with optical flow

• Joint formulation:
iteratively improving segmentation and motion estimation.

• Slow to optimize:
runtime: up to 20s(excluding OF).

• Initialization matters:
we need(somewhat) accurate initial optical flow.

• DL to the rescue?

6.1.2 FlowNet: Architecture 1

• Stack both(2) images → input is now 2 × RGB = 6 channels.

Figure 82: FlowNet

• Training with L2 loss from synthetic data.

6.1.3 FlowNet: Architecture 2 → Siamese architecture

Correlation layer:

• The dot product measures similarity of two features.

66
Figure 83: Siamese architecture

Figure 84: Correlation layer

• Correlation layer is useful for finding image correspondences.

• SegFlow: Joint estimation of optical flow and object segment.

– Joint feature representation at multiple scales.


– Supervision need not be synchronized.

6.1.4 Motion-based VOS: Summary

• We can obtain accurate estimates of optical flow with low latency.

• Naively applying optical flow to dense tracking has limited benefits:

– Due to severe (self-)occlusions, illumination changes, etc.


– Still an active area of research in semi-supervised VOS(dense tracking).

• Motion-based segmentation in a completely unsupervised setting:

6.2 Appearance-only VOS


Main idea:

• Train a segmentation model from available annotation(including the first frame).

• Apply the model to each frame independently.

One-shot VOS(OSVOS): Separate the training steps

• Pre-training for ”objections”.

• First-frame adaptation to specific object-of-interest using fine-tuning.

67
Figure 85: OSVOS

• One-shot: learning to segment sequence from one example(the first frame).

• This happens in the fine-tuning step:


the model learns the appearance of the foreground object.

• After fine-tuning, each frame is processed independently → no temporal information.

• The fine-tuned parameters are discarded before fine-tuning for the next video.

Drifting problem:

• The object appearance changes due to the changes in the object and camera pose

• One idea: adapt the model to the video using pseudo-labels.

OnAVOS: Online Adaptation

Figure 86: OnAVOS

• Adapt model to appearance changes in every frame, not just the first frame.

• Drawback: can be slow.

• OnAVOS is more accurate than OSVOS.

• Instead of fine-tuning on a single sample, we fine-tune on a dynamic set of pseudo-


labels.

68
Figure 87: Online adaptation

• The pseudo-labels may be inaccurate, so their benefit is diminished over time.

• Next: A reverse approach: fine-tune the model with a correct signal.

MaskTrack

Figure 88: MaskTrack

Traning inputs can be simulated

• Like displacements to train the regressor of Faster R-CNN.

• Very similar in spirit to Tracktor(Lecture 4)

6.2.1 Appearance-only VOS: Summary

Advantages of appearance-based models:

• Can be trained on static images.

• Can recover well from occlusions.

• Conceptually simple.

Disadvantages:

69
• No temporal consistency.

• Can be slow at test-time(need to adapt).

6.3 Metric-based approaches


Pixel-wise retrieval:

• Idea: Learn a pixel-level embedding space where proximity between two feature vectors
is semantically meaningful.

• The user input can be in any form, first-frame ground-truth mask, scribble....

Figure 89: Pixel-wise retrieval

• Training: Use the triplet loss to bring foreground pixels together and separate them
from background pixels.

• Test: embed pixels from both annotated and test frame, and perform a nearest neigh-
bor search for the test pixels.

• Advantages:

– We do not need to retrain the model for each sequence, nor fine-tune.
– We can use unsupervised training to learn a useful feature representation(e.g.,
contrastive learning).

7 Transformers
CNNs: A few relevant facts

• CNNs encode local context information.

• Receptive field increases monotonically with network depth

– This enables long-range context aggregation.


– Yet increased depth makes training more challenging.

• We may benefit from more explicit(u.e. in the same layer) non-local feature interac-
tions.

70
Universality: Developing components that work across all possible settings:

• Consider other modalities(language, speech, etc.)

• Modern deep learning is only partially universal, even in the contexts we have seen so
far(detection and segmentation).

• Transformers can be applied to language and images with excellent results in both
domains.

7.1 Self-attention: A hash table analogy


Consider a hash table:

key 1 val 1

key 2 val 2
f
key 3 val 3

key 4 val 4

• The query is n-dimensional.

• Each key is n-dimensional vector.

• Each value is m-dimensional vector.

• A hash table is not a differentiable function(of values w.r.t. keys); it’s not even
continuous.
- We can access any value if we know the corresponding key exactly.

• How can we make it differentiable


- Differentiable soft look-up.

• We can obtain an(approximate) value corresponding to the query:


X
v ′ (q) = w(q, ki )vi (69)
i

• w(q, ki ) encodes normalized similarity of the query with each key:


X
w(q, ki ) = 1, w(q, ki ) ≥ 0 (70)
i

• We can use standard operators in our DL arsenal: dot product and softmax

Putting it all together:

• A query vector Q ∈ RS×n : we have S queries.

• A key matrix K ∈ RT ×n : our hash table has T (key, value) pairs.

71
Figure 90: Self attention

• A value matrix V ∈ RT ×m

QK T
Attention(Q, K, V ) = sof tmax( √ )V (71)
n

Scaling factor 2 eases optimization especially for large n

• Large n increases the magnitude of the sum(we have n terms).

• Values with large magnitude inside softmax lead to vanishing gradient.

Linear projection

• Input: X ∈ RT × d T tokens of size d

• Define 4 linear mappings: W Q , W K ∈ Rd×n , W V ∈ Rd×m , W O ∈ Rm×d

• Compute Q, K, V :
K := XW K , Q := XW Q , V := XW V (72)

• Output: Y := Attention(Q, K, V )W O

• We started with [T × d] and produced [T × d] output

• Q, K and V all come form X. This is called self-attention.

Complexity

• Memory complexity(T >> n, m): O(T 2 )

QK T
sof tmax( √ )V (73)
n

• Runtime complexity: O(T 2 n)

Multi-head attention

• One softmax - one look-up operation.

• We can extend the same operation to multiple look-ups without increasing the com-
plexity.

72
• This is called multi-heda attention.

• Main idea: Chunk Q, K, V vectors

Figure 91: Multi-head

- For example, split it in two(e.g., split 128d vectors into two 64d vectors).

• Now we have two (Q, K, V ) triplets, where feature dimension is reduced by a factor
of 2.

• Repeat the same process:

Figure 92: Multi-head

• Concatenate V V to obtain the original resolution.

• Intuition: Given a single query, we can fetch multiple values corresponding to different
keys.

• We can design self-attention with arbitrary number of heads.


- condition: feature dimensionality of Q, R and V should be divisible by it.

• With Z heads, a single query can fetch up to Z different values.


- we can model more complex interactions between the tokens.

• No increase in runtime complexity - actually faster in wall time.

Normalization:
Further improvements:

• Residual connection:
add the original token feature before the normalization.

73
• Layer normalization:
Normalize each token feature w.r.t. its own mean and standard deviation.

Y := LayerN orm(X + M HA(X)) (74)

Recap:

• Self-attention is versatile:

– arbitrary number of tokens.


– design choices: the number of heads, feature dimensionality.

• Self-attention computation scales quadratically w.r.t. tokens


since we need to compute the pairwise similarity matrix.

• Self-attention is permutation-invariant(w.r.t. the tokens):

– the query index does not matter; we will always fetch the same value for it.
– Is it always useful?

7.2 Transformers
Transformer-based encoder:

Figure 93: Encoder

• Positional encoding breaks permutation invariance.

• Positional encoding is a matrix of size distinct values corresponding to each row.

• Positional encoding stores a unique value(vector) corresponding to a token index.

74
– It can be a learnable model parameter or fixed, for example:
pos
P E(pos,2i) = sin( 2i ) (75)
10000 dm odel
pos
P E(pos,2i+1) = cos( 2i ) (76)
10000 dm odel
– It introduces the notion of spatial affinity(e.g., distance between two words in
sentence; two patches in an image, etc.)
– It breaks permutation invariance.

7.3 Vision Transformers


An all-round transformer architecture:

• Competitive with CNNs(classification accuracy)*

• Main idea: Train on sequences of image patches; only minor changes to the origi-
nal(NLP) Transformer model.

• Can be complemented with a CNN model

Steps:

1. Split an image into fixed-sized patches.

2. Linearly embed each patch(a fully connected layer).

3. Attach an extra ”[class]” embedding(learnable).

4. Add positional embeddings.

5. Feed the sequence to the standard Transformer encoder.

6. Predict the class with an MLP using the output [class] token.

Figure 94: Vision Transformer

Experiment with ViT

75
• ViT performs well only when pre-trained on large JFT dataset(300M images).(QUIZ:
Why?)

• ViT does not have the inductive biases of CNNs:

– Locality(Self-attention is global.)
– 2D neighborhood structure(positional embedding is inferred from data.)
– Translation invariance.

• Nevertheless, now we use the same computational framework for two distinct modali-
ties: Language and vision!

7.4 Swin Transformers


Recall ViT
• We process patch embeddings with a series of self-attention layers:
- Quadratic computational complexity.

• The number of tokens remains the same in all layers(QUIZ: How many/ what does it
depend on?)

• In CNNs, we gradually decrease the resolution, which has computation benefits(and


increases the receptive field size).

• Do ViT benefits from the same strategy?


Swin Transformer:
• Construct a hierarchy of image patches.

• Attention windows have a fixed size.

• Linear computational complexity.

• Output representation is compatible with standard backbones(e.g., ResNet).


- We can test many downstream tasks(e.g., object detection).
A naive implementation will hurt context reasoning:
At any given level(except the last one), our context is not global anymore:

Figure 95: Naive Swin

Solution: Alternate the layout of local windows:


Successively decrease the number of tokens using patch merging:

76
Figure 96: Mature Swin

• Concaternate 2 × 2 C-dim patches into a feature vector (4 × C-dim);

• Linearly transform to a 2 × C dimensional vector.

Figure 97: Swin Transformer

Results:
• More efficient and accurate than ViT and some of CNNs(despite using lower resolution
input).

• Note: No pre-training on large datasets.

• Improved scalability on large datasets(compare ImageNet-1K and ImageNet-22K)


Summary:
• Reconciles the inductive biases from CNNs with Transformers;

• State-of-the-art on image classification, object detection and semantic segmentation.

• Linear computational complexity.

• Demonstrates that Transformers are strong vision models across a range of classic
downstream applications.
Transformer for detection?
• Would it make sense to adapt the Transformer to object detection?

• Recall that object detection is a set prediction problem:


- i.e., we do not care about the ordering of the bounding boxes.

• Transformers are well-suited for processing sets.

• Directly formulate the task as a set prediction!

77
7.5 DETR

Figure 98: DETR

• The CNN predicts local feature embeddings.

• The transformer predicts the bounding boxes in parallel.

• During training, we uniquely assign predictions to ground truth boxes with Hungarian
matching.

• No need for non-maximum suppression.

A closer look:

• DETR uses a conventional CNN backbone to learn a 2D representation of an input


image.

• The model flattens it and supplements it with a positional encoding before passing it
to the Transformer.

• The Transformer encodes the input sequence.

– TF encoder: ”self-attention”
– TF decoder: ”cross-attention”

• We supply object queries(laernable positional encoding) to the decoder, which jointly


attends to the encoder output.

• Each output embedding of the decoder is passed to a shared feed-forward network(FFN)


that predicts either a detection(class and bounding box) or a ”no object” class.

Figure 99: DETR

78
7.5.1 The losss function
N
X
LHungarian (y, ŷ) = [− log p̂σ̂(i) (ci ) + 1{ci ̸=∅} Lbox (bi , b̂σ̂ (i))] (77)
i=1

where

• LHungarian (y, ŷ): The total Hungarian loss used in DETR, which consists of classifica-
tion and localization terms.

• N : The number of ground truth objects in the image.

• ŷ: The predicted set of objects from the model.

• y: The ground truth set of objects.

• σ̂(i): The optimal assignment of ground truth objects to predicted objects obtained
using the Hungarian algorithm.

• p̂σ̂(i) (ci ): The predicted probability for the class ci of the ground truth object assigned
to prediction σ̂(i).

• − log p̂σ̂(i) (ci ): The classification loss term, which penalizes incorrect class predictions.

• 1{ci ̸=∅} : An indicator function that is 1 if the object ci is not the ”no object” (∅) class.

• Lbox (bi , b̂σ̂(i) ): The bounding box regression loss between the ground truth bounding
box bi and the predicted bounding box b̂σ̂(i) .

The bounding box loss:

λiou Liou (bi , b̂σ(i) ) + λL1 ||bi − b̂σ(i) ||1 (78)

• λiou : A weight coefficient for the IoU (Intersection over Union) loss.(→hyperparameter)

• Liou (bi , b̂σ(i) ): The generalized IoU loss function, which penalizes misalignment be-
tween the predicted and ground truth bounding boxes.

• bi : The ground truth bounding box for object i.

• b̂σ(i) : The predicted bounding box assigned to the ground truth object i using the
Hungarian algorithm.

• λL1 : A weight coefficient for the L1 loss term.(→hyperparameter)

• ||bi − b̂σ(i) ||1 : The L1 loss, which measures the absolute difference between the ground
truth and predicted bounding box coordinates.

DETR: Summary

• Accurate detection with a (relatively) simple architecture.

• No need for non-maximum suppression.


- We can simply disregard bounding boxes with ”empty” class, or low confidence.

79
• Accurate panoptic segmentation with a minor extension.

• Issues:

– High computational and memory requirements(especially the memory).


– Slow convergence / long training.

7.6 MaskFormer

Figure 100: MaskFormer

A unified model for semantic and panoptic segmentation.


Recall: Panoptic FCN

Figure 101: Panoptic FCN

MaskFormer’s idea: Compute the kernels using learnable queries with a Transformer.
Works well, but has troubles with small objects/segments.

7.7 Mask2Former
Idea: Attend with self-attention at multiple scales of the feature hierarchy:
Masked attention:

• Idea: constrain attention only to the foreground area corresponding to the query.

80
Figure 102: Mask2Former

• Standard self-attention:

Xl = sof tmax(Ql KlT )Vl + Xl−1 (79)

• Masked attention:

Xl = sof tmax(Ml−1 + Ql KlT )Vl + Xl−1 (80)



0 if Ml−1 (x, y) = 1
Ml−1 (x, y) = (81)
−∞ otherwise

Mask2Former: Summary

• Note that we do not talk about ”stuff” and ”things” anymore.

• Queries abstract those notions away.

• Achieves state-of-the-art accuracy across segmentation tasks.

7.8 Conclusions
• Transformers have revolutionized the field of NLP, achieving incredible results.

• We observe massive impact on computer vision(DETR, ViT).

• Complementing CNNs, Transformers have reached state-of-the-art in object classifi-


cation, detection, tracking, and image generation.

• A grain of salt: This often comes at an increased computational budget(larger GPUs,


longer training).

8 Unsupervised
Learning without labels

81
Compact, yet descriptive representation?
Consider the loss:
L = I(X; Z) − βI(Z; Y ) (82)

Figure 103: Unsupervised loss

where

• I(X; Z): Minimize MI between X and Z(compression).

• I(Z; Y ): Preserve relevant information in Z about Y .

Recap: Mutual Information(MI)

XX P(X,Y ) (x, y)
I(X; Y ) = P(X,Y ) (x, y) log (83)
PX (x)PY (y)
x∈X y∈Y

After obtaining compact and descriptive representation, we can apply it for different tasks,
such as detection, classification, segmentation, etc. by adding task-specific shallow mod-
els(e.g., linear projection).

8.1 Evaluating Self-supervised learning(SSL) models


• Fine-tuning on the downstream tasks:

– Either all or only few last layers.


– Pros: Typically leads to best task performance(e.g., accuracy, IoU).
– Cons: Our model becomes task-specific; cannot be re-used for other tasks.

• Linear probing:

– Only learn the new linear projection. The model parameters remain fixed.
– Pros: We can re-us the model for other tasks by training multiple linear projec-
tions.
– Cons: Typically worse task accuracy than fine-tuning.

• k-NN classification:

82
– Project labeled data in the embedding space.
– Classify datapoints according to the class of its k nearest neighbors.
– Pros: Same as linear probing(versatility), but no learning is necessary.
– Cons: Prediction can be a bit costly.
- Linear search complexity due to high feature dimensionality.

• How do we train deep models without labels?

– Goal: Define training objectives with some relation to our target objective.
– By training the model on these objectives, we hope to learn something for the
target objective.

• Categories of self-supervision

– Pretext tasks
– Contrastive learning
– Non-contrastive learning

8.2 Pretext tasks


Idea: Solve a different task with available(generated) supervision.
Caveats:

• The task should have some relation to our goal task.

• Finding an effective pretext task is challenging.

• The deep net will always try to cheat, i.e. find ”shortcut” solutions.

8.2.1 Pretext task: Rotation

Task: predicting image rotation.


Training process outline

1. Quantize rotation angles(e.g., 0, 90, 180, 270 - 4 classes).

2. Sample an image from the training dataset.

3. Rotate the image by one of the pre-defined angles.


- This defines our ”ground-truth” label.

4. Train the network to predict the correct rotation class.

• This leverages the photographic bias in typical image datasets


- i.e. photographed objects have prevalent orientation.

• Otherwise, there’s no canonical pose, hence the rotation angles are meaningless.
- A thought experiment: add all rotated images to the original dataset.

83
Figure 104: Pretext task: Rotation

Figure 105: Jigsaw puzzle

8.2.2 Pretext task: Jigsaw puzzle

• Solving this task requires the model to learn spatial relation of image patches.

• Cast this task as a classification problem:


- Every permutation defines a class.

8.2.3 Pretext task: Colorization

Predicting the original color of the images(in CIELAB color space):


Intuition: Proper colorization requires semantic understanding of the image.

Figure 106: Colorization

Nuances:

84
• This is not the same as predicting RGB values from a greyscale image(due to multi-
modality of colorization).

• Instead, we operate in Lab color space. Recall:

– L stands for perceptual lightness;


– a and b express four unique colors(red, green, blue, yellow).

• Euclidean distances in Lab are more perceptually meaningful.

• We cast colorization as a multinomial classification problem.

• Example: Colorization of black-and-white photos.

8.3 Contrastive learning


Recall metric learning

L(A, B, C) = max (0, ||f (A) − f (B)||2 − ||f (A) − f (C)||2 + 2 (84)

Intuitive idea:

Figure 107: Metric learning

We need labels to define positive and negative pairs.


Contrastive learning is an extension of metric learning to unsupervised scenarios.
Idea:

• Use data augmentation(e.g., cropping) to create a positive pair of the same image.

• Use other images to create(many) negative pairs.

Example:

• Represent each image as a single feature vector(e.g., last layer in a CNN).

• Consider cosine similarity of two such vectors:


x·y
d(x, y) = → Note: d(x, y) ∈ [−1, 1] (85)
||x||||y||

• For a given set {x, , y + , {y − }i=1,...,n } compute contrastive score.


+
ed(x,y )/τ
s(x) = Pn − (86)
ed(x,y+ )/τ + i=1 ed(x,yi )/τ

85
– Temperature τ : a hyperparameter(usually between 0.01 and 1.0).
– What is the range?
– What does it mean when it reaches maximum/minimum?
– We clearly want to maximize this value!(many implementations)
– Example loss: − log s(x).

8.3.1 Intuition

• Note that we normalize the feature embeddings:


x·y
d(x, y) = (87)
||x||||y||

• Every unit vector corresponds to a point on a unit sphere.

• The goal of contrastive learning is to cluster the representation on the sphere.

• The points on the sphere will be linearly separable.

Figure 108: Intuition

8.3.2 Deep frameworks for SSL

• SimCLR
- A simple framework for contrastive learning.

Figure 109: SimCLR

– Conceptually very simple.


– Best results require substantial computational resources
- batch size 8192(16382 negative pairs)
– Why do we need a large batch size?
- A subject of ongoing research. - Intuition: More negative samples reduce the
gradient variance.

86
• BYOL

• MoCo
- Reduced the GPU memory footprint by applying momentum encoder:

Figure 110: Momentum encoder

Figure 111: MoCo

• DINO

• Masked Autoencoders

• and many more. . .

8.4 Non-contrastive learning


8.4.1 DINO

• Simple to implement and to train.

• Broad application spectrum.

• Self-distillation with no labels.

• Self-attention maps in the last layer of a ViT with respect to [CLS] token.

• Different heads also focus on semantically distinguishable features.

• The attention maps are temporally stable.

87
Figure 112: DINO

8.4.2 DINOv2

Figure 113: DINOv2

• X2 faster and X3 less memory than DINO.

• A combination of existing techniques(e.g. noisy student, adaptive resolution).

• Training one big model, distilling to the smaller networks.

• Careful data curation.

8.4.3 Masked Autoencoders

Unsupervised learning with Transformers and a reconstruction loss:


Reconstructing HoG features, instead of pixel values:
Remarks:

• High masking ratio is necessary(75% or more).

• Compute the loss only on the patches masked out in the input:
- This is different from denoising autoencoders.

• Reconstruction w.r.t. normalized values:


- compute the mean and deviation of pixels within each patch.

• Why does this work:


- This behavior occurs by way of a rich hidden representation inside the MAE.
- Intuition: The goal is not much different from contrastive learning!

88
Figure 114: Masked autoencoders

Figure 115: Reconstructing HoG features

8.5 Downstream applications


What do DINO features encode?
DINO(and other SSL methods) provides semantic correspondence(almost) out-of-the-box.
We can also cluster ”background” (stuff) areas, hence obtain semantic segmentation.
High level idea: Learn a lower dimensional embedding, S(f (i), f (j)), such that clustering
in this space yields semantic masks.
Self-supervision in videos
Two groups of problems:

• Given a video, segment objects in the video(using motion cues).

Figure 116: Multi-view assumption

89
Figure 117: Extracted features

Figure 118: Extracted features

• Given a video dataset, learn to track objects.

Segmentation from motion cues Idea: If the object mask is correct, we cannot reconstruct
object-related optical flow. We train two networks

• Network G: Given an image and optical flow, predict object mask (foreground/background).

• Network I: Given a masked optical ow and the image (not masked), reconstruct the
original optical flow.

Contrastive random walk


Forward-backward cycle consistency:

• Given a video, construct a palindrome(i.e. t1 , . . . , tN −1 , tN , tN −1 , . . . , t1 )

• Label each patch in the image and propagate them through a video.
- We compute affinity(using cosine similarity) between patches of subsequent frames.

• Cycle consistency loss: Each label should arrive at its original location.

8.6 Conclusion
• Research on unsupervised learning is very active.

• We can train more accurate models with less supervision.

• Requires large computational resources(dozens of high-end GPUs).

90
Figure 119: Extracted features

• Yet do not scale(too) well with the amount of data(saturation).

• Many open questions:

– What is a good proxy task?


– How to make computational requirements manageable?
– How (and/or why) does it work?

9 Semi-supervised learning
Training on labeled and unlabeled data.

Figure 120: Semi-supervised

General remarks:
• Using both labeled and unlabeled data is a very practical scenario.

• If the goal is to get the best accuracy, semi-supervised learning is the way to go.
- Current state-of-the-art frameworks take this approach(rather than full supervision).
Small print:
1. Improvement is not always guaranteed.
- It depends on the model, the technique used and the unlabeled data.

2. Semi-supervised techniques are often complementary.


- A combination of multiple techniques yields the best results(though make the frame-
work more complex).
A practical perspective:

91
Figure 121: Semi-loss

9.1 Assumptions
Assumptions about semi-supervised learning:

9.1.1 Smoothness assumption

If two input points are close by, their labels should be the same.
Transitivity

• We have a labeled x1 ∈ XL and two unlabeled x2 , x3 ∈ XU inputs.

• Suppose x1 is close to x2 , and x2 is close to x3 , but x1 is not close to x3 .

• Then we can still expect x3 to have the same label as x1 .

9.1.2 Low-density assumption

The decision boundary should pass through a region with low density p(x).

9.1.3 Manifold assumption

• Data comes from multiple low-dimensional manifolds.

• Data points sharing the same manifold, share the same label.

Remark:
Which assumptions to make depends on what we know about how our data distribution
p(x) interacts with the class posterior p(y|x).

9.2 Two taxonomies


How unlabeled data is used:

9.2.1 Unsupervised pre-processing

- pre-training, clustering, etc.;


- Two stages:

• Unsupervised: Feature extraction/learning(e.g. DINO, MAE).

• Supervised: Fine-tuning, linear probing, or k-NN classification.

92
9.2.2 Wrapper methods

Self-training
A single classifier trained jointly on labeled and self-labeled data from the unlabeled dataset.
OnAVOS:

Figure 122: OnAVOS

Online adaptation: Adapt model to appearance changes in every frame, not just the first
frame.
Drawback: Can be slow.

Figure 123: Online Adaptation

Segment Anything (2)


Self training with pseudo labels

• How to select pseudo-labels(the confidence threshold)?

– High vs. low threshold trade-off(QUIZ)


– High: No learning signal(the gradient will be close to zero);
– Low: Noisy labels → low accuracy

• Tedious to train(Multiple training rounds).

93
Figure 124: OnAVOS with unlabeled dataset

• Sensitive to the initial model:


- Fails if the initial predictions are largely inaccurate.

A general outline:

1. Train a strong baseline on the labeled set;


- e.g. with heavy data augmentation(crops, photometric noise).

2. Predict pseudo-labels for the unlabeled set.

3. Continue training the network on both labeled and pseudo-labeled samples.

4. Repeat steps 2-3.

9.2.3 Intrinsically semi-supervised

Entropy minimization

• Example:
Entropy minimization for semantic segmentation(”self-training”):
X X
L({(xi , yi )}i , {x̂i }i ) = Lsupervised (xi , yi ) + λ Lunsupervised (x̂i ) (88)
i i

• Objective:
Minimize the entropy of class distribution for each pixel:
X
Lunsupervised (x̂) = − p(f (x̂)j |x̂) log p(f (x̂)j |x̂) (89)
j

Virtual adversarial network


Idea: Perturbations of the input should not change the label.

• supervised case. We want to minimize:

D[q(y|x∗ ), p(y|x∗ + radv , θ)] (90)

where
radv := arg max D[q(y|x∗ ), p(y|x∗ + r, θ)] (91)
r:||r||≤ϵ

94
• Semi-supervised case:
Replace q(y|x∗ ) above with our current estimate, p(y|x∗ , θ̂)

9.2.4 Learning from synthetic data

Labeled and unlabeled data may come from different distributions.


e.g. due to differences in the synthetic and real appearances.

• Labels are easier(hence cheaper) to obtain on a large scale.

• Consider it a special case of a semi-supervised learning problem.

Domain alignment
This translates into disjoint feature distribution of a model trained only on the labeled data:

Figure 125: Domain alignment

Solution: ”Align” the two distributions.


Domain alignment means making two feature distributions indistinguishable.
We can use a GAN:

Figure 126: GAN

• The discriminator learns to classify the origin of the provided feature.

• The model learns:

– to classify the labeled images.


– a feature representation that reduced discriminator accuracy.

Consistency regularization
Consistent prediction across image transformations.

95
Semantic meaning does not change, though not guaranteed in a deep net → Use it as a
consistency loss.
Framework:

Figure 127: Araslanov and Roth

Momentum net
Test-time augmentation is applied online at training time:

Figure 128: Momentum net

Limited supervision

Figure 129: Limited supervision

Weakly supervised learning: training coarse labels:


Segmentation with image-level labels
Idea: we can reuse the classifier weights to classify every pixel.

• Consider a classification network:

96
Figure 130: Classification network

• We can replace GAP and instead use 1 × 1 convolution.

• Insight: such classification turns out to be meaningful.

9.3 Summary
• Entropy minimization
- Improves accuracy, but leads to miscalibration.

• Virtual adversarial network


- generic treatment of supervised and unsupervised data.

• Consistency regularization
- can be effective, but sensitive to initial pseudo-label quality.

• Self-training
- simple and effective; but is limited by available augmentation techniques.

• Unsupervised pre-training
- simple and effective; this should be your first baseline.

• Domain alignment
- typically less fine-tuning is required, but can be still challenging to train(GAN).

• Coarse labels(weak supervision)


- a cost-effective compromise between fully labeled and unlabeled data.

References
[1] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat:
Integrated recognition, localization and detection using convolutional networks,” 2014.
[Online]. Available: https://arxiv.org/abs/1312.6229

[2] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple fea-
tures,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. CVPR 2001, vol. 1, 2001, pp. I–I.

[3] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1.

97
[4] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, mul-
tiscale, deformable part model,” in 2008 IEEE Conference on Computer Vision and
Pattern Recognition, 2008, pp. 1–8.

[5] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M. Smeulders, “Seg-


mentation as selective search for object recognition,” in 2011 International Conference
on Computer Vision, 2011, pp. 1879–1886.

[6] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial In-
telligence and Lecture Notes in Bioinformatics), vol. 8693. Springer, 2014, pp. 391–405.

98

You might also like