0% found this document useful (0 votes)
21 views45 pages

Cviii 2024 Ws

Uploaded by

huson7328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views45 pages

Cviii 2024 Ws

Uploaded by

huson7328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

COMPUTER VISION III LECTURE NOTES

Lecturer: Nikita Araslanov

Hurile Borjigin
Technical University of Munich
2024 WS

1
Contents
1 Introduction and Object Detection 4
1.1 What this course is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Understanding an image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Understanding a video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Some architectures and concepts . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Object detection 6
2.1 One-stage detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Object detection with sliding window . . . . . . . . . . . . . . . . . . 6
2.1.2 Feature-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Two-stage detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Non-Maximum Suppression(NSM) . . . . . . . . . . . . . . . . . . . . 9
2.3 Object detection with deep networks . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Overfeat[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Object proposals(Pre-filtering) and pooling for two-stage detection . . 10
2.3.3 R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.5 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Feature Pyramid Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Single-stage object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.3 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.4 Problems with one-stage detectors . . . . . . . . . . . . . . . . . . . . 19
2.5.5 Focal loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.6 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Spatial Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Detection evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Single object tracking 22


3.1 Bayesian tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Hidden Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Online vs. Offline tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 GOTURN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Online adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 MDNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Multi-object tracking 28
4.1 Approach 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Typical models of dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Approach 2: Tracktor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Metric learning: For re-identification(Re-ID) . . . . . . . . . . . . . . . . . . . 31
4.5.1 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2
4.5.2 How do we train a network to learn a feature representation . . . . . . 32
4.5.3 Metric learning for tracking . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.4 Summary of metric learning . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Online vs. Offline tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Graph based MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7.1 Tracking with network flow . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7.2 Tracking with Message Passing Network . . . . . . . . . . . . . . . . . 38
4.7.3 Evaluation of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Segmentation 44
5.1 K-means(clustering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Normalized cut(Ncut) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Energy-based model: Conditional random fields(CRFs) . . . . . . . . . . . . 44
5.5 Fully convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 SegNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8.1 PointRend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.9 Panoptic FPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.10 Panoptic FCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3
1 Introduction and Object Detection
1.1 What this course is
Hight-level(semantic) computer vision:

• Object detection

• image segmentation

• object tracking

• etc...

From another perspective this is the intersection of Computer vision, deep learning and
real-world applications, as shown in Fig 1.

Figure 1: Another perspective

What is computer vision:

• First defined in the 60s(The summer vision project 1966).

• ”Mimic the human visual system.

• Center block of artificial intelligence.

Semantic scene understanding:

• Classification

• Object detection

• Semantic segmentation

• Instance segmentation

• Object tracking

This course in two dimension is as shown in Fig 2.

4
Figure 2: CVIII in two Dimensions

1.2 Understanding an image


Different representations depending on the granularity

• Detection (bounding box-coarse description)

• Semantic segmentation(pixel-level)

• Instance segmentation(e.g. ”person 1”, ”person 2”

1.3 Understanding a video


Why use the temporal domain

• Motion analysis, multi-view reasoning.

• A smoothness assumption: no abrupt changes between frames.

But challenges:

• High computational demand.

• A lot of redundancy.

• Occlusions, multiple objects moving and interacting.

1.4 Some architectures and concepts


• R-CNN, Fast R-CNN and Faster R-CNN (2-stage object detection)

• YOLO, SSD, RetinaNet (1-stage object detection)

• Siamese networks (online tracking)

• Message Passing Networks (offline tracking)

• Mask R-CNN, UPSNet (panoptic segmentation)

• Deformable/atrous convolutions

• Graph neural networks(GNNs)

• Vision Transformers(ViT), DETR(object detection), SAM

• Contrastive learning

5
2 Object detection
2.1 One-stage detectors
One-stage detectors treat object detection as a single task, directly predicting the object
class and bounding box coordinates from the input images.

2.1.1 Object detection with sliding window

For every position, measure the distance(or correlation) between the template and the image
region(See Fig 3):

Figure 3: One-stage detector(old)

L(x0 , y0 ) = d(I(x0 ,y0 ) , T ) (1)

where d is the distance metric, I(x0 ,y0 ) is the image region, and T is the template.

Template matching distances

• Sum of squared distances(SSD), or mean squared error(MSE):

1X
d(I(x0 ,y0 ) , T ) = (I(x0 ,y0 ) (x, y) − T (x, y))2 (2)
n x,y

• Normalized cross-correlation(NCC):

1X 1
d(I(x0 ,y0 ) , T ) = I(x ,y ) (x, y)T (x, y) (3)
n x,y σI σT 0 0

• Zero-normalized cross-correlation(ZNCC):

1X 1
d(I(x0 ,y0 ) , T ) = (I(x0 ,y0 ) (x, y) − µI )(T (x, y) − µT ) (4)
n x,y σI σT

Disadvantages

• (Self-) Occlusions e.g. due to pose changes

• Changes in appearance

• Unknown position, scale, and aspect ratio brute-force search(inefficient)

6
Figure 4: One-stage detector(new)

2.1.2 Feature-based detection

Idea: Learn feature-based classifiers invariant to natural object changes(See Fig 4)

• Features are not always linearly separable

• Learning multiple weak learners to build a strong classifier(Fig 5)

Figure 5: Multiple weak learners

Viola-Jones detector [2]


Given data (xi , yi )

1. Define a set of Haar-like features(Fig 6)


Haar-features are sensitive to directionality of patterns

Figure 6: Haar features

2. Find a weak classifier with the lowest error across the dataset

7
3. Save the weak classifier and update the priority of the data samples

4. Repeat Steps 2-3 N times

Final classifier is the linear combination of all weak learners.

Histogram of Oriented Gradients [3]


Average gradient image over training samples → gradients provide shape information.
HOG descriptor → Histogram of oriented gradients.
Compute gradients in dense grids, compute gradients and create a histogram based on
gradient direction.

1. Choose your training set of images that contain the object you want to detect.

2. Choose a set of images that do NOT contain that object.

3. Extract HOG features from both sets.

4. Train an SVM classifier on the two sets to detect whether a feature vector represents
the object of interest or not(0/1 classification)

Deformable Part Model[4]


Many objects are not rigid, so we could use a Bottom-up approach:

1. detect body parts

2. detect ”person” if the body parts are in correct arrangement

Node: The amount of work for each RoI may grow significantly.

2.2 Two-stage detectors


Two-stage object detectors split the object detection task to two parts(Fig 7):

1. Region Proposal: The network first identifies potential regions in the image that
may contain objects.

2. Refinement: These regions are further analyzed to classify the objects and refine
their bounding boxes.

Figure 7: 2-stage detection

A generic, class-agnostic objectness measure: object proposals or regions of inter-


est(RoI)
Object proposal methods:

8
1. Selective search[5]
Using class-agnostic segmentation at multiple scales.

2. Edge boxes[6]
Bounding boxes that wholly enclose detected contours.

2.2.1 Non-Maximum Suppression(NSM)

Method to keep only the best proposals(See algorithm 1).

Algorithm 1 Non-Max Suppression


1: procedure NMS(B, c)
2: Bnms ← ∅
3: for bi ∈ B do ▷ Start with anchor box i
4: discard ← False
5: for bj ∈ B do ▷ For another box j
6: if same(bi , bj ) > λnms then ▷ If they overlap
7: if score(c, bj ) > score(c, bi ) then
8: discard ← True ▷ Discard box i if the score is lower than the score of j
9: end if
10: end if
11: end for
12: if ¬discard then
13: Bnms ← Bnms ∪ bi
14: end if
15: end for
16: return Bnms
17: end procedure

Region overlap
We measure region overlap with the Intersection over Union(IoU) or Jaccard In-
dex:

|A ∩ B|
J(A, B) =
A∪B
The threshold is the decision boundary used to determine whether a detected object or
prediction should be considered valid. For example, in object detection:

• Choosing a high threshold – more false positives, low precision

• Choosing a low threshold – more false negatives, low recall

2.3 Object detection with deep networks


2.3.1 Overfeat[1]

Explores three well-known vision tasks using a single framework.

• Classification

• Localization

9
• Detection

Training convolution on all tasks boosts up the accuracy, see Fig 8.

Figure 8: Overfeat

Apply Non-Max Suppression to combine the predictions and windows, see Fig 9.

Figure 9: NMS

Improved detection accuracy is largely due to feature representation learned with


deep nets.
But cons:

• Expensive to try all possible positions, scales and aspect ratios.

• The network perates on a fixed resolution.

The complexity of a sliding window is O(N D), where N is the number of all the windows,
and D represents the amount of work needed to perform detection and classification for one
window. The complexity of the ”pre-filtered” method is O(N d+nD), where d is the amount
of work needed for filtering each window, and n is the number of windows left after being
filtered.
”Pre-filtering” method pays off when:

O(N D) > O(N d + nD)

Assume the constant factors are comparable/negligible:

n d
+ <1
N D
n d
where N is the Region-of-Interest(RoI) ratio, and D is the efficiency of RoI generator. In
practice, there is a delicate balance between n and d: Reducing d and increasing n.

2.3.2 Object proposals(Pre-filtering) and pooling for two-stage detection

Heuristic-based object proposal methods:

10
1. Selective search
Using class-agnostic segmentation at multiple scales.

2. Edge boxes
Bounding boxes that wholly detected contours

R-CNN family(Regions with CNN features)

• Spatial Pyramid Pooling(SPP)

• (Fast/Faster) R-CNN

• Region Proposal Network(RPN)

2.3.3 R-CNN

Training scheme:

1. Pre-train the CNN for image classification(ImageNet)

2. Finetune the CNN on the number of classes the detector is aiming to classify

3. Train a linear SVM classifier to classify image regions - one linear SVM per class

4. Train the bounding box regressor

Pros:

• New: CNN features; the overall pipeline with proposals is heavily engineered – Good
accuracy.

• CNN summarizes each proposal into 4096 vector.

• Leverage transfer learning: The CNN can be pre-trained for image classification with
c classes. One needs only to change the FC layers to deal with Z classes.

Cons:

• Slow: 47s/image with VGG 16 backbone. One considers around 2000 proposals per
image, they need to be warped and forwarded through the CNN.

• Training is also slow and complex.

• The object proposal algorithm is fixed.

• Feature extraction and SVM classifier are trained separately - features are not learned
”end-to-end”.

Problems:

1. Out input image has a fixed size


FC layer prevents us from dealing with any image size.

2. TBD

11
Figure 10: SPP-Net

Figure 11: Fast R-CNN

2.3.4 Fast R-CNN

Fast R-CNN = R-CNN + RoI Pooling(Single layer SPP-Net)


RoI Pooling RoI Pooling is a technique introduced in the Fast R-CNN framework to
efficiently extract fixed-size feature representations for arbitrary regions(bounding boxes) in
a feature map. It address the problem of feeding region proposals of various sizes into a
neural network in a way that:

• Preserves spatial information with each region of interest(RoI).

• Produces a fixed-dimensional feature vector needed by fully connected layers(classifiers).

1. Feature extraction for the entire image


Instead of cropping individual region proposals from the original image and passing them
each though a ConvNet(as done in older approaches like R-CNN), we:

• Feed the entire image into a convolutional neural network(e.g., VGG, ResNet).

• Obtain the resulting feature map, typically a 2D grid of activations.

This means the network only processes the image once, which is much more efficient.
2. Mapping region proposals to the feature map

12
You have a set of region proposals(bounding boxes) in the original image coordinates-
these come from an external region proposal method(e.g., selective search) or from a region
proposal network(in Faster R-CNN).
Each region proposal is then mapped onto the feature map coordinates:
- Because the CNN reduces spatial resolution(due to pooling layers or strided convolu-
tions), the coordinates of each region of the original image need to be scaled to align with
the coordinate system of the smaller feature map.
3. Dividing each RoI into Sub-regions
To handle different RoI sizes but produce a fixed-size output(for example 7X7 output
grid in Fast R-CNN):

• Divide the mapped RoI on the feature map into a predefined grid(e.g., 7X7).

• Each sub-region in this grid corresponds to a smaller set of feature map cells.

4. Pooling operation
Within each of these sub-regions(bins):

• Perform a pooling operation(usually max pooling, though average pooling can also be
used).

• This operation collapses the variable-sized sub-region into a single value(e.g., the max-
imum activation within that bin).

By doing this for each subregion in the grid, you transform the entire RoI into a fixed-size
feature map(e.g., 7X7), regardless of the original size of the bounding box.
5. Feeding pooled features to classifier layers
Noe that you have a fixed spatial dimension(e.g., 7X7), you can:

• Flatten or otherwise reshape this pooled feature map.

• Pass it into fully connected layers(and ultimately classification/regression heads) to


predict:
- The object class for that region.
- The bounding box refinement offsets, etc.

Result of Fast R-CNN

2.3.5 Faster R-CNN

Faster R-CNN = Fast R-CNN + Region proposal network


Region Proposal Network: We fix the number of proposals by using a set of n = 9
anchors per location.

9 anchors = 3 scales × 3 aspect ratios

A 256-d descriptor characterizes every location.


For feature map location, we get a set of anchor corrections and classification into
object/non-object.
RPN: Training

13
Figure 12: Fast R-CNN results

Figure 13: Faster R-CNN

Classification ground truth: We compute p∗ which indicates how much an anchor over-
laps with the ground truth bounding boxes:

p∗ = 1 if IoU > 0.7 (5)

p8 = 0 if IoU < 0.3 (6)

14
• 1 indicates that the anchor represents an object(foreground), and 0 indicates a back-
ground object. The rest do not contribute to the training.

• For an image, randomly sample a few(e.g., 256) anchors to form a mini-batch(balanced


objects vs. non-objects)

• We learn anchor activate with the binary cross-entropy loss

• Those anchors that contain an object are used to compute the regression loss.

• Each anchor is described by the center position, width and height(xa , ya , wa , ha ).

• What the network actually predicts are:

x − xa y − ya
tx = (7) ty = (8)
wa ha

Figure 14: Normalized horizontal Figure 15: Normalized vertical shift


shift

   
w h
tw = log (9) th = log (10)
wa ha

Figure 16: Normalized width Figure 17: Normalized height

• Smooth L1 loss on regression targets

Faster R-CNN training:


- First implementation, training of RPN separate from the rest.
- Now we can train jointly!
- Four losses:

1. RPN classification(object/non-object)

2. RPN regression(anchor – proposal)

3. Fast R-CNN classification(type of object)

4. Fast R-CNN regression(proposal – box)

2.4 Feature Pyramid Networks


CNNs are not scale-invariant.

• Idea A: Featurised image hierarchy


- Pros: Typically boosts accuracy(esp. at test time).
- Cons: Computationally inefficient.

15
Figure 18: Faster R-CNN Performance

Figure 19: Featurised image hierarchy

• Idea B: Pyramidal feature hierarchy


- More efficient than Idea A, but
- Limited accuracy(inhibits the learning of deep representations)

• Feature pyramid network(FPN)


- Higher scales benefit from deeper representation from lower scales.
- Efficient and high accuracy.

Straightforward implementation:

• Convolution with 1 X 1 kernel

• Upsampling(nearest neighbours)

• Element-wise addition

Integration with RPN for object detection:

• Define RPN on each level of the pyramid.

• Only a single-scale anchor per level(scale varies across levels).

• At test time, merge the predictions from all levels.

Pros:

• Improves recall across all scales(esp. small objects).

• More accurate(in terms of AP).

16
Figure 20: Pyramidal feature hierarchy

Figure 21: Feature pyramid network

• Broadly applicable, also for one-stage detectors.

• Still in wide use today.

Cons: Increased model complexity.

2.5 Single-stage object detection


2.5.1 YOLO

Figure 22: YOLO

• Define a coarse grid(S × S).

• Associate B anchors to each cell.

• Each anchor is defined by:


- Localization(x, y, w, h).

17
- A confidence value(object/no object).
- And a class distribution over C classes.

Inference time:

Figure 23: YOLO Inference

It is more efficient than Faster R-CNN, but less accurate.

• Coarse grid resolution, few anchors per cell = issues with small objects

• Less robust to scale variation.

2.5.2 SSD

Pros:

• More accurate than YOLO.

• Works well even with lower resolution → improved inference speed.

Cons:

• Still lags behind two-stage detectors.


data augmentation is still crucial(esp. random sampling).

• A bit more complex(due to multi-scale features).

Figure 24: Single shot detection

18
2.5.3 RetinaNet

2.5.4 Problems with one-stage detectors

Two-stage detectors:

• Classification only works on ”interesting” foreground regions(proposals, 1-2k). Most


background examples are already filtered out.

• Class balance between foreground and background objects is manageable.

• Classifier can concentrate on analyzing proposals with rich information.

Problems with one-stage detectors:

• Many locations need to be analyzed(100k) densely covering the image - foreground-


background imbalance.
- Many negative examples(every cell on the feature grid).
- Few positive examples(actual number of objects).

• Hard negative mining: subsample the negatives with the largest error:
- works, but can be unstable.

Class imbalance:
- Idea: balance the positives/negatives in the loss function.
- Recall cross-entropy loss:

Figure 25: CE loss

2.5.5 Focal loss

Replace CE with focal loss(FL):

• When γ = 0, it is equivalent to the cross-entropy loss.

• As γ goes towards 1, the easy examples are down-weighted.

• Example: γ = 2, if pt = 0.9, FL is ×100 lower than CE.

19
Figure 26: Caption

2.5.6 RetinaNet

• One-stage(like YOLO and SSD) with Focal Loss.

• Feature extraction with ResNet.

• Multi-scale prediction - now with FPN.

Figure 27: Retina

2.6 Spatial Transformers


Features of Spatial pooling:

• Helps detect the presence of features.

• No information about precise location of features within a RoI.


- No effect on the RoI localization.

Solution: Spatial Transformers.


Equivariance is needed:
f (A(x)) = A(f (x)) (11)

• Learn to predict the parameters of the mapping θ:

• Learn to predict a chain of transformations.

• Learn to localize RoI without dense supervision(only class label).

20
Figure 28: Spatial Transformers

• Training multiple STs: Focus on different object parts.

• Learning to localize without any supervision.

• Makes the network equivariant to certain transformers(e.g., affine).

• Fully differentiable.

Cons:
- Difficulty of training for generalizing to more challenging scenes.

2.7 Detection evaluation


Precision: How many zebras you found are actually zebras?

TP
Precision = (12)
TP + FP
Recall: How many actual zebras in the image/dataset you could find?

TP
Recall = (13)
TP + FN
What is a true positive?

• Use the Intersection over Union(IoU).

• e.g., if IoU > 0.5 – positive match

• The criterion is defined by the benchmarks(MS-COCO, Pascal VOC)

Resolving conflicts

• Each prediction can match at most 1 ground truth box.

• Each ground truth box can match at most 1 prediction.

All-in-one metric: Average precision(AP)


There is often a trade-off between Precision and Recall.
AP = the area under the Precision-Recall curve(AUC)
Computing average precision

1. Sort the predicted boxes by confidence score

21
Figure 29: AP

2. For each prediction find the associated ground truth


- The ground truth should be unassigned and pass IoU threshold

3. Compute cumulative TP and FP:


- TP: [1, 0, 1] FP: [0, 1, 0] → cTP = [1, 1, 2], cFP = [0, 1, 1]

4. Compute precision and recall(#GT = 3 → TP + FN + 3)


- Precision: [1/1, 1/2, 2/3]; Recall(#GT=3): [1/3, 1/3, 2/3]

5. Plot the precision-recall curve and the area beneath(num. integration).

mAP is the average over object categories

3 Single object tracking


Problem statement: Given observations at t, and find their correspondence at t+1(n ̸= 0)
Challenges:

• Fast motion.

• Changing appearance.

• Changing object pose.

• Dynamic background.

• Occlusions.

• Poor image quality.

• ...

22
Figure 30: Problem state

A simple solution: Tracking by detection.


- Detect the object in every frame.
Problem: data association.

• Many objects may be present & the detector may misfire.

• We could use the motion prior.

3.1 Bayesian tracking

Figure 31: Bayesian example

Goal: Estimate car position at each time instant (say, the white car).
Observation: Image sequence and known background.

• Perform background subtraction.

• Obtain binary map of possible cars.

• But which one is the one we want to track?

Observations: Image
System state: Car position(x, y)
Notations:

23
Figure 32: Bayesian probabilities

• xk ∈ Rn : internal state at k-th frame


- Hidden random variable, e.g., position of the object in the image.
- Xk = [x1 , x2 , . . . , xk ]T history up to time step k.

• zk ∈ Rm : Measurements at k-th frame


- Observable random variable, e.g., the given image.
- Zk = [z1 , z2 , . . . , zk ]T history up tp time step k.

Goal:
Estimate posterior probability p(xk |Zk ).
How? Recursion:

p(xk−1 |Zk−1 ) → p(xk |Zk ) (14)

3.1.1 Hidden Markov model

Two assumptions of HMM:

1.
p(zk |xk , Zk−1 ) = p(zk |xk ) (15)

2.
p(xk |xk−1 , Zk−1 ) = p(xk |xk−1 ) (16)

24
Recursive Estimation

p(xk |Zk ) = p(xk |zk , Zk−1 ) (Applying Bayes’ theorem)


∝ p(zk |xk , Zk−1 ) · p(xk |Zk−1 )
∝ p(zk |xk ) · p(xk |Zk−1 ) (Assumption: p(zk |xk , Zk−1 ) = p(zk |xk ))
Z
∝ p(zk |xk ) · p(xk , xk−1 |Zk−1 ) dxk−1 (Marginalization)
Z
∝ p(zk |xk ) · p(xk |xk−1 , Zk−1 ) · p(xk−1 |Zk−1 ) dxk−1
Z
∝ p(zk |xk ) · p(xk |xk−1 ) · p(xk−1 |Zk−1 ) dxk−1 (Assumption: p(xk |xk−1 , Zk−1 ) = p(xk |xk−1 ))

Key Concepts:
p(b|a)p(a)
• Bayes’ Rule: p(a|b) = p(b)

• Assumption: p(zk |xk , Zk−1 ) = p(zk |xk )

• Marginalization: p(a) = p(a, b) db


R

• Factorization from graphical model: p(xk |xk−1 , Zk−1 ) = p(xk |xk−1 )

Bayesian formulation:
Z
p(xk |Zk ) = k · p(zk , xk ) · p(xk |xk−1 ) · p(xk−1 |Zk−1 )dxk−1 (17)

• p(xk |Zk ) → posterior probability at current time step.

• p(zk |xk ) → likelihood

• p(xk |xk−1 ) → temporal prior

• p(xk−1 |Zk−1 ) → posterior probability at previous time step

• k → normalizing term

Estimators:
Assume the posterior probability p(xk |Zk ) is known:

• posterior mean:

x̂k = E(xk |Zk ) (18)

• maximum a posterior(MAP):

x̂k = argmaxxk p(xk |Zk ) (19)

Deep networks:

25
• It is easy to see what the networks have to output given the input
- but it is harder(yet more useful) to understand what a network models in terms of
our Bayesian formulation.

• Typically, the networks are tasked to produce MAP directly, e.g.,

x̂k = argmaxxk p(xk |Zk ) ≈ fθ (Zk , xk−1 ) (20)

without modeling the actual state distribution.

3.2 Online vs. Offline tracking


Online tracking:
- ”Given observations so far, estimate the current state.”

• Process two frames at a time.

• For real-time applications.

• Prone to drifting → hard to recover from errors or occlusions.

Offline tracking:
- ”Given all observations, estimate any state.”

• Process a batch of frames.

• Good to recover from occlusions(short ones as we will see)

• Not suitable for real-time application.

• Suitable for video analysis.

An online tracking model can be used for offline tracking too. Our recursive Bayesian
model will still work.

3.3 GOTURN

Figure 33: GOTURN

26
• Input: Search region + template region(what to track).

• Output: Bounding box coordinates in the search region.

Temporal prior:

p(xk |xk−1 ) = δ(xk − xk−1 ) (21)

where δ is Dirac delta function.

Figure 34: Temporal prior

Advantages:

• Simple: close to template matching.

• efficient(real-time).

• end-to-end(we can make use of large annotated data).

Disadvantages:

• may be sensitive to the template choice.

• the temporal prior is too simple: fails if there is fast motion or occlusion.

• tracking one object only.

3.4 Online adaptation


Problem: The initial object appearance may change drastically,
- e.g., due to occlusion, pose change, etc.
Idea: adapt the appearance model on the (unlabeled) test sequence.

3.5 MDNet
Tracking with online adaptation

27
Figure 35: MDNet

4 Multi-object tracking
Challenges:

• Multiple objects of the same type.

• Heavy occlusions.

• Appearances of individual people are often very similar.

DL’s role in MOT:

1. Tracking initialization(e.g., using a detector.)


- Deep learning → more accurate detectors.

2. Prediction of the next position(motion model).


- We can learn a motion model from data.

3. Matching predictions with detections:


- Learning robust metrics(robust to pose changes, partial occlusions, etc.).
- Matching can be embedded into the model.

4.1 Approach 1
1. Track initialization(e.g., using a detector).

2. Prediction of the next position(motion model).


- Classic: Kalman filter(e.g., state transition model.)
- DL(data driven): LSTM/GRU.

3. Matching predictions with detections(appearance model).


- In general: reduction to the assignment problem.
- Bipartite matching.

28
4.2 Typical models of dynamics
• Constant position:
- i.e. no real dynamics, but if the velocity of the object is sufficiently small, this can
work.

• Constant velocity(possibly unknown):


- We assume that the velocity does not change over time.
= As long as the object does not quickly change velocity or direction, this is a quite
reasonable model.

• Constant acceleration(possibly unknown):


- Also captures the acceleration of the object.
- This may include both the velocity, but also the directional acceleration.

4.3 Bipartite matching

Figure 36: Bipartite matching

1. Define distances between boxes(e.g., 1 - IoU).


- We obtain N × N matrix.

2. Solve the assignment.


- Using Hungarian algorithm O(N 3 )

29
3. The bipartite matching solution corresponds to the minimum total cost.

What happens if one detection is missing?

1. Add a pseudo detection with a fixed cost(e.g., 0).

2. Run the Hungarian method as before.

3. Discard the assignment to the pseudo node.

What happens if no prediction is suitable?

• e.g., the object leaves the frame.

• Solution: Pseudo node.

• Its value will define a threshold.

• We may need to balance it out.

Figure 37: Pseudo Node

4.4 Approach 2: Tracktor


1. Detect objects in frame k − 1(e.g., using an object detector)

2. Initialize in the same location in frame j.

3. Refine predictions with regression.

4. Run object detection again to find new tracks.

30
Figure 38: Tractor

Advantages:

1. We can reuse well-engineered object detectors:


- The same architecture of regression and classification heads.

2. Work well even if trained on still images:


- The regression head is agnostic to the object ID and category.

Disadvantages:

1. No motion model:
- problems due to large motions(camera, objects) or low frame rate.

2. Confusion in crowded spaces:


- Since there is no notion of ”identity” in the model.

3. Temporary occlusions(the track is ”killed”):


- Generally applies to all online trackers.
- Partial remedy: Long-term memory of tracks(using object embeddings).

Problem of using IoU as metric:


The implicit assumption of small motion.
We need a more robust metric.

4.5 Metric learning: For re-identification(Re-ID)


4.5.1 Metric spaces

Definition:
A set X(e.g., containing images) is said to be a metric space if with any two points p
and q of X there is associated a real number d(p, q), called the distance from p to q, such
that


d(p, q) > 0 if p ̸= q; d(p, p) = 0;

31

d(p, q) = d(q, p)


d(p, q) ≤ d(p, r) + d(r, q) for any r ∈ X

Any function with these properties is called a distance function, or a metric.


Let’s reformulate:

d(p, q) = dω (fθ (p), fθ (q)) (22)

• We can decouple representation from the distance function:


- We can use simple metrics(e.g., L1, L2, etc) or parametric(Mahalanobis distance).

• The problem reduces to learning a feature representation fθ (·)

4.5.2 How do we train a network to learn a feature representation

Figure 39: Feature Representation

• Choose a distance function, e.g., L2:


- d(A, B; θ) := ||fθ (A) − fθ (B)||2

• Minimize the distance between image pairs of the same person:


- θ∗ := arg minθ EA,B [d(A, B; θ)]

Metric learning method: add negative pairs.


- Minimize the distance between positive pairs; maximize it otherwise.
Our goal: d(A, B; θ) < d(A, C; θ).
The loss:

θ∗ := arg min EA,B∈S + [dθ (A, B)] − EB,C∈S − [d| theta(B, C)] (23)
θ

S + and S − are sets of positive and negative image pairs.

32
Figure 40: Metric learning

1. Hinge loss:
L(A, B) = y ∗ ||f (A) − f (B)||2 + (1 − y ∗ ) max (0, m2 − ||f (A) − f (B)||2 )
- where y ∗ is 1 if (A, B) is a positive pair, and 0 otherwise.
- hinge loss for negative pairs with margin m

2. Triplet loss:
L(A, B, C) = max(0, ||f (A) − f (B)||2 − ||f (A) − f (C)||2 + m

Figure 41: Triplet loss

4.5.3 Metric learning for tracking

1. Train an embedding network on tripets of data:


- positive pairs: same person at different timesteps;
- negative pairs: different people.

2. Use the network to compute the similarity score for matching.

4.5.4 Summary of metric learning

• Many problems can be reduced to metric learning.


- Including MOT(both online and offline).

• Annotation needed:
- e.g., same identity in different images.

33
• In practice: careful tuning of the positive pair term vs. the negative term.
- hard-negative mining and a bounded function for the negative pairs help.

• Extension to unsupervised setting - contrastive learning.

4.6 Online vs. Offline tracking


Online tracking:
- ”Given observations so far, estimate the current state.”

• Process two frames at a time.

• For real-time applications.

• Prone to drifting → hard to recover from errors or occlusions.

Offline tracking:
- ”Given all observations, estimate any state.”

• Process a batch of frames.

• Good to recover from occlusions(short ones as we will see)

• Not suitable for real-time application.

• Suitable for video analysis.

4.7 Graph based MOT


4.7.1 Tracking with network flow

Minimum-cost maximum-flow problem:


Determine the maximum flow with a minimum cost.

Figure 42: MOT

• Node = object detection

• Edge = temporal ID correspondence

• Goal: disjoint set of trajectories

34
Minimizing the cost:
X
f ∗ = arg min C(I, j)f (I, j) (24)
f

where f is the disjoint set of trajectories, C(I, j) indicates the cost, and f (I, j) means the
indicator 0, 1.
To incorporate detection confidence, we split the node in two.

Figure 43: Network flow

where Cde t indicates the detection confidence, and Ct indicates the transition cost.
Problem with occlusions, such as:

• occlusion in the last frame.

• the object appears only in the second frame.

Solution: Connect all nodes(detections) to entrance/exit nodes:

Figure 44: Solution to occlusion

And the graph subjects to flow conservation:

Figure 45: Flow conservation

MAP formulation
- Our solution is a set of trajectories τ ∗

T ∗ = arg max P (T | X ) (25)


T

= arg max P (X | T )P (T ) (26)


T

35
Y
= arg max P (xi | T )P (T ) (27)
T
i

Y Y
= arg max P (xi | T ) P (Ti ) (28)
T
i Ti ∈T

Bayes rule

Bayes rule

Assumption 1:

Conditional independence of observations

Assumption 2:

Independence of trajectories

MAP fomulation:
Y Y
arg max P (xi |T ) P (Ti ) (29)
T
i Ti ∈T

X X
arg min − log P (xi |T ) − log P (Ti ) → log-space for optimization (30)
T
i TI ∈T

Prior
X
log P (Ti ) (31)
Ti ∈T

Trajectory model: count the entrance, exit and transition costs

Ti := (x0 , x1 , ..., xn ) (32)

Y
P (Ti ) = Pin (x0 ) Pt (xj | xj−1 )Pout (xn ) (33)
j=1,n

X
− log P (Ti ) = − log Pin (x0 ) − log Pt (xj | xj−1 ) − log Pout (xn ) (34)
j=1,n

Entrance Cost: − log Pin (x0 ) = fin (x0 )Cin (x0 )

Transition Cost: log Pt (xj | xj−1 ) = ft (xj , xj−1 )Ct (xj , xj−1 )

36
Exit Cost: − log Pout (xn ) = fout (xn )Cout (xn )

Likelihood:

X
− log P (xi | T ) (35)
i

We can use Bernoulli distribution:



γ ,
i if ∃Tj ∈ T , xi ∈ Tj
P (xi | T ) := (36)
1 − γi , otherwise

γi denotes prediction confidence (e.g., provided by the detector)

− log P (xi | T ) = −f (xi ) log γi − (1 − f (xi )) log(1 − γi ) (37)

1 − γi
= f (xi ) log − log(1 − γi ) (38)
γi

1−γi
Cdet (xi ) = log γi

log(1 − γi ) can be ignored in optimization

Optimization
(Cdet , Cin , Cout , Ct ) are estimated from data. Then:

• Construct the graph G(V, E, C, f ) from observation set X .

– V → Nodes (represent observations or potential assignments).


– E → Edges (represent possible transitions between observations).
– C → Costs assigned to edges.
– f → Flow through the graph, which represents assignments.

• Start with empty flow.

• WHILE (f (G) can be augmented):

– Augment f (G) by one.

Binary/Fibonacci search

O(log n)

37
– Find the min-cost flow by the algorithm of:

Min-cost flow algorithm (network simplex algorithm)

O(n2 m log n)

– IF (current min cost < global optimal cost):


∗ Store current min-cost assignment as global optimum.

• Return the global optimal flow as the best association hypothesis.

Summary

• Min-cost max-flow formulation:


the maximum number of trajectories with minimum costs.

• Optimization maximizes MAP:


global solution and efficient(polynomial time).

Open questions:

• How to handle occlusions

• Hoe to learn costs:


costs may be specific to the graph formulation and optimization.

4.7.2 Tracking with Message Passing Network

End-to-end learning?

• Can we learn features for multi-object tracking(encoding costs) to encode the solution
directly on the graph?

• Goal: Generalize the graph structure we have used and perform end-to-end learning.

Setup

• Input: task-encoding graph

– nodes: detections encoded as feature vectors


– edges: node interaction(e.g., inter-frame)

• Output: graph partitioning into(disjoint) trajectories


- e.g., encoded by edge label(0,1)

Deep learning on graphs


Key challenges:

• Graph can be of arbitrary size


number of nodes and edges

38
Figure 46: Message Passing Network

• Need invariance to node permutations.

Message passing

1. Initial graph

• Graph: G = (V, E)
• Initial embeddings:
(0)
- Node embedding: hi , I ∈ V
(0)
- Edge embedding: hi,j , (i, j) ∈ E
(l) (l)
• Embeddings after l steps: hi , i ∈ V h(i,j) , (i, j) ∈ E

2. ”Node-to-edge” update

Figure 47: Node-to-edge update

3. ”Edge-to-node” update

(a) Use the updated edge embeddings to update nodes:

39
(b) After a round of edge updates, each edge embedding contains information about
its pair of incident nodes.
(l) (l−1)
(c) By analogy: hi − Nv ([hi , hl(i,j) ])
(d) In general, we may have an arbitrary number neighbors(”degree”, or ”valency”)
(e) Define a permutation-invariant aggregation function:

Φ(l) (i) := Φ({h(l) (i, j)}j∈N e ) (39)

where the input is a set of embeddings from incident edges.


(f) Re-define the edge-to-node updates for a general graph:

(l) (l−1)
hi = N v(hi , Φ(l) (i)) (40)

Remarks:

• Main goal: gather content information into node and edge embeddings.

• Is one iteration of node-to-edge/edge-to-node updates enough?

• One iteration increases the receptive field of a node/edge by 1


- In practice, iterate message passing multiple times(hyperparameter).

• All operations used are differentiable.

• All vertices/edges are treated equally, i.i. the parameters are shared.

MOT with MPN

Figure 48: MOT with MPN

40
1. Input

2. Graph construction + feature encoding


- Encode appearance and scene geometry cues into node and edge embeddings.

3. Neural message passing


- Propagate cues across the entire graph with neural message passing.

4. Edge classification
- Learn to directly predict solutions of the Min-cost flow problem by classifying edge
embeddings.

5. Output

Feature encoding
Appearance and geometry encodings:

Figure 49: Geometry encodings

1. Relative Box Position  


2(yj − yi ) 2(xj − xi )
, (41)
hi + hj wi + wj
• This encodes the normalized relative position of two bounding boxes.
• The difference in the y-coordinate is divided by the sum of their heights.
• The difference in the x-coordinate is divided by the sum of their widths.
• The factor 2 ensures the values are in a normalized range.

2. Relative Box Size  


hi wi
log , log (42)
hj wj
• These terms encode the relative scale change between the two bounding boxes.
• Using a logarithm makes it more robust to size differences.

3. Time Difference
tj − ti (43)

• This represents the time gap between the two detections.


• It helps in determining whether two bounding boxes belong to the same object
across frames.

41
*Shared weights of CNN and MLP for all nodes and edges
Contrast:

• earlier: define pairwise and unary cost

• now:

– feature vectors associated to nodes and edges.


– use message passing to aggregate context information into the features.

Temporal causality
Flow conservation at a node

• At most 1 connection to the past.

• At most 1 connection to the future.

Time-aware message passing:

Figure 50: Time-aware message passing

Classifying edges

• After several iterations of message passing, each edge embedding contains content
information about detections.

• We feed the embeddings to an MLP that predicts whether an edge is active/inactive

Obtaining final solutions

• After classifying edges, we get a prediction between 0 and 1 for each edge in the graph.

• Directly thresholding solutions do not guarantee flow conservation constraints.

• In practice, around 98% of constraints are automatically satisfied.

42
• Lightweight post-processing(rounding or linear programming).

• The overall method is fast( 12fps) and achieves SOTA in the MOT challenge by a
significant margin.

Summary

• No strong assumptions on the graph structure


- handling occlusions.

• Costs can be learned from data.

• Accurate and fast(for an offline tracker).

• (Almost) End-to-end learning approach


- some post-processing is required.

4.7.3 Evaluation of MOT

Compute a set of measures per frame:

• Perform matching between predictions and ground truth.


- Hungarian algorithm

• FP = false positive

• FN = false negative

• IDSW = Identity switches


An identity switch (IDSW) happens in multi-object tracking (MOT) when the same
object is assigned different IDs across frames. This means that the tracking system
mistakenly reassigns a new ID to the same object, breaking continuity.

Figure 51: IDSW

1. An ID switch is counted because the ground truth track is assigned first to red,
then to blue.
2. Count both an ID switch(red and blue both assigned to the same ground truth),
but also fragmentation(Frag.) because the ground truth coverage was cut.
3. Identity is preserved. If two trajectories overlap with a ground truth trajec-
tory(within a threshold), the one that forces the least ID switches is chosen(the
red one).

43
Metrics:

• Multi-object tracking accuracy(MOTA):


P
t (F Nt + F Pt + IDSWt )
MOTA = 1 − P (44)
t GTt

• F1 score:
P
2 t T Pt
IDF1 = P (45)
t 2T Pt + F Pt + F Nt

• Multi-object tracking precision(MOTP):


P
t,i T Pt
MOTP = P (46)
t GTt

5 Segmentation
Flavours of image segmentation

5.1 K-means(clustering)
5.2 Spectral clustering
5.3 Normalized cut(Ncut)
5.4 Energy-based model: Conditional random fields(CRFs)
5.5 Fully convolutional neural networks
• 1 × 1 convolution facts

• Feature resolution

• Dilated convolutions
- Atrous spatial pyramid pooling

• Upsampling(transposed convolution)

• Upsampling(interpolation)

44
5.6 U-Net
5.7 SegNet
5.8 Mask R-CNN
5.8.1 PointRend

5.9 Panoptic FPN


5.10 Panoptic FCN

References
[1] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat:
Integrated recognition, localization and detection using convolutional networks,” 2014.
[Online]. Available: https://arxiv.org/abs/1312.6229

[2] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple fea-
tures,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. CVPR 2001, vol. 1, 2001, pp. I–I.

[3] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1.

[4] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, mul-


tiscale, deformable part model,” in 2008 IEEE Conference on Computer Vision and
Pattern Recognition, 2008, pp. 1–8.

[5] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M. Smeulders, “Seg-


mentation as selective search for object recognition,” in 2011 International Conference
on Computer Vision, 2011, pp. 1879–1886.

[6] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial In-
telligence and Lecture Notes in Bioinformatics), vol. 8693. Springer, 2014, pp. 391–405.

45

You might also like