0% found this document useful (0 votes)
14 views44 pages

CNN For Object Tracking

The document discusses the use of Convolutional Neural Networks (CNNs) for object tracking, focusing on various approaches including Siamese networks and their evolution. It outlines the challenges in object tracking, such as appearance variations and the need for effective classifiers, and presents methods like fully-convolutional Siamese networks and region proposal networks for improved tracking performance. The document concludes with advancements in Siamese networks that enhance localization and efficiency in tracking tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views44 pages

CNN For Object Tracking

The document discusses the use of Convolutional Neural Networks (CNNs) for object tracking, focusing on various approaches including Siamese networks and their evolution. It outlines the challenges in object tracking, such as appearance variations and the need for effective classifiers, and presents methods like fully-convolutional Siamese networks and region proposal networks for improved tracking performance. The document concludes with advancements in Siamese networks that enhance localization and efficiency in tracking tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

CNNs for Object Tracking

Seunghoon Hong
Course logistics (1)
● Assignment 2 is out
○ Deadline: Midnight June 7th
Course logistics (2)
● The instructions for paper presentation will be released by today
● Please read the instructions VERY CAREFULLY before you start
● Important deadlines (applied strictly; no late submission)
○ May 29: Paper bidding
○ June 7: Prepare presentation video and quiz
○ June 12: Watch presentations and solve quizzes
Recap: approaches in single object tracking
● Probabilistic tracking
○ Formulate the localization task as a sequential probabilistic inference problem
○ Given a probability of the initial target location, propagate it over the remaining frames

● Discriminative tracking
○ Classify the object from the distractors at every frame
○ Can be considered as sequential binary object detection (class = target, background)
Recap: Probabilistic tracking
● Sequential Bayesian filtering via Markov Chain Monte Carlo sampling

where
Recap: Discriminative tracking
Recap: Correlation filtering for discriminative tracking
● Solving a ridge regression via circulant matrices and discrete Fourier transform

Closed form solution.

If X is a circulant matrix,

computation is extremely efficient due to element-wise multiplication


Today’s agenda
● CNNs for (single) object tracking
● Approaches based on Siamese networks
○ Fully-Convolutional Siamese Networks for Object Tracking
○ High Performance Visual Tracking with Siamese Region Proposal Network
○ SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks
○ Fast Online Object Tracking and Segmentation: A Unifying Approach
Revisiting discriminative tracking
● Objective in inference
○ Given the current model of the target, identify the target at the current frame

model candidates
Revisiting challenges in object tracking

● Modeling severe appearance variations of the target


○ Illumination change, occlusion, deformation, rotation, …
Revisiting challenges in object tracking

● Modeling severe appearance variations of the target


○ Illumination change, occlusion, deformation, rotation, …

● Learning to recognize the target


○ The ground-truth for target is only given at the initial frame → one-shot classification!

Different targets in every videos!


Modeling target appearance
● Pre-trained CNN as a general feature extractor
Modeling target appearance
● Pre-trained CNN as a general feature extractor

Employing pre-trained feature alone


gives us descent improvement!

Hong et al., Online Tracking By Learning Discriminative Saliency Map With Convolutional Neural Network
Modeling a discriminator for target
● How can we learn weights for target classification?
Modeling a discriminator for target
● How can we learn weights for target classification?

● Online learning
○ Train a classifier on-the-fly using the ground-truth and tracking results (self-supervised)

Hong et al., Online Tracking By Learning Discriminative Saliency Map With Convolutional Neural Network
Modeling a discriminator for target
● How can we learn weights for target classification?

● Online learning
○ Train a classifier on-the-fly using the ground-truth and tracking results (self-supervised)
○ Problems
■ The model can easily overfit
■ Online update of the classifier is prune to drift (in case of temporal misclassification)
■ Using the pre-trained feature may not appropriate for tracking
(e.g. inaccurate localization due to translation-invariance, never trained for modeling
temporal variations, etc)

Can we pre-train the classifier for tracking?


Pre-training the classifier for tracking?
● Actually, offline training and online deployment is a standard concept in CNN
○ E.g., image classification

Training Testing
Pre-training the classifier for tracking?
● Actually, offline training and online deployment is a standard concept in CNN
○ E.g., image classification

● However, in tracking, the classifier (w) cannot be transferred across videos


○ The definition of foreground/background are different in every videos!
(i.e., tracking targets are different in all videos)

● Can we design a classifier transferable across different targets?


Exemplar-based classifier
● Reformulating the classifier parameterization

x: candidates

Frame #0 Frame #100


Exemplar-based classifier
● Reformulating the classifier parameterization

● Now, we can pre-train the classifier parameters ψ across different videos


○ The classifier weights w are determined adaptively depending on target z
○ If we train this model with various videos, it learns-to-encode arbitrary target
such that the similarity with the ground-truth candidate (x*) is higher than the rest (x)
○ It is transferable across different videos and targets!
Exemplar-based classifier
● Reformulating the classifier parameterization

If ψ=Φ, then we call this as


Siamese network
Fully-convolutional Siamese network

Bertinetto et al., Fully-Convolutional Siamese Networks for Object Tracking


Fully-convolutional Siamese network

Target extracted from


the initial frame

Frame at #t

Bertinetto et al., Fully-Convolutional Siamese Networks for Object Tracking


Fully-convolutional Siamese network
The fully-convolutional
Siamese network is used to extract
features of the target and in frames

Bertinetto et al., Fully-Convolutional Siamese Networks for Object Tracking


Fully-convolutional Siamese network

Considering the feature of the target


as a filter, run convolution on entire
feature map of the frame

Bertinetto et al., Fully-Convolutional Siamese Networks for Object Tracking


Fully-convolutional Siamese network

The output is the score map of


the target densely computed
in every locations

Bertinetto et al., Fully-Convolutional Siamese Networks for Object Tracking


Training
● Learning with videos (finally!)
○ Training dataset: ImageNet Video dataset
○ For each video, sample two frames with sufficient time interval (T)
○ Use one frame to extract the target (z), and the other as a candidate frame (x)
○ Considering the ground-truth location of the target as c, build a soft ground-truth
frame #t

frame #(t+T)
Inference
● Use the initial frame to extract the target (z), and fix it for the rest frames
○ Online update of the target φ(z) is straightforward, but did not get the gain
● Handling scale variation
○ Construct image pyramid of x in multiple scales 1.025 * {−2,−1,0,1,2}
○ Search the best scale with the maximum score
Result
Result
● State-of-the-art performance despite the simplicity
● ● Super-fast!
State-of-the-art (real-time
performance despitespeed)
the simplicity

Summary: fully-convolutional Siamese network
● Discriminative tracking via exemplar classifier
○ Use the target at initial frame as a convolution filter = adaptable classifier
○ The entire model is pre-trained end-to-end and transferable across videos
○ Can be deployed to videos with arbitrary target in testing time
● Fully-convolutional network allows Siamese network
○ Both the target classifier and frame-level feature extractor share the same parameters
○ Produces a score map via filtering, which allows super-efficient examination of samples
● Fast, and reasonably accurate
○ Real-time performance (60~80 fps)
Later innovations in Siamese-FC
● Accurate localization through region-proposal network
● Efficient parameterization with deep network
● Mask prediction for further accurate localization
Efficient and accurate modeling of box configuration
● In Siamese-FC, only the scale variation is modeled via image pyramid
● If we want to model variations in more scales and aspect-ratio,
exhaustive search based on image pyramid is not efficient
Siamese network with region proposal
● Efficient search over scale+aspect ratio through region-proposal network

Li et al., High Performance Visual Tracking with Siamese Region Proposal Network
Siamese network with region proposal
● Efficient search over scale+aspect ratio through region-proposal network
k: # of proposals (anchors)
Siamese network with region proposal
● Efficient search over scale+aspect ratio through region-proposal network
The target (template) generates k number of filters
for different bounding boxes

Li et al., High Performance Visual Tracking with Siamese Region Proposal Network
Siamese network with region proposal
● Efficient search over scale+aspect ratio through region-proposal network
Classification produces binary
score of each proposals

Li et al., High Performance Visual Tracking with Siamese Region Proposal Network
Siamese network with region proposal
● Efficient search over scale+aspect ratio through region-proposal network
Regression branch generates
(dx,dy,dw,dh) for proposals

Li et al., High Performance Visual Tracking with Siamese Region Proposal Network
Result

Li et al., High Performance Visual Tracking with Siamese Region Proposal Network
Improving Siamese-RPN
● Efficient parameterization via depth-wise convolution

Li et al., SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks
Improving Siamese-RPN
● Efficient parameterization via depth-wise convolution
● Exploiting very deep network with skip connections

Li et al., SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks
Mask prediction for better localization Just additional mask branch on top
of Siamese-RPN!

Valmadre et al., End-to-end representation learning for Correlation Filter based tracking
Mask prediction for better localization Every pixels predict a binary mask

Valmadre et al., End-to-end representation learning for Correlation Filter based tracking
Result
● Accuracy in terms of bounding box

Valmadre et al., End-to-end representation learning for Correlation Filter based tracking
Result

Valmadre et al., End-to-end representation learning for Correlation Filter based tracking

You might also like