Study On MOT
Study On MOT
On
STUDY PROJECT
ACKNOWLEDGEMENT
ABSTRACT
Multiple object tracking consists of detecting and identifying objects in video. In
some applications, such as robotics and surveillance, it is desired that the tracking is
performed in real-time. This poses a challenge in that it requires the algorithm to run
as fast as the frame-rate of the video. Today’s top performing tracking methods run
at only a few frames per second, and can thus not be used in real-time. Further,
when determining the speed of the tracker, it is common to not include the time it
takes to detect objects. I propose that one way of running a method in real-time is to
not look at every frame, but skip frames to make the video have the same frame-rate
as the tracking method. However, I believe that this will lead to decreased
performance. A major challenge in the popular tracking-by-detection framework is
how to associate unreliable detection results with existing tracks. I propose to handle
unreliable detection by collecting candidates from outputs of both detection and
tracking. In this project, I studied a multiple object tracker, following the
tracking-by-detection paradigm, as an extension of an existing method and also I
integrate appearance information to improve the performance. Due to this extension
I am able to track objects through longer periods of occlusions, effectively reducing
the number of identity switches. In this paper, I used the power of deep learning for
data association in tracking by jointly modeling object appearances and their
affinities between different frames in an end-to-end fashion. Experimental evaluation
shows that the extensions reduce the number of identity switches by 45%, achieving
overall competitive performance at high frame rates.
3
Content
1. Introduction 4
2. Related Work 5
3. Traditional Methods 6
3.1. Object Detection 6
3.1.1. Background Subtraction using GMM 6
3.2. Object Tracking 7
3.2.1. Centroid Tracking 7
3.2.2. Mean Shift 8
3.2.3. Camshift 9
4. Proposed Method 10
4.1. Framework Overview 10
4.2. Real Time Object Classification 10
4.3. Tracklet Confidence 10
4.4. SORT Deep Association Metric 11
4.4.1. Track Handling and State Estimation 11
4.4.2. Deep Appearance Descriptor 11
4.5. Deep Affinity Network 12
4.5.1. Data Preparation 12
4.5.2. Feature Extractor 12
4.5.3. Affinity Estimator 13
4.6. DAN Deployment 13
4.7. Hierarchical Data Association 14
5. Experiment 14
6. Conclusion 15
4
7. References 15
1. Introduction
When a video contains multiple moving objects that we wish to track, we refer to this
as multiple object tracking. Object detection is still an unsolved problem, and the
most powerful methods are limited by their speed. Adding tracking capabilities on top
of the detector usually slows down the algorithm further. Because of this, multiple
object tracking is difficult to do in real-time, since the best algorithms can only
analyse a few frames per second at best, even on powerful hardware. For such
algorithms to run in real-time, it would be necessary to skip multiple frames in order
to prevent an ever-increasing delay. Object tracking is an area within computer
vision which has many practical applications such as video surveillance,
human-computer interaction, and robot navigation. Surveillance is the monitoring of
the behavior, activities or other changing information usually of people and often in a
surreptitious manner. Video surveillance is commonly used for event detection and
human identification. But it is not easy as think to detect the event or tracking the
object.
It is a well-studied problem, and in many cases a complex problem to solve.
The problem of object tracking in video can be summarized as the task of
finding the position of an object in every frame. The ability to track an object in a
video depend on multiple factors, like knowledge about the target object, type of
parameters being tracked and type of video showing the object. Video-based
multiple vehicle tracking is essential for many vision-based intelligent transportation
systems applications. Although lots of tracking methods have been studied, there are
still some challenges E.g. Multiple vehicles tracking will be missed when the
occlusion happens by vehicle overlapping or connecting. It is common that the partial
occlusion can cause the features of vehicle, such as sizes, textures, and colors, to
be changed in the viewpoint of camera. In the vehicle occlusion, there are three
kinds of objects to block the vehicle, including other moving objects, background
scene objects, and other vehicles. There are several important steps towards
effective object tracking, including the choice of model to represent the object, and
object tracking method suitable for the task.
Simple Online and Real Time tracking is a simple framework that performs
Kalman filtering in image space and frame-by-frame data association using the
Hungarian method with an association metric that measures bounding box overlap.
SORT has a deficiency in tracking through occlusions as they typically appear in
frontal-view camera scenes. We overcome this issue by replacing the association
metric with a more informed metric that combines motion and appearance
5
2. Related Work
the on-line challenges for multiple object tracking that enable fair benchmarking of
our approach. However, the proposed framework is compatible with other existing
multiple object detectors as well.
3. Traditional Methods
There are three basic steps in video analysis, these are object detection, object
tracking, and recognition of object activities by analysing their tracks.
distribution exist and the other distributions are not important at all. However on
other side, if pixel values are changing continuously then constant number of
Gaussian is not always sufficient to estimate the background model. In that case it is
necessary to determine approximate number of Gaussian. A Gaussian Mixture
Model(GMM) is a parametric probability density function. It is expressed in terms of
weighted sum of Gaussian component densities. GMM is basically used as a
parametric model which estimate the probability distribution function based on
various object features. In computer vision technology, it is difficult to detect multiple
moving objects in case of severe occlusion and dynamic scene changes occurred. In
order to implement the background subtraction method to identify the moving objects
from each portion of video frames, background modeling is always the first step. This
background modeling can be done with the help of Gaussian Mixture Model. To get
the desired result, all incoming frames from video sequence are subtracted from a
reference background modeling frame and compare the difference with threshold
value to segment the image between foreground and background. Finally, any other
random pixels which are detected as a foreground pixels can be eliminated to
improve the foreground mask.
The goal of an object tracking is to generate the trajectory of an object over time by
discovering its exact position in every frame of the video sequence. I have studied
several object tracking algorithms (Meanshift, Camshift). The algorithm for object
tracking is composed of three modules: selection object module in the first frame of
the sequence, the module of Meanshift algorithm and the module of Camshift
algorithm. The selection module selects the position of the object in the first frame. It
consists of extracting the module initialization parameters that are moving through
the position, size, width, length of the search window of the object in the first frame of
the sequence.
When we detect an object we enclose with a bounding box using our previous
methods. Once we have the bounding box coordinates we compute the centroid, or
more simply, the center (x, y)-coordinates of the bounding box. For every
subsequent frame in our video stream we compute object centroids; we first need to
determine if we can associate the new object centroids with the old object centroid.
To accomplish this process, we compute the Euclidean distance between each pair
of existing object centroids and input object centroids (i.e. the object centroids at t-1
frame). We the distance between the centroids is less than the threshold then we
assume it is the same object. But this method doesn’t work when two objects
overlap.
circle) and you have to move that window to the area of maximum pixel density (or
maximum number of points). It is illustrated in the simple image given in the right.
The principle of the CamShift algorithm is based on the principles of the algorithm
Meanshift. Camshift is able to handle the dynamic distribution by adjusting the size
of the search window for the next frame based on the moment zero of the current
distribution of images. In contrast to the algorithm Meanshift who is conceived for the
static distributions, Camshift is conceived for a dynamic evolution of the distributions.
It adjusts the size of searching window by invariant moments. This allows the
algorithm to anticipate the movement of objects to quickly track the object in the next
frame. Even during the fast movements of an object, Camshift is still capable of
tracking well. It occurs when objects in video sequences are tracked and the object
moves such that the size and location of the change in probability distribution
changes in time. The initial search window was determined by a detection algorithm
or software dedicated to video processing. The CamShift algorithm calls upon the
MeanShift one to calculate the target centre in the probability distribution image, but
also the orientation of the principal axis and dimensions of the probability distribution.
Defining the first and second moments for x and y. These parameters are given from
the first and second moments, are defined by equations.
M ₀₀ = ∑∑I(x, y )
M ₀₁ = ∑∑xI(x, y )
M ₁₀ = ∑∑yI(x, y )
X c = M ₀₀ / M ₀₁
Y c = M ₀₀ / M ₁₀
w = r₁√ M ₀₀
l = r₂√ M ₀₀
Where:
M ₀₀ - Zero Moment
M ₀₁ - First Moment for x
M ₁₀ - First Moment for y
10
4. Proposed Method
Combining outputs of both detection and tracks will result in an excessive amount of
candidates. The classifier shares most computations on the entire image by using a
region-based fully convolutional neural network (R-FCN). Thus, it is much more
efficient comparing to classification on image patches, which are cropped from
heavily overlapped candidate regions.
Given an image frame, score maps of the entire image are predicted using a
fully convolutional neural network with an encoder decoder architecture. The
encoder part is a light-weight convolutional backbone for real-time performance, and
the decoder part with up-sampling is used to increase the spatial resolution of output
score maps for later classification.
Given a new frame, we estimate the new location of each existing track using the
Kalman filter. These predictions are adopted to handle detection failures caused by
varying visual properties of objects and occlusion in crowded scenes. But they are
not suitable for long-term tracking. The accuracy of the Kalman filter could decrease
if it is not updated by detection over a long time. Tracklet confidence is designed to
measure the accuracy of the filter using temporal information. A tracklet is generated
through temporal association of candidates from consecutive frames. We can split a
track into a set of tracklets since a track can be interrupted and retrieved for times
during its lifetime. Every time a track is retrieved from lost state, the Kalman filter will
be reinitialized. Therefore, only the information of the last tracklet is utilized to
formulate the confidence of the track.
I assumed a very general tracking scenario where the camera is uncalibrated and
where I have no ego-motion information available. While these circumstances pose a
challenge to the filtering framework, it is the most common setup considered in
recent multiple object tracking benchmarks. Therefore, the tracking scenario is
defined on the eight dimensional state space (u, v, γ, h, x’, y’, γ’, h’) that contains the
bounding box center position (u, v), aspect ratio γ, height h, and their respective
velocities in image coordinates.
For each track k I count the number of frames since the last successful
measurement association. This counter is incremented during Kalman filter
prediction and reset to 0 when the track has been associated with a measurement.
Tracks that exceed a predefined maximum age Amax are considered to have left the
scene and are deleted from the track set. New track hypotheses are initiated for
each detection that cannot be associated to an existing track. These new tracks are
classified as tentative during their first three frames. During this time, we expect a
successful measurement association at each time step. Tracks that are not
successfully associated to a measurement within their first three frames are deleted.
embedding to be trained offline, before the actual online tracking application. To this
end, we should employ a CNN that has been trained on a large-scale person
re-identification dataset that contains over 1,100,000 images of 1,261 pedestrians,
making it well suited for deep metric learning in a people tracking context. We should
employ a wide residual network with two convolutional layers followed by six residual
blocks. The global feature map of dimensionality 128 is computed in dense layer 10.
A final batch and 2 normalization projects features onto the unit hypersphere to be
compatible with our cosine appearance metric. In total, the network has 2,800,864
parameters and one forward pass of 32 bounding boxes takes approximately 30 ms
on an Nvidia GeForce GTX 1050 GPU.
Multiple object tracking datasets often lack in fully capturing the aspects of camera
photometric distortions, background scene variations and other practical factors to
which tracking approaches should remain robust. For model based approaches, it is
important that the training data contains sufficient variations of such tracking
irrelevant factors to induce robustness in the learned models. Hence, I perform the
following preprocessing steps like Photometric distortions, frame expansion,
cropping etc over the available data.
The preprocessing steps is applied to the frame pairs sequentially with a probability
0.3. The frames are then resized to fixed dimensions H ×W ×3, and horizontally
flipped with a probability of 0.5. The resulting processed frames are used as inputs to
the DAN along with the associated object centers computed by a detector. The
13
reason for data preparation is to increase the robustness of model by adding some
noise.
This is the first major component of the DAN as feature extractor for its functionality.
This sub-network models comprehensive yet compact features of the detected
objects in video frames. the feature extraction is performed by passing pairs of video
frames and object centers through two streams of convolution layers. These streams
share the model parameters in our implementation, whereas their architecture is
inspired by VGG16 network. We use the VGG architecture after converting its
fully-connected and softmax layers to convolution layers. This modification is made
because the spatial features of objects, which are of more interest in our task, are
better encoded by convolution layers. Compared to the original VGG16, the input
frame size for our network is much larger (i.e. 3×900×900) due to the nature of the
task at hands and the available tacking datasets. Consequently, we are still able to
compute 56 × 56 feature maps after the last layer of the modified VGG network.
The objective of this component of DAN is to encode affinities between the objects
using their extracted features. To that end, the network arranges the columns of Ft
and Ft−n, such that the columns of the two feature matrices are concatenated along
the depth dimension. The architecture of the compression network is inspired by the
physical significance of its input and output signals. The network maps a tensor
encoding combinations of object features to a matrix that codes similarities between
the features (hence, the objects). Thus, it performs a gradual dimension reduction
along the depth of the input tensor with convolutional kernels that do not allow
neighboring elements of feature maps to influence each other.
affinity matrix by a simple forward pass through the network and a concatenation
operation, as described above. Thus each frame is passed through the object
detector and feature extractor only once, but the features are used multiple times for
computing affinities with multiple other frames in pairs.
5. Experiment
I assess the performance of the tracker on the MOT16 benchmark. This benchmark
evaluates tracking performance on seven challenging test sequences, including
frontal-view scenes with moving camera as well as top-down surveillance setups. As
input to our tracker we rely on detections. They have trained a Faster RCNN on a
collection of public and private datasets to provide excellent performance. For a fair
comparison, I have re-run SORT on the same detections. Detections have been
thresholded at a confidence score of 0.3. The remaining parameters of the method
15
have been found on separate training sequences which are provided by the
benchmark. Evaluation is carried out according to the following metrics:
6. Conclusion
I have studies the traditional methods of Multiple Object Tracking in Real time then I
explored Deep Learning and computer vision applications to track the object more
robustly. I have presented an extension to SORT that incorporates appearance
information through a pre-trained association metric. Due to this extension, we are
able to track through longer periods of occlusion. I propose an online multiple people
tracking framework, which takes full advantage of recent deep neural networks. this
tackle unreliable detection by selecting candidates from outputs of both detection
and tracks. The scoring function for candidate selection is formulated by an efficient
R-FCN, which shares computations on the entire image. Moreover, we improve the
identification ability when coping with intra-category occlusion by introducing ReID
features for data association. The tracker derives its strength from our proposed
Convolutional Neural Network architecture, referred to as Deep Affinity Network
(DAN). The proposed DAN models features of pre detected objects in the video
frames at multiple levels of abstraction, and infers object affinities across different
frames by analyzing exhaustive permutations of the extracted features. The
cross-frame objects similarities and object features are recorded by this approach to
trace the trajectories of the object.
REFERENCES
16
1. SAMUEL MURRAY “Real-Time Multiple Object Tracking”, A Study on the Importance of Speed
2. IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) “Moving Object Tracking Distance and
Velocity Determination based on Background Subtraction” Algorithm B.Tharanidevi , R.Vadivu, K.B.Sethupathy
3. International Journal of Computer Science and Information Technologies “A Survey on Moving Object Detection and
Tracking Methods” Imran Khan Pathan, Chetan Chauhan
4. Deep Learning in Computer Vision - Coursera
5. Afef SALHI and Ameni YENGUI JAMMOUSSI “Object tracking system using Camshift, Meanshift and Kalman filter”
6. International Journal of Computer Applications (0975 – 8887) Multiple Object Detection using GMM Technique and
Tracking using Kalman Filter Rohini Chavan
7. B Y Lee, L H Liew, W S Cheah and Y C Wang “Occlusion handling in videos object tracking: A survey” Published
under licence by IOP Publishing Ltd
8. Multi-object tracking with dlib https://www.pyimagesearch.com/2018/10/29/multi-object-tracking-with-dlib/
9. L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in CVPR,
2008, pp. 1–8.
10. H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of
objects,” in CVPR, 2011, pp. 1201– 1208.
11. J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 33, no. 9, pp. 1806–1819, 2011.
12. Nicolai Wojk, Alex Bewley , Dietrich Paulus, “SIMPLE ONLINE AND REALTIME TRACKING WITH A DEEP
ASSOCIATION METRIC”
13. Long Chen, Haizhou Ai, Zijie Zhuang, Chong Shang, “REAL-TIME MULTIPLE PEOPLE TRACKING WITH DEEPLY
LEARNED CANDIDATE SELECTION AND PERSON RE-IDENTIFICATION”
14. ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal Mian, Mubarak Shah, “Deep Affinity Network for Multiple Object
Tracking” JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2017
15. Seung-Hwan Bae and Kuk-Jin Yoon, “Confidence-based data association and discriminative deep appearance
learning for robust online multi-object tracking,” IEEE Transactions on PAMI, 2017.
16. [2] Lo¨ıc Fagot-Bouquet, Romaric Audigier, Yoann Dhome, and Fred´ eric Lerasle, “Improving multi-frame data
association ´ with sparse representations for robust near-online multi-object tracking,” in ECCV, 2016.
17. [3] Amir Sadeghian, Alexandre Alahi, and Silvio Savarese, “Tracking the untrackable: Learning to track multiple cues
with long-term dependencies,” in ICCV, 2017.
18. A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A Survey,” ACM Computing Surveys, vol. 38, no. 4, pp. 13–es,
2006.