0% found this document useful (0 votes)
35 views17 pages

Study On MOT

This document is a study project report on multiple object tracking in real-time. It discusses challenges with running state-of-the-art tracking methods in real-time due to their slow speeds. The proposed method aims to improve real-time tracking performance by collecting candidate detections from both the detector and existing tracks. It also introduces a deep affinity network to jointly model object appearances and affinities between frames, reducing identity switches during occlusions. Experimental results show this approach reduces identity switches by 45% while achieving competitive performance at high frame rates.

Uploaded by

Kishan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views17 pages

Study On MOT

This document is a study project report on multiple object tracking in real-time. It discusses challenges with running state-of-the-art tracking methods in real-time due to their slow speeds. The proposed method aims to improve real-time tracking performance by collecting candidate detections from both the detector and existing tracks. It also introduces a deep affinity network to jointly model object appearances and affinities between frames, reducing identity switches during occlusions. Experimental results show this approach reduces identity switches by 45% while achieving competitive performance at high frame rates.

Uploaded by

Kishan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Study Project Report

On

Multiple Object Tracking in Real


Time
Prepared for

Mr. Ashish Chittora

Dept. of Electrical Engineering, BITS Pilani KK Birla Goa Campus

Study Project Report


Prepared by

Kishan Kumar Gupta 2016A8PS0406G

In partial fulfillment of the requirement of the course


INSTR F266
1

STUDY PROJECT
ACKNOWLEDGEMENT

I would like to express my deepest gratitude to Ashish Chittora Sir who


have given me the opportunity to work under his guidance and
continuously conveyed a spirit of enthusiasm in his teachings. He was
always willing to meet me at any point of time and gave me
constructive feedback on my study. Sir’s involvement kept me
motivated to achieve my objective. It has been a great privilege to be
able to work with him.
2

ABSTRACT
Multiple object tracking consists of detecting and identifying objects in video. In
some applications, such as robotics and surveillance, it is desired that the tracking is
performed in real-time. This poses a challenge in that it requires the algorithm to run
as fast as the frame-rate of the video. Today’s top performing tracking methods run
at only a few frames per second, and can thus not be used in real-time. Further,
when determining the speed of the tracker, it is common to not include the time it
takes to detect objects. I propose that one way of running a method in real-time is to
not look at every frame, but skip frames to make the video have the same frame-rate
as the tracking method. However, I believe that this will lead to decreased
performance. A major challenge in the popular tracking-by-detection framework is
how to associate unreliable detection results with existing tracks. I propose to handle
unreliable detection by collecting candidates from outputs of both detection and
tracking. In this project, I studied a multiple object tracker, following the
tracking-by-detection paradigm, as an extension of an existing method and also I
integrate appearance information to improve the performance. Due to this extension
I am able to track objects through longer periods of occlusions, effectively reducing
the number of identity switches. In this paper, I used the power of deep learning for
data association in tracking by jointly modeling object appearances and their
affinities between different frames in an end-to-end fashion. Experimental evaluation
shows that the extensions reduce the number of identity switches by 45%, achieving
overall competitive performance at high frame rates.
3

Content

1. Introduction 4
2. Related Work 5
3. Traditional Methods 6
3.1. Object Detection 6
3.1.1. Background Subtraction using GMM 6
3.2. Object Tracking 7
3.2.1. Centroid Tracking 7
3.2.2. Mean Shift 8
3.2.3. Camshift 9
4. Proposed Method 10
4.1. Framework Overview 10
4.2. Real Time Object Classification 10
4.3. Tracklet Confidence 10
4.4. SORT Deep Association Metric 11
4.4.1. Track Handling and State Estimation 11
4.4.2. Deep Appearance Descriptor 11
4.5. Deep Affinity Network 12
4.5.1. Data Preparation 12
4.5.2. Feature Extractor 12
4.5.3. Affinity Estimator 13
4.6. DAN Deployment 13
4.7. Hierarchical Data Association 14
5. Experiment 14
6. Conclusion 15
4

7. References 15

1. Introduction
When a video contains multiple moving objects that we wish to track, we refer to this
as multiple object tracking. Object detection is still an unsolved problem, and the
most powerful methods are limited by their speed. Adding tracking capabilities on top
of the detector usually slows down the algorithm further. Because of this, multiple
object tracking is difficult to do in real-time, since the best algorithms can only
analyse a few frames per second at best, even on powerful hardware. For such
algorithms to run in real-time, it would be necessary to skip multiple frames in order
to prevent an ever-increasing delay. Object tracking is an area within computer
vision which has many practical applications such as video surveillance,
human-computer interaction, and robot navigation. Surveillance is the monitoring of
the behavior, activities or other changing information usually of people and often in a
surreptitious manner. Video surveillance is commonly used for event detection and
human identification. But it is not easy as think to detect the event or tracking the
object.
It is a well-studied problem, and in many cases a complex problem to solve.
The problem of object tracking in video can be summarized as the task of
finding the position of an object in every frame. The ability to track an object in a
video depend on multiple factors, like knowledge about the target object, type of
parameters being tracked and type of video showing the object. Video-based
multiple vehicle tracking is essential for many vision-based intelligent transportation
systems applications. Although lots of tracking methods have been studied, there are
still some challenges E.g. Multiple vehicles tracking will be missed when the
occlusion happens by vehicle overlapping or connecting. It is common that the partial
occlusion can cause the features of vehicle, such as sizes, textures, and colors, to
be changed in the viewpoint of camera. In the vehicle occlusion, there are three
kinds of objects to block the vehicle, including other moving objects, background
scene objects, and other vehicles. There are several important steps towards
effective object tracking, including the choice of model to represent the object, and
object tracking method suitable for the task.
Simple Online and Real Time tracking is a simple framework that performs
Kalman filtering in image space and frame-by-frame data association using the
Hungarian method with an association metric that measures bounding box overlap.
SORT has a deficiency in tracking through occlusions as they typically appear in
frontal-view camera scenes. We overcome this issue by replacing the association
metric with a more informed metric that combines motion and appearance
5

information. Through integration of this network we increase robustness against


misses and occlusions.
In order to handle unreliable detection in an online mode, the tracking
framework optimally selects candidates from outputs of both detection and tracks in
each frame. On one hand, reliable predictions from the tracker can be used for
short-term association in case of missing detection or non-accurate bounding. On
the other hand, confident detection results are essential to prevent tracks drifting to
backgrounds in the long term.
In this project is aimed to answer the methods for multiple object tracking and
their benefits and disadvantages and the different methods more suitable for certain
applications and environments.

2. Related Work

Tracking is a mainstream problem in computer vision, hence numerous contributions


in tracking can be found in the literature. Tracking-by-detection is becoming the most
popular strategy for multi-object tracking. Many recent techniques in multiple object
tracking follow a more generic ’tracking-by-detection’ framework that first detects
target objects in video frames, and then associates them in different frames. Due to
parallel developments of multiple object detectors, approaches following this line of
action focus more on the data association aspect of tracking. These techniques can
be broadly categorized as local and global tracking methods. The local methods
generally consider only two frames for data association. This makes them
computationally efficient, however, their performance decreases when there is
tracking-irrelevant factors such as camera motion, and pose variation etc. Combining
results from multiple detectors can improve the tracking performance but is not
efficient for real-time applications. In contrast, our tracking framework needs only
one detector and generates candidates from existing tracks. They shared the feature
maps for classification but still had a high computation complicity. Person
re-identification was also explored for the global optimization.
Proposed framework leverages deeply learned ReID features in an online
mode, to improve the identification ability when coping with the problem of
intra-category occlusion.The detection stage in our approach expects video frames
as inputs and provides a set of bounding boxes for the target objects in those
frames. We compute the object center locations Ct using the available bounding
boxes. The performance of our approach is evaluated for on-line challenges in
multiple object tracking that provide their own detectors for the objects of interest.
For the MOT17 challenge, we use the provided Faster R-CNN and SDP detectors;
and for the UA-DETRAC, we use the EB detector. Our choices are entirely based on
6

the on-line challenges for multiple object tracking that enable fair benchmarking of
our approach. However, the proposed framework is compatible with other existing
multiple object detectors as well.

3. Traditional Methods

There are three basic steps in video analysis, these are object detection, object
tracking, and recognition of object activities by analysing their tracks.

3.1 Object Detection


The Object detection and tracking are playing an important role in many pattern
recognition and computer vision pattern recognition applications like autonomous
robot navigation, surveillance and vehicle navigation. An object detection mechanism
used in when the object first appears or in every frame in the video. In order to
reduce the number of false detection and increase accuracy rate, some object
detection methods use the temporal information computed from analyzing a
sequence of frames. Before Deep learning development Object detection was done
by various techniques such as temporal differencing, frame differencing, Optical flow
and Background subtraction.

3.2.1. Background Subtraction using GMM

Background subtraction is a popular technique to segment out the interested objects


in a frame. This technique involves subtracting an image that contains the object,
with the previous background image that has no foreground objects of interest. The
area of the image plane where there is a significant difference within these images
indicates the pixel location of the moving objects. These objects, which are
represented by groups of pixel, are then separated from the background image by
using threshold technique.
Gaussian Mixture Model (GMM) is basically one of the most popular
technique to construct the background model for segmentation of moving objects
from background. GMM technique assigns number of Gaussian distributions for each
pixel to estimate reference frame. If there is no any variations in the pixel values then
all Gaussian distributions approximate the same values. In that case only one
7

distribution exist and the other distributions are not important at all. However on
other side, if pixel values are changing continuously then constant number of
Gaussian is not always sufficient to estimate the background model. In that case it is
necessary to determine approximate number of Gaussian. A Gaussian Mixture
Model(GMM) is a parametric probability density function. It is expressed in terms of
weighted sum of Gaussian component densities. GMM is basically used as a
parametric model which estimate the probability distribution function based on
various object features. In computer vision technology, it is difficult to detect multiple
moving objects in case of severe occlusion and dynamic scene changes occurred. In
order to implement the background subtraction method to identify the moving objects
from each portion of video frames, background modeling is always the first step. This
background modeling can be done with the help of Gaussian Mixture Model. To get
the desired result, all incoming frames from video sequence are subtracted from a
reference background modeling frame and compare the difference with threshold
value to segment the image between foreground and background. Finally, any other
random pixels which are detected as a foreground pixels can be eliminated to
improve the foreground mask.

3.2 Object Tracking

The goal of an object tracking is to generate the trajectory of an object over time by
discovering its exact position in every frame of the video sequence. I have studied
several object tracking algorithms (Meanshift, Camshift). The algorithm for object
tracking is composed of three modules: selection object module in the first frame of
the sequence, the module of Meanshift algorithm and the module of Camshift
algorithm. The selection module selects the position of the object in the first frame. It
consists of extracting the module initialization parameters that are moving through
the position, size, width, length of the search window of the object in the first frame of
the sequence.

3.2.1. Centroid Tracking


The primary assumption of the centroid tracking algorithm is that a given object will
potentially move in between subsequent frames, but the distance between the
centroids for frames and will be smaller than all other distances between
objects.
It relies on the Euclidean distance between
(1) existing object centroids (i.e., objects the centroid tracker has already seen
before) and
(2) new object centroids between subsequent frames in a video
8

When we detect an object we enclose with a bounding box using our previous
methods. Once we have the bounding box coordinates we compute the centroid, or
more simply, the center (x, y)-coordinates of the bounding box. For every
subsequent frame in our video stream we compute object centroids; we first need to
determine if we can associate the new object centroids with the old object centroid.
To accomplish this process, we compute the Euclidean distance between each pair
of existing object centroids and input object centroids (i.e. the object centroids at t-1
frame). We the distance between the centroids is less than the threshold then we
assume it is the same object. But this method doesn’t work when two objects
overlap.

3.2.1. MeanShift - An object Tracking Algorithm


The meanshift algorithm is a
non-parametric method. It provides
accurate localization and efficient
matching without expensive
exhaustive search. The size of the
window of search is fixed. It is an
iterative process, that is to say, first
compute the meanshift value for the
current point position, then move
the point to its meanshift value as
the new position, then compute the
meanshift until it fulfill certain
condition. For an frame, we use the
distribution of the levels of grey
which gives the description of the
shape and we are going to
converge on the centre of mass of
the object calculated by means of
moments. The number of iterations
of the convergence of the algorithm
is obtained when the subject is
followed within the image
sequence.

The intuition behind the meanshift


is simple. Consider you have a set
of points. (It can be a pixel
distribution like histogram backprojection). You are given a small window ( may be a
9

circle) and you have to move that window to the area of maximum pixel density (or
maximum number of points). It is illustrated in the simple image given in the right.

3.2.2. CamShift - An object Tracking Algorithm

The principle of the CamShift algorithm is based on the principles of the algorithm
Meanshift. Camshift is able to handle the dynamic distribution by adjusting the size
of the search window for the next frame based on the moment zero of the current
distribution of images. In contrast to the algorithm Meanshift who is conceived for the
static distributions, Camshift is conceived for a dynamic evolution of the distributions.
It adjusts the size of searching window by invariant moments. This allows the
algorithm to anticipate the movement of objects to quickly track the object in the next
frame. Even during the fast movements of an object, Camshift is still capable of
tracking well. It occurs when objects in video sequences are tracked and the object
moves such that the size and location of the change in probability distribution
changes in time. The initial search window was determined by a detection algorithm
or software dedicated to video processing. The CamShift algorithm calls upon the
MeanShift one to calculate the target centre in the probability distribution image, but
also the orientation of the principal axis and dimensions of the probability distribution.
Defining the first and second moments for x and y. These parameters are given from
the first and second moments, are defined by equations.

We change the window size using following equation:

M ₀₀ = ∑∑I(x, y )
M ₀₁ = ∑∑xI(x, y )
M ₁₀ = ∑∑yI(x, y )

X c = M ₀₀ / M ₀₁
Y c = M ₀₀ / M ₁₀
w = r₁√ M ₀₀
l = r₂√ M ₀₀

Where:
M ₀₀ - Zero Moment
M ₀₁ - First Moment for x
M ₁₀ - First Moment for y
10

(w, l) - new window size


(Xc, Y c) - new centroid for next frame
I (x, y ) is the intensity value of point (x, y) in the probability distribution image.

4. Proposed Method

4.1 Framework Overview

In this work, I extend traditional tracking-by-detection by collecting candidates from


outputs of both detection and tracks. This framework consists of two sequential
tasks, that is, candidate selection and data association. We first measure all the
candidates using an unified scoring function. A discriminatively trained object
classifier and a well-designed tracklet confidence are fused to formulate the scoring
function. Non-maximal suppression (NMS) is subsequently performed with the
estimated scores. After obtaining candidates without redundancy, we use both
appearance representations and spatial information to hierarchically associate
existing tracks with the selected candidates. The appearance representations are
deeply learned from the person re-identification and Hierarchical data association. I
have deployed Deeply Affinity Network for appearance model to reduce the problem
of id switching and occlusion problems.

4.2. Real-Time Object Classification

Combining outputs of both detection and tracks will result in an excessive amount of
candidates. The classifier shares most computations on the entire image by using a
region-based fully convolutional neural network (R-FCN). Thus, it is much more
efficient comparing to classification on image patches, which are cropped from
heavily overlapped candidate regions.
Given an image frame, score maps of the entire image are predicted using a
fully convolutional neural network with an encoder decoder architecture. The
encoder part is a light-weight convolutional backbone for real-time performance, and
the decoder part with up-sampling is used to increase the spatial resolution of output
score maps for later classification.

4.3. Tracklet Confidence


11

Given a new frame, we estimate the new location of each existing track using the
Kalman filter. These predictions are adopted to handle detection failures caused by
varying visual properties of objects and occlusion in crowded scenes. But they are
not suitable for long-term tracking. The accuracy of the Kalman filter could decrease
if it is not updated by detection over a long time. Tracklet confidence is designed to
measure the accuracy of the filter using temporal information. A tracklet is generated
through temporal association of candidates from consecutive frames. We can split a
track into a set of tracklets since a track can be interrupted and retrieved for times
during its lifetime. Every time a track is retrieved from lost state, the Kalman filter will
be reinitialized. Therefore, only the information of the last tracklet is utilized to
formulate the confidence of the track.

4.4. SORT with Deep Association Metric

I adopt a conventional single hypothesis tracking methodology with recursive Kalman


filtering and frame-by-frame data association.

4.4.1. Track Handling and State Estimation

I assumed a very general tracking scenario where the camera is uncalibrated and
where I have no ego-motion information available. While these circumstances pose a
challenge to the filtering framework, it is the most common setup considered in
recent multiple object tracking benchmarks. Therefore, the tracking scenario is
defined on the eight dimensional state space (u, v, γ, h, x’, y’, γ’, h’) that contains the
bounding box center position (u, v), aspect ratio γ, height h, and their respective
velocities in image coordinates.
For each track k I count the number of frames since the last successful
measurement association. This counter is incremented during Kalman filter
prediction and reset to 0 when the track has been associated with a measurement.
Tracks that exceed a predefined maximum age Amax are considered to have left the
scene and are deleted from the track set. New track hypotheses are initiated for
each detection that cannot be associated to an existing track. These new tracks are
classified as tentative during their first three frames. During this time, we expect a
successful measurement association at each time step. Tracks that are not
successfully associated to a measurement within their first three frames are deleted.

4.4.2 Deep Appearance Descriptor

By using simple nearest neighbor queries without additional metric learning,


successful application of the method requires a well-discriminating feature
12

embedding to be trained offline, before the actual online tracking application. To this
end, we should employ a CNN that has been trained on a large-scale person
re-identification dataset that contains over 1,100,000 images of 1,261 pedestrians,
making it well suited for deep metric learning in a people tracking context. We should
employ a wide residual network with two convolutional layers followed by six residual
blocks. The global feature map of dimensionality 128 is computed in dense layer 10.
A final batch and 2 normalization projects features onto the unit hypersphere to be
compatible with our cosine appearance metric. In total, the network has 2,800,864
parameters and one forward pass of 32 bounding boxes takes approximately 30 ms
on an Nvidia GeForce GTX 1050 GPU.

4.5. Deep Affinity Network

If we model the appearance of objects in video frames and compute their


cross-frame affinities using the Deep Affinity Network (DAN). To present the
proposed network as two components; (a) Feature extractor, and (b) Affinity
estimator, however, the overall network is end-to-end trainable. The DAN training
requires video frame I t along with its object centers C t ; and video frame I t − n along
with its object centers C t − n as inputs. Here This is not restricted to two frames to
appear consecutively in video. Instead, it is allowed to be n time stamps apart.
Whereas the network is eventually deployed to track objects in consecutive video
frames, training it with non-consecutive frames benefits the overall approach in
reliably associating objects in a given frame to those in multiple previous frames.
DAN also requires ground truth binary data association matrix L(t − n, t) of the input
frame pair for computing the network cost during training.

4.5.1. Data preparation

Multiple object tracking datasets often lack in fully capturing the aspects of camera
photometric distortions, background scene variations and other practical factors to
which tracking approaches should remain robust. For model based approaches, it is
important that the training data contains sufficient variations of such tracking
irrelevant factors to induce robustness in the learned models. Hence, I perform the
following preprocessing steps like Photometric distortions, frame expansion,
cropping etc over the available data.

The preprocessing steps is applied to the frame pairs sequentially with a probability
0.3. The frames are then resized to fixed dimensions H ×W ×3, and horizontally
flipped with a probability of 0.5. The resulting processed frames are used as inputs to
the DAN along with the associated object centers computed by a detector. The
13

reason for data preparation is to increase the robustness of model by adding some
noise.

4.5.2. Feature Extractor

This is the first major component of the DAN as feature extractor for its functionality.
This sub-network models comprehensive yet compact features of the detected
objects in video frames. the feature extraction is performed by passing pairs of video
frames and object centers through two streams of convolution layers. These streams
share the model parameters in our implementation, whereas their architecture is
inspired by VGG16 network. We use the VGG architecture after converting its
fully-connected and softmax layers to convolution layers. This modification is made
because the spatial features of objects, which are of more interest in our task, are
better encoded by convolution layers. Compared to the original VGG16, the input
frame size for our network is much larger (i.e. 3×900×900) due to the nature of the
task at hands and the available tacking datasets. Consequently, we are still able to
compute 56 × 56 feature maps after the last layer of the modified VGG network.

4.5.3. Affinity Estimator

The objective of this component of DAN is to encode affinities between the objects
using their extracted features. To that end, the network arranges the columns of Ft
and Ft−n, such that the columns of the two feature matrices are concatenated along
the depth dimension. The architecture of the compression network is inspired by the
physical significance of its input and output signals. The network maps a tensor
encoding combinations of object features to a matrix that codes similarities between
the features (hence, the objects). Thus, it performs a gradual dimension reduction
along the depth of the input tensor with convolutional kernels that do not allow
neighboring elements of feature maps to influence each other.

4.6. DAN Deployment

The feature extractor component of DAN is trained as a two-stream network, it is


deployed as a one-stream model in our approach. This is possible because the
parameters are shared between the two streams. In Fig, the illustration of the
deployment of DAN by showing its two major components separately. The network
expects a single frame It as its input, along the object center locations Ct. The
feature extractor computes the feature matrix Ft for the frame and passes it to the
affinity estimator. The latter uses the feature matrix of a previous frame, say It−n to
compute the permutation tensor for the frame pair. The tensor is then mapped to an
14

affinity matrix by a simple forward pass through the network and a concatenation
operation, as described above. Thus each frame is passed through the object
detector and feature extractor only once, but the features are used multiple times for
computing affinities with multiple other frames in pairs.

4.7. Hierarchical Data Association

Predictions of tracks are utilized to handle missing detection occurred in crowded


scenes. Influenced by intra-category occlusion, these predictions may be involved
with other objects. To avoid taking other unwanted objects and backgrounds into
appearance representations, we hierarchically associate tracks with different
candidates using different features. In particular, we first apply data association on
candidates from detection, using appearance representations with a threshold for the
maximum distance. Then, we associate the remaining candidates with unassociated
tracks based on IoU between candidates and tracks, with a threshold. We only
update appearance representations of tracks when they are associated to detection.
The updating is conducted by saving ReID features from the associated detection.
Finally, new tracks are initialized based on the remaining detection results. With the
hierarchical data association, we only need to extract ReID features for candidates
from detection once per frame.

5. Experiment

I assess the performance of the tracker on the MOT16 benchmark. This benchmark
evaluates tracking performance on seven challenging test sequences, including
frontal-view scenes with moving camera as well as top-down surveillance setups. As
input to our tracker we rely on detections. They have trained a Faster RCNN on a
collection of public and private datasets to provide excellent performance. For a fair
comparison, I have re-run SORT on the same detections. Detections have been
thresholded at a confidence score of 0.3. The remaining parameters of the method
15

have been found on separate training sequences which are provided by the
benchmark. Evaluation is carried out according to the following metrics:

• Multi-object tracking accuracy (MOTA): Summary of overall tracking accuracy in


terms of false positives, false negatives and identity switches.
• Multi-object tracking precision (MOTP): Summary of overall tracking precision in
terms of bounding box overlap between ground-truth and reported location [23].
• Mostly tracked (MT): Percentage of ground-truth tracks that have the same label for
at least 80% of their life span.
• Mostly lost(ML): Percentage of ground-truth tracks that are tracked for at most 20%
of their life span.
• Identity switches (ID): Number of times the reported identity of a ground-truth track
changes.
• Fragmentation (FM): Number of times a track is interrupted by a missing detection.

6. Conclusion
I have studies the traditional methods of Multiple Object Tracking in Real time then I
explored Deep Learning and computer vision applications to track the object more
robustly. I have presented an extension to SORT that incorporates appearance
information through a pre-trained association metric. Due to this extension, we are
able to track through longer periods of occlusion. I propose an online multiple people
tracking framework, which takes full advantage of recent deep neural networks. this
tackle unreliable detection by selecting candidates from outputs of both detection
and tracks. The scoring function for candidate selection is formulated by an efficient
R-FCN, which shares computations on the entire image. Moreover, we improve the
identification ability when coping with intra-category occlusion by introducing ReID
features for data association. The tracker derives its strength from our proposed
Convolutional Neural Network architecture, referred to as Deep Affinity Network
(DAN). The proposed DAN models features of pre detected objects in the video
frames at multiple levels of abstraction, and infers object affinities across different
frames by analyzing exhaustive permutations of the extracted features. The
cross-frame objects similarities and object features are recorded by this approach to
trace the trajectories of the object.

REFERENCES
16

1. SAMUEL MURRAY “Real-Time Multiple Object Tracking”, ​A Study on the Importance of Speed
2. I​OSR Journal of Electronics and Communication Engineering (IOSR-JECE)​ “Moving Object Tracking Distance and
Velocity Determination based on Background Subtraction” Algorithm B.Tharanidevi , R.Vadivu, K.B.Sethupathy
3. International Journal of Computer Science and Information Technologies “A Survey on Moving Object Detection and
Tracking Methods” Imran Khan Pathan, Chetan Chauhan
4. Deep Learning in Computer Vision - Coursera
5. Afef SALHI and Ameni YENGUI JAMMOUSSI “Object tracking system using Camshift, Meanshift and Kalman filter”
6. International Journal of Computer Applications (0975 – 8887) Multiple Object Detection using GMM Technique and
Tracking using Kalman Filter Rohini Chavan
7. B Y Lee, L H Liew, W S Cheah and Y C Wang “Occlusion handling in videos object tracking: A survey” Published
under licence by IOP Publishing Ltd
8. Multi-object tracking with dlib ​https://www.pyimagesearch.com/2018/10/29/multi-object-tracking-with-dlib/
9. L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in CVPR,
2008, pp. 1–8.
10. H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of
objects,” in CVPR, 2011, pp. 1201– 1208.
11. J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 33, no. 9, pp. 1806–1819, 2011.
12. Nicolai Wojk, Alex Bewley , Dietrich Paulus, “SIMPLE ONLINE AND REALTIME TRACKING WITH A DEEP
ASSOCIATION METRIC”
13. Long Chen, Haizhou Ai, Zijie Zhuang, Chong Shang, “REAL-TIME MULTIPLE PEOPLE TRACKING WITH DEEPLY
LEARNED CANDIDATE SELECTION AND PERSON RE-IDENTIFICATION”
14. ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal Mian, Mubarak Shah, “Deep Affinity Network for Multiple Object
Tracking” JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2017
15. Seung-Hwan Bae and Kuk-Jin Yoon, “Confidence-based data association and discriminative deep appearance
learning for robust online multi-object tracking,” IEEE Transactions on PAMI, 2017.
16. [2] Lo¨ıc Fagot-Bouquet, Romaric Audigier, Yoann Dhome, and Fred´ eric Lerasle, “Improving multi-frame data
association ´ with sparse representations for robust near-online multi-object tracking,” in ECCV, 2016.
17. [3] Amir Sadeghian, Alexandre Alahi, and Silvio Savarese, “Tracking the untrackable: Learning to track multiple cues
with long-term dependencies,” in ICCV, 2017.
18. A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A Survey,” ACM Computing Surveys, vol. 38, no. 4, pp. 13–es,
2006.

You might also like