Multiple Object Tracking For Video Analysis and Surveillance A Literature Survey
Multiple Object Tracking For Video Analysis and Surveillance A Literature Survey
ISSN No:-2456-2165
Abstract:- Multiple Object Tracking (MOT) is the are looked into and tabulated with more emphasis on object
detection of unique objects and their movements tracking in aerial image sequences in this paper, with a
through frames. There are various issues in Multiple comparison of their performance scores, advantages,
Object Tracking like occlusions, similar appearance of disadvantages and particular scenarios they excel at.
different objects, interaction among multiple objects, etc.
There are even more issues with Object Tracking in II. FUNDAMENTAL ARCHITECTURES
aerial image sequences due to weather conditions, small
sizes of objects, distortion in proportions, etc. Even with A. Convolutional Neural Networks (CNN)
the various techniques and methods developed for MOT, An improvement of the artificial neural network known
only a fraction are able to perform on aerial image as the convolutional neural network (CNN) is excellent for
sequences with the rest focusing on ground level image pattern recognition in images and is mostly employed for
sequences. Research in Multiple Object Tracking has image processing and recognition. The input layer, the
been gaining a lot of attention with an increasing trend hidden layer, and the output layer are the three layers that
in the number of Multiple Object Tracking research make up the CNN. The input layer receives the initial data
papers published each year. This paper focuses on that will be processed further. With the help of the
activation function, the hidden layer uses this collection of
documenting the advancements in Multiple Object
weighted inputs to generate an outcome. The necessary
Tracking in recent years, with extra attention paid to
techniques that help with aerial image sequences. The output is subsequently produced by the output layer.
various architectures and techniques are classified and B. Recurrent Neural Networks (RNN)
are compared along with their advantages, Recurrent neural networks are a modification of
disadvantages and performance scores, with detailed feedforward neural networks with the addition of internal
mention of the datasets generally used for various memory. RNNs are iterative in nature as they perform the
applications. same function for each data input, but the output for the
Keywords:- Multiple object tracking, aerial images, current input depends on the last computation. Unlike
attention networks, occlusion, pedestrian tracking. feedforward neural networks, RNNs can use internal states
(memory) to process a sequence of inputs. This way you
I. INTRODUCTION keep the context remembered during training.
In recent years, Multiple Object Tracking has gained a C. Long Short-Term Memory (LSTM)
lot of attention and has become one of the more important Long Short-Term Memory (LSTM) networks are
tasks in computer vision. Object tracking is usually done in modified versions of RNNs but are more efficient in
two phases where an object is first detected and uniquely remembering past data in memory. The RNN vanishing
identified and is then tracked as it moves through the frames gradient problem is solved here. The LSTM network has
in phase two. Multiple object tracking performs the three gateways.
detection and tracking over multiple objects in the same Input Gate — Determines which value from the input is
frames. The detection phase must be able to handle used to modify the memory. The sigmoid function
situations with deformation, varying illumination, cluttered determines the value passed.
or textured background. etc. Re-identification and Forgotten Gates - Determines details that are discarded
association of objects is also very important and dictates the from blocks. This is determined by the sigmoid function.
accuracy of the tracking as there are various issues like Forget gates help solve the vanishing gradient problem in
occlusion, similar appearance in different objects, RNNs.
interaction among multiple objects. etc. The object detection Output Gate - Uses the block's input and memory to
and tracking in aerial images specifically has been a determine the output. The sigmoid function determines
problem for a long time. Even with the rise of numerous the value to pass and the tanh function weights the value.
methods which deal with sequences involving people and
objects, issues that arise by applying the same to aerial D. Attention Based
image sequences need to be addressed. These issues include While training an image model, the model should be able
smaller object sizes, weather conditions, change of scales, to focus on key elements of an image. This can be
and distortion in proportions. With the increasing trend in accomplished through attention mechanisms. Attention can
research of MOT, more and more approaches have been be described as a function mapping a query vector Q
looked into to counter these various issues. These methods together with a key-value vector pair K, V to an output. The
[4] The model has three main components- a The main results of this paper are that DETR can COCO 2017 and
convolutional neural network backbone to achieve results comparable to an optimized Faster R- panoptic
get feature representation, a transformer, CNN network on the COCO dataset. This means that segmentation
and a simple feed forward network that the transformer network can be used to detect datasets
makes the final prediction. objects in images with performance similar to the
well-established Faster R-CNN baseline.
[5] A network using background attention The results of this paper show that the proposed UAVDT,
that is weakly supervised that merge Guided Attention Network (GANet) outperforms CARPK,PUCPR
various scales of feature maps existing methods for object detection and counting
on the CARPK, PUCPR+ and UAVDT datasets.
Moreover, GANet also achieved better results in
terms of speed and computational efficiency.
[6] Semi-supervised Multi-Task Network This paper shows that by using a multi-task learning UAVDT, UA-
with Self-Attention approach with self-attention, we can improve the DETRAC
accuracy of object detection on two traffic
surveillance datasets, UA-DETRAC and UAVDT.
This model could potentially help with instance
segmentations at almost no cost.
[7] Dense anchor scales with large scale The proposed model was able to accurately detect VisDrone2019
variance for detection. Squeeze-and- objects in drone images with a maximum of 500
Excitation blocks are used to capture the detections using denser anchor scales with large
channel dependencies. Deep association scale variance, Squeeze-and-Excitation (SE) blocks,
network is used after the detection and a trained deep association network.
module and feeds the generated
hypotheses to the DeepSORT network.
[8] SMSOT-CNN, GOTURN (Tracking The created model is able to accurately and KIT AIS, DLR’s
Using Regression Networks) efficiently track multiple pedestrians and cars in Aerial Crowd
drone captured images. The results of the ablation Dataset
study on the dataset showed that the proposed
approach achieved the highest accuracy and the
shortest execution time.
[9] Attentional Asymmetric Siamese The results of this paper show that the proposed UAV123 and
Network and Discriminative Correlation DCF-ASN tracking framework, which combines the UAVDT
Filter discriminative correlation filters (DCF) with an
asymmetric siamese network (ASN), achieves the
state-of-the-art performance on five popular tracking
datasets.
[10] A module for data augmentation. An The network can effectively detect objects from VisDrone and
algorithm to merge the tiny objects into aerial images, with precision results higher than the UAVDT
several clusters of similar sizes. The state-of-the-art approaches when it was
candidate center points and the size of the implemented.
clusters are found out and a network
finely calculates the center points and
[11] The TLD algorithm and the ATLD The results of this paper show that the Appearance Aerial
algorithm and Tracking Learning Detection (ATLD) algorithm sequences from
outperforms the original Tracking Learning the UCF
Detection (TLD) algorithm in terms of accuracy. website, TLD
The ATLD algorithm also performs better than other dataset and
benchmark learning based algorithms. various
classified
sources
[12] LSTM, GCNN and a siamese neural The results of this paper show that the proposed KIT AIS, DLR-
network module AerialMPTNet method on the pedestrian datasets, ACD and
outperforms all other methods before it., and AerialMPT
achieves competitive results on the vehicle dataset.
Also, the results show that adding LSTM and GCNN
to the algorithm which tracks improves the tracking
performance.
[13] The accuracy of detection is improved by The results of this paper show that there is an VisDrone2019
separating the ReID and detection improvement of performance on the tracking of
branches into two parts. This also makes multiple objects compared to other models dataset
the two parts more independent. UAV video. Additionally, the model was able to
Temporal information in target detection detect more targets than the baseline model and
and the ReID head. reduced the issues of false and missed detections.
[14] Built using MMDetection and Pytorch. The results of this paper shows that the detection of VisDrone and
Different modules are used for detection object framework of two stages achieved a 42.06 AP UAVDT
and for extraction of features . Merging of score on the validation dataset of VisDrone,
detected focals is done with the help of outperforming all other small detection of object
NMS and obtaining of final predictions is methods seen in the literature.
done using IBS.
Table 1: Summaries of papers focused on Aerial Image Sequences
[15] CNN, Spatial-Temporal Attention The model shows the falsely classified negatives MOT 15,
Mechanism and positives, switching of identities of different MOT16
objects and other errors that can occur when
tracking multiple objects. The efficiency displayed
by the algorithm is shown by the results of the
accurate tracking of multiple objects.
[16] ResNet-50, Transformer with It shows the falsely classified negatives and MOT17,
multihead attention positives, switching of identities of different objects MOT20,
when calculating the accuracy. TransTrack was CrowdHuman
found to be a good matchup against other methods.
[29] ResNet-34, CenterNet While the time spent on re-ID matching grows CrowdHuman,
linearly with density, the time spent on joint 2DMOT15,
detection and re-ID is only slightly impacted by MOT16, MOT17
density. and MOT20
[17] Uses convolutional block attention uses a cost-sensitive tracking loss to concentrate on MOT 16, MOT
module (CBAM), constructed using negative distractions and attention networks which 17
a spatial temporal network on a consist of both temporal and spatial mechanisms are
ResNet-50 Backbone used to suppress noisy observations. These
mechanisms help the algorithm to track objects in
video frames better than both online and offline
trackers, as indicated by the identity-preserving
metrics.
[18] R-FCN architecture with an The R-FCN (Region-Based Fully Convolutional MOT 16, IDRI
encoder decoder block and Kalman Network) is a deep learning model that performs
Filter the majority of its calculations on the entirety of the
image. ReID (Person Re-Identification) features are
deep-learned appearance representations that are
trained on huge datasets which focus on the re-
identification of people to increase the ability of
identification.
[28] The main neural network used is The model takes into consideration various factors, MOT16, MOT17
ResNet-50. It generates multi-scale such as false negatives, false positives,
feature representation by integrating fragmentation and identity switches. The proposed
Feature Pyramid Networks. model is also faster and more simple than existing
methods, and it does not require any extra training
data.
[19] Attention based model that makes The results of the experiments performed to Market1501
use of higher order attention validate the precedence of the MHN for person re- ,DukeMTMC-
mechanisms for ReIdentification. identification show that it outperforms a wide range ReID and
of state-of-the-art methods on various datasets, CUHK03-NP
which include Market-1501, DukeMTMC-ReID
and CUHK03-NP.
[20] Bidirectional recurrent neural The results of this paper show that the proposed MOT15,
network (Bi-RNN) which framework increases the performance of existing MOT16, MOT17
determines the assignment matrix multi-object trackers, and on the MOTChallenge
based on the prediction1 to-ground- benchmark, it established a new state of the art
truth distance matrix score.
[21] Variational Bayesian Model, VEM The paper's results show that the proposed OVBT MOT 16
algorithm (Online Variational Bayesian Tracker) algorithm
performs well on the MOT 2016 dataset, with low
accuracy (MOTA) but high precision (MOTP). This
is likely because the algorithm is sometimes unable
to detect targets or misidentifies them, leading to
identity switches (ID) and missed targets (FN). The
results also show that when multiple observations
are considered within the visibility process,
performance is improved for all sequences and most
measures.
[22] Joint Detection and Embedding The results of this paper show that the proposed ETH,
with the DarkNet-53 backbone MOT system is the first real-time MOT system, CityPersons,
network. A feature pyramid with a high running speed of about 22 to 40 frames CalTech,
network is also used. per second. It also has a high MOTA score MOT16, PRW
and CUHK-
SYSU datasets
[23] Global Context Disentanglement, The paper demonstrates the precedence of the MOT16, MOT17
Deformable Attention, Guided proposed RelationTrack MOT framework. The and MOT20
Transformer Encoder experiments conducted on the MOT20, MOT16 and
MOT17 benchmarks show that the proposed
framework has established a new state-of-the-art
performance and has surpassed preceding methods.
[25] Uses YOLOX detector and ByteTrack can significantly improve the IDF1 MOT 17, MOT
performs association between the score. It has achieved a high MOTA performance 20, ETHZ
detection boxes and tracks
[26] A frame-level hypothesis The results of the paper show that MeMOT MOT16,
generation module, a track-level achieves better performance on normal as well as MOT17, MOT20
memory encoding module, a crowded scenarios. MeMOT also achieves good
memory decoding module performance on the MOT Challenge benchmarks
for pedestrian tracking, outperforming other
methods.
[27] Encoder-decoder transformer The model achieves high performance for both MOT17,
public and private detections on MOT17 and MOTS20
MOT20. It is able to track objects for a long period
of time accurately.
Table 2: Summaries of papers focused on Pedestrian Tracking
III. STRUCTURED RELATED WORK overall context of image, gives the final predictions as
output.
Vaswani, Ashish, et al.[1] introduces the Transformer
model which follows the encoder-decoder approach. The Cai, Yuanqiang, et al. [5] introduce the guided
encoder extracts features from the input and passes it onto attention network for the detection of objects in scenes that
the feed forward neural network. The decoder receives keys are captured using drones. The method is an anchorless
and queries from the encoder block. The model is auto- approach where it fuses the feature maps of different scales
regressive which means that it uses the previously generated by making use of the background attention to learn
outputs as the input for the present step. Transformer model, background discriminative representation and makes use of
which is based solely on attention mechanisms, a better the foreground attention module to examine the local view
quality was achieved compared to RNN and CNN. Using of the object. It uses data augmentation on training data to
parallelization the Transformer model took significantly less synthesize the brightness of the images from different
time to train. settings and noise to imitate different weather conditions
which leads to better accuracy.
Beheim, Tsuyoshi [2] discusses tracking of multiple
vehicles in aerial image sequences. To establish a Hughes Perreault, et al. [6] implemented a strategy
benchmark, MOT methods from different fields are applied that makes use of multi-task learning to create a network
to an aerial dataset. Based on that, several adjustments are with attention. Segmentation labels are generated for the
made to examine the impact on the methods’ ability to track. foreground and the background in a supervised and
This proposed collection includes a motion predictor, object unsupervised manner to train visual attention via
re-identification, and a vehicle orientation prediction background subtraction or optical flow. With the help of
module. these labels, an object detection model is trained to generate
bounding boxes and foreground/background segmentation
Jetley, Saumya, et al. [3] employed attention maps to maps while sharing the majority of model parameters. The
locate and utilize the useful spatial support of visual data segmentation maps are employed inside the network to
that CNNs use to make classification predictions. The paper weigh the feature map that led to the creation of the
features a scalar matrix that shows how important layer bounding boxes, reducing the weightage of irrelevant parts
activations are in relation to the goal at various 2D spatial of the image. The model is trained to do multiple tasks -
locations. When the technique was used throughout a segmentation and bounding box detection.
network, the model’s performance improved drastically.
Jadhav, Ajit, et al. [7] created a model for object
Carion, Nicolas, et al. [4] proposed a novel approach recognition in aerial view photos. The model is built on top
where the detection of objects is treated as a direct set of the RetinaNet model. The anchor scales are adjusted for
prediction issue. The method simplifies the pipeline by the dense distributions and smaller objects. Squeeze-and-
elimination of the requirement for numerous elements Excitation (SE) blocks are used for channel
created manually, like a suppression mechanism or interdependencies. This contributes to large performance
generation of anchors that specifically convey the past increases at a small additional computational cost. A custom
information we know about the task. The key components built DeepSORT network is used for the detection of objects
of the proposed framework is a transformer architecture using the above architecture on the VisDrone2019 MOT
with encoder and decoder and a global loss that makes sure dataset.
distinctive predictions are made using bipartite matching.
Given a predetermined, constrained limited set of item
queries, The transformer through its understanding of the
different objects and the relationships between them, and the