0% found this document useful (0 votes)
92 views10 pages

Multiple Object Tracking For Video Analysis and Surveillance A Literature Survey

Multiple Object Tracking (MOT) is the detection of unique objects and their movements through frames
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views10 pages

Multiple Object Tracking For Video Analysis and Surveillance A Literature Survey

Multiple Object Tracking (MOT) is the detection of unique objects and their movements through frames
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Multiple Object Tracking for Video Analysis


and Surveillance: A Literature Survey
Advitiya C S., Adarsh R Shenoy, Shravya A R., Abhishek Battula, Akash Raghavendra, Dr. Krishnan R
Department of Computer Science and Engineering
Dayananda Sagar College Of Engineering
Bangalore, India

Abstract:- Multiple Object Tracking (MOT) is the are looked into and tabulated with more emphasis on object
detection of unique objects and their movements tracking in aerial image sequences in this paper, with a
through frames. There are various issues in Multiple comparison of their performance scores, advantages,
Object Tracking like occlusions, similar appearance of disadvantages and particular scenarios they excel at.
different objects, interaction among multiple objects, etc.
There are even more issues with Object Tracking in II. FUNDAMENTAL ARCHITECTURES
aerial image sequences due to weather conditions, small
sizes of objects, distortion in proportions, etc. Even with A. Convolutional Neural Networks (CNN)
the various techniques and methods developed for MOT, An improvement of the artificial neural network known
only a fraction are able to perform on aerial image as the convolutional neural network (CNN) is excellent for
sequences with the rest focusing on ground level image pattern recognition in images and is mostly employed for
sequences. Research in Multiple Object Tracking has image processing and recognition. The input layer, the
been gaining a lot of attention with an increasing trend hidden layer, and the output layer are the three layers that
in the number of Multiple Object Tracking research make up the CNN. The input layer receives the initial data
papers published each year. This paper focuses on that will be processed further. With the help of the
activation function, the hidden layer uses this collection of
documenting the advancements in Multiple Object
weighted inputs to generate an outcome. The necessary
Tracking in recent years, with extra attention paid to
techniques that help with aerial image sequences. The output is subsequently produced by the output layer.
various architectures and techniques are classified and B. Recurrent Neural Networks (RNN)
are compared along with their advantages, Recurrent neural networks are a modification of
disadvantages and performance scores, with detailed feedforward neural networks with the addition of internal
mention of the datasets generally used for various memory. RNNs are iterative in nature as they perform the
applications. same function for each data input, but the output for the
Keywords:- Multiple object tracking, aerial images, current input depends on the last computation. Unlike
attention networks, occlusion, pedestrian tracking. feedforward neural networks, RNNs can use internal states
(memory) to process a sequence of inputs. This way you
I. INTRODUCTION keep the context remembered during training.

In recent years, Multiple Object Tracking has gained a C. Long Short-Term Memory (LSTM)
lot of attention and has become one of the more important Long Short-Term Memory (LSTM) networks are
tasks in computer vision. Object tracking is usually done in modified versions of RNNs but are more efficient in
two phases where an object is first detected and uniquely remembering past data in memory. The RNN vanishing
identified and is then tracked as it moves through the frames gradient problem is solved here. The LSTM network has
in phase two. Multiple object tracking performs the three gateways.
detection and tracking over multiple objects in the same  Input Gate — Determines which value from the input is
frames. The detection phase must be able to handle used to modify the memory. The sigmoid function
situations with deformation, varying illumination, cluttered determines the value passed.
or textured background. etc. Re-identification and  Forgotten Gates - Determines details that are discarded
association of objects is also very important and dictates the from blocks. This is determined by the sigmoid function.
accuracy of the tracking as there are various issues like Forget gates help solve the vanishing gradient problem in
occlusion, similar appearance in different objects, RNNs.
interaction among multiple objects. etc. The object detection  Output Gate - Uses the block's input and memory to
and tracking in aerial images specifically has been a determine the output. The sigmoid function determines
problem for a long time. Even with the rise of numerous the value to pass and the tanh function weights the value.
methods which deal with sequences involving people and
objects, issues that arise by applying the same to aerial D. Attention Based
image sequences need to be addressed. These issues include While training an image model, the model should be able
smaller object sizes, weather conditions, change of scales, to focus on key elements of an image. This can be
and distortion in proportions. With the increasing trend in accomplished through attention mechanisms. Attention can
research of MOT, more and more approaches have been be described as a function mapping a query vector Q
looked into to counter these various issues. These methods together with a key-value vector pair K, V to an output. The

IJISRT23FEB961 www.ijisrt.com 1617


Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
weights V are computed by the softmax expression. retain the order of sequential data. A layer in the encoder
Combining these operations and running it parallelly is consists of the multi-head attention block which performs
termed as Multi-Head Attention The Transformer is an self-attention. The output is fed into a feed forward neural
encoder-decoder model which uses a multi-headed attention network. The decoder works similarlybut also receives keys
mechanism to improve training speed. This model was and queries from the encoder. Finally, the decoder output is
designed to tackle problems in natural language processing fed into a feed-forward and softmax layer to produce
but is now vastly used in computer vision tasks since they probabilities for the next item.
solve the problem of long-term dependencies. The encoder
takes the inputs combined with the positional encodings to

Ref. Algorithm/Technique Results Datasets Used


No.

[4] The model has three main components- a The main results of this paper are that DETR can COCO 2017 and
convolutional neural network backbone to achieve results comparable to an optimized Faster R- panoptic
get feature representation, a transformer, CNN network on the COCO dataset. This means that segmentation
and a simple feed forward network that the transformer network can be used to detect datasets
makes the final prediction. objects in images with performance similar to the
well-established Faster R-CNN baseline.

[5] A network using background attention The results of this paper show that the proposed UAVDT,
that is weakly supervised that merge Guided Attention Network (GANet) outperforms CARPK,PUCPR
various scales of feature maps existing methods for object detection and counting
on the CARPK, PUCPR+ and UAVDT datasets.
Moreover, GANet also achieved better results in
terms of speed and computational efficiency.

[6] Semi-supervised Multi-Task Network This paper shows that by using a multi-task learning UAVDT, UA-
with Self-Attention approach with self-attention, we can improve the DETRAC
accuracy of object detection on two traffic
surveillance datasets, UA-DETRAC and UAVDT.
This model could potentially help with instance
segmentations at almost no cost.

[7] Dense anchor scales with large scale The proposed model was able to accurately detect VisDrone2019
variance for detection. Squeeze-and- objects in drone images with a maximum of 500
Excitation blocks are used to capture the detections using denser anchor scales with large
channel dependencies. Deep association scale variance, Squeeze-and-Excitation (SE) blocks,
network is used after the detection and a trained deep association network.
module and feeds the generated
hypotheses to the DeepSORT network.

[8] SMSOT-CNN, GOTURN (Tracking The created model is able to accurately and KIT AIS, DLR’s
Using Regression Networks) efficiently track multiple pedestrians and cars in Aerial Crowd
drone captured images. The results of the ablation Dataset
study on the dataset showed that the proposed
approach achieved the highest accuracy and the
shortest execution time.

[9] Attentional Asymmetric Siamese The results of this paper show that the proposed UAV123 and
Network and Discriminative Correlation DCF-ASN tracking framework, which combines the UAVDT
Filter discriminative correlation filters (DCF) with an
asymmetric siamese network (ASN), achieves the
state-of-the-art performance on five popular tracking
datasets.

[10] A module for data augmentation. An The network can effectively detect objects from VisDrone and
algorithm to merge the tiny objects into aerial images, with precision results higher than the UAVDT
several clusters of similar sizes. The state-of-the-art approaches when it was
candidate center points and the size of the implemented.
clusters are found out and a network
finely calculates the center points and

IJISRT23FEB961 www.ijisrt.com 1618


Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

sizes of the small objects.

[11] The TLD algorithm and the ATLD The results of this paper show that the Appearance Aerial
algorithm and Tracking Learning Detection (ATLD) algorithm sequences from
outperforms the original Tracking Learning the UCF
Detection (TLD) algorithm in terms of accuracy. website, TLD
The ATLD algorithm also performs better than other dataset and
benchmark learning based algorithms. various
classified
sources

[12] LSTM, GCNN and a siamese neural The results of this paper show that the proposed KIT AIS, DLR-
network module AerialMPTNet method on the pedestrian datasets, ACD and
outperforms all other methods before it., and AerialMPT
achieves competitive results on the vehicle dataset.
Also, the results show that adding LSTM and GCNN
to the algorithm which tracks improves the tracking
performance.

[13] The accuracy of detection is improved by The results of this paper show that there is an VisDrone2019
separating the ReID and detection improvement of performance on the tracking of
branches into two parts. This also makes multiple objects compared to other models dataset
the two parts more independent. UAV video. Additionally, the model was able to
Temporal information in target detection detect more targets than the baseline model and
and the ReID head. reduced the issues of false and missed detections.

[14] Built using MMDetection and Pytorch. The results of this paper shows that the detection of VisDrone and
Different modules are used for detection object framework of two stages achieved a 42.06 AP UAVDT
and for extraction of features . Merging of score on the validation dataset of VisDrone,
detected focals is done with the help of outperforming all other small detection of object
NMS and obtaining of final predictions is methods seen in the literature.
done using IBS.
Table 1: Summaries of papers focused on Aerial Image Sequences

Ref. No Algorithm/Technique Results Datasets Used

[15] CNN, Spatial-Temporal Attention The model shows the falsely classified negatives MOT 15,
Mechanism and positives, switching of identities of different MOT16
objects and other errors that can occur when
tracking multiple objects. The efficiency displayed
by the algorithm is shown by the results of the
accurate tracking of multiple objects.

[16] ResNet-50, Transformer with It shows the falsely classified negatives and MOT17,
multihead attention positives, switching of identities of different objects MOT20,
when calculating the accuracy. TransTrack was CrowdHuman
found to be a good matchup against other methods.

[29] ResNet-34, CenterNet While the time spent on re-ID matching grows CrowdHuman,
linearly with density, the time spent on joint 2DMOT15,
detection and re-ID is only slightly impacted by MOT16, MOT17
density. and MOT20

[17] Uses convolutional block attention uses a cost-sensitive tracking loss to concentrate on MOT 16, MOT
module (CBAM), constructed using negative distractions and attention networks which 17
a spatial temporal network on a consist of both temporal and spatial mechanisms are
ResNet-50 Backbone used to suppress noisy observations. These
mechanisms help the algorithm to track objects in
video frames better than both online and offline
trackers, as indicated by the identity-preserving

IJISRT23FEB961 www.ijisrt.com 1619


Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

metrics.

[18] R-FCN architecture with an The R-FCN (Region-Based Fully Convolutional MOT 16, IDRI
encoder decoder block and Kalman Network) is a deep learning model that performs
Filter the majority of its calculations on the entirety of the
image. ReID (Person Re-Identification) features are
deep-learned appearance representations that are
trained on huge datasets which focus on the re-
identification of people to increase the ability of
identification.

[28] The main neural network used is The model takes into consideration various factors, MOT16, MOT17
ResNet-50. It generates multi-scale such as false negatives, false positives,
feature representation by integrating fragmentation and identity switches. The proposed
Feature Pyramid Networks. model is also faster and more simple than existing
methods, and it does not require any extra training
data.

[19] Attention based model that makes The results of the experiments performed to Market1501
use of higher order attention validate the precedence of the MHN for person re- ,DukeMTMC-
mechanisms for ReIdentification. identification show that it outperforms a wide range ReID and
of state-of-the-art methods on various datasets, CUHK03-NP
which include Market-1501, DukeMTMC-ReID
and CUHK03-NP.

[20] Bidirectional recurrent neural The results of this paper show that the proposed MOT15,
network (Bi-RNN) which framework increases the performance of existing MOT16, MOT17
determines the assignment matrix multi-object trackers, and on the MOTChallenge
based on the prediction1 to-ground- benchmark, it established a new state of the art
truth distance matrix score.

[21] Variational Bayesian Model, VEM The paper's results show that the proposed OVBT MOT 16
algorithm (Online Variational Bayesian Tracker) algorithm
performs well on the MOT 2016 dataset, with low
accuracy (MOTA) but high precision (MOTP). This
is likely because the algorithm is sometimes unable
to detect targets or misidentifies them, leading to
identity switches (ID) and missed targets (FN). The
results also show that when multiple observations
are considered within the visibility process,
performance is improved for all sequences and most
measures.

[22] Joint Detection and Embedding The results of this paper show that the proposed ETH,
with the DarkNet-53 backbone MOT system is the first real-time MOT system, CityPersons,
network. A feature pyramid with a high running speed of about 22 to 40 frames CalTech,
network is also used. per second. It also has a high MOTA score MOT16, PRW
and CUHK-
SYSU datasets

[23] Global Context Disentanglement, The paper demonstrates the precedence of the MOT16, MOT17
Deformable Attention, Guided proposed RelationTrack MOT framework. The and MOT20
Transformer Encoder experiments conducted on the MOT20, MOT16 and
MOT17 benchmarks show that the proposed
framework has established a new state-of-the-art
performance and has surpassed preceding methods.

[24] Joint-object-detection-and-tracking The paper presents a joint-detection-and-tracking MOT16,


system(Transformer-based) system called PatchTrack that uses current frame of MOT17,
interest patches in order to infer both appearance CrowdHuman
information and object motion.

IJISRT23FEB961 www.ijisrt.com 1620


Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

[25] Uses YOLOX detector and ByteTrack can significantly improve the IDF1 MOT 17, MOT
performs association between the score. It has achieved a high MOTA performance 20, ETHZ
detection boxes and tracks

[26] A frame-level hypothesis The results of the paper show that MeMOT MOT16,
generation module, a track-level achieves better performance on normal as well as MOT17, MOT20
memory encoding module, a crowded scenarios. MeMOT also achieves good
memory decoding module performance on the MOT Challenge benchmarks
for pedestrian tracking, outperforming other
methods.

[27] Encoder-decoder transformer The model achieves high performance for both MOT17,
public and private detections on MOT17 and MOTS20
MOT20. It is able to track objects for a long period
of time accurately.
Table 2: Summaries of papers focused on Pedestrian Tracking

III. STRUCTURED RELATED WORK overall context of image, gives the final predictions as
output.
Vaswani, Ashish, et al.[1] introduces the Transformer
model which follows the encoder-decoder approach. The Cai, Yuanqiang, et al. [5] introduce the guided
encoder extracts features from the input and passes it onto attention network for the detection of objects in scenes that
the feed forward neural network. The decoder receives keys are captured using drones. The method is an anchorless
and queries from the encoder block. The model is auto- approach where it fuses the feature maps of different scales
regressive which means that it uses the previously generated by making use of the background attention to learn
outputs as the input for the present step. Transformer model, background discriminative representation and makes use of
which is based solely on attention mechanisms, a better the foreground attention module to examine the local view
quality was achieved compared to RNN and CNN. Using of the object. It uses data augmentation on training data to
parallelization the Transformer model took significantly less synthesize the brightness of the images from different
time to train. settings and noise to imitate different weather conditions
which leads to better accuracy.
Beheim, Tsuyoshi [2] discusses tracking of multiple
vehicles in aerial image sequences. To establish a Hughes Perreault, et al. [6] implemented a strategy
benchmark, MOT methods from different fields are applied that makes use of multi-task learning to create a network
to an aerial dataset. Based on that, several adjustments are with attention. Segmentation labels are generated for the
made to examine the impact on the methods’ ability to track. foreground and the background in a supervised and
This proposed collection includes a motion predictor, object unsupervised manner to train visual attention via
re-identification, and a vehicle orientation prediction background subtraction or optical flow. With the help of
module. these labels, an object detection model is trained to generate
bounding boxes and foreground/background segmentation
Jetley, Saumya, et al. [3] employed attention maps to maps while sharing the majority of model parameters. The
locate and utilize the useful spatial support of visual data segmentation maps are employed inside the network to
that CNNs use to make classification predictions. The paper weigh the feature map that led to the creation of the
features a scalar matrix that shows how important layer bounding boxes, reducing the weightage of irrelevant parts
activations are in relation to the goal at various 2D spatial of the image. The model is trained to do multiple tasks -
locations. When the technique was used throughout a segmentation and bounding box detection.
network, the model’s performance improved drastically.
Jadhav, Ajit, et al. [7] created a model for object
Carion, Nicolas, et al. [4] proposed a novel approach recognition in aerial view photos. The model is built on top
where the detection of objects is treated as a direct set of the RetinaNet model. The anchor scales are adjusted for
prediction issue. The method simplifies the pipeline by the dense distributions and smaller objects. Squeeze-and-
elimination of the requirement for numerous elements Excitation (SE) blocks are used for channel
created manually, like a suppression mechanism or interdependencies. This contributes to large performance
generation of anchors that specifically convey the past increases at a small additional computational cost. A custom
information we know about the task. The key components built DeepSORT network is used for the detection of objects
of the proposed framework is a transformer architecture using the above architecture on the VisDrone2019 MOT
with encoder and decoder and a global loss that makes sure dataset.
distinctive predictions are made using bipartite matching.
Given a predetermined, constrained limited set of item
queries, The transformer through its understanding of the
different objects and the relationships between them, and the

IJISRT23FEB961 www.ijisrt.com 1621


Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Bahmanyar, Reza, Seyedmajid Azimi, and Peter detection tasks is enhanced by merging temporal-association
Reinartz [8] propose a model based on CNNs. The CNN structures and separating distinct processes.
extracts features from two consecutive frames, one from the
current frame and one from the previous ones for each Koyun, Onur Can, et al. [14] suggest a two-stage
object. The object which is to be tracked is obtained from object detection system to handle the issue of small object
the previous frame and the search region is extracted from detection. The focused zones are produced by clusters of
the current frame. These frames are then processed and the objects in stage one, which comprises a network which is
locations of the objects increases and facilitates parallelism used for the detection of objects under the supervision of a
probabilistic model. The second stage, also an object
Xizhe Xue, et al. [9] proposes a tracking framework detection network, predicts objects in the focus zones. To
that first coarsely infers the state of the target using a get around the truncation impact of the region search
discrete correlation filter module and then accurately locates methodology, a method of suppression is also put forth.
the object using a trained Asymmetric Siamese Network. After combining the predicted boxes, overlapping boxes are
Tracking is done with the discrete correlation filter trackers, suppressed using the standard Non-Max Suppression (NMS)
which can be efficiently made use of with a frequency algorithm.
domain transform. Using DCF instructions and the weights
of the channels learnt from the annotated data, the model Chu, Qi, et al [15] apply single object tracking
refines the feature representation and accurately finds the techniques to MOT. They make use of dynamic CNN-based
object. spatial temporal networks with ROIPooling and shared
CNN characteristics. Features are obtained from the search
Tang, Ziyang, Xiang Liu, and Baijian Yang [10] area provided by the motion model. These features are
propose a network which employs a module to improve the weighted using spatial attention. The candidate with the
unbalanced datasets, a detector without anchors to coarsely highest score is marked as the target state. Samples which
estimate the tiny object clusters center points, and another were previously positive are also used updating the tracker.
detector without anchors to finely increase precision. To
improve the viability of detecting dense tiny objects, an Sun, Peize, et al. [16] introduce a simple yet effective
adaptive merging technique is used with a hierarchical loss solution to the multiple object tracking problems, proposed
function that is designed to optimize classification. in this paper as TransTrack. It employs the transformer
architecture. The CNN takes the image frame as input and
Malagi, Vindhya P, Ramesh Babu DR, and Krishnan outputs the feature map. This feature map along with the
Rangarajan [11] propose a technique for tracking multiple previous map is passed on to the encoder to generate
and single objects in aerial photos termed aerial tracking common features. Two decoder blocks are used, one takes
learning detection that is according to the well-known the features and determines the detection boxes, while the
algorithm TLD (Tracking Learning Detection) that tracks other generates tracking boxes from previous frames. Iou
using both motion and appearance information. The algorithm is used to associate these boxes.
established algorithm includes adjustments for movement of
camera, algorithmic alterations that integrate appearance Zhu, Ji, et al. [17] implement a technique based on
and motion cues for multiple object tracking and detection, single object tracking. The model uses a spatial attention
and improvements of distance between objects for the network which determines the area of interest by feature
enhancement of tracker performance if there are several extraction. A single object tracker monitors each target in
similar objects close-by. each frame. The track-let consists of neighboring frames to
preserve the trajectory of occluded objects. The temporal
Azimi, Seyed Majid, et al. [12] assessed a range of attention network is made up of bi directional LSTMs.
conventional and Neural Networks based Single Object and Weighted average pooling is used to determine if the
Multi Object Tracking algorithms on how they handle identified object newly entered the frame or had occluded.
various issues in tracking multiple pedestrians and vehicles The target is placed in a lost state and the tracker is stopped
in aerial images of high-resolution. The proposed Deep when the tracking process becomes unreliable. The
Learning system combines graphical, temporal, and framework can handle both noisy and occluded data.
appearance data utilizing a Siamese Neural Network,
LSTM, and a GCNN module for more precise and steady Chen, Long, et al. [18] use an encoder-decoder
tracking. Additionally, Online Hard Example Mining and network with a fully convolutional architecture. One image
Squeeze-and-Excitation layers are looked into to see how frame can forecast the score maps for the entire image. The
they affect the system’s performance. encoder component is a shallow convolutional backbone.
Features from the encoder network are concatenated with
Lin Y, Wang M, Chen W, Gao W, Li L and Liu Y [13] up-sampled features in the decoding part to capture both the
implement a technique based on FairMOT that suggests an semantic and low-level information.
enhanced multiple object tracking paradigm. The detection
and ReID procedures are separated in the designed structure Chen, Binghui, Weihong Deng, and Jiani Hu. [19]
of the model to reduce the effect one function has on the propose a module in order to capture subtle differences
others. A temporal embedding structure is also created to among pedestrians and offer discriminative attention
enhance the representation capacity of the model. The recommendations. It is designed to represent and exploit
performance of the model in object tracking and object complex and high-order statistical information in the
attention mechanism. ReID is seen as a zero-shot learning

IJISRT23FEB961 www.ijisrt.com 1622


Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
problem that explicitly enhances the richness and Cai, Jiarui, et al. [26] propose a model which takes
discrimination of attention information by using the Mixed sequences of frames as input and generates trajectories of
High-Order Attention Network. the tracked objects. The model uses a memory which holds
the tracked states of different objects. It uses a transformer
Xu, Yihong, et al. [20] suggest a module that encoder to convert the frames to embeddings which are fed
simulates the Hungarian matching process. In order to to the decoder while simultaneously stored in memory. The
directly improve deep trackers, the module enables output from the decoder along with the embeddings from
evaluating the correlation between ground truth objects and the memory buffer are fused together to generate bounding
object tracks. There are currently just a few submodules that boxes.
are trained using loss functions, and these loss functions
frequently do not correspond with standard tracking Meinhardt, Tim, et al. [27] propose TrackFormer,
assessment metrics like MOTA and MOTP. which uses attention to track objects. The model uses CNN
to extract features from the input images which are fed to
Ban, Yutong, et al. [21] suggest a variational Bayesian the transformer encoder. The encoder makes use of attention
model for tracking multiple persons. A variational to focus on important features and generate bounding boxes.
expectation maximization algorithm results from this. Due The decoder generates new tracks and also follows existing
to the use of closed-form equations for both the estimate of tracks using track queries.
the parameters of the model and the posterior distributions
of the latent variables, the proposed method is Zhang, Yifu, et al. [28] propose FairMOT, which
computationally efficient. The tracker can manage a suggests a single framework which consists of a
variable number of people over extended time periods. combination of multi-task learning of detection and re-ID.
This approach allows joint optimization of the two tasks. It
Wang, Zhongdao, et al. [22] propose a system which achieves great accuracy for both tasks and outperforms
combines the detection model, used to isolate targets and the state-of-the-art methods. Multi-task learning is a technique
appearance embedding model used to link detections that allows a single model to learn several tasks at once,
between frames, into one framework. The model can detect sharing information between them to improve performance.
many objects at once and is faster than current methods
bounding box regression, and anchor classification, and Peng, Jinlong, et al. [29] introduces a new multiple-
where the separate losses are automatically weighted. object tracking (MOT) model called Chained-Tracker
(CTracker), which is an end-to-end solution that integrates
Yu, En, et al. [23] propose a module which all three subtasks of detection of objects, extraction of
disentangles the learnt representation to give embeddings features, and association of data into a single framework.
unique to detection and ReID. The implicit method provided Secondly, it also provides an open source code for the
by this module helps to balance the requirements of these model, making it freely available for anyone to use and
subtasks. MOT techniques often use local information to improve. This could allow for more efficient and accurate
link the identified objects together without taking into tracking systems to be developed, and facilitate further
account the global semantic relationship. We employ a research into MOT. Finally, the paired attentive regression
module called the Guided Transformer Encoder to get methodology used in this model could be applied to other
around this limitation. It combines the robust reasoning computer vision tasks, potentially leading to more efficient
capacity of the Transformer Encoder with deformable and effective modelFinally, complete content and
attention. organizational editing before formatting. Please take note of
the following items when proofreading spelling and
Chen, Xiaotong, Seyed Mehdi Iranmanesh, and Kuo- grammar.
Chin Lien. [24] propose the model PatchTrack, a combined
tracking and detection system based on Transformers that IV. DATASETS
can predict tracks using patches from the current frame of
interest. The Kalman filter is used to forecast the locations A. UAVDT
of tracks that are already visible in the current frame. UAVDT is a massive Detection and Tracking dataset for
Patches that have been removed from the anticipated detection, Single Object Tracking and Multi-Object
bounding boxes are sent to the Transformer decoder in order Tracking.
to infer new tracks. The proposed approach leverages object
motion and object appearance information encoded in The dataset contains about 80,000 frames extracted
patches to concentrate on the regions where new tracks are from 10 hours of UAV footage. The videos are recorded in
most likely to develop. various complex scenarios like in the rain, fog, night, etc.
The main objects of interest in this dataset are vehicles
Zhang, Yifu, et al. [25] propose a simple method where each frame is annotated with bounding boxes around
called BYTE. The model detects objects and generates the vehicles and are described with various attributes like
bounding boxes. These boxes are classified into two weather condition, occlusion, elevation, vehicle type. etc.
categories, high score and low score. The model first The videos are recorded with image resolution of 1080x540
connects the tracklets to the high score boxes, if they do not pixels at 30 fps.
match, the low score boxes are linked to the objects. To
predict new locations, Kalman filters are used.

IJISRT23FEB961 www.ijisrt.com 1623


Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
B. VisDrone2019 |𝐹𝑁| + |𝐹𝑃| + |𝐼𝐷𝑆𝑊|
VisDrone2019 is a sizable benchmark dataset that 𝑀𝑂𝑇𝐴 = 1 −
|𝑔𝑡𝐷𝑒𝑡|
includes ground-truth that has been annotated for a variety
of significant computer vision applications. Over 280 video B. MOTP (Multiple Object Tracking Precision)
segments totaling 261,908 frames and 10,209 static pictures MOTP is a metric used for evaluating precision in MOT
make up the benchmark dataset. The dataset includes algorithms. It is defined as the mean distance between the
information on location, environment, items, and density, estimated object location and the inferred object location. In
among other things. The dataset was gathered utilizing a simpler terms it is a measure of how well the exact positions
variety of drone platforms in varied contexts, with varying of objects are estimated. As MOTP primarily measures the
levels of weather and illumination. These frames have detector's localization accuracy, it doesn't offer much about
bounding boxes carefully marked with targets of common the tracker's actual performance.
interest, including pedestrians, vehicles, bicycles, and
tricycles. The VisDrone2019 benchmark is used for single 1
and multiple object tracking in pictures and videos. 𝑀𝑂𝑇𝑃 = ∑ 𝑆
|𝑇𝑃|
𝑇𝑃
C. UA-DETRAC
This is a difficult real-world benchmark for multi-object C. IDF1 (Identification Metrics)
detection and tracking. The dataset is made up of 10 hours' IDF1 is utilized as a supplementary statistic on the
worth of videos that were recorded in 24 various places in MOTChallenge benchmark since it focuses on quantifying
cities in China. The resolution of the videos are 960 x 540 association accuracy rather than detection accuracy.. It
pixels in size and are shot at 25 fps. The UA-DETRAC places more emphasis on association accuracy than
collection contains more than 140 thousand frames and detection accuracy. IDF1 determines the presence of
8250 manually annotated cars, totaling over a million trajectories by the computation of ground truth trajectories
labeled bounding boxes. and predicted trajectories which are mapped using a
bijective function. The proportion of correctly identified
D. MOT 16/17 detections to the average number of computed detections
The MOT16/17 standards are often used in MOT to and ground-truth detections is measured by the IDF1 ratio.
detect and track pedestrians. The fourteen sequences that Instead of providing information regarding effective
make up MOT16 encompass a range of situations, vantage detection or association, the overall number of distinct items
positions, camera angles, and weather conditions. In in a scene is shown by the IDF1 score being high.
MOT16, 7 image sequences are used for training and 7 for Additionally, it doesn't assess how well trackers perform
validation. MOT17 was reconstructed from MOT16 and localization.
improved on. In comparison to MOT16, MOT17 offers
more accurate ground truth and more bounding boxes for |𝐼𝐷𝑇𝑃|
𝐼𝐷𝐹1 =
detection from a variety of detectors, including DPM, SDP, |𝐼𝐷𝑇𝑃| + 0.5 |𝐼𝐷𝐹𝑁| + 0.5 |𝐼𝐷𝐹𝑃|
and Faster RCNN.
VI. CONCLUSION
E. DOTA
Over 1.7 million object instances from 18 types of Multiple Object Tracking is being researched around
oriented-bounding-box annotations were collected for the the world by many institutions and organizations. This
planned DOTA dataset using more than 11,000 aerial paper offers a comprehensive view of the recent
photographs. Baselines encompassing 10 algorithms with advancements in the field consisting of the various
more than 70 configurations are generated based on this techniques which include those that have produced state-of-
large and well-annotated dataset. A comparison has been the-art results. In particular, this paper pays more attention
done on the speed and accuracy of each model and shown in to techniques that tackle issues with relation to object
the paper published by Ding, Jian, et al. [30] tracking in aerial image sequences - with new architectures
like attention models. The various techniques and
V. METRICS USED algorithms are compared with their pros and cons and are
tabulated. Finally, this paper goes through common datasets
A. MOTA (Multiple Object Tracking Accuracy) and benchmarks used, and potential strategies to tackle the
MOTA is the most popular measure for Multiple Object MOT challenges.
Tracking that is most comparable with how human vision
works. Amapping is done between the predicted detected REFERENCES
objects and ground truth detected objectsacross every frame
if they are similar to compute the contingency table values. [1.] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Identity Switch (IDSW), which happens when a tracker Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017).
mistakenly switches the identities of the objects or when a Attention is all you need. Advances in neural
tracker loses its target and then re-initialized with a different information processing systems, 30.
object, is used to quantify the relationship. Three types of [2.] Beheim, Tsuyoshi. (2021). Multi-Vehicle Detection
tracking mistakes are measured by MOTA: ID Switch, False and Tracking in Aerial Imagery Sequences using
Positive, and False Negative. Deep Learning Algorithms.

IJISRT23FEB961 www.ijisrt.com 1624


Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[3.] Jetley, S., Lord, N. A., Lee, N., & Torr, P. H. (2018). IEEE international conference on computer vision
Learn to pay attention. arXiv preprint (pp. 4836-4845).
arXiv:1804.02391. [16.] Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan,
[4.] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Z., ... & Luo, P. (2020). Transtrack: Multiple object
Kirillov, A., & Zagoruyko, S. (2020, August). End- tracking with transformer. arXiv preprint
to-end object detection with transformers. In arXiv:2012.15460.
European conference on computer vision (pp. 213- [17.] Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., &
229). Springer, Cham. Yang, M. H. (2018). Online multi-object tracking
[5.] Cai, Y., Du, D., Zhang, L., Wen, L., Wang, W., Wu, with dual matching attention networks. In
Y., & Lyu, S. (2019). Guided attention network for Proceedings of the European conference on
object detection and counting on drones. arXiv computer vision (ECCV) (pp. 366-382).
preprint arXiv:1909.11307. [18.] Chen, L., Ai, H., Zhuang, Z., & Shang, C. (2018,
[6.] Perreault, H., Bilodeau, G. A., Saunier, N., & July). Real-time multiple people tracking with deeply
Héritier, M. (2020, May). Spotnet: Self-attention learned candidate selection and person re-
multi-task network for object detection. In 2020 17th identification. In 2018 IEEE international conference
Conference on Computer and Robot Vision (CRV) on multimedia and expo (ICME) (pp. 1-6). IEEE.
(pp. 230-237). IEEE. [19.] Chen, B., Deng, W., & Hu, J. (2019). Mixed high-
[7.] Jadhav, A., Mukherjee, P., Kaushik, V., & Lall, B. order attention network for person re-identification.
(2020, February). Aerial multi-object tracking by In Proceedings of the IEEE/CVF international
detection using deep association networks. In 2020 conference on computer vision (pp. 371-381).
National Conference on Communications (NCC) (pp. [20.] Xu, Y., Osep, A., Ban, Y., Horaud, R., Leal-Taixé,
1-6). IEEE. L., & Alameda-Pineda, X. (2020). How to train your
[8.] Bahmanyar, R., Azimi, S. M., and Reinartz, P.: deep multi-object tracker. In Proceedings of the
MULTIPLE VEHICLES AND PEOPLE IEEE/CVF Conference on Computer Vision and
TRACKING IN AERIAL IMAGERY USING Pattern Recognition (pp. 6787-6796).
STACK OF MICRO SINGLE-OBJECT- [21.] Ban, Y., Ba, S., Alameda-Pineda, X., Horaud, R.
TRACKING CNNS, Int. Arch. Photogramm. Remote (2016). Tracking Multiple Persons Based on a
Sens. Spatial Inf. Sci., XLII-4/W18, 163–170, Variational Bayesian Model. In: Hua, G., Jégou, H.
https://doi.org/10.5194/isprs-archives-XLII-4-W18- (eds) Computer Vision – ECCV 2016 Workshops.
163-2019, 2019. ECCV 2016. Lecture Notes in Computer Science(),
[9.] Xue, X., Li, Y., Yin, X., & Shen, Q. (2021). DCF- vol 9914. Springer,
ASN: Coarse-to-fine Real-time Visual Tracking via Cham.https://doi.org/10.1007/978-3-319-48881-3_5
Discriminative Correlation Filter and Attentional [22.] Wang, Z., Zheng, L., Liu, Y., Li, Y., & Wang, S.
Siamese Network. arXiv preprint arXiv:2103.10607. (2020, August). Towards real-time multi-object
[10.] Tang, Z., Liu, X., & Yang, B. (2020, December). tracking. In European Conference on Computer
PENet: Object detection using points estimation in Vision (pp. 107-122). Springer, Cham.
high definition aerial images. In 2020 19th IEEE [23.] Yu, E., Li, Z., Han, S., & Wang, H. (2022).
International Conference on Machine Learning and Relationtrack: Relation-aware multiple object
Applications (ICMLA) (pp. 392-398). IEEE. tracking with decoupled representation. IEEE
[11.] Malagi, Vindhya & Babu, Ramesh & Rangarajan, Transactions on Multimedia.
Krishnan. (2016). Multi-object Tracking in Aerial [24.] Chen, X., Iranmanesh, S. M., & Lien, K. C. (2022).
Image Sequences using Aerial Tracking Learning and PatchTrack: Multiple Object Tracking Using Frame
Detection Algorithm. Defence Science Journal. 66. Patches. arXiv preprint arXiv:2201.00080.
122. 10.14429/dsj.66.8972. [25.] Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F.,
[12.] Azimi, S. M., Kraus, M., Bahmanyar, R., & Reinartz, Yuan, Z., ... & Wang, X. (2022). Bytetrack: Multi-
P. (2020). Multiple pedestrians and vehicles tracking object tracking by associating every detection box. In
in aerial imagery: A comprehensive study. arXiv European Conference on Computer Vision (pp. 1-
preprint arXiv:2010.09689. 21). Springer, Cham.
[13.] Lin Y, Wang M, Chen W, Gao W, Li L, Liu Y. [26.] Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., &
Multiple Object Tracking of Drone Videos by a Soatto, S. (2022). MeMOT: Multi-Object Tracking
Temporal-Association Network with Separated-Tasks with Memory. In Proceedings of the IEEE/CVF
Structure. Remote Sensing. 2022; Conference on Computer Vision and Pattern
14(16):3862.https://doi.org/10.3390/rs14163862 Recognition (pp. 8090-8100).
[14.] Koyun, O. C., Keser, R. K., Akkaya, İ. B., & [27.] Meinhardt, T., Kirillov, A., Leal-Taixe, L., &
Töreyin, B. U. (2022). Focus-and-Detect: A small Feichtenhofer, C. (2022). Trackformer: Multi-object
object detection framework for aerial images. Signal tracking with transformers. In Proceedings of the
Processing: Image Communication, 104, 116675. IEEE/CVF Conference on Computer Vision and
[15.] Chu, Q., Ouyang, W., Li, H., Wang, X., Liu, B., & Pattern Recognition (pp. 8844-8854).
Yu, N. (2017). Online multi-object tracking using [28.] Peng, J., Wang, C., Wan, F., Wu, Y., Wang, Y., Tai,
CNN-based single object tracker with spatial- Y., ... & Fu, Y. (2020, August). Chained-tracker:
temporal attention mechanism. In Proceedings of the Chaining paired attentive regression results for end-

IJISRT23FEB961 www.ijisrt.com 1625


Volume 8, Issue 2, February – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
to-end joint multiple-object detection and tracking. In
European conference on computer vision (pp. 145-
161). Springer, Cham.
[29.] Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W.
(2021). Fairmot: On the fairness of detection and re-
identification in multiple object tracking.
International Journal of Computer Vision, 129(11),
3069-3087.
[30.] Ding, Jian, et al. Object Detection in Aerial Images:
A Large-Scale Benchmark and Challenges.

IJISRT23FEB961 www.ijisrt.com 1626

You might also like