0% found this document useful (0 votes)
113 views30 pages

(SOTA) Deep Learning in Multi-Object Detection and Tracking State of The Art

This document summarizes the state of the art in deep learning for multi-object detection and tracking. It provides an overview of how deep learning networks have greatly improved object detection and tracking performance due to increased computing power. The document reviews generic and domain-specific object detection and tracking models, benchmark datasets, and applications that have benefited from these techniques, such as autonomous driving and security monitoring. It concludes by discussing challenges and future research directions, including leveraging granular computing approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views30 pages

(SOTA) Deep Learning in Multi-Object Detection and Tracking State of The Art

This document summarizes the state of the art in deep learning for multi-object detection and tracking. It provides an overview of how deep learning networks have greatly improved object detection and tracking performance due to increased computing power. The document reviews generic and domain-specific object detection and tracking models, benchmark datasets, and applications that have benefited from these techniques, such as autonomous driving and security monitoring. It concludes by discussing challenges and future research directions, including leveraging granular computing approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Applied Intelligence

https://doi.org/10.1007/s10489-021-02293-7

Deep learning in multi-object detection and tracking: state of the art


Sankar K. Pal1 · Anima Pramanik2 · J. Maiti2 · Pabitra Mitra3

Accepted: 26 February 2021


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021

Abstract
Object detection and tracking is one of the most important and challenging branches in computer vision, and have been
widely applied in various fields, such as health-care monitoring, autonomous driving, anomaly detection, and so on. With
the rapid development of deep learning (DL) networks and GPU’s computing power, the performance of object detectors
and trackers has been greatly improved. To understand the main development status of object detection and tracking
pipeline thoroughly, in this survey, we have critically analyzed the existing DL network-based methods of object detection
and tracking and described various benchmark datasets. This includes the recent development in granulated DL models.
Primarily, we have provided a comprehensive overview of a variety of both generic object detection and specific object
detection models. We have enlisted various comparative results for obtaining the best detector, tracker, and their combination.
Moreover, we have listed the traditional and new applications of object detection and tracking showing its developmental
trends. Finally, challenging issues, including the relevance of granular computing, in the said domain are elaborated as a
future scope of research, together with some concerns. An extensive bibliography is also provided.

Keywords Deep learning (DL) · Object detection · Object tracking · Video analysis · Machine learning · Granular
computing

1 Introduction and recent breakthrough research. In the applications of


both real-world and academia, object detection and track-
In recent years, object detection and tracking has gained ing has equal importance. Some of the real-world appli-
increasing attention due to its wide range of applications cations include autonomous driving, monitoring security,
transportation surveillance, and robotic vision [1]. A vari-
ety of sensing modalities, such as radar, Light Detection and
This article belongs to the Topical Collection: 30th Anniversary
Special Issue Ranging (LIDAR), and computer vision (CV) has become
available for object detection and tracking. Imaging tech-
 Anima Pramanik nology has immensely progressed in recent years. Cameras
[email protected]
are cheaper, smaller and of higher quality than ever before.
Sankar K. Pal Concurrently, computing power has dramatically increased.
[email protected] In recent years, computing platforms are geared toward par-
allelization such as multi core processing and graphical
J. Maiti
[email protected]
processing unit (GPU). Such hardware version allows CV
for object detection and tracking to pursue real-time imple-
Pabitra Mitra mentation. Rapid development in deep convolution neural
[email protected] network (CNN) and GPU’s enhanced computing power are
the main reasons behind the fast evolution of CV-based
1 Center for Soft Computing Research, Indian Statistical object detection and tracking.
Institute, Kolkata, West Bengal 700108, India In this context, let us mention the evolution of deep
2 ISE Dept., IIT Kharagpur, IIT Kharagpur, Kharagpur, learning (DL) from machine learning (ML) and their
West Bengal 721302, India characteristic differences. ML is a branch of artificial
3 Department of Computer Science & Engineering, intelligence (AI), and it basically means learning patterns
Kharagpur 721302, India from examples or sample data. Here the machine is given
S. K. Pal et al.

access to the data and has the ability to learn from it. The object detection accuracy. As the task is unsupervised,
data (or examples) could be labeled, unlabeled, or their segmentation poses several challenging issues.
combination. Accordingly, the learning could be supervised, Object detection can be performed using either image
unsupervised or semi-supervised. Artificial neural networks processing techniques or DL networks. Image processing
(ANNs) that have the ability to learn the relation between techniques usually do not require historical data for training
input and output from examples are good candidates for and are unsupervised in nature. But these techniques are
ML. ANNs enjoy the characteristics like adaptivity, speed, restricted to various factors, such as complex scenarios,
robustness/ ruggedness, and optimality. In the early 2000s, illumination effect, occlusion effect, and clutter effect. All
certain breakthroughs in multi-layered neural networks these issues are better tackled in DL-based object detection.
(MLP) facilitated the advent of deep learning. DL means The working principle of DL networks is supervised in
learning in depth in different stages [2]. DL is thus a nature, and is restricted to a huge amount of training data
specialized form of ML which takes the latter to the and the GPU’s computing power. Many benchmark datasets,
next level in an advanced form. DL is characterized by for examples, Caltech [16], KITTI [17], ImageNet [18],
learning the data representations, in contrary to task- PASCAL VOC [19], MS COCO [20], and V5 [21], are
specific algorithms [3]. Convolutional neural network already developed in object detection field. Due to the
(CNN) represents one such deep architecture which is most availability of such huge amount of data and development
popular for learning with images and video. of GPUs, DL networks based object detection is widely
In DL framework, the problem of object recognition accepted by researchers.
can be viewed as a task of labeling different objects in Object detection is followed by object tracking. The
an image frame with their correct classes and predicting aim of object tracking is to localize the trajectory of a
their bounding boxes with a high probability. The learning detected object and link it to that. Efficient and robust
performance in DL depends on the number of samples (or system design is required to track objects in either a domain-
previous experiences). Larger the number is, more accurate specific scenario or generic scenario. This target is fulfilled
is the performance. Today, we have abundant data which, in by recently developed DL networks. For example, consider
turn, makes DL a meaningful choice [3, 4]. However, DL the research on DL networks for image classification that
often needs hundreds or thousands of images for obtaining was done in ILSVRC 2012 competition [22]. Here, the
the best results, unlike the conventional (shallow) learning. error rate is reduced by 10% as compared to conventional
The term “shallow” is meant in contrast to “deep” [3, 4]. methods. Thereafter, new deeper learning networks are
Therefore, DL is computationally intensive and difficult to gradually developed for classification of images. They
engineer. It requires a high-performance GPU to provide are well-received by human vision community due to
very fast object recognition and motion detection. their efficiency. Advancements in object detection are
DL models can be used in both generic and domain- observed in face recognition [23], re-identification of person
specific object detection and tracking. In the detection [24], image semantic segmentation [15, 25], and action
network, deep CNN is used as a backbone to extract recognition [26], among others. All the successes of DL
the key features from an input image/video frame. These networks for object detection inspire the improvement
features are used to localize and classify the objects in object tracking. However, DL networks cannot be
in the same frame. Thereafter, in object tracking, these directly used for object tracking, since for tracking, objects
detected objects are tracked based on feature-nearness from need to be detected [27–29] first from the image frame
frame to frame. Object detection refers to scanning and either manually or by a network using supervised or
searching for objects of certain classes (e.g., human, car, semi-supervised learning. This learning task requires huge
and building) in an image/video frame. In the domain samples to learn the features of the selected object(s).
of object detection, there are diverse studies conducted, Earlier DL networks [30] were inferior compared to
which include edge detection [5, 6], image segmentation the correlation filter [31] for object tracking. Thereafter,
[7, 8], pose detection [9], face detection [10], multi- different strategies had been revealed to improve the DL
categories detection [11], pedestrian detection [12], scene for object tracking [3, 32, 33]. These strategies may be
text detection [13], and salient object detection [14, 177]. classified based on three main aspects: i) more samples are
The heart of scene understanding is object detection, so it used to perform the feature learning for tracking objects [34,
has a wide use in various fields, including security, military, 35], ii) features are extracted from multiple layers or low
transportation, and medical. Further, segmentation is the layers of deep CNNs [36, 37], and iii) to obtain directly the
mother task of object detection in an image. Segmentation tracking results, deep networks (end-to-end) are developed
can be performed using various conventional and modern [38]. Recently, two reviews [39, 40] have been published on
approaches [15]. Better segmentation results in higher DL for object tracking. Multiple object tracking (MOT) is
Deep learning in multi-object detection and tracking: state of the art

more complicated than single object tracking and is more 2 Object detection and tracking: Broad
applicable in a real-time scenario. Therefore, the research approaches
on MOT is overwhelmed by researchers. Although it has
been observed that DL is efficient for MOT problems, the In this section, we briefly discuss different approaches,
tracking performance is purely based on the success of both conventional and DL based, for multi-object detec-
proper image localization and classification [3, 28, 29, 41, tion and tracking along with their characteristic features.
178]. Therefore, it is necessary to summarize and analyze As mentioned before, both object detection and tracking are
the existing DL networks for both object detection and important in the field of CV. In general, object detection
tracking. Recently, there have been two reviews, one on is performed in two steps: finding the foreground entities
DL-based object detection [1, 42] and the other on DL- (using features) which are considered as object hypothe-
based object tracking [40]. These surveys have covered sis, and then verifying these candidates (using a classifier).
independently either DL-based object detection task, or We divide object detection into three board categories;
DL-based object tracking task, but not the both together. i) appearance-based, ii) motion-based, and iii) DL-based.
The present review deals with the tasks of DL-based both Appearance-based approaches use image processing tech-
object detection and tracking, considering them individually niques to recognize objects directly from images/video. But
and in combination. In other words, it analyses, in addition, these approaches usually fail in the detection of occluded
which combinations of detectors and trackers are suitable objects. Whereas, in motion-based approaches, a sequence
for which kinds of data. In that sense, this review integrating of images is used for the recognition of objects. These meth-
DL-based object detection and tracking is the first of ods may not function properly for detecting the objects
its kind. With the rapid development in CV research, in complex scenarios. DL-based approaches use either
the article provides a systematic and comprehensive appearance features or motion features or their combina-
study on the characteristic features, functionalities, and tion for object detection in images/video frames. Due to the
performances of the various state-of-the-art methods at recent technological breakthroughs, DL-based approaches
this juncture that offer several efficient solutions and for object detection have gained much attention as compared
new directions in this domain. It intends to provide to either appearance or motion-based approaches.
an overview of how different DL models are being Deep CNNs are used as backbone in DL-based object
tremendously deployed in generic object detection, specific detectors to extract features from the input image/video
object detection, and object tracking, as well as in finding frame. These features are used to classify the object(s).
the best detector-tracker combined models. This facilitates DL-based approaches have two categories: i) two-stage
the selection of appropriate deep models for multi-object detectors [43] and ii) one-stage detectors [44]. In two-stage
detection and tracking, and in turn enhances the scope for detectors, at first, approximate object regions are proposed
further improvement. These are followed by some crucial using deep features, and then these features are used for
application areas of object detection, various challenging the classification as well as bounding box regression for
research issues in detection and tracking, and certain the object candidate. In one-stage detectors, on the other
concerns for the future researchers in DL. The last aspect hand, bounding boxes are predicted over the images without
is very crucial as a kind of caution to the beginners in DL the region proposal step. This process consumes less time
and AI research. A comprehensive bibliography on the up- and hence, can be used in real-time devices. Two-stage
to-date research work on DL-based object detection and detectors achieve high detection accuracy, whereas one-
tracking is also presented. stage detectors have high speed. Various backbone networks
The article proceeds as follows: Section 2 presents (feature generation networks) that are used in DL-based
the broad approaches for object detection and tracking. object detection are: i) AlexNet [45], ii) ResNet [46], and
Generic object detectors are presented in Section 3. Then, iii) VGG16 [43], among others. With the advancement of
reviews of the application of CNN for various specific backbone networks and the increasing capability of GPUs, a
tasks are exhibited in Section 4. Section 5 elaborates the remarkable progress has been achieved in two-stage object
most representative and pioneering DL-based approaches detectors. Recently, the concept of granular computing has
for object tracking. Results of detailed analysis of deep been embedded in deep networks in order to enhance the
networks for both object detection and tracking are stated computation speed significantly, keeping a balance with
in Section 6. We conclude the paper in Section 7. detection accuracy. Some such networks are granulated
Various applications and challenges of object detection and CNN [3] and Granulated RCNN [178]. Detailed reviews of
tracking task, together with some concerns, are discussed in DL-based generic and specific object detection are provided
Section 8. in Sections 3 and 4.
S. K. Pal et al.

As said earlier, the task of object detection is followed networks, two-stage and one-stage detectors is provided in
by that of object tracking. Tracking aims to serve two Sections 3.1, 3.2, and 3.3, respectively.
major purposes. These are: i) prediction of the location
of foreground objects in videos and ii) correct association 3.1 Backbone networks
between detected objects and trajectories in the current
frame. Optical flow is used in [47] to track objects by This network acts as a feature generation network for
measuring the distance between the new detection and objet detection. It takes an image as input and generates
the displacement of trajectory. In [48], motion of newly its feature map. CNN and its variants are used as the
detected object in the current frame is estimated by Kalman backbone networks. Most of the backbone networks for
filter. Since real-life dynamic problems are often non-linear, object detection perform feature generation task at the
there have been several variations of the traditional Kalman convolution layers and classification task at the last fully
filter, such as Extended Kalman Filter (EKF) [49] and connected layers. Some such example deep networks are
particle filter [50]. These two filters work based on the AlexNet [45], ZFNet [43], and VGG16 [52]. Improved
non-linear transformation of random variables. versions of the basic deep network are also available.
In recent years, deep architecture has gained its popular- For instance, in [53], to make an existing network much
ity in MOT. We roughly classify the deep architecture-based deeper, some specially designed layers are used for addition
MOT into three categories. The first category involves to it, and for replacement of some existing layers, in
deep feature-based MOT enhancement where the features addition to subtraction of some existing layers. Use of
(semantic) are typically extracted from a deep CNN. Such specially designed deep networks is also made [44, 54]
an example is multiple hypothesis tracking (MHT) [51]. to meet some specific requirements. To achieve better
The second category includes MOT using deep CNN (end- accuracy and efficiency, researchers can choose deeper
to-end) learning. Such end-to-end DL networks, viz, RNN- and denser backbones, such as ResNet [55], ResNetXt
LSTM and hierarchical RNN models, are developed in [56], and AmoebaNet [57], or lightweight backbones,
[38]. The third category involves MOT using deep net- such as MobileNet [58], SqueezeNet [59], Xception [60],
work embedding. The core part of the tracking is accurately and MobileNetV2 [61]. These lightweight backbones
designed with the help of a deep CNN. A detailed review on are capable of meeting the requirements of the mobile
all the tracking categories is provided in Section 5. application. To meet the necessity of high degree of
Since the performance of tracking objects depends on precision and more accurate and precise applications,
the performance of their detection, we have provided complex backbone structures are required. But real-time
in Section 6 a comparative analysis of performance and video surveillance systems require high processing speed
challenges among different combinations of detectors and as well as high accuracy [44]. Therefore, researchers are
trackers on various videos. The purpose is to show which overwhelmed by the improved backbones to adapt to the
pair of detector and tracker is suitable to which kind of detection architecture and make a fair trade-off between the
data. For this analysis, we have focused only on those accuracy and speed.
investigations concerning DL-based multi-object detection As mentioned earlier, deeper and densely connected
and tracking algorithms, which are competitive on the backbones replace the shallower and sparsely connected
benchmark datasets. backbones to obtain more detection accuracy. For instance,
in [44], VGG16 is replaced by high capacity backbone,
ResNet that can identify rich features is adopted in Faster
3 Generic object detectors RCNN for further gain in accuracy. So, it can be said
that the quality of features determines the upper bound
Generic object detectors have an aim of locating and of network performance. Deeper and densely connected
classifying objects in an image and labeling them with backbones can provide more qualitative features than
rectangular bounding-boxes to show the confidence of shallower and sparsely connected backbones. Therefore,
existence. Generic object detectors are of two types: further exploration of deeper network is required. Out
two-stage detectors and one-stage detectors. Two-stage of the aforesaid networks, let us explain the features of
detectors follow the traditional object detection pipeline, AlexNet [45] as it is used in the subsequent discussions
i.e., object localization and its classification. Whereas, frequently. AlexNet consists of five convolution (Conv1,
one-stage detectors consider object detection task as Conv2, Conv3, Conv4, and Conv5), three pooling (Pool1,
regression/classification problem. For both detectors, the Pool2, and Pool5), and three fully connected (FC1, FC2,
classification task is done based on some features which and FC3) layers. It takes an image as input and constructs
are generated using a feature generation network, called its reduced feature map as output of Pool5. The number of
backbone network. A detailed discussion on backbone channels in this feature map is equal to that of the filters
Deep learning in multi-object detection and tracking: state of the art

used in Conv5 layer. Thereafter, this map is converted to a object classification task. In the last module, bounding
1-dimensional weighed array through FC1 and FC2 layers. boxes are fitted over the classified objects.
This array of the image is then fed into a classifier with In the first module, a selective search method is adopted
N class labels through FC3 layer, where N is the number to propose the approximate object(s) region(s) in the input
of object classes trained. During the training of AlexNet, image. Then, a deep CNN takes each region proposal
classification loss is minimized through back propagation, as input and generates a fixed length (4096-dimensional)
i.e., the error with respect to class label of the objects is feature vector that is further used in the classification task.
minimized. For more details about other deeper networks, The classification task is done through fully connected
one may refer to [62]. layers which need fixed-length input vectors. Therefore,
the feature vectors extracted from all the region proposals
3.2 Two stage detectors should have the same size. An image may contain one
or more objects having different sizes and aspect ratios.
Two stage detectors involve two tasks: object region Therefore, different sized region proposals are obtained in
proposal and object classification. First, object region the first module. Features extracted from these different
is proposed using either conventional methods or deep sized region proposals are wrapped in a fixed-sized
networks. Classification task is done based on the features bounding-box. Then, this fixed-sized feature vector is used
extracted from this proposed region, thus increasing the for object classification. Here, feature generation/backbone
detection accuracy. Basic architecture of a two stage network consists of five convolution and two fully
detector is shown in Fig. 1. Various two stage detectors connected layers. All convolution parameters are shared
include region convolutional neural network (RCNN) [45], across all the object categories that are used for training.
Fast RCNN [63], Faster RCNN [43], Mask RCNN [55], R- Training of RCNN has two stages. First, RCNN is trained
FCN [64], FPN [53], granulated CNN [3], and granulated using large-scale dataset, and then, it is fine-tuned using
RCNN (G-RCNN) [178]. These are explained in the some particular dataset. In RCNN, the last fully connected
following sections: layer is connected with (N + 1) classification layers
(where N: number of object classes, and 1: background)
3.2.1 RCNN for performing the final object classification. Stochastic
Gradient Descent (SGD) is used here for fine-tuning the
RCNN [45] is, perhaps, the first model as a two stage convolution parameters. For fine tuning of IoU (intersection
detector to show that deep CNN is better than conventional over union), the overlap between the region proposal and
methods for object detection. RCNN has four modules. ground truth is measured. If IoU of a region proposal is
The first module proposes object regions in the image less than 0.5, then it is considered as negative, otherwise,
frame. In the second module, a fixed-length feature vector positive. The region proposal whose IoU-value with respect
is extracted from these regions. Third module deals with to the ground truth is maximum, is considered as the ground

Fig. 1 Basic architecture of two stage detector [1]


S. K. Pal et al.

truth in the next training process. In RCNN, both region required time for the generation of region proposal is less
proposal and classification tasks are performed separately as compared to Fast RCNN. Because, Faster RCNN shares
with no sharing computation. Therefore, RCNN consumes both the fully image convolution features and a set of
prolonged time for classification task. common convolution layers to the detection network at the
same time. Here, anchors are placed at each convolution
3.2.2 Fast RCNN feature location to generate region proposals of different
sizes. Anchors are the spatial windows of different sizes
The next advanced version of RCNN is Fast RCNN [45] and different aspect ratios that are placed at a location
which addressed the runtime issue of RCNN. Fast RCNN in the input feature map. In Faster RCNN, anchor boxes
takes the entire image as an input and generates pooling having three different scales and three different aspect ratios
feature-maps corresponding to the input image. Each feature are used. On the output of the last convolution layer, a
in pooling-map is considered as a region of interest (RoI). constant sized window of (3 × 3) slides, where the center
Thereafter, this fixed sized RoI-map is passed through point of each sliding window (i.e., anchor box) corresponds
three fully connected layers for object classification and to a location in the original input image. Anchor box-
bounding-box fitting over the classified object. As the based region proposal is usually parameterized to predict
locations of pooling features are considered as the probable the bounding-box. Thereafter, the distance between the
regions and are used for classification task, the computation ground truth box and predicted bounding box is computed
time can be saved significantly as compared to RCNN. to optimize the location of the predicted box. On PASCAL
Another difference between RCNN and Fast RCNN is: VOC 07 test data set, a mAP of 69.9% is achieved by Faster
RCNN involves multi-stage training process, whereas Fast RCNN, whereas Fast RCNN achieves a mAP of 66.9%
RCNN uses one stage end-to-end training process. having shared convolution computations. Moreover, Faster
As said earlier, instead of considering the input region RCNN (testing time 198ms) is approximately 10 times
proposals, the RoI pooling-map is used for classification faster than Fast RCNN (testing time 1830ms) with VGG16
task. This feature map consists of some key features that network and Nvdia K40 GPU.
belong to different regions of different sizes. Therefore, Fast
RCNN does not require wrapping regions and reversing of 3.2.4 R-FCN
spatial features for the region proposals. Here, truncated
single value decomposition (SVD) is used for quick As mentioned earlier, Faster RCNN has two sub-networks:
detection by updating the weight parameters which helps one is a fully convolutional sub-network (shared) which
in accelerating the speed. Experimental results revealed is typically independent of RoI, and the other is an
that Fast RCNN achieves 66.9% mAP (mean average RoI-based unshared network. Faster RCNN uses deep
precision) on PASCAL VOC 07 [19] dataset. Whereas, CNN, such as AlexNet [45] and VGG16 [43], and
RCNN results in 66.0% mAP on the same dataset. The time provides efficient results. Whereas, the existing networks
for training in Fast RCNN is dropped 9 times as compared for image classification, including ResNets [65] and
to RCNN. Fast RCNN trained with truncated SVD achieves GoogleNets [66], are eventually fully convolutional. That
higher detection speed as compared to RCNN. Nvdia K40 means, ResNets and GoogleNets architectures construct
GPU is used during these experiments. From the aforesaid fully convolutional object detection network without RoI
experimental results, it is evident that Fast RCNN is better network. However, using Faster RCNN with ResNets and
than RCNN in terms of detection performance metrics. GoogleNets architectures provides inferior results. This
However, Fast RCNN uses a selective search method over happens, because the object detection task is translational
the convolution feature map to propose its pooling map, variant, whereas the image classification task is translational
which slows down its operation. invariant. Shifting of an object within an image should
be discriminative in classification of images, while any
3.2.3 Faster RCNN translation of an object in a bounding-box may be
meaningful in object detection. If the RoI pooling layer
Faster RCNN [43] is an improved version of Fast RCNN is manually inserted into a convolutional network, the
in terms of detection accuracy and runtime. As stated translational invariance property may get affected. To
earlier, in Fast RCNN, a selective search method is used address this issue, R-FCN was proposed in [64].
for region proposal that makes the system slow. Faster For each object category in R-FCN, the last convolution
RCNN replaces this method with a new region proposal layer initially generates g 2 position sensitive score maps
network (RPN) which is a fully-connected CNN. RPN having a grid size of (g × g). Then one position sensitive
predicts the object region(s) more efficiently in a wide pooling layer is appended to the last convolution layer to
range of aspect ratios and scales. In Faster RCNN, the aggregate the responses from these score maps. At last,
Deep learning in multi-object detection and tracking: state of the art

in every RoI, the g 2 scores are averaged to generate an (FPN) [53] is added with Faster RCNN [65] as backbone
(N + 1)-dimensional (N: number of object categories, to generate informative features, thereby increasing the
1: background) vector, and then, softmax responses are detection accuracy and speed. RoI features that are extracted
calculated. Another (4×g 2 )-d convolution layer is appended from different layers of FPN have different scales. Then,
to obtain the class-agnostic bounding-boxes. The testing FPN generates a feature hierarchy that consists of different
speed of R-FCN on both MS COCO and PASCAL VOC is scaled RoI feature maps. This is done in BU pathway. On
170 ms per image. the other hand, the TD pathway offers features of higher
resolution by up-sampling the feature maps from higher
3.2.5 FPN pyramid levels. The feature maps at the top pyramid are
nothing but the last convolution layer feature maps of the
Feature pyramids, built upon image pyramids, have been bottom-up pathway. Then, the same spatial-sized feature
widely adopted by many object detection systems to maps from the BU pathway and TD pathway are merged
improve the scale invariance [67, 68]. However, the training to generate the region proposal. Both higher-resolution
time and memory consumption are high in this process. and lower-resolution feature maps are generated by FPN,
In some techniques, the pyramids are usually built during thereby resulting in significant features for improving the
testing which leads to a lack of consistency between training detection accuracy.
and testing-time inferences [43, 63]. The hierarchy of in- Another way of improving the detection accuracy can
network features of a deep CNN produces feature maps be obtained by replacing the RoI pooling layer with RoI
having various spatial resolutions. It introduces semantic Align to retrieve a feature map (comparatively small) from
gaps caused by different depths. This issue is addressed each of the RoIs. Traditional RoI pooling quantization
in some studies [69, 70] where the pyramid building is method suffers from the mis-alignment problem that arises
started from the middle layers, but the resulting systems between RoIs and pooling features. This issue is addressed
miss the maps of higher degree of resolution. Besides, by RoI Align layer. Here, first, the floating-number of the
the feature pyramid network (FPN), proposed in [53], co-ordinates of each RoI-map is computed. Then, bilinear
holds an architecture in bottom-up (BU) pathway, top- interpolation operation is done using these floating-numbers
down (TD) pathway, and a number of lateral connections. to compute the exact values of features. These features are
These connections are used to combine strong semantic distributed into four RoI bins. Max or average pooling is
features (low resolution) with weak semantic features done to get significant feature values from all the four bins.
(high resolution). The BU pathway can produce a feature Finally, these feature values are aggregated and are used
hierarchy by down sampling the corresponding feature map for object classification. The aforesaid two modifications
with a stride of 2. The layers having the same sized output improve the detection precision. ResNet-FPN backbone
maps are grouped into some network stages, and the output achieves 71.2% AP (Average precision) and RoI Align
of the last layer of each stage is chosen as the reference set operation achieves 70.9% AP on MS COCO dataset.
of feature maps to build the following TD pathway. In TD
pathway, first, the feature maps from higher network stages 3.2.7 Incorporating granular computing in CNN
are up-sampled, and then enhanced using those of the same
spatial size, as obtained from the BU pathway via lateral In this section, we mention some recent developments
connections. A (1 × 1) convolution layer is appended to the in CNN incorporating the concept of granular computing
up-sampled map to reduce the channel dimensions, and the (GrC) for object detection and tracking. Two such new
merged map is achieved by element-wise addition. Finally, models are there, namely, granulated CNN and RCNN,
a (3 × 3) convolution is appended to each merged map in short G-CNN and G-RCNN, respectively. Before
to reduce the aliasing effect of up-sampling and the final explaining these models, let us describe, in brief, the
feature map is generated. This process is iterated until the concept of granules and granular computing along with its
finest resolution map is generated. As rich semantics can be characteristic features.
extracted by feature pyramid, FPN can be achieved without Granulation is a basic step of human cognition systems.
compromising the memory as well as speed. Moreover, FPN Granular computing (GrC) is a nature-inspired information
can be implemented at various stages of detection of objects. processing framework where computations/ operations are
performed on information granules. Granules evolve during
3.2.6 Mask RCNN the abstraction of knowledge from the data. Its significance
is based on one of the realizations that precision is
An extended version of Faster RCNN is Mask RCNN sometimes expensive and not very meaningful in modelling
[55] which is mainly developed to serve the instance and controlling complex systems. When the data has
segmentation task. Here, ResNet-feature pyramid network overlapping character, it may be convenient to represent
S. K. Pal et al.

them in terms of granules (a clump of indiscernible elements Z-numbers [71] was used to provide a granulated linguistic
drawn together, for example, by likelihood, similarity, description of the output scene, which is unique.
proximity, or functionality).
As GrC deals with granules, rather than individual (b) Granulated RCNN: G-RCNN [178] is an advanced
elements, it leads to gain in computation time; thereby version of Faster RCNN. Here the object detection has two
signifying its application to large data sets. stages: object localization (i.e., RoI) and its classification.
While DL is a computationally intensive process and G-RCNN is effective for the extraction of RoIs from
the GrC paradigm, on the other hand, leads to gain in image/video frame. This is done by incorporating the unique
computation time, it may be appropriate and logical to concept of granulation in a deep CNN. Here, granules
make their integration judiciously so as to make the DL are constructed using spatio-temporal information. These
framework efficient in terms of computation time requiring granules represent the object localization (i.e., region) in
only CPU. Based on this realization, G-CNN and G-RCNN an image/video frame. Unlike Fast and Faster RCNNs, G-
are formulated for object detection, tracking, and scene RCNN uses (i) granules formed over the pooling feature
description. These are described as follows: map, instead of the entire feature map, in defining RoIs,
(ii) only the objects in RoIs, instead of the entire pooling
(a) Granulated CNN: As stated, granulation is the process of feature map, for performing object classification, and (iii)
formation of granules using the information abstraction. For only positive RoIs during training, instead of the entire RoI-
processing an image frame in GrC paradigm, granules could map. In addition, both image and video can be used for
be made of equal or unequal sizes, and regular or irregular the training of G-RCNN. All these lead to the improvement
shaped, over the image frames, although irregular ones in real-time detection accuracy and speed. G-RCNN with
are more natural for real-life problems. Region growing AlexNet backbone achieves 80.9% detection mAP and
can be used to obtain irregular shaped (natural) spatio- 5.6fps speed over PASCAL VOC 12 dataset.
color neighborhood granules. Forming these granules would
represent both static and moving object regions in the 3.3 One stage detectors
image/video frame. These object regions are then fed to the
deep CNN architecture for performing object classification, In one stage detectors, the bounding boxes are predicted
thereby resulting in G-CNN. The functioning principle of over the images without the region proposal step, thereby
G-CNN is as follows: Instead of scanning the entire image increasing the detection speed. Basic architecture of one
pixel by pixel in the Convolution layer of DL, it jumps over stage detector is shown in Fig. 2. Various one stage detectors
the granules only which were formed before. That means, include YOLO [44], YOLOv2 [46], YOLOv3 [72], SSD
for a (32 × 32) image with N granules, sliding the filter is [70], DSSD [73], RetinaNet [74], M2Det [75], RefineDet
done only N times instead of over (32 × 32) pixels, where [76], and DCN [77]. These are explained in the following
N << (32 × 32). Hence a significant speed up is observed, sections:
compromising some accuracy [3].
This is the first investigation [3] incorporating granular 3.3.1 YOLO
computing in deep CNN framework for object detection.
Granulated CNN achieves 48.59% detection accuracy and YOLO [44] is an object detector with a single stage which
1.5fps speed over MS COCO dataset. Further, the concept of was designed after Faster RCNN. It is mainly applicable

Fig. 2 Basic architecture of one stage detector [1]


Deep learning in multi-object detection and tracking: state of the art

for the detection of real-time images. YOLO can predict YOLOv2 adds a batch normalization layer ahead
less than 100 region proposals, whereas Fast RCNN and of each of the convolution layers to accelerate
Faster RCNN can predict 2000 and 300 region proposals per its operation in order to achieve the convergence
image, respectively. YOLO considers the detection problem and hence, can regularize the model. Using BN in
as a problem of regression so as to retrieve features from YOLOv2, the mAP is increased by 2% as compared
an input image straight way for the prediction of class to YOLO.
probabilities and bounding-boxes. The speed of YOLO (ii) High-resolution classifier: An input resolution of
network is 45 fps excluding the batch processing using Titan (224 × 224) was adopted in YOLO backbone.
X GPU, whereas Fast RCNN and Faster RCNN achieve the Whereas, in YOLOv2, the input resolution is
speed of 0.5 fps, and 5 fps, respectively on the same GPU. increased to (448 × 448). Therefore, there is a
An input image is divided here into (g × g) grids. requirement of network adjustment to the new reso-
Features extracted from each grid cell are used for object lution inputs for object detection task. Accordingly,
classification. Each grid cell predicts B bounding boxes, some fine-tuning of classification network is done in
and for each box, C class probabilities are obtained for YOLOv2 for an image of resolution (448 × 448) and
C object classes. Two measures are considered for each 10 epochs. This increases the mAP to 4%.
bounding-box: first, the probability (P ) of the bounding- (iii) Convolution with anchor boxes: As already discussed,
box is defined to check whether the bounding-box belongs Faster RCNN utilizes an anchor box as a reference
to any object or not, and then IoU between the ground truth for generating the region proposals, which is then
and bounding-box is defined to check how accurately the parameterized relative to that reference anchor box to
bounding-box contains that object. The bounding-box with predict the bounding-box. This prediction mechanism
highest IoU and non-zero class probability is considered as is used in YOLOv2. Then it predicts the class and
the object region. YOLO network consists of 24 convolution object-ness score for each predicted bounding-box.
layers and 2 fully connected layers. YOLO is not so good in This operation increases the recall by 7% and reduces
object localization, which affects its detection accuracy. the mAP by 0.3%.
As compared to Fast RCNN, YOLO reduces the (iv) Size and aspect ratio prediction of the anchor box:
background false positives by 3 times. However, YOLO YOLOv2 utilizes k-means clustering method on the
obtains 63.4% mAP with 45fps as compared to Fast RCNN training bounding-boxes to obtain better priors. Then,
(70.0% mAP, 0.5fps) and Faster RCNN (73.2% mAP, 7fps). these priors are used to define the center location of
YOLO detector is restricted to high resolution detection and the predicted anchor box. The aspect ratio and size
single-class prediction. of this anchor box are predicted using the cluster
information. This operation improves the detection
3.3.2 YOLOv2 accuracy.
(v) Fine-grained features: As discussed, YOLO was
YOLOv2 [46] is an advancement of YOLO. Decisions from trained with (224 × 224) images. Yolov2 architecture
the past training task with a novel concept are adopted in is a modification of YOLO architecture. For localiz-
YOLOv2 to improve the speed and detection precision of ing smaller objects, YOLOv2 is re-trained with higher
YOLO. YOLOv2 consists of six tasks, such as i) batch resolution images (448 × 448). In this re-training
normalization, ii) high resolution classifier, iii) convolution process, YOLOv2 uses both the higher and lower res-
with anchor boxes, iv) size and aspect ratio prediction of olution features by stacking the adjacent features into
the anchor box, v) fine-grained features, and vi) multi-scale different channels. This increases the detection mAP
training. These are explained in the following section: by 1%.
(vi) Multi-scale training: To make a network robust to
(i) Batch normalization (BN): Training of YOLOv2 is operate on images having different sizes, every ten
done using the SGD approach. SGD uses mini- batches (randomly selected) chooses a new image of
batches for training process. For each mini-batch, dimension size from {320, 352, ..., 608}. It basically
mean and variance are computed and used for implies that it is possible for the same network to
activation. Then, for each mini-batch, activation is detect at different levels of resolutions. For example,
normalized using zero mean and standard deviation YOLOv2 achieves 78.4% mAP and 40 fps at higher
of 1. Finally, all the elements in each of the mini- resolution, whereas YOLO achieves 63.4% mAP and
batches are sampled using the same distribution. This 45 fps on VOC 07. Although YOLOv2 achieves high
operation may be viewed as a batch normalization detection precision with high speed, it is restricted to
[78]. It produces activations of same distribution. high resolution detection and single-class objects.
S. K. Pal et al.

3.3.3 YOLOv3 uses ResNet-101 as backbone. In prediction module, a


residual block is added to each prediction layer to do
YOLOv3 [72] is the next advanced version of YOLOv2. element-wise addition of the outputs of this layer. The de-
Deep CNN Darknet-53 is used as the feature generation net- convolution module augments the feature-map resolution
work in YOLOv3. YOLOv3 uses multi-label classification so that small objects can be detected using DSSD. By
with overlapping patterns for training, so that it can be used integrating these two modules with SSD, the DSSD can
in complex scenarios for object detection. Moreover, during predict a different set of objects having different sizes.
training, three feature maps of different scales are used in During the training of DSSD, the baseline network ResNet-
predicting the bounding-box. In YOLOv3, the last convolu- 101 is first pre-trained on the dataset ILSVRC CLSLOC,
tion layer generates three dimensional tensors that contain and thereafter, the original SSD model (ResNet-101) is
class predictions, object-ness, and bounding-box. YOLOv3 trained using (513 × 513) images from the same dataset.
achieves 57.9% mAP on MS COCO dataset as compared Parameters of this trained SSD model are then fine-tuned
to DSSD513 of 53.3% and RetinaNet of 61.1%. Because of through the training of de-convolution module. Experiments
the advantages of multi-class prediction, YOLOv3 can be on both PASCAL VOC dataset and MS COCO dataset
used for small object classification. YOLOv3 shows worse showed the effectiveness of DSSD513 model [73]. Addition
performance for the detection of medium and large sized of prediction module and de-convolution module with SSD
objects. model enhances the mAP by 2.2% on the test dataset
PASCAL VOC 07.
3.3.4 SSD
3.3.6 RetinaNet
Single-shot detector (SSD) [79] is a one-stage detector that
can predicts multiple classes. Within SSD, at each layer, RetinaNet [74] is another kind of object detector with
several feature maps having different scales are generated. a single stage that works considering the focal loss as
SSD predicts the class scores for a set of bounding- a classification loss. One-stage detectors provide a dense
boxes (default) of varying scales at every location in the set of object locations containing extreme foreground
aforesaid feature maps. These bounding-boxes (default) (positive) and background (negative) class imbalance. Due
have different scales and aspect ratios for a particular feature to this class imbalance issue, the training process is
map. The scale of bounding-boxes (default) is calculated in biased to the major class, thereby reducing the detection
one feature map based on the difference between highest precision. This problem is addressed in RetinaNet where
feature map and lowest feature map, where each specific a loss function, named as focal loss, is defined. This
feature map learns to be responsive for a particular scale reduces the weight of the loss which are assigned to the
of objects. For each default bounding-box, it predicts the negative samples (background). This loss concentrates on
multi-label classification scores. During training, the default the positive (hard) training samples and avoids the vast
bounding-boxes are matched with the ground-truth boxes. number of negative samples. In this way, RetinaNet is
The bounding-boxes (matched) are considered as positives trained with unbalanced negative and positive samples.
and rest are negatives. In case of large number of negatives, The experimental results revealed that the RetinaNet
the system adopts the background (hard negatives) to get with ResNet-101-FPN backbone achieved 39.1% AP, as
a sufficient number of positive boxes for training. In this compared to DSSD513 with 33.2% AP, on the dataset MS
approach, loss is defined for each bounding box. Then, COCO test-dev.
based on the loss maximization, bounding-boxes are chosen
as either positive or negative, so that the ratio between total 3.3.7 M2Det
negatives and positives is at most 3:1. From experiments, it
was evident that SSD512 (with input image size: 512 × 512) M2Det is developed in [75] to meet a wide variation of
produced better results in both speed and mAP with VGG16 scale across different object instances.It comprises a multi-
[43] backbone. Further, SSD512 obtained mAP of 81.6% on level feature pyramid network (MLFPN) which constructs
PASCAL VOC 07 test set and 80.0% on PASCAL VOC 12 more effective feature pyramids. Three steps are carried
test set. out to get enhanced feature pyramids. First, multi-level
features extracted from multiple layers in the backbone,
3.3.5 DSSD are fused to the base feature. Second, the base feature
is fed into a block consisting of joint Thinned U-shape
De-convolutional Single Shot Detector (DSSD) [73] is a Modules and Feature Fusion Modules to obtain decoder-
modified version of SSD. In DDSD, both prediction module layer features. A feature pyramid with multi-level features is
and de-convolution module are added with SSD, and it finally built integrating the decoder layers having equivalent
Deep learning in multi-object detection and tracking: state of the art

scale. In this way, multi-level and multi-scale features terms of characteristics, like region proposal, input feature,
are generated. These features are then fed to a SSD for loss function, learning method, softmax layer, is provided
object localization, classification, and bounding-box fitting. in Table 1. Comparative studies of their performances are
M2Det achieves AP of 41.0% at speed of 11.8 fps with provided in Section 6.
single-scale inference strategy and AP of 44.2% with multi- So far, we have explained different detectors, and
scale inference strategy utilizing VGG16 on MS COCO their relative merits and demerits. Let us now provide
test-dev dataset. It outperforms RetinaNet800 (Res101-FPN some applications of CNN for certain specific detection
as backbone) by 0.9% with single-scale inference strategy; tasks.
however, it is two times slower than RetinaNet800.

3.3.8 RefineDet 4 Applications of CNN for specific object


detection
RefineDet network [76] has two interconnected modules:
(i) refinement module and (ii) object detection module. Specific object detection tasks of CNN that will be
These two modules are inter-connected through a transfer discussed here are detection of face [81], salient objects
connection block. RefineDet is usually used to transfer [82, 83], and pedestrians [84, 85]. Salient object detection
features from the last module to the following one is accomplished with local contrast enhancement and pixel-
for improved prediction of objects. Here, the training level segmentation. Face detection and pedestrian detection
is done in end-to-end manner. It has three important are closely related to generic object detection and mainly
stages: (i) preprocessing, (ii) two interconnected modules accomplished with multiscale adaption and multi-feature
for detection, and (iii) NMS. Other one-stage detectors, fusion, respectively. Detailed reviews on detection of salient
including YOLO, SSD, and RetinaNet, utilize single step objects, face, and pedestrians are presented in Sections 4.1,
regression to obtain final outputs. Whereas, RefineDet uses 4.2, and 4.3, respectively.
a cascaded regression (two-step) method to predict the
hard-to-detect objects (i.e., small detected objects) more 4.1 DL in salient object detection
accurately.
Salient object detection aims at focusing on the dominant
3.3.9 DCN object regions within an image. An wide spectrum of
applications of salient object detection is available which
Regular CNN can focus only on features having fixed includes image cropping [86] and segmentation [6, 15,
square size (according to the kernel); therefore, the 87, 88], image retrieval [89], and object detection [53].
receptive field cannot cover the entire object pixels There are two broad approaches for the detection of salient
properly. Deformable convolutional networks (DCNs) [77] objects: (i) BU [82] approach and (ii) TD [83] approach.
can handle this issue by producing the deformable kernel. The BU approach is based on local feature-contrasts which
DCN has two varieties, such as DCNv2 and DCNv1. are dependent on various local and global features, e.g.,
DCNv2 [80] utilizes more deformable convolution layers edges [6, 90] and spatial information [91]. However, multi-
than DCNv1 to replace the regular convolution layers. All scale high level semantic information cannot be explored
the deformable layers are modulated by a learnable scalar with these contrasts (low-level). As a consequence, low-
value, which enhances the deformable effect and accuracy. contrast salient maps are generated. Whereas, the TD-based
DCNv2 achieved 45.3% mAP, as compared to DCNv1 with approach is task oriented. Task prior knowledge about the
41.7% mAP, on the dataset MS COCO test-dev. object category is used in this approach for the generation
In summary, the aforesaid generic detectors enhance of salient maps. Based on these maps, pixels are assigned
the accuracy by extracting richer features of objects and to a particular object category [92]. In other words, the
adopting multi-level and multi-scale features for object TD saliency detects the specific objects by pruning the BU
detection of different sizes. To achieve higher speed and saliency points [93].
precision, the one-stage detectors utilize newly designed Because of the significance of multi-scale high-level
loss function to filter out the easy samples which are features for various computer vision-related tasks, including
responsible for lowering significantly the number of region semantic segmentation [92], edge detection [94], and
proposals. Adaptation of deformable convolution layers is object detection [63], it is quite feasible to use CNN
seen to be effective in addressing the geometric variation in object (salient) detection. Some earlier study [95]
in images. Modeling the relationship between different performs searches for obtaining the optimal features. But
objects in an image is also necessary to improve the this approach is completely data-driven which is restricted
performance. An overview of various object detectors in to a large amount of training data. This issue is addressed
S. K. Pal et al.

Table 1 Overview of the prominent object detectors

Detectors Region Multi- Learning Loss Softmax End-


pro- scale method function layer to-end
posal input train

SPPNet [68] EB + SGD HL + BBR + -


RCNN [45] SS - SGD, BP HL + BBR + -
Fast RCNN [63] SS + SGD CLL + BBR + -
Faster RCNN [43] RPN + SGD CLL + BBR + +
R-FCN [64] RPN + SGD CLL+ BBR - +
FPN [53] RPN + Synchronized SGD CLL + BBR + +
Mask RCNN [55] RPN + SGD CLL + BBR + semantic sigmoid loss + +
YOLO [44] - - SGD CSSC + BBR + OC + BC + +
YOLOv2 [46] - - SGD CSSC + BBR + OC + BC + +
YOLOv3 [72] - - SGD CSSC + BBR + OC + BC + +
SSD [79] - - SGD CSL + BBR - +
DSSD [73] - - SGD CSL + BBR - +
RetinaNet [74] - - SGD CSL + BBR - +
M2Det [75] - - SGD CSL + BBR - +
RefineDet [76] - - SGD Cascaded CSL + BBR - +
DCN [77] - - SGD CSL + BBR - +
Granulated CNN [3] - - SGD CSL + BBR - +
G-RCNN [178] FRPN - SGD CLL + BBR + +

Note: ‘-’ denotes that the corresponding technique is employed, ‘+’ denotes that the corresponding technique is not considered, EB: Edge Boxes,
SS: Selective Search, RPN: Regional Proposal Network, SGD: Stochastic Gradient Descent [32], BP: Batch Processing, FRPN: Foreground Region
Proposal Network, Hinge Loss: HL, Bounding-box regression: BBR, Object Confidence: OC, Class softmax loss: CSL, Class Sum Squared Error:
CSSC, Class LOG Loss: CLL, Background Confidence: BC

in [96], where saliency prediction is integrated into pre- feature maps are significant in improving the detection
trained object recognition DNNs. Here, DNN’s weights are accuracy. A deep network, called Region Net, based on
fine-tuned by transferring the saliency evaluation metrics this is formulated in [101] for performing salient object
(i.e., KL-divergence, and normalized scan path saliency) detection. This network is based on Fast RCNN. Two
which are based on the specific object function. Here, local specific tasks, namely, multi-scale contextual modeling and
features combined with global features improve the salient end-to-end edge preserving, are integrated in the Region Net
object detection performance. In [97], two deep independent for saliency detection.
CNNs (DNN-G and DNN-L) are trained using both local
estimation and global search to obtain the global contrast 4.2 Face detection
as well as local information, and predict the saliency maps.
In [98], a semi-supervised saliency detection network is Detection of face is essential due to several face-related
proposed by integrating visual saliencies from both BU applications, including face recognition [102, 103], face
and TD saliency maps. This network results in an object- synthesis [104], and facial expression analysis [105]. Unlike
ness score by averaging the intensities of multi-scale super generic object detection, face detection task is performed
pixels. to recognize and locate face regions covering a very
Saliency object detection necessitates the requirement large range of scales. Some generic detectors (e.g., Faster
of both semantic segmentation and context modeling. A RCNN) are modified so that they can act as face detectors
novel super-pixel wise CNN approach, called Super CNN, [106–108]. In some studies, CNNs are trained with face
is developed in [99] to learn the internal representations landmarks and 3-dimensional modeling. For instance, a
of saliency efficiently. Here, saliency object detection is unified FCN end-to-end framework, called DenseBox, is
considered as a two-class problem. A novel deep saliency proposed in [109] for detecting face and localizing face
detection framework, namely CRPSD, is presented in [100], landmarks. In [110], a multi-task learning discriminative
which combines both the region-level saliency estimation framework is developed. It integrates a CNN with the help
and pixel-level saliency prediction. In addition, multi-scale of a 3-dimensional mean face model. This framework solves
Deep learning in multi-object detection and tracking: state of the art

two issues during the conversion of generic detector to face hypotheses for a detected object and builds a hypothesis
detector. These are: elimination of anchor boxes by a 3- tree. Then, a scoring function is defined to determine the
dimensional mean face model and the replacement of RoI best suitable hypothesis for a detected object to obtain
pooling layer with a face configured pooling layer. effective tracking performance. MHT method is extended in
[51] with appearance features of reduced dimension. This is
4.3 Pedestrian detection done using a multi-output regularized least square method.
To increase the discrimination in person re-identification
Generic Faster RCNN is modified in [111] for pedestrian task, a wide residual network (WRN) is introduced in
detection. Here, a downstream classifier takes boosted [115]. 12-normalized and 128-dimensional deep features
forests, high convolution feature maps, and RPN to take are extracted from the WRN and used for cosine softmax
care of the small instances and negative examples. Based on classification. These deep features are used to compute two
DPM [67], a DL framework, called DeepParts, is developed distances (i.e., minimum cosine distance and Mahalanobis
in [112] for addressing intricate occlusions within the distance) between detections and existing tracks. The
images. DeepParts makes decisions based on 45 DCNN minimum dissimilarity from a series of cascading of these
models (fine-tuned), and some strategies, such as part two distances is used to match a detection with the
selection and shifting of bounding box. Another deep appropriate track. This method is able to obtain competitive
net, called CompACT-Deep [113], combines hand-crafted on-line tracking performance at real-time.
features and fine-tuned deep CNNs to handle positive The feature learning aims to assess the commonalities
proposals of low IoU-value, and partial occlusion. Another between detections and tracks. Considering this goal,
deep CNN, called multispectral DNNs [70], combines the Siamese CNN [116] with two similar branches is developed
complementary information from both color and thermal for feature learning. Siamese CNN has three categories: i)
images for pedestrian detection. two branches having one cost layer, ii) two branches having
some common CNN layers, and iii) double stream stacked
inputs. Based on a comparative study [116], the third
5 Deep learning-based object tracking category is found to be best for extracting deep features.
Both motion information and deep features are fused with a
Object tracking is followed by object detection task. Based gradient boosting algorithm to solve the tracking problem.
on the functionalities of DL, MOT methods are classified The first architecture of Siamese CNN is utilized in [117]
into three main categories: i) deep network features-based to learn the affinities of track-lets to replace them with
MOT enhancement, ii) deep network embedding, and iii) previous features from ILDA [118]. This architecture is
deep network (end-to-end) learning. Generally, it is hard to extended in [119] to learn the associate affinities between
obtain MOT results using a single network as some inter- the existing track-lets and detections. Here, the tracking
related sub-modules (i.e., detection, feature extraction and is formulated as a generalized linear assignment problem
matching) are essential for MOT. Besides, assumptions, and is solved using the soft-margin approach. Hinge loss
like fixed distributions and Markov property, are considered is considered as the loss of the network. Both spatial and
to achieve effective tracking performance. These three temporal information is required in distance learning for
categories of MOT are explained in Sections 5.1–5.3. MOT problem. For distance learning, to impart the effects
of both constraints, Mahalanobis distance-based matrices
5.1 Enhancement of MOT using deep network (segment-wise) are used.
features It is stated that pairwise images may be used in Siamese
CNN to learn affinities. This architecture can also be used
In this technique, the tracking framework uses semantic to learn optical flow features that are extracted by deep
deep features instead of conventional handcrafted features CNNs [47]. It is therefore evident that the optical flow
to obtain effective tracking performance. The success of features are efficient in the on-line association of data as
DNN in the classification of the image is because of its well as tracking [120]. Compared to traditional algorithms,
ability to learn deep features. These features have rich deep CNNs can result in more robust and smoothing optical
semantic information. They are not only useful for image flow [47]. The optical flow-based features are effective to
classification but also for other tasks, including object enhance the performance in tracking. In [119], a multi-
detection, image segmentation, and MOT. cut framework is developed to construct a matching-cost
In object detection and segmentation tasks, deep features between detections and track-lets through deep matching
are useful for region proposals. Similarly, in MOT task, deep features, and to enhance the association outputs. The
features are extracted from deep CNN (AlexNet [45]) and cost for direct matching between long-term track-lets and
are used in MHT [114]. MHT holds multiple associated detections using deep optical flow can lose the information
S. K. Pal et al.

related to valid paths, and be unable to use them for tracking. layers. These spatial maps improve the tracking accuracy.
Accordingly, the said method is modified in [121] where Moreover, to reduce the time complexity of this model,
lifted edges are added for encoding re-identification deep the RoI pooling layer map, instead of the whole image
features for tracking multi-objects. frame, is shared with the classifier for tracking. The main
difference between the studies of [122] and [123] is that the
5.2 Deep network embedding-based MOT former uses category classifier; whereas the latter considers
occlusion features for tracking.
In this category, deep CNNs are designed as the core part In tracking, deep CNN can be used for either classifi-
of the framework for tracking. They are usually trained cation tasks or learning the regression models. The task of
with the help of samples obtained from tracking-related object detection and tracking can be considered as a regres-
data. Here, deep CNN is designed to obtain scores for sion task and learned with the aid of DL [64]. There are few
multiple classifications to various track-lets. A deep binary studies carried out in MOT which use regression models.
classifier is then developed to indicate whether the two The tracking performance (i.e., precision) can be enhanced
detections belong to the same object or not. These deep by using the regression loss. For example, the regression
network embedding-based MOT methods are mainly of losses related to the bounding-boxes in [124] are considered
three types depending on three types of learning task, to improve the tracking performance. In [125], the track-
namely discriminative deep network learning, deep metric ing problem is considered as a bounding-box regression
learning, and generative deep network learning. Let the task using a RNN. However, this method can hardly handle
corresponding MOT methods be referred as DN-MOT, occlusion and similar object problems in MOT task.
DM-MOT, and GN-MOT, respectively. These methods are
explained in Sections 5.2.1, 5.2.2, and 5.2.3, respectively. 5.2.2 DM-MOT

5.2.1 DN-MOT In this category, deep metric learning-based methods are used
for MOT. The training of such MOT methods results in
In this approach, object trackers optimize the discriminative learning about which track-let belongs to a specific detection
models initially and then seek for the best locations in the and whether two detections belong to the same object or not.
following frames to associate the detections with track- It can be considered as an image-patch verification process.
lets. The best locations are obtained according to these Similar to person re-identification [126] or face recognition
discriminative models. As deep CNNs are adopted widely [23], accurate affinity learning through a distance metric
for discriminative tasks, it is common that the discriminative is adopted in DM-MOT methods. In [115], a deep metric
deep network models are used in tracking. As an example, learning network, called deep SORT, is designed and trained
the particle filtering framework is proposed in [122] for for person re-identification and MOT problems. Here, motion
MOT. To track each detected object, two classifiers based features are fused with appearance features to achieve this
on CNN are developed. Features from different layers of goal. Deep SORT is good in tracking single class objects,
deep CNN (i.e., VGG16 [43]) model-based object detector but it fails in multi-class object tracking. This is solved by
(i.e., Faster RCNN) are fed to these classifiers as inputs to Multi-class Deep SORT (MCD-SORT) tracker [178]. Both
classify the detected object. The first classifier uses features motion and appearance features are used here to make the
from the region proposal to classify the object instance, and correct association between the detected object and track-
the second classifier extracts features from the convolution let. Searching for this association of object with trajectory
layer and thereafter, compares the classified object instance is restricted only within the same class. This increases the
with the past features of the object to determine whether performance in multi-class tracking.
they are similar or not. The confidence scores of the Siamese network is developed in [124] for MOT. Here,
classifiers are used to evaluate the weights of the particle first, quadruplets of image patches are fed to this network
filter, and finally, the tracking is done by particle filtering. A as inputs. Thereafter, triple distances are measured between
crucial issue of such a model is that training of the network these image patches. The output of the network provides
is done in off-line mode, whereas object’s historical features a ranking among the triple distances. Both motion features
are updated in on-line mode. and appearance features are fused in this network with the
Similar as in [122], another MOT framework using help of the distance metrics. A CNN based on triplet loss is
object-trackers is developed in [123]. Here, the tracker developed in [127] to obtain the information of the distance
searches for a candidate which is the best among image metrics between track-lets and detections. In [128], it is
patches and neighboring detections. To handle occlusion shown that how motion features can be learned using the
in [123], spatial features are eventually learned based on difference between LSTM prediction and detections in the
the visible map using the convolution and fully connected next frame.
Deep learning in multi-object detection and tracking: state of the art

Instead of learning the distance metric between detec- 5.3 End-to-end DL-based MOT
tions and track-lets, the investigation [129] is based on
learning of the distance metric between two track-lets. The In this technique, DL networks are directly designed for
network is able to extract a set of features from track-lets for obtaining the tracking results. MOT problems have various
each detection. Then, these features are fed to a Gated recur- stages: building the relationships in between detections and
rent unit (GRU) network as input. The output of the GRU track-lets, upgrading the existing trajectories, initialization
network is pooled temporally and is used to build the local of new track-lets, and deletion of trajectories from the
features in Euclidean space. Based on the distance between tracking system based on some criterion. It is difficult to
GRU network’s outputs, several sub-track-lets are gener- model the stages within a single framework and entirely
ated. These sub-track-lets are then re-connected to the long learn them. Of late, the process of tracking is simplified
trajectories with the help of similarity between the global using some assumptions. Therefore, a few end-to-end
features and local features. learning approaches have been developed to implement
these stages for MOT.
5.2.3 GN-MOT The states of track-lets, in the on-line MOT task, can be
estimated with the aid of a recursive Bayesian filter, and
In this approach, generative learning-based methods are each new detection can then be associated with one track-
used for MOT. This learning strategy is used in deep let based on a maximum similarity score. A network, called
networks for appropriate parameter estimation. For MOT RNN-LSTM, is developed in [38] to model the stages of
problem [130–132], deep generative learning is used to MOT. All these stages, such as state estimation of track-
increase the performance of tracking. In [133], the posterior lets, new detections, their matching matrix, and existing
probability of the movement of an object and appearance probabilities, are embedded into this network. The updated
features having Gaussian distribution are modeled through results on trajectories are outputted from this network.
linear regression. Here, the parameters of the regression New probability scores corresponding to these trajectories
model are learned with the help of a GRU network. Hidden are then computed to check whether some trajectory is
layers of this network are updated after the completion terminated or not. Here, LSTMs are used to calculate the
of the operation for each of the frames, and are utilized matching matrix between the track-lets and the detections.
to evaluate the mean and the deviation of the distribution This matching matrix is used to train the RNN in an end-
for the following frame. In tracking, the joint probability to-end fashion. This process can show promising tracking
between motion and appearance features is calculated [120]. results over single object tracking dataset only. The reasons
This joint probability is used to match the track-let with are: i) this approach considers only motion information, ii)
detection in the current frame. Thereafter, a greedy search initialization and termination of the trajectories do not use
algorithm for matching is used to determine the best results context information, and iii) the number of training images
to associate the detection with existing track-let. During is not sufficient for the learning of this model.
this process, a threshold is preset to delete some matching The aforesaid issue is solved in [135] where a
results that have low probability values. This reduces the hierarchical RNN model is designed to integrate different
computation time. features, including appearance, motion, and their interaction
An LSTM-based generative model is developed in [134] features for each object tracked. This model has three
for prediction. This model consists of an encoder which typical sub-LSTM networks that can predict long-term
is composed of stacked convolution layers. This encoder motion features, and extract contextual features and multi-
takes a sequence of ten image frames as an input, and frame appearance for track-lets. The features of all such
generates a pixel-wise probability map. The LSTM-based networks are thereafter concatenated. Then, these features
prediction module has two parts: short-term prediction are fed to the top hierarchy layer of RNN as input
and long-term prediction. Short-term prediction is done to measure the matching scores between track-lets and
to associate detections with the track-lets and long-term detections in the current frame. For training of this model,
prediction is used for updating the trajectories. During each LSTM network is pre-trained individually, and find-
this process, detections are generated through Generative tuned after obtaining the results of the top LSTM network
Adversarial Network (GAN). For a given frame, the of RNN. Here, training is done in an end-to-end way.
associated detections are added to existing trajectories, and This model achieves better results as compared to existing
non-associated detections are considered as newly detected methods to re-identify a person. Six or less frames are
objects. When some trajectory does not get associated with used in the hierarchical RNNs to obtain optimal tracking
any detection for more than ten frames, then that trajectory results. This work is further extended in [136], where the
is deleted from the tracking system. detailed operation of the network (LSTM) to learn the
S. K. Pal et al.

appearance features is explored. Between the input features process. Based on the functionality, deep network structures
and hidden states, a multiplication layer is added to explore can be categorized into RNN, CNN, and their different
the regression module and thereafter, develop a bilinear integrations and variants. As a training strategy mainly
LSTM module to associate detections with track-lets. This depends on network structures, we review here the different
modified LSTM is good in dealing with appearance features DL structures with their corresponding strategies for
only. Therefore, bilinear LSTM for appearance features and training.
conventional LSTM for motion features are mixed to obtain
the matching classifier. This is called MHT framework and 5.4.1 CNN-based MOT and training
it can be used for on-line tracking.
In globally optimized MOT, tracking can be modeled CNNs are widely used in tracking due to its excellent
with the help of a network flow and probabilistic graph. capability in feature learning. During the training of CNN,
In [137], min-cost network flow-based DL is designed the task-specific objective function is defined and training
for MOT. Here, the loss function is defined as weighted data for holistically tracking is used. Object tracking follows
l2 distance of edge labels. Thus, min-cost network flows object detection. Therefore, CNNs are pre-trained initially
are built on different layers of the deep model and are for object detection task, and later CNNs are fine-tuned
optimized. Experimental results reveal its effectiveness according to the tracking task.
in global tracking [137]. It is, therefore, expected that In order to improve the tracking performance, either
the graph model (network flow) based global tracking conventional hand-crafted features are replaced with the
algorithms can be extended by deep architecture. An features extracted from CNN models [51, 138], or training
overview of various object trackers, including method, of CNN models is made using classified (labeled) datasets
network, and the end-to-end train is given in Table 2. [115, 121]. Such datasets that are used for training of
CNN models are ImageNet [18] and person re-identification
5.4 Deep network structure and training for tracking datasets, namely CUHK03 [139] and MARS [140]. For
example, in deep SORT [115] tracking, the WRN is trained
The deep network has a huge number of parameters. with the help of MARS dataset. In real-time tracking
Therefore, it is crucial to train the network accurately. context, person re-identification task based on such MARS
Different network structures are utilized in the tracking training data may result in a huge number of mis-detections,

Table 2 Overview of various trackers

Tracker Network Working principle Type

CNNTCM [119] CNN CNNs are trained using temporally constrained metrics for MOT Offline
JointMC [138] CNN Multi-person tracking is done by multi-cut and deep matching Offline
LMP [121] CNN Lifted multi-cut with person re-identification is done for multiple people tracking Offline
QuadMOT [124] CNN Quadruplet CNNs are co-related to track multi-objects Offline
DeepNetFlow [137] CNN MOT is done by deep network flow Offline
Generation GRU Track-let cleaving and reconnection are done by deep Siamese Bi-GRU for solving Offline
cleaving and MOT problem
reconnection
association
(GCRA) [156]
Deep SORT [115] CNN Detections are associated with appropriate track-lets using both the appearance and Online
motion information
MHT-DAM [51] CNN Multiple hypothesis tracking revisited Online
AP-HWDPL [122] CNN Learning appearance model with deep features Online
STAM-MOT [123] CNN Spatio-temporal attention mechanism is adopted for MOT Online
CDA-DDAL [117] CNN CNNs are trained with discriminative appearance features for MOT Online
RNN-LSTM [38] RNN+LSTM Recurrent neural network-based features are stored by LSTM and further used to Online
make association between detections and track-lets
RAN [133] LSTM Recurrent autoregressive network-based appearance and motion features are used for MOT Online
AMIR [135] LSTM Appearance, motion, and their interaction are used to track long-term detections Online
MHT-bLSTM [136] LSTM bilinear long and short-term memory is used for MHT Online
MCD-SORT [178] CNN Association between detection and track-let is restricted within same object class Online
Deep learning in multi-object detection and tracking: state of the art

partial detections, and false alarms. It is, therefore, required comparison between different algorithms and set goals
to train CNN models with the help of real-time tracking data for solutions. For each dataset, evaluation is done based
[116, 122, 124]. on some specific performance metrics. Datasets with
For some nested CNNs, such as STAM-MOT [123] and performance metrics are briefly described in Section 6.1.
CNNTCM [119], it is hard to optimize the network by Based on the nature of the task, results can be categorized
adopting end-to-end training. Therefore, sub-networks are into detection results and tracking results, as explained in
first pre-trained and then, these are cascaded one after Sections 6.2 and 6.3.
another to obtain the whole network. Thus fine-tuning of the
whole network is required. STAM-MOT is developed using 6.1 Benchmark datasets and performance metrics
VGG16-network [43], and it has three sub-networks: (i) a for detection and tracking
visible map, (ii) spatial features, and (iii) a classifier. These
sub-networks are then pre-trained and the whole network For the detection task, static images are required, whereas
is fine-tuned once the samples for tracking are stored. In for tracking, videos are required. Datasets, such as PASCAL
CNNTCM, the sequence of images is split into a number VOC, MS COCO and ImageNet, are utilized for general
of segments. Using these segments, the whole network is object detection. MOT2015 and MOT2016 are used for
fine-tuned. tracking. All these datasets along with performance metrics
are discussed in the following sections.
5.4.2 RNN-based MOT and training
(a) PASCAL VOC: The PASCAL VOC [19] has two series,
Unlike CNNs, RNNs are suitably used for sequence called PASCAL VOC 07 and PASCAL VOC 12. PASCAL
modeling. They are able to predict a tracking-state based VOC 07 has 5K training and 5K test images. Whereas,
on historical information. Therefore, RNNs have effective PASCAL VOC 12 has 5.7K training and 5.7K test images.
tracking performance than CNNs. But the training of RNN Each series contains 20 categories of objects, including car,
is always difficult, since in RNN, the integration of both person, bike/scooter, bicycle, bus, loco, cat, bird, horse,
appearance and motion features is little difficult. Similar to kite, sheep, boat, bottle, chair, dining table, sofa, boat, and
CNNs, the training of RNNs requires both pre-training of television. These 20 categories can be considered as 4 main
sub-networks as well as fine-tuning of the entire network. branches, such as vehicles, person, animals, and household
The integration of long-term motion of an object and its objects. In PASCAL VOC datasets, bounding-boxes are
appearance features is done using the combination of LSTM labeled over 27,000 objects. Some examples of annotated
and RNN [38, 135, 136]. To learn the track-lets’ state and images are shown in Fig. 3.
prediction, and matching probability between track-lets and
object detections, modified RNN and LSTM are developed (b) MS COCO: The Microsoft Common Objects in Context
in [38]. Both mean square and log-likelihood errors are used (MS COCO) dataset [20] is created for two specific tasks:
for training here. In [136], LSTM and its bilinear version object detection and segmentation. This dataset contains 91
are used to accommodate various appearance features. Here, object categories, out of these 82 categories have more than
first, LSTMs are pre-trained individually with appearance 5000 labeled instances. These labeled samples cover all 20
and motion features, and then these two LSTMs are fine- object classes that are present in PASCAL VOC datasets.
tuned using the training data in an end-to-end manner. Of This dataset consists of 2,500,000 labeled instances in
late, GRU-based RNNs are used for tracking [129]. Here, a total of 328,000 images. MS COCO concentrates on
regression is adopted for track-lets’ prediction, and the varied viewpoint and real-time instances (i.e., objects
training of GRU is done by minimizing the log-likelihood from the natural environment), resulting in rich contextual
error. information. Three categories of images in MS COCO
dataset are shown in Fig. 4.

6 Some results on object detection and (c) ImageNet: ImageNet [18], also known as ILSVRC2014,
tracking is another important large-scale dataset. It has 200 object
classes, nearly 450k training images, 20k validation images,
In this section, we summarize the results of some well- and 40k test images. ImageNet is used for the task of object
known detectors, trackers, and their different combinations detection.
over various benchmark datasets, such as ImageNet [18],
PASCAL VOC [19], MS COCO [20], MOT2015 [141], (d) MOT: This dataset has 11 videos, each containing either
and MOT2016 [142]. These datasets are considered in one and/or two object classes, namely, person and car,
many areas of research because they can draw a standard and is used widely in state-of-the-art MOT approaches.
S. K. Pal et al.

Fig. 3 Annotated samples images from PASCAL VOC dataset [1]

MOT has two parts: MOT2015 [141] and MOT2016 (iv) Multi-Object Tracking Precision (MOTP(%)): It
[142]. MOT contains a sequence of images from diverse is the percentage of predicting the alignment of
scenarios having different distributions for the detection of bounding box and ground-truth.
pedestrians. (v) Mostly tracked targets (MT(%)): It is the percentage
MOT2015 and MOT2016 consist of a sequence of 22 of ground-truth trajectories covered by a track
and 16 videos, respectively. For each video, half of these hypothesis for 80% of their life or more.
sequences are utilized for the purpose of training. The (vi) Mostly lost targets (ML(%)): It is the percentage
rest of them are used for only testing. These videos are of ground-truth trajectories covered by a track
usually captured with different low and high frame rates hypothesis for 20% of their life or less.
in both moving and static platforms. Other issues, such as (vii) Speed (frames per sec(fps)): It is the number
illumination, occlusion, and (or) weather conditions are also of frames processed per second in detection and
considered during the capturing of these videos. tracking.
Metrics mAP and Speed are used for object detection,
(e) Performance metrics The performance metrics used in while MOTA, IDS, MOTP, MT, ML, and Speed are used for
object detection and tracking tasks are: object tracking.

(i) Mean average precision (mAP(%)): It is the mean of 6.2 Analysis of existing general object detection
the average precision scores for each category. methods
(ii) Multi-Object Tracking Accuracy (MOTA(%)): It is
the overall tracking accuracy in terms of false Tables 3 and 4 summarize the performing results of
positives, false negatives and identity switches. various object detectors for MS COCO and PASCAL
(iii) Identity switches (IDS): Every trajectory is assigned VOC datasets, respectively. Both PASCAL VOC data
to one ID, Identity switches are referred to the and MS COCO data are widely used as large image
number of times two trajectories switch their IDs. databases for the tasks of object detection and classification.

(a) Iconic object images (b) Iconic scene images (c) Non-iconic images

Fig. 4 Image samples from MS-COCO dataset [1]


Deep learning in multi-object detection and tracking: state of the art

Table 3 Detection results of various general object detectors over MS COCO test-dev dataset

Method Data Backbone mAP (%)

Fast RCNN [63] train VGG-16 19.7


Faster RCNN [43] trainval VGG-16 21.9
R-FCN [64] trainval VGG-16 22.6
CoupleNet [143] trainval ResNet-101 34.4
Faster RCNN+++ [33] trainval ResNet-101-C4 34.9
Faster RCNN w FPN [65] trainval35k ResNet-101-FPN 36.2
Deformable R-FCN [77] trainval Alignmed-inception-ResNet 37.5
umd-ted [144] trainval ResNet-101 40.8
Mask RCNN [55] trainval35k ResNetXT-101 39.8
DCNv2+Faster RCNN [80] train118k ResNet-101 44.8
YOLOv2 [46] trainval35k DarkNet-53 33.0
YOLOv3 [72] trainval35k DarkNet-19 21.6
DSSD321 [73] trainval35k ResNet-101 28.0
SSD513 [79] trainval35k ResNet-101 31.2
DSSD513 [73] trainval35k ResNet-101 33.2
RetinaNet500 [74] trainval35k ResNet-101 34.4
RetinaNet800 [74] trainval35k ResNet-101-FPN 39.1
M2Det512 [75] trainval35k ResNet-101 38.8
M2Det800 [75] trainval35k VGG16 41.0
RefineDet320+ [76] trainval35k ResNet-101 38.6
RefineDet512+ [76] trainval35k ResNet-101 41.8
FPN [53] trainval35k ResNet101 39.8
NAS-FPN [57] trainval35k RetinaNet 40.5
NAS-FPN [57] trainval35k AmoebaNet 48.0
Granulated CNN [3] trainval35k ResNet-101 32.0

These two public datasets contain a large number of different kinds of objects from the natural environment. As
both annotated images and object classes. These images a result, researchers can get rich information for training,
characterize varied viewpoints and real-time instances of validation, and testing of their deep models using these data.

Table 4 Detection results of various detectors over PASCAL VOC dataset

Method Training Data Test data Region proposal Backbone mAP (%)

RCNN [45] VOC 07 VOC 07 SS AlexNet 58.5


RCNN [45] VOC 07 VOC 07 SS VGG16 66.0
Fast RCNN [63] VOC 07 + VOC 12 VOC 07 SS VGG16 66.9
YOLO + Fast RCNN [44] VOC 07 + VOC 12 VOC 12 SS VGG16 70.7
YOLOv2 [46] VOC 07 + VOC 12+MS COCO VOC 12 - DarkNet-53 78.2
Fast RCNN [63] VOC 07 + VOC 12 VOC 12 SS VGG16 68.4
Faster RCNN [43] VOC 07 + VOC 12 VOC 12 RPN VGG16 70.4
Faster RCNN [43] VOC 07 + VOC 12+MS COCO VOC 12 RPN VGG16 75.9
YOLO + Fast RCNN [44] VOC 07 + VOC 12 VOC 12 RPN VGG16 70.7
YOLOv2 [46] VOC 07 + VOC 12+MS COCO VOC 12 - DarkNet-53 78.2
SSD300 [79] VOC 07 + VOC 12+MS COCO VOC 12 - ResNet101 79.3
SSD512 [79] VOC 07 + VOC 12+MS COCO VOC 12 - ResNet101 82.2
R-FCN [64] VOC 07 + VOC 12+MS COCO VOC 12 RPN ResNet101 85.0
G-RCNN [178] VOC 07 + VOC 12 VOC 12 FRPN G-AlexNet 80.9
S. K. Pal et al.

Accordingly, we have adopted them for object detection V OC12 + MSCOCO data provides better detection
and tracking problems, and for providing comparisons in accuracy as compared to the Faster RCNN having same
performance among different models. backbone network, but trained with V OC07+V OC12 data.
In our study, the results of various detectors (e.g., Rich features always provide good results. R-FCN having
Faster RCNN, Mask RCNN, YOLO, YOLOv2, YOLOv3, ResNet101 backbone is superior to R-FCN with ResNet50
SSD, DSSD, FPN, R-FCN, and DCN), trackers (e.g., in terms of detection accuracy. R-FCN is superior to other
AMIR, Deep SORT, MHT-DAM, CDA-DDAL, RNN- two-stage detectors for the PASCAL VOC dataset.
LSTM, QuadMOT, STAM-MOT, and Siamese CNN, and of It may be mentioned that the deep networks that result
their different combinations are compared in depth to obtain in high mAP score also require high computation time,
the best detector-tracker model. These models have different i.e., they have slow processing-capability of frames (low
characteristic features. For example, Faster RCNN, Mask fps). For example, the method DCNv2+Faster RCNN [80]
RCNN, FPN, and R-FCN are widely used as two-stage in Table 3 (10th row) that provides mAP = 44.8% can
detectors, whereas YOLO, SSD, DSSD, and DCN are the process only five frames per sec (fps = 5). Whereas, the
most advanced one-stage detectors. Among the aforesaid method YOLOv2 [46] (11th row) has fps = 45, i.e., it
trackers, AMIR and RNN-LSTM are categorized as end- can process 45 frames per second, but it results in mAP
to-end DL-based trackers. MHT-DAM, CDA-DDAL, and of 33%. Similarly, consider NAS-FPN [57] (24th row) and
Siamese CNN are widely used as deep features-based RefineDet320+ [76] (20th row). They have mAP scores
trackers. QuadMOT, STAM-MOT, and Deep SORT are of 48%, and 38.6%, respectively, with corresponding fps-
the most advanced deep embedding-based trackers. All values of 5, and 40.2. That means, there has been a trade-off
these detectors and trackers are top-ranked and widely between detection speed and accuracy.
used in the domain of computer vision as state-of-the-art Nothing is free!
models. Therefore, we have adopted them in our paper for a This constitutes a big challenge to have a balanced com-
comparative study. This result can be helpful to researchers promise between these two performance indices depending
who intend to use the existing deep models for object on the problems and need. Here comes the significance
detection and tracking, as well as for comparing any new of Granulated CNN [3] (last row) where by changing the
models, whenever designed. granule-size one can dictate this balance.
We have collected these results from various research
papers. From Table 3, it is evident that the typical baselines 6.3 Results of tracking methods
architectures augment the accuracy through the extraction
of rich features (i.e., multi-scale and multi-level features) of Comparative performances of some popular trackers over
objects having different sizes. As an example, by adopting MOT2015 and MOT2016 data sets are shown in Tables 5
VGG16 as the backbone of 512 feature dimensions on MS and 6, respectively, based on the results available in the
COCO test-dev dataset, the mAP of RefineDet512 exceeded existing literature. From the results of MOT2015 (Table 5),
that of the RefineDet320 (which uses VGG16 with 320 the end-to-end DL approaches (e.g., MHT-bLSTM and
features) by 3.6%. Two-stage detectors, such as Faster RNN-LSTM) are seen to provide overall better results.
RCNN, Mask RCNN, and FPN and its variants, achieve Deep network embedded approaches involving deep metric
higher mAP scores as compared to one-stage detectors (e.g., (e.g., Siamese CNN and DAN) outperform (in terms of
YOLOv2, YOLOv3, SSD, DSSD and RefineNet). On the fps) the other approaches using only deep features as
other hand, one-stage detectors achieve higher speed. In representation, except AP-HWDPL. From Table 6, global
addition, it is seen that the integration of one and two stage optimization methods, namely LMP and GCRA, are seen
detectors in one model achieves higher accuracy and speed to outperform others, including the end-to-end RNN-based
than those obtained by individuals for object detection. For models. Further, the performance metric MOTA results is
example, integrated networks DCNv2+Faster RCNN [80] less deviation for MOT2016 data as compared to MOT2015.
and NAS-FPN [57] with ResNet backbone, achieve the This is due to the fact that the object detection for MOT2016
highest detection accuracy over MS COCO test-dev dataset. is more stable than that for MOT2015 data.
Testing results of various detectors over PASCAL VOC DL-based trackers with higher order features for
dataset are shown in Table 4. It is seen that the region appearance and motion are seen to be more stable and
proposal network (RPN) enhances the detection accuracy robust. For instance, AMIR tracker is more stable than LMP
as compared to conventional region proposal methods tracker. Here, the former is a tracker based on end-to-end
(see, 85% mAP vs. 78.2% mAP in 13th row and 5th RNN involving more features than the latter which is a
row in Table 4). More number of training data results in globally optimized method with lifted edges. Comparative
higher detection accuracy. Adopting the VGG16 network results for various combinations of detectors and trackers
as a backbone, Faster RCNN trained with V OC07 + are shown in Table 7. From this table, it is evident that the
Deep learning in multi-object detection and tracking: state of the art

Table 5 Tracking results over MOT2015 dataset

Tracker Type MOTA MOTP MT ML IDS FPS

AP-HWDPL [122] Online 38.51 72.6 8.73 37.45 586 6.7


AMIR [135] Online 37.56 71.7 15.81 26.77 1026 1.9
AM [123] Online 30.23 72.2 12.90 46.75 755 0.5
DAN [145] Online 38.30 71.1 17.60 41.20 1648 6.3
RAN [133] Online 35.10 70.9 13.04 42.31 381 5.4
STAM-MOT [123] Offline 34.34 70.5 11.41 43.39 348 0.5
QuadMOT [124] Offline 33.81 73.4 12.89 36.88 703 3.7
CDA-DDAL [117] Online 32.81 70.7 9.71 42.16 614 2.3
MHT-DAM [51] Offline 32.34 71.8 15.96 43.82 435 0.7
CNNTCM [119] Offline 29.63 71.8 11.25 43.97 712 1.7
SiameseCNN [116] Offline 29.06 71.2 8.46 48.41 639 52.8
RNN-LSTM [38] Online 18.98 71.0 5.53 45.65 1490 165.2

Table 6 Tracking results over MOT2016 dataset

Tracker Type MOTA MOTP MT ML IDS FPS

LMP [121] Offline 48.75 79.0 18.17 40.06 481 0.5


GCRA [129] Offline 48.15 77.5 12.90 41.10 821 2.8
AMIR [135] Online 47.17 75.8 13.95 41.62 774 1.0
RAN [133] Online 45.88 74.8 13.18 41.90 648 0.9
STAM-MOT [123] Offline 45.96 74.9 14.62 43.61 473 0.2
QuadMOT [124] Offline 44.10 76.4 14.62 44.93 745 1.8
CDA-DDAL [117] Online 43.88 74.7 10.66 44.40 676 0.5
MHT-DAM [51] Offline 45.82 76.3 16.22 43.22 590 0.8
MHT-bLSTM [136] Offline 42.09 75.9 14.88 44.41 753 1.8

Table 7 Results of (Detector + Tracker) over MOT2015 dataset

Detector Tracker MOTA MOTP MT ML IDS FPS

Fast feature pyramid [146] Submodular Optimization [147] 13.4 71.5 2.6 1123 14
SPPNet [68] IOUT [148] 19.4 28.9 17.7 18.4 2311 6902
RCNN [45] IOUT [148] 16.0 38.3 13.8 20.7 5029 -
CompACT [113] GOG [149] 14.2 37.0 13.9 19.9 3334 389
RCNN [45] DCT [150] 11.7 38.0 10.1 22.8 758 0.7
CompACT [113] CMOT [118] 12.6 36.1 16.1 18.6 285 3.8
CompACT [113] H2T [151] 12.4 35.7 14.8 19.4 852 3.0
ComapACT [113] IHTLS [152] 11.1 36.8 13.8 19.9 953 19.8
ComapACT [113] CEM [153] 5.1 35.2 3.0 35.3 267 4.6
SPPNet [68] DAN [145] 38.30 71.1 17.60 41.20 1648 6.3
Faster RCNN [43] SORT [154] 67.5 74.5 46.2 7.7 124 60
Faster RCNN [43] Deep SORT [115] 69.9 74.2 51.4 4.1 108 21
G-RCNN [178] MCD-SORT [178] 80.1 80.9 61.8 3.6 54 29
S. K. Pal et al.

combination of Faster RCNN and Deep SORT is superior 8 Discussions: applications, challenges, and
to other combinations according to all kinds of tracking concerns
evaluation metrics.
DL-based object detection and tracking is growing rapidly
due to the continuous up-gradation of powerful computing
7 Conclusions equipments. Object detection is followed by object tracking.
Therefore, tracking accuracy primarily depends on the
In this study, we have provided a detailed review primarily accuracy of detection of objects over video frames.
on various deep learning (DL)-based models for the tasks Comparative studies among various popular detectors and
of generic object detection, specific object detection, and trackers, as well as their different combinations, have been
object tracking, considering the detection and tracking both provided in details. These comparisons are made in terms
individually and in combination. Some key observations of both characteristic features of the models and their
on DL-based generic object detection are as follows. The performances. In this section, we discuss some current
baseline deep architecture of two-stage detectors enhances applications and trends of object detection and tracking
the accuracy by extracting richer features of objects and in different domains. This also includes several pertinent
adopting multi-level and multi-scale features for different challenging issues for future investigations. Finally, certain
sized-object detection. By defining the focal loss function concerns for researchers are mentioned.
appropriately, one-stage detectors are found to be able to
filter out the easy samples (background); thereby reducing 8.1 Object detection: applications and challenges
greatly the number of target proposals and improving in turn
the detection speed and precision. This may be applicable Object detection has widely been applied in various fields,
to two-stage detectors too. Combining one-stage and two- including military, security, transportation, medical, and
stage detectors produces better results as compared to those life. These are briefly explained citing references as follows:
obtained by individuals. To address the geometric variation
in image frames, adopting deformable convolution layers 8.1.1 Security
is an effective way. Modeling the relationship between
different objects in an image, as expected, improves the In security, the most popular applications include detection
detection performance. Incorporation of granulation within of face [155], pedestrian [156] and anomaly [157]. The
the deep learning model improves the detection accuracy. objective of face detection is to detect people faces in
Some salient observations on DL-based specific object an image. Facial landmark localization, estimation of
detection are as follows. CNN facilitates extraction of head pose, and recognition of gender are three main
salient information in local regions in an image frame. Mod- components concerning face detection. Readers may refer to
eling the visual saliency along the boundaries of different the survey paper [10] for more details about face detection,
regions using super-pixel segmentation improves the CNN including the application of DL. Pedestrian detection means
performance in occlusion detection. Extraction of multi- detecting pedestrians in natural scene. For more details,
scale deep features is of significance for characterizing the refer to the survey [12]. Anomaly detection has various
local context in images. Strengthening the local connections applications, such as fraud detection, road safety and health-
(weight parameters) between different CNN layers based care monitoring. A good survey on this is provided in
on the local and global information from images improves [157].
object detection.
Similarly, for object tracking, end-to-end DL-based 8.1.2 Military
methods are superior to deep feature-based and deep
embedding-based methods. Generative networks exhibit The military field represents various tasks, for example,
outstanding tracking results as compared to discriminative object detection using remote sensing [158], topographic
networks. Learning of higher order features or transferring survey, and detection of flyer. In remote sensing object
of on-line features is expected to provide good tracking detection [158], objects are detected from remote sensing
performance in complex environments. Object tracking images/videos. This task has two challenges. First, the
using higher order appearance and motion features are seen target size is extremely small that makes the object
to be more stable and robust. Finally, the combination of detection procedure very time-consuming (i.e., too slow)
Faster RCNN and Deep SORT is seen to be superior to other for practical use. Second, the complex backgrounds often
combinations in terms of both speed and accuracy as per the results in false detection. Due to the dearth of information in
indices considered. remote sensing object detection, strong pipelines, like Faster
Deep learning in multi-object detection and tracking: state of the art

RCNN, SSD, FCN and YOLO, cannot work well in this caption generation, and it is a major challenging task. The
domain. Therefore, remote sensing object detection remains issue is handled by adopting, encoder-decoder frameworks,
as a hot research topic. For more details, readers are referred multi-modal embedding, attention mechanism [169], and
to the survey [159]. most importantly, reinforcement learning [171]. The survey
article [171] provides more details. A DL architecture is also
8.1.3 Transportation designed in [172] for rain detection from images.
The aforesaid applications are just some example
Object detection in transportation field involves various applications of DL. There are several other domains where
applications, such as license plate recognition, automatic the merits of DL technology are being explored.
driving, and traffic sign recognition. License plate recog-
nition is required in detecting residential access and traffic 8.1.6 Challenging issues
violations. Various features, such as edge, texture, mor-
phology, and sliding concentric windows, are integrated Although the achievement of object detection in various
using connected component analysis for making the task of fields is enormous, there still remain many scopes for
license plate recognition more robust [160]. Recently, DL further improvement. These include: i) combining single-
is adopted for license plate recognition [161], too. One may stage and two-stage detectors for object detection, ii)
refer to [162] in this context. Sensor fusion is utilized in exploration of post-processing method for object detection
[163] to obtain features for autonomous driving. The survey improvement, iii) development of weakly supervised object
[164] provides more details. detection (WSOD) algorithms, iv), designing unsupervised
framework for intelligent detection system, v) development
8.1.4 Medical of multi-domain object detectors, vi) adaptation of multi-
task learning in object detection, vii) fusing multi-source
Medical image detection, cancer/disease detection, and information, viii) exploration of GAN-based detectors when
health-care monitoring represent some applications of labeled images is scarce, and ix) making use of cell phone-
object detection in the medical field. A framework of based family diagnostic tools. Besides these, there are some
domain adaptation [165] is required for the detection of higher level challenging issues leading to much broader and
medical images. Computer-Aided Diagnosis (CAD) can deeper future scopes of DL research as follows:
assist doctors in classification of varying types of cancers.
Recently, CNNs are trained with large-scale glaucoma (a) One may note that granular computing (GrC) has
dataset for glaucoma detection [166]. Two recent survey recently drawn the attention of researchers for design-
papers [167, 168] may be referred. ing intelligent systems, in general. Its application
based on rough-fuzzy sets for image processing, and
8.1.5 Life object detection and tracking has been evident [28, 29,
87, 173] for dealing with uncertainties arising from,
Applications, such as pattern detection, event detection, say, overlapping, occlusion, and sudden appearance
rain/shadow detection, image caption generation, and of objects, among others. Since GrC is reputed for
species identification, represent some key tasks here. Event computational gain, attempts [3, 178] have been made
detection aims to detect real-world events from Internet recently to integrate it with deep CNN judiciously
news of festivals, disasters, talks and elections, among in order to make the CNN computationally speedy,
others. One may refer to a survey in [11] for further while sacrificing the detection accuracy little. Forma-
information. Research on appropriate detection of pattern is tion of granules dictates the extent of compromise,
challenging due to several factors, including pose variation, or trade-off balance, between the speed and accu-
scene occlusion, different illumination and sensor noise. racy. Therefore, it is a challenging issue for future
To achieve promising results, some researchers designed researchers.
strong baseline architectures for pattern detection in 2D (b) Z-numbers, as explained by Zadeh in 2011 [71],
images [169] and 3-dimensional point clouds [170]. For provide a summary of meaning of natural language
image caption generation, the computer automatically expression in terms of its qualitative aspect and
generates a caption for a given image. Here, the semantic embedded uncertainty. It may be used to design a
information of images is captured and expressed using framework for quantitative abstraction of information
natural language processing. Both computer vision and in describing the output scene of deep networks for
natural language processing technologies are used for image video-object detection [3, 179–181]. Exploiting the
S. K. Pal et al.

merits of Z-numbers in modeling the interpret-ability than false positives [135]. All these may constitute a part of
of the output in natural language, therefore, constitutes future investigations.
another challenge.
(c) It may be mentioned that fuzzy sets and rough sets are 8.3 Some concerns
reputed for input/ output representation and learning
the network parameters [174, 175] when the input data While developing AI and DL technologies for various
is vague, linguistic, ill-defined, or incomplete. These applications in data science, one may observe its evolution
characteristics may therefore be crucial in designing through related technologies/ disciplines over the decades
DNNs in ambiguous situations, and thus need to be more or less as:
explored. Pattern Recognition (1960’s) → Image Processing
(d) Further, ANN based models for machine leaning (1970’s) → AI/ML/Artificial Neural Networks (1980’s)
are known as “black-box” models, where even their → Knowledge Based System (1990’s) → Data Mining
designers may not be able to explain why the AI (2000’s) → Big Data (2010) → Deep Learning and Data-
arrived at a specific decision. The technical challenge driven Science (2017).
of explaining AI based decisions is sometimes known At each evolution of the mother subject – Pattern
as the “interpret-ability” problem. Deep leaning Recognition (PR), new approaches were developed for
models, being a complex AI system, is naturally its different tasks to handle the varying nature of data,
non-interpretable. Therefore, it leads to the issue of as well as decision-making problems. New terms and
trust-ability of the output solution. Here comes the technologies were accordingly coined with Big Hopes.
necessity of explainable AI systems, i.e., explainable However, a beginner should not suddenly jump into
deep models which can explain to a user to understand the new technologies without knowing its background
the AI’s cognition so as to determine when to trust theories adequately. For example, to know DL, one should
or accept the output solution and when to discard. know Artificial Neural Networks (ANN) and ML (shallow
To make this explanation in natural language for learning). And to know the latter, one should have complete
convenience, fuzzy set theory may be used. One may knowledge on pattern recognition. Otherwise, it may lead
refer to [176] in this context concerning the basic to dis-satisfaction just by blaming the DL technology and
concepts for generation of linguistic rules explaining CNNs. One may remember in this context, for example,
the output decision in terms of input features. what happened with ANN research when it revived in
1980’s with a big expectation; lots of R & D (research
8.2 Object tracking and issues and development) funds were invested in academia and
industry, and several new journals appeared, and so on.
The task of object tracking aims to detect specific objects But within a span of about twelve to fifteen years the
in a static image frame and then estimate their (objects) subject almost lost its interest at the rate similar to that
moving trajectories in video frames. Object tracking follows of its growth. One of the main reasons was too much
object detection. Therefore, the difficulties in the tracking quick expectation without developing the science behind the
task mainly arise from: i) incorrect or imprecise object functioning of this “black-box” system, and trying to apply
detection, ii) deciding an object as a true incomer or not, the same set of models or frameworks, be supervised or
iii) proper association between the detection and track-let, unsupervised, almost every domain of applications without
and iv) occurrence of false alarms. There are further certain studying the relevance or knowing the requisite framework
issues concerning tracking as follows: Although a large that might have demanded for building new application-
number of studies has been done to solve the MOT problem specific models.
for a single class, the same for multi-class problems is not Hope, learning from that previous example will prevent
yet much explored. The task-specific deep networks are recurrence of similar feelings for Deep Learning research!
effective in tracking, but these are not suitable for complex
Acknowledgements S.K. Pal acknowledges the National Science
conditions. Learning deep networks using higher order
Chair, SERB-DST, Government of India.
features is required to increase the tracking performance.
Learning scenario is required to differentiate moving objects
Declarations
from the background and to promote motion prediction.
This is useful for moving platforms. End-to-end DL-based Conflict of interest The paper is original in its contents and is not
tracking approaches output a large number of false negatives under consideration for publication in any other journals/proceedings.
Deep learning in multi-object detection and tracking: state of the art

There is no potential conflict of interest to disclose, such as scale visual recognition challenge. Int J Comput Vis 115(3):211–
employment, financial or non-financial interest. There is no funding 252
received by this work. The authors have no financial or proprietary 19. Everingham M, Van Gool L, Williams ChristopherKI, Winn
interests in any material discussed in this article. J, Zisserman A (2010) The pascal visual object classes (voc)
challenge. Int J Comput Vis 88(2):303–338
20. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D,
Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in
References context. In: European conference on computer vision. Springer,
pp 740–755
1. Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R (2019) 21. Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset
A survey of deep learning-based object detection. IEEE Access J, Kamali S, Popov S, Malloci M, Duerig T et al (2018) The open
7:128837–128868 images dataset v4: Unified image classification, object detection,
2. Pal SK (2018) Data science and technology: challenges, and visual relationship detection at scale. arXiv:1811.00982
opportunities and national relevance. 14th annual convocation 22. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classifi-
speech, national institute of technology, Calicut cation with deep convolutional neural networks. Commun ACM
3. Pal SK, Bhoumik D, Chakraborty DB (2020) Granulated 60(6):84–90
deep learning and z-numbers in motion detection and object 23. Zhang X, Fang Z, Wen Y, Li Z, Qiao Y (2017) Range loss for deep
recognition. Neural Comput Appl 32(21):16533–16548 face recognition with long-tailed training data. In: Proceedings of
4. Chakraborty DB, Pal SK (2021) Granular Video Computing: the IEEE International Conference on Computer Vision, pp 5409–
with Rough Sets, Deep Learning and in IoT. World Scientific, 5418
Singapore 24. Chung D, Tahboub K, Delp EJ (2017) A two stream siamese
5. Liu Y, Cheng M-M, Hu X, Wang K, Bai X (2017) Richer convolutional neural network for person re-identification. In:
convolutional features for edge detection. In: Proceedings of the Proceedings of the IEEE International Conference on Computer
IEEE conference on computer vision and pattern recognition, Vision, pp 1983–1991
pp 3000–3009 25. Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the
6. Pal SK, King RA (1983) On edge detection of x-ray images using convolutional layers in deep convnets for video action recognition.
fuzzy sets. IEEE Trans Pattern Anal Mach Intell 5(1):69–77 IEEE Trans Circ Syst Video Technol 28(8):1839–1849
7. Deravi F, Pal SK (1983) Grey level thresholding using second- 26. Geng H, Zhang H, Xue Y, Zhou M, Xu G, Gao Z (2017) Semantic
order statistics. Pattern Recogn Lett 1(5-6):417–422 image segmentation with fused cnn features. Optoelectron Lett
8. Pal SK, King RA, Hashim AA (1983) Automatic grey level 13(5):381–385
thresholding through index of fuzziness and entropy. Pattern 27. Chakraborty DB, Pal SK (2016) Neighborhood granules and
Recogn Lett 1(3):141–146 rough rule-base in tracking. Nat Comput 15(3):359–370
9. Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 28. Chakraborty DB, Pal SK (2017) Neighborhood rough filter and
2d pose estimation using part affinity fields. In: Proceedings of intuitionistic entropy in unsupervised tracking. IEEE Trans Fuzzy
the IEEE conference on computer vision and pattern recognition, Syst 26(4):2188–2200
pp 7291–7299 29. Pal SK, Chakraborty DB (2016) Granular flow graph, adaptive
10. Masi I, Wu Y, Hassner T, Natarajan P (2018) Deep face rule generation and tracking. IEEE Trans Cybern 47(12):4096–
recognition: A survey. In: 2018 31st SIBGRAPI conference on 4107
graphics, patterns and images (SIBGRAPI). IEEE, pp 471–478 30. Wang N, Yeung D-Y (2013) Learning a deep compact image
11. Hasan M, Orgun MA, Schwitter R (2018) A survey on real-time representation for visual tracking. In: Advances in neural
event detection from the twitter data stream. J Inf Sci 44(4):443– information processing systems, pp 809–817
463 31. Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed
12. Brunetti A, Buongiorno D, Trotta GF, Bevilacqua V (2018) tracking with kernelized correlation filters. IEEE Trans Pattern
Computer vision and deep learning techniques for pedestrian Anal Mach Intell 37(3):583–596
detection and tracking: A survey. Neurocomputing 300:17–33 32. Choi J, Jin Chang H, Fischer T, Yun S, Lee K, Jeong J, Demiris
13. Ren X, Zhou Y, He J, Chen K, Yang X, Sun J (2016) Y, Young Choi J (2018) Context-aware deep feature compression
A convolutional neural network-based chinese text detection for high-speed visual tracking. In: Proceedings of the IEEE
algorithm via text structure modeling. IEEE Trans Multimed Conference on Computer Vision and Pattern Recognition, pp 479–
19(3):506–518 488
14. Fan D-P, Wang W, Cheng M-M, Shen J (2019) Shifting more 33. Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PhilipHS
attention to video salient object detection. In: Proceedings of (2017) End-to-end representation learning for correlation filter
the IEEE conference on computer vision and pattern recognition, based tracking. In: Proceedings of the IEEE Conference on
pp 8554–8564 Computer Vision and Pattern Recognition, pp 2805–2813
15. Pal NR, Pal SK (1993) A review on image segmentation 34. Li J, Zhou X, Chan S, Chen S (2017) Object tracking using a
techniques. Pattern Recogn 26(9):1277–1294 convolutional network and a structured output svm. Comput Vis
16. Dollar P, Wojek C, Schiele B, Perona P (2011) Pedestrian Media 3(4):325–335
detection: An evaluation of the state of the art. IEEE Trans Pattern 35. Nam H, Han B (2016) Learning multi-domain convolutional
Anal Mach Intell 34(4):743–761 neural networks for visual tracking. In: Proceedings of the IEEE
17. Geiger A, Lenz P, Urtasun R (2012) Are we ready for conference on computer vision and pattern recognition, pp 4293–
autonomous driving? the kitti vision benchmark suite. In: 2012 4302
IEEE Conference on Computer Vision and Pattern Recognition. 36. Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond
IEEE, pp 3354–3361 correlation filters: Learning continuous convolution operators for
18. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang visual tracking. In: European conference on computer vision.
Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large Springer, pp 472–488
S. K. Pal et al.

37. Ma C, Huang J-B, Yang X, Yang M-H (2015) Hierarchical of the IEEE conference on computer vision and pattern
convolutional features for visual tracking. In: Proceedings of the recognition, pp 7036–7045
IEEE international conference on computer vision, pp 3074–3082 58. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W,
38. Milan A, Rezatofighi SH, Dick A, Reid I, Schindler K (2016) Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient
Online multi-target tracking using recurrent neural networks. convolutional neural networks for mobile vision applications.
arXiv:1604.03635 arXiv:1704.04861
39. Li P, Wang D, Wang L, Lu H (2018) Deep visual tracking: Review 59. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer
and experimental comparison. Pattern Recogn 76:323–338 K (2016) Squeezenet: Alexnet-level accuracy with 50x fewer
40. Xu Y, Zhou X, Chen S, Li F (2019) Deep learning for multiple parameters and¡ 0.5 mb model size. arXiv:1602.07360
object tracking: a survey. IET Comput Vis 13(4):355–368 60. Chollet F (2017) Xception: Deep learning with depthwise
41. Leal-Taixé L, Milan A, Schindler K, Cremers D, Reid I, Roth S separable convolutions. In: Proceedings of the IEEE conference
(2017) Tracking the trackers: an analysis of the state of the art in on computer vision and pattern recognition, pp 1251–1258
multiple object tracking. arXiv:1704.02781 61. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C
42. Zhao Z-Q, Zheng P, Xu S, Wu X (2019) Object detection with (2018) Mobilenetv2: Inverted residuals and linear bottlenecks.
deep learning: A review. IEEE Trans Neural Networks Learn Syst In: Proceedings of the IEEE conference on computer vision and
30(11):3212–3232 pattern recognition, pp 4510–4520
43. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real- 62. Rawat W, Wang Z (2017) Deep convolutional neural networks
time object detection with region proposal networks. In: Advances for image classification: A comprehensive review. Neural Comput
in neural information processing systems, pp 91–99 29(9):2352–2449
44. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only 63. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE
look once: Unified, real-time object detection. In: Proceedings of international conference on computer vision, pp 1440–1448
the IEEE conference on computer vision and pattern recognition, 64. Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via
pp 779–788 region-based fully convolutional networks. In: Advances in neural
45. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich information processing systems, pp 379–387
feature hierarchies for accurate object detection and semantic 65. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for
segmentation. In: Proceedings of the IEEE conference on image recognition. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 580–587 computer vision and pattern recognition, pp 770–778
46. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. 66. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D,
In: Proceedings of the IEEE conference on computer vision and Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with
pattern recognition, pp 7263–7271 convolutions. In: Proceedings of the IEEE conference on computer
47. Weinzaepfel P, Revaud J, Harchaoui Z, Schmid C (2013) vision and pattern recognition, pp 1–9
Deepflow: Large displacement optical flow with deep matching. 67. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2009)
In: Proceedings of the IEEE international conference on computer Object detection with discriminatively trained part-based models.
vision, pp 1385–1392 IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
48. Cheng HY, Hwang JN (2007) Multiple-target tracking for 68. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in
crossroad traffic utilizing modified probabilistic data association. deep convolutional networks for visual recognition. IEEE Trans
In: 2007 IEEE International Conference on Acoustics, Speech and Pattern Anal Mach Intell 37(9):1904–1916
Signal Processing-ICASSP’07, vol 1. IEEE, pp I–921 69. Bell S, Lawrence Zitnick C, Bala K, Girshick R (2016) Inside-
49. Lim Y-C, Lee M, Lee C-H, Kwon S, Lee J (2010) Improvement of outside net: Detecting objects in context with skip pooling and
stereo vision-based position and velocity estimation and tracking recurrent neural networks. In: Proceedings of the IEEE conference
using a stripe-based disparity estimation and inverse perspective on computer vision and pattern recognition, pp 2874–2883
map-based extended kalman filter. Opt Lasers Eng 48(9):859–868 70. Liu J, Zhang S, Wang S, Metaxas DN (2016) Multispectral deep
50. Cao X, Lan J, Yan P, Li X (2012) Vehicle detection and tracking neural networks for pedestrian detection. arXiv:1611.02644
in airborne videos by multi-motion layer analysis. Mach Vis Appl 71. Zadeh LA (2011) A note on z-numbers. Inf Sci 181(14):2923–
23(5):921–935 2932
51. Kim C, Li F, Ciptadi A, Rehg JM (2015) Multiple hypothesis 72. Redmon J, Farhadi A (2018) Yolov3: An incremental improve-
tracking revisited. In: Proceedings of the IEEE international ment. arXiv:1804.02767
conference on computer vision, pp 4696–4704 73. Fu C-Y, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd:
52. Simonyan K, Zisserman A (2014) Very deep convolutional Deconvolutional single shot detector. arXiv:1701.06659
networks for large-scale image recognition. arXiv:1409.1556 74. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for
dense object detection. In: Proceedings of the IEEE international
53. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie
conference on computer vision, pp 2980–2988
S (2017) Feature pyramid networks for object detection. In:
75. Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019)
Proceedings of the IEEE conference on computer vision and
M2det: A single-shot object detector based on multi-level feature
pattern recognition, pp 2117–2125
pyramid network. In: Proceedings of the AAAI Conference on
54. Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J (2018) Detnet: A
Artificial Intelligence, vol 33, pp 9259–9266
backbone network for object detection. arXiv:1804.06215
76. Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot
55. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In:
refinement neural network for object detection. In: Proceedings of
Proceedings of the IEEE international conference on computer
the IEEE conference on computer vision and pattern recognition,
vision, pp 2961–2969
pp 4203–4212
56. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated 77. Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017)
residual transformations for deep neural networks. In: Proceedings Deformable convolutional networks. In: Proceedings of the IEEE
of the IEEE conference on computer vision and pattern international conference on computer vision, pp 764–773
recognition, pp 1492–1500 78. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating
57. Ghiasi G, Lin T-Y, Le QV (2019) Nas-fpn: Learning scalable deep network training by reducing internal covariate shift.
feature pyramid architecture for object detection. In: Proceedings arXiv:1502.03167
Deep learning in multi-object detection and tracking: state of the art

79. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, 99. He S, Lau RWH, Liu W, Huang Z, Yang Q (2015) Supercnn:
Berg AC (2016) Ssd: Single shot multibox detector. In: European A superpixelwise convolutional neural network for salient object
conference on computer vision. Springer, pp 21–37 detection. Int J Comput Vis 115(3):330–344
80. Zhu X, Hu H, Lin S, Dai J (2019) Deformable convnets v2: More 100. Tang Y, Wu X (2016) Saliency detection via combining
deformable, better results. In: Proceedings of the IEEE Conference region-level and pixel-level predictions with cnns. In: European
on Computer Vision and Pattern Recognition, pp 9308–9316 Conference on Computer Vision. Springer, pp 809–825
81. Yang Z, Nevatia R (2016) A multi-scale cascade fully con- 101. Wang X, Ma H, Chen X, You S (2017) Edge preserving and
volutional network face detector. In: 2016 23rd International multi-scale contextual neural network for salient object detection.
Conference on Pattern Recognition (ICPR). IEEE, pp 633–638 IEEE Trans Image Process 27(1):121–134
82. Tu W-C, He S, Yang Q, Chien S-Y (2016) Real-time salient 102. Gao X, Wang N, Tao D, Li X (2012) Face sketch–photo synthesis
object detection with a minimum spanning tree. In: Proceedings of and retrieval using sparse representation. IEEE Trans Circ Sys
the IEEE conference on computer vision and pattern recognition, Video Technol 22(8):1213–1226
pp 2334–2342 103. Niu B, Yang Q, Shiu SCK, Pal SK (2008) Two-dimensional
83. Yang J, Yang M-H (2016) Top-down visual saliency via joint laplacianfaces method for face recognition. Pattern Recogn
crf and dictionary learning. IEEE Trans Pattern Anal Mach Intell 41(10):3237–3243
39(3):576–588 104. Wang N, Tao D, Gao X, Li X, Li J (2014) A comprehensive
84. Tomè D, Monti F, Baroffio L, Bondi L, Tagliasacchi M, Tubaro survey to face hallucination. Int J Comput Vis 106(1):9–30
S (2016) Deep convolutional neural networks for pedestrian 105. Majumder A, Behera L, Subramanian VK (2016) Automatic
detection. Signal Process Image Commun 47:482–489 facial expression recognition system using deep network-based
85. Zhao Z-Q, Bian H, Hu D, Cheng W, Glotin H (2017) Pedestrian data fusion. IEEE Trans Cybern 48(1):103–114
detection based on fast r-cnn and batch normalization. In: 106. Jiang H, Learned-Miller E (2017) Face detection with the faster
International Conference on Intelligent Computing. Springer, r-cnn. In: 2017 12th IEEE International Conference on Automatic
pp 735–746 Face & Gesture Recognition (FG 2017). IEEE, pp 650–657
86. Rother C, Bordeaux L, Hamadi Y, Blake A (2006) Autocollage. 107. Sun X, Wu P, Hoi StevenCH (2018) Face detection using deep
ACM Trans Graph (TOG) 25(3):847–852 learning: An improved faster rcnn approach. Neurocomputing
87. Chakraborty D, Shankar BU, Pal SK (2013) Granulation, rough 299:42–50
entropy and spatiotemporal moving object detection. Appl Soft 108. Wang H, Li Z, Ji X, Wang Y (2017) Face r-cnn.
Comput 13(9):4001–4009 arXiv:1706.01061
88. Pal SK, Mitra P (2002) Multispectral image segmentation 109. Huang L, Yang Y, Deng Y, Yu Y (2015) Densebox: Uni-
using the rough-set-initialized em algorithm. IEEE Trans Geosci fying landmark localization with end to end object detection.
Remote Sens 40(11):2495–2501 arXiv:1509.04874
89. Pal SK, Shankar BU, Mitra P (2005) Granular computing, rough 110. Li Y, Sun B, Wu T, Wang Y (2016) Face detection with end-
entropy and object extraction. Pattern Recogn Lett 26(16):2509– to-end integration of a convnet and a 3d model. . In: European
2517 Conference on Computer Vision. Springer, pp 420–436
90. Rosin PL (2009) A simple method for detecting salient regions. 111. Zhang L, Lin L, Liang X, He K (2016) Is faster r-cnn doing well
Pattern Recogn 42(11):2363–2371 for pedestrian detection? . In: European conference on computer
91. Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum H-Y (2010) vision. Springer, pp 443–457
Learning to detect a salient object. IEEE Trans Pattern Anal Mach 112. Tian Y, Luo P, Wang X, Tang X (2015) Deep learning strong
Intell 33(2):353–367 parts for pedestrian detection. In: Proceedings of the IEEE
92. Long J, Shelhamer E, Darrell T (2015) Fully convolutional international conference on computer vision, pp 1904–1912
networks for semantic segmentation. In: Proceedings of the IEEE 113. Cai Z, Saberian M, Vasconcelos N (2015) Learning complexity-
conference on computer vision and pattern recognition, pp 3431– aware cascades for deep pedestrian detection. In: Proceedings of
3440 the IEEE International Conference on Computer Vision, pp 3361–
93. Gao D, Han S, Vasconcelos N (2009) Discriminant saliency, the 3369
detection of suspicious coincidences, and applications to visual 114. Reid D (1979) An algorithm for tracking multiple targets. IEEE
recognition. IEEE Trans Pattern Anal Mach Intell 31(6):989–1005 Trans Autom Control 24(6):843–854
94. Xie S, Tu Z (2015) Holistically-nested edge detection. In: 115. Wojke N, Bewley A, Paulus D (2017) Simple online and
Proceedings of the IEEE international conference on computer realtime tracking with a deep association metric. In: 2017 IEEE
vision, pp 1395–1403 international conference on image processing (ICIP). IEEE,
95. Vig E, Dorr M, Cox D (2014) Large-scale optimization of pp 3645–3649
hierarchical features for saliency prediction in natural images. In: 116. Leal-Taixé L, Canton-Ferrer C, Schindler K (2016) Learning
Proceedings of the IEEE Conference on Computer Vision and by tracking: Siamese cnn for robust target association. In:
Pattern Recognition, pp 2798–2805 Proceedings of the IEEE Conference on Computer Vision and
96. Huang X, Shen C, Boix X, Zhao Q (2015) Salicon: Reducing Pattern Recognition Workshops, pp 33–40
the semantic gap in saliency prediction by adapting deep neural 117. Bae S-H, Yoon K-J (2017) Confidence-based data association
networks. In: Proceedings of the IEEE International Conference and discriminative deep appearance learning for robust online
on Computer Vision, pp 262–270 multi-object tracking. IEEE Trans Pattern Anal Mach Intell
97. Wang L, Lu H, Ruan X, Yang M-H (2015) Deep networks for 40(3):595–610
saliency detection via local estimation and global search. In: 118. Bae S-H, Yoon K-J (2014) Robust online multi-object tracking
Proceedings of the IEEE Conference on Computer Vision and based on tracklet confidence and online discriminative appearance
Pattern Recognition, pp 3183–3192 learning. In: Proceedings of the IEEE conference on computer
98. Cholakkal H, Johnson J, Rajan D (2018) Backtracking spatial vision and pattern recognition, pp 1218–1225
pyramid pooling-based image classifier for weakly supervised 119. Wang B, Wang L, Shuai B, Zuo Z, Liu T, Luk Chan K,
top–down salient object detection. IEEE Trans Image Process Wang G (2016) Joint learning of convolutional neural networks
27(12):6064–6078 and temporally constrained metrics for tracklet association. In:
S. K. Pal et al.

Proceedings of the IEEE Conference on Computer Vision and 135. Sadeghian A, Alahi A, Savarese S (2017) Tracking the untrack-
Pattern Recognition Workshops, pp 1–8 able: Learning to track multiple cues with long-term dependen-
120. Xiang Y, Alahi A, Savarese S (2015) Learning to track: Online cies. In: Proceedings of the IEEE International Conference on
multi-object tracking by decision making. In: Proceedings of Computer Vision, pp 300–311
the IEEE international conference on computer vision, pp 4705– 136. Kim C, Li F, Rehg JM (2018) Multi-object tracking with
4713 neural gating using bilinear lstm. In: Proceedings of the European
121. Tang S, Andriluka M, Andres B, Schiele B (2017) Multiple Conference on Computer Vision (ECCV), pp 200–215
people tracking by lifted multicut and person re-identification. In: 137. Schulter S, Vernaza P, Choi W, Chandraker M (2017) Deep
Proceedings of the IEEE Conference on Computer Vision and network flow for multi-object tracking. In: Proceedings of the
Pattern Recognition, pp 3539–3548 IEEE Conference on Computer Vision and Pattern Recognition,
122. Chen L, Ai H, Shang C, Zhuang Z, Bai B (2017) Online multi- pp 6951–6960
object tracking with convolutional neural networks. In: 2017 138. Tang S, Andres B, Andriluka M, Schiele B (2016) Multi-person
IEEE International Conference on Image Processing (ICIP). IEEE, tracking by multicut and deep matching. In: European Conference
pp 645–649 on Computer Vision. Springer, pp 100–111
123. Chu Q, Ouyang W, Li H, Wang X, Liu B, Yu N (2017) 139. Li W, Zhao R, Xiao T, Wang X (2014) Deepreid: Deep
Online multi-object tracking using cnn-based single object tracker filter pairing neural network for person re-identification. In:
with spatial-temporal attention mechanism. In: Proceedings of the Proceedings of the IEEE conference on computer vision and
IEEE International Conference on Computer Vision, pp 4836– pattern recognition, pp 152–159
4845 140. Zheng L, Bie Z, Sun Y, Wang J, Su C, Wang S, Tian Q (2016)
124. Son J, Baek M, Cho M, Han B (2017) Multi-object tracking Mars: A video benchmark for large-scale person re-identification.
with quadruplet convolutional neural networks. In: Proceedings of In: European Conference on Computer Vision. Springer, pp 868–
the IEEE conference on computer vision and pattern recognition, 884
pp 5620–5629 141. Leal-Taixé L, Milan A, Reid I, Roth S, Schindler K (2015)
125. Fang K (2016) Track-rnn: joint detection and tracking using Motchallenge 2015: Towards a benchmark for multi-target
recurrent neural networks. . In: Proceedings of the 29th tracking. arXiv:1504.01942
Conference on Neural Information Processing Systems (NIPS 142. Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K (2016)
2016), Barcelona Mot16: A benchmark for multi-object tracking. arXiv:1603.00831
126. Zhou S, Wang J, Wang J, Gong Y, Zheng N (2017) Point to set 143. Zhu Y, Zhao C, Wang J, Zhao X, Wu Y, Lu H (2017) Couplenet:
similarity based deep feature learning for person re-identification. Coupling global structure with local parts for object detection.
In: Proceedings of the IEEE Conference on Computer Vision and In: Proceedings of the IEEE international conference on computer
Pattern Recognition, pp 3741–3750 vision, pp 4126–4134
127. Xiang J, Zhang G, Hou J, Sang N, Huang R (2018) Multiple 144. Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–
target tracking by learning feature representation and distance improving object detection with one line of code. In: Proceedings
metric jointly. arXiv:1802.03252 of the IEEE international conference on computer vision,
128. Cheng D, Gong Y, Zhou S, Wang J, Zheng N (2016) Person pp 5561–5569
re-identification by multi-channel parts-based cnn with improved 145. Sun S, Akhtar N, Song H, Mian AS, Shah M (2019) Deep affinity
triplet loss function. In: Proceedings of the iEEE conference on network for multiple object tracking. IEEE transactions on pattern
computer vision and pattern recognition, pp 1335–1344 analysis and machine intelligence
129. Ma C, Yang C, Yang F, Zhuang Y, Zhang Z, Jia H, Xie X 146. Dollár P, Appel R, Belongie S, Perona P (2014) Fast feature
(2018) Trajectory factory: Tracklet cleaving and re-connection by pyramids for object detection. IEEE Trans Pattern Anal Mach
deep siamese bi-gru for multiple object tracking. In: 2018 IEEE Intell 36(8):1532–1545
International Conference on Multimedia and Expo (ICME). IEEE, 147. Shen J, Liang Z, Liu J, Sun H, Shao L, Tao D (2018)
pp 1–6 Multiobject tracking by submodular optimization. IEEE Trans
130. Fernando T, Denman S, Sridharan S, Fookes C (2018) Task Cybern 49(6):1990–2001
specific visual saliency prediction with memory augmented 148. Bochinski E, Eiselein V, Sikora T (2017) High-speed tracking-
conditional generative adversarial networks. In: 2018 IEEE Winter by-detection without using image information. In: 2017 14th IEEE
Conference on Applications of Computer Vision (WACV). IEEE, International Conference on Advanced Video and Signal Based
pp 1539–1548 Surveillance (AVSS). IEEE, pp 1–6
131. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley 149. Pirsiavash H, Ramanan D, Fowlkes CC (2011) Globally-optimal
D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial greedy algorithms for tracking a variable number of objects. In:
nets. In: Advances in neural information processing systems, CVPR 2011. IEEE, pp 1201–1208
pp 2672–2680 150. Andriyenko A, Schindler K, Roth S (2012) Discrete-continuous
132. Gregor K, Danihelka I, Mnih A, Blundell C, Wierstra D (2014) optimization for multi-target tracking. In: 2012 IEEE Conference
Deep autoregressive networks. In: International Conference on on Computer Vision and Pattern Recognition. IEEE, pp 1926–
Machine Learning. PMLR, pp 1242–1250 1933
133. Fang K, Xiang Y, Li X, Savarese S (2018) Recurrent 151. Wen L, Li W, Yan J, Lei Z, Yi D, Li SZ (2014) Multiple target
autoregressive networks for online multi-object tracking. In: 2018 tracking based on undirected hierarchical relation hypergraph.
IEEE Winter Conference on Applications of Computer Vision In: Proceedings of the IEEE conference on computer vision and
(WACV). IEEE, pp 466–475 pattern recognition, pp 1282–1289
134. Fernando T, Denman S, Sridharan S, Fookes C (2018) 152. Dicle C, Camps OI, Sznaier M (2013) The way they move:
Tracking by prediction: A deep generative model for mutli-person Tracking multiple targets with similar appearance. In: Proceedings
localisation and tracking. In: 2018 IEEE Winter Conference on of the IEEE international conference on computer vision,
Applications of Computer Vision (WACV). IEEE, pp 1122–1132 pp 2304–2311
Deep learning in multi-object detection and tracking: state of the art

153. Andriyenko A, Schindler K (2011) Multi-target tracking by 173. Sen D, Pal SK (2008) Generalized rough sets, entropy, and
continuous energy minimization. In: CVPR, vol 2, pp 7 image ambiguity measures. IEEE Trans Syst Man Cybern Part B
154. Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016) Simple (Cybern) 39(1):117–128
online and realtime tracking. In: 2016 IEEE International 174. Ganivada A, Ray SS, Pal SK (2012) Fuzzy rough granular self-
Conference on Image Processing (ICIP). IEEE, pp 3464– organizing map and fuzzy rough entropy. Theor Comput Sci
3468 466:37–63
155. He R, Wu X, Sun Z, Tan T (2018) Wasserstein cnn: Learning 175. Pal SK, Mitra S (1992) Multi-layer perceptron, fuzzy sets and
invariant features for nir-vis face recognition. IEEE Trans Pattern classification. IEEE Trans Neural Netw 3(5):683–697
Anal Mach Intell 41(7):1761–1773 176. Mitra S, Pal SK (1995) Fuzzy multi-layer perceptron, inferencing
156. Saberian MJ, Vasconcelos N (2012) Learning optimal embedded and rule generation. IEEE Trans Neural Netw 6(1):51–63
cascades. IEEE Trans Pattern Anal Mach Intell 34(10):2005–2018 177. Sen D, Pal SK (2010) Gradient histogram: thresholding in
157. Datondji SRE, Dupuis Y, Subirats P, Vasseur P (2016) A survey a region of interest for edge detection. Image Vis Comput
of vision-based traffic monitoring of road intersections. IEEE 28(4):677–695
Trans Intell Transp Syst 17(10):2681–2698 178. Pramanik A, Pal SK, Maiti J, Mitra P (2021) Granulated RCNN
158. Cheng G, Zhou P, Han J (2016) Learning rotation-invariant and multi-class deep sort for multi-object detection and tracking.
convolutional neural networks for object detection in vhr optical IEEE Transactions on Emerging Topics in Computational
remote sensing images. IEEE Trans Geosci Remote Sens Intelligence. https://doi.org/10.1109/TETCI.2020.3041019
54(12):7405–7415 179. Pal SK, Banerjee R, Dutta S, Sarma SS (2013) An insight
159. Cheng G, Han J (2016) A survey on object detection in optical into the Z-number approach to CWW. Fundamenta Informaticae
remote sensing images. ISPRS J Photogramm Remote Sens 124(1–2):197–229
117:11–28 180. Banerjee R, Pal SK (2015) Z*-numbers: augmented Z-numbers
160. Shivakumara P, Tang D, Asadzadehkaljahi M, Lu T, Pal U, Anisi for machine-subjectivity representation. Inform Sci 323:143–178
MH (2018) Cnn-rnn based method for license plate recognition. 181. Pal SK, Mandal DP (1992) Linguistic recognition system based
CAAI Trans Intell Technol 3(3):169–175 on approximate reasoning. Inform Sci 61(1–2):135–161
161. Sarfraz M, Ahmed MJ (2019) An approach to license plate
recognition system using neural network. In: Exploring Critical Publisher’s note Springer Nature remains neutral with regard to
Approaches of Evolutionary Computation. IGI Global, pp 20–36 jurisdictional claims in published maps and institutional affiliations.
162. Nair AS, Raju S, Harikrishnan KJ, Mathew A (2018) A survey of
techniques for license plate detection and recognition. i-manager’s
J Image Process 5(1):25 Sankar K. Pal is currently
163. Banerjee K, Notz D, Windelen J, Gavarraju S, He M (2018) a National Science Chair,
Online camera lidar fusion and object detection on hybrid data SERB-DST, Govt. of India.
for autonomous driving. In: 2018 IEEE Intelligent Vehicles He is a Distinguished Scien-
Symposium (IV). IEEE, pp 1632–1638 tist and former Director of
164. Arnold E, Al-Jarrah OY, Dianati M, Fallah S, Oxtoby D, Indian Statistical Institute, a
Mouzakitis A (2019) A survey on 3d object detection methods for former Distinguished Profes-
autonomous driving applications. IEEE Trans Intell Transp Syst sor of Indian National Science
20(10):3782–3795 Academy, and a former Chair
165. Li Z, Dong M, Wen S, Hu X, Zhou P, Zeng Z (2019) Clu-cnns: Professor of Indian National
Object detection for medical images. Neurocomputing 350:53–59 Academy of Engineering. He
166. Lu W, Zhou Y, Wan G, Hou S, Song S (2019) L3-net: Towards founded the Machine Intelli-
learning based lidar localization for autonomous driving. In: gence Unit and the Center for
Proceedings of the IEEE Conference on Computer Vision and Soft Computing Research: A
Pattern Recognition, pp 6389–6398 National Facility in the Insti-
167. Altaf F, Islam SyedMS, Akhtar N, Janjua NK (2019) Going deep tute in Calcutta. He received a
in medical image analysis: Concepts, methods, challenges, and Ph.D. in Radio Physics and Electronics from the University of Calcutta
future directions. IEEE Access 7:99540–99572 in 1979, and another Ph.D. in Electrical Engineering along with DIC
168. Naji S, Jalab HA, Kareem SA (2019) A survey on skin detection from Imperial College, University of London in 1982.
in colored images. Artif Intell Rev 52(2):1041–1087 He worked at the University of California, Berkeley and the
169. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould University of Maryland, College Park in 1986-87; the NASA Johnson
S, Zhang L (2018) Bottom-up and top-down attention for image Space Center, Houston, Texas in 1990-92 & 1994; and in US Naval
captioning and visual question answering. In: Proceedings of the Research Laboratory, Washington DC in 2004. Since 1997 he is a
IEEE conference on computer vision and pattern recognition, pp Distinguished Visitor of IEEE Computer Society (USA) for the Asia-
6077–6086 Pacific Region, and held several visiting positions in Italy, Poland,
170. Friedman S, Stamos I (2013) Online detection of repeated Hong Kong and Australian universities.
structures in point clouds of urban scenes for compression and Prof. Pal is a Fellow of IEEE, the World Academy of
registration. Int J Comput Vis 102(1-3):112–128 Sciences (TWAS), International Association for Pattern recognition,
171. Bai S, An S (2018) A survey on automatic image caption International Association of Fuzzy Systems, International Rough Set
generation. Neurocomputing 311:291–304 Society, and all the four National Academies for Science/Engineering
172. Yang W, Tan RT, Feng J, Guo Z, Yan S, Liu J (2019) Joint rain in India. He is a coauthor of twenty one books and more than four
detection and removal from a single image with contextualized hundred research publications in the areas of Pattern Recognition
deep networks. IEEE Trans Pattern Anal Mach Intell 42(6):1377– and Machine Learning, Image Processing, Data Mining and Web
1393 Intelligence, Soft Computing, Neural Nets, Genetic Algorithms,
S. K. Pal et al.

Fuzzy Sets, Rough Sets, Cognitive Machine and Bioinformatics. He J. Maiti (PhD, FIE) the
introduced and promoted the soft computing research & teaching in Founder Chairman of the
India. He visited forty five countries as a Keynote/ Invited speaker or Centre of Excellence in Safety
an academic visitor. Engineering & Analytics
He received the 1990 S.S. Bhatnagar Prize (which is the most (CoE-SEA) and Professor of
coveted award for a scientist in India), 2013 Padma Shri (one of the Department of Industrial
the highest civilian awards) by the President of India, and many and Systems Engineering, IIT
prestigious awards in India and abroad including the 2000 Khwarizmi Kharagpur is pioneer in mak-
International Award from the President of Iran, 2000-2001, 1993 ing safety analytics as core
NASA Tech Brief Award (USA), 1994 IEEE Trans. Neural Networks area of research in the broad
Outstanding Paper Award, 1995 NASA Patent Application Award domain of Safety Science. He
(USA), 1999 G.D. Birla Award, 1998 Om Bhasin Award, 2005- has established a unique world
06 Indian Science Congress-P.C. Mahalanobis Birth Centenary Gold class laboratory called “Safety
Medal from the Prime Minister of India for Lifetime Achievement, Analytics and Virtual Reality
2007 Sir J.C. Bose National Fellowship, 2015 DAE Raja Ramanna Laboratory” at IIT Kharagpur.
Fellowship, 2015 INAE-S.N. Mitra Award, 2017 INSA-Jawaharlal He authored over 150 publica-
Nehru Birth Centenary Lecture award, 2018 INSA Distinguished tions and executed several funded research and consultancy projects
Professorial Chair, and 2020 National Science Chair, Govt. of India. in the areas of safety engineering, analytics and management. He is
Prof. Pal acts(ted) an Associate Editor of IEEE Trans. PAMI (2002- currently serving the Editorial Board of Safety Science (as Associate
06), IEEE Trans. NN (1994-98 & 2003-06), Neurocomputing (1995- Editor), International Journal of Injury Control and Safety Promotion
2005), Pattern Recog. Lett. (1993-2011), Int. J. Patt. Recog. & Art. (as Associate Editor) and Safety and Health at Work (as Member).
Intell., Inform. Sci., Fuzzy Sets and Syst., LNCS Trans. Rough Sets, Prof Maiti is a true interdisciplinary & multidisciplinary researcher
Journal of Data, Information and Management, Int. J. Comput. Intell. with research on the interfaces of engineering, management science,
and Appl., Applied Intelligence (2002-12), Fundamenta Informaticae and statistics including analytics, where the research embodies (i)
(2003-19), IET Image Process. (2007-19), Ingeniera y Ciencia (2014- solving engineering and socio-technical problems in safety, quality,
15), and J. Intell. Inform. Syst. (2008-12); Editor-in-Chief, Int. J. reliability, and ergonomics, (ii) development of methodologies/models
Signal Processing, Image Processing and Pattern Recognition (2008- with innovative engineering and management science approaches, (iii)
19); a Book Series Editor, Frontiers in Artificial Intelligence and development of novel tools and techniques using advanced statistics,
Applications, IOS Press, and Statistical Science and Interdisciplinary data analytics, machine learning, and artificial intelligence, and (iv)
Research, World Scientific; a Member, Executive Advisory Editorial application of advanced technologies. A recipient of several awards,
Board, IEEE Trans. Fuzzy Systems, Int. Journal on Image and Prof Maiti is a member of several international societies.
Graphics, Int. Journal of Approximate Reasoning, and Data-Centric
Engineering (Cambridge Univ.); and a Guest Editor of IEEE
Computer, IEEE Trans. SMC, and Theoretical Computer Science.

Anima Pramanik received the


B. Tech degree in Electron- Pabitra Mitra is a profes-
ics and Communication Engi- sor of Computer Science and
neering from the West Ben- Engineering at Indian Institute
gal University of Technology, of Technology Kharagpur. He
West Bengal, India, and the did his PhD from Indian Sta-
M. Tech degree in Mecha- tistical Institute Calcutta and
tronics Engineering from the B. Tech from Indian Insti-
National Institute of Techni- tute of Technology Kharag-
cal Teacher’s Training and pur. He has been an Assis-
Research (NITTTR), Kolkata, tant Professor at IIT Kan-
West Bengal, India. Currently, pur and Scientist at Centre
she is pursuing PhD degree for AI and Robotics Banga-
with the Department of Indus- lore. He received the INAE
trial and Systems Engineering Young Engineer Award, IBM
from IIT Kharagpur, Kharag- and Yahoo Faculty Awards.
pur, India. Her research interests include computer vision, image/video He has co-authored a book and
processing, machine learning, and traffic safety. about 100 research papers in pattern recognition and machine learning.

You might also like