Real-Time Multiple Object Tracking Using Deep Learning Methods2021
Real-Time Multiple Object Tracking Using Deep Learning Methods2021
https://doi.org/10.1007/s00521-021-06391-y (0123456789().,-volV)
(0123456789().,-volV)
Abstract
Multiple-object tracking is a fundamental computer vision task which is gaining increasing attention due to its academic
and commercial potential. Multiple-object detection, recognition and tracking are quite desired in many domains and
applications. However, accurate object tracking is very challenging, and things are even more challenging when multiple
objects are involved. The main challenges that multiple-object tracking is facing include the similarity and the high density
of detected objects, while also occlusions and viewpoint changes can occur as the objects move. In this article, we
introduce a real-time multiple-object tracking framework that is based on a modified version of the Deep SORT algorithm.
The modification concerns the process of the initialization of the objects, and its rationale is to consider an object as tracked
if it is detected in a set of previous frames. The modified Deep SORT is coupled with YOLO detection methods, and a
concrete and multi-dimensional analysis of the performance of the framework is performed in the context of real-time
multiple tracking of vehicles and pedestrians in various traffic videos from datasets and various real-world footage. The
results are quite interesting and highlight that our framework has very good performance and that the improvements on
Deep SORT algorithm are functional. Lastly, we show improved detection and execution performance by custom training
YOLO on the UA-DETRAC dataset and provide a new vehicle dataset consisting of 7 scenes, 11.025 frames and 25.193
bounding boxes.
Keywords Computer vision Multiple-object tracking Deep learning Deep SORT YOLO
123
Neural Computing and Applications
and compression, traffic control, medical imaging, self- also include pedestrian tests and we compare a grand total
driving cars and robotics [24, 29, 30]. of 7 YOLO derivatives ranging from YOLOv3-Tiny all the
Although object tracking is quite useful and desired, way to a fully fledged YOLOv4 implementation. For the
accurate, real-time object tracking is quite challenging and tracking mechanism, we use the DEEP Sort framework,
things are even more challenging when multiple objects are which we modified to properly display the correct track ID
involved [37, 38]. One major issue that MOT is facing is in the real-time video feedback, a problem that occurred in
the bounding box level tracking performance and satura- all YOLO and Deep Sort implementations we could find on
tion; therefore, most research is focused on handling these the Internet. For each implementation, we provide the
aspects [3, 4]. For tracking to perform sufficiently, object optimal detection and tracking parameters which could be
detection is necessary to work flawlessly across all frames useful for fellow researchers and hobbyists. Moreover, we
and not having to use interpolation [39]. However, this is perform custom transfer learning training of the YOLO
almost impossible due to occlusion, the variety in view- detector using a slightly modified version of the UA-
points and the noise that may be introduced in a video. DETRAC dataset. The UA-DETRAC trained YOLOv4
Also, real-time tracking requires great computational provided state-of-the-art performance when compared to
resources and also needs to face challenges like the identity the publicly available MS-COCO trained YOLOv4 in our
switches and various detection failures [33, 34, 40]. test scenes. In addition to that, we also performed perfor-
In this paper, first we explore the performance of various mance characterization on this framework and found a
deep learning methods on the task of multiple-object bottleneck in the execution pipeline, which we resolved,
tracking. We examine how widespread deep learning and we saw an execution performance increase of up to
architectures are performing under various contexts in a 22%. For the evaluation process, we provide a wide variety
wide range of scene scenarios. Also, the paper introduces a of metrics and we test nine different scenes, six of which
modification of the Deep SORT [5, 25] algorithm, which are our own. For all test scenes, we also provide the ground
greatly improves the performance of object tracking truth files, which we generated either from the ground up or
methods, using different object detection models, such as from existing data. Finally, through the creation of the
YOLOv3-608 [6], YOLOv3-Tiny [6] and YOLOv4 [7]. ground truth files, we also provide a new vehicle multiple-
The Deep SORT implementation is an extension to the object dataset consisting of 7 scenes, 11.025 frames and
simple online and real-time tracking (SORT) [8] algorithm 25.193 bounding boxes.
and the SORT framework is utilized too as a mean to The rest of the paper is structured as follows. Section 2
measure bounding boxes overlaps. While this is a high examines the literature and presents related works and
performance method, the number of identity switches due methods on multiple-object tracking. Section 3 presents
to occlusions from poor camera angles are too many our implementation that is based on modified Deep SORT
[9–12]. By incorporating convolutional neural networks, to algorithm and the YOLO detection networks. The modifi-
additionally include appearance information, Deep SORT cation made on Deep SORT algorithm is presented and the
can substantially reduce identity switches. Our modifica- way it affects the visualization of tracking of moving
tion on Deep SORT is based on the process of the initial- objects is illustrated. Section 3 describes the way that some
ization of the tracked object IDs and the way they are bottlenecks in the multiple objects tracking procedure is
assigned and passed through, to be shown during the faced and presents the way that improves the real-time
visualization process. The results indicate that our modified performance in terms of frames per second. Section 4 deals
Deep SORT now properly displays the track IDs and it is with the experimental study, explains the datasets used for
closer to the ground truth in all the examined cases, a the training phases and for the testing phase and presents
problem that exists on all YOLO & Deep SORT imple- the results collected. Finally, Sect. 5 concludes the article
mentations we have found on the Internet. In addition, we and draws directions for future work.
present a way to improve the real-time operation of the
deep learning methods by identifying and facing a bottle-
neck in the MOT framework. We tested and provide a way 2 Related work
that can greatly improve the execution time of the tracking
procedure. The results show that we have an increase of Multiple-object tracking is attracting the increasing atten-
frames per second (FPS) in all examined deep learning tion of researchers in computer vision and artificial intel-
networks which is up to 22%. ligence. Several works in the literature study the
The novelty and the contribution of this paper can be performance of methods and systems and a detailed
summarized as follows. First, we explore the performance description of approaches and techniques can be found in
of various deep learning methods on the task of real-time [1, 2, 12].
multiple-object tracking. Our focus is road traffic, but we
123
Neural Computing and Applications
In the work presented in [3], authors propose an online capable of relational reasoning. Authors explore a number
multi-target tracker that exploits both high and low-confi- of relational reasoning architectures and show that multi-
dence target detections in a probability hypothesis density headed, self-attention outperforms the provided baselines
particle filter framework. Authors formulate an early and better accounts for complex physical interactions in a
association strategy between trajectories and detections toy experiment. Authors find that it leads to consistent
after the prediction stage, which allows performing target performance gains in tracking as well as future trajectory
estimation and state labeling without any additional prediction on three real-world datasets (MOTChallenge,
mechanisms. The authors’ solution has a peak multiple- UA-DETRAC and Stanford Drone dataset), particularly in
object tracking accuracy (MOTA) score of 53 on MOT15 the presence of ego-motion, occlusions, crowded scenes
and 52.5 on MOT16. and faulty sensor inputs. On the MOTChallenge dataset,
Authors in [8] present an approach to multi-object HART achieves 66.6% IOU, which itself is impressive
tracking where the main focus is to associate objects effi- given the small amount of training data of only 5225
ciently for online and real-time applications. To this end, training frames and no pre-training.
detection quality is identified as a key factor influencing In the work presented in [14], authors present an end-to-
tracking performance, where changing the detector can end model, named FAMNet, where feature extraction,
improve tracking by up to 18.9%. Despite only using a affinity estimation and multi-dimensional assignment are
rudimentary combination of familiar techniques such as the refined in a single network. All layers in FAMNet are
Kalman filter and Hungarian algorithm for the tracking designed differentiable and thus can be optimized jointly to
components, the approach achieves accuracy that is com- learn the discriminative features and higher-order affinity
parable to state-of-the-art online trackers. Additionally, model. Authors also integrate single-object tracking tech-
emphasis is placed on efficiency for facilitating real-time nique and a dedicated target management scheme into the
tracking and to promote greater uptake in applications such FAMNet-based tracking system to further recover false
as pedestrian tracking for autonomous vehicles. While negatives and inhibit noisy target candidates generated by
being an overall good framework at the time, the identity the external detector. The proposed method is evaluated on
switches are rather high with a value of 1001 in the MOT a diverse set of benchmarks including MOT2015,
benchmark. Their solution has a peak MOTA score of 33.4 MOT2017, KITTI-Car and UA-DETRAC and achieves
on MOT15. promising performance on all of them in comparison with
In the work presented in [4], authors present an online state-of-the-art. The authors’ method has a MOTA score of
method that encodes long-term temporal dependencies 40.6 on MOT15.
across multiple cues. One key challenge of tracking In the work presented in [15], authors introduce a focal
methods is to accurately track occluded targets or those loss-based RetinaNet, which works as one-stage object
which share similar appearance properties with surround- detector, is utilized to be able to well match the speed of
ing objects. To address this challenge, authors present a regular one-stage detectors and also defeats two-stage
structure of recurrent neural networks (RNN) that jointly detectors in accuracy, for vehicle detection. State-of-the-art
reasons on multiple cues over a temporal window. Their performance result has been shown on the DETRAC
motion and interaction models leverage two separate long vehicle dataset. This is important because one-stage object
short-term memory (LSTM) networks that track the motion detectors and two-stage object detector are regarded as the
and interactions of targets for a longer period—suitable for most important two groups of convolutional neural net-
the presence of long-term occlusions. Their solution has a work-based object detection methods. One-stage object
peak MOTA score of 37.6 on MOT15. detector could usually outperform two-stage object detec-
In the work presented in [2], authors present a com- tor in speed; however, it normally trails in detection
prehensive survey on works that employ deep learning accuracy, compared with two-stage object detectors.
models to solve the task of MOT on single-camera videos. In the work presented in [16], the authors introduce deep
Four main steps in MOT algorithms are identified, and an motion modeling network (DMM-Net) that can estimate
in-depth review of how deep learning was employed in multiple objects’ motion parameters to perform joint
each one of these stages is presented. A complete experi- detection and association in an end-to-end manner. DMM-
mental comparison of the presented works on the three Net models object features over multiple frames and
MOTChallenge datasets is also provided, identifying a simultaneously infer object classes, visibility and their
number of similarities among the top-performing methods motion parameters. These outputs are readily used to
and presenting some possible future research directions. update the tracklets for efficient MOT. DMM-Net achieves
In the work presented in [13], authors build on a neural PR-MOTA score of 12.80 @ 120? fps for the popular UA-
class-agnostic single-object tracker named HART and DETRAC challenge. Authors also introduce a synthetic
introduce a multi-object tracking method MOHART large-scale public dataset Omni-MOT for vehicle tracking
123
Neural Computing and Applications
that provides precise ground-truth annotations to eliminate 3.1 Modified deep SORT tracking algorithm
the detector influence in MOT evaluation.
In the work presented in [17], authors present a CNN- One of the most widely used object tracking frameworks is
based framework for online MOT. This framework utilizes Deep SORT, which is an extension to SORT (simple real-
the merits of single-object trackers in adapting appearance time tracker) [5]. Deep SORT achieves better tracking and
models and searching for target in the next frame. Simply less identity switches by including an appearance feature
applying a single-object tracker for MOT will encounter vector for the tracks which is derived, in this case, by a pre-
the problem in computational efficiency and drifted results trained CNN that runs on the YOLO detected bounding
caused by occlusion. Their framework achieves computa- boxes. Since simple detection models are very likely to fail
tional efficiency by sharing features and using ROI-Pooling at detecting numerous objects consecutively as the frames
to obtain individual features for each target. In the frame- go by, we need to add new methods to keep track of them
work, they introduce spatial–temporal attention mechanism and properly identify them. This is where Deep SORT
(STAM) to handle the drift caused by occlusion and comes in to make a proper MOT framework.
interaction among targets. Besides, the occlusion status can The Kalman filter is a crucial component in Deep
be estimated from the visibility map, which controls the SORT. Each state contains 8 variables (u, v, a, h, u0 , v0 , a0
online updating process via weighted loss on training and h0 ) where (u, v) are the coordinates of the bound box, a
samples with different occlusion statuses in different is the aspect ratio, and h is the height of it. The respective
frames. It can be considered as temporal attention mecha- velocities are given by u0 , v0 , a0 , h0 . The state contains only
nism. The proposed algorithm achieves 34.3% and 46.0% absolute position and velocity factors, since we assume a
in MOTA on challenging MOT15 and MOT16 benchmark simple linear velocity model. The Kalman filter helps us
datasets, respectively. face the problems that may arise from non-perfect detec-
tion and uses prior states to predict a good fit for future
bounding boxes. Now that we have the new bounding
3 Methodology boxes tracked from the Kalman filter, the next problem lies
in associating new detections with the predictions that have
In this section, we present our framework for object been created. Since they are processed independently, a
tracking that relies on the modification of the Deep SORT method is needed to associate track_i with incoming de-
algorithm. We describe the main methods for the object tection_k. To solve this, Deep SORT implements 2 things:
detection and tracking. architecture of this implementation. a distance metric to quantify the association and an effi-
More specifically, for the object detection procedure, cient algorithm to associate the data. The authors decided
YOLO models are utilized to detect desired objects in a to use the squared Mahalanobis distance (effective metric
frame, and after that, a modified version of Deep SORT when dealing with distributions) to incorporate the uncer-
algorithm is introduced to perform object tracking in the tainties from the Kalman filter. Thresholding this distance
sequences of the frames. Our modification on the Deep can give us a very good idea on the actual associations.
SORT algorithm concerns the process of the initialization This metric is more accurate than, say, Euclidean distance,
of the object IDs, and its rationale is to consider an object as we are effectively measuring the distance between 2
as ‘‘tracked’’ if it is detected in a set of previous frames. distributions. For the data association part, the Hungarian
We assess the performance of the YOLO object detection algorithm is used. Lastly, the feature vector becomes our
models trained on the MS COCO and UA-DETRAC ‘‘appearance descriptor’’ of the object. The authors have
datasets using transfer learning in order to assess their added this vector as part of the distance metric. Now, the
performance and identify suitable and optimal synergies updated distance metric is:
for multi-object tracking. The modified Deep SORT algo- D ¼ Lambda Dk þ ð1 LambdaÞ Da
rithm is tested using the YOLO models in tracking cars and
pedestrians, a variety of datasets and scenes, and its per- where Dk is the Mahalanobis distance and Da is the cosine
formance is compared to the original Deep SORT algo- distance between the appearance feature vectors and
rithm. The results indicate that our modified Deep SORT Lambda is the weighting factor. The importance of Da is so
algorithm now properly displays the assigned track IDs, high that the authors make a claim saying, they were able
while also providing good tracking performance. In the to achieve state of the art even with Lambda ¼ 0, meaning
following subsections, we present the modification made that they only used the appearance descriptor for the cal-
on the Deep SORT algorithm, the implementation of the culation. We provide the pseudocode for the Deep SORT-
YOLO models as well as the optimization of our enabled framework below.
framework.
123
Neural Computing and Applications
8. Update tracker with the use of IOUs and the Hungarian algorithm.
9. If initiated track has been consecutively detected for n_init frames then confirm track
10. Else:
12. If confirmed track has been consecutively not detected for MaxAge frames then delete
track.
The variables that cause the biggest change in perfor- algorithm relates to the proper display and count of the
mance are score and IOU of each respective model, and confirmed detections. Specifically, each track has three
n_init, max_cosine_distance and max_iou_distance from states: the initial tentative state, the confirmed state and the
the Deep SORT framework. We will present the optimal deleted state. Every new track is classified as tentative for
values for each implementation in the videos used. We the first n_init frames. If the n_init frames pass and the
keep max_cosine_distance and max_iou distance at the track is still identified, it will become confirmed and fea-
same values, of 0.4 and 0.7, respectively, for all our tests. ture similarity will also be employed. If a track fails to be
We also set our n_init at 7, unless stated otherwise. The identified properly using IOU similarity for every frame in
variable n_init dictates how many successful detections a the n_init phase, then it will be classified as deleted. We
track must have before it goes from its initial tentative state made sure that every bounding box shown on screen has
to its confirmed state. the proper confirmed state ID on it.
A main aspect we noticed on the functionality and the Below we provide some example cases of the compar-
utilization of Deep SORT concerns the fact that shows the ison results between the Deep SORT and the modified
initiated track IDs on the bounding boxes and not the version. In Figs. 1 and 2, we present a case from the
confirmed tracks. This operation may cause problems in MOT15 dataset, where the ground truth for that part of the
tracking and numbering correctly the detected objects in scene is 34 people. Our framework of the modified deep
the sequences of frames. To address this problem, a main SORT measured 32 people, while the original Deep SORT
modification that we implemented to the Deep SORT resulted in measuring 79 people as it is illustrated in the
123
Neural Computing and Applications
123
Neural Computing and Applications
bounding box IDs in Fig. 1. The difference is massive, and These models are trained on the MS-COCO and
the main reason for this is the large number of identity DETRAC datasets, and we use the weights that have been
switches either from occlusion or poor viewing angle as the generated by the training on these datasets. The first
detector struggles to maintain accurate detections across all implementation works with the YOLOv3-Tiny model and
frames. This can result in tracking numbers that, when the weights. In the second implementation, the YOLOv3-416
original Deep SORT algorithm is used, are quite higher and 608 model and the corresponding weights are used and
than the ground truth. The modified version correctly dis- lastly, we use YOLOv4-608 with 608-by-608 tensor input
plays the confirmed tracks on the video output. to test our framework with state-of-the-art models. These
An additional example case is illustrated in Figs. 3 and YOLO detection models have been formulated into Keras
4. The example case is in the frame of the ‘‘Roadtrack’’ test along with their weights that were generated from their
scene, where cars are detected and tracked. The grand truth perspective Darknet projects. Darknet [32] is an open-
of the example case is 19. In Fig. 3, the performance of the source neural network framework written in C and CUDA,
original Deep SORT is off by 6 tracking and giving IDs to and it supports CPU and GPU computation.
25 different cars. Also, it is worth noting that even in the The developers of YOLO reframe object detection as a
same frame the non-modified code had already failed to single regression problem, straight from image pixels to
properly ID this stack of cars, as illustrated by the fact that bounding box coordinates and class probabilities. A single
there is no car numbered with ID 24. The modified Deep convolutional network simultaneously predicts multiple
SORT has a quite better performance which matches the bounding boxes and class probabilities for those boxes.
ground truth. YOLO trains on full images and directly optimizes detec-
Finally, another example case is presented in Figs. 5 and tion performance. This unified model has several benefits
6 in our own, real-world test scene named ‘‘Rural road over traditional methods of object detection. First, YOLO
dusk.’’ The ground truth at that part of the scenes is 18. In is extremely fast. The frame detection process is looked at
Fig. 5, the results of the original Deep SORT are off by 20. as a regression problem which enables YOLO to have a
Although there were 18 different cars in the scenes, the simplified pipeline. They simply run this neural network on
original Deep SORT and YOLO framework resulted in a new image at test time to predict detections. This model
tracking and giving IDs up to 38. In Fig. 6, the results of achieves high throughput, which makes it suitable for
the modified Deep SORT are presented and we can see that process streaming video in real time. Second, YOLO rea-
the resulting IDs are very close to the ground and are off by sons globally about the image when making predictions.
just 1. The YOLO detector works the same way in both Unlike sliding window and region proposal-based tech-
cases and the modified Deep SORT performs quite better niques, YOLO sees the entire image during training and
compared to the original version reporting almost a perfect test time, so it implicitly encodes contextual information
performance. about classes as well as their appearance. Fast R-CNN, a
As illustrated in the above three example cases, the top detection method, mistakes background patches in an
modified version of the Deep SORT has a quite good image for objects, because it cannot see the larger context.
performance and resulted in better tracking and consistent Third, YOLO learns generalizable representations of
annotations of IDs. This is crucial when we create real-time objects. This network uses features from the entire image
online MOT systems since we can even feed in real-time a to predict each bounding box. It also predicts all bounding
live video from a camera and have it display proper IDs boxes across all classes for an image simultaneously. This
and tracking results. means that YOLO reasons globally about the full image
and all the objects in the image.
3.2 Detection models
3.2.1 YOLOv3-Tiny integration
The Deep SORT tracking algorithm needs to be integrated
with a multiple-object detection model that will perform For the tiny YOLOv3 model, we are using these anchors:
the detection of the desired objects in a frame and after [10, 13], [23, 27], [37, 58], [81, 82], [135, 169], [344, 319],
that, the Deep SORT will perform the tracking procedure. which correspond to the size of the bounding boxes and are
In the context of our study, we examine the performance of fundamental for the correct training and detection of our
the Deep SORT in the pipeline with YOLO (You Only CNN. Moreover, we set anchor mask values of [[3, 4, 5],
Look Once) [19]. YOLO has been proven to offer high [0, 1, 2]]. We configure them based on the design and the
performance and detection accuracy [35] and the Yolo dimension of the objects we want to detect. To begin with,
models used and examined here are (i) the YOLOv3-Tiny, for this framework we set a score = 0.3 and IOU = 0.2 for
(ii) the YOLOv3-416 and 608 and (iii) the YOLOv4-608. our tests in section 7.1 and score = 0.6 and IOU = 0.3 for
7.2 and 7.3. Score is the confidence percentage for the
123
Neural Computing and Applications
detection coming out of our CNN. Do keep in mind that more complex and higher performing models. With
YOLOv3-Tiny has only 21 layers and it needs only 5.5 YOLOv4, we now set our tensor input at 608 instead of
billion flops per frame. For the study, we use a tensor input 416, which we previously did for YOLOv3, and that is
of (416, 416). YOLOv3-Tiny has a mean average precision because we want to see how the framework will behave
(mAP) of 23.7% on the MS COCO dataset. when aiming at the best possible detection and feature
extraction. YOLOv4 achieves an mAP of 65.7% with an
3.2.2 YOLOv3 integration input tensor of (608, 608), which is significantly higher
than the previous models.
For YOLOv3-416 and 608, we are using these anchors:
[10, 13], [16, 30], [33, 23], [30, 61], [62, 45], [59, 119], 3.3 Framework optimization
[116, 90], [156, 198] and [373, 326], which correspond to
the size of the bounding boxes and are fundamental for the An important part of the proper real-time operation of our
correct training of our CNN. Moreover, we set anchor framework concerns a set of optimization procedures that
mask values of [[6, 7, 8], [3, 4, 5], [0, 1, 2]]. They were were performed. We monitor the functionality of the
configured and fine-tune based on the design and the framework in our systems with help from Intel’s VTune
dimension of the objects we want to detect. For this software stack. We launch our application, and then we
framework, we set a score = 0.4 and IOU = 0.2 for our hook VTune to the corresponding process ID of our
tests in Sect. 6.1 and score = 0.6 and IOU = 0.3 for 6.2 and framework. We detected a bottleneck in the CPU section of
6.3. YOLOv3-608 has only 106 layers and it needs 140 our systems indicating that the CPU cannot feed fast
billion flops per frame when having a tensor input of (608, enough our GPU, while also being able to perform the
608) and 65.86 billion flops per frame when at (416, 416). necessary calculations for the tracking algorithm provided
The complexity is much higher, and the increased com- by Deep SORT. In Fig. 7, we can see that the single-
putation requirements cause a considerable drop in average threaded nature of the software on crucial functions causes
frame. That being said it achieves significantly better issues and, if we were to multithread and batch our func-
results in the MS COCO dataset having a mAP of 55.3 for tions for video pre-processing, it would not meet the cri-
tensor input of 416 and 57.9 for tensor input of 608. In our teria for a real-time tracking algorithm, since it does not
study, we use it with tensor input of 416 and 608. process every frame as it is created. Looking closer at the
graph shown in Fig. 7, we see our primary thread failing to
3.2.3 YOLOv4-608 integration hold steady at 100% CPU time, which is caused by our
GPU having to run our CNN on a per-frame basis which
For YOLOv4-608, we are using the same anchors and causes the CPU to become idle.
mask that we used on YOLOv3-608. For this framework, The infrastructure used for our experiments is equipped
we set a score = 0.6 and IOU = 0.3 in all of our tests, with an NVIDIA GTX 1070 8 GB VRAM paired with 16
where the score is the confidence percentage for the gigabytes of RAM and an Intel i7 6700 K. We used the
detection coming out of our CNN. Notice that we gradually NVIDIA CUDA toolkit version 10.0 and Tensorflow-gpu
increase our detection thresholds as we continue testing version 1.14. Keras version 2.2.4 and Python Anaconda
Fig. 7 Spawned threads during execution of our framework using Intel VTune
123
Neural Computing and Applications
distribution 2019.03 were the frameworks for the imple- with just 1.2% of time spent waiting for data from DRAM,
mentations. For the experiments, the processor was set at we know that our memory subsystem is ready to handle an
4.5 GHz and the graphics card was also locked at 2.1 GHz, increase in the data flow. We installed and used the Pillow
while the training and validation data were kept on an 6.0.0 SIMD AVX2 package, and we got an improvement
NVME drive to alleviate any potential storage bottlenecks. that ranged from 10 to 22%. The percentage of the
Each row represents a thread spawned by python for this improvement depends on the number of cars detected per
framework. The CPU first has to preprocess a frame from frame, the detector used and video input resolution. This
the video input and then send it to the GPU for the object improved performance is presented in detail in the results
detection part of the framework. When the GPU is doing of the experimental study.
calculations, the CPU is idling while waiting for the GPU In Table 1, we present some results from the tests, where
to send back the result. When the results get sent back to we compare the generic SSE Pillow 6.0 version versus the
the CPU, it is time for the Deep SORT algorithm to take AVX2 enabled, Pillow version 6.0.
over and match each bounding box with the correct IDs. The results indicate that, after the optimization proce-
This is also executed on the CPU. After this process is dures, we have a substantial improvement in the real-time
completed, the CPU writes back to the frame the output operation and performance of the detection procedure. As
from the model along with the correct IDs. This process is illustrated in Table 1, the SIMD optimizations that were
getting repeated until there are no more frames to process. introduced on the framework have a substantial impact on
The rest of the threads remain mostly idle and that is the frames per second. We recorded the highest improve-
expected behavior, since we have not multithreaded the ment (21.87% in the frames per second) in the crossroad
video pre-processing task or the CPU side tasks of our video when running the YOLOv3-Tiny framework. As the
Deep SORT framework. input resolution goes up and the time spent for object
Knowing that, by default, python and tensor flow detection goes down, we will experience more and more
installations are not compiled to make use of more performance improvements from this.
advanced SSE4.1/SSE4.2 and AVX instructions, we try to
find ways to improve the performance by using optimized
libraries for our system. Initially, Intel’s python packages 4 Experimental study
for Numpy, Scipy and others were installed. These pack-
ages have improvements mainly from the use of SSE4.2, In this section, we present the experimental study and the
AVX, AVX2 instructions. However, the performance did results collected. The experiments focus on the examina-
not improve so much, because these libraries were not tion of the performance of the modified version of Deep
hotspots in our code. After this, we started timing every SORT tracking algorithm, and its performance is assessed
part of our code and found that the video processing tasks, toward the performance of the original version of the
which were powered by the Pillow library, were taking a algorithm. We examine its performance in a wide range of
big part of our execution time. With the use of VTune to datasets and integrations with YOLO detectors and assess
perform HPC profiling of our framework, as shown in the performance of the YOLO multiple-object detection
Fig. 8, we can see that our primary thread can grow in models trained on the MS-COCO and the UA-DETRAC
terms of vectorization. Since we are not bandwidth bound datasets using transfer learning on the testing datasets.
Fig. 8 Gathering performance metrics for our framework using Intel VTune
123
Neural Computing and Applications
Based on our tests, we found that the UA-DETRAC trained consists of 100 videos, selected from over 10 h of image
YOLOv3-Tiny and YOLOv4-608 models were able to sequences acquired by a Canon EOS 550D camera at 24
outperform the MS-COCO ones on average, in terms of different locations, which represent various traffic patterns
execution speed and detection accuracy. and conditions including urban highway, traffic crossings
and T-junctions. Notably, to ensure diversity, the creators
4.1 Datasets used for the training procedure capture the data at different locations with various illumi-
nation conditions and shooting angles. The videos are
4.1.1 MS-COCO recorded at 25 frames per second (fps) with the JPEG
image resolution of 960 9 540 pixels. More than 140,000
The Microsoft Common Objects in COntext (MS-COCO) frames in the UA-DETRAC dataset are annotated with
dataset contains 91 common object categories with 82 of 8250 vehicles, and a total of 1.21 million bounding boxes
them having more than 5000 labeled instances. In total the of vehicles are labeled. Creators asked over 10 domain
dataset has 2,500,000 labeled instances in 328,000 images. experts to annotate the collected data for more than two
Additionally, a critical distinction between this dataset and months. They also carried out several rounds of cross-
others is the number of labeled instances per image, which check to ensure high-quality annotations. The UA-
may aid in learning contextual information. MS COCO DETRAC dataset is divided into training (UA-DETRAC-
contains considerably more object instances per image train) and testing (UA-DETRAC-test) sets, with 60 and 40
(7.7) as compared to ImageNet (3.0) and PASCAL (2.3). sequences, respectively. The creators selected training
Utilizing over 70,000 working hours, a vast collection of videos that are taken at different locations from the testing
object instances was gathered, annotated and organized to videos, but ensure the training and testing videos have
drive the advancement of object detection and segmenta- similar traffic conditions and attributes. This setting redu-
tion algorithms. Emphasis was placed on finding non-ico- ces the chances of detection or tracking methods to over-fit
nic images of objects in natural environments and varied to particular scenarios. The classes four classes are ‘‘car,’’
viewpoints. Dataset statistics indicate that the images ‘‘bus,’’ ‘‘van’’ and ‘‘others.’’ The vast majority of the
contain rich contextual information with many objects dataset is labeled as ‘‘car.’’ In Fig. 9, example cases from
present per image. We only shortly mention this dataset, the datasets are presented as well as an example case with
because we used the weights created by the YOLO authors the corresponding bounding boxes of the objects too.
that were trained on the MS COCO dataset. We used the
pre-trained weights of all the YOLO models trained on the 4.2 Datasets used for testing
MS-COCO dataset as described in [6]. The models are
trained to detect 80 classes, and in the context of our In the context of the study, we employ nine scenes for
experiments, we used the model’s detections for the ‘‘car’’ assessing the performance of the methods and our modified
class. version of the Deep SORT. The nine scenes that were used
are different from the datasets used for the training pro-
4.1.2 UA-DETRAC dataset cedure of the models. Seven datasets out of the nine con-
cern the tracking of vehicles, and two (MOT16 and
The UA-DETRAC dataset [12] was created by the MOT20) concern the tracking of pedestrians. For all test
University at Albany for comprehensive performance scenes used, we have generated ground truth files either
evaluation of MOT systems. The UA-DETRAC dataset from scratch or through existing data. We have created a
123
Neural Computing and Applications
Fig. 9 Example training instances from the datasets. On the right diverse example cases with cars are illustrated, while on the right an example
case with the corresponding bounding boxes is illustrated
new vehicle dataset from the videos we captured consisting scene is of quite higher difficulty. All the datasets are
of 7 scenes, 11.025 frames and 25.193 bounding boxes. publicly available via the GitHub account of our team.
Lastly, all the bounding boxes coordinates are given in top The MOT16 [21] is a widely used dataset for object
left, width, height format and we also provide ground truth tracking procedures. We used the ‘‘MOT16-09’’ scene,
files in MOT16 format for the ‘‘Rural road dusk’’ scene. which is captured outdoors, facing a sidewalk from a low
The Crossroad [20] is a publicly available car traffic angle. It is a 30-frames-per-second video at 1080p resolu-
video, which is 3 min and 31 s long. The base resolution is tion and has a duration of 18 s. The ground truth for the
1080p with 16:9 aspect ratio, and it is just running at 10 tracks in this video is 25.
frames per second. This video has been captured from a The MOT20 [22] is another widely used dataset for
road traffic camera. object tracking procedures, and we used the ‘‘MOT20-01’’
The ‘‘Straight road’’ and ‘‘Racetrack’’ datasets are cap- scene. The scene is captured indoors in a crowded train
tured from a racing simulator called Assetto Corsa. The station. It is quite challenging scene and comes at a
datasets were created by us with a resolution of 1080p and 25-frames-per-second video at 1080p resolution. Its dura-
at 60 frames per second. We encode the same file down to tion is 17 s, and its ground truth for the tracks in this video
480p and 60 frames per second for further testing. These is 90.
captures from a video game are utilized to take full control
over the test scene and avoid recording artifacts, while 4.3 Results
simultaneously being able to capture a lossless and high-
resolution file. Also, it is worth noting that the ‘‘Racetrack’’ In this subsection, we present the results of the experi-
video has higher rates of identity switches, due to the mental study. We present the performance results of the
higher occlusion rate from having more cars close to each methods examined and the modified version of Deep
other on a per-frame basis. SORT. The experimental results are structured in two parts,
Furthermore, we created two new testing scenes cap- the first concerns the performance of the methods, when the
tured by a drone of our team, which we name ‘‘Rural road’’ Yolo detectors are trained on MS-COCO, and the second
and ‘‘Rural road dusk.’’ They were created using real-world when they are trained on the DETRAC dataset.
traffic from a public road. The Rural road video scene We rank these frameworks based on the results we get
concerns a public road on a sunny day. We have filmed from a wide variety of metrics. First one is the Deep SORT
approximately 15 min of public road traffic using the DJI Tracks Initiated metric. The closer this metric is to the
Phantom 3 drone at 1080p and 25fps. We cut down the ground truth the better the performance of the tracking is.
video to approximately 2 min, by taking out the parts A number greatly higher than ground truth usually shows
where there was no traffic. The video incorporates a bal- that the detector straggles to keep track of a certain object
ance between cars, large vans and pickup trucks. The across the scene. Every time the detector fails, there is a
‘‘Rural road dusk’’ video concerns the same public road in chance that a new initiated track is created if the object gets
dusk under different lighting conditions. Specifically, we detected on future frames. The second one is the modified
recorded the traffic of the same road one hour before dusk. Deep SORT Count metric, which is the amount of the
The camera was facing the sun, and the cars were gener- confirmed tracks for the scene based on the modification
ating shadows on the road. So, the Rural road dusk dataset performed to take into consideration a set of previous
frame detection. Moreover, we provide recall, precision, F1
123
Neural Computing and Applications
score and a confusion matrix for the TP, TN, FN metrics to of false negatives. The closer this metric is to the ground
evaluate detector performance along with tracking perfor- truth, the better the performance of the tracking is. A
mance. Lastly, we also provide MOTA and MOTP scores number greatly higher than ground truth usually shows that
when available. The MOTP metric is the total position the detector straggles to keep tracking of a certain object
error for matched object hypothesis pairs over all frames, across the scenes. Having said that, the Deep SORT
averaged by the total number of matches made. It shows algorithm did a decent job at mitigating this issue as seen
the ability of the tracker to estimate precise object posi- from the Deep SORT count metric.
tions, independent of its skill at recognizing object con- It is worth noting that trying to track on a 1080p source
figurations, keeping consistent trajectories and more. only gets us 15fps using this setup. This may not be viable
MOTA accounts for all object configuration errors made by for real-time tracking. Looking at the results for the
the tracker, false positives, misses, mismatches, over all Straight road at 480p, the average frame rate is 33.1 FPS,
frames. It gives a very intuitive measure of the tracker’s and for Racetrack at 480p, it is 35.2 FPS. This indicates
performance at keeping accurate trajectories, independent that the framework is quite suitable for real-time tracking.
of its precision in estimating object positions. To evaluate In the Crossroad video, massive amounts of identity
detector performance, we use a slightly modified version switches were experienced due to the extremely low frame
(to include all the metrics we wanted) of the tool used and rate of the source which, in turn, caused big gaps from
created by [36], an open-source evaluator. frame to frame for the bounding boxes. This makes the
trajectory estimation algorithm often to fail.
4.3.1 Results with optimized detection models trained Lastly, in the ‘‘Rural road’’ and ‘‘Rural road dusk’’
on MS-COCO videos the results indicate a poor tracking performance due
to the poor detection exhibited by YOLOv3-Tiny, as shown
Here, we present the results of our framework when the by the very poor recall and F1 score. There are many
YOLO detectors are trained on MS-COCO. Initially, we detection failures as pointed out by the significantly higher
present the performance when YOLO3-Tiny is used as initiated tracks metric compared to the Deep SORT count
detector and after that the performance when YOLO3 and for every scene except for the ‘‘Straight road’’ 1080p and
YOLO4 are used. 480p. On the Racetrack dataset, where the ground truth was
23 cars, we measured 31 on both resolutions and the tracks
4.3.1.1 YOLOv3-Tiny In Table 2, the results of the Deep initiated for both tests were close to 86 for the 480p video
SORT and the modified Deep SORT using YOLOv3-Tiny and to 85 when tested at 1080p, which indicates a large
trained on MS-COCO as detector are presented. A first amount of detection failures. The tracking of the Modified
point concerns the tracking performance in the ‘‘Race- Deep SORT in the ‘‘Straight road’’ is lower than the ground
track’’ video, which is also poor due to the subpar detection truth (6 vs 9). This is because the cars in the back are not
performance of YOLOv3-Tiny. Looking at the initiated detected by this YOLO model in time. The tracking for the
tracks metric of the Deep SORT (85), we can tell that this cars that were detected is excellent.
detection model consistently failed to hold track of the In Fig. 10, we provide the precision–recall curve for all
objects it detected, which is also shown by the high number scenes tested and in Figs. 11, 12, 13 and 14, example
Crossroad- 15.6 263 104 92 0.944 0.375 0.537 4211 246 6990
Init = 4
Crossroad- 15.6 269 80 92 0.944 0.375 0.537 4211 246 6990
Init = 7
Straight road 15.1 10 6 9 1 0.436 0.608 187 0 241
1080p
Straight road 33.1 12 6 9 1 0.483 0.651 207 0 221
480p
Racetrack 15 85 31 23 0.997 0.801 0.889 3486 8 862
Racetrack 480p 35.2 86 31 23 0.996 0.828 0.905 3604 12 744
Rural road 15.1 220 56 44 0.99 0.447 0.616 1321 13 1632
Rural road dusk 16 102 18 24 0.965 0.4 0.565 595 21 892
123
Neural Computing and Applications
Fig. 11 YOLOv3-Tiny-enabled
framework on the crossroad
video
Fig. 12 YOLOv3-Tiny-enabled
framework on the Racetrack
video
detection frames from the YOLOv3-Tiny framework are caused by the red car in the front which YOLOv3-Tiny has
illustrated. As seen in Fig. 11, YOLOv3-Tiny has trouble trouble detecting consistently. This causes issues to the
detecting the cars in the distance, something that results in Deep SORT to mark it as a confirmed track. In Figs. 13 and
numbering errors by a considerable amount. In Fig. 12, 14, the same problem is illustrated. The cars in the distance
even with the modified Deep SORT algorithm now prop- at the back cannot be properly detected and the car num-
erly displaying the tracks, we still see skipped Ids, which is bered ‘‘2’’ has been eliminated, because of the excessive
123
Neural Computing and Applications
Fig. 13 YOLOv3-Tiny-enabled
framework on the Straight road
1080p video
Fig. 14 YOLOv3-Tiny-enabled
framework on the Straight road
480p video
Crossroad- 10.3 411 150 92 0.892 0.823 0.856 9226 1110 1975
Init = 4
Crossroad- 10.4 451 103 92 0.892 0.823 0.856 9226 1110 1975
Init = 7
Straight road 10.3 10 8 9 0.876 0.759 0.813 325 46 103
1080p
Straight road 16.6 10 7 9 0.829 0.626 0.713 268 55 160
480p
Racetrack 10.5 35 24 23 0.985 0.919 0.951 3999 58 349
Racetrack 480p 16.9 46 24 23 0.988 0.893 0.938 3887 47 461
Rural road 10.6 83 45 44 0.877 0.894 0.885 2640 368 313
Rural road dusk 10.9 58 23 24 0.937 0.802 0.865 1194 79 293
identity switches. We can also notice that the lower video identity switches, as seen from the tracks initiated by Deep
input resolution did not affect the detection process. SORT (411 and 451, respectively). The reason for this is
the fact that YOLOv3-416 is better at detecting hard-to-see
4.3.1.2 YOLOv3-416 In Table 3, the results of the Deep cars compared to YOLOv3-Tiny. The results also show
SORT and the modified Deep SORT using YOLOv3-416 that the modified Deep SORT algorithm performed quite
as detector are presented. The results show that the well and made a quite good tracking, counting 150 (vs 411)
detection performance on the ‘‘Straight road’’ video and on and 103 (vs 451) cars, respectively. On the Racetrack
the more complex ‘‘Racetrack’’ video is significantly better scene, we noticed significantly less initiated tracks, because
than YOLOv3-Tiny. The increased detection performance this scene has a clear view of the cars. This allowed the
as seen by the recall and F1 score metrics, allows the Deep much-improved YOLOv3-416 to keep track of the initiated
SORT framework to perform even better. The performance objects. We now also notice near perfect performance in
in the ‘‘Crossroad’’ video is low, mainly because of the the Rural road videos, which is attributed to the much
low-resolution and frame rate video captured by the CCTV. better detection performance of YOLOv3-416 over
The results also point out that now we experience more YOLOv3-Tiny. The Deep SORT count is off by 1
123
Neural Computing and Applications
compared to ground truth and the Initiated tracks are sig- affect the detection process, since we only saw a tiny
nificantly lower compared to YOLOv3-Tiny. increase in detection performance for the cars that were
In Fig. 15, we provide the precision–recall curve for all furthest away. Finally, in Fig. 19, we can see proper
scenes tested and in Figs. 16, 17, 18 and 19, example detection and tracking performance for this part of the test.
detection frames from the YOLOv3-416 framework are The increased accuracy of YOLOv3-416 over YOLO-Tiny
illustrated. In Fig. 16, we can see that YOLOv3-416 can is noticeable and provided better tracking performance as
now detect cars that are far at the back distance. In Figs. 17 illustrated above.
and 18, we see the same; the cars at the back distance are
now detected and tracked properly. However, we still 4.3.1.3 YOLOv4 In Table 4, the results of the Deep SORT
experience an identity switch even with this improved and the modified Deep SORT using YOLOv4-608 detector
detection performance on both video inputs. We can also are presented. We again see that the detection performance
notice that the lower video input resolution did not greatly on the ‘‘Straight road’’ video is good and the performance
Fig. 16 YOLOv3-416-enabled
framework on the crossroad
video
Fig. 17 YOLOv3-416-enabled
framework on the Straight road
1080p video
123
Neural Computing and Applications
Fig. 18 YOLOv3-416-enabled
framework on the Straight road
480p video
Fig. 19 YOLOv3-416-enabled
framework on the Racetrack
video
on the more complex ‘‘Racetrack’’ video is significantly more see a lot more initiated tracks than expected on the
better than using YOLOv3-Tiny and roughly equal to crossroad video, which is caused mainly by the poor video
YOLOv3-416. The performance in the ‘‘Crossroad’’ video quality. A small problem we noticed with YOLOv4 is that
is good enough, when we use an n_init value of 4, some- it sometimes detected cars at places where there were none.
thing that is necessary because of the low-resolution and That happened for only one frame, so the Deep SORT
the frame rate of the video captured by the CCTV. It’s the algorithm was able to exclude that result without trouble. It
only way to track most of the cars, since a lot of them are is worth noting that now we have a tensor input resolution
only visible for less than 7 frames. The increased perfor- of (608, 608) so, drops in the detection accuracy are more
mance of YOLOv4-608 is now visible in this instance. noticeable on the 480p videos. The results on the ‘‘Straight
Lastly, it is worth noting that YOLOv4-608 is noticeable road 480p’’ and ‘‘Racetrack 480p’’ indicate a drop in the
slower than YOLOv3-416, but not by a large amount. The detection performance in both cases. Performance in the
increased detection performance is good and worth the cost Rural road videos remains good and significantly better
of a few FPS, as we witness an uplift in all detection than YOLOv3-Tiny.
performance metrics compared to YOLOv3-416. Looking In Fig. 20, we provide the precision–recall curve for all
at the tracks initiated by the Deep SORT, we can once scenes tested and in Figs. 21, 22, 23 and 24, we present
123
Neural Computing and Applications
Fig. 21 YOLOv4-608-enabled
framework on the crossroad
video
Fig. 22 YOLOv4-608-enabled
framework on the Straight road
1080p video
Fig. 23 YOLOv4-608-enabled
framework on the Straight road
480p video
123
Neural Computing and Applications
Fig. 24 YOLOv4-608-enabled
framework on the Racetrack
video
Table 5 MOT comparison on the YOLOv4-608-enabled framework using the ‘‘Racetrack 480p’’ video
YOLOv4 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score
example cases from the detection frames of YOLOv4-608. with our YOLOv4 detector trained on the UA-DETRAC
As seen in Fig. 21, YOLOv4-608 can now detect cars that dataset. Also, we notice an increase in the execution per-
are far away and we can keep tracking them without having formance which is approximately 10% as seen in Table 5.
to be close to the camera. In Figs. 22 and 23, we do see This is due to the simplification of the YOLOv4 network,
good tracking. However, at that point of the video, the cars since we train it on just one class.
further away cannot be properly identified. Once the cars We now pay attention to the ‘‘Tracks Initiated’’ metric
get closer to the camera, the detection and the tracking are and now see that YOLOv4-608 works best on this test
excellent. We can also note that this time, the lower video scene when trained on the UA-DETRAC dataset with a
input resolution affected the detection process a bit more perfect Deep SORT count and much lower initiated tracks
than previously and this is mainly caused by the fact that compared to the MS-COCO one, which failed to track well
we have a tensor input of (616, 616) and the vertical res- during a mild occlusion phase and had trouble detecting
olution of the video was 480 pixels. Small cars were hard some of the vehicles.
to be properly detected from the reduced vector informa- In Fig. 25, we provide the precision–recall curve and in
tion. Finally, in Fig. 24, we once more see good tracking Fig. 26 we now see perfect numbering and tracking of all
and detection performance by YOLOv4-608. cars in that particular frame. All other detectors we tested
failed to achieve this performance.
4.3.2 Results with optimized detection models trained
on UA-DETRAC
123
Neural Computing and Applications
Table 6 MOT comparison on the YOLOv4-608-enabled framework using the ‘‘Crossroad’’ video
YOLOv4 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score
Fig. 27 Precision–recall curve for YOLOv4-608 comparison on the Fig. 28 Precision–recall curve for YOLOv4-608 comparison on the
crossroad video Rural road video
In Table 6, we see that in the ‘‘crossroad’’ video we have count for the UA-DETRAC trained YOLOv4. We notice a
an increase in the performance of approximately 10%. The much better tracking process and much higher detection
results show that the UA-DETRAC trained YOLOv4 accuracy of big trucks and vans compared to the MS-
detector reports better performance compared to the MS- COCO trained YOLOv4. A major benefit of the UA-
COCO one. While the initiated tracks are significantly DETRAC dataset is the wide variety of vehicles that is
closer to the real counts, we do see a worse Deep SORT included. The MS-COCO dataset is lacking in that
Table 7 MOT comparison on the YOLOv4-608-enabled framework using the ‘‘Rural road’’ video
YOLOv4 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score
123
Neural Computing and Applications
Table 8 MOT comparison on the YOLOv4-608-enabled framework using the ‘‘Rural road dusk’’ video
YOLOv4 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score
123
Neural Computing and Applications
Table 10 MOT comparison on the YOLOv3-608-enabled framework using the ‘‘Racetrack 480p’’ video
YOLOv3 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score
Fig. 32 Precision–recall curve for YOLOv3-608 comparison on the Fig. 33 Precision–recall curve for YOLOv3-608 comparison on the
Racetrack 480p video crossroad video
The UA-DETRAC framework measured one above approximately 15% on the ‘‘Racetrack 480p’’ video. This is
ground truth, because once more there was a car carrying a due to the same reasons we described in YOLOv4. We
trailer in the video, which both models detected as a again notice that the UA-DETRAC-enabled YOLO is
vehicle. The reason why MS-COCO managed to match better in this test scene compared to the MS-COCO one,
ground truth is shown in Figs. 30 and 31. The MS-COCO which had a small mishap. Deep SORT count is now
framework completely failed to detect the truck shown in perfect, since we once more achieved perfect tracking.
the pictures. As seen in Table 9, the UA-DETRAC-enabled However, YOLOv3, due to its worse overall mAP perfor-
YOLOv4 scored significantly better in the MOTA and mance, exhibits more initiated tracks compared to
MOTP metrics. YOLOv4. Looking at the detection metrics, we can tell that
the UA-DETRAC trained YOLO had trouble in detecting
4.3.2.2 Results on YOLOv3 In Table 10, the results of the the cars at times, but what made it score better in tracking
Deep SORT and the modified Deep SORT using YOLOv3 was the detection consistency, once detection occurred. We
trained on UA-DETRAC as detector are presented. For the also provide the precision–recall curve in Fig. 32.
training procedure on the UA-DETRAC dataset, the tensor In Table 11, we notice an increase in the execution
input was set at 416 for 8000 batches and we measured an performance, which is approximately 15% on the ‘‘Cross-
average loss of 0.823 and a mAP of 96.31%. The results road’’ video. The UA-DETRAC-trained YOLOv3, while
show an increase in the execution performance, which is providing a Deep SORT count closer to ground truth
Table 11 MOT comparison on the YOLOv3-608-enabled framework using the ‘‘Crossroad’’ video
YOLOv3 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score
MS-COCO 8.2 312 108 92 0.929 0.868 0.897 9729 740 1472
UA- 9.5 217 89 92 0.995 0.429 0.599 4807 24 6394
DETRAC
123
Neural Computing and Applications
Table 12 MOT comparison on the YOLOv3-608-enabled framework using the ‘‘Rural road’’ video
YOLOv3 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score
123
Neural Computing and Applications
Table 13 MOT comparison on the YOLOv3-608-enabled framework using the ‘‘Rural road dusk’’ video
YOLOv3 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score
MS-COCO 59.1 71
UA-DETRAC 19.4 67.6
Table 15 MOT comparison on the YOLOv3-Tiny-enabled framework using the ‘‘Racetrack 480p’’ video
YOLOv3- Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
Tiny second Initiated SORT truth score
123
Neural Computing and Applications
Table 16 MOT comparison on the YOLOv3-Tiny-enabled framework using the ‘‘Crossroad’’ video
YOLOv3- Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
Tiny second Initiated SORT truth score
Fig. 39 Precision–recall curve for YOLOv3-Tiny comparison on the Fig. 40 Precision–recall curve for YOLOv3-Tiny comparison on the
crossroad video Rural road video
improved detection performance as seen by the signifi- the MS-COCO. The MS-COCO-trained YOLOv3-Tiny
cantly better F1 score and precision-recall curve (Fig. 39). failed consistently to keep track of cars, and this is shown
Moving on to the ‘‘Rural road’’ video, we again see an by the modified Deep SORT count metric. The UA-
increase in execution performance of roughly 15%. We DETRAC model had much better detection performance,
again notice that the UA-DETRAC-enabled YOLO is although still poor, but that helped the Deep SORT count
better in this test scene compared to the MS-COCO one, by metric to be closer to ground truth. Tracking performance
a large margin, as indicated by the detection performance remains significantly worse, when compared to YOLOv4,
metrics. In Table 17, the results show that both models had as seen from the high tracks-initiated metric at 73 and it is
a poor performance in this test, given how far they are from clear that YOLOv3-Tiny is not suitable for accurate
the ground truth. YOLOv3-Tiny failed numerous times and tracking and numbering of cars. Here, we can also see the
even its execution performance is nearly doubled, the loss improvement in tracking performance through the MOTA
in accuracy in this test is substantial (Fig. 40). and MOTP metrics in Table 19. While there is an
Lastly, the results in the Rural road dusk video are improvement, it is once more obvious that tracking was
presented in Table 18. We notice an increase in execution poor as is also seen in Fig. 41.
performance of roughly 15%. This time the UA-DETRAC- In conclusion, the use of the UA-DETRAC dataset
trained YOLOv3-Tiny showed better results compared to assisted in creating an overall better framework for car
Table 17 MOT comparison on the YOLOv3-Tiny-enabled framework using the ‘‘Rural road’’ video
YOLOv3- Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
Tiny second Initiated SORT truth score
123
Neural Computing and Applications
Table 18 MOT comparison on the YOLOv3-Tiny-enabled framework using the ‘‘Rural road dusk’’ video
YOLOv3- Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
Tiny second Initiated SORT truth score
Table 19 MOT metrics on the YOLOv3-Tiny-enabled framework Table 21 MOT metrics on the MOT16 scene
using the ‘‘Rural road dusk’’ video
MOT16-09 MOTA MOTP
YOLOv3-Tiny MOTA MOTP
YOLOv3-Tiny 37.2 76.1
MS-COCO 3.7 67.6 YOLOv3-608 57.1 78.8
UA-DETRAC 10.8 49.7 YOLOv4-608 42.3 61.9
Fig. 41 Precision–recall curve for YOLOv3-Tiny comparison on the Fig. 42 Precision–recall curve comparison on the MOT16-09 video
Rural road dusk video
4.3.3 Exploring the modified deep SORT on pedestrian
traffic, especially when YOLOv4 is used as detector. The videos
framework reports the best tracking performance, while
also being slightly faster compared to its MS-COCO The second part of the experimental study concerned the
counterpart. evaluation of our framework and the modified Deep SORT
in pedestrian videos and scenarios. In the context of this
123
Neural Computing and Applications
Fig. 43 YOLOv3-Tiny-enabled
framework on the MOT16-09
scene
Fig. 44 YOLOv3-608-enabled
framework on the MOT16-09
scene
experiment, two scenes from the MOT benchmark were pedestrians that were not close to the camera could not get
used, the MOT16 benchmark and the MOT20, respec- tracked, as seen in Fig. 43. The YOLOv3-Tiny-powered
tively. We also provide the MOTA and MOTP metrics as Deep SORT could not perform well enough having a
seen on the MOT benchmark. modified Deep SORT count of 14 and 52 initiated tracks.
YOLOv3-608 did much better compared to YOLOv3-
4.3.3.1 Results on the MOT16 scene In Tables 20 and 21, Tiny, having a modified Deep SORT count of 22 and 81
the results of the MOT16 are illustrated. The results show initiated tracks with a MOTA score of 57.1 and MOTP
that YOLOv3-Tiny achieves the best execution throughput score of 78.8. The high number of initiated tracks does
with 15 frames per second using 1080p video as input. show that tracking was still relatively poor, given that we
YOLOv3 and YOLOv4-608 perform at nearly half the only need to track 25 pedestrians. As you can see in
performance with 8 and 7.6 FPS, respectively. In Fig. 42, Fig. 44, this detector did a much better job at detecting
we also provide the precision and recall curve. pedestrians that were far away from the camera. It did face
However, YOLOv3-Tiny failed to keep track of most a problem that we will describe right below during our
objects due to its poor detection rate, while it also had a YOLOv4-608 notes.
MOTA score of 37.2 and MOTP score of 76.1. Many
Fig. 45 YOLOv4-608-enabled
framework on the MOT16-09
scene
123
Neural Computing and Applications
Table 23 MOT metrics on the MOT20 scene YOLOv4-608 achieved the best tracking performance,
MOT20-01 MOTA MOTP
and our framework has a MOTA score of 42.3 and a MOTP
score of 61.9. It was able to detect the people behind the
YOLOv3-Tiny 4.9 67.1 glass entrance of the shop and it was able to keep track of
YOLOv3-608 31.8 73.3 people better than YOLOv3. It initiated less tracks when
YOLOv4-608 18.9 59.7 compared to YOLOv3-608. One major advantage of
YOLOv4 was that it was much better at proper distinction
and feature capture of the tracks. As an example, notice in
Fig. 45 how the old man in the background is now properly
labeled at track 25 and not at track 5, in contrast to Fig. 44,
where YOLOv3 thought it was the same pedestrian that
passed at the beginning of the video.
Fig. 47 YOLOv3-Tiny-enabled
framework on the MOT20-01
scene
123
Neural Computing and Applications
Fig. 48 YOLOv3-608-enabled
framework on the MOT20-01
scene
Fig. 49 YOLOv4-608-enabled
framework on the MOT20-01
scene
confirm tracks. The worse detector was again YOLOv3- to track 90 pedestrians. As shown in Fig. 48, this detector
Tiny, since it failed to keep track of most objects due to its did a much better job at detecting pedestrians that were far
poor detection rate, having a MOTA score of 4.9 and a away from the camera. Having said that, the people that
MOTP score of 67.1. The framework using YOLO-Tiny were very far away could not be detected by any of our
detector had a count of 20 and 114 initiated tracks, which is models. At that part of the scene, heavy occlusion occurs
far from the ground truth of 90 tracks. We provide a and there is a hefty amount of video noise and blur from
screenshot in Fig. 47 that shows its failure to detect many the poor lighting conditions.
of the pedestrians. Finally, this time YOLOv4-608 achieved worse tracking
YOLOv3-608 did much better compared to YOLOv3- performance, while also having a worse Deep SORT count
Tiny, having a Deep SORT count of 48 and 188 initiated compared to YOLOv3-608 having a MOTA score of 18.9
tracks with a MOTA score of 31.8 and a MOTP score of and a MOTP score of 59.7. It initiated 135 tracks, which is
73.3. The high number of initiated tracks does show that significantly lower when compared to YOLOv3-608, but
tracking was still relatively poor, given that we only need that is just because it failed to track many of the pedestrians
123
Neural Computing and Applications
as seen in Fig. 49. The population density of this scene plan to compare and explore tracking performance using
paired with the increased video noise and the reflection our own custom, fine-tuned and purpose built detectors.
from the sun at the back make this scene incredibly difficult
to complete. We show the ground truth bounding boxes for
this scene at the same frame in Fig. 50. 6 Supplementary materials
123
Neural Computing and Applications
16. Sun S, Akhtar N, Song X, Song H, Mian A, Shah M (2020) 30. Xu S, Savvaris A, He S, Shin HS, Tsourdos A (2018) Real-time
Simultaneous detection and tracking with motion modelling for implementation of YOLO? JPDA for small scale UAV multiple
multiple object tracking. http://arxiv.org/abs/2008.08826 object tracking. In: 2018 international conference on unmanned
17. Chu Q, Ouyang W, Li H, Wang X, Liu B, Yu N (2017) Online aircraft systems (ICUAS). IEEE, pp 1336–1341
multi-object tracking using CNN-based single object tracker with 31. Yoon YC, Boragule A, Song YM, Yoon K, Jeon M (2018) Online
spatial-temporal attention mechanism. http://arxiv.org/abs/1708. multi-object tracking with historical appearance matching and
02843 scene adaptive detection filtering. In: 2018 15th IEEE interna-
18. Ristani E, Solera F, Zou RS, Cucchiara R, Tomasi C (2016) tional conference on advanced video and signal based surveil-
Performance measures and a data set for multi-target, multi- lance (AVSS). IEEE, pp 1–6
camera tracking. CoRR. http://arxiv.org/abs/1609.01775 32. Darknet: open source neural networks in c. https://github.com/
19. Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only AlexeyAB/darknet. Accessed 25 Apr 2020
look once: unified, real-time object detection. CoRR. http://arxiv. 33. Hou X, Wang Y, Chau LP (2019) Vehicle tracking using deep
org/abs/1506.02640 SORT with low confidence track filtering. In: 2019 16th IEEE
20. Maiya SR (2020) abhyantrika/nanonets_object_tracking. GitHub. international conference on advanced video and signal based
https://github.com/abhyantrika/nanonets_object_tracking surveillance (AVSS). IEEE, pp 1–6
21. Milan A, Leal-Taixé L, Reid ID, Roth S, Schindler K (2016) 34. Nguyen HQ, Nguyen TB, Nguyen TA, Le TL, Vu TH, Noe A.
MOT16: a benchmark for multi-object tracking. CoRR. http:// Comparative evaluation of human detection and tracking
arxiv.org/abs/1603.00831 approaches for online tracking applications
22. Dendorfer P et al (2020) MOT20: a benchmark for multi object 35. Lin JP, Sun MT (2018) A YOLO-based traffic counting system.
tracking in crowded scenes. http://arxiv.org/abs/2003.09003 In: 2018 conference on technologies and applications of artificial
23. Emami P, Pardalos PM, Elefteriadou L, Ranka S (2020) Machine intelligence (TAAI). IEEE, pp 82–85
learning methods for data association in multi-object tracking. 36. Padilla R, Netto SL, da Silva EA (2020) A survey on performance
ACM Comput Surv (CSUR) 53(4):1–34 metrics for object-detection algorithms. In: 2020 international
24. Hou X, Wang Y, Chau LP (2019) Vehicle tracking using deep conference on systems, signals and image processing (IWSSIP)
SORT with low confidence track filtering. In: 2019 16th IEEE 37. Wang Z, Zheng L, Liu Y, Wang S (2019) Towards real-time
international conference on advanced video and signal based multi-object tracking. arXiv preprint http://arxiv.org/abs/1909.
surveillance (AVSS). IEEE, pp 1–6 12605
25. Wojke N, Bewley A (2018) Deep cosine metric learning for 38. Lu HC, Li PX, Wang D (2018) Visual object tracking: a survey.
person re-identification. In: 2018 IEEE winter conference on Pattern Recognit Artif Intell 31(1):61–76
applications of computer vision (WACV). IEEE, pp 748–756 39. Yao R, Lin G, Xia S, Zhao J, Zhou Y (2020) Video object seg-
26. Karunasekera H, Wang H, Zhang H (2019) Multiple object mentation and tracking: a survey. ACM Trans Intell Syst Technol
tracking with attention to appearance, structure, motion and size. (TIST) 11(4):1–47
IEEE Access 7:104423–104434 40. Llamazares Á, Molinos EJ, Ocaña M (2020) Detection and
27. Voigtlaender P, Krause M, Osep A, Luiten J, Sekar BBG, Geiger tracking of moving obstacles (DATMO): a review. Robotica
A, Leibe B (2019) MOTS: multi-object tracking and segmenta- 38(5):761–774
tion. In: Proceedings of the IEEE conference on computer vision 41. Sam JR, Augasta G (2021) Review of recent advances in visual
and pattern recognition, pp 7942–7951 tracking techniques. Multimed Tools Appl 80:24185–24203
28. Sun S, Akhtar N, Song H, Mian AS, Shah M (2019) Deep affinity
network for multiple object tracking. IEEE Trans Pattern Anal Publisher’s Note Springer Nature remains neutral with regard to
Mach Intell 43:104–119 jurisdictional claims in published maps and institutional affiliations.
29. Liu G, Liu S, Muhammad K, Sangaiah AK, Doctor F (2018)
Object tracking in vary lighting conditions for fog based intelli-
gent surveillance of public spaces. IEEE Access 6:29283–29296
123