0% found this document useful (0 votes)
45 views

Real-Time Multiple Object Tracking Using Deep Learning Methods2021

This document summarizes a research paper on real-time multiple object tracking using deep learning methods. The paper introduces a modified version of the Deep SORT algorithm that considers an object tracked if detected in a set of previous frames. This modified Deep SORT is coupled with YOLO object detection and evaluated on traffic videos. The results show the modified Deep SORT framework has good performance and improvements over the original Deep SORT algorithm. The paper also presents a new vehicle dataset for multiple object tracking evaluation.

Uploaded by

sadeq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Real-Time Multiple Object Tracking Using Deep Learning Methods2021

This document summarizes a research paper on real-time multiple object tracking using deep learning methods. The paper introduces a modified version of the Deep SORT algorithm that considers an object tracked if detected in a set of previous frames. This modified Deep SORT is coupled with YOLO object detection and evaluated on traffic videos. The results show the modified Deep SORT framework has good performance and improvements over the original Deep SORT algorithm. The paper also presents a new vehicle dataset for multiple object tracking evaluation.

Uploaded by

sadeq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Neural Computing and Applications

https://doi.org/10.1007/s00521-021-06391-y (0123456789().,-volV)
(0123456789().,-volV)

S.I.: INFORMATION, INTELLIGENCE, SYSTEMS AND APPLICATIONS

Real-time multiple object tracking using deep learning methods


Dimitrios Meimetis1 • Ioannis Daramouskas1 • Isidoros Perikos1 • Ioannis Hatzilygeroudis1

Received: 21 October 2020 / Accepted: 26 July 2021


Ó The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021

Abstract
Multiple-object tracking is a fundamental computer vision task which is gaining increasing attention due to its academic
and commercial potential. Multiple-object detection, recognition and tracking are quite desired in many domains and
applications. However, accurate object tracking is very challenging, and things are even more challenging when multiple
objects are involved. The main challenges that multiple-object tracking is facing include the similarity and the high density
of detected objects, while also occlusions and viewpoint changes can occur as the objects move. In this article, we
introduce a real-time multiple-object tracking framework that is based on a modified version of the Deep SORT algorithm.
The modification concerns the process of the initialization of the objects, and its rationale is to consider an object as tracked
if it is detected in a set of previous frames. The modified Deep SORT is coupled with YOLO detection methods, and a
concrete and multi-dimensional analysis of the performance of the framework is performed in the context of real-time
multiple tracking of vehicles and pedestrians in various traffic videos from datasets and various real-world footage. The
results are quite interesting and highlight that our framework has very good performance and that the improvements on
Deep SORT algorithm are functional. Lastly, we show improved detection and execution performance by custom training
YOLO on the UA-DETRAC dataset and provide a new vehicle dataset consisting of 7 scenes, 11.025 frames and 25.193
bounding boxes.

Keywords Computer vision  Multiple-object tracking  Deep learning  Deep SORT  YOLO

1 Introduction their individual trajectories given an input video [1]. MOT


aims to process videos with the purpose to identify and
Computer vision is a fundamental domain that aims to track objects that belong to one or more categories, like
allow computer systems to analyze images, extract cars, pedestrians, objects, animals without any prior
knowledge and interpret them as humans do. Multi-object knowledge concerning the appearance, the movement and
tracking (MOT) is also called multi-target tracking (MTT) the number of targets [2, 26]. Many computer vision
and has a very important part in computer vision [27]. In problems depend to a large extent on multiple-object
general, the task of MOT is largely partitioned to locating tracking systems [18, 28]. There are two important steps
multiple objects, maintaining their identities and yielding involved in designing such systems. The first step involves
the detection of the objects [23]. So, the desired objects are
detected in each frame of the video. The quality of the
& Isidoros Perikos detection directly affects the performance of the overall
[email protected] monitoring procedure. The second step involves the
Dimitrios Meimetis matching of the identified objects to the previous ones to
[email protected] get their trajectories. High accuracy in the object detection
Ioannis Daramouskas system results in a smaller number of missing detections
[email protected] and ultimately produces smooth and accurate trajectories.
Ioannis Hatzilygeroudis Multiple-object tracking is also considered the process of
[email protected] locating multiple moving objects over time [27, 31, 41].
1
Computer Engineering and Informatics Department, Multiple-object tracking systems have a variety of uses in a
University of Patras, Patras, Greece wide range of topics like security, video communication

123
Neural Computing and Applications

and compression, traffic control, medical imaging, self- also include pedestrian tests and we compare a grand total
driving cars and robotics [24, 29, 30]. of 7 YOLO derivatives ranging from YOLOv3-Tiny all the
Although object tracking is quite useful and desired, way to a fully fledged YOLOv4 implementation. For the
accurate, real-time object tracking is quite challenging and tracking mechanism, we use the DEEP Sort framework,
things are even more challenging when multiple objects are which we modified to properly display the correct track ID
involved [37, 38]. One major issue that MOT is facing is in the real-time video feedback, a problem that occurred in
the bounding box level tracking performance and satura- all YOLO and Deep Sort implementations we could find on
tion; therefore, most research is focused on handling these the Internet. For each implementation, we provide the
aspects [3, 4]. For tracking to perform sufficiently, object optimal detection and tracking parameters which could be
detection is necessary to work flawlessly across all frames useful for fellow researchers and hobbyists. Moreover, we
and not having to use interpolation [39]. However, this is perform custom transfer learning training of the YOLO
almost impossible due to occlusion, the variety in view- detector using a slightly modified version of the UA-
points and the noise that may be introduced in a video. DETRAC dataset. The UA-DETRAC trained YOLOv4
Also, real-time tracking requires great computational provided state-of-the-art performance when compared to
resources and also needs to face challenges like the identity the publicly available MS-COCO trained YOLOv4 in our
switches and various detection failures [33, 34, 40]. test scenes. In addition to that, we also performed perfor-
In this paper, first we explore the performance of various mance characterization on this framework and found a
deep learning methods on the task of multiple-object bottleneck in the execution pipeline, which we resolved,
tracking. We examine how widespread deep learning and we saw an execution performance increase of up to
architectures are performing under various contexts in a 22%. For the evaluation process, we provide a wide variety
wide range of scene scenarios. Also, the paper introduces a of metrics and we test nine different scenes, six of which
modification of the Deep SORT [5, 25] algorithm, which are our own. For all test scenes, we also provide the ground
greatly improves the performance of object tracking truth files, which we generated either from the ground up or
methods, using different object detection models, such as from existing data. Finally, through the creation of the
YOLOv3-608 [6], YOLOv3-Tiny [6] and YOLOv4 [7]. ground truth files, we also provide a new vehicle multiple-
The Deep SORT implementation is an extension to the object dataset consisting of 7 scenes, 11.025 frames and
simple online and real-time tracking (SORT) [8] algorithm 25.193 bounding boxes.
and the SORT framework is utilized too as a mean to The rest of the paper is structured as follows. Section 2
measure bounding boxes overlaps. While this is a high examines the literature and presents related works and
performance method, the number of identity switches due methods on multiple-object tracking. Section 3 presents
to occlusions from poor camera angles are too many our implementation that is based on modified Deep SORT
[9–12]. By incorporating convolutional neural networks, to algorithm and the YOLO detection networks. The modifi-
additionally include appearance information, Deep SORT cation made on Deep SORT algorithm is presented and the
can substantially reduce identity switches. Our modifica- way it affects the visualization of tracking of moving
tion on Deep SORT is based on the process of the initial- objects is illustrated. Section 3 describes the way that some
ization of the tracked object IDs and the way they are bottlenecks in the multiple objects tracking procedure is
assigned and passed through, to be shown during the faced and presents the way that improves the real-time
visualization process. The results indicate that our modified performance in terms of frames per second. Section 4 deals
Deep SORT now properly displays the track IDs and it is with the experimental study, explains the datasets used for
closer to the ground truth in all the examined cases, a the training phases and for the testing phase and presents
problem that exists on all YOLO & Deep SORT imple- the results collected. Finally, Sect. 5 concludes the article
mentations we have found on the Internet. In addition, we and draws directions for future work.
present a way to improve the real-time operation of the
deep learning methods by identifying and facing a bottle-
neck in the MOT framework. We tested and provide a way 2 Related work
that can greatly improve the execution time of the tracking
procedure. The results show that we have an increase of Multiple-object tracking is attracting the increasing atten-
frames per second (FPS) in all examined deep learning tion of researchers in computer vision and artificial intel-
networks which is up to 22%. ligence. Several works in the literature study the
The novelty and the contribution of this paper can be performance of methods and systems and a detailed
summarized as follows. First, we explore the performance description of approaches and techniques can be found in
of various deep learning methods on the task of real-time [1, 2, 12].
multiple-object tracking. Our focus is road traffic, but we

123
Neural Computing and Applications

In the work presented in [3], authors propose an online capable of relational reasoning. Authors explore a number
multi-target tracker that exploits both high and low-confi- of relational reasoning architectures and show that multi-
dence target detections in a probability hypothesis density headed, self-attention outperforms the provided baselines
particle filter framework. Authors formulate an early and better accounts for complex physical interactions in a
association strategy between trajectories and detections toy experiment. Authors find that it leads to consistent
after the prediction stage, which allows performing target performance gains in tracking as well as future trajectory
estimation and state labeling without any additional prediction on three real-world datasets (MOTChallenge,
mechanisms. The authors’ solution has a peak multiple- UA-DETRAC and Stanford Drone dataset), particularly in
object tracking accuracy (MOTA) score of 53 on MOT15 the presence of ego-motion, occlusions, crowded scenes
and 52.5 on MOT16. and faulty sensor inputs. On the MOTChallenge dataset,
Authors in [8] present an approach to multi-object HART achieves 66.6% IOU, which itself is impressive
tracking where the main focus is to associate objects effi- given the small amount of training data of only 5225
ciently for online and real-time applications. To this end, training frames and no pre-training.
detection quality is identified as a key factor influencing In the work presented in [14], authors present an end-to-
tracking performance, where changing the detector can end model, named FAMNet, where feature extraction,
improve tracking by up to 18.9%. Despite only using a affinity estimation and multi-dimensional assignment are
rudimentary combination of familiar techniques such as the refined in a single network. All layers in FAMNet are
Kalman filter and Hungarian algorithm for the tracking designed differentiable and thus can be optimized jointly to
components, the approach achieves accuracy that is com- learn the discriminative features and higher-order affinity
parable to state-of-the-art online trackers. Additionally, model. Authors also integrate single-object tracking tech-
emphasis is placed on efficiency for facilitating real-time nique and a dedicated target management scheme into the
tracking and to promote greater uptake in applications such FAMNet-based tracking system to further recover false
as pedestrian tracking for autonomous vehicles. While negatives and inhibit noisy target candidates generated by
being an overall good framework at the time, the identity the external detector. The proposed method is evaluated on
switches are rather high with a value of 1001 in the MOT a diverse set of benchmarks including MOT2015,
benchmark. Their solution has a peak MOTA score of 33.4 MOT2017, KITTI-Car and UA-DETRAC and achieves
on MOT15. promising performance on all of them in comparison with
In the work presented in [4], authors present an online state-of-the-art. The authors’ method has a MOTA score of
method that encodes long-term temporal dependencies 40.6 on MOT15.
across multiple cues. One key challenge of tracking In the work presented in [15], authors introduce a focal
methods is to accurately track occluded targets or those loss-based RetinaNet, which works as one-stage object
which share similar appearance properties with surround- detector, is utilized to be able to well match the speed of
ing objects. To address this challenge, authors present a regular one-stage detectors and also defeats two-stage
structure of recurrent neural networks (RNN) that jointly detectors in accuracy, for vehicle detection. State-of-the-art
reasons on multiple cues over a temporal window. Their performance result has been shown on the DETRAC
motion and interaction models leverage two separate long vehicle dataset. This is important because one-stage object
short-term memory (LSTM) networks that track the motion detectors and two-stage object detector are regarded as the
and interactions of targets for a longer period—suitable for most important two groups of convolutional neural net-
the presence of long-term occlusions. Their solution has a work-based object detection methods. One-stage object
peak MOTA score of 37.6 on MOT15. detector could usually outperform two-stage object detec-
In the work presented in [2], authors present a com- tor in speed; however, it normally trails in detection
prehensive survey on works that employ deep learning accuracy, compared with two-stage object detectors.
models to solve the task of MOT on single-camera videos. In the work presented in [16], the authors introduce deep
Four main steps in MOT algorithms are identified, and an motion modeling network (DMM-Net) that can estimate
in-depth review of how deep learning was employed in multiple objects’ motion parameters to perform joint
each one of these stages is presented. A complete experi- detection and association in an end-to-end manner. DMM-
mental comparison of the presented works on the three Net models object features over multiple frames and
MOTChallenge datasets is also provided, identifying a simultaneously infer object classes, visibility and their
number of similarities among the top-performing methods motion parameters. These outputs are readily used to
and presenting some possible future research directions. update the tracklets for efficient MOT. DMM-Net achieves
In the work presented in [13], authors build on a neural PR-MOTA score of 12.80 @ 120? fps for the popular UA-
class-agnostic single-object tracker named HART and DETRAC challenge. Authors also introduce a synthetic
introduce a multi-object tracking method MOHART large-scale public dataset Omni-MOT for vehicle tracking

123
Neural Computing and Applications

that provides precise ground-truth annotations to eliminate 3.1 Modified deep SORT tracking algorithm
the detector influence in MOT evaluation.
In the work presented in [17], authors present a CNN- One of the most widely used object tracking frameworks is
based framework for online MOT. This framework utilizes Deep SORT, which is an extension to SORT (simple real-
the merits of single-object trackers in adapting appearance time tracker) [5]. Deep SORT achieves better tracking and
models and searching for target in the next frame. Simply less identity switches by including an appearance feature
applying a single-object tracker for MOT will encounter vector for the tracks which is derived, in this case, by a pre-
the problem in computational efficiency and drifted results trained CNN that runs on the YOLO detected bounding
caused by occlusion. Their framework achieves computa- boxes. Since simple detection models are very likely to fail
tional efficiency by sharing features and using ROI-Pooling at detecting numerous objects consecutively as the frames
to obtain individual features for each target. In the frame- go by, we need to add new methods to keep track of them
work, they introduce spatial–temporal attention mechanism and properly identify them. This is where Deep SORT
(STAM) to handle the drift caused by occlusion and comes in to make a proper MOT framework.
interaction among targets. Besides, the occlusion status can The Kalman filter is a crucial component in Deep
be estimated from the visibility map, which controls the SORT. Each state contains 8 variables (u, v, a, h, u0 , v0 , a0
online updating process via weighted loss on training and h0 ) where (u, v) are the coordinates of the bound box, a
samples with different occlusion statuses in different is the aspect ratio, and h is the height of it. The respective
frames. It can be considered as temporal attention mecha- velocities are given by u0 , v0 , a0 , h0 . The state contains only
nism. The proposed algorithm achieves 34.3% and 46.0% absolute position and velocity factors, since we assume a
in MOTA on challenging MOT15 and MOT16 benchmark simple linear velocity model. The Kalman filter helps us
datasets, respectively. face the problems that may arise from non-perfect detec-
tion and uses prior states to predict a good fit for future
bounding boxes. Now that we have the new bounding
3 Methodology boxes tracked from the Kalman filter, the next problem lies
in associating new detections with the predictions that have
In this section, we present our framework for object been created. Since they are processed independently, a
tracking that relies on the modification of the Deep SORT method is needed to associate track_i with incoming de-
algorithm. We describe the main methods for the object tection_k. To solve this, Deep SORT implements 2 things:
detection and tracking. architecture of this implementation. a distance metric to quantify the association and an effi-
More specifically, for the object detection procedure, cient algorithm to associate the data. The authors decided
YOLO models are utilized to detect desired objects in a to use the squared Mahalanobis distance (effective metric
frame, and after that, a modified version of Deep SORT when dealing with distributions) to incorporate the uncer-
algorithm is introduced to perform object tracking in the tainties from the Kalman filter. Thresholding this distance
sequences of the frames. Our modification on the Deep can give us a very good idea on the actual associations.
SORT algorithm concerns the process of the initialization This metric is more accurate than, say, Euclidean distance,
of the object IDs, and its rationale is to consider an object as we are effectively measuring the distance between 2
as ‘‘tracked’’ if it is detected in a set of previous frames. distributions. For the data association part, the Hungarian
We assess the performance of the YOLO object detection algorithm is used. Lastly, the feature vector becomes our
models trained on the MS COCO and UA-DETRAC ‘‘appearance descriptor’’ of the object. The authors have
datasets using transfer learning in order to assess their added this vector as part of the distance metric. Now, the
performance and identify suitable and optimal synergies updated distance metric is:
for multi-object tracking. The modified Deep SORT algo- D ¼ Lambda  Dk þ ð1  LambdaÞ  Da
rithm is tested using the YOLO models in tracking cars and
pedestrians, a variety of datasets and scenes, and its per- where Dk is the Mahalanobis distance and Da is the cosine
formance is compared to the original Deep SORT algo- distance between the appearance feature vectors and
rithm. The results indicate that our modified Deep SORT Lambda is the weighting factor. The importance of Da is so
algorithm now properly displays the assigned track IDs, high that the authors make a claim saying, they were able
while also providing good tracking performance. In the to achieve state of the art even with Lambda ¼ 0, meaning
following subsections, we present the modification made that they only used the appearance descriptor for the cal-
on the Deep SORT algorithm, the implementation of the culation. We provide the pseudocode for the Deep SORT-
YOLO models as well as the optimization of our enabled framework below.
framework.

123
Neural Computing and Applications

1. For every frame:

2. Perform prediction for the tracks using Kalman filter.

3. For every yolo detection:

4. Get detection features

5. Run non-maxima suppression

6. Calculate squared Mahalanobis distance based on predicted Kalman states

7. Find smallest feature cosine distance for every existing track

8. Update tracker with the use of IOUs and the Hungarian algorithm.

9. If initiated track has been consecutively detected for n_init frames then confirm track

10. Else:

11. Delete track.

12. If confirmed track has been consecutively not detected for MaxAge frames then delete
track.

The variables that cause the biggest change in perfor- algorithm relates to the proper display and count of the
mance are score and IOU of each respective model, and confirmed detections. Specifically, each track has three
n_init, max_cosine_distance and max_iou_distance from states: the initial tentative state, the confirmed state and the
the Deep SORT framework. We will present the optimal deleted state. Every new track is classified as tentative for
values for each implementation in the videos used. We the first n_init frames. If the n_init frames pass and the
keep max_cosine_distance and max_iou distance at the track is still identified, it will become confirmed and fea-
same values, of 0.4 and 0.7, respectively, for all our tests. ture similarity will also be employed. If a track fails to be
We also set our n_init at 7, unless stated otherwise. The identified properly using IOU similarity for every frame in
variable n_init dictates how many successful detections a the n_init phase, then it will be classified as deleted. We
track must have before it goes from its initial tentative state made sure that every bounding box shown on screen has
to its confirmed state. the proper confirmed state ID on it.
A main aspect we noticed on the functionality and the Below we provide some example cases of the compar-
utilization of Deep SORT concerns the fact that shows the ison results between the Deep SORT and the modified
initiated track IDs on the bounding boxes and not the version. In Figs. 1 and 2, we present a case from the
confirmed tracks. This operation may cause problems in MOT15 dataset, where the ground truth for that part of the
tracking and numbering correctly the detected objects in scene is 34 people. Our framework of the modified deep
the sequences of frames. To address this problem, a main SORT measured 32 people, while the original Deep SORT
modification that we implemented to the Deep SORT resulted in measuring 79 people as it is illustrated in the

Fig. 1 Non-modified Deep


SORT?YOLO framework.
Frame ground truth is 34

123
Neural Computing and Applications

Fig. 2 Results after our


modifications. Frame ground
truth is 34

Fig. 3 Non-modified Deep


SORT?YOLO framework in
the ‘‘Racetrack’’ scene. Frame
ground truth is 19

Fig. 4 Results after our


modifications in the
‘‘Racetrack’’ scene. Frame
ground truth is 19

Fig. 5 Non-modified Deep


SORT?YOLO framework in
the ‘‘Rural road dusk’’ scene.
Frame ground truth is 18

Fig. 6 Results after our


modifications in the ‘‘Rural road
dusk’’ scene. Frame ground
truth is 18

123
Neural Computing and Applications

bounding box IDs in Fig. 1. The difference is massive, and These models are trained on the MS-COCO and
the main reason for this is the large number of identity DETRAC datasets, and we use the weights that have been
switches either from occlusion or poor viewing angle as the generated by the training on these datasets. The first
detector struggles to maintain accurate detections across all implementation works with the YOLOv3-Tiny model and
frames. This can result in tracking numbers that, when the weights. In the second implementation, the YOLOv3-416
original Deep SORT algorithm is used, are quite higher and 608 model and the corresponding weights are used and
than the ground truth. The modified version correctly dis- lastly, we use YOLOv4-608 with 608-by-608 tensor input
plays the confirmed tracks on the video output. to test our framework with state-of-the-art models. These
An additional example case is illustrated in Figs. 3 and YOLO detection models have been formulated into Keras
4. The example case is in the frame of the ‘‘Roadtrack’’ test along with their weights that were generated from their
scene, where cars are detected and tracked. The grand truth perspective Darknet projects. Darknet [32] is an open-
of the example case is 19. In Fig. 3, the performance of the source neural network framework written in C and CUDA,
original Deep SORT is off by 6 tracking and giving IDs to and it supports CPU and GPU computation.
25 different cars. Also, it is worth noting that even in the The developers of YOLO reframe object detection as a
same frame the non-modified code had already failed to single regression problem, straight from image pixels to
properly ID this stack of cars, as illustrated by the fact that bounding box coordinates and class probabilities. A single
there is no car numbered with ID 24. The modified Deep convolutional network simultaneously predicts multiple
SORT has a quite better performance which matches the bounding boxes and class probabilities for those boxes.
ground truth. YOLO trains on full images and directly optimizes detec-
Finally, another example case is presented in Figs. 5 and tion performance. This unified model has several benefits
6 in our own, real-world test scene named ‘‘Rural road over traditional methods of object detection. First, YOLO
dusk.’’ The ground truth at that part of the scenes is 18. In is extremely fast. The frame detection process is looked at
Fig. 5, the results of the original Deep SORT are off by 20. as a regression problem which enables YOLO to have a
Although there were 18 different cars in the scenes, the simplified pipeline. They simply run this neural network on
original Deep SORT and YOLO framework resulted in a new image at test time to predict detections. This model
tracking and giving IDs up to 38. In Fig. 6, the results of achieves high throughput, which makes it suitable for
the modified Deep SORT are presented and we can see that process streaming video in real time. Second, YOLO rea-
the resulting IDs are very close to the ground and are off by sons globally about the image when making predictions.
just 1. The YOLO detector works the same way in both Unlike sliding window and region proposal-based tech-
cases and the modified Deep SORT performs quite better niques, YOLO sees the entire image during training and
compared to the original version reporting almost a perfect test time, so it implicitly encodes contextual information
performance. about classes as well as their appearance. Fast R-CNN, a
As illustrated in the above three example cases, the top detection method, mistakes background patches in an
modified version of the Deep SORT has a quite good image for objects, because it cannot see the larger context.
performance and resulted in better tracking and consistent Third, YOLO learns generalizable representations of
annotations of IDs. This is crucial when we create real-time objects. This network uses features from the entire image
online MOT systems since we can even feed in real-time a to predict each bounding box. It also predicts all bounding
live video from a camera and have it display proper IDs boxes across all classes for an image simultaneously. This
and tracking results. means that YOLO reasons globally about the full image
and all the objects in the image.
3.2 Detection models
3.2.1 YOLOv3-Tiny integration
The Deep SORT tracking algorithm needs to be integrated
with a multiple-object detection model that will perform For the tiny YOLOv3 model, we are using these anchors:
the detection of the desired objects in a frame and after [10, 13], [23, 27], [37, 58], [81, 82], [135, 169], [344, 319],
that, the Deep SORT will perform the tracking procedure. which correspond to the size of the bounding boxes and are
In the context of our study, we examine the performance of fundamental for the correct training and detection of our
the Deep SORT in the pipeline with YOLO (You Only CNN. Moreover, we set anchor mask values of [[3, 4, 5],
Look Once) [19]. YOLO has been proven to offer high [0, 1, 2]]. We configure them based on the design and the
performance and detection accuracy [35] and the Yolo dimension of the objects we want to detect. To begin with,
models used and examined here are (i) the YOLOv3-Tiny, for this framework we set a score = 0.3 and IOU = 0.2 for
(ii) the YOLOv3-416 and 608 and (iii) the YOLOv4-608. our tests in section 7.1 and score = 0.6 and IOU = 0.3 for
7.2 and 7.3. Score is the confidence percentage for the

123
Neural Computing and Applications

detection coming out of our CNN. Do keep in mind that more complex and higher performing models. With
YOLOv3-Tiny has only 21 layers and it needs only 5.5 YOLOv4, we now set our tensor input at 608 instead of
billion flops per frame. For the study, we use a tensor input 416, which we previously did for YOLOv3, and that is
of (416, 416). YOLOv3-Tiny has a mean average precision because we want to see how the framework will behave
(mAP) of 23.7% on the MS COCO dataset. when aiming at the best possible detection and feature
extraction. YOLOv4 achieves an mAP of 65.7% with an
3.2.2 YOLOv3 integration input tensor of (608, 608), which is significantly higher
than the previous models.
For YOLOv3-416 and 608, we are using these anchors:
[10, 13], [16, 30], [33, 23], [30, 61], [62, 45], [59, 119], 3.3 Framework optimization
[116, 90], [156, 198] and [373, 326], which correspond to
the size of the bounding boxes and are fundamental for the An important part of the proper real-time operation of our
correct training of our CNN. Moreover, we set anchor framework concerns a set of optimization procedures that
mask values of [[6, 7, 8], [3, 4, 5], [0, 1, 2]]. They were were performed. We monitor the functionality of the
configured and fine-tune based on the design and the framework in our systems with help from Intel’s VTune
dimension of the objects we want to detect. For this software stack. We launch our application, and then we
framework, we set a score = 0.4 and IOU = 0.2 for our hook VTune to the corresponding process ID of our
tests in Sect. 6.1 and score = 0.6 and IOU = 0.3 for 6.2 and framework. We detected a bottleneck in the CPU section of
6.3. YOLOv3-608 has only 106 layers and it needs 140 our systems indicating that the CPU cannot feed fast
billion flops per frame when having a tensor input of (608, enough our GPU, while also being able to perform the
608) and 65.86 billion flops per frame when at (416, 416). necessary calculations for the tracking algorithm provided
The complexity is much higher, and the increased com- by Deep SORT. In Fig. 7, we can see that the single-
putation requirements cause a considerable drop in average threaded nature of the software on crucial functions causes
frame. That being said it achieves significantly better issues and, if we were to multithread and batch our func-
results in the MS COCO dataset having a mAP of 55.3 for tions for video pre-processing, it would not meet the cri-
tensor input of 416 and 57.9 for tensor input of 608. In our teria for a real-time tracking algorithm, since it does not
study, we use it with tensor input of 416 and 608. process every frame as it is created. Looking closer at the
graph shown in Fig. 7, we see our primary thread failing to
3.2.3 YOLOv4-608 integration hold steady at 100% CPU time, which is caused by our
GPU having to run our CNN on a per-frame basis which
For YOLOv4-608, we are using the same anchors and causes the CPU to become idle.
mask that we used on YOLOv3-608. For this framework, The infrastructure used for our experiments is equipped
we set a score = 0.6 and IOU = 0.3 in all of our tests, with an NVIDIA GTX 1070 8 GB VRAM paired with 16
where the score is the confidence percentage for the gigabytes of RAM and an Intel i7 6700 K. We used the
detection coming out of our CNN. Notice that we gradually NVIDIA CUDA toolkit version 10.0 and Tensorflow-gpu
increase our detection thresholds as we continue testing version 1.14. Keras version 2.2.4 and Python Anaconda

Fig. 7 Spawned threads during execution of our framework using Intel VTune

123
Neural Computing and Applications

distribution 2019.03 were the frameworks for the imple- with just 1.2% of time spent waiting for data from DRAM,
mentations. For the experiments, the processor was set at we know that our memory subsystem is ready to handle an
4.5 GHz and the graphics card was also locked at 2.1 GHz, increase in the data flow. We installed and used the Pillow
while the training and validation data were kept on an 6.0.0 SIMD AVX2 package, and we got an improvement
NVME drive to alleviate any potential storage bottlenecks. that ranged from 10 to 22%. The percentage of the
Each row represents a thread spawned by python for this improvement depends on the number of cars detected per
framework. The CPU first has to preprocess a frame from frame, the detector used and video input resolution. This
the video input and then send it to the GPU for the object improved performance is presented in detail in the results
detection part of the framework. When the GPU is doing of the experimental study.
calculations, the CPU is idling while waiting for the GPU In Table 1, we present some results from the tests, where
to send back the result. When the results get sent back to we compare the generic SSE Pillow 6.0 version versus the
the CPU, it is time for the Deep SORT algorithm to take AVX2 enabled, Pillow version 6.0.
over and match each bounding box with the correct IDs. The results indicate that, after the optimization proce-
This is also executed on the CPU. After this process is dures, we have a substantial improvement in the real-time
completed, the CPU writes back to the frame the output operation and performance of the detection procedure. As
from the model along with the correct IDs. This process is illustrated in Table 1, the SIMD optimizations that were
getting repeated until there are no more frames to process. introduced on the framework have a substantial impact on
The rest of the threads remain mostly idle and that is the frames per second. We recorded the highest improve-
expected behavior, since we have not multithreaded the ment (21.87% in the frames per second) in the crossroad
video pre-processing task or the CPU side tasks of our video when running the YOLOv3-Tiny framework. As the
Deep SORT framework. input resolution goes up and the time spent for object
Knowing that, by default, python and tensor flow detection goes down, we will experience more and more
installations are not compiled to make use of more performance improvements from this.
advanced SSE4.1/SSE4.2 and AVX instructions, we try to
find ways to improve the performance by using optimized
libraries for our system. Initially, Intel’s python packages 4 Experimental study
for Numpy, Scipy and others were installed. These pack-
ages have improvements mainly from the use of SSE4.2, In this section, we present the experimental study and the
AVX, AVX2 instructions. However, the performance did results collected. The experiments focus on the examina-
not improve so much, because these libraries were not tion of the performance of the modified version of Deep
hotspots in our code. After this, we started timing every SORT tracking algorithm, and its performance is assessed
part of our code and found that the video processing tasks, toward the performance of the original version of the
which were powered by the Pillow library, were taking a algorithm. We examine its performance in a wide range of
big part of our execution time. With the use of VTune to datasets and integrations with YOLO detectors and assess
perform HPC profiling of our framework, as shown in the performance of the YOLO multiple-object detection
Fig. 8, we can see that our primary thread can grow in models trained on the MS-COCO and the UA-DETRAC
terms of vectorization. Since we are not bandwidth bound datasets using transfer learning on the testing datasets.

Fig. 8 Gathering performance metrics for our framework using Intel VTune

123
Neural Computing and Applications

Table 1 MOT performance comparison


Frames per second Video resolution Before SIMD After SIMD Improvement %

YOLOv3-Tiny on the crossroad video 1080p 12.8 15.6 21.87


YOLOv3-416 on the crossroad video 1080p 9.2 10.4 13.04
YOLOv4-608 on the crossroad video 1080p 6.7 7.8 16.41
YOLOv3-Tiny on the Racetrack video 1080p 12.8 15 17.18
YOLOv3-416 on the Racetrack video 1080p 9.2 10.5 14.13
YOLOv4-608 on the Racetrack video 1080p 6.7 7.9 17.91
YOLOv3-Tiny on the Straight road 480p video 480p 29.4 33.1 12.58
YOLOv3-416 on the Straight road 480p video 480p 15 16.6 10.66
YOLOv4-608 on the Straight road 480p video 480p 9.9 10.9 10.1

Based on our tests, we found that the UA-DETRAC trained consists of 100 videos, selected from over 10 h of image
YOLOv3-Tiny and YOLOv4-608 models were able to sequences acquired by a Canon EOS 550D camera at 24
outperform the MS-COCO ones on average, in terms of different locations, which represent various traffic patterns
execution speed and detection accuracy. and conditions including urban highway, traffic crossings
and T-junctions. Notably, to ensure diversity, the creators
4.1 Datasets used for the training procedure capture the data at different locations with various illumi-
nation conditions and shooting angles. The videos are
4.1.1 MS-COCO recorded at 25 frames per second (fps) with the JPEG
image resolution of 960 9 540 pixels. More than 140,000
The Microsoft Common Objects in COntext (MS-COCO) frames in the UA-DETRAC dataset are annotated with
dataset contains 91 common object categories with 82 of 8250 vehicles, and a total of 1.21 million bounding boxes
them having more than 5000 labeled instances. In total the of vehicles are labeled. Creators asked over 10 domain
dataset has 2,500,000 labeled instances in 328,000 images. experts to annotate the collected data for more than two
Additionally, a critical distinction between this dataset and months. They also carried out several rounds of cross-
others is the number of labeled instances per image, which check to ensure high-quality annotations. The UA-
may aid in learning contextual information. MS COCO DETRAC dataset is divided into training (UA-DETRAC-
contains considerably more object instances per image train) and testing (UA-DETRAC-test) sets, with 60 and 40
(7.7) as compared to ImageNet (3.0) and PASCAL (2.3). sequences, respectively. The creators selected training
Utilizing over 70,000 working hours, a vast collection of videos that are taken at different locations from the testing
object instances was gathered, annotated and organized to videos, but ensure the training and testing videos have
drive the advancement of object detection and segmenta- similar traffic conditions and attributes. This setting redu-
tion algorithms. Emphasis was placed on finding non-ico- ces the chances of detection or tracking methods to over-fit
nic images of objects in natural environments and varied to particular scenarios. The classes four classes are ‘‘car,’’
viewpoints. Dataset statistics indicate that the images ‘‘bus,’’ ‘‘van’’ and ‘‘others.’’ The vast majority of the
contain rich contextual information with many objects dataset is labeled as ‘‘car.’’ In Fig. 9, example cases from
present per image. We only shortly mention this dataset, the datasets are presented as well as an example case with
because we used the weights created by the YOLO authors the corresponding bounding boxes of the objects too.
that were trained on the MS COCO dataset. We used the
pre-trained weights of all the YOLO models trained on the 4.2 Datasets used for testing
MS-COCO dataset as described in [6]. The models are
trained to detect 80 classes, and in the context of our In the context of the study, we employ nine scenes for
experiments, we used the model’s detections for the ‘‘car’’ assessing the performance of the methods and our modified
class. version of the Deep SORT. The nine scenes that were used
are different from the datasets used for the training pro-
4.1.2 UA-DETRAC dataset cedure of the models. Seven datasets out of the nine con-
cern the tracking of vehicles, and two (MOT16 and
The UA-DETRAC dataset [12] was created by the MOT20) concern the tracking of pedestrians. For all test
University at Albany for comprehensive performance scenes used, we have generated ground truth files either
evaluation of MOT systems. The UA-DETRAC dataset from scratch or through existing data. We have created a

123
Neural Computing and Applications

Fig. 9 Example training instances from the datasets. On the right diverse example cases with cars are illustrated, while on the right an example
case with the corresponding bounding boxes is illustrated

new vehicle dataset from the videos we captured consisting scene is of quite higher difficulty. All the datasets are
of 7 scenes, 11.025 frames and 25.193 bounding boxes. publicly available via the GitHub account of our team.
Lastly, all the bounding boxes coordinates are given in top The MOT16 [21] is a widely used dataset for object
left, width, height format and we also provide ground truth tracking procedures. We used the ‘‘MOT16-09’’ scene,
files in MOT16 format for the ‘‘Rural road dusk’’ scene. which is captured outdoors, facing a sidewalk from a low
The Crossroad [20] is a publicly available car traffic angle. It is a 30-frames-per-second video at 1080p resolu-
video, which is 3 min and 31 s long. The base resolution is tion and has a duration of 18 s. The ground truth for the
1080p with 16:9 aspect ratio, and it is just running at 10 tracks in this video is 25.
frames per second. This video has been captured from a The MOT20 [22] is another widely used dataset for
road traffic camera. object tracking procedures, and we used the ‘‘MOT20-01’’
The ‘‘Straight road’’ and ‘‘Racetrack’’ datasets are cap- scene. The scene is captured indoors in a crowded train
tured from a racing simulator called Assetto Corsa. The station. It is quite challenging scene and comes at a
datasets were created by us with a resolution of 1080p and 25-frames-per-second video at 1080p resolution. Its dura-
at 60 frames per second. We encode the same file down to tion is 17 s, and its ground truth for the tracks in this video
480p and 60 frames per second for further testing. These is 90.
captures from a video game are utilized to take full control
over the test scene and avoid recording artifacts, while 4.3 Results
simultaneously being able to capture a lossless and high-
resolution file. Also, it is worth noting that the ‘‘Racetrack’’ In this subsection, we present the results of the experi-
video has higher rates of identity switches, due to the mental study. We present the performance results of the
higher occlusion rate from having more cars close to each methods examined and the modified version of Deep
other on a per-frame basis. SORT. The experimental results are structured in two parts,
Furthermore, we created two new testing scenes cap- the first concerns the performance of the methods, when the
tured by a drone of our team, which we name ‘‘Rural road’’ Yolo detectors are trained on MS-COCO, and the second
and ‘‘Rural road dusk.’’ They were created using real-world when they are trained on the DETRAC dataset.
traffic from a public road. The Rural road video scene We rank these frameworks based on the results we get
concerns a public road on a sunny day. We have filmed from a wide variety of metrics. First one is the Deep SORT
approximately 15 min of public road traffic using the DJI Tracks Initiated metric. The closer this metric is to the
Phantom 3 drone at 1080p and 25fps. We cut down the ground truth the better the performance of the tracking is.
video to approximately 2 min, by taking out the parts A number greatly higher than ground truth usually shows
where there was no traffic. The video incorporates a bal- that the detector straggles to keep track of a certain object
ance between cars, large vans and pickup trucks. The across the scene. Every time the detector fails, there is a
‘‘Rural road dusk’’ video concerns the same public road in chance that a new initiated track is created if the object gets
dusk under different lighting conditions. Specifically, we detected on future frames. The second one is the modified
recorded the traffic of the same road one hour before dusk. Deep SORT Count metric, which is the amount of the
The camera was facing the sun, and the cars were gener- confirmed tracks for the scene based on the modification
ating shadows on the road. So, the Rural road dusk dataset performed to take into consideration a set of previous
frame detection. Moreover, we provide recall, precision, F1

123
Neural Computing and Applications

score and a confusion matrix for the TP, TN, FN metrics to of false negatives. The closer this metric is to the ground
evaluate detector performance along with tracking perfor- truth, the better the performance of the tracking is. A
mance. Lastly, we also provide MOTA and MOTP scores number greatly higher than ground truth usually shows that
when available. The MOTP metric is the total position the detector straggles to keep tracking of a certain object
error for matched object hypothesis pairs over all frames, across the scenes. Having said that, the Deep SORT
averaged by the total number of matches made. It shows algorithm did a decent job at mitigating this issue as seen
the ability of the tracker to estimate precise object posi- from the Deep SORT count metric.
tions, independent of its skill at recognizing object con- It is worth noting that trying to track on a 1080p source
figurations, keeping consistent trajectories and more. only gets us 15fps using this setup. This may not be viable
MOTA accounts for all object configuration errors made by for real-time tracking. Looking at the results for the
the tracker, false positives, misses, mismatches, over all Straight road at 480p, the average frame rate is 33.1 FPS,
frames. It gives a very intuitive measure of the tracker’s and for Racetrack at 480p, it is 35.2 FPS. This indicates
performance at keeping accurate trajectories, independent that the framework is quite suitable for real-time tracking.
of its precision in estimating object positions. To evaluate In the Crossroad video, massive amounts of identity
detector performance, we use a slightly modified version switches were experienced due to the extremely low frame
(to include all the metrics we wanted) of the tool used and rate of the source which, in turn, caused big gaps from
created by [36], an open-source evaluator. frame to frame for the bounding boxes. This makes the
trajectory estimation algorithm often to fail.
4.3.1 Results with optimized detection models trained Lastly, in the ‘‘Rural road’’ and ‘‘Rural road dusk’’
on MS-COCO videos the results indicate a poor tracking performance due
to the poor detection exhibited by YOLOv3-Tiny, as shown
Here, we present the results of our framework when the by the very poor recall and F1 score. There are many
YOLO detectors are trained on MS-COCO. Initially, we detection failures as pointed out by the significantly higher
present the performance when YOLO3-Tiny is used as initiated tracks metric compared to the Deep SORT count
detector and after that the performance when YOLO3 and for every scene except for the ‘‘Straight road’’ 1080p and
YOLO4 are used. 480p. On the Racetrack dataset, where the ground truth was
23 cars, we measured 31 on both resolutions and the tracks
4.3.1.1 YOLOv3-Tiny In Table 2, the results of the Deep initiated for both tests were close to 86 for the 480p video
SORT and the modified Deep SORT using YOLOv3-Tiny and to 85 when tested at 1080p, which indicates a large
trained on MS-COCO as detector are presented. A first amount of detection failures. The tracking of the Modified
point concerns the tracking performance in the ‘‘Race- Deep SORT in the ‘‘Straight road’’ is lower than the ground
track’’ video, which is also poor due to the subpar detection truth (6 vs 9). This is because the cars in the back are not
performance of YOLOv3-Tiny. Looking at the initiated detected by this YOLO model in time. The tracking for the
tracks metric of the Deep SORT (85), we can tell that this cars that were detected is excellent.
detection model consistently failed to hold track of the In Fig. 10, we provide the precision–recall curve for all
objects it detected, which is also shown by the high number scenes tested and in Figs. 11, 12, 13 and 14, example

Table 2 MOT results on the YOLOv3-Tiny-enabled framework


YOLOv3-Tiny Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

Crossroad- 15.6 263 104 92 0.944 0.375 0.537 4211 246 6990
Init = 4
Crossroad- 15.6 269 80 92 0.944 0.375 0.537 4211 246 6990
Init = 7
Straight road 15.1 10 6 9 1 0.436 0.608 187 0 241
1080p
Straight road 33.1 12 6 9 1 0.483 0.651 207 0 221
480p
Racetrack 15 85 31 23 0.997 0.801 0.889 3486 8 862
Racetrack 480p 35.2 86 31 23 0.996 0.828 0.905 3604 12 744
Rural road 15.1 220 56 44 0.99 0.447 0.616 1321 13 1632
Rural road dusk 16 102 18 24 0.965 0.4 0.565 595 21 892

123
Neural Computing and Applications

Fig. 10 Precision–recall curve


for YOLOv3-Tiny

Fig. 11 YOLOv3-Tiny-enabled
framework on the crossroad
video

Fig. 12 YOLOv3-Tiny-enabled
framework on the Racetrack
video

detection frames from the YOLOv3-Tiny framework are caused by the red car in the front which YOLOv3-Tiny has
illustrated. As seen in Fig. 11, YOLOv3-Tiny has trouble trouble detecting consistently. This causes issues to the
detecting the cars in the distance, something that results in Deep SORT to mark it as a confirmed track. In Figs. 13 and
numbering errors by a considerable amount. In Fig. 12, 14, the same problem is illustrated. The cars in the distance
even with the modified Deep SORT algorithm now prop- at the back cannot be properly detected and the car num-
erly displaying the tracks, we still see skipped Ids, which is bered ‘‘2’’ has been eliminated, because of the excessive

123
Neural Computing and Applications

Fig. 13 YOLOv3-Tiny-enabled
framework on the Straight road
1080p video

Fig. 14 YOLOv3-Tiny-enabled
framework on the Straight road
480p video

Table 3 MOT results on the YOLOv3-416-enabled framework


YOLOv3-416 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

Crossroad- 10.3 411 150 92 0.892 0.823 0.856 9226 1110 1975
Init = 4
Crossroad- 10.4 451 103 92 0.892 0.823 0.856 9226 1110 1975
Init = 7
Straight road 10.3 10 8 9 0.876 0.759 0.813 325 46 103
1080p
Straight road 16.6 10 7 9 0.829 0.626 0.713 268 55 160
480p
Racetrack 10.5 35 24 23 0.985 0.919 0.951 3999 58 349
Racetrack 480p 16.9 46 24 23 0.988 0.893 0.938 3887 47 461
Rural road 10.6 83 45 44 0.877 0.894 0.885 2640 368 313
Rural road dusk 10.9 58 23 24 0.937 0.802 0.865 1194 79 293

identity switches. We can also notice that the lower video identity switches, as seen from the tracks initiated by Deep
input resolution did not affect the detection process. SORT (411 and 451, respectively). The reason for this is
the fact that YOLOv3-416 is better at detecting hard-to-see
4.3.1.2 YOLOv3-416 In Table 3, the results of the Deep cars compared to YOLOv3-Tiny. The results also show
SORT and the modified Deep SORT using YOLOv3-416 that the modified Deep SORT algorithm performed quite
as detector are presented. The results show that the well and made a quite good tracking, counting 150 (vs 411)
detection performance on the ‘‘Straight road’’ video and on and 103 (vs 451) cars, respectively. On the Racetrack
the more complex ‘‘Racetrack’’ video is significantly better scene, we noticed significantly less initiated tracks, because
than YOLOv3-Tiny. The increased detection performance this scene has a clear view of the cars. This allowed the
as seen by the recall and F1 score metrics, allows the Deep much-improved YOLOv3-416 to keep track of the initiated
SORT framework to perform even better. The performance objects. We now also notice near perfect performance in
in the ‘‘Crossroad’’ video is low, mainly because of the the Rural road videos, which is attributed to the much
low-resolution and frame rate video captured by the CCTV. better detection performance of YOLOv3-416 over
The results also point out that now we experience more YOLOv3-Tiny. The Deep SORT count is off by 1

123
Neural Computing and Applications

compared to ground truth and the Initiated tracks are sig- affect the detection process, since we only saw a tiny
nificantly lower compared to YOLOv3-Tiny. increase in detection performance for the cars that were
In Fig. 15, we provide the precision–recall curve for all furthest away. Finally, in Fig. 19, we can see proper
scenes tested and in Figs. 16, 17, 18 and 19, example detection and tracking performance for this part of the test.
detection frames from the YOLOv3-416 framework are The increased accuracy of YOLOv3-416 over YOLO-Tiny
illustrated. In Fig. 16, we can see that YOLOv3-416 can is noticeable and provided better tracking performance as
now detect cars that are far at the back distance. In Figs. 17 illustrated above.
and 18, we see the same; the cars at the back distance are
now detected and tracked properly. However, we still 4.3.1.3 YOLOv4 In Table 4, the results of the Deep SORT
experience an identity switch even with this improved and the modified Deep SORT using YOLOv4-608 detector
detection performance on both video inputs. We can also are presented. We again see that the detection performance
notice that the lower video input resolution did not greatly on the ‘‘Straight road’’ video is good and the performance

Fig. 15 Precision–recall curve


for YOLOv3-416

Fig. 16 YOLOv3-416-enabled
framework on the crossroad
video

Fig. 17 YOLOv3-416-enabled
framework on the Straight road
1080p video

123
Neural Computing and Applications

Fig. 18 YOLOv3-416-enabled
framework on the Straight road
480p video

Fig. 19 YOLOv3-416-enabled
framework on the Racetrack
video

Table 4 MOT results on the YOLOv4-608-enabled framework


YOLOv4-608 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

Crossroad- 7.8 396 91 92 0.939 0.8 0.864 8966 578 2235


Init = 4
Crossroad- 7.8 434 66 92 0.939 0.8 0.864 8966 578 2235
Init = 7
Straight road 7.7 20 10 9 1 0.899 0.947 385 0 43
1080p
Straight road 10.9 12 7 9 0.996 0.719 0.835 308 1 120
480p
Racetrack 7.9 44 24 23 0.997 0.973 0.985 4234 10 114
Racetrack 480p 11.3 37 25 23 0.994 0.965 0.98 4200 22 148
Rural road 7.8 90 48 44 0.959 0.92 0.939 2718 115 235
Rural road dusk 8.1 59 24 24 0.952 0.841 0.893 1252 63 235

on the more complex ‘‘Racetrack’’ video is significantly more see a lot more initiated tracks than expected on the
better than using YOLOv3-Tiny and roughly equal to crossroad video, which is caused mainly by the poor video
YOLOv3-416. The performance in the ‘‘Crossroad’’ video quality. A small problem we noticed with YOLOv4 is that
is good enough, when we use an n_init value of 4, some- it sometimes detected cars at places where there were none.
thing that is necessary because of the low-resolution and That happened for only one frame, so the Deep SORT
the frame rate of the video captured by the CCTV. It’s the algorithm was able to exclude that result without trouble. It
only way to track most of the cars, since a lot of them are is worth noting that now we have a tensor input resolution
only visible for less than 7 frames. The increased perfor- of (608, 608) so, drops in the detection accuracy are more
mance of YOLOv4-608 is now visible in this instance. noticeable on the 480p videos. The results on the ‘‘Straight
Lastly, it is worth noting that YOLOv4-608 is noticeable road 480p’’ and ‘‘Racetrack 480p’’ indicate a drop in the
slower than YOLOv3-416, but not by a large amount. The detection performance in both cases. Performance in the
increased detection performance is good and worth the cost Rural road videos remains good and significantly better
of a few FPS, as we witness an uplift in all detection than YOLOv3-Tiny.
performance metrics compared to YOLOv3-416. Looking In Fig. 20, we provide the precision–recall curve for all
at the tracks initiated by the Deep SORT, we can once scenes tested and in Figs. 21, 22, 23 and 24, we present

123
Neural Computing and Applications

Fig. 20 Precision–recall curve


for YOLOv4-608

Fig. 21 YOLOv4-608-enabled
framework on the crossroad
video

Fig. 22 YOLOv4-608-enabled
framework on the Straight road
1080p video

Fig. 23 YOLOv4-608-enabled
framework on the Straight road
480p video

123
Neural Computing and Applications

Fig. 24 YOLOv4-608-enabled
framework on the Racetrack
video

Table 5 MOT comparison on the YOLOv4-608-enabled framework using the ‘‘Racetrack 480p’’ video
YOLOv4 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

MS-COCO 11.3 37 25 23 0.994 0.965 0.98 4200 22 148


UA- 12.4 31 23 23 0.999 0.953 0.975 4144 3 204
DETRAC

example cases from the detection frames of YOLOv4-608. with our YOLOv4 detector trained on the UA-DETRAC
As seen in Fig. 21, YOLOv4-608 can now detect cars that dataset. Also, we notice an increase in the execution per-
are far away and we can keep tracking them without having formance which is approximately 10% as seen in Table 5.
to be close to the camera. In Figs. 22 and 23, we do see This is due to the simplification of the YOLOv4 network,
good tracking. However, at that point of the video, the cars since we train it on just one class.
further away cannot be properly identified. Once the cars We now pay attention to the ‘‘Tracks Initiated’’ metric
get closer to the camera, the detection and the tracking are and now see that YOLOv4-608 works best on this test
excellent. We can also note that this time, the lower video scene when trained on the UA-DETRAC dataset with a
input resolution affected the detection process a bit more perfect Deep SORT count and much lower initiated tracks
than previously and this is mainly caused by the fact that compared to the MS-COCO one, which failed to track well
we have a tensor input of (616, 616) and the vertical res- during a mild occlusion phase and had trouble detecting
olution of the video was 480 pixels. Small cars were hard some of the vehicles.
to be properly detected from the reduced vector informa- In Fig. 25, we provide the precision–recall curve and in
tion. Finally, in Fig. 24, we once more see good tracking Fig. 26 we now see perfect numbering and tracking of all
and detection performance by YOLOv4-608. cars in that particular frame. All other detectors we tested
failed to achieve this performance.
4.3.2 Results with optimized detection models trained
on UA-DETRAC

In this part of the experimental study, we examine the


performance of the modified Deep SORT when integrated
with YOLO detectors that are trained on UA-DETRAC
dataset. In the Racetrack and Rural road videos we use an
n_init = 7 and on the crossroads video n_init = 4 due to the
lower frame rates of the video.

4.3.2.1 Results on YOLOv4 In Table 5, the results of the


Deep SORT and the modified Deep SORT using YOLOv4
trained on UA-DETRAC as detector are presented. For the
training procedure, we measured an average loss of 1.583
and a mAP of 98.68%. The results are quite good and can
facilitate the good performance of the Deep SORT
framework.
Starting with the ‘‘Racetrack 480p’’ scene, we achieve
Fig. 25 Precision–recall curve for YOLOv4-608 comparison on the
perfect numbering and tracking across the whole test scene Racetrack 480p video

123
Neural Computing and Applications

Fig. 26 YOLOv4-608 UA-


DETRAC-enabled framework
on the Racetrack 480p video

Table 6 MOT comparison on the YOLOv4-608-enabled framework using the ‘‘Crossroad’’ video
YOLOv4 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

MS-COCO 7.8 396 91 92 0.939 0.8 0.864 8966 578 2235


UA- 8.5 291 87 32 0.997 0.694 0.818 7779 23 3422
DETRAC

Fig. 27 Precision–recall curve for YOLOv4-608 comparison on the Fig. 28 Precision–recall curve for YOLOv4-608 comparison on the
crossroad video Rural road video

In Table 6, we see that in the ‘‘crossroad’’ video we have count for the UA-DETRAC trained YOLOv4. We notice a
an increase in the performance of approximately 10%. The much better tracking process and much higher detection
results show that the UA-DETRAC trained YOLOv4 accuracy of big trucks and vans compared to the MS-
detector reports better performance compared to the MS- COCO trained YOLOv4. A major benefit of the UA-
COCO one. While the initiated tracks are significantly DETRAC dataset is the wide variety of vehicles that is
closer to the real counts, we do see a worse Deep SORT included. The MS-COCO dataset is lacking in that

Table 7 MOT comparison on the YOLOv4-608-enabled framework using the ‘‘Rural road’’ video
YOLOv4 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

MS-COCO 7.8 90 48 44 0.959 0.92 0.939 2718 115 235


UA- 8.4 82 46 44 0.994 0.935 0.963 2762 16 191
DETRAC

123
Neural Computing and Applications

Table 8 MOT comparison on the YOLOv4-608-enabled framework using the ‘‘Rural road dusk’’ video
YOLOv4 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

MS-COCO 8.1 59 24 24 0.952 0.841 0.893 1252 63 235


UA- 8.7 48 25 24 0.962 0.944 0.953 1405 55 82
DETRAC

Fig. 31 MS-COCO on the Rural road dusk video

Table 9 MOT metrics on the YOLOv4-608-enabled framework using


the ‘‘Rural road dusk’’ video
Fig. 29 Precision–recall curve for YOLOv4-608 comparison on the YOLOv4 MOTA MOTP
Rural road dusk video
MS-COCO 58.4 81.1
department and its inability to properly train our models in UA-DETRAC 71.1 85.7
detecting big vehicles became apparent in this test scene.
Having said that, we did notice that the UA-DETRAC
trained YOLOv4 performed worse at detecting cars that are While the initiated tracks are significantly closer to the real
further away, and this is also shown in the false negative numbers, we also see better results in the modified Deep
metric. Lastly, in Fig. 27 the precision–recall curve is SORT count metric. This test scene is considered of
provided. medium difficulty. We did notice much better tracking and
In Table 7, we notice an increase in execution perfor- much higher detection accuracy of big trucks and vans
mance of approximately 10% on the ‘‘Rural road’’ video. compared to the MS-COCO trained YOLOv4. Also, this
The UA-DETRAC-trained YOLOv4 once more showed test has few frames where heavy occlusions take place. The
better performance compared to the MS-COCO one, as MS-COCO model exhibited a bit worse detection perfor-
seen by all performance metrics we offer including Fig. 28. mance, which caused a few more identity switches during
easy parts of the scene.
In Table 8, the results show an increase in execution
performance of 7%. The UA-DETRAC-trained YOLOv4
showed once more better performance compared to the
MS-COCO one. While the initiated tracks are significantly
closer to the real numbers, we see that the Deep SORT
count is off by just one for our custom framework, while
the MS-COCO one is excellent. The results show better
tracking and much higher detection accuracy compared to
the MS-COCO-trained YOLOv4. The MS-COCO model
exhibited a bit worse detection performance, which caused
a few more identity switches during easy parts of the scene
(Fig. 29).
Fig. 30 UA-DETRAC on the Rural road dusk video

123
Neural Computing and Applications

Table 10 MOT comparison on the YOLOv3-608-enabled framework using the ‘‘Racetrack 480p’’ video
YOLOv3 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

MS-COCO 12.4 55 24 23 0.996 0.921 0.957 4005 14 343


UA- 14.1 58 23 23 0.999 0.77 0.869 3348 1 1000
DETRAC

Fig. 32 Precision–recall curve for YOLOv3-608 comparison on the Fig. 33 Precision–recall curve for YOLOv3-608 comparison on the
Racetrack 480p video crossroad video

The UA-DETRAC framework measured one above approximately 15% on the ‘‘Racetrack 480p’’ video. This is
ground truth, because once more there was a car carrying a due to the same reasons we described in YOLOv4. We
trailer in the video, which both models detected as a again notice that the UA-DETRAC-enabled YOLO is
vehicle. The reason why MS-COCO managed to match better in this test scene compared to the MS-COCO one,
ground truth is shown in Figs. 30 and 31. The MS-COCO which had a small mishap. Deep SORT count is now
framework completely failed to detect the truck shown in perfect, since we once more achieved perfect tracking.
the pictures. As seen in Table 9, the UA-DETRAC-enabled However, YOLOv3, due to its worse overall mAP perfor-
YOLOv4 scored significantly better in the MOTA and mance, exhibits more initiated tracks compared to
MOTP metrics. YOLOv4. Looking at the detection metrics, we can tell that
the UA-DETRAC trained YOLO had trouble in detecting
4.3.2.2 Results on YOLOv3 In Table 10, the results of the the cars at times, but what made it score better in tracking
Deep SORT and the modified Deep SORT using YOLOv3 was the detection consistency, once detection occurred. We
trained on UA-DETRAC as detector are presented. For the also provide the precision–recall curve in Fig. 32.
training procedure on the UA-DETRAC dataset, the tensor In Table 11, we notice an increase in the execution
input was set at 416 for 8000 batches and we measured an performance, which is approximately 15% on the ‘‘Cross-
average loss of 0.823 and a mAP of 96.31%. The results road’’ video. The UA-DETRAC-trained YOLOv3, while
show an increase in the execution performance, which is providing a Deep SORT count closer to ground truth

Table 11 MOT comparison on the YOLOv3-608-enabled framework using the ‘‘Crossroad’’ video
YOLOv3 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

MS-COCO 8.2 312 108 92 0.929 0.868 0.897 9729 740 1472
UA- 9.5 217 89 92 0.995 0.429 0.599 4807 24 6394
DETRAC

123
Neural Computing and Applications

Table 12 MOT comparison on the YOLOv3-608-enabled framework using the ‘‘Rural road’’ video
YOLOv3 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

MS-COCO 8.4 93 48 44 0.952 0.911 0.931 2692 133 261


UA- 9.3 120 46 44 0.991 0.527 0.688 1557 14 1396
DETRAC

compared to the MS-COCO one, does have significantly


worse detection performance during this test, as seen by the
precision, recall and F1 score metrics (Fig. 33). This test
scene is considered of high difficulty due to the extremely
low frame rate and video noise.
Moving on to the ‘‘Rural road’’ video which is now
tested using YOLOv3, in Table 12, we now notice an
increase in the execution performance of roughly 10%.
Also, we notice once again that the UA-DETRAC-enabled
YOLO detector has a worse performance in this test scene
compared to the MS-COCO trained one, but was able
consecutively detect vehicles better, once detection started
to occur. Deep SORT tracking counts are closer to ground
truth, but we again see more initiated tracks compared to
YOLOv4, which in this case confirms that misses a lot of
detections, which is also shown in Fig. 34. Lastly, we
Fig. 34 Precision–recall curve for YOLOv3-608 comparison on the
Rural road video notice slightly higher frame rates compared to the
YOLOv4 detectors.
In both UA-DETRAC-trained models, we noticed much
more consistent detections of trucks, buses and vans. To
prove our point, we attach two screenshots from the
detection output. In Fig. 35, the MS-COCO trained model
fails to detect the vans, while in Fig. 36, the UA-DETRAC
model provides consistent tracking of them.
Lastly, in Table 13, the results in the Rural road dusk
video are presented. We notice an increase in execution
performance of roughly 15%. The UA-DETRAC-trained
YOLOv3 detector showed worse performance compared to
Fig. 35 MS-COCO framework on the Rural road video the MS-COCO one. YOLOv3 exhibited more initiated
tracks and counted significantly less cars compared to
YOLOv4. This time we notice a better overall tracking
when using the MS-COCO trained YOLO even though the
UA-DETRAC YOLO could detect trucks better and all
metrics show this. The UA-DETRAC model had worse
detection performance which caused more identity
switches even during easy parts of the scene. The lightning
conditions of this scene proved troublesome for the UA-
DETRAC dataset as also seen by the precision-recall curve
(Fig. 37). In Table 14 we can also see the significantly
better MOTA and MOTP scores the MS-COCO-trained
Fig. 36 UA-DETRAC framework on the Rural road video YOLOv3 had compared to the UA-DETRAC one.

123
Neural Computing and Applications

Table 13 MOT comparison on the YOLOv3-608-enabled framework using the ‘‘Rural road dusk’’ video
YOLOv3 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

MS-COCO 8.7 57 23 24 0.96 0.809 0.878 1204 50 283


UA- 9.7 64 18 24 0.967 0.317 0.477 472 16 1015
DETRAC

Table 14 MOT metrics on the YOLOv3-608-enabled framework


using the ‘‘Rural road dusk’’ video
YOLOv3 MOTA MOTP

MS-COCO 59.1 71
UA-DETRAC 19.4 67.6

Fig. 38 Precision–recall curve for YOLOv3-Tiny comparison on the


Racetrack 480p video

The results show a great increase in the execution per-


formance. The UA-DETRAC-powered YOLOv3-Tiny is
approximately 50% faster when tested on the ‘‘Racetrack’’
video. We notice the UA-DETRAC-enabled YOLO is
better in this test scene compared to the MS-COCO one,
regarding the Track-initiated metric, but the modified Deep
Fig. 37 Precision–recall curve for YOLOv3-608 comparison on the
Rural road dusk video SORT count of the MS-COCO one is closer to the ground
truth. Overall, the detection performance of the UA-
4.3.2.3 Results on YOLOv3-Tiny In Table 15, the results of DETRAC was slightly better as shown by the recall, F1
the Deep SORT and the modified Deep SORT using score and Fig. 38.
YOLOv3-trained on UA-DETRAC as detector are pre- In Table 16, the results show that we have a great
sented. During the training, we measured an average loss of increase in the execution performance, which is approxi-
0.762 and a mAP of 96.32%. This time during testing we mately 15% on the ‘‘Crossroad’’ video. The UA-DETRAC-
keep the model size at (416, 416) instead of (608, 608) to trained YOLOv3-Tiny is significantly better compared to
further increase throughput. the MS-COCO one and all metrics show this. The
improved tracking performance is attributed to the

Table 15 MOT comparison on the YOLOv3-Tiny-enabled framework using the ‘‘Racetrack 480p’’ video
YOLOv3- Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
Tiny second Initiated SORT truth score

MS-COCO 35.7 136 29 23 0.999 0.585 0.738 2544 2 1804


UA- 54.3 99 31 23 0.999 0.591 0.743 2572 2 1776
DETRAC

123
Neural Computing and Applications

Table 16 MOT comparison on the YOLOv3-Tiny-enabled framework using the ‘‘Crossroad’’ video
YOLOv3- Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
Tiny second Initiated SORT truth score

MS-COCO 15.7 257 97 92 0.999 0.258 0.41 2895 1 8306


UA- 18 225 90 92 0.982 0.374 0.542 4193 73 7008
DETRAC

Fig. 39 Precision–recall curve for YOLOv3-Tiny comparison on the Fig. 40 Precision–recall curve for YOLOv3-Tiny comparison on the
crossroad video Rural road video

improved detection performance as seen by the signifi- the MS-COCO. The MS-COCO-trained YOLOv3-Tiny
cantly better F1 score and precision-recall curve (Fig. 39). failed consistently to keep track of cars, and this is shown
Moving on to the ‘‘Rural road’’ video, we again see an by the modified Deep SORT count metric. The UA-
increase in execution performance of roughly 15%. We DETRAC model had much better detection performance,
again notice that the UA-DETRAC-enabled YOLO is although still poor, but that helped the Deep SORT count
better in this test scene compared to the MS-COCO one, by metric to be closer to ground truth. Tracking performance
a large margin, as indicated by the detection performance remains significantly worse, when compared to YOLOv4,
metrics. In Table 17, the results show that both models had as seen from the high tracks-initiated metric at 73 and it is
a poor performance in this test, given how far they are from clear that YOLOv3-Tiny is not suitable for accurate
the ground truth. YOLOv3-Tiny failed numerous times and tracking and numbering of cars. Here, we can also see the
even its execution performance is nearly doubled, the loss improvement in tracking performance through the MOTA
in accuracy in this test is substantial (Fig. 40). and MOTP metrics in Table 19. While there is an
Lastly, the results in the Rural road dusk video are improvement, it is once more obvious that tracking was
presented in Table 18. We notice an increase in execution poor as is also seen in Fig. 41.
performance of roughly 15%. This time the UA-DETRAC- In conclusion, the use of the UA-DETRAC dataset
trained YOLOv3-Tiny showed better results compared to assisted in creating an overall better framework for car

Table 17 MOT comparison on the YOLOv3-Tiny-enabled framework using the ‘‘Rural road’’ video
YOLOv3- Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
Tiny second Initiated SORT truth score

MS-COCO 15.2 211 24 44 0.994 0.228 0.371 675 4 2278


UA- 17.4 172 64 44 0.997 0.594 0.745 1757 5 1196
DETRAC

123
Neural Computing and Applications

Table 18 MOT comparison on the YOLOv3-Tiny-enabled framework using the ‘‘Rural road dusk’’ video
YOLOv3- Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
Tiny second Initiated SORT truth score

MS-COCO 16.1 114 6 24 0.983 0.163 0.28 243 4 1244


UA- 18.8 73 20 24 0.988 0.241 0.388 359 4 1128
DETRAC

Table 19 MOT metrics on the YOLOv3-Tiny-enabled framework Table 21 MOT metrics on the MOT16 scene
using the ‘‘Rural road dusk’’ video
MOT16-09 MOTA MOTP
YOLOv3-Tiny MOTA MOTP
YOLOv3-Tiny 37.2 76.1
MS-COCO 3.7 67.6 YOLOv3-608 57.1 78.8
UA-DETRAC 10.8 49.7 YOLOv4-608 42.3 61.9

Fig. 41 Precision–recall curve for YOLOv3-Tiny comparison on the Fig. 42 Precision–recall curve comparison on the MOT16-09 video
Rural road dusk video
4.3.3 Exploring the modified deep SORT on pedestrian
traffic, especially when YOLOv4 is used as detector. The videos
framework reports the best tracking performance, while
also being slightly faster compared to its MS-COCO The second part of the experimental study concerned the
counterpart. evaluation of our framework and the modified Deep SORT
in pedestrian videos and scenarios. In the context of this

Table 20 MOT results on the MOT16 scene


MOT16-09 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

YOLOv3- 15 52 14 25 0.952 0.234 0.375 2025 101 6621


Tiny
YOLOv3- 8 81 22 25 0.948 0.476 0.634 4122 225 4524
608
YOLOv4- 7.6 72 30 25 0.943 0.452 0.611 3909 236 4737
608

123
Neural Computing and Applications

Fig. 43 YOLOv3-Tiny-enabled
framework on the MOT16-09
scene

Fig. 44 YOLOv3-608-enabled
framework on the MOT16-09
scene

experiment, two scenes from the MOT benchmark were pedestrians that were not close to the camera could not get
used, the MOT16 benchmark and the MOT20, respec- tracked, as seen in Fig. 43. The YOLOv3-Tiny-powered
tively. We also provide the MOTA and MOTP metrics as Deep SORT could not perform well enough having a
seen on the MOT benchmark. modified Deep SORT count of 14 and 52 initiated tracks.
YOLOv3-608 did much better compared to YOLOv3-
4.3.3.1 Results on the MOT16 scene In Tables 20 and 21, Tiny, having a modified Deep SORT count of 22 and 81
the results of the MOT16 are illustrated. The results show initiated tracks with a MOTA score of 57.1 and MOTP
that YOLOv3-Tiny achieves the best execution throughput score of 78.8. The high number of initiated tracks does
with 15 frames per second using 1080p video as input. show that tracking was still relatively poor, given that we
YOLOv3 and YOLOv4-608 perform at nearly half the only need to track 25 pedestrians. As you can see in
performance with 8 and 7.6 FPS, respectively. In Fig. 42, Fig. 44, this detector did a much better job at detecting
we also provide the precision and recall curve. pedestrians that were far away from the camera. It did face
However, YOLOv3-Tiny failed to keep track of most a problem that we will describe right below during our
objects due to its poor detection rate, while it also had a YOLOv4-608 notes.
MOTA score of 37.2 and MOTP score of 76.1. Many

Fig. 45 YOLOv4-608-enabled
framework on the MOT16-09
scene

123
Neural Computing and Applications

Table 22 MOT results on the MOT20 scene


MOT20-01 Frames per Deep SORT Tracks Modified Deep Ground Precision Recall F1 TP FP FN
second Initiated SORT truth score

YOLOv3- 15.5 114 20 90 0.989 0.051 0.097 1360 14 25,287


Tiny
YOLOv3- 6.9 188 48 90 0.995 0.309 0.472 8245 40 18,402
608
YOLOv4- 7.3 135 41 90 0.998 0.191 0.320 5092 9 21,555
608

Table 23 MOT metrics on the MOT20 scene YOLOv4-608 achieved the best tracking performance,
MOT20-01 MOTA MOTP
and our framework has a MOTA score of 42.3 and a MOTP
score of 61.9. It was able to detect the people behind the
YOLOv3-Tiny 4.9 67.1 glass entrance of the shop and it was able to keep track of
YOLOv3-608 31.8 73.3 people better than YOLOv3. It initiated less tracks when
YOLOv4-608 18.9 59.7 compared to YOLOv3-608. One major advantage of
YOLOv4 was that it was much better at proper distinction
and feature capture of the tracks. As an example, notice in
Fig. 45 how the old man in the background is now properly
labeled at track 25 and not at track 5, in contrast to Fig. 44,
where YOLOv3 thought it was the same pedestrian that
passed at the beginning of the video.

4.3.3.2 Results on the MOT20 scene In Tables 22 and 23,


we see the results for MOT20 scene. The results show that
YOLOv3-Tiny achieves the best execution throughput with
15.5 frames per second. YOLO3 and YOLO4-608 perform
at nearly half the performance with 6.9 and 7.3 FPS,
respectively. In Fig. 46, we also provide the precision and
recall curve.
In this scene, the modified Deep SORT coupled with
YOLOv4 was faster than v3 and this is attributed to the
overall less detected tracks per frame that v4 had as also
confirmed by the recall and F1 score. This reduced the
Fig. 46 Precision–recall curve comparison on the MOT20-01 video number of CPU cycles needed to calculate trajectories and

Fig. 47 YOLOv3-Tiny-enabled
framework on the MOT20-01
scene

123
Neural Computing and Applications

Fig. 48 YOLOv3-608-enabled
framework on the MOT20-01
scene

Fig. 49 YOLOv4-608-enabled
framework on the MOT20-01
scene

Fig. 50 Ground truth of the


MOT20-01 scene for the same
frame tested

confirm tracks. The worse detector was again YOLOv3- to track 90 pedestrians. As shown in Fig. 48, this detector
Tiny, since it failed to keep track of most objects due to its did a much better job at detecting pedestrians that were far
poor detection rate, having a MOTA score of 4.9 and a away from the camera. Having said that, the people that
MOTP score of 67.1. The framework using YOLO-Tiny were very far away could not be detected by any of our
detector had a count of 20 and 114 initiated tracks, which is models. At that part of the scene, heavy occlusion occurs
far from the ground truth of 90 tracks. We provide a and there is a hefty amount of video noise and blur from
screenshot in Fig. 47 that shows its failure to detect many the poor lighting conditions.
of the pedestrians. Finally, this time YOLOv4-608 achieved worse tracking
YOLOv3-608 did much better compared to YOLOv3- performance, while also having a worse Deep SORT count
Tiny, having a Deep SORT count of 48 and 188 initiated compared to YOLOv3-608 having a MOTA score of 18.9
tracks with a MOTA score of 31.8 and a MOTP score of and a MOTP score of 59.7. It initiated 135 tracks, which is
73.3. The high number of initiated tracks does show that significantly lower when compared to YOLOv3-608, but
tracking was still relatively poor, given that we only need that is just because it failed to track many of the pedestrians

123
Neural Computing and Applications

as seen in Fig. 49. The population density of this scene plan to compare and explore tracking performance using
paired with the increased video noise and the reflection our own custom, fine-tuned and purpose built detectors.
from the sun at the back make this scene incredibly difficult
to complete. We show the ground truth bounding boxes for
this scene at the same frame in Fig. 50. 6 Supplementary materials

All necessary materials, code and datasets can be found at


5 Conclusions https://github.com/Jimmeimetis/Deepsort-Yolo-
implementations.
In this paper, first we explore the performance of various
deep learning methods on the task of multiple-object
tracking. We examine how widespread deep learning
Declarations
architectures are performing under various contexts in a
wide range of scene scenarios. We introduced a modifi- Conflict of interest The authors declare that they have no conflict of
cation of the Deep SORT algorithm, which aids at properly interest.
displaying the track IDs, a crucial aspect of real time object
tracking. Our modification on the Deep SORT is based on
the process of the initialization of the object IDs, and its References
rationale is to consider an object as ‘‘tracked’’ if it is
detected in a set of previous frames, while properly passing 1. Luo W, Zhao X, Kim T-K (2014) Multiple object tracking: a
review. CoRR. http://arxiv.org/abs/1409.7618
the information to the framework, a problem that occurred 2. Ciaparrone G, Sánchez FL, Tabik S, Troiano L, Tagliaferri R,
in all Deep SORT and YOLO implementations we found. Herrera F (2020) Deep learning in video multi-object tracking: a
The results indicate that our Deep SORT modification is survey. Neurocomputing 381:61–88
functional across all tests. 3. Sanchez-Matilla R, Poiesi F, Cavallaro A (2016) Online multi-
target tracking with strong and weak detections. vol 9914,
In addition, we present a way to improve the real-time pp 84–99. https://doi.org/10.1007/978-3-319-48881-3_7
operation of the deep learning methods by identifying and 4. Sadeghian A, Alahi A, Savarese S (2017) Tracking the untrack-
facing bottlenecks in the MOT framework. We tested and able: learning to track multiple cues with long-term dependen-
provide a way that can greatly improve the execution time cies. CoRR. http://arxiv.org/abs/1701.01909
5. Wojke N, Bewley A, Paulus D (2017) Simple online and realtime
of the tracking process. The results show that we have an tracking with a deep association metric. CoRR. http://arxiv.org/
increase of frames per second (FPS) in all examined deep abs/1703.07402
learning networks, which is up to 22%. Through our 6. Redmon J, Farhadi A (2018) YOLOv3: an incremental
experimental process and our results, we found out that improvement. CoRR. http://arxiv.org/abs/1804.02767
7. Bochkovskiy A, Wang C-Y, Liao H-YM (2020) YOLOv4:
through the use of a dataset specialized in car traffic, we optimal speed and accuracy of object detection
can achieve better performance than using the models 8. Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016) Simple
trained on the MS COCO dataset. As we saw, during online and realtime tracking. CoRR. http://arxiv.org/abs/1602.
testing, the YOLOv3-Tiny-enabled framework was only 00763
9. Bernardin K, Stiefelhagen R (2008) Evaluating multiple object
suitable for simple scenes, where small occlusions occur, tracking performance: the CLEAR MOT metrics. EURASIP J
and the field of view remains constrained. YOLOv4 offered Image Video Process 2008(1):246309. https://doi.org/10.1155/
the best performance, which was later enhanced by using 2008/246309
the UA-DETRAC dataset during training and we also 10. Leal-Taixé L, Milan A, Reid ID, Roth S, Schindler K (2015)
MOTChallenge 2015: towards a benchmark for multi-target
provide what we consider the optimal parameters for each tracking. CoRR. http://arxiv.org/abs/1504.01942
framework tested. Finally, we have created and introduced 11. Voigtlaender P et al (2019) MOTS: multi-object tracking and
a new vehicle dataset from the videos we captured con- segmentation. CoRR. http://arxiv.org/abs/1902.03604
sisting of 7 scenes, 11.025 frames and 25.193 bounding 12. Wen L et al (2015) DETRAC: a new benchmark and protocol for
multi-object tracking. CoRR. http://arxiv.org/abs/1511.04136
boxes. The dataset is suitable for testing and training 13. Fuchs F, Kosiorek AR, Sun L, Jones OP, Posner I (2019) End-to-
multiple-object detectors and includes a variety of scenes, end recurrent multi-object tracking and trajectory prediction with
capture devices and daytime changes in efforts to cover relational reasoning. http://arxiv.org/abs/1907.12887
weak points detectors may have. 14. Chu P, Ling H (2019) FAMNet: joint learning of feature, affinity
and multi-dimensional assignment for online multiple object
A main direction that future work could examine con- tracking. CoRR. http://arxiv.org/abs/1904.04989
cerns the addition of more features on the Deep SORT 15. Wang X, Cheng P, Liu X, Uzochukwu B (2018) Focal loss dense
algorithm such as being able to track and label more than detector for vehicle surveillance. http://arxiv.org/abs/1803.01114
one classes at a time and also adjust the tracking process to
take into account camera movement. Furthermore, we also

123
Neural Computing and Applications

16. Sun S, Akhtar N, Song X, Song H, Mian A, Shah M (2020) 30. Xu S, Savvaris A, He S, Shin HS, Tsourdos A (2018) Real-time
Simultaneous detection and tracking with motion modelling for implementation of YOLO? JPDA for small scale UAV multiple
multiple object tracking. http://arxiv.org/abs/2008.08826 object tracking. In: 2018 international conference on unmanned
17. Chu Q, Ouyang W, Li H, Wang X, Liu B, Yu N (2017) Online aircraft systems (ICUAS). IEEE, pp 1336–1341
multi-object tracking using CNN-based single object tracker with 31. Yoon YC, Boragule A, Song YM, Yoon K, Jeon M (2018) Online
spatial-temporal attention mechanism. http://arxiv.org/abs/1708. multi-object tracking with historical appearance matching and
02843 scene adaptive detection filtering. In: 2018 15th IEEE interna-
18. Ristani E, Solera F, Zou RS, Cucchiara R, Tomasi C (2016) tional conference on advanced video and signal based surveil-
Performance measures and a data set for multi-target, multi- lance (AVSS). IEEE, pp 1–6
camera tracking. CoRR. http://arxiv.org/abs/1609.01775 32. Darknet: open source neural networks in c. https://github.com/
19. Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only AlexeyAB/darknet. Accessed 25 Apr 2020
look once: unified, real-time object detection. CoRR. http://arxiv. 33. Hou X, Wang Y, Chau LP (2019) Vehicle tracking using deep
org/abs/1506.02640 SORT with low confidence track filtering. In: 2019 16th IEEE
20. Maiya SR (2020) abhyantrika/nanonets_object_tracking. GitHub. international conference on advanced video and signal based
https://github.com/abhyantrika/nanonets_object_tracking surveillance (AVSS). IEEE, pp 1–6
21. Milan A, Leal-Taixé L, Reid ID, Roth S, Schindler K (2016) 34. Nguyen HQ, Nguyen TB, Nguyen TA, Le TL, Vu TH, Noe A.
MOT16: a benchmark for multi-object tracking. CoRR. http:// Comparative evaluation of human detection and tracking
arxiv.org/abs/1603.00831 approaches for online tracking applications
22. Dendorfer P et al (2020) MOT20: a benchmark for multi object 35. Lin JP, Sun MT (2018) A YOLO-based traffic counting system.
tracking in crowded scenes. http://arxiv.org/abs/2003.09003 In: 2018 conference on technologies and applications of artificial
23. Emami P, Pardalos PM, Elefteriadou L, Ranka S (2020) Machine intelligence (TAAI). IEEE, pp 82–85
learning methods for data association in multi-object tracking. 36. Padilla R, Netto SL, da Silva EA (2020) A survey on performance
ACM Comput Surv (CSUR) 53(4):1–34 metrics for object-detection algorithms. In: 2020 international
24. Hou X, Wang Y, Chau LP (2019) Vehicle tracking using deep conference on systems, signals and image processing (IWSSIP)
SORT with low confidence track filtering. In: 2019 16th IEEE 37. Wang Z, Zheng L, Liu Y, Wang S (2019) Towards real-time
international conference on advanced video and signal based multi-object tracking. arXiv preprint http://arxiv.org/abs/1909.
surveillance (AVSS). IEEE, pp 1–6 12605
25. Wojke N, Bewley A (2018) Deep cosine metric learning for 38. Lu HC, Li PX, Wang D (2018) Visual object tracking: a survey.
person re-identification. In: 2018 IEEE winter conference on Pattern Recognit Artif Intell 31(1):61–76
applications of computer vision (WACV). IEEE, pp 748–756 39. Yao R, Lin G, Xia S, Zhao J, Zhou Y (2020) Video object seg-
26. Karunasekera H, Wang H, Zhang H (2019) Multiple object mentation and tracking: a survey. ACM Trans Intell Syst Technol
tracking with attention to appearance, structure, motion and size. (TIST) 11(4):1–47
IEEE Access 7:104423–104434 40. Llamazares Á, Molinos EJ, Ocaña M (2020) Detection and
27. Voigtlaender P, Krause M, Osep A, Luiten J, Sekar BBG, Geiger tracking of moving obstacles (DATMO): a review. Robotica
A, Leibe B (2019) MOTS: multi-object tracking and segmenta- 38(5):761–774
tion. In: Proceedings of the IEEE conference on computer vision 41. Sam JR, Augasta G (2021) Review of recent advances in visual
and pattern recognition, pp 7942–7951 tracking techniques. Multimed Tools Appl 80:24185–24203
28. Sun S, Akhtar N, Song H, Mian AS, Shah M (2019) Deep affinity
network for multiple object tracking. IEEE Trans Pattern Anal Publisher’s Note Springer Nature remains neutral with regard to
Mach Intell 43:104–119 jurisdictional claims in published maps and institutional affiliations.
29. Liu G, Liu S, Muhammad K, Sangaiah AK, Doctor F (2018)
Object tracking in vary lighting conditions for fog based intelli-
gent surveillance of public spaces. IEEE Access 6:29283–29296

123

You might also like