(2024-AEJ) Z-YOLOv8s-based Approach For Road Object Recognition in Complex Traffic Scenarios
(2024-AEJ) Z-YOLOv8s-based Approach For Road Object Recognition in Complex Traffic Scenarios
Original article
A R T I C L E I N F O A B S T R A C T
Keywords: Object detection in road scenarios is crucial for intelligent transport systems and autonomous driving, but
Road Environmental complex traffic conditions pose significant challenges. This paper introduces Z-You Only Look Once version 8
Object Detection small (Z-YOLOv8s), designed to improve both accuracy and real-time efficiency under real-world uncertainties.
YOLOv8
By incorporating Revisiting Perspective Vision Transformer (RepViT) and C2f into the YOLOv8s framework, and
Deep Learning
integrating the Large Selective Kernel Network (LSKNet), the model enhances spatial feature extraction. Addi
Autonomous driving
tionally, the YOLOv8s backbone is optimized with Space-to-Depth Convolution (SPD-Conv) for better small
object detection. The Softpool-Spatial Pyramid Pooling Fast (SoftPool-SPPF) module ensures precise character
istic information preservation. Z-YOLOv8s improves mean average precision (mAP)@0.5 on the Berkeley Deep
Drive 100 K (BDD100K) and Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI)
datasets by 7.3 % and 3.8 %, respectively. It also achieves accuracy increases of 5.7 % and 6.5 % in Average
Precision (AP)-Small, and a real-time detection speed of 78.41 frames per second (FPS) on the BDD100K. Z-
YOLOv8s balances detection precision and processing speed more effectively than other detectors, as demon
strated by experimental results and comparisons.
1. Introduction stage algorithms, the Regions with Convolutional Neural Network (R-
CNN) model was initially reported [3]. The R-CNN transforms tradi
With the rapid advancement in autonomous driving technology, tional object detection into a regional feature extraction and classifica
road object detection has emerged as a pivotal area of research. The tion process. Spatial Pyramid Pooling (SPP) efficiently handles objects of
detection of objects in road environments is regarded as a critical aspect different sizes and addresses the issue of information loss in the R-CNN
of the environmental perception system of autonomous vehicles. For model owing to normalization [4]. Fast R-CNN employs end-to-end
devices with limited computational resources, it is imperative to detect training to jointly learn classification and regression tasks by sharing
objects quickly and accurately in real traffic scenarios to ensure safe and convolutional layers, thereby significantly reducing the network
reliable driving behaviors and decision-making [1,2]. With their training and testing times [5]. Faster R-CNN incorporates a Region
improved generalization and precision, deep-learning-based methods Proposal Network (RPN), leading to substantial improvements in the
are increasingly supplanting traditional algorithms as the dominant detection speed and precision [6]. Mask R-CNN extends this by not only
approach for object detection. These techniques have demonstrated detecting and localizing objects, but also by generating pixel-level seg
promising results in the detection of objects in traffic scenes. However, mentation masks for each detected object, thus achieving accurate
several significant challenges remain. In the current landscape of segmentation [7]. Although two-stage algorithms have become
autonomous driving environment perception tasks, factors such as increasingly precise, they often suffer from a low detection speed. In
diverse weather conditions, lighting variations, object occlusions, and contrast, one-stage algorithms such as the You Only Looking Once
presence of small objects in real traffic scenarios introduce substantial (YOLO) family [8–15] single-shot multibox detector (SSD) algorithm
uncertainty, thereby reducing the precision of road object detectors. [16], and RetinaNet utilize regression methods to simultaneously clas
Currently, road object detection algorithms that use deep learning sify objects and predict bounding boxes [17]. The YOLO algorithm uses
are classified into one-stage and two-stage approaches. Among two- the entire graphic as input and directly countermarches the location and
* Corresponding authors.
E-mail addresses: [email protected] (R. Zhao), [email protected] (S.H. Tang).
https://doi.org/10.1016/j.aej.2024.07.011
Received 18 May 2024; Received in revised form 21 June 2024; Accepted 2 July 2024
Available online 13 July 2024
1110-0168/© 2024 The Author(s). Published by Elsevier BV on behalf of Faculty of Engineering, Alexandria University This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
degree of the bounding box. The YOLO and SSD algorithms offer significantly improved the accuracy of ReID tasks in both occluded and
real-time detection speeds and are increasingly superior to two-stage holistic scenarios [21]. Zhang et al. reviewed deep learning-based per
algorithms in terms of precision, making them widely used in autono son search methods and highlighted the integration of detection and
mous driving applications. re-identification tasks. They introduced a new taxonomy, evaluated
To achieve faster and more precise road-object detection in complex state-of-the-art techniques, and explored future research directions to
traffic situations, research efforts have focused on enhancing deep- address challenges such as occlusion and scale variation in practical
learning object detection algorithms to improve precision and speed. applications [22]. However, their method only improved the accuracy of
For instance, Reference [18] proposed a YOLOv4-Tiny-based method for identifying occluded objects and was ineffective in detecting small ob
traffic sign object detection, and [19] introduced an enhanced YOLOv5 jects in complex backgrounds.
model for small object detection in traffic scenes. Therefore, YOLO Wang et al. developed an automatic fog detection algorithm using
methods can learn significant features to detect objects in traffic YOLOv5, with the new backbone network re-parameterization aggre
scenarios. gated residual transformations for deep neural networks (Rep-ResNeXt),
To reduce the effect of uncertainties on the effectiveness of object to enhance the network feature extraction speed and accuracy in fog
detection in complex traffic scenarios, this study primarily investigated driving scenes. Additionally, a feature enhancement module (FEM) is
one-stage methods and explored their potential for performance employed to automatically extract features from the foggy images and
improvement. We chose YOLOv8 [15] as the foundational model and other key parameters. This approach enhanced the accuracy and speed
proposed a real-time accurate object detector called Z-YOLOv8s. The of object detection in foggy weather conditions [23]. However, a com
proposed model aims to accomplish two main objectives: enhancing the plex network architecture increases the inference time, which further
detection accuracy, particularly for small objects, and achieving limits its applicability to autonomous driving. To address the Zero-Shot
real-time processing speed. Open-Set Recognition (ZS-OSR) problem, Li et al. proposed the adver
The primary contributions are as follows: sarial semantic embedding (ASE) method. This method ensures that
First, a Revisiting Perspective Vision Transformer (RepViT)C2f these embeddings are closely clustered around the unseen class em
module was introduced, combining the RepViT structure with the C2f beddings, while remaining distinct from the unknown class embeddings.
module to enable the network to integrate local features and global se Using both novel and unfamiliar features for training, this approach
mantic information. This fosters strong local correlations and global effectively trained an open-set classifier. Experimental results indicate
modeling capabilities. that this method significantly boosts the classification accuracy and
Second, the proposed SoftPool-Spatial Pyramid Pooling Fast (Soft improves the rejection rates of unknown classes [24].
Pool-SPPF) module enhances the detection of small objects in complex To improve YOLO performance in traffic scenarios, Shi et al. intro
traffic scenarios by addressing the challenge of insufficient fine-grained duced a network module to enhance feature extraction, used a dense
information extraction at the edges of the model. neck structure to merge details and semantics, and combined SCYLLA-
Third, experiments performed on publicly available datasets Intersection over Union (SIoU) with orientation information in the loss
demonstrated that the Z-YOLOv8 model achieved advanced detection function to improve convergence and precision for detecting small ob
accuracy while maintaining superior speed compared with several state- jects in traffic [25]. Tian et al. developed a technique for identifying
of-the-art models, showing a balanced performance in terms of both small objects in intelligent transportation scenarios that incorporated
speed and precision. object feedback and retained feature information. The small object
The structure of this work is organized as follows: Section 2 presents Intersection over Union (SOIoU) loss function is designed to adaptively
a review of prior studies. Sections 3 and 4 propose the Z-YOLOv8 optimize small objects; a small-object path aggregation network (SOP
method and present the experimental analysis, respectively. Section 5 Net) was adopted to retain detailed features. The results demonstrate
summarizes key contributions and outlines potential directions for that the proposed method achieves superior detection accuracy and
future research. outperforms existing methods [26]. However, these methods are
generally limited to the detection of small objects in traffic scenarios and
2. Related works have significant constraints.
Oreski improved the detection outcomes of the YOLO method in
In this section, we present a concise overview of recent object traffic scenarios by considering the multi-context (MCTX) module and
detection methods. integrating changes in the loss function. This approach effectively ex
ploits rich global contextual information without compromising effi
2.1. Object detection in traffic scenarios ciency [27]. Cong et al. introduced a lightweight detection algorithm
based on a modified YOLO model. This architecture primarily achieves
With the rapid developments in science, technology, and the auto the effective extraction and utilization of object feature information
motive industry, autonomous driving has gradually become a major through the interaction of information between subnets. In addition, a
focus of automotive research. This shift has positioned autonomous lightweight distributed shift C3 (DSC3) module was designed to resolve
driving systems as a central point of interest in the transportation sector issues related to model computation and label assignment. This method
[20]. Deep learning object detection methods in traffic scenarios have enhances the ability to detect environmental objects in traffic scenarios
gained widespread popularity. [28]. Tang et al. presented a pioneering object detection method named
Specifically, one-stage detection algorithms are characterized by the pyramid integration and attention-enhanced network (PIAENet),
high detection speeds, offering an excellent balance between precision which seamlessly integrates the Pyramid Integration Module (PIM) and
and speed. This balance is particularly important for real-time detection attention-enhanced module (AEM) to attain superior accuracy and ef
in autonomous vehicles. Various methods for detecting traffic scenario ficiency. The PIM augments the receptive field of the model by amal
objects have been continuously developed. For example, Yu et al. pro gamating multiscale features through multiple branches. Moreover,
posed a three-dimensional (3D) multi-view learning-based person re- AEM improves feature fusion by utilizing double-attention mechanisms
identification (ReID) method that addresses the challenges of occluded to reduce the impact of irrelevant information effectively [29]. Zhan
pedestrians. Traditional two-dimensional (2D) methods fail to capture et al. proposed an anchor-free multitasking learning network for
the full 3D characteristics of an individual. The new network structure, panoptic driving perception (YOLOPX). It features an anchor-free
multiview learning (MV-3DSReID), combines the advantages of 2D and detection head for improved adaptability and scalability, a lightweight
3D multiviews, captures geometric and shape details from a 3D space, lane-detection head with multiscale high-resolution features, and
and extracts semantic representations using 2D networks. This approach Polarized Self-Attention (PSA) modules for efficient training and
299
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
superior performance [30]. However, the effectiveness of these methods optimally and lacks the inductive bias inherent in CNNs. Moreover, as
is diminished because of their high computational complexity and poor the input image size increases, the sequence length and complexity also
real-time performance. increase. This was evident in the Detection Transformer (DETR), which
was the first successful attempt to use a transformer for object detection
[33]. DETR comprises a pretrained network (CNN) backbone and a
2.2. Vision transformer
transformer. It uses ResNets to generate low-dimensional features,
combines these features into a single feature set, adds position encod
With progress in deep learning, the transformer model has achieved
ings, and feeds them into a transformer. However, the complexity and
significant breakthroughs in natural language processing. Because of the
high computational and hardware demands of transformers render them
limitations of convolutional kernels in acquiring information, re
less practical for real-world applications. The swin transformer intro
searchers have begun to apply transformer models to computer vision
duced the concept of shift windows from CNNs to transformers. This
tasks [31]. Dosovitskiy et al. introduced a vision transformer (ViT) that
approach leverages the ViT patch-based technique by segmenting the
demonstrated excellent performance in self-attention mechanisms for
input image into separate nonoverlapping patches. The computational
computer vision tasks [32]. Unlike traditional convolutional neural
load of the local self-attention mechanism increases linearly with the
networks (CNNs) that recognize local patterns and features, ViT uses a
image dimensions. Consequently, the swin transformer uses more pa
transformer architecture on image patches for object classification. By
rameters than the convolutional models [34]. Wang et al. proposed
employing multi-head self-attention to capture long-range de
RepViT, which optimizes mobile networks for vision task V3 (Mobile
pendencies, transformer models have achieved state-of-the-art results in
NetV3) by integrating them with the ViT architecture. Consequently,
classic computer vision tasks. ViT demonstrated that a pure transformer
RepViT achieves an excellent balance between precision and real-time
architecture could outperform CNNs in computer vision tasks when
performance [35].
trained on large datasets. However, it requires extensive data to perform
300
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
301
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
spatial characteristics of each object. This flexibility is crucial for the contextual information. This enables the spatial feature vectors to be
accurate detection of road objects in complex traffic scenarios. mixed across the channels. The calculations are presented in Eqs. (2) and
The Large Selective Kernel (LSK)-attention mechanism dynamically (3).
selects convolutional kernels and adapts to diverse contextual infor
mation by considering local details from the input feature map. It adjusts U0 = X, Ui+1 = Fdi vi (Ui ) (2)
its receptive area to suit the different object types and contexts. LSKNet
is divided into two sub-blocks: the FFN and large-kernel (LK) selection.
̃ i = F1×1 (Ui ), for i in [1, N]
U i (3)
The FFN is employed to combine channels and enhance feature details, The expression Fdi w
represents depth-wise convolutions with kernel ki
and consists of a sequence that includes a fully connected layer, depth-
and dilation di , assuming the existence of N decomposed kernels, each
wise convolution, Gaussian Error Linear Unit (GELU) activation, and
processing through a 1 × 1 convolution layer denoted as F1×1 .
another fully connected layer. Similarly, the LK selection sub-block
LSKNet uses a spatial kernel selection algorithm to enhance its focus
consists of a sequence that incorporates a fully connected layer, an
on key spatial context areas by selecting features from large convolu
LSK sub-block, and GELU activation, followed by another fully con
tional kernels of various scales. It segments an input feature map into
nected layer. The key elements of LSKNet include the LSK sub-block,
smaller subsets, applies kernels of varying sizes to each subset, and
which incorporates the LK convolutions generated by decomposing
generates multiple output feature maps. These individual output feature
them into a sequence of kernels with progressively larger sizes and
maps are then combined or aggregated according to Eq. (4).
depthwise convolutions featuring higher dilation rates. Specifically, the
expansion of various aspects of i depth convolution, such as the kernel U
̃ = [U
̃ 1 ; ⋯; U
̃ i] (4)
size k, dilation rate d, and receptive field RF, is shown in Eq. (1).
Spatial relation descriptors are obtained by pooling feature maps
k1− 1 ≤ ki d1 = 1, d1− 1 < di < RFi− 1 using both the average and maximum pooling operations across the
(1)
RF1 = k1 , RFi = di (ki − 1) + RFi− 1 channel direction, that is, SAavg and SAmax , as defined in Eq. (5).
The receptive field can be expanded rapidly by enlarging the kernel SAavg = Pavg (U),
̃ SAmax = Pmax (U)
̃ (5)
sizes and increasing their dilation rates. A larger upper limit for the
dilation rate was adopted to avoid gaps among the feature maps. The After concatenating of SAavg and SAmax , convolutional layers are used
proposed approach simplifies the subsequent kernel selection and to convert these parameters to spatial attention maps with an equal
significantly decreases the number of parameters. Furthermore, using a number of depth convolutions N, as defined in Eq. (6).
set of decomposed depthwise convolutions in various receptive fields, it Applying the sigmoid activation function to each spatial attention
is possible to capture the characteristics at various ranges using map yields the spatial selection weights for each depth-wise convolu
302
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
tion. These weights are used for element-wise multiplication with the
corresponding depth-wise convolution feature maps, resulting in
weighted feature maps. Finally, a convolution layer fuses the maps to
generate the final attention features. This process is described by Eqs. (7)
and (8). The eventual output of the LSK module is expressed as the
element-wise product of the input feature X and spatial attention map S,
as illustrated in Eq. (9).
Therefore, LSKNet fulfills the requirement for a more comprehensive
and adaptable understanding of information in complex traffic sce
narios. Consequently, LSKNet was integrated into the detection head
network to enhance the ability of the model to extract the features. Fig. 3
presents the structural diagram of LSKNet.
Fig. 5. Structural diagram of SoftPool-SPPF.
SA
̂ = F2→N ([SAavg; SAmax]) (6)
SA
̃ i = σ( SA
̂ i) (7) f0,0 = X[0 : S : scale, 0 : S : scale],
f1,0 = X[0 : S : scale, 0 : S : scale], …,
N
∑ fscale− 1,0 = X[scale − 1 : S : scale, 0 : S : scale];
S = F( (SA
̃ i ⋅U
̃ i )) (8) f0,1 = X[0 : S : scale, 1 : S : scale],
(10)
i=1 …
fscale− 1,− 1 = X[scale − 1 : S : scale, 1 : S : scale];
Y = X⋅S (9) …
fscale− 1,scale− 1 = X[scale − 1 : S : scale, scale − 1 : S : scale].
3.2.3. SPD Conv module
Capturing images in typical traffic scenarios generally requires good 3.2.4. SoftPool-SPPF module
resolution and moderately sized objects. Object detection models With an increase in the CNN, the receptive field size also increased.
employ design elements, such as stride convolutions and pooling oper However, because the size of the input image is restricted, characteristic
ations, to skip unnecessary pixel-level details and efficiently extract extraction is repeated on a large receptive field. To address this issue,
object features. However, the detection of small objects in traffic sce YOLOv8s uses Spatial Pyramid Pooling Fast (SPPF) [12], which com
narios, particularly those with overlapping and occluded small objects, bines characteristic maps from diverse receptive fields. The current
causes the assumption that redundant information is ineffective. This module integrates both local and global characteristics, thereby maxi
limitation can severely restrict the ability of the model to capture mizing the expressive power of the characteristic map. The purpose of
detailed features, thereby reducing the accuracy of road-object detec SPPF is to extract the most crucial contextual characteristics for
tion. SPD-Conv [39] detects small objects in challenging traffic envi size-based object detection. To achieve this, SPPF employs multiple
ronments and retains detailed features during convolution. It was parallel max-pooling operations to integrate features from receptive
integrated into the backbone network of the YOLOv8s. By removing fields of different scales. Max pooling selects the highest value from the
stride convolutions and pooling operations, SPD-Conv preserves feature points within an environment and effectively preserves the
detailed information during downsampling, thereby improving pattern texture features. Although this approach is generally suitable for stan
learning efficiency and feature representation capability. dard applications where the model accuracy is not significantly
The SPD-Conv module consists of an SPD layer and a non-stride impacted, it fails to capture spatial information when only limited road
convolution, effectively replacing traditional pooling and stride convo object feature information is available for complex traffic scenes. This
lution layers in CNNs. As illustrated in Fig. 4, The SPD layer decreases phenomenon not only results in the loss of valuable information but also
the resolution of the feature map X while maintaining all channel in complicates the detection of small objects.
formation, thereby avoiding data loss. When SPD is applied to an input This study proposes the SoftPool-SPPF module as an alternative to
feature map X of size (S,S,C1 ), it results in the generation of sub-feature the SPPF module in the YOLOv8s backbone, as illustrated in Fig. 5.
maps. The specific splitting formula is given by Eq. (10). At a scale of 2, SoftPool is a pooling variant that reduces data loss during the pooling
four sub-features(f0,0 , f0,1 , f1,0 , and, f1,1 ), each sized (S/2, S/2, C1 ), process while retaining the functionality of the pooling layer [40]. It is
are derived from the feature map X. These sub-feature maps are particularly suitable for small object detection. SoftPool uses an expo
concatenated to obtain a feature map Xʹof size (S/2, S/2, 4C1 ). Impor nential weighting method to retain more feature information in down
tantly, this process maintains all of the information in the channel sampled activation mappings, offering finer control than other pooling
aspects. methods. It operates through distinct forward and backward phases. The
Finally, a non-stride convolution with a stride of 1 was adopted to forward process is differentiable, ensuring that each activation in the
process the output characteristic map, followed by an SPD operation to local neighborhood receives at least a minimal gradient value during the
capture rich feature information. This approach was adopted to prevent backward propagation phase. The characteristic map size is represented
any potential loss of feature map information that may occur with stride by C × H × W. In addition, R is the index set associated with activations
convolution. The SPD-Conv module effectively preserved the critical in the two-dimensional spatial area, and each activation i in the acti
characteristics of low-resolution images and enhanced the detection of vation area R is associated with a weight w. The index weighting
small objects, thereby significantly improving the accuracy of traffic calculation method is illustrated in Eqs. (11) and (12).
scenario applications. eai
wi = ∑ aj (11)
e
j∈R
∑
a=
̃ wi ∗ ai (12)
i∈R
303
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
propagation of important features. The SoftPool operation assigns occlusion. For the BDD100K dataset, this paper performed a category
nonlinear weights to the associated activation values, thereby ensuring reassignment by merging bike, car, bus, truck, train, and motorcycle into
that all activations contribute to the final output. ̃
a denotes the output of the “car” category, and person and rider into the “pedestrian” category,
SoftPool obtained by aggregating the weighted activations across the while retaining traffic signs and traffic lights. For the KITTI dataset, the
pooled kernel. Therefore, replacing SPPF with soft-SPFF allows better classes “Truck,” “Van,” and “Tram” were incorporated into the “car”
preservation of the comprehensive characteristics of small objects in category, and the class “Person (sitting)” was combined with the
complicated traffic backgrounds. “Pedestrian” category. Consequently, the final dataset retained labels for
three categories: “Car,” “Pedestrian,” and “Cyclist.”’ Following official
4. Experiments of result recommendations, the BDD100K dataset was divided into training,
validation, and test sets in a ratio of 7:1:2, whereas the KITTI dataset was
To demonstrate the efficiency of Z-YOLOv8s, extensive experiments divided using a ratio of 8:1:1. Fig. 6 and Fig. 7 show the normalized label
were conducted using the BDD100K and KITTI datasets. This section distributions of BDD100K and KITTI, respectively. Darker colors repre
provides details on the datasets used, implementation and settings, and sent a higher density of bounding boxes. The plots reveal that the
assessment results. bounding boxes in both datasets are primarily clustered in the lower-left
corner, indicating the notable presence of small objects. Images of these
samples are shown in Fig. 8.
4.1. Datasets
The BDD100K dataset, with each image containing up to 90 objects, 4.2. Implementation
many of which were small and occluded, was selected for the primary
experimental validation to enable the model to recognize different To prove the efficiency of Z-YOLOv8s for road detection in traffic
complex traffic scenarios [41]. The KITTI dataset is a major interna scenarios, model training and testing experiments were conducted using
tional benchmark for evaluating computer vision methods in intelligent the following hardware and software configurations. The hardware
driving scenarios [42]. The KITTI dataset includes practical image data setup included an Intel(R) Core Trademark (TM) i9–13900 K processor
from different areas, with each image containing up to 15 vehicles and with a clock frequency of 3.19 gigahertz (GHz), 60 gigabytes (GB) of
30 pedestrians and featuring many small objects and varying levels of random-access memory (RAM), and a GeForce Ray Tracing eXtreme
304
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
(RTX) 4090 graphics processor with 24 GB of video memory. The deep epochs. To address the presence of small objects in the sample images
learning framework used was PyTorch 1.8.1, with Torchvision 0.9.1, and achieve a balance between real-time performance and accuracy, the
and the base version of YOLOv8 employed was Ultralytics 8.0.25. The sample size was normalized using 640 × 640. The present dimensions
algorithm was configured with a batch size of 8 and trained for 300 facilitate the deployment of the model on edge devices without
305
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
Table 1
Consequences of Z-YOLOv8s algorithm ablation on the BDD100K dataset.
Detection Algorithm Module Result
RepViTC2f LSKNet SPD-Conv SoftPool-SPPF [email protected] (%) [email protected]:0.95 (%) P (%) R (%) FPS/f/s Parameters (M)
Table 2
Consequences of Z-YOLOv8s algorithm ablation on the KITTI dataset.
Detection Algorithm Module Result
RepViTC2f LSKNet SPD-Conv SoftPool-SPPF [email protected] (%) [email protected]:0.95 (%) P (%) R (%) FPS/f/s Parameters (M)
Fig. 9. P-R curve of every class for YOLOv8s and Z-YOLOv8s on BDD100K.
306
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
Fig. 10. P-R curve of every class for YOLOv8s and Z-YOLOv8s on KITTI.
advancements across multiple metrics compared with YOLOv8s. The 4.5. Algorithm performance analysis
network incorporates the RepViTC2f, LSKNet, SPD-Conv, and SoftPool-
SPPF structures. Despite the increase in the number of parameters, the Fig. 9 and Fig. 10 show the precision-recall (P-R) results for the
detection performance of the model showed significant improvement. BDD100K and KITTI test sets, respectively, showing the P-R curves for
On the BDD100K dataset, [email protected] increased from 67.9 % to 75.2 %, a each class under [email protected] for both YOLOv8s and Z-YOLOv8s. In the
gain of 7.3 %, whereas [email protected]:0.95 increased from 35.2 % to 39.5 %, BDD100K dataset, the most notable disparities were evident in the
a gain of 4.3 %. In the KITTI dataset, [email protected] increased by 3.8 % and curves of traffic signs and lights, which are typically the smallest objects
[email protected]:0.95 increased by 6.4 %. By incorporating various enhance in traffic scenes. In the KITTI dataset, the cyclist category exhibits the
ment modules, the YOLOv8s model parameter count grew from its initial most outstanding results, with a performance increase of 6.2 %. Addi
11.1 million (M) to a peak of 16.13 M. Despite the increase in parame tionally, the detection performance for traffic signs and traffic lights on
ters, which led to a reduction in inference speed, the model continued to the BDD100K dataset exhibited improvements of 8.3 % and 5.1 %,
satisfy the demands for real-time detection. The real-time detection respectively. These two categories are more densely distributed in traffic
speed reached 78.41 FPS and 87.3 FPS, respectively. Therefore, the Z- scenes and feature smaller pixels. These findings indicate that the pro
YOLOv8s algorithm not only fulfills real-time detection requirements posed method enhances the accuracy of small object detection and
but also significantly enhances detection accuracy. The RepViTC2f effectively reduces object omission rates.
module captured contextual semantic information and effectively To assess the effectiveness of the Z-YOLOv8s approach for small
extracted global features. The LSKNet module enhances object feature object detection in complex traffic scenarios, we used Common Objects
information, suppresses background noise, and improves visual repre in Context (COCO) benchmark evaluation criteria. Traffic scene objects
sentation capabilities. Furthermore, the SPD-Conv module enhances the were categorized into small, medium, and large objects based on their
detection of small objects in complex traffic scenarios by mitigating the sizes: small objects had fewer than 322 pixels, medium objects had
degradation of detailed feature information during convolution. In 322–962 pixels, and large objects had more than 962 pixels. Table 3
addition, the introduction of the SoftPool-SPPF module enabled the al presents the results for various object scales. The Z-YOLOv8s detector
gorithm to preserve the intricate features of small objects in complex exhibited higher AP and Average Recall (AR) values for all detected
traffic scenarios more effectively. Consequently, the experimental re objects across various IoU thresholds on the BDD100K and KITTI data
sults demonstrated that improvements at each stage enhanced the sets. For objects of various sizes under the same IoU, the Z-YOLOv8s
learning capabilities of the model, confirming the generality and effec algorithm achieved higher AP and AR values than the YOLOv8s algo
tiveness of the proposed algorithm. rithm. It is worth noting the significant improvements in AP and AR
values for small objects. The Z-YOLOv8s algorithm achieved AP-small
values of 26 % and 42 % for traffic objects, which is an improvement
Table 3
Performance completion of YOLOv8s and Z-YOLOv8s algorithms for object detection of various sizes in BDD100K and KITTI.
IoU Area maxDets YOLOv8s (BDD100K) YOLOv8s (BDD100K) YOLOv8s (KITTI) Z-YOLOv8s (KITTI)
307
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
of 5.7 % and 6.5 % on the BDD100 and KITTI datasets, respectively. The 4.7. Visualizations
AR-small values for traffic object detection with Z-YOLOv8s reached
42.1 % and 52.7 %, representing improvements of 9.2 % and 7.7 % for Finally, we present numerous visual results to provide a detailed
the BDD100 and KITTI datasets, respectively. Overall, the Z-YOLOv8s understanding of the detection properties of Z-YOLOv8s. These visuals
algorithm performed better in detecting small objects in road environ include object detection outcomes, trends in the loss function, and
ments while maintaining good generalization. Gradient-weighted Class Activation Mapping (Grad-CAM) for an intui
tive analysis of the detection results.
4.6. Comparative experimental analysis (1) Object Detection Result: To investigate the impact of the loss
function, we recorded the loss function trends for YOLOv8s and Z-
To assess the effectiveness of the proposed method, we performed a YOLOv8s on the BDD100K dataset, as shown in Fig. 11.
comparative analysis of several traditional object-detection networks Fig. 11 indicates that our methods exhibit a certain degree of
using the BDD100K and KITTI datasets. The results are presented in reduction in three categories of loss functions: box loss, class (cls) loss,
Fig. 11. Comparison of loss function between YOLOv8s and Z-YOLOv8s on BDD100K.
308
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
and distribution focal loss (dfl). This reveals that the algorithm is better excellent performance of Z-YOLOv8s.
suited to the training data, whereas the improved convergence indicates (3) Grad-CAM Result: Grad-CAM technology was employed to
that it can more effectively address the problem of class imbalance, provide a legitimate visual interpretation of the relationships made by
leading to enhanced detection performance. CNNs, allowing the discrimination and localization of significant areas
(2) Tendency of Loss Function: To enhance the understanding of in input images that obviously affect the network predictions. Grad-CAM
the algorithm performance, we visualized the detection consequences of [45] produces a rough map that localizes important areas in an image
YOLOv8s and Z-YOLOv8s on the BDD100K and KITTI test datasets, as used to predict a concept by analyzing the gradients flowing into the
shown in Fig. 12 and Fig. 13. YOLOv8s is prone to inaccurate localiza object concept from the final convolutional layer. To better analyze the
tion and produces numerous imprecise results. Conversely, Z-YOLOv8s influence of Z-YOLOv8s, the detection results are visualized in Fig. 14
demonstrates highly accurate object localization, significantly reducing and Fig. 15.
false detections and erroneous identifications. In addition, our proposed The visualization results clearly show that the baseline model
detector can identify object samples that YOLOv8s fails to detect, such as YOLOv8s exhibited issues with missed detections and false positives. By
pedestrians in shadows or obscured cars. Z-YOLOv8s achieved these contrast, the improved model Z-YOLOv8s demonstrates an enhanced
objectives with precise detection, as shown in Fig. 12. This confirms the focus on traffic objects. This indicates that the incorporated modules can
309
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
reduce boundary uncertainty, leverage the central features of objects, background information loss. In addition, the LSKNet block attention
and suppress extraneous information, thereby improving the detection module was incorporated to improve the feature extraction capabilities
accuracy in complex traffic scenarios with object occlusion. Further of the model. We employed SPD-Conv convolution and developed a
more, in traffic scenarios involving small objects, the Z-YOLOv8s model SoftPool-SPPF module to further mitigate overfitting. These enhance
can effectively concentrate on the information relevant to these objects ments improve the ability of the YOLOv8s model to detect noise and
and accurately capture their key features. This improvement in the low-quality images, thereby enhancing its performance in small-object
learning capacity of the network reduces the number of missed small- detection. Through multiple confirmation and ablation experiments on
object detections, thereby enhancing the overall detection performance. the BDD100K and KITTI benchmarks, the assessment results verified the
excellence of Z-YOLOv8s. Compared with YOLOv8s, Z-YOLOv8s ach
5. Conclusion ieved a significant enhancement of 7.3 % in [email protected] on the BDD100K
dataset, and a 3.8 % increase in [email protected] on the KITTI dataset. Addi
To solve the issues of incorrect and missed detections caused by tionally, the optimized algorithm improved AP-small accuracy by 5.7 %
occluded and small objects in complex traffic scenarios, we developed a and 6.5 % on these datasets, respectively. Moreover, Z-YOLOv8s main
road object detection network model based on YOLOv8s, namely, Z- tains real-time inference speeds, achieving 78.41 FPS on BDD100K and
YOLOv8s. This model combines RepViTC2f modules to suppress 87.3 FPS on KITTI, thereby ensuring efficient and accurate performance
310
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311
in complex traffic scenarios. Furthermore, Z-YOLOv8s exhibited a better [19] S. Wang, Z. Qu, C. Li, L. Gao, BANet: Small and multi-object detection with a
bidirectional attention network for traffic scenes, Eng. Appl. Artif. Intell. 117
balance between detection precision and speed than the state-of-the-art
(2023) 105504.
detectors, suggesting its suitability for intelligent driving applications. [20] S. Grigorescu, B. Trasnea, T. Cocias, G. Macesanu, A survey of deep learning
The ablation experiments further illustrate the effectiveness of our techniques for autonomous driving, J. Field Robot. 37 (2020) 362–386, https://
proposed algorithm, and the visualization results provide a novel di doi.org/10.1002/rob.21918.
[21] Z. Yu, L. Li, J. Xie, C. Wang, W. Li, X. Ning, Pedestrian 3D shape understanding for
rection for analyzing the content of each module and variations in the person re-identification via multi-view learning, IEEE Trans. Circuits Syst. Video
loss function. In the future, we plan to deploy an enhanced model on Technol. (2024), https://doi.org/10.1109/TCSVT.2024.3358850.
resource-constrained embedded devices for traffic scene object detec [22] P. Zhang, X. Yu, C. Wang, J. Zheng, X. Ning, X. Bai, Towards effective person
search with deep learning: a survey from systematic perspective, Pattern Recognit.
tion. The current deployment aims to enhance the algorithm’s robust 152 (2024) 110434, https://doi.org/10.1016/j.patcog.2024.110434.
ness and practicality in real-world scenarios while continuing to refine [23] H. Wang, Y. Xu, Y. He, Y. Cai, L. Chen, Y. Li, M.A. Sotelo, Z. Li, YOLOv5-Fog: A
the proposed algorithm and methodology. multiobjective visual detection algorithm for fog driving scenes based on improved
YOLOv5, IEEE Trans. Instrum. Meas. 71 (2022) 1–12.
[24] T. Li, G. Pang, X. Bai, J. Zheng, L. Zhou, X. Ning, Learning adversarial semantic
CRediT authorship contribution statement embeddings for zero-shot recognition in open worlds, Pattern Recognit. 149 (2024)
110258, https://doi.org/10.1016/j.patcog.2024.110258.
[25] Y. Shi, X. Li, M. Chen, SC-YOLO: a object detection model for small traffic signs,
Ruixin Zhao: Writing – original draft, Validation, Conceptualiza IEEE Access 11 (2023) 11500–11510.
tion. SaiHong Tang: Writing – review & editing, Supervision. Eris [26] D. Tian, Y. Han, S. Wang, Object feedback and feature information retention for
Elianddy Bin Supeni: Software, Investigation. Sharafiz Abdul Rahim: small object detection in intelligent transportation scenes, Expert Syst. Appl. 238
(2024) 121811, https://doi.org/10.1016/j.eswa.2023.121811.
Resources, Data curation. Luxin Fan: Visualization, Investigation. [27] G. Oreski, YOLO* C—Adding context improves YOLO performance,
Neurocomputing 555 (2023) 126655.
[28] P. Cong, H. Feng, S. Li, T. Li, Y. Xu, X. Zhang, A visual detection algorithm for
Declaration of Competing Interest autonomous driving road environment perception, Eng. Appl. Artif. Intell. 133
(2024) 108034, https://doi.org/10.1016/j.engappai.2024.108034.
[29] X. Tang, W. Xu, K. Li, M. Han, Z. Ma, R. Wang, PIAENet: pyramid integration and
The authors declare that they have no known competing financial attention enhanced network for object detection, Inf. Sci. 670 (2024) 120576,
interests or personal relationships that could have appeared to influence https://doi.org/10.1016/j.ins.2024.120576.
the work reported in this paper. [30] J. Zhan, Y. Luo, C. Guo, Y. Wu, J. Meng, J. Liu, YOLOPX: anchor-free multi-task
learning network for panoptic driving perception, Pattern Recognit. 148 (2024)
110152, https://doi.org/10.1016/j.patcog.2023.110152.
References [31] X. Xiang, Z. Wang, Y. Qiao, An improved YOLOv5 crack detection method
combined with transformer, IEEE Sens. J. 22 (2022) 14328–14335, https://doi.
[1] A. Boukerche, Z. Hou, Object detection using deep learning methods in traffic org/10.1109/JSEN.2022.3181003.
scenarios, ACM Comput. Surv. 54 (2022) 1–35, https://doi.org/10.1145/3434398. [32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.
[2] B. Wu, F. Iandola, P.H. Jin, K. Keutzer, Squeezedet: unified, small, low power fully Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is
convolutional neural networks for real-time object detection for autonomous worth 16x16 words: transformers for image recognition at scale, (2021).
driving, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshop (2017) [33] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end
129–137. object detection with transformers, in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm
[3] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate (Eds.), Comput. Vis. – ECCV 2020, Springer International Publishing, Cham, 2020,
object detection and semantic segmentation, Proc. IEEE Conf. Comput. Vis. Pattern pp. 213–229, https://doi.org/10.1007/978-3-030-58452-8_13.
Recognit. (2014) 580–587. [34] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer:
[4] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional Hierarchical vision transformer using shifted windows, Proc. IEEECVF Int. Conf.
networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2015) Comput. Vis. (2021) 10012–10022.
1904–1916. [35] A. Wang, H. Chen, Z. Lin, J. Han, G. Ding, RepViT: Revisiting Mobile CNN From
[5] R. Girshick, Fast r-cnn, Proc. IEEE Int. Conf. Comput. Vis. (2015) 1440–1448. ViT Perspective, (2023).
[6] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection [36] J. Pan, A. Bulat, F. Tan, X. Zhu, L. Dudziak, H. Li, G. Tzimiropoulos, B. Martinez,
with region proposal networks, Adv. Neural Inf. Process. Syst. 28 (2015). EdgeViTs: competing light-weight CNNs on mobile devices with vision
[7] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, Proc. IEEE Int. Conf. Comput. transformers, in: S. Avidan, G. Brostow, M. Cissé, G.M. Farinella, T. Hassner (Eds.),
Vis. (2017) 2961–2969. Comput. Vis. – ECCV 2022, Springer Nature Switzerland, Cham, 2022,
[8] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real- pp. 294–311, https://doi.org/10.1007/978-3-031-20083-0_18.
time object detection, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016) [37] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, Proc. IEEE Conf. Comput.
779–788. Vis. Pattern Recognit. (2018) 7132–7141.
[9] J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, Proc. IEEE Conf. [38] Y. Li, Q. Hou, Z. Zheng, M.-M. Cheng, J. Yang, X. Li, Large Selective Kernel
Comput. Vis. Pattern Recognit. (2017) 7263–7271. Network for Remote Sensing Object Detection, (2023).
[10] J. Redmon, A. Farhadi, YOLOv3: an incremental improvement, (2018). [39] R. Sunkara, T. Luo, No more strided convolutions or pooling: a new CNN building
[11] A. Bochkovskiy, C.-Y. Wang, H.-Y.M. Liao, YOLOv4: optimal speed and accuracy of block for low-resolution images and small objects, in: M.-R. Amini, S. Canu,
object detection, (2020). A. Fischer, T. Guns, P. Kralj Novak, G. Tsoumakas (Eds.), Mach. Learn. Knowl.
[12] G. Jocher, A. Stoken, J. Borovec, L. Changyu, A. Hogan, L. Diaconu, J. Poznanski, Discov. Databases, Springer Nature Switzerland, Cham, 2023, pp. 443–459,
L. Yu, P. Rai, R. Ferriday, ultralytics/yolov5: v3. 0, Zenodo (2020). https://doi.org/10.1007/978-3-031-26409-2_27.
[13] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie, Y. Li, B. [40] A. Stergiou, R. Poppe, G. Kalliatakis, Refining activation downsampling with
Zhang, Y. Liang, L. Zhou, X. Xu, X. Chu, X. Wei, X. Wei, YOLOv6: A single-stage SoftPool, Proc. IEEECVF Int. Conf. Comput. Vis. (2021) 10357–10366.
object detection framework for industrial applications, (2022). [41] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, T. Darrell,
[14] C.-Y. Wang, A. Bochkovskiy, H.-Y.M. Liao, YOLOv7: trainable bag-of-freebies sets Bdd100k: a diverse driving dataset for heterogeneous multitask learning, Proc.
new state-of-the-art for real-time object detectors, Proc. IEEECVF Conf. Comput. IEEECVF Conf. Comput. Vis. Pattern Recognit. (2020) 2636–2645.
Vis. Pattern Recognit. (2023) 7464–7475. [42] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the KITTI dataset,
[15] G. Jocher, A. Chaurasia, J. Qiu, YOLO by Ultralytics. Ultralytics, (2023). Int. J. Robot. Res. 32 (2013) 1231–1237, https://doi.org/10.1177/
[16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD: single 0278364913491297.
shot multibox detector, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Comput. [43] Z. Cai, N. Vasconcelos, Cascade r-cnn: delving into high quality object detection, :
Vis. – ECCV 2016, Springer International Publishing, Cham, 2016, pp. 21–37, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2018) 6154–6162.
https://doi.org/10.1007/978-3-319-46448-0_2. [44] Z. Ge, S. Liu, F. Wang, Z. Li, J. Sun, YOLOX: Exceeding YOLO Series in 2021,
[17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object (2021).
detection, Proc. IEEE Int. Conf. Comput. Vis. (2017) 2980–2988. [45] R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam:
[18] V.K. Sharma, P. Dhiman, R.K. Rout, Improved traffic sign recognition algorithm Visual explanations from deep networks via gradient-based localization, IEEE Int.
based on YOLOv4-tiny, J. Vis. Commun. Image Represent. 91 (2023) 103774. Conf. Comput. Vis. (2017) 618–626.
311