0% found this document useful (0 votes)
16 views14 pages

(2024-AEJ) Z-YOLOv8s-based Approach For Road Object Recognition in Complex Traffic Scenarios

The paper presents Z-YOLOv8s, an advanced object detection model designed for complex traffic scenarios, enhancing accuracy and real-time efficiency for autonomous driving. By integrating innovative components like the Revisiting Perspective Vision Transformer and Large Selective Kernel Network, Z-YOLOv8s achieves significant improvements in detection precision, particularly for small objects, and maintains a high processing speed of 78.41 frames per second. Experimental results demonstrate that Z-YOLOv8s outperforms existing models on benchmark datasets, balancing speed and accuracy effectively.

Uploaded by

junhaojia530
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

(2024-AEJ) Z-YOLOv8s-based Approach For Road Object Recognition in Complex Traffic Scenarios

The paper presents Z-YOLOv8s, an advanced object detection model designed for complex traffic scenarios, enhancing accuracy and real-time efficiency for autonomous driving. By integrating innovative components like the Revisiting Perspective Vision Transformer and Large Selective Kernel Network, Z-YOLOv8s achieves significant improvements in detection precision, particularly for small objects, and maintains a high processing speed of 78.41 frames per second. Experimental results demonstrate that Z-YOLOv8s outperforms existing models on benchmark datasets, balancing speed and accuracy effectively.

Uploaded by

junhaojia530
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Alexandria Engineering Journal 106 (2024) 298–311

Contents lists available at ScienceDirect

Alexandria Engineering Journal


journal homepage: www.elsevier.com/locate/aej

Original article

Z-YOLOv8s-based approach for road object recognition in complex


traffic scenarios
Ruixin Zhao * , Sai Hong Tang * , Eris Elianddy Bin Supeni , Sharafiz Abdul Rahim , Luxin Fan
Faculty of Engineering, Universiti Putra Malaysia, Serdang 43400, Malaysia

A R T I C L E I N F O A B S T R A C T

Keywords: Object detection in road scenarios is crucial for intelligent transport systems and autonomous driving, but
Road Environmental complex traffic conditions pose significant challenges. This paper introduces Z-You Only Look Once version 8
Object Detection small (Z-YOLOv8s), designed to improve both accuracy and real-time efficiency under real-world uncertainties.
YOLOv8
By incorporating Revisiting Perspective Vision Transformer (RepViT) and C2f into the YOLOv8s framework, and
Deep Learning
integrating the Large Selective Kernel Network (LSKNet), the model enhances spatial feature extraction. Addi­
Autonomous driving
tionally, the YOLOv8s backbone is optimized with Space-to-Depth Convolution (SPD-Conv) for better small
object detection. The Softpool-Spatial Pyramid Pooling Fast (SoftPool-SPPF) module ensures precise character­
istic information preservation. Z-YOLOv8s improves mean average precision (mAP)@0.5 on the Berkeley Deep
Drive 100 K (BDD100K) and Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI)
datasets by 7.3 % and 3.8 %, respectively. It also achieves accuracy increases of 5.7 % and 6.5 % in Average
Precision (AP)-Small, and a real-time detection speed of 78.41 frames per second (FPS) on the BDD100K. Z-
YOLOv8s balances detection precision and processing speed more effectively than other detectors, as demon­
strated by experimental results and comparisons.

1. Introduction stage algorithms, the Regions with Convolutional Neural Network (R-
CNN) model was initially reported [3]. The R-CNN transforms tradi­
With the rapid advancement in autonomous driving technology, tional object detection into a regional feature extraction and classifica­
road object detection has emerged as a pivotal area of research. The tion process. Spatial Pyramid Pooling (SPP) efficiently handles objects of
detection of objects in road environments is regarded as a critical aspect different sizes and addresses the issue of information loss in the R-CNN
of the environmental perception system of autonomous vehicles. For model owing to normalization [4]. Fast R-CNN employs end-to-end
devices with limited computational resources, it is imperative to detect training to jointly learn classification and regression tasks by sharing
objects quickly and accurately in real traffic scenarios to ensure safe and convolutional layers, thereby significantly reducing the network
reliable driving behaviors and decision-making [1,2]. With their training and testing times [5]. Faster R-CNN incorporates a Region
improved generalization and precision, deep-learning-based methods Proposal Network (RPN), leading to substantial improvements in the
are increasingly supplanting traditional algorithms as the dominant detection speed and precision [6]. Mask R-CNN extends this by not only
approach for object detection. These techniques have demonstrated detecting and localizing objects, but also by generating pixel-level seg­
promising results in the detection of objects in traffic scenes. However, mentation masks for each detected object, thus achieving accurate
several significant challenges remain. In the current landscape of segmentation [7]. Although two-stage algorithms have become
autonomous driving environment perception tasks, factors such as increasingly precise, they often suffer from a low detection speed. In
diverse weather conditions, lighting variations, object occlusions, and contrast, one-stage algorithms such as the You Only Looking Once
presence of small objects in real traffic scenarios introduce substantial (YOLO) family [8–15] single-shot multibox detector (SSD) algorithm
uncertainty, thereby reducing the precision of road object detectors. [16], and RetinaNet utilize regression methods to simultaneously clas­
Currently, road object detection algorithms that use deep learning sify objects and predict bounding boxes [17]. The YOLO algorithm uses
are classified into one-stage and two-stage approaches. Among two- the entire graphic as input and directly countermarches the location and

* Corresponding authors.
E-mail addresses: [email protected] (R. Zhao), [email protected] (S.H. Tang).

https://doi.org/10.1016/j.aej.2024.07.011
Received 18 May 2024; Received in revised form 21 June 2024; Accepted 2 July 2024
Available online 13 July 2024
1110-0168/© 2024 The Author(s). Published by Elsevier BV on behalf of Faculty of Engineering, Alexandria University This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

degree of the bounding box. The YOLO and SSD algorithms offer significantly improved the accuracy of ReID tasks in both occluded and
real-time detection speeds and are increasingly superior to two-stage holistic scenarios [21]. Zhang et al. reviewed deep learning-based per­
algorithms in terms of precision, making them widely used in autono­ son search methods and highlighted the integration of detection and
mous driving applications. re-identification tasks. They introduced a new taxonomy, evaluated
To achieve faster and more precise road-object detection in complex state-of-the-art techniques, and explored future research directions to
traffic situations, research efforts have focused on enhancing deep- address challenges such as occlusion and scale variation in practical
learning object detection algorithms to improve precision and speed. applications [22]. However, their method only improved the accuracy of
For instance, Reference [18] proposed a YOLOv4-Tiny-based method for identifying occluded objects and was ineffective in detecting small ob­
traffic sign object detection, and [19] introduced an enhanced YOLOv5 jects in complex backgrounds.
model for small object detection in traffic scenes. Therefore, YOLO Wang et al. developed an automatic fog detection algorithm using
methods can learn significant features to detect objects in traffic YOLOv5, with the new backbone network re-parameterization aggre­
scenarios. gated residual transformations for deep neural networks (Rep-ResNeXt),
To reduce the effect of uncertainties on the effectiveness of object to enhance the network feature extraction speed and accuracy in fog
detection in complex traffic scenarios, this study primarily investigated driving scenes. Additionally, a feature enhancement module (FEM) is
one-stage methods and explored their potential for performance employed to automatically extract features from the foggy images and
improvement. We chose YOLOv8 [15] as the foundational model and other key parameters. This approach enhanced the accuracy and speed
proposed a real-time accurate object detector called Z-YOLOv8s. The of object detection in foggy weather conditions [23]. However, a com­
proposed model aims to accomplish two main objectives: enhancing the plex network architecture increases the inference time, which further
detection accuracy, particularly for small objects, and achieving limits its applicability to autonomous driving. To address the Zero-Shot
real-time processing speed. Open-Set Recognition (ZS-OSR) problem, Li et al. proposed the adver­
The primary contributions are as follows: sarial semantic embedding (ASE) method. This method ensures that
First, a Revisiting Perspective Vision Transformer (RepViT)C2f these embeddings are closely clustered around the unseen class em­
module was introduced, combining the RepViT structure with the C2f beddings, while remaining distinct from the unknown class embeddings.
module to enable the network to integrate local features and global se­ Using both novel and unfamiliar features for training, this approach
mantic information. This fosters strong local correlations and global effectively trained an open-set classifier. Experimental results indicate
modeling capabilities. that this method significantly boosts the classification accuracy and
Second, the proposed SoftPool-Spatial Pyramid Pooling Fast (Soft­ improves the rejection rates of unknown classes [24].
Pool-SPPF) module enhances the detection of small objects in complex To improve YOLO performance in traffic scenarios, Shi et al. intro­
traffic scenarios by addressing the challenge of insufficient fine-grained duced a network module to enhance feature extraction, used a dense
information extraction at the edges of the model. neck structure to merge details and semantics, and combined SCYLLA-
Third, experiments performed on publicly available datasets Intersection over Union (SIoU) with orientation information in the loss
demonstrated that the Z-YOLOv8 model achieved advanced detection function to improve convergence and precision for detecting small ob­
accuracy while maintaining superior speed compared with several state- jects in traffic [25]. Tian et al. developed a technique for identifying
of-the-art models, showing a balanced performance in terms of both small objects in intelligent transportation scenarios that incorporated
speed and precision. object feedback and retained feature information. The small object
The structure of this work is organized as follows: Section 2 presents Intersection over Union (SOIoU) loss function is designed to adaptively
a review of prior studies. Sections 3 and 4 propose the Z-YOLOv8 optimize small objects; a small-object path aggregation network (SOP­
method and present the experimental analysis, respectively. Section 5 Net) was adopted to retain detailed features. The results demonstrate
summarizes key contributions and outlines potential directions for that the proposed method achieves superior detection accuracy and
future research. outperforms existing methods [26]. However, these methods are
generally limited to the detection of small objects in traffic scenarios and
2. Related works have significant constraints.
Oreski improved the detection outcomes of the YOLO method in
In this section, we present a concise overview of recent object traffic scenarios by considering the multi-context (MCTX) module and
detection methods. integrating changes in the loss function. This approach effectively ex­
ploits rich global contextual information without compromising effi­
2.1. Object detection in traffic scenarios ciency [27]. Cong et al. introduced a lightweight detection algorithm
based on a modified YOLO model. This architecture primarily achieves
With the rapid developments in science, technology, and the auto­ the effective extraction and utilization of object feature information
motive industry, autonomous driving has gradually become a major through the interaction of information between subnets. In addition, a
focus of automotive research. This shift has positioned autonomous lightweight distributed shift C3 (DSC3) module was designed to resolve
driving systems as a central point of interest in the transportation sector issues related to model computation and label assignment. This method
[20]. Deep learning object detection methods in traffic scenarios have enhances the ability to detect environmental objects in traffic scenarios
gained widespread popularity. [28]. Tang et al. presented a pioneering object detection method named
Specifically, one-stage detection algorithms are characterized by the pyramid integration and attention-enhanced network (PIAENet),
high detection speeds, offering an excellent balance between precision which seamlessly integrates the Pyramid Integration Module (PIM) and
and speed. This balance is particularly important for real-time detection attention-enhanced module (AEM) to attain superior accuracy and ef­
in autonomous vehicles. Various methods for detecting traffic scenario ficiency. The PIM augments the receptive field of the model by amal­
objects have been continuously developed. For example, Yu et al. pro­ gamating multiscale features through multiple branches. Moreover,
posed a three-dimensional (3D) multi-view learning-based person re- AEM improves feature fusion by utilizing double-attention mechanisms
identification (ReID) method that addresses the challenges of occluded to reduce the impact of irrelevant information effectively [29]. Zhan
pedestrians. Traditional two-dimensional (2D) methods fail to capture et al. proposed an anchor-free multitasking learning network for
the full 3D characteristics of an individual. The new network structure, panoptic driving perception (YOLOPX). It features an anchor-free
multiview learning (MV-3DSReID), combines the advantages of 2D and detection head for improved adaptability and scalability, a lightweight
3D multiviews, captures geometric and shape details from a 3D space, lane-detection head with multiscale high-resolution features, and
and extracts semantic representations using 2D networks. This approach Polarized Self-Attention (PSA) modules for efficient training and

299
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Fig. 1. Overview of architecture of Z-YOLOv8s.

superior performance [30]. However, the effectiveness of these methods optimally and lacks the inductive bias inherent in CNNs. Moreover, as
is diminished because of their high computational complexity and poor the input image size increases, the sequence length and complexity also
real-time performance. increase. This was evident in the Detection Transformer (DETR), which
was the first successful attempt to use a transformer for object detection
[33]. DETR comprises a pretrained network (CNN) backbone and a
2.2. Vision transformer
transformer. It uses ResNets to generate low-dimensional features,
combines these features into a single feature set, adds position encod­
With progress in deep learning, the transformer model has achieved
ings, and feeds them into a transformer. However, the complexity and
significant breakthroughs in natural language processing. Because of the
high computational and hardware demands of transformers render them
limitations of convolutional kernels in acquiring information, re­
less practical for real-world applications. The swin transformer intro­
searchers have begun to apply transformer models to computer vision
duced the concept of shift windows from CNNs to transformers. This
tasks [31]. Dosovitskiy et al. introduced a vision transformer (ViT) that
approach leverages the ViT patch-based technique by segmenting the
demonstrated excellent performance in self-attention mechanisms for
input image into separate nonoverlapping patches. The computational
computer vision tasks [32]. Unlike traditional convolutional neural
load of the local self-attention mechanism increases linearly with the
networks (CNNs) that recognize local patterns and features, ViT uses a
image dimensions. Consequently, the swin transformer uses more pa­
transformer architecture on image patches for object classification. By
rameters than the convolutional models [34]. Wang et al. proposed
employing multi-head self-attention to capture long-range de­
RepViT, which optimizes mobile networks for vision task V3 (Mobile­
pendencies, transformer models have achieved state-of-the-art results in
NetV3) by integrating them with the ViT architecture. Consequently,
classic computer vision tasks. ViT demonstrated that a pure transformer
RepViT achieves an excellent balance between precision and real-time
architecture could outperform CNNs in computer vision tasks when
performance [35].
trained on large datasets. However, it requires extensive data to perform

300
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Fig. 2. Structural diagram of RepViTC2f.

3. Methodology involve local processing, which potentially limits their effectiveness in


capturing the relationships between global features. This can result in
In this section, we present a comprehensive description of the pro­ false negatives, particularly when multiple or occluded objects are
posed method. Section 3.1 introduces the overall architecture. In Section handled.
3.2, we describe the specific components including the transformer- To enhance the global modeling capability of the detection model,
based backbone module (RepViT), LSKNet attention module, SPD- the RepViT [35] tructure was incorporated into the backbone of the
Conv module, and SoftPool-SPPF module. YOLOv8s model by replacing the bottleneck module in C2f with RepViT,
that is, RepViTC2f. The RepViTC2f structure is illustrated in Fig. 2. This
3.1. Architecture overview integration aims to overcome the limitations of the C2f module and
enhance the capacity of the model to capture broader global relation­
In this study, we developed a traffic-scene detection network model, ships. Consequently, the RepViTC2f module replaces the C2f module in
namely Z-YOLOv8s. This model preserves real-time detection speed the YOLOv8s backbone, creating a new backbone network with
while enhancing object detection accuracy in complex traffic scenarios, enhanced global modeling capabilities. YOLOv8s leverages the
effectively addressing challenges such as occlusions and small objects. RepViTC2f network to promote interaction between local and global
The improved Z-YOLOv8s architecture is illustrated in Fig. 1. information and construct sufficient feature representations. This
First, the backbone network employs a reconfigured RepViTC2f ar­ approach mitigates the issue of dense occlusion-obstruction objects.
chitecture, which enhances the feature extraction capabilities by inte­ RepViT leverages a self-attention mechanism with multiple heads to
grating global semantic information. enable the model to capture diverse global representations effectively
Second, an LSKNet module with an attention mechanism was added [36]. The RepViT module advances deep convolutions within the
before the detection head to better address incorrect and missed de­ MobileNetV3 architecture, enabling the separation of the channel and
tections in densely occluded scenes. LSKNet dynamically adjusts the token mixers. It introduces structural reparameterization to establish a
receptive fields, allowing the model to adaptively employ different large topology with multiple branches for the depth filters throughout the
kernels. It adjusts the receptive fields based on the spatial requirements training process. In the channel mixer, the expansion ratio is reduced to
of each object. This flexibility is crucial for the detection of road objects 2, effectively reducing parameter redundancy. A separate and deeper
in complex traffic situations. downsampling layer is employed during spatial downsampling that
Third, the SPD-Conv module is integrated within the backbone adjusts the channel dimensions using a convolution. The inputs and
network to ensure lossless information transfer, thereby improving the outputs of the two convolutions are then connected via residual con­
extraction of features related to small objects. nections to form a feedforward network (FFN). This approach increases
Finally, the SoftPool-SPPF module is implemented as a substitute for the network depth and reduces information loss owing to the decreased
the original SPPF module. SoftPool retains more information during resolution [37]. Therefore, integrating C2f with RepViT enhances the
downsampling activation mapping, addressing the spatial information ability of the network to interact with both local and global information,
limitations caused by max pooling operations, and thereby preserving thereby constructing comprehensive feature representations.
the detailed features of small objects within complex backgrounds.
We observed that the modules within our framework enhanced the 3.2.2. LSKNet attention module
detection accuracy while maintaining real-time detection speed. Because of the complexity of real-world traffic scenarios, including
adverse weather conditions, lighting variations, and dense object oc­
clusions, relying on limited background information can lead to erro­
3.2. Improvement of YOLOv8s network architecture design
neous detection. For instance, misidentifying a billboard as a traffic sign
can result in incorrect judgments and potentially cause traffic accidents.
In this section, we introduce the RepViTC2f, LSKNet, SPD-Conv, and
Additionally, different viewpoints and distances in traffic scenarios
SoftPool-SPPF modules used in the study.
require varying contextual information for accurate detection. However,
excessive contextual information may obscure object features and
3.2.1. RepViTC2f module
reduce the detection precision.
The YOLOv8s backbone network relies heavily on the C2f module to
To better extract the relevant feature information, the LSKNet [38]
utilize features of various scales and incorporate contextual information,
module is combined with the baseline model head of YOLOv8s. LSKNet
resulting in improved detection accuracy. However, the stacking of
flexibly adjusts the receptive field, allowing the model to use different
multiple C2f modules can result in redundant channels. Additionally,
macrocores and adjust the receptive field according to the specific
the standard convolutional kernels used in the C2f module primarily

301
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Fig. 3. Structural diagram of LSKNet.

spatial characteristics of each object. This flexibility is crucial for the contextual information. This enables the spatial feature vectors to be
accurate detection of road objects in complex traffic scenarios. mixed across the channels. The calculations are presented in Eqs. (2) and
The Large Selective Kernel (LSK)-attention mechanism dynamically (3).
selects convolutional kernels and adapts to diverse contextual infor­
mation by considering local details from the input feature map. It adjusts U0 = X, Ui+1 = Fdi vi (Ui ) (2)
its receptive area to suit the different object types and contexts. LSKNet
is divided into two sub-blocks: the FFN and large-kernel (LK) selection.
̃ i = F1×1 (Ui ), for i in [1, N]
U i (3)
The FFN is employed to combine channels and enhance feature details, The expression Fdi w
represents depth-wise convolutions with kernel ki
and consists of a sequence that includes a fully connected layer, depth-
and dilation di , assuming the existence of N decomposed kernels, each
wise convolution, Gaussian Error Linear Unit (GELU) activation, and
processing through a 1 × 1 convolution layer denoted as F1×1 .
another fully connected layer. Similarly, the LK selection sub-block
LSKNet uses a spatial kernel selection algorithm to enhance its focus
consists of a sequence that incorporates a fully connected layer, an
on key spatial context areas by selecting features from large convolu­
LSK sub-block, and GELU activation, followed by another fully con­
tional kernels of various scales. It segments an input feature map into
nected layer. The key elements of LSKNet include the LSK sub-block,
smaller subsets, applies kernels of varying sizes to each subset, and
which incorporates the LK convolutions generated by decomposing
generates multiple output feature maps. These individual output feature
them into a sequence of kernels with progressively larger sizes and
maps are then combined or aggregated according to Eq. (4).
depthwise convolutions featuring higher dilation rates. Specifically, the
expansion of various aspects of i depth convolution, such as the kernel U
̃ = [U
̃ 1 ; ⋯; U
̃ i] (4)
size k, dilation rate d, and receptive field RF, is shown in Eq. (1).
Spatial relation descriptors are obtained by pooling feature maps
k1− 1 ≤ ki d1 = 1, d1− 1 < di < RFi− 1 using both the average and maximum pooling operations across the
(1)
RF1 = k1 , RFi = di (ki − 1) + RFi− 1 channel direction, that is, SAavg and SAmax , as defined in Eq. (5).
The receptive field can be expanded rapidly by enlarging the kernel SAavg = Pavg (U),
̃ SAmax = Pmax (U)
̃ (5)
sizes and increasing their dilation rates. A larger upper limit for the
dilation rate was adopted to avoid gaps among the feature maps. The After concatenating of SAavg and SAmax , convolutional layers are used
proposed approach simplifies the subsequent kernel selection and to convert these parameters to spatial attention maps with an equal
significantly decreases the number of parameters. Furthermore, using a number of depth convolutions N, as defined in Eq. (6).
set of decomposed depthwise convolutions in various receptive fields, it Applying the sigmoid activation function to each spatial attention
is possible to capture the characteristics at various ranges using map yields the spatial selection weights for each depth-wise convolu

Fig. 4. Structural diagram of SPD-Conv.

302
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

tion. These weights are used for element-wise multiplication with the
corresponding depth-wise convolution feature maps, resulting in
weighted feature maps. Finally, a convolution layer fuses the maps to
generate the final attention features. This process is described by Eqs. (7)
and (8). The eventual output of the LSK module is expressed as the
element-wise product of the input feature X and spatial attention map S,
as illustrated in Eq. (9).
Therefore, LSKNet fulfills the requirement for a more comprehensive
and adaptable understanding of information in complex traffic sce­
narios. Consequently, LSKNet was integrated into the detection head
network to enhance the ability of the model to extract the features. Fig. 3
presents the structural diagram of LSKNet.
Fig. 5. Structural diagram of SoftPool-SPPF.
SA
̂ = F2→N ([SAavg; SAmax]) (6)

SA
̃ i = σ( SA
̂ i) (7) f0,0 = X[0 : S : scale, 0 : S : scale],
f1,0 = X[0 : S : scale, 0 : S : scale], …,
N
∑ fscale− 1,0 = X[scale − 1 : S : scale, 0 : S : scale];
S = F( (SA
̃ i ⋅U
̃ i )) (8) f0,1 = X[0 : S : scale, 1 : S : scale],
(10)
i=1 …
fscale− 1,− 1 = X[scale − 1 : S : scale, 1 : S : scale];
Y = X⋅S (9) …
fscale− 1,scale− 1 = X[scale − 1 : S : scale, scale − 1 : S : scale].
3.2.3. SPD Conv module
Capturing images in typical traffic scenarios generally requires good 3.2.4. SoftPool-SPPF module
resolution and moderately sized objects. Object detection models With an increase in the CNN, the receptive field size also increased.
employ design elements, such as stride convolutions and pooling oper­ However, because the size of the input image is restricted, characteristic
ations, to skip unnecessary pixel-level details and efficiently extract extraction is repeated on a large receptive field. To address this issue,
object features. However, the detection of small objects in traffic sce­ YOLOv8s uses Spatial Pyramid Pooling Fast (SPPF) [12], which com­
narios, particularly those with overlapping and occluded small objects, bines characteristic maps from diverse receptive fields. The current
causes the assumption that redundant information is ineffective. This module integrates both local and global characteristics, thereby maxi­
limitation can severely restrict the ability of the model to capture mizing the expressive power of the characteristic map. The purpose of
detailed features, thereby reducing the accuracy of road-object detec­ SPPF is to extract the most crucial contextual characteristics for
tion. SPD-Conv [39] detects small objects in challenging traffic envi­ size-based object detection. To achieve this, SPPF employs multiple
ronments and retains detailed features during convolution. It was parallel max-pooling operations to integrate features from receptive
integrated into the backbone network of the YOLOv8s. By removing fields of different scales. Max pooling selects the highest value from the
stride convolutions and pooling operations, SPD-Conv preserves feature points within an environment and effectively preserves the
detailed information during downsampling, thereby improving pattern texture features. Although this approach is generally suitable for stan­
learning efficiency and feature representation capability. dard applications where the model accuracy is not significantly
The SPD-Conv module consists of an SPD layer and a non-stride impacted, it fails to capture spatial information when only limited road
convolution, effectively replacing traditional pooling and stride convo­ object feature information is available for complex traffic scenes. This
lution layers in CNNs. As illustrated in Fig. 4, The SPD layer decreases phenomenon not only results in the loss of valuable information but also
the resolution of the feature map X while maintaining all channel in­ complicates the detection of small objects.
formation, thereby avoiding data loss. When SPD is applied to an input This study proposes the SoftPool-SPPF module as an alternative to
feature map X of size (S,S,C1 ), it results in the generation of sub-feature the SPPF module in the YOLOv8s backbone, as illustrated in Fig. 5.
maps. The specific splitting formula is given by Eq. (10). At a scale of 2, SoftPool is a pooling variant that reduces data loss during the pooling
four sub-features(f0,0 , f0,1 , f1,0 , and, f1,1 ), each sized (S/2, S/2, C1 ), process while retaining the functionality of the pooling layer [40]. It is
are derived from the feature map X. These sub-feature maps are particularly suitable for small object detection. SoftPool uses an expo­
concatenated to obtain a feature map Xʹof size (S/2, S/2, 4C1 ). Impor­ nential weighting method to retain more feature information in down­
tantly, this process maintains all of the information in the channel sampled activation mappings, offering finer control than other pooling
aspects. methods. It operates through distinct forward and backward phases. The
Finally, a non-stride convolution with a stride of 1 was adopted to forward process is differentiable, ensuring that each activation in the
process the output characteristic map, followed by an SPD operation to local neighborhood receives at least a minimal gradient value during the
capture rich feature information. This approach was adopted to prevent backward propagation phase. The characteristic map size is represented
any potential loss of feature map information that may occur with stride by C × H × W. In addition, R is the index set associated with activations
convolution. The SPD-Conv module effectively preserved the critical in the two-dimensional spatial area, and each activation i in the acti­
characteristics of low-resolution images and enhanced the detection of vation area R is associated with a weight w. The index weighting
small objects, thereby significantly improving the accuracy of traffic calculation method is illustrated in Eqs. (11) and (12).
scenario applications. eai
wi = ∑ aj (11)
e
j∈R


a=
̃ wi ∗ ai (12)
i∈R

In this process, wi represents the weight of the ith activation, and ai is


the ith activation value in R. These weights ensure the effective

303
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Fig. 6. Labels length and width distribution of BDD100K.

propagation of important features. The SoftPool operation assigns occlusion. For the BDD100K dataset, this paper performed a category
nonlinear weights to the associated activation values, thereby ensuring reassignment by merging bike, car, bus, truck, train, and motorcycle into
that all activations contribute to the final output. ̃
a denotes the output of the “car” category, and person and rider into the “pedestrian” category,
SoftPool obtained by aggregating the weighted activations across the while retaining traffic signs and traffic lights. For the KITTI dataset, the
pooled kernel. Therefore, replacing SPPF with soft-SPFF allows better classes “Truck,” “Van,” and “Tram” were incorporated into the “car”
preservation of the comprehensive characteristics of small objects in category, and the class “Person (sitting)” was combined with the
complicated traffic backgrounds. “Pedestrian” category. Consequently, the final dataset retained labels for
three categories: “Car,” “Pedestrian,” and “Cyclist.”’ Following official
4. Experiments of result recommendations, the BDD100K dataset was divided into training,
validation, and test sets in a ratio of 7:1:2, whereas the KITTI dataset was
To demonstrate the efficiency of Z-YOLOv8s, extensive experiments divided using a ratio of 8:1:1. Fig. 6 and Fig. 7 show the normalized label
were conducted using the BDD100K and KITTI datasets. This section distributions of BDD100K and KITTI, respectively. Darker colors repre­
provides details on the datasets used, implementation and settings, and sent a higher density of bounding boxes. The plots reveal that the
assessment results. bounding boxes in both datasets are primarily clustered in the lower-left
corner, indicating the notable presence of small objects. Images of these
samples are shown in Fig. 8.
4.1. Datasets

The BDD100K dataset, with each image containing up to 90 objects, 4.2. Implementation
many of which were small and occluded, was selected for the primary
experimental validation to enable the model to recognize different To prove the efficiency of Z-YOLOv8s for road detection in traffic
complex traffic scenarios [41]. The KITTI dataset is a major interna­ scenarios, model training and testing experiments were conducted using
tional benchmark for evaluating computer vision methods in intelligent the following hardware and software configurations. The hardware
driving scenarios [42]. The KITTI dataset includes practical image data setup included an Intel(R) Core Trademark (TM) i9–13900 K processor
from different areas, with each image containing up to 15 vehicles and with a clock frequency of 3.19 gigahertz (GHz), 60 gigabytes (GB) of
30 pedestrians and featuring many small objects and varying levels of random-access memory (RAM), and a GeForce Ray Tracing eXtreme

304
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Fig. 7. Labels length and width distribution of KITTI.

Fig. 8. Samples of the BDD100K (a-c) and KITTI dataset (d-f).

(RTX) 4090 graphics processor with 24 GB of video memory. The deep epochs. To address the presence of small objects in the sample images
learning framework used was PyTorch 1.8.1, with Torchvision 0.9.1, and achieve a balance between real-time performance and accuracy, the
and the base version of YOLOv8 employed was Ultralytics 8.0.25. The sample size was normalized using 640 × 640. The present dimensions
algorithm was configured with a batch size of 8 and trained for 300 facilitate the deployment of the model on edge devices without

305
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Table 1
Consequences of Z-YOLOv8s algorithm ablation on the BDD100K dataset.
Detection Algorithm Module Result

RepViTC2f LSKNet SPD-Conv SoftPool-SPPF [email protected] (%) [email protected]:0.95 (%) P (%) R (%) FPS/f/s Parameters (M)

YOLOv8s 67.9 35.2 74.6 60.8 136.25 11.1


YOLOv8s √ 68.3 35.9 75.2 61.5 110.31 13.13
YOLOv8s √ √ 70.7 36.8 75.9 63.5 89.24 14.41
YOLOv8s √ √ √ 72.4 38.2 76.4 65.6 69.64 16.13
YOLOv8s √ √ √ √ 75.2 39.5 77.1 68.5 78.41 15.7

Table 2
Consequences of Z-YOLOv8s algorithm ablation on the KITTI dataset.
Detection Algorithm Module Result

RepViTC2f LSKNet SPD-Conv SoftPool-SPPF [email protected] (%) [email protected]:0.95 (%) P (%) R (%) FPS/f/s Parameters (M)

YOLOv8s 90.6 67.1 91.4 80.6 156.1 11.1


YOLOv8s √ 92.2 67.6 92.7 85.3 132.8 13.13
YOLOv8s √ √ 92.7 68 94.1 83.3 110.5 14.41
YOLOv8s √ √ √ 93.5 72.2 93.6 89.5 76.4 16.13
YOLOv8s √ √ √ √ 94.4 73.5 94.3 89.7 87.3 15.7

compromising the essential image information. To ensure equity and N


1 ∑
comparison of the model properties, all ablation experiments and mAP = APi (16)
N i=1
comparisons between different models were performed without the use
of pre-trained weights. YOLOv8s was selected as the baseline model for The detection results were divided into True Positive (TP), False
enhancement and extension, aligned with the principles of the v8 series, Positive (FP), True Negative (TN), and False Negative (FN). The equa­
where scaling was exclusively applied to the network width and depth. tions for Average Precision (AP) and mAP are given in (15) and (16),
respectively. AP is the mean accuracy for a specific object class, denotes
4.3. Measurement index the total number of classes to be classified, [email protected] represents the
average AP across all classes at an intersection over union (IoU)
To quantitatively assess the properties of the object detector, we threshold of 0.5, and [email protected]:0.95 spans from 0.5 to 0.95 with a step
selected precision indices, such as precision (P), recall (R), mAP, and size of 0.05.
FPS. FPS denotes the number of frames that the model can detect per
second. The precision and recall are expressed in Eqs. (13) and (14), 4.4. Experimental of ablation analysis
respectively.
TP To validate the effectiveness of the proposed enhancements in object
P= × 100% (13) detection within traffic scenes, we performed ablation analysis using the
TP + FP
BDD100K and KITTI test sets. In the results tables, "√" indicates the use
TP of the modular approach described in this work, and comparisons are
R= × 100% (14)
TP + FN made with YOLOv8s. To ensure the authenticity of the experiments, we
∫ used [email protected], [email protected]:0.95, P, R, and FPS as the assessment indices.
1
AP = P(R)dR (15) The test results are presented in Tables 1 and 2, respectively.
0 Tables 1 and 2 illustrate that the Z-YOLOv8s algorithm shows

Fig. 9. P-R curve of every class for YOLOv8s and Z-YOLOv8s on BDD100K.

306
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Fig. 10. P-R curve of every class for YOLOv8s and Z-YOLOv8s on KITTI.

advancements across multiple metrics compared with YOLOv8s. The 4.5. Algorithm performance analysis
network incorporates the RepViTC2f, LSKNet, SPD-Conv, and SoftPool-
SPPF structures. Despite the increase in the number of parameters, the Fig. 9 and Fig. 10 show the precision-recall (P-R) results for the
detection performance of the model showed significant improvement. BDD100K and KITTI test sets, respectively, showing the P-R curves for
On the BDD100K dataset, [email protected] increased from 67.9 % to 75.2 %, a each class under [email protected] for both YOLOv8s and Z-YOLOv8s. In the
gain of 7.3 %, whereas [email protected]:0.95 increased from 35.2 % to 39.5 %, BDD100K dataset, the most notable disparities were evident in the
a gain of 4.3 %. In the KITTI dataset, [email protected] increased by 3.8 % and curves of traffic signs and lights, which are typically the smallest objects
[email protected]:0.95 increased by 6.4 %. By incorporating various enhance­ in traffic scenes. In the KITTI dataset, the cyclist category exhibits the
ment modules, the YOLOv8s model parameter count grew from its initial most outstanding results, with a performance increase of 6.2 %. Addi­
11.1 million (M) to a peak of 16.13 M. Despite the increase in parame­ tionally, the detection performance for traffic signs and traffic lights on
ters, which led to a reduction in inference speed, the model continued to the BDD100K dataset exhibited improvements of 8.3 % and 5.1 %,
satisfy the demands for real-time detection. The real-time detection respectively. These two categories are more densely distributed in traffic
speed reached 78.41 FPS and 87.3 FPS, respectively. Therefore, the Z- scenes and feature smaller pixels. These findings indicate that the pro­
YOLOv8s algorithm not only fulfills real-time detection requirements posed method enhances the accuracy of small object detection and
but also significantly enhances detection accuracy. The RepViTC2f effectively reduces object omission rates.
module captured contextual semantic information and effectively To assess the effectiveness of the Z-YOLOv8s approach for small
extracted global features. The LSKNet module enhances object feature object detection in complex traffic scenarios, we used Common Objects
information, suppresses background noise, and improves visual repre­ in Context (COCO) benchmark evaluation criteria. Traffic scene objects
sentation capabilities. Furthermore, the SPD-Conv module enhances the were categorized into small, medium, and large objects based on their
detection of small objects in complex traffic scenarios by mitigating the sizes: small objects had fewer than 322 pixels, medium objects had
degradation of detailed feature information during convolution. In 322–962 pixels, and large objects had more than 962 pixels. Table 3
addition, the introduction of the SoftPool-SPPF module enabled the al­ presents the results for various object scales. The Z-YOLOv8s detector
gorithm to preserve the intricate features of small objects in complex exhibited higher AP and Average Recall (AR) values for all detected
traffic scenarios more effectively. Consequently, the experimental re­ objects across various IoU thresholds on the BDD100K and KITTI data­
sults demonstrated that improvements at each stage enhanced the sets. For objects of various sizes under the same IoU, the Z-YOLOv8s
learning capabilities of the model, confirming the generality and effec­ algorithm achieved higher AP and AR values than the YOLOv8s algo­
tiveness of the proposed algorithm. rithm. It is worth noting the significant improvements in AP and AR
values for small objects. The Z-YOLOv8s algorithm achieved AP-small
values of 26 % and 42 % for traffic objects, which is an improvement

Table 3
Performance completion of YOLOv8s and Z-YOLOv8s algorithms for object detection of various sizes in BDD100K and KITTI.
IoU Area maxDets YOLOv8s (BDD100K) YOLOv8s (BDD100K) YOLOv8s (KITTI) Z-YOLOv8s (KITTI)

Average Precision (AP)% 0.50:0.95 all 100 34.7 39 65.8 71.4


0.50 all 100 67.3 74.7 91.1 93.9
0.75 all 100 30.4 34.4 59.5 62.2
0.50:0.95 small 100 20.3 26 35.5 42
0.50:0.95 medium 100 51.2 53.6 63.6 65.6
0.50:0.95 large 100 68.3 68.5 70.1 70.9
Average Recall (AR)% 0.50:0.95 all 1 10.5 11.1 26.6 27.5
0.50:0.95 all 10 39 42.9 63.1 65.4
0.50:0.95 all 100 45.2 51.7 65.1 67.6
0.50:0.95 small 100 32.9 42.1 45 52.7
0.50:0.95 medium 100 61.6 64.6 69.9 70.9
0.50:0.95 large 100 73.5 73.6 74.6 74.9

307
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Table 4 Tables 4 and 5. On the BDD100K dataset, Z-YOLOv8s achieved state-of-


Experiment results comparing different algorithmic models on the BDD100K the-art performance, reaching 75.2 % [email protected]. Although Faster R-
dataset. CNN [6] and Cascade R-CNN [43] are well-known two-stage detection
Model P R [email protected] [email protected]:0.95 FPS/f/ algorithms, Z-YOLOv8s demonstrates significant improvements in both
(%) (%) (%) (%) s speed and accuracy compared to traditional object detection networks.
Faster- RCNN 76.5 75.8 73.1 41 19.4 Compared with Faster R-CNN, our model achieves a 2.1 % higher
(ResNet50) [6] [email protected], with approximately four times faster inference speed (78.41
Cascade-RCNN 76.1 76.7 73.5 42.6 12.3 FPS vs. 19.4 FPS). Similarly, compared to Cascade R-CNN, Z-YOLOv8s
(ResNet50) [43]
has a 1.7 % higher [email protected] and is 6.3 times faster (78.41 FPS vs. 12.3
RetinaNet 71 67.1 66.3 34.6 26.3
(ResNet50) [17] FPS). Additionally, compared with state-of-the-art single-stage detection
SSD [16] 56.5 52.8 48.2 23.5 48.6 algorithms, Z-YOLOv8s achieves a superior balance between speed and
YOLOv4 [11] 76.9 65.5 72.3 37.6 97.4 precision. Z-YOLOv8s surpasses SSD [16] by 27 % in [email protected].
YOLOv5s [12] 73.3 59.1 64.9 32.2 218.4 Although YOLOv5s [12] and YOLOv7-tiny [14] exhibit detection speeds
YOLOv6s [13] 67.8 65.3 66.5 34.2 141.3
YOLOXs [44] 75.7 64.6 70.1 35.4 157.9
approximately three times faster than our algorithm, Z-YOLOv8s out­
YOLOv7-tiny [14] 72.1 58.1 63.5 30 264.6 performs them in accuracy, improving [email protected] by 10.3 % and 11.7 %,
YOLOv8s [15] 74.6 60.8 67.9 35.2 136.2 respectively. On the KITTI dataset, our algorithm delivered outstanding
Z-YOLOv8s 77.1 68.5 75.2 39.5 78.41 detection results. Compared to Faster R-CNN, our model realized a 6.9 %
higher [email protected] with approximately 3.6 times faster inference speed
(87.3 FPS vs. 24.1 FPS). Compared with state-of-the-art one-stage
Table 5 detection algorithms, Z-YOLOv8s achieves a better balance between
Experiment results for various algorithmic models on the KITTI dataset. speed and precision. Z-YOLOv8s surpassed RetinaNet by 20.7 % for
Model P R [email protected] [email protected]:0.95 FPS/ [email protected], and SSD by 26.2 %. Similarly, compared with YOLOv6s and
(%) (%) (%) (%) f/s YOLOXs, Z-YOLOv8s demonstrated notable improvements. It achieved a
Faster- RCNN 90.6 82.1 87.5 58.1 24.1
5.6 % higher [email protected] than YOLOv6s and outperformed YOLOXs by
(ResNet50) [6] 1.9 % in [email protected]. Although YOLOv5s and YOLOv7-tiny exhibited
Cascade-RCNN 76.8 71.6 73.7 43.2 35.4 detection speeds approximately three times faster than Z-YOLOv8s, the
(ResNet50) [43] proposed model outperformed them in accuracy, with improvements of
RetinaNet 88.7 85.9 86.8 58.8 17.6
3.1 and 6.9 % in [email protected], respectively. Overall, compared with other
(ResNet50) [17]
SSD [16] 70.6 67.3 68.2 37.8 56.3 algorithms, Z-YOLOv8s stands out in complex traffic scenarios, exhib­
YOLOv4 [11] 92.9 88.8 92.7 67.9 97.5 iting significantly improved detection precision for occluded and small
YOLOv5s [12] 93.8 83.2 91.3 62.7 286.3 objects. The robustness to small objects at unusual angles and vehicles in
YOLOv6s [13] 89.2 82.7 88.8 61.9 175.4 obscured regions is noticeably enhanced. Consequently, these results
YOLOXs [44] 94.7 84.8 92.5 65.3 186.7
YOLOv7-tiny [14] 90.3 80 87.5 56.4 328.1
indicate that Z-YOLOv8s delivers superior performance in terms of both
YOLOv8s [15] 91.4 86 90.6 67.1 156.1 detection accuracy and speed, particularly in challenging traffic
Z-YOLOv8s 94.3 89.5 94.4 73.2 87.3 scenarios.

of 5.7 % and 6.5 % on the BDD100 and KITTI datasets, respectively. The 4.7. Visualizations
AR-small values for traffic object detection with Z-YOLOv8s reached
42.1 % and 52.7 %, representing improvements of 9.2 % and 7.7 % for Finally, we present numerous visual results to provide a detailed
the BDD100 and KITTI datasets, respectively. Overall, the Z-YOLOv8s understanding of the detection properties of Z-YOLOv8s. These visuals
algorithm performed better in detecting small objects in road environ­ include object detection outcomes, trends in the loss function, and
ments while maintaining good generalization. Gradient-weighted Class Activation Mapping (Grad-CAM) for an intui­
tive analysis of the detection results.
4.6. Comparative experimental analysis (1) Object Detection Result: To investigate the impact of the loss
function, we recorded the loss function trends for YOLOv8s and Z-
To assess the effectiveness of the proposed method, we performed a YOLOv8s on the BDD100K dataset, as shown in Fig. 11.
comparative analysis of several traditional object-detection networks Fig. 11 indicates that our methods exhibit a certain degree of
using the BDD100K and KITTI datasets. The results are presented in reduction in three categories of loss functions: box loss, class (cls) loss,

Fig. 11. Comparison of loss function between YOLOv8s and Z-YOLOv8s on BDD100K.

308
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Fig. 12. Detection results of YOLOv8s and Z-YOLOv8s on BDD100K.

Fig. 13. Detection results of YOLOv8s and Z-YOLOv8s on KITTI.

and distribution focal loss (dfl). This reveals that the algorithm is better excellent performance of Z-YOLOv8s.
suited to the training data, whereas the improved convergence indicates (3) Grad-CAM Result: Grad-CAM technology was employed to
that it can more effectively address the problem of class imbalance, provide a legitimate visual interpretation of the relationships made by
leading to enhanced detection performance. CNNs, allowing the discrimination and localization of significant areas
(2) Tendency of Loss Function: To enhance the understanding of in input images that obviously affect the network predictions. Grad-CAM
the algorithm performance, we visualized the detection consequences of [45] produces a rough map that localizes important areas in an image
YOLOv8s and Z-YOLOv8s on the BDD100K and KITTI test datasets, as used to predict a concept by analyzing the gradients flowing into the
shown in Fig. 12 and Fig. 13. YOLOv8s is prone to inaccurate localiza­ object concept from the final convolutional layer. To better analyze the
tion and produces numerous imprecise results. Conversely, Z-YOLOv8s influence of Z-YOLOv8s, the detection results are visualized in Fig. 14
demonstrates highly accurate object localization, significantly reducing and Fig. 15.
false detections and erroneous identifications. In addition, our proposed The visualization results clearly show that the baseline model
detector can identify object samples that YOLOv8s fails to detect, such as YOLOv8s exhibited issues with missed detections and false positives. By
pedestrians in shadows or obscured cars. Z-YOLOv8s achieved these contrast, the improved model Z-YOLOv8s demonstrates an enhanced
objectives with precise detection, as shown in Fig. 12. This confirms the focus on traffic objects. This indicates that the incorporated modules can

309
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

Fig. 14. Grad-CAM visualization of BDD100K.

Fig. 15. Grad-CAM visualization of KITTI.

reduce boundary uncertainty, leverage the central features of objects, background information loss. In addition, the LSKNet block attention
and suppress extraneous information, thereby improving the detection module was incorporated to improve the feature extraction capabilities
accuracy in complex traffic scenarios with object occlusion. Further­ of the model. We employed SPD-Conv convolution and developed a
more, in traffic scenarios involving small objects, the Z-YOLOv8s model SoftPool-SPPF module to further mitigate overfitting. These enhance­
can effectively concentrate on the information relevant to these objects ments improve the ability of the YOLOv8s model to detect noise and
and accurately capture their key features. This improvement in the low-quality images, thereby enhancing its performance in small-object
learning capacity of the network reduces the number of missed small- detection. Through multiple confirmation and ablation experiments on
object detections, thereby enhancing the overall detection performance. the BDD100K and KITTI benchmarks, the assessment results verified the
excellence of Z-YOLOv8s. Compared with YOLOv8s, Z-YOLOv8s ach­
5. Conclusion ieved a significant enhancement of 7.3 % in [email protected] on the BDD100K
dataset, and a 3.8 % increase in [email protected] on the KITTI dataset. Addi­
To solve the issues of incorrect and missed detections caused by tionally, the optimized algorithm improved AP-small accuracy by 5.7 %
occluded and small objects in complex traffic scenarios, we developed a and 6.5 % on these datasets, respectively. Moreover, Z-YOLOv8s main­
road object detection network model based on YOLOv8s, namely, Z- tains real-time inference speeds, achieving 78.41 FPS on BDD100K and
YOLOv8s. This model combines RepViTC2f modules to suppress 87.3 FPS on KITTI, thereby ensuring efficient and accurate performance

310
R. Zhao et al. Alexandria Engineering Journal 106 (2024) 298–311

in complex traffic scenarios. Furthermore, Z-YOLOv8s exhibited a better [19] S. Wang, Z. Qu, C. Li, L. Gao, BANet: Small and multi-object detection with a
bidirectional attention network for traffic scenes, Eng. Appl. Artif. Intell. 117
balance between detection precision and speed than the state-of-the-art
(2023) 105504.
detectors, suggesting its suitability for intelligent driving applications. [20] S. Grigorescu, B. Trasnea, T. Cocias, G. Macesanu, A survey of deep learning
The ablation experiments further illustrate the effectiveness of our techniques for autonomous driving, J. Field Robot. 37 (2020) 362–386, https://
proposed algorithm, and the visualization results provide a novel di­ doi.org/10.1002/rob.21918.
[21] Z. Yu, L. Li, J. Xie, C. Wang, W. Li, X. Ning, Pedestrian 3D shape understanding for
rection for analyzing the content of each module and variations in the person re-identification via multi-view learning, IEEE Trans. Circuits Syst. Video
loss function. In the future, we plan to deploy an enhanced model on Technol. (2024), https://doi.org/10.1109/TCSVT.2024.3358850.
resource-constrained embedded devices for traffic scene object detec­ [22] P. Zhang, X. Yu, C. Wang, J. Zheng, X. Ning, X. Bai, Towards effective person
search with deep learning: a survey from systematic perspective, Pattern Recognit.
tion. The current deployment aims to enhance the algorithm’s robust­ 152 (2024) 110434, https://doi.org/10.1016/j.patcog.2024.110434.
ness and practicality in real-world scenarios while continuing to refine [23] H. Wang, Y. Xu, Y. He, Y. Cai, L. Chen, Y. Li, M.A. Sotelo, Z. Li, YOLOv5-Fog: A
the proposed algorithm and methodology. multiobjective visual detection algorithm for fog driving scenes based on improved
YOLOv5, IEEE Trans. Instrum. Meas. 71 (2022) 1–12.
[24] T. Li, G. Pang, X. Bai, J. Zheng, L. Zhou, X. Ning, Learning adversarial semantic
CRediT authorship contribution statement embeddings for zero-shot recognition in open worlds, Pattern Recognit. 149 (2024)
110258, https://doi.org/10.1016/j.patcog.2024.110258.
[25] Y. Shi, X. Li, M. Chen, SC-YOLO: a object detection model for small traffic signs,
Ruixin Zhao: Writing – original draft, Validation, Conceptualiza­ IEEE Access 11 (2023) 11500–11510.
tion. SaiHong Tang: Writing – review & editing, Supervision. Eris [26] D. Tian, Y. Han, S. Wang, Object feedback and feature information retention for
Elianddy Bin Supeni: Software, Investigation. Sharafiz Abdul Rahim: small object detection in intelligent transportation scenes, Expert Syst. Appl. 238
(2024) 121811, https://doi.org/10.1016/j.eswa.2023.121811.
Resources, Data curation. Luxin Fan: Visualization, Investigation. [27] G. Oreski, YOLO* C—Adding context improves YOLO performance,
Neurocomputing 555 (2023) 126655.
[28] P. Cong, H. Feng, S. Li, T. Li, Y. Xu, X. Zhang, A visual detection algorithm for
Declaration of Competing Interest autonomous driving road environment perception, Eng. Appl. Artif. Intell. 133
(2024) 108034, https://doi.org/10.1016/j.engappai.2024.108034.
[29] X. Tang, W. Xu, K. Li, M. Han, Z. Ma, R. Wang, PIAENet: pyramid integration and
The authors declare that they have no known competing financial attention enhanced network for object detection, Inf. Sci. 670 (2024) 120576,
interests or personal relationships that could have appeared to influence https://doi.org/10.1016/j.ins.2024.120576.
the work reported in this paper. [30] J. Zhan, Y. Luo, C. Guo, Y. Wu, J. Meng, J. Liu, YOLOPX: anchor-free multi-task
learning network for panoptic driving perception, Pattern Recognit. 148 (2024)
110152, https://doi.org/10.1016/j.patcog.2023.110152.
References [31] X. Xiang, Z. Wang, Y. Qiao, An improved YOLOv5 crack detection method
combined with transformer, IEEE Sens. J. 22 (2022) 14328–14335, https://doi.
[1] A. Boukerche, Z. Hou, Object detection using deep learning methods in traffic org/10.1109/JSEN.2022.3181003.
scenarios, ACM Comput. Surv. 54 (2022) 1–35, https://doi.org/10.1145/3434398. [32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.
[2] B. Wu, F. Iandola, P.H. Jin, K. Keutzer, Squeezedet: unified, small, low power fully Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is
convolutional neural networks for real-time object detection for autonomous worth 16x16 words: transformers for image recognition at scale, (2021).
driving, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshop (2017) [33] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end
129–137. object detection with transformers, in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm
[3] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate (Eds.), Comput. Vis. – ECCV 2020, Springer International Publishing, Cham, 2020,
object detection and semantic segmentation, Proc. IEEE Conf. Comput. Vis. Pattern pp. 213–229, https://doi.org/10.1007/978-3-030-58452-8_13.
Recognit. (2014) 580–587. [34] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer:
[4] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional Hierarchical vision transformer using shifted windows, Proc. IEEECVF Int. Conf.
networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2015) Comput. Vis. (2021) 10012–10022.
1904–1916. [35] A. Wang, H. Chen, Z. Lin, J. Han, G. Ding, RepViT: Revisiting Mobile CNN From
[5] R. Girshick, Fast r-cnn, Proc. IEEE Int. Conf. Comput. Vis. (2015) 1440–1448. ViT Perspective, (2023).
[6] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection [36] J. Pan, A. Bulat, F. Tan, X. Zhu, L. Dudziak, H. Li, G. Tzimiropoulos, B. Martinez,
with region proposal networks, Adv. Neural Inf. Process. Syst. 28 (2015). EdgeViTs: competing light-weight CNNs on mobile devices with vision
[7] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, Proc. IEEE Int. Conf. Comput. transformers, in: S. Avidan, G. Brostow, M. Cissé, G.M. Farinella, T. Hassner (Eds.),
Vis. (2017) 2961–2969. Comput. Vis. – ECCV 2022, Springer Nature Switzerland, Cham, 2022,
[8] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real- pp. 294–311, https://doi.org/10.1007/978-3-031-20083-0_18.
time object detection, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016) [37] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, Proc. IEEE Conf. Comput.
779–788. Vis. Pattern Recognit. (2018) 7132–7141.
[9] J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, Proc. IEEE Conf. [38] Y. Li, Q. Hou, Z. Zheng, M.-M. Cheng, J. Yang, X. Li, Large Selective Kernel
Comput. Vis. Pattern Recognit. (2017) 7263–7271. Network for Remote Sensing Object Detection, (2023).
[10] J. Redmon, A. Farhadi, YOLOv3: an incremental improvement, (2018). [39] R. Sunkara, T. Luo, No more strided convolutions or pooling: a new CNN building
[11] A. Bochkovskiy, C.-Y. Wang, H.-Y.M. Liao, YOLOv4: optimal speed and accuracy of block for low-resolution images and small objects, in: M.-R. Amini, S. Canu,
object detection, (2020). A. Fischer, T. Guns, P. Kralj Novak, G. Tsoumakas (Eds.), Mach. Learn. Knowl.
[12] G. Jocher, A. Stoken, J. Borovec, L. Changyu, A. Hogan, L. Diaconu, J. Poznanski, Discov. Databases, Springer Nature Switzerland, Cham, 2023, pp. 443–459,
L. Yu, P. Rai, R. Ferriday, ultralytics/yolov5: v3. 0, Zenodo (2020). https://doi.org/10.1007/978-3-031-26409-2_27.
[13] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie, Y. Li, B. [40] A. Stergiou, R. Poppe, G. Kalliatakis, Refining activation downsampling with
Zhang, Y. Liang, L. Zhou, X. Xu, X. Chu, X. Wei, X. Wei, YOLOv6: A single-stage SoftPool, Proc. IEEECVF Int. Conf. Comput. Vis. (2021) 10357–10366.
object detection framework for industrial applications, (2022). [41] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, T. Darrell,
[14] C.-Y. Wang, A. Bochkovskiy, H.-Y.M. Liao, YOLOv7: trainable bag-of-freebies sets Bdd100k: a diverse driving dataset for heterogeneous multitask learning, Proc.
new state-of-the-art for real-time object detectors, Proc. IEEECVF Conf. Comput. IEEECVF Conf. Comput. Vis. Pattern Recognit. (2020) 2636–2645.
Vis. Pattern Recognit. (2023) 7464–7475. [42] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the KITTI dataset,
[15] G. Jocher, A. Chaurasia, J. Qiu, YOLO by Ultralytics. Ultralytics, (2023). Int. J. Robot. Res. 32 (2013) 1231–1237, https://doi.org/10.1177/
[16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD: single 0278364913491297.
shot multibox detector, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Comput. [43] Z. Cai, N. Vasconcelos, Cascade r-cnn: delving into high quality object detection, :
Vis. – ECCV 2016, Springer International Publishing, Cham, 2016, pp. 21–37, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2018) 6154–6162.
https://doi.org/10.1007/978-3-319-46448-0_2. [44] Z. Ge, S. Liu, F. Wang, Z. Li, J. Sun, YOLOX: Exceeding YOLO Series in 2021,
[17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object (2021).
detection, Proc. IEEE Int. Conf. Comput. Vis. (2017) 2980–2988. [45] R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam:
[18] V.K. Sharma, P. Dhiman, R.K. Rout, Improved traffic sign recognition algorithm Visual explanations from deep networks via gradient-based localization, IEEE Int.
based on YOLOv4-tiny, J. Vis. Commun. Image Represent. 91 (2023) 103774. Conf. Comput. Vis. (2017) 618–626.

311

You might also like