A Contraband Detection Scheme in X-Ray Security Im
A Contraband Detection Scheme in X-Ray Security Im
Article
A Contraband Detection Scheme in X-ray Security Images Based
on Improved YOLOv8s Network Model
Qingji Gao 1 , Haozhi Deng 1, * and Gaowei Zhang 2
Abstract: X-ray inspections of contraband are widely used to maintain public transportation safety
and protect life and property when people travel. To improve detection accuracy and reduce the
probability of missed and false detection, a contraband detection algorithm YOLOv8s-DCN-EMA-
IPIO* based on YOLOv8s is proposed. Firstly, the super-resolution reconstruction method based
on the SRGAN network enhances the original data set, which is more conducive to model training.
Secondly, DCNv2 (deformable convolution net v2) is introduced in the backbone network and
merged with the C2f layer to improve the ability of the feature extraction and robustness of the
model. Then, an EMA (efficient multi-scale attention) mechanism is proposed to suppress the
interference of complex background noise and occlusion overlap in the detection process. Finally,
the IPIO (improved pigeon-inspired optimization), which is based on the cross-mutation strategy,
is employed to maximize the convolutional neural network’s learning rate to derive the optimal
group’s weight information and ultimately improve the model’s detection and recognition accuracy.
The experimental results show that on the self-built data set, the mAP (mean average precision) of
the improved model YOLOv8s-DCN-EMA-IPIO* is 73.43%, 3.98% higher than that of the original
model YOLOv8s, and the FPS is 95, meeting the deployment requirements of both high precision
and real-time.
Traditional target detection algorithms have made many contributions to the classi-
fication and recognition of security contraband. At present, a commonly used method is
based on the image features of a package, and the contour detection algorithm is used to
extract the package. The typical features of the package image are the gray feature and
the edge feature. Wu Ying et al. extracted the image’s golden zone using edge detection
operators [7]. And, Mei Hongyang et al. extracted the outlines of moving objects using
edge features [8]. Su Bingshan et al. proposed a new detection and classification method.
Contourlet transform was used to decompose the images scanned using X-rays, and the
decomposed co-occurrence matrix, Tamura texture feature and histogram feature were
extracted [9]. Finally, the feature vectors of these three features were linked in series to
obtain the joint feature vector. Wang Yu et al. proposed a study on the classification of
foreign objects in X-ray images based on computer vision, which mainly uses Tamura
texture features and random forest to automatically identify and classify prohibited objects
in X-rays [10]. Han Ping et al. proposed a method of adaptive sinusoidal gray transform
to achieve two-level enhancement, which can significantly improve image quality [11].
The ability of feature extraction and capture of traditional methods is poor, and its robust-
ness to the diversity of targets is low. As a result, the detection model lacks a particular
generalization ability and cannot be well applied to a large amount of data.
The convolutional neural network (CNN) has been widely used in image process-
ing and analysis in recent years, thanks to the high-speed development of deep learning
methods [12], and many computer vision tasks, such as face recognition, behavior recogni-
tion, medical image processing, and automatic driving, have made significant progress,
reaching the most advanced and efficient detection performance [13,14]. At the same time,
researchers are also looking at the field of X-ray security contraband detection. Before that,
the main method for image recognition and classification was the bag-of-visual-words
(BoVW) model, which extracted visible words through feature descriptors [15]. The k-
means algorithm was used to cluster features and combine visible words with similar
meanings [16]. It is also possible to create vocabularies for classification via RF, SVM,
or sparse representation [17,18]. The application of the standard BoVW method to the
classification and detection of X-ray images can improve the performance. Mery et al. com-
pared the effectiveness of X-ray contraband detection techniques based on deep learning,
sparse representation, the visual bag-of-words model, and traditional pattern recognition
schemes [19–21]. They discovered that the deep learning approach produced the best
results [2]. Wu Haibin et al. added improved void convolution based on the YOLOv4
algorithm, used a multi-scale aggregation of context information, and finally optimized
the candidate boxes via the k-means clustering algorithm, which eventually improved
the detection accuracy [22]. Dong Yishan et al. proposed an improved YOLOv5 X-ray
contraband detection model. Based on the YOLOv5 algorithm, the model introduced an
attention mechanism, border fusion and data enhancement strategies to improve detection
accuracy [23]. Because the volume of contraband is smaller than that of luggage, and the
ordinary network model is weak in detecting small targets, it is easy to cause the problem of
missing and misdetection. To solve the above problems, Zhang Youkang et al. added three
detection modules based on the one-stage target detection network SSD framework [24]
and proposed a multi-scale contraband detection network ACMNet suitable for X-ray
security inspection images, and they achieved satisfactory results [25]. Based on the im-
provement of the YOLOv3 network model, Guo Shouxiang et al. changed its backbone
network to a new backbone network composed of two darknets and introduced a feature
enhancement module to improve the detection effect of small targets [26]. To detect small
residual items on X-ray clothing images, Rong Gao et al. proposed combining the feature
pyramid network (FPN) with the Faster R-CNN algorithm [27–29]. The combination of
FPN and Faster R-CNN can make better use of feature maps with higher resolution. Wei
et al. proposed an attentional module DOAM for removing occluding baggage. When
faced with a high degree of occluding baggage, DOAM can bring a good improvement
effect [30]. Zhang Na et al. proposed a dangerous-goods-detection algorithm for X-ray
Sensors 2024, 24, 1158 3 of 27
security based on the improved Cascade RCNN network to improve the detection accuracy
by enhancing local feature learning and changing the weight proportional coefficient [31].
Aiming at solving the problems of false detection and missed detection of contraband in
X-ray security screening scenarios, You xi et al. proposed an adaptive security contraband
detection method XPIC R-CNN based on Cascade R-CNN that integrates spatial attention,
which significantly improves the average detection accuracy, but a large amount of com-
putation leads to low real-time performance [32]. The aforementioned techniques have
made significant strides toward improving the accuracy of security contraband detection
based on deep learning; yet, the following issues persist: (1) In the case of severe overlap
occlusion, it is easy to confuse the object to be inspected with the background, and similar
feedback will be obtained when images are transferred to the network to extract features
through the convolutional layer, thus reducing the recognition and detection accuracy
of contraband. (2) Due to the fixed sampling position, the ordinary convolution cannot
adapt to the actual detection target receptive field, which affects the ability of the model to
extract features and ultimately results in missing detection. The size and shape of various
contraband items vary greatly. (3) The setting of the anchor frame depends on manual
marker setting, resulting in a weak generalization ability of the initial value of the anchor
frame and low detection accuracy for small samples and small targets, which affects the
final detection accuracy of the model.
Since the single-stage object detection model YOLO was published, it has been widely
considered and applied in the industry. In recent years, the YOLO series algorithm has
been updated and iterated in several versions. The Ultralytics team proposed the YOLOv8
version in 2023, which not only meets the requirements of real-time performance, but also
has a lighter network structure and eliminates the anchor frame mechanism while meeting
high detection accuracy. Currently, a variety of real-world issues have been resolved by
the YOLOv8 model and its enhanced approach such as dense pedestrian detection [33,34],
road damage detection [35], gesture recognition [36], etc., but there are few examples of
dangerous goods detection. Therefore, in view of the above three problems, this paper takes
the YOLOv8s model as the base framework and makes improvements from four aspects:
data enhancement, fusion deformable convolution (DCNv2), added attention mechanism
(EMA), hyperparameter optimization, and finally, the model’s detection accuracy used for
contraband detection. The main innovations are as follows:
(1) The data enhancement of the X-ray security inspection data set and the introduction
of SRGAN super-resolution reconstruction technology can improve the resolution and
brightness of the original image, make the appearance and shape of the items to be
inspected in the picture more precise and have more information, which is conducive
to the feature extraction operation of the model.
(2) To enhance the feature extraction network and improve the ability of the model to
detect contraband on different scales, DCNv2 (deformable convolution net v2) is in-
troduced into the backbone network. An EMA (efficient multi-scale attention) module
is proposed to implement the adaptive calibration of feature map channels, thereby
improving the model’s attention to the target region and improving the reduction in
complicated background interference and the overlapping occlusion issue.
(3) A pigeon colony algorithm based on a cross-mutation approach is developed to im-
prove the learning rate parameters in the model’s hyperparameters by mimicking the
behavioral traits of flock homing. The algorithm’s features include a broad search
range and quick response time. The model’s initial learning rate is generated by the
improved pigeon-inspired optimization, and the mAP index serves as the fitness func-
tion. Continuous iteration is applied to obtain the optimal learning rate, which is then
translated into a corresponding mAP value that improves target detection accuracy.
The remaining contents of this paper are as follows: Chapter 2 introduces the principle
of YOLOv8s and the related improved model mentioned above; Chapter 3 mainly presents
the test experiment and results from the analysis; and Chapter 4 concludes the full text.
The remaining contents of this paper are as follows: Chapter 2 introduces the princi-
ple of YOLOv8s and the related improved model mentioned above; Chapter 3 mainly pre-
Sensors 2024, 24, 1158 4 of 27
sents the test experiment and results from the analysis; and Chapter 4 concludes the full
text.
2.2.Materials
Materialsand
and Methods
Methods
2.1.
2.1.Data
DataEnhancement
Enhancement
Generally
Generally speaking,
speaking, in in deep
deeplearning,
learning,the thehigher
higher thethe resolution
resolution andand thethe greater
greater the the
number
numberof ofimage
image datasets,
datasets, the the better
betterthetheperformance
performance ofofthethe model
model trained
trained usingusing
thesethese
datasets.
datasets. WhenWhen the numbernumber of ofdata
datasets
setsisisinsufficient,
insufficient, thethe recognition
recognition accuracy
accuracy of the
of the
modelwill
model willbebelow,
low,thethe recognition
recognition ability willbebeweak,
abilitywill weak, and
and thethe
generalization
generalization ability willwill
ability
be inadequate. In actuality though, gathering high-quality scene
be inadequate. In actuality though, gathering high-quality scene image data is frequently image data is frequently
challengingbecause
challenging because of the environment’s
environment’scomplexity
complexity andandunpredictability
unpredictability as well as the
as well as the
capacity limitations of the acquisition equipment. As a result, picture augmentationtech-
capacity limitations of the acquisition equipment. As a result, picture augmentation technol-
nology
ogy is required
is required to prevent
to prevent it from
it from overfitting andand
overfitting increase
increase thethe model’s
model’s recognition
recognition ac-
accuracy.
curacy.
The super-resolution reconstruction technique is applied in this research to improve
The super-resolution
the images. Increasing andreconstruction
enhancing antechnique is applied
image is referred toin
asthis research to improve
super-resolution, whereas
the images. Increasing and enhancing an image is referred
super-resolution reconstruction technology is the technique of reconstructing to as super-resolution, whereas
a high-
super-resolution
resolution image fromreconstruction
one or more technology is the technique
low-resolution photographs.of reconstructing
Its purpose a high-res-
is to recover
olution image from one or more low-resolution photographs. Its purpose is to recover
image details, improve clarity, and improve visual quality. Traditional reconstruction ap-
image details, improve clarity, and improve visual quality. Traditional reconstruction ap-
proaches are classified into two categories: interpolation-based methods and regularization-
proaches are classified into two categories: interpolation-based methods and regulariza-
based methods. However, the results obtained in this manner are unsatisfactory, resulting
tion-based methods. However, the results obtained in this manner are unsatisfactory, re-
in blurred images. As a result, super-resolution reconstruction technology based on deep
sulting in blurred images. As a result, super-resolution reconstruction technology based
learning has been developed and has proven to be a potent and effective answer to the
on deep learning has been developed and has proven to be a potent and effective answer
growth of
to the growth artificial intelligence.
of artificial It can It
intelligence. generate incredibly
can generate detailed
incredibly high-resolution
detailed high-resolution photos,
which
photos, which significantly improves the reconstruction effect. Because the resolutionpicture
significantly improves the reconstruction effect. Because the resolution of the of
data set scanned
the picture by scanned
data set the security
by the detector
securityindetector
this article is article
in this low, itisislow,
important to convert
it is important
the
to image
converttothe HD using
image to the
HD SRGAN
using thenetwork, a super-resolution
SRGAN network, reconstruction
a super-resolution approach
reconstruction
based on deep
approach basedlearning.
on deepSRGAN
learning.isSRGAN
a hybridisof the generative
a hybrid adversarial
of the generative networknet-
adversarial (GAN)
and
work (GAN) and the deep convolutional neural network (CNN) [37,38]. The SRGAN net-with
the deep convolutional neural network (CNN) [37,38]. The SRGAN network
a work
generative adversarial
with a generative network network
adversarial model can modelobtain more accurate
can obtain more accurateamplification
amplification results
than traditional super-resolution reconstruction techniques,
results than traditional super-resolution reconstruction techniques, resulting in naturalresulting in natural images
with
imagesexcellent perceptual
with excellent quality.quality.
perceptual FiguresFigure
1 and12and represent
Figure the structure
2 represent theofstructure
the generator
of
network and the
the generator discriminator
network network, respectively.
and the discriminator network, respectively.
Figure1.1.The
Figure Thenetwork
network structure of
of the
theSRGAN
SRGANgenerator.
generator.
TheThe generatornetwork
generator networkand
and the
the discriminator
discriminatornetwork
networkmake
make upup
thethe
two
twofundamental
fundamental
components of the SRGAN method. An input image with poor resolution
components of the SRGAN method. An input image with poor resolution is first is first sent to ato a
sent
9 × 9 convolution layer by the generator network design. Next, it is received as input by
the PReLU function. These values are then entered into residual blocks, each containing
two 3-by-3-sized 64-pixel convolution layers, using the ReLU function as the activation
function after the residual block [39]. Lastly, the super-resolution image is formed as the
output after two deconvolution layers are employed for upsampling to increase the image
size. Eight convolutional layers make up the discriminator network structure, which is in
Sensors 2024, 24, 1158 5 of 27
Figure 2. The network structure of the SRGAN discriminator.
The generator network and the discriminator network make up the two fundamental
× 9 convolution
9components of thelayer
SRGAN by the generator
method. network
An input image design. Next,resolution
with poor it is received as sent
is first inputtobya the
PReLU function. These
9 × 9 convolution layer by values are then entered
the generator network into residual
design. Next, blocks, each containing
it is received as input by two
3-by-3-sized 64-pixel convolution
the PReLU function. These valueslayers, using
are then the ReLU
entered functionblocks,
into residual as theeach
activation function
containing
after
two the residual block
3-by-3-sized 64-pixel[39]. Lastly, the layers,
convolution super-resolution
using the ReLUimagefunction
is formed asasthethe output after
activation
two deconvolution
function layers are
after the residual block employed for the
[39]. Lastly, upsampling to increase
super-resolution image the
is image
formedsize.as theEight
convolutional
output after two layers make up the
deconvolution discriminator
layers are employed network structure,
for upsampling which isthe
to increase inimage
charge of
differentiating between the
size. Eight convolutional created
layers makesuper-resolution
up the discriminator image and the
network actual high-resolution
structure, which is in
image. First, the convolution layer is used for the input image so that it extractshigh-
charge of differentiating between the created super-resolution image and the actual features
resolution
from image.
the input First,and
image the convolution
passes themlayer is used
to the Leaky forReLU
the input image sofunction.
activation that it extracts
It is then
features from
processed the input
by seven imagewhich
modules, and passes them
include to the Leakylayer,
a convolution ReLUaactivation function. Itlayer,
batch normalizing
is then
and processed
the Leaky ReLU by function.
seven modules, Finally,which include a convolution
the authentication results are layer, a batch normal-
produced by the fully
izing layer,layer
connected and the
andLeaky ReLU function.
the Sigmoid function.Finally, the authentication results are produced
by theIn fully connected
contrast layer and the Sigmoid
to a single-structure deep function.
learning network, a generative adversarial
In contrast to a single-structure
network generates objects using a generator network deep learning network,
and aa discriminator
generative adversarial networknet- for ad-
work generates objects using a generator network and a discriminator
versarial learning. Both generators and discriminators learn by competing with network forone
adver-
another.
sarial
The learning. Both
generator’s generators
objective and discriminators
is to convert learn by
a low-resolution (LR)competing
image into with one another.
a super-resolution
The generator’s objective is to convert a low-resolution (LR) image into a super-resolution
(SR) image. The discriminator aims to discriminate between true high-resolution images
(SR) image. The discriminator aims to discriminate between true high-resolution images
(HR) and super-resolution images (SR) and to provide the generator and discriminator
(HR) and super-resolution images (SR) and to provide the generator and discriminator
models with the discriminant results. The generator and the discriminator both update the
models with the discriminant results. The generator and the discriminator both update
relevant parameters at the same time; the generator continues to fool the discriminator by
the relevant parameters at the same time; the generator continues to fool the discriminator
generating a realistic super-resolution image that captures the image’s details and overall
by generating a realistic super-resolution image that captures the image’s details and
visual
overallfeatures, and theand
visual features, discriminator provides
the discriminator feedback
provides to thetogenerator
feedback the generator on developing
on de-
the
veloping the image to encourage the generator to learn more efficiently and losses.
image to encourage the generator to learn more efficiently and minimize minimize In this
paper, the DIV data set is utilized for 300 rounds of training to achieve
losses. In this paper, the DIV data set is utilized for 300 rounds of training to achieve the the appropriate
weight, and then
appropriate weight,it isand
applied
then itfor super-resolution
is applied reconstruction
for super-resolution of the security
reconstruction detector
of the secu-
data
rity detector data set to produce high-resolution images. Figure 3 depicts the training pro- for
set to produce high-resolution images. Figure 3 depicts the training procedure
super-resolution reconstruction.
cedure for super-resolution reconstruction.
Figure3.3.The
Figure The super-resolution
super-resolution reconstruction
reconstructiontraining procedure.
training procedure.
2.2. The Design of the YOLOv8s-DCN-EMA Model to Optimize the Detection Accuracy
of Contraband
2.2.1. The YOLOv8s Network Structure
In the division of security contraband, it is necessary to identify and detect dangerous
goods. Based on the location of the detected contraband, the size of the item is estimated to
determine whether the item belongs to the category of contestable goods.
The YOLO series detection algorithm, which is one of the most well-known object
detection algorithms, divides the image into multiple networks, predicts the bounding
box within each grid and the category of objects it contains, and uses the non-maximum
suppression (NMS) algorithm to eliminate overlapping bounding boxes, which has the
characteristics of fast speed and high precision. YOLOv1 has a relatively quick detection
speed, but its detection effect is not optimal for objects that are relatively close together
and small targets [40]. The YOLOv2 algorithm employs the Darknet19 network, which
is very adaptable and can accommodate images of varying sizes [41]. To enhance its
Sensors 2024, 24, 1158 6 of 27
capacity to detect data at various scales, YOLOv3 incorporates the spatial pool pyramid
and feature pyramid modules [42]. YOLOv4 introduces mish activation functions to
improve accuracy [43]. YOLOv5 introduces SPPF and C3 modules to optimize the detection
performance. To achieve the goal of continuously enhancing the network learning ability
without destroying the original gradient path, YOLOv7 introduces the E-ELAN module,
which uses extension, shuffle, and merge cardinality [44]. This improves the ability of
feature extraction and semantic information expression. The YOLOv8 detection algorithm
used in this paper, proposed by Glenn Jocher, is improved on the YOLOv5 algorithm.
Compared with previous generations of networks, YOLOv8 further optimizes the network
Sensors 2024, 24, x FOR PEER REVIEW
structure and improves the comprehensive performance of object detection. The7network
of 28
structure of the YOLOv8s model is shown in Figure 4.
Figure
Figure 4.4.The
Thenetwork
networkstructure
structure of
of YOLOv8s
YOLOv8s model.
model.
TheInput
YOLOv8 model can be categorized into YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l,
Since YOLOv4,
and YOLOv8x Mosaic,
from small toMixup and other
large based image
on the enhancement
parameters technologies
of the model. The have been
YOLOv8s
added to the data preprocessing module. YOLOv8 mainly adopts the processing
model is chosen as the detecting head in consideration of the size of the goods and the strategy
of YOLOv5.
precision andItreal-time
includes four augmentation
detection methods:The
of contraband. Mosaic, Mixup,
input layer,random
the neck perspective
layer of the
and the HSV augment. Among them, Mosaic technology is used to enhance the data set.
feature enhancement module, the head layer of the output end, and the backbone layer
The Mosaic technique is used to randomly arrange and crop the four feature maps to en-
of the network backbone module make up the four main components of the YOLOv8s
rich the background of the feature maps. Although image aliasing enhancement has many
network topology, which is depicted in the picture below.
advantages, experience has shown it to reduce training effectiveness if used during the
Input
training process. Therefore, the YOLOv8 model will choose to turn off the Mosaic en-
Since YOLOv4, Mosaic, Mixup and other image enhancement technologies have been
hancement operation in the last 10 epochs during the training process, which can improve
added to the data preprocessing module. YOLOv8 mainly adopts the processing strategy
accuracy.
of YOLOv5.
BackboneIt includes four augmentation methods: Mosaic, Mixup, random perspective
and the HSV
Backbone augment. Among
is the feature them,part
extraction Mosaic
of the technology
YOLOv8 modelis used to enhance
and uses the sametheideadata
set. The Mosaic technique is used to randomly arrange and crop the
of CSPNet. CSPNet is generally combined with ResNet, DenseNet, and other networks to four feature maps
tooptimize
enrich the thebackground of theand
network structure feature
reduce maps. Although
computing image aliasing
and memory enhancement
consumption. Take
has many advantages, experience has shown it to reduce training
DenseNet as an example. Through CSPNet, the underlying feature map is divided into effectiveness if used
during the training process. Therefore, the YOLOv8 model will choose
two parts. One part is output via the original DenseNet and other modules, and the other to turn off the
Mosaic
part isenhancement operation
directly combined within the
the last 10just
output epochs during The
mentioned. the training
backboneprocess,
network which
com-can
improve accuracy.
prises three modules: CBS, C2f, and SPPF. The CBS module is composed of a Conv2d
convolution module, batch normalization and the SiLU function. The neck network refers
to the ELAN structure design idea of YOLOv7, replaces the C3 module with the C2f struc-
ture, and optimizes the module structure, a bottleneck module connected by gradient
flow. Furthermore, the network structure is lightweight, and rich gradient flow infor-
mation can be obtained. Therefore, the C2f structure can enhance the feature fusion ability
of the convolutional neural network and improve the reasoning speed. The SPPF is opti-
Sensors 2024, 24, 1158 7 of 27
Backbone
Backbone is the feature extraction part of the YOLOv8 model and uses the same idea
of CSPNet. CSPNet is generally combined with ResNet, DenseNet, and other networks to
optimize the network structure and reduce computing and memory consumption. Take
DenseNet as an example. Through CSPNet, the underlying feature map is divided into two
parts. One part is output via the original DenseNet and other modules, and the other part is
directly combined with the output just mentioned. The backbone network comprises three
modules: CBS, C2f, and SPPF. The CBS module is composed of a Conv2d convolution module,
batch normalization and the SiLU function. The neck network refers to the ELAN structure
design idea of YOLOv7, replaces the C3 module with the C2f structure, and optimizes the
module structure, a bottleneck module connected by gradient flow. Furthermore, the network
structure is lightweight, and rich gradient flow information can be obtained. Therefore, the
C2f structure can enhance the feature fusion ability of the convolutional neural network and
improve the reasoning speed. The SPPF is optimized on the structure of the SPP module.
The
Sensors 2024, 24, x FOR PEER SPPF module changes the original 3 convolution cores of different sizes into 83 of
REVIEW 5×28 5
convolution cores. This is because two 5 × 5 convolution nuclei in a series have the same
effect as one 9 × 9 convolution kernel. Similarly, three 5 × 5 convolution nuclei in a series
are equivalent to one 13
kernel. Multiple × 13 convolution
convolution nuclei in akernel. Multiple
series can convolution
reduce the nuclei
computation of thein a series
network
can reduce
and improve the detection efficiency. Figure 5 illustrates some typical modules in the 5
the computation of the network and improve the detection efficiency. Figure
illustrates some typical
YOLOv8s networkmodules
model. in the YOLOv8s network model.
(a)
(b) (c)
(d)
Figure 5. The schematic diagram of SPPF module, bottleneck module, CspLayer module, and C2f
Figure 5. The schematic diagram of SPPF module, bottleneck module, CspLayer module, and C2f
module. (a) SPPF module; (b) bottleneck module; (c) CspLayer module; (d) C2f module.
module. (a) SPPF module; (b) bottleneck module; (c) CspLayer module; (d) C2f module.
Neck and Head
Neck and Headcontinues the FPN + PAN idea from YOLOv5 but also removes the 11 con-
YOLOv8
YOLOv8 continues
volutional the FPN
layers before + PAN idea
upsampling from YOLOv5
and directly upsamplesbut also
the removes
input featuresthe 11 convo-
of different
lutionalstages
layersinbefore upsampling
the backbone and directly
feature extraction upsamples
network, theoptimize
which can input features of different
the network struc-
stages in theand
ture backbone
improvefeature extraction
detection efficiency. network, whichtocan
In comparison optimize
YOLOv5, thethe network
head section struc-
has
ture andundergone
improvesignificant
detectionchanges. The In
efficiency. decoupling
comparisonhead to
structure
YOLOv5, replaces the original
the head sectioncou-
has
plingsignificant
undergone detection head structure.
changes. TheAccording
decoupling to the research,
head employing
structure replacesthethe
current decou-
original cou-
pling detection head for the same target detection task can expedite convergence and en-
hance detection accuracy. YOLOv8 also changes the previous anchor-base method and
uses the idea of the anchor-free method to no longer predict the offset of the anchor frame,
thus improving many problems caused by the anchor frame.
pling detection head structure. According to the research, employing the current decoupling
detection head for the same target detection task can expedite convergence and enhance
detection accuracy. YOLOv8 also changes the previous anchor-base method and uses the
idea of the anchor-free method to no longer predict the offset of the anchor frame, thus
improving many problems caused by the anchor frame.
Figure
Figure 6. 6.
TheThestructure
structurediagram
diagram of
of the
the EMA
EMA attention
attentionmechanism.
mechanism.
In this paper, the EMA attention module is added to the neck end of YOLOv8s for
the following reasons: (1) The EMA attention module is relatively computationally small,
has lower model complexity, and has higher computational efficiency. (2) EMA not only
outperforms other attention mechanisms such as SA, CBAM, CA, and ECA in terms of
results, but it is also more efficient in terms of the required parameters [46–49]. (3) Good
Sensors 2024, 24, 1158 9 of 27
In this paper, the EMA attention module is added to the neck end of YOLOv8s for
the following reasons: (1) The EMA attention module is relatively computationally small,
has lower model complexity, and has higher computational efficiency. (2) EMA not only
outperforms other attention mechanisms such as SA, CBAM, CA, and ECA in terms of
results, but it is also more efficient in terms of the required parameters [46–49]. (3) Good
task adaptability: the EMA attention module is suitable for a wide range of visual tasks.
The neck end of the YOLOv8 network is crucial for linking the prediction output head
to the backbone network. Due to the particular structure of the neck end from bottom to
top, features of different scales are fully integrated here, laying a foundation for future
prediction, so the network structure of the neck end can greatly affect the performance of
the algorithm. The following figure shows the network structure after the EMA module
is added. This module is added after the Upsample layer in the up-sampling phase of
PAN-FPN and after each C2f module in the down-sampling phase, before the convolution
of the CBS module. The feature attention is strengthened before the feature fusion, so that
the model can pay more attention to the smaller target, provide information on security
inspection contraband, improve its identification and classification ability of dangerous
goods and improve its positioning accuracy.
y ( u0 ) = ∑ x ( u0 + u k ) · w ( u k ) (2)
uk ∈ R
In Formula (2), R = {(−1, −1), (−1, 0), (−1, 1), (0, −1), (0, 0), (0, 1), (1, −1), (1, 0), (1, 1)}
is the receptive field area; w(uk ) is the weight of the convolution kernel at the sampling
position uk ; x (u0 + uk ) is the feature of the input feature map x at the location u0 + uk ; and
uk is the location element of R, that is, all sampling locations in the receptive field.
In a deformable convolution, the output eigenvalues y(u0 ) are defined as follows:
In Formula (3), ∆uk represents the learnable offset that the standard convolution
increases at the sampling point, generally a decimal number, and u0 + uk + ∆uk is also a
decimal number, where {∆uk |k = 1, 2, . . . , N }, N =| R|. Therefore, the sampling position
Sensors 2024, 24, 1158 10 of 27
Sensors 2024, 24, x FOR PEER REVIEWof pixels x (u0 + uk + ∆uk ) after the introduction of the offset is usually implemented via 11 of
the bilinear interpolation method. The formula for bilinear interpolation is as follows:
In this paper, the deformable convolutional DCNv2 module is added to the backbon
network of the YOLOv8s model. The backbone network plays a role in picture featu
extraction. Blending and merging information on multiple levels can result in more exte
sive and accurate visual features of contraband. The second, third, and fourth C2f modul
Sensors 2024, 24, 1158 11 of 27
In this paper, the deformable convolutional DCNv2 module is added to the backbone
network of the YOLOv8s model. The backbone network plays a role in picture feature ex-
traction. Blending and merging information on multiple levels can result in more extensive
and accurate visual features of contraband. The second, third, and fourth C2f modules
of the backbone network are replaced in this research by the deformable convolutional
module C2f_DCN, which improves the model’s attention on small and medium-sized
targets and lays the groundwork for future feature fusion.
Sensors 2024, 24, x FOR PEER REVIEW 12 of 28
To sum up, the structure diagram of the YOLOv8s network with the EMA attention
mechanism and deformable convolution added is shown in Figure 8.
Figure
Figure8.8.The
Thenetwork
networkstructure
structure of
of the YOLOv8s-DCN-EMAmodel.
the YOLOv8s-DCN-EMA model.
2.3.The
2.3. ThePigeon‐Inspired
Pigeon-Inspired Optimization
Optimization (PIO)
(PIO) Design
DesignBased
BasedononCross-Mutation
Cross‐MutationOperator
Operatorto to
Optimize Learning Rate
Optimize Learning Rate
2.3.1. The Basic Theory of the PIO Algorithm
2.3.1. The Basic Theory of the PIO Algorithm
The PIO algorithm comprises two operators: the compass and map operator and
The PIO algorithm
the landmark comprises
operator [52]. twooptimization
The flock operators: the compass
model and map
initializes the operator and the
flock’s location
landmark operator [52]. The flock optimization model initializes the flock’s location
and speed based on the compass operator, and the flock’s location and speed are updated and
speed based on the compass operator, and the flock’s location and speed are updated
during each iteration of the search process. In this case, speed and position are denoted dur-
ing each iteration of the search process. In this case, speed and position are denoted as
as follows:
follows: XtNt = XiNt −1 + ViNt (7)
N t 1
ViNt = ViNt −1 · e−XRt× N ·(VXigbest − XiNt −1 )
Nt Nt
t +Xrand
i (8)(7)
In the above formula, NRt is theNt compass
Vi Vi 1 eRNt and ( X gbest XiNt 1 ) is used to generate a(8)
map operator; rand
rand
random number in (0, 1); X gbest is the best position globally after the t − 1 iteration loop;
ViNt −In
1
is
thetheabove
current velocity R
formula, ofis
the pigeon;
the XtNt map
andand
compass is theoperator; rand ofisthe
current position i-thto
used pigeon
gener-
in the Nt-th iteration.
ate a random number in (0, 1); X gbest is the best position globally after the t 1 iteration
The number of pigeons in each generation will be cut in half for the landmark operator.
Nt V N t 1 is the current velocity of the pigeon; and X N t is the current
loop; −1
Ntposition ofto
the
Np is used
i to represent the amount of pigeons in each generation,
t and Xcenter is used
i-th pigeon in the Nt-th iteration.
The number of pigeons in each generation will be cut in half for the landmark oper-
ator. N pNt is used to represent the amount of pigeons in each generation, and N t 1
X center is
used to represent the center of the pigeons that are left. Therefore, the pigeons near their
destination can use this as a landmark, as a reference direction of their flight.
Sensors 2024, 24, 1158 12 of 27
represent the center of the pigeons that are left. Therefore, the pigeons near their destination
can use this as a landmark, as a reference direction of their flight.
NpNt −1
NpNt = (9)
2
N Nt −1
N −1 N −1
∑ Xi t · f itness( Xi t )
Nt −1 i =1
Xcenter = (10)
N Nt −1
NpNt −1 · ∑ f itness( XiNt −1 )
i =1
where NpNt represents the number of the t generation pigeon flock. The fitness value is
expressed as f itness( XiNt −1 ), and the fitness value of each pigeon is evaluated and arranged
Nt −1
to find the optimal path. Formulas (9) and (10) represent NpNt and Xcenter , respectively.
Formula (11) is used to update the flock position:
XiNt = XiNt −1 + rand · Xcenter
Nt −1
− XiNt −1 ) (11)
XiNt −1 = ( x1i
Nt −1 Nt −1 Nt −1
, x2i , . . . , xni ) (12)
The mutation operation is the core operation of the evolution operator, and the main
purpose is to generate an intermediate individual through the mutation mechanism, whose
mutation equation is as follows:
w jiNt
(
rand( j) ≤ CR or j = rnbi (i )
sN t
= (i = 1, 2, . . . , NpNt ; j = 1, 2, . . . , n) (18)
ji
x jiNt −1 rand( j) > CR or j ̸= rnbi (i )
inadequate diversity, and the algorithm’s pigeons’ location update formula has a poor
global search capability. The adaptive differential mutation crossover operator, which is
based on a previously improved strategy, is introduced. It can improve the algorithm’s
ability to search locally and globally, increase population diversity, and change the position
of pigeons by randomly selecting individuals within the flock.
The original compass operator, map operator and landmark operator will be changed in
the algorithm of pigeon-inspired optimization improved by the mutation crossover operator.
Compass and map operators are
This paper applies the improved pigeon-inspired optimization algorithm to the hy-
perparameter optimization of deep learning. The optimal learning rate and the optimal
mAP (mean average precision) are found via an iterative search using the pigeon-inspired
optimization algorithm based on the cross-mutation operator. The primary procedure is
depicted in Figure 9:
Sensors 2024, 24, 1158
Sensors 2024, 24, x FOR PEER REVIEW 15 of 14
28 of 27
Figure 9. The flow chart of optimization of learning rate based on improved pigeon-inspired opti-
Figure 9. The flow chart of optimization of learning rate based on improved pigeon-inspired
mization algorithm.
optimization algorithm.
eight common types of contraband, such as knives, scissors, and power banks. According
to the construction principle of the training set, verification set and test set, the ratio of
7:1.5:1.5 is used to divide the data set. The model image has an input size of 640 × 640.
Different detection and improved detection models were trained using the training set of
security check baggage as comparative experiments, and the experimental parameters of
other models were identical. Each image contained information such as the category and
location of dangerous goods in the luggage. The operating system used in the experiment
was 64-bit Ubuntu 20.04, the CPU was Intel(R) Core(TM) [email protected], the GPU
was NVIDIA GeForce RTX 3060, and the CUDA version was 11.6. The deep learning
framework was the pytorch2.0 framework and the Python version was 3.9. For the setting
of hyperparameters, the initial learning rate was set to 0.012, the number of training rounds
was set to 100, and the batch was set to 8. The Mosaic enhancement was turned off for the
last 10 rounds throughout the training process. The SGD optimizer was employed in the
model to update the parameters iteratively. It can dynamically modify the learning rate to
improve the convergence of the loss function.
TP
Precision = (21)
TP + FP
(2) Recall rate
TP
Recall = (22)
TP + FN
(3) Balanced score (F1 Score)
Precision × Recall
F1 Score = 2 × (23)
Precision + Recall
(4) Average precision (AP)
Z 1
AP = Precision( Recall ) d( Recall ) (24)
0
1 n
n∑
mAP = APi (25)
1
set and test set according to the ratio of 7:1.5:1.5. The data set contained eight types of
items to be detected, including computer, power bank, lighter, scissors, pressure bottle,
umbrella, water bottle and knife. The four models YOLOv8s-DCN-EMA, Yolov8n-DCN-
EMA, YOLOv8s and YOLOv8n were trained with 3360 training set images. When the model
was
Sensors 2024, 24, x FOR PEER REVIEW
trained to the 100th round, the returned loss function had reached the convergence 18 of 28
state. The detection and recognition effects of the original and revised model were then
tested using 720 test set photos. The results of the experiments are displayed in the Table 1
and Figure 10. In general, it can be seen that in the test set, the mAP values corresponding to
Table
the best1.model
Comparative
trainedexperiment of detection
by YOLOv8s, accuracy of different
Yolov8s-DCN-EMA, improved
YOLOv8n, models on all kinds
Yolov8n-DCN-EMA
of contraband. Among them, YOLOv8s is abbreviated as Y8s, yolov8s-DCN is abbreviated as Y8sD,
are 69.45%, 71.94%, 62.65% and 66.06%, respectively. Compared to the initial model, the
yolov8s-EMA is abbreviated as Y8sE, yolov8s-DCN-EMA is abbreviated as Y8sDE, and the rest are
YOLOv8s-DCN-EMA model optimized by deformable convolution and channel attention
the same in the following table.
has a detection effect mAP that is 2.49% greater. This suggests that the YOLOv8s-DCN-
EMA has a stronger ability to extractCategories (AP)/%
features and discriminate between different types
Model mAP/%
of information.
Computer The enhanced
Powerbank Lightermodel has significantly
Scissors Pressureincreased
Umbrellathe detection
Bottle accuracy
Knivesof
Y8n 62.65 multiple
94.0 small contraband
51.2 objects in45.8
41.0 each category
68.2of detection
87.5 tasks, including
63.1 lighters,
50.4
Y8nE 64.95 scissors,
94.2 water bottles,
53.9 and knives.
43.6 The corresponding
48.7 70.9 improvement
89.8 values
65.7 are 2.2%, 7.3%,
52.8
Y8nD 65.55 4.1%,
94.9 and 3.7%, respectively,
52.1 41.2which indicates
52.9 that the
73.1 improved
89.5module greatly
62.6 affects
58.1 the
Y8nDE 66.06 detection
94.7 task of
52.8 small targets.
41.4 Similarly,
53.9 the upgraded
69.8 YOLOV8N-DCN-EMA
90.5 65.8 model’s
59.6
detection effect is better than the original YOLOv8n model’s detection effect, with an
Y8s 69.45 95.5 60.6 53.5 59.4 75.0 91.5 63.7 56.4
increase of 3.41%, and significant improvements have been made in the model’s overall
Y8sE 71.35 94.3 60.4 54.2 62.5 76.8 91.5 68.0 63.1
detection performance and the detection performance of a single item. The YOLOv8s
Y8sD 71.75 94.4 62.5 54.3 65.1 78.2 92.2 67.5 59.8
model series contains a comparatively larger amounts of parameters than the YOLOv8n
Y8sDE 71.94 94.9
model. 60.2
This is because the55.7 66.7 at extracting
model is better 77.5 92.6 which67.8
features, leads to a60.1
higher
detection accuracy.
(a) (b)
(c) (d)
Figure 10. The comparative experiment of detection accuracy of YOLOv8s and other improved
Figure 10. The comparative experiment of detection accuracy of YOLOv8s and other improved mod-
models on all kinds of contraband. (a) YOLOv8s, (b) YOLOv8s-EMA, (c) YOLOv8s-DCN, (d)
els on all kinds of contraband. (a) YOLOv8s, (b) YOLOv8s-EMA, (c) YOLOv8s-DCN, (d) YOLOv8s-
YOLOv8s-DCN-EMA.
DCN-EMA.
3.2. Test Experiment of Detection Model Based on Data Enhancement
The collected security luggage data set contains low-resolution photos, limiting the
model’s capacity to extract features and resolve primary and secondary information. Con-
sequently, in order to produce high-resolution images and enhance the model’s detection
Sensors 2024, 24, 1158 17 of 27
Table 1. Comparative experiment of detection accuracy of different improved models on all kinds of
contraband. Among them, YOLOv8s is abbreviated as Y8s, yolov8s-DCN is abbreviated as Y8sD,
yolov8s-EMA is abbreviated as Y8sE, yolov8s-DCN-EMA is abbreviated as Y8sDE, and the rest are
the same in the following table.
Categories (AP)/%
Model mAP/%
Computer Powerbank Lighter Scissors Pressure Umbrella Bottle Knives
Y8n 62.65 94.0 51.2 41.0 45.8 68.2 87.5 63.1 50.4
Y8nE 64.95 94.2 53.9 43.6 48.7 70.9 89.8 65.7 52.8
Y8nD 65.55 94.9 52.1 41.2 52.9 73.1 89.5 62.6 58.1
Y8nDE 66.06 94.7 52.8 41.4 53.9 69.8 90.5 65.8 59.6
Y8s 69.45 95.5 60.6 53.5 59.4 75.0 91.5 63.7 56.4
Y8sE 71.35 94.3 60.4 54.2 62.5 76.8 91.5 68.0 63.1
Y8sD 71.75 94.4 62.5 54.3 65.1 78.2 92.2 67.5 59.8
Y8sDE 71.94 94.9 60.2 55.7 66.7 77.5 92.6 67.8 60.1
Figure
Figure 11. Samples
11. Samples of eight
of eight types
types ofofcontraband
contrabandand
andcorresponding
corresponding X-ray
X-rayscan
scanimages
imagesand
andenhanced
enhanced
X-ray scan images.
X-ray scan images.
To validate
To validate thethe impact
impact ofofthe
thedata
dataset
seton
onthe
theprecision
precision of
of the
the model
modelfollowing
followingdata
data
enhancement, the training set divided by the data set was utilized to trainthe
enhancement, the training set divided by the data set was utilized to train theYOLOv8
YOLOv8
and YOLOv8-DCN-EMA models. The partitioning of the data sets was consistent with
and YOLOv8-DCN-EMA models. The partitioning of the data sets was consistent with
Section 3.1.1, with a ratio of 7:1.5:1.5. The setting of experimental parameters was also the
Section 3.1.1, with a ratio of 7:1.5:1.5. The setting of experimental parameters was also the
same as in Section 3.1.1, and the image input size was 640 × 640. Since the improved model
same as in Section 3.1.1, and the image input size was 640 × 640. Since the improved model
needs to be used in actual airport security checks, it is more appropriate to use the original
test data as the test set, which can better reflect the model’s capacity for generalization.
The results of the experiments are displayed in the Table 2 and Figure 12. YOLOv8s*,
Yolov8S-DCN-EMA*, YOLOv8n*, and Yolov8N-DCN-EMA* represent the four improved
models. The models obtained after image enhancement training are marked with * in the
upper right corner shown below. It can be found that no matter the YOLOv8n series and
its improved model or the YOLOv8s series and its improved model, the detection effect
has been greatly improved. Taking the YOLOv8 series as an example, the detection effect
of YOLOv8s-DCN-EMA* was the best, and its mAP value was 72.96%.
Compared with other groups, the detection effect of YOLOv8s-DCN-EMA* was 1.02%
worse than that of YOLOv8s-DCN-EMA*. The detection effect mAP of YOLOv8s was 1.18%
worse than that of YOLOv8s*. Similarly, in this group of experiments, we also conducted
experiments on the model with a single module added as a comparison, and we can see that
the detection effect of the model trained on the dataset after super-resolution reconstruction
has been greatly improved. Taking the YOLOv8 series as an example, the detection effect
of YOLOv8s-DCN-EMA* was the best, and its mAP value was 72.96%.
Compared with other groups, the detection effect of YOLOv8s-DCN-EMA* was
1.02% worse than that of YOLOv8s-DCN-EMA*. The detection effect mAP of YOLOv8s
Sensors 2024, 24, 1158 was 1.18% worse than that of YOLOv8s*. Similarly, in this group of experiments, we also 19 of 27
conducted experiments on the model with a single module added as a comparison, and
we can see that the detection effect of the model trained on the dataset after super-resolu-
tion reconstruction
was better thebetter
than that ofwas than
dataset that ofsuper-resolution
without the dataset without super-resolution
reconstruction. recon-
For example,
struction. For example, the detection effect mAP of YOLOv8s-DCN* was
the detection effect mAP of YOLOv8s-DCN* was 0.49% higher than that of YOLOv8s-DCN. 0.49% higher
Thethan that ofeffect
detection YOLOv8s-DCN. The detectionwas
mAP of YOLOv8s-EMA* effect mAP
0.50% of YOLOv8s-EMA*
higher was 0.50%
than that of YOLOv8s-EMA.
For the detection effect of different categories of items, the training model on theofimproved
higher than that of YOLOv8s-EMA. For the detection effect of different categories items,
the training model on the improved data set had a better detection effect than the training
data set had a better detection effect than the training model before the improvement. Thus,
thismodel before the
experiment improvement.
demonstrates thatThus,
whenthis experiment
training demonstrates
the security that detection
contraband when training
model,
the security contraband detection model, the more varied the data set content, the higher
the more varied the data set content, the higher the image quality, the more robust the
the image quality, the more robust the model’s ability of feature extraction and resolve
model’s ability of feature extraction and resolve information, and the more effective the
information, and the more effective the corresponding detection effect.
corresponding detection effect.
Table 2. Comparative experiment of detection accuracy of different improved models based on
Table 2. Comparative
SRGAN experiment
and data enhancement of kinds
of all detection accuracy of different improved models based on
of contraband.
SRGAN and data enhancement of all kinds of contraband.
Categories (AP)/%
Model mAP/%
Computer Powerbank Lighter Categories
Scissors (AP)/%
Pressure Umbrella Bottle Knives
Model mAP/%
Y8n* 64.68 Computer
94.5 Powerbank
53.5 Lighter42.8 Scissors
48.1 Pressure
70.5 89.6
Umbrella 65.9
Bottle 52.5
Knives
Y8nE* 66.13 94.7 53.7 43.8 51.6 72.3 89.4 65.7 57.8
Y8n* 64.68 94.5 53.5 42.8 48.1 70.5 89.6 65.9 52.5
Y8nD* 66.85 95.2 53.9 44.3 53.7 73.5 89.6 66.1 58.5
Y8nE* 66.13 94.7 53.7 43.8 51.6 72.3 89.4 65.7 57.8
Y8nD*Y8nDE* 66.8567.48 95.295.3 53.9 54.3 44.3 44.8 53.754.5 73.9
73.5 90.8
89.6 66.2
66.1 60.0
58.5
Y8nDE*Y8s* 67.4870.63 95.395.5 54.3 62.7 44.8 53.4 54.561.2 77.5
73.9 92.3
90.8 62.2
66.2 60.2
60.0
Y8s* Y8sE* 70.6371.85 95.594.7 62.7 62.6 53.4 54.9 61.262.6 78.6
77.5 91.6
92.3 66.9
62.2 62.9
60.2
Y8sE*Y8sD* 71.8572.24 94.794.6 62.6 62.9 54.9 55.3 62.665.7 78.4
78.6 92.4
91.6 67.8
66.9 60.8
62.9
Y8sDE* 72.2472.66
Y8sD* 94.695.1 62.9 62.2 55.3 55.6 65.765.9 78.6
78.4 92.5
92.4 68.3
67.8 63.1
60.8
Y8sDE* 72.66 95.1 62.2 55.6 65.9 78.6 92.5 68.3 63.1
(a) (b)
(c) (d)
Figure 12. The comparative experiment of detection accuracy of YOLOv8s and other improved
Figure 12. The comparative experiment of detection accuracy of YOLOv8s and other improved mod-
models based on SRGAN and data enhancement of all kinds of contraband. (a) YOLOv8s*, (b)
els YOLOv8s-EMA*,
based on SRGAN(c)and data enhancement
YOLOv8s-DCN*, of all kinds of contraband. (a) YOLOv8s*, (b) YOLOv8s-
(d) YOLOv8s-DCN-EMA*.
EMA*, (c) YOLOv8s-DCN*, (d) YOLOv8s-DCN-EMA*.
3.3. Detection Model Test Experiment Based on an Improved Pigeon‐Inspired Algorithm to
Optimize Model Learning Rate
We provide an enhanced pigeon-inspired optimization technique based on the cross-
mutation strategy to optimize the learning rate parameters in the hyperparameters and
raise the detection accuracy by mimicking the homing properties of pigeons. The im-
proved flock method states that four pigeons N are in the flock and that each pigeon rep-
resents the learning rate. The learning rate has three limits: 0.1 for the upper limit, 0.001
Sensors 2024, 24, 1158 20 of 27
Evolutionary Trajectory
0.74
best result
0.735
0.73
0.725
0.72
0.715
0.71
0.705
0.7
0.695
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
The number of generations
(a) (b)
Figure 13. Evolution and optimization of improved pigeon-inspired optimization. (a) Outer-cycle
Figure 13. Evolution and optimization of improved pigeon-inspired optimization. (a) Outer-cycle
evolution curve, (b) Optimization process.
evolution curve, (b) Optimization process.
Categories (AP)/%
Model mAP/%
Computer Powerbank Lighter Scissors Pressure Umbrella Bottle Knives
Y8sDE* 72.66 95.1 62.2 55.6 65.9 78.6 92.5 68.3 63.1
Y8sDEP* 73.43 95.5 62.9 56.2 67.4 79.4 93.1 69.0 63.9
Categories (AP)/%
Model mAP/%
Computer Powerbank Lighter Scissors Pressure Umbrella Bottle Knives
Y8sDE* 72.66 95.1 62.2 55.6 65.9 78.6 92.5 68.3 63.1
Y8sDEP* 73.43 95.5 62.9 56.2 67.4 79.4 93.1 69.0 63.9
Sensors 2024, 24, 1158 The bolded font indicates the experimental results of the improved final model in this article.21 of 27
(a) (b)
Figure 14. The YOLOv8s-DCN-EMA-IPIO* detection model and the YOLOv8s-DCN-EMA* detection
model are compared experimentally in this diagram. (a) YOLOv8s-DCN-EMA*, (b) YOLOv8s-DCN-
EMA-IPIO*.
Table 3. The experimental comparison between the YOLOv8s-DCN-EMA-IPIO* detection model and
the YOLOv8s-DCN-EMA* detection model.
Categories (AP)/%
Model mAP/%
Computer Powerbank Lighter Scissors Pressure Umbrella Bottle Knives
Y8sDE* 72.66 95.1 62.2 55.6 65.9 78.6 92.5 68.3 63.1
Y8sDEP* 73.43 95.5 62.9 56.2 67.4 79.4 93.1 69.0 63.9
The bolded font indicates the experimental results of the improved final model in this article.
The results of the experiments reveal that the mAP of the YOLOv8s-DCN-EMA*
model with an optimized learning rate is 0.77% better than that of the non-optimized
model, and the overall detection performance is improved. Meanwhile, the detection
accuracy of each prohibited item has been improved accordingly. In the operation of
outer-loop evolution, when the number of iteration training reaches the 15th round, the
optimal mAP measured by the model on the test set reaches the highest, and the mAP
value in the next five rounds gradually becomes stable and no longer rises. It can be seen
from Figure 13b that the pigeons will eventually reach the destination. That is, the points
corresponding to the optimal learning rate and the optimal mAP will converge generation
by generation and eventually converge to one location. In conclusion, the cross-mutation
strategy-based modified pigeon swarm algorithm may dynamically optimize the model’s
learning rate, allowing it to converge to the ideal learning rate more quickly and effectively
while enhancing detection performance and accuracy.
mAP
YOLOv8s SRGAN DCN EMA IPIO Precision Recall F1 Score mAP FPS
(50–95)
✓ 71.4% 65.2% 68.2% 69.5% 47.4% 123
✓ ✓ 73.2% 65.8% 69.3% 70.6% 47.9% 124
✓ ✓ 75.6% 64.7% 69.7% 71.8% 48.9% 96
✓ ✓ 74.1% 64.4% 68.9% 71.4% 48.8% 111
✓ ✓ ✓ 76.4% 64.8% 70.1% 72.2% 49.5% 96
✓ ✓ ✓ 74.8% 66.5% 70.4% 71.9% 49.3% 112
✓ ✓ ✓ 75.3% 64.9% 69.9% 71.9% 49.0% 91
✓ ✓ ✓ ✓ 77.7% 66.2% 71.5% 72.7% 50.3% 92
✓ ✓ ✓ ✓ ✓ 78.2% 67.3% 72.3% 73.4% 50.6% 95
The bolded font indicates the experimental results of the improved final model in this article.
To verify the performance improvement effect of optimizing the deep learning hyper-
parameters under different configurations, we added IPIO to YOLOv8s, YOLOv8s-DCN,
YOLOv8s-EMA, YOLOv8s-DCN-EMA, and YOLOv8s-DCN-EMA* to test the optimized
effect, respectively. The experimental results are shown in Table 7. Optimizing the learning
rate in different models can all improve the final detection accuracy. The highest final
detection accuracy value of 73.4% was obtained by introducing the IPIO algorithm to
optimize the learning rate in the YOLOv8s-DCN-EMA* model.
Table 7. Comparative experiments to optimize model learning rate under different configurations.
4. Conclusions
Taking the detection and identification of security contraband as the research tar-
get, this paper proposes a network structure based on YOLOv8s, aiming at solving the
problems of missing and false detection of X-ray contraband in actual situations. YOLOv8s-
DCN-EMA-IPIO* combines data enhancement, the deformable convolutional DCNv2, the
multi-scale attention mechanism EMA, and the automatic hyperparameter optimization
model. The deformable convolutional DCNv2 is used to replace the C2f module in the
backbone network and make it become the C2f_DCN layer to increase the sensitivity field of
contraband of small sizes and different shapes, thus improving the detection performance
of the model on small and medium target objects. The neck network includes the multi-scale
attention mechanism known as EMA to reduce interference from complicated background
noise and overlapping occlusion phenomena during the detection phase. This mechanism
forces the model to focus more on primary information and disregard secondary informa-
tion. Data enhancement and SRGAN super-resolution reconstruction technology transform
low-resolution X-ray security images into super-resolution images. This process improves
the quality of model training data sets and increases the enhanced model’s detection accu-
racy. Convolutional neural networks use an improved pigeon-inspired optimization based
on the cross-mutation method to optimize the hyperparameters, precisely the learning rate.
The location of the flock is constantly updated throughout the global search and iteration
process. It not only realizes the optimal selection of the initial position of the flock, but also
obtains the optimal position of the pigeons through fast convergence, that is, the optimal
learning rate, and then receives better detection and recognition accuracy of the model.
Through experimental inspection, the mAP of the improved model YOLOV8s-DCN-EMA-
IPIO* is 3.98% better than that of the model YOLOv8s, and the accuracy and recall rate
are also improved, which are 6.8% and 2.1%, respectively. Due to the large amount of
computation of the algorithm, the next step is to reduce the number of parameters and the
amount of computation through a lightweight network while ensuring the same detection
accuracy to improve further the detection speed and real-time performance of the model,
which is convenient for embedded landing and development.
H.D.; supervision, Q.G.; project administration, H.D.; funding acquisition, Q.G. All authors have
read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data are not available due to privacy restrictions.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. European Parliament. Aviation Security with a Special Focus on Security Scanners. European Parliament Resolution of 6
July 2011 on Aviation Security, with a Special Focus on Security Scanners (2010/2154(INI)). 2012; pp. 1–10. Available online:
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:52011IP0329&rid=1 (accessed on 6 February 2024).
2. Mery, D.; Svec, E.; Arias, M.; Riffo, V.; Saavedra, J.M.; Banerjee, S. Modern Computer Vision Techniques for X-Ray Testing in
Baggage Inspection. IEEE Trans. Syst. Man Cybern. Syst. 2017, 47, 682–692. [CrossRef]
3. Schwaninger, A.; Bolfing, A.; Halbherr, T.; Helman, S.; Belyavin, A.; Hay, L. The impact of image based factors and training
on threat detection performance in X-ray screening. In Proceedings of the 3rd International Conference on Research in Air
Transportation, ICRAT 2008, Fairfax, VA, USA, 1–4 June 2008; pp. 317–324.
4. Blalock, G.; Kadiyali, V.; Simon, D.H. The impact of post-9/11 airport security measures on the demand for air travel. J. LawEcon.
2007, 50, 731–755. [CrossRef]
5. Hou, Y. Research on the Relationship between Work-Stress and Safety Performance of Airport Security Inspectors. Master’s
Thesis, Beijing Jiaotong University, Beijing, China, 2018. (In Chinese).
6. Michel, S.; Koller, S.M.; de Ruiter, J.C.; Moerland, R.; Hogervorst, M.; Schwaninger, A. Computer-based training increases
efficiency in X-ray image interpretation by aviation security screeners. In Proceedings of the 2007 41st Annual IEEE International
Carnahan Conference on Security Technology, Ottawa, ON, Canada, 8–11 October 2007; pp. 201–206.
7. Wu, Y.; Zhao, X.; Jin, Y.; Zhang, X. Application of edge detection operator in extracting golden region of image. Beijing Inst. Print.
Technol. J. 2013, 21, 34–37. (In Chinese)
8. Mei, H. Research and Application of Contour Extraction Method for Moving Objects in Surveillance Video. Master’s Thesis,
Central China Normal University, Wuhan, China, 2015. (In Chinese).
9. Su, B.; Chen, J.; Chen, Y. X-ray Image Contraband Classification Method Based on Joint Feature. Digit. Technol. Appl. 2019, 37,
76–77. (In Chinese)
10. Wang, Y.; Zhou, W.H.; Yang, X.M.; Jiang, W.; Wu, W. Classification of foreign bodies in X-ray images based on computer vision.
Chin. J. Liq. Cryst. Disp. 2017, 32, 287–293. (In Chinese) [CrossRef]
11. Han, P.; Liu, Z.; He, W. An effective two-stage enhancement method for Airport Security X-ray carry-on image. Photoelectronics
2011, 38, 99–105.
12. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of
the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105.
13. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-ResNet and the impact of residual connections
on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017;
pp. 4278–4284.
14. Zhu, Y.; Newsam, S. DenseNet for dense flow. In Proceedings of the 2017 IEEE International Conference on Image Processing
(ICIP), Beijing, China, 17–20 September 2017; pp. 790–794.
15. Bastan, M.; Yousefifi, M.R.; Thomas, M.B. Visual words on baggage X-ray images. In International Conference on Computer Analysis
of Images and Patterns; Springer: Berlin/Heidelberg, Germany, 2011; pp. 360–368.
16. Jongseo, P.; Minjoo, C. A k-means Clustering Algorithm to Determine Representative Operational Profiles of a Ship Using AIS
Data. J. Mar. Sci. Eng. 2022, 10, 1245.
17. Esteve, M.; Aparicio, J.; Rodriguez-Sala, J.J.; Zhu, J. Random Forests and the measurement of super efficiency in the context of
Free Disposal Hull. Eur. J. Oper. Res. 2023, 304, 729–744. [CrossRef]
18. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Appl. 1998, 13, 18–28.
[CrossRef]
19. Mery, D.; Svec, E.; Arias, M. Object recognition in baggage inspection using adaptive sparse representations of X-ray images. In
Proceedings of the PSIVT 2015: Image and Video Technology, Auckland, New Zealand, 25–27 November 2015; Springer: Cham,
Switzerland, 2016; pp. 709–720.
20. Mery, D.; Riffo, V.; Zuccar, I.; Pieringer, C. Automated X-ray object recognition using an efficient search algorithm in multiple
views. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA,
23–28 June 2013; pp. 368–374.
Sensors 2024, 24, 1158 26 of 27
21. Mery, D.; Mondragon, G.; Riffo, V.; Zuccar, I. Detection of regular objects in baggage using multiple X-ray views. Insight-Non-Destr.
Test. Cond. Monit. 2013, 55, 16–20. [CrossRef]
22. Wu, H.-B.; Wei, X.-Y.; Liu, M.-H.; Wang, A.-L.; Liu, H.; Iwahori, Y. Improved YOLOv4 for dangerous goods detection in X-ray
inspection combined with atrous convolution and transfer learning. Chin. Opt. 2021, 14, 1417–1425. (In Chinese)
23. Dong, Y.; Li, Z.; Guo, J.; Chen, T.; Lu, S. An improved YOLOv5 model for X-ray prohibited items detection. Laster Optoelectron.
Prog. 2023, 60, 0415005.
24. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In European
Conference on Computer Vision; Springer: Cham, Switzerland; Amsterdam, The Netherlands, 2016; pp. 21–37.
25. Zhang, Y.K.; Su, Z.G.; Zhang, H.G.; Yang, J.F. Multi-scale Prohibited Item Detection in X-ray Security Image. J. Signal Process.
2020, 36, 1096–1106.
26. Guo, S.; Zhang, L. Yolo-C: One-stage network for prohibited items detection within X-ray images. Laser Optoelectron. Prog. 2021,
58, 0810003. (In Chinese)
27. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In
Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99.
28. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
29. Gao, R.; Sun, Z.; Huyan, J.; Li, W.; Xiao, L.; Yao, B.; Wang, H. Small Foreign Metal Objects Detection in X-Ray Images of Clothing
Products Using Faster R-CNN and Feature Pyramid Network. IEEE Trans. Instrum. Meas. 2021, 70, 99. [CrossRef]
30. Wei, Y.; Tao, R.; Wu, Z.; Ma, Y.; Zhang, L.; Liu, X. Occluded prohibited items detection: An X-ray security inspection benchmark
and de-occlusion attention module. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA,
12–16 October 2020; pp. 138–146.
31. Zhang, N.; Luo, Y.; Bao, X.; Jin, Y.; Tu, X. X-ray Security Inspection for Contraband Detection Based on Improved Cascade RCNN
Network. Comput. Syst. Appl. 2022, 31, 224–230. (In Chinese)
32. You, X.; Hou, J.; Ren, D.; Yang, P.; Du, M. Adaptive Security Check Prohibited Items Detection Method with Fused Spatial
Attention. Comput. Eng. Appl. 2023, 59, 176–186. (In Chinese)
33. Wang, Z.; Xu, H.; Zhu, X.; Li, S.; Liu, Z.; Wang, Z. Improved Dense pedestrian detection algorithm based on YOLOv8: MER-YOLO.
Comput. Eng. Sci. 2023, 43, 1–17. (In Chinese) [CrossRef]
34. Gao, A.; Liang, X.; Xia, C.; Zhang, C. An Improved YOLOv8 Dense pedestrian detection algorithm. J. Graph. 2023, 44, 890–898.
35. Li, S.; Shi, T.; Jing, F. Improved Road damage detection algorithm of YOLOv8. Comput. Eng. Appl. 2023, 59, 165–174. (In Chinese)
36. Leng, R. Application of Foreign Objects Identification of Transmission Lines Based on YOLOv8 Algorithm. Master’s Thesis,
Northeast Forestry University, Harbin, China, 2023. (In Chinese).
37. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial
networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December
2014; p. 27.
38. Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al.
Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690.
39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
40. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
41. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
42. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
43. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detect-ion. arXiv 2020,
arXiv:2004.10934.
44. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada,
18–22 June 2023; pp. 7464–7475.
45. Ouyang, D.; He, S.; Zhang, G.; Luo, M. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023;
pp. 1–5.
46. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
47. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141.
48. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021.
Sensors 2024, 24, 1158 27 of 27
49. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020.
50. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE
International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773.
51. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2:more deformable, better results. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9300–9308.
52. Duan, H.; Qiao, P. Pigeon-inspired optimization: A new swarm intelligence optimizer for air robot path planning. Int. J. Intell.
Comput. Cybern. 2014, 7, 24–37. [CrossRef]
53. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the ECCV 2020, Glasgow, UK, 23–28 August 2020.
54. Lv, W.; Zhao, Y.; Xu, S. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069.
55. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection
Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.