0% found this document useful (0 votes)
18 views26 pages

A - Survey Object - Detection - and - X-Ray - Security - Imaging

This paper surveys the application of deep learning object detection techniques in X-ray security imaging, particularly for detecting dangerous goods in baggage. It reviews various algorithms, including CNN-based, Transformer-based, and hybrid models, and discusses their successes and limitations in the context of X-ray imaging. The authors aim to provide insights and guidance for improving public safety through enhanced detection methods in security screening processes.

Uploaded by

yeatlidiya65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views26 pages

A - Survey Object - Detection - and - X-Ray - Security - Imaging

This paper surveys the application of deep learning object detection techniques in X-ray security imaging, particularly for detecting dangerous goods in baggage. It reviews various algorithms, including CNN-based, Transformer-based, and hybrid models, and discusses their successes and limitations in the context of X-ray imaging. The authors aim to provide insights and guidance for improving public safety through enhanced detection methods in security screening processes.

Uploaded by

yeatlidiya65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Received 24 February 2023, accepted 2 May 2023, date of publication 8 May 2023, date of current version 12 May 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3273736

Object Detection and X-Ray Security Imaging:


A Survey
JIAJIE WU , XIANGHUA XU , AND JUNYAN YANG
Department of Computer Science, Hangzhou Dianzi University, Qiantang, Hangzhou, Zhejiang 310018, China
Corresponding author: Jiajie Wu ([email protected])
This work was supported by ‘‘Pioneer’’ and ‘‘Leading Goose’’ Research and Development Program of Zhejiang, China, under Grant
2022C03132.

ABSTRACT Security is paramount in public places such as subways, airports, and train stations, where secu-
rity screeners use X-ray imaging technology to check passengers’ luggage for potential threats. To streamline
this process and make it more efficient, researchers have turned to object detection techniques with the
help of deep learning. While some progress has been made, there are few comprehensive literature reviews.
This paper provides a comprehensive overview of the standard object detection algorithms and principles
in X-ray dangerous goods detection. The article begins by classifying and describing the more popular
deep learning object detection techniques in detail and presenting the commonly used publicly available
datasets and metrics. And then go on to summarize previous applications of deep learning techniques in
X-ray dangerous goods detection, highlighting their successes and limitations. Finally, based on an analysis
of the experimental results, it summarizes some of the limitations of deep learning in X-ray baggage detection
thus far. It offers insights into the future of this exciting field. With this review, we hope to provide valuable
insights and guidance for those seeking to improve public safety through X-ray imaging and deep learning
technology.

INDEX TERMS Object detection, deep learning, x-ray image detection, baggage security, yolo.

I. INTRODUCTION volutional Neural Network (CNN) and Transformer [2], [3],


X-ray baggage security screening is essential to maintaining which completely replaced the original hand-made feature
public safety in airports, train stations, and subways. In the extractors [4], and have achieved stunning results.
past, this process was primarily performed by hand, relying The current mainstream trend is towards hybrid algorithms,
on the experience and knowledge of security staff. How- as they combine the advantages of CNN and Transformer
ever, this manual approach was prone to human error and algorithms, with the former specializing in extracting texture
could be impacted by factors such as fatigue and emotional constructs from images and the latter preferring to extract the
turmoil [1]. A fully automated approach to X-ray baggage contour features of the object [5]. Figure 1 lists some of the
security screening is required to overcome these limitations. most representative algorithms in categories.
In this regard, deep learning object detection techniques offer On the other hand, as a particular type, X-ray images have
new hope for achieving fully automated detection. always been a focus of research in computer vision. However,
The task of object detection is one of the most fundamental the imaging principle of X-ray images differs from that of
tasks in the field of computer vision. Getting computers to ordinary images in natural light, which leads to poor detec-
recognize objects in their field of view, especially dangerous tion of ordinary algorithmic models on X-ray data sets [76].
materials in luggage backpacks, has been difficult. With the An increasing number of researchers are conducting focused
rise of new generation deep learning techniques such as Con- research and developing various algorithms to alleviate these
problems:
The associate editor coordinating the review of this manuscript and • In terms of research content, it can be divided into
approving it for publication was Zhongyi Guo . two main points: Firstly, from a model perspective,
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
45416 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 1. Overview of CNN-based, Transformer-based and hybrid algorithms.

it focuses on improving the accuracy of detection tasks The other sections of this paper are organized as follows:
under multiple overlapping obstacles or in pseudo-color Section 2 introduces the basic principles of object detection
images. Secondly, from a data perspective, it focuses algorithms, the differences between CNN and Transformer
on using deep learning techniques to synthesize bet- algorithms, and the principles of X-ray imaging; Section 3
ter pseudo-color images similar to natural images and details each of the three types of object detection algo-
how to quickly and efficiently expand X-ray datasets to rithms according to their algorithmic structure, namely, CNN-
improve the final recognition accuracy. based, Transformer-based, and hybrid algorithms; Section 4
• In terms of research approaches, there are three main describes the application of deep learning to X-ray hazardous
types: image classification, object detection, and object material detection, including classification, detection, and
segmentation. In practical applications, these three types segmentation. In section 5, four algorithms (YOLOv5 [96],
of algorithms are often combined in order to improve YOLOv7 [27], DINO [43], and Next-ViT [72]) are used to
the final recognition accuracy. The primary pieces of perform dangerous goods detection on the publicly available
literature are shown in figure 2. X-ray baggage dataset and give a meaningful analysis of the
results; Section 6 provides a summary and outlook, showing
We are motivated by the fact that few articles have provided
the shortcomings of current deep learning algorithms applied
a detailed and comprehensive summary of deep learning algo-
in X-ray security screening and looking into the future.
rithms and their applications in X-ray hazardous materials
detection. This paper provides a comprehensive review of
object detectors based on CNN, Transformers, and hybrid II. BACKGROUND
algorithms and a summary of their application to the X-ray A. DEEP LEARNING ARCHITECTURE
image security screening field to fill this gap. The main Object detection aims to localize and identify the targets in
contributions of the article are: the given image. Locating the number and class of objects
can become quite challenging due to the masking, exposure,
• An introduction to the popular object detection algo-
and perspective of the objects in the image. It is necessary for
rithms so far and an overview of their classification,
the computer model to overcome these problems to the best
including CNN-based, Transformer-based, and hybrid
of its ability and to consider timeliness [97]. Three common
algorithms.
backbone architectures are listed below.
• A series of models for applying deep learning algo-
rithms in X-ray baggage hazardous materials detection
are described in detail. 1) CNN-BASED BACKBONE
• Experiments using different detection algorithms on an Krizhevsky et al. [2] won the 2012 ILSVRC (ImageNet
open dataset of X-ray baggage and giving meaningful Large-Scale Visual Recognition Challenge) due to its out-
analytical results. standing performance, quickly making CNN the first choice
• The article provides an outlook on the application of for handling various tasks in the field of computer vision.
deep learning algorithms in the field of X-ray image Many classical CNN feature extraction networks have
security detection. emerged, including VGG [6], GoogLeNet [7], ResNets [8],

VOLUME 11, 2023 45417


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 2. Overview of deep learning algorithms in the field of X-ray baggage dangerous goods detection.

ResNeXt [98], CSPNet [99], EfficientNet [16]. These net- ViT before smoothing, and (b) is the Loss-Landscape after
work structures are shown in figure 3. smoothing using SAM [108]. The flatter Loss-Landscape, the
Take VGG−16 as an example, and its specific structure is better model performance and generalization ability. It is also
shown in figure 4.1 The process of extracting features from shown in equation 2 that a positive average eigenvalue map-
the convolutional layer is shown in equation 1. ping enhances the performance of MSAs, while a negative
l−1
value disrupts the optimization of the model. This provides a
N
X direction for optimizing ViT. The reason why ViT requires
xlj = σ( xl−1
i · wli,j + blj ), (1)
a large amount of data for pre-training to achieve better
i=1
results is that a large amount of data can help the model
where xlj denotes the jth feature of the lth layer, N denotes suppress negative Hessian eigenvalues in the early stages of
the number of features, wli,j denotes the convolution kernel training to achieve the effect of smoothing Loss-Landscape
of the lth layer, blj denotes the corresponding bias term, and and convexity of Loss [5], [109]. [110] shows that MSAs are
σ denotes the nonlinear function ReLu. It has been shown good at extracting outline information of objects and are not
that CNN is not good at processing high-frequency noise good at processing low-frequency signals.
in images, so they are more biased in extracting the texture In vanilla ViT, if the pixels are directly processed using
features of images [100]. the attention mechanism as in NLP tasks, the computational
complexity is a quadratic multiple of the image size, which
2) TRANSFORMER-BASED BACKBONE is unacceptable for most image processing tasks. In addition,
ViT with a fixed scale token is not fully applicable to vision
The transformer was used to solve problems such as machine
tasks because the objects in the images are variable. A lot of
translation in the NLP domain [101], and its structure is
improvements have been made to these flaws.
shown in figure 5. After the great success of the Transformer-
Taking Swin Transformer as an example, the appeal defect
based model, [3] applied it to the image classification task
is solved by using the hierarchical feature maps obtained by
and proposed the ViT model [3], which structure is shown
downsampling operation, and the shifted window attention
in figure 6. Later, the ViT model and its variants are applied
mechanism [54]. From figure 8, we can see that the spatial
in various computer vision tasks, including object detection,
resolution of the hierarchical feature map in Swin Trans-
scene segmentation, and so on [102], [103], [104], [105],
former is the same as that in ResNet, which can easily replace
[106].
ResNet as the backbone in the network. The use of W-MSA
A large part of the reason why transformer is so successful
and SW-MSA modules to implement the attention mecha-
is attributed to the attention mechanism, namely the multi-
nism greatly reduces the computational resources required in
head self-attentions (MSAs) [107]. Specifically, given the
the computation process through window exchange.
query matrix Q ∈ RN ×Dk , the key matrix K ∈ RM ×Dk ,
and the value matrix to be matched V ∈ RM ×Dv , where N
3) HYBRID BACKBONE
and M denote the lengths corresponding to Q and K, and Dk
Hybrid frameworks are one of the current research
and Dv denote the dimensions corresponding to K and V. The
computation process is as follows: hotspots [59], [60], [61], [72], [74], [111]. It has been shown
that CNN will filter the low-frequency part of the image, and
QK⊤
 
MSAs will filter the high-frequency part of the image, which
Attention(Q, K, V) = softmax √ V = AV, (2)
Dk is known that the high-frequency signal corresponds to the
 ⊤ outline edge in the image, and the low-frequency part mostly
where the attention matrix A = softmax QK √
Dk
. The dot corresponds to the background part [5].
√ The latest generation of the hybrid framework is to
product of Q and K divided by Dk can alleviate the gradient
vanishing problem of the softmax function. hybridize CNN with MSAs inside the stage [5], [72], not
Besides, the training process of MSAs can be viewed as the outside the stage, as shown in figure 9.
process of smoothing the feature mapping space, as shown
in figure 7. The subfigure (a) is the Loss-Landscape of B. X-RAY IMAGING
The different imaging principles lead to different X-ray and
1 Drawing tool from https://github.com/HarisIqbal88/PlotNeuralNet natural light images, as shown in figure 10. The current

45418 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 4. This is a diagram of VGG-16 architecture. In the diagram,


‘‘conv1’’ denotes the first convolutional layer, and ‘‘fc’’ denotes the fully
connected layer. Specifically, conv1. . . 5=Convolution+ReLu+MaxPooling,
fc=FullyConected+ReLu.

FIGURE 5. The transformer architecture in vanilla ViT [102].

FIGURE 6. Overview of vanilla ViT architecture [3].

X-ray images are available in 2D and 3D. 3D images are


usually baggage images scanned by CT (Computed Tomog-
raphy) machines [112]. Because of their high price, the most
common X-ray dangerous material images in the market are
FIGURE 3. CNN-based backbone architecture [97]. From left to right in the
figure are AlexNet, VGG−16, GoogLeNet, ResNet−50, CSPResNeXt−50,
mainly 2D images, so the study in this paper mainly focuses
EfficientNet−B4. on 2D images.

VOLUME 11, 2023 45419


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 7. (a) is the Loss-Landscape of ViT-B; (b) is the Loss-Landscape


after smoothing by SAM [109].

FIGURE 10. Different imaging demonstrations of the same objects under


natural light and X-rays [92].

FIGURE 8. In this figure, (a) is the Swin Transformer architecture; (b) is


the W-MSA and SW-MSA modules; (c) is the abstracted hierarchical
feature graph; (d) represents the patch merging process, which is used to
simulate the convolutional kernel operation in CNN; (e) is the cyclic shift
self-attentive mechanism, which is used to simulate the Cross-Attention
operation [54].

FIGURE 9. A mix of MSAs and Conv in the stage. The above diagram
shows the regular CNN block. The diagram below shows the convergence
of the Conv and MSAs layers within the stage.

Principle of X-ray imaging The main principle of X-ray FIGURE 11. X-ray pseudo color map imaging process [116], and [114].
imaging is that the X-ray tube produces a beam of light
that can penetrate the scanned object. Different objects have
different densities, and X-rays will attenuate differently at the corresponding pseudo-color images as shown in figure 11.
different material densities. This process can be expressed as In addition, it is also possible to form X-ray maps from
equation 3: multiple angles [76], [114], [115].
Ix = I0 eµx (3)
C. OBJECT DETECTION METRICS
where Ix denotes the intensity of the X-ray at x, I0 is the ini- The two most common metrics used in object detection tasks
tialized intensity value, and µ denotes the linear attenuation are as follows:
coefficient based on the material thickness. 1) GFLOPs
GFLOPs: Giga Floating-point Operations Per Second
Currently, with the development of technology, X-ray refers to the number of billion floating-point operations
machines are equipped with various energies that can produce per second, often used to evaluate the performance of a
a wide range of X-ray images for identifying the density model on GPU.
and adequate atomic number of objects Zeff . The intensity 2) mAP
mAP: i.e., Mean Average Precision, was first proposed
estimates can be found with Zeff by [113], converting them to in the VOC competition. Among them, the calculation

45420 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

a: PASCAL VOC [118]


The PASCAL VOC (Pascal Visual Object Classes) dataset
refers to the dataset used in the challenge that started in
2005 and included four object class objects for classifica-
tion and object detection tasks [117]. In 2007, the VOC07
dataset collected 5K training images and 12K objects with
labels [119]. VOC12 expands the dataset to 11K training
images, over 27K labeled objects, and 20 classifications and
also includes tasks such as segmentation and action detection
in 2012.

b: ILSVRC [120], [121]


ILSVRC (ImageNet Large Scale Visual Recognition Chal-
FIGURE 12. IOU example diagram. The ground truth in the figure is the
complete one horse. The ratio of the green candidate box to the
lenge) is a challenge that ran from 2010 to 2017. Two hun-
overlapping part of ground-truth is 0.42 [104]. dred of these categories were hand-picked for the object
detection task and consisted of more than 500,000 images.
Meanwhile, Meanwhile, ImageNet1000 is a subset of Ima-
of precision requires the participation of IOU (Intersec- geNet with 1000 different object categories and a total of
tion over Union), which is the ratio of the overlap area 1.2 million images. It provides a standardized benchmark for
between the ground truth and the predicted bounding the ILSVRC image classification challenge.
box to the union area, as shown in figure 12
When quantifying the prediction bounding box, a thresh- c: MS-COCO [122]
old must be set to determine if the detection is correct. The MS-COCO (Microsoft Common Objects in Context)
If the IoU is greater than the threshold, it is classified as dataset is the most commonly used dataset for the object
True Positive, while IoU below the threshold is classified detection task. Eighty target categories are used in the detec-
as False Positive. If the model fails to detect objects in tion task, as shown in figure 13, corresponding to the recogni-
the ground truth, it is called a False Negative. Precision tion level of a young human 4-year-old child. It was launched
is used to measure the percentage of correct predictions. in 2015, and its popularity has only grown since then. It has
In contrast, recall measures the correct predictions rel- over 2 million instances with an average of 3.5 categories per
ative to ground-truth and is calculated by referring to image. In addition, it contains 7.7 instances per image, much
equation 4, equation 5: more than other popular datasets. MS COCO also includes
images from different perspectives.
True Positive
Precision =
True Positive + False Positive d: Open Image [123]
True Positive
= (4) Open Image is from Google. This dataset contains 9.2 mil-
All Observations
True Positive lion images, and each image is annotated at the image level
Recall = with object bounding boxes and segmentation masks. Sixteen
True Positive + False Negative
million bounding boxes, 600 categories, and an average of
True Positive
= (5) 8.3 object categories per image were annotated on 1.9 million
All Ground Truth images by Open Image for the object detection task.
Based on equation 4 and equation 5, the average preci-
sion of each category is calculated separately, i.e., with 2) X-RAY BAGGAGE DETECTION DATASETS
N different recall rates, N precision rates are obtained. Six of these public datasets and one private dataset are
Moreover, mAP is the average of the precision of all described in detail below. The X-ray images in these six
categories, which can serve as a single metric for the public datasets are shown in figure 14. In addition, table 2
final evaluation [117]. summarizes additional X-ray security screening imaging
datasets.
D. DATASETS
Four common datasets for object detection tasks are presented a: GDXray [140]
in this section, along with many X-ray baggage detection Grima X-Ray Dataset contains 19, 407 images of castings,
datasets. welds, luggage, natural images, and backgrounds as shown
in table 3. Although this dataset contains multi-view luggage
1) OBJECT DETECTION DATASETS detection images, it is not suitable for deploying contem-
Four standard public datasets for object detection tasks are porary large-scale deep learning algorithms on it due to the
described below, with detailed parameters in table 1. single scene.

VOLUME 11, 2023 45421


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

TABLE 1. Detailed parameters for four types of object detection datasets [104].

TABLE 2. Commonly used data sets for X-ray baggage detection of dangerous goods [125].

b: SIXray [81] represent the ratio of normal to prohibited items. For example,
The dataset contains 1059231 X-ray images, divided into in the SIXray10 dataset, there are 98,219 images, of which
six categories: guns, knives, wrenches, pliers, scissors, and 8,929 are prohibited items and 89,290 are normal items.
hammers, as shown in table 4. The SIXray dataset restores the The SIXray1000 sub-dataset is a mixture of 1,000 randomly
actual scene and is a class of datasets with a severe imbalance selected prohibited images and 1,050,302 ordinary images.
between positive and negative samples, which is suitable for
real-time detection. c: Compass-XP Dataset [116]
In the experimental part of the article, the authors set up The instances in this dataset are 501 instances drawn from
three sub-datasets, SIXray10, SIXray100, and SIXray1000, 369 target classes of ImageNet, as shown in table 5. More-
where the numbers 10 and 100 in the first two sub-datasets over, it contains 1901 image pairs, and each pair contains

45422 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

TABLE 4. SIXray Dataset [81].

e: PIDray [89]
This dataset covers many real-world scenarios for detect-
ing prohibited items, especially intentionally hidden items.
FIGURE 13. Data distribution of COCO training dataset and validation The PIDray dataset contains 12 categories with a total
dataset [124].
of 47,677 X-ray images, and each image comes with a
high-quality annotated segmentation mask and bounding box.
The test set composes of three classes according to the degree
of masking: easy, medium, and challenging, as described
in table 7.

f: HiXray [92]
This dataset contains eight categories: lithium-ion pris-
matic batteries, lithium-ion cylindrical batteries, water, lap-
tops, cell phones, tablets, cosmetics, and non-metallic
lighters, for a total of 102,928 contraband items, which
FIGURE 14. The six most common public datasets in X-ray baggage is by far the most significant number of contraband items
detection. (2022) included in the dataset. The data was collected
from accurate airport security checks and manually anno-
TABLE 3. GDXray Dataset [140]. tated by professional security screeners, as described in
table 8.

g: DB
Durham Baggage (DB) Patch/Full Image. This database
is private and not publicly available. The dataset includes
pseudo-color 15,449 X-ray images from dual-energy four-
view Smiths 6040i machines, including 494 cameras,
1596 ceramic knives, 3208 knives, 3192 guns, 1203 gun
X-ray images scanned with a Gilardoni FEP ME 536 with parts, 2390 laptops, and 3,366 benign images. The derived
natural images taken with a Sony DSC-W800 digital camera. databases include DBP2 with DBP6 [142], that do the classi-
Each X-ray image package contains different image versions fication task, and DBF2 with DBF6 [84], [91].
of low energy, high energy, material density, grayscale (a
combination of low and high energy), and pseudo-color RGB, III. OBJECT DETECTION
which are ideal for studying X-ray imaging principles. How- Detectors early can be classified into two categories accord-
ever, it is unsuitable for deep learning-based target recogni- ing to the detection process: single-stage and two-stage. The
tion tasks. latter uses more candidate frames than the former and may
contain object suggestions for detection objects. However,
d: OPIXray [141] as research progressed, introducing more advanced detec-
This dataset comes from real-time airport security data, man- tors broke the boundaries of single/dual-stage classifica-
ually annotated by professional security officers. It contains tion. Single/dual-stage classification using detectors alone
8,885 X-ray images, divided into five categories: folding has become inadequate. Therefore, instead of presenting the
knives, straight knives, scissors, utility knives, and multi-tool classification of detectors according to the single/dual-stage
knives, as described in table 6. approach as other papers have done, this paper presents the

VOLUME 11, 2023 45423


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

TABLE 5. COMPASS-XP Dataset [116].

TABLE 6. OPIXray Dataset [141]. common object detection algorithms in both families are
summarised in table 9.

1) RCNN SERIES DETECTORS


R-CNN [18] is the first to use CNN classification pre-training
models (such as AlexNet [2]) to extract image features and
implement the object detection task with the region candidate
box selective search algorithm [155], as shown in figure 16.
The R-CNN algorithm is divided into four steps as follows:
1) Draw candidate regions A selective search algorithm is
used on the input image to select multiple high-quality
TABLE 7. PIDray Dataset [89]. candidate regions. These regions usually have different
shapes and sizes. Each candidate region may contain a
certain number of targets.
2) Using CNN pre-training model to fine-tune For fine-
tuning, in the CNN model, the classification head after
pre-training through the ImageNet1K dataset changed
to 20 from 1000. After that, the candidate region in the
TABLE 8. HiXray Dataset [92].
original image, after changing the size, is input to the
CNN model and using the pre-trained model to extract
the features of the images in the regions.
3) Training using SVM classifier We are training the SVM
classifier (containing two categories, i.e., positive and
negative samples) using the features extracted by CNN
to decide the target category in the region. When training
the SVM, if the result belongs to the target classification,
it is determined as a positive sample. Otherwise, it is a
negative sample. It labels N candidate regions (typically
2K) with positive and negative samples using IOUs.
Suppose the IOU of the candidate regions and the ground
truth are more significant than 0.5. In that case, the
sample is considered positive, and its category keeps the
same as the ground-truth category. If the IOU is less than
classification according to model architecture categories [97], 0.3, it is considered a negative sample.
[104]. 4) Regression-based training bounding box It used regres-
Three types of object detectors are presented in this sec- sion to fine-tune the location of the bounding box.
tion: CNN-based detector(section III-A), Transformer-based A linear regression model is trained for each class to
detector(section III-B), and hybrid detector(section III-C). determine if the bounding box is optimal.
Figure 15 lists the various models that are more mainstream Afterward, a series of improvement algorithms were pro-
in the field of object detection. posed based on R-CNN, and these improvements include two
kinds: speed and network structure.
A. CNN-BASED DETECTOR SPPNet [19] adds spatial pyramid pooling [156] to the
The two most common series of CNN-based detectors are the top of the last convolutional layer of the CNN for the first
R-CNN series and the YOLO series [97], [104]. The former time, generating fixed-length features for candidate regions
is a two-stage detector with slow speed and long training of arbitrary size on the image, which speeds up the R-CNN
time but high accuracy. Meanwhile, the latter is a single- evaluation.
stage detector, characterized by relatively low accuracy and Fast R-CNN [20] is an improvement of the training pro-
poor detection of small objects but faster detection. The most cess of R-CNN and SPPNet. It achieves unified training

45424 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 15. Timeline of object detection algorithm development.

TABLE 9. CNN-based detectors on the COCO 2017 val dataset.

network. Fast R-CNN significantly improves computational


efficiency.
Faster R-CNN [21] uses the region recommendation net-
work RPN (Region Proposal Network) to replace the selective
search strategy in the previous network, which can generate
a series of candidate boxes for any input image. From fig-
ure 17(b), we can see that the Faster R-CNN consists of four
major components in general, which are:

• Backbone. It is used to extract features from images;


• Region Proposal Layer. This layer is the core layer of the
network and consists of four parts, namely RPN (region
proposed network), proposed layer, anchor target layer,
FIGURE 16. Overview of R-CNN architecture. and proposed target layer. (i) RPN is used to calcu-
late and generate class prediction scores and bound-
ary box regression coefficients; (ii) The proposed layer
uses the anchor box generated by the anchor generator
of softmax classifier, SVM, and bounding box regressor by and trims the number of bounding boxes by applying
region candidate box shared convolutional computation and non-maximum suppression (NMS) based on the fore-
adds a new layer, namely RoI (Region of Interest), to the ground score. It also generates a converted bounding box

VOLUME 11, 2023 45425


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

by applying the regression coefficient generated by RPN


to the corresponding anchor box; (iii) The goal of the
Anchor Target Layer is to select the anchors that can
be used to train the RPN network; (ix) The Proposal
Target Layer aims to select good RoIs from the list of
RoIs output from the proposal layer. These RoIs will
perform RoI pooling operations with the feature maps
generated from the backbone and pass them to the rest
of the RPN for computing the predicted category scores
and bounding box regression coefficients;
• RoI Pooling Layer implements the spatial transforma-
tion of features, specifically, samples the input feature
map gave the coordinates of the proposed bounding box
of the region generated by the proposed target layer.
These coordinates are usually not on integer boundaries
and therefore require interpolation-based sampling;
• Classification Layer is used to obtain the output feature
maps generated by the RoI Pooling Layer and perform
convolution operations. The final output is realized by
two fully connected layers, namely bbox_pred_net, and
cls_score_net. The former can generate the class prob-
abilities of each region suggestion, and the latter gener-
ates a set of class-specific bounding boxes.

In general, Faster R-CNN is still the most widely used type


of detector in the industry today.
R-FCN [146] is further improved the problem of position
sensitivity of targets in images in different sub-networks
of the Faster R-CNN. In Faster R-CNN, the image is pro-
cessed in the first step of the backbone and then fed into
FIGURE 17. Overview of Faster R-CNN.
the sub-network associated with RoI in the second step for
processing. In the feature maps obtained after the first step
of processing, the RoI is shared and insensitive to the object
location. In the second step, the RoI is processed indepen- detectors based on R-CNN structures can use this cascading
dently, i.e., they are sensitive to the target location. approach to improve detection accuracy.
The R-FCN divides the RoI into k × k regions, which are Mask R-CNN [22] extends Faster R-CNN and can handle
mapped by a position-sensitive score map to each region, pixel-level segmentation of target instances. It classifies each
generating the corresponding response values. If all of this pixel into multiple segments, uses Faster R-CNN for object
response value information is greater than the threshold of a framing, and adds an extra mask in the header. Mask R-CNN
certain category, then this region is judged as this category uses the previous RoIPool layer using a RoIAlign layer to
category. Otherwise, it is judged as the background category. avoid pixel-level misalignment due to spatial quantization.
FPN [150] is a feature processing approach that is It is also among the most popular models in the R-CNN
often applied in frameworks such as Faster R-CNN. FPN family.
connects top-down side-by-side high-level features with HTC [151] is an improvement on the Cascade R-CNN and
low-resolution and high-semantic information to low-level Mask R-CNN:
features with high-resolution and low-semantic information
so that features at all scales are rich in semantic information. • By introducing the Interleaved Execution operation, the
This allows multi-scale feature representations with strong algorithm increases the information interaction between
semantic information to be learned while improving compu- different branches within each stage, i.e., the informa-
tational efficiency. tion processed by the box branch is then passed to the
Cascade R-CNN [149] is more like a training method. mask branch for processing, eliminating the information
The essence is to cascade multiple R-CNN networks based gap between the training and testing processes.
on different IOU thresholds on the Faster R-CNN, i.e., the • The algorithm adds an information flow connection
output of the previous R-CNN network is used as input to the between the mask branches of adjacent stages, allowing
latter R-CNN network, after which the results of the detection the mask branches of different stages to interact with
are optimized to keep rising the IOU threshold. Almost all each other.

45426 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

• The algorithm will process the information for semantic The backbone design idea of YOLOv6 is mainly derived from
segmentation. RepVGG [17], with the primary purpose of achieving more
The HTC algorithm is mainly used for semantic segmenta- straightforward deployment and faster inference speed in the
tion problems but can also be modified for object detection industry.
problems, e.g., HTC++ [157]. YOLOx [30] believes that YOLOv4 and YOLOv5 may
have over-optimized the anchor-based detection method,
2) YOLO SERIES DETECTORS so YOLOx chose to make improvements to YOLOv3. Previ-
The YOLO series detectors received much attention once they ously, YOLOv5 achieved the best performance on the COCO
were introduced. YOLO, or You Only Look Once, differs dataset (48.2%AP). The improvements made by YOLOx
from R-CNN in that YOLO does not select targets by gen- include: (i)YOLOx changes the original Coupled Head to
erating candidate frames but performs target prediction and Decoupled Head, as shown in figure 20(b). Specifically,
recognition directly at the pixel level.2 YOLOx decouples Cls, Reg, and IOUs, allowing the net-
YOLOv1 [23] grid the images and then use the grid for work to learn the categories and the corresponding coordinate
detection. Each grid is responsible for detecting targets that regressions better; (ii)Detection is carried out in an anchor-
fall within its region. Each cell can detect multiple bound- free manner, i.e., by generating an a priori frame (anchor) of
ing boxes, and the bounding box is denoted by pbox = different sizes and aspect ratios at each position on a given
(x, y, w, h, α), where (x, y) denotes the center coordinates of feature map. The purpose of this is to:
the target object, (w, h) denotes the width and height, and the 1) Reducing the computational effort of the model, produc-
confidence α = pobj × IOU , pobj denotes the probability of ing fewer prediction frames;
having an object fall in the grid with probability. The final 2) Mitigating positive and negative sample imbalances;
prediction res = (pbox , Classes ), where C denotes the number 3) There is no need to design the parameters of the anchor
of target classifications in the dataset. The above procedure manually.
is shown in figure 18. YOLOR [28]’s encoder is used to learn both implicit and
YOLOv2 [24] is an improvement on YOLOv1. The explicit knowledge representations, using implicit informa-
YOLO9000 model can detect 9000 target classes. YOLOv2 tion to perform different tasks, and this technique is also
strives for a balance of speed and accuracy. integrated into YOLOv7.
YOLOv3 [25] used Darknet-53 as the backbone for fea- YOLOv7 [27] increases the training cost in order to
ture extraction compared to the previous two versions and improve accuracy. By using the reparametrization trick, the
achieved SOAT in the same period. inference cost is kept constant. In addition, YOLOv7’s back-
YOLOv4 [26] makes changes to the model structure. bone is the basis of a cascade structure. Changes in network
A BOF (Bag Of Freebies) strategy is used in the model, which depth often bring about changes in width when the model
can improve detection accuracy with only an increase in train- is scaled and thus need to consider comprehensively when
ing cost and no impact on inference speed. The BOF strat- the model is scaled for evaluation. Several trainable Bag-of-
egy in YOLOv4 includes data augmentation, regularization, Freebies methods are designed in the paper for solving the
CmBN (Cross mini-Batch Normalization), CIoU-loss [162], above problems, including planned reparametrization module
and other techniques. BOS (Bag of specials) strategies, i.e., design, dynamic label assignment strategy (coarse for auxil-
plug-in modules and post-processing methods that add only a iary, fine for loss), batch normalization in topology, and EMA
small amount of inference time but can significantly improve module usage. YOLOv7 can effectively reduce about 40%
object detection accuracy, are also used in YOLOv4. The of the parameters and 50% of the computation of existing
former enhances certain models’ properties, such as expand- real-time object detectors and has faster inference and higher
ing the perceptual field, introducing attention mechanisms, detection accuracy, with the structure shown in figure 21.
or enhancing feature integration capabilities. At the same
time, the latter can filter the model prediction results. BOS
B. TRANSFORMER-BASED DETECTORS
strategies in YOLOv4 include Mish activation [163], CSP
The MSAs mechanism in the transformer allows for bet-
(Cross-stage partial connections ) [99], and other techniques.
ter extraction of contour information from the image. The
The architecture is shown in figure 19.
main limitation of the Transformer is its high computational
YOLOv5,3 YOLOv64 only open the source code and no
overhead, which is usually a quadratic amount of the input
related academic paper. YOLOv5 retains much of the network
feature size. Common Transformer-based object detection
structure of YOLOv4. YOLOv5 is one of the most used and
algorithms are outlined in table 10.
popular object detection models within the industry, espe-
DETR [31] is one of the first end-to-end transformer-
cially for applications in the field of real-time video detection.
based object detectors. It treats the object detection prob-
2 [158], [159], [160], [161] integrates the common YOLO series algo-
lem as an ensemble prediction problem. Unlike traditional
rithms. object detectors, DETR learns anchors (not by hand) and does
3 https://github.com/ultralytics/yolov5 not use non-maximum suppression (NMS) post-processing.
4 https://github.com/meituan/YOLOv6 Instead, position-encoded ‘‘object queries’’ are fed to the

VOLUME 11, 2023 45427


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 18. Overview of YOLOv1 architecture [23].

TABLE 10. Transformer-based detectors on the COCO 2017 val dataset.

To avoid the problem of ‘‘object queries’’ in which the


object cannot be matched accurately during the query pro-
cess, DETR adds a particular class, the no object label (∅),
in addition to matching the regular class labels. In the training
process, the Hungarian algorithm, a bivariate graph matching
algorithm, is used to perform one-to-one matching of the
ground-truth yi with the predicted target ŷσ (i) , and the match-
ing pair strategy σ̂ with the loss function Lmatch as:
FIGURE 19. Overview of YOLOv4 architecture [26].

N
X
l σ̂ = argmin Lmatch (yi , ŷσ (i) ),
decoder for finding the features of an object in the image and σ ∈SN i
decoding image features. The predictor produces detection Lmatch (yi , ŷσ (i) ) = −1{ci ̸=∅} p̂σ (i) (ci )
results directly from the decoder’s output queries, as figure 22
shows. + 1{ci ̸=∅} Lbox (bi , b̂σ (i) ). (6)

45428 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 22. DETR architecture [103], and [31].

where λiou , λL1 ∈ R are hyperparameters, Liou (·) is calcu-


lated as:

|bσ (i) ∩ b̂i |
Liou (bσ (i) , b̂i ) = 1 −
|bσ (i) ∪ b̂i |
|B(bσ (i) , b̂i ) \ bσ (i) ∪ b̂i |

− . (9)
|B(bσ (i) , b̂i )|
where |.| denotes the set, the intersection or the union can be
FIGURE 20. YOLOx architecture and its uncoupled head
mechanisms [109].
calculated by the linear function bσ (i) with min / max of b̂i ,
B(bσ (i) , b̂i ) denotes the maximum bounding box containing
bσ (i) , b̂i [31].
Although DETR was one of the first object detectors to use
a transformer structure, its disadvantages include poor con-
vergence and poor performance on high-resolution images,
mainly due to:
1) Encoder: the input is an image feature extracted through
the backbone (ResNet [8]), and the length and width of
this feature are denoted by H and W , respectively. The
complexity of the self-attention calculation (i.e., equa-
tion 2) grows quadratically with the feature pixel space,
i.e., O(H 2 W 2 C), where C is the feature dimension.
2) Decoder: the DETR requires the computation of both
cross-attention and self-attention modules. In comput-
ing cross-attention, ‘‘object queries’’ are computed and
extracted from the feature mapping output by encoder
FIGURE 21. Overview of YOLOv7 architecture. through the attention mechanism (i.e., equation 2),
specifically, Q which is ‘‘object queries’’ and K is the
feature mapping of the Encoder output. So the com-
In addition, the Hungarian loss function LH includes class putational complexity of cross-attention grows linearly
label loss and bounding box loss on all matching pairs (ci ̸= with the feature pixel space, i.e., O(HWC 2 + NHWC),
∅): where N denotes the number of ‘‘object queries’’. When
computing self-attention, ‘‘object queries’’ do QKV
N computations on each other, so the computational time
LH (yi , ŷσ (i) ) = [−logp̂σ̂ (i) (ci ) + 1{ci ̸=∅} Lbox (bi , b̂σ (i) )].
P
complexity is O(2NC 2 + N 2 C) [33].
i=1
In response to the shortcomings of DETR, many algo-
(7)
rithms have focused on improving the internal structure of the
encoder and decoder, including the descending calculation of
where Lbox denotes the bounding box loss and is calculated
attention, improvements in the structure of ‘‘object queries’’,
as:
and the selection of the feature mapping part of the encoder
output in the cross-attention calculation.
Lbox (bσ (i) , b̂i ) = λiou Liou (bσ (i) , b̂i ) + λL1 ||bσ (i) − b̂i ||1 ,
Deformable DETR [33] is one of the most widely
(8) used improved algorithms for DETR. Its most significant

VOLUME 11, 2023 45429


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 23. The Deformable Attention Module [33].

contribution is to improve the performance and accuracy of


DETR in detecting small objects using a multi-level vari-
able attention mechanism. In the case where 10× is smaller FIGURE 24. Overview of Conditional DETR architecture.
than the original DETR training epochs, the computational
complexity is O(2Nq C 2 +min(HWC 2 , Nq KC 2 )), an inference
speedup of 1.6 times. top−K feature values output from the last layer of the encode
The Deformable Attention Module in Deformable DETR is selected as prior knowledge to strengthen the queries in the
only on a few key sampling points near the reference point decoder.
rather than the entire feature mapping map. This reduces Conditional DETR’s [37] most significant contribution
the dimensionality of K in equation 2 and thus reduces the is the different treatment of content queries in the decoder
computational complexity, as shown in figure 23. (the output of self-attention in the decoder) from location
The calculations of deformable attentional characteristics space queries, as shown in figure 24. From the figure, the
are given by: K in the decoder cross-attention is generated by the position
M
X K embedding rvpk and the encoder output content embedding
X
DeformAttn(zq , pq , x) = Wm Amqk ck , respectively. Moreover, Q includes the spatial location
m=1 k=1 embedding pq in addition to the content embedding cq formed
· W ′m x(pq + 1pmqk ) . by the decoder self-attention.

(10)
It is generated by first normalizing the reference point s
where x ∈ RC×H ×W denotes the feature mapping output by and mapping it into a 256-dimensional sinusoidal position
the encoder, pq denotes the 2D reference point, zq denotes embedding ps (keeping the same generation as pk ), after
the qth content feature, Amqk denotes the attention weight of which the ps formed by the reference point s is transformed
the mth attention head at the k th sampling point, and 1pmqk in the embedding space using T to obtain the spatial position
denotes the sampling offset with respect to the reference embedding pq in Q, i.e., pq = T · ps . Since the decoder’s
point. From equation 10, it can be seen that the 2D reference embedding includes the position information, it can be used
points are involved in the computation as part of the cross- by T = FFN (decoderembedding).
attention query. DAB-DETR [40] is based on Conditional DETR, com-
According to equation 10, the formula for multi-level vari- bining reference points with ‘‘object queries’’ to form a 4D
able attentional features can be written as: learnable probe frame, (x, y, w, h), where (x, y) is the center
coordinate point of the probe frame, and (w, h) is the width
MSDeformAttn(zq , p̂q , {xl }Ll=1 )
and height of the probe frame, as shown in figure 25.
M L X
K
X X DAB-DETR updates the detection frame as a learnable
Amlqk · W ′m xl (φl (p̂q ) + 1pmlqk ) .

= Wm · parameter in the model. In the decoder, the self-attention
m=1 l=1 k=1 module is used for query updates, and the cross-attention
(11) module is used for feature detection.
Given the qth probe frame Aq = (xq , yq , wq , hq ), the
where {xl }Ll=1 is the lth level input feature map, p̂q ∈
position query Pq is generated by Pq = MLP(PE(Aq )),
[0, 1]2 denotes the normalized coordinates of the reference
where PE denotes the position encoder that generates the sine
point corresponding to each query element q, 1pmlqk denotes
embedding, calculated as:
the sampling offset, and Amlqk denotes the attention weight
of the mth attention head under the l th level input feature map
with respect to the k th sampling point.
In addition, Deformable DETR provides the so-called PE(Aq ) = PE(xq , yq , wq , hq )
‘‘two-stage’’ selection strategy as a candidate strategy. The = Cat(PE(xq ), PE(yq ), PE(wq ), PE(hq )). (12)

45430 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 27. Training score network using DAB in SPARSE DETR [35].

Map) to construct a score network for the encoder part, which


FIGURE 25. The difference between DAB-DETR, DETR, and Conditional filters the features that can participate in the calculation of the
DETR architectures [40].
cross-attention part of the decoder, as shown in figure 27.
DINO [43] has summarised its previous work and
upgraded DETR, including three improvements on the
encoder and decoder.
1) DINO builds on the previous DN-DETR by using a
contrast learning strategy for the noise reduction module
and re-adding the ‘‘no-object’’ category;
2) The model uses a Mixed Query Selection strategy to
select the output portion of the encoder dynamically.
When making a selection, the model retains only the
a priori position information and not the content infor-
mation, as the feature content information at this point
can mislead the decoder into making a wrong selection;
3) When iteratively updating the probe box, DINO updates
the ith and (i + 1)th layers using the parameters of the
FIGURE 26. Overview of DAB architecture.
ith layer. It will make better use of the previous position
information, and the iterative formula is equation 14.
where Cat denotes the splicing function. We also use 1bi = Layeri (bi−1 ), b′i = Update(bi−1 , 1bi ),
Self-Attn: Qq = Cq + Pq , Kq = Cq + Pq , Vq = Cq to (pred)
calculate the self-attention module. bi = Detach(b′i ), bi = Update(b′i−1 , 1bi ). (14)
Use the following formula to calculate the cross-attention (pred)
where bi−1 denotes the (i − 1)th input box, bi denotes the
module:
prediction box to be obtained, and b′i denotes that it is not
Cross-Attn: Qq = Cat(Cq , PE(xq , yq ) · MLP(csq) (Cq )), involved in the backpropagation calculation.
Kx,y = Cat(Fx,y , PE(x, y)), Vx,y = Fx,y . (13)
C. HYBRID MODELS
where Fx,y represents the image features at the point (x, y). By hybrid detector in this paper, we mean a detector with
figure 26 shows all the above processes. both CNN and MSAs layers in the backbone. In terms of
DN-DETR [39] changes the training way of DETR. Based the training process, the algorithm is generally trained in a
on the previous framework, DN-DETR improves the bivariate pre-training dataset (generally ImageNet-1K dataset [121])
bipartite graph-matching strategy in the original DETR. DN- for classification or others. After obtaining the pre-training
DETR adds noise to the ground truth. The training model model, the powerful representation capability of the model is
re-learns the ground truth, which can substantially reduce the used to fine-tune it in the downstream object detection task.
matching instability caused by the Hungarian algorithm, thus Some of the most common algorithms in this series are listed
speeding up the convergence. The decoder is comprises two in table 11.
parts, the noise reduction module and the matching module. ConvViT [58]was an early use of the MSAs layer to simu-
The two modules were previously trained collaboratively late CNN layer operations in order to improve the representa-
through a complex masking mechanism, and the ‘‘no-object’’ tion capability of the model, which inspired the development
class in the original DETR was eliminated from the model. of subsequent models. ConvViT is influenced by [167] and
SPARSE DETR [35] improves on the encoder in DETR; [168], i.e. MSAs can simulate arbitrary convolution layers
previous algorithms have mainly been improving on the as long as they have enough heads. It developed a new self-
decoder, but SPARSE DETR offers a different perspective. attentive layer, the GPSA (Gated Positional Self-Attention)
SPARSE DETR proposes a DAM (Decoder cross-Attention layer. The GPSA has a strong induced bias similar to the

VOLUME 11, 2023 45431


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

TABLE 11. The hybrid architecture detector on the COCO 2017 val dataset. ‘‘P-Epochs’’ indicates the number of times the model was pre-trained.

convolutional layer, and its role is to replace the original SA


(Self-Attention) layer in the ViT. Specifically, the GPSA layer
is initialized to simulate the localization of the convolutional
layer. Then it is freed from localization by adjusting the
gating parameters so that each head of the MSAs targets
fixed content information, giving it the ability to give extra
attention to different locations. The whole process is like this:
h i
GPSAh (X) := normalize Ah XW hval (15) FIGURE 28. (a): The CVT framework, (b): Detail of CTB (Convolutional
  Transformer Block) [59].
Ahij := (1 − σ (λh )) softmax Qhi K h⊤
j
 
+ σ (λh ) softmax vpos rij .
h⊤
(16)
directions. Due to efficiency issues, the transformer in
where σ : x 7 → 1/(1+e−x ) denotes the activation function, Qhi Mobile-Former only uses a minimal number of Tokens, which
denotes the query matrix of the ith patches under the hth self- affects the contribution of MHSA to the model’s accuracy.
attentive head, vh⊤
pos denotes the trainable embedding, and rij MetaFormer [62] is an abstraction framework that
denotes the relative position encoding. abstracts the transformer encoder into two components, the
CVT [59] improves the efficiency of ViT-like models mutable component responsible for attention, the token-
by introducing convolution into the transformer through mixer, and the other remaining invariant components (such
the CTE (Convolutional Token Embedding) layer and the as MLP and residual concatenation).
CP (Convolutional Projection) layer. The two modules are Next-ViT [72] is contributed to both academic research
the CTE (Convolutional Token Embedding) layer and the and industrial deployments. Industrial deployments are very
CP (Convolutional Projection) layer, whose primary func- demanding in terms of model execution time and compu-
tion is to downsample to enrich the feature map’s repre- tational resources, which requires the model to achieve a
sentation. Replace the original position linear projection in specific scaling ratio to ensure that the model is optimal.
ViT. Specifically, the token is mapped into the 1D space Next-ViT comprises four stages, P2, P3, P4, and P5, each
by Flatten (Conv2d (Reshape2D(xi ), s)). where xi consisting of two core modules, the NCB (Next Convolution
denotes the token to be mapped, Conv2d denotes the depth- Block) and the NTB (Next Transformer Block). The NCB is
separable convolution, i.e., Depth-wise Conv2d → used to calculate local information, and the NTB is used to
BatchNorm2d → and Point-wise Conv2d, s means calculate global information.
convolution kernel. As shown in figure 28 (b). NCB follows the abstract design of MetaFormer and uses
BoTNet [60] replaces the BottleNeck 3 × 3 convolutional MHCA (Multi-Head Convolutional Attention) as a token-
layer in ResNet [8] with MSAs. CMT modules consider CNN mixer, where the convolutional attention mechanism can
to capture local information in images and MSAs to extract learn the relationship between different tokens through the
global correlation information. It is combining the two yields trainable parameter W in the local perceptual field. The for-
a model that is both efficient and accurate. mula is as follows:
Mobile-Former [61] aims to improve recognition accu-
racy in lightweight applications. The model communicates
MobileNet [10] with the transformer in parallel in both CA(z) = O(W , (Tm , Tn )) where T{m,n} ∈ z. (17)

45432 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

computed in two Gaussian spaces instead of three, thus saving


a 1 × 1 convolution in each SA. In addition, ELAN uses
tricks such as shared SA scores and circular shifts along the
diagonal to speed up the model’s computation according to
its network characteristics. It is worth mentioning that the
current backbone of YOLOv7 [27] is the basis of the ELAN
extension.
The ideas of the GC ViT [73], MaxViT [74], Con-
vMAE [165], and InternImage [166] models are borrowed
from the architectural design of the Swin Transformer [54].
The accuracy of the object detection algorithm can be
improved by generating multi-level feature maps in stages to
extract information about objects contained in images with
different resolutions in space.
Pix2seq [32] algorithm is designed to take advantage of
the transformer’s ability to process NLP sequences by serial-
izing the bbox and class of the indicated object in the image
to make predictions on the serialized bbox and class. The
Pix2seq is essentially a generative self-supervised learning
algorithm [170].

FIGURE 29. Overview of Next-ViT architecture [72]. IV. X-RAY BAGGAGE DETECTION WITH DEEP LEARNING
X-Ray baggage detection is a task that, at this stage, is mainly
carried out manually. There is a massive market for deep
learning in this task. According to different detection meth-
where Tm and Tn denote the adjacent token in the input feature ods, X-ray dangerous goods detection algorithms mainly
z, and O denotes the inner product operation. include conventional image analysis, machine learning, and
NTB uses E-MHSA (Efficient Multi-Head Self Attention) deep learning algorithms. This paper focuses on deep learn-
to capture low-frequency signals, such as background and ing algorithms. Moreover, three types of supervised learning
other global information, where the SA operator is used algorithms, classification, detection, and segmentation, are
to reduce the spatial attention computational complexity. used for the introduction [76], [171]. Table 12 demonstrates
It can be expressed as equation SA(X ) = Attention(X · the application of deep learning algorithms.
W Q , Ps (X · W K ), Ps (X · W V )), where Attention(Q, K , V ) =
T
softmax( QK dk )V and Ps means average pooling (stride =
A. CLASSIFICATION
s), i.e., reducing the computational complexity of atten- Classification algorithms were one of the first algorithms to
tion by downsampling. The overall framework is shown emerge in this field. In simple terms, the need is fulfilled by
in figure 29. determining the prohibited items’ presence during the secu-
ELAN [75] consists of three modules, and in this paper, rity screening process. The limitation of this algorithm is that
we only discuss the core module called ELAB (Efficient it treats dangerous goods detection as a simple classification
Long-range). The ELAB module is referenced from the Swin problem, and the final result does not accurately detect the
Transformer model [54], which uses Shift-Convolution lay- type and location of hazardous materials.
ers to extract local structural information from images and Akcay et al. [77] used CNN to classify the dataset through
GMSA (Group-wise Multi-scale Self-Attention) module to migration learning. Solving a binary classification problem
extract global information. The Shift-Convolution layer con- like the presence or absence of firearms demonstrated that
sists of a shift operator and a 1 × 1 convolution. The primary using CNN is more effective than traditional machine learn-
function is to shift the first four groups of the input features, ing algorithms like SVM.
which have been divided into five equal groups, in the left, Rogers et al. [78] first use of dual-energy X-ray
right, top, and bottom directions to ensure that the 1 × 1 con- pictures for imaging detection. High-energy and low-
volution yields information about the surrounding pixels. The energy X-ray images captured by the dual-energy X-ray
GMSA deals with group window-based self-attentive mecha- machine were used as asingle channel (H ), dual chan-
nisms, with the freedom to adjust the window size within each nel ({H , − log H }, {− log H , − log L}) and four-channel
group [169]. When computing SA (Self Attention), the ASA ({− log L, L, H , − log H }) as different the input channels to
(Accelerated Self-Attention) mechanism is used, except that train the VGG-19 network for classification.
the LN (Layer Normalization) in the transformer is replaced Zhao et al. [79], [80] introduces GAN [172] to the X-ray
with Batch Normalization to ensure that the SA between imaging detection task by a three-stage learning method
groups is computed without additional overhead. The SA is for classification learning. The input X-ray dataset is first

VOLUME 11, 2023 45433


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 30. (a)GLAB framework, (b)Shift-Convolution layer, (c)GMSA layer, (d) Calculation of ASA [75].

TABLE 12. Deep learning algorithms in the field of X-ray baggage detection. (If not specified, ACC, mAP, and mIOU in the last column correspond to the
performance metrics of the classification, detection, and segmentation algorithms in the first column).

classified and labeled by the angular information of the fore- CHR [81] model performs classification detection on the
ground objects extracted from the input images. GAN gener- SIXray dataset [81]. Since the SIXray dataset is constructed
ates new objects, and finally, a small classification network is to simulate a realistic environment where a significant imbal-
used to confirm whether the generated images belong to the ance in security screening data occurs, the CHR model copes
correct class. with the class imbalance by extracting image features from

45434 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 31. Overview of CHR architecture [81].

three consecutive layers. Specifically, the backward layer in


the CHR model is upsampled and connected to the preceding
layer, as shown in figure 31.
In the figure, the g() function is used to remove redundant
information from the feature map by feeding the advanced
(l−1) (l) (l+1) FIGURE 32. Overview of SDANet framework [89].
features into three layers of ({h(x̃n ), h(x̃n ), h(x̃n )}) for
differentiation classification. A multi-level strategy is used in
the CHR model to extract the features of the objects better.
Caldwell et al. [83] investigates the generalization abil-
ity of models trained with different datasets from various
scanners. The authors created training and test samples from
single or multiple domains to investigate the effect of migra-
tion between other models. The limitation is that migration
learning is still challenging due to the scanners’ unknown
parameters and the CNN’s ability to generalize to unseen FIGURE 33. Overview of LIM Detailed Architecture [92].
target datasets.

B. DETECTION with mAPs of 60.5% and 60.9%, respectively. [87] used


Most existing X-ray dangerous goods detection algorithms YOLOv2 [24] to achieve an average accuracy of 94.5% and
use a CNN framework to detect the type and location of a recall of 92.6% on the unpublished dataset SASC.
dangerous goods in luggage. The limitation of this algorithm Hassan et al. [88] uses a cascaded multi-scale structure to
is that the detection results rely heavily on the texture infor- form RoI after extracting tensors from different angles of the
mation of the detected objects and do not fully use the shape object. The mAP reached over 96% on both the GDXray and
contour information of the objects. It leads to the fact that the SIXray.
actual detection results do not achieve the desired results. Wang et al. [89] designed a Selective Dense Attention
Akcay et al. [84] detects and identifies imaging on the Network, SDANet, which constructs a strong baseline on
DBF2/6 using the Faster R-CNN [21] algorithm. The mAP the PIDray, which consists of a dense attention module and
on the DBF6 reached 88.3%. a dependency refinement module, as shown in figure 32.
Sigman et al. [85] proposes a semi-supervised domain SDANet uses the attention mechanism to focus on target
adaptation learning algorithm, the Background Adaptive objects in complex contexts in a multi-level feature pyramid
FRCNN (Background Adaptive Faster R-CNN) algorithm. graph. The final APs on easy, medium, and hard are 71.2%,
The authors assume that, in reality, there are no dangerous 64.2%, and 49.5%, respectively.
goods in the security-checked images, and this assumption Tao et al. [92] proposes the LIM, Lateral Inhibition Mod-
aims to obtain the dataset more quickly. The algorithm has ule, which is a module that ignores task-irrelevant informa-
two domains: a manually collected domain with hazards and tion and focuses only on recognizable features when objects
a real-world domain without hazards. In addition, two domain overlap each other. Specifically, LIM is a carefully designed
discriminators are trained using adversarial, one for discrim- flexible add-on module that minimizes the flow of noisy
inating the target offer frame and the other for discriminating information through the Bidirectional Propagation module
the image features. Only the background area outside the tar- and activates the boundaries of the most recognizable features
get proposal region and ground truth is extracted for features from four directions through the Boundary Activation mod-
when training on a manual dataset. It allows the model to ule, as shown in figure 33. LIM achieved 83.2% and 90.6%
identify the hazards better, as the background features of the mAP on the HiXray and OPIXray.
images with hazards (manual dataset) will match the features Isaac et al. [90] has used the conditional GAN model
of the images without hazards (real dataset). as well as the FFL technique to analyze four types of
Subramani et al. [86] trained on the SIXray10 dataset images, including high-energy, low-energy, and effective
using the SSD [145] and RetinaNet [148] detectors, ray-Z images, as well as pseudo-color images synthesized

VOLUME 11, 2023 45435


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

from these three types of images. The results show that TABLE 13. SIXrayp and PIDrayp Dataset.
the pseudo-color maps synthesized by the conditional GAN
model achieve significant results on the private dataset deei6.

C. SEGMENTATION
Object segmentation algorithms segment images into mul-
tiple sub-regions. In X-ray images with highly overlapping TABLE 14. YOLOv5 detects basic results for the PIDrayp dataset.
targets, segmentation algorithms often rely on additional
information to complete the segmentation, such as the con-
tour information of hazardous materials. It brings additional
computational effort while the hazardous material profile
information dominates the final segmentation result.
Gaus et al. [93] uses a dual convolutional neural net-
work architecture to detect automatic anomalies in x-ray
images. The paper uses R-CNN [18], mask R-CNN [22], and
RetinaNet-like detection networks to provide object localiza-
tion for specific target object classes. Specifically, the images
are segmented using mask R-CNN to initialize the RoI,
followed by a negative/positive bifurcation of the previous
RoI by a network such as RetinaNet, with a segmentation
accuracy of 97.6%.
Hassan et al. [95] segmented the targets on the
images by extracting the structural tensor from differ-
ent angles, and finally achieved a segmentation mAP of Wrench, as shown in table 13 We fine-tuned the train-
96.7%/96.16%/75.32%/58.4% on GDXray/SIXray/OPIXray/ ing data set directly and tested the results on the testing
Compass-XP respectively. data set.
Ma et al. [94] addresses the problem of inaccurate identifi- As the detection process is real-time, the YOLOv5 and
cation of different contraband or dangerous goods due to dif- YOLOv7 models are used in preference to the detection pro-
ferences in appearance. The model named DDoAS consists cess, and the results are shown in table 14, table 15, table 16,
of two modules: DDoM, which accurately infers contraband table 17 respectively. These tables show that YOLOv7 is
information from a dense overlapping background by means more accurate in recognition than YOLOv5, and v7 has a
of dense backlinks, and ADM, which aims to improve the more incredible inference speed than the other models due
low learning efficiency due to differences in shape and size to the use of techniques such as model re-parameterization.
between different contraband items. The limitation of the The four models’ visual comparison results are shown in fig-
DDoAS algorithm is that the model uses additional optical ure 34. As can be seen in subplot (c), the Transformer-based
information (object edges and vertices) to assist in verifica- and hybrid models do not work well in X-ray image
tion, which makes it challenging to detect contraband with detection.
poor edge information, such as small folding knives. The main reason for this is that the MSAs mechanism
learns the contour information of the object. However, in the
V. EXPERIMENT X-ray image, the hazardous object to be detected is cov-
In this section, to test the accuracy of standard models ered or obscured by a large number of other objects, which
in detecting X-ray images without modifying the original leads to confusing feature information obtained by MSAs
structure and to provide directions for subsequent research, and cannot correctly distinguish the exact location of the
we select the four most common models among the three object; on the contrary, the CNN learns more information
types of algorithms for experimentation. Specifically, these about the texture of the object from the pseudo-color image,
are: YOLOv5,5 YOLOv7 [27], DINO [43], and NextViT [72]. which helps identify the type of object. Inspired by several
The data sets used in this experiment are the processed models, DINO incorporates a variety of factors that facilitate
SIXray6 and PIDray.7 We have marked them as SIXrayp and improved recognition accuracy but does not perform specific
PIDrayp respectively. SIXrayp contains five classes, namely optimizations and has a lower mAP than other models in this
Gun, Knife, Pliers, Scissors, Wrench, and PIDrayp contains area. In addition, Next-ViT, as a hybrid model, combines the
12 classes, namely Baton, Bullet, Gun, Hammer, Hand- advantages of both Conv and MSAs. However, as it is not
Cuffs, Knife, Lighter, Pliers, Powerbank, Scissors, Sprayer, an end-to-end detection model and its structural construc-
tion does not apply to X-ray images with many overlapping
5 https://github.com/ultralytics/yolov5 objects, it has no particular advantages regarding the accuracy
6 https://universe.roboflow.com/object-detection/ugku and operational efficiency. A hybrid model more suitable for
7 https://universe.roboflow.com/object-detection/security_xray X-ray images should be designed.

45436 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

FIGURE 34. (a) Represents the detailed identification results of the YOLOv5 versus YOLOv7 for each category on the SIXrayp dataset. ‘‘@.55 ’’
and ‘‘@.57 ’’ denote the value of mAP 50 under YOLOv5 and YOLOv7, respectively, and so on; (b) Representation of the detailed identification
results of the YOLOv5 versus YOLOv7 models for each category on the PIDrayp dataset; (c) indicates the comparison mAP results of the four
models YOLOv5, YOLOv7, DINO, and Next-ViT.

TABLE 15. YOLOv5 detects basic results for the SIXrayp dataset. temporarily solved some of the problems in this area, huge
limitations remain:
1) The pseudo-color pictures formed by dual-energy X-ray
still do not work well with modern detection models and
must be modified in depth to obtain more reasonable
results.
2) The timeliness of the algorithm is a factor that must be
considered at the moment.
3) In reality, if prohibited goods are in luggage, they will
TABLE 16. YOLOv7 detects basic results for the PIDrayp dataset. inevitably be wrapped in layers. The resulting X-ray
images can be extreme, with objects stacked on top of
each other over a large area. The accuracy of existing
models for identification may need to be higher.
4) The current X-ray baggage image dataset is still small
and of low quality, which affects the training of deep
learning models.
In response to the above challenges, we offer the follow-
ing suggestions:
1) Using image translation or style transfer techniques to
generate corresponding natural light images from X-ray
images, expanding the X-ray baggage dataset.
2) The use of image pairs formed by high- and low-energy
rays, combined with images in natural light, enriches the
TABLE 17. YOLOv7 detects basic results for the SIXrayp dataset. color of X-ray images and brings them closer to natural
light images.
3) Reduce the cost of 3D CT scan recognition technology
by converting 2D algorithms to 3D algorithms to recog-
nize stacked layers that are difficult to recognize in the
2D case.
4) Image feature extraction and synthesis using a Diffu-
sion model more advanced than GAN to generate high-
quality X-ray images containing prohibited items.
5) Although most prohibited items are masked, they do not
VI. CONCLUSION change their original shape excessively when exposed to
This paper reviews the more popular deep learning object X-ray. They can still be identified using contour infor-
detection algorithms of recent years. Also, it summarises mation through a rational algorithm design. One of the
the application of deep learning to the field of X-ray bag- future directions in X-ray dangerous goods detection is
gage dangerous goods detection. While many models have using hybrid algorithms that combine texture features

VOLUME 11, 2023 45437


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

and contour information of prohibited items for identi- [22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘‘Mask R-CNN,’’ in Proc.
fication. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 2961–2969.
[23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
6) In order to make fair comparisons, evaluation criteria Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
must be established on public datasets. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[24] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017,
REFERENCES pp. 7263–7271.
[1] A. Schwaninger, A. Bolfing, T. Halbherr, S. Helman, A. Belyavin, and [25] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’
L. Hay, ‘‘The impact of image based factors and training on threat detec- 2018, arXiv:1804.02767.
tion performance in X-ray screening,’’ Tech. Rep., 2008. [26] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ‘‘YOLOv4: Optimal
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification speed and accuracy of object detection,’’ 2020, arXiv:2004.10934.
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. [27] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, ‘‘YOLOv7: Trainable
Process. Syst., vol. 25, 2012, pp. 1–9. bag-of-freebies sets new state-of-the-art for real-time object detectors,’’
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, 2022, arXiv:2207.02696.
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, [28] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, ‘‘You only learn one represen-
J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16 × 16 words: tation: Unified network for multiple tasks,’’ 2021, arXiv:2105.04206.
Transformers for image recognition at scale,’’ 2020, arXiv:2010.11929. [29] X. Long, K. Deng, G. Wang, Y. Zhang, Q. Dang, Y. Gao, H. Shen, J. Ren,
[4] P. Viola and M. Jones, ‘‘Rapid object detection using a boosted cascade of S. Han, E. Ding, and S. Wen, ‘‘PP-YOLO: An effective and efficient
simple features,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern implementation of object detector,’’ 2020, arXiv:2007.12099.
Recognit. (CVPR), vol. 1, Dec. 2001. p. I.
[30] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, ‘‘YOLOX: Exceeding YOLO
[5] N. Park and S. Kim, ‘‘How do vision transformers work?’’ 2022, series in 2021,’’ 2021, arXiv:2107.08430.
arXiv:2202.06709.
[31] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
[6] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Com-
large-scale image recognition,’’ 2014, arXiv:1409.1556.
puter Vision—ECCV 2020. Glasgow, U.K., Aug. 2020, pp. 213–229.
[7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[32] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, ‘‘Pix2seq:
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’
A language modeling framework for object detection,’’ 2021,
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
arXiv:2109.10852.
pp. 1–9.
[33] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable DETR:
[8] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
Deformable transformers for end-to-end object detection,’’ in Proc. Int.
image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Conf. Learn. Represent., 2021, pp. 1–16.
(CVPR), Jun. 2016, pp. 770–778.
[34] M. Zheng, P. Gao, R. Zhang, K. Li, X. Wang, H. Li, and H. Dong, ‘‘End-
[9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
to-end object detection with adaptive clustering transformer,’’ 2020,
M. Andreetto, and H. Adam, ‘‘MobileNets: Efficient convolutional neural
arXiv:2011.09315.
networks for mobile vision applications,’’ 2017, arXiv:1704.04861.
[35] B. Roh, J. Shin, W. Shin, and S. Kim, ‘‘Sparse DETR: Efficient end-to-
[10] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
end object detection with learnable sparsity,’’ 2021, arXiv:2111.14330.
‘‘MobileNetV2: Inverted residuals and linear bottlenecks,’’ in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, [36] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, ‘‘Fast convergence of
pp. 4510–4520. DETR with spatially modulated co-attention,’’ in Proc. IEEE/CVF Int.
Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3621–3630.
[11] A. Howard, M. Sandler, B. Chen, W. Wang, L.-C. Chen, M. Tan, G. Chu,
V. Vasudevan, Y. Zhu, R. Pang, H. Adam, and Q. Le, ‘‘Searching for [37] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang,
MobileNetV3,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), ‘‘Conditional DETR for fast training convergence,’’ in Proc. IEEE/CVF
Oct. 2019, pp. 1314–1324. Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3651–3660.
[12] X. Zhang, X. Zhou, M. Lin, and J. Sun, ‘‘ShuffleNet: An extremely [38] Y. Wang, X. Zhang, T. Yang, and J. Sun, ‘‘Anchor DETR: Query design
efficient convolutional neural network for mobile devices,’’ in for transformer-based detector,’’ in Proc. AAAI Conf. Artif. Intell., 2022,
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, vol. 36, no. 3, pp. 2567–2575.
pp. 6848–6856. [39] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, ‘‘DN-DETR:
[13] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, ‘‘ShuffleNet V2: Practical Accelerate DETR training by introducing query DeNoising,’’ in Proc.
guidelines for efficient CNN architecture design,’’ in Proc. Eur. Conf. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
Comput. Vis. (ECCV), 2018, pp. 116–131. pp. 13619–13627.
[14] J. Hu, L. Shen, and G. Sun, ‘‘Squeeze-and-excitation networks,’’ in [40] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang,
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, ‘‘DAB-DETR: Dynamic anchor boxes are better queries for DETR,’’
pp. 7132–7141. 2022, arXiv:2201.12329.
[15] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, ‘‘GhostNet: More [41] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu,
features from cheap operations,’’ in Proc. IEEE/CVF Conf. Comput. Vis. ‘‘You only look at one sequence: Rethinking transformer in vision through
Pattern Recognit. (CVPR), Jun. 2020, pp. 1580–1589. object detection,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
[16] M. Tan and Q. Le, ‘‘EfficientNet: Rethinking model scaling for con- pp. 26183–26197.
volutional neural networks,’’ in Proc. Int. Conf. Mach. Learn., 2019, [42] Z. Sun, S. Cao, Y. Yang, and K. Kitani, ‘‘Rethinking transformer-based set
pp. 6105–6114. prediction for object detection,’’ in Proc. IEEE/CVF Int. Conf. Comput.
[17] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, ‘‘RepVGG: Vis. (ICCV), Oct. 2021, pp. 3611–3620.
Making VGG-style ConvNets great again,’’ in Proc. IEEE/CVF Conf. [43] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13733–13742. ‘‘DINO: DETR with improved denoising anchor boxes for end-to-end
[18] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierar- object detection,’’ 2022, arXiv:2203.03605.
chies for accurate object detection and semantic segmentation,’’ in Proc. [44] Z. Yao, J. Ai, B. Li, and C. Zhang, ‘‘Efficient DETR: Improving end-to-
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. end object detector with dense prior,’’ 2021, arXiv:2104.01318.
[19] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in [45] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, ‘‘Dynamic
deep convolutional networks for visual recognition,’’ IEEE Trans. Pattern DETR: End-to-end object detection with dynamic attention,’’ in Proc.
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2014. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 2988–2997.
[20] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis. [46] Z. Dai, B. Cai, Y. Lin, and J. Chen, ‘‘UP-DETR: Unsupervised pre-
(ICCV), Dec. 2015, pp. 1440–1448. training for object detection with transformers,’’ in Proc. IEEE/CVF Conf.
[21] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 1601–1610.
object detection with region proposal networks,’’ IEEE Trans. Pattern [47] H. Bao, L. Dong, S. Piao, and F. Wei, ‘‘BEiT: BERT pre-training of image
Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. transformers,’’ 2021, arXiv:2106.08254.

45438 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

[48] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, ‘‘BEiT v2: [70] Q. Zhang and Y.-B. Yang, ‘‘ResT: An efficient transformer for visual
Masked image modeling with vector-quantized visual tokenizers,’’ 2022, recognition,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
arXiv:2208.06366. pp. 15475–15485.
[49] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, [71] Z. Dai, H. Liu, Q. V. Le, and M. Tan, ‘‘CoAtNet: Marrying convolution
O. K. Mohammed, S. Singhal, S. Som, and F. Wei, ‘‘Image as a foreign and attention for all data sizes,’’ in Proc. Adv. Neural Inf. Process. Syst.,
language: BEiT pretraining for all vision and vision-language tasks,’’ vol. 34, 2021, pp. 3965–3977.
2022, arXiv:2208.10442. [72] J. Li, X. Xia, W. Li, H. Li, X. Wang, X. Xiao, R. Wang, M. Zheng,
[50] W. Wang, Y. Cao, J. Zhang, and D. Tao, ‘‘FP-DETR: Detection trans- and X. Pan, ‘‘Next-ViT: Next generation vision transformer for efficient
former advanced by fully pre-training,’’ in Proc. Int. Conf. Learn. Repre- deployment in realistic industrial scenarios,’’ 2022, arXiv:2207.05501.
sent., 2021, pp. 1–14. [73] A. Hatamizadeh, H. Yin, J. Kautz, and P. Molchanov, ‘‘Global context
[51] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, ‘‘Focal vision transformers,’’ 2022, arXiv:2206.09959.
self-attention for local-global interactions in vision transformers,’’ 2021, [74] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li,
arXiv:2107.00641. ‘‘MaxViT: Multi-axis vision transformer,’’ 2022, arXiv:2204.01697.
[52] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and [75] X. Zhang, H. Zeng, S. Guo, and L. Zhang, ‘‘Efficient long-range attention
L. Shao, ‘‘Pyramid vision transformer: A versatile backbone for dense network for image super-resolution,’’ 2022, arXiv:2203.06697.
prediction without convolutions,’’ in Proc. IEEE/CVF Int. Conf. Comput. [76] S. Akcay and T. Breckon, ‘‘Towards automatic threat detection: A survey
Vis. (ICCV), Oct. 2021, pp. 568–578. of advances of deep learning within X-ray security imaging,’’ Pattern
[53] P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao, ‘‘Multi- Recognit., vol. 122, Feb. 2022, Art. no. 108245.
scale vision longformer: A new vision transformer for high-resolution [77] S. Akcay, M. E. Kundegorski, M. Devereux, and T. P. Breckon, ‘‘Transfer
image encoding,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), learning using convolutional neural networks for object classification
Oct. 2021, pp. 2998–3008. within X-ray baggage security imagery,’’ in Proc. IEEE Int. Conf. Image
[54] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Process. (ICIP), Sep. 2016, pp. 1057–1061.
‘‘Swin transformer: Hierarchical vision transformer using shifted win- [78] T. W. Rogers, N. Jaccard, and L. D. Griffin, ‘‘A deep learning frame-
dows,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, work for the automated inspection of complex dual-energy X-ray cargo
pp. 10012–10022. imagery,’’ Proc. SPIE, vol. 10187, pp. 106–117, May 2017.
[55] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, ‘‘Feature [79] Z. Zhao, H. Zhang, and J. Yang, ‘‘A GAN-based image generation method
pyramid transformer,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2020, for X-ray security prohibited items,’’ in Proc. Chin. Conf. Pattern Recog-
pp. 323–339. nit. Comput. Vis. (PRCV). Springer, 2018, pp. 420–430.
[56] Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, and J. Wang, [80] J. Yang, Z. Zhao, H. Zhang, and Y. Shi, ‘‘Data augmentation for X-ray
‘‘HRFormer: High-resolution vision transformer for dense predict,’’ in prohibited item images using generative adversarial networks,’’ IEEE
Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 7281–7293. Access, vol. 7, pp. 28894–28902, 2019.
[57] J. Gu, H. Kwon, D. Wang, W. Ye, M. Li, Y.-H. Chen, L. Lai, V. Chandra, [81] C. Miao, L. Xie, F. Wan, C. Su, H. Liu, J. Jiao, and Q. Ye, ‘‘SIXray:
and D. Z. Pan, ‘‘Multi-scale high-resolution vision transformer for seman- A large-scale security inspection X-ray benchmark for prohibited item
tic segmentation,’’ 2021, arXiv:2111.01236. discovery in overlapping images,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
[58] S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and Pattern Recognit. (CVPR), Jun. 2019, pp. 2119–2128.
L. Sagun, ‘‘ConViT: Improving vision transformers with soft convo- [82] T. Morris, T. Chien, and E. Goodman, ‘‘Convolutional neural net-
lutional inductive biases,’’ in Proc. Int. Conf. Mach. Learn., 2021, works for automatic threat detection in security X-ray images,’’ in
pp. 2286–2296. Proc. 17th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2018,
[59] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, ‘‘CvT: pp. 285–292.
Introducing convolutions to vision transformers,’’ in Proc. IEEE/CVF Int. [83] M. Caldwell, M. Ransley, T. W. Rogers, and L. D. Griffin, ‘‘Trans-
Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 22–31. ferring X-ray based automated threat detection between scanners with
[60] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and different energies and resolution,’’ in Proc. SPIE, vol. 10441, 2017,
A. Vaswani, ‘‘Bottleneck transformers for visual recognition,’’ in Proc. pp. 130–139.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, [84] S. Akcay and T. P. Breckon, ‘‘An evaluation of region based object
pp. 16519–16529. detection strategies within X-ray baggage security imagery,’’ in Proc.
[61] Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu, ‘‘Mobile- IEEE Int. Conf. Image Process. (ICIP), Sep. 2017, pp. 1337–1341.
former: Bridging MobileNet and transformer,’’ 2021, arXiv:2108.05895. [85] J. B. Sigman, G. P. Spell, K. J. Liang, and L. Carin, ‘‘Background adap-
[62] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and tive faster R-CNN for semi-supervised convolutional object detection of
S. Yan, ‘‘MetaFormer is actually what you need for vision,’’ in Proc. threats in X-ray images,’’ Proc. SPIE, vol. 11404, pp. 12–21, May 2020.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, [86] M. Subramani, K. Rajaduari, S. D. Choudhury, A. Topkar, and
pp. 10819–10829. V. Ponnusamy, ‘‘Evaluating one stage detector architecture of convolu-
[63] B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, tional neural network for threat object detection using X-ray baggage
J. Gonzalez, K. Keutzer, and P. Vajda, ‘‘Visual transformers: Token- security imaging,’’ Revue d’Intell. Artificielle, vol. 34, no. 4, pp. 495–500,
based image representation and processing for computer vision,’’ 2020, Sep. 2020.
arXiv:2006.03677. [87] Z. Liu, J. Li, Y. Shu, and D. Zhang, ‘‘Detection and recognition of security
[64] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick, detection object based on YOLO9000,’’ in Proc. 5th Int. Conf. Syst.
‘‘Early convolutions help transformers see better,’’ in Proc. Adv. Neural Informat. (ICSAI), Nov. 2018, pp. 278–282.
Inf. Process. Syst., vol. 34, 2021, pp. 30392–30400. [88] T. Hassan, S. H. Khan, S. Akcay, M. Bennamoun, and N. Werghi, ‘‘Deep
[65] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani, cmst framework for the autonomous recognition of heavily occluded and
‘‘Bottleneck transformers for visual recognition,’’ in Proc. IEEE/CVF cluttered baggage items from multivendor security radiographs,’’ CoRR,
Conf. Comput. Vis. Pattern Recognit. (CVPR). Washington, DC, USA: vol. 14, p. 17, Dec. 2019.
IEEE Computer Society, Jun. 2021, pp. 16514–16524. [89] B. Wang, L. Zhang, L. Wen, X. Liu, and Y. Wu, ‘‘Towards real-world
[66] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, prohibited item detection: A large-scale X-ray benchmark,’’ in Proc.
‘‘Training data-efficient image transformers & distillation through atten- IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 5412–5421.
tion,’’ in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357. [90] B. K. S. Isaac-Medina, N. Bhowmik, C. G. Willcocks, and T. P. Breckon,
[67] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, ‘‘Incorporating ‘‘Cross-modal image synthesis within dual-energy X-ray security
convolution designs into visual transformers,’’ in Proc. IEEE/CVF Int. imagery,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work-
Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 579–588. shops (CVPRW), Jun. 2022, pp. 333–341.
[68] Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool, ‘‘LocalViT: [91] S. Akcay, M. E. Kundegorski, C. G. Willcocks, and T. P. Breckon, ‘‘Using
Bringing locality to vision transformers,’’ 2021, arXiv:2104.05707. deep convolutional neural network architectures for object classification
[69] X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, ‘‘Conditional positional and detection within X-ray baggage security imagery,’’ IEEE Trans. Inf.
encodings for vision transformers,’’ 2021, arXiv:2102.10882. Forensics Security, vol. 13, no. 9, pp. 2203–2215, Sep. 2018.

VOLUME 11, 2023 45439


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

[92] R. Tao, Y. Wei, X. Jiang, H. Li, H. Qin, J. Wang, Y. Ma, L. Zhang, [114] D. Mery, D. Saavedra, and M. Prasad, ‘‘X-ray baggage inspection with
and X. Liu, ‘‘Towards real-world X-ray security inspection: A high- computer vision: A survey,’’ IEEE Access, vol. 8, pp. 145620–145633,
quality benchmark and lateral inhibition module for prohibited items 2020.
detection,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, [115] D. Mery, Computer Vision for X-Ray Testing, vol. 10. Cham, Switzerland:
pp. 10923–10932. Springer, 2015.
[93] Y. F. A. Gaus, N. Bhowmik, S. Akçay, P. M. Guillén-Garcia, J. W. Barker, [116] M. Caldwell and L. D. Griffin, ‘‘Limits on transfer learning from pho-
and T. P. Breckon, ‘‘Evaluation of a dual convolutional neural network tographic image data to X-ray threat detection,’’ J. X-Ray Sci. Technol.,
architecture for object-wise anomaly detection in cluttered X-ray security vol. 27, no. 6, pp. 1007–1020, Jan. 2020.
imagery,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2019, [117] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
pp. 1–8. A. Zisserman, ‘‘The PASCAL visual object classes (VOC) challenge,’’
[94] B. Ma, T. Jia, M. Su, X. Jia, D. Chen, and Y. Zhang, ‘‘Automated Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Sep. 2010.
segmentation of prohibited items in X-ray baggage images using dense [118] M. Everingham and J. Winn, ‘‘The PASCAL visual object classes chal-
de-overlap attention snake,’’ IEEE Trans. Multimedia, early access, lenge 2012 (VOC2012) development kit,’’ Pattern Anal. Stat. Model.
May 11, 2022, doi: 10.1109/TMM.2022.3174339. Comput. Learn., Tech. Rep., 2012, pp. 1–45.
[95] T. Hassan and N. Werghi, ‘‘Trainable structure tensors for autonomous [119] M. Everingham and J. Winn, ‘‘The PASCAL visual object classes chal-
baggage threat detection under extreme occlusion,’’ in Proc. Asian Conf. lenge 2007 (VOC2007) development kit,’’ Univ. Leeds, Leeds, U.K.,
Comput. Vis., 2020, pp. 1–16. Tech. Rep., 2007.
[96] Ultralytics. (2022). YOLOv5. [Online]. Available: https://github.com/ [120] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
ultralytics/yolov5/releases/tag/v6.1 A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei,
[97] S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and B. Lee, ‘‘ImageNet large scale visual recognition challenge,’’ Int. J. Comput. Vis.,
‘‘A survey of modern deep learning based object detection models,’’ Digit. vol. 115, no. 3, pp. 211–252, Dec. 2015.
Signal Process., vol. 126, Jun. 2022, Art. no. 103514. [121] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
[98] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, ‘‘Aggregated residual A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
transformations for deep neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1492–1500. [122] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[99] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and P. Dollár, and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in
I.-H. Yeh, ‘‘CSPNet: A new backbone that can enhance learning capa- context,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2014, pp. 740–755.
bility of CNN,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [123] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset,
Workshops (CVPRW), Jun. 2020, pp. 390–391. S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari,
[100] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, ‘‘The open images dataset V4,’’ Int. J. Comput. Vis., vol. 128, no. 7,
and W. Brendel, ‘‘ImageNet-trained CNNs are biased towards tex- pp. 1956–1981, 2020.
ture; increasing shape bias improves accuracy and robustness,’’ 2018, [124] M. Akbari, A. Banitalebi-Dehkordi, and Y. Zhang, ‘‘EBJR: Energy-based
arXiv:1811.12231. joint reasoning for adaptive inference,’’ 2021, arXiv:2110.10343.
[101] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [125] NeelBhowmik. (2022). X-Ray Datasets. [Online]. Available: https://
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. github.com/NeelBhowmik/xray
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11. [126] (2022). FSOD EDS. [Online]. Available: https://github.com/DIG-
[102] T. Lin, Y. Wang, X. Liu, and X. Qiu, ‘‘A survey of transformers,’’ 2021, Beihang/XrayDetection#x-ray-fsod
arXiv:2106.04554. [127] LPAIS. (2022). Xray-Pi. [Online]. Available: https://github.com/LPAIS/
[103] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Xray-PI
Z. Shi, J. Fan, and Z. He, ‘‘A survey of visual transformers,’’ 2021, [128] (2022). Pixray. [Online]. Available: https://github.com/Mbwslib/DDoAS
arXiv:2111.06091. [129] (2022). CLCXray. [Online]. Available: https://github.com/Greyson
[104] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and Phoenix/CLCXray
M. Pietikäinen, ‘‘Deep learning for generic object detection: A survey,’’ [130] (2022). HiXray. [Online]. Available: https://github.com/DIG-
Int. J. Comput. Vis., vol. 128, pp. 261–318, Feb. 2020. Beihang/XrayDetection
[105] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, [131] N. Bhowmik, Y. F. A. Gaus, and T. P. Breckon, ‘‘On the impact of
‘‘Transformers in vision: A survey,’’ ACM Comput. Surv., vol. 54, no. 10, using X-ray energy response imagery for object detection via convolu-
pp. 1–41, Jan. 2022. tional neural networks,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP),
[106] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, Sep. 2021, pp. 1224–1228.
C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, ‘‘A survey on vision [132] LPAIS. (2022). Pidray. [Online]. Available: https://github.com/
transformer,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, bywang2018/security-dataset
pp. 87–110, Jan. 2023. [133] M. Naji, A. Anaissi, A. Braytee, and M. Goyal, ‘‘Anomaly detection in
[107] D. Zhou, Z. Yu, E. Xie, C. Xiao, A. Anandkumar, J. Feng, and X-ray security imaging: A tensor-based learning approach,’’ in Proc. Int.
J. M. Alvarez, ‘‘Understanding the robustness in vision transformers,’’ in Joint Conf. Neural Netw. (IJCNN), Jul. 2021, pp. 1–8.
Proc. Int. Conf. Mach. Learn., 2022, pp. 27378–27394. [134] B. K. S. Isaac-Medina, C. G. Willcocks, and T. P. Breckon, ‘‘Multi-view
[108] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, ‘‘Sharpness- object detection using epipolar constraints within cluttered X-ray security
aware minimization for efficiently improving generalization,’’ 2020, imagery,’’ in Proc. 25th Int. Conf. Pattern Recognit. (ICPR), Jan. 2021,
arXiv:2010.01412. pp. 9889–9896.
[109] X. Chen, C.-J. Hsieh, and B. Gong, ‘‘When vision transformers outper- [135] LPAIS. (2022). OPIXray. [Online]. Available: https://github.com/
form ResNets without pre-training or strong data augmentations,’’ 2021, OPIXray-author/OPIXray
arXiv:2106.01548. [136] (2022). SIXray. [Online]. Available: https://github.com/MeioJane/SIXray
[110] M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. S. Khan, and [137] (2022). Compass-XP. [Online]. Available: https://zenodo.org/record/
M.-H. Yang, ‘‘Intriguing properties of vision transformers,’’ in Proc. Adv. 2654887/#%5C.YUtGVHVKikA
Neural Inf. Process. Syst., vol. 34, 2021, pp. 23296–23308. [138] K. J Liang, J. B. Sigman, G. P. Spell, D. Strellis, W. Chang, F. Liu,
[111] J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, and C. Xu, T. Mehta, and L. Carin, ‘‘Toward automatic threat recognition for air-
‘‘CMT: Convolutional neural networks meet vision transformers,’’ 2021, port X-ray baggage screening with deep convolutional object detection,’’
arXiv:2107.06263. 2019, arXiv:1912.06329.
[112] F. L. Roder, ‘‘Explosives detection by dual-energy computed tomogra- [139] LPAIS. (2022). GDXray. [Online]. Available: https://domingomery.
phy (CT),’’ Proc. SPIE, vol. 182, pp. 171–178, Oct. 1979. ing.puc.cl/material/gdxray/
[113] B. Abidi, Y. Zheng, A. Gribok, and M. Abidi, ‘‘Screener evaluation of [140] D. Mery, V. Riffo, U. Zscherpel, G. Mondragón, I. Lillo, I. Zuccar,
pseudo-colored single energy X-ray luggage images,’’ in Proc. IEEE H. Lobel, and M. Carrasco, ‘‘GDXray: The database of X-ray images for
Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR) Workshops, nondestructive testing,’’ J. Nondestruct. Eval., vol. 34, no. 4, pp. 1–12,
Sep. 2005, p. 35. 2015.

45440 VOLUME 11, 2023


J. Wu et al.: Object Detection and X-Ray Security Imaging: A Survey

[141] Y. Wei, R. Tao, Z. Wu, Y. Ma, L. Zhang, and X. Liu, ‘‘Occluded pro- [166] W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu,
hibited items detection: An X-ray security inspection benchmark and de- L. Lu, H. Li, X. Wang, and Y. Qiao, ‘‘InternImage: Exploring large-
occlusion attention module,’’ in Proc. 28th ACM Int. Conf. Multimedia, scale vision foundation models with deformable convolutions,’’ 2022,
Oct. 2020, pp. 138–146. arXiv:2211.05778.
[142] M. E. Kundegorski, S. Akçay, M. Devereux, A. Mouton, and [167] J.-B. Cordonnier, A. Loukas, and M. Jaggi, ‘‘On the relationship between
T. P. Breckon, ‘‘On using feature descriptors as visual words for self-attention and convolutional layers,’’ 2019, arXiv:1911.03584.
object detection within X-ray baggage security screening,’’ Tech. Rep., [168] S. Li, X. Chen, D. He, and C.-J. Hsieh, ‘‘Can vision transformers perform
2016. convolution?’’ 2021, arXiv:2111.01353.
[143] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, [169] B. Yang, L. Wang, D. Wong, L. S. Chao, and Z. Tu, ‘‘Convolutional self-
‘‘OverFeat: Integrated recognition, localization and detection using con- attention networks,’’ 2019, arXiv:1904.03107.
volutional networks,’’ 2013, arXiv:1312.6229. [170] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, ‘‘Self-
[144] L. Zhang, L. Lin, X. Liang, and K. He, ‘‘Is faster R-CNN doing well for supervised learning: Generative or contrastive,’’ IEEE Trans. Knowl.
pedestrian detection?’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2016, Data Eng., vol. 35, no. 1, pp. 857–876, Jan. 2023.
pp. 443–457. [171] D. Mery and C. Pieringer, Computer Vision for X-Ray Testing, 2nd ed.
[145] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and Cham, Switzerland: Springer, 2021.
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur. Conf. [172] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
Comput. Vis. Springer, 2016, pp. 21–37. S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in
[146] J. Dai, Y. Li, K. He, and J. Sun, ‘‘R-FCN: Object detection via region- Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014, pp. 1–9.
based fully convolutional networks,’’ in Proc. Adv. Neural Inf. Process.
Syst., vol. 29, 2016, pp. 1–9.
[147] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely
connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
[148] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, ‘‘Focal loss for
dense object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 2980–2988.
[149] Z. Cai and N. Vasconcelos, ‘‘Cascade R-CNN: Delving into high quality JIAJIE WU received the M.S. degree in computer
object detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- science from the Shanxi University of Finance and
nit., Jun. 2018, pp. 6154–6162. Economics, Taiyuan, China, in 2017. He is cur-
[150] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, rently pursuing the Ph.D. degree in computer sci-
‘‘Feature pyramid networks for object detection,’’ in Proc. IEEE Conf. ence with Hangzhou Danzi University. He focuses
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. on the field of object detection and X-ray imaging
[151] K. Chen, W. Ouyang, C. C. Loy, D. Lin, J. Pang, J. Wang, Y. Xiong, X. Li, security detection.
S. Sun, W. Feng, Z. Liu, and J. Shi, ‘‘Hybrid task cascade for instance
segmentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2019, pp. 4974–4983.
[152] PaddlePaddle. (2022). YOLOv6-L (V2.1). [Online]. Available:
https://github.com/meituan/YOLOv6/releases/tag/0.2.1
[153] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, ‘‘Scaled-YOLOv4:
Scaling cross stage partial network,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13029–13038.
[154] S. Xu, X. Wang, W. Lv, Q. Chang, C. Cui, K. Deng, G. Wang, Q. Dang,
S. Wei, Y. Du, and B. Lai, ‘‘PP-YOLOE: An evolved version of YOLO,’’
2022, arXiv:2203.16250. XIANGHUA XU received the Ph.D. degree
[155] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, in computer science from Zhejiang University,
‘‘Selective search for object recognition,’’ Int. J. Comput. Vis., vol. 104, Hangzhou, China, in 2005. He is currently a
no. 2, pp. 154–171, Feb. 2013. Professor with the School of Computer Science
[156] K. Grauman and T. Darrell, ‘‘The pyramid match kernel: Discriminative and Technology, Hangzhou Dianzi University,
classification with sets of image features,’’ in Proc. 10th IEEE Int. Conf. Hangzhou. He has authored or coauthored over
Comput. Vis. (ICCV), vol. 1, Oct. 2005, pp. 1458–1465. 100 peer-reviewed journals and conference papers.
[157] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, His current research interests include computer
L. Dong, F. Wei, and B. Guo, ‘‘Swin transformer v2: Scaling up capacity vision and parallel and distributed computing.
and resolution,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. He was a recipient of the Best Paper Award at
(CVPR), Jun. 2022, pp. 12009–12019. the 2012 IEEE International Symposium on Workload Characterization.
[158] D. Thuan, ‘‘Evolution of YOLO algorithm and YOLOv5: The state-of-
the-art object detention algorithm,’’ Tech. Rep., 2021.
[159] G. Chaucer. (2022). YOLOU: United, Study and Easier to Deploy.
[Online]. Available: https://github.com/jizhishutong/YOLOU
[160] PaddlePaddle. (2022). YOLOSeries. [Online]. Available: https://github.
com/nemonameless/PaddleDetection_YOLOSeries
[161] Iscyy. (2022). YOLOAir: Makes Improvements Easy Again. [Online].
Available: https://github.com/iscyy/yoloair
[162] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, ‘‘Distance-IoU loss: JUNYAN YANG received the B.S. degree in civil
Faster and better learning for bounding box regression,’’ in Proc. AAAI engineering from Beijing Forestry University, Bei-
Conf. Artif. Intell., 2020, vol. 34, no. 7, pp. 12993–13000. jing, China, in 2021. He is currently pursuing the
[163] D. Misra, ‘‘Mish: A self regularized non-monotonic activation function,’’ M.S. degree in computer science with Hangzhou
2019, arXiv:1908.08681. Danzi University. He focuses on the field of com-
[164] Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, puter vision and X-ray imaging security detection.
and Y. Cao, ‘‘EVA: Exploring the limits of masked visual representation
learning at scale,’’ 2022, arXiv:2211.07636.
[165] P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y. Qiao, ‘‘ConvMAE: Masked
convolution meets masked autoencoders,’’ 2022, arXiv:2205.03892.

VOLUME 11, 2023 45441

You might also like