A - Survey Object - Detection - and - X-Ray - Security - Imaging
A - Survey Object - Detection - and - X-Ray - Security - Imaging
ABSTRACT Security is paramount in public places such as subways, airports, and train stations, where secu-
rity screeners use X-ray imaging technology to check passengers’ luggage for potential threats. To streamline
this process and make it more efficient, researchers have turned to object detection techniques with the
help of deep learning. While some progress has been made, there are few comprehensive literature reviews.
This paper provides a comprehensive overview of the standard object detection algorithms and principles
in X-ray dangerous goods detection. The article begins by classifying and describing the more popular
deep learning object detection techniques in detail and presenting the commonly used publicly available
datasets and metrics. And then go on to summarize previous applications of deep learning techniques in
X-ray dangerous goods detection, highlighting their successes and limitations. Finally, based on an analysis
of the experimental results, it summarizes some of the limitations of deep learning in X-ray baggage detection
thus far. It offers insights into the future of this exciting field. With this review, we hope to provide valuable
insights and guidance for those seeking to improve public safety through X-ray imaging and deep learning
technology.
INDEX TERMS Object detection, deep learning, x-ray image detection, baggage security, yolo.
it focuses on improving the accuracy of detection tasks The other sections of this paper are organized as follows:
under multiple overlapping obstacles or in pseudo-color Section 2 introduces the basic principles of object detection
images. Secondly, from a data perspective, it focuses algorithms, the differences between CNN and Transformer
on using deep learning techniques to synthesize bet- algorithms, and the principles of X-ray imaging; Section 3
ter pseudo-color images similar to natural images and details each of the three types of object detection algo-
how to quickly and efficiently expand X-ray datasets to rithms according to their algorithmic structure, namely, CNN-
improve the final recognition accuracy. based, Transformer-based, and hybrid algorithms; Section 4
• In terms of research approaches, there are three main describes the application of deep learning to X-ray hazardous
types: image classification, object detection, and object material detection, including classification, detection, and
segmentation. In practical applications, these three types segmentation. In section 5, four algorithms (YOLOv5 [96],
of algorithms are often combined in order to improve YOLOv7 [27], DINO [43], and Next-ViT [72]) are used to
the final recognition accuracy. The primary pieces of perform dangerous goods detection on the publicly available
literature are shown in figure 2. X-ray baggage dataset and give a meaningful analysis of the
results; Section 6 provides a summary and outlook, showing
We are motivated by the fact that few articles have provided
the shortcomings of current deep learning algorithms applied
a detailed and comprehensive summary of deep learning algo-
in X-ray security screening and looking into the future.
rithms and their applications in X-ray hazardous materials
detection. This paper provides a comprehensive review of
object detectors based on CNN, Transformers, and hybrid II. BACKGROUND
algorithms and a summary of their application to the X-ray A. DEEP LEARNING ARCHITECTURE
image security screening field to fill this gap. The main Object detection aims to localize and identify the targets in
contributions of the article are: the given image. Locating the number and class of objects
can become quite challenging due to the masking, exposure,
• An introduction to the popular object detection algo-
and perspective of the objects in the image. It is necessary for
rithms so far and an overview of their classification,
the computer model to overcome these problems to the best
including CNN-based, Transformer-based, and hybrid
of its ability and to consider timeliness [97]. Three common
algorithms.
backbone architectures are listed below.
• A series of models for applying deep learning algo-
rithms in X-ray baggage hazardous materials detection
are described in detail. 1) CNN-BASED BACKBONE
• Experiments using different detection algorithms on an Krizhevsky et al. [2] won the 2012 ILSVRC (ImageNet
open dataset of X-ray baggage and giving meaningful Large-Scale Visual Recognition Challenge) due to its out-
analytical results. standing performance, quickly making CNN the first choice
• The article provides an outlook on the application of for handling various tasks in the field of computer vision.
deep learning algorithms in the field of X-ray image Many classical CNN feature extraction networks have
security detection. emerged, including VGG [6], GoogLeNet [7], ResNets [8],
FIGURE 2. Overview of deep learning algorithms in the field of X-ray baggage dangerous goods detection.
ResNeXt [98], CSPNet [99], EfficientNet [16]. These net- ViT before smoothing, and (b) is the Loss-Landscape after
work structures are shown in figure 3. smoothing using SAM [108]. The flatter Loss-Landscape, the
Take VGG−16 as an example, and its specific structure is better model performance and generalization ability. It is also
shown in figure 4.1 The process of extracting features from shown in equation 2 that a positive average eigenvalue map-
the convolutional layer is shown in equation 1. ping enhances the performance of MSAs, while a negative
l−1
value disrupts the optimization of the model. This provides a
N
X direction for optimizing ViT. The reason why ViT requires
xlj = σ( xl−1
i · wli,j + blj ), (1)
a large amount of data for pre-training to achieve better
i=1
results is that a large amount of data can help the model
where xlj denotes the jth feature of the lth layer, N denotes suppress negative Hessian eigenvalues in the early stages of
the number of features, wli,j denotes the convolution kernel training to achieve the effect of smoothing Loss-Landscape
of the lth layer, blj denotes the corresponding bias term, and and convexity of Loss [5], [109]. [110] shows that MSAs are
σ denotes the nonlinear function ReLu. It has been shown good at extracting outline information of objects and are not
that CNN is not good at processing high-frequency noise good at processing low-frequency signals.
in images, so they are more biased in extracting the texture In vanilla ViT, if the pixels are directly processed using
features of images [100]. the attention mechanism as in NLP tasks, the computational
complexity is a quadratic multiple of the image size, which
2) TRANSFORMER-BASED BACKBONE is unacceptable for most image processing tasks. In addition,
ViT with a fixed scale token is not fully applicable to vision
The transformer was used to solve problems such as machine
tasks because the objects in the images are variable. A lot of
translation in the NLP domain [101], and its structure is
improvements have been made to these flaws.
shown in figure 5. After the great success of the Transformer-
Taking Swin Transformer as an example, the appeal defect
based model, [3] applied it to the image classification task
is solved by using the hierarchical feature maps obtained by
and proposed the ViT model [3], which structure is shown
downsampling operation, and the shifted window attention
in figure 6. Later, the ViT model and its variants are applied
mechanism [54]. From figure 8, we can see that the spatial
in various computer vision tasks, including object detection,
resolution of the hierarchical feature map in Swin Trans-
scene segmentation, and so on [102], [103], [104], [105],
former is the same as that in ResNet, which can easily replace
[106].
ResNet as the backbone in the network. The use of W-MSA
A large part of the reason why transformer is so successful
and SW-MSA modules to implement the attention mecha-
is attributed to the attention mechanism, namely the multi-
nism greatly reduces the computational resources required in
head self-attentions (MSAs) [107]. Specifically, given the
the computation process through window exchange.
query matrix Q ∈ RN ×Dk , the key matrix K ∈ RM ×Dk ,
and the value matrix to be matched V ∈ RM ×Dv , where N
3) HYBRID BACKBONE
and M denote the lengths corresponding to Q and K, and Dk
Hybrid frameworks are one of the current research
and Dv denote the dimensions corresponding to K and V. The
computation process is as follows: hotspots [59], [60], [61], [72], [74], [111]. It has been shown
that CNN will filter the low-frequency part of the image, and
QK⊤
MSAs will filter the high-frequency part of the image, which
Attention(Q, K, V) = softmax √ V = AV, (2)
Dk is known that the high-frequency signal corresponds to the
⊤ outline edge in the image, and the low-frequency part mostly
where the attention matrix A = softmax QK √
Dk
. The dot corresponds to the background part [5].
√ The latest generation of the hybrid framework is to
product of Q and K divided by Dk can alleviate the gradient
vanishing problem of the softmax function. hybridize CNN with MSAs inside the stage [5], [72], not
Besides, the training process of MSAs can be viewed as the outside the stage, as shown in figure 9.
process of smoothing the feature mapping space, as shown
in figure 7. The subfigure (a) is the Loss-Landscape of B. X-RAY IMAGING
The different imaging principles lead to different X-ray and
1 Drawing tool from https://github.com/HarisIqbal88/PlotNeuralNet natural light images, as shown in figure 10. The current
FIGURE 9. A mix of MSAs and Conv in the stage. The above diagram
shows the regular CNN block. The diagram below shows the convergence
of the Conv and MSAs layers within the stage.
Principle of X-ray imaging The main principle of X-ray FIGURE 11. X-ray pseudo color map imaging process [116], and [114].
imaging is that the X-ray tube produces a beam of light
that can penetrate the scanned object. Different objects have
different densities, and X-rays will attenuate differently at the corresponding pseudo-color images as shown in figure 11.
different material densities. This process can be expressed as In addition, it is also possible to form X-ray maps from
equation 3: multiple angles [76], [114], [115].
Ix = I0 eµx (3)
C. OBJECT DETECTION METRICS
where Ix denotes the intensity of the X-ray at x, I0 is the ini- The two most common metrics used in object detection tasks
tialized intensity value, and µ denotes the linear attenuation are as follows:
coefficient based on the material thickness. 1) GFLOPs
GFLOPs: Giga Floating-point Operations Per Second
Currently, with the development of technology, X-ray refers to the number of billion floating-point operations
machines are equipped with various energies that can produce per second, often used to evaluate the performance of a
a wide range of X-ray images for identifying the density model on GPU.
and adequate atomic number of objects Zeff . The intensity 2) mAP
mAP: i.e., Mean Average Precision, was first proposed
estimates can be found with Zeff by [113], converting them to in the VOC competition. Among them, the calculation
TABLE 1. Detailed parameters for four types of object detection datasets [104].
TABLE 2. Commonly used data sets for X-ray baggage detection of dangerous goods [125].
b: SIXray [81] represent the ratio of normal to prohibited items. For example,
The dataset contains 1059231 X-ray images, divided into in the SIXray10 dataset, there are 98,219 images, of which
six categories: guns, knives, wrenches, pliers, scissors, and 8,929 are prohibited items and 89,290 are normal items.
hammers, as shown in table 4. The SIXray dataset restores the The SIXray1000 sub-dataset is a mixture of 1,000 randomly
actual scene and is a class of datasets with a severe imbalance selected prohibited images and 1,050,302 ordinary images.
between positive and negative samples, which is suitable for
real-time detection. c: Compass-XP Dataset [116]
In the experimental part of the article, the authors set up The instances in this dataset are 501 instances drawn from
three sub-datasets, SIXray10, SIXray100, and SIXray1000, 369 target classes of ImageNet, as shown in table 5. More-
where the numbers 10 and 100 in the first two sub-datasets over, it contains 1901 image pairs, and each pair contains
e: PIDray [89]
This dataset covers many real-world scenarios for detect-
ing prohibited items, especially intentionally hidden items.
FIGURE 13. Data distribution of COCO training dataset and validation The PIDray dataset contains 12 categories with a total
dataset [124].
of 47,677 X-ray images, and each image comes with a
high-quality annotated segmentation mask and bounding box.
The test set composes of three classes according to the degree
of masking: easy, medium, and challenging, as described
in table 7.
f: HiXray [92]
This dataset contains eight categories: lithium-ion pris-
matic batteries, lithium-ion cylindrical batteries, water, lap-
tops, cell phones, tablets, cosmetics, and non-metallic
lighters, for a total of 102,928 contraband items, which
FIGURE 14. The six most common public datasets in X-ray baggage is by far the most significant number of contraband items
detection. (2022) included in the dataset. The data was collected
from accurate airport security checks and manually anno-
TABLE 3. GDXray Dataset [140]. tated by professional security screeners, as described in
table 8.
g: DB
Durham Baggage (DB) Patch/Full Image. This database
is private and not publicly available. The dataset includes
pseudo-color 15,449 X-ray images from dual-energy four-
view Smiths 6040i machines, including 494 cameras,
1596 ceramic knives, 3208 knives, 3192 guns, 1203 gun
X-ray images scanned with a Gilardoni FEP ME 536 with parts, 2390 laptops, and 3,366 benign images. The derived
natural images taken with a Sony DSC-W800 digital camera. databases include DBP2 with DBP6 [142], that do the classi-
Each X-ray image package contains different image versions fication task, and DBF2 with DBF6 [84], [91].
of low energy, high energy, material density, grayscale (a
combination of low and high energy), and pseudo-color RGB, III. OBJECT DETECTION
which are ideal for studying X-ray imaging principles. How- Detectors early can be classified into two categories accord-
ever, it is unsuitable for deep learning-based target recogni- ing to the detection process: single-stage and two-stage. The
tion tasks. latter uses more candidate frames than the former and may
contain object suggestions for detection objects. However,
d: OPIXray [141] as research progressed, introducing more advanced detec-
This dataset comes from real-time airport security data, man- tors broke the boundaries of single/dual-stage classifica-
ually annotated by professional security officers. It contains tion. Single/dual-stage classification using detectors alone
8,885 X-ray images, divided into five categories: folding has become inadequate. Therefore, instead of presenting the
knives, straight knives, scissors, utility knives, and multi-tool classification of detectors according to the single/dual-stage
knives, as described in table 6. approach as other papers have done, this paper presents the
TABLE 6. OPIXray Dataset [141]. common object detection algorithms in both families are
summarised in table 9.
• The algorithm will process the information for semantic The backbone design idea of YOLOv6 is mainly derived from
segmentation. RepVGG [17], with the primary purpose of achieving more
The HTC algorithm is mainly used for semantic segmenta- straightforward deployment and faster inference speed in the
tion problems but can also be modified for object detection industry.
problems, e.g., HTC++ [157]. YOLOx [30] believes that YOLOv4 and YOLOv5 may
have over-optimized the anchor-based detection method,
2) YOLO SERIES DETECTORS so YOLOx chose to make improvements to YOLOv3. Previ-
The YOLO series detectors received much attention once they ously, YOLOv5 achieved the best performance on the COCO
were introduced. YOLO, or You Only Look Once, differs dataset (48.2%AP). The improvements made by YOLOx
from R-CNN in that YOLO does not select targets by gen- include: (i)YOLOx changes the original Coupled Head to
erating candidate frames but performs target prediction and Decoupled Head, as shown in figure 20(b). Specifically,
recognition directly at the pixel level.2 YOLOx decouples Cls, Reg, and IOUs, allowing the net-
YOLOv1 [23] grid the images and then use the grid for work to learn the categories and the corresponding coordinate
detection. Each grid is responsible for detecting targets that regressions better; (ii)Detection is carried out in an anchor-
fall within its region. Each cell can detect multiple bound- free manner, i.e., by generating an a priori frame (anchor) of
ing boxes, and the bounding box is denoted by pbox = different sizes and aspect ratios at each position on a given
(x, y, w, h, α), where (x, y) denotes the center coordinates of feature map. The purpose of this is to:
the target object, (w, h) denotes the width and height, and the 1) Reducing the computational effort of the model, produc-
confidence α = pobj × IOU , pobj denotes the probability of ing fewer prediction frames;
having an object fall in the grid with probability. The final 2) Mitigating positive and negative sample imbalances;
prediction res = (pbox , Classes ), where C denotes the number 3) There is no need to design the parameters of the anchor
of target classifications in the dataset. The above procedure manually.
is shown in figure 18. YOLOR [28]’s encoder is used to learn both implicit and
YOLOv2 [24] is an improvement on YOLOv1. The explicit knowledge representations, using implicit informa-
YOLO9000 model can detect 9000 target classes. YOLOv2 tion to perform different tasks, and this technique is also
strives for a balance of speed and accuracy. integrated into YOLOv7.
YOLOv3 [25] used Darknet-53 as the backbone for fea- YOLOv7 [27] increases the training cost in order to
ture extraction compared to the previous two versions and improve accuracy. By using the reparametrization trick, the
achieved SOAT in the same period. inference cost is kept constant. In addition, YOLOv7’s back-
YOLOv4 [26] makes changes to the model structure. bone is the basis of a cascade structure. Changes in network
A BOF (Bag Of Freebies) strategy is used in the model, which depth often bring about changes in width when the model
can improve detection accuracy with only an increase in train- is scaled and thus need to consider comprehensively when
ing cost and no impact on inference speed. The BOF strat- the model is scaled for evaluation. Several trainable Bag-of-
egy in YOLOv4 includes data augmentation, regularization, Freebies methods are designed in the paper for solving the
CmBN (Cross mini-Batch Normalization), CIoU-loss [162], above problems, including planned reparametrization module
and other techniques. BOS (Bag of specials) strategies, i.e., design, dynamic label assignment strategy (coarse for auxil-
plug-in modules and post-processing methods that add only a iary, fine for loss), batch normalization in topology, and EMA
small amount of inference time but can significantly improve module usage. YOLOv7 can effectively reduce about 40%
object detection accuracy, are also used in YOLOv4. The of the parameters and 50% of the computation of existing
former enhances certain models’ properties, such as expand- real-time object detectors and has faster inference and higher
ing the perceptual field, introducing attention mechanisms, detection accuracy, with the structure shown in figure 21.
or enhancing feature integration capabilities. At the same
time, the latter can filter the model prediction results. BOS
B. TRANSFORMER-BASED DETECTORS
strategies in YOLOv4 include Mish activation [163], CSP
The MSAs mechanism in the transformer allows for bet-
(Cross-stage partial connections ) [99], and other techniques.
ter extraction of contour information from the image. The
The architecture is shown in figure 19.
main limitation of the Transformer is its high computational
YOLOv5,3 YOLOv64 only open the source code and no
overhead, which is usually a quadratic amount of the input
related academic paper. YOLOv5 retains much of the network
feature size. Common Transformer-based object detection
structure of YOLOv4. YOLOv5 is one of the most used and
algorithms are outlined in table 10.
popular object detection models within the industry, espe-
DETR [31] is one of the first end-to-end transformer-
cially for applications in the field of real-time video detection.
based object detectors. It treats the object detection prob-
2 [158], [159], [160], [161] integrates the common YOLO series algo-
lem as an ensemble prediction problem. Unlike traditional
rithms. object detectors, DETR learns anchors (not by hand) and does
3 https://github.com/ultralytics/yolov5 not use non-maximum suppression (NMS) post-processing.
4 https://github.com/meituan/YOLOv6 Instead, position-encoded ‘‘object queries’’ are fed to the
N
X
l σ̂ = argmin Lmatch (yi , ŷσ (i) ),
decoder for finding the features of an object in the image and σ ∈SN i
decoding image features. The predictor produces detection Lmatch (yi , ŷσ (i) ) = −1{ci ̸=∅} p̂σ (i) (ci )
results directly from the decoder’s output queries, as figure 22
shows. + 1{ci ̸=∅} Lbox (bi , b̂σ (i) ). (6)
FIGURE 27. Training score network using DAB in SPARSE DETR [35].
TABLE 11. The hybrid architecture detector on the COCO 2017 val dataset. ‘‘P-Epochs’’ indicates the number of times the model was pre-trained.
FIGURE 29. Overview of Next-ViT architecture [72]. IV. X-RAY BAGGAGE DETECTION WITH DEEP LEARNING
X-Ray baggage detection is a task that, at this stage, is mainly
carried out manually. There is a massive market for deep
learning in this task. According to different detection meth-
where Tm and Tn denote the adjacent token in the input feature ods, X-ray dangerous goods detection algorithms mainly
z, and O denotes the inner product operation. include conventional image analysis, machine learning, and
NTB uses E-MHSA (Efficient Multi-Head Self Attention) deep learning algorithms. This paper focuses on deep learn-
to capture low-frequency signals, such as background and ing algorithms. Moreover, three types of supervised learning
other global information, where the SA operator is used algorithms, classification, detection, and segmentation, are
to reduce the spatial attention computational complexity. used for the introduction [76], [171]. Table 12 demonstrates
It can be expressed as equation SA(X ) = Attention(X · the application of deep learning algorithms.
W Q , Ps (X · W K ), Ps (X · W V )), where Attention(Q, K , V ) =
T
softmax( QK dk )V and Ps means average pooling (stride =
A. CLASSIFICATION
s), i.e., reducing the computational complexity of atten- Classification algorithms were one of the first algorithms to
tion by downsampling. The overall framework is shown emerge in this field. In simple terms, the need is fulfilled by
in figure 29. determining the prohibited items’ presence during the secu-
ELAN [75] consists of three modules, and in this paper, rity screening process. The limitation of this algorithm is that
we only discuss the core module called ELAB (Efficient it treats dangerous goods detection as a simple classification
Long-range). The ELAB module is referenced from the Swin problem, and the final result does not accurately detect the
Transformer model [54], which uses Shift-Convolution lay- type and location of hazardous materials.
ers to extract local structural information from images and Akcay et al. [77] used CNN to classify the dataset through
GMSA (Group-wise Multi-scale Self-Attention) module to migration learning. Solving a binary classification problem
extract global information. The Shift-Convolution layer con- like the presence or absence of firearms demonstrated that
sists of a shift operator and a 1 × 1 convolution. The primary using CNN is more effective than traditional machine learn-
function is to shift the first four groups of the input features, ing algorithms like SVM.
which have been divided into five equal groups, in the left, Rogers et al. [78] first use of dual-energy X-ray
right, top, and bottom directions to ensure that the 1 × 1 con- pictures for imaging detection. High-energy and low-
volution yields information about the surrounding pixels. The energy X-ray images captured by the dual-energy X-ray
GMSA deals with group window-based self-attentive mecha- machine were used as asingle channel (H ), dual chan-
nisms, with the freedom to adjust the window size within each nel ({H , − log H }, {− log H , − log L}) and four-channel
group [169]. When computing SA (Self Attention), the ASA ({− log L, L, H , − log H }) as different the input channels to
(Accelerated Self-Attention) mechanism is used, except that train the VGG-19 network for classification.
the LN (Layer Normalization) in the transformer is replaced Zhao et al. [79], [80] introduces GAN [172] to the X-ray
with Batch Normalization to ensure that the SA between imaging detection task by a three-stage learning method
groups is computed without additional overhead. The SA is for classification learning. The input X-ray dataset is first
FIGURE 30. (a)GLAB framework, (b)Shift-Convolution layer, (c)GMSA layer, (d) Calculation of ASA [75].
TABLE 12. Deep learning algorithms in the field of X-ray baggage detection. (If not specified, ACC, mAP, and mIOU in the last column correspond to the
performance metrics of the classification, detection, and segmentation algorithms in the first column).
classified and labeled by the angular information of the fore- CHR [81] model performs classification detection on the
ground objects extracted from the input images. GAN gener- SIXray dataset [81]. Since the SIXray dataset is constructed
ates new objects, and finally, a small classification network is to simulate a realistic environment where a significant imbal-
used to confirm whether the generated images belong to the ance in security screening data occurs, the CHR model copes
correct class. with the class imbalance by extracting image features from
from these three types of images. The results show that TABLE 13. SIXrayp and PIDrayp Dataset.
the pseudo-color maps synthesized by the conditional GAN
model achieve significant results on the private dataset deei6.
C. SEGMENTATION
Object segmentation algorithms segment images into mul-
tiple sub-regions. In X-ray images with highly overlapping TABLE 14. YOLOv5 detects basic results for the PIDrayp dataset.
targets, segmentation algorithms often rely on additional
information to complete the segmentation, such as the con-
tour information of hazardous materials. It brings additional
computational effort while the hazardous material profile
information dominates the final segmentation result.
Gaus et al. [93] uses a dual convolutional neural net-
work architecture to detect automatic anomalies in x-ray
images. The paper uses R-CNN [18], mask R-CNN [22], and
RetinaNet-like detection networks to provide object localiza-
tion for specific target object classes. Specifically, the images
are segmented using mask R-CNN to initialize the RoI,
followed by a negative/positive bifurcation of the previous
RoI by a network such as RetinaNet, with a segmentation
accuracy of 97.6%.
Hassan et al. [95] segmented the targets on the
images by extracting the structural tensor from differ-
ent angles, and finally achieved a segmentation mAP of Wrench, as shown in table 13 We fine-tuned the train-
96.7%/96.16%/75.32%/58.4% on GDXray/SIXray/OPIXray/ ing data set directly and tested the results on the testing
Compass-XP respectively. data set.
Ma et al. [94] addresses the problem of inaccurate identifi- As the detection process is real-time, the YOLOv5 and
cation of different contraband or dangerous goods due to dif- YOLOv7 models are used in preference to the detection pro-
ferences in appearance. The model named DDoAS consists cess, and the results are shown in table 14, table 15, table 16,
of two modules: DDoM, which accurately infers contraband table 17 respectively. These tables show that YOLOv7 is
information from a dense overlapping background by means more accurate in recognition than YOLOv5, and v7 has a
of dense backlinks, and ADM, which aims to improve the more incredible inference speed than the other models due
low learning efficiency due to differences in shape and size to the use of techniques such as model re-parameterization.
between different contraband items. The limitation of the The four models’ visual comparison results are shown in fig-
DDoAS algorithm is that the model uses additional optical ure 34. As can be seen in subplot (c), the Transformer-based
information (object edges and vertices) to assist in verifica- and hybrid models do not work well in X-ray image
tion, which makes it challenging to detect contraband with detection.
poor edge information, such as small folding knives. The main reason for this is that the MSAs mechanism
learns the contour information of the object. However, in the
V. EXPERIMENT X-ray image, the hazardous object to be detected is cov-
In this section, to test the accuracy of standard models ered or obscured by a large number of other objects, which
in detecting X-ray images without modifying the original leads to confusing feature information obtained by MSAs
structure and to provide directions for subsequent research, and cannot correctly distinguish the exact location of the
we select the four most common models among the three object; on the contrary, the CNN learns more information
types of algorithms for experimentation. Specifically, these about the texture of the object from the pseudo-color image,
are: YOLOv5,5 YOLOv7 [27], DINO [43], and NextViT [72]. which helps identify the type of object. Inspired by several
The data sets used in this experiment are the processed models, DINO incorporates a variety of factors that facilitate
SIXray6 and PIDray.7 We have marked them as SIXrayp and improved recognition accuracy but does not perform specific
PIDrayp respectively. SIXrayp contains five classes, namely optimizations and has a lower mAP than other models in this
Gun, Knife, Pliers, Scissors, Wrench, and PIDrayp contains area. In addition, Next-ViT, as a hybrid model, combines the
12 classes, namely Baton, Bullet, Gun, Hammer, Hand- advantages of both Conv and MSAs. However, as it is not
Cuffs, Knife, Lighter, Pliers, Powerbank, Scissors, Sprayer, an end-to-end detection model and its structural construc-
tion does not apply to X-ray images with many overlapping
5 https://github.com/ultralytics/yolov5 objects, it has no particular advantages regarding the accuracy
6 https://universe.roboflow.com/object-detection/ugku and operational efficiency. A hybrid model more suitable for
7 https://universe.roboflow.com/object-detection/security_xray X-ray images should be designed.
FIGURE 34. (a) Represents the detailed identification results of the YOLOv5 versus YOLOv7 for each category on the SIXrayp dataset. ‘‘@.55 ’’
and ‘‘@.57 ’’ denote the value of mAP 50 under YOLOv5 and YOLOv7, respectively, and so on; (b) Representation of the detailed identification
results of the YOLOv5 versus YOLOv7 models for each category on the PIDrayp dataset; (c) indicates the comparison mAP results of the four
models YOLOv5, YOLOv7, DINO, and Next-ViT.
TABLE 15. YOLOv5 detects basic results for the SIXrayp dataset. temporarily solved some of the problems in this area, huge
limitations remain:
1) The pseudo-color pictures formed by dual-energy X-ray
still do not work well with modern detection models and
must be modified in depth to obtain more reasonable
results.
2) The timeliness of the algorithm is a factor that must be
considered at the moment.
3) In reality, if prohibited goods are in luggage, they will
TABLE 16. YOLOv7 detects basic results for the PIDrayp dataset. inevitably be wrapped in layers. The resulting X-ray
images can be extreme, with objects stacked on top of
each other over a large area. The accuracy of existing
models for identification may need to be higher.
4) The current X-ray baggage image dataset is still small
and of low quality, which affects the training of deep
learning models.
In response to the above challenges, we offer the follow-
ing suggestions:
1) Using image translation or style transfer techniques to
generate corresponding natural light images from X-ray
images, expanding the X-ray baggage dataset.
2) The use of image pairs formed by high- and low-energy
rays, combined with images in natural light, enriches the
TABLE 17. YOLOv7 detects basic results for the SIXrayp dataset. color of X-ray images and brings them closer to natural
light images.
3) Reduce the cost of 3D CT scan recognition technology
by converting 2D algorithms to 3D algorithms to recog-
nize stacked layers that are difficult to recognize in the
2D case.
4) Image feature extraction and synthesis using a Diffu-
sion model more advanced than GAN to generate high-
quality X-ray images containing prohibited items.
5) Although most prohibited items are masked, they do not
VI. CONCLUSION change their original shape excessively when exposed to
This paper reviews the more popular deep learning object X-ray. They can still be identified using contour infor-
detection algorithms of recent years. Also, it summarises mation through a rational algorithm design. One of the
the application of deep learning to the field of X-ray bag- future directions in X-ray dangerous goods detection is
gage dangerous goods detection. While many models have using hybrid algorithms that combine texture features
and contour information of prohibited items for identi- [22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘‘Mask R-CNN,’’ in Proc.
fication. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 2961–2969.
[23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
6) In order to make fair comparisons, evaluation criteria Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
must be established on public datasets. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[24] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017,
REFERENCES pp. 7263–7271.
[1] A. Schwaninger, A. Bolfing, T. Halbherr, S. Helman, A. Belyavin, and [25] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’
L. Hay, ‘‘The impact of image based factors and training on threat detec- 2018, arXiv:1804.02767.
tion performance in X-ray screening,’’ Tech. Rep., 2008. [26] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ‘‘YOLOv4: Optimal
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification speed and accuracy of object detection,’’ 2020, arXiv:2004.10934.
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. [27] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, ‘‘YOLOv7: Trainable
Process. Syst., vol. 25, 2012, pp. 1–9. bag-of-freebies sets new state-of-the-art for real-time object detectors,’’
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, 2022, arXiv:2207.02696.
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, [28] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, ‘‘You only learn one represen-
J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16 × 16 words: tation: Unified network for multiple tasks,’’ 2021, arXiv:2105.04206.
Transformers for image recognition at scale,’’ 2020, arXiv:2010.11929. [29] X. Long, K. Deng, G. Wang, Y. Zhang, Q. Dang, Y. Gao, H. Shen, J. Ren,
[4] P. Viola and M. Jones, ‘‘Rapid object detection using a boosted cascade of S. Han, E. Ding, and S. Wen, ‘‘PP-YOLO: An effective and efficient
simple features,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern implementation of object detector,’’ 2020, arXiv:2007.12099.
Recognit. (CVPR), vol. 1, Dec. 2001. p. I.
[30] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, ‘‘YOLOX: Exceeding YOLO
[5] N. Park and S. Kim, ‘‘How do vision transformers work?’’ 2022, series in 2021,’’ 2021, arXiv:2107.08430.
arXiv:2202.06709.
[31] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
[6] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Com-
large-scale image recognition,’’ 2014, arXiv:1409.1556.
puter Vision—ECCV 2020. Glasgow, U.K., Aug. 2020, pp. 213–229.
[7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[32] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, ‘‘Pix2seq:
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’
A language modeling framework for object detection,’’ 2021,
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
arXiv:2109.10852.
pp. 1–9.
[33] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable DETR:
[8] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
Deformable transformers for end-to-end object detection,’’ in Proc. Int.
image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Conf. Learn. Represent., 2021, pp. 1–16.
(CVPR), Jun. 2016, pp. 770–778.
[34] M. Zheng, P. Gao, R. Zhang, K. Li, X. Wang, H. Li, and H. Dong, ‘‘End-
[9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
to-end object detection with adaptive clustering transformer,’’ 2020,
M. Andreetto, and H. Adam, ‘‘MobileNets: Efficient convolutional neural
arXiv:2011.09315.
networks for mobile vision applications,’’ 2017, arXiv:1704.04861.
[35] B. Roh, J. Shin, W. Shin, and S. Kim, ‘‘Sparse DETR: Efficient end-to-
[10] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
end object detection with learnable sparsity,’’ 2021, arXiv:2111.14330.
‘‘MobileNetV2: Inverted residuals and linear bottlenecks,’’ in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, [36] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, ‘‘Fast convergence of
pp. 4510–4520. DETR with spatially modulated co-attention,’’ in Proc. IEEE/CVF Int.
Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3621–3630.
[11] A. Howard, M. Sandler, B. Chen, W. Wang, L.-C. Chen, M. Tan, G. Chu,
V. Vasudevan, Y. Zhu, R. Pang, H. Adam, and Q. Le, ‘‘Searching for [37] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang,
MobileNetV3,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), ‘‘Conditional DETR for fast training convergence,’’ in Proc. IEEE/CVF
Oct. 2019, pp. 1314–1324. Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3651–3660.
[12] X. Zhang, X. Zhou, M. Lin, and J. Sun, ‘‘ShuffleNet: An extremely [38] Y. Wang, X. Zhang, T. Yang, and J. Sun, ‘‘Anchor DETR: Query design
efficient convolutional neural network for mobile devices,’’ in for transformer-based detector,’’ in Proc. AAAI Conf. Artif. Intell., 2022,
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, vol. 36, no. 3, pp. 2567–2575.
pp. 6848–6856. [39] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, ‘‘DN-DETR:
[13] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, ‘‘ShuffleNet V2: Practical Accelerate DETR training by introducing query DeNoising,’’ in Proc.
guidelines for efficient CNN architecture design,’’ in Proc. Eur. Conf. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
Comput. Vis. (ECCV), 2018, pp. 116–131. pp. 13619–13627.
[14] J. Hu, L. Shen, and G. Sun, ‘‘Squeeze-and-excitation networks,’’ in [40] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang,
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, ‘‘DAB-DETR: Dynamic anchor boxes are better queries for DETR,’’
pp. 7132–7141. 2022, arXiv:2201.12329.
[15] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, ‘‘GhostNet: More [41] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu,
features from cheap operations,’’ in Proc. IEEE/CVF Conf. Comput. Vis. ‘‘You only look at one sequence: Rethinking transformer in vision through
Pattern Recognit. (CVPR), Jun. 2020, pp. 1580–1589. object detection,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
[16] M. Tan and Q. Le, ‘‘EfficientNet: Rethinking model scaling for con- pp. 26183–26197.
volutional neural networks,’’ in Proc. Int. Conf. Mach. Learn., 2019, [42] Z. Sun, S. Cao, Y. Yang, and K. Kitani, ‘‘Rethinking transformer-based set
pp. 6105–6114. prediction for object detection,’’ in Proc. IEEE/CVF Int. Conf. Comput.
[17] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, ‘‘RepVGG: Vis. (ICCV), Oct. 2021, pp. 3611–3620.
Making VGG-style ConvNets great again,’’ in Proc. IEEE/CVF Conf. [43] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13733–13742. ‘‘DINO: DETR with improved denoising anchor boxes for end-to-end
[18] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierar- object detection,’’ 2022, arXiv:2203.03605.
chies for accurate object detection and semantic segmentation,’’ in Proc. [44] Z. Yao, J. Ai, B. Li, and C. Zhang, ‘‘Efficient DETR: Improving end-to-
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. end object detector with dense prior,’’ 2021, arXiv:2104.01318.
[19] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in [45] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, ‘‘Dynamic
deep convolutional networks for visual recognition,’’ IEEE Trans. Pattern DETR: End-to-end object detection with dynamic attention,’’ in Proc.
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2014. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 2988–2997.
[20] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis. [46] Z. Dai, B. Cai, Y. Lin, and J. Chen, ‘‘UP-DETR: Unsupervised pre-
(ICCV), Dec. 2015, pp. 1440–1448. training for object detection with transformers,’’ in Proc. IEEE/CVF Conf.
[21] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 1601–1610.
object detection with region proposal networks,’’ IEEE Trans. Pattern [47] H. Bao, L. Dong, S. Piao, and F. Wei, ‘‘BEiT: BERT pre-training of image
Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. transformers,’’ 2021, arXiv:2106.08254.
[48] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, ‘‘BEiT v2: [70] Q. Zhang and Y.-B. Yang, ‘‘ResT: An efficient transformer for visual
Masked image modeling with vector-quantized visual tokenizers,’’ 2022, recognition,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
arXiv:2208.06366. pp. 15475–15485.
[49] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, [71] Z. Dai, H. Liu, Q. V. Le, and M. Tan, ‘‘CoAtNet: Marrying convolution
O. K. Mohammed, S. Singhal, S. Som, and F. Wei, ‘‘Image as a foreign and attention for all data sizes,’’ in Proc. Adv. Neural Inf. Process. Syst.,
language: BEiT pretraining for all vision and vision-language tasks,’’ vol. 34, 2021, pp. 3965–3977.
2022, arXiv:2208.10442. [72] J. Li, X. Xia, W. Li, H. Li, X. Wang, X. Xiao, R. Wang, M. Zheng,
[50] W. Wang, Y. Cao, J. Zhang, and D. Tao, ‘‘FP-DETR: Detection trans- and X. Pan, ‘‘Next-ViT: Next generation vision transformer for efficient
former advanced by fully pre-training,’’ in Proc. Int. Conf. Learn. Repre- deployment in realistic industrial scenarios,’’ 2022, arXiv:2207.05501.
sent., 2021, pp. 1–14. [73] A. Hatamizadeh, H. Yin, J. Kautz, and P. Molchanov, ‘‘Global context
[51] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, ‘‘Focal vision transformers,’’ 2022, arXiv:2206.09959.
self-attention for local-global interactions in vision transformers,’’ 2021, [74] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li,
arXiv:2107.00641. ‘‘MaxViT: Multi-axis vision transformer,’’ 2022, arXiv:2204.01697.
[52] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and [75] X. Zhang, H. Zeng, S. Guo, and L. Zhang, ‘‘Efficient long-range attention
L. Shao, ‘‘Pyramid vision transformer: A versatile backbone for dense network for image super-resolution,’’ 2022, arXiv:2203.06697.
prediction without convolutions,’’ in Proc. IEEE/CVF Int. Conf. Comput. [76] S. Akcay and T. Breckon, ‘‘Towards automatic threat detection: A survey
Vis. (ICCV), Oct. 2021, pp. 568–578. of advances of deep learning within X-ray security imaging,’’ Pattern
[53] P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao, ‘‘Multi- Recognit., vol. 122, Feb. 2022, Art. no. 108245.
scale vision longformer: A new vision transformer for high-resolution [77] S. Akcay, M. E. Kundegorski, M. Devereux, and T. P. Breckon, ‘‘Transfer
image encoding,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), learning using convolutional neural networks for object classification
Oct. 2021, pp. 2998–3008. within X-ray baggage security imagery,’’ in Proc. IEEE Int. Conf. Image
[54] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Process. (ICIP), Sep. 2016, pp. 1057–1061.
‘‘Swin transformer: Hierarchical vision transformer using shifted win- [78] T. W. Rogers, N. Jaccard, and L. D. Griffin, ‘‘A deep learning frame-
dows,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, work for the automated inspection of complex dual-energy X-ray cargo
pp. 10012–10022. imagery,’’ Proc. SPIE, vol. 10187, pp. 106–117, May 2017.
[55] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, ‘‘Feature [79] Z. Zhao, H. Zhang, and J. Yang, ‘‘A GAN-based image generation method
pyramid transformer,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2020, for X-ray security prohibited items,’’ in Proc. Chin. Conf. Pattern Recog-
pp. 323–339. nit. Comput. Vis. (PRCV). Springer, 2018, pp. 420–430.
[56] Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, and J. Wang, [80] J. Yang, Z. Zhao, H. Zhang, and Y. Shi, ‘‘Data augmentation for X-ray
‘‘HRFormer: High-resolution vision transformer for dense predict,’’ in prohibited item images using generative adversarial networks,’’ IEEE
Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 7281–7293. Access, vol. 7, pp. 28894–28902, 2019.
[57] J. Gu, H. Kwon, D. Wang, W. Ye, M. Li, Y.-H. Chen, L. Lai, V. Chandra, [81] C. Miao, L. Xie, F. Wan, C. Su, H. Liu, J. Jiao, and Q. Ye, ‘‘SIXray:
and D. Z. Pan, ‘‘Multi-scale high-resolution vision transformer for seman- A large-scale security inspection X-ray benchmark for prohibited item
tic segmentation,’’ 2021, arXiv:2111.01236. discovery in overlapping images,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
[58] S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and Pattern Recognit. (CVPR), Jun. 2019, pp. 2119–2128.
L. Sagun, ‘‘ConViT: Improving vision transformers with soft convo- [82] T. Morris, T. Chien, and E. Goodman, ‘‘Convolutional neural net-
lutional inductive biases,’’ in Proc. Int. Conf. Mach. Learn., 2021, works for automatic threat detection in security X-ray images,’’ in
pp. 2286–2296. Proc. 17th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2018,
[59] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, ‘‘CvT: pp. 285–292.
Introducing convolutions to vision transformers,’’ in Proc. IEEE/CVF Int. [83] M. Caldwell, M. Ransley, T. W. Rogers, and L. D. Griffin, ‘‘Trans-
Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 22–31. ferring X-ray based automated threat detection between scanners with
[60] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and different energies and resolution,’’ in Proc. SPIE, vol. 10441, 2017,
A. Vaswani, ‘‘Bottleneck transformers for visual recognition,’’ in Proc. pp. 130–139.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, [84] S. Akcay and T. P. Breckon, ‘‘An evaluation of region based object
pp. 16519–16529. detection strategies within X-ray baggage security imagery,’’ in Proc.
[61] Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu, ‘‘Mobile- IEEE Int. Conf. Image Process. (ICIP), Sep. 2017, pp. 1337–1341.
former: Bridging MobileNet and transformer,’’ 2021, arXiv:2108.05895. [85] J. B. Sigman, G. P. Spell, K. J. Liang, and L. Carin, ‘‘Background adap-
[62] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and tive faster R-CNN for semi-supervised convolutional object detection of
S. Yan, ‘‘MetaFormer is actually what you need for vision,’’ in Proc. threats in X-ray images,’’ Proc. SPIE, vol. 11404, pp. 12–21, May 2020.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, [86] M. Subramani, K. Rajaduari, S. D. Choudhury, A. Topkar, and
pp. 10819–10829. V. Ponnusamy, ‘‘Evaluating one stage detector architecture of convolu-
[63] B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, tional neural network for threat object detection using X-ray baggage
J. Gonzalez, K. Keutzer, and P. Vajda, ‘‘Visual transformers: Token- security imaging,’’ Revue d’Intell. Artificielle, vol. 34, no. 4, pp. 495–500,
based image representation and processing for computer vision,’’ 2020, Sep. 2020.
arXiv:2006.03677. [87] Z. Liu, J. Li, Y. Shu, and D. Zhang, ‘‘Detection and recognition of security
[64] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick, detection object based on YOLO9000,’’ in Proc. 5th Int. Conf. Syst.
‘‘Early convolutions help transformers see better,’’ in Proc. Adv. Neural Informat. (ICSAI), Nov. 2018, pp. 278–282.
Inf. Process. Syst., vol. 34, 2021, pp. 30392–30400. [88] T. Hassan, S. H. Khan, S. Akcay, M. Bennamoun, and N. Werghi, ‘‘Deep
[65] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani, cmst framework for the autonomous recognition of heavily occluded and
‘‘Bottleneck transformers for visual recognition,’’ in Proc. IEEE/CVF cluttered baggage items from multivendor security radiographs,’’ CoRR,
Conf. Comput. Vis. Pattern Recognit. (CVPR). Washington, DC, USA: vol. 14, p. 17, Dec. 2019.
IEEE Computer Society, Jun. 2021, pp. 16514–16524. [89] B. Wang, L. Zhang, L. Wen, X. Liu, and Y. Wu, ‘‘Towards real-world
[66] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, prohibited item detection: A large-scale X-ray benchmark,’’ in Proc.
‘‘Training data-efficient image transformers & distillation through atten- IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 5412–5421.
tion,’’ in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357. [90] B. K. S. Isaac-Medina, N. Bhowmik, C. G. Willcocks, and T. P. Breckon,
[67] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, ‘‘Incorporating ‘‘Cross-modal image synthesis within dual-energy X-ray security
convolution designs into visual transformers,’’ in Proc. IEEE/CVF Int. imagery,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work-
Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 579–588. shops (CVPRW), Jun. 2022, pp. 333–341.
[68] Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool, ‘‘LocalViT: [91] S. Akcay, M. E. Kundegorski, C. G. Willcocks, and T. P. Breckon, ‘‘Using
Bringing locality to vision transformers,’’ 2021, arXiv:2104.05707. deep convolutional neural network architectures for object classification
[69] X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, ‘‘Conditional positional and detection within X-ray baggage security imagery,’’ IEEE Trans. Inf.
encodings for vision transformers,’’ 2021, arXiv:2102.10882. Forensics Security, vol. 13, no. 9, pp. 2203–2215, Sep. 2018.
[92] R. Tao, Y. Wei, X. Jiang, H. Li, H. Qin, J. Wang, Y. Ma, L. Zhang, [114] D. Mery, D. Saavedra, and M. Prasad, ‘‘X-ray baggage inspection with
and X. Liu, ‘‘Towards real-world X-ray security inspection: A high- computer vision: A survey,’’ IEEE Access, vol. 8, pp. 145620–145633,
quality benchmark and lateral inhibition module for prohibited items 2020.
detection,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, [115] D. Mery, Computer Vision for X-Ray Testing, vol. 10. Cham, Switzerland:
pp. 10923–10932. Springer, 2015.
[93] Y. F. A. Gaus, N. Bhowmik, S. Akçay, P. M. Guillén-Garcia, J. W. Barker, [116] M. Caldwell and L. D. Griffin, ‘‘Limits on transfer learning from pho-
and T. P. Breckon, ‘‘Evaluation of a dual convolutional neural network tographic image data to X-ray threat detection,’’ J. X-Ray Sci. Technol.,
architecture for object-wise anomaly detection in cluttered X-ray security vol. 27, no. 6, pp. 1007–1020, Jan. 2020.
imagery,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2019, [117] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
pp. 1–8. A. Zisserman, ‘‘The PASCAL visual object classes (VOC) challenge,’’
[94] B. Ma, T. Jia, M. Su, X. Jia, D. Chen, and Y. Zhang, ‘‘Automated Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Sep. 2010.
segmentation of prohibited items in X-ray baggage images using dense [118] M. Everingham and J. Winn, ‘‘The PASCAL visual object classes chal-
de-overlap attention snake,’’ IEEE Trans. Multimedia, early access, lenge 2012 (VOC2012) development kit,’’ Pattern Anal. Stat. Model.
May 11, 2022, doi: 10.1109/TMM.2022.3174339. Comput. Learn., Tech. Rep., 2012, pp. 1–45.
[95] T. Hassan and N. Werghi, ‘‘Trainable structure tensors for autonomous [119] M. Everingham and J. Winn, ‘‘The PASCAL visual object classes chal-
baggage threat detection under extreme occlusion,’’ in Proc. Asian Conf. lenge 2007 (VOC2007) development kit,’’ Univ. Leeds, Leeds, U.K.,
Comput. Vis., 2020, pp. 1–16. Tech. Rep., 2007.
[96] Ultralytics. (2022). YOLOv5. [Online]. Available: https://github.com/ [120] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
ultralytics/yolov5/releases/tag/v6.1 A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei,
[97] S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and B. Lee, ‘‘ImageNet large scale visual recognition challenge,’’ Int. J. Comput. Vis.,
‘‘A survey of modern deep learning based object detection models,’’ Digit. vol. 115, no. 3, pp. 211–252, Dec. 2015.
Signal Process., vol. 126, Jun. 2022, Art. no. 103514. [121] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
[98] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, ‘‘Aggregated residual A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
transformations for deep neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1492–1500. [122] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[99] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and P. Dollár, and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in
I.-H. Yeh, ‘‘CSPNet: A new backbone that can enhance learning capa- context,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2014, pp. 740–755.
bility of CNN,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [123] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset,
Workshops (CVPRW), Jun. 2020, pp. 390–391. S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari,
[100] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, ‘‘The open images dataset V4,’’ Int. J. Comput. Vis., vol. 128, no. 7,
and W. Brendel, ‘‘ImageNet-trained CNNs are biased towards tex- pp. 1956–1981, 2020.
ture; increasing shape bias improves accuracy and robustness,’’ 2018, [124] M. Akbari, A. Banitalebi-Dehkordi, and Y. Zhang, ‘‘EBJR: Energy-based
arXiv:1811.12231. joint reasoning for adaptive inference,’’ 2021, arXiv:2110.10343.
[101] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [125] NeelBhowmik. (2022). X-Ray Datasets. [Online]. Available: https://
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. github.com/NeelBhowmik/xray
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11. [126] (2022). FSOD EDS. [Online]. Available: https://github.com/DIG-
[102] T. Lin, Y. Wang, X. Liu, and X. Qiu, ‘‘A survey of transformers,’’ 2021, Beihang/XrayDetection#x-ray-fsod
arXiv:2106.04554. [127] LPAIS. (2022). Xray-Pi. [Online]. Available: https://github.com/LPAIS/
[103] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Xray-PI
Z. Shi, J. Fan, and Z. He, ‘‘A survey of visual transformers,’’ 2021, [128] (2022). Pixray. [Online]. Available: https://github.com/Mbwslib/DDoAS
arXiv:2111.06091. [129] (2022). CLCXray. [Online]. Available: https://github.com/Greyson
[104] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and Phoenix/CLCXray
M. Pietikäinen, ‘‘Deep learning for generic object detection: A survey,’’ [130] (2022). HiXray. [Online]. Available: https://github.com/DIG-
Int. J. Comput. Vis., vol. 128, pp. 261–318, Feb. 2020. Beihang/XrayDetection
[105] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, [131] N. Bhowmik, Y. F. A. Gaus, and T. P. Breckon, ‘‘On the impact of
‘‘Transformers in vision: A survey,’’ ACM Comput. Surv., vol. 54, no. 10, using X-ray energy response imagery for object detection via convolu-
pp. 1–41, Jan. 2022. tional neural networks,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP),
[106] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, Sep. 2021, pp. 1224–1228.
C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, ‘‘A survey on vision [132] LPAIS. (2022). Pidray. [Online]. Available: https://github.com/
transformer,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, bywang2018/security-dataset
pp. 87–110, Jan. 2023. [133] M. Naji, A. Anaissi, A. Braytee, and M. Goyal, ‘‘Anomaly detection in
[107] D. Zhou, Z. Yu, E. Xie, C. Xiao, A. Anandkumar, J. Feng, and X-ray security imaging: A tensor-based learning approach,’’ in Proc. Int.
J. M. Alvarez, ‘‘Understanding the robustness in vision transformers,’’ in Joint Conf. Neural Netw. (IJCNN), Jul. 2021, pp. 1–8.
Proc. Int. Conf. Mach. Learn., 2022, pp. 27378–27394. [134] B. K. S. Isaac-Medina, C. G. Willcocks, and T. P. Breckon, ‘‘Multi-view
[108] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, ‘‘Sharpness- object detection using epipolar constraints within cluttered X-ray security
aware minimization for efficiently improving generalization,’’ 2020, imagery,’’ in Proc. 25th Int. Conf. Pattern Recognit. (ICPR), Jan. 2021,
arXiv:2010.01412. pp. 9889–9896.
[109] X. Chen, C.-J. Hsieh, and B. Gong, ‘‘When vision transformers outper- [135] LPAIS. (2022). OPIXray. [Online]. Available: https://github.com/
form ResNets without pre-training or strong data augmentations,’’ 2021, OPIXray-author/OPIXray
arXiv:2106.01548. [136] (2022). SIXray. [Online]. Available: https://github.com/MeioJane/SIXray
[110] M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. S. Khan, and [137] (2022). Compass-XP. [Online]. Available: https://zenodo.org/record/
M.-H. Yang, ‘‘Intriguing properties of vision transformers,’’ in Proc. Adv. 2654887/#%5C.YUtGVHVKikA
Neural Inf. Process. Syst., vol. 34, 2021, pp. 23296–23308. [138] K. J Liang, J. B. Sigman, G. P. Spell, D. Strellis, W. Chang, F. Liu,
[111] J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, and C. Xu, T. Mehta, and L. Carin, ‘‘Toward automatic threat recognition for air-
‘‘CMT: Convolutional neural networks meet vision transformers,’’ 2021, port X-ray baggage screening with deep convolutional object detection,’’
arXiv:2107.06263. 2019, arXiv:1912.06329.
[112] F. L. Roder, ‘‘Explosives detection by dual-energy computed tomogra- [139] LPAIS. (2022). GDXray. [Online]. Available: https://domingomery.
phy (CT),’’ Proc. SPIE, vol. 182, pp. 171–178, Oct. 1979. ing.puc.cl/material/gdxray/
[113] B. Abidi, Y. Zheng, A. Gribok, and M. Abidi, ‘‘Screener evaluation of [140] D. Mery, V. Riffo, U. Zscherpel, G. Mondragón, I. Lillo, I. Zuccar,
pseudo-colored single energy X-ray luggage images,’’ in Proc. IEEE H. Lobel, and M. Carrasco, ‘‘GDXray: The database of X-ray images for
Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR) Workshops, nondestructive testing,’’ J. Nondestruct. Eval., vol. 34, no. 4, pp. 1–12,
Sep. 2005, p. 35. 2015.
[141] Y. Wei, R. Tao, Z. Wu, Y. Ma, L. Zhang, and X. Liu, ‘‘Occluded pro- [166] W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu,
hibited items detection: An X-ray security inspection benchmark and de- L. Lu, H. Li, X. Wang, and Y. Qiao, ‘‘InternImage: Exploring large-
occlusion attention module,’’ in Proc. 28th ACM Int. Conf. Multimedia, scale vision foundation models with deformable convolutions,’’ 2022,
Oct. 2020, pp. 138–146. arXiv:2211.05778.
[142] M. E. Kundegorski, S. Akçay, M. Devereux, A. Mouton, and [167] J.-B. Cordonnier, A. Loukas, and M. Jaggi, ‘‘On the relationship between
T. P. Breckon, ‘‘On using feature descriptors as visual words for self-attention and convolutional layers,’’ 2019, arXiv:1911.03584.
object detection within X-ray baggage security screening,’’ Tech. Rep., [168] S. Li, X. Chen, D. He, and C.-J. Hsieh, ‘‘Can vision transformers perform
2016. convolution?’’ 2021, arXiv:2111.01353.
[143] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, [169] B. Yang, L. Wang, D. Wong, L. S. Chao, and Z. Tu, ‘‘Convolutional self-
‘‘OverFeat: Integrated recognition, localization and detection using con- attention networks,’’ 2019, arXiv:1904.03107.
volutional networks,’’ 2013, arXiv:1312.6229. [170] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, ‘‘Self-
[144] L. Zhang, L. Lin, X. Liang, and K. He, ‘‘Is faster R-CNN doing well for supervised learning: Generative or contrastive,’’ IEEE Trans. Knowl.
pedestrian detection?’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2016, Data Eng., vol. 35, no. 1, pp. 857–876, Jan. 2023.
pp. 443–457. [171] D. Mery and C. Pieringer, Computer Vision for X-Ray Testing, 2nd ed.
[145] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and Cham, Switzerland: Springer, 2021.
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur. Conf. [172] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
Comput. Vis. Springer, 2016, pp. 21–37. S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in
[146] J. Dai, Y. Li, K. He, and J. Sun, ‘‘R-FCN: Object detection via region- Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014, pp. 1–9.
based fully convolutional networks,’’ in Proc. Adv. Neural Inf. Process.
Syst., vol. 29, 2016, pp. 1–9.
[147] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely
connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
[148] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, ‘‘Focal loss for
dense object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 2980–2988.
[149] Z. Cai and N. Vasconcelos, ‘‘Cascade R-CNN: Delving into high quality JIAJIE WU received the M.S. degree in computer
object detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- science from the Shanxi University of Finance and
nit., Jun. 2018, pp. 6154–6162. Economics, Taiyuan, China, in 2017. He is cur-
[150] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, rently pursuing the Ph.D. degree in computer sci-
‘‘Feature pyramid networks for object detection,’’ in Proc. IEEE Conf. ence with Hangzhou Danzi University. He focuses
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. on the field of object detection and X-ray imaging
[151] K. Chen, W. Ouyang, C. C. Loy, D. Lin, J. Pang, J. Wang, Y. Xiong, X. Li, security detection.
S. Sun, W. Feng, Z. Liu, and J. Shi, ‘‘Hybrid task cascade for instance
segmentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2019, pp. 4974–4983.
[152] PaddlePaddle. (2022). YOLOv6-L (V2.1). [Online]. Available:
https://github.com/meituan/YOLOv6/releases/tag/0.2.1
[153] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, ‘‘Scaled-YOLOv4:
Scaling cross stage partial network,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13029–13038.
[154] S. Xu, X. Wang, W. Lv, Q. Chang, C. Cui, K. Deng, G. Wang, Q. Dang,
S. Wei, Y. Du, and B. Lai, ‘‘PP-YOLOE: An evolved version of YOLO,’’
2022, arXiv:2203.16250. XIANGHUA XU received the Ph.D. degree
[155] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, in computer science from Zhejiang University,
‘‘Selective search for object recognition,’’ Int. J. Comput. Vis., vol. 104, Hangzhou, China, in 2005. He is currently a
no. 2, pp. 154–171, Feb. 2013. Professor with the School of Computer Science
[156] K. Grauman and T. Darrell, ‘‘The pyramid match kernel: Discriminative and Technology, Hangzhou Dianzi University,
classification with sets of image features,’’ in Proc. 10th IEEE Int. Conf. Hangzhou. He has authored or coauthored over
Comput. Vis. (ICCV), vol. 1, Oct. 2005, pp. 1458–1465. 100 peer-reviewed journals and conference papers.
[157] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, His current research interests include computer
L. Dong, F. Wei, and B. Guo, ‘‘Swin transformer v2: Scaling up capacity vision and parallel and distributed computing.
and resolution,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. He was a recipient of the Best Paper Award at
(CVPR), Jun. 2022, pp. 12009–12019. the 2012 IEEE International Symposium on Workload Characterization.
[158] D. Thuan, ‘‘Evolution of YOLO algorithm and YOLOv5: The state-of-
the-art object detention algorithm,’’ Tech. Rep., 2021.
[159] G. Chaucer. (2022). YOLOU: United, Study and Easier to Deploy.
[Online]. Available: https://github.com/jizhishutong/YOLOU
[160] PaddlePaddle. (2022). YOLOSeries. [Online]. Available: https://github.
com/nemonameless/PaddleDetection_YOLOSeries
[161] Iscyy. (2022). YOLOAir: Makes Improvements Easy Again. [Online].
Available: https://github.com/iscyy/yoloair
[162] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, ‘‘Distance-IoU loss: JUNYAN YANG received the B.S. degree in civil
Faster and better learning for bounding box regression,’’ in Proc. AAAI engineering from Beijing Forestry University, Bei-
Conf. Artif. Intell., 2020, vol. 34, no. 7, pp. 12993–13000. jing, China, in 2021. He is currently pursuing the
[163] D. Misra, ‘‘Mish: A self regularized non-monotonic activation function,’’ M.S. degree in computer science with Hangzhou
2019, arXiv:1908.08681. Danzi University. He focuses on the field of com-
[164] Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, puter vision and X-ray imaging security detection.
and Y. Cao, ‘‘EVA: Exploring the limits of masked visual representation
learning at scale,’’ 2022, arXiv:2211.07636.
[165] P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y. Qiao, ‘‘ConvMAE: Masked
convolution meets masked autoencoders,’’ 2022, arXiv:2205.03892.