0% found this document useful (0 votes)
6 views11 pages

Gu 2020

This article presents a novel method for automatic object detection in X-ray baggage inspection using deep convolutional neural networks (CNNs) to enhance public security. The proposed approach includes a specific data augmentation technique, a feature enhancement module, and the use of focal loss to address foreground-background imbalance, resulting in improved accuracy and robustness compared to existing algorithms. The study highlights the challenges of X-ray imaging and the need for automated systems to reduce human error in security screenings.

Uploaded by

yeatlidiya65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

Gu 2020

This article presents a novel method for automatic object detection in X-ray baggage inspection using deep convolutional neural networks (CNNs) to enhance public security. The proposed approach includes a specific data augmentation technique, a feature enhancement module, and the use of focal loss to address foreground-background imbalance, resulting in improved accuracy and robustness compared to existing algorithms. The study highlights the challenges of X-ray imaging and the need for automated systems to reduce human error in security screenings.

Uploaded by

yeatlidiya65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

Automatic and robust object detection in X-ray


baggage inspection using deep convolutional
neural networks
Bangzhong Gu, RongJun Ge, Yang Chen, Senior Member, IEEE, Limin Luo, Senior Member, IEEE, and
Gouenou Coatrieux, Senior Member, IEEE

Abstract—For the purpose of ensuring public security, crime [1]. Security screening using X-ray scanners is widely
automatic inspection of X-ray scanners has been deployed used in public places [2]. These scans are visually inspected
at the entry points of many public places to detect danger- by a specifically trained human inspector to ensure there are
ous objects. However, current surveillance systems cannot
function without human supervision and intervention. In no dangers. It is extremely tedious to manually perform this
this paper, we propose an effective method using deep task since the baggage might actually be dangerous [3]. During
convolutional neural networks to detect objects during X- rush hour, it only takes a few seconds to determine whether
ray baggage inspection. As a first step, a large amount a piece of baggage contains any dangers or not [4]. Since
of training data is generated by a specific data augmen- each employee has to check a large amount of baggage, the
tation technique. Second, a feature enhancement module
is used to improve feature extraction capabilities. Then in possibility of human error over a long time is considerable,
order to address the foreground-background imbalance in even with specialized training [5]. Automated X-ray analysis
the region proposal network, focal loss is adopted. Third, remains a crucial issue in baggage inspection.
the multi-scale fused region of interest (RoI) is utilized to X-ray imaging is quite different from natural optical imag-
obtain more robust proposals. Finally, soft non-maximum ing in several aspects. The main difference is that the X-ray
suppression (NMS) is adopted to alleviate overlaps in bag-
gage detection. As compared with existing algorithms, the image is formed by irradiating the object with X-rays, while
proposed method proves that it is more accurate and robust the natural optical image is formed by the light reflection,
when dealing with densely cluttered backgrounds during X- which gives information about the surface of the objects
ray baggage inspection. [6], [7]. Thus, an X-ray image consists of shadows from
Index Terms—Convolutional Neural Networks, Baggage overlapping transparent layers. The transparency of the image
Inspection, Baggage Detection, X-Ray Images for Security is determined by the material density along the X-ray path.
Applications The visibility of objects on X-ray images depends on the
object’s density: high-density objects (e.g. thick metal) behave
substantially opaque and occlude all the other overlapping
I. I NTRODUCTION
objects, while very low-density objects (e.g. clothes) are barely

B AGGAGE inspection with security screening is a pow-


erful tool to reduce the risk of potential terrorism and
visible [8]. Fig. 1 shows some examples of baggage using X-
ray scanners. If part of the image is too dark to be visible,
the human inspector needs to open the baggage and check it
manually.
Manuscript received December 20, 2019; revised February 18, 2020
and June 19, 2020; accepted August 28, 2020. This work was supported
in part by the National Key R&D Program of China (2017YFA0104302,
2018YFA0704102 and Grant 2017YFC0109202, in part by National
High Technology Research and Development Program of China (863
Program, No. 2015AA043203), in part by the National Natural Science
Foundation (81827805, 61801003, 61871117, and 81530060), and in
part by the R&D Projects in Key Technology Areas of Guangdong
Province (No. 2018B030333001).
B. Gu, R. Ge, Y. Chen, and L. Luo are with the Laboratory of
Image Science and Technology, the School of Computer Science Fig. 1. Samples of X-ray baggage in inspection.
and Technology, Southeast University, Nanjing 210096, China, with
the School of Cyberspace Security, Southeast University, Nanjing
210096, China, with the Key Laboratory of Computer Network and Objects in X-ray baggage usually undergo in-plane and out-
Information Integration (Southeast University), Ministry of Education, of-plane rotation, which makes X-ray baggage inspection to be
Nanjing 210096, China, and also with the Centre de Recherche en a quite a difficult task. Despite these slight differences, object
Information Biomedicale Sino Francais (LIA CRIBs), Rennes 35000,
France (e-mail: [email protected]; [email protected]; recognition using both imaging techniques suffers many simi-
[email protected]; [email protected]). (Corresponding au- lar issues, such as perspective projection, geometric distortion,
thor: Yang Chen.) pose problems, self-occasions and large intra-class variability
G. Coatrieux is with the Institut Mines-Telecom, Telecom Bre-
tagne; INSERM U1101 LaTIM, 29238 Brest, France (e-mail: [9]. Observing the common problems, algorithms based on
[email protected]). computer vision technology for optical imaging can also be

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

used for X-ray baggage inspection. of the target object. Domingo Mery and colleagues [18] used
In recent years, convolutional neural networks (CNN) have adaptive sparse representations [19], [20] to automatically
been widely used in image analysis and interpretation. Meth- detect objects, when less restrictive conditions apply including
ods based on deep learning have achieved state-of-the-art some contrast, pose, intra-class variability and focal distance.
detection performance in many computer vision tasks [10]– The task presented in [21] considered a bag of visual words
[12], such as face recognition and automatic driving. However, (BoVW) model with several hand-crafted feature represen-
few efforts have been dedicated to investigate object detection tations. It achieved an average precision of 57%. Thorsten
in X-ray baggage inspection due to many limitations. For the Franzel and colleagues [22] studied the applicability and
lack of training data, most of the existing methods finetune efficiency of sparse local features on X-ray baggage object de-
the networks [13] to achieve good performance. But this is tection. This work investigated how material information given
not feasible in X-ray baggage inspection. Direct adoption of a in multi-view X-ray imaging affects detection performance.
pre-trained network has little flexibility to adjust the structure. As clearly seen, these methods are mostly based on hand-
There might be bias in the learning process. A good solution to crafted features. However, the advances in automated baggage
tackle these critical issues is to train the models from scratch. inspection are minimal and very limited compared to what
However, due to numerous parameters and inefficient training is required for X-ray inspection systems, which rely less on
strategy with the limited training data, previous approaches are human inspectors.
difficult to converge [14].
To address these issues, in this paper we propose an effective B. Convolutional Neural Networks for Object Detection
approach for object detection in X-ray baggage inspection.
Compared with other detection methods for object surface, Deep convolutional neural networks have made huge steps
such as Faster Region-based Convolutional Neural Networks in object detection in recent years. State-of-the-art deep CNN-
(Faster-RCNN) [15], Feature Pyramid Network (FPN) [16], based object detection methods can be divided into two groups:
our method has great advantages regarding object detection two-stage methods and single-stage methods. 1) Two-stage
in X-ray baggage inspection for object interior character. The methods, such as R-CNN [23], Fast R-CNN [24], Faster R-
main contributions of the proposed method are as follows. CNN [15], R-FCN [25] and FPN [16] achieve the detection
First, a specific data augmentation pipeline is designed to through two steps. The first step generates a set of candi-
accommodate the varied data. Second, an effective feature date region proposals and the second step classifies them
enhancement module is added to improve feature extraction into the target object category. To date, two-stage methods
capabilities; focal loss is adopted to address the foreground- have achieved the highest accuracy among object detection
background imbalance. Third, multi-scale fused RoI is adopted methods. 2) Single-stage methods, such as YOLO [26]–[28]
to obtain more accurate region proposals. Finally, soft NMS and SSD [29], use a single feed-forward convolutional network
is used to reduce errors when detecting adjacent objects. Two to directly predict classes and bounding boxes. Although these
new datasets are built for X-ray baggage object detection. methods have been tuned for speed, their accuracy is lower
To evaluate the method, a list of representative CNN-based than two-stage methods.
methods is investigated on the task of object detection during Object detection during X-ray baggage inspection is a
X-ray baggage inspection. The results are reported as a useful more challenging task than in natural optical images. To
performance baseline. And the proposed method outperforms the best of our knowledge, most of the previous work used
the existing ones. networks that were pre-trained on the ImageNet classification
The rest of this paper is organized as follows. Related work dataset. The study presented in [30] compared a BoVW
is explored in Section II. The proposed method in detail is approach with a CNN approach, exploring the use of transfer
presented in Section III. All methods evaluated in this work learning by finetuning the weights of different layers. The
are reported in Section IV. Finally, a conclusion is given in layers were transferred from another network trained on a
Section V. different task. Experiments show that CNN-based methods
outperform BoVW methods. Samet Akcay et al. [31] explored
some framework on X-ray baggage image classification and
II. RELATED WORK
detection. Their results showed that the CNN-based method
In this section, we briefly introduce traditional object de- outperforms hand-crafted methods.
tection methods in X-ray baggage inspection and CNN-based
models for object detection. III. PROPOSED METHOD
Our method is inspired by the design principles of the two-
A. Traditional Object Detection in X-ray Baggage Inspec- stage methods. Thus it inherits the accuracy advantages of
tion region proposal based methods. Fig. 2 illustrates the architec-
Some approaches attempted to perform object detection in ture of the proposed method, which can mainly be divided
X-ray baggage images from a single view of a single energy. into two parts: X-ray proposal network (XPN) and X-ray
The adapted implicit shape model (AISM) based on visual discriminative network (XDN). XPN takes an image as input
codebooks was proposed in [17]. This method used visual and outputs predicted region boxes. XDN is added after XPN.
vocabulary and appearance structures that were generated from XDN takes the coarse region boxes as input, and outputs the
a training dataset which includes representative X-ray images refined category and position simultaneously. In XPN, data

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

Fig. 2. The architecture of the proposed method. Part A represents the XPN. Part E represents the XDN.

augmentation is used to accommodate the diversity of the input distribution (0, 1), inv(·) represents the complement of the
image. The followed feature enhancement module is utilized image, op(·) represents the basic augmentation operation. In
to make information easier to propagate. In XDN, the fused this method, we use the following basic data augmentation:
RoI layer allows each proposal to access information from all affine transformations, mirroring and flipping, cropping and
levels. Then the bounding box regression and class prediction perspective transformations. Fig. 3(b) shows an example of
are processed. After XDN, soft NMS is used to alleviate object our data augmentation. A gun patch is cut from image A. Then
overlaps in the model. we apply some basic augmentation to the gun patch. Finally,
we paste the gun patch on another image B for training. This
technology can make full use of the data which have no target
A. Data Augmentation
object.
Data augmentation plays an important role in increasing
the network robustness against normal changes that might
appear in X-ray images, such as density changes or changes
in object orientation. Additionally, it can be used to achieve
better generalization and to simulate different X-ray object
conditions thus overcoming one of the main weaknesses of
CNN: its heavy reliance on previous training data. X-ray
images are quite different from natural images since they
undergo severe geometric transformations and are densely (a)
cluttered, this makes it difficult to cover most situations in
the training set. In this paper, we use an online augmentation
approach providing a virtually infinite dataset that does not
require extra storage space on the disk. Many applications
use basic geometric transformations for data augmentation,
such as mirroring and flipping. In order to change the position
of the objects, affine transformations are performed. Besides
the basic data augmentation, we design an effective pipeline
for X-ray images inspection to handle the problem of densely
cluttered objects in X-ray images. Fig. 3(a) shows the details (b)
of the specific technique. We select two random images A Fig. 3. Description of the proposed data augmentation technology.
and B from the database, image A belongs to the data which
contains target objects and images B belongs to the data which
contains no target objects. We cut the part containing the target B. Feature Enhancement Module
object from image A, namely patch A. Then we apply basic Feature enhancement module is utilized to enhance the
data augmentation to patch A. Finally, we combine image B information flow, which is not trivial for traditional CNN-
with patch A to build the augmented data used for training. based detectors, especially for densely cluttered background
We define the combined operation as: and small objects. It has been found that in [32], low layers
contain less semantic information compared with high layers,
C = inv(λ × inv(op(A)) + (1 − λ) × inv(B)) (1) but they have a higher localization accuracy. Since object de-
tection requires both accurate positions and precise categories,
Where C represents the composited image, λ represents multiple layers fusion is required. Several recent models for
the combination ratio which is sampled from the uniform object detection utilized different layers in a network. Some

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

of them fused the high layers with low layers. Intuitively,


low layers can provide more detailed information about X-
ray objects with cluttered backgrounds. In contrast, high
layers can capture more global context information. Although
these methods have achieved excellent performance on some
datasets, X-ray object detection is still a challenge for them
with its characteristics including the existence of disruptors
and densely cluttered background.
Part B in Fig. 2 is the proposed feature enhancement
module. The part B consists of part C and part D. The
part C includes the first bottom-up path (indicated as
Fig. 5. The structure of context enhancement module.
{C1 , C2 , C3 , C4 }), the second top-down path (indicated as
{M4 , M3 , M2 , M1 }) and the last bottom-up path (indicated as
{N1 , N2 , N3 , N4 }). The bottom-up path {C1 , C2 , C3 , C4 } and Fig. 2. We use two CEM in this work, named as CEM s and
the top-down path {M4 , M3 , M2 , M1 } is a feature pyramid CEM l. CEM s have one 1 × 1 convolution layer and two
network with the backbone of Resnet-50. The last bottom-up 3 × 3 dilated convolution layers with dilation rate of 1 and 2.
path {N1 , N2 , N3 , N4 } is a combination of the first bottom- CEM l has one more dilated convolution layer with a dilation
up path {C1 , C2 , C3 , C4 } and the second top-down path rate of 5 than CEM s. CEM s is used for the feature map
{M4 , M3 , M2 , M1 }. The layers in the last bottom-up path with small spatial resolution, while CEM l is used for the
{N1 , N2 , N3 , N4 } are obtained by lateral connections, enhanc- feature map with large spatial resolution. The layers, indicated
ing the spread of semantic information. The last bottom-up as {P1 , P2 , P3 , P4 }, have the same spatial resolution and chan-
path starts from the low-level feature map N1 and gradually nel size with the last bottom-up path {N1 , N2 , N3 , N4 }.This
reaches N4 . The spatial down-sampling factor is 2. The module effectively captures multi-scale information especially
channel number is 256. They are consistent with the first for objects that are small or contained in a densely cluttered
bottom-up path {C1 , C2 , C3 , C4 }. The N1 layer comes from background.
the M1 layer with 1×1 convolution. The Ni+1 layer is defined
as:
C. Focal Loss in Region Proposal Network

Ni+1 = Ni ⊕ (Ci+1 ⊕ Mi+1 ) (2) Focal loss [36] is utilized to avoid class imbalance problem
by down-weighting the losses of vast number of easy sam-
where ⊕ is an concatenation operation. A 1 × 1 convolution ples during the training of XPN. The basic region proposal
operation is applied to the previous layers (Ci+1 and Mi+1 ). network generates a large number of regions including more
Together with down-sampled layer of Ni , three components negative regions than the positive ones. To compensate for this
are added to produce fused layer Ni+1 . With this design, the imbalance, sampling strategies including random sampling and
current layers can take full advantage of prior information to hard negative mining are adopted in most models. Only a fixed
extract more discriminative representations. Fig. 4 shows The number of anchors with a fixed ratio are sampled. However, the
structure of the last bottom-up path {N1 , N2 , N3 , N4 }. resulting sampled positives cannot fully represent the objects.
In this paper, we use focal loss to take all regions into account
for training.
With the traditional definitions, the training
P regions of the
nth proposal layer are defined as S n = (p∗i , b∗i ), where
i
p∗i and b∗i are the corresponding label and ground truth coor-
dinates, respectively. Similar to most CNN-based detectors,
the loss of the ith sample in the nth detection layer is a
Fig. 4. The detail of the last bottom-up path which bring more combina- combination of classification and bounding box regression,
tions for the feature maps.
which is defined as follows:
Dilated convolution layer has achieved progressive improve-
ment in semantic segmentation which can provide context Ln ( pi , bi | W ) = Lcls (pi , p∗i ) +λLreg (bi , b∗i ) (3)
information [33]–[35]. In this work, dilated convolution layers
are utilized to enhance the feature maps for region proposal, Where W represents the parameters of the region proposal
thus making them more discriminable and robust. Fig. 5 shows network, pi is the probability distribution over the background
the details of our context enhancement module (CEM ). It and foreground object that is calculated by a softmax layer,
takes the feature maps N as input, and outputs the feature λ is the balancing parameter and bi stands for the regressed
maps P . The module contains one convolution layer and bounding box. For regions that are positively labeled, the
several dilated convolution layers with different dilation rates. bounding box b∗i is regressed from the corresponding region
We then concatenate the output feature maps of different box bi . The regression loss denotes a smooth L1 loss defined
convolution layers. More details is shown in the part D of as:

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

prediction. Fig. 6 shows the details of multi-scale RoI fusion


1   module. Following Mask R-CNN [37], RoI align pooling is
b∗j
X
Lreg (bi , b∗i ) = fL1 bji − i (4) utilized to pool feature grids from each level. The original
4
j∈{x,y,w,h}
RoI pooling uses quantization to transfer the coordinates of
Where the object in an image to its coordinates on the feature map.
RoI align pooling uses bilinear interpolation to compute the
0.5x2 , if |x| < 1 exact values of the input features, which makes the results

fL1 = (5) more accurate. By using bilinear interpolation operation, the
|x| − 0.5, otherwise
RoI align pooling layer can take full use of the information
Most CNN-based detectors adopt cross-entropy (CE) loss learned from previous layers. However, other interpolation
as the softmax loss function, this loss is only effective when operations may cause a detail or semantic information loss.
training mini-batch samples, where the sampled positive and Nearest neighbor interpolation computes the sampling points
negative anchors have a ratio of up to 1:1 or 1:2. In order by using the nearest neighbor point. This operation may lead
to naturally handle the foreground-background imbalance, we to a detail information loss for the pooled feature grids.
adopt the focal loss function that allows an efficient training Bicubic interpolation computes the sampling points by using
on all anchors without any sampling strategies. The focal loss the nearest 16 points. Some of the points come from the
is defined as follows: points out of the RoI. This operation may cause a semantic
information loss for the pooled feature grids.
Lcls (pi , p∗i ) =
P
i∈S m
P 2
β −(1 − pi ) log pi +
m
i∈S+ (6)
−pi 2 log (1 − pi )
P
(1 − β)
m
i∈S−

where pi is the probability confidence of the object and


m
1−pi is the probability confidence of the background, S+ and
m
S− represent the positive and negative anchors, respectively,
and β is the balancing parameter to avoid domination of
training loss by the negative anchors. Compared with CE loss,
focal loss has two advantages. First, the loss is similar to CE
loss for misclassified samples. For example, when a positive
sample is misclassified and pi is small, the modulating factor
2 Fig. 6. The details of multi-scale RoI fusion
(1 − pi ) is near 1, while in the case of misclassifying a
negative sample and having a large pi , the modulating factor
pi 2 is near 1. This results in unaffected loss. Second, the loss
is smoothly downweighted for well classified samples. For E. Soft Non-Maximum Suppression
example, when a positive sample is well classified with a large
2
pi , the modulating factor (1 − pi ) goes to 0, while a well NMS has been a crucial part of many detection algorithms,
classified negative sample and a small pi lead to a modulating in which it is used to obtain the final set of detections
factor pi 2 that is near 0. Thus, it can prevent the large number by significantly reducing the number of false positives. In
of easy negatives from dominating the loss during training. non-dense scenes, greedy NMS is applied to object scores,
which can resolve overlapping detections. In densely cluttered
images, however, multiple overlapping bounding boxes often
D. Multi-scale RoI Fusion reflect multiple, tightly packed objects among which many
Multi-scale RoI fusion aims at combining different levels of receive high object scores. NMS does not adequately dis-
RoIs for each proposal which can make the proposals stronger. criminate between overlapping detections and suppress partial
Since each feature map contains some specific information, detections.
the classification and regression of the bounding boxes are To address these problems, we adopt soft NMS [38] to
individually operated on different feature levels. However, a alleviate the mistakes in detecting adjacent objects. Let B =
small region proposal can obtain more semantic information {b1 , b2 , ..., bN } be a list of candidate bounding boxes and
from the higher layers which is helpful for classification, and a S = {s1 , s2 , ..., sN } be a list of corresponding detection
large region proposal can get better details from the lower lay- scores. The detection box with the maximum score is denoted
ers to facilitate its localization ability. Thus, RoIs in our work as bm and the threshold is T . iou (bm , bi ) denotes the
are proposed from all levels in the layers {P1 , P2 , P3 , P4 } for intersection over the union between bm and bi . The choice
each proposal, they are shown as dark blue regions in part D criterion in NMS can be written as follows:
of Fig. 2. After they are sent to RoI align pooling layer, we get
all the feature grids with the same size. Then an element-wise 
max fusion operation is utilized to fuse feature grids from si , iou (bm , bi ) < T
si = (7)
all levels. The fused feature grid is used for the following 0, iou (bm , bi ) ≥ T

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

Hence, NMS sets a hard threshold when deciding what


should be kept or removed. Instead of just removing the neigh- ti = ti−1 + li
boring detections with an overlap greater than the threshold (10)
fi = ti−1 + 1 − li
T , soft NMS assigns them great penalty with the purpose to
impose a higher penalty on the bounding box with a higher The precision p and recall r curves are calculated as:
overlap. Thus, the choice criterion instead of (7) is proposed
as follows: pi = ti
ti +fi
ti (11)
ri = np
iou(bm ,bi ) 2

si = si e− δ , i = 1, 2, ..., N (8) Where np is the number of positive samples. We then


where δ is the variance. calculated average precision (AP) based on the area under
precision recall curve:
IV. EXPERIMENTAL RESULTS AND ANALYSIS
np
In this section, we evaluate our proposed method for object X
AP = pi ∆r (12)
detection in X-ray baggage inspection. Experiments were
i
implemented based on the deep learning framework PyTorch
[39]. As shown in (12), we finally find mAP by averaging AP
values that calculate for N classes:
A. Dataset Description and Experimental Setup
N
1) Dataset Description: To perform this task, we use two 1 X
mAP = APi (13)
types of datasets described below: N i=1
Xdb1: This dataset is collected from the simulated situation.
To build a dataset for multi-class detection, baggage with 3) Implementation Details and Parameter Optimization:
random target objects are packed and then sent into a fixed Due to the large size of the X-ray images, training the
X-ray machine. The X-ray tube voltage is 55kV. The X-ray proposed detector requires large amounts of memory resulting
tube current is 5 mA. Finally, we get the simulated X-ray in a long training time. In our implementation of fast training,
baggage image. In addition to baggage with target objects, we we randomly crop small patches around the target object from
also build a set of images with no target objects. Following the input images used for training. This significantly reduces
these approaches, this dataset consists of 4,127 X-ray baggage the used memory and enables training with a larger mini-batch
images that contain target objects and 2,352 X-ray baggage size which can speed up the training.
images without target objects. The target objects include In our method, the anchor size is empirically set according
scissors, bottle, mental cup, kitchen knife, knife, battery and to the size distribution of the target objects. We use 5 scales
umbrella. with box areas of 162 , 322 , 642 , 1282 and 2562 pixels and
Xdb2: This dataset is collected from the subway security 5 aspect ratios of 4:1, 2:1, 1:1, 1:2 and 1:4. Since some
check. To build a dataset for multi-class detection, we man- target objects are long, such as knives, we added 4:1 and
ually select baggage images that contain target objects. In 1:4 aspect ratios to the anchors with aspect ratios of 2:1,
addition, we randomly select images from a large number 1:1 and 1:2. The branch weight was empirically set based
of baggage images with no target objects. Following these on the effect of the corresponding detection branch on the
approaches, this dataset consists of 21,538 X-ray sample gradient. Other parameters are set to default values, including
images with target objects and 35,254 X-ray sample images the learning rate (0.02 with a linear warm-up) for the first 500k
with no target objects. The target objects include bottles, mini-batches and (0.002) for the next 400k, the weight decay
mental cup, knives, scissors, gun, battery, laptop, umbrella, (0.0001) and momentum (0.9) and the tradeoff coefficient. In
lighter and pressure cans. our training process, the X-ray region proposal network and
2) Evaluation Metrics: The performance of our method X-ray discriminative network are jointly trained and all models
is measured using the quality evaluation PASCAL crite- are trained using synchronized SGD. The standard Faster R-
ria [40]:average precision (AP) and mean average precision CNN architecture typically adopts a fixed scale for all the
(mAP). To calculate mAP, we perform the following: we first training images. By randomly resizing the images to one of
sort detections based on their confidence scores. Next, we many scales, the detector is able to learn features across a
calculate IoU for each detection. Assuming each detection as wide range of sizes, thus improving its performance towards
unique, and denoting the IoU area as ai , we then threshold it scale invariance. In this work, we randomly assign one of three
by 0.5 giving a logical li , where: scales (800, 600 and 400) to each image before it is fed into
the network. In the XPN stage, the anchors whose overlaps
 with ground truth are greater than 0.6 are treated as positive
1, ai > 0.5
li = (9) samples, while the anchors whose overlaps with ground truth
0, otherwise
are less than 0.3 are treated as negative samples. The parameter
This is followed by a prefix-sum giving both true positives of the focal loss is set to 0.25 and the soft NMS algorithm
t and false positives f , where parameters are as follows: the threshold is 0.5 and δ is 1.

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

To compared with other methods, we split the datasets into


training (80%), validation (10%) and test (10%) sets, such
that the splits have similar class distributions but the unseen
test set contains somewhat challenging samples never used for
training.

B. Comparison Results on Dataset Xdb1


First, we train and evaluate our model using the Xdb1
training set and testing set, respectively. As comparisons
in Table I, our model achieves promising results, i.e. 0.954
in terms of mAP, which outperforms most of the previous
methods tested on this dataset by a large margin. As we
can see, Faster R-CNN shows a similar performance with R-
FCN. Region proposals based on them are collected from a
single scale feature map, leading to a loss in semantic and
position information. FPN adopts the pyramid structure to deal
with multi-scale detection and fuse features from different
levels. Therefore it gets a higher mAP than that based on
a single scale feature map. For single-stage methods, SSD
and YOLOv3 also generate RoIs from multi-scale feature
maps. However, large feature maps in SSD lead to a lack
of semantic information, and small feature maps may lead
to a lack of position information. Hense, SSD gets a lower
performance than others. YOLOv3 takes advantage of the
methods in FPN and gets a higher performance in terms of
mAP than SSD. Compared to FPN, YOLOv3 falls in some
categories because it does not have the proposal step. Single-
stage methods still get a worse performance than two-stage
methods. Table I shows that our method achieves impressive
improvements in scissors and knife detection which proves Fig. 7. Some qualitative detection results of our proposed method on
the Xdb1 test dataset. Only detection with scores higher than 0.5 are
that our model can be more stable when dealing with densely shown.
cluttered backgrounds and small objects in X-ray detection.
Fig. 7 shows some examples of the object detection results
using the proposed method on the Xdb1 dataset.
To assess the robustness of our approach, we reduced the
size of the training samples from 80% to 70% and 60%.
Accordingly, the size of validation sets are all 10%, and the
size of test sets are from 10% to 20% and 30%. Fig. 8 shows
the mAP with different proportion on dataset Xdb1. As we
can see, even with less training data and more test data, the
proposed method also performs better than other representative
methods.

TABLE I
AVERAGE PRECISION (AP) AND MEAN AVERAGE PRECISION
(MAP) ON THE TESTING SET OF THE XDB1 DATASET

model mAP scissors bottle mental cup kitchen knife knife battery scissors

Faster R-CNN (resnet50) 0.837 0.808 0.879 0.891 0.844 0.730 0.835 0.872
R-FCN(resnet50) 0.847 0.827 0.861 0.895 0.853 0.752 0.848 0.896
FPN(resnet50) 0.922 0.895 0.951 0.962 0.936 0.823 0.938 0.952
SSD(vgg) 0.823 0.788 0.873 0.866 0.801 0.714 0.827 0.893
YOLOv3(darknet53) 0.870 0.842 0.897 0.906 0.872 0.785 0.856 0.930
Ours 0.954 0.938 0.981 0.989 0.963 0.880 0.951 0.974

Fig. 8. Performance with different proportion on the dataset Xdb1


C. Comparison Results on Dataset Xdb2
We also use the dataset Xdb2 training and testing sets to
train and evaluate our method, respectively. As comparisons

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

in Table II, our method achieves promising results, i.e. 0.835


in terms of mAP, which outperforms most of the tested
methods on this dataset by a large margin. For tow-stage
methods, R-FCN shows a similar performance to Faster R-
CNN. They use a single scale feature map to generate region
proposals. This may lead to a lack of semantic and localization
information. FPN takes full use of the pyramid structure to
deal with multi-scale detection. The feature maps are fused
from different levels. This structure contains more detail and
semantic information. Therefore, FPN gets better performance
than Faster R-CNN or R-FCN especially for knives, scissors
and lighter detections. For single-stage methods, SSD gets
a worse performance than YOLOv3. Compared to Xdb1,
Xdb2 is more complicated with densely cluttered backgrounds.
SSD does not take enough advantage of different feature
maps, which may lead to a lack of detail and semantic
information. YOLOv3 also generate RoIs from multi-scale
feature maps. However, it gets a slight improvement when
dealing with such a complicated background. Compared to
FPN, YOLOv3 still gets a worse performance which may due
to the lack of region proposal step. Table II shows that the Fig. 9. Some qualitative detection results of our proposed method on
the Xdb2 test dataset. Only detection with scores higher than 0.5 are
proposed method achieves impressive improvement especially shown.
in small size objects, such as knives and lighter. It proves that
our method can be more stable when dealing with densely
cluttered backgrounds and small objects during X-ray baggage
inspection. Fig. 9 shows some qualitative detection results on
the Xdb2 dataset.
To assess the robustness of our approach, we reduced the
size of the training samples from 80% to 70% and 60%.
Accordingly, the size of validation sets are all 10%, and the
size of test sets are from 10% to 20% and 30%. Fig. 8 shows
the mAP with different proportion on dataset Xdb2. As we
can see, even with less training data and more test data, the
proposed method also performs better than other methods.

TABLE II
AVERAGE PRECISION (AP) AND MEAN AVERAGE PRECISION
(MAP) ON THE TESTING SET OF THE XDB2 DATASET

model mAP bottles mental cup knives scissors gun battery laptop umbrella lighter pressure cans
Faster R-CNN(resnet50) 0.706 0.808 0.809 0.528 0.614 0.710 0.683 0.818 0.781 0.557 0.749
Fig. 10. Performance with different proportion on the dataset Xdb2
R-FCN(resnet50) 0.706 0.817 0.801 0.525 0.623 0.722 0.688 0.815 0.762 0.551 0.752
FPN(resnet50) 0.797 0.860 0.858 0.677 0.690 0.834 0.766 0.927 0.853 0.669 0.831
SSD(vgg) 0.694 0.803 0.810 0.521 0.604 0.704 0.679 0.792 0.754 0.548 0.727
YOLOv3(darknet53) 0.713 0.826 0.822 0.532 0.612 0.731 0.696 0.811 0.795 0.552 0.755
V3 with our RoI fusion module, Ours represents V4 with
Ours 0.835 0.898 0.886 0.737 0.743 0.864 0.816 0.938 0.885 0.706 0.872 soft NMS module. For Xdb1 dataset, the data augmentation
module’s improvement is 1.5% more than our modified FPN,
feature enhancement module’s improvement is 0.8% more
than V1, focal loss module’s improvement is 0.1% more
D. Analysis of the Proposed Modules than V2, RoI module’s improvement is 0.6% more than V3,
To examine the effectiveness and contributions of different soft NMS module’s improvement is 0.3% more than V4 and
modules that are used in the proposed method, we conduct an our proposed method’s improvement is 3.4% more than our
additional ablation experiment for studies listed in Table IV modified FPN. For Xdb2 dataset: data augmentation module’s
and Table V. We mainly analyze the data augmentation mod- improvement is 2.3% more than our modified FPN, feature
ule, feature enhancement module, focal loss module and enhancement module gets an improvement of 1.3% more
soft NMS module. They are all listed in Table III where than V1, focal loss module’s improvement is 0.2% more than
FPN represents our modified feature pyramid network, V1 V2, RoI module’s improvement is 0.6% more than V3, soft
represents the basic FPN with our data augmentation module, NMS module gets an improvement of 0.3% more than V4
V2 represents V1 with our feature enhancement module, V3 and our proposed method improves by 4.7% more than our
represents V2 with our focal loss module, V4 represents modified FPN. The results of this comparison clearly reveal

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

the advantages of our method. The data augmentation module appearance of other objects. These reasons make it difficult for
improves the most in mAP. It can generate diverse training the model to correctly distinguish the targets. The right parts
data which can make the detector more robust when it is fed of Fig. 11 and Fig. 12 show some samples that are missed
with new data. Feature enhancement module is also effective marked in red rectangles. In the right part of Fig. 11, the
Thanks to combining enhanced multi-scale feature maps which kitchen knife is recognized as a dagger. Due to different views,
can be helpful for small objects like knives. The focal loss this sample has a similar shape of a dagger. The right part of
module and soft NMS module slightly improve which is still Fig. 12 shows that the knives was recognized as a lighter. Due
useful for detection. RoI module is also effective by fusing all to the small size, they may have the same shape in this view.
feature maps to generate more robust results. This situation may be alleviated by multi-view detection.
In order to further show the effect of the proposed method,
TABLE III we also validate the number of baggage without target objects
ADDITIONAL EXPERIMENTS WITH DIFFERENT MODULES OF detected as containing any of target objects. For the dataset of
OUR PROPOSED METHOD
Xdb1, 0.4% of baggage without target objects is detected as
Pyramid Structure Data Augmentation Feature Enhancement Focal Loss RoI Fusion Soft NMS
containing any of target objects. For the dataset of Xdb2, 0.2%
FPN ! # # # # #
of baggage without target objects is detected as containing any
V1 ! ! # # # # of target objects. The results show that the proposed method
! ! ! # # #
V2
yields a very low percentage as the baggage without target
V3 ! ! ! ! # #
V4 ! ! ! ! ! # objects is detected as containing any of target object. This
Ours ! ! ! ! ! ! proves that our method can be used in practice.

TABLE IV
AVERAGE PRECISION (AP) AND MEAN AVERAGE PRECISION
(MAP) ON THE TESTING SET OF THE XDB1 DATASET

model mAP scissors bottle mental cup kitchen knife knife battery scissors

FPN 0.922 0.895 0.951 0.962 0.936 0.823 0.935 0.952


V1 0.936 0.919 0.964 0.973 0.950 0.845 0.941 0.963 Fig. 11. Miss detection situation on Xdb1 dataset. (Left) miss alarm
V2 0.944 0.928 0.968 0.979 0.956 0.864 0.943 0.967 and (Right) false alarm.
V3 0.945 0.929 0.970 0.980 0.956 0.866 0.944 0.967
V4 0.951 0.936 0.979 0.984 0.964 0.876 0.948 0.973
Ours 0.954 0.938 0.981 0.989 0.963 0.880 0.951 0.974

TABLE V
AVERAGE PRECISION (AP) AND MEAN AVERAGE PRECISION
(MAP) ON THE TESTING SET OF THE XDB2 DATASET

model mAP bottles mental cup knives scissors gun battery laptop umbrella lighter pressure cans

FPN 0.797 0.860 0.858 0.677 0.690 0.834 0.766 0.927 0.853 0.669 0.831

V1 0.816 0.879 0.873 0.718 0.719 0.846 0.785 0.931 0.869 0.684 0.858

V2 0.827 0.887 0.879 0.727 0.735 0.857 0.805 0.935 0.877 0.699 0.864

V3 0.829 0.890 0.881 0.730 0.737 0.859 0.809 0.935 0.880 0.701 0.867
Fig. 12. Miss detection situation on Xdb2 dataset. (Left) miss alarm and
V4 0.833 0.895 0.885 0.735 0.743 0.862 0.813 0.938 0.883 0.705 0.870
(Right) false alarm.
Ours 0.835 0.898 0.886 0.737 0.743 0.864 0.816 0.938 0.885 0.706 0.872

V. CONCLUSIONS
E. False alarm and miss alarm In this paper, an effective approach is proposed to build
Although the proposed algorithm outperforms the relevant a deep object detector and train it from scratch for X-ray
methods on object detection during X-ray baggage inspection, image inspection. The novelties that distinguish the proposed
there are still some targets that are missed or misreported. This work from previous works lie in two major aspects. First,
section will briefly analyze these situations. Test results show instead of fine-tuning using ImageNet pre-trained models, our
that most errors occur in situations like the ones shown in method trains the deep detector from scratch, this provides the
Fig. 11 and Fig. 12. Due to the impact of objects’ cluttering, freedom to adjust or redesign the structures. Second, in order
some objects are misreported during testing. The left parts to improve the detection performance for clustered objects,
of Fig. 11 and Fig. 12 show some samples that are missed we adopt focal loss to address the foreground-background
marked in red rectangles. The misreported object may heavily imbalance and predict multi-scale object proposals from sev-
suffer from other objects or similar objects. Due to the impact eral enhanced intermediate layers to improve the accuracy.
of different views, some objects have a similar shape and The proposed regions are scaled using RoI Align, followed

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

by element-level fusion and soft NMS post-processing. The [19] J. Liu, Y. Hu, J. Yang, Y. Chen, H. Shu, L. Luo, Q. Feng, Z. Gui,
quantitative comparison results on the Xdb1 and Xdb2 datasets and G. Coatrieux, “3d feature constrained reconstruction for low-dose
ct imaging,” IEEE Transactions on Circuits and Systems for Video
show that the proposed method achieves better performance Technology, vol. 28, no. 5, pp. 1232–1247, 2016.
than comparative methods and it is more effective than existing [20] J. Liu, J. Ma, Y. Zhang, Y. Chen, J. Yang, H. Shu, L. Luo, G. Coatrieux,
algorithms for detecting small and densely cluttered X-ray W. Yang, Q. Feng et al., “Discriminative feature representation to
improve projection data inconsistency for low dose ct imaging,” IEEE
objects. However, as above stated, our method still produces transactions on medical imaging, vol. 36, no. 12, pp. 2499–2509, 2017.
some false alarms and omissions in some severe situations. [21] M. Baştan, M. R. Yousefi, and T. M. Breuel, “Visual words on baggage
Hence, in our future studies, we will focus on discriminating x-ray images,” in International Conference on Computer Analysis of
Images and Patterns, pp. 360–368. Springer, 2011.
the false alarms and learning the structure of the network [22] D. Mery, V. Riffo, I. Zuccar, and C. Pieringer, “Automated x-ray object
adaptively. In addition, we will improve the transferability of recognition using an efficient search algorithm in multiple views,” in
our model using domain adaptation methods. Proceedings of the IEEE conference on computer vision and pattern
recognition workshops, pp. 368–374, 2013.
[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
R EFERENCES convolutional networks for accurate object detection and segmentation,”
IEEE transactions on pattern analysis and machine intelligence, vol. 38,
[1] G. Zentai, “X-ray imaging for homeland security,” in 2008 IEEE no. 1, pp. 142–158, 2015.
International Workshop on Imaging Systems and Techniques, pp. 1–6. [24] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
IEEE, 2008. conference on computer vision, pp. 1440–1448, 2015.
[2] E. Parliament, “Aviation security with a special focus on security [25] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
scanners,” European Parliament Resolution (2010/2154 (INI)), pp. 1– based fully convolutional networks,” in Advances in neural information
10, 2012. processing systems, pp. 379–387, 2016.
[3] A. Schwaninger, A. Bolfing, T. Halbherr, S. Helman, A. Belyavin, and [26] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
L. Hay, “The impact of image based factors and training on threat once: Unified, real-time object detection,” in Proceedings of the IEEE
detection performance in x-ray screening.” 2008. conference on computer vision and pattern recognition, pp. 779–788,
[4] G. Blalock, V. Kadiyali, and D. H. Simon, “The impact of post-9/11 2016.
airport security measures on the demand for air travel,” The Journal of [27] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in
Law and Economics, vol. 50, no. 4, pp. 731–755, 2007. Proceedings of the IEEE conference on computer vision and pattern
[5] S. Michel, S. M. Koller, J. C. de Ruiter, R. Moerland, M. Hogervorst, recognition, pp. 7263–7271, 2017.
and A. Schwaninger, “Computer-based training increases efficiency in [28] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
x-ray image interpretation by aviation security screeners,” in 2007 41st arXiv preprint arXiv:1804.02767, 2018.
Annual IEEE international carnahan conference on security technology, [29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
pp. 201–206. IEEE, 2007. Berg, “Ssd: Single shot multibox detector,” in European conference on
[6] Z. Chen, Y. Zheng, B. R. Abidi, D. L. Page, and M. A. Abidi, “A computer vision, pp. 21–37. Springer, 2016.
combinational approach to the fusion, de-noising and enhancement of [30] S. Akçay, M. E. Kundegorski, M. Devereux, and T. P. Breckon, “Transfer
dual-energy x-ray luggage images,” in 2005 IEEE Computer Society learning using convolutional neural networks for object classification
Conference on Computer Vision and Pattern Recognition (CVPR’05)- within x-ray baggage security imagery,” in 2016 IEEE International
Workshops, pp. 2–2. IEEE, 2005. Conference on Image Processing (ICIP), pp. 1057–1061. IEEE, 2016.
[7] D. Mery, “Computer vision for x-ray testing,” Switzerland: Springer [31] S. Akcay, M. E. Kundegorski, C. G. Willcocks, and T. P. Breckon,
International Publishing.–2015, vol. 10, pp. 978–3, 2015. “Using deep convolutional neural network architectures for object clas-
[8] V. Rebuffel and J.-M. Dinten, “Dual-energy x-ray imaging: benefits sification and detection within x-ray baggage security imagery,” IEEE
and limits,” Insight-non-destructive testing and condition monitoring, transactions on information forensics and security, vol. 13, no. 9, pp.
vol. 49, no. 10, pp. 589–594, 2007. 2203–2215, 2018.
[9] D. Mery, E. Svec, M. Arias, V. Riffo, J. M. Saavedra, and S. Banerjee, [32] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell,
“Modern computer vision techniques for x-ray testing in baggage inspec- “Understanding convolution for semantic segmentation,” in 2018 IEEE
tion,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, winter conference on applications of computer vision (WACV), pp. 1451–
vol. 47, no. 4, pp. 682–692, 2016. 1460. IEEE, 2018.
[10] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, [33] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in
inception-resnet and the impact of residual connections on learning,” in Proceedings of the IEEE conference on computer vision and pattern
Thirty-first AAAI conference on artificial intelligence, 2017. recognition, pp. 472–480, 2017.
[11] Y. Zhu and S. Newsam, “Densenet for dense flow,” in 2017 IEEE [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
international conference on image processing (ICIP), pp. 790–794. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
IEEE, 2017. in Proceedings of the IEEE conference on computer vision and pattern
[12] A. Kamilaris and F. X. Prenafeta-Boldú, “Deep learning in agriculture: A recognition, pp. 1–9, 2015.
survey,” Computers and electronics in agriculture, vol. 147, pp. 70–90, [35] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
2018. “Deeplab: Semantic image segmentation with deep convolutional nets,
[13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, atrous convolution, and fully connected crfs,” IEEE transactions on
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
scale visual recognition challenge,” International journal of computer 2017.
vision, vol. 115, no. 3, pp. 211–252, 2015. [36] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
[14] M. Baştan, “Multi-view object detection in dual-energy x-ray images,” for dense object detection,” in Proceedings of the IEEE international
Machine Vision and Applications, vol. 26, no. 7-8, pp. 1045–1060, 2015. conference on computer vision, pp. 2980–2988, 2017.
[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time [37] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
object detection with region proposal networks,” in Advances in neural Proceedings of the IEEE international conference on computer vision,
information processing systems, pp. 91–99, 2015. pp. 2961–2969, 2017.
[16] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [38] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-nms: improving
“Feature pyramid networks for object detection,” in Proceedings of the object detection with one line of code,” in Proceedings of the IEEE
IEEE conference on computer vision and pattern recognition, pp. 2117– international conference on computer vision, pp. 5561–5569, 2017.
2125, 2017. [39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
[17] V. Riffo and D. Mery, “Automated detection of threat objects using T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
adapted implicit shape model,” IEEE Transactions on Systems, Man, imperative style, high-performance deep learning library,” in Advances
and Cybernetics: Systems, vol. 46, no. 4, pp. 472–482, 2015. in Neural Information Processing Systems, pp. 8024–8035, 2019.
[18] D. Mery, E. Svec, and M. Arias, “Object recognition in baggage [40] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
inspection using adaptive sparse representations of x-ray images,” in man, “The pascal visual object classes (voc) challenge,” International
Image and Video Technology, pp. 709–720. Springer, 2015. journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2020.3026285, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

GOUENOU COATRIEUX received the Ph.D.


degree in signal processing and telecom-
munication from the University of Rennes1,
Rennes, France, in collaboration with Ecole
Nationale Supérieure des Télécommunications,
Paris, France, in 2002. He is currently a Pro-
fessor with the Information and Image Process-
ing Department, Institut Mines-Télécom, Tele-
com Bretagne, Brest, France. His research is
BANGZHONG GU received the M.E. degrees conducted in the LaTIM Laboratory, INSERM
from the School of Computer Science and En- U1101, Brest. His research interests include
gineering, Southeast University, Nanjing, China, data security, encryption, watermarking, secure processing of out-
in 2015. He is currently pursuing the Ph.D. sourced data, digital forensics in medical imaging, and electronic patient
degree with the Laboratory of Image Science records.
and Technology, School of Computer Science
and Engineering, Southeast University, Nanjing,
China. His current research interests include
image processing, and machine learning.

RONGJUN GE received the B.E. and M.E. de-


grees from the School of Information Science
and Engineering, Lanzhou University, Lanzhou,
China, in 2013 and 2016, respectively. He is
currently pursuing the Ph.D. degree with the
Laboratory of Image Science and Technology,
School of Computer Science and Engineering,
Southeast University, Nanjing, China. His cur-
rent research interestsinclude image encryption,
chaos, medical image processing, and machine
learning.

YANG CHEN received the M.E. and Ph.D. de-


grees in biomedical engineering from First Mil-
itary Medical University, Guangzhou, China, in
2004 and 2007, respectively. Since 2008, he has
been a Professor with the Laboratory of Image
Science and Technology, School of Computer
Science and Engineering, Southeast University,
Nanjing, China. His research interests include
medical image reconstruction, image analysis,
pattern recognition, and computerized-aid diag-
nosis.

LIMIN LUO received the Ph.D. degree from the


University of Rennes, Rennes, France, in 1986.
He is currently a Professor with the Laboratory
of Image Science and Technology, School of
Computer Science and Engineering, Southeast
University, Nanjing, China. His current research
interests include medical imaging, image anal-
ysis, computer-assisted systems for diagnosis
and therapy in medicine, and computer vision.

0278-0046 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on June 25,2021 at 13:28:47 UTC from IEEE Xplore. Restrictions apply.

You might also like