3538 978-1-7281-9835-4/23/$31.00 ©2023 Ieee Icip 2023
3538 978-1-7281-9835-4/23/$31.00 ©2023 Ieee Icip 2023
LOCALIZATION
ABSTRACT
Recent advances in deep learning have facilitated significant
progress in the autonomous detection of concealed security
threats from baggage X-ray scans, a plausible solution to
overcome the pitfalls of manual screening. However, these
data-hungry schemes rely on extensive instance-level annota-
tions that involve strenuous skilled labor. Hence, this paper
proposes a context-aware transformer for weakly supervised
baggage threat localization, exploiting their inherent capacity
to learn long-range semantic relations to capture the object-
level context of the illegal items. Unlike the conventional
single-class token transformers, the proposed dual-token ar-
Fig. 1. Visualization of baggage threat localization. The top
chitecture can generalize well to different threat categories
row showcases the result of the proposed approach, with (A)
by learning the threat-specific semantics from the token-wise
threat-aware context map extracted from the proposed CGM
attention to generate context maps. The framework has been
(dimension 14 × 14), (B) context map interpolated and over-
evaluated on two public datasets, Compass-XP and SIXray,
laid on the input scan, capturing global features of the threat
and surpassed other SOTA approaches.
items, and (C) the final threat localization result. The bottom
Index Terms— Baggage security, X-ray Imagery, Weakly row shows the comparative results with different approaches,
Supervised Localization, Threat Recognition, Transformer. Grad-CAM [9], Ablation CAM [10] and TS-CAM [11].
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. The proposed Context-aware dual token transformer model translates the input scan into a sequence of patch tokens to
which learnable dual-class tokens are affixed to capture the global context of both the threat and benign categories. Position
embeddings are also added before passing the tokens through the encoder layers. CGM captures the global context of the
concealed threats by leveraging the threat-specific class token CT and the patch tokens. The context map is then refined using
patch-wise attention which is passed to PSM to expose complete threat objects from cluttered and occluded baggage scans.
gradient-free approaches [10, 15]. between overlapping threats and normal items. Hence, we
However, these approaches, based on CNNs [16, 17], are propose a dual-token transformer architecture, unlike conven-
constrained to localized interactions (see Fig. 1). Alterna- tional single-class token transformers, to capture the object-
tively, vision transformers [18] have gained attention due to level context of concealed security threats and to generalize
their ability to model global features by leveraging long-range well to different threat categories by localizing them with
semantics, which is crucial in localizing the object of interest. only binary labels (Threat vs. Benign). A class-specific train-
Gao et al. [11] incorporated CAMs with transformers to em- ing strategy is employed to associate the class tokens with
phasize the distinctive local features while diverting attention the specific object category (detailed in Section 2). We have
from the irrelevant parts. Meanwhile, Su et al. [19] proposed also designed a Context map Generation Module (CGM) to
token-prioritizing to comprehend the objects precisely. capture the global semantics of the threat items. Further, we
Despite the progress, WSOL has not yet been investigated have integrated a Patch Scoring Module (PSM) to expose ad-
in security threat recognition, primarily due to the additional ditional relevant occluded object regions.
challenges: a) Occlusion: Threat items may be impeded by
other high-density benign materials, rendering them indistin- 2. PROPOSED METHOD
guishable; b) Heavily cluttered background: Precise local-
ization is challenging due to noisy activation maps. Towards This section provides an overview of the proposed context-
this goal, we explore weakly supervised baggage threat lo- aware transformer (Fig. 2) along with detailed explanations
calization using transformers to exploit their ability to model of CGM and PSM and the implemented training strategy.
long-range spatial correlations. Furthermore, transformers Context-aware Transformer architecture : The input bag-
are ideal for X-ray baggage threat localization, as they favor gage X-ray image x of resolution W × H is initially divided
shape over texture and are robust to occlusion [20]. into M patches, where each patch xpn ∈ Rs×s×3 , n =
Even though the multi-headed attention mechanism en- 1, 2, · · · M does not overlap with the adjacent patches, such
that M = N ×N and N = W/s. The patches are then vector-
ables transformers to focus on several semantic regions, the
ized and linearly projected (represented by F() in Eq. 1) into
attentions are not class-specific [11]. Further, it can lead to M patch embeddings xn ∈ RM ×D , to which class tokens
very noisy activation maps as the class token captures interac- xCL ∈ R2×D are affixed, where D denotes the embedding di-
tions between different classes and the background. This can mension and xCL = [xCT ; xCB ] comprises of xCT and xCB
lead to unsatisfactory localization results, especially in com- ∈ R1×D . It is to be noted that, unlike standard transformer
pactly packed baggage scans where it is difficult to distinguish design where a single class token is employed, the pro-
3539
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.
posed framework has a dual-token architecture to capture the CAM [15] for transformer models. Score CAM was initially
context of both threat and benign categories by learning dis- proposed to grasp the significance of the activation maps of
criminative representations respectively. The tokens are then CNNs. However, as discussed in Section 3, employing Score
updated using positional embeddings xpos ∈ R(2+M )×D , CAM on patch tokens can activate unwanted backgrounds.
yielding the input token embeddings xin ∈ R(2+M )×D , In the proposed PSM, the patch embeddings from the final
which are then passed through L stacked encoder blocks. encoder block {P1 , P2 · · · PM } are first reshaped and trans-
xin = [xCT ; xCB ; F(xP1 ); F(xP2 ); · · · F(xPM )] ⊕ xpos (1)
posed into feature maps PF , where each feature map PFd , d ∈
{1, 2, · · · D} highlights different semantically related regions.
However, this might also add unwanted parts to the localiza-
= [CT ; CB ; P1 ; P2 ; · · · PM ] (2) tion results. Hence, the refined context map ACT ref is added
Each of these encoder blocks is comprised of a multiheaded to the feature maps to suppress the background.
attention layer with k heads and a multilayer perceptron. As
PFT = ACT ref ⊕ PF (6)
the tokens pass through multiple encoder blocks, CT captures
the contextual information of threat items from the scans. where PFT ∈ RN ×N ×D is then upsampled and normalized.
Context map Generation : The proposed CGM is respon-
sible for extracting the global context of concealed security ˆ = PFT − min(PFT )
threats by leveraging the long-range inter-dependencies be- PFT (7)
max(PFT ) − min(PFT )
tween the tokens learned by the self-attention blocks within
the encoder. More specifically, the input tokens xin are trans- The feature maps were superimposed over the input scan x
formed into queries Q, keys K, and values V for computing to generate scans with partial masking. These masked images
the attention (Eq. 3). The token-wise similarity map AT (Eq. were then fed to the trained transformer model to yield target
4) is then obtained by fusing the attention across the k heads. scores, which were then utilized as weights to linearly com-
bine with the respective feature maps to yield the final threat
!
QK T
Attention(Q, K, V ) = sof tmax p V (3) localization map. The bounding boxes were then drawn using
Dq
! the technique in [12].
QK T Dual Token Training Strategy : To capture the contextual
AT = sof tmax p (4) information of the threats from the scans, it is essential to
Dq
build a one-to-one association between each class token and
where Q, K, V ∈ R(2+M )×Dk and Dk = D/k. The at- the respective ground truth label. This is attained by modify-
tention map AT ∈ R(2+M )×(2+M ) captures the pair-wise at- ing the head of the proposed framework, where the final MLP
tention between the input tokens, as shown in Fig. 2. The head used for classification in standard transformer models is
orange-colored columns represent the attention between the replaced with an average pooling layer. The dual output to-
class tokens and patch tokens, from which we can extract kens from the final layer (CT ok = [CT , CB ] , CT ok ∈ R2×D )
threat-specific context map ACT ∈ R1×N ×N . ACT is ob- are averaged along the embedding dimension to obtain the
tained by leveraging and reshaping the attention scores be- scores corresponding to the threat and benign classes, which
are supervised by the one-hot encoded class labels, con-
tween the threat-specific class token CT and the patch tokens
strained via binary cross entropy loss.
(P1 , P2 · · · PM ). In this work, we have only used the final
encoder block in our implementation because the low-level D
1 X
semantics learned by the early layers can lead to noisy activa- y(c) = CT ok (c, l) , c ∈ {0, 1} (8)
D
l
tion that can hinder threat localization.
The context map ACT is then refined using patch-wise at-
where CT ok (c, l) is the lth element along the embedding
tention leveraged from AT , which is straightforward in con-
trast to prior works [21]. The blue-colored columns (see Fig. dimension of the cth token. The proposed training strategy
2) represent the attention scores between the patch tokens, enables each of the dual tokens to model distinctive global
which are averaged across the k attention heads, given by semantic correlations specific to the two classes.
AP ∈ RM ×M , utilized to refine the threat context map ACT :
M
X 3. EXPERIMENTAL ANALYSIS AND RESULTS
ACT ref (j) = AP (j, n) · ACT (n) (5)
n The proposed Context-aware baggage threat localization ap-
where ACT ref is later reshaped into a 2D tensor to yield the proach was evaluated on Compass-XP [22] and SIXray [3].
refined map ( ACT ref ∈ RN ×N ). It can be observed from Compass-XP, released in 2019, comprises of 11,568 scans
Section 3 that AP enhances localization continuity. (with different representations such as low and high energy,
Patch Scoring Module : PSM employs a perturbation-based grayscale, color, and density variants), from which 80% were
strategy to expose additional relevant and occluded object re- used for training, per the protocol. The SIXray dataset, on the
gions. It reveals more salient parts while retaining the re- other hand, consists of five threat categories and is very unbal-
gions captured by the CGM. The technique adapts the Score anced and occluded (guns, pliers, scissors, wrenches, knives).
3540
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.
Table 1. Performance on COMPASS-XP [22] and SIXray [3].
Compass-XP SIXray
Methods
Top-1 GT-Known Top-1 GT-Known Loc.
Grad-CAM (ResNet) 36.5 39.2 22.6 30.1 -
Ablation-CAM 35.9 38.3 21.9 28.8 -
TS-CAM 43.4 45.1 33.8 35.2 -
CHR [3] - - - - 54.8
Ours 55.3 58.2 37.6 38.3 82.9
3541
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.
datasets demonstrate the superiority of the approach. semantic coupled attention map for weakly supervised object
localization,” in Proceedings of the IEEE/CVF International
5. REFERENCES Conference on Computer Vision, 2021, pp. 2886–2895.
[12] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and
[1] Divya Velayudhan, Taimur Hassan, Ernesto Damiani, and Antonio Torralba, “Learning deep features for discriminative
Naoufel Werghi, “Recent advances in baggage threat detec- localization,” in Proceedings of the IEEE conference on com-
tion: A comprehensive and systematic survey,” ACM Comput- puter vision and pattern recognition, 2016, pp. 2921–2929.
ing Surveys (CSUR), 2022.
[13] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and
[2] Divya Velayudhan, Taimur Hassan, Abdelfatah Hassan Thomas S Huang, “Adversarial complementary learning for
Ahmed, Ernesto Damiani, and Naoufel Werghi, “Baggage weakly supervised object localization,” in Proceedings of the
threat recognition using deep low-rank broad learning detec- IEEE conference on computer vision and pattern recognition,
tor,” in 2022 IEEE 21st Mediterranean Electrotechnical Con- 2018, pp. 1325–1334.
ference (MELECON), 2022, pp. 966–971.
[14] Sahil Singla, Besmira Nushi, Shital Shah, Ece Kamar, and Eric
[3] C. Miao, L. Xie, F. Wan, C. Su, H. Liu, J. Jiao, and Q. Ye, Horvitz, “Understanding failures of deep networks via robust
“SIXray: A Large-scale Security Inspection X-ray Bench- feature extraction,” in Proceedings of the IEEE/CVF Confer-
mark for Prohibited Item Discovery in Overlapping Images,” ence on Computer Vision and Pattern Recognition, 2021, pp.
IEEE Conference on Computer Vision and Pattern Recogni- 12853–12862.
tion, 2019.
[15] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian
[4] Abdelfatah Ahmed, Ahmad Obeid, Divya Velayudhan, Taimur Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu, “Score-cam:
Hassan, Ernesto Damiani, and Naoufel Werghi, “Balanced Score-weighted visual explanations for convolutional neural
affinity loss for highly imbalanced baggage threat contour- networks,” in Proceedings of the IEEE/CVF conference on
driven instance segmentation,” in 2022 IEEE International computer vision and pattern recognition workshops, 2020, pp.
Conference on Image Processing (ICIP), 2022, pp. 981–985. 24–25.
[5] Taimur Hassan, Samet Akçay, Mohammed Bennamoun, [16] Bilal Hassan, Shiyin Qin, Taimur Hassan, Ramsha Ahmed,
Salman Khan, and Naoufel Werghi, “Unsupervised anomaly and Naoufel Werghi, “Joint segmentation and quantification
instance segmentation for baggage threat recognition,” Journal of chorioretinal biomarkers in optical coherence tomography
of Ambient Intelligence and Humanized Computing, pp. 1–12, scans: A deep learning approach,” IEEE Transactions on In-
2021. strumentation and Measurement, vol. 70, pp. 1–17, 2021.
[6] Taimur Hassan, Samet Akcay, Mohammed Bennamoun, [17] E. A. Hadhrami, M. A. Mufti, B. Taha, and N. Werghi, “Trans-
Salman Khan, and Naoufel Werghi, “Tensor pooling-driven fer learning with convolutional neural networks for moving tar-
instance segmentation framework for baggage threat recogni- get classification with micro-doppler radar spectrograms,” in
tion,” Neural Computing and Applications, vol. 34, no. 2, pp. 2018 International Conference on Artificial Intelligence and
1239–1250, 2022. Big Data, ICAIBD 2018, 2018, pp. 148–154.
[7] Renshuai Tao, Yanlu Wei, Xiangjian Jiang, Hainan Li, Hao- [18] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
tong Qin, Jiakai Wang, Yuqing Ma, Libo Zhang, and Xiang- Massa, Alexandre Sablayrolles, and Hervé Jégou, “Train-
long Liu, “Towards real-world x-ray security inspection: A ing data-efficient image transformers & distillation through at-
high-quality benchmark and lateral inhibition module for pro- tention,” in International Conference on Machine Learning.
hibited items detection,” in Proceedings of the IEEE/CVF In- PMLR, 2021, pp. 10347–10357.
ternational Conference on Computer Vision, 2021, pp. 10923–
10932. [19] Hui Su, Yue Ye, Zhiwei Chen, Mingli Song, and Lechao
Cheng, “Re-attention transformer for weakly supervised ob-
[8] Boying Wang, Libo Zhang, Longyin Wen, Xianglong Liu, and ject localization,” arXiv preprint arXiv:2208.01838, 2022.
Yanjun Wu, “Towards real-world prohibited item detection: A
large-scale x-ray benchmark,” in Proceedings of the IEEE/CVF [20] Muhammad Muzammal Naseer, Kanchana Ranasinghe,
International Conference on Computer Vision, 2021, pp. 5412– Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and
5421. Ming-Hsuan Yang, “Intriguing properties of vision transform-
ers,” Advances in Neural Information Processing Systems, vol.
[9] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
34, pp. 23296–23308, 2021.
Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra, “Grad-
cam: Visual explanations from deep networks via gradient- [21] Jiwoon Ahn and Suha Kwak, “Learning pixel-level semantic
based localization,” in Proceedings of the IEEE international affinity with image-level supervision for weakly supervised se-
conference on computer vision, 2017, pp. 618–626. mantic segmentation,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2018, pp. 4981–
[10] Harish Guruprasad Ramaswamy et al., “Ablation-cam: Visual
4990.
explanations for deep convolutional network via gradient-free
localization,” in Proceedings of the IEEE/CVF Winter Confer- [22] Matthew Caldwell and Lewis D Griffin, “Limits on transfer
ence on Applications of Computer Vision, 2020, pp. 983–991. learning from photographic image data to x-ray threat detec-
tion,” Journal of X-ray Science and Technology, vol. 27, no. 6,
[11] Wei Gao, Fang Wan, Xingjia Pan, Zhiliang Peng, Qi Tian,
pp. 1007–1020, 2019.
Zhenjun Han, Bolei Zhou, and Qixiang Ye, “Ts-cam: Token
3542
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.