0% found this document useful (0 votes)
30 views5 pages

3538 978-1-7281-9835-4/23/$31.00 ©2023 Ieee Icip 2023

This document summarizes a research paper about developing a context-aware transformer model for weakly supervised baggage threat localization using X-ray images. The proposed dual-token transformer architecture can generalize to different threat categories by learning threat-specific semantics from token-wise attention to generate context maps. The model was evaluated on two public datasets and outperformed other state-of-the-art approaches. The researchers aim to develop an automated solution for threat detection that does not require extensive instance-level annotations, unlike many existing deep learning methods.

Uploaded by

fthun58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views5 pages

3538 978-1-7281-9835-4/23/$31.00 ©2023 Ieee Icip 2023

This document summarizes a research paper about developing a context-aware transformer model for weakly supervised baggage threat localization using X-ray images. The proposed dual-token transformer architecture can generalize to different threat categories by learning threat-specific semantics from token-wise attention to generate context maps. The model was evaluated on two public datasets and outperformed other state-of-the-art approaches. The researchers aim to develop an automated solution for threat detection that does not require extensive instance-level annotations, unlike many existing deep learning methods.

Uploaded by

fthun58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CONTEXT-AWARE TRANSFORMERS FOR WEAKLY SUPERVISED BAGGAGE THREAT

LOCALIZATION

Divya Velayudhan1 , Abdelfatah Ahmed1 , Taimur Hassan 2 , Mohammed Bennamoun3


2023 IEEE International Conference on Image Processing (ICIP) | 978-1-7281-9835-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICIP49359.2023.10221975

Ernesto Damiani 1 , Naoufel Werghi1


1
KUCARS and C2PS, Department of Electrical Engineering and Computer Science, Khalifa University
2
Department of Electrical, Computer and Biomedical Engineering,Abu Dhabi University
3
Department of Computer Science and Software Engineering, The University of Western Australia

ABSTRACT
Recent advances in deep learning have facilitated significant
progress in the autonomous detection of concealed security
threats from baggage X-ray scans, a plausible solution to
overcome the pitfalls of manual screening. However, these
data-hungry schemes rely on extensive instance-level annota-
tions that involve strenuous skilled labor. Hence, this paper
proposes a context-aware transformer for weakly supervised
baggage threat localization, exploiting their inherent capacity
to learn long-range semantic relations to capture the object-
level context of the illegal items. Unlike the conventional
single-class token transformers, the proposed dual-token ar-
Fig. 1. Visualization of baggage threat localization. The top
chitecture can generalize well to different threat categories
row showcases the result of the proposed approach, with (A)
by learning the threat-specific semantics from the token-wise
threat-aware context map extracted from the proposed CGM
attention to generate context maps. The framework has been
(dimension 14 × 14), (B) context map interpolated and over-
evaluated on two public datasets, Compass-XP and SIXray,
laid on the input scan, capturing global features of the threat
and surpassed other SOTA approaches.
items, and (C) the final threat localization result. The bottom
Index Terms— Baggage security, X-ray Imagery, Weakly row shows the comparative results with different approaches,
Supervised Localization, Threat Recognition, Transformer. Grad-CAM [9], Ablation CAM [10] and TS-CAM [11].

1. INTRODUCTION Within the broader vision community, researchers have


focused on weakly supervised localization (WSOL), explor-
The increasing passenger traffic at airports and other transit ing weak supervisory data (image labels) to locate the ob-
hubs aggravates the risk of concealed security threats within ject categories instead of demanding instance-level annota-
baggage, raising serious concerns about public safety. Since tions. [12, 13]. Further, WSOL provides visual reasoning,
existing techniques in baggage monitoring are reliant on hu- which is vital for critical applications, such as baggage se-
man expertise, researchers have proposed automated baggage curity [14]. Visual inferences not only aid security personnel
security threat identification from X-ray scans as a plausible in locating the threats but also researchers in identifying the
solution [1–6]. Despite the significant progress, these data- pitfalls of the framework and developing more robust models.
hungry schemes feed on massive amounts of well-annotated The pioneering work in this field, proposed by Zhou et
training data, which are procured at great cost. Further, stud- al. [12], redesigned the classifier and linearly merged the ac-
ies have reported that the instance-level annotation of security tivation maps to depict model predictions. Other methods
datasets necessitates skilled security personnel to identify the based on class activation maps (CAM) include gradient-based
different threat categories and also involves strenuous labor approaches such as Grad-CAM [9] which is widely embraced
(taking up to 3 minutes for a single scan) [7, 8]. due to accessibility to all CNN architectures. However, it
This work is supported by a research fund from Khalifa University. Ref:
fails in highlighting integral object regions and in the case
CIRA-2019-047 and the Abu Dhabi Department of Education and Knowl- of multiple instances. To counteract this, several other tech-
edge (ADEK), Ref: AARE19-156. niques were contributed, such as adversarial erasing [13], and

978-1-7281-9835-4/23/$31.00 ©2023 IEEE 3538 ICIP 2023

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. The proposed Context-aware dual token transformer model translates the input scan into a sequence of patch tokens to
which learnable dual-class tokens are affixed to capture the global context of both the threat and benign categories. Position
embeddings are also added before passing the tokens through the encoder layers. CGM captures the global context of the
concealed threats by leveraging the threat-specific class token CT and the patch tokens. The context map is then refined using
patch-wise attention which is passed to PSM to expose complete threat objects from cluttered and occluded baggage scans.

gradient-free approaches [10, 15]. between overlapping threats and normal items. Hence, we
However, these approaches, based on CNNs [16, 17], are propose a dual-token transformer architecture, unlike conven-
constrained to localized interactions (see Fig. 1). Alterna- tional single-class token transformers, to capture the object-
tively, vision transformers [18] have gained attention due to level context of concealed security threats and to generalize
their ability to model global features by leveraging long-range well to different threat categories by localizing them with
semantics, which is crucial in localizing the object of interest. only binary labels (Threat vs. Benign). A class-specific train-
Gao et al. [11] incorporated CAMs with transformers to em- ing strategy is employed to associate the class tokens with
phasize the distinctive local features while diverting attention the specific object category (detailed in Section 2). We have
from the irrelevant parts. Meanwhile, Su et al. [19] proposed also designed a Context map Generation Module (CGM) to
token-prioritizing to comprehend the objects precisely. capture the global semantics of the threat items. Further, we
Despite the progress, WSOL has not yet been investigated have integrated a Patch Scoring Module (PSM) to expose ad-
in security threat recognition, primarily due to the additional ditional relevant occluded object regions.
challenges: a) Occlusion: Threat items may be impeded by
other high-density benign materials, rendering them indistin- 2. PROPOSED METHOD
guishable; b) Heavily cluttered background: Precise local-
ization is challenging due to noisy activation maps. Towards This section provides an overview of the proposed context-
this goal, we explore weakly supervised baggage threat lo- aware transformer (Fig. 2) along with detailed explanations
calization using transformers to exploit their ability to model of CGM and PSM and the implemented training strategy.
long-range spatial correlations. Furthermore, transformers Context-aware Transformer architecture : The input bag-
are ideal for X-ray baggage threat localization, as they favor gage X-ray image x of resolution W × H is initially divided
shape over texture and are robust to occlusion [20]. into M patches, where each patch xpn ∈ Rs×s×3 , n =
Even though the multi-headed attention mechanism en- 1, 2, · · · M does not overlap with the adjacent patches, such
that M = N ×N and N = W/s. The patches are then vector-
ables transformers to focus on several semantic regions, the
ized and linearly projected (represented by F() in Eq. 1) into
attentions are not class-specific [11]. Further, it can lead to M patch embeddings xn ∈ RM ×D , to which class tokens
very noisy activation maps as the class token captures interac- xCL ∈ R2×D are affixed, where D denotes the embedding di-
tions between different classes and the background. This can mension and xCL = [xCT ; xCB ] comprises of xCT and xCB
lead to unsatisfactory localization results, especially in com- ∈ R1×D . It is to be noted that, unlike standard transformer
pactly packed baggage scans where it is difficult to distinguish design where a single class token is employed, the pro-

3539

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.
posed framework has a dual-token architecture to capture the CAM [15] for transformer models. Score CAM was initially
context of both threat and benign categories by learning dis- proposed to grasp the significance of the activation maps of
criminative representations respectively. The tokens are then CNNs. However, as discussed in Section 3, employing Score
updated using positional embeddings xpos ∈ R(2+M )×D , CAM on patch tokens can activate unwanted backgrounds.
yielding the input token embeddings xin ∈ R(2+M )×D , In the proposed PSM, the patch embeddings from the final
which are then passed through L stacked encoder blocks. encoder block {P1 , P2 · · · PM } are first reshaped and trans-
xin = [xCT ; xCB ; F(xP1 ); F(xP2 ); · · · F(xPM )] ⊕ xpos (1)
posed into feature maps PF , where each feature map PFd , d ∈
{1, 2, · · · D} highlights different semantically related regions.
However, this might also add unwanted parts to the localiza-
= [CT ; CB ; P1 ; P2 ; · · · PM ] (2) tion results. Hence, the refined context map ACT ref is added
Each of these encoder blocks is comprised of a multiheaded to the feature maps to suppress the background.
attention layer with k heads and a multilayer perceptron. As
PFT = ACT ref ⊕ PF (6)
the tokens pass through multiple encoder blocks, CT captures
the contextual information of threat items from the scans. where PFT ∈ RN ×N ×D is then upsampled and normalized.
Context map Generation : The proposed CGM is respon-
sible for extracting the global context of concealed security ˆ = PFT − min(PFT )
threats by leveraging the long-range inter-dependencies be- PFT (7)
max(PFT ) − min(PFT )
tween the tokens learned by the self-attention blocks within
the encoder. More specifically, the input tokens xin are trans- The feature maps were superimposed over the input scan x
formed into queries Q, keys K, and values V for computing to generate scans with partial masking. These masked images
the attention (Eq. 3). The token-wise similarity map AT (Eq. were then fed to the trained transformer model to yield target
4) is then obtained by fusing the attention across the k heads. scores, which were then utilized as weights to linearly com-
bine with the respective feature maps to yield the final threat
!
QK T
Attention(Q, K, V ) = sof tmax p V (3) localization map. The bounding boxes were then drawn using
Dq
! the technique in [12].
QK T Dual Token Training Strategy : To capture the contextual
AT = sof tmax p (4) information of the threats from the scans, it is essential to
Dq
build a one-to-one association between each class token and
where Q, K, V ∈ R(2+M )×Dk and Dk = D/k. The at- the respective ground truth label. This is attained by modify-
tention map AT ∈ R(2+M )×(2+M ) captures the pair-wise at- ing the head of the proposed framework, where the final MLP
tention between the input tokens, as shown in Fig. 2. The head used for classification in standard transformer models is
orange-colored columns represent the attention between the replaced with an average pooling layer. The dual output to-
class tokens and patch tokens, from which we can extract kens from the final layer (CT ok = [CT , CB ] , CT ok ∈ R2×D )
threat-specific context map ACT ∈ R1×N ×N . ACT is ob- are averaged along the embedding dimension to obtain the
tained by leveraging and reshaping the attention scores be- scores corresponding to the threat and benign classes, which
are supervised by the one-hot encoded class labels, con-
tween the threat-specific class token CT and the patch tokens
strained via binary cross entropy loss.
(P1 , P2 · · · PM ). In this work, we have only used the final
encoder block in our implementation because the low-level D
1 X
semantics learned by the early layers can lead to noisy activa- y(c) = CT ok (c, l) , c ∈ {0, 1} (8)
D
l
tion that can hinder threat localization.
The context map ACT is then refined using patch-wise at-
where CT ok (c, l) is the lth element along the embedding
tention leveraged from AT , which is straightforward in con-
trast to prior works [21]. The blue-colored columns (see Fig. dimension of the cth token. The proposed training strategy
2) represent the attention scores between the patch tokens, enables each of the dual tokens to model distinctive global
which are averaged across the k attention heads, given by semantic correlations specific to the two classes.
AP ∈ RM ×M , utilized to refine the threat context map ACT :
M
X 3. EXPERIMENTAL ANALYSIS AND RESULTS
ACT ref (j) = AP (j, n) · ACT (n) (5)
n The proposed Context-aware baggage threat localization ap-
where ACT ref is later reshaped into a 2D tensor to yield the proach was evaluated on Compass-XP [22] and SIXray [3].
refined map ( ACT ref ∈ RN ×N ). It can be observed from Compass-XP, released in 2019, comprises of 11,568 scans
Section 3 that AP enhances localization continuity. (with different representations such as low and high energy,
Patch Scoring Module : PSM employs a perturbation-based grayscale, color, and density variants), from which 80% were
strategy to expose additional relevant and occluded object re- used for training, per the protocol. The SIXray dataset, on the
gions. It reveals more salient parts while retaining the re- other hand, consists of five threat categories and is very unbal-
gions captured by the CGM. The technique adapts the Score anced and occluded (guns, pliers, scissors, wrenches, knives).

3540

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.
Table 1. Performance on COMPASS-XP [22] and SIXray [3].

Compass-XP SIXray
Methods
Top-1 GT-Known Top-1 GT-Known Loc.
Grad-CAM (ResNet) 36.5 39.2 22.6 30.1 -
Ablation-CAM 35.9 38.3 21.9 28.8 -
TS-CAM 43.4 45.1 33.8 35.2 -
CHR [3] - - - - 54.8
Ours 55.3 58.2 37.6 38.3 82.9

Table 2. Comparative study to analyze the significance of


CGM and PSM in the proposed method.

Dataset Method Top-1 GT-Known


Fig. 3. Visualization of baggage threat localization using dif- Compass-XP CGM Only 43.8 45.2
ferent methods. Results in columns 2 and 3 are based on PSM Only 41.1 42.6
SIXray CGM Only 31.3 34.1
CNN, while TS-CAM (column 4) is based on Transformers.
PSM Only 29.7 34.4

Main Results : As evident from Table 1, the proposed


framework outperforms TS-CAM [11] both in terms of Top-
1 Loc.Acc. and GT-Known Loc.Acc. yielding 55.3% and
58.2% respectively on Compass-XP [22]. On SIXray, the
proposed framework delivers the best results, achieving Top-
1 Loc.Acc. of 37.6% and GT-Known Loc.Acc. of 38.3%.
Further, we have computed the localization metric as per the
protocol in [3] for comparative analysis, surpassing [3] by
28%. The comparatively lower results on SIXray are primar-
ily due to the heavy occlusion and compactly packed scans
Fig. 4. Qualitative analysis to study the significance of CGM in the dataset. The qualitative analysis is given in Fig. 3,
and PSM in the proposed framework. which depicts the superior results of the proposed method by
localizing the knife that overlays the metal band in Row 2,
and identifying both instances of Guns in Row 3.
We assessed the architecture using SIXray10 subset, which
Ablative Study : We have also qualitatively and quantita-
consists of 89,290 benign scans and 8,910 threat scans.
tively assessed our approach to analyze the significance of
Implementation :The proposed framework was constructed
both CGM and PSM, as shown in Table 2 and Fig. 4. From
using the ImageNet-trained DeiT-S backbone [18]. In partic-
Fig. 4, it can be observed that our approach can expertly lo-
ular, we used the single class token to initialize the proposed
calize occluded baggage threats. As can be seen from the
dual-token architecture. While training, data augmentation
top Row, our approach localizes the second occluded knife
was done (horizontal and vertical flipping), and scans were
(shown using a bounding box), which is not possible with
resized to 224 x 224. The framework was implemented us-
only CGM or PSM blocks alone.
ing PyTorch on a machine with Intel(R) Core(TM) i7-10700K
CPU @ 3.80GHz processor having NVIDIA GeForce RTX
3060 Ti, and trained the model for 20 epochs with batch size 4. CONCLUSION
12 and an initial learning rate of 2 e-5.
Evaluation Metrics : We have adopted the commonly used This work presents a context-aware transformer framework
metrics as in [13]: GT-known Localization Accuracy (posi- for weakly supervised X-ray baggage threat localization by
tive for over 50% IoU between the predicted and ground truth encoding the object-level context of the concealed security
bounding boxes) and Top-1 Localization Accuracy ( positive threats. The proposed dual-token architecture can general-
if correctly classified with over 50% IoU between the predic- ize well to different threat categories by learning the threat-
tions and ground truth). In addition, we have also computed specific semantics from the token-wise attention to generate
localization metric (Loc) as in [3] for comparative study (con- context maps. The patch tokens from the transformer out-
sidered positive if maximal response falls within one of the put are then scored to expose other salient regions, including
ground truth boxes). occluded threats. Experiments on two public X-ray baggage

3541

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.
datasets demonstrate the superiority of the approach. semantic coupled attention map for weakly supervised object
localization,” in Proceedings of the IEEE/CVF International
5. REFERENCES Conference on Computer Vision, 2021, pp. 2886–2895.
[12] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and
[1] Divya Velayudhan, Taimur Hassan, Ernesto Damiani, and Antonio Torralba, “Learning deep features for discriminative
Naoufel Werghi, “Recent advances in baggage threat detec- localization,” in Proceedings of the IEEE conference on com-
tion: A comprehensive and systematic survey,” ACM Comput- puter vision and pattern recognition, 2016, pp. 2921–2929.
ing Surveys (CSUR), 2022.
[13] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and
[2] Divya Velayudhan, Taimur Hassan, Abdelfatah Hassan Thomas S Huang, “Adversarial complementary learning for
Ahmed, Ernesto Damiani, and Naoufel Werghi, “Baggage weakly supervised object localization,” in Proceedings of the
threat recognition using deep low-rank broad learning detec- IEEE conference on computer vision and pattern recognition,
tor,” in 2022 IEEE 21st Mediterranean Electrotechnical Con- 2018, pp. 1325–1334.
ference (MELECON), 2022, pp. 966–971.
[14] Sahil Singla, Besmira Nushi, Shital Shah, Ece Kamar, and Eric
[3] C. Miao, L. Xie, F. Wan, C. Su, H. Liu, J. Jiao, and Q. Ye, Horvitz, “Understanding failures of deep networks via robust
“SIXray: A Large-scale Security Inspection X-ray Bench- feature extraction,” in Proceedings of the IEEE/CVF Confer-
mark for Prohibited Item Discovery in Overlapping Images,” ence on Computer Vision and Pattern Recognition, 2021, pp.
IEEE Conference on Computer Vision and Pattern Recogni- 12853–12862.
tion, 2019.
[15] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian
[4] Abdelfatah Ahmed, Ahmad Obeid, Divya Velayudhan, Taimur Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu, “Score-cam:
Hassan, Ernesto Damiani, and Naoufel Werghi, “Balanced Score-weighted visual explanations for convolutional neural
affinity loss for highly imbalanced baggage threat contour- networks,” in Proceedings of the IEEE/CVF conference on
driven instance segmentation,” in 2022 IEEE International computer vision and pattern recognition workshops, 2020, pp.
Conference on Image Processing (ICIP), 2022, pp. 981–985. 24–25.
[5] Taimur Hassan, Samet Akçay, Mohammed Bennamoun, [16] Bilal Hassan, Shiyin Qin, Taimur Hassan, Ramsha Ahmed,
Salman Khan, and Naoufel Werghi, “Unsupervised anomaly and Naoufel Werghi, “Joint segmentation and quantification
instance segmentation for baggage threat recognition,” Journal of chorioretinal biomarkers in optical coherence tomography
of Ambient Intelligence and Humanized Computing, pp. 1–12, scans: A deep learning approach,” IEEE Transactions on In-
2021. strumentation and Measurement, vol. 70, pp. 1–17, 2021.
[6] Taimur Hassan, Samet Akcay, Mohammed Bennamoun, [17] E. A. Hadhrami, M. A. Mufti, B. Taha, and N. Werghi, “Trans-
Salman Khan, and Naoufel Werghi, “Tensor pooling-driven fer learning with convolutional neural networks for moving tar-
instance segmentation framework for baggage threat recogni- get classification with micro-doppler radar spectrograms,” in
tion,” Neural Computing and Applications, vol. 34, no. 2, pp. 2018 International Conference on Artificial Intelligence and
1239–1250, 2022. Big Data, ICAIBD 2018, 2018, pp. 148–154.
[7] Renshuai Tao, Yanlu Wei, Xiangjian Jiang, Hainan Li, Hao- [18] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
tong Qin, Jiakai Wang, Yuqing Ma, Libo Zhang, and Xiang- Massa, Alexandre Sablayrolles, and Hervé Jégou, “Train-
long Liu, “Towards real-world x-ray security inspection: A ing data-efficient image transformers & distillation through at-
high-quality benchmark and lateral inhibition module for pro- tention,” in International Conference on Machine Learning.
hibited items detection,” in Proceedings of the IEEE/CVF In- PMLR, 2021, pp. 10347–10357.
ternational Conference on Computer Vision, 2021, pp. 10923–
10932. [19] Hui Su, Yue Ye, Zhiwei Chen, Mingli Song, and Lechao
Cheng, “Re-attention transformer for weakly supervised ob-
[8] Boying Wang, Libo Zhang, Longyin Wen, Xianglong Liu, and ject localization,” arXiv preprint arXiv:2208.01838, 2022.
Yanjun Wu, “Towards real-world prohibited item detection: A
large-scale x-ray benchmark,” in Proceedings of the IEEE/CVF [20] Muhammad Muzammal Naseer, Kanchana Ranasinghe,
International Conference on Computer Vision, 2021, pp. 5412– Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and
5421. Ming-Hsuan Yang, “Intriguing properties of vision transform-
ers,” Advances in Neural Information Processing Systems, vol.
[9] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
34, pp. 23296–23308, 2021.
Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra, “Grad-
cam: Visual explanations from deep networks via gradient- [21] Jiwoon Ahn and Suha Kwak, “Learning pixel-level semantic
based localization,” in Proceedings of the IEEE international affinity with image-level supervision for weakly supervised se-
conference on computer vision, 2017, pp. 618–626. mantic segmentation,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2018, pp. 4981–
[10] Harish Guruprasad Ramaswamy et al., “Ablation-cam: Visual
4990.
explanations for deep convolutional network via gradient-free
localization,” in Proceedings of the IEEE/CVF Winter Confer- [22] Matthew Caldwell and Lewis D Griffin, “Limits on transfer
ence on Applications of Computer Vision, 2020, pp. 983–991. learning from photographic image data to x-ray threat detec-
tion,” Journal of X-ray Science and Technology, vol. 27, no. 6,
[11] Wei Gao, Fang Wan, Xingjia Pan, Zhiliang Peng, Qi Tian,
pp. 1007–1020, 2019.
Zhenjun Han, Bolei Zhou, and Qixiang Ye, “Ts-cam: Token

3542

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on November 24,2023 at 08:42:29 UTC from IEEE Xplore. Restrictions apply.

You might also like