0% found this document useful (0 votes)
7 views12 pages

Concealed Object Detection For Passive Millimeter-Wave Security Imaging Based On Task-Aligned Detection Transformer

This paper introduces a Task-Aligned Detection Transformer network, PMMW-DETR, designed for concealed object detection in passive millimeter-wave (PMMW) security imaging. The proposed network addresses challenges such as low resolution and high noise in PMMW images by employing a Denoising Coarse-to-Fine Transformer backbone and a Query Selection module to enhance semantic perception. Experimental results demonstrate that PMMW-DETR outperforms existing methods in accuracy and classification confidence, showcasing its robustness in detecting concealed objects in low-quality PMMW images.

Uploaded by

khyronlu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

Concealed Object Detection For Passive Millimeter-Wave Security Imaging Based On Task-Aligned Detection Transformer

This paper introduces a Task-Aligned Detection Transformer network, PMMW-DETR, designed for concealed object detection in passive millimeter-wave (PMMW) security imaging. The proposed network addresses challenges such as low resolution and high noise in PMMW images by employing a Denoising Coarse-to-Fine Transformer backbone and a Query Selection module to enhance semantic perception. Experimental results demonstrate that PMMW-DETR outperforms existing methods in accuracy and classification confidence, showcasing its robustness in detecting concealed objects in low-quality PMMW images.

Uploaded by

khyronlu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1

Concealed Object Detection for Passive


Millimeter-Wave Security Imaging Based on
Task-Aligned Detection Transformer
Cheng Guo, Fei Hu, and Yan Hu

Abstract—Passive millimeter-wave (PMMW) is a significant for harmlessness to human safety and the ability to penetrate
potential technique for human security screening. Several popular textile materials of MMW.
arXiv:2212.00313v1 [cs.CV] 1 Dec 2022

object detection networks have been used for PMMW images. Since the 1990s, researchers have designed a variety of
However, restricted by the low resolution and high noise of
PMMW images, PMMW hidden object detection based on deep PMMW security systems and conducted many human imaging
learning usually suffers from low accuracy and low classification experiments [8], [10], [11]. However, the current system is
confidence. To tackle the above problems, this paper proposes still not commercially available, the main reason is that the
a Task-Aligned Detection Transformer network, named PMMW- level of concealed object detection algorithms for PMMW
DETR. In the first stage, a Denoising Coarse-to-Fine Transformer security images lags behind the hardware level. The PMMW
(DCFT) backbone is designed to extract long- and short-range
features in the different scales. In the second stage, we propose the radiation signal is weak, with only one hundred thousandths
Query Selection module to introduce learned spatial features into of the infrared signal, so the image contains a lot of noise.
the network as prior knowledge, which enhances the semantic Meanwhile, the resolution of PMMW images is much lower
perception capability of the network. In the third stage, aiming than that of infrared and visible images, making concealed
to improve the classification performance, we perform a Task- object detection a challenging task. Many methods have been
Aligned Dual-Head block to decouple the classification and
regression tasks. Based on our self-developed PMMW security proposed for the detection of concealed objects in PMMW
screening dataset, experimental results including comparison security images. Earlier research works were mainly based
with State-Of-The-Art (SOTA) methods and ablation study on statistical learning and machine learning, for example,
demonstrate that the PMMW-DETR obtains higher accuracy Martinez et al. [12] used Iterative Steering Kernel Regression
and classification confidence than previous works, and exhibits (ISKR) method for denoising and the Local Binary Fitting
robustness to the PMMW images of low quality.
(LBF) method to segment hidden objects and the human body.
Index Terms—Millimeter-wave radiometry, deep learning, ob- Lee et al. [13] applied Bayesian learning and expectation
ject detection, transformer, security screening. maximization algorithms to detect and segment concealed
objects. Yeom et al. [14], [15] further proposed the use of
I. I NTRODUCTION principal component analysis (PCA) to extract object features
Passive millimeter-wave (PMMW) imaging has attracted and measure the similarity between the object and ground
increasing attention from academia and industry in recent truth based on Lee’s work. Lopez et al. [16] extracted Haar
years [1]–[3]. Similar to sensors in the infrared and visible features for images and used classical machine learning algo-
bands, PMMW sensors receive the self-emitted radiation of rithms such as random forest and support vector machine for
objects and reflected environmental radiation. PMMW sensors, detection.
on the other hand, do not require an irradiation source and The requirement for manual design features is a common
have the ability to detect through smoke, dust, and light rain, disadvantage of the above methods, which leads to poor
making PMMW imaging capable of all-day and quasi-all- generalization ability and does not adapt as well when the
weather, so it is widely used in astronomy and remote sensing object distribution changes. Deep learning has now been
[4], [5]. With the improvement in resolution brought by very successful in the field of computer vision, and many
the development of millimeter-wave (MMW) devices, MMW networks have achieved impressive performance on available
radiometry is gradually applied to the information acquisition datasets. Inspired by these works, deep learning, especially
of close-range objects [6]–[9]. Among these, the most widely convolutional neural network (CNN), has been attempted to
used application is the detection of concealed objects such be applied to PMMW image feature extraction and object
as firearms, gasoline, and knives in human security screening detection in recent years, which avoids the need to manually
design complex features. For example, Lopez et al. [17] per-
Manuscript received XX XX, 2022; revised XX XX, 2022. This work formed smoothing preprocessing and applied SCNN networks
was supported in part by the National Natural Science Foundation of China for object segmentation. Mainstream networks in industrial
(NSFC) under Grants 61871438. (Corresponding author: Fei Hu.)
C. Guo, F. Hu, and Y. Hu are with the School of Electronic Information and applications, such as YOLOv3 and SSD, have also been used
Communications, Huazhong University of Science and Technology, Wuhan for concealed object detection in PMMW security images [18],
430074, China, and F. Hu are also with the National Key Laboratory of [19]. The common shortcoming of the above works is that they
Science and Technology on Multi-Spectral Information Processing, Wuhan
430074, China (e-mail: [email protected], [email protected], dawal- all need to set the prior anchor boxes manually, whose shape
[email protected] ). and size have a significant impact on the final performance of
2

the network. contributions of this paper are as follows:


The attention mechanism [20], [21] is a bionic approach 1) For the previous problem of classification-regression
to human vision that allows the detection network to focus conflict, we propose a new task-aligned dual-head block
on the object site of interest by computing the mutual corre- design, in which two decoders share queries while performing
lation between sample elements. With its global perception different (classification/regression) tasks with different param-
capability and superior detection performance, Transformer eters, achieving decoupled classification and regression in two
excels in the field of computer vision. Typical computer different feature spaces.
vision Transformer models include the Visual Transformer 2) We propose a query selection method adapted to the
(ViT) [22], Swin Transformer [23], and Detection Transformer security check scenario, i.e., introducing a spatial prior to the
(DETR) [24]. ViT pioneered the application of Transformer to detection head. Thus, the semantic perception capability of the
computer vision tasks by treating the Patch of an image as a network is enhanced. In addition, we perform an interaction
vector. However, the computation required by ViT to apply the method which we called query sharing between two decoders
attention mechanism directly to the full graph is very large, so for the selected queries.
Swin Transformer proposed a windowed attention mechanism 3) In order to balance global information perception and
and a shifted windowed attention mechanism to address this computational effort, we propose a coarse-to-fine multi-scale
problem, reducing the computation of the detection network DCFT attention block. The coarse global attention is used
while weakening the long-range sensing capability of the to help the network understand human body semantic in-
network. DETR, on the other hand, pioneers an ensemble formation to reduce false alarms caused by object-similar
prediction-based object detection method that eliminates the backgrounds, such as gaps between limbs and torso, while
necessity for post-processing processes such as the manual the fine local attention provides rich classification information.
design of a priori anchor frames and Non-Maximal Suppres- In addition, by redesigning the pooling part of the block, we
sion (NMS). However, DETR converges very slowly due to the improve the robustness of the network to measurement noise.
cross-attention mechanism and the instability brought by the 4) PMMW-DETR surpasses the popular SOTA models in
randomness of its bipartite graph matching. The design for accuracy. To train our network, we contribute a new dataset
optical images makes either Swin or DETR perform poorly that contains 4543 images with high-quality annotations. To
on the PMMW dataset, therefore it is necessary to design the best of our knowledge, this is the first dataset with the best
a Transformer that adapts to the characteristics of PMMW image quality that represents the level of the most advanced
imaging. PMMW imaging technology. With this dataset, we compared
Yang et al. [25] proposed the PMMW-Transformer for the classification performance of different models for the first
human security screening, an Anchor-Free Transformer net- time.
work that implements adaptive assignment of positive and
II. P ROPOSED M ETHOD
negative sample labels. The network proposes a new local-
global attention mechanism and Attention Weighting Module A. The Overall Structure
inspired by Twins [26], Swin, ATSS [27], and other networks. In order to solve the problem of insufficient information and
However, it does not well address the problems of difficult difficult convergence of the PMMW image dataset, we pro-
classification on low-quality images, low AP at high IoU pose a Transformer-based concealed object detection network
thresholds, and difficult convergence on small datasets. The named PMMW-DETR. Like the current mainstream detection
reasons are mainly the following: network, our Transformer network consists of the backbone,
First, the PMMW security images have low resolution neck, and head. As shown in Figure 1, the backbone is mainly
and little texture information, which is easy to cause a low applied to extract low-level features of images, while the neck
precision rate and low recall rate. Second, due to the intrinsic functions to further aggregate and extract high-dimensional
conflicts between classification and regression [28], the net- features, and the head performs classification and localization
work model has difficulty in learning classification information tasks on the obtained feature maps. For the problems men-
on such a dataset. Third, the inductive bias of Transformer is tioned in Section I, we propose 1) a coarse-to-fine multi-scale
much weaker than CNN and RNN, which makes the network DCFT attention block that balances global information and
harder to train on small-scale datasets [29], Especially for data computational efforts, 2) a multi-scale deformable attention
with high noise like PMMW security images. module that effectively integrates independent semantic fea-
To address the characteristics of PMMW security images in tures, and the query selection module that introduces a spatial
security scenarios, this paper proposes a task-aligned detection prior for the detection head, and 3) a task-aligned dual-head
Transformer network named PMMW-DETR. The backbone of block with query sharing that fully aligns the classification
the Transformer network consists of several hierarchical multi- and localization tasks, on the backbone, neck, and head,
scale Denoising Coarse-to-Fine Transformer (DCFT) attention respectively.
blocks so as to extract features, and the extracted feature map
enters the neck composed of an encoder and a query selection B. Denoising Coarse-to-Fine Transformer (DCFT) Attention-
module to further aggregate features. The output features, after Based Backbone
introducing a spatial prior, are divided into two decoders to The attention mechanism has the advantage of long-range
perform classification and regression tasks, the queries of both perception compared to convolution. The ability to fully ex-
are shared to ensure that they are task-aligned. The main tract features of the whole picture makes it well suited for
3

Classification Head

Neck

H W H W H W
  2c   4c   8c PMMW Output
8 8 16 16 32 32 Labels
DETR Confidence
Decoder Vector

Deformable Query Sharing


H  W 1 H W
 c Backbone Attention
4 4
Encoder
Regression Head

Positional
DCFT DCFT DCFT DCFT Embeddings
Attention Attention Attention Attention Query PMMW Output
Block Block Block Block DETR Boxes
input image Selection
Decoder Vector
Stage 1 Stage 2 Stage 3 Stage 4 Block
Task-Aligned Dual-Head Block

Fig. 1. The overall structure of the Task-aligned PMMW-DETR.

detection tasks with less local information such as PMMW character. With this character, we get a reasonable trade-off
security screening. However, the original attention mechanism between computational overhead and long-range perception
does fine-grained operations on the whole globe, resulting in capabilities.
a quadratic computational overhead w.r.t the image size [23], As shown in Figure 2(b), for a input feature map z ∈
which is unacceptable. Although Swin Transformer proposed RH×W ×c , we first partition it into a grid of query windows
the local attention mechanism and the shifted window attention with size nwp × nwp , in which the query window shares the
mechanism to reduce the computational cost, the long- and same surroundings. Each query window will pass  through a
short-range perception capability of the attention mechanism linear projection layer to obtain the Qi = fq z 1 in the cal-
is greatly weakened since the size of the window is far less culation of the attention mechanism, where i indexes the i-th
than that of the image, which might cause the false alarms due query window. For the i-th query window Qi ∈ Rnwp ×nwp ×d ,
to similar local features between objects and the background. we split the input feature map z into multiply sub-windows
To balance the performance and computational overhead with the number of N l × N l . Then we perform pooling for
of the attention mechanism, we propose the DCFT attention- each feature map level l ∈ {1, ..., L} to generate fine-grained
based backbone which consists of four DCFT attention blocks, and coarse-grained information. Since PMMW security images
as shown in Figure 1. In order to obtain a hierarchical multi- bear much more random noise than optical images, we adopt
scale backbone structure, we partition each image I ∈ RH×W a denoising kernel in the pooling operation, with it has been
into patches of size 4 × 4, resulting in H4 × W shown in [17] that denoising pre-processing is beneficial for
4 visual tokens
with dimension 4 × 4. Then, as shown in Figure 2(a), a patch object detection in PMMW security screening. A simple linear
embedding layer consisting of a convolutional layer with stride layer fpl to pool the sub-windows spatially by:
and filter size both equal to 4, is used to project the patches
H W
into the feature map z ∈ R 4 × 4 ×c . After the projection,
H W
z l = fpl (ẑ) ∈ R nl × nl ×c ,
similar operations are performed on the feature map in the four l l
(1)
ẑ = Reshape(z) ∈ R( nl × nl ×c)×(n ×n ) ,
H W

stages of DCFT attention blocks. At each stage i ∈ {1, 2, 3, 4},


the DCFT attention block consists of Ni DCFT layers. In where nl = {n1 , . . . , nL } is the size of sub-windows, and
particular, for the patch embedding layer in the i ∈ {2, 3, 4} n1 of the first feature map level is set to 1 to extract the
stage, the spatial size of the feature map is reduced by a factor finest granularity local feature. Once we obtain the pooled
of 2, meanwhile, the feature dimension increases by a factor feature maps {z 1 , ..., z L } at all L levels, we compute the key
of 2, which is different from the first stage mentioned above. and value for all levels layers fk
 using two linear projection
The output of the last 3 stages forms the multi-scale feature

, fv , i.e. Ki = fk z 1 , . . . , z L , Vi = fv z 1 , . . . , z L .
maps, which are then fed to the neck. To perform DCFT attention, we need to first extract the
Figure 2(b) shows the illustration of a single DCFT attention surrounding tokens for each query token in the feature map.
block, considering that the visual dependencies between the As we mentioned earlier, query tokens inside a query window
close regions are usually stronger than the far region, we main- nwp × nwp share the same surroundings. For the queries
tain fine-grained attention to the surrounding pixels closest inside the i-th query window Qi ∈ Rnwp ×nwp ×d , we extract
to the query tokens in the feature map and captures coarse- the N l × N l keys and values from K l and V l around the
grained attention on distant regions after being pooling, so window where the query lies in, and then gather the keys and
this attention mechanism has the far-to-close, coarse-to-fine values from all L to obtain Ki = {Ki1 , ..., KiL } ∈ RN ×c and
4

N1
n1
Denoising
nwp n1  n1 Pooling

Multi-head Self-Attention
N 1  N 1 c
1 2 3 4 5 6 7 8 z1 

Multi-layer
Perceptron
n2 N2
Level 2
Patch Embedding

Denoising
LayerNorm n 2  n 2 Pooling
N 2  N 2 c
z2 

H W c n3
DCFT Input feature map z 
Self-Attention N3
nwp nwp c
Query Window i Qi  Denoising
n3  n3 Pooling N 3  N 3 c
Query Token z3 
LayerNorm Fine Close
1
2
 Ni
Attended Patches 3
DCFT Layer Corse Far
(a) DCFT Attention Block (b) DCFT Self-Attention

Fig. 2. An illustration of DCFT Attention Block. (a) A DCFT Attention Block consists of a Patch Embedding layer and Ni DCFT layers. (b) An illustration
of the DCFT attention at window levels. Each of the finest square cells represents a query token.

Vi = {Vi1 , ..., ViL } ∈ RN ×c , where N is the sum of DCFT the object detection task as a set prediction to eliminate post-
l 2
PL 
region from all levels, i.e., N = l=1 N . Finally, we processing (e.g., NMS) while extracting the semantic features
follow [23] to introduce a relative position bias and compute of individual instances well. The shortcomings of DETR are
the DCFT attention for Qi by : the slow convergence rate and the data-hungry characteristics
[30], which have been intensified in the low-resolution PMMW

Qi KiT
 image situation. Therefore, the deformable attention module
Attention (Qi , Ki , Vi ) = Softmax √ + B Vi , (2) [31] is proposed, which has proven that its data-efficient
c
properties [30] alleviate the inherent data-hungry problem of
where B = {B 1 , . . . , B L } is the learnable relative posi- the DETR model. We follow the design of Deformabe DETR
tion bias. Similar to [23], we parameterize the bias B 1 ∈ using the deformable attention module for further feature
R(2nwp −1)×(2nwp −1) in the first level, since the relative po- aggregation of the multi-scale feature maps output by the
sition along the horizontal and vertical axis are both lies in backbone.
[−nwp + 1, nwp − 1]. After obtaining the attention scores of Also, some recent work has demonstrated that the encoder
the feature map, we send the scores to the LayerNorm and of DETR is functionally identical to the region proposal
Multi-Layer Perceptron (MLP) blocks. network in traditional two-stage networks [31], [32], and the
The overall computational cost of our
 DCFT atten- queries output by the encoder can be regarded as proposals
2 
tion becomes O L + l nl
P
(HW )c , while the com- in the region proposals network. Inspired by these works, we
putational cost of original ViT and Swin Transformer  propose a method for introducing a spatial prior adapted to
are O ((4c + 2HW ) (HW )c) and O 4c + 2n2wp (HW )c , security screening, called query selection.
where nwp is the window partition size of Swin Transformer,
1) Deformable Attention Module: The deformable attention
which is usually set to 7. It can be seen that our proposed
module only attends to a small set of key sampling points
DCFT attention is more accurate with less computational
around a reference point, regardless of the spatial size of the
complexity, and the accuracy comparison experiments are
feature maps, as shown in Fig 3. We first take the output multi-
given in the ablation study in Table IV. S
scale feature map {z s }s=1 ∈ RH×W ×c from the backbone,
content feature zq ∈ RC , and its correspondent normalized
C. Multi-Scale Deformable Attention Neck coordinates of the 2-d reference point p̂q ∈ [0, 1]2 as input
Since PMMW security images contain high noise and sparse of deformable attention, where q indexes a query element, s
texture, the object and noise are relatively similar in the low- indexes the input feature maps level. Then via two independent
level feature dimension, which leads to network confusion linear projection layers, we use the query feature zq to obtain
and convergence difficulties. Therefore, a neck to further learn sampling offset ∆pmsqk and attention weight Amsqk , where m
the high-dimensional semantic features and aggregate different indexes
PStheP attention head, and k indexes the sampling point
K
classes of features is highly desired. DETR has been very and s=1 k=1 Amsqk = 1. The calculation of the multi-
successful in this direction. DETR proposes the idea of treating scale deformable attention can be expressed as:
5

Deformable Attention Module Labels Anchors

Dual-head Decoder

Static Dynamic
Content Queries Anchor Queries

Query Selection
Anchor Encoding

×4 Anchor
Top-K Selection
proposals

Classification Unselected
Scores Anchors

Class embed Anchor embed

Image Spatial Features


& position information
Encoder

Neck

Fig. 3. Schematic of deformable attention neck. The Multi-Scale feature maps of the encoder output are used for the query selection and the cross-attention
in the decoder.

Inspired by the above practice [24], [31], [34], we propose


  XM a query selection module that combines the static content
S queries initialization and the dynamic anchor queries selection.
MS DeformAttn z q , p̂q , {z s }s=1 = W m·
m=1 Dynamic anchor queries are helpful to enhance spatial queries.
" S K # (3) However, since the multi-scale features contain rich content
XX
W 0m z s
 
Amsqk · φs p̂q + ∆pmsqk , information, we make content queries static and learnable
s=1 k=1 to avoid introducing incomplete objects that may lose in-
 formation and mislead the decoder. To give more detail, as
where φs p̂q re-scales the normalized coordinates p̂q to the
shown in Figure 3, the output multi-scale features are first
input feature map of the s-th level, W 0m ∈ RCv ×C and W m ∈
embedded into classification scores and anchors by the class
RC×Cv are of learnable weights (Cv = C/M ).
embedding layer and anchor embedding layer, respectively.
2) Query Selection Module: As shown on the right side Then we select the top-K anchors as proposals based on the
of Figure 3, the object queries used as input to the decoder classification score. Finally, we obtain dynamic anchor queries
contain spatial queries and content queries [32], [33]. In DETR by encoding the selected anchor proposals while initializing
[24], both queries are randomly initialized as static embed- the static content queries as original DETR.
dings without using any encoder features as spatial prior. For By the query selection module, the decoder of PMMW-
the spatial queries, DAB-DETR [33] has proved that anchor DETR learns the spatial distribution information directly from
boxes are better queries for DETR, which learns anchor boxes the feature maps output by the encoder, instead of starting from
directly instead of learning the spatial queries. For the content zero with blind queries. [31] has shown that object queries will
queries, DETR set them as all zero vectors, while Deformable query the object in its spatial distribution range, and the human
DETR [31] learns content queries together with the spatial security images happen to all have a human body as the subject
queries, which is an alternative approach to implementing with high similarity. Besides, as mentioned above, in order to
query initialization. Furthermore, Deformable DETR also has avoid misleading the network, we refine the content queries
a similar query selection module that is called ”two-stage”, that layer by layer. The detailed comparative ablation experiments
is, both the spatial and content queries are generated by the are given in Table III.
top-K selected features. Similarly, Efficient DETR [34] also
selects top-K features based on the classification score of each
image spatial feature. Furthermore, rather than the static 2d D. Task-Aligned Dual-Head Block
queries in DETR, the initialization of dynamic 4d anchor boxes Most current concealed object detection networks share
in PMMW-DETR makes it more closely related to spatial a head for both classification and bounding box regression.
queries so that it can be spatial prior to being introduced by However, [28], [35], [36] have proved that there is a spatial
query selection. misalignment between classification and regression tasks, that
6

6x

Decoder Layer

Decoder Layer

Decoder Layer
Encoder
Feature …… Detection
(a) Extractor Head

Random Object Cascade Feature


Initialization Query Refiner

Classification

Classification

Classification
Classification

Layer

Layer

Layer
Output

Encoder
Labels
(b) Feature
Extractor

Regression Layer

Regression Layer

Regression Layer
Regression
Output
Query Object
Selection Query
Task-Aligned
Dual-Head Block

Fig. 4. Previous end-to-end detectors like DETR and Deformable DETR (a) and ours (b).

is, the two tasks have a different sensitive locations for the ule is used for query updating, and can be expressed as:
same object. For example, some salient areas may be beneficial
QK T
 
for classification while the boundary might have rich informa- √ (4)
SelfAttn(Q, K, V ) = softmax V,
tion for regression. The spatial task misalignment greatly limits d
the performance of concealed object detection. Therefore, we
propose a task-aligned dual-head block. We perform concealed where d is the number of queries, and queries Qq , keys Kq ,
object classification and regression independently by using two and values Vq can be expressed as
independent branches in parallel heads. But such a two-branch
design might lead to a lack of interaction between the two Qq = Cq + Pq , Kq = Cq + Pq , Vq = Cq , (5)
tasks, resulting in inconsistent predictions when performing
them. To avoid this problem, we propose query sharing to where Cq ∈ RD indicates the learnable content query, and
explicitly align the two tasks with the sharing query cross Pq ∈ RD indicates the spatial query generated by
attention.
Pq = MLP {Cat [PE (xq ) , PE (yq ) , PE (wq ) , PE (hq )]} ,
As shown in Figure 4, the proposed task-aligned dual-head (6)
block has two different types of decoder layers with query
sharing modules. Different from the six-layer structure of the where (xq , yq , wq , hq ) is the q-th anchor generated from the
original DETR decoder, we design a decoder layer with two query selection module; PE : R → RD/2 is the positional
branches and add the decoder to the head to refine the features. encoding function to generate sinusoidal embeddings; the
The dual-head design achieves decoupling the classification notion Cat means concatenation function; MLP : R2D → RD
and regression tasks, avoiding the inherent conflict between is composed of two linear layers.
these two tasks. The regression head consists of regression The cross-attention module is used for feature probing,
layers on the left, and on the right side is the classification head where queries, keys, and values can be expressed as
consisting of classification layers. The regression layers and
classification layers are three layers each, and the total number Qq = Cat (Cq , PE (xq , yq ) · MLP (Cq )) ,
of layers is 6, which is consistent with the original DETR. In Kx,y = Cat (Fx,y , PE(x, y)) , (7)
the PMMW-DETR network, the regression layer is responsible
for refining the anchor boxes, while the classification layer Vx,y = Fx,y ,
refines the class labels. Therefore, in the following, we call
the regression layer the anchor refinement layer and the where MLP : RD → RD is composed of two linear layers to
classification layer the class refinement layer. learn a scale vector of the content information. We concatenate
the position and content information together as queries and
Figure 5 shows the specific data flow of the task-aligned keys to consider both the content and position contributions
dual-head block. Each block layer includes a self-attention so that the query matrix and key matrix can be expressed as
module and a cross-attention module. The self-attention mod- Q = Cat(QC , QP ) and K = Cat(KC , KP ). Based on the
7

Output MLP Output

(x’, y’, w’, h’)


(Δx,Δy,Δw,Δh)
Anchors Labels
Add & Norm Add & Norm
Decoder
FFN FFN
New
Anchor Refine Class Refine Boxes

(x, y, w, h)
Layer Layer Add & Norm Add & Norm
& position information
Image Spatial Features

Query Sharing
(1/w, 1/h )
Dynamic Anchor ● Classification
(wref, href )
Anchor Refine Class Refine Multi-head Multi-head
Layer Layer Cross-Attention MLP Cross-Attention
V K Q Q K V
C C
C (x, y ) ●
Anchor Refine Class Refine
Layer Layer MLP
Add & Norm Add & Norm
(x, y,
Anchor Class label Denoising Multi-head w, h) Sharing Query
Boxes Embeddings Indicator Self-Attention MLP Cross-Attention
object query V K Q V K Q

● Element-wise Multiplication
Element-wise Add
C Concatenation Modules
Image Spatial position Anchor Class label Denoising
Encoding Variables
Features information Boxes Embeddings Indicator

Fig. 5. The illustration of the task-aligned dual-head block. Three task-aligned dual-headed blocks consist of the decoder.

above queries, keys, and values, the cross-attention module is of attention aggregation target correlation. The validity of the
formulated as follows: query sharing is verified in Section III-D. In addition, we
Cross Attn(Q, K, V ) = introduce the query denoising learning [38], which adds a
T denoising part containing a denoising indicator and denoising
(8)
 
QC KC + ModAttn(QP , KP ) loss as a training shortcut to accelerate the training of PMMW-
softmax √ V,
d DETR.
where the modulated positional attention helps us extract
features of objects with different widths and heights, which III. E XPERIMENTS AND R ESULTS
can be expressed as
A. Experimental Environments
ModAttn(PE(xq , yq ), PE(x, y))

 
T wref T href
= PE (xq ) PE (x) + PE (yq ) PE (y) / D,
w h
√ (9) Radiometer
Near Field
where 1/ D is used for value rescaling [37], and the ref- Antenna

erence width and height that are calculated by wref , href = Elevation
Rotation
σ (MLP (Cq )) as shown in Figure 5.
The two content queries and spatial queries are updated
layer by layer. Using coordinates as spatial queries for learning
makes it have clear spatial meaning. As shown in Figure 5, Azimuth
Rotation
each anchor refines layer outputs an updated object anchor
by predicting the relative positions (∆x, ∆y, ∆w, ∆h). For
the sake of decoupling the classification and regression tasks,
we only consider the output of the class refine layer as
Fig. 6. The PMMW security screening system and Experimental Environ-
the classification result. Further, we design a query sharing ments.
mechanism to enhance the collaboration between these two
tasks, that is, the updated anchor of the anchor refine layer The dataset was acquired by our self-developed PMMW
is used as the input spatial query of the sharing query cross security system, as shown in Figure 6. The system consists of a
attention in the class refine layer after being embedded by Cassegrain antenna, a radiometer consisting of an ortho-mode
Eq.6. The calculation method of the sharing query cross transducer, two direct detection modules, a data acquisition
attention is simple, which is the same as Eq.4. However, module, and a three-axis scanning turntable. The system
the sharing query cross attention effectively mitigates the operates in the 94±2 GHz band with a sensitivity of about
problem of inter-spatial variation by exploiting the property 0.4 K. Using an antenna with a diameter of 0.6096 m, it
8

can achieve a resolution of 0.36°. As a research-oriented


rather than a commercial imaging system, the system uses
a scanning imaging regime that trades a longer imaging time
(approximately 5 min per image) for better imaging quality.
Compared with existing publicly available datasets [16], [17],
[19], [25], the image quality of our dataset is closer to or even
surpasses that of reported commercial security systems, such (a)
as X250 and S350 of Millivision Inc., iPat imager of Trex
Enterprises Inc., and SP0-NX of Qinetiq Inc. So our dataset
is more representative and forward-looking than the existing
datasets.

(b)
B. Security Dataset and Implementation Details

We collected a total of 247 security images, in which the Fig. 7. (a) Optical schematic photos corresponding to human security
screening. (b) Representative PMMW security images and the hidden objects
types of concealed objects included the metal wrench, alcohol we used, which include a metal wrench, alcohol bottle, metal knife, and metal
bottle, metal knife, and metal pistol. 186 PMMW security pistol.
images formed the training set and the remaining 77 formed
the test set. The training of the Transformer network requires
a large amount of data, but the scanning imaging regime is too C. Comparison with the State-of-the-Art
time-consuming to form a large-scale dataset. To compensate Following the mainstream approach of object detection
for the lack of data set, we added 379 simulated images to evaluation, we use average precision (AP) and average recall
the original training set. The simulation image is implemented (AR) to evaluate the performance of the proposed model.
based on the ray tracing method [3], [39]. To sum, the above In order to show the detection results for all classes, the
original dataset has a total of 715 images, of which 638 mAP/mAR is obtained by summing the AP/AR of each class
images form the training set and 77 images form the test and taking the average. Essentially, AR is twice the area under
set. After a series of data augmentation operations such as the recall-IOU curve, AP is the area under the precision-recall
flipping, rotating, Gaussian degradation filtering, and changing curve (PR curve), where recall is the ratio of detected samples
the brightness, the training set can be expanded by a factor of to the actual total samples, and precision reflects the detection
seven to 4466 images. All our datasets will be available at rate. The calculation of mAP and mAR is given as follows:
https://github.com/Ch3ngguo/opening-source-PMMW-dataset.
Several typical PMMW security images are given in Figure PK PK
7, from which it can be seen as follows: k=1 APk k=1 ARk (10)
mAP = , mAR = ,
1) Due to the uneven ambient brightness, the human chest K K
resembles the texture features of the alcohol bottle, which may where K is the number of classes, the APk and ARk are given
lead to false alarms. as follows:
2) The metal wrench in the fourth image is thin and shares
n−1
similar contour features with the scanning noise, and the X
brightness of the alcohol bottle in the second image is close to APk = (ri+1 − ri ) pinterp (ri+1 ) ,
i=1 (11)
the human body, both of which may lead to missed detection. Z 1
3) Since the flat place of the human body will reflect the ARk = 2 Recall(o)do,
strong brightness, such as the central part in the first image, 0.5
which may affect the detection of the central hidden objects. where o is recall(o) IoU and is the corresponding recall,
Therefore, it is a difficult task to detect objects on such a r1 , r2 , . . . , rn are the recall levels (in ascending order) at
low resolution, high noise, and low little texture information which the precision is first interpolated. The interpolated
dataset. precision Pinterp (r) = maxr0 ≥r Precision (r0 ), defined as the
For the implementation details, we use 1, 3, 5, 7 as the size highest precision found at a certain recall level r, for any recall
of nl in the DCFT module. We add uniform noise on boxes level r0 ≥ r. The Precision and Recall are defined as:
and set the hyperparameters with respect to noise as λ1 = 0.4,
λ2 = 0.4, and γ = 0.4 in the query denoising learning. We
TP TP
train the network for 50 epochs and initialized parameters by Precision = , Recall = , (12)
TP + FP TP + FN
Xavier. We adopt AdamW [40] as the optimizer with a weight
decay of 1 × 10−4 . The initial learning rate is 1 × 10−4 and it where the TP is true positives, FN is false negatives, and FP
drops by multiplying 0.1 at the 40-th epoch. The batch size is is false positives.
4, and all the models are trained on a single NVIDIA TESLA To show the superior performance of our proposed
A30 GPU. model, we present experimental results comparison with
9

TABLE I
Comparison with the SOTA object detection models.

Method Epochs mAP50 mAP75 mAP mAPS mAPM AR100 End-to-End


YOLOv3 273 95.0 76.4 61.0 39.3 62.8 65.2
SSD 24 95.5 65.3 59.5 31.4 62.7 64.3
Faster-RCNN 24 95.7 55.7 56.1 31.9 58.4 61.0
ATSS 24 96.2 74.1 59.8 32.2 61.5 64.0
FCOS 24 95.7 60.0 58.8 34.4 60.1 62.1
PVT 24 96.1 80.4 64.0 39.8 65.3 69.1
Dyhead 24 96.3 74.0 64.2 38.4 66.9 68.5
PMMW-DETR 24 97.0 87.1 67.8 45.7 68.4 75.5 X
PMMW-DETR(Ours) 50 97.5 91.8 71.3 49.1 74.1 80.4 X

six representative state-of-the-art (SOTA) detectors, includ- there are missed detection and wrong classification as shown
ing YOLOv3 [41] (one-stage detector), SSD [42] (one- in Figure 8. As in the fifth figure, the confidence level of the
stage detector), Faster-RCNN [43] (two-stage detector), ATSS metal knife is only 0.16, which will be a false negative object
[27] (anchor-free detector), FCOS [44] (anchor-free detector), in practical applications.
PVT [45] (Transformer-based detector), and Dynamic Head Faster-RCNN is a classic two-stage detection network. It
(Transformer-based detector) [46]. To ensure the fair validity introduces RPN to generate anchor points and nine proposal
of the experiment, we used the same hyperparameters in the anchors around each point. After screening, the effective
above networks. All the experiments were run on the same proposal anchor is sent to the classification and regression
dataset, and the above models are implemented based on the network. As we can see from the Table I, our model is 36.1%
mmdetection [47]. higher in mAP75 and 19.4% higher in AR than Faster-RCNN.
In the Table I, the epoch is the metric used to measure the It can be seen from Figure 8 that Faster-RCNN has fewer
convergence speed, where an epoch is a process of sending all boxes with higher confidence and thus is easier to miss objects.
data into the network and completing a forward calculation and The above methods need manually designed anchors as
back propagation. To fairly compare the performance, we set prior information for the network. On the contrary, FCOS
up an experiment with 24 (2x schedule) epochs setup and drop is a famous anchor-free detector. It treats object detection as
the learning rate of our model by multiplying by 0.1 in the 20- a regression of the distance between each position on the
th epoch. The mAP50 is the standard metric for Pascal VOC, feature map and the bounding box. FCOS avoids all the
which defines the mapped metric using a single IoU threshold super parameters related to the proposal anchor, so saves the
of 0.5. Similarly, the mAP75 uses a single IoU threshold of memory occupation and improves the calculation efficiency.
0.75. The mAP is the average of mAP under 10 IoU thresholds ATSS reveals the essential difference between anchor-free
(i.e.,0.50, 0.55, 0.60, . . . , 0.95), which is the most strict metric and anchor-based methods, that is, the selection of positive
in the COCO challenge (The mAP of the SOTA model in and negative samples, then it proposed a more reasonable
COCO dataset is 63.3). The mAR100 means the average recall sample selection strategy. FCOS and ATSS have implemented
calculated when one hundred anchor boxes are given to the Anchor-Free by a similar approach, making them face similar
images. The End-to-End indicates whether the model does not problems. As shown in Figure 8, we show as many anchor
require NMS and manual design anchors. boxes as possible to ensure that the experimental conclusions
YOLOv3 is a widely used one-stage detector, which was are complete. It can be seen that FCOS and ATSS do not
already adopted for PMMW security screening [19]. It directly perform well in regression anchor boxes, both of which failed
learns the bounding box, confidence, and category probability to enclose the metal pistol in all of their anchor boxes, and
by regression. YOLOv3 has a fast detection speed and a low many anchor boxes have low confidence. Meanwhile, in the
false detection rate because it can see global information. ATSS detection results, there is a confusion between the gaps
However, the detection accuracy of YOLOv3 is lower than and the object resulting in false alarms as we mentioned in
other SOTA detectors, especially for small objects. It can be Section III-B.
seen that our method is +10.4% mAP higher than YOLOv3. PVT is a famous Transformer-based backbone, which com-
We believe that the low detection rate is closely related to the bines the advantages of CNNs and Transformers. It intro-
fact that the original YOLOv3 model does not include multi- duces a step-by-step contraction pyramid to obtain multi-
scale detection. Besides, YOLOv3 often requires more than scale output similar to CNN. Meanwhile, thanks to the self-
two hundred epochs to converge, which is unacceptable on attention mechanism in Transformer, PVT maintains the global
PMMW datasets. receptive field at different scales. Dyhead is a novel detection
SSD is another widely used one-stage detector It introduces head with the attention mechanism. It combines multiple self-
the pyramid feature hierarchy, that is, objects are predicted on attention mechanisms between feature levels, spatial locations,
the feature maps of different receptive fields, thus improving and output channels, and thus significantly improves the rep-
the detection accuracy of the network. However, as shown resentation ability of object detection heads without the heavy
in Table I, the mAPS of SSD for small objects is very low, computational overhead. Introducing the attention mechanism
only 31.4% (18.9% lower than our method). At the same time, makes the results of PVT and Dyhead very competitive, being
10

FCOS YOLOv3

ATSS SSD

Dyhead Faster-
RCNN

PVT Ours

Ground Truth

Fig. 8. Visualization detection results of SOTA models and our proposed model.
75
75

(a)

Fig. 9. The mAP75 curves of PMMW-DETR and other models.

only 7.3% and 7.1% lower than ours in mAP, respectively.


100

However, as shown in Figure 8, since the attention mechanism


of PVT and Dyhead was carried out on a single head, they
had some wrong classification in the detection results.
Besides, Figure 9 shows the performance curves of ours
and other models. It can be seen that we are superior to other
(b)
methods after the fifth epoch.
Fig. 10. The mAP75 (a) and mAR100 (b) curves for each component of
D. Ablation Study PMMW-DETR.
In this section, we conduct a series of ablation studies to
evaluate the contribution of the proposed modules. In the
Tables II∼III, we use the terms ”QSE”, ”QSH”, ”DAM”, QSE, QSH, and DCFT to the strong baseline in proper order to
”DAB”, and ”DN” to denote ”Query Selection”, ”Query verify the performance of our proposed modules. The results
Sharing”, ”Deformable Attention Module”, ”Dynamic Anchor are available in Table II Rows 3∼5. In addition, Figure 10
Boxes”, and ”DeNoising learning”, and the ”R50”, ”Swin- shows the performance curves of each component. We can
T”, and ”DCFT” to denote ”Resnet-50”, ”Swin Transformer- see that QSE, QSH, and DCFT modules improve performance
Tiny”, ”Denoising Coarse-to-Fine Transformer”, respectively. significantly.
We use the DETR as the baseline, and build a strong To verify the effectiveness of query selection and the impact
baseline with modules of DAM, DAB, and DN. Then we add of dynamic initialization, using static initialization as a base-
11

TABLE II
Ablation study of the proposed algorithm components.

#Row QSH QSE Backbone AP50 AP75 AP APS APM AR100


1. DETR(baseline) R50 92.9 63.6 49.1 45.7 54.7 56.1
2. Strong baseline1 R50 95.8 72.4 58.0 56.9 63.8 63.1
3. PMMW-DETR X R50 97.0 86.8 65.7 42.5 67.4 71.9
4. PMMW-DETR X X R50 97.1 89.3 68.7 44.4 70.5 77.1
5. PMMW-DETR(Ours) X X DCFT 97.5 91.8 71.3 49.1 74.1 80.4
1 The strong baseline is built on DETR with modules of DAM, DAB, and DN.

TABLE III noise and lack of texture information in the PMMW image, the
Comparison between different query initialization methods. query selection module to introduce spatial prior information,
and the task-aligned dual-head block to enhance the ability
Method AP50 AP75 AP APS APM AR100
of the network to identify object categories. Numerical ex-
Static 97.0 86.8 65.7 42.5 67.4 71.9
periments show that PMMW-DETR outperforms other SOTA
Dynamic 97.1 88.3 67.9 43.1 68.9 75.8
QSE 97.1 89.3 68.7 44.4 70.5 77.1
methods, and achieves the classification of different objects
for the first time. It is worth noting that PMMW-DETR is also
applicable to infrared, optical, and radar images that possess
line, we compare the performance of dynamic initialization the following characteristics: (1) high noise; (2) Lack of local
and query selection. As can be seen from the Table III, texture information; (3) The object of interest appears in a
we can find our query selection method outperforms other relatively stable spatial position (for example, in the security
methods. The dynamic initialization method indeed limits the image, hidden objects always appear in the human body area).
performance of the network to a certain extent.
R EFERENCES
TABLE IV [1] N. A. Salmon, “Outdoor passive millimeter-wave imaging: Phenomenol-
Comparison between different backbones. ogy and scene simulation,” IEEE Transactions on Antennas and Prop-
agation, vol. 66, no. 2, pp. 897–908, 2017.
backbone #Params FLOPs AP50 AP75 AP APS APM AR100 [2] L. Yujiri, M. Shoucri, and P. Moffa, “Passive millimeter wave imaging,”
R50 37.7M 239G 95.8 72.4 58.0 56.9 63.8 63.1 IEEE microwave magazine, vol. 4, no. 3, pp. 39–50, 2003.
[3] N. A. Salmon, “Indoor full-body security screening: Radiometric mi-
PVT 34.2M 226G 96.1 80.4 64.0 39.8 65.3 69.1
crowave imaging phenomenology and polarimetric scene simulation,”
Swin-T 38.5M 245G 97.4 87.1 68.7 43.9 70.4 75.5 IEEE Access, vol. 8, pp. 144 621–144 637, 2020.
DCFT 39.4M 265G 97.5 91.8 71.3 49.1 74.1 80.4 [4] D. Casella, G. Panegrossi, P. Sano, B. Rydberg, V. Mattioli, C. Accadia,
M. Papa, F. S. Marzano, and M. Montopoli, “Can we use atmospheric
targets for geolocating spaceborne millimeter-wave ice cloud imager (ici)
On the basis of Table II Row 7, we replaced the DCFT with acquisitions?” IEEE Transactions on Geoscience and Remote Sensing,
different backbones and compared their performance, and the vol. 60, p. 5302622, 2022.
results are shown in Table IV. The parameters and FLOPs [5] D. Cuadrado-Calle, P. Piironen, and N. Ayllon, “Solid-state diode
technology for millimeter and submillimeter-wave remote sensing appli-
are calculated using the same network (RetinaNet). Since the cations: Current status and future trends,” IEEE Microwave Magazine,
multi-scale calculation is introduced in the backbone, param- vol. 23, no. 6, pp. 44–56, 2022.
eters and FLOPs of our proposed DCFT are slightly higher. [6] J. Su, H. Wu, P. Li, Y. Hu, and F. Hu, “Detection for ship by dual-
polarization imaging radiometer,” Optics Express, vol. 29, no. 17, pp.
However, as we mentioned in Section II-B the computational 27 830–27 844, 2021.
complexity of our model is lower, and our model outperforms [7] Y. Cheng, Y. Wang, Y. Niu, and Z. Zhao, “Concealed object enhance-
in mAP and mAR. Due to the excessive computation of full ment using multi-polarization information for passive millimeter and
terahertz wave security screening,” Optics Express, vol. 28, no. 5, pp.
map attention, it can only perform the image classification 6350–6366, 2020.
task that is less computational overhead, but cannot complete [8] Y. Cheng, L. Qiao, D. Zhu, Y. Wang, and Z. Zhao, “Passive polari-
the object detection task. Therefore, we added the PVT in the metric imaging of millimeter and terahertz waves for personnel,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 1, no. 15, 2020.
SOTA model to compare. The AP of R50 is poor because of [9] A. Y. Owda, “Passive millimeter-wave imaging for burns diagnostics
the lack of global information. Similar to R50, PVT loses more under dressing materials,” Sensors, vol. 22, no. 7, p. 2428, 2022.
long- and short-range information in the cascading reduction [10] Y. Zhao, W. Si, B. Han, Z. Yang, A. Hu, and J. Miao, “A novel near
field image reconstruction method based on beamforming technique for
process, which results in the lower AP of PVT. Swin-T real-time passive millimeter wave imaging,” IEEE Access, vol. 10, pp.
achieved competitive results with its compact design. However, 32 879–32 888, 2022.
its diminished distance to information resulted in 4.7% lower [11] Y. Cheng, L. Qiao, D. Zhu, Y. Wang, and Z. Zhao, “Passive polari-
metric imaging of millimeter and terahertz waves for personnel security
in mAP75 and 4.9% lower in mAR100 than ours. screening,” Optics Letters, vol. 46, no. 6, pp. 1233–1236, 2021.
[12] O. Martı́nez, L. Ferraz, X. Binefa, I. Gómez, and C. Dorronsoro,
IV. C ONCLUSION “Concealed object detection and segmentation over millimetric waves
images,” in 2010 IEEE Computer Society Conference on Computer
In this paper, we have presented a robust task-aligned Vision and Pattern Recognition-Workshops. IEEE, 2010, pp. 31–37.
Detection Transformer named PMMW-DETR. Different from [13] D.-S. Lee, S. Yeom, J.-Y. Son, and S.-H. Kim, “Automatic image
segmentation for concealed object detection using the expectation-
popular neural networks, we perform the denoising coarse-to- maximization algorithm,” Optics Express, vol. 18, no. 10, pp. 10 659–
fine Transformer backbone to deal with the problem of high 10 667, 2010.
12

[14] S. Yeom, D.-S. Lee, J.-Y. Son, M.-K. Jung, Y. Jang, S.-W. Jung, and [36] C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “Tood: Task-
S.-J. Lee, “Real-time outdoor concealed-object detection with passive aligned one-stage object detection,” in 2021 IEEE/CVF International
millimeter wave imaging,” Optics Express, vol. 19, no. 3, pp. 2530– Conference on Computer Vision (ICCV). IEEE Computer Society, 2021,
2536, 2011. pp. 3490–3499.
[15] S. Yeom, D.-S. Lee, Y. Jang, M.-K. Lee, and S.-W. Jung, “Real-time [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
concealed-object detection and recognition with passive millimeter wave Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
imaging,” Optics Express, vol. 20, no. 9, pp. 9371–9381, 2012. neural information processing systems, vol. 30, 2017.
[16] S. López-Tapia, R. Molina, and N. P. de la Blanca, “Using ma- [38] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Ac-
chine learning to detect and localize concealed objects in passive celerate detr training by introducing query denoising,” in Proceedings of
millimeter-wave images,” Engineering Applications of Artificial Intel- the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
ligence, vol. 67, pp. 81–90, 2018. 2022, pp. 13 619–13 627.
[17] S. Lopez-Tapia, R. Molina, and N. P. de la Blanca, “Deep cnns for [39] B. Qi, L. Lang, Y. Cheng, S. Liu, F. Hu, X. He, P. Deng, and L. Gui,
object detection using passive millimeter sensors,” IEEE Transactions “Passive millimeter-wave scene imaging simulation based on fast ray-
on Circuits and Systems for Video Technology, vol. 29, no. 9, pp. 2580– tracing,” in 2016 IEEE International Geoscience and Remote Sensing
2589, 2017. Symposium (IGARSS). IEEE, 2016, pp. 2642–2645.
[18] M. Kowalski, “Real-time concealed object detection and recognition in [40] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
passive imaging at 250 ghz,” Applied optics, vol. 58, no. 12, pp. 3134– arXiv preprint arXiv:1711.05101, 2017.
3140, 2019. [41] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
[19] L. Pang, H. Liu, Y. Chen, and J. Miao, “Real-time concealed object arXiv preprint arXiv:1804.02767, 2018.
detection from passive millimeter wave images based on the yolov3 [42] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
algorithm,” Sensors, vol. 20, no. 6, p. 1678, 2020. Berg, “Ssd: Single shot multibox detector,” in European conference on
[20] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to computer vision. Springer, 2016, pp. 21–37.
attention-based neural machine translation,” in Proceedings of the 2015 [43] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
Conference on Empirical Methods in Natural Language Processing, object detection with region proposal networks,” Advances in neural
2015, pp. 1412–1421. information processing systems, vol. 28, 2015.
[21] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by [44] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, stage object detection,” in Proceedings of the IEEE/CVF international
2017. conference on computer vision, 2019, pp. 9627–9636.
[22] A. Kolesnikov, A. Dosovitskiy, D. Weissenborn, G. Heigold, J. Uszko- [45] W. Wang, L. Yao, L. Chen, B. Lin, D. Cai, X. He, and W. Liu,
reit, L. Beyer, M. Minderer, M. Dehghani, N. Houlsby, S. Gelly et al., “Crossformer: A versatile vision transformer hinging on cross-scale
“An image is worth 16x16 words: Transformers for image recognition attention,” in International Conference on Learning Representations,
at scale,” arXiv preprint arXiv:2010.11929, 2021. 2021.
[23] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and [46] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang,
B. Guo, “Swin transformer: Hierarchical vision transformer using shifted “Dynamic head: Unifying object detection heads with attentions,” in
windows,” in Proceedings of the IEEE/CVF International Conference on Proceedings of the IEEE/CVF conference on computer vision and
Computer Vision, 2021, pp. 10 012–10 022. pattern recognition, 2021, pp. 7373–7382.
[24] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and [47] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng,
S. Zagoruyko, “End-to-end object detection with transformers,” in Z. Liu, J. Xu et al., “Mmdetection: Open mmlab detection toolbox and
European conference on computer vision. Springer, 2020, pp. 213– benchmark,” arXiv preprint arXiv:1906.07155, 2019.
229.
[25] H. Yang, D. Zhang, A. Hu, C. Liu, T. J. Cui, and J. Miao, “Transformer-
based anchor-free detection of concealed objects in passive millimeter
wave images,” IEEE Transactions on Instrumentation and Measurement,
2022.
[26] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and
C. Shen, “Twins: Revisiting the design of spatial attention in vision
transformers,” Advances in Neural Information Processing Systems,
vol. 34, pp. 9355–9366, 2021.
[27] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between
anchor-based and anchor-free detection via adaptive training sample
selection,” in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2020, pp. 9759–9768.
[28] Y. Wu, Y. Chen, L. Yuan, Z. Liu, L. Wang, H. Li, and Y. Fu, “Rethinking
classification and localization for object detection,” in Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition,
2020, pp. 10 186–10 195.
[29] L. Xuhong, Y. Grandvalet, and F. Davoine, “Explicit inductive bias
for transfer learning with convolutional networks,” in International
Conference on Machine Learning. PMLR, 2018, pp. 2825–2834.
[30] W. Wang, J. Zhang, Y. Cao, Y. Shen, and D. Tao, “Towards data-efficient
detection transformers,” arXiv preprint arXiv:2203.09507, 2022.
[31] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr:
Deformable transformers for end-to-end object detection,” arXiv preprint
arXiv:2010.04159, 2020.
[32] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and
J. Wang, “Conditional detr for fast training convergence,” in Proceedings
of the IEEE/CVF International Conference on Computer Vision, 2021,
pp. 3651–3660.
[33] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang,
“Dab-detr: Dynamic anchor boxes are better queries for detr,” arXiv
preprint arXiv:2201.12329, 2022.
[34] Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient detr: improving end-to-
end object detector with dense prior,” arXiv preprint arXiv:2104.01318,
2021.
[35] G. Song, Y. Liu, and X. Wang, “Revisiting the sibling head in object
detector,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2020, pp. 11 563–11 572.

You might also like