Concealed Object Detection For Passive Millimeter-Wave Security Imaging Based On Task-Aligned Detection Transformer
Concealed Object Detection For Passive Millimeter-Wave Security Imaging Based On Task-Aligned Detection Transformer
Abstract—Passive millimeter-wave (PMMW) is a significant for harmlessness to human safety and the ability to penetrate
potential technique for human security screening. Several popular textile materials of MMW.
arXiv:2212.00313v1 [cs.CV] 1 Dec 2022
object detection networks have been used for PMMW images. Since the 1990s, researchers have designed a variety of
However, restricted by the low resolution and high noise of
PMMW images, PMMW hidden object detection based on deep PMMW security systems and conducted many human imaging
learning usually suffers from low accuracy and low classification experiments [8], [10], [11]. However, the current system is
confidence. To tackle the above problems, this paper proposes still not commercially available, the main reason is that the
a Task-Aligned Detection Transformer network, named PMMW- level of concealed object detection algorithms for PMMW
DETR. In the first stage, a Denoising Coarse-to-Fine Transformer security images lags behind the hardware level. The PMMW
(DCFT) backbone is designed to extract long- and short-range
features in the different scales. In the second stage, we propose the radiation signal is weak, with only one hundred thousandths
Query Selection module to introduce learned spatial features into of the infrared signal, so the image contains a lot of noise.
the network as prior knowledge, which enhances the semantic Meanwhile, the resolution of PMMW images is much lower
perception capability of the network. In the third stage, aiming than that of infrared and visible images, making concealed
to improve the classification performance, we perform a Task- object detection a challenging task. Many methods have been
Aligned Dual-Head block to decouple the classification and
regression tasks. Based on our self-developed PMMW security proposed for the detection of concealed objects in PMMW
screening dataset, experimental results including comparison security images. Earlier research works were mainly based
with State-Of-The-Art (SOTA) methods and ablation study on statistical learning and machine learning, for example,
demonstrate that the PMMW-DETR obtains higher accuracy Martinez et al. [12] used Iterative Steering Kernel Regression
and classification confidence than previous works, and exhibits (ISKR) method for denoising and the Local Binary Fitting
robustness to the PMMW images of low quality.
(LBF) method to segment hidden objects and the human body.
Index Terms—Millimeter-wave radiometry, deep learning, ob- Lee et al. [13] applied Bayesian learning and expectation
ject detection, transformer, security screening. maximization algorithms to detect and segment concealed
objects. Yeom et al. [14], [15] further proposed the use of
I. I NTRODUCTION principal component analysis (PCA) to extract object features
Passive millimeter-wave (PMMW) imaging has attracted and measure the similarity between the object and ground
increasing attention from academia and industry in recent truth based on Lee’s work. Lopez et al. [16] extracted Haar
years [1]–[3]. Similar to sensors in the infrared and visible features for images and used classical machine learning algo-
bands, PMMW sensors receive the self-emitted radiation of rithms such as random forest and support vector machine for
objects and reflected environmental radiation. PMMW sensors, detection.
on the other hand, do not require an irradiation source and The requirement for manual design features is a common
have the ability to detect through smoke, dust, and light rain, disadvantage of the above methods, which leads to poor
making PMMW imaging capable of all-day and quasi-all- generalization ability and does not adapt as well when the
weather, so it is widely used in astronomy and remote sensing object distribution changes. Deep learning has now been
[4], [5]. With the improvement in resolution brought by very successful in the field of computer vision, and many
the development of millimeter-wave (MMW) devices, MMW networks have achieved impressive performance on available
radiometry is gradually applied to the information acquisition datasets. Inspired by these works, deep learning, especially
of close-range objects [6]–[9]. Among these, the most widely convolutional neural network (CNN), has been attempted to
used application is the detection of concealed objects such be applied to PMMW image feature extraction and object
as firearms, gasoline, and knives in human security screening detection in recent years, which avoids the need to manually
design complex features. For example, Lopez et al. [17] per-
Manuscript received XX XX, 2022; revised XX XX, 2022. This work formed smoothing preprocessing and applied SCNN networks
was supported in part by the National Natural Science Foundation of China for object segmentation. Mainstream networks in industrial
(NSFC) under Grants 61871438. (Corresponding author: Fei Hu.)
C. Guo, F. Hu, and Y. Hu are with the School of Electronic Information and applications, such as YOLOv3 and SSD, have also been used
Communications, Huazhong University of Science and Technology, Wuhan for concealed object detection in PMMW security images [18],
430074, China, and F. Hu are also with the National Key Laboratory of [19]. The common shortcoming of the above works is that they
Science and Technology on Multi-Spectral Information Processing, Wuhan
430074, China (e-mail: [email protected], [email protected], dawal- all need to set the prior anchor boxes manually, whose shape
[email protected] ). and size have a significant impact on the final performance of
2
Classification Head
Neck
H W H W H W
2c 4c 8c PMMW Output
8 8 16 16 32 32 Labels
DETR Confidence
Decoder Vector
Positional
DCFT DCFT DCFT DCFT Embeddings
Attention Attention Attention Attention Query PMMW Output
Block Block Block Block DETR Boxes
input image Selection
Decoder Vector
Stage 1 Stage 2 Stage 3 Stage 4 Block
Task-Aligned Dual-Head Block
detection tasks with less local information such as PMMW character. With this character, we get a reasonable trade-off
security screening. However, the original attention mechanism between computational overhead and long-range perception
does fine-grained operations on the whole globe, resulting in capabilities.
a quadratic computational overhead w.r.t the image size [23], As shown in Figure 2(b), for a input feature map z ∈
which is unacceptable. Although Swin Transformer proposed RH×W ×c , we first partition it into a grid of query windows
the local attention mechanism and the shifted window attention with size nwp × nwp , in which the query window shares the
mechanism to reduce the computational cost, the long- and same surroundings. Each query window will pass through a
short-range perception capability of the attention mechanism linear projection layer to obtain the Qi = fq z 1 in the cal-
is greatly weakened since the size of the window is far less culation of the attention mechanism, where i indexes the i-th
than that of the image, which might cause the false alarms due query window. For the i-th query window Qi ∈ Rnwp ×nwp ×d ,
to similar local features between objects and the background. we split the input feature map z into multiply sub-windows
To balance the performance and computational overhead with the number of N l × N l . Then we perform pooling for
of the attention mechanism, we propose the DCFT attention- each feature map level l ∈ {1, ..., L} to generate fine-grained
based backbone which consists of four DCFT attention blocks, and coarse-grained information. Since PMMW security images
as shown in Figure 1. In order to obtain a hierarchical multi- bear much more random noise than optical images, we adopt
scale backbone structure, we partition each image I ∈ RH×W a denoising kernel in the pooling operation, with it has been
into patches of size 4 × 4, resulting in H4 × W shown in [17] that denoising pre-processing is beneficial for
4 visual tokens
with dimension 4 × 4. Then, as shown in Figure 2(a), a patch object detection in PMMW security screening. A simple linear
embedding layer consisting of a convolutional layer with stride layer fpl to pool the sub-windows spatially by:
and filter size both equal to 4, is used to project the patches
H W
into the feature map z ∈ R 4 × 4 ×c . After the projection,
H W
z l = fpl (ẑ) ∈ R nl × nl ×c ,
similar operations are performed on the feature map in the four l l
(1)
ẑ = Reshape(z) ∈ R( nl × nl ×c)×(n ×n ) ,
H W
N1
n1
Denoising
nwp n1 n1 Pooling
Multi-head Self-Attention
N 1 N 1 c
1 2 3 4 5 6 7 8 z1
Multi-layer
Perceptron
n2 N2
Level 2
Patch Embedding
Denoising
LayerNorm n 2 n 2 Pooling
N 2 N 2 c
z2
H W c n3
DCFT Input feature map z
Self-Attention N3
nwp nwp c
Query Window i Qi Denoising
n3 n3 Pooling N 3 N 3 c
Query Token z3
LayerNorm Fine Close
1
2
Ni
Attended Patches 3
DCFT Layer Corse Far
(a) DCFT Attention Block (b) DCFT Self-Attention
Fig. 2. An illustration of DCFT Attention Block. (a) A DCFT Attention Block consists of a Patch Embedding layer and Ni DCFT layers. (b) An illustration
of the DCFT attention at window levels. Each of the finest square cells represents a query token.
Vi = {Vi1 , ..., ViL } ∈ RN ×c , where N is the sum of DCFT the object detection task as a set prediction to eliminate post-
l 2
PL
region from all levels, i.e., N = l=1 N . Finally, we processing (e.g., NMS) while extracting the semantic features
follow [23] to introduce a relative position bias and compute of individual instances well. The shortcomings of DETR are
the DCFT attention for Qi by : the slow convergence rate and the data-hungry characteristics
[30], which have been intensified in the low-resolution PMMW
Qi KiT
image situation. Therefore, the deformable attention module
Attention (Qi , Ki , Vi ) = Softmax √ + B Vi , (2) [31] is proposed, which has proven that its data-efficient
c
properties [30] alleviate the inherent data-hungry problem of
where B = {B 1 , . . . , B L } is the learnable relative posi- the DETR model. We follow the design of Deformabe DETR
tion bias. Similar to [23], we parameterize the bias B 1 ∈ using the deformable attention module for further feature
R(2nwp −1)×(2nwp −1) in the first level, since the relative po- aggregation of the multi-scale feature maps output by the
sition along the horizontal and vertical axis are both lies in backbone.
[−nwp + 1, nwp − 1]. After obtaining the attention scores of Also, some recent work has demonstrated that the encoder
the feature map, we send the scores to the LayerNorm and of DETR is functionally identical to the region proposal
Multi-Layer Perceptron (MLP) blocks. network in traditional two-stage networks [31], [32], and the
The overall computational cost of our
DCFT atten- queries output by the encoder can be regarded as proposals
2
tion becomes O L + l nl
P
(HW )c , while the com- in the region proposals network. Inspired by these works, we
putational cost of original ViT and Swin Transformer propose a method for introducing a spatial prior adapted to
are O ((4c + 2HW ) (HW )c) and O 4c + 2n2wp (HW )c , security screening, called query selection.
where nwp is the window partition size of Swin Transformer,
1) Deformable Attention Module: The deformable attention
which is usually set to 7. It can be seen that our proposed
module only attends to a small set of key sampling points
DCFT attention is more accurate with less computational
around a reference point, regardless of the spatial size of the
complexity, and the accuracy comparison experiments are
feature maps, as shown in Fig 3. We first take the output multi-
given in the ablation study in Table IV. S
scale feature map {z s }s=1 ∈ RH×W ×c from the backbone,
content feature zq ∈ RC , and its correspondent normalized
C. Multi-Scale Deformable Attention Neck coordinates of the 2-d reference point p̂q ∈ [0, 1]2 as input
Since PMMW security images contain high noise and sparse of deformable attention, where q indexes a query element, s
texture, the object and noise are relatively similar in the low- indexes the input feature maps level. Then via two independent
level feature dimension, which leads to network confusion linear projection layers, we use the query feature zq to obtain
and convergence difficulties. Therefore, a neck to further learn sampling offset ∆pmsqk and attention weight Amsqk , where m
the high-dimensional semantic features and aggregate different indexes
PStheP attention head, and k indexes the sampling point
K
classes of features is highly desired. DETR has been very and s=1 k=1 Amsqk = 1. The calculation of the multi-
successful in this direction. DETR proposes the idea of treating scale deformable attention can be expressed as:
5
Dual-head Decoder
Static Dynamic
Content Queries Anchor Queries
Query Selection
Anchor Encoding
×4 Anchor
Top-K Selection
proposals
Classification Unselected
Scores Anchors
Neck
Fig. 3. Schematic of deformable attention neck. The Multi-Scale feature maps of the encoder output are used for the query selection and the cross-attention
in the decoder.
6x
Decoder Layer
Decoder Layer
Decoder Layer
Encoder
Feature …… Detection
(a) Extractor Head
Classification
Classification
Classification
Classification
Layer
Layer
Layer
Output
Encoder
Labels
(b) Feature
Extractor
Regression Layer
Regression Layer
Regression Layer
Regression
Output
Query Object
Selection Query
Task-Aligned
Dual-Head Block
Fig. 4. Previous end-to-end detectors like DETR and Deformable DETR (a) and ours (b).
is, the two tasks have a different sensitive locations for the ule is used for query updating, and can be expressed as:
same object. For example, some salient areas may be beneficial
QK T
for classification while the boundary might have rich informa- √ (4)
SelfAttn(Q, K, V ) = softmax V,
tion for regression. The spatial task misalignment greatly limits d
the performance of concealed object detection. Therefore, we
propose a task-aligned dual-head block. We perform concealed where d is the number of queries, and queries Qq , keys Kq ,
object classification and regression independently by using two and values Vq can be expressed as
independent branches in parallel heads. But such a two-branch
design might lead to a lack of interaction between the two Qq = Cq + Pq , Kq = Cq + Pq , Vq = Cq , (5)
tasks, resulting in inconsistent predictions when performing
them. To avoid this problem, we propose query sharing to where Cq ∈ RD indicates the learnable content query, and
explicitly align the two tasks with the sharing query cross Pq ∈ RD indicates the spatial query generated by
attention.
Pq = MLP {Cat [PE (xq ) , PE (yq ) , PE (wq ) , PE (hq )]} ,
As shown in Figure 4, the proposed task-aligned dual-head (6)
block has two different types of decoder layers with query
sharing modules. Different from the six-layer structure of the where (xq , yq , wq , hq ) is the q-th anchor generated from the
original DETR decoder, we design a decoder layer with two query selection module; PE : R → RD/2 is the positional
branches and add the decoder to the head to refine the features. encoding function to generate sinusoidal embeddings; the
The dual-head design achieves decoupling the classification notion Cat means concatenation function; MLP : R2D → RD
and regression tasks, avoiding the inherent conflict between is composed of two linear layers.
these two tasks. The regression head consists of regression The cross-attention module is used for feature probing,
layers on the left, and on the right side is the classification head where queries, keys, and values can be expressed as
consisting of classification layers. The regression layers and
classification layers are three layers each, and the total number Qq = Cat (Cq , PE (xq , yq ) · MLP (Cq )) ,
of layers is 6, which is consistent with the original DETR. In Kx,y = Cat (Fx,y , PE(x, y)) , (7)
the PMMW-DETR network, the regression layer is responsible
for refining the anchor boxes, while the classification layer Vx,y = Fx,y ,
refines the class labels. Therefore, in the following, we call
the regression layer the anchor refinement layer and the where MLP : RD → RD is composed of two linear layers to
classification layer the class refinement layer. learn a scale vector of the content information. We concatenate
the position and content information together as queries and
Figure 5 shows the specific data flow of the task-aligned keys to consider both the content and position contributions
dual-head block. Each block layer includes a self-attention so that the query matrix and key matrix can be expressed as
module and a cross-attention module. The self-attention mod- Q = Cat(QC , QP ) and K = Cat(KC , KP ). Based on the
7
(x, y, w, h)
Layer Layer Add & Norm Add & Norm
& position information
Image Spatial Features
Query Sharing
(1/w, 1/h )
Dynamic Anchor ● Classification
(wref, href )
Anchor Refine Class Refine Multi-head Multi-head
Layer Layer Cross-Attention MLP Cross-Attention
V K Q Q K V
C C
C (x, y ) ●
Anchor Refine Class Refine
Layer Layer MLP
Add & Norm Add & Norm
(x, y,
Anchor Class label Denoising Multi-head w, h) Sharing Query
Boxes Embeddings Indicator Self-Attention MLP Cross-Attention
object query V K Q V K Q
● Element-wise Multiplication
Element-wise Add
C Concatenation Modules
Image Spatial position Anchor Class label Denoising
Encoding Variables
Features information Boxes Embeddings Indicator
Fig. 5. The illustration of the task-aligned dual-head block. Three task-aligned dual-headed blocks consist of the decoder.
above queries, keys, and values, the cross-attention module is of attention aggregation target correlation. The validity of the
formulated as follows: query sharing is verified in Section III-D. In addition, we
Cross Attn(Q, K, V ) = introduce the query denoising learning [38], which adds a
T denoising part containing a denoising indicator and denoising
(8)
QC KC + ModAttn(QP , KP ) loss as a training shortcut to accelerate the training of PMMW-
softmax √ V,
d DETR.
where the modulated positional attention helps us extract
features of objects with different widths and heights, which III. E XPERIMENTS AND R ESULTS
can be expressed as
A. Experimental Environments
ModAttn(PE(xq , yq ), PE(x, y))
√
T wref T href
= PE (xq ) PE (x) + PE (yq ) PE (y) / D,
w h
√ (9) Radiometer
Near Field
where 1/ D is used for value rescaling [37], and the ref- Antenna
erence width and height that are calculated by wref , href = Elevation
Rotation
σ (MLP (Cq )) as shown in Figure 5.
The two content queries and spatial queries are updated
layer by layer. Using coordinates as spatial queries for learning
makes it have clear spatial meaning. As shown in Figure 5, Azimuth
Rotation
each anchor refines layer outputs an updated object anchor
by predicting the relative positions (∆x, ∆y, ∆w, ∆h). For
the sake of decoupling the classification and regression tasks,
we only consider the output of the class refine layer as
Fig. 6. The PMMW security screening system and Experimental Environ-
the classification result. Further, we design a query sharing ments.
mechanism to enhance the collaboration between these two
tasks, that is, the updated anchor of the anchor refine layer The dataset was acquired by our self-developed PMMW
is used as the input spatial query of the sharing query cross security system, as shown in Figure 6. The system consists of a
attention in the class refine layer after being embedded by Cassegrain antenna, a radiometer consisting of an ortho-mode
Eq.6. The calculation method of the sharing query cross transducer, two direct detection modules, a data acquisition
attention is simple, which is the same as Eq.4. However, module, and a three-axis scanning turntable. The system
the sharing query cross attention effectively mitigates the operates in the 94±2 GHz band with a sensitivity of about
problem of inter-spatial variation by exploiting the property 0.4 K. Using an antenna with a diameter of 0.6096 m, it
8
(b)
B. Security Dataset and Implementation Details
We collected a total of 247 security images, in which the Fig. 7. (a) Optical schematic photos corresponding to human security
screening. (b) Representative PMMW security images and the hidden objects
types of concealed objects included the metal wrench, alcohol we used, which include a metal wrench, alcohol bottle, metal knife, and metal
bottle, metal knife, and metal pistol. 186 PMMW security pistol.
images formed the training set and the remaining 77 formed
the test set. The training of the Transformer network requires
a large amount of data, but the scanning imaging regime is too C. Comparison with the State-of-the-Art
time-consuming to form a large-scale dataset. To compensate Following the mainstream approach of object detection
for the lack of data set, we added 379 simulated images to evaluation, we use average precision (AP) and average recall
the original training set. The simulation image is implemented (AR) to evaluate the performance of the proposed model.
based on the ray tracing method [3], [39]. To sum, the above In order to show the detection results for all classes, the
original dataset has a total of 715 images, of which 638 mAP/mAR is obtained by summing the AP/AR of each class
images form the training set and 77 images form the test and taking the average. Essentially, AR is twice the area under
set. After a series of data augmentation operations such as the recall-IOU curve, AP is the area under the precision-recall
flipping, rotating, Gaussian degradation filtering, and changing curve (PR curve), where recall is the ratio of detected samples
the brightness, the training set can be expanded by a factor of to the actual total samples, and precision reflects the detection
seven to 4466 images. All our datasets will be available at rate. The calculation of mAP and mAR is given as follows:
https://github.com/Ch3ngguo/opening-source-PMMW-dataset.
Several typical PMMW security images are given in Figure PK PK
7, from which it can be seen as follows: k=1 APk k=1 ARk (10)
mAP = , mAR = ,
1) Due to the uneven ambient brightness, the human chest K K
resembles the texture features of the alcohol bottle, which may where K is the number of classes, the APk and ARk are given
lead to false alarms. as follows:
2) The metal wrench in the fourth image is thin and shares
n−1
similar contour features with the scanning noise, and the X
brightness of the alcohol bottle in the second image is close to APk = (ri+1 − ri ) pinterp (ri+1 ) ,
i=1 (11)
the human body, both of which may lead to missed detection. Z 1
3) Since the flat place of the human body will reflect the ARk = 2 Recall(o)do,
strong brightness, such as the central part in the first image, 0.5
which may affect the detection of the central hidden objects. where o is recall(o) IoU and is the corresponding recall,
Therefore, it is a difficult task to detect objects on such a r1 , r2 , . . . , rn are the recall levels (in ascending order) at
low resolution, high noise, and low little texture information which the precision is first interpolated. The interpolated
dataset. precision Pinterp (r) = maxr0 ≥r Precision (r0 ), defined as the
For the implementation details, we use 1, 3, 5, 7 as the size highest precision found at a certain recall level r, for any recall
of nl in the DCFT module. We add uniform noise on boxes level r0 ≥ r. The Precision and Recall are defined as:
and set the hyperparameters with respect to noise as λ1 = 0.4,
λ2 = 0.4, and γ = 0.4 in the query denoising learning. We
TP TP
train the network for 50 epochs and initialized parameters by Precision = , Recall = , (12)
TP + FP TP + FN
Xavier. We adopt AdamW [40] as the optimizer with a weight
decay of 1 × 10−4 . The initial learning rate is 1 × 10−4 and it where the TP is true positives, FN is false negatives, and FP
drops by multiplying 0.1 at the 40-th epoch. The batch size is is false positives.
4, and all the models are trained on a single NVIDIA TESLA To show the superior performance of our proposed
A30 GPU. model, we present experimental results comparison with
9
TABLE I
Comparison with the SOTA object detection models.
six representative state-of-the-art (SOTA) detectors, includ- there are missed detection and wrong classification as shown
ing YOLOv3 [41] (one-stage detector), SSD [42] (one- in Figure 8. As in the fifth figure, the confidence level of the
stage detector), Faster-RCNN [43] (two-stage detector), ATSS metal knife is only 0.16, which will be a false negative object
[27] (anchor-free detector), FCOS [44] (anchor-free detector), in practical applications.
PVT [45] (Transformer-based detector), and Dynamic Head Faster-RCNN is a classic two-stage detection network. It
(Transformer-based detector) [46]. To ensure the fair validity introduces RPN to generate anchor points and nine proposal
of the experiment, we used the same hyperparameters in the anchors around each point. After screening, the effective
above networks. All the experiments were run on the same proposal anchor is sent to the classification and regression
dataset, and the above models are implemented based on the network. As we can see from the Table I, our model is 36.1%
mmdetection [47]. higher in mAP75 and 19.4% higher in AR than Faster-RCNN.
In the Table I, the epoch is the metric used to measure the It can be seen from Figure 8 that Faster-RCNN has fewer
convergence speed, where an epoch is a process of sending all boxes with higher confidence and thus is easier to miss objects.
data into the network and completing a forward calculation and The above methods need manually designed anchors as
back propagation. To fairly compare the performance, we set prior information for the network. On the contrary, FCOS
up an experiment with 24 (2x schedule) epochs setup and drop is a famous anchor-free detector. It treats object detection as
the learning rate of our model by multiplying by 0.1 in the 20- a regression of the distance between each position on the
th epoch. The mAP50 is the standard metric for Pascal VOC, feature map and the bounding box. FCOS avoids all the
which defines the mapped metric using a single IoU threshold super parameters related to the proposal anchor, so saves the
of 0.5. Similarly, the mAP75 uses a single IoU threshold of memory occupation and improves the calculation efficiency.
0.75. The mAP is the average of mAP under 10 IoU thresholds ATSS reveals the essential difference between anchor-free
(i.e.,0.50, 0.55, 0.60, . . . , 0.95), which is the most strict metric and anchor-based methods, that is, the selection of positive
in the COCO challenge (The mAP of the SOTA model in and negative samples, then it proposed a more reasonable
COCO dataset is 63.3). The mAR100 means the average recall sample selection strategy. FCOS and ATSS have implemented
calculated when one hundred anchor boxes are given to the Anchor-Free by a similar approach, making them face similar
images. The End-to-End indicates whether the model does not problems. As shown in Figure 8, we show as many anchor
require NMS and manual design anchors. boxes as possible to ensure that the experimental conclusions
YOLOv3 is a widely used one-stage detector, which was are complete. It can be seen that FCOS and ATSS do not
already adopted for PMMW security screening [19]. It directly perform well in regression anchor boxes, both of which failed
learns the bounding box, confidence, and category probability to enclose the metal pistol in all of their anchor boxes, and
by regression. YOLOv3 has a fast detection speed and a low many anchor boxes have low confidence. Meanwhile, in the
false detection rate because it can see global information. ATSS detection results, there is a confusion between the gaps
However, the detection accuracy of YOLOv3 is lower than and the object resulting in false alarms as we mentioned in
other SOTA detectors, especially for small objects. It can be Section III-B.
seen that our method is +10.4% mAP higher than YOLOv3. PVT is a famous Transformer-based backbone, which com-
We believe that the low detection rate is closely related to the bines the advantages of CNNs and Transformers. It intro-
fact that the original YOLOv3 model does not include multi- duces a step-by-step contraction pyramid to obtain multi-
scale detection. Besides, YOLOv3 often requires more than scale output similar to CNN. Meanwhile, thanks to the self-
two hundred epochs to converge, which is unacceptable on attention mechanism in Transformer, PVT maintains the global
PMMW datasets. receptive field at different scales. Dyhead is a novel detection
SSD is another widely used one-stage detector It introduces head with the attention mechanism. It combines multiple self-
the pyramid feature hierarchy, that is, objects are predicted on attention mechanisms between feature levels, spatial locations,
the feature maps of different receptive fields, thus improving and output channels, and thus significantly improves the rep-
the detection accuracy of the network. However, as shown resentation ability of object detection heads without the heavy
in Table I, the mAPS of SSD for small objects is very low, computational overhead. Introducing the attention mechanism
only 31.4% (18.9% lower than our method). At the same time, makes the results of PVT and Dyhead very competitive, being
10
FCOS YOLOv3
ATSS SSD
Dyhead Faster-
RCNN
PVT Ours
Ground Truth
Fig. 8. Visualization detection results of SOTA models and our proposed model.
75
75
(a)
TABLE II
Ablation study of the proposed algorithm components.
TABLE III noise and lack of texture information in the PMMW image, the
Comparison between different query initialization methods. query selection module to introduce spatial prior information,
and the task-aligned dual-head block to enhance the ability
Method AP50 AP75 AP APS APM AR100
of the network to identify object categories. Numerical ex-
Static 97.0 86.8 65.7 42.5 67.4 71.9
periments show that PMMW-DETR outperforms other SOTA
Dynamic 97.1 88.3 67.9 43.1 68.9 75.8
QSE 97.1 89.3 68.7 44.4 70.5 77.1
methods, and achieves the classification of different objects
for the first time. It is worth noting that PMMW-DETR is also
applicable to infrared, optical, and radar images that possess
line, we compare the performance of dynamic initialization the following characteristics: (1) high noise; (2) Lack of local
and query selection. As can be seen from the Table III, texture information; (3) The object of interest appears in a
we can find our query selection method outperforms other relatively stable spatial position (for example, in the security
methods. The dynamic initialization method indeed limits the image, hidden objects always appear in the human body area).
performance of the network to a certain extent.
R EFERENCES
TABLE IV [1] N. A. Salmon, “Outdoor passive millimeter-wave imaging: Phenomenol-
Comparison between different backbones. ogy and scene simulation,” IEEE Transactions on Antennas and Prop-
agation, vol. 66, no. 2, pp. 897–908, 2017.
backbone #Params FLOPs AP50 AP75 AP APS APM AR100 [2] L. Yujiri, M. Shoucri, and P. Moffa, “Passive millimeter wave imaging,”
R50 37.7M 239G 95.8 72.4 58.0 56.9 63.8 63.1 IEEE microwave magazine, vol. 4, no. 3, pp. 39–50, 2003.
[3] N. A. Salmon, “Indoor full-body security screening: Radiometric mi-
PVT 34.2M 226G 96.1 80.4 64.0 39.8 65.3 69.1
crowave imaging phenomenology and polarimetric scene simulation,”
Swin-T 38.5M 245G 97.4 87.1 68.7 43.9 70.4 75.5 IEEE Access, vol. 8, pp. 144 621–144 637, 2020.
DCFT 39.4M 265G 97.5 91.8 71.3 49.1 74.1 80.4 [4] D. Casella, G. Panegrossi, P. Sano, B. Rydberg, V. Mattioli, C. Accadia,
M. Papa, F. S. Marzano, and M. Montopoli, “Can we use atmospheric
targets for geolocating spaceborne millimeter-wave ice cloud imager (ici)
On the basis of Table II Row 7, we replaced the DCFT with acquisitions?” IEEE Transactions on Geoscience and Remote Sensing,
different backbones and compared their performance, and the vol. 60, p. 5302622, 2022.
results are shown in Table IV. The parameters and FLOPs [5] D. Cuadrado-Calle, P. Piironen, and N. Ayllon, “Solid-state diode
technology for millimeter and submillimeter-wave remote sensing appli-
are calculated using the same network (RetinaNet). Since the cations: Current status and future trends,” IEEE Microwave Magazine,
multi-scale calculation is introduced in the backbone, param- vol. 23, no. 6, pp. 44–56, 2022.
eters and FLOPs of our proposed DCFT are slightly higher. [6] J. Su, H. Wu, P. Li, Y. Hu, and F. Hu, “Detection for ship by dual-
polarization imaging radiometer,” Optics Express, vol. 29, no. 17, pp.
However, as we mentioned in Section II-B the computational 27 830–27 844, 2021.
complexity of our model is lower, and our model outperforms [7] Y. Cheng, Y. Wang, Y. Niu, and Z. Zhao, “Concealed object enhance-
in mAP and mAR. Due to the excessive computation of full ment using multi-polarization information for passive millimeter and
terahertz wave security screening,” Optics Express, vol. 28, no. 5, pp.
map attention, it can only perform the image classification 6350–6366, 2020.
task that is less computational overhead, but cannot complete [8] Y. Cheng, L. Qiao, D. Zhu, Y. Wang, and Z. Zhao, “Passive polari-
the object detection task. Therefore, we added the PVT in the metric imaging of millimeter and terahertz waves for personnel,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 1, no. 15, 2020.
SOTA model to compare. The AP of R50 is poor because of [9] A. Y. Owda, “Passive millimeter-wave imaging for burns diagnostics
the lack of global information. Similar to R50, PVT loses more under dressing materials,” Sensors, vol. 22, no. 7, p. 2428, 2022.
long- and short-range information in the cascading reduction [10] Y. Zhao, W. Si, B. Han, Z. Yang, A. Hu, and J. Miao, “A novel near
field image reconstruction method based on beamforming technique for
process, which results in the lower AP of PVT. Swin-T real-time passive millimeter wave imaging,” IEEE Access, vol. 10, pp.
achieved competitive results with its compact design. However, 32 879–32 888, 2022.
its diminished distance to information resulted in 4.7% lower [11] Y. Cheng, L. Qiao, D. Zhu, Y. Wang, and Z. Zhao, “Passive polari-
metric imaging of millimeter and terahertz waves for personnel security
in mAP75 and 4.9% lower in mAR100 than ours. screening,” Optics Letters, vol. 46, no. 6, pp. 1233–1236, 2021.
[12] O. Martı́nez, L. Ferraz, X. Binefa, I. Gómez, and C. Dorronsoro,
IV. C ONCLUSION “Concealed object detection and segmentation over millimetric waves
images,” in 2010 IEEE Computer Society Conference on Computer
In this paper, we have presented a robust task-aligned Vision and Pattern Recognition-Workshops. IEEE, 2010, pp. 31–37.
Detection Transformer named PMMW-DETR. Different from [13] D.-S. Lee, S. Yeom, J.-Y. Son, and S.-H. Kim, “Automatic image
segmentation for concealed object detection using the expectation-
popular neural networks, we perform the denoising coarse-to- maximization algorithm,” Optics Express, vol. 18, no. 10, pp. 10 659–
fine Transformer backbone to deal with the problem of high 10 667, 2010.
12
[14] S. Yeom, D.-S. Lee, J.-Y. Son, M.-K. Jung, Y. Jang, S.-W. Jung, and [36] C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “Tood: Task-
S.-J. Lee, “Real-time outdoor concealed-object detection with passive aligned one-stage object detection,” in 2021 IEEE/CVF International
millimeter wave imaging,” Optics Express, vol. 19, no. 3, pp. 2530– Conference on Computer Vision (ICCV). IEEE Computer Society, 2021,
2536, 2011. pp. 3490–3499.
[15] S. Yeom, D.-S. Lee, Y. Jang, M.-K. Lee, and S.-W. Jung, “Real-time [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
concealed-object detection and recognition with passive millimeter wave Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
imaging,” Optics Express, vol. 20, no. 9, pp. 9371–9381, 2012. neural information processing systems, vol. 30, 2017.
[16] S. López-Tapia, R. Molina, and N. P. de la Blanca, “Using ma- [38] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Ac-
chine learning to detect and localize concealed objects in passive celerate detr training by introducing query denoising,” in Proceedings of
millimeter-wave images,” Engineering Applications of Artificial Intel- the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
ligence, vol. 67, pp. 81–90, 2018. 2022, pp. 13 619–13 627.
[17] S. Lopez-Tapia, R. Molina, and N. P. de la Blanca, “Deep cnns for [39] B. Qi, L. Lang, Y. Cheng, S. Liu, F. Hu, X. He, P. Deng, and L. Gui,
object detection using passive millimeter sensors,” IEEE Transactions “Passive millimeter-wave scene imaging simulation based on fast ray-
on Circuits and Systems for Video Technology, vol. 29, no. 9, pp. 2580– tracing,” in 2016 IEEE International Geoscience and Remote Sensing
2589, 2017. Symposium (IGARSS). IEEE, 2016, pp. 2642–2645.
[18] M. Kowalski, “Real-time concealed object detection and recognition in [40] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
passive imaging at 250 ghz,” Applied optics, vol. 58, no. 12, pp. 3134– arXiv preprint arXiv:1711.05101, 2017.
3140, 2019. [41] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
[19] L. Pang, H. Liu, Y. Chen, and J. Miao, “Real-time concealed object arXiv preprint arXiv:1804.02767, 2018.
detection from passive millimeter wave images based on the yolov3 [42] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
algorithm,” Sensors, vol. 20, no. 6, p. 1678, 2020. Berg, “Ssd: Single shot multibox detector,” in European conference on
[20] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to computer vision. Springer, 2016, pp. 21–37.
attention-based neural machine translation,” in Proceedings of the 2015 [43] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
Conference on Empirical Methods in Natural Language Processing, object detection with region proposal networks,” Advances in neural
2015, pp. 1412–1421. information processing systems, vol. 28, 2015.
[21] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by [44] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, stage object detection,” in Proceedings of the IEEE/CVF international
2017. conference on computer vision, 2019, pp. 9627–9636.
[22] A. Kolesnikov, A. Dosovitskiy, D. Weissenborn, G. Heigold, J. Uszko- [45] W. Wang, L. Yao, L. Chen, B. Lin, D. Cai, X. He, and W. Liu,
reit, L. Beyer, M. Minderer, M. Dehghani, N. Houlsby, S. Gelly et al., “Crossformer: A versatile vision transformer hinging on cross-scale
“An image is worth 16x16 words: Transformers for image recognition attention,” in International Conference on Learning Representations,
at scale,” arXiv preprint arXiv:2010.11929, 2021. 2021.
[23] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and [46] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang,
B. Guo, “Swin transformer: Hierarchical vision transformer using shifted “Dynamic head: Unifying object detection heads with attentions,” in
windows,” in Proceedings of the IEEE/CVF International Conference on Proceedings of the IEEE/CVF conference on computer vision and
Computer Vision, 2021, pp. 10 012–10 022. pattern recognition, 2021, pp. 7373–7382.
[24] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and [47] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng,
S. Zagoruyko, “End-to-end object detection with transformers,” in Z. Liu, J. Xu et al., “Mmdetection: Open mmlab detection toolbox and
European conference on computer vision. Springer, 2020, pp. 213– benchmark,” arXiv preprint arXiv:1906.07155, 2019.
229.
[25] H. Yang, D. Zhang, A. Hu, C. Liu, T. J. Cui, and J. Miao, “Transformer-
based anchor-free detection of concealed objects in passive millimeter
wave images,” IEEE Transactions on Instrumentation and Measurement,
2022.
[26] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and
C. Shen, “Twins: Revisiting the design of spatial attention in vision
transformers,” Advances in Neural Information Processing Systems,
vol. 34, pp. 9355–9366, 2021.
[27] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between
anchor-based and anchor-free detection via adaptive training sample
selection,” in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2020, pp. 9759–9768.
[28] Y. Wu, Y. Chen, L. Yuan, Z. Liu, L. Wang, H. Li, and Y. Fu, “Rethinking
classification and localization for object detection,” in Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition,
2020, pp. 10 186–10 195.
[29] L. Xuhong, Y. Grandvalet, and F. Davoine, “Explicit inductive bias
for transfer learning with convolutional networks,” in International
Conference on Machine Learning. PMLR, 2018, pp. 2825–2834.
[30] W. Wang, J. Zhang, Y. Cao, Y. Shen, and D. Tao, “Towards data-efficient
detection transformers,” arXiv preprint arXiv:2203.09507, 2022.
[31] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr:
Deformable transformers for end-to-end object detection,” arXiv preprint
arXiv:2010.04159, 2020.
[32] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and
J. Wang, “Conditional detr for fast training convergence,” in Proceedings
of the IEEE/CVF International Conference on Computer Vision, 2021,
pp. 3651–3660.
[33] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang,
“Dab-detr: Dynamic anchor boxes are better queries for detr,” arXiv
preprint arXiv:2201.12329, 2022.
[34] Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient detr: improving end-to-
end object detector with dense prior,” arXiv preprint arXiv:2104.01318,
2021.
[35] G. Song, Y. Liu, and X. Wang, “Revisiting the sibling head in object
detector,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2020, pp. 11 563–11 572.