A Light-Weighted Hypergraph Neural Network For Multimodal Remote Sensing Image Retrieval
A Light-Weighted Hypergraph Neural Network For Multimodal Remote Sensing Image Retrieval
16, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2691
and manifold learning, which improve retrieval accuracy signif- [72], and other fields. In hypergraph representation filed, Gui
icantly. Apart from the abovementioned methods, transformers et al. [73] proposed a hyperedge-based embedding framework
framework is used for image retrieval as well. In [49] and [50], for heterogeneous event data to represent proximity and rela-
authors adopted transformers for generating image descriptors tionships among objects. Based on this work, Tu et al. [74]
and train models with metric learning method, which achieve designed a deep hypernetwork embedding model, which takes
good results in different datasets. highly sparse hypergraph structure into consideration to im-
prove performance. Besides, Feng et al. [75] presented HGNN
B. Remote Sensing Image Retrieval framework; a hyperedge convolution operation is designed to
tackle representation learning, which achieves better fusion of
RS images are larger in size and contain more targets than
structural attribute information and node feature information of
natural scene images, so they often have multiple labels, which
hypergraph. Jiang et al. [76] designed DHGNN model to extract
make them more difficult and costlier to represent. Therefore,
hidden relationships of the hypergraph; clustering methods are
compared with manually designed features, deep learning meth-
utilized to build and update topology according to the local
ods, especially CNNs [51], [52], [53], [54], [55], [56], [57], are
and global feature. To diminish heavy use of computing re-
increasingly used in RSIR due to powerful feature representation
sources when handling non-k-uniform hypergraph and improve
capabilities. Zhang et al. [58] proposed a novel framework
model generalization ability, Zhang et al. [77] introduced a
called T-NLNN, which puts nonlocal operation and deep metric
self-attention mechanism and proposed Hyper-SAGNN model.
learning together. It turns out that T-NLNN works better than
As for filtering noise, Yadati et al. [78] proposed Hyper-GCN
traditional methods on all datasets. Shao et al. [59] introduced a
model to filter out possible data noise in the sampling process.
multilabel RSIR method on the basis of FCN, which works better
Yang et al. [79] came up with a solution to possible information
than classical CNN models. Many new approaches have been
loss due to the lack of the symmetric nature of data cooccurrence.
proposed to improve the performance of RSIR. Chaudhuri et al.
[60] proposed a semisupervised method based on graph-theory
that builds an image neighborhood graph and associates class III. METHODOLOGY
labels through a novel region labeling strategy, then they use a In the methodology, we start our discussion by introducing
new subgraph matching method to find the exact image that has the basic setup for CBRSIR task and then elaborate the details
high similarity with the given query image. Imbriaco et al. [61] of proposed method. The HGNLSF-net consists a hypergraph
introduced a new image retrieval method that forms a global convolution layer, which lies at the core of this article. The
descriptor by combining attention, local convolutional features, hypergraph convolution layer is capable to relate and link the
and locally aggregated descriptors vectors. features in the feature space. A hard link filter is also performed
RS images often have large scale, because of that, current in hypergraph layer, which helps remove the uncorrelated noise.
methods need very deep layers and a great quantity of parameters
to extract features better. In order to save storage space and speed A. Problem Description
up computation, feature reduction [62], [63] has been adopted to
RSIR. Li and Ren [64] proposed a three-step partial randomness Given the collection of multimodal remote sensing image
hashing method. First of all, the initial hash code estimation is set I, query image q ∈ Q. The positive and negative images
randomly generated; second, the hash code is modified by a are denoted by Pq and Nq . Ground truth labels are acquired
linear model; finally, the method uses a projection matrix to according to different class per image, i.e., if two images are the
generate the final binary code. Li et al. [65] proposed a network same class, then they are both positive, otherwise negative. The
on the basis of deep hashing neural network, and the new network retrieval task aims to give the similarity ranking according to the
can be end-to-end optimized on large-scale datasets. query image q.
As for datasets, many researchers have already established In order to accomplish such goal, we first extract the query
RSIR datasets. Zhou et al. [66] introduced a specifically col- image and dataset features, map them into a k-dimensional
lected dataset named PatternNet based on one single data modal- space, û, v̂ ∈ R. Then, the tensor is fed into the hypergraph layer
ity, so far the dataset has been used on over 35 methods and the followed by a fully connected layer, and the feature similarity
result can be used as a baseline for future research. Yuan et al. from different images can be computed by the output of the fully
[19] introduced a fine-grained dataset contained thousands of connected layer
query texts, keywords, and RS images, which is designed for similarity = s(F û, F v̂) (1)
TBRSIR task.
where F indicates the feature extraction process for the input
C. Hypergraph Neural Network image, and the function s computes the similarity between the
target and query features. Here, we apply the L2 norm to measure
The topological structure of hypergraph can establish the
the similarity.
semantic relationship of multiple nodes at the same time, and
mine nonlinear high-order correlations. Owing to the explosive
B. Overall Method
growth of graph-structured data, hypergraph neural networks
are widely used in image retrieval [67], social network analysis In Fig. 2, the general structure of HGNLSF-net is demon-
[68], [69], image processing [70], recommender system [71], strated. The first module of the network is feature exaction layer.
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2693
Initially, the multimodal images are resized to 224 × 224 × 3 as nodes and we build a hypergraph in order to better model
and the matrix values are normalized to the interval [−1, 1] as the long and short relationships and improve the information
the input. We utilize the pretrained weight of the ResNet50 to get exchange among nodes, as shown in Fig. 3.
the general feature from the dataset. More precisely, the output The mathematics expression of hypergraph is denoted by
feature size is 2048 × 7 × 7. G = (V, E, W ), where V represents the vertex, E denotes the
hyperedge, each hyperedge contains two or more vertices, and
W ∈ RM ×M is the weight matrix. Each element denotes the
C. Hypergraph Convolution Calculation weight for each hyperedge. In order to formulate the Laplacian
Given the need to match between the query sample and the operator for hypergraph, the incidence matrix H ∈ RN ×M is
multimodal dataset, the multiimage relationship is no longer necessary. It is usually denoted as a characteristic function as
suitable in the tradition graph structure, since the edge in graph follows:
can only represent a binary relationship. Compared with the
pairwise relation in graph, the hypergraph models the high-order
constraints, which turns out to regulate more than two images in 1, if v ∈ e
H(v, e) = (2)
the same hyperedge. Specifically, the feature maps are treated 0, otherwise.
2694 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023
Unlike the normal graph, a hypergraph owns both In many studies [75], [81], [82], the incidence matrix H plays
vertex and edge degree matrixes; the vertex degree matrix a core role in the hypergraph layer, and each element of H is
D ∈ RN ×N denotes that as for each vertex, the total number usually constructed by computing the norm distance between
of hyperedges contain the same node. The hyperedge degree features. In our network, we take the long-term intraspatial
matrix B ∈ RM ×M indicates how many vertex contained in dependency, and the correlation of the features is computed
each hyperedge. The normalized hypergraph Laplacian ma- to give the vertex’s weight for each hyperedge. Finally, the
trix is denoted by Δ ∈ RN ×N , which is computed by Δ = incidence matrix is given by
I − D−1/2 HB −1 H T D−1/2 . The eigenvalues and the corre-
sponding eigenvectors of Δ are represented by diagonal matrix H = Γ(x)λ(x)ΓT (x)μ(x) (7)
λ = diag(λ1 , ..., λN ) and Φ = φ1 , ..., φN , respectively. Given
the input vector x = (x1 , ..., xN ), the vector is decomposed into where Γ(x) ∈ RN ×Ĉ denotes the function of the original signal
the spectral domain by Fourier Transform on the basis of Φ. The embedding with a linear transform, and Ĉ denotes its feature di-
hypergraph convolution structure can be formulated as mension. λ(x) ∈ RĈ×Ĉ denotes a diagonal matrix, which learns
a better distance on each vertex. μ(x) ∈ RN ×M evaluates the
g ∗ x = Φg(λ)ΦT x (3) weights for each vertex on its hyperedge, which can help to relate
where g(λ) = diag(g(λ1 ), ..., g(λN )) implies the Fourier coef- the global relationship on the feature maps to build hyperedges.
ficients. However, the computation of the eigenvectors requires The abovementioned three functions Γ(x), λ(x), and μ(x) are
huge consumption time with computation complexity approxi- all calculated from the feature map. Γ(x) is embedded through
mately O(n3 ). Nevertheless, Defferrard et al. [80] parameterized a 1 × 1 convolution kernel with dimension Ĉ, λ(x) is calculated
g(λ) with truncated Chebyshev polynomials up to Kth order, by a global average pooling, and μ(x) is calculated by a M × M
which can simplify the calculation of eigenvectors to a finite convolution. They can be computed as follows:
sum. The truncated formula on the hypergraph convolution is
Γ(xl ) = conv(xl , WΓ )
computed by
K
λ(xl ) = diag(conv(xl , Wλ ))
g∗x= θk Tk (Δ̂)x (4)
μ(xl ) = conv(xl , Wμ ) (8)
k=0
where θk is the coefficients of the Chebyshev sequence, Δ̂ = where xl ∈ R1×1×Ĉ is the feature vector given by the global
2Δ
λmax − 1 and Tk denote the Chebyshev function. We can further pooling on the input features. WΓ , Wλ , and Wμ are three
simplify the formula by restricting K = 1. With the estimation parameters built for the linear embedding.
of λmax ≈ 2, the final version of the hypergraph convolution Remote sensing images often include a lot of targets and
becomes background information. When global correlation is established,
g ∗ x ≈ (D−1/2 HW B −1 H T D−1/2 )xθ (5) a lot of noise will be introduced. As for hypergraph, the element
of incidence matrix H represents the correlation among nodes
where θ denotes the Chebyshev parameter, which will be learned and many of them only have a weak connection. Thus, we build
in the neural network. Therefore the iteration for the hypergraph a hard-link filter block, which examines the amplitude of the
convolution layer can be defined as element in the incidence matrix H, The module only allows
xl+1 = σ(D−1/2 HW B −1 H T D−1/2 xl θ). (6) large number passing through the filter in order to remove any
noise that will reduce model effect. The mathematics expression
The incidence matrix H is formed as the basic structure of the of hard-link can be defined as
hypergraph, which can also be utilized to propagate information
among the hypergraph vertices. Therefore, better connections Hnew = x ≥ c, x ∈ H. (9)
among hyperedges would boost a better information exchange
among the vertices, and can improve the accuracy of retrieval From (9), it can be concluded that the hyperparameter c is
task. very important for the proposed method. If c is set too large, the
useful semantic connection among nodes will also be filtered
D. Hypergraph Features Representation out, and if c is set too small, it will be difficult to fully filter
the weak connections and potential background noise in the
The receptive field of CNN architecture restricts the feature global correlation of the remote sensing image. In order to better
to the local area instead of global structure. However, the image determine the threshold c, we conduct a series of comparative
retrieval task requires the global feature matching between the experiments with different values of c (0.5, 0.6, 0.65, 0.7, 0.75,
retrieval image and the target one. Therefore, we need more 0.8, 0.85, 0.9). After comparing the experimental results, the
advanced tool to acquire the global feature in the dataset. Hy- threshold is set to remove 70% of the total elements in incidence
pergraph behaves well in the multiple relationship construction matrix H.
and it is capable to acquire global information. In our network, After the hard link block, we substitute the incidence matrix
spatial feature F l ∈ Rh×w×c is manipulated as a graph-like into (6), and finally acquire the hypergraph layer as
node, and each feature is considered as a vertex with dimension
X l ∈ Rhw×c . xl+1 = σ(Δxl θ) (10)
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2695
where Dii , W , Hi , and B indicate the elements of matrix
D, W , H, and B, respectively.
A. Dataset Description
Our proposed HGNLSF-Net is trained and tested on a self- Fig. 4. Examples of multimodal images in MMRSIRD.
crafted dataset called MMRSIRD.
To ensure the continuity, multimodality, and multiscale char-
acteristics of spatio-temporal images, the dataset used in this day without more advanced GPU support. To guarantee the sta-
article is captured by multiple ultra-high resolution satellites bility of the experiment and prevent the model from overfitting
with a spatial resolution up to 0.3 m and with cloud cover problem, we use an adjustable learning rate (LR). Initially, the
≤ 2%. Compared with optical images, SAR images are more LR is default to 0.001, then it is changed to 10−4 at epoch 16
difficult to acquire. The dataset contains images sampled from and 10−5 at epoch 21. Furthermore, the model achieves its best
various satellites including optical sensor: Gaofen-2 (spatial performance by using SGD optimizer by setting momentum at
resolution of 1 m), SuperView-1 (spatial resolution of 0.5 m), 0.9.
and Worldview-3 (spatial resolution of 0.3 m), GeoEye-1 (spatial For the image transformation, we first resize the input image
resolution of 0.8 m) and SAR sensor: Gaofen-3 (spatial resolu- to 224 × 224 × 3, and then perform normalization to the input.
tion of 1 m). The typical samples of the multimodal images are An pretrained ResNet50 is used for image feature extraction, we
demonstrated in Fig. 4. only take the first eight layers of ResNet50 and output a tensor
For a consistent analysis of urban areas of global, we choose with size 2048 × 7 × 7. The tensor becomes a vector with size
Beijing and Berlin to build the dataset. The dataset covers more 2048 after the hypergraph and global average pooling layer, and
than 240 km2 and time ranges up to eight years. Both of these two finally, the fully connection layer is applied to the vector and
cities own developing trends in recent years: Beijing is a rapidly the vector length becomes 100; this vector is used for similarity
expanding city, including a large amount of farming land and computation.
bungalows to high-rise buildings; Berlin is undergoing a renewal As for evaluation, we utilize four kinds of evaluation metrics
plan, continually rebuilding and restructuring dense urban areas for the image retrieval task, including ANMRR, mAP@10,
to keep the city alive. R@1, R@2, and R@5. In the following parts we introduce the
MMRSIRD consists of 1774 group images, and each group mathematics definition of each metric.
includes 11 images shot by multiple sensors at different times in ANMRR is short for average normalized modified retrieval
the same area. Each sample is named in the form “A_B”, where rate, which is commonly applied in detection and retrieval tasks.
A represents the group number the sample belongs to (from 0 to ANMRR can be computed as [83]
1773), and B indicates the number of the sample in this group Z
(from 0 to 10). For example, 23_7 means the 8th sample of the 1
ANMRR = NMRR(i) (12)
24th group, as shown in Fig. 4. Z i=1
where NG(i) is number of the ground truth results given the From the above definition, we can draw the conclusion that
query i. K(i) is defined as the relevant ranks, which mean the the value of mAP is highly correlated to the number K, and we
total length of retrieval result that counts. K(i) is usually set take mAP@10 as one of the metrics in this article.
to min{4 × NG(i), 2 × max × {NG(j) | ∀j}}. MRR(i) is short
for modified retrieval rank and can be calculated as follows:
C. Performance Evaluation of the HGNLSF-Net
MRR(i) = AVR(i) − 0.5[1 + NG(i)] (14) We use three SOTA models to conduct comparative experi-
where AVR(i) is the average rank indicating the rank of the cor- ments with our proposed model as the performance validation.
rect retrieval results in the entire list. AVR(i) can be formulated The three models are DELF, Rerank-Transformer, and AMFMN.
as follows: DELF [84]: A new descriptor is introduced to solve the
retrieval challenge of large-scale remote sensing images. It
1
NG(i)
first extracts dense features by utilizing a fully convolutional
AVR(i) = Rank∗ (k) (15) network. Furthermore, to make the features more robust and
NG(i)
k=1
generic, an attention-based keypoint selection module using the
where Rank∗ (k) is defined as weak supervision signal is introduced, i.e., image-level labels. A
subset of dense features is effectively filtered by the keypoints.
∗ Rank(k), if Rank(k) ≤ K(i) Finally, L2 normalization and PCA methods are used for feature
Rank (k) = (16)
1.25K, if Rank(k) > K(i) reduction and balancing compactness and discriminativeness.
Several experimental results show that DELF is an effective
where Rank(k) means the rank of the kth correct retrieval result method to improve the retrieval accuracy of remote sensing
given the query. According to the definition, ANMRR value is images
from 0 to 1 and small value indicates more accurate retrieval Rerank-Transformer [85]: The method uses two steps to
result. retrieve images. First of all, the initial retrieval result is calculated
Recall@k is another standard criteria to assess the accuracy. by comparing the feature similarity between query condition and
For any query image q, Recall@k can be evaluated as the number dataset. Second, a new rerank module is introduced to improve
of positive examples over the total number of examples for q. It the model effect. To be specific, by using the top-M images
can be formulated as as anchors, the module is able to extract affinity features for
x∈Pq He(k − rI (q, x))
the query image and top-L results. In addition, the transformer
k
RI (q) = (17) module is introduced to update the extracted features. The model
|Pq |
is more robust because its input features are not extracted from
where rI (q, x) denotes the rank of example x. He(.) denotes the original image.
the Heaviside step function, which gives 0 for negative values, AMFMN [19]: The method comes up with an effective feature
otherwise equal to 1. The rI (q, x) is calculated as extraction model for multimodal remote sensing retrieval task.
By designing a new multimodal feature matching model, it can
rI (q, x) = 1 + He(sqz − sqx ) (18) better solve the difficult problems of remote sensing images
z∈I,z ∈
/x compared with natural scene images, i.e., containing a large
where s is the similarity function in (1). Therefore, the Recall@k number of targets and multiple scales of different targets. The
can be defined as self-attention mechanism is employed to enhance features and
can dynamically filter redundant features. To make the model fit
k x∈Pq H(k − 1 − /x He(sqz − sqx ))
z∈I,z ∈ better to the characteristics of strong intraclass similarity, a new
RI (q) = .
|Pq | dynamic triplet loss function is included in the model and the
(19) model performs very well.
Mean average precision at top-k(mAP@K) is another stan- Table I gives overall results. The bold indicates the best
dard criterion for image retrieval task. For any positive example results. From Table I, it can be concluded that all the other
x ∈ PQ , we can name the sequence x1 , x2 , ..., xv , v ≤ k accord- methods fail in complex scene understanding and relationship
ing to the rank of example x base on the similarity to query q. modeling except ours, which demonstrates the challenge of the
Then, average precision (AP) can be calculated as MMRSIRD dataset.
m
Compared with other methods, our proposed model achieves
i
AP = , x i ∈ PQ good performance across all evaluation criteria, including AN-
i=1
rI (q, xi ) MRR, mAP@10, R@1, R@2, and R@5. That is to say, our
HGNLSF-Net is the best one among all the compared models.
m= He(k − rI (q, x)). (20)
To be specific, the HGNLSF-Net performs better than DELF, one
x∈Pq
of the most widely used benchmarks in CBRSIR field and gains
Finally, the mAP@K for n query image and top-k examples a significant improvement of 0.0909, 10.53%, 31.84%, 29.84%,
is calculated as and 18.50% in ANMRR, mAP@10, R@1, R@2, and R@5,
n m respectively. In addition, compared with Rerank-Transformer,
1 i
mAP = . (21) a transformer-based model that obtains best results in many
n j=1 i=1 rI (qj , xi ) datasets, the improvement of our model gains 0.0493, 3.40%,
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2697
TABLE I
PERFORMANCE OF IMAGE RETRIEVAL METHODS ON MMRSIRD
TABLE II
COMPARISON RESULT OF DIFFERENT QUERY TYPE
13.95%, 12.82%, 6.36% in the five metrics. AMFMN is origi- and storage memory. Second, the hypergraph convolution layer
nally designed for TBRSIR task and has strong image feature can achieve a better performance with just one layer, and more
representation capability. The HGNLSF-Net achieves an opti- layers will cause an unfavorable oversmoothing problem and
mal performance in all metrics. In summary, our model can better contribute negative effect to the result. The comparison results
learn the nonlocal semantic features in the entire image because of parameter quantities between different models are given in
of hypergraph module. Also thanks to the hard-link block that Table III.
erases the unnecessary noise between nodes, it can improve the The results are calculated by the package torchsummary.
overall retrieval accuracy. The estimated total size also adds input and output parameters,
However, we observe an imbalance result in our experiment, and forward propagation and back propagation parameters in
since there are fewer examples of SAR images in the dataset addition to parameters size. It can be summarized that our model
compared with optical images. More precisely, MMRSIRD has not only can achieve the best results, but also has the minimum
about tenth as many optical than SAR samples. For such reason, parameter quantity, parameters size and total model size. The
we observe a significant difference in accuracy for optical and HGNLSF-Net model is just slightly bigger than the backbone
SAR query, the accuracy is higher when we test with an optical Resnet50 in three metrics, which demonstrates the feature repre-
query. The results are demonstrated in Table II. sentation ability of hypergraph convolution network. Among all
This phenomenon can be explained by the imbalance of the models, the Rerank-Transformer has the largest number of
training examples, and the network learns more optical feature parameters and model size due to its many stacked transformer
than SAR feature; thus, it is harder to match a SAR query to blocks. DELF and AMFMN are relatively close in terms of
its corresponding optical examples. Moreover, since one group parameter quantity and parameters size, but the overall model
consists ten optical images and one SAR image in MMRSIRD, size of AMFMN is much larger than that of DELF. It is probably
when we test one optical query, it has 90% to match with optical because that AMFMN is originally designed for TBRSIR, its
images and only 10% to a SAR image. Nevertheless, for a SAR input and output processing, and loss function are quite different
image query, it owns an opposite situation. We have to admit it from DELF. In the following, we give a simple proof to the
does have an advantage for a homogenous retrieval compared reason why our proposed model can achieve better results with
to cross-modal retrieval, since they share more common char- less layers.
acteristic in feature space, which explains the result difference For the hypergraph operator Δ, if we implement it k times,
in two kinds of query. However, the overall retrieval accuracy then it can be written as:
for SAR image query is still higher than all other methods on
the MMRSIRD dataset in R@1 and R@2, even considering this lim Δk = lim (D−1/2 HW B −1 H T D−1/2 )k
factor. k→∞ k→∞
= lim V (I − λ̃)k V T
k→∞
D. Analysis on the Light-Weighted Characteristics of the ⎡ ⎤
Proposed Model (1 − λ̃1 )k
⎢ ⎥
The hypergraph convolution layer naturally preserves a light- ⎢ (1 − λ̃2 )k ⎥ T
= lim V ⎢
⎢ ..
⎥V
⎥
weighted structure for two reasons. First of all, both the degree k→∞ ⎣ . ⎦
matrixes B and D are calculated from the incidence matrix H
(1 − λ̃N )k
and the computation only involves summation rather than mul-
tiplication, which contributes little to the calculation complexity (22)
2698 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023
TABLE III
COMPARISON OF PARAMETERS OF DIFFERENT MODELS
TABLE IV
ABLATION EXPERIMENTS OF HGNLSF-NET ON MMRSIRD DATASET
where V = D−1/2 H and λ̃ are the eigenvalues of the operator. layer are the same, we can simply write off the code lines to
Since the λ̃i are all greater between 0 and 1 except λ̃1 = 0, then get rid of the layer. The third experiment involves the hard-link
the abovementioned equation becomes block; while the block does not alter the input–output dimension
⎡ ⎤ as well, we can simply erase those code lines to accomplish
1 the ablation. It should be noticed that the hard-link is part of
⎢ ⎥
⎢ 0 ⎥ T hypergraph layer, and once the hypergraph layer is removed,
k
lim Δ = lim V ⎢ ⎢ ⎥V . (23)
k→∞ k→∞ .. ⎥ the hard-link block does not exist anymore too. The ablation
⎣ . ⎦ experiment results are demonstrated in Table IV.
0 As shown in Table IV, we take V1-5 to represent the following
different blocks:
If we have a signal input x, then
1) V1: ResNet50;
⎡ ⎤
1 2) V2: ResNet101;
⎢ ⎥ 3) V3: VGG16;
⎢ 0 ⎥ T
lim Δk x = lim V ⎢ ⎢ ..
⎥V x
⎥
4) V4: Hypergraph layer;
k→∞ k→∞ ⎣ . ⎦ 5) V5: Hard-link.
0 From the abovementioned results, there is an accuracy decay
when we replace Resnet to VGG16. Because the skip connec-
= < x · v1 > v1 = x̃1 v1 . (24) tions in Resnet reduce the gradient vanishing issue through
providing an alternate routine for the gradient to connect. Mean-
Since the eigenvector v1 is corresponding to the eigenvalue
while, the accuracy from ResNet101 turns out to be a little
λ̃1 = 0, then v1 is a constant vector equal to 1. Therefore, if
lower compared with ResNet50, meaning adding more layers
a signal is continuously smoothed, it eventually becomes equal
to the CNNs in the backbone does not always come to a better
everywhere, and there will be no distinguishability at all. This
accuracy and more layers leads to more time consumption at
property of the hypergraph layer restricts it to a single layer,
training. Although more stacked layers in backbone can help
which contributes to the light weight structure.
acquire general feature of the dataset, however, the accuracy
levels may reach to a limit and slowly decrease after a threshold.
E. Ablation Experiments Finally, the accuracy may turn out to dwindle on both training
We investigate the effectiveness of each block by conducting and testing process.
ablation experiments on HGNLSF-Net. Three different blocks Compared with the small accuracy difference in replacing
are tested. backbone experiment, there is a significant decay at accu-
In the first experiment, the VGG-16 and ResNet101 are used racy from removing the hypergraph layer. The accuracy drops
to replace with the ResNet50. In the second experiment, the 0.3510, 35.22%, 59.33%, 51.50%, and 44.32% in ANMRR,
network becomes purely Resnet50 after removing the hyper- mAP@10, R@1, R@2, and R@5 compared with the complete
graph layer. Since the input and output dimensions of hypergraph model. Thanks to the flexibility and capability of hypergraph in
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2699
Fig. 6. Comparison of texture feature for query and retrieval image. (a) Query
image (partial). (b) Retrieval image (partial).
Fig. 5. Visual results for an optical image query on MMRSIRD dataset, and
the top five results are demonstrated.
F. Visual Analysis
In this section, we give visual results to our experiment. In
Fig. 7. Visual results for an SAR image query on MMRSIRD dataset, the top
Fig. 5, we have an optical query image, the top five results give five results are demonstrated.
three positive examples at rank equal to 1, 2, and 4 and two
negative examples rank at positions 3 and 5.
Based on (20), the AP is computed as follows:
m
1
precision is calculated as
i 1 1 2 3
AP = = + + ≈ 0.917. (25)
m i=1 rI (q, xi ) 3 1 2 4
m
1 i 1 1 2
In this experiment, the first negative example appears at AP = = + = 0.5. (26)
m i=1 rI (q, xi ) 2 2 4
position 3, and we observe there is a texture-level similarity
between this image and query one as shown in Fig. 6. Since
the convolution layer in the Resnet50 captures the local texture The previous section has explained that the SAR image re-
feature, we believe that is the reason why such image can achieve trieval has a relative low accuracy compared with optical image
a high rank among others. retrieval. For SAR image queries, we observe that there is at
In Fig. 7, we select one SAR image as a query, the top five least one SAR image appearing at top five in most results, no
results turn out to be two positive and three negative examples. matter it is positive or negative example, which reconfirms the
The positive examples rank at positions 2 and 4, whereas the homogeneous modality shares more common features, which
negative examples rank at positions 1, 3, and 5. The average promotes the retrieval accuracy.
2700 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023
V. CONCLUSION [15] X. Wang, N. Chen, Z. Chen, X. Yang, and J. Li, “Earth observa-
tion metadata ontology model for spatiotemporal-spectral semantic-
In this article, as for CBRSIR task, in order to realize the re- enhanced satellite observation discovery: A case study of soil mois-
trieval of large-scale, semantic-level, multimodal remote sensing ture monitoring,” GIScience Remote Sens., vol. 53, no. 1, pp. 22–44,
2016.
images using one query image, we propose a new hypergraph [16] C.-R. Shyu et al., “GeoIRIS: Geospatial information retrieval and
based nonlocal semantic fusion network, i.e., HGNLSF-Net. indexing system–content mining, semantics modeling, and complex
Because the topological property of hypergraph determines that queries,” IEEE Trans. Geosci. remote Sens., vol. 45, no. 4, pp. 839–852,
Apr. 2007.
it can model the association among multiple nodes at the same [17] Y. Chen and X. Lu, “A deep hashing technique for remote sensing image-
time, the hypergraph neural network is designed to model the sound retrieval,” Remote Sens., vol. 12, no. 1, 2019, Art. no. 84.
nonlocal semantic features rather than the local, target-level rela- [18] Z. Shi and Z. Zou, “Can a machine generate humanlike language descrip-
tions for a remote sensing image?,” IEEE Trans. Geosci. Remote Sens.,
tionship. However, because of the complexity of the foreground vol. 55, no. 6, pp. 3623–3634, Jun. 2017.
and background in remote sensing images, the global semantic [19] Z. Yuan et al., “Exploring a fine-grained multiscale method for cross-
association often contains a lot of noise unrelated to the task. modal remote sensing image retrieval,” IEEE Trans. Geosci. Remote Sens.,
vol. 60, no. 1, Jan. 2022, Art. no. 4404119.
To solve the abovementioned problem, a hard-link module is [20] Q. Cheng, Y. Zhou, P. Fu, Y. Xu, and L. Zhang, “A deep semantic
introduced to filter noise. In addition, as too many stacked layers alignment network for the cross-modal image-text retrieval in remote
will reduce the accuracy, the model can achieve better results sensing,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14,
no. 1, pp. 4284–4297, Jun. 2021.
with less parameters. The hypergraph convolutional module [21] T. Bretschneider, R. Cavet, and O. Kao, “Retrieval of remotely sensed
proposed in this article can be embedded into other networks as imagery using spectral information content,” in Proc. IEEE Int. Geosci.
feature extraction layer to have stronger representation ability. Remote Sens. Symp., 2002, pp. 2253–2255.
[22] A. Vellaikal, C.-C. J. Kuo, and S. K. Dao, “Content-based retrieval of
The results on the typical dataset demonstrate that the HGNLSF- remote-sensed images using vector quantization,” in Visual Information
Net can improve the CBRSIR task performance. Processing IV. Bellingham, WA, USA: SPIE, 1995, pp. 178–189.
[23] G. Healey and A. Jain, “Retrieving multispectral satellite images using
physics-based invariant representations,” IEEE Trans. Pattern Anal. Mach.
REFERENCES Intell., vol. 18, no. 8, pp. 842–848, Aug. 1996.
[1] Y. Ma et al., “Remote sensing big data computing: Challenges and oppor- [24] B. Luo, J.-F. Aujol, Y. Gousseau, and S. Ladjal, “Indexing of satellite
tunities,” Future Gener. Comput. Syst., vol. 51, pp. 47–60, 2015. images with different resolutions by wavelet features,” IEEE Trans. Image
[2] N. Skytland, “What is NASA doing with big data today,” 2012. [Online]. Process., vol. 17, no. 8, pp. 1465–1472, Aug. 2008.
Available: https://www.opennasa.org/what-is-nasa-doing-with-big-data- [25] S. Newsam, L. Wang, S. Bhagavathy, and B. S. Manjunath, “Using texture
today.html to analyze and manage large collections of remote sensed image and video
[3] P. Gamba, P. Du, C. Juergens, and D. Maktav, “Foreword to the special data,” Appl. Opt., vol. 43, no. 2, pp. 210–217, 2004.
issue on ‘human settlements: A global remote sensing challenge’,” IEEE [26] Z. Shao, W. Zhou, L. Zhang, and J. Hou, “Improved color texture descrip-
J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 4, no. 1, pp. 5–7, tors for remote sensing image retrieval,” J. Appl. Remote Sens., vol. 8,
Mar. 2011. no. 1, 2014, Art. no. 083584.
[4] X. Huang, D. Wen, J. Li, and R. Qin, “Multi-level monitoring of subtle [27] A. Ma and I. K. Sethi, “Local shape association based retrieval of infrared
urban changes for the megacities of china using high-resolution multi-view satellite images,” in Proc. 7th Int. Symp. Multimedia, 2005, pp. 7–pp.
satellite imagery,” Remote Sens. Environ., vol. 196, pp. 56–75, 2017. [28] G. J. Scott, M. N. Klaric, C. H. Davis, and C.-R. Shyu, “Entropy-balanced
[5] S. Tian, Y. Zhong, A. Ma, and L. Zhang, “Three-dimensional change bitmap tree for shape-based object retrieval from large-scale satellite
detection in urban areas based on complementary evidence fusion,” IEEE imagery databases,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 5,
Trans. Geosci. Remote Sens., vol. 60, no. 1, Jan. 2021, Art. no. 5608913. pp. 1603–1616, May 2011.
[6] R. Hang, P. Yang, F. Zhou, and Q. Liu, “Multiscale progressive segmen- [29] G.-S. Xia, W. Yang, J. Delon, Y. Gousseau, H. Sun, and H. Maître,
tation network for high-resolution remote sensing imagery,” IEEE Trans. “Structural high-resolution satellite image indexing,” in Proc. ISPRS TC
Geosci. Remote Sens., vol. 60, no. 1, Jan. 2022, Art. no. 5412012. VII Symp.-100 Years ISPRS, 2010, pp. 298–303.
[7] D. Qin et al., “MSIM: A change detection framework for damage assess- [30] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to
ment in natural disasters,” Expert Syst. with Appl., vol. 97, pp. 372–383, object matching in videos,” in Proc. Comput. Vis., IEEE Int. Conf., 2003,
2018. vol. 3, pp. 1470–1470.
[8] M. Zhang, W. Shi, S. Chen, Z. Zhan, and Z. Shi, “Deep multiple instance [31] W. Zhou, Z. Shao, C. Diao, and Q. Cheng, “High-resolution remote-
learning for landslide mapping,” IEEE Geosci. Remote Sens. Lett., vol. 18, sensing imagery retrieval using sparse features by auto-encoder,” Remote
no. 10, pp. 1711–1715, Oct. 2021. Sens. Lett., vol. 6, no. 10, pp. 775–783, 2015.
[9] R. A. Bindschadler, T. A. Scambos, H. Choi, and T. M. Haran, “Ice sheet [32] J.-E. Lee, R. Jin, and A. K. Jain, “Rank-based distance metric learning: An
change detection by satellite image differencing,” Remote Sens. Environ., application to image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern
vol. 114, no. 7, pp. 1353–1362, Oct. 2011. Recognit., 2008, pp. 1–8.
[10] A. Ochtyra, A. Marcinkowska-Ochtyra, and E. Raczko, “Threshold-and [33] B. Chaudhuri, B. Demir, L. Bruzzone, and S. Chaudhuri, “Region-based
trend-based vegetation change monitoring algorithm based on the inter- retrieval of remote sensing images using an unsupervised graph-theoretic
annual multi-temporal normalized difference moisture index series: A case approach,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 7, pp. 987–991,
study of the tatra mountains,” Remote Sens. Environ., vol. 249, 2020, Jul. 2016.
Art. no. 112026. [34] Y. Li, Y. Zhang, C. Tao, and H. Zhu, “Content-based high-resolution remote
[11] R. Hang, X. Qian, and Q. Liu, “Cross-modality contrastive learning for sensing image retrieval via unsupervised feature learning and collaborative
hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., affinity metric fusion,” Remote Sens., vol. 8, no. 9, 2016, Art. no. 709.
vol. 60, no. 1, Jan. 2022, Art. no. 5532812. [35] U. Chaudhuri, B. Banerjee, and A. Bhattacharya, “Siamese graph convolu-
[12] L. Zhang and Y. Rui, “Image search–from thousands to billions in 20 tional network for content based remote sensing image retrieval,” Comput.
years,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 9, no. 1s, Vis. Image Understanding, vol. 184, pp. 22–30, 2019.
pp. 1–20, 2013. [36] Y. Li, Y. Zhang, X. Huang, and J. Ma, “Learning source-invariant deep
[13] R. R. Larson, Introduction to Information Retrieval. Cambridge, MA, hashing convolutional neural networks for cross-source remote sensing
USA: Cambridge Univ. Press, 2010. image retrieval,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 11,
[14] M. Wolfmuller, D. Dietrich, E. Sireteanu, S. Kiemle, E. Mikusch, and M. pp. 6521–6536, Nov. 2018.
Bottcher, “Data flow and workflow organization—The data management [37] U. Chaudhuri, B. Banerjee, A. Bhattacharya, and M. Datcu, “CMIR-NET:
for the terraSAR-X payload ground segment,” IEEE Trans. Geosci. Remote A deep learning based model for cross-modal retrieval in remote sensing,”
Sens., vol. 47, no. 1, pp. 44–50, Jan. 2009. Pattern Recognit. Lett., vol. 131, pp. 456–462, 2020.
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2701
[38] W. Xiong, Z. Xiong, Y. Zhang, Y. Cui, and X. Gu, “A deep cross-modality [62] T. Mukhtar et al., “Dimensionality reduction using discriminative autoen-
hashing network for SAR and optical remote sensing images retrieval,” coders for remote sensing image retrieval,” in Proc. Int. Conf. Image Anal.
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, no. 1, Process., 2019, pp. 499–508.
pp. 5284–5296, Jun. 2020. [63] Y. Wang, S. Ji, M. Lu, and Y. Zhang, “Attention boosted bilinear pooling
[39] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, for remote sensing image retrieval,” Int. J. Remote Sens., vol. 41, no. 7,
and trends of the new age,” ACM Comput. Surv., vol. 40, no. 2, pp. 1–60, pp. 2704–2724, 2020.
2008. [64] P. Li and P. Ren, “Partial randomness hashing for large-scale remote
[40] W. Chen et al., “Deep learning for instance retrieval: A survey,” sensing image retrieval,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 3,
IEEE Trans. Pattern Anal. Mach. Intell., early access, Nov. 1, 2022, pp. 464–468, Mar. 2017.
doi: 10.1109/TPAMI.2022.3218591. [65] Y. Li, Y. Zhang, X. Huang, H. Zhu, and J. Ma, “Large-scale remote sensing
[41] M. Verma and B. Raman, “Local neighborhood difference pattern: A new image retrieval by deep hashing neural networks,” IEEE Trans. Geosci.
feature descriptor for natural and texture image retrieval,” Multimedia Remote Sens., vol. 56, no. 2, pp. 950–965, Feb. 2018.
Tools Appl., vol. 77, no. 10, pp. 11843–11866, 2018. [66] W. Zhou, S. Newsam, C. Li, and Z. Shao, “PatternNet: A bench-
[42] J. Pradhan, A. Ajad, A. K. Pal, and H. Banka, “Multi-level colored direc- mark dataset for performance evaluation of remote sensing image re-
tional motif histograms for content-based image retrieval,” Vis. Comput., trieval,” ISPRS J. Photogrammetry Remote Sens., vol. 145, pp. 197–209,
vol. 36, no. 9, pp. 1847–1868, 2020. 2018.
[43] A. Singhal, M. Agarwal, and R. B. Pachori, “Directional local ternary [67] Y. Huang, Q. Liu, S. Zhang, and D. N. Metaxas, “Image retrieval via
co-occurrence pattern for natural image retrieval,” Multimedia Tools Appl., probabilistic hypergraph ranking,” in Proc. IEEE Comput. Soc. Conf.
vol. 80, no. 10, pp. 15901–15920, 2021. Comput. Vis. Pattern Recognit., 2010, pp. 3376–3383.
[44] J. Wan et al., “Deep learning for content-based image retrieval: A com- [68] Z.-K. Zhang and C. Liu, “A hypergraph model of social tagging networks,”
prehensive study,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, J. Stat. Mechanics Theory Experiment, vol. 2010, no. 10, 2010, Art. no.
pp. 157–166. P10005.
[45] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval: [69] J. Yu, H. Yin, J. Li, Q. Wang, N. Q. V. Hung, and X.
Learning global representations for image search,” in Proc. Eur. Conf. Zhang, “Self-supervised multi-channel hypergraph convolutional net-
Comput. Vis., 2016, pp. 241–257. work for social recommendation,” in Proc. Web Conf., 2021,
[46] N. Garcia and G. Vogiatzis, “Learning non-metric visual similarity for pp. 413–424.
image retrieval,” Image Vis. Comput., vol. 82, pp. 18–25, 2019. [70] S. Zhang, S. Cui, and Z. Ding, “Hypergraph-based image processing,” in
[47] C. Chang, G. Yu, C. Liu, and M. Volkovs, “Explore-exploit graph traversal Proc. IEEE Int. Conf. Image Process., 2020, pp. 216–220.
for image retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [71] J. Bu et al., “Music recommendation by unified hypergraph: Combining
Recognit., 2019, pp. 9423–9431. social media information and music content,” in Proc. 18th ACM Int. Conf.
[48] C. Liu et al., “Guided similarity separation for image retrieval,” Adv. Neural Multimedia, 2010, pp. 391–400.
Inf. Process. Syst., 2019, pp. 1554–1564. [72] Y. Zhu, Z. Guan, S. Tan, H. Liu, D. Cai, and X. He, “Heterogeneous
[49] F. Tan, J. Yuan, and V. Ordonez, “Instance-level image retrieval using hypergraph embedding for document recommendation,” Neurocomputing,
reranking transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, vol. 216, pp. 150–162, 2016.
pp. 12105–12115. [73] H. Gui, J. Liu, F. Tao, M. Jiang, B. Norick, and J. Han, “Large-scale
[50] A. El-Nouby, N. Neverova, I. Laptev, and H. Jégou, “Training vision embedding learning in heterogeneous event data,” in Proc. IEEE 16th Int.
transformers for image retrieval,” 2021, arXiv:2102.05644. Conf. Data Mining, 2016, pp. 907–912.
[51] Y. Sun et al., “Multisource data reconstruction-based deep unsupervised [74] K. Tu, P. Cui, X. Wang, F. Wang, and W. Zhu, “Structural deep embedding
hashing for unisource remote sensing image retrieval,” IEEE Trans. for hyper-networks,” in Proc. AAAI Conf. Artif. Intell., vol. 2018, pp. 426–
Geosci. Remote Sens., vol. 60, no. 1, Jan. 2022, Art. no. 5546316. 433.
[52] Y. Sun et al., “Multisensor fusion and explicit semantic preserving-based [75] Y. Feng, H. You, Z. Zhang, R. Ji, and Y. Gao, “Hypergraph neural
deep hashing for cross-modal remote sensing image retrieval,” IEEE Trans. networks,” in Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, 2019,
Geosci. Remote Sens., vol. 60, no. 1, Jan. 2021, Art. no. 5219614. pp. 3558–3565.
[53] Y. Sun et al., “Unsupervised deep hashing through learning soft pseudo [76] J. Jiang, Y. Wei, Y. Feng, J. Cao, and Y. Gao, “Dynamic hypergraph neural
label for remote sensing image retrieval,” Knowl.-Based Syst., vol. 239, networks,” in Proc. Int. Joint Conf. Artif. Intell., 2019, pp. 2635–2641.
2022, Art. no. 107807. [77] R. Zhang, Y. Zou, and J. Ma, “Hyper-SAGNN: A self-attention based
[54] W. Zhou, S. Newsam, C. Li, and Z. Shao, “Learning low dimensional graph neural network for hypergraphs,” in Proc. Int. Conf. Learn. Res.,
convolutional neural networks for high-resolution remote sensing image 2020, pp. 1–18.
retrieval,” Remote Sens., vol. 9, no. 5, 2017, Art. no. 489. [78] N. Yadati, M. Nimishakavi, P. Yadav, V. Nitin, A. Louis, and P. Talukdar,
[55] Y. Liu, L. Ding, C. Chen, and Y. Liu, “Similarity-based unsupervised deep “HyperGCN: A. new method for training graph convolutional networks
transfer learning for remote sensing image retrieval,” IEEE Trans. Geosci. on hypergraphs,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019,
Remote Sens., vol. 58, no. 11, pp. 7872–7889, Nov. 2020. pp. 1511–1522.
[56] W. Xiong, Y. Lv, Y. Cui, X. Zhang, and X. Gu, “A discriminative feature [79] C. Yang, R. Wang, S. Yao, and T. Abdelzaher, “Hypergraph learning with
learning approach for remote sensing image retrieval,” Remote Sens., line expansion,” in Proc. IEEE Int. Conf. Big Data, 2020, pp. 669–678.
vol. 11, no. 3, 2019, Art. no. 281. [80] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional
[57] R. Cao et al., “Enhancing remote sensing image retrieval using a triplet neural networks on graphs with fast localized spectral filter-
deep metric learning network,” Int. J. Remote Sens., vol. 41, no. 2, ing,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2016,
pp. 740–751, 2020. pp. 3844–3852.
[58] M. Zhang, Q. Cheng, F. Luo, and L. Ye, “A triplet nonlocal neural network [81] S. Bai, F. Zhang, and P. H. Torr, “Hypergraph convolution and hypergraph
with dual-anchor triplet loss for high-resolution remote sensing image attention,” Pattern Recognit., vol. 110, 2021, Art. no. 107637.
retrieval,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, [82] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”
no. 1, pp. 2711–2723, Jun. 2021. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
[59] Z. Shao, W. Zhou, X. Deng, M. Zhang, and Q. Cheng, “Multilabel remote pp. 7132–7141.
sensing image retrieval based on fully convolutional network,” IEEE J. [83] B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, and A. Yamada, “Color and
Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, no. 1, pp. 318–328, texture descriptors,” IEEE Trans. Circuits Syst. Video Technol., vol. 11,
Jan. 2020. no. 6, pp. 703–715, Jun. 2001.
[60] B. Chaudhuri, B. Demir, S. Chaudhuri, and L. Bruzzone, “Multilabel [84] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image
remote sensing image retrieval using a semisupervised graph-theoretic retrieval with attentive deep local features,” in Proc. IEEE Int. Conf.
method,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2, pp. 1144–1158, Comput. Vis., 2017, pp. 3456–3465.
Feb. 2018. [85] J. Ouyang, H. Wu, M. Wang, W. Zhou, and H. Li, “Contextual similarity
[61] R. Imbriaco, C. Sebastian, E. Bondarev, and P. H. de With, “Aggregated aggregation with self-attention for visual re-ranking,” in Proc. Adv. Neural
deep local features for remote sensing image retrieval,” Remote Sens., Inf. Process. Syst., pp. 3135–3148, 2021.
vol. 11, no. 5, 2019, Art. no. 493.
2702 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023
Hongfeng Yu received the B.Sc. degree in cartog- Xiaoyu Liu received the B.Sc. degree in information
raphy and geographical information system and the management and information system and the M.Sc.
M.Sc. degree in photogrammetry and remote sensing degree in management science and engineering from
from Peking University, Beijing, China, in 2013 and the Harbin Institute of Technology, Harbin, China, in
2016, respectively. He is currently working toward 2020 and 2022, respectively.
the Ph.D. degree in signal and Information process- She is currently an Assistant Professor with the
ing with Aerospace Information Research Institute, Aerospace Information Research Institute, Chinese
Chinese Academy of Sciences, Beijing, China. Academy of Sciences, Beijing, China. Her research
His research interests include deep learning and interests include deep learning and remote sensing
multimodal remote sensing interpretation. image processing.
Chubo Deng received the B.Sc. degree in applied Wanxuan Lu received the B.Sc. degree in detection,
mathematics from Hong Kong Baptist University, guidance and control technology from the Beijing
Hong Kong, in 2012, and the Ph.D. degree in applied Institute of Technology, Beijing, China, in 2016, and
mathematics from the George Washington University, the Ph.D. degree in signal and information processing
Washington, DC, USA, in 2018. from the Institute of Electronics, Chinese Academy
He is currently an Assistant Professor with the of Sciences, Beijing, China, in 2021.
Aerospace Information Research Institute, Chinese She is currently an Assistant Professor with
Academy of Sciences, Beijing, China. His research Aerospace Information Research Institute, Chinese
interests include information modeling and remote Academy of Sciences. Her research interests in-
sensing image processing. clude computer vision and remote sensing image
processing.