0% found this document useful (0 votes)
41 views13 pages

A Light-Weighted Hypergraph Neural Network For Multimodal Remote Sensing Image Retrieval

Uploaded by

P. VENKATESHWARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views13 pages

A Light-Weighted Hypergraph Neural Network For Multimodal Remote Sensing Image Retrieval

Uploaded by

P. VENKATESHWARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2690 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL.

16, 2023

A Light-Weighted Hypergraph Neural Network for


Multimodal Remote Sensing Image Retrieval
Hongfeng Yu , Chubo Deng , Liangjin Zhao , Lingxiang Hao, Xiaoyu Liu, Wanxuan Lu , and Hongjian You

Abstract—With the continuous maturity of remote sensing tech- I. INTRODUCTION


nology, the obtained remote sensing images’ quality and quantity
ARTH observation has become one of the technologies
have surpassed any previous period. In this context, the content-
based remote sensing image retrieval (CBRSIR) task attracts a lot
of attention and research interest. Nowadays, the previous CBRSIR
E vigorously developed by more and more countries, both
the amount and the resolution of the data are undergoing an
works mainly face the following problems. First of all, few works explosive growth [1], [2], [3]. The application of remote sensing
can realize one to many cross-modal image retrieval task (such as
using optical image to retrieve SAR, optical images at the same
images in urban construction [4], [5], [6], disaster emergency
time); second, research works mainly focus on small-area, target- response [7], [8], environmental protection [9], [10], [11], etc.,
level retrieval, and few on semantic-level retrieval of the whole is becoming more and more extensive. As a consequence, how
image; last but not the least, most of the existing networks are char- to quickly and accurately retrieve the desired image from the
acterized by massive parameters and huge computing need, which database is an important task to improve the application range
cannot be applied to resource-constrained edge devices with power
and storage limit. For the sake of alleviating these bottlenecks, this
and application rate of remote sensing images. Meanwhile, it
article introduces a novel light-weighted nonlocal semantic fusion also supports many downstream tasks, for example, instance
network based on hypergraph structure for CBRSIR (abbrevi- element extraction, image classification, object detection, and
ated as HGNLSF-Net). Specifically, in the framework, using the change detection.
topological characteristics of hypergraph, the relationship among The retrieval of remote sensing images can be divided into
multiple nodes can be modeled, so as to understand the global fea-
tures on remote sensing images better with fewer parameters and
two subtasks. One is text-based remote sensing image retrieval
less computation. In addition, since the nonlocal semantics often (TBRSIR) and the other is content-based remote sensing image
involves a lot of noise, the hard-link module is constructed to filter retrieval (CBRSIR). As for TBRSIR, the retrieval condition
noise. A series of experimental results on typical CBRSIR dataset, is a sentence, and the development can be traced back to the
i.e., Multi-modal Multi-temporal Remote Sensing Image Retrieval 1970s [12]. Traditional methods are based on manually made
Dataset (MMRSIRD), well show that with fewer parameters, the
proposed HGNLSF-Net outperforms other methods and achieves
metadata of images and text [13], [14], [15]. Recently, deep
optimal retrieval performance. neural network is more and more used in TBRSIR because of
its powerful feature representation capability. The models of
Index Terms—Hypergraph neural network, light weighted, TBRSIR consist of two-step models and one-step models. The
multimodal remote sensing image retrieval, nonlocal semantic
fusion. two-step models apply image caption first and then compare
the feature similarity of query text and caption [16], [17], [18].
The one-step models directly compare the semantic similarity of
query sentence and images [19], [20]. CBRSIR uses query image
Manuscript received 9 February 2023; revised 27 February 2023; accepted to retrieve similar remote sensing images. Conventional methods
1 March 2023. Date of publication 6 March 2023; date of current version 22 first utilize features designed by domain experts [21], [22],
March 2023. This work was supported in part by the National Key R&D Program
of China under Grant 2021YFB3900504 and in part by the National Natural
[23], [24], [25], [26], [27], [28], [29], then feature fusion [30],
Science Foundation of China under Grant 62201550. (Corresponding author: [31] and feature matching [32], [33]. Nowadays, convolutional
Chubo Deng.) neural network (CNN) methods have become the mainstream
Hongfeng Yu and Hongjian You are with the Aerospace Information Research
Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the
in this field. Li et al. [34] used unsupervised manner to extract
School of Electronic, Electrical, and Communication Engineering, University features of images and achieve good result. However, CNNs
of Chinese Academy of Sciences, Beijing 100190, China, also with the Key can only model the local relationship of images; therefore, many
Laboratory of Network Information System Technology (NIST), Aerospace
Information Research Institute, Chinese Academy of Sciences, Beijing 100190,
graph neural network (GNN)-based retrieval methods have been
China, and also with the University of Chinese Academy of Sciences, Beijing designed. Chaudhuri et al. [35] proposed a GNN with pairwise
100190, China (e-mail: [email protected]; [email protected]). similarity constraint to capture the nonlocal spatial details.
Chubo Deng, Liangjin Zhao, Lingxiang Hao, Xiaoyu Liu, and Wanxuan
Lu are with the Aerospace Information Research Institute, Chinese Academy
Although the abovementioned methods have made great im-
of Sciences, Beijing 100190, China, and also with the Key Laboratory of provement in CBRSIR task, the deficiencies of the following
Network Information System Technology (NIST), Aerospace Information Re- aspects reduce the generalization performance of the model.
search Institute, Chinese Academy of Sciences, Beijing 100190, China (e-mail:
[email protected]; [email protected]; [email protected]; liuxi-
First of all, researchers mainly focus on single-modal remote
[email protected]; [email protected]). sensing image retrieval. Although there are already cross-modal
Digital Object Identifier 10.1109/JSTARS.2023.3252670 retrieval methods [36], [37], [38], most of them are given one

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2691

Image Retrieval Dataset (MMRSIRD) show that our pro-


vided HGNLSF-Net model achieves the best quantitative
and qualitative results. Specifically, it gains an increase
of around 10%, 12%, and 7% in R@1, R@2, and R@5,
respectively. In addition, in terms of lightweight design,
the number and size of parameters of our model are signif-
icantly smaller than that of other models. The hypergraph
module can be easily embedded in other networks as
feature extraction layer to improve the model effect.
Next, this work will be introduced in detail from the fol-
lowing aspects. The rest of this article is organized as follows.
We introduce some existing research on natural scene image
retrieval, remote sensing image retrieval, and hypergraph neural
network in Section II. In Section III, we will introduce the details
of our HGNLSF-Net. A series of experimental comparison
results from different aspects are shown in Section IV. Finally,
Fig. 1. Example of multimodal, semantic-level, and full-image retrieval. The Section V concludes this article.
query image is a multispectral optical image and the retrieval results contain
different time and different modal images.
II. RELATED WORK
modal image (e.g., optical image) to retrieve another modal We first analyze some of the research works on image retrieval
image (e.g., SAR image). That is, the retrieval result can only methods for natural scene images and RS images. After this, we
be a single image. Second, the mainstream CBRSIR task is introduce some works on the basis of hypergraph neural network,
target-level, small-scale regional retrieval in remote sensing which is also used in our proposed method.
images, and there are few comprehensive semantic level retrieval
of full images, as shown in Fig. 1. Finally, most of the current
A. Natural Scene Image Retrieval
mainstream models need very deep layers and massive parame-
ters to extract better features. The computational cost and storage Content-based image retrieval (CBIR) framework has two key
need is high, which is difficult to adapt to the light-weighted edge parts. In the first part, image features extraction is accomplished
computing environment. to describe the image. After this, quantified similarity can be
To solve the aforementioned challenges, a novel light- calculated according to the conditions and images in the second
weighted nonlocal semantic fusion network based hypergraph part. This framework outputs the ranking list based on the deep
for CBRSIR, named as HGNLSF-Net, is proposed, which can features eventually. A lot of related works have been done in
model the multinode semantic feature in the whole remote this framework on issues of retrieval accuracy and efficiency
sensing image using hypergraph structure. Specifically, con- [39], [40]. Verma and Raman [41] proposed an IR technique
sidering the background noise in remote sensing images, the named LNDP, the innovative feature descriptor transforms and
noise suppression module is introduced to the model. In addi- expresses the interrelations of all adjacent pixels in binary form.
tion, according to the topological properties of hypergraph, too Pradhan et al. [42] proposed a multilevel colored directional
many stacked layers will make the features of nodes tend to be motif histogram for designing a CBIR scheme to extract local
consistent, which reduces model generalization performance. structural features of different levels. Singhal et al. [43] designed
As a consequence, we can use fewer hypergraph convolution a new texture feature, which achieves higher speed performance.
layers to achieve better results. Directional filter masks are leveraged to extract derivatives of
In conclusion, the article mainly innovates in the following the image and capture some details from the image in four
aspects. directions. Furthermore, owing to development of deep learning
1) Motivated by topological features of hypergraph, this and neural networks, such approaches have brought significant
work proposes a new framework specifically designed for improvements in image retrieval [44]. Gordo et al. [45] intro-
CBRSIR task based on hypergraph neural network, i.e., duced an architecture trained for the specific IR task, they utilize
HGNLSF-Net, which can extract the global feature and a ranking framework to construct region features, and a region
have a better understanding of the images. proposal network to find which regions ought to be merged to
2) We design a hard-link module to further optimize the build the global descriptor. Garcia and Vogiatzis [46] have an
hypergraph neural network. While modeling the global exploration about neural networks on the basis of a nonmetric
correlation of remote sensing images, a large amount of similarity function and offer an end-to-end technique for IR
background noise will inevitably be introduced to the tasks. To better capture the underlying image manifold, Chang
model. The hard-link module can filter the noise and make et al. [47] introduced a new graph-based network which traverses
the model more robust. the nearest neighbor graph in feed-forward steps. Besides, Liu
3) A series of experimental results on the typical CBR- et al. [48] leveraged GCN to form the descriptors containing
SIR dataset Multi-modal Multi-temporal Remote Sensing neighbor information encoding and use methods from clustering
2692 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023

and manifold learning, which improve retrieval accuracy signif- [72], and other fields. In hypergraph representation filed, Gui
icantly. Apart from the abovementioned methods, transformers et al. [73] proposed a hyperedge-based embedding framework
framework is used for image retrieval as well. In [49] and [50], for heterogeneous event data to represent proximity and rela-
authors adopted transformers for generating image descriptors tionships among objects. Based on this work, Tu et al. [74]
and train models with metric learning method, which achieve designed a deep hypernetwork embedding model, which takes
good results in different datasets. highly sparse hypergraph structure into consideration to im-
prove performance. Besides, Feng et al. [75] presented HGNN
B. Remote Sensing Image Retrieval framework; a hyperedge convolution operation is designed to
tackle representation learning, which achieves better fusion of
RS images are larger in size and contain more targets than
structural attribute information and node feature information of
natural scene images, so they often have multiple labels, which
hypergraph. Jiang et al. [76] designed DHGNN model to extract
make them more difficult and costlier to represent. Therefore,
hidden relationships of the hypergraph; clustering methods are
compared with manually designed features, deep learning meth-
utilized to build and update topology according to the local
ods, especially CNNs [51], [52], [53], [54], [55], [56], [57], are
and global feature. To diminish heavy use of computing re-
increasingly used in RSIR due to powerful feature representation
sources when handling non-k-uniform hypergraph and improve
capabilities. Zhang et al. [58] proposed a novel framework
model generalization ability, Zhang et al. [77] introduced a
called T-NLNN, which puts nonlocal operation and deep metric
self-attention mechanism and proposed Hyper-SAGNN model.
learning together. It turns out that T-NLNN works better than
As for filtering noise, Yadati et al. [78] proposed Hyper-GCN
traditional methods on all datasets. Shao et al. [59] introduced a
model to filter out possible data noise in the sampling process.
multilabel RSIR method on the basis of FCN, which works better
Yang et al. [79] came up with a solution to possible information
than classical CNN models. Many new approaches have been
loss due to the lack of the symmetric nature of data cooccurrence.
proposed to improve the performance of RSIR. Chaudhuri et al.
[60] proposed a semisupervised method based on graph-theory
that builds an image neighborhood graph and associates class III. METHODOLOGY
labels through a novel region labeling strategy, then they use a In the methodology, we start our discussion by introducing
new subgraph matching method to find the exact image that has the basic setup for CBRSIR task and then elaborate the details
high similarity with the given query image. Imbriaco et al. [61] of proposed method. The HGNLSF-net consists a hypergraph
introduced a new image retrieval method that forms a global convolution layer, which lies at the core of this article. The
descriptor by combining attention, local convolutional features, hypergraph convolution layer is capable to relate and link the
and locally aggregated descriptors vectors. features in the feature space. A hard link filter is also performed
RS images often have large scale, because of that, current in hypergraph layer, which helps remove the uncorrelated noise.
methods need very deep layers and a great quantity of parameters
to extract features better. In order to save storage space and speed A. Problem Description
up computation, feature reduction [62], [63] has been adopted to
RSIR. Li and Ren [64] proposed a three-step partial randomness Given the collection of multimodal remote sensing image
hashing method. First of all, the initial hash code estimation is set I, query image q ∈ Q. The positive and negative images
randomly generated; second, the hash code is modified by a are denoted by Pq and Nq . Ground truth labels are acquired
linear model; finally, the method uses a projection matrix to according to different class per image, i.e., if two images are the
generate the final binary code. Li et al. [65] proposed a network same class, then they are both positive, otherwise negative. The
on the basis of deep hashing neural network, and the new network retrieval task aims to give the similarity ranking according to the
can be end-to-end optimized on large-scale datasets. query image q.
As for datasets, many researchers have already established In order to accomplish such goal, we first extract the query
RSIR datasets. Zhou et al. [66] introduced a specifically col- image and dataset features, map them into a k-dimensional
lected dataset named PatternNet based on one single data modal- space, û, v̂ ∈ R. Then, the tensor is fed into the hypergraph layer
ity, so far the dataset has been used on over 35 methods and the followed by a fully connected layer, and the feature similarity
result can be used as a baseline for future research. Yuan et al. from different images can be computed by the output of the fully
[19] introduced a fine-grained dataset contained thousands of connected layer
query texts, keywords, and RS images, which is designed for similarity = s(F û, F v̂) (1)
TBRSIR task.
where F indicates the feature extraction process for the input
C. Hypergraph Neural Network image, and the function s computes the similarity between the
target and query features. Here, we apply the L2 norm to measure
The topological structure of hypergraph can establish the
the similarity.
semantic relationship of multiple nodes at the same time, and
mine nonlinear high-order correlations. Owing to the explosive
B. Overall Method
growth of graph-structured data, hypergraph neural networks
are widely used in image retrieval [67], social network analysis In Fig. 2, the general structure of HGNLSF-net is demon-
[68], [69], image processing [70], recommender system [71], strated. The first module of the network is feature exaction layer.
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2693

Fig. 2. Overall structure of the HGNLSF-net.

Fig. 3. Overview of the Hypergraph block structure.

Initially, the multimodal images are resized to 224 × 224 × 3 as nodes and we build a hypergraph in order to better model
and the matrix values are normalized to the interval [−1, 1] as the long and short relationships and improve the information
the input. We utilize the pretrained weight of the ResNet50 to get exchange among nodes, as shown in Fig. 3.
the general feature from the dataset. More precisely, the output The mathematics expression of hypergraph is denoted by
feature size is 2048 × 7 × 7. G = (V, E, W ), where V represents the vertex, E denotes the
hyperedge, each hyperedge contains two or more vertices, and
W ∈ RM ×M is the weight matrix. Each element denotes the
C. Hypergraph Convolution Calculation weight for each hyperedge. In order to formulate the Laplacian
Given the need to match between the query sample and the operator for hypergraph, the incidence matrix H ∈ RN ×M is
multimodal dataset, the multiimage relationship is no longer necessary. It is usually denoted as a characteristic function as
suitable in the tradition graph structure, since the edge in graph follows:
can only represent a binary relationship. Compared with the
pairwise relation in graph, the hypergraph models the high-order 
constraints, which turns out to regulate more than two images in 1, if v ∈ e
H(v, e) = (2)
the same hyperedge. Specifically, the feature maps are treated 0, otherwise.
2694 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023

Unlike the normal graph, a hypergraph owns both In many studies [75], [81], [82], the incidence matrix H plays
vertex and edge degree matrixes; the vertex degree matrix a core role in the hypergraph layer, and each element of H is
D ∈ RN ×N denotes that as for each vertex, the total number usually constructed by computing the norm distance between
of hyperedges contain the same node. The hyperedge degree features. In our network, we take the long-term intraspatial
matrix B ∈ RM ×M indicates how many vertex contained in dependency, and the correlation of the features is computed
each hyperedge. The normalized hypergraph Laplacian ma- to give the vertex’s weight for each hyperedge. Finally, the
trix is denoted by Δ ∈ RN ×N , which is computed by Δ = incidence matrix is given by
I − D−1/2 HB −1 H T D−1/2 . The eigenvalues and the corre-
sponding eigenvectors of Δ are represented by diagonal matrix H = Γ(x)λ(x)ΓT (x)μ(x) (7)
λ = diag(λ1 , ..., λN ) and Φ = φ1 , ..., φN , respectively. Given
the input vector x = (x1 , ..., xN ), the vector is decomposed into where Γ(x) ∈ RN ×Ĉ denotes the function of the original signal
the spectral domain by Fourier Transform on the basis of Φ. The embedding with a linear transform, and Ĉ denotes its feature di-
hypergraph convolution structure can be formulated as mension. λ(x) ∈ RĈ×Ĉ denotes a diagonal matrix, which learns
a better distance on each vertex. μ(x) ∈ RN ×M evaluates the
g ∗ x = Φg(λ)ΦT x (3) weights for each vertex on its hyperedge, which can help to relate
where g(λ) = diag(g(λ1 ), ..., g(λN )) implies the Fourier coef- the global relationship on the feature maps to build hyperedges.
ficients. However, the computation of the eigenvectors requires The abovementioned three functions Γ(x), λ(x), and μ(x) are
huge consumption time with computation complexity approxi- all calculated from the feature map. Γ(x) is embedded through
mately O(n3 ). Nevertheless, Defferrard et al. [80] parameterized a 1 × 1 convolution kernel with dimension Ĉ, λ(x) is calculated
g(λ) with truncated Chebyshev polynomials up to Kth order, by a global average pooling, and μ(x) is calculated by a M × M
which can simplify the calculation of eigenvectors to a finite convolution. They can be computed as follows:
sum. The truncated formula on the hypergraph convolution is
Γ(xl ) = conv(xl , WΓ )
computed by
K
 λ(xl ) = diag(conv(xl , Wλ ))
g∗x= θk Tk (Δ̂)x (4)
μ(xl ) = conv(xl , Wμ ) (8)
k=0

where θk is the coefficients of the Chebyshev sequence, Δ̂ = where xl ∈ R1×1×Ĉ is the feature vector given by the global

λmax − 1 and Tk denote the Chebyshev function. We can further pooling on the input features. WΓ , Wλ , and Wμ are three
simplify the formula by restricting K = 1. With the estimation parameters built for the linear embedding.
of λmax ≈ 2, the final version of the hypergraph convolution Remote sensing images often include a lot of targets and
becomes background information. When global correlation is established,
g ∗ x ≈ (D−1/2 HW B −1 H T D−1/2 )xθ (5) a lot of noise will be introduced. As for hypergraph, the element
of incidence matrix H represents the correlation among nodes
where θ denotes the Chebyshev parameter, which will be learned and many of them only have a weak connection. Thus, we build
in the neural network. Therefore the iteration for the hypergraph a hard-link filter block, which examines the amplitude of the
convolution layer can be defined as element in the incidence matrix H, The module only allows
xl+1 = σ(D−1/2 HW B −1 H T D−1/2 xl θ). (6) large number passing through the filter in order to remove any
noise that will reduce model effect. The mathematics expression
The incidence matrix H is formed as the basic structure of the of hard-link can be defined as
hypergraph, which can also be utilized to propagate information
among the hypergraph vertices. Therefore, better connections Hnew = x ≥ c, x ∈ H. (9)
among hyperedges would boost a better information exchange
among the vertices, and can improve the accuracy of retrieval From (9), it can be concluded that the hyperparameter c is
task. very important for the proposed method. If c is set too large, the
useful semantic connection among nodes will also be filtered
D. Hypergraph Features Representation out, and if c is set too small, it will be difficult to fully filter
the weak connections and potential background noise in the
The receptive field of CNN architecture restricts the feature global correlation of the remote sensing image. In order to better
to the local area instead of global structure. However, the image determine the threshold c, we conduct a series of comparative
retrieval task requires the global feature matching between the experiments with different values of c (0.5, 0.6, 0.65, 0.7, 0.75,
retrieval image and the target one. Therefore, we need more 0.8, 0.85, 0.9). After comparing the experimental results, the
advanced tool to acquire the global feature in the dataset. Hy- threshold is set to remove 70% of the total elements in incidence
pergraph behaves well in the multiple relationship construction matrix H.
and it is capable to acquire global information. In our network, After the hard link block, we substitute the incidence matrix
spatial feature F l ∈ Rh×w×c is manipulated as a graph-like into (6), and finally acquire the hypergraph layer as
node, and each feature is considered as a vertex with dimension
X l ∈ Rhw×c . xl+1 = σ(Δxl θ) (10)
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2695

where Δ = D−1/2 HW B −1 H T D−1/2 . The degree matrix D is


corresponding to the number of hyperedges that this vertex par-
ticipates in, each diagonal entry in D is computed by summing
up the row of matrix H. The edge degree matrix B represents the
number of vertex that each hyperedge contains. Both B and D
are diagonal matrixes and computed from the incidence matrix
H in the following manner:
M
 N

Dii = W Hi , B = Hi (11)
=1 i=1

where Dii , W , Hi , and B indicate the elements of matrix
D, W , H, and B, respectively.

IV. EXPERIMENTAL RESULT


In order to examine and analyze the proposed HGNLSF-Net,
experimental results are demonstrated through four kind of
comparisons test: overall comparison, parameter amount com-
parison, ablation experiments, and visual analysis.

A. Dataset Description
Our proposed HGNLSF-Net is trained and tested on a self- Fig. 4. Examples of multimodal images in MMRSIRD.
crafted dataset called MMRSIRD.
To ensure the continuity, multimodality, and multiscale char-
acteristics of spatio-temporal images, the dataset used in this day without more advanced GPU support. To guarantee the sta-
article is captured by multiple ultra-high resolution satellites bility of the experiment and prevent the model from overfitting
with a spatial resolution up to 0.3 m and with cloud cover problem, we use an adjustable learning rate (LR). Initially, the
≤ 2%. Compared with optical images, SAR images are more LR is default to 0.001, then it is changed to 10−4 at epoch 16
difficult to acquire. The dataset contains images sampled from and 10−5 at epoch 21. Furthermore, the model achieves its best
various satellites including optical sensor: Gaofen-2 (spatial performance by using SGD optimizer by setting momentum at
resolution of 1 m), SuperView-1 (spatial resolution of 0.5 m), 0.9.
and Worldview-3 (spatial resolution of 0.3 m), GeoEye-1 (spatial For the image transformation, we first resize the input image
resolution of 0.8 m) and SAR sensor: Gaofen-3 (spatial resolu- to 224 × 224 × 3, and then perform normalization to the input.
tion of 1 m). The typical samples of the multimodal images are An pretrained ResNet50 is used for image feature extraction, we
demonstrated in Fig. 4. only take the first eight layers of ResNet50 and output a tensor
For a consistent analysis of urban areas of global, we choose with size 2048 × 7 × 7. The tensor becomes a vector with size
Beijing and Berlin to build the dataset. The dataset covers more 2048 after the hypergraph and global average pooling layer, and
than 240 km2 and time ranges up to eight years. Both of these two finally, the fully connection layer is applied to the vector and
cities own developing trends in recent years: Beijing is a rapidly the vector length becomes 100; this vector is used for similarity
expanding city, including a large amount of farming land and computation.
bungalows to high-rise buildings; Berlin is undergoing a renewal As for evaluation, we utilize four kinds of evaluation metrics
plan, continually rebuilding and restructuring dense urban areas for the image retrieval task, including ANMRR, mAP@10,
to keep the city alive. R@1, R@2, and R@5. In the following parts we introduce the
MMRSIRD consists of 1774 group images, and each group mathematics definition of each metric.
includes 11 images shot by multiple sensors at different times in ANMRR is short for average normalized modified retrieval
the same area. Each sample is named in the form “A_B”, where rate, which is commonly applied in detection and retrieval tasks.
A represents the group number the sample belongs to (from 0 to ANMRR can be computed as [83]
1773), and B indicates the number of the sample in this group Z
(from 0 to 10). For example, 23_7 means the 8th sample of the 1 
ANMRR = NMRR(i) (12)
24th group, as shown in Fig. 4. Z i=1

B. Implement Details where Z is the number of query samples. NMRR(i) is another


metric which considers the number and the ranking of the correct
All training and evaluation are executed on a single NVIDIA query results at the same time. NMRR(i) is computed as follows:
GeForce RTX3090Ti GPU. Thanks to the single hypergraph
convolution layer, HGNLSF-Net is a relative small network. MRR(i)
NMRR(i) = (13)
The training with 24 iterations can be accomplished within one 1.25k(i) − 0.5[1 + NG(i)]
2696 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023

where NG(i) is number of the ground truth results given the From the above definition, we can draw the conclusion that
query i. K(i) is defined as the relevant ranks, which mean the the value of mAP is highly correlated to the number K, and we
total length of retrieval result that counts. K(i) is usually set take mAP@10 as one of the metrics in this article.
to min{4 × NG(i), 2 × max × {NG(j) | ∀j}}. MRR(i) is short
for modified retrieval rank and can be calculated as follows:
C. Performance Evaluation of the HGNLSF-Net
MRR(i) = AVR(i) − 0.5[1 + NG(i)] (14) We use three SOTA models to conduct comparative experi-
where AVR(i) is the average rank indicating the rank of the cor- ments with our proposed model as the performance validation.
rect retrieval results in the entire list. AVR(i) can be formulated The three models are DELF, Rerank-Transformer, and AMFMN.
as follows: DELF [84]: A new descriptor is introduced to solve the
retrieval challenge of large-scale remote sensing images. It
1 
NG(i)
first extracts dense features by utilizing a fully convolutional
AVR(i) = Rank∗ (k) (15) network. Furthermore, to make the features more robust and
NG(i)
k=1
generic, an attention-based keypoint selection module using the
where Rank∗ (k) is defined as weak supervision signal is introduced, i.e., image-level labels. A
 subset of dense features is effectively filtered by the keypoints.
∗ Rank(k), if Rank(k) ≤ K(i) Finally, L2 normalization and PCA methods are used for feature
Rank (k) = (16)
1.25K, if Rank(k) > K(i) reduction and balancing compactness and discriminativeness.
Several experimental results show that DELF is an effective
where Rank(k) means the rank of the kth correct retrieval result method to improve the retrieval accuracy of remote sensing
given the query. According to the definition, ANMRR value is images
from 0 to 1 and small value indicates more accurate retrieval Rerank-Transformer [85]: The method uses two steps to
result. retrieve images. First of all, the initial retrieval result is calculated
Recall@k is another standard criteria to assess the accuracy. by comparing the feature similarity between query condition and
For any query image q, Recall@k can be evaluated as the number dataset. Second, a new rerank module is introduced to improve
of positive examples over the total number of examples for q. It the model effect. To be specific, by using the top-M images
can be formulated as as anchors, the module is able to extract affinity features for

x∈Pq He(k − rI (q, x))
the query image and top-L results. In addition, the transformer
k
RI (q) = (17) module is introduced to update the extracted features. The model
|Pq |
is more robust because its input features are not extracted from
where rI (q, x) denotes the rank of example x. He(.) denotes the original image.
the Heaviside step function, which gives 0 for negative values, AMFMN [19]: The method comes up with an effective feature
otherwise equal to 1. The rI (q, x) is calculated as extraction model for multimodal remote sensing retrieval task.
 By designing a new multimodal feature matching model, it can
rI (q, x) = 1 + He(sqz − sqx ) (18) better solve the difficult problems of remote sensing images
z∈I,z ∈
/x compared with natural scene images, i.e., containing a large
where s is the similarity function in (1). Therefore, the Recall@k number of targets and multiple scales of different targets. The
can be defined as self-attention mechanism is employed to enhance features and
  can dynamically filter redundant features. To make the model fit
k x∈Pq H(k − 1 − /x He(sqz − sqx ))
z∈I,z ∈ better to the characteristics of strong intraclass similarity, a new
RI (q) = .
|Pq | dynamic triplet loss function is included in the model and the
(19) model performs very well.
Mean average precision at top-k(mAP@K) is another stan- Table I gives overall results. The bold indicates the best
dard criterion for image retrieval task. For any positive example results. From Table I, it can be concluded that all the other
x ∈ PQ , we can name the sequence x1 , x2 , ..., xv , v ≤ k accord- methods fail in complex scene understanding and relationship
ing to the rank of example x base on the similarity to query q. modeling except ours, which demonstrates the challenge of the
Then, average precision (AP) can be calculated as MMRSIRD dataset.
m
 Compared with other methods, our proposed model achieves
i
AP = , x i ∈ PQ good performance across all evaluation criteria, including AN-
i=1
rI (q, xi ) MRR, mAP@10, R@1, R@2, and R@5. That is to say, our
 HGNLSF-Net is the best one among all the compared models.
m= He(k − rI (q, x)). (20)
To be specific, the HGNLSF-Net performs better than DELF, one
x∈Pq
of the most widely used benchmarks in CBRSIR field and gains
Finally, the mAP@K for n query image and top-k examples a significant improvement of 0.0909, 10.53%, 31.84%, 29.84%,
is calculated as and 18.50% in ANMRR, mAP@10, R@1, R@2, and R@5,
n m respectively. In addition, compared with Rerank-Transformer,
1  i
mAP = . (21) a transformer-based model that obtains best results in many
n j=1 i=1 rI (qj , xi ) datasets, the improvement of our model gains 0.0493, 3.40%,
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2697

TABLE I
PERFORMANCE OF IMAGE RETRIEVAL METHODS ON MMRSIRD

TABLE II
COMPARISON RESULT OF DIFFERENT QUERY TYPE

13.95%, 12.82%, 6.36% in the five metrics. AMFMN is origi- and storage memory. Second, the hypergraph convolution layer
nally designed for TBRSIR task and has strong image feature can achieve a better performance with just one layer, and more
representation capability. The HGNLSF-Net achieves an opti- layers will cause an unfavorable oversmoothing problem and
mal performance in all metrics. In summary, our model can better contribute negative effect to the result. The comparison results
learn the nonlocal semantic features in the entire image because of parameter quantities between different models are given in
of hypergraph module. Also thanks to the hard-link block that Table III.
erases the unnecessary noise between nodes, it can improve the The results are calculated by the package torchsummary.
overall retrieval accuracy. The estimated total size also adds input and output parameters,
However, we observe an imbalance result in our experiment, and forward propagation and back propagation parameters in
since there are fewer examples of SAR images in the dataset addition to parameters size. It can be summarized that our model
compared with optical images. More precisely, MMRSIRD has not only can achieve the best results, but also has the minimum
about tenth as many optical than SAR samples. For such reason, parameter quantity, parameters size and total model size. The
we observe a significant difference in accuracy for optical and HGNLSF-Net model is just slightly bigger than the backbone
SAR query, the accuracy is higher when we test with an optical Resnet50 in three metrics, which demonstrates the feature repre-
query. The results are demonstrated in Table II. sentation ability of hypergraph convolution network. Among all
This phenomenon can be explained by the imbalance of the models, the Rerank-Transformer has the largest number of
training examples, and the network learns more optical feature parameters and model size due to its many stacked transformer
than SAR feature; thus, it is harder to match a SAR query to blocks. DELF and AMFMN are relatively close in terms of
its corresponding optical examples. Moreover, since one group parameter quantity and parameters size, but the overall model
consists ten optical images and one SAR image in MMRSIRD, size of AMFMN is much larger than that of DELF. It is probably
when we test one optical query, it has 90% to match with optical because that AMFMN is originally designed for TBRSIR, its
images and only 10% to a SAR image. Nevertheless, for a SAR input and output processing, and loss function are quite different
image query, it owns an opposite situation. We have to admit it from DELF. In the following, we give a simple proof to the
does have an advantage for a homogenous retrieval compared reason why our proposed model can achieve better results with
to cross-modal retrieval, since they share more common char- less layers.
acteristic in feature space, which explains the result difference For the hypergraph operator Δ, if we implement it k times,
in two kinds of query. However, the overall retrieval accuracy then it can be written as:
for SAR image query is still higher than all other methods on
the MMRSIRD dataset in R@1 and R@2, even considering this lim Δk = lim (D−1/2 HW B −1 H T D−1/2 )k
factor. k→∞ k→∞

= lim V (I − λ̃)k V T
k→∞
D. Analysis on the Light-Weighted Characteristics of the ⎡ ⎤
Proposed Model (1 − λ̃1 )k
⎢ ⎥
The hypergraph convolution layer naturally preserves a light- ⎢ (1 − λ̃2 )k ⎥ T
= lim V ⎢
⎢ ..
⎥V

weighted structure for two reasons. First of all, both the degree k→∞ ⎣ . ⎦
matrixes B and D are calculated from the incidence matrix H
(1 − λ̃N )k
and the computation only involves summation rather than mul-
tiplication, which contributes little to the calculation complexity (22)
2698 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023

TABLE III
COMPARISON OF PARAMETERS OF DIFFERENT MODELS

TABLE IV
ABLATION EXPERIMENTS OF HGNLSF-NET ON MMRSIRD DATASET

where V = D−1/2 H and λ̃ are the eigenvalues of the operator. layer are the same, we can simply write off the code lines to
Since the λ̃i are all greater between 0 and 1 except λ̃1 = 0, then get rid of the layer. The third experiment involves the hard-link
the abovementioned equation becomes block; while the block does not alter the input–output dimension
⎡ ⎤ as well, we can simply erase those code lines to accomplish
1 the ablation. It should be noticed that the hard-link is part of
⎢ ⎥
⎢ 0 ⎥ T hypergraph layer, and once the hypergraph layer is removed,
k
lim Δ = lim V ⎢ ⎢ ⎥V . (23)
k→∞ k→∞ .. ⎥ the hard-link block does not exist anymore too. The ablation
⎣ . ⎦ experiment results are demonstrated in Table IV.
0 As shown in Table IV, we take V1-5 to represent the following
different blocks:
If we have a signal input x, then
1) V1: ResNet50;
⎡ ⎤
1 2) V2: ResNet101;
⎢ ⎥ 3) V3: VGG16;
⎢ 0 ⎥ T
lim Δk x = lim V ⎢ ⎢ ..
⎥V x

4) V4: Hypergraph layer;
k→∞ k→∞ ⎣ . ⎦ 5) V5: Hard-link.
0 From the abovementioned results, there is an accuracy decay
when we replace Resnet to VGG16. Because the skip connec-
= < x · v1 > v1 = x̃1 v1 . (24) tions in Resnet reduce the gradient vanishing issue through
providing an alternate routine for the gradient to connect. Mean-
Since the eigenvector v1 is corresponding to the eigenvalue
while, the accuracy from ResNet101 turns out to be a little
λ̃1 = 0, then v1 is a constant vector equal to 1. Therefore, if
lower compared with ResNet50, meaning adding more layers
a signal is continuously smoothed, it eventually becomes equal
to the CNNs in the backbone does not always come to a better
everywhere, and there will be no distinguishability at all. This
accuracy and more layers leads to more time consumption at
property of the hypergraph layer restricts it to a single layer,
training. Although more stacked layers in backbone can help
which contributes to the light weight structure.
acquire general feature of the dataset, however, the accuracy
levels may reach to a limit and slowly decrease after a threshold.
E. Ablation Experiments Finally, the accuracy may turn out to dwindle on both training
We investigate the effectiveness of each block by conducting and testing process.
ablation experiments on HGNLSF-Net. Three different blocks Compared with the small accuracy difference in replacing
are tested. backbone experiment, there is a significant decay at accu-
In the first experiment, the VGG-16 and ResNet101 are used racy from removing the hypergraph layer. The accuracy drops
to replace with the ResNet50. In the second experiment, the 0.3510, 35.22%, 59.33%, 51.50%, and 44.32% in ANMRR,
network becomes purely Resnet50 after removing the hyper- mAP@10, R@1, R@2, and R@5 compared with the complete
graph layer. Since the input and output dimensions of hypergraph model. Thanks to the flexibility and capability of hypergraph in
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2699

Fig. 6. Comparison of texture feature for query and retrieval image. (a) Query
image (partial). (b) Retrieval image (partial).

Fig. 5. Visual results for an optical image query on MMRSIRD dataset, and
the top five results are demonstrated.

constructing complicated data correlation, the result proves once


again the importance of hypergraph layer, which lies at the core
of this model.
The hard-link also contributes to the improvement of model
accuracy, and the result implies that there is around 15% reduc-
tion in mAP@10 and R@K. Because there always exist some
connections between different features, most of them are only
weak connections caused by noise or pseudocorrelation, which
gives negative effect on the result. A hard-link block removes
those week connections and, therefore, gives an improvement
on the model result.

F. Visual Analysis
In this section, we give visual results to our experiment. In
Fig. 7. Visual results for an SAR image query on MMRSIRD dataset, the top
Fig. 5, we have an optical query image, the top five results give five results are demonstrated.
three positive examples at rank equal to 1, 2, and 4 and two
negative examples rank at positions 3 and 5.
Based on (20), the AP is computed as follows:
m
1 
precision is calculated as
i 1 1 2 3
AP = = + + ≈ 0.917. (25)
m i=1 rI (q, xi ) 3 1 2 4
m
1  i 1 1 2
In this experiment, the first negative example appears at AP = = + = 0.5. (26)
m i=1 rI (q, xi ) 2 2 4
position 3, and we observe there is a texture-level similarity
between this image and query one as shown in Fig. 6. Since
the convolution layer in the Resnet50 captures the local texture The previous section has explained that the SAR image re-
feature, we believe that is the reason why such image can achieve trieval has a relative low accuracy compared with optical image
a high rank among others. retrieval. For SAR image queries, we observe that there is at
In Fig. 7, we select one SAR image as a query, the top five least one SAR image appearing at top five in most results, no
results turn out to be two positive and three negative examples. matter it is positive or negative example, which reconfirms the
The positive examples rank at positions 2 and 4, whereas the homogeneous modality shares more common features, which
negative examples rank at positions 1, 3, and 5. The average promotes the retrieval accuracy.
2700 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023

V. CONCLUSION [15] X. Wang, N. Chen, Z. Chen, X. Yang, and J. Li, “Earth observa-
tion metadata ontology model for spatiotemporal-spectral semantic-
In this article, as for CBRSIR task, in order to realize the re- enhanced satellite observation discovery: A case study of soil mois-
trieval of large-scale, semantic-level, multimodal remote sensing ture monitoring,” GIScience Remote Sens., vol. 53, no. 1, pp. 22–44,
2016.
images using one query image, we propose a new hypergraph [16] C.-R. Shyu et al., “GeoIRIS: Geospatial information retrieval and
based nonlocal semantic fusion network, i.e., HGNLSF-Net. indexing system–content mining, semantics modeling, and complex
Because the topological property of hypergraph determines that queries,” IEEE Trans. Geosci. remote Sens., vol. 45, no. 4, pp. 839–852,
Apr. 2007.
it can model the association among multiple nodes at the same [17] Y. Chen and X. Lu, “A deep hashing technique for remote sensing image-
time, the hypergraph neural network is designed to model the sound retrieval,” Remote Sens., vol. 12, no. 1, 2019, Art. no. 84.
nonlocal semantic features rather than the local, target-level rela- [18] Z. Shi and Z. Zou, “Can a machine generate humanlike language descrip-
tions for a remote sensing image?,” IEEE Trans. Geosci. Remote Sens.,
tionship. However, because of the complexity of the foreground vol. 55, no. 6, pp. 3623–3634, Jun. 2017.
and background in remote sensing images, the global semantic [19] Z. Yuan et al., “Exploring a fine-grained multiscale method for cross-
association often contains a lot of noise unrelated to the task. modal remote sensing image retrieval,” IEEE Trans. Geosci. Remote Sens.,
vol. 60, no. 1, Jan. 2022, Art. no. 4404119.
To solve the abovementioned problem, a hard-link module is [20] Q. Cheng, Y. Zhou, P. Fu, Y. Xu, and L. Zhang, “A deep semantic
introduced to filter noise. In addition, as too many stacked layers alignment network for the cross-modal image-text retrieval in remote
will reduce the accuracy, the model can achieve better results sensing,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14,
no. 1, pp. 4284–4297, Jun. 2021.
with less parameters. The hypergraph convolutional module [21] T. Bretschneider, R. Cavet, and O. Kao, “Retrieval of remotely sensed
proposed in this article can be embedded into other networks as imagery using spectral information content,” in Proc. IEEE Int. Geosci.
feature extraction layer to have stronger representation ability. Remote Sens. Symp., 2002, pp. 2253–2255.
[22] A. Vellaikal, C.-C. J. Kuo, and S. K. Dao, “Content-based retrieval of
The results on the typical dataset demonstrate that the HGNLSF- remote-sensed images using vector quantization,” in Visual Information
Net can improve the CBRSIR task performance. Processing IV. Bellingham, WA, USA: SPIE, 1995, pp. 178–189.
[23] G. Healey and A. Jain, “Retrieving multispectral satellite images using
physics-based invariant representations,” IEEE Trans. Pattern Anal. Mach.
REFERENCES Intell., vol. 18, no. 8, pp. 842–848, Aug. 1996.
[1] Y. Ma et al., “Remote sensing big data computing: Challenges and oppor- [24] B. Luo, J.-F. Aujol, Y. Gousseau, and S. Ladjal, “Indexing of satellite
tunities,” Future Gener. Comput. Syst., vol. 51, pp. 47–60, 2015. images with different resolutions by wavelet features,” IEEE Trans. Image
[2] N. Skytland, “What is NASA doing with big data today,” 2012. [Online]. Process., vol. 17, no. 8, pp. 1465–1472, Aug. 2008.
Available: https://www.opennasa.org/what-is-nasa-doing-with-big-data- [25] S. Newsam, L. Wang, S. Bhagavathy, and B. S. Manjunath, “Using texture
today.html to analyze and manage large collections of remote sensed image and video
[3] P. Gamba, P. Du, C. Juergens, and D. Maktav, “Foreword to the special data,” Appl. Opt., vol. 43, no. 2, pp. 210–217, 2004.
issue on ‘human settlements: A global remote sensing challenge’,” IEEE [26] Z. Shao, W. Zhou, L. Zhang, and J. Hou, “Improved color texture descrip-
J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 4, no. 1, pp. 5–7, tors for remote sensing image retrieval,” J. Appl. Remote Sens., vol. 8,
Mar. 2011. no. 1, 2014, Art. no. 083584.
[4] X. Huang, D. Wen, J. Li, and R. Qin, “Multi-level monitoring of subtle [27] A. Ma and I. K. Sethi, “Local shape association based retrieval of infrared
urban changes for the megacities of china using high-resolution multi-view satellite images,” in Proc. 7th Int. Symp. Multimedia, 2005, pp. 7–pp.
satellite imagery,” Remote Sens. Environ., vol. 196, pp. 56–75, 2017. [28] G. J. Scott, M. N. Klaric, C. H. Davis, and C.-R. Shyu, “Entropy-balanced
[5] S. Tian, Y. Zhong, A. Ma, and L. Zhang, “Three-dimensional change bitmap tree for shape-based object retrieval from large-scale satellite
detection in urban areas based on complementary evidence fusion,” IEEE imagery databases,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 5,
Trans. Geosci. Remote Sens., vol. 60, no. 1, Jan. 2021, Art. no. 5608913. pp. 1603–1616, May 2011.
[6] R. Hang, P. Yang, F. Zhou, and Q. Liu, “Multiscale progressive segmen- [29] G.-S. Xia, W. Yang, J. Delon, Y. Gousseau, H. Sun, and H. Maître,
tation network for high-resolution remote sensing imagery,” IEEE Trans. “Structural high-resolution satellite image indexing,” in Proc. ISPRS TC
Geosci. Remote Sens., vol. 60, no. 1, Jan. 2022, Art. no. 5412012. VII Symp.-100 Years ISPRS, 2010, pp. 298–303.
[7] D. Qin et al., “MSIM: A change detection framework for damage assess- [30] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to
ment in natural disasters,” Expert Syst. with Appl., vol. 97, pp. 372–383, object matching in videos,” in Proc. Comput. Vis., IEEE Int. Conf., 2003,
2018. vol. 3, pp. 1470–1470.
[8] M. Zhang, W. Shi, S. Chen, Z. Zhan, and Z. Shi, “Deep multiple instance [31] W. Zhou, Z. Shao, C. Diao, and Q. Cheng, “High-resolution remote-
learning for landslide mapping,” IEEE Geosci. Remote Sens. Lett., vol. 18, sensing imagery retrieval using sparse features by auto-encoder,” Remote
no. 10, pp. 1711–1715, Oct. 2021. Sens. Lett., vol. 6, no. 10, pp. 775–783, 2015.
[9] R. A. Bindschadler, T. A. Scambos, H. Choi, and T. M. Haran, “Ice sheet [32] J.-E. Lee, R. Jin, and A. K. Jain, “Rank-based distance metric learning: An
change detection by satellite image differencing,” Remote Sens. Environ., application to image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern
vol. 114, no. 7, pp. 1353–1362, Oct. 2011. Recognit., 2008, pp. 1–8.
[10] A. Ochtyra, A. Marcinkowska-Ochtyra, and E. Raczko, “Threshold-and [33] B. Chaudhuri, B. Demir, L. Bruzzone, and S. Chaudhuri, “Region-based
trend-based vegetation change monitoring algorithm based on the inter- retrieval of remote sensing images using an unsupervised graph-theoretic
annual multi-temporal normalized difference moisture index series: A case approach,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 7, pp. 987–991,
study of the tatra mountains,” Remote Sens. Environ., vol. 249, 2020, Jul. 2016.
Art. no. 112026. [34] Y. Li, Y. Zhang, C. Tao, and H. Zhu, “Content-based high-resolution remote
[11] R. Hang, X. Qian, and Q. Liu, “Cross-modality contrastive learning for sensing image retrieval via unsupervised feature learning and collaborative
hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., affinity metric fusion,” Remote Sens., vol. 8, no. 9, 2016, Art. no. 709.
vol. 60, no. 1, Jan. 2022, Art. no. 5532812. [35] U. Chaudhuri, B. Banerjee, and A. Bhattacharya, “Siamese graph convolu-
[12] L. Zhang and Y. Rui, “Image search–from thousands to billions in 20 tional network for content based remote sensing image retrieval,” Comput.
years,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 9, no. 1s, Vis. Image Understanding, vol. 184, pp. 22–30, 2019.
pp. 1–20, 2013. [36] Y. Li, Y. Zhang, X. Huang, and J. Ma, “Learning source-invariant deep
[13] R. R. Larson, Introduction to Information Retrieval. Cambridge, MA, hashing convolutional neural networks for cross-source remote sensing
USA: Cambridge Univ. Press, 2010. image retrieval,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 11,
[14] M. Wolfmuller, D. Dietrich, E. Sireteanu, S. Kiemle, E. Mikusch, and M. pp. 6521–6536, Nov. 2018.
Bottcher, “Data flow and workflow organization—The data management [37] U. Chaudhuri, B. Banerjee, A. Bhattacharya, and M. Datcu, “CMIR-NET:
for the terraSAR-X payload ground segment,” IEEE Trans. Geosci. Remote A deep learning based model for cross-modal retrieval in remote sensing,”
Sens., vol. 47, no. 1, pp. 44–50, Jan. 2009. Pattern Recognit. Lett., vol. 131, pp. 456–462, 2020.
YU et al.: LIGHT-WEIGHTED HYPERGRAPH NEURAL NETWORK FOR MULTIMODAL REMOTE SENSING IMAGE RETRIEVAL 2701

[38] W. Xiong, Z. Xiong, Y. Zhang, Y. Cui, and X. Gu, “A deep cross-modality [62] T. Mukhtar et al., “Dimensionality reduction using discriminative autoen-
hashing network for SAR and optical remote sensing images retrieval,” coders for remote sensing image retrieval,” in Proc. Int. Conf. Image Anal.
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, no. 1, Process., 2019, pp. 499–508.
pp. 5284–5296, Jun. 2020. [63] Y. Wang, S. Ji, M. Lu, and Y. Zhang, “Attention boosted bilinear pooling
[39] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, for remote sensing image retrieval,” Int. J. Remote Sens., vol. 41, no. 7,
and trends of the new age,” ACM Comput. Surv., vol. 40, no. 2, pp. 1–60, pp. 2704–2724, 2020.
2008. [64] P. Li and P. Ren, “Partial randomness hashing for large-scale remote
[40] W. Chen et al., “Deep learning for instance retrieval: A survey,” sensing image retrieval,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 3,
IEEE Trans. Pattern Anal. Mach. Intell., early access, Nov. 1, 2022, pp. 464–468, Mar. 2017.
doi: 10.1109/TPAMI.2022.3218591. [65] Y. Li, Y. Zhang, X. Huang, H. Zhu, and J. Ma, “Large-scale remote sensing
[41] M. Verma and B. Raman, “Local neighborhood difference pattern: A new image retrieval by deep hashing neural networks,” IEEE Trans. Geosci.
feature descriptor for natural and texture image retrieval,” Multimedia Remote Sens., vol. 56, no. 2, pp. 950–965, Feb. 2018.
Tools Appl., vol. 77, no. 10, pp. 11843–11866, 2018. [66] W. Zhou, S. Newsam, C. Li, and Z. Shao, “PatternNet: A bench-
[42] J. Pradhan, A. Ajad, A. K. Pal, and H. Banka, “Multi-level colored direc- mark dataset for performance evaluation of remote sensing image re-
tional motif histograms for content-based image retrieval,” Vis. Comput., trieval,” ISPRS J. Photogrammetry Remote Sens., vol. 145, pp. 197–209,
vol. 36, no. 9, pp. 1847–1868, 2020. 2018.
[43] A. Singhal, M. Agarwal, and R. B. Pachori, “Directional local ternary [67] Y. Huang, Q. Liu, S. Zhang, and D. N. Metaxas, “Image retrieval via
co-occurrence pattern for natural image retrieval,” Multimedia Tools Appl., probabilistic hypergraph ranking,” in Proc. IEEE Comput. Soc. Conf.
vol. 80, no. 10, pp. 15901–15920, 2021. Comput. Vis. Pattern Recognit., 2010, pp. 3376–3383.
[44] J. Wan et al., “Deep learning for content-based image retrieval: A com- [68] Z.-K. Zhang and C. Liu, “A hypergraph model of social tagging networks,”
prehensive study,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, J. Stat. Mechanics Theory Experiment, vol. 2010, no. 10, 2010, Art. no.
pp. 157–166. P10005.
[45] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval: [69] J. Yu, H. Yin, J. Li, Q. Wang, N. Q. V. Hung, and X.
Learning global representations for image search,” in Proc. Eur. Conf. Zhang, “Self-supervised multi-channel hypergraph convolutional net-
Comput. Vis., 2016, pp. 241–257. work for social recommendation,” in Proc. Web Conf., 2021,
[46] N. Garcia and G. Vogiatzis, “Learning non-metric visual similarity for pp. 413–424.
image retrieval,” Image Vis. Comput., vol. 82, pp. 18–25, 2019. [70] S. Zhang, S. Cui, and Z. Ding, “Hypergraph-based image processing,” in
[47] C. Chang, G. Yu, C. Liu, and M. Volkovs, “Explore-exploit graph traversal Proc. IEEE Int. Conf. Image Process., 2020, pp. 216–220.
for image retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [71] J. Bu et al., “Music recommendation by unified hypergraph: Combining
Recognit., 2019, pp. 9423–9431. social media information and music content,” in Proc. 18th ACM Int. Conf.
[48] C. Liu et al., “Guided similarity separation for image retrieval,” Adv. Neural Multimedia, 2010, pp. 391–400.
Inf. Process. Syst., 2019, pp. 1554–1564. [72] Y. Zhu, Z. Guan, S. Tan, H. Liu, D. Cai, and X. He, “Heterogeneous
[49] F. Tan, J. Yuan, and V. Ordonez, “Instance-level image retrieval using hypergraph embedding for document recommendation,” Neurocomputing,
reranking transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, vol. 216, pp. 150–162, 2016.
pp. 12105–12115. [73] H. Gui, J. Liu, F. Tao, M. Jiang, B. Norick, and J. Han, “Large-scale
[50] A. El-Nouby, N. Neverova, I. Laptev, and H. Jégou, “Training vision embedding learning in heterogeneous event data,” in Proc. IEEE 16th Int.
transformers for image retrieval,” 2021, arXiv:2102.05644. Conf. Data Mining, 2016, pp. 907–912.
[51] Y. Sun et al., “Multisource data reconstruction-based deep unsupervised [74] K. Tu, P. Cui, X. Wang, F. Wang, and W. Zhu, “Structural deep embedding
hashing for unisource remote sensing image retrieval,” IEEE Trans. for hyper-networks,” in Proc. AAAI Conf. Artif. Intell., vol. 2018, pp. 426–
Geosci. Remote Sens., vol. 60, no. 1, Jan. 2022, Art. no. 5546316. 433.
[52] Y. Sun et al., “Multisensor fusion and explicit semantic preserving-based [75] Y. Feng, H. You, Z. Zhang, R. Ji, and Y. Gao, “Hypergraph neural
deep hashing for cross-modal remote sensing image retrieval,” IEEE Trans. networks,” in Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, 2019,
Geosci. Remote Sens., vol. 60, no. 1, Jan. 2021, Art. no. 5219614. pp. 3558–3565.
[53] Y. Sun et al., “Unsupervised deep hashing through learning soft pseudo [76] J. Jiang, Y. Wei, Y. Feng, J. Cao, and Y. Gao, “Dynamic hypergraph neural
label for remote sensing image retrieval,” Knowl.-Based Syst., vol. 239, networks,” in Proc. Int. Joint Conf. Artif. Intell., 2019, pp. 2635–2641.
2022, Art. no. 107807. [77] R. Zhang, Y. Zou, and J. Ma, “Hyper-SAGNN: A self-attention based
[54] W. Zhou, S. Newsam, C. Li, and Z. Shao, “Learning low dimensional graph neural network for hypergraphs,” in Proc. Int. Conf. Learn. Res.,
convolutional neural networks for high-resolution remote sensing image 2020, pp. 1–18.
retrieval,” Remote Sens., vol. 9, no. 5, 2017, Art. no. 489. [78] N. Yadati, M. Nimishakavi, P. Yadav, V. Nitin, A. Louis, and P. Talukdar,
[55] Y. Liu, L. Ding, C. Chen, and Y. Liu, “Similarity-based unsupervised deep “HyperGCN: A. new method for training graph convolutional networks
transfer learning for remote sensing image retrieval,” IEEE Trans. Geosci. on hypergraphs,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019,
Remote Sens., vol. 58, no. 11, pp. 7872–7889, Nov. 2020. pp. 1511–1522.
[56] W. Xiong, Y. Lv, Y. Cui, X. Zhang, and X. Gu, “A discriminative feature [79] C. Yang, R. Wang, S. Yao, and T. Abdelzaher, “Hypergraph learning with
learning approach for remote sensing image retrieval,” Remote Sens., line expansion,” in Proc. IEEE Int. Conf. Big Data, 2020, pp. 669–678.
vol. 11, no. 3, 2019, Art. no. 281. [80] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional
[57] R. Cao et al., “Enhancing remote sensing image retrieval using a triplet neural networks on graphs with fast localized spectral filter-
deep metric learning network,” Int. J. Remote Sens., vol. 41, no. 2, ing,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2016,
pp. 740–751, 2020. pp. 3844–3852.
[58] M. Zhang, Q. Cheng, F. Luo, and L. Ye, “A triplet nonlocal neural network [81] S. Bai, F. Zhang, and P. H. Torr, “Hypergraph convolution and hypergraph
with dual-anchor triplet loss for high-resolution remote sensing image attention,” Pattern Recognit., vol. 110, 2021, Art. no. 107637.
retrieval,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, [82] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”
no. 1, pp. 2711–2723, Jun. 2021. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
[59] Z. Shao, W. Zhou, X. Deng, M. Zhang, and Q. Cheng, “Multilabel remote pp. 7132–7141.
sensing image retrieval based on fully convolutional network,” IEEE J. [83] B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, and A. Yamada, “Color and
Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, no. 1, pp. 318–328, texture descriptors,” IEEE Trans. Circuits Syst. Video Technol., vol. 11,
Jan. 2020. no. 6, pp. 703–715, Jun. 2001.
[60] B. Chaudhuri, B. Demir, S. Chaudhuri, and L. Bruzzone, “Multilabel [84] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image
remote sensing image retrieval using a semisupervised graph-theoretic retrieval with attentive deep local features,” in Proc. IEEE Int. Conf.
method,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2, pp. 1144–1158, Comput. Vis., 2017, pp. 3456–3465.
Feb. 2018. [85] J. Ouyang, H. Wu, M. Wang, W. Zhou, and H. Li, “Contextual similarity
[61] R. Imbriaco, C. Sebastian, E. Bondarev, and P. H. de With, “Aggregated aggregation with self-attention for visual re-ranking,” in Proc. Adv. Neural
deep local features for remote sensing image retrieval,” Remote Sens., Inf. Process. Syst., pp. 3135–3148, 2021.
vol. 11, no. 5, 2019, Art. no. 493.
2702 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 16, 2023

Hongfeng Yu received the B.Sc. degree in cartog- Xiaoyu Liu received the B.Sc. degree in information
raphy and geographical information system and the management and information system and the M.Sc.
M.Sc. degree in photogrammetry and remote sensing degree in management science and engineering from
from Peking University, Beijing, China, in 2013 and the Harbin Institute of Technology, Harbin, China, in
2016, respectively. He is currently working toward 2020 and 2022, respectively.
the Ph.D. degree in signal and Information process- She is currently an Assistant Professor with the
ing with Aerospace Information Research Institute, Aerospace Information Research Institute, Chinese
Chinese Academy of Sciences, Beijing, China. Academy of Sciences, Beijing, China. Her research
His research interests include deep learning and interests include deep learning and remote sensing
multimodal remote sensing interpretation. image processing.

Chubo Deng received the B.Sc. degree in applied Wanxuan Lu received the B.Sc. degree in detection,
mathematics from Hong Kong Baptist University, guidance and control technology from the Beijing
Hong Kong, in 2012, and the Ph.D. degree in applied Institute of Technology, Beijing, China, in 2016, and
mathematics from the George Washington University, the Ph.D. degree in signal and information processing
Washington, DC, USA, in 2018. from the Institute of Electronics, Chinese Academy
He is currently an Assistant Professor with the of Sciences, Beijing, China, in 2021.
Aerospace Information Research Institute, Chinese She is currently an Assistant Professor with
Academy of Sciences, Beijing, China. His research Aerospace Information Research Institute, Chinese
interests include information modeling and remote Academy of Sciences. Her research interests in-
sensing image processing. clude computer vision and remote sensing image
processing.

Liangjin Zhao received the B.Sc. degree in automa-


tion from the University of Electronic Science and Hongjian You received the B.Sc. degree in com-
Technology of China, Chengdu, China, in 2015, and puter science and technology from Wuhan University,
the M.Sc. degree in aeronautical and astronautical Wuhan, China, in 1992, the M.Sc. degree in computer
science and technology from the Beijing Institute of science and technology from Tsinghua University,
Technology, Beijing, China, in 2018. Beijing, China, in 1995, and the Ph.D. degree in signal
He is currently a Research Assistant with the Insti- and information processing from the University of
tute of Electronics, Chinese Academy of Sciences. Chinese Academy of Sciences, Beijing, China, in
His research interests include the target detection 2001.
and recognition in unmanned aerial vehicle remote He is currently a Professor with Aerospace Infor-
sensing images and simultaneous localization and mation Research Institute, Chinese Academy of Sci-
mapping. ences. His research interests include remote sensing
image processing and analysis, and SAR image applications.

Lingxiang Hao received the B.Sc. degree in elec-


tronical information science and technology and the
M.Sc. degree in electronic and communication en-
gineering from the Beijing University of Posts and
Telecommunications, Beijing, China, in 2018 and
2021, respectively.
He is currently an Assistant Professor with the
Aerospace Information Research Institute, Chinese
Academy of Sciences, Beijing, China. His research
interests include deep learning and remote sensing
image processing.

You might also like