0% found this document useful (0 votes)
21 views14 pages

High-Resolution Remote Sensing Image Captioning Based On Structured Attention

Uploaded by

arvind.buela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views14 pages

High-Resolution Remote Sensing Image Captioning Based On Structured Attention

Uploaded by

arvind.buela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

60, 2022 5603814

High-Resolution Remote Sensing Image Captioning


Based on Structured Attention
Rui Zhao , Zhenwei Shi , Member, IEEE, and Zhengxia Zou

Abstract— Automatically generating language descriptions of sensing field, image captioning also has attracted increas-
remote sensing images has become an emerging research hot ing attention recently due to its broad application prospects
spot in the remote sensing field. Attention-based captioning, both in civil and military usages, such as remote sensing
as a representative group of recent deep learning-based cap-
tioning methods, shares the advantage of generating the words image retrieval and military intelligence generation [3]. Dif-
while highlighting corresponding object locations in the image. ferent from other tasks in remote sensing field, such as
Standard attention-based methods generate captions based on object detection [4]–[8] classification [9]–[11], and segmenta-
coarse-grained and unstructured attention units, which fails to tion [12]–[15], remote sensing image captioning focuses more
exploit structured spatial relations of semantic contents in remote on generating comprehensive sentence descriptions rather than
sensing images. Although the structure characteristic makes
remote sensing images widely divergent to natural images and predicting individual category tags or words. To generate
poses a greater challenge for the remote sensing image captioning accurate and detailed descriptions, the captioning model needs
task, the key of most remote sensing captioning methods is to not only determine the semantic contents that exist in the
usually borrowed from the computer vision community without image but also have a good understanding of the relationship
considering the domain knowledge behind. To overcome this between them and what activities that they are engaged in [2].
problem, a fine-grained, structured attention-based method is
proposed to utilize the structural characteristics of semantic In the most recent deep learning-based image captioning
contents in high-resolution remote sensing images. Our method methods, the models are built based on the “encoder–decoder”
learns better descriptions and can generate pixelwise segmen- network architecture [1], [2], [16]–[18]. In the encoding stage,
tation masks of semantic contents. The segmentation can be deep convolutional neural networks (CNNs) are used to extract
jointly trained with the captioning in a unified framework high-level internal representations of the input image. In the
without requiring any pixelwise annotations. Evaluations are
conducted on three remote sensing image captioning benchmark decoding stage, a recurrent neural network (RNN) is typically
data sets with detailed ablation studies and parameter analysis. trained to decode the representations to sentence descriptions.
Compared with the state-of-the-art methods, our method achieves More recently, the visual attention mechanism, a technique
higher captioning accuracy and can generate high-resolution and derived from automatic machine translation [19], [20], has
meaningful segmentation masks of semantic contents at the same greatly promoted the research progress in image caption-
time.
ing [21]–[26]. The attention mechanism was originally intro-
Index Terms— Image captioning, image segmentation, remote duced to improve the performance of an RNN model by
sensing image, structured attention. taking into account the input from several time steps to make
one prediction [19]. In image captioning, visual attention can
help the model better exploit spatial correlations of semantic
I. I NTRODUCTION contents in the image and highlight those contents while
generating corresponding words [21].
I MAGE captioning is an important computer vision task that
emerged in recent years and aims to automatically generate
language descriptions of an input image [1], [2]. In the remote
For the remote sensing image captioning task, Qu et al. [27]
first proposed a deep multimodal neural network model for
high-resolution remote sensing image caption generation. Shi
Manuscript received May 13, 2020; revised September 3, 2020, and Zou [3] proposed a fully convolutional networks (FCNs)
December 21, 2020, and March 3, 2021; accepted March 29, 2021. Date of captioning model, which mainly focuses on the multilevel
publication April 12, 2021; date of current version December 16, 2021. This semantics and semantic ambiguity problems. Lu et al. [28]
work was supported in part by the National Key Research and Development
Program of China under Grant 2019YFC1510905, in part by the National explored several encoder–decoder-based methods and their
Natural Science Foundation of China under Grant 61671037, and in part by attention-based variants and published a remote sensing image
the Beijing Natural Science Foundation under Grant 4192034. (Corresponding caption data set named RSICD. Wang et al. [29] introduced
author: Zhenwei Shi.)
Rui Zhao and Zhenwei Shi are with the Image Processing Center, School of the multisentence captioning task and proposed a framework
Astronautics, Beihang University, Beijing 100191, China, also with the Beijing using semantic embedding to measure the image represen-
Key Laboratory of Digital Media, Beihang University, Beijing 100191, China, tation and the sentence representation to improve captioning
and also with the State Key Laboratory of Virtual Reality Technology and
Systems, School of Astronautics, Beihang University, Beijing 100191, China results. Lu et al. [30] proposed a sound active attention (SAA)
(e-mail: [email protected]; [email protected]). framework for more specific caption generation according to
Zhengxia Zou is with the Department of Computational Medicine and Bioin- the interest of the observer. Wang et al. [31] proposed the
formatics, University of Michigan at Ann Arbor, Ann Arbor, MI 48109 USA
(e-mail: [email protected]). retrieval topic recurrent memory network that first retrieves
Digital Object Identifier 10.1109/TGRS.2021.3070383 the topic words of input remote sensing images from the
1558-0644 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
5603814 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

topic repository and then generates the captions by using


a recurrent memory network [32] based on both the topic
words and the image features. Ma et al. [33] proposed two
multiscale captioning methods to grab multiscale information
for generating better captions. Cui et al. [34] proposed an
attention-based remote sensing image semantic segmentation
and spatial relationship recognition method. However, the cap-
tioning module in their method just follows the classical
model [21] without modification based on the characteristic of
remote sensing images. The captioning module is independent
of other modules, and the accuracy of caption generation is
not improved by other modules. Sumbul et al. [35] proposed
a summarization-driven image captioning method, which inte-
grated the summarized ground-truth captions to generate more
detailed captions for remote sensing images. Li et al. [36]
proposed a truncation cross entropy to deal with the over- Fig. 1. Brief comparison between (a) standard attention-based captioning
fitting problem. Wang et al. [37] proposed a word–sentence method and (b) proposed structured attention-based captioning method. In (a),
framework to extract the valuable words first and then organize the captions are learned from a set of coarse and unstructured image regions.
As a comparison, in (b), our method exploits the fine-grained structure of the
them into a well formed caption. Huang et al. [38] proposed image and, thus, generates more accurate descriptions.
a denoising-based multiscale feature fusion mechanism to
enhance the image feature extraction. Li et al. [39] proposed
a multilevel attention model to enhance the effect of attention geometry and appearance. Structured attention is performed
through a hierarchical structure. Wu et al. [40] proposed on each structured unit obtained in the segmentation proposal
a scene attention mechanism that tried to catch the scene generation, that is, the pixels within each structured unit
information to improve the captions. receive the same attention weight, while different structured
These remote sensing image captioning methods are all units get different attention weights. In this way, the proposed
based on encoder–decoder architecture, which can be roughly method can exploit the spatial structure of semantic contents
divided into two groups: 1) methods without visual atten- and produce fine-grained attention maps to guide the decoder
tion mechanisms that constructed between caption and image to more proper caption generation. We show our method
space [3], [27]–[29], [31], [35]–[38] and 2) methods with generates better sentence descriptions and pixel-level object
visual attention mechanisms [28], [30], [33], [34], [39], [40]. masks under a unified framework. It is worth mentioning that,
The visual attention mechanisms in these methods are in our method, the segmentation is trained solely based on the
designed based on coarse-grained, unstructured attention units, image-level ground-truth sentences and does not require any
which fails to exploit structured spatial relations of semantic pixelwise annotations.
contents in remote sensing images. For example, in the popular Fig. 1 shows the key differences between the proposed
natural image captioning method “Show, attend and tell” [21], method and previous attention-based methods. In our method,
the authors uniformly divide the image feature map into we first divide the input image into a set of class-agnostic
14 × 14 spatial units. However, in remote sensing images, segmentation proposals and then encode the structure of each
the semantic contents are usually highly structured where of the segmentation proposals into our attention module.
narrow and irregularly shaped objects, such as roads, rivers, Structured attention can guide the model to accurately focus
and structures, usually occupy a large portion. The uniform on highly structured semantic contents during the training,
division of the feature map inevitably leads to an underexploit thereby improving the performance of the image captioning
of the spatial structure of remote sensing semantic contents. task. Although the class label of each proposal is not available
Besides, due to the coarse division of attention units, these during the training and is considered as latent variables,
methods also fail to produce fine-grained attention maps of we show the correspondence between the predicted words
irregularly shaped semantic contents. and the attention weights for each proposal can be adaptively
In this article, we show that the structured and pixel-level learned under a weakly supervised training process. Our
regional information can be used to enhance the efficacy of method, therefore, produces much more accurate attention
attention-based remote sensing image captioning. In computer maps for the semantic contents than those unstructured atten-
vision, in-depth research has been made on the pixel-level tion methods.
description of irregularly shaped semantic contents, where Extensive evaluations of our method are made on three
a representative group of the method is semantic segmenta- benchmark data sets. Our method achieves higher captioning
tion [41]–[45]. We, thus, introduce a structured attention mod- accuracy than other state-of-the-art captioning methods and
ule in our captioning model and propose a joint captioning and generates object masks with high quality. Detailed ablation
segmentation framework for high-resolution remote sensing studies and parameter analysis are also conducted, which
images by taking advantage of the structured attention mech- suggests the effectiveness of our method.
anism. The structured attention module aims to focus on the The contributions of this article are summarized as
semantic contents in the remote sensing images with structured follows.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: HIGH-RESOLUTION REMOTE SENSING IMAGE CAPTIONING BASED ON STRUCTURED ATTENTION 5603814

Fig. 2. Overview of the proposed captioning method. Our method consists of three parts: an encoder, which maps the input image to feature maps; a decoder,
which generates the sentences based on the image feature; and a structured attention module, which interacts with the decoder during the captioning and at
the same time generates pixel-level object masks.

1) We propose a novel image captioning method for


high-resolution remote sensing images based on the
structured attention mechanism. The proposed method
deals with image captioning and pixel-level segmenta-
tion under a unified framework.
2) We investigate the possibility of using structured atten-
tion for weakly supervised image segmentation. To the
best of our knowledge, such a topic has rarely been
studied before.
3) We achieve higher captioning accuracy over other state-
of-the-art methods on three remote sensing image cap-
tioning benchmark data sets. Fig. 3. Input images (first row) and the class-agnostic segmentation proposals
generated by the selective search method (second row).
The rest of this article is organized as follows. In Section II,
we will introduce the structured attention and the details of
our method. Experimental results and analysis are given in module, we use a predefined method “Selective Search” [48]
Section III. The conclusions are drawn in Section IV. to segment the input image to a set of class-agnostic seg-
mentation proposals based on the color and texture features.
II. M ETHODOLOGY The selective search module in our framework requires the
In this section, we give a detailed introduction to the remote sensing image to be high resolution to extract available
proposed structured attention method and how we build our segmentation proposals. Fig. 3 shows some samples generated
image captioning model on top of it. by the selective search. The proposals are then synchronously
encoded to our attention module with the image features by
using a newly proposed pooling method, named the “structured
A. Overview of the Method pooling” method. In this way, the original image features are
The captioning model proposed in this article mainly con- recalibrated, and we, thus, can obtain a set of structured region
sists of three parts: an encoder, a decoder, and a structured descriptions for captioning and mask generation. The attention
attention module. Fig. 2 shows the processing flow of the weights generated by the model for each region on predicting
proposed model. From the natural image captioning literature, a certain word are considered as the probability that the region
we borrow the encoder–decoder framework that has been belongs to the word category (e.g., building, tree, and bridge).
shown to work well in the image captioning task. We use
a deep CNN as our encoder to extract high-level feature
representations from the input image. It is worth mentioning B. Encoder and Decoder
that our method is, indeed, independent of the choice of the We use the 50-layer deep residual network (ResNet-50) [49]
backbone model. Any deep CNN can be used as an encoder. as our encoder. We remove the full connection layer (predic-
We use a long short-term memory network [46], [47] as tion layer) of the ResNet-50 and use the feature maps produced
our decoder to decode the image features to the sentence by the last convolution block “Conv_5” as our internal fea-
description. Before feeding features to the structured attention ture representations. Our decoder is a one-layer LSTM with

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
5603814 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

512 hidden units. The LSTMs are a special kind of RNN,


capable of learning long-term dependences. By selectively
forgetting and updating information in the training process,
the LSTM can achieve better performance in the complex
sequential prediction problems than the vanilla RNNs. Our
decoder is trained to generate the word score vector yt ∈ R K at
each time step t based on a context vector zt , a previous hidden
state vector ht−1 and a previously generated word score vector
yt−1 , where K is the size of the vocabulary. The prediction of
yt can be written as follows:
 
yt = Lo Lh ht + L y yt−1 + Lz zt−1 (1)
where Lh , L y , and Lz are a group of trainable parameters that
transform the input vectors to calibrate their dimensions. The
Lo is a group of trainable parameters that transform from the
summarized vectors to the output score vectors. To compute
ht , zt , and yt at each time step, the LSTM gates and internal
states are defined as follows:
it = σ (Wi xt + bi )
 
ft = σ W f x t + b f
ot = σ (Wo xt + bo )
c̃t = tanh (Wc xt + bc )
Fig. 4. Illustration of the proposed structured pooling method. The notation
ct = ft  ct−1 + it  c̃t  represents the elementwise product operation.
ht = ot  tanh (ct ) (2)
where xt is the concatenation of previous hidden state ht−1 ,
and the previously generated word vector yt−1 and the context represent the image features produced by the encoder, where
vector ẑt are xt = [ht−1 ; Pyt−1 ; ẑt ]. P ∈ Rm×K is an embed- h, w, and c are the height, width, and the number of channels,
ding matrix, where m denotes the embedding dimension. it , respectively. The region proposals Ri , i = 1, . . . , N produced
ft , ot , ct , and ht are the outputs of the input gate, forget by the selective search are considered as the base units when
gate, output gate, memory, and hidden state of the LSTM, performing structured pooling.
respectively. Wi , W f , Wo , and Wc are trainable weight For the unit i , the structured feature representation si
matrices, and bi , b f , bo , and bc are their trainable biases. produced by the structured pooling can be represented as
σ (·), tanh (·) and  represent the logistic sigmoid activation, follows:
hyperbolic tangent function and elementwise multiplication 1 
operation, respectively. si = F(x, y)  Ri (4)
hw 
(x,y)∈Ri
Finally, to produce word probabilities pt , we use a “soft-
max” layer to normalize the generated score vectors to
where Ri is the projected region proposal, which is resized
probabilities
from the size of the input image to the size of the feature
pt = softmax(yt ) map. The summation is performed among the pixels (x, y)
 within the region Ri along the spatial dimensions. It should
K
 
= exp(yt ) exp yt(i) . (3) be noticed that, when we average the feature values within a
i=1 certain region Ri , we divide the number of all spatial pixels in
the feature map (hw) instead of the number of valid pixels in
that region. The reason behind this is that we want to enhance
C. Structured Attention the features according to their structured unit size, that is,
1) Structured Pooling: Most CNNs produce unstructured the features of small structured units will be less weighted to
image feature representations. Here, we propose a new pooling reduce the noise effect from these regions.
operation called “structured pooling” to generate structured Fig. 4 gives a simple illustration of the proposed structured
feature representations given a set of region proposals of pooling operation. To help understand, in this figure, we show
any shapes. The structured pooling can be considered as a an alternative but equivalent way of performing structured
modification of the standard “region of interest (ROI) pooling.” pooling, where we first pixelwisely multiply the features on
The difference between the two operations is that the ROI a set of resized region masks and then perform the global
pooling is only designed to pool the features from rectangular average pooling to produce the pooling output. To reduce the
regions, while the structured pooling applies to the regions of misalignment effect, we use the bilinear interpolation when
any shape. Suppose that I is the input image and F ∈ Rh×w×c we reduce the size of the binary region masks.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: HIGH-RESOLUTION REMOTE SENSING IMAGE CAPTIONING BASED ON STRUCTURED ATTENTION 5603814

It is worth noting that, although proposal-based approaches 3) Object Masks’ Generation: We generate the object
are widely used in computer vision tasks, the proposed struc- masks based on the attention weights of each structured region.
tured attention method is designed specifically for remote sens- In our attention module, the attention weights αt(i) , i = 1, . . . N
ing images. Since remote sensing images are captured from represent the semantic correlation between the tth word of
high above, many semantic contents in remote sensing images, the sentence and each of the N regions. The larger the αt(i) ,
such as rivers and bridges, show highly structured charac- the more relevant it is to the i th structured unit Ri . We use the
teristics. For example, bridges are always long and straight, attention weights as the category probability of the segmenta-
and rivers are always winding and slender, while buildings tion output. The nouns of the semantic contents of interest can
are mostly in the form of regular polygon aggregation. As a be easily picked out from the generated captions by comparing
comparison, semantic contents in natural images often lack each word to a predefined noun set. The pixelwise object
regular and structured shape outlines under different views and masks can be finally generated by binarizing the segmentation
occlusions. Since the proposed structured attention mechanism weights of each region.
relies on effective structure extraction of the semantic contents
in images, it is more suitable for remote sensing images rather D. Loss Functions
than natural images. Since image captioning is a sequential prediction prob-
2) Context Vector Generation: The context vector ẑt in lem of each word in the sentence, we follow the previous
our method is a dynamic representation of the corresponding works [1], [21] and formulate the prediction of each word
structured unit of the image at the time t. Given the feature as a regularized classification process. The loss can, thus,
representations si and the previous hidden state ht−1 , we calcu- be written as a running sum of the regularized cross-entropy
late an attention weight value αt(i) , which represents the degree loss of each word in the sentence
of correlation between the structured units and the generated ⎛ ⎞
C K
( j) ( j)
word vector yt L(x) = − log⎝ ŷt pt ⎠ + βrd (α t ) + γ rv (α t ) (9)
j =1
α̃t(i) = f att (si , ht−1 )
t=1
(5)
where pt = [ pt(1) , . . . , pt(K ) ] is the predicted word probability
where f att (·) represents a multilayer perceptron (MLP), which vector. ŷt = [ ŷt(1), . . . , ŷt(K )] is the one-hot label of the tth
is trained to generate the attention weights. ( j)
word in the ground-truth caption, and ŷt ∈ {0, 1}. C is the
To build the network f att (·), we first adjust the dimensions number of words in the generated sentence. rd (α t ) and rv (α t )
of si and ht−1 to the same number by passing each of them are the doubly stochastic regularization (DSR) [21] and the
through a fully connected layer. Then, the transformed vectors proposed attention variance regularization (AVR), which we
are added together to fuse the information from both the will introduce later. β and γ are the weight coefficients for
structured unit and the context, and the fusion vector is further balancing different loss terms.
fed to another fully connected layer to produce the attention 1) Doubly Stochastic Regularization: In Section II-C2,
weight αt(i) 
we show that i αt,i = 1 since the attention weights are
finally normalized by a softmax function. Here, we further
f att (si , ht−1 ) = f 3 (ReLU( f 1 (si ) + f 2 (ht−1 ))) (6)
regularize the attention weights from the time dimension and
where f 1 , f 2 , and f3 represent the three fully connected layers introduce the DSR as follows:
2
and ReLU(·) represents the rectified linear unit activation 
N 
C

function. Then, the attention weights of the N unique regions rd (α t ) = 1− αt,i . (10)
at the time step t are normalized with a softmax layer to i=1 t=1
produce the final attention vector α t This regularization term encourages the model to pay equal
  attention to each part of the image during the generation of
α t = softmax α̃t(1) , . . . , α̃t(N ) . (7) captions. In other words, it can prevent some regions from
always receiving strong attention, while it can prevent other
Once we get the attention weighted vector, the context vector regions from being ignored all the time.
zt can be finally computed as a linear combination of the 2) Attention Variance Regularization: When we fixed the
structured feature represents si and their attention weights αt(i) time step t and look at the attention weights of each structure
regions, we usually hope to see these regions receive a highly

N
diverse attentions. This means that we do not want each region
zt = αt(i) si . (8)
i=1
receive equal attentions. We, thus, design the AVR term to
enforce the regions have a high variance in their attention
Note that, at the time step t, the attention weight of weights
each structured unit is computed based on the same context

C
information ht−1 , which ensures that the initial competitive- rv (α t ) = − α t − E{α t }22
ness of each structured unit is fair and reduces the possi- t=1
bility of introducing deviation. This is called the “context
C
1 2
information broadcast” mechanism, which was introduced by =− αt − (11)
Vinyals et al. [1] for the first time. t=1
N 2

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
5603814 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

where we have E{α t } = 1/N since the α t has been normalized A. Experimental Setup
by the softmax function. We take the negative value of the l2 1) Data Sets: We conduct experiments on three
norm since we want to maximize the attention variance. It is widely-used remote sensing image captioning data sets:
easy to proof that, when the α t is an one-hot vector, i.e., only UCM-Captions [27], Sydney-Captions [27], and RSICD [28].
one region contributes to the prediction of the current word, For each data set, we followed the standard protocols on
the value of rv (α t ) will reach its minimum. On the contrary, splitting the data set into training, validation, and test sets.
when every region receives equal attention, i.e., αt(i) = 1/N, In any of the three data sets, each image is labeled with five
i = 1, . . . , N, in which we do not hope to see, rv (α t ) will be sentences as ground-truth captions. The following are the
maximized. details of the three data sets.
E. Implementation Details a) UCM-Captions: The UCM-Captions data set [27] is
built based on the UC Merced land use data set [54]. It con-
1) Training Details: In the training phase, we use the tains 2100 remote sensing images from 21 types of scenes.
Adam optimizer [50] to train our model. We set regularization Each image has a size of 256 × 256 pixels and a spatial
coefficients β = γ = 1. We set the learning rate of our encoder resolution of 0.3 m/pixel.
to 1e−4 and set the learning rate of our decoder learning rate
b) Sydney-Captions: The Sydney-Captions data set [27]
to 4e−4 . The batch size is set to 64, and the model is trained
is built based on the Sydney land data set [55]. It totally
for 100 epochs. In our encoder, the ResNet-50 is pretrained contains 613 remote sensing images collected from the
on the ImageNet data set [51]. To speed up training, we only
Google Earth imagery in Sydney, Australia. Each image has
fine-tune the convolutional blocks 2–4 of the ResNet-50 during
a size of 500 × 500 pixels and a spatial resolution of
training. In our decoder, the memory and hidden state gate of
0.5 m/pixel.
the LSTM at the time step 0 are initialized separately based on
c) RSICD: The RSICD [28] is the most widely used
the averaged image features. We use a fully connected layer
data set for remote sensing image caption generation task.
to transform the features to produce their 0-time inputs.
It contains 10 921 remote sensing images collected from the
2) Segmentation Proposal Generation: When we use the
AID data set [56] and other platforms, such as Baidu Map,
selective search to generate segmentation proposals, three key
MapABC, and Tianditu. The images are in various spatial
parameters need to be specifically tuned, including a smooth
resolutions. The size of each image is 224 × 224 pixels.
parameter σ of the Gaussian filter, a min_size parameter,
which controls the minimum bounding box size of the pro- 2) Evaluation Metrics: We use four different metrics to
posals, and a scale parameter s, which controls the initial evaluate the accuracy of the generated captions, including the
segmentation scales. We set σ = 0.8, min_size = 100, and bilingual evaluation understudy (BLEU) [57], ROUGE-L [58],
s = 100. Besides, to prevent oversegmentation, we applied METEOR [59], and CIDEr-D [60], which are all widely used
the guided image filter [52] to preprocess the image before in recent image captioning literature.
the selective search. The guided image filter can effectively a) BLEU: The BLEU [57] measures the co-occurrences
smooth the input image while keeping its edge and structures. between the generated caption and the ground truth by using
The smoothed images are only used for generating segmenta- n-grams (a set of n ordered words). The key of the BLEU-n
tion proposals. When the encoder computes the image features, (n = {1, 2, 3, 4}) is the n-gram precision—the proportion of
we still use the original images. the matched n-grams out of the total number of n-grams in
3) Beam Search: At the inference stage, instead of using a the evaluated caption.
greedy search that chooses the word with the highest score and b) METEOR: Since the BLEU does not take the
uses it to predict the next word, we apply the beam search [53] recall into account directly, to address this weakness,
to generate more stabilized captions. The beam search selects the METEOR [59] is introduced to compute the accuracy
the top-k candidates in each time step and then predicts top- based on explicit word-to-word matches between the caption-
k new words accordingly for each of these sequences in the ing and the ground truth.
next step. Then, the new top-k sequences of the next time c) ROUGE-L: ROUGE-L [58] is a modified version of
step are selected out of all k × k candidates. It is worth ROUGE, which computes an F-measure with a recall bias
mentioning that, in consideration of computational efficiency, using the longest common subsequence (LCS) between the
top-k candidate sequences are selected for each time step, and generated and the ground-truth captions.
the sequence with the highest score is selected as the final d) CIDEr-D: CIDEr-D [60] is an improved version of
caption output at the last time step. Therefore, up to time t, k CIDEr, which first converts the caption into the form of
sequences are generated instead of k t . k is called the “beam the term frequency inverse document frequency (TF-IDF)
size,” which is set to 5 for our experiment. vector [61] and then calculates the cosine similarity of the
reference caption and the caption generated by the model.
III. E XPERIMENTS CIDEr-D penalizes the repetition of specific n-grams beyond
In this section, we will introduce in detail our experimental the number of times they occur in the reference sentence.
data sets, metrics, and comparison results. We also provide For any of the above four metrics, a higher score indicates
ablation experiments, parameter analysis, and speed analysis a higher accuracy. The scores of BLEU, ROUGE-L, and
to verify the effectiveness of the proposed structured attention METEOR are between 0 and 1.0. The score of CIDEr-D is
module. between 0 and 10.0.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: HIGH-RESOLUTION REMOTE SENSING IMAGE CAPTIONING BASED ON STRUCTURED ATTENTION 5603814

TABLE I
A BLATION S TUDIES ON THE P ROPOSED S TRUCTURED ATTENTION M ECHANISM . T HE E VALUATION S CORES (%)
A RE R EPORTED ON THE UCM-C APTIONS D ATA S ET [27]

TABLE II
A BLATION S TUDIES ON THE T WO R EGULARIZATION T ERMS : DSR AND AVR. T HE E VALUATION S CORES (%)
A RE R EPORTED ON THE UCM-C APTIONS D ATA S ET [27]

TABLE III
E VALUATION S CORES (%) OF O UR M ETHODS W ITH A D IFFERENT N UMBER OF P ROPOSALS PER I MAGE . A LL M ODELS A RE
T RAINED AND E VALUATED ON THE UCM-C APTIONS D ATA S ET [27]

B. Ablation Studies We set the number of segmentation proposals N in the


The ablation studies are conducted to analyze the impor- selective search to 4, 8, 12, and 16, train each model, and
tance of three different technical components of the pro- then evaluate the captioning accuracy accordingly. Table III
posed method, including the structured attention module, shows the accuracy of our model with different segmentation
DSR, and AVR. The ablation studies and parameter analy- proposals. The result shows that, when the number of regions
sis experiments are performed on the UCM-Captions data increases from 4 to 8, the evaluation scores are greatly
set, the Sydney-Captions data set, and the RSICD data set. improved. However, when the number of regions further
We found that the proposed method behaves similarly on increases from 8 to 16, the improvement of evaluation scores
these data sets. For brevity, we only report results on the becomes less significant, and some scores even decrease. This
UCM-Captions. is because, when the number of segmentation proposals set
We first remove the proposed structured attention module of by N is much larger than the actual number of regions
our method and replace it with a standard soft-attention mod- in the remote sensing image, the oversegmentation of the
ule [21] while keeping other configurations unchanged. Table I image will destroy the structures of the semantic contents.
shows the comparison results. The best scores are marked We also report the models’ training time in the last column
as bold. The results show that, compared with the baseline of Table III. We show that increasing the number of regions
method, the structured attention improves the accuracy with a leads to an increase in training time. To balance the accuracy
large margin in terms of all evaluation metrics (+5.47% on of different metrics, we finally set the number of regions to
BLEU-4, +3.39% on METEOR, +3.78% on ROUGE-L, and N = 8.
+20.11% on CIDEr-D). The beam size k will affect the captioning accuracy and
We then gradually remove the regularization terms from the inference time. We set different beam sizes in our method
our loss function and train the corresponding captioning and analyze their accuracy and speed. All models are trained
model separately. Table II shows their evaluation accuracy. and evaluated on the UCM-Captions data set [27]. Table IV
We show that the DSR and the AVR can both yield noticeable shows the evaluation results of our method with different beam
improvements in captioning accuracy. Particularly, the pro- sizes. We can see that, when the beam size increases from
posed method trained with both of these two regularization 1 to 5, the evaluation scores are improved but are saturated
terms achieves the best accuracy on all metrics. at 6. We also show that increasing the beam size leads to a
slower inference speed. To balance the captioning accuracy
C. Parameter Analysis and the inference speed, we set the beam size to k = 5.
We also analyze two important parameters in our method: Figs. 5 and 6 show the percentage of accuracy improvement
1) the number of segmentation proposals N and 2) the beam of the proposed method with different numbers of segmen-
size k. tation proposals and different beam sizes. The percentage

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
5603814 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

TABLE IV
E VALUATION S CORES (%) OF O UR M ETHODS W ITH D IFFERENT B EAM S IZES . A LL M ODELS A RE E VALUATED ON THE UCM-C APTIONS D ATA S ET [27]

mRNN [27], mLSTM [27], mGRU [62], mGRU-


embedword [30], ConvCap [63], Soft-attention [28],
Hard-attention [28], CSMLF [29], RTRMN [31], and
SAA [30]. Among these methods, most of them (except
mGRU and ConvCap) are initially designed for the remote
sensing image captioning task. However, their basic ideas are
mainly borrowed from the natural image captioning [1], [21].
The details of these models are described as follows.
1) VLAD + RNN: VLAD + RNN [28] uses the handcrafted
feature descriptor “VLAD” [64] as its encoder to compute
image representations and use a naive RNN as its decoder to
generate captions.
2) VLAD + LSTM: VLAD + LSTM [28] also uses VLAD
to compute the image features, but the difference is that it uses
Fig. 5. (Better viewed in color) Percentage of accuracy improvement of
an LSTM as its decoder.
the proposed method with a different number of segmentation proposals 3) mRNN, mLSTM, and mGRU: These three methods [27],
N = {4, 6, 8, 10, 12, 16}. All models are trained and evaluated on the [27], [62] all use the VGG-16 [65] as their encoders but
UCM-Captions data set [27].
use different RNNs (naive RNN, LSTM, and GRU) as their
decoders.
4) mGRU-Embedword: Similar to the mGRU [62],
the mGRU-embedword [30] also uses the VGG-16 as its
encoder and the GRU as its decoder. The difference is that
mGRU-embedword uses a pretrained global vector, namely,
GloVe [66], to embed words.
5) ConvCap: The ConvCap [63] uses the VGG-16 as its
encoder and computes the attention weights based on the
activations of the last convolutional layer. Instead of using the
RNN-based decoder, this method generates captions by using
a CNN-based decoder [63].
6) Soft-Attention and Hard-Attention: Soft-attention [28]
and Hard-attention [28] are two methods using VGG-16 as
the encoder and LSTM as the decoder. The decoders are build
based on soft attention and hard attention mechanism [21],
Fig. 6. (Better viewed in color) The percentage of accuracy improvement respectively.
of the proposed method with different beam sizes k = {1, 2, 3, 4, 5, 6}. All 7) CSMLF: CSMLF [29] is a retrieval-based method that
models are evaluated on the UCM-Captions data set [27].
uses latent semantic embedding to measure the similarity
between the image representation and the sentence represen-
of improvement is defined as follows: (acc − acc0 )/(acc0 ) × tation in a common semantic space.
100%. We define the accuracy on N = 4 and k = 1 as the 8) RTRMN: RTRMN [31] uses Resnet-101 as its encoder
baseline accuracy acc0 . and then uses the topic extractor to extract topic information.
A retrieval topic recurrent memory network is used to generate
captions based on the topic words. “RTRMN (semantic)” and
D. Comparison With Other Methods “RTRMN (statistical)” are two variants of the RTRMN, which
In this section, we evaluate our method on three data are based on semantic topics repository and statistical topics
sets and compared our method with a variety of recent repository, respectively.
image captioning methods. The comparison methods 9) SAA: SAA [30] introduces an SAA framework to com-
include the VLAD + RNN [28], VLAD + LSTM [28], bine the sound information during the generation of captions.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: HIGH-RESOLUTION REMOTE SENSING IMAGE CAPTIONING BASED ON STRUCTURED ATTENTION 5603814

TABLE V
E VALUATION S CORES (%) OF D IFFERENT M ETHODS ON THE UCM-C APTIONS D ATA S ET [27]

TABLE VI
E VALUATION S CORES (%) OF D IFFERENT M ETHODS ON THE S YDNEY-C APTIONS D ATA S ET [27]

TABLE VII
E VALUATION S CORES (%) OF D IFFERENT M ETHODS ON THE RSICD D ATA S ET [28]

SAA uses the VGG-16 and sound GRUs as its encoder and soft-attention module [21] while keeping other configurations
uses another GRUs as its decoder. unchanged as the baseline method.
10) Baseline: We first remove the proposed structured atten- Tables V–VII show the accuracy of our method and the
tion module of our method and replace it with a standard above comparison ones on the three different data sets.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
5603814 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Fig. 7. Captioning results of the baseline method and the proposed method on the RSICD data set [28]. In each group of the image, (a) shows one of the
five ground-truth captions, (b) shows the captions generated by the baseline method, and (c) shows the captions generated by the proposed method. The words
that do not match the images are marked in red.

All comparison methods follow the same fixed partitions of E. Qualitative Analysis
data (80% for training, 10% for validation, and 10% for test), 1) Caption Generation Results: In Fig. 7, we show some
which makes the comparison fair. In these tables, the best captioning results of our method on the RSICD data set.
scores are marked as bold. For the comparison methods, In each group of the result, we show one of the five
the metric scores are taken from the articles that proposed ground-truth sentences, the sentence generated by our method
them. Since Qu et al. [27] did not report the ROUGE-L (Resnet50 + LSTM + structured attention), and that gener-
scores on UCM-Captions and Sydney-Captions data sets, these ated by our baseline (Resnet50 + LSTM + soft attention)
numbers are missing in Tables V and VI. We can see our accordingly. The generated words that do not match the images
method achieves the best accuracy in most of the entries. are marked in red. As we can see, the baseline method tends
For example, on the RSICD data set, our baseline method to generate false descriptions of small objects or irregularly
(Resnet50 + LSTM + soft attention), which also applies beam shaped objects. This is because the regions captured by the soft
search with the same beam size during the inference stage, attention are coarse-grained and unstructured, which leads to
is already better than most of the other methods, as shown insufficient use of structured information and low-level visual
in Table VII. When we integrate the structured attention, information. As a comparison, our method generates more
we further improve our baseline by 2.97% on BLEU-4, accurate descriptions, which is because the structured attention
2.56% on METEOR, 1.30% on ROUGE-L, and 15.51% on can fully exploit the structured information of remote sensing
CIDEr-D. While the baseline and proposed structured attention semantic contents.
method both use ResNet-50 as the encoder and LSTM as 2) Visualization of the Attention Weights: In Fig. 8, we visu-
the decoder, the proposed approach always achieves a score alize the weights produced by the standard soft attention and
higher than the baseline, which means that the achieved the proposed structured attention. For each image, we visual-
improvement is due to the proposed structured attention ize the attention weights of the attention module when the
method. decoder is generating corresponding words, such as trees,

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: HIGH-RESOLUTION REMOTE SENSING IMAGE CAPTIONING BASED ON STRUCTURED ATTENTION 5603814

Fig. 8. Visualization of the attention weights on the RSICD data set [28]. The attention weights are displayed in different colors and are overlaid on
the original image for a better view. The proposed structured attention-based method produces much more detailed and structural rational attention weights
compared to the standard soft attention-based method [21]. To reduce the mosaic effect, the soft attention maps are smoothed by bilinear interpolation multiple
times as suggested by previous literature [21].

playgrounds, and buildings. For each word, the attention TABLE VIII
weights are displayed in different colors and are overlaid on C OMPARISON B ETWEEN O UR M ETHOD AND THE BASELINE ON THE
N UMBER OF PARAMETERS , FLPO S , T RAINING T IME , AND I NFERENCE
the original image for a better view. As we can see that S PEED (I MAGES PER S ECOND ). A LL R ESULTS A RE R EPORTED
the structured attention can produce much more detailed and BASED ON THE UCM-C APTIONS D ATA S ET
structural rational attention weights compared to standard soft
attention. Although we only train our method with image-level
annotations and do not use any pixelwise annotations during
the training, our method still produces accurate and meaningful
segmentation results. The attention maps generated by our
method, thus, can be considered as a new way for weakly
supervised image segmentation.
3) Visualization of Object Masks: Fig. 9 visualizes the
method. The results are computed on the RSICD data set. The
regions in some images where the attention is heavily weighted
training time and inference speed are tested on an NVIDIA
as the decoder generates specific classes of words, such as
TITAN X (Pascal) graphics card. Comparing with the baseline
the river, bridge, building, and tree. We can also see that
method, the proposed method does not increase the number of
some generated object masks usually have specific structures.
model parameters. This is because the basic network structure
For example, the river is winding, and the bridge is long
does not need to be changed when we modify the standard
and straight. By leveraging the low-level vision from the
attention module to the proposed structured attention module.
segmentation proposals with typical structures, the proposed
The proposed method even decreases the number of FLOPs
structured attention can help the decoder to generate more
because the baseline method generates 14 × 14 = 196 groups
accurate captions.
attention weights, each corresponding to a uniform grid in
the image, while the proposed method only need to generate
F. Speed Performance eight groups attention weights since we set the number of
In Table VIII, we report the number of model parameters, structured regions to 8 in our method. We extract and save
the number of floating-point operations (FLOPs), training region proposals of images locally before training the model
time, and the inference speed (images per second) of our and directly load them from the local disk during training.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
5603814 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Fig. 9. Visualization of the object masks generated by our method from the RSICD data set. There are ten classes of semantic contents where their masks
can be generated by our method, including the baseball field, beach, bridge, center, church, playground, pond, port, river, and storage tanks.

Therefore, the training time is not affected by the region Our method also has limitations. Our method is only
proposal extraction. Since the structured attention module pro- applicable to high-resolution remote sensing images. If the
duces only eight structured attention weights instead of 196, image is not high resolution, the selective search method will
which is used in a standard soft attention module, if we do fail to extract available segmentation proposals to support
not consider the segmentation time, the inference speed even the proposed structured attention mechanism. The number
becomes faster. of segmentation proposals used in our structured attention
module is fixed. When the input image contains more semantic
IV. C ONCLUSION content than the predefined number of proposals, the structured
attention module may fail to focus on the most salient regions.
We proposed a new image captioning method for remote
In future work, we will make the number of segmentation pro-
sensing images based on the structured attention mecha-
posals adaptive. Another future direction is to combine remote
nism. The proposed method achieves captioning and weakly
sensing image captioning with object detection. Particularly,
supervised segmentation under a unified framework. Different
we will focus on weakly supervised detection, i.e., to train the
from the previous methods that are based on coarse-grained
detector only based on the sentence annotation.
and soft attentions, we show that the proposed structured
attention-based method can exploit the structured information
of semantic contents and generate more accurate sentence R EFERENCES
descriptions. Experiments on three public remote sensing [1] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural
image captioning data sets suggest the effectiveness of our image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2015, pp. 3156–3164.
method. Compared with other state-of-the-art captioning meth- [2] X. Chen and C. L. Zitnick, “Learning a recurrent visual representation
ods, our method achieves the best results on most evaluation for image caption generation,” 2014, arXiv:1411.5654. [Online]. Avail-
metrics. The visualization experiments also show that our able: http://arxiv.org/abs/1411.5654
[3] Z. Shi and Z. Zou, “Can a machine generate humanlike language
method can generate much more detailed and meaningful descriptions for a remote sensing image?” IEEE Trans. Geosci. Remote
object masks than the soft attention-based method. Sens., vol. 55, no. 6, pp. 3623–3634, Jun. 2017.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: HIGH-RESOLUTION REMOTE SENSING IMAGE CAPTIONING BASED ON STRUCTURED ATTENTION 5603814

[4] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A sur- [28] X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for
vey,” 2019, arXiv:1905.05055. [Online]. Available: http://arxiv.org/abs/ remote sensing image caption generation,” IEEE Trans. Geosci. Remote
1905.05055 Sens., vol. 56, no. 4, pp. 2183–2195, Apr. 2018.
[5] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in [29] B. Wang, X. Lu, X. Zheng, and X. Li, “Semantic descriptions of high-
optical remote sensing images based on weakly supervised learning and resolution remote sensing images,” IEEE Geosci. Remote Sens. Lett.,
high-level feature learning,” IEEE Trans. Geosci. Remote Sens., vol. 53, vol. 16, no. 8, pp. 1274–1278, Aug. 2019.
no. 6, pp. 3325–3337, Jun. 2015. [30] X. Lu, B. Wang, and X. Zheng, “Sound active attention framework for
[6] Z. Zou and Z. Shi, “Ship detection in spaceborne optical image with remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens.,
SVD networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10, vol. 58, no. 3, pp. 1985–2000, Mar. 2020.
pp. 5832–5845, Oct. 2016. [31] B. Wang, X. Zheng, B. Qu, and X. Lu, “Retrieval topic recurrent memory
[7] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo- network for remote sensing image captioning,” IEEE J. Sel. Topics Appl.
lutional neural networks for object detection in VHR optical remote Earth Observ. Remote Sens., vol. 13, pp. 256–270, 2020.
sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, [32] K. Tran, A. Bisazza, and C. Monz, “Recurrent memory networks
pp. 7405–7415, Dec. 2016. for language modeling,” 2016, arXiv:1601.01272. [Online]. Available:
[8] Z. Zou and Z. Shi, “Random access memories: A new paradigm for http://arxiv.org/abs/1601.01272
target detection in high resolution aerial remote sensing images,” IEEE [33] X. Ma, R. Zhao, and Z. Shi, “Multiscale methods for optical remote-
Trans. Image Process., vol. 27, no. 3, pp. 1100–1111, Mar. 2018. sensing image captioning,” IEEE Geosci. Remote Sens. Lett., early
[9] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi- access, Jul. 29, 2020, doi: 10.1109/LGRS.2020.3009243.
cation: Benchmark and state of the art,” Proc. IEEE, vol. 105, no. 10, [34] W. Cui et al., “Multi-scale semantic segmentation and spatial relation-
pp. 1865–1883, Oct. 2017. ship recognition of remote sensing images based on an attention model,”
[10] X. Lu, X. Zheng, and Y. Yuan, “Remote sensing scene classification Remote Sens., vol. 11, no. 9, p. 1044, May 2019.
by unsupervised representation learning,” IEEE Trans. Geosci. Remote [35] G. Sumbul, S. Nayak, and B. Demir, “SD-RSIC: Summarization-driven
Sens., vol. 55, no. 9, pp. 5148–5157, Sep. 2017. deep remote sensing image captioning,” IEEE Trans. Geosci. Remote
[11] B. Pan, X. Xu, Z. Shi, N. Zhang, H. Luo, and X. Lan, “DSSNet: Sens., early access, Oct. 26, 2020, doi: 10.1109/TGRS.2020.3031111.
A simple dilated semantic segmentation network for hyperspectral [36] X. Li, X. Zhang, W. Huang, and Q. Wang, “Truncation cross entropy
imagery classification,” IEEE Geosci. Remote Sens. Lett., vol. 17, no. 11, loss for remote sensing image captioning,” IEEE Trans. Geosci. Remote
pp. 1968–1972, Nov. 2020. Sens., early access, Jul. 30, 2020, doi: 10.1109/TGRS.2020.3010106.
[12] M. Pesaresi and J. A. Benediktsson, “A new approach for the morpho- [37] Q. Wang, W. Huang, X. Zhang, and X. Li, “Word-sentence framework
logical segmentation of high-resolution satellite imagery,” IEEE Trans. for remote sensing image captioning,” IEEE Trans. Geosci. Remote
Geosci. Remote Sens., vol. 39, no. 2, pp. 309–320, Feb. 2001. Sens., early access, Dec. 25, 2020, doi: 10.1109/TGRS.2020.3044054.
[13] A. A. Farag, R. M. Mohamed, and A. El-Baz, “A unified framework for [38] W. Huang, Q. Wang, and X. Li, “Denoising-based multiscale feature
MAP estimation in remote sensing image segmentation,” IEEE Trans. fusion for remote sensing image captioning,” IEEE Geosci. Remote Sens.
Geosci. Remote Sens., vol. 43, no. 7, pp. 1617–1634, Jul. 2005. Lett., vol. 18, no. 3, pp. 436–440, Mar. 2021.
[14] P. Ghamisi, M. S. Couceiro, F. M. L. Martins, and J. A. Benediktsson,
[39] Y. Li, S. Fang, L. Jiao, R. Liu, and R. Shang, “A multi-level attention
“Multilevel image segmentation based on fractional-order darwinian
model for remote sensing image captions,” Remote Sens., vol. 12, no. 6,
particle swarm optimization,” IEEE Trans. Geosci. Remote Sens., vol. 52,
p. 939, Mar. 2020.
no. 5, pp. 2382–2394, May 2014.
[40] S. Wu, X. Zhang, X. Wang, C. Li, and L. Jiao, “Scene attention
[15] Z. Zou, T. Shi, W. Li, Z. Zhang, and Z. Shi, “Do game data generalize
mechanism for remote sensing image caption generation,” in Proc. Int.
well for remote sensing image segmentation?” Remote Sens., vol. 12,
Joint Conf. Neural Netw. (IJCNN), Jul. 2020, pp. 1–7.
no. 2, p. 275, Jan. 2020.
[16] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain images [41] J. Xiao and L. Quan, “Multiple view semantic segmentation for street
with multimodal recurrent neural networks,” 2014, arXiv:1410.1090. view images,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep. 2009,
[Online]. Available: http://arxiv.org/abs/1410.1090 pp. 686–693.
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification [42] Y. Liu, J. Liu, Z. Li, J. Tang, and H. Lu, “Weakly-supervised dual
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. clustering for image semantic segmentation,” in Proc. IEEE Conf.
Process. Syst., 2012, pp. 1097–1105. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2075–2082.
[18] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, [43] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“DRAW: A recurrent neural network for image generation,” 2015, “Semantic image segmentation with deep convolutional nets and fully
arXiv:1502.04623. [Online]. Available: http://arxiv.org/abs/1502.04623 connected CRFs,” 2014, arXiv:1412.7062. [Online]. Available: http://
[19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by arxiv.org/abs/1412.7062
jointly learning to align and translate,” 2014, arXiv:1409.0473. [Online]. [44] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The
Available: http://arxiv.org/abs/1409.0473 SYNTHIA dataset: A large collection of synthetic images for semantic
[20] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. segmentation of urban scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern
Process. Syst., 2017, pp. 5998–6008. Recognit. (CVPR), Jun. 2016, pp. 3234–3243.
[21] K. Xu et al., “Show, attend and tell: Neural image caption generation [45] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous
with visual attention,” in Proc. Int. Conf. Mach. Learn., Jun. 2015, convolution for semantic image segmentation,” 2017, arXiv:1706.05587.
pp. 2048–2057. [Online]. Available: http://arxiv.org/abs/1706.05587
[22] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with [46] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
semantic attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
(CVPR), Jun. 2016, pp. 4651–4659. [47] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network
[23] L. Zhou, C. Xu, P. Koch, and J. J. Corso, “Image caption generation with regularization,” 2014, arXiv:1409.2329. [Online]. Available: http://arxiv.
text-conditional semantic attention,” 2016, arXiv:1606.04621. [Online]. org/abs/1409.2329
Available: http://arxiv.org/abs/1606.04621 [48] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
[24] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: A. W. M. Smeulders, “Selective search for object recognition,” Int. J.
Adaptive attention via a visual sentinel for image captioning,” in Comput. Vis., vol. 104, no. 2, pp. 154–171, Sep. 2013.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, [49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
pp. 375–383. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[25] L. Li, S. Tang, L. Deng, Y. Zhang, and Q. Tian, “Image caption with (CVPR), Jun. 2016, pp. 770–778.
global-local attention,” in Proc. 21st AAAI Conf. Artif. Intell., 2017, [50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
pp. 4133–4139. tion,” 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.org/abs/
[26] P. Anderson et al., “Bottom-up and top-down attention for image 1412.6980
captioning and visual question answering,” in Proc. IEEE/CVF Conf. [51] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6077–6086. A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
[27] B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
high resolution remote sensing image,” in Proc. Int. Conf. Comput., Inf. [52] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Trans.
Telecommun. Syst. (CITS), Jul. 2016, pp. 1–5. Pattern Anal. Mach. Intell., vol. 35, no. 6, pp. 1397–1409, Jun. 2013.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
5603814 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

[53] A. Graves, “Sequence transduction with recurrent neural net- Zhenwei Shi (Member, IEEE) received the Ph.D.
works,” 2012, arXiv:1211.3711. [Online]. Available: http://arxiv.org/abs/ degree in mathematics from the Dalian University
1211.3711 of Technology, Dalian, China, in 2005.
[54] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions He was a Post-Doctoral Researcher with the
for land-use classification,” in Proc. 18th SIGSPATIAL Int. Conf. Adv. Department of Automation, Tsinghua University,
Geographic Inf. Syst. (GIS), 2010, pp. 270–279. Beijing, China, from 2005 to 2007. He was a Visiting
[55] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised feature Scholar with the Department of Electrical Engineer-
learning for scene classification,” IEEE Trans. Geosci. Remote Sens., ing and Computer Science, Northwestern University,
vol. 53, no. 4, pp. 2175–2184, Apr. 2015. Evanston, IL, USA, from 2013 to 2014. He is a
[56] G.-S. Xia et al., “AID: A benchmark data set for performance evaluation Professor and the Dean of the Image Processing
of aerial scene classification,” IEEE Trans. Geosci. Remote Sens., Center, School of Astronautics, Beihang University,
vol. 55, no. 7, pp. 3965–3981, Jul. 2017. Beijing. He has authored or coauthored more than 100 scientific articles in
[57] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method refereed journals and proceedings, including the IEEE T RANSACTIONS ON
for automatic evaluation of machine translation,” in Proc. 40th Annu. PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE, the IEEE T RANSAC -
Meeting Assoc. Comput. Linguistics (ACL), 2001, pp. 311–318. TIONS ON I MAGE P ROCESSING , the IEEE T RANSACTIONS ON G EOSCIENCE
[58] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” AND R EMOTE S ENSING , the IEEE Conference on Computer Vision and
in Text Summarization Branches Out. Barcelona, Spain: Association for Pattern Recognition, and the IEEE International Conference on Computer
Computational Linguistics, 2004, pp. 74–81. Vision. His research interests include remote sensing image processing and
[59] A. Lavie and A. Agarwal, “Meteor: An automatic metric for MT analysis, computer vision, pattern recognition, and machine learning.
evaluation with high levels of correlation with human judgments,” in Dr. Shi serves as an Associate Editor for Infrared Physics and Technology
Proc. 2nd Workshop Stat. Mach. Transl. StatMT, 2007, pp. 65–72. and an Editorial Advisory Board Member for ISPRS Journal of Photogram-
[60] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based metry and Remote Sensing.
image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2015, pp. 4566–4575.
[61] S. Robertson, “Understanding inverse document frequency: On theo-
retical arguments for IDF,” J. Document., vol. 60, no. 5, pp. 503–520,
Oct. 2004.
[62] X. Li, A. Yuan, and X. Lu, “Multi-modal gated recurrent units for image
description,” Multimedia Tools Appl., vol. 77, no. 22, pp. 29847–29869,
Nov. 2018.
[63] J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional image
captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 5561–5570.
[64] H. Jegou, F. Perronnin, M. Douze, J. Sánchez, P. Perez, and C. Schmid,
“Aggregating local image descriptors into compact codes,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 34, no. 9, pp. 1704–1716, Sep. 2012.
[65] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
Available: http://arxiv.org/abs/1409.1556 Zhengxia Zou received the B.S. and Ph.D. degrees
[66] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for from the Image Processing Center, School of
word representation,” in Proc. Conf. Empirical Methods Natural Lang. Astronautics, Beihang University, Beijing, China,
Process. (EMNLP), 2014, pp. 1532–1543. in 2013 and 2018, respectively.
He is working as a Post-Doctoral Research Fellow
with the University of Michigan at Ann Arbor,
Ann Arbor, MI, USA. His research interests include
Rui Zhao received the B.S. degree from the Image computer vision and the related applications in
Processing Center, School of Astronautics, Beihang remote sensing, self-driving vehicles, and video
University, Beijing, China, in 2019, where he is games.
pursuing the master’s degree. Dr. Zou serves as a Senior Program Committee
His research interests include computer vision, Member/Reviewer for a number of top conferences and top journals, including
deep learning, and image captioning. the Neural Information Processing Systems Conference (NeurIPS), Interna-
tional Conference on Computer Vision and Pattern Recognition (CVPR),
AAAI Conference on Artificial Intelligence (AAAI), IEEE T RANSACTIONS
ON I MAGE P ROCESSING , IEEE Signal Processing Magazine (IEEE SPM),
and IEEE T RANSACTIONS ON G EOSCIENCE AND R EMOTE S ENSING. His
personal website is http://www-personal.umich.edu/~zzhengxi/.

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 16,2024 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.

You might also like