Photonics 08 00431 v2 Compressed
Photonics 08 00431 v2 Compressed
photonics
Article
Target Detection Method for Low-Resolution Remote Sensing
Image Based on ESRGAN and ReDet
Yuwu Wang , Guobing Sun * and Shengwei Guo
College of Electronics Engineering, Heilongjiang University, Harbin 150080, China; [email protected] (Y.W.);
[email protected] (S.G.)
* Correspondence: [email protected]
Abstract: With the widespread use of remote sensing images, low-resolution target detection in
remote sensing images has become a hot research topic in the field of computer vision. In this
paper, we propose a Target Detection on Super-Resolution Reconstruction (TDoSR) method to solve
the problem of low target recognition rates in low-resolution remote sensing images under foggy
conditions. The TDoSR method uses the Enhanced Super-Resolution Generative Adversarial Network
(ESRGAN) to perform defogging and super-resolution reconstruction of foggy low-resolution remote
sensing images. In the target detection part, the Rotation Equivariant Detector (ReDet) algorithm,
which has a higher recognition rate at this stage, is used to identify and classify various types of
targets. While a large number of experiments have been carried out on the remote sensing image
dataset DOTA-v1.5, the results of this paper suggest that the proposed method achieves good results
in the target detection of low-resolution foggy remote sensing images. The principal result of this
paper demonstrates that the recognition rate of the TDoSR method increases by roughly 20% when
compared with low-resolution foggy remote sensing images.
Keywords: remote sensing images; super-resolution reconstruction; target detection; ESRGAN; ReDet
was still a significant gap when compared to the real image [7,9–11]. The Enhanced Deep
Residual Networks for Single Image Super-Resolution (EDSR) [12] removed the batch
normalization layer (BN) [10] in the network on the basis of SRGAN, expanded the model
size, and obtained better super-division images after training. While the performance
of the above-mentioned methods in dealing with the super-resolution reconstruction of
remote sensing images was not very good, the Enhanced Super-Resolution Generative
Adversarial Networks (ESRGAN) [13] achieved a perfect effect. The ESRGAN [13] made
three improvements on the basis of the SRGAN. First, the Residual in Residual Dense
Block (RRDB), which had a larger capacity and was easier to train, was introduced into the
network and replaced the original basic residuals. Second, the BN layer was removed [10]
and the GAN network was improved to a Relativistic average GAN (RaGAN) [13] network.
ESRGAN’s discriminator could then predict whether an image was more real after learning
rather than judging whether an image was true or false. Third, the perceptual domain
loss function was modified and the VGG feature [14] before its activation was used. After
these improvements, the image reconstructed by the ESRGAN had more realistic texture
details and attained a better visual effect. Previously, the ESRGAN was not applied to
low-resolution foggy remote sensing images. The experiments in this paper prove that the
ESRGAN is very suitable for the super-resolution reconstruction of remote sensing images.
Kwan et al. have proposed a method to enhance low-resolution images based on a point
spread function, which has provided a new direction for future research [15].
Through the ESRGAN network, high-quality super-resolution images are obtained
and target detection is performed on the super-resolution images. In this paper, the target
detection network selected the Rotating Equivariant Detector (ReDet) [16] and the results
were displayed in Oriented Bounding Boxes (OBBs) [17,18]. Rotating the feature map
obtained by inputting an image into the CNN was different from the feature map obtained
by inputting the image into the CNN after the rotation. The ReDet detection method
consists of two parts, the rotation equivariant feature extraction and the rotation invariant
feature extraction [17,19]. Finally, the combination of these two methods realizes the
detection of small and dim targets in the remote sensing images with higher accuracy.
2. Related Work
Currently there are no open source and complete low-resolution remote sensing image
datasets. Therefore, the public remote sensing image data set named DOTA-v1.5 [20] has
been selected for this research. The Bicubic method is used to down-sample the DOTA data
set so to obtain low-resolution remote sensing images. Then, the down-sampled remote
sensing images are artificially simulated and fogged by the mainstream RGB channel
synthesis fog method. This obtained low-quality image data set is then used for the
following super-resolution reconstruction research.
(i,j)
n
m
Q
(a) (b)
Figure 1. (a)
Figure Interpolation
1. (a) basis
Interpolation function
basis diagram;
function (b)(b)
diagram; Schematic diagram
Schematic of of
diagram thethe
bicubic algorithm.
bicubic algorithm.
The Q point
The Q point is the
is the source
source image
image coordinate
coordinate point
point correspondingtotothe
corresponding point((X
thepoint X, Y))on
,Y
the target image B after being reduced several times; then, the coefficients of the 16 points
onaround
the target
point Q areBcalculated,
image after beingand reduced several
the pixel valuetimes; then,
of point Q istheobtained
coefficients
afterof the 16
weighting,
points around point Q are calculated, and the pixel value
as shown in Figure 1b. As is shown, where m is the distance between the point of point Q is obtained after
and
weighting, as shown
the abscissa of the in Figure
upper left1b. As is shown,
corner point, andwhere
n ismthe
is the distance
distance between
between thethe point
point and
and the
the abscissaofofthe
ordinate theupper
upperleft leftcorner
cornerpoint.
point, To
andfind
n isthe
thecoefficient
distance between the pointtoand
corresponding each
thecoordinate
ordinate of the upper left corner point. To find the coefficient corresponding to each
point:
coordinate
The point:
distances between the four points in the X axis direction of each row and the Q
point
Theare 1 + m, m,
distances 1 − m, 2the
between m. points in the X axis direction of each row and the Q
− four
The distances between the four points in the Y axis direction of each row and the Q
point are 1 + m, m,1 − m, 2 − m.
point are 1 + n, n, 1 − n, 2 − n.
The distances
From between thebasis
the interpolation fourfunction the Y axis ifdirection
points inoperation, the row of and the Q
each rowcorresponding
coefficient
point
to point + jn) ,isn,1
are 1(i, W− (1n+, 2m−) nand
. the corresponding column coefficient is W (1 + n), then the
coefficient of this point
From the interpolation basis is K 0,0 =function
W (1 + m )·W (1 + nif).the row coefficient corresponding
operation,
to pointThe( icoefficients
, j ) is W (of m )remaining
1 +the and the points are calculated
corresponding columnas above, and soisthe
coefficient (1 + n ) ,
Wcoefficients
of the four points in the first row are:
then the coefficient of this point is K 0,0 = W (1 + m ) W (1 + n ) .
The coefficientsKof =W
0,0the (1 + m)·W
remaining (1 + nare
points ), Kcalculated
1,0 = W ( mas)·Wabove,
(1 + nand
), so the coeffi-
(2)
cients of the four points K2,0 =inWthe − m)·
(1 first + n), K1,0 = W (2 − m)·W (1 + n).
W (1are:
row
The coefficientsKof = Wfour
0,0 the (1 +points
m ) Win + n second
(1the = Ware:
), K 1,0row ( m ) W (1 + n ), (2)
(a) (b)
Figure
Figure 2.
2. (a)
(a) The
The image
image after
after down-sampling;
down-sampling; (b)
(b) the
the fogged
fogged image after down-sampling.
2.2. Effective
2.2. Effective Algorithms
Algorithms for
for SISR
SISR
The following will introduce some of
The following will introduce some of the
the effective
effective algorithms
algorithms used
used in
in recent
recent years
years for
for
single image
single image super-resolution
super-resolution reconstruction.
reconstruction.
where x represents a real image, z represents the random noise input to the generator
G, and G (z) represents the image generated by the generator G. D ( x ) represents the
probability of the discriminator D to determine whether the real image is indeed real. For
D, the closer D ( x ) is to 1, the better. D ( G (z)) represents the probability that D judges
whether the image generated by G is real. G hopes that D ( G (z)) is as large as possible, and
at the time V ( D, G ) will become smaller. Finally, D hopes that the larger is the D ( x ), the
smaller the D ( G (z)), at time V ( D, G ) it will become larger.
Photonics 2021, 8, 431 5 of 14
2.2.2. SRGAN
SRGAN [7] applied the GAN [8] to the task of processing the image super-resolution
reconstruction for the first time and made improvements in the loss function. SRGAN’s
network model is divided into three parts: generator, discriminator, and VGG [14] network.
In the training process, the generator and the discriminator alternate against training and
iterating continuously. The VGG [14,22,23] network only participates in the calculation
of Loss.
The generator of the SRGAN is an improvement made on the basis of SRResNet. The
generator network part contains multiple residual blocks, and each residual block contains
two A convolutional layers which are connected to batch normalization (BN) [10] after
the convolutional layer. Take the PReLU as the activation function and choose two 2×
sub-pixel convolution layers to increase the feature size [6]. The discriminator network
part of the SRGAN contains 8 convolutional layers. As the number of network layers
deepens, the number of features continues to increase while the feature size continues to
decrease. LeakyReLU is selected as the activation function [22] and finally passes through
two fully connected layers and the final sigmoid. The activation function is used to predict
the probability of whether the generated image is a real image.
The loss function of the SRGAN is divided into generator loss and discriminator
loss. The generator loss consists of content loss and counter loss. The loss function of the
generator is as follows:
l SR = lX
SR
+ 10−3 lGen
SR
(7)
where lXSR is a content loss and l SR is a confrontation loss. The content loss includes
Gen
the MSE loss [6,21] and the VGG loss. The MSE loss is used to calculate the matching
degree between pixels, and the VGG loss is used to calculate the matching degree of a
feature layer. Using MSE can get a good performance evaluation index, but the super-
resolution reconstructed image obtained only using the MSE loss loses more high-frequency
information. The purpose of adding the VGG loss is to more effectively recover the high-
frequency information of the image.
The calculation of the MSE loss is as follows:
1 rW rH HR
2
r W H x∑ ∑ x,y
SR LR
l MSE = 2 I − GθG I (8)
x,y
=1 y =1
where W represents the width of the image, H represents the height of the image, I HR is
the real high-resolution image, and I LR is the low-resolution image corresponding to the
real high-resolution image.
The calculation of the VGG loss is as follows:
Wi,j Hi,j 2
1
SR
lVGG/i.j =
Wi,j Hi,j ∑∑ φi,j I HR
x,y
−φi,j GθG I LR
x,y
(9)
x =1 y =1
where φi,j represents the feature map obtained before the i maximum pooling layer of the j
convolution of the VGG network, and Wi,j and Hi,j are the dimensions of the corresponding
feature map in the VGG network.
The counter loss of the generator is calculated as follows:
N
SR
lGen = ∑ − log DθD GθG I LR (10)
n =1
where DθD GθG I LR is the estimated probability that the reconstructed image and
2.2.3. EDSR
Compared with the SRGAN, the EDSR removes the BN layer on the basis of its
network. For the task of the image super-resolution reconstruction, the image generated by
the network is required to be consistent with the input source image in terms of brightness,
contrast, and color, while only the resolution and some of the details are changed. In
the processing of images, the BN layer is equivalent to contrast stretching. The color
distribution of the image is normalized, which destroys the original contrast information
of the image [12]. Therefore, the performance of the BN layer in the image super-resolution
is not good. The addition of the BN layer increases the training time, thereby making the
training unstable or even divergent.
The model performance of the EDSR is improved by removing the BN layer in the
residual network, by increasing the number of residual layers from 16 to 32, and then
expanding the model size. The EDSR uses the loss function of L1 [23] norm style to
optimize the network model. During training, we first train the low-multiple up-sampling
model, and then use the obtained parameters to initialize the high-multiple up-sampling
model, which not only reduces the training time of the high-multiple up-sampling model
but also achieves a very high level training effect.
The EDSR has achieved a good effect in the super-resolution reconstruction task, but
there is still a large gap in edge detail from the real image.
Photonics 2021, 8, x FOR PEER REVIEW 7 of 15
3. Experimental Method
Through the research and comparison of image super-resolution reconstruction algo-
rithms, the ESRGAN algorithm is selected for the following research and has the best effect
effect
in thein theoffield
field of remote
remote sensingsensing image reconstruction
image reconstruction thus far.thus far. Through
Through the super-
the super-resolution
resolution
processingprocessing of low-resolution
of low-resolution remote
remote sensing sensing
images, images, the
the generated generated super-res-
super-resolution images
olution imagesand
are identified are classified.
identified and
The classified.
flow chartThe flow
of the chartset
whole ofof
the whole set ofisidentifica-
identification shown in
tion is shown
Figure 3. in Figure 3.
super-
Downsa resoluti
mpling Fogging on ldentify
GT LR LR(FOG) SR Result
Figure 3. A 3.
Figure flow chartchart
A flow of the
ofremote sensing
the remote image
sensing recognition
image algorithm.
recognition Down-sampling
algorithm. the real
Down-sampling thehigh-definition imageimage
real high-definition to obtain a
low-resolution
to obtain aimage, then simulating
low-resolution image, the
thenfogging process
simulating the on the low-resolution
fogging process on theimage to obtain aimage
low-resolution low-resolution
to obtain foggy image, which
a low-resolu-
is then super-resolution
tion reconstructed,
foggy image, which and finally the reconstructed,
is then super-resolution reconstructed image Target
and finally recognition
the on the
reconstructed image.
image Target recognition on
the image.
3.1. ESRGAN
3.1. ESRGAN
ESRGAN’s [13] generator network refers to the SRResNet structure. The ESRGAN
has ESRGAN’s
two improvements on thenetwork
[13] generator basis of this generator
refers network.structure.
to the SRResNet First, it removes all the
The ESRGAN
BNtwo
has layers in the network.
improvements After
on the removing
basis the BN layers,
of this generator the generalization
network. First, it removesability ofBN
all the the
modelinand
layers thethe trainingAfter
network. speedremoving
are both improved. Second,
the BN layers, the original residual
the generalization abilityblock
of theis
changed
model andtothe
thetraining
Residualspeed
in Residual
are bothDense Block (RRDB).
improved. Second, The changedresidual
the original RRDB combines
block is
multi-layer residual networks and dense connections [12,24]. The previous
changed to the Residual in Residual Dense Block (RRDB). The changed RRDB combines algorithms for
multi-layer residual networks and dense connections [12,24]. The previous algorithms for
super-resolution reconstruction based on the GAN are used in the discriminator network
to determine whether the image generated by the generator is true and natural [9]. The
most important improvement of the ESRGAN discriminator network is the probability
that it discriminates real images more realistically than fake images. The ESRGAN uses
Photonics 2021, 8, 431 7 of 14
super-resolution reconstruction based on the GAN are used in the discriminator network
to determine whether the image generated by the generator is true and natural [9]. The
most important improvement of the ESRGAN discriminator network is the probability that
it discriminates real images more realistically than fake images. The ESRGAN uses the
VGG features before activation in the perceptual domain loss and overcomes two short-
comings. First, the features after activation are very sparse, especially in deep networks.
This sparse activation provides a weaker supervision effect, which makes the generator
network performance low. Second, the use of activated features causes the super-resolution
reconstructed image to differ in brightness from the real image.
The ESRGAN uses a relative average discriminator, and the loss function of the
discriminator is defined as:
h i h i
L D = − Exr log DRa xr , x f − Ex f 1 − log DRa x f , xr (12)
where xr is the real image, x f is the original low-resolution image generated by the gen-
erator, DRa xr , x f is the difference between the real image and the average value of
the generated image, and DRa x f , xr is the difference between the average value of the
generated image and the real image.
The counter loss function of the generator is defined as:
Ra
LG = Lpercep + λLG + ηL1 (13)
where Lpercep is the perceptual domain loss, LG Ra is the counter loss of the generator, and
t∗x = 1 ∗ ∗
wr (( x − xr ) cos θr + ( y − yr ) sin θr );
t∗y = 1 ∗ ∗
hr (( y − yr ) cos θr − ( x − xr ) sin θr );
∗ h∗ (14)
t∗w = log w ∗
wr ; t h = log hr ;
t∗θ = 2π1
((θ ∗ − θr ) mod 2π )
where the five values are the gt value of RRoI and the offset of HRoI. Use these offsets
as inputs to enter the decoder module and to decode the relevant parameters of RRoI,
namely ( x, y, w, h, θ ). This can make the final RRoI as close as possible to the gt value,
which reduces the number of parameters and improves the performance of the rotating
frame detection.
Photonics 2021, 8, 431 8 of 14
3.3. TDoSR
In this method, the down-sampling and super-resolution reconstruction of the image
are carried out according to the scale factor × 4. As the size of the image in the DOTA
dataset is too large, it was cropped to a 1024 × 1024 size image before the experiment. Then
we use the MATLAB Bicubic algorithm to down-sample the original high-definition remote
sensing image to obtain a low-resolution remote sensing image with a size of 256 × 256.
The method of the RGB channel synthesizing fog in MATLAB is used to artificially simulate
and add fog to low-resolution remote sensing images.
The training process is divided into two stages. First, we train a PSNR-oriented
model with the L1 loss. The learning rate is initialized as 2 × 10−4 and decayed by a
factor of 2 every 2 × 105 iterations. We then employ the trained PSNR-oriented model
as an initialization for the generator. The generator is trained using the loss function in
Equation (9) with λ = 5 × 10−3 and η = 0.01. The learning rate is set to 1 × 10−4 and
halved at 50 K, 100 K, 200 K, and 300 K iterations. Pre-training with pixel-wise loss helps
GAN-based methods to obtain more visually pleasing results. We use Adam [27] and
alternately update the generator and discriminator network until the model converges.
The low-resolution foggy remote sensing image is then input into the trained ESRGAN
model for super-resolution reconstruction, and a high-resolution remote sensing image
with a size of 1024 × 1024 is obtained.
The training for ReDet is as follows. For the original ResNet, we directly use the
ImageNet pretrained models from PyTorch [28]. For ReResNet, we implement it based on
the mmclassification [29]. We train ReResNet on the ImageNet-1K with an initial learning
rate of 0.1. All models are trained for 100 epochs and the learning rate is divided by 10 at
(30,60,90) epochs. The batch size is set to 256. Fine-tuning is on detection. We adopt ResNet
with FPN [25] as the backbone of the baseline method. ReResNet with ReFPN is adopted as
the backbone of proposed ReDet. For RPN, we set 15 anchors per location of each pyramid
level. For R-CNN, we sample 512 RoIs with a 1:3 positive to negative ratio for training. For
testing, we adopt 10,000 RoIs (2000 for each pyramid level) before NMS and 2000 RoIs after
NMS. We adopt the same training schedules as mmdetection [29]. The SGD optimizer is
adopted with an initial learning rate of 0.01, and the learning rate is divided by 10 at each
decay step. The momentum and weight decay are 0.9 and 0.0001, respectively. We train all
models in 12 epochs for the DOTA.
Then, input the high-resolution remote sensing image obtained in the previous step
into the trained ReDet detector, and finally obtain the recognition rate of various targets.
Hardware equipment: This experiment was carried out on two pieces of equipment.
The hardware configurations of Device 1 are: CPU—Intel(R) Core (TM) [email protected]
x12 from Intel San Francisco, USA; GPU—NVIDIA GeForce GTX 1650 from NVIDIA in
Santa Clara, USA; memory—16 GB from GALAXY Hong Kong, China.
The hardware configurations of Device 2 are: CPU—Intel(R) Xeon(R) Gold [email protected]
x32 from Intel San Francisco, USA; GPU—NVIDIA Quadro P5000 from NVIDIA in Santa
Clara, USA; memory—128 GB from GALAXY Hong Kong, China.
Software configuration: The environment configurations of the two devices are the
same. The operating system is the 64-bit Ubuntu 18.04 LTS for both devices.
The driver version of the graphics card is: Nvidia-Linux-x64-450.80.02; CUDA version
is 10.0; PyTorch 1.3.1.
H W
1 2
MSE = H ×W ∑ ∑ ( X (i, j) − Y (i, j)) ;
i =1 j =1 (15)
(2n −1)2
PSNR = 10 log10 MSE
where MSE represents the mean square error of the current image X and the reference
image Y, X (i, j) and Y (i, j) represent the pixel values at the corresponding coordinates, H
and W are the height and width of the image, respectively, and n is the number of bits per
pixel (generally 8). The unit of PSNR is dB. The larger the value, the smaller the distortion,
because larger values indicate a smaller MSE. If the MSE is smaller, and if the two images
are closer, then the distortion is also smaller.
Photonics 2021, 8, 431 10 of 14
SSIM is structural similarity, which is an index to measure the similarity of two images.
The calculation formula of SSIM is as follows:
2u X uY +C1
L( X, Y ) = u2X +uY 2 +C ;
1
C ( X, Y ) = σ2σ2 X+σσY2++CC2 ;
Photonics 2021, 8, x FOR PEER REVIEW X Y 2 (16)
11 of 15
+C3
S( X, Y ) = σσXXY σY +C3 ;
SSI M ( X, Y ) = L( X, Y )·C ( X, Y )·S( X, Y )
( K1 L ) , C2 σ=X(and
K 2 L )σY
2 2
where uX
images and u
X and
C1 , C2 , C3 the
Y.Y represent is amean values
constant andofusually X andCY,1 =respectively;
images takes ,
2 2
C2 the standard deviations of images X and Y, respectively; σX , σY represent the
represent
C3 =
variances of images X andKY,=respectively; andLσ= represents the covariances of images X
2 1 0.01, K 2 = 0.03, XY255
and Y. C,1 ,and generally
C2 , C 3 is a constant and usually takes C . 2 2 C2
1 = ( K1 L ) , C2 = ( K2 L ) , C3 = 2 , and
The image after super-resolution
generally K1 = 0.01, K2 = 0.03, L = 255. reconstruction is shown in Figure 4. The reconstruc-
tion effect of theafter
The image ESRGAN algorithm is
super-resolution the best, and is
reconstruction the reconstructed
shown in Figureimage
4. Theisreconstruc-
closest to
the
tionoriginal
effect ofimage. The average
the ESRGAN values is
algorithm of the
the objective
best, andevaluation indexes image
the reconstructed PSNR and SSIM
is closest
after the super-resolution reconstruction of the various algorithms
to the original image. The average values of the objective evaluation indexes PSNR and are calculated on the
test set, as listed in Table 1. After comparison, the traditional interpolation
SSIM after the super-resolution reconstruction of the various algorithms are calculated on algorithm has
the
the worst effect,
test set, and in
as listed theTable
ESRGAN algorithm
1. After selected
comparison, the for this paper
traditional not only achieves
interpolation algorithmthe
best objective evaluation index, but also shows the superiority of this
has the worst effect, and the ESRGAN algorithm selected for this paper not only achieves algorithm in sensory
vision.
the best objective evaluation index, but also shows the superiority of this algorithm in
sensory vision.
Figure 4.
Figure Comparison of
4. Comparison of the
the results
results of
of different
different super-resolution
super-resolution methods.
methods.
1. Comparison
Table 1.
Table Comparison of
of the
the objective
objective evaluation
evaluation indicators
indicators of
of different
different super-resolution
super-resolutionmethods.
methods.
Method
Method PSNR
PSNR SSIM
SSIM
Bicubic
Bicubic 28.3841
28.3841 0.7345
0.7345
SRGAN
SRGAN 36.3838
36.3838 0.8830
0.8830
EDSR
EDSR 36.4970
36.4970 0.8841
0.8841
ESRGAN 36.5556 0.8846
ESRGAN 36.5556 0.8846
GT LR Bicubic
It may be concluded from the recognition accuracy of Table 2 that the recognition
effect is the best in the original high-definition image, while the traditional Bicubic inter-
polation algorithm has the worst effect, with a 10% decline when compared to the original
image. No additional effective information was introduced in the process, the
Photonics 2021, 8, 431 12 of 14
It may be concluded from the recognition accuracy of Table 2 that the recognition effect
is the best in the original high-definition image, while the traditional Bicubic interpolation
algorithm has the worst effect, with a 10% decline when compared to the original image.
No additional effective information was introduced in the process, the reconstruction effect
was poor, and the recognition rate was also the lowest. While several other deep learning-
based super-resolution algorithms rebuild images and improve the image resolution,
they also introduce external information for the image reconstruction so that the images
have more detailed information. The ESRGAN algorithm selected for this paper has the
best performance in terms of both visual effects and objective evaluation indicators. The
reconstructed remote sensing image has rich texture details, the edge information is more
obvious, and the recognition rate is the highest among all the algorithms. The accuracy
difference is only roughly 1.2%.
The remote sensing image recognition algorithm proposed in this paper effectively
solves the problem of the low recognition rate of low-resolution remote sensing images in
foggy scenes.
5. Discussion
This paper proposed a new method for target detection in low-resolution remote
sensing images in foggy weather. The low-resolution foggy remote sensing image was
super-resolution reconstructed via the ESRGAN network, and the reconstructed super-
resolution image was input to the recognition classification in the trained detector model.
After many experiments, this method improved the target recognition rate of low-resolution
remote sensing images by nearly 20%. The main contributions of this paper are as follows.
First, the application of image super-resolution reconstruction technology to the task of
target detection in remote sensing images has broadened the application range of image
super-resolution reconstruction technology. Furthermore, this research has realized the
recognition and detection of small and weak targets on low-resolution remote sensing
images under foggy conditions and achieved a very good detection effect. Finally, this
paper compared the different methods of image super-resolution reconstruction at this
stage, and ultimately selected the ESRGAN method as the best through many experiments,
which helps the target detection task of remote sensing images at low resolution. The
research undertaken in this paper has some benefit to the application of super-resolution
reconstruction technology in the field of target detection. In the past two years, Transformer
has shown the advantages in processing computer vision tasks, and has provided new
research directions for the future of for the super-resolution reconstruction of remote
sensing images.
Author Contributions: Conceptualization, Y.W. and G.S.; methodology, Y.W.; software, Y.W.; valida-
tion, Y.W. and G.S.; formal analysis, Y.W.; investigation, Y.W. and S.G.; resources, G.S; data curation,
Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W., G.S. and S.G;
visualization, Y.W.; supervision, Y.W. and G.S.; project administration, G.S.; funding acquisition, G.S.
All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by Heilongjiang University, grant number JM201911; and
Heilongjiang University, grant number YJSCX2021-176HLJU.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can
be found here: https://captain-whu.github.io/DOTA/dataset.html (accessed on 1 July 2021).
Acknowledgments: The authors acknowledge the support of Heilongjiang University.
Conflicts of Interest: The authors declare no conflict of interest.
Photonics 2021, 8, 431 13 of 14
References
1. Steve, L.; Giles, C.L.; Tsoi, A.C.; Back, A.D. Face Recognition: A Convolutional Neural-Network Approach. IEEE Trans. Neural
Netw. 1997, 8, 98–113.
2. Claus, N. Evaluation of Convolutional Neural Networks for Visual Recognition. IEEE Trans. Neural Netw. 1998, 9, 685–696.
3. Christian, S.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with
Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA,
USA, 7–12 June 2015.
4. Chao, D.; Loy, C.C.; He, K.; Tang, X. Part IV—Learning a Deep Convolutional Network for Image Super-Resolution. In
Proceedings of the Computer Vision—ECCV 2014—13th European Conference, Zurich, Switzerland, 6–12 September 2014.
5. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016.
6. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video
Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016.
7. Christian, L.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al.
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the 2017 IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017.
8. Ian, J.G.; Abadie, J.P.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative Adversarial Networks.
Commun. ACM 2020, 63, 139–144.
9. Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016.
10. Sergey, I.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015.
11. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.
In Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015.
12. Bee, L.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI,
USA, 21–26 July 2017.
13. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Part V—Esrgan: Enhanced Super-Resolution Generative
Adversarial Networks. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018.
14. Karen, S.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd
International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015.
15. Kwan, C.; Dao, M.; Chou, B.; Kwan, L.M.; Ayhan, B. Mastcam Image Enhancement Using Estimated Point Spread Functions. In
Proceedings of the 8th IEEE Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEMCON,
New York, NY, USA, 19–21 October 2017.
16. Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A Rotation-Equivariant Detector for Aerial Object Detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 19–25 June 2021.
17. Jian, D.; Xue, N.; Long, Y.; Xia, G.; Lu, Q. Learning Roi Transformer for Oriented Object Detection in Aerial Images. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019.
18. Xue, Y.; Yan, J. Part VIII—Arbitrary-Oriented Object Detection with Circular Smooth Label. In Proceedings of the Computer
Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020.
19. Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-Oriented Scene Text Detection via Rotation Proposals.
IEEE Trans. Multimed. 2018, 20, 3111–3122.
20. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. Dota: A Large-Scale Dataset for Object
Detection in Aerial Images. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2018, Salt Lake City, UT, USA, 18–22 June 2018.
21. Chao, D.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal.
Mach. Intell. 2015, 38, 295–307.
22. Joan, B.; Sprechmann, P.; LeCun, Y. Super-Resolution with Deep Convolutional Sufficient Statistics. In Proceedings of the 4th
International Conference on Learning Representations, ICLR 2016, San Juan, PR, USA, 2–4 May 2016.
23. Justin, J.; Alahi, A.; FeiFei, L. Part II—Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the
Computer Vision—ECCV 2016—14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016.
24. Alec, R.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.
In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, PR, USA, 2–4 May 2016.
25. Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution.
In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA,
21–26 July 2017.
Photonics 2021, 8, 431 14 of 14
26. Christian, S.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-V4, Inception-Resnet and the Impact of Residual Connec-
tions on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA,
4–9 February 2017.
27. TsungYi, L.; Dollr, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection.
In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA,
21–26 July 2017.
28. Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149.
29. Diederik, P.K.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on
Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015.
30. Adam, P.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An
Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing
Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada,
8–14 December 2019.
31. Kai, C.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. Mmdetection: Open Mmlab Detection
Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155.
32. Ying, T.; Yang, J.; Liu, X. Image Super-Resolution via Deep Recursive Residual Network. In Proceedings of the 2017 IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017.