0% found this document useful (0 votes)
31 views

Object Detection Using Adaptive Mask RCNN

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Object Detection Using Adaptive Mask RCNN

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received: August 3, 2019. Revised: October 22, 2019.

65

Object Detection Using Adaptive Mask RCNN in Optical Remote Sensing Images

Amira S. Mahmoud1* Sayed A. Mohamed1 Reda A. El-Khoribi2


Hisham M. AbdelSalam2

1
National Authority for Remote Sensing and Space Science, Cairo, Egypt
2
Faculty of Computers and Information, Cairo university, Giza, Egypt
* Corresponding author’s Email: [email protected]

Abstract: Fast and automatic object detection in remote sensing images is a critical and challenging task for civilian
and military applications. Recently, deep learning approaches were introduced to overcome the limitation of traditional
object detection methods. In this paper, adaptive mask Region-based Convolutional Network (mask-RCNN) is utilized
for multi-class object detection in remote sensing images. Transfer learning, data augmentation, and fine-tuning were
adopted to overcome objects scale variability, small size, the density of objects, and the scarcity of annotated remote
sensing image. Also, five optimization methods were investigated namely: Adaptive Moment Estimation (Adam),
stochastic gradient decent (SGD), adaptive learning rate method (Adelta), Root Mean Square Propagation (RMSprop)
and hybrid optimization. In hybrid optimization, the training process begins Adam then switches to SGD when
appropriate and vice versa. Also, the behaviour of adaptive mask RCNN was compared to baseline deep object
detection methods. Several experiments were conducted on the challenging NWPU-VHR-10 dataset. The hybrid
method Adam_SGD acheived the highest Accuracy precision, with 95%. Experimental results showed detection
performance in terms of accuracy and intersection over union (IOU) boost of performance up to 6%.
Keywords: Object detection, Deep learning, Mask RCNN, Adam, SGD, RmsProp.

can be divided into four main groups [3]. 1) template


1. Introduction matching based methods which can be subdivided
into the rigid template and deformable template-
Object detection is a multi-objectives complex
based methods [3]; 2) Knowledge-based method [4],
problem considering classification and localization
which divided into geometric knowledge and context
single or multi-object in an image [1], In remote
information [5]; 3) Object Based Image analysis
sensing domain, object detection becomes even more
(OBIA) based method which required two main steps
complicated due to the complex nature of remote
object segmentation and object classification [6] and
sensing images. The term object may include both
4) Machine learning based method which consist of
sharp boundaries (man-made) and vague boundaries
two main stages. The first stage is feature extraction
fused with the background (landscape) [2]. Object
using feature engineer methods [7] such as histogram
detection has a comprehensive range of applications
of oriented gradients (HOG) [8], Bag of Words
such as robot vision, face recognition, content-based
(BOW) [9], sparse representation and Human activity
image retrieval, military applications, and pedestrian
recognition (Har) was adopted in [10]. A group of
detection [1]. Very high-resolution satellite images
these features may be used, and different feature
capture detailed information about the size, shape,
reduction methods can also be utilized to improve
texture, and topology of objects on earth in addition
feature selection stage. In the second stage, a
to a complicated background, variety of illumination
classifier is trained using these features. The widely
intensities, the influence of weather, noise obscured.
used classifiers include support vector machine [10],
Recently, object detection is a hot research topic in
Ada-boost [11], artificial neural network [12].
remote sensing domain [3]. Object detection methods

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 66

Recently, deep learning algorithms show their feature constraints (GFC) to improve the accuracy of
superiority in feature representation tasks in different aircraft detection. The detection accuracy increased
computer vision and remote sensing domain. The by an average of 3.66%. In [23], an enhancement of
recent evolution of deep learning (DL) in detecting Faster R-CNN was introduced to detect densely
complicated patterns in big remote sensing imagery packed objects in satellite images. Enormous
exposes its high potential to address various experiments were conducted to evaluate the
challenges such as complexity of satellite images, effectiveness of the proposed method in terms of
lack of training datasets, multi-sensor data, complex accuracy and IOU. Results showed the effectiveness
background, atmospheric conditions. These are the of the proposed method.
primary challenges to achieve a robust automatic In [24], AlexNet was adopted to extract generic
object detection using deep learning. feature for ship detection task in very high-resolution
Region-based Convolutional Network (R-CNN) images. The proposed method outperforms You Only
[18] achieved an excellent object detection accuracy Look Once (YOLO) and SSD in terms of accuracy
using very deep CNN to classify object proposals. R- and IOU. Moreover, Nie et al. [25] proposed a novel
CNN has notable drawbacks such as multi-stage framework based on Mask R-CNN for the inshore
pipeline training, extensive training time and space, ship detection task. They adopted Soft-Non-
and slow detection. An enhancement was introduced Maximum Suppression (Soft-NMS) to improve the
by Spatial Pyramid Pooling Networks (SPPnets)[19] proposed method of performance robustness and
by sharing convolutions across proposals to limit efficiency. In [26], Yang et al. proposed a three
time cost in training. Fast RCNN [20] operates on a stages framework for object detection. In the first
single stage with a multi-task loss during the training stage, a sliding window technique utilized to generate
phase. This enhancement limits the used storage the candidate region proposal. Next, AlexNet and
space and improves accuracy, but region proposal GoogleNet were chosen to extract generic image
computation still considered the main bottleneck. To features from each region proposal. Finally,
overcome this problem, Ren et al. introduced an unsupervised score-based bounding box regression
additional region proposal network (RPN) [13] that (USB-BBR) algorithm was proposed to optimize the
replaced the selective search for region proposal bounding box of the detected object. Results of the
generation, thereby combining region proposal, proposed framework surpass other methods in terms
classification, and localization regression improve of accuracy and IOU quality with complex
speed and accuracy but still too slow to achieve real- backgrounds. Inspired by Faster-RCNN, Li et al. [27]
time detection. Another approach to overcome the used region proposal network to generate translation-
time-consumed in region selection step was to invariant and multi-scale candidate region. Next,
directly predict confidences for both classification local-contextual feature fusion network was used to
and localization bounding boxes. YOLO [14] form a discriminative joint representation (local-and
introduced real-time performance by computing a contextual feature) for each candidate region. Finally,
single loss. YOLOv2 [15] is an enhancement that accurate classification and accurate object
provided a smooth trade-off Between speed and localization were implemented. In [28], Cheng et al.
accuracy. SSD method [16], achieved significantly presented a two stages approach based on Faster R-
accurate performance compared with YOLO by CNN, namely deep adaptive proposal network
adding feature map at each scale YOLO versions and (DAPNet). The input image is feed to the backbone
SSD methods struggle with small objects within the network to generate the high-level features
image, due to the spatial constraints of the algorithm. representation of the image, then the category prior
R-FCN [17, 32] is considered as two-stage object network (CPN) sub-network and fine-region proposal
detector which applies the position-sensitive ROI- network (F-RPN) used the aforementioned high-level
pooling to tackle the dilemma between translation- features to obtain the category prior information and
invariance in classification and translation-variance candidate regions for each image respectively. Both
in localization however it less accurate than faster R- results were combined to achieve an adaptive region
CNN. proposal. Finally, the accuracy detection network
In [12], a deep neural network was utilized for sub-network was used to classification and
ship detection task in optical images. Various regression for each adaptive candidate boxes. Several
augmentation methods, such as rotation, scaling, and experiments were carried out on a public
illuminations conditions, were adopted to enhance NWPUVHR dataset to evaluate the proposed
the learning procedure. In [22], Pan et al. utilized a approach performance and results show its
cascade convolutional neural network (CCNN) superiority. Ammour et al. [29] proposed a car
framework based on transfer-learning and geometric detection method in unmanned aerial vehicle images
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 67

(UAV). A mean-shift algorithm was used to segment were introduced in section 3. Finally, section 4 draws
the UAV input image into small homogeneous the conclusion.
regions. Then, a pre-trained Vgg-16 was adopted to
extract a generic feature for each segment. Finally, 2. Proposed method
linear support vector (SVM) classifier was adopted to
In recent years, deep learning techniques have
binary map each segment as into “car” and “no car.”
achieved state-of-the-art results for object detection
The proposed method outperformed state-of-the-art
on standard benchmarks. Mask R-CNN
methods, both in terms of accuracy and
outperformed other deep learning object detection
computational time. To overcome the limited
model and won a COCO object detection challenge
accuracy of the traditional ship detection methods,
in 2016. However, the performance of Mask R-CNN
Yang et al. [30]proposed an approach called Rotation
in remote sensing domain hardly achieved
Dense Feature Pyramid Network (R-DFPN) method.
comparable results due to the complex nature of
The proposed method has two stages: dense feature
satellite images, the lake of annotated sampled, and
Pyramid Network (DFPN) for feature fusion and
varied object scales. This work study the behavior of
Rotation Region Detection Network (RDN) for
different optimization methods and a hybrid training
prediction. Comprehensive evaluations on remote
strategy that starts with an adaptive method (Adam)
sensing images extracted from Google Earth for ship
then switches to SGD (SWATS), and vice versa.
detection demonstrated the superiority of the
Mask-RCNN[33] was introduced by He et al. in
proposed method. In [31], Cheng et al. proposed an
2018 as an extension to Faster RCNN [13] to allow
effective approach to learn a rotation-invariant CNN
an accurate pixel-based segmentation. It consists of
mode. First, the new rotation-invariant layer was
two main stages namely: Feature Pyramid Network
trained by optimizing a new objective function via
(FPN) and Region Proposal Network (RPN). In
imposing a regularization constraint then fine-tune
feature pyramid network, a different number of
the whole CNN network to boost the performance
proposals was generated about the regions where
further. The proposed method was evaluated on a
there might be an object based on the input image.
public NWPUVHR dataset, and the results denoted
First, we utilized a standard convolutional neural
the effectiveness of the proposed method.
network to serve as a feature extractor. The state of
The problem investigated in this paper, we
art architectures AlexNet, VGG Net and GoogleNet
utilized mask-RCNN to boost the object detection
had (5, 19, 22) layers respectively. By getting deeper,
accuracy in the RS domain. The main contribution of
the network suffers from vanishing gradient problem,
this paper is utilizing adaptive Mask RCNN
which results in performance saturation or even
framework to detect multi-scale object in optical
degrading rapidly. Several attempts [34] had been
remote sensing images. The proposed adaptive mask
introduced to overcome the vanishing gradient
RCNN efficiently reduce the redundancy of detectors
problem. Based on the residual block, [35] was firstly
boxes and allow multi-scale targets under complex
introduced ResNet50 architecture. Skip connection
background images. Transfer learning and fine-tune
or shortcut which allow to take activation from one
were adopted to overcome the scarcity and
layer and feed it to another layer s that about 2–3 hops
complexity of remote sensing images. The paper also
away. ResNet50 becomes seminal architecture to
studies the behaviour of adaptive mask RCNN
different computer vision applications. In this paper,
towards baseline optimization methods namely:
we used a pre-trained architecture on ImageNet (1000
Adam, SGD, Ada-delta, RMSprop, hybrid
class) dataset. Generally, the size of the recent model
SGD_Adam, hybrid Adam_SGD. The paper also
is substantially smaller due to the usage of global
studies compare adaptive mask RCNN towards
average pooling rather than fully-connected layers.
baseline object detection methods Faster RCNN
We choose ResNet50 as a feature extractor network
(FRCN) method [13], You only look once (YOLO)
which encodes input image into 32x32x2048 feature
method [14], (YOLO2) method [15], Single Shot
map. The FPN extracts regions of interest from
Multibox Detector (SSD) method [16], Region-based
features of different levels according to the size of the
Fully Convolutional Network (R FCN) [17]. All
feature which feeds as input to Next stage (RPN).
experiments were conducted on a publicly
In Region Proposal Network (RPN), the regions
available10-class geospatial object NWPU VHR-10
scanned individually and predicted whether or not an
dataset[ 33].
object is present. The actual input image is never
The remainder of this paper is organized as
scanned by RPN instead RPN network scans the
follows, proposed adaptive Mask R-CNN is proposed
feature map, making it much faster. Next, each of
in section 2. Experimental results and discussion

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 68

Testing

Training Phase Class Label

Test set Pre-trained ResNet-


Mask-RCNN Mask
(30%) 50 on ImageNet

Boundary Box

Data Augmentation
Satellite Image - Resize Input Images
Train (70%) 1024x1024
dataset
- -Rotation /Flipping

Figure.1 The proposed object detection method for optical remote sensing image

regions of interest proposed by the RPN as inputs and overcome the problem of limited annotated dataset in
outputs a classification (SoftMax) and a bounding remote sensing domain, we adopted transfer learning
box (regressor). Finally, Mask- RCNN adds a new by selected the pre-trained network weights of the
branch to output a binary mask that indicates whether resnet50 model, which was successfully trained with
the given pixel is or not part of an object. This added the image net dataset [36]. We utilized the pre-trained
branch is a Fully Convolutional Network on top of resnet50 and fine-tuned the network weights to the
the backbone architecture. The proposed method NWPUVHR dataset. Due to limited memory, we
consists of two main phases: Training and testing consider three different strategies in fine-tuning. First
phase as illustrated in Fig. 1. strategy, we train the head layer for 30 epochs while
freezing other layers with learning rate 0.1. Second,
2.1 Loss function the convolution layer (+5) and convolution layer (+4)
were trained for 30 epochs each using a learning rate
Mask R-CNN utilized a multi-task loss function 0.01and 0.001, respectively. Finally, the convolution
that combined the loss of classification, localization layer (+3) were trained for 400 epochs with learning
and segmentation mask as illustrated in Eq. (1). rate 0.001. We used different argumentation methods
such as horizontal flip, vertical flip, image rotation,
𝐿 = 𝐿𝑐𝑙𝑠 + 𝐿𝑏𝑏𝑜𝑥 + 𝐿𝑚𝑎𝑠𝑘 (1)
and image translation to enlarge the training data.
One can observe that this domain-specific fine-tuning
Where 𝐿𝑐𝑙𝑠 , 𝐿𝑏𝑏𝑜𝑥 are same as in Faster R-CNN [13].
allows learning good network weights for a high-
The added mask 𝐿𝑚𝑎𝑠𝑘 is illustrated in Eq. (2). as the
capacity CNN for NWPUVHR dataset.
average binary cross-entropy that only includes 𝑘 𝑡ℎ
mask if the region is associated with the ground truth 2.3 Testing phase
class 𝑘 .
The learned model used directly to predict class
1 label, boundary box, and masked segment for each
𝐿𝑚𝑎𝑠𝑘 = − 2 ∑ 𝑦𝑖𝑗 𝑙𝑜𝑔𝑦̂𝑖𝑗𝑘
𝑚 image in testing data. To evaluate the learned model
1≤𝑖,𝑗≤𝑚
+ (1 − 𝑦𝑖𝑗 ) log(1 − 𝑦̂𝑖𝑗𝑘 ) (2) performance, the predicted labels and boundary box
is matched with those in the dataset.
Where the mask branch generates a mask of 2.4 Optimization techniques
dimension m x m for each RoI and each class𝑦𝑖𝑗 and
𝑘
𝑦̂𝑖𝑗 are cell (i, j) label of the true mask and the Neural network optimization played an essential
predicted value respectively. role in training deep neural networks. Generally,
there are two metrics to evaluate the efficiency of
2.2 Training phase optimizer: speed of convergence and generalization.
Stochastic gradient descent (SGD) [37] is commonly
Mask-RCNN requires a large amount of used for training deep neural networks. Compared
annotated data for training to avoid overfitting. To

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 69

Figure. 2 Statistics of total number of objects of each category used in training and testing in the NWPU VHR-10 data set

with SGD, Adaptive optimization methods such as for multi-class in remote sensing images. This data
Adam [38] , Adelta [39], RMSprop [40] perform well set was cropped from Google Earth then manually
in the initial stages of training but tend to generalize annotated by experts; it contains ten classes of objects,
poorly. Inspired by their work, Keskar, and Soche namely “airplane, ship, storage tank, baseball
[41]. We introduced two-hybrid training strategy that diamond, tennis court, basketball court, ground track
starts with an adaptive method (Adam) then switches field, harbor, bridge, and vehicle” samples as shown
to SGD (SWATS), and vice versa. An evaluation of in Fig. 2.
their performance of the hybrid approach in object In our work, the total number of objects in the NWPU
detection in remote sensing domain. We conducted VHR-10 data set is divided into 70% and 30% for
several experiments to investigate the triggering training and testing in class level. Fig. 2 presents the
condition to switch between Adam and SGD. The statistics of the total number of objects in each class
triggering condition includes the number of epochs used in both training and testing. Overall, it can be
and value of learning rate. The optimal triggering seen that the 10- classes included in NWPU dataset
condition in object detection was to set the learning are not equally distributed in terms of the number of
rate to 0.001 or epochs achieved 400. images or objects.

3. Results 3.2 Evaluation matrices


In this paper, NWPU-VHR-10geospatial object Two evaluation metrics were used to evaluate the
detection dataset is used to evaluate the performance proposed object detection method: The Average
of our proposed method. We describe the data set and Precision (AP) [42] and the Precision-recall curves
the evaluation metrics in section 3.1 and 3.2, (PRC). The Precision measures the fraction of
respectively. The implementation details of the detections that are true positives as illustrated in Eq.
proposed method are presented in section 3.3. Finally, (3)and the Recall measures the fraction of positives
the proposed adaptive Mask RCNN method is that are correctly identified as illustrated in Eq. (4)
compared with other state-of-art deep object The area under the PRC measures the AP metric. The
detection methods, including the results presentation higher the AP value, the better the performance, and
and numerical analysis were depicted in section 3.4. vice versa. The precision indicator measures the
percentage of your positive predictions are truly
3.1 Dataset positive, and the recall indicator represents the
fraction of positives that are correctly identified. The
The NWPU VHR-10 [2-4] is one of the
precision and recall indicators are formulated as
pioneering works in remote sensing object detection
follows.
filed, which is designed to provide a standard dataset

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 70

Table 1. Performance for YOLO, Faster RCNN, SSD, R-FCN, and proposed method on NWPU dataset in terms of AP
percentage values and average running time in seconds per image
FRCN YOLO1 YOLO2 SSD R-FCN Proposed
[13] [14] [15] [16] [17] Method
Class

Airplane 82.8 60.8 87.3 95.7 96.1 99.9


Ship 77.5 62.7 84.7 93.6 98.3 92.7
Storage tank 52.5 28.7 42.7 60.9 72.5 94.5
Baseball diamond 96.3 85.7 93.1 99.4 99.4 99.5
Tennis court 62.9 58.4 65.7 87.7 90.7 97.3
Basketball 68.8 82.2 85.5 92 97.8 88.9
Ground track field 98.4 88.7 97.1 98.6 99.3 93.8
Harbor 82.5 75 80.5 94.6 92.5 95.9
Bridge 78.8 72.5 90 97.0 93.4 95.8
Vehicle 63.8 52.3 70.8 74.5 88.4 91.6
Mean AP 76.4 66.7 79.7 89.4 92.8 95
Time (s) 6.21 3.36 4.24 5.72 4.32 7.1

𝑡𝑝 based Convolutional Network (FRCN) method


𝑝𝑒𝑟𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝+𝑓𝑝
(3)
[13],You only look once (YOLO1) [14] You only
𝑡𝑝 look once (YOLO2) method [15], Single Shot
𝑟𝑒𝑐𝑎𝑙𝑙 = (4) Multibox Detector (SSD) method [16], and Region-
𝑡𝑝+𝑓𝑛
based Fully Convolutional Network (R-FCN) [17] in
Where 𝑡𝑝 = True Positives, 𝑓𝑝 = False Positives and term of average precision and computation time.
𝑓𝑛 = False Negatives. Table 1 shows the obtained results achieved
measured by AP values for each class of NWPU
3.3 Implementation details dataset. One can observe that YOLO2 and SSD
achieved a comparable performance 79.7%, and
We randomly selected 500 images from the 89.4% respectively. However, YOLO is slightly
positive images as training images. The rest 150 faster compared with other techniques. R-FCN has
images were used to evaluate the performance of the the highest mean AP value (92.8%). Our proposed
proposed object detection method. Owing to the method outperforms other techniques and boosts
limited size of the training set, different data performance by 6%. The proposed method achieved
augmentation was adopted such as rotation, flipped it a better trade-off between detection accuracy and
horizontally and vertically to expand the number of speed. However, although our method has achieved
samples. The augmented images were considered as the best performance in terms of AP, the detection
a representation for rotation of target, lighting accuracy for the object categories of basketball court
changes, and the variety of sensors. We conducted and vehicle is still relatively low.
our experiments using NVIDIA GEFORCE® GTX Table 1 shows the quantitative comparison results
1080 Ti, 11 GB of memory, to considerably speed up of six different methods measured by AP values, and
deep learning training computations. Tensor Flow average running time per image. The best
[43] was selected as the implementing framework. performances are highlighted in bold. The proposed
method achieved the best performance in terms of
3.4 Results mean AP. However, it reached the lowest
In this section, we conducted several experiments performance in terms of speed. However, the
to evaluate the performance of the proposed method proposed method still needs to be improved in terms
in term of average precision, computation time, of speed to achieve real-time performance. The
Intersection over Union (IOU), and Precision-Recall balance between computational complexity and
Curves (PRC). performance remains a big challenge. In the future,
First, we compare the performance of the we would like to investigate other approaches to meet
proposed method against the deep learning baseline lower computational burden and system complexity.
object detection techniques namely Faster- Region-

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 71

Second, we conduct several experiments to two hybrid techniques Adam- SGD, and SGD_Adam
evaluate six optimization method in the remote were tested method in terms of IOU. The recall rates
sensing object detection task. Four optimization of these optimization techniques under different IOU
techniques: Adam, Adelta, RMSprop and SGD, and thresholds are plotted in Fig. 3. It can be observed that

Figure. 3 Recall vs. IOU overlap ratio on the NWPU VHR-10 data set for airplane, ship, storage tank, baseball diamond,
tennis court, basketball, ground track field, and harbour, bridge, and vehicle classes, respectively

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 72

Figure.4 precision and recall on the NWPU VHR-10 dataset for airplane, ship, storage tank, baseball diamond, tennis
court, basketball, ground track field, harbour, bridge, and vehicle classes, respectively

(1) The recall curves declined with the increasing of basketball, ground-track, and harbor, the recall of
IoU thresholds. In detail, the recall of Adelta and different optimization techniques is higher compared
SGD optimization decreased more quickly compared with other object classes. This is due to small size
with other techniques, which demonstrates their objects with a complex background are harder to
limited performance in object detection task in detect. (3) Hybrid based optimization Adam-SGD
remote sensing domain. (2) For object classes such as

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 73

struggles with small-size objects, whereas the hybrid-


optimization method (ADAM-SGD) is more
effective for detection small size objects in remote
sensing images. (4) Hybrid-optimization (ADAM-
SGD) method attained the highest AP values for the
most class of objects. This show that our method is
effective for detecting objects with various size.
Adam and Hybrid SGD_Adam obtained a
comparable detection performance compared with
Hybrid Adam-SGD. An obtained gain in
performance up to 6 % in terms of mean AP, which
illustrates that the switching between Adam to SGD
can effectively improve the generalization of object
detection in remote sensing domain. Compared with
SGD and Ada-delta, Hybrid Adam-SGD achieved up
to 6% and 40% performance gains in terms of mean
AP respectively. Fig. 5 shows object detection results
Figure. 5 samples of object detection result with the of samples with the proposed approach. Some objects
proposed approach such as storage tank, tennis court are densely peaked,
vehicles and ships are small in size with a complex
achieves the highest recall rate compared with other background, and ground track field has a large size.
optimization techniques. The remaining optimization The proposed method has successfully detected most
techniques have comparable performance. Overall, of these objects, demonstrating the effectiveness of
our proposed method achieved higher recall for a our method. As can be seen from Table 2, Adam and
small object, which is vital in object detection in Hybrid SGD_Adam achieved near 100% AP for
remote sensing domain due to the different resolution airplane and baseball diamond, but the AP value has
of satellite data. degenerated. This is mainly because the small size of
Third, to evaluate the quality of the proposed the object leads to limited feature representation for
method in each class detection, Fig. 4 displays the accurate object detection.
precision-recall curves (PRCs) of six optimizations
aforementioned techniques. For better comparison 4. Conclusion
and visualization, we plot (1-recall) for X-axes and
(1- precision) for Y-axes. As can be seen from them, This paper proposed an adaptive Mask RCNN
(1) all optimization techniques achieved superb approach for detecting multi-class objects in remote
performance for the object categories of airplane and sensing images. We utilized transfer learning, fine-
baseball diamond. However, for other eight object tuning, and augmentation techniques such as rotation,
categories, the PRC of different optimization scaling, and illuminations conditions to overcome the
techniques are varied. This is due to that both classes insufficient labeled remote sensing imagery. The
have relatively larger in training samples count and paper also draws a comparison between the proposed
size. (2) Hybrid Adam-SGD and Hybrid SGD_Adam method and the baseline deep object detection
have higher precision than Adam and SGD, techniques in term of average precision, computation
respectively. This demonstrates that hybrid time, Intersection over Union (IOU), and Precision-
optimization method can boost detection Recall Curves (PRC). Numerous experiments were
performance. 3) Adelta optimization method is not conducted using challenging multi-class NWPU-
favored in object detection (3) other optimization VHR-10 dataset. The dataset was split into 70% and
techniques achieved comparable performance except 30% for training and testing respectively. Also,
for Adelta. Overall, the hybrid optimization several experiments were performed to evaluate the
(SWATS) is very effective for object detection in effectiveness of optimization techniques, namely:
remote sensing images. Adam, SGD, Ada-delta, RMSprop, hybrid
Finally, Table 2 shows the quantitative SGD_Adam, hybrid Adam_SGD.
comparison results of the six, as mentioned earlier Analyze the results, the proposed method
optimization method in terms of AP values, and outperforms other baseline object detection methods
average running time per image. The best and boot the performance by 6% in terms of AP. In
performances are highlighted in bold. The terms of IOU and PRCS, the results obtained from all
performance of Ada-delta optimization shows
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 74

Table 2. performance of six optimization techniques in terms of AP percentage values and average running time per
image

class Adam SGD RMSprop A-delta SGD_Adam Adam_SGD

Airplane 99.0 97.6 97.2 82.8 100 99.9


Ship 91 81.7 84.3 50 89.8 92.7
Storage tank 93.3 87.6 94.3 43.5 96.9 94.5
Baseball diamond 96.6 97.1 95.8 89.2 98.4 99.5
Tennis court 95.3 83.2 87.8 51.2 96.8 97.3
Basketball 87.6 72.8 81.2 9.4 87.7 88.9
Ground track field 87.7 90.9 92.2 77.4 93.9 93.8
Harbor 93.8 95.5 85.8 16 93.6 95.9
Bridge 79.6 95.3 74.3 8.3 76.3 95.8
Vehicle 83.5 75.6 79.8 58.5 79.2 91.6
Mean AP 90.8 87.7 87.3 48.6 91.2 95.0

optimization techniques clarify superb performance and Remote Sensing, Vol.49, No.12, pp. 4928-
for the object categories of airplane and baseball 4943, 2011.
diamond. However, for other eight object categories, [3] D. Chaudhuri, N. K. Kushwaha, and A.Samal,
are varied. This is due to that both classes have “Semi-automated road detection from high
relatively larger in training samples count and size. resolution satellite images by directional
The AP metric measures the area under the PRC. The morphological enhancement and segmentation
higher the AP value, the better the performance, and techniques”, IEEE Journal of Selected Topics in
vice versa. the results of the average precision (AP) Applied Earth Observations and Remote
in optimization techniques Adam, SGD, RMSprop, Sensing, Vol.5, No.5, pp. 1538-1544, 2012.
Adelta, hybrid SGD_Adam, and hybrid Adam_SGD [4] A.O. Ok, “Automated detection of buildings
were 90.8%, 87.7%, 87.3%, 48.6%, 91.2, and 95% from single VHR multispectral images using
respectively. The proposed adaptive Mask RCNN shadow information and graph cuts”, ISPRS
firstly, outperformed other deep learning methods Journal of Photogrammetry and Remote Sensing,
and achieved the highest accuracy in terms of IOU Vol.86, pp. 21-40, 2013.
and PRC by utilizing the switch between optimizers [5] T. Blaschke, G.J. Hay, M. Kelly, S. Lang, P.
SWATS (switch from Adam to SGD) in training Hofmann, E. Addink, R.Q. Feitosa, F.V. Meer,
phase compared with utilizing default optimizer H.V. Werff, F.V. Coillie, and D. tiede,
(SGD) in other methods. Secondly, SWATS “Geographic object-based image analysis–
achieved a verified high accuracy with reducing the towards a new paradigm”, ISPRS Journal of
computation time and cost. Hence, in our future work, Photogrammetry and Remote Sensing, Vol.87,
we intend to implement an ensemble of pp. 180-191, 2014.
heterogeneous object detection approaches. In [6] Y. Li, S. Wang, Q. Tian, and X. Ding, “Feature
addition to incorporate a multi-GPU configuration to representation for statistical-learning-based
further reduce the computation time. object detection: A review”, Pattern
Recognition, Vol.48, No.11, pp. 3542-3559,
References 2015.
[7] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J.
[1] G. Cheng and J. Han, “A survey on object
Ren, “Effective and efficient midlevel visual
detection in optical remote sensing images”,
elements-oriented land-use classification using
ISPRS Journal of Photogrammetry and Remote
VHR remote sensing images”, IEEE
Sensing, Vol.117, pp.11-28, 2016.
Transactions on Geoscience and Remote
[2] T.R. Martha, N. Kerle, C.J. Westen, V. Jetten,
Sensing, Vol.53, No.8, pp. 4238-4249, 2015.
and K.V. Kumar, “Segment optimization and
[8] D. Zhang, J. Han, G. Cheng, Z. Liu, S. Bu, and
data-driven thresholding for knowledge-based
L. Guo, “Weakly supervised learning for target
landslide detection by object-based image
detection in remote sensing images”, IEEE
analysis”, IEEE Transactions on Geoscience
Geoscience and Remote Sensing Letters, Vol.12,
No.4, pp. 701-705, 2015.
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 75

[9] N. Yokoya and A. Iwasaki, “Object detection [20] R. Girshick, “Fast r-cnn”, In: Proc. of the IEEE
based on sparse representation and Hough International Conference on Computer Vision,
voting for optical remote sensing imagery”, pp. 1440-1448, 2015.
IEEE Journal of Selected Topics in Applied [21] M.M.U. Rathore, A. Paul, A. Ahmad, B.W.
Earth Observations and Remote Sensing, Vol.8, Chen, B. Huang, and W. Ji, “Real-time big data
No.5, pp. 2053-2062, 2015. analytical architecture for remote sensing
[10] G. Mountrakis, J. Im, and C. Ogole, “Support application”, IEEE Journal of Selected Topics in
vector machines in remote sensing: A review”, Applied Earth Observations and Remote
ISPRS Journal of Photogrammetry and Remote Sensing, Vol.8, No.10, pp. 4610-4621, 2015.
Sensing, Vol.66, No.3, pp. 247-259, 2011. [22] B. Pan, J. Tai, Q. Zheng, and S. Zhao, “Cascade
[11] Z. Shi, X. Yu, Z. Jiang, and B. Li, “Ship Convolutional Neural Network Based on
detection in high-resolution optical imagery Transfer-Learning for Aircraft Detection on
based on anomaly detector and local shape High-Resolution Remote Sensing Images”,
feature”, IEEE Transactions on Geoscience and Journal of Sensors, Vol.2017, 2017.
Remote Sensing, Vol.52, No.8, pp.4511-4523, [23] Z. Deng, L. Lei, H. Sun, H. Zou, S. Zhou, and J.
2014. Zhao, “An enhanced deep convolutional neural
[12] J. Tang, C. Deng, G.B. Huang, and B. Zhao, network for densely packed objects detection in
“Compressed-domain ship detection on remote sensing images”, International
spaceborne optical image using deep neural Workshop on Remote Sensing with Intelligent
network and extreme learning machine”, IEEE Processing, pp. 1-4, 2017.
Transactions on Geoscience and Remote [24] T. Wang and Y. Gu, “Cnn Based
Sensing, Vol.53, No.3, pp.1174-1185, 2015. Renormalization Method for Ship Detection in
[13] S. Ren, K. He, R. Girshick, and J. Sun, “Faster Vhr Remote Sensing Images”, In: Proc. of
r-cnn: Towards real-time object detection with IGARSS IEEE International Geoscience and
region proposal networks”, Advances in Neural Remote Sensing Symposium, pp.1252-1255,
Information Processing Systems, pp.91-99, 2018.
2015. [25] S. Nie, Z. Jiang, H. Zhang, B. Cai, and Y.Yao,
[14] J. Redmon, S. Divvala, R. Girshick, and A. “Inshore Ship Detection Based on Mask R-
Farhadi, “You only look once: Unified, real-time CNN”, In: Proc. of IGARSS IEEE International
object detection”, In: Proc. of the IEEE Geoscience and Remote Sensing Symposium,
Conference on Computer Vision and Pattern pp.693-696, 2018.
Recognition, pp.779-788, 2016. [26] Y. Long, Y. Gong, Z. Xiao, and Q. Liu,
[15] J. Redmon and A. Farhadi, “YOLO9000: better, “Accurate Object Localization in Remote
faster, stronger”, In: Proc. of the IEEE Sensing Images Based on Convolutional Neural
Conference on Computer Vision and Pattern Networks”, IEEE Transactions on Geoscience
Recognition, pp. 7263-7271, 2017. and Remote Sensing, Vol.55, No.5, pp. 2486-
[16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. 2498, 2017.
Reed, C.Y. Fu, and A.C. Berg, “Ssd: Single shot [27] K. Li, G. Cheng, S. Bu, and X. You, “Rotation-
multibox detector”, In: Proc. of European insensitive and context-augmented object
Conference on Computer Vision, pp. 21-37, detection in remote sensing images”, IEEE
2016. Transactions on Geoscience and Remote
[17] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object Sensing, Vol.56, No.4, pp. 2337-2348, 2017.
detection via region-based fully convolutional [28] L. Cheng, X. Liu, L. Li, L. Jiao, and X. Tang,
networks”, Advances in Neural Information “Deep Adaptive Proposal Network for Object
Processing Systems, pp.379-387, 2016. Detection in Optical Remote Sensing Images”,
[18] R. Girshick, J. Donahue, T. Darrell, and J. Malik, arXiv preprint arXiv:1807.07327, 2018.
“Rich feature hierarchies for accurate object [29] N. Ammour, H. Alhichri, Y. Bazi ,B. Benjdira,
detection and semantic segmentation”, In: Proc. N. Alajlan, and M. Zuair, “Deep learning
of the IEEE Conference on Computer Vision and approach for car detection in UAV imagery”,
Pattern Recognition, pp. 580-587, 2014. Remote Sensing, Vol.9, No.4, pp.31, 2017.
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial [30] X. Yang, H. Sun, k. Fu, J. Yang, X. Sun, M. Yan,
pyramid pooling in deep convolutional networks and Z. Guo, “Automatic ship detection in remote
for visual recognition”, IEEE Transactions on sensing images from google earth of complex
Pattern Analysis and Machine Intelligence, scenes based on multiscale rotation dense
Vol.37, No.9, pp. 1904-1916, 2015.
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07
Received: August 3, 2019. Revised: October 22, 2019. 76

feature pyramid networks”, Remote Sensing, adam to sgd”, arXiv preprint arXiv:1712.07628,
Vol.10, No.1, pp.132, 2018. 2017.
[31] G. Cheng, P. Zhou, and J. Han, “Learning [42] K. Oksuz, B.C. Cam, E. Akbas, S. Kalkan,
rotation-invariant convolutional neural networks “Localization recall precision (lrp): A new
for object detection in VHR optical remote performance metric for object detection”, In:
sensing images”, IEEE Transactions on Proc.of the European Conference on Computer
Geoscience and Remote Sensing, Vol.54, No.12, Vision (ECCV), pp.504-519, 2018.
pp. 7405-7415, 2016. [43] C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi,
[32] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. “Inception-v4, inception-resnet and the impact
Zou, “Multi-scale object detection in remote of residual connections on learning”, In: Proc. of
sensing imagery with convolutional neural the Thirty-First AAAI Conference on Artificial
networks”, ISPRS Journal of Photogrammetry Intelligence, 2017.
and Remote Sensing, Vol.145, pp.3-22, 2018.
[33] K. Zhao, J. Kang, J. Jung, and G.Soh, “Building
Extraction from Satellite Images Using Mask R-
CNN with Building Boundary Regularization”,
In: Proc. of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops,
pp.247-251, 2018.
[34] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy,
B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, and
T. Chen, “Recent advances in convolutional
neural networks”, Pattern Recognition, Vol.77,
pp.354-377, 2018.
[35] K. He, X. Zhang, S. Ren, and J. Sun “Deep
residual learning for image recognition”, In:
Proc.of the IEEE Conference on Computer
Vision and Pattern Recognition, pp.770-778,
2016.
[36] J. Wang, C. Luo, H. Huang, H. Zhao, and S.
Wang, “Transferring Pre-Trained Deep CNNs
for Remote Scene Classification with General
Features Learned from Linear PCA Network”,
Remote Sensing, Vol.9, No.3, pp.225, 2017.
[37] H. Robbins and N. Carolina, “A stochastic
approximation method”, The Annals of
Mathematical Statistics, pp. 400-40, 1951.
[38] D. P. Kingma and J. Ba, “Adam: A method for
stochastic optimization”, arXiv preprint
arXiv:1412.6980, 2014.
[39] M.D. Zeiler, “ADADELTA an adaptive learning
rate method”, arXiv preprint arXiv:1212.5701,
2012.
[40] T. Tieleman and G. Hinton ,“Divide the gradient
by a running average of its recent magnitude.
coursera: Neural networks for machine
learning”, Tech. rep., Technical Report.
Available online:
https://zh.coursera.org/learn/neuralnetworks/lec
ture/YQHki/rmsprop-divide-the-gradient-by-a-
running-average-of-its-recent-magnitude
(Accessed on 21 April 2017)
[41] N.S. Keskar and R. Socher, “Improving
generalization performance by switching from

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.07

You might also like