Electronics 11 00739 v2
Electronics 11 00739 v2
Article
Modified Yolov3 for Ship Detection with Visible and
Infrared Images
Lena Chang 1 , Yi-Ting Chen 2, *, Jung-Hua Wang 2,3 and Yang-Lang Chang 4
1 Department of Communications, Navigation and Control Engineering, National Taiwan Ocean University,
Keelung 202301, Taiwan; [email protected]
2 Department of Electrical Engineering, National Taiwan Ocean University, Keelung 202301, Taiwan;
[email protected]
3 Department of Electrical Engineering, AI Research Center, National Taiwan Ocean University,
Keelung 202301, Taiwan
4 Department of Electrical Engineering, National Taipei University of Technology, Taipei 106344, Taiwan;
[email protected]
* Correspondence: [email protected]; Tel.: +886-02-2462-2192 (ext. 7214)
Abstract: As the demands for international marine transportation increase rapidly, effective port
management has become an important issue. Automatic ship recognition can facilitate the realization
of smart ports, and improve the efficiency of port operation and management. In order to take
into account the processing efficiency and detection accuracy at the same time, the study presented
an improved deep-learning network based on You only look once version 3 (Yolov3) for all-day
ship detection with visible and infrared images. Yolov3 network can simultaneously improve the
recognition ability of large and small objects through multiscale feature-extraction architecture.
Considering reducing computational time and network complexity with relatively competitive
detection accuracy, the study modified the architecture of Yolov3 by choosing an appropriate input
image size, fewer convolution filters, and detection scales. In addition, the reduced Yolov3 was further
modified with the spatial pyramid pooling (SPP) module to improve the network performance in
Citation: Chang, L.; Chen, Y.-T.;
Wang, J.-H.; Chang, Y.-L. Modified
feature extraction. Therefore, the proposed modified network can achieve the purpose of multi-scale,
Yolov3 for Ship Detection with multi-type, and multi-resolution ship detection. In the study, a common self-built data set was
Visible and Infrared Images. introduced, aiming to conduct all-day and real-time ship detection. The data set included a total of
Electronics 2022, 11, 739. https:// 5557 infrared and visible light images from six common ship types in northern Taiwan ports. The
doi.org/10.3390/electronics11050739 experimental results on the data set showed that the proposed modified network architecture achieved
Academic Editor:
acceptable performance in ship detection, with the mean average precision (mAP) of 93.2%, processing
Abdeldjalil Ouahabi 104 frames per second (FPS), and 29.2 billion floating point operations (BFLOPs). Compared with the
original Yolov3, the proposed method can increase mAP and FPS by about 5.8% and 8%, respectively,
Received: 15 February 2022
while reducing BFLOPs by about 47.5%. Furthermore, the computational efficiency and detection
Accepted: 26 February 2022
performance of the proposed approach have been verified in the comparative experiments with some
Published: 27 February 2022
existing convolutional neural networks (CNNs). In conclusion, the proposed method can achieve
Publisher’s Note: MDPI stays neutral high detection accuracy with lower computational costs compared to other networks.
with regard to jurisdictional claims in
published maps and institutional affil- Keywords: ship detection; Yolov3; spatial pyramid pooling; infrared images; visible images
iations.
1. Introduction
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland. With the dramatic increase in the demand for international maritime trade, effective
This article is an open access article management of ports plays a pivotal role in many developing countries. In addition,
distributed under the terms and real-time monitoring of ships to provide safe coastal areas is also an important issue when
conditions of the Creative Commons developing fishery economy and maritime transportation. As computer vision and artificial
Attribution (CC BY) license (https:// intelligence develop rapidly, the application of intelligence surveillance systems has been
creativecommons.org/licenses/by/ gradually used in various fields. Recently, ports are getting smarter through intelligent
4.0/). navigation, automation, and reducing the need for manpower. For instance, the target
detection technology based on deep learning algorithms has attracted widespread attention
in the field of autonomous ship navigation and intelligent ship monitoring [1]. Moreover,
real-time detection of ships based on computer vision technology has greatly improved
port management and maritime inspections [2].
Ship detection plays an important but challenging role in the field of image recogni-
tion. There are two types of data available for ship detection, radar images, and optical
images. In general, radar images cover a wider range and optical images provide more
detailed information. In the literature, Synthetic Aperture Radar (SAR) imagery [3–5] and
optical images [6–9] have been widely used for ship detection methods. These studies
conducted experiments in different complex backgrounds for SAR images and optical
images, respectively. It was shown in [6] that the complex background will cause a lot of
false alarms and even increase the computational time. Therefore, it is difficult to develop a
suitable detection model for the complex ocean background characterized by rough sur-
faces, coastal areas, and river estuaries. Moreover, the utilization of SAR images is limited
by noise response and low resolution. For example, the resolution of SAR images degrades
the detection performance of small and densely distributed ships, especially for fishing
vessels moored in ports. Furthermore, due to the time-consuming of image collection and
preprocessing, it is difficult to use remote sensing data to achieve real-time ship detection.
With the rapid development of digital cameras, intelligent video surveillance systems
are increasingly deployed in ports and coastal areas, which can be utilized for visible
ship target detection. Through video surveillance, the port management system can
automatically assign a suitable berthing position according to the ship detection results,
which reduces ship waiting time and improves the throughput of berthing areas. This not
only reduces the port operating cost but also improves the port service quality. Moreover,
ship detection also plays an important role in coastal defense. In order to ensure the safety
of coastal areas, the coast guard currently spends a lot of manpower in performing patrol
and defense tasks. With the aid of ship detection, the coast guard can instantly understand
the conditions of the coastal area. For example, ships smuggling or crossing the border can
be detected by the video surveillance systems along the coastline. Therefore, the study used
infrared and visible images for ship detection to monitor ships in the harbor day and night.
Clear and low-noise images are beneficial for subsequent object detection. However,
real-world images are inevitably affected by noise, which may originate from adverse
weather conditions, image acquisition chains, or image compression. These lead to the
degradation of the obtained visual image. This degradation can be canceled or at least
reduced by denoising preprocessing. In general, the image denoising methods can be
divided into spatial domain methods and transform domain methods, such as kernel
regression [10], nonlinear digital filters [11], and the most efficient denoising methods
based on wavelets [12] of first-generation [13] or 2nd generation [14]. Such operations can
not only improve the quality of the image, but also improve the performance of subsequent
image processing (extraction of the desired information, prediction, classification, texture
analysis, and segmentation).
In recent years, there have been several studies on ship detection using optical im-
ages [15–17]. Most algorithms contain three common processes, including region selection,
feature extraction, and classification. Region selection [8] generally adopts the sliding
window method to pass through the image globally. This causes a lot of computation re-
dundancy, thus increasing processing time. Then, features of the target are extracted, which
will affect the performance of subsequent target detection. There are some well-known
feature extraction methods, such as local binary pattern (LBP) [15], scale-invariant feature
transformations (SIFT) [18], and histogram of oriented gradients (HOG) [19], which need
to be manually designed to obtain valuable features. In addition, the establishment of
manual features relies too much on expert experience, and the generalization ability is
weak. Based on the extracted features, targets are mapped and classified using a classifier,
such as a support vector machine (SVM) [16,17,20] and Adaboost [6,21]. Most of the tradi-
tional ship detection methods were based on remote sensing data, which were captured
Electronics 2022, 11, 739 3 of 20
from a top-down view. Therefore, the handcrafted features can be defined according to
the ship’s aspect ratio, size, or scattering characteristics. In this paper, the ship images
were taken by the camera in ports from different side-view angles. Even for the same
ship type, different perspectives will lead to different ship characteristics. The traditional
methods are limited by the manually designed object features and templates. For ship
detection, the methods based on handcrafted features encounter bottlenecks in the case
of ship targets with multiple scales, multiple types, and multiple side views, or under
complex weather and ocean conditions [6,7]. When it is difficult to define object features by
hand programming, machine learning provides a feasible solution to learn features from a
large amount of observational data. Recently, computer vision based on deep learning and
convolutional neural networks (CNNs) have been widely used in various fields, especially
for object detection and classification. Semantic image features extracted by the deep CNNs
(DCNNs) are robust to morphological changes, image noise, and relative object positions
in visual images [22–25]. Therefore, this research was motivated to utilize an efficient deep
learning network to achieve automatic feature extraction for machine learning. Ships of
various sizes, shapes, and colors can be detected by deep learning methods with higher
detection accuracy than traditional methods. However, it remains a challenge in detecting
small or densely distributed ships, especially in ports.
However, many high-precision methods are computationally intensive. In recent years,
the deep learning methods implemented on GPU have accelerated the computing speed of
object detection [26–30]. Generally, there are two approaches for object detection based on
deep learning: the one-stage method and the two-stage method. The two-stage approaches
consist of two modules, DCNN, and region proposal network. The representative two-stage
methods mainly include regional-based CNN (R-CNN) [30], Fast R-CNN [31], and Faster
R-CNN [32]. For instance, Fan et al. [33] proposed the modified Faster R-CNN for ship
detection using Polarimetric SAR (PolSAR) data, which was still difficult to detect in-shore
small ships. The study [34] predicted ship navigation direction and detected dense ships
through a detection model with multi-scale rotational R-CNN. The research [35] proposed a
region of interest (ROI) method, which can achieve better small ship detection performance
in SAR images by combining SVM and Faster R-CNN. Dong et al. [36] adopted a multi-
angle box-based rotation in-sensitive structure of object detection to improve the R-CNN for
very-high-resolution (VHR) ship images. The computational efficiency is still insufficient
for real-time processing, even the detection performance of the two-stage approach is
better than that of the traditional one. Subsequently, considering the requirement of fast
processing in real-time object detection, the one-stage method was proposed to directly
detect the category and position of the object by omitting the region proposal step. The main
one-stage representative methods are Single Shot Multibox Detector (SSD) [37], Yolo [38],
Yolov2 [39], Yolov3 [40], and Yolov4 [41]. In the literature, some studies have applied deep
learning methods to ship detection in SAR imagery. For example, Wang et al. [42] improved
the overall performance and detection accuracy on Sentinel-1 SAR images by using SSD
to perform transfer learning. Zhang et al. [43] proposed a grid CNN (G-CNN) approach
for real-time ship detection in SAR images, which had a faster detection performance by
meshing the input images. Furthermore, studies [44,45] have proposed improved Yolo-
based networks for ship tracking. Zhang et al. [45] solved the problems of missing and
inaccurate localization through the combination of the HOG and LBP features by the ship
detection method based on an optimized Yolo network. The study [44] realized the tracking
and detection of ships in monitored marine areas by improving Yolov3 architecture based
on Darknet.
In addition to the detection accuracy, improving the processing speed, reducing
the model complexity, and adapting the ship detection model to the actual hardware
conditions are of great significance to the system implementation. Considering the relatively
balanced detection performance in processing time and detection accuracy of the Yolov3
algorithm [40], this paper utilized the Yolov3 architecture for the ship detection method
by modifying the parameters and architecture of the network. In our previous study [46],
Electronics 2022, 11, 739 4 of 20
the concept of modifying Yolov3 parameters for ship detection was proposed based on
changing the input image size, the number of filters in the convolutional layer, and the
detection scale. Compared with [46], this study further modified the Yolov3 network by
using a spatial pyramid pool (SPP) module to improve feature extraction. More complete
experiments have been conducted, such as selecting a more appropriate input image
size for ship detection and comparing the proposed approach with other deep learning
networks. In addition, the built dataset has been augmented and images of different
complex backgrounds, ship types, and target scales have been used to verify the ship
detection method in this paper. Experimental results showed that the proposed modified
network achieved low computational complexity and robustness in real-time ship detection.
The rest of the paper was organized as follows. The framework of Yolo networks
was given in Section 2. Section 3 described the details of the modified method. Section 4
presented the self-built ship data set and the experimental results. Finally, some conclusions
were drawn in Section 5.
IoU is often used to evaluate the accuracy of an object detector. For an IoU greater
than the defined IoU threshold, it means that the prediction of a bounding box containing
an object is “correct”. IoU is useful when assigning anchor boxes during training dataset
preparation and when cleaning multiple prediction boxes for the same object using the
non-maximum suppression algorithm. The default IoU threshold is usually assigned to 0.5,
which is at least half of the ground truth, and the predicted box covers the same region.
Yolov1 is generally reported as having a faster network speed and less computation
time. However, its detection accuracy is lower than other popular one-stage algorithms,
such as SSD and Faster R-CNN. Compared with Yolov1, Yolov2 has significant improve-
ments in computational efficiency and detection performance. Many improvements were
proposed inYolov2. The fully connected layers were replaced by the convolutional layers,
and the concept of anchor boxes was introduced. To match objects of different shapes and
sizes, the anchors are usually set according to the size of the object in the training data set.
Instead of computing class probabilities for each cell as in Yolov1, the class probabilities are
calculated for each anchor box in Yolov2. In addition, the backbone network architecture of
Yolov2 is Darknet-19.
Electronics 2022, 11, 739 5 of 20
λcoord is the weight of the coordinate error. s2 is the number of grid cells per detection
obj
layer and B is the number of the bounding boxes in each grid cell. Iij indicates whether a
target lies in the j-th bounding box of the i-th grid cell. (xi , yi , hi , wi ) and (xi , yi , hi , wi )
represent the center coordinate, height, and width of the ground truth and predicted box,
respectively. The IoU error indicates the degree of overlap between the ground truth and
the predicted box, which is given by
s2 s2
h 2 i h 2 i
B B
∑i=1 ∑j=1 Iij + λnoobj ∑i=1 ∑j=1 Iij
obj noobj
ErrorIoU = Ci − Ci Ci − Ci (4)
λnoobj is the confidence penalty that the prediction box does not contain an object.
Ci and Ci represent the true and predicted confidence, respectively. Classification error
represents the accuracy of classification. It can be defined as:
s2 B
Errorcls = λcoord ∑i=1 ∑j=1 Iij ∑ceclasses (pi (c) − p̂i (c))2
obj
(5)
c represents the class to which the detected target belongs. pi (c) and p̂i (c) refer to the
true probability and predicted value of the target, respectively. Combining the above errors,
the loss function of Yolov3 is expressed as:
3. Methodology
3.1. Proposed Modified Yolov3 Network Architecture
The proposed modified network architecture in this paper was based on the Yolov3
network. First, the study chose the anchor box size that was more suitable for the self-built
ship data set in network training. The anchor boxes originally proposed in Faster R-CNN
were used to detect multiple objects in one grid cell. Then, the Yolo matched the ratio of
width to height of objects by anchor boxes. In Yolo, the width and height of anchor boxes
were obtained based on the Pascal VOC [47] and COCO [48] data sets. Since those data sets
contained various types of objects, the defined anchor box size was not suitable for the ship
data set in this research. Based on the ship-type characteristics in the built ship data set,
Electronics 2022, 11, 739 6 of 20
this research obtained the appropriate anchor boxes by the K-means [49] algorithm. Since
the prediction layer of the Yolov3 network contains three anchor boxes for each scale, it is
necessary to partition the sizes of bounding boxes into nine categories. In order to acquire
optimal sizes of anchor boxes, the width and height of the bounding box are selected as
the clustering features in K-means. In the clustering process, the bounding box size of
each target in the dataset is divided into nine clusters according to the feature similarity,
which is measured by the IoU value between the current anchor box and the bounding box.
Then, the anchor box size is updated by the mean value of each cluster. These processes are
performed iteratively until the centroid of each cluster does not change. Since the selected
anchor boxes are much closer to the ship shapes in the ship data set, these anchor boxes
can speed up the network training. The size of anchor boxes obtained by K-means were
(14,21), (26,36), (47,38), (62,59), (91,77), (73,113), (130,109), (105,170), (186,172), which were
applied in the following experiments.
Next, the study evaluated the influence of the input image size on the detection per-
formance of the Yolo-based networks. For this purpose, the study examined the efficiency
of networks with different input image sizes, from 288 × 288 to 512 × 512. Generally, the
larger the input image size, that is, the larger the feature map in the deep learning network,
the more features, and details of the image can be retained. Although the detection accuracy
is better when the input image size increases, the computational complexity also increases.
In order to achieve better detection performance and computational efficiency at the same
time, the research will select the appropriate input image size in ship detection.
The multiscale detection module powerfully helps the Yolo network search and de-
tect objects of different scales in the same image. However, the more complex the entire
deep learning network, the longer the computation time required for object detection. In
addition, with the refinement of the grid, more retained image details will increase the
detection accuracy, but at the same time, more training and prediction times will reduce
the computational efficiency. Considering the trade-off between detection accuracy and
computation time, appropriate detection scales not only simplify network architecture but
also improve detection performance. Therefore, it is important to choose an appropriate
network scale for specific object detection, such as ships. The study will consider three com-
binations of detection scales, one of which has all three scales, another retains medium and
small target scales (removing the large target scale), and the other only has the small target
scale. Experiments will examine the ship detection efficiency of these three combinations.
Finally, the influence of the convolution filters on the network performance was con-
sidered. More convolutional filters mean more weights in the deep learning network, which
can improve the detection accuracy of the network, but also increase the computational
burden of the system. Since the built ship dataset includes only six types of ships, choosing
an appropriate number of filters will improve the efficiency of storage and system imple-
mentation of the proposed Yolov3 network architecture. Therefore, this study examined the
ship detection performance by reducing the filters of the convolutional layers in Darknet-53,
the backbone of Yolov3. For example, when a 20% filter reduction was performed, the
number of filters of 32 and 64 in the convolutional layers of the first residual block, shown
in Figure 1, would be reduced to 26 and 52, respectively. The experiments in the next section
showed that an appropriate number of filters can reduce the computational complexity of
the system, improve the classification speed, and maintain the detection accuracy at the
same time.
Electronics 2022,
Electronics 11, 11,
2022, 739x FOR PEER REVIEW 7 of 721
of 20
Figure1.1.Proposed
Figure Proposedmodified
modified Yolov3
Yolov3 network
network with
withSPP
SPPmodule.
module.
3.2.Spatial
3.2. SpatialPyramid
PyramidPooling
Pooling
Spatialpyramid
Spatial pyramidpooling
pooling(SPP)
(SPP)[50,51]
[50,51]isisone
oneof
ofthe
themost
mostpopular
popularapproaches
approaches for for vision
vi-
sion recognition. The SPP module divides each feature map into
recognition. The SPP module divides each feature map into several different grid sizes several different grid
sizes as
(such (such
4 ×as4,4 2× ×
4, 22,×12,×1 1)
× 1)and
andthen
thenperforms
performs the
the maximum
maximumpooling poolingoperation
operation on on
each grid. After the maximum pooling, three feature maps with
each grid. After the maximum pooling, three feature maps with dimensions of 16 × dimensions of 16 × C, 4 × C,
C, and 1 × C will be generated for a C-dimensional input feature
4 × C, and 1 × C will be generated for a C-dimensional input feature map. Then, the map. Then, the three
feature
three mapsmaps
feature are ablearetoable
generate a fixed-length
to generate output feature
a fixed-length output map regardless
feature mapof the input of
regardless
size and will connect to the following fully connected layers. Thus, regardless
the input size and will connect to the following fully connected layers. Thus, regardless of the input
ofdimension, the SPP module
the input dimension, provides
the SPP modulefixed-dimensional output, which was
provides fixed-dimensional impossible
output, whichin was
the previous networks using sliding windows. Due to the flexibility
impossible in the previous networks using sliding windows. Due to the flexibility of the input dimen-
of the
sions,dimensions,
input SPP can incorporate
SPP can the functionality
incorporate obtained in variable
the functionality obtained dimensions.
in variable dimensions.
Moreover, SPP extracts the main spatial information of the feature map and performs
Moreover, SPP extracts the main spatial information of the feature map and performs
stitching, which is a feature enhancement module. The receptive field of a single neuron
stitching, which is a feature enhancement module. The receptive field of a single neuron is
is gradually increasing as the convolutional layers of the Yolov3 network are deepened
gradually increasing as the convolutional layers of the Yolov3 network are deepened during
during the feature extraction process. At the same time, the feature extraction capability
the feature extraction process. At the same time, the feature extraction capability has also
has also been improved, and the extracted features have become more abstract. If the
been improved, and the extracted features have become more abstract. If the shape of the
shape of the object’s feature map is blurred, the spatial information of the small object will
object’s feature map is blurred, the spatial information of the small object will be inaccurate
be inaccurate at this time. Experimental results show that when using Yolov3 to detect
atmultiple
this time. Experimental results show that when using Yolov3 to detect multiple ships
ships in one image, the phenomenon of missed detections will happen and the
inship
onedetection
image, the phenomenon
performance will ofbemissed
greatly detections
reduced. Due willtohappen and the
the enhanced ship detection
feature extrac-
performance will be greatly reduced. Due to the enhanced feature
tion capability of SPP, the study proposed a modified Yolov3 network that adopts extraction capability
the SPP of
SPP, the study
module proposed
to improve thea performance
modified Yolov3 network
of Yolov3 inthat adoptsship
multiple the targets
SPP module to improve
detection. As
the performance of Yolov3 in multiple ship targets detection.
shown in Figure 1, the SPP module is added between the Darknet-53 backbone and FPN.As shown in Figure 1, the
SPP
Themodule is added
feature maps between
are pooled the Darknet-53
in different scales bybackbone
different and FPN.
sliding The feature
windows, maps
of which theare
pooled in different scales by different sliding windows, of which the sizes are 1, 5, 9, and 13
in local spatial bins, respectively. The stride of max-pooling is set to 1 and the padding is
Electronics 2022, 11, 739 8 of 20
utilized to keep the size of the output feature maps unchanged. Then, these four feature
maps concatenate and input to the subsequent detection layer. Experiments verified that
the proposed modified Yolov3 has improved the ship detection performance, especially for
blurred images with densely distributed ships.
Fishing
Class Container Cruise War Ship Yacht Sailboat
Boat
Electronics 2022, 11, 739 9 of 20
Total
1009 528 1008 1043 1000 969
numbers
4.2.Evaluation
4.2. EvaluationMethods
Methods
In the study,
In the study, the themetrics
metrics including
including IoU,IoU,precision,
precision,recall,
recall,F1-score
F1-scoreandandmean
mean Average
Average
Precision(mAP),
Precision (mAP),frames
frames perper second
second (FPS),
(FPS),and
andbillion
billionfloating
floatingpoint
pointoperations
operations (BFLOPs)
(BFLOPs)
wereutilized
were utilizedtotoevaluate
evaluatethe the detection
detection performance
performance ofof
thethe proposed
proposed modified
modified network.
network. The
The effectiveness
effectiveness of theofpredicted
the predicted bounding
bounding box isbox is determined
determined according
according to whether
to whether the is
the IoU
IoU is than
greater greater
thethan the specified
specified thresholdthreshold
[55]. In [55]. In the experiment,
the experiment, the IoUthe IoU threshold
threshold was setwasto 0.5.
set to 0.5. Precision, recall rate, and F1-score are common performance
Precision, recall rate, and F1-score are common performance indicators for evaluating indicators for eval-
object
uating object
detectors. detectors.
Precision Precision
(P) refers to the(P)ratio
refers to theships
of true ratiotoofall
true ships
ships to all ships
predicted predicted
by the network.
by the network. Recall (R) refers to the proportion of true
Recall (R) refers to the proportion of true ships predicted by networks among ships predicted by networks
all true
among
ships. all trueisships.
F1-score F1-score is a indicator
a comprehensive comprehensive indicatorprecision
that combines that combines precision
and recall and
to evaluate
recall to evaluate the performance of different networks. The calculation
the performance of different networks. The calculation formulas of the abovementioned formulas of the
abovementioned indicators are as follows:
indicators are as follows:
TP
Precision ==
Precision (7)(7)
TP + FP
Recall = TP (8)
Recall = (8)
TP + FN
F1-score = 2 (9)
Precision × Recall
F1-score = 2 × (9)
where TP (True Positive) represents samplesPrecision + Recallpositive and predicted to be
that are actually
positive;
where TP FP (False
(True Positive)
Positive) represents
represents samples
samples that
that areare actuallypositive
actually negative butpredicted
and predictedto tobe
be positive;
positive; FN (False
FP (False Negative)
Positive) refers to
represents samples
samples that
that are
are actuallynegative
actually positivebut
butpredicted
predicted to
betopositive;
be negative; TN (True
FN (False Negative)
Negative) refers
refers to samples
to samples that that are actually
are actually negative
positive and pre- to
but predicted
dicted to be negative.
be negative; TN (True Negative) refers to samples that are actually negative and predicted
Average precision (AP) value is usually used as a performance index for object de-
to be negative.
tection. It represents
Average the
precision accuracy
(AP) valueof
is the model
usually in aas
used specific category, index
a performance whichforcanobject
be calcu-
detec-
lated by the area under the Precision-Recall (P-R) curve, as shown in Equation
tion. It represents the accuracy of the model in a specific category, which can be calculated(10),
by the area under the Precision-Recall (P-R) curve, as shown in Equation (10),
Z 1
AP = P(R)dR (10)
0
Moreover, to evaluate the precision of all categories, the mean AP (mAP), is often used
as a performance measure for the network.
Frames per second (FPS) represents the number of frames processed by the detection
method in one second. It is also an important metric for evaluating the real-time perfor-
mance of the object detector. Besides the above performance metrices, BFLOPs represent the
AP = P(R)dR (10)
Moreover, to evaluate the precision of all categories, the mean AP (mAP), is often
used as a performance measure for the network.
Frames per second (FPS) represents the number of frames processed by the detection
Electronics 2022, 11, 739 10 of 20
method in one second. It is also an important metric for evaluating the real-time perfor-
mance of the object detector. Besides the above performance metrices, BFLOPs represent
the number of operations required by the detection algorithm and can be used as an indi-
number
cator of operations
to evaluate requiredof
the complexity bythe
thenetwork.
detection algorithm and can be used as an indicator
to evaluate the complexity of the network.
4.3. Modified Yolov3 Performance
4.3. Modified Yolov3 Performance
In this experiment, the study compared the performance of the modified Yolov3 with
In this experiment, the study compared the performance of the modified Yolov3 with
different parameters, including input image sizes, detection scales, and the number of
different parameters, including input image sizes, detection scales, and the number of
convolution filters.
convolution filters.
First, ship detection experiments were conducted to evaluate the impact of the input
First, ship detection experiments were conducted to evaluate the impact of the input
image size on Yolov3 performance. In the experiment, the detection scales and convolu-
image size on Yolov3 performance. In the experiment, the detection scales and convolution
tion filters were maintained as those in the original Yolov3. Figure 3 displayed the mAP
filters were maintained as those in the original Yolov3. Figure 3 displayed the mAP values
values of Yolov3 with input image sizes varying from 288 to 512. It can be observed that
of Yolov3 with input image sizes varying from 288 to 512. It can be observed that the mAP
theofmAP
Yolov3of has
Yolov3 has increased
increased from 89.7% fromto89.7%
91.6%.toHowever,
91.6%. However,
the mAPthe mAP
only only increased
slightly slightly
increased when the input image size was larger than 384. In
when the input image size was larger than 384. In addition to mAP, other performanceaddition to mAP, other per-
formance metrics BFLOPs
metrics BFLOPs and FPSand wereFPS were evaluated
evaluated for simulation
for simulation schemes schemes with an
with an input input
image size
image size of 352 × 352, 384 × 384, 416 × 416, and 448 × 448. These
of 352 × 352, 384 × 384, 416 × 416, and 448 × 448. These results were presented in the results were presented
in first
the first
blockblock of Table
of Table 2. It be
2. It can canobserved
be observed that thethatinput
the input
image image sizeahas
size has a great
great influ-on
influence
ence on the computational complexity of the network architecture.
the computational complexity of the network architecture. Although the mAP is higher Although the mAP is
higher when the input image size increases, the required operations
when the input image size increases, the required operations BFLOPs increase relatively. In BFLOPs increase rel-
atively.
general,In the
general,
largerthethe larger
image thesize,image size, the
the higher themAPhigher the ship
of the mAPdetection.
of the ship detection.the
Comparing
Comparing the results
results in Table 2, the in
mAPTableof 2,
thethe384mAP
× 384 of image
the 384can × 384 image
remain can remain
above 91% which abovewas 91%
only
which was only 0.2% and 0.1% lower than the mAP of the 416 ×
0.2% and 0.1% lower than the mAP of the 416 × 416 and 448 × 448 images, respectively. 416 and 448 × 448 images,
respectively.
Whereas, the Whereas,
BFLOPsthe of BFLOPs
the 384 ×of384 theimage
384 × 384 wasimage was 55.7,
55.7, which waswhich
aboutwas
85%about
and 73%85% of
and 73% of the BFLOPs required for the 416 × 416 and 448 × 448
the BFLOPs required for the 416 × 416 and 448 × 448 images, respectively. Considering the images, respectively.
Considering
computational the efficiency
computational and mAP,efficiency and selected
the study mAP, the thestudy
inputselected theto
image size input
be 384image
× 384
size
fortothe
befollowing
384 × 384 experiments.
for the following experiments.
Figure
Figure 3. The
3. The mAPmAP
of of Yolov3
Yolov3 with
with different
different input
input image
image sizes.
sizes.
Electronics 2022, 11, 739 11 of 20
Scales Two detection scales 51.0 98.4 90.8% 0.93 0.84 0.88
(Input image size 384 × 384) Small target scale 46.3 101.2 88.0% 0.93 0.80 0.86
Next, the experiments were performed to examine the influence of the detection scales
of Yolov3 on ship detection. The input image size was 384 × 384, and the convolution
filters remained the same as those in the original Yolov3. Three combinations of detection
scales were considered in the experiment: (1) with all three scales, (2) with two detection
scales for medium and small targets (removing the large target scale), and (3) with only
the small target scale. The detection performance of the three combinations was compared
in the second block of Table 2. For the two detection scales combination scheme (2), the
mAP was 90.8%, which was 0.5% lower than the mAP of Yolov3 with all detection scales
(as shown in the first block of Table 2); and the BFLOPs was 51, which was about 91% of
the operations required in Yolov3 with all detection scales. Although the mAP of the two
detection scales decreased slightly, it still remained above 90%. Therefore, the simulation
schemes of 384 × 384 image size and two detection scales (for medium and small targets)
were considered in the following experiments.
Finally, the impact of reducing convolutional filters on network performance was
examined. In the experiment, the network with 20%, 30%, and 40% filter reduction was
considered. Moreover, the input image size was 384 × 384, and the network preserved
two detection scales for small and medium targets. The detection performance was shown
in the third block of Table 2. It can be observed that the Yolov3 with 30% filter reduction
had a better performance, with 90.7% mAP and 28.9 BFLOPs. Compared with the network
without reducing the convolutional filters, the network with 30% filter reduction had better
calculation efficiency with similar ship detection performance. The BFLOPs of the network
with 30% filter reduction have been reduced to 43.3% of the operations required by the
network with all filters retained, while the mAP remained above 90%.
According to the above experimental results, the proposed Yolov3 modified the net-
work parameters, in which the input image size was 384 × 384, the detection module
retained two scales of small and medium targets, and the convolution filters were reduced
by 30%. The modified Yolov3 greatly reduced the computation cost while maintaining ship
detection accuracy, with 90.7% mAP and 28.9 BFLOPs. Moreover, the FPS of the modified
Yolov3 was up to 106.2, which is about 9.6% higher than the original Yolov3. Figure 4
showed the training process of this modified Yolov3 model. It can be observed that after
completing 20,000 iterations, the modified Yolov3 has reached an accuracy of more than
90%, with a loss of 0.13.
Electronics 2022,11,
Electronics2022, 11,739
x FOR PEER REVIEW 12 of
12 of 20
21
Figure4.4.Training
Figure Trainingprocess
processofofthe
themodified
modifiedYolov3.
Yolov3.The
Thesimulation
simulationscheme
schemewaswas384
384×× 384
384 input
input size,
size,
two detection scales, and 30% filter reduction. The red line and blue line represent the
two detection scales, and 30% filter reduction. The red line and blue line represent the mAP and mAP and
training average loss, respectively.
training average loss, respectively.
Inthe
In thefollowing,
following,the the detection
detection accuracy
accuracy of various
of various typestypes of ships
of ships in theintesting
the testing data
data was
was studied
studied for thefor the Yolov3
Yolov3 network network with different
with different parameters.
parameters. The study The study examined
examined the effecttheof
effect of input image size, detection scales, and convolution filters on
input image size, detection scales, and convolution filters on ship detection accuracy by the ship detection accu-
racy by
same thesimulation
three same threeschemes
simulation schemes
as Table 2. The ascorresponding
Table 2. The corresponding
results were shown resultsinwere
the
shown
first, in theand
second, first, second,
third blocksandofthird
Tableblocks of Table 3, respectively.
3, respectively.
Considering the effect of image size, it can be observed that the network with the
input3.image
Table sizeaccuracy
Detection of 448 ×of448 had better
Yolov3 on test performance for every
data under different type ofscenarios.
simulation ship. The detection
accuracy of 416 × 416 and 384 × 384 image sizes was very close, only about 2% lower than
Parametersthe detection accuracyWarship of 448 × 448Container
image size. Cruise
In fact, the largerSailboat
the input Fishing
image size,mAP the
Ship Ship Yacht Boat
better the detection accuracy, but the computational burden of the network also increases.
448 × 448 88.4 91.2 90.1 87.1 94.4 87.8 89.8
Input image size (Three In order
416 ×to416
reduce the computational
84.6 91.1 complexity,88.0 the research
85.1 tried
92.0 to select84.0a moderate
87.5
detection scales) image × 384
384size 84.5
and used a network 91.2
with appropriate 87.6detection85.2scales92.4 83.8
and convolution 87.4
filters.
352 × 352 82.4 92.3 85.8 82.7 94.6 81.0 86.4
Based on the results of the second block of Table 3, when selecting the input image size of
Scales Two detection scales 90.4 93.1 88.6 90.5 93.0 83.6 89.9
(Input image size 384 × 384) 384 × 384
Small andscale
target using the 87.6
network with 92.2 two detection
85.0 scales 87.2and all convolution
89.4 80.4 filters,
87.0the
detection accuracy of ships has improved and the mAP reached 89.9%, which was 0.1%
Filters −20% 89.6 93.2 95.1 93.0 92.7 83.7 91.2
(Input image size 384 × 384 higher−than 30% the mAP corresponding
91.8 92.4to the input94.5 image94.1size of 44896.3 × 448. 87.6
Finally, the
92.8re-
and two detection scales) sults in−40%the third block 90.4 of Table 3 90.3
also verified 92.4
that an 90.1
appropriate 93.8 number 81.6of convolu-
89.4
tional filters would further improve the accuracy of ship detection. For the network with
30%Considering
filter reduction, the mAP
effectwas up to size,
of image 92.8%, it which
can be was 2.9% higher
observed that the than the 89.9%
network withmAPthe
of the abovementioned network with all convolution filters.
input image size of 448 × 448 had better performance for every type of ship. The detection
accuracyTheof experimental
416 × 416 and results
384 ×validated
384 image that thewas
sizes modified Yolov3,
very close, onlywith
about input imagethan
2% lower size
384 × 384, two detection scales, and a 30% filter reduction in the convolutional
the detection accuracy of 448 × 448 image size. In fact, the larger the input image size, the layer, can
achieve higher ship detection accuracy, superior performance, and
better the detection accuracy, but the computational burden of the network also increases. better calculation effi-
ciency
In orderthan the original
to reduce Yolov3 network.
the computational complexity, the research tried to select a moderate
image size and used a network with appropriate detection scales and convolution filters.
Electronics 2022, 11, 739 13 of 20
Based on the results of the second block of Table 3, when selecting the input image size of
384 × 384 and using the network with two detection scales and all convolution filters, the
detection accuracy of ships has improved and the mAP reached 89.9%, which was 0.1%
higher than the mAP corresponding to the input image size of 448 × 448. Finally, the results
in the third block of Table 3 also verified that an appropriate number of convolutional
filters would further improve the accuracy of ship detection. For the network with 30%
filter reduction, mAP was up to 92.8%, which was 2.9% higher than the 89.9% mAP of the
abovementioned network with all convolution filters.
The experimental results validated that the modified Yolov3, with input image size
384 × 384, two detection scales, and a 30% filter reduction in the convolutional layer,
can achieve higher ship detection accuracy, superior performance, and better calculation
efficiency than the original Yolov3 network.
Table 4. Performance comparison of the proposed method with other networks on training data.
Modified
28.9 106.2 90.7% 0.92 0.84 0.88
Yolov3
Electronics 2022, 11, 739 14 of 20
Modified
29.2 104.7 93.0% 0.93 0.86 0.89
Yolov3-spp
performance with high detection accuracy, low computational complexity, and fast process-
ing speed.
(a)
(b)
Figure 5. Performance
Figure 5. Performance evaluation of the Yolo-based
Yolo-based networks,
networks, including
including (a)
(a) BFLOPs,
BFLOPs, FPS and mAP;
(b) Precision,
(b) Precision,Recall,
Recall,and
andF1-score.
F1-score.
Table 5. Detection accuracy of the proposed modified methods and other networks on testing
data.
Then, the detection results of the proposed modified Yolov3 and other CNN networks
by using testing data were shown in Table 5. Yolov2-tiny and EfficientDet had poor
detection results, with mAP of 64.8% and 61.9%, respectively. The detection accuracy of
SSD was similar to that of Yolov2, and the mAP was about 76%. The reason for the poor
detection efficiency was that the network cannot extract effective features from multiscale
images, while the Yolov3 applied the FPN technique to address this problem. The mAP
of Yolov3-spp has improved by 1.4% compared to the original Yolov3. The proposed
modified Yolov3 and modified Yolov3-spp can improve the detection performance, with
mAP of 92.8% and 93.2%, which were 5.4% and 4.4% higher than the original Yolov3
networks, respectively. Among Yolo-based models, Yolov4 achieved the highest detection
performance, reaching 94.3% mAP. The mAP of the proposed modified Yolov3-spp was
1.1% lower than that of Yolov4, which was due to the slightly lower detection accuracy of
the proposed approach for small vessels such as fishing boats. The precision and recall of
Yolov4 were 0.2 lower and 0.2 higher than the proposed method, respectively. Both Yolov4
and the proposed method had an F1-score of 0.89. However, the BFLOPs of the proposed
modified Yolov3-spp were only 57.5% of the required operations of Yolov4.
Table 5. Detection accuracy of the proposed modified methods and other networks on testing data.
Warship Container Ship Cruise Ship Yacht Sailboat Fishing Boat mAP
EfficientDet 64.9 62.1 63.8 68.1 65.2 47.4 61.9
Resnet151 81.2 86.5 83.5 86.8 87.8 80.2 84.3
SSD 75.2 73.5 83.6 79.8 81.5 65.5 76.5
Yolov2 74.8 72.6 80.4 78.2 78.1 67.4 75.3
Yolov3 84.5 91.2 87.6 85.2 92.4 83.8 87.4
Yolov3-spp 86.7 92.1 90.7 84.1 95.6 83.5 88.8
Yolov4 92.9 93.8 95.7 92.9 97.2 93.4 94.3
Yolov2-tiny 65.8 67.8 64.7 65.3 71.8 53.4 64.8
Yolov3-tiny 71.1 78.7 72.5 70.2 74.8 67.8 72.5
Yolov4-tiny 79.6 87.0 82.3 77.3 87.5 77.2 81.8
Modified Yolov3 91.8 92.4 94.5 94.1 96.3 87.6 92.8
Modified Yolov3-spp 92.9 93.2 95.4 93.5 95.8 88.9 93.2
In summary, compared with other networks, the proposed modified Yolov3-spp can
provide high detection accuracy and high calculation efficiency for ship detection. The results
in Tables 4 and 5 verified the superior performance of the proposed modified networks.
(1) Con-
tainer
ship
(2) Cruise
(3) Yacht
Electronics 2022, 11, x FOR PEER REVIEW 18 of 21
(4) War
ship
(5) Sailboat
(6) Fishing
boat
(8) War
ship
(blurred
images)
(6) Fishing
boat
Electronics 2022, 11, 739 17 of 20
(7) Fishing
boat
(infra-
red im-
ages)
(8) War
ship
(blurred
images)
(9) Multi-
type
ship
targets
5. Conclusions
From the results, it can be observed that the modified Yolov3 and the original Yolov3
haveThis
missed some
study small and
proposed obscure ships.
a modified However,
Yolov3-spp the modified
model Yolov3-spp
for ship detection canvisible
with avoid
missing someimages.
and infrared densely arranged
The ships,of
effectiveness and
theeven detectmethod
proposed partially
in obstructed ship targets.
real-time detection was
The adopted
verified SPPexperiments
by the modules improve
on thethe feature
built data extraction and of
set consisting preserve spatial
six types information
of ship images.
by pooling in local
Experimental spatial
results bins,that
showed thereby improving
the proposed the ability
modified to expressoutperforms
Yolov3-spp ship featuresmost
and
alleviating the problem of multiscale ship detection. The modified Yolov3-spp has better
detection performance than the modified Yolov3. In addition, the Yolov3 networks can
successfully detect ships in infrared images, as shown in the seventh row of Figure 6. Due
to the dense distribution of fishing boats in harbors, some fishing boats in this infrared
image were missed detected by Yolov3 and the modified Yolov3. However, the modified
Yolov3-spp achieved better detection and only missed one fishing boat, compared with the
other two Yolov3 networks. Finally, even for blurred images or multitype of ship targets in
an image, as shown in the eighth and ninth rows of Figure 6, the modified Yolov3-spp can
detect almost all ships correctly and achieve the highest confidence score. In general, the
proposed modified Yolov3-spp network can improve the performance of multi-scale ship
detection, and the detection box is more accurate than the original Yolov3 network.
5. Conclusions
This study proposed a modified Yolov3-spp model for ship detection with visible
and infrared images. The effectiveness of the proposed method in real-time detection was
verified by the experiments on the built data set consisting of six types of ship images.
Experimental results showed that the proposed modified Yolov3-spp outperforms most
of the current CNN networks in terms of detection accuracy and computation efficiency.
The proposed method achieved better detection performance than the original Yolov3
in ship detection, increasing mAP by 5.8%, FPS by 8%, and reducing BFLOPs by about
47.6%. Experiments also showed that the proposed method has high detection accuracy in
multiscale detection situations, especially for the detection of densely distributed ships in
Electronics 2022, 11, 739 18 of 20
ports. In conclusion, the proposed method has high computational efficiency and detection
accuracy and meets the requirements of real-time detection. Furthermore, this study has
investigated the ship detection algorithms in detail and developed a common ship dataset
consisting of visible and infrared images. In future work, the attentional mechanism and
the more complete data set will be a key research direction.
Author Contributions: Data curation, Y.-T.C.; Methodology, L.C.; Project administration, Y.-L.C.;
Software, Y.-T.C.; Supervision, J.-H.W.; Validation, L.C.; Writing—original draft, Y.-T.C.; Writing—
review & editing, L.C. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Ministry of Science and Technology, Taiwan, under Grant
Nos: MOST-109-2221-E019-054, MOST-110-2119-M-027-001 and MOST-110-2221-E-027-101.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Chen, X.; Chen, H.; Wu, H.; Huang, Y.; Yang, Y.; Zhang, W.; Xiong, P. Robust visual ship tracking with an ensemble framework
via multi-view learning and wavelet filter. Sensors 2020, 20, 932. [CrossRef]
2. Hu, W.C.; Yang, C.Y.; Huang, D.Y. Robust real-time ship detection and tracking for visual surveillance of cage aquaculture. J. Vis.
Commun. Image Represent. 2011, 22, 543–556. [CrossRef]
3. Wang, X.; Chen, C. Adaptive ship detection in SAR images using variance WIE-based method. Signal Image Video Process. 2016,
10, 1219–1224. [CrossRef]
4. Hwang, J.; Kim, D.; Jung, H.S. An efficient ship detection method for KOMPSAT-5 synthetic aperture radar imagery based on
adaptive filtering approach. Korean J. Remote Sens. 2017, 33, 89–95. [CrossRef]
5. Chang, Y.L.; Anagaw, A.; Chang, L.; Wang, Y.C.; Hsiao, C.Y.; Lee, W.H. Ship Detection Based on YOLOv2 for SAR Imagery. Remote
Sens. 2019, 11, 786. [CrossRef]
6. Shi, Z.; Yu, X.; Jiang, Z.; Li, B. Ship detection in high-resolution optical imagery based on anomaly detector and local shape
feature. IEEE Trans. Geosci. Remote Sens. 2014, 52, 4511–4523.
7. Liu, G.; Zhang, Y.; Zheng, X.; Sun, X.; Fu, K.; Wang, H. A new method on inshore ship detection in high-resolution satellite images
using shape and context information. IEEE Geosci. Remote Sens. Lett. 2014, 11, 617–621. [CrossRef]
8. Nie, T.; He, B.; Bi, G.; Zhang, Y. A Method of Ship Detection under Complex Background. ISPRS Int. J. Geo-Inf. 2017, 6, 159.
[CrossRef]
9. Dong, C.; Liu, J.; Xu, F. Ship detection in optical remote sensing images based on saliency and a rotation-invariant descriptor.
Remote Sens. 2018, 10, 400. [CrossRef]
10. Takeda, H.; Farsiu, S.; Milanfar, P. Kernel regression for image processing and reconstruction. IEEE Trans. Image Process. 2007, 16,
349–366. [CrossRef]
11. Pitas, I.; Venetsanopoulos, A.N. Nonlinear Digital Filters: Principles and Applications; Kluwer: Boston, MA, USA, 1990.
12. Ouahabi, A. Signal and Image Multiresolution Analysis; ISTE-Wiley: London, UK; Hoboken, NJ, USA, 2013.
13. Ouahabi, A. A review of wavelet denoising in medical imaging. In Proceedings of the 8th International Workshop on Systems,
Signal Processing and Their Applications (IEEE/WoSSPA), Algiers, Algeria, 12–15 May 2013; pp. 19–26.
14. Ahmed, S.S.; Messali, Z.; Ouahabi, A.; Trepout, S.; Messaoudi, C.; Marco, S. Nonparametric denoising methods based on
contourlet transform with sharp frequency localization: Application to low exposure time electron microscopy images. Entropy
2015, 17, 3461–3478. [CrossRef]
15. Yang, F.; Xu, Q.; Li, B. Ship detection from optical satellite images based on saliency segmentation and structure-LBP feature.
IEEE Geosci. Remote Sens. Lett. 2017, 14, 602–606. [CrossRef]
16. Xia, Y.; Wan, S.; Yue, L. A novel algorithm for ship detection based on dynamic fusion model of multi-feature and support
vector machine. In Proceedings of the IEEE Sixth International Conference on Image and Graphics (ICIG), Hefei, China,
12–15 August 2011; pp. 521–526.
17. Xu, J.; Sun, X.; Zhang, D.; Fu, K. Automatic detection of inshore ships in high-resolution remote sensing images using robust
invariant generalized Hough transform. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2070–2074.
18. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [CrossRef]
19. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. IEEE Conf. Comput. Vis. Pattern Recognit. 2005, 1,
886–893.
20. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [CrossRef]
21. Schapire, R.E. Explaining AdaBoost. In Empirical Inference; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–52.
Electronics 2022, 11, 739 19 of 20
22. Kim, K.; Hong, S.; Choi, B.; Kim, E. Probabilistic ship detection and classification using deep learning. Appl. Sci. 2018, 8, 936.
[CrossRef]
23. Huang, H.; Sun, D.; Wang, R.; Zhu, C.; Liu, B. Ship target detection based on improved Yolo network. Math. Probl. Eng. 2020,
2020, 6402149. [CrossRef]
24. Li, H.; Deng, L.; Yang, C.; Liu, J.; Gu, Z. Enhanced Yolov3 tiny network for real-time ship detection from visual image. IEEE
Access. 2021, 9, 16692–16706. [CrossRef]
25. Li, Z.; Zhao, L.; Han, X.; Pan, M. Lightweight ship detection methods based on Yolov3 and DenseNet. Math. Probl. Eng. 2020,
2020, 4813183. [CrossRef]
26. Yao, Y.; Jiang, Z.; Zhang, H.; Zhao, D.; Cai, B. Ship detection in optical remote sensing images based on deep convolutional neural
networks. J. Appl. Remote Sens. 2017, 11, 042611. [CrossRef]
27. Lin, H.; Shi, Z.; Zou, Z. Fully convolutional network with task partitioning for inshore ship detection in optical remote sensing
images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1665–1669. [CrossRef]
28. Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic ship detection in remote sensing images from google earth
of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sens. 2018, 10, 132. [CrossRef]
29. Li, Q.; Mou, L.; Liu, Q.; Wang, Y.; Zhu, X.X. HSF-Net: Multiscale deep feature embedding for ship detection in optical remote
sensing imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7147–7161. [CrossRef]
30. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014;
pp. 580–587.
31. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile,
13–16 December 2015; pp. 1440–1448.
32. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings
of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99.
33. Fan, W.; Zhou, F.; Bai, X.; Tao, M.; Tian, T. Ship detection using deep convolutional neural networks for PolSAR images. Remote
Sens. 2019, 11, 2862. [CrossRef]
34. Yang, X.; Sun, H.; Sun, X.; Yan, M.; Guo, Z.; Fu, K. Position detection and direction prediction for arbitrary-oriented ships via
multitask rotation region convolutional neural network. IEEE Access. 2018, 6, 50839–50849. [CrossRef]
35. Zhasng, S.; Wu, R.; Xu, K.; Wang, J.; Sun, W. R-CNN-Based ship detection from high resolution remote sensing imagery. Remote
Sens. 2019, 11, 631. [CrossRef]
36. Dong, Z.; Lin, B. Learning a robust CNN-based rotation insensitive model for ship detection in VHR remote sensing images. Int.
J. Remote Sens. 2020, 41, 3614–3626. [CrossRef]
37. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of
the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Amsterdam, The
Netherlands, 2016; pp. 21–37.
38. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
39. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
40. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
41. Bochkovskiy, A.; Wang, C.Y.; Mark Liao, H.Y. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020,
arXiv:2004.10934.
42. Wang, Y.; Wang, C.; Zhang, H. Combining a single shot multibox detector with transfer learning for ship detection using Sentinel-1
SAR images. Remote Sens. Lett. 2018, 9, 780–788. [CrossRef]
43. Zhang, T.; Zhang, X. High-speed ship detection in SAR images based on a grid convolutional neural network. Remote Sens. 2019,
11, 1206. [CrossRef]
44. Liu, B.; Wang, S.; Zhao, J.; Li, M. Ship tracking and recognition based on Darknet network and YOLOv3 algorithm. J. Comput.
Appl. 2019, 39, 1663–1668.
45. Zhang, Y.; Shu, J.; Hu, L.; Zhou, Q.; Du, Z. A Ship Target Tracking Algorithm Based on Deep Learning and Multiple Features; SPIE:
Bellingham, WA, USA, 2020; Volume 11433.
46. Chang, L.; Chen, Y.T.; Hung, M.H.; Wang, J.H.; Chang, Y.L. Yolov3 based ship detection in visible and infrared images. In
Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16
July 2021.
47. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (VOC) challenge. Int. J.
Comput. Vis. 2010, 88, 303–338. [CrossRef]
48. Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft
COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV) 2014, Zurich,
Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755.
49. Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.D.; Silverman, R.; Wu, A.Y. An efficient k-means clustering algorithm:
Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 881–892. [CrossRef]
Electronics 2022, 11, 739 20 of 20
50. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef]
51. Huang, Z.; Wang, J. DC-SPP-YOLO: Dense connection and spatial pyramid pooling based YOLO for object detection. Inf. Sci.
2020, 522, 241–258. [CrossRef]
52. AlexeyAB. AlexeyAB/Darknet: Yolov3. 2020. Available online: https://github.com/AlexeyAB/darknet (accessed on 10 February 2022).
53. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747.
54. Tzutalin. Tzutalin/Labelimg. 2018. Available online: https://github.com/tzutalin/labelImg (accessed on 10 February 2022).
55. Li, K.; Huang, Z.; Cheng, Y.C.; Lee, C.H. A maximal figure-of-merit learning approach to maximizing mean average precision
with deep neural network based classifiers. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4503–4507.
56. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020; pp. 10781–10790.
57. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385.