Object Detection and Ship Classification Using YOLOv5
Object Detection and Ship Classification Using YOLOv5
Abstract. Using a public dataset of images of maritime vessels provided by Analytics Vidhya, manual annotations
were made on a subsample of images with Roboflow using the ground truth classifications provided by the dataset.
YOLOv5, a prominent open source family of object detection models that comes with an out-of-the-box pre-training
on the Common Objects in Context (COCO) dataset, was used to train on annotations of subclassifications of
maritime vessels. YOLOv5 provides significant results in detecting a boat. The training, validation, and test set
of images trained YOLOv5 in the cloud using Google Colab. Three of our five subclasses, namely, cruise ships,
ROROs (Roll On Roll Off, typically car carriers), and military ships, have very distinct shapes and features and
yielded positive results. Two of our subclasses, namely, the tanker and cargo ship, have similar characteristics when
the cargo ship is unloaded and not carrying any cargo containers. This yielded interesting misclassifications that
could be improved in future work. Our trained model resulted in the validation metric of mean Average Precision
([email protected]) of 0.932 across all subclassification of ships.
Keywords: Object detection, image classification, maritime, ship classification, YOLOv5.
INTRODUCTION images from Analytics Vidhya [1, 2] was used, and five
different classifications of ships were defined for the algo-
In the military maritime realm, there is a desire to under- rithm to detect. This study also takes into consideration
stand the environment passively without using emitters various factors that can affect the results and attempts
such as RADAR and LiDAR. A vessel can keep a low to improve the training dataset and algorithmic config-
profile and detect navigational hazards or threats. In the urables. A subsample of images has been annotated using
commercial maritime realm, ships can use images and Roboflow [3] and trained through an existing off-the-shelf
videos in conjunction with RADAR to add another layer object detection algorithm called YOLOv5 [4].
of confidence to hazard avoidance. These industries need YOLOv5 was born due to the improvements made by
reliable object detection using cameras. Hence, this study Glen Jocher [5], who ported YOLOv3’s Darknet weights [6]
will explore the use of electro-optical cameras taking still to PyTorch [7]. PyTorch is an open-source machine learn-
images to be used in a dataset to retrain a YOLOv5 model, a ing framework that allows users to implement powerful
deep learning object detector, on such images and evaluate computational functions that are accelerated with a GPU’s
the performances. processing power. YOLOv5 version 6.1 was released on
Object detection is a technique for locating instances February 22, 2022, which featured the YOLOv5n model for
of objects in either images or key frames from videos ultralight edge devices. At the time of this report, the latest
by leveraging machine learning algorithms with the goal version, 6.1, was utilized. The YOLOv5s model was used
of replicating recognition intelligence using a computer. due to the constraints of using free cloud-based tools to
Object detection can be applied to many domains as long facilitate our training and detection, as described later in
as a sufficient domain-based dataset is available. But, object the “Methodology” section.
detection algorithms must also consider condition factors This study is organized as follows. The “Litera-
applicable to that domain, such as poor weather or lighting ture Review” section reviews works related to the use
conditions. of machine learning techniques on images of mar-
To study the application of object detection using still itime ships, specifically with the YOLO family of
images and videos, an applicable dataset of maritime algorithms and similar machine learning algorithms.
114
Object Detection and Ship Classification Using YOLOv5 115
The “Methodology” section describes the dataset and The structure of YOLOv5 can be split into input, back-
methods of preprocessing, annotation, training, and vali- bone, neck, and prediction. Some research has been done
dation. The “Results and Discussion” section discusses the on creating new backbones, improving existing back-
results and the key metrics of mAP, precision, recall, AUC, bones [12], or swapping YOLOv5’s backbone for another
and F1 scores. Finally, the “Conclusion and Future Work” existing backbone. Ting et al. [13] swapped the exiting
section discusses the conclusions and future ideas. backbone for Hauawei’s GhostNet [14] and stacked two
In the literature review, we will compare and contrast of these GhostNets into what they call a Ghostbottlenet.
previous works with this study and determine the unique- Using the Ghostbottlenet instead of YOLOv5’s original
ness of the dataset and methodology, as well as increase the backbone, they were able to improve feature extraction
fundamental knowledge on the subject of object detection and reduce the overall model size. Zhou et al. [15] also
and classification. It should be noted that the demand replaced the original backbone with Mixed Receptive Field
for advanced video surveillance and perception capability Convolution (MixConv), where MixConv makes use of
has been requested by the United States Department of multiple convolution kernels to improve feature extraction
Defense (DoD). The DoD has set aside millions of dollars by increasing attention to pixel coordinates in horizon-
to procure and field innovative technologies from non- tal and vertical channels. Qiao et al. [16] attempted to
traditional vendors, making this research in-demand and re-identify maritime vessels that the model has already
valuable. seen even at other orientations, using a Global-and-Local
Many studies discuss machine learning in the applica- Fusion-based Multi-view Feature Learning by replacing
tion of ship classification [8], but to date, none of these the backbone with ResNet-50 for global and local feature
works address the problem with the off-the-shelf applica- extraction.
tion of YOLOv5 and the dataset from Analytics Vidhya. Orientation recognition is the focus of another study
The uniqueness of this study is that it applies YOLOv5 where the researchers use a single shot detector (SSD) for
to a large dataset consisting of various types of ships, in both multiclass vessel detection and five defined orienta-
addition to the varying quality of images from multiple tions (i.e., front, front side, side, backside, and back) [17,
viewpoints. 18]. SSD is a feedforward ConvNet that explores the
presence of an object instance in the predefined default
bounding boxes, followed by a non-maximum suppres-
LITERATURE REVIEW sion stage to produce the final detection. Tang et al. [19]
explored the use of an SSD with hue, saturation, and value
Research has been done for ship classification using alter- preprocessing to improve the Intersection Over Union.
native algorithms and methods and different datasets. Kim The hue, saturation, and value preprocessing operation are
et al. [9] used a different dataset that includes images and used to extract regions of interest to feed to the YOLO
is focused on the improvement of a preexisting classifica- network.
tion with different defining subclasses of ships including There are other studies in the same field that cross-
“Boat,” “Speed Boat,” “Vessel Ship,” “Ferry,” “Kayak,” compare different object detection algorithms, such as
“Buoy,” “Sail Boat,” and “Others.” They achieved a mean Faster R-CNN[20], R-FCN [21], SSD, and EfficientDet [22],
average precision (mAP) (0.5) value of 0.898 and an mAP while still attempting to detect maritime vessels. Iancu
(0.5:0.95) value of 0.528. Li et al. [8] presented a combina- et al. [23] found that in small to medium size objects greater
tion of real-time ship classification, Ship Detection from than 162 pixels, Faster R-CNN with Inception-Resnet v2
Visual Image, and YOLOv3 to achieve an mAP value of outperforms the others except in detecting large objects
0.741. where EfficientDet does a better job [23]. It is interesting
Tang et al. [10] compared different versions of YOLO to note that all of the convolutional neural network (CNN)
for datasets of Synthetic Aperture Radar images and tra- based detectors were also pre-trained on the COCO [24]
ditional satellite camera images of ships. While similar in dataset, similar to YOLO, which is also a CNN [23].
nature in terms of using YOLO and object detection of The COCO [24] dataset by Microsoft houses 330,000
ships, the datasets are quite different in terms of image images and 1.5 million object instances, and 80 object
capture angle. The dataset chosen for this study, from categories. The richly annotated dataset contains objects
Analytics Vidhya, contains much closer images of various in their natural context and depicts complex everyday
horizontal profiles with five distinct classifications, namely, scenes. The dataset focuses on segmenting individual
“Cargo,” “Military,” “Carrier,” “Cruise,” and “Tankers.” object instances rather than what other object recogni-
Using YOLOv3, Liu et al. [11] took steps to improve the tion datasets support, such as image classification, object
algorithm’s accuracy of large pixel dense satellite images localization, or segmentation. Since one of the 80 object
by reducing the original networks 32 times down-sampling categories is “boat,” we can utilize transfer learning for
to four times, as well as using a sliding window method to our five subclassifications of “boat” since YOLOv5 is pre-
cut down large images to many smaller images. trained on COCO.
116 Sean Brown et al.
Lastly, the “cache” flag was used to cache images for faster
training. The dataset also contains images both in color and black
and white, as well as blurry images and clear images.
Including all of these types of images in the dataset ensures
Post-processing
that it challenges the machine learning model to predict
Our first attempt at training a custom YOLOv5 model was classifications that are not inherently obvious and pushes
with some recommendations for the parameters of train.py. the limits further of how well the model can predict given
After viewing some of the charts produced by Tensor- less than ideal images.
Flow, it was determined that our model was overfitting.
An undesirable positive slope occurred at the end of the Object Detection
bounding box regression loss graph (Mean Squared Error)
As mentioned in the “Methodology” section, the machine
or box_loss graph. To correct this overfitting problem, the
learning model was run through Google Colab using
epochs were lowered from 150 to 100 to reduce the number
YOLOv5 and Python in a Jupyter notebook. The model ran
of times the training data goes through the algorithm.
100 epochs and produced images in which it made predic-
tions on classifications of ships in multiple images. Some
RESULTS AND DISCUSSION of these images were classified correctly, and some of them
were not. In a few of the predictions, there were various
Dataset Quality ships within the images for added complexity to the model.
Figures 1 through 7 show examples of the algorithm’s
The dataset was obtained from Analytics Vidhya for the output. The model performs differently on different images
Game of Deep Learning: Computer Vision Hackathon [1, for object detection. It is possible to correctly classify an
2]. It was essential to gather a wide variety of photos with object, misclassify a ship, or classify background objects as
a few key factors to test the performance of the machine a ship, and it can be possible to miss the detection entirely.
learning model. The first type of image contained in the Figure 3 shows the model successfully classifying multi-
dataset is a simple image of one ship with that ship being ple ships within the same image of type “Cruise Ship,” so it
the focus of the image. The second type of image is a is understood that the model has the capability to identify
single image containing multiple same classifications of the more than one object per image.
ship. The third type of image is a single image containing Figures 2, 3, and 4 all show successful classifications
numerous vessels of different classifications. The fourth of three different types of ships with images of varying
type is an image of a ship, but it is not the main focus of the quality. It can be seen that Figures 2 and 3 have a higher
photograph. The ship may be in the background or blend confidence in classification than Figure 4, likely due to the
into the environment more so than in other photographs. poor image quality of Figure 4. However, it is important
118 Sean Brown et al.
Post-processing
As described in the “Post-processing” subsection of the
“Methodology” section, the method of rectifying the
model’s apparent overfitting observed in the box_loss
graph was to reduce the epochs from 150 to 100. As seen in
Figure 5. Missing classification example. Figure 13, the data for the box_loss appears to be trending
in a healthy manner and does not indicate overfitting, as
seen in the previous training attempt.
to note that the model can detect an object even when the The phenomenon of overfitting occurs when the model
image quality is not ideal. has been trained for too long, becomes too specific to
Figure 5 is an example of the model missing the detection the training set, and performs poorly with new data. An
entirely. It can be seen that there are two cruise ships in the ideal learning curve graph is identified when training and
image, but the model identifies only one. validation loss decreases to the point of stability.
Figure 6 shows another capability of the model in
that it can identify overlapping ships. In this image, the
Performance
tanker was located directly in the foreground of the con-
tainer ship, and the model was still able to detect both The concept of algorithm performance is interpreted as the
ships successfully. Likely the container ship suffers low quality of results the machine learning model produced.
Object Detection and Ship Classification Using YOLOv5 119
Validation
The goal of validation is to validate the model’s per-
formance by how well it correctly predicted the correct
classification of the ship and that the object is accurately
detected. There can be, and are, instances where the model
identifies a background image as some classification of a
vessel or classifies a ship incorrectly. Figures 13, 14, and 15
represent the performance of the validation set. In each of
these figures, it is important to look at the dark orange line,
which is the validation, compared to the faded orange line,
which is the training data.
The validation box loss chart in Figure 13 shows the
mean square error (MSE) of the validation data vs training
data. MSE depicts how close the regression line is to a
Figure 11. Precision–recall chart.
set of points by calculating the distance from the points
to the regression line and squaring the error. You can
observe from the graph that the MSE consistently declines
and never trends in the positive direction. Box loss shows
how well the predicted box overlaps with the validation
bounding box.
Figure 14 shows the classification loss or cross-entropy.
Entropy measures the average amount of information
needed to represent a random event drawn from a prob-
ability for a random variable. Cross-entropy is the measure
of the difference between two probability distributions of
random sets and can be used as a loss function when
Figure 12. F1 curve chart. Figure 13. Validation box loss chart.
optimizing classification models. Figure 14 shows the val- on cargo load. Other future work that could be useful
idation of the model to classify the object given a set of to commercial and military customers is estimating the
possible classifications correctly. Lastly, in Figure 15, the distance and bearing of the target after properly detecting
object loss chart shows binary cross entropy or the ability and classifying an object of interest.
of the model to detect an object of interest or not accurately. We plan to investigate improving our results by increas-
In each of these charts, it would be evident if the model ing the number of hand-annotated images in our training
was overfitting or underfitting. If overfitting, the algorithm and validation datasets. Increasing the number of images
would do well on the training dataset but poorly on vali- per classification will also further refine our results. There
dation data. This scenario would visually be identifiable is also news of future versions of YOLOv6 and YOLOv7
by the validation line being consistently above the training that could be utilized, as well as changing to larger pre-
data line even though the loss on the training data is low. trained weights to compare results. With the growing fleet
Underfitting occurs when the algorithm performs poorly of commercial and military ships, there is demand for
not only on the validation data, but the training data as research like this study, and we think the future will use
well. It is visually identifiable by looking at the valida- cameras and machine learning as a passive perception
tion charts if both validation and training data lines are system for maritime ships.
separated, and the training line is above the validation
line. This may mean that the algorithm is too complex for CONFLICT OF INTEREST
the given dataset in comparison to underfitting, where the
model is not complex enough for the given dataset. All The authors declare that the research was conducted in the
three of our validation charts show healthy loss functions. absence of any commercial or financial relationships that
Table 1 shows training and validation results, showing could be construed as a potential conflict of interest.
precision, recall, and mAP. The standard for comparing
object detectors is mAP, and for our classifications, we are
pleased with the results and will compare them with other AUTHOR CONTRIBUTIONS
related projects in the “Conclusion” section.
SB and CH prepared the dataset by annotating the images
CONCLUSION AND FUTURE WORK and also worked on applying the machine learning algo-
rithms. RG and SB provided guidance and supervised
With the methods chosen, using a public dataset, annotat- the whole project. SB and CH wrote the initial draft of
ing images from the dataset with Roboflow, and training the manuscript. RG provided comments and SB did the
an off-the-shelf machine learning algorithm YOLOv5 in overseeing and final edits of the manuscript to bring it to
a cloud-based environment, the application of commer- publication form.
cial and military passive perception of maritime ships is
achievable. Object detection and classification of maritime REFERENCES
ships have many options of machine learning algorithms
but our results prove that YOLOv5 is a competitive CNN, [1] Game of Deep Learning: Computer Vision Hackathon. (n.d.).
as indicated by our mAP value of 0.929 and healthy valida- Retrieved June 5, 2022, from https://datahack.analyticsvidhya.c
tion curves presented in the “Validation” section. om/contest/game-of-deep-learning/
We found that our custom-trained model using the Ana- [2] Game of Deep Learning: Ship Datasets | Kaggle. (n.d.). Retrieved
June 5, 2022, from https://www.kaggle.com/datasets/arpitjain007
lytics Vidhya dataset performed better in terms of mAP
/game-of-deep-learning-ship-datasets
of different Intersection Over Union thresholds from 0.5 [3] Roboflow, Inc. (2020). Roboflow Annotate [Computer software].
to 0.95 in 0.05 increment steps compared to the related [4] Ultralytics. (n.d.). GitHub – ultralytics/yolov5: YOLOv5. Retrieved
works of Kim et al. [9], where their model resulted in a June 5, 2022, from https://github.com/ultralytics/yolov5
[email protected]:0.95 of 0.528 and our model performed at 0.598. [5] YOLOv5 New Version Explained [May 2022]. (n.d.). Retrieved June
Another comparison is of the related works of Li et al. [8] 12, 2022, from https://blog.roboflow.com/yolov5-improvements-
using YOLOv3 to achieve an mAP value of 0.741, and our and-evaluation/
[6] Ultralytics. (n.d.). GitHub – ultralytics/yolov3: YOLOv3. Retrieved
custom YOLOv5 model produced an mAP value of 0.929.
June 12, 2022, from https://github.com/ultralytics/yolov3
Iancu et al. [23] at best produced an mAP value of 55.48% [7] PyTorch. (n.d.). GitHub – pytorch/pytorch: PyTorchv1.11. Retrieved
while cross-comparing four other competitors to YOLOv5 June 12, 2022, from https://github.com/pytorch/pytorch
compared to our model’s mAP of 92.9%. [8] Li, H., Deng, L., Yang, C., et al. (2021). Enhanced YOLO v3 tiny
In future work, to rectify the misclassifications of cargo network for real-time ship detection from visual image. IEEE Access:
container ships and tankers with similar features, addi- Practical Innovations, Open Solutions, 9, 16692–16706. https://doi.or
g/10.1109/ACCESS.2021.3053956
tional images with varying characteristics would have to
[9] Kim, J.-H., Kim, N., Park, Y., et al. (2022). Object detection and
be added to the datasets. The algorithm would benefit from classification based on YOLO-V5 with improved maritime dataset.
annotating images where a container ship may be empty Journal of Mathematics Science and Education, 10(3), 377. https://doi.
and the same for tankers and various stages depending org/10.3390/jmse10030377
Object Detection and Ship Classification Using YOLOv5 123
[10] Tang, G., Zhuge, Y., Claramunt, C., et al. (2021). N-YOLO: A SAR [18] Liu, W., Anguelov, D., Erhan, D., et al. (2015). SSD: Single shot
ship detection using noise-classifying and complete-target extrac- multibox detector. ArXiv. https://doi.org/10.48550/arxiv.1512.
tion. Remote Sensing, 13(5), 871. https://doi.org/10.3390/rs13050871 02325
[11] Liu, R., Wang, T., Zhou, Y., et al. (2019). A satellite image target [19] Tang, G., Liu, S., Fujino, I., et al. (2020). H-YOLO: A single-shot ship
detection model based on an improved single-stage target detection detection approach based on region of interest preselected network.
network. 2019 Chinese Automation Congress (CAC), 4931–4936. https: Remote Sensing, 12(24), 4192. https://doi.org/10.3390/rs12244192
//doi.org/10.1109/CAC48633.2019.8997495 [20] Ren, S., He, K., Girshick, R., et al. (2017). Faster R-CNN: Towards
[12] Zhang, X., Yan, M., Zhu, D., et al. (2022). Marine ship detection and real-time object detection with region proposal networks. IEEE Trans-
classification based on YOLOv5 model. Journal of Physics: Conference actions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.
Series, 2181(1), 012025. https://doi.org/10.1088/1742-6596/2181/1 https://doi.org/10.1109/TPAMI.2016.2577031
/012025 [21] Dai, J., Li, Y., He, K., et al. (2016). R-FCN: Object detection via region-
[13] Ting, L., Baijun, Z., Yongsheng, Z., et al. (2021). Ship detection based fully convolutional networks. ArXiv. https://doi.org/10.485
algorithm based on improved YOLO V5. Proceedings of the 2021 50/arxiv.1605.06409
6th International Conference on Automation, Control and Robotics Engi- [22] Tan, M., Pang, R., Le, Q. (2020). EfficientDet: Scalable and efficient
neering (CACRE), Shanghai, China, September 23–25, 2022, 483–487. object detection. Proceedings of the 2020 IEEE/CVF Conference on
https://doi.org/10.1109/CACRE52464.2021.9501331 Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA,
[14] Han, K., Wang, Y., Tian, Q., et al. (2020). GhostNet: More features June 13–19, 2020, 10778–10787. https://doi.org/10.1109/CVPR4260
from cheap operations. Proceedings of the 2020 IEEE/CVF Conference 0.2020.01079
on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, [23] Iancu, B., Soloviev, V., Zelioli, L., et al. (2021). ABOships—An
June 13–19, 2020, pp. 1577–1586. https://doi.org/10.48550/arXiv.1 inshore and offshore maritime vessel detection dataset with precise
911.11907 annotations. Remote Sensing, 13(5), 988. https://doi.org/10.3390/rs
[15] Zhou, S., Yin, J. (2022). YOLO-Ship: A visible light ship detection 13050988
method. Proceedings of the 2022 2nd International Conference on Con- [24] Lin, T.-Y., Maire, M., Belongie, S., et al. (2014). Microsoft COCO:
sumer Electronics and Computer Engineering (ICCECE), January 14–16, Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, & T.
2022, Guangzhou, China, 113–118. https://doi.org/10.1109/ICCE Tuytelaars (Eds.), Proceedings of the European Conference on Computer
CE54139.2022.9712768 Vision (ECCV), Zurich, Switzerland, September 6–12, 2014 (Vol. 8693,
[16] Qiao, D., Liu, G., Dong, F., et al. (2020). Marine vessel re- pp. 740–755). Springer International Publishing. https://doi.org/10
identification: A large-scale dataset and global-and-local fusion- .1007/978-3-319-10602-1_48
based discriminative feature learning. IEEE Access: Practical Innova- [25] Google. (2019). Google Colaboratory (Version 2022/5/20) [Computer
tions, Open Solutions, 8, 27744–27756. https://doi.org/10.1109/AC software]. Google.
CESS.2020.2969231
[17] Ghahremani, A., Kong, Y., Bondarev, E., et al. (2019). Multi-class
detection and orientation recognition of vessels in maritime surveil-
lance. Économie et Institutions, 31(11), 266-1–266-5. https://doi.org/
10.2352/ISSN.2470-1173.2019.11.IPAS-266