Paper UnderwaterObjectDetectionUsingYOLOV4
Paper UnderwaterObjectDetectionUsingYOLOV4
net/publication/354556654
CITATIONS READS
23 887
5 authors, including:
All content following this page was uploaded by Iza sazanita Isa on 21 February 2022.
Mohamed Syazwan Asyraf Bin Rosli1, Iza Sazanita Isa1, Mohd Ikmal Fitri Maruzuki1, Siti Noraini Sulaiman1, Ibrahim Ahmad2
1School of Electrical Engineering, College of Engineering, Universiti Teknologi MARA,
Abstract—Underwater computer vision system has been From literature, numerous developments have been
widely used for many underwater applications such as ocean accomplished and produced excellent results despite the
exploration, biological research and monitoring underwater life underwater challenges such as variation in lighting and water
sustainability. However, in counterpart of the underwater murkiness. With the implementation through deep
environment, there are several challenges arise such as water architecture with numerous features, deep learning model has
murkiness, dynamic background, low light and low visibility the capability to achieve high performance primarily in the
which limits the ability to explore this area. To overcome these area of computer vision specially for underwater object
challenges, there is a crucial to improve underwater vision detection.
system that able to efficiently adapt with varying environments.
Therefore, it is great of significance to propose an efficient and Technically, the modern machine learning utilizes
precise underwater detection by using YOLOv4 based on deep Convolutional Neural Network (CNN) as based networks.
learning algorithm. In the research, an open-source underwater Due to the need of domain expertise and human intervention
dataset was used to investigate YOLOv4 performance based on in traditional machine learning, many researchers assured that
metrics evaluation of precision and processing speed (FPS). The deep learning as it has its flexibility and supremacy in the
result shows that YOLOv4 able to achieve a remarkable of accuracy on certain application [2] [3]. In addition, many
97.96% for mean average precision with frame per second of comparative studies had proven that the performance of deep
46.6. This study shows that YOLOv4 model is highly significant
learning based on CNN surpassed the traditional methods [4]
to be implemented in underwater vision system as it possesses
[5]. This is because, the main factor for object detection using
ability to accurately detect underwater objects with haze and
low-light environments.
deep learning with CNN was due to the inclusion of
classification and object localization. In other way, CNN gives
Keywords—Underwater detection, computer vision, YOLOv4, benefit for the image classification approach since it has the
mean average precision, real-time ability to learn by assigning weights and biases to various
objects in the image. Generally, modern object detection has
I. INTRODUCTION two types of detection which is multistage detection and single
In recent few years, various type of underwater vision stage detection. Region based Convolutional Neural Network
system has been developed to practically integrate with (R-CNN) [6] is the pioneer for multistage detector whilst
underwater vehicles such as autonomous underwater vehicle Faster R-CNN [7] is the latest model improvement. Even
(AUV) and remotely operated vehicle (ROV). The vast though Faster R-CNN is able to achieve good accuracy but it
demand in such application of underwater vision system is has limited ability to achieve sufficient speed for real time
significant due to the need of huge amounts of data collection. implementation [8].
Furthermore, the underwater vision system is applicable for In the past recent years, You Only Look Once (YOLO) has
analyzing and understanding in-depth criteria that range from been arisen to be one of the most famous architecture for
inspection of physical oceanography to the identification and single stage detector. This architecture is famous due to its
counting of marine biology for biological research. Most of efficacy, fast and accurate [8] [9]. Typically, the YOLO
the modern underwater research are equipped with devices architecture has four version which is YOLOv1 [10],
such as camera that can withstand underwater pressure, YOLOv2 [11], YOLOv3 [12] and the latest one is YOLOv4
corrosion and most importantly need to be waterproof. With [13]. Latest improvement in YOLOv4 optimizes both speed
this rapid system development, underwater vision system is and accuracy which is composed of CSPdarknet53 as its
not left behind to utilize those devices with numerous backbone. This backbone can enhance the learning capability
computer vision and artificial intelligence algorithm which of CNN by helping to build robust object detection model
can help to accelerate the practical research. especially for underwater computer vision. In addition, a
The integration between object detection with deep block called Spatial Pyramid Pooling (SPP) also was added
learning is one of the applications that is used in underwater into the backbone in order to increase the receptive field and
vision system. The object detection involves a process of capture the most significant features which will benefit in
training the classifier to understand and learn on semantic and object with varying visibility.
high-level features to classify different images. The In this paper, single stage detector which is YOLOv4 was
conceptual of object detection is precisely estimate the desired trained and tested using underwater dataset to justify the
object and locate the position of the objects in each image [1].
model robustness in detecting object with several challenges Jalal et al. [20] by combining Optical flow and Gaussian
including various visibility. The dataset used was The mixture model (GMM) with YOLOv3 algorithm. The study
Brackish dataset that is composed of 6 different classes of revealed that the GMM and optical flow alone failed to
underwater animals [14]. This dataset is challenging since it produce acceptable score for fish detection compare to
was recorded 9 meters below surface of brackish strait in YOLO. However, the study further enhanced the YOLO
northern part of Denmark. The YOLOv4 model’s model and the score had increased in around 5% of F1-score.
performance will be tested based on two major evaluations All these hybrids proposed system can achieve better accuracy
which is mean Average Precision (mAP) and Frame Per but utilize relatively high computational power due to
Second (FPS). This paper will be divided into several section complex mixture of algorithm. Therefore, this application
where Section 2 is for related research and implementation may result in poor resulted in poor real-time performance.
while Section 3 will elaborate on the method used to train and
test the YOLOv4 model. Finally, Section 4 describes results Aforementioned, several advantages have been
and discussion of underwater detection based on speed and highlighted to prove the effectiveness of YOLO models for
accuracy. underwater implementation especially in terms of detection
precision and processing speed. The way of feature extraction
II. RELATED RESEARCH AND IMPLEMENTATION layer was built based on CNN leads to a good result in
precision whilst the ability to achieve high frame per second
Considering high-efficiency performance in object is due to the single-stage detection scheme. From the
detection, YOLO have been implemented by many literature review of underwater computer vision application
researchers for application of underwater detection which and research, YOLOv3 is the most popular algorithms in
have more challenging environment especially in murky water YOLO family. Its breakthrough in deep learning-based
and low light surroundings. As example, a research by Xu et computer vision has yield many applications especially
al. [15] utilized YOLOv3 for underwater fish detection for underwater detection. Recent development in YOLO
waterpower applications. The datasets used to train and test advancement called YOLOv4 is still new and the application
the model were very challenging with high turbidity, high for underwater still low especially in underwater animal
velocity and murky water as the three datasets were recorded detection. The motivation of this work is to study and utilize
at marine and hydrokinetic energy projects and river YOLOv4 in detecting underwater creatures by using
hydropower projects. The training and testing of the model challenging underwater dataset which will test YOLOv4
shows adequate results for mean average precision (mAP) of capability in terms of precision and its real-time application.
53.92%. Apart from underwater animal detection, underwater
computer vision also been used for other underwater purpose. III. METHODOLOGY
One of it is detection of underwater pipeline leakage proposed
by X Zhao et al. [16]. The research used YOLOv3 algorithm This section will expose the acquisition and preparation of
with a total of 900 three-channels images as dataset to locate underwater dataset to YOLOv4 architecture based on Deep
oil spill point of the underwater pipeline. The trained model Convolutional Neural Network (DCNN) detection model. In
able to achieve 77.5% of leakage point detection accuracy addition, this section includes the evaluation made for
with 36 frames per second of processing time. YOLOv4 performance towards underwater dataset.
Generally, the overall proposed work in this study is presented
Meanwhile, M. Fulton et al. [17] proposed robotic as shown in Fig. 1.
detection of marine litter for UAV system through several
detection models namely YOLOv2, Tiny-YOLO, Faster R-
CNN and Single Shot Detector (SSD). Literally, the research
has concluded that YOLOv2 strikes the best balance between
detection accuracy and processing speed. Another research
adopts YOLO as architecture for object detection that aims for
underwater sustainability was proposed by Wu et al [18]. In
order to overcome challenges such as light absorption and low
visibility in turbid waters, this research implemented
YOLOv4 to detect underwater rubbish using ROV. YOLOv4
model was trained using 1120 images from 3 different sources
which captured through phone, by the ROV and scrap through
internet. Generally, the study claimed that the trained model is
“fast and effective” where it able to achieve mAP of 82.7%.
Furthermore, the proposed system has successfully
implemented into hardware which is ROV to detect rubbish in
underwater. Fig. 1. Overall proposed object detection of YOLOv4
YOLO algorithm also has been combined with other A. Dataset Acquisition
algorithm that could help to enhance YOLO capability in
detecting underwater object. Mohamed et al. [19] utilized In object detection, majority of the dataset used are in the
YOLOv3 for fish detection and tracking application in fish form of image. The dataset contains a large number of images
farms. In the study, pre-processing of the underwater images that are used for training an algorithm with the goal to learn
was executed using Multi-Scale Retinex (MSR) algorithm the detail of feature in every image. Hence, the algorithm is
while optical flow algorithm was used to track fish. From the able to find most common or predictable pattern of the dataset
result, it shows that the model is able to track the fish as a whole. In this study, an underwater open-source dataset
trajectory with the help of YOLO compared to without [14] was applied to build YOLOv4 detection system for
YOLO. The hybrid algorithm also had been proposed by A
underwater application and continuously investigate the overall process of deep learning framework and training
model performance within scope of the challenging dataset. platform is depicted in Fig. 2.
The dataset was taken from The Brackish Dataset [14] that
contains six underwater categories namely big fish, jellyfish,
crab, shrimp, small fish and starfish. With several
environment effects such as variation in luminosity, water
murkiness and low resolution, this challenging dataset was
recorded in brackish strait in Limfjorden that runs through
Aalborg, Denmark. The dataset consists of 10,995 annotated
files and 14,518 images extracted from recorded videos. The
dataset was separated into 80% for training, 10% for
validation and 10% for testing. The features and description
of the dataset is summarized in Table I.
Since YOLO is a supervised learning, this dataset was
established with annotation by standardized YOLO
annotation bounding box format. Each image will have its
own annotation in .txt file. For YOLO, it has a specific
annotation format that consist of 5 components which is
object-id, center x, center y, width and height. The object-id
Fig. 2. Overall proposed object detection of YOLOv4
represent the class number while center x and center y will
represent the coordinates of the center points of the bounding
boxes. The width and height are the representation of the size
of the bounding box. C. YOLOv4 Model Execution
YOLOv4 is a single stage detector where the network is
separated into 4 sections namely input, backbone, neck and
TABLE I. THE BRACKISH DATASET FEATURE AND DESCRIPTION dense prediction as shown in Fig. 3. Since it is supervised
Dataset Feature Description
learning, so it requires labelled images with bounding boxes
to be fed as input during training. The backbone of YOLOv4
Annotated Image 25,613 annotations 10,995 images with is define as the essential feature-extraction architecture. The
annotation while backbone still integrated with original YOLOv3’s backbone
3523 images without
annotations (only which is Darknet53 but with an improvement that utilize
background) Cross-Stage-Partial (CSP) [23] connections which later the
Number of Classes 6 Big fish, Jellyfish,
backbone called as CSPDarknet53.
Crab, Shrimp, Small
fish and Starfish
Training 11,614 images
Validation 1452 images
Testing 1452 images
B. Deep Learning Framework and Training Platform Next, is neck section where this section will mix and
Neural network framework is used to provide flexible combine the features in the backbone first before being fed for
APIs and configuration options for performance optimization detection purpose. YOLOv4’s authors pick modified version
where it is designed to facilitate and fasten the training of deep of Path Aggregation Network (PANet) [24] as the neck for the
learning models [21]. In this study, the neural network architecture. Apart from that, YOLOv4 also adopt the usage
framework that was used is an open source framework called of Spatial Pyramid Pooling (SPP) [25]. Before a feature move
Darknet. Darknet is written in C and CUDA. Using this neural to fully connected layer for prediction, it needs to be flattened
network framework also will allow the execution of the first. The final section which is dense prediction or also known
training and detection to be made in Graphical Processing Unit as head plays important roles in producing final prediction and
(GPU) which is faster compare using Central Processing Unit locating bounding boxes. YOLOv4 deploys same head as
(CPU). YOLOv3 which the network detects the bounding coordinates
as well as confidence score for specific class.
YOLOv4 was trained and tested using Jupyter notebook in
Google Colaboratory. “Colab” for short is an open platform Before the training begin, YOLOv4 configuration file was
that allows user to write and execute Python and mostly used also modified in order to define several parameters that will
for machine learning since it provides free and powerful be used during training. The parameters that was set in
computing resources including GPU [22]. In this paper, Tesla configuration file is listed as in Table II. The training was set
T4 was chosen as GPU to train and test the YOLOv4 model. to run for 12000 iterations. In machine learning, to provide
In addition, Google Drive was connected with Colab to allow unbiased evaluation to the final trained model, a test dataset
the training weight to be saved in particular iteration. The was used to assess the model performance. For testing the
This work was supported by grants from the Ministry of Higher Education
(MOHE), Malaysia under the FRGS grant of 600-RMI/FRGS 5/3
(291/2019).
performance speed, the trained model was tested using a test IV. RESULTS AND DISCUSSIONS
video where the FPS was assessed. Overall, an excellent result had been achieved for a single
stage deep learning based object detection. The result shown
in Table III were obtained on the test dataset, which was
TABLE II. YOLOV4 TRAINING PARAMETERS CONFIGURATION considered to be challenging to the models This dataset also
Batch Size 64
provided unbiased representation of how YOLOv4 trained
model react to “never seen images”. Fig. 4 shows the training
Subdivision 16 curve of mAP versus iteration. The model started to converge
Width 416 with a good performance at 4000th iteration and having
stagnant performance at 10000th iteration.
Height 416
Momentum 0.949
Decay 0.0005
Learning Rate 0.001
Activation Mish
D. Performance Evaluation
In order to evaluate the performance of the YOLO model,
the evaluation criterions were measured and calculated based
on five common evaluation metrics that are Precision and
Recall as shown in Eq. 1 and Eq. 2 respectively. Precision
indicates that out of all predicted instances that belongs to
particular class, actually belonged to that particular class. In
addition, precision is used to reflect the robustness of Fig. 4. [email protected] versus iteration
detection where high precision returns in truer detected object
TABLE III. YOLOV4 PERFORMANCE METRICS
than false detected in the trained model. Meanwhile the recall
determines the ability of the model to find all relevant Performance Metrics Result (%)
instances in the dataset. High precision indicates a low value Precision 94.00
of the false positive though generally correlated with a small
number of false negatives for recall. Another evaluation used Recall 97.00
is F1-measure as shown in Eq. 3 which represents the F1-Score 95.48
harmonic mean of precision and recall.
mAP @ 0.5 97.96
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (𝑇𝑃)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (1)
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (𝑇𝑃)+𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (𝐹𝑃)
(d)
Fig. 5. Detection output (a) – (d) ability to detect underwater life and
classes differentiate