Underwater Target Detection Algorithm Based On YOLO and Swin Transformer For Sonar Images
Underwater Target Detection Algorithm Based On YOLO and Swin Transformer For Sonar Images
Abstract—To detect the underwater target more effectively, CNN to dominate CV. Attention mechanism in Transformer
the paper optimizes the one-stage detection algorithm YOLO is inspired by the human’s brain and is mainly used to force
with combining Swin Transformer blocks and layers. Due to the the network to pay more attention to the targets instead
special underwater environmental conditions, the task mainly
focuses on acoustic detection methods based on sonar images. of background or blank areas. Among all of the significant
Deep learning neural network replacing the traditional detection models, Swin Transformer [12] has accomplished to exceed
algorithm is introduced to be the detection framework. It has CNN in many CV missions and becomes one of the sort-of-
been verified in the paper that the combination of lightweight the-art(SOTA) models by using more flexible and connected
network YOLO and well-performed network Swin Transformer attention mechanism.
can achieve more accurate detection precision and meanwhile
meet the requirements of real-time detection using Autonomous Sonar is a kind of more effective underwater detection
Underwater Vehicle(AUV)’s available hardware. device than camera and lidar due to the better nature of
Index Terms—underwater target detection, sonar images, deep acoustic waves in seawater medium. It can provide more
learning, YOLO, Swin Transformer accurate and detailed information even in dark and turbid
scenario. Target detection technology based on sonar images
I. INTRODUCTION has the advantages of higher frequency, longer range, higher
Complex underwater environment makes it difficult to detect resolution, more beams, and stronger real-time [13] which
the desired targets due to turbidity and weak illumination. have already been applied to object detection [14]–[16].
Researchers have made lots of effort to get over the difficulties. Above all, the purpose of this paper is to further study and
Traditional detection methods used before are mainly based innovate underwater target detection algorithms based on sonar
on manually designed features which would be influenced images. Balancing and improving the precision and speed
by the experts’ own experience and ability a lot. However, together by combining YOLO and Swin Transformer so that
with the development of deep learning and convolution neural an efficient real-time detection network that is suitable to be
networks(CNN), from AlexNet [1] in 2012 to ConvNeXt [2] mounted on AUVs can be built.
in 2022, target detection technology has become efficient,
II. TARGET DETECTION SYSTEM DESIGN
rapid-developed [3] and has expanded to many different fields
including underwater maritime targets [4]. The technology
based on CNN has gradually replaced the traditional detection
methods because of its superiority in all aspects [5]. CNN
has demonstrated efficiency at extracting and selecting fea-
tures from images. Among the great CNN frameworks, You
Only Look Once(YOLO) [6] is one of the most widely-used
lightweight real-time one-stage detectors which is suitable
to be mounted on AUVs [7] and Einsidler has investigated
the capability of YOLO for detection of seafloor anomalies
in AUV-based side-scan sonar imagery and showed its high
accuracy [8].
Thanks to the great applications of Transformer in Natural
Language Processing(NLP) area, the researchers are devoted
to introducing it to solve some classic computer vision(CV) Fig. 1. General Detection System Frame
missions such as object detection [9]–[11]in order to unify
NLP and CV using one framework. The effort on this finds The target detection system designed in this paper is based
Transformer’s extraordinary performance and even triggers on CNN and Transformer combination network. Firstly, the
a great debate which is whether Transformer will replace network is trained with a certain amount of sonar images
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
of Swin Transformer as backbone including Swin Transformer
MHSA concatenates multi heads of self-attention and then Layers and other modules is shown in Fig. 7.
goes through a linear transformation which is given as
M ultiHead(Q, K, V ) = Concat(head1 , head2 , ..., headh )L (3)
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
TABLE I 0.5 and mAP 0.5:0.95 means calculating the average
E XPERIMENTATION P LATFORM mAP from mAP 0.5 to 0.95 with each step of 0.5.
• F1
Items version
Fβ can also reflect the relationship between P and R by
Operating system Windows11
CPU AMD Ryzen 7 5800 8-Core Processor 1 1 1 β2
GPU NVIDIA GeForce RTX 3070 Ti = × ( + ) (6)
Video memory 8GB Fβ 1 + β2 P R
RAM 16GB
CUDA CUDAv11.0 CuDNNv8.0.4 We can set β to different values to balance the importance
Python 3.7.11 of precision and recall. Setting it to 1 stands for the
Pytorch 1.7.0 harmonic mean of precision and recall.
TABLE II
T RAINING S ETTINGS V. RESULTS
YOLOv5s [28] is used as our baseline and shows the
Training Parameter value
intuitive result we want. Then we compare the three methods
batch size 8 introduced before and find that they all can improve the
epochs 300
input size (640,640) performance of the network in varying degrees.
momentum 0.937
weight decay 0.0005 A. Performance of YOLOv5s
initial learning rate(lr0 ) 0.01(SGD)
cyclical learning rate (lrf ) 0.1(Cosine annealing) YOLOv5s is the small version of YOLOv5 family which has
warmup epochs 3 only 7.02 M parameter and can achieve the speed of inferring
warmup momentum 0.8 one single image in 3.5 ms at shape (8, 3, 640, 640) using our
warmup bias lr 0.1
environment including one NVIDIA 3070Ti.
1) Visualization Results:
As shown in Fig. 11, when we detect underwater targets
C. Evaluation of Detection Network
using moving AUH, sonar images are shown on the screen
Before training, we need to make clear of some metrics to and the network we design can directly draw a bounding box
evaluate our network. around the target we want and tell us the classes and the
The confusion matrix is often used to define the evaluation possibility of how likely it is the target.
indexes. To better understand it in Tab III, taking detecting
anchors as an example, TP means the prediction given by the
network is an anchor and the target is actually the anchor. FP
means that the network predicts it is the anchor but it is not.
TN means non-anchor is predicted by the network and it is
not in reality and FN means the detection network misses the
target.
TABLE III
C ONFUSION M ATRIX
Prediction
Confusion Matrix
Positive Negative
Positive True Positive(TP) False Negative(FN)
Ground Truth Fig. 11. Bounding Boxes, Confidence, and Classes
Negative False Positive(FP) True Negative(TN)
Anchors and pillars can be detected well by YOLOv5s but
it may miss the ship which is shown directly above because
• Precision of the lack of ship samples. From Fig 11, we can learn the
TP
P recision = (4) shortcoming of YOLOv5s. So it is necessary to improve the
TP + FP
accuracy of this one-stage detection frame network.
• Recall 2) Quantitative Results:
TP
Recall = (5) After 300 epochs used 6.5 hours, we can get more detailed
TP + FN results shown below.
• mAP The network converges well and the training successes
Mean Average Precision(mAP) stands for the area under according to Fig. 12.
the curve(AUC) of P-R curve, which is the mean value Through the confusion matrix shown in Fig. 13, we can
of AP of all classes under a fixed Intersection over directly acknowledge which target is predicted well and which
Union(IOU) threshold. MAP 0.5 means setting IOU to is not. Based on the fact of lacking of ship and aircraft images,
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
B. Comparision of Three Models
1) Work on C3:
We set four groups of experiment on the change of C3
including replacing C3 in the last layer of backbone and also
neck with C3STR and C3MHSA. Several evaluation indexes
we care have shown in Tab. IV.
TABLE IV
C OMPARISION BETWEEN C3, C3MHSA, AND C3STR
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
properly, it can boost the development of target detection
technology especially in some complex environments such as
the underwater scenarios. The combination of YOLO and Swin
Transformer blocks achieves higher accuracy without gaining
weights and inference time which of great significance for
future hardware implementation. We can get at most 2.2%
higher than YOLOv5 in precision and 0.8s time less at the
(a) mAP 0.5
same time. Our next step will adopt the methods mentioned
in the paper and do some sea trials to validate and compare.
However, the large size of Swin Transformer has caused
some problems and directly replacing YOLO’s backbone with
it will diminish the lightweight advantage of YOLO. The
key to the widespread use of the real-time target detection
framework designed for underwater scenarios is to balance
speed and accuracy. So taking apart Swin Transformer to
minimize the size of the model like before, forming C3MHSA
and C3STR is a better direction to be deeply analyzed and
(b) mAP 0.5:0.95
exploited. Some more efficient methods to combine YOLO
Fig. 16. MAP of Four Models and Transformer also will be tested and compared in future
work.
Furthermore, we will continue to optimize our network, for
example, to add more attention mechanism modules such as
CBAM [29], CA [30] to make it faster, lighter, and more
accurate and we will also use some networks such as GAN
[31] to pre-process sonar images in order to solve the long-
standing problems of shortage of datasets and poor quality that
plagues underwater deep learning and target detection.
Fig. 17. Bbox Loss, Class Loss and Object Loss R EFERENCES
[1] Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with
Deep Convolutional Neural Networks. Neural Inf. Process. Syst. 2012,
2) Work on Backbone: 25, doi:10.1145/3065386
We only use Dataset1 to analyze the thought of replacing [2] Liu Z, Mao H, Wu C Y, et al. A convnet for the 2020s[C]//Proceedings of
CSPDarknet with Swin Transformer. The results have shown the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2022: 11976-11986.
in Tab. V. [3] Valdenegro-Toro M. Best practices in convolutional networks for
forward-looking sonar image recognition[J]. OCEANS 2017 - Aberdeen,
TABLE V 2017:1–9.
C OMPARISION BETWEEN YOLOV 5 S AND YOLOV 5 S +S WIN [4] Galusha A, Dale J, Keller J M, et al. Deep convolutional neural
network target classification for underwater synthetic aperture sonar
imagery[C]//Detection and Sensing of Mines, Explosive Objects, and
Structure (datasets 1) YOLOv5s YOLOv5s+Swin Obscured Targets XXIV. SPIE, 2019, 11012: 18-28.
MAP(0.50) 0.990 0.991 [5] Valdenegro-Toro M. Objectness Scoring and Detection Proposals in
Forward-Looking Sonar Images with Convolutional Neural Networks[J].
MAP(0.50:0.95) 0.843 0.861(+1.8%) artificial neural networks in pattern recognition, 2016:209–219.
Weights Size 27.1MB 118MB [6] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-
time object detection. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition; 2016. p. 779–788.
MAP has increased by 1.8% but also the model size has [7] Kim J, Cho H, Pyo J, et al. The convolution neural network based agent
vehicle detection using forward-looking sonar image[C]. In: OCEANS
expanded to more than four times larger than the original 2016 MTS/IEEE Monterey. 2016. 1–5.
model. This is predictable because of the design of Swin [8] Einsidler D, Dhanak M, Beaujean P P. A deep learning approach to target
Transformer. According to this, directly inserting this large recognition in side-scan sonar imagery[C]//OCEANS 2018 MTS/IEEE
Charleston. IEEE, 2018: 1-4.
model into the backbone isn’t an optimal choice for our AUV [9] Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weis-
system because of the request for lightweight restricted by senborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa;
limited hardware. Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob
(2021-06-03). ”An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale”. arXiv:2010.11929
VI. CONCLUSIONS [10] Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with
Experiments show that Swin transformer and its core mod- transformers[C]//European conference on computer vision. Springer,
Cham, 2020: 213-229.
ule(Window and Shift-Window Multi-Head Self Attention) [11] Zhu X, Su W, Lu L, et al. Deformable detr: Deformable transformers for
can improve the performance of neural networks and if used end-to-end object detection[J]. arXiv preprint arXiv:2010.04159, 2020.
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
[12] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo,
B. (2021). Swin Transformer: Hierarchical Vision Transformer using
Shifted Windows. 2021 IEEE/CVF International Conference on Com-
puter Vision (ICCV), 9992-10002.
[13] Dobeck G J. Algorithm fusion for the detection and classification of sea
mines in the very shallow water region using side-scan sonar imagery[C].
In: Aerosense. 2000.
[14] Wang J, Shan T, Chandrasekaran M, et al. Deep learning for detection
and tracking of underwater pipelines using multibeam imaging sonar.
Proceedings of the IEEE International Conference on Robotics and
Automation Workshop; 2019. [Google Scholar]
[15] Lee S, Park B, Kim A. Deep learning from shallow dives: sonar
image generation and training for underwater object detection.
arXiv:1810.07990. preprint; 2018. [Google Scholar]
[16] Sung M, Kim J, Lee M, et al. Realistic sonar image simulation using
deep learning for underwater object detection. Int J Control Autom Syst.
2020;18(3):523–534. [Crossref], [Web of Science ®], [Google Scholar]
[17] Weiss K, Khoshgoftaar T M, Wang D D. A survey of transfer learning[J].
Journal of Big data, 2016, 3(1): 1-40.
[18] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical
image database[C]//2009 IEEE conference on computer vision and
pattern recognition. Ieee, 2009: 248-255.
[19] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and
accuracy of object detection[J]. arXiv preprint arXiv:2004.10934, 2020.
[20] Wang, C.; Liao, H.M.; Wu, Y.; Chen, P.; Hsieh, J.; Yeh, I. CSPNet:
A New Backbone that can Enhance Learning Capability of CNN.In
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19
June 2020; pp. 1571–1580.
[21] Loffe, S.; Szegedy, C. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In Proceedings of the
32nd International Conference on International Conference on Machine
Learning (ICML), Lile, France, 6–11 July 2015;pp. 448–456.
[22] Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for
neural network function approximation in reinforcement learning. Neural
Netw. 2018, 107, 3–11. [CrossRef] [PubMed]
[23] A. Srinivas, T. -Y. Lin, N. Parmar, J. Shlens, P. Abbeel and A. Vaswani,
”Bottleneck Transformers for Visual Recognition,” 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2021,
pp. 16514-16524, doi: 10.1109/CVPR46437.2021.01625.
[24] ZHOU Y, CHEN S, WU Ke, NING M, CHEN H, ZHANG P. SCTD
1.0:Sonar Common Target Detection Dataset[J]. Computer Science,
2021, 48(11A): 334-339.
[25] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J].
Advances in neural information processing systems, 2017, 30.
[26] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object
detection[C]//Proceedings of the IEEE conference on computer vision
and pattern recognition. 2017: 2117-2125.
[27] Wang K, Liew J H, Zou Y, et al. Panet: Few-shot image semantic seg-
mentation with prototype alignment[C]//Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2019: 9197-9206.
[28] Ultralytics-YOLOv5. Available online:
https://github.com/ultralytics/yolov5(accessed on 1 January 2021)
[29] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention
module[C]//Proceedings of the European conference on computer vision
(ECCV). 2018: 3-19.
[30] Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile net-
work design[C]//Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. 2021: 13713-13722.
[31] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial
nets[J]. Advances in neural information processing systems, 2014, 27.
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.