0% found this document useful (0 votes)
89 views7 pages

Underwater Target Detection Algorithm Based On YOLO and Swin Transformer For Sonar Images

The document proposes an underwater target detection algorithm that combines YOLO and Swin Transformer to process sonar images. It aims to achieve accurate detection with real-time performance suitable for use on autonomous underwater vehicles. The algorithm balances precision and speed by leveraging the efficient detection of YOLO with the strong feature representation of Swin Transformer.

Uploaded by

deshmukhneha833
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views7 pages

Underwater Target Detection Algorithm Based On YOLO and Swin Transformer For Sonar Images

The document proposes an underwater target detection algorithm that combines YOLO and Swin Transformer to process sonar images. It aims to achieve accurate detection with real-time performance suitable for use on autonomous underwater vehicles. The algorithm balances precision and speed by leveraging the efficient detection of YOLO with the strong feature representation of Swin Transformer.

Uploaded by

deshmukhneha833
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Underwater Target Detection Algorithm Based on

YOLO and Swin Transformer for Sonar Images


1st Ruoyu Chen 2nd Shuyue Zhan 3nd Ying Chen
Ocean College Ocean College Ocean College
Zhejiang University Zhejiang University Zhejiang University
Zhoushan,316021 China Zhoushan,316021 China Zhoushan,316021 China
[email protected] shuyue [email protected] [email protected]
OCEANS 2022, Hampton Roads | 978-1-6654-6809-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/OCEANS47191.2022.9976986

Abstract—To detect the underwater target more effectively, CNN to dominate CV. Attention mechanism in Transformer
the paper optimizes the one-stage detection algorithm YOLO is inspired by the human’s brain and is mainly used to force
with combining Swin Transformer blocks and layers. Due to the the network to pay more attention to the targets instead
special underwater environmental conditions, the task mainly
focuses on acoustic detection methods based on sonar images. of background or blank areas. Among all of the significant
Deep learning neural network replacing the traditional detection models, Swin Transformer [12] has accomplished to exceed
algorithm is introduced to be the detection framework. It has CNN in many CV missions and becomes one of the sort-of-
been verified in the paper that the combination of lightweight the-art(SOTA) models by using more flexible and connected
network YOLO and well-performed network Swin Transformer attention mechanism.
can achieve more accurate detection precision and meanwhile
meet the requirements of real-time detection using Autonomous Sonar is a kind of more effective underwater detection
Underwater Vehicle(AUV)’s available hardware. device than camera and lidar due to the better nature of
Index Terms—underwater target detection, sonar images, deep acoustic waves in seawater medium. It can provide more
learning, YOLO, Swin Transformer accurate and detailed information even in dark and turbid
scenario. Target detection technology based on sonar images
I. INTRODUCTION has the advantages of higher frequency, longer range, higher
Complex underwater environment makes it difficult to detect resolution, more beams, and stronger real-time [13] which
the desired targets due to turbidity and weak illumination. have already been applied to object detection [14]–[16].
Researchers have made lots of effort to get over the difficulties. Above all, the purpose of this paper is to further study and
Traditional detection methods used before are mainly based innovate underwater target detection algorithms based on sonar
on manually designed features which would be influenced images. Balancing and improving the precision and speed
by the experts’ own experience and ability a lot. However, together by combining YOLO and Swin Transformer so that
with the development of deep learning and convolution neural an efficient real-time detection network that is suitable to be
networks(CNN), from AlexNet [1] in 2012 to ConvNeXt [2] mounted on AUVs can be built.
in 2022, target detection technology has become efficient,
II. TARGET DETECTION SYSTEM DESIGN
rapid-developed [3] and has expanded to many different fields
including underwater maritime targets [4]. The technology
based on CNN has gradually replaced the traditional detection
methods because of its superiority in all aspects [5]. CNN
has demonstrated efficiency at extracting and selecting fea-
tures from images. Among the great CNN frameworks, You
Only Look Once(YOLO) [6] is one of the most widely-used
lightweight real-time one-stage detectors which is suitable
to be mounted on AUVs [7] and Einsidler has investigated
the capability of YOLO for detection of seafloor anomalies
in AUV-based side-scan sonar imagery and showed its high
accuracy [8].
Thanks to the great applications of Transformer in Natural
Language Processing(NLP) area, the researchers are devoted
to introducing it to solve some classic computer vision(CV) Fig. 1. General Detection System Frame
missions such as object detection [9]–[11]in order to unify
NLP and CV using one framework. The effort on this finds The target detection system designed in this paper is based
Transformer’s extraordinary performance and even triggers on CNN and Transformer combination network. Firstly, the
a great debate which is whether Transformer will replace network is trained with a certain amount of sonar images

Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.

978-1-6654-6809-1/22/$31.00 ©2022 IEEE


including some common underwater targets such as shipwreck, comparison of these three networks, we can learn what really
aircraft, anchor, and so on. However, in order to diminish the matters and how they affect the performance of the network.
network’s poor performance and low precision caused by the
lack of sonar images, transfer learning [17] will be used to A. C3MHSA
improve the accuracy. The weights of the network which is Multi-Head Self-Attention(MHSA) is one of the basic layers
trained by the large dataset–ImageNet [18] will be transferred of Transformer and Swin Transformer applies Window MHSA
to our network used underwater. Also, data augmentation and Shift-Window MHSA to improve it. This transformer
tricks such as Mosaic [19], flipping, mix-up, and rotation will module is introduced to calculate the self-attention of pixels
all be used before training to improve the performance of the before detection to let the network focus on the targets than
detector. The detector will be deployed through an embedded the background or other unimportant things.
development board which is loaded on our AUV to fulfill the So, at first, we learn how MHSA affects the network and
mission of real-time detection after the network is trained well. insert it into C3 which is inspired by BotNet [23]. BotNet
The whole workflow is shown in Fig. 1. replaces the 3×3 convolution with MHSA in ResNet block and
surpasses ResNet evaluated on the COCO validation set. The
III. MAIN METHODS authors hope this simple and effective approach can serve as a
Basically, we want to combine YOLO and Swin Trans- strong baseline for future research in self-attention models for
former, or in other words, more precisely, we try to introduce computer vision. So we utilize this thought also by replacing
the attention mechanism into the basic but fast one-stage 3 × 3 convolution with MHSA in BottleNeck and forming a
detection network architectures to keep the speed and also new block called BottleNeck Transformer. C3MHSA is shown
achieve better accuracy. in Fig. 3.
Three main methods are proposed in the paper and com-
pared. The methods are based on the improvement of the
backbone which is CSPDarknet in the origin YOLOv5. Cross
Stage Partial(CSP) [20] is used for feature extraction and
selection and CBS is the basic module of it consisting of
convolution layers, batch normalization [21], and activation
layers using a Sigmoid Weighted Liner Unit(SiLU) [22] which
is given as
1
SiLu(x) = x × ( ) (1)
1 + e−x
Firstly, we only change the basic layer of CSPDarknet
called C3(an improved version of CSP-Bottleneck with 3
convolutions) which is shown in Fig 2. Instead of directly Fig. 3. Combination of C3 and MHSA——C3MHSA
inserting Swin Transformer Layer into C3, we choose to take Especially, MHSA in BotNet uses relative distance en-
a look at the more underlying principle component which is codings Rh and Rw for height and width respectively. The
Multi-Head Self-Attention(MHSA) and see how it influences attention logits in MHSA are qk T + qrT where q, k, r repre-
the results. sent query, key, and position encodings which qk T stands for
content-content and qrT for content-position.In practice, the
attention function can be written as:
Qh KhT + Qh RhT
Attention(Qh , Kh , Vh ) = sof tmax( √ )Vh (2)
dk
where dk is the variance of the dot product of Qh and Kh
to alleviate the gradient vanishing problem of softmax.

Fig. 2. Basic Component of CSPDarknet–C3

After the improvement of C3 only, we also try to replace the


whole backbone with Swin Transformer as it is designed to be
one significant member of the backbone family. Through the
Fig. 4. MHSA Layer in C3MHSA

Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
of Swin Transformer as backbone including Swin Transformer
MHSA concatenates multi heads of self-attention and then Layers and other modules is shown in Fig. 7.
goes through a linear transformation which is given as
M ultiHead(Q, K, V ) = Concat(head1 , head2 , ..., headh )L (3)

where L is a matrix for linear transformation.


Multi-Head Self-Attention linearly projects the queries,keys
and values more times with different, learned linear projections
to different dimensions so it allows the model to jointly
attend to information from different representation subspaces
at different position and is benefit for parallel computing [25].
Considering the memory and computation requirements of Fig. 8. Swin Transformer as Backbone of YOLOv5
limited hardware, we only incorporate MHSA at the lowest
resolution feature maps in the backbone by only replacing the IV. EXPERIMENTS
last C3 with C3MHSA to validate its influence. A. Datasets
B. C3STR The experiments conclude two different datasets including
Next, Swin Transformer Layer is molten into C3 to form six common targets which are anchor, basket, pillar, ship-
C3STR to go further. wreck, human, and aircraft. Anchor, basket and pillar are
put inside the school’s experimental tank and used Ocu-
lus BluePrint MD750d sonar to collect Dataset1, which has
2239 images at all. Dataset2 is from an open source dataset
called Sonar Common Target Detection Dataset(SCTD) [24],
which has 357 images at all. And they are from not only
Foward-Looking Sonar(FLS) but also Side-Scan Sonar(SSS)
and Synthetic Aperture Sonar(SAS). This dataset is inserted
into training hoping our model can figure out all kinds of sonar
Fig. 5. Swin Transformer Layer in C3–C3STR images and prevent overfitting.
One Swin Transformer layer includes two successive blocks
which is similar to Vision Transformer [9] but replacing one
of the regular standard Multi-Head Self Attention (MHSA)
modules with a module based on shifted windows(SW-MHSA)
which is the most creative thoughts in Swin Transformer as
shown in Fig. 6.

Fig. 9. The Experimental Tank and the Sonar We Use

Fig. 6. One Swin Transformer Fig. 7. Swin Transformer


Layer with Two Blocks Backbone
Fig. 10. Example of Datasets (The pictures in the first row are captured by
C. Whole Backbone ourselves including anchor, basket and pillar. The second row is from SCTD
including shipwreck, human and aircraft.)
Finally, the whole Swin Transformer is used to replace
CSPDarknet as the backbone in YOLOv5 but also keep using
FPN [26] and PANET [27] as the neck for feature fusion and B. Platform and Training Settings
YOLO head to detect the targets as shown in Fig 8 to make it The software and hardware environment and the parameters
a giant model and see how it performs. The whole structure we use are shown in Tab. I and Tab. II .

Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
TABLE I 0.5 and mAP 0.5:0.95 means calculating the average
E XPERIMENTATION P LATFORM mAP from mAP 0.5 to 0.95 with each step of 0.5.
• F1
Items version
Fβ can also reflect the relationship between P and R by
Operating system Windows11
CPU AMD Ryzen 7 5800 8-Core Processor 1 1 1 β2
GPU NVIDIA GeForce RTX 3070 Ti = × ( + ) (6)
Video memory 8GB Fβ 1 + β2 P R
RAM 16GB
CUDA CUDAv11.0 CuDNNv8.0.4 We can set β to different values to balance the importance
Python 3.7.11 of precision and recall. Setting it to 1 stands for the
Pytorch 1.7.0 harmonic mean of precision and recall.
TABLE II
T RAINING S ETTINGS V. RESULTS
YOLOv5s [28] is used as our baseline and shows the
Training Parameter value
intuitive result we want. Then we compare the three methods
batch size 8 introduced before and find that they all can improve the
epochs 300
input size (640,640) performance of the network in varying degrees.
momentum 0.937
weight decay 0.0005 A. Performance of YOLOv5s
initial learning rate(lr0 ) 0.01(SGD)
cyclical learning rate (lrf ) 0.1(Cosine annealing) YOLOv5s is the small version of YOLOv5 family which has
warmup epochs 3 only 7.02 M parameter and can achieve the speed of inferring
warmup momentum 0.8 one single image in 3.5 ms at shape (8, 3, 640, 640) using our
warmup bias lr 0.1
environment including one NVIDIA 3070Ti.
1) Visualization Results:
As shown in Fig. 11, when we detect underwater targets
C. Evaluation of Detection Network
using moving AUH, sonar images are shown on the screen
Before training, we need to make clear of some metrics to and the network we design can directly draw a bounding box
evaluate our network. around the target we want and tell us the classes and the
The confusion matrix is often used to define the evaluation possibility of how likely it is the target.
indexes. To better understand it in Tab III, taking detecting
anchors as an example, TP means the prediction given by the
network is an anchor and the target is actually the anchor. FP
means that the network predicts it is the anchor but it is not.
TN means non-anchor is predicted by the network and it is
not in reality and FN means the detection network misses the
target.

TABLE III
C ONFUSION M ATRIX

Prediction
Confusion Matrix
Positive Negative
Positive True Positive(TP) False Negative(FN)
Ground Truth Fig. 11. Bounding Boxes, Confidence, and Classes
Negative False Positive(FP) True Negative(TN)
Anchors and pillars can be detected well by YOLOv5s but
it may miss the ship which is shown directly above because
• Precision of the lack of ship samples. From Fig 11, we can learn the
TP
P recision = (4) shortcoming of YOLOv5s. So it is necessary to improve the
TP + FP
accuracy of this one-stage detection frame network.
• Recall 2) Quantitative Results:
TP
Recall = (5) After 300 epochs used 6.5 hours, we can get more detailed
TP + FN results shown below.
• mAP The network converges well and the training successes
Mean Average Precision(mAP) stands for the area under according to Fig. 12.
the curve(AUC) of P-R curve, which is the mean value Through the confusion matrix shown in Fig. 13, we can
of AP of all classes under a fixed Intersection over directly acknowledge which target is predicted well and which
Union(IOU) threshold. MAP 0.5 means setting IOU to is not. Based on the fact of lacking of ship and aircraft images,

Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
B. Comparision of Three Models
1) Work on C3:
We set four groups of experiment on the change of C3
including replacing C3 in the last layer of backbone and also
neck with C3STR and C3MHSA. Several evaluation indexes
we care have shown in Tab. IV.

TABLE IV
C OMPARISION BETWEEN C3, C3MHSA, AND C3STR

Structure YOLOv5s YOLOv5s+ YOLOv5s+ YOLOv5s+


(datasets 1+2) baseline 2C3STR C3STR MHSA
Fig. 12. Loss, Precision, Recall, and mAP
C3 C3 C3 C3
C3 C3 C3 C3
Backbone
C3 C3 C3 C3
C3 C3STR C3STR C3MHSA
C3 C3 C3 C3
C3 C3 C3 C3
Neck
C3 C3 C3 C3
C3 C3STR C3 C3
Precision 0.905 0.929 0.948 0.968
Recall 0.918 0.880 0.913 0.898
MAP(0.50) 0.931 0.933 0.931 0.929
MAP(0.50:0.95) 0.769 0.777(+0.8%) 0.791(+2.2%) 0.787(+1.8%)
Layers 213 231 222 217
Parameter(M) 7.02 7.30 7.16 6.70
Inference Time(ms) 3.5 3.3 2.7 3.1

C3STR replacing C3 in the last layer of the backbone has


achieved the best results which has surpassed mAP of baseline
Fig. 13. Confusion Matrix by 2.2% with a little parameter increment and has the least
inference time. Also, C3MHSA gets a 1.8% increase with
the prediction is not as precise as other targets and the model fewer parameters and less time. However, replacing C3 in both
has missed 33% aircraft and 11% ship. backbone and neck gets worse results than only replacing one
F1, Precision, Recall, and P-R curves of six targets are C3 and obtains more parameters.
shown below in Fig 14. Based on the larger dataset, the Intuitively, C3STR replacing C3 helps us get the ship we
detection precision of anchor, basket, and pillar is much better want as shown in Fig 15.
than the other three targets.

(a) F1 curve (b) P curve

Fig. 15. Ship Detected by C3STR

To analyze the outcome more effectively, we put mAP and


loss changing by epochs in four models together and use the
tensorboard to show. Through Fig. 16, C3STR and C3MHSA
all can achieve higher mAP in early epochs and improve the
performance of the network.
(c) R curve (d) P-R curve Fig. 17 shows with continue of training, the loss has
descended steadily.
Fig. 14. F1 , Precision, Recall, and P-R curves

Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
properly, it can boost the development of target detection
technology especially in some complex environments such as
the underwater scenarios. The combination of YOLO and Swin
Transformer blocks achieves higher accuracy without gaining
weights and inference time which of great significance for
future hardware implementation. We can get at most 2.2%
higher than YOLOv5 in precision and 0.8s time less at the
(a) mAP 0.5
same time. Our next step will adopt the methods mentioned
in the paper and do some sea trials to validate and compare.
However, the large size of Swin Transformer has caused
some problems and directly replacing YOLO’s backbone with
it will diminish the lightweight advantage of YOLO. The
key to the widespread use of the real-time target detection
framework designed for underwater scenarios is to balance
speed and accuracy. So taking apart Swin Transformer to
minimize the size of the model like before, forming C3MHSA
and C3STR is a better direction to be deeply analyzed and
(b) mAP 0.5:0.95
exploited. Some more efficient methods to combine YOLO
Fig. 16. MAP of Four Models and Transformer also will be tested and compared in future
work.
Furthermore, we will continue to optimize our network, for
example, to add more attention mechanism modules such as
CBAM [29], CA [30] to make it faster, lighter, and more
accurate and we will also use some networks such as GAN
[31] to pre-process sonar images in order to solve the long-
standing problems of shortage of datasets and poor quality that
plagues underwater deep learning and target detection.

Fig. 17. Bbox Loss, Class Loss and Object Loss R EFERENCES
[1] Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with
Deep Convolutional Neural Networks. Neural Inf. Process. Syst. 2012,
2) Work on Backbone: 25, doi:10.1145/3065386
We only use Dataset1 to analyze the thought of replacing [2] Liu Z, Mao H, Wu C Y, et al. A convnet for the 2020s[C]//Proceedings of
CSPDarknet with Swin Transformer. The results have shown the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2022: 11976-11986.
in Tab. V. [3] Valdenegro-Toro M. Best practices in convolutional networks for
forward-looking sonar image recognition[J]. OCEANS 2017 - Aberdeen,
TABLE V 2017:1–9.
C OMPARISION BETWEEN YOLOV 5 S AND YOLOV 5 S +S WIN [4] Galusha A, Dale J, Keller J M, et al. Deep convolutional neural
network target classification for underwater synthetic aperture sonar
imagery[C]//Detection and Sensing of Mines, Explosive Objects, and
Structure (datasets 1) YOLOv5s YOLOv5s+Swin Obscured Targets XXIV. SPIE, 2019, 11012: 18-28.
MAP(0.50) 0.990 0.991 [5] Valdenegro-Toro M. Objectness Scoring and Detection Proposals in
Forward-Looking Sonar Images with Convolutional Neural Networks[J].
MAP(0.50:0.95) 0.843 0.861(+1.8%) artificial neural networks in pattern recognition, 2016:209–219.
Weights Size 27.1MB 118MB [6] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-
time object detection. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition; 2016. p. 779–788.
MAP has increased by 1.8% but also the model size has [7] Kim J, Cho H, Pyo J, et al. The convolution neural network based agent
vehicle detection using forward-looking sonar image[C]. In: OCEANS
expanded to more than four times larger than the original 2016 MTS/IEEE Monterey. 2016. 1–5.
model. This is predictable because of the design of Swin [8] Einsidler D, Dhanak M, Beaujean P P. A deep learning approach to target
Transformer. According to this, directly inserting this large recognition in side-scan sonar imagery[C]//OCEANS 2018 MTS/IEEE
Charleston. IEEE, 2018: 1-4.
model into the backbone isn’t an optimal choice for our AUV [9] Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weis-
system because of the request for lightweight restricted by senborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa;
limited hardware. Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob
(2021-06-03). ”An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale”. arXiv:2010.11929
VI. CONCLUSIONS [10] Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with
Experiments show that Swin transformer and its core mod- transformers[C]//European conference on computer vision. Springer,
Cham, 2020: 213-229.
ule(Window and Shift-Window Multi-Head Self Attention) [11] Zhu X, Su W, Lu L, et al. Deformable detr: Deformable transformers for
can improve the performance of neural networks and if used end-to-end object detection[J]. arXiv preprint arXiv:2010.04159, 2020.

Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.
[12] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo,
B. (2021). Swin Transformer: Hierarchical Vision Transformer using
Shifted Windows. 2021 IEEE/CVF International Conference on Com-
puter Vision (ICCV), 9992-10002.
[13] Dobeck G J. Algorithm fusion for the detection and classification of sea
mines in the very shallow water region using side-scan sonar imagery[C].
In: Aerosense. 2000.
[14] Wang J, Shan T, Chandrasekaran M, et al. Deep learning for detection
and tracking of underwater pipelines using multibeam imaging sonar.
Proceedings of the IEEE International Conference on Robotics and
Automation Workshop; 2019. [Google Scholar]
[15] Lee S, Park B, Kim A. Deep learning from shallow dives: sonar
image generation and training for underwater object detection.
arXiv:1810.07990. preprint; 2018. [Google Scholar]
[16] Sung M, Kim J, Lee M, et al. Realistic sonar image simulation using
deep learning for underwater object detection. Int J Control Autom Syst.
2020;18(3):523–534. [Crossref], [Web of Science ®], [Google Scholar]
[17] Weiss K, Khoshgoftaar T M, Wang D D. A survey of transfer learning[J].
Journal of Big data, 2016, 3(1): 1-40.
[18] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical
image database[C]//2009 IEEE conference on computer vision and
pattern recognition. Ieee, 2009: 248-255.
[19] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and
accuracy of object detection[J]. arXiv preprint arXiv:2004.10934, 2020.
[20] Wang, C.; Liao, H.M.; Wu, Y.; Chen, P.; Hsieh, J.; Yeh, I. CSPNet:
A New Backbone that can Enhance Learning Capability of CNN.In
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19
June 2020; pp. 1571–1580.
[21] Loffe, S.; Szegedy, C. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In Proceedings of the
32nd International Conference on International Conference on Machine
Learning (ICML), Lile, France, 6–11 July 2015;pp. 448–456.
[22] Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for
neural network function approximation in reinforcement learning. Neural
Netw. 2018, 107, 3–11. [CrossRef] [PubMed]
[23] A. Srinivas, T. -Y. Lin, N. Parmar, J. Shlens, P. Abbeel and A. Vaswani,
”Bottleneck Transformers for Visual Recognition,” 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2021,
pp. 16514-16524, doi: 10.1109/CVPR46437.2021.01625.
[24] ZHOU Y, CHEN S, WU Ke, NING M, CHEN H, ZHANG P. SCTD
1.0:Sonar Common Target Detection Dataset[J]. Computer Science,
2021, 48(11A): 334-339.
[25] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J].
Advances in neural information processing systems, 2017, 30.
[26] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object
detection[C]//Proceedings of the IEEE conference on computer vision
and pattern recognition. 2017: 2117-2125.
[27] Wang K, Liew J H, Zou Y, et al. Panet: Few-shot image semantic seg-
mentation with prototype alignment[C]//Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2019: 9197-9206.
[28] Ultralytics-YOLOv5. Available online:
https://github.com/ultralytics/yolov5(accessed on 1 January 2021)
[29] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention
module[C]//Proceedings of the European conference on computer vision
(ECCV). 2018: 3-19.
[30] Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile net-
work design[C]//Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. 2021: 13713-13722.
[31] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial
nets[J]. Advances in neural information processing systems, 2014, 27.

Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on May 20,2024 at 06:20:07 UTC from IEEE Xplore. Restrictions apply.

You might also like