0% found this document useful (0 votes)
54 views12 pages

Self-Supervised Acoustic Representation Learning Via Acoustic-Embedding Memory Unit Modified Space Autoencoder For Underwater Target Recognition

This document summarizes a research paper that proposes a self-supervised learning method called the acoustic-embedding memory unit modified space autoencoder (ASAE) to learn representations from unlabeled underwater acoustic signals for target recognition. The method combines Mel filter banks for acoustic discrimination and gammatone filter banks for noise robustness. An acoustic-embedding memory unit is introduced to mine more negative samples and obtain adversarially enhanced features, improving accuracy, convergence, and robustness over other acoustic features.

Uploaded by

598724131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views12 pages

Self-Supervised Acoustic Representation Learning Via Acoustic-Embedding Memory Unit Modified Space Autoencoder For Underwater Target Recognition

This document summarizes a research paper that proposes a self-supervised learning method called the acoustic-embedding memory unit modified space autoencoder (ASAE) to learn representations from unlabeled underwater acoustic signals for target recognition. The method combines Mel filter banks for acoustic discrimination and gammatone filter banks for noise robustness. An acoustic-embedding memory unit is introduced to mine more negative samples and obtain adversarially enhanced features, improving accuracy, convergence, and robustness over other acoustic features.

Uploaded by

598724131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

NOVEMBER 11 2022

Self-supervised acoustic representation learning via


acoustic-embedding memory unit modified space
autoencoder for underwater target recognition 
Xingmei Wang; Jiaxiang Meng; Yangtao Liu; ... et. al

J Acoust Soc Am 152, 2905–2915 (2022)


https://doi.org/10.1121/10.0015138

CrossMark

 
View Export
Online Citation

Related Content

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


Long-term scalogram integrated with an iterative data augmentation scheme for acoustic scene
classification
J Acoust Soc Am (June 2021)
ARTICLE
...................................

Self-supervised acoustic representation learning


via acoustic-embedding memory unit modified space
autoencoder for underwater target recognition
Xingmei Wang,a) Jiaxiang Meng, Yangtao Liu, Ge Zhan, and Zhaonan Tian
College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China

ABSTRACT:
Since the expensive annotation of high-quality signals obtained from passive sonars and the weak generalization
ability of the single feature in the ocean, this paper proposes the self-supervised acoustic representation learning
under acoustic-embedding memory unit modified space autoencoder (ASAE) and performs the underwater target
recognition task. In the manner of the animal-like acoustic auditory system, the first step is to design a self-
supervised representation learning method called space autoencoder (SAE) to merge Mel filter-bank (FBank) with
the acoustic discrimination and gammatone filter-bank (GBank) with the anti-noise robustness into SAE spectrogram
(SAE Spec). Meanwhile, due to poor high-level semantic information in SAE Spec, an acoustic-embedding memory

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


unit (AEMU) is introduced as the strategy of adversarial enhancement. During the auxiliary task, more negative sam-
ples are joined in the improved contrastive loss function to obtain adversarial enhanced features called ASAE spec-
trogram (ASAE Spec). Ultimately, the comprehensive contrast experiments and ablation experiments on two
underwater datasets show that ASAE Spec increases by more than 0.96% in accuracy, convergence rate, and anti-
noise robustness of other mainstream acoustic features. The results prove the potential value of ASAE in practical
C 2022 Acoustical Society of America. https://doi.org/10.1121/10.0015138
applications. V
(Received 10 April 2022; revised 20 October 2022; accepted 23 October 2022; published online 11 November 2022)
[Editor: Haiqiang Niu] Pages: 2905–2915

I. INTRODUCTION 1997) reduce the scale of samples in the training set. Lian
et al. (2017) found that the acoustic features from gammatone
Underwater target recognition plays a crucial role in
filters have strong anti-noise robustness. However, informa-
military and civil fields. This technology utilizes acoustic
tion loss and complicated operations in statistical methods
signals from the passive sonar to classify targets including
result in difficulty to obtain good acoustic representation. The
sea creatures, ships, and sea waves. Nowadays, underwater
professional knowledge necessary for machine learning is
target recognition with high efficiency and high accuracy
has become one of the most important projects in the ocean also difficult to be acquired in the complicated ocean environ-
science field, and highly intelligent and automatic machine ment. Hence, early models for underwater target recognition
identification has been gradually used for underwater target hardly meet the high requirements of underwater tasks about
recognition (Neupane and Seok, 2020; Steiniger et al., high efficiency and high accuracy.
2022). However, many factors increase the difficulty of Over recent years, the outstanding performance of deep
underwater target recognition: (1) There are complex infor- learning in various fields, such as object tracking
mation components in acoustic signals from the sonar, such (Ciaparrone et al., 2020) and emotion recognition (Kim and
as sea waves, ocean biological acoustic information, and Serous, 2018), provides novel ideas for underwater target
acoustic signals from artificial exploration activity. (2) The recognition. Deep learning has a strong representation learn-
underwater target recognition is mainly based on the radi- ing ability without professional knowledge using much data
ated noise from targets’ power systems, whereas recogniz- (Kamal et al., 2013; Cao et al., 2016; Chen and Xu, 2017;
ing target radiated noise is a great challenge for humans and Zhu et al., 2017; Shen et al., 2018; Yang et al., 2018; Cao
machines. Therefore, many scholars try to solve these prob- et al., 2019; Tian et al., 2021). Cao et al. (2016) proposed a
lems in underwater target recognition. modified autoencoder to learn the key knowledge of under-
The statistical methods applied to the representation of water data for convolutional neural networks (CNN). Zhu
target radiated noise were raised in the 1990s (Xuegong, et al. (2017) introduced CNN for representation learning
2000; Ji and Liao, 2005; Yang et al., 2016; Lian et al., 2017). and applied SVM to recognize unmanned underwater detec-
For example, Yang et al. (2016) developed support vector tion equipment (UUDE). To improve the representational
machines (SVM) (Cao et al., 2013; Freund and Schapire, performance on convergence rate and anti-noise robustness,
the novel compressed competitive deep belief network was
developed by discretization and network reconstruction with
a)
Electronic mail: [email protected] few samples (Shen et al., 2018).

J. Acoust. Soc. Am. 152 (5), November 2022 0001-4966/2022/152(5)/2905/11/$30.00 C 2022 Acoustical Society of America
V 2905
https://doi.org/10.1121/10.0015138

Nevertheless, most deep learning methods employ contrastive-based self-supervised learning. Meanwhile, how
supervised learning, which relies on “clean” data with to mine more negative samples in the contrastive-based self-
labels. It is difficult to obtain many labeled “clean” acoustic supervised learning should be addressed.
signals in underwater target recognition. Instead, self- To address the problems of limited labeled underwater
supervised learning designs upstream auxiliary tasks, not acoustic samples and poor negative sample learning, this
requiring labels. Thus, auxiliary tasks focus on mining sam- paper proposes an underwater target recognition method
ples without labels according to their own supervisory infor- with self-supervised acoustic representation learning to
mation, and learn features valuable for downstream tasks obtain adversarial enhanced features under ASAE. To our
(Doersch, 2016; Gidaris et al., 2018; Lee et al., 2020). best knowledge, this is the first attempt to combine self-
Further, supervised learning assigns the same labels to aug- supervised acoustic representation learning with underwater
mented samples from the same source. Even so, if augmen- target recognition. The main contributions are as follows:
tation results in large differences in their distribution, (1) To take the advantages of FBank and GBank together,
forcing such invariance may drop model performance. Thus, this paper proposes SAE for the auxiliary task of feature
Lee et al. (2020) proposed a self-supervised label augmenta- reconstruction and obtains SAE Spec with high acoustic dis-
tion technique with aggregation and self-distillation in a crimination and anti-noise robustness. (2) To cover the lack
multi-task learning framework for alleviating few-shot and of high-level semantic information in SAE Spec by mining
imbalanced classification tasks. To learn acoustic features more negative samples, this paper introduces a strategy of
from unlabeled data, some general solutions have been put

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


adversarial enhancement by adding acoustic-embedding
forward upon different auxiliary tasks. For example, convo- memory unit (AEMU) to SAE. (3) The results of compari-
lutional encoders have been used to extract self-supervised son experiments and ablation experiments on underwater
acoustic features from problem agnostic speech encoder datasets show that adversarial enhanced features called
(PASE) (Pascual et al., 2019) and PASEþ (Ravanelli et al., ASAE Spec from ASAE outperform the existing acoustic
2020), and these models can solve various auxiliary tasks, features for underwater target recognition.
such as the reconstruction of waveforms and the prediction The rest of our paper is organized as follows: Section II
of Mel frequency cepstral coefficient (MFCC) (Lim et al., presents the details of the proposed models. Section III
2007). These acoustic features improve the performance of shows the experimental results and analysis. Section IV dis-
speaker recognition, sentiment classification, and speech cusses the conclusion and future work.
recognition. In addition, autoregressive predictive coding
(APC) (Chung and Glass, 2020) and contrastive predictive II. METHODOLOGY
coding (CPC) (Oord et al., 2018) have been introduced for
The core of underwater target recognition is to learn
self-supervised acoustic representation learning. Both APC
acoustic features with strong generalization, rich high-level
and CPC aim to predict future acoustic frames with past
semantic information, and anti-noise robustness from the
audio frames. Wang et al. (2020a) built the auxiliary task of
low-quality underwater dataset. Thus, this paper applies the
reconstructing the masked Log Mel spectrograms (Nandan
contrastive-based self-supervised learning method to under-
and Vepa, 2020) with transformer models on LibriSpeech
water acoustic representation learning under ASAE, which
and Wall Street Journal dataset. In this paper, self-
is adopted for downstream underwater target recognition.
supervised learning is applied to the acoustic representation
This contrastive-based self-supervised learning usually
for the downstream underwater target recognition task.
explores the similarities or differences between samples,
Classical self-supervised learning mainly focuses on
and then learns better representation by the distance metric
pixel-level reconstruction loss but ignores high-level seman-
of samples. Thereinto, SAE is proposed to reconstruct
tic information. Recently, adversarial enhancement based on
FBank into GBank for acoustic features with their advan-
contrastive learning (Kong et al., 2019; Zhou et al., 2020) is
tages, so SAE Spec has good acoustic characteristics and
proposed. It compares different samples to enrich the infor-
strong anti-noise robustness. FBank and GBank as the base-
mation with different loss functions (Gutmann and
line acoustic features are described in brief. Subsequently,
Hyvarinen, 2010; Sohn, 2016; Song et al., 2016;
AEMU is introduced as ASAE to mine more negative sam-
Movshovitz-Attias et al., 2017; Deng et al., 2019; Fan et al.,
ples, so as to further enrich feature information, and thus,
2019; Sun et al., 2020). In detail, the typical loss function of
ASAE Spec can be the acoustic feature of underwater target
contrastive learning includes contrastive loss (Hadsell et al.,
recognition.
2006) and triplet loss (Schroff et al., 2015). They optimize
for clustering the positive paired samples and pulling nega-
A. FBank and GBank
tive paired samples away. According to two typical loss
functions, the loss functions of contrastive learning for self- Spectrogram (Spec) (Jin et al., 2020) contains key
supervised learning (Zhuang et al., 2019; Li et al., 2020) are information in temporal and frequency dimensions. As
inspired by many variations of loss functions, such as noise shown in Fig. 1, inspired by the animal-like acoustic audi-
contrastive estimation loss (NCE Loss) (Gutmann and tory system, Mel filters and gammatone filters are applied
Hyvarinen, 2010) and N-pair loss (Sohn, 2016). This paper on Spec to obtain FBank (Yoshioka et al., 2014) and
is aimed at introducing adversarial enhancement as the GBank, respectively. GBank is gammatone frequency
2906 J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al.
https://doi.org/10.1121/10.0015138

FIG. 1. (Color online) The process of adopting Mel filters and gammatone filters to Spec as FBank and GBank. FBank and GBank are more informative in
the feature compared to MFCC and GFCC which perform DCT to lose some information.

cepstral coefficient (GFCC) without discrete cosine trans- gi ðtÞ ¼ tn1 exp ð2pbi tÞ cos ð2pfi þ ui ÞuðtÞ; 1  i  N; (2)
form (DCT) like FBank. These filters produce the top-
performing and reliable features in underwater target where fi is the center frequency, bi is the time attenuation
recognition. FBank fits the acoustic discrimination, and

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


rate, ui is the phase, N is the number of filters, and uðtÞ is
GBank fits the auditory mechanism with the human cochlea the step function.
filtering system. The filters follow short-time Fourier trans-
form (STFT) (Durak and Arikan, 2003). B. SAE

1. Mel filter
In this paper, due to the dependence of deep learning on
labeled data, SAE is proposed as a self-supervised acoustic
Mel filters select wave energy in different frequency representation learning method. It aims at implementing an
bands as target representations like the auditory system of animal-like acoustic auditory system by FBank and GBank.
the human ear with strong acoustic discrimination. Several It takes FBank as input and performs the auxiliary task of
filters Hm ðkÞ are set in the spectral range. Each filter is a tri- reconstructing FBank to GBank. Obviously, it does not need
angular filter with a center frequency f ðmÞ, labels as supervised information to learn acoustic features.
8 Therefore, a new feature with high performance named SAE
>
> 0; k < f ðm  1Þ Spec feature is created by combining the good acoustic dis-
>
>
>
> k  f ðm  1Þ crimination of FBank and the strong anti-noise robustness of
>
>
< f ðmÞ  f ðm  1Þ ; f ðm  1Þ  k  f ðmÞ GBank. Moreover, the self-supervised SAE solves the prob-
Hm ðkÞ ¼ (1) lems of many unlabeled and class-imbalanced datasets.
>
> f ðm þ 1Þ  k
>
> ; f ðmÞ < k  f ðm þ 1Þ SAE contains encoder hðÞ and decoder gðÞ illustrated
>
> f ðm þ 1Þ  f ðmÞ
>
> in Fig. 2. To make SAE more anti-noise robust, the random
:
0; k > f ðm þ 1Þ: Gaussian noise embedded module (RGNE) is added to
FBank. It simulates the noise from a real underwater envi-
ronment to achieve better anti-noise robustness for the
2. Gammatone filter
extracted features. The input is encoded to the embedding
Gammatone filters (Johannesma, 1972) are applied to space by hðÞ. The decoder then reconstructs the encoder
the auditory system with frequency sensing to simulate the output as pseudo-GBank by upsampling. The pseudo-GBank
energy feedback in the human ear. Each gammatone filter is and true GBank are applied for loss of spatial similarity.
regarded as the multiplication of the gamma function and Therefore, SAE Spec in the embedding space is beneficial
the acoustic signal as follows: for recognition and anti-noise robustness. The labeled and

FIG. 2. (Color online) The framework


of the proposed SAE with the auxiliary
task of reconstructing FBank to GBank.

J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2907
https://doi.org/10.1121/10.0015138

clean dataset is not necessary for SAE, according to the aux- where y means the paired samples belong to positive or
iliary task of feature reconstruction. SAE is quite suitable negative. xi and xj represent one-dimensional vectors hi and
for the representation learning of downstream underwater hj of pseudo-GBank and true GBank.
target recognition.
As shown in Fig. 2, RGNE is used to generate the C. AEMU
Gaussian noise n as shown in Eq. (3). FBank and random
As traditional self-supervised learning usually assumes
noise n are superimposed to form a noisy FBank in the
that all pixels are independent, high-level semantic informa-
encoder hðÞ,
tion is ignored. However, high-level semantic information
1 2 2 plays an important role in classification. While the high-
nð xÞ ¼ pffiffiffiffiffiffi eðxlÞ =2r : (3) level semantic information is richer, the classification ability
r 2p
is better (Shelhamer et al., 2017). Therefore, in the underwa-
The encoder is similar to ResNet50 (He et al., 2016). ter target recognition field, the core of the better perfor-
Encoder hðÞ totally has three stages, and each contains the mance is to learn more high-level semantic information that
residual network of three convolutional layers. The principle represents different targets for easier classification. The
of the encoder is shown as contrastive-based self-supervised learning in our paper
needs positive samples and negative samples, and then mea-
zi ¼ hðxi Þ: (4) sures the distance between positive and negative samples.

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


The core idea is that the distance between the sample and
In this paper, zi (i.e., SAE Spec feature) is finally used the positive samples is much greater than the distance
for the downstream recognition task. The output of the between the sample and the negative samples,
encoder is the input of the decoder gðÞ. It goes through the
upsampling corresponding to the encoder. The principle of scoreð f ðxÞ; f ðxþ ÞÞ  scoreðf ð xÞ; f ðx ÞÞ; (7)
the decoder is defined as
where x represents anchor sample, xþ is the positive samples,
ri ¼ gðzi Þ: (5) and x means negative samples. The loss like Information
Noise Contrastive Estimation Loss (InfoNCE Loss) (Oord et al.,
The decoder, as well as the true GBank, tightly follows 2018) can realize Eq. (7). From this, it is crucial to find reliable
the projection head module. This module adopts a fully con- positive samples and mine valid negative samples. However,
nected layer to flatten pseudo-GBank and true GBank as SAE has a problem with poor negative sample learning. Thus,
one-dimensional vectors hi and hj . Thereafter, the loss of this paper applies the strategy of adversarial enhancement in
spatial similarity in SAE is the cosine embedding loss (CE SAE. Inspired by cross-batch memory (XBM) (Wang et al.,
Loss) not the mean square error loss (MSE loss), because 2020b), AEMU is introduced into SAE as ASAE. ASAE is
MSE loss is not suitable when the underwater dataset shows illustrated in Fig. 3. It builds a dynamic negative sample library
a class imbalance. In addition, CE loss shown in Eq. (6) can for more negative samples and uses the loss function of contras-
better perform the feature similarity and avoid overfitting, tive learning. Meanwhile, the dynamic negative sample library
(
1  cos ðxi ; xj Þ; y¼1 allows SAE to learn massive negative samples without more
lossððxi ; xj Þ; yÞ ¼ max0; cos x ; x  margin; y ¼ 1; computational effort. Therefore, ASAE generates adversarial
ð i jÞ
enhanced features called ASAE Spec containing more high-
(6) level semantic information.

FIG. 3. (Color online) The framework of the proposed ASAE. The strategy of adversarial enhancement is added with the loss of contrastive learning with
AEMU.

2908 J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al.
https://doi.org/10.1121/10.0015138

The negative sample learning has been performed in the downstream tasks in this paper. Since the proposed ASAE
mini-batch. The larger mini-batch means more negative has already achieved the important representation learning
samples, which produces better training in contrastive learn- in the underwater target recognition, this paper only accom-
ing. However, the size of the mini-batch is limited by the plishes downstream tasks with the shallow classification
memory size. Therefore, AEMU is applied to collect infor- model to verify our proposed SAE Spec and ASAE Spec.
mation in a different mini-batch. It plays an important role Also, two datasets are utilized to validate the performance
in adversarial enhancement. AEMU stores many negative of the obtained acoustic features. The comparisons between
samples in a dynamic queue, and the scale of the dynamic the existing good acoustic features, such as MFCC, GFCC
queue is controllable. In each iteration, the dynamic queue (Mao et al., 2015; Lian et al., 2017; Zhang et al., 2018),
is updated with a mini-batch by dequeue and enqueue opera- FBank, and GBank, are made. The details of the proposed
tions. In detail, the queue is updated like this: GBank in model are presented in the Methods section. In addition, we
each mini-batch is included at the end of the queue, while also perform the ablation experiment for the proposed idea
GBank at the front of the queue is taken out. This way in this paper.
ensures that the dynamic queue contains the positive sam-
A. Dataset
ples required at each iteration, while the remaining features
are treated as negative samples. Also, AEMU performs N This paper performs experiments on two datasets. The
paired samples adopted in modified InfoNCE, one is the cooperative-project dataset (CP Dataset) collected

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


    from our underwater environment. It contains the signals of
exp sim f ðxi Þ; f ðxj Þ =s sea creatures and the radiated noise of four types of ships.
‘i;j ¼ log N ; (8) They are grouped into 5 classes: A, B, C, D, and E. A is a
X    
exp sim f ðxi Þ; f ðxk Þ =s tug ship with 50 samples. B is a sailboat with 27 samples. C
k6¼i is a passenger ferry with 151 samples. D is a roll-on-roll-off
ship containing 17 309 samples. E contains whales and dol-
where f ðÞ represents the positive and negative samples, phins with 7097 samples. The total length is about 20 h, and
simðÞ denotes the similarity function of Eq. (6), and s is the each short signal is split into multiple signals of 2 s to facili-
hyperparameter. There is one positive sample and N  1 tate experimental processing. Besides the above five classes,
negative samples. There are more the mutual information there is another class class of 41 samples that is the the
between negative pairs obtained by minimizing ‘ij , therefore water flow as the background noise to verify the anti-noise
information of f ðxj Þ is reduced in Eq. (8) since f ðxj Þ is robustness. Thereinto, Fig. 4 shows the Spec of the example
knowledgeable. signal with class E in the CP dataset.
In summary, to achieve the task of underwater target The other dataset is the ShipsEar dataset (Santos-
recognition based on self-supervised acoustic representation Dominguez et al., 2016) with ship power noise recorded in
learning, we use the ASAE trained on the feature recon- the Galiia Coast from 2012–2013. Eleven types of ships in
struction task as the feature extraction module, i.e., only the this dataset are grouped into four main classes: A, B, C, and
encoder module of ASAE is used. D. A includes the fishing boat, trawler, mussel boat, tugboat,
and dredgers. B includes the motorboat, pilot ship, and sail-
boat. C includes passengers. D includes the ocean liner and
III. EXPERIMENTS AND RESULTS
ro-ro vessels. There are 90 records from 15 s to 10 min, and
In this section, to validate our proposed model, under- each signal is split into multiple signals of 0.34 s. In this
water target recognition is adopted as the downstream task. experiment, there are 17 records with 31 min of A. B con-
Multi-layer perceptron (MLP) and multinomial logistic tains 19 records with 26 min. C is divided into 30 records
regression (MLR) are used as classification models for with 71 min. D has 12 records with 41 min. Also, there is an

FIG. 4. (Color online) Spec of FBank, GBank, and the original underwater acoustic signal with class E in the CP dataset. The vertical axis is the time, and
the horizontal axis represents the frequency. (a) FBank Spec, (b) GBank Spec, (c) Original underwater acoustic signal Spec.

J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2909
https://doi.org/10.1121/10.0015138

extra class named background noise with 12 records. In the fusion features of GFCC and modified empirical mode
summary, the ratio between the training set and testing set decomposition as the multi-dimensional fusion features
in all datasets is 7:3. Meanwhile, the train set is adopted to (MFF) (Wang et al., 2019), the fusion of MFCC and GFCC
train the ASAE until convergence. (MG), and the fusion feature of FBank and GBank (FG).
However, there are few studies to fuse features from Mel fil-
B. Evaluation Metrics ters and features from gammatone filters as new fusion fea-
tures. So, we intend to simply add them. In this paper, the
In our paper, Accuracy (Acc), classifier loss conver-
Acc of different features, including MFCC, GFCC, FBank,
gence rate (CLCR), and anti-noise robust (ANR) are utilized
GBank, SAE Spec, ASAE Spec, etc., with MLP and MLR is
as evaluation metrics for our proposed model.
shown in Table I.
As seen in Table I, ASAE Spec performs best on MLP and
1. Acc
MLR. Without AEMU, SAE Spec obtained from SAE drops by
Accuracy (Acc) is defined as at least 0.96% in accuracy. It is proved that AEMU is effective
for better performance with enriching high-level semantic infor-
1X N
mation. FBank offers high acoustic discrimination but low anti-
Accuracyðf 0 ; D TestÞ ¼ ðf 0 ðxi Þ ¼ labeli Þ; (9)
N i¼1 noise robustness, while GBank has strong anti-noise robustness
but low acoustic discrimination. In addition, MFCC and GFCC
where D Test is the testing dataset, f 0 ðÞ represents the clas- are shallow machine learning features with less acoustic infor-

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


sification model, label is the true data annotation, and N is mation. Therefore, the accuracy of ASAE Spec is higher than
the size of the dataset D Test. other features. In addition, the performance of the fusion fea-
tures is close to SAE Spec. However, our self-supervised repre-
2. CLCR sentation learning not only has the benefits of FBank and
GBank, but also includes the strategy of adversarial enhance-
To verify that the features contain more high-level ment called AEMU to learn more high-level semantic informa-
semantic information, CLCR is introduced as one of the per- tion than traditional manual features. Overall, it can prove that
formance evaluation metrics. CLCR means the epochs ASAE provides higher effectiveness.
required at the classification model under different features. To demonstrate that the ASAE Spec contains more
For the classification model f 0 and the training dataset high-level semantic information, this paper carries out
D Train, CLCR is expressed as experiments about CLCR. Figure 5 shows the convergence
rate of various features on MLP and MLR under 500
CLCRðf 0 ; D TrainÞ ¼ epochsðloss ¼ loss minÞ: (10)
training epochs. About 100 epochs are shown in the figure.
The convergence rate under subsequent epochs tends to
3. ANR level off. Table II shows CLCR of various acoustic features.
As derived from Fig. 5 and Table II, the ASAE Spec
Due to obvious background noise in most underwater has better convergence rates than the other features. Clearly,
datasets, ANR becomes an important performance evalua- compared to SAE Spec, ASAE Spec shows a 23.55%
tion metric on underwater target recognition tasks for such improvement of CLCR on the ShipsEar dataset. The
“unclean” datasets. In this paper, the acoustic signals added improved convergence rates of MLP and MLR demonstrate
with the noise are used as a testing dataset DN Test. For the that ASAE Spec contains more high-level semantic informa-
classification model f 0 , and the data annotation label, ANR tion. Therefore, ASAE Spec has better adaptability for
is given as follows: underwater target recognition.

Accuracyðf 0 ;D TestÞ  Accuracyðf 0 ;DN TestÞ TABLE I. The Acc (%) of various acoustic features. The bold numbers
ANR ¼ : (11)
Accuracyðf 0 ;D TestÞ mean the best accuracy among various features in two datasets.

CP dataset ShipsEar dataset


IV. RESULTS Feature MLP MLR MLP MLR
Self-supervised learning in our model is proposed to MFCC 88.52 6 0.16 87.59 6 0.19 83.73 6 0.17 81.47 6 0.17
obtain key features from underwater acoustic signals and GFCC 87.79 6 0.16 86.83 6 0.20 83.48 6 0.17 80.74 6 0.17
improve the performance of the downstream recognition MG 88.86 6 0.20 87.89 6 0.19 84.36 6 0.21 81.86 6 0.19
task. Therefore, this paper compares the performance of the MM 89.62 6 0.32 88.69 6 0.22 85.65 6 0.20 82.56 6 0.20
underwater target recognition between the proposed features MFF 89.84 6 0.27 88.84 6 0.23 85.90 6 0.26 82.80 6 0.21
FBank 89.40 6 0.21 88.45 6 0.21 85.34 6 0.21 82.35 6 0.22
and the state-of-the-art mainstream features including
GBank 89.29 6 0.24 88.34 6 0.24 85.26 6 0.26 82.22 6 0.25
MFCC (Wang et al., 2016; Zhang et al., 2016), GFCC,
FG 89.60 6 0.33 88.68 6 0.32 85.55 6 0.27 82.55 6 0.29
FBank, and GBank. In addition, there are some other fusion SAE Spec 89.87 6 0.29 88.90 6 0.25 85.81 6 0.28 82.79 6 0.25
features, such as the fusion features of MFCC and modified ASAE Spec 91.95 6 0.28 89.86 6 0.21 87.29 6 0.26 84.44 6 0.27
empirical mode decomposition (MM) (Wang et al., 2019),
2910 J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al.
https://doi.org/10.1121/10.0015138

FIG. 5. (Color online) The loss of vari-


ous acoustic features. (a) The loss of
MLP on CP dataset, (b) the loss of
MLP on ShipsEar dataset, (c) the loss
of MLR on CP dataset, (d) the loss of
MLR on ShipsEar dataset.

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


For ANR of various features, this paper utilizes a noisy the ASAE Spec is higher than the other features on noisy
testing dataset for accuracy experiments. To better reflect and clean datasets, and it has the smallest drop rate. It indi-
the real-world noise environment, we add the background cates that the proposed ASAE Spec has better anti-noise
noise of the dataset to the clean testing dataset. The signal- robustness. To further verify the anti-noise robustness for
to-noise ratio (SNR) of the testing dataset is greater than ASAE Spec, Gaussian white noises at different SNRs are
6 dB. This gives evidence of good model performance in the added to mainstream acoustic features on CP dataset in
real underwater environment. To visualize the anti-noise Table IV.
robustness of various features, Table III presents the mean
TABLE III. ANR (%) of various acoustic features with MLP and MLR.
accuracy and drop rate (the ANR metrics) of various fea- Clean Acc (%) means accuracy on the clean dataset. Noisy Acc (%) means
tures on clean and noisy datasets. accuracy on the noisy dataset. The bold numbers mean the best performance
As shown in Table III, the performance of each noisy among various features in noisy and clean datasets.
feature is significantly poorer, since random noise poses dif- MLP MLR
ficulties in underwater target recognition. Meanwhile,
although GFCC has the worst accuracy with the noisy data- Clean Noisy Clean Noisy
Dataset Feature Acc Acc ANR Acc Acc ANR
set, the features derived from gammatone filters have a
smaller drop rate than the features derived from Mel filters, CP dataset MFCC 88.52 82.89 6.36 87.59 83.19 5.02
indicating better ANR. Of them, the performance of the GFCC 87.79 82.30 6.26 86.83 83.17 4.21
ASAE Spec is still outstanding. In summary, the accuracy of MG 88.86 83.51 6.02 87.89 84.21 4.19
MM 89.62 84.84 5.33 88.69 85.87 3.18
TABLE II. CLCR of various acoustic features at the model convergence. MFF 89.84 85.45 4.88 88.84 86.11 3.07
The bold numbers mean the best convergence speed among various features FBank 89.40 84.22 5.79 88.45 85.10 3.79
in two datasets.
GBank 89.29 84.77 5.06 88.34 85.66 3.03
FG 89.60 84.82 5.07 88.68 85.89 3.14
CP dataset ShipsEar dataset
SAE Spec 89.87 85.57 4.79 88.90 86.42 2.79
Feature MLP (Epochs) MLR (Epochs) MLP (Epochs) MLR (Epochs) ASAE Spec 91.95 88.15 4.13 89.86 87.87 2.22

MFCC 87 85 89 94 ShipsEar MFCC 83.73 79.10 5.53 81.47 76.76 5.78


GFCC 92 89 95 96 dataset GFCC 83.48 78.91 5.47 80.74 76.53 5.22
MG 89 82 87 89 MG 84.36 80.08 5.07 81.86 77.72 5.06
MM 83 81 78 79 MM 85.65 82.97 3.13 82.56 79.16 4.12
MFF 74 71 65 72 MFF 85.90 83.32 3.00 82.80 79.94 3.45
FBank 57 55 59 65 FBank 85.34 81.92 3.97 82.35 78.69 4.45
GBank 61 59 63 70 GBank 85.26 82.66 3.04 82.22 79.06 3.84
FG 67 65 71 73 FG 85.55 82.97 3.02 82.55 79.41 3.80
SAE Spec 42 40 45 50 SAE Spec 85.81 83.30 2.92 82.79 79.92 3.47
ASAE Spec 22 22 31 42 ASAE Spec 87.29 85.19 2.41 84.44 81.89 3.02

J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2911
https://doi.org/10.1121/10.0015138

TABLE IV. Acc (%) of mainstream acoustic features on CP dataset at dif- TABLE VI. Acc (%) of acoustic features in each class of ShipsEar dataset.
ferent SNRs. The bold numbers mean the best accuracy among various fea- The bold numbers mean the best accuracy among various features in each
tures in CP dataset with different SNRs. class of ShipsEar dataset.

Feature MLP MLR Feature MLP MLR

10 dB 5 dB 0 dB 5 dB 10 dB 5 dB 0 dB 5 dB A B C D A B C D

MFCC 16.48 26.28 67.36 73.78 17.85 28.24 68.29 79.24 MFCC 83.17 73.67 89.18 84.76 79.52 75.70 82.31 82.38
GFCC 22.95 61.83 77.88 80.04 25.82 63.53 79.25 81.24 GFCC 82.98 73.42 88.96 84.58 78.94 79.46 84.02 79.30
MG 21.67 59.64 71.60 79.27 22.13 50.26 75.07 80.74 FBank 83.94 74.94 89.40 85.46 81.63 77.59 82.76 81.85
MM 23.62 61.28 78.91 80.20 26.81 65.09 79.40 82.30 GBank 83.85 74.81 89.10 85.37 80.77 78.48 82.41 81.99
MFF 31.10 69.79 81.86 82.96 30.74 69.49 81.92 83.77 SAE 85.48 78.86 90.37 86.08 82.55 80.11 85.93 82.87
FBank 19.48 29.85 69.32 77.88 21.14 29.72 70.11 82.25 Spec
GBank 24.65 65.15 80.91 82.12 27.75 67.79 81.26 83.34 ASAE 86.35 83.92 90.52 86.87 83.93 83.00 86.57 84.13
FG 29.25 69.23 81.21 82.85 29.27 68.99 81.90 83.66 Spec
SAE 44.50 72.31 82.22 83.53 45.56 75.64 83.25 84.17
Spec
ASAE 50.25 77.78 84.17 85.14 52.46 78.22 84.17 85.63
Spec under 15 times on two datasets, which are shown in Fig. 6.
USAE Spec means unrobust SAE Spec learned by SAE with-

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


out RGNE(USAE).
From Table IV, it is found that MFCC has the worst As seen in Fig. 6, although the accuracy of SAE Spec and
robustness while GFCC has better ANR. Compared with USAE Spec shows slight fluctuations, the anti-noise robustness
mainstream acoustic features, SAE Spec and ASAE Spec is still strong. In CP dataset of MLP, the accuracy of SAE Spec
can maintain high accuracy and ANR under different SNRs is higher than USAE Spec by 2.10%. In ShipsEar dataset of
with MLP and MLR. Finally, to more clearly show the MLP, it outperforms about 2.03%. About MLR, the accuracy of
improved performance of the proposed self-supervised SAE Spec is about 2.05% higher on CP dataset, and about
acoustic features, the accuracy of each class is compared 2.12% higher on ShipsEar dataset. In summary, it indicates that
with mainstream acoustic features in this paper, as shown in RGNE can improve the anti-noise robustness of acoustic fea-
the Table V and Table VI. tures in underwater target recognition.
It can be observed from Tables V and Table VI that the
accuracy of our proposed acoustic features is higher than the 2. AEMU
mainstream common features. According to all comparison
experiments, it can be concluded that ASAE for adversarial To verify AEMU, this paper compares the accuracy of
enhanced features achieves strong effectiveness in underwa- ASAE Spec on different sizes of the dynamic queue in
ter target recognition. Moreover, it improves the perfor- AEMU as Fig. 7.
mance of underwater target recognition with class As can be seen from Fig. 7, the accuracy of ASAE Spec
imbalance, weak labels, and more environmental noise. in the MLP and MLR grows as the scale of AEMU increases.
Overall, ASAE can work under few high-quality ocean tar- Specifically, the accuracy is correlated to AEMU scale until
get samples and expensive annotation. 256. Total above, AEMU is introduced to help SAE mine
more negative samples for adversarial enhancement.
A. Ablation experiment
V. CONCLUSION
1. RGNE
According to underwater acoustic signals, we propose a
To further verify the contribution of RGNE to the self-supervised acoustic representation learning for the
improvement of the anti-noise robustness, the ablation experi- downstream underwater target recognition. SAE is designed
ment on RGNE is carried out. The experiments are done to do the auxiliary task of feature reconstruction to form

TABLE V. Acc (%) of acoustic features in each class of CP dataset. The bold numbers mean the best accuracy among various features in each class of CP dataset.

MLP MLR
Feature
A B C D E A B C D E

MFCC 47.50 45.00 70.99 86.84 86.53 45.00 45.00 68.70 85.91 85.68
GFCC 42.50 35.00 67.94 85.32 84.23 37.50 25.00 67.18 85.37 84.48
FBank 47.50 45.00 72.52 87.71 87.34 47.50 45.00 70.23 86.69 86.73
GBank 45.00 45.00 71.76 88.24 85.25 47.50 40.00 69.47 86.67 86.37
SAE Spec 52.50 50.00 77.10 90.31 89.96 50.00 45.00 70.99 88.85 89.96
ASAE Spec 55.00 55.00 83.21 92.33 91.81 55.00 50.00 76.34 90.12 90.59

2912 J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al.
https://doi.org/10.1121/10.0015138

FIG. 6. (Color online) The ablation


experiment of SAE Spec and USAE
Spec. (a) The ablation experiment of
SAE Spec with MLP on noisy CP data-
set, (b) the ablation experiment of SAE
Spec with MLP on noisy ShipsEar
dataset, (c) the ablation experiment of
SAE Spec with MLR on noisy CP
dataset, (d) the ablation experiment of
SAE Spec with MLR on noisy
ShipsEar dataset.

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


animal-like acoustic auditory systems. Furthermore, AEMU Even with the shallow classification model, it can outper-
with modified loss function of contrastive learning provides form common features, such as MFCC, GFCC, and has
more high-level semantic information as the ASAE Spec. demonstrated better underwater target identification and
This acoustic feature can capture the benefits of FBank and classification performance.
GBank with more useful contrastive information, being well Furthermore, the results of ASAE are validated on CP
adopted in the downstream underwater target recognition. dataset and ShipsEar dataset. Then, we confirm that ASAE

FIG. 7. The accuracy of ASAE Spec


under different AEMU scales. (a) The
ablation experiment of AEMU with
MLP on CP dataset, (b) the ablation
experiment of AEMU with MLP on
ShipsEar dataset, (c) the ablation
experiment of AEMU with MLR on
CP dataset, (d) the ablation experiment
of AEMU with MLR on ShipsEar
dataset.

J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2913
https://doi.org/10.1121/10.0015138

Spec has better performance than the mainstream features in Ji, S., Liao, X., and Carin, L. (2005). “Adaptive multiaspect target classifi-
cation and detection with hidden Markov models,” IEEE Sens. J. 5,
underwater target recognition. It is also well illustrated that
1035–1042.
underwater target recognition with the adversarial enhanced Jin, G., Liu, F., Wu, H., and Song, Q. (2020). “Deep learning-based frame-
feature has the potential value to enrich high-level semantic work for expansion, recognition and classification of underwater acoustic
information. However, the lack of high-quality datasets and signal,” J. Exp. Theor. Artif. Intell. 32, 205–218.
Johannesma, P. (1972). “The pre-response stimulus ensemble of neurons in
weak generalization of representation learning remain the the cochlear nucleus,” in Symposium on Hearing Theory, June 22–23,
key challenges in underwater target recognition. These prob- 1972, Eindhoven, Holland, pp. 58–69.
lems also need to be explored in future research. Kamal, S., Mohammed, S. K., Pillai, P. R. S., and Supriya, M. H. (2013). “Deep
learning architectures for underwater target recognition,” in 12th Symposium
on Ocean Electronics, October 23–25, 2013, Kochi, India, pp. 48–54.
ACKNOWLEDGMENTS Kim, J. W., and Saurous, R. A. (2018). “Emotion recognition from human
speech using temporal information and deep learning,” in 19th Annual
This work was supported by the National Natural Conference of the International Speech Communication, September 2–6,
Science Foundation of China under Grant No. 41876110. 2018, Hyderabad, India, pp. 937–940.
This work was also supported by Fundamental Research Kong, L., d’Autume, C. d M., Ling, W., Yu, L., Dai, Z., and Yogatama, D.
(2019). “A mutual information maximization perspective of language rep-
Funds for the Central Universities in China under Grant resentation learning,” arXiv:08350.
No. 3072022JC0601. Lee, H., Hwang, S. J., and Shin, J. (2020). “Self-supervised label augmenta-
tion via input transformations,” in 37th International Conference on
Cao, X., Togneri, R., Zhang, X., and Yu, Y. (2019). “Convolutional neural Machine Learning [Virtual], pp. 5670–5680.

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


network with second-order pooling for underwater target classification,” Li, J., Zhou, P., Xiong, C., and Hoi, S. C. (2020). “Prototypical contrastive
IEEE Sens. J. 19, 3058–3066. learning of unsupervised representations,” arXiv:04966.
Cao, X., Zhang, X., Yu, Y., and Niu, L. (2016). “Deep learning-based rec- Lian, Z., Xu, K., Wan, J., and Li, G. (2017). “Underwater acoustic target
ognition of underwater target,” in 2016 IEEE International Conference on classification based on modified GFCC features,” in 2nd IEEE Advanced
Digital Signal Processing, October 16–18, 2016, Beijing, China, pp. Information Technology, Electronic and Automation Control Conference,
89–93. Chongqing, China, pp. 258–262.
Cao. Y., Miao. Q.-G., Liu. J.-C., and Gao. L. (2013). “Advance and pros- Lim, T., Bae, K., Hwang, C., and Lee, H. (2007). “Classification of under-
pects of AdaBoost algorithm,” Zidonghua Xuebao/Acta Automatica Sin. water transient signals using MFCC feature vector,” in 2007 9th
39, 745–758. International Symposium on Signal Processing and Its Applications,
Chen, Y., and Xu, X. (2017). “The research of underwater target recogni- Sharjah, United Arab Emirates, pp. 1–4.
tion method based on deep learning,” in 7th IEEE International Mao, Z., Wang, Z., and Wang, D. (2015). “Speaker recognition algorithm
Conference on Signal Processing, Communications and Computing, based on Gammatone filter bank,” Comput. Eng. Appl. 51, 200–203.
Movshovitz-Attias, Y., Toshev, A., Leung, T. K., Ioffe, S., and Singh, S.
October 22–25, 2017, Xiamen, Fujan, China, pp. 1–5.
Chung, Y.-A., and Glass, J. (2020). “Generative pre-training for speech (2017). “No fuss distance metric learning using proxies,” in 16th IEEE
International Conference on Computer Vision, October 22–29, 2017,
with autoregressive predictive coding,” in 2020 IEEE International
Venice, Italy, pp. 360–368.
Conference on Acoustics, Speech, and Signal Processing, May 4–8, 2020,
Nandan, A., and Vepa, J. (2020). “Language agnostic speech embeddings
Barcelona, Spain, pp. 3497–3501.
for emotion classification,” in Proc. ICML Workshop SAS, July 18, 2020,
Ciaparrone, G., Luque Sanchez, F., Tabik, S., Troiano, L., Tagliaferri, R.,
Virtual, pp. 1–6.
and Herrera, F. (2020). “Deep learning in video multi-object tracking: A
Neupane, D., and Seok, J. (2020). “A review on deep learning-based
survey,” Neurocomputing 381, 61–88.
approaches for automatic sonar target recognition,” Electronics 9,
Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). “ArcFace: Additive
1972–2001.
angular margin loss for deep face recognition,” in 32nd IEEE/CVF
Oord, A. v d., Li, Y., and Vinyals, O. (2018). “Representation learning with
Conference on Computer Vision and Pattern Recognition, June 16–20, contrastive predictive coding,” arXiv:03748.
2019, Long Beach, CA, pp. 4685–4694. Pascual, S., Ravanelli, M., Serra, J., Bonafonte, A., and Bengio, Y. (2019).
Doersch, C. (2016). “Tutorial on variational autoencoders,” arXiv:05908. “Learning problem-agnostic speech representations from multiple self-
Durak, L., and Arikan, O. (2003). “Short-time Fourier transform: Two fun- supervised tasks,” in 20th Annual Conference of the International Speech
damental properties and an optimal implementation,” IEEE Trans. Signal Communication Association: Crossroads of Speech and Language,
Process. 51, 1231–1242. September 15–19, 2019, Graz, Austria, pp. 161–165.
Fan, X., Jiang, W., Luo, H., and Fei, M. (2019). “SphereReID: Deep hyper- Ravanelli, M., Zhong, J., Pascual, S., Swietojanski, P., Monteiro, J., Trmal,
sphere manifold embedding for person re-identification,” J. Visual J., and Bengio, Y. (2020). “Multi-task self-supervised learning for robust
Commun. Image Represent. 60, 51–58. speech recognition,” in 2020 IEEE International Conference on
Freund, Y., and Schapire, R. E. (1997). “A decision-theoretic generalization Acoustics, Speech, and Signal Processing, May 4–8, 2020, Barcelona,
of on-line learning and an application to boosting,” J. Comput. Syst. Sci. Spain, pp. 6989–6993.
55, 119–139. Santos-Domınguez, D., Torres-Guijarro, S., Cardenal-Lopez, A., and Pena-
Gidaris, S., Singh, P., and Komodakis, N. (2018). “Unsupervised represen- Gimenez, A. (2016). “ShipsEar: An underwater vessel noise database,”
tation learning by predicting image rotations,” in 6th International Appl. Acoust. 113, 64–69.
Conference on Learning Representations, April 30-May 3, 2018, Schroff, F., Kalenichenko, D., and Philbin, J. (2015). “FaceNet: A unified
Vancouver, BC, Canada, pp. 1–16. embedding for face recognition and clustering,” in IEEE Conference on
Gutmann, M., and Hyvarinen, A. (2010). “Noise-contrastive estimation: A Computer Vision and Pattern Recognition, June 7–12, 2015, Boston, MA,
new estimation principle for unnormalized statistical models,” in 13th pp. 815–823.
International Conference on Artificial Intelligence and Statistics, May Shelhamer, E., Long, J., and Darrell, T. (2017). “Fully convolutional net-
13–15, 2010, Sardinia, Italy, pp. 297–304. works for semantic segmentation,” IEEE Trans. Pattern Anal. Mach.
Hadsell, R., Chopra, S., and LeCun, Y. (2006). “Dimensionality reduction Intell. 39, 640–651.
by learning an invariant mapping,” in 2006 IEEE Computer Society Shen, S., Yang, H., and Sheng, M. (2018). “Compression of a deep competi-
Conference on Computer Vision and Pattern Recognition, June 17–22, tive network based on mutual information for underwater acoustic targets
2006, New York, NY, pp. 1735–1742. recognition,” Entropy 20, 243–245.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for Sohn, K. (2016). “Improved deep metric learning with multi-class N-pair
image recognition,” in 29th IEEE Conference on Computer Vision and loss objective,” in 30th Annual Conference on Neural Information
Pattern Recognition, June 26-July 1, 2016, Las Vegas, NV, pp. 770–778. Processing Systems, Barcelona, Spain, pp. 1857–1865.

2914 J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al.
https://doi.org/10.1121/10.0015138

Song, H. O., Xiang, Y., Jegelka, S., and Savarese, S. (2016). “Deep metric Conference on Applied Sciences and Technology, Islamabad, Pakistan,
learning via lifted structured feature embedding,” in 29th IEEE pp. 522–527.
Conference on Computer Vision and Pattern Recognition, June 26–July 1, Yang, H., Shen, S., Yao, X., Sheng, M., and Wang, C. (2018). “Competitive
2016, Las Vegas, NV, pp. 4004–4012. deep-belief networks for underwater acoustic target recognition,” Sensors
Steiniger, Y., Kraus, D., and Meisen, T. (2022). “Survey on deep learning 18, 952.
based computer vision for sonar imagery,” Eng. Appl. Artif. Intell. 114, Yoshioka, T., Ragni, A., and Gales, M. J. F. (2014). “Investigation of unsu-
105157. pervised adaptation of DNN acoustic models with filter bank input,” in
Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., and Wei, 2014 IEEE International Conference on Acoustics, Speech, and Signal
Y. (2020). “Circle loss: A unified perspective of pair similarity opti- Processing, May 4–9, Florence, Italy, pp. 6344–6348.
mization,” in 2020 IEEE/CVF Conference on Computer Vision and Zhang, L., Wu, D., Han, X., and Zhu, Z. (2016). “Feature extraction of
Pattern Recognition [Virtual], pp. 6397–6406. underwater target signal using mel frequency cepstrum coefficients based
Tian, S., Chen, D., Wang, H., and Liu, J. (2021). “Deep convolution stack for on acoustic vector sensor,” J. Sens. 2016, 1.
waveform in underwater acoustic target recognition,” Sci. Rep. 11, 9614. Zhang, W., Wu, Y., Wang, D., Wang, Y., Wang, Y., and Zhang, L.
Wang, W., Li, S., Yang, J., Liu, Z., and Zhou, W. (2016). “Feature extrac- (2018). “Underwater target feature extraction and classification based on
tion of underwater target in auditory sensation area based on MFCC,” in gammatone filter and machine learning,” in 15th International
2016 IEEE/OES China Ocean Acoustics (COA), pp. 1–6. Conference on Wavelet Analysis and Pattern Recognition, July 15–18,
Wang, X., Liu, A., Zhang, Y., and Xue, F. (2019). “Underwater acoustic 2018, Chengdu, China, pp. 42–47.
target recognition: A combination of multi-dimensional fusion features Zhou, K., Wang, H., Zhao, W. X., Zhu, Y., Wang, S., Zhang, F., Wang, Z.,
and modified deep neural network,” Remote Sens. 11, 1888–1904. and Wen, J.-R. (2020). “S3-Rec: Self-supervised learning for sequential
Wang, W., Tang, Q., and Livescu, K. (2020a). “Unsupervised pre-training recommendation with mutual information maximization,” in Proceedings
of bidirectional speech encoders via masked reconstruction,” in IEEE of the 29th ACM International Conference on Information and
International Conference on Acoustics, Speech, and Signal Processing, Knowledge Management, CIKM 2020, October 19–23, 2020, Virtual,

Downloaded from http://pubs.aip.org/asa/jasa/article-pdf/152/5/2905/16533113/2905_1_online.pdf


May 4–8, 2020, pp. 6889–6893. Online, Ireland, pp. 1893–1902.
Wang, X., Zhang, H., Huang, W., and Scott, M. R. (2020b). “Cross-batch Zhu, P., Isaacs, J., Fu, B., and Ferrari, S. (2017). “Deep learning feature
memory for embedding learning,” in 2020 IEEE/CVF Conference on extraction for target recognition and classification in underwater sonar
Computer Vision Pattern Recognition [Virtual], June 14–19, 2020, pp. images,” in 56th IEEE Annual Conference on Decision and Control, CDC
6387–6396. 2017, December 12–15, 2017, Melbourne, PVIC Australia, pp.
Xuegong, Z. J. (2000). “Introduction to statistical learning theory support 2724–2731.
vector machines,” Acta Automatica Sin. 26, 32–42. Zhuang, C., Zhai, A., and Yamins, D. (2019). “Local aggregation for unsu-
Yang, H., Gan, A., Chen, H., Pan, Y., Tang, J., and Li, J. (2016). pervised learning of visual embeddings,” in 17th IEEE/CVF International
“Underwater acoustic target recognition using SVM ensemble via Conference on Computer Vision, October 27-November 2, 2019, South
weighted sample and feature selection,” in 13th International Bhurban Korea, pp. 6001–6011.

J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2915

You might also like