Self-Supervised Acoustic Representation Learning Via Acoustic-Embedding Memory Unit Modified Space Autoencoder For Underwater Target Recognition
Self-Supervised Acoustic Representation Learning Via Acoustic-Embedding Memory Unit Modified Space Autoencoder For Underwater Target Recognition
CrossMark
View Export
Online Citation
Related Content
ABSTRACT:
Since the expensive annotation of high-quality signals obtained from passive sonars and the weak generalization
ability of the single feature in the ocean, this paper proposes the self-supervised acoustic representation learning
under acoustic-embedding memory unit modified space autoencoder (ASAE) and performs the underwater target
recognition task. In the manner of the animal-like acoustic auditory system, the first step is to design a self-
supervised representation learning method called space autoencoder (SAE) to merge Mel filter-bank (FBank) with
the acoustic discrimination and gammatone filter-bank (GBank) with the anti-noise robustness into SAE spectrogram
(SAE Spec). Meanwhile, due to poor high-level semantic information in SAE Spec, an acoustic-embedding memory
I. INTRODUCTION 1997) reduce the scale of samples in the training set. Lian
et al. (2017) found that the acoustic features from gammatone
Underwater target recognition plays a crucial role in
filters have strong anti-noise robustness. However, informa-
military and civil fields. This technology utilizes acoustic
tion loss and complicated operations in statistical methods
signals from the passive sonar to classify targets including
result in difficulty to obtain good acoustic representation. The
sea creatures, ships, and sea waves. Nowadays, underwater
professional knowledge necessary for machine learning is
target recognition with high efficiency and high accuracy
has become one of the most important projects in the ocean also difficult to be acquired in the complicated ocean environ-
science field, and highly intelligent and automatic machine ment. Hence, early models for underwater target recognition
identification has been gradually used for underwater target hardly meet the high requirements of underwater tasks about
recognition (Neupane and Seok, 2020; Steiniger et al., high efficiency and high accuracy.
2022). However, many factors increase the difficulty of Over recent years, the outstanding performance of deep
underwater target recognition: (1) There are complex infor- learning in various fields, such as object tracking
mation components in acoustic signals from the sonar, such (Ciaparrone et al., 2020) and emotion recognition (Kim and
as sea waves, ocean biological acoustic information, and Serous, 2018), provides novel ideas for underwater target
acoustic signals from artificial exploration activity. (2) The recognition. Deep learning has a strong representation learn-
underwater target recognition is mainly based on the radi- ing ability without professional knowledge using much data
ated noise from targets’ power systems, whereas recogniz- (Kamal et al., 2013; Cao et al., 2016; Chen and Xu, 2017;
ing target radiated noise is a great challenge for humans and Zhu et al., 2017; Shen et al., 2018; Yang et al., 2018; Cao
machines. Therefore, many scholars try to solve these prob- et al., 2019; Tian et al., 2021). Cao et al. (2016) proposed a
lems in underwater target recognition. modified autoencoder to learn the key knowledge of under-
The statistical methods applied to the representation of water data for convolutional neural networks (CNN). Zhu
target radiated noise were raised in the 1990s (Xuegong, et al. (2017) introduced CNN for representation learning
2000; Ji and Liao, 2005; Yang et al., 2016; Lian et al., 2017). and applied SVM to recognize unmanned underwater detec-
For example, Yang et al. (2016) developed support vector tion equipment (UUDE). To improve the representational
machines (SVM) (Cao et al., 2013; Freund and Schapire, performance on convergence rate and anti-noise robustness,
the novel compressed competitive deep belief network was
developed by discretization and network reconstruction with
a)
Electronic mail: [email protected] few samples (Shen et al., 2018).
J. Acoust. Soc. Am. 152 (5), November 2022 0001-4966/2022/152(5)/2905/11/$30.00 C 2022 Acoustical Society of America
V 2905
https://doi.org/10.1121/10.0015138
Nevertheless, most deep learning methods employ contrastive-based self-supervised learning. Meanwhile, how
supervised learning, which relies on “clean” data with to mine more negative samples in the contrastive-based self-
labels. It is difficult to obtain many labeled “clean” acoustic supervised learning should be addressed.
signals in underwater target recognition. Instead, self- To address the problems of limited labeled underwater
supervised learning designs upstream auxiliary tasks, not acoustic samples and poor negative sample learning, this
requiring labels. Thus, auxiliary tasks focus on mining sam- paper proposes an underwater target recognition method
ples without labels according to their own supervisory infor- with self-supervised acoustic representation learning to
mation, and learn features valuable for downstream tasks obtain adversarial enhanced features under ASAE. To our
(Doersch, 2016; Gidaris et al., 2018; Lee et al., 2020). best knowledge, this is the first attempt to combine self-
Further, supervised learning assigns the same labels to aug- supervised acoustic representation learning with underwater
mented samples from the same source. Even so, if augmen- target recognition. The main contributions are as follows:
tation results in large differences in their distribution, (1) To take the advantages of FBank and GBank together,
forcing such invariance may drop model performance. Thus, this paper proposes SAE for the auxiliary task of feature
Lee et al. (2020) proposed a self-supervised label augmenta- reconstruction and obtains SAE Spec with high acoustic dis-
tion technique with aggregation and self-distillation in a crimination and anti-noise robustness. (2) To cover the lack
multi-task learning framework for alleviating few-shot and of high-level semantic information in SAE Spec by mining
imbalanced classification tasks. To learn acoustic features more negative samples, this paper introduces a strategy of
from unlabeled data, some general solutions have been put
FIG. 1. (Color online) The process of adopting Mel filters and gammatone filters to Spec as FBank and GBank. FBank and GBank are more informative in
the feature compared to MFCC and GFCC which perform DCT to lose some information.
cepstral coefficient (GFCC) without discrete cosine trans- gi ðtÞ ¼ tn1 exp ð2pbi tÞ cos ð2pfi þ ui ÞuðtÞ; 1 i N; (2)
form (DCT) like FBank. These filters produce the top-
performing and reliable features in underwater target where fi is the center frequency, bi is the time attenuation
recognition. FBank fits the acoustic discrimination, and
1. Mel filter
In this paper, due to the dependence of deep learning on
labeled data, SAE is proposed as a self-supervised acoustic
Mel filters select wave energy in different frequency representation learning method. It aims at implementing an
bands as target representations like the auditory system of animal-like acoustic auditory system by FBank and GBank.
the human ear with strong acoustic discrimination. Several It takes FBank as input and performs the auxiliary task of
filters Hm ðkÞ are set in the spectral range. Each filter is a tri- reconstructing FBank to GBank. Obviously, it does not need
angular filter with a center frequency f ðmÞ, labels as supervised information to learn acoustic features.
8 Therefore, a new feature with high performance named SAE
>
> 0; k < f ðm 1Þ Spec feature is created by combining the good acoustic dis-
>
>
>
> k f ðm 1Þ crimination of FBank and the strong anti-noise robustness of
>
>
< f ðmÞ f ðm 1Þ ; f ðm 1Þ k f ðmÞ GBank. Moreover, the self-supervised SAE solves the prob-
Hm ðkÞ ¼ (1) lems of many unlabeled and class-imbalanced datasets.
>
> f ðm þ 1Þ k
>
> ; f ðmÞ < k f ðm þ 1Þ SAE contains encoder hðÞ and decoder gðÞ illustrated
>
> f ðm þ 1Þ f ðmÞ
>
> in Fig. 2. To make SAE more anti-noise robust, the random
:
0; k > f ðm þ 1Þ: Gaussian noise embedded module (RGNE) is added to
FBank. It simulates the noise from a real underwater envi-
ronment to achieve better anti-noise robustness for the
2. Gammatone filter
extracted features. The input is encoded to the embedding
Gammatone filters (Johannesma, 1972) are applied to space by hðÞ. The decoder then reconstructs the encoder
the auditory system with frequency sensing to simulate the output as pseudo-GBank by upsampling. The pseudo-GBank
energy feedback in the human ear. Each gammatone filter is and true GBank are applied for loss of spatial similarity.
regarded as the multiplication of the gamma function and Therefore, SAE Spec in the embedding space is beneficial
the acoustic signal as follows: for recognition and anti-noise robustness. The labeled and
J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2907
https://doi.org/10.1121/10.0015138
clean dataset is not necessary for SAE, according to the aux- where y means the paired samples belong to positive or
iliary task of feature reconstruction. SAE is quite suitable negative. xi and xj represent one-dimensional vectors hi and
for the representation learning of downstream underwater hj of pseudo-GBank and true GBank.
target recognition.
As shown in Fig. 2, RGNE is used to generate the C. AEMU
Gaussian noise n as shown in Eq. (3). FBank and random
As traditional self-supervised learning usually assumes
noise n are superimposed to form a noisy FBank in the
that all pixels are independent, high-level semantic informa-
encoder hðÞ,
tion is ignored. However, high-level semantic information
1 2 2 plays an important role in classification. While the high-
nð xÞ ¼ pffiffiffiffiffiffi eðxlÞ =2r : (3) level semantic information is richer, the classification ability
r 2p
is better (Shelhamer et al., 2017). Therefore, in the underwa-
The encoder is similar to ResNet50 (He et al., 2016). ter target recognition field, the core of the better perfor-
Encoder hðÞ totally has three stages, and each contains the mance is to learn more high-level semantic information that
residual network of three convolutional layers. The principle represents different targets for easier classification. The
of the encoder is shown as contrastive-based self-supervised learning in our paper
needs positive samples and negative samples, and then mea-
zi ¼ hðxi Þ: (4) sures the distance between positive and negative samples.
FIG. 3. (Color online) The framework of the proposed ASAE. The strategy of adversarial enhancement is added with the loss of contrastive learning with
AEMU.
2908 J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al.
https://doi.org/10.1121/10.0015138
The negative sample learning has been performed in the downstream tasks in this paper. Since the proposed ASAE
mini-batch. The larger mini-batch means more negative has already achieved the important representation learning
samples, which produces better training in contrastive learn- in the underwater target recognition, this paper only accom-
ing. However, the size of the mini-batch is limited by the plishes downstream tasks with the shallow classification
memory size. Therefore, AEMU is applied to collect infor- model to verify our proposed SAE Spec and ASAE Spec.
mation in a different mini-batch. It plays an important role Also, two datasets are utilized to validate the performance
in adversarial enhancement. AEMU stores many negative of the obtained acoustic features. The comparisons between
samples in a dynamic queue, and the scale of the dynamic the existing good acoustic features, such as MFCC, GFCC
queue is controllable. In each iteration, the dynamic queue (Mao et al., 2015; Lian et al., 2017; Zhang et al., 2018),
is updated with a mini-batch by dequeue and enqueue opera- FBank, and GBank, are made. The details of the proposed
tions. In detail, the queue is updated like this: GBank in model are presented in the Methods section. In addition, we
each mini-batch is included at the end of the queue, while also perform the ablation experiment for the proposed idea
GBank at the front of the queue is taken out. This way in this paper.
ensures that the dynamic queue contains the positive sam-
A. Dataset
ples required at each iteration, while the remaining features
are treated as negative samples. Also, AEMU performs N This paper performs experiments on two datasets. The
paired samples adopted in modified InfoNCE, one is the cooperative-project dataset (CP Dataset) collected
FIG. 4. (Color online) Spec of FBank, GBank, and the original underwater acoustic signal with class E in the CP dataset. The vertical axis is the time, and
the horizontal axis represents the frequency. (a) FBank Spec, (b) GBank Spec, (c) Original underwater acoustic signal Spec.
J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2909
https://doi.org/10.1121/10.0015138
extra class named background noise with 12 records. In the fusion features of GFCC and modified empirical mode
summary, the ratio between the training set and testing set decomposition as the multi-dimensional fusion features
in all datasets is 7:3. Meanwhile, the train set is adopted to (MFF) (Wang et al., 2019), the fusion of MFCC and GFCC
train the ASAE until convergence. (MG), and the fusion feature of FBank and GBank (FG).
However, there are few studies to fuse features from Mel fil-
B. Evaluation Metrics ters and features from gammatone filters as new fusion fea-
tures. So, we intend to simply add them. In this paper, the
In our paper, Accuracy (Acc), classifier loss conver-
Acc of different features, including MFCC, GFCC, FBank,
gence rate (CLCR), and anti-noise robust (ANR) are utilized
GBank, SAE Spec, ASAE Spec, etc., with MLP and MLR is
as evaluation metrics for our proposed model.
shown in Table I.
As seen in Table I, ASAE Spec performs best on MLP and
1. Acc
MLR. Without AEMU, SAE Spec obtained from SAE drops by
Accuracy (Acc) is defined as at least 0.96% in accuracy. It is proved that AEMU is effective
for better performance with enriching high-level semantic infor-
1X N
mation. FBank offers high acoustic discrimination but low anti-
Accuracyðf 0 ; D TestÞ ¼ ðf 0 ðxi Þ ¼ labeli Þ; (9)
N i¼1 noise robustness, while GBank has strong anti-noise robustness
but low acoustic discrimination. In addition, MFCC and GFCC
where D Test is the testing dataset, f 0 ðÞ represents the clas- are shallow machine learning features with less acoustic infor-
Accuracyðf 0 ;D TestÞ Accuracyðf 0 ;DN TestÞ TABLE I. The Acc (%) of various acoustic features. The bold numbers
ANR ¼ : (11)
Accuracyðf 0 ;D TestÞ mean the best accuracy among various features in two datasets.
J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2911
https://doi.org/10.1121/10.0015138
TABLE IV. Acc (%) of mainstream acoustic features on CP dataset at dif- TABLE VI. Acc (%) of acoustic features in each class of ShipsEar dataset.
ferent SNRs. The bold numbers mean the best accuracy among various fea- The bold numbers mean the best accuracy among various features in each
tures in CP dataset with different SNRs. class of ShipsEar dataset.
10 dB 5 dB 0 dB 5 dB 10 dB 5 dB 0 dB 5 dB A B C D A B C D
MFCC 16.48 26.28 67.36 73.78 17.85 28.24 68.29 79.24 MFCC 83.17 73.67 89.18 84.76 79.52 75.70 82.31 82.38
GFCC 22.95 61.83 77.88 80.04 25.82 63.53 79.25 81.24 GFCC 82.98 73.42 88.96 84.58 78.94 79.46 84.02 79.30
MG 21.67 59.64 71.60 79.27 22.13 50.26 75.07 80.74 FBank 83.94 74.94 89.40 85.46 81.63 77.59 82.76 81.85
MM 23.62 61.28 78.91 80.20 26.81 65.09 79.40 82.30 GBank 83.85 74.81 89.10 85.37 80.77 78.48 82.41 81.99
MFF 31.10 69.79 81.86 82.96 30.74 69.49 81.92 83.77 SAE 85.48 78.86 90.37 86.08 82.55 80.11 85.93 82.87
FBank 19.48 29.85 69.32 77.88 21.14 29.72 70.11 82.25 Spec
GBank 24.65 65.15 80.91 82.12 27.75 67.79 81.26 83.34 ASAE 86.35 83.92 90.52 86.87 83.93 83.00 86.57 84.13
FG 29.25 69.23 81.21 82.85 29.27 68.99 81.90 83.66 Spec
SAE 44.50 72.31 82.22 83.53 45.56 75.64 83.25 84.17
Spec
ASAE 50.25 77.78 84.17 85.14 52.46 78.22 84.17 85.63
Spec under 15 times on two datasets, which are shown in Fig. 6.
USAE Spec means unrobust SAE Spec learned by SAE with-
TABLE V. Acc (%) of acoustic features in each class of CP dataset. The bold numbers mean the best accuracy among various features in each class of CP dataset.
MLP MLR
Feature
A B C D E A B C D E
MFCC 47.50 45.00 70.99 86.84 86.53 45.00 45.00 68.70 85.91 85.68
GFCC 42.50 35.00 67.94 85.32 84.23 37.50 25.00 67.18 85.37 84.48
FBank 47.50 45.00 72.52 87.71 87.34 47.50 45.00 70.23 86.69 86.73
GBank 45.00 45.00 71.76 88.24 85.25 47.50 40.00 69.47 86.67 86.37
SAE Spec 52.50 50.00 77.10 90.31 89.96 50.00 45.00 70.99 88.85 89.96
ASAE Spec 55.00 55.00 83.21 92.33 91.81 55.00 50.00 76.34 90.12 90.59
2912 J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al.
https://doi.org/10.1121/10.0015138
J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2913
https://doi.org/10.1121/10.0015138
Spec has better performance than the mainstream features in Ji, S., Liao, X., and Carin, L. (2005). “Adaptive multiaspect target classifi-
cation and detection with hidden Markov models,” IEEE Sens. J. 5,
underwater target recognition. It is also well illustrated that
1035–1042.
underwater target recognition with the adversarial enhanced Jin, G., Liu, F., Wu, H., and Song, Q. (2020). “Deep learning-based frame-
feature has the potential value to enrich high-level semantic work for expansion, recognition and classification of underwater acoustic
information. However, the lack of high-quality datasets and signal,” J. Exp. Theor. Artif. Intell. 32, 205–218.
Johannesma, P. (1972). “The pre-response stimulus ensemble of neurons in
weak generalization of representation learning remain the the cochlear nucleus,” in Symposium on Hearing Theory, June 22–23,
key challenges in underwater target recognition. These prob- 1972, Eindhoven, Holland, pp. 58–69.
lems also need to be explored in future research. Kamal, S., Mohammed, S. K., Pillai, P. R. S., and Supriya, M. H. (2013). “Deep
learning architectures for underwater target recognition,” in 12th Symposium
on Ocean Electronics, October 23–25, 2013, Kochi, India, pp. 48–54.
ACKNOWLEDGMENTS Kim, J. W., and Saurous, R. A. (2018). “Emotion recognition from human
speech using temporal information and deep learning,” in 19th Annual
This work was supported by the National Natural Conference of the International Speech Communication, September 2–6,
Science Foundation of China under Grant No. 41876110. 2018, Hyderabad, India, pp. 937–940.
This work was also supported by Fundamental Research Kong, L., d’Autume, C. d M., Ling, W., Yu, L., Dai, Z., and Yogatama, D.
(2019). “A mutual information maximization perspective of language rep-
Funds for the Central Universities in China under Grant resentation learning,” arXiv:08350.
No. 3072022JC0601. Lee, H., Hwang, S. J., and Shin, J. (2020). “Self-supervised label augmenta-
tion via input transformations,” in 37th International Conference on
Cao, X., Togneri, R., Zhang, X., and Yu, Y. (2019). “Convolutional neural Machine Learning [Virtual], pp. 5670–5680.
2914 J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al.
https://doi.org/10.1121/10.0015138
Song, H. O., Xiang, Y., Jegelka, S., and Savarese, S. (2016). “Deep metric Conference on Applied Sciences and Technology, Islamabad, Pakistan,
learning via lifted structured feature embedding,” in 29th IEEE pp. 522–527.
Conference on Computer Vision and Pattern Recognition, June 26–July 1, Yang, H., Shen, S., Yao, X., Sheng, M., and Wang, C. (2018). “Competitive
2016, Las Vegas, NV, pp. 4004–4012. deep-belief networks for underwater acoustic target recognition,” Sensors
Steiniger, Y., Kraus, D., and Meisen, T. (2022). “Survey on deep learning 18, 952.
based computer vision for sonar imagery,” Eng. Appl. Artif. Intell. 114, Yoshioka, T., Ragni, A., and Gales, M. J. F. (2014). “Investigation of unsu-
105157. pervised adaptation of DNN acoustic models with filter bank input,” in
Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., and Wei, 2014 IEEE International Conference on Acoustics, Speech, and Signal
Y. (2020). “Circle loss: A unified perspective of pair similarity opti- Processing, May 4–9, Florence, Italy, pp. 6344–6348.
mization,” in 2020 IEEE/CVF Conference on Computer Vision and Zhang, L., Wu, D., Han, X., and Zhu, Z. (2016). “Feature extraction of
Pattern Recognition [Virtual], pp. 6397–6406. underwater target signal using mel frequency cepstrum coefficients based
Tian, S., Chen, D., Wang, H., and Liu, J. (2021). “Deep convolution stack for on acoustic vector sensor,” J. Sens. 2016, 1.
waveform in underwater acoustic target recognition,” Sci. Rep. 11, 9614. Zhang, W., Wu, Y., Wang, D., Wang, Y., Wang, Y., and Zhang, L.
Wang, W., Li, S., Yang, J., Liu, Z., and Zhou, W. (2016). “Feature extrac- (2018). “Underwater target feature extraction and classification based on
tion of underwater target in auditory sensation area based on MFCC,” in gammatone filter and machine learning,” in 15th International
2016 IEEE/OES China Ocean Acoustics (COA), pp. 1–6. Conference on Wavelet Analysis and Pattern Recognition, July 15–18,
Wang, X., Liu, A., Zhang, Y., and Xue, F. (2019). “Underwater acoustic 2018, Chengdu, China, pp. 42–47.
target recognition: A combination of multi-dimensional fusion features Zhou, K., Wang, H., Zhao, W. X., Zhu, Y., Wang, S., Zhang, F., Wang, Z.,
and modified deep neural network,” Remote Sens. 11, 1888–1904. and Wen, J.-R. (2020). “S3-Rec: Self-supervised learning for sequential
Wang, W., Tang, Q., and Livescu, K. (2020a). “Unsupervised pre-training recommendation with mutual information maximization,” in Proceedings
of bidirectional speech encoders via masked reconstruction,” in IEEE of the 29th ACM International Conference on Information and
International Conference on Acoustics, Speech, and Signal Processing, Knowledge Management, CIKM 2020, October 19–23, 2020, Virtual,
J. Acoust. Soc. Am. 152 (5), November 2022 Wang et al. 2915