0% found this document useful (0 votes)

54 views103 pages

Synthesizing Underwater Sounds Using Generative Artificial Intelligence

Uploaded by

evuj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views103 pages

Synthesizing Underwater Sounds Using Generative Artificial Intelligence

Uploaded by

evuj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

Calhoun: The NPS Institutional Archive

DSpace Repository

NPS Scholarship Theses

2024-09

SYNTHESIZING UNDERWATER SOUNDS USING

GENERATIVE ARTIFICIAL INTELLIGENCE

Garip, Mustafa
Monterey, CA; Naval Postgraduate School

https://hdl.handle.net/10945/73316

�Copyright�is reserved by the copyright owner.

Downloaded from NPS Archive: Calhoun

NAVAL
POSTGRADUATE
SCHOOL
MONTEREY, CALIFORNIA

THESIS

SYNTHESIZING UNDERWATER
SOUNDS USING GENERATIVE AI

Mustafa Garip

September 2024

Thesis Advisor: Monique P. Fargues

Second Reader: Nicholas Durofchalk
Distribution Statement A. Approved for public release: Distribution is unlimited.
THIS PAGE INTENTIONALLY LEFT BLANK

_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Form Approved OMB
REPORT DOCUMENTATION PAGE No. 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing
instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of
information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions
for reducing this burden, to Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis
Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-
0188) Washington, DC 20503.
1. AGENCY USE ONLY 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
(Leave blank) September 2024 Master's thesis
4. TITLE AND SUBTITLE 5. FUNDING NUMBERS
SYNTHESIZING UNDERWATER SOUNDS USING GENERATIVE AI

6. AUTHOR(S) Mustafa Garip

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING

Naval Postgraduate School ORGANIZATION REPORT
Monterey, CA 93943-5000 NUMBER
9. SPONSORING / MONITORING AGENCY NAME(S) AND 10. SPONSORING /
ADDRESS(ES) MONITORING AGENCY
N/A REPORT NUMBER
11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the
official policy or position of the Department of Defense or the U.S. Government.
12a. DISTRIBUTION / AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE
Distribution Statement A. Approved for public A
release: Distribution is unlimited.
13. ABSTRACT (maximum 200 words)
Underwater acoustics remains a vast and underexplored field, while artificial intelligence (AI) has
recently emerged as a powerful tool across various research domains. This thesis investigates the generation
of high-quality synthetic underwater sounds to replicate three types of acoustic signals: boats, dolphins, and
whales. The short-time Fourier transform is used to preprocess the signals for inputs to a two-dimensional
generative diffusion model. Time domain signals are reconstructed using the Griffin-Lim algorithm and
assessed with a one-dimensional convolutional neural network classifier. Our findings demonstrate that a
dataset including up to 90% synthetic signals mixed with as little as 10% real signals does not compromise
the performance of the AI classifier. Results show synthetic underwater signals produced by the diffusion
model are highly realistic and can effectively replace or imitate real underwater signals.

14. SUBJECT TERMS 15. NUMBER OF

underwater acoustics, diffusion model, artificial intelligence, generation, synthesizing PAGES
101
16. PRICE CODE

17. SECURITY 18. SECURITY 19. SECURITY 20. LIMITATION OF

CLASSIFICATION OF CLASSIFICATION OF THIS CLASSIFICATION OF ABSTRACT
REPORT PAGE ABSTRACT
Unclassified Unclassified Unclassified UU
NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89)
Prescribed by ANSI Std. 239-18

i
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

ii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Distribution Statement A. Approved for public release: Distribution is unlimited.

SYNTHESIZING UNDERWATER SOUNDS USING GENERATIVE AI

Mustafa Garip
Lieutenant, Turkish Naval Forces
BSIE, Turkish Naval Academy, 2013

Submitted in partial fulfillment of the

requirements for the degree of

MASTER OF SCIENCE IN ENGINEERING ACOUSTICS

from the

NAVAL POSTGRADUATE SCHOOL

September 2024

Approved by: Monique P. Fargues

Advisor

Nicholas Durofchalk
Second Reader

Oleg A. Godin
Chair, Department of Engineering Acoustics Academic Committee

iii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

iv
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
ABSTRACT

Underwater acoustics remains a vast and underexplored field, while artificial

intelligence (AI) has recently emerged as a powerful tool across various research domains.
This thesis investigates the generation of high-quality synthetic underwater sounds to
replicate three types of acoustic signals: boats, dolphins, and whales. The short-time
Fourier transform is used to preprocess the signals for inputs to a two-dimensional
generative diffusion model. Time domain signals are reconstructed using the Griffin-Lim
algorithm and assessed with a one-dimensional convolutional neural network classifier.
Our findings demonstrate that a dataset including up to 90% synthetic signals mixed with
as little as 10% real signals does not compromise the performance of the AI classifier.
Results show synthetic underwater signals produced by the diffusion model are highly
realistic and can effectively replace or imitate real underwater signals.

v
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

vi
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
TABLE OF CONTENTS

I. INTRODUCTION................................................................................................. 1
A. MOTIVATION AND RESEARCH QUESTIONS ................................ 1
B. THESIS ORGANIZATION ..................................................................... 2

II. PREVIOUS WORK .............................................................................................. 5

A. GENERATIVE AI MODELS .................................................................. 5
1. Generative Adversarial Networks ............................................... 5
2. Variational Autoencoders ............................................................ 7
3. Other Networks and Proposed Models ....................................... 8
B. DIFFUSION MODELS ............................................................................ 8

III. UNDERWATER SOUNDS AND SIGNAL PRINCIPLES ............................. 13

A. BASICS OF UNDERWATER SOUNDS .............................................. 13
B. SPECTROGRAMS ................................................................................. 14
C. SIGNAL RECONSTRUCTION CONSTRAINTS .............................. 17
1. Constant Overlap-Add Compliance .......................................... 18
2. Size of the STFT Magnitude ...................................................... 18
D. GRIFFIN-LIM ALGORITHM.............................................................. 18
E. DATASET................................................................................................ 20

IV. EXPERIMENTAL DESIGN AND EVALUATION METHODOLOGY...... 23

A. DATA PREPROCESSING .................................................................... 23
1. Data Processing Method ............................................................. 24
2. Parameter Selection .................................................................... 24
3. Data Resizing and Scaling .......................................................... 26
B. DIFFUSION MODEL ............................................................................ 26
C. OUTPUT EVALUATION ...................................................................... 29
1. Evaluation Method ..................................................................... 29
2. Classification Approach ............................................................. 29
3. Performance Metrics .................................................................. 31

V. RESULTS AND DISCUSSIONS ....................................................................... 35

A. EVALUATION DATA AND OVERALL PERFORMANCE
METRICS ................................................................................................ 35
B. CLASS-BY-CLASS PERFORMANCE METRICS ............................ 38
1. Boat Signals ................................................................................. 38
vii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
2. Dolphin Signals ........................................................................... 39
3. Whale Signals .............................................................................. 40
C. DATA PROCESSING DURATIONS ................................................... 42

VI. CONCLUSIONS AND RECOMMENDATIONS ............................................ 43

A. CONCLUSIONS ..................................................................................... 43
B. FUTURE WORK .................................................................................... 43

APPENDIX A. MATLAB CODES................................................................................ 45

A. DATA DISCOVERY TEMPLATE ....................................................... 45
B. DOLPHIN DATA REDUCTION .......................................................... 46
C. PHASE I:DATA PREPROCESSING ................................................... 46
D. FUNCTION: EXTRACTSTFTMAG.................................................... 48
E. FUNCTION: DOUBLE2TIFF ............................................................... 48
F. PHASE II:DIFFUSION MODEL.......................................................... 49
G. FUNCTION: CREATEDIFFUSIONMODELNETWORK ................ 55
H. FUNCTION: CENTERSIGNAL ........................................................... 58
I. PHASE III: OUTPUT EVALUATION ................................................ 58

APPENDIX B. SUPPLEMENTARY CONFUSION MATRICES............................. 63

APPENDIX C. ADDITIONAL PLOTS AND TABLES OF PERFORMANCE

METRICS ............................................................................................................ 65

LIST OF REFERENCES ............................................................................................... 77

INITIAL DISTRIBUTION LIST .................................................................................. 83

viii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
LIST OF FIGURES

Figure 1. General structure of diffusion models. Adapted from [18]......................... 9

Figure 2. Wenz Curves for underwater ambient sounds. Source: [30]. ................... 14

Figure 3. Representation of STFT. Source: [31]. ..................................................... 16

Figure 4. Left: The sound signal of a dolphin. Right: Spectrogram of the sound
of a dolphin. .............................................................................................. 16

Figure 5. A boat signal and its recovered version using GLA. ................................ 20

Figure 6. Scheme of the system................................................................................ 23

Figure 7. Diffusion network properties. ................................................................... 27

Figure 8. Diffusion network diagram and a sample of block diagrams. .................. 27

Figure 9. Data structure of the evaluation phase. ..................................................... 36

Figure 10. Data usage in classifier. ............................................................................ 37

Figure 11. Overall mean performance metrics obtained after 80 classifier runs. ...... 38

Figure 12. Mean performance metrics for boat signals after 5 runs of each
combination............................................................................................... 39

Figure 13. Mean performance metrics for dolphin signals after 5 runs of each
combination............................................................................................... 40

Figure 14. Mean performance metrics for whale signals after 5 runs of each
combination............................................................................................... 41

Figure 15. Confusion matrix obtained after 5th run of classifier with 100% real
signal combination. ................................................................................... 63

Figure 16. Confusion matrices obtained after 5th runs of classifier with 100%
synthetic signal combinations. .................................................................. 63

Figure 17. Overall accuracy after 80 runs of the classifier. ....................................... 65

Figure 18. Overall F1 score after 80 runs of the classifier. ........................................ 67

Figure 19. Overall precision after 80 runs of the classifier. ....................................... 69

Figure 20. Overall recall after 80 runs of the classifier. ............................................. 71

ix
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 21. Detailed performance metrics for boat signals. ........................................ 73

Figure 22. Detailed performance metrics for dolphin signals. ................................... 74

Figure 23. Detailed performance metrics for whale signals....................................... 75

x
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
LIST OF TABLES

Table 1. Number of signals per source class in dataset. ......................................... 21

Table 2. Column sizes of STFT outputs for specific window length and
overlap percentage combinations. ............................................................. 25

Table 3. Training parameters of the diffusion model. Source: [48]. ....................... 28

Table 4. Training parameters of the 1D CNN Classifier. ....................................... 31

Table 5. Data processing times. .............................................................................. 42

Table 6. Accuracy values after 80 runs of the classifier. ........................................ 66

Table 7. F1 score values after 80 runs of the classifier. .......................................... 68

Table 8. Precision values after 80 runs of the classifier.......................................... 70

Table 9. Recall values after 80 runs of the classifier. ............................................. 72

xi
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

xii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
LIST OF ACRONYMS AND ABBREVIATIONS

1D one-dimensional
2D two-dimensional
AI artificial intelligence
a.k.a. also known as
CNN convolutional neural network
COLA constant-overlap add
DDPM denoising diffusion probabilistic model
DFT discrete Fourier transformation
FF feedforward neural network
FT Fourier transform
GAN generative adversarial network
GLA Griffin-Lim algorithm
GPU graphics processing unit
ISTFT inverse short-time Fourier transform
ML machine learning
ms millisecond
nfft number of discrete Fourier transform points
NM nautical miles
RGB red-green-blue
SDE stochastic differential equation
SGM score-based generative models
STFT short-time Fourier transform
RNN recurrent neural network
VAE variational autoencoder

xiii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

xiv
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
ACKNOWLEDGMENTS

No matter how strong a nation’s armed forces are, no matter how glorious
the victories they achieve, if that nation does not have an army of science,
the victories on the battlefields will come to an end.

—M.K. Atatürk

First, my deepest gratitude goes to my advisor, Professor Monique P. Fargues, for

her unwavering support, insightful guidance, and continuing patience throughout my thesis
journey. I extend my appreciation to Dr. Nick Durofchalk, my second reader, for his
constructive criticism, which has helped to refine this thesis.

My heartfelt thanks to all distinguished faculty members whose courses I had the
privilege to attend during my stay at NPS. I learned a lot from each of them, and their
knowledge and collective wisdom have contributed to the successful completion of this
research.

Additionally, I want to offer my thanks to the Turkish Naval Forces for granting
me this opportunity to pursue my master’s degree from NPS.

Last, but most importantly, I am profoundly grateful to my wife, Gökçe, for her
endless love, support, patience, and encouragement. Almost 6000 NM away from our home
country, she provided me with a calm and peaceful haven in Monterey. The presence of
her and our Californian twin daughters, Meltem and Kumsal, gave me the strength to
accomplish my research. Also, to my parents: thank you for your valuable support here and
from afar.

xv
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

xvi
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
I. INTRODUCTION

The exploration of underwater sounds plays a pivotal role in understanding marine

ecosystems, communication among aquatic species, and monitoring the impacts of human
activities on the oceans. Additionally, underwater is one of the crucial domains of warfare
for militaries. However, the collection of accurate and identified underwater acoustic data
is fraught with challenges such as high costs, logistical complexities, and potential
environmental disturbances. As a response to these challenges, recent advancements in
generative artificial intelligence (AI) have opened new frontiers in generative modeling,
where machines are no longer mere tools of computation but architects of creation.

Generative algorithms have introduced innovative approaches in various domains

such as speech recognition, text generation, and image synthesis. Among these methods,
diffusion models have emerged recently as a powerful technique, iteratively refining
random noise into coherent outputs through a stochastic process. This research investigates
application of diffusion models to generate underwater acoustic signals, which is a difficult
task considering the acoustic environment is shaped by the dense, complex nature of
aquatic sound propagation. By leveraging generative AI, this study explores the synthesis
of underwater sounds.

A. MOTIVATION AND RESEARCH QUESTIONS

The application of machine learning (ML) techniques in underwater acoustics has

enabled advanced signal processing, classification, and pattern recognition, enhancing our
ability to analyze complex underwater soundscapes. ML algorithms can automatically
identify and distinguish between different sources of acoustic signals, such as marine
mammal vocalizations, ship noise, or natural ambient sounds. This synergy has not only
accelerated the pace of data analysis in underwater acoustics but has also facilitated the
development of innovative approaches for monitoring marine ecosystems, studying
cetacean behavior, improving the effectiveness of underwater communication systems, and
producing robust detection algorithms for unmanned underwater vehicles, etc. The

1
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
integration of ML and underwater acoustics continues to open new avenues for research
and exploration in marine sciences and technology.

Besides its contribution to underwater acoustic research, ML techniques have a

major drawback since they are data hungry. As the high demand for sound data
accompanies the cost of data collection and identification, using AI to produce artificial
underwater sounds can be a valuable alternative. A generative model enables the
production of real-like sounds as an alternative to large scale data collection efforts
required to meet the big data demands of AI training algorithms in a cost-effective fashion.
Moreover, future researchers can explore the military applications of using synthetic
underwater sounds as deception or jamming tools for disrupting the battlespace.

There are various generative algorithms such as generative adversarial networks

(GANs), variational autoencoders (VAEs), and transformer networks. Diffusion models
are one of the recently introduced generative models designed to create synthetic data.
However, these methods including diffusion models were initially designed for image
processing. This thesis will consider the following research questions:

1. How can we generate artificial underwater signals using AI algorithms to

replicate real-world sounds?

2. Can artificial underwater data be used in training of AI classifiers without

impacting performances?

3. What is the maximum effective proportion of artificial data feed for AI

training process?

B. THESIS ORGANIZATION

There are six chapters in this thesis. In Chapter I, the motivations behind the study
and the research questions to be addressed are proposed. A brief discussion of previous
work about generative models is presented in Chapter II. General information about the
dataset used in the study, underwater sounds and signal processing schemes considered in
this work are provided in Chapter III. The experimental design and evaluation schemes
considered in the study are introduced in Chapter IV. Results and discussion of the findings

2
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
are presented in Chapter V. Conclusion and recommendations for future work discussed in
Chapter VI. MATLAB codes used in this thesis can be found in Appendix A and additional
figures and tables are included in Appendix B.

In this chapter, we introduced our motivation behind the study and the research
questions to be addressed in the study. Previous work will be discussed in the next chapter.

3
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

4
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
II. PREVIOUS WORK

In recent years, advancements in AI have revolutionized various industries, with

generative AI models leading the charge. These models, capable of creating realistic and
high-quality outputs, have found applications in fields ranging from art to science. Among
these, diffusion models have risen as a powerful technique for producing data that closely
resembles real-world distributions. In this chapter, we discuss generative models and the
application of a two-dimensional (2D) diffusion algorithm to generate one-dimensional
(1D) signals.

A. GENERATIVE AI MODELS

AI and deep learning have transformed data generation, introducing sophisticated

techniques that exceed traditional methods in complexity and capability. Prominent
generative AI structures are generative adversarial networks and variational autoencoders,
while other techniques are recurrent neural networks, feedforward network, and
transformer network [1]. In addition, diffusion models are a member of generative AI
structures and will be discussed separately. All these schemes have been used to produce
realistic and high-quality data samples. In the field of acoustics, these AI-driven
innovations have extensive applications, ranging from music and speech synthesis to the
creation of environmental sounds, representing a significant advancement in audio
technology.

1. Generative Adversarial Networks

The first generative adversarial network (GAN) scheme was introduced by

Goodfellow et.al. in 2014 [2] and has since become a cornerstone of generative modeling
in artificial intelligence. Briefly, GANs are composed of two neural networks, a generator
and a discriminator, which are trained concurrently through opposing processes. The
generator produces synthetic data samples, while the discriminator evaluates the
authenticity of created synthetic ones against real data. This competitive and dynamic
model drives both networks to improve iteratively, resulting in the generation of highly
realistic data. GANs have been widely applied in various domains, including image
5
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
synthesis, data augmentation, and unsupervised learning, demonstrating their versatility
and efficacy in generating complex, high-dimensional data distributions.

After their first introduction, various GAN architectures have been proposed on
several research areas during the last decade [3]. In acoustics, Liu et.al. introduced GANs
with spectrograms in [4]. The GAN in [4] uses spectrograms of .wav files as input and
generates new (fake) spectrograms as output. The generated spectrograms are evaluated by
an AlexNET-based classifier. This GAN does not use audio signals directly. Another
example of GANs for sound generation is in [5] which is one of the first applications of
GANs to unsupervised audio generation. In [5], two GAN architectures are used; one GAN
uses time domain information while the other uses frequency domain spectrograms. Both
implementations are compared by using inception score, nearest neighbor, and real listener
methods as detailed in [5]. Results show that both can be used for sound generation;
however, the qualitative scores of produced audio signals are lower than those obtained
with real audio signals. Another frequency-based GAN model is the instantaneous
frequency GAN discussed in [6]. In this work, the input audio data was pre-processed using
the Short Time Fourier Transform (STFT), unwrapping the phase, and scaling
(normalizing) the magnitude. Then, the resulting processed data is used for training. The
generated samples are then evaluated using five different metrics: human evaluation,
statistical analysis, inception score, quality of pitch, and Fréchet inception distance. Results
show that the GANs can generate audio which has high pitch quality and inception score
while scoring worse in other metrics.

Despite their successes, the main challenge with GANs is training instability, which
remains a dynamic field of research with continuous efforts to enhance robustness and
scalability. This instability problem is mostly the result of non-overlapping between the
distributions of input and output data [7]. Other prominent challenges are mode collapse,
computational load (especially for high-resolution data), and limited applicability to
sequential data according to [8].

6
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
2. Variational Autoencoders

Variational autoencoders (VAEs) are a class of generative models that combine the
principles of deep learning and probabilistic graphical models to generate new data that are
statistically similar to those present in a given dataset. Introduced by Kingma and Welling
in 2013 [9], VAEs include an encoder that transforms input data into a latent space, and a
decoder that reconstructs the data from this latent space. The key innovation in VAEs is
the use of variational inference to estimate the posterior distribution of the latent variables,
facilitating efficient training through backpropagation. This approach allows VAEs to learn
a continuous and smooth latent space, enabling them to generate coherent and diverse
samples.

Under the sound generation topic, [10] proposes a type of VAE model for fast and
high-quality audio synthesis. This model uses a two-stage training phase such that the first
part is representation learning with a regular VAE and the latter is an adversarial fine-
tuning VAE. Another model in [11] consists of a modified VAE. In this model, a variational
recurrent autoencoder is supported by history (already generated outputs). Then, the
generated sounds are compared with outputs of a classical VAE model and evaluated by
human assessments, statistical analysis, and mutual information methods. To be able to
generate audio using raw data, [12] uses an enhanced VAE model, called the VQ-VAE
model. This model in [12] compresses audio into a discrete space at first, then utilizes a
loss function crafted to preserve as much musical information (i.e., coherence, musicality,
diversity, and novelty) as possible, even at higher levels of compression.

In addition to sound generation, VAEs have been widely applied in various

domains, such as image synthesis, anomaly detection, and data compression. These areas
clearly demonstrate their versatility and efficacy in handling complex and high-
dimensional data. However, VAEs generally produce blurrier images or low-quality
synthesis output compared to other generative models like GANs. As a result, improving
the sharpness and enhancing the quality of VAE-generated outputs remains an active area
of research.

7
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
3. Other Networks and Proposed Models

Apart from main generative models GANs and VAEs, other neural networks can
be used in data generation. Some of the most commonly used ones are recurrent neural
networks (RNNs), feedforward neural networks (FFs), and transformer networks [1]. All
utilize neural network architectures to model complex data relationships, enabling them to
perform tasks such as classification, generation, and prediction across various domains.
Additionally, they can be used as a part of complex constructions together with other
models including GANs and VAEs.

FFs are the simplest form of neural networks, with information moving from inputs
to outputs in one direction only, passing through any number of hidden layers. Each layer
is completely connected to the next, making them suitable for tasks with straightforward
input-output relationships, such as image classification, regression, and pattern
recognition. In [13], FFs are used as a part of GAN structure.

RNNs are designed to handle sequential data by keeping a hidden state that retains
information from the prior time steps, allowing information to persist across sequences. As
RNNs mainly deal with sequential data, they are mostly suitable for tasks such as time
series prediction, language modeling, and speech recognition. However, in [14], a RNN is
implemented within a GAN structure and [11] has an example of hybrid RNN-VAE
combination.

Transformer networks utilize self-attention mechanisms to simultaneously model

relationship between all elements in a sequence, capturing long-range dependencies
without the limitations of sequential processing [15]. They can be applied to natural
language processing problems such as translation by machine and text summarization.
Additionally, there are researchers such as in [16] that propose using a transformer network
as generator of a GAN structure.

B. DIFFUSION MODELS

Diffusion models, introduced in 2015 [17], are a class of generative models that
leverage the principles of iterative refinement to generate data by progressively denoising
a sample initialized with random noise. These models are based on two network-based
8
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
processes. The reverse process, wherein data points undergo a series of transformations
governed by stochastic differential equations, effectively simulating a reverse diffusion
from noise to data. The forward process is a training phase which involves learning the
reverse diffusion process, allowing the model to produce high-quality samples that mimic
the original data distribution after the training is completed. An intuition of diffusion
models can be seen in Figure 1.

Forward diffusion process

(data-to-noise)

Reverse diffusion process

(noise-to-data)

Figure 1. General structure of diffusion models. Adapted from [18].

Diffusion models have shown impressive effectiveness in generating complex,

high-dimensional data, such as images, due to their ability to capture intricate structures
and dependencies. Their iterative nature also allows for fine-grained control over the
generation process, making them a powerful tool in tasks requiring high precision and
quality. Recent advancements in diffusion models have further improved their scalability
and efficiency, solidifying their role as a significant approach within the broader landscape
of generative modeling. After the proposal of the initial diffusion model, nowadays there
are three forefront versions of diffusion models according to [7]: denoising diffusion
probabilistic models (DDPMs), score-based generative models (SGM), and stochastic
differential equations (Score-SDEs).

9
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
DDPM, proposed in 2020 by [19], uses a Markov chain in both forward and reverse
diffusion processes. Additionally, these Markov chains trained with variational inference,
enable the model to effectively capture complex data distributions and generate samples
that represent the original data. The study also shows that when forward diffusion involves
small increments of standard Gaussian noise, the transitions in the reverse process can also
be set to conditional Gaussian distributions, enabling a straightforward neural network
parameterization of the DDPM.

SGMs and Score-SDEs utilize the gradient of the data distribution (score function)
to lead the generation process, while DDPMs focus on learning the denoising steps through
variational inference. SGM, first proposed in [20], involves perturbing data with
progressively amplifying Gaussian noise and concurrently estimating the score functions
for all noisy data distributions through the training of a deep neural network model, which
is conditioned on various noise levels. Score-SDE in [21] extends the concept of SGMs by
incorporating stochastic differential equations to model the data generation process
continuously over time.

The theoretical foundations of diffusion algorithms, rooted in stochastic processes

and variational inference, provide a robust framework for understanding and implementing
these models. Diffusion models have applications in a wide range of domains such as image
generation [19], [20], [21], video generation [22], [23], [24], music generation [25], text-
based generations [26], [27], [28], and molecular modellings [29].

In conclusion, diffusion algorithms represent a powerful and versatile class of

generative models that have significantly advanced the field of AI and ML. By leveraging
iterative denoising processes, these models can effectively generate high-quality data
samples that closely match the original data distribution. As research in this area continues
to evolve, diffusion algorithms are poised to play a critical role in a wide range of
applications. Their capacity to capture intricate data structures and dependencies
underscores their importance in advancing the capabilities of generative modeling. A
comprehensive review of diffusion models can be found in [7].

10
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
In this chapter, we briefly introduced various generative AI models and reviewed
diffusion models and their development using previous works in literature. In the next
chapter we will summarize common features of underwater sounds, discuss spectrogram
and a signal recovery algorithm, and present the dataset selected for our study.

11
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

12
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
III. UNDERWATER SOUNDS AND SIGNAL PRINCIPLES

In this chapter we briefly summarize general characteristics of underwater sounds,

present the basic concepts behind the spectrogram, the Griffin-Lim algorithm (GLA) and
its use in signal reconstruction. We also describe the dataset used in the thesis work.

A. BASICS OF UNDERWATER SOUNDS

Underwater sounds can be caused by various sources such as marine species,

human activities, atmospheric events, or earth movements. In general, each of these sound
sources has different signal characteristics including signal shape, duration (short or long),
and frequency distribution (narrowband or broadband). The main common characteristic
to all is the variety of sound frequencies present within, ranging from infrasonic to
ultrasonic. A generalization of the frequencies associated with different underwater sources
is shown by the Wenz curves presented in Figure 2.

Underwater sound propagation is a complex phenomenon with significant

implications in various scientific disciplines, including oceanography, acoustics, physics
and marine biology. Its complexity mainly comes from the underwater medium and the
environment itself, especially in the ocean. The sound speed in water is influenced by
several key factors such as temperature, ambient pressure, and salinity, which all can vary
by time and place due to the inhomogeneity of waters. Additionally, the ocean floor
topography and water column stratification play a significant role in sound transmission.
Moreover, the source frequency is another crucial factor due to varying attenuation of
different frequencies in the ocean. When all these factors are evaluated with the
background noise, the effective transmission range can be found. As a result, underwater
acoustic signals present a greater challenge to researchers rather than the more familiar
acoustic signals, such as those found in human speech acoustics.

13
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 2. Wenz Curves for underwater ambient sounds. Source: [30].

B. SPECTROGRAMS

The spectrogram is a well-known and widely used technique in signal analysis

which provides a visual representation of the frequency content variations as a function of
time. Spectrograms are used in diverse fields such as acoustics, music, and signal analysis
and are computed using the short-time Fourier transform (STFT) process.

The fundamental idea behind the Fourier transform (FT) process is to convert the
input signal from the time domain to the frequency domain. In digital signal processing, a
discrete-time version of the FT is used since continuous signals are converted into discrete-
time sequences by sampling. Generating spectrograms is accomplished by first splitting a
14
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
signal into short segments of equal length—a.k.a. windows— where successive windows
may partially overlap between segments. Next, the discrete Fourier transform (DFT) is
applied to each window. This process is called the discrete-time short-time Fourier
transform (referred to as the STFT from here) and allows for the analysis of changes in
signal frequency contents. The general mathematical representation of the STFT output for
an input signal is given by:

, (3.1)

where represents the time frame index, i.e., the column number of the STFT output
matrix, is the number of samples between successive windows, and is the window
function. The rows size of the STFT matrix corresponds to the number of frequency bins
or components of the signal’s spectrum at a particular time frame. The number of columns,
, and the number of rows, , of a one-sided STFT output matrix is given by:

, (3.2)

where is the number of discrete Fourier transformation (DFT) points. Note the rounded
down value of the parameter k is used when it is not an integer value. A graphical depiction
of the STFT is shown in Figure 3.

After computing the STFT output of the signal, the remaining steps for the
spectrogram are to compute the magnitude of the STFT and plot the result. Usually,
spectrograms are plotted using a dB scale and a colormap option, where the dB scale is
obtained by using . At that point, spectrogram values are displayed

by using plotting tools and colormaps to visualize the sound signal, as shown in Figure 4,
where the left plot represents a dolphin sound sampled at 200 KHz, and the right plot
represent the resulting spectrogram.

15
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 3. Representation of STFT. Source: [31].

Figure 4. Left: The sound signal of a dolphin. Right: Spectrogram of the

sound of a dolphin.

16
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Several parameters must be selected when generating spectrograms; window size,
window type, window overlap amount, and FT length. The choice of these parameters will
affect the resolution and accuracy of the resulting spectrogram. For example, a larger
window size provides for improved frequency resolution but lower time resolution and vice
versa. Overlapping windows can help in reducing spectral leakage and improving the
smoothness of the spectrogram while increasing the computational load. The window type
affects the trade-off between the time and frequency resolution, with different windows
emphasizing either sharper temporal features or more precise frequency components.

In addition to signal analysis purposes, spectrograms are useful for ML and AI

applications. By transforming time-domain signals into their frequency-domain
representations, spectrograms reveal patterns that are often more discernible and relevant
for these tasks. In speech recognition, for instance, spectrograms highlight phonetic
features [32] that can be utilized by neural networks [33] to accurately transcribe spoken
words. Also, in music analysis, spectrograms can capture harmonic and rhythmic
structures, aiding in genre classification, melody extraction, and even automatic music
composition [34], [35]. More recently, advanced ML techniques such as convolutional
neural networks (CNNs) have leveraged the 2D structure of spectrograms, treating them as
images to extract specific features of interest [4], [36], [37], [38].

C. SIGNAL RECONSTRUCTION CONSTRAINTS

Achieving a good reconstruction when computing the STFT of an input signal and
subsequently inverting it using Inverse short-time Fourier transform (ISTFT) is inherently
challenging. The accuracy of this reconstruction is contingent upon fulfilling specific
conditions: window constant-overlap add (COLA) compliance and the number of the time
frames used in the STFT.

Employing what is referred to as COLA compliant windows (introduced below),

assuming no modifications have been made to the STFT of the signal, is crucial for the
integrity of the reconstruction process. Such compliance ensures that the STFT and ISTFT
operations adhere to a consistent overlap-add framework, which is vital for preserving the
temporal and spectral attributes of the original signal.

17
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
1. Constant Overlap-Add Compliance

The window used in the STFT process is called COLA compliant when the
following mathematical constraint is satisfied.

, (3.3)

where is the window function selected, is the number of samples between

successive windows, and is a constant. Selecting the parameter equal to 0 leads to the
basic overlap-add constraint expression, while leads to the weighted overlap-add
constraint.

COLA compliance can be evaluated according to the value of the constant . If

then it is referred to as strong compliance, while represents weak compliance.
Strong COLA compliance ensures that the resulting reconstructed signal is not distorted or
attenuated, when segments of the windowed signal are recombined during the ISTFT
phase, thereby preserving the integrity of the original signal as it also stated in [39].

2. Size of the STFT Magnitude

The STFT size is another factor to consider during the reconstruction process. To
ensure that the length of the signal reconstructed from the ISTFT is the same as the original
input signal length, the total number of time frames used in the STFT (in other words, the
column number, k as defined in Equation 3.2, of the STFT magnitude) should be an integer.
This condition guarantees that the specific combination of the signal segmentation and
overlap during the STFT and ISTFT processes are appropriately aligned, facilitating the
preservation of the signal’s structural integrity.

D. GRIFFIN-LIM ALGORITHM

The Griffin-Lim algorithm (GLA) is one of the classic methods used for signal
reconstruction applications. The algorithm was developed by Griffin and Lim [40] and
aims to estimate a time-domain signal from its STFT magnitude only in some iterative
fashion. The challenge in this process lies in recovering the signal without access to the

18
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
phase information, which is typically not provided or lost. The GLA is an iterative process
designed to take advantage of the common information present in neighboring time
windows used in the STFT. This partially common information is used by the GLA to
estimate the missing phase information. The iterative process is designed to produce a
stable signal reconstruction in which the STFT magnitude is equal to the initial STFT
magnitude and the STFT phase has converged to some stable behavior.

The first step of the GLA initializes the STFT magnitude estimate with a random
or zero phase. A cost function is defined as the difference between the reconstructed and
initial STFT magnitudes. Then, the iterative process begins. During the iterative process,
the GLA finds the ISTFT of the estimate to obtain the time-domain signal estimate and
computes the difference between the reconstructed and original STFT magnitudes. The
iterative process continues until it produces a stable signal reconstruction; one in which the
STFT magnitude is equal to the initial STFT magnitude and the STFT phase has converged
to a stable behavior. Even though the iterative process may converge to a local minimum
and is computationally intensive due to multiple STFTs/ISTFTs operations, the GLA
remains widely used due to its simplicity and effectiveness in various practical
applications.

An example for a GLA application can be seen in Figure 5. In this illustration, a

sample of boat signal from the dataset is used (More information about the dataset will be
given in Chapter III Section E). The STFT is computed using a rectangular window with
length 40 samples, 90% overlap, nfft equal to 80, and 200 KHz sampling frequency. The
GLA reconstruction was obtained using a one-sided spectrogram and zero initial phase,
using the built-in GLA MATLAB function.

19
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 5. A boat signal and its recovered version using GLA.

Several improvements to the original GLA approach have been proposed over the
years. For example, in [41], an optimization approach is proposed for the phase recovery
problem with a faster solution and lower-cost computation rather than original method. In
[42], another fast phase estimation algorithm is presented which uses the sparseness of the
input signal with lower computational load than that present in the original GLA. However,
the proposed version in [41] is not guaranteed theoretically to converge every time.
Additionally, we observed in our study that both methods in [41] and [42] resulted in higher
error levels than those produced using the original algorithm. Therefore, the original
algorithm was used in this work.

E. DATASET

The datasets used in this thesis are recordings of underwater event detections
collected in the Southern California Bight (Site-E, located at 32º 50.5’ N-119º 10.2’ W)
and published by Frasier in 2021 [43]. Hydrophones [44] were deployed at a depth of 1300
meters for 122 days in 2018 and 2019 [43]. The original dataset includes various
echolocation clicks and broadband anthropogenic events; however, only three of the source

20
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
classes -boats, Risso’s dolphins (grampus griseus) and sperm whales (physeter
macrorynchus), are selected for this thesis work. The selected dataset has already been
labeled using an unsupervised classification algorithm and expert-reviewed manual
labelling workflow as it stated in [45].

The signals consist of impulsive events with received levels greater than 120 dB re
1 µPa and sampled at 200 kHz. All signals have 200 time-samples (1 ms duration) and are
centered so that the maximum signal energy is located at the 100th sample. The original
dataset is organized with signal waveforms and manual labels in different disc folders. The
selected classes were extracted using the detection time and label of each event. There were
more than 710,000 dolphin samples in the original dataset. However, such a high number
of samples in one class can cause significant bias in ML and AI algorithms training.
Therefore, a random selection was applied to the dolphin samples to bring the size of this
class more aligned with other class sizes. After this extraction process, the final distribution
of the samples available in each class is shown in Table 1.

Table 1. Number of signals per source class in dataset.

Source
Boat Dolphin Whale Total
Class

Number of
43,431 42,532 15,059 101,022
Signals

In this chapter, we reviewed general specifications of underwater sounds, the

spectrograms, signal reconstruction constraints, and the GLA. We also introduced the
dataset used in the thesis work. In the next chapter we will present the workflow followed
in this thesis study which includes data preprocessing, diffusion model application, and
output evaluation.

21
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

22
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
IV. EXPERIMENTAL DESIGN AND EVALUATION
METHODOLOGY

In this chapter we introduce the three phases followed in our study. The first phase
focuses on data preprocessing, with the goal of organizing raw data to fit required format
constraints for the diffusion model type selected for our study. The second phase focuses
on the diffusion model training stage, as the diffusion model is used to generate new
underwater signals. Finally, the last phase focuses on evaluating the quality of the
generated underwater signals. The overall flow of the study is presented in Figure 6.

Figure 6. Scheme of the system.

Simulations were implemented using MATLAB (ver. R2024b) with custom and
built-in functions. We used a Dell Precision 5820 computer with a NVIDIA T1000 GPU.
MATLAB code is provided in Appendix A.

A. DATA PREPROCESSING

Diffusion models were initially designed for image applications, i.e., designed for
2D input data and generating 2D output data. 2D data typically refers to data organized in
a two-dimensional structure, often represented as a matrix or a table. Diffusion models
have also shown impressive performances in text, audio and video applications. The
underwater sounds considered in this study are one-dimensional (1D) while the diffusion
23
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
model utilized for the study was designed for 2D data, thus requiring preprocessing to
match the expected format for the diffusion model.

1. Data Processing Method

First, the STFT is applied to each underwater signal sample to convert 1D signals
into the 2D data format expected by the diffusion model used in our study. Then, resulting
STFT magnitudes are used to train the diffusion model. The diffusion model generates new
STFT magnitudes. After training, the GLA, previously described in Chapter III, Section D
is applied to recover 1D time-domain synthetic underwater signals from generated STFT
magnitudes.

2. Parameter Selection

The STFT and GLA algorithms rely on several user-specific parameters which need
to be selected judiciously for the scheme to result in good signal recovery. At this point,
the signal reconstruction constraints discussed previously in Chapter III, Section C need to
be considered while selecting the following input parameters to generate the 2D STFT
magnitudes of 200 sample-long signals in the dataset: spectral window (type and length),
overlap length (number of overlapped samples between successive windows), and the
number of DFT samples (nfft).

a. Window type and length

We chose a rectangular window as the baseline window type used in the STFT
generation step and investigated the impact other parameters, i.e., window length, overlap
length, and nfft, have on recovered 1D signals using the GLA. First, we considered window
lengths equal to 5, 10, 25, 40, and 50 samples (given each signal was 200 samples) and
various overlapping amounts. Then, for each window length and overlap percentage values
considered, we investigated whether these combinations led to COLA compliant windows
and satisfied other prefect reconstruction constraints discussed in Chapter III, Section C.
The combinations of window lengths and overlap percentages that meet the reconstruction
constraints for the range of window length and overlap amounts considered can be seen in
Table 2.

24
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 2. Column sizes of STFT outputs for specific window length and
overlap percentage combinations.

Window Length (samples)

5 10 25 40 50
20 - - - - -
40 - - - - -
50 - 39 - 9 7
60 - - - - -
70 - - - - -
Overlap Percentage (%)

72 - - - - -
75 - - - 17 -
80 196 96 36 21 16
87.5 - - - 33 -
88 - - - - -
90 - 191 - 41 31
94 - - - - -
95 - - - 81 -
96 - - 176 - 76
97.5 - - - 161 -
98 - - - - 151
Note: Filled-in cells show the number of time windows (number of columns) present in the STFT
output for combinations satisfying perfect reconstruction constraints presented in Chapter III,
Section C. Empty cells show combinations without reconstruction conditions satisfied.

b. Fourier transform length

The second parameter required for the STFT computation step is the nfft used in
the FT operation. This value must be equal to or greater than the window size and its
selection directly impacts the dimension of the STFT magnitude matrices used as inputs to
the diffusion model according to Equation 3.2. Note that generative AI models generally
work with square-sized data (i.e. 16x16, 64x64, 256x256) for several practical reasons such
as dimensional symmetry and simplicity in processing. Additionally, input data sizes affect
training time, computer memory, model architecture and output resolution. At this point,
we decided to use a window length equal to 40 samples and 90% window overlap (i.e., 36
samples) with nfft equal to 80 samples. These parameters result in STFT magnitude
25
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
matrices of size 41x41, given we used a one-sided STFT matrix, according to Table 2 and
Equation 3.2. This size keeps matrix dimension low to reduce computational burden, such
as training time and memory requirements low while keeping sufficient signal information.

3. Data Resizing and Scaling

The last steps in the preprocessing phase are data resizing and scaling. Note it is
usually better to deal with even-sized matrices in algorithms to use the advantage of simpler
memory management and efficiency during implementation. Therefore, we resized our
41x41 matrices as 40x40 using the built-in MATLAB function resize [46]. This function
removes the last row and column of the input matrix (which corresponds to removing the
last time window computed in the STFT and the highest frequency bin present in the
frequency axis as it also stated in [47]) as the desired matrix size is smaller than its original
size. Simulations showed that this truncation scheme did not remove any significant signal
information as all signals present in the dataset are centered around the 100th sample.

The second step is data rescaling. The diffusion model uses Gaussian zero-mean
noise during the training phase. Initial simulations conducted using the raw STFT
magnitude matrices for training resulted in poor quality generated signals. As a result, we
rescaled the STFT magnitude values so that all values would be below 1, which resulted in
a much-improved outcome. Specifically, we divided all STFT magnitude values by 34000
which is a little greater than the maximum STFT magnitude present in our matrix set. This
rescaling factor was also considered when reconstructing 1D signals from the diffusion
model-generated 2D STFT magnitude matrices using the GLA.

B. DIFFUSION MODEL

The diffusion model method selected in our work to generate synthetic underwater
signals is the denoising diffusion probabilistic model (DDPM) (referred to as diffusion
model from here). The diffusion model used in this thesis is based on the model in [19] and
the MATLAB code template used for this study is available (since ver. R2023b) [48]. The
network properties can be seen in Figure 7 and the network diagram with a sample of
network layer block is presented in Figure 8.

26
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 7. Diffusion network properties.

Figure 8. Diffusion network diagram and a sample of block diagrams.

27
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note the diffusion model in [19] was initially designed for images, and the demo
MATLAB code of the model in [48] uses RGB color matrices as inputs. We had to
customize this code template in order to use 2D STFT magnitude matrices as inputs. As a
result, we made the following modifications to the code:

• Network: A new input variable was added to the createDiffusionNetwork

function to make this function compatible with different input sizes. This
modification allowed us to use 40x40 STFT magnitudes or any other
even-sized square matrices as inputs.

• Batch processing: The original preprocessMiniBatch function includes

rescaling RGB image values from [0 255] to [-1 1]. We removed that step
as we rescaled inputs during the data preprocessing phase.

• Generative function: A new function called generateAndReconstruct was

added. Its goal is to 1) generate artificial STFT magnitudes from the
trained network and 2) reconstruct signals from such STFT magnitudes
using the GLA.

In addition to the general network characteristics, several training parameters are

user-specified: mini batch size, number of epochs, learning rate, gradient decay factor, and
squared gradient decay factor. These parameters directly influence the efficiency of the
learning process, the stability of the model, and the quality of the final output. Proper
selection of these ensures that the model generalizes new data well and achieves optimal
accuracy and robustness. The parameters used in the model are presented in Table 3.

Table 3. Training parameters of the diffusion model. Source: [48].

Training Parameter Value

Mini batch size 128
Number of epochs 50
Learning rate 0.0005
Gradient factor 0.9
Squared gradient factor 0.9999
28
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
C. OUTPUT EVALUATION

In this phase we evaluate the “quality” of the generated signals, i.e., how close
generated signal properties are to those of real ones.

1. Evaluation Method

The evaluation phase can be performed by subjective methods, objective methods,

or a combination of both [49]. In this study, we chose objective methods. Several examples
of objective methods exist in the literature, such as inception score [5], [6], [19], classifiers
[34], [50], [51], Fréchet distance [6], [19], [50], and log-likelihood [17].

In this study, our approach is based on classifier performance comparisons. First,

we train a classifier with real signals, then replace real signals with synthetic ones, and
evaluate the impact synthetic signals inserted during the training phase have on resulting
classifier performances. The basic argument is that a classifier trained on real data and the
same classifier trained on a mixture of real and artificial data should result in the “same,”
i.e., not statistically different, classification performances when artificial data
characteristics closely match those of real data. To compare classification performances,
we use the following classifier performance metrics: accuracy, F1 score, precision, and
recall.

2. Classification Approach

Multiple classification techniques are available in the literature such as kernels,

support vector machines, neural networks, ensemble or k-nearest neighbor algorithms, etc.
Such classifiers first require the extraction of several parameters (usually referred to as
features) to characterize signal class properties. In this study, we bypass the need to define
such features and instead use a deep learning approach, namely a one-dimensional
convolutional neural network (1D CNN) with implementation already available within [52]
as the classifier approach to evaluate how realistic generated synthetic signals are.

29
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
a. Classifier Algorithm

1D CNNs are specialized neural network architectures designed to process

sequential data, such as time series, text, or audio signals, where the input data is 1D.
Unlike 2D CNNs, which are typically used for image classification, 1D CNNs apply
convolutional operations along a single dimension, i.e., in this case the time domain,
enabling them to capture temporal or spatial dependencies in sequences. In 1D CNNs,
convolutional layers use filters that slide across the input sequence, generating feature maps
that highlight important patterns like peaks or transitions in the data. This process is
computationally efficient and effective in capturing local patterns, which can be crucial for
tasks like speech recognition or signal classification. Compared to recurrent neural
networks (RNNs), which are also widely used for sequential data, 1D CNNs are less
dependent on sequential order and thus often faster to train, as they can be parallelized
more effectively. The recent role of 1D CNNs in classification tasks is significant, as they
excel at recognizing local features to achieve high accuracy in sequential data. A
comprehensive survey about 1D CNNs and their applications can be found in [53].

b. Classifier Specifications

In this study we randomly partitioned the classifier input data into a training set
(70%) and validation set (10%) while using a fixed test set (20%). After partitioning the
data, we used the predefined 1D CNN architecture and features provided in [52]. Prior to
applying the classification scheme, we specifically investigated the effect the filter size has
on resulting classification performances, due to the short signal length (200 samples long).
For this purpose, we selected a random subset of real data only and trained the classifier
using filter sizes of length equal to 5, 10, 20, 25, 40, and 50. Results obtained on that subset
showed best classification performances were obtained for a filter size equal to 50 and we
fixed the filter size to 50 in all subsequent work. All other user-specified classifier
parameters required in the MATLAB 1D CNN architecture were set as listed in Table 4.

30
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 4. Training parameters of the 1D CNN Classifier.

Parameter Value
Solver type Adaptive movement estimation (adam)
Maximum number of epochs 60
Learning rate 0.01
Padding direction Left
Validation patience 5

3. Performance Metrics

Performance metrics are used for assessing the efficacy of classifiers, providing
quantitative measures that inform the effectiveness, reliability, and robustness of predictive
models. In this thesis, we use some of the most common metrics: accuracy, F1 score,
precision, and recall.

Accuracy quantifies the ratio of accurately classified instances (true positives and
true negatives) relative to the total number of instances (all data), serving as a fundamental
metric for evaluating classifier performance. Precision and recall metrics provide a more
accurate classifier performance evaluation than the accuracy metric does when dealing with
unbalanced datasets. Precision reflects the classifier’s ability to avoid false positives while
recall indicates the classifier’s capability to identify all true instances. The F1-score, which
depends on both precision and recall, provides a comprehensive metric that balances these
two aspects, making it particularly advantageous when using imbalanced datasets in
classifiers.

Accuracy is calculated in the same way for both binary and multi-class
classification tasks. However, other metrics first need to be computed separately on a one-
class-versus-all-others scenario for multi-class problems, and results averaged over all
classes to get a single value. Accuracy, precision of a class, recall, and F1-score of a class,
are defined as:

31
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
, (4.1)

, (4.2)

, (4.3)

, (4.4)

where N represents a specific class label.

After calculating precision, recall, and F1 score class on a one-class-versus-all-

others setup for each class, results need to be averaged to provide overall results. Two
different averaging approaches are available: macro or micro averaging. In this work we
used macro-averaging as this averaging scheme has shown to be useful when evaluating
classification schemes derived from data with unbalanced class sizes. Macro-averaging
calculates the arithmetic mean of each class’s performance metrics (i.e., metrics are
computed on a per-class basis, and then results are averaged by dividing by the number of
classes) General representation of macro-averaging is:

, (4.5)

where N represents the class-specific label and n is the total number of classes. In this study
we used the MATLAB code provided in [54] to get macro average values for recall,
precision and F1-score metrics.

32
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
In this chapter, we described the three phases of the study flow followed in this
thesis: data preprocessing, diffusion model training, and output evaluation. In the next
chapter we will present the results.

33
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

34
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
V. RESULTS AND DISCUSSIONS

In this chapter we present simulation results. First, we provide overall performance

metrics results obtained with a 1D CNN classifier during the evaluation phase. Then, we
present detailed performance metrics for each class of signal separately. Finally, we present
estimated processing times noted in each study phase and discuss computational costs.

A. EVALUATION DATA AND OVERALL PERFORMANCE METRICS

In this study we used 3 types of underwater signals (boat, dolphin, and whale) to
design three class-specific diffusion models, i.e., one for each class of signal, and generate
synthetic signals for each class.

Evaluating the performance of the diffusion model in generating synthetic signals

was conducted by investigating the performance of a classifier partially trained on
generated synthetic signals to classify real signals. Specifically, a varying proportion of
real and synthetic signals was used at the classifier design/training phase to evaluate the
“quality” of synthetic signals generated by the diffusion models. 12,000 signals per class
were used for the training phase and 3,000 real signals per class were used in the testing
phase to evaluate the classifier performance We note that an identical number of samples
per class was used at the training phase to prevent unbalanced data concerns in all
simulation runs. The proportion of synthetic signals in the classifier training phase was
varied from 10% to 100% to evaluate the quality of the synthetic signals. The summary of
data used in evaluation phase is illustrated in Figure 9.

35
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 9. Data structure of the evaluation phase.

According to our evaluation approach previously discussed in Chapter IV Section

C, we investigated the performances of 1D CNN classifiers designed using different
proportions of real/synthetic signals at the training stage. For that purpose, for each
classifier trained, we introduced synthetic signals to replace real signals for one class only
at a time at the classifier training stage. We successively considered the following real/
synthetic signal partitioning splits: 1) 100% real signals, 2) 75% real and 25% synthetic,
3) 50% real and 50% synthetic, 4) 25% real and 75% synthetic, 5) 10% real and 90%
synthetic, and 6) 100% synthetic. These successive training signals splits led to 16 different
training set configurations. Note that a fixed set of 100% real signals was used at the testing
stage in all implementations. For each training set configuration, 5 runs were then
conducted to mitigate any potential bias in the results due to specific choice of signals
randomly selected for the training/validation stages. A graphical depiction of the
implementations considered in the study is presented in Figure 10, which shows that a total
of 80 different implementations were considered in total.

36
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 10. Data usage in classifier.

The overall mean performance metrics presented as a function of the real/synthetic

signal configuration split used for the classifier design phases (training/validation stages)
is shown in Figure 11. Error bars indicate maximum and minimum values for that data
point. The following conclusions can be drawn from the results:

• The overall classification performance remains quite stable as the

proportion of real signals decreases until it goes down to 25% of the class
size of that training set (on a single class basis).

• Small performance degradations are observed as the proportion of real

signal drops to 10%.

• Significant performance degradations are observed when the classifiers are

trained on synthetic signals only for one out of the three classes. In such a
case, the confusion matrices indicate that the performance loss is
specifically due to the classification performance degradation observed on
the class trained from synthetic signals only. Specific confusion matrices
are included in Appendix B.

37
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: In the figure, accuracy and recall are on top of each other as we used data classes
have the same size.
Figure 11. Overall mean performance metrics obtained after 80 classifier
runs.

Individual metric-by-metric assessment of overall performance and tables for

performance metrics values can be found in Appendix C.

B. CLASS-BY-CLASS PERFORMANCE METRICS

In this section, we present performance metric results obtained on a per class basis.
Detailed plots for each performance metric are included in Appendix C.

1. Boat Signals

Mean performance metrics obtained after 5 runs for the combinations of boat
signals are presented in Figure 12. Results show the decreasing proportion of real data does

38
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
not have a significant impact on classifier performances until it reaches 10%. Results also
show that all performance metrics values are close to each other, have small variations, and
have similar trends as those observed in overall results presented Figure 11.

Note: In the figure, accuracy and recall are on top of each other as we used data classes
have the same size.
Figure 12. Mean performance metrics for boat signals after 5 runs of each
combination.

2. Dolphin Signals

The mean performance metrics obtained after 5 runs for the combinations of
dolphin signals are presented in Figure 13. Results show the decreasing proportion of real
data does not have a significant impact on classifier performances until it goes below 10%.
Results also show that all performance metrics values are close to each other and have
small variations.
39
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Thus, results indicate the artificial dolphin signals approximate real dolphin signals
with a high level of accuracy.

Note: In the figure, accuracy and recall are on top of each other as we used data classes
have the same size.
Figure 13. Mean performance metrics for dolphin signals after 5 runs of each
combination.

3. Whale Signals

The mean performance metrics obtained after 5 runs for the combinations of
dolphin signals are presented in Figure 14. Results show the decreasing proportion of real
data does not have a significant impact on classifier performance until it goes below 10%,
as previously observed for the boat signal class.

40
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: In the figure, accuracy and recall are on top of each other as we used data classes
have the same size.
Figure 14. Mean performance metrics for whale signals after 5 runs of each
combination.

Overall results show the synthetic dolphin signals performed relatively better than
synthetic boat and whale signals. We hypothesize that this different behavior may be linked
to the large class size and consistent characteristics of the dolphin signals used in the
training of the diffusion model. By comparison, in our data set, the number of samples in
the whale class is 3 times smaller than the dolphin class or the boat class, resulting in a
diffusion model with potentially slightly worse generative capability. Finally, even though
the boat class is similar in size to the larger dolphin class, signals within that class are not
as consistent, given the large variety of boats. As a result, we hypothesize that a larger
number of boat and whale signals may have been needed to improve the diffusion model
generative capability for these classes.

41
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
C. DATA PROCESSING DURATIONS

Each phase of the study is subject to specific data handling durations which are
directly related to the computational cost of the overall study. Average data processing
times are presented in Table 5.

Table 5. Data processing times.

Phase/ Operation Duration (Note)

Data preprocessing (from real signals to STFT

~7500 signal/hour
magnitudes)
~36 hours (for 17000 iterations
Training of diffusion model
and 50 epochs)
Generation of STFT magnitudes using trained
2.5-4.5 hours/1000 signals
diffusion model

Reconstruction of the synthetic signals 2-3 minutes/1000 signals

Classification of 1D CNN 3-4 minutes/run

Note: Durations are approximate as various computers and data sizes were used during the study.
Specifications of peer computers can be found in Chapter IV.

As it can be seen in the Table 5, implementing a diffusion model is not a quick

process, and computational costs should be carefully considered when planning to design
and use such models. For example, the overall duration starting from preprocessing
~100,000 real signals to generating 45,000 synthetic signals for the 3 classes considered in
this study could be approximately 2 weeks of non-stop computation for a computer with
capabilities similar to those used in our study.

In this chapter, we discussed results obtained in this study using performance

metrics and data processing durations with our assessment and discussion. The next and
final chapter presents conclusions and recommendations for follow-on work.

42
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
VI. CONCLUSIONS AND RECOMMENDATIONS

A. CONCLUSIONS

In this study, we investigated the ability of diffusion models designed for 2D image
data to generate 3 different types of synthetic underwater signals (boat, dolphin and whale)
and evaluated the quality of the generated artificial signals using a 1D CNN classifier
approach by considering various proportions of synthetic data using in the classifier
training. This study had two major phases:

1. Signal model generation: Three class-specific diffusion models based on

image data were designed. Diffusion models were trained on frequency-
based spectrogram information and synthetic time-domain underwater
signals recovered using the Griffin-Lim algorithm.

2. Evaluation of synthetic signal authenticity: A 1D CNN type of classifier

was used to evaluate the quality of synthetic signals, and performance
metrics such as accuracy, F1 score, precision, and recall, used for the
quantitative analysis.

Results show that diffusion models designed for 2D data can generate highly
accurate synthetic underwater sounds with some additional preprocessing and recovery
phases. Findings indicate classification performances remained stable as the proportion of
synthetic signals involved in the 1D CNN classifier design stage increased up to 90% for
the three classes of signals considered in this study (boat, dolphin, and whale). These results
reflect the high quality of the synthetic signals generated by the diffusion models.

B. FUTURE WORK

In this work, we investigated the capability of a diffusion model designed for 2D

data that can be applied to the generation of 1D signal using frequency-based information
and the Griffin-Lim algorithm.

One main area of follow-on work should be improving the speed of the model. As
currently implemented, the computational cost is high due to long processing times in all

43
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
phases and existence of pre/post-data handling algorithms, and the use of MATLAB for
this initial investigation. Faster computer processors and different software could reduce
this computational load. In addition, optimizing hyperparameters present in the diffusion
model should be included in future research efforts.

Another avenue for follow-on work could focus on investigating whether better
preprocessing parameters can be defined when using diffusion models designed for 2D
data on 1D signals. Usually, higher dimensional STFT magnitude matrices carry more
accurate information about the signal under investigation. However, larger matrix sizes can
significantly increase computational burden. Investigating a “best” choice of preprocessing
parameters which takes into account the signal information and the 2D input matrix size
while keeping computational load to a manageable level would be quite useful.

Another avenue for follow-on work could be investigating the effect the size of a
signal class has on the quality of the resulting diffusion model. A specific source producing
high variety of the signals may need more samples than a source has consistent signal
behaviors. Less but representative number of samples to be trained in a diffusion model
will shorten computational durations while generating new signals effectively.

Finally, the type of diffusion models used in this study was designed for 2D data.
More recently, diffusion models designed for 1D data have been proposed. Applying such
a model would eliminate the preprocessing and post processing stages currently needed in
our study to transform the 1D signal into 2D STFT magnitude images, and the need to use
the Griffin-Lim algorithm to recover 1D time-domain data from synthetic STFT magnitude
outputs generated by the diffusion models. Comparing results obtained in this study to
other generative model approaches on the same data set would be quite useful.

44
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
APPENDIX A. MATLAB CODES

This appendix includes MATLAB codes used in this thesis work.

A. DATA DISCOVERY TEMPLATE

% DATA DISCOVERY
% Template script for extracting class detections from a disc file.

% Detailed notes on data:

% WEBSITE: https://github.com/MarineBioAcousticsRC/DetEdit/wiki/How-It-Works

% NOTE: Some identification labels do not have detected signals.

% use file path of data repository and use addpath

addpath(‘DATA REPOSITORY - DRYAD’);

% use desired disc ID and data files

load(‘SOCAL_E_65_disk01a_Delphin_ID1.mat’,’zID’); % manual label data
load(‘SOCAL_E_65_disk01a_Delphin_TPWS1.mat’,’MTT’); % date&time data
load(‘SOCAL_E_65_disk01a_Delphin_TPWS1.mat’,’MSN’); % signal data

% initial variables
classLabel=1; % pick class label (boat=1, sperm whale=5, risso dolphin=2)
ix=[]; % index numbers of detections in desired class
count=[]; % count matrix
last=1; % last detection index

% find and count detections

datenum=zID(zID(:,2)==classLabel,1); % detection date and times

for d=1:length(datenum)
for s=last:length(MTT)
if MTT(s) == datenum(d) % match detection times
count(classLabel)=count(classLabel)+1;
ix=[ix;s];
last=s;
break
end
end
end

classSignals=MSN(ix); % desired class signals

save selectedData.mat classSignals ix

45
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
B. DOLPHIN DATA REDUCTION

% DOLPHIN DATA REDUCTION

% Script for reduction of total dolphin signals.
% The files that have high number of dolphin detections are used for random
% selection.

clear; clc;
rng(1923); % fix seed
load(‘selectedData.mat’); % look datadiscovery_template

% 7 disks (from 02d to 03b) have high number of detections rather than
% the other disks. Select 4000x7 samples randomly and keep others
n=sum(count(1:end-7,2)); nsi=1:1:n; % number of signals we will keep as it is
rsi=randi([n+1,length(MSNd(:,1))],[28000,1]); % random selection of indexes

MSNdr=MSNd(nsi’,:); MSNdr=[MSNdr;MSNd(rsi,:)]; % reduced dolphin matrix

index_dr=[nsi’;rsi]; % index matrix of reduced dolphin samples

clear n rsi nsi

save selectedData.mat

C. PHASE I:DATA PREPROCESSING

% DATA PREPROCESSING
% This script preprocesses the data for the diffusion model.
% For details, please see Chapter IV Section A in the thesis.
tic
clear; clc;
format compact;
rng(1923);
load(“selectedData.mat”);
%% Define initial parameters
% elective parameters
nfft=80; % nfft
WinLength=40; % window length
op=0.9; % overlap ratio
magSize=[40 40]; % desired STFT mag size
scaleFactor=34000; % scaling factor
% standard parameters
fs=2e5; % sampling frequency
npts=200; % number of data points
time=0:npts-1; time=time/fs; % time array
WinType=rectwin(WinLength); % window function
ovrlp=fix(WinLength*op); % overlap samples
%% Produce STFT magnitudes and change variable type as a final output
% boat data
source=‘boat’;
fprintf(‘boat\n’);

46
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
for g=1:length(MSNb(:,1))
xx=MSNb(g,:);
% STFT mag
targetFolder=‘C:\boat\’;
[STFTmag, Mmagb(g), Imagb(g)] = extractSTFTmag(xx,WinType,ovrlp,nfft,fs,source,g,
magSize,targetFolder);
% Tiff
targetFolder=‘C:\boat\’;
double2tiff(STFTmag/scaleFactor,source,g,targetFolder);
end
tic
% dolphin data (reduced)
source=‘dolphin’;
fprintf(‘dolphin\n’);
for g=1:length(MSNdr(:,1))
xx=MSNdr(g,:);
% STFT mag
targetFolder=‘C:\rossidolphin\’;
[STFTmag, Mmagd(g), Imagd(g)] = extractSTFTmag(xx,WinType,ovrlp,nfft,fs,source,g,
magSize,targetFolder);
% Tiff
targetFolder=‘C:\rossidolphin\’;
double2tiff(STFTmag/scaleFactor,source,g,targetFolder);
end
% whale data
source=‘whale’;
fprintf(‘whale\n’);
for g=1:length(MSNw(:,1))
xx=MSNw(g,:);
% STFT mag
targetFolder=‘C:\spermwhale\’;
[STFTmag, Mmagw(g), Imagw(g)] = extractSTFTmag(xx,WinType,ovrlp,nfft,fs,source,g,
magSize,targetFolder);
% Tiff
targetFolder=‘C:\spermwhale\’;
double2tiff(STFTmag/scaleFactor,source,g,targetFolder);
end
toc
save preprocessOutputs.mat Mmagb Mmagd Mmagw Imagb Imagd Imagw
%% Analysis: scaling factor
% max amplitude distribution of STFT magnitudes
figure;
tlo=tiledlayout(1,3,”TileSpacing”,”compact”);
nexttile(); histogram(Mmagb); title(‘Boat Signals’);
hold on; xline(scaleFactor,’red’); subtitle(‘<34K = 99.87%’);
nexttile(); histogram(Mmagd); title(‘Dolphin Signals’);
hold on; xline(scaleFactor,’red’); subtitle(‘<34K = 100%’);
nexttile(); histogram(Mmagw); title(‘Whale Signals’);
hold on; xline(scaleFactor,’red’); subtitle(‘<34K = 98.11%’);
title(tlo,’Max STFT Magnitudes’);

47
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
D. FUNCTION: EXTRACTSTFTMAG

function [mag,M,I]=extractSTFTmag(x,window,noverlap,nfft,fs,source,index,
desiredSize,folder)
% This function computes STFT magnitude of the input signal iaw inputs,
% resizes and saves it in {typeName}{Index}.mat format into selected file,
% and converts STFT magnitude into Tiff format to use in diffusion model
% using double2tif function.

Xstft=stft(x,fs,”Window”,window,’OverlapLength’,noverlap,’FFTLength’,
nfft,’FrequencyRange’,”onesided”);

Xstft_mag=abs(Xstft);

[M,I] = max(Xstft_mag,[],’all’); % M: max amplitude, I:index of max amp

mag=resize(Xstft_mag,desiredSize); % mag: STFT mag in desired size

current=cd(folder); % go to desired folder

save(sprintf([source,num2str(index),’.mat’]),”Xstft_mag”);
cd(current); % go back previous folder

end

E. FUNCTION: DOUBLE2TIFF

function double2tiff(inputMatrix, source, index, folder)

% This function converts a 2D matrix (double precision) into a
% tiff image object without information lost.

current=cd(folder); % go to desired folder

% Image must be single precision. Single range [1.1755e-38 3.4028e+38]

inputMsingle = single(inputMatrix);

% Create tiff object

fileName = sprintf([source, num2str(index),’.tif’]);
tiffObject = Tiff(fileName, ‘w’);

% Set tags
tagstruct.ImageLength = size(inputMsingle,1);
tagstruct.ImageWidth = size(inputMsingle,2);
tagstruct.Compression = Tiff.Compression.None;
tagstruct.SampleFormat = Tiff.SampleFormat.IEEEFP;
tagstruct.Photometric = Tiff.Photometric.MinIsBlack;
tagstruct.BitsPerSample = 32;
tagstruct.SamplesPerPixel = size(inputMsingle,3);
tagstruct.PlanarConfiguration = Tiff.PlanarConfiguration.Chunky;
tiffObject.setTag(tagstruct);

48
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
% Write the array to disk
tiffObject.write(inputMsingle);
tiffObject.close;

cd(current); % go back previous folder

end

F. PHASE II:DIFFUSION MODEL

The diffusion model for thesis. Please check Chapter IV Section B for detailed explanations
of the model.

This diffusion model gets 2D data and generates new 2D data. Input data are modified STFT
magnitudes which are obtained from data preprocessing that computes STFTs of underwater
signals, resizes STFT magnitude matrices from 41x41 to 40x40, scales these matrices by
dividing 34000 respectively. For the details of data preprocessing, please see Chapter IV
Section A.

As the original model of [19] works with images, some changes have been made in it for
thesis work:

• Input type — STFT magnitudes are double type variables but we convert them
into tif type which is a kind of image. This data type can preserve precision of
original double data.
• Input size — It is changed as [40 40] according to the input feature. For details
see Section IV.A.
• Network — A new input variable “matrixSize” is added into the function
createDiffusionNetwork to make it compatible with different input sizes.
• Batch Processing — We use scaling factor 34000 in data preprocessing.
Therefore, the function preprocessMiniBatch is only for concatenation.
• Generation — A new function generateAndReconstruct is coded for generation of
new STFT magnitudes using trained network and reconstruction of signals from
these STFT magnitudes using GLA.

Please see the original script in [48] for detailed explanations of the model.

Input Data

dataFolder = “C:\boat”; % use desired folder including data for generation

imds = imageDatastore(dataFolder,IncludeSubfolders=true);
imgSize = [40 40];
audsImds = augmentedImageDatastore(imgSize,imds);

49
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Initial Properties and Training Options

Forward diffusion options:

numNoiseSteps = 500; % number of noise steps for adding or removing

betaMin = 1e-4; % minimum beta coefficient
betaMax = 0.02; % maximum beta coefficient
varianceSchedule = linspace(betaMin,betaMax,numNoiseSteps); % variance schedule

Training options:

miniBatchSize = 128; % mini batch size

numEpochs = 50; % number of epochs

Adam optimization options:

learnRate = 0.0005; % learning rate

gradientDecayFactor = 0.9; % gradient decay factor
squaredGradientDecayFactor = 0.9999; % squared decay factor

Network

numInputChannels = 1; % number of input channels (1 for STFT magnitude matrix)

net = createDiffusionModelNetwork(numInputChannels, imgSize)

The full network architecture can be seen by using deepNetworkDesigner(net) in Deep

Network Designer.

Model Loss Function

Listed in the Model Loss Function section.

Training of Diffusion Model

doTraining = true; % use false for pre-trained data

The function preprocessMinibatch is defined at the end.

mbq = minibatchqueue(audsImds, ...

MiniBatchSize=miniBatchSize, ...
MiniBatchFcn=@preprocessMiniBatch, ...
50
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
MiniBatchFormat=“SSCB,” ...
PartialMiniBatch=“discard”);
Initialize the parameters for Adam optimization.

averageGrad = [];
averageSqGrad = [];

Track the model performance:

numObservationsTrain = numel(imds.Files);
numIterationsPerEpoch = ceil(numObservationsTrain/miniBatchSize);
numIterations = numEpochs*numIterationsPerEpoch;

Observe the training progress:

if doTraining
monitor = trainingProgressMonitor(...
Metrics=“Loss,” ...
Info=[“Epoch”,”Iteration”], ...
XLabel=“Iteration”);
end

Train the network:

if doTraining
iteration = 0;
epoch = 0;

while epoch < numEpochs && ~monitor.Stop

epoch = epoch + 1;
shuffle(mbq);

while hasdata(mbq) && ~monitor.Stop

iteration = iteration + 1;
img = next(mbq);

% Generate random noise.

targetNoise = randn(size(img),Like=img);

% Generate a random noise step.

noiseStep = dlarray(randi(numNoiseSteps,[1 miniBatchSize],Like=img),”CB”);

% Apply noise to the image.

noisyImage = applyNoiseToImage(img,targetNoise,noiseStep,varianceSchedule);

% Compute loss.
51
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[loss,gradients] = dlfeval(@modelLoss,net,noisyImage,noiseStep,targetNoise);

% Update model.
[net,averageGrad,averageSqGrad] =
adamupdate(net,gradients,averageGrad,averageSqGrad,iteration, ...
learnRate,gradientDecayFactor,squaredGradientDecayFactor);

% Record metrics.
recordMetrics(monitor,iteration,Loss=loss);
updateInfo(monitor,Epoch=epoch,Iteration=iteration);
monitor.Progress = 100 * iteration/numIterations;
end
end
else
% If doTraining is false, download and extract the pretrained network from the folder.
load(“DiffusionNetworkTrained.mat”); % change folder path properly
end
save DiffusionNetworkTrained net % save trained network seperately
save DiffusionNetworkData.mat % save training data

Generate New Signals

Generate new STFT magnitudes and reconstruct signals using GLA.

numImages = 5000; % number of signals will be generated

generatedSignals =
generateAndReconstruct(net,varianceSchedule,imgSize,numImages,numInputChannels);
save syntheticSignals.mat generatedSignals % save generated signals

Supporting Functions

Forward Noising Function

function noisyImg = applyNoiseToImage(img,noiseToApply,noiseStep,varianceSchedule)

alphaBar = cumprod(1 - varianceSchedule);
alphaBarT = dlarray(alphaBar(noiseStep),”CBSS”);

noisyImg = sqrt(alphaBarT).img + sqrt(1 - alphaBarT).noiseToApply;

end

52
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Model Loss Function

function [loss, gradients] = modelLoss(net,X,Y,T)

% Forward data through the network.
noisePrediction = forward(net,X,Y);

% Compute mean squared error loss between predicted noise and target.
loss = mse(noisePrediction,T);

gradients = dlgradient(loss,net.Learnables);
end
Mini-batch Preprocessing Function

function X = preprocessMiniBatch(data)
% Concatenate mini-batch.
X = cat(4,data{:});
end
Generation and Reconstruction Function

function imagesAll =
generateAndReconstruct(net,varianceSchedule,imageSize,numImages,numChannels)
tic % start chronometer for generation process

%--- INITIAL PARAMETERS

% Selective parameters (See Chapter IV Section A)
nfft=80; % number of DFT points
WinLength=40; % window length
op=0.9; % overlap ratio
M=34000; % scaling factor

% Standard parameters for dataset

fs=2e5; % sampling frequency
npts=200; % signal length
time=0:npts-1; time=time/fs; % time array
WinType=rectwin(WinLength); % window function
ovrlp=fix(WinLength*op); % length of overlap

%--- GENERATION
% Compute variance schedule parameters.
alphaBar = cumprod(1 - varianceSchedule);
alphaBarPrev = [1 alphaBar(1:end-1)];
posteriorVariance = varianceSchedule.*(1 - alphaBarPrev)./(1 - alphaBar);

% Reverse the diffusion process.

numNoiseSteps = length(varianceSchedule);
53
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
% Generate random noise.
images = randn([imageSize numChannels numImages]);

for noiseStep = numNoiseSteps:-1:1

if noiseStep ~= 1
z = randn([imageSize,numChannels,numImages]);
else
z = zeros([imageSize,numChannels,numImages]);
end

% Predict the noise using the network.

predictedNoise = predict(net,images,noiseStep);

sqrtOneMinusBeta = sqrt(1 - varianceSchedule(noiseStep));

addedNoise = sqrt(posteriorVariance(noiseStep))*z;
predNoise = varianceSchedule(noiseStep)*predictedNoise/sqrt(1 - alphaBar(noiseStep));

images = 1/sqrtOneMinusBeta*(images - predNoise) + addedNoise;

end
fprintf(‘end of generation\n’)
toc % stop chronometer for generation process

%--- RECONSTRUCTION
tic % start chronometer for reconstruction process
% get generated magnitudes
genSingles=gather(abs(images));
imagesAll=zeros([numImages npts]);
for g=1:numImages
genMag=genSingles(:,:,g);

% reverse preprocessing and use GLA

genMag=resize(genMag,[41 41]);
XReconr=real(stftmag2sig(genMag,nfft,fs,Window=WinType,OverlapLength=
ovrlp,FrequencyRange=‘onesided’,InitializePhaseMethod=“zeros”));
XReconr=XReconr*M;

% center signal around 100th sample like original signals in dataset

XReconr=centerSignal(XReconr);

imagesAll(g,:)=XReconr;
end

toc % stop chronometer and display elapsed time

end

54
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
G. FUNCTION: CREATEDIFFUSIONMODELNETWORK

function net = createDiffusionModelNetwork(numImageChannels, matrixSize)

% Create a network to predict the noise added to an
% image.

% Revised by Mustafa Garip, 2024. Use with SpatialFlattenLayer.m and SpatialUnflattenLayer.m in [48].

inputSize = [matrixSize numImageChannels];

initialNumChannels = 64;
filterSize = [3 3];
numGroups = 32;
numHeads = 1;
% Backbone
layers = [
% Image input
imageInputLayer(inputSize,Normalization=“none”)
convolution2dLayer(filterSize,initialNumChannels,Padding=“same”,Name=“conv_in”)
% Encoder
residualBlock(initialNumChannels,filterSize,numGroups,”1”)
residualBlock(initialNumChannels,filterSize,numGroups,”2”)
convolution2dLayer(filterSize,2*initialNumChannels,Padding=“same”,Stride=2,Name=“
downsample_2”)
residualBlock(2*initialNumChannels,filterSize,numGroups,”3”)
attentionBlock(numHeads,2*initialNumChannels,numGroups,”3”)
residualBlock(2*initialNumChannels,filterSize,numGroups,”4”)
attentionBlock(numHeads,2*initialNumChannels,numGroups,”4”)
convolution2dLayer(filterSize,4*initialNumChannels,Padding=“same”,Stride=2,Name=“
downsample_4”)
residualBlock(4*initialNumChannels,filterSize,numGroups,”5”)
residualBlock(4*initialNumChannels,filterSize,numGroups,”6”)
% Bridge
residualBlock(4*initialNumChannels,filterSize,numGroups,”7”)
attentionBlock(numHeads,4*initialNumChannels,numGroups,”7”)
residualBlock(4*initialNumChannels,filterSize,numGroups,”8”)
attentionBlock(numHeads,4*initialNumChannels,numGroups,”8”)
% Decoder
depthConcatenationLayer(2,Name=“cat_9”)
residualBlock(4*initialNumChannels,filterSize,numGroups,”9”)
depthConcatenationLayer(2,Name=“cat_10”)
residualBlock(4*initialNumChannels,filterSize,numGroups,”10”)
transposedConv2dLayer(filterSize,2*initialNumChannels,Cropping=“same”,Stride=2,
Name=“upsample_10”)
depthConcatenationLayer(2,Name=“cat_11”)
residualBlock(2*initialNumChannels,filterSize,numGroups,”11”)
attentionBlock(numHeads,2*initialNumChannels,numGroups,”11”)
depthConcatenationLayer(2,Name=“cat_12”)
residualBlock(2*initialNumChannels,filterSize,numGroups,”12”)
attentionBlock(numHeads,2*initialNumChannels,numGroups,”12”)

55
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
transposedConv2dLayer(filterSize,initialNumChannels,Cropping=“same”,Stride=2,Name=“
upsample_12”)
depthConcatenationLayer(2,Name=“cat_13”)
residualBlock(initialNumChannels,filterSize,numGroups,”13”)
depthConcatenationLayer(2,Name=“cat_14”)
residualBlock(initialNumChannels,filterSize,numGroups,”14”)
depthConcatenationLayer(2,Name=“cat_end”)
% Output
groupNormalizationLayer(numGroups)
swishLayer
convolution2dLayer(filterSize,numImageChannels,Padding=“same”);
];
net = dlnetwork(layers, Initialize=false);

% Add the noise step embedding input

numNoiseChannels = 1;
noiseStepEmbeddingLayers = [
featureInputLayer(numNoiseChannels)
sinusoidalPositionEncodingLayer(initialNumChannels)
fullyConnectedLayer(4*initialNumChannels)
swishLayer
fullyConnectedLayer(4*initialNumChannels,Name=“noiseEmbed”)
];
net = addLayers(net, noiseStepEmbeddingLayers);

% Connect the noise step embedding to each residual block

numResidualBlocks = 14;
channelMultipliers = [1 1 2 2 4 4 4 4 4 4 2 2 1 1];
attentionBlockIndices = [3 4 7 8 11 12];

for ii = 1:numResidualBlocks
numChannels = channelMultipliers(ii)*initialNumChannels;
noiseStepConnectorLayers = [
groupNormalizationLayer(numGroups,Name=“normEmbed_”+ii)
fullyConnectedLayer(numChannels,Name=“fcEmbed_”+ii)
];
net = addLayers(net,noiseStepConnectorLayers);
net = connectLayers(net,”noiseEmbed,” “normEmbed_”+ii);
net = connectLayers(net,”fcEmbed_”+ii,”addEmbedRes_”+ii+”/in2”);
end
% Add missing skip connections in each residual and attention block
for ii = 1:numResidualBlocks
skipConnectionSource = “norm1Res_” + ii;
numChannels = channelMultipliers(ii)*initialNumChannels;
% Add 1x1 convolution to ensure the correct number of channels
net = addLayers(net, convolution2dLayer([1,1], numChannels, Name=“skipConvRes_”
+ii));
net = connectLayers(net,skipConnectionSource,”skipConvRes_”+ii);
net = connectLayers(net,”skipConvRes_”+ii,”addRes_”+ii+”/in2”);
if ismember(ii,attentionBlockIndices)
skipConnectionSource = “normAttn_”+ii;

56
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
net = connectLayers(net,skipConnectionSource,”addAttn_”+ii+”/in2”);
end
end

% Add missing skip connections between encoder and decoder

numEncoderResidualBlocks = 6;
for ii = 1:numEncoderResidualBlocks
correspondingDecoderBlockIdx = numResidualBlocks - ii + 1;
net = connectLayers(net,”addRes_”+ii, “cat_”+correspondingDecoderBlockIdx+”/in2”);
end
net = connectLayers(net,”conv_in”,”cat_end/in2”);
% Initialize the network
net = initialize(net);
end

% Helper functions
% Residual block
function layers = residualBlock(numChannels,filterSize,numGroups,name)
layers = [
groupNormalizationLayer(numGroups,Name=“norm1Res_”+name)
swishLayer()
convolution2dLayer(filterSize,numChannels,Padding=“same”)
functionLayer(@(x,y) x + y,Formattable=true,Name=“addEmbedRes_”+name)
groupNormalizationLayer(numGroups)
swishLayer()
convolution2dLayer(filterSize,numChannels,Padding=“same”)
additionLayer(2,Name=“addRes_”+name)
];
end

% Attention block
function layers = attentionBlock(numHeads,numKeyChannels,numGroups,name)
layers = [
groupNormalizationLayer(numGroups,Name=“normAttn_”+name)
SpatialFlattenLayer()
selfAttentionLayer(numHeads,numKeyChannels)
SpatialUnflattenLayer()
additionLayer(2,Name=“addAttn_”+name)
];
end

57
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
H. FUNCTION: CENTERSIGNAL

function centeredSig = centerSignal(sig)

% This function centers maximum amplitude of an input signal around its
% center sample using circular shift and returns centered signal.
sig=sig.’;
[m,i]=max(sig);

% find center sample

if mod(length(sig),2) == 0
c=length(sig)/2;
else
c=round(length(sig)/2);
end

% shift signal
if i~=c
delay=i-c;
centeredSig=circshift(sig,-delay);
else
centeredSig=sig;
end
end

I. PHASE III: OUTPUT EVALUATION

These scripts classify signals data using a 1-D convolutional neural network based on demo
code in [52]. For other details, please see thesis Chapter IV Section C.

Please use supporting function findDataCombination to pick desired the combination you
want to use and make changes in designated lines of the following code. Then, the code will
run 5 times for selected combination and give the performance metrics of each run in a
matrix.

Load Data

Load the data folder includes training and test set combinations of real and generated signal
samples.

datafile=‘.mat’ % file includes data combinations

numCombinations=16; % number of combinations will be used

for cs=1:numCombinations
[trainData,trainLabel,testData,testLabel,saveName]=findDataCombination(cs,datafile);
tic % start chronometer for to collect processing time
numIteration=5;
for m=1:numIteration
traind=trainData.’; % CHANGE right side according to the data combination
testd=testData.’; % CHANGE right side according to the data combination
58
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Organize Data

Separate training, validation, and test sets using selected percentages. As our data
combinations have training (80%) and test (20%) sets separately, here we partition only the
training using number of samples we used to get ratios 70% for training and 10% for
validation. Test set is used as it is.

for g=1:length(traind(1,:))
dataTrain{g,1}=traind(:,g);
end
for g=1:length(testd(1,:))
dataTest{g,1}=testd(:,g);
end

numChannels = 1;

numObservations = length(traind(1,:));
[idxTrain,idxValidation] = trainingPartitions(numObservations, [10500/12000 1500/12000]); % for the
overall ratio 70–10

XTrain = dataTrain(idxTrain);
TTrain = categorical(trainLabel(idxTrain));

XValidation = dataTrain(idxValidation);
TValidation = categorical(trainLabel(idxValidation));

XTest = dataTest;
TTest = categorical(testLabel);

% prepare brief summary variables

t=0; % processing time for total of 5 runs per combination
allMetrics=[0,0,0,0]; % mean performance metrics for total of 5 runs per combination
save briefMetrics.mat t allMetrics
clear t allMetrics

59
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
1-D Convolutional Network Architecture

Uses two blocks of 1-D convolution, ReLU, and layer normalization layers.

filterSize = 50;
numFilters = 32;

classNames = categories(TTrain);
numClasses = numel(classNames);

cnnlayers = [ ...
sequenceInputLayer(numChannels)
convolution1dLayer(filterSize,numFilters,Padding=“causal”)
reluLayer
layerNormalizationLayer
convolution1dLayer(filterSize,2*numFilters,Padding=“causal”)
reluLayer
layerNormalizationLayer
globalAveragePooling1dLayer
fullyConnectedLayer(numClasses)
softmaxLayer];

Training Options

Uses Adam optimazer with specified training options.

options = trainingOptions(“adam,” ...

MaxEpochs=60, ...
InitialLearnRate=0.01, ...
SequencePaddingDirection=“left,” ...
ValidationData={XValidation,TValidation}, ...
ValidationPatience=5, ...
Plots=“training-progress,” ...
Verbose=false, ...
TargetDataFormats=‘auto’);

Train Neural Network

cnnnet = trainnet(XTrain,TTrain,cnnlayers,”crossentropy”,options);

60
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Test Neural Network

scores = minibatchpredict(cnnnet,XTest,SequencePaddingDirection=“left”);
YTest = scores2label(scores, classNames);

cmat = confusionmat(TTest,YTest);
metrics = multiclass_metrics_common(cmat); % for code of this function please see [54]

results(m,1)=metrics.Accuracy;
results(m,2)=metrics.F1score;
results(m,3)=metrics.Precision;
results(m,4)=metrics.Recall;

% figure % confusion matrix

% confusionchart(TTest,YTest);
% title([‘Accuracy = ‘,num2str(round(acc,3)*100),’%’])
end % run
save(saveName)

load briefMetrics.mat
t(cs)=toc;
allMetrics(cs,1)=mean(results(:,1));
allMetrics(cs,2)=mean(results(:,2));
allMetrics(cs,3)=mean(results(:,3));
allMetrics(cs,4)=mean(results(:,4));
save briefMetrics.mat t allMetrics

clear
close all
end % cs

61
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

62
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
APPENDIX B. SUPPLEMENTARY CONFUSION MATRICES

Note: Class labels are as follows: 1-Boat, 2-Dolphin, 3-Whale.

Figure 15. Confusion matrix obtained after 5th run of classifier with 100% real
signal combination.

Note: Class labels are as follows: 1-Boat, 2-Dolphin, 3-Whale.

Figure 16. Confusion matrices obtained after 5th runs of classifier with 100%
synthetic signal combinations.

63
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

64
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
APPENDIX C. ADDITIONAL PLOTS AND TABLES OF
PERFORMANCE METRICS

Note: Each circle represents a different run. Error bars indicate maximum and minimum
values at that data point.
Figure 17. Overall accuracy after 80 runs of the classifier.

65
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 6. Accuracy values after 80 runs of the classifier.

Class Combination Run-1 Run-2 Run-3 Run-4 Run-5

Common 100 R 0.9351 0.9324 0.9356 0.9366 0.9320
75 R, 25 S 0.9318 0.9276 0.9329 0.9276 0.9331
50 R, 50 S 0.9301 0.9331 0.9232 0.9320 0.9250
Boat 25 R, 75 S 0.9009 0.8880 0.9142 0.9127 0.9103
10 R, 90 S 0.8808 0.8887 0.8597 0.8617 0.8683
100 S 0.6786 0.6776 0.6769 0.7340 0.6766
75 R, 25 S 0.9366 0.9310 0.9328 0.9366 0.9343
50 R, 50 S 0.9370 0.9308 0.9387 0.9341 0.9342
Dolphin 25 R, 75 S 0.9279 0.9257 0.9268 0.9322 0.9287
10 R, 90 S 0.9330 0.9349 0.9369 0.9298 0.9347
100 S 0.7198 0.7521 0.7760 0.7411 0.7709
75 R, 25 S 0.9333 0.9323 0.9302 0.9372 0.9326
50 R, 50 S 0.9216 0.9112 0.9229 0.9303 0.9276
Whale 25 R, 75 S 0.8994 0.9164 0.9061 0.8927 0.9042
10 R, 90 S 0.8832 0.8883 0.8537 0.8633 0.8514
100 S 0.6797 0.6793 0.6893 0.6991 0.7387
Note: Combination codes are as follows: Numbers: % ratio, R: real data, S: synthetic data.

66
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: Each circle represents a different run. Error bars indicate maximum and minimum
values at that data point.
Figure 18. Overall F1 score after 80 runs of the classifier.

67
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 7. F1 score values after 80 runs of the classifier.

Class Combination Run-1 Run-2 Run-3 Run-4 Run-5

Common 100 R 0.9351 0.9324 0.9355 0.9366 0.9320
75 R, 25 S 0.9316 0.9274 0.9328 0.9275 0.9330
50 R, 50 S 0.9301 0.9331 0.9226 0.9319 0.9250
Boat 25 R, 75 S 0.8999 0.8855 0.9135 0.9118 0.9098
10 R, 90 S 0.8777 0.8867 0.8551 0.8566 0.8640
100 S 0.5803 0.5798 0.5781 0.6899 0.5767
75 R, 25 S 0.9365 0.9309 0.9327 0.9365 0.9343
50 R, 50 S 0.9370 0.9306 0.9386 0.9341 0.9342
Dolphin 25 R, 75 S 0.9278 0.9257 0.9267 0.9321 0.9289
10 R, 90 S 0.9331 0.9349 0.9370 0.9301 0.9349
100 S 0.6987 0.7368 0.7667 0.7187 0.7636
75 R, 25 S 0.9333 0.9322 0.9301 0.9372 0.9325
50 R, 50 S 0.9213 0.9107 0.9228 0.9303 0.9275
Whale 25 R, 75 S 0.8986 0.9162 0.9055 0.8908 0.9034
10 R, 90 S 0.8807 0.8863 0.8478 0.8587 0.8452
100 S 0.5863 0.5863 0.6060 0.6245 0.6945
Note: Combination codes are as follows: Numbers: % ratio, R: real data, S: synthetic data.

68
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: Each circle represents a different run. Error bars indicate maximum and minimum
values at that data point.
Figure 19. Overall precision after 80 runs of the classifier.

69
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 8. Precision values after 80 runs of the classifier.

Class Combination Run-1 Run-2 Run-3 Run-4 Run-5

Common 100 R 0.9352 0.9323 0.9356 0.9366 0.9322
75 R, 25 S 0.9315 0.9284 0.9337 0.9276 0.9330
50 R, 50 S 0.9301 0.9333 0.9278 0.9318 0.9254
Boat 25 R, 75 S 0.9069 0.9057 0.9193 0.9182 0.9154
10 R, 90 S 0.8960 0.9035 0.8836 0.8887 0.8959
100 S 0.8250 0.8315 0.8237 0.8020 0.8222
75 R, 25 S 0.9365 0.9313 0.9327 0.9367 0.9346
50 R, 50 S 0.9371 0.9309 0.9386 0.9342 0.9343
Dolphin 25 R, 75 S 0.9290 0.9264 0.9268 0.9321 0.9292
10 R, 90 S 0.9336 0.9355 0.9371 0.9312 0.9352
100 S 0.8016 0.8056 0.8242 0.8013 0.8314
75 R, 25 S 0.9334 0.9331 0.9302 0.9373 0.9325
50 R, 50 S 0.9247 0.9175 0.9262 0.9305 0.9281
Whale 25 R, 75 S 0.9086 0.9186 0.9136 0.9094 0.9125
10 R, 90 S 0.9033 0.9057 0.8894 0.8936 0.8877
100 S 0.8280 0.8292 0.8292 0.8367 0.8420
Note: Combination codes are as follows: Numbers: % ratio, R: real data, S: synthetic data.

70
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: Each circle represents a different run. Error bars indicate maximum and minimum
values at that data point.
Figure 20. Overall recall after 80 runs of the classifier.

71
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 9. Recall values after 80 runs of the classifier.

Class Combination Run-1 Run-2 Run-3 Run-4 Run-5

72
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 21. Detailed performance metrics for boat signals.

73
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 22. Detailed performance metrics for dolphin signals.

74
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 23. Detailed performance metrics for whale signals.

75
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK

76
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
LIST OF REFERENCES

[1] M. Civit, J. Civit-Masot, F. Cuadrado, and M. J. Escalona, “A systematic review

of artificial intelligence-based music generation: Scope, applications, and future
trends,” Expert Systems with Applications, vol. 209, 118190, 2022. Available:
https://doi.org/10.1016/j.eswa.2022.118190

[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.

Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural
Information Processing Systems, pp.2672–2680, 2014.

[3] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A Review on Generative Adversarial
Networks: Algorithms, Theory, and Applications,” IEEE Transactions on
Knowledge and Data Engineering, vol. 35, no. 4, pp. 3313–3332, 1 April 2023.
Available: https://doi.org/10.1109/TKDE.2021.3130191.

[4] F. Liu, Q. Song, and G. Jin, “Expansion of restricted sample for underwater
acoustic signal based on generative adversarial networks,” in Proceedings of
SPIE, vol.11069, no. 1106948, pp.1–8, May 2019.

[5] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” in

International Conference on Learning Representations, 2018. Available:
https://doi.org/10.48550/arXiv.1802.04208

[6] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts,

“GAN-Synth: Adversarial neural audio synthesis,” in International Conference on
Learning Representations, 2018. Available: https://doi.org/10.48550/
arXiv.1902.08710

[7] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Yi Zhao, W. Zhang, B. Cui, and M.
Yang, “Diffusion models: a comprehensive survey of methods and applications”
ACM Computing Survey, vol. 56, issue 4, Article 105, New York, USA, April
2024. Available: https://doi.org/10.1145/3626235

[8] S. Ji, J. Luo, and X. Yang, “A Comprehensive Survey on Deep Music Generation:
Multi-Level Representations, Algorithms, Evaluations, and Future Directions,”
arXiv, 2020. Available: https://doi.org/10.48550/arXiv.2011.06801

[9] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv, 2022.

Available: https://doi.org/10.48550/arXiv.1312.6114

[10] A. Caillon and P. Esling, “RAVE: A variational autoencoder for fast and high-
quality neural audio synthesis,” arXiv, 2021. Available: https://doi.org/10.48550/
arXiv.2111.05011

77
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[11] I. P. Yamshchikov and A. Tikhonov, “Music generation with variational recurrent
autoencoder supported by history,” SN Applied Sciences, vol. 2, article 1937,
2017. Available: https://doi.org/10.1007/s42452-020-03715-w

[12] P. Dhariwal, H. Jun, C, Payne, J.W. Kim, A. Radford, and I. Sutskever, “Jukebox:
a generative model for music,” arXiv, 2020. Available: https://doi.org/10.48550/
arXiv.2005.00341

[13] O. Mogren, “C-RNN-GAN: Continuous recurrent neural networks with

adversarial training,” arXiv, 2016. Available: https://doi.org/10.48550/
arXiv.1611.09904

[14] H-M. Liu and Y.-H. Yang, “Lead Sheet Generation and Arrangement by
Conditional Generative Adversarial Network,” in 2018 17th IEEE International
Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA,
2018, pp. 722–727.doi: 10.1109/ICMLA.2018.00114.

[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł.

Kaiser, and I. Polosukhin, “Attention is All you Need,” in 29th Conference on
Neural Information Processing Systems (NIPS 2017), Vancouver, Canada, 2017.
Available: https://papers.nips.cc/paper_files/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

[16] A. Muhamed, “Symbolic Music Generation with Transformer-GANs,” AAAI, vol.

35, no. 1, pp. 408–417, May 2021.

[17] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep

Unsupervised Learning using Nonequilibrium Thermodynamics,” in Proceedings
of the 32nd International Conference on Machine Learning, pp. 2256—2265,
Lille, France, 2015. Available: https://proceedings.mlr.press/v37/sohl-
dickstein15.html

[18] A. Vahdat and K. Kreis, “Improving diffusion models as an alternative to GANs,

Part 2,” nVidia Developer Technical, blog, 2022. Available:
https://developer.nvidia.com/blog/improving-diffusion-models-as-an-alternative-
to-gans-part-2/

[19] J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” in 34th
Conference on Neural Information Processing Systems (NIPS 2020), Vancouver,
Canada, 2020. Available: https://proceedings.neurips.cc/paper_files/paper/2020/
file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

[20] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data
distribution,” in 33rd Conference on Neural Information Processing Systems
(NIPS 2019), Vancouver, Canada, 2019. Available:
https://proceedings.neurips.cc/paper_files/paper/2019/file/
3001ef257407d5a371a96dcd947c7d93-Paper.pdf
78
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[21] Y.Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole,
“Score-based generative modeling through stochastic differential equations,,”
arXiv, 2021. Available: https://doi.org/10.48550/arXiv.2011.13456

[22] W. Harvey, S. Naderiparizi, V. Masrani, C. Weilbach, and F. Wood, “Flexible

diffusion modeling of long videos,” in 36th Conference on Neural Information
Processing Systems (NIPS 2022), Vancouver, Canada, 2022. Available:
https://doi.org/10.48550/arXiv.2205.11495

[23] J: Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video

diffusion models,” arXiv, 2022. Available: https://doi.org/10.48550/
arXiv.2204.03458

[24] R. Yang, P. Srivastava, and S. Mandt, “Diffusion Probabilistic Modeling for

Video Generation” Entropy, vol. 25, pp. 1469, 2023. Available: https://doi.org/
10.3390/e25101469

[25] W. Huang and F. Zhan, “A novel probabilistic diffusion model based on the weak
selection mimicry theory or the generation of hypnotic songs,” Mathematics 2023,
vol. 11, pp. 3345, 2023. Available: https://doi.org/10.3390/math11153345

[26] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I.

Sutskever, and M. Chen, “GLIDE: Towards Photorealistic Image Generation and
Editing with Text-Guided Diffusion Models,” arXiv, 2022. Available:
https://doi.org/10.48550/arXiv.2112.10741

[27] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S.

Ghasemipour, B. Karagol Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho,
D. J. Fleet, and M. Norouzi, “Photorealistic Text-to-Image Diffusion Models with
Deep Language Understanding,” arXiv, 2022. Available: https://doi.org/
10.48550/arXiv.2205.11487

[28] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A

Diffusion Probabilistic Model for Text-to-Speech,” in Proceedings of the 38th
International Conference on Machine Learning Research, vol. 139, pp. 8599–
8608, 2023. Available:https://proceedings.mlr.press/v139/popov21a.html

[29] E. Hoogeboom, V. G. Satorras, C. Vignac, and M. Welling, “Equivariant

diffusion for molecule generation in 3D.” arXiv, 2022. Available: https://doi.org/
10.48550/arXiv.2203.17003

[30] P. Roux, W.A. Kuperman, K.G. Sabra, “Ocean acoustic noise and passive
coherent array processing,” Comptes Rendus Geoscience, vol. 343, issues 8–9, pp.
533–547, 2011. Available: https://doi.org/10.1016/j.crte.2011.02.003.

[31] The MathWorks Inc, “spectrogram.” Accessed: June 23, 2024. Available:
https://www.mathworks.com/help/signal/ref/spectrogram.html?s_tid=doc_ta
79
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[32] B.G. Greene, D.B. Pisoni, T.D. Carrell, “Recognition of speech spectrograms,”
Journal of the Acoustical Society of America, vol. 76 (1), pp. 32–43, 1984.
Available: https://doi.org/10.1121/1.391035

[33] L. Weyse, “Audio spectrogram representations for processing with convolutional

neural networks,” in Proceedings of the 1st International Workshop on Deep
Learning and Music joint with IJCNN, pp. 37–41, Anchorage, USA, 2017.
Available: https://doi.org/10.48550/arXiv.1706.09559

[34] Y.M.G. Costa, L.S. Oliveira, A.L. Koericb, F. Gouyon, “Music genre recognition
using spectrograms,” in 18th International Conference on Systems, Signals and
Image Processing, Sarajevo, Bosnia and Herzegovina, 2011, pp. 1–4.

[35] J. George, L. Shamir, “Unsupervised analysis of similarities between musicians

and musical genres using spectrograms,” Artificial Intelligence Research, vol.
4(2), pp. 61–71, 2015. Available: http://dx.doi.org/10.5430/air.v4n2p61

[36] J. Huang, B. Chen, B. Yao and W. He, “ECG Arrhythmia Classification Using
STFT-Based Spectrogram and Convolutional Neural Network,” in IEEE Access,
vol. 7, pp. 92871-92880, 2019. Available: http://dx.doi.org/10.1109/
ACCESS.2019.2928017.

[37] A.A. Cabrera-Ponce, J. Martinez-Carranza, C. Rascon, “Detection of nearby

UAVs using CNN and spectrograms,” in International Micro Air Vehicle
Competition and Conference (IMAV 2019), 2019. Available:
https://www.imavs.org/papers/2019/18.pdf

[38] Z. Ouyang, H. Yu, W. -P. Zhu and B. Champagne, “A Fully Convolutional

Neural Network for Complex Spectrogram Processing in Speech Enhancement,”
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), Brighton, UK, 2019, pp. 5756–5760. Available:
https://doi.org/10.1109/ICASSP.2019.8683423

[39] The MathWorks Inc, “Iscola.” Accessed: July 2, 2024. Available:

https://www.mathworks.com/help/signal/ref/iscola.html?s_tid=doc_ta

[40] D.W. Griffin and J.S. Lim, “Signal estimation from modified short-time Fourier
transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.
ASSP-32, no. 2, pp. 236–243, April 1984.

[41] N. Perraudin, P. Balazs, and P.L. Søndergaard, “A Fast Griffin-Lim Algorithm,”

in 2013 IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, New Paltz, NY, USA, October 20–23, 2013. Available: https://doi.org/
10.1109/WASPAA.2013.6701851.

80
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[42] J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, “Fast Signal Reconstruction
from Magnitude STFT Spectrogram Based on Spectrogram Consistency,” in 13th
International Conference on Digital Audio Effects (DAFx-10), Graz, Austria,
September 6–10, 2010.

[43] K. Frasier, “HARP Echolocation click and broadband anthropogenic event

detections Southern California Bight: 2017–2019 sites E and H [Dataset] Dryad,”
2021. Available: https://doi.org/10.6076/D1G01N

[44] S.M. Wiggins and J.A. Hildebrand, “High-frequency Acoustic Recording

Package (HARP) for broad-band, long-term marine mammal monitoring,” in
2007 International Symposium on Underwater Technology 2007 and
International Workshop on Scientific Use of Submarine Cables & Related
Technologies, Tokyo, Japan, 2007, pp. 551–557.

[45] K.E. Frasier, “A machine learning pipeline for classification of cetacean

echolocation clicks in large underwater acoustic datasets,” PLoS Comput Biol,
vol. 17(12), 2021. Available: https://doi.org/10.1371/journal.pcbi.1009613

[46] The MathWorks Inc, “Short-time Fourier Transform-stft” Available:

https://www.mathworks.com/help/signal/ref/stft.html?searchHighlight=
stft&s_tid=srchtitle_support_results_1_stft#mw_e8026896-e240-4a16-ac1f-
24417c3722fe

[47] The MathWorks Inc, “Resize data by adding or removing elements” Available:
https://www.mathworks.com/help/matlab/ref/resize.html?searchHighlight=
resize&s_tid=srchtitle_support_results_2_resize

[48] The MathWorks Inc, “Generate Images Using Diffusion Example.” Available:
https://www.mathworks.com/help/deeplearning/ug/generate-images-using-
diffusion.html#GenerateImagesUsingDiffusionExample-3

[49] Z. Xiong, W. Wang, J. Yu, Y. Lin, and Z. Wang, “A Comprehensive Survey for
Evaluation Methodologies of AI-Generated Music,,” arXiv, 2023. Available:
https://doi.org/10.48550/arXiv.2308.13736

[50] E. Adib, A. S. Fernandez, F. Afghah, and J. J. Prevost, “Synthetic ECG signal

generation using probabilistic diffusion models,” in IEEE Access, vol. 11, pp.
75818-75828, 2023. doi: 10.1109/ACCESS.2023.3296542.

[51] A. M. Pfau, “Multi-label classification of underwater soundscapes using deep

convolutional neural networks,” M.S. thesis, Dept. of Computer Sciences, NPS,
Monterey, CA, USA, 2020. Available: https://hdl.handle.net/10945/66705

[52] The MathWorks Inc, “Classification Learner App” Available:

https://www.mathworks.com/help/stats/classification-learner-app.html?s_tid=
CRUX_lftnav
81
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[53] The MathWorks Inc, “Choose Classifier Options” Available:
https://www.mathworks.com/help/stats/choose-a-classifier.html#mw_cdbdc267-
a5ed-419b-a46f-7217cc1132ef

[54] P. Manjunatha, “Multiclass metrics of a confusion matrix,” 2024. Available:

https://github.com/preethamam/MultiClassMetrics-ConfusionMatrix/releases/tag/
1.2.2

82
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
INITIAL DISTRIBUTION LIST

1. Defense Technical Information Center

Fort Belvoir, Virginia

2. Dudley Knox Library

Naval Postgraduate School
Monterey, California

83
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
DUDLEY KNOX LIBRARY
NAVAL POSTGRADUATE SCHOOL
WWW.NPS.EDU

_________________________________________________________
WHERE SCIENCE MEETS THE ART OF WARFARE