Synthesizing Underwater Sounds Using Generative Artificial Intelligence
Synthesizing Underwater Sounds Using Generative Artificial Intelligence
DSpace Repository
2024-09
Garip, Mustafa
Monterey, CA; Naval Postgraduate School
https://hdl.handle.net/10945/73316
THESIS
SYNTHESIZING UNDERWATER
SOUNDS USING GENERATIVE AI
by
Mustafa Garip
September 2024
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Form Approved OMB
REPORT DOCUMENTATION PAGE No. 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing
instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of
information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions
for reducing this burden, to Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis
Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-
0188) Washington, DC 20503.
1. AGENCY USE ONLY 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
(Leave blank) September 2024 Master's thesis
4. TITLE AND SUBTITLE 5. FUNDING NUMBERS
SYNTHESIZING UNDERWATER SOUNDS USING GENERATIVE AI
i
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
ii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Distribution Statement A. Approved for public release: Distribution is unlimited.
Mustafa Garip
Lieutenant, Turkish Naval Forces
BSIE, Turkish Naval Academy, 2013
from the
Nicholas Durofchalk
Second Reader
Oleg A. Godin
Chair, Department of Engineering Acoustics Academic Committee
iii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
iv
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
ABSTRACT
v
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
vi
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
TABLE OF CONTENTS
I. INTRODUCTION................................................................................................. 1
A. MOTIVATION AND RESEARCH QUESTIONS ................................ 1
B. THESIS ORGANIZATION ..................................................................... 2
viii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
LIST OF FIGURES
Figure 2. Wenz Curves for underwater ambient sounds. Source: [30]. ................... 14
Figure 4. Left: The sound signal of a dolphin. Right: Spectrogram of the sound
of a dolphin. .............................................................................................. 16
Figure 5. A boat signal and its recovered version using GLA. ................................ 20
Figure 11. Overall mean performance metrics obtained after 80 classifier runs. ...... 38
Figure 12. Mean performance metrics for boat signals after 5 runs of each
combination............................................................................................... 39
Figure 13. Mean performance metrics for dolphin signals after 5 runs of each
combination............................................................................................... 40
Figure 14. Mean performance metrics for whale signals after 5 runs of each
combination............................................................................................... 41
Figure 15. Confusion matrix obtained after 5th run of classifier with 100% real
signal combination. ................................................................................... 63
Figure 16. Confusion matrices obtained after 5th runs of classifier with 100%
synthetic signal combinations. .................................................................. 63
x
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
LIST OF TABLES
Table 2. Column sizes of STFT outputs for specific window length and
overlap percentage combinations. ............................................................. 25
xi
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
xii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
LIST OF ACRONYMS AND ABBREVIATIONS
1D one-dimensional
2D two-dimensional
AI artificial intelligence
a.k.a. also known as
CNN convolutional neural network
COLA constant-overlap add
DDPM denoising diffusion probabilistic model
DFT discrete Fourier transformation
FF feedforward neural network
FT Fourier transform
GAN generative adversarial network
GLA Griffin-Lim algorithm
GPU graphics processing unit
ISTFT inverse short-time Fourier transform
ML machine learning
ms millisecond
nfft number of discrete Fourier transform points
NM nautical miles
RGB red-green-blue
SDE stochastic differential equation
SGM score-based generative models
STFT short-time Fourier transform
RNN recurrent neural network
VAE variational autoencoder
xiii
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
xiv
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
ACKNOWLEDGMENTS
No matter how strong a nation’s armed forces are, no matter how glorious
the victories they achieve, if that nation does not have an army of science,
the victories on the battlefields will come to an end.
—M.K. Atatürk
My heartfelt thanks to all distinguished faculty members whose courses I had the
privilege to attend during my stay at NPS. I learned a lot from each of them, and their
knowledge and collective wisdom have contributed to the successful completion of this
research.
Additionally, I want to offer my thanks to the Turkish Naval Forces for granting
me this opportunity to pursue my master’s degree from NPS.
Last, but most importantly, I am profoundly grateful to my wife, Gökçe, for her
endless love, support, patience, and encouragement. Almost 6000 NM away from our home
country, she provided me with a calm and peaceful haven in Monterey. The presence of
her and our Californian twin daughters, Meltem and Kumsal, gave me the strength to
accomplish my research. Also, to my parents: thank you for your valuable support here and
from afar.
xv
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
xvi
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
I. INTRODUCTION
1
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
integration of ML and underwater acoustics continues to open new avenues for research
and exploration in marine sciences and technology.
B. THESIS ORGANIZATION
There are six chapters in this thesis. In Chapter I, the motivations behind the study
and the research questions to be addressed are proposed. A brief discussion of previous
work about generative models is presented in Chapter II. General information about the
dataset used in the study, underwater sounds and signal processing schemes considered in
this work are provided in Chapter III. The experimental design and evaluation schemes
considered in the study are introduced in Chapter IV. Results and discussion of the findings
2
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
are presented in Chapter V. Conclusion and recommendations for future work discussed in
Chapter VI. MATLAB codes used in this thesis can be found in Appendix A and additional
figures and tables are included in Appendix B.
In this chapter, we introduced our motivation behind the study and the research
questions to be addressed in the study. Previous work will be discussed in the next chapter.
3
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
4
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
II. PREVIOUS WORK
A. GENERATIVE AI MODELS
After their first introduction, various GAN architectures have been proposed on
several research areas during the last decade [3]. In acoustics, Liu et.al. introduced GANs
with spectrograms in [4]. The GAN in [4] uses spectrograms of .wav files as input and
generates new (fake) spectrograms as output. The generated spectrograms are evaluated by
an AlexNET-based classifier. This GAN does not use audio signals directly. Another
example of GANs for sound generation is in [5] which is one of the first applications of
GANs to unsupervised audio generation. In [5], two GAN architectures are used; one GAN
uses time domain information while the other uses frequency domain spectrograms. Both
implementations are compared by using inception score, nearest neighbor, and real listener
methods as detailed in [5]. Results show that both can be used for sound generation;
however, the qualitative scores of produced audio signals are lower than those obtained
with real audio signals. Another frequency-based GAN model is the instantaneous
frequency GAN discussed in [6]. In this work, the input audio data was pre-processed using
the Short Time Fourier Transform (STFT), unwrapping the phase, and scaling
(normalizing) the magnitude. Then, the resulting processed data is used for training. The
generated samples are then evaluated using five different metrics: human evaluation,
statistical analysis, inception score, quality of pitch, and Fréchet inception distance. Results
show that the GANs can generate audio which has high pitch quality and inception score
while scoring worse in other metrics.
Despite their successes, the main challenge with GANs is training instability, which
remains a dynamic field of research with continuous efforts to enhance robustness and
scalability. This instability problem is mostly the result of non-overlapping between the
distributions of input and output data [7]. Other prominent challenges are mode collapse,
computational load (especially for high-resolution data), and limited applicability to
sequential data according to [8].
6
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
2. Variational Autoencoders
Variational autoencoders (VAEs) are a class of generative models that combine the
principles of deep learning and probabilistic graphical models to generate new data that are
statistically similar to those present in a given dataset. Introduced by Kingma and Welling
in 2013 [9], VAEs include an encoder that transforms input data into a latent space, and a
decoder that reconstructs the data from this latent space. The key innovation in VAEs is
the use of variational inference to estimate the posterior distribution of the latent variables,
facilitating efficient training through backpropagation. This approach allows VAEs to learn
a continuous and smooth latent space, enabling them to generate coherent and diverse
samples.
Under the sound generation topic, [10] proposes a type of VAE model for fast and
high-quality audio synthesis. This model uses a two-stage training phase such that the first
part is representation learning with a regular VAE and the latter is an adversarial fine-
tuning VAE. Another model in [11] consists of a modified VAE. In this model, a variational
recurrent autoencoder is supported by history (already generated outputs). Then, the
generated sounds are compared with outputs of a classical VAE model and evaluated by
human assessments, statistical analysis, and mutual information methods. To be able to
generate audio using raw data, [12] uses an enhanced VAE model, called the VQ-VAE
model. This model in [12] compresses audio into a discrete space at first, then utilizes a
loss function crafted to preserve as much musical information (i.e., coherence, musicality,
diversity, and novelty) as possible, even at higher levels of compression.
7
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
3. Other Networks and Proposed Models
Apart from main generative models GANs and VAEs, other neural networks can
be used in data generation. Some of the most commonly used ones are recurrent neural
networks (RNNs), feedforward neural networks (FFs), and transformer networks [1]. All
utilize neural network architectures to model complex data relationships, enabling them to
perform tasks such as classification, generation, and prediction across various domains.
Additionally, they can be used as a part of complex constructions together with other
models including GANs and VAEs.
FFs are the simplest form of neural networks, with information moving from inputs
to outputs in one direction only, passing through any number of hidden layers. Each layer
is completely connected to the next, making them suitable for tasks with straightforward
input-output relationships, such as image classification, regression, and pattern
recognition. In [13], FFs are used as a part of GAN structure.
RNNs are designed to handle sequential data by keeping a hidden state that retains
information from the prior time steps, allowing information to persist across sequences. As
RNNs mainly deal with sequential data, they are mostly suitable for tasks such as time
series prediction, language modeling, and speech recognition. However, in [14], a RNN is
implemented within a GAN structure and [11] has an example of hybrid RNN-VAE
combination.
B. DIFFUSION MODELS
Diffusion models, introduced in 2015 [17], are a class of generative models that
leverage the principles of iterative refinement to generate data by progressively denoising
a sample initialized with random noise. These models are based on two network-based
8
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
processes. The reverse process, wherein data points undergo a series of transformations
governed by stochastic differential equations, effectively simulating a reverse diffusion
from noise to data. The forward process is a training phase which involves learning the
reverse diffusion process, allowing the model to produce high-quality samples that mimic
the original data distribution after the training is completed. An intuition of diffusion
models can be seen in Figure 1.
9
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
DDPM, proposed in 2020 by [19], uses a Markov chain in both forward and reverse
diffusion processes. Additionally, these Markov chains trained with variational inference,
enable the model to effectively capture complex data distributions and generate samples
that represent the original data. The study also shows that when forward diffusion involves
small increments of standard Gaussian noise, the transitions in the reverse process can also
be set to conditional Gaussian distributions, enabling a straightforward neural network
parameterization of the DDPM.
SGMs and Score-SDEs utilize the gradient of the data distribution (score function)
to lead the generation process, while DDPMs focus on learning the denoising steps through
variational inference. SGM, first proposed in [20], involves perturbing data with
progressively amplifying Gaussian noise and concurrently estimating the score functions
for all noisy data distributions through the training of a deep neural network model, which
is conditioned on various noise levels. Score-SDE in [21] extends the concept of SGMs by
incorporating stochastic differential equations to model the data generation process
continuously over time.
10
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
In this chapter, we briefly introduced various generative AI models and reviewed
diffusion models and their development using previous works in literature. In the next
chapter we will summarize common features of underwater sounds, discuss spectrogram
and a signal recovery algorithm, and present the dataset selected for our study.
11
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
12
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
III. UNDERWATER SOUNDS AND SIGNAL PRINCIPLES
13
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 2. Wenz Curves for underwater ambient sounds. Source: [30].
B. SPECTROGRAMS
The fundamental idea behind the Fourier transform (FT) process is to convert the
input signal from the time domain to the frequency domain. In digital signal processing, a
discrete-time version of the FT is used since continuous signals are converted into discrete-
time sequences by sampling. Generating spectrograms is accomplished by first splitting a
14
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
signal into short segments of equal length—a.k.a. windows— where successive windows
may partially overlap between segments. Next, the discrete Fourier transform (DFT) is
applied to each window. This process is called the discrete-time short-time Fourier
transform (referred to as the STFT from here) and allows for the analysis of changes in
signal frequency contents. The general mathematical representation of the STFT output for
an input signal is given by:
, (3.1)
where represents the time frame index, i.e., the column number of the STFT output
matrix, is the number of samples between successive windows, and is the window
function. The rows size of the STFT matrix corresponds to the number of frequency bins
or components of the signal’s spectrum at a particular time frame. The number of columns,
, and the number of rows, , of a one-sided STFT output matrix is given by:
, (3.2)
where is the number of discrete Fourier transformation (DFT) points. Note the rounded
down value of the parameter k is used when it is not an integer value. A graphical depiction
of the STFT is shown in Figure 3.
After computing the STFT output of the signal, the remaining steps for the
spectrogram are to compute the magnitude of the STFT and plot the result. Usually,
spectrograms are plotted using a dB scale and a colormap option, where the dB scale is
obtained by using . At that point, spectrogram values are displayed
by using plotting tools and colormaps to visualize the sound signal, as shown in Figure 4,
where the left plot represents a dolphin sound sampled at 200 KHz, and the right plot
represent the resulting spectrogram.
15
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 3. Representation of STFT. Source: [31].
16
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Several parameters must be selected when generating spectrograms; window size,
window type, window overlap amount, and FT length. The choice of these parameters will
affect the resolution and accuracy of the resulting spectrogram. For example, a larger
window size provides for improved frequency resolution but lower time resolution and vice
versa. Overlapping windows can help in reducing spectral leakage and improving the
smoothness of the spectrogram while increasing the computational load. The window type
affects the trade-off between the time and frequency resolution, with different windows
emphasizing either sharper temporal features or more precise frequency components.
Achieving a good reconstruction when computing the STFT of an input signal and
subsequently inverting it using Inverse short-time Fourier transform (ISTFT) is inherently
challenging. The accuracy of this reconstruction is contingent upon fulfilling specific
conditions: window constant-overlap add (COLA) compliance and the number of the time
frames used in the STFT.
17
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
1. Constant Overlap-Add Compliance
The window used in the STFT process is called COLA compliant when the
following mathematical constraint is satisfied.
, (3.3)
The STFT size is another factor to consider during the reconstruction process. To
ensure that the length of the signal reconstructed from the ISTFT is the same as the original
input signal length, the total number of time frames used in the STFT (in other words, the
column number, k as defined in Equation 3.2, of the STFT magnitude) should be an integer.
This condition guarantees that the specific combination of the signal segmentation and
overlap during the STFT and ISTFT processes are appropriately aligned, facilitating the
preservation of the signal’s structural integrity.
D. GRIFFIN-LIM ALGORITHM
The Griffin-Lim algorithm (GLA) is one of the classic methods used for signal
reconstruction applications. The algorithm was developed by Griffin and Lim [40] and
aims to estimate a time-domain signal from its STFT magnitude only in some iterative
fashion. The challenge in this process lies in recovering the signal without access to the
18
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
phase information, which is typically not provided or lost. The GLA is an iterative process
designed to take advantage of the common information present in neighboring time
windows used in the STFT. This partially common information is used by the GLA to
estimate the missing phase information. The iterative process is designed to produce a
stable signal reconstruction in which the STFT magnitude is equal to the initial STFT
magnitude and the STFT phase has converged to some stable behavior.
The first step of the GLA initializes the STFT magnitude estimate with a random
or zero phase. A cost function is defined as the difference between the reconstructed and
initial STFT magnitudes. Then, the iterative process begins. During the iterative process,
the GLA finds the ISTFT of the estimate to obtain the time-domain signal estimate and
computes the difference between the reconstructed and original STFT magnitudes. The
iterative process continues until it produces a stable signal reconstruction; one in which the
STFT magnitude is equal to the initial STFT magnitude and the STFT phase has converged
to a stable behavior. Even though the iterative process may converge to a local minimum
and is computationally intensive due to multiple STFTs/ISTFTs operations, the GLA
remains widely used due to its simplicity and effectiveness in various practical
applications.
19
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 5. A boat signal and its recovered version using GLA.
Several improvements to the original GLA approach have been proposed over the
years. For example, in [41], an optimization approach is proposed for the phase recovery
problem with a faster solution and lower-cost computation rather than original method. In
[42], another fast phase estimation algorithm is presented which uses the sparseness of the
input signal with lower computational load than that present in the original GLA. However,
the proposed version in [41] is not guaranteed theoretically to converge every time.
Additionally, we observed in our study that both methods in [41] and [42] resulted in higher
error levels than those produced using the original algorithm. Therefore, the original
algorithm was used in this work.
E. DATASET
The datasets used in this thesis are recordings of underwater event detections
collected in the Southern California Bight (Site-E, located at 32º 50.5’ N-119º 10.2’ W)
and published by Frasier in 2021 [43]. Hydrophones [44] were deployed at a depth of 1300
meters for 122 days in 2018 and 2019 [43]. The original dataset includes various
echolocation clicks and broadband anthropogenic events; however, only three of the source
20
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
classes -boats, Risso’s dolphins (grampus griseus) and sperm whales (physeter
macrorynchus), are selected for this thesis work. The selected dataset has already been
labeled using an unsupervised classification algorithm and expert-reviewed manual
labelling workflow as it stated in [45].
The signals consist of impulsive events with received levels greater than 120 dB re
1 µPa and sampled at 200 kHz. All signals have 200 time-samples (1 ms duration) and are
centered so that the maximum signal energy is located at the 100th sample. The original
dataset is organized with signal waveforms and manual labels in different disc folders. The
selected classes were extracted using the detection time and label of each event. There were
more than 710,000 dolphin samples in the original dataset. However, such a high number
of samples in one class can cause significant bias in ML and AI algorithms training.
Therefore, a random selection was applied to the dolphin samples to bring the size of this
class more aligned with other class sizes. After this extraction process, the final distribution
of the samples available in each class is shown in Table 1.
Source
Boat Dolphin Whale Total
Class
Number of
43,431 42,532 15,059 101,022
Signals
21
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
22
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
IV. EXPERIMENTAL DESIGN AND EVALUATION
METHODOLOGY
In this chapter we introduce the three phases followed in our study. The first phase
focuses on data preprocessing, with the goal of organizing raw data to fit required format
constraints for the diffusion model type selected for our study. The second phase focuses
on the diffusion model training stage, as the diffusion model is used to generate new
underwater signals. Finally, the last phase focuses on evaluating the quality of the
generated underwater signals. The overall flow of the study is presented in Figure 6.
Simulations were implemented using MATLAB (ver. R2024b) with custom and
built-in functions. We used a Dell Precision 5820 computer with a NVIDIA T1000 GPU.
MATLAB code is provided in Appendix A.
A. DATA PREPROCESSING
Diffusion models were initially designed for image applications, i.e., designed for
2D input data and generating 2D output data. 2D data typically refers to data organized in
a two-dimensional structure, often represented as a matrix or a table. Diffusion models
have also shown impressive performances in text, audio and video applications. The
underwater sounds considered in this study are one-dimensional (1D) while the diffusion
23
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
model utilized for the study was designed for 2D data, thus requiring preprocessing to
match the expected format for the diffusion model.
First, the STFT is applied to each underwater signal sample to convert 1D signals
into the 2D data format expected by the diffusion model used in our study. Then, resulting
STFT magnitudes are used to train the diffusion model. The diffusion model generates new
STFT magnitudes. After training, the GLA, previously described in Chapter III, Section D
is applied to recover 1D time-domain synthetic underwater signals from generated STFT
magnitudes.
2. Parameter Selection
The STFT and GLA algorithms rely on several user-specific parameters which need
to be selected judiciously for the scheme to result in good signal recovery. At this point,
the signal reconstruction constraints discussed previously in Chapter III, Section C need to
be considered while selecting the following input parameters to generate the 2D STFT
magnitudes of 200 sample-long signals in the dataset: spectral window (type and length),
overlap length (number of overlapped samples between successive windows), and the
number of DFT samples (nfft).
We chose a rectangular window as the baseline window type used in the STFT
generation step and investigated the impact other parameters, i.e., window length, overlap
length, and nfft, have on recovered 1D signals using the GLA. First, we considered window
lengths equal to 5, 10, 25, 40, and 50 samples (given each signal was 200 samples) and
various overlapping amounts. Then, for each window length and overlap percentage values
considered, we investigated whether these combinations led to COLA compliant windows
and satisfied other prefect reconstruction constraints discussed in Chapter III, Section C.
The combinations of window lengths and overlap percentages that meet the reconstruction
constraints for the range of window length and overlap amounts considered can be seen in
Table 2.
24
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 2. Column sizes of STFT outputs for specific window length and
overlap percentage combinations.
72 - - - - -
75 - - - 17 -
80 196 96 36 21 16
87.5 - - - 33 -
88 - - - - -
90 - 191 - 41 31
94 - - - - -
95 - - - 81 -
96 - - 176 - 76
97.5 - - - 161 -
98 - - - - 151
Note: Filled-in cells show the number of time windows (number of columns) present in the STFT
output for combinations satisfying perfect reconstruction constraints presented in Chapter III,
Section C. Empty cells show combinations without reconstruction conditions satisfied.
The second parameter required for the STFT computation step is the nfft used in
the FT operation. This value must be equal to or greater than the window size and its
selection directly impacts the dimension of the STFT magnitude matrices used as inputs to
the diffusion model according to Equation 3.2. Note that generative AI models generally
work with square-sized data (i.e. 16x16, 64x64, 256x256) for several practical reasons such
as dimensional symmetry and simplicity in processing. Additionally, input data sizes affect
training time, computer memory, model architecture and output resolution. At this point,
we decided to use a window length equal to 40 samples and 90% window overlap (i.e., 36
samples) with nfft equal to 80 samples. These parameters result in STFT magnitude
25
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
matrices of size 41x41, given we used a one-sided STFT matrix, according to Table 2 and
Equation 3.2. This size keeps matrix dimension low to reduce computational burden, such
as training time and memory requirements low while keeping sufficient signal information.
The last steps in the preprocessing phase are data resizing and scaling. Note it is
usually better to deal with even-sized matrices in algorithms to use the advantage of simpler
memory management and efficiency during implementation. Therefore, we resized our
41x41 matrices as 40x40 using the built-in MATLAB function resize [46]. This function
removes the last row and column of the input matrix (which corresponds to removing the
last time window computed in the STFT and the highest frequency bin present in the
frequency axis as it also stated in [47]) as the desired matrix size is smaller than its original
size. Simulations showed that this truncation scheme did not remove any significant signal
information as all signals present in the dataset are centered around the 100th sample.
The second step is data rescaling. The diffusion model uses Gaussian zero-mean
noise during the training phase. Initial simulations conducted using the raw STFT
magnitude matrices for training resulted in poor quality generated signals. As a result, we
rescaled the STFT magnitude values so that all values would be below 1, which resulted in
a much-improved outcome. Specifically, we divided all STFT magnitude values by 34000
which is a little greater than the maximum STFT magnitude present in our matrix set. This
rescaling factor was also considered when reconstructing 1D signals from the diffusion
model-generated 2D STFT magnitude matrices using the GLA.
B. DIFFUSION MODEL
The diffusion model method selected in our work to generate synthetic underwater
signals is the denoising diffusion probabilistic model (DDPM) (referred to as diffusion
model from here). The diffusion model used in this thesis is based on the model in [19] and
the MATLAB code template used for this study is available (since ver. R2023b) [48]. The
network properties can be seen in Figure 7 and the network diagram with a sample of
network layer block is presented in Figure 8.
26
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 7. Diffusion network properties.
27
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note the diffusion model in [19] was initially designed for images, and the demo
MATLAB code of the model in [48] uses RGB color matrices as inputs. We had to
customize this code template in order to use 2D STFT magnitude matrices as inputs. As a
result, we made the following modifications to the code:
In this phase we evaluate the “quality” of the generated signals, i.e., how close
generated signal properties are to those of real ones.
1. Evaluation Method
2. Classification Approach
29
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
a. Classifier Algorithm
b. Classifier Specifications
In this study we randomly partitioned the classifier input data into a training set
(70%) and validation set (10%) while using a fixed test set (20%). After partitioning the
data, we used the predefined 1D CNN architecture and features provided in [52]. Prior to
applying the classification scheme, we specifically investigated the effect the filter size has
on resulting classification performances, due to the short signal length (200 samples long).
For this purpose, we selected a random subset of real data only and trained the classifier
using filter sizes of length equal to 5, 10, 20, 25, 40, and 50. Results obtained on that subset
showed best classification performances were obtained for a filter size equal to 50 and we
fixed the filter size to 50 in all subsequent work. All other user-specified classifier
parameters required in the MATLAB 1D CNN architecture were set as listed in Table 4.
30
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 4. Training parameters of the 1D CNN Classifier.
Parameter Value
Solver type Adaptive movement estimation (adam)
Maximum number of epochs 60
Learning rate 0.01
Padding direction Left
Validation patience 5
3. Performance Metrics
Performance metrics are used for assessing the efficacy of classifiers, providing
quantitative measures that inform the effectiveness, reliability, and robustness of predictive
models. In this thesis, we use some of the most common metrics: accuracy, F1 score,
precision, and recall.
Accuracy quantifies the ratio of accurately classified instances (true positives and
true negatives) relative to the total number of instances (all data), serving as a fundamental
metric for evaluating classifier performance. Precision and recall metrics provide a more
accurate classifier performance evaluation than the accuracy metric does when dealing with
unbalanced datasets. Precision reflects the classifier’s ability to avoid false positives while
recall indicates the classifier’s capability to identify all true instances. The F1-score, which
depends on both precision and recall, provides a comprehensive metric that balances these
two aspects, making it particularly advantageous when using imbalanced datasets in
classifiers.
Accuracy is calculated in the same way for both binary and multi-class
classification tasks. However, other metrics first need to be computed separately on a one-
class-versus-all-others scenario for multi-class problems, and results averaged over all
classes to get a single value. Accuracy, precision of a class, recall, and F1-score of a class,
are defined as:
31
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
, (4.1)
, (4.2)
, (4.3)
, (4.4)
, (4.5)
where N represents the class-specific label and n is the total number of classes. In this study
we used the MATLAB code provided in [54] to get macro average values for recall,
precision and F1-score metrics.
32
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
In this chapter, we described the three phases of the study flow followed in this
thesis: data preprocessing, diffusion model training, and output evaluation. In the next
chapter we will present the results.
33
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
34
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
V. RESULTS AND DISCUSSIONS
In this study we used 3 types of underwater signals (boat, dolphin, and whale) to
design three class-specific diffusion models, i.e., one for each class of signal, and generate
synthetic signals for each class.
35
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 9. Data structure of the evaluation phase.
36
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 10. Data usage in classifier.
37
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: In the figure, accuracy and recall are on top of each other as we used data classes
have the same size.
Figure 11. Overall mean performance metrics obtained after 80 classifier
runs.
In this section, we present performance metric results obtained on a per class basis.
Detailed plots for each performance metric are included in Appendix C.
1. Boat Signals
Mean performance metrics obtained after 5 runs for the combinations of boat
signals are presented in Figure 12. Results show the decreasing proportion of real data does
38
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
not have a significant impact on classifier performances until it reaches 10%. Results also
show that all performance metrics values are close to each other, have small variations, and
have similar trends as those observed in overall results presented Figure 11.
Note: In the figure, accuracy and recall are on top of each other as we used data classes
have the same size.
Figure 12. Mean performance metrics for boat signals after 5 runs of each
combination.
2. Dolphin Signals
The mean performance metrics obtained after 5 runs for the combinations of
dolphin signals are presented in Figure 13. Results show the decreasing proportion of real
data does not have a significant impact on classifier performances until it goes below 10%.
Results also show that all performance metrics values are close to each other and have
small variations.
39
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Thus, results indicate the artificial dolphin signals approximate real dolphin signals
with a high level of accuracy.
Note: In the figure, accuracy and recall are on top of each other as we used data classes
have the same size.
Figure 13. Mean performance metrics for dolphin signals after 5 runs of each
combination.
3. Whale Signals
The mean performance metrics obtained after 5 runs for the combinations of
dolphin signals are presented in Figure 14. Results show the decreasing proportion of real
data does not have a significant impact on classifier performance until it goes below 10%,
as previously observed for the boat signal class.
40
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: In the figure, accuracy and recall are on top of each other as we used data classes
have the same size.
Figure 14. Mean performance metrics for whale signals after 5 runs of each
combination.
Overall results show the synthetic dolphin signals performed relatively better than
synthetic boat and whale signals. We hypothesize that this different behavior may be linked
to the large class size and consistent characteristics of the dolphin signals used in the
training of the diffusion model. By comparison, in our data set, the number of samples in
the whale class is 3 times smaller than the dolphin class or the boat class, resulting in a
diffusion model with potentially slightly worse generative capability. Finally, even though
the boat class is similar in size to the larger dolphin class, signals within that class are not
as consistent, given the large variety of boats. As a result, we hypothesize that a larger
number of boat and whale signals may have been needed to improve the diffusion model
generative capability for these classes.
41
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
C. DATA PROCESSING DURATIONS
Each phase of the study is subject to specific data handling durations which are
directly related to the computational cost of the overall study. Average data processing
times are presented in Table 5.
42
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
VI. CONCLUSIONS AND RECOMMENDATIONS
A. CONCLUSIONS
In this study, we investigated the ability of diffusion models designed for 2D image
data to generate 3 different types of synthetic underwater signals (boat, dolphin and whale)
and evaluated the quality of the generated artificial signals using a 1D CNN classifier
approach by considering various proportions of synthetic data using in the classifier
training. This study had two major phases:
Results show that diffusion models designed for 2D data can generate highly
accurate synthetic underwater sounds with some additional preprocessing and recovery
phases. Findings indicate classification performances remained stable as the proportion of
synthetic signals involved in the 1D CNN classifier design stage increased up to 90% for
the three classes of signals considered in this study (boat, dolphin, and whale). These results
reflect the high quality of the synthetic signals generated by the diffusion models.
B. FUTURE WORK
One main area of follow-on work should be improving the speed of the model. As
currently implemented, the computational cost is high due to long processing times in all
43
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
phases and existence of pre/post-data handling algorithms, and the use of MATLAB for
this initial investigation. Faster computer processors and different software could reduce
this computational load. In addition, optimizing hyperparameters present in the diffusion
model should be included in future research efforts.
Another avenue for follow-on work could focus on investigating whether better
preprocessing parameters can be defined when using diffusion models designed for 2D
data on 1D signals. Usually, higher dimensional STFT magnitude matrices carry more
accurate information about the signal under investigation. However, larger matrix sizes can
significantly increase computational burden. Investigating a “best” choice of preprocessing
parameters which takes into account the signal information and the 2D input matrix size
while keeping computational load to a manageable level would be quite useful.
Another avenue for follow-on work could be investigating the effect the size of a
signal class has on the quality of the resulting diffusion model. A specific source producing
high variety of the signals may need more samples than a source has consistent signal
behaviors. Less but representative number of samples to be trained in a diffusion model
will shorten computational durations while generating new signals effectively.
Finally, the type of diffusion models used in this study was designed for 2D data.
More recently, diffusion models designed for 1D data have been proposed. Applying such
a model would eliminate the preprocessing and post processing stages currently needed in
our study to transform the 1D signal into 2D STFT magnitude images, and the need to use
the Griffin-Lim algorithm to recover 1D time-domain data from synthetic STFT magnitude
outputs generated by the diffusion models. Comparing results obtained in this study to
other generative model approaches on the same data set would be quite useful.
44
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
APPENDIX A. MATLAB CODES
% DATA DISCOVERY
% Template script for extracting class detections from a disc file.
% initial variables
classLabel=1; % pick class label (boat=1, sperm whale=5, risso dolphin=2)
ix=[]; % index numbers of detections in desired class
count=[]; % count matrix
last=1; % last detection index
for d=1:length(datenum)
for s=last:length(MTT)
if MTT(s) == datenum(d) % match detection times
count(classLabel)=count(classLabel)+1;
ix=[ix;s];
last=s;
break
end
end
end
45
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
B. DOLPHIN DATA REDUCTION
clear; clc;
rng(1923); % fix seed
load(‘selectedData.mat’); % look datadiscovery_template
% 7 disks (from 02d to 03b) have high number of detections rather than
% the other disks. Select 4000x7 samples randomly and keep others
n=sum(count(1:end-7,2)); nsi=1:1:n; % number of signals we will keep as it is
rsi=randi([n+1,length(MSNd(:,1))],[28000,1]); % random selection of indexes
% DATA PREPROCESSING
% This script preprocesses the data for the diffusion model.
% For details, please see Chapter IV Section A in the thesis.
tic
clear; clc;
format compact;
rng(1923);
load(“selectedData.mat”);
%% Define initial parameters
% elective parameters
nfft=80; % nfft
WinLength=40; % window length
op=0.9; % overlap ratio
magSize=[40 40]; % desired STFT mag size
scaleFactor=34000; % scaling factor
% standard parameters
fs=2e5; % sampling frequency
npts=200; % number of data points
time=0:npts-1; time=time/fs; % time array
WinType=rectwin(WinLength); % window function
ovrlp=fix(WinLength*op); % overlap samples
%% Produce STFT magnitudes and change variable type as a final output
% boat data
source=‘boat’;
fprintf(‘boat\n’);
46
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
for g=1:length(MSNb(:,1))
xx=MSNb(g,:);
% STFT mag
targetFolder=‘C:\boat\’;
[STFTmag, Mmagb(g), Imagb(g)] = extractSTFTmag(xx,WinType,ovrlp,nfft,fs,source,g,
magSize,targetFolder);
% Tiff
targetFolder=‘C:\boat\’;
double2tiff(STFTmag/scaleFactor,source,g,targetFolder);
end
tic
% dolphin data (reduced)
source=‘dolphin’;
fprintf(‘dolphin\n’);
for g=1:length(MSNdr(:,1))
xx=MSNdr(g,:);
% STFT mag
targetFolder=‘C:\rossidolphin\’;
[STFTmag, Mmagd(g), Imagd(g)] = extractSTFTmag(xx,WinType,ovrlp,nfft,fs,source,g,
magSize,targetFolder);
% Tiff
targetFolder=‘C:\rossidolphin\’;
double2tiff(STFTmag/scaleFactor,source,g,targetFolder);
end
% whale data
source=‘whale’;
fprintf(‘whale\n’);
for g=1:length(MSNw(:,1))
xx=MSNw(g,:);
% STFT mag
targetFolder=‘C:\spermwhale\’;
[STFTmag, Mmagw(g), Imagw(g)] = extractSTFTmag(xx,WinType,ovrlp,nfft,fs,source,g,
magSize,targetFolder);
% Tiff
targetFolder=‘C:\spermwhale\’;
double2tiff(STFTmag/scaleFactor,source,g,targetFolder);
end
toc
save preprocessOutputs.mat Mmagb Mmagd Mmagw Imagb Imagd Imagw
%% Analysis: scaling factor
% max amplitude distribution of STFT magnitudes
figure;
tlo=tiledlayout(1,3,”TileSpacing”,”compact”);
nexttile(); histogram(Mmagb); title(‘Boat Signals’);
hold on; xline(scaleFactor,’red’); subtitle(‘<34K = 99.87%’);
nexttile(); histogram(Mmagd); title(‘Dolphin Signals’);
hold on; xline(scaleFactor,’red’); subtitle(‘<34K = 100%’);
nexttile(); histogram(Mmagw); title(‘Whale Signals’);
hold on; xline(scaleFactor,’red’); subtitle(‘<34K = 98.11%’);
title(tlo,’Max STFT Magnitudes’);
47
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
D. FUNCTION: EXTRACTSTFTMAG
function [mag,M,I]=extractSTFTmag(x,window,noverlap,nfft,fs,source,index,
desiredSize,folder)
% This function computes STFT magnitude of the input signal iaw inputs,
% resizes and saves it in {typeName}{Index}.mat format into selected file,
% and converts STFT magnitude into Tiff format to use in diffusion model
% using double2tif function.
Xstft=stft(x,fs,”Window”,window,’OverlapLength’,noverlap,’FFTLength’,
nfft,’FrequencyRange’,”onesided”);
Xstft_mag=abs(Xstft);
end
E. FUNCTION: DOUBLE2TIFF
% Set tags
tagstruct.ImageLength = size(inputMsingle,1);
tagstruct.ImageWidth = size(inputMsingle,2);
tagstruct.Compression = Tiff.Compression.None;
tagstruct.SampleFormat = Tiff.SampleFormat.IEEEFP;
tagstruct.Photometric = Tiff.Photometric.MinIsBlack;
tagstruct.BitsPerSample = 32;
tagstruct.SamplesPerPixel = size(inputMsingle,3);
tagstruct.PlanarConfiguration = Tiff.PlanarConfiguration.Chunky;
tiffObject.setTag(tagstruct);
48
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
% Write the array to disk
tiffObject.write(inputMsingle);
tiffObject.close;
end
The diffusion model for thesis. Please check Chapter IV Section B for detailed explanations
of the model.
This diffusion model gets 2D data and generates new 2D data. Input data are modified STFT
magnitudes which are obtained from data preprocessing that computes STFTs of underwater
signals, resizes STFT magnitude matrices from 41x41 to 40x40, scales these matrices by
dividing 34000 respectively. For the details of data preprocessing, please see Chapter IV
Section A.
As the original model of [19] works with images, some changes have been made in it for
thesis work:
• Input type — STFT magnitudes are double type variables but we convert them
into tif type which is a kind of image. This data type can preserve precision of
original double data.
• Input size — It is changed as [40 40] according to the input feature. For details
see Section IV.A.
• Network — A new input variable “matrixSize” is added into the function
createDiffusionNetwork to make it compatible with different input sizes.
• Batch Processing — We use scaling factor 34000 in data preprocessing.
Therefore, the function preprocessMiniBatch is only for concatenation.
• Generation — A new function generateAndReconstruct is coded for generation of
new STFT magnitudes using trained network and reconstruction of signals from
these STFT magnitudes using GLA.
Please see the original script in [48] for detailed explanations of the model.
Input Data
49
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Initial Properties and Training Options
Training options:
Network
averageGrad = [];
averageSqGrad = [];
numObservationsTrain = numel(imds.Files);
numIterationsPerEpoch = ceil(numObservationsTrain/miniBatchSize);
numIterations = numEpochs*numIterationsPerEpoch;
if doTraining
monitor = trainingProgressMonitor(...
Metrics=“Loss,” ...
Info=[“Epoch”,”Iteration”], ...
XLabel=“Iteration”);
end
if doTraining
iteration = 0;
epoch = 0;
% Compute loss.
51
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[loss,gradients] = dlfeval(@modelLoss,net,noisyImage,noiseStep,targetNoise);
% Update model.
[net,averageGrad,averageSqGrad] =
adamupdate(net,gradients,averageGrad,averageSqGrad,iteration, ...
learnRate,gradientDecayFactor,squaredGradientDecayFactor);
% Record metrics.
recordMetrics(monitor,iteration,Loss=loss);
updateInfo(monitor,Epoch=epoch,Iteration=iteration);
monitor.Progress = 100 * iteration/numIterations;
end
end
else
% If doTraining is false, download and extract the pretrained network from the folder.
load(“DiffusionNetworkTrained.mat”); % change folder path properly
end
save DiffusionNetworkTrained net % save trained network seperately
save DiffusionNetworkData.mat % save training data
Supporting Functions
52
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Model Loss Function
% Compute mean squared error loss between predicted noise and target.
loss = mse(noisePrediction,T);
gradients = dlgradient(loss,net.Learnables);
end
Mini-batch Preprocessing Function
function X = preprocessMiniBatch(data)
% Concatenate mini-batch.
X = cat(4,data{:});
end
Generation and Reconstruction Function
function imagesAll =
generateAndReconstruct(net,varianceSchedule,imageSize,numImages,numChannels)
tic % start chronometer for generation process
%--- GENERATION
% Compute variance schedule parameters.
alphaBar = cumprod(1 - varianceSchedule);
alphaBarPrev = [1 alphaBar(1:end-1)];
posteriorVariance = varianceSchedule.*(1 - alphaBarPrev)./(1 - alphaBar);
%--- RECONSTRUCTION
tic % start chronometer for reconstruction process
% get generated magnitudes
genSingles=gather(abs(images));
imagesAll=zeros([numImages npts]);
for g=1:numImages
genMag=genSingles(:,:,g);
imagesAll(g,:)=XReconr;
end
54
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
G. FUNCTION: CREATEDIFFUSIONMODELNETWORK
% Revised by Mustafa Garip, 2024. Use with SpatialFlattenLayer.m and SpatialUnflattenLayer.m in [48].
55
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
transposedConv2dLayer(filterSize,initialNumChannels,Cropping=“same”,Stride=2,Name=“
upsample_12”)
depthConcatenationLayer(2,Name=“cat_13”)
residualBlock(initialNumChannels,filterSize,numGroups,”13”)
depthConcatenationLayer(2,Name=“cat_14”)
residualBlock(initialNumChannels,filterSize,numGroups,”14”)
depthConcatenationLayer(2,Name=“cat_end”)
% Output
groupNormalizationLayer(numGroups)
swishLayer
convolution2dLayer(filterSize,numImageChannels,Padding=“same”);
];
net = dlnetwork(layers, Initialize=false);
for ii = 1:numResidualBlocks
numChannels = channelMultipliers(ii)*initialNumChannels;
noiseStepConnectorLayers = [
groupNormalizationLayer(numGroups,Name=“normEmbed_”+ii)
fullyConnectedLayer(numChannels,Name=“fcEmbed_”+ii)
];
net = addLayers(net,noiseStepConnectorLayers);
net = connectLayers(net,”noiseEmbed,” “normEmbed_”+ii);
net = connectLayers(net,”fcEmbed_”+ii,”addEmbedRes_”+ii+”/in2”);
end
% Add missing skip connections in each residual and attention block
for ii = 1:numResidualBlocks
skipConnectionSource = “norm1Res_” + ii;
numChannels = channelMultipliers(ii)*initialNumChannels;
% Add 1x1 convolution to ensure the correct number of channels
net = addLayers(net, convolution2dLayer([1,1], numChannels, Name=“skipConvRes_”
+ii));
net = connectLayers(net,skipConnectionSource,”skipConvRes_”+ii);
net = connectLayers(net,”skipConvRes_”+ii,”addRes_”+ii+”/in2”);
if ismember(ii,attentionBlockIndices)
skipConnectionSource = “normAttn_”+ii;
56
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
net = connectLayers(net,skipConnectionSource,”addAttn_”+ii+”/in2”);
end
end
% Helper functions
% Residual block
function layers = residualBlock(numChannels,filterSize,numGroups,name)
layers = [
groupNormalizationLayer(numGroups,Name=“norm1Res_”+name)
swishLayer()
convolution2dLayer(filterSize,numChannels,Padding=“same”)
functionLayer(@(x,y) x + y,Formattable=true,Name=“addEmbedRes_”+name)
groupNormalizationLayer(numGroups)
swishLayer()
convolution2dLayer(filterSize,numChannels,Padding=“same”)
additionLayer(2,Name=“addRes_”+name)
];
end
% Attention block
function layers = attentionBlock(numHeads,numKeyChannels,numGroups,name)
layers = [
groupNormalizationLayer(numGroups,Name=“normAttn_”+name)
SpatialFlattenLayer()
selfAttentionLayer(numHeads,numKeyChannels)
SpatialUnflattenLayer()
additionLayer(2,Name=“addAttn_”+name)
];
end
57
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
H. FUNCTION: CENTERSIGNAL
% shift signal
if i~=c
delay=i-c;
centeredSig=circshift(sig,-delay);
else
centeredSig=sig;
end
end
These scripts classify signals data using a 1-D convolutional neural network based on demo
code in [52]. For other details, please see thesis Chapter IV Section C.
Please use supporting function findDataCombination to pick desired the combination you
want to use and make changes in designated lines of the following code. Then, the code will
run 5 times for selected combination and give the performance metrics of each run in a
matrix.
Load Data
Load the data folder includes training and test set combinations of real and generated signal
samples.
for cs=1:numCombinations
[trainData,trainLabel,testData,testLabel,saveName]=findDataCombination(cs,datafile);
tic % start chronometer for to collect processing time
numIteration=5;
for m=1:numIteration
traind=trainData.’; % CHANGE right side according to the data combination
testd=testData.’; % CHANGE right side according to the data combination
58
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Organize Data
Separate training, validation, and test sets using selected percentages. As our data
combinations have training (80%) and test (20%) sets separately, here we partition only the
training using number of samples we used to get ratios 70% for training and 10% for
validation. Test set is used as it is.
for g=1:length(traind(1,:))
dataTrain{g,1}=traind(:,g);
end
for g=1:length(testd(1,:))
dataTest{g,1}=testd(:,g);
end
numChannels = 1;
numObservations = length(traind(1,:));
[idxTrain,idxValidation] = trainingPartitions(numObservations, [10500/12000 1500/12000]); % for the
overall ratio 70–10
XTrain = dataTrain(idxTrain);
TTrain = categorical(trainLabel(idxTrain));
XValidation = dataTrain(idxValidation);
TValidation = categorical(trainLabel(idxValidation));
XTest = dataTest;
TTest = categorical(testLabel);
59
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
1-D Convolutional Network Architecture
Uses two blocks of 1-D convolution, ReLU, and layer normalization layers.
filterSize = 50;
numFilters = 32;
classNames = categories(TTrain);
numClasses = numel(classNames);
cnnlayers = [ ...
sequenceInputLayer(numChannels)
convolution1dLayer(filterSize,numFilters,Padding=“causal”)
reluLayer
layerNormalizationLayer
convolution1dLayer(filterSize,2*numFilters,Padding=“causal”)
reluLayer
layerNormalizationLayer
globalAveragePooling1dLayer
fullyConnectedLayer(numClasses)
softmaxLayer];
Training Options
cnnnet = trainnet(XTrain,TTrain,cnnlayers,”crossentropy”,options);
60
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Test Neural Network
scores = minibatchpredict(cnnnet,XTest,SequencePaddingDirection=“left”);
YTest = scores2label(scores, classNames);
cmat = confusionmat(TTest,YTest);
metrics = multiclass_metrics_common(cmat); % for code of this function please see [54]
results(m,1)=metrics.Accuracy;
results(m,2)=metrics.F1score;
results(m,3)=metrics.Precision;
results(m,4)=metrics.Recall;
load briefMetrics.mat
t(cs)=toc;
allMetrics(cs,1)=mean(results(:,1));
allMetrics(cs,2)=mean(results(:,2));
allMetrics(cs,3)=mean(results(:,3));
allMetrics(cs,4)=mean(results(:,4));
save briefMetrics.mat t allMetrics
clear
close all
end % cs
61
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
62
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
APPENDIX B. SUPPLEMENTARY CONFUSION MATRICES
63
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
64
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
APPENDIX C. ADDITIONAL PLOTS AND TABLES OF
PERFORMANCE METRICS
Note: Each circle represents a different run. Error bars indicate maximum and minimum
values at that data point.
Figure 17. Overall accuracy after 80 runs of the classifier.
65
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 6. Accuracy values after 80 runs of the classifier.
66
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: Each circle represents a different run. Error bars indicate maximum and minimum
values at that data point.
Figure 18. Overall F1 score after 80 runs of the classifier.
67
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 7. F1 score values after 80 runs of the classifier.
68
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: Each circle represents a different run. Error bars indicate maximum and minimum
values at that data point.
Figure 19. Overall precision after 80 runs of the classifier.
69
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 8. Precision values after 80 runs of the classifier.
70
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Note: Each circle represents a different run. Error bars indicate maximum and minimum
values at that data point.
Figure 20. Overall recall after 80 runs of the classifier.
71
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Table 9. Recall values after 80 runs of the classifier.
72
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 21. Detailed performance metrics for boat signals.
73
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 22. Detailed performance metrics for dolphin signals.
74
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
Figure 23. Detailed performance metrics for whale signals.
75
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
THIS PAGE INTENTIONALLY LEFT BLANK
76
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
LIST OF REFERENCES
[3] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A Review on Generative Adversarial
Networks: Algorithms, Theory, and Applications,” IEEE Transactions on
Knowledge and Data Engineering, vol. 35, no. 4, pp. 3313–3332, 1 April 2023.
Available: https://doi.org/10.1109/TKDE.2021.3130191.
[4] F. Liu, Q. Song, and G. Jin, “Expansion of restricted sample for underwater
acoustic signal based on generative adversarial networks,” in Proceedings of
SPIE, vol.11069, no. 1106948, pp.1–8, May 2019.
[7] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Yi Zhao, W. Zhang, B. Cui, and M.
Yang, “Diffusion models: a comprehensive survey of methods and applications”
ACM Computing Survey, vol. 56, issue 4, Article 105, New York, USA, April
2024. Available: https://doi.org/10.1145/3626235
[8] S. Ji, J. Luo, and X. Yang, “A Comprehensive Survey on Deep Music Generation:
Multi-Level Representations, Algorithms, Evaluations, and Future Directions,”
arXiv, 2020. Available: https://doi.org/10.48550/arXiv.2011.06801
[10] A. Caillon and P. Esling, “RAVE: A variational autoencoder for fast and high-
quality neural audio synthesis,” arXiv, 2021. Available: https://doi.org/10.48550/
arXiv.2111.05011
77
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[11] I. P. Yamshchikov and A. Tikhonov, “Music generation with variational recurrent
autoencoder supported by history,” SN Applied Sciences, vol. 2, article 1937,
2017. Available: https://doi.org/10.1007/s42452-020-03715-w
[12] P. Dhariwal, H. Jun, C, Payne, J.W. Kim, A. Radford, and I. Sutskever, “Jukebox:
a generative model for music,” arXiv, 2020. Available: https://doi.org/10.48550/
arXiv.2005.00341
[14] H-M. Liu and Y.-H. Yang, “Lead Sheet Generation and Arrangement by
Conditional Generative Adversarial Network,” in 2018 17th IEEE International
Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA,
2018, pp. 722–727.doi: 10.1109/ICMLA.2018.00114.
[19] J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” in 34th
Conference on Neural Information Processing Systems (NIPS 2020), Vancouver,
Canada, 2020. Available: https://proceedings.neurips.cc/paper_files/paper/2020/
file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
[20] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data
distribution,” in 33rd Conference on Neural Information Processing Systems
(NIPS 2019), Vancouver, Canada, 2019. Available:
https://proceedings.neurips.cc/paper_files/paper/2019/file/
3001ef257407d5a371a96dcd947c7d93-Paper.pdf
78
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[21] Y.Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole,
“Score-based generative modeling through stochastic differential equations,,”
arXiv, 2021. Available: https://doi.org/10.48550/arXiv.2011.13456
[25] W. Huang and F. Zhan, “A novel probabilistic diffusion model based on the weak
selection mimicry theory or the generation of hypnotic songs,” Mathematics 2023,
vol. 11, pp. 3345, 2023. Available: https://doi.org/10.3390/math11153345
[30] P. Roux, W.A. Kuperman, K.G. Sabra, “Ocean acoustic noise and passive
coherent array processing,” Comptes Rendus Geoscience, vol. 343, issues 8–9, pp.
533–547, 2011. Available: https://doi.org/10.1016/j.crte.2011.02.003.
[31] The MathWorks Inc, “spectrogram.” Accessed: June 23, 2024. Available:
https://www.mathworks.com/help/signal/ref/spectrogram.html?s_tid=doc_ta
79
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[32] B.G. Greene, D.B. Pisoni, T.D. Carrell, “Recognition of speech spectrograms,”
Journal of the Acoustical Society of America, vol. 76 (1), pp. 32–43, 1984.
Available: https://doi.org/10.1121/1.391035
[34] Y.M.G. Costa, L.S. Oliveira, A.L. Koericb, F. Gouyon, “Music genre recognition
using spectrograms,” in 18th International Conference on Systems, Signals and
Image Processing, Sarajevo, Bosnia and Herzegovina, 2011, pp. 1–4.
[36] J. Huang, B. Chen, B. Yao and W. He, “ECG Arrhythmia Classification Using
STFT-Based Spectrogram and Convolutional Neural Network,” in IEEE Access,
vol. 7, pp. 92871-92880, 2019. Available: http://dx.doi.org/10.1109/
ACCESS.2019.2928017.
[40] D.W. Griffin and J.S. Lim, “Signal estimation from modified short-time Fourier
transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.
ASSP-32, no. 2, pp. 236–243, April 1984.
80
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
[42] J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, “Fast Signal Reconstruction
from Magnitude STFT Spectrogram Based on Spectrogram Consistency,” in 13th
International Conference on Digital Audio Effects (DAFx-10), Graz, Austria,
September 6–10, 2010.
[47] The MathWorks Inc, “Resize data by adding or removing elements” Available:
https://www.mathworks.com/help/matlab/ref/resize.html?searchHighlight=
resize&s_tid=srchtitle_support_results_2_resize
[48] The MathWorks Inc, “Generate Images Using Diffusion Example.” Available:
https://www.mathworks.com/help/deeplearning/ug/generate-images-using-
diffusion.html#GenerateImagesUsingDiffusionExample-3
[49] Z. Xiong, W. Wang, J. Yu, Y. Lin, and Z. Wang, “A Comprehensive Survey for
Evaluation Methodologies of AI-Generated Music,,” arXiv, 2023. Available:
https://doi.org/10.48550/arXiv.2308.13736
82
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
INITIAL DISTRIBUTION LIST
83
_________________________________________________________
NAVAL POSTGRADUATE SCHOOL | MONTEREY, CALIFORNIA | WWW.NPS.EDU
DUDLEY KNOX LIBRARY
NAVAL POSTGRADUATE SCHOOL
WWW.NPS.EDU
_________________________________________________________
WHERE SCIENCE MEETS THE ART OF WARFARE