0% found this document useful (0 votes)
29 views19 pages

Deepfake Audio Detection and Justification With Ex

Hhhh

Uploaded by

Gautam S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views19 pages

Deepfake Audio Detection and Justification With Ex

Hhhh

Uploaded by

Gautam S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Deepfake audio detection and justi cation with

Explainable Arti cial Intelligence (XAI)


Aditi Govindu (  [email protected] )
MIT World Peace University
Preeti Kale
MIT World Peace University
Aamir Hullur
MIT World Peace University
Atharva Gurav
MIT World Peace University
Parth Godse
MIT World Peace University

Research Article

Keywords: Generative Adversarial Neural Networks (GANs), deepfake audio, VGG16, Explainable Arti cial
Intelligence (XAI), Fréchet Audio Distance (FAD)

Posted Date: October 17th, 2023

DOI: https://doi.org/10.21203/rs.3.rs-3444277/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


Deepfake audio detection and justification with
Explainable Artificial Intelligence (XAI)
Aditi Govindu1*†, Preeti Kale1*, Aamir Hullur1†, Atharva
Gurav1† and Parth Godse1†
1*ComputerScience Engineering, Dr. Vishwanath Karad, MIT
World Peace University, Kothrud, Pune, 411038, Maharashtra,
India.

*Corresponding author(s). E-mail(s): [email protected];


[email protected]

ORCID ID: Aditi Govindu (0009-0008-5432-7435)

Contributing authors: [email protected];


[email protected]; [email protected]
†These authors contributed equally to this work.

Abstract
Deepfake audio refers to synthetically generated audio, often used as legal
hoaxes to impersonate human voices. This paper generates fake audio from Fake
or Real (FoR) dataset using Generative Adversarial Neural Networks (GANs).
FoR dataset has the advantage of a diversity of speakers across 195,000 samples.
The proposed work analyses the quality of the generated fake data using the
Fréchet Audio Distance (FAD) score. FAD evaluation score of 23.814 indicates
good quality fake has been produced by the generator. The study further enables
glass box analysis of deepfake audio detection through Explainable Artificial
Intelligence (XAI) models of LIME, SHAP and GradCAM. This research assists
in understanding impact of frequency bands in audio classification based on the
quantitative analysis of SHAPLey values and qualitative comparison of
explainability masks of LIME and GradCAM. The use of FAD metric provides
a quantitative evaluation of generator performance. XAI and FAD metrics help
in the development of deepfake audio through GANs with minimal data input.
The results of this research are applicable to detection of phishing audio calls and
voice impersonation.

Keywords: Generative Adversarial Neural Networks (GANs), deepfake audio, VGG16,


Explainable Artificial Intelligence (XAI), Fréchet Audio Distance (FAD).

1 Introduction
Deepfake audio is synthetic audio, generated by Artificial Intelligence (AI) models, through
manipulation of pre-existing audio data (Agarwal et. al., 2020). AI algorithms employ
replay-based, synthetic based or imitation based approaches to real data and create
humanistic audio samples. They modify the pitch, frequency, amplitude, or background
noise of original audio, to create new samples vastly different from the original (Müller et.
al, 2022). Originally conceived with the intent of giving audiobooks a human touch or a
new voice to speech-impaired persons, deep fake audio has evolved into cybersecurity
concerns. It is being used to manipulate public opinion (Lyu, 2020) and falsely incriminate
individuals.
The evolution of technology has led to the use of GANs (Goodfellow et. al., 2020) for
generation of deep fake audio. These models comprise a generator and discriminator, that
1
compete to improve the overall GAN performance. The generator applies latent 1-
dimensional noise to the original audio data and creates a fake sample. The discriminator’s
goal is to segregate audio as fake or real. Once a decision is made, the generator learns to
create better fakes and the discriminator becomes more robust in its decision-making. The
quality of fakes thus generated is obtained by comparing the probability distribution of real
samples to fake samples through Fréchet distance metrics (Eiter and Mannila, 1994). Our
research lays emphasis on the use of FAD for quality of fake audio generated by GANs. The
samples created by our GANs generator, coupled with an existing dataset of real and fake
samples, make it near impossible for humans to judge as real or fake audio. Therefore, it is
imperative to train deep learning models to identify fake audio, quickly and efficiently.
Through our work, we train various classifiers on real and fake data.
Deep learning models are black box models that give minimal insights to the
classification process. Given the rise of deep fake audio creation, we highlight the
frequencies a model flagged as fake through XAI techniques (Došilović, Brčić and Hlupić,
2018). This allows humans to not only review the classification features, but also improve
fake audio creation through GANs.
Through this study, we convert black box deep learning models into white box models.
We use XAI that comprises tools and techniques to detect, study and explain the “thought
process” of a computational model. It converts a deep learning model to a white box model,
by highlighting parameters the model learnt and interpreting model predictions (Guo, 2020).
It focuses on explanation, meaningfulness, accuracy, and knowledge limits of the model.
We use LIME, an acronym for Local Interpretable Model Agnostic Explanation (Salih et.
al., 2023), SHAP or Shapely Additive Explanation (Lujain et. al., 2020) and Gradient-
weighted Class Activation Mapping (GradCAM) (Ramprasaath et. al., 2017) to provide
qualitative metrics for model study.
This paper explains the creation of deep fake audio using GANs and subsequent
evaluation of GANs using FAD metric (Santiago et. al., 2023) We convert audio files to
spectrograms for real or fake classification and explain the process through qualitative XAI
metrics. Our work highlights the use of Fréchet distance and XAI metrics in deep fake audio
detection.
The next section describes the background of the paper with emphasis on features
crucial in identifying audio discrepancies. It is followed by a thorough survey on deep-fake
attacks, counter measures, XAI-GANs and datasets used. In section 4, the methodology
along with problem statement and algorithms are explained at length. Furthermore, section
5 on performance analysis and results defines our experimental setup and highlights the
results. Discussion in section 6 is summed up by conclusion in section 7. Section 8 explains
future work that can be undertaken.

2 Background
Audio hoaxes and spam calls have been known in the market for several years. However, the
dawn of generative AI has led to the creation of digital audio samples, making it near
impossible for a novice ear to identify the difference.
In this section, we discuss the features of audio used in identifying it as real or fake. We lay
emphasis on cepstral scores as our chosen feature for identification, amongst pitch, frequency,
amplitude etc.

2.1 Deepfake audio detection features


Human speech is slow, with appropriate pauses and breaths. Deep learning models on the other hand
generate a stream of frequencies, hence do not have such pronounced pauses or patterns. Thus, the
computational speed differences in real and generated audio make it easy for classifiers to label it as fake
(Yi, Fu, et. al., 2022).
The length of audio clips and levels of background noise play a critical role in the classification
process. Longer clips have greater chances of being labelled as fake, due to variations in pitch and
background noise (Wijethunga, 2020). Deep fake audio created from samples with low background noise
are more likely to be identified as fake, due to application of additional noise in creating a fake sample
(Li, Ahmadiadli and Zhang, 2022).

2
Fricatives (Metehan et. al., 2021) or syllables such as f, v, s, z, are produced by humans from the
throat. For instance, the letter f is created by the interdental movement of teeth and tongue, while s is an
alveolar generation. Fricatives have a rounded feel when said by humans as opposed to deep learning
models because air pressure variations in the trachea lead to sound enunciation. Hence, audio with these
sounds can be discerned as real or fake with ease.
The most popular approach to detection is conversion of audio to a MEL Spectrogram scale and
comparison of frequency distributions in real and fake samples. The quantification of frequency and
representation of data in numeric format makes cepstral score (Shadle and Mair, 1996) based
classification a common approach in deep fake audio detection.
Keeping in mind these approaches, we apply cepstral scores quantification of frequency and represent
audio in a numeric format. By converting audio vs time to audio vs frequency scale, we can represent
audio signals as Spectrograms in a reference free format. This is a quantifiable, generic approach to be
used at a later stage for classification into real and fake. Hence, we selected cepstral scores and spectral
analysis as ideal quantifiable approaches. We use GANs to create fake audio samples, augment these
fakes with an existing fake dataset sample and use our vast dataset to apply transfer learning (Logan,
2000) in classifiers.
This work contributes to XAI, GANs and audio deep fake detection in the following ways:
1. Generation of fake samples using GANs and objective evaluation with FAD metric
2. Use of Fake or Real (FoR) dataset, a new effective dataset for our study and augmentation using
generated fake samples.
3. Apply transfer learning to deep fake audio classification.
4. Convert black box deep learning models to white box, through qualitative analysis of LIME,
SHAP and GradCAM results.
This research highlights the impact of XAI in classification and conversely, creation of better fake
audio samples through GANs.

3 Literature survey
We highlight the common types of deep fake audio attacks in the past in section 3.1. Section 3.2
compares the techniques used to discover such attacks. Section 3.3 explains the use of XAI-
GANs models for image datasets and the gap in research on GANs for audio. Section 3.4
summarizes the audio datasets available in the market and features of the data chosen for this
study. An overview of each section has been depicted in Fig. 1.

3
Cloning
Conversion
1. Attacks
Text to Speech

Synthesis
VAE
2. Detection GANs
RNN
Ensemble
CNN
Literature LIME
survey
GradCAM
3. XAI-GANs
SHAP
LRP
ASV Spoof
ReMASC
4. Datasets VoxCeleb
FakeAVCeleb

Fake or Real

Fig. 1 Literature survey overview

3.1 Deepfake attacks

Deepfake audio attacks refer to the use of altered audio to clone, replicate or reproduce a human
voice by digital means for the purpose of fooling the listener. This section describes prevalent
attacks of cloning, substitution, generation, and synthesis.
Voice cloning involves the use of deep learning procedures to replicate a person's voice by
emulating their speech patterns. Once trained, the system can generate new speech that mimics
the speaker’s voice. This technique can be used to create fake audio content making it difficult
to distinguish from real samples (Tan et. al., 2018). However, it is a resource intensive digital
process, requiring manual speaker recording and preprocessing.
Voice conversion (Arik et. al., 2018) involves the modification of a human’s voice to
resemble another. This technique can be used to create fake audio content that appears to be
spoken by a particular person. This can also expand to encompass speaker impersonation
(Bisman, Yamagishi and Li, 2020) where the voice of a particular person is replicated, to make
them “say” something they did not. Lip-syncing involves altering the speech of a person to
match a particular video or image, making it appear as if the person is speaking the words in the
video or image (Chen et. al., 2017). Syllable substitution involves replacing specific syllables in
a person's speech with other syllables to create new words or phrases (Suwajanakorn et. al,
2017). All these techniques require careful mapping of audio and video, as slight distortions can
be easily identified.
Text-To-Speech (TTS) synthesis transforms text to audio using AI algorithms (Carson-
Berndsen, 1990). These algorithms can generate human-like speech that can be difficult to
distinguish from real audio. This technique can be used to create fake audio content and is most
popularly used by voice assistants (Shen et. al, 2018) in the market today.
Audio synthesis (Subhash et. al., 2020) involves the creation of audio content that does not
exist. This technique can be used to create realistic fake audio content, such as fake speeches,
interviews, or events that did not transpire. Completely digitally created, we aim to generate
such samples through GANs. Our approach solves the challenges of manual labor in cloning,
mapping of audio to video in conversion and typing or text input in TTS approach. Noise is
added to audio to generate synthetic files, making our approach reliable and scalable.
4
3.2 Deepfake attack detection techniques

To detect the attacks mentioned in section 3.1, researchers have used generative, and memory
based deep learning models in the detection process. These detection methods have been
explained in this section.
GANs and Variational Auto Encoders (VAEs) are generative models (VAEs) used to detect
audio deepfakes. These models are trained to distinguish between genuine and manipulated
audio clips by studying the underlying probability curves of audio clips. In the paper (Donahue
et. al., 2018)) authors proposed using a GAN-based model to detect voice conversion and
speaker impersonation attacks. However, it takes a long training time.
Recurrent Neural Networks or RNNs are commonly used to detect audio deepfakes by
analyzing the temporal dependencies in audio data (Jones, 2020). Temporal convolutional
neural networks and RNNs can be combined in an ensemble model or attention-based LSTM
networks, to detect manipulated audio clips (Chintha et. al., 2020).
Convolutional Neural Networks or CNNs detect audio deepfakes by analyzing the spectral
and temporal characteristics of audio data. Paper (Su et. al, 2021) mentions CNN model to
extract temporal and spectral audio features from data and apply them to detect manipulated
audio clips. The spectrum of audio signal represents a higher-order Fourier correlation. Fast
Fourier Transform (FFT) is applied to create Fourier coefficients, that depict first-order statistics
or first-order correlation in frequencies. These coefficients have lower correlation for human
audio, as opposed to high correlation statistics for deep fake audio. Furthermore, analysis of
cepstral score demonstrated a component of power in speech data which AI synthesized audio
cannot recreate (Lewis et. al., 2020).
Ensemble models combine multiple deep learning models to better the accuracy of audio
deepfake recognition (AK Singh and P Singh, 2021). An ensemble of CNNs and RNNs to detect
manipulated audio clips, is the most popular approach to deep fake detection.
We integrate the research done by Su, Xia, Liang, Nie (2021) and Lewis, Toubal, Chen,
Sandesera et. al. (2020) to create our model to detect AI synthesized speech. This approach
improves work done by (Rana and Sung, 2020), where cepstral score is represented as MEL
Spectrogram, prior to classification. It uses less training time than VAE and RNN approach. The
models in this study are pre-trained, hence we utilize pre-defined weights to provide accurate
results within a short period of time, as compared to CNN or ensemble approaches.

3.3 Explainable Generative Adversarial Neural Networks (XAI-GANs)

As generative techniques evolve, work on detection must grow at the same pace to develop
more effective methods for detecting audio deepfakes. The models discussed in section 3.2 have
shown promising results, but can be improved, especially when it comes to detecting
sophisticated audio attacks. This leads us to the use of XAI to illustrate the process of
classification and how humans can intervene to advance model performance.
The research by (Chugh, Gupta et. al., 2020) explains the differentiation of 10 digits in
MNIST dataset using 20% of original dataset, as noise is used to modify most important
features in the data before passing it on to the generator. We considered the research in this
paper to be our paper and benchmarked the results on Audio MNIST dataset, before proceeding
with our work on deepfake audio creation.
Work done by (Yu, 2022) compares the performance of diffusion models and GANs. The
diversity of images explored by GANs is less than that of diffusion models. Diffusion models
apply gradual noise, covering distribution of data in the entire image. Thus, fidelity or trueness
of data distribution is better in diffusion models, compared to GANs.
In the study performed by (Dhariwal and Nichol, 2021) XAI is used to understand why
certain classes or nodes were dropped by generators. Discriminator score was assigned to
identify if dropping rate was correlated to architecture of deep learning model.
XAI is used to limit the learning rate of generators in LDGAN and reduce overfit problems
as stated by (Huang, Lin, et. al., 2021). The use of (Decision Making Analysis) DiME diffusion
model to generate counterfactual images and verification of results by XAI encouraged (Kim
and Park, 2023) to create a new metric called correlation difference to find spurious relations in

5
fake images of generators.
Study was done on Audio MNIST data by (Guillaume, Simon, and Jurie, 2022) to classify
gender of speaker using AlexNET and AudioNET models with Layer wise Relevance
Propagation (LRP) XAI metric to decipher audio features identified in classification process.
Through the work on XAI-GANs models, we identified an opportunity for use of GANs for
audio data. Our survey highlights an opportunity to research normal human speech dataset with
XAI models of LIME, SHAP or GradCAM. Therefore, we use custom explainable Deep
Convolutional Generative Adversarial Neural Networks (DCGANs) as done by (Becker et. al.,
2018) to create fake audio data from the Fake or Real dataset which is explained in section 3.4.

3.4 Audio datasets

Given a fair understanding of deepfake audio attacks, techniques for detection and
explainability in improvement of GANs performance, we perform an extensive comparison of
datasets suitable to our research work.
The ASV Spoof dataset proposed by (S Jindal and N Jindal, 2021) is one of the most
widely used datasets for fake audio detection. It includes a collection of 8076 synthetic and real
speech samples, covering various spoofing attacks, such as replay attacks and speech
conversion. The dataset was used in several studies for detection, generation, and study
purposes.
The ReMASC dataset first used by (Kawa, Plata and Syga, 2022) is a replay attack-based
collection of 132 voice commands. It includes clippings of voice conversion and speaker
impersonation, along with their corresponding original clips.
The VoxCeleb dataset is an extensive dataset encompassing a collection of speech samples
from 6,112 celebrities. Although not designed specifically for fake audio detection, it has been
used in some studies, as mentioned by (Gong, Yang et. al., 2019).
The FakeAVCeleb dataset contains manipulated audio clips of celebrities created using
voice conversion techniques (Nagrani, Chung and Zisserman, 2017). The dataset was developed
specifically for the task of voice conversion detection and includes collection of samples from 5
racial backgrounds. Each celebrity has 20 real speech samples and 40 fake speech samples. The
FakeAVCeleb dataset was used in a study which proposed a deep learning-based approach for
detecting digitally faked voices. We have generated fake audio from this dataset and augmented
it to the Fake or Real dataset.
Fake or Real dataset (FoR) is a 2GB dataset owned by Lassonde School of Engineering. It
is an aggregation of VoxCeleb, Google Wavenet TTS and LJSpeech data to comprise 1,95,000
clippings across 4 versions. It was created by (Khalid, Tariq, Kim and Woo, 2021). The for-
original version contains balanced, unaltered raw audio data, as obtained from the source. The
for-norm version contains normalized data of for-original samples. The for-2sec version is
derived from the for-norm version, where files have been truncated to 2 seconds. The fourth
version, which we have considered in this work is the for-rerec version. It is a rerecorded
version of for-2sec, designed to emulate an attacker’s utterances over voice calls.
The FoR dataset is the largest, latest benchmarked audio dataset. Given its length of
clippings and diversity of speakers, we generate fake samples from Fake AV Celeb data and
append it to the FoR dataset, for our use case.
Post our extensive literature survey, we identified research gaps that we bridge through our
novel work. We enumerate them as:
1. Scope to perform quantitative evaluation of generator performance in GANs using the
FAD metric.
2. State of art survey demonstrates an opportunity to augment audio datasets from various
sources for deepfake detection. We explore the synthesis of audio using DCGANs on Fake AV
Celeb dataset and subsequent augmentation of novel FoR dataset, to create custom data for
detection.
3. Current research explores a single explainable approach for each machine learning or
deep learning model. Through this paper, we contrast the frequency bands in heatmaps
of LIME, SHAP and GradCAM models. We infer subtle differences in frequency
bands that each XAI model identifies.
Our scope to work on audio creation using DCGANs and contrast explainable models for
audio detection are established in section 4.

6
4 Methodology
Our problem statement is the creation of deep fake audio samples using GANs on Celeb AV
dataset and evaluation of quality of samples thus created using FAD scores. We augment fake
samples in FoR dataset using these generated fakes. We create spectrograms with this dataset,
using a pretrained classifier and ultimately explain the process with XAI models of LIME,
GradCAM and SHAP. In this section, we discuss the features of audio used in identifying it as
real or fake. We lay emphasis on cepstral scores as our chosen feature for identification,
amongst pitch, frequency, amplitude etc.
In section 4.1, we establish our problem statement with pre-requisites. Section 4.2 depicts
model architecture and data flow. We conclude our methodology with algorithms in section
4.3.

4.1 Problem statement and pre-requisites

We define DCGAN architecture and use the Fake AV celeb dataset to create 128 fake audio
samples per batch. GANs architecture applies 1 dimensional noise (y) to audio signal a, leading
to creation of fake sample x.
We evaluate the performance of generator using FAD metric, as proposed by (Reimao and
Tzerpos, 2019). FAD measures magnitude of distortions in multivariate Gaussian distribution of
real and fake audio. As distortions rise, separation distance decreases and FAD score decreases.
Thus, a low FAD score indicates better fake sample generation.
After creation of samples from DCGAN and evaluating our model using FAD score, we
augment Fake samples in FoR dataset to create a large sample of 50,000 audio clippings. Of
these, 80% is training data and the rest is equally split for testing and validation. We leverage
the power of pre-trained classifiers for image data and convert our audio samples to MEL
Spectrograms. Conversion to MEL spectrograms, uses Fourier transforms applied to the 2
second clip. Fourier Transform is the mathematical representation used in our approach where
input is the audio signal in time domain and output is its decomposition into frequencies. A
minor adjustment is applied by transforming frequencies on Y-axis in logarithmic scale and
amplitude on X-axis in decibels (db). The resultant output is called a spectrogram. Conversion
of audio file from Hz to MEL Scale is also done in this stage. Thus, the final spectrogram with
MEL scale is called MEL spectrogram.
To create MEL Spectrogram, we define the variables inspired by (Kilgour et. al., 2018).
Sampling rate or bits of data read from the audio file per second. It is 22050 Hz by default, but
we have considered only 2 second clippings, hence reduced it to 16000Hz. Fourier transform is
applied to the audio signal in sections, as defined by the window size of 2048 or 210 bits per
second. The frequency of the original audio signal is labelled as f and no. of samples considered
in 1 epoch is 128. The formula for MEL Scale is listed as Eq. (1),

𝑓 (1)
MEL Scale = 1127 × (1 + )
700
Where,
Window size of transform = 2048
Sampling rate = 16000 Hz
Frequency of original signal = f Hz
Batch size = 128
A sample spectrogram has been depicted in Fig. 2 below. It denotes the frequency
distribution of the original signal as a heatmap.

7
Fig. 2 MEL Spectrogram

The steps of our problem statement are to consider a set of audio examples (A) with n
instances of 2 second clippings as {a1, a2, a3 … an}, 1 dimensional Gaussian noise vector (x),
generator architecture (Gen) and discriminator architecture (D),
1. Create fake audio by applying x to ai in A.
2. Evaluate generator using FAD score (d) as per Eq. (2).

𝑑 2 = |𝜇𝑥 2 − 𝜇𝑦 2 | + 𝑡𝑟(∑ 𝑥 + ∑ 𝑦 + √∑ 𝑥 × ∑ 𝑦) (2)

Where,
Fréchet distance = d2
Feature wise mean of real data (x) = µ x
Feature wise mean of fake data (y) = µ y
Covariance of real data (x) = ∑x
Covariance of fake data (y) = ∑y
Trace of square matrix = tr operation
3. Add generated samples to FoR dataset.
4. Convert audio files in FoR dataset to MEL Spectrograms, by applying Fourier
transform to original frequency f of audio signal and generating MEL spectrogram, as
mentioned in Eq. (3).

1000 𝑓 (3)
𝑚 = × 𝑙𝑜𝑔 (1 + )
𝑙𝑜𝑔2 1000
Where,
MEL Spectrogram = m
Frequency of audio = f
5. Repeat step 4 for real audio files from FoR dataset.
6. Apply transfer learning to classifier models on FoR dataset and classify Spectrogram
as real or fake.
7. Use LIME and GradCAM to generate super pixels of frequency bands in MEL
Spectrogram, to visualize as an image mask.
8. To evaluate the model using SHAP, we use the SHAPley value (g) in Eq. (4).

g(z′) = ϕ0 + M × ∑𝑀
𝑗=1 𝜙𝑗 𝑧′𝑗
(4)

Where,
Shapley value = g
Coalition vector = z’ = {0,1} or {real, fake}
Samples = R
Maximum samples considered = M
Instance = j
Feature of instance j = ϕj ∈ R.

8
4.2 System design

Fig. 3 illustrates the flow of data in our classification process from conversion of audio to
Spectrograms, classification, and application of XAI. We also define the 3 algorithms in this
process in section 4.3.

Algorithm 1 •generateFakeAudio

Algorithm 2 •createSpectrogram

Algorithm 3 •classifyExplain

Fig. 3 Data flow

4.3 Algorithms

For our research, we use 3 algorithms. Fig. 4 and Algorithm 1 represent the first algorithm to
generate fake samples from the Fake AV Celeb data known as generateFakeAudio algorithm.
Real audio (RA) is considered as input to a generator (Gen). Noise (x) is applied to real audio
(RA). In every epoch discriminator (D) performance on classifying audio as real or fake is used
to improve x. In the last epoch, fake audio (FA) created is saved to a folder and passed along
with RA to FAD module. FAD score once generated is used to objectively measure GANs
performance.

Fig. 4 Components of algorithm 1

The estimated time complexity of this algorithm is no. of epochs (k) × no. of samples in
dataset (n), which approximates to O(kn). These steps have been depicted in Algorithm 1.

Algorithm 1 Generate Fake Audio


Require
Audio recordings (AR) = {a1, a2, a3 … an}
Sample count = n
1 dimensional gaussian noise (y)
Produce
Fake audio samples (FR) = AR + y

1: Define generator layers (Gen)


2: Define discriminator layers (D)
3: Define epochs of GANs model (default: 50)
4: while epoch e < 50 do
9
5: for ai in AR do
6: G(x): x = ai + y
1
7: Discriminator loss = ×∑(𝑙𝑜𝑔 (𝐷(𝑥) + 𝑙𝑜𝑔 (1 −
𝑛
𝐷(𝐺𝑒𝑛(𝑥))
1
8: Generator loss = × ∑ (log (1 - D(Gen(x)))
𝑛

9: Minimize losses using optimizer


10: Update samples generated (a’)
11: Modify noise (y)
12: end for
13: if e = 50 then
14: Store results as generator output (Gen)
15: else [e != 50]
16: Continue to step 5
17: end if
18: end while
19: for sample in Gen do
20: Write audio file to folder (/generated)
21: end for
22: Import VGG model for FAD score and pass samples in AR folder, as well as
FR folder.
23: Read FAD score in a scale of 0-100

Another algorithm to apply MEL transformation to the audio samples, known as the
createSpectrogram function, as depicted in Algorithm 2. The time complexity of algorithm 2 is
dependent on time to compute Fast Fourier Transform (FFT) of n samples, that is, O(n).

Algorithm 2 Create Spectrogram


Require
Real audio (RA)
Fake audio (FA)
Window size of transform = 2048
Sampling rate = 512 Hz
Frequency of original signal = f Hz
Batch size = 128
Produce
Real spectrograms in /real folder.
Fake spectrograms in /fake folder.

1: Load real audio files (RA)


2: for ri in RA do
3: Apply Fast Fourier Transform to the original signal to get FFT
4: Apply Short Fourier Transform to the original signal to get STFT
5: Use librosa library in Python to create MEL Spectrogram for STFT
6: Write spectrogram as .png image to /real/img.name
7: end for
8: Load fake audio files (FA)
9: Repeat steps 2 – 6 for fake audio files

The final algorithm in our pipeline is used to train the classifier on spectrograms and
explain the process through XAI models, labelled classifyExplain module, whose components
have been illustrated in Fig. 5 and the algorithm itself in Algorithm 3. Results of the classifier
are passed into an XAI module, to identify pixels of image spectrogram that is used to extract
essential features a model learns in classification.

10
Fig. 5 Components of algorithm 3

In algorithm 3, time complexity depends on time needed to load XAI module in python and
train it on 1 image. Since models are pretrained, minimal time is required to load and test the
MEL Spectrograms. Thus, time is estimated as constant time or O (1), across 4 classifiers.

Algorithm 3 Classify Explain


Require
Real spectrograms
Fake spectrograms
Trained model
Produce
Explainable image mask
1: Import LIME, GradCAM, SHAP explainers.
2: Load spectrogram image
3: Load classifier model, say VGG16
4: Pass model and image to LIME module
5: Visualize feature mask of classifier
6: Repeat steps 2 – 5 for GradCAM and SHAP
7: Repeat steps 1 – 6 for all classifier models

5 Performance analysis and results


5.1 Experimental setup
As described in section 4, we implement the algorithms of section 4.3 in the data pipeline. We
create fake audio and convert all audio samples to spectrograms. These images are passed to
classifiers to identify as fake and real audio. The models used for this purpose are VGG16,
Mobile Net, InceptionV3 and custom CNN.
VGG16 contains 16 convolutional layers with 4 pooling layers in between trained on
ImageNet data. Mobile Net consists of 28 layers including deep convolutional layer, batch
normalization, ReLU and SoftMax. InceptionV3 model uses a combination of 1x1, 3x3 and 5x5
convolutional layers with the output concatenated into a single output vector. The custom CNN
comprises a convolutional layer, followed by a max-pooling layer, repeated thrice. The layers
are depicted in Table 4 below.

Table 4 Convolutional Neural Network layers

Layer name No. of nodes

Convolutional layer 1 32

Max pooling layer 1 -

Convolutional layer 2 128


Max pooling layer 2 -
Convolutional layer 3 128
Max pooling layer 3 -

11
Dense layer with 30% dropout 512
Output layer 1

Section 5.2 compares the precision, accuracy, recall and F1 metrics of all models. XAI is
used to identify features in each image to explain the classification process. The results of XAI
metrics have been enlisted in section 5.3.

5.2 Performance comparison metrics

Performance evaluation of models on our audio spectrogram dataset has been mentioned here.
Specifically, we used a custom CNN architecture and 3 pretrained models. We trained all 4
models on a Graphics Processing Unit (GPU) to identify differences in execution time. We vary
the batch sizes and epochs to optimize performance. Table 5 highlights the comparison of
models we have used.

Table 5 Model comparison metrics

Metrics InceptionV3 VGG16 Mobile Net CNN


Accuracy 90.17 93.37 91.57 88.72
Precision 0.815 0.808 0.707 0.816
Recall 0.729 0.721 0.851 0.723
F1 score 0.772 0.765 0.779 0.765
Epochs 5 5 5 20
Batch size 32 32 32 10
Time (Sec) 2200 2100 250 160

A visualization of accuracy of models implemented thus far, has been depicted in Fig. 6
below. It shows the highest bar plot for VGG16 at 93.37% and lowest for CNN at 88.72%.

Evaluation
94 93.37
93
91.57
92
Accuracy

91 90.17
90 88.72
89
88
87
86
Inception Mobile VGG16 CNN
Net
Models

Fig. 6 Model accuracies

As gathered from Table 5 and Fig. 6 above, VGG16 achieved the highest accuracy. This is
due to its transfer of priori weights learnt from ImageNet dataset. Mobile Net and InceptionV3
models achieved slightly lower accuracy due to significant structural differences compared to
VGG16. The CNN architecture had the lowest accuracy. We reason this performance difference
is due to transfer learning. Custom CNN has been trained from scratch, as opposed to other
models that have prior weights to optimize. This leads to better generalization of features from
pre-trained models to Spectrograms.
In terms of execution time, custom CNN had the least time of 160 seconds for maximum
epochs of 20, owing to its small size. The other 3 models take significant time due to large
architectural structure and weights. A comparative graph of the impact of epochs on accuracy
has been depicted in Fig. 7.

12
Comparison chart
Epochs Batch size Accuracy

90.17 93.37 91.57 88.72

Metric range
32 32 32
20
10
5 5 5

Inception MobileNet VGG16 CNN


Models

Fig. 7 Epochs vs Accuracy vs batch size

5.3 XAI metrics

We used three different XAI techniques, LIME, GradCAM and SHAP, to understand how each
model was making its predictions. Specifically, we used these techniques to identify frequency
bands or pixels of the spectrogram that were critical for each prediction, and to compare the
importance of different features across different models.
We tested all models on a Spectrogram and viewed SHAPley values for the image, as
illustrated in column 1 of Table 6. Green regions indicate maximum positive impact pixels on
classification, while red regions denote maximum negative impact. A sample SHAP colour
scale is illustrated in Fig. 8 below.

Fig. 8 SHAPley value range

The SHAP value range for each row in Table 6 has been established as:
Row 1: Inception = -0.08 (Red) to +0.08 (Green)
Row 2: VGG16 = -0.10 (Red) to +0.10 (Green)
Row 3: Mobile Net = -0.075 (Red) to +0.075 (Green)
Row 4: CNN = -0.075 (Red) to +0.075 (Green)
In Table 6, the second column depicts GradCAM visualization of heatmap on the same
image indicates red regions as most important features in classification, while pink with
minimal impact.
LIME applies perturbations to the image and identifies pixels that contribute to
classification of each class. A collection of pixels is created to form super pixel groups and an
explainable mask is applied to interpret model predictions. Red regions indicate the least
important or negative impact on classification, while green highlights maximum positive impact
on classification. This has been illustrated in column 3.
These results on a MEL Spectrogram have been depicted in Table 6 below.

Table 6 XAI image mask comparison

SHAP GRADCAM LIME

13
InceptionV3

VGG16

Mobile Net

Custom CNN

6 Discussion
To summarize this study, we started with Fake Audio generation using custom-GANs. The
generated fake samples generated a FAD score of 23.814, within the comparable limit of 0-100.
This indicates a 23% difference in real and fake sample distribution. Our FAD score depicts a
moderate to good quality fake audio being generated.
The generated fakes were then accompanied by fake samples in FoR dataset and
passed on to various deep learning models like custom CNN, VGG16, InceptionV3, and Mobile
Net. At 93.37% accuracy, VGG16 outperforms others due to transfer learning of weights from
ImageNet dataset to Spectrograms.
The classification of these models was explained using XAI techniques like LIME,
SHAP and GradCAM. SHAP highlighted a range of values from -0.8 to 0.8, for every frequency
band or pixel of the MEL Spectrogram. LIME generated results along similar lines, drawing
Regions of Interest (ROI) boxes around the most important frequency bands. GradCAM
generated a heatmap where red regions indicated maximum impact, while pink color indicated
minimal impact to current class segregation.
Our qualitative study using LIME, GradCAM and SHAP highlighted that different
models were focusing on different parts of the spectrogram while making predictions. This is an
important conclusion interpreting the results of this study. These findings suggest that different
models may be more appropriate depending on the specific requirements of a given application.
Table 7 highlights the contributions of researchers in deepfake and XAI domain, whilst
comparing our contributions. We build upon previous work, to use GANs for augmenting the
novel FoR dataset, compare 4 deep learning models and contrast 3 XAI models. Our analysis
has been summarized as follows.

Table 7 Previous work comparison

Paper title Previous findings Our contributions

1. Detecting Deepfake Taylor decomposition was applied to Use of 2GB FoR dataset, with
voice using Explainable ASV Spoof and LJSpeech datasets. 1,95,000 audio samples is larger
Deep Learning Techniques ASV Spoof has 8076 audio samples of than ASV Spoof.
by (Mossad, ElNainay and fake audio.
Torki, 2018) Compared CNN, VGG16, Mobile
Comparison of CNN and LSTM Net and InceptionV3 to confirm
models was done using frequency band the advantage of pre-trained
analysis of MEL Spectrograms. model weights over LSTM

14
architecture.
LRP XAI method was used to interpret
results qualitatively through heatmaps.
Compared the XAI results of
LIME, SHAP and GradCAM
through frequency regions of
interest (ROI).
2. Voice Impersonation Use of CNN-LSTM based models on Built upon the findings of this
detection using LSTM FoR dataset. study by using VGG16, Mobile
based RNN and Net, InceptionV3 in addition to
Explainable AI by (Lim, CNN model.
Chae, and Lee, 2022)
Frequency bands in MEL
Cepstral frequency scores or MFCC Spectrogram highlighted
coefficients were explained for real and differences in explainability
fake audio through LIME models. masks of LIME, SHAP and
GradCAM.

3. Explaining Deep learning An open-source project, based on ASV Studied SHAP results in range of
models for Spoofing and Spoof dataset explained the SHAPley -0.08 to 0.10 for 4 deep learning
Deepfake detection with values for deep fake detection through models on FoR dataset.
SHAPley Additive MEL Spectrograms.
Explanations by (Kawshik, Explored the relation between
Rahim, Parizat, Noor and Future work outlined the exploration of waveform models of
Jannah, 2021) SHAP values to study classifier Spectrograms to SHAP values.
performance on low-level Spectro- High noise application by GANs
temporal intervals. led to distorted MEL waveforms,
thus causing negative impact on
detection, i.e., large negative
SHAP values.

7 Conclusion
This research depicts the feasibility of using custom DCGANs to generate convincing fake
audio samples and the potential for such samples to deceive classification models. The FAD
metric increases confidence in generated fake samples. Furthermore, by incorporating XAI
techniques such as LIME, SHAP and GradCAM, intuition is gained on process of classifiers
predictions. This is useful in identifying and addressing vulnerabilities in decision-making
processes. These findings are useful in developing more accurate and efficient models for audio
classification. They will be applicable to interpreting the behavior of deep learning models in a
transparent and convenient manner.
Overall, our study highlights the need for continued research and development of more
robust and secure deep learning models, particularly in audio and speech recognition, where the
potential for malicious use of fake samples is high.

8 Future work
In our research, only 128-256 fake samples were generated by DCGANs. Thus, we need to raise
the number of samples to 16,000 to satisfy the requirements to compute Fréchet Inception
Distance (FID) scores. This score can add value towards comparing GANs with other enhanced
XAI-GANs models by using methods like saliency, fidelity etc. Additionally, the results
provided by XAI in Table 6 can be enhanced by explanation using a specific metric as in area of
LIME ROI, pixel overlap in real and fake or other mathematical interpretations of images.
Future work involves exploring different deep learning architectures and XAI techniques,
as well as evaluating model performances on various types of audio data. Additionally,
investigating the impact of different data pre-processing techniques on FoR dataset can provide
insight to performance of new model.

Declarations
The authors of this research assert no conflict of interest in publication. This is a solemn
declaration stating no personal agenda or circumstances that may be alleged as
misappropriating, misguiding, or maligning the depiction or analysis of results in this research.

15
There are no financial or non-financial competing interests in the publication of this work.

Acknowledgements

This project was completed under the supervision of our guide, in accordance with final year project
requirements of BTech CSE degree at our college in India. A special mention to researchers and scientists
across the globe, who provided us with access to their datasets and tips on how to proceed with this novel
project. We also thank our friends and family, without whom no project is ever a success.

References
[1] Agarwal, S., Farid, H., El-Gaaly, T. and Lim, S. N.: Detecting deep-fake videos from
appearance and behavior, In: Proc. Of IEEE international workshop on information forensics
and security (WIFS), pp. 1-6 (2020).
[2] Müller, N. M., Czempin, P., Dieckmann, F., Froghyar, A., and Böttinger, K.: Does audio
deepfake detection generalize? arXiv preprint, arXiv:2203.16263 (2022).
[3] Lyu, S., Deepfake detection: Current challenges and next steps. In: Proc. Of 2020 IEEE
international conference on multimedia & expo workshops (ICMEW), London, UK, pp. 1-6,
(2020).
[4] Goodfellow I., Jean, PA., Mehdi, M., Bing, X., David, WF., Sherjil, O., Aaron, C. and
Yoshua, B.: Generative adversarial networks., Commun. ACM, 63(11), pp. 139-144 (2020).
[5] Eiter, T. and Mannila, H.: Computing discrete Fréchet distance, technical report CD-TR
94/64, Technische Universitat Wien (1994).
[6] Došilović, F.K., Brčić, M., and Hlupić, N.: Explainable artificial intelligence: A survey. In:
Proc. Of 41st International convention on information and communication technology,
electronics, and microelectronics (MIPRO), Opatija, Croatia, pp. 0210-0215 (2018).
[7] Guo, W.: Explainable artificial intelligence for 6G: Improving trust between human and
machine, IEEE Communications Magazine, 58(6), pp. 39-45 (2020).
[8] Salih, A., Raisi-Estabragh, Z., Galazzo, I. B., Radeva, P., Petersen, S.E., Menegaz G. and
Lekadir, K.: Commentary on explainable artificial intelligence methods: SHAP and LIME,
arXiv preprint arXiv:2305.02012 (2023).
[9] Lujain I., Mesinovic, M., Yang, K. W., and Eid, M. A.: Explainable prediction of acute
myocardial infarction using machine learning and shapley values, IEEE Access, (8), pp.
210410-210417, 2020.
[10] Ramprasaath, S. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. and Batra, D. Grad-
cam: Visual explanations from deep networks via gradient-based localization. In: Proc. of
IEEE international conference on computer vision (ICCV), Venice, Italy, pp. 618-626
(2017).
[11] Santiago, P., Bhattacharya, G., Yeh, C., Pons, J. and Serrà, J.: Full-band general audio
synthesis with score-based diffusion. In: Proc. Of IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, pp. 1-5 (2023).
[12] Yi, J., Fu, R., Tao, J., Nie, S., Ma, H., Wang, C., Wang, T., Tian, Z., Bai, Y., Fan, C., and
Liang, S.: Add 2022: the first audio deep synthesis detection challenge. In: Proc. Of IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore,
pp. 9216-9220 (2022).
[13] Wijethunga, RLMAPC., Matheesha, DMK., Noman, AA., De Silva, KHVTA., Tissera, M.,
and Rupasinghe, L.: Deepfake audio detection: a deep learning-based solution for group
conversations. In: Proc. Of 2nd International Conference on Advancements in Computing
(ICAC), Malabe, Sri Lanka, (1), pp. 192-197, (2020).
[14] Li, M., Ahmadiadli, Y., and Zhang, X.P.: A Comparative Study on Physical and Perceptual
Features for Deepfake Audio Detection, In: Proc. of the 1st International Workshop on
Deepfake Detection for Audio Multimedia, ACM, Lisbon, Portugal, pp. 35-41 (2022).

16
[15] Metehan, Y., Kantharaju, P., Disch, P., Niedermeier, A., Escalante-B, A. N., and
Morgenshtern, V. I.: Fricative phoneme detection using deep neural networks and its
comparison to traditional methods., In: Proc. Of Interspeech, Husova, Czechia, pp. 51-55,
2021.
[16] Shadle, C. H., and Mair, S. J.: Quantifying spectral characteristics of fricatives. In
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96
(3), pp. 1521-1524). IEEE, (1996).
[17] Logan, B.: Mel frequency cepstral coefficients for music modelling, Ismir, 270(1), p. 11,
(2000).
[18] Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C.: A survey on deep transfer
learning. In: Proc. Of 27th International Conference on Artificial Neural Networks and
Machine Learning (ICANN), Rhodes, Greece, Part III 27, pp. 270-279, 2018.
[19] Arik, S., Chen, J., Peng, K., Ping, W., and Zhou, Y.: Neural voice cloning with a few
samples, Advances in neural information processing systems, 31 (2018).
[20] Bisman, B., Yamagishi, J., King, S., and Li, H.: An overview of voice conversion and its
challenges: From statistical modeling to deep learning, IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 29, pp. 132-157 (2020).
[21] Chen, S., Ren, K., Piao, S., Wang, C., Wang, Q., Weng, J., and Mohaisen, A.: You can hear
but you cannot steal: Defending against voice impersonation attacks on smartphones. In:
Proc. 2017 IEEE 37th International Conference on Distributed Computing Systems
(ICDCS), Atlanta, USA, pp. 183-195 (2017).
[22] Suwajanakorn, S., Seitz, S.M., and Kemelmacher-Shlizerman, I.: Synthesizing obama:
learning lip sync from audio, ACM Transactions on Graphics (ToG), 36(4), pp. 1-13,
(2017).
[23] Carson-Berndsen, J.: Phonological processing of speech variants. In: Proc. Of 13th
International Conference on Computational Linguistics (COLING), Toronto, Canada, 3,
(1990).
[24] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Chen, Z., Zang, Y., Wang, Y.,
Skerrv-Ryan, RJ., Saurous, RA., Agiomvrgiannakis, Y., Wu Y, and Yang, Z.: Natural tts
synthesis by conditioning wavenet on mel spectrogram predictions. In: Proc. Of IEEE
international conference on acoustics, speech and signal processing (ICASSP), Calgary,
Canada, pp. 4779-4783 (2018).
[25] Subhash, S., Srivatsa, P.N., Siddesh, S., Ullas, A. and Santhosh, B.: Artificial intelligence-
based voice assistant. In: Proc. Of 2020 Fourth world conference on smart trends in systems,
security and sustainability (WorldS4), London, UK, pp. 593-596 (2020).
[26] Donahue, C., McAuley, J. and Puckette, M.: Adversarial audio synthesis, arXiv preprint
arXiv:1802.04208 (2018).
[27] Jones, V.A.: Artificial intelligence enabled deepfake technology: The emergence of a new
threat, Doctoral dissertation, Utica College (2020).
[28] Chintha, A., Thai, B., Sohrawardi, S.J., Bhatt, K., Hickerson, A., Wright, M. and Ptucha, R.:
Recurrent convolutional structures for audio spoof and video deepfake detection, IEEE
Journal of Selected Topics in Signal Processing, 14(5), pp. 1024-1037, (2020).
[29] Su, Y., Xia, H., Liang, Q., and Nie, W.: Exposing DeepFake videos using attention based
convolutional LSTM network, Neural Processing Letters, 53, pp. 4159-4175, (2021).
[30] Lewis, J.K., Toubal, I.E., Chen, H., Sandesera, V., Lomnitz, M., Hampel-Arias, Z., Prasad,
C., and Palaniappan, K., Deepfake video detection based on spatial, spectral, and temporal
inconsistencies using multimodal deep learning. In: Proc. Of 2020 IEEE Applied Imagery
Pattern Recognition Workshop (AIPR), Washington DC, USA, pp. 1-9, 2020.
[31] Singh, A.K. and Singh, P., Detection of ai-synthesized speech using cepstral & bispectral
statistics, In: Proc. Of 4th International Conference on Multimedia Information Processing
and Retrieval (MIPR), Tokyo, Japan, pp. 412-417 (2021).
[32] Rana, M.S., and Sung, A.H.: Deepfakestack: A deep ensemble-based learning technique for
deepfake detection, In: Proc. 2020 7th IEEE international conference on cyber security and

17
cloud computing (CSCloud)/2020 6th IEEE international conference on edge computing and
scalable cloud (EdgeCom), New York, USA, pp. 70-75 (2020).
[33] Chugh, K., Gupta, P., Dhall, A., and Subramanian, R.: Not made for each other-audio-visual
dissonance-based deepfake detection and localization, In: Proc. of the 28th ACM
international conference on multimedia, Seattle, USA, pp. 439-447 (2020).
[34] Yu, X.: Towards Explainable Generative Adversarial Networks, Master’s thesis, University
of Waterloo (2022).
[35] Dhariwal, P. and Nichol, A.: Diffusion models beat gans on image synthesis, Advances in
Neural Information Processing Systems, 34, pp. 8780-8794 (2021).
[36] Huang, CY., Lin, YY., Lee, HY., and Lee, L.S.: Defending your voice: Adversarial attack
on voice conversion, In: Proc. of 2021 IEEE Spoken Language Technology Workshop
(SLT), Shenzhen, China, pp. 552-559. IEEE (2021).
[37] Kim, J., and Park, H.: Limited Discriminator GAN using explainable AI model for
overfitting problem, ICT Express, 9(2), pp. 241-246 (2023).
[38] Guillaume, J., Simon, L., and Jurie, F.: Diffusion models for counterfactual explanations, In:
Proc. of the Asian Conference on Computer Vision (ACCV), Macau, China, pp. 858-876,
(2022).
[39] Becker, S., Ackermann, M., Lapuschkin, S., Müller, K. R., and Samek, W.: Interpreting and
explaining deep neural networks for classification of audio signals, arXiv preprint,
arXiv:1807.03418 (2018).
[40] Jindal, S., and Jindal, N.: Comparative Study of GANs Available for Audio Classification.,
In: Proc. of the International Conference on Paradigms of Computing, Communication and
Data Sciences (PCCDS), Springer, Singapore, pp. 901-908 (2021).
[41] Kawa, P. Plata, M., and Syga, P.: Attack Agnostic Dataset: Towards Generalization and
Stabilization of Audio DeepFake Detection., arXiv preprint arXiv:2206.13979, (2022).
[42] Gong, Y., Yang, J., Huber, J., MacKnight, M., Poellabauer, C.: ReMASC: realistic replay
attack corpus for voice controlled systems., arXiv preprint arXiv:1904.03365 (2019).
[43] Nagrani, A., Chung, J.S., and Zisserman, A.: VoxCeleb: a large-scale speaker identification
dataset, arXiv preprint, arXiv:1706.08612 (2017).
[44] Khalid, H., Tariq, S., Kim, M. and Woo, S. S.: FakeAVCeleb: A novel audio-video
multimodal deepfake dataset, arXiv preprint, arXiv:2108.05080 (2021).
[45] Reimao, R. and Tzerpos, V.: FoR: a dataset for synthetic speech detection. In Proc. of 2019
International Conference on Speech Technology and Human–Computer Dialogue (SpeD),
Timisoara, Romania, pp. 1–10, (2019).
[46] Kilgour, K., Zuluaga, M., Roblek, D., and Sharifi, M.: Fréchet Audio Distance: A Metric for
Evaluating Music Enhancement Algorithms, arXiv preprint arXiv:1812.08466 (2018).
[47] Mossad, O.S., ElNainay, M., and Torki, M.: Deep convolutional neural network with multi-
task learning scheme for modulations recognition., In: Proc. of 2019 15th International
Wireless Communications & Mobile Computing Conference (IWCMC), Tangier, Morocco,
pp. 1644-1649 (2019).
[48] Lim, S.Y., Chae, D.K., and Lee, S.C.: Detecting Deepfake Voice Using Explainable Deep
Learning Techniques, Applied Sciences, 12(8), pp. 3926 (2022).
[49] Kawshik, B., Rahim, A., Parizat, P. S., Noor, MAU., and Jannah, M.: Voice impersonation
detection using LSTM based RNN and explainable AI., Doctoral dissertation, Brac
University (2021).

18

You might also like