0% found this document useful (0 votes)
15 views13 pages

Audio Deepfake (Camera Ready Paper)

This paper presents a neural network-based system for detecting audio deepfakes using machine learning and deep learning techniques, particularly focusing on Mel-frequency Cepstral Coefficients (MFCC) for feature extraction. The proposed model demonstrates superior accuracy in classifying audio as real or fake compared to traditional CNN models, achieving an accuracy of 80.7%. The research includes a comprehensive analysis of existing literature, data preprocessing methods, and various feature extraction techniques to enhance the robustness and effectiveness of deepfake detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views13 pages

Audio Deepfake (Camera Ready Paper)

This paper presents a neural network-based system for detecting audio deepfakes using machine learning and deep learning techniques, particularly focusing on Mel-frequency Cepstral Coefficients (MFCC) for feature extraction. The proposed model demonstrates superior accuracy in classifying audio as real or fake compared to traditional CNN models, achieving an accuracy of 80.7%. The research includes a comprehensive analysis of existing literature, data preprocessing methods, and various feature extraction techniques to enhance the robustness and effectiveness of deepfake detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

A Neural Network-Based Deepfake Detection System for Audio Clips

1
Rahul Saha 2Deep Sengupta and 3Anupam Mondal
1,2,3
Computer Science and Engineering, Institute of Engineering &
Management, University of Engineering & Management, Kolkata,
700091, West Bengal, India

[email protected],[email protected],
[email protected]

Abstract: Deepfake detection using machine learning and deep learning


is a rapidly growing field where machine learning algorithms are used to
distinguish between real and fake content. Applications of Audio
Deepfakes(AD) range from audiobook enhancement to public safety
threats. This paper will provide a study of ways to identify and
overcome AD using a combination of machine learning(ML),mainly
deep learning (DL) with convolution neural network. This research
covers many areas of depth aspects, focusing on Mel-frequency
Cepstral Coefficients (MFCC) and other techniques to extract features
from an audio. Preliminary experiments on fake or real data
demonstrate the effectiveness of support vector machine (SVM) for
short words, the possibility of gradient boosting on similar data, and
the performance of the VGG-16 model.
In this study, Fake or Real (FoR) dataset is used to explore fea-
tures and image-based methods in addition to deep audio. Deep
learning, specifically Convolutional Networks, outperforms machine
learning with 80.7 percent accuracy. Compared to traditional, CNN
models such as VGG16 and XceptionNet, the proposed model shows
greater accuracy in classifying sounds as false or real. We conduct a
comprehensive review of the existing literature, including numerical
analysis, simulated and synthetic AD, and quantitative comparisons of
detection methods.

Keywords: Audio Deepfakes (ADs), Machine Learning (ML), Deep


Learning (DL), Mel-frequency Cepstral Coefficients, Convolution
Neural Network

1 Introduction

Deepfake discovery utilizing machine learning and profound learning may


be a quickly developing field where counterfeit insights and machine
learning calculations are to identify fake substance. Applications of Sound
Deepfakes (Advertisement) run from audiobook improvement to open
security dangers. This paper will give a study of ways to overcome
Advertisement employing a combination of machine learning (ML),
1
profound learning (DL), and other strategies. This inquire about covers
numerous ranges of profundity viewpoints, centering on Mel-frequency
Cepstral Coefficients (MFCC) and other strategies to extricate highlights
from an sound. Preparatory tests on fake or genuine information illustrate
the viability of bolster vector machine (SVM) for words, the plausibility
of slope boosting on comparable information, and the execution of the
VGG-16 model.
In this consider, Fake or Genuine (FoR) dataset is utilized to investigate
features and image-based strategies in expansion to profound sound.
Profound learning, particularly Convolutional Systems, beats machine
learning with 80.7 percent exactness. Compared to conventional, CNN
models such as VGG16 and XceptionNet, the proposed demonstrate
appears more noteworthy exactness in classifying sounds as wrong or
genuine. We conduct a comprehensive survey of the existing writing,
counting numerical investigation, recreated and manufactured
Advertisements, and quantitative comparisons of discovery methods.

2 Literature Survey

This work employments machine learning and profound learning,


particularly Mel Recurrence Cepstral Coefficient (MFCC), to distinguish
profound sounds in the Fake or Genuine dataset. Exploratory comes about
appear that Back Vector Machines (SVM) perform well on introduction to
brief sounds, angle boosting performs well for ancient information, and
the VGG-16 demonstrate performs way better in other cases, particularly
on crude information[1].This work presents a profound learning strategy
to recognize deepfake interactive media substance by analyzing varying
media likenesses and feelings in recordings. The plan propelled by
Siamese engineering and triple misfortune outflanks the state-of-the-art
strategy by accomplishing a great single video AUC score of 84.4 percent
on DFDC and 96.6 percent on the DF- TIMIT dataset. and spearheaded
the sound. Video and visualization for more profound information
disclosure[2]. This think about addresses two vision and hearing dangers
in profundity, proposing a common look operation that leads to a common
combination of these models. These tests appear the finest execution and
detail compared to the preparation demonstration itself, highlighting the
significance of visual and sound-related judgment in profound
investigation[3].This inquiry centers on utilizing untrue or genuine data
(FoR) created by progressed text-to-speech models to bargain with the
danger of voice communication. Two strategies based on visuals and
pictures were explored for a profound voice look. The proposed show
appears to have more noteworthy precision in classifying sounds as fake
or genuine compared to conventional CNN models such as VGG16 and
XceptionNet[4].This article presents the Sound Profound Amalgamation
Location Challenge (Include) 2022, which addresses distinctive real-life
and complex circumstances for profound sound discovery. This challenge
incorporates three strategies: Idealize Information Revelation (LF),
Idealize Information Disclosure (PF), and Idealize Information Disclosure
(FG). This article professional- vides an diagram of information,
measurements, and methods and highlights later progresses and

2
discoveries within the field of profound dialect search[5].This article gives
an outline of sound deepfakes (Advertisement) and the conceivable
outcomes for proceeded advancement of discovery strategies when there
are concerns around their affect on open security. It analyzes existing
machine learning and profound learning strategies, compares sound
information blunders, and distinguishes vital trade-offs between exactness
and measurement methods. This review highlights the need for further
research to resolve these inconsistencies and suggests potential guide-
lines for more robust AD detection models, particularly in addressing
noise and the sound of the world[6]. This study evaluates various CNN
architectures for deep sound detection, including concepts such as size,
technique, and accuracy. The customized architecture of Malik et al. The
most accurate is 100 percent. But it seems to depend on the context,
indicating the need for different architectures. Experiments with different
sound representations demonstrate the consistency of the customized
model. Although these standards lag behind legal standards, they have led
the development of quality standards that will meet legal restrictions while
solving deep-rooted problems[7]. This study addresses the threat of misuse
of synthetic speech by arguing that the real voice is different from the
synthetic voice in group discussion. The system uses deep neural networks
combining negative speech, speech binarization, and CNN-based methods
to achieve high accuracy and effective speech analysis[8].This study
demonstrates the application of securing transmission points using manual
and automatic extraction. It uses CNN for histogram analysis and shows
the performance evaluation of the model-based application and Deep
Voice speech recognition[9]. This paper solves the problem of detecting
deep voice spoofs, using the ASVspoof dataset, and combining data
enhancement and hybrid feature extraction. The proposed model adopts
LSTM in the backend, uses MFCC + GTCC and SMOTE, and achieves
99 percent testing accuracy and 1.6 percent EER performance on
ASVspoof 2021 deepfake sections. Noise evaluations and experiments on
the DECRO dataset further demonstrate the effectiveness of the
model[10]. Motivation: This paper demonstrates the challenges of XAI
image classification by focusing on the quality score of similar objects.
Recognize the difference between human and machine understanding and
the impact on interpreting XAI output[11]. Audio analysis by Fourier
transform becomes a spectrogram, converting the audio signal into a visual
form. Analysis of melspectrograms for audio, providing visual and audio
interpretation using scores based on spectrogram scores[12]. Model:
Emphasis on simplicity rather than realism, using general communication
techniques for speech recognition[13]. Explainable Artificial Intelligence
(XAI): Introduce the XAI method based on Taylor decomposition, use of
gradient integration to judge the accuracy of the relationship[14]. Speech
Generation: Describe the Griffin-Lim algorithm for generating voice from
scores of spectrogram scores, for simplicity and ability to influence the
factor properties of the voice even without perfect segmentation[15].
Understanding with Humans: Inspired by XAI’s human assistance in
visual processing, this paper explores the classification of scores in
complex language, absorption detection and comparison with
spectrogram-based audio[16].
3
3 Analysis

The analysis section of an audio deepfake detection paper focuses on


evaluating the proposed system’s performance, robustness, and
generalization capabilities. The system is typically assessed on diverse
datasets comprising real audio from sources like LibriSpeech or VCTK
and synthetic audio generated by techniques such as WaveNet,
Tacotron, or Deep Voice. Additionally, Equal Error Rate (EER) may
be employed, particularly for applications involving biometric
security. To better understand the system’s capabilities, experiments
are conducted under varying conditions, such as the presence of noise,
compression artifacts, or audio generated by unseen synthesis models.
Performance evaluation involves comparing the proposed model which
uses the basic or the best method, showcasing improvements in
detection rates and robustness to challenging scenarios like noisy
environments or low-quality audio. Ablation studies examine the
impact of different components, such as feature types (e.g., MFCCs,
spectrograms) and model architectures, on the overall accuracy.
Sensitivity analysis evaluates the model’s reliability under specific
conditions, including varying signal-to-noise ratios or different
deepfake generation techniques. Cross-dataset testing is important for
understanding generalization capabilities, which guarantees that the
model performs well on unseen datasets or diverse speaker profiles.
Additionally, adversarial robustness is analyzed to determine how
resistant the system is to carefully designed attacks aimed at bypassing
detection. The interpretability of the model is explored by identifying
features or patterns that differentiate real from fake audio. Error
analysis highlights common failure cases, such as low-quality real
audio misclassified as fake or high-quality synthetic audio passing as
real, providing a foundation for further refinement. Computational
efficiency is also considered, with an evaluation of the system’s
runtime and resource requirements for both training and inference,
ensuring feasibility for real-time applications. Finally, the analysis
emphasizes the broader implications of the findings, including
potential forensic applications and limitations, while addressing ethical
considerations to prevent misuse of the detection technology.
Visualizations like ROC curves, confusion matrices, and spectrogram
comparisons, along with detailed tables of results, support the analysis,
offering a comprehensive understanding of power the system and the
drawbacks which can be improved.

4
4 Data Preprocessing

One of the important step in this data preprocessing, an effective audio


deepfake detection system, which ensures the dataset is maintains
cleanliness, and is standardized, and is prepared for feature extraction
and model training. The dataset, has many real and fake audio samples
estimated about over 195000 which serves as a important resource for
training classifiers to detect fake audio. This dataset includes audio
generated using advanced technique for text-to-speech which include
systems like Google Wavenet and Deep Voice 3, as well as a variety
of genuine human recordings. The dataset is provided in different
variations: pure, type0, type1(for-2sec), and type2(for-rerec). There are
several files in original version ,as they were originally extracted, while
for-norm offers standardized versions of the same files with consistent
sampling rates, volumes, and channels to ensure class and gender
balance. The for-2sec version truncates the audio files to two seconds,
making it suitable for analyzing short utterances, while the forrerec
version re-records the truncated files through a voice channel to
simulate real-world attack scenarios. Despite its utility, the FoR dataset
suffers from issues such as duplicate files, 0-bit files, and varying bit
rates, which can adversely affect machine learning model performance.
To address these challenges, preprocessing steps are employed to
remove duplicates and 0-bit files that do not contribute meaningful
information to the training process. Additionally, audio signals are
standardized by normalizing bit rates and applying zero-padding to
ensure that all waveforms conform to an operationally viable length of
16,000 samples, which aligns with the requirements of TensorFlow’s
audio signal processing library. These preprocessing steps not only
improve data quality but also enhance the uniformity of the input
features, enabling more accu- rate and robust model training. Properly
preprocessed data ensures that the deepfake detection system can
effectively generalize across diverse audio formats and scenarios,
providing a strong foundation for detecting synthetic speech in real-
world applications.

5
5 Feature Extraction

One of the important step in building a robust deep voice detection


system is feature extraction as it converts the audio signal into a
meaningful representation that can be used by machine learning (ML)
and deep learning (DL) models. Audio signals are complex and have
many time and frequency characteristic features, so special selection is
important to distinguish the real voice from the synthetic one.
Traditional extraction methods such as Mel frequency cepstral
coefficients (MFCC), spectral roll-off, zero-crossing rate, and
chromaticity features are widely used due to their ability to capture
important features of Speech and Tone behavior. This feature can
identify the difference between the audio and acoustic features of
deepfake voices. Pre-trained neural networks (e.g., convolutional and
recurrent architectures) can produce high resolution images that cover
subtle structures introduced during synthesis, such as inconsistencies

in the audio, spectral patterns, and temporal dynamics. The distribution


pipeline combines traditional classification techniques with today’s
deep learning techniques to discover hidden patterns and anomalies in
heterogeneous data and hybrid methods. These two approaches enable
the system to be widely used for many types of tasks, from text-to-
speech (TTS) systems to voice-based artificial intelligence (GAN). The
effectiveness of the results removal directly affects the detection
performance, making it an important basis for combating th e
increasing deep-voice threat in the fields of security, forensics, and
public safety.
5.1 a. Chroma STFT (Short-Time Fourier Transform)
Chroma STFT. uses short time Fourier transform to calculate
chromatic properties. STFT means information about tone and
signal pattern distribution. It depicts peaks with high value. Figure
5.1 a. shows a spectrogram representation that uses color to show
the intensity (or loudness) of sound at different frequencies (or
tones) over time.

5.1 b. MFCC (Mel-frequency cepstral coefficients)


MFCC (Mel Frequency Cepstral Coefficient) MFCC represents the short
duration of the sound. MFCC provides valuable inputs about the frequency
value by capturing some spectral features of a sound signal. In general the
first few coefficients (e.g. MFCC1, MFCC2, etc.) contain the most
important information for the classification function. Figure 5.1 b. Show a
spectrogram representing the intensity (or loudness) of a sound using color
at each frequency (or sound) at a given time. It also shows the amplitude
or power of the MFCC coefficient at a given time. Changes in intensity
can indicate differences in the acoustic patterns between the actual sound
and the noise.

6
5.1 c. RMS (Root Mean Square)
Root mean square gives the measure of the amplitude’s average for a
signal. Its value represents, the signal amplitude over a given period.
RMS is often used for tasks such as recognition a speech and
classification of audio. Higher RMS values indicate more energy in the
audio signal. Figure 5.1 c. shows the audio waveform with a recorded
root mean square (RMS) amplitude value of 33.4 dB, which represents
the power level’s average of the audio signal at a single time.

5.1 d. Spectral Centroid


The Spectral Centroid depicts the main center of the audio signal
power spectrum. It provides valuable inputs about the brightness of
a given audio signal. The spectral center of gravity is a measure of
the frequency distribution in an given audio signal, with higher
values indicating greater detail. Figure 5.1 d can be considered a
representation of the brightness or trueness of the sound..

5.1 e. Spectral Bandwidth


Spectral bandwidth measures the width of a few frequencies where
most of the spectrum energy is concentrated. It provides valuable
inputs about the frequencies in an audio signal. Higher spectral
bandwidth values indicate greater content in the signal. In Figure
5.1 e, the upper graph plots the spectral bandwidth over time in
Hertz (Hz) and shows how the transmission frequency changes as
the signal changes. In the movie below, the spectral center is marked
and the shadow or curve around it indicates the spectral bandwidth.
This represents the frequencies of different range that make up the
signal's power at any given time.

5.1 f. Zero Crossing Rate (ZCR)


ZCR helped to measure the rate at which audio signal changed its sign. It
is a simple function that provides information about the number of
waveform discontinuities in the audio signal. ZCR is used primarily for
speech and music analysis. A higher ZCR value means faster changes in
the audio signal, which can indicate certain features such as touch sounds
or speech segments. Time passes over the horizontal axis (zero amplitude).

7
Fig 5.1 a Fig 5.1 b

Fig 5.1d -
Fig 5.1 c

Fig 5.1 e Fig 5.1 f

8
6 Proposed Methodology
A proposed method for deep sound detection using CNN, this work
presents a design and generalization for deep sound detection using
network neural networks (CNN). This approach focuses on using
artificial intelligence techniques and deep learning to improve the model,
for which it will be able to identify and use real data. Each step of the
scheme is explained below: Extract important features of the sound to
prepare the sound data for depth perception. Mel frequency cepstral
coefficients (MFCCs) are considered as important feature as it can easily
capture the timbre and spectral features of the signal. Using the Librosa
library, a 40-dimensional MFCC is calculated for each audio sample. To
ensure consistency, the average MFCC is calculated across all frames and
each model is reduced to a full-length feature vector. Zero mean and unit
variation increase the stability and convergence of CNN model training.
Use data augmentation to have diversity and robustness to the capacity
and size of deep data archives. The following improvements have been
made: Change the pitch of the audio signal without changing its structure.
Add noise to simulate real voice. CNN architecture is designed to
generate MFCC feature vectors and classify true or false messages. The
main components of this model include many layer which accepts 40-
dimensional MFCC feature vectors. The relationship between models is
based on the ReLU function. The output of the neuron is deep. The
training materials are divided into training, validation and test files.
Failure depicts Binary cross entropy is used as it measures the error in
binary problems. Gradient-based optimization. The sample size is 32 to
ensure computational efficiency while maintaining sufficient gradient
tuning. Another independent evaluation of the training model to assess
its effectiveness. Key performance indicators include: Actually: all
examples excluded. Average of the consistency between the truth and the
reward, indicating equality of the two measurements.

Fig 6.1 Flowchart depicting the workflow


Deploy CNN learning models, used in environments with TensorFlow
dependencies. The transmission process completes the audio data by
extracting MFCC features and combining them into models for
projection. The developing data and technology provide a stable
foundation for solving the growing deep voice search problem by
combining high-quality results, advanced CNN architectures and a
rigorous approach. High performance has been achieved while the system
is seeking further research and improvement.
9
Convolution Neural Network
In our approach to audio deepfake detection, we utilize a Convolu-
tional Neural Arrange (CNN), a capable profound learning design
broadly recognized for its capacity to extricate spatial and transient
highlights from information. CNNs are especially successful in
analyzing spectrograms, which speak to sound signals in a visual frame,
empowering the demonstrate to memorize intricate patterns and
irregularities within the recurrence and time spaces that recognize
genuine sound from engineered deepfakes.
The CNN in our approach comprises of different layers, each playing a
particular part in highlight extraction and classification. Convolutional
layers shape the center of the arrange, applying channels to the input
information to identify neighborhood designs such as changes in
recurrence and adequacy. These layers capture basic low-level features,
such as edges or recurrence shifts, within the beginning stages and
steadily learn more complex structures as the organize develops.
Pooling layers (e.g., max-pooling) take after the convolutional layers
to diminish the spatial dimensions of the include maps, subsequently
minimizing computational complexity whereas holding the foremost
data.

Fig 6.2 Layers of convolution


To improve the network capacity to generalize over differing inputs,
we utilize group normalization layers, which normalize the
enactments of each layer to progress preparing solidness and
convergence speed. Also, enactment capacities like ReLU
(Amended Direct Unit) present non-linearity, empowering the
organize to memorize complex highlight representations that are
pivotal for recognizing whether the given audio is real or fake.

7 Results and Discussion

CNN-based deep data forging methods have been shown to be


effective in various tests validating the model performance evaluated
on different real and interactive data Power absorption fakes. The
model performance showed an overall test accuracy of 80.7% and had

10
the area under the curve (AUC) of 0.88, demonstrating the ability of
the model to distinguish between the two classes. Furthermore, the fact
that accuracy and recall are equal increases the confidence of the model
in detecting and reducing defects. It detected 430 out of 500 fake news
samples and 415 out of 500 fake news samples. Although a small
percentage of real samples are not classified as fake samples, the false
positive rate is low. More importantly, using layers for continuous
operation and utilizing MFCC features for noise representation helps
improve the overall capacity, as evidenced by the small difference
between representation, performance, and evaluation. Figure 7 depicts
the performance of our model

Fig. 7. Model performance


This work demonstrates the ability of a CNN-based framework to
solve deep speech recognition problems. The model demonstrates
confidence in the analysis of manipulated messages by leveraging
MFCC capabilities and a well-designed architecture. When the
results are good, the method can be further improved by searching
for the best results, such as spectrograms or embedding previous
samples.

To assess the model’s performance in a broader context, we compared its


accuracy with other state-of-the-art deepfake detection models:

Model Accuracy AUC Precision Recall F1- Dataset


(%) score Used
WaveNet 74.5 0.82 0.75 0.73 0.74 Fake or
Real
(FoR)
Tacotron 78.1 0.82 0.79 0.77 0.78 Fake or
Real
(FoR)
0.80
DeepVoice 79.3 0.82 0.81 0.79 Fake or
Proposed 80.7 0.82 0.83 0.83 0.83 Fake or
CNN Real
(FoR)
Table 1. Comparative Analysis with State-of-the-Art Methods

11
In Table 1 ,shows the benchmark the model’s effectiveness, a
comparative study was conducted against state-of-the-art deepfake
detection systems, including WaveNet, Tacotron, and Deep Voice.
Table I presents the evaluation metrics for these models,
highlighting the CNN-based approach’s superiority.

8 Conclusion

This work supports Fake or Real (FoR) data for the analysis of real and
fake audio data using advanced methods such as Mel Frequency
Cepstral Coefficients (MFCC) and search machine learning and
performance depth. Future research directions emphasize exploring
advanced feature extraction techniques, such as wavelet transforms
and hybrid methods, to better preserve intricate audio patterns.
Integrating audio and visual data could enhance detection accuracy,
while addressing adversarial attacks remains crucial for ensuring
robustness in real-world scenarios. Testing models under diverse real-
world conditions—considering ambient noise, reverberation, and
varied recording equipment—will further validate their
generalizability. Continuous development of datasets that reflect
diverse accents, languages, and environments is vital to enhance
robustness. Ethical considerations, including privacy, consent, and
responsible use, are increasingly important as deepfake detection
technology evolves. Explainability techniques are critical for
transparency and understanding model decisions, while standardizing
benchmarks will ensure fair comparisons of detection methods. Human-
in-the-loop systems could improve accuracy by combining machine
intelligence with human judgment. Lastly, fostering collaboration
among researchers, industry experts, and policymakers is essential
for staying ahead of emerging challenges and advancing deepfake
detection systems.

References

[1]A. Hamza et al., "Deepfake Audio Detection via MFCC Features


Using Machine Learning," IEEE Access, vol. 10, pp. 134018-134028,
2022, doi: 10.1109/ACCESS.2022.3231480.

[2]T. Mittal et al., "Emotions Don’t Lie: An Audio-Visual Deepfake


Detection Method using Affective Cues," in Proceedings of the 28th
ACM International Conference on Multimedia (MM ’20), ACM, New
York, NY, USA, 2020, pp. 2823–2832, doi:
10.1145/3394171.3413570.

[3] Y. Zhou, S.-N. Lim, "Joint Audio-Visual Deepfake Detection," in


Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV), 2021, pp. 14800-14809.

[4] J. Khochare et al., "A Deep Learning Framework for Audio


Deepfake Detection," in Arabian Journal for Science and Engineering,
vol. 47, no. 3, 2022, pp. 3447, doi: [journal DOI].

12
[5] J. Yi et al., "ADD 2022: the first Audio Deep Synthesis Detection
Challenge," in ICASSP 2022 - 2022 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022,
pp. 9216-9220, doi: 10.1109/ICASSP43922.2022.9746939.

[6] "Challenges A Review of Modern Audio Deepfake Detection


Methods and Future Directions," https://doi.org/10.3390/a15050155.

[7] M. Mcuba et al., "The Effect of Deep Learning Methods on


Deepfake Audio Detection for Digital Investigation," in Procedia
Computer Science, vol. 219, 2023, pp. 211-219, doi:
10.1016/j.procs.2023.01.283.

[8] R.L.M.A.P.C.Wijethunga et al., "Deepfake Audio Detection: A


Deep Learning Based Solution for Group Conversations," in 2020 2nd
International Conference on Advancements in Com- puting (ICAC),
Malabe, Sri Lanka, 2020, pp. 192-197, doi:
10.1109/ICAC51239.2020.9357161.

[9] D.M. Ballesteros et al., "Deep4SNet: deep learning for fake speech
classification," in Expert Systems with Applications, vol. 184, 2021,
115465, doi: 10.1016/j.eswa.2021.115465.

[10] N. Chakravarty and M. Dua, "Data augmentation and hybrid


feature amalgamation to detect audio deep fake attacks," in Physica
Scripta, vol. 98, no. 9, 2023.

[11] S.-Y. Lim et al., "Detecting Deepfake Voice Using Explainable


Deep Learning Techniques."

[12] "A Review of Deep Learning Based Speech Synthesis," in Appl.


Sci., 2019, 9, 4050, doi: [DOI].

[13] Y. Ren et al., "Fastspeech 2: Fast and High-Quality End-to-End


Text to Speech," arXiv, 2020, arXiv:2006.04558.

[14] J. Shen et al., "Natural Tts Synthesis by Conditioning Wavenet on


Mel Spectrogram Predictions," in IEEE, 2018, pp. 4779–4783.

[15] W. Ping et al., "Deep voice 3: Scaling text-to-speech with


convolutional sequence learning," arXiv, 2017, arXiv:1710.07654.

[16] Z. Khanjani et al., "How deep are the fakes? Focusing on audio
deepfake: A survey," arXiv, 2021, arXiv:2111.14203.

13

You might also like