Audio Deepfake (Camera Ready Paper)
Audio Deepfake (Camera Ready Paper)
1
Rahul Saha 2Deep Sengupta and 3Anupam Mondal
1,2,3
Computer Science and Engineering, Institute of Engineering &
Management, University of Engineering & Management, Kolkata,
700091, West Bengal, India
[email protected],[email protected],
[email protected]
1 Introduction
2 Literature Survey
2
discoveries within the field of profound dialect search[5].This article gives
an outline of sound deepfakes (Advertisement) and the conceivable
outcomes for proceeded advancement of discovery strategies when there
are concerns around their affect on open security. It analyzes existing
machine learning and profound learning strategies, compares sound
information blunders, and distinguishes vital trade-offs between exactness
and measurement methods. This review highlights the need for further
research to resolve these inconsistencies and suggests potential guide-
lines for more robust AD detection models, particularly in addressing
noise and the sound of the world[6]. This study evaluates various CNN
architectures for deep sound detection, including concepts such as size,
technique, and accuracy. The customized architecture of Malik et al. The
most accurate is 100 percent. But it seems to depend on the context,
indicating the need for different architectures. Experiments with different
sound representations demonstrate the consistency of the customized
model. Although these standards lag behind legal standards, they have led
the development of quality standards that will meet legal restrictions while
solving deep-rooted problems[7]. This study addresses the threat of misuse
of synthetic speech by arguing that the real voice is different from the
synthetic voice in group discussion. The system uses deep neural networks
combining negative speech, speech binarization, and CNN-based methods
to achieve high accuracy and effective speech analysis[8].This study
demonstrates the application of securing transmission points using manual
and automatic extraction. It uses CNN for histogram analysis and shows
the performance evaluation of the model-based application and Deep
Voice speech recognition[9]. This paper solves the problem of detecting
deep voice spoofs, using the ASVspoof dataset, and combining data
enhancement and hybrid feature extraction. The proposed model adopts
LSTM in the backend, uses MFCC + GTCC and SMOTE, and achieves
99 percent testing accuracy and 1.6 percent EER performance on
ASVspoof 2021 deepfake sections. Noise evaluations and experiments on
the DECRO dataset further demonstrate the effectiveness of the
model[10]. Motivation: This paper demonstrates the challenges of XAI
image classification by focusing on the quality score of similar objects.
Recognize the difference between human and machine understanding and
the impact on interpreting XAI output[11]. Audio analysis by Fourier
transform becomes a spectrogram, converting the audio signal into a visual
form. Analysis of melspectrograms for audio, providing visual and audio
interpretation using scores based on spectrogram scores[12]. Model:
Emphasis on simplicity rather than realism, using general communication
techniques for speech recognition[13]. Explainable Artificial Intelligence
(XAI): Introduce the XAI method based on Taylor decomposition, use of
gradient integration to judge the accuracy of the relationship[14]. Speech
Generation: Describe the Griffin-Lim algorithm for generating voice from
scores of spectrogram scores, for simplicity and ability to influence the
factor properties of the voice even without perfect segmentation[15].
Understanding with Humans: Inspired by XAI’s human assistance in
visual processing, this paper explores the classification of scores in
complex language, absorption detection and comparison with
spectrogram-based audio[16].
3
3 Analysis
4
4 Data Preprocessing
5
5 Feature Extraction
6
5.1 c. RMS (Root Mean Square)
Root mean square gives the measure of the amplitude’s average for a
signal. Its value represents, the signal amplitude over a given period.
RMS is often used for tasks such as recognition a speech and
classification of audio. Higher RMS values indicate more energy in the
audio signal. Figure 5.1 c. shows the audio waveform with a recorded
root mean square (RMS) amplitude value of 33.4 dB, which represents
the power level’s average of the audio signal at a single time.
7
Fig 5.1 a Fig 5.1 b
Fig 5.1d -
Fig 5.1 c
8
6 Proposed Methodology
A proposed method for deep sound detection using CNN, this work
presents a design and generalization for deep sound detection using
network neural networks (CNN). This approach focuses on using
artificial intelligence techniques and deep learning to improve the model,
for which it will be able to identify and use real data. Each step of the
scheme is explained below: Extract important features of the sound to
prepare the sound data for depth perception. Mel frequency cepstral
coefficients (MFCCs) are considered as important feature as it can easily
capture the timbre and spectral features of the signal. Using the Librosa
library, a 40-dimensional MFCC is calculated for each audio sample. To
ensure consistency, the average MFCC is calculated across all frames and
each model is reduced to a full-length feature vector. Zero mean and unit
variation increase the stability and convergence of CNN model training.
Use data augmentation to have diversity and robustness to the capacity
and size of deep data archives. The following improvements have been
made: Change the pitch of the audio signal without changing its structure.
Add noise to simulate real voice. CNN architecture is designed to
generate MFCC feature vectors and classify true or false messages. The
main components of this model include many layer which accepts 40-
dimensional MFCC feature vectors. The relationship between models is
based on the ReLU function. The output of the neuron is deep. The
training materials are divided into training, validation and test files.
Failure depicts Binary cross entropy is used as it measures the error in
binary problems. Gradient-based optimization. The sample size is 32 to
ensure computational efficiency while maintaining sufficient gradient
tuning. Another independent evaluation of the training model to assess
its effectiveness. Key performance indicators include: Actually: all
examples excluded. Average of the consistency between the truth and the
reward, indicating equality of the two measurements.
10
the area under the curve (AUC) of 0.88, demonstrating the ability of
the model to distinguish between the two classes. Furthermore, the fact
that accuracy and recall are equal increases the confidence of the model
in detecting and reducing defects. It detected 430 out of 500 fake news
samples and 415 out of 500 fake news samples. Although a small
percentage of real samples are not classified as fake samples, the false
positive rate is low. More importantly, using layers for continuous
operation and utilizing MFCC features for noise representation helps
improve the overall capacity, as evidenced by the small difference
between representation, performance, and evaluation. Figure 7 depicts
the performance of our model
11
In Table 1 ,shows the benchmark the model’s effectiveness, a
comparative study was conducted against state-of-the-art deepfake
detection systems, including WaveNet, Tacotron, and Deep Voice.
Table I presents the evaluation metrics for these models,
highlighting the CNN-based approach’s superiority.
8 Conclusion
This work supports Fake or Real (FoR) data for the analysis of real and
fake audio data using advanced methods such as Mel Frequency
Cepstral Coefficients (MFCC) and search machine learning and
performance depth. Future research directions emphasize exploring
advanced feature extraction techniques, such as wavelet transforms
and hybrid methods, to better preserve intricate audio patterns.
Integrating audio and visual data could enhance detection accuracy,
while addressing adversarial attacks remains crucial for ensuring
robustness in real-world scenarios. Testing models under diverse real-
world conditions—considering ambient noise, reverberation, and
varied recording equipment—will further validate their
generalizability. Continuous development of datasets that reflect
diverse accents, languages, and environments is vital to enhance
robustness. Ethical considerations, including privacy, consent, and
responsible use, are increasingly important as deepfake detection
technology evolves. Explainability techniques are critical for
transparency and understanding model decisions, while standardizing
benchmarks will ensure fair comparisons of detection methods. Human-
in-the-loop systems could improve accuracy by combining machine
intelligence with human judgment. Lastly, fostering collaboration
among researchers, industry experts, and policymakers is essential
for staying ahead of emerging challenges and advancing deepfake
detection systems.
References
12
[5] J. Yi et al., "ADD 2022: the first Audio Deep Synthesis Detection
Challenge," in ICASSP 2022 - 2022 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022,
pp. 9216-9220, doi: 10.1109/ICASSP43922.2022.9746939.
[9] D.M. Ballesteros et al., "Deep4SNet: deep learning for fake speech
classification," in Expert Systems with Applications, vol. 184, 2021,
115465, doi: 10.1016/j.eswa.2021.115465.
[16] Z. Khanjani et al., "How deep are the fakes? Focusing on audio
deepfake: A survey," arXiv, 2021, arXiv:2111.14203.
13