Deepfake Report Finalll-1
Deepfake Report Finalll-1
PROJECTWORK
21ISP76
BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING
For the Academic Year 2024 - 2025
Submitted by:
Moksha Prada P
1MV22IS402
Under the guidance of
C E R T IF I CAT E
It is certified that the PROJECT WORK [21ISP76] entitled “Deepfake Voice Detection Using
Machine Learning” is carried out by 1MV22IS402 – Moksha Prada P bonafide students of Sir
M Visvesvaraya Institute of Technology in partial fulfilment for the 8th semester for the award
of the Degree of Bachelor of Engineering in Information Science and Engineering of the
Visvesvaraya Technological University, Belagavi during the academic year 2024-2025. It is
certified that all corrections and suggestions indicated for Internal Assessment have been
incorporated in the report deposited in the department library. The project report has been approved
as it satisfies the academic requirements in respect of project work prescribed for the course of
Bachelor of Engineering.
Examination:
Name of Examiner Signature with Date
1)
2)
DECLARATION
We hereby declare that the entire project work embodied in this dissertation has been carried out
by us and no part has been submitted for any degree or diploma of any institution previously.
Place: Bengaluru
Date:
Signature of Student
I
ACKNOWLEDGMENT
It gives us immense pleasure to express our sincere gratitude to the management of Sir M.
Visvesvaraya Institute of Technology, Bengaluru for providing the opportunity and the
resources to accomplish our project work in their premises.
On the path of learning, the presence of an experienced guide is indispensable, and we would like
to thank our guide Ms. Sowjanya Lakshmi A, Asst. Professor, Dept. of ISE, for her invaluable
help and guidance.
Heartfelt and sincere thanks to Dr. G. C. Bhanu Prakash, Prof. and Head, Dept. of ISE,
for his suggestions, constant support and encouragement.
We would also like to convey our regards to Prof. S. G. Rakesh, Principal, Sir MVIT for
providing us with the infrastructure and facilities needed to develop our project.
We would also like to thank the staff of Department of Information Science and Engineering and
lab-in-charges for their co-operation and suggestions. Finally, we would like to thank our Parents
and friends for their help and suggestions without which completing this project would not have
been possible.
II
ABSTRACT
Deep learning has made significant strides in audio synthesis techniques, making it
more challenging to distinguish between authentic and fake speech. This study uses voice-
conversion and synthetic techniques to develop a robust fake speech detection system that
focuses on Logical Access (LA) threats. To increase dataset size and improve model
generalization, the system makes use of a deep learning model with data augmentation
techniques like time stretching, pitch shifting, and volume scaling.
Normalization, noise reduction, and audio segmentation into regular 4-second frames
are all examples of preprocessing. Melspectrograms, which are produced by normalizing
inputs using Z-normalization and the Fast Fourier Transform (FFT), are utilized as feature
representations. A multi-layer convolutional model with 2D and 1x1 convolutions, batch
normalization, max-pooling, ReLU activation, and fully linked layers makes up the suggested
architecture. Dropout and other regularization techniques are employed to strengthen the
model’s resistance to overfitting.
The ASVspoof 2019 corpus was used for training and testing, with further variants to
simulate real-world situations. To examine classification behavior, the confusion matrix and
the metrics of accuracy, precision, recall, F1-score, and ROC-AUC were employed. The
results demonstrate that the system was quite successful at distinguishing between real and
phony speech, with a high detection accuracy.
III
CONTENTS
SL No Chapters Page No
Introduction 1-2
1 1.1 Overview 1
1.2 Organization of Report 1-2
2 Literature Review 3-6
Problem Statement and Objectives 7-8
3.1 Problem Statement 7
3
3.2 Objectives 8
3.3 Significance of the Project Work 9
4 Methodology 10-12
4.1 Block Diagram 13
4.2 System Architecture 14
4.3 Control Flow Diagram 14
4.4 Data Flow Diagram 15
4.5 Sequence Diagram / Activity Diagram 15
Implementation 16-21
5.1 System Requirements 16
5 5.2 Algorithms / Pseudocodes 16-20
5.Mathematical Description 20-21
5.4 Testing and Test Cases 21
Results and Discussion 22-26
6.1 Dataset Samples 22
6 6.2 Results 22-24
6.3 Result Analysis 24-25
6.4 Summary 25-26
Conclusion and Future Scope 26-27
7 7.1 Conclusion 26
7.2 Future Scope 27-28
8 References 29-30
IV
LIST OF FIGURES
Page
Fig. No. Description No
1 Mel Spectrogram and MFCC Generation 10
2 Block diagram 13
3 System architecture 13
4 Control flow diagram 14
5 Level 1 Data Flow Diagram 14
6 Activity diagram 15
7 Performance Matrix 22
8 Confusion Matrix 23
9 Accuracy Curve 23
10 Loss Curve 24
LIST OF TABLES
Table Page
No. Description No
1 Summary of the CNN model architecture 19
2 Comparison Of Accuracy and Equal Error Rate (EER) With Similar Studies 25
V
Deepfake Voice Detection Introduction
CHAPTER 1
INTRODUCTION
1.1 Overview
With the development of AI, deepfake technology has also improved and is now capable
of creating incredibly lifelike fake sounds. Because these AI voices might sound a lot like actual
humans, there are serious worries about fraud, deception, and cybersecurity. Abuse of fake
speech has taken various forms, from using voice-based identification systems to conduct
fraudulent transactions to posing as politicians or celebrities. Because synthetic speech
technology is developing so quickly, it is essential to build efficient detection methods to
differentiate between real and fake sounds. These attacks are now too strong for traditional
voice authentication methods, such as rule-based approaches and human inspections,
necessitating the use of advanced machine learning algorithms.
Since our model is an offline detector, it is more practical to use in real-world situations
than other models that need online processing. Such features might be helpful in settings like
secure regions and forensic investigations when there is little internet access. Applications of
this approach may also be seen in voice verification systems, media authentication, and
cybersecurity, where speech deepfake detection is essential to combating disinformation and
identity theft.
architecture is described, focusing on the implementation of the CNN to process the extracted
audio features.
The report then outlines the training procedures employed to optimize the model and
the evaluation criteria used to assess performance. The results showcase the system's
effectiveness in distinguishing synthetic speech, underlining its potential in real-world security
applications. Finally, the report discusses future directions, such as enhancing the model for
real-time detection to increase its practical utility across various domains.
CHAPTER 2
LITERATURE REVIEW
The rise of deepfake technologies has led to significant research efforts in detecting
synthetic speech. Reimao and Tzerpos [2] introduced the FOR dataset for synthetic speech
detection in 2019, emphasizing data quality. Their methodology involved using a
comprehensive dataset to train detection models, focusing on spectral features to improve
model performance. However, their approach faced challenges in detecting highly
sophisticated fake speech and small distortions in speech. In 2020, Subramani and Rao [5]
developed efficient neural representations for fake speech detection, leveraging
autoencoders and deep learning models. This method improved detection performance by
learning representations that generalized well across various datasets, though the model’s
dependency on large labelled datasets remains a limitation. Wijethunga et al. [7] applied
deep learning techniques to group conversations, detecting deepfake audio by analysing
group interactions. The accuracy of their model improved significantly over traditional
methods, but real-time processing in multi-speaker environments was still a challenge.
Capoferri et al. [6] used reverberation cues to detect audio splicing, which helped improve
detection accuracy for manipulated speech. However, their method struggles with detecting
speech segments that are minimally altered or highly coherent, limiting its application.
Purevdagva et al. [8] introduced a machine-learning framework for detecting fake
political speech, employing multiple feature extraction techniques. Their methodology
combined prosodic, spectral, and phonetic features to detect inconsistencies in speech.
While this method achieved high accuracy, its limitation lies in the fact that it may not
perform as well on non-political speech or in environments with background noise. Hamza
et al. [16] applied MFCC features in combination with machine learning models for
deepfake audio detection, achieving high detection accuracy. However, their model is
sensitive to variations in environmental noise, leading to potential inaccuracies in real-
world scenarios. Zhang et al. [12] employed residual networks with transformer encoders
to detect fake speech, significantly improving the model’s accuracy in detecting subtle
distortions. This method is particularly effective for detecting deepfake speech at higher
quality levels. However, the model is computationally expensive, making it less suitable
for low-resource environments.
through contrastive training strategies. This approach enhanced model robustness but
required careful tuning of contrastive learning parameters and still lacked resilience against
highly realistic fake samples. Khochare et al. [13] developed a deep learning framework
for audio deepfake detection, combining feature extraction and classification in a single
pipeline. Their model achieved commendable accuracy across various datasets, though it
faced limitations with language diversity and accents.
Shaaban et al. [17] offered a comprehensive analysis of various audio deepfake
approaches, highlighting vulnerabilities and suggesting layered countermeasures. While
their work served as a valuable reference for system design, it lacked experimental
validation and real-world benchmarking. Albazony et al. [18] examined the use of
recurrent neural networks (RNNs) for detecting deepfake videos, and although their focus
was visual, the temporal modeling techniques had implications for sequential audio
detection. However, the adaptation of such models to the audio domain requires further
investigation.
Bansal et al. [19] proposed a hybrid deepfake detection approach using
convolutional neural networks and DCGANs to suppress fake multimedia content,
including audio. While the method showed strong performance, it struggled with
overfitting and required significant processing power. Li et al. [20] studied voice
characteristics like jitter and shimmer to detect fake audio at the acoustic signal level. Their
approach offered interpretability and high accuracy for certain synthetic voices but faltered
when faced with more complex deepfake generation methods.
Paramarthalingam et al. [22] introduced a deep learning model for detecting
potholes for visually impaired assistance, which, while not directly related to deepfake
audio, demonstrated the broader applications of audio analysis and environment-aware
machine learning that could inspire future detection architectures. Xue et al. [24] proposed
a dynamic ensemble distillation framework using teacher-student models to create
lightweight deepfake detectors. The solution showed promise for resource-constrained
environments but required substantial offline training with large teacher models.
Deng et al. [25] presented VFD-Net, a vocoder fingerprint-based deepfake
detection system that leverages unique spectral artifacts introduced during synthetic voice
generation. While the model achieved state-of-the-art results in vocoder-based detection,
it performed poorly on end-to-end models designed to bypass vocoder artifacts. Al Ajmi
et al. [26] introduced a zero prior knowledge approach that enabled detection of fake
speech without needing prior examples of fake audio. This unsupervised technique
expanded the detection landscape but suffered from lower accuracy and higher false
positives.
Mathew et al. [27] focused on real-time detection of deepfake audio in
communication platforms, addressing latency and integration challenges. Their model was
efficient for real-time streaming but still faced trade-offs in terms of detection granularity
and scalability. Kang et al. [29] proposed FADEL, an uncertainty-aware fake audio
detection system using evidential deep learning. FADEL enhanced reliability by
quantifying prediction confidence, though its complexity and resource needs may hinder
deployment in lightweight or embedded systems.
More recently, Pham et al. [21] focused on spectrogram-based features in
combination with deep learning for deepfake audio detection, improving detection
accuracy but struggling to handle background noise or low-quality recordings. Dixit et al.
[30] employed speech-to-text conversion to detect fake news in live media, using deep
learning for improved accuracy. Their method, while highly effective for specific cases,
faced challenges in handling speech with varying accents or speech from non-native
speakers.
Accuracy in deepfake audio detection is largely contingent on the quality and
variety of training data, the model architecture, and the feature extraction methods. While
some models achieve high accuracy, especially on high-quality deepfakes, real-time
detection and detection on low-resource devices remain significant challenges. Moreover,
limitations in detecting advanced deepfake techniques, such as those involving minute
audio distortions or adversarial methods, persist.
Future Scope lies in improving the adaptability of detection models, enabling
real-time detection on mobile and embedded devices, and reducing dependency on large
datasets. Exploring unsupervised learning methods, reducing computational overhead,
and improving cross-domain detection (e.g., speech-to-text deepfake detection) will be
crucial in expanding the applicability of deepfake voice detection systems.
CHAPTER 3
PROBLEM STATEMENT AND OBJECTIVES
3.1 Problem Statement
The emergence of deepfake technology, which is driven by developments in machine
learning (ML) and artificial intelligence (AI), has made it extremely difficult to preserve the
integrity and authenticity of multimedia material. It becomes more challenging to distinguish
between authentic and fraudulent information when deepfakes modify audio in a way that
closely resembles the voices and looks of actual people, frequently with great accuracy.
Key Challenges:
• Increasing Sophistication of Deepfake Algorithms:
GANs (Generative Adversarial Networks) and other AI models that can produce
incredibly lifelike fake media are used in contemporary deepfake generating
approaches. By taking advantage of minute nuances in audio signals and frames, these
computers are able to identify irregularities more accurately than humans.
• Threats to Digital Trust and Security:
Deepfake content has been used as a weapon for identity theft, personal defamation,
and disinformation operations. This has sparked worries in a variety of fields where
confidence in digital material is crucial, such as social media, politics, law enforcement,
and the media.
• Limited Effectiveness of Traditional Detection Methods:
Because deepfake technologies are dynamic and constantly changing, traditional
detection techniques that depend on static characteristics or heuristic-based algorithms
are unable to keep up. The majority of current methods are susceptible to cross-modal
manipulations since they either concentrate on audio or audio analysis alone.
• Scalability and Generalization Issues:
In real-world applications, deepfake detection methods are less successful because they
frequently have trouble generalizing to new modification techniques or unexplored
datasets. It is computationally hard to identify deepfakes in realtime in high-resolution
or live- streamed footage.
The machine learning model developed in this project is capable of near real-time
detection, which is essential for immediate response and practical deployment in real-
world scenarios.
• Scalable and Adaptable Framework:
The system is designed to be scalable and adaptable, allowing future integration
with additional audio manipulation detection mechanisms, making it relevant for
long-term applications across various industries.
CHAPTER 4
METHODOLOGY
The methodology adopted in this project encompasses a systematic sequence of steps
designed to develop a reliable machine learning model capable of detecting deepfake (spoofed)
voice recordings. Each stage, from data preprocessing to model evaluation, is carefully designed
to ensure robustness and accuracy in detecting fake voice inputs.
Dataset Collection
The dataset used in this project is the ASVspoof 2019 dataset, a widely recognized
benchmark dataset for automatic speaker verification and spoofing countermeasures. It
contains audio samples of both bonafide (genuine human speech) and spoofed (fake)
speech generated using various voice synthesis (TTS) and voice conversion (VC)
techniques. This dataset helps simulate real-world scenarios and provides a diverse set of
samples that are essential for training and testing the model effectively.
Data Preparation
Once the features are extracted, the dataset is organized and prepared for training:
• Label Encoding: Bonafide samples are labeled as 0 and spoofed samples as 1.
• Data Splitting: The dataset is split into training and testing sets to
evaluate generalization. A separate validation set may also be used to fine-tune
model parameters.
• Normalization: Input features are normalized to scale the values uniformly,
which helps in faster convergence during training.
Model Training
• The model is trained using the binary cross-entropy loss function, which is
suitable for binary classification tasks.
• The Adam optimizer is employed for its efficiency in handling sparse
gradients and adaptive learning rates.
• The training is conducted over multiple epochs, with batch processing for
computational efficiency.
• Early stopping is utilized to halt training when validation accuracy stops
improving, preventing overfitting.
Model Evaluation
• After training, the model is tested on unseen data to evaluate its
performance. The evaluation metrics include:
• Accuracy: Measures the overall correctness of predictions.
• Confusion Matrix: Provides a detailed breakdown of true positives, false
positives, true negatives, and false negatives.
Precision, Recall, and F1-Score: These metrics offer insight into the model's effectiveness in
identifying spoofed and bonafide voices, especially when the classes are imbalanced.
The Data Flow Diagram (DFD) shows how data moves through the system. It begins with the
user inputting an audio file, followed by preprocessing and feature extraction. The processed
data is then sent to the classification model, which returns the detection result. This diagram
helps in understanding how data is transformed at each stage.
CHAPTER 5
IMPLEMENTATION PROCESS
5.1 System Requirements
• Processor: Intel Core i7/i9 (12th Gen) or AMD Ryzen 7/9
• RAM: 16GB or more
• GPU: NVIDIA RTX 3060 (6GB) or higher (RTX 4090 for high-end deep learning)
• Storage: 512GB NVMe SSD + 1TB HDD (for datasets & models)
• Operating System: Windows 11 / Ubuntu 20.04/ MacOS
• Software: Python 3.8+ with virtual environment support
5.2 Algorithm
Deep learning techniques are used in the proposed methodology to accurately
distinguish between real and fraudulent speech. This approach places a strong emphasis on
feature extraction, preprocessing, CNN-based model construction, and the analysis of speech
patterns from carefully chosen datasets.
The steps listed below are as follows: detailed description of the research methodology
used in this study:
Data Selection
The ASVspoof2019 Logical Access (LA) dataset was chosen as the main dataset for this
job. This dataset, which includes both synthetic and real speech samples, is frequently used
to detect bogus speech. Audio files containing synthetic speech generated by different
textto-speech (TTS) and voice conversion (VC) methods are stored in the LA subset.
The dataset is divided into three subsets:
• Training set: Used for model learning and parameter tuning.
• Validation set: Used for model optimization and hyperparameter adjustments.
• Test set: Used to evaluate the model’s performance on unseen data.
The audio samples in the ASVspoof2019 Logical Access (LA) dataset are encoded in 16-
bit, 16 kHz WAV format and have a fixed duration of 4 seconds per track. Preprocessing
and segmentation techniques are applied to guarantee efficient feature extraction for false
speech detection.
Input Layer
Accepts an MFCC feature matrix of shape (time frames × 13 MFCC coefficients),
reshaped for 2D convolutional processing.
Training
The training process follows a structured pipeline to ensure optimal performance and
generalization of the CNN model for fake speech detection. The steps are as follows:
Data Splitting:
The dataset is divided into three subsets:
• Training Set (55%): Used to optimize model weights.
• Validation Set (20%): Used for hyperparameter tuning and to monitor
performance during training.
• Test Set (25%): Reserved for final evaluation to measure the model’s
generalization ability.
The prepare datasets (0.25, 0.2) function handles this splitting, ensuring balanced
distribution across the three sets.
Model Compilation:
The CNN model’s input shape is set to the retrieved MFCC feature dimensions, (13,13,1).
It is constructed using a convolutional architecture.
For steady convergence and adaptive learning, it uses the Adam optimizer with a
learning rate of 0.0001. The output layer employs softmax activation for two-class
classification, while the sparse categorical cross-entropy loss function is employed. The
primary test tool is an accurate tracking of the model’s performance.
Early Stopping:
To prevent overfitting, an early stopping callback is applied. It monitors the validation
loss (val_loss) and stops training if no improvement is observed for 5 consecutive epochs.
The restore_best_weights=True setting ensures that the model reverts to the best-
performing weights before stopping.
Training Process:
To achieve the best possible balance between training speed and stability, the model is
trained for 30 epochs with a batch size of 32. When training, the validation set is used to
track performance and make dynamic learning adjustments.
Evaluation:
After training, the model is tested on the hold-out test set using model.evaluate(X_test,
y_test). The test accuracy is printed to assess how well the model generalizes to unseen
data.By limiting overfitting and optimizing performance on tasks involving the
categorization of actual and false speech, this methodical methodology guarantees efficient
training. Thus, this is the suggested approach.
Sigmoid Activation:
ŷ = 1 / (1 + e^(-z))
Actual Fake TP FN
Actual Real FP TN
Test Cases
TC03 Edge case: short- Very short clip (≤1 Proper classification
duration audio sec)
CHAPTER 6
RESULTS AND DISCUSSION
The dataset used in this study is the ASVspoof 2019 Logical Access (LA) dataset, a
publicly available benchmark that includes both real and fake speech samples. These samples
are generated using various Text-to-Speech (TTS) and Voice Conversion (VC) methods. Each
audio file is:
Performance Metrics
• Accuracy: 97%
• Precision: High, indicating few false positives
• Recall: High, suggesting that most fake audio samples were correctly identified
• F1-Score: A strong balance between precision and recall
These metrics were derived from predictions on the test dataset, which consisted of previously
unseen samples. The model was particularly effective in identifying subtle artifacts in synthetic
audio that are often missed by human listeners.
Learning Curves
Strengths:
• High detection rate for various forms of synthetic audio
• Robust performance across multiple attack methods, including TTS and VC
• Effective learning from a relatively moderate-sized dataset
• Well-balanced precision and recall, indicating reliable performance
Limitations
• Exploring multi-modal detection (e.g., combining text and audio) for higher reliability
To highlight the superiority of the proposed model, we compare its performance with existing
deepfake detection methods wrt to accuracy and Equal Error Rate (EER).
• The proposed model outperforms Res-Net and VGG models by a fair bit.
• ResNet models struggle with generalization, particularly when tested on unseen datasets.
Table 2: Comparison Of Accuracy And Equal Error Rate (EER) With Similar Studies
Study Accuracy EER (%)
(%)
Chinguun Purevdagva et al. 59.2 40.8
[8] approach
Kai Li et al. [20] approach 63.82 36.18
J. Khochare et al. [13] approach 67.0 33.0
6.4 Summary
This project presented a detailed analysis of the model's performance in detecting fake
audio. The proposed CNN-based approach achieved 97% accuracy, outperforming several
state-of-the-art methods.
Key conclusions:
• Our best performing model, single-task variant of CNN, achieves a macro F1 score
of 97.61 on the validation set. The model can be further applied on augmented data
to enhance generalisation and will prove helpful in deployment. The loss in
evaluation metrics is above the human observation level of about 85%.
Overall, the results validate the success of the proposed methodology and highlight its potential
as a scalable and reliable tool for combating audio-based misinformation and identity fraud.
CHAPTER 7
CONCLUSION AND FUTURE SCOPE
7.1 Conclusion
This project developed a CNN-based Fake Speech Detection system that effectively
distinguishes between genuine and manipulated audio samples. By employing data augmentation
techniques, we enhanced the model’s robustness, enabling better generalization to various audio
manipulations. As the prevalence of fake speech increases, this research underscores the need for
advanced methodologies to combat misinformation in audio communications. Future work will
focus on refining model architectures and incorporating larger, more diverse datasets to further
enhance detection capabilities, contributing to audio forensics and security efforts.
The rise of synthetic voice generation through advanced AI techniques like Text-to-
Speech (TTS) and Voice Conversion (VC) presents significant threats to digital security, identity
verification, and media authenticity. This project presents a Convolutional Neural Network
(CNN)-based fake speech detection model trained on the ASVspoof 2019 dataset. By utilizing
MFCCs and spectrogram-based features, the model achieved a remarkable accuracy of 97% and
an F1-score of 97.61%, demonstrating its effectiveness in identifying fake speech even when the
differences are imperceptible to the human ear.
Data augmentation techniques like pitch shifting, noise addition, and time-stretching were
employed to improve model generalization, significantly enhancing robustness against unseen and
adversarial audio samples. The lightweight nature of the model—with under 100 KB memory
footprint—makes it viable for deployment in real-world environments, including mobile and
embedded systems.
Comparative studies also showed the superiority of this model over traditional and state-
of-the-art techniques such as ResNet, VGG, and Transformer encoders. Despite minor limitations
like performance drop on highly compressed audio or near-perfect synthetic voices, the system
provides a reliable, efficient, and scalable solution for the growing problem of voice- based
deepfake attacks.
CHAPTER 8
REFERENCES
[1] Li, Yang, et al. “Universal voice conversion.” In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 3526-3535. 2019.
[2] Reimao, R., and V. Tzerpos. “FOR: A dataset for synthetic speech detection,” in 2019
International Conference on Speech Technology and Human-Computer Dialogue (SpeD).
IEEE, pp. 1–10.
[3] Korshunov, Pavel, and S. Marcel. “Deepfake detection using inverse contrastive loss.” In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 3920-3929. 2020.
[4] Prakash, Shreya, et al. “A comprehensive study on deep fake audio detection.” In
Proceedings of the 2020 ACM Workshop on Information Hiding and Multimedia Security,
pp. 81-90.
[5] Subramani, N. and D. Rao, “Learning efficient representations for fake speech detection,”
in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp.
5859–5866, 2020.
[6] Capoferri, D., et al., “Speech audio splicing detection and localization exploiting
reverberation cues,” in 2020 IEEE International Workshop on Information Forensics and
Security (WIFS). IEEE, pp. 1–6.
[7] Wijethunga, R., et al., “Deepfake audio detection: a deep learning-based solution for group
conversations,” in 2020 2nd International conference on advancements in computing
(ICAC), vol. 1. IEEE, pp. 192–197.
[8] Purevdagva, C., et al., “A machine-learning based framework for detection of fake
political speech,” in 2020 IEEE 14th International Conference on Big Data Science and
Engineering (BigDataSE). IEEE, pp. 80–87.
[9] Mukhopadhyay, Rudrabha, et al. “A comprehensive survey of voice conversion and deep
fake techniques.” arXiv preprint arXiv:2103.03230, 2021.
[10] Xie, Jin, et al. “Voice Deep Guard: Towards Intelligent Voice Deepfake Detection.” In
Proceedings of the 28th ACM International Conference on Multimedia, 2021.
[11] Ballesteros, D. M., et al., “Deep4snet: deep learning for fake speech classification,” Expert
Systems with Applications, vol. 184, p. 115465, 2021.
[12] Zhang, Z., et al., “Fake speech detection using residual network with transformer
encoder,” in Proceedings of the 2021 ACM workshop on information hiding and
multimedia security, pp. 13–22.
[13] Khochare, J., et al., “A deep learning framework for audio deepfake detection,” Arabian
Journal for Science and Engineering, pp. 1–12, 2021.
[14] Pasupathi, Panupong, and Taxing Li. “Detecting AI-Generated Text with BERT.” In
Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pp. 672-684.
[15] Zhang, L., et al., “The partialspoof database and countermeasures for the detection of short
fake speech segments embedded in an utterance,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813–825, 2022.
[16] Hamza, A., et al., “Deepfake audio detection via MFCC features using machine learning,”
IEEE Access, vol. 10, pp. 134018–134028, 2022.
[17] Shaaban, O. A., et al., “Audio deepfake approaches,” IEEE Access, vol. 11, pp. 132652–
132682, 2023.
[18] Albazony, A. A. M., et al., “Deepfake videos detection by using recurrent neural network
(RNN),” in 2023 Al-Sadiq International Conference on Communication and Information
Technology (AICCIT). IEEE, pp. 103–107.
[19] Bansal, K., et al., “Deepfake detection using CNN and DCGANs to drop-out fake
multimedia content: a hybrid approach,” in 2023 International Conference on IoT,
Communication and Automation Technology (ICICAT). IEEE, pp. 1–6.
[20] Li, K., et al., “Contributions of jitter and shimmer in the voice for fake audio detection,”
IEEE Access, vol. 11, pp. 84689–84698, 2023.
[21] Pham, L., et al., “Deepfake audio detection using spectrogram-based feature and ensemble
of deep learning models,” in 2024 IEEE 5th International Symposium on the Internet of
Sounds (IS2). IEEE, pp. 1–5.
[22] Paramarthalingam, A., et al., “A deep learning model to assist visually impaired in pothole
detection using computer vision,” Decision Analytics Journal, vol. 12, p. 100507, 2024.
[23] Basha, S. A. Y., and K. U. Priya, “Recognition of deep fake voice acoustic using ensemble
bagging model,” in 2024 5th International Conference on Electronics and Sustainable
Communication Systems (ICESC). IEEE, pp. 1211–1217.
[24] Xue, J., et al., “Dynamic ensemble teacher-student distillation framework for light-weight
fake audio detection,” IEEE Signal Processing Letters, 2024.
[25] Deng, J., et al., “VFD-Net: Vocoder fingerprints detection for fake audio,” in ICASSP
2024 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 12151–12155.
[26] Al Ajmi, S. A., et al., “Faked speech detection with zero prior knowledge,” Discover
Applied Sciences, vol. 6, no. 6, p. 288, 2024.
[27] Mathew, J. J., et al., “Towards the development of a real-time deepfake audio detection
system in communication platforms,” arXiv preprint arXiv:2403.11778, 2024.
[28] Song, D., et al., “Anomaly detection of deepfake audio based on real audio using
generative adversarial network model,” IEEE Access, 2024.
[29] Kang, J. Y., et al., “FADEL: Uncertainty-aware fake audio detection with evidential deep
learning,” in ICASSP 2025 IEEE International Conference on Acoustics, Speech and
Signal Processing. IEEE, pp. 1–5.
[30] Dixit, Y., et al., “Fake news detection of live media using speech to text conversion,” in
2021 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, pp.
1–5.