Final Deepfake Voice Detection Report
Final Deepfake Voice Detection Report
PROJECT WORK
21ISP76
Deepfake Voice Detection Using Machine Learning
Submitted in partial fulfillment for the requirements for the Eighth semester
BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING
For the Academic Year 2024 - 2025
Submitted by:
Amruthesh S G 1MV21IS006
Maninder Kaur 1MV21IS021
Tejaswini G H 1MV21IS038
Moksha Prada P 1MV22IS402
CERTIFICATE
It is certified that the PROJECT WORK [21ISP76] entitled “Deepfake Voice Detection Using
Machine Learning” is carried out by 1MV21IS006 – Amruthesh S G , 1MV21IS021 –
Maninder Kaur , 1MV21IS038 – Tejaswini G H , 1MV22IS402 – Moksha Prada P bonafide
Students of Sir M Visvesvaraya Institute of Technology in partial fulfilment for the 8th semester
for the award of the Degree of Bachelor of Engineering in Information Science and Engineering
of the Visvesvaraya Technological University, Belagavi during the academic year 2024-2025.
It is certified that all corrections and suggestions indicated for Internal Assessment have been
incorporated in the report deposited in the department library. The project report has been
approved as it satisfies the academic requirements in respect of project work prescribed for the
course of Bachelor of Engineering.
Examination:
Name of Examiner Signature with Date
1)
2)
DECLARATION
We hereby declare that the entire project work embodied in this dissertation has
been carried out by us and no part has been submitted for any degree or diploma of
any institution previously.
Place: Bengaluru
Date:
Signature of Student
Amruthesh S G 1MV21IS006
Tejaswini G H 1MV21IS038
I
ACKNOWLEDGMENT
Heartfelt and sincere thanks to Dr. G. C. Bhanu Prakash, Prof. and Head,
Dept. of ISE, for his suggestions, constant support and encouragement.
We would also like to thank the staff of Department of Information Science and
Engineering and lab-in-charges for their co-operation and suggestions. Finally, we
would like to thank our Parents and friends for their help and suggestions without
which completing this project would not have been possible.
II
ABSTRACT
Deep learning has made significant strides in audio synthesis techniques, making it
more challenging to distinguish between authentic and fake speech. This study uses voice-
conversion and synthetic techniques to develop a robust fake speech detection system that
focuses on Logical Access (LA) threats. To increase dataset size and improve model
generalization, the system makes use of a deep learning model with data augmentation
techniques like time stretching, pitch shifting, and volume scaling.
Normalization, noise reduction, and audio segmentation into regular 4-second frames
are all examples of preprocessing. Melspectrograms, which are produced by normalizing
inputs using Z-normalization and the Fast Fourier Transform (FFT), are utilized as feature
representations. A multi-layer convolutional model with 2D and 1x1 convolutions, batch
normalization, max-pooling, ReLU activation, and fully linked layers makes up the suggested
architecture. Dropout and other regularization techniques are employed to strengthen the
model’s resistance to overfitting.
The ASVspoof 2019 corpus was used for training and testing, with further variants to
simulate real-world situations. To examine classification behavior, the confusion matrix and
the metrics of accuracy, precision, recall, F1-score, and ROC-AUC were employed. The
results demonstrate that the system was quite successful at distinguishing between real and
phony speech, with a high detection accuracy.
III
CONTENTS
SL No Chapters Page No
Introduction 1-2
1 1.1 Overview 1
1.2 Organization of Report 1-2
2 Literature Review 3-6
Problem Statement and Objectives 7-8
3.1 Problem Statement 7
3
3.2 Objectives 7-8
3.3 Significance of the Project Work 8
4 Methodology 9-14
4.1 Block Diagram 12
4.2 System Architecture 12
4.3 Control Flow Diagram 13
4.4 Data Flow Diagram 13
4.5 Sequence Diagram / Activity Diagram 14
Implementation 15-20
5.1 System Requirements 15
5 5.2 Algorithms / Pseudocodes 15-19
5.Mathematical Description 19-20
5.4 Testing and Test Cases 20
Results and Discussion 21-25
6.1 Dataset Samples 21
6 6.2 Results 21-23
6.3 Result Analysis 23-24
6.4 Summary 24-25
Conclusion and Future Scope 26-27
7 7.1 Conclusion 26
7.2 Future Scope 26-27
8 References 28-29
IV
LIST OF FIGURES
Page
Fig. No. Description
No
1 Mel Spectrogram and MFCC Generation 10
2 Block diagram 13
3 System architecture 13
4 Control flow diagram 14
5 Level 1 Data Flow Diagram 14
6 Activity diagram 15
7 Performance Matrix 22
8 Confusion Matrix 23
9 Accuracy Curve 23
10 Loss Curve 24
LIST OF TABLES
Table Page
Description
No. No
1 Summary of the CNN model architecture 19
2 Comparison Of Accuracy And Equal Error Rate (EER) With Similar Studies 25
V
Deepfake Voice Detection Introduction
CHAPTER 1
INTRODUCTION
1.1 Overview
With the development of AI, deepfake technology has also improved and is now capable of
creating incredibly lifelike fake sounds. Because these AI voices might sound a lot like actual
humans, there are serious worries about fraud, deception, and cybersecurity. Abuse of fake
speech has taken various forms, from using voice-based identification systems to conduct
fraudulent transactions to posing as politicians or celebrities. Because synthetic speech
technology is developing so quickly, it is essential to build efficient detection methods to
differentiate between real and fake sounds. These attacks are now too strong for traditional
voice authentication methods, such as rule-based approaches and human inspections,
necessitating the use of advanced machine learning algorithms.
Our focus is on developing a deep learning-based artificial speech recognition system that can
distinguish between artificial and genuine speech. The ASVspoof 2019 benchmark, which
includes a comprehensive collection of spoof and real speech samples, is used in our approach.
We employ spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs) to extract
important elements that help detect relevant speech patterns. A Convolutional Neural Network
(CNN), a powerful deep learning framework that can recognize patterns in audio and visual
data, is then used to examine the acquired features.
Since our model is an offline detector, it is more practical to use in real-world situations than
other models that need online processing. Such features might be helpful in settings like secure
regions and forensic investigations when there is little internet access. Applications of this
approach may also be seen in voice verification systems, media authentication, and
cybersecurity, where speech deepfake detection is essential to combating disinformation and
identity theft.
This report provides a comprehensive explanation of the proposed deepfake voice detection
system. It begins with the dataset selection, specifically highlighting the use of the ASVspoof
2019 dataset, which offers a wide range of real and spoofed speech samples.
Next, the preprocessing methods are detailed, including the conversion of raw audio into
MEL spectrograms, noise reduction, and clarity enhancement steps. Following this, the model
architecture is described, focusing on the implementation of the CNN to process the extracted
audio features.
The report then outlines the training procedures employed to optimize the model and the
evaluation criteria used to assess performance. The results showcase the system's
effectiveness in distinguishing synthetic speech, underlining its potential in real-world
security applications.
Finally, the report discusses future directions, such as enhancing the model for real-time
detection to increase its practical utility across various domains.
CHAPTER 2
LITERATURE REVIEW
The rise of deepfake technologies has led to significant research efforts in detecting
synthetic speech. Reimao and Tzerpos [2] introduced the FOR dataset for synthetic speech
detection in 2019, emphasizing data quality. Their methodology involved using a
comprehensive dataset to train detection models, focusing on spectral features to improve
model performance. However, their approach faced challenges in detecting highly
sophisticated fake speech and small distortions in speech. In 2020, Subramani and Rao [5]
developed efficient neural representations for fake speech detection, leveraging
autoencoders and deep learning models. This method improved detection performance by
learning representations that generalized well across various datasets, though the model’s
dependency on large labelled datasets remains a limitation. Wijethunga et al. [7] applied
deep learning techniques to group conversations, detecting deepfake audio by analysing
group interactions. The accuracy of their model improved significantly over traditional
methods, but real-time processing in multi-speaker environments was still a challenge.
Capoferri et al. [6] used reverberation cues to detect audio splicing, which helped improve
detection accuracy for manipulated speech. However, their method struggles with
detecting speech segments that are minimally altered or highly coherent, limiting its
application.
Purevdagva et al. [8] introduced a machine-learning framework for detecting fake
political speech, employing multiple feature extraction techniques. Their methodology
combined prosodic, spectral, and phonetic features to detect inconsistencies in speech.
While this method achieved high accuracy, its limitation lies in the fact that it may not
perform as well on non-political speech or in environments with background noise. Hamza
et al. [16] applied MFCC features in combination with machine learning models for
deepfake audio detection, achieving high detection accuracy. However, their model is
sensitive to variations in environmental noise, leading to potential inaccuracies in real-
world scenarios. Zhang et al. [12] employed residual networks with transformer encoders
to detect fake speech, significantly improving the model’s accuracy in detecting subtle
distortions. This method is particularly effective for detecting deepfake speech at higher
quality levels. However, the model is computationally expensive, making it less suitable
for low-resource environments.
through contrastive training strategies. This approach enhanced model robustness but
required careful tuning of contrastive learning parameters and still lacked resilience against
highly realistic fake samples. Khochare et al. [13] developed a deep learning framework
for audio deepfake detection, combining feature extraction and classification in a single
pipeline. Their model achieved commendable accuracy across various datasets, though it
faced limitations with language diversity and accents.
Shaaban et al. [17] offered a comprehensive analysis of various audio deepfake
approaches, highlighting vulnerabilities and suggesting layered countermeasures. While
their work served as a valuable reference for system design, it lacked experimental
validation and real-world benchmarking. Albazony et al. [18] examined the use of
recurrent neural networks (RNNs) for detecting deepfake videos, and although their focus
was visual, the temporal modeling techniques had implications for sequential audio
detection. However, the adaptation of such models to the audio domain requires further
investigation.
Bansal et al. [19] proposed a hybrid deepfake detection approach using
convolutional neural networks and DCGANs to suppress fake multimedia content,
including audio. While the method showed strong performance, it struggled with
overfitting and required significant processing power. Li et al. [20] studied voice
characteristics like jitter and shimmer to detect fake audio at the acoustic signal level. Their
approach offered interpretability and high accuracy for certain synthetic voices but faltered
when faced with more complex deepfake generation methods.
Paramarthalingam et al. [22] introduced a deep learning model for detecting
potholes for visually impaired assistance, which, while not directly related to deepfake
audio, demonstrated the broader applications of audio analysis and environment-aware
machine learning that could inspire future detection architectures. Xue et al. [24] proposed
a dynamic ensemble distillation framework using teacher-student models to create
lightweight deepfake detectors. The solution showed promise for resource-constrained
environments but required substantial offline training with large teacher models.
Deng et al. [25] presented VFD-Net, a vocoder fingerprint-based deepfake
detection system that leverages unique spectral artifacts introduced during synthetic voice
generation. While the model achieved state-of-the-art results in vocoder-based detection,
it performed poorly on end-to-end models designed to bypass vocoder artifacts. Al Ajmi
et al. [26] introduced a zero prior knowledge approach that enabled detection of fake
speech without needing prior examples of fake audio. This unsupervised technique
expanded the detection landscape but suffered from lower accuracy and higher false
positives.
Mathew et al. [27] focused on real-time detection of deepfake audio in
communication platforms, addressing latency and integration challenges. Their model was
efficient for real-time streaming but still faced trade-offs in terms of detection granularity
and scalability. Kang et al. [29] proposed FADEL, an uncertainty-aware fake audio
detection system using evidential deep learning. FADEL enhanced reliability by
quantifying prediction confidence, though its complexity and resource needs may hinder
deployment in lightweight or embedded systems.
More recently, Pham et al. [21] focused on spectrogram-based features in
combination with deep learning for deepfake audio detection, improving detection
accuracy but struggling to handle background noise or low-quality recordings. Dixit et al.
[30] employed speech-to-text conversion to detect fake news in live media, using deep
learning for improved accuracy. Their method, while highly effective for specific cases,
faced challenges in handling speech with varying accents or speech from non-native
speakers.
Accuracy in deepfake audio detection is largely contingent on the quality and
variety of training data, the model architecture, and the feature extraction methods. While
some models achieve high accuracy, especially on high-quality deepfakes, real-time
detection and detection on low-resource devices remain significant challenges. Moreover,
limitations in detecting advanced deepfake techniques, such as those involving minute
audio distortions or adversarial methods, persist.
Future Scope lies in improving the adaptability of detection models, enabling
real-time detection on mobile and embedded devices, and reducing dependency on large
datasets. Exploring unsupervised learning methods, reducing computational overhead,
and improving cross-domain detection (e.g., speech-to-text deepfake detection) will be
crucial in expanding the applicability of deepfake voice detection systems.
CHAPTER 3
PROBLEM STATEMENT AND OBJECTIVES
3.1 Problem Statement
The emergence of deepfake technology, which is driven by developments in machine learning
(ML) and artificial intelligence (AI), has made it extremely difficult to preserve the integrity
and authenticity of multimedia material. It becomes more challenging to distinguish between
authentic and fraudulent information when deepfakes modify audio in a way that closely
resembles the voices and looks of actual people, frequently with great accuracy.
Key Challenges:
1. Increasing Sophistication of Deepfake Algorithms
• GANs (Generative Adversarial Networks) and other AI models that can produce
incredibly lifelike fake media are used in contemporary deepfake generating approaches.
By taking advantage of minute nuances in audio signals and frames, these computers are
able to identify irregularities more accurately than humans.
2. Threats to Digital Trust and Security
• Deepfake content has been used as a weapon for identity theft, personal defamation,
and disinformation operations. This has sparked worries in a variety of fields where
confidence in digital material is crucial, such as social media, politics, law enforcement,
and the media.
3. Limited Effectiveness of Traditional Detection Methods
• Because deepfake technologies are dynamic and constantly changing, traditional
detection techniques that depend on static characteristics or heuristic-based algorithms
are unable to keep up.
• The majority of current methods are susceptible to cross-modal manipulations since
they either concentrate on audio or audio analysis alone.
4. Scalability and Generalization Issues
• In real-world applications, deepfake detection methods are less successful because they
frequently have trouble generalizing to new modification techniques or unexplored
datasets.
• It is computationally hard to identify deepfakes in realtime in high-resolution or live-
streamed footage.
3.2 Objectives
• Identification of utterance: The primary objective of the Fake Speech Detection project is to
be able to tell fake speech utterances from bonafide (authentic) ones. The project should prove
viable in detecting Logical Attacks such as TTS and VC.
• Extension of ASVSpoof 2019 Dataset: We intend to increase the number of training examples
in our dataset by 5 times to 10 times by employing various audio signal processing and speech
augmentation techniques on the existing dataset such as Time Shifting, Time Stretching, Pitch
Scaling, Noise Addition, etc. This will make the model more robust and improve generalisation
capabilities of the model.
• Performance Assessment: After model is built, we will assess the performance of proposed
model against established benchmarks in fake speech detection, focusing on metrics such as
precision, recall, and F1-score. Evaluation of model on original dataset, augmented dataset and
both combined will be done seperately and compared against various studies involving similar
models and datasets.
3.3 Significance of the Project Work
• Mitigates Misinformation Risks
With the increasing use of synthetic voices in spreading fake news, executing scams, and
conducting impersonation attacks, this project plays a crucial role in identifying and
preventing the misuse of AI-generated audio content.
• Enhance Security in Voice-Based Systems
Voice authentication systems, such as those used in banking or smart home devices, are
vulnerable to spoofing attacks. This project adds a significant layer of protection by
accurately detecting deepfake voices, thereby strengthening overall system security.
• Supports Legal and Ethical Standards
Deepfake voice detection is vital for maintaining the integrity of digital evidence in legal
proceedings and ensuring ethical usage of audio content in media and communication.
• Promote Trust in Media Content
The ability to verify the authenticity of audio enhances consumer trust in media,
journalism, and broadcasting. It helps content creators and consumers distinguish between
real and manipulated voice recordings.
• Real-Time Detection Capability
The machine learning model developed in this project is capable of near real-time
detection, which is essential for immediate response and practical deployment in real-world
scenarios.
• Scalable and Adaptable Framework
The system is designed to be scalable and adaptable, allowing future integration with
additional audio manipulation detection mechanisms, making it relevant for long-term
applications across various industries.
CHAPTER 4
METHODOLOGY
The methodology adopted in this project encompasses a systematic sequence of steps designed to
develop a reliable machine learning model capable of detecting deepfake (spoofed) voice recordings.
Each stage, from data preprocessing to model evaluation, is carefully designed to ensure robustness and
accuracy in detecting fake voice inputs.
1. Dataset Collection
The dataset used in this project is the *ASVspoof 2019* dataset, a widely recognized benchmark
dataset for automatic speaker verification and spoofing countermeasures. It contains audio
samples of both *bonafide (genuine human speech)* and *spoofed (fake)* speech generated
using various voice synthesis (TTS) and voice conversion (VC) techniques. This dataset helps
simulate real-world scenarios and provides a diverse set of samples that are essential for training
and testing the model effectively.
2. Preprocessing and Feature Extraction
Raw audio signals cannot be directly fed into machine learning models, especially deep learning
networks. Therefore, the following preprocessing steps are performed:
• Audio Loading: Each .flac or .wav file is loaded using the librosa library, which also
helps in resampling the audio at a consistent sampling rate.
• Spectrogram Generation: The log-mel spectrogram is extracted from each audio file.
It captures the frequency domain features by applying the Mel scale, which aligns more
closely with how humans perceive sound.
• MFCC (Mel Frequency Cepstral Coefficients): MFCCs are also extracted as they are
known to represent the timbral texture of speech and are widely used in speech
recognition and spoof detection.
These extracted features are converted into 2D image-like arrays, suitable for feeding into
convolutional neural networks (CNNs).
3. Data Preparation
Once the features are extracted, the dataset is organized and prepared for training:
• Label Encoding: Bonafide samples are labeled as 0 and spoofed samples as 1.
• Data Splitting: The dataset is split into training and testing sets to evaluate
generalization. A separate validation set may also be used to fine-tune model parameters.
• Normalization: Input features are normalized to scale the values uniformly, which helps
in faster convergence during training.
4. Model Design – Convolutional Neural Network (CNN)
A CNN architecture is designed to classify input spectrograms into real or fake voice. CNNs are
chosen for their ability to recognize spatial hierarchies in 2D feature maps. The architecture
includes:
• Convolutional Layers: To detect local audio features like pitch, tone, and modulation
patterns.
• Pooling Layers: To reduce dimensionality and computation, while preserving important
features.
• Batch Normalization: To stabilize and accelerate training.
• Dropout Layers: To prevent overfitting by randomly disabling neurons during training.
• Fully Connected Dense Layers: Final classification layers with a sigmoid or softmax
function to output probabilities.
5. Model Training
• The model is trained using the binary cross-entropy loss function, which is suitable for
binary classification tasks.
• The Adam optimizer is employed for its efficiency in handling sparse gradients and
adaptive learning rates.
• The training is conducted over multiple epochs, with batch processing for computational
efficiency.
• Early stopping is utilized to halt training when validation accuracy stops improving,
preventing overfitting.
6. Model Evaluation
• After training, the model is tested on unseen data to evaluate its performance. The
evaluation metrics include:
• Accuracy: Measures the overall correctness of predictions.
• Confusion Matrix: Provides a detailed breakdown of true positives, false positives, true
negatives, and false negatives.
Precision, Recall, and F1-Score: These metrics offer insight into the model's effectiveness in
identifying spoofed and bonafide voices, especially when the classes are imbalanced.
7. Visualization and Interpretation
• Loss and Accuracy Curves: Plotted for both training and validation sets to visualize
learning progression.
• Spectrogram Visuals: Used to understand what types of patterns the CNN is learning
to distinguish between real and fake audio.
CHAPTER 5
IMPLEMENTATION PROCESS
5.1 System Requirements
• Processor: Intel Core i7/i9 (12th Gen) or AMD Ryzen 7/9
• RAM: 16GB or more
• GPU: NVIDIA RTX 3060 (6GB) or higher (RTX 4090 for high-end deep learning)
• Storage: 512GB NVMe SSD + 1TB HDD (for datasets & models)
• Operating System: Windows 11 / Ubuntu 20.04/ MacOS
• Software: Python 3.8+ with virtual environment support
5.2 Algorithm
Deep learning techniques are used in the proposed methodology to accurately distinguish between real
and fraudulent speech. This approach places a strong emphasis on feature extraction, preprocessing,
CNN-based model construction, and the analysis of speech patterns from carefully chosen datasets.
The steps listed below are as follows: detailed description of the research methodology used in this
study:
A. Data Selection
The ASVspoof2019 Logical Access (LA) dataset was chosen as the main dataset for this job. This
dataset, which includes both synthetic and real speech samples, is frequently used to detect bogus
speech. Audio files containing synthetic speech generated by different textto-speech (TTS) and
voice conversion (VC) methods are stored in the LA subset.
The dataset is divided into three subsets:
The audio samples in the ASVspoof2019 Logical Access (LA) dataset are encoded in 16-bit, 16
kHz WAV format and have a fixed duration of 4 seconds per track. Preprocessing and
segmentation techniques are applied to guarantee efficient feature extraction for false speech
detection.
• Frame Segmentation: Each audio file is divided into 10 segments to ensure sufficient
temporal resolution.
• Sampling Consistency: Given a sample rate of 16,000 Hz and a track duration of 4
seconds, each track contains 64,000 samples (i.e., 16, 000 × 4). – Windowing: A
Hamming window is applied to each frame to reduce spectral leakage.
• Padding (if required): Zero-padding is applied to maintain uniform segment lengths
across all samples.
2) Feature Extraction
To extract meaningful representations of speech, MelFrequency Cepstral Coefficients
(MFCCs) are computed from each segmented frame. The extraction process includes:
1) Input Layer
Accepts an MFCC feature matrix of shape (time frames × 13 MFCC coefficients), reshaped for
2D convolutional processing.
2) Convolutional and Pooling Layers
The model employs multiple convolutional layers to extract features from speech signals:
a. First Conv Layer: A Conv2D layer with 32 filters and (3×3) kernel size applies
feature extraction to learn key speech patterns.
b. Max Pooling (2×2) with same padding reduces spatial dimensions while preserving
critical information.
c. Batch Normalization stabilizes training and accelerates convergence.
d. This structure is repeated across three convolutional layers, progressively refining
feature maps.
e. The third convolutional layer uses a (2×2) kernel to extract fine-grained speech details.
3) Flatten Layer
Converts D feature maps into a 1D vector for classification.
4) Fully Connected (Dense) Layers
a. A Dense layer with 64 neurons and ReLU activation further processes extracted
features.
b. A Dropout layer (0.3 probability) prevents overfitting.
5) Output Layer
A Dense layer with 2 neurons and Softmax activation produces probability scores for real vs.
fake speech classification.
This CNN-based architecture efficiently captures deep-fake speech artifacts while maintaining
computational efficiency.
1) Early Stopping
To prevent overfitting, an early stopping callback is applied. It monitors the validation loss
(val_loss) and stops training if no improvement is observed for a 5 consecutive epochs. The
restore_best_weights=True setting ensures that the model reverts to the best-performing
weights before stopping.
2) Training Process
To achieve the best possible balance between training speed and stability, the model is
trained for 30 epochs with a batch size of 32. When training, the validation set is used to
track performance and make dynamic learning adjustments.
3) Evaluation
After training, the model is tested on the hold-out test set using model.evaluate(X_test,
y_test). The test accuracy is printed to assess how well the model generalizes to unseen data.
By limiting overfitting and optimizing performance on tasks involving the categorization of
actual and false speech, this methodical methodology guarantees efficient training. Thus,
this is the suggested approach.
Sigmoid Activation:
ŷ = 1 / (1 + e^(-z))
Binary Cross Entropy Loss:
L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]
Actual Fake TP FN
Actual Real FP TN
2. Test Cases
TC03 Edge case: short- Very short clip (≤1 Proper classification
duration audio sec)
CHAPTER 6
RESULTS AND DISCUSSION
6.1 Dataset Samples
The dataset used in this study is the ASVspoof 2019 Logical Access (LA) dataset, a publicly
available benchmark that includes both real and fake speech samples. These samples are
generated using various Text-to-Speech (TTS) and Voice Conversion (VC) methods. Each
audio file is:
Performance Metrics
• Accuracy: 97%
• Precision: High, indicating few false positives
• Recall: High, suggesting that most fake audio samples were correctly identified
• F1-Score: A strong balance between precision and recall
These metrics were derived from predictions on the test dataset, which consisted of previously
unseen samples. The model was particularly effective in identifying subtle artifacts in synthetic
audio that are often missed by human listeners.
Learning Curves
The model’s performance reflects its capacity to effectively differentiate between real and
synthetic audio samples. Several key observations and insights emerged from the evaluation:
Strengths
Future Improvements
6.4 Summary
This project presented a detailed analysis of the model's performance in detecting fake audio.
The proposed CNN-based approach achieved 97% accuracy, outperforming several state-of-
the-art methods.
Key conclusions:
• Our best performing model, single-task variant of CNN, achieves a macro F1 score
of 97.61 on the validation set. The model can be further applied on augmented data
to enhance generalisation and will prove helpful in deployment. The loss in
evaluation metrics is above the human observation level of about 85%.
• Due to the usage of an efficient CNN architecture, the processing time is very low,
from feeding input to generating the result. These models need fewer than 50,000
parameters and have around 100 KB memory footprint. This is highly commendable
in the field of Audio Signal Processing.
• The model is highly effective at identifying fake audio generated using modern TTS
and VC methods.
• Performance is competitive when benchmarked against recent literature.
• Certain limitations exist, particularly regarding compressed audio and high-end
synthetic voices.
• Future improvements can significantly boost the generalizability and efficiency of
the system, especially for deployment in real-world applications.
Overall, the results validate the success of the proposed methodology and highlight its potential
as a scalable and reliable tool for combating audio-based misinformation and identity fraud.
CHAPTER 7
CONCLUSION AND FUTURE SCOPE
7.1 Conclusion
This project developed a CNN-based Fake Speech Detection system that effectively
distinguishes between genuine and manipulated audio samples. By employing data
augmentation techniques, we enhanced the model’s robustness, enabling better generalization
to various audio manipulations. As the prevalence of fake speech increases, this research
underscores the need for advanced methodologies to combat misinformation in audio
communications. Future work will focus on refining model architectures and incorporating
larger, more diverse datasets to further enhance detection capabilities, contributing to audio
forensics and security efforts.
The rise of synthetic voice generation through advanced AI techniques like Text-to-Speech
(TTS) and Voice Conversion (VC) presents significant threats to digital security, identity
verification, and media authenticity. This project presents a Convolutional Neural Network
(CNN)-based fake speech detection model trained on the ASVspoof 2019 dataset. By utilizing
MFCCs and spectrogram-based features, the model achieved a remarkable accuracy of 97%
and an F1-score of 97.61%, demonstrating its effectiveness in identifying fake speech even
when the differences are imperceptible to the human ear.
Data augmentation techniques like pitch shifting, noise addition, and time-stretching were
employed to improve model generalization, significantly enhancing robustness against unseen
and adversarial audio samples. The lightweight nature of the model—with under 100 KB
memory footprint—makes it viable for deployment in real-world environments, including
mobile and embedded systems.
Comparative studies also showed the superiority of this model over traditional and some state-
of-the-art techniques such as ResNet, VGG, and Transformer encoders. Despite minor
limitations like performance drop on highly compressed audio or near-perfect synthetic voices,
the system provides a reliable, efficient, and scalable solution for the growing problem of voice-
based deepfake attacks.
7.2 Future Scope
1. Real-Time and Edge Deployment
• Extend the framework by fusing audio with visual cues (lip-sync analysis,
facial expressions) and text transcripts (Natural Language Processing) to
provide holistic deepfake detection.
9. User-Interactive Interfaces
CHAPTER 8
REFERENCES
[1] Li, Yang, et al. “Universal voice conversion.” In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 3526-3535. 2019.
[2] Reimao, R., and V. Tzerpos. “FOR: A dataset for synthetic speech detection,” in 2019
International Conference on Speech Technology and Human-Computer Dialogue (SpeD).
IEEE, pp. 1–10.
[3] Korshunov, Pavel, and S. Marcel. “Deepfake detection using inverse contrastive loss.” In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 3920-3929. 2020.
[4] Prakash, Shreya, et al. “A comprehensive study on deep fake audio detection.” In
Proceedings of the 2020 ACM Workshop on Information Hiding and Multimedia Security,
pp. 81-90.
[5] Subramani, N. and D. Rao, “Learning efficient representations for fake speech detection,”
in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp.
5859–5866, 2020.
[6] Capoferri, D., et al., “Speech audio splicing detection and localization exploiting
reverberation cues,” in 2020 IEEE International Workshop on Information Forensics and
Security (WIFS). IEEE, pp. 1–6.
[7] Wijethunga, R., et al., “Deepfake audio detection: a deep learning-based solution for
group conversations,” in 2020 2nd International conference on advancements in
computing (ICAC), vol. 1. IEEE, pp. 192–197.
[8] Purevdagva, C., et al., “A machine-learning based framework for detection of fake
political speech,” in 2020 IEEE 14th International Conference on Big Data Science and
Engineering (BigDataSE). IEEE, pp. 80–87.
[9] Mukhopadhyay, Rudrabha, et al. “A comprehensive survey of voice conversion and deep
fake techniques.” arXiv preprint arXiv:2103.03230, 2021.
[10] Xie, Jin, et al. “Voice Deep Guard: Towards Intelligent Voice Deepfake Detection.” In
Proceedings of the 28th ACM International Conference on Multimedia, 2021.
[11] Ballesteros, D. M., et al., “Deep4snet: deep learning for fake speech classification,” Expert
Systems with Applications, vol. 184, p. 115465, 2021.
[12] Zhang, Z., et al., “Fake speech detection using residual network with transformer
encoder,” in Proceedings of the 2021 ACM workshop on information hiding and
multimedia security, pp. 13–22.
[13] Khochare, J., et al., “A deep learning framework for audio deepfake detection,” Arabian
Journal for Science and Engineering, pp. 1–12, 2021.
[14] Pasupathi, Panupong, and Taxing Li. “Detecting AI-Generated Text with BERT.” In
Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pp. 672-684.
[15] Zhang, L., et al., “The partialspoof database and countermeasures for the detection of short
fake speech segments embedded in an utterance,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813–825, 2022.
[16] Hamza, A., et al., “Deepfake audio detection via MFCC features using machine learning,”
IEEE Access, vol. 10, pp. 134018–134028, 2022.
[17] Shaaban, O. A., et al., “Audio deepfake approaches,” IEEE Access, vol. 11, pp. 132652–
132682, 2023.
[18] Albazony, A. A. M., et al., “Deepfake videos detection by using recurrent neural network
(RNN),” in 2023 Al-Sadiq International Conference on Communication and Information
Technology (AICCIT). IEEE, pp. 103–107.
[19] Bansal, K., et al., “Deepfake detection using CNN and DCGANs to drop-out fake
multimedia content: a hybrid approach,” in 2023 International Conference on IoT,
Communication and Automation Technology (ICICAT). IEEE, pp. 1–6.
[20] Li, K., et al., “Contributions of jitter and shimmer in the voice for fake audio detection,”
IEEE Access, vol. 11, pp. 84689–84698, 2023.
[21] Pham, L., et al., “Deepfake audio detection using spectrogram-based feature and ensemble
of deep learning models,” in 2024 IEEE 5th International Symposium on the Internet of
Sounds (IS2). IEEE, pp. 1–5.
[22] Paramarthalingam, A., et al., “A deep learning model to assist visually impaired in pothole
detection using computer vision,” Decision Analytics Journal, vol. 12, p. 100507, 2024.
[23] Basha, S. A. Y., and K. U. Priya, “Recognition of deep fake voice acoustic using ensemble
bagging model,” in 2024 5th International Conference on Electronics and Sustainable
Communication Systems (ICESC). IEEE, pp. 1211–1217.
[24] Xue, J., et al., “Dynamic ensemble teacher-student distillation framework for light-weight
fake audio detection,” IEEE Signal Processing Letters, 2024.
[25] Deng, J., et al., “VFD-Net: Vocoder fingerprints detection for fake audio,” in ICASSP
2024 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 12151–12155.
[26] Al Ajmi, S. A., et al., “Faked speech detection with zero prior knowledge,” Discover
Applied Sciences, vol. 6, no. 6, p. 288, 2024.
[27] Mathew, J. J., et al., “Towards the development of a real-time deepfake audio detection
system in communication platforms,” arXiv preprint arXiv:2403.11778, 2024.
[28] Song, D., et al., “Anomaly detection of deepfake audio based on real audio using
generative adversarial network model,” IEEE Access, 2024.
[29] Kang, J. Y., et al., “FADEL: Uncertainty-aware fake audio detection with evidential deep
learning,” in ICASSP 2025 IEEE International Conference on Acoustics, Speech and
Signal Processing. IEEE, pp. 1–5.
[30] Dixit, Y., et al., “Fake news detection of live media using speech to text conversion,” in
2021 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, pp.
1–5.