0% found this document useful (0 votes)
29 views36 pages

Final Deepfake Voice Detection Report

The document outlines a project titled 'Deepfake Voice Detection Using Machine Learning' conducted by students of Sir M. Visvesvaraya Institute of Technology as part of their Bachelor of Engineering program. The project aims to develop a robust system for detecting synthetic speech using deep learning techniques, particularly focusing on audio features and employing a Convolutional Neural Network (CNN) architecture. The report includes a literature review, methodology, results, and future directions for enhancing the detection system's practical applications in cybersecurity.

Uploaded by

m7086651
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views36 pages

Final Deepfake Voice Detection Report

The document outlines a project titled 'Deepfake Voice Detection Using Machine Learning' conducted by students of Sir M. Visvesvaraya Institute of Technology as part of their Bachelor of Engineering program. The project aims to develop a robust system for detecting synthetic speech using deep learning techniques, particularly focusing on audio features and employing a Convolutional Neural Network (CNN) architecture. The report includes a literature review, methodology, results, and future directions for enhancing the detection system's practical applications in cybersecurity.

Uploaded by

m7086651
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi-590010

PROJECT WORK
21ISP76
Deepfake Voice Detection Using Machine Learning

Submitted in partial fulfillment for the requirements for the Eighth semester

BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING
For the Academic Year 2024 - 2025
Submitted by:

Amruthesh S G 1MV21IS006
Maninder Kaur 1MV21IS021
Tejaswini G H 1MV21IS038
Moksha Prada P 1MV22IS402

Under the guidance of

Ms. Sowjanya Lakshmi A


Asst. Professor, Department of ISE

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING


SIR M. VISVESVARAYA INSTITUTE OF TECHNOLOGY
Krishnadevaraya Nagar, International Airport Road,
Hunasmaranahalli, Bengaluru – 562157
DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

CERTIFICATE
It is certified that the PROJECT WORK [21ISP76] entitled “Deepfake Voice Detection Using
Machine Learning” is carried out by 1MV21IS006 – Amruthesh S G , 1MV21IS021 –
Maninder Kaur , 1MV21IS038 – Tejaswini G H , 1MV22IS402 – Moksha Prada P bonafide
Students of Sir M Visvesvaraya Institute of Technology in partial fulfilment for the 8th semester
for the award of the Degree of Bachelor of Engineering in Information Science and Engineering
of the Visvesvaraya Technological University, Belagavi during the academic year 2024-2025.
It is certified that all corrections and suggestions indicated for Internal Assessment have been
incorporated in the report deposited in the department library. The project report has been
approved as it satisfies the academic requirements in respect of project work prescribed for the
course of Bachelor of Engineering.

Ms Sowjanya Lakshmi A Dr. G. C. Bhanu Prakash Prof. S G Rakesh


Assistant Professor Head of Department, Principal,
Dept. of ISE, Dept. of ISE, Sir MVIT
Sir MVIT Sir MVIT Bengaluru – 562157
Bengaluru – 562157 Bengaluru – 562157

Examination:
Name of Examiner Signature with Date

1)

2)
DECLARATION
We hereby declare that the entire project work embodied in this dissertation has
been carried out by us and no part has been submitted for any degree or diploma of
any institution previously.

Place: Bengaluru
Date:

Signature of Student

Amruthesh S G 1MV21IS006

Maninder Kaur 1MV21IS021

Tejaswini G H 1MV21IS038

Moksha Prada P 1MV22IS402

I
ACKNOWLEDGMENT

It gives us immense pleasure to express our sincere gratitude to the management


of Sir M. Visvesvaraya Institute of Technology, Bengaluru for providing the
opportunity and the resources to accomplish our project work in their premises.

On the path of learning, the presence of an experienced guide is indispensable, and


we would like to thank our guide Ms. Sowjanya Lakshmi A, Asst. Professor,
Dept. of ISE, for her invaluable help and guidance.

Heartfelt and sincere thanks to Dr. G. C. Bhanu Prakash, Prof. and Head,
Dept. of ISE, for his suggestions, constant support and encouragement.

We would also like to convey our regards to Prof. S. G. Rakesh, Principal,


Sir MVIT for providing us with the infrastructure and facilities needed to develop
our project.

We would also like to thank the staff of Department of Information Science and
Engineering and lab-in-charges for their co-operation and suggestions. Finally, we
would like to thank our Parents and friends for their help and suggestions without
which completing this project would not have been possible.

II
ABSTRACT

Deep learning has made significant strides in audio synthesis techniques, making it
more challenging to distinguish between authentic and fake speech. This study uses voice-
conversion and synthetic techniques to develop a robust fake speech detection system that
focuses on Logical Access (LA) threats. To increase dataset size and improve model
generalization, the system makes use of a deep learning model with data augmentation
techniques like time stretching, pitch shifting, and volume scaling.

Normalization, noise reduction, and audio segmentation into regular 4-second frames
are all examples of preprocessing. Melspectrograms, which are produced by normalizing
inputs using Z-normalization and the Fast Fourier Transform (FFT), are utilized as feature
representations. A multi-layer convolutional model with 2D and 1x1 convolutions, batch
normalization, max-pooling, ReLU activation, and fully linked layers makes up the suggested
architecture. Dropout and other regularization techniques are employed to strengthen the
model’s resistance to overfitting.

The ASVspoof 2019 corpus was used for training and testing, with further variants to
simulate real-world situations. To examine classification behavior, the confusion matrix and
the metrics of accuracy, precision, recall, F1-score, and ROC-AUC were employed. The
results demonstrate that the system was quite successful at distinguishing between real and
phony speech, with a high detection accuracy.

This work makes a significant addition to voice-based security systems by offering a


scalable, practical, and broadly applicable defense against new audio spoofing threat.

III
CONTENTS

SL No Chapters Page No
Introduction 1-2
1 1.1 Overview 1
1.2 Organization of Report 1-2
2 Literature Review 3-6
Problem Statement and Objectives 7-8
3.1 Problem Statement 7
3
3.2 Objectives 7-8
3.3 Significance of the Project Work 8
4 Methodology 9-14
4.1 Block Diagram 12
4.2 System Architecture 12
4.3 Control Flow Diagram 13
4.4 Data Flow Diagram 13
4.5 Sequence Diagram / Activity Diagram 14
Implementation 15-20
5.1 System Requirements 15
5 5.2 Algorithms / Pseudocodes 15-19
5.Mathematical Description 19-20
5.4 Testing and Test Cases 20
Results and Discussion 21-25
6.1 Dataset Samples 21
6 6.2 Results 21-23
6.3 Result Analysis 23-24
6.4 Summary 24-25
Conclusion and Future Scope 26-27
7 7.1 Conclusion 26
7.2 Future Scope 26-27
8 References 28-29

IV
LIST OF FIGURES

Page
Fig. No. Description
No
1 Mel Spectrogram and MFCC Generation 10
2 Block diagram 13
3 System architecture 13
4 Control flow diagram 14
5 Level 1 Data Flow Diagram 14
6 Activity diagram 15
7 Performance Matrix 22
8 Confusion Matrix 23
9 Accuracy Curve 23
10 Loss Curve 24

LIST OF TABLES
Table Page
Description
No. No
1 Summary of the CNN model architecture 19
2 Comparison Of Accuracy And Equal Error Rate (EER) With Similar Studies 25

V
Deepfake Voice Detection Introduction

CHAPTER 1
INTRODUCTION
1.1 Overview
With the development of AI, deepfake technology has also improved and is now capable of
creating incredibly lifelike fake sounds. Because these AI voices might sound a lot like actual
humans, there are serious worries about fraud, deception, and cybersecurity. Abuse of fake
speech has taken various forms, from using voice-based identification systems to conduct
fraudulent transactions to posing as politicians or celebrities. Because synthetic speech
technology is developing so quickly, it is essential to build efficient detection methods to
differentiate between real and fake sounds. These attacks are now too strong for traditional
voice authentication methods, such as rule-based approaches and human inspections,
necessitating the use of advanced machine learning algorithms.

Our focus is on developing a deep learning-based artificial speech recognition system that can
distinguish between artificial and genuine speech. The ASVspoof 2019 benchmark, which
includes a comprehensive collection of spoof and real speech samples, is used in our approach.
We employ spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs) to extract
important elements that help detect relevant speech patterns. A Convolutional Neural Network
(CNN), a powerful deep learning framework that can recognize patterns in audio and visual
data, is then used to examine the acquired features.

Since our model is an offline detector, it is more practical to use in real-world situations than
other models that need online processing. Such features might be helpful in settings like secure
regions and forensic investigations when there is little internet access. Applications of this
approach may also be seen in voice verification systems, media authentication, and
cybersecurity, where speech deepfake detection is essential to combating disinformation and
identity theft.

1.2 Organisation of Report

This report provides a comprehensive explanation of the proposed deepfake voice detection
system. It begins with the dataset selection, specifically highlighting the use of the ASVspoof
2019 dataset, which offers a wide range of real and spoofed speech samples.
Next, the preprocessing methods are detailed, including the conversion of raw audio into
MEL spectrograms, noise reduction, and clarity enhancement steps. Following this, the model

DEPT OF ISE, SIR MVIT 2024-2025 Page | 1


Deepfake Voice Detection Introduction

architecture is described, focusing on the implementation of the CNN to process the extracted
audio features.
The report then outlines the training procedures employed to optimize the model and the
evaluation criteria used to assess performance. The results showcase the system's
effectiveness in distinguishing synthetic speech, underlining its potential in real-world
security applications.
Finally, the report discusses future directions, such as enhancing the model for real-time
detection to increase its practical utility across various domains.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 2


Deepfake Voice Detection Literature Review

CHAPTER 2
LITERATURE REVIEW

The rise of deepfake technologies has led to significant research efforts in detecting
synthetic speech. Reimao and Tzerpos [2] introduced the FOR dataset for synthetic speech
detection in 2019, emphasizing data quality. Their methodology involved using a
comprehensive dataset to train detection models, focusing on spectral features to improve
model performance. However, their approach faced challenges in detecting highly
sophisticated fake speech and small distortions in speech. In 2020, Subramani and Rao [5]
developed efficient neural representations for fake speech detection, leveraging
autoencoders and deep learning models. This method improved detection performance by
learning representations that generalized well across various datasets, though the model’s
dependency on large labelled datasets remains a limitation. Wijethunga et al. [7] applied
deep learning techniques to group conversations, detecting deepfake audio by analysing
group interactions. The accuracy of their model improved significantly over traditional
methods, but real-time processing in multi-speaker environments was still a challenge.
Capoferri et al. [6] used reverberation cues to detect audio splicing, which helped improve
detection accuracy for manipulated speech. However, their method struggles with
detecting speech segments that are minimally altered or highly coherent, limiting its
application.
Purevdagva et al. [8] introduced a machine-learning framework for detecting fake
political speech, employing multiple feature extraction techniques. Their methodology
combined prosodic, spectral, and phonetic features to detect inconsistencies in speech.
While this method achieved high accuracy, its limitation lies in the fact that it may not
perform as well on non-political speech or in environments with background noise. Hamza
et al. [16] applied MFCC features in combination with machine learning models for
deepfake audio detection, achieving high detection accuracy. However, their model is
sensitive to variations in environmental noise, leading to potential inaccuracies in real-
world scenarios. Zhang et al. [12] employed residual networks with transformer encoders
to detect fake speech, significantly improving the model’s accuracy in detecting subtle
distortions. This method is particularly effective for detecting deepfake speech at higher
quality levels. However, the model is computationally expensive, making it less suitable
for low-resource environments.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 3


Deepfake Voice Detection Literature Review

Mukhopadhyay et al. [9] provided a comprehensive survey on voice conversion


and deepfake techniques, offering insights into various methods and their effectiveness.
This work is valuable in providing a broader perspective on deepfake detection; however,
its limitation lies in the lack of empirical testing and evaluation of the surveyed methods.
Xie et al. [10] introduced Voice DeepGuard, a model that combines adversarial networks
and machine learning to detect synthetic voices. The methodology demonstrated high
robustness across different types of synthetic voices but had limitations when it came to
detecting synthetic voices with minute alterations. Prakash et al. [4] proposed a
comprehensive study on deepfake audio detection using an ensemble of multiple deep
learning models. Their methodology improved detection accuracy, but their system faced
difficulties with real-time detection in large datasets, making it impractical for real-time
applications.
Li et al. [1] explored universal voice conversion techniques for improving the
adaptability of deepfake detection systems. Their methodology demonstrated effectiveness
across different speakers and speech conditions but struggled with low-quality audio or
speech distortions. Pasupathi and Li [14] presented a method for detecting AI-generated
text through BERT, which provided valuable cross-domain insights, although the model
was specifically designed for text and does not directly address deepfake audio detection.
In 2022, Zhang et al. [15] introduced a model for detecting fake speech segments
embedded within utterances. Their method, which utilized deep learning techniques,
showed promise in detecting short, fake segments, though it had limitations in dealing with
longer fake speech passages. Ballesteros et al. [11] presented Deep4snet, a deep learning-
based model for fake speech classification. Their model was highly accurate in detecting
fake speech; however, it showed reduced performance with low-quality or highly
compressed audio.
In 2023, Song et al. [28] proposed an anomaly detection method for deepfake
audio, employing generative adversarial networks (GANs) to detect discrepancies between
real and fake speech. The method was effective but had limitations when dealing with
adversarial deepfake speech generated through advanced techniques. Basha and Priya [23]
applied an ensemble bagging model for deepfake voice recognition, offering improved
performance with diverse audio sources. However, the method was computationally
intensive, which could hinder its application in real-time systems.
Korshunov and Marcel [3] presented a deepfake detection method using inverse
contrastive loss, which improved the model’s ability to differentiate real and fake speech

DEPT OF ISE, SIR MVIT 2024-2025 Page | 4


Deepfake Voice Detection Literature Review

through contrastive training strategies. This approach enhanced model robustness but
required careful tuning of contrastive learning parameters and still lacked resilience against
highly realistic fake samples. Khochare et al. [13] developed a deep learning framework
for audio deepfake detection, combining feature extraction and classification in a single
pipeline. Their model achieved commendable accuracy across various datasets, though it
faced limitations with language diversity and accents.
Shaaban et al. [17] offered a comprehensive analysis of various audio deepfake
approaches, highlighting vulnerabilities and suggesting layered countermeasures. While
their work served as a valuable reference for system design, it lacked experimental
validation and real-world benchmarking. Albazony et al. [18] examined the use of
recurrent neural networks (RNNs) for detecting deepfake videos, and although their focus
was visual, the temporal modeling techniques had implications for sequential audio
detection. However, the adaptation of such models to the audio domain requires further
investigation.
Bansal et al. [19] proposed a hybrid deepfake detection approach using
convolutional neural networks and DCGANs to suppress fake multimedia content,
including audio. While the method showed strong performance, it struggled with
overfitting and required significant processing power. Li et al. [20] studied voice
characteristics like jitter and shimmer to detect fake audio at the acoustic signal level. Their
approach offered interpretability and high accuracy for certain synthetic voices but faltered
when faced with more complex deepfake generation methods.
Paramarthalingam et al. [22] introduced a deep learning model for detecting
potholes for visually impaired assistance, which, while not directly related to deepfake
audio, demonstrated the broader applications of audio analysis and environment-aware
machine learning that could inspire future detection architectures. Xue et al. [24] proposed
a dynamic ensemble distillation framework using teacher-student models to create
lightweight deepfake detectors. The solution showed promise for resource-constrained
environments but required substantial offline training with large teacher models.
Deng et al. [25] presented VFD-Net, a vocoder fingerprint-based deepfake
detection system that leverages unique spectral artifacts introduced during synthetic voice
generation. While the model achieved state-of-the-art results in vocoder-based detection,
it performed poorly on end-to-end models designed to bypass vocoder artifacts. Al Ajmi
et al. [26] introduced a zero prior knowledge approach that enabled detection of fake
speech without needing prior examples of fake audio. This unsupervised technique

DEPT OF ISE, SIR MVIT 2024-2025 Page | 5


Deepfake Voice Detection Literature Review

expanded the detection landscape but suffered from lower accuracy and higher false
positives.
Mathew et al. [27] focused on real-time detection of deepfake audio in
communication platforms, addressing latency and integration challenges. Their model was
efficient for real-time streaming but still faced trade-offs in terms of detection granularity
and scalability. Kang et al. [29] proposed FADEL, an uncertainty-aware fake audio
detection system using evidential deep learning. FADEL enhanced reliability by
quantifying prediction confidence, though its complexity and resource needs may hinder
deployment in lightweight or embedded systems.
More recently, Pham et al. [21] focused on spectrogram-based features in
combination with deep learning for deepfake audio detection, improving detection
accuracy but struggling to handle background noise or low-quality recordings. Dixit et al.
[30] employed speech-to-text conversion to detect fake news in live media, using deep
learning for improved accuracy. Their method, while highly effective for specific cases,
faced challenges in handling speech with varying accents or speech from non-native
speakers.
Accuracy in deepfake audio detection is largely contingent on the quality and
variety of training data, the model architecture, and the feature extraction methods. While
some models achieve high accuracy, especially on high-quality deepfakes, real-time
detection and detection on low-resource devices remain significant challenges. Moreover,
limitations in detecting advanced deepfake techniques, such as those involving minute
audio distortions or adversarial methods, persist.
Future Scope lies in improving the adaptability of detection models, enabling
real-time detection on mobile and embedded devices, and reducing dependency on large
datasets. Exploring unsupervised learning methods, reducing computational overhead,
and improving cross-domain detection (e.g., speech-to-text deepfake detection) will be
crucial in expanding the applicability of deepfake voice detection systems.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 6


Deepfake Voice Detection Problem Statement and Objectives

CHAPTER 3
PROBLEM STATEMENT AND OBJECTIVES
3.1 Problem Statement
The emergence of deepfake technology, which is driven by developments in machine learning
(ML) and artificial intelligence (AI), has made it extremely difficult to preserve the integrity
and authenticity of multimedia material. It becomes more challenging to distinguish between
authentic and fraudulent information when deepfakes modify audio in a way that closely
resembles the voices and looks of actual people, frequently with great accuracy.

Key Challenges:
1. Increasing Sophistication of Deepfake Algorithms
• GANs (Generative Adversarial Networks) and other AI models that can produce
incredibly lifelike fake media are used in contemporary deepfake generating approaches.
By taking advantage of minute nuances in audio signals and frames, these computers are
able to identify irregularities more accurately than humans.
2. Threats to Digital Trust and Security
• Deepfake content has been used as a weapon for identity theft, personal defamation,
and disinformation operations. This has sparked worries in a variety of fields where
confidence in digital material is crucial, such as social media, politics, law enforcement,
and the media.
3. Limited Effectiveness of Traditional Detection Methods
• Because deepfake technologies are dynamic and constantly changing, traditional
detection techniques that depend on static characteristics or heuristic-based algorithms
are unable to keep up.
• The majority of current methods are susceptible to cross-modal manipulations since
they either concentrate on audio or audio analysis alone.
4. Scalability and Generalization Issues
• In real-world applications, deepfake detection methods are less successful because they
frequently have trouble generalizing to new modification techniques or unexplored
datasets.
• It is computationally hard to identify deepfakes in realtime in high-resolution or live-
streamed footage.
3.2 Objectives
• Identification of utterance: The primary objective of the Fake Speech Detection project is to
be able to tell fake speech utterances from bonafide (authentic) ones. The project should prove
viable in detecting Logical Attacks such as TTS and VC.
• Extension of ASVSpoof 2019 Dataset: We intend to increase the number of training examples
in our dataset by 5 times to 10 times by employing various audio signal processing and speech
augmentation techniques on the existing dataset such as Time Shifting, Time Stretching, Pitch
Scaling, Noise Addition, etc. This will make the model more robust and improve generalisation
capabilities of the model.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 7


Deepfake Voice Detection Problem Statement and Objectives

• Performance Assessment: After model is built, we will assess the performance of proposed
model against established benchmarks in fake speech detection, focusing on metrics such as
precision, recall, and F1-score. Evaluation of model on original dataset, augmented dataset and
both combined will be done seperately and compared against various studies involving similar
models and datasets.
3.3 Significance of the Project Work
• Mitigates Misinformation Risks
With the increasing use of synthetic voices in spreading fake news, executing scams, and
conducting impersonation attacks, this project plays a crucial role in identifying and
preventing the misuse of AI-generated audio content.
• Enhance Security in Voice-Based Systems
Voice authentication systems, such as those used in banking or smart home devices, are
vulnerable to spoofing attacks. This project adds a significant layer of protection by
accurately detecting deepfake voices, thereby strengthening overall system security.
• Supports Legal and Ethical Standards
Deepfake voice detection is vital for maintaining the integrity of digital evidence in legal
proceedings and ensuring ethical usage of audio content in media and communication.
• Promote Trust in Media Content
The ability to verify the authenticity of audio enhances consumer trust in media,
journalism, and broadcasting. It helps content creators and consumers distinguish between
real and manipulated voice recordings.
• Real-Time Detection Capability
The machine learning model developed in this project is capable of near real-time
detection, which is essential for immediate response and practical deployment in real-world
scenarios.
• Scalable and Adaptable Framework
The system is designed to be scalable and adaptable, allowing future integration with
additional audio manipulation detection mechanisms, making it relevant for long-term
applications across various industries.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 8


Deepfake Voice Detection Methodology

CHAPTER 4
METHODOLOGY
The methodology adopted in this project encompasses a systematic sequence of steps designed to
develop a reliable machine learning model capable of detecting deepfake (spoofed) voice recordings.
Each stage, from data preprocessing to model evaluation, is carefully designed to ensure robustness and
accuracy in detecting fake voice inputs.
1. Dataset Collection
The dataset used in this project is the *ASVspoof 2019* dataset, a widely recognized benchmark
dataset for automatic speaker verification and spoofing countermeasures. It contains audio
samples of both *bonafide (genuine human speech)* and *spoofed (fake)* speech generated
using various voice synthesis (TTS) and voice conversion (VC) techniques. This dataset helps
simulate real-world scenarios and provides a diverse set of samples that are essential for training
and testing the model effectively.
2. Preprocessing and Feature Extraction
Raw audio signals cannot be directly fed into machine learning models, especially deep learning
networks. Therefore, the following preprocessing steps are performed:

• Audio Loading: Each .flac or .wav file is loaded using the librosa library, which also
helps in resampling the audio at a consistent sampling rate.
• Spectrogram Generation: The log-mel spectrogram is extracted from each audio file.
It captures the frequency domain features by applying the Mel scale, which aligns more
closely with how humans perceive sound.
• MFCC (Mel Frequency Cepstral Coefficients): MFCCs are also extracted as they are
known to represent the timbral texture of speech and are widely used in speech
recognition and spoof detection.

FIG 1 Mel Spectrogram and MFCC Generation

DEPT OF ISE, SIR MVIT 2024-2025 Page | 9


Deepfake Voice Detection Methodology

These extracted features are converted into 2D image-like arrays, suitable for feeding into
convolutional neural networks (CNNs).
3. Data Preparation
Once the features are extracted, the dataset is organized and prepared for training:
• Label Encoding: Bonafide samples are labeled as 0 and spoofed samples as 1.
• Data Splitting: The dataset is split into training and testing sets to evaluate
generalization. A separate validation set may also be used to fine-tune model parameters.
• Normalization: Input features are normalized to scale the values uniformly, which helps
in faster convergence during training.
4. Model Design – Convolutional Neural Network (CNN)
A CNN architecture is designed to classify input spectrograms into real or fake voice. CNNs are
chosen for their ability to recognize spatial hierarchies in 2D feature maps. The architecture
includes:

• Convolutional Layers: To detect local audio features like pitch, tone, and modulation
patterns.
• Pooling Layers: To reduce dimensionality and computation, while preserving important
features.
• Batch Normalization: To stabilize and accelerate training.
• Dropout Layers: To prevent overfitting by randomly disabling neurons during training.
• Fully Connected Dense Layers: Final classification layers with a sigmoid or softmax
function to output probabilities.
5. Model Training

• The model is trained using the binary cross-entropy loss function, which is suitable for
binary classification tasks.
• The Adam optimizer is employed for its efficiency in handling sparse gradients and
adaptive learning rates.
• The training is conducted over multiple epochs, with batch processing for computational
efficiency.
• Early stopping is utilized to halt training when validation accuracy stops improving,
preventing overfitting.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 10


Deepfake Voice Detection Methodology

6. Model Evaluation

• After training, the model is tested on unseen data to evaluate its performance. The
evaluation metrics include:
• Accuracy: Measures the overall correctness of predictions.
• Confusion Matrix: Provides a detailed breakdown of true positives, false positives, true
negatives, and false negatives.
Precision, Recall, and F1-Score: These metrics offer insight into the model's effectiveness in
identifying spoofed and bonafide voices, especially when the classes are imbalanced.
7. Visualization and Interpretation

• Loss and Accuracy Curves: Plotted for both training and validation sets to visualize
learning progression.
• Spectrogram Visuals: Used to understand what types of patterns the CNN is learning
to distinguish between real and fake audio.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 11


Deepfake Voice Detection Methodology

4.1 Block Diagram

FIG 2 :BLOCK DIAGRAM


4.2 System Architecture

FIG 3: SYSTEM ARCHITECTURE

DEPT OF ISE, SIR MVIT 2024-2025 Page | 12


Deepfake Voice Detection Methodology

4.3 Control Flow Diagram

FIG 4: CONTROL FLOW DIAGRAM

4.4 Data Flow Diagram

FIG 5: Level 1 Data Flow Diagram

DEPT OF ISE, SIR MVIT 2024-2025 Page | 13


Deepfake Voice Detection Methodology

4.5 Sequence Diagram / Activity Diagram

Fig 6 : ACTIVITY DIAGRAM

DEPT OF ISE, SIR MVIT 2024-2025 Page | 14


Deepfake Voice Detection Implementation Process

CHAPTER 5
IMPLEMENTATION PROCESS
5.1 System Requirements
• Processor: Intel Core i7/i9 (12th Gen) or AMD Ryzen 7/9
• RAM: 16GB or more
• GPU: NVIDIA RTX 3060 (6GB) or higher (RTX 4090 for high-end deep learning)
• Storage: 512GB NVMe SSD + 1TB HDD (for datasets & models)
• Operating System: Windows 11 / Ubuntu 20.04/ MacOS
• Software: Python 3.8+ with virtual environment support

5.2 Algorithm
Deep learning techniques are used in the proposed methodology to accurately distinguish between real
and fraudulent speech. This approach places a strong emphasis on feature extraction, preprocessing,
CNN-based model construction, and the analysis of speech patterns from carefully chosen datasets.
The steps listed below are as follows: detailed description of the research methodology used in this
study:

A. Data Selection
The ASVspoof2019 Logical Access (LA) dataset was chosen as the main dataset for this job. This
dataset, which includes both synthetic and real speech samples, is frequently used to detect bogus
speech. Audio files containing synthetic speech generated by different textto-speech (TTS) and
voice conversion (VC) methods are stored in the LA subset.
The dataset is divided into three subsets:

• Training set: Used for model learning and parameter tuning.


• Validation set: Used for model optimization and hyperparameter adjustments.
• Test set: Used to evaluate the model’s performance on unseen data.

The audio samples in the ASVspoof2019 Logical Access (LA) dataset are encoded in 16-bit, 16
kHz WAV format and have a fixed duration of 4 seconds per track. Preprocessing and
segmentation techniques are applied to guarantee efficient feature extraction for false speech
detection.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 15


Deepfake Voice Detection Implementation Process

B. Preprocessing and Feature Extraction


The ASVspoof2019 Logical Access (LA) dataset consists of audio samples stored in 16-bit, 16 kHz
WAV format, with each track having a fixed duration of 4 seconds. To ensure effective feature
extraction for fake speech detection, pre-processing and segmentation techniques are applied.
1) Preprocessing
Since speech signals vary in structure, segmentation is performed to create uniform input sizes
for feature extraction. The preprocessing steps include:

• Frame Segmentation: Each audio file is divided into 10 segments to ensure sufficient
temporal resolution.
• Sampling Consistency: Given a sample rate of 16,000 Hz and a track duration of 4
seconds, each track contains 64,000 samples (i.e., 16, 000 × 4). – Windowing: A
Hamming window is applied to each frame to reduce spectral leakage.
• Padding (if required): Zero-padding is applied to maintain uniform segment lengths
across all samples.
2) Feature Extraction
To extract meaningful representations of speech, MelFrequency Cepstral Coefficients
(MFCCs) are computed from each segmented frame. The extraction process includes:

• Number of MFCCs: 13 MFCCs are computed per frame.


• Fourier Transform: The Fast Fourier Transform (FFT) size is 2048, converting each
frame into the frequency domain.
• Hop Length: A hop length of 512 samples is used, determining the overlap between
consecutive frames.
• Mel-Scale Filtering: A filter bank is applied to mimic human auditory perception.
• Feature Matrix Construction: The final MFCC feature matrix consists of 13
coefficients per frame, serving as input to the deep learning model.
By structuring the input data into 10 uniform segments and extracting MFCC features, this
methodology ensures the effective differentiation of real and fake speech.
C. Model Architecture
To classify real and fake speech, we employ a Convolutional Neural Network (CNN)-based
model that processes MFCC feature matrices extracted from segmented audio. The model is
designed to capture both spectral and temporal patterns indicative of fake speech artifacts.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 16


Deepfake Voice Detection Implementation Process

1) Input Layer
Accepts an MFCC feature matrix of shape (time frames × 13 MFCC coefficients), reshaped for
2D convolutional processing.
2) Convolutional and Pooling Layers
The model employs multiple convolutional layers to extract features from speech signals:
a. First Conv Layer: A Conv2D layer with 32 filters and (3×3) kernel size applies
feature extraction to learn key speech patterns.
b. Max Pooling (2×2) with same padding reduces spatial dimensions while preserving
critical information.
c. Batch Normalization stabilizes training and accelerates convergence.
d. This structure is repeated across three convolutional layers, progressively refining
feature maps.
e. The third convolutional layer uses a (2×2) kernel to extract fine-grained speech details.
3) Flatten Layer
Converts D feature maps into a 1D vector for classification.
4) Fully Connected (Dense) Layers
a. A Dense layer with 64 neurons and ReLU activation further processes extracted
features.
b. A Dropout layer (0.3 probability) prevents overfitting.
5) Output Layer
A Dense layer with 2 neurons and Softmax activation produces probability scores for real vs.
fake speech classification.
This CNN-based architecture efficiently captures deep-fake speech artifacts while maintaining
computational efficiency.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 17


Deepfake Voice Detection Implementation Process

TABLE 1: SUMMARY OF THE CNN MODEL ARCHITECTURE


D. Training
The training process follows a structured pipeline to ensure optimal performance and
generalization of the CNN model for fake speech detection. The steps are as follows:
1) Data Splitting
The dataset is divided into three subsets:

• Training Set (55%): Used to optimize model weights.


• Validation Set (20%): Used for hyperparameter tuning and to monitor performance
during training.
Test Set (25%): Reserved for final evaluation to measure the model’s
generalization ability.
The prepare_datasets(0.25, 0.2) function handles this splitting, ensuring balanced
distribution across the three sets.
2) Model Compilation
The CNN model’s input shape is set to the retrieved MFCC feature dimensions, or (13, 13,
1). It is constructed using a convolutional architecture
with batch normalization and pooling layers andbuild_model(input_shape), For steady
convergence and adaptive learning, it uses the Adam optimizer with a learning rate of
0.0001. The output layer employs softmax activation for two-class classification, while the
sparse categorical cross-entropy loss function is employed. The primary test tool is an
accurate tracking of the model’s performance.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 18


Deepfake Voice Detection Implementation Process

1) Early Stopping
To prevent overfitting, an early stopping callback is applied. It monitors the validation loss
(val_loss) and stops training if no improvement is observed for a 5 consecutive epochs. The
restore_best_weights=True setting ensures that the model reverts to the best-performing
weights before stopping.
2) Training Process
To achieve the best possible balance between training speed and stability, the model is
trained for 30 epochs with a batch size of 32. When training, the validation set is used to
track performance and make dynamic learning adjustments.
3) Evaluation
After training, the model is tested on the hold-out test set using model.evaluate(X_test,
y_test). The test accuracy is printed to assess how well the model generalizes to unseen data.
By limiting overfitting and optimizing performance on tasks involving the categorization of
actual and false speech, this methodical methodology guarantees efficient training. Thus,
this is the suggested approach.

5.3 Mathematical Description


1. Feature Extraction (MFCC / Spectrograms):
S(m, n) = log(Σ|X(k)|² · Hₘ(k))
Where:
- X(k): FFT of the signal
- Hₘ(k): Mel filter bank
- m: Mel filter index
- n: time frame
2. Convolutional Neural Network (CNN):
Zᵢⱼ^(l) = σ(Σ Wₘₙ^(l) · Xᵢ₊ₘⱼ₊ₙ^(l-1) + b^(l))
Where:
- W: weights of layer l
- b: bias
- σ: activation function (ReLU)
- Z: output feature map
3. Binary Classification Output:

DEPT OF ISE, SIR MVIT 2024-2025 Page | 19


Deepfake Voice Detection Implementation Process

Sigmoid Activation:
ŷ = 1 / (1 + e^(-z))
Binary Cross Entropy Loss:
L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]

5.4 Testing and Test Cases


1. Testing
Model Evaluation Metrics:

Accuracy = (TP + TN) / (TP + TN + FP + FN)


Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Confusion Matrix:

Predicted Fake Predicted Real

Actual Fake TP FN

Actual Real FP TN

2. Test Cases

Test Case ID Description Input Expected Output

TC01 Test real voice Real voice audio Label = 0 (Real)


sample (.flac)

TC02 Test spoofed voice Synthesized spoofed Label = 1 (Fake)


sample audio

TC03 Edge case: short- Very short clip (≤1 Proper classification
duration audio sec)

TC04 Noise handling Real audio with Still predicts "Real"


background noise

TC05 Dataset corruption Malformed audio file Handled gracefully

TC06 Empty file input Empty file Error or skip

DEPT OF ISE, SIR MVIT 2024-2025 Page | 20


Deepfake Voice Detection Results and Discussion

CHAPTER 6
RESULTS AND DISCUSSION
6.1 Dataset Samples
The dataset used in this study is the ASVspoof 2019 Logical Access (LA) dataset, a publicly
available benchmark that includes both real and fake speech samples. These samples are
generated using various Text-to-Speech (TTS) and Voice Conversion (VC) methods. Each
audio file is:

• Encoded in 16-bit, 16 kHz WAV format


• 4 seconds in duration
• Segmented into 10 uniform frames for processing
The dataset is divided into:

• Training set: for model training and parameter learning


• Validation set : for hyperparameter tuning
• Test set: for final evaluation on unseen data
6.2 Results
The CNN-based fake speech detection model was evaluated using standard classification
metrics: accuracy, precision, recall, and F1-score. The results indicate high effectiveness in
identifying deepfake audio signals.

Performance Metrics

• Accuracy: 97%
• Precision: High, indicating few false positives
• Recall: High, suggesting that most fake audio samples were correctly identified
• F1-Score: A strong balance between precision and recall
These metrics were derived from predictions on the test dataset, which consisted of previously
unseen samples. The model was particularly effective in identifying subtle artifacts in synthetic
audio that are often missed by human listeners.

Fig 7 Performance Matrix

DEPT OF ISE, SIR MVIT 2024-2025 Page | 21


Deepfake Voice Detection Results and Discussion

Fig 8 Confusion Matrix

Learning Curves

• Figure 1: Accuracy Curve – Demonstrates consistent increase in model accuracy


during training and validation phases
• Figure 2: Loss Curve – Shows decreasing trend in both training and validation loss,
confirming that the model learned efficiently without significant overfitting

Fig 9 Accuracy Curve

DEPT OF ISE, SIR MVIT 2024-2025 Page | 22


Deepfake Voice Detection Results and Discussion

Fig 10 Loss Curve

6.3 Result Analysis

The model’s performance reflects its capacity to effectively differentiate between real and
synthetic audio samples. Several key observations and insights emerged from the evaluation:

Strengths

• High detection rate for various forms of synthetic audio


• Robust performance across multiple attack methods, including TTS and VC
• Effective learning from a relatively moderate-sized dataset
• Well-balanced precision and recall, indicating reliable performance
Limitations

1. Performance degradation on compressed audio


o Compression artifacts may hide the subtle clues that help identify deepfake audio.
o This affects real-world deployment where audio might be compressed (e.g., mobile
networks, messaging apps).
2. Misclassification of high-quality synthetic voices
o Some sophisticated TTS systems generate audio that is very close to human speech.
o The model sometimes struggles to identify these as fake.
3. Computational demands
o While efficient on GPUs, running the model on resource-constrained devices (like
smartphones) may not yield real-time results.
o Optimization techniques such as model pruning, quantization, or distillation can be
applied in future work to address this.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 23


Deepfake Voice Detection Results and Discussion

Future Improvements

• Use of GANs for adversarial training to increase robustness


• Integration of Transformer-based architectures for improved contextual understanding
• Adding prosodic and rhythm-based features to complement MFCC-based inputs
• Exploring multi-modal detection (e.g., combining text and audio) for higher reliability
Comaparative analysis with other models
To highlight the superiority of the proposed model, we compare its performance with existing
deepfake detection methods wrt to accuracy and Equal Error Rate (EER).
• The proposed model outperforms Res-Net and VGG models by a fair bit.
• ResNet models struggle with generalization, particularly when tested on unseen datasets.

Study Accuracy (%) EER (%)


Chinguun Purevdagva et al. 59.2 40.8
[8] approach
Kai Li et al. [20] approach 63.82 36.18
J. Khochare et al. [13] 67.0 33.0
approach
R. Reimao, V. Tzerpos [2] 71.47 28.53
approach
Lin Zhang et al. [15] 83.0 17.0
approach
Transformer Encoder [12] 90.43 9.57
approach
Ameer Hamza et al. [16] 93.1 6.9
approach
Sahar Abdullah Al Ajmi et al. 94.2 5.8
[26] approach
Our Model 97.0 3
Table 2: Comparison Of Accuracy And Equal Error Rate (EER) With Similar Studies

6.4 Summary

This project presented a detailed analysis of the model's performance in detecting fake audio.
The proposed CNN-based approach achieved 97% accuracy, outperforming several state-of-
the-art methods.

Key conclusions:

• Our best performing model, single-task variant of CNN, achieves a macro F1 score
of 97.61 on the validation set. The model can be further applied on augmented data
to enhance generalisation and will prove helpful in deployment. The loss in
evaluation metrics is above the human observation level of about 85%.
• Due to the usage of an efficient CNN architecture, the processing time is very low,
from feeding input to generating the result. These models need fewer than 50,000
parameters and have around 100 KB memory footprint. This is highly commendable
in the field of Audio Signal Processing.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 24


Deepfake Voice Detection Results and Discussion

• The model is highly effective at identifying fake audio generated using modern TTS
and VC methods.
• Performance is competitive when benchmarked against recent literature.
• Certain limitations exist, particularly regarding compressed audio and high-end
synthetic voices.
• Future improvements can significantly boost the generalizability and efficiency of
the system, especially for deployment in real-world applications.
Overall, the results validate the success of the proposed methodology and highlight its potential
as a scalable and reliable tool for combating audio-based misinformation and identity fraud.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 25


Deepfake Voice Detection Conclusion and Future Scope

CHAPTER 7
CONCLUSION AND FUTURE SCOPE
7.1 Conclusion
This project developed a CNN-based Fake Speech Detection system that effectively
distinguishes between genuine and manipulated audio samples. By employing data
augmentation techniques, we enhanced the model’s robustness, enabling better generalization
to various audio manipulations. As the prevalence of fake speech increases, this research
underscores the need for advanced methodologies to combat misinformation in audio
communications. Future work will focus on refining model architectures and incorporating
larger, more diverse datasets to further enhance detection capabilities, contributing to audio
forensics and security efforts.
The rise of synthetic voice generation through advanced AI techniques like Text-to-Speech
(TTS) and Voice Conversion (VC) presents significant threats to digital security, identity
verification, and media authenticity. This project presents a Convolutional Neural Network
(CNN)-based fake speech detection model trained on the ASVspoof 2019 dataset. By utilizing
MFCCs and spectrogram-based features, the model achieved a remarkable accuracy of 97%
and an F1-score of 97.61%, demonstrating its effectiveness in identifying fake speech even
when the differences are imperceptible to the human ear.
Data augmentation techniques like pitch shifting, noise addition, and time-stretching were
employed to improve model generalization, significantly enhancing robustness against unseen
and adversarial audio samples. The lightweight nature of the model—with under 100 KB
memory footprint—makes it viable for deployment in real-world environments, including
mobile and embedded systems.
Comparative studies also showed the superiority of this model over traditional and some state-
of-the-art techniques such as ResNet, VGG, and Transformer encoders. Despite minor
limitations like performance drop on highly compressed audio or near-perfect synthetic voices,
the system provides a reliable, efficient, and scalable solution for the growing problem of voice-
based deepfake attacks.
7.2 Future Scope
1. Real-Time and Edge Deployment

• Optimize the current model using techniques like quantization, model


pruning, and knowledge distillation to ensure compatibility with low-
power devices.

• Implement lightweight versions of the model for real-time use in mobile


apps, smart home devices, and voice-controlled systems.
2. Transformer and Attention-Based Architectures

DEPT OF ISE, SIR MVIT 2024-2025 Page | 26


Deepfake Voice Detection Conclusion and Future Scope

• Explore advanced models such as Vision Transformers (ViT), Audio


Spectrogram Transformers (AST), or multimodal transformers to capture
complex temporal patterns and context dependencies in audio.
3. Robustness Against Compression and Noise

• Fine-tune the system using datasets with varying compression levels


(MP3, AAC) and environmental noise to improve detection in realistic
communication channels like calls and online meetings.

4. Cross-Language and Accent Generalization

• Train the system with multilingual datasets to ensure the model


generalizes well across different accents, dialects, and languages,
especially for global-scale applications.

5. Adversarial Training and GAN Resistance

• Incorporate adversarial examples and Generative Adversarial Network


(GAN)-based spoofed voices during training to improve resistance to
evolving attack techniques.

6. Multimodal Deepfake Detection

• Extend the framework by fusing audio with visual cues (lip-sync analysis,
facial expressions) and text transcripts (Natural Language Processing) to
provide holistic deepfake detection.

7. Cloud-Based Forensic Tools

• Develop an API or cloud-based platform for law enforcement and media


agencies to verify audio authenticity, particularly useful for legal
proceedings, cybercrime analysis, and journalism.

8. Integration with Biometric and Authentication Systems

• Collaborate with developers of voice biometrics and AI-based


authentication systems to integrate the detection engine for real-time
spoof prevention in banking, customer service, and smart assistants.

9. User-Interactive Interfaces

• Build user-friendly interfaces or browser extensions for journalists,


content creators, or educators to detect and visualize potential fake audio
in media files.

10. Ethical Auditing and Regulation Support

• Assist in developing ethical standards, regulatory guidelines, and


certification systems by providing explainable AI outputs and verifiable
proof of speech authenticity.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 27


Deepfake Voice Detection References

CHAPTER 8
REFERENCES
[1] Li, Yang, et al. “Universal voice conversion.” In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 3526-3535. 2019.
[2] Reimao, R., and V. Tzerpos. “FOR: A dataset for synthetic speech detection,” in 2019
International Conference on Speech Technology and Human-Computer Dialogue (SpeD).
IEEE, pp. 1–10.
[3] Korshunov, Pavel, and S. Marcel. “Deepfake detection using inverse contrastive loss.” In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 3920-3929. 2020.
[4] Prakash, Shreya, et al. “A comprehensive study on deep fake audio detection.” In
Proceedings of the 2020 ACM Workshop on Information Hiding and Multimedia Security,
pp. 81-90.
[5] Subramani, N. and D. Rao, “Learning efficient representations for fake speech detection,”
in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp.
5859–5866, 2020.
[6] Capoferri, D., et al., “Speech audio splicing detection and localization exploiting
reverberation cues,” in 2020 IEEE International Workshop on Information Forensics and
Security (WIFS). IEEE, pp. 1–6.
[7] Wijethunga, R., et al., “Deepfake audio detection: a deep learning-based solution for
group conversations,” in 2020 2nd International conference on advancements in
computing (ICAC), vol. 1. IEEE, pp. 192–197.
[8] Purevdagva, C., et al., “A machine-learning based framework for detection of fake
political speech,” in 2020 IEEE 14th International Conference on Big Data Science and
Engineering (BigDataSE). IEEE, pp. 80–87.
[9] Mukhopadhyay, Rudrabha, et al. “A comprehensive survey of voice conversion and deep
fake techniques.” arXiv preprint arXiv:2103.03230, 2021.
[10] Xie, Jin, et al. “Voice Deep Guard: Towards Intelligent Voice Deepfake Detection.” In
Proceedings of the 28th ACM International Conference on Multimedia, 2021.
[11] Ballesteros, D. M., et al., “Deep4snet: deep learning for fake speech classification,” Expert
Systems with Applications, vol. 184, p. 115465, 2021.
[12] Zhang, Z., et al., “Fake speech detection using residual network with transformer
encoder,” in Proceedings of the 2021 ACM workshop on information hiding and
multimedia security, pp. 13–22.
[13] Khochare, J., et al., “A deep learning framework for audio deepfake detection,” Arabian
Journal for Science and Engineering, pp. 1–12, 2021.
[14] Pasupathi, Panupong, and Taxing Li. “Detecting AI-Generated Text with BERT.” In
Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pp. 672-684.
[15] Zhang, L., et al., “The partialspoof database and countermeasures for the detection of short
fake speech segments embedded in an utterance,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813–825, 2022.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 28


Deepfake Voice Detection References

[16] Hamza, A., et al., “Deepfake audio detection via MFCC features using machine learning,”
IEEE Access, vol. 10, pp. 134018–134028, 2022.
[17] Shaaban, O. A., et al., “Audio deepfake approaches,” IEEE Access, vol. 11, pp. 132652–
132682, 2023.
[18] Albazony, A. A. M., et al., “Deepfake videos detection by using recurrent neural network
(RNN),” in 2023 Al-Sadiq International Conference on Communication and Information
Technology (AICCIT). IEEE, pp. 103–107.
[19] Bansal, K., et al., “Deepfake detection using CNN and DCGANs to drop-out fake
multimedia content: a hybrid approach,” in 2023 International Conference on IoT,
Communication and Automation Technology (ICICAT). IEEE, pp. 1–6.
[20] Li, K., et al., “Contributions of jitter and shimmer in the voice for fake audio detection,”
IEEE Access, vol. 11, pp. 84689–84698, 2023.
[21] Pham, L., et al., “Deepfake audio detection using spectrogram-based feature and ensemble
of deep learning models,” in 2024 IEEE 5th International Symposium on the Internet of
Sounds (IS2). IEEE, pp. 1–5.
[22] Paramarthalingam, A., et al., “A deep learning model to assist visually impaired in pothole
detection using computer vision,” Decision Analytics Journal, vol. 12, p. 100507, 2024.
[23] Basha, S. A. Y., and K. U. Priya, “Recognition of deep fake voice acoustic using ensemble
bagging model,” in 2024 5th International Conference on Electronics and Sustainable
Communication Systems (ICESC). IEEE, pp. 1211–1217.
[24] Xue, J., et al., “Dynamic ensemble teacher-student distillation framework for light-weight
fake audio detection,” IEEE Signal Processing Letters, 2024.
[25] Deng, J., et al., “VFD-Net: Vocoder fingerprints detection for fake audio,” in ICASSP
2024 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 12151–12155.
[26] Al Ajmi, S. A., et al., “Faked speech detection with zero prior knowledge,” Discover
Applied Sciences, vol. 6, no. 6, p. 288, 2024.
[27] Mathew, J. J., et al., “Towards the development of a real-time deepfake audio detection
system in communication platforms,” arXiv preprint arXiv:2403.11778, 2024.
[28] Song, D., et al., “Anomaly detection of deepfake audio based on real audio using
generative adversarial network model,” IEEE Access, 2024.
[29] Kang, J. Y., et al., “FADEL: Uncertainty-aware fake audio detection with evidential deep
learning,” in ICASSP 2025 IEEE International Conference on Acoustics, Speech and
Signal Processing. IEEE, pp. 1–5.
[30] Dixit, Y., et al., “Fake news detection of live media using speech to text conversion,” in
2021 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, pp.
1–5.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 29

You might also like