0% found this document useful (0 votes)
12 views37 pages

Deepfake Report Finalll-1

The document presents a project report on 'Deepfake Voice Detection Using Machine Learning' submitted by Moksha Prada P for the Bachelor of Engineering in Information Science and Engineering at Visvesvaraya Technological University. It outlines the development of a deep learning-based system to detect synthetic speech, employing techniques such as data augmentation and a convolutional neural network architecture. The study emphasizes the importance of effective detection methods in combating fraud and cybersecurity threats posed by advanced deepfake technologies.

Uploaded by

m7086651
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views37 pages

Deepfake Report Finalll-1

The document presents a project report on 'Deepfake Voice Detection Using Machine Learning' submitted by Moksha Prada P for the Bachelor of Engineering in Information Science and Engineering at Visvesvaraya Technological University. It outlines the development of a deep learning-based system to detect synthetic speech, employing techniques such as data augmentation and a convolutional neural network architecture. The study emphasizes the importance of effective detection methods in combating fraud and cybersecurity threats posed by advanced deepfake technologies.

Uploaded by

m7086651
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi-590010

PROJECTWORK
21ISP76

Deepfake Voice Detection Using Machine Learning


Submitted in partial fulfillment for the requirements for the Eighth semester

BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING
For the Academic Year 2024 - 2025
Submitted by:
Moksha Prada P
1MV22IS402
Under the guidance of

Ms. Sowjanya Lakshmi A


Asst. Professor, Department of ISE

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING


SIR M. VISVESVARAYA INSTITUTE OF TECHNOLOGY
Krishnadevaraya Nagar, International Airport Road,
Hunasmaranahalli, Bengaluru – 562157
DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

C E R T IF I CAT E
It is certified that the PROJECT WORK [21ISP76] entitled “Deepfake Voice Detection Using
Machine Learning” is carried out by 1MV22IS402 – Moksha Prada P bonafide students of Sir
M Visvesvaraya Institute of Technology in partial fulfilment for the 8th semester for the award
of the Degree of Bachelor of Engineering in Information Science and Engineering of the
Visvesvaraya Technological University, Belagavi during the academic year 2024-2025. It is
certified that all corrections and suggestions indicated for Internal Assessment have been
incorporated in the report deposited in the department library. The project report has been approved
as it satisfies the academic requirements in respect of project work prescribed for the course of
Bachelor of Engineering.

Ms. Sowjanya Lakshmi. A Dr. G. C. Bhanu Prakash Prof. S.G. Rakesh


Assistant Professor Head of Department, Principal,
Dept. of ISE, Dept. of ISE, Sir MVIT
Sir MVIT Sir MVIT Bengaluru – 562157
Bengaluru – 562157 Bengaluru – 562157

Examination:
Name of Examiner Signature with Date

1)

2)
DECLARATION
We hereby declare that the entire project work embodied in this dissertation has been carried out
by us and no part has been submitted for any degree or diploma of any institution previously.

Place: Bengaluru
Date:

Signature of Student

Moksha Prada P 1MV22IS402

I
ACKNOWLEDGMENT

It gives us immense pleasure to express our sincere gratitude to the management of Sir M.
Visvesvaraya Institute of Technology, Bengaluru for providing the opportunity and the
resources to accomplish our project work in their premises.

On the path of learning, the presence of an experienced guide is indispensable, and we would like
to thank our guide Ms. Sowjanya Lakshmi A, Asst. Professor, Dept. of ISE, for her invaluable
help and guidance.

Heartfelt and sincere thanks to Dr. G. C. Bhanu Prakash, Prof. and Head, Dept. of ISE,
for his suggestions, constant support and encouragement.

We would also like to convey our regards to Prof. S. G. Rakesh, Principal, Sir MVIT for
providing us with the infrastructure and facilities needed to develop our project.

We would also like to thank the staff of Department of Information Science and Engineering and
lab-in-charges for their co-operation and suggestions. Finally, we would like to thank our Parents
and friends for their help and suggestions without which completing this project would not have
been possible.

II
ABSTRACT
Deep learning has made significant strides in audio synthesis techniques, making it
more challenging to distinguish between authentic and fake speech. This study uses voice-
conversion and synthetic techniques to develop a robust fake speech detection system that
focuses on Logical Access (LA) threats. To increase dataset size and improve model
generalization, the system makes use of a deep learning model with data augmentation
techniques like time stretching, pitch shifting, and volume scaling.

Normalization, noise reduction, and audio segmentation into regular 4-second frames
are all examples of preprocessing. Melspectrograms, which are produced by normalizing
inputs using Z-normalization and the Fast Fourier Transform (FFT), are utilized as feature
representations. A multi-layer convolutional model with 2D and 1x1 convolutions, batch
normalization, max-pooling, ReLU activation, and fully linked layers makes up the suggested
architecture. Dropout and other regularization techniques are employed to strengthen the
model’s resistance to overfitting.

The ASVspoof 2019 corpus was used for training and testing, with further variants to
simulate real-world situations. To examine classification behavior, the confusion matrix and
the metrics of accuracy, precision, recall, F1-score, and ROC-AUC were employed. The
results demonstrate that the system was quite successful at distinguishing between real and
phony speech, with a high detection accuracy.

This work makes a significant addition to voice-based security systems by offering a


scalable, practical, and broadly applicable defense against new audio spoofing threat.

III
CONTENTS
SL No Chapters Page No
Introduction 1-2
1 1.1 Overview 1
1.2 Organization of Report 1-2
2 Literature Review 3-6
Problem Statement and Objectives 7-8
3.1 Problem Statement 7
3
3.2 Objectives 8
3.3 Significance of the Project Work 9
4 Methodology 10-12
4.1 Block Diagram 13
4.2 System Architecture 14
4.3 Control Flow Diagram 14
4.4 Data Flow Diagram 15
4.5 Sequence Diagram / Activity Diagram 15
Implementation 16-21
5.1 System Requirements 16
5 5.2 Algorithms / Pseudocodes 16-20
5.Mathematical Description 20-21
5.4 Testing and Test Cases 21
Results and Discussion 22-26
6.1 Dataset Samples 22
6 6.2 Results 22-24
6.3 Result Analysis 24-25
6.4 Summary 25-26
Conclusion and Future Scope 26-27
7 7.1 Conclusion 26
7.2 Future Scope 27-28
8 References 29-30

IV
LIST OF FIGURES

Page
Fig. No. Description No
1 Mel Spectrogram and MFCC Generation 10
2 Block diagram 13
3 System architecture 13
4 Control flow diagram 14
5 Level 1 Data Flow Diagram 14
6 Activity diagram 15
7 Performance Matrix 22
8 Confusion Matrix 23
9 Accuracy Curve 23
10 Loss Curve 24

LIST OF TABLES
Table Page
No. Description No
1 Summary of the CNN model architecture 19
2 Comparison Of Accuracy and Equal Error Rate (EER) With Similar Studies 25

V
Deepfake Voice Detection Introduction

CHAPTER 1
INTRODUCTION
1.1 Overview

With the development of AI, deepfake technology has also improved and is now capable
of creating incredibly lifelike fake sounds. Because these AI voices might sound a lot like actual
humans, there are serious worries about fraud, deception, and cybersecurity. Abuse of fake
speech has taken various forms, from using voice-based identification systems to conduct
fraudulent transactions to posing as politicians or celebrities. Because synthetic speech
technology is developing so quickly, it is essential to build efficient detection methods to
differentiate between real and fake sounds. These attacks are now too strong for traditional
voice authentication methods, such as rule-based approaches and human inspections,
necessitating the use of advanced machine learning algorithms.

Our focus is on developing a deep learning-based artificial speech recognition system


that can distinguish between artificial and genuine speech. The ASVspoof 2019 benchmark,
which includes a comprehensive collection of spoof and real speech samples, is used in our
approach. We employ spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs) to
extract important elements that help detect relevant speech patterns. A Convolutional Neural
Network (CNN), a powerful deep learning framework that can recognize patterns in audio and
visual data, is then used to examine the acquired features.

Since our model is an offline detector, it is more practical to use in real-world situations
than other models that need online processing. Such features might be helpful in settings like
secure regions and forensic investigations when there is little internet access. Applications of
this approach may also be seen in voice verification systems, media authentication, and
cybersecurity, where speech deepfake detection is essential to combating disinformation and
identity theft.

1.2 Organization of Report

This report provides a comprehensive explanation of the proposed deepfake voice


detection system. It begins with the dataset selection, specifically highlighting the use of the
ASVspoof 2019 dataset, which offers a wide range of real and spoofed speech samples. Next,
the preprocessing methods are detailed, including the conversion of raw audio into MEL
spectrograms, noise reduction, and clarity enhancement steps. Following this, the model

DEPT OF ISE, SIR MVIT 2024-2025 Page | 1


Deepfake Voice Detection Introduction

architecture is described, focusing on the implementation of the CNN to process the extracted
audio features.
The report then outlines the training procedures employed to optimize the model and
the evaluation criteria used to assess performance. The results showcase the system's
effectiveness in distinguishing synthetic speech, underlining its potential in real-world security
applications. Finally, the report discusses future directions, such as enhancing the model for
real-time detection to increase its practical utility across various domains.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 2


Deepfake Voice Detection Literature Review

CHAPTER 2
LITERATURE REVIEW

The rise of deepfake technologies has led to significant research efforts in detecting
synthetic speech. Reimao and Tzerpos [2] introduced the FOR dataset for synthetic speech
detection in 2019, emphasizing data quality. Their methodology involved using a
comprehensive dataset to train detection models, focusing on spectral features to improve
model performance. However, their approach faced challenges in detecting highly
sophisticated fake speech and small distortions in speech. In 2020, Subramani and Rao [5]
developed efficient neural representations for fake speech detection, leveraging
autoencoders and deep learning models. This method improved detection performance by
learning representations that generalized well across various datasets, though the model’s
dependency on large labelled datasets remains a limitation. Wijethunga et al. [7] applied
deep learning techniques to group conversations, detecting deepfake audio by analysing
group interactions. The accuracy of their model improved significantly over traditional
methods, but real-time processing in multi-speaker environments was still a challenge.
Capoferri et al. [6] used reverberation cues to detect audio splicing, which helped improve
detection accuracy for manipulated speech. However, their method struggles with detecting
speech segments that are minimally altered or highly coherent, limiting its application.
Purevdagva et al. [8] introduced a machine-learning framework for detecting fake
political speech, employing multiple feature extraction techniques. Their methodology
combined prosodic, spectral, and phonetic features to detect inconsistencies in speech.
While this method achieved high accuracy, its limitation lies in the fact that it may not
perform as well on non-political speech or in environments with background noise. Hamza
et al. [16] applied MFCC features in combination with machine learning models for
deepfake audio detection, achieving high detection accuracy. However, their model is
sensitive to variations in environmental noise, leading to potential inaccuracies in real-
world scenarios. Zhang et al. [12] employed residual networks with transformer encoders
to detect fake speech, significantly improving the model’s accuracy in detecting subtle
distortions. This method is particularly effective for detecting deepfake speech at higher
quality levels. However, the model is computationally expensive, making it less suitable
for low-resource environments.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 3


Deepfake Voice Detection Literature Review

Mukhopadhyay et al. [9] provided a comprehensive survey on voice conversion


and deepfake techniques, offering insights into various methods and their effectiveness.
This work is valuable in providing a broader perspective on deepfake detection; however,
its limitation lies in the lack of empirical testing and evaluation of the surveyed methods.
Xie et al. [10] introduced Voice DeepGuard, a model that combines adversarial networks
and machine learning to detect synthetic voices. The methodology demonstrated high
robustness across different types of synthetic voices but had limitations when it came to
detecting synthetic voices with minute alterations. Prakash et al. [4] proposed a
comprehensive study on deepfake audio detection using an ensemble of multiple deep
learning models. Their methodology improved detection accuracy, but their system faced
difficulties with real-time detection in large datasets, making it impractical for real-time
applications.
Li et al. [1] explored universal voice conversion techniques for improving the
adaptability of deepfake detection systems. Their methodology demonstrated effectiveness
across different speakers and speech conditions but struggled with low-quality audio or
speech distortions. Pasupathi and Li [14] presented a method for detecting AI-generated
text through BERT, which provided valuable cross-domain insights, although the model
was specifically designed for text and does not directly address deepfake audio detection.
In 2022, Zhang et al. [15] introduced a model for detecting fake speech segments embedded
within utterances. Their method, which utilized deep learning techniques, showed promise
in detecting short, fake segments, though it had limitations in dealing with longer fake
speech passages. Ballesteros et al. [11] presented Deep4snet, a deep learning- based model
for fake speech classification. Their model was highly accurate in detecting fake speech;
however, it showed reduced performance with low-quality or highly compressed audio.
In 2023, Song et al. [28] proposed an anomaly detection method for deepfake audio,
employing generative adversarial networks (GANs) to detect discrepancies between real
and fake speech. The method was effective but had limitations when dealing with
adversarial deepfake speech generated through advanced techniques. Basha and Priya [23]
applied an ensemble bagging model for deepfake voice recognition, offering improved
performance with diverse audio sources. However, the method was computationally
intensive, which could hinder its application in real-time systems.
Korshunov and Marcel [3] presented a deepfake detection method using inverse
contrastive loss, which improved the model’s ability to differentiate real and fake speech

DEPT OF ISE, SIR MVIT 2024-2025 Page | 4


Deepfake Voice Detection Literature Review

through contrastive training strategies. This approach enhanced model robustness but
required careful tuning of contrastive learning parameters and still lacked resilience against
highly realistic fake samples. Khochare et al. [13] developed a deep learning framework
for audio deepfake detection, combining feature extraction and classification in a single
pipeline. Their model achieved commendable accuracy across various datasets, though it
faced limitations with language diversity and accents.
Shaaban et al. [17] offered a comprehensive analysis of various audio deepfake
approaches, highlighting vulnerabilities and suggesting layered countermeasures. While
their work served as a valuable reference for system design, it lacked experimental
validation and real-world benchmarking. Albazony et al. [18] examined the use of
recurrent neural networks (RNNs) for detecting deepfake videos, and although their focus
was visual, the temporal modeling techniques had implications for sequential audio
detection. However, the adaptation of such models to the audio domain requires further
investigation.
Bansal et al. [19] proposed a hybrid deepfake detection approach using
convolutional neural networks and DCGANs to suppress fake multimedia content,
including audio. While the method showed strong performance, it struggled with
overfitting and required significant processing power. Li et al. [20] studied voice
characteristics like jitter and shimmer to detect fake audio at the acoustic signal level. Their
approach offered interpretability and high accuracy for certain synthetic voices but faltered
when faced with more complex deepfake generation methods.
Paramarthalingam et al. [22] introduced a deep learning model for detecting
potholes for visually impaired assistance, which, while not directly related to deepfake
audio, demonstrated the broader applications of audio analysis and environment-aware
machine learning that could inspire future detection architectures. Xue et al. [24] proposed
a dynamic ensemble distillation framework using teacher-student models to create
lightweight deepfake detectors. The solution showed promise for resource-constrained
environments but required substantial offline training with large teacher models.
Deng et al. [25] presented VFD-Net, a vocoder fingerprint-based deepfake
detection system that leverages unique spectral artifacts introduced during synthetic voice
generation. While the model achieved state-of-the-art results in vocoder-based detection,
it performed poorly on end-to-end models designed to bypass vocoder artifacts. Al Ajmi
et al. [26] introduced a zero prior knowledge approach that enabled detection of fake
speech without needing prior examples of fake audio. This unsupervised technique

DEPT OF ISE, SIR MVIT 2024-2025 Page | 5


Deepfake Voice Detection Literature Review

expanded the detection landscape but suffered from lower accuracy and higher false
positives.
Mathew et al. [27] focused on real-time detection of deepfake audio in
communication platforms, addressing latency and integration challenges. Their model was
efficient for real-time streaming but still faced trade-offs in terms of detection granularity
and scalability. Kang et al. [29] proposed FADEL, an uncertainty-aware fake audio
detection system using evidential deep learning. FADEL enhanced reliability by
quantifying prediction confidence, though its complexity and resource needs may hinder
deployment in lightweight or embedded systems.
More recently, Pham et al. [21] focused on spectrogram-based features in
combination with deep learning for deepfake audio detection, improving detection
accuracy but struggling to handle background noise or low-quality recordings. Dixit et al.
[30] employed speech-to-text conversion to detect fake news in live media, using deep
learning for improved accuracy. Their method, while highly effective for specific cases,
faced challenges in handling speech with varying accents or speech from non-native
speakers.
Accuracy in deepfake audio detection is largely contingent on the quality and
variety of training data, the model architecture, and the feature extraction methods. While
some models achieve high accuracy, especially on high-quality deepfakes, real-time
detection and detection on low-resource devices remain significant challenges. Moreover,
limitations in detecting advanced deepfake techniques, such as those involving minute
audio distortions or adversarial methods, persist.
Future Scope lies in improving the adaptability of detection models, enabling
real-time detection on mobile and embedded devices, and reducing dependency on large
datasets. Exploring unsupervised learning methods, reducing computational overhead,
and improving cross-domain detection (e.g., speech-to-text deepfake detection) will be
crucial in expanding the applicability of deepfake voice detection systems.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 6


Deepfake Voice Detection Problem Statement and Objectives

CHAPTER 3
PROBLEM STATEMENT AND OBJECTIVES
3.1 Problem Statement
The emergence of deepfake technology, which is driven by developments in machine
learning (ML) and artificial intelligence (AI), has made it extremely difficult to preserve the
integrity and authenticity of multimedia material. It becomes more challenging to distinguish
between authentic and fraudulent information when deepfakes modify audio in a way that
closely resembles the voices and looks of actual people, frequently with great accuracy.

Key Challenges:
• Increasing Sophistication of Deepfake Algorithms:
GANs (Generative Adversarial Networks) and other AI models that can produce
incredibly lifelike fake media are used in contemporary deepfake generating
approaches. By taking advantage of minute nuances in audio signals and frames, these
computers are able to identify irregularities more accurately than humans.
• Threats to Digital Trust and Security:
Deepfake content has been used as a weapon for identity theft, personal defamation,
and disinformation operations. This has sparked worries in a variety of fields where
confidence in digital material is crucial, such as social media, politics, law enforcement,
and the media.
• Limited Effectiveness of Traditional Detection Methods:
Because deepfake technologies are dynamic and constantly changing, traditional
detection techniques that depend on static characteristics or heuristic-based algorithms
are unable to keep up. The majority of current methods are susceptible to cross-modal
manipulations since they either concentrate on audio or audio analysis alone.
• Scalability and Generalization Issues:
In real-world applications, deepfake detection methods are less successful because they
frequently have trouble generalizing to new modification techniques or unexplored
datasets. It is computationally hard to identify deepfakes in realtime in high-resolution
or live- streamed footage.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 7


Deepfake Voice Detection Problem Statement and Objectives
3.2 Objectives

1. Identification of utterance: The primary objective of the Fake Speech Detection


project is to be able to tell fake speech utterances from bonafide (authentic) ones. The
project should prove viable in detecting Logical Attacks such as TTS and VC.
2. Extension of ASVSpoof 2019 Dataset: We intend to increase the number of training
examples in our dataset by 5 times to 10 times by employing various audio signal
processing and speech augmentation techniques on the existing dataset such as Time
Shifting, Time Stretching, Pitch Scaling, Noise Addition, etc. This will make the model
more robust and improve generalization capabilities of the model.
3. Performance Assessment: After model is built, we will assess the performance of
proposed model against established benchmarks in fake speech detection, focusing on
metrics such as precision, recall, and F1-score. Evaluation of model on original dataset,
augmented dataset and both combined will be done separately and compared against
various studies involving similar models and datasets.

3.3 Significance of the Project Work

• Mitigates Misinformation Risks:


With the increasing use of synthetic voices in spreading fake news, executing scams,
and conducting impersonation attacks, this project plays a crucial role in identifying
and preventing the misuse of AI-generated audio content.
• Enhance Security in Voice-Based Systems:
Voice authentication systems, such as those used in banking or smart home devices,
are vulnerable to spoofing attacks. This project adds a significant layer of protection
by accurately detecting deepfake voices, thereby strengthening overall system
security.
• Supports Legal and Ethical Standards:
Deepfake voice detection is vital for maintaining the integrity of digital evidence in
legal proceedings and ensuring ethical usage of audio content in media and
communication.
• Promote Trust in Media Content:
The ability to verify the authenticity of audio enhances consumer trust in media,
journalism, and broadcasting. It helps content creators and consumers distinguish
between real and manipulated voice recordings.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 8


Deepfake Voice Detection Problem Statement and Objectives
• Real-Time Detection Capability:

The machine learning model developed in this project is capable of near real-time
detection, which is essential for immediate response and practical deployment in real-
world scenarios.
• Scalable and Adaptable Framework:
The system is designed to be scalable and adaptable, allowing future integration
with additional audio manipulation detection mechanisms, making it relevant for
long-term applications across various industries.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 9


Deepfake Voice Detection Methodology

CHAPTER 4
METHODOLOGY
The methodology adopted in this project encompasses a systematic sequence of steps
designed to develop a reliable machine learning model capable of detecting deepfake (spoofed)
voice recordings. Each stage, from data preprocessing to model evaluation, is carefully designed
to ensure robustness and accuracy in detecting fake voice inputs.

Dataset Collection
The dataset used in this project is the ASVspoof 2019 dataset, a widely recognized
benchmark dataset for automatic speaker verification and spoofing countermeasures. It
contains audio samples of both bonafide (genuine human speech) and spoofed (fake)
speech generated using various voice synthesis (TTS) and voice conversion (VC)
techniques. This dataset helps simulate real-world scenarios and provides a diverse set of
samples that are essential for training and testing the model effectively.

Preprocessing and Feature Extraction


Raw audio signals cannot be directly fed into machine learning models, especially deep
learning networks. Therefore, the following preprocessing steps are performed:
• Audio Loading: Each .flac or .wav file is loaded using the librosa library, which
also helps in resampling the audio at a consistent sampling rate.
• Spectrogram Generation: The log-mel spectrogram is extracted from each
audio file. It captures the frequency domain features by applying the Mel scale,
which aligns more closely with how humans perceive sound.
• MFCC (Mel Frequency Cepstral Coefficients): MFCCs are also extracted as
they are known to represent the timbral texture of speech and are widely used in
speech recognition and spoof detection.

Fig 1 : Mel Spectrogram And MFCC Generation

DEPT OF ISE, SIR MVIT 2024-2025 Page | 10


Deepfake Voice Detection Methodology
These extracted features are converted into 2D image-like arrays, suitable for feeding into
convolutional neural networks (CNNs).

Data Preparation
Once the features are extracted, the dataset is organized and prepared for training:
• Label Encoding: Bonafide samples are labeled as 0 and spoofed samples as 1.
• Data Splitting: The dataset is split into training and testing sets to
evaluate generalization. A separate validation set may also be used to fine-tune
model parameters.
• Normalization: Input features are normalized to scale the values uniformly,
which helps in faster convergence during training.

Model Design – Convolutional Neural Network (CNN)


A CNN architecture is designed to classify input spectrograms into real or fake voice.
CNNs are chosen for their ability to recognize spatial hierarchies in 2D feature maps.
The architecture includes:
• Convolutional Layers: To detect local audio features like pitch, tone, and
modulation patterns.
• Pooling Layers: To reduce dimensionality and computation, while preserving
important features.
• Batch Normalization: To stabilize and accelerate training.
• Dropout Layers: To prevent overfitting by randomly disabling neurons during training.
• Fully Connected Dense Layers: Final classification layers with a sigmoid or
softmax function to output probabilities.

Model Training
• The model is trained using the binary cross-entropy loss function, which is
suitable for binary classification tasks.
• The Adam optimizer is employed for its efficiency in handling sparse
gradients and adaptive learning rates.
• The training is conducted over multiple epochs, with batch processing for
computational efficiency.
• Early stopping is utilized to halt training when validation accuracy stops
improving, preventing overfitting.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 11


Deepfake Voice Detection Methodology

Model Evaluation
• After training, the model is tested on unseen data to evaluate its
performance. The evaluation metrics include:
• Accuracy: Measures the overall correctness of predictions.
• Confusion Matrix: Provides a detailed breakdown of true positives, false
positives, true negatives, and false negatives.
Precision, Recall, and F1-Score: These metrics offer insight into the model's effectiveness in
identifying spoofed and bonafide voices, especially when the classes are imbalanced.

Visualization and Interpretation


• Loss and Accuracy Curves: Plotted for both training and validation sets to
visualize learning progression.
• Spectrogram Visuals: Used to understand what types of patterns CNN is
learning to distinguish between real and fake audio.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 12


Deepfake Voice Detection Methodology

4.1 Block Diagram


The block diagram provides a simplified visual representation of the deepfake voice detection
system. It outlines the main components such as input audio, preprocessing, feature extraction,
model classification, and output generation. This helps understand the overall structure and data
flow in the system.

Fig 2: Block Diagram


4.2 System Architecture
The system architecture diagram showcases how different modules of the project interact
with each other. It highlights the backend processes including data input, preprocessing using
libraries like Librosa, MFCC/spectrogram extraction, CNN model processing, and final
classification of audio as real or fake. This layered view ensures modular and scalable design.

Fig 3: System Architecture

DEPT OF ISE, SIR MVIT 2024-2025 Page | 13


Deepfake Voice Detection Methodology

4.3 Control Flow Diagram


The Control Flow Diagram illustrates the logical execution flow of the system components. It
begins with receiving the input audio and continues through preprocessing steps such as
normalization and segmentation. The extracted features are then passed to the CNN model for
classification. Based on the model's prediction, the system outputs whether the audio is real or
fake. This diagram helps visualize the control and sequencing of operations in the detection
pipeline.

Fig 4: Control Flow Diagram

4.4 Data Flow Diagram

The Data Flow Diagram (DFD) shows how data moves through the system. It begins with the
user inputting an audio file, followed by preprocessing and feature extraction. The processed
data is then sent to the classification model, which returns the detection result. This diagram
helps in understanding how data is transformed at each stage.

Fig 5: Level 1 Data Flow Diagram

DEPT OF ISE, SIR MVIT 2024-2025 Page | 14


Deepfake Voice Detection Methodology

4.5 Sequence Diagram / Activity Diagram


The activity diagram outlines the step-by-step workflow of the system. It begins from audio
input and follows through preprocessing, feature extraction, model inference, and ends with
result display. It captures the sequence of operations and decision points involved in the
deepfake detection process.

Fig 6 : Activity Diagram

DEPT OF ISE, SIR MVIT 2024-2025 Page | 15


Deepfake Voice Detection Implementation Process

CHAPTER 5
IMPLEMENTATION PROCESS
5.1 System Requirements
• Processor: Intel Core i7/i9 (12th Gen) or AMD Ryzen 7/9
• RAM: 16GB or more
• GPU: NVIDIA RTX 3060 (6GB) or higher (RTX 4090 for high-end deep learning)
• Storage: 512GB NVMe SSD + 1TB HDD (for datasets & models)
• Operating System: Windows 11 / Ubuntu 20.04/ MacOS
• Software: Python 3.8+ with virtual environment support

5.2 Algorithm
Deep learning techniques are used in the proposed methodology to accurately
distinguish between real and fraudulent speech. This approach places a strong emphasis on
feature extraction, preprocessing, CNN-based model construction, and the analysis of speech
patterns from carefully chosen datasets.
The steps listed below are as follows: detailed description of the research methodology
used in this study:
Data Selection
The ASVspoof2019 Logical Access (LA) dataset was chosen as the main dataset for this
job. This dataset, which includes both synthetic and real speech samples, is frequently used
to detect bogus speech. Audio files containing synthetic speech generated by different
textto-speech (TTS) and voice conversion (VC) methods are stored in the LA subset.
The dataset is divided into three subsets:
• Training set: Used for model learning and parameter tuning.
• Validation set: Used for model optimization and hyperparameter adjustments.
• Test set: Used to evaluate the model’s performance on unseen data.
The audio samples in the ASVspoof2019 Logical Access (LA) dataset are encoded in 16-
bit, 16 kHz WAV format and have a fixed duration of 4 seconds per track. Preprocessing
and segmentation techniques are applied to guarantee efficient feature extraction for false
speech detection.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 16


Deepfake Voice Detection Implementation Process
Preprocessing and Feature Extraction
The ASVspoof2019 Logical Access (LA) dataset consists of audio samples stored in 16-bit, 16
kHz WAV format, with each track having a fixed duration of 4 seconds. To ensure effective
feature extraction for fake speech detection, pre-processing and segmentation techniques are
applied.
Preprocessing
Since speech signals vary in structure, segmentation is performed to create uniform input
sizes for feature extraction. The preprocessing steps include:
• Frame Segmentation: Each audio file is divided into 10 segments to ensure
sufficient temporal resolution.
• Sampling Consistency: Given a sample rate of 16,000 Hz and a track
duration of 4 seconds, each track contains 64,000 samples (i.e., 16, 000 × 4).
– Windowing: A Hamming window is applied to each frame to reduce
spectral leakage.
• Padding (if required): Zero-padding is applied to maintain uniform segment
lengths across all samples.
Feature Extraction
To extract meaningful representations of speech, MelFrequency Cepstral
Coefficients (MFCCs) are computed from each segmented frame. The extraction
process includes:
• Number of MFCCs: 13 MFCCs are computed per frame.
• Fourier Transform: The Fast Fourier Transform (FFT) size is 2048,
converting each frame into the frequency domain.
• Hop Length: A hop length of 512 samples is used, determining the overlap
between consecutive frames.
• Mel-Scale Filtering: A filter bank is applied to mimic human auditory perception.
• Feature Matrix Construction: The final MFCC feature matrix consists
of 13 coefficients per frame, serving as input to the deep learning model.
By structuring the input data into 10 uniform segments and extracting MFCC features, this
methodology ensures the effective differentiation of real and fake speech.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 17


Deepfake Voice Detection Implementation Process
Model Architecture
To classify real and fake speech, we employ a Convolutional Neural Network (CNN)-
based model that processes MFCC feature matrices extracted from segmented audio.
The model is designed to capture both spectral and temporal patterns indicative of
fake speech artifacts.

Input Layer
Accepts an MFCC feature matrix of shape (time frames × 13 MFCC coefficients),
reshaped for 2D convolutional processing.

Convolutional and Pooling Layers


The model employs multiple convolutional layers to extract features from speech signals:
• First Conv Layer: A Conv2D layer with 32 filters and (3×3) kernel size
applies feature extraction to learn key speech patterns.
• Max Pooling (2×2) with same padding reduces spatial dimensions while
preserving critical information.
• Batch Normalization stabilizes training and accelerates convergence.
• This structure is repeated across three convolutional layers, progressively
refining feature maps.
• The third convolutional layer uses a (2×2) kernel to extract fine-grained speech details.
Flatten Layer
Converts D feature maps into a 1D vector for classification.
Fully Connected (Dense) Layers
• A Dense layer with 64 neurons and ReLU activation further processes
extracted features.
• A Dropout layer (0.3 probability) prevents overfitting.
Output Layer
A Dense layer with 2 neurons and Softmax activation produces probability scores for
real vs. fake speech classification.
This CNN-based architecture efficiently captures deep-fake speech artifacts while
maintaining computational efficiency.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 18


Deepfake Voice Detection Implementation Process

Table 1: Summary of the CNN Model Architecture

Training
The training process follows a structured pipeline to ensure optimal performance and
generalization of the CNN model for fake speech detection. The steps are as follows:
Data Splitting:
The dataset is divided into three subsets:
• Training Set (55%): Used to optimize model weights.
• Validation Set (20%): Used for hyperparameter tuning and to monitor
performance during training.
• Test Set (25%): Reserved for final evaluation to measure the model’s
generalization ability.
The prepare datasets (0.25, 0.2) function handles this splitting, ensuring balanced
distribution across the three sets.
Model Compilation:
The CNN model’s input shape is set to the retrieved MFCC feature dimensions, (13,13,1).
It is constructed using a convolutional architecture.
For steady convergence and adaptive learning, it uses the Adam optimizer with a
learning rate of 0.0001. The output layer employs softmax activation for two-class
classification, while the sparse categorical cross-entropy loss function is employed. The
primary test tool is an accurate tracking of the model’s performance.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 19


Deepfake Voice Detection Implementation Process
2.

Early Stopping:
To prevent overfitting, an early stopping callback is applied. It monitors the validation
loss (val_loss) and stops training if no improvement is observed for 5 consecutive epochs.
The restore_best_weights=True setting ensures that the model reverts to the best-
performing weights before stopping.
Training Process:
To achieve the best possible balance between training speed and stability, the model is
trained for 30 epochs with a batch size of 32. When training, the validation set is used to
track performance and make dynamic learning adjustments.
Evaluation:
After training, the model is tested on the hold-out test set using model.evaluate(X_test,
y_test). The test accuracy is printed to assess how well the model generalizes to unseen
data.By limiting overfitting and optimizing performance on tasks involving the
categorization of actual and false speech, this methodical methodology guarantees efficient
training. Thus, this is the suggested approach.

5.3 Mathematical Description


1. Feature Extraction (MFCC / Spectrograms):
S(m, n) = log(Σ|X(k)|² · Hₘ(k))
Where:
- X(k): FFT of the signal
- Hₘ(k): Mel filter bank
- m: Mel filter index
- n: time frame
2. Convolutional Neural Network

Zᵢⱼ^(l) = σ(Σ Wₘₙ^(l) · Xᵢ₊ₘⱼ₊ₙ^(l-1) +


b^(l)) Where:
- W: weights of layer l
- b: bias
- σ: activation function (ReLU)
- Z: output feature map

DEPT OF ISE, SIR MVIT 2024-2025 Page | 20


Deepfake Voice Detection Implementation Process

3. Binary Classification Output:

Sigmoid Activation:
ŷ = 1 / (1 + e^(-z))

Binary Cross Entropy Loss:


L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]

5.4 Testing and Test Cases


Testing

Model Evaluation Metrics:

Accuracy = (TP + TN) / (TP + TN + FP + FN)


Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Confusion Matrix:

Predicted Fake Predicted Real

Actual Fake TP FN

Actual Real FP TN

Test Cases

Test Case ID Description Input Expected Output

TC01 Test real voice Real voice audio Label = 0 (Real)


sample (.flac)

TC02 Test spoofed voice Synthesized spoofed Label = 1 (Fake)


sample audio

TC03 Edge case: short- Very short clip (≤1 Proper classification
duration audio sec)

TC04 Noise handling Real audio with Still predicts "Real"


background noise

TC05 Dataset corruption Malformed audio file Handled gracefully

TC06 Empty file input Empty file Error or skip

DEPT OF ISE, SIR MVIT 2024-2025 Page | 21


Deepfake Voice Detection Results and Discussion

CHAPTER 6
RESULTS AND DISCUSSION

6.1 Dataset Samples

The dataset used in this study is the ASVspoof 2019 Logical Access (LA) dataset, a
publicly available benchmark that includes both real and fake speech samples. These samples
are generated using various Text-to-Speech (TTS) and Voice Conversion (VC) methods. Each
audio file is:

• Encoded in 16-bit, 16 kHz WAV format


• 4 seconds in duration
• Segmented into 10 uniform frames for processing
The dataset is divided into:

• Training set: for model training and parameter learning


• Validation set: for hyperparameter tuning
• Test set: for final evaluation on unseen data
6.2 Results
The CNN-based fake speech detection model was evaluated using standard classification
metrics: accuracy, precision, recall, and F1-score. The results indicate high effectiveness in
identifying deepfake audio signals.

Performance Metrics

• Accuracy: 97%
• Precision: High, indicating few false positives
• Recall: High, suggesting that most fake audio samples were correctly identified
• F1-Score: A strong balance between precision and recall
These metrics were derived from predictions on the test dataset, which consisted of previously
unseen samples. The model was particularly effective in identifying subtle artifacts in synthetic
audio that are often missed by human listeners.

Fig 7 : Performance Matrix

DEPT OF ISE, SIR MVIT 2024-2025 Page | 22


Deepfake Voice Detection Results and Discussion

Fig 8 Confusion Matrix

Learning Curves

• Figure 1: Accuracy Curve – Demonstrates consistent increase in model accuracy


during training and validation phases
• Figure 2: Loss Curve – Shows decreasing trend in both training and validation loss,
confirming that the model learned efficiently without significant overfitting

Fig 9 Accuracy Curve

DEPT OF ISE, SIR MVIT 2024-2025 Page | 23


Deepfake Voice Detection Results and Discussion

Fig 10 Loss Curve

6.3 Result Analysis

The model’s performance reflects its capacity to effectively differentiate between


real and synthetic audio samples. Several key observations and insights emerged from the
evaluation:

Strengths:
• High detection rate for various forms of synthetic audio
• Robust performance across multiple attack methods, including TTS and VC
• Effective learning from a relatively moderate-sized dataset
• Well-balanced precision and recall, indicating reliable performance
Limitations

1. Performance degradation on compressed audio


• Compression artifacts may hide the subtle clues that help identify deepfake
audio.
• This affects real-world deployment where audio might be compressed
(e.g., mobile networks, messaging apps).
2. Misclassification of high-quality synthetic voices

Some sophisticated TTS systems generate audio that is very close to human
speech.
• The model sometimes struggles to identify these as fake.
3. Computational demands
• While efficient on GPUs, running the model on resource-constrained
devices (like smartphones) may not yield real-time results.
•Optimization techniques such as model pruning, quantization, or
distillation can be applied in future work to address this.
DEPT OF ISE, SIR MVIT 2024-2025 Page | 24
Deepfake Voice Detection Results and Discussion
Future Improvements
• Use of GANs for adversarial training to increase robustness

• Integration of Transformer-based architectures for improved contextual understanding

• Adding prosodic and rhythm-based features to complement MFCC-based inputs

• Exploring multi-modal detection (e.g., combining text and audio) for higher reliability

Comaparative analysis with other models

To highlight the superiority of the proposed model, we compare its performance with existing
deepfake detection methods wrt to accuracy and Equal Error Rate (EER).

• The proposed model outperforms Res-Net and VGG models by a fair bit.
• ResNet models struggle with generalization, particularly when tested on unseen datasets.
Table 2: Comparison Of Accuracy And Equal Error Rate (EER) With Similar Studies
Study Accuracy EER (%)
(%)
Chinguun Purevdagva et al. 59.2 40.8
[8] approach
Kai Li et al. [20] approach 63.82 36.18
J. Khochare et al. [13] approach 67.0 33.0

R. Reimao, V. Tzerpos [2] approach 71.47 28.53

Lin Zhang et al. [15] 83.0 17.0


approach
Transformer Encoder [12] 90.43 9.57
approach
Ameer Hamza et al. [16] Approach 93.1 6.9

Sahar Abdullah Al Ajmi et al. 94.2 5.8


[26] approach
Our Model 97.0 3

6.4 Summary
This project presented a detailed analysis of the model's performance in detecting fake
audio. The proposed CNN-based approach achieved 97% accuracy, outperforming several
state-of-the-art methods.

Key conclusions:

• Our best performing model, single-task variant of CNN, achieves a macro F1 score
of 97.61 on the validation set. The model can be further applied on augmented data
to enhance generalisation and will prove helpful in deployment. The loss in
evaluation metrics is above the human observation level of about 85%.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 25


Deepfake Voice Detection Results and Discussion
• Due to the usage of an efficient CNN architecture, the processing time is very low,
from feeding input to generating the result. These models need fewer than 50,000
parameters and have around 100 KB memory footprint. This is highly
commendable in the field of Audio Signal Processing.
• The model is highly effective at identifying fake audio generated using modern
TTS and VC methods.

• Performance is competitive when benchmarked against recent literature.

• Certain limitations exist, particularly regarding compressed audio and high-


end synthetic voices.

• Future improvements can significantly boost the generalizability and efficiency


of the system, especially for deployment in real-world applications.

Overall, the results validate the success of the proposed methodology and highlight its potential
as a scalable and reliable tool for combating audio-based misinformation and identity fraud.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 26


Deepfake Voice Detection Conclusion and Future Scope

CHAPTER 7
CONCLUSION AND FUTURE SCOPE
7.1 Conclusion
This project developed a CNN-based Fake Speech Detection system that effectively
distinguishes between genuine and manipulated audio samples. By employing data augmentation
techniques, we enhanced the model’s robustness, enabling better generalization to various audio
manipulations. As the prevalence of fake speech increases, this research underscores the need for
advanced methodologies to combat misinformation in audio communications. Future work will
focus on refining model architectures and incorporating larger, more diverse datasets to further
enhance detection capabilities, contributing to audio forensics and security efforts.

The rise of synthetic voice generation through advanced AI techniques like Text-to-
Speech (TTS) and Voice Conversion (VC) presents significant threats to digital security, identity
verification, and media authenticity. This project presents a Convolutional Neural Network
(CNN)-based fake speech detection model trained on the ASVspoof 2019 dataset. By utilizing
MFCCs and spectrogram-based features, the model achieved a remarkable accuracy of 97% and
an F1-score of 97.61%, demonstrating its effectiveness in identifying fake speech even when the
differences are imperceptible to the human ear.

Data augmentation techniques like pitch shifting, noise addition, and time-stretching were
employed to improve model generalization, significantly enhancing robustness against unseen and
adversarial audio samples. The lightweight nature of the model—with under 100 KB memory
footprint—makes it viable for deployment in real-world environments, including mobile and
embedded systems.

Comparative studies also showed the superiority of this model over traditional and state-
of-the-art techniques such as ResNet, VGG, and Transformer encoders. Despite minor limitations
like performance drop on highly compressed audio or near-perfect synthetic voices, the system
provides a reliable, efficient, and scalable solution for the growing problem of voice- based
deepfake attacks.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 27


Deepfake Voice Detection Conclusion and Future Scope

7.2 Future Scope


• Real-Time and Edge Deployment:
Optimize the model through quantization, pruning, and distillation for deployment on
low-power devices. Develop lightweight versions for real-time use in mobile apps,
smart assistants, and IoT systems.
• Advanced Architectures and Robustness:
Explore transformer-based models (e.g., AST, ViT, multimodal transformers) to
improve temporal and contextual understanding. Enhance robustness against
compression and noise through fine-tuning on diverse, real-world audio conditions.
• Multimodal and Biometric Integration:
Extend detection capabilities by fusing audio with visual cues (lip-sync, facial analysis)
and integrating with biometric systems for secure authentication in domains like
banking and virtual assistants.
• Cloud-Based Tools and Ethical Applications:
Develop accessible, cloud-based forensic tools for legal and media verification. Support
ethical auditing by offering explainable AI and helping shape regulatory standards for
deepfake detection.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 28


Deepfake Voice Detection References

CHAPTER 8
REFERENCES
[1] Li, Yang, et al. “Universal voice conversion.” In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 3526-3535. 2019.
[2] Reimao, R., and V. Tzerpos. “FOR: A dataset for synthetic speech detection,” in 2019
International Conference on Speech Technology and Human-Computer Dialogue (SpeD).
IEEE, pp. 1–10.
[3] Korshunov, Pavel, and S. Marcel. “Deepfake detection using inverse contrastive loss.” In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 3920-3929. 2020.
[4] Prakash, Shreya, et al. “A comprehensive study on deep fake audio detection.” In
Proceedings of the 2020 ACM Workshop on Information Hiding and Multimedia Security,
pp. 81-90.
[5] Subramani, N. and D. Rao, “Learning efficient representations for fake speech detection,”
in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp.
5859–5866, 2020.
[6] Capoferri, D., et al., “Speech audio splicing detection and localization exploiting
reverberation cues,” in 2020 IEEE International Workshop on Information Forensics and
Security (WIFS). IEEE, pp. 1–6.
[7] Wijethunga, R., et al., “Deepfake audio detection: a deep learning-based solution for group
conversations,” in 2020 2nd International conference on advancements in computing
(ICAC), vol. 1. IEEE, pp. 192–197.
[8] Purevdagva, C., et al., “A machine-learning based framework for detection of fake
political speech,” in 2020 IEEE 14th International Conference on Big Data Science and
Engineering (BigDataSE). IEEE, pp. 80–87.
[9] Mukhopadhyay, Rudrabha, et al. “A comprehensive survey of voice conversion and deep
fake techniques.” arXiv preprint arXiv:2103.03230, 2021.
[10] Xie, Jin, et al. “Voice Deep Guard: Towards Intelligent Voice Deepfake Detection.” In
Proceedings of the 28th ACM International Conference on Multimedia, 2021.
[11] Ballesteros, D. M., et al., “Deep4snet: deep learning for fake speech classification,” Expert
Systems with Applications, vol. 184, p. 115465, 2021.
[12] Zhang, Z., et al., “Fake speech detection using residual network with transformer
encoder,” in Proceedings of the 2021 ACM workshop on information hiding and
multimedia security, pp. 13–22.
[13] Khochare, J., et al., “A deep learning framework for audio deepfake detection,” Arabian
Journal for Science and Engineering, pp. 1–12, 2021.
[14] Pasupathi, Panupong, and Taxing Li. “Detecting AI-Generated Text with BERT.” In
Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pp. 672-684.
[15] Zhang, L., et al., “The partialspoof database and countermeasures for the detection of short
fake speech segments embedded in an utterance,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813–825, 2022.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 29


Deepfake Voice Detection References

[16] Hamza, A., et al., “Deepfake audio detection via MFCC features using machine learning,”
IEEE Access, vol. 10, pp. 134018–134028, 2022.
[17] Shaaban, O. A., et al., “Audio deepfake approaches,” IEEE Access, vol. 11, pp. 132652–
132682, 2023.
[18] Albazony, A. A. M., et al., “Deepfake videos detection by using recurrent neural network
(RNN),” in 2023 Al-Sadiq International Conference on Communication and Information
Technology (AICCIT). IEEE, pp. 103–107.
[19] Bansal, K., et al., “Deepfake detection using CNN and DCGANs to drop-out fake
multimedia content: a hybrid approach,” in 2023 International Conference on IoT,
Communication and Automation Technology (ICICAT). IEEE, pp. 1–6.
[20] Li, K., et al., “Contributions of jitter and shimmer in the voice for fake audio detection,”
IEEE Access, vol. 11, pp. 84689–84698, 2023.
[21] Pham, L., et al., “Deepfake audio detection using spectrogram-based feature and ensemble
of deep learning models,” in 2024 IEEE 5th International Symposium on the Internet of
Sounds (IS2). IEEE, pp. 1–5.
[22] Paramarthalingam, A., et al., “A deep learning model to assist visually impaired in pothole
detection using computer vision,” Decision Analytics Journal, vol. 12, p. 100507, 2024.
[23] Basha, S. A. Y., and K. U. Priya, “Recognition of deep fake voice acoustic using ensemble
bagging model,” in 2024 5th International Conference on Electronics and Sustainable
Communication Systems (ICESC). IEEE, pp. 1211–1217.
[24] Xue, J., et al., “Dynamic ensemble teacher-student distillation framework for light-weight
fake audio detection,” IEEE Signal Processing Letters, 2024.
[25] Deng, J., et al., “VFD-Net: Vocoder fingerprints detection for fake audio,” in ICASSP
2024 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 12151–12155.
[26] Al Ajmi, S. A., et al., “Faked speech detection with zero prior knowledge,” Discover
Applied Sciences, vol. 6, no. 6, p. 288, 2024.
[27] Mathew, J. J., et al., “Towards the development of a real-time deepfake audio detection
system in communication platforms,” arXiv preprint arXiv:2403.11778, 2024.
[28] Song, D., et al., “Anomaly detection of deepfake audio based on real audio using
generative adversarial network model,” IEEE Access, 2024.
[29] Kang, J. Y., et al., “FADEL: Uncertainty-aware fake audio detection with evidential deep
learning,” in ICASSP 2025 IEEE International Conference on Acoustics, Speech and
Signal Processing. IEEE, pp. 1–5.
[30] Dixit, Y., et al., “Fake news detection of live media using speech to text conversion,” in
2021 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, pp.
1–5.

DEPT OF ISE, SIR MVIT 2024-2025 Page | 30

You might also like