0% found this document useful (0 votes)
11 views82 pages

Main Report Draft UNTOCHED

Uploaded by

leo playmaker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views82 pages

Main Report Draft UNTOCHED

Uploaded by

leo playmaker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 82

ACKNOWLEDGEMENT

We express our humble gratitude to Dr. C. Muthamizhchelvan, Vice-Chancellor, SRM


Institute of Science and Technology, for the facilities extended for the project work and
his continued support.We extend our sincere thanks to Dean-CET, SRM Institute of
Science and Technology, Dr.T.V. Gopal, for his invaluable support.

We wish to thank Dr. Revathi Venkataraman, Professor & Chairperson, School of


Computing, SRM Institute of Science and Technology, for her support throughout the
project work.We are incredibly grateful to our Head of the Department, Dr. Annapurani
Panaiyappan K, Professor and Head, Department of Networking and Communications,
School of Computing, SRM Institute of Science and Technology, for her suggestions and
encouragement at all the stages of the project work.

We want to convey our thanks to our Project Coordinator, Dr. G. Suseela, Associate
Professor, Panel Head, Dr K.Venkatesh, Professor and members, Dr.V.Rajaram,
Assistant Professor, Dr Angayarkanni S A, Assistant Professor, Dr. B. Balakiruthiga,
Assistant Professor, Department of Networking and Communications, School of
Computing, SRM Institute of Science and Technology, for their inputs during the project
reviews and support.

We register our immeasurable thanks to our Faculty Advisor, Dr.Kalaiselvi K, Associate


Professor, Department of Networking and Communications, School of Computing, SRM
Institute of Science and Technology, for leading and helping us to complete our course.

Our inexpressible respect and thanks to our guide, Dr.Saravanan M, Professor,


Department of Networking and Communications, SRM Institute of Science and
Technology, for providing me/us with an opportunity to pursue our project under his/her
mentorship. He provided us with the freedom and support to explore the research topics
of our interest. His passion for solving problems and making a difference in the world has
always been inspiring.

We sincerely thank the Networking and Communications, Department staff and students,
SRM Institute of Science and Technology, for their help during our project. Finally, we
would like to thank parents, family members, and friends for their unconditional love,
constant support, and encouragement.

Sivaramakrishnan M [Reg No: RA2011029010002]

Aaryan Rajput [Reg No: RA2011029010005]


SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

KATTANKULATHUR – 603 203

BONAFIDE CERTIFICATE

Certified for this project report titled “Classification of Deep fake audio using

MFCC” is the bonafide work of Mr. Sivaramakrishnan M [Reg No:

RA2011029010002] and Mr. Aaryan Rajput [Reg No: RA2011029010005]

who carried out the project work under my supervision. Certified further, that to

the best of my knowledge, the work reported herein does not form part of any

other thesis or dissertation based on which a degree or award was conferred on

an earlier occasion for this or any other candi

Dr.Annapurani Panaiyappan K
Professor And Head
Dr. Saravanan M Department of Networking and
Professor CCommunications,
Department of Networking and SRM INSTITUTE OF SCIENCE
Communications, AND TECHNOLOGY
SRM INSTITUTE OF SCIENCE
AND TECHNOLOGY

EXAMINER II

EXAMINER I
C.N TIT PAGE NO.
o. LEE
..

Abstract viii
List of Figures ix
List of Tables
xi
List of Symbols and Abbreviations xii

1. INTRODUCTION 1

1. What is Deep Fake 2


1
1. Pre-Existing issues of Deepfake 3
2
1. Need of Detection and the purpose 7
3
1. Scope of this project 8
4
1. Artificial Intelligence , Machine Learning and Deep Learning 9
5
1. Motivation 13
6
2 LITERATURE REVIEW 16

2. Related Literature Works 16


1
2. Detection using Deep Learning 18
2
2. Previous Frameworks 21
3
2. ASV spoof 2021 22
4
2. Challenges and Future Directions 23
5
3 DEEP FAKE AUDIO DETECTION 27
3. System Architecture Diagram of Our Application 27
1
3. Mel Frequency Cepstral Coefficients 29
2
3. Convolutional Neural Network 33
3
3. Data Normalization 37
4
3. Extract MFCC Features 38
5
3. Load Audio Data 40
6
3. Audio Authenticity Algorithm 42
7
3. Activity Diagram 43
8

4 RESULTS 45
5 CONCLUSION 54
6 FUTURE SCOPE 55
7 REFERENCES 56

8. APPENDIX 1
59
9. APPENDIX 2 62
10. PAPER PUBLICATION 68
STATUS
11. PLAGIARISM REPORT 69
ABSTRACT

Creating fake voices, often done with advanced computer programs, poses a growing

threat in the digital era. The pressing need for effective detection of AI-generated voices

through the development of an advanced application. Utilizing state-of-the-art auto-

regressive models and encoder components, the application allows users to upload audio

files, providing real-time probability estimates on whether the voice is AI-generated. A

theoretical exposition on auto-regressive models and the iterative generation process in

Text-to-Speech models forms the foundation of our approach. Voice forgery is

becoming a bigger problem in today's digital world. There's an urgent demand for a

better way to spot AI-generated voices. Our research presents a new method for

detecting fake audio by combining advanced signal processing and deep learning

techniques. We analyzed a dataset containing both real and fake audio samples by

extracting Mel-Frequency Cepstral Coefficients (MFCC) to understand the sound

patterns. Using sophisticated neural networks like Convolutional Neural Networks

(CNNs) and Fully Connected Layers, we examined voice samples on a large scale.

Through systematic testing, we identified the most effective models for distinguishing

real from fake audio. Our system includes a user-friendly interface for easy audio

upload and analysis. This research advances audio forensics and helps combat the

spread of deep fake content online. Our research not only addresses the immediate

challenges posed by voice forgery but also establishes a scalable and adaptable
framework for future developments in the fight against digital deception. By making

this technology accessible, we empower individuals and organizations to safeguard the

authenticity of digital communication in an era where trust in digital content is

increasingly under scrutiny.

LIST OF FIGURES

1.4.1 Supervised Learning 5

1.4.2 Unsupervised Learning 5

1.4.3 Reinforcement Learning 6

1.4.4 (a) A single-layer perceptron 7

(b) a multi-layer perceptron 8

1.5.1 Face Manipulation working 9

1.5.2 Voice Synthesis Working 10

1.5.3 Scene Reconstruction Working 11

1.5.4 Image Restoration Working 12

2.1.1 Proposed approach for detection of deep fake audios Using various 15
ML models

2.4.1 An illustration of the AD detection process 21

2.4.2 Imitation-based Deep fake 22

2.5.1 The overall structure of pre-training and fine-tuning 23

3.1.1 Fake Audio 27


3.1.2 Real Audio 27

3.2.1 Architecture Diagram 29

3.3.1 Use case Diagram 31

3.4.1 Activity Diagram 32

3.5.1 Steps involved in MFCC Feature Extraction 34

3.5.2 Fourier Transform for an audio signal 35

3.5.3 Mel Filter bank 35

3.7.1 Sigmoid Function 40

3.7.2 CNN Architecture 43

3.7.3 Summary of our Deep Learning Model 44

3.8 Algorithm of our Deep learning model 45

4.1 a) Training and Validation Accuracy 46

b) Training and Validation Loss 47

c) Confusion Matrix 48

d)Audio Fake Detection & Result 49


LIST OF TABLES

2.2.1 Complexity and accuracy of known convolutional neural network (CNN) 22


networks

2.3.1 t-DCF cost function parameters assumed in ASVspoof 2021 22


LIST OF SYMBOLS AND ABBREVIATIONS

MFCC Mel Frequency Cepstral Coefficients

STFT Short Term Fourier Transform

TTS Text-to-speech

CNN Convolutional Neural Network

AI Artificial Intelligence

ReLu Rectified Linear Unit

LSTM Long Short-Term Memory Equation


CHAPTER 1

INTRODUCTION

In an age dominated by technological advancements, the emergence of deep fake


technology has introduced a new dimension to the challenges surrounding the
authenticity of digital content. Deep fakes, fuelled by sophisticated artificial
intelligence algorithms, have the capability to convincingly manipulate images, videos,
and audio, blurring the lines between reality and manipulation.The widespread
accessibility of deep fake tools has raised significant concerns across various sectors,
including politics, media, and cybersecurity. The potential for misinformation and the
erosion of trust in digital media underscore the urgent need for robust and effective
deep fake detection mechanisms.This document delves into the intricacies of deep fake
detection, exploring the challenges posed by evolving synthetic media techniques and
presenting innovative solutions to mitigate their impact.

The detection of these kinds of audio By leveraging advanced algorithms, neural


networks, and multi-modal analyses [1], our approach aims to not only identify deep
fake content but also stay ahead of the dynamic landscape of deceptive technologies.
As we navigate through the following sections, we will uncover the key objectives,
methodologies, and technologies that form the foundation of our deep fake detection
system. Join us in the exploration of a cutting-edge solution designed to uphold the
integrity of digital content and counter the threats posed by synthetic media
deception.In the ever-evolving landscape of artificial intelligence, the emergence of
deep fake audio technology has introduced a nuanced layer of complexity to the
detection and mitigation of synthetic soundscapes.

The fundamental approach behind this Deep fake audio, a product of


advanced machine learning algorithms, delves into the realm of mimicking natural
human speech and other auditory elements with an unprecedented level of realism. As
we embark on this comprehensive exploration, this report aims to dissect the
mechanisms, challenges, and innovative solutions within the domain of deep fake audio
detection, unraveling the deceptive potential of synthetic audio and the strategies
employed to safeguard against its nefarious applications. Expanding on the provided
introduction into a comprehensive exploration of deep fake technology, particularly
focusing on detection mechanisms and the evolving landscape of synthetic media, we
embark on an extensive examination.

1.1 What is Deep fake

Deepfake technology, a fusion of AI sophistication and creative manipulation, is


transforming the landscape of digital media, pushing the boundaries of reality and
fiction. At its core, deepfake employs advanced machine learning models to synthesize
audiovisual content that is remarkably lifelike, making it possible to generate videos
and audio recordings that appear to show people saying and doing things they never
actually did. This capability arises from the power of deep neural networks, specifically
generative adversarial networks (GANs),[4] where two models work in tandem—one
generating the fake content and the other evaluating its authenticity, continuously
improving the realism of the output. The creation process of deepfakes is both intricate
and fascinating, involving the collection of extensive datasets of the target's images,
videos, or voice clips.
These datasets train the neural network to understand and replicate the
minute details of the target's appearance or voice, allowing for the generation of new
content that bears an uncanny resemblance to genuine material. This technology
leverages encoder-decoder frameworks, where the encoder learns a compressed
knowledge representation of the target's features, and the decoder uses this knowledge
to create the deep fake content by applying these features onto another person's likeness
or voice. Beyond entertainment and creative expression, where deep fakes have been
used to astonishing effect in films, video games, and virtual reality experiences, their
potential applications extend into areas like education, where historical figures could be
brought to life to deliver lectures, or language learning, where deep fake technology
could help in lip-syncing videos in different languages, enhancing the learning
experience.

However, the ethical and societal implications of deep fakes are profound. The
technology poses a formidable challenge to the notion of trust in the digital age, as it
can be weaponized to create misleading or harmful content. Politically motivated deep
fakes, pornography, and scams threaten personal reputations, democratic processes, and
social harmony, blurring the line between truth and deception. The ease with which
individuals can be portrayed in misleading contexts without their consent has raised
alarm bells, leading to calls for more robust digital literacy, advanced detection
technologies, and legal measures to protect individuals and societies from the potential
misuse of deepfake technology. The ongoing battle between deepfake creators and
detectors resembles an arms race, with each advancement in creation techniques being
met with sophisticated detection methods that analyze inconsistencies in videos or
sounds. Researchers and technologists are exploring a variety of approaches, including
deep learning models that can identify subtle artifacts or discrepancies in facial
expressions, speech patterns, or background noise that may indicate a piece of content
is a deep fake.

In conclusion, while deepfake technology showcases the remarkable


advances in AI and machine learning, it also underscores the critical need for ethical
considerations, regulatory frameworks, and technological solutions to mitigate its risks.
Society must navigate these challenges carefully, ensuring that while we embrace the
positive transformations deep fakes can bring to entertainment, education, and beyond,
we also safeguard the principles of truth, consent, and security in our increasingly
digital world. As deep fake technology evolves, distinguishing between real and
synthetic content becomes increasingly difficult, underscoring the need for vigilance,
critical media literacy, and continued development of detection technologies.

1.2. Pre-Existing issues of Deepfake

Deepfake technology represents a sophisticated application within the realm of computer


vision, leveraging artificial intelligence to manipulate visual data in unprecedented ways.
Deepfake algorithms utilize techniques such as generative adversarial networks (GANs)
[5] and variational autoencoders (VAEs) to create highly convincing synthetic images,
videos, or audio recordings.
At its core, deepfake technology involves the synthesis of hyper-realistic content by
understanding and mimicking the visual patterns present in authentic media. By training
on vast datasets of real-world images and videos, deep learning models can learn to
replicate the intricate nuances of human facial expressions, gestures, and speech patterns.
This process enables the creation of synthetic media that closely resembles genuine
recordings, often indistinguishable to the human eye or ear.

Deepfake applications span various domains within computer vision, including:

1. Face Manipulation: Deep fake algorithms excel in altering facial expressions,


identities, or attributes within images or videos. This capability enables the creation of
convincing impersonations or the insertion of individuals into scenarios they never
participated in Fig1.2.1 depicts a streamlined overview of the supervised machine
learning process. It begins with "Labeled Data," showcasing various geometric shapes—
hexagons, squares, and triangles—each tagged with a label, signifying the training dataset
that instructs the machine learning algorithm to discern and categorize these shapes.
Adjacent to this, the "Labels" section presents the corresponding identifiers for the
shapes, delineating the classes the algorithm is to predict. At the core of the diagram,
"Model Training" is symbolized by gears, metaphorically representing the computational
and algorithmic operations involved in the model's learning phase.[20] Following this
stage is "Prediction," where the trained model is depicted as ready to classify new,
unlabelled data, signified by an icon radiating outward lines. Beneath this, "Test Data"
comprises two unlabeled shapes, indicating fresh data introduced to the model to evaluate
its classifying proficiency. The culmination of this process is illustrated on the right,
where the shapes are accurately identified as a "Square" and "Triangle" by the model.[20]
This diagrammatic representation encapsulates the essence of supervised machine
learning, illustrating the training of a model on labeled data and its subsequent
application in predicting classifications for new data.

2. Voice Synthesis: Beyond visual content, deepfake technology extends to audio


synthesis, allowing the generation of realistic speech patterns and vocal inflections.[7]
This enables the creation of synthetic voice recordings that mimic specific individuals
with remarkable accuracy. Fig 1.2.2 depicts us how are voices being selected inorder to
train a voice synthesis model it is necessary to feed the model with both real and the fake
data so that the model gets trained on both of data and provides much accurate results

3. Scene Reconstruction: Some advanced deepfake techniques involve reconstructing


entire scenes or environments from limited visual input. By understanding the spatial
relationships and context within an image or video, deep learning models can extrapolate
and generate additional content seamlessly.[5] Fig 1.2.3 depicts about construction of
DeepFake for face recognition models

Figure 1.2.1- Face Manipulation working


Figure 1.2.2- Voice Synthesis Working

4. Image Restoration: Deep learning models are also employed in the restoration of
damaged or degraded visual data. By learning the underlying structures and patterns
within images, these algorithms can reconstruct missing or corrupted information,
enhancing the overall quality of the content. Figure 1.2.4 depicts Image restoration
working .

However, despite its remarkable capabilities, deepfake technology raises significant


concerns regarding its potential for misuse, particularly in the dissemination of
misleading or fraudulent content.[17] As a result, researchers and practitioners within the
field of computer vision are actively developing methods for detecting and mitigating the
impact of deep fakes on society. These efforts include the development of deepfake
detection algorithms, the establishment of ethical guidelines for the responsible use of
synthetic media, and ongoing research into techniques for preserving the integrity of
visual data in an increasingly digital age.
Figure 1.2.3- Scene Reconstruction Working

Figure 1.2.4 - Image Restoration Working

1.3 Need of Detection and the purpose

This research is driven by the imperative to confront the growing menace of voice
forgery, an issue exacerbated by the capabilities of advanced generative models in the
contemporary digital landscape. Our central purpose is the development of a
sophisticated application aimed at proficiently detecting AI-generated voices within
uploaded audio files. By harnessing the prowess of state-of-the-art auto-regressive
models and encoder components, our objective is to furnish users with a dynamic tool
that provides instantaneous probability estimates, discerning whether a voice originates
from artificial intelligence. The theoretical foundation of our approach is laid through a
comprehensive exploration of auto-regressive models and the iterative generation
process inherent in Text-to-Speech models.Beyond the immediate goal of voice
detection, our overarching purpose extends to contributing to the broader field of AI-
generated voice detection.

The amalgamation of these tools and techniques culminates in the


development of a potent AI-generated voice detection and analysis tool. Notably, this
tool transcends traditional quantitative assessments by affording users the invaluable
ability to audit and comprehend the intricacies of the analyzed audio, further solidifying
its practical utility. In essence, our research endeavors to make a substantial
contribution to the burgeoning field of AI-generated voice detection. By offering both
theoretical insights and practical applications, we not only elevate our understanding of
the underlying challenges but also fortify our collective defenses against the
proliferating threat of sophisticated generative models in voice forgery.

1.4 Scope of this project

Deep fake audio encompasses a broad spectrum of artificially generated audio content
that mirrors authentic human speech and auditory experiences. From impersonation for
malicious purposes to the creation of synthetic voice overs in the realms of
entertainment and media, the applications of deep fake audio traverse diverse sectors,
necessitating a multifaceted approach to detection and prevention. At the core of deep
fake audio generation lies the utilization of generative models, including but not limited
to Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).
These models meticulously analyze extensive datasets, capturing the nuances of natural
speech patterns and auditory cues. Furthermore, manipulation of Text-to-Speech (TTS)
[18] systems contributes to the creation of highly convincing audio, posing a
formidable challenge to conventional detection methodologies. The dynamic and
adaptive nature of deep fake audio presents formidable challenges for detection
mechanisms. Traditional approaches struggle to keep pace with the rapid evolution of
generative models, as creators continuously refine their techniques to produce
increasingly convincing synthetic soundscapes. The inherent ability of deep fake audio
to exploit vulnerabilities in human auditory perception further complicates the task of
identifying manipulated content.

As the arms race between creators of deep fake audio and detection systems
intensifies, researchers and technologists have devised state-of-the-art strategies to
combat synthetic audio deception. Advanced machine learning algorithms, coupled
with forensic audio analysis, form the bedrock of modern detection methodologies.
Multi-modal approaches, incorporating visual and contextual cues, contribute to the
robustness of these strategies, enabling a more comprehensive understanding of
synthetic soundscapes. Beyond the technological intricacies, this report explores the
far-reaching implications of deep fake audio in various sectors, including politics,
cybersecurity, and the entertainment industry.[19] The potential for misinformation,
identity theft, and the erosion of trust in auditory media underscores the urgency of
developing effective detection mechanisms to mitigate the societal impact of synthetic
soundscapes. The path forward involves continuous advancements in deep fake audio
detection technologies, as well as collaborative efforts across academia, industry, and
policymakers. This report sheds light on the ongoing research and development
initiatives, providing insights into the future landscape of deep fake audio detection and
the proactive measures required to stay ahead of emerging threats.

1.5 Artificial Intelligence, Machine Learning and Deep Learning

A branch of computational theory and logic known as artificial intelligence is


concerned with creating computer systems and algorithms that can carry out activities
that would typically need human intelligence, logical reasoning, and skill. Since these
apps and algorithms mimic how the human mind functions, they necessitate an extreme
amount of knowledge about every aspect of the issue. These consist of the different
items, their characteristics, how they are categorized differently, and how they are
related to one another. Generally speaking, there are two categories of artificial
intelligence based on the types of problems that they are designed to solve:
1) Vertical AI: These artificial intelligence algorithms concentrate on mastering a
specific problem. Typically, they are hardcoded to obey commands for a single,
mechanized, and monotonous duty. For instance, setting up daily appointments and
calls and automatically updating the database with data from a single outside source.

2) Horizontal AI: These AI systems concentrate on more general issue formulations.


Applications falling under this category of artificial intelligence are able to manage
numerous jobs and satisfy the various needs of their users using a single logic and
configuration. For instance, users can assign several tasks to virtual assistants such as
Cortana, Alexa, and Siri.

3) Machine Learning (ML), The study of creating machine-intelligent algorithms that


enable a computer to learn from past and present data is machine intelligence, a subset
of artificial intelligence, despite the terms being used synonymously and frequently
misconstrued. In order to do this, the algorithms recognize and learn the different links
between the data's constituent aspects, search for recurring themes, react to diverse
scenarios outside the bounds of their programming, and make predictions in line with
those responses.

Depending on the kind of data they utilize to learn and the kinds of outputs they
provide, machine learning may be generally divided into three groups. They are as
follows:

1) Supervised Learning: The datasets that the machine uses to learn in this kind of
learning are fully tagged and structured. The datasets are used as inputs for the ML
algorithm's training and contain a set of input features and their related outputs.
Throughout the training process, the model assesses its performance and automatically
modifies its learning curve in order to improve performance metrics. It learns by
mapping its predictions to the true predictions found in the dataset. As an illustration,
consider classification and regression tasks like predicting commodity prices and
classifying images, respectively. Figure 1.3.1 below provides an example of how
complete supervised learning is carried out. The data is first labeled with various
identities, and then it is used to train the models. Afterwards, test data is used to
complete the prediction step, and the final product is recognized.

Figure 1.5.1- Supervised Learning

2) Unsupervised Learning: It is suggested by unsupervised learning that the datasets


and their outputs are not mapped. The models are meant to generate outputs by
discovering and learning common patterns without adhering to a predetermined
"correct answer" after being trained on a dataset containing many features. As an
illustration, consider the clustering methods employed in evolutionary biology, DNA
pattern identification, and consumer segmentation [16]. Figure 1.3.2 illustrates the
process of intercepting DNA data, applying an algorithm, processing the data, and
producing an output for an unsupervised learning algorithm.

3) Reinforcement Learning: In order to maximize the agent's performance, the


algorithm in this sort of learning modifies its learning curve by examining the
characteristics of the learning problem, environment, and behavior. It employs an
action-reward system to learn any approach that is suitable for solving the problem and
correct any one that is not. The machine makes careful to learn only those techniques
that enable the end goal to be attained quickly and efficiently. Take gaming and
driverless driving as examples. Figure 1.3.3 illustrates how the environment and agent
generate an output, as well as how the raw data is.

Figure 1.3.2- Unsupervised Learning

Figure 1.3.3- Reinforcement Learning

As a branch of machine learning, deep learning uses intelligent models that simulate the
functioning of the human brain. These models, termed artificial neural networks, have
connected elements called neurons that mimic the architecture of the brain. Neural
network neurons, in contrast to brain neurons, are constrained by network architecture in
terms of their connectivity and data flow direction. Neural networks are more robust than
typical machine learning algorithms and can perform a wide range of jobs.

They are highly skilled at analyzing unprocessed data, deriving features


from several levels, compressing data, and producing results based on these features.
Neural network layers' constituent units, or perceptrons, are complex mathematical
operations. To get outputs, they take inputs, apply biases and weights, and then run them
through nonlinear functions. Via connection links, these outputs and additional features
like activation signals are sent to layer perceptrons that follow.[15]Two varieties of
perceptrons exist. Binary classification and other linearly separable tasks are handled by
single-layer perceptrons.Fully connected neural networks, or multi-layer perceptrons, are
made up of three layers: input, hidden, and output. They improve processing capacity for
applications like picture identification and stock analysis by introducing non-linearity to
input data.
Figure 1.3.4: (a) A single-layer perceptron and (b) a multi-layer perceptron

Figure 1.3.4 (a) gives us a brief idea of how perceptrons are done in a single layer. We
have different forms of raw data starting from x1,x2,....xn. which are given to a single
layer and desired output is trained. On the other hand Figure 1.3.4 (b) has a slight
difference as here another layer is being introduced which is a hidden layer the input is
given to multi-layer perceptron this time and the desired output is being trained .

1.6. Motivation

However, despite its remarkable capabilities, deepfake technology raises significant


concerns regarding its potential for misuse, particularly in the dissemination of
misleading or fraudulent content. As a result, researchers and practitioners within the
field of computer vision are actively developing methods for detecting and mitigating the
impact of deep fakes on society. These efforts include the development of deepfake
detection algorithms, the establishment of ethical guidelines for the responsible use of
synthetic media, and ongoing research into techniques for preserving the integrity of
visual data in an increasingly digital age. In recent years, the proliferation of deepfake
technology has emerged as a formidable challenge in the fields of information security,
digital media integrity, and personal privacy. Deep fake audio, a subset of this
technology, involves the manipulation or generation of human-like speech with the intent
to deceive, entertain, or experiment. The motivations for developing deepfake audio
detection systems are both varied and urgent, grounded in a need to uphold the
authenticity of communication, protect individual identities, and maintain trust in digital
media.[14] The term 'deepfake' is a combination of 'deep learning' and 'fake,' and it
originally referred to synthetic media where a person's likeness or voice is replaced with
someone else's. With the democratization of AI tools and techniques, generating
convincing fake audio has become increasingly accessible, leading to a surge in potential
misuse. The ability to mimic voices accurately raises ethical concerns, particularly
around consent and deception. Impersonating public figures, celebrities, or private
individuals without consent infringes on rights to likeness and can lead to misinformation
and reputation damage. The motivation for developing detection mechanisms is clear:

To create a line of defense against the misuse of synthetic media. As deep fake
technology evolves, so too must the tools designed to detect and mitigate its effects. This
arms race is not just technical—it’s also societal, as the norms around digital content
authenticity are being challenged. For cybersecurity, the emergence of deepfake audio
introduces a new vector for attacks. From voice phishing (vishing) to bypassing voice
authentication systems, the threats are real and present. The detection systems serve as
the bulwark against these vulnerabilities, aiming to protect individuals and organizations
alike. Law enforcement and digital forensic experts are at the frontline, requiring tools to
distinguish between authentic and tampered audio recordings.

The accuracy and reliability of these tools are crucial for legal proceedings,
evidence validation, and the enforcement of digital communication laws. In journalism,
the sanctity of reported speech is sacrosanct. Deepfake audio detection tools assist
journalists and media houses in verifying the authenticity of audio clips before
publication, maintaining the trust and integrity crucial to the profession. The political
arena, where the authenticity of communication can sway public opinion and policy
decisions, is especially vulnerable. Ensuring the veracity of statements attributed to
political figures is essential to maintain a fair democratic process. While deepfake
technology can be used for creative expression in the entertainment industry,
distinguishing between performance art and reality becomes critical when content has the
potential to deceive the audience. For individuals, the invasion of privacy manifested
through unauthorized use of one's voice is a direct attack on personal autonomy.
Detection systems can empower individuals to reclaim control over their digital personas.
Researchers are motivated by the technical challenge deepfakes represent, as well as the
social responsibility to counteract the potential harm. The development of detection
systems is as much about advancing AI as it is about curbing its negative impacts. At the
core of the motivation for deep fake audio detection is the broader issue of trust in
technology. As society becomes increasingly reliant on digital communication,
establishing and maintaining trust in the tools we use is essential. Part of the motivation
also lies in education and awareness. Detection tools can help educate the public about
the existence and nature of deep fakes, fostering a more discerning consumption of digital
media. The drive to develop deep fake audio detection technology stems from a
multifaceted motivation matrix. It is a response rooted in the need for security, the pursuit
of justice, the protection of individual rights, and the preservation of societal trust in the
digital age. The quest to develop these tools is a testament to the resilience and
adaptability of our society in the face of emerging technologies. Figure 3.4.1 presents a
diagram illustrating the various malicious applications of deep fake technology.

In the center, there is a caption that reads "WHAT ARE DEEPFAKES


USED FOR?", indicating that the surrounding nodes will explain different uses of deep
fakes. Surrounding this central question are circles, each describing a particular use:This
likely refers to the use of deep fakes to create false narratives or fake news that can
influence the outcome of elections.This is the unethical practice of using deep fake
technology to superimpose celebrities' faces onto explicit content without their consent.
Refers to the use of deep fakes in manipulating individuals or social groups, often for
fraudulent purposes or to gain unauthorized access to systems.[7] The use of deep fake
technology to impersonate individuals for unauthorized access to personal information or
resources. Refers to using deep fakes to deceive individuals or organizations into making
financial transactions under false pretenses .
CHAPTER 2

LITERATURE REVIEW

The discussions revolve around the utilization and advancement of Mel Frequency
Cepstral Coefficients (MFCC) within the sphere of deep learning models . The proposed
research endeavors aim to harmness and optimize the MFCC feature extraction process
thereby bolstering the efficacy of deepfake audio identification and verification. This
enhanced focus on MFCC, integrated with sophisticated neural network architectures,
seeks to significantly amplify the precision and reliability of deepfake audio detection
systems.

2.1 Related Literature works

Kumar, Basant, and Shatha Rashad Alraisi. The new method for identifying AI-generated
audio deepfakes will be shown and discussed in this article. Convolutional neural
networks (CNNs) are used to develop deep learning techniques and are a sort of black
box for detecting acoustic things. The suggested models may serve as reliable reference
networks for audio classification. "Extreme inception," or "XCeption," is an acronym that
represents the application of the concepts of Inception to their logical conclusion.[1]
CNN uses a deep convolutional neural network architecture that is made up of Depth
Wise Separable Convolutions to detect audio. In order to generate real-time results and
optimize the entire system (COM-IT-CON), vol. 1, pp. 463–468. IEEE, 2022.

Hamza, A., Javed, A. R. R., Iqbal, F., Kryvinska, N., Almadhor, A. S., Jalil, Z., &
Borghol, R. (2022).The goal of this research is to detect deepfake audio using machine
learning and deep learning techniques. Specifically, audio features are extracted by
means of Mel-frequency cepstral coefficients, or MFCCs. The Fake-or-Real dataset,
which is divided into four smaller datasets according on bit rate and audio length, is used
in this study.[2] The results show that on the for-rece and for-2-sec datasets, the support
vector machine (SVM) model obtained the maximum accuracy. On the for-norm dataset,
the gradient boosting model performed better than other advanced methods; on the for-
original dataset, the VGG-16 deep learning model performed better than any other model.
Xuechen Liu; Xin Wang; Md Sahidullah; Jose Patino; Héctor Delgado; Tomi Kinnunen;
Massimiliano Todisco; The ASVspoof 2021 challenge focused on detecting spoofing and
deepfake audio with 54 teams participating. It included three tasks: logical access (LA),
physical access (PA), and the new Deep fake (DF) task. LA task findings suggest
countermeasures are effective against encoding and transmission effects. PA task results
show promise in detecting replay attacks in real environments but highlight a
vulnerability to differences between real and simulated acoustics.[3] The DF task,
targeting online manipulated speech detection, reveals resilience to compression but a
lack of generalization across datasets. The challenge also provided insights into
influential data factors, performance on unseen data, and outlined future directions for
ASVspoof.

Almutairi, Z., & Elgibreen, H. (2022).This article offers a thorough analysis of Audio
Deepfakes (ADs), a human voice cloning technology that was first developed to improve
audiobooks but now poses threats to public safety. In order to identify imitation- and
synthetic-based Deepfakes, it analyzes datasets and examines current AD detection
techniques. The review demonstrates how different approach types affect detection
performance and points out the trade-off between scalability and accuracy. Along with
discussing obstacles and future objectives for AD detection research, the article
highlights the need for more reliable models that can identify fakeness in a variety of
audio situations.[4] All things considered, it is an invaluable tool for scholars who want
to comprehend the situation of the AD literature right now and create better detection
methods.

Abderrahim Fathan, Jahangir Alam, Woo Hyun Kang IEEE/2022,The paper focuses on
detecting fake audio clips, emphasizing the need for improved models against audio
spoofing attacks. The study introduces specialized audio augmentations, achieving a
notable 2.8% EER on ASVspoof 2019. Unlike traditional approaches, it explores Mel-
spectrogram image features and diverse audio codecs for robustness against variations in
ASVspoof2021.[5] Employing WaveletCNN and VGG16 architectures, the research
highlights their superiority over baselines in handling crucial spectral information for
spoofing detection. Additionally, it reveals a significant degradation in countermeasure
system robustness when exposed to degraded speech samples through VoIP or
mismatched audio compression. The findings contribute insights to deep fake detection,
showcasing the effectiveness of novel augmentations, image features, and specialized
architectures.

Abu Qais ,Akshar Rastogi,Akash Saxena IEEE/2019,This paper proposes a


Convolutional Neural Network (CNN)-based system for detecting speech spoofing using
diverse audio features. Addressing the risks of deep fake audios, the study emphasizes the
potential threats to individual and national security. The system optimizes computation
by converting audios into images (Spectrogram, MFCC, FFT, STFT) and feeding
numeric array values into the model. Various approaches for data feeding are explored,
both individually and in a concatenated manner.

Basant Kumar, Shatha Rashad Alraisi IEEE/2022,This article uses Convolutional Neural
Network (CNN), a reliable deep learning technique, to present a novel tool for identifying
AI-generated audio deep fakes. Based on the XCeption architecture, the suggested
models function as reliable standard networks for audial classification. CNN's audio
detection capabilities are improved by XCeption, which stands for "extreme inception,"
with the use of Depth Wise Separable Convolutions. The method seeks to simplify the
system and provide real-time deepfake detection findings.

2.2 Detection Using Deep learning

The MFCC series of audio files, showcasing the amplitude in decibels (dB) to emphasize
the auditory power of the signal. Subsequently, feature extraction and selection processes
are elucidated, wherein each audio waveform undergoes initial processing to generate a
vector group representing MFCCs for every frame. The comparison between fake and
genuine audio signals in spectrogram representation, as depicted, underscores[9] the
relevance of auditory features in distinguishing between deep fake and authentic audio
signals. Our approach involves meticulous feature selection, guided by Principal
Component Analysis (PCA) to identify salient characteristics conducive to deep fake
detection.
By reducing the feature set to 65 crucial components, selected based on PCA's explained
variance ratio metric (97%), we ensure the efficacy and relevance of the data utilized in
our detection models. This introductory insight lays the foundation for a comprehensive
exploration of deep fake audio detection methodologies, underscoring the importance of
feature selection and representation in enhancing model performance and robustness. The
proposed research endeavours to address the challenge of deepfake audio detection
through the application of machine learning algorithms. Deepfake technologies have
become increasingly sophisticated, posing significant threats to various domains,
including media authenticity, privacy, and security. To combat these threats, researchers
have explored the efficacy of different machine learning techniques, including Support
Vector Machine (SVM), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting
(XGB), in identifying and mitigating the impact of deepfake audio content. It is a well-
established supervised learning method known for its ability to construct optimal decision
boundaries in high-dimensional spaces. By maximizing the margin between classes,
Equation 2.1.1 SVM aims to achieve robust classification performance. In the context of
deepfake audio detection, SVM leverages a hyperplane to separate genuine and
manipulated audio samples, utilizing support vectors to define the decision boundary.
Despite its effectiveness in high-dimensional environments, SVM may encounter
challenges in handling large datasets and providing probability estimates directly.

Equation 2.2.1 - Support Vector Machine Equation


The information flow through the Equation 2.1.2 LSTM cell in a neural network is
updated and controlled by these gates and states, which are implemented as mathematical
equations. This makes the network effective for tasks like speech recognition, language
translation, and time series prediction because it enables the network to learn and
remember patterns in sequential input over extended periods. Recurrent neural network
(RNN) architecture of the LSTM type is intended to extract long-term dependencies from
sequential data. The following are the equations for an LSTM cell.

Equation 2.2.2- Long Short-Term Memory Equation

This is like the brain's way of deciding what's not important anymore. For example, if a
character in the book is introduced briefly but isn't mentioned again, The brain might
decide to forget about them. This is like the brain's way of deciding what new
information is worth remembering. If a new character is introduced and becomes
important to the story, The brain might choose to remember details about them. This is
like the brain's way of deciding what information to use or share from its memory. When
we want to talk about the book with a friend, we'll only mention the parts that are
relevant to the conversation.This is like the brain's short-term memory.

It stores information temporarily and gets updated based on what to remember


(input gate) and what to forget (forget gate) This is like the brain's long-term memory. It
stores important information for a longer time and gets updated based on what's currently
relevant (output gate). Extreme Gradient Boosting (XGB) represents a powerful
ensemble learning technique that combines the strength of gradient boosting with
optimized parallel processing. By iteratively refining weak base models, XGB effectively
minimizes prediction errors and enhances overall model performance. However, XGB
may be susceptible to outliers and requires careful parameter tuning to mitigate the risk of
overfitting. Despite these challenges, XGB has demonstrated promising results in
detecting deepfake audio content, particularly when deployed with appropriate learning
rates and estimator settings.

2.3 Previous frameworks

In terms of accuracy, the suggested method for deep fake audio detection employing
machine learning models—specifically, the VGG-16 model and LSTM—performs better
than the baseline method. By contrast, the SVM model baseline technique received a
testing score of 67%, demonstrating a notable improvement in accuracy with the
suggested approach. Furthermore, the suggested method received the greatest testing
score of 73%, surpassing the best score of the previous study utilizing the SVM model by
26%. This illustrates the superiority of the suggested method over the baseline method
and how well it works to detect deepfake audio. Several CNN models have been learned
by Table 2.2.1; several are enumerated here based on their correctness. The table below
shows parameters and MACCs in the order of millions.

Table 2.3 : Complexity and accuracy of known convolutional neural networks (CNN)
networks

2.4 ASV spoof 2021

The ASVspoof 2021 challenge addressed the detection of spoofed and deepfake speech in
realistic and useful circumstances with the goal of furthering the field of automatic
speaker verification (ASV). 54 teams entered the challenge, submitting work for the three
tasks of deepfake (DF) detection, physical access (PA), and logical access (LA). The
development of trustworthy countermeasures (CMs) against different spoofing
techniques, such as voice conversion (VC), text-to-speech (TTS) synthesis, and replay
attacks, was emphasized in Table 2.3.1.

Table 2.4 : t-DCF cost function parameters assumed in ASVspoof 2021


The inclusion of the DF task, which sought to identify distorted and compressed speech
data released online, was one of the challenge's main features. The DF job was centered
on assessing the performance of independent CM systems, in contrast to the LA and PA
tasks that utilized both ASV systems and CMs. Although detection methods in the DF
task demonstrated some resistance to compression effects, the challenge results showed
that these systems lacked cross-source dataset generalization. The significance of
assessing CM performance in real-world scenarios where speech data is coded,
compressed, and transmitted via telephone channels (LA job) or experiences acoustic
propagation in physical areas (PA task) was highlighted by the ASVspoof 2021
challenge. The challenge aims to test the effectiveness of CM solutions under more
realistic conditions by modeling scenarios that resemble real-world applications.The
challenge results also made clear how important it is to address key data elements that
have an impact on CM performance.

For the purpose of creating reliable and broadly applicable deepfake and spoofing speech
detection systems, it is essential to comprehend these elements. The challenge's best
systems showed t-DCFs at the ASV floor, demonstrating low mistake rates and proficient
voice spoofing detection. The ASVspoof 2021 challenge yielded significant insights into
the creation of trustworthy ASV systems and CMs for the identification of deepfake and
spoofed speech. The outcomes emphasized how crucial it is to take into account real-
world circumstances, significant data elements, and the resilience of detection systems in
the face of changing spoofing attempts. Future studies in this field ought to concentrate
on resolving the challenges' constraints and investigating fresh approaches to the
development of speech and language processing technologies.

2.5 Challenges and Future Directions

the creation of datasets and techniques for audio deepfake detection in response to the
growing usage of AI-generated tools for producing false audio. The review includes
datasets used for imitation and synthetic-based deepfake detection, as well as
comparative analysis and current approaches. It also draws attention to difficulties and
possible avenues for further study in this field. The introduction describes how the misuse
of audio deep fakes and the emergence of AI-generated technologies have raised
concerns for public safety. In order to prevent the spread of false information, it
highlights the necessity of distributed audio recordings being authenticated. It divides
audio deepfake assaults into three categories: replay, synthetic, and imitation. It explores
different machine learning (ML) and deep learning (DL) techniques created for the
purpose of identifying phony audio, emphasizing the drawbacks and difficulties of each
technique.

It also discusses the need for further research to address the gaps in audio
deepfake detection. This provides a comprehensive overview of recent studies and
datasets related to audio deepfake detection. It compares the performance of different
detection methods, highlighting the varying success rates, overfitting concerns, and the
impact of specific features and datasets on the accuracy of detection methods. It
emphasizes the need for more robust detection models that can handle diverse languages,
accents, and real-world noises. It also stresses the importance of addressing the
challenges in audio deep fake detection and the potential for future research in this field.
The current AD detection methods require extensive preprocessing, particularly in
classical ML methods, which necessitates significant manual labor to prepare the data.
Similarly, DL-based methods use an image-based approach to understand audio features,
which can affect the performance of the method. Most existing studies focus on
developing detection methods for English-speaking voices, with limited attention to non-
English languages such as Arabic. The unique challenges of languages like Arabic,
including alphabet pronunciation and accents, pose significant obstacles for traditional
audio processing and ML learning models.

The below figure demonstrates the basic working concept behind AD


detection process where we have given two different types of voice input synthetic and
imitation audio and then the training process is been carried using these voices with the
help of audio deep fake detection model and fake voice recognition is done.
Figure 2.5- An illustration of the AD detection process

There is limited coverage of other languages and accents in the majority of datasets
created for AD detection, which are centered on English. To supplement the current AD
detection methods, for instance, fresh datasets based on the syntactic fakeness of the
Arabic language are required. In AD detection techniques, there is a significant trade-off
between scalability and accuracy. ML techniques are more accurate, but their scalability
is impacted by the need for extensive training and manual feature extraction. However,
certain audio file changes cause DL algorithms to falter, which limits their scalability.
The Deep Fake Voice Imitation is shown in Figure 2.4.2.
Figure 2.5- Imitation-based Deep fake

2.6 Deep Convolutional Neural Network (CNN)

The study makes use of these cutting-edge neural network architectures to identify
deepfake sounds, an important role in preventing malicious attacks on voice-based
authentication systems. The careful approach taken in model creation is highlighted by
the use of the Adam optimizer in the CNN model's training, in addition to methods like
num epoch, the RELU activation function in hidden layers, and the SoftMax function in
the output layer. The model's robustness is further increased by the strategic use of
Dropout to avoid overfitting and the inclusion of the categorical-cross entropy function
for loss computation. Additionally, the study uses a variety of datasets to support the
CNN model's training and evaluation, including DFDC, Blazeface, Deep fakes Inference
Demo, and particular models like Deepfake XCeption.

The all-inclusive process includes importing libraries, obtaining test films,


generating helpers, putting the XCeption Net into practice, and finishing with the
combination of ResNEXT and XCeption architectures. The study's research design
includes developing strategies, choosing datasets, choosing algorithms, and putting a
deep fake audio clip characteristic equation into practice. The thorough examination of
both genuine and deepfake audio recordings, together with differences in text grammar
and depth, highlights how rigorously model evaluation and performance assessment are
done. In order to evaluate the accuracy and effectiveness of the various models in audio
detection, the tests required examining them according to predetermined criteria . The
effectiveness of this strategy was first evaluated using the urbansound8k dataset.CNN
had the best accuracy out of the three models, but even while the model accurately
identified sounds, it did not fare well in the accuracy test, receiving an average score of
75.1%.

Figure-2.6.1 : The overall structure of pre-training and fine-tuning.

The above image demonstrates a basic approach of how CNN models are being used in
some of the previously implemented approaches; the figure 2.5.1 shows (a) pre-trained
and (b) fine-tuning which tell us about how CNN encoder works with both types of audio
inputs.
CHAPTER 3

DEEP FAKE AUDIO DETECTION USING MFCC

In Proposed Methodology , we elaborate on the dataset employed, outline the proposed


architecture for detecting deep fake audio, and discuss the algorithms utilized for
analysis. Initially, we curated a dataset comprising both genuine and counterfeit audio
samples. Subsequently, we extracted Mel-Frequency Cepstral Coefficients (MFCC)
features from the audio data to facilitate a thorough examination of the dataset. Utilising
a suite of sophisticated and pre-trained neural networks, we then proceeded to analyse the
voice samples on a grand scale. This approach allowed us to evaluate the effectiveness of
various models in accurately identifying deepfake audio, ultimately focusing on the one
that demonstrated the highest accuracy for this specific task. Expanding the given content
to meet the requirement of 2000 words would result in a detailed exploration of deep fake
technology, its implications, examples, and the significance of using a balanced dataset
for developing detection models.

3.1 System Architecture Diagram of our Application


In the rapidly evolving digital era, the security of user authentication systems has become
paramount. Traditional methods based on passwords and PINs are increasingly
supplanted by biometric technologies, with voice recognition emerging as a uniquely
promising avenue. Figure 3.1 depicts an Architecture Diagram. This report details a
sophisticated Voice Authentication System (VAS) designed to ensure high security and
user convenience through voice biometrics . Voice authentication is predicated on the
unique characteristics of an individual's speech, which includes rhythm, pitch, and
speaking style. As with fingerprints and other biometric traits, these vocal attributes are
inherently difficult to replicate, making them potent tools for secure identification. The
flowchart presented provides a visual representation of the complex processes behind the
VAS, starting with user interaction and culminating in voice classification.

The user-first approach is emphasized by the prominent placement of user


interaction options at the system's commencement. At the heart of the system lies a
sophisticated Database, presumed to store a multitude of user credentials, including voice
prints. This centralized repository serves as the reference point against which all
authentication attempts are verified.The user database is a secure vault where personal
identifiers are stored. For voice authentication, this includes digital models of users' voice
patterns. The presence of an Image Database suggests multimodal biometrics where voice
recognition might be paired with facial recognition or other image-based authentication
methods for enhanced security. When a user attempts to access the system, their voice is
uploaded and undergoes rigorous analysis.

The system employs advanced signal processing algorithms to detect the presence and
quality of the voice input, ensuring that the sample is clear and usable. Voice extraction
involves isolating the voice from any background noise and processing it to extract the
salient features necessary for authentication. After feature extraction, the VAS compares
the input against stored voice prints. A dynamic comparison algorithm analyzes the
sample for matches within the database, accounting for minor fluctuations in the user's
voice. The final classification is the system's decisive step, determining the authenticity
of the user's voice. Sophisticated machine learning models are utilized to distinguish
between genuine users and fraudulent attempts, possibly involving deep fake audio or
voice spoofing.

The system's emphasis on security is evident from the multifactorial


approach to user verification. The VAS operates on the premise that each voice contains
an array of immutable and individual features, creating a 'voiceprint' as unique as a
fingerprint. In comparison to other biometric systems, voice authentication provides a
seamless user experience, as it requires no physical interaction with the device, making it
suitable for remote authentication scenarios. A critical feature of the system is its
feedback loop, which suggests a continuous learning mechanism. Real voice detections
are likely utilized to refine the model's accuracy, indicating the employment of adaptive
algorithms that can learn and update the system to respond to new data and emerging
threats. Looking forward, the VAS stands as a beacon of modern security technology. Its
integration with Artificial Intelligence (AI) and machine learning models not only
ensures robust protection against unauthorized access but also offers scalability and
adaptability for future security challenges.

As the technology matures, we can anticipate broader applications


extending beyond secure access, into realms like voice-based transactions and smart
home control. The report concludes that the Voice Authentication System flowchart
exemplifies a well-architected framework for secure and user-friendly authentication. The
system's methodical approach, as depicted in the flowchart, guarantees both the
protection against and the detection of increasingly sophisticated fraudulent activities in
the digital domain. As voice-forging technologies advance, the VAS's role will only grow
in importance, underscoring the need for continued innovation in the field of biometric
security.

This expanded content explores the intricacies of the Voice Authentication System as
depicted in the flowchart and extends into a comprehensive discussion on the
implications and applications of voice authentication technologies. It should provide a
solid foundation for the 2000-word requirement in your report.

3.2 Mel Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients (MFCCs) represent a widely adopted feature in the


field of audio and speech processing, especially for tasks involving voice recognition,
speech detection, and, more pertinently, the analysis of deep fake audio. The efficacy of
MFCCs in these domains can be attributed to their ability to capture the primary
characteristics of human speech, rendering them a robust feature for distinguishing
between genuine and synthetic (deepfake) voices. This document delves into the rationale
behind employing MFCCs for analyzing deepfake audio, highlighting the methodological
underpinnings that make MFCCs particularly suited for this purpose. The digital realm
has witnessed a surge in the creation and distribution of deep fake audio, primarily due to
advancements in artificial intelligence and machine learning technologies. These
deepfake audios, often indistinguishable from real voice recordings to the untrained ear,
pose significant challenges in areas ranging from security to misinformation.

Given the subtle nuances that differentiate genuine audio from its
counterfeit counterparts, conventional analysis methods that rely on simple amplitude-
time graphs prove inadequate. It is within this context that MFCCs emerge as a critical
tool. Fig 3.2.1 demonstrates the steps which are mainly performed while doing Feature
Extraction in MFCC.
Figure 3.2.2-Steps involved in MFCC Feature Extraction

MFCCs are predicated on the understanding that human auditory perception does not
follow a linear scale. Instead, humans perceive frequency in a Mel-scale, a concept that
approximates the human ear's response to different frequencies. This non-linear
perception is crucial when analyzing complex sounds like speech, where distinguishing
between different phonemes (the smallest units of sound) becomes essential, especially in
nuanced applications like identifying deepfake audio. The first step involves applying the
Fast Fourier Transform (FFT) to the signal. This transformation is critical as it shifts the
signal from the time domain to the frequency domain, providing a spectrum that
represents the signal's frequency components over time. And the graph Figure 3.2.2 gives
a brief overview of the Fourier transform for an audio signal.

Figure-3.2.2 Fourier Transform for an audio signal

2. Mel Filter Bank: After FFT, the spectrum is passed through a set of triangular filters,
collectively known as the Mel Filter Bank which is depicted in Figure 3.2.3. These filters
are designed to mimic the human ear's response, emphasizing the frequencies most
pertinent to human speech. The result of this step is a series of energy outputs from each
filter, which effectively captures the most significant aspects of the speech signal.

Fig-3.2.3 Mel Filter bank

3. Logarithmic Scaling: The energies obtained from the Mel Filter Bank are then
subjected to a logarithmic scale. This step is vital because human perception of sound
intensity is also logarithmic, meaning that this scaling makes the features more
representative of how humans perceive sound.

4. Discrete Cosine Transform (DCT): Finally, to decorrelate the log filter bank energies
and achieve a compact representation, a Discrete Cosine Transform is applied. The result
of the DCT is a set of coefficients that constitute the MFCCs. The first few coefficients
(typically the first 12-13) are used as they capture the most significant characteristics of
the signal, while higher-order coefficients, which represent finer details that are less
important for speech recognition, are discarded.

In the realm of deep fake audio detection, MFCCs offer a nuanced understanding of the
voice signal, enabling the identification of inconsistencies that may not be apparent in the
waveform directly. These inconsistencies are often the hallmark of synthetic audio,
arising from the differences in how deep learning models generate speech compared to
natural speech patterns. By focusing on the perceptually most relevant parts of the speech
signal, MFCCs can highlight anomalies that differentiate deepfakes from genuine audio.
Moreover, the robustness of MFCCs against various noise types and their effectiveness in
low-resource settings make them particularly suitable for this task. Deepfake audio
detection models leveraging MFCCs can achieve high accuracy, as these features
encapsulate the essence of human speech perception, making them a powerful
discriminator between real and counterfeit voices. The utilization of MFCCs in the
analysis of deep fake audio is underpinned by their ability to transform complex voice
signals into a compact, perceptually meaningful representation. This transformation,
grounded in the human auditory system's characteristics, allows for the effective
differentiation between genuine and synthetic audio. As deep fake technology evolves,
the role of MFCCs in identifying and mitigating its implications remains indispensable,
providing a reliable method for safeguarding against the challenges posed by these
advanced synthetic audio technologies.

3.3 Convolutional Neural Network :

The architecture of a neural network, including the number of layers and the choice of
activation functions, deeply influences its ability to learn complex patterns in the data.
When distinguishing between real and fake instances—common in tasks such as binary
classification, fraud detection, or generating synthetic data with Convolutional Neural
Networks (CNNs)—the model's depth (number of layers) and breadth (number of units
per layer) play crucial roles. The mathematical principles behind CNNs for audio data are
similar to those used for image data, with the primary difference being the dimensionality
of the data. While image data is typically 2D (height x width), audio data is 1D (time).
Here’s how the key mathematical components of CNNs apply to audio data . For audio,
the convolution operation involves sliding a 1D convolutional kernel (filter) across the
audio signal:
Equation 3.3.1 - 1D convolutional kernel

This Equation 3.3.1 can extract local features like frequency patterns and changes over
time. Pooling layers in audio work by reducing the time resolution of the feature maps
(often called down-sampling). Max pooling is common:

Equation 3.3.2 - Max pooling layers

After convolutions and pooling, the feature maps might be flattened into a 1D vector and
connected to dense layers, similar to those in image CNNs. These layers can combine the
local features to form more global features, which are then used for tasks like
classification or regression. Activation functions like ReLU or sigmoid introduce non-
linearity:

Equation 3.3.3 - Activation function (Sigmoid)


Figure 3.3.1 Sigmoid Function

For audio, these functions help to model complex patterns in the data. The same loss
functions apply, with the cross-entropy loss being common for classification:

Equation 3.3.4- Cross entropy loss

Often, audio signals are converted into a time-frequency representation called a


spectrogram before being input into a CNN. The spectrogram is a 2D representation with
time on one axis and frequency on the other, enabling the use of 2D CNNs similar to
those used for images. The spectrogram is computed using the Short-Time Fourier
Transform (STFT):
Equation 3.3.5 - Short-Time Fourier Transform(STFT)

In summary, CNNs process audio data by learning to recognize important temporal


features and patterns, which can then be used for tasks such as audio classification,
speech recognition, and audio generation. The transformation of audio into a time-
frequency domain via spectrograms allows for a rich and nuanced representation that can
be exploited by the hierarchical nature of CNNs to learn robust features relevant to the
task at hand.

1) Input Layer: The network starts with an 8-neuron dense layer that receives input in
the shape {(13,)}. This shows that a 1-dimensional array with 13 attributes is what the
model anticipates each input instance to be. Since all 13 inputs are coupled to every
neuron in the {Dense} layer, this layer is fully connected.

2) Activation Layer - ReLU: Each neuron in the first dense layer is followed by a
Rectified Linear Unit (ReLU) activation function. ReLU is chosen for its efficiency and
effectiveness in introducing non-linearity to the model, allowing it to learn complex
patterns.

3) Dropout Layer:An activation function known as a Rectified Linear Unit (ReLU)


follows every neuron in the first dense layer. ReLU is used because it can effectively and
efficiently add non-linearity to the model, enabling it to learn intricate patterns.
4) Hidden Layers: The model includes two additional sets of Dense, Activation (ReLU),
and Dropout layers, progressively increasing in complexity with 16 and 32 neurons,
respectively. This design enables the model to capture more complex relationships in the
data as it moves deeper.

5) Output Layer: The network concludes with an additional dense layer containing a
single neuron, succeeded by a sigmoid activation function. This configuration is
commonly used for binary classification problems, in which the model produces a
probability, usually labeled as 1, indicating the chance that the input belongs to the
positive class.

Real-world data is often complex and high-dimensional. Deep neural networks (DNNs)
with multiple layers can learn a hierarchy of features at different levels of abstraction. For
example, in image recognition, lower layers might detect edges, while deeper layers
might identify more complex shapes or objects. Activation functions introduce non-
linearities into the model, which are essential for learning complex, non-linear decision
boundaries between real and fake instances. Without non-linear activation functions, a
deep neural network would essentially become a linear model, unable to capture the
complexity or the nuances in data. A model with too few layers might underfit, meaning
it cannot capture the underlying structure of the data. Conversely, a very deep model
might overfit, especially if it has more parameters than there are examples in the training
data, capturing noise rather than the signal. Dropout layers, as in your model, help
mitigate overfitting by randomly "dropping out" a subset of neurons during training,
forcing the network to learn more robust features.

The goal of a neural network is not just to memorize the training data but to
generalize well to unseen data. The architecture of the network, including the number of
layers and the types of activation functions, must be carefully chosen to balance between
learning the training data's patterns and maintaining the ability to generalize to new,
unseen data. More layers and neurons require more computational resources and time to
train. It's essential to find a balance that allows the model to learn the necessary features
to distinguish between real and fake instances without unnecessarily increasing the
computational cost. The exact number of layers and neurons often comes down to
empirical experimentation. Researchers and practitioners use techniques like cross-
validation, grid search, and domain knowledge to find an architecture that works best for
their specific problem. The process of finding the right model architecture involves a
combination of theoretical knowledge, empirical testing, and adjustments based on the
model's performance on validation data. The CNN above gives us the complete overview
of how the CNN works .

3.4 Data Normalization

Normalization is paramount in ensuring that the data fed into the model is on a
comparable scale. This process involves adjusting the audio samples so that their
amplitude levels are standardized, preventing models from being biased by variations in
loudness. Similarly, duration normalization is essential, as audio samples might vary
significantly in length. Techniques such as padding shorter samples or truncating longer
ones ensure uniformity, allowing the model to focus on the content's quality rather than
its quantity. Exploratory Data Analysis (EDA) plays a crucial role in identifying outliers
or anomalous samples that could potentially skew the model’s learning. Anomalies might
arise from recording errors, background noise, or corrupt files. By employing
visualization tools and statistical methods to scrutinize the dataset, such anomalies are
flagged. A decision is then made whether to correct or remove these samples altogether,
ensuring the dataset’s integrity is preserved. To enhance the model's ability to generalize
across a broader spectrum of voice samples, real voice data is augmented. This
augmentation can involve:

1) Pitch Adjustments: Slightly altering the pitch of the voice samples can simulate
different speaking tones, helping the model learn to recognize real voices across a wider
range of variations.

2) Noise Additions: Introducing background noises or white noise to clean voice samples
mimics real-world recording conditions, training the model to focus on the voice
characteristics despite external disturbances.
3) Speed Variation: Adjusting the speed of voice samples without altering the pitch
(time-stretching) introduces temporal variations, further diversifying the training data.

3.5 Extract MFCC Features

Algorithm 3.5.1 for extracting Mel-Frequency Cepstral Coefficients (MFCC) from a


collection of audio data is a cornerstone in the field of audio analysis, particularly in
applications that require a nuanced understanding of audio characteristics. The input for
this process, `audioDataList`, comprises tuples pairing audio arrays with their
corresponding sample rates, reflecting the digital encapsulation of sound and the
frequency at which its data points were sampled. The MFCC extraction algorithm then
processes this data to produce a series of MFCC feature arrays, each a compact
representation of an audio sample’s spectral properties.

At the heart of the algorithm lies the initialization of an empty list, `mfccFeatures`,
destined to hold the MFCC arrays derived from each audio sample. The algorithm
meticulously iterates over `audioDataList`, applying the `librosa.feature.mfcc` function to
each audio tuple. This function, a key component of the `librosa` library, is instrumental
in transforming raw audio data into the MFCC features. The transformation process
involves a sophisticated sequence of steps starting with the Fourier transform to transition
the audio from the time domain to the frequency domain. Subsequent warping of these
frequencies onto a Mel scale, followed by the logarithmic scaling of power at each Mel
frequency, and the application of a discrete cosine transform (DCT), culminates in the
generation of the MFCCs. These coefficients effectively capture the essence of the audio
signal’s form, making them invaluable for various audio processing tasks.
The MFCCs for each audio sample are systematically appended to the `mfccFeatures`
list. Upon completion of the extraction process across all tuples in `audioDataList`, the
algorithm concludes by returning the `mfccFeatures` list.

Algorithm 3.5.1 Extract MFCC Features

This list now contains a comprehensive set of MFCC feature arrays, each array a detailed
acoustic fingerprint of an audio sample. The versatility of MFCCs extends to a wide array
of applications, including but not limited to speech recognition, where they facilitate the
identification of linguistic content; speaker identification, which leverages the unique
vocal traits encapsulated in the MFCCs; and audio authenticity verification. In the latter,
the ability of MFCCs to distinguish between genuine and artificially manipulated audio,
such as deepfake content, is particularly critical. This capability underscores the
importance of the MFCC extraction algorithm in maintaining the integrity and
authenticity of audio content in an era where digital forgery is increasingly prevalent.

The development and application of this algorithm represent a significant stride in the
field of audio analysis, offering a robust tool for dissecting and understanding the
complex nature of sound. By providing a method to capture and analyze the unique
spectral characteristics of audio signals, the algorithm not only enhances existing audio
processing techniques but also paves the way for new innovations in digital forensics,
security, and multimedia applications.

3.6 Load audio data

Algorithm 3.6.1, an essential preparatory step in the broader context of audio analysis and
processing workflows, often serving as the gateway to sophisticated audio manipulation,
feature extraction, and machine learning applications. Its purpose is to systematically
load audio files from a specified directory, ensuring they are in a uniform format (.wav in
this instance) that's conducive to further analysis. This process is critical in scenarios
where consistency and accuracy in audio data representation are paramount, such as in
speech recognition, audio forensics, and digital signal processing projects.

Algorithm 3.6.1 Load Audio Data


This algorithm is meticulously designed to automate the ingestion of audio data,
showcasing a streamlined approach to handling potentially large datasets with varying
audio lengths and properties. By focusing on .wav files, the algorithm taps into a format
widely recognized for its lossless quality and straightforward structure, making it ideal
for tasks that require high fidelity and unaltered audio data.

The use of the `librosa` library, a cornerstone in the Python audio processing community,
underscores the algorithm's reliance on established, robust methods for audio loading.
`Librosa` not only simplifies the process of loading audio data but also ensures that the
data is ready for complex operations like MFCC (Mel-Frequency Cepstral Coefficients)
extraction, pitch detection, and temporal analysis, which are often subsequent steps in
audio processing pipelines. Moreover, by encapsulating the audio data and sample rate in
tuples, the algorithm lays a versatile foundation for downstream tasks. This tuple
structure facilitates easy access to both the raw audio waveform and its corresponding
metadata, a necessity for precise, context-aware audio analysis. Whether it's training a
neural network to recognize specific sounds or conducting a detailed forensic analysis of
audio recordings, having immediate access to both the waveform and sample rate is
invaluable In practice, this algorithm could be the first step in a complex chain of
operations aimed at detecting deepfake audio, enhancing speech clarity in noisy
recordings, or even identifying unique acoustic signatures for biodiversity monitoring.

The universality and simplicity of its approach mean that it can be easily adapted or
expanded upon to meet the specific requirements of a wide array of audio processing
tasks. As we move towards more sophisticated and computationally demanding
applications of audio analysis, the importance of efficient, reliable data loading
mechanisms cannot be overstated. Algorithms like "Load Audio Data" not only
streamline the initial stages of these applications but also ensure that the data integrity is
maintained, setting the stage for accurate and insightful outcomes.
3.7 Audio Authenticity Algorithm

Algorithm 3.7.1 for extracting Mel-Frequency Cepstral Coefficients (MFCC) from a


collection of audio data is a cornerstone in the field of audio analysis, particularly in
applications that require a nuanced understanding of audio characteristics. The input for
this process, `audioDataList`, comprises tuples pairing audio arrays with their
corresponding sample rates, reflecting the digital encapsulation of sound and the
frequency at which its data points were sampled. The MFCC extraction algorithm then
processes this data to produce a series of MFCC feature arrays, each a compact
representation of an audio sample’s spectral properties. At the heart of the algorithm lies
the initialization of an empty list, `mfccFeatures`, destined to hold the MFCC arrays
derived from each audio sample. The algorithm meticulously iterates over
`audioDataList`, applying the `librosa.feature.mfcc` function to each audio tuple. This
function, a key component of the `librosa` library, is instrumental in transforming raw
audio data into the MFCC features.

The transformation process involves a sophisticated sequence of steps starting with the
Fourier transform to transition the audio from the time domain to the frequency domain.
Subsequent warping of these frequencies onto a Mel scale, followed by the logarithmic
scaling of power at each Mel frequency, and the application of a discrete cosine
transform (DCT), culminates in the generation of the MFCCs. These coefficients
effectively capture the essence of the audio signal’s form, making them invaluable for
various audio processing tasks.
The MFCCs for each audio sample are systematically appended to the `mfccFeatures`
list. Upon completion of the extraction process across all tuples in `audioDataList`, the
algorithm concludes by returning the `mfccFeatures` list.

Algorithm 3.7.1 Audio Authenticity Detection Algorithm

This list now contains a comprehensive set of MFCC feature arrays, each array a detailed
acoustic fingerprint of an audio sample. The versatility of MFCCs extends to a wide array
of applications, including but not limited to speech recognition, where they facilitate the
identification of linguistic content; speaker identification, which leverages the unique
vocal traits encapsulated in the MFCCs; and audio authenticity verification. In the latter,
the ability of MFCCs to distinguish between genuine and artificially manipulated audio,
such as deepfake content, is particularly critical. This capability underscores the
importance of the MFCC extraction algorithm in maintaining the integrity and
authenticity of audio content in an era where digital forgery is increasingly prevalent.

The development and application of this algorithm represent a significant stride in the
field of audio analysis, offering a robust tool for dissecting and understanding the
complex nature of sound. By providing a method to capture and analyze the unique
spectral characteristics of audio signals, the algorithm not only enhances existing audio
processing techniques but also paves the way for new innovations in digital forensics,
security, and multimedia applications.
3.8 Activity Diagram

Figure 3.8.1 Activity Diagram

Figure 3.8.1 serves as a detailed blueprint, mapping out the intricate dance of interactions
within an audio processing system that leverages the power of a Hidden Markov Model
(HMM). At the heart of this system is a dynamic interplay between user actions,
hardware capabilities, and sophisticated algorithmic analysis, designed to transform raw
audio into meaningful insights.

The journey begins with the user, whose intent to capture an audio sample sets the entire
process in motion. This user-driven action is a crucial first step, as it underscores the
system's responsiveness to human inputs, bridging the gap between the tangible actions
of recording and the digital world of audio analysis.
As the user initiates a recording, the system springs to life, with the 'Audio' component
acting as the conductor, orchestrating the flow of data from the physical realm of sound
waves to the digital domain. This component, possibly embodying both software routines
and hardware interfaces, ensures that the transition from an analog signal (sound waves)
to a digital representation (audio data) is seamless and efficient. The 'Microphone',
whether a literal hardware device or a software abstraction, plays a pivotal role in
capturing the essence of the sound. Its ability to convert acoustic energy into electrical
signals or digital data is fundamental to the system's operation. This conversion is not
merely a technical procedure but an act of preservation, capturing moments of sound in a
form that can be analyzed and interpreted.

Upon successful capture, the audio data travels back to the 'Audio' component, which acts
as a gateway to the 'Main' processing unit. Here, preliminary processing takes place, a
critical step where audio data is conditioned, possibly through noise reduction,
normalization, or segmentation, preparing it for the intricate analysis ahead. By the time
the analysis concludes, the system has traversed a vast spectrum of interactions, from the
tangible act of recording audio to the abstract complexities of statistical modeling. The
narrative encapsulated in Fig 3.8.1 is more than a technical diagram; it is a testament to
the synergy between human interaction, technological sophistication, and mathematical
elegance in the quest to unlock the stories hidden within audio data.
CHAPTER 4

RESULTS & DISCUSSION

For a comprehensive analysis and elaboration of our deep fake voice detection model's
performance in a report, it's crucial to choose the right dataset for the analysis of the real
and fake audios and it is essential to compare the real voices with the fake voices . There
are totally 2473 real voices and 2265 fake voices . We initially took the data of the
Amplitude time graph of all these voices which ends with .wav . We observed the
characteristics of the Amplitude Time graph having a very less amount of deflection from
the real voices so we had to consider only the characteristics which are needed for us .
Fig 4.1 (a) depicts how does a Fake sound wave differs from a real sound wave which is
Fig 4.1 (b)
Figure 4.1 (a) Fake Audio (b) Real Audio

Initially both the training and testing data validation increases as the number of epochs
increases This is typical of the early stages of training where the model is learning from
the data. The training loss continues to decrease and starts to flatten out as the number of
epochs grows, suggesting that the model is starting to converge on a solution. The
validation loss decreases alongside the training loss up to a certain point, after which it
begins to show some volatility and slight increases. This could be indicative of the model
starting to over fit the training data. Since the real audio and fake audio can be ambiguous
based on the MFCC features so we considered choosing the MFCC features.

This MFCC generated data is in the form of an array of 13 different numbers


each representing the length mel frequency cepstral coefficient of a wave . We trained
this array of numbers inside a 8 layered neural network as input features and we activated
the function using the TanH function . We used the TanH function because the array of
numbers consisted of values of both positive and negative values which TanH activates
and provides better results than the ReLu activation function . We then added it into a 16
features layer network. We had a Dropout layer of 0.6 means there is a 60% chance that
any given neuron will be ignored during a training pass. The output is as 1 output feature
which uses Sigmoid activation function which is the best activation function used for
binary classification . This is then passed in adam optimizer and binary loss cross
entorophy .

This Real and Fake generated MFCC are then normalized such that every data which is
trained in the model are equal so that the model provides appropriate outputs with the
trained audio . So the trained audio is then being trained in a CNN architecture which
then trains and the model is runned in 20,30 and 50 epochs . Initially the training
accuracy of the model was noted to be 23.8% . We trained it in a multilayered neural
network with many input features and we trained it for 5 different layers of
8,16,32,64,128 and after it is being trained it is providing one as the output value .

These were the values which were being noted after training for the first 20 epochs. The
regularization techniques were being applied on the model as well but since to check the
accuracy of the model we tried to test in 30 and 50 epochs even respectively . The 30 and
50 epochs gave us a result

Figure 4.2 CNN Architecture


(a)

(b)
(c)

Figure 4.3 a) Results after 20 epochs

b) after 50 epochs c) after 100 epochs

Fig 4.3 gives us the accuracy of various epochs which are being trained and it is observed
the data is getting saturated after 20 epochs so 20 epochs was chosen for this model . The
confusion matrix was built for the 20 epochs . The performance of accuracy of our model
on the test dataset is recorded to be 99% on the test dataset indicating a robust capability
to distinguish between authentic and synthetic generated audio samples . This level of
accuracy underscores the model’s potential as a reliable tool in the ongoing effort to
identify and mitigate the impact of deep fake technology on digital media integrity .
Initially both the testing and the training data validation increases the number of epochs
increases . This is typical of the early stages of training where the model is learning from
the data . The training loss continues to decrease and starts to flatten out as the number of
epochs increases suggesting the model is converging on one particular solution The
validation then slightly decreases as the result of overfitting .

Confusion matrix as indicated in Figure 4.4 suggests that the confusion matrix
presented shows the performance of a binary classification model. The matrix is divided
into four quadrants. There are 452 instances where the model correctly predicted the
negative class. This means that for 452 times, the model accurately identified the samples
that should not have the positive label. There are 2 instances where the model incorrectly
predicted the positive class. This is also known as Type I error. These are the cases where
the model predicted the outcome to be positive, but actually, it was negative. There are 0
instances of this, which means there were no cases where the model incorrectly predicted
the negative class. This is also referred to as Type II error. It would be the situation where
the model predicts the outcome as negative, but in reality, it should be positive. There
are 494 instances where the model correctly predicted the positive class. These are the
cases where the model accurately identified the samples that should have the positive
label. The columns represent the model's predictions, and the rows represent the actual
labels. Ideally, in a perfect classifier, all predictions would fall into the top-left and
bottom-right quadrants, indicating 100% accuracy with no false positives or false
negatives.

Figure 4.4 Confusion matrix


Figure 4.1 (d).1 Audio Fake Detection & Result

This model shows a high number of true positives and true negatives, with a very low
number of false positives and no false negatives. It suggests that the model is highly
accurate in classifying the positive class and quite good at classifying the negative class
with very few mistakes. Such a confusion matrix indicates that the model has high
precision and high recall for the positive class. Precision is a measure of result relevancy,
while recall is a measure of how many truly relevant results are returned. Overall, this
model's performance is excellent for the data and conditions it was tested under.
However, it's also important to consider other metrics like the F1 score for a more
comprehensive understanding of model performance and to evaluate it against different
datasets to understand its generalizability.
CHAPTER 5

CONCLUSION

The performance evaluation of our deepfake voice detection model reveals significant
insights into its efficacy and optimization requirements. The Resnet model gave accuracy of
around 56% but our model gave an accuracy of 84% on the test dataset, indicating a robust
capability to distinguish between authentic and synthetically generated audio samples. This
level of accuracy underscores the model’s potential as a reliable tool in the ongoing effort to
identify and mitigate the impact of deepfake technology on digital media integrity. Initially
both the training and testing data validation increases as the number of epochs increases .
This is typical of the early stages of training where the model is learning from the data. The
training loss continues to decrease and starts to flatten out as the number of epochs grows,
suggesting that the model is starting to converge on a solution. The validation loss decreases
alongside the training loss up to a certain point, after which it begins to show some volatility
and slight increases.

This could be indicative of the model starting to over fit the training data. Overfitting
occurs when the model learns the training data too well, including its noise and outliers,
which can negatively impact its performance on new, unseen data.The most notable aspect of
the validation loss is that after around 20 epochs, it shows an upward trend, which is a strong
indication of over fitting. This suggests that while the model continues to improve on the
training data, it is becoming less generalized to new data. From this confusion matrix we can
infer the deep learning model predicted 452 values as negative which means that for 452
cases it has predicted properly .The model has 2 as positive when they were actually
negative. The low number here suggests that the model is quite good at not mistakenly
identifying a negative case as positive .The model correctly predicted 494 instances as
positive. This means that for 494 cases, the model identified the presence of the condition or
category correctly.The F1 score for the model, based on the provided confusion matrix, is
approximately 0.998. This is a very high F1 score, indicating that the model has a good
balance of precision and recall, at east for the dataset it was tested on.
CHAPTER 6

FUTURE SCOPE

While our current framework effectively distinguishes short-span audio files, its
applicability to longer recordings remains limited. Future research should prioritize
enhancing the framework's scalability to process extended audio durations without
compromising accuracy. This entails exploring techniques to extract and analyze features
over extended time frames while maintaining computational efficiency.

Additionally, investigating methods to detect deepfake audio manipulations in real-time


streaming contexts would be invaluable for addressing evolving cybersecurity threats.
This focused approach ensures our system remains accurate and adaptable, meeting the
demands of diverse audio detection scenarios. Furthermore, to bolster the robustness of
our framework, it is imperative to explore advanced feature extraction techniques beyond
MFCC. Investigating the integration of additional audio features, such as pitch, rhythm,
and spectral density, could provide richer information for detecting subtle manipulations
in audio recordings. Moreover, leveraging ensemble learning methods and incorporating
diverse neural network architectures may enhance the model's ability to generalize across
various deepfake generation techniques and audio characteristics. In parallel, efforts
should focus on expanding the dataset used for training and validation purposes. Access
to larger, more diverse datasets encompassing a wide range of deep fake audio samples
will enable the model to learn from a broader spectrum of scenarios, thereby improving
its detection accuracy and generalization capabilities. Moreover, curating datasets that
reflect real-world conditions, including different languages, accents, and recording
environments, will ensure the model's effectiveness across diverse settings.
CHAPTER 7

REFERENCES

[1]Kumar, Basant, and Shatha Rashad Alraisi. "Deepfakes audio detection techniques
using deep convolutional neural network." In 2022 International Conference on Machine
Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), vol. 1, pp. 463-468.
IEEE, 2022.

[2] Hamza, A., Javed, A. R. R., Iqbal, F., Kryvinska, N., Almadhor, A. S., Jalil, Z., &
Borghol, R. (2022). Deepfake audio detection via MFCC features using machine
learning. IEEE Access, 10, 134018-134028.

[3] Liu, X., Wang, X., Sahidullah, M., Patino, J., Delgado, H., Kinnunen, T., ... & Lee, K.
A. (2023). Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.
IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Almutairi, Z., & Elgibreen, H. (2022). A review of modern audio deepfake detection
methods: Challenges and future directions. Algorithms, 15(5), 155.

[5] Abbasi, A., Javed, A. R. R., Yasin, A., Jalil, Z., Kryvinska, N., & Tariq, U. (2022). A
large-scale benchmark dataset for anomaly detection and rare event classification for
audio forensics. IEEE Access, 10, 38885-38894.

[6] Heidari, A., Jafari Navimipour, N., Dag, H., & Unal, M. (2023). Deepfake detection
using deep learning methods: A systematic and comprehensive review. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery, e1520.

[7] Heidari, A., Jafari Navimipour, N., Dag, H., & Unal, M. (2023). Deepfake detection
using deep learning methods: A systematic and comprehensive review. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery, e1520.

[8] Ahmed, A., Javed, A. R., Jalil, Z., Srivastava, G., & Gadekallu, T. R. (2022). Privacy
of web browsers: A challenge in digital forensics. In Genetic and Evolutionary
Computing: Proceedings of the Fourteenth International Conference on Genetic and
Evolutionary Computing, October 21-23, 2021, Jilin, China 14 (pp. 493-504). Springer
Singapore.

[9] Pianese, A., Cozzolino, D., Poggi, G., & Verdoliva, L. (2022, December). Deepfake
audio detection by speaker verification. In 2022 IEEE International Workshop on
Information Forensics and Security (WIFS) (pp. 1-6). IEEE.

[10]Abbasi, A., Javed, A. R., Iqbal, F., Jalil, Z., Gadekallu, T. R., & Kryvinska, N.
(2022). Authorship identification using ensemble learning. Scientific reports, 12(1), 9537.

[11] Raza, M. A., & Malik, K. M. (2023). Multimodaltrace: Deepfake detection using
audiovisual representation learning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 993-1000).

[12] Stupp, C. (2019). Fraudsters used AI to mimic CEO’s voice in unusual cybercrime
case. The Wall Street Journal, 30(08).

[13] Wijethunga, R. L. M. A. P. C., Matheesha, D. M. K., Al Noman, A., De Silva, K. H.


V. T. A., Tissera, M., & Rupasinghe, L. (2020, December). Deepfake audio detection: a
deep learning based solution for group conversations. In 2020 2nd International
conference on advancements in computing (ICAC) (Vol. 1, pp. 192-197). IEEE.

[14] MEL-SPECTROGRAM IMAGE-BASED END-TO-END AUDIO DEEPFAKE


DETECTION UNDER CHANNEL-MISMATCHED CONDITIONS,Abderrahim
Fathan, Jahangir Alam, Woo Hyun Kang,2022 IEEE International Conference on
Multimedia and Expo (ICME) — 978-1-6654-8563-0/22/31.00 ©2022 IEEE — DOI:
10.1109/ICME52920.2022.9859621

[15] Ulutas, G., Tahaoglu, G., & Ustubioglu, B. (2023, July). Deepfake audio detection
with vision transformer based method. In 2023 46th International Conference on
Telecommunications and Signal Processing (TSP) (pp. 244-247). IEEE.

[16] Fathan, A., Alam, J., & Kang, W. H. (2022, July). Mel-spectrogram image-based
end-to-end audio deepfake detection under channel-mismatched conditions. In 2022
IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE.
[17] Yan, R., Wen, C., Zhou, S., Guo, T., Zou, W., & Li, X. (2022, May). Audio
deepfake detection system with neural stitching for add 2022. In ICASSP 2022-2022
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(pp. 9226-9230). IEEE.

[18]Raza, M. A., & Malik, K. M. (2023). Multimodaltrace: Deepfake detection using


audiovisual representation learning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 993-1000).

[19] Pianese, A., Cozzolino, D., Poggi, G., & Verdoliva, L. (2022, December). Deepfake
audio detection by speaker verification. In 2022 IEEE International Workshop on
Information Forensics and Security (WIFS) (pp. 1-6). IEEE.

[20]Zhang, Y., Li, X., Yuan, J., Gao, Y., & Li, L. (2021, December). A deepfake video
detection method based on multi-modal deep learning method. In 2021 2nd International
Conference on Electronics, Communications and Information Technology (CECIT) (pp.
28-33). IEEE.
APPENDIX 1

This segment outlines the programming languages, software, and libraries utilized in our
project. Our project is crafted using Python, a versatile, general-purpose programming
language known for its object-oriented features and high-level structure. Python is
celebrated for its succinct and intelligible syntax. Although it can handle highly intricate
processes with multifaceted workflows, Python's capabilities in artificial intelligence (AI)
and machine learning (ML) algorithms empower developers to construct dependable,
sophisticated systems with machine intelligence. The Python libraries employed in our
project include:

(1) Numpy: This library offers a multi-dimensional array object and various
derivatives (like masked arrays and matrices), alongside a collection of
operations for quick array processing. These operations encompass
mathematical functions, logic operations, shape manipulation, sorting,
selection, input/output, discrete Fourier transforms, elementary linear
algebra, basic statistics, and random simulation, making it a critical
foundation for scientific computing in Python.

(2) Pandas: This library provides efficient and flexible data structures
designed to work seamlessly with structured data, making data analysis in
Python both straightforward and intuitive. Its aim is to be an essential tool
for real-world data analysis, striving to be the most powerful and versatile
tool for data manipulation and analysis available across all programming
languages.
(3) TensorFlow: is a versatile and open-source library designed for machine
learning and artificial intelligence applications. Primarily focused on deep
neural network training and inference, it was originally created by
Google's Brain team for both research and production purposes within
Google. TensorFlow supports a broad array of programming languages,
including Python, JavaScript, C++, and Java, making it accessible for
various development projects. It features an expansive and adaptable
ecosystem of numerous tools, libraries, and community resources,
enabling researchers to advance the frontiers of machine learning.

(4) Matplotlib: This is a comprehensive library for creating static, animated,


and interactive visualizations in Python. It offers an extensive array of
functions and tools to generate high-quality graphs, charts, figures, and
plots, ranging from histograms to scatter plots, suitable for various data
analysis and visualization needs. Matplotlib is designed to provide
complete control over the appearance of the plots, including colors, legend
placement, line styles, and font properties, making it possible to produce
publication-quality figures. The library serves as a foundational plotting
tool for the Python scientific computing ecosystem, supporting various
output formats and interactive environments. By providing a powerful and
flexible platform for visualizing data, Matplotlib plays a crucial role in
data analysis, machine learning projects, and scientific research, enabling
users to convey complex data insights in a visually appealing and
understandable manner.

(5) Librosa: This is a Python package specifically designed for music and
audio analysis. It offers the tools to analyze audio signals and music to
extract information, making it easier to process, manipulate, and
understand audio data. Librosa supports a wide range of functionalities,
including audio feature extraction, such as tempo detection, beat tracking,
and mel-frequency cepstral coefficients (MFCCs), which are pivotal for
tasks in music information retrieval (MIR), audio signal processing, and
machine learning applications related to sound. Its comprehensive suite of
libraries and functions simplifies the complex process of audio analysis,
enabling developers to focus on creating innovative applications in the
domain of audio and music technology.

(6) Streamlit: This is an open-source Python library that simplifies the process
of creating and sharing beautiful, custom web apps for machine learning
and data science projects. With Streamlit , developers can quickly turn
data scripts into shareable web applications with minimal coding. It is
designed to make the deployment of interactive apps straightforward,
eliminating the need for complex web development skills. Streamlit’s
intuitive API allows for the easy integration of interactive widgets, such as
sliders, buttons, and text inputs, enabling users to interact with their data
and ML models dynamically. The library supports rapid prototyping and
provides an efficient way to visualize data, display models, and present
results, making it an invaluable tool for data scientists and developers
looking to showcase their projects in a user-friendly format.
APPENDIX22

import streamlit as st

import librosa

import numpy as np

import tensorflow as tf

from tensorflow.keras.models import load_model

import os

def extract_features(audio, sample_rate):

mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=13)

mean_mfcc = np.mean(mfccs, axis=1)

return mean_mfcc.reshape(1, -1)

# Streamlit app title

st.title('Audio Fake Detection')

# Upload the file

uploaded_file = st.file_uploader("Choose an audio file...", type=["wav", "mp3"])

model=load_model('./my_model.h5')

print(model)

if uploaded_file is not None:


# Load the model

audio, sample_rate = librosa.load(uploaded_file, sr=None)

st.text(f"Sample rate: {sample_rate} Hz")

# Display audio player

st.audio(uploaded_file, format='audio/wav', start_time=0)

# Extract features

with st.spinner('Extracting features...'):

mfcc_reshaped = extract_features(audio, sample_rate)

st.write('Features extracted')

with st.spinner('Making prediction...'):

prediction = model.predict(mfcc_reshaped)

st.write('Prediction made', prediction[0])

if prediction[0]<0.5:

st.success('Prediction: Real')

else:

st.error('Prediction: Fake')
PAPER PUBLICATION STATUS

You might also like