0% found this document useful (0 votes)
5 views

Audio_Deepfake_Detection_Using_Deep_Learning Paper2

The document presents research on audio deepfake detection using deep learning techniques, specifically focusing on a Convolutional Neural Network (CNN) model trained on the ASVspoof 2019 dataset. The study aims to enhance the identification of altered audio files, addressing the growing concerns over the credibility of audio data in various applications. By utilizing Mel spectrogram representations and data augmentation, the project seeks to improve security and integrity in the digital audio landscape.

Uploaded by

govindan7707
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Audio_Deepfake_Detection_Using_Deep_Learning Paper2

The document presents research on audio deepfake detection using deep learning techniques, specifically focusing on a Convolutional Neural Network (CNN) model trained on the ASVspoof 2019 dataset. The study aims to enhance the identification of altered audio files, addressing the growing concerns over the credibility of audio data in various applications. By utilizing Mel spectrogram representations and data augmentation, the project seeks to improve security and integrity in the digital audio landscape.

Uploaded by

govindan7707
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the SMART–2023, IEEE Conference ID: 59791

12th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd, December, 2023
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India

Audio Deepfake Detection Using Deep Learning


2023 12th International Conference on System Modeling & Advancement in Research Trends (SMART) | 979-8-3503-6988-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/SMART59791.2023.10428163

R. Anagha1, A. Arya2, V. Hari Narayan3, S. Abhishek4 and T. Anjali5


1,2,3,4,5Department of Computer Science and Engineering,

Amrita School of Computing, Amrita Vishwa Vidyapeetham, Amritapuri, India


E-mail: [email protected], [email protected],
3
[email protected], [email protected], [email protected]

Abstract—The capacity to identify real audio recordings including voice authentication, audio forensics, and media
from their modified counterparts is essential in the age of credibility. Identifying real audio files from mod- ified
sophisticated digital manipulation for maintaining security ones has grown more difficult as these alterations have
and trust in a vari- ety of applications, from media forensics
become more sophisticated [2].
to voice authentication systems. This research aims to create
a deep learning model that can distinguish between authentic This study aims to address the urgent problem of
and altered audio files, with an emphasis on identifying audio locating altered audio recordings in the large ocean of
deepfakes. The study uses Mel spectrogram representations digital content.
and data augmentation techniques to effectively extract The goal is to create a powerful deep-learning model,
features from the ASVspoof 2019 dataset and train models. more precisely a Convolutional Neural Network (CNN),
Convolutional neural networks (CNNs) comprising a that is capable of telling the difference between original
number of layers, including convolutional, pooling, batch
audio and altered versions. The study makes use of the
normalization, ReLU activation, dropout, global average
pooling, and a dense classification layer are used as the ASVspoof 2019 dataset, a vast assortment of real and
foundation of the design. The Adam optimizer is used to spoofed audio recordings, to accomplish this. The goal of
optimize the model once it has been trained using binary the work is to use this dataset to teach the CNN how to
cross-entropy loss, and a variety of metrics, such as accuracy, distinguish between real audio data and edited versions by
F1 score, ROC curve, and AUC, are used to track its using subtle cues and patterns.
performance. By making it easier to identify audio deepfakes, Designing and implementing a highly accurate deep-
this project will ultimately increase the security and integrity
learning model for audio deepfake detection is the main
of audio data in the digital world.
Keywords: Audio Deepfake Detection,, Deep Learning,
goal of this study. The work uses cutting-edge methods to
Con- volutional Neural Network (CNN),Mel Spectrogram, accomplish this, including data augmentation, converting
ASVspoof 2019 Dataset, Binary Classification, Feature audio samples into Mel spectrograms, and utilising CNNs’
Extraction, Perfor- mance Metrics. strong pattern detection abilities. The project seeks to make
a substantial contribution to the field of digital forensics
I. Introduction by concentrating on these technical techniques, improving
Information is now disseminated instantly and widely the security and reliability of audio data, and reducing
in a time dominated by social media and sophisticated the negative effects of audio deepfakes on numerous
digital technologies. Although these platforms have applications and technologies.
democratized communication, they have also turned into
hubs for the dis- semination of false information and II. Related Works
destructive content. The rise of deepfakes, sophisticated Dixit, Kaur, et al. [1] (2023) give a review of audio-
digital forgeries made utilising Deep Learning (DL) focused deepfake detection techniques. The paper presents
algorithms, is one of the most concerning effects of the major issues and openings for further study in this field,
digital era. These deepfakes come in a variety of formats, including the effect of accents, the need for robust and
including images, movies, and increasingly realistic audio generalizable models, and the ethical implications of
recordings, which has resulted in the rise of a serious threat deepfake technology. The ways through which deepfakes
called ”Audio Deepfakes” (AD) [1]. are generated are shown here. According to their survey,
Particularly audio deepfakes have evolved into when compared to other traditional classification
instruments for misleading behaviour, impersonation, algorithms, the SVM classifier was shown to reach an
and the dissemina- tion of rumours. The legitimacy and accuracy of 99%, although it could only be trained on
dependability of audio data have come under intense datasets in Arabic and Chinese. The models must also be
scrutiny due to the capacity to create convincing audio trained on the sophisticated deepfake dataset for audio
modifications, posing severe problems for key fields deepfake detection methods.

176 Copyright © IEEE–2023 ISBN: 979-8-3503-6988-5

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
Audio Deepfake Detection Using Deep Learning

Almutairi and Elgibreen. [2] (2022) provides a Speech-denoising component’s Multilayer-Perceptron and
comprehen- sive survey of the existing methods for detecting Convolutional Neural Network de- signs have accuracy
imitation- based and synthetic-based audio deepfake rates of 93% and 94%, respectively. For text conversion,
attacks, which are two main types of audio fakeness. It Natural Language Processing was used with a 93%
gives a brief description of the available datasets for audio accuracy rate, and an RNN model was used for voice
deepfake detection and compares their characteristics tagging with an accuracy rate of 80% and a diarization
and limitations. They talk about the limitations of audio error rate of 0.52. CNNs were used to correctly separate
deepfake detection methods when it comes to non-English real and fake audio with 94% precision. They concluded
languages. A comparison of different modern techniques that the proposed system can effectively detect synthetic
for detecting audio deep fakes according to three criteria: speech in group conversations as well as contribute to the
tandem Detection Cost Function(tDCF), Equal Error Rate field of speech analysis.
(EER), and accuracy are also done. Khochare, Joshi, et al. [7] (2021) explored the
Mcuba, Singh, et al. [3] (2022) provide a comparative potential of deep learning for the detection of audio
statis- tic for different deep learning techniques that deepfakes using two approaches: feature-based and image-
find deep fake audio in forensic investigations. Mel- based approaches. The study makes use of the Fake or Real
spectrograms, MFCC, Spectrograms, and Chromagrams (FoR) dataset, which contains audio samples generated by
are four of the frequency spectrum’s visual representations the latest text- to-speech models, as a basis for its analysis.
that they utilize. They com- pare the performance of In feature-based approaches, audio samples are converted
different CNN architectures on a public dataset. According into spectral features,
to their results, the VGG-16 archi- tecture performs best for while in image-based approaches, audio samples are
the MFCC image feature, while a fully connected custom converted into melspectrograms. In this paper, different
architecture performs excellently for the Chromagram, algorithms for machine learning and deep learning are
Spectrogram, and Mel-Spectrum images. compared in terms of their performance in a classification
Bikku, Bhargavi, et al. [4] (2023) tell us how an task. The paper claims that the Temporal Convolutional
audio deepfake can be used to manipulate the voice and Network (TCN) performs the best among the image-based
speech of a person, making it challenging to tell real audio methods, and among the feature-based techniques, the
from fake. The paper offers a thorough analysis of the Support Vector Machine (SVM) performs the best.
different algorithms employed for audio deepfake creation Lyu. [8] (2020) proposed a technique that, when
as well as identification, highlighting their advantages applied to various deepfake video datasets, can produce
and disadvantages. The paper also performs a detailed cutting-edge outcomes. He also makes the point that AI-
comparison of representative features and classifiers based imitation is not just limited to imagery but that it
across various datasets for audio deepfake detection. is also enabling the production of incredibly lifelike
The paper concludes that additional studies should focus audio files. To specifically target such forgeries, different
on the difficulties of data scarcity, generalization, and techniques must be developed because sound signals are
interpretability for audio deepfake detection. one-dimensional signals that possess a distinct nature from
Xue, Fan, et al. [5] (2022) suggested a technique videos and pictures. There has been an increase in the
for detecting audio deep fakes established on a mixture number of people interested in creating efficient techniques
of F0 information and real plus fictional spectrogram for detecting and mitigating the impact of deepfake audio.
characteristics. The system combines fundamental Martin-Donas and Alvarez.[9] (2022) proposed
frequency (F0) informa- tion, selected from the frequency a system for detecting audio deepfakes based on the
band having most of F0, with real and fictional spectrogram wav2vec2 self- supervised model. The novelty aspect
elements which are used to distinguish between genuine of the paper is that it feeds the downstream classifier
and fake speech. Results from experiments showing an contextualized representations from various transformer
equivalent error rate (EER) of 0.43% show how effective layers of the wav2vec2 model that has been previously
the suggested system is. trained. The paper also explores vari- ous methods of
Wijethunga, Matheesha, et al. [6] (2020) suggested a data enhancement to assist the classifier in adjusting
system that is made up of four main components: speech to difficult situations. The results demonstrate that the
denoising, speaker diarization, natural language processing, suggested system performs competitively in the ASVspoof
and syn- thetic speech detection. The system uses different 2021 and 2022 ADD challenges, which simulate natural
deep neural network architectures and datasets to perform sce- narios with spoof audio. The paper comes to the
each compo- nent. They constructed models of Deep conclusion that data augmentation techniques can be used
Neural Networks and incorporated them into one unified to effectively adapt the classifier and that the wav2vec2
solution by utilizing various datasets, comprising Urban- features demonstrate robustness to various speech content
Sound8K, Conversational, AMI-Corpus, and FakeOrReal. and conditions.
Copyright © IEEE–2023 ISBN: 979-8-3503-6988-5 177

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
12th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd, December, 2023
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India

III. Preliminaries using Generative Adversarial Networks (GANs). These


methods show how complex audio deepfakes are and
A. Audio Deepfake
how, in the face of the advancement of synthetic audio
A revolutionary age in artificial intelligence has begun technology, real voices must be discerned with advanced
with the emergence of audio deepfakes, commonly referred detection and monitoring.
to as voice cloning. Artificial intelligence was initially
created with good intentions. This technology, which was B. Mel Spectrogram
first created to improve human experiences, has found A thorough visual depiction of an audio signal’s
uses in the production of audiobooks and in helping people frequency components across time is offered by
whose voices have been lost due to medical issues, thereby spectrograms. Spectrograms use Fourier Transforms
regaining their capacity for communication. The way we to dissect the intricate audio wave into its component
engage with technology has also been revolutionized by frequencies and display the magnitude of each frequency
the development of personalized digital assistants, natural- present in the signal. These visualisations are essential
sounding text-to-speech services, and effective voice in a variety of disciplines, including music, linguistics,
translation tools. speech recognition, and seismology.The Fig 1 shows the
But even on ubiquitous platforms like smartphones spectrogram of an audio signal.
and personal PCs, the availability of audio deepfake It is evident that on displaying the spectrogram, there
techniques has raised serious issues. What was once isn’t much visible information available to analyze. This is
a promising invention is now perhaps being used to because, unlike other senses, human auditory perception
distribute false information. On social media platforms, works on a logarithmic scale. Due to this peculiarity,
these audio modifications have been abused to mislead the Mel Spectrogram was developed, which changes the
and sway public opinion. The ramifications are extensive, y-axis to the Mel Scale, a scale that simulates how humans
encompassing ethical issues and cybersecurity difficulties. perceive pitch variations.
Concerns are raised regarding the possible use of these Mel Spectrograms are an improved spectrogram
deepfakes in logical access voice spoofing, which would designed specifically for human auditory comprehension.
allow bad actors to influence public opinion for the The Mel Scale offers a logarithmic frequency scale that
purposes of propaganda, slander, or even terrorism, given addresses the non- linearity of human hearing. It was
how simple it is to make them. Considering the enormous inspired by how people perceive sound. Additionally, the
amount of voice recordings that are transferred every day Decibel Scale is incorporated
through the internet, it is particularly difficult to spot these
manipulations. As a result, these deceptive techniques have
turned their attention to other people and institutions, such
as governments and political figures. The dire dangers of
this technology are highlighted by actual events, such as
how con artists impersonated CEOs for financial crime
utilizing AI-based software.
An alarming amount of respondents to a 2023 McAfee
study conducted worldwide reported experiencing financial
losses as a result of these fraudulent actions. Additionally,
the growth of audio deepfakes puts the security of financial
Fig. 1: Mel Spectrogram
customers at risk by posing a danger to already-in-use
into the colour representation to account for the
speech recognition systems, particularly those used in the
logarithmic loudness perception, guaranteeing proper
banking industry.
representation of amplitude fluctuations. Mel Spectrograms
There are three basic types of audio deepfakes:
are essential for collecting the precise nuances of audio
imitation- based, synthetic-based, and replay-based.
recordings in the context of the audio deepfake detection
Replay-based deep- fakes imitate the victim’s speech
project, enabling the deep learning model to differentiate
using methods like far- field detection and cut-and-paste
between real and altered audio sources.
detection, which are thwarted by deep convolutional neural
networks and text-dependent speaker verification. With IV. Dataset
the use of technologies like WaveNet, artificial speech is The project makes use of the ASVspoof 2019
produced utilizing text analysis, acoustic modeling, and dataset, which consists of real (bona fide) and fake
vocoding modules in synthetic-based deepfakes. Voice (spoofed) audio recordings and serves as a standard for
conversion, also known as imitation-based deepfakes, research on automated speaker verification spoofing and
changes a speaker’s voice to imitate that of another by countermeasures.

178 Copyright © IEEE–2023 ISBN: 979-8-3503-6988-5

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
Audio Deepfake Detection Using Deep Learning

A. Real Samples 2) Feature Extraction


The recordings of actual human speech in the dataset After labeling, the audio files are loaded using the
are real samples. For training models to reliably identify Librosa library, and Mel spectrograms are extracted as
and validate actual speakers, these samples serve as features. Mel spectrograms offer a visual rep- resentation
authentic speaker data. The model is able to set a baseline of how an audio signal’s frequency components vary over
for what real conversation sounds like by grasping the time. Mel spectrograms are chosen for feature extraction
nuances of real speech. in our project due to their effectiveness in capturing crucial
B. Fake Samples details of audio signals. Unlike traditional spectrograms
that represent the entire frequency spectrum uniformly,
On the other side, altered recordings known as spoof
mel spectrograms focus on the aspects of sound that our
samples are used to deceive speaker verifica- tion systems.
ears are more sensitive to. The extracted spectrograms are
Replay assaults (playing back a real recording) and voice
either padded or truncated to a fixed width. The features
synthesis (creating fake speech) are two examples of these
obtained, along with their corresponding labels, are
manipulations. Researchers can evaluate how susceptible
appended to the X array and Y array, respectively.
speaker verification systems are to different kinds of
assaults by integrating faked samples. B. Model Architecture
The dataset offers a wide range of recording 1) Convolutional Neural Network
circumstances, including various environments and
The design of the model is based on a Convolutional
degrees of background noise. Given the great variety of
Neural Network (CNN). CNNs belong to the category of
real-world events, this diver- sity is crucial. In contrast
deep neural networks and are extensively utilized for the
to a system operating in a controlled environment, a
analysis of visual data in deep learning. The fundamental
speaker verification system used in a public setting may
elements of a CNN are convo- lutional layers, which mainly
experience significant amounts of background noise. The
apply filters to input data to allow the network to recognise
resulting system is more prepared to deal with real- world
local patterns and features. These layers are frequently
events successfully since models are trained on data that
followed by pooling layers for reducing spatial dimensions.
represents these variations.
CNNs, originally created for image processing, can be
used for audio classification by adjusting their architecture
to the one-dimensional nature of audio data.
The model is composed of an input layer which accepts
mel spectrograms. It is followed by two convolutional
layers, and then a max-pooling layer comes after each of
Fig. 2: System Architecture them. A flattening layer, a dense layer which utilizes ReLU
activation, a dropout layer for regularisation, and a final
V. Methodology
dense layer using softmax activation are added after that.
The methodology consists of four main steps: Data Spatial features are extracted from the input spectrograms
Prepa- ration, Model Architecture, Training Process, and using convolutional layers with 32 and 64 filters,
Evaluation Process. respectively, each followed by ReLU activation functions.
A. Data Preparation Spatial dimensions are decreased by the max-pooling
This process involves steps to ensure that the data is layers with a 2x2 pool size. Regularization is applied with
organized and optimized for training and testing the model. a dropout rate of 0.5 to reduce overfitting during training.
1) Labelling C. Training Process
In the initial phase of data preparation, the primary The dataset is divided into two : a training set and a
task is to link each audio file with the corresponding label. validation set. The training set uses 80% of the data and
The labels are binary, dividing the audios into two groups: validation set uses 20% of the data. The model is evaluated
”real” or ”bonafide” and ”fake” or ”spoof.” If the audio is a on a separate test set to assess its generalization to unseen
genuine recording (”bonafide”), the corresponding label is data. The training process utilizes the Adam optimizer
set to 1. If the audio is a spoofed recording or falls into any along with categorical cross-entropy loss. The Adam
other category, the label is set to 0. This labeling process optimizer dynami- cally adjusts learning rates based on
is crucial as it helps the model to learn relevant patterns prior gradient information, optimizing the convergence
and characteristics associated with genuine and fake audio process. Dropout is applied to prevent overfitting. Dropout
recordings. By assigning distinct labels to each sound file, selectively deactivates neurons during training, preventing
the model can recognize the differences between authentic the model from relying too heavily on specific pathways
and manipulated data. and promoting adaptability to diverse data patterns.

Copyright © IEEE–2023 ISBN: 979-8-3503-6988-5 179

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
12th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd, December, 2023
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India

D. Evaluation Process
To evaluate the model’s performance , separate test
data consisting of real and fake audio is used. The test data
is then subjected to the same preprocessing steps employed
during the model training phase, specifically involving the
extraction of Mel spectrograms with consistent time steps.
The model’s performance on test data is then evaluated
using various metrics.
VI. Results and Discussion
The research employs metrics such as Accuracy,
Precision, and ROC curve to assess the model’s
performance. CNNs have proven to yield superior results Fig. 3: Confusion Matrix
in traditional audio classification [12]. The results of
the study after the model evaluation, shown in the table
below, affirm the effectiveness of using Convolutional
Neural Networks (CNNs) not only in standard audio
classification but also in the context of audio deepfake
detection.
Table I: Evaluation Metrics

Serial Num Metric Nameā Result

1 Accuracy 85%

Area Under the ROC Curve


2 0.87
(AOC)
Fig. 4: ROC Curve
3 Average Precesion 0.90
power. The ROC Curve helps in understanding how
The accuracy of the model is reported at 85%, well the model distinguishes between real and fake audio
indicating the percentage of correct predictions across at various thresholds.
The precision-recall curve (fig. 5) shows the trade-off
all classifications. The ROC curve area stands at 0.87,
between precision and recall for different threshold values.
reflecting the model’s capability to distinguish between
Strong recall and high precision has a high area under the
classes. A higher AUC suggests better overall performance. curve. A low FP rate is associated with high precision, and
Additionally, the average precision, a measure of the a low FN rate corresponds to a high recall. A classifier that
model’s precision across various classification thresholds, returns accurate findings and the majority of all favourable
is reported as 0.9, highlighting the model’s effectiveness in results will have high precision and recall scores. Analysis
precision-recall trade-offs. of the graph indicates that the model is demonstrating
The confusion matrix (fig 3) gives a distribution of true strong performance.
positives (TP), true negatives (TN), false positives(FP),
and false negatives(FN).A true positive represents a
correct observation of real data while a true negative
refers to a correct flase positive. False Positives (FP) and
False Negatives (FN) denote cases where the model made
classification errors. Analysis of these factors provides a
more detailed understanding of the model’s capability to
predict accurately.
The ROC curve graph (fig. 4) shows how well a
classification model performs at various classification
levels.The area under the ROC curve (AUC) is a
quantitative measure. Higher AUC values indicates better
discriminatory Fig. 5: Precision-Recall Curve

180 Copyright © IEEE–2023 ISBN: 979-8-3503-6988-5

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
Audio Deepfake Detection Using Deep Learning

VII. Conclusion [7] Khochare, J., Joshi, C., Yenarkar, B., Suratkar, S., & Kazi, F.
(2021). A deep learning framework for audio deepfake detection.
The proposed work implements an effective method Arabian Journal for Science and Engineering, 1-12.
for Audio Deepfake detection. The utilization of deep [8] S. Lyu, ”Deepfake Detection: Current Challenges and Next Steps,”
learning techniques, especially CNNs, represents a novel 2020 IEEE International Conference on Multimedia & Expo
approach to addressing the challenges associated with Workshops (ICMEW), London, UK, 2020, pp. 1-6, doi: 10.1109/
ICMEW46912.2020.9105991.
identifying manip- ulated audio content. Unlike traditional
[9] J. M. Mart´ın-Don˜as and A. A´ lvarez, ”The Vicomtech Audio
methods, the CNN model is designed to automatically Deepfake Detection System Based on Wav2vec2 for the 2022
learn relevant features from the audio data, enabling it to ADD Challenge,” ICASSP 2022 - 2022 IEEE International
effectively distinguish between authentic and deepfake Conference on Acoustics, Speech and Signal Processing (ICASSP),
audio. Singapore, Singapore, 2022, pp. 9241-9245, doi: 10.1109/
ICASSP43922.2022.9747768.
Convolutional layers, max-pooling layers, batch
[10] B. Vimal, M. Surya, Darshan, V. S. Sridhar and A. Ashok, ”MFCC
normal- ization, ReLU activation functions, dropout layers, Based Audio Classification Using Machine Learning,” 2021 12th
global average pooling, and a final dense layer are used International Conference on Computing Communication and
in this project. Evaluation measures, including accuracy, Networking Technologies (ICCCNT), Kharagpur, India, 2021, pp.
ROC curve analysis, and the AUC meter, are used to assess 1-4, doi: 10.1109/ICCCNT51525.2021.9579881.
[11] Lalitha, S., Geyasruti, D., Ramamurthi, N., & Shravani, M.
how well it is performing. Results indicate that the model
(2015). Emotion detection using MFCC and Cepstrum features.
performs reasonably well in distinguishing between real Procedia Com- puter Science, 70, 29–35. https://doi.org/10.1016/j.
and fake audio. procs.2015.10.020
[12] K. Jaiswal and D. Kalpeshbhai Patel, ”Sound Classification
References Using Convolutional Neural Networks,” 2018 IEEE International
[1] Dixit, A., Kaur, N.,& Kingra, S. (2023). Review of audio deepfake Conference on Cloud Computing in Emerging Markets (CCEM),
detection techniques: Issues and prospects. Expert Systems, 40(8), Bangalore, India, 2018, pp. 81-84, doi: 10.1109/CCEM.2018.00021.
e13322. https://doi.org/10.1111/exsy.13322 [13] S. Hershey et al., ”CNN architectures for large-scale audio
[2] Almutairi Z, Elgibreen H. A Review of Modern Audio Deepfake
classification,” 2017 IEEE International Conference on Acoustics,
Detection Methods: Challenges and Future Directions. Algorithms.
Speech and Signal Processing (ICASSP), New Orleans, LA, USA,
2022; 15(5):155. https://doi.org/10.3390/a15050155
2017, pp. 131-135, doi: 10.1109/ICASSP.2017.7952132.
[3] Mcuba , Singh , Ikuesan , & Venter. (2023, March 22). The Effect
[14] F. Rong, ”Audio Classification Method Based on Machine
of Deep Learning Methods on Deepfake Audio Detection for
Digital In- vestigation - ScienceDirect. https://doi.org/10.1016/j. Learning,” 2016 International Conference on Intelligent
procs.2023.01.283 Transportation, Big Data & Smart City (ICITBS), Changsha, China,
[4] T. Bikku, K. Bhargavi, J. Bhavitha, Y. Lalithya and T. 2016, pp. 81-84, doi: 10.1109/ICITBS.2016.98.
Vineetha, ”Deep Residual Learning for Unmasking DeepFake,” [15] A. Aditya, R. Vinod, A. Kumar, I. Bhowmik and J. Swaminathan,
2023 International Conference on Advances in Electronics, ”Classifying Speech into Offensive and Hate Categories along with
Communication, Computing and Intelligent Information Systems Targeted Communities using Machine Learning,” 2022 International
(ICAECIS), Bangalore, India, 2023, pp. 435-440, doi: 10.1109/ Conference on Inventive Computation Technologies (ICICT),
ICAECIS58353.2023.10170400. Nepal, 2022, pp. 291-295, doi: 10.1109/ICICT54344.2022.9850944.
[5] Jun Xue, Cunhang Fan, Zhao Lv, Jianhua Tao, Jiangyan Yi, [16] Kartik, P. V., & Gb, J. (2020). A Deep Learning Based System to
Chengshi Zheng, Zhengqi Wen, Minmin Yuan, and Shegang Predict the Noise (Disturbance) in Audio Files. Intelligent Systems
Shao. 2022. Audio Deepfake Detection Based on a Combination and Computer Technology, 37, 154.
of F0 Information and Real Plus Imaginary Spectrogram Features. [17] R. Chinmayi, N. Sreeja, A. S. Nair, M. K. Jayakumar, R. Gowri and
In Proceedings of the 1st International Workshop on Deepfake A. Jaiswal, ”Emotion Classification Using Deep Learning,” 2020
Detection for Audio Multimedia (DDAM ’22). Association for Third International Conference on Smart Systems and Inventive
Computing Machinery, New York, NY, USA, 19–26. https://doi. Technology (ICSSIT), Tirunelveli, India, 2020, pp. 1063-1068, doi:
org/10.1145/3552466.3556526 10.1109/ICSSIT48917.2020.9214103.
[6] R. L. M. A. P. C. Wijethunga, D. M. K. Matheesha, A. A. [18] S. S. Poorna, C. Y. Jeevitha, S. J. Nair, S. Santhosh and G. J.
Noman, K. H. V. T. A. De Silva, M. Tissera and L. Rupasinghe, Nair, ”Emotion recognition using multi-parameter speech feature
”Deepfake Audio Detection: A Deep Learning Based Solution classifica- tion,” 2015 International Conference on Computers,
for Group Conversations,” 2020 2nd International Conference on Communications, and Systems (ICCCS), Kanyakumari, India,
Advancements in Computing (ICAC), Malabe, Sri Lanka, 2020, pp.
2015, pp. 217-222, doi: 10.1109/CCOMS.2015.7562904.
192-197, doi: 10.1109/ICAC51239.2020.9357161.

Copyright © IEEE–2023 ISBN: 979-8-3503-6988-5 181

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.

You might also like