Audio_Deepfake_Detection_Using_Deep_Learning Paper2
Audio_Deepfake_Detection_Using_Deep_Learning Paper2
12th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd, December, 2023
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India
Abstract—The capacity to identify real audio recordings including voice authentication, audio forensics, and media
from their modified counterparts is essential in the age of credibility. Identifying real audio files from mod- ified
sophisticated digital manipulation for maintaining security ones has grown more difficult as these alterations have
and trust in a vari- ety of applications, from media forensics
become more sophisticated [2].
to voice authentication systems. This research aims to create
a deep learning model that can distinguish between authentic This study aims to address the urgent problem of
and altered audio files, with an emphasis on identifying audio locating altered audio recordings in the large ocean of
deepfakes. The study uses Mel spectrogram representations digital content.
and data augmentation techniques to effectively extract The goal is to create a powerful deep-learning model,
features from the ASVspoof 2019 dataset and train models. more precisely a Convolutional Neural Network (CNN),
Convolutional neural networks (CNNs) comprising a that is capable of telling the difference between original
number of layers, including convolutional, pooling, batch
audio and altered versions. The study makes use of the
normalization, ReLU activation, dropout, global average
pooling, and a dense classification layer are used as the ASVspoof 2019 dataset, a vast assortment of real and
foundation of the design. The Adam optimizer is used to spoofed audio recordings, to accomplish this. The goal of
optimize the model once it has been trained using binary the work is to use this dataset to teach the CNN how to
cross-entropy loss, and a variety of metrics, such as accuracy, distinguish between real audio data and edited versions by
F1 score, ROC curve, and AUC, are used to track its using subtle cues and patterns.
performance. By making it easier to identify audio deepfakes, Designing and implementing a highly accurate deep-
this project will ultimately increase the security and integrity
learning model for audio deepfake detection is the main
of audio data in the digital world.
Keywords: Audio Deepfake Detection,, Deep Learning,
goal of this study. The work uses cutting-edge methods to
Con- volutional Neural Network (CNN),Mel Spectrogram, accomplish this, including data augmentation, converting
ASVspoof 2019 Dataset, Binary Classification, Feature audio samples into Mel spectrograms, and utilising CNNs’
Extraction, Perfor- mance Metrics. strong pattern detection abilities. The project seeks to make
a substantial contribution to the field of digital forensics
I. Introduction by concentrating on these technical techniques, improving
Information is now disseminated instantly and widely the security and reliability of audio data, and reducing
in a time dominated by social media and sophisticated the negative effects of audio deepfakes on numerous
digital technologies. Although these platforms have applications and technologies.
democratized communication, they have also turned into
hubs for the dis- semination of false information and II. Related Works
destructive content. The rise of deepfakes, sophisticated Dixit, Kaur, et al. [1] (2023) give a review of audio-
digital forgeries made utilising Deep Learning (DL) focused deepfake detection techniques. The paper presents
algorithms, is one of the most concerning effects of the major issues and openings for further study in this field,
digital era. These deepfakes come in a variety of formats, including the effect of accents, the need for robust and
including images, movies, and increasingly realistic audio generalizable models, and the ethical implications of
recordings, which has resulted in the rise of a serious threat deepfake technology. The ways through which deepfakes
called ”Audio Deepfakes” (AD) [1]. are generated are shown here. According to their survey,
Particularly audio deepfakes have evolved into when compared to other traditional classification
instruments for misleading behaviour, impersonation, algorithms, the SVM classifier was shown to reach an
and the dissemina- tion of rumours. The legitimacy and accuracy of 99%, although it could only be trained on
dependability of audio data have come under intense datasets in Arabic and Chinese. The models must also be
scrutiny due to the capacity to create convincing audio trained on the sophisticated deepfake dataset for audio
modifications, posing severe problems for key fields deepfake detection methods.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
Audio Deepfake Detection Using Deep Learning
Almutairi and Elgibreen. [2] (2022) provides a Speech-denoising component’s Multilayer-Perceptron and
comprehen- sive survey of the existing methods for detecting Convolutional Neural Network de- signs have accuracy
imitation- based and synthetic-based audio deepfake rates of 93% and 94%, respectively. For text conversion,
attacks, which are two main types of audio fakeness. It Natural Language Processing was used with a 93%
gives a brief description of the available datasets for audio accuracy rate, and an RNN model was used for voice
deepfake detection and compares their characteristics tagging with an accuracy rate of 80% and a diarization
and limitations. They talk about the limitations of audio error rate of 0.52. CNNs were used to correctly separate
deepfake detection methods when it comes to non-English real and fake audio with 94% precision. They concluded
languages. A comparison of different modern techniques that the proposed system can effectively detect synthetic
for detecting audio deep fakes according to three criteria: speech in group conversations as well as contribute to the
tandem Detection Cost Function(tDCF), Equal Error Rate field of speech analysis.
(EER), and accuracy are also done. Khochare, Joshi, et al. [7] (2021) explored the
Mcuba, Singh, et al. [3] (2022) provide a comparative potential of deep learning for the detection of audio
statis- tic for different deep learning techniques that deepfakes using two approaches: feature-based and image-
find deep fake audio in forensic investigations. Mel- based approaches. The study makes use of the Fake or Real
spectrograms, MFCC, Spectrograms, and Chromagrams (FoR) dataset, which contains audio samples generated by
are four of the frequency spectrum’s visual representations the latest text- to-speech models, as a basis for its analysis.
that they utilize. They com- pare the performance of In feature-based approaches, audio samples are converted
different CNN architectures on a public dataset. According into spectral features,
to their results, the VGG-16 archi- tecture performs best for while in image-based approaches, audio samples are
the MFCC image feature, while a fully connected custom converted into melspectrograms. In this paper, different
architecture performs excellently for the Chromagram, algorithms for machine learning and deep learning are
Spectrogram, and Mel-Spectrum images. compared in terms of their performance in a classification
Bikku, Bhargavi, et al. [4] (2023) tell us how an task. The paper claims that the Temporal Convolutional
audio deepfake can be used to manipulate the voice and Network (TCN) performs the best among the image-based
speech of a person, making it challenging to tell real audio methods, and among the feature-based techniques, the
from fake. The paper offers a thorough analysis of the Support Vector Machine (SVM) performs the best.
different algorithms employed for audio deepfake creation Lyu. [8] (2020) proposed a technique that, when
as well as identification, highlighting their advantages applied to various deepfake video datasets, can produce
and disadvantages. The paper also performs a detailed cutting-edge outcomes. He also makes the point that AI-
comparison of representative features and classifiers based imitation is not just limited to imagery but that it
across various datasets for audio deepfake detection. is also enabling the production of incredibly lifelike
The paper concludes that additional studies should focus audio files. To specifically target such forgeries, different
on the difficulties of data scarcity, generalization, and techniques must be developed because sound signals are
interpretability for audio deepfake detection. one-dimensional signals that possess a distinct nature from
Xue, Fan, et al. [5] (2022) suggested a technique videos and pictures. There has been an increase in the
for detecting audio deep fakes established on a mixture number of people interested in creating efficient techniques
of F0 information and real plus fictional spectrogram for detecting and mitigating the impact of deepfake audio.
characteristics. The system combines fundamental Martin-Donas and Alvarez.[9] (2022) proposed
frequency (F0) informa- tion, selected from the frequency a system for detecting audio deepfakes based on the
band having most of F0, with real and fictional spectrogram wav2vec2 self- supervised model. The novelty aspect
elements which are used to distinguish between genuine of the paper is that it feeds the downstream classifier
and fake speech. Results from experiments showing an contextualized representations from various transformer
equivalent error rate (EER) of 0.43% show how effective layers of the wav2vec2 model that has been previously
the suggested system is. trained. The paper also explores vari- ous methods of
Wijethunga, Matheesha, et al. [6] (2020) suggested a data enhancement to assist the classifier in adjusting
system that is made up of four main components: speech to difficult situations. The results demonstrate that the
denoising, speaker diarization, natural language processing, suggested system performs competitively in the ASVspoof
and syn- thetic speech detection. The system uses different 2021 and 2022 ADD challenges, which simulate natural
deep neural network architectures and datasets to perform sce- narios with spoof audio. The paper comes to the
each compo- nent. They constructed models of Deep conclusion that data augmentation techniques can be used
Neural Networks and incorporated them into one unified to effectively adapt the classifier and that the wav2vec2
solution by utilizing various datasets, comprising Urban- features demonstrate robustness to various speech content
Sound8K, Conversational, AMI-Corpus, and FakeOrReal. and conditions.
Copyright © IEEE–2023 ISBN: 979-8-3503-6988-5 177
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
12th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd, December, 2023
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
Audio Deepfake Detection Using Deep Learning
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
12th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd, December, 2023
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India
D. Evaluation Process
To evaluate the model’s performance , separate test
data consisting of real and fake audio is used. The test data
is then subjected to the same preprocessing steps employed
during the model training phase, specifically involving the
extraction of Mel spectrograms with consistent time steps.
The model’s performance on test data is then evaluated
using various metrics.
VI. Results and Discussion
The research employs metrics such as Accuracy,
Precision, and ROC curve to assess the model’s
performance. CNNs have proven to yield superior results Fig. 3: Confusion Matrix
in traditional audio classification [12]. The results of
the study after the model evaluation, shown in the table
below, affirm the effectiveness of using Convolutional
Neural Networks (CNNs) not only in standard audio
classification but also in the context of audio deepfake
detection.
Table I: Evaluation Metrics
1 Accuracy 85%
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.
Audio Deepfake Detection Using Deep Learning
VII. Conclusion [7] Khochare, J., Joshi, C., Yenarkar, B., Suratkar, S., & Kazi, F.
(2021). A deep learning framework for audio deepfake detection.
The proposed work implements an effective method Arabian Journal for Science and Engineering, 1-12.
for Audio Deepfake detection. The utilization of deep [8] S. Lyu, ”Deepfake Detection: Current Challenges and Next Steps,”
learning techniques, especially CNNs, represents a novel 2020 IEEE International Conference on Multimedia & Expo
approach to addressing the challenges associated with Workshops (ICMEW), London, UK, 2020, pp. 1-6, doi: 10.1109/
ICMEW46912.2020.9105991.
identifying manip- ulated audio content. Unlike traditional
[9] J. M. Mart´ın-Don˜as and A. A´ lvarez, ”The Vicomtech Audio
methods, the CNN model is designed to automatically Deepfake Detection System Based on Wav2vec2 for the 2022
learn relevant features from the audio data, enabling it to ADD Challenge,” ICASSP 2022 - 2022 IEEE International
effectively distinguish between authentic and deepfake Conference on Acoustics, Speech and Signal Processing (ICASSP),
audio. Singapore, Singapore, 2022, pp. 9241-9245, doi: 10.1109/
ICASSP43922.2022.9747768.
Convolutional layers, max-pooling layers, batch
[10] B. Vimal, M. Surya, Darshan, V. S. Sridhar and A. Ashok, ”MFCC
normal- ization, ReLU activation functions, dropout layers, Based Audio Classification Using Machine Learning,” 2021 12th
global average pooling, and a final dense layer are used International Conference on Computing Communication and
in this project. Evaluation measures, including accuracy, Networking Technologies (ICCCNT), Kharagpur, India, 2021, pp.
ROC curve analysis, and the AUC meter, are used to assess 1-4, doi: 10.1109/ICCCNT51525.2021.9579881.
[11] Lalitha, S., Geyasruti, D., Ramamurthi, N., & Shravani, M.
how well it is performing. Results indicate that the model
(2015). Emotion detection using MFCC and Cepstrum features.
performs reasonably well in distinguishing between real Procedia Com- puter Science, 70, 29–35. https://doi.org/10.1016/j.
and fake audio. procs.2015.10.020
[12] K. Jaiswal and D. Kalpeshbhai Patel, ”Sound Classification
References Using Convolutional Neural Networks,” 2018 IEEE International
[1] Dixit, A., Kaur, N.,& Kingra, S. (2023). Review of audio deepfake Conference on Cloud Computing in Emerging Markets (CCEM),
detection techniques: Issues and prospects. Expert Systems, 40(8), Bangalore, India, 2018, pp. 81-84, doi: 10.1109/CCEM.2018.00021.
e13322. https://doi.org/10.1111/exsy.13322 [13] S. Hershey et al., ”CNN architectures for large-scale audio
[2] Almutairi Z, Elgibreen H. A Review of Modern Audio Deepfake
classification,” 2017 IEEE International Conference on Acoustics,
Detection Methods: Challenges and Future Directions. Algorithms.
Speech and Signal Processing (ICASSP), New Orleans, LA, USA,
2022; 15(5):155. https://doi.org/10.3390/a15050155
2017, pp. 131-135, doi: 10.1109/ICASSP.2017.7952132.
[3] Mcuba , Singh , Ikuesan , & Venter. (2023, March 22). The Effect
[14] F. Rong, ”Audio Classification Method Based on Machine
of Deep Learning Methods on Deepfake Audio Detection for
Digital In- vestigation - ScienceDirect. https://doi.org/10.1016/j. Learning,” 2016 International Conference on Intelligent
procs.2023.01.283 Transportation, Big Data & Smart City (ICITBS), Changsha, China,
[4] T. Bikku, K. Bhargavi, J. Bhavitha, Y. Lalithya and T. 2016, pp. 81-84, doi: 10.1109/ICITBS.2016.98.
Vineetha, ”Deep Residual Learning for Unmasking DeepFake,” [15] A. Aditya, R. Vinod, A. Kumar, I. Bhowmik and J. Swaminathan,
2023 International Conference on Advances in Electronics, ”Classifying Speech into Offensive and Hate Categories along with
Communication, Computing and Intelligent Information Systems Targeted Communities using Machine Learning,” 2022 International
(ICAECIS), Bangalore, India, 2023, pp. 435-440, doi: 10.1109/ Conference on Inventive Computation Technologies (ICICT),
ICAECIS58353.2023.10170400. Nepal, 2022, pp. 291-295, doi: 10.1109/ICICT54344.2022.9850944.
[5] Jun Xue, Cunhang Fan, Zhao Lv, Jianhua Tao, Jiangyan Yi, [16] Kartik, P. V., & Gb, J. (2020). A Deep Learning Based System to
Chengshi Zheng, Zhengqi Wen, Minmin Yuan, and Shegang Predict the Noise (Disturbance) in Audio Files. Intelligent Systems
Shao. 2022. Audio Deepfake Detection Based on a Combination and Computer Technology, 37, 154.
of F0 Information and Real Plus Imaginary Spectrogram Features. [17] R. Chinmayi, N. Sreeja, A. S. Nair, M. K. Jayakumar, R. Gowri and
In Proceedings of the 1st International Workshop on Deepfake A. Jaiswal, ”Emotion Classification Using Deep Learning,” 2020
Detection for Audio Multimedia (DDAM ’22). Association for Third International Conference on Smart Systems and Inventive
Computing Machinery, New York, NY, USA, 19–26. https://doi. Technology (ICSSIT), Tirunelveli, India, 2020, pp. 1063-1068, doi:
org/10.1145/3552466.3556526 10.1109/ICSSIT48917.2020.9214103.
[6] R. L. M. A. P. C. Wijethunga, D. M. K. Matheesha, A. A. [18] S. S. Poorna, C. Y. Jeevitha, S. J. Nair, S. Santhosh and G. J.
Noman, K. H. V. T. A. De Silva, M. Tissera and L. Rupasinghe, Nair, ”Emotion recognition using multi-parameter speech feature
”Deepfake Audio Detection: A Deep Learning Based Solution classifica- tion,” 2015 International Conference on Computers,
for Group Conversations,” 2020 2nd International Conference on Communications, and Systems (ICCCS), Kanyakumari, India,
Advancements in Computing (ICAC), Malabe, Sri Lanka, 2020, pp.
2015, pp. 217-222, doi: 10.1109/CCOMS.2015.7562904.
192-197, doi: 10.1109/ICAC51239.2020.9357161.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on April 09,2025 at 12:15:55 UTC from IEEE Xplore. Restrictions apply.