0% found this document useful (0 votes)

119 views11 pages

Deepfake Audio Detection Via MFCC Features Using M

Uploaded by

Anescu Mihai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views11 pages

Deepfake Audio Detection Via MFCC Features Using M

Uploaded by

Anescu Mihai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Deepfake Audio Detection via MFCC

features using Machine Learning
AMEER HAMZA1 , ABDUL REHMAN JAVED2*,3 , FARKHUD IQBAL4 , NATALIA KRYVINSKA5 ,
AHMAD S. ALMADHOR6 , ZUNERA JALIL2 , ROUBA BORGHOL7
1
Faculty of Computing and AI, Air University, Islamabad, Pakistan
2
Department of Cyber Security, Air University, Islamabad, Pakistan
3
Department of Electrical and Computer Engineering, Lebanese American University, Byblos, Lebanon
4
College of Technological Innovation, Zayed University, Abu Dhabi, UAE
5
Information Systems Department, Faculty of Management, Comenius University in Bratislava, Odbojárov 10, 82005 Bratislava 25, Slovakia
6
College of Computer and Information Sciences, Jouf University, Saudi Arabia
7
Rochester Institute of Technology, Dubai Silicon Oasis, UAE
Corresponding author: [email protected], [email protected]

ABSTRACT Deepfake content is created or altered synthetically using artificial intelligence (AI)
approaches to appear real. It can include synthesizing audio, video, images, and text. Deepfakes may
now produce natural-looking content, making them harder to identify. Much progress has been achieved in
identifying video deepfakes in recent years; nevertheless, most investigations in detecting audio deepfakes
have employed the ASVSpoof or AVSpoof dataset and various machine learning, deep learning, and
deep learning algorithms. This research uses machine and deep learning-based approaches to identify
deepfake audio. Mel-frequency cepstral coefficients (MFCCs) technique is used to acquire the most useful
information from the audio. We choose the Fake-or-Real dataset, which is the most recent benchmark
dataset. The dataset was created with a text-to-speech model and is divided into four sub-datasets: for-
rece, for-2-sec, for-norm and for-original. These datasets are classified into sub-datasets mentioned above
according to audio length and bit rate. The experimental results show that the support vector machine
(SVM) outperformed the other machine learning (ML) models in terms of accuracy on for-rece and for-
2-sec datasets, while the gradient boosting model performed very well using for-norm dataset. The VGG-16
model produced highly encouraging results when applied to the for-original dataset. The VGG-16 model
outperforms other state-of-the-art approaches.

INDEX TERMS Deepfakes, Deepfake audio, Synthetic audio, Machine learning, Acoustic Data

I. INTRODUCTION ployed in several criminal activities in recent years, the

Deepfake is a portmanteau of deep learning and fake. Deep- ability to detect them is crucial. Detecting deepfake in au-
fake is a type of digitally-created content in which the orig- dio, video, and text is a broad and active research domain.
inal human faces in a photo, video, or recording have been Between 2018 and 2019, there was a significant increase in
swapped out for computer-generated ones [1], [2]. Deepfake the number of articles about deepfake (from 60 to 309) [8].
first surfaced on Reddit in 2017 when a user named "deep- Articles about deepfakes were expected to increase to over
fakes" submitted a falsified video on this website with a 730 by 2020’s end, according to predictions made on July
different actor’s face. As a new technology, it unavoidably 24 [9]. In [10] found that most focus is on video deepfakes,
carries a slew of legal difficulties that infringe on personal in- particularly in developing video deepfakes.
terests like portraiture rights, reputation rights, and copyright Deep Fakes are increasingly detrimental to privacy, so-
and inflicts economic and reputational harm to businesses cial security, and Authenticity. However, recent works have
[3], [4]. Furthermore, a fabricated video of a politician or focused on deepfake video detection, achieving greater ac-
government being released will cause a media crisis, social curacy. However, audio spoofing and calls from malicious
instability, and national instability [5], [6], [7]. sources are generated through deep fakes, which need a
Audio deepfakes are AI-generated or modified audio that specially trained model for handling this. The deepfake audio
appears to be real. Since audio deepfakes have been em- detection based purely on audio is less explored than image

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

and video-based approaches, as these works simultaneously II. LITERATURE REVIEW

utilize the audio and Spatio-temporal information in the Audio deepfakes audio is generated, edited, or synthesized
video to train the deep learning model. However, only the using artificial intelligence, which appears real. Detecting
audio-based classifier’s classification and detection are very audio deepfakes is critical since audio deepfakes have been
significant. Hence, to this end, we proposed an approach used in several illegal actions in banking, customer service,
based on multiple machine learning algorithms to improve and call centers. To detect audio deepfakes, one must first un-
the accuracy of the classification models using Random derstand the procedures of generation. As the name suggests,
Forest, Decision Tree, and SVM algorithms. We provide audio deepfake algorithms are classified into three types:
comparative results and analysis of the baseline models. We Replay attack, speech synthesis, and voice conversion are
conducted our experiments on Fake-or-Real Dataset, and all possible. This section gives the reader each subcategory’s
there were four sub-detests. most recent and relevant frameworks.
The ASVspoof2015 [11] is the first automatic speaker Audio forensics is a branch of forensics used to au-
verification spoofing and countermeasures dataset that stim- thenticate, enhance, and analyze audio information to aid
ulates research in this field. It decreases the equal error in investigating various crimes. Audio as forensic evidence
rate (EER) by less the 1.5%. Some attacks have even 50% must be modified and analyzed before criminal prosecu-
EER. However, unknown attacks can have five times more tion. However, more significantly, it must be validated to
EER. Further, in ASVspoof2017 [12], the limits of replay demonstrate that it is genuine and has not been tampered
spoofing attack detection are worked upon. The EER of with. Several methods, primarily employing AI/ML-based
6.73% and the Instantaneous frequency cosine coefficients techniques, have been used to detect audio events in the
(EFCC) drastically improve countermeasures performance. last decade. A deep learning framework was employed by
Then ASVspoof2019 [13] put more emphasis on coun- the authors of the study [16] for audio-deep fake detection.
termeasures concerning automatic speaker verification and The model separability is increased using a Long-short term
spoofed audio detection. Other than that, computer vision memory (LSTM)-the based network is used to recognize
algorithms such as convolutional neural networks (CNN) is events in sub-sampled signals [17]. To reduce the audio
used low-quality audio spectrograms for synthetic speech signal complexity and ease of reconstruction encode, the
detection [14]. The time information can be lost in CNN- frequencies higher than the Nyquist frequency[18] are used,
based models. Hence, probabilistic forecasting with a tem- and the authors [19] utilized non-uniform sampling for audio
poral convolutional neural network is used for improving subsampling.
automatic speaker verification and spoofed audio detection Replay attacks consist of repeatedly playing back a record-
[15]. ing of the voice of the intended victim. Replay attacks come
This research aims to derive a methodology for identifying in two forms, the first is far field detection, and the second
deepfake audio from non-synthetic or real audio. It provides is copy-and-paste detection [20], [21]. As of now, deep
the following contributions to identify deepfake audios effec- convolutional networks are used as a method for detecting
tively by resolving the restrictions discussed above: complete replay attacks [22]. Several methods have been
developed for identifying replay attacks, and they center on
• Propose a transfer learning-based approach to detect the characteristics that are provided in the network. The
deepfake. method of using deep convolutional networks to detect replay
• Extend work on deepfake audio detection on the Fake- attacks was found to have an Equal Error Rate (EER) of zero
or-Real dataset by conducting detailed experiments on percent on the ASVspoof2017 training and test dataset [12].
Fake-or-Real datasets and sub-datasets using machine Speech synthesis (SS) is recreating human speech digi-
and deep learning-based approaches. tally, typically using computer software or hardware. TTS
• Use a superior feature extraction approach to obtain is a component of SS that takes in written material and
MFCC features from audio sources. outputs spoken language based on that text according to
• Results reveal that the SVM model outperforms other predetermined linguistic rules. Text reading and AI personal
ML models compared to other dataset sub-sets except assistants are just two applications of speech synthesis. An-
for the for-original dataset. The VGG-16 model pro- other perk of speech synthesis is that it can mimic various
duced highly encouraging results when applied to the voices and dialects without relying on canned recordings.
for-original dataset. Lyrebird 1 , a powerful speech synthesis company, employs
deep learning models to synthesize 1,000 sentences in a
The paper will proceed as described below. Section II ex-
second. TTS largely relies on the quality of the speech
plains the literature review methods. The suggested approach
corpus to build the system, and regrettably, creating speech
and algorithms are described in III. The analysis and results
corpora is expensive [23]. Speech synthesis is the artificial
of the experiments are provided in Section IV. Section V
reproduction of human speech using software or hardware
presents the discussion on the proposed approach. Finally,
system programs. In order to synthesize 1,000 sentences per
Section VI presents the overall conclusion.
1 https://www.descript.com/lyrebird

2 VOLUME 4, 2016

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

second, Lyrebird uses deep learning models. The success using different supervised and unsupervised machine learn-
of a TTS system is highly dependent on the quality of the ing algorithms. The following section explains the proposed
speech corpus upon which it is built, and it is costly to collect framework for all sub-datasets, including data handling, pre-
and annotate speech samples. Char2Wav is a framework for processing, feature engineering, and the classification phase.
speech synthesis production from start to finish. PixleCNN Figure 1 Shows the detailed architecture diagram of the
is also the foundation of WaveNet [24], an SS framework. proposed framework, consisting of 1) data preprocessing,
WaveGlow prioritizes stage two of the two-stage process 2) feature extraction 3) Classification models. The detailed
generally used by text-to-speech synthesis systems (encoder description of each phase is as follows:
and decoder). Therefore, WaveGlow is concerned with mod-
ifying specific time-aligned data. Incorporating information A. DATA PREPROCESSING
into sound files by using encoding techniques like a mel- More than 195,000 real human and synthetic computer-
spectrogram. The Tacotron 2 [25] system comprises two generated speech samples are included in the Fake-or-Real
parts. The first component is an attention-based recurrent (FoR) collection. Classifiers may be trained on the dataset to
sequence-to-sequence feature prediction network. This com- identify fake speech better. Information from Deep Voice 3
ponent’s output is a mel anticipated sequence. Frames of is included [29] and Google Wavenet[24] TTS and various
a spectrogram A modified WaveNet vocoder is the second human sound recordings. This dataset may be accessed in
component. For audio data, [26], [27] used GAN-based gen- four different varieties[ 1) for-original,2) for-norm, 3) for-
erative models. It operates on Mel spectrograms and employs 2sec, and 4) for-rerec]. The original version includes the
a fully convolutional feed-forward network as the generator. files without any changes from when they were first ex-
The authors give a summary of their recently created data tracted from the speech sources. The latest volume (For-
set. It comprises 117,985 created audio segments of 16-bit norm) contains the duplicate files as the first, but they have
Pulse Code Modulation(PCM) wav format and is available been standardized in terms of sampling rate, volume, and
on zenodo 2 . various channels to achieve gender and class parity. The
The current study has poor performance validation and second is the basis for the third (for-2sec), except that the files
testing results detecting deep false audios. Feature-based are truncated after 2 seconds instead of the original length.
techniques are required to improve the outputs of machine The third and final version (for-rerec) is a re-recorded version
learning models. The deep learning approaches show better of the for-2second dataset created to simulate an attacker
results but require greater training time and computational transmitting an utterance via a voice channel. However, these
resources. Hence, the potential for machine learning ap- datasets suffer from duplicate files, 0-bit files, and different
proaches in deepfake detection is explored, while the limi- bit-rate in audio signals. They negatively affect the ML
tation of handling higher feature sets and complexities can model’s training and performance. Hence, we preprocess the
be solved through a transfer learning-based deep learning dataset to remove the duplicate and 0-bit file, which does not
approach. contribute to model training. Also, the bit rate is standardized
to zero-padding for an audio waveform with less than 16,000
III. PROPOSED METHODOLOGY samples, conforming to an operationally viable bit rate for the
In machine learning, training a model always involves the TensorFlow audio signal processing library. Also, the data is
trade-off of over-fitting and under-fitting, which negatively normalized using a standard scaler to ease model training.
impacts the model’s real-time performance. It is difficult to
handle this trade-off so that models do not over-fit or under- B. FEATURE EXTRACTION
fit. One of the major issues in deepfake is the high false- Deepfake audio signal often consists of similar feature sets to
positive rate ratio, which occurs when most models classify the original signal. However, distinguishability is challeng-
an unseen pattern as abnormal if it is not included in the ing to advance in deep learning approaches in generating
training set. It is due to the model’s inability to be trained deepfakes. Hence, extracted features can strongly affect the
on a large dataset. A dataset that covers all possible patterns model’s predictive power and accuracy. It is observed that
and cases, deepfake and real. It is regarded as a theoretical audio signals in the frequency domain can provide us the
concept that cannot be implemented practically. Hence, the features which are helpful in the detection and classification
dataset Fake-or-Real [28] is divided into four datasets:for- of deepfake audios, which can deceive a human under spe-
rece, for-2-sec, for-norm, for-original, where for-original cific scenarios. For this purpose, we use Mel-frequency Cep-
dataset is the collection of other three datasets and without stral Coefficient (MFCC), a widely used feature for speech
much preprocessing. recognition [30], [31]. The dataset (Fake or Real Audio
This research aimed to develop a technique to classify dataset) used in this study is a more recent dataset, which
deep fake synthetic audio under different background noise was only used once in research. This study is not limited to
and audio sizes and duration. We proposed a framework only MFCC features; we also employed cepstral (MFCC),
that handles the big data training set and performs detection Spectral (Roll-off point, centroid, contrast, bandwidth), Raw
signal (zero cross rate), and signal energy features and made
2 https://zenodo.org/record/5642694 a featured ensemble, but our primary focus is on MFCC
VOLUME 4, 2016 3

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Features Processing
Data Preprocessing

Duplicate Removal

Bit rate Conversion MFCC `

DeepFake Dataset
Frequency Domain Conversion
Audios Segmentation
Features Extraction Features Selection Machine Learning
Bit-0 file Removal DeepFake Detection
Algorithms

Normalization

Audio Windowing

FIGURE 1: Graphical Representation of Proposed Approach for detection of deepfake audios

features because MFCC used log function and Mel-filter

to mimic the human hearing system furthermore MFCC
apply triangular band-pass filters to covert the frequency
information to mimic what a human perceived. Figure 2
shows the MFCC series of audio files; also, for showing the
auditory power of the signal, the amplitudes are presented in
decibels (db). In this work, MFCC is used for deepfake audio
detection. Following the initial processing of audio signals,
a vector group representing the MFCC will be generated FIGURE 2: The melspectrogram representation of audio
out of each frame of the sound waveform. This study uses signal where the amplitude is depicted in terms of decibel
Mel-frequency cepstral coefficient and short-time Fourier
transform (STFT), which transform the waveforms from the
time-domain signals into the time-frequency-domain signals. the feature importance by adding the gain of each feature and
In Figure 3, the comparison between the fake and accurate scaling the number of samples passing through the node. Let
audio signals in terms of their spectrogram representation us assume that k is the node, Xj is the importance of features,
is shown. First, the spectrogram is shown in terms of their and the total samples toward all nodes are Yk . The importance
different amplitude, and then, for better auditory inspection, of a feature can be represented as in equation 1:
the signal is further analyzed using the decibel (db) of the X
given signal. It helps us understand which auditory features Xj = k : jYk GK (1)
are relevant in distinguishing between the deepfake and real
audio signal. In this section, we explain the feature extraction where, nodes k split on feature j. The final feature impor-
and selection process. In our case, the sampling (frame) tance Xj for each feature is calculated by normalizing the Xj′
rate is 44100. The 270 retrieved features of each audio file for each tree and then summing those normalized values for
are stored in a data frame. We reduced the characteristics each tree in the random forest.
to only those that would be beneficial and got rid of the Xj ′
rest using Principal Component Analysis (PCA) [32]. 65 Xj ′ = P (2)
z Xjz
characteristics are crucial enough to be sent to deepfake
′
P
detection models. We employ PCA’s explained variance ratio Xjjz
metric to determine the value of carefully chosen features. RF Xjj ′ = P z ′ (3)
z,t Xjjzt
The value of explained variance ratio is (97%), indicating that
the selected data’s usefulness is convincing. We set different As in equation 2 and 3, z indicates all features, and t
values of PCA and find the most important features where depicts all trees in a random forest. The Xj is the importance
n_component is 65. of a feature for node j, while Xj ′ is its normalized feature
importance. RF Xjj ′ is the feature importance for all trees in
a random forest. Moreover, Xjz is normalized importance of
C. CLASSIFICATION MODELS
feature j w.r.t tree t. The model makes predictions based on
1) Random Forest the important features obtained, as mentioned in Figure 3.
Random Forest is a decision tree-based algorithm except
that it fits many categorizing decision trees on different sub- 2) Support Vector Machine (SVM)
samples of the dataset and then uses averaging to integrate SVM is a supervised learning method that relies primarily on
all the decision trees. It helps in the mitigation of dataset two assumptions: 1) Converting data into a high-dimensional
overfitting problems. Random forest is used in calculating space may reduce complex classification issues with complex
4 VOLUME 4, 2016

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) Fake audio (b) Real audio

(c) Fake Audio (d) Real Audio

FIGURE 3: In (a) and (b), the comparison is shown between the deepfake and real audio signal in spectrogram where the
difference in amplitude is apparent. In (c) and (d), the amplitude is shown in terms of decibels (db) for understanding the
auditory parts of the audio signal.

decision surfaces to more minor problems that may be solved Dot product is represented d (.) in equation 4. The set of all
by making it linearly separable, and 2) only training patterns x-vectors that satisfy the equation f(x) = 0 is denoted by H0 .
near the decision surface provide the most sensitive details Assuming two hyperplanes, H1 and H2 , the distance between
for classification. Assume a deepfake detection problem as a them is referred to as their margin which can be represented
binary classification with linearly separable vectors xi ∈ Rn , as follows:
2
as the decision surface used to classify a pattern as belonging (5)
to one of the two classes is the hyperplane H0 . If x is a ∥w∥
random vector n ∗ R, we define The decision hyperplane H0 depends on vectors closest to
the two parallel hyperplanes called support vectors. The
margin must be maximal to obtain a classifier that is not
much adapted to the training data. Consider a collection of
f(x) = w.x + b (4) training data vectors X = xi , ...xL , xi ∈ Rn and a set of
VOLUME 4, 2016 5

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

matching labels, Y = yi , ...yL , yi ∈ {1, −1}. We consider is vulnerable to outliers because each successive classifier is
the hyperplane H0 to be optimally separated if the vectors compelled to correct the mistakes made by its prior learners.
are categorized without error and the margin is greatest. The This is because the estimators rely on historical predictions
vectors must verify in order to be accurately categorized. to determine their accuracy. For this reason, streamlining the
process is complex.
fxi > +1 f or yi = +1 (6)
IV. EXPERIMENTS AND RESULTS
fxi > −1 f or yi = −1 (7) About 195,000 human and synthetic speech samples were
Hence, finding the SVM classifying function H0 can be used to create the Fake-or-Real (FoR) dataset. In Table 1,
stated as follows: we offer a summary of the data set. Classifiers may be
trained on the dataset to identify fake speech better. The
1 2 dataset is an amalgamation of information from the following
minimize ∥w∥ (8)
2 recent datasets: first, Text-to-speech programs, such as Deep
Voice 3 and Google Wavenet TTS [29]. Secondly, includes
yi f(xi ) ≥ 1, ∀i (9) many different types of the recorded human voice, includ-
ing those from the Arctic Dataset, LJSpeech Dataset, Vox-
The SVM was chosen for its properties that aid in classi- Forge Dataset, and user-submitted recordings [33], [34], [35].
fying deepfake audios. It performs well with a clear margin The four dataset versions available for public consumption
of separation between samples and is effective in high- are for-original, for-norm, for-2sec, and for-rerec. The for-
dimensional environments. It employs a subset of training original folder stores the raw data from the speech sources.
points in the decision function, making it memory efficient. It The for-norm has some duplicate files but is otherwise
works well when the number of dimensions is more than the well-balanced across demographic categories (gender and
size of the sample set. SVM does not perform very well on socioeconomic status) and technical parameters (sample rate,
our for-original dataset data set because the required training volume, and multiple channels). The third one is like the
time and the noise in the data set are higher. It does not second one, only the files are cut off after 2 seconds, and
directly provide probability estimates, calculated using an it is called for-2sec. The last variant, dubbed for-here, is
expensive five-fold cross-validation that takes a long time to a re-recording of the for-2second dataset meant to mimic
train. However, the clean datasets extracted from for-original a situation in which an attacker transmits a speech over a
dataset perform better on the classification task. SVM has vocal channel like a phone call or voice message. We provide
been shown to perform effectively in higher-dimensional the outcomes of our binary classification analysis of the
data, most notably when detecting events in audio data. suggested method. Table 2 shows the experimental findings
Hence, for deepfake audio, we implemented it by utilizing for spotting deepfakes.
the Scikit-learn library. We use radial basis function (RBF) The experiments were also performed using noisy audio
kernel, C=4, and probability=True. sound signals. For this purpose, we added synthetic noise to
each audio signal of three datasets (for-2sec, for-norm, and
3) Multi-layer Perceptron (MLP) for-rerec dataset). This method kept both original and noisy
MLP is adequate for classification tasks; a multilayer per- audio in the dataset and increased the audio signal sample.
ceptron, through layers, can effectively filter the relevant The length of the original for-2sec dataset is 17870 audio
features from data and tune the parameters of the models samples, and after adding noise to the dataset, the new dataset
for optimal predictions. There are at least three levels in the will be composed of 35740 audio samples, the same for the
MLP model: an input layer, a hidden layer of calculation for-rerec and for-norm datasets.
nodes, and an output layer of processing nodes. In this study,
we use hyperparameters of MLP classifier as a hidden-layer- A. FOR-REREC DATASET
size staple:length= 100, solver=adam and RMSprop, while The results of the for-rerec dataset are presented in Table 2.
RMSprop is used for smaller datasets, shuffle=True and Multiple ML models are applied to obtain better results. The
verbose=False, activation-function=relu. machine learning algorithms such as Support Vector Machine
(SVM) have 98.83% accuracy, Decision Tree 88.28%, Ran-
4) Extreme Gradient Boosting (XGB) dom Forest Classifier 96.60%, AdaBoot 87.67%, Gradient
XGB is a parallel and optimized version of gradient boosting Booting 93.51%, XGB Classifier 93.40%. The SVM model
algorithms that combines efficiency and resource manage- exhibited the highest results using the for-rerec dataset.
ment. It implements gradient-boosted decision trees in an it- The result of the for-rerec dataset noisy audio signals
erative model by combining weak base models into a stronger classification is presented in Table 3. Results depict that the
learner. The residual is utilized to refine the loss function and MLP and SVM models obtained the highest accuracy score
improve the prior prediction at each iteration of the gradient of 98.66% and 98.43% compared to other ML models. The
boosting algorithm. We use a learning rate of 0.1 and an other ML models like; DT, LR, and XGB obtained 82.12%,
estimator of 10000 for the XGBoost algorithm. However, it 88%, and 88.92% accuracy.
6 VOLUME 4, 2016

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1: Dataset Description

Datasets Size Description
FOR-REREC DATASET 1.5 GB It is a re-recorded version of the for-2-second dataset to simulate a scenario where an attacker
sends an utterance through a voice channel (i.e., a phone call or a voice message).
FOR-2SEC DATASET 1 GB Contains audios based on FOR-NORM dataset, but with the files truncated at 2 seconds
FOR-NORM DATASET 5.8 GB the same files as FOR-ORIGINAL dataset, but balanced according to gender and the same
sampling rate, volume, and channels for each.
FOR-ORIGINAL DATASET 7.7 GB The audios are collected from various sources, without any modification

B. FOR-2SEC DATASET TABLE 3: Accuracy comparison for noisy audio signals

using machine learning models
In for-2sec dataset consist of audio with two-second intervals.
The audio is complex, as the information in that small inter- Models for-2sec for-norm for-rerec
SVM 99.59 75.21 98.43
val is little. However, it is much easier for machine learning
MLP Classifier 99.49 89.22 98.66
algorithms to process data in this form. Hence, we observe Decision Tree 87.52 65.10 82.12
better performance. The results are depicted in Table 2. We Extra Tree Classifier 97.91 90.19 96.25
observe MLP classifier accuracy of 94.69, Random Forest Logistic Regression 87.53 86.28 88.00
Gaussian Naive Bayes 79.77 80.16 82.14
of 94.44, and SVM 97.57. gradient boosting of 94.30 and Ada Boost 85.28 91.35 83.89
Adaboosing of 90.23. The MLP model outperforms the other Gradient Boosting 92.29 93.50 88.86
ML model in terms of accuracy. XGBoost 92.22 94.25 88.92
Linear Discriminant Analysis 87.05 90.52 85.88
Table 3 shows the results of for-2sec dataset with noisy Quadratic Discriminant Analysis 96.22 65.15 95.59
audio signal classification. To get better outcomes, several
ML models are used. The ML algorithms such as SVM ob-
tained 99.59% accuracy, MLP obtained 99.49%, DT 87.52% D. FOR-ORIGINAL DATASET
accuracy, and so on. The SVM exhibited the highest accuracy The for-original datasets are compiled from various datasets
compared to other ML models using noisy audio signals. and consist of audio samples of various lengths, bit-rates, and
noise levels. The machine learning models did not produce
C. FOR-NORM DATASET comparatively better results. These models would not be able
It contains recorded audio at 12-second intervals. The result to handle this data’s complexities and feature variations. To
of the for-norm dataset is shown in Table 2. It shows MLP this end, We used a transfer learning-based deep learning
Classifier 86.82, Random Forest Classifier 90.60, extra trees approach. In this approach, we extracted the same visual
91.46, Gradient Booting 92.63, XGB Classifier 92.60, LDA features of MFCC from the audio data. These visual features
91.35, Gaussian NB 81.81, and Adabost 89.40. However, train the VGG-16-based model and LSTM to perform deep-
some algorithms show average results, like QDA 61.36 and fake or real audio classification. Finally, the VGG16 model
KNN 64.21. The Gradient Boosting classifier obtained the outperformed the LSTM model with a testing accuracy of
highest results compared to the other ML models. 93%. The LSTM model obtained 91% accuracy. The VGG-
The results of noisy audio from the for-norm dataset are 16 model uses ImageNet weights and input shapes (64 x 64
presented in Table 3. The results of the for-norm dataset are x 3). The validation accuracy of 0.94 and validation loss of
less than the other two datasets. The XGB model obtained the 0.14 is obtained, while the testing accuracy is 93%. Figure
highest results using noisy audio from the for-norm dataset. 4a shows the training and validation graph, while Figure 4b
All other ML models obtained quite well results but not so shows the training and validation loss of the VGG16 model.
impressive.
E. MODEL COMPARISON
TABLE 2: Accuracy comparison for machine learning mod- This study compares the model accuracy with the other
els baseline paper [36] to assess the efficacy of our proposed
model. It is easier to compare results when the experimental
Models for-2sec for-norm for-rerec conditions (dataset, data samples) are identical to those used
SVM 97.57 71.54 98.83
MLP Classifier 94.69 86.82 98.79 in the initial study. As presented in this section, the dataset
Decision Tree 87.13 62.16 88.28 utilized in this investigation has only been used once in a
Extra Tree Classifier 94.61 91.46 96.87 previous study [36]. Because of this reason, the suggested
Gaussian Naive Bayes 88.20 81.81 81.91
method cannot be compared to any other studies.
Ada Boost 90.23 88.40 87.67
Gradient Boosting 94.30 92.63 93.51 Our technique shows potential in terms of classification
XGBoost 94.52 92.60 93.40 accuracy. This work obtained comparatively better results in
Linear Discriminant Analysis 89.50 91.35 87.56 ensemble-based machine learning models such as boosting
Quadratic Discriminant Analysis 96.13 61.36 96.91
algorithms, as the XGboost algorithm shows greater accuracy
than the baseline model. The model’s accuracies for three
sub-datasets are shown in Table 2 and 3.
VOLUME 4, 2016 7

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) Model accuracy (b) Model loss

FIGURE 4: Comparison between the validation and training (accuracy and loss)

Table 2 and 3 compare the machine learning model results of 91%. The existing approach proposed by khochare et al.
of the feature-based approach to training machine learning (2021) used MFCC features and various machine learning
algorithms. Our approach to selecting the best feature and models for deepfake audio detection [36]. They have utilized
ML classifiers obtained promising results on three datasets 20 MFCC features for each audio. The author employed mul-
(FOR-REREC DATASET, OR-2SEC DATASET, and FOR- tiple machine learning models (SVM, RF, KNN, and XGB).
NORM DATASET). However, The for-norm dataset does The author used 20 MFCC features with the SVM model
not perform well on our approach using a simple SVM and obtained the highest accuracy rate of 67%. Another
algorithm as the data is of high dimensions. Without the study proposed by Reimao et al. (2019) used both machine
dimensionality reduction on a complex dataset, it performs learning and deep learning techniques along with various
poorly. This dataset contains audio of a length greater than feature extraction methods [28]. The author used Timbre
12 seconds. Hence, the windowing technique can perform Model Analysis (Brightness, Hardness, Depth, Roughness)
better in combination with MFCC. The proposed approach features with multiple ML models (NB, SVM, DT, and
is compared with the baseline approach that used FOR- RF). According to the ML model classification results, the
ORIGINAL DATASET for experimentation [36]. The ex- SVM model using the various feature extraction methods
isting approach used various ML models (SVM, RF, KNN, obtained a 73.46% accuracy rate. Furthermore, STFT, Mel-
XGB) to detect deepfake from FOR-ORIGINAL-DATASET. Spectrograms, MFCC, and CQT feature extraction meth-
The proposed approach obtains the highest testing score of ods are used with the VGG19 model and obtained 89.79%
93%, which is 26% higher than the best score of existing accuracy. Compared to the previous research, the VGG16
work using the SVM model. It is concluded that the proposed model achieved the highest results, with an accuracy of 93%.
approach can efficiently detect deepfake audio. The dataset More so, the LSTM model achieved 91% accuracy. The
used in this study is only used in only one previous study. VGG16 model loss and training and validation accuracy are
The proposed and existing approaches’ experimental settings shown in Figure 4. The proposed approach with the features
are similar (dataset, data split). In addition, the comparative mentioned in section III-B outperforms the previous state-of-
analysis of the proposed method with the state-of-the-art the-art feature extraction techniques is presented in Table 4.
feature extraction techniques is presented in Table 4. The
proposed approach combines features from multiple feature V. DISCUSSION
extraction techniques and extracts the most optimal features This research extended the work on deepfake audios by
for classification. The two deep learning models are em- extending the work on the Fake-or-Real dataset. This dataset
ployed in this research. The proposed approach employed comprises a state-of-the-art dataset in audio detection and
VGG16 and LSTM model with a feature ensemble of MFCC- classification. We improved upon the algorithm’s perfor-
40, Roll-off point, centroid, contrast, and bandwidth features. mance, which was previously trained on feature-based ap-
The features extracted from each method are combined for proaches by using the MFCC-based features, indicating con-
model classification. The VGG16 model obtained the highest siderable improvements in inaccuracy. Our feature outper-
results compared to the existing study with an accuracy of forms the feature-based approach by 10 to 20 percent on
93%. Furthermore, the LSTM model obtained an accuracy average across these datasets. The for-norm dataset performs
poorly on our approach using simple SVM algorithms. Win-
8 VOLUME 4, 2016

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 4: Comparison between results of the proposed approach and existing approach
Approaches Features Models Accuracy (%)
SVM 67
RF 62
Existing Approach [36] MFCC-20
KNN 62
XGB 59
NB 67.27
SVM 73.46
Timbre Model Analysis (Brightness, Hardness, Depth, Roughness)
Existing Approach [28] DT (J48) 70.26
RF 71.47
STFT, Mel-Spectrograms, MFCC and CQT VGG19 89.79
LSTM 91
Proposed Approach MFCC-40, Roll-off point, centroid, contrast, bandwidth
VGG16 93

dowing techniques, in combination with MFCC, can perform other ML models, while 92.63% was obtained by the Gradi-
better. We conduct additional experiments on machine learn- ent Boosting classifier using the for-norm dataset. This study
ing algorithms categorizing into (1) Statistical models like obtained 98.83% highest accuracy using the SVM model on
QDA, LDA, and Gaussian Naive Bayes for dimensionality the for-rerec dataset. We plan to explore the different window
reduction to reduce noise in the data. Then (2) Tree-based sizes for MFCC and various input sizes for models in the
models such as Decision Tree, Extra Tree, and Random future. Future work can be done on evaluating these models
Forest these algorithms can handle multidimensional data. against potential fluctuation and distortion in the audio signal,
Therefore, they do not involve domain knowledge or pa- understanding which signal is greater. Moreover, studies
rameter setting and are appropriate for exploratory pattern on the state-of-the-art few-shot learning and Bidirectional
detection. Lastly, (3) Boosting models, namely Ada Boost, Encoder Representations from Transformers (BERT) based
Gradient Boosting, and XGBoost, these algorithms funda- models can be conducted. Furthermore, we plan to evalu-
mentally create several weak learners and combine their ate our models in ambient noise and reverberation circum-
predictions with building a strong rule, which helps increase stances. We intend to use feature extraction methods like i-
the accuracy of a model on feature-rich audio data. These vector, x-vector, a combination of MFCC and GFCC, and a
three classes of ML algorithms are chosen for our approach combination of DWT and MFCC, which were not taken into
to explore and improve these performances on MFCC-based account in the current scenario of experiments because it is
feature sets. Besides this, we proposed a VGG-16-based deep the beginning of our journey to identify Deepfake audio.
learning model for the bigger dataset, which is the superset of
the other three datasets. It uses transfer learning and trained REFERENCES
on MFCC images feature for training the model. We obtained [1] A. Abbasi, A. R. R. Javed, A. Yasin, Z. Jalil, N. Kryvinska, and U. Tariq,
an accuracy of 93% while using half of the original dataset. A “A large-scale benchmark dataset for anomaly detection and rare event
large amount of data correlated with higher model accuracy. classification for audio forensics,” IEEE Access, vol. 10, pp. 38885–38894,
2022.
We tried to obtain a limited performance dataset. The entire [2] A. R. Javed, W. Ahmed, M. Alazab, Z. Jalil, K. Kifayat, and T. R.
dataset can be explored for even better results in the future. Gadekallu, “A comprehensive survey on computer forensics: State-of-the-
art, tools, techniques, challenges, and future directions,” IEEE Access,
2022.
VI. CONCLUSION [3] A. R. Javed, Z. Jalil, W. Zehra, T. R. Gadekallu, D. Y. Suh, and M. J.
Piran, “A comprehensive survey on digital video forensics: Taxonomy,
The detection of audio data is significant as an essential challenges, and future directions,” Engineering Applications of Artificial
tool for enhancing security against scamming and spoofing. Intelligence, vol. 106, p. 104456, 2021.
[4] A. Ahmed, A. R. Javed, Z. Jalil, G. Srivastava, and T. R. Gadekallu,
Deepfake audios have garnered significant public attention “Privacy of web browsers: a challenge in digital forensics,” in Interna-
as society rapidly recognizes its possible security danger. tional Conference on Genetic and Evolutionary Computing, pp. 493–504,
However, deepfake audio is extensively studied in combina- Springer, 2021.
[5] A. R. Javed, F. Shahzad, S. ur Rehman, Y. B. Zikria, I. Razzak, Z. Jalil,
tion with Spatio-temporal data of video. This study improves and G. Xu, “Future smart cities requirements, emerging technologies,
upon the Fake-or-Real (FoR) dataset, which comprises state- applications, challenges, and future aspects,” Cities, vol. 129, p. 103794,
of-the audio datasets and custom audios for deepfake audio 2022.
[6] A. Abbasi, A. R. Javed, F. Iqbal, Z. Jalil, T. R. Gadekallu, and N. Kryvin-
classification. It is further compiled into four sub-datasets. ska, “Authorship identification using ensemble learning,” Scientific Re-
This study conducted experiments with multiple audio data ports, vol. 12, no. 1, pp. 1–16, 2022.
features to detect deepfakes in audio data. This work extracts [7] S. Anwar, M. O. Beg, K. Saleem, Z. Ahmed, A. R. Javed, and U. Tariq,
“Social relationship analysis using state-of-the-art embeddings,” Trans-
MFCC features from audio for feature engineering. Several
actions on Asian and Low-Resource Language Information Processing,
machine learning algorithms are applied to the selected fea- 2022.
ture set to detect the deepfake audio. This approach gave [8] C. Stupp, “Fraudsters used ai to mimic ceo’s voice in unusual cybercrime
higher accuracy and results in all cases than other state-of- case,” The Wall Street Journal, vol. 30, no. 08, 2019.
[9] T. T. Nguyen, Q. V. H. Nguyen, C. M. Nguyen, D. Nguyen, D. T. Nguyen,
the-art studies for audio data. This study obtained 97.57% and S. Nahavandi, “Deep learning for deepfakes creation and detection: A
accuracy with SVM using the for-2sec dataset compared to survey,” arXiv preprint arXiv:1909.11573, 2019.

VOLUME 4, 2016 9

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[10] Z. Khanjani, G. Watson, and V. P. Janeja, “How deep are the fakes? International Conference on Information and Communications Technol-
focusing on audio deepfake: A survey,” arXiv preprint arXiv:2111.14203, ogy (ICOIACT), pp. 379–383, IEEE, 2018.
2021. [33] J. Kominek and A. W. Black, “The cmu arctic speech databases,” in Fifth
[11] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah, ISCA workshop on speech synthesis, 2004.
and A. Sizov, “Asvspoof 2015: the first automatic speaker verification [34] K. Ito and L. Johnson, “The lj speech dataset.” https://keithito.com/LJ-
spoofing and countermeasures challenge,” in Sixteenth annual conference Speech-Dataset/, 2017.
of the international speech communication association, 2015. [35] K. MacLean, “Voxforge,” Ken MacLean.[Online]. Available:
[12] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Ya- http://www.voxforge.org/home.[Acedido em 2012], 2018.
magishi, and K. A. Lee, “The asvspoof 2017 challenge: Assessing the [36] J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi, “A deep
limits of replay spoofing attack detection,” ISCA (the International Speech learning framework for audio deepfake detection,” Arabian Journal for
Communication Association), 2017. Science and Engineering, pp. 1–12, 2021.
[13] J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado, X. Wang, N. Evans,
T. Kinnunen, K. A. Lee, V. Vestman, and A. Nautsch, “Asvspoof
2019: Automatic speaker verification spoofing and countermeasures chal-
lenge evaluation plan,” tech. rep., Tech. Rep., 2019.[Online]. Available:
http://http://www. asvspoof. org . . . , 2019.
[14] S. Ö. Arık, H. Jun, and G. Diamos, “Fast spectrogram inversion using
multi-head convolutional neural networks,” IEEE Signal Processing Let-
ters, vol. 26, no. 1, pp. 94–98, 2018.
[15] Y. Chen, Y. Kang, Y. Chen, and Z. Wang, “Probabilistic forecasting
with temporal convolutional neural network,” Neurocomputing, vol. 399,
pp. 491–501, 2020.
[16] Y. Kawaguchi, “Anomaly detection based on feature reconstruction from
subsampled audio signals,” in 2018 26th European Signal Processing
Conference (EUSIPCO), pp. 2524–2528, IEEE, 2018.
[17] Y. Kawaguchi and T. Endo, “How can we detect anomalies from sub- AMEER HAMZA is with the Department of Cre-
sampled audio signals?,” in 2017 IEEE 27th International Workshop on ative Technology, Air University, Islamabad. He is
Machine Learning for Signal Processing (MLSP), pp. 1–6, IEEE, 2017. doing his Master’s degree in Artificial Intelligence
[18] H. Landau, “Sampling, data transmission, and the nyquist rate,” Proceed- from Air University, Islamabad, Pakistan.
ings of the IEEE, vol. 55, no. 10, pp. 1701–1706, 1967.
[19] H. Yu, Z.-H. Tan, Z. Ma, R. Martin, and J. Guo, “Spoofing detection in
automatic speaker verification systems using dnn classifiers and dynamic
acoustic features,” IEEE transactions on neural networks and learning
systems, vol. 29, no. 10, pp. 4633–4644, 2017.
[20] S. Pradhan, W. Sun, G. Baig, and L. Qiu, “Combating replay attacks
against voice assistants,” Proceedings of the ACM on Interactive, Mobile,
Wearable and Ubiquitous Technologies, vol. 3, no. 3, pp. 1–26, 2019.
[21] J. Villalba and E. Lleida, “Preventing replay attacks on speaker verification
systems,” in 2011 Carnahan Conference on Security Technology, pp. 1–8,
IEEE, 2011.
[22] F. Tom, M. Jain, and P. Dey, “End-to-end audio replay attack detection us-
ing deep convolutional networks with attention.,” in Interspeech, pp. 681–
685, 2018.
[23] K. Kuligowska, P. Kisielewicz, and A. Włodarz, “Speech synthesis sys-
tems: disadvantages and limitations,” Int J Res Eng Technol (UAE), vol. 7,
pp. 234–239, 2018.
[24] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,
N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative
model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. ABDUL REHMAN JAVED is a lecturer at the
[25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Department of Cyber Security, Air University,
Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by con- Islamabad, Pakistan. He has worked with Na-
ditioning wavenet on mel spectrogram predictions,” in 2018 IEEE inter- tional Cybercrimes and Forensics Laboratory at
national conference on acoustics, speech and signal processing (ICASSP), Air University, Islamabad, Pakistan. He received
pp. 4779–4783, IEEE, 2018. his Master’s degree in Computer Science from the
[26] J. Frank and L. Schönherr, “Wavefake: A data set to facilitate audio National University of Computer and Emerging
deepfake detection,” arXiv preprint arXiv:2111.02813, 2021. Sciences, Islamabad, Pakistan. He is a member
[27] M. Hassaballah, M. A. Hameed, and M. H. Alkinani, “Introduction to of both IEEE and ACM. He is a cybersecurity
digital image steganography,” in Digital Media Steganography, pp. 1–15, researcher and practitioner with industry and aca-
Elsevier, 2020.
demic experience. He has reviewed over 150 scientific research articles for
[28] R. Reimao and V. Tzerpos, “For: A dataset for synthetic speech detection,”
various well-known journals. He is a TPC member of CID2021 (Fourth
in 2019 International Conference on Speech Technology and Human-
International Workshop on Cybercrime Investigation and Digital forensics
Computer Dialogue (SpeD), pp. 1–10, IEEE, 2019.
- CID2021) and the 44th International Conference on Telecommunications
[29] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,
J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with and Signal Processing. He has served as moderator in the 1st IEEE Interna-
convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017. tional Conference on Cyber Warfare and Security (ICCWS). He has authored
[30] F. M. Rammo and M. N. Al-Hamdani, “Detecting the speaker language over 50 peer-reviewed research articles and is supervising/co-supervising
using cnn deep learning algorithm,” Iraqi Journal For Computer Science several graduate (BS and MS) students on health informatics, cybersecurity,
and Mathematics, vol. 3, no. 1, pp. 43–52, 2022. mobile computing, and digital forensics topics. His current research interests
[31] Z. A. Abbood, B. T. Yasen, M. R. Ahmed, A. D. Duru, et al., “Speaker include but are not limited to mobile and ubiquitous computing, data analy-
identification model based on deep neural networks,” Iraqi Journal For sis, knowledge discovery, data mining, natural language processing, smart
Computer Science and Mathematics, vol. 3, no. 1, pp. 108–114, 2022. homes, and their applications in human activity analysis, human motion
[32] A. Winursito, R. Hidayat, and A. Bejo, “Improvement of mfcc feature analysis, and e-health. He aims to contribute to interdisciplinary research
extraction accuracy using pca in indonesian speech recognition,” in 2018 in computer science and human-related disciplines.

10 VOLUME 4, 2016

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FARKHUND IQBAL holds the position of As- ZUNERA JALIL is an Assistant Professor at the
sociate Professor in the College of Technological Department of Cyber Security, Faculty of Com-
Innovation, Zayed University, United Arab Emi- puting & Artificial Intelligence, Air University,
rates. He is an Affiliate Professor at the School of Islamabad, and senior researcher at National Cy-
Information Studies, McGill University, Canada, bercrimes and Forensics Lab, National Center for
and an Adjunct Professor at the Faculty of Busi- Cyber Security, Islamabad, Pakistan. She earned
ness and IT, Ontario Tech University, Canada. her Ph.D. in Computer Science with a specializa-
He leads the Cybersecurity and Digital Forensics tion in Information Security from FAST National
(CAD) research group at the Center for Smart University of Computer and Emerging Sciences,
Cities and Intelligent Systems, Zayed University. Islamabad, Pakistan, in 2010. She received her
He holds a Master’s (2005) and a Ph.D. degree (2011) from Concordia Master’s degree in Computer Science in 2007 with a scholarship from the
University, Canada. He uses Artificial Intelligence, Machine Learning, and Higher Education Commission of Pakistan. She has served as a full-time
Data Analytics techniques for problem-solving in cybersecurity, health care, faculty member at International Islamic University, Islamabad; Iqra Univer-
and cybercrime investigation in the smart city domain. He has published sity, Islamabad; and Saudi Electronic University, Riyadh, Saudi Arabia. Her
more than 120 papers in high-ranked journals and conferences. He has research interests include but are not limited to computer forensics, machine
served as a chair and co-chair for several IEEE/ACM conferences and has learning, criminal profiling, software watermarking, intelligent systems, and
been a guest editor and reviewer for multiple high-rank journals. data privacy protection.

NATALIA KRYVINSKA is a Full Professor and a

Head of the Information Systems Department at
the Faculty of Management, Comenius University
in Bratislava, Slovakia. Previously, she served as
a University Lecturer and a Senior Researcher
at the eBusiness Department at the University
of Vienna’s School of Business Economics and
Statistics. She received her Ph.D. in Electrical
& IT Engineering from the Vienna University of
Technology in Austria, and a Docent title (Habili-
tation) in Management Information Systems from the Comenius University
in Bratislava, Slovakia. She got her Professor title and was appointed for
the professorship by the President of the Slovak Republic. Her research ROUBA BORGHOL is an Assistant Professor of
interests include Complex Service Systems Engineering, Service Analytics, Mathematics at Rochester Institute of Technology
and Applied Mathematics. Dubai. She received a Master’s Degree in Ap-
plied Mathematics from the University of Claude
Bernard II, Lyon, and a Ph.D. in Mathematics from
the University of Tours in France in December
2005. Dr. Rouba was employed by the College
of Applied Science and Dhofar University as an
Assistant Professor for the academic years 2010-
2013. The Lebanese University also employed
AHMAD S. ALMADHOR received the B.S.E. her- Lebanon as an Assistant Professor for the academic year 2008-2009 and
degree in computer science from Aljouf University as a lecturer at the University of Tours in France during 2005-2007. She was
(formerly Aljouf College), Aljouf, Saudi Arabia, a research fellow at the polytechnic school of Palaiseau in France in 2008.
in 2005 and the M.E. degree in computer science Throughout her fifteen years in academia, she has taught several courses and
and engineering from University of South Car- topics, such as pure mathematics, and applied for mathematics courses for
olina, Columbia, SC, USA, in 2010 and the Ph.D. both undergraduate and graduate programs.
degree in electrical and computer engineering at
the University of Denver, Denver, CO, USA, in
2019. From 2006 to 2008, he was a Teaching
Assistant and College of Sciences manager, then
a lecturer from 2011 to 2012, all at Aljouf University, Aljouf, Saudi Arabia.
Then, I became a Senior Graduate Assistant and Tutor advisor at the Univer-
sity of Denver in 2013 2019. He is currently an Assistant Professor of CEN
and VD of Computer and Information Science College at Jouf University,
Saudi Arabia. His research interest includes AI, Blockchain, Networks,
Smart and Microgrid cyber security, and integration. Image processing,
Video Surveillance systems, PV, EV, Machine, and Deep learning/ Dr.
Almadhor’s awards and honors include the Aljouf University Scholarship
(Royal Embassy of Saudi Arabia in D.C.), Aljouf’s Governor Award for
excellency, and several others.

VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

A Deep Learning Framework For Audio Deepfake Detection
No ratings yet
A Deep Learning Framework For Audio Deepfake Detection
12 pages
Deepfake Audio Detection Via MFCC Features Using Machine Learning
No ratings yet
Deepfake Audio Detection Via MFCC Features Using Machine Learning
11 pages
Deepfake Basepaper
No ratings yet
Deepfake Basepaper
3 pages
AI Audio Deepfake
No ratings yet
AI Audio Deepfake
18 pages
IJISAE 3 Dr.+Shwetambari+Borade 3 1899
No ratings yet
IJISAE 3 Dr.+Shwetambari+Borade 3 1899
8 pages
Unmasking - The - Truth - A - Deep - Learning - Approach - To - Detecting - Deepfake - Audio - Through - MFCC - Features - P
No ratings yet
Unmasking - The - Truth - A - Deep - Learning - Approach - To - Detecting - Deepfake - Audio - Through - MFCC - Features - P
8 pages
Deepfake Audio Detection Using MFCC Features: Priya N V, Pavan H, Prajwal S, Varun R Vinay A
100% (1)
Deepfake Audio Detection Using MFCC Features: Priya N V, Pavan H, Prajwal S, Varun R Vinay A
11 pages
Audio - Deepfake - Detection - Using - Deep - Learning Paper2
No ratings yet
Audio - Deepfake - Detection - Using - Deep - Learning Paper2
6 pages
Detection of Fake AudioA Deep
No ratings yet
Detection of Fake AudioA Deep
11 pages
Deepfakes Audio Detection Techniques Using Deep Convolutional Neural Network-Paper3
No ratings yet
Deepfakes Audio Detection Techniques Using Deep Convolutional Neural Network-Paper3
6 pages
Audio Deepfake Approaches
No ratings yet
Audio Deepfake Approaches
31 pages
Audio Deepfake Detection Paper
100% (1)
Audio Deepfake Detection Paper
6 pages
FakeAVCeleb A Novel Audio-Video Multimodal DeepFake Dataset
No ratings yet
FakeAVCeleb A Novel Audio-Video Multimodal DeepFake Dataset
22 pages
Audio Deepfake (Camera Ready Paper)
No ratings yet
Audio Deepfake (Camera Ready Paper)
13 pages
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
No ratings yet
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
6 pages
Report
No ratings yet
Report
7 pages
Base Paper Audio Deep Fake Detection
No ratings yet
Base Paper Audio Deep Fake Detection
16 pages
Electronics 14 02040
No ratings yet
Electronics 14 02040
13 pages
Electronics 13 00095
No ratings yet
Electronics 13 00095
27 pages
Computers 13 00256
No ratings yet
Computers 13 00256
13 pages
Final PPT-1
No ratings yet
Final PPT-1
60 pages
Applsci 13 08488 v2
No ratings yet
Applsci 13 08488 v2
15 pages
AudioVeritas A Machine Learning Model To
No ratings yet
AudioVeritas A Machine Learning Model To
8 pages
Base Paper 1 (Hybrid Approach)
No ratings yet
Base Paper 1 (Hybrid Approach)
6 pages
Anomaly Detection of Deepfake Audio Based On Real Audio Using Generative Adversarial Network Model
No ratings yet
Anomaly Detection of Deepfake Audio Based On Real Audio Using Generative Adversarial Network Model
16 pages
Vigo: Audiovisual Fake Detection and Segment Localization
No ratings yet
Vigo: Audiovisual Fake Detection and Segment Localization
5 pages
RBPRATYUSH448
No ratings yet
RBPRATYUSH448
20 pages
Detection of Synthetically Generated Speech
No ratings yet
Detection of Synthetically Generated Speech
5 pages
Project PPT Bhu
No ratings yet
Project PPT Bhu
12 pages
Minor Project Ms
No ratings yet
Minor Project Ms
12 pages
AReviewofModernAudioDeepfakeDetectionMethods PDF
No ratings yet
AReviewofModernAudioDeepfakeDetectionMethods PDF
20 pages
A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions
100% (1)
A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions
20 pages
Main Report Draft UNTOCHED
No ratings yet
Main Report Draft UNTOCHED
82 pages
Integrating Audio-Visual Features For Multimodal Deepfake Detection
No ratings yet
Integrating Audio-Visual Features For Multimodal Deepfake Detection
5 pages
DeepFake Detection System
No ratings yet
DeepFake Detection System
60 pages
(IJCST-V11I4P16) :nikhil Sontakke, Sejal Utekar, Shivansh Rastogi, Shriraj Sonawane
No ratings yet
(IJCST-V11I4P16) :nikhil Sontakke, Sejal Utekar, Shivansh Rastogi, Shriraj Sonawane
7 pages
Project
No ratings yet
Project
14 pages
Paper 6
No ratings yet
Paper 6
10 pages
Deepfakes Generation and Detection
No ratings yet
Deepfakes Generation and Detection
54 pages
Deepfake Speech Detection Research
No ratings yet
Deepfake Speech Detection Research
3 pages
Innovative Project
No ratings yet
Innovative Project
7 pages
Unmasking The Fake Machine Learning Approach For Deepfake Voice Detection
No ratings yet
Unmasking The Fake Machine Learning Approach For Deepfake Voice Detection
12 pages
Working Paper
No ratings yet
Working Paper
5 pages
Deepfake Researchpaper
No ratings yet
Deepfake Researchpaper
34 pages
Robust Deepfake Detection Leveraging EfficientNet-B3 Backbone With Binary Classification Techniques
No ratings yet
Robust Deepfake Detection Leveraging EfficientNet-B3 Backbone With Binary Classification Techniques
9 pages
2022 ICASSP Audio Deepfake Emotions+
No ratings yet
2022 ICASSP Audio Deepfake Emotions+
5 pages
Summary of Audio Deepfakes - A Survey
No ratings yet
Summary of Audio Deepfakes - A Survey
2 pages
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
No ratings yet
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
5 pages
Ijset v11 Issue6 571
No ratings yet
Ijset v11 Issue6 571
5 pages
Vocal Tract
No ratings yet
Vocal Tract
19 pages
Deepfake Video Detection Challenges and Opportunities
No ratings yet
Deepfake Video Detection Challenges and Opportunities
48 pages
2411.19537v1 Survey
No ratings yet
2411.19537v1 Survey
24 pages
A Survey On The Detection and Impacts of Deepfakes in Visual Audio and Textual Formats
No ratings yet
A Survey On The Detection and Impacts of Deepfakes in Visual Audio and Textual Formats
33 pages
A Multimodal Framework For Deepfake Detection: Abstract
No ratings yet
A Multimodal Framework For Deepfake Detection: Abstract
22 pages
BTP Report
No ratings yet
BTP Report
39 pages
Implementation Paper
No ratings yet
Implementation Paper
13 pages
Wa0009
No ratings yet
Wa0009
12 pages
1 s2.0 S240584402500653X Main
No ratings yet
1 s2.0 S240584402500653X Main
23 pages
Recurrent Convolutional Structures For Audio
No ratings yet
Recurrent Convolutional Structures For Audio
14 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Revised Chapter Rain Prediction Using IOT AI
No ratings yet
Revised Chapter Rain Prediction Using IOT AI
16 pages
1822 B.E Cse Batchno 242
No ratings yet
1822 B.E Cse Batchno 242
54 pages
Data Science Notes
No ratings yet
Data Science Notes
44 pages
Cancer Prediction Using Machine Learning
No ratings yet
Cancer Prediction Using Machine Learning
5 pages
Anomaly Detection and Predictive Maintenance
No ratings yet
Anomaly Detection and Predictive Maintenance
9 pages
House Price Prediction Using Machine Learning and Artificial Intelligence.
No ratings yet
House Price Prediction Using Machine Learning and Artificial Intelligence.
11 pages
MSC Hanif e 2019
No ratings yet
MSC Hanif e 2019
97 pages
Chapter 1 Tupad
No ratings yet
Chapter 1 Tupad
13 pages
ML - 04 - Decision Trees
No ratings yet
ML - 04 - Decision Trees
51 pages
IRJMETS40200036565
No ratings yet
IRJMETS40200036565
13 pages
Krishna Data Scientist +1 (713) - 478-5282
No ratings yet
Krishna Data Scientist +1 (713) - 478-5282
5 pages
Mathematics: Predictive Power of Adaptive Candlestick Patterns in Forex Market. Eurusd Case
No ratings yet
Mathematics: Predictive Power of Adaptive Candlestick Patterns in Forex Market. Eurusd Case
34 pages
Influence of Social Media On Performance of Movies: by Fnu Shruti
No ratings yet
Influence of Social Media On Performance of Movies: by Fnu Shruti
62 pages
Weeklydiary - (1) SANCHIT (1) - PROPER GOOD
No ratings yet
Weeklydiary - (1) SANCHIT (1) - PROPER GOOD
13 pages
Final Report.2.0.final
No ratings yet
Final Report.2.0.final
68 pages
Machine Learning CLOs PLOs Theory Course Completion
No ratings yet
Machine Learning CLOs PLOs Theory Course Completion
5 pages
Rainfall
No ratings yet
Rainfall
24 pages
Landslide Kapil PDF
No ratings yet
Landslide Kapil PDF
9 pages
(IJETA-V11I3P38) :abhay Purohit, Avnish Kumar, Ritu Vaishnav, Shubham Singh
No ratings yet
(IJETA-V11I3P38) :abhay Purohit, Avnish Kumar, Ritu Vaishnav, Shubham Singh
6 pages
Random Forest Algorithms - Comprehensive Guide With Examples
No ratings yet
Random Forest Algorithms - Comprehensive Guide With Examples
13 pages
ML Daily Tracker 8 Weeks
No ratings yet
ML Daily Tracker 8 Weeks
2 pages
Random Forest
No ratings yet
Random Forest
11 pages
Interview Questions
No ratings yet
Interview Questions
8 pages
Crop Yield Waali
100% (2)
Crop Yield Waali
20 pages
Ajit Tiwari Laptop
No ratings yet
Ajit Tiwari Laptop
69 pages
Credit Card Fraud Detection System
100% (1)
Credit Card Fraud Detection System
7 pages
Unit 6: Big Data Analytics Using R: 6.0 Overview
No ratings yet
Unit 6: Big Data Analytics Using R: 6.0 Overview
32 pages
Project Report 5
No ratings yet
Project Report 5
51 pages
Data Science AI Certification Program
No ratings yet
Data Science AI Certification Program
30 pages
Swayam 8thmajor
No ratings yet
Swayam 8thmajor
57 pages

Deepfake Audio Detection Via MFCC Features Using M

Uploaded by

Deepfake Audio Detection Via MFCC Features Using M

Uploaded by

This article has been accepted for publication in IEEE Access.

Deepfake Audio Detection via MFCC

I. INTRODUCTION ployed in several criminal activities in recent years, the

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

and video-based approaches, as these works simultaneously II. LITERATURE REVIEW

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Bit rate Conversion MFCC `

FIGURE 1: Graphical Representation of Proposed Approach for detection of deepfake audios

features because MFCC used log function and Mel-filter

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) Fake audio (b) Real audio

(c) Fake Audio (d) Real Audio

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1: Dataset Description

B. FOR-2SEC DATASET TABLE 3: Accuracy comparison for noisy audio signals

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) Model accuracy (b) Model loss

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Abbasi et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

NATALIA KRYVINSKA is a Full Professor and a

You might also like