Deepfake Audio Detection Via MFCC Features Using M
Deepfake Audio Detection Via MFCC Features Using M
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Deepfake content is created or altered synthetically using artificial intelligence (AI)
approaches to appear real. It can include synthesizing audio, video, images, and text. Deepfakes may
now produce natural-looking content, making them harder to identify. Much progress has been achieved in
identifying video deepfakes in recent years; nevertheless, most investigations in detecting audio deepfakes
have employed the ASVSpoof or AVSpoof dataset and various machine learning, deep learning, and
deep learning algorithms. This research uses machine and deep learning-based approaches to identify
deepfake audio. Mel-frequency cepstral coefficients (MFCCs) technique is used to acquire the most useful
information from the audio. We choose the Fake-or-Real dataset, which is the most recent benchmark
dataset. The dataset was created with a text-to-speech model and is divided into four sub-datasets: for-
rece, for-2-sec, for-norm and for-original. These datasets are classified into sub-datasets mentioned above
according to audio length and bit rate. The experimental results show that the support vector machine
(SVM) outperformed the other machine learning (ML) models in terms of accuracy on for-rece and for-
2-sec datasets, while the gradient boosting model performed very well using for-norm dataset. The VGG-16
model produced highly encouraging results when applied to the for-original dataset. The VGG-16 model
outperforms other state-of-the-art approaches.
INDEX TERMS Deepfakes, Deepfake audio, Synthetic audio, Machine learning, Acoustic Data
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
2 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
second, Lyrebird uses deep learning models. The success using different supervised and unsupervised machine learn-
of a TTS system is highly dependent on the quality of the ing algorithms. The following section explains the proposed
speech corpus upon which it is built, and it is costly to collect framework for all sub-datasets, including data handling, pre-
and annotate speech samples. Char2Wav is a framework for processing, feature engineering, and the classification phase.
speech synthesis production from start to finish. PixleCNN Figure 1 Shows the detailed architecture diagram of the
is also the foundation of WaveNet [24], an SS framework. proposed framework, consisting of 1) data preprocessing,
WaveGlow prioritizes stage two of the two-stage process 2) feature extraction 3) Classification models. The detailed
generally used by text-to-speech synthesis systems (encoder description of each phase is as follows:
and decoder). Therefore, WaveGlow is concerned with mod-
ifying specific time-aligned data. Incorporating information A. DATA PREPROCESSING
into sound files by using encoding techniques like a mel- More than 195,000 real human and synthetic computer-
spectrogram. The Tacotron 2 [25] system comprises two generated speech samples are included in the Fake-or-Real
parts. The first component is an attention-based recurrent (FoR) collection. Classifiers may be trained on the dataset to
sequence-to-sequence feature prediction network. This com- identify fake speech better. Information from Deep Voice 3
ponent’s output is a mel anticipated sequence. Frames of is included [29] and Google Wavenet[24] TTS and various
a spectrogram A modified WaveNet vocoder is the second human sound recordings. This dataset may be accessed in
component. For audio data, [26], [27] used GAN-based gen- four different varieties[ 1) for-original,2) for-norm, 3) for-
erative models. It operates on Mel spectrograms and employs 2sec, and 4) for-rerec]. The original version includes the
a fully convolutional feed-forward network as the generator. files without any changes from when they were first ex-
The authors give a summary of their recently created data tracted from the speech sources. The latest volume (For-
set. It comprises 117,985 created audio segments of 16-bit norm) contains the duplicate files as the first, but they have
Pulse Code Modulation(PCM) wav format and is available been standardized in terms of sampling rate, volume, and
on zenodo 2 . various channels to achieve gender and class parity. The
The current study has poor performance validation and second is the basis for the third (for-2sec), except that the files
testing results detecting deep false audios. Feature-based are truncated after 2 seconds instead of the original length.
techniques are required to improve the outputs of machine The third and final version (for-rerec) is a re-recorded version
learning models. The deep learning approaches show better of the for-2second dataset created to simulate an attacker
results but require greater training time and computational transmitting an utterance via a voice channel. However, these
resources. Hence, the potential for machine learning ap- datasets suffer from duplicate files, 0-bit files, and different
proaches in deepfake detection is explored, while the limi- bit-rate in audio signals. They negatively affect the ML
tation of handling higher feature sets and complexities can model’s training and performance. Hence, we preprocess the
be solved through a transfer learning-based deep learning dataset to remove the duplicate and 0-bit file, which does not
approach. contribute to model training. Also, the bit rate is standardized
to zero-padding for an audio waveform with less than 16,000
III. PROPOSED METHODOLOGY samples, conforming to an operationally viable bit rate for the
In machine learning, training a model always involves the TensorFlow audio signal processing library. Also, the data is
trade-off of over-fitting and under-fitting, which negatively normalized using a standard scaler to ease model training.
impacts the model’s real-time performance. It is difficult to
handle this trade-off so that models do not over-fit or under- B. FEATURE EXTRACTION
fit. One of the major issues in deepfake is the high false- Deepfake audio signal often consists of similar feature sets to
positive rate ratio, which occurs when most models classify the original signal. However, distinguishability is challeng-
an unseen pattern as abnormal if it is not included in the ing to advance in deep learning approaches in generating
training set. It is due to the model’s inability to be trained deepfakes. Hence, extracted features can strongly affect the
on a large dataset. A dataset that covers all possible patterns model’s predictive power and accuracy. It is observed that
and cases, deepfake and real. It is regarded as a theoretical audio signals in the frequency domain can provide us the
concept that cannot be implemented practically. Hence, the features which are helpful in the detection and classification
dataset Fake-or-Real [28] is divided into four datasets:for- of deepfake audios, which can deceive a human under spe-
rece, for-2-sec, for-norm, for-original, where for-original cific scenarios. For this purpose, we use Mel-frequency Cep-
dataset is the collection of other three datasets and without stral Coefficient (MFCC), a widely used feature for speech
much preprocessing. recognition [30], [31]. The dataset (Fake or Real Audio
This research aimed to develop a technique to classify dataset) used in this study is a more recent dataset, which
deep fake synthetic audio under different background noise was only used once in research. This study is not limited to
and audio sizes and duration. We proposed a framework only MFCC features; we also employed cepstral (MFCC),
that handles the big data training set and performs detection Spectral (Roll-off point, centroid, contrast, bandwidth), Raw
signal (zero cross rate), and signal energy features and made
2 https://zenodo.org/record/5642694 a featured ensemble, but our primary focus is on MFCC
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
Features Processing
Data Preprocessing
Duplicate Removal
Normalization
Audio Windowing
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
decision surfaces to more minor problems that may be solved Dot product is represented d (.) in equation 4. The set of all
by making it linearly separable, and 2) only training patterns x-vectors that satisfy the equation f(x) = 0 is denoted by H0 .
near the decision surface provide the most sensitive details Assuming two hyperplanes, H1 and H2 , the distance between
for classification. Assume a deepfake detection problem as a them is referred to as their margin which can be represented
binary classification with linearly separable vectors xi ∈ Rn , as follows:
2
as the decision surface used to classify a pattern as belonging (5)
to one of the two classes is the hyperplane H0 . If x is a ∥w∥
random vector n ∗ R, we define The decision hyperplane H0 depends on vectors closest to
the two parallel hyperplanes called support vectors. The
margin must be maximal to obtain a classifier that is not
much adapted to the training data. Consider a collection of
f(x) = w.x + b (4) training data vectors X = xi , ...xL , xi ∈ Rn and a set of
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
matching labels, Y = yi , ...yL , yi ∈ {1, −1}. We consider is vulnerable to outliers because each successive classifier is
the hyperplane H0 to be optimally separated if the vectors compelled to correct the mistakes made by its prior learners.
are categorized without error and the margin is greatest. The This is because the estimators rely on historical predictions
vectors must verify in order to be accurately categorized. to determine their accuracy. For this reason, streamlining the
process is complex.
fxi > +1 f or yi = +1 (6)
IV. EXPERIMENTS AND RESULTS
fxi > −1 f or yi = −1 (7) About 195,000 human and synthetic speech samples were
Hence, finding the SVM classifying function H0 can be used to create the Fake-or-Real (FoR) dataset. In Table 1,
stated as follows: we offer a summary of the data set. Classifiers may be
trained on the dataset to identify fake speech better. The
1 2 dataset is an amalgamation of information from the following
minimize ∥w∥ (8)
2 recent datasets: first, Text-to-speech programs, such as Deep
Voice 3 and Google Wavenet TTS [29]. Secondly, includes
yi f(xi ) ≥ 1, ∀i (9) many different types of the recorded human voice, includ-
ing those from the Arctic Dataset, LJSpeech Dataset, Vox-
The SVM was chosen for its properties that aid in classi- Forge Dataset, and user-submitted recordings [33], [34], [35].
fying deepfake audios. It performs well with a clear margin The four dataset versions available for public consumption
of separation between samples and is effective in high- are for-original, for-norm, for-2sec, and for-rerec. The for-
dimensional environments. It employs a subset of training original folder stores the raw data from the speech sources.
points in the decision function, making it memory efficient. It The for-norm has some duplicate files but is otherwise
works well when the number of dimensions is more than the well-balanced across demographic categories (gender and
size of the sample set. SVM does not perform very well on socioeconomic status) and technical parameters (sample rate,
our for-original dataset data set because the required training volume, and multiple channels). The third one is like the
time and the noise in the data set are higher. It does not second one, only the files are cut off after 2 seconds, and
directly provide probability estimates, calculated using an it is called for-2sec. The last variant, dubbed for-here, is
expensive five-fold cross-validation that takes a long time to a re-recording of the for-2second dataset meant to mimic
train. However, the clean datasets extracted from for-original a situation in which an attacker transmits a speech over a
dataset perform better on the classification task. SVM has vocal channel like a phone call or voice message. We provide
been shown to perform effectively in higher-dimensional the outcomes of our binary classification analysis of the
data, most notably when detecting events in audio data. suggested method. Table 2 shows the experimental findings
Hence, for deepfake audio, we implemented it by utilizing for spotting deepfakes.
the Scikit-learn library. We use radial basis function (RBF) The experiments were also performed using noisy audio
kernel, C=4, and probability=True. sound signals. For this purpose, we added synthetic noise to
each audio signal of three datasets (for-2sec, for-norm, and
3) Multi-layer Perceptron (MLP) for-rerec dataset). This method kept both original and noisy
MLP is adequate for classification tasks; a multilayer per- audio in the dataset and increased the audio signal sample.
ceptron, through layers, can effectively filter the relevant The length of the original for-2sec dataset is 17870 audio
features from data and tune the parameters of the models samples, and after adding noise to the dataset, the new dataset
for optimal predictions. There are at least three levels in the will be composed of 35740 audio samples, the same for the
MLP model: an input layer, a hidden layer of calculation for-rerec and for-norm datasets.
nodes, and an output layer of processing nodes. In this study,
we use hyperparameters of MLP classifier as a hidden-layer- A. FOR-REREC DATASET
size staple:length= 100, solver=adam and RMSprop, while The results of the for-rerec dataset are presented in Table 2.
RMSprop is used for smaller datasets, shuffle=True and Multiple ML models are applied to obtain better results. The
verbose=False, activation-function=relu. machine learning algorithms such as Support Vector Machine
(SVM) have 98.83% accuracy, Decision Tree 88.28%, Ran-
4) Extreme Gradient Boosting (XGB) dom Forest Classifier 96.60%, AdaBoot 87.67%, Gradient
XGB is a parallel and optimized version of gradient boosting Booting 93.51%, XGB Classifier 93.40%. The SVM model
algorithms that combines efficiency and resource manage- exhibited the highest results using the for-rerec dataset.
ment. It implements gradient-boosted decision trees in an it- The result of the for-rerec dataset noisy audio signals
erative model by combining weak base models into a stronger classification is presented in Table 3. Results depict that the
learner. The residual is utilized to refine the loss function and MLP and SVM models obtained the highest accuracy score
improve the prior prediction at each iteration of the gradient of 98.66% and 98.43% compared to other ML models. The
boosting algorithm. We use a learning rate of 0.1 and an other ML models like; DT, LR, and XGB obtained 82.12%,
estimator of 10000 for the XGBoost algorithm. However, it 88%, and 88.92% accuracy.
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
Table 2 and 3 compare the machine learning model results of 91%. The existing approach proposed by khochare et al.
of the feature-based approach to training machine learning (2021) used MFCC features and various machine learning
algorithms. Our approach to selecting the best feature and models for deepfake audio detection [36]. They have utilized
ML classifiers obtained promising results on three datasets 20 MFCC features for each audio. The author employed mul-
(FOR-REREC DATASET, OR-2SEC DATASET, and FOR- tiple machine learning models (SVM, RF, KNN, and XGB).
NORM DATASET). However, The for-norm dataset does The author used 20 MFCC features with the SVM model
not perform well on our approach using a simple SVM and obtained the highest accuracy rate of 67%. Another
algorithm as the data is of high dimensions. Without the study proposed by Reimao et al. (2019) used both machine
dimensionality reduction on a complex dataset, it performs learning and deep learning techniques along with various
poorly. This dataset contains audio of a length greater than feature extraction methods [28]. The author used Timbre
12 seconds. Hence, the windowing technique can perform Model Analysis (Brightness, Hardness, Depth, Roughness)
better in combination with MFCC. The proposed approach features with multiple ML models (NB, SVM, DT, and
is compared with the baseline approach that used FOR- RF). According to the ML model classification results, the
ORIGINAL DATASET for experimentation [36]. The ex- SVM model using the various feature extraction methods
isting approach used various ML models (SVM, RF, KNN, obtained a 73.46% accuracy rate. Furthermore, STFT, Mel-
XGB) to detect deepfake from FOR-ORIGINAL-DATASET. Spectrograms, MFCC, and CQT feature extraction meth-
The proposed approach obtains the highest testing score of ods are used with the VGG19 model and obtained 89.79%
93%, which is 26% higher than the best score of existing accuracy. Compared to the previous research, the VGG16
work using the SVM model. It is concluded that the proposed model achieved the highest results, with an accuracy of 93%.
approach can efficiently detect deepfake audio. The dataset More so, the LSTM model achieved 91% accuracy. The
used in this study is only used in only one previous study. VGG16 model loss and training and validation accuracy are
The proposed and existing approaches’ experimental settings shown in Figure 4. The proposed approach with the features
are similar (dataset, data split). In addition, the comparative mentioned in section III-B outperforms the previous state-of-
analysis of the proposed method with the state-of-the-art the-art feature extraction techniques is presented in Table 4.
feature extraction techniques is presented in Table 4. The
proposed approach combines features from multiple feature V. DISCUSSION
extraction techniques and extracts the most optimal features This research extended the work on deepfake audios by
for classification. The two deep learning models are em- extending the work on the Fake-or-Real dataset. This dataset
ployed in this research. The proposed approach employed comprises a state-of-the-art dataset in audio detection and
VGG16 and LSTM model with a feature ensemble of MFCC- classification. We improved upon the algorithm’s perfor-
40, Roll-off point, centroid, contrast, and bandwidth features. mance, which was previously trained on feature-based ap-
The features extracted from each method are combined for proaches by using the MFCC-based features, indicating con-
model classification. The VGG16 model obtained the highest siderable improvements in inaccuracy. Our feature outper-
results compared to the existing study with an accuracy of forms the feature-based approach by 10 to 20 percent on
93%. Furthermore, the LSTM model obtained an accuracy average across these datasets. The for-norm dataset performs
poorly on our approach using simple SVM algorithms. Win-
8 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
TABLE 4: Comparison between results of the proposed approach and existing approach
Approaches Features Models Accuracy (%)
SVM 67
RF 62
Existing Approach [36] MFCC-20
KNN 62
XGB 59
NB 67.27
SVM 73.46
Timbre Model Analysis (Brightness, Hardness, Depth, Roughness)
Existing Approach [28] DT (J48) 70.26
RF 71.47
STFT, Mel-Spectrograms, MFCC and CQT VGG19 89.79
LSTM 91
Proposed Approach MFCC-40, Roll-off point, centroid, contrast, bandwidth
VGG16 93
dowing techniques, in combination with MFCC, can perform other ML models, while 92.63% was obtained by the Gradi-
better. We conduct additional experiments on machine learn- ent Boosting classifier using the for-norm dataset. This study
ing algorithms categorizing into (1) Statistical models like obtained 98.83% highest accuracy using the SVM model on
QDA, LDA, and Gaussian Naive Bayes for dimensionality the for-rerec dataset. We plan to explore the different window
reduction to reduce noise in the data. Then (2) Tree-based sizes for MFCC and various input sizes for models in the
models such as Decision Tree, Extra Tree, and Random future. Future work can be done on evaluating these models
Forest these algorithms can handle multidimensional data. against potential fluctuation and distortion in the audio signal,
Therefore, they do not involve domain knowledge or pa- understanding which signal is greater. Moreover, studies
rameter setting and are appropriate for exploratory pattern on the state-of-the-art few-shot learning and Bidirectional
detection. Lastly, (3) Boosting models, namely Ada Boost, Encoder Representations from Transformers (BERT) based
Gradient Boosting, and XGBoost, these algorithms funda- models can be conducted. Furthermore, we plan to evalu-
mentally create several weak learners and combine their ate our models in ambient noise and reverberation circum-
predictions with building a strong rule, which helps increase stances. We intend to use feature extraction methods like i-
the accuracy of a model on feature-rich audio data. These vector, x-vector, a combination of MFCC and GFCC, and a
three classes of ML algorithms are chosen for our approach combination of DWT and MFCC, which were not taken into
to explore and improve these performances on MFCC-based account in the current scenario of experiments because it is
feature sets. Besides this, we proposed a VGG-16-based deep the beginning of our journey to identify Deepfake audio.
learning model for the bigger dataset, which is the superset of
the other three datasets. It uses transfer learning and trained REFERENCES
on MFCC images feature for training the model. We obtained [1] A. Abbasi, A. R. R. Javed, A. Yasin, Z. Jalil, N. Kryvinska, and U. Tariq,
an accuracy of 93% while using half of the original dataset. A “A large-scale benchmark dataset for anomaly detection and rare event
large amount of data correlated with higher model accuracy. classification for audio forensics,” IEEE Access, vol. 10, pp. 38885–38894,
2022.
We tried to obtain a limited performance dataset. The entire [2] A. R. Javed, W. Ahmed, M. Alazab, Z. Jalil, K. Kifayat, and T. R.
dataset can be explored for even better results in the future. Gadekallu, “A comprehensive survey on computer forensics: State-of-the-
art, tools, techniques, challenges, and future directions,” IEEE Access,
2022.
VI. CONCLUSION [3] A. R. Javed, Z. Jalil, W. Zehra, T. R. Gadekallu, D. Y. Suh, and M. J.
Piran, “A comprehensive survey on digital video forensics: Taxonomy,
The detection of audio data is significant as an essential challenges, and future directions,” Engineering Applications of Artificial
tool for enhancing security against scamming and spoofing. Intelligence, vol. 106, p. 104456, 2021.
[4] A. Ahmed, A. R. Javed, Z. Jalil, G. Srivastava, and T. R. Gadekallu,
Deepfake audios have garnered significant public attention “Privacy of web browsers: a challenge in digital forensics,” in Interna-
as society rapidly recognizes its possible security danger. tional Conference on Genetic and Evolutionary Computing, pp. 493–504,
However, deepfake audio is extensively studied in combina- Springer, 2021.
[5] A. R. Javed, F. Shahzad, S. ur Rehman, Y. B. Zikria, I. Razzak, Z. Jalil,
tion with Spatio-temporal data of video. This study improves and G. Xu, “Future smart cities requirements, emerging technologies,
upon the Fake-or-Real (FoR) dataset, which comprises state- applications, challenges, and future aspects,” Cities, vol. 129, p. 103794,
of-the audio datasets and custom audios for deepfake audio 2022.
[6] A. Abbasi, A. R. Javed, F. Iqbal, Z. Jalil, T. R. Gadekallu, and N. Kryvin-
classification. It is further compiled into four sub-datasets. ska, “Authorship identification using ensemble learning,” Scientific Re-
This study conducted experiments with multiple audio data ports, vol. 12, no. 1, pp. 1–16, 2022.
features to detect deepfakes in audio data. This work extracts [7] S. Anwar, M. O. Beg, K. Saleem, Z. Ahmed, A. R. Javed, and U. Tariq,
“Social relationship analysis using state-of-the-art embeddings,” Trans-
MFCC features from audio for feature engineering. Several
actions on Asian and Low-Resource Language Information Processing,
machine learning algorithms are applied to the selected fea- 2022.
ture set to detect the deepfake audio. This approach gave [8] C. Stupp, “Fraudsters used ai to mimic ceo’s voice in unusual cybercrime
higher accuracy and results in all cases than other state-of- case,” The Wall Street Journal, vol. 30, no. 08, 2019.
[9] T. T. Nguyen, Q. V. H. Nguyen, C. M. Nguyen, D. Nguyen, D. T. Nguyen,
the-art studies for audio data. This study obtained 97.57% and S. Nahavandi, “Deep learning for deepfakes creation and detection: A
accuracy with SVM using the for-2sec dataset compared to survey,” arXiv preprint arXiv:1909.11573, 2019.
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
[10] Z. Khanjani, G. Watson, and V. P. Janeja, “How deep are the fakes? International Conference on Information and Communications Technol-
focusing on audio deepfake: A survey,” arXiv preprint arXiv:2111.14203, ogy (ICOIACT), pp. 379–383, IEEE, 2018.
2021. [33] J. Kominek and A. W. Black, “The cmu arctic speech databases,” in Fifth
[11] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah, ISCA workshop on speech synthesis, 2004.
and A. Sizov, “Asvspoof 2015: the first automatic speaker verification [34] K. Ito and L. Johnson, “The lj speech dataset.” https://keithito.com/LJ-
spoofing and countermeasures challenge,” in Sixteenth annual conference Speech-Dataset/, 2017.
of the international speech communication association, 2015. [35] K. MacLean, “Voxforge,” Ken MacLean.[Online]. Available:
[12] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Ya- http://www.voxforge.org/home.[Acedido em 2012], 2018.
magishi, and K. A. Lee, “The asvspoof 2017 challenge: Assessing the [36] J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi, “A deep
limits of replay spoofing attack detection,” ISCA (the International Speech learning framework for audio deepfake detection,” Arabian Journal for
Communication Association), 2017. Science and Engineering, pp. 1–12, 2021.
[13] J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado, X. Wang, N. Evans,
T. Kinnunen, K. A. Lee, V. Vestman, and A. Nautsch, “Asvspoof
2019: Automatic speaker verification spoofing and countermeasures chal-
lenge evaluation plan,” tech. rep., Tech. Rep., 2019.[Online]. Available:
http://http://www. asvspoof. org . . . , 2019.
[14] S. Ö. Arık, H. Jun, and G. Diamos, “Fast spectrogram inversion using
multi-head convolutional neural networks,” IEEE Signal Processing Let-
ters, vol. 26, no. 1, pp. 94–98, 2018.
[15] Y. Chen, Y. Kang, Y. Chen, and Z. Wang, “Probabilistic forecasting
with temporal convolutional neural network,” Neurocomputing, vol. 399,
pp. 491–501, 2020.
[16] Y. Kawaguchi, “Anomaly detection based on feature reconstruction from
subsampled audio signals,” in 2018 26th European Signal Processing
Conference (EUSIPCO), pp. 2524–2528, IEEE, 2018.
[17] Y. Kawaguchi and T. Endo, “How can we detect anomalies from sub- AMEER HAMZA is with the Department of Cre-
sampled audio signals?,” in 2017 IEEE 27th International Workshop on ative Technology, Air University, Islamabad. He is
Machine Learning for Signal Processing (MLSP), pp. 1–6, IEEE, 2017. doing his Master’s degree in Artificial Intelligence
[18] H. Landau, “Sampling, data transmission, and the nyquist rate,” Proceed- from Air University, Islamabad, Pakistan.
ings of the IEEE, vol. 55, no. 10, pp. 1701–1706, 1967.
[19] H. Yu, Z.-H. Tan, Z. Ma, R. Martin, and J. Guo, “Spoofing detection in
automatic speaker verification systems using dnn classifiers and dynamic
acoustic features,” IEEE transactions on neural networks and learning
systems, vol. 29, no. 10, pp. 4633–4644, 2017.
[20] S. Pradhan, W. Sun, G. Baig, and L. Qiu, “Combating replay attacks
against voice assistants,” Proceedings of the ACM on Interactive, Mobile,
Wearable and Ubiquitous Technologies, vol. 3, no. 3, pp. 1–26, 2019.
[21] J. Villalba and E. Lleida, “Preventing replay attacks on speaker verification
systems,” in 2011 Carnahan Conference on Security Technology, pp. 1–8,
IEEE, 2011.
[22] F. Tom, M. Jain, and P. Dey, “End-to-end audio replay attack detection us-
ing deep convolutional networks with attention.,” in Interspeech, pp. 681–
685, 2018.
[23] K. Kuligowska, P. Kisielewicz, and A. Włodarz, “Speech synthesis sys-
tems: disadvantages and limitations,” Int J Res Eng Technol (UAE), vol. 7,
pp. 234–239, 2018.
[24] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,
N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative
model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. ABDUL REHMAN JAVED is a lecturer at the
[25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Department of Cyber Security, Air University,
Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by con- Islamabad, Pakistan. He has worked with Na-
ditioning wavenet on mel spectrogram predictions,” in 2018 IEEE inter- tional Cybercrimes and Forensics Laboratory at
national conference on acoustics, speech and signal processing (ICASSP), Air University, Islamabad, Pakistan. He received
pp. 4779–4783, IEEE, 2018. his Master’s degree in Computer Science from the
[26] J. Frank and L. Schönherr, “Wavefake: A data set to facilitate audio National University of Computer and Emerging
deepfake detection,” arXiv preprint arXiv:2111.02813, 2021. Sciences, Islamabad, Pakistan. He is a member
[27] M. Hassaballah, M. A. Hameed, and M. H. Alkinani, “Introduction to of both IEEE and ACM. He is a cybersecurity
digital image steganography,” in Digital Media Steganography, pp. 1–15, researcher and practitioner with industry and aca-
Elsevier, 2020.
demic experience. He has reviewed over 150 scientific research articles for
[28] R. Reimao and V. Tzerpos, “For: A dataset for synthetic speech detection,”
various well-known journals. He is a TPC member of CID2021 (Fourth
in 2019 International Conference on Speech Technology and Human-
International Workshop on Cybercrime Investigation and Digital forensics
Computer Dialogue (SpeD), pp. 1–10, IEEE, 2019.
- CID2021) and the 44th International Conference on Telecommunications
[29] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,
J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with and Signal Processing. He has served as moderator in the 1st IEEE Interna-
convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017. tional Conference on Cyber Warfare and Security (ICCWS). He has authored
[30] F. M. Rammo and M. N. Al-Hamdani, “Detecting the speaker language over 50 peer-reviewed research articles and is supervising/co-supervising
using cnn deep learning algorithm,” Iraqi Journal For Computer Science several graduate (BS and MS) students on health informatics, cybersecurity,
and Mathematics, vol. 3, no. 1, pp. 43–52, 2022. mobile computing, and digital forensics topics. His current research interests
[31] Z. A. Abbood, B. T. Yasen, M. R. Ahmed, A. D. Duru, et al., “Speaker include but are not limited to mobile and ubiquitous computing, data analy-
identification model based on deep neural networks,” Iraqi Journal For sis, knowledge discovery, data mining, natural language processing, smart
Computer Science and Mathematics, vol. 3, no. 1, pp. 108–114, 2022. homes, and their applications in human activity analysis, human motion
[32] A. Winursito, R. Hidayat, and A. Bejo, “Improvement of mfcc feature analysis, and e-health. He aims to contribute to interdisciplinary research
extraction accuracy using pca in indonesian speech recognition,” in 2018 in computer science and human-related disciplines.
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3231480
FARKHUND IQBAL holds the position of As- ZUNERA JALIL is an Assistant Professor at the
sociate Professor in the College of Technological Department of Cyber Security, Faculty of Com-
Innovation, Zayed University, United Arab Emi- puting & Artificial Intelligence, Air University,
rates. He is an Affiliate Professor at the School of Islamabad, and senior researcher at National Cy-
Information Studies, McGill University, Canada, bercrimes and Forensics Lab, National Center for
and an Adjunct Professor at the Faculty of Busi- Cyber Security, Islamabad, Pakistan. She earned
ness and IT, Ontario Tech University, Canada. her Ph.D. in Computer Science with a specializa-
He leads the Cybersecurity and Digital Forensics tion in Information Security from FAST National
(CAD) research group at the Center for Smart University of Computer and Emerging Sciences,
Cities and Intelligent Systems, Zayed University. Islamabad, Pakistan, in 2010. She received her
He holds a Master’s (2005) and a Ph.D. degree (2011) from Concordia Master’s degree in Computer Science in 2007 with a scholarship from the
University, Canada. He uses Artificial Intelligence, Machine Learning, and Higher Education Commission of Pakistan. She has served as a full-time
Data Analytics techniques for problem-solving in cybersecurity, health care, faculty member at International Islamic University, Islamabad; Iqra Univer-
and cybercrime investigation in the smart city domain. He has published sity, Islamabad; and Saudi Electronic University, Riyadh, Saudi Arabia. Her
more than 120 papers in high-ranked journals and conferences. He has research interests include but are not limited to computer forensics, machine
served as a chair and co-chair for several IEEE/ACM conferences and has learning, criminal profiling, software watermarking, intelligent systems, and
been a guest editor and reviewer for multiple high-rank journals. data privacy protection.
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/