0% found this document useful (0 votes)
41 views31 pages

Audio Deepfake Approaches

Uploaded by

Safina Soomro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views31 pages

Audio Deepfake Approaches

Uploaded by

Safina Soomro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Received 6 October 2023, accepted 13 November 2023, date of publication 16 November 2023,

date of current version 29 November 2023.


Digital Object Identifier 10.1109/ACCESS.2023.3333866

Audio Deepfake Approaches


OUSAMA A. SHAABAN 1, REMZI YILDIRIM2 , AND ABUBAKER A. ALGUTTAR 1
1 GraduateSchool of Natural and Applied Sciences, Ankara Yıldırım Beyazıt University, 06760 Ankara, Turkey
2 Department of Computer Engineering, Ankara Yıldırım Beyazıt University, 06760 Ankara, Turkey

Corresponding author: Ousama A. Shaaban ([email protected])

ABSTRACT This paper presents a review of techniques involved in the creation and detection of audio
deepfakes, the first Section provides information about general deep fakes. In the second section, the main
methods for audio deepfakes are outlined and subsequently compared. The results discuss various methods
for detecting audio deepfakes, including analyzing statistical properties, examining media consistency, and
utilizing machine learning and deep learning algorithms. Major methods used to detect fake audio in these
studies included Support Vector Machines (SVMs), Decision Trees (DTs), Convolutional Neural Networks
(CNNs), Siamese CNNs, Deep Neural Networks (DNNs), and a combination of CNNs and Recurrent Neural
Networks (RNNs). The accuracy of these methods varied, with the highest accuracy being 99% for SVM and
the lowest being 73.33% for DT. The Equal Error Rate (EER) was reported in a few of the studies, with the
lowest being 2% for Deep-Sonar and the highest being 12.24 for DNN-HLLs. The t-DCF was also reported
in some of the studies, with the Siamese CNN performing the best with a 55% improvement in min-t-DCF
and EER compared to other methods.

INDEX TERMS Deepfakes, artificial intelligence, deep learning, audio deepfakes, forensics, datasets,
survey.

I. INTRODUCTION usage in criminal activities persists at high rates [3], [4],


Deepfake refers to synthetic information or materials that [5]. Over the last several years, thousands of fake films have
have been developed or altered using artificial intelligence gone viral online, and they are largely aimed at public figures
(AI) technologies, and are intended to be considered authen- and famous people. In 2017, a Reddit user by the moniker
tic. These may include audio, video, picture, and text of deepfakes produced the first piece of deepfake material,
synthesis [1]. a viral porn movie. Since the invention of the deepfake tech-
In alternative narrowly defined deepfakes (coming out of nology, dishonest applications have become commonplace.
Deep Learning (DL) and ‘‘fake’’), artificial neural network Soon after, more and more deepfake-based apps like FakeApp
(ANN) innovations are important for manipulating media and FaceSwap appeared. The intelligent stripping software
files. Software using (AI) such as FaceApp and FakeApp Deep Nude was released in June 2019 and immediately
were used to superimpose the faces of a victim onto a video of caused a frenzy. In addition to being a privacy risk, videos
the person’s origin App, which was used to superimpose the made with these apps are increasingly used to sway elections
faces of a victim onto a video of the person’s origin. Create a from the perspective of the public. The identification of false
video in which the intended recipient says or does something information is now at the forefront of concern for people,
that the original provider does. Due to this trading system, companies, and governments. with an increasing amount of
anybody may buy or sell a newly generated appearance, research on deepfake devices.
chronological age, or even a new hairdo. Many concerns have Deepfake technology is not limited to its use in pornogra-
been raised regarding the dissemination of these hoaxes [2]. phy, but is also utilized for a range of nefarious and unethical
Although deepfake technology may be used for beneficial purposes. This includes the dissemination of false informa-
objectives such as virtual reality and cinematography, its tion, the instigation of political turmoil, and various forms of
cybercrime.
The associate editor coordinating the review of this manuscript and More specifically, AI-synthesized systems that can pro-
approving it for publication was Kostas Kolomvatsos . duce convincing audios have recently been developed for
2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
132652 For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 11, 2023
O. A. Shaaban et al.: Audio Deepfake Approaches

audio faking [6]. However, despite the fact that these tools is confined to visual deepfake detection [15], therefore it is
were designed to benefit people, they have also been uti- not very broad.
lized to disseminate false information via audio [7], resulting Many deepfake detection strategies have emerged as a
in concern about ‘‘Audio Deepfakes.’’ Recently referred result of the increased focus on the topic of deepfake detection
to as audio manipulations, audio deepfakes are becoming by academics and specialists in recent years as a means
more accessible through mobile devices and desktop com- of combating these dangers. In addition, research into the
puters [8]. This has resulted in widespread public concern existing literature on detection strategies and performance
regarding the adverse consequences of deep fakes in cyber- evaluation is underway. However, the scientific community
security. Despite the advantages of this technology, audio and practitioners may benefit from a more in-depth study in
deepfakes are more complex than simple text, email, or email this field that summarizes information on deepfake from all
links. It is possible for someone to utilize this as a logical- perspectives, including accessible datasets (something that
access audio-spoofing method [9], which opens the door to has been significantly lacking in prior surveys).
propaganda, slander, and even terrorism as a means of influ- This review provides a detailed analysis of audio deepfake
encing public opinion. Detecting fakeness in vast quantities detection techniques, along with generative approaches. The
of audio recordings shared online every day is difficult [10]. key contributions include:
However, politicians and governments are not immune to a) To provide researchers with an overview of different audio
deep fake attacks [11]. For more than $ 243,000, scam- methods for generating and detecting audio deepfakes.
mers in 2019 exploited AI software to mimic a CEO over b) Update the reader on what is new and noteworthy in the
the phone [12]. Consequently, the legitimacy of all publicly world of audio deepfakes, including techniques, tools,
available audio recordings should be verified to prevent the regulations, and problems.
propagation of false information. Therefore, recent attention c) Help the reader realize the probable effects of audio
has been paid to this topic in the scientific community. It is deepfakes.
becoming increasingly difficult to identify audio forgeries d) Provide a guidance for the research community to com-
because of the emergence of three distinct types of deepfakes: prehend future audio deepfake developments.
those based on synthetic data, imitation audio, and replay This article is structured as follows: In Section I, we present
data. an introduction to general deepfakes, setting the founda-
In addition, other detection methods are available for deter- tion for the subsequent sections. Section II delves into
mining whether audio recordings contain real or fake speech. audio deepfakes, outlining and comparing the main meth-
Several DL and Machine Learning (ML) models have been ods employed in their creation and detection. Moving to
developed to detect fake audios using various approaches. Section III, we explore various datasets that play a crucial
There are still many gaps in current algorithms [13]. There- role in the development and evaluation of audio deepfake
fore, additional research is essential to enhance the detection detection techniques. Finally, Section IV offers a conclusion
capability of Audio Deepfakes and address the deficiencies and discussion, summarizing the limitations identified and
identified in the existing literature. It has become more diffi- outlining potential directions for future research.
cult to identify audio deepfakes owing to the emergence of Deepfakes can be classified into four main categories:
new forms such as those based on synthetic imitation and Text, Image, Video, and Audio. While most scientists are
replay, as discussed below. preoccupied with investigating deepfakes in videos, the other
With the advent of cutting-edge tools and DL approaches, types of deepfakes must also receive a wide range of attention
Audio deepfake detection has become an important field of due to the comprehensive advances in creating these types of
study. Currently, DL approaches have failed to compensate deepfakes, as shown in FIGURE 1:
for these limitations. Therefore, further research is needed to
determine which aspects of Audio Deepfake detection require
improvement. In addition, imitated and synthetically pro-
duced audio-detection approaches have not been examined in
the literature. We believe that this was a significant difference
in the present study.
When evaluated on a publicly available dataset, the
effectiveness of deepfake detection remained unchanged at
82.56% in recent years [14]. This is despite the fact that FIGURE 1. Deepfake classification [1].
deepfake creation has seen significant improvement in recent
years. Although this performance boost is substantial from a Due to the rise of social media and digitalization, fake
scholarly perspective, it is insufficient for real-world applica- news has become a prevalent issue, challenging conven-
tion. Recently, two major obstacles have emerged that make it tional definitions of news [2]. False information, presented
crucial to consider the interpretability of deep fake detection: as fact, is widely disseminated on online platforms [16].
lower detection accuracy and increased target range. How- Zhou and Zafarani [17] define false news as deliberately
ever, the current study on comprehensible deepfake detection publishing incorrect materials that can be debunked by

VOLUME 11, 2023 132653


O. A. Shaaban et al.: Audio Deepfake Approaches

fact-checking [5]. Previous research shows that people of all achieve this synchronization, while GAN-based models have
ages and backgrounds struggle to identify false news [18]. improved accuracy [34].
During times of uncertainty, like the COVID-19 pandemic, Facial attribute manipulation alters facial features like
false rumors spread rapidly on social media, impacting public identity or expression in images or videos, with methods like
perception [19]. This phenomenon affects various aspects of StarGAN-v2 and AttGAN producing convincing results [35],
life, including election campaigns, healthcare, and the econ- [36], [37].
omy [20], [21]. Detecting fake text is a complex task [22]. Detecting video deepfakes has involved diverse approaches,
GROVER, a text generation method using GPT-2, can cre- from analyzing frame boundaries to utilizing CNNs and
ate highly convincing fake news [25]. Some studies employ attention mechanisms, each with its strengths and limita-
transformer-based algorithms to identify fraudulent text on tions [38], [39], [40], [76], [77], [79], [80]. Researchers
social media [24], [25]. This study investigates the detection have also employed capsule networks [45] and Efficient-
of brief deepfake text samples from Twitter using dynamic NetB4 [46] for identification purposes. Bondi et al. examined
model adjustments and a specialized BERT model [32]. the performance of EfficientNetB4 using multiple datasets
Image deepfakes encompass three primary types. Firstly, and found that triplet loss delivered exceptional results [47].
there’s Faceswap, widely popularized by Snapchat, allow-
ing users to modify facial features in photographs for II. AUDIO DEEPFAKES
playful transformations [1]. Secondly, Synthesis techniques, The technology of deepfakes has been implemented in the
powered by generative adversarial networks (GANs), have realm of audio, specifically in the context of audio assis-
revolutionized image creation, with models like NVIDIA tants and other computer-generated audios that are becoming
112 generating countless variations of images [26]. Lastly, more ubiquitous in our daily routines [48]. The utilization of
Editing, including AI-driven methods, enables significant artificially generated or modified audio information poses a
image alterations [26]. Detecting fake images has been significant threat to society, as it has the potential to generate
a focus of research, employing algorithms like k-NN, issues of trust when individuals are incapable of distinguish-
LDA, and SVM [27]. Additionally, Face-Aware Liquify ing between authentic and counterfeit material [49].
in Adobe Photoshop and human artist modifications have There are three different varieties of speech: voiced,
been used [28]. Detection methods have evolved, employing unvoiced, and silent. Voiced speech contains a limited amount
supervised and unsupervised scenarios, and datasets such of energy and a periodic sequence of impulses, whereas
as StyleGAN-generated faces and iFakeFaceDB have been unvoiced speech consists of random, no periodic noise-like
employed [29]. Innovations like ‘‘facial X-ray’’ [30] and patterns. Silence, on the other hand, refers to the duration
attention-based CNN models [31] enhance detection capabil- during which there are no significant signals.
ities, achieving impressive accuracy rates. Numerous linguistic factors, including formants, are used
to analyses and categories utterances. Formants are frequen-
cies where energy is densely packed, resulting in a spectral
peak. In typical human speech, formants range from three
to five and are categorized according to their increasing fre-
quency. The first three formants are crucial to both voiced and
unvoiced speech and are frequently employed as surrogates.
The term ‘‘anti-forensics’’ has been recently incorporated
into the language of digital forensics. Although there is
FIGURE 2. Video Deepfake classification. no consensus on the precise meaning of the term, Rogers
has proposed a definition of anti-forensics as encompass-
ing activities that aim to undermine the presence, amount,
Video deepfakes encompass five main categories based on or authenticity of evidence at a crime scene, or to obstruct
the degree of manipulation as shown in FIGURE 2. These the examination and interpretation of such evidence during an
categories include face swaps, face reenactment, lip-syncing, investigation [50].
full-face synthesis, and facial attribute manipulation. Face Another definition of anti-forensics is ‘‘an attempt to pre-
reenactment, for instance, manipulates facial expressions by vent the recognition, collection, collation, and validation of
emulating the movements of a reference actor, often used for digital data’’ and established four categories of anti-forensics:
post-production modifications in films and video games [32]. data concealment, data deletion, data generation prevention,
Various techniques, such as 3D facial modeling and real- and new approaches. For example, transformation methods
time RGB-D sensor-based methods, have been employed to may be performed by malicious or rootkit-shared libraries
enhance the realism of these manipulations [33]. that abuse system calls or change data during the construction
Lip-syncing synchronizes mouth movements with audio process by using runtime links [50]. Practically speaking,
stimuli, essential for effective communication and accessibil- it is the ‘‘application of the scientific approach to digi-
ity for individuals with hearing impairments [11]. Techniques tal media in order to invalidate factual material for court
like Recurrent Neural Networks (RNNs) have been utilized to examination.’’ [51].

132654 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

‘‘Anti-forensics’’ is, in general, a ‘‘collection of procedures Gammatone Cepstral Coefficient (GTCC), spectral flux, and
and actions performed by a person with the intent to impede spectral centroid as potential markers of artificial noise in
the digital investigative process’’ [51]. their study. Das et al. and Li et al. [67], [68]. have sug-
Over the past ten years, digital multimedia forensics have gested the utilization of Inverted Constant-Q Coefficient
garnered significant interest. Most studies have concentrated (ICQC), Inverted Constant-Q Cepstral Coefficient (ICQCC),
on the detection of image forgery [52], with document and Long-term Variable Q transform (L-VQT) techniques to
forgery detection accounting for a large proportion [53]. identify synthetic music. Additionally, the authors of [106]
However, detection of digital audio forgeries has received explored the use of a Res2Net network trained on log-power
little attention. Before using any data as evidence in mul- magnitude spectrograms, Linear Frequency Cepstral Coef-
timedia forensics, it is critical to verify both the originality ficients (LFCC), and Constant-Q Transform (CQT) for
and integrity of the material already in possession. The basic identifying synthetic audio.
objective of audio forensics is to authenticate the audio by
determining whether it was fake, and to identify the person
or persons who were really speaking. A. AUDIO DEEPFAKE GENERATION METHODS
There are a variety of possible purposes, such as presenting One type of deepfake is AI-generated audio manipulation,
it as proof in a legal proceeding or putting an end to rumors which can clone a human audio and portray it as having
that have spread through social media or paparazzi. Digital said something controversial that it never really utters. Fake
impersonation refers to the production of speech in such a audios participants [70], [71]. Synthetic audios are suitable
manner as to mislead humans or computers into believing that for several applications such as automatic audio labelling for
speech originates from a reliable and genuine source, thereby combine AI with human editing. For instance, speech that
causing loss to society or the economy. are similar to genuine speech have become a reality because
Synthesizing speech and altering the speaker’s tone of the recent breakthroughs in AI generation methods for
are both possible using audio-specific deep-learning tech- audio television plus films, AI assistance, and individualized
niques [54], [55]. Audio waveforms, spectrograms (which synthetic audios for persons with vocal issues. Additionally,
integrate information from the frequency and time domains), fake/synthetic audios have become a growing challenge to
and other acoustic features are commonly analyzed in audio vocal biometrics [72] It could be used for evil purposes
forensics to identify artificially produced or modified audio such as spreading propaganda, spreading false news, or even
clips. The amplitude of the time-varying audio stream was committing fraud.
analyzed using waveform-based techniques. The synthesis of higher quality audios may synthesis and
In [56], the authors suggested a time-domain Artificial cloning. These algorithms may produce highly convincing
audio Detection Network with numerous blocks, similar to and identically voicing synthetic audio in response to the text
those seen in ResNet and Inception networks. In [57] the or utterances of target synthesis models powered by neural
study introduced a technique for detecting fabricated speech, networks such as Google’s Wavenet [73] and Tacotron [74] or
which employs a convolutional recurrent neural network AdobeVoco [75] may generate convincing counterfeit audios
(CRNN). Rather than relying on image-based methodologies, that sound like the target’s vocal audio; for example, the
this technique directly converts audio signal spectrograms software for editing audio [76] might be used to produce more
and utilizes computer vision techniques for analysis. The powerful audios by combining natural and synthetic audio
spectrogram illustrates the frequency and intensity of the sources.
audio source over time. In addition to images, recent developments in AI-generated
A Mel spectrogram, which represents frequencies in mega- synthetic audios have enabled the production of incredibly
hertz, is a variation of the spectrogram [58]. For the detection convincing false films [11]. Such advances in audio syn-
of synthetic speech, Bartusiak et al. utilized a (CNN) and thesis have already demonstrated their capacity to produce
convolution transformer [59], [60] in combination with nor- convincing and natural-voicing acoustic deepfakes, thereby
malized grayscale spectrograms of the audio stream. The presenting significant threats to civilization [77]. Deepfake
authors of [61] used melspectrograms to train a spatial trans- movie appeal and destructive impact can be increased by
form network and a temporal CNN. Audio characteristics are incorporating fake audio and visual manipulation [11]. These
coefficients and other values derived from the transformation synthetic discourses lack features of audio quality linked to
process [62]. Two examples of such features are cepstral the identification of the target, which include expressiveness,
coefficients at constant Q and Mel frequencies [63], [64]. roughness, breathing, tension, and emotion [78].
A copy-move attack can be detected by dividing an audio AI researchers are attempting to solve these issues to
stream into segments, and comparing the audio character- enable machines to mimic human speech in terms of how
istics of each segment, such as delta-MFCC [63], mel- it sounds and how easily it can be understood. (TTS) Syn-
frequency cepstral coefficients (MFCC) [63], and pitch [65]. thesis and (VC) are the only methods used to produce audio
using Pearson correlation coefficient [65] Higher degrees deepfakes. A TTS (Text-to-speech) synthesizer is a piece
of resemblance indicate a copy-and-paste attack. Hassan of software that can mimic the speech of any speaker [79].
and Javed [66] utilized a (RNN) to evaluate (MFCC), a methodology utilized to transform an audio waveform

VOLUME 11, 2023 132655


O. A. Shaaban et al.: Audio Deepfake Approaches

TABLE 1. List of tools, applications, and open-source projects which synthesize audiovisual deepfakes.

originating from a source audio into one that emulates the GAN [87], [88] as well as other emerging technologies.
vocal characteristics of a selected speaker [80]. References [89] and [90] have contributed to the fast expan-
The VC system uses an audio clip collected from the user sion of the speech synthesis industry. Figure 3 shows the
as its source, and generates a radically false audio file for rationale underlying recent Text-to-speech (TTS) techniques.
the target subject. It maintains the grammatical and phonetic
aspects of the original sentence, while emphasizing its quality
and likeness to the target speaker. Both VC synthesis and
TTS pose significant dangers, because they create audios that
are nearly unidentifiable in human speech. In addition, dupli-
cated replays initiate assaults. Vocal biometric devices are
of concern because improved audio synthesis algorithms can FIGURE 3. Workflow diagram of the most recent TTS systems.
create audios similar to those received by loudspeakers [81].
This section summarizes the recent advances in speech WaveNet is a primary advancement in audio and speech
synthesis, including TTS, audio conversion, and detec- synthesis. [85], Tacotron [74], and DeepVoice3 [91] can pro-
tion approaches Table 1 shows a list of tools, applica- duce realistic synthetic audios from text inputs to enhance
tions, and open-source projects which synthesize audiovisual the interaction between people and robots. WaveNet, Devel-
deepfakes. oped in [85], which is the result of the creation of a
pixelCNN, was developed in [121]. conditional image [92].
1) TTS (TEXT-TO-SPEECH) AUDIO SYNTHESIS WaveNet models employ acoustic information such as spec-
TTS, which has been around for over a decade, is a system trograms to convert raw audio waves throughout a generating
that uses text input to generate an artificial audio, and allows frame based on real audio data. Parallel WaveNet technol-
the use of an audio for better interaction between humans ogy was developed to enhance the sampling efficiency and
and computers. The first investigations on TTS synthesis provide high-quality audio streams [93]. An additional DL
were conducted using audio concatenation and parameter that depends on a WaveNet variation called Deep Voice1
estimation. The concatenated text-to-speech (TTS) technique [94] is available by swapping an associated NN template
entails the fragmentation of superior audio recordings into for any module with an audio source, speech synthesizer,
smaller units, which are subsequently reassembled to gen- or text processing interface. There is no actual end-to-end
erate a novel speech pattern. Nonetheless, owing to the audio-recognition technology; however, each module is indi-
absence of advancement and lucidity in this methodology vidually taught. Google coined the word ‘‘taciturn’’ in 2017.
throughout the years, its appeal has diminished. Paramet- The All-inclusive Audio Synthesis Model Taciturn can syn-
ric models differ from other models in that they employ a thesize audio using textual and audio pairings, making it
mapping technique to convert text into basic speech features, sufficiently versatile for use with many different types of
which are subsequently transformed into an audio stream dataset. Tacotrons, such as WaveNet, are generative systems
using vocoders. Subsequently, DL became an important audio composed of a sequence-to-sequence model, attention-based
synthesis approach, resulting in a much higher degree of decoding, and post-processing net. However, the superior
audio quality. These include neural audio encoders [82], [83], performance of the tacotron model may be accompanied by
auto encoder [84], autoregressive models [74], [85], [86], certain drawbacks. This process is repeated several times.

132656 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

Model creation requires high-performance systems because goal of achieving a high degree of similarity between the
of the inefficiency in including these components. synthesized and natural audios. These products aim to achieve
Deep audios using a combination of Tacotron and this similarity by using a variety of techniques. These systems
WaveNet [95] synthesize speech. Tacotron uses WaveNet to are becoming more and more popular, and they are currently
translate source text into speech by transforming it into a utilized for a wide variety of applications including chatbots,
linear spectrum. Tacotron2 was created in [96] is a speech virtual assistants, and audiobooks.
synthesis system that utilizes a neural network architecture Text-to-speech (TTS) systems of the modern era are
consisting of an encoder network, a decoder network, and capable of effectively converting written text into speech
an attention mechanism to generate speech waveforms of that sounds natural and possesses particular characteristics.
superior quality. The encoder network is responsible for Researchers have been able to develop speech models that
receiving textual input and producing a series of embedding can replicate the audio of a particular speaker with remark-
that encapsulate the semantic content of the input. Subsequent able accuracy, even with only a small number of reference
to the production of embedding by the encoder, the decoder samples to use as a guide. This has been made possible by the
network proceeds to generate a sequence of Mel-spectrogram development of neural network models. This has ushered in a
frames. The attention mechanism facilitates the decoder’s new era of real-time audio cloning technology; in which it is
ability to concentrate on pertinent segments of the encoder now possible to synthesize the audio of a person in real-time
output during the production of Mel-spectrogram frames. using only a few seconds of their speech as input.
The Tacotron2 model has been extensively employed in This has opened the door to a wide range of applications
various domains, including but not limited to audiobook for the technology. This type of technology has a wide variety
narration, virtual assistants, and chatbots. The system has of applications in the real world, ranging from individualized
exhibited exceptional proficiency by producing speech that audio assistants to assistive technology for people who have
closely resembles human-like quality and intonation. Fur- difficulties communicating through speech [89], [97]. Audio
thermore, Tacotron2 exhibits a high degree of adaptability synthesis systems do not seek to imitate a person’s distinctive
and can undergo training on diverse linguistic and vocal speech features, whereas speech cloning systems do [98].
modalities. ISpeech3, VoiceApp2, and Overdub1 are only a few exam-
Notwithstanding its accomplishments, Tacotron2 encoun- ples of AI-powered Audio-cloning platforms that make this
ters certain obstacles, including generating speech that technology publicly available by generating synthetic false
sounds authentic in the presence of ambient noise and man- audios that mimic targeted speech.
aging texts of extended length. Nevertheless, current research The authors of [89] Developed a TTS system that relies on
endeavors to tackle these concerns and enhance the efficacy Tacotron 2, which can synthesize the Audios of several speak-
of vocal synthesis systems such as Tacotron2. The researchers ers, even those not present throughout training. Three neural
created DeepVoice3, an entirely convolutive spectrogram networks, each trained separately, constitute the framework.
character model, to overcome the temporal complexity asso- Synthetic speech correctly imitates a target speaker’s Audio
ciated with recurrent unit-based audio synthesis models. but not its prosody.
Reference [91] shows that the Deep Voice 3 model outper- In [107], the authors recommended two Deep Voice 3 mod-
forms its rivals in terms of speed because all calculations are ules: speaker encoding and speaker adaptation. Speaker
performed in parallel. There are three main components to adaptability was prioritized in the framework to produce
Deep Voice 3: The vocoder has three parts: (1) Encoder that audio for several channels. To encode speakers, a second
transforms input text into a learnt internal code form; (2) a model was trained to use the multi speaker generative frame-
decoder that interprets autoregressively learned representa- work to determine new speaker embedding.
tions; and (3) a wholly convolutive post-processing network In [133], researchers unveiled a speech cloning algorithm
that predicts the parameters of the vocoder. VoiceLoop is an that, given a text input or audio waveform of a speaker as
alternative audio synthesis model. Using a memory frame, input, can synthesize an audio that sounds similar to the
speech was generated from audios that were not audible dur- one that the system is supposed to mimic. The architec-
ing the training. VoiceLoop constructs phonological storage ture incorporates a neural vocoder, in addition to text and
by using an offset buffer as a matrix. Phonemes in a string audio encoders and decoders. The speech-generation model is
of texts were converted into tiny vectors for representation. instructed by a representation disentangled from the speaker
The generated phonemes were evaluated and their codes were and the approach is jointly trained using latent linguistic char-
added to form a new contextual vector. acteristics. Cloning a speaker’s audio takes approximately
There has been a significant amount of research put five minutes, but the final product is of exceptional quality
into and development of end-to-end audio synthesis mod- and is similar to the original speaker.
els. In [91], researchers discussed various methods that can The authors of [99] have suggested a meta-learning
be used to construct such models. Additionally, commer- approach to enhance the efficacy of audio commands. This
cial products such as Amazon AWS Polly, Baidu TTS, and approach involves the integration of a WaveNet model that
Google Cloud TTS have been introduced by [131]. with the can operate with restricted data. The initial stage of this

VOLUME 11, 2023 132657


O. A. Shaaban et al.: Audio Deepfake Approaches

TABLE 2. Overview of the latest audio synthesis methods.

methodology entails the computation of speaker adaptation number of noisy input samples. The methodology entails the
through the refinement of audio embedding. Subsequently, process of instructing the model using a dataset that encom-
the embedding vectors of novel speakers can be forecasted passes audio samples from various speakers. Subsequently,
utilizing a parametric methodology that is not dependent on the model is fine-tuned on a smaller dataset that comprises
textual data. This approach may prove advantageous in situa- samples from the specific target speaker. The model’s output
tions where there is a scarcity of data and prompt adjustment exhibits the ability to produce synthetic speech of supe-
to novel speakers is imperative. rior quality that bears a striking resemblance to the natural
The findings of this investigation exhibit the efficacy of audio of the target speaker, despite the presence of restricted
the suggested methodology in producing superior synthetic training data. The aforementioned methodology exhibits the
vocalizations for diverse speakers. An additional encoding capability to facilitate a diverse array of implementations,
network must be constructed to achieve this. This technique such as audio replication and audio transformation, while
works well when training high-quality, clean data. The qual- necessitating minimal data prerequisites. The results showed
ity of the synthesized speech is reduced when background that the artificial speech became more lifelike. Consequently,
noise is present during the encoding process. In [125], The creating convincingly equivalent synthetic speech from a
researchers presented a multi-speaker sequence-to-sequence limited amount of poor-quality audio data remains a chal-
model. The model utilizes domain-specific training data to lenge. Table 2 summarizes the sophisticated audio synthesis
reconstruct the speech of a target speaker from a restricted approaches.

132658 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

2) VOICE CONVERSION (VC) learned for simultaneous VC feature mapping. Parallel train-
Voice synthesis using VC makes the source voice appear ing requires a large number of source and target spoken
more like the desired voice while keeping the original phrase samples, which is impractical for practical application.
grammar intact. VC is used for various purposes in the Researchers have proposed VC algorithms for nonparallel
entertainment sector, including expressive audio synthesis, (unpaired) training data as a means to achieve voice con-
individualized speech assistance for individuals with hearing version for speakers of diverse languages. The algorithms in
impairments, and audio dubbing [80]. Recent advances in question endeavor to enhance the quality of speech conver-
anti-spoofing technology for automatic speaker recognition sion through the utilization of both parallel and non-parallel
[72] have included VC systems for generating fake data [72], data.
[100]. Audio control relies on more advanced aspects of In a particular research endeavor [136], a signifi-
speech such as timbre and prosody. Prosody is concerned with cant endeavor was undertaken to devise techniques for
suprasegmentally features, such as pitch, amplitude, stress, cross-lingual voice conversion (VC) utilizing non-parallel
and duration, whereas audio timber focuses on the spectral training data and a diverse set of languages. This pro-
characteristics of the auditory system through phonation. cess entails the translation of one language into another
Several voice conversion competitions (VCCs) have been through the use of recorded speech. Robust value creation
organized to promote research on voice conversion methods approaches, such as neural network-based [108], vocoder
and improve the accuracy of existing methods, [72], [100] [109], [110], GAN [111], [112], and VAE [113], [114] have
Scholars in the domain of speech conversion have been been developed to aid in the modeling of non-parallel spectral
investigating techniques to enhance the caliber of speech con- data.
version by utilizing both parallel and non-parallel data. The Techniques based on auto encoders attempt to learn how to
Voice Conversion Challenge (VCC) is a technique that aims modify speaker identification independently of the linguistic
to convert input audio to output speech through the integration content. In [114], the quality of learned representations was
of parallel and non-parallel training data, as described in compared using various auto encoding techniques. When-
[72] and [101]. The objective of the VCC was to mitigate ever WaveNet and Vector Quantized VAE are used together,
the constraints associated with conventional parallel training it was found that, [85] The decoder enhances the preser-
data by investigating the potential of non-parallel data, which vation of speaker-invariant language content and recovers
is frequently more prevalent in practical settings. In [136], rejected information. Owing to the dimensionality reduc-
significant endeavors were undertaken to devise techniques tion bottleneck, VAE/GAN-based techniques over smooth the
for cross-lingual voice conversion (VC), which pertains to the transformed features, resulting in voice conversion with audio
process of transforming recorded speech from one language buzz.,
to another. Recent GAN-based approaches such as VAW-GAN [115],
The research was centered on non-parallel training data CycleGAN [111], [116], and StarGAN [153] aim to pro-
and encompassed a diverse array of languages in order to duce high-quality converted speech. Studies [117], [118] have
tackle the difficulty of cross-lingual voice conversion. The demonstrated superior performance in terms of sounding nat-
findings indicate encouraging enhancements in the standard ural and similar to the target audience compared to other
of speech conversion, highlighting the viability of integrating multilingual VC. Therefore, performance is reliant on the
non-parallel data with conventional parallel training tech- presence of a speaker, and diminishes for unseen speakers.
niques. Previous research has shown that VC methods depend Owing to their capacity to create human-like speech, neural
on spectrum mapping with paired training data, and require vocoders have surpassed other vocoding technologies and
the use of audio samples from both target and source audios become the standard for audio synthesis in recent years [91].
that share common linguistic content. Gaussian Mixture The vocoder shows the ability to acquire and produce audio
Model GMM-based techniques [26], [102], regression using waves that bear a striking resemblance to the distinct acoustic
partial least squares [103], exemplar-based techniques [104], characteristics of the speaker.
and parallel spectral modeling [105], [106] have been sug- Research [110] examined the performance of a variety of
gested. These [102], [104] are ‘‘shallow’’ VC approaches that vocoders and determined that parallel-WaveGAN performed
could directly modify the spectral features of the source audio the best. Using acoustic properties, [119] effectively simu-
in its native feature space [105] To capture temporal correla- lated the transmission of human speech data over an IP (VC).
tion in an audio stream. Researchers previously proposed an Nevertheless, there is scope for improvement in addressing
RNN-based speaker-dependent sequential approach. unidentified louder speakers [71] Using AttS2S-VC, [120]
In [106] and [143], the deep bidirectional LSTM Cotatron, [121] and VTN, [122] researchers can directly
(DBLSTM) methodology made it feasible to extract synthesize speech from text labels using three modern VC
long-range contextual data while producing high-quality techniques based on TTS by detecting aligned linguistic
transformed audios using DNN-based techniques. This is features from the source speech. By doing so, we know
made possible by the fact that DBLSTM is a technique. that neither the source nor destination speaker’s identity will
In [105] and [106], feature representations were efficiently change throughout the conversion process. Unfortunately,

VOLUME 11, 2023 132659


O. A. Shaaban et al.: Audio Deepfake Approaches

TABLE 3. Overview of the various methods for detecting deepfake audio.

these strategies rely on text labels, which are not always easily substance, while the decoder uses these ‘‘embeddings’’ to
accessible. construct audio clips. The zero-shot VC scenario is inter-
There have been recent attempts at ‘‘one-and-done’’ VC esting because it does not require collection or adjustment
techniques [123], [160]. Unlike prior methodologies, the pro- of parameters or data for adaptation. However, adaptation
cess of training few-shot voice conversion models does not falls short, especially in situations in which both the goal
necessitate direct access to the source and target speaker data and source speakers are invisible, vastly dissimilar, and very
samples. Merely one statement from each speaker suffices for loud [125].
the conversion procedure. The speech of the source speaker is
utilized to derive a speaker embedding, which is subsequently B. AUDIO DEEPFAKE DETECTION METHODS
employed to produce the converted speech. Notwithstanding Due to recent advancements in TTS [126], [129] and
recent progress, the few-shot voice conversion techniques VC [125] techniques, deepfakes in audio pose a growing
for speakers who have not been previously encountered still threat to audio biometric interfaces and society. Previous
encounter obstacles in attaining dependable performance. research has not fully addressed the detection of synthetic
[125]. This is largely because speaker embedding generated speech [130], but DL methods, such as CNNs, RNNs, and
from a single unseen speaker’s speech is insufficient [126]. LSTMs, show promise in detecting deep fakes by analyz-
This has a noticeable effect on the dependability of the one- ing spectral content, pitch, and time-frequency patterns. The
shot conversions. The additional effort [127], [128] of speaker use of these methods holds great potential for preventing
identities is concealed during training using zero-shot VC and the spread of audio deep fakes. This section examines the
the model does not need to be retrained. methodologies for detecting audio deepfakes.
The speaker encoder breaks down data about the speaker’s In the previous TABLE 3, an overview of various methods
delivery into individual ‘‘embeddings’’ for style and for detecting deepfake audio was provided. Two primary

132660 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

categories can be established for the techniques.: handcrafted the data that was gathered was employed to produce CQT
techniques and DL techniques. Handcrafted techniques characteristics, which were subsequently utilized for training
involve manually designing and implementing algorithms to the LCNN classifier. Although this approach improves the
detect deepfake audio, while DL techniques utilize neural accuracy of detecting counterfeit audio, it requires a substan-
networks to automatically learn patterns in the audio data and tial quantity of training data.
detect deepfakes. In the following text, we will delve deeper The detection of copied conversations was proposed by
into each of these categories and discuss the specific methods Aljasem et al. [137] through the utilization of a technique that
used relies on handcrafted features. At the outset, sign-modified
acoustic local ternary patterns were utilized to extract fea-
1) HANDCRAFTED TECHNIQUES tures from the input data. The knowledge that was acquired
Yi et al. [131] proposed a technique for identifying audio was subsequently utilized to develop classifiers based on
content that has been modified using TTS synthetic speech asymmetrical bagging technique for the purpose of discrimi-
recognition, which may be trained using GMM and LCNN nating between authentic and artificially generated speeches.
classifiers using constant Q-cepstral coefficients (CQCC), The aforementioned technique exhibits resilience towards
which require handcrafted features. Although this technique high-volume cloned vocal playback assaults. Nevertheless,
performed better with completely synthesized audio, its per- it necessitates additional refinement with regards to its
formance progressively declined with the partially generated efficiency.
audio samples [106]. Res2Net is a modified version of the Ma et al. [138] introduced a method based on continuous
ResNet. They assessed the model using a variety of acoustic learning to improve the ability of modified speech detection
properties and determined that CQT features provided the systems to generalize. The learning capabilities of the model
best results. This model performs better at detecting audio were enhanced by adding a loss function to distill accumu-
tampering; Nonetheless, there is room for further enhance- lated information. Although this technique is computationally
ment of its capacity for generalization. efficient and capable of detecting previously undiscovered
In [132], utilized a combination of mel-spectrogram fea- spoofing operations, its performance with noisy data has not
tures and ResNet-34 for the purpose of detecting counterfeit been examined.
speech. Despite the success of this approach, there is Borrelli et al. [139] included both short- and long-term
room for further improvement. The authors Monteiro et al. bicoherent characteristics in their study. Three classifiers
[133] have presented their research findings. An approach were trained using the gathered features: linear (SVM), radial
utilizing ensembles was employed to distinguish between basis function (RBF), and random forest (SVM). This tech-
authentic and synthetic speech. Deep learning models, specif- nique achieves the highest degree of precision when an
ically LCNNs and ResNets, were utilized to compute deep SVM classifier is used. However, because this is a manual
attributes, which were subsequently combined to achieve this process, it cannot be used to hide procedures. When analyz-
objective. Despite the robustness of the false speech detec- ing GAN-generated audio samples, [140] researchers have
tion, it is crucial to evaluate this model on a representative employed bispectral analysis to identify the unique spectral
dataset. correlations.
A method for identifying counterfeit speeches was devised Similarly, in [141], utilized bispectral and mel-cepstral
by Gao et al. [134] which relies on the detection of such analyses to identify the missing robust power elements in
inconsistencies. A residual network was trained to identify counterfeit speech. The aforementioned characteristics were
altered speech through the utilization of a global 2D-DCT employed for the purpose of instructing diverse classifiers
feature. Although the model exhibited a higher degree of grounded on ML, among which a Quadratic Support Vector
generalization, its performance deteriorated when noisy data Machine (SVM) exhibited the most superior performance.
was used. An artificial speech detection model based on These strategies [140], [141] are unaffected by TTS synthetic
the ResNet network and transformer encoder was devel- audio but may miss the audio synthesis of superior-quality
oped by Zhang et al. [135] (TEResNet). The initial stage Malik and Changalvala [142] suggested using a CNN to
involved the utilization of a transformer encoder to build identify clone speech.
context-specific renderings of an acoustic key point by ana- First, audio samples were transformed into spectrograms,
lyzing the correlation between the frames of the audio input. which were then used to calculate the deep features and
Subsequently, a residual network was trained using the deter- categorize the actual and false speech samples using CNN
mined key points to differentiate between unaltered and architecture. Although this method is more effective in identi-
changed speeches. This study demonstrates improved effec- fying phony audio, it suffers from samples with high levels of
tiveness in detecting bogus audio but requires substantial background noise. Chen et al. [9] developed a technique that
training data. exploits DL to recognize fake audios. Audio samples were
In the study [136] conducted by Das et al. they developed used to create linear filter banks (LFB) with 60 dimensions
a technique for determining whether an individual’s speech based on which a specialized ResNet model was trained. This
has been altered. First, a signal commanding approach was study enhances the identification of bogus audios, albeit at
utilized to boost the variety of the training data. Subsequently, considerable computational expense.

VOLUME 11, 2023 132661


O. A. Shaaban et al.: Audio Deepfake Approaches

Huang and Pun [143] proposed a technique to detect audio The study presents a new loss function, denoted as
spoofing. First, silences were identified by analyzing the One-class Softmax (OC-Softmax), which is designed for the
rate and intensity of each speech signal’s short-term zero purpose of detecting audio spoofing. This is juxtaposed with
crossing. Then, in the relatively high-frequency domain, the the frequently employed loss functions for binary classifica-
chosen sections were used to identify the LFBank criti- tion. The OC-Softmax loss function has been developed with
cal spots. Finally, a superior DenseNet-BiLSTM framework the purpose of condensing the authentic speech representa-
was developed for audio manipulation detection. However, tion and segregating the instances of spoofing attacks in the
the computational cost of this method [143] for avoiding embedding space.
overfitting is high. Based on keypoint and light CNNs, The mathematical equations utilized by the system are as
Wu et al. [144] suggested a novel approach for detecting follows:
synthetic audio manipulations (LCNN). N wT x
1 X e yi i
The unique characteristics of human audios were used LS = − log T
N w x wT x
to train a (CNN) model. Alterations were made to make i=1 e yi i + e 1−yi i
the emphasis distribution more similar to that of the nor- 1 XN  T 
mal speech. An LCNN was then used in combination with = log 1 + e w1−yi −wyi xi , (1)
N i=1
the modified keypoints to distinguish natural speech from The softmax loss is a loss function commonly employed
artificial speech. That’s because this method [144] can’t be in the training of classification models. The aforementioned
easily fooled by artificially altered audios. However, it cannot statement pertains to the quantification of the dissimilarity
prevent assaults from using a replay, because it has no way of between the anticipated probability distribution and the fac-
identifying them. tual probability distribution. During the training process, the
model aims to minimize the softmax loss, which facilitates
2) DEEP LEARNING FEATURES-BASED TECHNIQUES the acquisition of the ability to accurately predict the appro-
Zhang et al. [145] showed that a (DL) strategy can be devel- priate class for every input.
oped using OC Softmax and ResNet-18. The model was 
T
α ŵ x̂ −m


trained to identify the feature space that allowed for differ- 1 XN e yi i


LAMS =− log  T 
entiation between the natural and modified audio samples. N i=1 α ŵy x̂i −m α ŵT x̂
e i + e 1−yi i  
Despite its superiority in generalization against unknown   T
1 XN α m− ŵyi −ŵ1−yi x̂i
assaults, this technique suffers from VC attacks owing to the = log 1 + e , (2)
waveform filtering. N i=1

AS shown in figure (4) the system adheres to a conven- The AM-Softmax loss function is utilized in the training
tional architecture based on deep learning for the purpose of one-class classification models. The proposed approach is
of detecting audio spoofing. The characteristics of the a variant of the softmax loss function, which incorporates an
speech are inputted into a neural network for the pur- angular margin to enhance the compactness of the embedding
pose of computing an embedding vector that corresponds distributions for each class. During the training process, the
to the inputted utterance. The neural network is trained AM-Softmax loss function is minimized, thereby facilitating
to acquire knowledge of an embedding space that enables the model’s ability to differentiate between authentic and
efficient differentiation between authentic audios and those counterfeit vocalizations.
produced through spoofing. Subsequently, the embedding 1 XN  
log 1 + eα myi −ŵ0 x̂i (−1)
yi

is employed to evaluate the level of certainty regard- LOCS = (3)
N i=1
ing whether the utterance pertains to genuine speech or
The OC-Softmax loss is a loss function utilized in the
spoofing.
training of one-class classification models. The proposed
approach involves a variation of the AM-Softmax loss func-
tion, which incorporates dual margins to enhance the com-
pression of genuine speech and effectively isolate instances
of spoofing attacks. During the training process, the model
is trained to minimize the OC-Softmax loss, which in turn
enhances its ability to accurately detect deepfakes.
Hua et al. [56] presented The Res-TSSDNet and Inc-
TSSDNet models for the purpose of detecting synthetic
speech. Both models exhibit a comparable architecture, char-
acterized by a series of stacked ResNet-style or Inception-
style blocks, fully-connected linear layers, and global max
FIGURE 4. An illustration of Softmax and AM-Softmax for binary pooling. The Inc-TSSDNet utilizes dilated convolutions in its
classification, alongside the proposed OC-Softmax for one-class learning.
The embeddings and weight vectors presented in the illustration are not
models to augment the receptive field. The training approach
normalized [145]. encompasses the preparatory phase of data, utilization of

132662 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

weighted cross-entropy loss to address data imbalance, and deep neural network (DNN) was proposed to differentiate
implementation of mixup regularization to enhance general- between authentic and counterfeit audios. The DNN was
ization. The findings demonstrate the efficacy of the proposed trained on a speaker recognition (SR) system and extracts
models in relation to established techniques, as evaluated on activation patterns from the SR system for both authentic and
the ASVspoof2019 dataset. An ablation study is conducted synthetic vocalizations. A classification system was then used
to assess the influence of the dimensions of network depth to distinguish between authentic and counterfeit audios based
and width. According to the research, it is suggested that on the extracted activation patterns.
lighter models are able to attain a favorable balance between The operational process involves the initial input of authen-
precision and effectiveness. tic and counterfeit vocalizations into the SR mechanism. The
SR system then generates a collection of activation patterns
WCE (z, yi ) = −wyi log zyi

(4) for every vocalization. The layer-wise neuron activation pat-
The utilization of weighted cross-entropy loss is a strategy tern extractor is then used to extract the activation patterns.
to address the issue of data imbalance, whereby the minority Finally, the activation patterns are inputted into the classifier,
class is assigned a higher weight. which performs the task of categorizing the audios into either
genuine or counterfeit.
CEmixup z̃, yi , yj = λCE z̃, yi + (1 − λ)CE z̃, yj , (5)
  
The authors found that layer-wise neuron behaviors can be
The study employs the mixup regularization loss function, used to detect artificially-generated fake audios. The TKAN
which integrates the cross-entropy (CE) losses of the mixed neuron coverage criterion was more effective than the ACN
examples in the synthetic speech detection domain. neuron coverage criterion because it can distinguish between
real and artificial audios more effectively.
x̃i = λxi + (1 − λ)xj , ỹi = λyi + (1 − λ)yj (6) The system has reportedly shown effectiveness in identi-
fying vocalizations that were created artificially. The system
The regularization loss of mixup pertains to the amal- has a detection rate of 98.1% and a false alarm rate of
gamation of training examples and labels to enhance the about 2%. The system demonstrates resistance to manipula-
generalization of the model. tion attempts, including but not limited to audio alteration and
Wang et al. [146] devised a deep neural network (DNN) the addition of outside noises.
model, which they named Deep-Sonar, to identify artificially- the equations that utilized are:
generated counterfeit audios in speaker recognition (SR)
x∈X ,i∈I ϕ(x, i; θ)
P
systems. The employed methodology utilizes a stratified
configuration of neural units to execute the task of classifi- δl = (7)
|I | · |X |
cation. The Deep-Sonar system was assessed by the authors
on the audios of English speakers obtained from the FoR computes the lth layer threshold l for the SR system.
dataset [147], The results showed a detection rate of 98.1% ACN (l, i) = |{x | ∀x ∈ l, ϕ(x, i; θ) > δl }| (8)
and an equal error rate (EER) of approximately 2%. The
model’s efficacy was notably impacted by the existence of TKAN (l, i) = {argmax(ϕ(x, i; θ), k) : x ∈ X (9)
noise in practical settings, leading to a reduction in preci- defines the neuron coverage criteria for the ACN.
sion. Wang et al. presented a noise-reduction methodology to defines the neuron coverage requirements for TKAN.
tackle the aforementioned problem. The proposed technique The findings of the authors suggest that layer-wise neu-
yielded a 5% improvement in the model’s accuracy, leading ron behaviors can be used to detect artificially-generated
to a detection rate of 98.6% and an EER of 1.9%. fake audios. The TKAN neuron coverage criterion is more
As can be seen from figure (5) The system comprises three effective than the ACN neuron coverage criterion because
primary constituents: it can distinguish between real and artificial audios more
effectively.
In their study, Yu et al. [148] introduced a new method
for scoring, referred to as Human Log-Likelihoods (HLLs),
that utilizes a Deep Neural Network (DNN) classifier.
In contrast to the conventional employment of (GMM) in
Log-Likelihood Ratios (LLRs) scoring system, HLLs are
specifically devised to augment the precision of the classi-
fication procedure. The efficacy of the HLLs approach was
assessed by the authors through the utilization of the ASV
Spoof Challenge 2015 dataset and the automated extraction
of feature sets. The findings of the experiment indicate that
the DNN-HLLs exhibited superior performance in detecting
FIGURE 5. The DeepSonar system’s block diagram for spotting accuracy compared to GMM-LLRs, as evidenced by an Equal
AI-generated doppelganger voices [146]. Error Rate (EER) of 12.24. This study provides evidence

VOLUME 11, 2023 132663


O. A. Shaaban et al.: Audio Deepfake Approaches

supporting the enhanced dependability and precision of the as the metric for determining the degree of spoofing.
HLLs technique in identifying falsified audio. 1 XT
S3DNN (F) = log (P (h | Fi )) (13)
T i=1
Equations (14) to (17) are utilized to determine the mean
(m) and standard deviation (σ ) of the spoofing scores (SHLL
and SLLR ) by employing the log-likelihood ratio (LLR) and
human-like likelihood (HLL) scoring techniques.
m− SHLL = E [y1 ] (14)
q  
σ− SHLL = E y21 − E [y1 ]2 /T

(15)
m_SLLR = E [y2 ] (16)
FIGURE 6. The model of spoofing detection system in an ASV q  
system [148].
σ _SLLR = E y22 − E [y2 ]2 /T

(17)

As shown in Figure (6) The system for detecting spoofing Equations (18) and (19) denote FRR(θ) and FAR(θ) as
comprises three main components: feature extraction, spoof- the measures of false rejection rate and false acceptance
ing detection, and decision-making. rate, correspondingly, at a specific threshold value of θ. The
The feature extraction component extracts distinctive char- cumulative distribution functions of the normal distribution
acteristics from the audio signal input. The spoofing detection are utilized in these equations to estimate FRR and FAR.
component is a deep neural network (DNN) that has been FRR(θ) = CDF (θ | mh , σh ) (18)
trained to differentiate between authentic and falsified audio.
FAR(θ) = 1 − CDF (θ | ms , σs ) (19)
The decision-making component determines the authenticity
of the input audio by analyzing the DNN’s output. Authors of [149] build a model using a Light Convolutional
The spoofing detection score can be computed using the Gated RNN (LCGRNN). they introduced Res-TSSDNet,
log-likelihood ratio (LLR), the human-like likelihood (HLL) which is a full-stack model for synthetic speech detection that
scoring techniques, or a combination of both. The mean (m) uses deep feature computation and classification. The model
and standard deviation (σ ) of the spoofing scores are then can be modified to fit new data, although this requires more
determined. The false rejection rate (FRR) and false accep- than the usual processing.
tance rate (FAR) are also calculated.
1 XT
SGMM (X) = {logP (Xi | λhuman )
T i=1
−logP Xi | λspoof

(10)

The scores S1DNN(F) and S2DNN(F) are derived from a


(DNN) specifically designed to discriminate against spoof-
ing, resulting in a spoofing detection mechanism. These
equations are denoted as (11) and (12). The following equa-
tions are utilized to calculate the disparity between the
logarithmic posterior probabilities of authentic human speech
and fraudulent spoofing techniques.
FIGURE 7. The proposed LC-GRNN utterancelevel identity vector extractor
S1DNN (F) block diagram [149].
 X 
1 XT K
= log [P (h | Fi )] − log P (sk | Fi ) The initial stage of data preparation involves the elimina-
T i=1 k=1
tion of noise and the normalization of features. The process
(11)
of feature extraction involves the utilization of a (CNN) for
S2DNN (F) the purpose of extracting features from the speech signal. The
1 XT process of classification employs a (RNN) to differentiate
= {log [P (h | Fi )] − log [max (P (sk | Fi ))]}
T i=1 between authentic and fraudulent speech signals.
(12) The following equation effectively encapsulates the tempo-
ral dependencies and contextual information present within
Equation (13) denotes that S3DNN (F) is an additional met-
the audio sequence, thereby playing a pivotal role in the pre-
ric for detecting spoofing, which is computed based on the
cise detection of counterfeit audio. The equation for updating
results of the deep neural network. The approach utilizes
the Gated Recurrent Unit (GRU) can be expressed as follows:
the log-likelihood of human speech, which represents the
znt = σ MFM Wzn ∗ xtn + Unz ∗ hnt−1

probability of a given frame belonging to human utterance, (20)

132664 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

The symbol ∗ is utilized to represent a convolution opera- is calculated through a technique that involves the extrac-
tion performed by an operator. The convolutional layers may tion of significant features from the input, such as the
be construed as filter banks that have undergone training Mel-Frequency Cepstral Coefficients (MFCC), Constant-Q
and optimization to identify anomalies in the counterfeit Cepstral Coefficients (CQCC), and (STFT). The findings
speech. The primary benefit of utilizing these filters lies indicate a notable enhancement of 71% and 75% in the t-DCF
in the extraction of frame-level characteristics at each tem- (0.1569) and EER (6.02) matrices, respectively. Nevertheless,
poral interval, which exhibit greater discriminatory power the system exhibits generalization errors, underscoring the
than those obtained through the utilization of fully connected necessity for additional research to augment its efficacy.
units.
Also The variable znt denotes the update gate at time t,
which governs the extent to which the prior hidden state
hht−1 is modified in response to the present input xnt . The
computation of the update gate involves the utilization of
a sigmoid activation function (σ ), in conjunction with the
weight matrices Wzn and Uzn , and the bias term b_z.
FIGURE 9. The Spec-ResNet model architecture [151].
Cheng et al. [150] proposed a strategy that utilized the
Squeeze-Excitation Network (SENet) to train a Deep Neural
The system comprises four main parts: pre-processing,
Network (DNN) by incorporating log power magnitude spec-
feature extraction, classification, and post-processing.
tra and CQCC acoustic features. The ASVspoof 2019 dataset
The pre-processing stage removes noise from the audio
was utilized to evaluate the method, which demonstrated a
signal and normalizes it to a standard range. The feature
17% enhancement in the identification of synthetic audio.
extraction stage extracts feature from the audio signal, such
Nonetheless, the model’s efficacy exhibited a decline when
as Mel-frequency cepstral coefficients (MFCCs). The classi-
subjected to a logical access scenario, wherein overfitting was
fication stage uses a deep residual neural network (ResNet)
detected, leading to a t-DCF cost and an EER of zero.
to classify the audio signal as authentic or spoofed. The
post-processing stage produces the classification decision
from the ResNet.
The authors propose investigating alternative methods for
feature extraction and incorporating supplementary data into
the model to reduce generalization errors.

CM (s) = log(p(bona fide | s; θ)) − log(p(spoof | s; θ))


(21)
The equation involves the assignment of probabilities
to the input being either genuine or a spoof, denoted as
P(genuine) and P(spoof), respectively. The computation of
FIGURE 8. Feature map technique using the Unified method illustrated. the log-likelihood ratio involves the natural logarithm of the
A unified feature map is created by repeatedly repeating an utterance ratio between the probability of genuine events and the prob-
after extracting low-level acoustical features. The feature map is then
divided into segments with M frames of length and L frames of overlap, ability of spoof events.
and input into the DNN models [150]. Through utilization of the aforementioned formula, the
system is capable of producing a numerical value indicative
Figure (8) depicts Feature map technique of the system of the probability that the input is authentic or fraudulent.
designed for detecting audio-visual spoofing. The process Elevated CM values are indicative of a greater probability
entails the retrieval of auditory and visual characteristics from of the input’s authenticity, whereas reduced values suggest
the input, which are subsequently merged to capture their a higher probability of it being a counterfeit.
interrelation. The temporal modeling module is designed to Rahul et al. [132] introduced a novel methodology for
capture temporal dependencies and contextual information identifying falsified English-speaking audios using transfer
within the fused features. Ultimately, the authenticity of the learning and the ResNet-34 technique, which outperformed
input is determined through a classification stage that lever- unimodal and multimodal approaches. The CNN network
ages the output generated by the temporal modeling module. was used for pre-training the transfer learning model, which
The presented block diagram illustrates the integration of utilized the Rest-34 technique to address the vanishing gra-
auditory and visual data within the system, which serves to dient problem. The framework yielded optimal outcomes,
augment its ability to detect instances of spoofing. as evidenced by an EER metric of 5.32% and a t-DCF met-
Alzantot et al. [151] have proposed a technique that utilizes ric of 0.11514%. Khochare et al. [61] conducted a study
a residual (CNN) for the identification of audio deepfakes. on the detection of artificially generated fraudulent audio
The Counter Major (CM) score of the counterfeit audio by utilizing two innovative deep learning models, namely

VOLUME 11, 2023 132665


O. A. Shaaban et al.: Audio Deepfake Approaches

the Temporal Convolutional Network (TCN) and the Spatial the signal as genuine or counterfeit. The decision maker uses
Transformer Network (STN). The study explored the effi- the output of the RCNN to determine the authenticity of the
cacy of feature-based and image-based approaches. The TCN signal.
model demonstrated a favorable outcome with an accuracy
rate of 92%. In contrast, the STN model exhibited an accuracy
rate of 80%; however, it lacked the capability to accom-
modate inputs such as (STFT) and Mel Frequency Cepstral
Coefficient (MFCC).

FIGURE 11. An overview of the problem domain. The audio and visual
FIGURE 10. The framework of the proposed speech spoofing detection
components are extracted and subjected to spoof and deepfake detection
system [132].
models for processing [57].

The proposed framework uses transfer learning to train a The present architectural design incorporates mathemati-
deep CNN to classify speech signals as genuine or fraudulent. cal equations, which are outlined below.
The CNN is first trained on a large dataset of speech signals  
using a pre-existing model. This allows the network to learn eyc
universal speech characteristics, which can then be used to LCE = − log  P1+n  (22)
f yj
classify new speech signals. The CNN is then fine-tuned j=1 e
using a smaller dataset of speech signals that have been The cross-entropy loss function (L_CE) is used to train
labeled as genuine or fraudulent. This fine-tuning allows the the RCNN. It measures the difference between the predicted
neural network to learn the specific characteristics of the probability distribution and the actual probability distribu-
speech signals in the dataset, which improves the accuracy tion.
of classification.
Mel-spectrograms are used to represent speech signals in DKL (N ((µ1 , µ2 )T , diag(σ12 , σ22 )) ∥ N (0, I ))
Xn
the frequency domain. This representation captures both the = λ( σi2 + µ2i − log(σi ) − 1) (23)
temporal and spectral characteristics of the signal, which is i=1
important for classifying speech signals. The Kullback-Leibler (KL) divergence (L_KL) is used to
ResNet is a CNN that is known for its ability to learn measure the difference between two probability distributions.
long-range dependencies in data. This makes it well-suited
LEN = λ1 LKL + λ2 LCE (24)
for classifying speech signals, which can have long temporal
dependencies. The ensemble loss (L_EN) is a combination of L_CE and
The proposed framework has been shown to achieve high L_KL. It is minimized to improve the overall performance of
accuracy in classification across a variety of speech datasets. the RCNN.
This is due to the use of transfer learning, which allows the Shan and Tsai [153] developed a method for aligning audio
neural network to learn the fundamental characteristics of recordings using three different classification models: LSTM,
speech, and the use of Mel-spectrograms, which captures the bidirectional LSTM, and transformer architectures. The goal
temporal and frequency aspects of the signal. of this approach was to classify individual audio frames from
Chintha et al. [57]proposed two novel models for audio a set of 50 distinct recordings into either a matching or non-
deepfake detection. The first model, CRNN-Spoof, uses a matching status. The bidirectional LSTM model was found
bidirectional LSTM network to predict counterfeit audio to have the best performance, achieving a precision rate of
based on five layers of extracted audio signals. The sec- 99.7% and an error rate of 0.43%.
ond model, WIRE-Net-Spoof, uses a weighted negative As shown in figure (12) The system consists of three main
log-likelihood function and outperformed CRNN-Spoof by components:
0.132% in the Tandem Decision Cost Function (t-DCF) A repository of unprocessed audio recordings sourced
with an EER of 4.27% in the ASV Spoof Challenge 2019 from reliable entities, a search sub-system that retrieves
dataset [152]. relevant matches from a database in response to an audio
Figure (11) depicts the block diagram of the system. The query, and a cross-verification sub-system that uses a refer-
system consists of an audio/video encoder, a feature extractor, ence recording to authenticate the audio query and ensure
and a recurrent neural network (RCNN). The audio/video its validity.
encoder converts the input signal into a digital format. The The cross-verification sub-system comprises three steps:
feature extractor extracts feature from the digital signal. The First, feature extraction: The audio query and the refer-
RCNN is a machine learning model that is trained to classify ence recording are transformed into a collection of features

132666 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

data by converting the sample rate, merging audio channels,


and extracting MFCCs. The MFCCs are then used to train
a DNN to predict the existence of background noises. The
DNN is then used to filter out the background noises from
the original audio signal.
The system was evaluated on the UrbanSound8K dataset,
which consists of labeled urban audio excerpts from 10 dis-
tinct classes. The system achieved a success rate of 94% in
detecting audio produced by AI synthesizers.
The system’s block diagram is shown in Figure 13. The
system consists of three main components,

FIGURE 12. An overview of the entire system. The objective is to


authenticate a speech recording provided by a prominent
global figure [153].

that capture the essence of their acoustic content. Second,


FIGURE 13. Block diagram for synthetic speech detection with DNN [154].
Alignment: The features of the audio query and the reference
recording are aligned using dynamic programming. Last, A data preprocessing component that converts the sample
decision-making: The system determines the legitimacy of rate, merges audio channels, and extracts MFCCs.
the audio query based on the results of the alignment. A DNN classifier that predicts the existence of background
The process of feature extraction involves the computation noises.
of Mel-frequency cepstral coefficients (MFCCs) from the An adaptive filter that eliminates the background noises
audio recordings. The MFCCs comprise a set of 39 dimen- from the original audio signal.
sions, which encompass both delta and delta-delta features. The system was able to achieve high accuracy by com-
The alignment process involves the computation of a pair- bining the strengths of CNNs and RNNs. CNNs are good at
wise cost matrix C, which is obtained by evaluating the extracting features from the input data, while RNNs are good
Euclidean distance between the query and reference features. at capturing long-term dependencies. By combining these
The cumulative cost matrix D is generated through dynamic two types of neural networks, the system was able to learn
programming using the following guidelines: to identify the subtle differences between real and synthetic
D[i, j] audios.
 The system is a promising step towards developing effec-

 C[i, j] i=0 tive methods for detecting audio produced by AI synthesizers.
 α + D[i − 1, j]

i > 0, j = 0 It can be used to protect against the spread of misinformation
=

 min(γ + D[i, j − 1], α + D[i − 1, j], and disinformation, and it can also be used to improve the
security of audio-based authentication systems.

i > 0, j > 0

C[i, j] + D[i − 1, j − 1])
(25) Jiang et al. [155] introduced a self-supervised spoofing
audio detection (SSAD) model, which draws inspiration
The backtrace matrix B is concurrently updated with D in from PASE+, a pre-existing self-supervised deep learning
order to maintain a record of the optimal transition types. methodology. The employed approach involves the utiliza-
The optimal alignment can be identified by the lowest cost tion of multilayer convolutional blocks to extract contextual
element located in the final row of D. The determination of features from the audio stream. Although SSAD exhibited
the optimal subsequence path is achieved through the process commendable scalability and efficiency, its performance was
of tracking the back pointers. comparatively weaker than other deep learning methodolo-
The aforementioned equations encapsulate the procedure gies, as evidenced by an EER of 5.31 percent. Subsequent
of aligning the features of a query and a reference through the investigations may delve into the prospective advantages of
calculation of costs, cumulative costs, and optimal transitions self-supervised learning and scrutinize techniques to aug-
via dynamic programming. ment its efficacy, with the aim of further ameliorating the
Wijethunga et al. [154] proposed a system for detecting SSAD model’s performance. Additionally, research could
audio produced by AI synthesizers using a combination of be conducted to investigate the potential of combining
CNNs and RNNs. The system first preprocesses the audio self-supervised learning with other DL approaches to create

VOLUME 11, 2023 132667


O. A. Shaaban et al.: Audio Deepfake Approaches

a hybrid model that could potentially outperform existing The quality of the preceding layer’s representations is
models. The features of the MLmodels can be extracted enhanced by the inclusion of an additional hidden layer with
automatically, reducing the need for extensive preprocessing ReLU activation in the nonlinear projection.
and saving time. To further improve the performance of the The Congener Info Max (CIM) task aims to reduce the dis-
models. parity between two comparable types of speeches, as defined
by L1, while simultaneously increasing the disparity between
two distinct types of speeches, as defined by L2.

L1 = ESr log (d (sa , sr ))


 
(27)
L2 = ESf log 1 − d sa , sf
 
(28)
L = L1 + L2 (29)

The discriminator function, denoted as d, is evaluated with


the expectation over positive samples (ESf ) and negative sam-
ples (ESr ).
The A-Softmax loss function, also known as Angular soft-
max, is a mathematical function used in ML.
The A-Softmax loss function can be denoted as follows:
 

∥Xi ∥cos mθyi ,i
1 X e
Lang = −log 
 


N i 
∥X ∥cos θyj ,i
e∥Xi ∥cos mθyi ,i + j̸=yi e i
P
FIGURE 14. The architecture of SSAD [155].
(30)
As can be seen from Figure (14) The architecture of SSAD In the context of ML, the variable N represents the quan-
as follow. tity of training samples denoted by the set {Xi }Ni=1 , along
SSAD’s architecture includes an encoder—a CNN—to with their respective labels {yi}Ni=1 . These training pairs are
extract audio features. The encoder comprises convolutional utilized in the calculation of θyi ,i , which represents the angle
layers followed by max pooling, extracting features at various between Xi and the corresponding column yi of weights W
scales and reducing dimensionality. Workers, compact neural in the fully connected classification layer. Additionally, the
networks, perform self-supervised tasks on encoder-extracted integer m serves as a parameter that governs the magnitude
features to enhance discriminative qualities between real and of the angular margin between classes.
fake audio. The classifier, another neural network, is trained Subramani and Rao [156] propose a number of methods
to categorize recordings based on encoder features and for improving the accuracy of fake speech detection models,
worker predictions, using a dataset of labeled authentic and including, Lightweight convolutional neural networks: The
synthetic audio. The system excels in precise classification authors propose two lightweight convolutional neural net-
due to its self-supervised learning, acquiring distinctive fea- work architectures for fake speech detection: EfficientCNN
tures for differentiation. This learning method outperforms and RES-EfficientCNN. These models have fewer param-
conventional supervised learning that relies on annotated eters and require less memory than traditional methods,
data. SSAD modifies the encoder’s architecture accordingly. making them more efficient and easier to deploy on resource-
The architecture of the encoder is modified by SSAD in the constrained devices.
following manner. Multi-task learning: The authors also propose a multi-task
The dilated convolution, denoted as F, is an operation learning setting for fake speech detection. In this setting, the
performed on an element s within a sequence. It involves the model is trained to jointly predict the veracity (bonafide vs.
expansion of the receptive field of the convolutional kernel by fake) and the source of the fake speech. The authors argue that
inserting gaps between the kernel elements. This results in a this helps the model to learn more discriminative features for
larger effective kernel size, which allows for the incorporation fake speech detection.
of a larger context into the convolution operation. Transfer learning: The authors also investigate the use of
Xk−1 transfer learning for fake speech detection. Transfer learning
F(s) = (x ·d f ) (s) = f (i)·xs−d·i (26) is a technique where a model trained on one task is used as a
i=0
starting point for training a model on a new task. The authors
The dilation factor is represented by d, the filter size is show that transfer learning can be used to adapt fake speech
denoted by k, and the term s − d · i takes into consideration detection models to new attack vectors (synthesis models)
the past direction. The process of dilation can be understood with less training data.
as the incorporation of a constant interval between each pair The authors evaluate their methods on two datasets of fake
of neighboring filter taps. speech: the ASVSpoof2019 dataset [72] and the RTVCSpoof

132668 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

dataset. They show that their methods significantly outper- in this system is a one-dimensional convolutional neural
form previous methods on both datasets. network (1)-D CNN). The CNN is provided with a set of log-
The findings of the evaluation indicate that RES- probabilities, which have been generated for each frame of
EfficientCNN outperformed EfficientCNN with an F1- the audio signal through the utilization of a Gaussian mixture
score of 97.61 points, surpassing the latter’s F1-score of model (GMM) as its input. Following this, the 1-D CNN
94.14 points by 3.47 points. The aforementioned results indi- produces a set of features that illustrate the local and global
cate that the proposed approach is efficacious in enhancing relationships between the frames.
the precision of the model. Classifier, this component receives a sequence of features
Lei et al. [157] proposed a hybrid model that integrates extracted by the feature extractor and produces a probability
1-D CNN and Siamese CNN to optimize the performance of estimate of the authenticity of the speech signal. The classifier
the latter. The hybrid architecture was formulated by amal- used in this system is a Siamese CNN as can be seen from
gamating two CNN and appending a fully connected layer at Figure (16). The Siamese CNN is composed of two iden-
the terminal stage. The results obtained from the experiment tical CNNs that undergo simultaneous training on identical
indicate that the employment of the hybrid model led to a datasets. The two CNNs have identical weights and biases,
notable enhancement of around 50% in both the min-tDCF albeit having distinct inputs and outputs. The two CNNs
and EER metrics, specifically when utilizing the LFCC fea- receive inputs in the form of feature sequences extracted
tures. The utilization of CQCC features in conjunction with from two distinct utterances. The probabilities indicating the
the hybrid model resulted in a notable enhancement in model authenticity of the two utterances are generated as outputs
performance, as evidenced by a roughly 20% improvement in by the two CNNs. The Siamese CNN integrates the two
both the min-tDCF and EER metrics. The results of this study probabilities to generate a conclusive probability regarding
indicate that the hybrid model exhibits greater resilience and the authenticity of the speech in the given input utterance.
efficacy in accommodating diverse feature sets. Furthermore, XM
the hybrid model exhibited greater resilience to noise and p(x) = wi pi (x) (31)
i=1
improved capacity for detecting fraudulent audio.
The system is composed of two fundamental components,
specifically a feature extractor and a classifier.
Feature extractor, as shown in figure (15) This component
converts the raw audio signal into a set of discrete features
that can be used by the classifier to differentiate between
genuine and deceptive speech. The feature extractor used

FIGURE 16. The architecture of the Siamese Convolutional Neural


Network (CNN) model [157].

The log-probabilities of each frame in the audio signal are


computed utilizing the (GMM). The log-probabilities denote
the probability that the frame was produced by an authentic
speaker or an impostor speaker.
 
1 1
pi (x) = exp − (x − µi ) 6i (x − µi )
′ −1
(2π)D/2 |6i |1/2 2
FIGURE 15. The architecture of the CNN model [157]. (32)

VOLUME 11, 2023 132669


O. A. Shaaban et al.: Audio Deepfake Approaches

1-D (CNN) is a neural network architecture that is fre-


quently employed in the domains of speech and image
processing. The (CNN) implemented in this system com-
prises 512 filters.
scorebaseline = logp (X | λh ) − logp (X | λs ) (33)
The Siamese (CNN) is a prevalent neural network archi-
tecture utilized for various applications, including but not
limited to object matching and facial recognition. The system
employs a Siamese (CNN) architecture, wherein two CNNs
with identical structures are concurrently trained on the same
dataset.
FIGURE 17. The representation of all classifiers that have been
fij = log wj · pj (xi )

(34) implemented in their entirety [158].

In the context of speech feature sequences, it is observed


that the GMM approach operates by independently accumu-
lating scores across all frames, without taking into account the categorized into two domains, namely shallow learning and
specific contribution of each Gaussian component towards deep learning. Logistic Regression (LR), Decision Tree (DT),
the final score. Random Forest (RF), and Gradient Boosting: XGBoost (XG)
Furthermore, the disregard for the correlation between are examples of shallow learning classifiers. The category
consecutive frames has been observed. The objective is to of deep learning classifiers encompasses Convolution Neural
construct a model for the distribution of scores on each com- Networks (CNN) and Bidirectional (BiLSTM).
ponent of the (GMM) and introduce the Gaussian probability Khochare et al. [61] conducted a comprehensive inves-
feature. tigation to evaluate the effectiveness of feature-based and
In this experiment, it was observed that for a raw frame image-based techniques in the classification of synthetically
feature such as CQCC or LFCC, the size of the new feature f produced counterfeit audio. The present study employed
is dependent on the order of GMM. Additionally, the compo- two innovative deep learning models, namely the Temporal
nent f7 is also a crucial factor. Convolutional Network (TCN) and the Spatial Transformer
Subsequently, the mean and standard deviation values of Network (STN), to achieve the intended objective. The find-
the training dataset are computed and subsequently employed ings of the study indicate that TCN exhibited a high level of
for the purpose of mean and variance normalization for every precision in distinguishing authentic from fabricated audio,
individual utterance. with a notable accuracy rate of 92%. In contrast, STN
Lataifeh et al. [158] conducted an experimental study demonstrated a comparatively lower accuracy rate of 80%.
aimed to evaluate the effectiveness of ML(ML) models in Despite exhibiting exceptional performance with sequential
comparison to (CNNs) and Bidirectional Long Short-Term data, it was discovered that the (STFT) and (MFCC) features,
Memory (BiLSTM) in detecting imitation-based fakeness on when transformed into inputs, were incompatible with TCN,
the Arabic Diversified Audio (AR-DAD) dataset [159]. The as per the findings.
research conducted an investigation on a range of MLmethod- As shown in figure (18), The system proposed in this study
ologies, encompassing SVM, SVM-Linear, Radial Basis consists of two approaches, feature-based classification and
Function (SVMRBF), LR, Decision Tree (DT), Radial Basis image-based classification.
Function (RF), and Gradient Boosting (XGBoost). According
to the findings, the Support Vector Machine (SVM) exhib-
ited the most noteworthy precision rate of 99%, whereas
the Decision Tree (DT) demonstrated the least accuracy rate
of 73.33%. CNN attained a detection rate of 94.33%, sur-
passing the performance of BiLSTM. CNN demonstrated
a high level of efficacy in detecting false correlations and
autonomously extracting characteristics through its capacity
for generalization. Nonetheless, a limitation of (CNN) archi-
tectures in the context of Audio Deepfake pertains to their
exclusive capacity to handle visual data as input. Preprocess-
ing of the audio is necessary to convert it into a spectrogram
or a two-dimensional representation prior to input into the
network.
Classifier systems based on MLtechniques are employed to
classify data by utilizing input features. The classifiers can be FIGURE 18. Diagram of the method for detecting deepfake audio [61].

132670 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

Feature-based classification: This approach converts audio calculated as the number of times the signal crosses the zero
samples into a dataset of features, such as mean square axis in a given time interval.
energy, Chroma features, spectral centroid, spectral band- 1 X W
width, spectral roll off, zero crossing rate, and MFCCs. |sgn[x(n)] − sgn[x(n − 1)]| (39)
WL n=1 L
These features are then used to train machine learning
(ML) models, such as support vector machines (SVMs), The given expression pertains to an audio signal repre-
light gradient boosting machines (LGBMs), extreme gradi- sented by x(n), wherein WL denotes the window length and
ent boosting (XGBoosts), k-nearest neighbors (KNNs), and sgn represents the signum function.
random forests (RFs). The trained models are then used to In their study [160], E.R. Bartusiak and E.J. Delp proposed
classify new audio samples as either authentic or counterfeit. a novel approach to assign synthetic speech to its originator.
Image-based classification: This approach converts audio The employed technique utilizes a transformer, which is a
samples into melspectrograms using the librosa library. Mel- neural network framework that has demonstrated efficacy in
spectrograms are visual representations of the frequency various natural language processing endeavors. The efficacy
content of audio signals. The melspectrograms are then used of the method was evaluated on three distinct sets of synthetic
to train deep learning models, such as spatial transformer net- speech data, and it demonstrated a notable level of precision
works (STNs) and temporal convolutional networks (TCNs). across all three datasets. The method attained a 99.8% accu-
The trained models are then used to classify new audio sam- racy rate on the ASVspoof2019 dataset.
ples as either authentic or counterfeit. The method attained a 96.3% accuracy on the SP Cup
The concept of Mean Square Energy of a signal x(n) can dataset. The method attained a precision rate of 93.4% on
be expressed as follows: the DARPA SemaFor Audio Attribution dataset. The efficacy
r of the technique was also evaluated in an open-set context,
1 2 wherein it demonstrated the ability to accurately detect unfa-
x + x22 + · · · + xn2

xrms = (35)
n 1 miliar speech generation techniques, achieving a precision
In this context, the variable ‘‘n’’ represents the total number rate of 90.2% on the ASVspoof2019 dataset and 88.45% on
of samples, while xi = ith sample. the DARPA SemaFor Audio Attribution dataset.
Spectral centroid: The spectral centroid is a measure of the The method exhibits robustness towards AAC compression
center of gravity of the spectrum of a signal. It is calculated when the data rates are equal to or greater than 32kbps.
as the weighted average of the frequencies in the spectrum, The method’s transformer comprises a total of approximately
with the weights being the magnitudes of the frequencies. 87 million parameters. The authors intend to enhance the
Pk=N precision and resilience of the approach in their forthcoming
f (k) · f (k)
µ = k=1 Pn=N (36) research.
k=1 m(k)
The magnitude at the kth frequency bin is denoted as
m(k), while the center frequency at the kth frequency bin is
represented by f(k).
Spectral bandwidth: The spectral bandwidth is a measure
of the width of the spectrum of a signal. It is calculated as the
square root of the variance of the frequencies in the spectrum.
X 1
p
m(k)(f (k) − µ)p (37)
k
The expression m(k) denotes the magnitude at the kth
frequency bin, while f (k) represents the center frequency at
the same bin. The parameter µ corresponds to the spectral
centroid.
Spectral rolloff: The spectral rolloff is a measure of the fre-
quency below which a certain percentage of the total energy
in the spectrum is located. It is calculated as the frequency at
which 85% of the energy in the spectrum is located.
Xfr XN
arg max m(k) ≥ 0.85 m(k) (38)
fr ∈{1,...,N } k=1 k=1
FIGURE 19. The diagrammatic representation of the proposed approach,
The rolloff frequency is denoted as fr , and the magnitude namely Synthetic Speech Attribution Transformer (SSAT) [160].

at the kth frequency bin is represented by m(k).


The computation of the Zero Crossing Rate: In Figure (19), the initial stage transforms speech into
Zero crossing rate: The zero crossing rate is a measure of a melspectrogram, emphasizing frequencies significant for
the frequency at which a signal crosses the zero axis. It is human hearing. It partitions the spectrum into mel bands and

VOLUME 11, 2023 132671


O. A. Shaaban et al.: Audio Deepfake Approaches

computes power spectra within each. Mel bands are logarith-


mic, enhancing perceptual significance. The melspectrogram
is segmented for better classification precision, providing
additional data to the classifier. Each region is assigned a
vector with statistical speech signal characteristics. Positional
encoding is then added to each vector for the transformer
neural network to understand their relative positions. The
transformer network excels at capturing distant relationships
using self-attention.
The transformer processes vectors and encodings, generat-
ing concealed states. These states are aggregated to create a
single 768-dimensional representation for the auditory input.
Categorization is done through a linear layer with SoftMax
activation. It transforms the 768-dimensional representation
into a probability distribution over potential classes, ensuring
their total equals unity. The output includes a classification
label denoting speech origin and a confidence score reflecting
the classifier’s certainty.
However, ASVspoof 2021 [161] presents new challenges.
It introduces a category of compressed TTS and VC deepfake
samples without speaker verification or original speakers’
audios.
Arif et al. [162] introduced a new audio feature descrip-
tor, named ELTP-LFCC, which is created by merging two
existing techniques: Local Ternary Pattern (ELTP) and Linear
Frequency Cepstral Coefficients (LFCC).
The researchers utilized a Deep Bidirectional Long
Short-Term Memory (DBiLSTM) network in conjunction
with this descriptor to construct a model capable of detecting
fraudulent audio in diverse indoor and outdoor settings. The
ASVspoof 2019 dataset, comprising of artificially generated
FIGURE 20. The proposed framework’s architectural design [162].
and impersonation-based fraudulent audio, was utilized to
assess the efficacy of the model. The findings indicate that
the model exhibited greater efficacy in identifying artificially
generated audio (with an equal error rate of 0.74%) as com- sample c and the 10 adjacent audio samples i, through the
pared to samples produced through imitation (with an equal application of θ around c.
error rate of 33.28%). The auto-adaptive threshold is computed dynamically by
The block diagram can be explained as shown in utilizing the standard deviation of each frame.
Figure (20), where a bidirectional LSTM model classi- θ = α × σ (0 < α ≤ 1) (41)
fied using ELTP-LFCC features. Each BiLSTM layer had
64 units. Concatenated outputs were passed to a FC layer, The symbol σ denotes the standard deviation that is calcu-
then a softmax layer for classification. lated for every frame of the audio, while α represents a scaling
The suggested architecture integrated ELTP, LFCC, and factor. A linear search algorithm was utilized to optimize
BiLSTM to accurately detect logical access attacks in audio the scaling factor α by identifying the point of convergence
signals. within the interval of 0 and 1.
Extended Local Tertiary Patterns (ELTP): Linear Frequency Cepstral Coefficients (LFCC)
 The computation of the 20-dimensional Linear Frequency
  
 1,
 si ≥ c +θ Cepstral Coefficients (LFCC) involves the utilization of a
P si , c, θ = 0, si − θ < θ (40) series of linear filters on the Fast Fourier Transform (FFT)
of the audio signals.

 −1, si ≤ (c − θ)

(2k − 1) iπ
XK  
The acoustic signal is denoted by P si , c, θ , where c log (gk ) cos , 1≤i≤I

(42)
corresponds to the central sample of the frame F that has
k=1 2K
si neighbors. The neighbor index is represented by i, while In the given context, K denotes the quantity of filters while
the threshold is denoted by θ. The ELTP is computed by I represent the number of Local Feature Coding Coefficients
determining the magnitude difference between the central (LFCC) utilized. The final 40-dimensional ELTP-LFCC

132672 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

feature vector was obtained by integrating the 20-dimensional


LFCC features with the 20-dimensional ELTP features.
Bidirectional Long-Term Short-Term Memory (BiLSTM):
BiLSTM’s calculation of the concealed vector and the
output vector
ht = H (Wxh xt + Whh ht−1 + Bh ) (43)
yt = Why ht + by (44)
The variables in the equation are denoted as follows:
Wrepresents the weight matrices, where Wxh specifically
denotes the input-hidden weight matrix. B represents the bias
vectors, with Bh representing the hidden bias vector. Finally,
FIGURE 21. Conceptual diagram of the proposed approach [10].
H represents the hidden function.
The computation involved in the Long Short-Term Mem-
ory (LSTM) cell pertains to the forget gate, input gate, output Subsequently, the (CNN) is trained on the histogram
gate, cell memory, and hidden vector. images by the system. The (CNN) acquires the ability to dis-
ft = σg Wxf × xt + Whf × ht−1 + Wcf × ct−1 + Bf

cern distinctive attributes within the images that are indicative
(45) of counterfeit audio recordings.
it = σg (Wxi × xt + Whi × ht−1 + Wci × ct−1 + Bi ) (46) Upon completion of the training process, (CNN) can be
employed to categorize novel audio data as either authentic
ot = σg (Wxo × xt + Who × ht−1 + Wco × ct + Bo ) (47) or counterfeit.
ct = ft ct−1 + it tanh (Wxc × xt + Whc × ht−1 + Bc ) (48) Binary crossentropy loss function
ht = ot tanh (ct ) (49) 1 XN
yi · log ŷi + (1 − yi ) · log 1 − ŷi
 
L(y, ŷ) = −
The hard-sigmoid function, denoted as σg , is utilized in the N i=0
context of the forget gate (f ), input gate (i), output gate (o), (51)
cell memory (c), and hidden vector (h). he selected loss function is binary crossentropy L(y, ŷ), which
combining the outputs of the forward and backward hidden is related to the dissimilarity in terms of entropy between two
sequences. data sequences, in our case, the entropy of the known labels

← yi , and the entropy of the predicted labels ŷ. This kind of loss
⃗ ht + W← h t + By
yt = Why (50)
hy function is very useful in binary classification problems.
⃗ The backward hidden RMSprop optimizer
A sequence that is forward hidden h,

sequence h and output sequence are obtained through an f (x) = max(0, x) (52)
iterative process that involves the forward layer being iterated
The optimizer is utilized for the purpose of training the
from t = 1 to T, and the backward layer being iterated from
(CNN). The approach in question pertains to a form of
t = T to 1.
the stochastic gradient descent algorithm that incorporates a
The equations presented encapsulate the fundamental prin-
rolling average of the squared gradients for the purpose of
ciples of the proposed methodology, encompassing both the
weight updating in a (CNN).
feature extraction components (ELTP and LFCC) and the
The Rectified Linear Unit (ReLU) activation function
classification aspect (BiLSTM) of the framework.
Ballesteros et al. [10] developed a classification model 1
f (x) = (53)
called Deep4SNet that employed a 2D CNN model (his- 1 + e−x
togram) to encode the audio dataset and discriminate between The (ReLU) activation functions are utilized for the con-
synthetic and imitation audios. This model was incredibly volutional and hidden layers due to their favorable balance
accurate, with an impressive 98.5% accuracy rate when it between computational cost and performance. The objective
came to identifying counterfeit and synthetic audio. of the (ReLU) is to eliminate negative values while permitting
Unfortunately, the performance of Deep4SNet was not positive values to propagate, as specified by Equation (52).
scalable and was negatively impacted by the process of data Sigmoid activation function
translation, thus limiting its potential applications.
As can be seen from Figure (21) The audio data is initially 1
f (x) = (54)
subjected to pre-processing, wherein it is transformed into a 1 + e−x
histogram image. The aforementioned process involves the The activation function of the final neuron is sigmoid,
segmentation of the audio signal into equidistant temporal as derived from Equation (54). The scientific community
intervals, followed by the computation of the frequency count widely recommends this type of activation for binary clas-
for each interval. sification problems.

VOLUME 11, 2023 132673


O. A. Shaaban et al.: Audio Deepfake Approaches

The field of Deepfake identification has been expanded by PyTorch deep learning framework was utilized to implement
the release of the FakeAVCeleb dataset [163], the system, which underwent training on a dataset consisting
Khalid et al. [164] conducted an investigation into the effi- of 1000 audio deepfakes and 1000 authentic audio record-
cacy of unimodal techniques for detecting Deepfakes. Specif- ings. An assessment was conducted on a corpus comprising
ically, they evaluated the performance of five classifiers, 500 fabricated audio files and 500 authentic audio recordings,
namely MesoInception-4, Meso-4, Xception, EfficientNet- whereby the system attained a precision rate of 97%.
B0, and VGG16. The research aimed to evaluate the efficacy Figure (22) provides a visual representation of the block
of unimodal techniques in detecting Deepfakes. The Xception diagram for the system. The audio recording is used as
classifier demonstrated the highest level of efficiency, yield- an input to the system, which then extracts vector repre-
ing a 76% outcome, whereas the EfficientNet-B0 classifier sentations from the recording using a feature convolutional
exhibited the lowest level of performance, producing a result layer. The output of the system is the vector representations.
of 50%. Nevertheless, the investigation demonstrated that all After that, the vector representations are separated into audio
unimodal classifiers were unsuccessful in accurately detect- streams that are either masked or unmasked, depending on
ing counterfeit audio despite their endeavors. which state they are currently in. Encoding processes are
The model comprises of two distinct components, namely applied to the inputs after they have been unmasked. These
a visual network and an audio network. The (CNN) known as processes lead to the generation of significant and unin-
the visual network has been trained to detect visual anomalies terrupted latent representations. The TE is responsible for
present in deepfake videos. The (RNN) known as the audio deriving the contextualized representations, and it does so by
network has been trained to detect audio artifacts present in first receiving the encoded inputs that have not been masked
deepfake audio. and then using those inputs. After obtaining contextualized
The operational mechanism of the system commences with representations, the next step is to input them into the pro-
the preliminary processing of the input audio and video data. jection layer, which is responsible for projecting the ultimate
The preprocessing of video data involves the conversion context vector. After the context vector has been obtained, it is
of the data into a series of still images. The audio information then sent on to the ASP layer by utilizing the Mean (µ) metric
undergoes preprocessing through the transformation into a as the transmission method. The result that was produced by
sequence of (MFCC) features. Subsequently, the data that
has undergone preprocessing is inputted into the visual and
audio networks. The visual network generates a probability
score indicating the likelihood of the input video being a
deepfake. The audio network generates a probability score
indicating the likelihood of the input audio being a deepfake.
The ultimate likelihood of the input being a deepfake is
determined by the multiplication of the probabilities derived
from the visual and audio networks
A (CNN) architecture has been proposed by the authors
of a recent academic publication [165], with the intention
of addressing the issue of generalization that is frequently
experienced in deep learning models. Before the audio data
could be fed into the architecture of the CNN, it was first
transformed into scatter plot images of adjacent samples. This
was done so that it could be used to overcome the challenge.
On the Fake or Real (FoR) dataset [147], the accuracy of the
model was evaluated, and the results showed that it had a per-
formance of 88.9%. However, its accuracy of 88% and EER
of 11% were lower than those of other DL models tested in
the study. This indicates the need for additional development
as well as the inclusion of more data transformers in order to
improve its performance.
Almutairi and Elgibreen [166] Proposed a deep neural
network architecture for the purpose of identifying manip-
ulated audio content, commonly referred to as deepfakes.
The proposed model was derived from the HuBERT pre-
trained model, a substantial language model that underwent
training on an extensive corpus of unannotated speech data.
The model underwent fine-tuning using a dataset comprising FIGURE 22. The proposed method for detecting Arabic-language Audio
both audio deepfakes and authentic audio recordings. The Deepfake [166].

132674 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

TABLE 4. Comparing the effectiveness of classical ML and DL models in detecting fake audio.

FIGURE 23. The effectiveness of classical machine learning and deep learning models in detecting fake audio.

the ASP layer is then sent onward to a dense layer that makes In the following equation, the cross-entropy loss function
use of a Tanh activation function. The result that is produced is specified. This is a method that is frequently utilized for the
by the dense layer is a forecast regarding the authenticity of training of models.
the audio recording, more specifically whether or not it is a
1 XM
deepfake. Jbce = − [ym × log(h∅(Xm)) + (1 − ym)
The Tanh activation function, which is employed in the M m=1

dense layer is: ×log(1 − h∅(Xm))] (56)

The notation M denotes the quantity of training examples


ex − e−x
 
f (x) = −(1.10) (55) in a given dataset. The label of the m training example is
ex + e−x represented by ym, while Xm denotes the inputs associated
with the m training example. The function h∅ is employed
The mathematical constant known as e is the foundation to represent the method that utilizes hidden neural network
upon which the natural logarithm is built. It is denoted by the weights ∅.
variable known as e. The value read from the input device is Table 4 and Figure 23 indicates that classical MLmodels,
represented by the variable x. namely logistic regression (LR), quadratic support vector

VOLUME 11, 2023 132675


O. A. Shaaban et al.: Audio Deepfake Approaches

TABLE 5. Comparing the performance of various methods for audio deepfake detection.

machine (Q-SVM), and SVM, have been employed in sev- tested by Lataifeh et al. [158] on the Arabic Diversified Audio
eral investigations for detecting counterfeit audio, and have (AR-DAD) dataset [159], while the decision tree (DT) model
demonstrated notable levels of efficacy. Nevertheless, these had the lowest accuracy at 73.33%. The CNN model had a
models frequently necessitate substantial preprocessing and higher detection rate than the BiLSTM model, with 94.33%
feature extraction. Conversely, (CNNs) and bidirectional long accuracy.
short-term memory (BiLSTM) are alternative deep learn- The Siamese CNN model proposed by Lei et al. [157]
ing models that have been employed to address the same improved the min-tDCF and Equal Error Rate (EER) by
problem, yielding mixed outcomes. Several studies have approximately 55% when compared to other models, but its
reported varying results regarding the comparative robustness performance was slightly lower when using certain features.
of (CNNs) and Support Vector Machines (SVMs). While The CNN model developed in [165] and trained on the Fake
some studies have demonstrated the superior robustness of or Real (FoR) dataset [147] achieved an accuracy of 88.9%
CNNs, others have indicated that SVMs exhibit the highest and an EER of 11%.
accuracy among all tested models. The selection of a suitable It is worth mentioning that the performance of these meth-
model for detecting fake audio is contingent upon several fac- ods may vary depending on the specific dataset and evaluation
tors, such as the characteristics and magnitude of the dataset, criteria used. Further research is needed to improve the accu-
the intricacy of the features, and the extent of preprocessing racy and robustness of audio deepfake detection methods.
necessary.
Overall, it appears that both classical ML and DL models III. DATASETS
can be effective in detecting fake audio, with the choice of Baidu Dataset, Baidu is a tool for spotting replicated speech,
model depending on the specific task and dataset. and is a collection created by AI researchers at Baidu’s Sil-
Furthermore, the ability to accurately detect fake audio is icon Valley outpost [70]. There are ten authentic recordings
crucial in maintaining the integrity of information and pro- of human speech in this collection, along with 120 cloned
tecting against malicious intent. In Table 5, we will compare samples and four morphed samples.
the effectiveness of classical ML and DL, in detecting fake Mozilla TTS, The world’s largest publicly accessible
audio, the data will demonstrate the relative performance of database of speakers has been made available via the widely
each model. used open-source browser Mozilla Firefox [167]. As of 2019,
In addition to TABLE 5 the RES-EfficientCNN model, a the database currently has over 1,400 h of voice recordings in
(CNN) developed by Subramani and Rao [156], achieved a 18 different languages. The audio was recorded in 54 other
F1-score of 97.61 when tested on the ASV spoof challenge languages over 7,226 h. These 5.5 million audio samples were
2019 dataset [152]. The Deep4SNet model, which uses a 2D used using the Deep Speech Toolkit from Mozilla.
CNN model to classify imitation and synthetic audio, had Fake-or-Real (FOR), Another popular dataset utilized in
an accuracy of 98.5% in detecting such audio. The (SVM) SVR research is the FOR database [147]. Roughly 195,000
model had the highest accuracy at 99% among the ML models snippets of both human and AI-generated speech may be

132676 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

TABLE 6. Comparing audio Deepfake detection data sets.

found in this collection. Human speech samples and exam- It features both the authentic and imitated readings of the
ples of recently developed TTS techniques (such as Google Quran, On the other hand, the audio speech includes 30 read-
Wavene [85] and [126]) are available in this repository. The ers from Arabic countries and 12 imitators. More specifically,
Authors provide four unique variations of FOR: the ‘‘for- the reciters were individual males who were native speakers
original,’’ ‘‘for-norm,’’ ‘‘for 2 seconds,’’ and ‘‘for rec’’ (FR). of Arabic and hail from the nations of the United Arab
Audios in FO are not symmetrical and have not been edited, Emirates, Yemen, Egypt, Kuwait, Sudan, and Saudi Arabia.
whereas those in FN are balanced and unchanged. The F2S The data comprised 15,810 real samples and 379 fake sam-
uses FN data sampled every two seconds to simulate speech- ples, each of which was ten seconds long. Classical Arabic
based invasion. By contrast, FR is simply a re-recording of (CA) is a moniker given to the language of the dataset because
the F2S database. it is written in Arabic.
ASV spoof 2019, One of the most well-known datasets H-Voice (Histograms Voice), Recent work has resulted in
for detecting fake audios [72] is divided into two parts: one the creation of a dataset known as H-Voice [169], which
for analyzing physical access and the other for analyzing uses synthesized and imitative voices to speak languages
logical access. Both the LA and PA were created using audio including English, French, Tagalog, Portuguese, and Spanish.
samples from 107 unique speakers included in the VCTK In the PNG format, we find the samples that were originally
basic corpus (46 m and 61 f). LA features examples of con- stored in a histogram. There of 6672 samples and a plethora
verting and cloning audios in addition to both the original and of subfolders were collected. However, the total number of
reconstructed audios. The speaker samples aged 20, 10, and samples consisted of both natural and artificially created vari-
48 years were included in the database. The training, develop- ants (3,332 actual and 3,264 fake samples) and natural and
ment, and evaluation databases were subdivided based on the synthetically created variants (four real and 72 fake samples).
two main types of data (21 males and 27 females). All source Deep Voice 3, which is utilized to generate synthetic-based
samples were recorded under the same conditions, notwith- files, is freely accessible to the public.
standing the presence of variable presenter categorizations. FakeAVCeleb: The FakeAVCeleb dataset [163] is an inno-
The evaluation set contained examples of unknown attacks, vative and limited dataset of English speakers, created using
whereas spoofing cases of the same type and parameters were the SV2TTS tool. A synthetic process was used to create
included in the development and training sets, respectively. the dataset. It includes 20,490 samples, 490 of which are
M-AILabs, The real-speech dataset created by M-AILabs authentic and the remaining 20,000 are fake. Samples were
[168] is widely utilized by TTS programs, such as Deep- available in the MP3 format and last precisely seven seconds
Voice 3 [127]. The M-AILABS dataset contains 999 h and each.
32 min of audio. Many native speakers of these nine lan- ADD, The audio deep synthesis detection (ADD) com-
guages contributed to the creation of this dataset. petition [170] unveiled a novel dataset that aims to identify
AR-DAD, this is a fabricated audio file of Arabic speakers synthetic-based audio. The dataset comprises three distinct
that was obtained from the audio site of the Holy Quran and categories, namely low-quality fake audio detection (LF),
was given the name Ar-DAD Arabic Diversified Audio [159]. partial fake audio detection (PF), and fake audio games (FG).

VOLUME 11, 2023 132677


O. A. Shaaban et al.: Audio Deepfake Approaches

The LF dataset comprises a total of 1052 audio samples, [7] N. Diakopoulos and D. Johnson, ‘‘Anticipating and addressing the
predominantly of synthetic origin. On the other hand, the PF ethical implications of deepfakes in the context of elections,’’
New Media Soc., vol. 23, no. 7, pp. 2072–2098, Jul. 2021, doi:
dataset encompasses 300 authentic vocal recordings, along- 10.1177/1461444820925811.
side 700 artificially generated words that are accompanied by [8] Y. Rodríguez-Ortega, D. M. Ballesteros, and D. Renza, ‘‘A machine learn-
ambient noise. The dataset is readily accessible to the public ing model to detect fake voice,’’ in Applied Informatics (Communications
in Computer and Information Science). Springer, 2020, pp. 3–13.
and has been curated in the Chinese language, rendering it [9] T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, and E. Khoury,
available for utilization by researchers. ‘‘Generalization of audio deepfake detection,’’ in Proc. Speaker Lang.
Recognit. Workshop (Odyssey), Tokyo, Japan, Nov. 2020, pp. 132–137,
doi: 10.21437/odyssey.2020-19.
IV. CONCLUSION AND DISCUSSION [10] D. M. Ballesteros, Y. Rodriguez-Ortega, D. Renza, and G. Arce,
Numerous investigations have been carried out to identify ‘‘Deep4SNet: Deep learning for fake speech classification,’’
Exp. Syst. Appl., vol. 184, Dec. 2021, Art. no. 115465, doi:
audio deepfakes utilizing diverse (DL) methodologies. The 10.1016/j.eswa.2021.115465.
aforementioned methodologies encompass (DNNs), (CNNs), [11] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, ‘‘Synthe-
and (CRNNs). The Human Log-Likelihoods (HLLs) method- sizing obama,’’ ACM Trans. Graph., vol. 36, no. 4, pp. 1–13, Aug. 2017,
doi: 10.1145/3072959.3073640.
ology, which utilizes a (DNN) classifier, exhibited superior [12] Catherine Stupp Fraudsters Used AI to Mimic CEO’s Voice in Unusual
performance compared to the conventional GMM technique. Cybercrime Case. Accessed: Jul. 15, 2022. [Online]. Available:
Specifically, the HLLs approach achieved an equal error rate https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-
unusual-cybercrime-case-11567157402
(EER) of 12.24% on the ASV spoof challenge 2015 dataset. [13] Z. Almutairi and H. Elgibreen, ‘‘A review of modern audio deep-
Although the utilization of the ASSERT technique relying fake detection methods: Challenges and future directions,’’ Algorithms,
vol. 15, no. 5, p. 155, May 2022, doi: 10.3390/a15050155.
on a Squeeze-Excitation Network and a residual CNN-based [14] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and
approach exhibited encouraging outcomes, both methods C. C. Ferrer, ‘‘The DeepFake detection challenge (DFDC) dataset,’’
encountered challenges with generalization. By way of com- 2020, arXiv:2006.07397.
[15] B. Malolan, A. Parekh, and F. Kazi, ‘‘Explainable deep-fake
parison, the utilization of transfer learning and the ResNet-34 detection using visual interpretability methods,’’ in Proc. 3rd Int.
methodology within a framework yielded the most optimal Conf. Inf. Comput. Technol. (ICICT), Mar. 2020, pp. 289–293, doi:
outcomes, as evidenced by an EER of 5.32% and t-DCF of 10.1109/ICICT50521.2020.00051.
[16] E. C. Tandoc, D. Lim, and R. Ling, ‘‘Diffusion of disinformation:
0.11514% on the ASVspoof 2019 dataset. How social media users respond to fake news and why,’’ Journalism,
The ASVspoof 2019 dataset was evaluated using the vol. 21, no. 3, pp. 381–398, Mar. 2020, doi: 10.1177/1464884919868325.
Temporal Convolutional Network and Spatial Transformer [17] X. Zhou and R. Zafarani, ‘‘A survey of fake news: Fundamental theories,
detection methods, and opportunities,’’ ACM Comput. Surv., vol. 53, no. 5,
Network, resulting in accuracy rates of 92% and 80%, respec- pp. 1–40, Sep. 2021, doi: 10.1145/3395046.
tively. In addition, the study produced two CRNN-based [18] C.-S. Atodiresei, A. Tanaselea, and A. Iftene, ‘‘Identifying fake
models, wherein one model demonstrated superior perfor- news and fake users on Twitter,’’ Proc. Comput. Sci., vol. 126,
pp. 451–461, Jan. 2018, doi: 10.1016/j.procs.2018.07.279.
mance compared to the other by 0.132% in the Tandem [19] M. Aldwairi and A. Alwahedi, ‘‘Detecting fake news in social media
Decision Cost Function (t-DCF) and 4.27% in Equal Error networks,’’ Proc. Comput. Sci., vol. 141, pp. 215–222, Jan. 2018, doi:
Rate (EER) on the identical dataset. A technique for align- 10.1016/j.procs.2018.10.171.
[20] C. Sindermann, A. Cooper, and C. Montag, ‘‘A short review on sus-
ment, which utilized three classification models, was also ceptibility to falling for fake political news,’’ Current Opinion Psychol.,
suggested and exhibited satisfactory performance on the vol. 36, pp. 44–48, Dec. 2020.
ASVspoof 2019 dataset. [21] D. D. Parsons. (2020). The Impact of Fake News on Company Value:
Evidence From Evidence From Tesla and Galena Biopharma. [Online].
The utilization of the transfer learning and ResNet-34 tech- Available: https://trace.tennessee.edu/utk_chanhonoproj
nique framework has demonstrated superior performance, [22] D. Ippolito, D. Duckworth, C. Callison-Burch, and D. Eck, ‘‘Automatic
as evidenced by its attainment of the lowest EER and t-DCF detection of generated text is easiest when humans are fooled,’’ in Proc.
58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 1808–1822,
on the ASVspoof 2019 dataset. doi: 10.18653/v1/2020.acl-main.164.
[23] D. I. Adelani, H. Mai, F. Fang, H. H. Nguyen, J. Yamagishi, and
I. Echizen, ‘‘Generating sentiment-preserving fake online reviews using
REFERENCES neural language models and their human- and machine-based detec-
[1] Z. Khanjani, G. Watson, and V. P. Janeja, ‘‘How deep are the fakes? tion,’’ in Advanced Information Networking and Applications (Advances
Focusing on audio deepfake: A survey,’’ 2020, arXiv:2111.14203. in Intelligent Systems and Computing), vol. 1151. Springer, 2020,
[2] X. Zhang and A. A. Ghorbani, ‘‘An overview of online fake news: Charac- pp. 1341–1354, doi: 10.1007/978-3-030-44041-1_114.
terization, detection, and discussion,’’ Inf. Process. Manag., vol. 57, no. 2, [24] T. Fagni, F. Falchi, M. Gambini, A. Martella, and M. Tesconi, ‘‘Tweep-
Mar. 2020, Art. no. 102025, doi: 10.1016/j.ipm.2019.03.004. Fake: About detecting deepfake tweets,’’ PLoS ONE, vol. 16, no. 5,
May 2021, Art. no. e0251415, doi: 10.1371/journal.pone.0251415.
[3] J. Shin, L. Jian, K. Driscoll, and F. Bar, ‘‘The diffusion of misinformation
[25] D. Dukic, D. Keca, and D. Stipic, ‘‘Are you human? Detect-
on social media: Temporal pattern, message, and source,’’ Comput. Hum.
ing bots on Twitter using BERT,’’ in Proc. IEEE 7th Int. Conf.
Behav., vol. 83, pp. 278–287, Jun. 2018, doi: 10.1016/j.chb.2018.02.008.
Data Sci. Adv. Analytics (DSAA), Oct. 2020, pp. 631–636, doi:
[4] G. Gravanis, A. Vakali, K. Diamantaras, and P. Karadais, ‘‘Behind the 10.1109/DSAA49011.2020.00089.
cues: A benchmarking study for fake news detection,’’ Exp. Syst. Appl., [26] T. Toda, A. W. Black, and K. Tokuda, ‘‘Voice conversion based on
vol. 128, pp. 201–213, Aug. 2019, doi: 10.1016/j.eswa.2019.03.036. maximum-likelihood estimation of spectral parameter trajectory,’’ IEEE
[5] A. Bondielli and F. Marcelloni, ‘‘A survey on fake news and rumour Trans. Audio, Speech, Language Process., vol. 15, no. 8, pp. 2222–2235,
detection techniques,’’ Inf. Sci., vol. 497, pp. 38–55, Sep. 2019, doi: Nov. 2007, doi: 10.1109/TASL.2007.907344.
10.1016/j.ins.2019.05.035. [27] L. Guarnera, O. Giudice, and S. Battiato. DeepFake Detection by Analyz-
[6] S. Lyu, ‘‘DeepFake detection: Current challenges and next steps,’’ in Proc. ing Convolutional Traces. Accessed: Dec. 17, 2022. [Online]. Available:
IEEE Int. Conf. Multimedia Expo Workshops (ICMEW), London, U.K., https://www.Guarnera_DeepFake_Detection_by_Analyzing_Convolut
2020, pp. 1–6, doi: 10.1109/ICMEW46912.2020.9105991. ional_Traces_CVPRW_2020_paper.pdf

132678 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

[28] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. Efros, ‘‘Detect- [48] N. M. Müller, K. Pizzi, and J. Williams, ‘‘Human perception of audio
ing photoshopped faces by scripting photoshop,’’ in Proc. IEEE/CVF deepfakes,’’ 2021, arXiv:2107.09667.
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 10071–10080, doi: [49] B. Chesney and D. Citron, ‘‘Deep fakes: A looming challenge for privacy,
10.1109/ICCV.2019.01017. democracy, and national security,’’ Calif. Law Rev., vol. 107, no. 6,
[29] J. C. Neves, R. Tolosana, R. Vera-Rodriguez, V. Lopes, H. Proença, pp. 1753–1820, 2019, doi: 10.15779/Z38RV0D15J.
and J. Fierrez, ‘‘GANprintR: Improved fakes and evaluation of the [50] M. R. Al-Mousa, N. A. Sweerky, G. Samara, M. Alghanim,
state of the art in face manipulation detection,’’ IEEE J. Sel. Top- A. S. I. Hussein, and B. Qadoumi, ‘‘General countermeasures of anti-
ics Signal Process., vol. 14, no. 5, pp. 1038–1048, Aug. 2020, doi: forensics categories,’’ in Proc. Global Congr. Electr. Eng. (GC-ElecEng),
10.1109/JSTSP.2020.3007250. Dec. 2021, pp. 5–10, doi: 10.1109/GC-ElecEng52322.2021.9788230.
[30] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, ‘‘Face X-
[51] G. C. Kessler, ‘‘Anti-forensics and the digital investigator,’’
ray for more general face forgery detection,’’ in Proc. IEEE/CVF Conf.
in Proc. 5th Aust. Digit. Forensics Conf., 2007, pp. 1–7, doi:
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5000–5009, doi:
10.4225/75/57ad39ee7ff25.
10.1109/CVPR42600.2020.00505.
[31] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain, ‘‘On the [52] T. Qazi, K. Hayat, S. U. Khan, S. A. Madani, I. A. Khan, J. Kolodziej,
detection of digital face manipulation,’’ in Proc. IEEE/CVF Conf. Com- H. Li, W. Lin, K. C. Yow, and C. Xu, ‘‘Survey on blind image forgery
put. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5780–5789, doi: detection,’’ IET Image Process., vol. 7, no. 7, pp. 660–670, Oct. 2013,
10.1109/CVPR42600.2020.00582. doi: 10.1049/iet-ipr.2012.0388.
[32] T. Khakhulin, V. Sklyarova, V. Lempitsky, and E. Zakharov, ‘‘Realistic [53] K. Hayat and T. Qazi, ‘‘Forgery detection in digital images via discrete
one-shot mesh-based head avatars,’’ 2022, arXiv:2206.08343. wavelet and discrete cosine transforms,’’ Comput. Electr. Eng., vol. 62,
[33] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and pp. 448–458, Aug. 2017, doi: 10.1016/j.compeleceng.2017.03.013.
C. Theobalt, ‘‘Real-time expression transfer for facial reenactment,’’ [54] J. Kim, J. Kong, and J. Son, ‘‘Conditional variational autoen-
ACM Trans. Graph., vol. 34, no. 6, pp. 1–14, Nov. 2015. coder with adversarial learning for end-to-end text-to-speech,’’ 2021,
[34] K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar, arXiv:2106.06103.
‘‘A lip sync expert is all you need for speech to lip generation in the [55] K. Zhou, B. Sisman, R. Liu, and H. Li, ‘‘Seen and unseen emotional
wild,’’in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 484–492, doi: style transfer for voice conversion with a new emotional speech dataset,’’
10.1145/3394171.3413532. in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
[35] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, ‘‘StarGAN: Jun. 2021, pp. 920–924, doi: 10.1109/ICASSP39728.2021.9413391.
Unified generative adversarial networks for multi-domain image-to- [56] G. Hua, A. B. J. Teoh, and H. Zhang, ‘‘Towards end-to-end
image translation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern synthetic speech detection,’’ IEEE Signal Process. Lett., vol. 28,
Recognit., Jun. 2018, pp. 8789–8797, doi: 10.1109/CVPR.2018.00916. pp. 1265–1269, 2021, doi: 10.1109/LSP.2021.3089437.
[36] Y. Choi, Y. Uh, J. Yoo, and J. W. Ha, ‘‘StarGAN v2: Diverse image
[57] A. Chintha, B. Thai, S. J. Sohrawardi, K. Bhatt, A. Hickerson, M. Wright,
synthesis for multiple domains,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
and R. Ptucha, ‘‘Recurrent convolutional structures for audio spoof and
Pattern Recognit., Jun. 2020, pp. 8188–8197.
video deepfake detection,’’ IEEE J. Sel. Topics Signal Process., vol. 14,
[37] R. Huang, S. Zhang, T. Li, and R. He, ‘‘Beyond face rotation: Global
no. 5, pp. 1024–1037, Aug. 2020, doi: 10.1109/JSTSP.2020.2999185.
and local perception GAN for photorealistic and identity preserving
frontal view synthesis,’’ in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2019, [58] J. Volkmann, S. S. Stevens, and E. B. Newman, ‘‘A scale for the measure-
pp. 2439–2448. ment of the psychological magnitude pitch,’’ J. Acoust. Soc. Amer., vol. 8,
[38] G. Y. Kang, Y. P. Feng, R. K. Wang, and Z. M. Lu, ‘‘Edge and feature p. 208, Jan. 1937, doi: 10.1121/1.1901999.
points based video intra-frame passive-blind copy-paste forgery detec- [59] E. R. Bartusiak and E. J. Delp, ‘‘Frequency domain-based detection of
tion,’’ J. Netw. Intell., vol. 6, no. 3, pp. 637–645, 2021. generated audio,’’ Electron. Imag., vol. 33, no. 4, pp. 1–7, Jan. 2021.
[39] G. Singh and K. Singh, ‘‘Chroma key foreground forgery detection under [60] E. R. Bartusiak and E. J. Delp, ‘‘Synthesized speech detection using
various attacks in digital video based on frame edge identification,’’ convolutional transformer-based spectrogram analysis,’’ in Proc. 55th
Multimedia Tools Appl., vol. 81, no. 1, pp. 1419–1446, Jan. 2022, doi: Asilomar Conf. Signals, Syst., Comput., Oct. 2021, pp. 1426–1430, doi:
10.1007/s11042-021-11380-3. 10.1109/IEEECONF53345.2021.9723142.
[40] O. Alamayreh and M. Barni, ‘‘Detection of GAN-synthesized street [61] J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi, ‘‘A deep
videos,’’ in Proc. 29th Eur. Signal Process. Conf. (EUSIPCO), Aug. 2021, learning framework for audio deepfake detection,’’ Arabian J. Sci. Eng.,
pp. 811–815, doi: 10.23919/EUSIPCO54536.2021.9616262. vol. 47, no. 3, pp. 3447–3458, Mar. 2022, doi: 10.1007/s13369-021-
[41] V. Kumar, A. Singh, V. Kansal, and M. Gaur, ‘‘A comprehensive sur- 06297-w.
vey on passive video forgery detection techniques,’’ in Recent Studies [62] K. Bhagtani, A. K. S. Yadav, E. R. Bartusiak, Z. Xiang, R. Shao,
on Computational Intelligence (Studies in Computational Intelligence), S. Baireddy, and E. J. Delp, ‘‘An overview of recent work in multimedia
vol. 921. Singapore: Springer, 2021, pp. 39–57, doi: 10.1007/978-981- forensics,’’ in Proc. IEEE 5th Int. Conf. Multimedia Inf. Process. Retr.,
15-8469-5_4. Aug. 2022, pp. 324–329, doi: 10.1109/MIPR54900.2022.00064.
[42] C. Rathgeb, R. Tolosana, R. Vera-Rodriguez, and C. Busch, Advances [63] F. Akdeniz and Y. Becerikli, ‘‘Detection of copy-move forgery in audio
in Computer Vision and Pattern Recognition Handbook of Digital Face signal with Mel Frequency and Delta-Mel Frequency Kepstrum Coef-
Manipulation and Detection From DeepFakes to Morphing Attacks. ficients,’’ in Proc. Innov. Intell. Syst. Appl. Conf. (ASYU), Oct. 2021,
Springer, 2022. pp. 1–6, doi: 10.1109/ASYU52992.2021.9598977.
[43] D. Güera and E. J. Delp, ‘‘Deepfake video detection using recurrent
[64] M. Todisco, H. Delgado, and N. Evans, ‘‘Constant Q cepstral coef-
neural networks,’’ in Proc. 15th IEEE Int. Conf. Adv. Video Signal Based
ficients: A spoofing countermeasure for automatic speaker verifica-
Surveill. (AVSS), Nov. 2018, pp. 1–6, doi: 10.1109/AVSS.2018.8639163.
tion,’’ Comput. Speech Lang., vol. 45, pp. 516–535, Sep. 2017, doi:
[44] D. Güera, ‘‘Media forensics using machine learning approaches,’’ Doc-
10.1016/j.csl.2017.01.001.
toral dissertation, Dept. Elect. Comput. Eng., Purdue Univ., West
Lafayette, IN, USA, 2019. [65] Q. Yan, R. Yang, and J. Huang, ‘‘Robust copy-move detection of
[45] D. M. Montserrat, H. Hao, S. K. Yarlagadda, S. Baireddy, R. Shao, speech recording using similarities of pitch and formant,’’ IEEE Trans.
J. Horváth, E. Bartusiak, J. Yang, D. Güera, F. Zhu, and E. J. Delp, ‘‘Deep- Inf. Forensics Security, vol. 14, no. 9, pp. 2331–2341, Sep. 2019, doi:
fakes detection with automatic face weighting,’’ in Proc. IEEE/CVF 10.1109/TIFS.2019.2895965.
Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2020, [66] F. Hassan and A. Javed, ‘‘Voice spoofing countermeasure for synthetic
pp. 2851–2859. speech detection,’’ in Proc. Int. Conf. Artif. Intell. (ICAI), Apr. 2021,
[46] L. Bondi, E. Daniele Cannas, P. Bestagini, and S. Tubaro, ‘‘Train- pp. 209–212, doi: 10.1109/ICAI52203.2021.9445238.
ing strategies and data augmentations in CNN-based DeepFake video [67] R. K. Das, J. Yang, and H. Li, ‘‘Long range acoustic features for spoofed
detection,’’ in Proc. IEEE Int. Workshop Inf. Forensics Secur. (WIFS), speech detection,’’ in Proc. INTERSPEECH, 2019, pp. 1058–1062, doi:
Dec. 2020, pp. 1–6, doi: 10.1109/WIFS49906.2020.9360901. 10.21437/Interspeech.2019-1887.
[47] H. H. Nguyen, J. Yamagishi, and I. Echizen, ‘‘Capsule-forensics networks [68] J. Li, H. Wang, P. He, S. M. Abdullahi, and B. Li, ‘‘Long-term variable
for deepfake detection,’’ in Handbook of Digital Face Manipulation Q transform: A novel time-frequency transform algorithm for syn-
and Detection: From DeepFakes to Morphing Attacks. Berlin, Germany: thetic speech detection,’’ Digit. Signal Process., vol. 120, Jan. 2022,
Springer, 2022. Art. no. 103256, doi: 10.1016/j.dsp.2021.103256.

VOLUME 11, 2023 132679


O. A. Shaaban et al.: Audio Deepfake Approaches

[69] X. Li, N. Li, C. Weng, X. Liu, D. Su, D. Yu, and H. Meng, ‘‘Replay [89] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang,
and synthetic speech detection with Res2Net architecture,’’ in Proc. I. L. Moreno, and Y. Wu, ‘‘Transfer learning from speaker verification to
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2021, multispeaker text-to-speech synthesis,’’ in Proc. Adv. Neural Inf. Process.
pp. 6354–6358, doi: 10.1109/ICASSP39728.2021.9413828. Syst., 2018, pp. 4480–4490.
[70] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, ‘‘Neural voice cloning [90] J. Cong, S. Yang, L. Xie, G. Yu, and G. Wan, ‘‘Data efficient voice
with a few samples,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 110, cloning from noisy samples with domain adversarial training,’’ in Proc.
2018, pp. 10019–10029. Interspeech, Oct. 2020, pp. 811–815, doi: 10.21437/interspeech.2020-
[71] J. Lorenzo-Trueba, F. Fang, X. Wang, I. Echizen, J. Yamagishi, and 2530.
T. Kinnunen, ‘‘Can we steal your vocal identity from the internet: Initial [91] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,
investigation of cloning Obama’s voice using GAN, WaveNet and low- J. Raiman, and J. Miller, ‘‘Deep voice 3: Scaling text-to-speech with con-
quality found data,’’ in Proc. Speaker Lang. Recognit. Workshop, 2018, volutional sequence learning,’’ in Proc. 6th Int. Conf. Learn. Represent.,
pp. 240–247, doi: 10.21437/odyssey.2018-34. 2018, pp. 1–16.
[72] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, [92] A. Van Den, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves,
M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee, and L. Juvela, ‘‘Conditional image generation with PixelCNN decoders,’’ in Proc. Adv.
‘‘ASVspoof 2019: A large-scale public database of synthesized, con- Neural Inf. Process. Syst., 2016, pp. 4790–4798. [Online]. Available:
verted and replayed speech,’’ Comput. Speech Lang., vol. 64, Nov. 2020, https://papers.nips.cc/paper/6527-conditional-image-generation-with-
Art. no. 101114, doi: 10.1016/j.csl.2020.101114. pixelcnn-decoders.pdf
[73] M. Swan. WaveNet: A Generative Model for Raw Audio. Accessed: [93] A. Van Den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
Feb. 23, 2023. [Online]. Available: https://research.google/pubs/ K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg,
pub45774/ and N. Casagrande, ‘‘Parallel WaveNet: Fast high-fidelity speech syn-
[74] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, thesis,’’ in Proc. 35th Int. Conf. Mach. Learn. (ICML), vol. 9, 2018,
Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, pp. 6270–6278.
R. Clark, and R. A. Saurous, ‘‘Tacotron: Towards end-to-end speech [94] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky,
synthesis,’’ in Proc. Interspeech, Aug. 2017, pp. 4006–4010, doi: Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, and S. Sengupta, ‘‘Deep voice:
10.21437/interspeech.2017-1452. Real-time neural text-to-speech,’’ in Proc. Adv. Neural Inf. Process. Syst.,
[75] Z. Jin, G. J. Mysore, S. Diverdi, J. Lu, and A. Finkelstein, ‘‘VoCo: Text- 2017, pp. 2963–2971.
based insertion and replacement in audio narration,’’ ACM Trans. Graph., [95] S. O. Arik and J. Miller, ‘‘Deep voice 2: Multi-speaker neural text-to-
vol. 36, no. 4, pp. 1–13, Aug. 2017, doi: 10.1145/3072959.3073702. speech,’’ in Proc. NIPS, vol. 1, 2017, pp. 1–9.
[76] Audacity. accessed: Sep. 9, 2020. [Online]. Available: https://www. [96] Y. Yasuda, X. Wang, S. Takaki, and J. Yamagishi, ‘‘Investigation
audacityteam.org of enhanced tacotron text-to-speech synthesis systems with self-
attention for pitch accent language,’’ in Proc. IEEE Int. Conf. Acoust.,
[77] J. Damiani. A Voice Deepfake Was Used To Scam A CEO Out Of $243,000.
Speech Signal Process. (ICASSP), May 2019, pp. 6905–6909, doi:
Accessed: Sep. 6, 2020. [Online]. Available: https://www.forbes.com/
10.1109/ICASSP.2019.8682353.
sites/jessedamiani/2019/09/03/a-voice-deepfake-was-used-to-scam-a-
ceo-out-of-243000/ [97] Y. Lee, T. Kim, and S.-Y. Lee, ‘‘Voice imitating text-to-speech neural
networks,’’ 2018, arXiv:1806.00927.
[78] A. Leung. NVIDIA Reveals That Part of Its CEO’s Keynote Presentation
Was Deepfaked. Accessed: Aug. 29, 2021. [Online]. Available: https:// [98] H.-T. Luong and J. Yamagishi, ‘‘NAUTILUS: A versatile voice cloning
hypebeast.com/2021/8/nvidia-deepfake-jensen-huang-omniverse- system,’’ IEEE/ACM Trans. Audio, Speech, Language Process., vol. 28,
keynote-video pp. 2967–2981, 2020, doi: 10.1109/TASLP.2020.3034994.
[99] Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen, Q. Wang,
[79] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and
L. C. Cobo, A. Trask, B. Laurie, and C. Gulcehre, ‘‘Sample efficient
Y. Bengio, ‘‘Char2Wav: End-to-end speech synthesis,’’ in Proc. 5th Int.
adaptive text-to-speech,’’ in Proc. 7th Int. Conf. Learn. Represent., 2019,
Conf. Learn. Represent., 2015, pp. 1–6.
pp. 1–16.
[80] B. Sisman, J. Yamagishi, S. King, and H. Li, ‘‘An overview of voice
[100] Z. Yi, W. C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kinnunen,
conversion and its challenges: From statistical modeling to deep learn-
Z. Ling, and T. Toda, ‘‘Voice conversion challenge 2020—Intra-lingual
ing,’’ IEEE/ACM Trans. Audio, Speech, Language Process., vol. 29,
semi-parallel and cross-lingual voice conversion,’’ in Proc. Joint Work-
pp. 132–157, 2021, doi: 10.1109/TASLP.2020.3038524.
shop Blizzard Challenge Voice Convers. Challenge, 2020, pp. 80–98, doi:
[81] P. Partila, J. Tovarek, G. H. Ilk, J. Rozhon, and M. Voznak, ‘‘Deep 10.21437/vcc_bc.2020-14.
learning serves voice cloning: How vulnerable are automatic speaker
[101] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio,
vulnerable systems to Spoofing trials?’’ IEEE Commun. Mag., vol. 58,
T. Kinnunen, and Z. Ling, ‘‘The voice conversion challenge 2018:
no. 2, pp. 100–105, Feb. 2020, doi: 10.1109/MCOM.001.1900396.
Promoting development of parallel and nonparallel methods,’’ in
[82] R. Natsume, T. Yatagawa, and S. Morishima, ‘‘RSGAN: Face swapping Proc. Speaker Lang. Recognit. Workshop, 2018, pp. 195–202, doi:
and editing using face and hair representation in latent spaces,’’ 2009, 10.21437/odyssey.2018-28.
arXiv:1804.03447. [102] Y. Stylianou, O. Cappe, and E. Moulines, ‘‘Continuous probabilis-
[83] Faceswap-GAN. Accessed: May 29, 2022. [Online]. Available: tic transform for voice conversion,’’ IEEE Trans. Speech Audio
https://github.com/shaoanlu/faceswap-GAN Process., vol. 6, no. 2, pp. 131–142, Mar. 1998, doi: 10.1109/89.
[84] R. Natsume, T. Yatagawa, and S. Morishima, ‘‘RSGAN: Face swap- 661472.
ping and editing using face and hair representation in latent spaces,’’ [103] E. Helander, H. Silen, T. Virtanen, and M. Gabbouj, ‘‘Voice con-
in Proc. ACM SIGGRAPH Posters, 2018, pp. 1–2, Art. no. 69, doi: version using dynamic kernel partial least squares regression,’’ IEEE
10.1145/3230744.3230818. Trans. Audio, Speech, Language Process., vol. 20, no. 3, pp. 806–817,
[85] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, Mar. 2012, doi: 10.1109/TASL.2011.2165944.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, [104] Z. Wu, T. Virtanen, E. S. Chng, and H. Li, ‘‘Exemplar-based sparse repre-
‘‘WaveNet: A generative model for raw audio,’’ 2016, arXiv:1609. sentation with residual compensation for voice conversion,’’ IEEE/ACM
03499. Trans. Audio, Speech, Language Process., vol. 22, no. 10, pp. 1506–1521,
[86] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, ‘‘VoiceLoop: Voice Oct. 2014, doi: 10.1109/TASLP.2014.2333242.
fitting and synthesis via a phonological loop,’’ in Proc. 6th Int. Conf. [105] T. Nakashika, T. Takiguchi, and Y. Ariki, ‘‘High-order sequence mod-
Learn. Represent., 2018, pp. 1–14. eling using speaker-dependent recurrent temporal restricted Boltzmann
[87] I. Korshunova, J. Dambre, and L. Theis, ‘‘Fast face-swap using convolu- machines for voice conversion,’’ in Proc. Interspeech, Sep. 2014,
tional neural networks,’’ in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017, pp. 2278–2282, doi: 10.21437/interspeech.2014-447.
pp. 3677–3685. [106] L. Sun, S. Kang, K. Li, and H. Meng, ‘‘Voice conversion using
[88] Y. Nirkin, Y. Keller, and T. Hassner, ‘‘FSGAN: Subject agnostic deep bidirectional long short-term memory based recurrent neural
face swapping and reenactment,’’ in Proc. IEEE/CVF Int. Conf. Com- networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Pro-
put. Vis. (ICCV), Oct. 2019, pp. 7183–7192, doi: 10.1109/ICCV.2019. cess. (ICASSP), Apr. 2015, pp. 4869–4873, doi: 10.1109/ICASSP.2015.
00728. 7178896.

132680 VOLUME 11, 2023


O. A. Shaaban et al.: Audio Deepfake Approaches

[107] H. Ming, D. Huang, L. Xie, J. Wu, M. Dong, and H. Li, ‘‘Deep [126] N. Li, D. Tuo, D. Su, Z. Li, and D. Yu, ‘‘Deep discriminative embeddings
bidirectional LSTM modeling of timbre and prosody for emotional for duration robust speaker verification,’’ in Proc. Interspeech, Sep. 2018,
voice conversion,’’ in Proc. Interspeech, Sep. 2016, pp. 2453–2457, doi: pp. 2262–2266.
10.21437/interspeech.2016-1053. [127] J.-C. Chou and H.-Y. Lee, ‘‘One-shot voice conversion by
[108] J. Wu, Z. Wu, and L. Xie, ‘‘On the use of I-vectors and average voice separating speaker and content representations with instance
model for voice conversion without parallel data,’’ in Proc. Asia–Pacific normalization,’’ in Proc. Interspeech, Sep. 2019, pp. 664–668, doi:
Signal Inf. Process. Assoc. Annu. Summit Conf. (APSIPA), Dec. 2016, 10.21437/interspeech.2019-2663.
pp. 1–6, doi: 10.1109/APSIPA.2016.7820901. [128] Y. Rebryk and S. Beliaev, ‘‘ConVoice: Real-time zero-shot voice style
[109] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, ‘‘WaveNet transfer with convolutional network,’’ 2020, arXiv:2005.07815.
vocoder with limited training data for voice conversion,’’ in Proc. [129] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
Interspeech, Sep. 2018, pp. 1983–1987, doi: 10.21437/interspeech.2018- A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, ‘‘WaveNet:
1190. A generative model for raw audio,’’ 2016, pp. 5206–5210.
[110] P.-C. Hsu, C.-H. Wang, A. T. Liu, and H.-Y. Lee, ‘‘Towards robust neural [130] M. R. Kamble, H. B. Sailor, H. A. Patil, and H. Li, ‘‘Advances in anti-
vocoding for speech generation: A survey,’’ 2019, arXiv:1912.02461. spoofing: From the perspective of ASVspoof challenges,’’ APSIPA Trans.
[111] T. Kaneko and H. Kameoka, ‘‘CycleGAN-VC: Non-parallel voice con- Signal Inf. Process., vol. 9, p. e2, Jan. 2020, doi: 10.1017/ATSIP.2019.21.
version using cycle-consistent adversarial networks,’’ in Proc. 26th [131] J. Yi, Y. Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, ‘‘Half-
Eur. Signal Process. Conf. (EUSIPCO), Sep. 2018, pp. 2100–2104, doi: truth: A partially fake audio detection dataset,’’ in Proc. Interspeech,
10.23919/EUSIPCO.2018.8553236. Aug. 2021, pp. 2683–2687, doi: 10.21437/interspeech.2021-930.
[112] M. Zhang, B. Sisman, L. Zhao, and H. Li, ‘‘DeepConversion: Voice con- [132] P. R Aravind, U. Nechiyil, and N. Paramparambath, ‘‘Audio spoofing ver-
version with limited parallel training data,’’ Speech Commun., vol. 122, ification using deep convolutional neural networks by transfer learning,’’
pp. 31–43, Sep. 2020, doi: 10.1016/j.specom.2020.05.004. 2020, arXiv:2008.03464.
[113] W.-C. Huang, H. Luo, H.-T. Hwang, C.-C. Lo, Y.-H. Peng, Y. Tsao, and [133] J. Monteiro, J. Alam, and T. H. Falk, ‘‘Generalized end-to-end
H.-M. Wang, ‘‘Unsupervised representation disentanglement using cross detection of spoofing attacks to automatic speaker recognizers,’’
domain features and adversarial learning in variational autoencoder based Comput. Speech Lang., vol. 63, Sep. 2020, Art. no. 101096, doi:
voice conversion,’’ IEEE Trans. Emerg. Topics Comput. Intell., vol. 4, 10.1016/j.csl.2020.101096.
no. 4, pp. 468–479, Aug. 2020, doi: 10.1109/TETCI.2020.2977678. [134] Y. Gao, T. Vuong, M. Elyasi, G. Bharaj, and R. Singh, ‘‘Generalized
[114] J. Chorowski, R. J. Weiss, S. Bengio, and A. Van Den Oord, ‘‘Unsu- spoofing detection inspired from audio generation artifacts,’’ in Proc.
pervised speech representation learning using wavenet autoencoders,’’ Interspeech, Aug. 2021, pp. 3691–3695, doi: 10.21437/interspeech.2021-
IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 12, 1705.
pp. 2041–2053, Dec. 2019. [135] Z. Zhang, X. Yi, and X. Zhao, ‘‘Fake speech detection using
[115] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, residual network with transformer encoder,’’ in Proc. ACM Work-
‘‘Voice conversion from unaligned corpora using variational autoencod- shop Inf. Hiding Multimedia Secur., Jun. 2021, pp. 13–22, doi:
ing Wasserstein generative adversarial networks,’’ in Proc. Interspeech, 10.1145/3437880.3460408.
Aug. 2017, pp. 3364–3368, doi: 10.21437/interspeech.2017-63. [136] R. K. Das, J. Yang, and H. Li, ‘‘Data augmentation with signal com-
[116] F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba, ‘‘High- panding for detection of logical access attacks,’’ in Proc. IEEE Int. Conf.
quality nonparallel voice conversion based on cycle-consistent Acoust., Speech Signal Process., Jun. 2021, pp. 6349–6353.
adversarial network,’’ in Proc. IEEE Int. Conf. Acoust., Speech [137] M. Aljasem, A. Irtaza, H. Malik, N. Saba, A. Javed, K. M. Malik,
Signal Process. (ICASSP), Apr. 2018, pp. 5279–5283, doi: and M. Meharmohammadi, ‘‘Secure automatic speaker verification
10.1109/ICASSP.2018.8462342. (SASV) system through sm-ALTP features and asymmetric bagging,’’
[117] K. Tanaka, T. Kaneko, N. Hojo, and H. Kameoka, ‘‘Synthetic-to-natural IEEE Trans. Inf. Forensics Security., vol. 16, pp. 3524–3537, 2021, doi:
speech waveform conversion using cycle-consistent adversarial net- 10.1109/TIFS.2021.3082303.
works,’’ in Proc. IEEE Spoken Lang. Technol. Workshop, Dec. 2018, [138] H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, and C. Wang, ‘‘Continual learning for
pp. 632–639. fake audio detection,’’ in Proc. Interspeech, Aug. 2021, pp. 1748–1752,
[118] J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee, ‘‘Multi-target voice doi: 10.21437/interspeech.2021-794.
conversion without parallel data by adversarially learning disentangled [139] C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, ‘‘Synthetic
audio representations,’’ in Proc. Interspeech, Sep. 2018, pp. 501–505, speech detection through short-term and long-term prediction traces,’’
doi: 10.21437/interspeech.2018-1830. EURASIP J. Inf. Secur., vol. 2021, no. 1, pp. 1–14, Dec. 2021, doi:
[119] R. Yamamoto, E. Song, and J.-M. Kim, ‘‘Parallel wavegan: A fast 10.1186/s13635-021-00116-3.
waveform generation model based on generative adversarial networks [140] E. A. AlBadawy, S. Lyu, and H. Farid, ‘‘Detecting AI-synthesized speech
with multi-resolution spectrogram,’’ in Proc. IEEE Int. Conf. Acoust., using bispectral analysis,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Speech Signal Process. (ICASSP), May 2020, pp. 6199–6203, doi: Pattern Recognit. Work., Jun. 2019, pp. 104–109.
10.1109/ICASSP40776.2020.9053795. [141] A. K. Singh and P. Singh, ‘‘Detection of AI-synthesized speech using
[120] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, ‘‘AttS2S-VC: cepstral & bispectral statistics,’’ in Proc. IEEE 4th Int. Conf. Multimedia
Sequence-to-sequence voice conversion with attention and context Inf. Process. Retr. (MIPR), Sep. 2021, pp. 412–417.
preservation mechanisms, NTT Corporation, Japan,’’ in Proc. IEEE Int. [142] H. Malik and R. Changalvala, ‘‘Fighting AI with AI: Fake speech detec-
Conf. Acoustics, Speech Signal Process., May 2019, pp. 6805–6809. tion using deep learning,’’ in Proc. AES Int. Conf., Jun. 2019, pp. 1–9.
[121] S.-W. Park, D.-Y. Kim, and M.-C. Joe, ‘‘Cotatron: Transcription- [143] L. Huang and C.-M. Pun, ‘‘Audio replay spoof attack detection
guided speech encoder for Any-to-Many voice conversion without by joint segment-based linear filter bank feature extraction and
parallel data,’’ in Proc. Interspeech, Oct. 2020, pp. 4696–4700, doi: attention-enhanced DenseNet-BiLSTM network,’’ IEEE/ACM Trans.
10.21437/interspeech.2020-1542. Audio, Speech, Language Process., vol. 28, pp. 1813–1825, 2020, doi:
[122] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda, 10.1109/TASLP.2020.2998870.
‘‘Voice transformer network: Sequence-to-Sequence voice conversion [144] Z. Wu, R. K. Das, J. Yang, and H. Li, ‘‘Light convolutional neu-
using transformer with text-to-speech pretraining,’’ in Proc. Interspeech, ral network with feature genuinization for detection of synthetic
Oct. 2020, pp. 4676–4680, doi: 10.21437/interspeech.2020-1066. speech attacks,’’ in Proc. Interspeech, Oct. 2020, pp. 1101–1105, doi:
[123] H. Lu, Z. Wu, D. Dai, R. Li, S. Kang, J. Jia, and H. Meng, ‘‘One-shot 10.21437/interspeech.2020-1810.
voice conversion with global speaker embeddings,’’ in Proc. Interspeech, [145] Y. Zhang, F. Jiang, and Z. Duan, ‘‘One-class learning towards syn-
2021, pp. 669–673. thetic voice spoofing detection,’’ IEEE Signal Process. Lett., vol. 28,
[124] R. Gontijo Lopes, S. Fenu, and T. Starner, ‘‘Data-free knowledge distil- pp. 937–941, 2021, doi: 10.1109/LSP.2021.3076358.
lation for deep neural networks,’’ 2017, arXiv:1710.07535. [146] R. Wang, F. Juefei-Xu, Y. Huang, Q. Guo, X. Xie, L. Ma, and Y. Liu,
[125] T. H. Huang, J. H. Lin, and H. Y. Lee, ‘‘How far are we from robust voice ‘‘DeepSonar: Towards effective and robust detection of AI-synthesized
conversion: A survey,’’ in Proc. IEEE Spoken Lang. Technol. Workshop, fake voices,’’ in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020,
Jan. 2021, pp. 514–521, doi: 10.1109/SLT48900.2021.9383498. pp. 1207–1216, doi: 10.1145/3394171.3413716.

VOLUME 11, 2023 132681


O. A. Shaaban et al.: Audio Deepfake Approaches

[147] R. Reimao and V. Tzerpos, ‘‘FoR: A dataset for synthetic speech detec- [169] D. M. Ballesteros, Y. Rodriguez, and D. Renza, ‘‘A dataset of histograms
tion,’’ in Proc. Int. Conf. Speech Technol. Human-Computer Dialogue of original and fake voice recordings (H-Voice),’’ Data Brief, vol. 29,
(SpeD), Oct. 2019, pp. 1–10, doi: 10.1109/SPED.2019.8906599. Apr. 2020, Art. no. 105331, doi: 10.1016/j.dib.2020.105331.
[148] H. Yu, Z.-H. Tan, Z. Ma, R. Martin, and J. Guo, ‘‘Spoofing [170] J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y. Bai,
detection in automatic speaker verification systems using DNN C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, and H. Li,
classifiers and dynamic acoustic features,’’ IEEE Trans. Neural ‘‘ADD 2022: The first audio deep synthesis detection challenge,’’ in Proc.
Netw. Learn. Syst., vol. 29, no. 10, pp. 4633–4644, Oct. 2018, doi: IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2022,
10.1109/TNNLS.2017.2771947. pp. 9216–9220, doi: 10.1109/ICASSP43922.2022.9746939.
[149] A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. Gomez,
‘‘A light convolutional GRU-RNN deep feature extractor for ASV spoof-
ing detection,’’ in Proc. Interspeech. 2021, pp. 1068–1072.
[150] C. I. Lai, N. Chen, J. Villalba, and N. Dehak, ‘‘ASSERT: Anti-spoofing
with squeeze-excitation and residual networks,’’ in Proc. Interspeech,
Sep. 2019, pp. 1013–1017, doi: 10.21437/Interspeech.2019-1794.
[151] M. Alzantot, Z. Wang, and M. B. Srivastava, ‘‘Deep residual neural
networks for audio spoofing detection,’’ in Proc. Interspeech, Sep. 2019,
pp. 1078–1082, doi: 10.21437/interspeech.2019-3174. OUSAMA A. SHAABAN received the B.S.
[152] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, degree in computer maintenance from the Col-
A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee,
lege of Computer Technology, Tripoli, Libya,
‘‘ASVspoof 2019: Future horizons in spoofed and fake audio
in 2003, and the M.S. degree in information tech-
detection,’’ in Proc. Interspeech, Sep. 2019, pp. 1008–1012, doi:
10.21437/interspeech.2019-2249.
nology from Universiti Utara Malaysia (UUM),
Kedah, Malaysia, in 2015. He is currently pur-
[153] M. Shan and T. Tsai, ‘‘A cross-verification approach for protecting world
leaders from fake and tampered audio,’’ 2020, arXiv:2010.12173. suing the Ph.D. degree in computer engineering
with Ankara Yıldırım Beyazıt University, Ankara,
[154] R. L. P. C. Wijethunga, D. M. K. Matheesha, A. A. Noman,
K. H. A. De Silva, M. Tissera, and L. Rupasinghe, ‘‘Deepfake audio Turkey. His research interests include augmented
detection: A deep learning based solution for group conversations,’’ in reality, machine learning, and DL.
Proc. 2nd Int. Conf. Advancements Comput. (ICAC), vol. 1, Dec. 2020,
pp. 192–197, doi: 10.1109/ICAC51239.2020.9357161.
[155] Z. Jiang, H. Zhu, L. Peng, W. Ding, and Y. Ren, ‘‘Self-supervised spoofing
audio detection scheme,’’ in Proc. Interspeech, 2020, pp. 4223–4227.
[156] N. Subramani and D. Rao, ‘‘Learning efficient representations for
fake speech detection,’’ in Proc. 34th AAAI Conf. Artif. Intell., 2020,
pp. 5859–5866, doi: 10.1609/aaai.v34i04.6044.
[157] Z. Lei, Y. Yang, C. Liu, and J. Ye, ‘‘Siamese convolutional neu-
ral network using Gaussian probability feature for spoofing speech REMZI YILDIRIM received the B.S. and M.S.
detection,’’ in Proc. Interspeech, Oct. 2020, pp. 1116–1120, doi: degrees in electronics and computer engineering
10.21437/interspeech.2020-2723. from Gazi University, Ankara, Turkey, in 1988 and
[158] M. Lataifeh, A. Elnagar, I. Shahin, and A. B. Nassif, ‘‘Arabic audio 1993, respectively, and the Ph.D. degree in elec-
clips: Identification and discrimination of authentic cantillations from tronics from Erciyes University, Kayseri, Turkey,
imitations,’’ Neurocomputing, vol. 418, pp. 162–177, Dec. 2020, doi: in 1996. From 1999 to 2002, he was a Visiting
10.1016/j.neucom.2020.07.099. Scholar with the Massachusetts Institute of Tech-
[159] M. Lataifeh and A. Elnagar, ‘‘Ar-DAD: Arabic diversified audio nology, Boston, MA, USA. From 2004 to 2005,
dataset,’’ Data Brief, vol. 33, Dec. 2020, Art. no. 106503, doi: he was with Liverpool University, U.K. He is
10.1016/j.dib.2020.106503. currently a Professor with the Department of Com-
[160] E. R. Bartusiak and E. J. Delp, ‘‘Frequency domain-based detection of puter Engineering, Ankara Yıldırım Beyazıt University, Ankara. He has
generated audio,’’ in Proc. Electron. Imag., Soc. Imag. Sci. Technol., 2022, published more than 100 journal articles and conference papers and 21 books.
pp. 273–281. His research interests include optoelectronics, cyber-physical systems, and
[161] H. E. Delgado. ASVspoof 2021. Accessed: Aug. 6, 2021. [Online]. Avail- model checking.
able: https://www.asvspoof.org/
[162] T. Arif, A. Javed, M. Alhameed, F. Jeribi, and A. Tahir, ‘‘Voice spoofing
countermeasure for logical access attacks detection,’’ IEEE Access, vol. 9,
pp. 162857–162868, 2021, doi: 10.1109/ACCESS.2021.3133134.
[163] H. Khalid, S. Tariq, M. Kim, and S. S. Woo, ‘‘FakeAVCeleb: A novel
audio-video multimodal deepfake dataset,’’ 2021, arXiv:2108.05080.
[164] H. Khalid, M. Kim, S. Tariq, and S. S. Woo, ‘‘Evaluation of an audio-
video multimodal deepfake dataset using unimodal and multimodal
detectors,’’ in Proc. 1st Workshop Synth. Multimedia-Audiovisual Deep-
fake Gener. Detection, Oct. 2021, pp. 7–15. ABUBAKER A. ALGUTTAR received the B.Sc.
degree in computer science from Benghazi Uni-
[165] S. Camacho, D. M. Ballesteros, and D. Renza, ‘‘Fake speech recognition
using deep learning,’’ in Applied Computer Sciences in Engineering versity, in 1993, and the M.Eng.Sc. degree in
(Communications in Computer and Information Science), vol. 1431. computer science and engineering and the M.T.M.
Cham, Switzerland: Springer, 2021, pp. 38–48. degree in technology management from the Uni-
[166] Z. Almutairi and H. Elgibreen, ‘‘Detecting fake audio of Arabic versity of New South Wales, Sydney, NSW,
speakers using self-supervised deep learning,’’ IEEE Access, vol. 11, Australia, in 2004 and 2005, respectively. He is
pp. 72134–72147, 2023, doi: 10.1109/ACCESS.2023.3286864. currently pursuing the Ph.D. degree in computer
[167] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, engineering with Ankara Yıldırım Beyazıt Univer-
R. Morais, L. Saunders, F. M. Tyers, and G. Weber, ‘‘Common voice: sity, Ankara, Turkey. From 2006 to 2018, he was a
A massively-multilingual speech corpus,’’ in Proc. 12th Int. Conf. Lang. Lecturer with the College of Information Technology, Misurata University,
Resour. Eval. Conf., 2020, pp. 4218–4222. Misurata, Libya. His research interests include machine and DL, and fake
[168] The M-AILABS Speech Dataset. Accessed: Feb. 25, 2021. [Online]. Avail- news detection.
able: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset

132682 VOLUME 11, 2023

You might also like