Deepfakes Generation and Detection State-Of-The-Art, Open Challenges, Counter Measures, and Way Forward - Springler 2022
Deepfakes Generation and Detection State-Of-The-Art, Open Challenges, Counter Measures, and Way Forward - Springler 2022
net/publication/361086563
CITATIONS READS
138 8,369
6 authors, including:
All content following this page was uploaded by Ali Javed on 06 June 2022.
Abstract
Easy access to audio-visual content on social media, combined with the availability of modern tools such as Tensorflow or Keras,
and open-source trained models, along with economical computing infrastructure, and the rapid evolution of deep-learning (DL)
methods have heralded a new and frightening trend. Particularly, the advent of easily available and ready to use Generative
Adversarial Networks (GANs), have made it possible to generate deepfakes media partially or completely fabricated with the
intent to deceive to disseminate disinformation and revenge porn, to perpetrate financial frauds and other hoaxes, and to disrupt
government functioning. Existing surveys have mainly focused on the detection of deepfake images and videos; this paper
provides a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches for
deepfake generation, and the methodologies used to detect such manipulations in both audio and video. For each category of
deepfake, we discuss information related to manipulation approaches, current public datasets, and key standards for the evalu-
ation of the performance of deepfake detection techniques, along with their results. Additionally, we also discuss open challenges
and enumerate future directions to guide researchers on issues which need to be considered in order to improve the domains of
both deepfake generation and detection. This work is expected to assist readers in understanding how deepfakes are created and
detected, along with their current limitations and where future research may lead.
Keywords Artificial intelligence . Deepfakes . Deep learning . Face swap . Lip-synching . Puppetmaster . Speech synthesis .
Voice conversion
disinformation around the globe and may pose a severe threat, generated using either text-to-speech synthesis (TTS) or voice
in the form of fake news, in the future [2], if they have not conversion (VC). TTS aims to produce natural and intelligible
already. voice waveforms, based on the provided text, that sounds like
Multimedia content as evidence is the current standard of they have been spoken by the target identity. VC techniques
proof in every sector of the legal world. It goes without saying transform the speech signal produced by a source speaker to
that the audio-visual content admitted as evidence must be seem like it was spoken by a target speaker while keeping the
authentic and its integrity must be verified. At the same time, linguistic contents intact.
the introduction of easy to use manipulation tools (e.g. Zao Unlike deepfake videos, less attention has been paid to the
[3], REFACE [4], FaceApp [5], Audacity [6], Soundforge [7]) detection of audio deepfakes. In the last few years, voice ma-
has increased the perceived realism of fabricated data, which nipulation has also become very sophisticated. Synthetic
makes the authentication and integrity verification of such voices are not only a threat to automated speaker verification
content even more challenging. Soon deepfakes are expected systems, but also to voice-controlled systems deployed in the
to be routinely used as weapons of disinformation, which will Internet of Things (IoT) [12, 13]. Voice cloning has tremen-
lead to a loss of credibility in state institutions, electronic dous potential to destroy public trust and to empower crimi-
media, and others due to the inability of common people to nals to manipulate business dealings, even private phone calls.
differentiate between original and fake videos. Moreover, the For example, recently a case was reported in which bank rob-
emergence of machine-generated text, along with manipulated bers cloned a company executive’s speech to dupe their sub-
audio-visual data, on social sites will bring more devastating ordinates into transferring hundreds of thousands of dollars
effects and mislead decision-makers [8]. Currently, most of into a secret account [14]. Voice cloning is expected to be-
the existing multimedia forensic examiners focus on facing come a unique challenge in the future of deepfake detection.
the challenge of analyzing multimedia files from social net- Therefore, it is important that unlike current approaches that
works and sharing websites, e.g., YouTube, Facebook, etc. focus only on detecting video signal manipulations, audio
Satisfying the authentication and integrity requirements when forgeries should also be examined.
flagging manipulated videos on social media is a challenging Most of the existing surveys focus only on reviewing
task because sophisticated deepfake generation algorithms deepfake still images and video detection [15–17]. There is
with the potential to create more realistic fake videos have no recently published survey on deepfakes that specifically
become more readily available. focuses on the generation and detection of both the audio
Deepfake video can be categorized into the following and video. The discussion of generic image manipulation
types: i) face-swap ii) lip-synching iii) puppet-master iv) face and multimedia forensic techniques was addressed in detail
synthesis and attribute manipulation, and v) audio-only in [18], however deepfake generation techniques were not
deepfakes. In face-swap deepfakes, the face of the source per- included. In [19], an overview of face manipulation and de-
son is replaced with the face of a victim to generate a fake tection techniques was presented. Another survey, [20], re-
video of the victim which in reality the source person has viewed visual deepfake detection approaches but does not
done. Face-swap-oriented deepfakes usually target a famous discuss speech manipulation and its detection. The latest work
person by showing them in scenarios in which they never presented by Mirsky et al. [21] gives an in-depth analysis of
appeared in order to damage their reputation in the face of visual deepfake creation techniques,. Deepfake detection ap-
the public, for example, in non-consensual pornography. In proaches are, however, only briefly discussed, and moreover,
lip-synching-based deepfakes, the movement of the target per- it lacks a discussion of audio deepfakes. To the best of our
son’s lips is manipulated to make them consistent with a spe- knowledge this paper is the first attempt to provide a detailed
cific audio recording so that the victim appears to say what- analysis and review of both audio and visual deepfake detec-
ever is in the recording. In puppet-master deepfakes, video is tion techniques and generative approaches. The following are
created which mimics the expressions of the target person, the main contributions of our work:
such as eye movement, facial expressions, and head move-
ment. Puppet-master deepfakes aim to hijack the source per- i. To give the research community an insight into the various
son’s expression, or even full-body, in a video in order to types of video and audio-based deepfake generation and
animate it according to the impersonator’s desire [9]. Face detection methods.
synthesis and attribute manipulation involve the generation ii. To provide the reader with the latest improvements,
of photo-realistic face images as well as facial attribute trends, limitations, and challenges in the field of audio-
editing. This manipulation has been used to spread disinfor- visual deepfakes.
mation on social media using fake profiles. Lastly, audio iii. To give an understanding to the reader about the possible
deepfakes focus on the generation of the target speaker’s voice implications of audio-visual deepfakes.
using deep learning techniques to portray the speaker saying iv. To act as a guide to the reader to understand the future
something they have not said [10, 11]. The fake voices can be trends of audio and visual deepfakes.
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
1.1 Literature collection and selection criteria whereas disinformation is the set of strategies employed to
fabricate original “information” in order to achieve planned
In this survey we reviewed the existing publications which political or financial objectives, and is becoming increasingly
approach techniques for the generation and detection of ma- prevalent. Because of the extensive use of social media plat-
nipulated audio and video. A detailed description of the ap- forms, it is now very easy to spread false information [22].
proach and protocols employed for the review is given in Although all categories of fake multimedia (i.e. video, images,
Table 1, Figs. 1 and 2. and audio) could be sources of both disinformation and mis-
The rest of the paper is organized as follows: Section 2 information, audiovisual-based deepfakes are expected to be
presents a discussion of deepfakes as a source of disinforma- much more devastating. Historically, deepfakes were created
tion. In Section 3, the history and evolution of deepfakes are to defame or discredit public figures. For example, in 2017 a
briefly discussed. Section 4 presents an overview of state-of- female celebrity faced such situation when her fake porno-
the-art audio and visual deepfake generation and detection graphic video was circulated in cyberspace [20]. This is an
techniques. Section 5 presents the details of available datasets evidence that deepfakes can be used to damage reputations,
used for both audio and video deepfakes detection. We have i.e., the character assassination of renowned people in order to
identified the open challenges for both audio-visual deepfake defame them [20], blackmail of individuals for monetary ben-
generation and detection in Section 6. In Section 7, we have efits, or to create political or religious unrest by targeting pol-
discussed the possible future trends of both deepfake genera- iticians or religious figures with fake video/speech [23], etc.
tion and detection, and finally, we conclude our work in This damage is not limited to targeting individuals; rather
Section 8. deepfakes can be used to manipulate elections or even to the-
oretically start wars or used to deceive military analysts with
fake information, and so on. Deepfakes are expected to ad-
2 Disinformation and misinformation using vance these archtypes of disinformation and misinformation
deepfakes to the next level.
Misinformation is defined as false or inaccurate information Trolls Trolls are hobbyists who spread inflammatory informa-
that is communicated, regardless of an intention to deceive, tion solely to cause disorder or to get a reaction [14]. For
Purpose • To provide a brief overview of existing state-of-the-art techniques and identify potential gaps in both audio-visual
deepfake generation and detection.
• To provide systematic review and structure to the existing state-of-the-art techniques with respect to each category of
audio-visual deepfake generation and detection.
Data sources Google Scholar, Springer Link, ACM digital library, IEEE explorer, and DBLP
Query A methodical approach was designed to systematically utilize the data sources mentioned above and the following query
strings were used:
Deepfakes/Faceswap/ Face reenactment/ lip-syncing /Deepfakes AND Faceswap/ Deepfakes AND Face reenactment/
Deepfakes AND lip-syncing/ GAN synthesized/ face manipulation/ Attribute Manipulation/GAN AND Puppet
Mastery/ GAN AND Expression Manipulation/ Video Synthesis/ Audio synthesis/ Deep learning AND TTS/ Deep
learning AND Voice Conversion/ Deep learning AND Voice Cloning/Deepfakes AND Dataset/ Deepfakes AND
Audio/ Deepfakes AND Video/Deepfakes AND image
Method We have systematically categorized the literature on video and audio deepfakes as follows (Fig. 1):
a) Video deepfake generation and detection into the following categories: face swap, lip-syncing, puppet-mastery, entire
face synthesis, and facial attribute manipulation.
b) Audio deepfake generation and detection into the following categories: text-to-speech synthesis and voice conversion.
Size A total of 436 papers were retrieved using the method and query mentioned above from listed data sources up to
−03-15-2022. We selected only those studies that were relevant and included the criteria ‘deepfakes’ in the positive set.
Other relevant publications, where ‘deepfake’ was not in the positive set, were included in the negative set. All other
studies were excluded from the final selection of papers, i.e., white papers and articles. The number of publications in
the deepfake research area and distribution category, based on year, are presented in Fig. 2. It can be observed that the
majority of the related articles are from conferences and informal publications i.e., arXiv. This is because the articles
that have been published in top-tier journals make up only a small portion of the total currently available research.
Study types/inclusion and Peer-reviewed journal papers and articles from conference proceedings were given more importance. Additionally,
exclusion articles from archive literature were also considered.
M. Masood et al.
example, posting audio-visual manipulated racist or sexist pandemic with the China [30]. In such a situation, the use of
content to promote hatred. Similarly, during the 2020 US fabricated audio-visual deepfake content by these theorists can
presidential campaign, conflicting narratives about Trump increase controversy in global politics.
and Biden were circulated on social media, contributing to
an environment of fear [24]. In contrast to independent trolls, Hyper-partisan media Hyper-partisan media includes fake
who spread the disinformation for their own satisfaction, hired news websites and blogs which intentionally spread false in-
trolls perform the same for monetary benefit. Different actors, formation to a specific political demographic. Because of the
like political parties, businessmen, and companies routinely extensive usage of social media, Hyper-partisan media is one
hire people to forge news related to their competitors and of the biggest potential incubators for the spread of fabricated
spread it in the market [25]. Deepfake videos generated by news [31]. Convincing AI-generated fake content assists these
hired trolls are the latest weapon in the ongoing fabricated bloggers to easily spread disinformation, to attract visitors, or
news war that can bring a more devastating effect on society to increase views. As social platforms are largely independent
[26]. and ad-driven mediums, spreading fabricated information
may purely be a profit-making strategy [32].
Bots Bots are automated software or algorithms used to spread
fabricated or misleading content among the people [27]. A Politicians One of the main sources of disinformation is the
study published in [28] concluded that during the 2016 US political parties themselves, which may spread manipulated
presidential election, bots generated one-fifth of the tweets information for point-scoring. Due to a large number of fol-
during the last month of the campaign. The emergence of lowers on social platforms, politicians are central nodes in
deepfakes has bolstered the negative impact of bots i.e., re- online networks. So, politicians may use their fame and public
cently, a messaging app named telegram used bots to post support to spread false news among their followers. To de-
nude pictures of women [14]. fame opponent parties, politicians may use deepfakes to post
controversial content about their competitors on conventional
Conspiracy theorists Conspiracy theorists range from nonpro- media [29].
fessional filmmakers to Reddit agents who spread vague and
doubtful claims on the internet either through “documen- Foreign governments As the Internet has converted the world
taries” or by posting stories and memes [29]. Recently, several into a “Global Village,” it has become easier for conflicting
conspiracy theorists have connected the current COVID countries to spread false news to target the reputation of any
Fig. 2 Number of papers in the area of Deepfake research by a year-wise publication count, and b the number of publications by year belonging to
studied categories, obtained from Google Scholar
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
country in the world. Many countries are running Shallow fakes involve basic editing of a video utilizing
government-sponsored social media accounts, websites, and slowing, speeding, cutting, and selectively splicing together
applications, contributing to political propaganda globally unaltered existing footage that can alter the whole context of
[14]. These non-state actors are anticipated to become more the information delivered. In May 2019, a video of US
active in this sector as deepfakes techniques cut the costs of Speaker Nancy Pelosi was selectively edited to make it appear
online propaganda. This raises the risk that extremist groups that she was slurring her words and was drunk or confused
skilled in information warfare may exploit the technology and [14]. The video was shared on Facebook and received more
initiate foreign attacks on their own to increase the stress than 2.2 million views within 48 hours. Video manipulation
among countries. for the entertainment industry, specifically in film production,
has been done for decades. Figure 3 shows the evolution of
deepfakes over the years. An early notable academic project
3 DeepFakes evolution was the Video Rewrite Program [40], intended for applica-
tions in movie dubbing, and published in 1997. It was the first
The earliest example of manipulated multimedia content oc- software which was able to automatically reanimate facial
curred in 1860 when a portrait of southern politician John movements in an existing video to a different audio track,
Calhoun was skillfully manipulated by replacing his head with and it achieved surprisingly convincing results.
that of US President for propaganda purposes [33]. Usually, The first true deepfake appeared online in September 2017
such manipulation is accomplished by adding (splicing), re- when a Reddit user named “deepfake” posted a series of
moving (inpainting), and replicating (copy-move) the objects computer-generated videos of famous actresses with their
within or between two images [18]. Then, suitable post- faces swapped onto pornographic content [20]. Another noto-
processing steps, such as scaling, rotating, and color adjust- rious deepfake case was the release of the deepNude applica-
ment are applied to improve the visual appearance, scale, and tion that allowed users to generate fake nude images [41]. This
perspective coherence [34]. was the beginning of when deepfakes gained wider recogni-
Aside from these traditional manipulation methods, ad- tion within a large community. Today deepfake technology/
vancements in Computer Graphics and deep learning (DL) applications, i.e. FakeApp [42], FaceSwap [43], and ZAO [3]
techniques now offer a variety of different automated ap- are easily accessible, and users without a computer engineer-
proaches for digital manipulation with better semantic consis- ing background can create a fake video within seconds.
tency. A recent trend involves the synthesis of videos from Moreover, open-source projects on GitHub, such as
scratch using autoencoders, or generative adversarial net- DeepFaceLab [44] and related tutorials, are easily available
works (GANs), for different applications [35] and, more spe- on YouTube. A list of other available deepfake creation ap-
cifically, photorealistic human face generation based on any plications, software, and open-source projects is given in
attribute [36–39]. Another pervasive manipulation, called Table 2. Contemporary academic projects that lea to the de-
“shallow fakes” or “cheap fakes,” are audio-visual manipula- velopment of deepfake technology are Face2Face [38] and
tions created using cheaper and more accessible software. Synthesizing Obama [37], published in 2016 and 2017
Table 2 An overview of Audio-visual deepfakes generation software, applications, and open-source projects
Cheap fakes
Adobe Commercial Desktop Adobe Audio Video Editing, AI-powered video reframing
Premiere Software
Corel Commercial Desktop Corel Proprietary AI
VideoStudio Software
Lip-sync
dynalips Commercial Web App www.dynalips.com/ Proprietary
crazytalk Commercial Web App www.reallusion.com/crazytalk/ Proprietary
Wav2Lip Open source github.com/Rudrabha/Wav2Lip GAN with pre-trained discriminator network and
implementation visual quality loss function
Facial Attribute Manipulation
FaceApp MobileApp FaceApp Inc Deep generative CNNs
Adobe Commercial Desktop Adobe DNNs + filters
Software
Rosebud Commercial Web App www.rosebud.ai/ Proprietary AI
Face Swap
ZAO Mobile app Momo Inc Proprietary
REFACE Mobile app Neocortext, Inc Proprietary
Reflect Mobile app Neocortext, Inc Proprietary
Impressions Mobile app Synthesized Media, Inc. Proprietary
FakeApp Desktop App www.malavida.com/en/soft/fakeapp/ GAN
FaceSwap Open source faceswapweb.com/ Employed two seperate pairs of encoder-decoder
implementation with shared encoder parameters.
DFaker Open source github.com/dfaker/df -For facial reconstruction a DSSIM loss function is
implementation utilized.
-Keras library-based implementation.
DeepFaceLab Open source github.com/iperov/DeepFaceLab - Several face extraction methods, e.g. dlib,
implementation MTCNN, S3FD etc.
- Extends different Faceswap models i.e. H64,
H128, LIAEF128, SAE [44].
FaceSwapGAN Open source github.com/shaoanlu/faceswap-GAN Uses two loss functions, namely adversarial loss
implementation and perceptual loss, to the auto-encoder.
DeepFake-tf Open source github.com/StromWine/DeepFake-tf Same as DFaker however, uses tensor-flow for
implementation implementation.
Faceswapweb Commercial Web App faceswapweb.com/ GAN
Face Reenactment
Face2Face Open source web.stanford.edu/~zollhoef/papers/CVPR2016_ Uses 3DMM and ML technique
implementation Face2Face/page.html
Dynamixyz Commercial Desktop www.dynamixyz.com/ Machine-learning
Software
FaceIT3 Open source github.com/alew3/faceit_live3 GAN
implementation
Face Generation
Generated Commercial Web App generated.photos/ StyleGAN
Photos
Voice Synthesis
Overdub Commercial Web App www.descript.com/overdub Proprietary (AI based)
Respeecher Commercial Web App www.respeecher.com/ Combines traditional digital signal processing
algorithms with proprietary deep generative
modeling techniques
SV2TTS Open source github.com/CorentinJ/Real-Time-Voice-Cloning LSTM with Generalized end-to-end loss
implementation
ResembleAI Commercial Web App www.resemble.ai/ Proprietary (AI based)
Voicery Commercial Web App www.voicery.com/ Proprietary AI and deep learning
VoiceApp Mobile app Zoezi AB Proprietary (AI-based)
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
respectively. Face2Face [38] captures the real-time facial ex- targeted forged modality (Fig. 1). Visual deepfakes are further
pressions of the source person as they talk into a commodity grouped into the following types based on manipulation level:
webcam. It modifies the target person’s face in the original (i) face swap or identity swap, (ii) lip-syncing, (iii) face-
video to depict them, mimicking the source facial expressions. reenactment or puppet-mastery, iv) entire face synthesis and
Synthesizing Obama [37] is a video rewrite 2.0 program, used v) facial attribute manipulation. Audio deepfakes are further
to modify mouth movements in video footage of a person in classified as i) text-to-speech synthesis and ii) voice
order to depict the person “saying” the words contained in an conversion.
arbitrary audio clip. These works [37, 38] are focused on the Numerous models have been created to perform video ma-
manipulation of the head and facial region only. Recent de- nipulation. For manipulating both audio and video, different
velopment expands the application of deepfakes to the entire variants and combinations of GANs and encoder-decoder ar-
body [9, 45, 46], the generation of deepfakes from a single chitectures are used. We have presented a generic pipeline for
image [47–50], and temporally smooth video synthesis [51]. deepfakes generation in Fig. 4. To perform manipulation, an
Most of the deepfakes currently present on social platforms image or audio of the target identity and the conditioned
like YouTube, Facebook or Twitter may be regarded as harm- source types including an image, video, sketch map, etc. are
less, entertaining, or artistic. There are also some examples, used. First, the facial region is detected and then cropped
however, where deepfakes have been used for revenge porn, before translating both the target face and the source data to
hoaxes, political or non-political influence, and financial fraud intermediate representations such as deep features, facial land-
[52]. In 2018, a deepfake video went viral online in which mark keypoints, UV maps, and 3D morphable model param-
former U.S. President Barak Obama appeared to insult the eters. The intermediate representations are then passed to dif-
current president, Donald Trump [53]. In June 2019, a fake ferent synthesis models, or combinations of models, such as
video of Facebook CEO Mark Zuckerberg was posted to GANs [1], encoder-decoder, Pix2Pix network [60], and RNN/
Instagram by the Israeli advertising company “Canny” [52]. LSTM. For audio deepfake generation the input can be either
More recently, extremely realistic deepfake videos of Tom text or voice signal. In the case of text input, a linguistic
Cruise posted on the TikTok platform gained 1.4million views analyzer is used to generate linguistic features such as pho-
within just a few days [54]. nemes, duration, and other different granularities. The obtain-
Apart from visual manipulation, audio deepfakes are a new ed features are then passed to an acoustic analyzer for inter-
form of cyber-attack, with the potential to cause severe dam- mediate representation i.e., MCC (mel-cepstral coefficients),
age to individuals due to highly sophisticated speech synthesis MGC (mel-generalized coefficients), and mel-spectrograms,
techniques i.e. WaveNet [55], Tacotron [56], and deep voice1 etc., that are later used to generate output audio waveform.
[57]. Fake audio-assisted financial scams increased signifi- Finally, the output is acquired by re-rendering the generated
cantly in 2019 as a direct result of the progression in speech face into the target frame. For the detection of audiovisual
synthesis technology. In August 2019, a European company’s deepfakes, Fig. 5 shows general processing steps. Most of
chief executive officer, tricked by an audio deepfake, made a the deepfake detection approaches have employed either
fraudulent transfer of $243,000 [58]. A voice-mimicking AI handcrafted features-based or deep learning-based methods
software was used to clone the voice patterns of the victim by for feature extraction. Few approaches are focused to employ
training ML algorithms using audio recordings obtained from the fusion of both handcrafted and deep features and using
the internet. If such techniques can be used to imitate the voice multiple modalities such as both audio and visual signals for
of a top government official or a military leader and applied at effective manipulation. The computed key points are then
scale, it could have serious national security implications [59]. used to classify the input media as real or fake. In the follow-
ing sub-sections, we have analyzed the above-mentioned ma-
nipulation types in detail in terms of both synthesis and detec-
4 Audio-visual deepfake types tion techniques.
and categorization of the literature
4.1 Visual manipulations
This section provides an in-depth analysis of existing state-of-
the-art methods for audio and visual deepfakes. A review for 4.1.1 Face-swap
each category of deepfake in terms of creation and detection is
provided to give a deeper understanding of the various ap- Generation Visual manipulation is nothing new; images and
proaches. We provide a critical investigation of existing liter- videos have been forged since the early days of
ature which includes the technologies, their capabilities, lim- phototography. In face-swap [61], or face replacement, the
itations, challenges, and future trends for both deepfake crea- face of the person in the source video is replaced by the face
tion and detection. Deepfakes are broadly categorized into two in the target video, as shown in Fig. 6. Traditional face-swap
groups, visual and audio manipulations, depending on the approaches [62–64] generally take three steps to perform a
M. Masood et al.
face-swap operation. First, these tools detect the face in source approaches generally offer good results but have two major
images and then select a target’s candidate face image from limitations. First, they completely replace the input face with
the facial library that is similar to the input facial appearance the target face, and expressions of the input face image are
and pose. Second, the method replaces the eyes, nose, and lost. Second, the synthetic result is very rigid, and the replaced
mouth of the face and further adjusts the lighting and color face looks unnatural i.e., it requires a matching pose to gener-
of the candidate face image to match the appearance of input ate good results.
images, and seamlessly blends the two faces. Finally, the third Recently, DL-based approaches have become popular for
step positions the blended candidate replacement by comput- synthetic media creation due to their realistic results. Recent
ing a match distance over the overlap region. These deepfakes have shown how these approaches can be applied
with automated digital multimedia manipulation. In 2017, the morphable model (3DMM) [61, 66], or GAN based models
first deepfake video that appeared online was created using a [65, 67]. Korshunova et al. [66] proposed a convolution neural
face-swap approach, where the face of a celebrity was shown network (CNN) based approach that transferred the semantic
in pornographic content [20]. This approach used a neural content, e.g., face posture, facial expression, and illumination
network to morph a victim’s face onto someone else’s features conditions, of the input image to create the same effects in
while preserving the original facial expression. As time went another image. They introduced a loss function that was a
on, face-swap software i.e. FakeApp [42] and FaceSwap [43] weighted combination of style loss, content loss, light loss,
made it both easier and quicker to produce deepfakes with and total variation regularization. This method [66] generates
more convincing results by replacing the face in a video. more realistic deepfakes compared to [62], however, it re-
These approaches typically use two encoder-decoder pairs. quires a large amount of training data. Moreover, the trained
In this technique, an encoder is used to extract latent features model can be used to transform only one image at a time.
of the face from the image and afterward the decoder is used to Nirkin et al. [61] presented a method that used a full convo-
reconstruct the face. To swap faces between the source and lution network (FCN) for face segmentation and replacement
target image, two pairs of encoder and decoder are required, in concert with a 3DMM to estimate facial geometry and
where each encoder is first trained on the source and then the corresponding texture. Then the face reconstruction was per-
target image. Once training is complete, the decoders are formed on the target image by adjusting the model parameters.
swapped, so that an original encoder of the source image These approaches [61, 66] have the limitation of subject-
and a decoder of the target image are used to regenerate the specific or pair-specific training. Recently subject agnos-
target image with the features of the source image. The result- tic approaches have been proposed to address this limita-
ing image has the source’s face on the target’s face, while tion [65, 67]. In [65], an improved deepfake generation
keeping the target’s facial expressions. Fig. 7 is an example approach using a GAN was proposed which adds adver-
of a deepfake crafted in such a way that the feature set of face sarial loss and perceptual loss to VGGface, implemented
A is connected with the decoder B to reconstruct face B from in the auto-encoder architecture [43]. The addition of
the original face A. The recently launched ZAO [3], REFACE VGGFace perceptual loss made the direction of the eyes
[4], and FakeApp [42] applications are more popular due to appear more realistic and consistent with the input, and
their effectiveness in producing realistic face swap-based also helped to smooth the artifacts added in the segmen-
deepfakes. FakeApp allows the selective modification of fa- tation mask, resulting in a high-quality output video.
cial parts. ZAO and REFACE have gone viral lately, used by FSGAN [67] allowed face swapping and reenactment in
less tech-savvy users to swap their faces with movie stars and real-time by following the reenact and blend strategy.
embed themselves into well-known movies and TV clips. This method simultaneously manipulates pose, expres-
There are many publicly available implementations of face- sion, and identity while producing high-quality and tem-
swap technology using deep neural networks, such as porally coherent results. These GAN-based approaches
FaceSwap [43], DeepFaceLab [44], and FaceSwapGAN [65] [65, 67] outperform several existing autoencoder-decoder
leading to the creation of a growing number of synthesized methods [42, 43] as they work without being explicitly
media clips. trained on subject-specific images. Moreover, their itera-
Until recently, most of the research focused on advances in tive nature makes them well-suited for face manipulation
face-swapping technology, either using a reconstructed 3D tasks such as generating realistic images of fake faces.
M. Masood et al.
Fig. 7 Creation of a Deepfake using an auto-encoder and decoder. The same encoder-decoder pair is used to learn the latent features of the faces during
training, while during generation decoders are swapped, such that latent face A is subjected to decoder B to generate face A with the features of face B
Some of the work used a disentanglement concept for face decoder. These encoded features are passed to a novel gener-
swap by using VAEs. RSGAN [68] employed two separate ator with cascaded Adaptive Attentional Denormalization
VAEs to encode the latent representation of facial and hair layers inside residual blocks which adaptively adjust the iden-
regions respectively. Both encoders were conditioned to pre- tity region and target attributes. Finally, another network is
dict the attributes that describe the target identity. Another used to fix occlusion inconsistencies and refine the results.
approach, FSNet [69], presented a framework to achieve Table 3 presents details of Face-swap based deepfake creation
face-swapping using a latent space, to separately encode the approaches.
face region of the source identity and landmarks of the target
identity, which were later combined to generate the swapped Detection Several recent studies have developed novel
face. However, these approaches [68, 69] do not preserve methods to identify face swap manipulations. Table 4, shows
target attributes, like target occlusion and illumination condi- the comparison of faceswap detection techniques using both
tions, well. handcrafted and deep features.
Facial occlusions are always challenging to handle in face- Techniques based on handcrafted Features: Zhang et al.
swapping methods. In many cases, the facial region in the [73] propose a technique to detect swapped faces by using a
source or target is partially covered with hair, glasses, a hand, Speeded Up Robust Features (SURF) descriptor for feature
or some other object. This results in visual artifacts and incon- extraction. This is then used to train an SVM for classification,
sistencies in the resultant image. FaceShifter [70] generates a and then tested on a set of Gaussian blurred images. While this
swapped face with high fidelity and preserves the target attri- approach has improved deepfake image detection perfor-
butes such as pose, expression, and occlusion. The identity mance it is unable to detect manipulated videos. Yang et al.
encoder was used to encode the source identity and the target [74] introduce an approach to detect deepfakes by estimating
attributes, with feature maps being obtained via the U-Net the 3D head position from 2D facial landmarks. The
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
Faceswap [43] Encoder-decoder Facial landmarks Private 256×256 ▪ Blurry results due to lossy compression
▪ Lack of pose, facial expression, gaze
direction, hairstyle, and lighting
▪ Requires massive number of target
images
FaceSwapGAN GAN VGGFace VGGFace 256×256 ▪ Lack of texture details; generates overly
[65] smooth results
DeepFaceLab Encoder-decoder Facial landmarks Private 256×256 ▪ Fails to blend very different facial hues
[71] ▪ Requires target training data
Fast Face-swap CNN VGGFace ▪ CelebA (200,000 images) 256×256 ▪ Works for a single person only
[66] ▪ Yale Face Database B ▪ Gives better results for frontal face view
(different pose and lighting ▪ Lack of skin texture details, e.g.,
conditions) smooth results and Facial Expression
transfer
▪ Does not deal well with occluding
objects i.e. glasses
Nirkin et al. [61] FCN-8 s-VGG ▪ Basel Face Model to IARPA Janus CS2 (1275 face 256×256 ▪ Poor results in case of different image
architecture represent faces videos) resolutions
▪ 3DDFA model for ▪ Fails to blend very different facial hues
expression
Chen et al. [72] VGG-16 net 68 facial landmarks Helen (2330 images) 256×256 ▪ Provide more realistic results but are
sensitive to variation in posture and
gaze
FSNet [69] GAN Facial landmarks CelebA 128×128 ▪ Sensitive to variation in angle
RSGAN [68] GAN Facial landmarks, CelebA 128×128 ▪ Sensitive to variation in angle,
segmentation mask occlusion, lightning
▪ Limited output resolution
FaceShifter [70] GAN Attributes (face, ▪ VGG Face 256×256 ▪ Stripped artifacts
occlusions, lighting ▪ CelebA-HQ
or styles) ▪ FFHQ
computed difference among the head poses is used as a feature when dimensionality reduction techniques are applied. Jung
vector to train an SVM classifier which is later used to differ- et al. [78] propose a technique to detect deepfakes by identi-
entiate between original and forged content. This technique fying an anomaly based on the time, repetition, and intervened
exhibits good performance for deepfake detection but has a eye-blinking duration within videos. This method combined
limitation in estimating landmark orientation in blurred im- the Fast-HyperFace [79] and EAR techniques (eye detect)
ages, which degrades the performance of this method under [80] to detect eye blinking. An integrity authentication method
those conditions. Guera et al. [75] present a method for de- is employed by tracking the fluctuation of eye blinks based on
tecting synthesized faces from videos. Multimedia stream de- gender, age, behavior, and time factor to spot real and fake
scriptors [76] are used to extract features, which are then used videos. The approach in [78] exhibits better deepfake detec-
to train both an SVM and a random forest classifier to differ- tion performance, however, it is not appropriate if the subject
entiate between the real and manipulated faces in a sample in the video is suffering from mental illness, as abnormal eye
video. This technique gives an effective solution to deepfake blinking patterns are often observed in that population.
detection but is unable to perform well against video re- Furthermore, the work in [81, 83] presents ML based ap-
encoding attacks. Ciftci et al. [77] introduce an approach to proaches for face-swap detection, however, it still requires
detect forensic changes within videos by computing biological performance improvement in the presence of post-
signals (e.g. heart rate) from the face portion of the videos. processing attacks.
Temporal and spatial characteristics of facial features are com- Techniques based on Deep Features: Several studies
puted to train SVM and CNN models to differentiate between have employed a DL-based method for Face-swap manipula-
bonafide and fake videos. This technique has improved tion detection. Li et al. [84] proposed a method for detecting
deepfake detection accuracy, however, it has a large feature forensic modifications made within video. First, facial land-
vector space and its detection accuracy drops significantly marks are extracted using the dlib software package [96].
M. Masood et al.
Handcrafted features
Zhang SURF + SVM 64-D features using ▪ Precision=97% Generate deepfake ▪ Unable to preserve facial
et al. [73] SURF ▪ Recall=88% dataset using LFW expressions
▪ Accuracy=92% face database. ▪ Works with static images
only.
Yang et al. SVM Classifier 68-D facial landmarks ROC=89% ▪ UADFV ▪ Degraded performance for
[74] using DLib ROC=84% ▪ DARPA MediFor blurry images.
GAN Image/ Video
Challenge.
Guera et al. SVM, RF Multimedia stream AUC=93% (SVM) Custom dataset. ▪ Fails on video re-encoding
[75] Classifier descriptor [76] AUC=96% (RF) attacks
Ciftci et al. CNN medical signals features Accuracy=96% Face Forensics dataset ▪ Large feature vector space.
[77]
Jung et al. Fast-HyperFace Landmark features Accuracy=87.5% Eye Blinking ▪ Inappropriate for people
[78] [79], EAR [80] Prediction dataset with mental illness
Matern MLP, Logreg 16-D texture energy ▪ AUC=.0.851(MLP) FF++ ▪ Only applicable to face
et al. [81] based features of eyes ▪ AUC=0.784 (LogReg) images with open eyes and
and teeth [82] clear teeth.
Agarwal SVM Classifier 16 AU’s using AUC=93% Own dataset. ▪ Degraded performance in
et al. [83] OpenFace2 toolkit cases where a person is
looking off-camera.
Deep Learning-based features
Li e al. [84] VGG16, DLib facial landmarks AUC=84.5 (VGG16), 97.4 DeepFake-TIMIT ▪ Not robust for multiple
ResNet50, (ResNet50), 95.4 video compression.
ResNet101, (ResNet101), 93.8
ResNet152 (ResNet152)
Guera et al. CNN/ RNN Deep features Accuracy=97.1% Customized dataset ▪ Applicable to short videos
[33] only (2 sec).
Li et al. CNN/RNN DLib facial landmarks TPR=99% Customized dataset ▪ Fails over frequent closed
[85] eyes or blinking.
Montserrat CNN+RNN Deep features Accuracy=92.61% DFDC ▪ Performance needs
et al. [86] improvement.
Lima et al. VGG11+LSTM Deep features Accuracy=98.26%, AUC= Celeb-DF ▪ Computationally complex.
[87] 99.73%
Agarwal VGG6+ Deep features + AUC=99% WLDR ▪ Unable to generalize well to
et al. [88] encoder-- behavioral biometrics AUC=99% FF unseen deepfakes.
decoder AUC=93% DFD
network AUC=99% Celeb-DF
Fernandes Neural-ODE Heart-rate Loss=0.0215 Custom ▪ Computationally complex
et al. [89] model Loss=0.0327 DeepfakeTIMIT
Yang et al. GAN Deep features Accuracy=97.37% FF++ ▪ Low generalization abilty
[90] AUC=0.9999 CelebDF
AUC=0.9579 DFDC
Accuracy=99.86% DeeperForensics
Sabir et al. CNN/RNN Deep features Accuracy=96.3% FF++ ▪ Results are reported for
[91] static images only.
Afchar MesoInception-4 Deep features TPR=81.3% FF++ ▪ Performance degrades on
et al. [92] low quality videos.
Nguyen CNN Deep features Accuracy=83.71% FF++ ▪ Degraded detection
et al. [93] performance for unseen
cases.
Stehouwer CNN Deep features Accuracy=99.43% Diverse Fake Face ▪ Computationally complex
et al. [94] Dataset (DFFD) due to large feature vector
space.
Rossle SVM+CNN Co-Occurance matrix + Accuracy=90.29% FF++ ▪ Low performance on
et al. [95] DF compressed videos.
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
Next, CNN-based models, named ResNet152, ResNet101, magnification [101], are used to measure heart rate. The com-
ResNet50, and VGG16 are trained to detect forged content puted heart-rate was used to train a Neural Ordinary
from video. This approach is more robust in detecting forensic Differential Equations (Neural-ODE) model [102] to differen-
changes but it exhibits low performance on multi-time com- tiate the original and altered content. This technique [89]
pressed videos. Guera et al. [33] propose a novel CNN to works well for deepfake detection but also has increased com-
extract the features at the frame level. Then the RNN is trained putational complexity. In [103] a multi-scale texture differ-
on the set of extracted features to detect deepfakes from input ence network is introduced for face manipulation detection.
video. This work achieves good detection performance but The model is comprised of a ResNet-18 based textural differ-
only on videos of short duration i.e. videos of 2 seconds or ence information block and a multi-scale information extrac-
less. Li et al. [85] propose a technique to detect deepfakes by tion block. Then, the obtained features at different scales are
using the fact that the manipulated videos lack accurate eye fused to perform classification using cross-entropy loss. Yang
blinking in synthesized faces. A CNN/RNN approach is used et al. [90] propose a multi-scale self-texture attention deepfake
to detect a lack of eye blinking in the videos in order to expose detection framework based on facial texture analysis. The ar-
the forged content. This technique shows better deepfake de- chitecture work by identifying the potential texture difference
tection performance, however, it only uses the lack of eye between real and fake faces. It consists of a trace generator and
blinking as a clue to detect the deepfakes. This approach has a classification network. The trace generator network is com-
the following potential limitations: i) it is unable to detect prised of an image analysis encoder followed by a self-texture
forgeries in videos with frequent eye blinking, ii) it is unable attention module for the calculation of texture autocorrelation
to detect manipulated faces with closed eyes in training, and in features in order to differentiate between real and forged
iii) it is inapplicable in scenarios where forgers can create faces. For trace generation, the triplet loss is used to generate
realistic eye blinking in synthesized faces. Montserrat et al. fake faces, and logistic regression for the actual face images.
[86] introduce a method for detecting visual manipulation in a The loss function, based on probability-constrained trace con-
video. Initially, a Multi-task convolutional neural network trol loss for trace construction and confined by classification
(MTCNN) [97] is employed to detect the faces from all video probability, is applied. This method is robust to different tex-
frames to compute the features. In the next step, an Automatic tural post-processing operations, however the overall detec-
Face Weighting (AFW) mechanism, along with a Gated tion accuracy is low due to lack of generalizability. Other
Recurrent Unit, is used to discard incorrectly identified faces. works [91–95] have explored CNN-based methods for the
Finally, an RNN is employed to combine the features from all detection of swapped faces, however, there is a need for a
steps and locate the manipulated content in the video samples. more robust approach.
The approach in [86] works well for deepfake detection, how-
ever, it is unable to obtain a prediction from the features in 4.1.2 Lip-syncing
multiple frames. Lima et al. [87] introduce a technique to
detect video manipulation by learning the temporal informa- Generation The Lip-syncing approach involves synthesizing a
tion of frames. Initially, VGG-11 is employed to compute the video of a target identity such that the mouth region in the
features from video frames, on which LSTM is applied for manipulated video is consistent with a specific audio input
temporal sequence analysis. Several CNN frameworks, [37] (Fig. 8). A key aspect of the synthesis of a video with
named R3D, ResNet, I3D, are trained on the temporal se- an audio segment is the movement and appearance of the
quence descriptors outputted by the LSTM, in order to iden- lower portion of the mouth and its surrounding region. To
tify original and manipulated video. This approach [87] im- convey a message more effectively and naturally, it is impor-
proves deepfake detection accuracy but at the expense of high tant to generate proper lip movements along with expressions.
computational cost. Agarwal et al. [88] present an approach to From a scientific point of view, lip-syncing has many appli-
locate face-swap-based manipulations by combining both fa- cations in the entertainment industry, such as making audio-
cial and behavioral biometrics. Behavioral biometrics are rec- driven photorealistic digital characters in films or games,
ognized with the encoder-decoder network (Facial Attributes- voice-bots, and dubbing films in foreign languages.
Net, FAb-Net) [98], whereas VGG-16 is employed for facial Moreover, it can also help the hearing-impaired understand a
feature computation. Finally, by merging both metrics the scenario by lip-reading from a video created using genuine
inconsistencies in matching identities are revealed in order to audio.
locate face-swap deepfakes. The approach in [88] works well Existing works on lip-syncing [104, 105] require the
for unseen cases, however, it may not generalize well to lip- reselection of frames from a video or transcription, along with
synch-based deepfakes. Fernandes et al. [89] introduce a tech- target emotions, to synthesize lip motions. These approaches
nique to locate visual manipulation by measuring the heart- are limited to a dedicated emotional state and don’t generalize
rate of the subjects. Initially, three techniques: skin color var- well to unseen faces. However, DL models are capable of
iation [99], average optical intensity [100], and Eulerian video learning and predicting movements from audio features. A
M. Masood et al.
detailed analysis of existing DL-based methods used for Lip- a case study due to the sufficient availability of online video
sync-based deepfake generation are presented in Table 5. footage. Thus, this model requires retraining and large amount
Suwajanakorn et al. [37] proposes an approach to generate a of data for each individual. The Speech2Vid [106] model
photo-realistic lip-synced video using a target’s video and an takes an audio clip and a static image of a target subject as
arbitrary audio clip as input. A recurrent neural network input and generates a video that is lip-synced with the audio
(RNN) based model is employed to learn the mapping be- clip. This model uses Mel Frequency Cepstral Coefficient
tween audio features and mouth shape for every frame and (MFCC) features, extracted from the audio input, and feeds
later used frame reselection to fill in the texture around the them into a CNN-based encoder-decoder. As a post-
mouth based on the landmarks. This synthesis is performed on processing step, a separate CNN is used for frame deblurring
the lower facial regions i.e. mouth, chin, nose, and cheeks, and and sharpening in order to preserve the quality of visual con-
applies a series of post-processing steps, such as smoothing tent. This model generalizes well to unseen faces and thus
jaw location and re-timing the video to align vocal pauses, or does not need retraining for new identities. However, this
talking head motion, to produce videos that appear more nat- work is unable to synthesize a variety of emotions on facial
ural and realistic. In this work, Barak Obama is considered as expressions.
Suwajanakorn RNN (single-layer ▪ Mouth landmarks Youtube videos 2048×1024 ▪ Requires a large amount of training data for the
et al. [37] unidirectional LSTM) (36-D features) (17 hours) target person.
▪ MFCC audio ▪ Require retraining for each identity.
features (28-D) ▪ Sensitive to the 3D movement of the head
▪ No direct control over facial expressions
Speech2Vid Encoder–decoder CNN ▪ VGG-M network ▪ VGG Face 109×109 ▪ lacks the ability to synthesize emotional facial
[106] ▪ MFCC audio ▪ LRS2 expressions
features (41.3-hour
video)
▪ VoxCeleb2
(test)
Vougioukas Temporal GAN MFCC audio features ▪ GRID 96×128▪ lacks the ability to synthesize emotional facial
et al. [107] ▪ TCD TIMIT expressions flickering and jitter
▪ sensitive to large facial motions
Zhou et al. Temporal GAN Deep audio-video ▪ LRW 256×256 ▪ lacks the ability to synthesize emotional facial
[108] features ▪ MS-Celeb-1 M expressions
Vdub [109] 3DMM ▪ 66 facial feature ▪ Private 1024×1024 ▪ Requires video of the target
points
▪ MFCC features
LipGAN [110] GAN ▪ VGG-M network ▪ LRS 2 1280×720 ▪ visual artifacts and temporal inconsistency
▪ MFCC features ▪ unable to preserve source lip region
characteristics
Wav2Lip [111] GAN ▪ Mel-spectrogram ▪ LRS2 1280×720 ▪ lacks the ability to synthesize emotional facial
representation expressions
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
The GAN-based manipulations, such as [107] employ a is proposed to synthesize full pose video with facial expres-
temporal GAN, consisting of an RNN, to generate a sions, gestures, and body posture movements from given
photorealistic video directly from a still image and speech audio.
signal. The resulting video includes synchronized lip move-
ments, eye-blinking, and natural facial expressions without Detection techniques based on handcrafted features Initially,
relying on manually handcrafted audio-visual features. ML-based methods are employed for the detection of lip-sync
Multiple discriminators are employed to control frame quality, visual deepfakes. Korshunov et al. [114] propose a technique
audio-visual synchronization, and overall video quality. This employing 40-D MFCC features containing the 13-D static,
model can generate lip-syncing for any individual in real-time. 13-D delta, and 13-D double-delta along with the energy, in
In [108], an adversarial learning method is employed to learn combination with mouth landmarks to train four classifiers,
disentangled audio-visual representation. The speech encoder i.e. SVM, LSTM, multilayer perceptron (MLP), and Gaussian
is trained to project both audio and visual representations into mixture model (GMM). Three publicly available datasets,
the same latent space. The advantage of using a disentangled named VidTIMIT [115], AMI corpus [116], and GRID corpus
representation was that both the audio and video can serve as a [117] are used to evaluate the performance of this technique.
source of speech information during the generation process. From the results, it is concluded in [114] that the LSTM
As a result, it is possible to generate realistic talking face achieves better performance than other techniques. Lip-
sequences on an arbitrary identity with synchronized lip syncing deepfake detection performance of the LSTM method
movement. Garrido et al. [109] present a Vdub system that drops, however, for the VidTIMIT [115] and AMI [116]
captures the high-quality 3D facial model of both the source datasets due to fewer training samples for each person in both
and the target actor. The computed facial model is used to of these datasets over the GRID dataset. In [118] MFCC fea-
photo-realistically reconstruct a 3D mouth model of the dub- tures were substituted with DNN embeddings i.e., language-
ber to be applied on the target actor. An audio channel analysis specific phonetic features used for automatic speaker recogni-
is performed to better align the synthesized visual content with tion. The evaluation show improved performance as com-
the audio. This approach better renders a coarse-textured teeth pared to [114], however, performance is not evaluated on
proxy, however it fails to synthesize a high-quality interior large-scale realistic datasets and GAN-based manipulation.
mouth region. In [110] a face-to-face translation method, Techniques based on Deep Features: Other DL-based
LipGAN, is proposed which can synthesize a talking face techniques, such as [119], propose a detection approach by
video of any individual utilizing a given single image and exploiting the inconsistencies between phoneme-viseme
audio segment as input. LipGAN consists of a generator net- pairs. In [119], the authors observe that in a video the lip shape
work to synthesize portrait video frames with a modified associated with specific phenomes such as M, B, or P must be
mouth and jaw area from the given audio and target frames completely closed to pronounce them, however deepfake
and uses a discriminator network to decide whether the syn- videos often lack this aspect. They analyze the performance
thesized face is synchronized with the given audio. This ap- by creating deepfakes using Audio-to-Video (A2V) [37] and
proach is unable to ensure temporal consistency in the synthe- Text-to-Video (T2V) [112] synthesis techniques. However,
sized content, as blurriness and jitter can be observed in the this method fails to generalize well for unseen samples during
resultant video. Recently, Prajwal et al. [111] proposed a training. Haliassos et al. [120] propose a lip-sync deepfake
wav2lip speaker-independent model that can accurately syn- detection approach, namely LipForensics, using a spatio-
chronize lip movements in a video recording to a given audio temporal network. Initially, a feature extractor 3D-CNN
clip. This approach employs a pre-trained lip-sync discrimi- ResNet18 and a multiscale temporal convolutional network
nator that is further trained on noisy generated videos in the (MS-TCN) are trained on lip-reading datasets such as
absence of a generator. This model uses several consecutive Lipreading in the Wild (LRW). Then, the model is fine-
frames instead of a single frame in the discriminator and em- tuned on deepfake videos using the FaceForensics++ (FF++)
ploys visual quality loss, along with contrastive loss, thus dataset. This method also performs well over different post-
increasing the visual quality by considering temporal processing operations such as blur, noise, compression etc.,
correlation. however, the performance substantially decreases when there
Recent approaches can synthesize photo-realistic fake is a limited mouth movement in the video, such as pauses in
videos from speech (audio-to-video) or text (text-to-video) speech or less movement in the lips. Chugh et al. [121] pro-
with convincing video results. The methods proposed in [37, pose a deepfake detection mechanism by finding a lack of
112] can alter existing video of a person to the desired speech synchronization between the audio and visual channels.
to be spoken from text input by modifying the mouth move- They compute a modality dissimilarity score (MDS) between
ment and speech accordingly. These approaches are more fo- the audio and visual modalities. A sub-network based on 3D-
cused on synchronizing lip movements by synthesizing the ResNet architecture is used for feature computation and em-
region around the mouth. In [113] a VAE based framework ploys two loss functions, a cross-entropy loss at the output
M. Masood et al.
layer for robust feature learning, and a contrastive loss, com- multilingual video conference, dubbing or editing an actor’s
puted over the segment-level audiovisual features. The MDS head and their facial expressions in film industry post-
is calculated as the total audiovisual dissonance over all seg- production systems, or creating photorealistic animation for
ments of the video and is used for the classification of the movies and games, etc.
video as real or fake. Mittal et al. [122] proposes a Siamese Initially, 3D facial modeling-based approaches for facial
network architecture for audio-visual deepfake detection. This reenactment were proposed because of their ability to accu-
approach compares the correlation between emotion-based rately capture the geometry and movement, and for improved
differences in facial movements and speech in order to distin- photorealism in reenacted faces. Thies et al. [125, 126] pre-
guish between real and fake. However, this approach requires sented the first real-time facial expression transfer method
a real-fake video pair for the training of the network and fails from an actor to a target person. A commodity RGB-D sensor
to classify correctly if only a few frames in the video have was used to track and reconstruct the 3D model of the source
been manipulated. Chintha et al. [123] propose a framework and target actors. For each frame, the tracked deformations of
based on the XceptionNet CNN for facial feature extraction the source face were applied to the target face model, and later
and then pass it to a bidirectional LSTM network for the de- the altered face was blended onto the original target face while
tection of temporal inconsistencies. The network is trained via preserving the facial appearance of the target face model.
two loss functions, i.e., cross-entropy and KL-divergence, to Face2Face [38] is an advanced form of facial reenactment
discriminate the feature distribution of real video from that of technique, as presented in [125]. This method worked in
manipulated video. Table 6 presents a comparison of real-time and was capable of altering the facial movements
handcrafted and deep learning techniques employed for the of generic RGB video streams, e.g., YouTube videos, using
detection of lip sync-based deepfakes. a standard webcam. The 3D model reconstruction approach
was combined with image rendering techniques to generate
4.1.3 Puppet-master the output. This could create a convincing and instantaneous
re-rendering of a target actor with a relatively simple home
Generation Puppet-master, also known as face reenactment, is setup. This work was further extended to control the facial
another common variety of deepfake that manipulates the fa- expressions of a person in a target video based on intuitive
cial expressions of a person, e.g., transferring the facial ges- hand gestures using an inertial measurement unit [127].
tures, eye, and head movements, to an output video that reflect Later, GANs were successfully applied for facial reenact-
those of the source actor [124] as shown in Fig. 9. Puppet- ment due to their ability to generate photo-realistic images.
mastery aims to deform the person’s mouth movement to Pix2pixHD [60] produced high-resolution images with better
make fabricated content. Facial reenactment has various ap- fidelity by combining a multi-scale conditional GANs
plications, like altering the facial expression and mouth move- (cGAN) architecture [128] using a perceptual loss. Kim
ment of a participant to a foreign language in an online et al. [47] proposed an approach that allowed the full
Handcrafted features
Korshunov SVM, LSTM, MLP, GMM EER=24.74 (LSTM), 53.45 (MLP), VidTIMIT LSTM performs better than
et al. [114] 56.18(SVM), 56.09(GMM) others but its performance
EER=33.86 (LSTM), 41.21(MLP), AMI degrades as the training
48.39(SVM), 47.84 (GMM) samples decrease.
EER=14.12 (LSTM), 28.58(MLP), 30.06 GRID
(SVM), 46.81(GMM)
Agarwal et al. SVM Accuracy=99.6% Custom Performance degrades for
[119] dataset unseen samples
Deep Learning-based features
Haliassos et al. 3D-ResNet18, multi-scale temporal AUC=97.1% FF++ Performance degrades in
[120] convolutional network cases when there is limited
lip movement
Mittal et al. siamese network architecture Accuracy =84.4% DFDC Requires a real–fake video
[122] AUC=96.3%(LQ), 94.9%(HQ) DF-TIMIT pair for training.
Chintha et al. XceptionNet CNN with bidirectional Accuracy=97.83% Celeb-Df Performance degrades on
[123] LSTM network Accuracy=96.89% FF++ compressed samples
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
reanimation of portrait videos by an actor, such as changing the driving network learns pose and expression information
head pose, eye gaze, and blinking, rather than just modifying from the driving fame to the vector map. The driving network
the facial expression of the target identity and thus produced was crafted to interpolate face representation from the embed-
photorealistic dubbing results. At first, a face reconstruction ded network in order to produce target expressions. Zakharov
approach was used to obtain a parametric representation of the et al. [132] present a meta-transfer learning approach where
face and illumination information from each video frame to the network was first trained on multiple identities and then
produce a synthetic rendering of the target identity. This rep- fine-tuned on the target identity. First, target identity encoding
resentation was then fed to a render-to-video translation net- is obtained by averaging the target’s expressions and associ-
work based on the cGAN in order to predict the synthetic ated landmarks from different frames. Then a pix2pixHD [60]
rendering into photo-realistic video frames. This approach re- GAN was used to generate the target identity using source
quired training the videos for target identity. Wu et al. [129] landmarks as input, and identity encoding via adaptive in-
proposed ReenactGAN, which encodes input facial features stance normalization (AdaIN) layers. This approach works
into a boundary latent space. A target-specific transformer was well at oblique angles and directly transfers the expression
used to adapt the source boundary space according to the without requiring intermediate boundary latent space or an
specified target, and later the latent space was decoded onto interpolation map, as in [39]. Zhang et al. [133] propose an
the target face. GANimation [130] employed a dual cGAN auto-encoder-based structure to learn the latent representation
generator conditioned on emotion action units (AU) to transfer of the target’s facial appearance and the source’s face shape.
facial expressions. The AU-based generator used an attention These features are used as input to SPADE residual blocks for
map to interpolate between the reenacted and original images. the face reenactment task, which preserves the spatial infor-
Instead of relying on AU estimations, GANnotation [131] mation and concatenates the feature map in a multi-scale man-
used facial landmarks, along with a self-attention mechanism, ner from the face reconstruction decoder. This approach can
for facial reenactment. This approach introduced a triple con- better handle large pose changes and exaggerated facial ac-
sistency loss to minimize visual artifacts but required the im- tions. In FaR-GAN [134], learnable features from convolution
ages to be synthesized with a frontal facial view for further layers are used as input to the SPADE module instead of using
processing. These models [130, 131] required a large amount multi-scale landmark masks, as in [133]. Usually, few-shot
of training data for the target identity to perform well at learning fails to completely preserve the source identity in
oblique angles or they lacked the ability to generate photo- the generated results for cases where there is a large pose
realistic reenactment for unknown identities. difference between the reference and target image.
Recently, few-shot or one-shot face reenactment ap- MarioNETte [48] is proposed to mitigate identity leakage by
proaches have been proposed to achieve reenactment using a employing attention block and target feature alignment. This
few, or even a single, source image. In [39], a self-supervised helps the model to accommodate the variations between face
learning model, X2face, using multiple modalities such as structures better. Finally, the identity is retained by using a
driving frame, facial landmarks, or audio, to transfer the pose novel landmark transformer, influenced by the 3DMM facial
and expression of the input source to the target expression, model [135].
was proposed. X2face uses two encoder-decoder networks: an Real-time face reenactment approaches, such as FSGAN
embedding network and a driving network. The embedding [67], perform both facial replacement and reenactment with
network learns face representation from the source frame and occlusion handling. For reenactment, a pix2pixHD [60]
M. Masood et al.
generator takes the target’s image and source’s 3D facial land- realistic deepfakes. The videos that will be generated using
mark as input and outputs a reenacted image and 3-channel the above-mentioned techniques will be further merged with
(hair, face, and background) encoded segmentation mask. The fake audio to create completely fabricated content [137].
recurrent generator is trained recursively where output is iter- These progressions enable the real-time manipulation of facial
ated multiple times for incremental interpolation from source expressions and motion in videos while making it challenging
to target landmarks. The results are further improved by ap- to distinguish between what is real and what is fake.
plying Delaunay Triangulation and barycentric coordinate in-
terpolation to generate output similar to the target’s pose. This Detection Techniques based on handcrafted Features:
method achieves real-time facial reenactment at 30fps and can Matern et al. [81] presented an approach for classifying forged
be applied to any face without requiring identity-specific train- content by employing simple facial handcrafted features like
ing. Table 7 provides the summary of techniques adopted for the color of eyes, missing artifact information in the eyes and
facial expression manipulation mentioned above. teeth, and missing reflections. These features were used to
In the next few years, photo-realistic full-body reenactment train two models, i.e. logistic regression and MLP, to distin-
[9, 136] videos will also be viable, where the target’s expres- guish manipulated content from the original data. This tech-
sion, along with mannerisms, will be manipulated to create nique has a low computational cost; however, it applied only
to the visual content with open eyes or visible teeth. Amerini 4.1.4 Face synthesis
et al. [138] proposed an approach based on optical flow fields
to detect synthesized faces in digital videos. The optical flow Generation Facial editing in digital images has been heavily
fields [139] of each video frame were computed using PWC- explored for decades. It has been widely adopted in the art,
Net [140]. The estimated optical flow fields of frames were animation, and entertainment industry, however lately it has
used to train the VGG16 and ResNet50 to classify real and been exploited to create deepfakes for identity impersonation.
fake content. This method [138] exhibited better deepfake Face generation involves the synthesis of photorealistic im-
detection performance, however, only initial results have been ages of a human face that may or may not exist in real life. The
reported. Agarwal et al. [83] presented a user-specific tech- tremendous evolution in deep generative models has made
nique for deepfake detection. First, a GAN was used to gen- them widely adopted tools for face image synthesis and
erate all three types of deepfakes for US ex-president Barack editing. Generative deep learning models, i.e. GAN [1] and
Obama. Then the OpenFace2 [141] toolkit was used to esti- VAE [142], have been successfully used to generate photo-
mate facial and head movements. The estimated difference realistic fake human face images. In facial synthesis, the ob-
between the 2D and 3D facial and head landmarks was used jective is to generate non-existent but realistic-looking faces.
to train a binary SVM to classify between the original face and Face synthesis has enabled a wide range of beneficial appli-
synthesized face of Barack Obama. This technique provided cations, like automatic character creation for video games and
good detection accuracy, however, it was vulnerable in those 3D face modeling industries. AI-based face synthesis could
scenarios where a person is looking off-camera. also be used for malicious purposes such as the synthesis of
Techniques based on Deep Features: Several research photorealistic fake profile picture for a fake social network
works have focused on employing DL-based methods for account in order to spread disinformation. Several approaches
puppet-mastery deepfake detection. Sabir et al. [91] observed have been proposed to generate realistic-looking facial images
that while generating the manipulated content, forgers often that humans are unable to recognize as synthesized. Figure 10
do not impose temporal coherence in the synthesis process. shows the improvement in the quality of synthetic facial im-
So, in [91], a recurrent convolutional model was used to in- ages between 2014 and 2019. Table 9 provides a summary of
vestigate the temporal artifacts in order to identify synthesized works presented for the generation of entirely synthetic faces.
faces in the images. This technique [91] achieved better de- Since the emergence of GANs [1] in 2014, significant ef-
tection performance, however, it worked best on static frames. forts have been made to improve the quality of synthesized
Rossler et al. [95] employed both handcrafted (co-occurrence images. The images generated using the first GAN model [1]
matrix) and learned features for detecting manipulated con- were low-resolution and not very convincing. DCGAN [143]
tent. It was concluded in [95] that the detection performance was the first approach that introduced a deconvolution layer in
of both networks, either employing hand-crafted or deep fea- the generator to replace the fully connected layer, which
tures, degraded when evaluating them on compressed videos. achieved better performance in synthetic image generation.
To analyze the mesoscopic properties of manipulated content, Liu et al. [144] proposed CoGAN, based on VAE, for learning
Afchar et al. [92] proposed an approach where they employed joint distributions of two-domain images. This model trained a
two variants of the CNN model with a small number of layers, couple of GANs rather than a single one, and each was re-
named Meso-4 and MesoInception-4. This method managed sponsible for synthesizing images in one domain. The size of
to reduce the computational cost by downsampling the frames generated images still remained relatively small, e.g. 64 × 64
but at the expense of a decrease in accuracy in deepfake de- or 128 × 128 pixels.
tection. Nguyen et al. [93] proposed a multi-task, learning- The generation of high-resolution images was limited ear-
based CNN network to simultaneously detect and localize lier due to memory constraints. Karras et al. [145] presented
manipulated content from videos. An autoencoder was used ProGAN, a training methodology for GANs, that employed an
for the classification of forged content, while a y-shaped de- adaptive mini-batch size that progressively increased the res-
coder was applied to share the extracted information for the olution, depending on the current output resolution, by adding
segmentation and reconstruction steps. This model was robust layers to the networks during the training process. StyleGAN
to deepfake detection; however, the evaluation accuracy de- [146] was an improved version of ProGAN [145]. Instead of
graded when presented with unseen scenarios. To overcome mapping latent code z to a resolution, a Mapping Network was
the issue of performance degradation, as in [93], Stehouwer employed that learned to map input latent vector (Z) to an
et al. [94] proposed a Forensic transfer (FT) based CNN ap- intermediate latent vector (W) which controlled different vi-
proach for deepfake detection. This work [94], however, suf- sual features. The improvement was that the intermediate la-
fered from high computational cost due to a large feature tent vector was free from any distribution restriction, and this
space. The comparison of these handcrafted and deep reduced the correlation between features (disentanglement).
features-based face reenactment deepfake detection tech- The layers of the generator network were controlled via an
niques is presented in Table 8. AdaIN operation which helped decide the features in the
M. Masood et al.
Handcrafted
Matern MLP, Logreg 16-D texture energy based ▪ AUC=.823 (MLP) FF++ ▪ Only applicable to face images with
et al. [81] features of eyes and teeth ▪ AUC=.866 open eyes and clear teeth.
[82] (LogReg)
Agarwal SVM Classifier 16 AU’s using OpenFace2 ▪ AUC=98% Own dataset. ▪ Degraded performance in cases where
et al. [83] toolkit a person is looking off-camera.
Amerini VGG16, ResNet Optical flow fields Accuracy=81.61% FF++ ▪ Very few results are reported
et al. (VGG16),
[138] 75.46%
(ResNet)
Deep Learning
Sabir et al. CNN/RNN CNN features Accuracy=94.35% FF++ ▪ Results are reported for static images
[91] only.
Afchar MesoInception-4 Deep features (DF) TPR=81.3% FF++ ▪ Performance degrades on low quality
et al. [92] videos.
Nguyen CNN Deep features Accuracy=92.50% FF++ ▪ Degraded detection performance for
et al. [93] unseen cases.
Stehouwer CNN Deep features Accuracy=99.4% Diverse Fake Face ▪ Computationally complex due to
et al. [94] Dataset (DFFD) large feature vector space.
Rossle SVM+CNN Co-Occurance matrix + DF Accuracy=86.86% FF++ ▪ Low performance on compressed
et al. [95] videos.
output layer. Compared to [1, 143, 144], StyleGAN [146] determine the related features in distant regions of the image.
achieved state-of-the-art high resolution in the generated im- This work further improved the semantic quality of the gen-
ages i.e., 1024 × 1024, with fine detail. StyleGAN2 [147] erated image. In [150], the authors proposed BigGAN archi-
further improved the perceived image quality by removing tecture, which used residual networks to improve image fidel-
unwanted artifacts, such as a change in gaze direction and ity and the variety of generated samples by increasing the
teeth alignment with the facial pose. Huang et al. [148] pre- batch size and varying latent distribution. In BigGAN, the
sented a Two-Pathway Generative Adversarial Network (TP- latent distribution was embedded in multiple layers of the
GAN) that could simultaneously perceive global structures generator to influence features at different resolutions and
and local details, like humans, and synthesized a high- levels of the hierarchy rather than just adding to the initial
resolution frontal view facial image from a single ill-posed layer. Thus, the generated images were photo-realistic and
face image. Image synthesis using this approach preserved very close to real-world images from the ImageNet dataset.
the identity under large pose variations and illumination. Zhang et al. [151] proposed a stacked GAN (StackGAN)
Zhang et al. [149] introduced a self-attention module in model to generate high-resolution images (e.g., 256 × 256)
convolutional GANs (SAGAN) to handle global dependen- with details based on a given textual description. In [152],
cies, and thus ensured that the discriminator can accurately spatial and channel attention layers were added to the
Fig. 10 Improvement in the quality of synthetic faces generated by variations on GANs. In order, the images are from papers by Goodfellow et al. (2014)
[1], Radford et al. (2015) [143], Liu et al. (2016) [144], Karras et al. (2017) [145], and Style-based (2018 [146], 2019 [147])
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
Liu et al. [144] CoGAN Deep Features CelebA 64×64 or 128×128 ▪ Generates low-quality samples
Karras et al. [145] ProGAN Deep Features CelebA 1024×1024 ▪ Limited control on the generated output
Karras et al. [147] StyleGAN Deep Features ▪ ImageNet 1024×1024 ▪ Blob-like artifacts
Huang et al. [148] TP-GAN Deep Features ▪ LFW 256×256 ▪ Lack fine details
▪ Lack semantic consistency
Zhang et al. [149] SAGAN Deep Features ▪ ImageNet2012 128×128 ▪ Unwanted visible artifacts
Brock et al. [150] BigGAN Deep Features ▪ ImageNet 512×512 ▪ Class-conditional image synthesis
▪ Class leakage
Zhang et al. [151] StackGAN Deep Features ▪ CUB 256×256 ▪ Lack semantic consistency
▪ Oxford
▪ MS-COCO
generator network to improve texture learning details for operations i.e. noise, compression, and blurring, etc. Marra
super-resolution image generation. et al. [164] proposed a study to identify GAN-generated fake
images. Particularly, [164] introduced a multi-task incremen-
Detection Techniques based on handcrafted Features: A tal learning detection approach to locate and classify new
lot of literature is available on image forgery detection types of GAN-generated samples without affecting the detec-
[153–158]. As AI-manipulated data is a new phenomenon, tion accuracy of the previous ones. Two solutions related to
there are few forensic techniques that work well for deepfake the position of the classifier were introduced by employing the
detection. Recently, some researchers [73, 159] have adopted iCaRL algorithm for incremental learning [165], named as
the idea of employing the traditional methods of image forg- Multi-Task MultiClassifier, and Multi-Task Single
ery identification to detect synthesized faces, however, these Classifier. This approach [164] was robust to unseen GAN-
approaches are unable to identify fake facial images. Current generated samples but was unable to perform well if the in-
research has focused on new ML-based techniques. formation on the fake content generation method is not avail-
McCloskey et al. [160] present an approach to identify fake able. Table 10 presents a comparison of the face synthesis
images by employing the fact that the color information is deepfake detection techniques mentioned above.
dissimilar between the real camera and synthesized samples.
The color key-points from input samples are used to train the 4.1.5 Facial attribute manipulation
SVM for classification. This approach [160] exhibits better
fake sample detection accuracy, however, it may not perform Generation Face attribute editing involves altering the facial
well for blurred images. Guarnera et al. [161] proposes a appearance of an existing sample by modifying an attribute-
method to identify fake images. Initially, the EM algorithm specific region while keeping the irrelevant regions un-
is used to calculate the image features. The computed key- changed. Face attribute editing includes removing/wearing
points are used to train three types of classifiers, KNN, eyeglasses, changing viewpoint, skin retouching (e.g.,
SVM, and LDA. The approach in [161] performs well for smoothing skin, removing scars, and minimizing wrinkles),
synthesized image identification, but may not perform well and even some higher-level modifications, such as age and
for compressed images. gender, etc. Increasingly, people are using commercially
Techniques based on Deep Features: DL-based work available AI-based face editing and mobile applications such
such as in [162], the authors proposed a method to detect as FaceApp [5] to automatically alter the appearance of an
forged images by calculating the pixel co-occurrence matrices input image.
at three color channels of the image. Then a CNN model was Recently, several GAN-based approaches have been pro-
trained to learn important features from the co-occurrence ma- posed to edit facial attributes, such as the color of the skin,
trices to differentiate manipulated and non-manipulated con- hairstyle, age, and gender by adding/removing glasses and
tent. Yu et al. [163] presented an attribution network architec- facial expressions in a given face. In this manipulation, the
ture to map an input sample to its related fingerprint image. GAN takes the original face image as input and generates the
The correlation index among each sample fingerprint and edited face image with the given attribute, as shown in Fig. 11.
model fingerprint acts as a softmax logit for classification. A summary of face attribute manipulation approaches is pre-
This approach [163] exhibited better detection accuracy, how- sented in Table 11. Perarnau et al. [166] introduce the
ever, it may not have performed well with post-processing Invertible Conditional GAN (IcGAN), which uses an encoder
M. Masood et al.
Handcrafted
Guarnera et al. EM + (KNN, SVM, Deep features ▪ Accuracy=99.22 CelebA Not robust to compressed images.
[161] LDA) (KNN)
▪ Accuracy=
99.81(SVM)
▪ Accuracy=99.61
(LDA)
McCloskey SVM Color channels ▪ AUC=70% MFC2018 Performance degrades with blurry
et al. [160] samples.
Deep Learning
Nataraj et al. CNN Deep features + Accuracy=99.49% ▪ cycleGAN ▪ Works with static images only.
[162] co-occurrence matrices Accuracy=93.42% ▪ StarGAN ▪ Low performance for jpeg
compressed images.
Yu et al. [163] CNN Deep features Accuracy=99.43% CelebA ▪ Poor performance on
post-processing operations.
Marra et al. CNN+Incremental Deep features Accuracy=99.3% Customized ▪ Needs source manipulation
[164] Learning technique information
in combination with cGANs for face attribute editing. The independently to handle translations between each pair of im-
encoder maps the input face image into a latent representation age domains and thus limited their practical usage. StarGAN
and an attributes manipulation vector, and a cGAN recon- [36], an enhanced approach, was capable of translating images
structs a face image with new attributes, given the altered among multiple domains using a single generator. A condi-
attributes vector as the condition. This suffers from informa- tional facial attribute transfer network was trained via attribute
tion loss and alters the original face identity in the synthesized classification loss and cycle consistency loss. StarGAN
image. In [167], a Fader Network is presented, where an achieved promising visual results in terms of attribute manip-
encoder-decoder architecture is trained in an end-to-end man- ulation and expression synthesis. This approach, however,
ner which generates an image by disentangling the salient added some undesired visible artifacts in facial skin, such as
information of the image and the attribute values directly in an uneven color tone, in the output image. The recently pro-
latent space. This approach, however, adds unexpected distor- posed StarGAN-v2 [168] achieved state-of-the-art visual
tion and blurriness, and thus fails to preserve the fine details of quality of generated images as compared to [36] by adding a
the original in the generated image. random Gaussian noise vector into the generator. In AttGAN
Prior studies [166, 167] have been focused on handling [169], an encoder-decoder architecture was proposed that con-
image-to-image translations between two domains. These sidered the relationship between attributes and the latent rep-
methods required different generators to be trained resentation. Instead of imposing an attribute independent
Fig. 11 Examples of different face manipulations: original sample (Input) and manipulated samples
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
Perarnau et al. IcGAN ▪ Deep Features ▪ CelebA 64×64 ▪ Fails to preserve original face identity
[166] ▪ MNIST
Fader Network Encoder-decoder ▪ Deep Features ▪ CelebA 256×256 ▪ Unwanted distortion and blurriness
[167] ▪ Fails to preserve fine details
Choi et al. [168] StarGAN ▪ Deep Features ▪ CelebA 512×512 ▪ Undesired visible artifacts in the facial skin e.g., the
▪ RaFD uneven color tone
He et al. [169] AttGAN ▪ Deep Features ▪ CelebA 384×384 ▪ Generates low-quality results and adds unwanted
▪ LFW changes, blurriness
Liu et al. [170] STGAN ▪ Deep Features ▪ CelebA 384×384 ▪ Poor performance for multiple attribute manipulation
Zhang et al. SAGAN ▪ Deep Features ▪ CelebA 256×256 ▪ Lack of details in the attribute-irrelevant region
[171]
He et al. [172] PA-GAN ▪ Deep Features ▪ CelebA 256×256 ▪ undesired artifacts in case of baldness and open
mouth etc.
constraint on the latent representation, like in [166, 167], an approaches for the detection of facial attributes manipulation.
attribute classification constraint was applied to the generated In [173], the author used the pixel co-occurrence matrices to
image in order to guarantee the correct change of the desired compute features from the suspect samples. The extracted
attributes. AttGAN provided improved facial attribute editing keypoints were used to train a CNN classifier to differentiate
results, with other facial details well preserved. However, the original and manipulated faces. The method in [173] showed
bottleneck layer, i.e., down-sampling in the encoder-decoder better facial attribute manipulation detection accuracy, how-
architecture, added unwanted changes and blurriness and gen- ever, it may not have performed well given noisy samples. An
erated low-quality edited results. Liu et al. [170] proposed the identification approach using keypoints computed from the
STGAN model that incorporated an attribute difference indi- frequency domain, instead of employing raw sample pixels,
cator and a selective transfer unit with an encoder-decoder to was introduced in [174]. For each input sample, a 2D discrete
adaptively select and modify the encoded features. STGAN fourier transformation (DFT) was applied to transform the
only focused on the attribute-specific region and did not guar- image to the frequency domain in order to acquire one fre-
antee good preservation of the details in attribute-irrelevant quency sample per RGB channel. The work, [174], used an
regions. AutoGAN classifier for predicting real and fake samples. The
Other works introduce the attention mechanism for attri- generalization ability of the work in [174] was evaluated over
bute manipulation. SAGAN [171] introduces a GAN-based unseen GAN frameworks. More specifically, they considered
attribute manipulation network to perform alteration and a two GAN frameworks, namely StarGAN [36] and GauGAN
global spatial attention mechanism to localize and explicitly [175]. The work showed better prediction accuracy for the
constrain editing within a specified region. This approach pre- StarGAN model, however, in the case of GauGAN, the tech-
serves the irrelevant details well but at the cost of attribute nique faced a serious performance drop.
correctness in the case of multiple attribute manipulation. Techniques based on Deep Features: The research com-
PA-GAN [172] employs a progressive attention mechanism munity has presented several methods to detect facial manip-
in a GAN to progressively blend the attribute features into the ulations by evaluating the internal GAN pipeline. Similar
encoder features, constrained inside a proper attribute area, by work was presented in [176], where the author introduced
employing an attention mask from high to low feature level. the concept that analyzing internal neuron behavior could as-
As the feature level gets lower (higher resolution), the atten- sist in identifying manipulated faces, as layer-by-layer neuron
tion mask gets more precise and the attribute editing becomes activation arrangements could extract a more representa-
fine. This approach successfully performs multiple attribute tive set of significant image features for recognizing the
manipulation, and preserves irrelevance within a single model original and fake faces. The proposed solution in [176],
well. However, some undesired artifacts appear in cases where namely FakeSpotter, computed deep features by
significant modifications are required, such as baldness and an employing several DL-based face recognition frame-
open mouth. works, i.e., VGG-Face [177], OpenFace [178], and
FaceNet [179]. The extracted features were used to train
Detection Techniques based on handcrafted Features: an SVM classifier to categorize fake and real faces. The
Researchers have employed the traditional ML-based solution [176] performed well for facial attributes
M. Masood et al.
manipulation detection, however, it may not have per- Non-Uniformity (PRNU). In this method, scores gathered af-
formed well for samples with intense light variation. ter performing an analysis of spatial and spectral features,
Existing works on facial attribute manipulation have either computed from the PRNU patterns from entire image samples,
employed entire faces or passed face patches in order to spot were fused. The approach [184] was able to robustly differen-
real and manipulated content. A face patch-based technique tiate between bonafide and retouched facial samples, however
was presented in [180], where a Restricted Boltzmann accuracy was lacking.
Machine (RBM) was used to compute deep features. Then, Many of these DL-based methods achieve near-perfect ac-
the extracted features were used to train a two-class SVM curacy, as shown in Table 12, however this accuracy appears
classifier to classify real and forged faces. The method in to be largely due to the presence of GAN fingerprints in the
[180] was robust to manipulated face detection, however, it manipulated samples. Newer research focuses on detec-
was at the expense of increased computational cost. Another tion in samples where the GAN signatures have been re-
similar approach was proposed in [181], where a CNN-based moved, and this has proven to be challenging for previ-
keypoint extractor was presented. The CNN approach com- ously high-performing frameworks. Hence, the research
prised six convolutional layers, along with two fully connect- community needs to develop strategies that are resistant
ed layers. Additionally, residual connections were introduced to such attacks.
which allowed the ResNet frameworks to compute the deep
features from the input samples. Finally, the calculated fea- 4.1.6 Discussion of visual manipulation methods
tures were used to train an SVM classifier to predict real and
manipulated faces. The approach in [181] showed better ma- Generation Deepfake generation has advanced significantly in
nipulation identification performance, however, it did not per- recent years. The high quality of generated images across
form well in terms of various post-processing attacks, i.e., different visual manipulation categories (face-swap, face-re-
noise, blurring, intensity variations, and color changes. enactment, lip-sync, entire face synthesis, and attribute manip-
Some researchers have employed the use of entire faces rather ulation) has made it increasingly difficult for human eyes to
than face patches in order to detect facial attribute manipula- differentiate between fake and genuine content. Among the
tion in visual content. One such work was presented by Tariq significant advances are: (i) unpaired self-supervised training
et al. [182], where several DL-based frameworks, i.e., VGG- strategies avoid the requirement for extensive labeled training
16, VGG-19, ResNet, and XceptionNet, were trained on sus- data, (ii) the addition of AdaIN layers, pix2pixHD network,
pect samples in order to locate facial attribute forgeries. The self-attention modules, and feature disentanglement for im-
work in [182] showed better face attribute manipulation de- proved synthesized faces, (iii) one/few-shot learning strategies
tection, however its performance declined in real-world sce- enable identity theft with limited target training data, (iv) the
narios. Some authors used attention mechanisms to further use of temporal discriminators and optical flow estimation to
enhance training in the attribute manipulation detection sys- improve coherence in the synthesized videos, (v) introduction
tems. Dang et al. [183] introduced a framework to identify of a secondary network for seamless blending of composites
several types of facial manipulation. This framework em- in order to reduce boundary artifacts, (vi) the use of multiple
ployed attention mechanisms in order to enhance feature loss functions to handle different tasks, such as conversion,
map calculation in CNN frameworks. Two different methods blending, occlusion, pose, illumination, etc., for improved fi-
of attribute manipulation generation were taken into account: nal output, and (vii) the adoption of perceptual loss with pre-
i) fake samples generated using the publicly available trained VGG-Face network dramatically enhanced synthesize
FaceApp software, with various available filters, and ii) fake facial quality. Current deepfake systems have a few limita-
samples generated with the StarGAN network. The work tions, e.g., in facial reenactment generation techniques frontal
[183] is robust to face forgery detection, however, at the ex- poses are always used to drive and create the content. As a
pense of high computational cost. result, reenactment is restricted to a somewhat static perfor-
Wang et al. [170] proposed a framework to detect manip- mance. Currently, Face-swapping onto the body of a lookalike
ulated faces which encompassed two classification steps: local is performed to achieve facial reenactment, however, this ap-
and global predictors. A Dilated Residual Network (DRN) proach has limited flexibility because having a good match is
model was used as a global predictor to identify real and fake not always achievable with the current technology. Moreover,
samples, while optical flow fields were utilized for local pre- face reenactment depends on the driver’s performance to por-
dictions. The approach in [170] worked well for face attribute tray the target identity personality. Recently, there has been a
manipulation identification but required extensive training da- trend towards identity-independent deepfake generation
ta. Similarly, [164] proposed a DL-based framework, models. Another development is real-time deepfakes that al-
XceptionNet, for the detection of face attribute forgeries. low face swapping in video chats. Real-time deepfakes at
However, the method in [164] suffered from high computa- 30fps have been achieved in works such as [67, 106]. The
tional cost. Rathgeb et al. [184] introduced Photo Response next generation of deepfakes are expected to utilize video
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
Hand-crafted
[173] co-occurrence Co-Occurrence Accuracy=99.4% Private dataset. ▪ Its evaluation performance reduces over noisy
matrices along matrix images.
with CNN
[174] GAN Frequency Accuracy =100% Private dataset. ▪ The technique faces serious performance
Discriminator domain degradation for GauGAN framework-based face
features attribute manipulations.
Deep Learning
[176] FakeSpoter Deep features Accuracy=84.7% Private dataset ▪ Detection performance decreases when the
samples have significant light variation.
[180] RBM along with Deep features Accuracy=96.2% Private dataset ▪ This method suffers from the high computational
the SVM Accuracy=87.1% Private dataset (Celebrity cost.
classifier Retouching, ND-IIITD
Retouching)
[181] CNN+SVM Deep features Accuracy =99.7% Private dataset ▪ Results are reported for post-processing attacks.
[182] CNNs Deep features AUC=74.9% Private dataset ▪ Performance degrades in real-world scenarios.
[183] Attention Deep features AUC=99.9% DFFD ▪ This work is computationally complex.
Mechanism
along with
CNN
[170] DRN Deep features Average precision= Private dataset ▪ The approach should be evaluated over a standard
99.8% dataset.
[164] Incremental Deep features Accuracy =99.3% Private dataset ▪ This work is inefficient.
Learning along
with the CNN
[184] Score-Level PRNU Features EER=13.7% Private dataset ▪ The work needs to improve the classification
Fusion accuracy.
stylization techniques to generate target manipulated content with soft biometrics of the same person’s identity. For visual
with projected expression and mannerism. Although, existing deepfakes detection, it has been observed that it’s relatively
deepfakes are not perfect, the rapid development of high- easy for the research community to detect image-based ma-
quality real/fake image datasets promote deepfake generation nipulations in comparison to video-based deepfakes. Both for
research. audio or visual deepfakes, most of the research work has used
publically available datasets instead of using their own syn-
Detection In this subsection, we presented a summary of the thesized datasets. The existing works have reported robust
work performed for visual deepfakes detection. Based on the performance for visual deepfake detection but has faced a
in-depth analysis of various detection approaches, we con- serious performance drop for unseen cases, indicating a lack
cluded that most of the existing detection work is based on of generalization ability, is likely related. Moreover, these
employing a DL-based approach and shows a robust perfor- approaches are unable to definitively prove the difference be-
mance approaching 100%. The main reason for the accuracy tween real and manipulated content, so these approaches lack
of models is the presence of fingerprint information, visible explainability. Several deepfake detection methods presented
artifacts in the audiovisual manipulated samples. However, in previous years have proven to be nearly unusable due to
more recently researchers have presented approaches which implementation complexities such as variation in datasets,
removed the information from the forged samples, which is configuration environment, and complicated architecture.
proving to be a challenge even for high-performing attribute More recently, software and online platforms such as
manipulation detection frameworks. It has been observed that DeepFake-o-meter [185], FakeBuster [186], and Video
most of the existing detection techniques perform well on face Authenticator (not publicly available) [187] have been intro-
swap detection, and are relatively easily able to identify when duced which are able to easily detect audio-visual manipula-
the entire face is swapped with the target identity, which usu- tion and give access to the general audience. However, these
ally leaves artifacts. However, expression swap and lip-sync platforms are in their infancy and need further development to
are more challenging to detect as these manipulations tamper handle emerging deepfakes.
M. Masood et al.
Figure 12 groups the existing work performed for visual algorithms can generate synthetic speech that sounds like the
deepfake detection. Table 13 presents a detailed description of target speaker, based on text or samples of the target speaker,
each category. Existing approaches have either targeted spa- with highly convincing results [59, 217]. Synthetic voice is
tial and temporal artifacts left during the generation or data- widely adapted for the development of different applications,
driven classification. The spatial artifacts include inconsis- such as automated dubbing for TV and film, chatbots, AI
tencies [78, 81, 114, 188, 193, 201–203], abnormalities in assistants, text readers, and personalized synthetic voices for
background [160, 194, 198], and GAN fingerprints [74, 163, vocally handicapped people. Aside from this, synthetic/fake
204, 205]. The temporal artifacts involve detecting variation voices have become an increased threat to voice biometric
in a person’s behavior [83, 88, 200], physiological signals [77, systems [218] and used for malicious purposes, such as polit-
78, 85, 89], coherence [190, 199, 206], or video frame syn- ical gains, fake news, or fraud as well [14, 58]. More complex
chronization [33, 75, 91, 138, 207, 208]. Instead of focusing audio synthesis could combine the power of AI with manual
on a specific artifact, some approaches are data-driven, which editing. For example, neural network-powered voice synthesis
detect manipulations by classification [58, 73, 84, 86, 87, models, such as Google’s Tacotron [56], Wavenet [55], or
92–95, 119, 123, 161, 162, 164, 189, 191, 192, 209–213] or AdobeVoco [219], can generate realistic and convincing fake
anomaly identification [121, 122, 195, 196, 214–216]. voices that resemble the victim’s voice. Later on, audio editing
Moreover, in Fig. 12, the * references show the DL-based software, e.g. Audacity [6], can be used to integrate the orig-
approaches employed for deepfake detection, while others inal and synthesized audio to make more convincing fakes.
show the hand-coded feature extraction methods. AI-based impersonation is not limited to visual content;
recent advancements in AI-synthesized fake voices are
4.2 Audio manipulations assisting the creation of highly realistic deepfakes video
[37]. These developments in speech synthesis have shown a
AI-synthesized audio manipulation is a type of deepfake that potential to produce realistic and highly natural sounding au-
can clone a person’s voice and depict that voice saying some- dio deepfakes, exhibiting a real threat to society [14].
thing that the person never said. Recent advancements in AI- Combining synthetic audio content with visual manipulation
synthesized algorithms for speech synthesis and voice cloning can make deepfake videos significantly more convincing and
have shown the potential to produce realistic fake voices that increase their impact [37]. Despite much progress, synthesized
are nearly indistinguishable from genuine speech. These speech still lacks some aspects of voice quality, like
Fig. 12 Categorization of visual deepfake detection techniques (The red and pink for facial attribute manipulation detection techniques, where *
color shows Face-Swap detection approaches, purple for Face- shows deep-learning based approaches)
Reenactment, Orange for lip-syncing, Blue for facial image synthesis,
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
Table 13 Description of
classification categories for Inconsistencies Visible artifacts within the frame such as inconsistent head poses and landmarks etc.
existing deepfake detection Environment Abnormalities in the background such as lighting and other details.
methods Forensics GAN fingerprints left during the generation process.
Behavioral Monitoring abnormal gestures and facial expressions.
Synchronization Temporal consistency such as inconsistencies between adjacent frames/modality.
Physiology Lack of biological signals such as eye blinking patterns and heart rate
Coherence Missing optical flow field and artifacts such as flickering and jitter between frames
Classification End-to-end CNN based data-driven models
Anomaly Outliers identification such as reconstructing real images and comparing to the encoded
Detection image. They are used to see unknown creation methods.
expressiveness, roughness, breathiness, stress, and emotion, interaction. The initial research on TTS synthesis technology
specific to a target identity [220]. The AI research community was done using the methods of speech concatenation or
is making a concerted effort to overcome these challenges and parameter estimation. The concatenative TTS systems are
produce human-like voice quality with high speaker based on separating high-quality recorded speech into
similarity. small fragments followed by concatenation into a new
Two distinct modalities for audio deepfakes are text-to- speech. In recent years, this method has become outdated
speech (TTS) synthesis and voice conversion (VC). TTS syn- and unpopular as it is not scalable or consistent. In con-
thesis is a technology that can synthesize a natural-sounding trast, parametric models map text to the salient speech
sample of any speaker based on the given input text [221]. VC parameters and convert them into an audio signal using
is a technique that modifies the audio waveform of a source vocoders. Later on, the deployment of deep neural net-
speaker to a sound similar to the target speaker’s voice [222]. works has gradually become a dominant method for
A VC system takes the recording of an individual as a source speech synthesis that achieves much better voice quality.
and creates a deepfake audio in the target’s voice. It preserves These methods include the Neural vocoders [55, 221,
the linguistic and phonetic characteristics of the source sample 224], GANs [225–227], autoencoders [228],
and changes them to that of the target speaker. TTS synthesis autoregressive models [229–231], and other emerging
and VC represent a genuine threat when used maliciously as techniques [228, 232–236] which have promoted the rapid
both generate completely synthetic computer-generated development of the speech synthesis industry. Figure 13
voices that are nearly indistinguishable from genuine speech. shows the principle design of modern TTS methods.
Moreover, cloned replay attacks [13] impose a potential risk The significant developments in voice/speech synthesis are
for voice biometric devices because the latest speech synthesis WaveNet [55], Tacotron [56], and DeepVoice3 [224], which
techniques can produce a vocal sample with high speaker can generate realistic sounding synthetic speech from an input
similarity [223]. This section lists the latest progress in speech text to provide an enhanced interaction experience between
synthesis including TTS and VC techniques as well as detec- humans and machines. Table 14 presents an overview of the
tion strategies. state-of-the-art speech synthesis methods. WaveNet [55], de-
veloped by DeepMind in 2016 utilizes raw audio waveforms
4.2.1 TTS voice synthesis by processing acoustic features, i.e., spectrograms, through a
generative framework that is trained on actual recorded
TTS is a decades-old technology which can synthesize a speech. Parallel WaveNet has been introduced to enhance
natural-sounding voice from a given input text, and thus en- sampling efficacy and produce high-fidelity audio signals
ables a voice to be used for better human-computer [231]. Another DL based using a variant of WaveNet, Deep
WaveNet [55] Deep neural network ▪ linguistic ▪ VCTK (44 hrs.) ▪ Computationally complex
features
▪ fundamental
frequency (log
F0)
Tacotron [56] Encoder-Decoder with RNN ▪ Deep features Private (24.6 hrs.) ▪ Costly to train the model
Deep Voice Deep neural networks ▪ linguistic Private (20 hrs.) ▪ Independent training of each module leads
1[57] features to a cumulative error in synthesized
speech
Deep Voice 2 RNN ▪ Deep features VCTK (44 hrs.) ▪ Costly to train the model
[237]
DeepVoice3 Encoder-decoder ▪ Deep features ▪ Private (20 hrs.) ▪ Does not generalize well for unseen
[224] ▪ VCTK (44 hrs.) samples.
▪ LibriSpeech ASR (820 hrs.)
Parallel Feed-forward neural network ▪ linguistic Private ▪ Requires a large amount of the target’s
WaveNet with dilated causal features speech training data.
[231] convolutions
VoiceLoop Fully-connected neural network ▪ 63-dimensional ▪ VCTK (44 hrs.) ▪ Low ecological validity
[230] audio features ▪ Private
Tacotron2[238] ▪ Encoder-decoder ▪ linguistic ▪ Japanese speech corpus from ▪ Lack of real-time speech synthesis
features the ATR Ximera dataset
(46.9 hrs.)
Arik et al. [59] Encoder- decoder ▪ Mel ▪ LibriSpeech (820 hrs.) ▪ Low performance for multi-speaker speech
spectrograms ▪ VCTK (44 hrs.) generation in the case of low-quality
audio
Jia et al. [233] Encoder-decoder ▪ Mel ▪ LibriSpeech (436 hrs.) ▪ Fails to attain human-level naturalness
spectrograms ▪ VCTK (44 hrs.) ▪ Lacks in transferring the target accent,
prosody to synthesized speech
Luong et al. Encoder-decoder ▪ Mel ▪ LibriSpeech (245 hrs.) ▪ Low performance in the case of noisy
[228] spectrograms ▪ VCTK (44 hrs.) audio samples
Chen et al. Encoder + deep neural network ▪ Mel ▪ LibriSpeech (820 hrs.) ▪ Low performance in the case of a
[235] spectrograms ▪ private low-quality audio sample
Cong et al. Encoder-decoder ▪ Mel ▪ MULTI-SPK ▪ Lacks in synthesizing utterances of a target
[236] spectrograms ▪ CHiME-4 speaker
Voice 1 [57], puts each module containing an audio signal, spectrogram, later converted to voice through the WaveNet
voice generator, or a text analysis front-end through a related model.
NN model. Due to the independent training of each module, In [238], Tacotron2 is introduced for vocal synthesis and it
however, it is not a real end-to-end speech synthesis system. exhibits an impressively high mean opinion score, very simi-
In 2017, Google introduced tacotron [56] an end-to-end lar to human speech. Tacotron2 consists of a recurrent
speech synthesis model. Tacotron could synthesize speech sequence-to-sequence keypoint estimation framework that
from given <text, audio> pairs and thus generalized well to maps character embedding to mel-scale spectrograms. To deal
other datasets. Similar to WaveNet, the Tacotron framework with the time complexities of recurrent unit-based speech syn-
was a generative framework comprised of a seq2seq model thesis models, a new, fully-convolutional character-to-
that contained an encoder, an attention-based decoder, and a spectrogram model named DeepVoice3 is presented in
post-processing network. Even though the Tacotron model [224]. The Deep Voice 3 model is faster than its peers due
attained better performance it had one potential limitation to performing fully parallel computations. Deep Voice 3 is
i.e., it must employ multiple recurrent components. The inclu- comprised of three main modules: i) an encoder that accepts
sion of these units made it computationally inefficient so it text as input and transforms it into an internal learned form, ii)
required high-performance systems for model training. Deep a decoder that converts the learned representations in an
Voice 2 [237] combined the capabilities of both the Tacotron autoregressive manner, and iii) a post-processing, fully
and WaveNet models for voice synthesis. Initially, Tacotron convolutional network that predicts the final vocoder
was employed for converting the input text to a linear scale parameters.
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
Another model for voice synthesis is VoiceLoop [230], the cloned speech. Chen et al. [235] propose a meta-learning
which uses a memory framework to generate speech from approach using the waveNet model for voice adaption with
voices unseen during training. VoiceLoop builds a phonolo- limited data. Initially, speaker adaptation is computed by fine-
gical store by executing a shifting buffer as a matrix. Text tuning the speaker embedding. Then, a text-independent para-
strings are characterized as a list of phonemes that are later metric approach is applied whereby an auxiliary encoder net-
decoded in short vectors. The new context vector is produced work is trained to predict the embedding vector of new speak-
by assessing the encoding of the resulting phonemes and sum- er. This approach performs well on clean and high-quality
ming them together. The above-mentioned powerful end-to- training data however the presence of noise deviates the
end speech synthesizer models [224, 238] have enabled the speaker encoding and directly affects the performance of the
production of large-scale commercial products, such as synthesized speech. In [236], the authors propose a seq2seq
Google Cloud TTS, Amazon AWS Polly, and Baidu TTS. multi-speaker framework with domain adversarial training to
All these projects aim to attain a high similarity between syn- produce a target speaker voice from only a few available noisy
thesized and human voices. samples. The results show improved naturalness in the syn-
The latest TTS systems can convert given text to human thetic speech. However, similarity still remains challenging to
speech with a particular voice identity. Using generative achieve due to an inability to transfer target accents and pros-
models, researchers have built voice imitating TTS models ody to synthesized speech with a limited amount of low-
that can clone the voice of a particular speaker in real-time quality speech data.
using a few samples of reference speech [233, 234]. The key Different GAN-based architectures have been applied to
distinction between voice cloning and speech synthesis sys- process and generate high-quality speech in audio synthesis.
tems is that the former focuses on preserving the characteris- Notable works include WaveGAN [239], GAN-TTS [225],
tics of the specific identity speech attributes while the latter MelGAN [226], and Hifi-GAN [227]. Some works introduce
lacks this feature to maintain the quality of the generated GAN-based vocoders that focus on producing high-quality
speech [228]. Various AI-enabled voice cloning online plat- speech while maintaining controllability. In [225], the authors
forms are available, such as Overdub,1 VoiceApp2, and introduce GAN-TTS, a linguistic to waveform generation
iSpeech,3 which can produce synthesized voices that closely model using a GAN. It is based on a conditional feed-
resemble the target’s speech, and give the public access to this forward generator network that generates a raw speech wave-
technology. Jia et al. [233] proposes a Tacotron 2 based TTS form, and an ensemble of discriminator networks that use
system capable of producing multi-speaker speech, including multi-frequency random windows to assess synthesized
those unseen during training. The framework consists of three speech. In [226], the authors introduce Mel-GAN, a dilated
independently trained neural networks. The findings show convolutional structure to enlarge the receptive field in order
that although the synthetic speech resembles a target speaker’s to better simulate long-range correlation in the waveform se-
voice it does not fully isolate the voice of the speaker from the quences. A multi-scale discriminator network is used with a
prosody of the audio reference. Arik et al. [59] propose a Deep feature matching loss over the feature map of real and synthet-
Voice 3-based technique comprised of two modules: speaker ic audio. In [227], a generator is based on a multi-receptive
adaptation and speaker encoding. For speaker adaptation, a field fusion module that processes many patterns of varying
multi-speaker generative framework is fine-tuned. For speaker durations simultaneously. Multiple sub-discriminators are
encoding, an independent model is trained to directly infer a used to individually evaluate different periodic portions of
new speaker embedding, which is applied to the multi-speaker the input waveform. The loss function, similar to [226], is
generative model. used to compute the distance between the produced wave-
Loung et al. [228] propose a speech generation framework form’s mel-spectrogram and ground truth. The HiFi-GAN
that can synthesize a target-specific voice, either from input can efficiently synthesize speech that closely resembles natu-
text or a reference raw audio waveform from a source speaker. ral speech, however, for high-quality speech synthesis, it re-
The framework consists of a separate encoder and decoder for quires model fine-tuning and respective ground-truth data.
text and speech, and a neural vocoder. The model is jointly Aside from naturalness, expressiveness is an important fac-
trained with linguistic latent features, and the speech genera- tor that differentiates synthesized speech from human speech.
tion model learns a speaker-disentangled representation. The Numerous factors influence the expressiveness of a synthetic
obtained results achieve quality and speaker similarity to the voice, including content, timbre, phonation, style, emotion,
target speaker; however, it takes almost 5 minutes to produce and others. An expressive TTS requires a one-to-many map-
ping that matches voice variants to a text selection in terms of
pitch, loudness, time, and speaker accent. In [240], a feed-
1 forward transformer network that generates mel-
https://www.descript.com/overdub
2
https://apps.apple.com/us/app/voiceapp/id1122985291 spectrograms from text and then synthesizes speech is pro-
3
https://www.ispeech.org/apps posed. Because a mel-spectrogram sequence is substantially
M. Masood et al.
lengthier than its corresponding phoneme sequence, a mono- GMMs [243, 244], partial least square regression [245],
tonic alignment search is employed to extract a duration that exemplar-based [246] techniques and others [247–249] were
aligns both text and speech and provides better control over proposed for parallel spectral modeling. These [243–246]
the vocal speed and prosody. Similarly, work in [229] em- were “shallow” VC methods that transformed source speech
ploys a fully convolutional network to generate mel- spectral features directly in the original feature space.
spectrograms for speech synthesis, along with a positional Nakashika et al. [247] proposed a speaker-dependent se-
attention mechanism that aligns speech and text sequences. quence modeling method based on and RNN to capture tem-
Kim et al. [232] introduce Glow-TTS, a Flow-based model poral correlations in an acoustic sequence. In [248, 249], a
for the generation of mel-spectrograms. This model uses a deep bidirectional LSTM (DBLSTM) was employed to cap-
self-attention mechanism to internally learn mappings be- ture long-range contextual information and generate high-
tween the text and the latent representation of speech by using quality converted speech. DNN based methods [247–249] ef-
properties of flow and dynamic programming. The Glow-TTS ficiently learned feature representation for feature mapping in
model synthesizes natural-sounding speech and provides bet- parallel VC, however they require large-scale paired source
ter control over the synthesized speech, such as speaking rate and target speaker utterance data for parallel training that is
or pitch but it involves a huge number of training parameters. not feasible for practical applications in the real world.
In addition, computing average mel-spectrograms from input VC methods for non-parallel (unpaired) training data are
leads to low-quality and less expressive synthesized speech proposed to achieve VC for multiple speakers with different
because it lacks the ability to capture the expression details of languages. Powerful VC techniques based on neural networks
every single utterance. Therefore, more efficient approaches [250], vocoders [251, 252], GANs [253–259], and VAE
that can better model different variations of speech are re- [260–262] are introduced for non-parallel spectral modeling.
quired to improve the expressiveness of the synthesized Auto-encoder-based approaches attempt to learn disentangled
speech. speaker information from linguistic content and independently
convert the speaker’s identity. The work in [262] investigates
4.2.2 Voice conversion the quality of a learned representation by comparing different
auto-encoding methods. It shows that a combination of a
Voice Conversion (VC) is a speech-to-speech synthesis tech- Vector Quantized VAE and a WaveNet [55] decoder better
nology that manipulates an input voice to sound like the target preserves speaker invariant linguistic content and retrieves
voice identity while maintaining the linguistic content of the information discarded by the encoder. However, VAE/
source speech. VC has numerous applications in real life, GAN-based methods tend to over smooth the transformed
including expressive voice synthesis, personalized speech features because of dimensionality reduction bottleneck.
speaking assistants, adaptive equipment for vocally impaired Thus, the low-level information such as pitch contour, noise,
people, voice dubbing for the entertainment industry, and and channel data is lost, that results in buzzy-sounding con-
many others [222]. The recent development of anti-spoofing verted voices.
for automated speaker verification [218] included VC systems Recent GAN-based approaches, such as CycleGAN
for the generation of spoofing data [241, 242]. [253–256], VAW-GAN [257], and StarGAN [258], attempt
In general, to perform VC high-level features of the speech, to achieve high-quality transformed speech using non-parallel
e.g., voice timbre and prosody characteristics are used. Voice training data. Studies [254, 258] demonstrate state-of-the-art
timber is concerned with spectral properties of the vocal tract performance for multilingual VC in terms of both naturalness
during phonation, whereas prosody relates to suprasegmental and similarity, however, performance is speaker-dependent
characteristics, i.e., pitch, amplitude, stress, and duration. and degrades for unseen speakers. Neural vocoders have rap-
Multiple Voice Conversion Challenges (VCC) have been held idly become the most popular vocoding approach for speech
to encourage the development of VC generation techniques synthesis due to their ability to generate human-like speech
and improve the quality of converted speech [137, 241, 242]. [224]. A vocoder learns to generate audio waveform from
Earlier VCC aimed to convert source speech to target speech acoustic features. The study [252] analyzes the performance
by using non-parallel and parallel data [137, 241] but more of different vocoders and shows that parallel-WaveGANs
recent [242] focused on the development of cross-lingual VC [239] can effectively simulate the data distribution of human
techniques, where the source speech is converted to sound like speech, with acoustic characteristics, for VC. The perfor-
target speech using nonparallel training data across different mance, however, is still restricted for unseen speaker identity
languages. and noisy samples [217]. Recent VC methods based on TTS,
In earlier studies VC techniques were based on spectrum like AttS2S-VC [263], Cotatron [264], and VTN [265] use
mapping using paired training data, where speech samples text labels to synthesize speech directly by extracting aligned
from both the source and target speaker speaking the same linguistic characteristics from the input voice. This ensures
linguistic content are required for conversion. Methods using that the converted speaker and the target speaker’s identity
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
are the same. However, these methods necessitate the use of spoofed speech. This approach works well, but its perfor-
text labels, which are not always readily accessible. mance needs improvement. Monteiro et al. [284] propose an
Recently, one-shot VC techniques [266, 267] are present- ensemble-based model for the detection of synthetic speech.
ed. In contrast to earlier techniques, the data samples of source Deep learning models, LCNNs and ResNets are used to com-
and target speakers are not required to be seen during training. pute deep features, which are later fused to differentiate be-
Furthermore, just one utterance from the source and target tween real and spoofed speech. This model is robust to fake
speakers is required for conversion. The speaker embedding speech detection, however, it needs to be evaluated on some
is extracted from the target speech, which can control the standard datasets. Gao et al. [282] propose a synthetic speech
speaker identity of the converted speech independently. detection approach based on inconsistencies. They employ a
Despite these advancements, the performance of few-shot global 2D-DCT feature to train a residual network to detect
VC techniques for unseen speakers is not stable [268]. This manipulated speech. This model has better generalization abil-
is primarily due to the inadequacy of speaker embedding ex- ity, however, the performance degrades on noisy samples.
tracted from a single speech sample from an unseen speaker Zhang et al. [287] propose a model to detect fake speech by
[269] which significantly impacts the reliability of one-shot using a ResNet model with a transformer encoder
conversions. Other work [270–272] adopts zero-shot VC, (TEResNet). Initially, a transformer encoder is employed to
where the source and target speakers are unseen during train- compute a contextual representation of the acoustic keypoints
ing, and also without re-training the model by employing an by considering the correlation between audio signal frames.
encoder-decoder architecture. The encoder extracts style and The computed keypoints are then used to train a residual net-
content information into style embedding and content embed- work to differentiate between real and manipulated speech.
ding, then the decoder constructs a speech sample by combin- This work shows better fake audio detection performance,
ing style and content embedding. The zero-shot VC scenario however, it requires extensive training data. Das et al. [279]
is attractive because no adaptative data or parameters are re- propose a method to detect manipulated speech. Initially, a
quired, however the adaptability quality is insufficient, espe- signal companding technique for data augmentation is used
cially when the target and source speakers are unseen, diverse, to increase the diversity of the training data. Then, CQT fea-
or noisy [268]. The summary of voice conversion techniques tures are computed from the obtained data, which are later
discussed above are presented in Table 15. used to train the LCNN classifier. The method improves the
fake audio detection accuracy but requires extensive training
4.2.3 Audio deepfake detection data.
Aljasem et al. [13] propose a hand-crafted, feature-based
Due to recent advances in TTS [55, 224] and VC [268] tech- approach to detect cloned speech. Initially, sign-modified
niques, audio deepfakes have become an greater threat to acoustic local ternary pattern features are extracted from input
voice biometric interfaces and society [58]. In the field of samples. Then, the computed keypoints are used to train an
audio forensics, there are several approaches for identifying asymmetric, bagging-based classifier to categorize the sam-
spoofed audio. Existing works, however, fail to fully tackle ples into bona fide and fake. This work is robust to noisy
the detection of synthetic speech [276]. In this section, we cloned voice replay attacks, however, its performance needs
review the approaches proposed for the detection of audio further improvement. Ma et al. [280] present a a continual
deepfakes. Table 16 presents the comparison of audio learning-based technique to enhance the generalization ability
deepfake detection techniques using both handcrafted and of a manipulated speech detection system. A knowledge dis-
deep features. tillation loss function is introduced in the framework to en-
Techniques based on handcrafted Features: Yi et al. hance the learning ability of the model. This approach is com-
[278] presented an approach to identify TTS-based manipu- putationally efficient and can detect unseen spoofing manipu-
lated audio content. In [278] hand-crafted features Constant Q lations, however, the performance has not been evaluated on
cepstral coefficients (CQCC) were used to train GMM and noisy samples. Borrelli et al. [293] employ bicoherence fea-
LCNN classifiers to detect TTS synthesized speech. This ap- tures together with long-term short-term features. The extract-
proach exhibits better detection performance for fully synthe- ed features are used to train three different types of classifiers:
sized audio, however performance degrades rapidly for par- a random forest, a linear SVM, and a radial basis function
tially synthesized audio clips. Li et al. [277] propose a modi- (RBF) SVM. This method obtains the best accuracy with the
fied ResNet model Res2Net. They evaluate the model using SVM classifier. Due to handcrafted features, however, this
different acoustic features and obtain the best performance work is not generalized to unseen manipulations. In [202]
using CQT features. This model exhibits better audio manip- bispectral analysis is performed in order to identify specific
ulation detection performance, however its generalization and unusual spectral correlations present in GAN generated
ability needs further improvement. In [283], mel- speech samples. Similarly, in [281] bispectral and Mel-
spectrogram features with ResNet-34 are employed to detect cepstral analysis are performed in order to detect missing
M. Masood et al.
Ming et al. DBLSTM F0 and energy contour ▪ CMU-ARCTIC [273] ▪ Requires parallel training data
[248]
Nakashika Recurrent temporal restricted MCC, F0, and aperiodicity ▪ ATR Japanese speech ▪ Lacks temporal dependencies of
et al. [247] Boltzmann machines (RTRBMs) database [274] speech sequences
Sun et al. DBLSTM-RNN MCC, F0 and Aperiodicity ▪ CMU-ARCTIC [273] ▪ Requires parallel training data
[249]
Wu et al. DBLSTM- i-vectors 19D-MCCs, Delta and ▪ VCTK corpus ▪ Computationally complex
[250] Delta-Delta, F0, 400-D
i-vector
Liu et al. WaveNet vocoder MCC and F0 ▪ VCC 2018 ▪ Performance degrades on
[251] inter-gender conversions
Kaneko et al. Encoder-decoder with GAN 34D-MCC, F0, and aperiodicity ▪ VCC 2018 ▪ Computationally complex
[255] ▪ Domain-specific voice
Kameoka Encoder-decoder with GAN 36D-MCC, F0, and aperiodicity ▪ VCC 2018 ▪ Performance degrades on
et al. [258] cross-gender conversion
▪ Low performance for unseen
speakers
Zhang et al. VAW-GAN STRAIGHT spectra [275], F0 ▪ VCC2016 ▪ Lacks target speaker similarity
[259] and aperiodicity
Huang et al. Encoder-decoder STRAIGHT spectra [275] MCCs ▪ VCC 2018 ▪ Lacks multi-target VC
[260] ▪ Introduces abnormal fluctuations
in generated speech
Chorowski VQ-VAE, WaveNet decoder 13D-MFCC ▪ LibriSpeech ▪ Over smooth and low naturalness
et al. [262] ▪ ZeroSpeech 2017 in generated speech
▪ Increased training complexity
Tanaka et al. BiLSTM encoder-LSTM decoder Acoustic features ▪ CMU Arctic database ▪ Requires extensive training data
[263]
Park et al. Encoder-decoder Mel-spectrogram ▪ LibriTTS ▪ Requires transcribed data
et al. [264] ▪ VCTK dataset ▪ Lacks target speaker similarity
Huang et al. VAE-vocoder MCCs, log F0, and aperiodicity ▪ CMU ARCTIC ▪ Requires parallel training data
[265] ▪ VCTK corpus
Lu et al. Attention mechanism in 13D-MFCCs, PPGs and log F0 ▪ VCTK corpus ▪ Low target similarity and
[266] encoder-decoder naturalness in generated speech
Liu et al. Encoder and DBLSTM 19 MFCCs, log F0 and PPG ▪ VCTK corpus ▪ Low target similarity and
[267] naturalness in generated speech
Chou et al. Attention mechanism in 19 MFCCs, log F0 and PPG ▪ VCTK Corpus ▪ Low quality of converted voices
[270] encoder-decoder in case of noisy samples
Qian et al. Encoder-decoder speech spectrogram ▪ VCTK corpus ▪ Prosody flipping between the
[271] source and the target.
▪ Not well-generalized to unseen
data
durable power components in synthesized speech. The com- initially, short-term zero-crossing rate and energy are utilized
puted features are then used to train several ML-based classi- to identify the periods of silence in each speech signal. In the
fiers and attained the best performance using a Quadratic next step, the linear filter bank (LFBank) key-points are com-
SVM. These approaches [202, 281] are robust to TTS synthe- puted from the nominated segments in the relatively high-
sized audio, however, they may not be able to detect high- frequency domain. Lastly, an attention-enhanced DenseNet-
quality synthesized speech. Chen et al. [285] propose a DL- BiLSTM framework is built to locate places wher the audio is
based framework for audio deepfake detection. The 60- manipulated. This method [286] avoids over-fitting at the ex-
dimensional linear filter banks (LFB) are extracted from pense of high computational cost. Wu et al. [210] introduce a
speech samples and are later used to train a modified ResNet novel, key-point genuinization based light convolutional neu-
model. This work improves fake audio detection performance ral network (LCNN) framework for the identification of
but suffers from high computational cost. Huang et al. [286] manipulared speech. The attributes of the original speech are
present an approach for audio spoofing detection where utilized to train a model using a CNN. The output is then
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
Hand-crafted features
Li et al. [277] Res2Net CQT EER=2.502 ASVspoof2019 ▪ Needs generalization
improvement
Yi et al. [278] GMM/LCNN CQCC EER=19.22 Propriety ▪ Performance degrades for partial
(GMM) synthesized audio clip
EER=6.99
(LCNN)
Das et al. [279] LCNN CQT EER=3.13 ASVspoof2019 ▪ Requires extensive training data
Aljasem et al. Asymmetric Combination of MFCC, GTCC, EER=5.22 ASVspoof2019 ▪ Performance needs further
[13] bagging ALTP, and spectral features improvement
Ma et al. [280] CNN 60-D LFCC EER=9.25 ASVspoof2019 ▪ Performance degrades on noisy
samples
AlBadawy logistic regression Bispectral features AUC=0.99 Propriety ▪ Performance may degrade on
et al. [202] classifier high-quality speech samples
Singh et al. Quadratic SVM Bispectral and mel-cepstral features Acc=96.1% Propriety ▪ Needs evaluation on a large scale
[281] dataset
Gao et al. ResNet 2D-DCT features EER=4.03 ASVspoof2019 ▪ Performance degrades on noisy
[282] samples
Aravind et al. ResNet34 Mel-spectrogram features EER=5.87 ASVspoof2019 ▪ Performance needs improvement
[283]
Monteiro et al. LCNN/ResNet Spectral features EER=6.38 Propriety ▪ Results should be evaluated on a
[284] standard dataset
Chen et al. ResNet 60-dimensional LFB EER=1.81 ASVspoof2019 ▪ Computationally complex
[285] approach
Huang et al. DenseNet-BiLSTM LFBank EER=0.53 ASVspoof ▪ Computationally complex
[286] 2019 approach.
Wu et al. [210] LCNN Genuine speech features EER=4.07 ASVspoof ▪ Can’t deal with cloned replay
2019 attack detection.
Zhang et al. TEResNet Spectrum features EER=5.89 ASVspoof2019 ▪ Requires extensive training data
[287] EER=3.99 Fake-or-Real
dataset [288]
Deep Learning features
Zhang et al. ResNet-18+ Deep features EER=2.19 ASVspoof2019 ▪ Performance degrades on VC.
[289] OC-softmax
Gomez-Alanis LCG- RNN Deep features EER=6.28 ASVspoof ▪ Fails to generalize for unseen
et al. [290] 2019 attacks
Hua et al. Res-TSSDNet Deep features EER=1.64 ASVspoof2019 ▪ Computationally complex
[291]
Jiang et al. CNN Deep features EER=5.31 ASVspoof2019 ▪ Performance needs further
[292] improvement
Wang et al. DNN Deep features EER=0.021 Fake-or-Real ▪ Requires evaluation on
[58] dataset [288] challenging dataset
converted to an original key-point distribution closer to that of space in which real speech can be discriminated from manip-
genuine speech. The transformed key-points are used with an ulated samples by a certain margin. This method improves the
LCNN to identify genuine and altered speech. This approach performance generalization ability against unseen attacks,
[210] is robust to synthetic speech manipulation detection. It however, performance degrades on VC attacks generated
is, however, unable to deal with cloned-replay attack using waveform filtering. In [290], the authors propose a
detection. Light Convolutional Gated RNN (LCGRNN) model to com-
Techniques based on Deep Features: Zhang et al. [289] pute the deep features and classify the real and fake speech.
propose a DL-based approach using ResNet-18 and a one- This model is computationally efficient; however, it is not
class (OC) softmax. They train the model to learn a feature generalized well to real-world examples. Hua et al. [291]
M. Masood et al.
propose an end-to-end synthetic speech detection model, Res- considerable drop in the perceived speech quality [297]. The
TSSDNet, for the computation of deep features and classifi- quality of synthesized speech can be improved by increasing
cation. This model is generalized well to unseen samples; the sampling rate. Some of the existing works suffer from
however, this is at the expense of increased computational word repetition, skipping, long pause or babbling problems,
cost. Wang et al. [58] propose a DNN based approach with which cause a loss in the intelligibility of the generated speech
a layer-wise neuron activation mechanism to differentiate be- [229–231]. To address this problem, existing models have
tween real and synthetic speech. This approach performs well introduced style/prosody transfer to generate more expressive
for fake audio detection, however the framework requires voices [229, 232, 240]. Moreover, speech synthesis tech-
evaluation on challenging datasets. Jiang et al. [292] propose niques to maintain the audio of the specific target are further
a self-supervised learning-based approach comprising eight required to be explored [235, 236]. Therefore, there is a need
convolutional layers to compute deep features and classify to develop such systems that can efficiently adapt to a specific
original and fake speech. This work is computationally effi- target with limited data and high efficiency.
cient but detection accuracy needs enhancement. Malik et al.
[294] propose a CNN for cloned speech detection. Initially, Detection We have presented a detailed literature review of
audio samples are converted to spectrograms on which a CNN the techniques employed for the detection of synthesized
framework is used to compute deep features and classify real speech in Section 4.2.3. Most of the existing detection ap-
and fake speech samples. This approach shows better fake proaches are based on the employment the hand-coded fea-
audio detection accuracy but performance degrades on noisy tures for the detection of altered speech [277–285, 287, 293].
samples. Similarly, in [295], a spatial-temporal CNN model is Some additional works have utilized end-to-end training
proposed to process mel-spectrogram sequences in order to models to detect audio manipulation [58, 292], while others
identify given audio sample as real or fake. have employed both hand-coded and deep features in a train-
Most of the above-mentioned fake speech detection have ing module for speech synthesis detection [286]. Only a few
been evaluated on the ASVspoof2019 [218] dataset, however, techniques are focused on the detection of more than one type
the recently launched ASVspoof2021 [296] has opened new of audio deepfake, e.g., TTS and VC [58, 281]. In the realm of
challenges for the research community. This dataset intro- audio manipulation, VC detection has proven more challeng-
duces a separate speech deepfake category that includes high- ing compared to TTS [218]. Several works have used CNN-
ly compressed TTS and VC samples without speaker based methods [292, 295], ensemble methods based on differ-
verification. ent feature representations [284], or methods that detect un-
usual aspects in human speech [202, 281]. Several variants of
4.2.4 Discussion on audio manipulation methods the ResNet model have used deep features to detect audio
spoofing [289, 291]. However, one of the limitations of the
Generation Extensive work has been presented on the gener- existing works is the lack of generalization of the detection
ation of correct and natural speech for real-world applications, models. The performance significantly degrades when evalu-
however, several areas require further improvement. A good ated on unseen or samples generated with different manipula-
speech synthesis model should produce a both realistic and tion methods [202, 290]. Lastly, an additional limitation of the
clear voice. For this reason, existing works have tried to im- existing techniques is detection performance with limited
prove the articulation and genuineness of speech synthesis training data and computational resources [289–291].
[55–57]. In recent years, the quality of synthetic voice has
improved significantly via the use of deep learning tech-
niques. The significant improvements include voice adapta-
tion [59, 235], one/few-shot learning [266, 267], self-attention 5 Deepfake datasets
network [270], and cross-lingual voice transfer [254, 258].
However, the ability to produce a more human-like natural- To analyze the detection accuracy of proposed methods it is of
sounding speech in the presence of noise remains challenging. utmost importance to have a good and representative dataset
Another main aim of speech synthesis techniques is to deploy for performance evaluation. Moreover, the techniques should
a lightweight model that requires less training data [231]. be validated across datasets to show their ability to generalize.
Some of the work on this subject is presented in [270–272], Therefore, researchers have put in significant effort over the
however, these approaches lack the ability to maintain natu- years preparing standardized datasets for manipulated video
ralism in synthesized speech. Therefore, there is a need to and audio content. In this section, we present a detailed review
develop an efficient and effective speech synthesis model that of the standard datasets that are currently used to evaluate the
requires less training data and resources which is also able to performance of audio and video deepfake detection tech-
maintain realism. Furthermore, an audio signal is generated niques. Tables 17 and 18 show a comparison of available
with a sampling frequency less than 16 kHz, it causes a video and audio deepfake datasets respectively.
Table 17 Comparison of video deepfakes detection datasets
UADFV [74] DF-TIMIT [191] FF++ [95] Celeb-DF DFDC- DF [299] WDF FN [301] FakeAV-Celeb [302]
[194] preview [300]
[298]
Released Nov, 2018 Dec, 2018 Jan, 2019 Nov, 2019 Oct, 2019 June, 2020 Oct, 2020 March, 2021 Aug, 2021
Total videos 98 620 4000 1203 5250 60,000 7314 221,247 20,000
Real content 48 – 1000 408 1131 10,000 3805 – 500
Fake content 48 620 3000 795 4119 50,000 3509 – 19,500
Tool/ technology used Fakeapp Faceswap- GAN Deepfake, Deepfake Unknown DF-VAE Collected Encoder-Decoder, GAN, FSGAN [67], faceswap [66],
for fake content application [65] CG-manipulations from Pix2Pix, RNN/LSTM Tacotron [233, 238], and
generation [42] internet and 3DMM Wave2Lip [111]
Avg. Duration 11.4 sec 4 sec 18 sec- 13 sec 30 sec – – Varied Varied
Resolution 294 × 500 64×64 (LQ) 480p, 720p, 1080p Varied 180p – 1920 ×1080 Varied Varied 225 ×224
128 ×128 (HQ) 2160p
Format – JPG H.264, CRF=0, 23, Mp4 H.264 Mp4 Mp4 Mp4 Mp4
40
Visual quality Low Low Low High High High High Both low and high Low
Temporal flickering Yes Yes Yes Improved Improved Significantly – Significantly improved Improved
improved
Modality Visual Audio/visual Visual Visual Audio/visual Visual Visual Visual Audio/visual
LJ speech dataset M-AILabs Mozilla TTS [305] FOR dataset [288] Baidu Dataset [59] ASV spoof 2019 [296] WaveFake [297]
[303] dataset [304]
5.1 Video datasets videos through the refinement of existing deepfake algorithms
[310, 311].
UADFV The first dataset released for deepfake detection was
UADFV [74]. It consists of a total of 98 videos, where 49 are Deepfake Detection Challenge (DFDC) Recently, the
real videos collected from YouTube and then copies are ma- Facebook community launched a challenge, aptly named the
nipulated by using the FakeApp application [42] to generate Deepfake Detection Challenge (DFDC)-preview [312], and
49 fake videos. The average length of videos is 11.14 sec with released a new dataset that contains 1131 original videos
an average resolution of 294 × 500 pixels. However, the vi- and 4119 manipulated videos. The altered content is generated
sual quality of the videos is very low, and the resultant alter- using two unknown techniques. The final version of the
ation is obvious and thus easy to detect. DFDC database is publicly available on [298]. It contains
100,000 fake videos along with 19,000 original samples.
DeepfakeTIMIT DeepfakeTIMIT [191] was introduced in The dataset is created using various face-swap-based methods
2018, and consists of a total of 620 videos of 32 subjects. with different augmentations, i.e., geometric and color trans-
For each subject, there are deepfake videos of two quality formations, varying frame rate, etc., and distractors, i.e., over-
levels: DeepFake-TIMIT-LQ and DeepFake-TIMIT-HQ. In laying different types of objects, in a video.
DeepFake-TIMIT-LQ, the resolution of the output image is
64 × 64, whereas in DeepFake-TIMIT-HQ, the resolution of DeeperForensics (DF) Another Large-Scale dataset for
output size is 128 × 128. The fake content is generated by deepfake detection, containing 50,000 original and 10,000
employing a face swap-GAN [65]. The generated videos are manipulated videos, is found in [299]. A novel conditional
only 4 seconds long, and the dataset contains no audio channel autoencoder, namely DF-VAE, is used to create manipulated
manipulation. Moreover, the resultant videos are often blurry videos. The dataset comprises highly diverse samples in terms
and people in actual videos are mostly presented in full frontal of actor appearance. Further, a mixture of distortions and per-
face view with a monochrome color background. turbations, such as compression, blur, and noise, are added to
better represent real-world scenarios. As compared to previ-
FaceForensics++ One of the most famous datasets for ous datasets [74, 191, 194], the quality of generated samples is
deepfake detection is FF++ [95]. This dataset was presented significantly improved.
in 2019 as an extended form of the FaceForensics dataset
[306], which contains videos with facial expression manipu- WildDeepfake WildDeepfake (WDF) [300] is considered to
lation only, and was released in 2018. The FF++ dataset has be one of the most challenging deepfake detection datasets.
four subsets, named FaceSwap [307], DeepFake [43], It contains both real and deepfake samples collected from
Face2Face [38], and NeuralTextures [308]. The dataset con- the internet. This dataset contains video samples of di-
tains 1000 original videos collected from the YouTube-8 M verse subject matter, along with variation in terms of res-
dataset [309] and 3000 manipulated videos generated using olution, background, illumination conditions, and com-
the computer graphics and deepfake approaches specified in pression rates.
[306]. This dataset is also available in two quality levels, un-
compressed and H264 compressed format, which can be used ForgeryNet Another advanced Visual deepfakes-based dataset
to evaluate the performance of deepfake detection approaches namely the ForgeryNet (FN) is presented in the ForgeryNet
on both compressed and uncompressed videos. The FF++ Challenge 2021 [301]. ForgeryNet is an extensive online
dataset fails to generalize lip-sync deepfakes however, and available deep face forgery dataset comprising 2.9 million
some videos exhibit color inconsistencies around the manip- static samples, along with 221,247 videos. This dataset is cre-
ulated faces. ated by applying different 7 image-level alteration techniques,
and 8 video-level forgery methods. Furthermore, about 36
Celeb-DF Another popular dataset used for evaluating various perturbations attacks are added to make the dataset
deepfake detection techniques is Celeb-DF [194]. This dataset more challenging and close to real-world scenarios.
presents videos of higher quality and tries to overcome the
problem of visible source artifacts found in previous data- FakeAVCeleb FakeAVCeleb [302] dataset is recently released
bases. The CelebDF dataset contains 408 original videos and and contains multimodal deepfake videos that involve manip-
795 fake videos. The original content was collected from ulation in both audio and video channels with accurate lip-
Youtube, and is divided into two parts named Real1 and syncing. The dataset is generated using real videos collected
Real2 respectively. In Real1, there are a total of 158 videos from YouTube and popular synthetic algorithms such as
of 13 subjects with different gender and skin color. Real2 FSGAN [67], FaceSwap [66], Tacotron [233, 238], and
comprises 250 videos, each having a different subject, and Wave2Lip [111]. The dataset also includes fine-level video
the synthesized videos are generated from these original labelling respective to audio-visual manipulation, resulting
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
in four pair combinations: real audio-real video, real audio- ASV spoof 2019 Another well-known dataset for fake audio
fake video, fake audio-real video, fake audio-real video, and detection is ASVspoof-2019 [218], which is comprised of two
fake audio-fake video. Videos featuring celebrities from dif- parts for performing logical access (LA) and physical access
ferent ethnic backgrounds and ages, with equal representation (PA) state analysis. Both LA and PA are created from the
of each gender, are included to eliminate racial biasness and VCTK base corpus, which comprises audio clips taken from
improve the fairness of deepfake detectors. 107 speakers (46 males, 61 females). LA consists of both
A representative map for datasets based on release year and voice cloning and voice conversion samples, whereas PA con-
size is shown in Fig. 14. Furthermore, we have added the sists of replay samples along with bona fide ones. Both
visual samples from the mentioned datasets to facilitate the datasets are further divided into three databases, named train-
reader to visually experience the synthesis quality of the ing, development, and evaluation, which contain clips from
DeepFake datasets in Fig. 15. All of the above-mentioned 20- (8 males, 12 females), 10- (4 males, 6 females), and 48-
datasets contain synthesized face portions only; these datasets (21 males, 27 females) speakers respectively. Further catego-
lack upper/full body deepfakes. A more robust dataset is need- rization is diverse in terms of presenters, and the recording
ed which should be able to synthesize an entire body situations are the same for all source samples. The training
deepfake. and development sets contain spoofing occurrences created
with the same method/conditions (labeled as known attacks),
5.2 Audio datasets while the evaluation set contains samples with unknown
attacks.
LJ speech and M-AILabs dataset LJSpeech [303] and M-
AILabs [304] datasets are famous for the real-speech database Fake-or-Real (FOR) dataset The FOR database [288] is another
employed in numerous TTS applications, i.e. DeepVoice 3 dataset that is widely employed for synthetic voice detection.
[224]. The LJSpeech database is comprised of 13,100 clips This database consists of over 195,000 samples both from
totaling 24 hours in length. All samples are recorded by a humans and AI-synthetic speech. This database groups sam-
female speaker. The M-AILABS dataset consists of total of ples from the new TTS method (i.e. Deep Voice 3 [224] and
999 hours and 32 minutes of audio. This dataset was created Google-Wavenet [55]) together with diverse human speech
with multiple speakers in 9 different languages. samples (i.e. Arctic Dataset, LJSpeech Dataset, VoxForge
Dataset). The FOR database has four versions, namely for-
Mozilla TTS Mozilla Firefox, a well-known publicly available original (FO), for-norm (FN), for-2 sec (F2S), and for-rerec
browser, released the biggest open-source database of (FR). FO contains unbalanced voices without alteration, while
people speaking [305]. Initially, the database included FN comprises balanced unaltered samples in terms of gender,
1400 hours of recorded voices, in 18 different languages, class, and volume, etc. F2S contains data from FN, however,
in 2019. Later it was extended to 7226 hours of recorded the samples are trimmed to 2 seconds, and the FR version is a
voices in 54 diverse languages. This dataset contains 5.5 rerecorded version of the F2S database, to simulate a condi-
million audio clips and was employed by Mozilla’s Deep tion in which an attacker passes a sample via a voice channel
Speech toolkit. (i.e. a cellphone call or a voice message).
Baidu dataset The Baidu Silicon Valley AI Lab cloned audio comprises speech created from 48 speakers including 27 fe-
dataset is another database employed for cloned speech detec- males and 21 males. This dataset is more challenging than
tion [59]. This database is comprised of 10 ground truth previous versions and contains various audio coding and com-
speech recordings, 120 cloned samples, and 4 morphed pression attacks with different environments and transmission
samples. scenarios.
ASV spoof 2021 The ASVspoof-2021 [296] is another dataset WaveFake Recently, WaveFake (WF) [297] a large-scale
released as a part of ASVspoof challenge [276]. Along with audio deepfake detection dataset, was released. It con-
earlier LA and PA partitions, this database includes an extra tains 117,985 fake audio clips in 16-bit PCM wav format.
assessment partition for an audio deepfake detection track. The database uses six different advanced TTS audio gen-
This database is an extension of ASV spoof 2019 and has no erative models across two languages. The synthetic
specific training set. It includes only an evaluation set which speech samples closely resemble real speech data,
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
however it lacks diversity and includes samples by only Illumination Conditions Current deepfake generation ap-
one speaker. proaches produce fake information in a controlled environ-
ment with consistent lighting conditions. However, an abrupt
change in illumination conditions such as in indoor/outdoor
6 Open challenges scenes results in color inconsistencies and strange artifacts in
the resultant videos.
6.1 Open challenges in Deepfakes generation
Occlusions One of the main challenges in deepfake generation
Although extensive efforts have been shown to improve the is the occurrence of occlusion, which results when the face
visual quality of generated deepfakes there are still several region of the source and victim are obscured with a hand, hair,
challenges that need to be addressed. A few of them are dis- glasses, or any other item. Moreover, occlusion can be the
cussed below. result of the hidden face or eye portion which eventually
causes inconsistent facial features in the manipulated content.
Generalization The generative models are data-driven, and
therefore they reflect the learned features during training, in Temporal Coherence Another drawback of generated
the output. To generate high-quality deepfakes a large amount deepfakes is the presence of evident artifacts like flickering
of data is required for training. Moreover, the training process and jittering among frames. These effects occur because the
itself requires hours to produce convincing deepfake audiovi- deepfake generation frameworks work on each frame without
sual content. Usually, it is easier to obtain a dataset of the taking into account the temporal consistency. To overcome
source/driving identity but the availability of sufficient data this limitation, some works either provide this context to gen-
for a specific victim is a challenging task. Also, retraining the erator or discriminator, consider temporal coherence losses,
model for each specific target identity is computationally employ RNNs, or take a combination of all these approaches.
complex. Because of this, a generalized model is required to
enable the execution of a trained model for multiple target High-quality audio speech synthesis TTS and VC based on
identities unseen during training or with few training samples neural networks attempt to push the boundaries and generate
available. realistic speech for real-world applications. Current generated
audio speech signal, however, lack different artficacts that
Identity Leakage The preservation of target identity is a prob- exist in human speech, such as pauses, varying emotions,
lem when there is a significant mismatch between the target realism, expressiveness, accent, robustness, and controllabili-
identity and the source identity, specifically in face reenact- ty. Several generative models, such as VAE [239, 262, 271],
ment tasks where target expressions are driven by some source GAN [239, 255, 257], vocoders [228, 252], and end-to-end
identity. The facial data of the source identity is partially trans- learning models [224, 238] are used to improve the quality of
ferred to the generated face. This occurs when training is per- the synthesized audio signal. However, there is a need for
formed on single or multiple identities, but data pairing is improved modeling techniques that produce the speech which
accomplished for the same identity. is spontaneous, expressive, and varies in style, to enhance the
naturalness of generated audio samples.
Paired Training A trained, supervised model can generate
high-quality output but at the expense of data pairing. Data Robust speech synthesis The synthesis of high-quality speech
pairing is concerned with generating the desired output by for different languages requires extensive training and labeled
identifying similar input examples from the training data. text data, and consumes huge computing resources. Such set-
This process is laborious and inapplicable to those scenarios tings introduce an extensive computational burden which usu-
where different facial behaviors and multiple identities are ally results in a tradeoff between the quality and inference time
involved in the training stage. for generated audio content. The research community has tak-
en several initiatives to introduce lightweight audio signal
Pose Variations and Distance from the camera Existing generation techniques, such as the ZeroSpeech Challenge
deepfake techniques generate good results of the target for [313], where speech signal is generated from audio data only.
frontal facial view. However, the quality of manipulated con- However, to cope with the real-world scenarios, there is a need
tent degrades significantly for scenarios where a person is for a more robust approach that can generate a high-quality
looking off-camera. This results in undesired visual artifacts signal from a small training dataset and low resource
around the facial region. Furthermore, another big challenge consumption.
for convincing deepfake generation is the facial distance of the
target from the camera, as an increase in distance from cap- Speech Adaptability The existing speech synthesis techniques
turing devices results in low-quality face synthesis. are target-specific, i.e., they are capable of generating an audio
M. Masood et al.
signal for the specific person on which the model is trained. Performance evaluation Presently, deepfake detection
Such approaches lack the ability to generate a high-quality methods are formulated as a binary classification problem,
signal for unseen instances, which shows a deficit in where each sample can be either real or fake. Such classifiers
the ability to generalize in the existing speech synthesis are easier to build in a controlled environment, where we
models. The main reason models lack adaptability is generate and verify deepfake detection techniques by utilizing
over-fitting on training data, which makes them unable audio-visual content that is either original or fabricated.
to efficiently learn enough acoustic information to be However, for real-world scenarios, videos can be altered in
able to generate samples for a new target. Therefore, a ways other than deepfakes, so just because the content was not
more accurate, generalizable, model is required to tackle detected as manipulated does not guarantee the video is an
the current challenges of speech generation models [59, original one. Furthermore, deepfake content can be the subject
235, 268]. of multiple types of alteration, i.e., audio and/or visual, and
therefore a single label may not be completely accurate.
Realism in synthetic audio speech Though the quality of syn- Moreover, in visual content with multiple people’s faces,
thetic audio is certainly getting much better, there is still a more than one of them could be manipulated with deepfakes
need for improvement. Some of the main challenges are lack over a segment of frames. Therefore, any binary classification
of natural emotions, control over duration, sound volume, and scheme should be enhanced to multiclass/multi-label and uti-
the pace at which the target speaks. The existing speech gen- lize local classification/detection at the frame level to cope
eration models use one-to-many mappings [229, 240], which with the challenges of real-world scenarios.
produce a low-quality speech signal with a lack of expressive-
ness in the presence of insufficient sample data. Therefore, Model scalability Another main challenge in the existing
there is a need for an efficient model that can better learn the deepfake detection models is the lack of scalability for large-
varying qualities of speech signals in order to produce high- scale platforms, such as social media [197, 314]. When used
quality synthetic audio. in a real-world scenario, inference time becomes a critical
factor for detecting fake audio-visual content. Designing a
6.2 Challenges in deepfakes detection methods model with high accuracy but with a very long inference time
makes the approach unlikely to be widely used in actual ap-
Although remarkable advancements have been made in the plications. Therefore, there is a need for detection techniques
performance of deepfake detectors, there are numerous con- that have real-time performance capability with a high accu-
cerns about current detection techniques that need attention. racy rate for massive deepfake content detection.
Some of the challenges of deepfake detection approaches are
discussed in this section. Explainability in detection methods Existing deepfake detec-
tion approaches are typically designed to perform batch anal-
Quality of deepfake datasets The accessibility of large data- ysis over a large dataset, however when these techniques are
bases of deepfakes is an important factor in the generation of employed in the field by journalists or law enforcement, there
deepfake detection techniques. Analyzing the quality of may only be a small set of videos available for analysis. A
videos from these datasets, however, reveals several ambigu- numerical score parallel to the probability of an audio or video
ities when compared to actual manipulated content found on being real or fake is not as valuable to the practitioners if it
the internet. Different visual artifacts that can be visualized in cannot be confirmed with an appropriate proof of the score. In
these databases are: i) temporal flickering in some cases dur- those situations, it is very common to demand an explanation
ing the speech, ii) blurriness around the facial regions, iii) over for the numerical score for the analysis to be believed before
smoothness in facial texture/lack of facial texture details, iv) publication or utilization in a court of law. Most deepfake
lack of head pose movement or rotation, v) lack of face oc- detection methods lack such an explanation, however, partic-
cluding objects such as glasses, and lightning effects, vi) sen- ularly those which are based on DL approaches due to their
sitivity to variations in input posture or gaze, skin color incon- black-box nature.
sistency, and identity leakage, and vii) limited availability of a
combined high-quality audio-visual deepfake dataset. The Fairness and trust It has been observed that existing audio and
aforementioned dataset ambiguities are due to imperfect steps visual deepfakes datasets are biased and contain imbalanced
in the manipulation technique. Furthermore, manipulated con- data of different races and genders. Furthermore, the detection
tent of low quality can hardly be convincing or create a real techniques employed can be biased as well. Although re-
impression. Therefore, even if detection approaches exhibit searchers have started doing work in this area to fill the gap
better performance over such videos it is not guaranteed that very little work is available [315]. Hence, there is an urgent
these methods will perform well when employed in the real- need to introduce approaches that improve the data and fair-
world scenarios. ness in detection algorithms.
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
Temporal aggregation Existing deepfake detection methods for detection methods. Different works [316, 317] have eval-
are based on binary classification at the frame level, i.e. uated the performance of the state-of-the-art visual deepfake
checking the probability that each video frame is real or ma- detectors in the presence of adversarial attacks and display an
nipulated. These approaches do not consider temporal consis- intense reduction in accuracy. In the case of audio, studies
tency between frames, however, and suffer from two potential such as [318, 319] show that several adversarial pre/post-
problems: (i) deepfake content shows temporal artifacts, and processing operations can be used to evade spoof detection.
(ii) real or fake frames could appear in sequential intervals. Similarly, the method in [320] is concerned with improving
Furthermore, these techniques require an extra step to com- the quality of GAN-generated samples by enhancing spectral
pute the integrity score at the video level, as these methods distributions. Such methods ultimately result in removing fake
need to combine the score from each frame to generate a final traces in the frequency domain and complicate the detection
value. process [321, 322]. A third method, in [323–325], uses ad-
vanced image filtering techniques to improve generation qual-
Social media laundering Social platforms like Twitter, ity such as the removal of model-based fingerprints left during
Facebook, or Instagram are among the main online networks generation and the addition of noise to remove fake signs.
used to spread audio-visual content among the general public. These methods pose a real challenge for deepfake detection
To save bandwidth on the network or to secure the user’s methods, thus the research community needs to propose tech-
privacy, such content is commonly stripped of meta-data, niques that are robust and reliable to such attacks.
down-sampled, and substantially compressed before
uploading. These manipulations, normally known as social
media laundering, remove clues with respect to underlying
7 Future directions
forgeries and eventually increase false positive detection rates.
Most deepfake detection approaches employ signal level key-
Synthetic media is gaining a lot of attention because of its
points and are more affected by social media laundering. A
potential positive and negative impact on our society. The
measure to increase the accuracy of deepfake identification
competition between deepfake generation and detection will
approaches over social media laundering is to keenly include
not end in the foreseeable future, although impressive work
simulations of these effects in training data, and also increase
has been presented for the generation and detection of
the evaluation databases to contain data on social media laun-
deepfakes. There is still, however, room for improvement. In
dered visual content.
this section, we discuss the current state of deepfakes, their
limitations, and future trends.
Diversified audio DeepFake detection datasets Currently, ex-
tensive and diverse datasets for visual deepfake detection are
available, however, there is a lack of such datasets for audio 7.1 Creation
deepfake detection systems. Recently launched synthesized
audio datasets, i.e., ASVspoof-2021 [296] and WaveFake Visual media has more influence compared to text-based dis-
[297] have been introduced, however the ASVspoof-2021 information. Recently, the research community has focused
dataset does not contain specific training data for the audio more on the generation of identity agnostic models and
deepfake track, while others contain samples from a single high-quality deepfakes. A few distinguished improvements
person only. Therefore, existing audio deepfakes detection are i) a reduction in the amount of training data due to the
approaches still require a more challenging and diverse dataset introduction of un-paired self-supervised methods [326], ii)
for the evaluation and detection of real-world deepfakes. quick learning, which allows identity stealing using a single
image [132, 134], iii) enhancements in visual details [60, 147],
DeepFake detection evasion Most deepfake detection iv) improved temporal coherence in generated videos by
methods are concerned with missing information and artifacts employing optical flow estimation and GAN based temporal
left during the generation process. Detection techniques may discriminators [107], v) the alleviation of visible artifacts
fail, however, when this data is unavailable as attackers at- around the face boundary by adding secondary networks for
tempt to remove such traces during the manipulation genera- seamless blending [69], and vi) improvements in synthesized
tion process. Such fooling techniques are classified into three face quality by adding multiple losses with different respon-
types: adversarial perturbation attacks, elimination of manip- sibilities, such as occlusion, creation, conversion, and blend-
ulation traces in the frequency domain, and the employment of ing [112]. Several approaches have been proposed to boost the
image filtering to mislead detectors. In the case of visual ad- visual quality and realism of deepfake generation, however,
versarial attacks, different perturbations, such as random there are a few limitations. Most of the current synthetic media
cropping, noise, and JPEG compression, are added to the generation focuses on a frontal face pose. In facial reenact-
training data, which ultimately results in high false alarms ment, for good results, the face is swapped with a lookalike
M. Masood et al.
identity. However, it is not possible to always have the best require improvements on multiple fronts. Following are many
match, which ultimately results in identity leakage. of unresolved challenges in the domain of deepfake detection.
AI-based manipulation is not restricted to the creation of
visual content only, leading to a generation of highly genuine & The existing methods are not robust to post-processing
audio deepfakes. The quality of audio deepfakes has signifi- operations like compression, noisy effects, light varia-
cantly improved and requires less training data to generate tions, etc. Moreover, limited work has been presented that
more realistic synthetic audio of the target speaker. The em- can detect both audio and visual deepfakes.
ployment of synthesized speech for impersonating targets can & Recently, most of the techniques have focused on face-
produce highly convincing deepfakes with a marked negative swap detection by exploiting its limitations, like visible
adverse impact on society. Currently, audio-visual content is artifacts. However, with immense developments in tech-
generated separately using multiple disconnected steps, which nology, the near future will produce more sophisticated
ultimately results in the generation of asynchronous content. face-swaps, such as impersonating someone, with the tar-
Present deepfake generation focuses on the face region only, get having a similar face shape, personality, and hairstyle.
however the next generation of deepfakes is expected to target Aside from this, other types of deepfake, like face-
full body manipulations, such as a change in body pose, along reenactment and lip-synching are getting stronger day by
with convincing expressions. Target-specific joint audio- day.
visual synthesis with more naturalness and realism in speech & The introduction of Vision Transformer techniques that
is a new cutting-edge application of the technology in the use a self-attention mechanism to learn meaningful repre-
context of persona appropriation [108, 327]. Another possible sentation from the input has shown remarkable perfor-
trend is the creation of real-time deepfakes. Some researchers mance in a variety of machine vision tasks. The concept
have already reported attaining real-time deepfakes at 30fps of patch embedding with CNN features can perform well
[67]. Such alterations will result in the generation of more for deepfake detection due to their accuracy and high re-
believable deepfakes. call rate. Even though some work has been presented by
researchers [331–333] there is a need for more exploration
of this concept as these approaches have the potential to
7.2 Detection better tackle the challenges of deepfake recognition, such
as robustness against unseen manipulations and perturba-
To prevent deepfake misinformation and disinformation, tion attacks.
some authors presented approaches to identify forensic chang- & Existing deepfake detectors have mainly relied on the sig-
es made with visual content by employing the concept of natures of existing deepfakes by using ML techniques,
blockchain and smart contracts [328–330]. In [329] the au- including unsupervised clustering and supervised classifi-
thors utilized Ethereum smart contracts to locate and track cation methods, and therefore are less likely to detect un-
the origin and history of manipulated information and its known deepfakes. Both anomaly-based and signature-
source, even in the presence of multiple manipulation attacks. based detection methods have their own pros and cons.
This smart contract applied hashes of the interplanetary file For example, anomaly detection-based approaches show a
system to saved videos, together with their metadata. While high false alarm rate because they may misclassify a bona
this method may perform well for deepfake identification, it is fide multimedia sample whose patterns are rare in the
applicable only if the video metadata exists. Thus, the devel- dataset. On the other hand, signature-based approaches
opment and adoption of such techniques could be useful for cannot discover unknown attacks [334]. Therefore, a hy-
the newswires, however, the vast majority of content created brid approach using both anomaly and signature-based
by normal citizens won’t be protected by such techniques. detection needs to be studied in order to identify known
Recent automated deepfake identification approaches typ- and unknown attacks. Furthermore, a collaboration with
ically deal with face swapping videos, and the majority of the Reinforcement Learning (RL) method could be added
uploaded fake videos belong in this category. Major improve- to the hybrid signature and anomaly approach. More spe-
ments in detection algorithms include i) identification of arti- cifically, RL can give a reward (or penalty) to the system
facts left during the generation process, such as inconsis- when it selects frames that contain (or do not contain)
tencies in head pose [74], lack of eye blinking [80], color anomalies, or any signs of manipulation. Additionally, in
variations in facial texture [160] and teeth alignment, ii) de- the future, deep reinforcement active learning approaches
tection of unseen GAN generated samples, iii) spatio-temporal [335, 336] could play a pivotal role in the detection of
features, and iv) physiological signals like heart rate [89], and deepfakes.
an individual’s behavior patterns [83]. Although extensive & Anti-forensic, or adversarial, ML techniques can be em-
work has been presented for automated detection, these auto- ployed to reduce the classification accuracy of automated
mated detection methods are expected to be short-lived and detection methods. Game-theoretic approaches could be
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
employed to mitigate adversarial attacks on deepfake de- shallow audio forgeries can easily be fused along with
tectors. Additionally, RL, and particularly deep reinforce- deep audio forgeries in deepfake videos. We have already
ment learning (DRL), is extremely efficient in solving developed a voice spoofing detection corpus [338] for
intricate cyber-defense problems. Thus, DRL could offer single- and multi-order replay attacks. Currently, we are
great potential for not only deepfake detection but also to working on developing a robust voice cloning and audio-
counter antiforensic attacks on the detectors. Since RL can visual deepfake dataset that can be effectively used to
model an autonomous agent to take sequential actions evaluate the performance of futuristic audio-visual
optimally with limited, or without prior, knowledge of deepfake detection methods.
the environment, it could be used to meet a need for de- & A unified method to address the variation of cloned at-
veloping algorithms to capture traces of anti-forensic pro- tacks, such as cloned replay. The majority of voice
cessing and to design attack-aware deepfake detectors. spoofing detectors target detecting either replay or cloning
The defense of deepfake detectors against adversarial in- attacks [218, 277, 286]. These two-class oriented, genuine
put could be modeled as a two-player zero-sum game with vs. spoof countermeasures, are not ready to counter mul-
which player utilities sum to zero at each time step. The tiple spoofing attacks on automatic speaker verification
defender here is represented by an actor-critic DRL algo- (ASV) systems. A study on presentation attack detection
rithm [337]. indicated that the countermeasures trained on a specific
& The current deepfake detectors face challenges, particular- type of spoofing attack do not generalize well for other
ly due to incomplete, sparse, and noisy data in the training types of spoofing attacks [339]. Moreover, a unified coun-
phases. There is a need to explore innovative AI architec- termeasure that can detect replay and cloning attacks in
tures, algorithms, and approaches that “bake in” physics, multi-hop scenarios, where multiple microphones and
mathematics, and prior knowledge relevant to deepfakes. smart speakers are chained together, does not exist. We
Embedding physics and prior knowledge using addressed the problem of spoofing attack detection in
knowledge-infused learning into AI will help to overcome multi-hop scenarios in our prior work [11], but only for
the challenges of sparse data and will facilitate the devel- voice replay attacks. Therefore, there exists an urgent need
opment of generative models that are causal and to develop a unified countermeasure that can effectively
explanative. detect a variety of spoofing attacks (i.e. replay, cloning,
& Most of the existing approaches have focused on one spe- and cloned replay) in a multi-hop scenario.
cific type of feature, such as landmark features. However, & The exponential growth of smart speakers and other
as the complexity of deepfakes is increasing, it is impor- voice-enabled devices has made Automated Speech
tant to fuse landmarks, photoplethysmography (PPG), and Verification (ASV) a fundamental component. However,
audio-based features. Likewise, it is important to evaluate optimal utilization of ASV in critical domains, such as
the fusion of classifiers. Particularly, the fusion of anom- financial services, health care, etc., is not possible unless
aly and signature-based ensemble learning will assist in we counter the threats of multiple voice spoofing attacks
the improvement of accuracy in deepfake detectors. on the ASV. Thus, this vulnerability also presents a need
& Existing research on deepfakes has mainly focused on to develop a robust and unified spoofing countermeasure.
detecting manipulation in the visual content of the video, & There exists a crucial need to implement federated, learn-
however, audio manipulation, an integral component of ing-based, lightweight approaches to detect the manipula-
deepfakes, has been mostly ignored by the research com- tion at the source, so an attack doesn’t traverse a network
munity. There exists a need to develop unified deepfake of smart speakers (or other IoT devices) [10, 11].
detectors that are capable of effectively detecting both
audio (i.e., TTS synthesis, voice conversion, cloned-re-
play) and visual forgeries (face-swap, lip-sync, and pup-
pet-master) simultaneously. 8 Conclusion
& Existing deepfake datasets lack the potential attributes (i.e.
multiple visual and audio forgeries, etc.) required to eval- This survey paper presents a comprehensive review of
uate the performance of more robust deepfake detection existing deepfake generation and detection methods. Not all
methods. As stated above, the research community has digital manipulations are harmful. Due to immense technolog-
hardly explored the fact that deepfake videos contain not ical advancements, however, it is now very easy to produce
only visual forgeries but audio manipulations as well. realistic fabricated content. Therefore, malicious users can use
Existing deepfake datasets do not consider audio forgery it to spread disinformation, to attack individuals, and to cause
and only focus on visual forgeries. In near future, the role social, psychological, religious, mental, and political stress. In
of voice cloning (TTS synthesis, VC) and replay spoofing, the future, we imagine seeing the results of fabricated content
may increase in deepfake video generation. Additionally, in many other modalities and industries. There is a cold war
M. Masood et al.
between deepfake generation and detection methods. As there 16. Malik A, Kuribayashi M, Abdullahi SM, Khan AN (2022)
DeepFake detection for human face images and videos: a survey.
are improvements in one it causes challenges for the other. We
IEEE Access 10:18757–18775
provided a detailed analysis of existing audio and video 17. Rana MS, Nobi MN, Murali B, Sung AH (2022) Deepfake detec-
deepfake generation and detection techniques, along with their tion: a systematic literature review. IEEE Access
strengths and weaknesses. We have also discussed existing 18. Verdoliva L (2020) Media forensics and deepfakes: an overview.
IEEE J Sel Top Sign Process 14:910–932
challenges and the future directions of both deepfake creation
19. Tolosana R, Vera-Rodriguez R, Fierrez J, Morales A, Ortega-
and identification methods. Garcia J (2020) Deepfakes and beyond: a survey of face manipu-
lation and fake detection. Inf Fusion 64:131–148
20. Nguyen TT, Nguyen CM, Nguyen DT, Nguyen DT, Nahavandi S
Acknowledgements This material is based upon work supported by the (2019) Deep learning for deepfakes creation and detection. arXiv
National Science Foundation (NSF) under Grant number 1815724, preprint arXiv:190911573
Punjab Higher Education Commission of Pakistan under Award No. 21. Mirsky Y, Lee W (2021) The creation and detection of deepfakes:
(PHEC/ARA/PIRCA/20527/21), and Michigan Translational Research a survey. ACM Comput Surv 54:1–41
and Commercialization (MTRAC) Advanced Computing Technologies 22. Oliveira L (2017) The current state of fake news. Procedia
(ACT) Grant Case number 292883. Any opinions, findings, and conclu- Comput Sci 121:817–825
sions or recommendations expressed in this material are those of the 23. Chesney R, Citron D (2019) Deepfakes and the new disinforma-
author(s) and do not necessarily reflect the views of the NSF and tion war: the coming age of post-truth geopolitics. Foreign Aff 98:
MTRAC ACT. 147
24. Karnouskos S (2020) Artificial intelligence in digital media: the
era of deepfakes. IEEE Trans Technol Soc 1:138–147
25. Stiff H, Johansson F (2021) Detecting computer-generated disin-
References formation. Int J Data Sci Anal 13:363–383. https://doi.org/10.
1007/s41060-021-00299-5
1. Goodfellow I et al (2014) Generative adversarial nets. Adv Neural 26. Dobber T, Metoui N, Trilling D, Helberger N, de Vreese C (2021)
Inf Proces Syst 1:2672–2680 Do (microtargeted) deepfakes have real effects on political atti-
2. Etienne H (2021) The future of online trust (and why Deepfake is tudes? Int J Press Polit 26:69–91
advancing it). AI Ethics 1:553–562. https://doi.org/10.1007/ 27. Lingam G, Rout RR, Somayajulu DV (2019) Adaptive deep Q-
s43681-021-00072-1 learning model for detecting social bots and influential users in
3. ZAO. https://apps.apple.com/cn/app/zao/id1465199127. online social networks. Appl Intell 49:3947–3964
Accessed September 09, 2020 28. Shao C, Ciampaglia GL, Varol O, Yang K-C, Flammini A,
4. Reface App. https://reface.app/. Accessed September 11, 2020 Menczer F (2018) The spread of low-credibility content by social
5. FaceApp. https://www.faceapp.com/. Accessed September 17, bots. Nat Commun 9:1–9
2020 29. Marwick A, Lewis R (2017) Media manipulation and disinforma-
tion online. Data & Society Research Institute, New York, pp 7–
6. Audacity. https://www.audacityteam.org/. Accessed September
19
09, 2020
30. Tsao S-F, Chen H, Tisseverasinghe T, Yang Y, Li L, Butt ZA
7. Sound Forge. https://www.magix.com/gb/music/sound-forge/.
(2021) What social media told us in the time of COVID-19: a
Accessed January 11, 2021
scoping review. Lancet Digit Health 3:e175–e194
8. Shu K, Wang S, Lee D, Liu H (2020) Mining disinformation and
31. Pierri F, Ceri S (2019) False news on social media: a data-driven
fake news: concepts, methods, and recent advancements. In: survey. ACM SIGMOD Rec 48:18–27
Disinformation, misinformation, and fake news in social media.
32. Chesney B, Citron D (2019) Deep fakes: a looming challenge for
Springer, pp 1–19
privacy, democracy, and national security. Calif Law Rev 107:
9. Chan C, Ginosar S, Zhou T, Efros AA (2019) Everybody dance 1753
now. In: Proceedings of the IEEE international conference on 33. Güera D, Delp EJ (2018) Deepfake video detection using recurrent
computer vision, pp 5933–5942 neural networks. In 2018 15th IEEE international conference on
10. Malik KM, Malik H, Baumann R (2019) Towards vulnerability advanced video and signal based surveillance (AVSS). IEEE, pp
analysis of voice-driven interfaces and countermeasures for replay 1–6
attacks. In 2019 IEEE conference on multimedia information pro- 34. Gupta S, Mohan N, Kaushal P (2021) Passive image forensics
cessing and retrieval (MIPR). IEEE, pp 523–528 using universal techniques: a review. Artif Intell Rev 1:1–51
11. Malik KM, Javed A, Malik H, Irtaza A (2020) A light-weight 35. Pavan Kumar MR, Jayagopal P (2021) Generative adversarial
replay detection framework for voice controlled iot devices. networks: a survey on applications and challenges. Int J
IEEE J Sel Top Sign Process 14:982–996 Multimed Inf Retr 10:1–24. https://doi.org/10.1007/s13735-020-
12. Javed A, Malik KM, Irtaza A, Malik H (2021) Towards protecting 00196-w
cyber-physical and IoT systems from single-and multi-order voice 36. Choi Y, Choi M, Kim M, Ha J-W, Kim S, Choo J (2018) Stargan:
spoofing attacks. Appl Acoust 183:108283 unified generative adversarial networks for multi-domain image-
13. Aljasem M, Irtaza A, Malik H, Saba N, Javed A, Malik KM, to-image translation. In: Proceedings of the IEEE conference on
Meharmohammadi M (2021) Secure automatic speaker verifica- computer vision and pattern recognition, pp 8789–8797
tion (SASV) system through sm-ALTP features and asymmetric 37. Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017)
bagging. IEEE Trans Inf Forensics Secur 16:3524–3537 Synthesizing Obama: learning lip sync from audio. ACM Trans
14. Sharma M, Kaur M (2022) A review of Deepfake technology: an Graph 36:95–108. https://doi.org/10.1145/3072959.3073640
emerging AI threat. Soft Comput Secur Appl:605–619 38. Thies J, Zollhofer M, Stamminger M, Theobalt C, Nießner M
15. Zhang T (2022) Deepfake generation and detection, a survey. (2016) Face2face: real-time face capture and reenactment of rgb
Multimed Tools Appl 81:6259–6276. https://doi.org/10.1007/ videos. In: Proceedings of the IEEE conference on computer vi-
s11042-021-11733-y sion and pattern recognition, pp 2387–2395
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
39. Wiles O, Sophia Koepke A, Zisserman A (2018) X2face: a net- 60. Wang T-C, Liu M-Y, Zhu J-Y, Tao A, Kautz J, Catanzaro B
work for controlling face generation using images, audio, and pose (2018) High-resolution image synthesis and semantic manipula-
codes. In: Proceedings of the European conference on computer tion with conditional gans. In: Proceedings of the IEEE conference
vision (ECCV), pp 670–686 on computer vision and pattern recognition, pp 8798–8807
40. Bregler C, Covell M, Slaney M (1997) Video rewrite: driving 61. Nirkin Y, Masi I, Tuan AT, Hassner T, Medioni G (2018) On face
visual speech with audio. In: Proceedings of the 24th annual con- segmentation, face swapping, and face perception. In 2018 13th
ference on Computer graphics and interactive techniques, pp 353– IEEE international conference on automatic face & gesture recog-
360 nition (FG 2018). IEEE, pp 98–105
41. Johnson DG, Diakopoulos N (2021) What to do about deepfakes. 62. Bitouk D, Kumar N, Dhillon S, Belhumeur P, Nayar SK (2008)
Commun ACM 64:33–35 Face swapping: automatically replacing faces in photographs. In:
42. FakeApp 2.2.0. https://www.malavida.com/en/soft/fakeapp/. ACM transactions on graphics (TOG). ACM, pp 39
Accessed September 18, 2020 63. Lin Y, Lin Q, Tang F, Wang S (2012) Face replacement with
43. Faceswap: Deepfakes software for all. https://github.com/ large-pose differences. In: Proceedings of the 20th ACM interna-
deepfakes/faceswap. Accessed September 08, 2020 tional conference on multimedia. ACM, pp 1249–1250
44. DeepFaceLab. https://github.com/iperov/DeepFaceLab. Accessed 64. Smith BM, Zhang L (2012) Joint face alignment with non-
August 18, 2020 parametric shape models. In: European conference on computer
45. Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N (2019) vision. Springer, pp 43–56
First order motion model for image animation. In: Advances in 65. Faceswap-GAN https://github.com/shaoanlu/faceswap-GAN.
neural information processing systems, pp 7137–7147 Accessed September 18, 2020
46. Zhou H, Sun Y, Wu W, Loy CC, Wang X, Liu Z (2021) Pose- 66. Korshunova I, Shi W, Dambre J, Theis L (2017) Fast face-swap
controllable talking face generation by implicitly modularized using convolutional neural networks. In: Proceedings of the IEEE
audio-visual representation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3677–3685
conference on computer vision and pattern recognition, pp 67. Nirkin Y, Keller Y, Hassner T (2019) FSGAN: subject agnostic
4176–4186 face swapping and reenactment. In: Proceedings of the IEEE in-
47. Kim H, Garrido P, Tewari A, Xu W, Thies J, Niessner M, Pérez P, ternational conference on computer vision, pp 7184–7193
Richardt C, Zollhöfer M, Theobalt C (2018) Deep video portraits. 68. Natsume R, Yatagawa T, Morishima S (2018) RSGAN: face
ACM Trans Graph 37:163–177. https://doi.org/10.1145/3197517. swapping and editing using face and hair representation in latent
3201283 spaces. arXiv preprint arXiv:180403447
48. Ha S, Kersner M, Kim B, Seo S, Kim D (2020) Marionette: few- 69. Natsume R, Yatagawa T, Morishima S (2018) Fsnet: an identity-
shot face reenactment preserving identity of unseen targets. In: aware generative model for image-based face swapping. In: Asian
Proceedings of the AAAI conference on artificial intelligence, pp conference on computer vision. Springer, pp 117–132
10893–10900 70. Li L, Bao J, Yang H, Chen D, Wen F (2020) Advancing high
49. Wang Y, Bilinski P, Bremond F, Dantcheva A (2020) fidelity identity swapping for forgery detection. In: Proceedings
ImaGINator: conditional Spatio-temporal GAN for video gener- of the IEEE/CVF conference on computer vision and pattern rec-
ation. In: The IEEE winter conference on applications of comput- ognition, pp 5074–5083
er vision, pp 1160–1169 71. Petrov I et al. (2020) DeepFaceLab: a simple, flexible and exten-
50. Lu Y, Chai J, Cao X (2021) Live speech portraits: real-time sible face swapping framework. arXiv preprint arXiv:200505535
photorealistic talking-head animation. ACM Trans Graph 40:1–17 72. Chen D, Chen Q, Wu J, Yu X, Jia T (2019) Face swapping:
51. Lahiri A, Kwatra V, Frueh C, Lewis J, Bregler C (2021) realistic image synthesis based on facial landmarks alignment.
LipSync3D: data-efficient learning of personalized 3D talking Math Probl Eng 2019
faces from video using pose and lighting normalization. In: 73. Zhang Y, Zheng L, Thing VL (2017) Automated face swapping
Proceedings of the IEEE/CVF conference on computer vision and its detection. In: 2017 IEEE 2nd international conference on
and pattern recognition, pp 2755–2764 signal and image processing (ICSIP). IEEE, pp 15–19
52. Westerlund M (2019) The emergence of deepfake technology: a 74. Yang X, Li Y, Lyu S (2019) Exposing deep fakes using inconsis-
review. Technol Innov Manag Rev 9:39–52 tent head poses. In: 2019 IEEE international conference on acous-
53. Greengard S (2019) Will deepfakes do deep damage? Commun tics, speech and signal processing (ICASSP). IEEE, pp 8261–
ACM 63:17–19 8265
54. Lee Y, Huang K-T, Blom R, Schriner R, Ciccarelli CA (2021) To 75. Güera D, Baireddy S, Bestagini P, Tubaro S, Delp EJ (2019) We
believe or not to believe: framing analysis of content and audience need no pixels: video manipulation detection using stream de-
response of top 10 deepfake videos on youtube. Cyberpsychol scriptors. arXiv preprint arXiv:190608743
Behav Soc Netw 24:153–158 76. Jack K (2011) Video demystified: a handbook for the digital en-
55. Oord Avd et al. (2016) Wavenet: a generative model for raw gineer. Elsevier
audio. In: 9th ISCA speech synthesis workshop, p 2 77. Ciftci UA, Demir I (2020) FakeCatcher: detection of synthetic
56. Wang Y et al. (2017) Tacotron: towards end-to-end speech syn- portrait videos using biological signals. IEEE Trans Pattern Anal
thesis. arXiv preprint arXiv:170310135 Mach Intell 1
57. Arik SO et al. (2017) Deep voice: real-time neural text-to-speech. 78. Jung T, Kim S, Kim K (2020) DeepVision: Deepfakes detection
In: International conference on machine learning PMLR, pp 195– using human eye blinking pattern. IEEE Access 8:83144–83154
204 79. Ranjan R, Patel VM, Chellappa R (2017) Hyperface: a deep multi-
58. Wang R, Juefei-Xu F, Huang Y, Guo Q, Xie X, Ma L, Liu Y task learning framework for face detection, landmark localization,
(2020) Deepsonar: towards effective and robust detection of ai- pose estimation, and gender recognition. IEEE Trans Pattern Anal
synthesized fake voices. In: Proceedings of the 28th ACM inter- Mach Intell 41:121–135
national conference on multimedia, pp 1207–1216 80. Soukupova T, Cech J (2016) Eye blink detection using facial
59. Arik S, Chen J, Peng K, Ping W, Zhou Y (2018) Neural voice landmarks. In: 21st Computer Vision Winter Workshop
cloning with a few samples. In: Advances in neural information 81. Matern F, Riess C, Stamminger M (2019) Exploiting visual arti-
processing systems, pp 10019–10029 facts to expose deepfakes and face manipulations. In: 2019 IEEE
M. Masood et al.
winter applications of computer vision workshops (WACVW). The 29th annual workshop of the Swedish artificial intelligence
IEEE, pp 83–92 society (SAIS). Linköping University Electronic Press
82. Malik J, Belongie S, Leung T, Shi J (2001) Contour and texture 101. Wu H-Y, Rubinstein M, Shih E, Guttag J, Durand F, Freeman W
analysis for image segmentation. Int J Comput Vis 43:7–27 (2012) Eulerian video magnification for revealing subtle changes
83. Agarwal S, Farid H, Gu Y, He M, Nagano K, Li H (2019) in the world. ACM Trans Graph 31:1–8
Protecting world leaders against deep fakes. In: Proceedings of 102. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK (2018)
the IEEE conference on computer vision and pattern recognition Neural ordinary differential equations. In: Advances in neural in-
workshops, pp 38-45 formation processing systems, pp 6571–6583
84. Li Y, Lyu S (2019) Exposing deepfake videos by detecting face 103. Yang J, Li A, Xiao S, Lu W, Gao X (2021) MTD-net: learning to
warping artifacts. In: IEEE conference on computer vision and detect deepfakes images by multi-scale texture difference. IEEE
pattern recognition workshops (CVPRW), pp 46–52 Trans Inf Forensics Secur 16:4234–4245
85. Li Y, Chang M-C, Lyu S (2018) In ictu oculi: exposing ai gener- 104. Fan B, Wang L, Soong FK, Xie L (2015) Photo-real talking head
ated fake face videos by detecting eye blinking. In: 2018 IEEE with deep bidirectional LSTM. In: 2015 IEEE international con-
international workshop on information forensics and security ference on acoustics, Speech and Signal Processing (ICASSP).
(WIFS). IEEE, pp 1–7 IEEE, pp 4884–4888
86. Montserrat DM et al. (2020) Deepfakes detection with automatic 105. Charles J, Magee D, Hogg D (2016) Virtual immortality:
face weighting. In: Proceedings of the IEEE/CVF conference on reanimating characters from tv shows. In European conference
computer vision and pattern recognition workshops, pp 668–669 on computer vision. Springer, pp 879–886
87. de Lima O, Franklin S, Basu S, Karwoski B, George A (2020) 106. Jamaludin A, Chung JS, Zisserman A (2019) You said that?:
Deepfake detection using spatiotemporal convolutional networks. Synthesising talking faces from audio. Int J Comput Vis 1:1–13
arXiv preprint arXiv:14749 107. Vougioukas K, Petridis S, Pantic M (2019) End-to-end speech-
88. Agarwal S, El-Gaaly T, Farid H, Lim S-N (2020) Detecting deep- driven realistic facial animation with temporal GANs. In:
fake videos from appearance and behavior. In 2020 IEEE interna- Proceedings of the IEEE conference on computer vision and pat-
tional workshop on information forensics and security (WIFS). tern recognition workshops, pp 37–40
IEEE, pp 1–6 108. Zhou H, Liu Y, Liu Z, Luo P, Wang X (2019) Talking face gen-
89. Fernandes S, Raj S, Ortiz E, Vintila I, Salter M, Urosevic G, Jha S eration by adversarially disentangled audio-visual representation.
(2019) Predicting heart rate variations of Deepfake videos using In: Proceedings of the AAAI conference on artificial intelligence,
neural ODE. In: Proceedings of the IEEE international conference pp 9299–9306
on computer vision workshops 109. Garrido P, Valgaerts L, Sarmadi H, Steiner I, Varanasi K, Perez P,
Theobalt C (2015) Vdub: modifying face video of actors for plau-
90. Yang J, Xiao S, Li A, Lu W, Gao X, Li Y (2021) MSTA-net:
sible visual alignment to a dubbed audio track. In: Computer
forgery detection by generating manipulation trace based on
graphics forum. Wiley Online Library, pp 193–204
multi-scale self-texture attention. IEEE Trans Circuits Syst
110. KR Prajwal, Mukhopadhyay R, Philip J, Jha A, Namboodiri V,
Video Technol
Jawahar C (2019) Towards automatic face-to-face translation. In:
91. Sabir E, Cheng J, Jaiswal A, AbdAlmageed W, Masi I, Natarajan
Proceedings of the 27th ACM international conference on multi-
P (2019) Recurrent convolutional strategies for face manipulation
media, pp 1428–1436
detection in videos. Interfaces (GUI) 3:80–87
111. Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C (2020)
92. Afchar D, Nozick V, Yamagishi J, Echizen I (2018) Mesonet: a A lip sync expert is all you need for speech to lip generation in the
compact facial video forgery detection network. In: 2018 IEEE wild. In: Proceedings of the 28th ACM international conference
international workshop on information forensics and security
on multimedia, pp 484–492
(WIFS). IEEE, pp 1–7 112. Fried O, Tewari A, Zollhöfer M, Finkelstein A, Shechtman E,
93. Nguyen HH, Fang F, Yamagishi J, Echizen I (2019) Multi-task Goldman DB, Genova K, Jin Z, Theobalt C, Agrawala M
learning for detecting and segmenting manipulated facial images (2019) Text-based editing of talking-head video. ACM Trans
and videos. In: 2019 IEEE 10th international conference on bio- Graph 38:1–14
metrics theory, applications and systems (BTAS), pp 1–8 113. Kim B-H, Ganapathi V (2019) LumiereNet: lecture video synthe-
94. Cozzolino D, Thies J, Rössler A, Riess C, Nießner M, Verdoliva L sis from audio. arXiv preprint arXiv:190702253
(2018) Forensictransfer: weakly-supervised domain adaptation for 114. Korshunov P, Marcel S (2018) Speaker inconsistency detection in
forgery detection. arXiv preprint arXiv:181202510 tampered video. In 2018 26th European signal processing confer-
95. Rossler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M ence (EUSIPCO). IEEE, pp 2375–2379
(2019) Faceforensics++: learning to detect manipulated facial im- 115. Sanderson C, Lovell BC (2009) Multi-region probabilistic histo-
ages. In: Proceedings of the IEEE international conference on grams for robust and scalable identity inference. In: International
computer vision, pp 1–11 conference on biometrics. Springer, pp 199–208
96. King DE (2009) Dlib-ml: a machine learning toolkit. J Mach 116. Anand A, Labati RD, Genovese A, Muñoz E, Piuri V, Scotti F
Learn Res 10:1755–1758 (2017) Age estimation based on face images and pre-trained
97. Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and convolutional neural networks. In: 2017 IEEE symposium series
alignment using multitask cascaded convolutional networks. IEEE on computational intelligence (SSCI). IEEE, pp 1–7
Signal Process Lett 23:1499–1503 117. Boutellaa E, Boulkenafet Z, Komulainen J, Hadid A (2016)
98. Wiles O, Koepke A, Zisserman A (2018) Self-supervised learning Audiovisual synchrony assessment for replay attack detection in
of a facial attribute embedding from video. Paper presented at the talking face biometrics. Multimed Tools Appl 75:5329–5343
29th British machine vision conference (BMVC) 118. Korshunov P et al. (2019) Tampered speaker inconsistency detec-
99. Rezende DJ, Mohamed S, Wierstra D (2014) Stochastic tion with phonetically aware audio-visual features. In:
backpropagation and approximate inference in deep generative International Conference on Machine Learning
models. Paper presented at the international conference on ma- 119. Agarwal S, Farid H, Fried O, Agrawala M (2020) Detecting deep-
chine learning, pp 1278–1286 fake videos from phoneme-viseme mismatches. In: Proceedings of
100. Rahman H, Ahmed MU, Begum S, Funk P (2016) Real time heart the IEEE/CVF conference on computer vision and pattern recog-
rate monitoring from facial RGB color video using webcam. In: nition workshops, pp 660–661
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
120. Haliassos A, Vougioukas K, Petridis S, Pantic M (2021) Lips 138. Amerini I, Galteri L, Caldelli R, Del Bimbo A (2019) Deepfake
Don't lie: a Generalisable and robust approach to face forgery video detection through optical flow based CNN. In proceedings
detection. In: Proceedings of the IEEE/CVF conference on com- of the IEEE international conference on computer vision
puter vision and pattern recognition, pp 5039–5049 workshops
121. Chugh K, Gupta P, Dhall A, Subramanian R (2020) Not made for 139. Alparone L, Barni M, Bartolini F, Caldelli R (1999)
each other-audio-visual dissonance-based deepfake detection and Regularization of optic flow estimates by means of weighted
localization. In: Proceedings of the 28th ACM international con- vector median filtering. IEEE Trans Image Process 8:1462–1467
ference on multimedia, pp 439–447 140. Sun D, Yang X, Liu M-Y, Kautz J (2018) PWC-net: CNNs for
122. Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) optical flow using pyramid, warping, and cost volume. In:
Emotions Don't lie: an audio-visual deepfake detection method Proceedings of the IEEE conference on computer vision and pat-
using affective cues. In: Proceedings of the 28th ACM internation- tern recognition, pp 8934–8943
al conference on multimedia, pp 2823–2832 141. Baltrušaitis T, Robinson P, Morency L-P (2016) Openface: an
123. Chintha A, Thai B, Sohrawardi SJ, Bhatt K, Hickerson A, Wright open source facial behavior analysis toolkit. In: 2016 IEEE winter
M, Ptucha R (2020) Recurrent convolutional structures for audio conference on applications of computer vision (WACV). IEEE, pp
spoof and video deepfake detection. IEEE J Sel Top Sign Process 1–10
14:1024–1037 142. Kingma DP, Welling M (2013) Auto-encoding variational bayes.
124. Thies J, Zollhöfer M, Theobalt C, Stamminger M, Nießner M arXiv preprint arXiv:13126114
(2018) Real-time reenactment of human portrait videos. ACM 143. Radford A, Metz L, Chintala S (2015) Unsupervised representa-
Trans Graph 37:1–13. https://doi.org/10.1145/3197517.3201350 tion learning with deep convolutional generative adversarial net-
125. Thies J, Zollhöfer M, Nießner M, Valgaerts L, Stamminger M, works. arXiv preprint arXiv:151106434
Theobalt C (2015) Real-time expression transfer for facial reen- 144. Liu M-Y, Tuzel O (2016) Coupled generative adversarial net-
actment. ACM Trans Graph 34:1–14 works. In: Advances in neural information processing systems,
126. Zollhöfer M, Nießner M, Izadi S, Rehmann C, Zach C, Fisher M, pp 469–477
Wu C, Fitzgibbon A, Loop C, Theobalt C, Stamminger M (2014) 145. Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing
Real-time non-rigid reconstruction using an RGB-D camera. of gans for improved quality, stability, and variation. In: 6th
ACM Trans Graph 33:1–12 International Conference on Learning Representations
127. Thies J, Zollhöfer M, Theobalt C, Stamminger M, Nießner M 146. Karras T, Laine S, Aila T (2019) A style-based generator archi-
(2018) Headon: real-time reenactment of human portrait videos. tecture for generative adversarial networks. In: Proceedings of the
ACM Trans Graph 37:1–13 IEEE conference on computer vision and pattern recognition, pp
128. Mirza M, Osindero S (2014) Conditional generative adversarial 4401–4410
nets. arXiv preprint arXiv:14111784 147. Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020)
129. Wu W, Zhang Y, Li C, Qian C, Change Loy C (2018) Analyzing and improving the image quality of stylegan. In:
ReenactGAN: learning to reenact faces via boundary transfer. Proceedings of the IEEE/CVF conference on computer vision
In: Proceedings of the European conference on computer vision and pattern recognition, pp 8110–8119
(ECCV), pp 603–619 148. Huang R, Zhang S, Li T, He R (2017) Beyond face rotation: global
130. Pumarola A, Agudo A, Martínez AM, Sanfeliu A, Moreno- and local perception Gan for photorealistic and identity preserving
Noguer F (2018) GANimation: anatomically-aware facial anima- frontal view synthesis. In: Proceedings of the IEEE international
tion from a single image. In: Proceedings of the European confer- conference on computer vision, pp 2439–2448
ence on computer vision (ECCV), pp 818–833 149. Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-
131. Sanchez E, Valstar M (2020) Triple consistency loss for pairing attention generative adversarial networks. In: international confer-
distributions in GAN-based face synthesis. In: 15th IEEE interna- ence on machine learning. PMLR, pp 7354–7363
tional conference on automatic face and gesture recognition. 150. Brock A, Donahue J, Simonyan K (2019) Large scale gan training
IEEE, pp 53–60 for high fidelity natural image synthesis. In: 7th International
132. Zakharov E, Shysheya A, Burkov E, Lempitsky V (2019) Few- Conference on Learning Representations
shot adversarial learning of realistic neural talking head models. 151. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN
In: Proceedings of the IEEE international conference on computer (2017) Stackgan: text to photo-realistic image synthesis with
vision, pp 9459–9468 stacked generative adversarial networks. In: Proceedings of the
133. Zhang Y, Zhang S, He Y, Li C, Loy CC, Liu Z (2019) One-shot IEEE international conference on computer vision, pp 5907–5915
face reenactment. Paper presented at the British machine vision 152. Lu E, Hu X (2022) Image super-resolution via channel attention
conference (BMVC) and spatial attention. Appl Intell 52:2260–2268. https://doi.org/10.
134. Hao H, Baireddy S, Reibman AR, Delp EJ (2020) FaR-GAN for 1007/s10489-021-02464-6
one-shot face reenactment. In: IEEE Conference on Computer 153. Zhong J-L, Pun C-M, Gan Y-F (2020) Dense moment feature
Vision and Pattern Recognition (CVPR) index and best match algorithms for video copy-move forgery
135. Blanz V, Vetter T (1999) A morphable model for the synthesis of detection. Inf Sci 537:184–202
3D faces. In: Proceedings of the 26th annual conference on 154. Ding X, Huang Y, Li Y, He J (2020) Forgery detection of motion
Computer graphics and interactive techniques, pp 187–194 compensation interpolated frames based on discontinuity of opti-
136. Wehrbein T, Rudolph M, Rosenhahn B, Wandt B (2021) cal flow. Multimed Tools Appl:1–26
Probabilistic monocular 3d human pose estimation with normal- 155. Niyishaka P, Bhagvati C (2020) Copy-move forgery detection
izing flows. In: Proceedings of the IEEE/CVF international con- using image blobs and BRISK feature. Multimed Tools Appl:1–
ference on computer vision, pp 11199–11208 15
137. Lorenzo-Trueba J, Yamagishi J, Toda T, Saito D, Villavicencio F, 156. Sunitha K, Krishna A, Prasad B (2022) Copy-move tampering
Kinnunen T, Ling Z (2018) The voice conversion challenge 2018: detection using keypoint based hybrid feature extraction and im-
promoting development of parallel and nonparallel methods. In proved transformation model. Appl Intell:1–12
the speaker and language recognition workshop. ISCA, pp 195– 157. Tyagi S, Yadav D (2022) A detailed analysis of image and video
202 forgery detection techniques. Vis Comput:1–21
M. Masood et al.
158. Nawaz M, Mehmood Z, Nazir T, Masood M, Tariq U, Mahdi 176. Wang R, Juefei-Xu F, Ma L, Xie X, Huang Y, Wang J, Liu Y
Munshi A, Mehmood A, Rashid M (2021) Image authenticity (2021) Fakespotter: a simple yet robust baseline for spotting AI-
detection using DWT and circular block-based LTrP features. synthesized fake faces. In: Proceedings of the 29th international
Comput Mater Contin 69:1927–1944 conference on international joint conferences on artificial intelli-
159. Akhtar Z, Dasgupta D (2019) A comparative evaluation of local gence, pp 3444–3451
feature descriptors for deepfakes detection. In: 2019 IEEE inter- 177. Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recogni-
national symposium on technologies for homeland security tion. In: Proceedings of the British Machine Vision, pp 6
(HST). IEEE, pp 1–5 178. Amos B, Ludwiczuk B, Satyanarayanan M (2016) Openface: a
160. McCloskey S, Albright M (2018) Detecting gan-generated imag- general-purpose face recognition library with mobile applications.
ery using color cues. arXiv preprint arXiv:08247 CMU School of Computer Science 6
161. Guarnera L, Giudice O, Battiato S (2020) DeepFake detection by 179. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified
analyzing convolutional traces. In proceedings of the IEEE/CVF embedding for face recognition and clustering. In: Proceedings of
conference on computer vision and pattern recognition work- the IEEE conference on computer vision and pattern recognition,
shops, pp 666–667 pp 815–823
162. Nataraj L, Mohammed TM, Manjunath B, Chandrasekaran S, 180. Bharati A, Singh R, Vatsa M, Bowyer KW (2016) Detecting facial
Flenner A, Bappy JH, Roy-Chowdhury AK (2019) Detecting retouching using supervised deep learning. IEEE Trans Inf
GAN generated fake images using co-occurrence matrices. Forensics Secur 11:1903–1913
Electronic Imaging 5:532–531 181. Jain A, Singh R, Vatsa M (2018) On detecting gans and
163. Yu N, Davis LS, Fritz M (2019) Attributing fake images to GANs: retouching based synthetic alterations. In: 2018 IEEE 9th interna-
learning and analyzing GAN fingerprints. In: Proceedings of the tional conference on biometrics theory, applications and systems
IEEE international conference on computer vision, pp 7556–7566 (BTAS). IEEE, pp 1–7
164. Marra F, Saltori C, Boato G, Verdoliva L (2019) Incremental 182. Tariq S, Lee S, Kim H, Shin Y, Woo SS (2018) Detecting both
learning for the detection and classification of GAN-generated machine and human created fake face images in the wild. In:
images. In: 2019 IEEE international workshop on information Proceedings of the 2nd international workshop on multimedia
forensics and security (WIFS). IEEE, pp 1–6 privacy and security, pp 81–87
165. Rebuffi S-A, Kolesnikov A, Sperl G, Lampert CH (2017) ICARL: 183. Dang H, Liu F, Stehouwer J, Liu X, Jain AK (2020) On the
incremental classifier and representation learning. In: proceedings detection of digital face manipulation. In: Proceedings of the
of the IEEE conference on computer vision and pattern recogni- IEEE/CVF conference on computer vision and pattern recogni-
tion, pp 2001–2010 tion, pp 5781–5790
166. Perarnau G, Van De Weijer J, Raducanu B, Álvarez JM (2016) 184. Rathgeb C, Botaljov A, Stockhardt F, Isadskiy S, Debiasi L, Uhl
Invertible conditional gans for image editing. arXiv preprint A, Busch C (2020) PRNU-based detection of facial retouching.
arXiv:161106355 IET Biom 9:154–164
167. Lample G, Zeghidour N, Usunier N, Bordes A, Denoyer L, 185. Li Y, Zhang C, Sun P, Ke L, Ju Y, Qi H, Lyu S (2021) DeepFake-
Ranzato MA (2017) Fader networks: manipulating images by o-meter: an open platform for DeepFake detection. In: 2021 IEEE
sliding attributes. In: Advances in neural information processing security and privacy workshops (SPW). IEEE, pp 277–281
systems, pp 5967–5976 186. Mehta V, Gupta P, Subramanian R, Dhall A (2021) FakeBuster: a
168. Choi Y, Uh Y, Yoo J, Ha J-W (2020) Stargan v2: diverse image DeepFakes detection tool for video conferencing scenarios. In
synthesis for multiple domains. In: Proceedings of the IEEE/CVF 26th international conference on intelligent user interfaces, pp
conference on computer vision and pattern recognition, pp 8188– 61–63
8197 187. Reality Defender 2020: A FORCE AGAINST DEEPFAKES.
169. He Z, Zuo W, Kan M, Shan S, Chen X (2019) Attgan: facial (2020). https://rd2020.org/index.html. Accessed August 03, 2021
attribute editing by only changing what you want. IEEE Trans 188. Durall R, Keuper M, Pfreundt F-J, Keuper J (2019) Unmasking
Image Process 28:5464–5478 deepfakes with simple features. arXiv preprint arXiv:00686
170. Liu M, Ding Y, Xia M, Liu X, Ding E, Zuo W, Wen S (2019) 189. Marra F, Gragnaniello D, Cozzolino D, Verdoliva L (2018)
Stgan: a unified selective transfer network for arbitrary image Detection of gan-generated fake images over social networks.
attribute editing. In: Proceedings of the IEEE conference on com- In: 2018 IEEE conference on multimedia information processing
puter vision and pattern recognition, pp 3673–3682 and retrieval (MIPR). IEEE, pp 384–389
171. Zhang G, Kan M, Shan S, Chen X (2018) Generative adversarial 190. Caldelli R, Galteri L, Amerini I, Del Bimbo A (2021) Optical flow
network with spatial attention for face attribute editing. In: based CNN for detection of unlearnt deepfake manipulations.
Proceedings of the European conference on computer vision Pattern Recogn Lett 146:31–37
(ECCV), pp 417–432 191. Korshunov P, Marcel S (2018) Deepfakes: a new threat to face
172. He Z, Kan M, Zhang J, Shan S (2020) PA-GAN: progressive recognition? Assessment and detection. arXiv preprint arXiv:
attention generative adversarial network for facial attribute editing. 181208685
arXiv preprint arXiv:200705892 192. Wang S-Y, Wang O, Zhang R, Owens A, Efros AA (2020) CNN-
173. Nataraj L, Mohammed TM, Manjunath B, Chandrasekaran S, generated images are surprisingly easy to spot... for now. In:
Flenner A, Bappy JH, Roy-Chowdhury AK (2019) Detecting Proceedings of the IEEE/CVF conference on computer vision
GAN generated fake images using co-occurrence matrices. and pattern recognition, pp 8695–8704
Electron Imaging 2019:532-531–532-537 193. Malik H (2019) Securing voice-driven interfaces against fake
174. Zhang X, Karaman S, Chang S-F (2019) Detecting and simulating (cloned) audio attacks. In 2019 IEEE conference on multimedia
artifacts in gan fake images. In 2019 IEEE international workshop information processing and retrieval (MIPR). IEEE, pp 512–517
on information forensics and security (WIFS). IEEE, pp 1–6 194. Li Y, Yang X, Sun P, Qi H, Lyu S (2020) Celeb-df: a new dataset
175. Isola P, Zhu J-Y, Zhou T, Efros AA (2017) Image-to-image trans- for deepfake forensics. In: IEEE Conference on Computer Vision
lation with conditional adversarial networks. In: Proceedings of and Patten Recognition (CVPR)
the IEEE conference on computer vision and pattern recognition, 195. Khalid H, Woo SS (2020) OC-FakeDect: classifying deepfakes
pp 1125–1134 using one-class variational autoencoder. In: Proceedings of the
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
IEEE/CVF conference on computer vision and pattern recognition on digital futures and transformative technologies (ICoDT2).
workshops, pp 656–657 IEEE, pp 1–6
196. Cozzolino D, Rössler A, Thies J, Nießner M, Verdoliva L (2021) 214. Wang R, Ma L, Juefei-Xu F, Xie X, Wang J, Liu Y (2020)
ID-reveal: identity-aware DeepFake video detection. Paper pre- Fakespotter: a simple baseline for spotting ai-synthesized fake
sented at the international conference on computer vision, pp faces. In: Proceedings of the 29th international joint conference
15088–15097 on artificial intelligence (IJCAI), pp 3444–3451
197. Hu J, Liao X, Wang W, Qin Z (2021) Detecting compressed 215. Pan Z, Ren Y, Zhang X (2021) Low-complexity fake face detec-
deepfake videos in social networks using frame-temporality two- tion based on forensic similarity. Multimedia Systems 27:353–
stream convolutional network. IEEE Trans Circuits Syst Video 361. https://doi.org/10.1007/s00530-021-00756-y
Technol:1 216. Giudice O, Guarnera L, Battiato S (2021) Fighting deepfakes by
198. Li X, Yu K, Ji S, Wang Y, Wu C, Xue H (2020) Fighting against detecting gan dct anomalies. J Imaging 7:128
deepfake: patch & pair convolutional neural networks (ppcnn). In 217. Lorenzo-Trueba J, Fang F, Wang X, Echizen I, Yamagishi J,
companion proceedings of the web conference 2020, pp 88–89 Kinnunen T (2018) Can we steal your vocal identity from the
199. Amerini I, Caldelli R (2020) Exploiting prediction error inconsis- internet?: initial investigation of cloning Obama's voice using
tencies through LSTM-based classifiers to detect deepfake videos. GAN, WaveNet and low-quality found data. In the speaker and
In: Proceedings of the 2020 ACM workshop on information hid- language recognition workshop. ISCA, pp 240–247
ing and multimedia security, pp 97–102 218. Wang X et al (2020) ASVspoof 2019: a large-scale public data-
200. Hosler B, Salvi D, Murray A, Antonacci F, Bestagini P, Tubaro S, base of synthetized, converted and replayed speech. Comput
Stamm MC (2021) Do Deepfakes feel emotions? A semantic ap- Speech Lang 64:101114
proach to detecting deepfakes via emotional inconsistencies. In: 219. Jin Z, Mysore GJ, Diverdi S, Lu J, Finkelstein A (2017) Voco:
Proceedings of the IEEE/CVF conference on computer vision and text-based insertion and replacement in audio narration. ACM
pattern recognition, pp 1013–1022 Trans Graph 36:1–13
201. Zhao T, Xu X, Xu M, Ding H, Xiong Y, Xia W (2021) Learning 220. Leung A NVIDIA Reveals That Part of Its CEO's Keynote
self-consistency for deepfake detection. In: Proceedings of the Presentation Was Deepfaked. https://hypebeast.com/2021/8/
IEEE/CVF international conference on computer vision, pp nvidia-deepfake-jensen-huang-omniverse-keynote-video.
15023–15033 Accessed August 29, 2021
202. AlBadawy EA, Lyu S, Farid H (2019) Detecting AI-synthesized 221. Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville A,
speech using bispectral analysis. In: CVPR workshops, pp 104- Bengio Y (2017) Char2wav: end-to-end speech synthesis. In: 5th
109 International Conference on Learning Representations
203. Guo Z, Hu L, Xia M, Yang G (2021) Blind detection of glow-
222. Sisman B, Yamagishi J, King S, Li H (2020) An overview of voice
based facial forgery. Multimed Tools Appl 80:7687–7710. https://
conversion and its challenges: from statistical modeling to deep
doi.org/10.1007/s11042-020-10098-y
learning. IEEE/ACM Transactions on Audio, Speech, Language
204. Guo Z, Yang G, Chen J, Sun X (2020) Fake face detection via
Processing
adaptive residuals extraction network. arXiv preprint arXiv:04945
223. Partila P, Tovarek J, Ilk GH, Rozhon J, Voznak M (2020) Deep
205. Fu T, Xia M, Yang G (2022) Detecting GAN-generated face im-
learning serves voice cloning: how vulnerable are automatic
ages via hybrid texture and sensor noise based features. Multimed
speaker verification systems to spoofing trials? IEEE Commun
Tools Appl. https://doi.org/10.1007/s11042-022-12661-1
Mag 58:100–105
206. Fei J, Xia Z, Yu P, Xiao F (2021) Exposing AI-generated videos
224. Ping W et al (2018) Deep voice 3: 2000-speaker neural text-to-
with motion magnification. Multimed Tools Appl 80:30789–
speech. Proc ICLR:214–217
30802. https://doi.org/10.1007/s11042-020-09147-3
207. Singh A, Saimbhi AS, Singh N, Mittal M (2020) DeepFake video 225. Bińkowski M et al. (2020) High fidelity speech synthesis with
detection: a time-distributed approach. SN Comput Sci 1:212. adversarial networks. Paper presented at the 8th international con-
https://doi.org/10.1007/s42979-020-00225-9 ference on learning representations
208. Han B, Han X, Zhang H, Li J, Cao X (2021) Fighting fake news: 226. Kumar K et al (2019) Melgan: generative adversarial networks for
two stream network for deepfake detection via learnable SRM. conditional waveform synthesis. Adv Neural Inf Proces Syst 32
IEEE Trans Biom Behav Identity Sci 3:320–331 227. Kong J, Kim J, Bae J (2020) Hifi-Gan: generative adversarial
209. Rana MS, Sung AH (2020) Deepfakestack: a deep ensemble- networks for efficient and high fidelity speech synthesis. Adv
based learning technique for deepfake detection. In: 2020 7th Neural Inf Proces Syst 33:17022–17033
IEEE international conference on cyber security and cloud com- 228. Luong H-T, Yamagishi J (2020) NAUTILUS: a versatile voice
puting (CSCloud)/2020 6th IEEE international conference on cloning system. IEEE/ACM Trans Audio Speech Lang Process
edge computing and scalable cloud (EdgeCom). IEEE, pp 70–75 28:2967–2981
210. Wu Z, Das RK, Yang J, Li H (2020) Light convolutional neural 229. Peng K, Ping W, Song Z, Zhao K (2020) Non-autoregressive
network with feature genuinization for detection of synthetic neural text-to-speech. In: International conference on machine
speech attacks. In: Interspeech 2020, 21st Annual Conference of learning. PMLR, pp 7586–7598
the International Speech Communication Association. ISCA, pp 230. Taigman Y, Wolf L, Polyak A, Nachmani E (2018) Voiceloop:
1101–1105 voice fitting and synthesis via a phonological loop. In: 6th
211. Yu C-M, Chen K-C, Chang C-T, Ti Y-W (2022) SegNet: a net- International Conference on Learning Representations
work for detecting deepfake facial videos. Multimedia Systems 1. 231. Oord A et al. (2018) Parallel wavenet: fast high-fidelity speech
https://doi.org/10.1007/s00530-021-00876-5 synthesis. In international conference on machine learning.
212. Su Y, Xia H, Liang Q, Nie W (2021) Exposing DeepFake videos PMLR, pp 3918–3926
using attention based convolutional LSTM network. Neural 232. Kim J, Kim S, Kong J, Yoon S (2020) Glow-tts: a generative flow
Process Lett 53:4159–4175. https://doi.org/10.1007/s11063-021- for text-to-speech via monotonic alignment search. Adv Neural Inf
10588-6 Proces Syst 33:8067–8077
213. Masood M, Nawaz M, Javed A, Nazir T, Mehmood A, Mahum R 233. Jia Y et al. (2018) Transfer learning from speaker verification to
(2021) Classification of Deepfake videos using pre-trained multispeaker text-to-speech synthesis. In: Advances in neural in-
convolutional neural networks. In: 2021 international conference formation processing systems, pp 4480–4490
M. Masood et al.
234. Lee Y, Kim T, Lee S-Y (2018) Voice imitating text-to-speech 26th European signal processing conference (EUSIPCO). IEEE,
neural networks. arXiv preprint arXiv:00927 pp 2100–2104
235. Chen Y et al. (2019) Sample efficient adaptive text-to-speech. In: 254. Chou J-c, Yeh C-c, Lee H-y, Lee L-s (2018) Multi-target voice
7th International Conference on Learning Representations conversion without parallel data by adversarially learning
236. Cong J, Yang S, Xie L, Yu G, Wan G (2020) Data efficient voice disentangled audio representations. In: 19th Annual Conference
cloning from noisy samples with domain adversarial training. of the International Speech Communication Association. ISCA,
Paper presented at the 21st Annual Conference of the pp 501–505
International Speech Communication Association, pp 811–815 255. Kaneko T, Kameoka H, Tanaka K, Hojo N (2019) Cyclegan-vc2:
237. Gibiansky A et al. (2017) Deep voice 2: multi-speaker neural text- improved cyclegan-based non-parallel voice conversion. In: 2019
to-speech. In: Advances in neural information processing systems, IEEE international conference on acoustics, speech and signal
pp 2962–2970 processing (ICASSP). IEEE, pp 6820–6824
238. Yasuda Y, Wang X, Takaki S, Yamagishi J (2019) Investigation 256. Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J (2018) High-
of enhanced Tacotron text-to-speech synthesis systems with self- quality nonparallel voice conversion based on cycle-consistent
attention for pitch accent language. In: 2019 IEEE international adversarial network. In 2018 IEEE international conference on
conference on acoustics, speech and signal processing (ICASSP). acoustics, speech and signal processing (ICASSP). IEEE, pp
IEEE, pp 6905–6909 5279–5283
239. Yamamoto R, Song E, Kim J-M (2020) Parallel WaveGAN: a fast 257. Hsu C-C, Hwang H-T, Wu Y-C, Tsao Y, Wang H-M (2017)
waveform generation model based on generative adversarial net- Voice conversion from unaligned corpora using variational
works with multi-resolution spectrogram. In: 2020 IEEE interna- autoencoding wasserstein generative adversarial networks. Paper
tional conference on acoustics, speech and signal processing presented at the 18th Annual Conference of the International
(ICASSP). IEEE, pp 6199–6203 Speech Communication Association, pp 3364–3368
240. Ren Y, Ruan Y, Tan X, Qin T, Zhao S, Zhao Z, Liu T-Y (2019) 258. Kameoka H, Kaneko T, Tanaka K, Hojo N (2018) Stargan-vc:
Fastspeech: fast, robust and controllable text to speech. Adv Non-parallel many-to-many voice conversion using star genera-
Neural Inf Proces Syst 32:3165–3174 tive adversarial networks. In: 2018 IEEE spoken language tech-
241. Toda T, Chen L-H, Saito D, Villavicencio F, Wester M, Wu Z, nology workshop (SLT). IEEE, pp 266–273
Yamagishi J (2016) The voice conversion challenge 2016. In: 259. Zhang M, Sisman B, Zhao L, Li H (2020) DeepConversion: Voice
INTERSPEECH, pp 1632–1636 conversion with limited parallel training data. Speech Comm 122:
242. Zhao Y et al. (2020) Voice conversion challenge 2020: Intra- 31–43
lingual semi-parallel and cross-lingual voice conversion. In:
260. Huang W-C, Luo H, Hwang H-T, Lo C-C, Peng Y-H, Tsao Y,
Proceeding joint workshop for the blizzard challenge and voice
Wang H-M (2020) Unsupervised representation disentanglement
conversion challenge
using cross domain features and adversarial learning in variational
243. Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilis-
autoencoder based voice conversion. IEEE Trans Emerg Top
tic transform for voice conversion. IEEE Trans Speech Audio
Comput Intell 4:468–479
Process 6:131–142
261. Qian K, Jin Z, Hasegawa-Johnson M, Mysore GJ (2020) F0-
244. Toda T, Black AW, Tokuda K (2007) Voice conversion based on
consistent many-to-many non-parallel voice conversion via con-
maximum-likelihood estimation of spectral parameter trajectory.
ditional autoencoder. In 2020 IEEE international conference on
IEEE Trans Speech Audio Process 15:2222–2235
acoustics, speech and signal processing (ICASSP). IEEE, pp
245. Helander E, Silén H, Virtanen T, Gabbouj M (2011) Voice con-
6284–6288
version using dynamic kernel partial least squares regression.
IEEE Trans Audio Speech Lang Process 20:806–817 262. Chorowski J, Weiss RJ, Bengio S, van den Oord A (2019)
246. Wu Z, Virtanen T, Chng ES, Li H (2014) Exemplar-based sparse Unsupervised speech representation learning using wavenet
representation with residual compensation for voice conversion. autoencoders. IEEE/ACM Trans Audio Speech Lang Process
IEEE/ACM Trans Audio Speech Lang Process 22:1506–1521 27:2041–2053
247. Nakashika T, Takiguchi T, Ariki Y (2014) High-order sequence 263. Tanaka K, Kameoka H, Kaneko T, Hojo N (2019) AttS2S-VC:
modeling using speaker-dependent recurrent temporal restricted sequence-to-sequence voice conversion with attention and context
Boltzmann machines for voice conversion. In: Fifteenth annual preservation mechanisms. In: ICASSP 2019–2019 IEEE interna-
conference of the international speech communication association tional conference on acoustics, speech and signal processing
248. Ming H, Huang D-Y, Xie L, Wu J, Dong M, Li H (2016) Deep (ICASSP). IEEE, pp 6805–6809
bidirectional LSTM modeling of timbre and prosody for emotion- 264. Park S-w, Kim D-y, Joe M-c (2020) Cotatron: Transcription-
al voice conversion. In: INTERSPEECH, pp 2453–2457 guided speech encoder for any-to-many voice conversion without
249. Sun L, Kang S, Li K, Meng H (2015) Voice conversion using deep parallel data. In: 21st Annual Conference of the International
bidirectional long short-term memory based recurrent neural net- Speech Communication Association. ISCA, pp 4696–4700
works. In 2015 IEEE international conference on acoustics, 265. Huang W-C, Hayashi T, Wu Y-C, Kameoka H, Toda T (2020)
speech and signal processing (ICASSP). IEEE, pp 4869–4873 Voice transformer network: Sequence-to-sequence voice conver-
250. Wu J, Wu Z, Xie L (2016) On the use of i-vectors and average sion using transformer with text-to-speech pretraining. In: 21st
voice model for voice conversion without parallel data. In: 2016 Annual Conference of the International Speech Communication
Asia-Pacific signal and information processing association annual Association. ISCA, pp 4676–4680
summit and conference (APSIPA). IEEE, pp 1–6 266. Lu H, Wu Z, Dai D, Li R, Kang S, Jia J, Meng H (2019) One-shot
251. Liu L-J, Ling Z-H, Jiang Y, Zhou M, Dai L-R (2018) WaveNet voice conversion with global speaker embeddings. In:
vocoder with limited training data for voice conversion. In: INTERSPEECH, pp 669–673
INTERSPEECH, pp 1983–1987 267. Liu S, Zhong J, Sun L, Wu X, Liu X, Meng H (2018) Voice
252. Hsu P-c, Wang C-h, Liu AT, Lee H-y (2019) Towards robust conversion across arbitrary speakers based on a single target-
neural vocoding for speech generation: a survey. arXiv preprint speaker utterance. In: INTERSPEECH, pp 496–500
arXiv:02461 268. Huang T-h, Lin J-h, Lee H-y (2021) How far are we from robust
253. Kaneko T, Kameoka H (2018) Cyclegan-vc: Non-parallel voice voice conversion: a survey. In: 2021 IEEE spoken language tech-
conversion using cycle-consistent adversarial networks. In: 2018 nology workshop (SLT). IEEE, pp 514–521
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
269. Li N, Tuo D, Su D, Li Z, Yu D, Tencent A (2018) Deep discrim- ACM workshop on information hiding and multimedia security,
inative embeddings for duration robust speaker verification. In: pp 13–22
INTERSPEECH, pp 2262–2266 288. Reimao R, Tzerpos V (2019) FoR: a dataset for synthetic speech
270. Chou J-c, Yeh C-c, Lee H-y (2019) One-shot voice conversion by detection. In international conference on speech technology and
separating speaker and content representations with instance nor- human-computer dialogue IEEE, pp 1–10
malization. In: 20th Annual Conference of the International 289. Zhang Y, Jiang F, Duan Z (2021) One-class learning towards
Speech Communication Association. ISCA, pp 664–668 synthetic voice spoofing detection. IEEE Signal Process Lett 28:
271. Qian K, Zhang Y, Chang S, Yang X, Hasegawa-Johnson M 937–941
(2019) Autovc: zero-shot voice style transfer with only 290. Gomez-Alanis A, Peinado AM, Gonzalez JA, Gomez AM (2019)
autoencoder loss. In: International conference on machine learn- A light convolutional GRU-RNN deep feature extractor for ASV
ing. PMLR, pp 5210–5219 spoofing detection. In: Proc Interspeech, pp 1068–1072
272. Rebryk Y, Beliaev S (2020) ConVoice: real-time zero-shot voice 291. Hua G, Bengjinteoh A, Zhang H (2021) Towards end-to-end syn-
style transfer with convolutional network. arXiv preprint arXiv: thetic speech detection. IEEE Signal Process Lett 28:1265–1269
07815 292. Jiang Z, Zhu H, Peng L, Ding W, Ren Y (2020) Self-supervised
273. Kominek J, Black AW (2004) The CMU Arctic speech databases. spoofing audio detection scheme. In: INTERSPEECH, pp 4223–
In: Fifth ISCA workshop on speech synthesis 4227
274. Kurematsu A, Takeda K, Sagisaka Y, Katagiri S, Kuwabara H, 293. Borrelli C, Bestagini P, Antonacci F, Sarti A, Tubaro S (2021)
Shikano K (1990) ATR Japanese speech database as a tool of Synthetic speech detection through short-term and long-term pre-
speech recognition and synthesis. Speech Comm 9:357–363 diction traces. EURASIP J Inf Secur 2021:1–14
275. Kawahara H, Masuda-Katsuse I, De Cheveigne A (1999) 294. Malik H (2019) Fighting AI with AI: fake speech detection using
Restructuring speech representations using a pitch-adaptive deep learning. In: International Conference on Audio Forensics.
time–frequency smoothing and an instantaneous-frequency- AES
based F0 extraction: possible role of a repetitive structure in 295. Khochare J, Joshi C, Yenarkar B, Suratkar S, Kazi F (2021) A
sounds. Speech Comm 27:187–207 deep learning framework for audio deepfake detection. Arab J Sci
276. Kamble MR, Sailor HB, Patil HA, Li H (2020) Advances in anti- Eng 1:1–12
spoofing: from the perspective of ASVspoof challenges. APSIPA 296. Yamagishi J et al. (2021) ASVspoof 2021: accelerating progress
Trans Signal Inf Process 9 in spoofed and deepfake speech detection. arXiv preprint arXiv:
277. Li X, Li N, Weng C, Liu X, Su D, Yu D, Meng H (2021) Replay 00537
and synthetic speech detection with res2net architecture. In 2021 297. Frank J, Schönherr L (2021) WaveFake: a data set to facilitate
IEEE international conference on acoustics, speech and signal audio deepfake detection. In: 35th annual conference on neural
processing (ICASSP). IEEE, pp 6354–6358 information processing systems
278. Yi J, Bai Y, Tao J, Tian Z, Wang C, Wang T, Fu R (2021) Half- 298. Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M, Ferrer
truth: a partially fake audio detection dataset. In: 22nd Annual CC (2020) The DeepFake detection challenge dataset. arXiv pre-
Conference of the International Speech Communication print arXiv:200607397
Association. ISCA, pp 1654–1658 299. Jiang L, Li R, Wu W, Qian C, Loy CC (2020) Deeperforensics-
279. Das RK, Yang J, Li H (2021) Data augmentation with signal 1.0: a large-scale dataset for real-world face forgery detection. In:
Companding for detection of logical access attacks. In: 2021 Proceedings of the IEEE/CVF conference on computer vision and
IEEE international conference on acoustics, speech and signal pattern recognition, pp 2889–2898
processing (ICASSP). IEEE, pp 6349–6353 300. Zi B, Chang M, Chen J, Ma X, Jiang Y-G (2020) Wilddeepfake: a
280. Ma H, Yi J, Tao J, Bai Y, Tian Z, Wang C (2021) Continual challenging real-world dataset for deepfake detection. In proceed-
Learning for Fake Audio Detection. In: 22nd Annual ings of the 28th ACM international conference on multimedia, pp
Conference of the International Speech Communication 2382–2390
Association. ISCA, pp 886–890 301. He Y et al. (2021) Forgerynet: a versatile benchmark for compre-
281. Singh AK, Singh P (2021) Detection of AI-synthesized speech hensive forgery analysis. In: Proceedings of the IEEE/CVF con-
using cepstral & bispectral statistics. In: 4th international confer- ference on computer vision and pattern recognition, pp 4360–4369
ence on multimedia information processing and retrieval (MIPR). 302. Khalid H, Tariq S, Kim M, Woo SS (2021) FakeAVCeleb: a novel
IEEE, pp 412–417 audio-video multimodal deepfake dataset. In: Thirty-fifth confer-
282. Gao Y, Vuong T, Elyasi M, Bharaj G, Singh R (2021) Generalized ence on neural information processing systems
Spoofing Detection Inspired from Audio Generation Artifacts. In: 303. Ito K (2017) The LJ speech dataset. https://keithito.com/LJ-
22nd Annual Conference of the International Speech Speech-Dataset. Accessed December 22, 2020
Communication Association. ISCA, pp 4184–4188 304. The M-AILABS speech dataset. (2019). https://www.caito.de/
283. Aravind P, Nechiyil U, Paramparambath N (2020) Audio spoofing 2019/01/the-m-ailabs-speech-dataset/. Accessed Feb 25, 2021
verification using deep convolutional neural networks by transfer 305. Ardila R et al. (2019) Common voice: a massively-multilingual
learning. arXiv preprint arXiv:03464 speech corpus. arXiv preprint arXiv:191206670
284. Monteiro J, Alam J, Falk THJCS (2020) Generalized end-to-end 306. Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M
detection of spoofing attacks to automatic speaker recognizers. (2018) Faceforensics: a large-scale video dataset for forgery de-
Comput Speech Lang 63:101096 tection in human faces. arXiv preprint arXiv:180309179
285. Chen T, Kumar A, Nagarsheth P, Sivaraman G, Khoury E (2020) 307. Faceswap. https://github.com/MarekKowalski/FaceSwap/.
Generalization of audio deepfake detection. In proc. odyssey 2020 Accessed August 14, 2020
the speaker and language recognition workshop, pp 132–137 308. Thies J, Zollhöfer M, Nießner M (2019) Deferred neural render-
286. Huang L, Pun C-M (2020) Audio replay spoof attack detection by ing: image synthesis using neural textures. ACM Trans Graph 38:
joint segment-based linear filter Bank feature extraction and 1–12
attention-enhanced DenseNet-BiLSTM network. IEEE/ACM 309. Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G,
Trans Audio Speech Lang Process 28:1813–1825 Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: a
287. Zhang Z, Yi X, Zhao X (2021) Fake speech detection using resid- large-scale video classification benchmark. arXiv preprint arXiv:
ual network with transformer encoder. In: Proceedings of the 2021 160908675
M. Masood et al.
310. Aravkin A, Burke JV, Ljung L, Lozano A, Pillonetto G (2017) 329. Hasan HR, Salah K (2019) Combating deepfake videos using
Generalized Kalman smoothing: modeling and algorithms. blockchain and smart contracts. IEEE Access 7:41596–41606
Automatica 86:63–86 330. Mao D, Zhao S, Hao Z (2022) A shared updatable method of
311. Reinhard E, Adhikhmin M, Gooch B, Shirley P (2001) Color content regulation for deepfake videos based on blockchain.
transfer between images. IEEE Comput Graph 21:34–41 Appl Intell:1–18
312. Dolhansky B, Howes R, Pflaum B, Baram N, Ferrer CC (2019) 331. Kaddar B, Fezza SA, Hamidouche W, Akhtar Z, Hadid A (2021)
The deepfake detection challenge (dfdc) preview dataset. arXiv HCiT: Deepfake video detection using a hybrid model of CNN
preprint arXiv:08854 features and vision transformer. In: 2021 international conference
313. Versteegh M, Thiolliere R, Schatz T, Cao XN, Anguera X, Jansen on visual communications and image processing (VCIP). IEEE,
A, Dupoux E (2015) Zero resource speech challenge. In: 16th pp 1–5
Annual Conference of the International Speech Communication 332. Wodajo D, Atnafu S (2021) Deepfake video detection using
Association. ISCA, pp 3169–3173 convolutional vision transformer. arXiv preprint arXiv:11126
314. Mitra A, Mohanty SP, Corcoran P, Kougianos E (2021) A ma- 333. Wang J, Wu Z, Chen J, Jiang Y-G (2021) M2tr: Multi-modal
chine learning based approach for Deepfake detection in social multi-scale transformers for deepfake detection. arXiv preprint
media through key video frame extraction. SN Comput Sci 2:98. arXiv:09770
https://doi.org/10.1007/s42979-021-00495-x
334. Deokar B, Hazarnis A (2012) Intrusion detection system using log
315. Trinh L, Liu Y (2021) An examination of fairness of AI models for
files and reinforcement learning. Int J Comput Appl 45:28–35
deepfake detection. In: Proceedings of the thirtieth international
joint conference on artificial intelligence. IJCAI, pp 567–574 335. Liu Z, Wang J, Gong S, Lu H, Tao D (2019) Deep reinforcement
316. Carlini N, Farid H (2020) Evading deepfake-image detectors with active learning for human-in-the-loop person re-identification. In:
white-and black-box attacks. In: Proceedings of the IEEE/CVF Proceedings of the IEEE/CVF international conference on com-
conference on computer vision and pattern recognition work- puter vision, pp 6122–6131
shops, pp 658–659 336. Wang J, Yan Y, Zhang Y, Cao G, Yang M, Ng MK (2020) Deep
317. Neekhara P, Dolhansky B, Bitton J, Ferrer CC (2021) Adversarial reinforcement active learning for medical image classification. In:
threats to deepfake detection: a practical perspective. In: International conference on medical image computing and
Proceedings of the IEEE/CVF conference on computer vision computer-assisted intervention. Springer, pp 33–42
and pattern recognition, pp 923–932 337. Feng M, Xu H (2017) Deep reinforecement learning based opti-
318. Huang C-y, Lin YY, Lee H-y, Lee L-s (2021) Defending your mal defense for cyber-physical system in presence of unknown
voice: adversarial attack on voice conversion. In: 2021 IEEE spo- cyber-attack. In: 2017 IEEE symposium series on computational
ken language technology workshop (SLT). IEEE, pp 552–559 intelligence (SSCI). IEEE, pp 1–8
319. Ding Y-Y, Zhang J-X, Liu L-J, Jiang Y, Hu Y, Ling Z-H (2020) 338. Baumann R, Malik KM, Javed A, Ball A, Kujawa B, Malik H
Adversarial post-processing of voice conversion against spoofing (2021) Voice spoofing detection corpus for single and multi-order
detection. In: 2020 Asia-Pacific signal and information processing audio replays. Comput Speech Lang 65:101132
association annual summit and conference (APSIPA ASC). IEEE, 339. Gonçalves AR, Violato RP, Korshunov P, Marcel S, Simoes FO
pp 556–560 (2017) On the generalization of fused systems in voice presenta-
320. Durall R, Keuper M, Keuper J (2020) Watch your up-convolution: tion attack detection. In: 2017 international conference of the bio-
CNN based generative deep neural networks are failing to repro- metrics special interest group (BIOSIG). IEEE, pp 1–5
duce spectral distributions. In: Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, pp 7890–7899 Publisher’s note Springer Nature remains neutral with regard to jurisdic-
321. Jung S, Keuper M (2021) Spectral distribution aware image gen- tional claims in published maps and institutional affiliations.
eration. In: Proceedings of the AAAI conference on artificial in-
telligence, pp 1734–1742
322. Huang Y et al. (2020) FakeRetouch: evading DeepFakes detection
via the guidance of deliberate noise. arXiv preprint arXiv:09213 Momina Masood completed the
323. Neves JC, Tolosana R, Vera-Rodriguez R, Lopes V, Proença H, B.Sc. (Computer Science) degree
Fierrez J (2020) Ganprintr: improved fakes and evaluation of the with honors and 3rd position from
state of the art in face manipulation detection. IEEE J Sel Top Sign Fatima Jinnah Women University,
Process 14:1038–1048 Rawalpindi, Pakistan in 2015 and
324. Osakabe T, Tanaka M, Kinoshita Y, Kiya H (2021) CycleGAN MS (Computer Science) degree
without checkerboard artifacts for counter-forensics of fake-image from University of Engineering
detection. In: International workshop on advanced imaging tech- and Technology, Taxila, Pakistan
nology (IWAIT) 2021. International Society for Optics and in 2018. She is currently working
Photonics, pp 1176609 as a Programmer at Computer
325. Huang Y et al. (2020) Fakepolisher: making deepfakes more Science Department, University
detection-evasive by shallow reconstruction. In: Proceedings of of Engineering and Technology,
the 28th ACM international conference on multimedia, pp Taxila, Pakistan. Her research in-
1217–1226 terests include computer vision,
326. Bansal A, Ma S, Ramanan D, Sheikh Y (2018) Recycle-gan: un- medical image processing, multi-
supervised video retargeting. In: Proceedings of the European media forensics, and machine learning.
conference on computer vision (ECCV), pp 119-135
327. Abe M, Nakamura S, Shikano K, Kuwabara H (1990) Voice con-
version through vector quantization. J Acoust Soc Jpn 11:71–76
328. Fraga-Lamas P, Fernández-Caramés TM (2020) Fake news, dis-
information, and Deepfakes: leveraging distributed ledger technol-
ogies and Blockchain to combat digital deception and counterfeit
reality. IT Prof 22:53–59
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward