A Review of Deepfake Techniques Architecture Detection and Datasets
A Review of Deepfake Techniques Architecture Detection and Datasets
ABSTRACT Driven by continuous advancements in artificial intelligence, especially deep learning, the
level of realism associated with deepfake technology continues to improve year after year, which poses
unprecedented challenges to the field of deepfake detection. The boundary between what we as humans can
detect as real or fake becomes evermore blurred as new generations of algorithms such as Dall-E 3 and Stable
Diffusion are released. This paper provides a comprehensive study into the landscape of deepfake detection,
exploring in-depth the key challenges, recognising recent successes, and suggesting promising avenues for
future research. A meta-literature review is conducted to identify the current challenges and future directions,
which form the foundation of this work. They are investigated by analysing state-of-the-art research with
a focus on the key components that are crucial to the design of a deepfake detector, i.e., the architecture,
detection methods and datasets. A major challenge identified by this study is the lack of dataset diversity
leading to unfair attribute representation. This must be addressed by improving standardisation on dataset
ethics and privacy. This is one of the main reasons for the insufficient generalisation capability of current
deepfake detectors as demonstrated by their unsatisfactory performance when faced with unseen data or
data in the wild. This literature review provides deepfake detection researchers and practitioners with the
latest information that will serve as a vital resource for their continued and important activity, now and in
the future.
INDEX TERMS Deepfakes, deepfake detection, generative AI, deep learning, machine learning, artificial
intelligence, datasets, survey.
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
154718 For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 12, 2024
P. Edwards et al.: Review of Deepfake Techniques: Architecture, Detection, and Datasets
the integrity and authenticity of the video ensured that it failed image and video media have led to the need for a more robust
its objective. This deepfake was one of many examples where approach.
such technology, if in the wrong hands, can become a weapon Artefacts uncovered in the spatial and frequency domain
in the digital age that we live in. have exposed vital information relating to either the pixel
Academic research in the field of deepfakes has signif- formation (spatial domain) that makes up the overall image
icantly grown since 2017, as illustrated in Figure 1 which over time or the frequency representation, e.g., low or
shows statistical data collected from Dimensions [5] using high-frequency components (frequency domain), which cor-
the publication type ‘Article’ or ‘Preprint’ and a date range responds to the rate at which the pixel information is
from ‘2017’ to ‘2024’. Note that a linear trendline was used changing. For example, a disturbance in the surrounding
to extrapolate them until 2025. Unsurprisingly, considering pixel formation between the old and new content can provide
the mass media attention around deepfakes and fake news, valuable statistical information that may expose the boundary
the statistics highlight that the volume of published research of where the manipulation occurred [10]. Additionally, facial
on deepfake detection is far outpacing research on deepfake blending inconsistencies, which are inherently transferred by
creation illustrating the demand for solutions on this contro- the synthesis process during the creation of a deepfake, leave
versial topic. traceable artefacts in the image statistics [11]. The camera
model NoisePrint has demonstrated success in applying the
Photo-Response Non-Uniformity technique to extract and
compare noise signatures from images in a manner similar to
extracting a person’s fingerprint [12]. Furthermore, utilising
the spatial and frequency domain as handcrafted features for
ML has paved the way forward with novel detection methods
capable of inferring information based on complex informa-
tion with limited human interaction. Typically used with a
binary classifier and a fully connected layer, the process of
selecting the features to be learned can present challenges,
particularly when the underlying pipeline for deepfake cre-
ation is continuously evolving. Indeed, this can result in poor
generalisations on unseen data.
An important milestone in the detection of deepfakes has
been made possible through DL, whereby a model is able
FIGURE 1. Statistics depicting the number of deepfake, deepfake creation to learn complex multi-dimensional patterns from complex
or deepfake detection studies published over the last seven years (2017 datasets using artificial neurons that replicate the way in
to 2023).
which the human brain works [13]. In addition, this enables
a richer representation of features to be learned in a way that
Prior to deepfakes, the detection of manipulated imagery would not be achievable through standard ML [14].
often focused on the semantic characteristics within the Since the taking off of DL before 2012, DL architec-
image in terms of what can be seen and its overall composure. tures have rapidly evolved, driving significant advancements
For example, research by O’Brien and Farid [6] focused on in deepfake research, which can be observed in Figure 2.
the vanishing point in an image to establish the relationship This progress began with Convolutional Neural Networks
with common reflections and determine the feasibility of (CNNs), including VGG, CapsNet, which played a founda-
the image containing forged content. Lighting and shadow tional role in shaping modern deepfake detection techniques
details within the image are other examples of inconsistencies and paved the way for future advancements with their novel
that may occur. Furthermore, Johnson and Farid [7] report approaches. However, as the years have advanced, increased
that lighting and cast shadows can be used to approximate levels of interest have shifted towards the development of
if the lighting source is consistent for all the objects within hybrid architectures. In particular, this can be seen in the
the image and, therefore, with reasonable accuracy, identify number of variant architectures, including Transformer and
if manipulation has occurred. These techniques have suc- CapsNet, which have seen great interest from the academic
cessfully transitioned to the task of deepfake detection, with community.
promising results. For example, Wu et al. [8] explain that To establish the starting point, ten previously published
subtle clues in the swapped face region can expose incon- literature review papers between 2022 and 2023 were eval-
sistencies that do not match the composure of the image as uated in Section III to identify the key successes and
a whole and are often the unintentional result of the deep- challenges. Informed by them, four challenge themes, i.e.,
fake creation pipeline. Additionally, this technique can be dataset, architecture and scalability, explainability and eval-
effective against video content as explained by Zhu et al. [9], uation, were established to further complement this study
where inconsistencies in the inter-frame sequence highlight and guide the reader towards the main themes. Section IV
abnormalities. However, quality and detail improvements in provides the reader with a high-level overview of the common
FIGURE 2. Timeline of architectures by class. The diagram highlights some of the main DL architectures and their associated variant
architectures over time. The coloured bubbles indicate the number of research papers dedicated to new hybrid variants.
architectures used in deepfake detection whereas Figure 2 • Investigate and compare SOTA research papers on deep-
provides a timeline of the main architectural variants and fake detection.
shows where research activity has evolved. A comprehensive • Conduct a review of datasets used for deepfake detection
breakdown of the techniques used for deepfake detection and the impact they have on generalisation and bias.
is presented in Section V, followed by Section VI, which • Identify and compare the key challenges against those
provides the reader with an overview of the datasets used for observed in previous literature reviews.
model training and evaluation. Finally, section VII reviews
the observations and future trends exposed from the papers B. CHALLENGE THEMES
studied in Section V. To guide the categorisation and identification of current and
future key challenges, this analysis is based on the most
A. AIMS AND OBJECTIVES important themes highlighted in previous literature review
The aim of this literature review is to provide the reader papers (see Section III). Table 1 provides a breakdown of
with an in-depth evaluation of the latest research on deepfake these challenges along with their description.
detection. This is achieved by exploring the various types of
architectures and datasets that are used in order to understand II. METHODOLOGY
how these crucial elements contribute to delivering State- This section provides an overview of the process used to
of-the-Art (SOTA) detection. The primary objectives are as undertake this review paper and details the research strategy
follows: and timeframe used.
• Present an overview of key findings from recent lit-
erature reviews and produce a snap-shot view of the A. RESEARCH STRATEGY
observed challenges associated with the key challenge In order to cast a wide net over this research domain and
themes from Section I-B. deliver a comprehensive review and not be limited by any
• Evaluate the architectures used in deepfake detection specific angle, no specific journal databases were used for the
and identify their strengths and weaknesses. acquisition of research papers. In addition, since this review
is focused on deepfakes detection-based methods, material • Deepfake Detection (Section V) – using over one hun-
specific to the creation pipeline of deepfakes was not included dred papers, a subset of twenty-nine prominent papers
in the search strategy. Moreover, as the process for selecting published in 2023 were chosen for analysis. This was
detection papers was based on image and video techniques, based on their novel approach or contribution to this
publications associated with audio or text specifically were research domain.
excluded. • Datasets (Section VI) – the search for papers on the
subject of datasets was not limited by a date range but
B. TIMEFRAME instead an extensive search was performed to identify as
The following timeframe was applied throughout this review many key and novel datasets (see Figure 3 and Figure 4).
paper in order to ensure that a comprehensive review is
conducted while providing only the most relevant and recent
III. META-LITERATURE REVIEW
research.
This section explores ten recently published review papers in
• Meta-Literature Review (Section III) – due to the
the field of deepfake creation and detection, aiming to provide
fast-moving pace of research in this field it was decided
a snapshot view of the approach taken while understanding
to prioritise the search of published literature review
the common challenges and future directions. Table 2 pro-
papers between 2022 and 2023 to provide coverage
vides a breakdown of the ten review papers selected, which
of the most recent work. Eventually, a subset of ten
cover the period between 2022 and 2023.
high-quality papers (see Table 2) was selected for
review.
• Architecture (Section IV) – the search for papers on the A. CURRENT CHALLENGES AND FUTURE DIRECTIONS
subject of DL architectures was not limited by a date Based on the review papers listed in Table 2, current chal-
range but instead, the architecture needed to be associ- lenges and future directions have been clustered into the
ated with the learning of image media. The aim was to following four themes: dataset, architecture and scalabil-
uncover the extensive range of research and the various ity, explainability and evaluation. The aim is to cover the
key architectures and hybrid variants (see Figure 2 for a high-level trends affecting deepfake detection research while
reduced diagram and Figure 5 in Appendix A, for a full identifying ways to mitigate these challenges and support
diagram). future directions.
1) DATASET The quality and size of the datasets are also observed [17],
The dataset theme is arguably the most important one, as this [19], [21], [22] as a key challenge, especially as some of the
fundamentally determines how effective a trained model is detection methods are trained on limited data [17] or data that
at performing the task at hand. Access to good-quality data is specific to a single creation technique [20]. A model’s abil-
is therefore crucial. Although this often becomes a challenge ity in generalising to unseen data becomes a greater challenge
for researchers, it can sometimes be overlooked. when the quality and size of the data are limited [19], this may
A lack of availability to the public of global datasets, lead the model to be ineffective against real-world data. This
which provide a fair representation of forgery techniques is particularly important in situations of fairness, whereby the
and real media, has been observed as a critical limitation risk of increasing bias towards certain attributes, including
in the development of deepfake detection methods [15], ethnicity, gender and age, has genuine consequences that
[22], [23]. Researchers will often supplement their work can lead to racial profiling and discrimination if used in
by utilising custom datasets [18] and [23] for training and real-world applications [19]. The risk of adversarial attacks
evaluation. However, they generally provide limited expo- to avoid detection [21] is also raised as a serious concern,
sure to how effectively a model performs compared with whereby the input permutations of the data are altered (some-
other datasets and are often unavailable to the general pub- times referred to as data-poising) to prevent the model from
lic. A community-led global dataset [23] for AI-synthesised learning true representations of the intended data. Moreover,
images could provide a global approach to benchmarking the use of pre-processing techniques [9] has implications,
and evaluating model performance. Providing a structured particularly if the researcher has not fully understood the
approach [18] in the way in which data is used for training dataset in question.
and evaluation would allow for improved measurable accu- The general consensus indicates that many of the proposed
racy while offering a consistent and transparent approach. detection methods published are constrained to working in
Existing datasets commonly used by researchers [24] for environments that do not reflect the real-world and therefore
training and evaluation include FaceForensics ++ (FF++) are likely to fail to generalise or infer correctly if applied
and Celeb-DF. However, these alone do not provide enough to data in real-world applications. Continuous advances in
of a challenge [23] to evaluate success as they do not represent technology [22] will eventually lead to the creation of full-
current media found in the real-world. body deepfakes, which will ultimately present new challenges
for managing new and emerging datasets going forward. detection methods provide only a binary classification (real or
As observed by Naitali et al. [25], the WildDeepfake [26] fake) as their output. The lack of additional context around the
dataset provides imagery related to full-body deepfakes, yet model’s decision process can lead to issues in trust and inter-
the volume and diversity are still somewhat limited in their operability [22], which ultimately results in ambiguity [15]
effectiveness for AI. Despite the limited amount of avail- in understanding and evaluating a model’s effectiveness.
able full-body deepfake data, research by Hong et al. [27] In particular, many of the reviewed methods fail to provide
highlights the progress being made to generate 3D moving localised information [22] that would identify where the
full-body content, which in the future will aid the develop- manipulation likely occurred. This is further compounded by
ment of new datasets. the black-box [20], [22] nature in which a model is trained
and tested and is therefore not adequate [16] in providing
confidence to the user if an image or video is a deepfake.
2) ARCHITECTURE
By addressing the black-box nature as a main area of research
Early implementations of ML architectures exposed weak-
in DL, progress is being made that will eventually support
nesses in their ability to extract local and global features [15],
deepfake detection. One way of adding explainability could
making it difficult for the model to effectively learn from
be through the use of a multi-class classification combined
the selected handcrafted features. Indeed, selecting the rel-
with a confidence score [17], as this would present the user
evant features [17] is a complex process, which has become
with the power to make an informed decision.
more and more challenging as the quality of deepfake media
improves. Combining DL approaches with traditional tech-
4) EVALUATION
niques (statistical analysis, for example) [18] has highlighted
The absence of a structured and uniform approach to evalu-
an overall improvement in model performance, suggesting
ate, [19] and [23], deepfake detection methods is observed,
that selecting handcrafted features is no longer sufficient for
with an emphasis on the creation of a consistent approach
learning complex patterns in media content. A shift towards
to benchmarking against other SOTA techniques. Although
a hybrid approach [19], [21], where combining various archi-
the accuracy metric [24] is identified as the most common
tectures to overcome known limitations, has prompted a new
measurement, this does not take into account the multitude
research direction focused on DL. Although at present there
of existing metrics, including precision, recall, and the Area
is no architectural framework that provides a sufficiently
under Receiver Operating Characteristic (AUC-ROC) Curve,
stable platform, the Xception Network [18] has demonstrated
which provide valuable insight into the state of a given model.
promising results as a DL architecture. Indeed, recent lit-
In addition, reported evaluation metrics are at times over-
erature demonstrates how DL approaches can outperform
inflated [23] and do not clearly reflect the way in which
non-DL approaches [24]. For example, a multi-modal archi-
the model was trained or evaluated, which in turn leads
tecture is proposed to allow for combining features across
to inconsistencies in the author’s work. As the measure of
audio, imagery and video media [17]. However, there is a
success is often based on achieving higher accuracy against
lack of research on temporal aggregation [18] in video media,
other SOTA techniques while benchmarking against common
whereby the temporal consistency between frames is evalu-
datasets [21], there is little to no attention to real-world
ated rather than providing a binary classification for the entire
scenarios. The fact that many of the common datasets contain
video sequence.
outdated image and video content, that does not adequately
A negative trade-off in the success of DL approaches
reflect new and emerging deepfake technology [19], dimin-
comes in the form of a steep rise in computational resource
ishes the models’ ability to perform well against data from
required as the models grow in complexity [15]. Fortunately,
the wild.
adopting pre-trained models based on existing architectures
is one way of overcoming this challenge [21]. Pre-trained
IV. ARCHITECTURE
models can be fine-tuned for downstream tasking [21] while
To put into context the architectures that have been exploited
reducing the number of trainable parameters and overall
in deepfake detection, the following section provides an
computational effort. Managing a model’s complexity to
overview of the architectures commonly used in the computer
ensure its efficiency through optimisation is essential for the
vision domain. Figure 2 provides a timeline of the common
transition from research to capability deployment. However,
DL architectures. The coloured bubbles within the figure
ensuring optimised inference times [17] while expanding a
indicate where increased research has been dedicated to cre-
model’s size through new training data needs careful consid-
ate new hybrid implementations based on a specific class of
eration. This is particularly true for DL architectures, where
architecture
the number of parameters dramatically increases with net-
work depth [15]. A. CONVOLUTIONAL NEURAL NETWORK
One of the most prominent contributions to research on
3) EXPLAINABILITY Convolutional Neural Networks (CNNs) was conducted by
The third and most challenging theme is explainability [15], LeCun et al. [28] in 1989 to overcome the challenge of per-
[16], [17], [20], [22], as most of the proposed deepfake forming numerical character recognition and classification
through ML. This subsequently led to the development of the architecture, where stackable depthwise separable convolu-
LeNet-5 architecture in 1998 by LeCun et al. [29], which con- tions are used as a replacement for the Inception module
sisted of a seven-layer network using three convolution layers while achieving notably improved performance. According to
(a kernel of 5 × 5) with feature maps of size: 6, 16 and 120, them, the original hypothesis for the Inception network was
two subsampling layers with 2×2 receptive fields for average based on the idea that cross-channel and spatial correlations
pooling, and finally two fully-connected layers. According to should not be mapped together in the feature map as they
LeCun et al. [29], the use of back-propagation in the network are sufficiently separated. However, in contrast to this, they
can help reduce the number of trainable parameters since observe that by using a CNN, the cross-channel and spatial
the weights in the feature extractor can be learned through correlations can be separated in the feature map. Comparative
a shared scheme. testing on the ImageNet dataset concludes only marginal
improvement in performance, yet the authors highlight that
1) CAPSULE NETWORK the parameter size is similar to the Inception-V3 network and
Proposed in 2011 by Hinton et al. [30], the Capsule Network that any improvement is likely to be contributed to the effi-
(CapsNet) was designed to overcome a significant limitation cient use of the parameters and not the depth of the network.
in the way complex spatial relationships are learned by con-
ventional CNNs. According to Hinton et al. [30], CNNs are 4) DENSE CONVOLUTIONAL NETWORK
non-equivariant and therefore do not take into account the The Dense Convolutional Network, Huang et al. [35], was
precise positional relationship between facial features in both proposed in 2017 as a stackable feed-forward network
the local and global feature space. In essence, a CNN will designed as a more efficient and deeper CNN, where dense
learn the arrangement of facial features (eyes, nose, mouth, blocks act as the building blocks in the network. Inspired by
etc.) as individual components, respectively, and will not take the use of skip-connections in the Residual Neural Network
into account the location of neighbouring facial components. (ResNet) architecture to overcome the challenge of gradient
To learn the global feature space, Kwabena Patrick et al. [31] flow, the authors propose direct connections from any layer
explain how each capsule contains a series of neurons that to all preceding layers, thereby removing the gradient flow
focus on learning from the same feature space while individ- problem. Furthermore, a convolution and down-sampling
ually learning distinctive properties from the feature. Further layer using average pooling is positioned between each of the
to this, the CapsNet uses two layers of capsules, represented dense blocks. The paper highlights competitive results against
as the lower and higher-level capsules, whereby the output other SOTA architectures on tested evaluation datasets, with
from each capsule is in the form of a vector. Interest from the the potential to provide improved learning through feature
research community to modify and improve the architecture sharing across the network.
can be seen in Figure 2 where the growth of hybrid variants
has increased between 2018 and 2021.
B. TRANSFORMER
The Transformer was proposed by Vaswani et al. [36] as
2) INCEPTION NETWORK
an alternative to CNN and Recurrent Neural Network archi-
Szegedy et al. [32] proposed the Inception network in tectures in 2017 and is based on the use of self-attention.
2014 using a sparsely connected architecture combined with According to them, the design incorporates an encoder and
a CNN to not only achieve greater network depth but to also decoder structure and is based on a fully connected feed-
overcome the demand on computational resources. Accord- forward network. Using learned embeddings, the input and
ing to them, significant growth in network parameters can output tokens are converted to vectors, with additional posi-
occur as a consequence of increasing the number of fully tional information embedded.
connected layers within the architecture due to the connec- The Vision Transformer (ViT) was proposed by Doso-
tivity between each and every layer across the network. They vitskiy et al. [37] in 2021 and builds on the success of
suggest applying dimensionality reduction through a series the Transformer for Natural Language Processing. Using
of max and average pooling operation aggregations to help Self-Attention as a key component, the architecture works
reduce the number of expensive computational operations by feeding an input image as a sequence of fixed-sized,
while preventing the transfer of a large number of filters one-dimensional patches with embedded positional infor-
between the layers within the network. The authors claim a mation. The sequencing is additionally flattened before
key design motivation is the scalability of the architecture to being linearly projected to a Transformer Encoder using
run on devices with potentially low computational resources, Multiheaded Self-Attention. Furthermore, the classification
making it more viable for deployment in real-world situa- is performed using a Multi-Layer Perceptron block. Even
tions. though impressive results are observed on the evaluated
datasets, the authors highlight the lack of image-related
3) XCEPTION NETWORK inductive bias when trained on smaller datasets, which can
In 2017, Chollet [33] presented the Xception network as result in overfitting or poor generalisation. However, the
a parameter-efficient adaptation to the Inception-V3 [34] authors observe that on larger datasets, the model learns
the patterns from the data itself, and therefore inductive generalisation was observed, the study is limited by the use
bias becomes less essential. Similarly, the ViT has gained of static image data. Indeed, analysis of temporal inconsis-
increased interest from the research community as illustrated tencies could deliver greater robustness. Study [53] employs
in Figure 2 where continuously hybrid variants have been a patch-based approach using the Gram-Net architecture pro-
developed. posed in [78]. According to the authors in [78], the Gram-Net
incorporates a Gram Block positioned before the down sam-
C. GENERATIVE AI pling layer of a ResNet backbone to learn global features.
The ability to generate realistic content based on generative However, it is worth considering that the reported 100%
AI has accelerated over the past several years with the intro- accuracy in the initial evaluation of the first two datasets [53]
duction of advanced DL models that are capable of generating might require further validation to ensure robustness. In [60],
text, image, video, and audio content [38]. Dall-E 3 [39], the authors identify a potential weakness in the CNN architec-
Imagen [40] and Stable Diffusion [41] are just some of the ture that could lead to spatial information being lost because
prominent models available, which can produce new samples of the design of the max pooling layer, whereby the model
based on the patterns learnt by the models [42]. becomes unable to learn truthful representation from the orig-
inal image features. Thus, they hypothesise that by adapting
1) GENERATIVE ADVERSARIAL NETWORK a Visual Geometry Group (VGG)-19 network as a feature
The concept of training two simultaneous models that com- extractor, spatial information in the lower layers of the net-
pete against each other was proposed by Goodfellow et al. work can be retained. To emulate how the human brain works,
[43] in 2014 and is known as the Generative Adversarial a CapsNet is used for the classification task as it uses neurons
Network (GAN). The two models, a generative (creator) and a to learn key relationships between facial components and
discriminator (detector), are trained against each other, where their orientation. Eventually, their new architecture delivers
the aim of the discriminator is to determine by estimating not only enhanced performance and convergence speed but
the probability that the sample produced by the generative also reduces the model complexity. In [55], the authors also
came from either the training dataset or from the generative address CNN’s insufficiency in extracting spatial features,
model itself. According to them, a key design feature is whilst bringing to attention the need for further research
the implementation of backpropagation in the network for in temporal feature extraction. They tackle these shortcom-
improved gradient flow, which is an alternative to the use ings, by implementing pre-processing steps to enhance image
of Markov chains. Furthermore, forward propagation can be quality using a Gaussian noise reduction and employing the
used to sample from the generative model. Lucas-Kanade optical flow algorithm. As in previous work,
the authors finalise their design using the VGG with a Cap-
2) DIFFUSION MODELS sNet to which they include a modified Dynamic Routing
Diffusion Models (DMs), Yang et al. [44], are based on a type Algorithm. They report significant performance gains over
of probabilistic generative model that works by introducing other SOTA methods.
noise into the input image as it traverses through the network Methods to extract noise information from the frequency
in a forward pass. To generate new content, the model must domain have demonstrated significant progress in the field
learn to reconstruct the image by removing the noise, which of deepfake detection. Restoration techniques to restore
is known as the diffusion process. According to the authors, low-quality artefacts from compressed images could prove
the three key types of DMs are Denoising Diffusion Proba- valuable, especially when access to high-quality deepfakes
bilistic Models [45], Score-based Generative Models [46] and is often limited. The authors in [73] consider magnitude
Stochastic Differential Equations [47]. and phase spectra to capture contour and textual informa-
tion and their spectral relationship within the frequency
V. DEEPFAKE DETECTION domain. The design incorporates both a Regular and Irregular
The aim of this section is to explore the most current literature Local Fourier to increase local information, whilst utilising a
on deepfake detection in order to establish a baseline view Pointwise convolution to overcome the challenge of gradient
of the current SOTA methods available. Table 3 provides loss and explosion. To extract multiple features and capture
a breakdown of twenty-nine novel methods from literature their interactions, the authors incorporate a cross-attention
in 2023 and each represents a unique approach to deepfake mechanism with multiple Binary Cross-Attention blocks.
detection. Although evaluation results demonstrate comparative results
against other SOTA methods, particularly in cross-dataset
A. CONVOLUTIONAL NEURAL NETWORK evaluations, the model underperforms as the level of image
To improve generalisation, the authors in [51] use a ResNet18 compression increases through post-processing operations.
as the base architecture and combine a K-Nearest Neig- To address this, it is proposed [62] to achieve greater robust-
bors algorithm for the feature classification. To enhance the ness by leveraging low and high features that are often
architecture, they implemented Error Level Analysis dur- destroyed during the post-processing operation with a frame-
ing the pre-processing stage and supplied it as the input work called TruFor [62]. Using a modified Noiseprint [12],
to the ResNet18 for feature extraction. Although better they intend to expose essential image information about the
TABLE 3. List of deepfake detection methods reviewed in this study. NOTE: where applicable an average value for ACC and AUC was applied.
TABLE 3. (Continued.) List of deepfake detection methods reviewed in this study. NOTE: where applicable an average value for ACC and AUC was applied.
manipulation history of an image to form a unique noise Research on learning temporal inconsistencies in deepfake
signature. Then, Contrastive Learning is used to compare the video content is another area showing promising results.
similarity of random patches extracted from the input image A ResNet with Contrastive Learning is used in [49] in which
to learn anomalies based on their noise signature and in turn, discriminative features are learnt through a multi-modal
infer if the image has been forged. Also observing that image approach (video and audio) using two separate networks.
post-processing techniques can result in the contamination Named ‘Person of Interest’, the models are trained on real
of low-level features and, fundamentally, information loss, video content using a ResNet50, delivering significant per-
it is proposed to use more traditional data-centric techniques formance gains through the use of the combined feature
for feature enhancement such as sharpening [58]. Unfortu- embeddings, that are fused using a Multilayer Perceptron.
nately, this leads only to minor performance gains. To restore As the study recognises a potential limitation that can result
low-level contaminated features to their original state, it is in reduced generalisation when only a single POI video is
suggested [58] to employ an EfficientNet backbone using provided, the authors recommend that multiple videos of the
an adversarial learning strategy and discriminator, which is target are used to ensure greater accuracy. Instead of focus-
in contrast to the approach in [55]. Cross-dataset evaluation ing on a multi-modal approach, study [50] uses extracted
shows the value of this approach by outperforming other video keyframes to overcome compression and image quality
SOTA methods. Still, it is clear that further research in this loss using a Deep Convolutional Transformer model, which
area should be considered, particularly in the anti-forensic utilises Convolutional Pooling and Re-Attention. A 17-layer
domain. CNN with a kernel size of 3 × 3, batch normalisation and
a GELU activation function is used to extract local fea- Module learns the weighted importance of both the local
tures before being fed directly to a Pooling Transformer, and global features to compute the similarity between the
using depth-wise separable convolution for learning global features. A predicted score is then produced via a Mult-
representations. To address the limitations of the previous Layer Perceptron. Eventually, it is shown that leveraging local
method, study [71] uses a noise-based approach where the features from the global feature map can lead to a reduc-
extraction of the I-Frame or intra-frame enables the extrac- tion in computational resources, whilst achieving competitive
tion of a higher level of image quality post-compression. results. In contrast to the previous method, the authors [77]
Furthermore, the authors utilise a Siamese network com- highlight that unintentional implicit identity information may
bined with a pre-trained Recursive Information Distillation be being learnt by binary classifiers, which could result in
Network (RIDNet) to extract noise patches from the face a model becoming biased and therefore increasing the risk
and background, using a Euclidean distance, ensuring the of misclassification. The study evaluates several architectures
background region is the furthest away. In addition, a Multi- pre-trained using FFHQ [79], and despite the model not being
Head Relative-Interaction is designed as a replacement for trained on several datasets, implicit identity leakage was
the Cosine similarity used by Siamese networks to enhance discovered from the Celeb-DF [80] and LFW [81] datasets.
the measurement of similarity between the noise patches, To overcome this, the authors propose a multi-scale anchor
whilst overcoming limitations that the authors perceive as to focus on local regions and in turn limit the model’s expo-
information loss and performance degradation. Overlap- sure to global identity information. In [75], low-dimensional
ping face and background patches are observed for future embeddings are extracted using Principal Components Anal-
research in the study as this could result in the model ysis, Autoencoder and Fourier analysis to demonstrate how
generating false positive classifications. A Multi-Rate Exci- common features in the spatial domain can be used to dis-
tation Network is proposed in the study [76], where the tinguish between fake and real images. The expectation is
spatial-temporal inconsistencies are learnt through bipartite that unique geometric attributes associated with the synthesis
groups to measure different sampling rates. The expectation process can be used to exploit within-class similarities that
is that sampling different rates can encourage the network relate to how the face is aligned, the pose and other relevant
to learn longer-distance temporal inconsistencies. A Momen- factors. Although alterations to the geometric aspects, such
tary Inconsistency Excitation module is used to extract spatial as cropping, could result in performance degradation, the
artefacts and to force the network to learn short-distance authors believe their approach is less likely to be affected
cross-group temporal inconsistencies, while a Longstanding by anti-forensic or laundering attacks. Instead of focusing
Inconsistency Excitation module focuses on long-term tem- on features in the spatial domain, the approach taken in [54]
poral relationships. The Evaluation highlights that spatial and uses Rationale-Augmented CNN to perform facial recon-
temporal embeddings play an equal role in the effectiveness struction for deepfake detection. The authors observe that
and robustness of a detection method. substituting the cross-entropy loss for a triplet loss function
The use of facial features to extract identity information of an Inception network would enable the model to quantify
is another research direction to achieve promising results. new facial features during the training phase. Experiments
Unique facial characteristics acquired from the local and show the triplet loss has the potential to both improve the
global feature space can prove valuable, especially in the area model’s overall performance and provide computational effi-
of identity leakage, towards improving generalisation. In the ciency when performing similarity matches between each
study [74], the authors devise a novel Implicit Identity Driven face.
framework to measure the distance between the explicit and Another direction of research has focused on Graph Neu-
implicit identity of real and fake faces. This approach relies ral Network (GNN) as it is believed that they can learn a
on an Explicit Identity Contrast (EIC) and Implicit Identity richer representation of features, particularly when it comes
Exploration (IIE) loss using a CNN architecture as the back- to interconnected facial landmarks, which could prove crucial
bone. The hypothesis is that, within the feature space, the in the absence of diverse deepfake datasets. In [57], the
fake face converges more closely with the implicit identity authors propose to combine a GNN and pyramid ResNet
of the target face than the explicit identity of the source face structure to enhance model interpretability. The input image
during the fake swapping process. Therefore, the EIC loss is first spliced into patches, before being fed to the K-Nearest
can be used to separate samples within the feature space Neighbors algorithm to construct the graph. There have also
by creating discriminative feature embeddings, while the been attempts to utilise individual pixels [82] for constructing
IIE loss helps to refine the implicit identity from the target the graph or integrating GNNs with the CapsNet and Long
face. Alternatively, the Spatial Interaction Network, proposed Short-Term Memory network, but these resulted in degraded
in [64], utilises a Region of Interest layer and a Recursive performance due to model complexity [57]. Also moti-
Feature Eliminator to generate a local feature map using vated by the ability offered by GNN in terms of structuring
coordinates from four facial regions (nose, mouth, left eye vital information from complex facial features by construct-
and right eye). It is suggested that the removal of the max ing high-level relationships between inter-connecting nodes,
pooling layer of an Xception network (backbone) keeps the it is proposed to improve the learning of long-term inter-
loss of low-level features to a minimum. A Spatial-Aware dependencies of a Graph Convolutional Network (GCN)
by combining a pre-trained ViT network using Contrastive of this novel approach, termed Shallow ViT, demonstrates
Learning [72]. In addition, since high-level features are sup- competitive results against SOTA methods. However, only
posed to be less prone to interference by the process of intra-dataset experiments were carried out and therefore it is
manipulation, they are key to providing greater generalisation not possible to determine how the model would generalise
to unseen data. The study observes the relationship between under cross-dataset conditions. Alternatively, a lightweight
the graph convolution layers and the receptive field for learn- framework is achieved in the study [65], where the authors
ing long-term information. Performance degradation could claim their model uses only 3% of the parameters normally
result from a reduction in the number of layers. To address associated with other SOTA methods. The authors fuse fea-
the limitations of the previous study, [59] a GCN architecture ture embeddings from spatial, temporal and spatiotemporal
is enhanced by using a dynamic graph learning approach [59] features as a holistic approach to deepfake detection, using
since this enables the model to build the structure of each a transformer-based architecture. Each respective embedding
layer dynamically as the model continuously learns. Iden- is processed through a pre-trained model using a combina-
tifying the initial optimum number of neighbouring nodes tion of 2D and 3D convolutional layers with max pooling,
for the construction of the graph is highlighted as a limit- which is then fed independently to a series of Transformer
ing factor in the model’s design. In other words, the model encoder layers using Multi-Head Self Attention and Mult-
could demonstrate signs of under or over-fitting of its data Layer Perceptron. The self-attention token embeddings are
during the training phase. Despite this, the adaptability of concatenated further using a sequence pooling technique
the graph structure to change allows for a more flexible before a binary classifier determines if the content is real
approach. or fake [65]. Despite concerns of spatial information being
lost [60] through the max pooling layer, the authors [65],
B. TRANSFORMER believe that their framework can lead to improved per-
The authors of the study [52] recognise a key weakness in formance from the reduction of spatial information whilst
traditional CNNs where limited coverage of global features maintaining competitive results to other SOTA methods.
has an impact on a model’s ability to learn the image as a Study [69], freezes the backbone parameters of a Swin
whole. The authors continue to explain how the convolutional Transformer during the training phase and subsequently
filter applied in the pooling layer results in vital information fine-tunes the model for downstream classification. The
being removed. To overcome the loss of this information, authors hypothesise that during the creation of a deepfake,
the authors apply a ViT architecture, explaining how embed- a loss of depth information can be used as a measurement
ded global information can be captured through the use of to calculate the feature distance, whereby real faces will
a Multi-Head Self Attention Layer using positional patch measure closer and real and fake faces will measure further
embeddings. To improve overall performance, the authors away. Trained using the source, target and fake face, a depth
apply a hybrid approach by combining the EfficientNet archi- map is created before being fed to a triplet loss network
tecture with a ViT to improve the representation of spatial where the discriminative features are learnt concerning their
information across the local and global feature space. In the latent feature space. As highlighted in the study, SOTA results
study [56], low and high-level semantic information is used are observed, however, further testing is required to demon-
to evaluate how discriminative features provide separabil- strate the effectiveness of the model under more challenging
ity amongst real and fake distributions, trained separately conditions.
using a pre-trained Xception and ViT network. The outcome On the other hand, study [63] presents a novel detec-
from the study concludes, using linear-probing, that the ViT tion strategy named Thumbnail Layout (TALL) is combined
presents a stronger representation of high-level features and with a Swin Transformer to enhance spatial and tempo-
has the potential for greater robust generalisation to unseen ral awareness, with the aid of Self-Attention and a Shifted
data. However, the authors observe that fine-tuning the com- Window mechanism. The study suggests improved gener-
plete backbone of parameters would result in efficiency alisation when TALL is used with a Transformer, however,
challenges while identifying that limited deepfake datasets due to the model-agnostic design, TALL can be adapted
could have the potential to result in the model overfitting for to work with other DL architectures. The TALL element
downstream training. To overcome this limitation the study employs a dense-sampling approach to extract four consec-
adopts a fine-tuning strategy to 19% or 16.92 million of the utive frames at random, which are then resized and presented
overall ViT-Base parameters while freezing the remaining as a thumbnail layout. A novel approach by decomposing
backbone. the spatial-temporal features through a Transformer using
Access to datasets and expensive computational resources, self-attention and a self-subtracting mechanism is observed
observed in [56], were also highlighted as challenges in [61], in the study [66]. The authors believe that during the creation
in which the authors emphasise the importance of develop- process, spatial information is often treated in isolation and
ing a lightweight model that is capable of operating under is therefore independent of the inter-frame sequence, lead-
such conditions. In contrast to [56], the authors claim to ing to temporal inconsistencies being introduced. To capture
have further reduced the number of trainable parameters of these inconsistencies, the self-subtracting mechanism helps
the ViT-Base from 86M to 5.2M or 6%. The evaluation to guide the network to target important temporal features.
Based on the Layer-wise Relevance Propagation [83], the their privacy should be factored into the thought process when
authors introduce explainability to their model by visualising determining a dataset’s suitability. It can be seen from Table 3,
the discriminative and salient areas. which summarises the papers reviewed in Section V there
The authors in the study [67], observe that poor generali- is consistency regarding the datasets they used for model
sation can result from model training on datasets where the training and evaluation. This is reinforced by Figure 3 which
number of discriminative features is too subtle for a standard highlights the most popular datasets and their release date.
CNN architecture to interpret. However, to overcome this It is particularly remarkable that those published between
limitation, the authors propose a Frequency-Aware Attention 2019 and 2020 are still highly influential in benchmarking
Feature Fusion using an Xception network as the back- deepfake detectors whereas deepfake generation technology
bone. To improve the learning of local and global features, has progressed significantly since their release. While their
augmentation is applied to each RGB frame using a Discrete popularity is associated with the need to demonstrate compar-
Cosine Transform and a series of learnable weights to capture ative results against other SOTA methods that use the same
a range of frequency domain information. The author sub- datasets, as the field evolves, the value placed on a dataset
sequently converts the frequency information back to their should decrease compared to modern and more advanced
spatial domain to produce a frequency-aware image. Visual- ones. Indeed, the most relevant benchmarking should reflect
ising the model using a Gradient-weighted Class Activation the current state of deepfake datasets to show true SOTA and
Map (Grad-CAM) confirms that their approach can track prove true generalisation.
regions with manipulation, whilst ignoring those that are in Finally, another consideration is understanding the makeup
face pristine. In contrast, the study [68] proposes an audio- of a given dataset in terms of pre- and post-processing activi-
visual multi-model framework by fusing spatial and temporal ties and data diversity (age, gender, ethnicity, etc.). Indeed,
feature embeddings to learn inter-frame joint relationships. if it is not adequately considered, their usage is likely to
The audio-visual embeddings are first encoded using a spatial result in an adverse effect on a model’s ability to gener-
and temporal encoder and are based on a ViT network, which alise. For example, Figure 4 illustrates the ratio between real
the authors claim is better at capturing a stronger represen- and fake content, highlighting important variations between
tation of temporal features over time. A modified decoder datasets. The following section provides an overview of
was employed to overcome the challenge of collaboratively the datasets commonly associated with the training and
learning from multi-modal distributions that are not the same, evaluation of deepfake detector methods. A comprehensive
using a Bi-Directional Cross-Attention block. process of uncovering datasets from 2015 onwards can be
seen in Figure 3. Datasets typically considered benchmark-
C. GENERATIVE AI ing datasets are reviewed in Sections VI-A to VI-D. Novel
Deep neural networks might be unable to discriminate subtle datasets are discussed in Sections VI-E to VI-H. Section VI-I
artefacts, particularly as generative models advance, which explores the considerations of ethics, privacy, dataset
can result in redundant information being learnt. Based on diversity and the challenges of processing operations to
this idea, an adversarial learning strategy to perform artefact datasets.
disentanglement, using an encoder and decoder structure,
is proposed to provide significant improvement during the A. FACEFORENSICS++
feature extraction [70]. Here, the features, with the newly The FaceForensics++ (FF++) [84] dataset was published
constructed fake image are re-processed through the encoder in 2019 for training and benchmarking deepfake detec-
and decoder to learn the ground truth. The study highlights tors. This is a revised version of the previously released
that adversarial learning can not only achieve stronger gener- FaceForensics (FF) [85] dataset from 2018. The accompa-
alisation but also overcome some of the challenges of training nying paper [84] defines four techniques used to generate
based on specific datasets. the content, which is based on two graphical techniques
(Face2Face and FaceSwap) and two learning-based tech-
VI. DATASET niques (DeepFakes and NeuralTextures), containing over
Determining the right type of data for any research topic is 1.8 million images from 1,000 video sources. To simulate
challenging, and AI is no exception. Data plays an essential real-world data, the content was subjected to post-processing
part in the training and validation of any given model, irre- using different compression rates and is denoted by High-
spective of the task at hand. However, determining what type Quality (HQ), Low-Quality (LQ) and the original raw format
of data and how much is needed is something that requires in its uncompressed state. At the time of its release, the
careful consideration. Fortunately, access and availability of average accuracy was 80.87% (calculated from the average
datasets for research in the field of deepfakes have become accuracy of raw 95.50%, HQ 80.73% and LQ 66.38), showing
more straightforward thanks to the tremendous effort from that it was quite challenging for SOTA technology. How-
both academia and industry, where rich and diverse data ever, the best method from Table 3 (based on the average
can be acquired with ease. Nonetheless, the ethical approach of the accuracy scores for the evaluation dataset) has since
taken to acquire individual subjects and the consideration for achieved an accuracy of 94.51%. In addition, this does not
FIGURE 3. Timeline of datasets by class. The darker coloured boxes indicate the number of papers that reference the use of a particular
dataset. The papers associated with this figure are based on the deepfake detection methods from Table 3. The specific details of the private
datasets are not known at the time of writing this paper.
FIGURE 4. Timeline showing the ratio of real and fake content for each common dataset. Video datasets (left) and image datasets (right).
The papers associated with this figure are based on the deepfake detection methods from Table 3.
take into account the breakdown of raw, HQ and Low. Fur- it can be argued that this dataset presents less of a challenge
thermore, the generation of manipulated content was based and may not be suitable as a primary benchmarking dataset
on techniques that have since been superseded. In addition, to evaluate and detect current deepfakes. Despite this, this
the under-representation of gender could result in gender dataset provides a valuable reflection of the once-considered
bias if used in the training and evaluation of a model. Thus, novel manipulation techniques.
subjects currently represented in deepfake datasets. Based on generalisation as they do not provide a true reflection of real
403 subjects, the content boasts a collection of fake videos images.
using six synthesised models and real video content. Addi-
tionally, the setting for the video footage is based on the I. DATASET CONSIDERATIONS
subject talking directly to the camera, which is assumed
Managing datasets, both in terms of volume and structure,
as the at-most-risk setting for manipulation. Augmentation
is complex and requires careful planning. A data management
and other pre-processing techniques were omitted from this
process can be used to not only streamline the data but also
dataset to allow the researcher to apply their approach
reduce the risk of data contamination or unexpected changes
accordingly. However, post-processing was applied to the
to the data itself. Each dataset should be analysed to under-
synthesised videos to sharpen the quality of the content and
stand how the data is distributed and avoid the risk of under
remove unwanted noise. As reported in the study, the results
or over-representation, which could result in an unbalanced
from the evaluation demonstrate comparative performance
and biased dataset.
against the other datasets tested. However, the study observes
a markable increase in performance when additional datasets
such as DFDC [87] and FF++ [84] are combined during 1) ETHICS AND PRIVACY
model training. How deepfake datasets are acquired, stored, and used is
important from an ethical and privacy perspective. Unfortu-
nately, the datasets [26] and [80], reviewed in Section VI,
G. GENDER BALANCED DEEPFAKE DATASET
are examples of where material has been obtained from the
Proposed in 2022, the Gender Balanced Deepfake Dataset
Internet, and the associated papers contain little to no evi-
[91] is an amalgamation of 10,000 real and fake dence of compliance concerning ethics and subject privacy.
videos sourced from the FF++ [84], Celeb-DF [80] and Indeed, each subject must not only consent to the usage of
DeeperForensics-1.0 [86] datasets respectively. Designed to their identity but also understand how their identity will be
promote gender fairness, the dataset was manually annotated used. Recent news has highlighted a growing trend were
to remove gender bias in model training and evaluation. high-profile deepfakes are targeting individuals. For example,
In addition, the study has removed deepfake videos con- an audio deepfake of US President Jo Biden was distributed
taining irregular gender swapping. The concluding results, on social media, attempting to disrupt voters ahead of an
as reported in the study, highlight that improvements in
upcoming election [48], [96]. Similarly, Taylor Swift was
gender bias can be achieved and should be seen not only as
subject to a series of deepfake images that were shared online
an important milestone in gender fairness but also in reducing
and without her consent [48], [97]. However, high-profile
overall all bias across other attributes such as age and race.
individuals are not the only targets. High school principal
Eric Eiswert was the subject of a deepfake recording that was
H. DEEPFAKE IMAGE-HIGH-QUALITY used to spread racial slur. Fortunately, police investigators
The Deepfake Image-High-Quality (DFIM-HQ) [92] became were able to validate the recording as a deepfake [98]. Thus,
publicly available in 2023 to provide a challenging dataset a framework was defined to outline the best practices for the
for benchmarking detection methods. The dataset consists creation and usage of publicly available datasets to ensure
of various scenarios, poses, degradation and illumination, best practices are followed in a standardised manner [99].
making it more closely aligned with the type of images found In addition, the side effects of training DL models with
in the wild. Furthermore, the dataset is a combination of vast volumes of personal identities are becoming a topic of
70,000 images from the FFHQ [79] dataset and 70,000 Style- interest, with recent research exposing the risk of data leakage
GAN2 images [93]. Consequently, the authors recognise the caused by a DL model’s ability to memorise its training
risk of introducing inherent bias transfer from FFHQ, which data [100]. Worryingly, using a Stable Diffusion [41] model
could result in a trained model becoming discriminant of to generate a new identity, they discovered that the generated
certain attributes (age, race and gender). However, the authors individual matched closely one of the identities in the training
propose adversarial debiasing as a technique to overcome this dataset. Furthermore, it was reported in 2022 that a patient’s
limitation, which is applied to the model training and not medical images were found in the LAION-5B [101] dataset
the dataset directly. To assess the effectiveness of adversarial without the person’s content [102]. This highlights the risk of
debiasing, the authors utilised metrics from AI Fairness 360 exposing confidential training data due to adversarial attacks
[94], which demonstrated reduced bias. It should be noted on trained models with reverse engineering techniques poses
that deepfake techniques used in DFIM-HQ are limited to a serious risk.
those generated from StyleGAN2 and do not take into account In addition to the points raised above, the impact of model
the imagery from newer generations of StyleGAN such as bias on real-world applications have far reaching conse-
StyleGAN3 [95]. Furthermore, other forms of deepfakes quences that need to be addressed. For example, model bias is
derived from face-swapping techniques are also omitted. highlighted in several key datasets, including Celeb-DF and
In addition, the range of permutations by users who have DFDC, where trained models are unable to correctly classify
applied image filters in the FFHQ dataset could result in poor images against certain demographic populations. This novel
research identifies that by addressing the imbalance of facial has the potential to erode trust. As technology continues to
attributes in leading datasets, improved generalisation could push boundaries in quality and realism, how we perceive
also be achieved against unseen data. Furthermore, the anno- and differentiate between media in the future will likely be
tated labels used in the creation of a deepfake dataset are often influence our cognitive bias.
omitted from the public release, making it difficult to evaluate
the fairness of the data and how it is represented across a 3) PRE AND POST-PROCESSING
global demographic [91]. The consequence of eroding trust Providing adequate documentation of all pre- and post-
in deepfake detection due to the inequality of data fairness, processing techniques used in the creation of a dataset’s
presents an important challenge moving forward. The sever- pipeline is crucial. Understanding the processes used in the
ity of misclassification caused by systems put in place to creation can avoid unnecessary operations that could impact
safeguard society, has the potential to cause great harm, while model performance. Furthermore, improved transparency can
allowing threat actors to continue taking advantage. help to promote the reproducibility of results, which can
lead to more accurate model performance. Le et al. [103]
2) DATASET DIVERSITY address the need for guidance to inform the researcher not
Acquiring adequate data to perform model training while only on how the data was prepared (resizing the input vs.
maintaining a fair distribution of key attributes (age, gender, cropping) but also on how this may impact the input pipeline
ethnicity, etc.) is a notable challenge. Using augmentation of a given deepfake detector. There are many scenarios where
techniques in the spatial domain to increase dataset diversity pre- and post-processing are used to improve the quality of
has been seen as an effective approach in overcoming this lim- the data. For example, the implementation of pre-processing
itation. However, the process of generating a fake sample for a operations through data enhancement to remove unwanted
given dataset is often dataset specific in their implementation, information during feature extraction [70]. However, apply-
which can lead to inconsistencies in the distribution from one ing these techniques to data that may have already been
dataset to another. To counteract this, the authors suggest subject to these operations could have the effect of removing
that combining within and cross-domain augmentation in useful features.
the latent space to learn more intrinsic features can lead to
improved generalisation. Alternatively, it has been proposed VII. OBSERVATIONS AND FUTURE TRENDS
to composite multiple samples from real and fake samples. This section provides reflections following the Challenge
The authors claim that this technique can not only overcome themes previously defined, i.e., dataset, architecture and scal-
forgery specific bias within the data but also close the gap ability, explainability and evaluation.
between dataset distributions.
The Synthetic-20K Dataset attempts to overcome the chal- A. DATASET
lenge of demographically underrepresented groups of people Access to rich and diverse datasets is still considered a chal-
using techniques to mirror the complex and intricate details lenge, as reported in the papers covered in Section V. This
associated with the human face. While this is a positive move is believed to be a contributing factor as to why so many
forward, the value associated with this dataset is somewhat detection methods achieve poor performance when general-
limited by the StyleGAN2 [93] architecture. Opportunely, the ising against data from the wild [62]. Indeed, there is a lack
ever-improving Generative AI technology that has emerged of data representing complex and noisy environments as the
over recent years offers substantial promises for the creation background or situational setting of current dataset is gen-
of realistic synthetic imagery. The many Diffusion Models erally limited and lacks realistic real-world composure [69].
(DMs) [44] already available to the public is a clear indicator Pre-processing techniques to enhance image quality before
to this high level of interest. However, the research commu- training have been extensively used in this field to encour-
nity has been slow to respond, with little progress being made age improved feature learning and model generalisation. For
to address the detection capabilities against DMs. The authors example, Gaussian blur noise reduction is used to enhance
identify that existing detection capabilities will unlikely gen- and refine training images to help improve overall generali-
eralise to content from DMs due to variations found in the sation [55]. Alternatively, augmentation techniques to present
higher-frequency space of the frequency domain. Despite the data in various composures, including cropping and rota-
this, the presence of some degree of similarity in GANs, sug- tion, can supplement a dataset when the data lacks diversity.
gests that models trained with DM data may lead to improved Indeed, two augmentation strategies have been applied to
generalisation. Already usage of various DMs has contributed an input image using a contrast loss to improve the feature
to overcome the limitation of access to domain specific data representation [67]. In addition, new samples can be created
with the creation of the DiffusionDB-Face and JourneyDB- through a generator using the disentangled artefacts. An Arte-
Face datasets. The novelty of these datasets includes the facts Cycle Consistency Loss approach is proposed to help
comprehensive use of Text-to-Image prompts to generate a reduce the dependency on large datasets and improve overall
rich and diverse collection of imagery. Yet this may lead performance [70]. New and improved architectures are pre-
to additional pre-processing to remove unrealistic imagery. senting ways to not only optimise a model but also reduce its
In any case, the presence of model bias through data diversity complexity. Rational augmented CNN has also been seen as
a possible solution to the challenge of limited data, where a cluster faces based on similarity matching presents an inter-
model could not only be used for detection but also for the esting approach. The model’s effectiveness at discriminating
creation of new facial identities and to extend training and real and fake content without having been trained on fake
evaluation data [54]. However, there is a risk of the image content could lead to improve generalisation as the model
quality degrading from imperfections in the source training itself is not forgery domain specific.
data. Many of the SOTA methods proposed in academic liter- Finally, although advancements in generative AI and the
ature focus on model performance through evaluating against increasing range of available Diffusion Models allow the pro-
large-scale datasets. However, limited coverage is given to the duction of realistic imagery, which is on par with StyleGAN2
training data and performance of the model itself, which can [93] and StyleGAN3 [95], only a limited coverage of DMs
lead to challenges of experiment repeatability. To overcome are observed in the papers reviewed. Despite this, the range
this limitation, the TrainFors dataset was curated as a bench- of generative AI models capable of creating multi-modal
marking dataset for model training and evaluation. content has surged in recent years. The level of realism
Thus, instead of creating additional training data, it has and sophistication in generating content using Text-to-Image,
been proposed to mitigate that need by exploiting adversar- Text-to-Audio, etc. requires a shift towards a multi-modal
ial learning to reconstruct low-level features to their former approach to detection. One novel approach utilises con-
state [58]. An alternative approach is to develop lightweight trastive learning to capture features from the audio-visual
models which can not only achieve stronger generalisation space using a cross-model technique. Real video content
compared with other SOTA models but can do so with less is fed to a model trainer for downstream learning, where
data [75]. Also promising is usage of a rich representation masked embeddings are learnt from alternate modalities.
of learned features using a GNN architecture as this can A second model is subsequently trained on the learnt embed-
overcome limitations when access to diverse datasets is not dings as a classifier to determine between real and fake
available [57]. Finally, the authors in [60] believe their E-Cap videos. Despite improved generalisation, the technique only
Net can generalise to variations in data permutation without works with single person videos. Alternatively, it has been
requiring additional training data as the model can represent proposed to combine a multi-modal transformer approach
the input image in its entire state and the relationship between to capture spatial and temporal inconsistencies across the
each interconnecting part. audio-visual space. Furthermore, by using dynamic weight
fusion, not only is the loss of vital information during
B. ARCHITECTURE AND SCALABILITY the training phase reduced, but also common features are
It has been highlighted that in some instances, too much extracted. Evaluation of the DFDC dataset demonstrates sig-
emphasis is placed on forgery localisation over the direct nificant improvement in performance using this multi-modal
task of detection [62]. In addition, tailoring the detection approach. It is expected that usage of the underlying archi-
method around the architecture for image segmentation tectures from systems such as Dall-E 3 [39], Imagen [40] and
may contribute to some SOTA methods experiencing poor Stable Diffusion [41], will open new opportunities in research
generalisation and limited robustness. Although improved for the detection of deepfakes.
generalisation can be achieved through the ViT architecture
due to its ability to learn high-level features, this poses sig-
nificant challenges from an increase in model complexity C. EXPLAINABILITY
and the requirement for large training datasets [56]. Despite Earlier research into deepfake detection using ML often
this, the authors conclude that, by applying a fine-tuning focused on solving this challenge as a binary classification
strategy while freezing the backbone of the network, signifi- problem, for example, labelling the input source as either real
cant improvements to the model’s efficiency can be achieved. or fake. However, the quality and sophistication of deepfakes
Also exploiting a pre-trained network, a lightweight Trans- that are seen in the wild pose a far greater challenge than
former architecture is designed, where the parameters are before. Thus, the ability to understand how a model is reason-
frozen while using only 3% of the parameters typically used ing with its input becomes more and more important. This is
in SOTA methods [65]. An interesting trend is the usage core to TruFor, where an integrity score is supplemented by an
of architecture-agnostic approaches, where novel detection anomaly and confidence map to provide the user with enough
methods are not constrained to any specific type of image evidence to make an informed decision [62]. Similarly, it is
classification model [63] and [64]. In other words, this will proposed to incorporate a graph Transformer relevancy map
help to future-proof a model to adapt to new generations using the output activation map [72]. There, the relevancy
of DL architectures. As convergence speed is a crucial ele- map is designed to provide improved transparency on the
ment in model deployment ability, it is proposed to deliver a subtle details that the model believes are manipulated while
sparse gradient and compact feature representation by using a ignoring irrelevant information. Based on this concept of
Max-Feature-Map (MFM) activation function [60]. Further- output activation map, a Grad-CAM is used as part of model
more, improved model efficiency using MFM can also be evaluation as it can provide a useful visual aid to highlight the
achieved for other common activation functions, including regions within the source image that the model has deemed
ReLU and Sigmoid. Using a Prototype Learning Layer to important [67].
The architecture and training datasets need to be fac- performance. In [104], the authors establish a list of fac-
tored into the model explainability, as this will influence tors that influence a model’s decision by measuring the
FIGURE 5. Timeline of architectures by class. the diagram highlights some of the main dl architectures and their associated variant architectures over
time.
attributes of gender, race and affective (smiling vs. non- are accounted for will become a necessity in deploy-
smiling) from samples obtained by the StyleGAN2 [93] able AI in the future. Local Interpretable Model-Agnostic
algorithm and the FFHQ [79] dataset. Tested on various Explanations and Shapley Additive Explanation [106] are
architectures, it was observed that the ViT represented two examples of existing techniques designed to provide
a stronger bias towards females with lighter skin tones model interpretability. Incorporating XAI into the design
using data from the StyleGAN2 dataset, compared with of deepfake detectors would not only enhance model
males with darker skin tones from the FFHQ dataset [104]. transparency but also aid the process of fine-tuning and
As reported by [105], rapid growth in the field of Explain- optimisation.
able Artificial Intelligence (XAI) has led to a surge in
research dedicated to improving AI transparency. Sim- D. EVALUATION
ply put, the need to address concerns over accountability An important study describes the necessity of using evalua-
and trustworthiness while ensuring ethical considerations tion metrics to not only understand how a model is making
predictions but also to observe if under- or over-fitting is [8] W. Wu, W. Zhou, W. Zhang, H. Fang, and N. Yu, ‘‘Capturing the light-
occurring based on the training data used [51]. Indeed, there ing inconsistency for Deepfake detection,’’ in Artificial Intelligence and
Security (Lecture Notes in Computer Science), X. Sun, X. Zhang, Z. Xia,
are a number of metrics available for DL that can help and E. Bertino, Eds., Cham, Switzerland: Springer, 2022, pp. 637–647,
identify issues with the model or can be used in further doi: 10.1007/978-3-031-06788-4_52.
fine-tuning. Although the widely accepted metrics include [9] C. Zhu, B. Zhang, Q. Yin, C. Yin, and W. Lu, ‘‘Deepfake detection
via inter-frame inconsistency recomposition and enhancement,’’
accuracy, AUC, precision, recall and F1-Score, the review Pattern Recognit., vol. 147, Mar. 2024, Art. no. 110077, doi:
in Section V highlights that the majority of the studies only 10.1016/j.patcog.2023.110077.
evaluate their models using accuracy and AUC. Furthermore, [10] J. H. Bappy, A. K. Roy-Chowdhury, J. Bunk, L. Nataraj, and B. S. Man-
junath, ‘‘Exploiting spatial structure for localizing manipulated image
fewer than twenty percent consider precision, recall and regions,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
F1-Score as part of their evaluation. Still, accuracy provides pp. 4980–4989, doi: 10.1109/ICCV.2017.532.
a valuable measurement for calculating the number of true [11] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, ‘‘Face
X-ray for more general face forgery detection,’’ 2019, arXiv:1912.13458.
positive and true negative matches, which is important for
[12] D. Cozzolino and L. Verdoliva, ‘‘Noiseprint: A CNN-based camera model
determining how well the model is classifying the given fingerprint,’’ 2018, arXiv:1808.08396.
dataset [60]. In addition, AUC can be used to determine how [13] M. M. Taye, ‘‘Understanding of machine learning with deep learning:
well the model is able to discriminate between the classes that Architectures, workflow, applications and future directions,’’ Computers,
vol. 12, no. 5, p. 91, Apr. 2023, doi: 10.3390/computers12050091.
it was trained with. Less common metrics may also be con- [14] A. M. Almars, ‘‘Deepfakes detection techniques using deep learning: A
sidered. They include a Probability of Detection [49], True survey,’’ J. Comput. Commun., vol. 9, no. 5, pp. 20–35, May 2021, doi:
Detection [67] and Equal Error Rate Metric (EERM) [56], 10.4236/jcc.2021.95003.
[15] S. Tyagi and D. Yadav, ‘‘A detailed analysis of image and video
[59], [60], [74]. EERM is of particular interest as it measures forgery detection techniques,’’ Vis. Comput., vol. 39, no. 3, pp. 813–833,
the rate at which the model is likely to misclassify [60]. Mar. 2023, doi: 10.1007/s00371-021-02347-4.
Using a threshold value, the optimised position is when the [16] K. Patil, S. Kale, J. Dhokey, and A. Gulhane, ‘‘Deepfake detection using
biological features: A survey,’’ 2023, arXiv:2301.05819.
False Acceptance Rate and False Rejection Rate are equal.
[17] M. Masood, M. Nawaz, K. M. Malik, A. Javed, A. Irtaza, and H. Malik,
Finally, although over seventy-percent of the papers apply ‘‘Deepfakes generation and detection: State-of-the-art, open challenges,
evaluation to both inter and intra-dataset comparison, some countermeasures, and way forward,’’ Appl. Intell., vol. 53, no. 4,
studies limit their evaluation to inter-dataset only [54]. One pp. 3974–4026, Feb. 2023, doi: 10.1007/s10489-022-03766-z.
[18] M. Zanardelli, F. Guerrini, R. Leonardi, and N. Adami, ‘‘Image forgery
should also note that a multi-modal comparison is performed detection: A survey of recent deep-learning approaches,’’ Multime-
in one study [49]. dia Tools Appl., vol. 82, no. 12, pp. 17521–17566, May 2023, doi:
10.1007/s11042-022-13797-w.
[19] L. Stroebel, M. Llewellyn, T. Hartley, T. S. Ip, and M. Ahmed, ‘‘A
VIII. CONCLUSION systematic literature review on the effectiveness of deepfake detec-
This paper presents a comprehensive review of literature tion techniques,’’ J. Cyber Secur. Technol., vol. 7, no. 2, pp. 83–113,
Apr. 2023, doi: 10.1080/23742917.2023.2192888.
relating to the field of deepfake detection. The aim is to
[20] T. T. Nguyen, Q. V. H. Nguyen, D. T. Nguyen, D. T. Nguyen,
provide the reader with the latest research on the architec- T. Huynh-The, S. Nahavandi, T. T. Nguyen, Q.-V. Pham, and C. M.
tures, detection methods and datasets currently used in the Nguyen, ‘‘Deep learning for deepfakes creation and detection: A survey,’’
field while recognising and analysing their strengths and Comput. Vis. Image Understand., vol. 223, Oct. 2022, Art. no. 103525,
doi: 10.1016/j.cviu.2022.103525.
weaknesses. To conclude, Table 4 presents an overview of [21] J. W. Seow, M. K. Lim, R. C. W. Phan, and J. K. Liu, ‘‘A com-
the original challenge themes defined in Table 1. prehensive overview of deepfake: Generation, detection, datasets, and
opportunities,’’ Neurocomputing, vol. 513, pp. 351–371, Nov. 2022, doi:
10.1016/j.neucom.2022.09.135.
APPENDIX [22] D. Dagar and D. K. Vishwakarma, ‘‘A literature review and perspectives
See Figure 5. in deepfakes: Generation, detection, and applications,’’ Int. J. Multimedia
Inf. Retr., vol. 11, no. 3, pp. 219–289, Sep. 2022, doi: 10.1007/s13735-
022-00241-w.
REFERENCES [23] F. Juefei-Xu, R. Wang, Y. Huang, Q. Guo, L. Ma, and Y. Liu, ‘‘Countering
malicious DeepFakes: Survey, battleground, and horizon,’’ Int. J. Comput.
[1] I. Sample. (Jan. 13, 2020). What Are Deepfakes and How Can
Vis., vol. 130, no. 7, pp. 1678–1734, Jul. 2022, doi: 10.1007/s11263-022-
You Spot Them. The Guardian. Accessed: Sep. 22, 2022. [Online].
01606-8.
Available: https://www.theguardian.com/technology/2020/jan/13/what-
[24] M. S. Rana, M. N. Nobi, B. Murali, and A. H. Sung, ‘‘Deep-
are-deepfakes-and-how-can-you-spot-them
fake detection: A systematic literature review,’’ IEEE Access, vol. 10,
[2] Facelab. Accessed: Oct. 18, 2022. [Online]. Available: pp. 25494–25513, 2022, doi: 10.1109/ACCESS.2022.3154404.
https://facelab.mobi/
[25] A. Naitali, M. Ridouani, F. Salahdine, and N. Kaabouch, ‘‘Deepfake
[3] FaceApp: Face Editor. Accessed: Oct. 18, 2022. [Online]. Available: attacks: Generation, detection, datasets, challenges, and research direc-
https://faceapp.com/ tions,’’ Computers, vol. 12, no. 10, p. 216, Oct. 2023, doi: 10.3390/com-
[4] M. Boháček and H. Farid, ‘‘Protecting president Zelenskyy against deep puters12100216.
fakes,’’ 2022, arXiv:2206.12043. [26] B. Zi, M. Chang, J. Chen, X. Ma, and Y.-G. Jiang, ‘‘WildDeep-
[5] Publications—Dimensions. Accessed: Nov. 10, 2023. [Online]. Avail- fake: A challenging real-world dataset for deepfake detection,’’ 2021,
able: https://app.dimensions.ai/discover/publication arXiv:2101.01456.
[6] J. F. O’Brien and H. Farid, ‘‘Exposing photo manipulation with inconsis- [27] F. Hong, Z. Chen, Y. Lan, L. Pan, and Z. Liu, ‘‘EVA3D: Compositional 3D
tent reflections,’’ ACM Trans. Graph., vol. 31, no. 1, pp. 1–11, Feb. 2012, human generation from 2D image collections,’’ 2022, arXiv:2210.04888.
doi: 10.1145/2077341.2077345. [28] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
[7] M. K. Johnson and H. Farid, ‘‘Exposing digital forgeries in complex W. Hubbard, and L. D. Jackel, ‘‘Backpropagation applied to handwrit-
lighting environments,’’ IEEE Trans. Inf. Forensics Security, vol. 2, no. 3, ten zip code recognition,’’ Neural Comput., vol. 1, no. 4, pp. 541–551,
pp. 450–461, Sep. 2007, doi: 10.1109/TIFS.2007.903848. Dec. 1989, doi: 10.1162/neco.1989.1.4.541.
[29] Y. LeCun, L. Bottou, Y. Bengio, and P. Ha, ‘‘Gradient-based learn- [51] R. Rafique, R. Gantassi, R. Amin, J. Frnda, A. Mustapha, and
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11, A. H. Alshehri, ‘‘Deep fake detection and classification using error-level
pp. 2278–2324, Nov. 1998. analysis and deep learning,’’ Sci. Rep., vol. 13, no. 1, p. 7422, May 2023,
[30] G. E. Hinton, A. Krizhevsky, and S. D. Wang, ‘‘Transforming auto- doi: 10.1038/s41598-023-34629-3.
encoders,’’ in Artificial Neural Networks and Machine Learning—ICANN [52] Y.-J. Heo, W.-H. Yeo, and B.-G. Kim, ‘‘DeepFake detection algorithm
2011 (Lecture Notes in Computer Science), T. Honkela, W. Duch, based on improved vision transformer,’’ Appl. Intell., vol. 53, no. 7,
M. Girolami, and S. Kaski, Eds., Berlin, Germany: Springer, 2011, pp. 7512–7527, Apr. 2023, doi: 10.1007/s10489-022-03867-9.
pp. 44–51, doi: 10.1007/978-3-642-21735-7_6. [53] M. Soleimani, A. Nazari, and M. E. Moghaddam, ‘‘Deepfake detection
[31] M. K. Patrick, A. F. Adekoya, A. A. Mighty, and B. Y. Edward, of occluded images using a patch-based approach,’’ Multimedia Syst.,
‘‘Capsule networks—A survey,’’ J. King Saud Univ. Comput. Inf. Sci., vol. 29, no. 5, pp. 2669–2687, Oct. 2023, doi: 10.1007/s00530-023-
vol. 34, no. 1, pp. 1295–1310, Jan. 2022, doi: 10.1016/j.jksuci.2019. 01140-8.
09.014. [54] S. R. A. Ahmed and E. Sonuç, ‘‘Deepfake detection using rationale-
[32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, augmented convolutional neural network,’’ Appl. Nanoscience, vol. 13,
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’ no. 2, pp. 1485–1493, Feb. 2023, doi: 10.1007/s13204-021-02072-3.
2014, arXiv:1409.4842. [55] T. Lu, Y. Bao, and L. Li, ‘‘Deepfake video detection based on
[33] F. Chollet, ‘‘Xception: Deep learning with depthwise separable convolu- improved CapsNet and temporal–spatial features,’’ Comput., Mater.
tions,’’ 2016, arXiv:1610.02357. Continua, vol. 75, no. 1, pp. 715–740, 2023, doi: 10.32604/cmc.2023.
[34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, 034963.
‘‘Rethinking the inception architecture for computer vision,’’ 2015, [56] R. Shao, T. Wu, L. Nie, and Z. Liu, ‘‘DeepFake-adapter: Dual-level
arXiv:1512.00567. adapter for DeepFake detection,’’ 2023, arXiv:2306.00863.
[35] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely [57] F. Khalid, A. Javed, Q.-U. Ain, H. Ilyas, and A. Irtaza, ‘‘DFGNN:
connected convolutional networks,’’ in Proc. IEEE Conf. Comput. An interpretable and generalized graph neural network for deepfakes
Vis. Pattern Recognit. (CVPR), Jul. 2017. Accessed: Aug. 2, 2023. detection,’’ Expert Syst. Appl., vol. 222, Jul. 2023, Art. no. 119843, doi:
[Online]. Available: https://openaccess.thecvf.com/content_ 10.1016/j.eswa.2023.119843.
cvpr_2017/html/Huang_Densely_Connected_Convolutional_ [58] J. Ke and L. Wang, ‘‘DF-UDetector: An effective method towards robust
CVPR_2017_paper.html deepfake detection via feature restoration,’’ Neural Netw., vol. 160,
[36] A. Vaswani, ‘‘Attention is all you need,’’ 2017, arXiv:1706.03762. pp. 216–226, Mar. 2023, doi: 10.1016/j.neunet.2023.01.001.
[37] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, [59] Y. Wang, K. Yu, C. Chen, X. Hu, and S. Peng, ‘‘Dynamic graph
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, learning with content-guided spatial-frequency relation reasoning for
S. Gelly, J. Uszkoreit, and N. Houlsby, ‘‘An image is worth deepfake detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
16x16 words: Transformers for image recognition at scale,’’ 2020, Recognit. (CVPR), Jun. 2023, pp. 7278–7287. [Online]. Available:
arXiv:2010.11929. https://openaccess.thecvf.com/content/CVPR2023/html/Wang_
[38] R. Gozalo-Brizuela and E. C. Garrido-Merchán, ‘‘A survey of generative Dynamic_Graph_Learning_With_Content-Guided_Spatial-
AI applications,’’ 2023, arXiv:2306.02781. Frequency_Relation_Reasoning_for_Deepfake_CVPR_2023_paper.html
[39] J. Betker et al., ‘‘Improving image generation with better cap- [60] H. Ilyas, A. Javed, K. M. Malik, and A. Irtaza, ‘‘E-cap net: An
tions,’’ Comput. Sci., vol. 2, no. 3, p. 8, 2023. [Online]. Available: efficient-capsule network for shallow and deepfakes forgery detec-
https://cdn.openai.com/papers/dall-e-3.pdf tion,’’ Multimedia Syst., vol. 29, no. 4, pp. 2165–2180, Aug. 2023, doi:
[40] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, 10.1007/s00530-023-01092-z.
S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, [61] S. Usmani, S. Kumar, and D. Sadhya, ‘‘Efficient deepfake detection
T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, ‘‘Photorealistic text- using shallow vision transformer,’’ Multimedia Tools Appl., vol. 83, no. 4,
to-image diffusion models with deep language understanding,’’ 2022, pp. 12339–12362, Jun. 2023, doi: 10.1007/s11042-023-15910-z.
arXiv:2205.11487. [62] F. Guillaro, D. Cozzolino, A. Sud, N. Dufour, and L. Verdoliva,
[41] Stability AI. Stable Diffusion 2.0 Release. Accessed: Jan. 14, 2024. ‘‘TruFor: Leveraging all-round clues for trustworthy image forgery
[Online]. Available: https://stability.ai/news/stable-diffusion-v2-release detection and localization,’’ in Proc. IEEE/CVF Conf. Comput.
[42] S. Feuerriegel, J. Hartmann, C. Janiesch, and P. Zschech, ‘‘Generative Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 20606–20615, doi:
AI,’’ Bus. Inf. Syst. Eng., vol. 66, no. 1, pp. 111–126, Sep. 2023, doi: 10.1109/CVPR52729.2023.01974.
10.1007/s12599-023-00834-7. [63] Y. Xu, J. Liang, G. Jia, Z. Yang, Y. Zhang, and R. He, ‘‘TALL: Thumbnail
[43] I. Goodfellow, ‘‘Generative adversarial nets,’’ in Proc. Adv. layout for deepfake video detection,’’ in Proc. IEEE/CVF Int. Conf.
Neural Inf. Process. Syst., Red Hook, NY, USA: Curran Comput. Vis. (ICCV), vol. 27, Oct. 2023, pp. 22658–22668. [Online].
Associates, 2014, pp. 1–11. Accessed: Nov. 17, 2023. [Online]. Available: https://openaccess.thecvf.com/content/ICCV2023/html/
Available: https://proceedings.neurips.cc/paper_files/paper/2014/ Xu_TALL_Thumbnail_Layout_for_Deepfake_Video_Detection_
hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html ICCV_2023_paper.html
[44] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, [64] J. Wang, X. Du, Y. Cheng, Y. Sun, and J. Tang, ‘‘Si-Net: Spatial inter-
and M.-H. Yang, ‘‘Diffusion models: A comprehensive survey of methods action network for deepfake detection,’’ Multimedia Syst., vol. 29, no. 5,
and applications,’’ 2022, arXiv:2209.00796. pp. 3139–3150, Jul. 2023, doi: 10.1007/s00530-023-01114-w.
[45] J. Ho, A. Jain, and P. Abbeel, ‘‘Denoising diffusion probabilistic models,’’ [65] M. A. Raza, K. M. Malik, and I. Ul Haq, ‘‘HolisticDFD: Infusing spa-
2020, arXiv:2006.11239. tiotemporal transformer embeddings for deepfake detection,’’ Inf. Sci.,
[46] Y. Song and S. Ermon, ‘‘Improved techniques for training score-based vol. 645, Oct. 2023, Art. no. 119352, doi: 10.1016/j.ins.2023.119352.
generative models,’’ 2020, arXiv:2006.09011. [66] C. Zhao, C. Wang, G. Hu, H. Chen, C. Liu, and J. Tang, ‘‘ISTVT:
[47] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and Interpretable spatial-temporal video transformer for deepfake detection,’’
B. Poole, ‘‘Score-based generative modeling through stochastic differen- IEEE Trans. Inf. Forensics Security, vol. 18, pp. 1335–1348, 2023, doi:
tial equations,’’ 2020, arXiv:2011.13456. 10.1109/TIFS.2023.3239223.
[48] A. Birrer and N. Just, ‘‘What we know and don’t know about deepfakes: [67] C. Tian, Z. Luo, G. Shi, and S. Li, ‘‘Frequency-aware attentional
An investigation into the state of the research and regulatory landscape,’’ feature fusion for deepfake detection,’’ in Proc. IEEE Int. Conf.
New Media Soc., May 2024, doi: 10.1177/14614448241253138. Acoust., Speech Signal Process. (ICASSP), Jun. 2023, pp. 1–5, doi:
[49] D. Cozzolino, A. Pianese, M. Nießner, and L. Verdoliva, ‘‘Audio-visual 10.1109/ICASSP49357.2023.10094654.
person-of-interest DeepFake detection,’’ in Proc. IEEE/CVF Conf. Com- [68] W. Yang, X. Zhou, Z. Chen, B. Guo, Z. Ba, Z. Xia, X. Cao, and
put. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2023, pp. 943–952, K. Ren, ‘‘AVoiD-DF: Audio-visual joint learning for detecting deepfake,’’
doi: 10.1109/CVPRW59228.2023.00101. IEEE Trans. Inf. Forensics Security, vol. 18, pp. 2015–2029, 2023, doi:
[50] T. Wang, H. Cheng, K. P. Chow, and L. Nie, ‘‘Deep convolutional 10.1109/TIFS.2023.3262148.
pooling transformer for deepfake detection,’’ ACM Trans. Multimedia [69] B. Liang, Z. Wang, B. Huang, Q. Zou, Q. Wang, and J. Liang, ‘‘Depth
Comput., Commun., Appl., vol. 19, no. 6, pp. 1–20, May 2023, doi: map guided triplet network for deepfake face detection,’’ Neural Netw.,
10.1145/3588574. vol. 159, pp. 34–42, Feb. 2023, doi: 10.1016/j.neunet.2022.11.031.
[70] X. Li, R. Ni, P. Yang, Z. Fu, and Y. Zhao, ‘‘Artifacts-disentangled [91] A. V. Nadimpalli and A. Rattani, ‘‘GBDF: Gender balanced DeepFake
adversarial learning for deepfake detection,’’ IEEE Trans. Circuits dataset towards fair DeepFake detection,’’ 2022, arXiv:2207.10246.
Syst. Video Technol., vol. 33, no. 4, pp. 1658–1670, Apr. 2023, doi: [92] S. Mathews, S. Trivedi, A. House, S. Povolny, and C. Fralick, ‘‘An
10.1109/TCSVT.2022.3217950. explainable deepfake detection framework on a novel unconstrained
[71] T. Wang and K. P. Chow, ‘‘Noise based deepfake detection via multi-head dataset,’’ Complex Intell. Syst., vol. 9, no. 4, pp. 4425–4437, Aug. 2023,
relative-interaction,’’ in Proc. AAAI Conf. Artif. Intell., vol. 37, no. 12, doi: 10.1007/s40747-022-00956-7.
Jun. 2023, pp. 14548–14556, doi: 10.1609/aaai.v37i12.26701. [93] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
[72] A. Khormali and J.-S. Yuan, ‘‘Self-supervised graph transformer for ‘‘Analyzing and improving the image quality of StyleGAN,’’ 2019,
deepfake detection,’’ 2023, arXiv:2307.15019. arXiv:1912.04958.
[73] G. Yang, A. Wei, X. Fang, and J. Zhang, ‘‘FDS_2D: Rethinking [94] R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde,
magnitude-phase features for DeepFake detection,’’ Multimedia Syst., K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar,
vol. 29, no. 4, pp. 2399–2413, Jun. 2023, doi: 10.1007/s00530-023- K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh,
01118-6. K. R. Varshney, and Y. Zhang, ‘‘AI fairness 360: An extensible toolkit for
[74] B. Huang, Z. Wang, J. Yang, J. Ai, Q. Zou, Q. Wang, and D. Ye, detecting and mitigating algorithmic bias,’’ IBM J. Res. Develop., vol. 63,
‘‘Implicit identity driven deepfake face swapping detection,’’ nos. 4–5, pp. 4:1–4:15, Jul./Sep. 2019, doi: 10.1147/JRD.2019.2942287.
presented at the IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [95] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehti-
(CVPR), Jun. 2023. [Online]. Available: https://openaccess.thecvf. nen, and T. Aila, ‘‘Alias-free generative adversarial networks,’’ 2021,
com/content/CVPR2023/html/Huang_Implicit_Identity_Driven_ arXiv:2106.12423.
Deepfake_Face_Swapping_Detection_CVPR_2023_paper.html [96] (Jan. 22, 2024). Fake Biden Robocall Tells Voters to Skip New Hampshire
[75] S. Mundra, G. J. Aniano Porcile, S. Marvaniya, J. R. Verbus, Primary Election. BBC News. Accessed: Oct. 5, 2024. [Online]. Avail-
and H. Farid, ‘‘Exposing GAN-generated profile photos from com- able: https://www.bbc.com/news/world-us-canada-68064247
pact embeddings,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pat- [97] (Jan. 26, 2024). Taylor Swift Deepfakes Spark Calls in Congress for New
tern Recognit. Workshops (CVPRW), Jun. 2023, pp. 884–892, doi: Legislation. BBC News. Accessed: Oct. 5, 2024. [Online]. Available:
10.1109/CVPRW59228.2023.00095. https://www.bbc.com/news/technology-68110476
[76] G. Pang, B. Zhang, Z. Teng, Z. Qi, and J. Fan, ‘‘MRE-Net: multi-rate [98] (Apr. 26, 2024). Baltimore High School Teacher Arrested Over Deepfake
excitation network for deepfake video detection,’’ IEEE Trans. Circuits Racist Audio of Principal. BBC News. Accessed: Oct. 5, 2024. [Online].
Syst. Video Technol., vol. 33, no. 8, pp. 3663–3676, Aug. 2023, doi: Available: https://www.bbc.com/news/world-us-canada-68907895
10.1109/TCSVT.2023.3239607. [99] K. Peng, A. Mathur, and A. Narayanan, ‘‘Mitigating dataset
[77] S. Dong, J. Wang, R. Ji, J. Liang, H. Fan, and Z. Ge, ‘‘Implicit harms requires stewardship: Lessons from 1000 papers,’’ 2021,
identity leakage: The stumbling block to improving deepfake arXiv:2108.02922.
detection generalization,’’ presented at the IEEE/CVF Conf. [100] N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramèr,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023. [Online]. B. Balle, D. Ippolito, and E. Wallace, ‘‘Extracting training data from
Available: https://openaccess.thecvf.com/content/CVPR2023/html/ diffusion models,’’ 2023, arXiv:2301.13188.
Dong_Implicit_Identity_Leakage_The_Stumbling_Block_ [101] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman,
to_Improving_Deepfake_Detection_CVPR_2023_paper.html M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,
[78] Z. Liu, X. Qi, and P. Torr, ‘‘Global texture enhancement for fake face P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt,
detection in the wild,’’ 2020, arXiv:2002.00133. R. Kaczmarczyk, and J. Jitsev, ‘‘LAION-5B: An open large-scale
dataset for training next generation image-text models,’’ 2022,
[79] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture for
arXiv:2210.08402.
generative adversarial networks,’’ 2018, arXiv:1812.04948.
[102] M. Growcoot. Shocked Artist Finds Private Medical Photos in AI
[80] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, ‘‘Celeb-DF: A large-scale
Training Data Set. PetaPixel. Accessed: Nov. 20, 2023. [Online].
challenging dataset for DeepFake forensics,’’ 2019, arXiv:1909.12962.
Available: https://petapixel.com/2022/09/26/shocked-artist-finds-
[81] G. B. Huang, M. Ramesh, T. Berg, and E. L. Miller, ‘‘Labeled faces private-medical-photos-in-ai-training-data-set/
in the wild: A database for studying face recognition in unconstrained [103] B. Le, S. Tariq, A. Abuadbba, K. Moore, and S. Woo, ‘‘Why do deepfake
environments,’’ Univ. Massachusetts, Amherst, MA, USA, Tech. Rep., detectors fail?’’ 2023, arXiv:2302.13156.
Oct. 2007. [104] M. P. Gangan, A. Kadan, and L. V L, ‘‘Exploring fairness in pre-trained
[82] M. Edwards and X. Xie, ‘‘Graph based convolutional neural network,’’ visual transformer based natural and GAN generated image detection
2016, arXiv:1609.08965. systems and understanding the impact of image compression in fairness,’’
[83] H. Chefer, S. Gur, and L. Wolf, ‘‘Transformer interpretability 2023, arXiv:2310.12076.
beyond attention visualization,’’ in Proc. IEEE/CVF Conf. Comput. [105] G. P. Reddy and Y. V. P. Kumar, ‘‘Explainable AI (XAI): Explained,’’ in
Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 782–791, doi: Proc. IEEE Open Conf. Electr., Electron. Inf. Sci. (eStream), Apr. 2023,
10.1109/CVPR46437.2021.00084. pp. 1–6, doi: 10.1109/eStream59056.2023.10134984.
[84] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, [106] S. Lundberg and S.-I. Lee, ‘‘A unified approach to interpreting model
‘‘FaceForensics++: Learning to detect manipulated facial images,’’ 2019, predictions,’’ 2017, arXiv:1705.07874.
arXiv:1901.08971.
[85] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner,
‘‘FaceForensics: A large-scale video dataset for forgery detection in
human faces,’’ 2018, arXiv:1803.09179.
[86] L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy, ‘‘DeeperForensics-
1.0: A large-scale dataset for real-world face forgery detection,’’ 2020,
arXiv:2001.03024.
[87] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and
C. C. Ferrer, ‘‘The DeepFake detection challenge (DFDC) dataset,’’
2020, arXiv:2006.07397.
PETER EDWARDS received the B.Sc. degree
[88] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer,
‘‘The deepfake detection challenge (DFDC) preview dataset,’’ 2019, (Hons.) in computer science from Brunel Univer-
arXiv:1910.08854. sity, Uxbridge, U.K., in 2005, and the M.Sc. degree
[89] M. F. Sohan, M. Solaiman, and M. A. Hasan, ‘‘A survey on deepfake video in advanced computing and digital forensics from
detection datasets,’’ Indonesian J. Electr. Eng. Comput. Sci., vol. 32, no. 2, Edinburgh Napier University, Edinburgh, U.K.,
p. 1168, Nov. 2023, doi: 10.11591/ijeecs.v32.i2.pp1168-1176. in 2020. He is currently pursuing the Ph.D. degree
[90] P. Kwon, J. You, G. Nam, S. Park, and G. Chae, ‘‘KoDF: A with the School of Computer Science and Mathe-
large-scale Korean DeepFake detection dataset,’’ in Proc. IEEE/CVF matics, Kingston University, Kingston, U.K.
Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 10724–10733, doi: His interests include deep learning, media
10.1109/ICCV48922.2021.01057. forensics, and data analytics.
JEAN-CHRISTOPHE NEBEL (Senior Member, XING LIANG (Member, IEEE) received the Ph.D.
IEEE) received the M.Sc.(Eng.) degree in elec- degree in mobile satellite communications from
tronics and signal processing from the Institute of the University of Bradford, in 2007.
Chemistry and Industrial Physics, Lyon, France, She is currently a Senior Lecturer with the
in 1992, and the Ph.D. degree in parallel program- School of Computer Science and Mathematics,
ming from the University of St Etienne, France, Kingston University London. More than the years,
in 1997. she has actively contributed to multiple EU and
From 1997 to 2004, he was a Postdoctoral U.K. research projects in AI and telecommuni-
Research Associate/Fellow with the Computing cations, including H2020 ENSURESEC, H2020
Science Department, The University of Glasgow, EUNOMIA, EPSRC CONCORD, IST FIFTH,
U.K. Since 2004, he has been a permanent Academic with Kingston Uni- IST SatNEx, and Dunhill Medical Trust funded ATDA-BSL, and several
versity London. Currently, he is a Full Professor of computer science with industrial related research projects. Her current research interests include
the Faculty of Engineering, Computing and the Environment, where he computer vision, deep learning, generative AI, and quantum machine
leads research in AI and machine learning applied to a variety of topics, learning with applications in cyber-physical system security, social media
including computer vision, bioinformatics, cybersecurity, renewable energy, security, healthcare, the IoT, and business intelligence.
and ecology.
Prof. Nebel was awarded the A. H. Reeve Premium by the Council of
the Institute of Electrical and Electronics Engineers for a journal article
describing his and the co-authors’ pioneer work in developing a 3D Dynamic
Whole Body Measurement System, in 2004.