Deepfake Generation and Detection - A Benchmark and Survey
Deepfake Generation and Detection - A Benchmark and Survey
Abstract Deepfake is a technology dedicated to cre- directions of the discussed fields. We closely follow the
ating highly realistic facial images and videos under latest developments in this project.
specific conditions, which has significant application
potential in fields such as entertainment, movie pro- Keywords Deepfake Generation · Face Swapping ·
duction, and digital human creation, to name a few. Face Reenactment · Talking Face Generation · Facial
With the advancements in deep learning, techniques Attribute Editing · Forgery detection · Survey
primarily represented by Variational Autoencoders and
Generative Adversarial Networks have achieved impres- 1 Introduction
sive generation results. More recently, the emergence of
diffusion models with powerful generation capabilities Artificial Intelligence Generated Content (AIGC) gar-
has sparked a renewed wave of research. In addition to ners considerable attention [227] in academia and in-
deepfake generation, corresponding detection technolo- dustry. Deepfake generation, as one of the important
gies continuously evolve to regulate the potential misuse technologies in the generative domain, gains significant
of deepfakes, such as privacy invasion and phishing attention due to its ability to create highly realistic facial
attacks. This survey comprehensively reviews the lat- media content. This technique transitions from tradi-
est developments in deepfake generation and detection, tional graphics-based methods to deep learning-based
summarizing and analyzing current state-of-the-arts in approaches. Early methods employ advanced Variational
this rapidly evolving field. First, we unify task defini- Autoencoder [136, 239, 255] (VAE) and Generative Ad-
tions, comprehensively introduce datasets and metrics, versarial Network (GAN) [125, 126] techniques, enabling
and discuss developing technologies. Then, we discuss seemingly realistic image generation, but their perfor-
the development of several related sub-fields and focus mance is still unsatisfactory, which limits practical appli-
on researching four representative deepfake fields: face cations. Recently, the diffusion structure [15,77,166] has
swapping, face reenactment, talking face generation, and greatly enhanced the generation capability of images
facial attribute editing, as well as forgery detection. Sub- and videos. Benefiting from this new wave of research,
sequently, we comprehensively benchmark representative deepfake technology demonstrates potential value for
methods on popular datasets for each field, fully eval- practical applications and can generate content indistin-
uating the latest and most influential published works. guishable from real ones, which has further attracted
Finally, we analyze the challenges and future research attention and is widely applied in numerous fields [11],
including entertainment, movie production, online live
Gan Pei and Jiangning Zhang contribute equally. broadcasting, and privacy protection, etc.
1
East China Normal University, Shanghai, China. Deepfake generation can generally be divided into
2
Youtu Lab, Tencent, Shanghai, China. four mainstream research fields: 1) Face swapping [6,
3
Nanjing University, Suzhou, China.
4
Shanghai Jiao Tong University, Shanghai, China.
236, 283] is dedicated to executing identity exchanges
5
Zhejiang University, Hangzhou, China. between two person images; 2) Face reenactment [17,
6
The University of Sydney, Sydney, Australia. 97, 140] emphasizes transferring source movements and
poses; 3) Talking face generation [179, 189, 326] focuses
2 Gan Pei, Jiangning Zhang, et al.
Fig. 1: Time diagram that reflects the survey pipeline. Zoom in for a better holistic perception of this work.
on achieving natural matching of mouth movements to tion from current technologies. The survey will compre-
textual content in character generation, and 4) Facial hensively discuss these fields as well as related sub-fields,
attribute editing [105, 210, 241] aims to modify specific and cover tracking of the latest works.
facial attributes of the target image. The development of • Contribution. In this survey, we comprehensively
related foundational technologies has gradually shifted explore the key technologies and latest advancements
from single forward GAN models [69, 125] to multi- in Deepfakes generation and forgery detection. We first
step diffusion models [15, 93, 223] with higher quality unify the task definitions (Sec. 2.1), provide a compre-
generation capabilities, and the generated content has hensive comparison of datasets and metrics (Sec. 2.3), Drive S
also gradually transitioned from single-frame images to Target
and discuss the development of related technologies.
Source Output Target
temporal video modeling [77]. In addition, NeRF [66, Specifically, we investigate four mainstream deepfake
187] has been frequently incorporated into modeling to fields: face swapping (Sec. 3.1.1), face reenactment (a) Face Swap (Sec. (b) Face
improve multi-view consistency capabilities [115, 330]. 3.1.2), talking face generation (Sec. 3.1.3), and facial at- Smile
Drive Source Or
While enjoying the novelty and convenience of this tribute editing (mainly on multiple editing) “Make(Sec.
Amrica great3.1.4),
again!” Target
Target Output
technology, the unethical use of it raises concerns over as well as forgery detection (Sec. 3.2). We also analyze
the spread of privacy invasion, and dissemination of fake the benchmarks and settings for each domain, thor-
news, necessitating the development of effective forgery oughly evaluating the latest and influential works
(c) Talking Face Generation pub-
fake detection. From the earliest handcrafted feature- recent diffusion-based approaches. Additionally, we dis-
based methods [90, 353] to deep learning-based meth- cuss closely related fields, including head swapping, face
ods [304, 308], and the recent hybrid detection tech- super-resolution, face reconstruction, face Talking inpainting,
face generation Face attribute editing Different colors repr
niques [109], forgery detection has undergone substan- body animation, portrait style transfer, makeup trans-
Motion informati
Makeup informa
Other face attribu
tial technological advancements along with the devel- fer, and adversarial sample detection. Benefiting from Identity informat
Targeted edits
opment of generative technologies. The data modal- the current popularity of AIGC, the research iteration (e) Comparison of Objects Operated by Different Task
ity has also transitioned from the spatial and frequent cycle in the deepfake field has been significantly reduced,
domains [150, 219] to the more challenging temporal and we keep updating and discussing the in-submission
domain [84, 301]. Considering that current generative works in the revised version.
technologies have a higher level of interest, develop • Scope. This survey primarily focuses on mainstream
faster, and can generate indistinguishable content from face-related tasks, including face swapping, face reenact-
reality [190], corresponding detection technologies need ment, talking face generation, facial (multiple) attribute
continuous evolution. editing, and forgery detection. We also cover some re-
Overall, despite the significant progress made in both lated domain tasks in Sec. 2.4 and detail specific popular
directions, they still exhibit challenging limitations in sub-tasks in Sec. 3.3. Considering the large number of
specific scenarios [257], mainly reflected in the visual articles (including published and preprints), we mainly
perception authenticity and generative accuracy of mod- include representative and attention-grabbing works. In
els. This has attracted a large number of researchers addition, we compare this investigation with recent sur-
to continue their efforts and has sparked thoughts on veys. Sha et al . [227] only discuss character generation
industrial applications. Existing survey works only focus while we cover a more comprehensive range of tasks.
on partial deepfake fields and lack discussions on new Compared to works [169, 184, 190], our study encom-
technologies [169, 190, 227], especially diffusion-based passes a broader range of technical models, particularly
image/video generation methods, due to their disconnec- the more powerful diffusion-based methods. Additionally,
Deepfake Generation and Detection: A Benchmark and Survey 3
𝑆%
𝐶 We have now reached the
tipping point of generative AI Real §3.2Forgery Detection
∅$ or
Fake
𝐼! 𝐼%
∅#
Age Editing Gender Change Mouth Editing …
§3.1.1 Face Swapping §3.1.2Face Reenactment §3.1.3 Talking Face Generation §3.1.4 Facial Attribute Editing
Identity Preservation Head Pose / Movement Temporal Continuity Multiple Attribute Editing
Fig. 2: Top: Illustration of different deepfake generation (Sec. 3.1) and detection tasks (Sec. 3.2) that are discussed in
this survey. Bottom: Specific facial attribute modification of each task. Data from NVIDIA Keynote at COMPUTEX
2023 at 29:40.
§3.2 Foreign
Detection
§3.1.1 Face Swap §3.1.2 Face Reenactment §3.1.3 Talking Head Generation §3.1.4 Facial Attribute Editing
Foreign Detection
g
Identity Preservation Head Pose Temporal Consistency Identity-irrelevant Attribute Editing
Fig. 3: Development timeline of three mainstream gen- Fig. 4: Works summarization on different directions per
nformation
erative modals, i.e., VAE, GAN, and Diffusion. year. Data is obtained on 2024/05/13.
facial attributes, under the conditions of driving image, CVAE [239] introduces conditional input. VQ-VAE [255]
video, head pose, etc.. This technology often involves introduces the concept of vector quantization to improve
the support of facial motion capture technology, such the learning of latent representations. Subsequent mod-
as facial tracking or prediction based on deep learning els continually advance.
models [346]. 2) GANs [69] achieve high-quality generation through
• Talking Face Generation can be viewed as an adversarial training with an extra discriminator. Subse-
extension in time, aiming at generating a talking video quently, research on GANs experiences a surge, and
Io ={Ioi }, i = 0, 1, · · · , N − 1 with the character in the currently, implementations based on GANs remain
target image It engaging in a conversation based on an the mainstream approach for various deepfake tasks.
arbitrary driving source, such as text, audio, video, or CGAN [191] introduces conditional control variables to
a multi-modal composite source. The lip movements, GANs. Pix2Pix [111] enhances GAN performance in spe-
facial poses, expressions, emotions, and spoken content cific image translation tasks. StyleGAN [125] introduces
information of the character in the generated video the concept of style transfer to GANs. StyleGAN2 [126]
match the target information. further improves the quality and controllability of gen-
• Facial Attribute Editing aims to modify the seman- erated images. Additionally, some composite research
tic information of the target face It (e.g., personality, combines GANs with VAEs, e.g., CVAE-GAN [9].
age, expressions, skin color, etc.) in a directed manner 3) Diffusion models [238] model the generation of data
based on individual interest and preference. Existing as a diffusion process. DDPM [93] gains widespread
methods include single and comprehensive attribute attention for its outstanding generative performance,
editing: the former focuses on training a model for only especially when the model excels in handling large-scale,
one attribute, while the latter integrates multiple at- high-resolution images. LDM [223] is more flexible and
tribute editing tasks simultaneously that is our primary powerful when it comes to modeling complex data dis-
focus in this survey. tributions. In the field of video generation, diffusion
models play a crucial role [81, 164, 177]. SVD [15] fine-
• Forgery Detection aims to detect and identify
tunes the base model using an image-to-video conversion
anomalies, tampering, or forgery areas in images or
task based on text-to-video models. AnimateDiff [77]
videos by the anomaly score So , and it has great re-
attaches a newly initialized motion modeling module to
search and application value in information security and
a frozen text-to-image model and trains it on subsequent
multimedia forensics.
video clips to refine reasonable motion prior knowledge,
thus achieving excellent generation results.
2.2 Technological History and Roadmap
• Discriminative Neural Network. Convolutional
• Generative Framework. Variational Autoencoders Neural Networks (CNNs) [89, 141, 144, 174, 329, 331]
(VAEs) [136, 239, 255], Generative Adversarial Networks have played a pivotal role in the history of deep learn-
(GANs) [69, 125, 126], and Diffusion [93, 223, 238] have ing. LeNet [144], as the pioneer of CNNs, showcased
played pivotal roles in the developmental history of the charm of machine learning. AlexNet [141] and
generative models. ResNet [89] made deep learning feasible. Recently, Con-
1) VAE [136] emerges in 2013, altering the relation- vNeXt [174] has achieved excellent results surpassing
ship between latent features and linear mappings in those of Swin-Transformer [172]. The Transformer ar-
autoencoder latent spaces. It introduces feature distri- chitecture initially proposed [256] in 2017. The core
butions like the Gaussian distribution and achieves the idea involves using self-attention mechanisms to capture
generation of new entities through interpolation. To en- dependencies between different positions in the input se-
hance generation capabilities under specific conditions, quence, enabling global modeling of sequences. ViT [53]
Deepfake Generation and Detection: A Benchmark and Survey 5
Table 1: Overview of commonly used datasets. Orange-marked ones are selected to evaluate different methods in
Sec. 4.
Dataset Type Scale Highlight
LFW [103] Image 10K Facial images captured under various lighting conditions, poses, expressions, and occlusions at 250×250 resolution.
Deepfake Generation
CelebA [173] Image 200K The dataset includes over 200K facial images from more than 10K individuals, each with 40 attribute labels.
VGGFace [211] Image 2600K A super-large-scale facial dataset involving a staggering 26K participants, encompassing a total of 2600K facial images.
VoxCeleb1 [198] Video 100K voices A large scale audio-visual dataset of human speech, the audio includes noisy background interference.
CelebA-HQ [124] Image 30K A high-resolution facial dataset consisting of 30K face images, each with a resolution of 1024×1024 resolution.
VGGFace2 [22] Image 3000K A large-scale facial dataset, expanded with greater diversity in terms of ethnicity and pose compared to VGGFace.
VoxCeleb2 [39] Video 2K hours A dataset that is five times larger in scale than VoxCeleb1 and improves racial diversity.
FFHQ [125] Image 70K The dataset with over 70K high-resolution (1024×1024) facial images, showcasing diverse ethnicity, age, and backgrounds.
MEAD [263] Video 40 hours An emotional audiovisual dataset provides facial expression information during conversations with various emotional labels.
CelebV-HQ [359] Video 35K videos The dataset’s video clips have a resolution of more than 512×512 resolution and are annotated with rich attribute labels.
DeepfakeTIMIT [138] Audio-Video 640 videos The dataset is evenly divided into two versions: LQ (64×64) and HQ (128×128), with all videos using face swapping forgery.
Forgery Detection
FF++ [225] Video 6K videos Comprising 1K original videos and manipulated videos generated using five different forgery methods.
DFDCP [51] Audio-Video 5K videos The preliminary dataset for The Deepfake Detection Challenge includes two face-swapping methods.
DFD [48] Video 3K videos The dataset comprises 3K deepfake videos generated using five forgery methods.
Deeperforensics [117] Video 60K videos The videos feature faces with diverse skin tones, and rich environmental diversity was considered during the filming process.
DFDC [50] Audio-Video 128K videos The official dataset for The Deepfake Detection Challenge and contain a substantial amount of interference information.
Celeb-DF [158] Video 1K videos Comprising 408 genuine videos from diverse age groups, ethnicities, and genders, along with 795 DeepFake videos.
Celeb-DFv2 [158] Video 6K videos An expanded version of Celeb-DFv1, this dataset not only increases in quantity but also the diversity.
FakeAVCeleb [130] Audio-Video 20K videos A novel audio-visual multimodal deepfake detection dataset, deepfake videos generated using four forgery methods.
demonstrates that using Transformer in the field of HQ [242], CelebV-HQ [359], TalkingHead-1KH [270],
computer vision can still achieve excellent performance LRS2 [2], LRS3 [3], etc. 2) Commonly used forgery
and PVT [271] overcomes the challenges of adapting detection datasets include UADFV [156], Deepfake-
Transformer to various dense prediction tasks. In addi- TIMIT [138], FF++ [225], Deeperforensics-1.0 [117],
tion, Swin-Transformer [172] addresses the limitations DFDCP [51], DFDC [50], Celeb-DF [158], Celeb-
of Transformer in handling high-resolution image pro- DFv2 [158], FakeAVCeleb [130], DFD [48], WildDeep-
cessing tasks. Subsequently, Swin-Transformer V2 [170] fake [363], KoDF [142], UADFV [302], Deephy [199],
further improves the model’s efficiency and the resolu- DF-Platter [200], etc. We summarize popular datasets
tion of manageable inputs. in Table 1 .
• Neural Radiance Field (NeRF) is first introduced • Metric. 1) For deepfake generation tasks,
in 2020 [187], with its core idea revolving around the commonly used metrics include: Peak Signal-to-
use of volume rendering and implicit neural fields to Noise Ratio (PSNR) [262], Structured Similarity
represent and reconstruct both geometric and illumina- (SSIM) [274], Learned Perceptual Image Patch Similar-
tion information of 3D scenes [66]. Compared to tra- ity (LPIPS) [336], Fréchet Inception Distance (FID) [92],
ditional 3D methods, it exhibits higher visual quality Kernel Inception Distance (KID) [12], Cosine Similarity
and is currently widely applied in tasks such as 3D ge- (CSIM) [270], Identity Retrieval Rate (ID Ret) [258], Ex-
ometry enhancement [110, 228], segmentation [168] and pression Error [28], Pose Error [226], Landmark Distance
6D pose estimation [188]. In addition, Some notable (LMD) around the mouths [32], Lip-sync Confidence
works [108, 303] combining NeRF as a supplement to (LSE-C) [217], Lip-sync Distance (LSE-D) [217], etc. 2)
3D information and generation models are particularly For forgery generation commonly uses: Area Under the
prominent at present. ROC Curve (AUC) [143], Accuracy (ACC) [260], Equal
Error Rate (EER) [102], Average Precision (AP) [249],
• Work Summary. The evolution of mainstream gen-
F1-Score [35], etc. Detailed definitions are explained
erative models is depicted chronologically in Fig. 3. This
in Sec. 4.1.
survey delves into four categories of generation tasks
along with the forgery detection task, and the publica- • Loss Function. VAE-based approaches generally em-
tion years distribution of the surveyed articles is shown ploy reconstruction loss and KL divergence loss [136].
in Fig. 4. Commonly used reconstruction loss functions include
Mean Squared Error, Cross-Entropy, LPIPS [336], and
perceptual [122] losses. GAN-based methods further
2.3 Datasets, Metrics, and Losses introduce adversarial loss [69] to increase image authen-
ticity, while diffusion-based works introduce denoising
• Dataset. Given the various datasets in sur- loss function [93].
veyed fields, we use numerical labels to save post-
textual space. 1) Commonly used deepfake generation 2.4 Related Research Domains
datasets include LFW [103], CelebA [173], CelebA-
HQ [124], VGGFace [211], VGGFace2 [22], FFHQ [125], • Head Swapping replaces the entire head informa-
Multi-PIE [193], VoxCeleb1 [198], VoxCeleb2 [39], tion rather than only the face of the target image [237],
MEAD [263], MM CelebA-HQ [280], CelebAText- including facial contours and hairstyle, with that of the
6 Gan Pei, Jiangning Zhang, et al.
source image. However, in terms of facial attributes, • Portrait Style Transfer aims to reconstruct the
Head Swap only replaces identity attributes, while leav- style of a target image to match that of a source image
ing other attribute information unchanged. Recently, by learning the stylistic features of the source image [213,
some methods [87, 264] based on Diffusion have been 299]. The goal is to preserve the content information of
proposed and have shown promising results in achieving the target image while adapting its style to that of the
good performance. source image [20]. Common applications include image
cross-domain style transfer, such as transforming real
• Face Super-resolution aims to enhance the reso- face images into animated face styles [60, 216]. Methods
lution of low-resolution face images to generate high- based on GANs [1,120,288] and Diffusion [107,163] have
resolution face images [196, 333]. This task is closely achieved high-quality performance in this task.
related to various deepfake sub-tasks. In the early stages
of Face Swapping [201, 244] and Talking Face Gen- • Makeup Transfer aims to achieve style transfer learn-
eration [33, 58], there is consistently a low-resolution ing from a source image to a target image [119, 154].
issue in the generated images and videos. This prob- Existing models have achieved initial success in applying
lem is addressed by incorporating face super-resolution and removing makeup on target images [26, 71, 154, 167],
methods into the models [189, 289] to improve output allowing for quantitative control over the intensity of
quality. In terms of technical approaches, FSR can be makeup. However, they perform poorly in transferring
categorized into methods based on CNNs [30, 36, 104], extreme styles [295,296,349]. Existing mainstream meth-
GANs [137, 197], reinforcement learning [234], and en- ods are based on GANs [88, 119, 246].
semble learning [116]. • Adversarial Sample Detection focuses on identify-
ing whether the input data is an adversarial sample [86].
• Face Reconstruction refers to the process of recreat- If recognized as such, the model can refuse to provide
ing the three-dimensional appearance of an individual’s services for it, such as throwing an error or not produc-
face based on one or multiple 2D facial images [194,229]. ing an output [316]. Current deepfake detection models
Facial reconstruction often plays an intermediate role in often rely on a single cue from the generation process
various deepfake sub-tasks. In Face Swapping and Face as the basis for detection, making them vulnerable to
Reenactment tasks, facial parameters are reconstructed specific adversarial samples. Furthermore, relatively lit-
using 3DMM, and the model’s parameters can be con- tle work has focused on adversarial sample testing in
trolled directionally. Additionally, reconstructing a 3D fa- terms of model generalization capability and detection
cial model is one of the methods to address the issue of fa- evaluation.
cial artifacts in videos generated under large facial poses.
Common technical approaches for facial reconstruction
include methods based on 3DMM [160,180], epipolar ge- 3 Deepfake Inspections: A Survey
ometry [61], one-shot learning [254, 281], shadow shape
reconstruction [118, 128], and hybrid learning-based re- This section systematically examines four generation
construction [27, 54]. tasks: Face Swapping (Sec. 3.1.1), Talking Face Genera-
• Face Inpainting, a.k.a. face completion, aims to tion (Sec. 3.1.3), Face Reenactment (Sec. 3.1.2), Facial
reconstruct missing regions in face images caused by Attribute Editing (Sec. 3.1.4). Additionally, we review
external factors such as occlusion and lighting while techniques for forgery detection (commonly known as
preserving facial texture information is crucial in this deepfake detection) in Sec. 3.2. We document and com-
process [340]. This task is a crucial sub-task of im- pare the most representative works in the main text.
age inpainting, and the current methods are mostly Furthermore, we discuss several related domains that
based on deep learning that can be roughly divided garnered significant attention in Sec. 3.3.
into two categories: GAN based [307, 352] and Diffusion
3.1 Deepfake Generation
based [286, 298].
• Body Animation aims to alter the entire bodily pose 3.1.1 Face Swapping
while unchanging the overall body information [314]. The
goal is to achieve a modification of the target image’s In this section, we review face swapping methods from
entire body posture using an optional driving image or the perspective of basic architecture, which can be
video, aligning the body of the target image with the mainly divided into four categories and summarized
information from the driving signal. The mainstream in Table 2.
implementation path for body animation is based on • Traditional Graphics. As representative early im-
GANs [343, 356], and Diffusion [177, 268, 273, 292]. plementations, traditional graphics methods in the
Deepfake Generation and Detection: A Benchmark and Survey 7
Table 2: Overview of representative face swapping methods. Notations: ➊ Self-build, ➋ CelebA-HQ, ➌ FFHQ, ➍ VG-
GFace2, ➎ VGGFace, ➏ CelebV, ➐ CelebA, ➑ VoxCeleb2, ➒ LFW, ➓ KoDF. Abbreviations: SIGGRAPH (SIG.),
EUROGRAPHICS (EG.), GANs (G.), VAEs (V.), Diffusion (D.), Split-up and Integration (SI.).
Method Venue Dataset Categorize Limitation Highlight
Blanz et al. [14] EG.’04 ➊ 3DMM Manual intervention, unnatural output. Early face-swapping efforts simplified manual interaction steps.
A three-phase implementation framework with the help of a pre-constructed face database to match
Traditional Graphics
ID modules and weak feature matching loss functions are proposed to find a balance
SimSwap [34] MM’20 ➊➍ G.+V. Poor ability to preserve face feature attributes.
between identity information replacement and attribute information retention.
MegaFS [361] CVPR’21 ➊➋➌ G. Poor ability to preserve face feature attributes. The first method allows for face swapping on images with a resolution of one million pixels.
FaceInpainter [149] CVPR’21 ➊➋➌➎ G.+3DMM Poor representation of image details. A two-stage framework innovatively implements heterogeneous domains face swapping.
A 3D shape-aware identity extractor is proposed to achieve better retention of attribute information
HifiFace [272] IJCAI’21 ➍ G.+3DMM Uses a large number of parameters.
such as facial shape.
An extension of the FSGAN method that combines Poisson optimization with perceptual loss enhances
FSGANv2 [204] TPAMI’22 ➊ [182] G. Unable to process posture differences effectively.
the output image facial details.
Proposing an exchange-guided ID reversal strategy to enhance the performance of attribute informa
StyleSwap [293] ECCV’22 ➊➎➑ G. Unable to process posture differences effectively.
-tion replacement during face exchange.
Local facial region awareness branch with global source feature adaptation (SFA) branch is proposed
RAFSwap [283] CVPR’22 ➋ G. Unable to process posture differences effectively.
to better achieve the preservation of target image attribute information.
Potential semantic de-entanglement is realized to obtain facial structural attributes and appearance
FSLSD [289] CVPR’22 ➊➋ G. Poor ability to preserve face feature attributes.
attributes in a hierarchical manner.
Kim et al. [134] CVPR’22 ➌➍ G. Unable to process posture differences effectively. An identity embedder is proposed to enhance the training speed under supervision.
A 3d-aware approach to the face-swapping task, de-entangling identity and attribute features in latent
3DSwap [157] CVPR’23 ➋➌ G.+3DMM Unable to process posture differences effectively.
space to achieve identity replacement and attribute feature retention.
Oriented with privacy-preserving applications, the method directly employs latent space of pre-trained
FALCO [11] CVPR’23 ➋➒ G. Poor ability to handle facial occlusion.
GANs to achieve the identity of anonymized images while preserving facial attributes.
Two mutually independent encoders are proposed to encode attribute information outside the face
WSC-Swap [220] ICCV’23 ➊➋➌➎ G.+3DMM Poor resolution of the output image .
region and semantic-level non-identical facial attributes inside the face region.
The identity features obtained from the de-entanglement are fed to the generator as an identity loss
BlendFace [236] ICCV’23 ➊➌➍➏ G. Unable to handle occlusion and extreme lighting.
function, which guides the generator to generate an image to fit the source image identity information.
It consists of face reshaping network and face exchange network, which better solves the influence of the
FlowFace [320] AAAI’23 ➊➋➌➍ G.+3DMM Altered target image lighting details.
difference between source and target face contours on the face exchange work.
S2Swap [162] MM’23 ➊➋➌➑ G.+3D Poor ability to preserve face feature attributes. Achieving high-fidelity face swapping through semantic disentanglement and structural enhancement.
Utilizing a multi-stage identity injection mechanism effectively combines facial features from both the
StableSwap [362] TMM’24 ➊➌ G.+3D Unable to handle extreme skin color differences.
source and target to produce high-fidelity face swapping.
DiffFace [135] arXiv’22 ➊➌ D. Facial lighting attributes are altered. Claims to be the first diffusion model-based face exchange framework.
Difussion
DiffSwap [345] CVPR’23 ➊➌ D. Poor ability to handle facial occlusion. Reenvisioning face swapping as conditional inpainting to harness the power of the diffusion model.
FaceX [87] arXiv’23 ➊➌➏ D. Unable to handle extreme skin color differences. A novel facial all-rounder model capable of performing various facial tasks.
Conditional diffusion model introduces identity and expression encoders components, achieving a balan-
Liu et al. [166] arXiv’24 ➊➋➌ D. Poor ability to preserve face feature attributes.
ce between identity replacement and attribute preservation during the generation process.
Introducing a multiscale transformer network focusing on high-quality semantically aware corresponden-
Cui et al. [42] CVPR’23 Other Altered target image lighting details.
Alternative
➊➋
ces between source and target faces.
The identity generator is designed to reconstruct high-resolution images of specific identities, and an
TransFS [23] FG’23 ➊➋➓ Other Unable to process posture differences effectively.
attention mechanism is utilized to enhance the retention of identity information.
A Global Residual Attribute-Preserving Encoder (GRAPE) is proposed, and a network flow considering
Wang et al. [269] TMM’24 ➊➋ Other Poor ability to handle facial occlusion.
the facial landmarks of the target face was introduced, achieving high-quality face swapping.
implementation path can be divided into two cate- ducing a facial parameter model often involve building a
gories: 1) Key information matching and fusion. Meth- facial parameter model using 3DMM technology based
ods [13,205,247] ground in critical information matching on a pre-collected face database. After matching the
and fusion are geared towards substituting correspond- facial information of the source image with the con-
ing regions by aligning key points within facial regions structed face model, specific modifications are made to
of interest (ROIs), such as the mouth, eyes, nose, and the relevant parameters of the facial parameter model to
mouth, between the source and target images. Following generate a completely new face. Dale et al . [43] utilize
this, additional procedures such as boundary blend- 3DMM to track facial expressions in two videos, en-
ing and lighting adjustments are executed to produce abling face swapping in videos. Some methods [74, 161]
the resulting image. Bitouk et al . [13] accomplish auto- explore scenarios involving significant pose differences
mated face replacement by constructing a substantial between the source and target images. Lin et al . [161]
face database to locate faces with akin poses and light- construct a 3D face model from frontal faces, renderable
ing conditions for substitution. Meanwhile, Nirkin et in any pose. Guo et al . [74] utilize plane parameteri-
al . [205] enhance keypoint matching and segmentation zation and affine transformation to establish a one-to-
accuracy by incorporating a Fully Convolutional Net- one dense mapping between 2D graphics. Traditional
work (FCN) into their method. 2) The construction computer graphics methods solve basic face-swapping
of a 3D prior model for facial parameterization. Meth- problems, exploring full automation to enhance gener-
ods [14, 43] based on constructing a 3D prior and intro- alization. However, these methods are constrained by
8 Gan Pei, Jiangning Zhang, et al.
the need for similarities in pose and lighting between between the target and source faces. It further enhances
source and target images. They also face challenges like feature fusion for both source and target by introducing
low image resolution, modification of target attributes, a cross-attention fusion module. However, most of the
and poor performance in extreme lighting and occlusion aforementioned methods often struggle to effectively
scenarios. handle occlusion issues.
• Generative Adversarial Network. GAN-based 5) Facial Masking Artifacts. Some methods [152, 171,
methods aim to obtain realistic images generated 203, 224] have partially alleviated the artifacts caused
through adversarial training between the generator and by facial occlusion. FSGAN [203] designs a restoration
the discriminator, becoming the mainstream face swap- network to estimate missing pixels. E4S [171] redefines
ping approach. According to different improvement ob- the face-swapping problem as a mask-exchanging prob-
jectives, methods can be classified into seven categories: lem for specific information. It utilizes a mask-guided
1) Early GAN-based methods [9, 165, 192, 358] address injection module to perform face swapping in the latent
issues related to the similarity of pose and lighting be- space of StyleGAN. However, overall, the methods above
tween source and target images. DepthNets [192] com- have not thoroughly addressed the issue of artifacts in
bines GANs with 3DMM to map the source face to any generated images under extreme occlusion conditions.
target geometry, not limited to the geometric shape of 6) Trade-offs between Identity-replacement and
the target template. This allows it to be less affected Attribute-retention. In addition to the occlusion issues
by differences in pose between the source and target that need further handling, researchers [65,236] discover
faces. However, they face challenges in generalizing the that the balance between identity replacement and at-
trained model to unknown faces. tribute preservation in generated images seems akin
2) Improved Generalizability. To improve the model’s to a seesaw. Many methods [134, 162, 293] explore the
generalization, many efforts [34, 203, 244] are made to equilibrium between identity replacement and attribute
explore solutions. Combining GANs with VAEs, the retention. InfoSwap [65] aims to decouple identity and
model [10, 201] encodes and processes different facial attribute information in face swapping by leveraging
regions separately. FSGAN [203] integrates face reenact- the information bottleneck theory. It seeks controlled
ment with face swap, designing a facial blending network swapping of identity between source and target faces.
to mix two faces seamlessly. SimSwap [34] introduces an StyleSwap [293] introduces a novel swapping guidance
identity injection module to avoid integrating identity strategy, the ID reversal, to enhance the similarity of
information into the decoder. However, these methods facial identity in the output. Shiohara et al . [236] pro-
suffer from low resolution and significant attribute loss pose BlendFace, using an identity encoder that extracts
and need help to handle facial occlusions effectively. identity features from the source image and uses it as
3) Resolution Upgrading. Some methods [114, 282, 361] identity distance loss, guiding the generator to produce
provide solutions to enhance the resolution of generated facial exchange results.
images. MegaFS [361] introduces the first single-lens 7) Model Light-weighting is also an important topic with
face swapping method at the million-pixel level. The profound implications for the widespread application
face encoder no longer compresses facial information but of models. FastSwap [309] achieves this by innovating
represents it in layers, achieving more detailed preser- a decoder block called Triple Adaptive Normalization
vation. StyleIPSB [114] constrains semantic attribute (TAN), effectively integrating identity information from
codes within the subspace of StyleGAN, thereby fixing the source image and pose information from the target
certain semantic information during face swapping to image. XimSwap [6] modifies the design of convolutional
preserve pore-level details. blocks and the identity injection mechanism, successfully
4) Geometric Detail Preservation. To capture and deploying on STM32H743.
reproduce more facial geometric details, some meth- • Diffusion-based. The latest studies [87, 135, 166, 345]
ods [220, 272, 320, 341] introduce 3DMM into GANs, in this area produce promising generation results. Diff-
enabling the incorporation of 3D priors. HifiFace [272] Swaps [345] redefines the face swapping problem as a
introduces a novel 3D shape-aware identity extractor, re- conditional inpainting task. Liu et al . [166] introduce
placing traditional face recognition networks to generate a multi-modal face generation framework and achieved
identity vectors that include precise shape information. this by introducing components such as balanced iden-
FlowFace [320] introduces a two-stage framework based tity and expression encoders to the conditional diffusion
on semantic guidance to achieve shape-aware face swap- model, striking a balance between identity replacement
ping. FlowFace++ [341] improves upon FlowFace by and attribute preservation during the generation process.
utilizing a pre-trained Mask Autoencoder to convert face As a novel facial generalist model, FaceX [87] can achieve
images into a fine-grained representation space shared various facial tasks, including face swapping and edit-
Deepfake Generation and Detection: A Benchmark and Survey 9
Table 3: Overview of representative face reenactment methods. Notations: ➊ Voxceleb, ➋ Self-build, ➌ Voxceleb2,
➍ TalkingHead-1KH, ➎ CelebV-HQ, ➏ VFHQ, ➐ RaFD, ➑ VGGFace, ➒ CelebV, ➓FFHQ. Abbreviations:
Expression (Exp).
Method Venue Controllable object Dataset Highlight
Based on 3DMM
Using 3D facial reconstruction, transfer facial expressions and pose from a driving character to a target character via affine transforma
Face2Face [251] CVPR’16 Exp, Pose ➋
-tion, then generate the final video through rendering techniques.
Kim et al. [133] TOG’18 Exp, Pose, Blink ➋ Using synthesized rendering images of a parameterized face model as input, creating lifelike video frames for the target actor.
Kim et al. [132] TOG’19 Lip,Exp, Pose ➋ Built on a recurrent generative adversarial network, it employs a hierarchical neural face renderer to synthesize realistic video frames.
Head2Head [140] FG’21 Exp, Pose, Blink, Gaze ➋ The model consists of two stages: facial reconstruction and tracking in the first stage, followed by video rendering in the second stage.
Using 3DMM for facial modeling provides a 3D prior to the GAN, effectively guiding the generator to accurately recover pose and
HeadGAN [56] ECCV’21 Exp, Pose ➊
expression from the target frame.
Decoupling the actor’s facial appearance and motion information with two separate encodings allows the network to learn facial app
Face2Faceρ [297] ECCV’22 Exp, Pose ➊
-earance and motion priors.
PECHead [68] CVPR’23 Lip, Exp, Pose ➌➍➎➏ A novel multi-scale feature alignment module for motion perception is proposed to minimize distortion during motion transmission.
Based on Landmark Matching
The training is realized in two stages: the first stage guides the generation frames towards the driving frames and the second stage
X2Face [278] CVPR’18 Exp, Pose ➑
accomplishes the preservation of the source identity information through the identity loss function.
Zakharov et al. [319] ICCV’19 Exp, Pose ➊➌ Proposed a meta-learning framework for adversarial generative models, reducing the required training data size.
FReeNet [334] CVPR’20 Exp, Pose ➐ [193] A new triple perceptual loss is proposed to richly reproduce facial details of the face.
Decomposed facial information into two layers for modeling: the first layer synthesizes coarse images related to the pose using a small
Zakharov et al. [318] ECCV’20 Exp, Pose ➌
neural network, and the second layer defines texture images unrelated to the pose, containing high-frequency details.
Introduced image attention blocks, target feature alignment modules, and landmark transformers, enhancing the model’s performance
MarioNETte [83] AAAI’20 Exp, Pose ➊➒
when generalizing to unknown individual identities.
DG [97] CVPR’22 Exp, Pose ➊➌➐ A proposed dual generator model network for large pose face reproduction.
Doukas et al. [55] TPAMI’23 Exp, Pose, Gaze ➊ [127] [339] Eye gaze control in the generated video is implemented to further enhance visual realism.
By establishing dense facial keypoint matching, accurate deformation field prediction is achieved, and the model training is expedited
MetaPortrait [324] CVPR’23 Exp, Pose ➌
based on the meta-learning philosophy.
The facial tri-plane is represented by canonical tri-plane, identity deformation, and motion components, achieving face reenactment
Yang et al. [300] AAAI’24 Exp, Pose ➊➌➍
without the need for 3D parameter model priors.
FSRT [222] CVPR’24 Exp, Pose ➊ The Transformer-based encoder-decoder effectively encodes attributes and improves action transmission quality.
Based on Face Feature Decopling
HiDe-NeRF [155] CVPR’23 Lip, Exp, Pose ➊➌➍ High-fidelity and free-viewing talking head synthesis using deformable neural radiation fields.
HyperReenact [17] ICCV’23 Exp, Pose ➊➌ Exploiting the effectiveness of hypernetworks in real image inversion tasks and extending them to real image manipulation.
This work optimizes a Mask Network and combines it with StyleGAN2’s style potential space S in order to achieve the separation of
Stylemask [18] FG’23 Exp, Pose ➓
facial pose and expression of the target image from the identity features of the source image.
Bounareli et al. [16] IJCV’24 Exp, Pose ➊➌ In GANs’ latent space, head pose and expression changes are decoupled, achieving near-real outputs through real image embedding.
Based on Self-supervised
The model is decoupled and driven by interpretable control signals that can be obtained from multiple sources such as external driving
ICface [253] WACV’20 Exp, Pose ➊
videos and manual controls.
Identity and attribute decomposition are realized in StyleGAN2’s latent space, and a cyclic manifold adjustment technique enhances
Oorloff et al. [207] ICCV’23 Lip, Exp, Pose ➎
facial reconstruction results.
High-fidelity facial generation is achieved by using information-rich Projected Normalized Coordinate Code (PNCC) and eye maps,
Xue et al. [294] TOMM’23 Exp, Pose ➊➌
replacing sparse facial landmark representations.
ing. Leveraging the pre-trained StableDiffusion [15] has 3DMM, consists of a u-shaped rendering network driven
significantly improved the quality and model training by head pose and facial motion fields and a hierarchical
speed. coarse-to-fine motion network guided by landmarks at
• Alternative Techniques. Some methods stand in- different scales. However, some methods [132, 133, 251]
dependently from the above classifications that are dis- exhibit visible artifacts in the background when dealing
cussed here collectively. Fast Face-swap [139] views the with significant head movements in the input images.
identity swap task as a style transfer task, achieving its To address issues such as incomplete attribute decou-
goals based on VGG-Net. However, this method has poor pling in facial reproduction tasks, PECHead [68] models
generalization. Some methods [23, 42] apply the Trans- facial expressions and pose movements. It combines
former architecture to face swapping tasks. Leveraging self-supervised learning of landmarks with 3D facial
a facial encoder based on the Swin Transformer [172], landmarks and introduces a new motion-aware multi-
TransFS [23] obtains rich facial features, enabling facial scale feature alignment module to eliminate artifacts
swapping in high-resolution images. ReliableSwap [317] that may arise from facial motion.
enhances the model’s identity preservation capabilities • Landmark Matching. This kind of methods [55,
by constructing a reliable supervisor called the "cyclic 83, 97, 278, 318] aim to establish a mapping relation-
triplet." However, it has limitations in preserving at- ship between semantic objects in the facial regions of
tribute information. the driving source and the target source through land-
marks. Based on this mapping relationship [319, 334],
3.1.2 Face Reenactment the transfer of facial movement information is achieved.
In particular, X2Face [278] achieves one-to-one driving
This section reviews current methods from four points: of facial expressions and poses for each frame of the char-
3DMM-based, landmark matching, face feature decou- acter in the driving video and source video. To address
pling, and self-supervised learning. We summarize them the challenge of reproducing large head poses in facial
in Table 3. reenactment, Xu et al . [97] propose a dual-generator
• 3DMM-based. Some methods [68, 140, 297] utilize network incorporating a 3D landmark detector into the
3DMM to construct a facial parameter model as an inter- model. Free-headgan [55] comprises a 3D keypoint es-
mediary for transferring information between the source timator, an eye gaze estimator, and a generator built
and target. In particular, Face2Faceρ [297], based on on the HeadGAN architecture. The 3D keypoint esti-
10 Gan Pei, Jiangning Zhang, et al.
mator addresses the regression of deformations related pose changes and movement control, which are crucial
to 3D poses and expressions. The eye gaze estimator in natural videos. To address this, MakeItTalk [357]
controls eye movement in videos, providing finer details. decouples input audio information by predicting facial
MetaPortrait [324] achieves accurate distortion field landmarks based on audio and obtaining semantic de-
prediction through dense facial keypoint matching and tails on facial expressions and poses from audio sig-
accelerates model training based on meta-learning prin- nals. SadTalker [337] extracts 3D motion coefficients
ciples, delivering excellent results on limited datasets. for constructing a 3DMM from audio and uses this to
• Feature Decoupling. The latent feature decoupling modulate a new 3D perceptual facial rendering for gen-
and driving methods [16–18,155,300] aims to disentangle erating head poses in talking videos. Additionally, some
facial features in the latent space of the driving video, re- methods [63, 145, 262, 279] propose their improvement
placing or mapping the corresponding latent information methods, and these will not be detailed one by one. In
to achieve high-fidelity facial reproduction under specific addition, the emotional expression varies for different
conditions. HyperReenact [17] uses attribute decoupling, texts during a conversation, and vivid emotions are an es-
employing a hyper-network to refine source identity fea- sential part of real talking face videos [64,233]. Recently,
tures and modify facial poses. StyleMask [18] separates some methods [82, 250, 322] extend their previous ap-
facial pose and expression from the identity information proaches by incorporating matching between the driving
of the source image by learning masks and blending information and corresponding emotions. EMMN [250]
corresponding channels in the pre-trained style space S establishes an organic relationship between emotions
of StyleGAN2. HiDe-NeRF [155] employs a deformable and lip movements by extracting emotion embeddings
neural radiance field to represent a 3D scene, with a from the audio signal, synthesizing overall facial expres-
lightweight deformation module explicitly decoupling sions in talking faces rather than focusing solely on audio
facial pose and expression attributes. for facial expression synthesis. AMIGO [322] employs
a sequence-to-sequence cross-modal emotion landmark
• Self-supervised Learning. Self-supervised learning
generation network to generate vivid landmarks aided
employs supervisory signals inferred from the intrinsic
by audio information, ensuring that lips and emotions
structure of the data, reducing the reliance on exter-
in the output image sequence are synchronized with the
nal data labels [207, 253, 321, 328]. Oorloff et al . [207]
input audio. However, existing methods still lack effec-
employs self-supervised methods to train an encoder,
tive control over the intensity of emotions. In addition,
disentangling identity and facial attribute information
TalkCLIP [178] introduces style parameters, expanding
of portrait images within the pre-defined latent space
the style categories for text-guided talking video genera-
itself of a pre-trained StyleGAN2. Zhang et al . [328]
tion. Zhong et al . [348] propose a two-stage framework,
utilizes 3DMM to provide geometric guidance, employs
incorporating appearance priors during the generation
pre-computed optical flow to guide motion field esti-
process to enhance the model’s ability to preserve at-
mation, and relies on pre-computed occlusion maps to
tributes of the target face. DR2 [326] explores practical
guide the perception and repair of occluded areas.
strategies for reducing the training workload.
3.1.3 Talking Face Generation • Multimodal Conditioned. To generate more real-
istic talking videos, some methods [159, 261, 284, 351]
In this section, we review current methods from three introduce additional modal information on top of audio-
perspectives: audio/text driven, multimodal conditioned, driven methods to guide facial pose and expression.
diffusion-based, and 3D-model Technologies. We also GC-AVT [159] generates realistic talking videos by inde-
summarize them in Table 4. pendently controlling head pose, audio information, and
• Audio/Text Driven. Methods aim to map and guide facial expressions. This approach introduces an expres-
lip and facial movements in generated videos by under- sion source video, providing emotional information dur-
standing the semantic information from the driving ing the speech and the pose source video. However, the
source [230, 259, 332]. Early methods [32, 58] perform video quality falls below expectations, and it struggles
poorly in terms of generalization and training complex- to handle complex background changes. Xu et al . [284]
ity. After training, the models struggled to generalize to integrate text, image, and audio-emotional modalities
new individuals, requiring extensive conversational data into a unified space to complement emotional content
for training new characters. Researchers [33, 217] pro- in textual information. Multimodal approaches have sig-
pose their solutions from various perspectives. However, nificantly enhanced the vividness of generated videos,
Most of these methods prioritize generating lip move- but there is still room for exploration of organically
ments aligned with semantic information, overlooking combining information driven by different sources and
essential aspects like identity and style, such as head modalities.
Deepfake Generation and Detection: A Benchmark and Survey 11
Table 4: Overview of representative talking face generation methods. Notations: ➊ LRW, ➋ VoxCeleb2, ➌ MEAD,
➍ Self-build, ➎ LRS2, ➏ HDTF, ➐ LRS3, ➑ CREMA-D, ➒ VoxCeleb, ➓ FFHQ.
Method Venue Dataset Limitation Highlight
Proposed a novel generator network and a comprehensive model with four complementary losses, as
Chen et al. [32] ECCV’18 ➊ [40] [221] Poor resolution, inability to control pose and emotional.
well as a new audio-visual related loss function to guide video generation.
Zhou et al. [350] AAAI’19 ➊ Inability to control pose and emotional variations. Generate high-quality talking face videos by disentangling audio-visual representations.
Chen et al. [33] CVPR’19 ➊ Inability to control pose and emotional variations. Proposed a cascaded approach, using facial landmarks as an intermediate high-level representation.
Wav2Lip [217] ICMR’20 ➊➎➐ Poor resolution, inability to control pose and emotional. A new evaluation framework and a dataset for training mouth synchronization are proposed.
Separating content information and identity information from audio signals, combining LSTM and self
MakeItTalk [357] TOG’20 ➋ Uable to control pose and emotional variations well.
Audio / Text - Driven
The denoising network, style-aware lip expert, and style predictor collaborate to make the model
DreamTalk [179] arXiv’23 ➋➌➏ Mismatched emotion and semantics, occasional artifacts.
perform well in various speaking styles.
The model incorporates motion frame and audio embedding information to capture past movements
Stypułkowski [240] WACV’24 ➊➑ High model complexity, short video generation duration.
and future expressions, with an emphasis on the mouth region through an additional lip sync loss.
EmoTalker [325] ICASSP’24 ➌➑ High model complexity. It achieves emotion-editable talking face generation based on a conditional diffusion model.
Expressive and well-decoupled facial latent space has been constructed, and highly controllable, high
VASA-1 [287] arXiv’24 ➒ High model complexity.
-quality generation effects have been achieved based on the Diffusion Transformer.
The NeRF based approach achieves accurate reproduction of detailed facial components and generates
AD-NeRF [76] ICCV’21 ➍ Inadequate control of emotions and latent attributes.
the upper body region.
DFRF [230] ECCV’22 Lack of emotional and other latent attributes control. Combining audio with 3D perceptual features and proposing an facial deformation module.
3D-Model
➍
Facial modeling is divided into NeRF related to audio and unrelated to audio to enhance audio-visual
AE-NeRF [148] AAAI’24 ➏ Lack emotional and other latent attributes control.
lip synchronization and facial detail.
The facial sync controller boosts component coordination, and a portrait generator corrects artifacts,
SyncTalk [214] CVPR’24 ➍ Lack controllable emotional intensity regulation.
enhancing video details.
Facial and audio information is separately represented using tri-plane, followed by rendering. The gen
Ye et al. [306] ICLR’24 ➋ [359] Lack emotional control and occasional artifacts.
-erated results are further optimized based on the super-resolution network.
• Diffusion-based. Recently, some methods [57, 179, • 3D-model Technologies. 3D models, exemplified by
189, 285, 325] apply the Diffusion model to the task of NeRF, are gaining traction in talking face generation [76,
talking face generation. For fine-grained talking video 148, 230]. AD-NeRF [76] directly feeds features from the
generation, DAE-Talker [57] replaces manually crafted input audio signal into a conditional implicit function
intermediate representations, such as facial landmarks to generate a dynamic NeRF. AE-NeRF [148] employs a
and 3DMM coefficients, with data-driven latent represen- dual NeRF framework to separately model audio-related
tations obtained from a Diffusion Autoencoder (DAE). regions and audio-independent regions. Furthermore,
The image decoder generates video frames based on some methods [214, 306] adopt Tri-Plane [25] to repre-
predicted latent variables. EmoTalker [325] utilizes a sent facial and audio attributes. Synctalk [214] models
conditional diffusion model for emotion-editable talking and renders head motion using a tri-plane hash repre-
face generation. It introduces emotion intensity blocks sentation, and then further enhances the output quality
and the FED dataset to enhance the model’s under- using a portrait synchronization generator. Very recently,
standing of complex emotions. Very recently, diffusion 3D Gaussian Splatting [129] also been widely applied
models are gaining prominence in talking face generation to this task. Some method [29, 37, 151, 311] introduce
tasks [240,252,275] and video generation tasks [123,273]. 3DGS to achieve more refined facial reconstruction and
Emo [252] directly predicts video from audio without motion details, aiming to address the issue of insufficient
the need for intermediate 3D components, achieving pose and expression control caused by NeRF’s implicit
excellent results. However, the lack of explicit control representation.
signals may easily lead to unnecessary artifacts. Based
on the Diffusion Transformer architecture, VASA-1 [287]
3.1.4 Facial Attribute Editing
finely encodes and reconstructs facial details, construct-
ing an expressive and well-decoupled facial latent space. In this section, we review current methods chronologi-
This results in highly controllable and high-resolution cally, following the progression in overcoming technical
video generation. challenges, primarily focusing on multiple attribute edit-
12 Gan Pei, Jiangning Zhang, et al.
Table 5: Overview of representative facial attribute editing methods. Notations: ➊ FFHQ, ➋ CelebA, ➌ CelebA-HQ,
➍ CelebAMask-HQ, ➎ VoxCeleb, ➏ CelebAText-HQ, ➐ LFW, ➑ MM CelebA-HQ, ➒ CARLA, ➓ Multi-PIE. In
addition, some abbreviations are used in the table: SIGGRAPH (SIG.), GANs (G.), Diffusion (D.), Transformer (T.).
Method Venue Categorize Dataset Highlight
SC-FEGAN [121] ICCV’19 G. ➌ Users can generate high-quality edited output images by freely sketching parts of the source image.
Applying attribute classification constraints to generated images has validated the drawbacks of enforcing stringent attribute
AttGAN [91] TIP’19 G. ➋
independence constraints in latent representations.
Thoroughly investigated how to encode different semantics in the latent space and explored the disentanglement between various
Shen et al. [232] CVPR’20 G. ➋
semantics to achieve precise control over facial attributes.
By integrating explicit decoupling terms and identity-consistent terms into the loss function, the preservation of facial identity
Yao et al. [305] ICCV’21 G.+T. ➌
information is improved, resulting in high-quality face editing in videos.
Proposed a solution based on wavelet transform to address the issue of partial loss of attribute information when generating
HifaFace [67] CVPR’21 G. ➊
edited results due to "cyclic consistency" problems.
Preechakul et al. [218] CVPR’22 D. ➊ When encoding images, it is divided into semantically meaningful parts and parts that represent the details of the image.
FENeRF [243] CVPR’22 G.+NeRF ➊➍ The introduction of semantic masks into the conditional radiance field enables finer image textures.
Generating faces after face editing is guided based on facial attribute classification. The introduction of a sparse attention mecha
GuidedStyle [95] NN’22 G. ➊
-nism enhances the manipulation of individual attribute styles.
The introduction of the Conditional Feature Warping (CFW) module addresses the issue of temporal inconsistency caused by
FDNeRF [330] SIG.’22 G.+NeRF ➎
dynamic information in the process of face editing in videos.
Proposed a dual-branch framework for text-driven facial editing, with coordination achieved between the two branches through
AnyFace [241] CVPR’22 G. ➏➑
a Cross-Modal Distillation (CMD) module.
TransEditor [291] CVPR’22 G. ➌➊ Emphasizing dual-space GAN interaction’s importance, a transformer architecture is introduced for improved interaction.
Proposed the concept of assisted diffusion, integrating individual multimodalities to explore the complementarity between differe
Huang et al. [106] CVPR’23 D. ➍
-nt modalities.
The entangled attribute space is decomposed into conceptual and hierarchical latent spaces, and transformer network encoders
Ozkan et al. [208] ICCV’23 G. ➊
are employed to modify information in the latent space.
CIPS-3D++ [354] TPAMI’23 G.+NeRF ➊➒ Replaced the convolutional architecture with an MLP (Multi-Layer Perceptron) architecture to achieve faster rendering speeds.
ClipFace [7] SIG.’23 G.+3DMM ➊ Learned texture generation from large-scale datasets, enhancing generator performance through generative adversarial training.
TG-3DFace [310] ICCV’23 G. ➌➏ For different scenarios, two sets of text-to-face cross-modal alignment methods were designed with specific focuses.
VecGAN++ [44] TPAMI’23 G. ➌ Orthogonal constraint and disentanglement loss are used to decouple attribute vectors in the latent space.
DiffusionRig [49] CVPR’23 D. ➊ 3DMM and diffusion model integration propose a two-stage method for learning personalized facial details.
Kim et al. [131] CVPR’23 D. ➎ Proposed a method for facial editing in videos based on the diffusion model.
SDGAN introduces a semantic separation generator and a semantic mask alignment strategy, achieving satisfactory preservation
SDGAN et al. [277] AAAI’24 G. ➌
of irrelevant details and precise attribute manipulation.
FaceDNeRF [327] NIPS’24 D.+NeRF ➊ Creating and editing facial NeRFs with single-view images, text prompts, and target lighting.
ing methods utilizing GANs. Finally, we summarize IA-FaceS [105] embeds the face image to be edited
methods in Table 5. into two branches of the model, where one branch
calculates high-dimensional component-invariant con-
• Comprehensive Editing. Facial attribute editing tent embedding to capture facial details, and the other
aims to selectively alter specific facial attributes with- branch provides low-dimensional component-specific em-
out affecting others. Therefore, disentangling different bedding for component operations. Additionally, some
facial attributes is a primary challenge. Early facial at- approaches [115, 243, 330, 354] combine GANs with
tribute editing models [231, 355] often achieve editing NeRF [187] for enhanced spatial awareness capabilities.
for a single attribute through data-driven training. For Specifically, FENeRF [243] uses two decoupled latent
instance, Shen et al . [231] propose learning the differ- codes to generate corresponding facial semantics and
ence between pre-/post-operation images, represented as textures in a 3D volume with spatial alignment shar-
residual images, to achieve attribute-specific operations. ing the same geometry. CIPS-3D++ [354] enhances the
However, single-attribute editing falls short of meeting model’s training efficiency with a NeRF-based shallow
expectations, and compression steps in the process often 3D shape encoder and an MLP-based deep 2D image
lead to a significant loss of image resolution, a common decoder.
issue in early methods. The fundamental challenge in
comprehensive editing and unrelated attribute modifi- • Text Driven facial attribute editing is a crucial ap-
cation is achieving complete attribute disentanglement. plication scenario and a recent hot topic in academic
Many approaches [67, 232, 291, 305] have explored this. research [7, 94, 241, 310]. TextFace [94] introduces text-
E.g., HifaFace [67] identifies cycle consistency issues to-style mapping, directly encoding text descriptions
as the cause of facial attribute information loss that into the latent space of pre-trained StyleGAN. TG-
proposes a wavelet-based method for high-fidelity face 3DFace [310] introduces two text-to-face cross-modal
editing, while TransEditor [291] introduces a dual-space alignment techniques, including global contrastive learn-
GAN structure based on the transformer framework that ing and fine-grained alignment modules, to enhance the
improves image quality and attribute editing flexibility. high semantic consistency between the generated 3D
face and the input text.
• Irrelevant-attribute Retained. Another critical
aspect of face editing is retaining as much target • Diffusion-based models have been introduced into
image information as possible in the generated im- facial attribute editing [49, 106, 131, 218] and achieve
ages [95, 105, 305]. GuidedStyle [95] leverages attention excellent results. Huang et al . [106] propose a collabora-
mechanisms in StyleGAN [125] for the adaptive selec- tive diffusion framework, utilizing multiple pre-trained
tion of style modifications for different image layers. unimodal diffusion models together for multimodal face
Deepfake Generation and Detection: A Benchmark and Survey 13
generation and editing. DiffusionRig [49] conditions the 3.2.2 Time Domain
initial 3D face model, which helps preserve facial iden-
tity information during personalized editing of facial • Abnormal Physiological Information. Forgery
appearance based on the general facial details before videos often overlook the authentic physiological fea-
the dataset. tures of humans, failing to achieve overall consistency
with authentic individuals. Therefore, some methods
focus on assessing the plausibility of the physiological
3.2 Forgery Detection features of the generated faces in videos. Li et al . [156]
detect blinking and blink frequency in videos as crite-
In this section, we review current forgery detection tech- ria for determining the video’s authenticity. Yang et
niques based on the type of detection cues, categorizing al . [302] focuses on the inconsistency of head poses in
them into three: Space Domain (Sec. 3.2.1), Time Do- videos, comparing the differences between head poses
main (Sec. 3.2.2), Frequency Domain (Sec. 3.2.3), and estimated using all facial landmarks and those estimated
Data-Driven (Sec. 3.2.4). We also summarize the de- using only the landmarks in the central region. Peng et
tailed information about popular methods in Table 6. al . [212] focuse on inter-frame gaze angles, obtaining
gaze characteristics of each video frame and using a
spatio-temporal feature aggregator to combine temporal
3.2.1 Space Domain
gaze features, spatial attribute features, and spatial tex-
ture features as the basis for detection and classification.
• Image-level Inconsistency. The generation pro-
cess of forged images often involves partial alterations • Inter-Frame Inconsistency. Methods [38, 72, 290,
rather than global generation, leading to common local 304, 308, 347] based on inter-frame inconsistency for
differences in non-globally generated forgery methods. forgery detection aim to uncover differences in images
Therefore, some methods focus on differences in image between adjacent frames or frames with specific tem-
spatial details as criteria for determining whether an poral spans. Gu et al . [72] focuse on inter-frame im-
image is forged, such as color [90], saturation [183], arti- age inconsistency by densely sampling adjacent frames,
facts [21, 235, 342], gradient variations [249], etc. Specif- while Yin et al . [308] design a Dynamic Fine-grained
ically, RECCE [21] considers shadow generation from Difference Capturing module and a Multi-Scale Spatio-
a training perspective, utilizing the learned representa- Temporal Aggregation module to cooperatively model
tions on actual samples to identify image reconstruction spatio-temporal inconsistencies. Yang et al . [304] ap-
differences. LGrad [249] utilizes a pre-trained transfor- proach forgery detection as a graph classification prob-
mation model, converting images to gradients to visual- lem, emphasizing the relationship information between
ize general artifacts and subsequently classifying based facial regions to capture the relationships among local
on these representations. In addition, some works focus features across different frames. Choi et al . [38] discover
on detection based on differences in facial and non-facial that the style variables in each frame of Deepfake work
regions [206], as well as the fine-grained details of image change. Based on this, they developed a style attention
textures [24, 175]. Recently, Ba et al . [8] focuse not only module to focus on the inconsistency of the style latent
on the discordance in a single image region but also on variables between frames.
the detection of fused local representation information • Multimodal Inconsistency. The core idea behind
from multiple non-overlapping areas. multimodal detection algorithms is to make judgments
• Local Noise Inconsistency. Image forgery may in- based on the flow of prior information from multiple
volve adding, modifying, or removing content in the attributes rather than solely considering the image or
image, potentially altering the noise distribution in the audio differences of individual characteristics in each
image. Detection methods based on noise aim to identify frame. The consideration of audio-visual modal incon-
such local or even global differences in the image. Zhou et sistency has received extensive research in various meth-
al . [353] propose a dual-stream structure, combining ods [41, 59, 84, 301]. POI-Forensics [41] proposes a deep
GoogleNet with a triplet network to focus on tampering forgery detection method based on audio-visual authen-
artifacts and local noise in images. Nguyen et al . [202] tication, utilizing contrastive learning to learn the most
utilize capsule networks to detect forged pictures and distinctive embeddings for each identity in moving fa-
videos in various forgery scenarios. NoiseDF [267] spe- cial and audio segments. AVoiD-DF [301] embeds spa-
cializes in identifying underlying noise traces left behind tiotemporal information in a spatiotemporal encoder
in Deepfake videos, introducing an efficient and novel and employs a multimodal joint decoder to fuse multi-
Multi-Head Relative Interaction with depth-wise sepa- modal features and learn their inherent relationships.
rable convolutions to enhance detection performance. Subsequently, a cross-modal classifier is applied to detect
14 Gan Pei, Jiangning Zhang, et al.
Table 6: Overview of representative forgery detection methods. Notations: ➀ FF++, ➁ DFDC, ➂ Celeb-DF,
➃ Deeperforensics, ➄ Self-build, ➅ UADFV, ➆ Celeb-HQ, ➇ DFDCp, ➈ FFHQ, ➉ DFD.
Method Venue Train Test Highlight
Gram-Net [175] CVPR’20 ➆➈ ➆➈ The method posits that genuine faces and fake faces exhibit inconsistencies in texture details.
Face X-ray [153] CVPR’20 ➀ ➀➁➂➉ Focusing on boundary artifacts of face fusion for forgery detection.
A texture enhancement module, an attention generation module, and a bilinear attention pooling mod
Zhao et al. [342] CVPR’21 ➀ ➀➁➂
-ule are proposed to focus on texture details.
Space Domain
Nirkin et al. [206] TPAMI’21 ➀ ➀➁➂ Detecting swapped faces by comparing the facial region with its context (non-facial area).
The belief that the more difficult to detect forged faces typically contain more generalized traces of forg
SBIs [235] CVPR’22 ➀ ➁➂➇➉
-ery can encourage the model to learn a feature representation with greater generalization ability.
RECCE [21] CVPR’22 ➀ ➀➁➂ [363] Reconstruction learning on real samples to learn common compressed representations of real images.
The gradient is utilized to present generalized artifacts that are fed into the classifier to determine the
LGrad [249] CVPR’23 ➄ ➄
truth of the image.
NoiseDF [267] AAAI’23 ➀ ➀➁➂➃ Extracting noise traces and features from cropped faces and background squares in video frames.
Multiple non-overlapping local representations are extracted from the image for forgery detection. A
Ba et al. [8] AAAI’24 ➀➁➂ ➀➁➂
local information loss function, based on information bottleneck theory, is proposed for constraint.
Focusing on the inconsistency in the head pose in videos by comparing the estimated head pose using
Yang et al. [302] ICASSP’19 ➅ ➅
all facial landmarks with the one estimated using only the landmarks in the central region.
It is believed that most face video forgeries are generated frame by frame. As each altered face is inde
FTCN [347] ICCV’21 ➀ ➀➁➂➃ [152]
-pendently generated, this inevitably leads to noticeable flickering and discontinuity.
Time Domain
LipForensics [85] CVPR’21 ➀ ➀➁➂ Concern about temporal inconsistency of mouth movements in videos.
M2TR [260] ICMR’22 ➀ ➀➁➂➉ Capturing local inconsistencies at different scales for forgery detection using a multiscale transformer.
Gu et al. [72] AAAI’22 ➀ ➀➁➂ [363] By densely sampling adjacent frames to pay attention to the inter-frame image inconsistency.
Treating detection as a graph classification problem and focusing on the relationship between the local
Yang et al. [304] TIFS’23 ➀➁➂ ➀➁➂
image features across different frames.
AVoiD-DF [301] TIFS’23 ➁➄ [130] ➁➄ [130] Multimodal forgery detection using audiovisual inconsistency.
Choi et al. [38] CVPR’24 ➀➂➃ ➀➂ Focus on the inconsistency of the style latent vectors between frames.
Forgery detection is conducted by converting video clips into thumbnails containing both spatial and
Xu et al. [290] IJCV’24 ➀➁➂➃ ➀➁➂➃
temporal information.
Focuse on inter-frame gaze angles, extracting gaze informations and employing spatio-temporal feature
Peng et al. [212] TIFS’24 ➀➂➇ ➀➂➇
aggregation to combine temporal, spatial, and texture features for detection and classification.
F3 -Net [219] ECCV’20 ➀ ➀ A two-branch frequency perception framework with a cross-attention module is proposed.
Propose an adaptive frequency feature generation module to extract differential features from different
FDFL [150] CVPR’21
Frequency
➀ ➀
frequency bands in a learnable manner.
Notice that the forgery flaws used to distinguish between real and fake faces are concentrated in the
HFI-Net [185] TIFS’22 ➀ ➁➂➃➅ [138]
mid- and high-frequency spectrum.
Guo et al. [80] TIFS’23 ➀➁ ➀➁➂ Designing a backbone network for Deepfake detection with space-frequency interaction convolution.
A lightweight frequency-domain learning network is proposed to constrain classifier operation within
Tan et al. [248] AAAI’24 ➄ ➀➄
the frequency domain.
Dang et al. [45] CVPR’20 ➄ ➂➅ Utilizing attention mechanisms to handle the feature maps of the detection model.
Proposes pairwise self-consistent learning for training CNN to extract these source features and detect
Zhao et al. [344] ICCV’21
Data Driven
➀ ➀➁➂➃➇➉
deep vacation images.
Based on an autoregressive model, using the facial representation of the current frame to predict the
Finfer [99] AAAI’22 ➀ ➀➂➇ [363]
facial representation of future frames.
Huang et al. [102] CVPR’23 ➀ ➀➁➂➉ [152] A new implicit identity-driven face exchange detection framework is proposed.
HiFi-Net [75] CVPR’23 ➄ ➄ Converting forgery detection and localization into a hierarchical fine-grained classification problem.
Weakly supervised image processing detection is proposed such that only binary image level labels (real
Zhai et al. [323] ICCV’23 [52] [98] [276]
or tampered) are required for training.
disharmonious operations within and between modali- describe the frequency-aware statistical differences be-
ties. Agarwal et al . [5] describe a forensic technique for tween real and forged faces. HFI-Net [185] consists of
detecting fake faces using static and dynamic auditory a dual-branch network and four Global-Local Interac-
ear characteristics. Indeed, multimodal detection meth- tion (GLI) modules. It effectively explores multi-level
ods are currently a hotspot in forgery detection research. frequency artifacts, obtaining frequency-related forgery
clues for face detection. Tan et al . [248] introduce a
novel frequency-aware approach called FreqNet, which
3.2.3 Frequency Domain focuses on the high-frequency information of images and
combines it with a frequency-domain learning module to
Frequency domain-based forgery detection methods
learn source-independent features. Furthermore, some
transform image time-domain information into the fre-
approaches combine spatial, temporal, and frequency
quency domain. Works [62, 150, 185, 219, 248] utilize
domains for joint consideration [80, 181], and Guo et
statistical measures of periodic features, frequency com-
al . [80] design a spatial-frequency interaction convolu-
ponents, and frequency characteristic distributions, ei-
tion to construct a novel backbone network for Deepfake
ther globally or in local regions, as evaluation metrics
detection.
for forgery detection. Specifically, F3 -Net [219] proposes
a dual-branch framework. One frequency-aware branch
utilizes Frequency-aware Image Decomposition (FAD) 3.2.4 Data Driven
to learn subtle forgery patterns in suspicious images. In
contrast, the other branch aims to extract high-level Data-driven forgery detection focuses on learning spe-
semantics from Local Frequency Statistics (LFS) to cific patterns and features from extensive image or video
Deepfake Generation and Detection: A Benchmark and Survey 15
datasets to distinguish between genuine and potentially text models like CLIP to guide StyleGAN generation
manipulated images. Some methods [96, 313] believe and applies patch-based contrastive style loss to enhance
that images generated by specific models possess unique stylization and fine details further.
model fingerprints. Based on this belief, forgery detec- • Diffusion-based methods [107, 146, 163] represent
tion can be achieved by focusing on the model’s training. the generative process of cross-domain image transfer
In addition, FakeSpotter [265] introduces the Neuron using diffusion processes. DiffusionGAN3D [146] com-
Coverage Criterion to capture layer-wise neuron acti- bines 3DGAN [25] with a diffusion model from text to
vation behavior. It monitors the neural behavior of a graphics, introducing relative distance loss and learn-
deep face recognition system through a binary classifier able tri-planes for specific scenarios to further enhance
to detect fake faces. There are also methods [102, 344] cross-domain transformation accuracy.
that attempt to classify the sources of different com-
ponents in an image. For instance, Huang et al . [102]
think that the difference between explicit and implicit 3.3.3 Body Animation
identity helps detect face swapping. There are numerous
data-driven methods [45,75,79,99], and it is not feasible • Generative Adversarial Network. GAN-based
to discuss each one in detail here. approaches [112, 343, 356] aim to train a model to gen-
erate images whose conditional distribution resembles
the target domain, thus transferring information from
3.3 Specific Related Domains reference images to target images. CASD [356] is based
on a style distribution module using a cross-attention
In this section, we briefly review related popular tasks
mechanism, facilitating pose transfer between source
beyond the deepfake generation, such as Face Super-
semantic styles and target poses. VGFlow [112] intro-
resolution (Sec. 3.3.1), Portrait Style Transfer (Sec.
duces a visibility-guided flow module to preserve texture
3.3.2), Body Animation (Sec. 3.3.3), and Makeup Trans-
and perform style manipulation concurrently. However,
fer (Sec. 3.3.4).
existing methods still rely considerably on training sam-
ples, and exhibit decreased performance when dealing
3.3.1 Face Super-resolution with actions in rare poses.
• Convolutional Neural Networks. Early works [36, • Diffusion-based. The task of body animation using
101, 176] on facial super-resolution based on CNNs aims diffusion models aims to utilize diffusion processes to
to leverage the powerful representational capabilities generate the propagation and interaction of movements
of CNNs to learn the mapping relationship between between body parts based on a reference source. This
low-resolution and high-resolution images from train- approach [100, 177, 273] represents a current hot topic
ing samples. Depending on whether they focus on local in research and implementation. LEO [273] focuses on
details of the image, they can be divided into global the spatiotemporal continuity between generated ac-
methods [36, 104], local methods [101], and mixed meth- tions, employing the Latent Motion Diffusion Model to
ods [176]. represent motion as a series of flow graphs during the
generation process. Animate Anyone [100] harnesses the
• Generative Adversarial Network. GAN aims to
powerful generation capabilities of stable diffusion mod-
achieve the optimal output result through an adversarial
els combined with attention mechanisms to generate
process between the generator and the discriminator.
high-quality character animation video.
This type of method [209, 338] currently dominates the
field for flexible and efficient architecture.
3.3.4 Makeup Transfer
3.3.2 Portrait Style Transfer
• Graphics-based Approaches. Before the introduc-
• Generative Adversarial Network. The most ma- tion of neural networks, traditional computer graphics
ture style transfer algorithm is the GAN-based ap- methods [73,147] use image gradient editing and physics-
proach [1, 120, 288]. However, due to the relatively poor based operations to understand the semantics of makeup.
stability of GANs, it is common for the generated im- By decomposing the input image into multiple layers,
ages to contain artifacts and unreasonable components. each representing different facial information, traditional
3DAvatarGAN [1] bridges the pre-trained 3D-GAN in methods would distort the reference facial image onto
the source domain with the 2D-GAN trained on an artis- the non-makeup target using facial landmarks for each
tic dataset to achieve cross-domain generation. Scen- layer. However, due to the limitations of manually de-
imefy [120] utilizes semantic constraints provided by signed operators, the output images from traditional
16 Gan Pei, Jiangning Zhang, et al.
methods often appear unnatural, with noticeable arti- and pose errors quantify the differences in expression
facts. Additionally, there is a tendency for background and pose between the generated face and the source
information to be modified to some extent. face. These metrics are evaluated using a pose estima-
• Generative Adversarial Network. Early deep tor [226] and a 3D facial model [47], extracting expres-
learning-based methods [71] aim at fully automatic sion and pose vectors for the generated and source faces.
makeup transfer. However, these methods [26, 31] ex- Lower values for expression error and pose error indicate
hibit poor performance when faced with significant dif- higher facial expression and pose similarity between the
ferences in pose and expression between the source swapped face and the source face. FID [92] is used to
and target faces and are unable to handle extreme assess image quality, with lower FID values indicating
makeup scenarios well. Some methods [119, 167, 349] that the generated images closely resemble authentic
proposes their solutions, PSGAN++ [167] comprises facial images in appearance.
the Makeup Distillation Network, Attentive Makeup • Face Reenactment. Face reenactment commonly
Morphing module, Style Transfer Network, and Identity uses consistent evaluation metrics, including CSIM,
Extraction Network, further enhancing the ability of SSIM [274], PSNR, LPIPS [336], LMD [32], and FID.
PSGAN [119] to perform targeted makeup transfer with CSIM describes the cosine similarity between the gen-
detail preservation. ELeGANt [296], CUMTGAN [88], erated and source faces, calculated by ArcFace [46],
and HT-ASE [154] explore the preservation of detailed with higher values indicating better performance. SSIM,
information. ELeGANt [296] encodes facial attributes PSNR, LPIPS, and FID are used to measure the quality
into pyramid feature maps to retain high-frequency in- of synthesized images. SSIM measures the structural
formation. Matching facial semantic information, which similarity between two images, with higher values indi-
involves rendering makeup styles onto semantically cor- cating a closer resemblance to natural images. PSNR
responding positions of a target image, is often over- quantifies the ratio between a signal’s maximum possible
looked. SSAT [245] introduces the SSCFT module and power and noise’s power, indicating higher quality for
weakly supervised semantic loss for accurate semantic higher values. LPIPS assesses reconstruction fidelity us-
correspondence. SSAT++ [246] further improves color ing a pre-trained AlexNet [141] to extract feature maps
fidelity matching, but both models are complex with for similarity score computation. As mentioned earlier,
high training complexity. In addition, BeautyREC [295], FID is used to evaluate image quality. LMD assesses
based on the Transformer with long-range visual depen- the accuracy of lip shape in generated images or videos,
dencies, achieves efficient global makeup transfer with with lower values indicating better model performance.
significantly reduced overall model parameters compared
to previous works. • Talking Face Generation. Expanding upon face
reenactment metrics, talking face generation incorpo-
rates additional metrics, including M/F-LMD, Sync,
4 Benchmark Results
LSE-C, and LSE-D. The LSE-C and LSE-D are usually
used to measure lip synchronization effectiveness [217].
We firstly introduce the evaluation metrics commonly
The landmark distances on the mouth (M-LMD) [33]
used for each deepfake tasks, and then we evaluate the
and the confidence score of SyncNet (Sync) measure
performance of representative methods for each reviewed
synchronization between the generated lip motion and
field on the most widely used datasets with the data
the input audio. F-LMD computes the difference in
sourced from the original respective papers. Considering
the average distance of all landmarks between predic-
the differences in training datasets, testing datasets,
tions and ground truth (GT) as a measure to assess
and metrics used by different approaches, we strive to
the generated expression. In addition, there are some
compare them fairly in each table to the greatest extent.
meaningful metrics, such as LSE-C and LSE-D for mea-
suring lip synchronization effectiveness [217], AVD [155]
4.1 Metrics for evaluating identity preservation performance, AU-
CON [83] for assessing facial pose and expression jointly,
and AGD [55] for evaluating eye gaze changes. These
• Face Swapping. The most commonly used objective
newly proposed evaluation metrics enrich the perfor-
evaluation metrics for face swapping include ID Ret,
mance assessment system by targeting various aspects
Expression Error, Pose Error, and FID. ID Ret is cal-
of the model’s performance.
culated by a pre-trained face recognition model [258],
measuring the Euclidean distance between the gener- • Facial Attribute Editing. The standard evaluation
ated face and the source face. A higher ID Ret indicates metrics used in face attribute manipulation are FID,
better preservation of identity information. Expression LPIPS [336], KID [12], PSNR and SSIM. KID is one of
Deepfake Generation and Detection: A Benchmark and Survey 17
Table 7: Results of representative face swapping methods Table 8: Results of representative face reenactment
on FF++. Notations: ➊ CelebA-HQ, ➋ FFHQ, ➌ VG- methods on VoxCeleb for the self-reenactment. No-
GFace, ➍ VGGFace2, ➎ VoxCeleb2. tations: ➊ VoxCeleb, ➋ VoxCeleb2, ➌ ETH-Xgaze,
➍ Gaze360 [127], ➎ MPIIGaze [339], ➏ TalkingHead-
Test: FF++ 1KH. In addition, we use gray to represent data that is
Methods Train ID Ret.(%)↑ Exp Err.↓ Pose Err.↓ FID↓
partially uncertain.
FaceShifter [152] ➊➋➌ 97.38 2.06 2.96 -
SimSwap [34] ➍ 92.83 - 1.53 - Test: VoxCeleb
FaceInpainter [149] ➊➋➌ 97.63 - 2.21 - Methods Train
HifiFace [272] ➍ 98.48 - 2.63 - CSIM↑ PSNR↑ LPIPIS↓ FID↓ SSIM↑
RAFSwap [283] ➊ 96.70 2.92 2.53 - HyperReenact [17] ➊ 0.710 - 0.230 27.10 -
Xu et al. [289] ➋ 90.05 2.79 2.46 - DG [97] ➊ 0.831 - - 22.10 0.761
DiffSwap [345] ➋ 98.54 5.35 2.45 2.16 AVFR-GAN [4] ➊ - 32.20 - 8.48 0.824
FlowFace [320] ➊➋➍ 99.26 - 2.66 - Free-HeadGAN [55] ➊➌➍➎ 0.810 22.16 0.100 35.40 -
FlowFace++ [341] ➊➋➍ 99.51 - 2.20 - HiDe-NeRF [155] ➊➋➐ 0.931 21.90 0.084 - 0.862
StyleIPSB [114] ➋ 95.05 2.23 3.58 -
StyleSwap [293] ➌➎ 97.05 5.28 1.56 2.72
WSC-Swap [220] ➊➋➌ 99.88 5.01 1.51 -
Table 11: Evaluation results of the models involved in Table 14: Results of the self-dataset performance on
talking face generation on the MEAD dataset. Notations: FF++. Notations: HQ (Mild compression), LQ (Heavy
➊ MEAD, ➋ LRW, ➌ VoxCeleb2, ➍ HDTF. compression).
Test: MEAD FF++ (LQ) FF++ (HQ)
Method Train CSIM↑ LMD↓ M/F-LMD↓ Sync↑ FID↓ PSNR/SSIM↑ Methods Train ACC(%) AUC(%) ACC(%) AUC(%)
Xu et al. [284] ➊ 0.83 2.36 - 3.500 15.91 30.09/0.850
EMMN [250] ➊➋ - - 2.780/2.870 3.570 - 29.38/0.660 F3 -Net [219] FF++ 93.02 95.80 98.95 99.30
AMIGO [322] ➊➌ - 2.44 2.140/2.440 - 19.59 30.29/0.820 Masi et al. [181] FF++ 86.34 - 96.43 -
SLIGO [233] ➊ 0.88 1.83 - 3.690 - -/0.790
Gan et al. [64] ➊➌ - - 2.250/2.470 - 19.69 21.75/0.680 Zhao et al. [342] FF++ 88.69 90.40 97.60 99.29
DreamTalk [179] ➊➌➍ - - 2.910/1.930 3.780 - -/0.860 FDFL [150] FF++ 89.00 92.40 96.69 99.30
SPACE [82] ➊➌ - - - 3.610 11.68 - LipForensics [85] FF++ 94.20 98.10 98.80 99.70
TalkCLIP [178] ➊ - - 3.601/2.415 3.773 - -/0.829 RECCE [21] FF++ 91.03 95.02 97.06 99.32
Guo et al. [79] FF++ 92.76 96.85 99.24 99.75
MRL [304] FF++ 91.81 96.18 93.82 98.27
5 Future Prospects
4.3 Main Results on Forgery Detection
This section provides a short discussion of future re-
search directions that can facilitate and envision the
The mainstream evaluation metrics for forgery detec- development of deepfake generation and detection.
tion technology are ACC and AUC. Table 14 presents
• Face Swapping. Generalization is a significant issue
the ACC and AUC metrics for some detection models
in face swapping models. While many models demon-
trained on FF++ [225] and tested on FF++ (HQ) and
strate excellent performance on their training sets, there
FF++ (LQ). LipForensics [85] exhibits robust perfor-
is often noticeable performance degradation when ap-
mance on the strongly compressed FF++ (LQ), while
plied to different datasets during testing. In addition,
Guo et al . [79] perform best on FF++ (HQ). Table 15
beyond the common evaluation metrics, various face
shows the cross-dataset evaluation using DFDC [50],
swapping works employ different evaluation metric sys-
Celeb-DF [158], Celeb-DFv2 [158], and DeeperForensics-
tems, lacking a unified evaluation protocol. This absence
1.0 [117] as validation sets. AVoiD-DF [301] and Zhao et
hinders researchers from intuitively assessing model per-
al . [344] demonstrate excellent generalization ability, but
formance. Therefore, establishing comprehensive exper-
there is still significant room for improvement in these
imental and evaluation frameworks is crucial for fair
datasets. However, overall, there is room for improve-
comparisons, and driving progress in the field.
ment in the evaluation performance of forgery detection
models on the DFDC, which has a larger sample size • Face Reenactment. Existing methods have room for
and more complex forgery methods. improvement, facing three main challenges: convenience,
Deepfake Generation and Detection: A Benchmark and Survey 19
authenticity, and security. Many approaches struggle as diffusion. Specifically, this paper covers an overview
to balance lightweight deployment and generating high- of basic background knowledge, including concepts of
quality reenactment effects, hindering the widespread research tasks, the development of generative models
adoption of facial reenactment technology in industries. and neural networks, and other information from closely
Moreover, several methods claim to achieve high-quality related fields. Subsequently, we summarize the technical
facial reenactment, but they exhibit visible degrada- approaches adopted by different methods in the main-
tion in output during rapid pose changes or extreme stream four generation and one detection fields, and
lighting conditions in driving videos. Additionally, the classify and discuss the methods from a technical per-
computational complexity, consuming significant time spective. In addition, we strive to fairly organize and
and system resources, poses substantial challenges for benchmark the representative methods in each field. Fi-
practical applications. nally, we summarize the current challenges and future
• Talking Face Generation. Current methods strive research directions for each field.
to enhance the realism of generated conversational Acknowledgement. This work is supported by the
videos. However, they lack fine-grained control over National Natural Science Foundation of China (No.
the emotional nuances of the conversation, where the 62371189).
matching of emotional intonation to audio and seman-
tic content is not precise enough, and the control over
emotional intensity is too coarse. In addition, the realis- References
tic correlation between head pose and facial expression 1. Abdal, R., Lee, H.Y., Zhu, P., Chai, M., Siarohin, A.,
movement seems insufficiently considered. Lastly, for Wonka, P., Tulyakov, S.: 3davatargan: Bridging domains
text or audio with intense emotions, noticeable artifacts for personalized editable avatars. In: CVPR (2023) 6, 15
2. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisser-
in head movement still occur to a significant extent.
man, A.: Deep audio-visual speech recognition. TPAMI
• Facial Attribute Editing. Currently, mainstream (2018) 5, 11
facial attribute editing employs the decoupling concept 3. Afouras, T., Chung, J.S., Zisserman, A.: Lrs3-ted: a
large-scale dataset for visual speech recognition. arXiv
based on GANs and diffusion models are gradually be- (2018) 5
ing introduced into this field. The primary challenge 4. Agarwal, M., Mukhopadhyay, R., Namboodiri, V.P.,
is effectively separating the facial attributes to prevent Jawahar, C.: Audio-visual face reenactment. In: WACV
unintended processing of other facial features during (2023) 17
5. Agarwal, S., Farid, H.: Detecting deep-fake videos from
attribute editing. Additionally, there needs to be a uni- aural and oral dynamics. In: CVPR (2021) 14
versally accepted benchmark dataset and evaluation 6. Ancilotto, A., Paissan, F., Farella, E.: Ximswap: many-
framework for fair assessments of facial editing. to-many face swapping for tinyml. TECS (2023) 1,
8
• Forgery Detection. With the rapid development of 7. Aneja, S., Thies, J., Dai, A., Nießner, M.: Clipface: Text-
facial forgery techniques, the central challenge in face guided editing of textured 3d morphable models. In:
forgery detection technology is accurately identifying SIGGRAPH (2023) 12
8. Ba, Z., Liu, Q., Liu, Z., Wu, S., Lin, F., Lu, L., Ren, K.:
various forgery methods using a single detection model. Exposing the deception: Uncovering more forgery clues
Simultaneously, ensuring that the model exhibits ro- for deepfake detection. In: AAAI (2024) 13, 14
bustness when detecting forgeries in the presence of 9. Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: Cvae-gan:
fine-grained image generation through asymmetric train-
disturbances such as compression is crucial. Most detec-
ing. In: ICCV (2017) 4, 8
tion models follow a generic approach targeting common 10. Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: Towards
operational steps of a specific forgery method, such as open-set identity preserving face synthesis. In: CVPR
the integration phase in face swapping or assessing tem- (2018) 7, 8
11. Barattin, S., Tzelepis, C., Patras, I., Sebe, N.: Attribute-
poral inconsistencies, but this manner limits the model’s preserving face dataset anonymization via latent code
generalization capabilities. Moreover, as forgery tech- optimization. In: CVPR (2023) 1, 7
niques evolve, forged videos may evade detection by 12. Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.:
introducing interference during the detection process. Demystifying mmd gans. In: ICLR (2018) 5, 16
13. Bitouk, D., Kumar, N., Dhillon, S., Belhumeur, P., Nayar,
S.K.: Face swapping: automatically replacing faces in
photographs. In: SIGGRAPH (2008) 7
6 Conclusion 14. Blanz, V., Scherbaum, K., Vetter, T., Seidel, H.P.: Ex-
changing faces in images. In: Eurographics (2004) 7
This survey comprehensively reviews the latest develop- 15. Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch,
D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti,
ments in the field of deepfake generation and detection,
V., Letts, A., et al.: Stable video diffusion: Scaling latent
which is the first to cover a variety of related fields video diffusion models to large datasets. arXiv (2023) 1,
thoroughly and discusses the latest technologies such 2, 4, 9
20 Gan Pei, Jiangning Zhang, et al.
16. Bounareli, S., Tzelepis, C., Argyriou, V., Patras, I.: One- 37. Cho, K., Lee, J., Yoon, H., Hong, Y., Ko, J., Ahn, S.,
shot neural face reenactment via finding directions in Kim, S.: Gaussiantalker: Real-time high-fidelity talking
gan’s latent space. IJCV (2024) 9, 10 head synthesis with audio-driven 3d gaussian splatting.
17. Bounareli, S., Tzelepis, C., Argyriou, V., Patras, I., Tz- arXiv (2024) 11
imiropoulos, G.: Hyperreenact: one-shot reenactment via 38. Choi, J., Kim, T., Jeong, Y., Baek, S., Choi, J.: Exploit-
jointly learning to refine and retarget faces. In: ICCV ing style latent flows for generalizing deepfake detection
(2023) 1, 9, 10, 17 video detection. In: CVPR (2024) 13, 14
18. Bounareli, S., Tzelepis, C., Argyriou, V., Patras, I., Tz- 39. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep
imiropoulos, G.: Stylemask: Disentangling the style space speaker recognition. In: Interspecch (2018) 5
of stylegan2 for neural face reenactment. In: FG (2023) 40. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An
9, 10 audio-visual corpus for speech perception and automatic
19. Burgos-Artizzu, X.P., Perona, P., Dollár, P.: Robust face speech recognition. JASA (2006) 11
landmark estimation under occlusion. In: ICCV (2013) 41. Cozzolino, D., Pianese, A., Nießner, M., Verdoliva, L.:
7 Audio-visual person-of-interest deepfake detection. In:
20. Cai, Q., Ma, M., Wang, C., Li, H.: Image neural style CVPR (2023) 13
transfer: A review. Computers and Electrical Engineering 42. Cui, K., Wu, R., Zhan, F., Lu, S.: Face transformer:
(2023) 6 Towards high fidelity and accurate face swapping. In:
21. Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: CVPR (2023) 7, 9
End-to-end reconstruction-classification learning for face 43. Dale, K., Sunkavalli, K., Johnson, M.K., Vlasic, D., Ma-
forgery detection. In: CVPR (2022) 13, 14, 18 tusik, W., Pfister, H.: Video face replacement. In: SIG-
22. Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, GRAPH (2011) 7
A.: Vggface2: A dataset for recognising faces across pose 44. Dalva, Y., Altındiş, S.F., Dundar, A.: Vecgan: Image-
and age. In: FG (2018) 5 to-image translation with interpretable latent directions.
23. Cao, W., Wang, T., Dong, A., Shu, M.: Transfs: Face In: ECCV (2022) 12
swapping using transformer. In: FG (2023) 7, 9 45. Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On
24. Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes the detection of digital face manipulation. In: CVPR
fake images detectable? understanding properties that (2020) 14, 15
generalize. In: ECCV (2020) 13
46. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Addi-
25. Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan,
tive angular margin loss for deep face recognition. In:
B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J.,
CVPR (2019) 16
Khamis, S., et al.: Efficient geometry-aware 3d generative
47. Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.:
adversarial networks. In: CVPR (2022) 11, 15
Accurate 3d face reconstruction with weakly-supervised
26. Chang, H., Lu, J., Yu, F., Finkelstein, A.: Pairedcyclegan:
learning: From single image to image set. In: CVPR
Asymmetric style transfer for applying and removing
(2019) 16
makeup. In: CVPR (2018) 6, 16
48. DFD: Dfd. https://blog.research.google/2019/
27. Chaudhuri, B., Vesdapunt, N., Shapiro, L., Wang, B.:
09/contributing-data-to-deepfake-detection.html
Personalized face modeling for improved face reconstruc-
(2019) 5
tion and motion retargeting. In: ECCV (2020) 6
28. Chaudhuri, B., Vesdapunt, N., Wang, B.: Joint face 49. Ding, Z., Zhang, X., Xia, Z., Jebe, L., Tu, Z., Zhang,
detection and facial motion retargeting for multiple faces. X.: Diffusionrig: Learning personalized priors for facial
In: CVPR (2019) 5 appearance editing. In: CVPR (2023) 12, 13
29. Chen, B., Hu, S., Chen, Q., Du, C., Yi, R., Qian, Y., 50. Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R.,
Chen, X.: Gstalker: Real-time audio-driven talking face Wang, M., Ferrer, C.C.: The deepfake detection challenge
generation via deformable gaussian splatting. arXiv (dfdc) dataset. arXiv (2020) 5, 18
(2024) 11 51. Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Ferrer,
30. Chen, C., Gong, D., Wang, H., Li, Z., Wong, K.Y.K.: C.C.: The deepfake detection challenge (dfdc) preview
Learning spatial attention for face super-resolution. TIP dataset. arXiv (2019) 5
(2020) 6 52. Dong, J., Wang, W., Tan, T.: Casia image tampering
31. Chen, H.J., Hui, K.M., Wang, S.Y., Tsao, L.W., Shuai, detection evaluation database. In: ChinaSIP (2013) 14
H.H., Cheng, W.H.: Beautyglow: On-demand makeup 53. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
transfer framework with reversible generative network. D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
In: CVPR (2019) 16 M., Heigold, G., Gelly, S., et al.: An image is worth 16x16
32. Chen, L., Li, Z., Maddox, R.K., Duan, Z., Xu, C.: Lip words: Transformers for image recognition at scale. In:
movements generation at a glance. In: ECCV (2018) 5, ICLR (2021) 4
10, 11, 16 54. Dou, P., Shah, S.K., Kakadiaris, I.A.: End-to-end 3d face
33. Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical reconstruction with deep neural networks. In: CVPR
cross-modal talking face generation with dynamic pixel- (2017) 6
wise loss. In: CVPR (2019) 6, 10, 11, 16 55. Doukas, M.C., Ververas, E., Sharmanska, V., Zafeiriou,
34. Chen, R., Chen, X., Ni, B., Ge, Y.: Simswap: An efficient S.: Free-headgan: Neural talking head synthesis with
framework for high fidelity face swapping. In: ACM MM explicit gaze control. TPAMI (2023) 9, 16, 17
(2020) 7, 8, 17 56. Doukas, M.C., Zafeiriou, S., Sharmanska, V.: Headgan:
35. Chen, X., Dong, C., Ji, J., Cao, J., Li, X.: Image manip- One-shot neural head synthesis and editing. In: ICCV
ulation detection by multi-view multi-scale supervision. (2021) 9
In: ICCV (2021) 5 57. Du, C., Chen, Q., He, T., Tan, X., Chen, X., Yu, K.,
36. Chen, Z., Lin, J., Zhou, T., Wu, F.: Sequential gating Zhao, S., Bian, J.: Dae-talker: High fidelity speech-driven
ensemble network for noise robust multiscale face restora- talking face generation with diffusion autoencoder. In:
tion. IEEE transactions on cybernetics (2019) 6, 15 ACM MM (2023) 11
Deepfake Generation and Detection: A Benchmark and Survey 21
58. Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real 80. Guo, Z., Jia, Z., Wang, L., Wang, D., Yang, G., Kasabov,
talking head with deep bidirectional lstm. In: ICASSP N.: Constructing new backbone networks via space-
(2015) 6, 10 frequency interactive convolution for deepfake detection.
59. Feng, C., Chen, Z., Owens, A.: Self-supervised video TIFS (2023) 14
forensics by audio-visual anomaly detection. In: CVPR 81. Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Fei-Fei,
(2023) 13 L., Essa, I., Jiang, L., Lezama, J.: Photorealistic video
60. Feng, J., Singhal, P.: 3d face style transfer with a hybrid generation with diffusion models. arXiv (2023) 4
solution of nerf and mesh rasterization. In: WACV (2024) 82. Gururani, S., Mallya, A., Wang, T.C., Valle, R., Liu,
6 M.Y.: Space: Speech-driven portrait animation with con-
61. Feng, M., Gilani, S.Z., Wang, Y., Mian, A.: 3d face recon- trollable expression. In: ICCV (2023) 10, 11, 18
struction from light field images: A model-free approach. 83. Ha, S., Kersner, M., Kim, B., Seo, S., Kim, D.: Mari-
In: ECCV (2018) 6 onette: Few-shot face reenactment preserving identity of
62. Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A., unseen targets. In: AAAI (2020) 9, 16
Kolossa, D., Holz, T.: Leveraging frequency analysis for 84. Haliassos, A., Mira, R., Petridis, S., Pantic, M.: Lever-
deep fake image recognition. In: ICML (2020) 14 aging real talking faces via self-supervision for robust
forgery detection. In: CVPR (2022) 2, 13, 18
63. Fu, H., Wang, Z., Gong, K., Wang, K., Chen, T., Li, H.,
85. Haliassos, A., Vougioukas, K., Petridis, S., Pantic, M.:
Zeng, H., Kang, W.: Mimic: Speaking style disentangle-
Lips don’t lie: A generalisable and robust approach to
ment for speech-driven 3d facial animation. In: AAAI
face forgery detection. In: CVPR (2021) 14, 18
(2024) 10
86. Han, S., Lin, C., Shen, C., Wang, Q., Guan, X.: Inter-
64. Gan, Y., Yang, Z., Yue, X., Sun, L., Yang, Y.: Efficient
preting adversarial examples in deep learning: A review.
emotional adaptation for audio-driven talking-head gen-
CSUR (2023) 6
eration. In: ICCV (2023) 10, 11, 18 87. Han, Y., Zhang, J., Zhu, J., Li, X., Ge, Y., Li, W.,
65. Gao, G., Huang, H., Fu, C., Li, Z., He, R.: Information Wang, C., Liu, Y., Liu, X., Tai, Y.: A generalist facex
bottleneck disentanglement for identity swapping. In: via learning unified facial representation. arXiv (2023)
CVPR (2021) 8 6, 7, 8
66. Gao, K., Gao, Y., He, H., Lu, D., Xu, L., Li, J.: Nerf: Neu- 88. Hao, M., Gu, G., Fu, H., Liu, C., Cui, D.: Cumtgan: An
ral radiance field in 3d vision, a comprehensive review. instance-level controllable u-net gan for facial makeup
arXiv (2022) 2, 5 transfer. Knowledge-Based Systems (2022) 6, 16
67. Gao, Y., Wei, F., Bao, J., Gu, S., Chen, D., Wen, F., 89. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning
Lian, Z.: High-fidelity and arbitrary face editing. In: for image recognition. In: CVPR (2016) 4
CVPR (2021) 12, 18 90. He, P., Li, H., Wang, H.: Detection of fake images via
68. Gao, Y., Zhou, Y., Wang, J., Li, X., Ming, X., Lu, Y.: the ensemble of deep representations from multi color
High-fidelity and freely controllable talking head video spaces. In: ICIP (2019) 2, 13
generation. In: CVPR (2023) 9, 17 91. He, Z., Zuo, W., Kan, M., Shan, S., Chen, X.: Attgan:
69. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Facial attribute editing by only changing what you want.
Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: TIP (2019) 12
Generative adversarial nets. In: NeurIPS (2014) 2, 4, 5 92. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B.,
70. Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, Hochreiter, S.: Gans trained by a two time-scale update
S.: Multi-pie. IVC (2010) 7 rule converge to a local nash equilibrium. In: NeurIPS
71. Gu, Q., Wang, G., Chiu, M.T., Tai, Y.W., Tang, C.K.: (2017) 5, 16
Ladn: Local adversarial disentangling network for facial 93. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion proba-
makeup and de-makeup. In: ICCV (2019) 6, 16 bilistic models. In: NeurIPS (2020) 2, 4, 5
72. Gu, Z., Chen, Y., Yao, T., Ding, S., Li, J., Ma, L.: 94. Hou, X., Zhang, X., Li, Y., Shen, L.: Textface: Text-to-
Delving into the local: Dynamic inconsistency learning style mapping based face generation and manipulation.
for deepfake video detection. In: AAAI (2022) 13, 14 TMM (2022) 12, 18
95. Hou, X., Zhang, X., Liang, H., Shen, L., Lai, Z., Wan,
73. Guo, D., Sim, T.: Digital face makeup by example. In:
J.: Guidedstyle: Attribute knowledge guided style ma-
CVPR (2009) 15
nipulation for semantic face editing. Neural Networks
74. Guo, H., Niu, D., Kong, X., Zhao, X.: Face replacement
(2022) 12, 18
based on 2d dense mapping. In: ICIGP (2019) 7
96. Hsu, C.C., Lee, C.Y., Zhuang, Y.X.: Learning to detect
75. Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: fake face images in the wild. In: IS3C (2018) 15
Hierarchical fine-grained image forgery detection and 97. Hsu, G.S., Tsai, C.H., Wu, H.Y.: Dual-generator face
localization. In: CVPR (2023) 14, 15 reenactment. In: CVPR (2022) 1, 9, 17
76. Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., Zhang, 98. Hsu, Y.F., Chang, S.F.: Detecting image splicing using
J.: Ad-nerf: Audio driven neural radiance fields for talk- geometry invariants and camera characteristics consis-
ing head synthesis. In: ICCV (2021) 11 tency. In: ICME (2006) 14
77. Guo, Y., Yang, C., Rao, A., Wang, Y., Qiao, Y., Lin, D., 99. Hu, J., Liao, X., Liang, J., Zhou, W., Qin, Z.: Finfer:
Dai, B.: Animatediff: Animate your personalized text- Frame inference-based deepfake detection for high-visual-
to-image diffusion models without specific tuning. In: quality videos. In: AAAI (2022) 14, 15
ICLR (2024) 1, 2, 4 100. Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.:
78. Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: Animate anyone: Consistent and controllable image-to-
A dataset and benchmark for large-scale face recognition. video synthesis for character animation. arXiv (2023)
In: ECCV (2016) 7 15
79. Guo, Y., Zhen, C., Yan, P.: Controllable guide-space for 101. Hu, X., Ma, P., Mai, Z., Peng, S., Yang, Z., Wang,
generalizable face forgery detection. In: ICCV (2023) 15, L.: Face hallucination from low quality images using
18 definition-scalable inference. PR (2019) 15
22 Gan Pei, Jiangning Zhang, et al.
102. Huang, B., Wang, Z., Yang, J., Ai, J., Zou, Q., Wang, Q., 122. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for
Ye, D.: Implicit identity driven deepfake face swapping real-time style transfer and super-resolution. In: ECCV
detection. In: CVPR (2023) 5, 14, 15 (2016) 5
103. Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: 123. Karras, J., Holynski, A., Wang, T.C., Kemelmacher-
Labeled faces in the wild: A database forstudying face Shlizerman, I.: Dreampose: Fashion image-to-video syn-
recognition in unconstrained environments. In: Workshop thesis via stable diffusion. In: ICCV (2023) 11
on faces in’Real-Life’Images: detection, alignment, and 124. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progres-
recognition (2008) 5 sive growing of gans for improved quality, stability, and
104. Huang, H., He, R., Sun, Z., Tan, T.: Wavelet-srnet: A variation. In: ICLR (2018) 5
wavelet-based cnn for multi-scale face super resolution. 125. Karras, T., Laine, S., Aila, T.: A style-based genera-
In: ICCV (2017) 6, 15 tor architecture for generative adversarial networks. In:
105. Huang, W., Tu, S., Xu, L.: Ia-faces: A bidirectional CVPR (2019) 1, 2, 4, 5, 12
method for semantic face editing. Neural Networks (2023) 126. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen,
2, 12, 18 J., Aila, T.: Analyzing and improving the image quality
106. Huang, Z., Chan, K.C., Jiang, Y., Liu, Z.: Collaborative of stylegan. In: CVPR (2020) 1, 4
diffusion for multi-modal face generation and editing. In: 127. Kellnhofer, P., Recasens, A., Stent, S., Matusik, W.,
CVPR (2023) 12 Torralba, A.: Gaze360: Physically unconstrained gaze
107. Hur, J., Choi, J., Han, G., Lee, D.J., Kim, J.: Expanding estimation in the wild. In: ICCV (2019) 9, 17
expressiveness of diffusion models with limited data via 128. Kemelmacher-Shlizerman, I., Basri, R.: 3d face recon-
self-distillation based fine-tuning. In: WACV (2024) 6, struction from a single image using a single reference
15 face shape. TPAMI (2010) 6
108. Hwang, S., Hyung, J., Kim, D., Kim, M.J., Choo, J.: 129. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d
Faceclipnerf: Text-driven 3d face manipulation using gaussian splatting for real-time radiance field rendering.
deformable neural radiance fields. In: ICCV (2023) 5 ACM TOG (2023) 11
109. Ilyas, H., Javed, A., Malik, K.M.: Avfakenet: A unified 130. Khalid, H., Tariq, S., Kim, M., Woo, S.S.: Fakeavceleb: A
end-to-end dense swin transformer deep learning model novel audio-video multimodal deepfake dataset. NeurIPS
for audio–visual deepfakes detection. Applied Soft Com- (2021) 5, 14
puting (2023) 2 131. Kim, G., Shim, H., Kim, H., Choi, Y., Kim, J., Yang, E.:
110. Irshad, M.Z., Zakharov, S., Liu, K., Guizilini, V., Kollar,
Diffusion video autoencoders: Toward temporally consis-
T., Gaidon, A., Kira, Z., Ambrus, R.: Neo 360: Neural
tent face video editing via disentangled video encoding.
fields for sparse view synthesis of outdoor scenes. In:
In: CVPR (2023) 12, 18
ICCV (2023) 5
132. Kim, H., Elgharib, M., Zollhöfer, M., Seidel, H.P., Beeler,
111. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-
T., Richardt, C., Theobalt, C.: Neural style-preserving
image translation with conditional adversarial networks.
visual dubbing. ACM TOG (2019) 9
In: ICCV (2017) 4
133. Kim, H., Garrido, P., Tewari, A., Xu, W., Thies, J., Niess-
112. Jain, R., Singh, K.K., Hemani, M., Lu, J., Sarkar, M.,
ner, M., Pérez, P., Richardt, C., Zollhöfer, M., Theobalt,
Ceylan, D., Krishnamurthy, B.: Vgflow: Visibility guided
C.: Deep video portraits. ACM TOG (2018) 9
flow network for human reposing. In: CVPR (2023) 15
134. Kim, J., Lee, J., Zhang, B.T.: Smooth-swap: a simple
113. Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C.C., Cao,
enhancement for face-swapping with smoothness. In:
X., Xu, F.: Audio-driven emotional video portraits. In:
CVPR (2022) 7, 8
CVPR (2021) 11
114. Jiang, D., Song, D., Tong, R., Tang, M.: Styleipsb: 135. Kim, K., Kim, Y., Cho, S., Seo, J., Nam, J., Lee, K.,
Identity-preserving semantic basis of stylegan for high Kim, S., Lee, K.: Diffface: Diffusion-based face swapping
fidelity face swapping. In: CVPR (2023) 8, 17 with facial guidance. arXiv (2022) 7, 8
115. Jiang, K., Chen, S.Y., Liu, F.L., Fu, H., Gao, L.: Nerf- 136. Kingma, D.P., Welling, M.: Auto-encoding variational
faceediting: Disentangled face editing in neural radiance bayes. stat (2014) 1, 4, 5
fields. In: SIGGRAPH (2022) 2, 12 137. Ko, S., Dai, B.R.: Multi-laplacian gan with edge en-
116. Jiang, K., Wang, Z., Yi, P., Wang, G., Gu, K., Jiang, hancement for face super resolution. In: ICPR (2021)
J.: Atmfn: Adaptive-threshold-based multi-model fusion 6
network for compressed face hallucination. TMM (2019) 138. Korshunov, P., Marcel, S.: Deepfakes: a new threat to
6 face recognition? assessment and detection. arXiv (2018)
117. Jiang, L., Li, R., Wu, W., Qian, C., Loy, C.C.: 5, 14
Deeperforensics-1.0: A large-scale dataset for real-world 139. Korshunova, I., Shi, W., Dambre, J., Theis, L.: Fast
face forgery detection. In: CVPR (2020) 5, 18 face-swap using convolutional neural networks. In: ICCV
118. Jiang, L., Zhang, J., Deng, B., Li, H., Liu, L.: 3d face (2017) 9
reconstruction with geometry details from a single image. 140. Koujan, M.R., Doukas, M.C., Roussos, A., Zafeiriou, S.:
TIP (2018) 6 Head2head: Video-based neural head synthesis. In: FG
119. Jiang, W., Liu, S., Gao, C., Cao, J., He, R., Feng, J., (2020) 1, 9
Yan, S.: Psgan: Pose and expression robust spatial-aware 141. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet
gan for customizable makeup transfer. In: CVPR (2020) classification with deep convolutional neural networks.
6, 16 In: NeurIPS (2012) 4, 16
120. Jiang, Y., Jiang, L., Yang, S., Loy, C.C.: Scenimefy: 142. Kwon, P., You, J., Nam, G., Park, S., Chae, G.: Kodf: A
Learning to craft anime scene via semi-supervised image- large-scale korean deepfake detection dataset. In: ICCV
to-image translation. In: ICCV (2023) 6, 15 (2021) 5
121. Jo, Y., Park, J.: Sc-fegan: Face editing generative adver- 143. Le, B.M., Woo, S.S.: Quality-agnostic deepfake detection
sarial network with user’s sketch and color. In: ICCV with intra-model collaborative learning. In: ICCV (2023)
(2019) 12 5
Deepfake Generation and Detection: A Benchmark and Survey 23
144. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient- 164. Liu, J., Wang, Q., Fan, H., Wang, Y., Tang, Y., Qu, L.:
based learning applied to document recognition. Pro- Residual denoising diffusion models. In: CVPR (2024) 4
ceedings of the IEEE (1998) 4 165. Liu, K., Perov, I., Gao, D., Chervoniy, N., Zhou, W.,
145. Lee, D., Kim, C., Yu, S., Yoo, J., Park, G.M.: Radio: Zhang, W.: Deepfacelab: Integrated, flexible and exten-
Reference-agnostic dubbing video synthesis. In: WACV sible face-swapping framework. PR (2023) 7, 8
(2024) 10, 11 166. Liu, R., Ma, B., Zhang, W., Hu, Z., Fan, C., Lv, T., Ding,
146. Lei, B., Yu, K., Feng, M., Cui, M., Xie, X.: Diffusion- Y., Cheng, X.: Towards a simultaneous and granular
gan3d: Boosting text-guided 3d generation and domain identity-expression control in personalized face genera-
adaption by combining 3d gans and diffusion priors. tion. arXiv (2024) 1, 7, 8
arXiv (2023) 15 167. Liu, S., Jiang, W., Gao, C., He, R., Feng, J., Li, B., Yan,
147. Li, C., Zhou, K., Lin, S.: Simulating makeup through S.: Psgan++: robust detail-preserving makeup transfer
physics-based manipulation of intrinsic image layers. In: and removal. TPAMI (2021) 6, 16
CVPR (2015) 15 168. Liu, Y., Hu, B., Huang, J., Tai, Y.W., Tang, C.K.: In-
148. Li, D., Zhao, K., Wang, W., Peng, B., Zhang, Y., Dong, stance neural radiance field. In: ICCV (2023) 5
J., Tan, T.: Ae-nerf: Audio enhanced neural radiance 169. Liu, Y., Li, Q., Deng, Q., Sun, Z., Yang, M.H.: Gan-based
field for few shot talking head synthesis. In: AAAI (2024) facial attribute manipulation. TPAMI (2023) 2
11 170. Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning,
149. Li, J., Li, Z., Cao, J., Song, X., He, R.: Faceinpainter: J., Cao, Y., Zhang, Z., Dong, L., et al.: Swin transformer
High fidelity face adaptation to heterogeneous domains. v2: Scaling up capacity and resolution. In: CVPR (2022)
In: CVPR (2021) 7, 17 5
150. Li, J., Xie, H., Li, J., Wang, Z., Zhang, Y.: Frequency- 171. Liu, Z., Li, M., Zhang, Y., Wang, C., Zhang, Q., Wang,
aware discriminative feature learning supervised by J., Nie, Y.: Fine-grained face swapping via regional gan
single-center loss for face forgery detection. In: CVPR inversion. In: CVPR (2023) 8
(2021) 2, 14, 18 172. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z.,
151. Li, J., Zhang, J., Bai, X., Zheng, J., Ning, X., Zhou, J., Lin, S., Guo, B.: Swin transformer: Hierarchical vision
Gu, L.: Talkinggaussian: Structure-persistent 3d talking transformer using shifted windows. In: ICCV (2021) 4,
head synthesis via gaussian splatting. arXiv (2024) 11 5, 9
152. Li, L., Bao, J., Yang, H., Chen, D., Wen, F.: Faceshifter: 173. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face
Towards high fidelity and occlusion aware face swapping. attributes in the wild. In: ICCV (2015) 5
CVPR (2020) 7, 8, 14, 17 174. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell,
153. Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022) 4
F., Guo, B.: Face x-ray for more general face forgery 175. Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement
detection. In: CVPR (2020) 14, 18 for fake face detection in the wild. In: CVPR (2020) 13,
154. Li, M., Yu, W., Liu, Q., Li, Z., Li, R., Zhong, B., Zhang, 14
S.: Hybrid transformers with attention-guided spatial 176. Lu, T., Wang, J., Jiang, J., Zhang, Y.: Global-local fu-
embeddings for makeup transfer and removal. TCSVT sion network for face super-resolution. Neurocomputing
(2023) 6, 16 (2020) 15
155. Li, W., Zhang, L., Wang, D., Zhao, B., Wang, Z., Chen, 177. Ma, Y., He, Y., Cun, X., Wang, X., Shan, Y., Li, X.,
M., Zhang, B., Wang, Z., Bo, L., Li, X.: One-shot high- Chen, Q.: Follow your pose: Pose-guided text-to-video
fidelity talking-head synthesis with deformable neural generation using pose-free videos. arXiv (2023) 4, 6, 15
radiance field. In: CVPR (2023) 9, 10, 16, 17 178. Ma, Y., Wang, S., Ding, Y., Ma, B., Lv, T., Fan, C., Hu,
156. Li, Y., Chang, M.C., Lyu, S.: In ictu oculi: Exposing ai Z., Deng, Z., Yu, X.: Talkclip: Talking head generation
generated fake face videos by detecting eye blinking. In: with text-guided expressive speaking styles. arXiv (2023)
WIFS (2018) 5, 13 10, 18
157. Li, Y., Ma, C., Yan, Y., Zhu, W., Yang, X.: 3d-aware 179. Ma, Y., Zhang, S., Wang, J., Wang, X., Zhang, Y., Deng,
face swapping. In: CVPR (2023) 7 Z.: Dreamtalk: When expressive talking head generation
158. Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: A meets diffusion probabilistic models. arXiv (2023) 1, 11,
large-scale challenging dataset for deepfake forensics. In: 18
CVPR (2020) 5, 18 180. Maninchedda, F., Oswald, M.R., Pollefeys, M.: Fast 3d
159. Liang, B., Pan, Y., Guo, Z., Zhou, H., Hong, Z., Han, X., reconstruction of faces with glasses. In: CVPR (2017) 6
Han, J., Liu, J., Ding, E., Wang, J.: Expressive talking 181. Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P.,
head generation with granular audio-visual control. In: AbdAlmageed, W.: Two-branch recurrent network for
CVPR (2022) 10, 11, 17 isolating deepfakes in videos. In: ECCV (2020) 14, 18
160. Lin, J., Yuan, Y., Shao, T., Zhou, K.: Towards high- 182. Maze, B., Adams, J., Duncan, J.A., Kalka, N., Miller,
fidelity 3d face reconstruction from in-the-wild images T., Otto, C., Jain, A.K., Niggel, W.T., Anderson, J.,
using graph convolutional networks. In: CVPR (2020) 6 Cheney, J., et al.: Iarpa janus benchmark-c: Face dataset
161. Lin, Y., Wang, S., Lin, Q., Tang, F.: Face swapping and protocol. In: ICB (2018) 7
under large pose variations: A 3d model based approach. 183. McCloskey, S., Albright, M.: Detecting gan-generated
In: ICME (2012) 7 imagery using saturation cues. In: ICIP (2019) 13
162. Liu, F., Yu, L., Xie, H., Liu, C., Ding, Z., Yang, Q., 184. Melnik, A., Miasayedzenkau, M., Makaravets, D., Pirsh-
Zhang, Y.: High fidelity face swapping via semantics tuk, D., Akbulut, E., Holzmann, D., Renusch, T., Re-
disentanglement and structure enhancement. In: ACM ichert, G., Ritter, H.: Face generation and editing with
MM (2023) 7, 8 stylegan: A survey. TPAMI (2024) 2
163. Liu, J., Huang, H., Jin, C., He, R.: Portrait diffu- 185. Miao, C., Tan, Z., Chu, Q., Yu, N., Guo, G.: Hierar-
sion: Training-free face stylization with chain-of-painting. chical frequency-assisted interactive networks for face
arXiv (2023) 6, 15 manipulation detection. TIFS (2022) 14
24 Gan Pei, Jiangning Zhang, et al.
186. Milborrow, S., Morkel, J., Nicolls, F.: The muct land- 209. Pan, Y., Tang, J., Tjahjadi, T.: Lpsrgan: Generative
marked face database. PRASA (2010) 7 adversarial networks for super-resolution of license plate
187. Mildenhall, B., Srinivasan, P., Tancik, M., Barron, J., image. Neurocomputing (2024) 15
Ramamoorthi, R., Ng, R.: Nerf: Representing scenes 210. Pang, Y., Zhang, Y., Quan, W., Fan, Y., Cun, X., Shan,
as neural radiance fields for view synthesis. In: ECCV Y., Yan, D.m.: Dpe: Disentanglement of pose and ex-
(2020) 2, 5, 12 pression for general video portrait editing. In: CVPR
188. Min, Z., Zhuang, B., Schulter, S., Liu, B., Dunn, E., (2023) 2
Chandraker, M.: Neurocs: Neural nocs supervision for 211. Parkhi, O., Vedaldi, A., Zisserman, A.: Deep face recog-
monocular 3d object localization. In: CVPR (2023) 5 nition. In: BMVC (2015) 5
189. Mir, A., Alonso, E., Mondragón, E.: Dit-head: High- 212. Peng, C., Miao, Z., Liu, D., Wang, N., Hu, R., Gao, X.:
resolution talking head synthesis using diffusion trans- Where deepfakes gaze at? spatial-temporal gaze incon-
formers. In: ICAART (2024) 1, 6, 11 sistency analysis for video face forgery detection. TIFS
190. Mirsky, Y., Lee, W.: The creation and detection of deep- (2024) 13, 14
fakes: A survey. CSUR (2021) 2 213. Peng, X., Zhu, J., Jiang, B., Tai, Y., Luo, D., Zhang,
191. Mirza, M., Osindero, S.: Conditional generative adver- J., Lin, W., Jin, T., Wang, C., Ji, R.: Portraitbooth:
sarial nets. arXiv (2014) 4 A versatile portrait model for fast identity-preserved
192. Moniz, J.R.A., Beckham, C., Rajotte, S., Honari, S., Pal, personalization. In: CVPR (2024) 6
C.: Unsupervised depth estimation, 3d face rotation and 214. Peng, Z., Hu, W., Shi, Y., Zhu, X., Zhang, X., Zhao, H.,
replacement. In: NeurIPS (2018) 8 He, J., Liu, H., Fan, Z.: Synctalk: The devil is in the
193. Moore, S., Bowden, R.: Multi-view pose and facial ex- synchronization for talking head synthesis. In: CVPR’24
pression recognition. In: BMVC (2010) 5, 9 (2024) 11
194. Morales, A., Piella, G., Sukno, F.M.: Survey on 3d face 215. Peng, Z., Wu, H., Song, Z., Xu, H., Zhu, X., He, J.,
reconstruction from uncalibrated images. Computer Liu, H., Fan, Z.: Emotalk: Speech-driven emotional dis-
Science Review (2021) 6 entanglement for 3d face animation. In: ICCV (2023)
195. Mosaddegh, S., Simon, L., Jurie, F.: Photorealistic face 11
de-identification by aggregating donors’ face components. 216. Pérez, J.C., Nguyen-Phuoc, T., Cao, C., Sanakoyeu,
In: ACCV (2015) 7 A., Simon, T., Arbeláez, P., Ghanem, B., Thabet, A.,
196. Moser, B.B., Raue, F., Frolov, S., Palacio, S., Hees, J., Pumarola, A.: Styleavatar: Stylizing animatable head
Dengel, A.: Hitchhiker’s guide to super-resolution: Intro- avatars. In: WACV (2024) 6
duction and recent advances. TPAMI (2023) 6 217. Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawa-
197. Moser, B.B., Shanbhag, A.S., Raue, F., Frolov, S., har, C.: A lip sync expert is all you need for speech to
Palacio, S., Dengel, A.: Diffusion models, image super- lip generation in the wild. In: ACM MM (2020) 5, 10,
resolution and everything: A survey. In: AAAI (2024) 11, 16
218. Preechakul, K., Chatthee, N., Wizadwongsa, S., Suwa-
6
janakorn, S.: Diffusion autoencoders: Toward a meaning-
198. Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a
ful and decodable representation. In: CVPR (2022) 12,
large-scale speaker identification dataset. In: Interspecch
18
(2017) 5, 17
219. Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking
199. Narayan, K., Agarwal, H., Thakral, K., Mittal, S., Vatsa,
in frequency: Face forgery detection by mining frequency-
M., Singh, R.: Deephy: On deepfake phylogeny. In: IJCB
aware clues. In: ECCV (2020) 2, 14, 18
(2022) 5
220. Ren, X., Chen, X., Yao, P., Shum, H.Y., Wang, B.:
200. Narayan, K., Agarwal, H., Thakral, K., Mittal, S., Vatsa,
Reinforced disentanglement for face swapping without
M., Singh, R.: Df-platter: multi-face heterogeneous deep-
skip connection. In: ICCV (2023) 7, 8, 17
fake dataset. In: CVPR (2023) 5 221. Richie, C., Warburton, S., Carter, M.: Audiovisual
201. Natsume, R., Yatagawa: Rsgan: face swapping and edit- database of spoken american english. LDC (2009) 11
ing using face and hair representation in latent spaces. 222. Rochow, A., Schwarz, M., Behnke, S.: Fsrt: Facial scene
In: SIGGRAPH (2018) 6, 7, 8 representation transformer for face reenactment from
202. Nguyen, H.H., Yamagishi, J., Echizen, I.: Capsule- factorized appearance, head-pose, and facial expression
forensics: Using capsule networks to detect forged images features. In: CVPR (2024) 9
and videos. In: ICASSP (2019) 13 223. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Om-
203. Nirkin, Y., Keller, Y., Hassner, T.: Fsgan: Subject ag- mer, B.: High-resolution image synthesis with latent
nostic face swapping and reenactment. In: ICCV (2019) diffusion models. In: CVPR (2022) 2, 4
7, 8 224. Rosberg, F., Aksoy, E.E., Alonso-Fernandez, F., Englund,
204. Nirkin, Y., Keller, Y., Hassner, T.: Fsganv2: Improved C.: Facedancer: pose-and occlusion-aware high fidelity
subject agnostic face swapping and reenactment. TPAMI face swapping. In: WACV (2023) 8
(2022) 7 225. Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies,
205. Nirkin, Y., Masi, I., Tuan, A.T., Hassner, T., Medioni, G.: J., Nießner, M.: Faceforensics++: Learning to detect
On face segmentation, face swapping, and face perception. manipulated facial images. In: ICCV (2019) 5, 17, 18
In: FG (2018) 7 226. Ruiz, N., Chong, E., Rehg, J.M.: Fine-grained head pose
206. Nirkin, Y., Wolf, L., Keller, Y., Hassner, T.: Deepfake estimation without keypoints. In: CVPR (2018) 5, 16
detection based on discrepancies between faces and their 227. Sha, T., Zhang, W., Shen, T., Li, Z., Mei, T.: Deep
context. TPAMI (2021) 13, 14 person generation: A survey from the perspective of face,
207. Oorloff, T., Yacoob, Y.: Robust one-shot face video re- pose, and cloth synthesis. CSUR (2023) 1, 2
enactment using hybrid latent spaces of stylegan2. In: 228. Sharma, P., Tewari, A., Du, Y., Zakharov, S., Ambrus,
ICCV (2023) 9, 10 R.A., Gaidon, A., Freeman, W.T., Durand, F., Tenen-
208. Ozkan, S., Ozay, M., Robinson, T.: Conceptual and hier- baum, J.B., Sitzmann, V.: Neural groundplans: Persis-
archical latent space decomposition for face editing. In: tent neural scene representations from a single image.
ICCV (2023) 12 In: ICLR (2023) 5
Deepfake Generation and Detection: A Benchmark and Survey 25
229. Sharma, S., Kumar, V.: 3d face reconstruction in deep 251. Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C.,
learning era: A survey. ACME (2022) 6 Nießner, M.: Face2face: Real-time face capture and reen-
230. Shen, S., Li, W., Zhu, Z., Duan, Y., Zhou, J., Lu, J.: actment of rgb videos. In: CVPR (2016) 9
Learning dynamic facial radiance fields for few-shot talk- 252. Tian, L., Wang, Q., Zhang, B., Bo, L.: Emo: Emote por-
ing head synthesis. In: ECCV (2022) 10, 11 trait alive-generating expressive portrait videos with au-
231. Shen, W., Liu, R.: Learning residual images for face dio2video diffusion model under weak conditions. arXiv
attribute manipulation. In: CVPR (2017) 12 (2024) 11
232. Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the 253. Tripathy, S., Kannala, J., Rahtu, E.: Icface: Interpretable
latent space of gans for semantic face editing. In: CVPR and controllable face reenactment using gans. In: WACV
(2020) 12 (2020) 9, 10
233. Sheng, Z., Nie, L., Zhang, M., Chang, X., Yan, Y.: 254. Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view
Stochastic latent talking face generation towards emo- supervision for single-view reconstruction via differen-
tional expressions and head poses. TCSVT (2023) 10, tiable ray consistency. In: CVPR (2017) 6
18 255. Van Den Oord, A., Vinyals, O., et al.: Neural discrete
234. Shi, Y., Li, G., Cao, Q., Wang, K., Lin, L.: Face hallu- representation learning. In: NeurIPS (2017) 1, 4
cination by attentive sequence optimization with rein- 256. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
forcement learning. TPAMI (2019) 6 Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: At-
235. Shiohara, K., Yamasaki, T.: Detecting deepfakes with tention is all you need. In: NeurIPS (2017) 4
self-blended images. In: CVPR (2022) 13, 14, 18 257. Vyas, K., Pareek, P., Jayaswal, R., Patil, S.: Analysing
236. Shiohara, K., Yang, X., Taketomi, T.: Blendface: Re- the landscape of deep fake detection: A survey. IJISAE
designing identity encoders for face-swapping. In: ICCV (2024) 2
(2023) 1, 7, 8 258. Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou,
237. Shu, C., Wu, H., Zhou, H., Liu, J., Hong, Z., Ding, C., J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for
Han, J., Liu, J., Ding, E., Wang, J.: Few-shot head deep face recognition. In: CVPR (2018) 5, 16
swapping in the wild. In: CVPR (2022) 5 259. Wang, J., Qian, X., Zhang, M., Tan, R.T., Li, H.: Seeing
238. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Gan- what you said: Talking face generation guided by a lip
guli, S.: Deep unsupervised learning using nonequilibrium reading expert. In: CVPR (2023) 10, 11
260. Wang, J., Wu, Z., Ouyang, W., Han, X., Chen, J., Jiang,
thermodynamics. In: ICML (2015) 4
Y.G., Li, S.N.: M2tr: Multi-modal multi-scale transform-
239. Sohn, K., Lee, H., Yan, X.: Learning structured output
ers for deepfake detection. In: ICML (2022) 5, 14, 18
representation using deep conditional generative models.
261. Wang, J., Zhao, K., Zhang, S., Zhang, Y., Shen, Y.,
In: NeurIPS (2015) 1, 4
Zhao, D., Zhou, J.: Lipformer: High-fidelity and general-
240. Stypułkowski, M., Vougioukas, K., He, S., Zięba, M.,
izable talking face generation with a pre-learned facial
Petridis, S., Pantic, M.: Diffused heads: Diffusion models
codebook. In: CVPR (2023) 10, 11
beat gans on talking-face generation. In: WACV (2024)
262. Wang, J., Zhao, Y., Fan, H., Xu, T., Li, Q., Li, S., Liu,
11
L.: Memory-augmented contrastive learning for talking
241. Sun, J., Deng, Q., Li, Q., Sun, M., Ren, M., Sun, Z.: Any-
head generation. In: ICASSP (2023) 5, 10, 17
face: Free-style text-to-face synthesis and manipulation. 263. Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C.,
In: CVPR (2022) 2, 12, 18 He, R., Qiao, Y., Loy, C.C.: Mead: A large-scale audio-
242. Sun, J., Li, Q., Wang, W., Zhao, J., Sun, Z.: Multi- visual dataset for emotional talking-face generation. In:
caption text-to-face synthesis: Dataset and algorithm. ECCV (2020) 5, 17
In: ACM MM (2021) 5 264. Wang, Q., Liu, L., Hua, M., Zhu, P., Zuo, W., Hu, Q.,
243. Sun, J., Wang, X., Zhang, Y., Li, X., Zhang, Q., Liu, Y., Lu, H., Cao, B.: Hs-diffusion: Semantic-mixing diffusion
Wang, J.: Fenerf: Face editing in neural radiance fields. for head swapping. arXiv (2023) 6
In: CVPR (2022) 12, 18 265. Wang, R., Juefei-Xu, F., Ma, L., Xie, X., Huang, Y.,
244. Sun, Q., Tewari, A., Xu, W., Fritz, M., Theobalt, C., Wang, J., Liu, Y.: Fakespotter: a simple yet robust base-
Schiele, B.: A hybrid model for identity obfuscation by line for spotting ai-synthesized fake faces. In: IJCAI
face replacement. In: ECCV (2018) 6, 7, 8 (2021) 15
245. Sun, Z., Chen, Y., Xiong, S.: Ssat: A symmetric semantic- 266. Wang, S., Ma, Y., Ding, Y., Hu, Z., Fan, C., Lv, T.,
aware transformer network for makeup transfer and re- Deng, Z., Yu, X.: Styletalk++: A unified framework for
moval. In: AAAI (2022) 16 controlling the speaking styles of talking heads. TPAMI
246. Sun, Z., Chen, Y., Xiong, S.: Ssat ++: A semantic-aware (2024) 11
and versatile makeup transfer network with local color 267. Wang, T., Chow, K.P.: Noise based deepfake detection
consistency constraint. TNNLS (2023) 6, 16 via multi-head relative-interaction. In: AAAI (2023) 13,
247. Sunkavalli, K., Johnson, M.K., Matusik, W., Pfister, H.: 14
Multi-scale image harmonization. ACM TOG (2010) 7 268. Wang, T., Li, L., Lin, K., Lin, C.C., Yang, Z., Zhang,
248. Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: H., Liu, Z., Wang, L.: Disco: Disentangled control for
Frequency-aware deepfake detection: Improving general- referring human dance generation in real world. In:
izability through frequency space domain learning. In: CVPR (2024) 6
AAAI (2024) 14 269. Wang, T., Li, Z., Liu, R., Wang, Y., Nie, L.: An efficient
249. Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on attribute-preserving framework for face swapping. TMM
gradients: Generalized artifacts representation for gan- (2024) 7
generated images detection. In: CVPR (2023) 2, 5, 13, 270. Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view
14 neural talking-head synthesis for video conferencing. In:
250. Tan, S., Ji, B., Pan, Y.: Emmn: Emotional motion mem- CVPR (2021) 5
ory network for audio-driven emotional talking face gen- 271. Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D.,
eration. In: ICCV (2023) 10, 18 Lu, T., Luo, P., Shao, L.: Pyramid vision transformer:
26 Gan Pei, Jiangning Zhang, et al.
A versatile backbone for dense prediction without con- 292. Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang,
volutions. In: ICCV (2021) 5 C., Feng, J., Shou, M.Z.: Magicanimate: Temporally
272. Wang, Y., Chen, X., Zhu, J., Chu, W., Tai, Y., Wang, consistent human image animation using diffusion model.
C., Li, J., Wu, Y., Huang, F., Ji, R.: Hififace: 3d shape In: CVPR (2024) 6
and semantic prior guided high fidelity face swapping. 293. Xu, Z., Zhou, H., Hong, Z., Liu, Z., Liu, J., Guo, Z., Han,
In: IJCAI (2021) 7, 8, 17 J., Liu, J., Ding, E., Wang, J.: Styleswap: Style-based
273. Wang, Y., Ma, X., Chen, X., Dantcheva, A., Dai, B., generator empowers robust face swapping. In: ECCV
Qiao, Y.: Leo: Generative latent image animator for (2022) 7, 8, 17
human video synthesis. arXiv (2023) 6, 11, 15 294. Xue, H., Ling, J., Tang, A., Song, L., Xie, R., Zhang,
274. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: W.: High-fidelity face reenactment via identity-matched
Image quality assessment: from error visibility to struc- correspondence learning. TOMM (2023) 9
tural similarity. TIP (2004) 5, 16 295. Yan, Q., Guo, C., Zhao, J., Dai, Y., Loy, C.C., Li,
275. Wei, H., Yang, Z., Wang, Z.: Aniportrait: Audio-driven C.: Beautyrec: Robust, efficient, and component-specific
synthesis of photorealistic portrait animation. arXiv makeup transfer. In: CVPR (2023) 6, 16
(2024) 11 296. Yang, C., He, W., Xu, Y., Gao, Y.: Elegant: Exquisite
276. Wen, B., Zhu, Y., Subramanian, R., Ng, T.T., Shen, X., and locally editable gan for makeup transfer. In: ECCV
Winkler, S.: Coverage—a novel database for copy-move (2022) 6, 16
forgery detection. In: ICIP (2016) 14 297. Yang, K., Chen, K., Guo, D., Zhang, S.H., Guo, Y.C.,
277. Wenmin Huang, W.L., Jiwu Huang, X.C.: Sdgan: Dis- Zhang, W.: Face2face ρ: Real-time high-resolution one-
entangling semantic manipulation for facial attribute shot face reenactment. In: ECCV (2022) 9
editing. In: AAAI (2024) 12 298. Yang, P., Zhou, S., Tao, Q., Loy, C.C.: Pgdiff: Guiding
278. Wiles, O., Koepke, A., Zisserman, A.: X2face: A network diffusion models for versatile face restoration via partial
for controlling face generation using images, audio, and guidance. In: NeurIPS (2024) 6
pose codes. In: ECCV (2018) 9 299. Yang, S., Jiang, L., Liu, Z., Loy, C.C.: Vtoonify: Control-
279. Wu, X., Hu, P., Wu, Y., Lyu, X., Cao, Y.P., Shan, lable high-resolution portrait video style transfer. ACM
Y., Yang, W., Sun, Z., Qi, X.: Speech2lip: High-fidelity TOG (2022) 6
speech to lip generation by learning from a short video. 300. Yang, S., Wang, W., Lan, Y., Fan, X., Peng, B., Yang, L.,
In: ICCV (2023) 10 Dong, J.: Learning dense correspondence for nerf-based
280. Xia, W., Yang, Y., Xue, J.H., Wu, B.: Tedigan: Text- face reenactment. In: AAAI (2024) 9, 10
guided diverse face image generation and manipulation. 301. Yang, W., Zhou, X., Chen, Z., Guo, B., Ba, Z., Xia, Z.,
In: CVPR (2021) 5 Cao, X., Ren, K.: Avoid-df: Audio-visual joint learning
281. Xing, Y., Tewari, R., Mendonca, P.: A self-supervised for detecting deepfake. TIFS (2023) 2, 13, 14, 18
bootstrap method for single-image 3d face reconstruction. 302. Yang, X., Li, Y., Lyu, S.: Exposing deep fakes using
In: WACV (2019) 6 inconsistent head poses. In: ICASSP (2019) 5, 13, 14
282. Xu, C., Zhang, J., Han, Y., Tian, G., Zeng, X., Tai, 303. Yang, Y., Yang, Y., Guo, H., Xiong, R., Wang, Y., Liao,
Y., Wang, Y., Wang, C., Liu, Y.: Designing one uni- Y.: Urbangiraffe: Representing urban scenes as compo-
fied framework for high-fidelity face reenactment and sitional generative neural feature fields. arXiv (2023)
swapping. In: ECCV (2022) 8 5
283. Xu, C., Zhang, J., Hua, M., He, Q., Yi, Z., Liu, Y.: 304. Yang, Z., Liang, J., Xu, Y., Zhang, X.Y., He, R.: Masked
Region-aware face swapping. In: CVPR (2022) 1, 7, 17 relation learning for deepfake detection. TIFS (2023) 2,
284. Xu, C., Zhu, J., Zhang, J., Han, Y., Chu, W., Tai, Y., 13, 14, 18
Wang, C., Xie, Z., Liu, Y.: High-fidelity generalized emo- 305. Yao, X., Newson, A., Gousseau, Y., Hellier, P.: A latent
tional talking face generation with multi-modal emotion transformer for disentangled face editing in images and
space learning. In: CVPR (2023) 10, 11, 18 videos. In: ICCV (2021) 12
285. Xu, C., Zhu, S., Zhu, J., Huang, T., Zhang, J., Tai, Y., 306. Ye, Z., Zhong, T., Ren, Y., Yang, J., Li, W., Huang,
Liu, Y.: Multimodal-driven talking face generation via a J., Jiang, Z., He, J., Huang, R., Liu, J., et al.: Real3d-
unified diffusion-based generator. CoRR (2023) 11 portrait: One-shot realistic 3d talking portrait synthesis.
286. Xu, J., Motamed, S., Vaddamanu, P., Wu, C.H., Haene, In: ICLR (2024) 11
C., Bazin, J.C., De la Torre, F.: Personalized face inpaint- 307. Yildirim, A.B., Pehlivan, H., Bilecen, B.B., Dundar, A.:
ing with diffusion models by parallel visual attention. In: Diverse inpainting and editing with gan inversion. In:
WACV (2024) 6 ICCV (2023) 6
287. Xu, S., Chen, G., Guo, Y.X., Yang, J., Li, C., Zang, Z., 308. Yin, Q., Lu, W., Li, B., Huang, J.: Dynamic difference
Zhang, Y., Tong, X., Guo, B.: Vasa-1: Lifelike audio- learning with spatio-temporal correlation for deepfake
driven talking faces generated in real time. arXiv (2024) video detection. TIFS (2023) 2, 13, 18
11 309. Yoo, S.M., Choi, T.M., Choi, J.W., Kim, J.H.: Fastswap:
288. Xu, S., Li, L., Shen, L., Men, Y., Lian, Z.: Your3demoji: A lightweight one-stage framework for real-time face
Creating personalized emojis via one-shot 3d-aware car- swapping. In: WACV (2023) 8
toon avatar synthesis. In: SIGGRAPH (2022) 6, 15 310. Yu, C., Lu, G., Zeng, Y., Sun, J., Liang, X., Li, H., Xu,
289. Xu, Y., Deng, B., Wang, J., Jing, Y., Pan, J., He, S.: Z., Xu, S., Zhang, W., Xu, H.: Towards high-fidelity
High-resolution face swapping via latent semantics dis- text-guided 3d face generation and manipulation using
entanglement. In: CVPR (2022) 6, 7, 17 only images. In: ICCV (2023) 12, 18
290. Xu, Y., Liang, J., Sheng, L., Zhang, X.Y.: Towards 311. Yu, H., Qu, Z., Yu, Q., Chen, J., Jiang, Z., Chen, Z.,
generalizable deepfake video detection with thumbnail Zhang, S., Xu, J., Wu, F., Lv, C., et al.: Gaussiantalker:
layout and graph reasoning. IJCV (2024) 13, 14 Speaker-specific talking head synthesis via 3d gaussian
291. Xu, Y., Yin, Y., Jiang, L., Wu, Q., Zheng, C., Loy, C.C., splatting. arXiv (2024) 11
Dai, B., Wu, W.: Transeditor: Transformer-based dual- 312. Yu, L., Xie, H., Zhang, Y.: Multimodal learning for tem-
space gan for highly controllable facial editing. In: CVPR porally coherent talking face generation with articulator
(2022) 12 synergy. TMM (2021) 11
Deepfake Generation and Detection: A Benchmark and Survey 27
313. Yu, N., Davis, L.S., Fritz, M.: Attributing fake images to 333. Zhang, J., Xu, C., Li, J., Han, Y., Wang, Y., Tai, Y.,
gans: Learning and analyzing gan fingerprints. In: ICCV Liu, Y.: Scsnet: An efficient paradigm for learning simul-
(2019) 15 taneously image colorization and super-resolution. In:
314. Yu, W.Y., Po, L.M., Cheung, R.C., Zhao, Y., Xue, Y., AAAI (2022) 6
Li, K.: Bidirectionally deformable motion modulation 334. Zhang, J., Zeng, X., Wang, M., Pan, Y., Liu, L., Liu, Y.,
for video-based human pose transfer. In: ICCV (2023) 6 Ding, Y., Fan: Freenet: Multi-identity face reenactment.
315. Yu, Z., Yin, Z., Zhou, D., Wang, D., Wong, F., Wang, In: CVPR (2020) 9
B.: Talking head generation with probabilistic audio-to- 335. Zhang, N., Paluri, M., Taigman, Y., Fergus, R., Bourdev,
visual diffusion priors. In: ICCV (2023) 11 L.: Beyond frontal faces: Improving person recognition
316. Yuan, C., Liu, X., Zhang, Z.: The current status and using multiple cues. In: CVPR (2015) 7
progress of adversarial examples attacks. In: CISCE 336. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang,
(2021) 6 O.: The unreasonable effectiveness of deep features as a
317. Yuan, G., Li, M., Zhang, Y., Zheng, H.: Reliableswap: perceptual metric. In: CVPR (2018) 5, 16
Boosting general face swapping via reliable supervision. 337. Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo,
arXiv (2023) 9 Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d
318. Zakharov, E., Ivakhnenko, A., Shysheya, A., Lempitsky, motion coefficients for stylized audio-driven single image
V.: Fast bi-layer neural synthesis of one-shot realistic talking face animation. In: CVPR (2023) 10, 11
head avatars. In: ECCV (2020) 9 338. Zhang, W., Liu, Y., Dong, C., Qiao, Y.: Ranksrgan:
319. Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Generative adversarial networks with ranker for image
Few-shot adversarial learning of realistic neural talking super-resolution. In: ICCV (2019) 15
head models. In: CVPR (2019) 9 339. Zhang, X., Sugano, Y., Fritz, M., Bulling, A.:
320. Zeng, H., Zhang, W., Fan, C., Lv, T., Wang, S., Zhang, Appearance-based gaze estimation in the wild. In: CVPR
Z., Ma, B., Li, L., Ding, Y., Yu, X.: Flowface: Semantic (2015) 9, 17
flow-guided shape-aware face swapping. In: AAAI (2023) 340. Zhang, X., Zhai, D., Li, T., Zhou, Y., Lin, Y.: Image
7, 8, 17 inpainting based on deep learning: A review. Information
321. Zeng, X., Pan, Y., Wang, M., Zhang, J., Liu, Y.: Realistic Fusion (2023) 6
face reenactment via self-supervised disentangling of 341. Zhang, Y., Zeng, H., Ma, B., Zhang, W., Zhang, Z., Ding,
identity and pose. In: AAAI (2020) 10 Y., Lv, T., Fan, C.: Flowface++: Explicit semantic flow-
322. Zhai, S., Liu, M., Li, Y., Gao, Z., Zhu, L., Nie, L.: Talking
supervised end-to-end face swapping. arXiv (2023) 8,
face generation with audio-deduced emotional landmarks.
17
TNNLS (2023) 10, 17, 18
342. Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu,
323. Zhai, Y., Luan, T., Doermann, D., Yuan, J.: Towards
N.: Multi-attentional deepfake detection. In: CVPR
generic image manipulation detection with weakly-
(2021) 13, 14, 18
supervised self-consistency learning. In: ICCV (2023)
343. Zhao, J., Zhang, H.: Thin-plate spline motion model for
14
image animation. In: CVPR (2022) 6, 15
324. Zhang, B., Qi, C., Zhang, P., Zhang, B., Wu, H., Chen,
D., Chen, Q., Wang, Y., Wen, F.: Metaportrait: Identity- 344. Zhao, T., Xu, X., Xu, M., Ding, H., Xiong, Y., Xia,
preserving talking head generation with fast personalized W.: Learning self-consistency for deepfake detection. In:
adaptation. In: CVPR (2023) 9, 10 ICCV (2021) 14, 15, 18
325. Zhang, B., Zhang, X., Cheng, N., Yu, J., Xiao, J., Wang, 345. Zhao, W., Rao, Y., Shi, W., Liu, Z., Zhou, J., Lu, J.:
J.: Emotalker: Emotionally editable talking face genera- Diffswap: High-fidelity and controllable face swapping
tion via diffusion model. In: ICASSP (2024) 11 via 3d-aware masked diffusion. In: CVPR (2023) 7, 8,
326. Zhang, C., Wang, C., Zhao, Y., Cheng, S., Luo, L., Guo, 17
X.: Dr2: Disentangled recurrent representation learning 346. Zheng, G., Xu, Y.: Efficient face detection and tracking
for data-efficient speech video synthesis. In: WACV in video sequences based on deep learning. Information
(2024) 1, 10, 11 Sciences (2021) 4
327. Zhang, H., DAI, T., Xu, Y., Tai, Y.W., Tang, C.K.: 347. Zheng, Y., Bao, J., Chen, D., Zeng, M., Wen, F.: Ex-
Facednerf: Semantics-driven face reconstruction, prompt ploring temporal coherence for more general video face
editing and relighting with diffusion models. In: NIPS forgery detection. In: ICCV (2021) 13, 14, 18
(2024) 12 348. Zhong, W., Fang, C., Cai, Y., Wei, P., Zhao, G., Lin, L.,
328. Zhang, H., Ren, Y., Chen, Y., Li, G., Li, T.H.: Exploiting Li, G.: Identity-preserving talking face generation with
multiple guidance from 3dmm for face reenactment. In: landmark and appearance priors. In: CVPR (2023) 10,
AAAIW (2023) 10 11
329. Zhang, J., Li, X., Li, J., Liu, L., Xue, Z., Zhang, B., 349. Zhong, X., Huang, X., Wu, Z., Lin, G., Wu, Q.: Sara:
Jiang, Z., Huang, T., Wang, Y., Wang, C.: Rethinking Controllable makeup transfer with spatial alignment and
mobile block for efficient attention-based models. In: region-adaptive normalization. arXiv (2023) 6, 16
ICCV (2023) 4 350. Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face
330. Zhang, J., Li, X., Wan, Z., Wang, C., Liao, J.: Fdnerf: generation by adversarially disentangled audio-visual
Few-shot dynamic neural radiance fields for face recon- representation. In: AAAI (2019) 11
struction and expression editing. In: SIGGRAPH (2022) 351. Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu,
2, 12, 18 Z.: Pose-controllable talking face generation by implic-
331. Zhang, J., Li, X., Wang, Y., Wang, C., Yang, Y., Liu, itly modularized audio-visual representation. In: CVPR
Y., Tao, D.: Eatformer: Improving vision transformer (2021) 10, 11, 17
inspired by evolutionary algorithm. IJCV (2024) 4 352. Zhou, M., Liu, X., Yi, T., Bai, Z., Zhang, P.: A superior
332. Zhang, J., Liu, L., Xue, Z., Liu, Y.: Apb2face: Audio- image inpainting scheme using transformer-based self-
guided face reenactment with auxiliary pose and blink supervised attention gan model. Expert Systems with
signals. In: ICASSP (2020) 10 Applications (2023) 6
28 Gan Pei, Jiangning Zhang, et al.
353. Zhou, P., Han, X., Morariu, V.I., Davis, L.S.: Two-stream
neural networks for tampered face detection. In: CVPRW
(2017) 2, 13
354. Zhou, P., Xie, L., Ni, B., Tian, Q.: Cips-3d++: End-
to-end real-time high-resolution 3d-aware gans for gan
inversion and stylization. TPAMI (2023) 12
355. Zhou, S., Xiao, T., Yang, Y., Feng, D., He, Q., He, W.:
Genegan: Learning object transfiguration and attribute
subspace from unpaired data. In: BMVC (2017) 12
356. Zhou, X., Yin, M., Chen, X., Sun, L., Gao, C., Li, Q.:
Cross attention based style distribution for controllable
person image synthesis. In: ECCV (2022) 6, 15
357. Zhou, Y., Han, X., Shechtman, E., Echevarria, J.,
Kalogerakis, E., Li, D.: Makelttalk: speaker-aware
talking-head animation. ACM TOG (2020) 10, 11
358. Zhu, B., Fang, H., Sui, Y., Li, L.: Deepfakes for medical
video de-identification: Privacy protection and diagnostic
information preservation. In: AAAI (2020) 7, 8
359. Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang,
L., Liu, Z., Loy, C.C.: Celebv-hq: A large-scale video
facial attributes dataset. In: ECCV (2022) 5, 11
360. Zhu, J., Van Gool, L., Hoi, S.C.: Unsupervised face
alignment by robust nonrigid mapping. In: ICCV (2009)
7
361. Zhu, Y., Li, Q., Wang, J., Xu, C.Z., Sun, Z.: One shot
face swapping on megapixels. In: CVPR (2021) 7, 8
362. Zhu, Y., Zhao, W., Tang, Y., Rao, Y., Zhou, J., Lu,
J.: Stableswap: Stable face swapping in a shared and
controllable latent space. TMM (2024) 7
363. Zi, B., Chang, M., Chen, J., Ma, X., Jiang, Y.G.: Wild-
deepfake: A challenging real-world dataset for deepfake
detection. In: ACM MM (2020) 5, 14