0% found this document useful (0 votes)
44 views19 pages

Large Scale Benchmark For Content Driven

Uploaded by

Rana Bedier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views19 pages

Large Scale Benchmark For Content Driven

Uploaded by

Rana Bedier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual

Forgery Detection and Localization

Zhixi Cai1 , Shreya Ghosh2 , Abhinav Dhall3 , Tom Gedeon2 , Kalin Stefanov1 , Munawar Hayat1
1
Monash University, 2 Curtin University, 3 Indian Institute of Technology Ropar,
{zhixi.cai,kalin.stefanov,munawar.hayat}@monash.edu,
{shreya.ghosh,tom.gedeon}@curtin.edu.au,[email protected]
arXiv:2305.01979v3 [cs.CV] 16 Jul 2023

Abstract Audio-visual deepfake content generation utilizes methods


for voice cloning [42, 91], face reenactment [73, 89], and
Most deepfake detection methods focus on detecting spa- face swapping [50, 70].
tial and/or spatio-temporal changes in facial attributes and Audio-visual deepfakes include videos that have been ei-
are centered around the binary classification task of detect- ther manipulated or created from scratch to primarily mis-
ing whether a video is real or fake. This is because avail- lead, deceive or influence audiences. Given that access to
able benchmark datasets contain mostly visual-only modifi- deepfake generation technologies has become widespread
cations present in the entirety of the video. However, a so- and the technologies are easy to use, some researchers ar-
phisticated deepfake may include small segments of audio gue that deepfakes are “threat to democracy” [8, 78, 80, 88].
or audio-visual manipulations that can completely change For example, [87] used a video of the former United States
the meaning of the video content. To addresses this gap, president Barack Obama to demonstrate a novel face reen-
we propose and benchmark a new dataset, Localized Audio actment method. In the resultant realistic video, the former
Visual DeepFake (LAV-DF), consisting of strategic content- president’s lip movement is synchronized with the speech of
driven audio, visual and audio-visual manipulations. The another person. This type of manipulations has the poten-
proposed baseline method, Boundary Aware Temporal tial to mislead people in forming wrong opinions and could
Forgery Detection (BA-TFD), is a 3D Convolutional Neu- have serious consequences.
ral Network-based architecture which effectively captures Given the rapid grow of fake videos on the Internet, ro-
multimodal manipulations. We further improve (i.e. BA- bust and accurate deepfake detection methods are increas-
TFD+) the baseline method by replacing the backbone with ingly important. This triggered the release of several bench-
a Multiscale Vision Transformer and guide the training pro- mark datasets for deepfake detection [26, 37, 49, 76] and
cess with contrastive, frame classification, boundary match- state-of-the-art deepfake detection methods [4,15,23,40,53,
ing and multimodal boundary matching loss functions. The 75,97] demonstrate promising performance on those bench-
quantitative analysis demonstrates the superiority of BA- mark datasets, which define the problem as a binary classi-
TFD+ on temporal forgery localization and deepfake detec- fication task (i.e. classify the whole input video as real or
tion tasks using several benchmark datasets including our fake).
newly proposed dataset. The dataset, models and code are Fake content however, may only constitute a small
available at https://github.com/ControlNet/LAV-DF. part(s) of a long real video [19] and these modified seg-
ment(s) could completely change the meaning and sen-
timent of the original content. Lets consider the exam-
1. Introduction ple illustrated in Figure 1, where the real video on the
1
left captures the person saying “Vaccinations are safe”.
Increasingly powerful deep learning algorithms (e.g. When the word “safe” is replaced with its antonym “dan-
Autoencoders [77] and Generative Adversarial Net- gerous”, the meaning and sentiment of the video is signif-
works [33]) accompanied by the rapid advances in com- icantly changed. This type of video forgeries can effec-
puting power have enabled the generation of highly real- tively manipulate the public opinion, particularly when tar-
istic synthetic media commonly referred to as deepfakes2 . geting media involving famous individuals, as the example
1 The paper is under consideration/review at Computer Vision and Im- with Barack Obama. Given the underlying assumption (i.e.
age Understanding Journal. deepfake detection is a binary classification problem) of the
2 In the text, deepfake and forgery are used interchangeably. current deepfake detection benchmark datasets and meth-
ods, it is possible that the state-of-the-art techniques may Real Real Real Real Fake Fake
not perform well in identifying this new type of manipula- ... ...
tions.
This paper addresses the important task of content-driven "Vaccinations are safe!" "Vaccinations are dangerous!"
forgery localization and detection in video. In terms of Sentiment Score: 0.44 Sentiment Score: -0.48
benchmark datasets, there is a significant gap in the avail- Real Video Fake Video
ability of datasets for multimodal content-driven forgery lo- Figure 1. Content-driven audio-visual manipulation. In the real
calization and detection. This paper proposes a pipeline for video (left) the subject is saying “Vaccinations are safe”. In the
generating such large-scale dataset that can serve as a valu- audio-visual deepfake (right) created from the real video, “safe” is
able resource for future research in this area. Furthermore, changed to “dangerous” (resulting in a significant change in per-
this paper also introduces a novel multimodal method that ceived sentiment). The green-edge frames are real and red-edge
utilizes audio and visual information to precisely detect the are fake. Note that through a subtle audio-visual manipulation,
boundaries of fake segments in videos. The main contri- the meaning of the video content has been completely changed.
butions of our work are,
targeting different specializations were introduced. For ex-
• A large-scale public dataset, Localized Audio Visual
ample, WildDeepfake [106] for web-crawled in-the-wild
DeepFake, for temporal forgery localization and de-
fake video detection, FFIW10K [103] for detecting fake
tection.
faces in videos containing multiple faces, KoDF [51] for
• A multimodal method, Boundary Aware Temporal Korean deepfake detection, and DF-Platter [66] for de-
Forgery Detection Plus, for fake segment localization tecting multi-face heterogeneous deepfakes. DeeperForen-
and detection. sics [43] is another notable dataset that overcomes the bias
of having high number of fake videos. However, all those
• A thorough validation of the method’s components and datasets mainly consider visual-only deepfake detection. In
comprehensive comparison with the state-of-the-art. 2021, FakeAVCeleb [47] was introduced including both
face swapping and audio-based face reenactment. This
dataset includes fake audio generated from SV2TTS [42],
2. Related Work which makes it the first deepfake detection dataset focusing
on audio-visual manipulations.
This section reviews the relevant literature on deepfake Given that all of the those datasets regard the
detection datasets and methods. Given the similarities be- deepfake detection as a binary classification problem,
tween temporal forgery localization and temporal action lo- ForgeryNet [37] dataset was introduced, which includes
calization, previous work in the latter area is also reviewed. visual-only face swapping in random frames and is suitable
for both video/image classification and spatial/temporal
2.1. Deepfake Detection Datasets forgery localization. However, ForgeryNet only applies ran-
Deepfake detection research is driven by datasets gen- dom face swapping in the visual modality and does not
erated with various deepfake generation approaches. We consider audio and content-driven modifications. To bridge
present a summary of the deepfake detection datasets avail- this gap, we propose a multimodal content-driven temporal
able to the research community in Table 1. The first forgery localization and detection dataset.
deepfake dataset named DF-TIMIT was proposed by [49].
2.2. Deepfake Detection Methods
DF-TIMIT curation process involved face swapping on
VidTimit dataset [79]. Later, UADFV [94], FaceForen- Deepfake detection methods can be categorized into two
sics++ [76] and Google DFD [69] were introduced, and categories: traditional machine learning and deep learn-
FaceForensics++ has become a popular benchmark dataset ing approaches. The traditional machine learning meth-
for multiple deepfake detection methods [74, 90]. The main ods include EM [35] and SVM [97]. On the other hand,
limitation of the aforementioned datasets is their size (i.e. deep learning methods include CNN [25], RNN [15, 65]
a maximum of thousands of video samples). Given that and ViT [22, 38, 92]. Most of the prior deepfake detec-
CNNs and Transformers (commonly used for deeepfake tion methods focus on temporal inconsistencies [34,52] and
detection) are data-demanding techniques, these datasets multimodal synchronization [19, 64, 90, 105] to detect fake
have low generalization capability [56]. In 2020, Facebook videos.
(i.e. Meta) published the large-scale dataset DFDC [26] for All of the above mentioned methods employ classifica-
deepfake detection with more than 100K samples. Until to- tion centric approach. Thus, those methods do not have
day, DFDC is the standard benchmark used for deepfake temporal localization capabilities. Only MDS [19] demon-
detection methods [15, 96]. After DFDC, several datasets strated scenarios where only parts of the video are modified,
Table 1. Details for publicly available deepfake datasets in a chronologically ascending order. The LAV-DF dataset details are
reported in the last row. Cla: Classification, SL: Spatial Localization, TFL: Temporal Forgery Localization, FS: Face Swapping, and RE:
ReEnactment.

Dataset Year Tasks Manipulated Manipulation #Subjects #Real #Fake #Total


Modality Method
DF-TIMIT [49] 2018 Cla V FS 43 320 640 960
UADFV [97] 2019 Cla V FS 49 49 49 98
FaceForensics++ [76] 2019 Cla V FS/RE - 1,000 4,000 5,000
Google DFD [69] 2019 Cla V FS - 363 3,068 3,431
DFDC [26] 2020 Cla AV FS 960 23,654 104,500 128,154
DeeperForensics [43] 2020 Cla V FS 100 50,000 10,000 60,000
Celeb-DF [56] 2020 Cla V FS 59 590 5,639 6,229
WildDeepfake [106] 2020 Cla - - - 3,805 3,509 7,314
FFIW10K [103] 2021 Cla V FS - 10,000 10,000 20,000
KoDF [51] 2021 Cla V FS/RE 403 62,166 175,776 237,942
FakeAVCeleb [47] 2021 Cla AV RE 600+ 570 25,000+ 25,500+
ForgeryNet [37] 2021 SL/TFL/Cla V Random FS/RE 5,400+ 99,630 121,617 221,247
DF-Platter [66] 2023 Cla V FS 454 133,260 132,496 265,756
LAV-DF (ours) 2022 TFL/Cla AV Content-driven RE 153 36,431 99,873 136,304

although this approach is primarily designed for classifica- temporal forgery localization. The importance of accessing
tion. Our dataset and method are designed to consider both the multimodal information for temporal action localization
audio-visual deepfake detection and temporal localization. was recently raised by [3].

2.3. Temporal Action Localization 2.4. Proposed Multimodal Approach


Since the temporal forgery localization task is similar This paper proposes a multimodal method for pre-
to temporal action localization, we also review the liter- cise boundary proposal estimation to detect and local-
ature in this domain. For temporal action localization, ize fake segments videos. We quantitatively compare the
ActivityNet [10], THUMOS14 [39], HACS [100], EPIC- performance of the proposed method with existing state-
KITCHEN [24], and FineAction [63] are popular bench- of-the-art approaches, including BMN [57], AGT [67],
mark datasets. Temporal action localization methods can MDS [19], AVFusion [3], BSN++ [86], TadTR [62], Ac-
be classified as two types: 2-stage approaches [61, 93, 98], tionFormer [99], and TriDet [82].
where the temporal bounding box proposals are generated
at first and then are classified as different classes, and 1- 3. Localized Audio Visual DeepFake Dataset
stage approaches [9,58,60,62,67,82,95,99], which directly
predict the final temporal segments. We created a large-scale audio-visual deepfake dataset
For temporal forgery localization, there is no require- containing 136,304 videos (36,431 real and 99,873 fake).
ment to classify the foreground segments, in other words, Our data generation pipeline is illustrated in Figure 2. The
the background is always real and the foreground is al- generation is guided by relevant words in the video tran-
ways fake. Hence, 1-stage temporal action localization scripts and specifically, the manipulation strategy is to re-
approaches are more relevant for the task. According place strategic words with their antonyms, which leads to
to [3], these approaches can be grouped in two main cat- a significant change in the perceived sentiment of the state-
egories: methods based on anchors and methods based on ment.
predicting the boundary probabilities. Anchor-based meth-
3.1. Audio-Visual Data Sourcing
ods [31, 32, 84, 85] utilize sliding windows in the video to
detect segments. [59] proposed a new framework to gener- The real videos in this dataset are collected from the
ate proposals that predicts the boundary probabilities based VoxCeleb2 dataset [20], which is a large-scale facial video
on start and end timestamps. This approach can access dataset containing more than 1 million utterance videos of
the global context information to generate more precise 6,112 speakers. To ensure consistency, the faces within
and flexible segment proposals than anchor-based methods. these videos are tracked and cropped using the Dlib fa-
Based on this method, several other approaches were pro- cial detector [48] at 224×224 resolution. The VoxCeleb2
posed to enhance performance [57, 86]. dataset offers a diverse range of video lengths, spoken lan-
All temporal action localization methods described guages, and voice qualities. Our dataset includes only
above are visual-only, which is not optimal for the task of English-speaking videos, where the spoken language was
Visual
Sentiment-Driven Text Replacement
ADJ,VERB,
Audio Speech-to-Text Text Parsing NOUN Antonym Choosing

Real Data
Voice Reenactment
Post Processing Normalization
Fake Data
Face Reenactment
I II III

Figure 2. Content-driven audio-visual manipulation for the creation of the LAV-DF dataset. The real transcript is used to find the word
tokens for replacement based on the largest change in perceived sentiment. Then the modified tokens are used as input for generating audio.
Post-processing and normalization are applied to the generated audio to maintain loudness consistency in the temporal neighborhood. The
generated audio is then used as input for facial reenactment. The green-edge audio and visual frames are real data, and red-edge are fake
data. In total, three categories of data are generated: Fake Audio and Fake Visual, Fake Audio and Real Visual and Real Audio and Fake
Visual.

detected through the confidence score generated by the where ∆S(τi ) is the difference in sentiment score of the
Google Speech-to-Text service3 . We leveraged the same original and modified transcripts when utilizing the replace-
service to generate the transcripts. ment τi and M is the maximum number of replacements.
There is up to 1 replacement for videos shorter than 10
3.2. Audio-Visual Data Generation
seconds; otherwise, there can be a maximum of 2 replace-
After sourcing the real videos, the next step is to analyze ments. The shift in sentiment distribution following the ma-
each video transcript for content-driven deepfake genera- nipulations is visualized in Figure 3 (a), while the histogram
tion. The generation process includes transcript manipula- of |∆S| indicating that the sentiment of most transcripts has
tion, followed by generation of the corresponding audio and been successfully changed, is shown in Figure 3 (b).
visual modalities.
3.2.2 Audio Generation
3.2.1 Transcript Manipulation
After the transcript manipulation, the next step is to gener-
Following the collection and wrangling of the real data, ate speaker-specific audio for the replacement tokens. Mo-
the next step is to analyze the transcript of a video de- tivated by the prior work on adaptive text-to-speech meth-
noted as D = {d0 , d1 , · · · , dm , · · · , dn }, where di repre- ods [14, 42, 68], we adopted SV2TTS [42] for speaker-
sents individual word tokens and n denotes the total num- specific audio generation. SV2TTS consists of three mod-
ber of tokens in the transcript. The objective is to iden- ules: 1) An encoder module responsible for extracting the
tify the tokens within D that should be replaced in or- style embedding of the reference speaker, 2) A spectrogram
der to achieve the maximum change in perceived senti- generation module based on Tacotron 2 [81] utilizing re-
ment. This process aims to create a modified transcript placement tokens and the speaker style embedding, and 3)
D′ = {d0 , d1 , · · · , d′m , · · · , dn }, which consists of most A vocoder module based on WaveNet [71], which generates
of the original tokens from D and the replacements for a realistic audio using the spectrogram. In the audio genera-
few specific tokens. The replacement tokens, denoted as d′ , tion, we utilized a pre-trained SV2TTS model to generate
are selected from a set dˆ containing antonyms of d, sourced the audio segments. Then, we performed loudness normal-
from WordNet [30]. To determine the sentiment value of the ization on the generated audio segments by considering the
transcript, we employed the sentiment analyzer available in corresponding real audio neighbors. The rationale behind
NLTK [5]. Specifically, for each token d in a transcript D, the loudness normalization is to generate a more realistic
the replacement is found with, counterpart of the audio segment chosen for replacement.
τ = argmax |S(D) − S(D′ )|
d∈D,d′ ∈dˆ
3.2.3 Video Generation
Then all replacements in a transcript D are found as follows,
The generated audio is used as input for generating the cor-
M
X responding visual frames. Wav2Lip [73] facial reenactment
θ = argmax | ∆S(τi )| is used for this task, as it has been shown to achieve state-
{τm }M
m=1 i=1
of-the-art output generation quality along with better gen-
3 https://cloud.google.com/speech-to-text eralization [41, 44]. We encountered several issues with us-
Proportion

Proportion
ing other popular visual generation methods such as AD- Before
After
NeRF [36] and ATVGnet [17]. For example, AD-NeRF
does not fit in our generation context (i.e. zero-shot genera-
tion of unseen speakers), and ATVGnet uses a static refer-
ence image as input for facial reenactment, resulting in pose Sentiment Sentiment Change
inconsistencies on the boundaries between real and fake (a) (b)
segments. In contrast, Wav2Lip uses a reference video and

Proportion

Proportion
target audio as input and generates an output video in which
the person in the reference video lip-syncs to the target au-
dio content, ensuring pose consistency between real and
fake segments. We employed a pre-trained Wav2Lip model Fake Segment Length Video Length
and upscaled the generated visual segments to a resolution (c) (d)
of 224 × 224. The generated audio-visual segments are then both-modified real
synchronized and used to replace the original audio-visual
segments.
Similar to [46], LAV-DF includes three categories of
generated data,

• Fake Audio and Fake Visual. Both the real audio and visual-modified
visual segments corresponding to the replacement to- audio-modified
kens are manipulated. # Fake Segments Modification Type
(e) (f)
• Fake Audio and Real Visual. Only the real audio Figure 3. Statistics of the LAV-DF dataset. (a) Distribution of
segments corresponding to the replacement tokens are sentiment scores before and after content-driven deepfake gener-
manipulated. To keep the fake audio and real visual ation, (b) Histogram of sentiment changes |∆S|, (c) Distribution
segments synchronized, the corresponding real visual of fake segment lengths, (d) Distribution of video lengths, (e) Pro-
segments are length-normalized. portion of number of fake segments, and (f) Proportion of modifi-
cations.
• Real Audio and Fake Visual. Only the real visual seg-
ments corresponding to the replacement tokens are ma-
nipulated and the length of the fake visual segments is
normalized to match the length of the real audio seg- 3.4. Dataset Quality
ments.
Table 2 provides a quantitative comparison (PSNR and
3.3. Dataset Statistics
SSIM) with existing dataset generation pipelines in terms
The dataset contains 136,304 videos of 153 unique iden- of visual quality, demonstrating that our pipeline achieves
tities, with 36,431 real videos and 99,873 videos containing better visual quality on the VoxCeleb2 dataset.
fake segments. For benchmarking, we splitted the dataset
into 3 identity-independent subsets: train (78,703 videos of
91 identities), validation (31,501 videos of 31 identities),
Table 2. Visual quality of the LAV-DF dataset. We maintained
and test (26,100 videos of 31 identities). Summary of main the experimental protocol and adopted the scores on VoxCeleb2
statistics of the dataset is presented in Figure 3. for the related deepfake generation pipelines from [102].
The dataset includes a total of 114,253 fake segments,
with duration (0, 1.6] seconds and an average length of 0.65 Method PSNR SSIM
seconds. Notably, 89.26% of the fake segments are shorter
than 1 second. The maximum length of the videos in the ATVGnet [17] 29.41 0.826
dataset is 20 seconds and 69.61% of the videos are shorter Wav2Lip [73] 29.54 0.846
than 10 seconds. In terms of modality modification, the dis- MakeitTalk [104] 29.51 0.817
tribution is balanced among the four types: visual-modified, Rhythmic Head [16] 29.55 0.779
audio-modified, both-modified and real. Additionally, the PC-AVS [102] 29.68 0.886
majority of the videos (62.72%) contain only 1 fake seg- LAV-DF (Ours) 33.06 0.898
ment, while a smaller proportion of videos (10.55%) in-
clude 2 fake segments.
4. Boundary Aware Temporal Forgery Detec- Algorithm 1: Training procedure of BA-TFD+
tion+ Method Data: Training data D ⊃ {Xi , Yi }ni=1 , Modality
The objective is to detect and localize multimodal ma- modification flag E = {ηi = (ηvi , ηai )}ni=1 ,
nipulations given an input video. To this end, we designed Weights of losses λ
the proposed method BA-TFD+ in such a way that it has Result: Parameters of the model θ
the capability to capture deepfake artifacts and localize the θ ← Initialize the parameters randomly;
boundary of fake segments. An overview of the proposed Y0 ← label for real data;
method is depicted in Figure 4 and Algorithm 1. while θ not converged do
(V, A, Y ) ← Next sample from D;
4.1. Preliminaries (ηv , ηa ) ← Next flag from E;
Yv ← if ηv then Y else Y0 ;
The training dataset D ⊃ {Xi , Yi }ni=1 comprises of n Ya ← if ηa then Y else Y0 ;
multimodal inputs Xi with visual modality Vi and audio Y (b) ← Generate labels from Y ;
modality Ai , and the associated output labels Yi . The pro- (b) (f )
posed model BA-TFD+ with trainable parameters θ is opti- (Yv , Yv ) ← Generate labels from Yv ;
(b) (f )
mized to map the inputs Xi to the outputs Yi . Each Xi has a (Ya , Ya ) ← Generate labels from Ya ;
different number of frames ti . In order to simplify the batch zv ← FEv (V );
training of the model, we padded the temporal axis for all za ← FEa (mel-spetrogram(A));
Xi to T . Y (c) ← ηvi ∧ ηai ;
Lc ← ContrastiveLoss(zv , za , Y (c) );
4.2. Visual Encoder (f )
Ŷv ← FCv (zv );
The goal of the visual encoder FEv is to capture the (f )
Ŷa ← FCa (za );
frame-level spatio-temporal features from the input visual /* FL: Frame Loss */
modality V ⊃ {Vi }ni=1 using an MViTv2 [55]. MViTv2 1 (f ) (f ) (f ) (f )
Lf ← 2 (FL(Ŷv , Yv )+FL(Ŷa , Ya ));
achieves seminal performance gain for different video anal-
/* ⊕: concatenation */
ysis tasks including video action recognition and detec- (b)(p) (b)(c) (b)(pc) (f )
tion. In addition, MViTv2 leverages hierarchical multi- (Ŷv , Ŷv , Ŷv ) ← FBv (zv ⊕ Ŷv );
(b)(p) (b)(c) (b)(pc) (f )
scale features compared to the basic ViT [27]. Our back- (Ŷa , Ŷa , Ŷa ) ← FBa (za ⊕ Ŷa );
bone MViTv2-Base model comprises of 4 blocks and 24 (b)(p) (b)(p)
Ŷ (b)(p) ← FΓp (Ŷv , Ŷa, zv , za );
multi-head self-attention layers. As illustrated in Figure 4, (b)(c) (b)(c) (b)(c)
Ŷ ← FΓc (Ŷv , Ŷa , zv , za );
the visual encoder FEv maps the inputs V ∈ RC×T ×H×W (b)(pc) (b)(pc) (b)(pc)
(T is the number of frames, C is the number of channels, Ŷ ← FΓpc (Ŷv , Ŷa , zv , za );
(b)(p) (b)
and H and W are the height and width of the frames) to Lbm ← 12 (M SE(Ŷv , Yv ) +
latent space zv ∈ RCf ×T (Cf is the dimension of the fea- M SE(Ŷv
(b)(c) (b)
, Yv )+M SE(Ŷv
(b)(pc) (b)
, Yv )+
tures). M SE(Ŷa
(b)(p) (b)
, Ya ) + M SE(Ŷa
(b)(c) (b)
, Ya ) +
(b)(pc) (b)
4.3. Audio Encoder M SE(Ŷa , Ya ));
Lb ← M SE(Ŷ (b)(p) , Y (b) ) +
The goal of the ViT-based [27] audio encoder FEa is to M SE(Ŷ (b)(c) (b)
, Y ) + M SE(Ŷ (b)(pc)
,Y (b)
);
learn meaningful features from the raw input audio modal- θ ← Adam(Lb , Lbm , Lf , Lc , λ, θ);
ity A ⊃ {Ai }ni=1 . Following previous work [40, 96], we end
pre-process the raw audio A to generate representative mel- return θ;
spectrograms A′ ∈ RFm ×Ta (Ta = τ T is the temporal di-
mension and τ ∈ N∗ , N∗ denotes positive integers, and Fm
is the length of the mel-frequency cepstrum features). In or-
der to keep the audio-visual synchronization, we reshape the 4.4. Frame Classification Module
temporal axis of the mel-spectrograms to τ Fm × T . The re-
shaped spectrograms A′ are given as input to the ViT blocks We further deploy frame-level classification modules on
of the audio encoder FEa 4 . The audio encoder FEa maps top of the visual and audio features. Let us denote the
the mel-spectrograms A′ to the latent space za ∈ RCf ×T , ground truth labels for visual and audio modality as Yv
(f )
where Cf is the features dimension. (f )
and Ya . The visual classification module FCv maps the
(f )
4 We only incorporate the multi-head self-attention layers of the ViT for latent visual features zv to labels Ŷv ∈ RT . Similarly, the
the audio encoder. audio classification module FCa maps latent audio features
Post Process

Soft
Non-Maximum
Suppression

Average
Segments

Figure 4. Overview of the BA-TFD+ method. BA-TFD+ mainly comprises of 1) Visual encoder (FEv ) that takes resized raw video
frames as input, 2) Audio encoder (FEa ) that takes spectrogram extracted from raw audio as input, 3) Visual and audio based frame
classification module (i.e. FCv and FCa ), 4) Boundary localization module to facilitate forgery localization in both visual (FBv ) and audio
(FBa ) modality, and finally 5) Multimodal fusion module that fuses multimodal latent features position-wise (p), channel-wise (c) and
position-channel wise (pc). During inference, post-processing operation is applied to generate segments from the output of the fusion
module. ⊕ denotes concatenation.

D
D two boundary modules, FBv for visual and FBa for audio
T
Visual BM T Averaging
modality.
Conv1D Element-Wise
Weighted Average The visual boundary module FBv input consists of the
C Conv1D D

Conv1D T
concatenation of latent features zv and classification outputs
T (f ) (f )
Visual Features Visual Weights Norm D
Ŷv , i.e zv ⊕ Ŷv . FBv predicts the position-aware bound-
T (b)(p)
C Conv1D
D
Fusion BM ary maps Ŷv ∈ RD×T and the channel-aware boundary
T
Conv1D (b)(c)
T Audio Weights maps Ŷv ∈ RD×T as output. These results are ag-
Audio Features Conv1D Averaging
D
gregated by a convolutional layer which outputs position-
D (b)(pc)
T channel boundary maps denoted as Ŷv ∈ RD×T . Sim-
Audio BM T
ilarly, the audio boundary module FBa input consists of the
Figure 5. Overview of the BA-TFD+ fusion module. The gray concatenation of latent features za and classification outputs
(f ) (f )
block is used to normalize the visual and audio weights pro- Ŷa , i.e za ⊕ Ŷa . FBa first predicts the audio position-
duced by the 1D convolutional layers, followed by element-wise aware boundary maps Ŷa
(b)(p)
and channel-aware boundary
weighted average. ⊕ denotes element-wise addition, ⊗ denotes (b)(c) (b)(p) (b)(c)
element-wise multiplication and BM denotes boundary map. maps Ŷa . Then Ŷa and Ŷa are aggregated to
(b)(pc)
Ŷa using a convolutional layer.

za to labels Ŷa
(f )
∈ RT . 4.6. Multimodal Fusion Module
The fusion module illustrated in Figure 5, uses bound-
4.5. Boundary Localization Module (b)(p) (b)(p) (b)(c) (b)(c) (b)(pc)
ary maps Ŷv , Ŷa , Ŷv , Ŷa , Ŷv , and
(b)(pc)
This module facilitates the learning of deepfake local- Ŷa and features zv and za from the visual and au-
ization. Motivated by BSN++ [86], we adopted the pro- dio modalities as input. Since the boundary module corre-
posal relation block (PRB) as the framework for the bound- sponding to each modality predicts three boundary maps,
ary maps (representation of the boundary information of all there are three fusion modules for position-aware FΓp ,
densely distributed proposals). The ground truth boundary channel-aware FΓc and aggregated position-channel FΓpc
(b)
map Y (b) ∈ RD×T is generated from Y , where Yij is boundary maps.
the confidence score for a segment which starts at the j- For the visual modality, the visual boundary maps and
th frame and ends at the (i + j)-th frame. The PRB module features from the visual and audio modalities are used to
contains both a position-aware attention module (captures calculate the visual weights Wv ∈ RD×T . Similarly, for the
global dependencies) and a channel-aware attention module audio modality, the audio boundary maps and features from
(captures inter-dependencies between different channels). both modalities are utilized to calculate the audio weights
In order to achieve localization in each modality, we deploy Wa ∈ RD×T . The element-wise weighted average of
the fusion boundary maps predictions Ŷ (b)(p) , Ŷ (b)(c) and v), ηm specifies whether modality m is manipulated or not,
Ŷ (b)(pc) is formed in the final step. Each boundary map (f ) T n
Pϕ ∈ 0 is the label for real videos, and T = {ti }0 where
Y
α ∈ {p, c, pc} is calculated as follows, T is the total number of frames in the dataset. This loss
enforces the visual and audio encoder to learn whether a
(b)(α) (b)(α)
Wv Ŷv + Wa Ŷa visual frame or audio sample is real or fake.
Ŷ (b)(α) = ,
Wv + Wa
where all operations are element-wise.
4.7.3 Boundary Matching Loss
4.7. Loss Functions
Following the standard protocol [57, 86], we generated the
The training process of BA-TFD+ is guided by con- ground truth boundary maps as labels for efficient training.
trastive (Lc ), frame classification (Lf ), boundary matching The fusion boundary matching loss is calculated as,
(Lb ) and multimodal boundary matching (Lbm ) loss func-
tions.
n X ti
D X
1 X X (b)(α) (b)
4.7.1 Contrastive Loss Lb = P (Ŷijk − Yijk )2 ,
3D T
α∈{p,c,pc} i=1 j=1 k=1
Contrastive loss has been proven to be helpful to elimi-
nate the misalignment between different modalities [19,21]. where α is one of the boundary map types from the bound-
Motivated by this, BA-TFD+ uses the latent visual and ary module, n is the number of samples in the dataset, D is
audio features zv and za of real videos as positive pairs. the maximum proposalP duration, ti is the number of frames,
On the other hand, latent features zv and za with at least and T = {ti }n0 where T is the total number of frames in
one modified modality are considered negative pairs (i.e. the dataset.
Y (c) = 0). Thus, the contrastive loss minimizes the differ-
ence between the visual and audio modalities for positive
pairs (i.e. Y (c) = 1) and keeps that margin larger than δ for 4.7.4 Multimodal Boundary Matching Loss
negative pairs. The contrastive loss is defined as follows,
We utilized the label information for each modality to train
the proposed multimodal framework and extended the con-
n
1 X (c) (c) cept of boundary matching loss (Lb ) to more modalities.
Lc = Yi d2i + (1 − Yi ) max(δ − di , 0)2
The multimodal boundary matching loss is defined as fol-
P
Cf T i=1
lows,
di = ||zvi − zai ||2 ,
where, n is the number of samples in the dataset, di is the ti
D X
n X
1 X X X (b)(α) (b)
ℓ2 distance between visual and audio modality in the la- Lbm = P (Ŷmijk −Ymijk )2
(c) 2D T
tent space, Yi is Pthe label for contrastive learning and m∈{v,a} α∈{p,c,pc} i=1 j=1 k=1
n
T = {ti }0 where T is the total number of frames in
the dataset. (b)
Ym(b) = ηm Y (b) + (1 − ηm )Yϕ ,

4.7.2 Frame Classification Loss where, m is the modality (visual v or audio a), ηm specifies
whether modality m is modified, α is one of the boundary
This is a standard frame level cross-entropy loss depicted (b)
as, map types from the boundary module, Yϕ ∈ 0D×T is the
groundPtruth boundary maps for real videos, and T = {ti }n0
where T is the total number of frames in the dataset.
ti
n X
1 X X (f ) (f )
Lf = − P H(Ŷmij , Ymij )
2 T
m∈{a,v} i=1 j=1 4.7.5 Overall Loss

H(Ŷ (f ) , Y (f ) ) = Y (f ) log Ŷ (f ) +(1−Y (f ) ) log (1 − Ŷ (f ) ) The overall training objective of BA-TFD+ is defined as,
(f )
Ym(f ) = ηm Y (f ) + (1 − ηm )Yϕ ,
L = Lb + λbm Lbm + λf Lf + λc Lc ,
where n is the number of samples in the dataset, ti is the
number of frames, m is the modality (i.e. audio a or visual where, λbm , λf and λc are weights for different losses.
Table 3. Temporal forgery localization results on the “fullset” of the LAV-DF dataset. The visual-only version of BA-TFD+ uses the
output from the visual boundary matching layer, illustrating the performance when using only the visual modality.

Method [email protected] [email protected] [email protected] AR@100 AR@50 AR@20 AR@10


BMN [57] 10.56 01.66 00.00 48.49 44.39 37.13 31.55
BMN (E2E) 24.01 07.61 00.07 53.26 41.24 31.60 26.93
MDS [19] 12.78 01.62 00.00 37.88 36.71 34.39 32.15
AGT [67] 17.85 09.42 00.11 43.15 34.23 24.59 16.71
BSN++ [86] 56.41 32.57 00.21 74.93 71.11 64.98 59.29
AVFusion [3] 65.38 23.89 00.11 62.98 59.26 54.80 52.11
BA-TFD [12] 79.15 38.57 00.24 67.03 64.18 60.89 58.51
TadTR [62] 80.22 61.04 05.22 72.50 72.50 70.56 69.18
ActionFormer [99] 85.23 59.05 00.93 77.23 77.23 77.19 76.93
TriDet [82] 86.33 70.23 03.05 74.47 74.47 74.46 74.45
BA-TFD+ (ours) 96.30 84.96 04.44 81.62 80.48 79.40 78.75
BA-TFD+ (ours) (visual only) 64.78 54.85 02.53 64.00 59.33 55.94 54.38

Table 4. Temporal forgery localization results on the “subset” of the LAV-DF dataset. The visual-only version of BA-TFD+ uses the
output from the visual boundary matching layer, illustrating the performance when using only the visual modality.

Method [email protected] [email protected] [email protected] AR@100 AR@50 AR@20 AR@10


BMN [57] 28.10 05.47 00.01 55.49 54.44 52.14 47.72
BMN (E2E) 32.32 11.38 00.14 59.69 48.17 39.01 34.17
MDS [19] 23.43 03.48 00.00 58.53 56.68 53.16 49.67
AGT [67] 15.69 10.69 00.15 49.11 40.31 31.70 23.13
BSN++ [86] 65.26 37.70 00.22 78.89 76.32 71.00 65.38
AVFusion [3] 62.01 22.77 00.11 61.98 58.08 53.31 50.52
BA-TFD [12] 85.20 47.06 00.29 67.34 64.52 61.19 59.32
TadTR [62] 83.48 63.57 05.44 74.15 74.15 72.42 71.38
ActionFormer [99] 79.48 48.01 01.08 70.38 70.38 70.36 70.08
TriDet [82] 80.71 60.93 02.91 67.64 67.64 67.64 67.63
BA-TFD+ (ours) 96.82 86.47 03.90 81.74 80.59 79.60 79.15
BA-TFD+ (ours) (visual only) 96.47 82.02 03.79 80.65 79.00 77.46 76.90

4.8. Inference noted as full set. For a fair comparison with existing visual-
only methods [57, 86], we additionally prepared a subset of
During inference, the model generates three types of
the full set denoted as subset where the audio-only manipu-
fusion boundary maps - position-aware boundary map
lated videos are removed.
Ŷ (b)(p) , channel-aware boundary map Ŷ (b)(c) and aggre-
gated position-channel boundary map Ŷ (b)(pc) . Follow-
ing previous work [86], we averaged the three boundary 5.2. Implementation Details
maps to produce the final boundary map Ŷ (b) . This bound- The BA-TFD+ method is implemented in PyTorch [72]
ary map represents the confidence for all proposals in the and the model is trained using 2 NVIDIA A100 80GB
video. Since this operation produces duplicated propos- GPUs. We resized the input videos to 96 × 96 to reduce
als, we post-process the proposals with Soft Non-Maximum the computational cost of the MViTv2-based visual back-
Suppression (S-NMS) [7] similar to BSN++ [86]. bone. The temporal dimension T is fixed to 512 for LAV-
DF and 300 for ForgeryNet [37] and DFDC [26]. The latent
5. Experiments features zv and za have the same shape Cf × T where the
feature size Cf = 256 and T ∈ {512, 300}. For the bound-
5.1. Dataset Partitioning
ary matching modules FBv and FBa , we set the maximum
We splitted the LAV-DF dataset into 78,703 train, 31,501 segment duration D to 40 for LAV-DF, 200 for ForgeryNet
validation and 26,100 test videos. The test partition is de- and 300 for DFDC. We followed the training protocol pro-
Table 5. Temporal forgery localization results on the ForgeryNet dataset. The visual-only version of BA-TFD+ uses the output from
the visual boundary matching layer, illustrating the performance when using only the visual modality.

Method Avg. AP [email protected] [email protected] [email protected] AR@5 AR@2


Xception [18] 62.83 68.29 62.84 58.30 73.95 25.83
X3D-M+BSN [28, 59] 70.29 80.46 77.24 55.09 86.88 81.33
X3D-M+BMN [28, 57] 83.47 90.65 88.12 74.95 91.99 88.44
SlowFast+BSN [29, 59] 73.42 82.25 80.11 60.66 88.78 83.63
SlowFast+BMN [29, 57] 86.85 92.76 91.00 80.02 93.49 90.64
BA-TFD+ (ours) (visual only) 87.79 93.13 89.14 81.09 95.69 90.63

Table 6. Deepfake detection results on the DFDC dataset. AGT [67], AVFusion [3], MDS [19], BSN++ [86],
TadTR [62], ActionFormer [99], and TriDet [82]. Based on
Method AUC the original implementations, BMN, BSN++, TadTR, Ac-
Meso4 [1] 0.753 tionFormer, and TriDet require extracted features, thus, we
FWA [54] 0.727 trained these models based on 2-stream I3D features [13].
Siamese [64] 0.844 For the methods that require S-NMS [7] during post-
MDS [19] 0.916 processing, we searched the optimal hyperparameters for
BA-TFD [12] 0.846 S-NMS using the validation part of the concerned dataset.
BA-TFD+ (ours) 0.937 All reported results are based on the test partitions.

6. Results
posed in MViTv2 [55]. Throughout our experiments, we 6.1. Temporal Forgery Localization
empirically set λc = 0.1, λf = 2, λb = 1, λbm = 1 and δ =
0.99. 6.1.1 LAV-DF Dataset

5.3. Evaluation Details We evaluated the performance of BA-TFD+ on the LAV-


DF dataset for temporal forgery localization, and compare
We benchmarked the LAV-DF dataset for deepfake de- it with other approaches. For the full set, from Table 3,
tection and localization tasks. For deepfake detection we our method achieves the best performance for [email protected] and
follow standard evaluation protocols [26, 76], and use Area AR@100. Unlike temporal action localization datasets, the
Under the Curve (AUC) as evaluation metric for this bi- segments in our dataset have a single label for the fake seg-
nary classification task. We are the first to benchmark deep- ments which leads to high AP scores. The multimodal MDS
fake localization task and adopt Average Precision (AP) and method is not specifically designed for temporal forgery lo-
Average Recall (AR) as the evaluation metrics. For AP, calization tasks and can only predict fixed-length segments,
we set the IoU thresholds to 0.5, 0.75 and 0.95, follow- lacking the ability to precisely identify boundaries. There-
ing ActivityNet [10] evaluation protocol. For AR, since fore, the scores for MDS are relatively low. For BMN and
the number of fake segments is small, we set the number BSN++, the AP scores are low because they are designed
of proposals to 100, 50, 20 and 10 with the IoU thresholds for fake proposal generation instead of forgery localization.
[0.5:0.05:0.95]. When evaluating the proposed approach on TadTR, ActionFormer, and TriDet achieve relatively better
ForgeryNet [37], we follow the protocol in that paper (i.e. performance as they are one-stage temporal action localiza-
[email protected], [email protected], [email protected], AR@5, and AR@2). tion approaches that generate more precise segments. Ad-
For evaluating BA-TFD+ on ForgeryNet, we used only ditionally, we observe that BMN trained with an end-to-end
the visual pipeline of the method to train the model visual encoder performs better than using pre-trained I3D
(ForgeryNet is a visual-only deepfake dataset). Since only features. With the multimodal complimentary information,
the visual modality is used in the model, only Lb and Lf are our approach outperforms the aforementioned approaches.
used for training. Similarly for evaluation on DFDC [26], We further evaluated all methods on the subset of the
we consider the whole fake video as one fake segment and LAV-DF dataset. From Table 4, it is observed that the
train our model in the temporal localization manner. Then, performance of the visual-only methods including BMN,
we train a small MLP to map the boundary map to the final AGT, BSN++ and TadTR is improved. The visual-only
binary labels. score of our method improves from 64.78 ([email protected]) to
We also evaluated the performance of several state- 96.47 ([email protected]), and the margin between the unimodal
of-the-art methods on LAV-DF, including BMN [57], and multimodal versions is decreased significantly from
Table 7. Impact of loss functions. The contribution of different losses for temporal forgery localization on the full set of the LAV-DF
dataset.

Loss Function [email protected] [email protected] [email protected] AR@100 AR@50 AR@20 AR@10


Lf 59.45 51.46 07.11 77.25 75.60 70.76 67.24
Lc , Lf 63.42 56.24 08.55 78.17 76.47 71.58 68.22
Lb 71.31 34.30 00.12 66.92 63.67 57.99 54.72
Lbm , Lb 71.97 51.17 00.50 69.86 67.58 64.44 62.64
Lf , Lbm , Lb 94.71 78.54 01.66 77.86 76.44 74.67 73.69
Lc , Lf , Lbm , Lb 96.30 84.96 04.44 81.62 80.48 79.40 78.75

Table 8. Impact of pre-trained features. Comparison of different pre-trained features for temporal forgery localization on the full set of
the LAV-DF dataset.

Visual Audio Citation [email protected] [email protected] [email protected] AR@100 AR@50 AR@20 AR@10
I3D E2E [13] 74.76 59.57 04.02 74.28 71.92 68.64 66.63
MARLIN E2E [11] 92.27 75.11 04.10 77.93 76.38 74.53 73.47
3DMM E2E [6] 01.84 00.11 00.00 34.00 31.54 20.94 11.81
E2E TRILLsson3 [83] 95.16 82.67 05.65 81.21 79.80 78.22 77.49
E2E Wav2Vec2 [2] 95.92 84.94 05.66 82.48 81.38 79.93 79.24
E2E E2E N/A 96.30 84.96 04.44 81.62 80.48 79.40 78.75

Table 9. Impact of encoder architectures. Comparison of different backbone architectures for temporal forgery localization on the full
set of the LAV-DF dataset.

Visual Audio Boundary [email protected] [email protected] [email protected] AR@100 AR@50 AR@20 AR@10
3D CNN CNN BMN 76.90 38.50 00.25 66.90 64.08 60.77 58.42
3D CNN CNN BSN++ 92.44 71.34 01.15 75.86 74.43 72.39 71.21
MViTv2-Tiny CNN BMN 89.32 59.47 01.45 72.52 70.14 67.55 65.92
MViTv2-Small CNN BMN 89.31 59.97 01.78 72.74 70.35 67.56 65.87
MViTv2-Base CNN BMN 89.90 59.67 01.51 72.22 69.99 67.29 65.64
3D CNN ViT-Tiny BMN 78.08 35.18 00.41 67.38 64.38 60.92 58.66
3D CNN ViT-Small BMN 79.61 37.63 00.42 67.10 64.23 60.77 58.51
3D CNN ViT-Base BMN 80.86 36.55 00.34 67.24 64.27 60.86 58.46
MViTv2-Small ViT-Base BSN++ 93.59 75.22 02.56 77.73 76.08 74.07 72.93
MViTv2-Base ViT-Base BSN++ 96.30 84.96 04.44 81.62 80.48 79.40 78.75

31.52 ([email protected]) to 0.35 ([email protected]). Thus, our method 6.2. Deepfake Detection
demonstrates its superior performance for temporal forgery
We also compare our method with previous deepfake de-
localization.
tection methods on a subset of the DFDC dataset following
the configuration of [19]. As shown in Table 6, the perfor-
mance of our method is better than previous methods such
as Meso4 [1], FWA [54], Siamese [64], and MDS [19]. In
6.1.2 ForgeryNet Dataset summary, our method performs well on the classification
task.
We evaluated the performance of the visual-only BA-TFD+ 6.3. Ablation Studies
trained on the ForgeryNet dataset, and compare it with other
6.3.1 Impact of Loss Functions
approaches (using the results reported by [37]). As shown
in Table 5, the performance of the visual-only BA-TFD+ To examine the contributions of each loss of BA-TFD+, we
exceeds the previous best model SlowFast [29]+BMN [57], train six models with different combinations of losses. To
showing that proposed method has advantage for temporal aggregate the frame-level predictions for the models with-
forgery localization. out boundary module, we follow the algorithm proposed in
Figure 6. Impact of CBG in the boundary matching module. The figure shows comparison of models containing CBG module and a
model without CBG module.

previous work [101]. From table 7, it is evident that all and attention-based BSN++ modules [86] for predicting
of the integrated losses have positive influence on the per- boundaries.
formance. By observing the difference between the scores, We used the variations of MViTv2 from the original
the boundary matching loss Lb and the frame classification paper (i.e. MViTv2-Tiny, MViTv2-Small and MViTv2-
loss Lc contribute significantly to the performance. With Base) as the visual encoders. We can conclude that the
the frame-level labels supervising the model, the encoders MViTv2 architecture plays an important role while com-
are trained to have a better capacity to extract the features paring with the baseline, but the benefit of different scales
relevant to deepfake artifacts. Whereas the boundary mod- of the MViTv2 architecture is not significant. As for the
ule mechanism have localization ability to detect the fake audio encoder, we followed the architecture definitions for
segments more precisely. ViT (i.e. ViT-Tiny, ViT-Small and ViT-Base) for compari-
son. We can conclude that the audio encoder benefits from
6.3.2 Impact of Pre-Trained Features different scales of the ViT architecture. We also compared
the BSN++-based boundary module with BMN-based ar-
In the literature [62, 99], pre-trained visual features, such as chitecture. The contribution from the BSN++ is the most
I3D [13], are commonly used for temporal action localiza- significant compared with MViTv2 for the visual encoder
tion. Since the I3D features are pre-trained on the Kinetics and ViT for the audio encoder. Owing to the attention mech-
dataset [45], they encode the representation of the universal anism, the framework utilizes the global and local context
scene of the video. However, temporal forgery localization to analyze the artifacts. The combination of MViTv2-Base,
requires the model to have a specialized understanding of ViT-Base and BSN++ produces the best performance com-
facial information. Therefore, the pre-trained features ob- pared to all other combinations of modules.
tained from universal visual dataset are not likely to be suit-
able for our task. Our quantitative results support this, e.g.
the comparison between the two BMN models in Table 3 6.3.4 Impact of CBG in the Boundary Matching Mod-
where one uses I3D features and the other uses end-to-end ule
training.
To examine the impact of pre-trained features on BA- We adopted the method from BSN++ [86] to improve the
TFD+, we trained models using different pre-trained fea- performance for temporal forgery localization. This method
tures, including visual (I3D, MARLIN ViT-S [11] and includes two modules, complementary boundary generator
3DMM [6]) and audio features (TRILLsson [83] and (CBG) and proposal relation block (PRB). The CBG mod-
Wav2Vec2 [2]). The results are shown in Table 8. From ule predicts the confidence that a timestamp is starting or
the results, we can observe the following patterns: 1) The ending point of segments. The PRB module, based on
model trained fully end-to-end reaches the best performance BMN [57], predicts the boundary map which contains the
and 2) Compared with visual features, audio features have confidences of dense segment proposals. For inference, the
better task specific performance. results from both modules are multiplied as the final output.
In this ablation study, we aim to discuss the impact of the
CBG module.
6.3.3 Impact of Encoder Architectures
We trained several models containing CBG modules
To find the best modality-specific architecture for BA- with different loss weights, from 10−2 to 104 , and also a
TFD+, we trained several architecture combinations for the model without CBG module. As shown in Figure 6, the
visual encoder, audio encoder, and boundary module. The best CBG loss weight is 103 . However, compared with the
results are presented in Table 9. Compared to the previ- non-CBG model, the best model with CBG can only com-
ous model BA-TFD [12] as baseline (3D-CNN + CNN + pete on AR and has a huge gap on AP metrics. Based on
BMN [57]), we used the attention-based architectures in- this observation, we drop the CBG module in the boundary
cluding MViTv2 [55] and ViT [27] families for encoders module and only use PRB.
7. Conclusion [4] Belhassen Bayar and Matthew C. Stamm. A Deep
Learning Approach to Universal Image Manipulation
In this paper, we introduce and investigate content-driven Detection Using a New Convolutional Layer. In Pro-
multimodal deepfake generation, detection, and localiza- ceedings of the 4th ACM Workshop on Information
tion. We introduce a new dataset where both the audio and
Hiding and Multimedia Security, IH&MMSec
visual modalities are modified at strategic locations. Ad-
’16, pages 5–10, New York, NY, USA, June 2016.
ditionally, we propose a new method for temporal forgery
Association for Computing Machinery.
localization. Through extensive experiments, we demon-
strate that our method outperforms existing state-of-the-art [5] Steven Bird, Ewan Klein, and Edward Loper. Nat-
techniques. ural Language Processing with Python: Analyzing
The proposed dataset, LAV-DF, may raise ethical con- Text with the Natural Language Toolkit. O’Reilly
cerns due to its potential negative social impact. Given that Media, Inc., June 2009. Google-Books-ID: KGIb-
the dataset contains facial videos of celebrities, there could fiiP1i4C.
be a risk of its misuse for unethical purposes. Moreover, [6] Volker Blanz and Thomas Vetter. A morphable
the dataset generation pipeline itself can be used to gener- model for the synthesis of 3D faces. In Proceedings
ate fake videos. To confront the potential negative impact of the 26th annual conference on Computer graphics
of our work, we have taken several measures. Most impor- and interactive techniques - SIGGRAPH ’99, pages
tantly, we have prepared an end-user license agreement as a 187–194, Not Known, 1999. ACM Press.
preventive measure. Similarly, users need to agree on terms [7] Navaneeth Bodla, Bharat Singh, Rama Chellappa,
and conditions to use the proposed temporal forgery local- and Larry S. Davis. Soft-NMS – Improving Object
ization method BA-TFD+. Detection With One Line of Code. In Proceedings
This work has some limitations: 1) The audio reenact- of the IEEE International Conference on Computer
ment method employed for dataset creation does not con- Vision, pages 5561–5569, 2017.
sistently generate the desired reference style, 2) The reso-
lution of the dataset is limited by the source videos, and 3) [8] John Brandon. There Are Now 15,000 Deepfake
The high classification scores obtained indicate the need for Videos on Social Media. Yes, You Should Worry.
further improvement in the visual reenactment method. Forbes, Oct. 2019.
Major improvement in the future will be extending the [9] Shyamal Buch, Victor Escorcia, Bernard Ghanem,
generation pipeline to include word tokens insertion, sub- Li Fei-Fei, and Juan Carlos Niebles. End-to-
stitution and deletion and converting statements into ques- end, single-stream temporal action detection in
tions. untrimmed videos. Procedings of the British Ma-
chine Vision Conference 2017, May 2019. Publisher:
References British Machine Vision Association.
[1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, [10] Fabian Caba Heilbron, Victor Escorcia, Bernard
and Isao Echizen. MesoNet: a Compact Facial Video Ghanem, and Juan Carlos Niebles. ActivityNet: A
Forgery Detection Network. In 2018 IEEE Interna- Large-Scale Video Benchmark for Human Activity
tional Workshop on Information Forensics and Se- Understanding. In Proceedings of the IEEE Confer-
curity (WIFS), pages 1–7, Dec. 2018. ISSN: 2157- ence on Computer Vision and Pattern Recognition,
4774. pages 961–970, 2015.
[2] Alexei Baevski, Yuhao Zhou, Abdelrahman Mo- [11] Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav
hamed, and Michael Auli. wav2vec 2.0: A Frame- Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari,
work for Self-Supervised Learning of Speech Repre- and Munawar Hayat. MARLIN: Masked Autoen-
sentations. In Advances in Neural Information Pro- coder for Facial Video Representation LearnINg. In
cessing Systems, volume 33, pages 12449–12460. Proceedings of the IEEE/CVF Conference on Com-
Curran Associates, Inc., 2020. puter Vision and Pattern Recognition, pages 1493–
[3] Anurag Bagchi, Jazib Mahmood, Dolton Fernandes, 1504, 2023.
and Ravi Sarvadevabhatla. Hear Me out: Fusional [12] Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Mu-
Approaches for Audio Augmented Temporal Action nawar Hayat. Do You Really Mean That? Content
Localization:. In Proceedings of the 17th Interna- Driven Audio-Visual Deepfake Dataset and Multi-
tional Joint Conference on Computer Vision, Imag- modal Method for Temporal Forgery Localization.
ing and Computer Graphics Theory and Applica- In 2022 International Conference on Digital Image
tions, pages 144–154. SCITEPRESS - Science and Computing: Techniques and Applications (DICTA),
Technology Publications, 2022. pages 1–10, Nov. 2022.
[13] Joao Carreira and Andrew Zisserman. Quo Vadis, Deepfake Detection. In Stan Sclaroff, Cosimo Dis-
Action Recognition? A New Model and the Kinet- tante, Marco Leo, Giovanni M. Farinella, and Fed-
ics Dataset. In Proceedings of the IEEE Conference erico Tombari, editors, Image Analysis and Process-
on Computer Vision and Pattern Recognition, pages ing – ICIAP 2022, Lecture Notes in Computer Sci-
6299–6308, 2017. ence, pages 219–229, Cham, 2022. Springer Interna-
[14] Edresson Casanova, Christopher Shulby, Eren tional Publishing.
Gölge, Nicolas Michael Müller, Frederico Santos De [23] Davide Cozzolino, Giovanni Poggi, and Luisa Ver-
Oliveira, Arnaldo Candido Jr., Anderson Da Silva doliva. Recasting Residual-based Local Descriptors
Soares, Sandra Maria Aluisio, and Moacir Antonelli as Convolutional Neural Networks: an Application
Ponti. SC-GlowTTS: An Efficient Zero-Shot Multi- to Image Forgery Detection. In Proceedings of the
Speaker Text-To-Speech Model. In Interspeech 5th ACM Workshop on Information Hiding and Mul-
2021, pages 3645–3649. ISCA, Aug. 2021. timedia Security, IH&MMSec ’17, pages 159–
[15] Beijing Chen, Tianmu Li, and Weiping Ding. Detect- 164, New York, NY, USA, June 2017. Association
ing deepfake videos based on spatiotemporal atten- for Computing Machinery.
tion and convolutional LSTM. Information Sciences,
[24] Dima Damen, Hazel Doughty, Giovanni Maria
601:58–70, July 2022.
Farinella, Antonino Furnari, Evangelos Kazakos,
[16] Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby
Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-Head Perrett, Will Price, and Michael Wray. Rescaling
Generation with Rhythmic Head Motion. In An- Egocentric Vision: Collection, Pipeline and Chal-
drea Vedaldi, Horst Bischof, Thomas Brox, and Jan- lenges for EPIC-KITCHENS-100. International
Michael Frahm, editors, Computer Vision – ECCV Journal of Computer Vision, 130(1):33–55, Jan.
2020, Lecture Notes in Computer Science, pages 35– 2022.
51, Cham, 2020. Springer International Publishing.
[17] Lele Chen, Ross K. Maddox, Zhiyao Duan, and [25] Oscar de Lima, Sean Franklin, Shreshtha Basu,
Chenliang Xu. Hierarchical Cross-Modal Talking Blake Karwoski, and Annet George. Deepfake
Face Generation With Dynamic Pixel-Wise Loss. In Detection using Spatiotemporal Convolutional Net-
Proceedings of the IEEE/CVF Conference on Com- works, June 2020. arXiv:2006.14749 [cs, eess].
puter Vision and Pattern Recognition, pages 7832– [26] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo
7841, 2019. Lu, Russ Howes, Menglin Wang, and Cristian Can-
[18] Francois Chollet. Xception: Deep Learning With ton Ferrer. The DeepFake Detection Challenge
Depthwise Separable Convolutions. In Proceedings (DFDC) Dataset, Oct. 2020. arXiv: 2006.07397 [cs].
of the IEEE Conference on Computer Vision and Pat- [27] Alexey Dosovitskiy, Lucas Beyer, Alexander
tern Recognition, pages 1251–1258, 2017. Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
[19] Komal Chugh, Parul Gupta, Abhinav Dhall, and Ra- Thomas Unterthiner, Mostafa Dehghani, Matthias
manathan Subramanian. Not made for each other- Minderer, Georg Heigold, Sylvain Gelly, Jakob
Audio-Visual Dissonance-based Deepfake Detection Uszkoreit, and Neil Houlsby. An Image is Worth
and Localization. In Proceedings of the 28th ACM 16x16 Words: Transformers for Image Recognition
International Conference on Multimedia, MM ’20, at Scale. In International Conference on Learning
pages 439–447, New York, NY, USA, Oct. 2020. As- Representations, 2021.
sociation for Computing Machinery.
[28] Christoph Feichtenhofer. X3D: Expanding Architec-
[20] Joon Son Chung, Arsha Nagrani, and Andrew Zisser-
tures for Efficient Video Recognition. In Proceed-
man. VoxCeleb2: Deep Speaker Recognition. In In-
ings of the IEEE/CVF Conference on Computer Vi-
terspeech 2018, pages 1086–1090. ISCA, Sept. 2018.
sion and Pattern Recognition, pages 203–213, 2020.
[21] Joon Son Chung and Andrew Zisserman. Out of
Time: Automated Lip Sync in the Wild. In Chu- [29] Christoph Feichtenhofer, Haoqi Fan, Jitendra Ma-
Song Chen, Jiwen Lu, and Kai-Kuang Ma, editors, lik, and Kaiming He. SlowFast Networks for Video
Computer Vision – ACCV 2016 Workshops, Lecture Recognition. In Proceedings of the IEEE/CVF In-
Notes in Computer Science, pages 251–263, Cham, ternational Conference on Computer Vision, pages
2017. Springer International Publishing. 6202–6211, 2019.
[22] Davide Alessandro Coccomini, Nicola Messina, [30] Christiane Fellbaum. WordNet: An Electronic Lexi-
Claudio Gennaro, and Fabrizio Falchi. Combin- cal Database. MIT Press, 1998. Google-Books-ID:
ing EfficientNet and Vision Transformers for Video Rehu8OOzMIMC.
[31] Jiyang Gao, Kan Chen, and Ram Nevatia. CTAP: deepfakes detection. Applied Soft Computing,
Complementary Temporal Action Proposal Genera- 136:110124, Mar. 2023.
tion. In Proceedings of the European Conference on [41] Amir Jamaludin, Joon Son Chung, and Andrew Zis-
Computer Vision (ECCV), pages 68–83, 2018. serman. You Said That?: Synthesising Talking Faces
[32] Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, from Audio. International Journal of Computer Vi-
and Ram Nevatia. TURN TAP: Temporal Unit Re- sion, 127(11):1767–1779, Dec. 2019.
gression Network for Temporal Action Proposals. In [42] Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang,
Proceedings of the IEEE International Conference Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick
on Computer Vision, pages 3628–3636, 2017. Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and
[33] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Yonghui Wu. Transfer learning from speaker veri-
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron fication to multispeaker text-to-speech synthesis. In
Courville, and Yoshua Bengio. Generative ad- Proceedings of the 32nd International Conference on
versarial networks. Communications of the ACM, Neural Information Processing Systems, NIPS’18,
63(11):139–144, Oct. 2020. pages 4485–4495, Red Hook, NY, USA, Dec. 2018.
[34] Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Curran Associates Inc.
Ding, Jilin Li, Feiyue Huang, and Lizhuang Ma. Spa- [43] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and
tiotemporal Inconsistency Learning for DeepFake Chen Change Loy. DeeperForensics-1.0: A Large-
Video Detection. In Proceedings of the 29th ACM In- Scale Dataset for Real-World Face Forgery Detec-
ternational Conference on Multimedia, pages 3473– tion. In Proceedings of the IEEE/CVF Conference
3481. Association for Computing Machinery, New on Computer Vision and Pattern Recognition, pages
York, NY, USA, Oct. 2021. 2889–2898, 2020.
[35] Luca Guarnera, Oliver Giudice, and Sebastiano Bat- [44] Prajwal K R, Rudrabha Mukhopadhyay, Jerin Philip,
tiato. DeepFake Detection by Analyzing Convolu- Abhishek Jha, Vinay Namboodiri, and C V Jawa-
tional Traces. In Proceedings of the IEEE/CVF Con- har. Towards Automatic Face-to-Face Translation.
ference on Computer Vision and Pattern Recognition In Proceedings of the 27th ACM International Con-
Workshops, pages 666–667, 2020. ference on Multimedia, MM ’19, pages 1428–1436,
[36] Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, New York, NY, USA, Oct. 2019. Association for
Hujun Bao, and Juyong Zhang. AD-NeRF: Au- Computing Machinery.
dio Driven Neural Radiance Fields for Talking Head [45] Will Kay, Joao Carreira, Karen Simonyan, Brian
Synthesis. In Proceedings of the IEEE/CVF Interna- Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan,
tional Conference on Computer Vision, pages 5784– Fabio Viola, Tim Green, Trevor Back, Paul Natsev,
5794, 2021. Mustafa Suleyman, and Andrew Zisserman. The
[37] Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guo- Kinetics Human Action Video Dataset, May 2017.
jun Yin, Luchuan Song, Lu Sheng, Jing Shao, and arXiv:1705.06950 [cs].
Ziwei Liu. ForgeryNet: A Versatile Benchmark for [46] Hasam Khalid, Minha Kim, Shahroz Tariq, and Si-
Comprehensive Forgery Analysis. In Proceedings of mon S. Woo. Evaluation of an Audio-Video Multi-
the IEEE/CVF Conference on Computer Vision and modal Deepfake Dataset using Unimodal and Multi-
Pattern Recognition, pages 4360–4369, 2021. modal Detectors. Proceedings of the 1st Workshop on
[38] Young-Jin Heo, Woon-Ha Yeo, and Byung-Gyu Synthetic Multimedia - Audiovisual Deepfake Gener-
Kim. DeepFake detection algorithm based on im- ation and Detection, pages 7–15, Oct. 2021. arXiv:
proved vision transformer. Applied Intelligence, 2109.02993.
53(7):7512–7527, Apr. 2023. [47] Hasam Khalid, Shahroz Tariq, and Simon S. Woo.
[39] Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, FakeAVCeleb: A Novel Audio-Video Multimodal
Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Deepfake Dataset, Aug. 2021. arXiv: 2108.05080
Mubarak Shah. The THUMOS Challenge on Action [cs].
Recognition for Videos ”in the Wild”. Computer Vi- [48] Davis E. King. Dlib-ml: A Machine Learning
sion and Image Understanding, 155:1–23, Feb. 2017. Toolkit. The Journal of Machine Learning Research,
arXiv: 1604.06182. 10:1755–1758, Dec. 2009.
[40] Hafsa Ilyas, Ali Javed, and Khalid Mahmood Ma- [49] Pavel Korshunov and Sebastien Marcel. DeepFakes:
lik. AVFakeNet: A unified end-to-end Dense Swin a New Threat to Face Recognition? Assessment and
Transformer deep learning model for audio–visual Detection, Dec. 2018. arXiv:1812.08685 [cs].
[50] Iryna Korshunova, Wenzhe Shi, Joni Dambre, and Network for Temporal Action Proposal Generation.
Lucas Theis. Fast Face-Swap Using Convolutional In Proceedings of the European Conference on Com-
Neural Networks. In Proceedings of the IEEE In- puter Vision (ECCV), pages 3–19, 2018.
ternational Conference on Computer Vision, pages
[60] Xiaolong Liu, Song Bai, and Xiang Bai. An Em-
3677–3685, 2017.
pirical Study of End-to-End Temporal Action Detec-
[51] Patrick Kwon, Jaeseong You, Gyuhyeon Nam, Sung- tion. In Proceedings of the IEEE/CVF Conference
woo Park, and Gyeongsu Chae. KoDF: A Large- on Computer Vision and Pattern Recognition, pages
Scale Korean DeepFake Detection Dataset. In Pro- 20010–20019, 2022.
ceedings of the IEEE/CVF International Conference
on Computer Vision, pages 10744–10753, 2021. [61] Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang
Bai, and Philip H. S. Torr. Multi-Shot Temporal
[52] John K. Lewis, Imad Eddine Toubal, Helen
Event Localization: A Benchmark. In Proceedings of
Chen, Vishal Sandesera, Michael Lomnitz, Zigfried
the IEEE/CVF Conference on Computer Vision and
Hampel-Arias, Calyam Prasad, and Kannappan Pala-
Pattern Recognition, pages 12596–12606, 2021.
niappan. Deepfake Video Detection Based on Spa-
tial, Spectral, and Temporal Inconsistencies Using [62] Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shi-
Multimodal Deep Learning. In 2020 IEEE Applied wei Zhang, Song Bai, and Xiang Bai. End-to-End
Imagery Pattern Recognition Workshop (AIPR), Temporal Action Detection With Transformer. IEEE
pages 1–9, Mar. 2020. ISSN: 2332-5615. Transactions on Image Processing, 31:5427–5441,
[53] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, 2022. Conference Name: IEEE Transactions on Im-
Dong Chen, Fang Wen, and Baining Guo. Face X- age Processing.
Ray for More General Face Forgery Detection. In [63] Yi Liu, Limin Wang, Yali Wang, Xiao Ma, and Yu
Proceedings of the IEEE/CVF Conference on Com- Qiao. FineAction: A Fine-Grained Video Dataset for
puter Vision and Pattern Recognition, pages 5001– Temporal Action Localization. IEEE Transactions
5010, 2020. on Image Processing, 31:6937–6950, 2022. Confer-
[54] Yuezun Li and Siwei Lyu. Exposing DeepFake ence Name: IEEE Transactions on Image Process-
Videos By Detecting Face Warping Artifacts. In ing.
IEEE Conference on Computer Vision and Pattern
[64] Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra,
Recognition Workshops (CVPRW), page 7, 2019.
Aniket Bera, and Dinesh Manocha. Emotions Don’t
[55] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Lie: An Audio-Visual Deepfake Detection Method
Mangalam, Bo Xiong, Jitendra Malik, and Christoph using Affective Cues. In Proceedings of the 28th
Feichtenhofer. MViTv2: Improved Multiscale Vision ACM International Conference on Multimedia, MM
Transformers for Classification and Detection. In ’20, pages 2823–2832, New York, NY, USA, Oct.
Proceedings of the IEEE/CVF Conference on Com- 2020. Association for Computing Machinery.
puter Vision and Pattern Recognition, pages 4804–
4814, 2022. [65] Daniel Mas Montserrat, Hanxiang Hao, Sri K. Yarla-
[56] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and gadda, Sriram Baireddy, Ruiting Shao, Janos Hor-
Siwei Lyu. Celeb-DF: A Large-Scale Challenging vath, Emily Bartusiak, Justin Yang, David Guera,
Dataset for DeepFake Forensics. In Proceedings of Fengqing Zhu, and Edward J. Delp. Deepfakes De-
the IEEE/CVF Conference on Computer Vision and tection With Automatic Face Weighting. In Proceed-
Pattern Recognition, pages 3207–3216, 2020. ings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition Workshops, pages 668–
[57] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei
669, 2020.
Wen. BMN: Boundary-Matching Network for Tem-
poral Action Proposal Generation. In Proceedings [66] Kartik Narayan, Harsh Agarwal, Kartik Thakral,
of the IEEE/CVF International Conference on Com- Surbhi Mittal, Mayank Vatsa, and Richa Singh.
puter Vision, pages 3889–3898, 2019. DF-Platter: Multi-Face Heterogeneous Deepfake
[58] Tianwei Lin, Xu Zhao, and Zheng Shou. Single Shot Dataset. In Proceedings of the IEEE/CVF Confer-
Temporal Action Detection. In Proceedings of the ence on Computer Vision and Pattern Recognition,
25th ACM international conference on Multimedia, pages 9739–9748, 2023.
MM ’17, pages 988–996, New York, NY, USA, Oct. [67] Megha Nawhal and Greg Mori. Activity Graph
2017. Association for Computing Machinery. Transformer for Temporal Action Localiza-
[59] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing tion. arXiv:2101.08540 [cs], Jan. 2021. arXiv:
Wang, and Ming Yang. BSN: Boundary Sensitive 2101.08540.
[68] Paarth Neekhara, Shehzeen Hussain, Shlomo Dub- IEEE/CVF International Conference on Computer
nov, Farinaz Koushanfar, and Julian McAuley. Ex- Vision, pages 1–11, 2019.
pressive Neural Voice Cloning. In Proceedings of [77] David E. Rumelhart, Geoffrey E. Hinton, and
The 13th Asian Conference on Machine Learning, Ronald J. Williams. Learning Internal Representa-
pages 252–267. PMLR, Nov. 2021. ISSN: 2640- tions by Error Propagation. Technical report, CAL-
3498. IFORNIA UNIV SAN DIEGO LA JOLLA INST
[69] Dufou Nick and Jigsaw Andrew. Contributing Data FOR COGNITIVE SCIENCE, Sept. 1985. Section:
to Deepfake Detection Research, Sept. 2019. Technical Reports.
[70] Yuval Nirkin, Yosi Keller, and Tal Hassner. FS- [78] Ian Sample. What are deepfakes – and how can you
GAN: Subject Agnostic Face Swapping and Reen- spot them? The Guardian, Jan. 2020.
actment. In Proceedings of the IEEE/CVF Interna-
[79] Conrad Sanderson, editor. The VidTIMIT Database.
tional Conference on Computer Vision, pages 7184–
IDIAP, 2002.
7193, 2019.
[80] Oscar Schwartz. You thought fake news was bad?
[71] Aaron van den Oord, Sander Dieleman, Heiga
Deep fakes are where truth goes to die. The
Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Guardian, Nov. 2018.
Nal Kalchbrenner, Andrew Senior, and Koray
Kavukcuoglu. WaveNet: A Generative Model for [81] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike
Raw Audio, Sept. 2016. arXiv:1609.03499 [cs]. Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,
[72] Adam Paszke, Sam Gross, Francisco Massa, Adam
Rif A. Saurous, Yannis Agiomvrgiannakis, and
Lerer, James Bradbury, Gregory Chanan, Trevor
Yonghui Wu. Natural TTS Synthesis by Condi-
Killeen, Zeming Lin, Natalia Gimelshein, Luca
tioning Wavenet on MEL Spectrogram Predictions.
Antiga, Alban Desmaison, Andreas Kopf, Edward
In 2018 IEEE International Conference on Acous-
Yang, Zachary DeVito, Martin Raison, Alykhan Te-
tics, Speech and Signal Processing (ICASSP), pages
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
4779–4783, Apr. 2018. ISSN: 2379-190X.
Junjie Bai, and Soumith Chintala. PyTorch: An Im-
perative Style, High-Performance Deep Learning Li- [82] Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia
brary. In Advances in Neural Information Processing Li, and Dacheng Tao. TriDet: Temporal Action De-
Systems, volume 32. Curran Associates, Inc., 2019. tection With Relative Boundary Modeling. In Pro-
ceedings of the IEEE/CVF Conference on Computer
[73] K R Prajwal, Rudrabha Mukhopadhyay, Vinay P.
Vision and Pattern Recognition, pages 18857–18866,
Namboodiri, and C.V. Jawahar. A Lip Sync Ex-
2023.
pert Is All You Need for Speech to Lip Generation
In the Wild. In Proceedings of the 28th ACM Inter- [83] Joel Shor and Subhashini Venugopalan. TRILLsson:
national Conference on Multimedia, MM ’20, pages Distilled Universal Paralinguistic Speech Represen-
484–492, New York, NY, USA, Oct. 2020. Associa- tations. In Interspeech 2022, pages 356–360, Sept.
tion for Computing Machinery. 2022. arXiv:2203.00236 [cs, eess].
[74] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, [84] Zheng Shou, Jonathan Chan, Alireza Zareian,
and Jing Shao. Thinking in Frequency: Face Forgery Kazuyuki Miyazawa, and Shih-Fu Chang. CDC:
Detection by Mining Frequency-Aware Clues. In An- Convolutional-De-Convolutional Networks for Pre-
drea Vedaldi, Horst Bischof, Thomas Brox, and Jan- cise Temporal Action Localization in Untrimmed
Michael Frahm, editors, Computer Vision – ECCV Videos. In Proceedings of the IEEE Conference
2020, Lecture Notes in Computer Science, pages 86– on Computer Vision and Pattern Recognition, pages
103, Cham, 2020. Springer International Publishing. 5734–5743, 2017.
[75] Muhammad Anas Raza and Khalid Mahmood Malik. [85] Zheng Shou, Dongang Wang, and Shih-Fu Chang.
Multimodaltrace: Deepfake Detection Using Audio- Temporal Action Localization in Untrimmed Videos
visual Representation Learning. In Proceedings of via Multi-Stage CNNs. In Proceedings of the IEEE
the IEEE/CVF Conference on Computer Vision and Conference on Computer Vision and Pattern Recog-
Pattern Recognition, pages 993–1000, 2023. nition, pages 1049–1058, 2016.
[76] Andreas Rossler, Davide Cozzolino, Luisa Verdo- [86] Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and
liva, Christian Riess, Justus Thies, and Matthias Junjie Yan. BSN++: Complementary Boundary Re-
Niessner. FaceForensics++: Learning to Detect Ma- gressor with Scale-Balanced Relation Modeling for
nipulated Facial Images. In Proceedings of the Temporal Action Proposal Generation. Proceedings
of the AAAI Conference on Artificial Intelligence, Kui Ren. AVoiD-DF: Audio-Visual Joint Learning
35(3):2602–2610, May 2021. Number: 3. for Detecting Deepfake. IEEE Transactions on Infor-
[87] Justus Thies, Mohamed Elgharib, Ayush Tewari, mation Forensics and Security, 18:2015–2029, 2023.
Christian Theobalt, and Matthias Nießner. Neural Conference Name: IEEE Transactions on Informa-
Voice Puppetry: Audio-Driven Facial Reenactment. tion Forensics and Security.
In Andrea Vedaldi, Horst Bischof, Thomas Brox, and [97] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing
Jan-Michael Frahm, editors, ECCV 2020, Lecture Deep Fakes Using Inconsistent Head Poses. In
Notes in Computer Science, pages 716–731, Cham, ICASSP 2019 - 2019 IEEE International Confer-
2020. Springer International Publishing. ence on Acoustics, Speech and Signal Processing
[88] Daniel Thomas. Deepfakes: A threat to democracy (ICASSP), pages 8261–8265, May 2019. ISSN:
or just a bit of fun? BBC News, Jan. 2020. 2379-190X.
[89] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and [98] Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu
Jan Kautz. MoCoGAN: Decomposing Motion and Rong, Peilin Zhao, Junzhou Huang, and Chuang
Content for Video Generation. In Proceedings of the Gan. Graph Convolutional Networks for Tempo-
IEEE Conference on Computer Vision and Pattern ral Action Localization. In Proceedings of the
Recognition, pages 1526–1535, 2018. IEEE/CVF International Conference on Computer
Vision, pages 7094–7103, 2019.
[90] Junke Wang, Zuxuan Wu, Wenhao Ouyang, Xintong
[99] Chen-Lin Zhang, Jianxin Wu, and Yin Li. Action-
Han, Jingjing Chen, Yu-Gang Jiang, and Ser-Nam
Former: Localizing Moments of Actions with Trans-
Li. M2TR: Multi-modal Multi-scale Transformers
formers. In Shai Avidan, Gabriel Brostow,
for Deepfake Detection. In Proceedings of the 2022
Moustapha Cissé, Giovanni Maria Farinella, and Tal
International Conference on Multimedia Retrieval,
Hassner, editors, Computer Vision – ECCV 2022,
ICMR ’22, pages 615–623, New York, NY, USA,
Lecture Notes in Computer Science, pages 492–510,
June 2022. Association for Computing Machinery.
Cham, 2022. Springer Nature Switzerland.
[91] Yuxuan Wang, R.J. Skerry-Ryan, Daisy Stan-
[100] Hang Zhao, Antonio Torralba, Lorenzo Torresani,
ton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
and Zhicheng Yan. HACS: Human Action Clips and
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy
Segments Dataset for Recognition and Temporal Lo-
Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob
calization. In Proceedings of the IEEE/CVF Interna-
Clark, and Rif A. Saurous. Tacotron: Towards End-
tional Conference on Computer Vision, pages 8668–
to-End Speech Synthesis. In Interspeech 2017, pages
8678, 2019.
4006–4010. ISCA, Aug. 2017.
[101] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong
[92] Deressa Wodajo and Solomon Atnafu. Deepfake
Wu, Xiaoou Tang, and Dahua Lin. Temporal Ac-
Video Detection Using Convolutional Vision Trans-
tion Detection With Structured Segment Networks.
former, Mar. 2021. arXiv:2102.11126 [cs].
In Proceedings of the IEEE International Conference
[93] Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Tha- on Computer Vision, pages 2914–2923, 2017.
bet, and Bernard Ghanem. G-TAD: Sub-Graph Lo- [102] Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change
calization for Temporal Action Detection. In Pro- Loy, Xiaogang Wang, and Ziwei Liu. Pose-
ceedings of the IEEE/CVF Conference on Computer Controllable Talking Face Generation by Implicitly
Vision and Pattern Recognition, pages 10156–10165, Modularized Audio-Visual Representation. In Pro-
2020. ceedings of the IEEE/CVF Conference on Computer
[94] Ke Yang, Peng Qiao, Dongsheng Li, Shaohe Lv, Vision and Pattern Recognition, pages 4176–4186,
and Yong Dou. Exploring Temporal Preservation 2021.
Networks for Precise Temporal Action Localization. [103] Tianfei Zhou, Wenguan Wang, Zhiyuan Liang, and
Proceedings of the AAAI Conference on Artificial In- Jianbing Shen. Face Forensics in the Wild. In Pro-
telligence, 32(1), Apr. 2018. Number: 1. ceedings of the IEEE/CVF Conference on Computer
[95] Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, Vision and Pattern Recognition, pages 5778–5788,
and Limin Wang. BasicTAD: An astounding RGB- 2021.
Only baseline for temporal action detection. Com- [104] Yang Zhou, Xintong Han, Eli Shechtman, Jose
puter Vision and Image Understanding, 232:103692, Echevarria, Evangelos Kalogerakis, and Dingzeyu
July 2023. Li. MakeltTalk: speaker-aware talking-head anima-
[96] Wenyuan Yang, Xiaoyu Zhou, Zhikai Chen, Bofei tion. ACM Transactions on Graphics, 39(6):221:1–
Guo, Zhongjie Ba, Zhihua Xia, Xiaochun Cao, and 221:15, Nov. 2020.
[105] Yizhe Zhu, Jialin Gao, and Xi Zhou. AVForensics:
Audio-driven Deepfake Video Detection with Mask-
ing Strategy in Self-supervision. In Proceedings of
the 2023 ACM International Conference on Multime-
dia Retrieval, ICMR ’23, pages 162–171, New York,
NY, USA, 2023. Association for Computing Machin-
ery.
[106] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun
Ma, and Yu-Gang Jiang. WildDeepfake: A Chal-
lenging Real-World Dataset for Deepfake Detection.
In Proceedings of the 28th ACM International Con-
ference on Multimedia, MM ’20, pages 2382–2390,
New York, NY, USA, Oct. 2020. Association for
Computing Machinery.

You might also like