0% found this document useful (0 votes)
31 views9 pages

Deepfake Video Detection Using Convolutional Vision Transformer

This document proposes a Convolutional Vision Transformer (CViT) model to detect deepfake videos. Deepfakes are hyper-realistic synthetic media that can be used maliciously. Current detection methods lack generalizability across different deepfake creation tools and techniques. The proposed CViT adds a CNN module to a Vision Transformer to leverage local and global features for detection. It aims to be more generalized by emphasizing data preprocessing and training on a diverse dataset. The model achieves 91.5% accuracy on the DeepFake Detection Challenge dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views9 pages

Deepfake Video Detection Using Convolutional Vision Transformer

This document proposes a Convolutional Vision Transformer (CViT) model to detect deepfake videos. Deepfakes are hyper-realistic synthetic media that can be used maliciously. Current detection methods lack generalizability across different deepfake creation tools and techniques. The proposed CViT adds a CNN module to a Vision Transformer to leverage local and global features for detection. It aims to be more generalized by emphasizing data preprocessing and training on a diverse dataset. The model achieves 91.5% accuracy on the DeepFake Detection Challenge dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Deepfake Video Detection Using Convolutional Vision Transformer

Deressa Wodajo Solomon Atnafu


Jimma University Addis Ababa University
[email protected] [email protected]
arXiv:2102.11126v3 [cs.CV] 11 Mar 2021

Abstract However, since Deepfakes are deceptive in nature, they can


also be used for malicious purposes.
The rapid advancement of deep learning models that Since the Deepfake phenomenon, various authors have
can generate and synthesis hyper-realistic videos known as proposed different mechanisms to differentiate real videos
Deepfakes and their ease of access have raised concern on from fake ones. As pointed by [10], even though each pro-
possible malicious intent use. Deep learning techniques posed mechanism has its strength, current detection meth-
can now generate faces, swap faces between two subjects ods lack generalizability. The authors noted that current ex-
in a video, alter facial expressions, change gender, and isting models focus on the Deepfake creation tools to tackle
alter facial features, to list a few. These powerful video by studying their supposed behaviors. For instance, Yuezun
manipulation methods have potential use in many fields. et al. [33] and TackHyun et al. [25] used inconsistencies in
However, they also pose a looming threat to everyone if eye blinking to detect Deepfakes. However, using the work
used for harmful purposes such as identity theft, phishing, of Konstantinos et al. [58] and Hai et al. [46], it is now
and scam. In this work, we propose a Convolutional Vi- possible to mimic eye blinking. The authors in [58] pre-
sion Transformer for the detection of Deepfakes. The Con- sented a system that generates videos of talking heads with
volutional Vision Transformer has two components: Con- natural facial expressions such as eye blinking. The authors
volutional Neural Network (CNN) and Vision Transformer in [46] proposed a model that can generate facial expression
(ViT). The CNN extracts learnable features while the ViT from a portrait. Their system can synthesis a still picture to
takes in the learned features as input and categorizes them express emotions, including a hallucination of eye-blinking
using an attention mechanism. We trained our model on the motions.
DeepFake Detection Challenge Dataset (DFDC) and have
We base our work on two weaknesses of Deepfake de-
achieved 91.5 percent accuracy, an AUC value of 0.91, and
tection methods pointed out by [10, 11]: data preprocess-
a loss value of 0.32. Our contribution is that we have added
ing, and generality. Polychronis et al. [11] noted that cur-
a CNN module to the ViT architecture and have achieved a
rent Deepfake detection systems focus mostly on presenting
competitive result on the DFDC dataset.
their proposed architecture, and give less emphasis on data
preprocessing and its impact on the final detection model.
The authors stressed the importance of data preprocess-
1. Introduction ing for Deepfake detections. Joshual et al. [10] focused
Technologies for altering images, videos, and audios are on the generality of facial forgery detection and found that
developing rapidly [12, 62]. Techniques and technical ex- most proposed systems lacked generality. The authors de-
pertise to create and manipulate digital content are also eas- fined generality as reliably detecting multiple spoofing tech-
ily accessible. Currently, it is possible to seamlessly gener- niques and reliably spoofing unseen detection techniques.
ate hyper-realistic digital images [28] with a little resource Umur et al. [13] proposed a generalized Deepfake de-
and an easy how-to-do instructions available online [30, 9]. tector called FakeCatcher using biological signals (internal
Deepfake is a technique which aims to replace the face of a representations of image generators and synthesizers). They
targeted person by the face of someone else in a video [1]. It used a simple Convolutional Neural Network (CNN) clas-
is created by splicing synthesized face region into the orig- sifier with only three layers. The authors used 3000 videos
inal image [62]. The term can also mean to represent the for training and testing. However, they didn’t specify in de-
final output of a hype-realistic video created. Deepfakes can tail how they preprocessed their data. From [31, 52, 21], it
be used for creation of hyper-realistic Computer Generated is evident that very deep CNNs have superior performance
Imagery (CGI), Virtual Reality (VR) [7], Augmented Re- than shallow CNNs in image classification tasks. Hence,
ality (AR), Education, Animation, Arts, and Cinema [13]. there is still room for another generalized Deepfake detec-
tor that has extensive data preprocessing pipeline and also source image is swapped on the face of a target image. In
is trained on a very deep Neural Network model to catch as puppet-master, the person creating the video controls the
many Deepfake artifacts as possible. person in the video. In lip-sync, the source person controls
Therefore, we propose a generalized Convolutional Vi- the mouse movement in the target video, and in face reen-
sion Transformer (CViT) architecture to detect Deepfake actment, facial features are manipulated [56]. The Deep-
videos using Convolutional Neural Networks and the Trans- fake creation mechanisms commonly use feature map rep-
former architecture. We call our approach generalized for resentations of a source image and target image. Some of
three main reasons. 1) Our proposed model can learn local the feature map representations are the Facial Action Cod-
and global image features using the CNN and the Trans- ing System (FACS), image segmentation, facial landmarks,
former architecture by using the attention mechanism of the and facial boundaries [37]. FACS is a taxonomy of human
Transformer [6]. 2) We give equal emphasis on our data pre- facial expression that defines 32 atomic facial muscle ac-
processing during training and classification. 3) We propose tions named Action Units (AU) and 14 Action Descriptors
to train our model on a diverse set of face images using the (AD) for miscellaneous actions. Facial land marks are a
largest dataset currently available to detect Deepfakes cre- set of defined positions on the face, such as eye, nose, and
ated in different settings, environments, and orientations. mouth positions [36].

2. Related Work 2.1.1 Face Synthesis


With the rapid advancement of the CNNs [4, 20], Gen- Image synthesis deals with generating unseen images from
erative Adversarial Networks (GANs) [18], and its vari- sample training examples [23]. Face image synthesis tech-
ants [22], it is now possible to create hyper-realistic im- niques are used in face aging, face frontalization, and pose
ages [32], videos [61] and audio signals [53, 15] that are guided generation. GANs are used mainly in face synthe-
much harder to detect and distinguish from real untampered sis. GANs are generative models that are designed to create
audiovisuals. The ability to create a seemingly real sound, generative models of data from samples [3, 18]. GANs
images, and videos have caused a steer from various con- contain two adversarial networks, a generative model G ,
cerned stakeholders to deter such developments not to be and discriminative model D . The generator and the dis-
used by adversaries for malicious purposes [12]. To this ef- criminator act as adversaries with respect to each other to
fect, there is currently an urge in the research community to produce real-like samples [22]. The generator’s goal is to
come with Deepfake detection mechanisms. capture the data distribution. The goal of the discriminator
2.1. Deep Learning Techniques for Deepfake Video is to determine whether a sample is from the model distribu-
Generation tion or the data distribution [18]. Face frontalization GANs
change the face orientation in an image. Pose guided face
Deepfake is generated and synthesized by deep genera- image generation maps the pose of an input image to an-
tive models such GANs and Autoencoders (AEs) [18, 37]. other image. GAN architecture, such as StyleGAN [26] and
Deepfake is created by swapping between two identities FSGAN [43], synthesize highly realistic-looking images.
of subjects in an image or video [56]. Deepfake can
also be created by using different techniques such as face
2.1.2 Face Swap
swap [43], puppet-master [53], lip-sync [49, 47], face-
reenactment [14], synthetic image or video generation, and Face swap or identity swap is a GAN based method that
speech synthesis [48]. Supervised [45, 24, 51], and un- creates realistic Deepfake videos. The face swap process
supervised image-to-image translation [19] and video-to- inserts the face of a source image in a target image of which
video translation [59, 35] can be used to create highly re- the subject has never appeared [56]. It is most popularly
alistic Deepfakes. used to insert famous actors in a variety of movie clips [2].
The first Deepfake technique is the FakeAPP [42] which Face swaps can be synthesized using GANs and traditional
used two AE network. An AE is a Feedforward Neural Net- CV techniques such as FaceSwap (an application for swap-
work (FFNN) with an encoder-decoder architecture that is ping faces) and ZAO (a Chines mobile application that
trained to reconstruct its input data [60]. FakeApp’s encoder swaps anyone’s face onto any video clips) [56]. Face Swap-
extracts the latent face features, and its decoder reconstructs ping GAN (FSGAN) [43], and Region-Separative GAN
the face images. The two AE networks share the same en- (RSGAN) [39] are used for face swapping, face reenact-
coder to swap between the source and target faces, and dif- ment, attribute editing, and face part synthesis. Deepfake
ferent decoders for training. FaceSwap uses two AEs with a shared encoder that recon-
Most of the Deepfake creation mechanisms focus on the structs training images of the source and target faces [56].
face region in which face swapping and pixel-wise editing The processes involve a face detector that crops and aligns
are commonly used [28]. In the face swap, the face of a the face using facial landmark information [38]. A trained

2
encoder and decoder of the source face swap the features of combines a CNN and RNN architecture to detect Deepfake
the source image to the target face. The autoencoder out- videos.
put is then blended with the rest of the image using Poisson Md. Shohel Rana and Andrew H. Sung [50] proposed a
editing [38]. DeepfakeStack, an ensemble method (A stack of different
Facial expression (face reenactment) swap alters one’s DL models) for Deepfake detection. The ensemble is com-
facial expression or transforms facial expressions among posed of XceptionNet, InceptionV3, InceptionResNetV2,
persons. Expression reenactment turns an identity into a MobileNet, ResNet101, DenseNet121, and DenseNet169
puppet [37]. Using facial expression swap, one can transfer open source DL models. Junyaup Kim et al. [29] proposed
the expression of a person to another one [27]. Various fa- a classifier that distinguishes target individuals from a set of
cial reenactments have proposed through the years. Cycle- similar people using ShallowNet, VGG-16, and Xception
GAN is proposed by Jun-Yan et al. [63] for facial reenact- pre-trained DL models. The main objective of their system
ment between two video sources without any pair of train- is to evaluate the classification performance of the three DL
ing examples. Face2Face manipulates the facial expression models.
of a source image and projects onto another target face in
real-time [54]. Face2Face creates a dense reconstruction 3. Convolutional Vision Transformer
between the source image and the target image that is used
In this section, we present our approach to detect Deep-
for the synthesis of the face images under different light set-
fake videos. The Deepfake video detection model consists
tings [38].
of two components: the preprocessing component and the
detection component. The preprocessing component con-
2.2. Deep Learning Techniques for Deepfake Video
sists of the face extraction and data augmentation. The
Detection
detection components consist of the training component,
Deepfake detection methods fall into three categories the validation component, and the testing component. The
[34, 37]. Methods in the first category focus on the physical training and validation components contain a Convolutional
or psychological behavior of the videos, such as tracking Vision Transformer (CViT). The CViT has a feature learn-
eye blinking or head pose movement. The second category ing component that learns the features of input images and
focus on GAN fingerprint and biological signals found in a ViT architecture that determines whether a specific video
images, such as blood flow that can be detected in an im- is fake or real. The testing component applies the CViT
age. The third category focus on visual artifacts. Methods learning model on input images to detect Deepfakes. Our
that focus on visual artifacts are data-driven, and require a proposed model is shown in Figure 1.
large amount of data for training. Our proposed model falls
into the third category. In this section, we will discuss var-
3.1. Preprocessing
ious architectures designed and developed to detect visual The preprocessing component’s function is to prepare
artifacts of Deepfakes. the raw dataset for training, validating, and testing our
Darius et al. [1] proposed a CNN model called MesoNet CViT model. The preprocessing component has two sub-
network to automatically detect hyper-realistic forged components: the face extraction, and the data augmentation
videos created using Deepfake [40] and Face2Face [54]. component. The face extraction component is responsible
The authors used two network architectures (Meso-4 and for extracting face images from a video in a 224 x 224 RGB
MesoInception-4) that focus on the mesoscopic properties format. Figure 2 and Figure 3 shows a sample of the ex-
of an image. Yuezun and Siwei [34] proposed a CNN ar- tracted faces.
chitecture that takes advantage of the image transform (i.e.,
scaling, rotation and shearing) inconsistencies created dur-
3.2. Detection
ing the creation of Deepfakes. Their approach targets the The Deepfake detection process consists of three sub-
artifacts in affine face warping as the distinctive feature to components: the training, the validation, and the testing
distinguish real and fake images. Their method compares components. The training component is the principal part
the Deepfake face region with that of the neighboring pix- of the proposed model. It is where the learning occurs. DL
els to spot resolution inconsistencies that occur during face models require a significant time to design and fine-tune to
warping. fit a particular problem domain into its model. In our case,
Huy et al. [41] proposed a novel deep learning approach the foremost consideration is to search for an optimal CViT
to detect forged images and videos. The authors focused on model that learns the features of Deepfake videos. For this,
replay attacks, face swapping, facial reenactments and fully we need to search for the right parameters appropriate for
computer generated image spoofing. Daniel Mas Montser- training our dataset. The validation component is similar
rat et al. [38] proposed a system that extracts visual and tem- to that of the training component. The validation compo-
poral features from faces present in a video. Their method nent is a process that fine-tunes our model. It is used to

3
Figure 1. Convolutional Vision Transformer.

training progress and its Deepfake detection accuracy. The


testing component is where we classify and determine the
class of the faces extracted in a specific video. Thus, this
sub-component addresses our research objectives.
The proposed CViT model consists of two components:
Feature Learning (FL) and the ViT. The FL extracts learn-
able features from the face images. The ViT takes in the FL
as input and turns them into a sequence of image pixels for
the final detection process.
The Feature Learning (FL) component is a stack of con-
volutional operations. The FL component follows the struc-
ture of VGG architecture [52]. The FL component differs
Figure 2. Sample extracted fake face images. from the VGG model in that it doesn’t have the fully con-
nected layer as in the VGG architecture, and its purpose is
not for classification but to extract face image features for
the ViT component. Hence, the FL component is a CNN
without the fully connected layer.
The FL component has 17 convolutional layers, with a
kernel of 3 x 3 . The convolutional layers extract the low
level feature of the face images. All convolutional layers
have a stride and padding of 1. Batch normalization to nor-
malize the output features and the ReLU activation function
for non-linearity are applied in all of the layers. The Batch
normalization function normalizes change in the distribu-
tion of the previous layers [41], as the change in between
the layers will affect the learning process of the CNN ar-
Figure 3. Sample extracted real face images. chitecture. A five max-pooling of a 2 x 2 -pixel window
with stride equal to 2 is also used. The max-pooling oper-
ation reduces dimension of image size by half. After each
evaluate our CViT model and helps the CViT model to up- max-pooling operation, the width of the convolutional layer
date its internal state. It helps us to track our CViT model’s (channel) is doubled by a factor of 2, with the first layer

4
having 32 channels and the last layer 512. ment the model. We will present the results acquired from
The FL component has three consecutive convolutional the implementation of the model and give an interpretation
operations at each layer, except for the last two layers, of the experimental results.
which have four convolutional operations. We call those
three convolutional layers as CONV Block for simplicity. 4.1. Dataset
Each convolutional computation is followed by batch nor- DL models learn from data. As such, careful dataset
malization and the ReLU nonlinearity. The FL component preparation is crucial for their learning quality and predic-
has 10.8 million learnable parameters. The FL takes in an tion accuracy. BlazeFace neural face detector [5], MTCNN
image of size 224 x 224 x 3 , which is then convolved at [55] and face recognition [17] DL libraries are used to ex-
each convolutional operation. The FL internal state can be tract the faces. Both BlazeFace and face recognition are
represented as (C , H , W ) tensor, where C is the channel, fast at processing a large number of images. The three DL
H is the height, and W is the width. The final output of the libraries are used together for added accuracy of face detec-
FL is a 512 x 7 x 7 spatially correlated low level feature of tion. The face images are stored in a JPEG file format with
the input images, which are then fed to the ViT architecture. 224 x 224 image resolution. A 90 percent compression
Our Vision Transformer (ViT) component is identical to ratio is also applied. We prepared our datasets in a train,
the ViT architecture described in [16]. Vision Transformer validation, and test sets. We used 162,174 images classified
(ViT) is a transformer model based on the work of [57]. into 112,378 for training, 24,898 for validation and 24,898
The transformer and its variants (e.g., GPT-3 [44]) are pre- for testing with 70 :15 :15 ratios, respectively. Each real and
dominantly used for NLP tasks. ViT extends the applica- fake class has the same number of images in all sets.
tion of the transformer from the NLP problem domain to We used Albumentations for data augmentation. Albu-
a CV problem domain. The ViT uses the same compo- mentations is a python data augmentation library which has
nents as the original transformer model with slight modi- a large class of image transformations. Ninety percent of
fication of the input signal. The FL component and the ViT the face images were augmented, making our total dataset
component makes up our Convolutional Vision Transformer to be 308,130 facial images.
(CViT) model. We named our model CViT since the model
is based on both a stack of convolutional operation and the 4.2. Evaluation
ViT architecture.
The CViT model is trained using the binary cross-
The input to the ViT component is a feature map of the
entropy loss function. A mini-batch of 32 images are nor-
face images. The feature maps are split into seven patches
malized using mean of [0 .485 , 0 .456 , 0 .406 ] and stan-
and are then embedded into a 1 x 1024 linear sequence.
dard deviation of [0 .229 , 0 .224 , 0 .225 ]. The normalized
The embedded patches are then added to the position em-
face images are then augmented before being fed into the
bedding to retain the positional information of the image
CViT model at each training iterations. Adam optimizer
feature maps. The position embedding has a 2 x 1024 di-
with a learning rate of 0 .1e-3 and weight decay of 0 .1e-6
mension.
is used for optimization. The model is trained for a total of
The ViT component takes in the position embedding and
50 epochs. The learning rate decreases by a factor of 0.1 at
the patch embedding and passes them to the Transformer.
each step size of 15.
The ViT Transformer uses only an encoder, unlike the orig-
The classification process takes in 30 facial images and
inal Transformer. The ViT encoder consists of MSA and
passes it to our trained model. To determine the classifica-
MLP blocks. The MLP block is an FFN. The Norm normal-
tion accuracy of our model, we used a log loss function. A
izes the internal layer of the transformer. The Transformer
log loss described in Equation 1 classifies the network into
has 8 attention heads. The MLP head has two linear layers
a probability distribution from 0 to 1, where 0 > y < 0 .5
and the ReLU nonlinearity. The MLP head task is equiva-
represents the real class, and 0 .5 ≥ y < 1 represents the
lent to the fully connected layer of a typical CNN architec-
fake class. We chose a log loss classification metric because
ture. The first layer has 2048 channels, and the last layer
it highly penalizes random guesses and confident false pre-
has two channels that represent the class of Fake or Real
dictions.
face image. The CViT model has a total of 20 weighted
layers and 38.6 million learnable parameters. Softmax is
applied on the MLP head output to squash the weight val- 1 X
n

ues between 0 and 1 for the final detection purpose. LogLoss = − [yi log(ŷi ) + log(1 − yi )log(1 − ŷi )]
n i=1
(1)
4. Experiments
Another metric we used to measure our model capacity
In this section, we present the tools and experimental is the ROC and AUC metrics [8]. The ROC is used to visu-
setup we used to design and develop the prototype to imple- alize a classifier to select the classification threshold. AUC

5
is an area covered by the ROC curve. AUC measures the Figure 4, Figure 5 and Figure 6 show images that were mis-
accuracy of a classifier. classified by the DL libraries. The figures summarize our
We present our result using accuracy, AUC score, and preliminary data preprocessing test on 200 videos selected
loss value. We tested the model on 400 unseen DFDC randomly from 10 folders. We chose our test set video in
videos and achieved 91.5 percent accuracy, an AUC value of all settings we can found in the DFDC dataset: indoor, out-
0.91, and a loss value of 0.32. The loss value indicates how door, dark room, bright room, subject sited, subject stand-
far our models’ prediction is from the actual target value. ing, speaking to side, speaking in front, a subject moving
For Deepfake detection, we used 30 face images from each while speaking, gender, skin color, one person video, two
video. The amount of frame number we use affects the people video, a subject close to the camera, and subject
chance of Deepfake detection. However, accuracy might away from the camera. For the preliminary test, we ex-
not always be the right measure to detect Deepfakes as we tracted every frame of the videos and found the 637 nonface
might encounter all real facial images from a fake video region.
(fake videos might contain real frames).
We compared our result with other Deepfake detection
models, as shown in Table 1, 2, and 3. From Table 1,
2, and 3, we can see that our model performed well on
the DFDC, UADFV, and FaceForensics++ dataset. How-
ever, our model performed poorly on the FaceForensics++
FaceShifter dataset. The reason for this is because visual
artifacts are hard to learn, and our proposed model likely
didn’t learn those artifacts well.
Figure 4. face recognition non face region detection.
Dataset Accuracy
FaceForensics++ FaceSwap 69%
FaceForensics++ DeepFakeDetection 91%
FaceForensics++ Deepfake 93%
FaceForensics++ FaceShifter 46%
FaceForensics++ NeuralTextures 60%

Table 1. CViT model prediction accuracy on FaceForensics++


dataset

Figure 5. BlazeFace non face region detection.


Method Validation Test
CNN and RNN-GRU [38] [47] 92.61% 91.88%
CViT 87.25 91.5

Table 2. Accuracy of our model and other Deepfake detection


models on the DFDC dataset

Method Validation FaceSwap Face2Face


MesoNet 84.3% 96% 92% Figure 6. MTCNN non face region detection.
MesoInception 82.4% 98% 93.33%
CViT 93.75 69% 69.39% We tested our model to check how its accuracy is af-
fected without any attempt to remove these images, and
Table 3. AUC performance of our model and other Deepfake de- our models’ accuracy dropped to 69.5 percent, and the loss
tection models on UADFV dataset. * FaceForensics++ value increased to 0.4.
To minimize non face regions and prevent wrong predic-
tions, we used the three DL libraries and picked the best
4.3. Effects of Data Processing During Classification
performing library for our model, as shown in Table 4. As a
A major potential problem that affects our model accu- solution, we used face recognition as a “filter” for the face
racy is the inherent problems that are in the face detection images detected by BlazeFace. We chose face recognition
DL libraries (MTCNN, BlazeFace, and face recognition). because, in our investigation, it rejects more false-positive

6
than the other two models. We used face recognition for [4] Md Zahangir Alom, Tarek M. Taha, Chris Yakopcic, Ste-
final Deepfake detection. fan Westberg, Paheding Sidike, Mst Shamima Nasrin, Mah-
mudul Hasan, Brian C. Van Essen, Abdul A. S. Awwal, and
Dataset Blazeface f rec ** MTCNN Vijayan K. Asari. A State-of-the-Art Survey on Deep Learn-
DFDC 83.40% 91.50% 90.25% ing Theory and Architectures. Electronics, 8(3):292, 2019.
FaceSwap 56% 69% 63% [5] Valentin Bazarevsky, Yury Kartynnik, Andrey Vakunov,
Karthik Raveendran, and Matthias Grundmann. BlazeFace:
FaceShifter 40% 46% 44%
Sub-millisecond Neural Face Detection on Mobile GPUs.
NeuralTextures 57% 60% 60% arXiv preprint arXiv:1907.05047v2, 2019.
DeepFakeDetection 82% 91% 79.59 [6] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens,
Deepfake 87% 93% 81.63% and Quoc V. Le. Attention augmented convolutional net-
Face2Face 54% 61% 69.39% works. In 2019 IEEE/CVF International Conference on
UADF 74.50% 93.75% 88.16% Computer Vision (ICCV), pages 3285–3294, 2019.
[7] Avishek Joey Bose and Parham Aarabi. Virtual Fakes: Deep-
Table 4. DL libraries comparison on Deepfake detection accuracy. Fakes for Virtual Reality. In 2019 IEEE 21st International
** face recognition Workshop on Multimedia Signal Processing (MMSP), pages
1–1. IEEE, 2019.
[8] Andrew P. Bradley. The use of the area under the ROC curve
5. Conclusion in the evaluation of machine learning algorithms. Pattern
Recognition, 30(7):1145–1159, 1997.
Deepfakes open new possibilities in digital media, VR,
[9] John Brandon. Terrifying high-tech porn: Creepy
robotics, education, and many other fields. On another spec-
‘deepfake’ videos are on the rise, 2018. Available
trum, they are technologies that can cause havoc and distrust at https://www.foxnews.com/tech/terrifying-high-tech-porn-
to the general public. In light of this, we have designed and creepy-deepfake-videos-are-on-the-rise.
developed a generalized model for Deepfake video detec- [10] Joshua Brockschmidt, Jiacheng Shang, and Jie Wu. On the
tion using CNNs and Transformer, which we named Con- Generality of Facial Forgery Detection. In 2019 IEEE 16th
volutional Vison Transformer. We called our model a gen- International Conference on Mobile Ad Hoc and Sensor Sys-
eralized model for three reasons. 1) Our first reason arises tems Workshops (MASSW), pages 43–47. IEEE, 2019.
from the combined learning capacity of CNNs and Trans- [11] Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon
former. CNNs are strong at learning local features, while Papadopoulos, and Ioannis Kompatsiaris. Investigating
Transformers can learn from local and global feature maps. the Impact of Pre-processing and Prediction Aggrega-
This combined capacity enables our model to correlate ev- tion on the DeepFake Detection Task. arXiv preprint
arXiv:2006.07084v1, 2020.
ery pixel of an image and understand the relationship be-
[12] Bobby Chesney and Danielle Citron. Deep Fakes A Loom-
tween nonlocal features. 2) We gave equal emphasis on our
ing Challenge for Privacy, Democracy, and National Secu-
data preprocessing during training and classification. 3) We rity, 2019. Available at https://ssrn.com/abstract=3213954.
used the largest and most diverse dataset for Deepfake de- [13] Umur Aybars Ciftci, Ilke Demir, and Lijun Yin. Fake-
tection. Catcher: Detection of Synthetic Portrait Videos using Bio-
The CViT model was trained on a diverse collection of logical Signals. arXiv preprint arXiv:1901.02212v2, 2019.
facial images that were extracted from the DFDC dataset. [14] Sourabh Dhere, Suresh B. Rathod, Sanket Aarankalle, Yash
The model was tested on 400 DFDC videos and has Lad, and Megh Gandhi. A Review on Face Reenactment
achieved an accuracy of 91.5 percent. Still, our model has a Techniques. In 2020 International Conference on Industry
lot of room for improvement. In the future, we intend to ex- 4.0 Technology (I4Tech), pages 191–194, Pune, India, 2020.
pand on our current work by adding other datasets released IEEE.
for Deepfake research to make it more diverse, accurate, [15] Chris Donahue, Julian J. McAuley, and Miller S. Puckette.
and robust. Adversarial Audio Synthesis. In 7th International Confer-
ence on Learning Representations, ICLR 2019, New Or-
References leans, LA, USA, May 6-9, 2019, New York, NY, USA, 2019.
OpenReview.net.
[1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao [16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Echizen. MesoNet: a Compact Facial Video Forgery Detec- Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
tion Network. pages 1–7, 2018. Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
[2] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is
Koki Nagano, and Hao Li. Protecting World Leaders Against Worth 16x16 Words: Transformers for Image Recognition at
Deep Fakes. In CVPR Workshops, 2019. Scale. arXiv preprint arXiv:2010.11929v1, 2020.
[3] Charu C. Aggarwal. Neural Networks and Deep Learning: [17] Adam Geitgey. The world’s simplest facial recogni-
A Textbook. Springer International Publishing, Switzerland, tion api for Python and the command line. Available at
2020. https://github.com/ageitgey/face_recognition.

7
[18] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Alykhan Tejani, Johannes Totz, Zehan Wang, and Wen-
Xu, David Warde-Farley, Sherjil Ozairy, Aaron Courville, zhe Shi. Photo-Realistic Single Image Super-Resolution
and Yoshua Bengio. Generative Adversarial Nets. In Pro- Using a Generative Adversarial Network. arXiv preprint
ceedings of the 27th International Conference on Neural In- arXiv:1609.04802v5, 2017.
formation Processing Systems - Volume 2, page 2672–2680, [33] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In Ictu
Cambridge, MA, USA, 2014. MIT Press. Oculi: Exposing AI Generated Fake Face Videos by Detect-
[19] Arushi Handa, Prerna Garg, and Vijay Khare. Masked ing Eye Blinking. arXiv preprint arXiv:1806.02877v2, 2018.
Neural Style Transfer using Convolutional Neural Net- [34] Yuezun Li and Siwei Lyu. Exposing DeepFake Videos
works. In 2018 International Conference on Recent Innova- By Detecting Face Warping Artifacts. arXiv preprint
tions in Electrical, Electronics Communication Engineering arXiv:1811.00656v3, 2019.
(ICRIEECE), pages 2099–2104, 2018. [35] Arun Mallya, Ting-Chun Wang, Karan Sapra, and Ming-Yu
[20] Rahul Haridas and Jyothi R L. Convolutional Neural Net- Liu. World-Consistent Video-to-Video Synthesis. In Com-
works: A Comprehensive Survey. International Journal puter Vision – ECCV 2020, pages 359–378, Cham, 2020.
of Applied Engineering Research (IJAER), 14(03):780–789, Springer International Publishing.
2019. [36] Brais Martinez, Michel F. Valstar, Bihan Jiang, and Maja
[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Pantic. Automatic Analysis of Facial Actions: A Survey.
Deep Residual Learning for Image Recognition. In 2016 IEEE Transactions on Affective Computing, 10(3):325–347,
IEEE Conference on Computer Vision and Pattern Recog- 2019.
nition (CVPR), pages 770–778. IEEE, 2016.
[37] Yisroel Mirsky and Wenke Lee. The Creation and Detection
[22] Yongjun Hong, Uiwon Hwang, Jaeyoon Yoo, and Sungroh
of Deepfakes: A Survey. ACM Comput. Surv., 54(1), 2021.
Yoon. How Generative Adversarial Networks and Their
[38] Daniel Mas Montserrat, Hanxiang Hao, S. K. Yarlagadda,
Variants Work: An Overview. volume 52, New York, NY,
Sriram Baireddy, Ruiting Shao, Janos Horvath, Emily Bar-
USA, 2019. Association for Computing Machinery.
tusiak, Justin Yang, David Guera, Fengqing Zhu, and Ed-
[23] He Huang, Phillip S. Yu, and Changhu Wang. An Introduc-
ward J. Delp. Deepfakes Detection with Automatic Face
tion to Image Synthesis with Generative Adversarial Nets.
Weighting. In 2020 IEEE/CVF Conference on Computer Vi-
arXiv preprints arXiv:1803.04469v1, 2018.
sion and Pattern Recognition Workshops (CVPRW), pages
[24] Xun Huang, Ming-Yu Liu, Serge Belongie, and Ming-Yu
2851–2859, 2020.
Liu. Multimodal Unsupervised Image-to-Image Translation.
[39] Ryota Natsume, Tatsuya Yatagawa, and Shigeo Morishima.
In Computer Vision – ECCV 2018, pages 179–196, Cham,
RSGAN: Face Swapping and Editing Using Face and Hair
2018. Springer International Publishing.
Representation in Latent Spaces. In ACM SIGGRAPH 2018
[25] TackHyun Jung, SangWon Kim, and KeeCheon Kim. Deep-
Posters, SIGGRAPH ’18, New York, NY, USA, 2018. Asso-
Vision: Deepfakes Detection Using Human Eye Blinking
ciation for Computing Machinery.
Pattern. IEEE Access, 8:83144–83154, 2020.
[26] Tero Karras, Samuli Laine, and Timo Aila. A Style- [40] Huy H. Nguyen, Ngoc-Dung T. Tieu, Hoang-Quoc Nguyen-
Based Generator Architecture for Generative Adversarial Son, Vincent Nozick, Junichi Yamagishi, and Isao Echizen.
Networks. arXiv preprints arXiv:1812.04948, 2018. Modular Convolutional Neural Network for Discriminating
[27] Hasam Khalid and Simon S. Woo. OC-FakeDect: Classify- between Computer-Generated Images and Photographic Im-
ing Deepfakes Using One-class Variational Autoencoder. In ages. In Proceedings of the 13th International Conference on
2020 IEEE/CVF Conference on Computer Vision and Pat- Availability, Reliability and Security, New York, NY, USA,
tern Recognition Workshops (CVPRW), pages 2794–2803, 2018. Association for Computing Machinery.
2020. [41] Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen.
[28] Ali Khodabakhsh, Raghavendra Ramachandra, Kiran Raja, Capsule-forensics: Using Capsule Networks to Detect
and Pankaj Wasnik. Fake face detection methods: Can they Forged Images and Videos. In ICASSP 2019 - 2019 IEEE
be generalized? In 2018 International Conference of the Bio- International Conference on Acoustics, Speech and Signal
metrics Special Interest Group (BIOSIG), pages 1–6. IEEE, Processing (ICASSP), pages 2307–2311, 2019.
2018. [42] Thanh Thi Nguyen, Cuong M. Nguyen, Dung Tien Nguyen,
[29] Junyaup Kim, Siho Han, and Simon S. Woo. Classifying Duc Thanh Nguyen, and Saeid Nahavandi. Deep Learn-
Genuine Face images from Disguised Face Images. In 2019 ing for Deepfakes Creation and Detection. arXiv preprint
IEEE International Conference on Big Data (Big Data), arXiv:1909.11573v1, 2019.
pages 6248–6250, 2019. [43] Yuval Nirkin, Yosi Keller, and Tal Hassner. FSGAN: Sub-
[30] Pavel Korshunov and Sebastien Marcel. DeepFakes: a New ject Agnostic Face Swapping and Reenactment. In 2019
Threat to Face Recognition? Assessment and Detection. IEEE/CVF International Conference on Computer Vision,
arXiv preprints arXiv:1812.08685, 2018. ICCV 2019, Seoul, Korea (South), October 27 - November
[31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2, 2019, pages 7183–7192. IEEE, 2019.
ImageNet Classification with Deep Convolutional Neural [44] OpenAI. OpenAI API, 2020. Available at
Networks. Commun. ACM, 60(6):84–90, 2017. https://openai.com/blog/openai-api.
[32] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, [45] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Zhu. Semantic Image Synthesis With Spatially-Adaptive

8
Normalization. In 2019 IEEE/CVF Conference on Computer of the 31st International Conference on Neural Information
Vision and Pattern Recognition (CVPR), pages 2332–2341. Processing Systems, NIPS’17, page 6000–6010. Curran As-
IEEE, 2019. sociates Inc., 2017.
[46] Hai X. Pham, Yuting Wang, and Vladimir Pavlovic. Gen- [58] Konstantinos Vougioukas, Stavros Petridis, and Maja Pan-
erative Adversarial Talking Head: Bringing Portraits to Life tic. Realistic Speech-Driven Facial Animation with GANs.
with a Weakly Supervised Neural Network. arXiv preprint International Journal of Computer Vision, 128:1398–1413,
arXiv:1803.07716, 2018. 2020.
[47] K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Nambood- [59] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,
iri, and C V Jawahar. A Lip Sync Expert Is All You Need for Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-
Speech to Lip Generation In the Wild, page 484–492. As- to-Video Synthesis. In Proceedings of the 32nd Interna-
sociation for Computing Machinery, New York, NY, USA, tional Conference on Neural Information Processing Sys-
2020. tems, NIPS’18, page 1152–1164, Red Hook, NY, USA,
[48] Mike Price and Matt Price. Playing Of- 2018. Curran Associates Inc.
fense and Defense with Deepfakes, 2019. [60] M. Arif Wani, Farooq Ahmad Bhat, Saduf Afzal, and
Available at lhttps://www.blackhat.com/us- Asif Iqbal Khan. Advances in Deep Learning, volume 57
19/briefings/schedule/playing-offense-and-defense-with- of Studies in Big Data. Springer Nature, Singapore, 2020.
deepfakes-14661. [61] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and
[49] Prajwal K R, Rudrabha Mukhopadhyay, Jerin Philip, Ab- Victor Lempitsky. Few-Shot Adversarial Learning of
hishek Jha, Vinay Namboodiri, and C V Jawahar. To- Realistic Neural Talking Head Models. arXiv preprint
wards Automatic Face-to-Face Translation. In the 27th ACM arXiv:1905.08233v2, 2019.
International Conference on Multimedia (MM ’19), page [62] Lilei Zheng, Ying Zhang, and Vrizlynn L.L. Thing. A survey
1428–1436, New York, NY, USA, 2019. Association for on image tampering and its detection in real-world photos.
Computing Machinery. Elsevier, 58:380–399, 2018.
[50] Md. Shohel Rana and Andrew H. Sung. DeepfakeStack: A [63] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.
Deep Ensemble-based Learning Technique for Deepfake De- Efros. Unpaired Image-to-Image Translation Using Cycle-
tection. In 2020 7th IEEE International Conference on Cyber Consistent Adversarial Networks. In 2017 IEEE Interna-
Security and Cloud Computing (CSCloud)/2020 6th IEEE tional Conference on Computer Vision (ICCV), pages 2242–
International Conference on Edge Computing and Scalable 2251, 2017.
Cloud (EdgeCom), pages 70–75, 2020.
[51] Kuniaki Saito, Kate Saenko, and Ming-Yu Liu. COCO-
FUNIT: Few-Shot Unsupervised Image Translation with a
Content Conditioned Style Encoder. In Computer Vision –
ECCV 2020, pages 382–398, Cham, 2020. Springer Interna-
tional Publishing.
[52] Karen Simonyan and Andrew Zisserman. Very Deep Con-
volutional Networks for Large-Scale Image Recognition. In
Yoshua Bengio and Yann LeCun, editors, 3rd International
Conference on Learning Representations, ICLR 2015, San
Diego, CA, USA, May 7-9, 2015, Conference Track Proceed-
ings, 2015.
[53] Supasorn Suwajanakorn, Steven M. Seitz, and Ira
Kemelmacher-Shlizerman. Synthesizing Obama: Learning
Lip Sync from Audio. ACM Trans. Graph., 36(4):780–789,
2017.
[54] Justus Thies, Michael Zollhöfer, Marc Stamminger, Chris-
tian Theobalt, and Matthias Nießner. Face2Face: Real-Time
Face Capture and Reenactment of RGB Videos. Commun.
ACM, 62(1):96–104, 2018.
[55] Timesler. Pretrained Pytorch face detection (MTCNN)
and recognition (InceptionResnet) models. Available at
https://github.com/timesler/facenet-pytorch.
[56] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez,
Aythami Morales, and Javier Ortega-Garcia. DeepFakes and
Beyond: A Survey of Face Manipulation and Fake Detection.
Inf. Fusion, 64:131–148, 2020.
[57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Il-
lia Polosukhin. Attention is All You Need. In Proceedings

You might also like