Deepfake Video Detection Using Convolutional Vision Transformer
Deepfake Video Detection Using Convolutional Vision Transformer
2
encoder and decoder of the source face swap the features of combines a CNN and RNN architecture to detect Deepfake
the source image to the target face. The autoencoder out- videos.
put is then blended with the rest of the image using Poisson Md. Shohel Rana and Andrew H. Sung [50] proposed a
editing [38]. DeepfakeStack, an ensemble method (A stack of different
Facial expression (face reenactment) swap alters one’s DL models) for Deepfake detection. The ensemble is com-
facial expression or transforms facial expressions among posed of XceptionNet, InceptionV3, InceptionResNetV2,
persons. Expression reenactment turns an identity into a MobileNet, ResNet101, DenseNet121, and DenseNet169
puppet [37]. Using facial expression swap, one can transfer open source DL models. Junyaup Kim et al. [29] proposed
the expression of a person to another one [27]. Various fa- a classifier that distinguishes target individuals from a set of
cial reenactments have proposed through the years. Cycle- similar people using ShallowNet, VGG-16, and Xception
GAN is proposed by Jun-Yan et al. [63] for facial reenact- pre-trained DL models. The main objective of their system
ment between two video sources without any pair of train- is to evaluate the classification performance of the three DL
ing examples. Face2Face manipulates the facial expression models.
of a source image and projects onto another target face in
real-time [54]. Face2Face creates a dense reconstruction 3. Convolutional Vision Transformer
between the source image and the target image that is used
In this section, we present our approach to detect Deep-
for the synthesis of the face images under different light set-
fake videos. The Deepfake video detection model consists
tings [38].
of two components: the preprocessing component and the
detection component. The preprocessing component con-
2.2. Deep Learning Techniques for Deepfake Video
sists of the face extraction and data augmentation. The
Detection
detection components consist of the training component,
Deepfake detection methods fall into three categories the validation component, and the testing component. The
[34, 37]. Methods in the first category focus on the physical training and validation components contain a Convolutional
or psychological behavior of the videos, such as tracking Vision Transformer (CViT). The CViT has a feature learn-
eye blinking or head pose movement. The second category ing component that learns the features of input images and
focus on GAN fingerprint and biological signals found in a ViT architecture that determines whether a specific video
images, such as blood flow that can be detected in an im- is fake or real. The testing component applies the CViT
age. The third category focus on visual artifacts. Methods learning model on input images to detect Deepfakes. Our
that focus on visual artifacts are data-driven, and require a proposed model is shown in Figure 1.
large amount of data for training. Our proposed model falls
into the third category. In this section, we will discuss var-
3.1. Preprocessing
ious architectures designed and developed to detect visual The preprocessing component’s function is to prepare
artifacts of Deepfakes. the raw dataset for training, validating, and testing our
Darius et al. [1] proposed a CNN model called MesoNet CViT model. The preprocessing component has two sub-
network to automatically detect hyper-realistic forged components: the face extraction, and the data augmentation
videos created using Deepfake [40] and Face2Face [54]. component. The face extraction component is responsible
The authors used two network architectures (Meso-4 and for extracting face images from a video in a 224 x 224 RGB
MesoInception-4) that focus on the mesoscopic properties format. Figure 2 and Figure 3 shows a sample of the ex-
of an image. Yuezun and Siwei [34] proposed a CNN ar- tracted faces.
chitecture that takes advantage of the image transform (i.e.,
scaling, rotation and shearing) inconsistencies created dur-
3.2. Detection
ing the creation of Deepfakes. Their approach targets the The Deepfake detection process consists of three sub-
artifacts in affine face warping as the distinctive feature to components: the training, the validation, and the testing
distinguish real and fake images. Their method compares components. The training component is the principal part
the Deepfake face region with that of the neighboring pix- of the proposed model. It is where the learning occurs. DL
els to spot resolution inconsistencies that occur during face models require a significant time to design and fine-tune to
warping. fit a particular problem domain into its model. In our case,
Huy et al. [41] proposed a novel deep learning approach the foremost consideration is to search for an optimal CViT
to detect forged images and videos. The authors focused on model that learns the features of Deepfake videos. For this,
replay attacks, face swapping, facial reenactments and fully we need to search for the right parameters appropriate for
computer generated image spoofing. Daniel Mas Montser- training our dataset. The validation component is similar
rat et al. [38] proposed a system that extracts visual and tem- to that of the training component. The validation compo-
poral features from faces present in a video. Their method nent is a process that fine-tunes our model. It is used to
3
Figure 1. Convolutional Vision Transformer.
4
having 32 channels and the last layer 512. ment the model. We will present the results acquired from
The FL component has three consecutive convolutional the implementation of the model and give an interpretation
operations at each layer, except for the last two layers, of the experimental results.
which have four convolutional operations. We call those
three convolutional layers as CONV Block for simplicity. 4.1. Dataset
Each convolutional computation is followed by batch nor- DL models learn from data. As such, careful dataset
malization and the ReLU nonlinearity. The FL component preparation is crucial for their learning quality and predic-
has 10.8 million learnable parameters. The FL takes in an tion accuracy. BlazeFace neural face detector [5], MTCNN
image of size 224 x 224 x 3 , which is then convolved at [55] and face recognition [17] DL libraries are used to ex-
each convolutional operation. The FL internal state can be tract the faces. Both BlazeFace and face recognition are
represented as (C , H , W ) tensor, where C is the channel, fast at processing a large number of images. The three DL
H is the height, and W is the width. The final output of the libraries are used together for added accuracy of face detec-
FL is a 512 x 7 x 7 spatially correlated low level feature of tion. The face images are stored in a JPEG file format with
the input images, which are then fed to the ViT architecture. 224 x 224 image resolution. A 90 percent compression
Our Vision Transformer (ViT) component is identical to ratio is also applied. We prepared our datasets in a train,
the ViT architecture described in [16]. Vision Transformer validation, and test sets. We used 162,174 images classified
(ViT) is a transformer model based on the work of [57]. into 112,378 for training, 24,898 for validation and 24,898
The transformer and its variants (e.g., GPT-3 [44]) are pre- for testing with 70 :15 :15 ratios, respectively. Each real and
dominantly used for NLP tasks. ViT extends the applica- fake class has the same number of images in all sets.
tion of the transformer from the NLP problem domain to We used Albumentations for data augmentation. Albu-
a CV problem domain. The ViT uses the same compo- mentations is a python data augmentation library which has
nents as the original transformer model with slight modi- a large class of image transformations. Ninety percent of
fication of the input signal. The FL component and the ViT the face images were augmented, making our total dataset
component makes up our Convolutional Vision Transformer to be 308,130 facial images.
(CViT) model. We named our model CViT since the model
is based on both a stack of convolutional operation and the 4.2. Evaluation
ViT architecture.
The CViT model is trained using the binary cross-
The input to the ViT component is a feature map of the
entropy loss function. A mini-batch of 32 images are nor-
face images. The feature maps are split into seven patches
malized using mean of [0 .485 , 0 .456 , 0 .406 ] and stan-
and are then embedded into a 1 x 1024 linear sequence.
dard deviation of [0 .229 , 0 .224 , 0 .225 ]. The normalized
The embedded patches are then added to the position em-
face images are then augmented before being fed into the
bedding to retain the positional information of the image
CViT model at each training iterations. Adam optimizer
feature maps. The position embedding has a 2 x 1024 di-
with a learning rate of 0 .1e-3 and weight decay of 0 .1e-6
mension.
is used for optimization. The model is trained for a total of
The ViT component takes in the position embedding and
50 epochs. The learning rate decreases by a factor of 0.1 at
the patch embedding and passes them to the Transformer.
each step size of 15.
The ViT Transformer uses only an encoder, unlike the orig-
The classification process takes in 30 facial images and
inal Transformer. The ViT encoder consists of MSA and
passes it to our trained model. To determine the classifica-
MLP blocks. The MLP block is an FFN. The Norm normal-
tion accuracy of our model, we used a log loss function. A
izes the internal layer of the transformer. The Transformer
log loss described in Equation 1 classifies the network into
has 8 attention heads. The MLP head has two linear layers
a probability distribution from 0 to 1, where 0 > y < 0 .5
and the ReLU nonlinearity. The MLP head task is equiva-
represents the real class, and 0 .5 ≥ y < 1 represents the
lent to the fully connected layer of a typical CNN architec-
fake class. We chose a log loss classification metric because
ture. The first layer has 2048 channels, and the last layer
it highly penalizes random guesses and confident false pre-
has two channels that represent the class of Fake or Real
dictions.
face image. The CViT model has a total of 20 weighted
layers and 38.6 million learnable parameters. Softmax is
applied on the MLP head output to squash the weight val- 1 X
n
ues between 0 and 1 for the final detection purpose. LogLoss = − [yi log(ŷi ) + log(1 − yi )log(1 − ŷi )]
n i=1
(1)
4. Experiments
Another metric we used to measure our model capacity
In this section, we present the tools and experimental is the ROC and AUC metrics [8]. The ROC is used to visu-
setup we used to design and develop the prototype to imple- alize a classifier to select the classification threshold. AUC
5
is an area covered by the ROC curve. AUC measures the Figure 4, Figure 5 and Figure 6 show images that were mis-
accuracy of a classifier. classified by the DL libraries. The figures summarize our
We present our result using accuracy, AUC score, and preliminary data preprocessing test on 200 videos selected
loss value. We tested the model on 400 unseen DFDC randomly from 10 folders. We chose our test set video in
videos and achieved 91.5 percent accuracy, an AUC value of all settings we can found in the DFDC dataset: indoor, out-
0.91, and a loss value of 0.32. The loss value indicates how door, dark room, bright room, subject sited, subject stand-
far our models’ prediction is from the actual target value. ing, speaking to side, speaking in front, a subject moving
For Deepfake detection, we used 30 face images from each while speaking, gender, skin color, one person video, two
video. The amount of frame number we use affects the people video, a subject close to the camera, and subject
chance of Deepfake detection. However, accuracy might away from the camera. For the preliminary test, we ex-
not always be the right measure to detect Deepfakes as we tracted every frame of the videos and found the 637 nonface
might encounter all real facial images from a fake video region.
(fake videos might contain real frames).
We compared our result with other Deepfake detection
models, as shown in Table 1, 2, and 3. From Table 1,
2, and 3, we can see that our model performed well on
the DFDC, UADFV, and FaceForensics++ dataset. How-
ever, our model performed poorly on the FaceForensics++
FaceShifter dataset. The reason for this is because visual
artifacts are hard to learn, and our proposed model likely
didn’t learn those artifacts well.
Figure 4. face recognition non face region detection.
Dataset Accuracy
FaceForensics++ FaceSwap 69%
FaceForensics++ DeepFakeDetection 91%
FaceForensics++ Deepfake 93%
FaceForensics++ FaceShifter 46%
FaceForensics++ NeuralTextures 60%
6
than the other two models. We used face recognition for [4] Md Zahangir Alom, Tarek M. Taha, Chris Yakopcic, Ste-
final Deepfake detection. fan Westberg, Paheding Sidike, Mst Shamima Nasrin, Mah-
mudul Hasan, Brian C. Van Essen, Abdul A. S. Awwal, and
Dataset Blazeface f rec ** MTCNN Vijayan K. Asari. A State-of-the-Art Survey on Deep Learn-
DFDC 83.40% 91.50% 90.25% ing Theory and Architectures. Electronics, 8(3):292, 2019.
FaceSwap 56% 69% 63% [5] Valentin Bazarevsky, Yury Kartynnik, Andrey Vakunov,
Karthik Raveendran, and Matthias Grundmann. BlazeFace:
FaceShifter 40% 46% 44%
Sub-millisecond Neural Face Detection on Mobile GPUs.
NeuralTextures 57% 60% 60% arXiv preprint arXiv:1907.05047v2, 2019.
DeepFakeDetection 82% 91% 79.59 [6] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens,
Deepfake 87% 93% 81.63% and Quoc V. Le. Attention augmented convolutional net-
Face2Face 54% 61% 69.39% works. In 2019 IEEE/CVF International Conference on
UADF 74.50% 93.75% 88.16% Computer Vision (ICCV), pages 3285–3294, 2019.
[7] Avishek Joey Bose and Parham Aarabi. Virtual Fakes: Deep-
Table 4. DL libraries comparison on Deepfake detection accuracy. Fakes for Virtual Reality. In 2019 IEEE 21st International
** face recognition Workshop on Multimedia Signal Processing (MMSP), pages
1–1. IEEE, 2019.
[8] Andrew P. Bradley. The use of the area under the ROC curve
5. Conclusion in the evaluation of machine learning algorithms. Pattern
Recognition, 30(7):1145–1159, 1997.
Deepfakes open new possibilities in digital media, VR,
[9] John Brandon. Terrifying high-tech porn: Creepy
robotics, education, and many other fields. On another spec-
‘deepfake’ videos are on the rise, 2018. Available
trum, they are technologies that can cause havoc and distrust at https://www.foxnews.com/tech/terrifying-high-tech-porn-
to the general public. In light of this, we have designed and creepy-deepfake-videos-are-on-the-rise.
developed a generalized model for Deepfake video detec- [10] Joshua Brockschmidt, Jiacheng Shang, and Jie Wu. On the
tion using CNNs and Transformer, which we named Con- Generality of Facial Forgery Detection. In 2019 IEEE 16th
volutional Vison Transformer. We called our model a gen- International Conference on Mobile Ad Hoc and Sensor Sys-
eralized model for three reasons. 1) Our first reason arises tems Workshops (MASSW), pages 43–47. IEEE, 2019.
from the combined learning capacity of CNNs and Trans- [11] Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon
former. CNNs are strong at learning local features, while Papadopoulos, and Ioannis Kompatsiaris. Investigating
Transformers can learn from local and global feature maps. the Impact of Pre-processing and Prediction Aggrega-
This combined capacity enables our model to correlate ev- tion on the DeepFake Detection Task. arXiv preprint
arXiv:2006.07084v1, 2020.
ery pixel of an image and understand the relationship be-
[12] Bobby Chesney and Danielle Citron. Deep Fakes A Loom-
tween nonlocal features. 2) We gave equal emphasis on our
ing Challenge for Privacy, Democracy, and National Secu-
data preprocessing during training and classification. 3) We rity, 2019. Available at https://ssrn.com/abstract=3213954.
used the largest and most diverse dataset for Deepfake de- [13] Umur Aybars Ciftci, Ilke Demir, and Lijun Yin. Fake-
tection. Catcher: Detection of Synthetic Portrait Videos using Bio-
The CViT model was trained on a diverse collection of logical Signals. arXiv preprint arXiv:1901.02212v2, 2019.
facial images that were extracted from the DFDC dataset. [14] Sourabh Dhere, Suresh B. Rathod, Sanket Aarankalle, Yash
The model was tested on 400 DFDC videos and has Lad, and Megh Gandhi. A Review on Face Reenactment
achieved an accuracy of 91.5 percent. Still, our model has a Techniques. In 2020 International Conference on Industry
lot of room for improvement. In the future, we intend to ex- 4.0 Technology (I4Tech), pages 191–194, Pune, India, 2020.
pand on our current work by adding other datasets released IEEE.
for Deepfake research to make it more diverse, accurate, [15] Chris Donahue, Julian J. McAuley, and Miller S. Puckette.
and robust. Adversarial Audio Synthesis. In 7th International Confer-
ence on Learning Representations, ICLR 2019, New Or-
References leans, LA, USA, May 6-9, 2019, New York, NY, USA, 2019.
OpenReview.net.
[1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao [16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Echizen. MesoNet: a Compact Facial Video Forgery Detec- Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
tion Network. pages 1–7, 2018. Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
[2] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is
Koki Nagano, and Hao Li. Protecting World Leaders Against Worth 16x16 Words: Transformers for Image Recognition at
Deep Fakes. In CVPR Workshops, 2019. Scale. arXiv preprint arXiv:2010.11929v1, 2020.
[3] Charu C. Aggarwal. Neural Networks and Deep Learning: [17] Adam Geitgey. The world’s simplest facial recogni-
A Textbook. Springer International Publishing, Switzerland, tion api for Python and the command line. Available at
2020. https://github.com/ageitgey/face_recognition.
7
[18] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Alykhan Tejani, Johannes Totz, Zehan Wang, and Wen-
Xu, David Warde-Farley, Sherjil Ozairy, Aaron Courville, zhe Shi. Photo-Realistic Single Image Super-Resolution
and Yoshua Bengio. Generative Adversarial Nets. In Pro- Using a Generative Adversarial Network. arXiv preprint
ceedings of the 27th International Conference on Neural In- arXiv:1609.04802v5, 2017.
formation Processing Systems - Volume 2, page 2672–2680, [33] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In Ictu
Cambridge, MA, USA, 2014. MIT Press. Oculi: Exposing AI Generated Fake Face Videos by Detect-
[19] Arushi Handa, Prerna Garg, and Vijay Khare. Masked ing Eye Blinking. arXiv preprint arXiv:1806.02877v2, 2018.
Neural Style Transfer using Convolutional Neural Net- [34] Yuezun Li and Siwei Lyu. Exposing DeepFake Videos
works. In 2018 International Conference on Recent Innova- By Detecting Face Warping Artifacts. arXiv preprint
tions in Electrical, Electronics Communication Engineering arXiv:1811.00656v3, 2019.
(ICRIEECE), pages 2099–2104, 2018. [35] Arun Mallya, Ting-Chun Wang, Karan Sapra, and Ming-Yu
[20] Rahul Haridas and Jyothi R L. Convolutional Neural Net- Liu. World-Consistent Video-to-Video Synthesis. In Com-
works: A Comprehensive Survey. International Journal puter Vision – ECCV 2020, pages 359–378, Cham, 2020.
of Applied Engineering Research (IJAER), 14(03):780–789, Springer International Publishing.
2019. [36] Brais Martinez, Michel F. Valstar, Bihan Jiang, and Maja
[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Pantic. Automatic Analysis of Facial Actions: A Survey.
Deep Residual Learning for Image Recognition. In 2016 IEEE Transactions on Affective Computing, 10(3):325–347,
IEEE Conference on Computer Vision and Pattern Recog- 2019.
nition (CVPR), pages 770–778. IEEE, 2016.
[37] Yisroel Mirsky and Wenke Lee. The Creation and Detection
[22] Yongjun Hong, Uiwon Hwang, Jaeyoon Yoo, and Sungroh
of Deepfakes: A Survey. ACM Comput. Surv., 54(1), 2021.
Yoon. How Generative Adversarial Networks and Their
[38] Daniel Mas Montserrat, Hanxiang Hao, S. K. Yarlagadda,
Variants Work: An Overview. volume 52, New York, NY,
Sriram Baireddy, Ruiting Shao, Janos Horvath, Emily Bar-
USA, 2019. Association for Computing Machinery.
tusiak, Justin Yang, David Guera, Fengqing Zhu, and Ed-
[23] He Huang, Phillip S. Yu, and Changhu Wang. An Introduc-
ward J. Delp. Deepfakes Detection with Automatic Face
tion to Image Synthesis with Generative Adversarial Nets.
Weighting. In 2020 IEEE/CVF Conference on Computer Vi-
arXiv preprints arXiv:1803.04469v1, 2018.
sion and Pattern Recognition Workshops (CVPRW), pages
[24] Xun Huang, Ming-Yu Liu, Serge Belongie, and Ming-Yu
2851–2859, 2020.
Liu. Multimodal Unsupervised Image-to-Image Translation.
[39] Ryota Natsume, Tatsuya Yatagawa, and Shigeo Morishima.
In Computer Vision – ECCV 2018, pages 179–196, Cham,
RSGAN: Face Swapping and Editing Using Face and Hair
2018. Springer International Publishing.
Representation in Latent Spaces. In ACM SIGGRAPH 2018
[25] TackHyun Jung, SangWon Kim, and KeeCheon Kim. Deep-
Posters, SIGGRAPH ’18, New York, NY, USA, 2018. Asso-
Vision: Deepfakes Detection Using Human Eye Blinking
ciation for Computing Machinery.
Pattern. IEEE Access, 8:83144–83154, 2020.
[26] Tero Karras, Samuli Laine, and Timo Aila. A Style- [40] Huy H. Nguyen, Ngoc-Dung T. Tieu, Hoang-Quoc Nguyen-
Based Generator Architecture for Generative Adversarial Son, Vincent Nozick, Junichi Yamagishi, and Isao Echizen.
Networks. arXiv preprints arXiv:1812.04948, 2018. Modular Convolutional Neural Network for Discriminating
[27] Hasam Khalid and Simon S. Woo. OC-FakeDect: Classify- between Computer-Generated Images and Photographic Im-
ing Deepfakes Using One-class Variational Autoencoder. In ages. In Proceedings of the 13th International Conference on
2020 IEEE/CVF Conference on Computer Vision and Pat- Availability, Reliability and Security, New York, NY, USA,
tern Recognition Workshops (CVPRW), pages 2794–2803, 2018. Association for Computing Machinery.
2020. [41] Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen.
[28] Ali Khodabakhsh, Raghavendra Ramachandra, Kiran Raja, Capsule-forensics: Using Capsule Networks to Detect
and Pankaj Wasnik. Fake face detection methods: Can they Forged Images and Videos. In ICASSP 2019 - 2019 IEEE
be generalized? In 2018 International Conference of the Bio- International Conference on Acoustics, Speech and Signal
metrics Special Interest Group (BIOSIG), pages 1–6. IEEE, Processing (ICASSP), pages 2307–2311, 2019.
2018. [42] Thanh Thi Nguyen, Cuong M. Nguyen, Dung Tien Nguyen,
[29] Junyaup Kim, Siho Han, and Simon S. Woo. Classifying Duc Thanh Nguyen, and Saeid Nahavandi. Deep Learn-
Genuine Face images from Disguised Face Images. In 2019 ing for Deepfakes Creation and Detection. arXiv preprint
IEEE International Conference on Big Data (Big Data), arXiv:1909.11573v1, 2019.
pages 6248–6250, 2019. [43] Yuval Nirkin, Yosi Keller, and Tal Hassner. FSGAN: Sub-
[30] Pavel Korshunov and Sebastien Marcel. DeepFakes: a New ject Agnostic Face Swapping and Reenactment. In 2019
Threat to Face Recognition? Assessment and Detection. IEEE/CVF International Conference on Computer Vision,
arXiv preprints arXiv:1812.08685, 2018. ICCV 2019, Seoul, Korea (South), October 27 - November
[31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2, 2019, pages 7183–7192. IEEE, 2019.
ImageNet Classification with Deep Convolutional Neural [44] OpenAI. OpenAI API, 2020. Available at
Networks. Commun. ACM, 60(6):84–90, 2017. https://openai.com/blog/openai-api.
[32] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, [45] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Zhu. Semantic Image Synthesis With Spatially-Adaptive
8
Normalization. In 2019 IEEE/CVF Conference on Computer of the 31st International Conference on Neural Information
Vision and Pattern Recognition (CVPR), pages 2332–2341. Processing Systems, NIPS’17, page 6000–6010. Curran As-
IEEE, 2019. sociates Inc., 2017.
[46] Hai X. Pham, Yuting Wang, and Vladimir Pavlovic. Gen- [58] Konstantinos Vougioukas, Stavros Petridis, and Maja Pan-
erative Adversarial Talking Head: Bringing Portraits to Life tic. Realistic Speech-Driven Facial Animation with GANs.
with a Weakly Supervised Neural Network. arXiv preprint International Journal of Computer Vision, 128:1398–1413,
arXiv:1803.07716, 2018. 2020.
[47] K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Nambood- [59] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,
iri, and C V Jawahar. A Lip Sync Expert Is All You Need for Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-
Speech to Lip Generation In the Wild, page 484–492. As- to-Video Synthesis. In Proceedings of the 32nd Interna-
sociation for Computing Machinery, New York, NY, USA, tional Conference on Neural Information Processing Sys-
2020. tems, NIPS’18, page 1152–1164, Red Hook, NY, USA,
[48] Mike Price and Matt Price. Playing Of- 2018. Curran Associates Inc.
fense and Defense with Deepfakes, 2019. [60] M. Arif Wani, Farooq Ahmad Bhat, Saduf Afzal, and
Available at lhttps://www.blackhat.com/us- Asif Iqbal Khan. Advances in Deep Learning, volume 57
19/briefings/schedule/playing-offense-and-defense-with- of Studies in Big Data. Springer Nature, Singapore, 2020.
deepfakes-14661. [61] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and
[49] Prajwal K R, Rudrabha Mukhopadhyay, Jerin Philip, Ab- Victor Lempitsky. Few-Shot Adversarial Learning of
hishek Jha, Vinay Namboodiri, and C V Jawahar. To- Realistic Neural Talking Head Models. arXiv preprint
wards Automatic Face-to-Face Translation. In the 27th ACM arXiv:1905.08233v2, 2019.
International Conference on Multimedia (MM ’19), page [62] Lilei Zheng, Ying Zhang, and Vrizlynn L.L. Thing. A survey
1428–1436, New York, NY, USA, 2019. Association for on image tampering and its detection in real-world photos.
Computing Machinery. Elsevier, 58:380–399, 2018.
[50] Md. Shohel Rana and Andrew H. Sung. DeepfakeStack: A [63] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.
Deep Ensemble-based Learning Technique for Deepfake De- Efros. Unpaired Image-to-Image Translation Using Cycle-
tection. In 2020 7th IEEE International Conference on Cyber Consistent Adversarial Networks. In 2017 IEEE Interna-
Security and Cloud Computing (CSCloud)/2020 6th IEEE tional Conference on Computer Vision (ICCV), pages 2242–
International Conference on Edge Computing and Scalable 2251, 2017.
Cloud (EdgeCom), pages 70–75, 2020.
[51] Kuniaki Saito, Kate Saenko, and Ming-Yu Liu. COCO-
FUNIT: Few-Shot Unsupervised Image Translation with a
Content Conditioned Style Encoder. In Computer Vision –
ECCV 2020, pages 382–398, Cham, 2020. Springer Interna-
tional Publishing.
[52] Karen Simonyan and Andrew Zisserman. Very Deep Con-
volutional Networks for Large-Scale Image Recognition. In
Yoshua Bengio and Yann LeCun, editors, 3rd International
Conference on Learning Representations, ICLR 2015, San
Diego, CA, USA, May 7-9, 2015, Conference Track Proceed-
ings, 2015.
[53] Supasorn Suwajanakorn, Steven M. Seitz, and Ira
Kemelmacher-Shlizerman. Synthesizing Obama: Learning
Lip Sync from Audio. ACM Trans. Graph., 36(4):780–789,
2017.
[54] Justus Thies, Michael Zollhöfer, Marc Stamminger, Chris-
tian Theobalt, and Matthias Nießner. Face2Face: Real-Time
Face Capture and Reenactment of RGB Videos. Commun.
ACM, 62(1):96–104, 2018.
[55] Timesler. Pretrained Pytorch face detection (MTCNN)
and recognition (InceptionResnet) models. Available at
https://github.com/timesler/facenet-pytorch.
[56] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez,
Aythami Morales, and Javier Ortega-Garcia. DeepFakes and
Beyond: A Survey of Face Manipulation and Fake Detection.
Inf. Fusion, 64:131–148, 2020.
[57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Il-
lia Polosukhin. Attention is All You Need. In Proceedings