DeepfakeStack A Deep Ensemble-Based Learning Technique For Deepfake Detection
DeepfakeStack A Deep Ensemble-Based Learning Technique For Deepfake Detection
Abstract—Recent advances in technology have made the deep of ANN. Deepfakes became popular when fabricated porn
learning (DL) models available for use in a wide variety of videos of well-known faces; for example, celebrities or
novel applications; for example, generative adversarial politicians are in progress of making it online. The term
network (GAN) models are capable of producing hyper- violates not only the rules of consent but the victim’s
realistic images, speech, and even videos, such as the so-called privacy. Because creating Deepfakes without a person’s
“Deepfake” produced by GANs with manipulated audio and/or approval is a form of abuse leading in another way of crime.
video clips, which are so realistic as to be indistinguishable As presented in the annual report [3] under the broader
from the real ones in human perception. Aside from innovative name of Deepfake, Google searches provide several
and legitimate applications, there are numerous nefarious or
webpages for the keyword ‘Deepfake’ that expanded rapidly
unlawful ways to use such counterfeit contents in propaganda,
since 2017, as well as searches for webpages containing
political campaigns, cybercrimes, extortion, etc. To meet the
challenges posed by Deepfake multimedia, we propose a deep
related videos (see Fig. 1). This report also presents.
ensemble learning technique called DeepfakeStack for • 1790+ of Deepfake videos hosted by the top 10 adult
detecting such manipulated videos. The proposed technique websites without considering pornhub.com, which
combines a series of DL based state-of-art classification models has disabled searches for ‘Deepfakes’.
and creates an improved composite classifier. Based on our • 6174 of Deepfake videos hosted by adult websites
experiments, it is shown that DeepfakeStack outperforms other featuring fake video content only.
classifiers by achieving an accuracy of 99.65% and AUROC of • 3 new sites dedicated to hosting Deepfake
1.0 score in detecting Deepfake. Therefore, our method pornography.
provides a solid basis for building a Realtime Deepfake • 902 of papers published on the arXiv, including
detector. ‘GAN (Generative Adversarial Network)’ in titles or
abstracts in 2018 only.
Keywords-Deepfake; DeepfakeStack; GANs; Deep Ensemble
Learning;, Greedy Layer-wise Pretraining.
• 25 articles on the topic published, including non-peer,
reviewed where DARPA funds 12 of them.
I. INTRODUCTION
Recent progress in automated video processing
applications (e.g., FaceApp [1], FakeApp [2]), artificial
neural networks (ANN), ML algorithms and social media
allow cybercriminals to create and spread high-quality
manipulated video contents (aka. fake videos) through digital
media that lead to the appearance of deliberate
misinformation. The illustration of certain entities acting or Figure 1. (a) Number of webpages returned by Google search for
stating things they never actually said or performed are "Deepfake", (b) Number of searches for webpages containing Deepfake
becoming remarkably practical, even hard to recognize the videos.
proof of authenticity. The keyword “Deepfake manipulation” In this paper, we apply a deep ensemble learning
permits anyone to swap the face of an individual by technique, namely, DeepfakeStack by evaluating several DL
another’s face, including expressions and creates based state-of-art models. The idea behind the
photorealistic fake images or videos that are known as DeepfakeStack is based on training a meta-learner on top of
Deepfakes. These videos are readily visible in malicious use pre-trained base-learners and offers an interface to fit the
whereas some of them could be harmful to individuals and meta-learner on the predictions of the base-learners and
society. In the past, video manipulation was an expensive shows how the ensemble technique performs the
task that required an extensive amount of workforce, time, classification task. The architecture of the DeepfakeStack
and money, but now, only it needs a gaming laptop or involves two or more base learners, called level-0 models,
desktop with an internet connection and a basic knowledge and a meta-learner called level-1 model that combines the
71
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.
a. Stacking ensemble (SE): The SE takes the output of introduced an autoencoder-based architecture named
the base-learners and used them as input to train the ForensicTransfer (FT), which differentiates authentic images
meta-learner in such a way that the model learns from counterfeit. The FT establishes a series of experiments
how to best map the base-learner’s decisions into an and achieves up to 80-85% in terms of accuracy. Nguyen et
enhanced output. al. [20] proposed multi-task learning based CNN for
b. Randomized weighted ensemble (WRE): In WRE concurrently carrying out detection and segmentation of
technique, each of the base-learners is weighted by a manipulated facial images and videos. The proposed system
value based on their performance that is evaluated on contains an encoder that encodes features used for the binary
a hold-out validation dataset. The model receives a classi¿cation and a Y-shaped decoder where the output of
higher weight if it performs better than others. In one of its branches is used for segmenting the manipulated
other words, this technique is just optimizing regions. In [21], the authors presented a deep CNN based
weights that are used for weighting all the base- model that uses a capsule network (CN) to detect Deepfake.
learners’ output and taking the weighted average. In addition to this, it identifies replay attacks and computer-
generated image. In [22], the authors proposed an approach
III. RELATED WORKS for building a Deepfake detector called FakeCatcher (FC) to
detect synthetic portrait videos where the proposed method
A. Facial Manipulations exploits biological signals extracted from facial areas. In [23]
Guera and Delp [10] proposed a system that contains two method, missing reÀections and missing details in the eye
essential components: (i) a CNN and (ii) an LSTM. For a and teeth areas are being exploited and the texture features
given image sequence, the CNN generates a set of features are extracted from the facial region based on facial
for each frame and passes them to the LSTM for analysis landmarks and are fed them into the ML classifiers for
after concatenating those features of multiple sequential classifying them as the Deepfakes or the real videos.
frames. The proposed network was trained on 600 videos Koopman et al. [24] explored a photo response non-
and achieved 97.1% of accuracy. Li and Lyu [11] proposed a uniformity (PRNU) analysis to detect Deepfake in video
method for detecting warping artifacts in Deepfake videos by frames. In this PRNU analysis, a series of frames are
training four different CNN on authentic and manipulated generated from the input videos and are kept in sequentially
face images and obtained the accuracy up to 99%. In [12] labeled folders. To leave the portion of the PRNU pattern
authors proposed another plan that reveals fake faces using and to make it consistent, it crops each video frame by
eye blinking detection technique, in which authors assume precisely identical pixels values. These frames are then split
that function is missing in Deepfake analysis. Afchar et al. into 8 groups where each of them is equal size, and for each,
[13] trained a CNN, namely MesoNet, for classifying the real a typical PRNU pattern is created using second order FSTV
and Deepfake manipulated face. The network has two method and compare them to one another with calculating
inception modules, namely Meso-4 and MesoInception-4, in the normalized cross-correlation scores. For each video, it
conjunction with two convolution layers connected by max- estimates variations in correlation scores and the average
pooling segments. Zhou et al. [14] proposed a two-stream correlation score. Finally, it performs a t-test on these results
network for detecting face manipulation in the video. In its for Deepfakes and real videos where the t-test produces
first stream, a CNN based face classification network is statistical signi¿cances between the results for both
trained to capture tampering artifact evidence, and in the Deepfakes and real videos.
second stream, a steganalysis feature-based triplet network is
trained for controlling functions that capture local noise IV. METHODOLOGY
residual evidence. In [15], the authors proposed a two-phases
method, where the first phase crops and adjusts the facial A. Dataset Description
area that is taken from the video frames using computer- To conduct the experiment and evaluate the proposed
generated masks, and the second step is for detecting such technique, we have used the FaceForensics++ (FF++) dataset
manipulation using a recurrent convolutional network [25]. The dataset has been collected by Visual Computing
(RCNN). The experiment has improved the performance of Group (VGG) which is an active research group on computer
detection accuracy in detecting such manipulation than the vision, computer graphics, and machine learning. There are
previously reported results up to 4.55%. In [16], Do et al. 1000 real videos included in this dataset that was
suggested a deep CNN model for classifying a real image or downloaded from YouTube. Then, manipulated them by
fake image generated by GANs. The main objectives of this using 3 popular state-of-art manipulation techniques (i.e.,
article can be concise as (i) creation of training data sets that Deepfake, FaceSwap, and Face2Face).
can be adapted to the test data set, (ii) building a deep
learning network based on face recognition networks for B. Data Analysis and Preprocessing
extracting face features, and finally, (iii) performing a fine- To achieve the best performance of the used ML/DL
tune to fit the face features to classify the real/fake face. models, we need to preprocess the dataset by applying some
data analysis. Below the idea of how this dataset is organized
B. Digital Media Forensics as follows:
Much work has been done on digital media forensics, 9 In this experiment, we have used 49 real and 51
early papers include, e.g. [17, 18]; Cozzolino et al. [19] fake to make the balanced dataset. After then, we
72
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.
separate each of the videos based on its category to fine-tune any architecture relevant to the model
under the directories (e.g., Original, Deepfakes). that has already been trained and tested on a similar
9 For each video, each folder is created that contains dataset. For example, the CNN-based networks that
all extracted image sequences. For example, if the have already been trained and tested on the
video file’s name is ‘485.mp4’ then we create a ImageNet dataset [26]. To adapt to any of these
directory with the same name ‘485’ where it architectures, the dataset needs to be preprocessed
contains all the frames of ‘485.mp4’ for each of the and establish an environment accordingly. These
original videos and we have followed the same CNN-based networks were trained with normalized
procedure for Deepfakes data. images of equal size (224x224) on RGB images.
9 We don’t consider the entire video sequence Therefore, before feeding into the model, we must
instead; we take only 101 frames from each video look after that the dataset should be normalized and
to reduce the computational time. preprocessed into the same size. In this experiment,
As the main objective is to detect the manipulated face we initialize 7 DL models (e.g., XceptionNet,
images, we are concerned only in the face area. So, it is a MobileNet, ResNet101, InceptionV3, DensNet121,
good idea to ignore all others like the body, background, etc. InceptionReseNetV2, DenseNet169) with ImageNet
Therefore, we track the face in each of the images and feed weights and apply the transfer learning by replacing
them into the classifier. only the topmost layer with 2 outputs with SoftMax
activation. We consider these architectures as base-
C. DeepfakeStack
learners, and to train these models, we follow
The DeepfakeStack provides a way of combining a set of Greedy Layer-wise Pretraining (GLP) [27]
k base-learners, C1, C2, …, Ck, to produce an enhanced technique. The GLP uninterruptedly adds a new
classifier C*. For a given dataset, D, it splits it into k training hidden layer to a model and refit the model.
sets, D1, D2, …, Dk, and uses Di to build the base-learner, Ci. Besides, it permits the newly added model to learn
For classifying a new (unseen) data tuple, the DeepfakeStack
the inputs from the existing hidden layer, while
returns a class prediction based on the votes of these base-
keeping the weights for the existing hidden layers
learners. Simply, for a given tuple X to classify, it
accumulates the class label predictions obtained from the fixed. This procedure is called “layer-wise” as the
base-learners and yields the class in the majority. The model is trained one layer at a time and is referred
algorithm can be summarized in figure 3. to as “greedy” because of this layer-wise method
can resolve the problem of training a deep network.
9 Stack Generalization: Once the base-learners are
ready, we need to define the meta-learner. In the
case of meta-learner, we create a CNN based
classifier, namely, DeepfakeStackClassifier (DFC),
and embed in a larger multi-headed neural network
to learn to obtain the best combination of the
predictions from each input base-learner. This
approach permits the stacking ensemble to be
treated as a single large model and the benefit is
that the outputs of these base-learners are provided
directly to the meta-learner. In addition to this, it
Figure 3. The algorithm for the DeepfakeStack classifier. makes it possible to update the weights of the base-
The working procedure of DeepfakeStack is split into learners as well as the meta-learner model (see
two sections: (i) Base-Learners Creation, (ii) Stack figure 4). The input layer of each base-learner is
Generalization. used as an individual input head to the DFC model.
9 Base-Learners Creation: As we defined this work This means k copies of input data are fed to the
to solve the binary classification problem, we need DFC model, where k represents the number of
to fix the label 0 for real and 1 for Deepfake and, input models (base-learners) and merge the output
measure both accuracy and categorical log loss. of each of these models. In this experiment, a
Once we are done with data analysis it is very simple concatenation merge has been used, where a
crucial to decide what kind of models might work single 14-element vector is formed from the two
for this data. It is a good idea to prefer any CNN- class-probabilities predicted by each of the 7 base-
based architecture as we have an image dataset. In learners. To interpret this “input” to the meta-
addition to this, selecting picture-perfect factors is a learner, we define a hidden layer in conjunction
huge challenge, which may include the number of with an output layer that makes its probabilistic
layers, number of units, dropout rates, activations, prediction.
learning rates, etc. Contemplating all, we can adjust
73
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Overview of DeepfakeStack.
74
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.
in figure 6, the ROC tells us how good the model is for [11] Y. Li, and S. Lyu, “Exposing DeepFake Videos by Detecting Face
classifying the two classes: Original and Deepfake. The area Warping Artifacts,” Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, pp. 46–52,
covered by the curve is the area between the colored line and 2019.
the axis where each color line represents an individual [12] Y. Li, M. Chang, and S. Lyu, “In Ictu Oculi: Exposing AI Created
model/classifier (i.e., the blue line represents the DFC Fake Videos by Detecting Eye Blinking,” 2018 IEEE International
model). The bigger the area covered, the better the models Workshop on Information Forensics and Security (WIFS), Hong
are at classifying the given classes. In other words, the closer Kong, pp. 1–7, December 2018.
the AUCROC is to 1, the better. Based on the experiment, [13] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “MesoNet: a
we can see that the DFC achieves an AUROC of 1.0 which Compact Facial Video Forgery Detection Network,” 2018 IEEE
International Workshop on Information Forensics and Security
indicates that the positive and negative data classes are (WIFS), Hong Kong, pp. 1–7, December 2018.
perfectly separated, and the model is as efficient as it can get. [14] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Two-Stream Neural
Networks for Tampered Face Detection,” 2017 IEEE Conference on
VI. CONCLUSIONS AND FUTURE WORKS Computer Vision and Pattern Recognition Workshops (CVPRW),
Detecting Deepfakes has become a significant challenge Honolulu, HI, pp. 1831–1839, July 2017.
because even though many such manipulated videos are [15] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P.
intended for entertainment, still many of them could be Natarajan, “Recurrent Convolutional Strategies for Face
Manipulation Detection in Videos,” Workshop on Applications of
harmful to individuals and society. Based on the research Computer Vision and Pattern Recognition to Media Forensics with
needs, a few datasets of Deepfake manipulation have been CVPR, pp. 80–87, 2019.
made available. In this paper, we propose a deep ensemble [16] N. T. Do, I. S. Na, and S. H. Kim, “DeepFakes: Forensics Face
learning technique, DeepfakeStack, by experimenting with Detection from GANs Using Convolutional Neural Network,”
various DL-based models on the FF++ dataset. The International Symposium on Information Technology Convergence
(ISITC 2018), South Korea 2018.
experiment shows that a larger stacking ensemble neural
network (called DFC) model is defined and fit on the test [17] Q. Liu, A. H. Sung, et al. “Feature Mining and Pattern Recognition in
Steganalysis and Digital Forensics,” Pattern Recognition, Machine
(unseen) dataset, then the new model is used to predict the Intelligence and Biometrics (Editor Patrick S.P. Wang), High
test dataset. Evaluating the results, we see that the proposed Education Press and Springer, pp. 561–604, December 2011.
DFC model achieves an accuracy of 99.65% and AUROC [18] Q. Liu, P. Cooper, et. al. “Detection of JPEG Double Compression
1.0, outperforming the DL-based models, thereby provides a and Identification of Smartphone Image Source and Post-Capture
strong basis for developing an effective Deepfake detector. Manipulation,” Applied Intelligence, 39(4), pp. 705–726, 2013.
In future work, the authors intend to use the proposed [19] D. Cozzolino, J. Thies, A. Rossler, R. Christian, M. Nießner, and L.
method, in conjunction with Blockchain technology, to build Verdoliva, “ForensicTransfer: Weakly-supervised domain adaptation
for forgery detection,” arXiv:1812.02510, December 2018.
a Deepfake detection and prevention system.
[20] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task
REFERENCES Learning for Detecting and Segmenting Manipulated Facial Images
and Videos,” arXiv:1906.06876, June 2019.
[1] FaceApp, https://www.faceapp.com/, last accessed 2020/06/07. [21] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics:
[2] FakeApp, https://www.fakeapp.org/, last accessed 2020/06/07. Using Capsule Networks to Detect Forged Images and
[3] G. Patrini, F. Cavalli, and H. Ajder, “The state of Deepfakes: reality Videos,” ICASSP 2019 - 2019 IEEE International Conference on
under attack,” Annual Report v.2.3, January 2019. Acoustics, Speech, and Signal Processing (ICASSP), Brighton,
United Kingdom, pp. 2307–2311.
[4] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner,
“Face2Face: Real-Time Face Capture and Reenactment of RGB [22] U. A. Ciftci, and I. Demir, “FakeCatcher: Detection of Synthetic
Videos,” 2016 IEEE Conference on Computer Vision and Pattern Portrait Videos using Biological Signals,” ArXiv: abs/1901.02212,
Recognition (CVPR), Las Vegas, pp. 2387–2395, November 2016. January 2019.
[5] S. Suwajanakorn, S. M. Seitz, and I. K. Shlizerman, “Synthesizing [23] F. Matern, C. Riess, and M. Stamminger, “Exploiting Visual Artifacts
Obama: learning lip sync from audio,” ACM Transactions on to Expose Deepfakes and Face Manipulations,” 2019 IEEE Winter
Graphics, 36(4), July 2017. Applications of Computer Vision Workshops (WACVW), Waikoloa
Village, HI, USA, pp. 83–92, January 2019.
[6] Faceswap, https://github.com/MarekKowalski/FaceSwap/, last
accessed: 2020/06/07. [24] M. Koopman, A. M. Rodriguez, and Z. Geradts, “Detection of
Deepfake Video Manipulation,” 20th Irish Machine Vision and Image
[7] Exploring DeepFakes, https://goberoi.com/exploring-deepfakes-
Processing conference (IMVIP 2018), Northern Ireland, United
20c9947c22d9, last accessed: 2020/06/07.
Kingdom, August 2018.
[8] How deep learning fakes videos (Deepfake) and how to detect it?,
[25] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M.
https://medium.com/@jonathan_hui/how-deep-learning-fakes-videos-
Niessner, “FaceForensics++: Learning to Detect Manipulated Facial
deepfakes-and-how-to-detect-it-c0b50fbf7cb9, last accessed:
Images,” 2019 IEEE/CVF International Conference on Computer
2020/06/07
Vision (ICCV), Seoul, South Korea, pp. 1–11, October-November
[9] The Power of Ensembles in Deep Learning, https://towardsdata- 2019.
science.com/the-power-of-ensembles-in-deep-learning-a8900ff42be9,
[26] ImageNet, http://www.image-net.org/, last accessed: 2020/06/07.
last accessed: 2020/06/07.
[27] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy
[10] D. Guera, and E. J. Delp, “Deepfake Video Detection Using
layer-wise training of deep networks,” Proceedings of the 19th
Recurrent Neural Networks,” 15th IEEE International Conference on
International Conference on Neural Information Processing Systems
Advanced Video and Signal Based Surveillance (AVSS), Auckland,
(NIPS’06), Cambridge, MA, USA, pp. 153–160, December 2006.
New Zealand, pp. 1–6, November 2018.
75
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.