0% found this document useful (0 votes)
47 views6 pages

DeepfakeStack A Deep Ensemble-Based Learning Technique For Deepfake Detection

Uploaded by

STUDY PURPOSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views6 pages

DeepfakeStack A Deep Ensemble-Based Learning Technique For Deepfake Detection

Uploaded by

STUDY PURPOSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2020 7th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/2020 6th IEEE International

Conference on Edge Computing and Scalable Cloud (EdgeCom)

DeepfakeStack: A Deep Ensemble-based Learning


Technique for Deepfake Detection
Md. Shohel Rana Andrew H. Sung
Computing Sciences and Computer Engineering Computing Sciences and Computer Engineering
The University of Southern Mississippi The University of Southern Mississippi
Hattiesburg, MS 39406, United States Hattiesburg, MS 39406, United States
[email protected] [email protected]

Abstract—Recent advances in technology have made the deep of ANN. Deepfakes became popular when fabricated porn
learning (DL) models available for use in a wide variety of videos of well-known faces; for example, celebrities or
novel applications; for example, generative adversarial politicians are in progress of making it online. The term
network (GAN) models are capable of producing hyper- violates not only the rules of consent but the victim’s
realistic images, speech, and even videos, such as the so-called privacy. Because creating Deepfakes without a person’s
“Deepfake” produced by GANs with manipulated audio and/or approval is a form of abuse leading in another way of crime.
video clips, which are so realistic as to be indistinguishable As presented in the annual report [3] under the broader
from the real ones in human perception. Aside from innovative name of Deepfake, Google searches provide several
and legitimate applications, there are numerous nefarious or
webpages for the keyword ‘Deepfake’ that expanded rapidly
unlawful ways to use such counterfeit contents in propaganda,
since 2017, as well as searches for webpages containing
political campaigns, cybercrimes, extortion, etc. To meet the
challenges posed by Deepfake multimedia, we propose a deep
related videos (see Fig. 1). This report also presents.
ensemble learning technique called DeepfakeStack for • 1790+ of Deepfake videos hosted by the top 10 adult
detecting such manipulated videos. The proposed technique websites without considering pornhub.com, which
combines a series of DL based state-of-art classification models has disabled searches for ‘Deepfakes’.
and creates an improved composite classifier. Based on our • 6174 of Deepfake videos hosted by adult websites
experiments, it is shown that DeepfakeStack outperforms other featuring fake video content only.
classifiers by achieving an accuracy of 99.65% and AUROC of • 3 new sites dedicated to hosting Deepfake
1.0 score in detecting Deepfake. Therefore, our method pornography.
provides a solid basis for building a Realtime Deepfake • 902 of papers published on the arXiv, including
detector. ‘GAN (Generative Adversarial Network)’ in titles or
abstracts in 2018 only.
Keywords-Deepfake; DeepfakeStack; GANs; Deep Ensemble
Learning;, Greedy Layer-wise Pretraining.
• 25 articles on the topic published, including non-peer,
reviewed where DARPA funds 12 of them.
I. INTRODUCTION
Recent progress in automated video processing
applications (e.g., FaceApp [1], FakeApp [2]), artificial
neural networks (ANN), ML algorithms and social media
allow cybercriminals to create and spread high-quality
manipulated video contents (aka. fake videos) through digital
media that lead to the appearance of deliberate
misinformation. The illustration of certain entities acting or Figure 1. (a) Number of webpages returned by Google search for
stating things they never actually said or performed are "Deepfake", (b) Number of searches for webpages containing Deepfake
becoming remarkably practical, even hard to recognize the videos.
proof of authenticity. The keyword “Deepfake manipulation” In this paper, we apply a deep ensemble learning
permits anyone to swap the face of an individual by technique, namely, DeepfakeStack by evaluating several DL
another’s face, including expressions and creates based state-of-art models. The idea behind the
photorealistic fake images or videos that are known as DeepfakeStack is based on training a meta-learner on top of
Deepfakes. These videos are readily visible in malicious use pre-trained base-learners and offers an interface to fit the
whereas some of them could be harmful to individuals and meta-learner on the predictions of the base-learners and
society. In the past, video manipulation was an expensive shows how the ensemble technique performs the
task that required an extensive amount of workforce, time, classification task. The architecture of the DeepfakeStack
and money, but now, only it needs a gaming laptop or involves two or more base learners, called level-0 models,
desktop with an internet connection and a basic knowledge and a meta-learner called level-1 model that combines the

978-1-7281-6550-9/20/$31.00 ©2020 IEEE 70


DOI 10.1109/CSCloud-EdgeCom49738.2020.00021
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.
predictions of these level-0 models. The level-1 model is system is built with encoder and decoder and responsible for
trained on the predictions made by base models on out-of- creating images, and the discriminative network determines
sample data. That is, data not used to train the base models is whether the created representation is accurate and believable.
fed to the base models, predictions are made, and these The below example describes how it works (see figure 2).
predictions, along with the expected outputs, provide the 1. After collecting many images for both actors (A and
input and output pairs of the training dataset used to fit the B), build an encoder for encoding all these images to
meta-model. In this experiment, XceptionNet, ResNet101, extract essential features and then use a decoder to
InceptionResNetV2, MobileNet, InceptionV3, DenseNet121, reconstruct the corresponding image.
and DenseNet169 are used as base-learners and a newly 2. Use different decoders for actor A and actor B for
defined CNN model as a 2nd level meta learner which is also decoding the features. To do this, using the
known as Deepfake Classifier (DFC). The experiment using backpropagation algorithm train the generative
these models shows that the DeepfakeStack achieves an network in such a way that the input is fitted tightly
accuracy of 99.65% and AUROC of 1.0 and the promising with the output.
results are shown. 3. After the training, it needs to deal with the video
The rest of the paper has been formatted as Sect. 2 gives frame-by-frame to exchange the A’s face with the
an overview of Deepfake and a brief about deep ensemble B’s face. To do this job, first, it needs to extract A’s
learning techniques; Sect. 3 describes related works; Sect. 4 face using a standard face detection technique and
presents methodology including dataset description, data feed it into the encoder. Then use the decoder of
preprocessing, proposed technique, the technology used; actor B to reconstruct the image instead of feeding to
Sect. 5 presents results and analysis, and Sect. 6 gives the decoder of A. Finally, amalgamate the original
conclusions and future work. image into the newly formed face image.
4. At the final step, feed original images to the
II. OVERVIEW discriminative network and train itself to identify
originality better. It confirms the created images as
A. Definition of Deepfake human eyed indistinguishable to the authentic.
A combination of "Deep Learning" and "Fake" can be
called Deepfake that refers to any photo-realistic audiovisual
content created using the DL technology. The technique is
initiated by analyzing plenty of photos or a video of one’s
face, training an AI algorithm to manipulate that face, and
then using that algorithm to map the face onto a person in an
image or video. In late 2017, the term “Deepfake” is named
after a Reddit user known as Deepfakes, who used DL
technology and attempted to replace a target actor’s face
with another’s face in pornographic videos. In recent, two
popular facial manipulation methods have attracted a lot to
cybercriminals in doing video manipulation job. (a) Encode all images of both actors (A and B) using the generative network.
• Facial expression manipulation: Face2Face [4]
allows anyone to transfer the facial expressions of a
person to a different person using standard
applications in real-time. For example, “Synthesizing
Obama” [5] can animate a person’s look by
transferring another person's facial expression based
on an audio input sequence.
• Facial identity manipulation: In FaceSwap [6], the
face of a person is replaced by any other person’s
(b) Reconstruct the image of actor A by using the decoder of the actor B
face instead of changing expressions. For example, instead of feeding to the decoder of A and then merging it into the original
Snapchat. The same methodology is applied in image. Using the discriminative network confirm the created image is
Deepfake using DL technology. accurate and believable;

B. Deepfake Generation Pipeline Figure 2. Deepfake generation pipeline.


To create Deepfakes (Deepfake image or video), C. Deep Ensemble Learning Technique
researchers apply the face-swapping technique using The ensemble technique is a way of combining a list of
Generative Adversarial Networks (GANs) [7-8]. A sub-models (also known as base-learners) to form an ideal
combination of two neural networks builds GANs used in the perceptive model where each sub-model contributes to
Deepfakes generation. One is called a synthesizer or producing the final output. The newly generated combined
generative network, and another is called a detector or model is called a meta-model. The two commonly used
discriminative network. These two networks participate in methods are [9]:
generating realistic fake videos or images. The generative

71

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.
a. Stacking ensemble (SE): The SE takes the output of introduced an autoencoder-based architecture named
the base-learners and used them as input to train the ForensicTransfer (FT), which differentiates authentic images
meta-learner in such a way that the model learns from counterfeit. The FT establishes a series of experiments
how to best map the base-learner’s decisions into an and achieves up to 80-85% in terms of accuracy. Nguyen et
enhanced output. al. [20] proposed multi-task learning based CNN for
b. Randomized weighted ensemble (WRE): In WRE concurrently carrying out detection and segmentation of
technique, each of the base-learners is weighted by a manipulated facial images and videos. The proposed system
value based on their performance that is evaluated on contains an encoder that encodes features used for the binary
a hold-out validation dataset. The model receives a classi¿cation and a Y-shaped decoder where the output of
higher weight if it performs better than others. In one of its branches is used for segmenting the manipulated
other words, this technique is just optimizing regions. In [21], the authors presented a deep CNN based
weights that are used for weighting all the base- model that uses a capsule network (CN) to detect Deepfake.
learners’ output and taking the weighted average. In addition to this, it identifies replay attacks and computer-
generated image. In [22], the authors proposed an approach
III. RELATED WORKS for building a Deepfake detector called FakeCatcher (FC) to
detect synthetic portrait videos where the proposed method
A. Facial Manipulations exploits biological signals extracted from facial areas. In [23]
Guera and Delp [10] proposed a system that contains two method, missing reÀections and missing details in the eye
essential components: (i) a CNN and (ii) an LSTM. For a and teeth areas are being exploited and the texture features
given image sequence, the CNN generates a set of features are extracted from the facial region based on facial
for each frame and passes them to the LSTM for analysis landmarks and are fed them into the ML classifiers for
after concatenating those features of multiple sequential classifying them as the Deepfakes or the real videos.
frames. The proposed network was trained on 600 videos Koopman et al. [24] explored a photo response non-
and achieved 97.1% of accuracy. Li and Lyu [11] proposed a uniformity (PRNU) analysis to detect Deepfake in video
method for detecting warping artifacts in Deepfake videos by frames. In this PRNU analysis, a series of frames are
training four different CNN on authentic and manipulated generated from the input videos and are kept in sequentially
face images and obtained the accuracy up to 99%. In [12] labeled folders. To leave the portion of the PRNU pattern
authors proposed another plan that reveals fake faces using and to make it consistent, it crops each video frame by
eye blinking detection technique, in which authors assume precisely identical pixels values. These frames are then split
that function is missing in Deepfake analysis. Afchar et al. into 8 groups where each of them is equal size, and for each,
[13] trained a CNN, namely MesoNet, for classifying the real a typical PRNU pattern is created using second order FSTV
and Deepfake manipulated face. The network has two method and compare them to one another with calculating
inception modules, namely Meso-4 and MesoInception-4, in the normalized cross-correlation scores. For each video, it
conjunction with two convolution layers connected by max- estimates variations in correlation scores and the average
pooling segments. Zhou et al. [14] proposed a two-stream correlation score. Finally, it performs a t-test on these results
network for detecting face manipulation in the video. In its for Deepfakes and real videos where the t-test produces
first stream, a CNN based face classification network is statistical signi¿cances between the results for both
trained to capture tampering artifact evidence, and in the Deepfakes and real videos.
second stream, a steganalysis feature-based triplet network is
trained for controlling functions that capture local noise IV. METHODOLOGY
residual evidence. In [15], the authors proposed a two-phases
method, where the first phase crops and adjusts the facial A. Dataset Description
area that is taken from the video frames using computer- To conduct the experiment and evaluate the proposed
generated masks, and the second step is for detecting such technique, we have used the FaceForensics++ (FF++) dataset
manipulation using a recurrent convolutional network [25]. The dataset has been collected by Visual Computing
(RCNN). The experiment has improved the performance of Group (VGG) which is an active research group on computer
detection accuracy in detecting such manipulation than the vision, computer graphics, and machine learning. There are
previously reported results up to 4.55%. In [16], Do et al. 1000 real videos included in this dataset that was
suggested a deep CNN model for classifying a real image or downloaded from YouTube. Then, manipulated them by
fake image generated by GANs. The main objectives of this using 3 popular state-of-art manipulation techniques (i.e.,
article can be concise as (i) creation of training data sets that Deepfake, FaceSwap, and Face2Face).
can be adapted to the test data set, (ii) building a deep
learning network based on face recognition networks for B. Data Analysis and Preprocessing
extracting face features, and finally, (iii) performing a fine- To achieve the best performance of the used ML/DL
tune to fit the face features to classify the real/fake face. models, we need to preprocess the dataset by applying some
data analysis. Below the idea of how this dataset is organized
B. Digital Media Forensics as follows:
Much work has been done on digital media forensics, 9 In this experiment, we have used 49 real and 51
early papers include, e.g. [17, 18]; Cozzolino et al. [19] fake to make the balanced dataset. After then, we

72

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.
separate each of the videos based on its category to fine-tune any architecture relevant to the model
under the directories (e.g., Original, Deepfakes). that has already been trained and tested on a similar
9 For each video, each folder is created that contains dataset. For example, the CNN-based networks that
all extracted image sequences. For example, if the have already been trained and tested on the
video file’s name is ‘485.mp4’ then we create a ImageNet dataset [26]. To adapt to any of these
directory with the same name ‘485’ where it architectures, the dataset needs to be preprocessed
contains all the frames of ‘485.mp4’ for each of the and establish an environment accordingly. These
original videos and we have followed the same CNN-based networks were trained with normalized
procedure for Deepfakes data. images of equal size (224x224) on RGB images.
9 We don’t consider the entire video sequence Therefore, before feeding into the model, we must
instead; we take only 101 frames from each video look after that the dataset should be normalized and
to reduce the computational time. preprocessed into the same size. In this experiment,
As the main objective is to detect the manipulated face we initialize 7 DL models (e.g., XceptionNet,
images, we are concerned only in the face area. So, it is a MobileNet, ResNet101, InceptionV3, DensNet121,
good idea to ignore all others like the body, background, etc. InceptionReseNetV2, DenseNet169) with ImageNet
Therefore, we track the face in each of the images and feed weights and apply the transfer learning by replacing
them into the classifier. only the topmost layer with 2 outputs with SoftMax
activation. We consider these architectures as base-
C. DeepfakeStack
learners, and to train these models, we follow
The DeepfakeStack provides a way of combining a set of Greedy Layer-wise Pretraining (GLP) [27]
k base-learners, C1, C2, …, Ck, to produce an enhanced technique. The GLP uninterruptedly adds a new
classifier C*. For a given dataset, D, it splits it into k training hidden layer to a model and refit the model.
sets, D1, D2, …, Dk, and uses Di to build the base-learner, Ci. Besides, it permits the newly added model to learn
For classifying a new (unseen) data tuple, the DeepfakeStack
the inputs from the existing hidden layer, while
returns a class prediction based on the votes of these base-
keeping the weights for the existing hidden layers
learners. Simply, for a given tuple X to classify, it
accumulates the class label predictions obtained from the fixed. This procedure is called “layer-wise” as the
base-learners and yields the class in the majority. The model is trained one layer at a time and is referred
algorithm can be summarized in figure 3. to as “greedy” because of this layer-wise method
can resolve the problem of training a deep network.
9 Stack Generalization: Once the base-learners are
ready, we need to define the meta-learner. In the
case of meta-learner, we create a CNN based
classifier, namely, DeepfakeStackClassifier (DFC),
and embed in a larger multi-headed neural network
to learn to obtain the best combination of the
predictions from each input base-learner. This
approach permits the stacking ensemble to be
treated as a single large model and the benefit is
that the outputs of these base-learners are provided
directly to the meta-learner. In addition to this, it
Figure 3. The algorithm for the DeepfakeStack classifier. makes it possible to update the weights of the base-
The working procedure of DeepfakeStack is split into learners as well as the meta-learner model (see
two sections: (i) Base-Learners Creation, (ii) Stack figure 4). The input layer of each base-learner is
Generalization. used as an individual input head to the DFC model.
9 Base-Learners Creation: As we defined this work This means k copies of input data are fed to the
to solve the binary classification problem, we need DFC model, where k represents the number of
to fix the label 0 for real and 1 for Deepfake and, input models (base-learners) and merge the output
measure both accuracy and categorical log loss. of each of these models. In this experiment, a
Once we are done with data analysis it is very simple concatenation merge has been used, where a
crucial to decide what kind of models might work single 14-element vector is formed from the two
for this data. It is a good idea to prefer any CNN- class-probabilities predicted by each of the 7 base-
based architecture as we have an image dataset. In learners. To interpret this “input” to the meta-
addition to this, selecting picture-perfect factors is a learner, we define a hidden layer in conjunction
huge challenge, which may include the number of with an output layer that makes its probabilistic
layers, number of units, dropout rates, activations, prediction.
learning rates, etc. Contemplating all, we can adjust

73

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Overview of DeepfakeStack.

V. RESULTS AND ANALYSIS


After defining the DFC model, we fit it directly on the
holdout test dataset for 300 epochs. Note that, the weights of
the base-learners will not be updated during the training
since their trainable property is set to False (i.e., not
trainable) while defining them. Only the weights of the new
hidden and output layer will be updated. After successful
fitting, we use this DFC model to predict unseen data and
expect the DFC to perform better compared to any individual
sub-model (base-learner). For comparison, the performances
are depicted in Table 1.

TABLE I. PERFORMANCE OF THE DEEPFAKESTACK MODEL AND


INDIVIDUAL DL MODEL (BASE-LEARNER)

Precision Recall F1-score Accu AU Figure 5. Accuracy.


Model
0 1 0 1 0 1 racy ROC

XCEPN 0.94 1.00 1.00 0.94 0.97 0.97 96.88 0.976


INCV3 0.88 0.95 0.85 0.88 0.86 0.87 86.49 0.866
MOBN 0.84 1.00 1.00 0.81 0.92 0.90 90.74 0.911
RSN101 0.94 0.96 0.96 0.94 0.95 0.95 94.95 0.954
IRNV2 0.82 1.00 1.00 0.79 0.90 0.88 89.26 0.899
DNS121 0.93 1.00 1.00 0.93 0.96 0.96 96.34 0.969
DNS169 0.95 1.00 1.00 0.94 0.97 0.97 97.13 0.971
DFC 0.99 1.00 1.00 0.99 1.00 1.00 99.65 1.000

The main difference of performances among various


learners or classifiers is based on their model size; for
DeepfakeStack, the model size is very large and if not
carefully built, it results in overfitting. The results of the
overall accuracy of each DL model using the same
parameters are summarized in figure 5, where it is seen that
the best performance is obtained by the DeepfakeStack
Figure 6. ROC curve.
(DFC) model. The DFC achieves an accuracy of 99.65%.
Based on the experiment, we can say that the DFC model To summarize, a ROC curve is produced per model by
now learned to detect the manipulated videos/images and varying the threshold values from 0 to 1 which helps to
perform very well when the video or image contents are visualize the tradeoff between sensitivity and specificity and
manipulated by Deepfake. recognize how well-separated our data classes are. As shown

74

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.
in figure 6, the ROC tells us how good the model is for [11] Y. Li, and S. Lyu, “Exposing DeepFake Videos by Detecting Face
classifying the two classes: Original and Deepfake. The area Warping Artifacts,” Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, pp. 46–52,
covered by the curve is the area between the colored line and 2019.
the axis where each color line represents an individual [12] Y. Li, M. Chang, and S. Lyu, “In Ictu Oculi: Exposing AI Created
model/classifier (i.e., the blue line represents the DFC Fake Videos by Detecting Eye Blinking,” 2018 IEEE International
model). The bigger the area covered, the better the models Workshop on Information Forensics and Security (WIFS), Hong
are at classifying the given classes. In other words, the closer Kong, pp. 1–7, December 2018.
the AUCROC is to 1, the better. Based on the experiment, [13] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “MesoNet: a
we can see that the DFC achieves an AUROC of 1.0 which Compact Facial Video Forgery Detection Network,” 2018 IEEE
International Workshop on Information Forensics and Security
indicates that the positive and negative data classes are (WIFS), Hong Kong, pp. 1–7, December 2018.
perfectly separated, and the model is as efficient as it can get. [14] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Two-Stream Neural
Networks for Tampered Face Detection,” 2017 IEEE Conference on
VI. CONCLUSIONS AND FUTURE WORKS Computer Vision and Pattern Recognition Workshops (CVPRW),
Detecting Deepfakes has become a significant challenge Honolulu, HI, pp. 1831–1839, July 2017.
because even though many such manipulated videos are [15] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P.
intended for entertainment, still many of them could be Natarajan, “Recurrent Convolutional Strategies for Face
Manipulation Detection in Videos,” Workshop on Applications of
harmful to individuals and society. Based on the research Computer Vision and Pattern Recognition to Media Forensics with
needs, a few datasets of Deepfake manipulation have been CVPR, pp. 80–87, 2019.
made available. In this paper, we propose a deep ensemble [16] N. T. Do, I. S. Na, and S. H. Kim, “DeepFakes: Forensics Face
learning technique, DeepfakeStack, by experimenting with Detection from GANs Using Convolutional Neural Network,”
various DL-based models on the FF++ dataset. The International Symposium on Information Technology Convergence
(ISITC 2018), South Korea 2018.
experiment shows that a larger stacking ensemble neural
network (called DFC) model is defined and fit on the test [17] Q. Liu, A. H. Sung, et al. “Feature Mining and Pattern Recognition in
Steganalysis and Digital Forensics,” Pattern Recognition, Machine
(unseen) dataset, then the new model is used to predict the Intelligence and Biometrics (Editor Patrick S.P. Wang), High
test dataset. Evaluating the results, we see that the proposed Education Press and Springer, pp. 561–604, December 2011.
DFC model achieves an accuracy of 99.65% and AUROC [18] Q. Liu, P. Cooper, et. al. “Detection of JPEG Double Compression
1.0, outperforming the DL-based models, thereby provides a and Identification of Smartphone Image Source and Post-Capture
strong basis for developing an effective Deepfake detector. Manipulation,” Applied Intelligence, 39(4), pp. 705–726, 2013.
In future work, the authors intend to use the proposed [19] D. Cozzolino, J. Thies, A. Rossler, R. Christian, M. Nießner, and L.
method, in conjunction with Blockchain technology, to build Verdoliva, “ForensicTransfer: Weakly-supervised domain adaptation
for forgery detection,” arXiv:1812.02510, December 2018.
a Deepfake detection and prevention system.
[20] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task
REFERENCES Learning for Detecting and Segmenting Manipulated Facial Images
and Videos,” arXiv:1906.06876, June 2019.
[1] FaceApp, https://www.faceapp.com/, last accessed 2020/06/07. [21] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics:
[2] FakeApp, https://www.fakeapp.org/, last accessed 2020/06/07. Using Capsule Networks to Detect Forged Images and
[3] G. Patrini, F. Cavalli, and H. Ajder, “The state of Deepfakes: reality Videos,” ICASSP 2019 - 2019 IEEE International Conference on
under attack,” Annual Report v.2.3, January 2019. Acoustics, Speech, and Signal Processing (ICASSP), Brighton,
United Kingdom, pp. 2307–2311.
[4] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner,
“Face2Face: Real-Time Face Capture and Reenactment of RGB [22] U. A. Ciftci, and I. Demir, “FakeCatcher: Detection of Synthetic
Videos,” 2016 IEEE Conference on Computer Vision and Pattern Portrait Videos using Biological Signals,” ArXiv: abs/1901.02212,
Recognition (CVPR), Las Vegas, pp. 2387–2395, November 2016. January 2019.
[5] S. Suwajanakorn, S. M. Seitz, and I. K. Shlizerman, “Synthesizing [23] F. Matern, C. Riess, and M. Stamminger, “Exploiting Visual Artifacts
Obama: learning lip sync from audio,” ACM Transactions on to Expose Deepfakes and Face Manipulations,” 2019 IEEE Winter
Graphics, 36(4), July 2017. Applications of Computer Vision Workshops (WACVW), Waikoloa
Village, HI, USA, pp. 83–92, January 2019.
[6] Faceswap, https://github.com/MarekKowalski/FaceSwap/, last
accessed: 2020/06/07. [24] M. Koopman, A. M. Rodriguez, and Z. Geradts, “Detection of
Deepfake Video Manipulation,” 20th Irish Machine Vision and Image
[7] Exploring DeepFakes, https://goberoi.com/exploring-deepfakes-
Processing conference (IMVIP 2018), Northern Ireland, United
20c9947c22d9, last accessed: 2020/06/07.
Kingdom, August 2018.
[8] How deep learning fakes videos (Deepfake) and how to detect it?,
[25] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M.
https://medium.com/@jonathan_hui/how-deep-learning-fakes-videos-
Niessner, “FaceForensics++: Learning to Detect Manipulated Facial
deepfakes-and-how-to-detect-it-c0b50fbf7cb9, last accessed:
Images,” 2019 IEEE/CVF International Conference on Computer
2020/06/07
Vision (ICCV), Seoul, South Korea, pp. 1–11, October-November
[9] The Power of Ensembles in Deep Learning, https://towardsdata- 2019.
science.com/the-power-of-ensembles-in-deep-learning-a8900ff42be9,
[26] ImageNet, http://www.image-net.org/, last accessed: 2020/06/07.
last accessed: 2020/06/07.
[27] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy
[10] D. Guera, and E. J. Delp, “Deepfake Video Detection Using
layer-wise training of deep networks,” Proceedings of the 19th
Recurrent Neural Networks,” 15th IEEE International Conference on
International Conference on Neural Information Processing Systems
Advanced Video and Signal Based Surveillance (AVSS), Auckland,
(NIPS’06), Cambridge, MA, USA, pp. 153–160, December 2006.
New Zealand, pp. 1–6, November 2018.

75

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 04,2024 at 16:53:35 UTC from IEEE Xplore. Restrictions apply.

You might also like