Paper (Related Project-3)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Deepfakes Detection with Automatic Face Weighting

Daniel Mas Montserrat, Hanxiang Hao, S. K. Yarlagadda, Sriram Baireddy, Ruiting Shao
János Horváth, Emily Bartusiak, Justin Yang, David Güera, Fengqing Zhu, Edward J. Delp

Video and Image Processing Laboratory (VIPER)


School of Electrical Engineering
Purdue University
West Lafayette, Indiana, USA

Abstract

Altered and manipulated multimedia is increasingly


present and widely distributed via social media platforms.
Advanced video manipulation tools enable the generation
of highly realistic-looking altered multimedia. While many
methods have been presented to detect manipulations, most
of them fail when evaluated with data outside of the datasets
used in research environments. In order to address this
problem, the Deepfake Detection Challenge (DFDC) pro-
vides a large dataset of videos containing realistic manipu-
lations and an evaluation system that ensures that methods Figure 1. Example of images from DFDC [3] dataset: original im-
work quickly and accurately, even when faced with chal- age (left) and manipulated image with the swapped face (right).
lenging data. In this paper, we introduce a method based on
convolutional neural networks (CNNs) and recurrent neural
networks (RNNs) that extracts visual and temporal features tools become more accessible, realistic, and undetectable,
from faces present in videos to accurately detect manipu- the divide between real and fake multimedia is blurred. Fur-
lations. The method is evaluated with the DFDC dataset, thermore, social media allows for the uncontrolled spread of
providing competitive results compared to other techniques. manipulated content at a large scale. This spread of misin-
formation damages journalism and news providers as it gets
increasingly difficult to distinguish between reliable and un-
trustworthy information sources.
1. Introduction
Human facial manipulations are among the most com-
Manipulated multimedia is rapidly increasing its pres- mon Deepfake forgeries. Through face swaps, an individual
ence on the Internet and social media. Its rise is fueled can be placed at some location he or she was never present
by the mass availability of easy-to-use tools and techniques at. By altering the lip movement and the associated speech
for generating realistic fake multimedia content. Recent ad- signal, realistic videos can be generated of individuals say-
vancements in the field of deep learning have led to the de- ing words they actually never uttered. This type of Deep-
velopment of methods to create artificial images and videos fake manipulation can be very damaging when used to gen-
that are eerily similar to authentic images and videos. Ma- erate graphic adult content or fake news that can alter the
nipulated multimedia created using such techniques typi- public opinion. In fact, many images and videos containing
cally involving neural networks, such as Generative Adver- such Deepfake forgeries are already present on adult con-
sarial Networks (GAN) [1] and Auto-Encoders (AE) [2], are tent web sites, news articles, and social media.
generally referred to as Deepfakes. While these tools can be Image and video manipulations have been utilized for
useful to automate steps in movie production, video game a long time. Before the advent of Deepfakes, editing
design, or virtual reality rendering, they are potentially very tools such as Photoshop [4] or GIMP [5] have been widely
damaging if used for malicious purposes. As manipulation used for image manipulations. Some common forgeries
include splicing (inserting objects into images) [6], copy structions are used to re-synthesize the target face with dif-
and moving parts within an image (copy-move forgery) [7], ferent expressions under different lighting conditions. In
or shadow removal [8]. While research on detecting such DeepFakes [25], two autoencoders [2] (with a shared en-
manipulations has been conducted for more than a decade coder) are trained to reconstruct target and source faces. To
[6, 9, 10, 8, 11, 12, 13, 7, 14], many techniques fail to create fake faces, the trained encoder and decoder of the
detect more recent and realistic manipulations, especially source face are applied on the target face. This fake face
when the multimedia alterations are performed with deep is blended onto the target video using Poisson image edit-
learning methods. Fortunately, there is an increasing ef- ing [29], creating a Deepfake video. Note the difference
fort to develop reliable detection technology such as AWS, between DeepFakes (capital F), the technique now being
Facebook, Microsoft, and the Partnership on AI’s Media described, and Deepfakes (lowercase f), which is a gen-
Integrity Steering Committee with the Deepfake Detection eral term for fake media generated with deep learning-based
Challenge (DFDC) [3]. methods. In NeuralTextures [26], a neural texture of the
Advances in deep learning have resulted in a great vari- face in the target video is learned. This information is used
ety of methods that have provided groundbreaking results to render the facial expressions from the source video on the
in many areas including computer vision, natural language target video.
processing, and biomedical applications [15]. While sev- In recent years, methods have been developed to detect
eral neural networks that detect a wide range of manipu- such deep learning-based manipulations. In [16], several
lations have been introduced [16, 17, 18, 19, 20, 21], new CNN architectures have been tested in a supervised set-
generative methods that create very realistic fake multime- ting to discriminate between GAN generated images and
dia [22, 23, 24, 25, 26] are presented every year, leading to real images. Preliminary results are promising but the
a push and pull problem where manipulation methods try performance degrades as the difference between training
to fool new detection methods and vice-versa. Therefore, and testing increases or when the data is compressed. In
there is a need for methods that are capable of detecting [17, 18, 19], forensic analysis of GAN generated images re-
multimedia manipulations in a robust and rapid manner. vealed that GANs leave some high frequency fingerprints in
In this paper, we present a novel model architecture that the images they generate.
combines a Convolutional Neural Network (CNN) with a
Recurrent Neural Network (RNN) to accurately detect fa- Additionally, several techniques to detect videos con-
cial manipulations in videos. The network automatically taining facial manipulations have been presented. While
selects the most reliable frames to detect these manipula- some of these methods focus on detecting videos contain-
tions with a weighting mechanism combined with a Gated ing only DeepFake manipulations, others are designed to
Recurrent Unit (GRU) that provides a final probability of a be agnostic to the technique used to perform the facial ma-
video being real or being fake. We train and evaluate our nipulation. The work presented in [30, 31] use a temporal-
method with the Deepfake Detection Challenge dataset, ob- aware pipeline composed by a Convolutional Neural Net-
taining a final score of 0.321 (log-likelihood error, the lower work (CNN) and a Recurrent Neural Network (RNN) to de-
the better) at position 117 of 2275 teams (top 6%) of the tect DeepFake videos. Current DeepFake videos are created
public leader-board. by splicing synthesized face regions onto the original video
frames. This splicing operation can leave artifacts that can
2. Related Work later be detected when estimating the 3D head pose. The au-
thors of [32] exploit this fact and use the difference between
There are many techniques for face manipulation and the head pose estimated with the full set of facial landmarks
generation. Some of the most commonly used include and a subset of them to separate DeepFake videos from
FaceSwap [27], Face2Face [28], DeepFakes [25], and Neu- real videos. This method provided competitive results on
ralTextures [26]. FaceSwap and Face2Face are computer the UADFV [33] database. The same authors proposed a
graphics based methods while the other two are learning method [34] to detect DeepFake videos by analyzing the
based methods. In FaceSwap [27], a face from a source face warping artifacts. The authors of [20] detect manip-
video is projected onto a face in a target video using fa- ulated videos generated by the DeepFake and Face2Face
cial landmark information. The face is successfully pro- techniques with a shallow neural network that acts on meso-
jected by minimizing the difference between the projected scopic features extracted from the video frames to distin-
shape and the target face’s landmarks. Finally, the rendered guish manipulated videos from real ones. However, the re-
face is color corrected and blended with the target video. In sults presented in [21] demonstrated that in a supervised
Face2Face [28], facial expressions from a selected face in setting, several deep network based models [35, 36, 37]
a source video are transferred to a face in the target video. outperform the ones based on shallow networks when de-
Face2Face uses selected frames from each video to create tecting fake videos generated with DeepFake, Face2Face,
dense reconstructions of the two faces. These dense recon- FaceSwap, and NeuralTexture.
Figure 2. Block Diagram of our proposed Deepfake detection system: MTCNN detects faces within the input frames, then EfficientNet
extracts features from all the detected face regions, and finally the Automatic Face Weighting (AFW) layer and the Gated Recurrent Unit
(GRU) predict if the video is real or manipulated.

3. Deepfake Detection Challenge Dataset error score. In practice, if this worst-case happens, the loss
is clipped to a very big value. This evaluation system poses
The Deepfake Detection Challenge (DFDC) [3] dataset an extra challenge, as methods with good performance in
contains a total of 123,546 videos with face and audio ma- metrics like accuracy, could have very high log-likelihood
nipulations. Each video contains one or more people and errors.
has a length of 10 seconds with a total of 300 frames. The
nature of these videos typically includes standing or sitting 4. Proposed Method
people, either facing the camera or not, with a wide range
of backgrounds, illumination conditions, and video quality. Our proposed method (Figure 2) extracts visual and tem-
The training videos have a resolution of 1920 × 1080 pix- poral features from faces by using a combination of a CNN
els, or 1080 × 1920 pixels if recorded in vertical mode. with an RNN. Because all visual manipulations are located
Figure 1 shows some examples of frames from videos of within face regions, and faces are typically present in a
the dataset. This dataset is composed by a total of 119,146 small region of the frame, using a network that extracts fea-
videos with a unique label (real or fake) in a training set, tures from the entire frame is not ideal. Instead, we focus on
400 videos on the validation set without labels and 4000 extracting features only in regions where a face is present.
private videos in a testing set. The 4000 videos of the test Because networks trained with general image classification
set can not be inspected but models can be evaluated on it task datasets such as ImageNet [38] have performed well
through the Kaggle system. The ratio of manipulated:real when transferred to other tasks [39], we use pre-trained
videos is 1:0.28. Because only the 119,245 training videos backbone networks as our starting point. Such backbone
contain labels, we use the totality of that dataset to train and networks extract features from faces that are later fed to an
validate our method. The provided training videos are di- RNN to extract temporal information. The method has three
vided into 50 numbered parts. We use 30 parts for training, distinct steps: (1) face detection across multiple frames us-
10 for validation and 10 for testing. ing MTCNN [40], (2) feature extraction with a CNN, and
A unique label is assigned to each video specifying (3) prediction estimation with a layer we refer to as Auto-
whether it contains a manipulation or not. However, it is not matic Face Weighting (AFW) along with a Gated Recurrent
specified which type of manipulation is performed: face, Unit (GRU). Our approach is described in detail in the fol-
audio, or both. As our method only uses video information, lowing subsections, including a boosting and test augmen-
manipulated videos with only audio manipulations will lead tation approach we included in our DFDC submission.
to noisy labels as the video will be labeled as fake but faces
will be real. Furthermore, more than one person might be
4.1. Face Detection
present in the video, with face manipulations performed on We use MTCNN [40] to perform face detection.
only one of them. MTCNN is a multi-task cascaded model that can produce
The private set used for testing evaluates submitted both face bounding boxes and facial landmarks simultane-
methods within the Kaggle system and reports a log- ously. The model uses a cascaded three-stage architecture to
likelihood loss. Log-likelihood loss drastically penalizes predict face and landmark locations in a coarse-to-fine man-
being both confident and wrong. In the worst case, a pre- ner. Initially, an image pyramid is generated by resizing the
diction that a video is authentic when it is actually manip- input image to different scales. The first stage of MTCNN
ulated, or the other way around, will add infinity to your then obtains the initial candidates of facial bounding boxes
and landmarks given the input image pyramid. The second gin between the distance of the sample to its class center and
stage takes the initial candidates from the first stage as the the distances of the sample to the centers of other classes in
input and rejects a large number of false alarms. The third an angular space. Therefore, by minimizing the ArcFace
stage is similar to the second stage but with a larger input loss, the classification model can obtain highly discrimina-
image size and deeper structure to obtain the final bound- tive features for real faces and fake faces to achieve a more
ing boxes and landmark points. Non-maximum suppression robust classification that succeeds even for high-quality fake
and bounding box regression are used in all three stages to faces.
remove highly overlapped candidates and refine the predic-
tion results. With the cascaded structure, MTCNN refines 4.3. Automatic Face Weighting
the results stage by stage in order to get accurate predic-
tions. While an image classification CNN provides a predic-
We choose this model because it provides good detection tion for a single image, we need to assign a prediction for
performance on both real and synthetic faces in the DFDC an entire video, not just a single frame. The natural choice
dataset. While we also considered more recent methods like is to average the predictions across all frames to obtain a
BlazeFace [41], which provides faster inferencing, its false video-level prediction. However, this approach has several
positive rate on the DFDC dataset was considerably larger drawbacks. First, face detectors such as MTCNN can erro-
than that of MTCNN. neously report that background regions of the frames con-
tain faces, providing false positives. Second, some videos
We extract faces from 1 every 10 frames for each video.
might include more than one face but with only one of them
In order to speed up the face detection process, we down-
being manipulated. Furthermore, some frames might con-
scale the frame by a factor of 4. Additionally, we include
tain blurry faces where the presence of manipulations might
a margin of 20 pixels at each side of the detected bounding
be difficult to detect. In such scenarios, a CNN could pro-
boxes in order to capture a broader area of the head as some
vide a correct prediction for each frame but an incorrect
regions such as the hair might contain artifacts useful to de-
video-level prediction after averaging.
tect manipulations. After processing the input frames with
MTCNN, we crop all the regions where faces were detected In order to address this problem, we propose an auto-
and resize them to 224 × 224 pixels. matic weighting mechanism to emphasize the most reli-
able regions where faces have been detected and discard the
4.2. Face Feature Extraction least reliable ones when determining a video-level predic-
tion. This approach, similar to attention mechanisms [44],
After detecting face regions, a binary classification automatically assigns a weight, wj , to each logit, lj , out-
model is trained to extract features that can be used to clas- putted by the EfficientNet network for each jth face region.
sify the real/fake faces. The large number of videos that Then, these weights are used to perform a weighted aver-
have to be processed in a finite amount of time for the Deep- age of all logits, from all face regions found in all sampled
fake Detection Challenge requires networks that are both frames to obtain a final probability value of the video being
fast and accurate. In this work, we use EfficientNet-b5 [42] fake. Both logits and weights are estimated using a fully-
as it provides a good trade-off between network parameters connected linear layer with the features extracted by Effi-
and classification accuracy. Additionally, the network has cientNet as input. In other words, the features extracted by
been designed using neural architecture search (NAS) al- EfficientNet are used to estimate a logit (that indicates if
gorithms, resulting in a network that is both compact and the face is real or fake) and a weight (that can provide infor-
accurate. In fact, this network has outperformed previous mation of how confident or reliable is the logit prediction).
state-of-the-art approaches in datasets such as ImageNet The output probability, pw , of a video being false, by the
[38] while having fewer parameters. automatic face weighting is:
Since the DFDC dataset contains many high-quality
photo-realistic fake faces, discriminating between real and PN
j=1 w j lj
manipulated faces can be challenging. To achieve a better pw = σ( PN ) (1)
and more robust face feature extraction, we combine Effi- j=1 wj
cientNet with the additive angular margin loss (also known
as ArcFace) [43] instead of a regular softmax+cross-entropy Where wj and lj are the weight value and logit obtained
loss. ArcFace is a learnable loss function that is based on for the jth face region, respectively and σ(.) is the Sigmoid
the classification cross-entropy loss but includes penaliza- function. Note that after the fully-connected layer, wj is
tion terms to provide a more compact representation of the passed through a ReLU activation function to enforce that
categories. ArcFace simultaneously reduces the intra-class wj ≥ 0. Additionally, a very small value is added to avoid
difference and enlarges the inter-class difference between divisions by 0. This weighted sum aggregates all the esti-
the classification features. It is designed to enforce a mar- mated logits providing a video-level prediction.
4.4. Gated Recurrent Unit level prediction), while ArcFace is a loss based on frame-
level predictions. The BCE applied to pw updates the Ef-
The backbone model estimates a logit and weight for
ficientNet and AFW weights. The BCE applied to pRN N
each frame without using information from other frames.
updates all weights of the ensemble (excluding MTCNN).
While the automatic face weighting combines the estimates
While we train the complete ensemble end-to-end, we
of multiple frames, these estimates are obtained by using
start the training process with an optional initial step con-
single-frame information. However, ideally the video-level
sisting of 2000 batches of random crops applied to the Ar-
prediction would be performed using information from all
cFace loss to obtain an initial set of parameters of the Effi-
sampled frames.
cientNet. This initial step provides the network with useful
In order to merge the features from all face regions and
layers to later train the automatic face weighting layer and
frames, we include a Recurrent Neural Network (RNN) on
the GRU. While this did not present any increase in detec-
top of the automatic face weighting. We use a Gated Re-
tion accuracy during our experiments, it provided a faster
current Unit (GRU) to combine the features, logits, and
convergence and a more stable training process.
weights of all face regions to obtain a final estimate. For
Due to computing limitations of GPUs, the size of the
each face region, the GRU takes as input a vector of di-
network, and the number of input frames, only one video
mension 2051 consisting of the features extracted from Ef-
can be processed at a time during training. However, the
ficientNet (with dimension 2048), the estimated logit lj , the
network parameters are updated after processing every 64
estimated weighting value wj , and the estimated manipu-
videos (for the binary cross-entropy losses) and 256 ran-
lated probability after the automatic face weighting pw . Al-
dom frames (for the ArcFace loss). We use Adam as the
though lj , wj , pw , and the feature vectors are correlated,
optimization technique with a learning rate of 0.001.
we input all of them to the GRU and let the network itself
extract the useful information. The GRU is composed of 4.6. Boosting Network
3 stacked bi-directional layers and a uni-directional layer
with a hidden layer with dimension 512. The output of the The logarithmic nature of the binary cross-entropy loss
last layer of the GRU is mapped through a linear layer and (or log-likelihood error) used at the DFDC leads to large
a Sigmoid function to estimate a final probability pRN N of penalizations for predictions both confident and incorrect.
the video being manipulated. In order to obtain a small log-likelihood error we want a
method that has both good detection accuracy and is not
4.5. Training Process overconfident of its predictions. In order to do so, we use
two main approaches during testing: (1) adding a boosting
We use a pre-trained MTCNN for face detection and
network and (2) applying data augmentation during testing.
we only train our EfficientNet, GRU, and the Automatic
Face Weighting layers. The EfficientNet is initialized with The boosting network is a replica of the previously de-
weights pre-trained on ImageNet. The GRU and AFW lay- scribed network. However, this auxiliary network is not
ers are initialized with random weights. During the train- trained to minimize the binary cross-entropy of the real/fake
ing process, we oversample real videos (containing only classification, but trained to predict the error between the
unmanipulated faces) to balance the dataset. The network predictions of our main network and the ground truth la-
is trained end-to-end with 3 distinct loss functions: an Ar- bels. We do so by estimating the error of the main network
cFace loss with the output of EfficentNet, a binary cross- on the logit domain for both the AFW and GRU outputs.
entropy loss with the automatic face weighting prediction When using the boosting network, the prediction outputted
pw , and a binary cross-entropy loss with the GRU predic- by the automatic face weighting layer, pbw , is defined as:
tion pRN N . PN
j=1 (wj lj + wjb ljb )
The ArcFace loss is used to train the EfficientNet lay- pbw = σ( PN ) (2)
ers with batches of cropped faces from randomly selected j=1 (wj + wjb )
frames and videos. This loss allows the network to learn
Where wj and lj are the weights and logits outputted
from a large variety of manipulated and original faces with
by the main network and wjb and ljb , are the weights and
various colors, poses, and illumination conditions. Note
logits outputted by the boosting network for the jth input
that ArcFace only trains the layers from EfficientNet and
face region and σ(.) is the Sigmoid function. In a similar
not the GRU layers or the fully-connected layers that output
manner, the prediction outputted by the GRU, pbRN N , is:
the AFW weight values and logits.
The binary cross-entropy (BCE) loss is applied at the
pbRN N = σ(lRN N + lRN
b
N) (3)
outputs of the automatic face weighting layer and the GRU.
The BCE loss is computed with cropped faces from frames Where lRN N is the logit outputted by the GRU of the
b
of a randomly selected video. Note that this loss is based on main network, lRN N is the logit outputted by the GRU of
the output probabilities of videos being manipulated (video- the boosting network, and σ(.) is the Sigmoid function.
Figure 3. Diagram of the proposed method including the boosting network (dashed elements). The predictions of the main and boosting
network are combined at the AFW layer and after the GRUs. We train the main network with the training set and the boosting network
with the validation set.

While the main network is trained using the training split We also evaluate two CNNs: EfficientNet [42] and Xcep-
of the dataset, described in section 3, we train the boosting tion [37]. For these networks, we simply average the pre-
network with the validation split. dictions for each frame to obtain a video-level prediction.
Figure 3 presents the complete diagram of our system We use the validation set to select the configuration for
when including the boosting network. The dashed elements each models that provides the best balanced accuracy. Ta-
and the symbols with superscripts form part of the boost- ble 1 presents the results of balanced accuracy. Because it
ing network. The main network and the boosting network is based on extracting features on the entire video, Conv-
are combined at two different points: at the automatic face LSTM [30] is unable to capture the manipulations that hap-
weighting layer, as described in equation 2, and after the pen within face regions. However, if the method is adapted
gated recurrent units, as described in equation 3. to process only face regions, the detection accuracy im-
proves considerably. Classification networks such as Xcep-
4.7. Test Time Augmentation
tion [37], which provided state-of-the-art results in Face-
Besides adding the boosting network, we perform data Forensics++ dataset [21], and EfficientNet-b5 [42] show
augmentation during testing. For each face region detected good accuracy results. Our work shows that by including
by the MTCNN, we crop the same region in the 2 previous an automatic face weighting layer and a GRU, the accuracy
and 2 following frames of the frame being analyzed. There- is further improved.
fore we have a total of 5 sequences of detected face regions.
Table 1. Balanced accuracy of the presented method and previous
We run the network within each of the 5 sequences and per-
works.
form a horizontal flip in some of the sequences randomly.
Method Validation Test
Then, we average the prediction of all the sequences. This
Conv-LSTM [30] 55.82% 57.63%
approach helps to smooth out overconfident predictions: if
Conv-LSTM [30] + MTCNN 66.05% 70.78%
the predictions of different sequences disagree, averaging
EfficientNet-b5 [42] 79.25% 80.62%
all the probabilities leads to a lower number of both incor-
Xception [37] 78.42% 80.14%
rect and overconfident predictions.
Ours 92.61% 91.88%
5. Experimental Results
Additionally, we evaluate the accuracy of the predictions
We train and evaluate our method with the DFDC at every stage of our method. Table 2 shows the balanced
dataset, described in section 3. Additionally, we compare accuracy of the prediction obtained by the averaging the
the presented approach with 4 other techniques. We com- logits predicted by EfficientNet, lj (logits), the prediction of
pare it with the work presented in [30] and a modified ver- the automatic face weighting layer, pw (AFW), and the pre-
sion that only process face regions detected by MTCNN. diction after the gated recurrent unit, pRN N (GRU). We can
Table 3. The log-likelihood error of our method with and without
boosting network and test augmentation.
Method Log-likelihood
Baseline 0.364
+ Boosting Network 0.341
+ Test Augmentation 0.321

to automatically weight different face regions and boosting


techniques can be used to obtain more robust predictions.
The method processes videos quickly (in less than eight sec-
onds) with a single GPU.
Although the results of our experiments are promising,
new techniques to generate deepfake manipulations emerge
continuously. The modular nature of the proposed approach
allows for many improvements, such as using different face
detection methods, different backbone architectures, and
other techniques to obtain a prediction from features of mul-
tiple frames. Furthermore, this work focuses on face manip-
ulation detection and dismisses any analysis of audio con-
Figure 4. Examples of faces with manipulations from DFDC. The tent which could provide a significant improvement of de-
images in the top are incorrectly classified by the network. The tection accuracy in future work.
bottom images are correctly classified.
7. Acknowledgment
observe that every stage increases the detection accuracy, This material is based on research sponsored by DARPA
obtaining the highest accuracy with the GRU prediction. and Air Force Research Laboratory (AFRL) under agree-
ment number FA8750-16-2-0173. The U.S. Government
Table 2. Balanced accuracy of at different stages of our method.
is authorized to reproduce and distribute reprints for Gov-
Method Validation Accuracy ernmental purposes notwithstanding any copyright notation
Ours (logits) 85.51% thereon. The views and conclusions contained herein are
Ours (AFW) 87.90% those of the authors and should not be interpreted as nec-
Ours (GRU) 92.61% essarily representing the official policies or endorsements,
either expressed or implied, of DARPA and Air Force Re-
Figure 4 shows some examples of correctly (bottom) and search Laboratory (AFRL) or the U.S. Government.
incorrectly (top) detected manipulations. We observed that Address all correspondence to Edward J. Delp,
the network typically fails when faced with highly-realistic [email protected] .
manipulations that are performed in blurry or low-quality
images. Manipulations performed in high-quality videos References
seem to be properly detected, even the challenging ones.
We evaluate the effect of using the boosting network and [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
data augmentation during testing. In order to so, we use the
“Generative adversarial nets,” Proceedings of Advances in
private testing set on the Kaggle system and report our log-
Neural Information Processing Systems, pp. 2672–2680, De-
likelihood error (the lower the better). Table 3 shows that by cember 2014, Montreal, Canada. 1
using both the boosting and test augmentation we are able
[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning,
to decrease our log-likelihood down to 0.321. This place
MIT Press, 2016, http://www.deeplearningbook.
the method in the position 117 of 2275 teams (5.1%) of the
org. 1, 2
competition’s public leader-board.
[3] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C.
6. Conclusions Ferrer, “The deepfake detection challenge (dfdc) preview
dataset,” arXiv preprint arXiv:1910.08854, 2019. 1, 2, 3
In this paper, we present a new method to detect face ma- [4] E. Bailey, Adobe Photoshop: A Beginners Guide to Pho-
nipulations within videos. We show that combining convo- toshop Lightroom - The 52 Photoshop Lightroom Tricks You
lutional and recurrent neural networks achieves high detec- Didn’t Know Existed!, vol. 1, CreateSpace Independent Pub-
tion accuracies on the DFDC dataset. We describe a method lishing Platform, 2016, North Charleston, SC. 1
[5] The GIMP Development Team, “GIMP,” https://www. [18] X. Zhang, S. Karaman, and S.-F. Chang, “Detecting and
gimp.org. 1 simulating artifacts in gan fake images,” arXiv preprint
[6] D. Cozzolino and L. Verdoliva, “Noiseprint: a cnn-based arXiv:1907.06515, 2019. 2
camera model fingerprint,” IEEE Transactions on Informa- [19] N. Yu, L. S. Davis, and M. Fritz, “Attributing fake images to
tion Forensics and Security, vol. 15, pp. 144–159, May 2019. gans: Learning and analyzing gan fingerprints,” Proceedings
2 of the IEEE International Conference on Computer Vision,
[7] M. Barni, Q.-T. Phan, and B. Tondi, “Copy move source- pp. 7556–7566, October 2019, Seoul, South Korea. 2
target disambiguation through multi-branch cnns,” arXiv
[20] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen,
preprint arXiv:1912.12640, 2019. 2
“Mesonet: a compact facial video forgery detection net-
[8] S. Yarlagadda, D. Güera, D. M. Montserrat, F. Zhu, E. Delp, work,” Proceedings of the IEEE International Workshop
P. Bestagini, and S. Tubaro, “Shadow removal detection and on Information Forensics and Security, pp. 1–7, December
localization for forensics analysis,” Proceedings of IEEE In- 2018, Hong Kong. 2
ternational Conference on Acoustics, Speech and Signal Pro-
cessing, pp. 2677–2681, May 2019, Brighton, UK. 2 [21] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,
and M. Nießner, “Faceforensics++: Learning to detect ma-
[9] D. Cozzolino, G. Poggi, and L. Verdoliva, “Splicebuster: A nipulated facial images,” Proceedings of the IEEE Interna-
new blind image splicing detector,” Proceedings of IEEE tional Conference on Computer Vision, pp. 1–11, October
International Workshop on Information Forensics and Secu- 2019, Seoul, South Korea. 2, 6
rity, pp. 1–6, January 2015, Rome, Italy. 2
[22] Z. Hui, J. Li, X. Wang, and X. Gao, “Image fine-grained
[10] S. K. Yarlagadda, D. Güera, P. Bestagini, F. Maggie Zhu,
inpainting,” arXiv preprint arXiv:2002.02609, 2020. 2
S. Tubaro, and E. J. Delp, “Satellite image forgery detection
and localization using gan and one-class classifier,” Elec- [23] H. Le and D. Samaras, “Shadow removal via shadow image
tronic Imaging, vol. 2018, no. 7, pp. 214–1, January 2018. decomposition,” Proceedings of the IEEE International Con-
2 ference on Computer Vision, pp. 8578–8587, October 2019,
[11] E. R. Bartusiak, S. K. Yarlagadda, D. Güera, P. Bestagini, Seoul, South Korea. 2
S. Tubaro, F. M. Zhu, and E. J. Delp, “Splicing detection [24] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan
and localization in satellite imagery using conditional gans,” training for high fidelity natural image synthesis,” arXiv
Proceedings of IEEE Conference on Multimedia Information preprint arXiv:1809.11096, 2018. 2
Processing and Retrieval, pp. 91–96, March 2019, San Jose,
CA. 2 [25] “DeepFakes,” https://github.com/deepfakes/
faceswap. 2
[12] J. Horváth, D. Güera, S. K. Yarlagadda, P. Bestagini, F. M.
Zhu, S. Tubaro, and E. J. Delp, “Anomaly-based manipula- [26] J. Thies, M. Zollhöfer, and M. Nießner, “Deferred neural
tion detection in satellite images,” Networks, vol. 29, pp. 21, rendering: Image synthesis using neural textures,” ACM
2019. 2 Transactions on Graphics, vol. 38, no. 4, pp. 1–12, July
[13] M. Barni, L. Bondi, N. Bonettini, P. Bestagini, A. Costanzo, 2019. 2
M. Maggini, B. Tondi, and S. Tubaro, “Aligned and non- [27] M. Kowalski, “Faceswap,” https://github.com/
aligned double jpeg detection using convolutional neural net- MarekKowalski/FaceSwap/. 2
works,” Journal of Visual Communication and Image Repre-
sentation, vol. 49, pp. 153–163, November 2017. 2 [28] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and
M. Nießner, “Face2face: Real-time face capture and reen-
[14] D. Güera, S. Baireddy, P. Bestagini, S. Tubaro, and E. J. actment of rgb videos,” in Proceedings of the IEEE con-
Delp, “We need no pixels: Video manipulation detection ference on Computer Vision and Pattern Recognition, Las
using stream descriptors,” Proceedings of the International Vegas, NV, June 2016, pp. 2387–2395. 2
Conference on Machine Learning , Synthetic Realities: Deep
Learning for Detecting AudioVisual Fakes Workshop, June [29] P. Pérez, M. Gangnet, and A. Blake, “Poisson image edit-
2019, Long Beach, CA. 2 ing,” Proceedings of the ACM Special Interest Group on
Computer GRAPHics and Interactive Techniques, pp. 313–
[15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Na-
318, July 2003, San Diego, California. 2
ture, vol. 521, pp. 436–444, May 2015. 2
[16] F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva, [30] D. Güera and E. J. Delp, “Deepfake video detection using
“Detection of gan-generated fake images over social net- recurrent neural networks,” Proceedings of the IEEE Inter-
works,” Proceedings of the IEEE Conference on Multimedia national Conference on Advanced Video and Signal Based
Information Processing and Retrieval, pp. 384–389, April Surveillance, pp. 1–6, November 2018, Auckland, New
2018, Miami, FL. 2 Zealand. 2, 6
[17] F. Marra, D. Gragnaniello, L. Verdoliva, and G. Poggi, “Do [31] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi,
gans leave artificial fingerprints?,” Proceedings of IEEE and P. Natarajan, “Recurrent convolutional strategies for face
Conference on Multimedia Information Processing and Re- manipulation detection in videos,” Interfaces (GUI), vol. 3,
trieval, pp. 506–511, March 2019, San Diego, CA. 2 pp. 1, 2019. 2
[32] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using
inconsistent head poses,” Proceedings of the IEEE Interna-
tional Conference on Acoustics, Speech and Signal Process-
ing, pp. 8261–8265, May 2019, Brighton, United Kingdom.
2
[33] Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing ai
created fake videos by detecting eye blinking,” Proceeding
IEEE International Workshop on Information Forensics and
Security, pp. 1–7, 2018, Hong Kong. 2
[34] Y. Li and S. Lyu, “Exposing deepfake videos by detecting
face warping artifacts,” arXiv preprint arXiv:1811.00656,
2018. 2
[35] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-
berger, “Densely connected convolutional networks,” Pro-
ceedings of the IEEE conference on Computer Vision and
Pattern Recognition, pp. 4700–4708, July 2017, Honolulu,
HI. 2
[36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
“Rethinking the inception architecture for computer vision,”
Proceedings of the IEEE conference on Computer Vision and
Pattern Recognition, pp. 2818–2826, June 2016, Las Vegas,
NV. 2
[37] F. Chollet, “Xception: Deep learning with depthwise sepa-
rable convolutions,” Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition, pp. 1251–1258,
July 2017, Honolulu, HI. 2, 6
[38] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei,
“Imagenet: A large-scale hierarchical image database,” pp.
248–255, August 2009, Miami, FL. 3, 4
[39] M. Huh, P. Agrawal, and A. A. Efros, “What makes
imagenet good for transfer learning?,” arXiv preprint
arXiv:1608.08614, 2016. 3
[40] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detec-
tion and alignment using multitask cascaded convolutional
networks,” IEEE Signal Processing Letters, vol. 23, April
2016. 3
[41] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran,
and M. Grundmann, “Blazeface: Sub-millisecond neural
face detection on mobile gpus,” Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
Workshop on Computer Vision for Augmented and Virtual
Reality, June 2019, Long Beach, CA. 4
[42] M. Tan and Q. V. Le, “Efficientnet: Rethinking model
scaling for convolutional neural networks,” arXiv preprint
arXiv:1905.11946, 2019. 4, 6
[43] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Ad-
ditive angular margin loss for deep face recognition,” Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, June 2019, Long Beach, CA. 4
[44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all
you need,” Proceedings of Advances in Neural Information
Processing Systems, pp. 5998–6008, December 2017, Long
Beach, CA. 4

You might also like