Paper (Related Project-3)
Paper (Related Project-3)
Paper (Related Project-3)
Daniel Mas Montserrat, Hanxiang Hao, S. K. Yarlagadda, Sriram Baireddy, Ruiting Shao
János Horváth, Emily Bartusiak, Justin Yang, David Güera, Fengqing Zhu, Edward J. Delp
Abstract
3. Deepfake Detection Challenge Dataset error score. In practice, if this worst-case happens, the loss
is clipped to a very big value. This evaluation system poses
The Deepfake Detection Challenge (DFDC) [3] dataset an extra challenge, as methods with good performance in
contains a total of 123,546 videos with face and audio ma- metrics like accuracy, could have very high log-likelihood
nipulations. Each video contains one or more people and errors.
has a length of 10 seconds with a total of 300 frames. The
nature of these videos typically includes standing or sitting 4. Proposed Method
people, either facing the camera or not, with a wide range
of backgrounds, illumination conditions, and video quality. Our proposed method (Figure 2) extracts visual and tem-
The training videos have a resolution of 1920 × 1080 pix- poral features from faces by using a combination of a CNN
els, or 1080 × 1920 pixels if recorded in vertical mode. with an RNN. Because all visual manipulations are located
Figure 1 shows some examples of frames from videos of within face regions, and faces are typically present in a
the dataset. This dataset is composed by a total of 119,146 small region of the frame, using a network that extracts fea-
videos with a unique label (real or fake) in a training set, tures from the entire frame is not ideal. Instead, we focus on
400 videos on the validation set without labels and 4000 extracting features only in regions where a face is present.
private videos in a testing set. The 4000 videos of the test Because networks trained with general image classification
set can not be inspected but models can be evaluated on it task datasets such as ImageNet [38] have performed well
through the Kaggle system. The ratio of manipulated:real when transferred to other tasks [39], we use pre-trained
videos is 1:0.28. Because only the 119,245 training videos backbone networks as our starting point. Such backbone
contain labels, we use the totality of that dataset to train and networks extract features from faces that are later fed to an
validate our method. The provided training videos are di- RNN to extract temporal information. The method has three
vided into 50 numbered parts. We use 30 parts for training, distinct steps: (1) face detection across multiple frames us-
10 for validation and 10 for testing. ing MTCNN [40], (2) feature extraction with a CNN, and
A unique label is assigned to each video specifying (3) prediction estimation with a layer we refer to as Auto-
whether it contains a manipulation or not. However, it is not matic Face Weighting (AFW) along with a Gated Recurrent
specified which type of manipulation is performed: face, Unit (GRU). Our approach is described in detail in the fol-
audio, or both. As our method only uses video information, lowing subsections, including a boosting and test augmen-
manipulated videos with only audio manipulations will lead tation approach we included in our DFDC submission.
to noisy labels as the video will be labeled as fake but faces
will be real. Furthermore, more than one person might be
4.1. Face Detection
present in the video, with face manipulations performed on We use MTCNN [40] to perform face detection.
only one of them. MTCNN is a multi-task cascaded model that can produce
The private set used for testing evaluates submitted both face bounding boxes and facial landmarks simultane-
methods within the Kaggle system and reports a log- ously. The model uses a cascaded three-stage architecture to
likelihood loss. Log-likelihood loss drastically penalizes predict face and landmark locations in a coarse-to-fine man-
being both confident and wrong. In the worst case, a pre- ner. Initially, an image pyramid is generated by resizing the
diction that a video is authentic when it is actually manip- input image to different scales. The first stage of MTCNN
ulated, or the other way around, will add infinity to your then obtains the initial candidates of facial bounding boxes
and landmarks given the input image pyramid. The second gin between the distance of the sample to its class center and
stage takes the initial candidates from the first stage as the the distances of the sample to the centers of other classes in
input and rejects a large number of false alarms. The third an angular space. Therefore, by minimizing the ArcFace
stage is similar to the second stage but with a larger input loss, the classification model can obtain highly discrimina-
image size and deeper structure to obtain the final bound- tive features for real faces and fake faces to achieve a more
ing boxes and landmark points. Non-maximum suppression robust classification that succeeds even for high-quality fake
and bounding box regression are used in all three stages to faces.
remove highly overlapped candidates and refine the predic-
tion results. With the cascaded structure, MTCNN refines 4.3. Automatic Face Weighting
the results stage by stage in order to get accurate predic-
tions. While an image classification CNN provides a predic-
We choose this model because it provides good detection tion for a single image, we need to assign a prediction for
performance on both real and synthetic faces in the DFDC an entire video, not just a single frame. The natural choice
dataset. While we also considered more recent methods like is to average the predictions across all frames to obtain a
BlazeFace [41], which provides faster inferencing, its false video-level prediction. However, this approach has several
positive rate on the DFDC dataset was considerably larger drawbacks. First, face detectors such as MTCNN can erro-
than that of MTCNN. neously report that background regions of the frames con-
tain faces, providing false positives. Second, some videos
We extract faces from 1 every 10 frames for each video.
might include more than one face but with only one of them
In order to speed up the face detection process, we down-
being manipulated. Furthermore, some frames might con-
scale the frame by a factor of 4. Additionally, we include
tain blurry faces where the presence of manipulations might
a margin of 20 pixels at each side of the detected bounding
be difficult to detect. In such scenarios, a CNN could pro-
boxes in order to capture a broader area of the head as some
vide a correct prediction for each frame but an incorrect
regions such as the hair might contain artifacts useful to de-
video-level prediction after averaging.
tect manipulations. After processing the input frames with
MTCNN, we crop all the regions where faces were detected In order to address this problem, we propose an auto-
and resize them to 224 × 224 pixels. matic weighting mechanism to emphasize the most reli-
able regions where faces have been detected and discard the
4.2. Face Feature Extraction least reliable ones when determining a video-level predic-
tion. This approach, similar to attention mechanisms [44],
After detecting face regions, a binary classification automatically assigns a weight, wj , to each logit, lj , out-
model is trained to extract features that can be used to clas- putted by the EfficientNet network for each jth face region.
sify the real/fake faces. The large number of videos that Then, these weights are used to perform a weighted aver-
have to be processed in a finite amount of time for the Deep- age of all logits, from all face regions found in all sampled
fake Detection Challenge requires networks that are both frames to obtain a final probability value of the video being
fast and accurate. In this work, we use EfficientNet-b5 [42] fake. Both logits and weights are estimated using a fully-
as it provides a good trade-off between network parameters connected linear layer with the features extracted by Effi-
and classification accuracy. Additionally, the network has cientNet as input. In other words, the features extracted by
been designed using neural architecture search (NAS) al- EfficientNet are used to estimate a logit (that indicates if
gorithms, resulting in a network that is both compact and the face is real or fake) and a weight (that can provide infor-
accurate. In fact, this network has outperformed previous mation of how confident or reliable is the logit prediction).
state-of-the-art approaches in datasets such as ImageNet The output probability, pw , of a video being false, by the
[38] while having fewer parameters. automatic face weighting is:
Since the DFDC dataset contains many high-quality
photo-realistic fake faces, discriminating between real and PN
j=1 w j lj
manipulated faces can be challenging. To achieve a better pw = σ( PN ) (1)
and more robust face feature extraction, we combine Effi- j=1 wj
cientNet with the additive angular margin loss (also known
as ArcFace) [43] instead of a regular softmax+cross-entropy Where wj and lj are the weight value and logit obtained
loss. ArcFace is a learnable loss function that is based on for the jth face region, respectively and σ(.) is the Sigmoid
the classification cross-entropy loss but includes penaliza- function. Note that after the fully-connected layer, wj is
tion terms to provide a more compact representation of the passed through a ReLU activation function to enforce that
categories. ArcFace simultaneously reduces the intra-class wj ≥ 0. Additionally, a very small value is added to avoid
difference and enlarges the inter-class difference between divisions by 0. This weighted sum aggregates all the esti-
the classification features. It is designed to enforce a mar- mated logits providing a video-level prediction.
4.4. Gated Recurrent Unit level prediction), while ArcFace is a loss based on frame-
level predictions. The BCE applied to pw updates the Ef-
The backbone model estimates a logit and weight for
ficientNet and AFW weights. The BCE applied to pRN N
each frame without using information from other frames.
updates all weights of the ensemble (excluding MTCNN).
While the automatic face weighting combines the estimates
While we train the complete ensemble end-to-end, we
of multiple frames, these estimates are obtained by using
start the training process with an optional initial step con-
single-frame information. However, ideally the video-level
sisting of 2000 batches of random crops applied to the Ar-
prediction would be performed using information from all
cFace loss to obtain an initial set of parameters of the Effi-
sampled frames.
cientNet. This initial step provides the network with useful
In order to merge the features from all face regions and
layers to later train the automatic face weighting layer and
frames, we include a Recurrent Neural Network (RNN) on
the GRU. While this did not present any increase in detec-
top of the automatic face weighting. We use a Gated Re-
tion accuracy during our experiments, it provided a faster
current Unit (GRU) to combine the features, logits, and
convergence and a more stable training process.
weights of all face regions to obtain a final estimate. For
Due to computing limitations of GPUs, the size of the
each face region, the GRU takes as input a vector of di-
network, and the number of input frames, only one video
mension 2051 consisting of the features extracted from Ef-
can be processed at a time during training. However, the
ficientNet (with dimension 2048), the estimated logit lj , the
network parameters are updated after processing every 64
estimated weighting value wj , and the estimated manipu-
videos (for the binary cross-entropy losses) and 256 ran-
lated probability after the automatic face weighting pw . Al-
dom frames (for the ArcFace loss). We use Adam as the
though lj , wj , pw , and the feature vectors are correlated,
optimization technique with a learning rate of 0.001.
we input all of them to the GRU and let the network itself
extract the useful information. The GRU is composed of 4.6. Boosting Network
3 stacked bi-directional layers and a uni-directional layer
with a hidden layer with dimension 512. The output of the The logarithmic nature of the binary cross-entropy loss
last layer of the GRU is mapped through a linear layer and (or log-likelihood error) used at the DFDC leads to large
a Sigmoid function to estimate a final probability pRN N of penalizations for predictions both confident and incorrect.
the video being manipulated. In order to obtain a small log-likelihood error we want a
method that has both good detection accuracy and is not
4.5. Training Process overconfident of its predictions. In order to do so, we use
two main approaches during testing: (1) adding a boosting
We use a pre-trained MTCNN for face detection and
network and (2) applying data augmentation during testing.
we only train our EfficientNet, GRU, and the Automatic
Face Weighting layers. The EfficientNet is initialized with The boosting network is a replica of the previously de-
weights pre-trained on ImageNet. The GRU and AFW lay- scribed network. However, this auxiliary network is not
ers are initialized with random weights. During the train- trained to minimize the binary cross-entropy of the real/fake
ing process, we oversample real videos (containing only classification, but trained to predict the error between the
unmanipulated faces) to balance the dataset. The network predictions of our main network and the ground truth la-
is trained end-to-end with 3 distinct loss functions: an Ar- bels. We do so by estimating the error of the main network
cFace loss with the output of EfficentNet, a binary cross- on the logit domain for both the AFW and GRU outputs.
entropy loss with the automatic face weighting prediction When using the boosting network, the prediction outputted
pw , and a binary cross-entropy loss with the GRU predic- by the automatic face weighting layer, pbw , is defined as:
tion pRN N . PN
j=1 (wj lj + wjb ljb )
The ArcFace loss is used to train the EfficientNet lay- pbw = σ( PN ) (2)
ers with batches of cropped faces from randomly selected j=1 (wj + wjb )
frames and videos. This loss allows the network to learn
Where wj and lj are the weights and logits outputted
from a large variety of manipulated and original faces with
by the main network and wjb and ljb , are the weights and
various colors, poses, and illumination conditions. Note
logits outputted by the boosting network for the jth input
that ArcFace only trains the layers from EfficientNet and
face region and σ(.) is the Sigmoid function. In a similar
not the GRU layers or the fully-connected layers that output
manner, the prediction outputted by the GRU, pbRN N , is:
the AFW weight values and logits.
The binary cross-entropy (BCE) loss is applied at the
pbRN N = σ(lRN N + lRN
b
N) (3)
outputs of the automatic face weighting layer and the GRU.
The BCE loss is computed with cropped faces from frames Where lRN N is the logit outputted by the GRU of the
b
of a randomly selected video. Note that this loss is based on main network, lRN N is the logit outputted by the GRU of
the output probabilities of videos being manipulated (video- the boosting network, and σ(.) is the Sigmoid function.
Figure 3. Diagram of the proposed method including the boosting network (dashed elements). The predictions of the main and boosting
network are combined at the AFW layer and after the GRUs. We train the main network with the training set and the boosting network
with the validation set.
While the main network is trained using the training split We also evaluate two CNNs: EfficientNet [42] and Xcep-
of the dataset, described in section 3, we train the boosting tion [37]. For these networks, we simply average the pre-
network with the validation split. dictions for each frame to obtain a video-level prediction.
Figure 3 presents the complete diagram of our system We use the validation set to select the configuration for
when including the boosting network. The dashed elements each models that provides the best balanced accuracy. Ta-
and the symbols with superscripts form part of the boost- ble 1 presents the results of balanced accuracy. Because it
ing network. The main network and the boosting network is based on extracting features on the entire video, Conv-
are combined at two different points: at the automatic face LSTM [30] is unable to capture the manipulations that hap-
weighting layer, as described in equation 2, and after the pen within face regions. However, if the method is adapted
gated recurrent units, as described in equation 3. to process only face regions, the detection accuracy im-
proves considerably. Classification networks such as Xcep-
4.7. Test Time Augmentation
tion [37], which provided state-of-the-art results in Face-
Besides adding the boosting network, we perform data Forensics++ dataset [21], and EfficientNet-b5 [42] show
augmentation during testing. For each face region detected good accuracy results. Our work shows that by including
by the MTCNN, we crop the same region in the 2 previous an automatic face weighting layer and a GRU, the accuracy
and 2 following frames of the frame being analyzed. There- is further improved.
fore we have a total of 5 sequences of detected face regions.
Table 1. Balanced accuracy of the presented method and previous
We run the network within each of the 5 sequences and per-
works.
form a horizontal flip in some of the sequences randomly.
Method Validation Test
Then, we average the prediction of all the sequences. This
Conv-LSTM [30] 55.82% 57.63%
approach helps to smooth out overconfident predictions: if
Conv-LSTM [30] + MTCNN 66.05% 70.78%
the predictions of different sequences disagree, averaging
EfficientNet-b5 [42] 79.25% 80.62%
all the probabilities leads to a lower number of both incor-
Xception [37] 78.42% 80.14%
rect and overconfident predictions.
Ours 92.61% 91.88%
5. Experimental Results
Additionally, we evaluate the accuracy of the predictions
We train and evaluate our method with the DFDC at every stage of our method. Table 2 shows the balanced
dataset, described in section 3. Additionally, we compare accuracy of the prediction obtained by the averaging the
the presented approach with 4 other techniques. We com- logits predicted by EfficientNet, lj (logits), the prediction of
pare it with the work presented in [30] and a modified ver- the automatic face weighting layer, pw (AFW), and the pre-
sion that only process face regions detected by MTCNN. diction after the gated recurrent unit, pRN N (GRU). We can
Table 3. The log-likelihood error of our method with and without
boosting network and test augmentation.
Method Log-likelihood
Baseline 0.364
+ Boosting Network 0.341
+ Test Augmentation 0.321