0% found this document useful (0 votes)
55 views5 pages

Deepfake 8

This document describes the methods and results of the DeeperForensics Challenge 2020 on real-world face forgery detection. The challenge used a large dataset of 60,000 videos containing 17.6 million frames to evaluate models. A total of 115 teams participated and 25 teams submitted valid results. The top solutions and potential future directions are discussed.

Uploaded by

adityasingh.b9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views5 pages

Deepfake 8

This document describes the methods and results of the DeeperForensics Challenge 2020 on real-world face forgery detection. The challenge used a large dataset of 60,000 videos containing 17.6 million frames to evaluate models. A total of 115 teams participated and 25 teams submitted valid results. The top solutions and potential future directions are discussed.

Uploaded by

adityasingh.b9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

DeeperForensics Challenge 2020 on Real-World Face Forgery Detection:

Methods and Results

Liming Jiang, Zhengkui Guo, Wayne Wu, Zhaoyang Liu, Ziwei Liu, Chen Change Loy,
Shuo Yang, Yuanjun Xiong, Wei Xia, Baoying Chen, Peiyu Zhuang, Sili Li, Shen Chen, Taiping Yao,
Shouhong Ding, Jilin Li, Feiyue Huang, Liujuan Cao, Rongrong Ji, Changlei Lu, Ganchao Tan
arXiv:2102.09471v1 [cs.CV] 18 Feb 2021

Abstract spread. Tampered videos on the internet could lead to possi-


ble perilous consequences, entailing the potential legitimate
This paper reports methods and results in the Deeper- concerns among the general public and authorities. There-
Forensics Challenge 2020 on real-world face forgery de- fore, effective face forgery detection methods become an
tection. The challenge employs the DeeperForensics-1.0 urgent need to safeguard against these photorealistic fake
dataset, one of the most extensive publicly available real- videos, particularly in real-world scenarios where the video
world face forgery detection datasets, with 60,000 videos sources and distortions are unknown.
constituted by a total of 17.6 million frames. The model We organize the DeeperForensics Challenge 2020 with
evaluation is conducted online on a high-quality hidden test the aim to advance the state-of-the-art in face forgery detec-
set with multiple sources and diverse distortions. A total tion. Participants are expected to develop robust and generic
of 115 participants registered for the competition, and 25 methods for forgery detection in real-world scenarios. The
teams made valid submissions. We will summarize the win- challenge uses DeeperForensics-1.0 [17], a large-scale real-
ning solutions and present some discussions on potential world face forgery detection dataset that contains 60, 000
research directions. videos with a total of 17.6 million frames1 . All source
videos in DeeperForensics-1.0 are carefully collected, and
fake videos are generated by a newly proposed end-to-end
1. Introduction face swapping framework. Extensive real-world perturba-
tions are applied to obtain a more challenging benchmark
Recent years have witnessed exciting progress [1, 23, 4, of larger scale and higher diversity. The dataset also fea-
5, 20, 17] in automatic face swapping. Indeed, these tech- tures a hidden test set, which is richer in distribution than
niques have eschewed the cumbersome hand-crafted face the publicly available training set, suggesting a better set-
manipulation processes, hence facilitating the development ting to simulate real-world scenarios. Besides, the hidden
of various popular softwares for face editing. From an- test set will be continuously updated to get future versions
other perspective, these easy-to-access softwares, named along with the development of Deepfakes technology. The
“Deepfakes”, have also brought risks for being misused and evaluation of the challenge is performed online on the cur-
rent version of the hidden test set.
• Liming Jiang, Ziwei Liu, and Chen Change Loy are with S-Lab,
In the following sections, we will describe the Deep-
Nanyang Technological University.
• Zhengkui Guo is with The Chinese University of Hong Kong.
erForensics Challenge 2020, summarize the winning solu-
• Wayne Wu and Zhaoyang Liu are with SenseTime Research.
tions and results, and provide discussions to take a closer
• Shuo Yang, Yuanjun Xiong, and Wei Xia are with Amazon Web look at the current status and possible future development
Services. of real-world face forgery detection.
• Baoying Chen, Peiyu Zhuang, and Sili Li are with Shenzhen Key
Laboratory of Media Information Content Security, Shenzhen Uni- 2. About the Challenge
versity.
• Shen Chen, Liujuan Cao, and Rongrong Ji are with Media Analytics 2.1. Platform
and Computing Lab, Xiamen University.
• Taiping Yao, Shouhong Ding, Jilin Li, and Feiyue Huang are with The DeeperForensics Challenge 2020 is hosted on the
YouTu Lab, Tencent, Shanghai. CodaLab platform2 in conjunction with ECCV 2020, The
• Changlei Lu and Ganchao Tan are with University of Science and
1 Project page: https://liming-jiang.com/projects/DrF1/DrF1.html.
Technology of China.
2 Challenge website: https://competitions.codalab.org/competitions/25228.

1
2nd Workshop on Sensing, Understanding and Synthesizing 2.3. Evaluation Metric
Humans3 . The online evaluation is conducted using Ama-
Similar to Deepfake Detection Challenge (DFDC) [2],
zon Web Services (AWS)4 . First, participants register their
the DeeperForensics Challenge 2020 uses the binary cross-
teams on the CodaLab challenge website. Then, they are re-
entropy loss (BCELoss) to evaluate the performance of face
quested to submit their models to the AWS evaluation server
forgery detection models:
(with one 16 GB Tesla V100 GPU for each team) to perform
N
the online evaluation on the hidden test set. When the eval- BCELoss = −
1 X
[yi · log (p (yi )) + (1 − yi ) · log (1 − p (yi ))],
uation is done, participants receive the encrypted prediction N i=1
files through an automatic email. Finally, they submit the
where N is the number of videos in the hidden test set, yi
result file to the CodaLab challenge website.
denotes the ground truth label of video i (fake: 1, real: 0),
and p (yi ) indicates the predicted probability that video i
2.2. Dataset is fake. A smaller BCELoss score is better, which directly
The DeeperForensics Challenge 2020 employs the contributes to a higher ranking. If the BCELoss score is the
DeeperForensics-1.0 dataset [17] that was proposed in same, the one with less runtime will achieve a higher rank-
CVPR 2020. DeeperForensics-1.0 contains 60, 000 videos ing. To avoid an infinite BCELoss that is both too confident
constituted by a total of 17.6 million frames. The dataset and wrong, the score is bounded by a threshold value.
features three appealing properties: good quality, large
2.4. Timeline
scale, and high diversity.
To ensure good quality, extensive data collection is con- The DeeperForensics Challenge 2020 lasted for nine
ducted. The high-resolution (1920 × 1080) source videos weeks – eight weeks for the development phase and one
are collected from 100 paid actors with four typical skin week for the final test phase.
tones across 26 countries. Their eight expressions (i.e., neu- The challenge officially started at the ECCV 2020 Sense-
tral, angry, happy, sad, surprise, contempt, disgust, fear) Human Workshop on August 28, 2020, and it immediately
are recorded under nine lighting conditions by seven cam- entered the development phase. In the development phase,
eras at different locations. We further ask the actors to the evaluation is performed on the test-dev hidden test set,
perform 53 supplementary expressions defined by 3DMM which contains 1, 000 videos representing general circum-
blendshapes [9] to make the dataset more diverse. Besides, stances of the full hidden test set. The test-dev hidden test
a robust end-to-end face swapping framework, DF-VAE, is set is used to maintain a public leaderboard. Participants
developed to generate the fake videos. In addition, seven can conduct four online evaluations (each with 2.5 hours of
types of real-world perturbations at five intensity levels are runtime limit) per week.
applied to obtain a more challenging benchmark of larger The final test phase started on October 24, 2020. The
scale and higher diversity. Readers are referred to [17] for evaluation is conducted on the test-final hidden test set, con-
details. taining 3, 000 videos (also including test-dev videos) with a
An indispensable component of DeeperForensics-1.0 is similar distribution as test-dev, for the final competition re-
the hidden test set, which is richer in distribution than the sults. A total of two online evaluations (each with 7.5 hours
publicly available training set. The hidden test set suggests of runtime limit) are allowed. The final test phase ended on
a better real-world face forgery detection setting: 1) Multi- October 31, 2020.
ple sources. Fake videos in-the-wild should be manipulated Finally, the challenge results were announced in Decem-
by different unknown methods; 2) High quality. Threaten- ber 2020. In total, 115 participants registered for the com-
ing fake videos should have high quality to deceive human petition, and 25 teams made valid submissions.
eyes; 3) Diverse distortions. Different perturbations should
be considered. The hidden test set will evolve by includ- 3. Results and Solutions
ing more challenging samples along with the development
Table 1. Final results of the top-5 teams in the DeeperForensics
of Deepfakes technology. The evaluation of the challenge Challenge 2020. The runtime is shown in seconds.
is performed on its current version.
Ranking TeamName UserName BCELoss↓ Runtime↓
All the participants using the DeeperForensics-1.0 1 Forensics BokingChen 0.2674 7690
dataset should agree to its Terms of Use [7]. They are 2 RealFace Iverson 0.3699 11368
recommended but not restricted to train their algorithms 3 VISG zz110 0.4060 11012
4 jiashangplus jiashangplus 0.4064 16389
on DeeperForensics-1.0. The use of any external datasets 5 Miao miaotao 0.4132 19823
should be disclosed and follow the Terms of Use.
3 Workshop website: https://sense-human.github.io. Among the 25 teams who made valid submissions, many
4 Online evaluation website: https://aws.amazon.com. participants achieve promising results. We show the final

2
results of the top-5 teams in Table 1. In the following sec- and the total training epoch is 50. They use AdamW opti-
tions, we will present the winning solutions of top-3 entries. mizer [22] with initial learning rate of 0.001. Label smooth-
ing is applied with a smoothing factor of 0.05.
3.1. Solution of First Place • Testing: The testing pipeline follows the three stages in
Team members: Baoying Chen, Peiyu Zhuang, Sili Li Figure 1. They clip the prediction score of each video in a
range of [0.01, 0.99] to reduce the large loss caused by the
prediction errors. In addition to the best BCELoss score,
their fastest execution speed may be attributed to the use of
the faster face extractor MTCNN and the ensemble of three
image-level models with fewer parameters.
3.2. Solution of Second Place
Figure 1. The framework of the first-place solution. Team members: Shen Chen, Taiping Yao, Shouhong Ding,
Jilin Li, Feiyue Huang, Liujuan Cao, Rongrong Ji
As shown in Figure 1, the method designed by the cham-
Image-level model
pion team contains three stages, namely Face Extraction, Feature
Extraction Classifier Median
Classification, and Output. Pre-process
(EN-b5) probabilities prob1

Face Extraction. They first extract 15 frames from each Retina


Face
Video-level model
weighted
ensemble
Prediction

video at equal intervals using VideoCapture of OpenCV. Face regions


Feature
Extraction
Attention
Classifier
Video sequence Module
prob2
Then, they use the face detector MTCNN [30] to detect (EN-b5)

the face region of each frame and expand the region by 1.2
Figure 2. The framework of the second-place solution.
times to crop the face image.
Classification. They define the prediction of the probabil-
Face manipulated video contains two types of forgery
ity that the face is fake as the face score. They use Effi-
traces, i.e., image-level artifacts and video-level artifacts.
cientNet [25] as the backbone, which was proven effective
The former refers to the artifacts such as blending bound-
in the Deepfake Detection Challenge (DFDC) [2]. The re-
aries and abnormal textures within image, while the latter
sults of three models (EfficientNet-B0, EfficientNet-B1 and
is the face jitter problem between video frames. Most pre-
EfficientNet-B2) are ensembled for each face.
vious works only focused on artifacts in a specific modality
Output. The final output score of a video is the predicted
and lacked consideration of both. The team in the second
probability that the video is fake, which is calculated by the
place proposes to use an attention mechanism to fuse the
average of face scores for the extracted frames.
temporal information in videos, and further combine it with
Implementation Details. The team employs EfficientNet
an image model to achieve better results.
pre-trained on ImageNet as the backbone. They select
The overall framework of their method is shown in Fig-
EfficientNet-B0, EfficientNet-B1, and EfficientNet-B2 for
ure 2. First, they use RetinaFace [13] with 20% margin to
the model ensemble. In addition to DeeperForensics-1.0,
detect faces in video frames. Then, the face sequence is fed
they use some other public datasets, i.e., UADFV [28],
into an image-based model and a video-based model, where
Deep Fake Detection [8], FaceForensics++ [24], Celeb-
the backbones are both EfficientNet-b5 [25] with NoisyStu-
DF [29], and DFDC Preview [15]. They balance the class
dent [27] pre-trained weights. The image-based model pre-
samples with the down-sampling mode. The code of the
dicts frame by frame and takes the median of probabilities
champion solution has been made publicly available5 .
as the prediction. The video-based model takes the entire
• Training: Inspired by the DFDC winning solution, ap-
face sequence as the input and adopts an attention module
propriate data augmentation could contribute to better re-
to fuse the temporal information between frames. Finally,
sults. As for the data augmentation, the champion team uses
the per-video prediction score is obtained by averaging the
the perturbation implementation in DeeperForensics-1.0 [6]
probabilities predicted by the above two models.
during training. They only apply the image-level distor-
Implementation Details. The team implements the pro-
tions: color saturation change (CS), color contrast change
posed method via PyTorch. All the models are trained
(CC), local block-wise (BW), white Gaussian noise in color
on 8 NVIDIA Tesla V100 GPUs. In addition to
components (GNC), Gaussian blur (GB) and JPEG com-
the DeeperForensics-1.0 dataset, they use three external
pression (JPEG). They randomly mixup these distortions
datasets, i.e., FaceForensics++ [24], Celeb-DF [29], and Di-
with a probability of 0.2. Besides, they also try other data
verse Fake Face Dataset [12]. They used the official splits
augmentation [3], but the performance improvement is slim.
provided by the above datasets to construct the training, val-
The images are resized to 224 × 224. The batch size is 128,
idation and test sets. They balance the positive and negative
5 https://github.com/beibuwandeluori/DeeperForensicsChallengeSolution. samples through the down-sampling technique.

3
• Training: The second-place team uses the following data Implementation Details. First, the team crops faces in
augmentations: RandAugment [11], patch Gaussian [21], the video frames using the MTCNN [30] face detector.
Gaussian blur, image compression, random flip, random They combine all the cropped face images into a face video
crop and random brightness contrast. They also employ the clip. Each video clip is then resized to 64 × 224 × 224
perturbation implementation in DeeperForensics-1.0 [6]. or 64 × 112 × 112. Various data augmentations are ap-
For the image-based model, they train a classifier based plied, including Gaussian blur, white Gaussian noise in
on EfficientNet-b5 [25], using binary cross-entropy loss as color components, random crop, random flip, etc. Then,
the loss function. They adopt a two-stage training strat- they use the processed video clips as the input to train
egy for the video-based model. In stage-1, they train an a 3D convolutional neural network (3DCNN) using the
image-based classifier based on EfficientNet-b5. In stage- cross-entropy loss. They examine three kinds of networks,
2, they fix the model parameters trained in stage-1 to serve I3D [10], 3D ResNet [16] and R(2+1)D [26]. These models
as face feature extractor, and introduce an attention module are pre-trained on the action recognition datasets, e.g., ki-
to learn temporal information via nonlinear transformations netics [18]. In addition to DeeperForensics-1.0, they use
and softmax operations. The input of the network is the three external public face manipulation datasets, i.e., the
face sequence (i.e., 5 frames per video) in stage-2, and only DFDC dataset [14], Deep Fake Detection [8], and Face-
the attention module and classification layers are trained. Forensics++ [24].
The binary cross-entropy loss is adopted as the loss func-
tion. The input size is scaled to 320 × 320. Adam opti-
mizer [19] is used with a learning rate of 0.0002, β1 = 0.9, 4. Discussion
β2 = 0.999, and weight decay of 0.00001. The batch size
is 32. The total number of training epochs is set to 20, and The methods mentioned above have considered differ-
the learning rate is halved every 5 epochs. ent potential aspects in developing a robust face forgery
• Testing: They sample 10 frames at equal intervals for each detection model. We are glad to find the winning solu-
video and detect faces by RetinaFace [13] as in the train- tions achieve promising performance in the DeeperForen-
ing phase. Then, the face images are resized to 320 × 320. sics Challenge 2020. In summary, there are three key points
Test-time augmentation (TTA) (e.g., flip) is applied to get inspired by these methods that could improve real-world
20 images (10 original and 10 flipped), which are fed into face forgery detection. 1) Strong backbone. Backbone se-
network to get the prediction score. They clip the prediction lection of the forgery detection models is important. The
score of each video to [0.01, 0.99] to avoid excessive losses high-performance winning solutions are based on state-of-
on extreme error samples. the-art EfficientNet. 2) Diverse augmentations. Applying
appropriate data augmentations may better simulate real-
3.3. Solution of Third Place world scenarios and boost the model performance. 3) Tem-
Team members: Changlei Lu, Ganchao Tan poral information. Since the primary detection target is the
fake videos, temporal information can be a critical clue to
distinguish the real from the fake.
Despite the promising results, we believe that there is
still much room for improvement in the real-world face
forgery detection task. 1) More suitable and diverse data
augmentations may contribute to a better simulation of real-
world data distribution. 2) Developing a robust detection
method that can cope with unseen manipulation methods
and distortions is a critical problem. At this stage, we ob-
serve that the model training is data-dependent. Although
data augmentations can help improve the performance to
a certain extent, the generalization ability of most forgery
detection models is still poor. 3) Different artifacts in
Figure 3. The framework of the third-place solution.
the Deepfakes videos (e.g., checkerboard artifacts, fusion
boundary artifacts) remain rarely explored.
Similar to the second-place entry, the team in the third
place also utilize the poor temporal consistency in existing Acknowledgments. We thank Amazon Web Services for
face manipulation techniques. To this end, they propose to sponsoring the prize of this challenge. The organization of
use a 3D convolutional neural network (3DCNN) to capture this challenge is also supported by A*STAR through the
spatial-temporal features for forgery detection. The frame- Industry Alignment Fund - Industry Collaboration Projects
work of their method is shown in Figure 3. Grant.

4
References [18] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
[1] DeepFaceLab. https : / / github . com / iperov / Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu-
DeepFaceLab. Accessed: 2019-08-20. 1 man action video dataset. arXiv preprint, arXiv:1705.06950,
[2] Deepfake Detection Challenge. https://www.kaggle. 2017. 4
com/c/deepfake- detection- challenge. Ac- [19] Diederik P Kingma and Jimmy Ba. Adam: A method for
cessed: 2020-02-15. 2, 3 stochastic optimization. arXiv preprint, arXiv:1412.6980,
[3] Deepfake detection (DFDC) solution by selimsef. https: 2014. 4
/ / github . com / selimsef / dfdc _ deepfake _ [20] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang
challenge. Accessed: 2020-10-30. 3 Wen. FaceShifter: Towards high fidelity and occlusion aware
[4] DeepFakes. https://github.com/deepfakes/ face swapping. In CVPR, 2020. 1
faceswap. Accessed: 2019-08-16. 1 [21] Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer,
[5] faceswap-GAN. https://github.com/shaoanlu/ and Ekin D Cubuk. Improving robustness without sacrificing
faceswap-GAN. Accessed: 2019-08-16. 1 accuracy with patch gaussian augmentation. arXiv preprint,
[6] Perturbation implementation in DeeperForensics- arXiv:1906.02611, 2019. 4
1.0. https : / / github . com / EndlessSora / [22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
DeeperForensics - 1 . 0 / tree / master / regularization. In ICLR, 2019. 3
perturbation. Accessed: 2020-10-30. 3, 4 [23] Ivan Petrov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu,
[7] Terms of use: DeeperForensics-1.0 dataset. https:// Sugasa Marangonda, Chris Umé, Jian Jiang, Luis RP, Sheng
github.com/EndlessSora/DeeperForensics- Zhang, Pingyu Wu, et al. DeepFaceLab: A simple, flexi-
1 . 0 / blob / master / dataset / Terms _ of _ Use . ble and extensible face swapping framework. arXiv preprint,
pdf. Accessed: 2020-05-21. 2 arXiv:2005.05535, 2020. 1
[8] Google AI Blog. Contributing data to deepfake detec- [24] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
tion research. https : / / ai . googleblog . com / tian Riess, Justus Thies, and Matthias Nießner. FaceForen-
2019 / 09 / contributing- data - to - deepfake- sics++: Learning to detect manipulated facial images. In
detection.html. Accessed: 2019-09-25. 3, 4 ICCV, 2019. 3, 4
[25] Mingxing Tan and Quoc Le. EfficientNet: Rethinking model
[9] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun
scaling for convolutional neural networks. In ICML, 2019.
Zhou. FaceWarehouse: A 3d facial expression database for
3, 4
visual computing. TVCG, 20:413–425, 2013. 2
[26] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
[10] Joao Carreira and Andrew Zisserman. Quo vadis, action
LeCun, and Manohar Paluri. A closer look at spatiotemporal
recognition? a new model and the kinetics dataset. In CVPR,
convolutions for action recognition. In CVPR, 2018. 4
2017. 4
[27] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V
[11] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V
Le. Self-training with noisy student improves ImageNet clas-
Le. RandAugment: Practical automated data augmentation
sification. In CVPR, 2020. 3
with a reduced search space. In CVPRW, 2020. 4
[28] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes
[12] Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and using inconsistent head poses. In ICASSP, 2019. 3
Anil K Jain. On the detection of digital face manipulation.
[29] Pu Sun Honggang Qi Yuezun Li, Xin Yang and Siwei Lyu.
In CVPR, 2020. 3
Celeb-DF: A large-scale challenging dataset for deepfake
[13] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, forensics. In CVPR, 2020. 3
and Stefanos Zafeiriou. RetinaFace: Single-shot multi-level [30] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao.
face localisation in the wild. In CVPR, 2020. 3, 4 Joint face detection and alignment using multitask cascaded
[14] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, convolutional networks. SPL, 23:1499–1503, 2016. 3, 4
Russ Howes, Menglin Wang, and Cristian Canton Ferrer.
The deepfake detection challenge dataset. arXiv preprint,
arXiv:2006.07397, 2020. 4
[15] Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole
Baram, and Cristian Canton Ferrer. The deepfake detec-
tion challenge (DFDC) preview dataset. arXiv preprint,
arXiv:1910.08854, 2019. 3
[16] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Learn-
ing spatio-temporal features with 3D residual networks for
action recognition. In ICCVW, 2017. 4
[17] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and
Chen Change Loy. DeeperForensics-1.0: A large-scale
dataset for real-world face forgery detection. In CVPR, 2020.
1, 2

You might also like