0% found this document useful (0 votes)
23 views14 pages

ADD: Frequency Attention and Multi-View Based Knowledge Distillation To Detect Low-Quality Compressed Deepfake Images

Uploaded by

xueping wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views14 pages

ADD: Frequency Attention and Multi-View Based Knowledge Distillation To Detect Low-Quality Compressed Deepfake Images

Uploaded by

xueping wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

ADD: Frequency Attention and Multi-View based Knowledge Distillation to

Detect Low-Quality Compressed Deepfake Images


Binh M. Le,1 Simon S. Woo, 2
1
Department of Computer Science and Engineering, Sungkyunkwan University, South Korea
2
Department of Applied Data Science, Sungkyunkwan University, South Korea
[email protected], [email protected]
arXiv:2112.03553v1 [cs.CV] 7 Dec 2021

Abstract tive deepfake detectors (Dzanic, Shah, and Witherden 2019;


Despite significant advancements of deep learning-based
Rossler et al. 2019; Wang et al. 2019; Li and Lyu 2018;
forgery detectors for distinguishing manipulated deepfake Khayatkhoei and Elgammal 2020; Zhang, Karaman, and
images, most detection approaches suffer from moderate to Chang 2019). Most approaches utilize the deep learning-
significant performance degradation with low-quality com- based approaches, where generally they perform well if
pressed deepfake images.Because of the limited informa- there are a large amount of high-resolution training data.
tion in low-quality images, detecting low-quality deepfake However, the performances of these approaches drop dra-
remains an important challenge. In this work, we apply matically (by up to 18% (Dzanic, Shah, and Witherden
frequency domain learning and optimal transport theory in 2019; Rossler et al. 2019)) for compressed low-resolution
knowledge distillation (KD) to specifically improve the de- images due to lack of available pixel information to suf-
tection of low-quality compressed deepfake images. We ex- ficiently distinguish fake images from real ones. In other
plore transfer learning capability in KD to enable a student
network to learn discriminative features from low-quality im-
words, because of the compression, subtle differences and
ages effectively. In particular, we propose the Attention-based artifacts such as sharp edges in hairs and lips that can be
Deepfake detection Distiller (ADD), which consists of two possibly leveraged for differentiating deepfakes can also be
novel distillations: 1) frequency attention distillation that ef- removed. Therefore, there still remains an important chal-
fectively retrieves the removed high-frequency components lenge to effectively detect low-quality compressed deep-
in the student network, and 2) multi-view attention distil- fakes, which frequently occur on social media and mobile
lation that creates multiple attention vectors by slicing the platforms in bandwidth-challenging and storage-limited en-
teacher’s and student’s tensors under different views to trans- vironments.
fer the teacher tensor’s distribution to the student more effi-
ciently. Our extensive experimental results demonstrate that
In this work, we propose the Attention-based Deepfake
our approach outperforms state-of-the-art baselines in detect- detection Distiller (ADD). Our primary goal is to detect low-
ing low-quality compressed deepfake images. quality (LQ) deepfakes, which are less explored in most pre-
vious studies but plays a pivotal role in real-world scenar-
ios. First, we assume there are high-quality (HQ) images
Introduction are readily available, similar to the settings in other stud-
Recently, facial manipulation techniques using deep ies (Rossler et al. 2019; Wang et al. 2019; Li and Lyu 2018;
learning methods such as deepfakes have drawn consider- Khayatkhoei and Elgammal 2020; Zhang, Karaman, and
able attention (Rossler et al. 2019; Pidhorskyi, Adjeroh, and Chang 2019; Dzanic, Shah, and Witherden 2019). And, we
Doretto 2020; Richardson et al. 2020; Nitzan et al. 2020). use knowledge distillation (KD) as an overarching backbone
Moreover, deepfakes have become more realistic and so- architecture to detect low-quality deepfakes. While most of
phisticated, making it difficult to be distinguished by human the existing knowledge distillation methods aim to reduce
eyes (Siarohin et al. 2020). And it has become much eas- the student size for model compression applications or im-
ier to generate such realistic deepfakes than before. Hence, prove the performance of lightweight deep learning models
such advancements and convenience enable even novices to (Hinton, Vinyals, and Dean 2015; Tian, Krishnan, and Isola
easily create highly realistic fake faces for simple entertain- 2019; Huang and Wang 2017; Passalis and Tefas 2018), we
ment. However, these fake images raise serious security, pri- hypothesize that a student can learn lost distinctive features
vacy, and social concerns, as they can be abused for ma- of low-quality compressed images from a teacher that is pre-
licious purposes, such as impersonation (Catherine 2019), trained on high-quality images for deepfake detection. We
revenge pornography (Cole 2018), and fake news propaga- first lay out the following two major challenges associated
tion (Quandt et al. 2019). with detecting the LQ compressed deepfakes, and provide
To address such problems arising from deepfakes, there the intuitions of our approaches to overcome these issues:
have been immense research efforts in developing effec- 1) Loss of high-frequency information. As discussed,
Copyright © 2022, Association for the Advancement of Artificial while lossy image compression algorithms make changes vi-
Intelligence (www.aaai.org). All rights reserved. sually unnoticeable to humans, they can significantly reduce
DNNs’ deepfake detection capability by removing the fine-
grained artifacts in high-frequency components. To investi-
gate this phenomenon more concretely, we revisit Frequency
Principal (F-Principal) (Xu, Zhang, and Xiao 2019), which
describes the learning behavior of general DNNs in the fre-
quency domain. F-Principal states that general DNNs tend
to learn dominant low-frequency components first and then
capture high-frequency components during the training pro-
cess (Xu, Zhang, and Xiao 2019). For example, to illustrate
this issue, Fig. 1 is provided to indicates that most of the
lost information during compression is from high-frequency
components. As a consequence, general DNNs shift their at-
tention in later training epochs to high-frequency compo-
nents, which now represent intrinsic characteristics of ob-
jects in each individual image rather than discriminative Figure 1: Degradation of high-frequency components af-
features. This learning process increases the variance of ter compression of real and fake videos. Sample fake face
DNNs’ decision boundaries and induces overfitting, thereby frames are taken from the NeuralTextures dataset in Face-
degrading the detection performance. A trivial approach to Forensics++ (Rossler et al. 2019). Left column: Sample
tackle the overfitting is applying the early stopping method faces from raw videos. Middle column: Sample faces from
(Morgan and Bourlard 1989); however, fine-grained arti- c40-compressed videos. Right column: Normalized spec-
facts of deepfakes can be subsequently omitted, especially trum differences in the frequency domain after applying Dis-
when they are highly compressed. To overcome this issue, crete Fourier Transform (DFT) to raw and compressed im-
we propose the novel frequency attention distiller, which ages. The concentrated differences at the center are the high-
guides the student to effectively recover the removed high- est frequency components.
frequency components in low-quality compressed images
from the teacher during training.
2) Loss of correlated information. In addition, under • We propose the novel frequency attention distillation,
heavy compression, crucial features and pixel correlations which effectively enables the student to retrieve high-
that not only capture the intra-class variations, but also char- frequency information from the teacher.
acterize the inter-class differences are also degraded. In par- • We develop the novel multi-view attention distillation
ticular, these correlations are essential for CNNs’ ability to with contrastive distillation for the student to efficiently
learn the features at the local filters, but they are significantly mimic the teacher while maintaining pixel correlations
removed in the compressed input images. from the teacher to the student through SWD.
Recent studies (Wang et al. 2018; Hu et al. 2018) have • We demonstrate that our approach outperforms well-
empirically demonstrated that training DNNs that are able to known baselines, including attention-based distillation
capture this correlated information can successfully improve methods, on different low-quality compressed deepfake
their performances. Therefore, in this work, we focus on datasets.
improving the lost correlations by proposing a novel multi-
view attention, inspired by the work of Bonneel et al. (Bon-
neel et al. 2015), and contrastive distillation (Tian, Krishnan, Related Work
and Isola 2019). The element-wise discrepancy between the Deepfake detection. Deepfake detection has recently
teacher’s and student’s tensors that ignores the relationship drawn significant attention, as it is related to protecting per-
within local regions of pixels, or channel-wise attention that sonal privacy. Therefore, there has been a large number of
only considers a single dimension of backbone features. On research works to identify such deepfakes (Rossler et al.
the other hand, our proposed method ensures that our model 2019; Li et al. 2020, 2019; Jeon et al. 2020; Rahmouni et al.
attends to output tensors from multiple views (slices) using 2017; Wang et al. 2019; Li and Lyu 2018). Li et al. (Li
Sliced Wasserstein distance (SWD) (Bonneel et al. 2015). et al. 2020) tried to expose the blending boundaries of gen-
Therefore, our multi-view attention distiller guides the stu- erated faces and showed the effectiveness of their method,
dent to mimic its teacher more efficiently through a geomet- when applied for unseen face manipulation techniques. Self-
rically meaningful metric based on SWD. In summary, we training with L2-starting point regularization was introduced
present our overall Attention-based Deepfake detection Dis- by Jeon et al. (Jeon et al. 2020) to detect newly gener-
tiller (ADD), which consists of two novel distillations (See ated images. However, the majority of prior works are lim-
Fig. 2): 1) frequency attention distillation that effectively re- ited to high-quality (HQ) synthetic images, which are rather
trieves the removed high-frequency components in the stu- straightforward to detect by constructing binary classifiers
dent network, and 2) multi-view attention distillation that with a large amount of HQ images.
creates multiple attention vectors by slicing the teacher’s and Knowledge distillation (KD). Firstly introduced by Hin-
student’s tensors under different views to transfer the teacher ton et al. (Hinton, Vinyals, and Dean 2015), KD is a train-
tensor’s distribution to the student more efficiently. ing technique that transfers acquired knowledge from a pre-
Our contributions are summarized as follows: trained teacher model to a student model for model com-
Figure 2: Illustration of our proposed Attention-based Deepfake detection Distiller (ADD) distillation framework. First, a low-
quality compressed image and its corresponding raw image are used as an input to the student and pre-trained teacher model,
respectively. The student model is trained with two novel distillers: 1) frequency attention distiller and 2) multi-view attention
distiller. The frequency attention distiller creates a weighted loss, which focuses more on the degraded high-frequency com-
ponents. The multi-view attention distiller slices student’s and teacher’s tensors by different random views to spawn multiple
attention vectors. Green arrows indicate the flows of gradient decent updates to train the student’s parameters.

pression applications. However, many existing works (Yim tive pairs closer, and push away those from negative pairs.
et al. 2017; Tian, Krishnan, and Isola 2019; Huang and Wang Although CRD achieves superior performance to those of
2017; Passalis and Tefas 2018) applied different types of dis- previous approaches, it requires a large memory buffer to
tillation methods to conventional datasets, e.g., ImageNet, save embedding features of each sample. This is restrictive
PASCAL VOC 2007, and CIFAR100, but not for deepfake when training size and embedding space become larger. In-
datasets. On the other hand, Zhu et al. (Zhu et al. 2019) stead, we directly sample positive and negative images in
used FitNets (Romero et al. 2014) to train a student model the same mini-batch and apply the contrastive loss to em-
that is able to detect low-resolution images, which is simi- bedded features, similar to the Siamese network (Bromley
lar to our method in that the teacher and the student learn to et al. 1994).
detect high and low-quality images, respectively. However, Frequency domain learning. In the field of media foren-
their approach coerces the student to mimic the penultimate sics, several approaches (Jiang et al. 2020; Khayatkhoei
layer’s distribution from the teacher, while it does not pos- and Elgammal 2020; Dzanic, Shah, and Witherden 2019)
sess rich features at the lower layers. showed that discrepancies of high-frequency’s Fourier spec-
In order to encourage the student model to mimic trum are effective clues to distinguish CNN-based generated
the teacher more effectively, Zagoruyko and Komodakis images. Frank et al. (Frank et al. 2020) and and Zhang et
(Zagoruyko and Komodakis 2016) proposed the activation- al. (Zhang, Karaman, and Chang 2019) utilized the checker-
based attention transfer, similar to FitNets, but their ap- board artifacts (Odena, Dumoulin, and Olah 2016) of the
proach achieves better performance by creating spatial at- frequency spectrum caused by up-sampling components of
tention maps. Our multi-view attention method inherits from generative neural networks (GAN) as effective features in
this approach but carries more generalization ability by not detecting GAN-based fake images. Nevertheless, their de-
only exploiting spatial attention (in width and height dimen- tection performances were greatly degraded when the train-
sion), but also introducing attention features from random ing synthesized images are compressed, becoming low-
dimensions using Radon transform (Helgason 2010). Thus, quality. Quian et al. proposed an effective frequency-based
our approach pushes the student’s backbone features closer forgery detection method, named F 3 N et, which decom-
to the teacher’s. poses an input image to many frequency components, col-
In addition, inspired by InfoNCE loss (Oord, Li, and laborating with local frequency statistics on a two-streams
Vinyals 2018), Tian et al. (Tian, Krishnan, and Isola 2019) network. The F 3 N et, however, doubles the number of pa-
proposed contrastive representation distillation (CRD), rameters from its backbone.
which formulates the contrastive learning framework and Wasserstein distance. Induced by the optimal transport
motivates the student network to drive samples from posi- theory, Wasserstein distance (WD) (Villani 2008), and its
variations have been explored in training DNNs to learn
a particular distribution thanks to Wasserstein’s underlying
geometrically meaningful distance property. In fact, WD-
based applications cover a wide range of fields, such as to
improve generative models (Arjovsky, Chintala, and Bottou
2017; Deshpande, Zhang, and Schwing 2018), learn the dis-
tribution of latent space in autoencoders (Kolouri et al. 2018;
Xu et al. 2020), and match features in domain adaptation
tasks (Lee et al. 2019).In this work, we utilize the Wasser-
stein metric to provide the student geometrically meaningful
guidance to efficiently mimic the teacher’s tensor distribu-
tion. Thus, the student can learn the true tensor distribution,
even though its input features are partially degraded through Figure 3: Illustration of our frequency attention distiller. The
high compression. F function is applied to each channel of the input tensor.
Distance metric d calculates the discrepancy of the corre-
sponding coefficients of each frequency component from
Our Approach the teacher and the student. Finally, the attention map is
Our Attention-based Deepfake detection Distiller (ADD) obtained by averaging the element-wise differences across
is consisted of the following two novel distillations (See channels.
Fig. 2): 1) frequency attention distillation and 2) multi-view
attention distillation.
ing way:
Frequency Attention Distillation C
1 X 
w(u, v) = exp γF R · d(FAS (c, u, v), FAT (c, u, v)) ,
Let fS and fT be the student and the pre-trained teacher C c=1
network. By forwarding a low-quality compressed input im- (3)
age and its corresponding raw image through fS and fT , where γF R is a positive hyper-parameter that governs the
respectively, we obtain features AS and AT ∈ RC×W ×H exponential cumulative loss, as the student’s removed fre-
from its backbone network, which have C channels, the quency increases. This design of attention weights ensures
width of W , and the height of H. To create frequency that the model focuses more on the losing high-frequency,
representations, Discrete Fourier Transform (DFT) F : and makes Eq. 2 partly similar to focal loss (Lin et al. 2017).
RC×W ×H → CC×W ×H is applied to each channel as fol- Figure 3 visually illustrates our frequency loss.
lows:
H
W X
X ux vy Multi-view Attention Distillation
FAS/T (c, u, v) = AS/T (c, x, y) · e−i2π( W + H ) ,
x=1 y=1
Sliced Wasserstein distance. The p-Wasserstein dis-
(1) tance between two probability measures µ and ν (Villani
where c, x and y denote the cth , xth and yth slice in the chan- 2008) with their corresponding probability density functions
nel, the width and height dimension of AS and AT , respec- Pµ and Pν in a probability space (Ω, P(Ω)) and Ω ⊂ Rd , is
tively. Here, for convenience, we use the notation FAS/T to defined as follows:
Z 1/p
denote that the function is independently applied for both 
student’s and teacher’s backbone features. Then, the value Wp (Pµ , Pν ) = inf ψ(x, y)p dπ(x, y) ,
π∈Π(µ,ν) Ω×Ω
at (u, v) on each single feature-map FAS/T (c, :, :) indicates (4)
the coefficient of a basic frequency component. The differ- where Π(µ, ν) is a set of all transportation plans π, which
ence between a pair of corresponding coefficients from the has the marginal densities Pµ and Pν , respectively, and
teacher and the student represents the “absence” of that stu- ψ : Ω × Ω → R+ is a transportation cost function. Equa-
dent’s frequency component. Next, let d : C2 → R+ be a tion 4 searches for an optimal transportation plan between
metric that assesses the distance between two input complex µ and ν, which is also known as Kantorovitch formulation
numbers and supports stochastic gradient descent. Then, the (Kantorovitch 1958). In the case of one-dimensional prob-
frequency loss between the teacher and student can be de- ability space, i.e., Ω ⊂ R, the closed-form solution of the
fined as follows: p-Wasserstein distance is:
C X
W X
H Z 1 p 1/p
ψ Fµ−1 (κ), Fν−1 (κ) dκ
X
Wp (Pµ , Pν ) = , (5)

LF R = w(u, v)·d FAS (c, u, v), FAT (c, u, v) ,
0
c=1 u=1 v=1
(2) where Fµ and Fν are the cumulative distribution functions
where w(u, v) is an attention weight at (u, v). In this work, of Pµ and Pν , respectively.
we utilize the exponential of the difference across channels A variation of Wasserstein distance, inspired by the above
between the teacher and student as the weight in the follow- closed-form solution, is Sliced Wasserstein distance (SWD)
that deploys multiple projections from a high dimensional
distribution to various one-dimensional marginal distribu-
tions and calculates the optimal transportation cost for
each projection. In order to construct these one-dimensional
marginal distributions, we use the Radon transform (Helga-
son 2010), which is defined as follows:
Z
RPµ (t, θ) = µ(x)δ(t − hθ, xi)dx, ∀θ ∈ S d−1 , ∀t ∈ R,

(6)
where δ denotes the Diract delta function, h·, ·i is the Eu-
clidean inner-product, and S d−1 ⊂ Rd is the d-dimensional
unit sphere. Thus, we denote Rθ µ as a 1-D marginal dis- Figure 4: Detailed illustration of our multi-view attention
tribution of µ under the projection on θ. The Sliced 1- distillation. Two backbone features of the teacher (top) and
Wasserstein distance is defined as follows: the student (bottom). After normalization, obtained features
Z are projected on a random sample θi , then two attention vec-
SW1 (Pµ , Pν ) = W1 (Rθ µ, Rθ ν)dθ. (7) tors, vTθi and vSθi , are obtained by sorting the projection im-
S d−1 ages and binning them into g groups. Multiple values of θ
Now, we can calculate the Sliced Wasserstein distance by provides us multiple attention views on the two tensors.
optimizing a series of 1-D transportation problems, which
have the closed-form solution that can be computed in
each instance, which inspired by the CRD distillation frame-
O(N log(N )) (Rabin et al. 2011). In particular, by sorting
work of Tian et al. (Tian, Krishnan, and Isola 2019) . Thus,
Rθ µ and Rθ ν in ascending order using two permutation op-
our overall multi-view attention loss is defined as follows:
erators τ1 and τ2 , respectively, the SW D can be approxi-
mated as follows: LMV =γMV × SW D(PAS ,PAT )+

K X
N ηMV × SW D(PAS , PA+ )+ (9)
X T
SW D(Pµ , Pν ) ≈ ψ(Rθk µτ1 [i] , Rθk ντ2 [i] ), (8) 
max(∆ − SW D PAS , PA− ), 0 ,
k=1 i=1 T

where K is the number of uniform random samples θ using where A+


T and A−
T are the random instance’s representation
Monte Carlo method to approximate the integration of θ over that belong to the same and the opposite class of AS at the
the unit sphere S d−1 in Eq. 7. teacher, respectively. And ∆ is a margin that manages the
Multi-view attention distillation. Let PA be the square discrepancy of negative pairs and γMV , and ηMV are scaling
of A after being normalized by the Frobenius norm, i.e., hyper-parameters.
A◦2
PA = kAk 2 , where ◦ denotes the Hadamard power (Bocci, Overall Loss Function
F
Carlini, and Kileel 2016). Consequently, we are now able to The overall distillation loss in our KD framework is for-
consider PA as a discrete probability density function over mulated as follows:
Ω = RC×W ×H ⊂ R3 , where PA (c, x, y) indicates the den-
sity value at the slice cth , xth and yth of the channel, the LDistill (AS , AT ) = α · LF R + β · LMV , (10)
| {z } | {z }
width and height dimension, respectively. To avoid replicat- frequency attention multi-view attention
ing the element-wise differences, we additionally need to bin
the projected vectors into g groups before applying distilla- where α and β are hyper-parameters to balance the contri-
tion. One important property of our multi-view attention is bution of frequency attention distiller and multi-view atten-
that different values of θ provide different attention views tion distiller, respectively. Our attention loss is parameter-
(slices) of AS and AT . For instance, with θ = (1, 0, 0), we free and is independent from model architecture design, and
achieve the channel-wise attention that was introduced by it can be directly added to any detector model’s conven-
Chen et al. (Chen et al. 2017). Or, we can produce an at- tional loss (e.g., cross-entropy loss). Also, the frequency at-
tention vector in the width and height dimension, when θ tention requires computational
 complexity in O CW H ·
becomes close to (0, 1, 0) and (0, 0, 1), respectively. With (log(W ) + log(H)) for one backbone feature, where
this general property, a student can pay full attention to O W H · (log(W ) + log(H)) is the complexity of 2-D
its teacher’s tensor distribution instead of some pre-defined Fast Fourier Transform applied for one channel. On the other
constant attention views. hand, the average-case
 complexity of multi-view
 attention is
Figure 4 pictorially illustrates our overall multi-view at- O KN · log(N ) , where O N · log(N ) is the complexity
tention distillation, and we summarize our multi-view atten- of 1-D closed-form solution as mentioned above, K is the
tion in Algorithm 1 in the supplementary materials. In order number of random samples θ, and N is the number of ele-
to encourage the semantic similarity of samples’ representa- ments in one backbone feature, i.e., N = CW H. Our end-
tion from the same class and discourage that of those from to-end Attention-based Deepfake detection Distiller (ADD)
different classes, we further apply the contrastive loss for pipeline is presented in Fig. 2.
Experiment and Hou 2020), which employed Inception ResNet V1 pre-
Datasets trained on the VGGFace2 dataset (Cao et al. 2018). These
are the two best performing publicly available deepfake de-
Our proposed method is evaluated on five different pop- tection methods1 . Additionally, we use the F 3 N et, which is
ular deepfake benchmark datasets: NeuralTextures (Thies, a frequency-based deepfake detection introduced by Quian
Zollhöfer, and Nießner 2019), Deepfakes (Accessed 2021- et al. (Qian et al. 2020) for evaluation. The F 3 N et is de-
Jan-01a), Face2Face (Thies et al. 2016), FaceSwap (Ac- ployed on two streams of XceptionNet as described in the
cessed 2021-Jan-01c), and FaceShifter (Li et al. 2019). Ev- paper. Finally, ResNet50 (He et al. 2016) is also included as
ery dataset has 1,000 videos generated from 1,000 real hu- a baseline to compare with distillation methods.
man face videos by Rössler et al. (Rossler et al. 2019). Distillation baseline methods. As there has not been
These videos are compressed into two versions: medium much research that deploys KD for deepfake detection, we
compression (c23) and high compression (c40), using the further integrate other three well-known distillation archi-
H.264 codec with a constant rate quantization parameter of tectures in the ResNet50 backbone to perform comparisons,
23, and 40, respectively. Each dataset is randomly divided including: FitNet (Romero et al. 2014), Attention Trans-
into training, validation, and test set consisting of 720, 140, fer (Zagoruyko and Komodakis 2016) (AT) and Non-local
and 140 videos, respectively. We randomly select 64 frames (Wang et al. 2018) (NL). Each of these methods is fine-tuned
from each video and obtain 92,160, 17,920, and 17,920 im- on the validation set to achieve its best performance.
ages for training, validation, and test set, respectively. Then, First, comparing ours with the non-distillation baselines,
we utilize the Dblib (King 2009) to detect the largest face we can observe that our method improves the detection ac-
in every single frame and resize them to a square image of curacy from 1% to 6% across all five datasets for both com-
128 × 128 pixels. pression data types. On average, our approach outperforms
the other three distillation methods, and is superior on the
Experiment Settings highest compressed (c40) datasets. The model with FitNet
In our experiments, we use Adam optimizer (Kingma and loss, though it has a small improvement, does not have com-
Ba 2014) with β1 = 0.9, β2 = 0.999 and  = 10−8 . petitive results due to retaining insufficient frequency infor-
The learning rate is 2 × 10−4 , which follows one cycle mation. The attention module and non-local module also
learning rate schedule (Smith and Topin 2019) with a mini- provide compelling results. However, they do not surpass
batch size of 144. In every epoch, the model is validated our methods because of the lower attention dimension and
10 times to save the best parameters using validation ac- frequency information shortage.
curacy. Early stopping is applied, when the validation per-
formance does not improve after 10 consecutive times. We Ablation Study and Discussions
use ResNet50 (He et al. 2016) as our backbone to imple-
Effects of Attention Modules. We investigate the quanti-
ment our proposed distillation framework. In Eq. 2, we de-
tative impact of the frequency attention and multi-view at-
fine d as the square of the modulus of the difference be-
tention on the final performance. In the past, the Neural-
tween two complex number, i.e., d(c1 , c2 ) = |c1 − c2 |2 ,
Textures (NT) dataset has shown to be the most difficult to
which satisfies the properties of a general distance metric:
differentiate by both human eyes and DNN (Rossler et al.
non-negative, symmetric, identity, and triangle inequality.
2019). Hence, we conduct our ablation study on the c40
The number of binning groups g is equal to a half of the
highly NT compressed dataset. The results are presented
number of channels of AS . Our hyper-parameters settings
in Table 2. We can observe that frequency attention im-
{γF R = 1, γMV = 100, ηMV = 50, ∆ = 0.012, α = 1, }
proves about 6.76% of the accuracy. Multi-view attention
are kept the same, while β is fine-tuned on each dataset in
with contrastive loss provides a slightly better result than
the range of 16 to 23 through the experiments. The experi-
that of without contrastive at 68.14% and 67.01%, respec-
ments are conducted on two TITAN RTX 24GB GPUs with
tively. Finally, combining the frequency attention and multi-
Intel Xeon Gold 6230R CPU @ 2.10GHz.
view attention distillation with contrastive loss significantly
improves the accuracy up to 68.53%. The results of our ab-
Results
lation study demonstrate that each proposed attention dis-
Our experimental results are presented in Table 1. We tiller has a different contribution to the student’s ability to
use Accuracy score (ACC) and Recall at 1 (R@1), which mimic the teacher, and they are compatible when integrated
are described in detail in the supplementary materials. We together to achieve the best performance.
compare our ADD method, with both distillation and non- Sensitivity of attention weights (α and β). We conduct
distillation baselines. For a fair comparison between differ- an experiment on the sensitivity of the frequency attention
ent methods, the same low resolution is used at 128 × 128 weight α and multi-view attention weight β on the five dif-
pixels as mentioned above is used throughout the experi- ferent datasets. The detailed results are presented in the sup-
ments. plementary materials. The result shows that by changing the
Non-distillation methods. We reproduce two highest- value of α and β, the performance of our method contin-
score deepfake detection benchmark methods (Accessed uously outperform the baseline results, indicating that our
2021-Jan-01b): 1) the method proposed by Rössler et al. approach is less sensitive to both α and β.
(Rossler et al. 2019), which used Xception model, and 2) the
1
approach by Dogonadze et al. (Dogonadze, Obernosterer, http://kaldir.vc.in.tum.de/faceforensics benchmark/
Medium comp- High comp- NT DeepFakes Face2Face FaceSwap FaceShifter
Datasets Models ression (c23) ression (c40) c23 c40 c23 c40 c23 c40 c23 c40 c23 c40
ACC R@1 ACC R@1 ResNet18
Rössler et al. 76.36 57.24 56.75 51.88 Baseline 81.8 67.3 97.5 89.2 94.2 85.0 90.2 84.5 93.2 89.2
Dogonadze et al. 78.03 77.13 61.12 48.01 ADD 84.3 67.5 97.7 94.7 95.7 85.3 96.0 91.5 97.0 92.2
NeuralTextures

F 3 N et 77.91 77.39 61.95 32.35 ResNet34


ResNet50 86.25 82.75 60.27 53.06
Baseline 82.6 58.4 92.0 93.4 94.2 83.2 92.4 88.6 95.6 89.3
FitNet - ResNet50 86.26 84.83 66.01 57.28 ADD 84.3 63.5 97.8 94.6 94.9 83.4 96.8 90.6 97.8 91.5
AT - ResNet50 85.21 84.99 62.61 43.50
NL - ResNet50 88.26 86.95 65.65 46.82 EfficientNet-B0
ADD - ResNet50 (ours) 88.48 87.53 68.53 58.42 Baseline 81.2 60.5 96.5 90.0 94.1 77.4 92.6 83.4 93.8 84.0
ADD 83.5 67.6 97.5 92.5 96.7 80.3 95.3 87.5 95.1 85.2
Rössler et al. 97.42 96.96 92.43 82.39
Dogonadze et al. 94.67 94.39 93.97 93.52
F 3 N et
Table 3: Classification accuracy (%) of ResNet18, ResNet34
DeepFakes

96.26 95.84 93.06 93.00


ResNet50 96.34 95.90 92.89 91.18 and EfficientNet-B0 baseline and their integration with our
ADD training framework.
FitNet - ResNet50 97.28 97.78 93.68 93.34
AT - ResNet50 97.37 98.72 95.11 94.35
NL - ResNet50 98.42 98.21 93.09 94.35
ADD - ResNet50 (ours) 98.67 98.09 95.50 94.59 Experiment with other backbones. Table 3 shows the
results with three other backbones: ResNet18 and ResNet34
Rössler et al. 91.83 91.02 80.21 77.42
Dogonadze et al. 89.34 88.73 83.44 81.00
(He et al. 2016), and EfficientNet-B0 (Tan and Le 2019).
F 3 N et We set up the hyper-parameters of the four DNNs as the
Face2Face

95.52 95.40 81.48 79.31


ResNet50 95.60 94.77 83.94 79.88 same for ResNet50, except γF R is changed to 1e−3 for
EfficientNet-B0. Our distilled model improves the detection
FitNet - ResNet50 95.91 96.16 83.48 78.99
AT - ResNet50 96.80 96.84 83.55 78.72 accuracy of all five datasets in different compression qual-
NL - ResNet50 96.44 96.64 83.69 82.04 ity, up to 7%, 5.8% 7.1% with ResNet18, ResNet34, and
ADD - ResNet50 (ours) 96.82 97.14 85.42 83.54 EfficientNet-B0 backbone compared to their baselines, re-
Rössler et al. 95.49 95.36 88.09 87.67
spectively.
Dogonadze et al. 93.33 92.78 90.02 89.10 Grad-CAM (Selvaraju et al. 2017). Using Grad-CAM,
F 3 N et 95.74 95.65 89.58 88.90 we provide visual explanations regarding the merits of train-
FaceSwap

ResNet50 92.46 90.85 88.91 86.52 ing a LQ deepfake detector with our ADD framework. The
FitNet - ResNet50 97.29 96.29 89.16 90.13 gallery of Grad-CAM visualization is included in the sup-
AT - ResNet50 97.66 97.27 89.75 90.41 plementary material. First, our ADD is able to correct the
NL - ResNet50 97.34 96.95 91.86 90.78 facial artifacts’ attention of the LQ detector to resemble its
ADD - ResNet50 (ours) 97.85 97.34 92.49 92.13 teacher trained on raw datasets. Second, the ADD vigor-
Rössler et al. 93.04 93.16 89.20 87.12 ously instructs the student model to neglect the background
Dogonadze et al. 89.80 89.36 82.03 79.96 noises and activate the facial areas as its teacher does when
FaceShifter

F 3 N et 95.10 95.02 89.13 88.69 encountering facial images in complex backgrounds. Mean-
ResNet50 94.89 93.88 89.56 88.48 while, the baseline model which is solely trained on LQ
FitNet - ResNet50 96.63 95.95 90.16 89.36 datasets steadily makes wrong predictions with high confi-
AT - ResNet50 96.32 96.76 88.28 89.45 dence by activating non-facial areas and is deceived by com-
NL - ResNet50 96.24 95.28 90.04 87.71 plex backgrounds.
ADD - ResNet50 (ours) 96.60 95.84 91.64 90.27
Conclusion
Table 1: Experimental results of our proposed method and
other seven different baseline approaches on five different In this paper, we proposed a novel Attention-based Deep-
deepfake datasets. The best results are highlighted in bold. fake detection Distillations (ADD), exploring frequency at-
tention distillation and multi-view attention distillation in a
KD framework to detect highly compressed deepfakes. The
Model ACC(%) frequency attention helps the student to retrieve and focus
ResNet (baseline) 60.27 more on high-frequency components from the teacher. The
Our ResNet (FR) 67.03 multi-view attention, inspired by Sliced Wasserstein dis-
Our ResNet (MV w/o contrastive) 67.01 tance, pushes the student’s output tensor distribution toward
Our ResNet (MV w/ contrastive) 68.14 the teacher’s, maintaining correlated pixel features between
Our ResNet (FR+MV) 68.53
tensor elements from multiple views (slices). Our experi-
ments demonstrate that our proposed method is highly effec-
Table 2: The effect of each single attention module on the
tive and achieves competitive results in most cases when de-
final results experimented on NeuralTextures dataset.
tecting extremely challenging highly compressed challeng-
ing LQ deepfakes. Our code is available here2 .
2
https://github.com/Leminhbinh0209/ADD.git
Acknowledgments Dzanic, T.; Shah, K.; and Witherden, F. 2019. Fourier Spectrum
Discrepancies in Deep Network Generated Images. arXiv preprint
This work was partly supported by Institute of Informa-
arXiv:1911.06465. (Cited on pages 1 and 3.)
tion & communications Technology Planning & Evalua-
tion (IITP) grant funded by the Korea government (MSIT) Frank, J.; Eisenhofer, T.; Schönherr, L.; Fischer, A.; Kolossa, D.;
(No.2019-0-00421, AI Graduate School Support Program and Holz, T. 2020. Leveraging frequency analysis for deep fake
image recognition. In International Conference on Machine Learn-
(Sungkyunkwan University)), (No. 2019-0-01343, Regional ing, 3247–3258. PMLR. (Cited on page 3.)
strategic industry convergence security core talent training
business) and the Basic Science Research Program through He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE conference
National Research Foundation of Korea (NRF) grant funded
on computer vision and pattern recognition, 770–778. (Cited on
by Korea government MSIT (No. 2020R1C1C1006004). pages 6 and 7.)
Also, this research was partly supported by IITP grant
Helgason, S. 2010. Integral geometry and Radon transforms.
funded by the Korea government MSIT (No. 2021-0-00017,
Springer Science & Business Media. (Cited on pages 3 and 5.)
Original Technology Development of Artificial Intelligence
Industry) and was partly supported by the Korea government Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowl-
MSIT, under the High-Potential Individuals Global Training edge in a neural network. arXiv preprint arXiv:1503.02531. (Cited
on pages 1 and 2.)
Program (2020-0-01550) supervised by the IITP.
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; and Wei, Y. 2018. Relation net-
works for object detection. In Proceedings of the IEEE Conference
References on Computer Vision and Pattern Recognition, 3588–3597. (Cited
Accessed. 2021-Jan-01a. DeepFakes GitHub. https://github.com/ on page 2.)
deepfakes/faceswap. (Cited on page 6.)
Huang, Z.; and Wang, N. 2017. Like what you like: Knowl-
Accessed. 2021-Jan-01b. FaceForensics Benchmark. http://kaldir. edge distill via neuron selectivity transfer. arXiv preprint
vc.in.tum.de/faceforensics benchmark/. (Cited on page 6.) arXiv:1707.01219. (Cited on pages 1 and 3.)
Accessed. 2021-Jan-01c. FaceSwap GitHub. https://github.com/ Jeon, H.; Bang, Y.; Kim, J.; and Woo, S. S. 2020. T-GD: Transfer-
MarekKowalski/FaceSwap/. (Cited on pages 6 and 10.) able GAN-generated Images Detection Framework. arXiv preprint
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gener- arXiv:2008.04115. (Cited on page 2.)
ative adversarial networks. In International conference on machine Jiang, L.; Dai, B.; Wu, W.; and Loy, C. C. 2020. Focal Frequency
learning, 214–223. PMLR. (Cited on page 4.) Loss for Generative Models. arXiv preprint arXiv:2012.12821.
Bocci, C.; Carlini, E.; and Kileel, J. 2016. Hadamard products of (Cited on page 3.)
linear spaces. Journal of Algebra, 448: 595–617. (Cited on page 5.) Kantorovitch, L. 1958. On the translocation of masses. Manage-
Bonneel, N.; Rabin, J.; Peyré, G.; and Pfister, H. 2015. Sliced and ment science, 5(1): 1–4. (Cited on page 4.)
radon wasserstein barycenters of measures. Journal of Mathemati- Khayatkhoei, M.; and Elgammal, A. 2020. Spatial Frequency Bias
cal Imaging and Vision, 51(1): 22–45. (Cited on page 2.) in Convolutional Generative Adversarial Networks. arXiv preprint
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; and Shah, R. arXiv:2010.01473. (Cited on pages 1 and 3.)
1994. Signature verification using a” siamese” time delay neu- King, D. E. 2009. Dlib-ml: A machine learning toolkit. The Journal
ral network. Advances in neural information processing systems, of Machine Learning Research, 10: 1755–1758. (Cited on page 6.)
737–737. (Cited on page 3.)
Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O. M.; and Zisserman, A. 2018.
optimization. arXiv preprint arXiv:1412.6980. (Cited on page 6.)
Vggface2: A dataset for recognising faces across pose and age. In
2018 13th IEEE international conference on automatic face & ges- Kolouri, S.; Pope, P. E.; Martin, C. E.; and Rohde, G. K. 2018.
ture recognition (FG 2018), 67–74. IEEE. (Cited on page 6.) Sliced Wasserstein auto-encoders. In International Conference on
Learning Representations. (Cited on page 4.)
Catherine, S. 2019. Fraudsters Used AI to
Mimic CEO’s Voice in Unusual Cybercrime Case. Lee, C.-Y.; Batra, T.; Baig, M. H.; and Ulbricht, D. 2019. Sliced
https://www.wsj.com/articles/fraudsters- wasserstein discrepancy for unsupervised domain adaptation. In
use-ai-to-mimic-ceos-voice-in-unusual- Proceedings of the IEEE/CVF Conference on Computer Vision and
cybercrime-case-11567157402. (Cited on page 1.) Pattern Recognition, 10285–10295. (Cited on page 4.)
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; and Li, L.; Bao, J.; Yang, H.; Chen, D.; and Wen, F. 2019. Faceshifter:
Chua, T.-S. 2017. Sca-cnn: Spatial and channel-wise attention in Towards high fidelity and occlusion aware face swapping. arXiv
convolutional networks for image captioning. In Proceedings of preprint arXiv:1912.13457. (Cited on pages 2, 6, and 10.)
the IEEE conference on computer vision and pattern recognition, Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; and Guo,
5659–5667. (Cited on page 5.) B. 2020. Face x-ray for more general face forgery detection. In
Cole, S. 2018. We Are Truly Fucked: Everyone Is Making AI- Proceedings of the IEEE/CVF Conference on Computer Vision and
Generated Fake Porn Now. (Cited on page 1.) Pattern Recognition, 5001–5010. (Cited on page 2.)
Deshpande, I.; Zhang, Z.; and Schwing, A. G. 2018. Generative Li, Y.; and Lyu, S. 2018. Exposing deepfake videos by detecting
modeling using the sliced wasserstein distance. In Proceedings of face warping artifacts. arXiv preprint arXiv:1811.00656. (Cited on
the IEEE conference on computer vision and pattern recognition, pages 1 and 2.)
3483–3491. (Cited on page 4.) Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017.
Dogonadze, N.; Obernosterer, J.; and Hou, J. 2020. Deep Face Focal loss for dense object detection. In Proceedings of the IEEE
Forgery Detection. arXiv preprint arXiv:2004.11804. (Cited on international conference on computer vision, 2980–2988. (Cited
page 6.) on page 4.)
Morgan, N.; and Bourlard, H. 1989. Generalization and parame- Thies, J.; Zollhöfer, M.; and Nießner, M. 2019. Deferred neural
ter estimation in feedforward nets: Some experiments. Advances rendering: Image synthesis using neural textures. ACM Transac-
in neural information processing systems, 2: 630–637. (Cited on tions on Graphics (TOG), 38(4): 1–12. (Cited on pages 6 and 10.)
page 2.) Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; and
Nitzan, Y.; Bermano, A.; Li, Y.; and Cohen-Or, D. 2020. Face iden- Nießner, M. 2016. Face2face: Real-time face capture and reen-
tity disentanglement via latent space mapping. ACM Transactions actment of rgb videos. In Proceedings of the IEEE conference on
on Graphics (TOG), 39(6): 1–14. (Cited on page 1.) computer vision and pattern recognition, 2387–2395. (Cited on
Odena, A.; Dumoulin, V.; and Olah, C. 2016. Deconvolution and pages 6 and 10.)
checkerboard artifacts. Distill (2016). (Cited on page 3.) Tian, Y.; Krishnan, D.; and Isola, P. 2019. Contrastive represen-
Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representa- tation distillation. arXiv preprint arXiv:1910.10699. (Cited on
tion learning with contrastive predictive coding. arXiv preprint pages 1, 2, 3, and 5.)
arXiv:1807.03748. (Cited on page 3.) Villani, C. 2008. Optimal transport: old and new, volume 338.
Passalis, N.; and Tefas, A. 2018. Learning deep representations Springer Science & Business Media. (Cited on pages 3 and 4.)
with probabilistic knowledge transfer. In Proceedings of the Euro- Wang, R.; Ma, L.; Juefei-Xu, F.; Xie, X.; Wang, J.; and Liu, Y.
pean Conference on Computer Vision (ECCV), 268–284. (Cited on 2019. Fakespotter: A simple baseline for spotting ai-synthesized
pages 1 and 3.) fake faces. arXiv preprint arXiv:1909.06122, 2. (Cited on pages 1
Pidhorskyi, S.; Adjeroh, D. A.; and Doretto, G. 2020. Adversarial and 2.)
latent autoencoders. In Proceedings of the IEEE/CVF Conference Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-
on Computer Vision and Pattern Recognition, 14104–14113. (Cited local neural networks. In Proceedings of the IEEE conference on
on page 1.) computer vision and pattern recognition, 7794–7803. (Cited on
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; and Shao, J. 2020. Think- pages 2, 6, and 10.)
ing in frequency: Face forgery detection by mining frequency- Xu, H.; Luo, D.; Henao, R.; Shah, S.; and Carin, L. 2020. Learn-
aware clues. In European Conference on Computer Vision, 86–103. ing autoencoders with relational regularization. In International
Springer. (Cited on page 6.) Conference on Machine Learning, 10576–10586. PMLR. (Cited
Quandt, T.; Frischlich, L.; Boberg, S.; and Schatto-Eckrodt, T. on page 4.)
2019. Fake news. The international encyclopedia of Journalism Xu, Z.-Q. J.; Zhang, Y.; and Xiao, Y. 2019. Training behavior of
Studies, 1–6. (Cited on page 1.) deep neural network in frequency domain. In International Confer-
Rabin, J.; Peyré, G.; Delon, J.; and Bernot, M. 2011. Wasserstein ence on Neural Information Processing, 264–274. Springer. (Cited
barycenter and its application to texture mixing. In International on page 2.)
Conference on Scale Space and Variational Methods in Computer
Yim, J.; Joo, D.; Bae, J.; and Kim, J. 2017. A gift from knowledge
Vision, 435–446. Springer. (Cited on page 5.)
distillation: Fast optimization, network minimization and transfer
Rahmouni, N.; Nozick, V.; Yamagishi, J.; and Echizen, I. 2017. learning. In Proceedings of the IEEE Conference on Computer
Distinguishing computer graphics from natural images using con- Vision and Pattern Recognition, 4133–4141. (Cited on page 3.)
volution neural networks. In 2017 IEEE Workshop on Information
Zagoruyko, S.; and Komodakis, N. 2016. Paying more attention
Forensics and Security (WIFS), 1–6. IEEE. (Cited on page 2.)
to attention: Improving the performance of convolutional neural
Richardson, E.; Alaluf, Y.; Patashnik, O.; Nitzan, Y.; Azar, Y.; networks via attention transfer. arXiv preprint arXiv:1612.03928.
Shapiro, S.; and Cohen-Or, D. 2020. Encoding in style: a (Cited on pages 3, 6, and 10.)
stylegan encoder for image-to-image translation. arXiv preprint
arXiv:2008.00951. (Cited on page 1.) Zhang, X.; Karaman, S.; and Chang, S.-F. 2019. Detecting and
simulating artifacts in gan fake images. In 2019 IEEE Interna-
Romero, A.; Ballas, N.; Kahou, S. E.; Chassang, A.; Gatta, C.; and tional Workshop on Information Forensics and Security (WIFS),
Bengio, Y. 2014. Fitnets: Hints for thin deep nets. arXiv preprint 1–6. IEEE. (Cited on pages 1 and 3.)
arXiv:1412.6550. (Cited on pages 3, 6, and 10.)
Zhu, M.; Han, K.; Zhang, C.; Lin, J.; and Wang, Y. 2019. Low-
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; and
resolution visual recognition via deep feature distillation. In
Nießner, M. 2019. Faceforensics++: Learning to detect manipu-
ICASSP 2019-2019 IEEE International Conference on Acoustics,
lated facial images. In Proceedings of the IEEE/CVF International
Speech and Signal Processing (ICASSP), 3762–3766. IEEE. (Cited
Conference on Computer Vision, 1–11. (Cited on pages 1, 2, and 6.)
on page 3.)
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.;
and Batra, D. 2017. Grad-cam: Visual explanations from deep net-
works via gradient-based localization. In Proceedings of the IEEE
international conference on computer vision, 618–626. (Cited on
pages 7 and 11.)
Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; and Sebe,
N. 2020. First order motion model for image animation. arXiv
preprint arXiv:2003.00196. (Cited on page 1.)
Smith, L. N.; and Topin, N. 2019. Super-convergence: Very fast
training of neural networks using large learning rates. In Artifi-
cial Intelligence and Machine Learning for Multi-Domain Opera-
tions Applications, volume 11006, 1100612. International Society
for Optics and Photonics. (Cited on page 6.)
Tan, M.; and Le, Q. 2019. Efficientnet: Rethinking model scaling
for convolutional neural networks. In International Conference on
Machine Learning, 6105–6114. PMLR. (Cited on page 7.)
Supplementary materials target face and a generator that has Adaptive Attentional
Denormalization (AAD) layers. In particular, AAD is
able to blend the identity and the features to a synthe-
A. Multi-View Attention Algorithm sized face. In the second stage, they developed a novel
Algorithm 1 presents the pseudo-code for multi-view at- Heuristic Error Acknowledging Refinement Network in
tention distillation between two corresponding backbone order to enhance facial occlusions.
features using the Sliced Wasserstein distance (SWD) from
the student and teacher models. For implicity, we formulate
how each single projection θi contributes to the total SWD. C. Distillation Baseline Methods
However, in practice, K uniform vectors θi in S d−1 can be In our experiement, we actively integrate three well-
sampled simultaneously by deep learning libraries, e.g. Ten- known distillation losses into the teacher-student training
sorFlow or PyTorch, and the projection operation or binning framework for comparing with ours:
can be vertorized.
• FitNet (Romero et al. 2014). FitNet method proposed
B. Datasets hints algorithms, in which the student’s guided layers
try to predict the outputs of teacher’s hint layers. We ap-
We describe the five different deepfake datasets used in ply this hint-based learning approach on the penultimate
our experiments: layer of the teacher and student.
• NeuralTextures. Facial reenactment is an application • Attention Transfer (Zagoruyko and Komodakis 2016)
of Neural Textures (Thies, Zollhöfer, and Nießner 2019) (AT). Attention method transfers attention maps, which
technique that is used in video re-renderings. This are obtained by summing up spatial values across the
approach includes learned feature maps stored on top of backbone features’ channels from the teacher to the stu-
3D mesh proxies, called neural textures, and a deferred dent.
neural renderer. The NeuralTextures datasets used in
our experiment includes facial modifications of mouth • Non-local (Wang et al. 2018) (NL). Non-local module
regions, while the other face regions remain the same. generates self-attention features from the student and
teacher’s backbone features. Subsequently, the student’s
self-attention tensors attempt to mimic the teacher’s.
• DeepFakes. The DeepFakes dataset is generated using
two autoencoders with a shared encoder, each of which
is trained on the source and target faces, respectively. D. Evaluation Metrics
Fake faces are generated by decoding the source face’s The results in our experiments are evaluated based on the
embedding representation with the target face’s decoder. following metrics:
Note that DeepFakes, at the beginning, was a specific
• Accuracy (ACC). ACC is widely used to evaluate a clas-
facial swapping technique, but is now referred to as
sifier’s performance, and it calculates the proportion of
AI-generated facial manipulation methods.
samples whose true classes are predicted with the high-
est probability. ACC of a model fθ tested on a test set
• Face2Face. Face2Face (Thies et al. 2016) is a real-time of N samples {(x1 , y1 ), ..., (xN , yN )} is formulated as
facial reenactment approach, in which the target person’s follows:
expression follows the source person’s, while his/her PN
identity is preserved. Particularly, the identity corre- I (arg max(fθ (xi )), yi )
sponding to the target face is recovered by a non-rigid ACC = i=1 , (11)
N
model-based bundling approach on a set of key-frames
that are manually selected in advance. The source face’s where I(·, ·) is the Kronecker delta function.
expression coefficients are transferred to the target, • Recall at k (R@k). R@k indicates the proportion of test
while maintaining environment lighting as well as target samples with at least one observed sample from the same
background. class in k−nearest neighbors determined in a particu-
lar feature space. A small R@k implies small intra-class
variation, which usually leads to better accuracy. R@k is
• FaceSwap. FaceSwap (Accessed 2021-Jan-01c) is a formulated as follows:
lightweight application that is built upon the graphic  PN P 
k
structures of source and target faces. A 3D model is i=1 I j=1 I (neighborxi [j], yi ) , 0
designed to fit 68 facial landmarks extracted from the R@k = 1 − ,
N
source face. Then, it projects the facial regions back to
the target face by minimizing the pair-wise landmark (12)
errors and applies color correction in the final step. where neighborxi [j] is the label of the j th nearest neighbor of
xi , and I(·, ·) is the Kronecker delta function. We use the Eu-
clidean distance to measure the distances between queried and
• FaceShifter. FaceShifter (Li et al. 2019) is a two-stage referenced samples whose features are the penultimate layer’s
face swapping framework. The first stage includes a outputs, and we adopt R@1 which considers the first nearest
encoder-based multi-level feature extractor used for a neighbor of a test sample.
Algorithm 1: Multi-view attention distillation using the Sliced Wasserstein distance (SWD)
Require: Two backbone features AS and AT , which are the respective low-quality compressed and raw images from the
student and teacher networks, the number of random projections K, and the number of bins g.
A◦2 A◦2
1: PAS , PAT ← kA Sk2 , kA Tk2 . normalize AS and AT
S F T F
2: SW D ← 0 . initialize SWD to zero
3: for iteration i ← 1 to K do
4: θi ← U(S d−1 ) . uniformly sample θi from unit sphere in R3
θi θi
5: uS , uT ← Rθi PAS , Rθi PAT . project PAS and PAT on θi
6: uθS,τ
i
1
, u θi
T ,τ2 ← Sort(uθi
S ), Sort(u θi
T ) . sort uθi θi
S and uT in ascending order
7: vSθi , vTθi ← G(uθS,τ i
), G(uθTi,τ2 ) . partition uθS,τ
i
and uθTi,τ2 into g bins
P1g θi θi
1

8: SW D ← SW D + j=1 ψ(vS [j], vT [j]) . apply 1-D transportation cost with quadratic loss ψ
9: end for
10: return SW D

E. Hyperparameters’ Sensitivity datasets, the selected samples of which have complex


We examine the sensitivity of the different choices of at- backgrounds. We can observe that when the teacher is
tention weight hyperparameters, α and β. We experiment trained with raw data, it produces predictions (around
and show that our approach (solid lines) is less sensitive to 1.00) with nearly perfect confidence for the fake class.
α and β, and almost always outperforms the ResNet50 base- When encountered highly compressed images with com-
lines (dashed lines) across all datasets, as shown in Fig. 5. In plex backgrounds, the ResNet50 baseline model makes
this work, we fix α = 1 and fine-tune β, but they can be wrong predictions, not activating the facial regions but
further tuned to optimize the performance. the background (red arrows in Fig. 7). On the other hand,
our ADD - ResNet50 model is also trained on low-quality
F. Gallery of Grad-CAM for the Teacher, compressed data, but our approach accumulates more
distilled knowledge from the teacher and it is able to cor-
Student and Baseline Model rectly identify and emphasize the actual facial areas, as
Grad-CAM (Selvaraju et al. 2017) is generated by back- its teacher similarly does with raw images (green arrows
propagating the gradients of the highest-probability class in Fig. 7).
to a preceding convolutional layer’s output, producing a
As shown in Fig. 7, we can demonstrate and conclude
weighted combination with that output. The localization
that when a low-quality compressed image has a complex
heat map visually indicates a particular contribution of each
background, it is easy for a conventional learning model
region to the final class prediction. The strong activation re-
trained without additional information (the second col-
gions, which are represented in red color, are resulted from
umn) to become more vulnerable, making incorrect pre-
positive layer’s outputs and higher value of the gradients,
dictions with high confidence, activating background re-
whereas negative pixels or low gradients produce less ac-
gions. Meanwhile, with our KD framework, the student
tivation regions are indicated in blue color. In our experi-
is trained under the guidance of its teacher that can fo-
ments, we utilized ResNet50 pre-trained on raw data as the
cus more on the facial areas, and effectively eliminate
teacher, our ADD - ResNet50 trained on low-quality com-
the background effects from the real and fake faces.
pressed data (c40) as the student, and ResNet50 trained on
the c40 data alone without any KD method as a baseline for Interestingly, as shown in Fig. 7, the baseline classifier
comparison. We provide visual explanations regarding the can possibly be susceptible to and exploited by adver-
two benefits of training a low-quality deepfake detector with sarial attacks, which typically add small amount of noise
our ADD framework as follows: to the image. Therefore, one can explore the possibility
of performing adversarial attacks on the compressed im-
• Correctly identifying facial activation regions. Despite
ages by adding small amount of noise to or disordering
being trained on low-quality compressed images, the
the background. These changes resulting from adversar-
ResNet50 baseline still makes wrong predictions with
ial attacks would be difficult to be distinguished by hu-
high confidence as shown in the second column of Fig.
mans and can easily deceive the output of the baseline
6. The areas pointed by red arrows indicate the base-
classifier. Therefore, it would be interesting to explore
lines’ activation regions, which are different from the
adversarial attacks for complex, compressed images and
teacher’s, indicated by green arrows in the third column.
videos. Also, as a defense mechanism, we can consider
After training with our ADD framework, the student is
KD as a framework with our novel attention mechanism
able to produce correct predictions that are strongly cor-
for future work to better detect face regions and increase
related and can be further visualized by its activation re-
robustness against complex background noise as shown
gions which are closely following the teacher’s.
in Fig. 7.
• Resolving background confusion. Figure 7 presents
the Grad-CAMs for the fake class of five different
Figure 5: Ablation study on attention weights α (left column) and β (right column). The blue and orange lines indicate results
with medium compression (c23) and high compression (c40) data, respectively. The solid and dashed lines represent results of
our ADD - ResNet50 student model and the ResNet50 baseline model, respectively.
Figure 6: Grad-CAM outputs of deepfake images from NeuralTextures, DeepFakes, Face2Face, FaceSwap and FaceShifter
dataset. The strongest activation regions are highlighted in red, where the blue indicates the more silent ones. The red arrows
indicate the activation regions that lead to wrong predictions of ResNet50 baseline trained on highly compressed datasets
without our KD framework. Green arrows in the fourth column indicate the facial activation regions of the student, which
almost match with the teacher’s in the third columns, resulting from distilling knowledge from the teacher effectively by
our ADD framework. (Figure is best viewed in color).
Figure 7: Class activation maps of deepfake images from NeuralTextures, DeepFakes, Face2Face, FaceSwap and FaceShifter
dataset. The green arrows shown the correct facial activation regions that the student successfully mimic its teacher. The red
arrows indicate non-facial areas that ResNet50 baseline relies on to make the wrong predictions with high confidence when
trained on highly compressed datasets. (Figure is best viewed in color).

You might also like