ADD: Frequency Attention and Multi-View Based Knowledge Distillation To Detect Low-Quality Compressed Deepfake Images
ADD: Frequency Attention and Multi-View Based Knowledge Distillation To Detect Low-Quality Compressed Deepfake Images
pression applications. However, many existing works (Yim tive pairs closer, and push away those from negative pairs.
et al. 2017; Tian, Krishnan, and Isola 2019; Huang and Wang Although CRD achieves superior performance to those of
2017; Passalis and Tefas 2018) applied different types of dis- previous approaches, it requires a large memory buffer to
tillation methods to conventional datasets, e.g., ImageNet, save embedding features of each sample. This is restrictive
PASCAL VOC 2007, and CIFAR100, but not for deepfake when training size and embedding space become larger. In-
datasets. On the other hand, Zhu et al. (Zhu et al. 2019) stead, we directly sample positive and negative images in
used FitNets (Romero et al. 2014) to train a student model the same mini-batch and apply the contrastive loss to em-
that is able to detect low-resolution images, which is simi- bedded features, similar to the Siamese network (Bromley
lar to our method in that the teacher and the student learn to et al. 1994).
detect high and low-quality images, respectively. However, Frequency domain learning. In the field of media foren-
their approach coerces the student to mimic the penultimate sics, several approaches (Jiang et al. 2020; Khayatkhoei
layer’s distribution from the teacher, while it does not pos- and Elgammal 2020; Dzanic, Shah, and Witherden 2019)
sess rich features at the lower layers. showed that discrepancies of high-frequency’s Fourier spec-
In order to encourage the student model to mimic trum are effective clues to distinguish CNN-based generated
the teacher more effectively, Zagoruyko and Komodakis images. Frank et al. (Frank et al. 2020) and and Zhang et
(Zagoruyko and Komodakis 2016) proposed the activation- al. (Zhang, Karaman, and Chang 2019) utilized the checker-
based attention transfer, similar to FitNets, but their ap- board artifacts (Odena, Dumoulin, and Olah 2016) of the
proach achieves better performance by creating spatial at- frequency spectrum caused by up-sampling components of
tention maps. Our multi-view attention method inherits from generative neural networks (GAN) as effective features in
this approach but carries more generalization ability by not detecting GAN-based fake images. Nevertheless, their de-
only exploiting spatial attention (in width and height dimen- tection performances were greatly degraded when the train-
sion), but also introducing attention features from random ing synthesized images are compressed, becoming low-
dimensions using Radon transform (Helgason 2010). Thus, quality. Quian et al. proposed an effective frequency-based
our approach pushes the student’s backbone features closer forgery detection method, named F 3 N et, which decom-
to the teacher’s. poses an input image to many frequency components, col-
In addition, inspired by InfoNCE loss (Oord, Li, and laborating with local frequency statistics on a two-streams
Vinyals 2018), Tian et al. (Tian, Krishnan, and Isola 2019) network. The F 3 N et, however, doubles the number of pa-
proposed contrastive representation distillation (CRD), rameters from its backbone.
which formulates the contrastive learning framework and Wasserstein distance. Induced by the optimal transport
motivates the student network to drive samples from posi- theory, Wasserstein distance (WD) (Villani 2008), and its
variations have been explored in training DNNs to learn
a particular distribution thanks to Wasserstein’s underlying
geometrically meaningful distance property. In fact, WD-
based applications cover a wide range of fields, such as to
improve generative models (Arjovsky, Chintala, and Bottou
2017; Deshpande, Zhang, and Schwing 2018), learn the dis-
tribution of latent space in autoencoders (Kolouri et al. 2018;
Xu et al. 2020), and match features in domain adaptation
tasks (Lee et al. 2019).In this work, we utilize the Wasser-
stein metric to provide the student geometrically meaningful
guidance to efficiently mimic the teacher’s tensor distribu-
tion. Thus, the student can learn the true tensor distribution,
even though its input features are partially degraded through Figure 3: Illustration of our frequency attention distiller. The
high compression. F function is applied to each channel of the input tensor.
Distance metric d calculates the discrepancy of the corre-
sponding coefficients of each frequency component from
Our Approach the teacher and the student. Finally, the attention map is
Our Attention-based Deepfake detection Distiller (ADD) obtained by averaging the element-wise differences across
is consisted of the following two novel distillations (See channels.
Fig. 2): 1) frequency attention distillation and 2) multi-view
attention distillation.
ing way:
Frequency Attention Distillation C
1 X
w(u, v) = exp γF R · d(FAS (c, u, v), FAT (c, u, v)) ,
Let fS and fT be the student and the pre-trained teacher C c=1
network. By forwarding a low-quality compressed input im- (3)
age and its corresponding raw image through fS and fT , where γF R is a positive hyper-parameter that governs the
respectively, we obtain features AS and AT ∈ RC×W ×H exponential cumulative loss, as the student’s removed fre-
from its backbone network, which have C channels, the quency increases. This design of attention weights ensures
width of W , and the height of H. To create frequency that the model focuses more on the losing high-frequency,
representations, Discrete Fourier Transform (DFT) F : and makes Eq. 2 partly similar to focal loss (Lin et al. 2017).
RC×W ×H → CC×W ×H is applied to each channel as fol- Figure 3 visually illustrates our frequency loss.
lows:
H
W X
X ux vy Multi-view Attention Distillation
FAS/T (c, u, v) = AS/T (c, x, y) · e−i2π( W + H ) ,
x=1 y=1
Sliced Wasserstein distance. The p-Wasserstein dis-
(1) tance between two probability measures µ and ν (Villani
where c, x and y denote the cth , xth and yth slice in the chan- 2008) with their corresponding probability density functions
nel, the width and height dimension of AS and AT , respec- Pµ and Pν in a probability space (Ω, P(Ω)) and Ω ⊂ Rd , is
tively. Here, for convenience, we use the notation FAS/T to defined as follows:
Z 1/p
denote that the function is independently applied for both
student’s and teacher’s backbone features. Then, the value Wp (Pµ , Pν ) = inf ψ(x, y)p dπ(x, y) ,
π∈Π(µ,ν) Ω×Ω
at (u, v) on each single feature-map FAS/T (c, :, :) indicates (4)
the coefficient of a basic frequency component. The differ- where Π(µ, ν) is a set of all transportation plans π, which
ence between a pair of corresponding coefficients from the has the marginal densities Pµ and Pν , respectively, and
teacher and the student represents the “absence” of that stu- ψ : Ω × Ω → R+ is a transportation cost function. Equa-
dent’s frequency component. Next, let d : C2 → R+ be a tion 4 searches for an optimal transportation plan between
metric that assesses the distance between two input complex µ and ν, which is also known as Kantorovitch formulation
numbers and supports stochastic gradient descent. Then, the (Kantorovitch 1958). In the case of one-dimensional prob-
frequency loss between the teacher and student can be de- ability space, i.e., Ω ⊂ R, the closed-form solution of the
fined as follows: p-Wasserstein distance is:
C X
W X
H Z 1 p 1/p
ψ Fµ−1 (κ), Fν−1 (κ) dκ
X
Wp (Pµ , Pν ) = , (5)
LF R = w(u, v)·d FAS (c, u, v), FAT (c, u, v) ,
0
c=1 u=1 v=1
(2) where Fµ and Fν are the cumulative distribution functions
where w(u, v) is an attention weight at (u, v). In this work, of Pµ and Pν , respectively.
we utilize the exponential of the difference across channels A variation of Wasserstein distance, inspired by the above
between the teacher and student as the weight in the follow- closed-form solution, is Sliced Wasserstein distance (SWD)
that deploys multiple projections from a high dimensional
distribution to various one-dimensional marginal distribu-
tions and calculates the optimal transportation cost for
each projection. In order to construct these one-dimensional
marginal distributions, we use the Radon transform (Helga-
son 2010), which is defined as follows:
Z
RPµ (t, θ) = µ(x)δ(t − hθ, xi)dx, ∀θ ∈ S d−1 , ∀t ∈ R,
Ω
(6)
where δ denotes the Diract delta function, h·, ·i is the Eu-
clidean inner-product, and S d−1 ⊂ Rd is the d-dimensional
unit sphere. Thus, we denote Rθ µ as a 1-D marginal dis- Figure 4: Detailed illustration of our multi-view attention
tribution of µ under the projection on θ. The Sliced 1- distillation. Two backbone features of the teacher (top) and
Wasserstein distance is defined as follows: the student (bottom). After normalization, obtained features
Z are projected on a random sample θi , then two attention vec-
SW1 (Pµ , Pν ) = W1 (Rθ µ, Rθ ν)dθ. (7) tors, vTθi and vSθi , are obtained by sorting the projection im-
S d−1 ages and binning them into g groups. Multiple values of θ
Now, we can calculate the Sliced Wasserstein distance by provides us multiple attention views on the two tensors.
optimizing a series of 1-D transportation problems, which
have the closed-form solution that can be computed in
each instance, which inspired by the CRD distillation frame-
O(N log(N )) (Rabin et al. 2011). In particular, by sorting
work of Tian et al. (Tian, Krishnan, and Isola 2019) . Thus,
Rθ µ and Rθ ν in ascending order using two permutation op-
our overall multi-view attention loss is defined as follows:
erators τ1 and τ2 , respectively, the SW D can be approxi-
mated as follows: LMV =γMV × SW D(PAS ,PAT )+
K X
N ηMV × SW D(PAS , PA+ )+ (9)
X T
SW D(Pµ , Pν ) ≈ ψ(Rθk µτ1 [i] , Rθk ντ2 [i] ), (8)
max(∆ − SW D PAS , PA− ), 0 ,
k=1 i=1 T
ResNet50 92.46 90.85 88.91 86.52 ing a LQ deepfake detector with our ADD framework. The
FitNet - ResNet50 97.29 96.29 89.16 90.13 gallery of Grad-CAM visualization is included in the sup-
AT - ResNet50 97.66 97.27 89.75 90.41 plementary material. First, our ADD is able to correct the
NL - ResNet50 97.34 96.95 91.86 90.78 facial artifacts’ attention of the LQ detector to resemble its
ADD - ResNet50 (ours) 97.85 97.34 92.49 92.13 teacher trained on raw datasets. Second, the ADD vigor-
Rössler et al. 93.04 93.16 89.20 87.12 ously instructs the student model to neglect the background
Dogonadze et al. 89.80 89.36 82.03 79.96 noises and activate the facial areas as its teacher does when
FaceShifter
F 3 N et 95.10 95.02 89.13 88.69 encountering facial images in complex backgrounds. Mean-
ResNet50 94.89 93.88 89.56 88.48 while, the baseline model which is solely trained on LQ
FitNet - ResNet50 96.63 95.95 90.16 89.36 datasets steadily makes wrong predictions with high confi-
AT - ResNet50 96.32 96.76 88.28 89.45 dence by activating non-facial areas and is deceived by com-
NL - ResNet50 96.24 95.28 90.04 87.71 plex backgrounds.
ADD - ResNet50 (ours) 96.60 95.84 91.64 90.27
Conclusion
Table 1: Experimental results of our proposed method and
other seven different baseline approaches on five different In this paper, we proposed a novel Attention-based Deep-
deepfake datasets. The best results are highlighted in bold. fake detection Distillations (ADD), exploring frequency at-
tention distillation and multi-view attention distillation in a
KD framework to detect highly compressed deepfakes. The
Model ACC(%) frequency attention helps the student to retrieve and focus
ResNet (baseline) 60.27 more on high-frequency components from the teacher. The
Our ResNet (FR) 67.03 multi-view attention, inspired by Sliced Wasserstein dis-
Our ResNet (MV w/o contrastive) 67.01 tance, pushes the student’s output tensor distribution toward
Our ResNet (MV w/ contrastive) 68.14 the teacher’s, maintaining correlated pixel features between
Our ResNet (FR+MV) 68.53
tensor elements from multiple views (slices). Our experi-
ments demonstrate that our proposed method is highly effec-
Table 2: The effect of each single attention module on the
tive and achieves competitive results in most cases when de-
final results experimented on NeuralTextures dataset.
tecting extremely challenging highly compressed challeng-
ing LQ deepfakes. Our code is available here2 .
2
https://github.com/Leminhbinh0209/ADD.git
Acknowledgments Dzanic, T.; Shah, K.; and Witherden, F. 2019. Fourier Spectrum
Discrepancies in Deep Network Generated Images. arXiv preprint
This work was partly supported by Institute of Informa-
arXiv:1911.06465. (Cited on pages 1 and 3.)
tion & communications Technology Planning & Evalua-
tion (IITP) grant funded by the Korea government (MSIT) Frank, J.; Eisenhofer, T.; Schönherr, L.; Fischer, A.; Kolossa, D.;
(No.2019-0-00421, AI Graduate School Support Program and Holz, T. 2020. Leveraging frequency analysis for deep fake
image recognition. In International Conference on Machine Learn-
(Sungkyunkwan University)), (No. 2019-0-01343, Regional ing, 3247–3258. PMLR. (Cited on page 3.)
strategic industry convergence security core talent training
business) and the Basic Science Research Program through He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE conference
National Research Foundation of Korea (NRF) grant funded
on computer vision and pattern recognition, 770–778. (Cited on
by Korea government MSIT (No. 2020R1C1C1006004). pages 6 and 7.)
Also, this research was partly supported by IITP grant
Helgason, S. 2010. Integral geometry and Radon transforms.
funded by the Korea government MSIT (No. 2021-0-00017,
Springer Science & Business Media. (Cited on pages 3 and 5.)
Original Technology Development of Artificial Intelligence
Industry) and was partly supported by the Korea government Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowl-
MSIT, under the High-Potential Individuals Global Training edge in a neural network. arXiv preprint arXiv:1503.02531. (Cited
on pages 1 and 2.)
Program (2020-0-01550) supervised by the IITP.
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; and Wei, Y. 2018. Relation net-
works for object detection. In Proceedings of the IEEE Conference
References on Computer Vision and Pattern Recognition, 3588–3597. (Cited
Accessed. 2021-Jan-01a. DeepFakes GitHub. https://github.com/ on page 2.)
deepfakes/faceswap. (Cited on page 6.)
Huang, Z.; and Wang, N. 2017. Like what you like: Knowl-
Accessed. 2021-Jan-01b. FaceForensics Benchmark. http://kaldir. edge distill via neuron selectivity transfer. arXiv preprint
vc.in.tum.de/faceforensics benchmark/. (Cited on page 6.) arXiv:1707.01219. (Cited on pages 1 and 3.)
Accessed. 2021-Jan-01c. FaceSwap GitHub. https://github.com/ Jeon, H.; Bang, Y.; Kim, J.; and Woo, S. S. 2020. T-GD: Transfer-
MarekKowalski/FaceSwap/. (Cited on pages 6 and 10.) able GAN-generated Images Detection Framework. arXiv preprint
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gener- arXiv:2008.04115. (Cited on page 2.)
ative adversarial networks. In International conference on machine Jiang, L.; Dai, B.; Wu, W.; and Loy, C. C. 2020. Focal Frequency
learning, 214–223. PMLR. (Cited on page 4.) Loss for Generative Models. arXiv preprint arXiv:2012.12821.
Bocci, C.; Carlini, E.; and Kileel, J. 2016. Hadamard products of (Cited on page 3.)
linear spaces. Journal of Algebra, 448: 595–617. (Cited on page 5.) Kantorovitch, L. 1958. On the translocation of masses. Manage-
Bonneel, N.; Rabin, J.; Peyré, G.; and Pfister, H. 2015. Sliced and ment science, 5(1): 1–4. (Cited on page 4.)
radon wasserstein barycenters of measures. Journal of Mathemati- Khayatkhoei, M.; and Elgammal, A. 2020. Spatial Frequency Bias
cal Imaging and Vision, 51(1): 22–45. (Cited on page 2.) in Convolutional Generative Adversarial Networks. arXiv preprint
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; and Shah, R. arXiv:2010.01473. (Cited on pages 1 and 3.)
1994. Signature verification using a” siamese” time delay neu- King, D. E. 2009. Dlib-ml: A machine learning toolkit. The Journal
ral network. Advances in neural information processing systems, of Machine Learning Research, 10: 1755–1758. (Cited on page 6.)
737–737. (Cited on page 3.)
Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O. M.; and Zisserman, A. 2018.
optimization. arXiv preprint arXiv:1412.6980. (Cited on page 6.)
Vggface2: A dataset for recognising faces across pose and age. In
2018 13th IEEE international conference on automatic face & ges- Kolouri, S.; Pope, P. E.; Martin, C. E.; and Rohde, G. K. 2018.
ture recognition (FG 2018), 67–74. IEEE. (Cited on page 6.) Sliced Wasserstein auto-encoders. In International Conference on
Learning Representations. (Cited on page 4.)
Catherine, S. 2019. Fraudsters Used AI to
Mimic CEO’s Voice in Unusual Cybercrime Case. Lee, C.-Y.; Batra, T.; Baig, M. H.; and Ulbricht, D. 2019. Sliced
https://www.wsj.com/articles/fraudsters- wasserstein discrepancy for unsupervised domain adaptation. In
use-ai-to-mimic-ceos-voice-in-unusual- Proceedings of the IEEE/CVF Conference on Computer Vision and
cybercrime-case-11567157402. (Cited on page 1.) Pattern Recognition, 10285–10295. (Cited on page 4.)
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; and Li, L.; Bao, J.; Yang, H.; Chen, D.; and Wen, F. 2019. Faceshifter:
Chua, T.-S. 2017. Sca-cnn: Spatial and channel-wise attention in Towards high fidelity and occlusion aware face swapping. arXiv
convolutional networks for image captioning. In Proceedings of preprint arXiv:1912.13457. (Cited on pages 2, 6, and 10.)
the IEEE conference on computer vision and pattern recognition, Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; and Guo,
5659–5667. (Cited on page 5.) B. 2020. Face x-ray for more general face forgery detection. In
Cole, S. 2018. We Are Truly Fucked: Everyone Is Making AI- Proceedings of the IEEE/CVF Conference on Computer Vision and
Generated Fake Porn Now. (Cited on page 1.) Pattern Recognition, 5001–5010. (Cited on page 2.)
Deshpande, I.; Zhang, Z.; and Schwing, A. G. 2018. Generative Li, Y.; and Lyu, S. 2018. Exposing deepfake videos by detecting
modeling using the sliced wasserstein distance. In Proceedings of face warping artifacts. arXiv preprint arXiv:1811.00656. (Cited on
the IEEE conference on computer vision and pattern recognition, pages 1 and 2.)
3483–3491. (Cited on page 4.) Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017.
Dogonadze, N.; Obernosterer, J.; and Hou, J. 2020. Deep Face Focal loss for dense object detection. In Proceedings of the IEEE
Forgery Detection. arXiv preprint arXiv:2004.11804. (Cited on international conference on computer vision, 2980–2988. (Cited
page 6.) on page 4.)
Morgan, N.; and Bourlard, H. 1989. Generalization and parame- Thies, J.; Zollhöfer, M.; and Nießner, M. 2019. Deferred neural
ter estimation in feedforward nets: Some experiments. Advances rendering: Image synthesis using neural textures. ACM Transac-
in neural information processing systems, 2: 630–637. (Cited on tions on Graphics (TOG), 38(4): 1–12. (Cited on pages 6 and 10.)
page 2.) Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; and
Nitzan, Y.; Bermano, A.; Li, Y.; and Cohen-Or, D. 2020. Face iden- Nießner, M. 2016. Face2face: Real-time face capture and reen-
tity disentanglement via latent space mapping. ACM Transactions actment of rgb videos. In Proceedings of the IEEE conference on
on Graphics (TOG), 39(6): 1–14. (Cited on page 1.) computer vision and pattern recognition, 2387–2395. (Cited on
Odena, A.; Dumoulin, V.; and Olah, C. 2016. Deconvolution and pages 6 and 10.)
checkerboard artifacts. Distill (2016). (Cited on page 3.) Tian, Y.; Krishnan, D.; and Isola, P. 2019. Contrastive represen-
Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representa- tation distillation. arXiv preprint arXiv:1910.10699. (Cited on
tion learning with contrastive predictive coding. arXiv preprint pages 1, 2, 3, and 5.)
arXiv:1807.03748. (Cited on page 3.) Villani, C. 2008. Optimal transport: old and new, volume 338.
Passalis, N.; and Tefas, A. 2018. Learning deep representations Springer Science & Business Media. (Cited on pages 3 and 4.)
with probabilistic knowledge transfer. In Proceedings of the Euro- Wang, R.; Ma, L.; Juefei-Xu, F.; Xie, X.; Wang, J.; and Liu, Y.
pean Conference on Computer Vision (ECCV), 268–284. (Cited on 2019. Fakespotter: A simple baseline for spotting ai-synthesized
pages 1 and 3.) fake faces. arXiv preprint arXiv:1909.06122, 2. (Cited on pages 1
Pidhorskyi, S.; Adjeroh, D. A.; and Doretto, G. 2020. Adversarial and 2.)
latent autoencoders. In Proceedings of the IEEE/CVF Conference Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-
on Computer Vision and Pattern Recognition, 14104–14113. (Cited local neural networks. In Proceedings of the IEEE conference on
on page 1.) computer vision and pattern recognition, 7794–7803. (Cited on
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; and Shao, J. 2020. Think- pages 2, 6, and 10.)
ing in frequency: Face forgery detection by mining frequency- Xu, H.; Luo, D.; Henao, R.; Shah, S.; and Carin, L. 2020. Learn-
aware clues. In European Conference on Computer Vision, 86–103. ing autoencoders with relational regularization. In International
Springer. (Cited on page 6.) Conference on Machine Learning, 10576–10586. PMLR. (Cited
Quandt, T.; Frischlich, L.; Boberg, S.; and Schatto-Eckrodt, T. on page 4.)
2019. Fake news. The international encyclopedia of Journalism Xu, Z.-Q. J.; Zhang, Y.; and Xiao, Y. 2019. Training behavior of
Studies, 1–6. (Cited on page 1.) deep neural network in frequency domain. In International Confer-
Rabin, J.; Peyré, G.; Delon, J.; and Bernot, M. 2011. Wasserstein ence on Neural Information Processing, 264–274. Springer. (Cited
barycenter and its application to texture mixing. In International on page 2.)
Conference on Scale Space and Variational Methods in Computer
Yim, J.; Joo, D.; Bae, J.; and Kim, J. 2017. A gift from knowledge
Vision, 435–446. Springer. (Cited on page 5.)
distillation: Fast optimization, network minimization and transfer
Rahmouni, N.; Nozick, V.; Yamagishi, J.; and Echizen, I. 2017. learning. In Proceedings of the IEEE Conference on Computer
Distinguishing computer graphics from natural images using con- Vision and Pattern Recognition, 4133–4141. (Cited on page 3.)
volution neural networks. In 2017 IEEE Workshop on Information
Zagoruyko, S.; and Komodakis, N. 2016. Paying more attention
Forensics and Security (WIFS), 1–6. IEEE. (Cited on page 2.)
to attention: Improving the performance of convolutional neural
Richardson, E.; Alaluf, Y.; Patashnik, O.; Nitzan, Y.; Azar, Y.; networks via attention transfer. arXiv preprint arXiv:1612.03928.
Shapiro, S.; and Cohen-Or, D. 2020. Encoding in style: a (Cited on pages 3, 6, and 10.)
stylegan encoder for image-to-image translation. arXiv preprint
arXiv:2008.00951. (Cited on page 1.) Zhang, X.; Karaman, S.; and Chang, S.-F. 2019. Detecting and
simulating artifacts in gan fake images. In 2019 IEEE Interna-
Romero, A.; Ballas, N.; Kahou, S. E.; Chassang, A.; Gatta, C.; and tional Workshop on Information Forensics and Security (WIFS),
Bengio, Y. 2014. Fitnets: Hints for thin deep nets. arXiv preprint 1–6. IEEE. (Cited on pages 1 and 3.)
arXiv:1412.6550. (Cited on pages 3, 6, and 10.)
Zhu, M.; Han, K.; Zhang, C.; Lin, J.; and Wang, Y. 2019. Low-
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; and
resolution visual recognition via deep feature distillation. In
Nießner, M. 2019. Faceforensics++: Learning to detect manipu-
ICASSP 2019-2019 IEEE International Conference on Acoustics,
lated facial images. In Proceedings of the IEEE/CVF International
Speech and Signal Processing (ICASSP), 3762–3766. IEEE. (Cited
Conference on Computer Vision, 1–11. (Cited on pages 1, 2, and 6.)
on page 3.)
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.;
and Batra, D. 2017. Grad-cam: Visual explanations from deep net-
works via gradient-based localization. In Proceedings of the IEEE
international conference on computer vision, 618–626. (Cited on
pages 7 and 11.)
Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; and Sebe,
N. 2020. First order motion model for image animation. arXiv
preprint arXiv:2003.00196. (Cited on page 1.)
Smith, L. N.; and Topin, N. 2019. Super-convergence: Very fast
training of neural networks using large learning rates. In Artifi-
cial Intelligence and Machine Learning for Multi-Domain Opera-
tions Applications, volume 11006, 1100612. International Society
for Optics and Photonics. (Cited on page 6.)
Tan, M.; and Le, Q. 2019. Efficientnet: Rethinking model scaling
for convolutional neural networks. In International Conference on
Machine Learning, 6105–6114. PMLR. (Cited on page 7.)
Supplementary materials target face and a generator that has Adaptive Attentional
Denormalization (AAD) layers. In particular, AAD is
able to blend the identity and the features to a synthe-
A. Multi-View Attention Algorithm sized face. In the second stage, they developed a novel
Algorithm 1 presents the pseudo-code for multi-view at- Heuristic Error Acknowledging Refinement Network in
tention distillation between two corresponding backbone order to enhance facial occlusions.
features using the Sliced Wasserstein distance (SWD) from
the student and teacher models. For implicity, we formulate
how each single projection θi contributes to the total SWD. C. Distillation Baseline Methods
However, in practice, K uniform vectors θi in S d−1 can be In our experiement, we actively integrate three well-
sampled simultaneously by deep learning libraries, e.g. Ten- known distillation losses into the teacher-student training
sorFlow or PyTorch, and the projection operation or binning framework for comparing with ours:
can be vertorized.
• FitNet (Romero et al. 2014). FitNet method proposed
B. Datasets hints algorithms, in which the student’s guided layers
try to predict the outputs of teacher’s hint layers. We ap-
We describe the five different deepfake datasets used in ply this hint-based learning approach on the penultimate
our experiments: layer of the teacher and student.
• NeuralTextures. Facial reenactment is an application • Attention Transfer (Zagoruyko and Komodakis 2016)
of Neural Textures (Thies, Zollhöfer, and Nießner 2019) (AT). Attention method transfers attention maps, which
technique that is used in video re-renderings. This are obtained by summing up spatial values across the
approach includes learned feature maps stored on top of backbone features’ channels from the teacher to the stu-
3D mesh proxies, called neural textures, and a deferred dent.
neural renderer. The NeuralTextures datasets used in
our experiment includes facial modifications of mouth • Non-local (Wang et al. 2018) (NL). Non-local module
regions, while the other face regions remain the same. generates self-attention features from the student and
teacher’s backbone features. Subsequently, the student’s
self-attention tensors attempt to mimic the teacher’s.
• DeepFakes. The DeepFakes dataset is generated using
two autoencoders with a shared encoder, each of which
is trained on the source and target faces, respectively. D. Evaluation Metrics
Fake faces are generated by decoding the source face’s The results in our experiments are evaluated based on the
embedding representation with the target face’s decoder. following metrics:
Note that DeepFakes, at the beginning, was a specific
• Accuracy (ACC). ACC is widely used to evaluate a clas-
facial swapping technique, but is now referred to as
sifier’s performance, and it calculates the proportion of
AI-generated facial manipulation methods.
samples whose true classes are predicted with the high-
est probability. ACC of a model fθ tested on a test set
• Face2Face. Face2Face (Thies et al. 2016) is a real-time of N samples {(x1 , y1 ), ..., (xN , yN )} is formulated as
facial reenactment approach, in which the target person’s follows:
expression follows the source person’s, while his/her PN
identity is preserved. Particularly, the identity corre- I (arg max(fθ (xi )), yi )
sponding to the target face is recovered by a non-rigid ACC = i=1 , (11)
N
model-based bundling approach on a set of key-frames
that are manually selected in advance. The source face’s where I(·, ·) is the Kronecker delta function.
expression coefficients are transferred to the target, • Recall at k (R@k). R@k indicates the proportion of test
while maintaining environment lighting as well as target samples with at least one observed sample from the same
background. class in k−nearest neighbors determined in a particu-
lar feature space. A small R@k implies small intra-class
variation, which usually leads to better accuracy. R@k is
• FaceSwap. FaceSwap (Accessed 2021-Jan-01c) is a formulated as follows:
lightweight application that is built upon the graphic PN P
k
structures of source and target faces. A 3D model is i=1 I j=1 I (neighborxi [j], yi ) , 0
designed to fit 68 facial landmarks extracted from the R@k = 1 − ,
N
source face. Then, it projects the facial regions back to
the target face by minimizing the pair-wise landmark (12)
errors and applies color correction in the final step. where neighborxi [j] is the label of the j th nearest neighbor of
xi , and I(·, ·) is the Kronecker delta function. We use the Eu-
clidean distance to measure the distances between queried and
• FaceShifter. FaceShifter (Li et al. 2019) is a two-stage referenced samples whose features are the penultimate layer’s
face swapping framework. The first stage includes a outputs, and we adopt R@1 which considers the first nearest
encoder-based multi-level feature extractor used for a neighbor of a test sample.
Algorithm 1: Multi-view attention distillation using the Sliced Wasserstein distance (SWD)
Require: Two backbone features AS and AT , which are the respective low-quality compressed and raw images from the
student and teacher networks, the number of random projections K, and the number of bins g.
A◦2 A◦2
1: PAS , PAT ← kA Sk2 , kA Tk2 . normalize AS and AT
S F T F
2: SW D ← 0 . initialize SWD to zero
3: for iteration i ← 1 to K do
4: θi ← U(S d−1 ) . uniformly sample θi from unit sphere in R3
θi θi
5: uS , uT ← Rθi PAS , Rθi PAT . project PAS and PAT on θi
6: uθS,τ
i
1
, u θi
T ,τ2 ← Sort(uθi
S ), Sort(u θi
T ) . sort uθi θi
S and uT in ascending order
7: vSθi , vTθi ← G(uθS,τ i
), G(uθTi,τ2 ) . partition uθS,τ
i
and uθTi,τ2 into g bins
P1g θi θi
1
8: SW D ← SW D + j=1 ψ(vS [j], vT [j]) . apply 1-D transportation cost with quadratic loss ψ
9: end for
10: return SW D