0% found this document useful (0 votes)
42 views5 pages

Paper (Related Project-2)

This document proposes using capsule networks to detect forged images and videos. Capsule networks were originally designed for computer vision problems but this work applies them to digital forensics problems. The method uses a capsule network to detect various types of forgeries from replay attacks using printed images or recorded videos to computer-generated videos made with deep learning. It extracts features from images/videos using VGG-19 and feeds them into a capsule network for classification. Experiments on four datasets showed the capsule network approach outperformed state-of-the-art methods for generalized forgery detection across different attack types.

Uploaded by

Razan Gaihre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views5 pages

Paper (Related Project-2)

This document proposes using capsule networks to detect forged images and videos. Capsule networks were originally designed for computer vision problems but this work applies them to digital forensics problems. The method uses a capsule network to detect various types of forgeries from replay attacks using printed images or recorded videos to computer-generated videos made with deep learning. It extracts features from images/videos using VGG-19 and feeds them into a capsule network for classification. Experiments on four datasets showed the capsule network approach outperformed state-of-the-art methods for generalized forgery detection across different attack types.

Uploaded by

Razan Gaihre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CAPSULE-FORENSICS: USING CAPSULE NETWORKS TO DETECT FORGED IMAGES

AND VIDEOS

Huy H. Nguyen? , Junichi Yamagishi?†‡ , and Isao Echizen?†


?
SOKENDAI (The Graduate University for Advanced Studies), Kanagawa, Japan

National Institute of Informatics, Tokyo, Japan

The University of Edinburgh, Edinburgh, UK
Email: {nhhuy, jyamagis, iechizen}@nii.ac.jp

ABSTRACT Several countermeasures have been proposed to deal with


manipulated images and videos. However, most of them
Recent advances in media generation techniques have made it are aimed at particular types of attacks. For example, lo-
easier for attackers to create forged images and videos. State- cal binary pattern (LBP)-based methods [7, 8] are effective
of-the-art methods enable the real-time creation of a forged against replay attacks in which the attacker places a printed
version of a single video obtained from a social network. Al- photo or displays a video on a screen in front of the cam-
though numerous methods have been developed for detecting era. However, the eyes-focused method designed to detect a
forged images and videos, they are generally targeted at cer- deepfake forgery [9] can fail with the replay attack when the
tain domains and quickly become obsolete as new kinds of video displayed is of the actual target person. Other meth-
attacks appear. The method introduced in this paper uses a ods have more generalized ability; for instance, Fridrich and
capsule network to detect various kinds of spoofs, from replay Kodovsky’s method [10] can be applied for both steganal-
attacks using printed images or recorded videos to computer- ysis and detecting facial reenactment videos. However, its
generated videos using deep convolutional neural networks. performance on secondary tasks is limited in comparison
It extends the application of capsule networks beyond their with task-specific methods like that of Rossler et al. [11].
original intention to the solving of inverse graphics problems. Moreover, while some methods can detect a single forged
Index Terms— computer-generated video, replay attack, image [11, 12, 13], others require video input [9].
forgery detection, capsule network This paper presents a method that uses a capsule net-
work to detect forged images and videos in a wide range
of forgery scenarios, including replay attack detection and
1. INTRODUCTION
(both fully and partially) computer-generated image/video
detection. This is pioneering work in the use of capsule net-
Forged images and videos can be used to bypass facial au-
works [14, 15, 16], which were originally designed for com-
thentication and to create fake news media. The quality of
puter vision problems, to solve digital forensics problems. A
manipulated images and videos has seen significant improve-
comprehensive survey of state-of-the-art related work and in-
ment with the development of advanced network architectures
tensive comparisons using four major datasets demonstrated
and the use of large amounts of training data. This has dra-
the superior performance of the proposed method.
matically simplified the creation of facial forgeries. Nowa-
days, the only thing needed to create a forged facial image
is simply a short video of the target person [1, 2] or an ID
photo [3, 4]. The techniques developed by Chung et al. [4]
2. RELATED WORK
and Suwajanakorn et al. [5] can improve the ability of attack-
ers to learn the mapping between speech and lip motion, en-
abling the creation of fully synthesized audio-video data for In this section, we group forgery detection approaches into
any person. In this age of social networks serving as major replay attack detection and computer-generated image/video
sources of information, fake news with manipulated multime- detection on the basis of the features used and their target.
dia can quickly spread and have significant effects. The deep- Note that some approaches are two-fold while others are ap-
fake phenomenon [6] is a good example of this threat—any plicable only to certain types of attacks. We also provide
person with a personal computer can create videos incorporat- some basic information about capsule networks and the dy-
ing the facial image of any celebrity by using a human image namic routing algorithm that made this kind of network prac-
synthesis technique based on artificial intelligence. tical.

978-1-5386-4658-8/18/$31.00 ©2019 IEEE 2307 ICASSP 2019


2.1. Replay Attack Detection described. This has improved the accuracy of vision tasks.
Application of a capsule network to the forensics task, the
Prior to the current deep learning era, LBP methods were the focus of this paper, is a challenging problem. However, the
primary defense against replay attacks [7, 8]. The method agreement between capsules achieved by using the dynamic
introduced by Kim et al. [17], which is based on local pat- routing algorithm could boost detection performance on com-
terns of the diffusion speed (local speed patterns), achieves plex and nearly flawless forged images and videos.
higher accuracy than that of LBP-based methods. Now, with
the introduction of deep learning, the ability to detect replay
attacks has been greatly improved. The method of Yang et 3. CAPSULE-FORENSICS
al. [18] uses a support vector machine to classify features ex-
3.1. Overview
tracted by a pre-trained convolutional neural network (CNN).
That of Menotti et al. [19] uses a similar procedure but opti-
mizes the filters in an available high-performance CNN archi- Pre- Capsule Post-
VGG-19
tecture. The method of Alotaibi and Mahmood [20] uses non- processing Network processing
linear diffusion based on an additive operator splitting scheme
in their own CNN. The recently introduced method of Ito et
al. [21] leverages a pre-trained CNN and utilizes the whole Fig. 1. Overview of proposed method.
image instead of only the extracted face region.
The proposed method (Fig. 1) works for both images and
2.2. Computer-Generated Image/Video Detection videos. For video input, the video is split into frames in
the pre-processing phase. The classification results (posterior
There are several state-of-the-art methods for detecting im-
probabilities) are then acquired from the frames. The prob-
ages or videos generated by computer using, for example,
abilities are averaged in the post-processing phase to get the
a deepfake technique for face swapping [6], the Face2Face
final result. The remaining parts are constructed the same way
method for facial reenactment [1], or the deep video por-
as when the input is an image.
traits technique [2] for the purpose of forgery. Fridrich and
In the pre-processing phase, faces are detected and scaled
Kodovsky [10] proposed a hand-crafted-feature noise-based
to 128 × 128. Like we did in our previous work [12], we use
approach for steganalysis that can also be used for forgery
part of the VGG-19 network [27] to extract the latent features,
detection. Cozzolino et al. [22] implemented a CNN version
which are the inputs to the capsule network. Unlike we did in
of this approach. Raghavendra et al. [23] described the spe-
our previous work, we take the output of the third maxpooling
cial case of fine-tuning two available CNNs while Rossler et
layer instead of three outputs before the ReLU layers. We do
al. [11] used only one CNN. Bayar and Stamm [24], Rah-
this because we need to reduce the size of the inputs to the
mouni et al. [25], Afchar et al. [13], Quan et al. [26], and Li
capsule network.
et al. [9] proposed their own networks. Li et al.’s network [9],
for example, is video based and uses temporal information to
detect eye blinking. We used a hybrid approach [12] incorpo- 3.2. Capsule Design
rating part of a pre-trained VGG (Visual Geometry Group)-19
network [27] and a proposed CNN. Zhou et al. [28] proposed Dynamic
2D-conv + stats routing
a two-stream network. pooling + 1D-conv
Real image
2D-conv + stats capsule
2.3. Capsule Networks pooling + 1D-conv Fake image
Hinton et al. [14] addressed the limitations of CNNs applied capsule
2D-conv + stats
to inverse graphics tasks and laid the foundation for a more pooling + 1D-conv
robust “capsule” architecture in 2011. However, this com-
plex architecture could not be effectively implemented at the
Fig. 2. Overall design of capsule-forensics network.
time due to the lack of an efficient algorithm and the limita-
tions of computer hardware. Instead, easy-to-design easy-to-
train CNNs became widely used. Now, with the introduction The proposed network consists of three primary capsules
of the dynamic routing algorithm [15] and the expectation- and two output capsules, one for real and one for fake images
maximization routing algorithm [16], capsule networks have (Fig. 2). The latent features extracted by part of the VGG-19
been implemented with remarkable initial results. Two re- network [27] are the inputs, which are distributed to the three
cent studies demonstrated that, with the agreement between primary capsules (Fig. 3). As in our previous work [12], sta-
capsules calculated by the dynamic routing algorithm, the hi- tistical pooling, which is important for forgery detection, is
erarchical pose relationships between object parts can be well used. The outputs of the three capsules (uj|i ) are dynamically

2308
64 16 2 8 1 cap. 1
real real
cap. 2
input fake
cap. 3

cap. 1
fake real
64 × 64 64 × 64 16 8 8 cap. 2
input fake
convolution cap. 3
batch normalization primary capsules output capsules
ReLU
statistical pooling Fig. 4. Average results calculated by primary capsules and
output capsules from real and fake images generated with
Face2Face method [1]. Three primary capsules have signif-
Fig. 3. Detailed design of primary capsule. Upper numbers
icantly different reactions between real and fake inputs. Al-
indicate number of filters (depth) while lower number indicate
though their weights are also different, there is strong agree-
size of outputs of corresponding filters.
ment in the output capsules.

Algorithm 1 Dynamic routing between capsules.


The use of equation 3 instead of simply using the length
procedure ROUTING(uj|i , W, r)
of the output capsules [15] promotes separation between the
Ŵ ← W + rand(size(W )) two output capsules on each dimension.
ûj|i ← Ŵi squash(uj|i ) . Wi ∈ Rm×n
for all input capsule i and all output capsules j do
bij ← 0 4. EVALUATION
for r iterations do To evaluate the advantage of using random noise, we tested
for all input capsules i do ci ← sof
Ptmax(bi ) the proposed method with and without using random noise
for all output capsules j do sj ← i cij ûj|i (Capsule-Forensics-Noise and Capsule-Forensics, respec-
for all output capsules j do vj ← squash(sj ) tively). The random noise was generated from a normal
for all input capsules i and output capsules j do distribution N (0, 0.01) and was used in the training phase
bij ← bij + ûj|i · vj only. Two iterations (r = 2) were used in the dynamic
return vj routing algorithm. We used the half total error rate (HTER)
F RR+F AR T P +T N

2 and accuracy T P +T N +F P +F N as metrics.
routed to the output capsules (vj ) for r iterations using Algo-
rithm 1. The network has approximate 2.8 million parame- 4.1. Replay Attack Detection
ters, a relatively small number for such networks. We slightly
improved the algorithm of Sabour et al. [15] by adding Gaus- To determine the ability of the proposed method to detect re-
sian random noise to the 3-D weight tensor W and applying play attacks, we compared its performance with that of eight
one additional squash (equation 1) before routing by iterat- state-of-the-art detection methods on the well-known Idiap
ing. The added noise helps reduce over-fitting while the ad- REPLAY-ATTACK dataset [7]. As shown in Table 1, the pro-
ditional equation keeps the network more stable. The outputs posed method with random noise (Capsule-Forensics-Noise),
of the primary and output capsules are illustrated in Fig. 4. as well as our previous method [12], had an HTER of zero.

ksj k2 sj
vj = squash(sj ) = (1) Table 1. Half total error rate of state-of-the-art detection
1 + ksj k2 ksj k methods on REPLAY-ATTACK dataset [7].
Unlike Sabour et al.’s work [15], we use the cross-entropy Method HTER (%)
loss function: Chigovska et al. [7] 17.17
Pereira et al. [8] 08.51
Kim et al. [17] 12.50
L = − (y log(ŷ) + (1 − y) log(1 − ŷ)) , (2) Yang et al. [18] 02.30
where y is the ground truth label and ŷ is the predicted label Menotti et al. [19] 00.75
calculated using equation 3, in which m is the dimension of Alotabib et al. [20] 10.00
Ito et al. [21] 00.43
the output capsule vj .
Nguyen et al. [12] 00.00
 | ! Capsule-Forensics 00.28
1 X v1 Capsule-Forensics-Noise 00.00
ŷ = sof tmax (3)
m i v|2 :,i

2309
4.2. Face Swapping Detection
Table 4. Accuracy of state-of-the-art facial reenactment de-
We determined the ability of our proposed method to detect tection methods at frame level on FaceForensics dataset [11]
face swapping using a deepfake technique on the deepfake with three levels of compression: no compression, easy com-
dataset proposed by Afchar et al. [13] at both the frame and pression (23), and strong compression (40).
video levels. As shown in Tables 2 and 3, our proposed Accuracy (%)
Method
method with random noise (Capsule-Forensics-Noise) had No-C Easy-C Hard-C
the highest accuracy in both cases. Fridrich & Kodovsky [10] 99.40 75.87 58.16
Cozzolino et al. [22] 99.60 79.80 55.77
Bayar & Stamm [24] 99.53 86.10 73.63
Table 2. Accuracy of face swapping detection at frame level Rahmouni et al. [25] 98.60 88.50 61.50
on deepfake dataset [13]. Raghavendra et al. [23] 97.70 93.50 82.13
Method Accuracy (%) Zhou et al. [28] 99.93 96.00 86.83
Meso-4 [13] 89.10 Rossler et al. [11] 99.93 98.13 87.81
MesoInception-4 [13] 91.70 Meso-4 [13] 94.60 92.40 83.20
Nguyen et al. [12] 92.36 MesoInception-4 [13] 96.80 93.40 81.30
Capsule-Forensics 94.47 Nguyen et al. [12] 98.80 96.10 76.40
Capsule-Forensics-Noise 95.93 Capsule-Forensics 99.13 97.13 81.20
Capsule-Forensics-Noise 99.37 96.50 81.00

Table 3. Accuracy of face swapping detection at video level


on deepfake dataset [13]. Table 5. Comparison with MesoNet network at video level
Method Accuracy (%) on FaceForensics dataset [11].
Meso-4 [13] 96.90 Accuracy (%)
MesoInception-4 [13] 98.40 No-C Easy-C Hard-C
Capsule-Forensics 97.69 Meso-4 [13] - 95.30 -
Capsule-Forensics-Noise 99.23 MesoInception-4 [13] - 95.30 -
Capsule-Forensics 99.33 98.00 82.00
Capsule-Forensics-Noise 99.33 96.00 83.33
4.3. Facial Reenactment Detection
We determined the ability of our proposed method to detect Table 6. Accuracy of state-of-the-art methods on discriminat-
facial reenactment on the FaceForensics dataset [11], which ing between CGIs and PIs.
was created using the Face2Face method [1]. We strictly fol- Accuracy
Method
lowed the authors’ guidelines for processing the data. As Patch Full Size
Rahmouni et al. [25] 89.76 99.30
shown in Table 4, on average, the proposed method (with
Quan et al. [26] 94.75 99.58
and without noise) had performance comparable to that of the Nguyen et al. [12] 96.55 99.86
best-performing state-of-the-art methods. Capsule-Forensics 96.75 99.72
We also tested our method at the video level and compared Capsule-Forensics-Noise 97.00 100.00
its performance with that of Afchar et al.’s MesoNet facial
video forgery detection network [13]. For our method, we a wide range of forged image and video attacks. They also
used only the first ten frames instead of the entire video. As demonstrated that capsule networks can be used in domains
shown in Table 5, our method outperformed Afchar et al.’s other than computer vision. The proposed use of random
network. noise in the training phase proved beneficial in most cases.
Future work will mainly focus on evaluating the ability of
4.4. Fully Computer-Generated Image Detection the proposed method to resist adversarial machine attacks,
especially on the proposed random noise at test time, and en-
Finally, we compared the performance of our proposed
hancing its ability. It will also focus on making the proposed
method with that of state-of-the-art methods on computer-
method robust against mixed attacks, on detecting anomalies,
generated images (CGIs) and photographic images (PIs) on
and on raising this critical issue in the research community.
the dataset proposed by Rahmouni et al. [25]. Once again, as
shown in Table 6, our method had the best performance and
had perfect accuracy on full-size test images.
6. ACKNOWLEDGMENTS
5. CONCLUSION
This work was supported by JSPS KAKENHI Grant Numbers
Our comprehensive experiments demonstrated the feasibility (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051)
of building a general detection method that is effective for and by JST CREST Grant Number JPMJCR18A6, Japan.

2310
7. PREFERENCES [14] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang,
“Transforming auto-encoders,” in ICANN. Springer,
[1] Justus Thies, Michael Zollhofer, Marc Stamminger, 2011.
Christian Theobalt, and Matthias Nießner, “Face2Face: [15] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton,
Real-time face capture and reenactment of RGB “Dynamic routing between capsules,” in NIPS, 2017.
videos,” in CVPR. IEEE, 2016. [16] Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst,
[2] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, “Matrix capsules with EM routing,” in ICLRW, 2018.
Weipeng Xu, Justus Thies, Matthias Nießner, Patrick [17] Wonjun Kim, Sungjoo Suh, and Jae-Joon Han, “Face
Pérez, Christian Richardt, Michael Zollhöfer, and Chris- liveness detection from a single image via diffusion
tian Theobalt, “Deep video portraits,” in SIGGRAPH. speed model,” IEEE TIP, 2015.
ACM, 2018.
[18] Jianwei Yang, Zhen Lei, and Stan Z Li, “Learn convo-
[3] Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes lutional neural network for face anti-spoofing,” arXiv
Kopf, and Michael F Cohen, “Bringing portraits to life,” preprint arXiv:1408.5601, 2014.
ACM TOG, 2017. [19] David Menotti, Giovani Chiachia, Allan Pinto,
[4] Joon Son Chung, Amir Jamaludin, and Andrew William Robson Schwartz, Helio Pedrini, Alexan-
Zisserman, “You said that?,” arXiv preprint dre Xavier Falcao, and Anderson Rocha, “Deep
arXiv:1705.02966, 2017. representations for iris, face, and fingerprint spoofing
[5] Supasorn Suwajanakorn, Steven M Seitz, and Ira detection,” IEEE TIFS, 2015.
Kemelmacher-Shlizerman, “Synthesizing obama: learn- [20] Aziz Alotaibi and Ausif Mahmood, “Deep face liveness
ing lip sync from audio,” ACM TOG, 2017. detection based on nonlinear diffusion using convolu-
[6] “Terrifying high-tech porn: Creepy ’deepfake’ videos tion neural network,” Signal, Image and Video Process-
are on the rise,” https://www.foxnews.com/ ing, 2017.
tech/terrifying-high-tech-porn- [21] Koichi Ito, Takehisa Okano, and Takafumi Aoki, “Re-
creepy-deepfake-videos-are-on-the- cent advances in biometrics security: A case study of
rise, Accessed: 2018-02-17. liveness detection in face recognition,” in APSIPA ASC.
IEEE, 2017.
[7] Ivana Chingovska, André Anjos, and Sébastien Marcel,
“On the effectiveness of local binary patterns in face [22] Davide Cozzolino, Giovanni Poggi, and Luisa Verdo-
anti-spoofing,” in BIOSIG, 2012. liva, “Recasting residual-based local descriptors as con-
volutional neural networks: an application to image
[8] Tiago de Freitas Pereira, André Anjos, José Mario forgery detection,” in IH&MMSEC. ACM, 2017.
De Martino, and Sébastien Marcel, “Can face anti-
[23] R. Raghavendra, Kiran B. Raja, Sushma Venkatesh, and
spoofing countermeasures work in a real world sce-
Christoph Busch, “Transferable deep-CNN features for
nario?,” in ICB. IEEE, 2013.
detecting digital and print-scanned morphed face im-
[9] Yuezun Li, Ming-Ching Chang, Hany Farid, and Si- ages,” in CVPRW. IEEE, 2017.
wei Lyu, “In ictu oculi: Exposing AI generated fake [24] Belhassen Bayar and Matthew C Stamm, “A deep learn-
face videos by detecting eye blinking,” arXiv preprint ing approach to universal image manipulation detec-
arXiv:1806.02877, 2018. tion using a new convolutional layer,” in IH&MMSEC.
[10] Jessica Fridrich and Jan Kodovsky, “Rich models for ACM, 2016.
steganalysis of digital images,” IEEE TIFS, 2012. [25] Nicolas Rahmouni, Vincent Nozick, Junichi Yamagishi,
[11] Andreas Rössler, Davide Cozzolino, Luisa Verdo- and Isao Echizen, “Distinguishing computer graph-
liva, Christian Riess, Justus Thies, and Matthias ics from natural images using convolution neural net-
Nießner, “FaceForensics: A large-scale video dataset works,” in WIFS. IEEE, 2017.
for forgery detection in human faces,” arXiv preprint [26] Weize Quan, Kai Wang, Dong-Ming Yan, and Xiaopeng
arXiv:1803.09179, 2018. Zhang, “Distinguishing between natural and computer-
[12] Huy H Nguyen, Ngoc-Dung T Tieu, Hoang-Quoc generated images using convolutional neural networks,”
Nguyen-Son, Vincent Nozick, Junichi Yamagishi, and IEEE TIFS, 2018.
Isao Echizen, “Modular convolutional neural network [27] Karen Simonyan and Andrew Zisserman, “Very deep
for discriminating between computer-generated images convolutional networks for large-scale image recogni-
and photographic images,” in ARES. ACM, 2018. tion,” in ICLR, 2015.
[13] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and [28] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S
Isao Echizen, “MesoNet: a compact facial video forgery Davis, “Two-stream neural networks for tampered face
detection network,” in WIFS. IEEE, 2018. detection,” in CVPRW. IEEE, 2017.

2311

You might also like