Locally GAN-generated Face Detection Based On An Improved Xception
Locally GAN-generated Face Detection Based On An Improved Xception
Information Sciences
journal homepage: www.elsevier.com/locate/ins
a r t i c l e i n f o a b s t r a c t
Article history: It has become a research hotspot to detect whether a face is natural or GAN-generated.
Received 8 September 2020 However, all the existing works focus on whole GAN-generated faces. So, an improved
Received in revised form 7 January 2021 Xception model is proposed for locally GAN-generated face detection. To the best of our
Accepted 2 May 2021
knowledge, our work is the first one to address this issue. Some improvements over
Available online 4 May 2021
Xception are as follows: (1) Four residual blocks are removed to avoid the overfitting prob-
lem as much as possible; (2) Inception block with the dilated convolution is used to replace
Keywords:
the common convolution layer in the pre-processing module of the Xception to obtain
Generated face
Inception block
multi-scale features; (3) Feature pyramid network is utilized to obtain multi-level features
Xception for final decision. The first locally GAN-based generated face (LGGF) dataset is constructed
Feature pyramid network by the pluralistic image completion method on the basis of FFHQ dataset. It has a total
952,000 images with the generated regions in different shapes and sizes. Experimental
results demonstrate the superiority of the proposed model which outperforms some exist-
ing models, especially for the faces having small generated regions.
Ó 2021 Elsevier Inc. All rights reserved.
1. Introduction
In daily life, unique biological information is often used in identity authentication, such as face, fingerprint, iris, and voi-
ceprint, etc. Compared with other biological information, the face has more advantages, such as lower acquisition cost, more
information, and no physical contact. Therefore, face images have been widely applied in the identification and authentica-
tion services. New applications, such as face payment, face retrieval, and face check-in, have come one after another, entering
the daily life in an all-round way. The ‘‘face brushing era” has arrived.
However, the emergence of the Generative Adversarial Networks (GAN)-based face generation technology makes the
‘‘brush face” unsafety. In the past five years, with the development of deep learning, various GAN models have been pro-
posed, such as BEGAN [1], PGGAN [2], and StyleGAN [3], etc. Consequently, many tools have been published, such as
⇑ Corresponding author.
E-mail address: [email protected] (W. Ding).
https://doi.org/10.1016/j.ins.2021.05.006
0020-0255/Ó 2021 Elsevier Inc. All rights reserved.
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
Face2Face [4], FaceSwap [5], DeepFake [6], and so on. Even a non-professional person can generate a realistic image with
these tools. Although the use of these tools enriches and facilitates people’s material and spiritual life, there are some secu-
rity concerns. The large number of fake faces has brought huge challenges to politics, justice, criminal investigation, repu-
tation protection, and even social stability. For example, according to the Associated Press, a spy created a fake profile
photo on LinkedIn account of Katie Jones, a 30-something redhead, to attract would-be targets. She was connected to Wash-
ington political figures, including a deputy assistant secretary of state, a senior aide to a senator and the economist Paul
Winfree.
Therefore, GAN-generated face detection is a worthwhile investment. As a consequence, many methods have been pro-
posed in the past five years. These methods can be classified into two categories: intrinsic attributes-based [7–10] and deep
learning-based [11–18]. The intrinsic attribute-based methods exploit the inconsistence of different types of attributes
between the natural image and the GAN-generated image, such as global symmetry [7], brightness [8], color differences
[9], compression [10], and so on. The deep learning-based methods extract the expected statistical features through network
learning automatically, thus they are usually more effective than the intrinsic attribute-based methods. For the generated
image detection, Quan et al [11] used two large 7 7 kernels in the first two layers to extract distinctive feature to detect
the real or fake image. Mo et al. [12] utilized high-pass filter as the pre-processing layers and then extracted features from
the pre-processed input by CNN for fake face detection. Nataraj et al. [13] adopted color co-occurrence matrix as the network
input for fake face detection. Hulzebosch et al. [14] considered some pre-processing operations, such as filtering by two high-
pass filters, using the co-occurrence matrix, and transforming to the HSV color space. However, Liu et al. [15] pointed out
that using the handcraft features as the input would lead to the loss of the raw data information. Therefore, they used
the raw image as the input, and proposed Gram-Net by using the Gram matrix block to extract the global texture features.
For the fake face video detection, Afchar et al. [16] uses the Inception block as the initial layers to detect the deepfake and
Face2Face. In order to stack convolution outputs of different shapes and thereby increase the function space to optimize the
model, they improved the inception block by replacing the common convolution layers with the dilated convolution layers.
Khodabakhsh et al. [17] utilized the common image classification models to detect generated face videos, such as AlexNet,
VGG19, Resnet50, Inception-v3, and Xception. The experimental comparison showed that the Xception was the best one
among five models. Rössler et al. [18] constructed a huge dataset for deepfake and Face2Face detection, and also showed that
the Xception model achieved the best overall performance on their constructed dataset among the six compared models. In
addition, Chang et al. [19] and Wang et al. [20] combined the tools of data augmentation with some existing models to
improve the performance of these models.
All of the works mentioned above focus on the whole generated face images. But there are some realistic scenarios where
only a small part even a tiny part of the image is generated and the rest is natural, for example, face image restoration,
glasses removal, and mask removed, etc. Since the generated regions may very small and the small-generated regions
may be reduced to a point even nothing in the feature map after a deep network having multi-layer pooling, the whole gen-
erated face detection methods mentioned above may be not effective in locally generated face case. Therefore, how to extract
effective features from such small-generated regions is a very critical problem.
To the best of our knowledge, this paper is the first work to address the locally GAN-generated face detection. The main
contributions are as follows:
(1) The first Locally GAN-Generated Face (LGGF) dataset is constructed. The randomly generated binary masks are pasted
on the natural images from the natural FFHQ dataset [3] to obtain the incomplete images. Then, the incomplete images are
restored to generate the LGGF dataset by the pluralistic image completion method [21]. The LGGF dataset has a total 952,000
images with the generated regions in different shapes and sizes.
(2) An improved model based on Xception [22] is proposed for detecting locally GAN-generated faces. The Xception is
improved by removing four residual blocks to alleviate overfitting, introducing Inception block with dilated convolution into
the pre-processing module to obtain multi-scale features, and combining with Feature Pyramid Network (FPN) [23] to
extract multi-level features for the final decision.
The structure of the paper is divided as follows. Section 2 reviews the relevant works and some relevant technologies.
Section 3 describes the structure of the proposed model in detail. The experimental dataset construction and results analysis
are presented in Section 4. Finally, Section 5 summarizes the work of the paper.
In this section, firstly, the pluralistic image completion method is introduced for constructing locally GAN-generated
datasets; then, the Xception model for the whole generated face detection is reviewed; finally, some preliminaries related
with the proposed model is presented, such as the Inception block with the dilated convolutions, and squeeze-and-
excitation (SE) module.
Traditional image content completion methods [24,25] have good performance in background completion but are not sat-
isfactory on reconstructing unique content that does not present in the images [21]. So, some CNN-based methods [26,27]
17
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
have been proposed. However, these methods may lead to distorted structures and blur textures [21], especially for large
missing regions. Recently, the GAN-based methods [21,28] are becoming more and more mature. They can generate high-
level semantic content, making images more natural and realistic.
In this paper, the pluralistic method proposed by Zheng et al. [21] is utilized to construct the locally GAN-generated face
dataset because the pluralistic method not only improves the visual perception but also does not limit to the attributes of the
target. It has two parallel paths: the reconstructive path and the generative path. As for the reconstructive path, the given
ground truth of the image is utilized to get the prior distribution of the missing parts. Then, the image is rebuilt from the
prior distribution. As for the generative path, the conditional prior is coupled to the distribution obtained in the reconstruc-
tive path.
The Xception model [22] is based on the depthwise separable convolution, which decomposes ordinary convolution into
two processes: spatial convolution and pointwise convolution. The spatial convolution performs on each input channel inde-
pendently. Then, the pointwise convolution uses a 1 1 kernel to convolve point by point. It not only decreases the number
of parameters but also reduces the number of calculations. As shown in Fig. 1, the Xception model consists of 14 residual
blocks. The 14 residual blocks contain 3 common convolution layers and 33 depthwise separable convolutions in total.
All the three common convolution layers are in the pre-processing module. Moreover, the Xception utilizes the global aver-
age pooling with fully connection layer in the decision module.
In [18], the Xception is utilized for the whole fake face detection and achieves good performance. However, although
separable convolution alleviates the overfitting of the network to some extent, the Xception model still has the problem of
overfitting, especially for the locally generated faces detection. Overfitting caused by data imbalance is a common phe-
nomenon in classification models [29]. Regarding the locally GAN-generated face detection issue, the Xception model also
exists this phenomenon. Specifically, for a pair of natural and locally GAN-generated images, the features from the natural
regions are much more than those from the generated regions. The data is imbalanced. So, the Xception model tends to
learn natural features. As a result, the Xception is easy to fall into overfitting when detecting the locally GAN-generated
faces.
2.3. Inception block with the dilated convolutions and squeeze-and-excitation module
MesoNet [16] uses the Inception block as the initial layers. In addition, in order to stack convolution outputs of different
shapes and thereby increase the function space to optimize the model, the Inception block is improved by replacing the com-
mon convolution layers with the dilated convolution layers. The improved Inception block contains four parallel groups of
convolution layers. The first group has a 1 1 convolution served as a jump connection between consecutive modules, while
each of the other three groups is with the dilated convolutions, trying to obtain multi-scale information. Finally, the final
output is obtained by concatenation. In this paper, we take advantage of the Inception block with dilated convolution to
extract multi-scale features from the locally GAN-generated faces.
The SE module [30] was first proposed for the ImageNet classification problem. The motivation of the SE module is to
establish the interdependence between feature channels explicitly. A SE module contains a global pooling layer, two fully
connection layers, and an activation layer. After the processing of the global pooling, the SE module turns each feature
map into a specific value, which represents the weight of a channel of the feature. Then, the obtained weights are pro-
cessed by two fully connection layers with ‘relu’ activation and normalized to 0–1 by using the nonlinear activation
function Sigmoid. Finally, the original feature map is weighted by multiplying the normalized weights to complete
the re-calibration.
18
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
3. Proposed model
In this section, a model for the locally GAN-generated face detection is proposed. The model is an improved version of the
Xception model. Two improvements by removing four residual blocks and adding the SE module are considered to avoid the
overfitting problem as much as possible. The Inception block with the dilated convolution and the SE module is used to
replace each common convolution layer in the pre-processing module to obtain the multi-scale features. Moreover, the
FPN is added to achieve the multi-level features for final decision. The details are described in the following.
In this paper, the Inception block is combined with the SE module to promote the positive feature maps and suppress the
negative ones for the locally generated face detection task. The reason is that the SE module can establish the relationship
between channels and weight channels according to potential importance. As shown in Fig. 2 (a), the input is processed by
the Inception block before the SE-module. Here, c, h, and w represent the number of channels, rows, and columns of the fea-
ture map, respectively. Scale denotes the multiplication operation. The SE module is also introduced into the residual block
with the same purpose. The architecture is presented in Fig. 2 (b).
Since the very small-generated regions may be reduced to a point even nothing in the feature map after multi-layer con-
volution and many times of pooling, the final decision may be very difficult. Therefore, the FPN [23] is utilized to extract
multi-level features for the final decision from the feature maps of the information content layers and those of the deep
semantic layers. The FPN has been successfully used in small object detection [31]. In this paper, the FPN is combined with
global average pooling and fully connection layer. The architecture of FPN is shown in Fig. 2(c).
As shown in Fig. 2 (c), the FPN combines a bottom-up pathway, a top-down pathway, and some lateral connections by add
operation to achieve feature maps of different layers. Therefore, the achieved maps with different levels contain the depth
Fig. 2. Architecture of the SE-Inception block, SE-residual block, FPN connecting with global average pooling.
19
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
Output
8×8
16×16 Global Average Merge
16×16
64×64 32×32 FPN and Dense
256×256 128×128
semantic information and the shallow context information. Then, the feature of each scale connects with the global average
pooling and fully connection layer. Finally, the results with different scales are merged for making a decision.
The overall architecture of the proposed locally GAN-generated face detection model is illustrated in Fig. 3. The proposed
model can be divided into 4 modules: the pre-processing module, information content module, and deep semantic module
for feature extraction, the decision module for final decision. All convolutions for the network are the depthwise separable
convolutions with 3 3 kernel except for the pre-processing module using the Inception block. The activation function is the
‘elu’, which is shown to be superior to ‘relu’ in [22].
In the pre-processing module, the SE-Inception block as shown in Fig. 3 (a) is used to replace the common convolutional
layer in the original Xception model to obtain the multi-scale features. The SE-Inception block increases the functional space
to optimize the pre-processing module by stacking the output of several convolutional layers with different kernel shapes.
The pre-processing module contains two SE-Inception blocks and a common convolution layer between them. For the two
SE-Inception blocks, each has a common convolution layer and three dilated convolution layers. The number of the kernels in
two block are (6, 10, 10, 6) and (12, 20, 20, 12), respectively. The rates of the dilated convolution are set as r = 1, 2, 3, respec-
tively. For the convolution layer, it is used to reduce image resolution with the kernel stride 2. The number of the kernels for
this convolution layer is 32.
In the information content module, the inherent features in the images are learned by the relatively shallow layer. The
information content module is composed of three SE-residual pooling blocks. Each block has a SE-residual block with a
max-pooling layer. The 1 1 kernel of convolution with stride 2 is used as the transition convolution. The number of con-
volution kernel for three blocks are 128, 256, 512, respectively.
In the deep semantic module, more high-level semantic information is obtained with the network layers going deep.
Since the excessive increase of convolutional layers and convolutional kernels would lead to overfitting, the proposed model
removes four residual blocks in the deep semantic module of the Xception model and also changes the number of kernels in
some convolutional layers. The deep semantic module contains four SE-residual blocks, a SE-residual pooling block, and a
depthwise separable convolution layer. The number of kernels for all the four SE-residual blocks are 512. The number of ker-
nels for the SE-residual pooling block and the depthwise separable convolution layer are 728 and 1024, respectively.
In the decision module, the multi-level feature maps from three different layers are obtained by FPN for the final decision-
making. One layer is the output of the last block of the deep semantic module. The other two layers are from the last two
blocks of the information content module. The resolutions of three layers are different with 8 8, 16 16 and 32 32,
respectively. The details for the decision module are shown in Fig. 4. After FPN, the features with multi-levels are connected
with both the global average pooling and fully connection layer (Dense). Here, each fully connection layer utilizes 16 neu-
rons. Finally, the features are merged by the concate operation for the final softmax classification decision.
Finally, after the training of the network, the classification result is achieved. The loss function for the whole network
adopts the cross-entropy given by,
X
L¼ pi logðyi Þ þ ð1 pi Þlogð1 yi Þ ð1Þ
i
20
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
Fig. 4. Detailed architecture for the decision module. Numbers in the form (m m, n) refer to the kernel size m and the number of kernels n in the
convolutional.
In this section, firstly, the construction of the LGGF dataset is described in Section 4.1. Then, the experimental setup is
elaborated in Section 4.2. The experimental results are described in Section 4.3. Finally, some analysis and discussion are
presented in Section 4.4.
To the best of our knowledge, there is no public available dataset for the locally GAN-generated face detection currently.
So, the LGGF dataset is constructed by the procedure shown in Fig. 5.
In order to construct the LGGF dataset, firstly, the natural face image library FFHQ [3] which is served as a benchmark for
some GANs is used to create a partially incomplete image dataset. The FFHQ dataset is a high quality, natural face image
dataset containing 70,000 images at 1024 1024 resolution. It contains many variations, including race, gender, age, back-
ground, and so on. At the same time, it also has a good coverage area for the head of the faces, for example, glasses, head-
bands, masks, hats, and so on. In this paper, all the 70,000 images are resized to 256 256 in the consideration of
computation complexity.
Then, the binary ground truth masks with different sizes are created by Matlab. Six different sizes are considered with the
ratio of the whole image from 0.5% and 5.5% every 1.0%. In addition, two types in shape (regular rectangular shape and irreg-
ular shape) are taken into account. Each type and size contains 70,000 masks. These masks appear at arbitrary positions in
the natural images. Then, the FFHQ dataset is combined with these two types and six different sizes of masks, obtaining
twelve incomplete image datasets.
Finally, the pluralistic model [21] described in section 2.1 is utilized to restore the incomplete region of each image in the
twelve incomplete region image datasets. The model has been trained by Zheng et al. [21] on the Celeba-HQ dataset [2]. Con-
sequently, the LGGF dataset is achieved. It has a total 952,000 images with the generated regions in different shapes and
sizes. Some samples are shown in Fig. 6.
In addition, in the following experiments, we want to construct such an experimental dataset which contains the FFHQ
dataset having 7000 natural images, and in which the number of natural images and generated images is the same. There-
fore, two sub-datasets are created from the LGGF dataset, i.e., a regular sub-dataset and an irregular one. The regular sub-
Fig. 5. Overview of the construction of the locally GAN-generated face (LGGF) dataset.
21
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
Fig. 6. Some examples for the LGGF dataset, In each group, the left one is the natural image, the middle one is the incomplete image with the mask, the right
one is the locally generated image after restoration.
dataset randomly selects 70,000 images with the regular rectangular generated regions from the LGGF dataset, while the
irregular sub-dataset randomly selects 70,000 images with the irregular generated regions. The generated regions in both
sub-datasets are in different sizes with the ratio of the whole image from 0.5% and 5.5%. The number of images of different
region sizes is the same.
To evaluate the performance of the proposed locally GAN-generated face detection model, we carried out a series of
experiments. In these experiments, the original FFHQ dataset and two sub-datasets (regular sub-dataset and irregular
one) are divided into training, validation, and testing sets with the ratio 5:1:4. Moreover, in each experiment, the number
of generated samples and natural samples are the same in each set of the training, validation, and testing sets.
To evaluate the performance of the proposed model, three metrics are considered: Accuracy, Precision, and Recall. Let P
and N be the number of locally generated samples and natural samples in the datasets, TP, TN, FP, and FN be the number of the
correctly detected locally generated samples, correctly detected natural samples, samples erroneously detected as locally
generated samples, and falsely detected as natural samples, respectively. The definition of three metrics is given by,
TP þ TN
Accuracy ¼ ð2Þ
PþN
TP
Precision ¼ ð3Þ
TP þ FP
TP
Recall ¼ ð4Þ
TP þ FN
It can be seen from these equations that: (a) the higher the Accuracy, the better the overall performance; (b) the higher
the Precision, the better the detection of the locally generated image; (c) the higher the Recall, the lower the misjudgment
rate of the locally generated image.
All the experiments are performed in Keras and done with a single 11 GB GeForce GTX 1080 Ti, i7-6900 K CPU, and 64 GB
RAM. Regarding the training, we use the adam optimization algorithm with the initial learning rate 0.001, and set the total
epochs as 128. The source code and experimental datasets are available at https://github.com/imagecbj/Locally-GAN-gener-
ated-face-detection-based-on-an-improved-Xception.
The proposed model presented in Fig. 3 is an improved model of the Xception with three main modifications. In order to
show the improvement of three main modifications, the ablation experiments are carried out. The ablation experimental
results are shown in the Table 1. It can be observed from this table that each modification has contributed to the final per-
22
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
Table 1
Ablation experimental results for three main modifications (M1: removing four residual blocks; M2: using Inception block with the dilated convolution to
replace the common convolution layer; M3: adding FPN to obtain multi-level features).
Table 2
Effectiveness comparison of different models. ’-’ means that the corresponding model does not converge.
formance of the proposed model. Among these three modifications, the modification M2 by using Inception block achieves
the most contribution.
Then, in order to evaluate the performance of the proposed model, the proposed model is compared with six recent
advanced models. The compared six models are NICG model [11], IH model [12], MesoNet model [16], Xception model
[22], Cross-Net model [14], and Gram-Net model [15].
The first test evaluates the effectiveness of the proposed model. Table 2 shows the results of different models for the reg-
ular and irregular sub-datasets in terms of Accuracy, Precision, and Recall. Notice that there are no results for the IH and
NICG models because these two models cannot converge in both regular and irregular sub-datasets. It can be seen from
Table 2 that the proposed model is superior to other models. Therefore, it is more suitable for the detection of locally
GAN-generated faces. Due to the non-convergence of the IH and NICG models, they are no longer considered in the following
tests.
The second test is to evaluate the effect on the size of the generated regions. Here, for each type (regular or irregular),
there are six testing sets for six different sizes with the ratio of the whole image between 0.5% and 5.5%. Regarding each test-
ing set, it contains 28,000 natural images and 28,000 generated images with a fixed generated region size. Fig. 7 shows the
results of different sizes of the generated regions. It can be found from Fig. 7 that: (a) the accuracies increase for all the com-
pared models with the increase of the sizes of the generated regions; (b) the proposed model outperforms other models no
matter what sizes, especially for the small sizes. This is because of the use of Inception block with the dilated convolution
and FPN, which separately extract multi-scale and multi-level features.
23
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
Table 3
Comparison of robustness against smoothing filtering in the testing set (FFHQ dataset + regular sub-dataset).
Table 4
Comparison of robustness against smoothing filtering in the testing set (FFHQ dataset + irregular sub-dataset).
€The robustness ability of the proposed model is evaluated in the third test. Therefore, four smoothing filtering operations
are performed on two testing sets including the regular and irregular sub-dataset respectively. They are Gaussian filtering,
mean filtering, median filtering, and bilateral filtering. Tables 3 and 4 present the results on two testing sets respectively. It
can be seen from two tables that although the metrics values decrease after the filtering operations, the proposed model still
has considerable performance and is superior to the other four models except mean filtering.
The fourth test assesses the generalization performance of the proposed model for different image restoration methods.
Here, we consider the other two recent image restoration methods: DFNet[32], and Generative_inpainting[33]. They are used
to restore the incomplete regions of each image in the two incomplete image sub-datasets corresponding to the testing set of
the regular and irregular sub-datasets. Each sub-dataset has 28,000 images. Then, different well-trained models are used to
detect these restored images. These models are trained by the training set whose images restored by the pluralistic image
restoration method [21]. The results are shown in Tables 5 and 6 for two restoration methods, i.e., DFNet [32], and Gener-
ative_inpainting [33]. It can be observed from Table 5 and Table 6 as well as Table2 that: (i) the performance of each detec-
tion model is getting worse when using the testing set created by other restoration methods. The reason is that the tests
corresponding to Tables 5 and 6 are cross-dataset evaluation, while the test of Table 2 is in-dataset evaluation; (ii) the pro-
posed model still performs well and achieves the best performance than the other four models in each test.
Table 5
Performance comparison of different models for the dataset created by the DFNet [32].
24
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
Table 6
Performance comparison of different models for the dataset created by the Generative_inpainting [33].
Table 7
Comparison of the performance in detecting very small-generated regions.
The fifth test is to verify that the proposed model also has good discrimination ability on very small-generated regions.
Hence, we consider the locally generated images having very small sizes with the ratio between 0.1% and 1.0%. The results
are presented in Table 7. The results show that the proposed model also achieves the best performance among five compared
models. The MesoNet almost fails with the Recall around 0.5.
Finally, we test the performance of the proposed model in some realistic scenarios, such as glasses removal, mask
removal, and the other occlusion removal, etc. Therefore, we create three sub-datasets corresponding to these three realistic
scenarios. Regarding the sub-dataset for mask removal, the original face images with masks are collected from the Internet.
Regarding the sub-datasets for glasses removal and the other occlusion removal, the original face images with glasses and
other occlusions are from the natural FFHQ dataset. The occlusion in these original images are removed by the software
Adobe Photoshop and then repaired by the pluralistic image restoration method [21]. Finally, 300 pairs of testing images
for each sub-dataset are obtained. Each pair contains an original face image and its corresponding repaired one. Some sam-
ples are shown in Fig. 8. All these 900 pairs are used as testing images to evaluate the performance of each model in the
25
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
Table 8
Performance comparison of different cases of the occlusion removal.
realistic scenarios. The detection results are presented in Table 8. It can be observed from Table 8 that: (i) the performance of
each model for three sub-datasets is better than that shown in Table 2 for the regular sub-dataset and irregular sub-dataset.
The main reason is that the glasses removal and mask removal usually need to restore the relatively greater regions than
most cases in the regular sub-dataset and irregular sub-dataset; (ii) our model still achieves the overall best performance
among five compared models.
The proposed model has compared with the six models previously designed for the whole GAN-generated faces in Tables
2–8 and also Fig. 7. The comparison has shown that the proposed model is very suitable for the locally GAN-generated face
detection. The main reason is that the proposed model utilizes the Inception block with the dilated convolution and FPN to
extract multi-scale and multi-level features. As for some models initially for the whole GAN-generated faces, such as NICG
model [11] and IH model [12], they almost fail in the final decision. Fig. 9 presents the loss and accuracy of these two models.
It can be observed from Fig. 9 that these two models can not converge in both sub-datasets. So, the development of the mod-
els for the locally GAN-generated face detection is very necessary.
26
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
The proposed model is an improved version of Xception. Compared with the Xception, the proposed model achieves the
better performance and the less parameters. The proposed model removes four residual blocks and reduces 1024 convolu-
tion kernels in the last layer. The number of the parameters for the proposed model and Xception are 922 k, and 2,506 k,
respectively.
5. Conclusions
In this paper, the locally GAN-generated face detection problem is investigated. An improved Xception model is proposed
for locally GAN-generated face detection. The Inception block with the dilated convolution and FPN as well as SE module are
introduced into Xception to obtain the multi-scale and multi-level features for the small-generated face regions. The LGGF
dataset is constructed on the basis of the natural FFHQ dataset and has a total 952,000 images with the generated regions in
different shapes and sizes. Experimental results prove that the proposed model has better performance than other models
designed for the whole generated faces, especially for the faces with the small-generated regions. Certainly, as for the locally
GAN-generated faces, their detection is only an image-level task, while the localization of the generated regions, the pixel-
level task, is also very important and more difficult than the detection task. So, the localization task is our future work.
Beijing Chen: Conceptualization, Methodology, Writing-Reviewing and Editing. Xingwang Ju: Data curation, Software,
Validation, Writing-Original draft preparation. Bin Xiao: Methodology, Writing-Reviewing and Editing. Weiping Ding:
Investigation, Writing-Reviewing and Editing. Yuhui Zheng: Methodology, Visualization.
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgements
The authors would like to express the sincere appreciation to the editor and anonymous reviewers for their insightful com-
ments, which greatly improve the quality of this paper.
This work was supported by the National Natural Science Foundation of China under Grants 62072251, 61976120, and
61972206, the Natural Science Foundation of Jiangsu Province under Grant BK20191445, the PAPD fund.
References
[1] D. Berthelot, T. Schumm, L. Metz, Began: Boundary equilibrium generative adversarial networks, arXiv preprint arXiv:1703.10717, 2017.
[2] T. Karras, T. Aila, S. Laine, et al., Progressive growing of gans for improved quality, stability, and variation, arXiv preprint arXiv:1710.10196, 2017.
[3] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, in: Proceedings of the 2019 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR2019), 2019, pp. 401-4410.
[4] J. ThiesJ, M. Zollhofer, M. Stamminger, et al., Face2face: Real-time face capture and reenactment of rgb videos, in: Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR2016), 2016, pp. 2387-2395.
[5] I. Korshunova, W. Shi, J. Dambre, et al., Fast face-swap using convolutional neural networks, in: Proceedings of the 2017 IEEE International Conference
on Computer Vision (ICCV2017), 2017, pp. 3677-3685.
[6] D. Güera, E. Delp, Deepfake video detection using recurrent neural networks, in: Proceedings of the 15th IEEE International Conference on Advanced
Video and Signal Based Surveillance (AVSS2018), 2018, pp. 1-6.
[7] F. Matern, C. Riess, M. Stamminger, Exploiting visual artifacts to expose deepfakes and face manipulations, in: Proceedings of the 2019 IEEE Winter
Applications of Computer Vision Workshops (WACVW2019), 2019, pp. 83-92.
[8] S. McCloskey, M. Albright, Detecting GAN-generated imagery using color cues, arXiv preprint arXiv:1812.08247, 2018.
[9] H. Li, B. Li, S. Tan, et al., Detection of deep network generated images using disparities in color components, arXiv preprint arXiv:1808.07276, 2018.
[10] F. Marra, D. Gragnaniello, D. Cozzolino, et al., Detection of GAN-generated fake images over social networks, in: Proceedings of the 2018 IEEE
Conference on Multimedia Information Processing and Retrieval (MIPR2018), 2018, pp. 384-389.
[11] W. Quan, K. Wang, D.M. Yan, et al, Distinguishing between natural and computer-generated images using convolutional neural networks, IEEE Trans.
Inf. Forensics Secur. 13 (11) (2018) 2772–2787.
[12] H. Mo, B. Chen, W. Luo, Fake faces identification via convolutional neural network, in, in: Proceedings of the 6th ACM Workshop on Information Hiding
and Multimedia Security (IH&MMSec2018), 2018, pp. 43–47.
[13] L. Nataraj, T.M. Mohammed, B.S. Manjunath, et al., Detecting GAN generated fake images using co-occurrence matrices, Electronic Imaging, 2019, pp.
532-1-532-7.
[14] N. Hulzebosch, S. Ibrahimi, M. Worring, Detecting CNN-generated Facial images in real-world scenarios, in: Proceedings of the 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops (CVPR-W2020), 2020, pp. 642-643.
[15] Z. Liu, X. Qi, P.H.S. Torr, Global texture enhancement for fake face detection in the wild, in: Proceedings of the 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR2020), 2020, pp. 8060-8069.
[16] D. Afchar, V. Nozick, J. Yamagishi, et al., MesoNet: a compact facial video forgery detection network, in: Proceedings of the 2018 IEEE International
Workshop on Information Forensics and Security (WIFS2018), 2018, pp. 1-7.
[17] A. Khodabakhsh, R. Ramachandra, K. Raja, et al, Fake face detection methods: Can they be generalized?, in, in: Proceedings of the 2018 International
Conference of the Biometrics Special Interest Group, 2018, pp. 1–6.
[18] A. Rössler, D. Cozzolino, L. Verdoliva, et al., Faceforensics++: Learning to detect manipulated facial images, arXiv preprint arXiv:1901.08971, 2019.
27
B. Chen, X. Ju, B. Xiao et al. Information Sciences 572 (2021) 16–28
[19] X. Chang, J. Wu, T. Yang, et al, DeepFake face Image detection based on improved VGG convolutional nneural network, in, in: Proceedings of the 39th
Chinese Control Conference (CCC2020), 2020, pp. 7252–7256.
[20] S.Y. Wang, O. Wang, R. Zhang, et al., CNN-generated images are surprisingly easy to spot... for now, in: Proceedings of the 2020 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR2020), 2020, pp. 8695-8704.
[21] C. Zheng, T.J. Cham, J. Cai, Pluralistic image completion, in, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR2019), 2019, pp. 1438–1447.
[22] F. Chollet, Xception, Deep learning with depthwise separable convolutions, in, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR2017), 2017, pp. 1251–1258.
[23] T.Y. Lin, P. Dollár, R. Girshick, et al., Feature pyramid networks for object detection, in: Proceedings of the 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR2017), 2017, pp. 2117-2125.
[24] C. Antonio, P. Patrick, T. Kentaro, Region filling and object removal by exemplar-based image inpainting, IEEE Trans. Image Process. 13 (9) (2004) 1200–
1212.
[25] J. Hays, A.A. Efros, Scene completion using millions of photographs, ACM Trans. Graphics 26 (3) (2007) 4-es.
[26] Y. Zeng, J. Fu, H. Chao, et al., Learning pyramid-context encoder network for high-quality image inpainting, in: Proceedings of the 2019 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR2019), 2019: 1486-1494.
[27] R. Sakurai, S. Yamane, J.H. Lee, Restoring aspect ratio distortion of natural images with convolutional neural network, IEEE Trans. Ind. Inf. 15 (1) (2019)
563–571.
[28] G. Kanojia, S. Raman, MIC-GAN: multi-view assisted image completion using conditional generative adversarial networks, in: Proceedings of the 2020
National Conference on Communications (NCC2020), 2020, pp. 1-6.
[29] M.S. Santos, J.P. Soares, P.H. Abreu, et al, Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches, IEEE Comput.
Intell. Mag. 13 (4) (2018) 59–76.
[30] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in, in: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR2018), 2018, pp. 7132–7141.
[31] B. Singh, M. Najibi, L.S. Davis, SNIPER: Efficient multi-scale training, Adv. Neural Inf. Process. Syst. (2018) 9310–9320.
[32] X. Hong, P. Xiong, R. Ji, Deep fusion network for image completion, in: Proceedings of the 27th ACM International Conference on Multimedia
(MM2019), 2019, pp. 2033-2042.
[33] J. Yu, Z. Lin, J. Yang J, Free-form image inpainting with gated convolution, in: Proceedings of the 2019 IEEE International Conference on Computer
Vision (CVPR2019), 2019, pp. 4471-4480.
28