Image Tampering Localization Using A Dense Fully Convolutional Network
Image Tampering Localization Using A Dense Fully Convolutional Network
16, 2021
Abstract— The emergence of powerful image editing software the identification of tampered images has become an urgent
has substantially facilitated digital image tampering, leading to issue.
many security issues. Hence, it is urgent to identify tampered While distinguishing whether an image has been tampered
images and localize tampered regions. Although much attention
has been devoted to image tampering localization in recent with or not is addressed in many image forensic methods
years, it is still challenging to perform tampering localization in [1]–[3], localizing tampered image regions has attracted more
practical forensic applications. The reasons include the difficulty and more attention from researchers in recent years [4]–[6].
of learning discriminative representations of tampering traces Conventionally, tampering localization was performed on an
and the lack of realistic tampered images for training. Since image by inspecting abnormal local traces, such as JPEG
Photoshop is widely used for image tampering in practice, this
paper attempts to address the issue of tampering localization by compression artifacts [6]–[9], Photo response non-uniformity
focusing on the detection of commonly used editing tools and (PRNU) [10], [11], noise level [12], and statistics of local
operations in Photoshop. In order to well capture tampering descriptor [13]. With the success of deep learning in various
traces, a fully convolutional encoder-decoder architecture is computer vision tasks, such as object detection and semantic
designed, where dense connections and dilated convolutions are segmentation, many methods based on deep learning have been
adopted for achieving better localization performance. In order
to effectively train a model in the case of insufficient tampered developed for image tampering detection and localization.
images, we design a training data generation strategy by resorting Among them, some methods [14]–[17] perform tampering
to Photoshop scripting, which can imitate human manipulations detection at image or patch level with convolutional neural
and generate large-scale training samples. Extensive experimental networks (CNN), and they can be adapted for tampering local-
results show that the proposed approach outperforms state- ization through sliding-window analysis. More recently, some
of-the-art competitors when the model is trained with only
generated images or fine-tuned with a small amount of realistic methods [18]–[20] have been proposed to achieve pixel-level
tampered images. The proposed method also has good robustness or region-level tampering localization directly. Such methods
against some common post-processing operations. are based on different network structures, including fully
Index Terms— Image forensics, image tampering localiza- convolutional networks (FCN) [18], faster R-CNN [19], long
tion, fully convolutional network, densely convolutional network, short-term memory (LSTM) cells [20], etc.
dilated convolution, Photoshop scripting. Despite the increasing emergence of image tampering
localization approaches, exposing tampered image regions
I. I NTRODUCTION
remains challenging in practice. One of the reasons is related
N OWADAYS, digital images captured with convenient
acquisition devices are increasingly used as evidence.
However, digital images can be easily tampered with by
to the sophistication of image tampering. Usually, image
tampering is carried out by manipulating image regions,
including splicing, copy-move, removal, etc. Instead of per-
powerful and popular image editing software, such as Photo-
forming such manipulations in a naïve fashion, a forger
shop, while few visually perceptible traces are left. Therefore,
would tweak the tampered regions with some post-processing
Manuscript received August 17, 2020; revised January 4, 2021; accepted (e.g., resizing, rotation, contrast/brightness adjustment, denois-
March 21, 2021. Date of publication April 1, 2021; date of current ing) so as to conceal visually perceptible tampering traces. The
version April 19, 2021. This work was supported in part by the Key-Area
Research and Development Program of Guangdong Province under Grant post-processing operations are conveniently available in image
2019B010139003; in part by the NSFC under Grant U19B2022, Grant editing software, especially the most widely used Photoshop.
61802262, Grant 61872244, and Grant 61772349; in part by the Guangdong With Photoshop, a variety of editing tools and operations
Basic and Applied Basic Research Foundation under Grant 2019B151502001;
in part by the Shenzhen Research and Development Program under Grant can be utilized in image tampering, resulting in visually
JCYJ20200109105008228 and Grant JCYJ20180305124325555; and in plausible images, where the tampering traces are complicated
part by the Alibaba Group through Alibaba Innovative Research (AIR) and subtle. Therefore it is difficult to extract or learn features
Program. The associate editor coordinating the review of this manuscript and
approving it for publication was Dr. David Doermann. (Corresponding author: that are discernable between tampered and original regions.
Haodong Li.) Another reason is the lack of training samples. While machine
The authors are with the Guangdong Key Laboratory of Intelligent Infor- learning, especially deep learning, has shown the effectiveness
mation Processing, Shenzhen University, Shenzhen 518060, China, also
with the Shenzhen Key Laboratory of Media Security, Shenzhen University, for tampering localization, such a solution is data-hungry.
Shenzhen 518060, China, and also with the Shenzhen Institute of Artificial Unfortunately, collecting large-scale realistic tampered images
Intelligence and Robotics for Society, Shenzhen 518060, China (e-mail: (as well as ground-truths) is time-consuming and laborious.
[email protected]; [email protected]; [email protected];
[email protected]; [email protected]). Without sufficient training data, it is hard to obtain a reliable
Digital Object Identifier 10.1109/TIFS.2021.3070444 tampering localization model.
1556-6021 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2987
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2988 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021
enhancement [32], and scaling [33], etc. One limitation of tampering traces and thus deteriorating the final localization
these methods is that they target on only a single tampering results output in the decoder part. To overcome the existing
trace. When the image undergoes multiple manipulations, shortcomings, we specifically include dense connections and
the tampering detection and localization performance of these dilated convolutions into the proposed encoder-decoder archi-
methods will be greatly degraded. tecture. In this way, the network is able to capture more subtle
In addition to detecting specific tampering operations, some tampering traces within the image and obtain finer feature
methods based on deep learning have been proposed for maps for making predictions.
detecting more general tampering operations. Among them, Another purpose of the proposed method is to seek for a
some approaches [14], [15], [34] are able to achieve tam- way to prepare sufficient training images. Since it is hard to
pering localization with sliding-window analysis. However, artificially prepare large-scale tampered images, we resort to
their localization performance depends on the size of sliding an alternative and automatic way. Basically, if a forger uses
window. To get around of this issue, a possible solution is Photoshop to create tampered images, he or she is likely to use
to use FCN. Salloum et al. [18] proposed a multi-task FCN the operations and tools listed in Table I. We therefore employ
that integrates the edge and the internal of the tampered area to Photoshop script program to make use of these operations
achieve splicing localization. To localize more general tamper- and tools to create tampered images, imitating the practical
ing operations such as copy-move and removal, other methods tampering process. In this way, we can generate a large number
[19], [20], [35] have been proposed. A two-stream network of tampered images programmatically, and train our model
structure that utilizes both content-related features and residual with the generated tampered images.
features is proposed in [19] to achieve region-level tampering
localization. Bappy et al. [20] proposed an encoder-decoder B. The Proposed Encoder-Decoder Architecture
architecture to analyze the differences between manipulated In this subsection, we introduce the proposed network
and non-manipulated regions with resampling features and architecture for tampering localization. As shown in Fig. 1,
LSTM, and the decoder can directly produce pixel-level the proposed architecture consists of an encoder and a decoder,
tampering localization results. In [35], a feature extractor is where the former one transforms the input image into dis-
firstly learned from 385 types of image manipulations, and criminative feature maps and the latter one outputs pixel-wise
then image tampering localization is achieved by solving a predictions by further processing the feature maps. To improve
local anomaly detection problem. Since deep learning is data- localization performance, we include dense blocks in both the
driven, plenty of tampered images are needed for training a encoder and decoder, and use dilated convolutions in the layers
model. Some previous works [19], [20] utilized image segmen- of dense block #4 and #5. The details are described as follows.
tation datasets, such as MS-COCO [36], to generate tampered 1) Dense Connection: In order to capture the subtle tam-
images with the provided pixel-level segmentation annotations. pering traces, we utilize the dense connection proposed in
However, these tampered images are generated with some [37]. Fig. 2 shows a typical dense block, which consists of
simple operations, and thus they often contain obvious artifacts four internal convolutional layers and one transition layer.
that are easy to be exposed. Therefore, a model trained with Assuming the total number of internal convolutional layers
these tampered images usually obtains poor performance when in a dense block is L, the output of the l-th layer is defined
dealing with tampered images in real applications. as
III. P ROPOSED M ETHOD xl = Hl ([x 0 , x 1 , . . . , xl−1 ]), l ∈ {1, 2, . . . , L}, (1)
In this section, we elaborate our solution for localiz- where [x 0 , x 1 , . . . , xl−1 ] represents the concatenation oper-
ing image regions tampered with Photoshop, including an ation that combines the feature maps produced in layers
encoder-decoder architecture based on deep networks and a 0, . . . , l −1, and Hl (·) represents three consecutive operations
strategy for generating large-scale training samples. in the l-th layer: 3 × 3 convolution, followed by batch nor-
malization (BN) and Relu activation. Assuming the l-th layer
A. Approach Overview outputs k feature maps, the value of k is defined as the growth
The primary purpose of the proposed method is to design rate. Thus, the l-th layer’s input [x 0 , x 1 , . . . , xl−1 ] totally has
a network architecture to effectively capture tampering traces. l ×k channels. In [37], a transition layer, composing of a 1×1
Since various tools and operations in Photoshop are used in convolution followed by a 2 × 2 average pooling, is used to
practice, it is tricky to model tampering traces. Some previous reduce the number of output channels. In our network, we do
works [14], [15] try to learn features of tampering traces not use a pooling operation after each dense block; hence each
block by block; hence the block size heavily affects tampering transition layer is a 1 × 1 convolutional layer.
localization performance. Besides, it is also hard to perform In the proposed architecture, there are five dense blocks in
real-time detection on high-resolution images since the number the encoder network and two in the decoder network. There
of blocks is huge. Some recent works [18], [20] have alleviated are two explanations for why the use of dense block can
these issues by using fully convolutional network, which improve tampering localization performance. Firstly, with the
can produce pixel-level localization results directly. However, dense connections, individual layers can receive additional
in these works, the last output feature maps in the encoder supervision from the loss function through the shorter connec-
part are usually of very low resolutions (1/16 [20] or 1/32 tions. This could provide stronger and more direct supervisions
[18] of the input image), leading to the loss of some detailed regarding to the localization task on the kernel weights.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2989
Fig. 1. The proposed architecture for image tampering localization. The encoder network contains five dense blocks and three 2 × 2 average pooling layers,
and the decoder network contains two dense blocks, three transposed convolutional layers, and a 5 × 5 convolutional layer.
Fig. 3. Dilated convolution with different dilated rates. The dark elements
Fig. 2. A dense block. The four former layers are internal convolutional represent non-zero elements and the white elements represent zero elements.
layers and the last layer is the transition layer.
of the layers increases, the receptive field grows exponentially
Since the tampering traces are subtle, it is beneficial for learn- while no additional parameters are added and the resolutions
ing more discriminative features with stronger supervisions. of feature maps are kept, thus avoiding the loss of useful
Secondly, the dense connections remarkably encourage feature information for tampering localization.
reuse. In this way, latter layers within a dense block receive In our proposed architecture, dilated convolutional layers
features learned by its different preceding layers, increasing are used in the forth and fifth dense blocks. Fig. 4 shows the
the variation of input and thus promoting the layers to learn feature map responses with and without dilated convolutions
effective features for various tampering traces. According to in these two blocks. The feature map responses are obtained
our experimental results, incorporating dense connections into by performing linear discriminative analysis (LDA) to reduce
the network indeed improves the performance, please refer to the channels of the output feature maps to one. We observe
Section IV-B for details. that with dilated convolutions the feature responses can cover
2) Dilated Convolution: In tampering localization, just the tampered regions more completely, indicating that better
learning features from narrow local regions would result in localization performance can be obtained. This is supported
volatile feature representations. Thus, it is necessary to utilize by the final predictions shown in Fig. 4. Based on our
larger range information for feature learning and capture experimental results, using dilated convolutions can increase
more tampering traces, meaning that the convolutional layers the average F1-score by a relative improvement of about 4%
should have larger receptive fields. Traditionally, there are (refer to Section IV-B for details).
two possible ways to increase the receptive field: (i) using 3) Architecture Details: We totally use seven dense blocks
a larger convolutional kernel, and (ii) adding a pooling layer in the proposed encoder-decoder architecture, the settings for
before the next convolutional layer. A significant drawback each dense block are shown in Table II. As shown in Table II,
of the former is that the number of parameters increases as each dense block is composed of several internal convolutional
the receptive field grows. The second way, however, would layers (four in the first block and two in others) and a transition
result in the reduce of feature map resolution, causing the loss layer as the output layer. In the forth and fifth dense blocks,
of some detailed traces in small tampered regions. To exceed dilated convolutional layers rather than normal convolutional
the limitations, we utilize the dilated convolution proposed in layers are used. We set the kernel size as 3 and the dilated
[38]. Fig. 3 shows that the dilated convolution can increase rate as 2 in each dilated convolutional layer. The kernel size
the size of receptive field by padding adjacent elements of of almost every normal convolutional layer is 3 ×3, except the
the convolutional kernel with zeros. The receptive field of the last one in the decoder network, whose kernel size is 5 × 5.
kernel can be controlled by the dilated rate r . As the number The stride of each convolutional layer is set to be 1 so that no
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2990 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021
Fig. 4. Two examples for illustrating the intermediate feature maps obtained with and without dilated convolutions. In each subfigure, the first super-column
shows the input image and its ground-truth; in the second super-column, the first row are the feature map responses of block #4 and #5 and the final predictions
output by the proposed architecture, respectively, while those in the second row are output by an architecture without dilated convolutions. (a) Example #1.
(b) Example #2.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2991
Fig. 5. Diagram of the training data generation strategy based on Photoshop scripting. The tools listed in Table I are used for post-processing.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2992 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021
TABLE III
C ONFIGURATIONS OF THE N ETWORK VARIANTS
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2993
TABLE V
P ERFORMANCE OF D IFFERENT M ETHODS . T HE M ODELS U SED IN THE F IRST T HREE C ASES A RE T RAINED W ITH THE PS-S CRIPTED B OOK -C OVER
D ATASET, W HILE T HOSE U SED IN THE L AST C ASE (G RAY BACKGROUND ) A RE F URTHER T RAINED W ITH THE PS-S CRIPTED D RESDEN D ATASET
learning rate was set as 5 × 10−4 and was decreased performance of the proposed network, i.e., i) the number of
50% after every epoch. The batch size was set to be 16. dense blocks in the encoder, ii) the use of dilated convolu-
We tested the model with the validation set at the end of tional layers, and iii) the use of a dense block between two
each epoch, and chose the one achieved the highest vali- transposed convolutional layers. Therefore, we designed four
dation F1-score as the final model. All experiments were network variants as shown in Table III. Among them, Variant
carried out with a Nvidia Tesla P100 GPU server. The #1, #2, and #3 have the same number of convolutional layers
source code is available at: https://github.com/ZhuangPeiyu/ but different dense connections, and thus the numbers of dense
Dense-FCN-for-tampering-localization. blocks in the encoder networks of Variant #1, #2, and #3 are
5) Performance Metrics: Since image tampering localiza- 5, 3, and 0, respectively; and in Variant #4, the dense block
tion is a pixel-level binary classification problem, the following between the transposed convolutions are removed. We trained
pixel-level performance metrics were used. these networks with the PS-scripted book-cover dataset and
2×TP recorded the training loss during the training processes as
F1 = , shown in Fig. 7. It is observed that the proposed network
2 × T P + FN + FP
T P × T N−FP × FN converged faster than the other variants and achieved the
MCC = √ , lowest loss after convergence. Then, we tested the trained
(T P + F P)(T P + F N)(T N + F P)(T N + F N) models with the three testing datasets mentioned above. The
TP
I OU = , obtained F1-scores are shown in Table IV. From Table IV we
T P + FP + FN have the following observations.
where T P and F N represent the numbers of correctly classi- • The number of dense blocks in the encoder. According to
fied and misclassified tampered pixels, respectively, and T N the localization performance shown in Table IV, we can
and F P represent the numbers of correctly classified and see that Variant #1 achieves the best average F1-score on
misclassified original pixels, respectively. As the final output the three datasets. The relative improvement regarding
of the network is a real-valued result, thresholding is needed Variant #2 and #3 are 1% and 4%, respectively. These
for computing the metrics. Some previous works [18], [20] results indicate that it is necessary to use enough dense
choose the best threshold for every single testing image. This is blocks in the encoder network to improve the perfor-
unreasonable in practice, since the corresponding ground-truth mance. Consequently, we decided to use 5 dense blocks
is unavailable. In this work, we chose the commonly used in the encoder part of the proposed architecture.
value 0.5 as the threshold for all images when computing • Dilated convolutional layer. By comparing the perfor-
the metrics. In addition to the above metrics, we also cal- mance of Variant #1 with the proposed model, we observe
culated the pixel-level receiver operating characteristic curve that the use of dilated convolutions in the forth and fifth
and reported the area under curve (AUC) for performance dense blocks improves the F1-score by 0.01 or 0.02 in
evaluation. different datasets, meaning that expanding receptive field
without reducing feature map resolution with dilated
B. Ablation Study convolution is beneficial for tampering localization.
In this experiment, we performed an ablation study to • Dense block between two transposed convolutional lay-
validate the effectiveness of the proposed network. Gener- ers. Compared with Variant #4, the F1-scores achieved
ally, there are three factors that may affect the localization by the proposed model improve by 0.03 on PS-arbitrary
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2994 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021
Fig. 8. F1-scores obtained with different thresholds. From left to right are the results for PS-boundary dataset, PS-arbitrary dataset and NIST-2016 dataset,
respectively.
TABLE VI
P ERFORMANCE OF D IFFERENT M ETHODS A FTER F INE -T UNING W ITH 10% TAMPERED I MAGES
and NIST-2016 datasets, meaning that adding a dense outperforms its rivals in most cases. The F1-scores achieved
block between two transposed convolutional layers in the by the proposed method are 0.61 and 0.57 for PS-boundary
decoder network is helpful. and PS-arbitrary, respectively, which are 0.06 and 0.26 higher
Based on these observations, we conclude that the structure than the second best one (i.e., Bayar’s method with 64 × 64
components designed in the proposed network can indeed sliding-window). For the NIST-2016 dataset, all methods do
improve tampering localization performance. not perform well. It may be due to the fact that the training
images are pictures of book covers while the testing images
C. Performance for Training With PS-Scripted Images in NIST-2016 are natural scenes, and different properties of
In this experiment, we trained models with the PS-scripted the training and testing images lead to poor performance.
datasets and evaluated the trained models with the testing Nevertheless, the proposed method achieves the best F1-score,
datasets. In such a case, all the testing images are come from MCC, and IOU among all the methods.
datasets that are not involved in training. Please note that train- In order to investigate the performance by using training and
ing is not applicable for ADQ1, DCT, NADQ, and Mantra-net; testing images with the same property, we regarded the models
hence these methods performed the testing directly. trained with the PS-scripted book-cover dataset as pre-trained
Firstly, we trained the models with the PS-scripted book- models, and further trained them with the PS-scripted Dresden
cover dataset, and obtained the testing results as shown dataset. The resulting models are tested on NIST-2016 again.
in Table V. It is observed that the proposed method The obtained results are shown in the bottom of Table V.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2995
Fig. 9. Some localization results for example images in PS-boundary dataset (Row #1–#3), PS-arbitrary dataset (Row #4–#6) and NIST-2016 dataset (Row
#7–#10). The First and last columns are the tampered images and ground-truths, respectively. The other columns are the probability heatmaps obtained by
different methods.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2996 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021
Fig. 11. Pixel-level AUC scores for different JPEG compression qualities.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2997
Fig. 15. Pixel-level AUC scores for different levels of additive Gaussian noise.
relatively well in the experiments conducted in Section IV-C are shown in Fig. 12. We observe that the performance
and IV-D. of all the methods is degraded when the resizing factor is
1) Robustness Against JPEG Compression: We compressed lower than 1.0, it may be due to the fact that down-scaling
the tampered images in the PS-boundary and PS-arbitrary has removed many image details, which are important for
datasets using the commonly used JPEG compression qualities tampering localization. Overall, the proposed method achieves
in Photoshop (i.e., PS8-PS12), and then fed them into the the best robustness to resizing. Even when an image is reduced
trained model mentioned above. The pixel-level AUC scores to 25% of its original size (i.e., the resizing factor is 0.5),
of different methods are shown in Fig. 11. It is observed the AUC of our method is still higher than 0.80.
that all the four methods can resist JPEG compression to 3) Robustness Against Cropping: Cropping is commonly
a certain extent. For the PS-boundary dataset, the proposed used as a post-processing operation. In this experiment,
method slightly underperforms Forensics Similarity, especially we cropped a bottom right part from each image in
for PS10. Nevertheless, for the PS-arbitrary dataset, in which PS-boundary and PS-arbitrary with a shift of (x, y), as shown
the tampered images are subjected to more practical tampering in Fig. 13. The values of x and y are given by
operations, the proposed method achieves the best perfor- ⎧
⎪
⎪ x = x + 8 × i, x[0, 7]
mance in all the cases. ⎨ (W − x) × (H − y) 2
2) Robustness Against Resizing: In practice, a tampered s.t. ,
⎪
⎪ W×H 3
image is likely be subjected to post-resizing. To test the ⎩
y = y + 8 × j, y[0, 7]
robustness against resizing, we used Photoshop to resize the
images in PS-boundary and PS-arbitrary with a ratio from where i and j are two positive random integers, and W
0.5 to 2.0. The AUC scores regarding different resizing factors and H are the width and height of the image, respectively.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2998 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021
According to the cropping factor (x, y), the cropped into the data generation script program, so as to increase the
images are divided into 64 categories (regarding to the 8 × 8 diversity of training samples.
JPEG compression grid). The performance of four methods
in the case of cropping is shown in Fig. 14. It is observed R EFERENCES
that our method performs the best against cropping on both
[1] T. Bianchi and A. Piva, “Detection of nonaligned double JPEG com-
datasets, especially for the PS-arbitrary dataset, where its AUC pression based on integer periodicity maps,” IEEE Trans. Inf. Forensics
scores are higher than others about 0.1 for various cropping Security, vol. 7, no. 2, pp. 842–848, Apr. 2012.
factors. [2] M. Goljan and J. Fridrich, “CFA-aware features for steganalysis of color
images,” in Proc. Media Watermarking, Secur., Forensics, vol. 9409,
4) Robustness Against Gaussian Noise: Another commonly 2015, Art. no. 94090V.
used post-processing operation is noise addition. In this exper- [3] C.-M. Pun, X.-C. Yuan, and X.-L. Bi, “Image forgery detection using
iment, we evaluated four methods in the case that the testing adaptive oversegmentation and feature point matching,” IEEE Trans. Inf.
Forensics Security, vol. 10, no. 8, pp. 1705–1716, Aug. 2015.
images were corrupted with additive Gaussian noise. Please [4] J. H. Bappy, A. K. Roy-Chowdhury, J. Bunk, L. Nataraj, and
note that the noise levels are not fully matched in the training B. S. Manjunath, “Exploiting spatial structure for localizing manipu-
and testing phases. For the training images, Gaussian noise is lated image regions,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 4970–4979.
added to make the resulting PSNRs be 20dB, 30dB, or 40dB. [5] H. Li, W. Luo, X. Qiu, and J. Huang, “Image forgery localization
For the testing images, the standard deviations of the Gaussian via integrating tampering possibility maps,” IEEE Trans. Inf. Forensics
noise vary from 2, 5, 10, 15, 20, and 25, and the corresponding Security, vol. 12, no. 5, pp. 1240–1252, May 2017.
[6] T. Bianchi and A. Piva, “Image forgery localization via block-grained
resulting PSNRs are 42dB, 34dB, 28dB, 25dB, 22dB, and analysis of JPEG artifacts,” IEEE Trans. Inf. Forensics Security, vol. 7,
20dB, respectively. Such settings introduced more challenges. no. 3, pp. 1003–1017, Jun. 2012.
As shown in Fig. 15, the proposed method is more robust [7] Z. Lin, J. He, X. Tang, and C.-K. Tang, “Fast, automatic and fine-grained
tampered JPEG image detection via DCT coefficient analysis,” Pattern
against Gaussian additive noise than other methods. Among all Recognit., vol. 42, no. 11, pp. 2492–2501, Nov. 2009.
the methods, Forensic Similarity has the poorest performance, [8] P. Korus and J. Huang, “Multi-scale fusion for improved localization
because it relies on image block similarity that may be heavily of malicious tampering in digital images,” IEEE Trans. Image Process.,
vol. 25, no. 3, pp. 1312–1326, Mar. 2016.
destroyed by additive noise. [9] I. Amerini, R. Becarelli, R. Caldelli, and A. Del Mastio, “Splicing forg-
5) Performance on Original Image: In addition, we have eries localization through the use of first digit features,” in Proc. IEEE
tested 1,000 original images that were used to build the Int. Workshop Inf. Forensics Security (WIFS), Dec. 2014, pp. 143–148.
[10] G. Chierchia, D. Cozzolino, G. Poggi, C. Sansone, and L. Verdoliva,
PS-boundary and PS-arbitrary datasets. By thresholding the “Guided filtering for PRNU-based localization of small-size image
network outputs with the value of 0.5, the obtained average forgeries,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
pixel-level false alarm rate is 0.005, meaning that our method (ICASSP), May 2014, pp. 6231–6235.
[11] G. Chierchia, G. Poggi, C. Sansone, and L. Verdoliva, “A Bayesian-
produces very few false predictions for original images. MRF approach for PRNU-based image forgery detection,” IEEE Trans.
Inf. Forensics Security, vol. 9, no. 4, pp. 554–567, Apr. 2014.
V. C ONCLUSION [12] S. Lyu, X. Pan, and X. Zhang, “Exposing region splicing forgeries
with blind local noise estimation,” Int. J. Comput. Vis., vol. 110, no. 2,
In this paper, we propose a method for image tampering pp. 202–221, Nov. 2014.
localization. We mainly focus on the editing operations and [13] W. Fan, K. Wang, and F. Cayre, “General-purpose image forensics using
tools used in Photoshop, considering the fact that Photoshop patch likelihood under image statistical models,” in Proc. IEEE Int.
Workshop Inf. Forensics Security (WIFS), Nov. 2015, pp. 1–6.
is widely used in practice and its traces are representative. [14] J. Chen, X. Kang, Y. Liu, and Z. J. Wang, “Median filtering forensics
To learn tampering traces and achieve pixel-level localization, based on convolutional neural networks,” IEEE Signal Process. Lett.,
we design a fully convolutional network by using densely vol. 22, no. 11, pp. 1849–1853, Nov. 2015.
[15] B. Bayar and M. C. Stamm, “Constrained convolutional neural networks:
convolutional blocks, where dilated convolutions are placed A new approach towards general purpose image manipulation detection,”
into certain blocks. Taking the advantages of dense connection IEEE Trans. Inf. Forensics Security, vol. 13, no. 11, pp. 2691–2706,
and dilated convolution, the network is able to obtain good Nov. 2018.
[16] M. Barni et al., “Aligned and non-aligned double JPEG detection using
localization performance. To deal with the lack of training convolutional neural networks,” J. Vis. Commun. Image Represent.,
data, we employ Photoshop scripting to programmatically gen- vol. 49, pp. 153–163, Nov. 2017.
erate large-scale training tampered images. Extensive experi- [17] Y. Chen, X. Kang, Y. Q. Shi, and Z. J. Wang, “A multi-purpose
image forensic method using densely connected convolutional neural
mental evaluations show that the proposed method outperforms networks,” J. Real-Time Image Process., vol. 16, no. 3, pp. 725–740,
several state-of-the-art methods on three datasets. It is worthy Jun. 2019.
to note that two of the evaluated datasets are manually created [18] R. Salloum, Y. Ren, and C.-C. J. Kuo, “Image splicing localization using
a multi-task fully convolutional network (MFCN),” J. Vis. Commun.
from book-cover images, imitating the tampering on pictures Image Represent., vol. 51, pp. 201–209, Feb. 2018.
of documents and certificates, which may provide convenience [19] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Learning rich features
for future research in image tampering localization. Besides, for image manipulation detection,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2018, pp. 1053–1061.
the proposed method is robust to several types of post- [20] J. H. Bappy, C. Simons, L. Nataraj, B. S. Manjunath, and
processing, including JPEG compression, resizing, cropping, A. K. Roy-Chowdhury, “Hybrid LSTM and encoder–decoder architec-
and additive Gaussian noise. ture for detection of image forgeries,” IEEE Trans. Image Process.,
vol. 28, no. 7, pp. 3286–3300, Jul. 2019.
In the future, we will further analyze the traces left by [21] I. Yerushalmy and H. Hel-Or, “Digital image forgery detection based
other image editing operations and tools, and accordingly on lens and sensor aberration,” Int. J. Comput. Vis., vol. 92, no. 1,
optimize the proposed network architecture with appropriate pp. 71–91, Mar. 2011.
[22] O. Mayer and M. C. Stamm, “Accurate and efficient image forgery
structure components. Moreover, we will also try to include detection using lateral chromatic aberration,” IEEE Trans. Inf. Forensics
the functions of other image editing software, such as GIMP, Security, vol. 13, no. 7, pp. 1762–1777, Jul. 2018.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2999
[23] A. C. Gallagher and T. Chen, “Image authentication by detecting traces Haodong Li (Member, IEEE) received the B.S.
of demosaicing,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern and Ph.D. degrees from Sun Yat-sen University,
Recognit. Workshops, Jun. 2008, pp. 1–8. Guangzhou, China, in 2012 and 2017, respectively.
[24] P. Ferrara, T. Bianchi, A. De Rosa, and A. Piva, “Image forgery He is currently an Assistant Professor with the
localization via fine-grained analysis of CFA artifacts,” IEEE Trans. Inf. College of Electronics and Information Engineering,
Forensics Security, vol. 7, no. 5, pp. 1566–1577, Oct. 2012. Shenzhen University, Shenzhen, China. His cur-
[25] B. Mahdian and S. Saic, “Using noise inconsistencies for blind rent research interests include multimedia forensics,
image forensics,” Image Vis. Comput., vol. 27, no. 10, pp. 1497–1503, image processing, and deep learning.
Sep. 2009.
[26] W. Li, Y. Yuan, and N. Yu, “Passive detection of doctored JPEG image
via block artifact grid extraction,” Signal Process., vol. 89, no. 9,
pp. 1821–1829, Sep. 2009.
[27] W. Wang, J. Dong, and T. Tan, “Tampered region localization of digital
color images based on JPEG compression noise,” in Proc. Int. Workshop
Digit. Watermarking, 2010, pp. 120–133.
[28] I. Amerini, T. Uricchio, L. Ballan, and R. Caldelli, “Localization of
JPEG double compression through multi-domain convolutional neural
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Work-
shops (CVPRW), Jul. 2017, pp. 1865–1871. Shunquan Tan (Senior Member, IEEE) received
[29] W. Wang, J. Dong, and T. Tan, “Exploring DCT coefficient quantiza- the B.S. degree in computational mathematics and
tion effects for local tampering detection,” IEEE Trans. Inf. Forensics applied software and the Ph.D. degree in computer
Security, vol. 9, no. 10, pp. 1653–1666, Oct. 2014. software and theory from Sun Yat-sen University,
[30] H.-D. Yuan, “Blind forensics of median filtering in digital images,” IEEE Guangzhou, China, in 2002 and 2007, respectively.
Trans. Inf. Forensics Security, vol. 6, no. 4, pp. 1335–1345, Dec. 2011. From 2005 to 2006, he was a Visiting Scholar with
[31] M. Kirchner, “Fast and reliable resampling detection by spectral analysis the New Jersey Institute of Technology, Newark,
of fixed linear predictor residue,” in Proc. 10th ACM Workshop Multi- NJ, USA. In 2007, he joined Shenzhen University,
media Secur. (MM Sec), 2008, pp. 11–20. China, where he is currently an Associate Professor
[32] M. Stamm and K. J. R. Liu, “Blind forensics of contrast enhancement in with the College of Computer Science and Software
digital images,” in Proc. 15th IEEE Int. Conf. Image Process., Oct. 2008, Engineering. His current research interests include
pp. 3112–3115. multimedia security, multimedia forensics, and machine learning.
[33] B. Bayar and M. C. Stamm, “On the robustness of constrained convolu-
tional neural networks to JPEG post-compression for image resampling
detection,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
(ICASSP), Mar. 2017, pp. 2152–2156.
[34] O. Mayer and M. C. Stamm, “Forensic similarity for digital images,”
IEEE Trans. Inf. Forensics Security, vol. 15, pp. 1331–1346, 2020.
[35] Y. Wu, W. Abdalmageed, and P. Natarajan, “ManTra-Net: Manipulation
tracing network for detection and localization of image forgeries with Bin Li (Senior Member, IEEE) received the B.E.
anomalous features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern degree in communication engineering and the Ph.D.
Recognit. (CVPR), Jun. 2019, pp. 9543–9552. degree in communication and information system
[36] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in from Sun Yat-sen University, Guangzhou, China,
Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755. in 2004 and 2009, respectively.
[37] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely From 2007 to 2008, he was a Visiting Scholar with
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. the New Jersey Institute of Technology, Newark, NJ,
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708. USA. In 2009, he joined Shenzhen University, Shen-
[38] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated zhen, China, where he is currently a Professor. He is
convolutions,” in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–13. also the Director of the Shenzhen Key Laboratory of
[39] T. Gloe and R. Böhme, “The ‘Dresden image database’ for benchmark- Media Security, the Vice Director of the Guangdong
ing digital image forensics,” in Proc. ACM Symp. Appl. Comput., 2010, Key Laboratory of Intelligent Information Processing, and a member of the
pp. 1584–1590. Peng Cheng Laboratory. His current research interests include multimedia
[40] S. Ye, Q. Sun, and E.-C. Chang, “Detecting digital image forgeries forensics, image processing, and deep machine learning. Since 2019, he has
by measuring inconsistencies of blocking artifact,” in Proc. IEEE been serving as a member for the IEEE Information Forensics and Security
Multimedia Expo Int. Conf., Jul. 2007, pp. 12–15. TC.
[41] M. Zampoglou, S. Papadopoulos, and Y. Kompatsiaris, “Large-scale
evaluation of splicing localization algorithms for Web images,” Mul-
timedia Tools Appl., vol. 76, no. 4, pp. 4801–4834, Feb. 2017.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on ImageNet classification,” in
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034.
Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.