0% found this document useful (0 votes)
43 views14 pages

Image Tampering Localization Using A Dense Fully Convolutional Network

This paper proposes a method for localizing tampered regions in images using a dense fully convolutional network. The network is designed to capture traces left by commonly used editing tools and operations in Photoshop. A training data generation strategy is also introduced to address the lack of realistic tampered images by programmatically imitating human manipulations in Photoshop at a large scale.

Uploaded by

asmm.rahaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views14 pages

Image Tampering Localization Using A Dense Fully Convolutional Network

This paper proposes a method for localizing tampered regions in images using a dense fully convolutional network. The network is designed to capture traces left by commonly used editing tools and operations in Photoshop. A training data generation strategy is also introduced to address the lack of realistic tampered images by programmatically imitating human manipulations in Photoshop at a large scale.

Uploaded by

asmm.rahaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

2986 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL.

16, 2021

Image Tampering Localization Using a


Dense Fully Convolutional Network
Peiyu Zhuang , Student Member, IEEE, Haodong Li , Member, IEEE, Shunquan Tan , Senior Member, IEEE,
Bin Li , Senior Member, IEEE, and Jiwu Huang , Fellow, IEEE

Abstract— The emergence of powerful image editing software the identification of tampered images has become an urgent
has substantially facilitated digital image tampering, leading to issue.
many security issues. Hence, it is urgent to identify tampered While distinguishing whether an image has been tampered
images and localize tampered regions. Although much attention
has been devoted to image tampering localization in recent with or not is addressed in many image forensic methods
years, it is still challenging to perform tampering localization in [1]–[3], localizing tampered image regions has attracted more
practical forensic applications. The reasons include the difficulty and more attention from researchers in recent years [4]–[6].
of learning discriminative representations of tampering traces Conventionally, tampering localization was performed on an
and the lack of realistic tampered images for training. Since image by inspecting abnormal local traces, such as JPEG
Photoshop is widely used for image tampering in practice, this
paper attempts to address the issue of tampering localization by compression artifacts [6]–[9], Photo response non-uniformity
focusing on the detection of commonly used editing tools and (PRNU) [10], [11], noise level [12], and statistics of local
operations in Photoshop. In order to well capture tampering descriptor [13]. With the success of deep learning in various
traces, a fully convolutional encoder-decoder architecture is computer vision tasks, such as object detection and semantic
designed, where dense connections and dilated convolutions are segmentation, many methods based on deep learning have been
adopted for achieving better localization performance. In order
to effectively train a model in the case of insufficient tampered developed for image tampering detection and localization.
images, we design a training data generation strategy by resorting Among them, some methods [14]–[17] perform tampering
to Photoshop scripting, which can imitate human manipulations detection at image or patch level with convolutional neural
and generate large-scale training samples. Extensive experimental networks (CNN), and they can be adapted for tampering local-
results show that the proposed approach outperforms state- ization through sliding-window analysis. More recently, some
of-the-art competitors when the model is trained with only
generated images or fine-tuned with a small amount of realistic methods [18]–[20] have been proposed to achieve pixel-level
tampered images. The proposed method also has good robustness or region-level tampering localization directly. Such methods
against some common post-processing operations. are based on different network structures, including fully
Index Terms— Image forensics, image tampering localiza- convolutional networks (FCN) [18], faster R-CNN [19], long
tion, fully convolutional network, densely convolutional network, short-term memory (LSTM) cells [20], etc.
dilated convolution, Photoshop scripting. Despite the increasing emergence of image tampering
localization approaches, exposing tampered image regions
I. I NTRODUCTION
remains challenging in practice. One of the reasons is related
N OWADAYS, digital images captured with convenient
acquisition devices are increasingly used as evidence.
However, digital images can be easily tampered with by
to the sophistication of image tampering. Usually, image
tampering is carried out by manipulating image regions,
including splicing, copy-move, removal, etc. Instead of per-
powerful and popular image editing software, such as Photo-
forming such manipulations in a naïve fashion, a forger
shop, while few visually perceptible traces are left. Therefore,
would tweak the tampered regions with some post-processing
Manuscript received August 17, 2020; revised January 4, 2021; accepted (e.g., resizing, rotation, contrast/brightness adjustment, denois-
March 21, 2021. Date of publication April 1, 2021; date of current ing) so as to conceal visually perceptible tampering traces. The
version April 19, 2021. This work was supported in part by the Key-Area
Research and Development Program of Guangdong Province under Grant post-processing operations are conveniently available in image
2019B010139003; in part by the NSFC under Grant U19B2022, Grant editing software, especially the most widely used Photoshop.
61802262, Grant 61872244, and Grant 61772349; in part by the Guangdong With Photoshop, a variety of editing tools and operations
Basic and Applied Basic Research Foundation under Grant 2019B151502001;
in part by the Shenzhen Research and Development Program under Grant can be utilized in image tampering, resulting in visually
JCYJ20200109105008228 and Grant JCYJ20180305124325555; and in plausible images, where the tampering traces are complicated
part by the Alibaba Group through Alibaba Innovative Research (AIR) and subtle. Therefore it is difficult to extract or learn features
Program. The associate editor coordinating the review of this manuscript and
approving it for publication was Dr. David Doermann. (Corresponding author: that are discernable between tampered and original regions.
Haodong Li.) Another reason is the lack of training samples. While machine
The authors are with the Guangdong Key Laboratory of Intelligent Infor- learning, especially deep learning, has shown the effectiveness
mation Processing, Shenzhen University, Shenzhen 518060, China, also
with the Shenzhen Key Laboratory of Media Security, Shenzhen University, for tampering localization, such a solution is data-hungry.
Shenzhen 518060, China, and also with the Shenzhen Institute of Artificial Unfortunately, collecting large-scale realistic tampered images
Intelligence and Robotics for Society, Shenzhen 518060, China (e-mail: (as well as ground-truths) is time-consuming and laborious.
[email protected]; [email protected]; [email protected];
[email protected]; [email protected]). Without sufficient training data, it is hard to obtain a reliable
Digital Object Identifier 10.1109/TIFS.2021.3070444 tampering localization model.
1556-6021 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2987

In this paper, we try to address the problem of tampering TABLE I


localization via focusing on unveiling the tampering traces C OMMONLY U SED O PERATIONS AND T OOLS IN P HOTOSHOP
left by Photoshop. As Photoshop is popularly involved in
image tampering, the traces left by its tools and operations are
representative. Consequently, we can identify a broad range
of tampered regions by detecting the editing tools and opera-
tions in Photoshop. In order to capture tampering traces and
provide pixel-level predictions, we design an encoder-decoder
architecture that is based on a fully convolutional network. In
the proposed architecture, dense connections are adopted to
enhance implicit supervision and encourage feature reuse, and
dilated convolutions are used to enlarge receptive filed while
keeping the resolutions of feature maps. By means of these,
the architecture can achieve better localization performance.
For the sake of training model in the lack of realistic tampered
images, we employ Photoshop scripting to imitate image
tampering in reality and create large-scale tampered images.
With such a data generation strategy, it is possible to collect ubiquitous dissemination of fake images. Generally, a forger
sufficient training images conveniently. We have conducted would first use Photoshop to perform tampering operations,
extensive performance evaluations on three tampered image such as selection, scaling, rotation, and movement, on a certain
datasets. The experimental results show that the proposed region in an image. Then, a variety of editing tools would be
method outperforms several state-of-the-art methods, no matter used to adjust the sharpness or color shade of the tampered
when the model is trained with only the generated tampered region, aiming to conceal visually inconsistent traces. For
images or is further fine-tuned with some realistic tampered example, one can splice a region from one image into another
images. The proposed method also has good robustness against one in Photoshop, and applies the blur tool at the boundary of
some common post-processing, including JPEG compression, the spliced region to alleviate the high contrast between the
resizing, cropping, and additive Gaussian noise. tampered region and the original region. The commonly used
The main contributions of this work are as follows. tampering operations and tools in Photoshop are summarized
• We consider the task of tampering localization by focus- in Table I. Please note that these operations and tools can
ing on the detection of operations and tools used in be used separately or together in practical image tampering
Photoshop. Since Photoshop is widely used for image situations, and which one was precisely used is unknown to
tampering, detecting the traces left by the operations and the forensic investigator.
tools of Photoshop can directly attain the localization of We have investigated the usage frequency of the tools listed
tampered image regions. in Table I on the artificially created PS-boundary dataset (refer
• We propose an encoder-decoder architecture for image to Section IV). We find that more than 61% of the tampered
tampering localization, which comprises dense connec- images were processed with the blur tools in Photoshop, and
tions and dilated convolutions for improving the tam- more than 87% of the tampered images were subjected to one
pering localization performance. Extensive experimental or more of the tools in Table I, meaning that these tools are
results show that the proposed architecture outperforms the most frequently used ones in Photoshop. Thus, detecting
state-of-the-art competitors on three datasets. the traces left by these tools can expose a large proportion of
• We introduce an effective strategy to generate large-scale images tampered with Photoshop.
tampered images for training. This strategy uses Photo-
B. Related Work
shop scripting to imitate the practical tampering process,
so that we can obtain sufficient training samples without In image tampering detection and localization, both
expensive laborious efforts. in-camera and out-camera artifacts can be exploited. In-camera
The remainder of this paper is organized as follows. artifacts are left by the camera internal processing operations.
In Section II, we present the background knowledge on Photo- Several methods have been proposed to capture the in-camera
shop tampering and introduce the related works. In Section III, traces, such as lens aberration [21], [22] and color filter array
we elaborate the proposed architecture and training data gener- (CFA) [23], [24]. Another more general in-camera artifact
ation strategy. The performance evaluations and comparisons called noise pattern has also been used to identify tampered
to existing methods are depicted in Section IV. Finally, image [12], [25]. In addition to in-camera artifacts, some
we draw our conclusions in Section V. works focus on out-camera artifacts that are left by operations
external to the camera. For example, JPEG compression traces,
II. BACKGROUND AND R ELATED W ORK such as block artifacts [26] and double JPEG artifacts [16],
[27], [28] are used to detect and localize the tampered regions.
A. Image Tampering With Photoshop The distribution of the DCT coefficients is used to localize the
As a piece of powerful image editing software, Photoshop tampered regions [7], [29]. Other methods aim to detect the
has been widely used in image tampering, resulting in the traces left by median filtering [30], resampling [31], contrast

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2988 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

enhancement [32], and scaling [33], etc. One limitation of tampering traces and thus deteriorating the final localization
these methods is that they target on only a single tampering results output in the decoder part. To overcome the existing
trace. When the image undergoes multiple manipulations, shortcomings, we specifically include dense connections and
the tampering detection and localization performance of these dilated convolutions into the proposed encoder-decoder archi-
methods will be greatly degraded. tecture. In this way, the network is able to capture more subtle
In addition to detecting specific tampering operations, some tampering traces within the image and obtain finer feature
methods based on deep learning have been proposed for maps for making predictions.
detecting more general tampering operations. Among them, Another purpose of the proposed method is to seek for a
some approaches [14], [15], [34] are able to achieve tam- way to prepare sufficient training images. Since it is hard to
pering localization with sliding-window analysis. However, artificially prepare large-scale tampered images, we resort to
their localization performance depends on the size of sliding an alternative and automatic way. Basically, if a forger uses
window. To get around of this issue, a possible solution is Photoshop to create tampered images, he or she is likely to use
to use FCN. Salloum et al. [18] proposed a multi-task FCN the operations and tools listed in Table I. We therefore employ
that integrates the edge and the internal of the tampered area to Photoshop script program to make use of these operations
achieve splicing localization. To localize more general tamper- and tools to create tampered images, imitating the practical
ing operations such as copy-move and removal, other methods tampering process. In this way, we can generate a large number
[19], [20], [35] have been proposed. A two-stream network of tampered images programmatically, and train our model
structure that utilizes both content-related features and residual with the generated tampered images.
features is proposed in [19] to achieve region-level tampering
localization. Bappy et al. [20] proposed an encoder-decoder B. The Proposed Encoder-Decoder Architecture
architecture to analyze the differences between manipulated In this subsection, we introduce the proposed network
and non-manipulated regions with resampling features and architecture for tampering localization. As shown in Fig. 1,
LSTM, and the decoder can directly produce pixel-level the proposed architecture consists of an encoder and a decoder,
tampering localization results. In [35], a feature extractor is where the former one transforms the input image into dis-
firstly learned from 385 types of image manipulations, and criminative feature maps and the latter one outputs pixel-wise
then image tampering localization is achieved by solving a predictions by further processing the feature maps. To improve
local anomaly detection problem. Since deep learning is data- localization performance, we include dense blocks in both the
driven, plenty of tampered images are needed for training a encoder and decoder, and use dilated convolutions in the layers
model. Some previous works [19], [20] utilized image segmen- of dense block #4 and #5. The details are described as follows.
tation datasets, such as MS-COCO [36], to generate tampered 1) Dense Connection: In order to capture the subtle tam-
images with the provided pixel-level segmentation annotations. pering traces, we utilize the dense connection proposed in
However, these tampered images are generated with some [37]. Fig. 2 shows a typical dense block, which consists of
simple operations, and thus they often contain obvious artifacts four internal convolutional layers and one transition layer.
that are easy to be exposed. Therefore, a model trained with Assuming the total number of internal convolutional layers
these tampered images usually obtains poor performance when in a dense block is L, the output of the l-th layer is defined
dealing with tampered images in real applications. as
III. P ROPOSED M ETHOD xl = Hl ([x 0 , x 1 , . . . , xl−1 ]), l ∈ {1, 2, . . . , L}, (1)
In this section, we elaborate our solution for localiz- where [x 0 , x 1 , . . . , xl−1 ] represents the concatenation oper-
ing image regions tampered with Photoshop, including an ation that combines the feature maps produced in layers
encoder-decoder architecture based on deep networks and a 0, . . . , l −1, and Hl (·) represents three consecutive operations
strategy for generating large-scale training samples. in the l-th layer: 3 × 3 convolution, followed by batch nor-
malization (BN) and Relu activation. Assuming the l-th layer
A. Approach Overview outputs k feature maps, the value of k is defined as the growth
The primary purpose of the proposed method is to design rate. Thus, the l-th layer’s input [x 0 , x 1 , . . . , xl−1 ] totally has
a network architecture to effectively capture tampering traces. l ×k channels. In [37], a transition layer, composing of a 1×1
Since various tools and operations in Photoshop are used in convolution followed by a 2 × 2 average pooling, is used to
practice, it is tricky to model tampering traces. Some previous reduce the number of output channels. In our network, we do
works [14], [15] try to learn features of tampering traces not use a pooling operation after each dense block; hence each
block by block; hence the block size heavily affects tampering transition layer is a 1 × 1 convolutional layer.
localization performance. Besides, it is also hard to perform In the proposed architecture, there are five dense blocks in
real-time detection on high-resolution images since the number the encoder network and two in the decoder network. There
of blocks is huge. Some recent works [18], [20] have alleviated are two explanations for why the use of dense block can
these issues by using fully convolutional network, which improve tampering localization performance. Firstly, with the
can produce pixel-level localization results directly. However, dense connections, individual layers can receive additional
in these works, the last output feature maps in the encoder supervision from the loss function through the shorter connec-
part are usually of very low resolutions (1/16 [20] or 1/32 tions. This could provide stronger and more direct supervisions
[18] of the input image), leading to the loss of some detailed regarding to the localization task on the kernel weights.

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2989

Fig. 1. The proposed architecture for image tampering localization. The encoder network contains five dense blocks and three 2 × 2 average pooling layers,
and the decoder network contains two dense blocks, three transposed convolutional layers, and a 5 × 5 convolutional layer.

Fig. 3. Dilated convolution with different dilated rates. The dark elements
Fig. 2. A dense block. The four former layers are internal convolutional represent non-zero elements and the white elements represent zero elements.
layers and the last layer is the transition layer.
of the layers increases, the receptive field grows exponentially
Since the tampering traces are subtle, it is beneficial for learn- while no additional parameters are added and the resolutions
ing more discriminative features with stronger supervisions. of feature maps are kept, thus avoiding the loss of useful
Secondly, the dense connections remarkably encourage feature information for tampering localization.
reuse. In this way, latter layers within a dense block receive In our proposed architecture, dilated convolutional layers
features learned by its different preceding layers, increasing are used in the forth and fifth dense blocks. Fig. 4 shows the
the variation of input and thus promoting the layers to learn feature map responses with and without dilated convolutions
effective features for various tampering traces. According to in these two blocks. The feature map responses are obtained
our experimental results, incorporating dense connections into by performing linear discriminative analysis (LDA) to reduce
the network indeed improves the performance, please refer to the channels of the output feature maps to one. We observe
Section IV-B for details. that with dilated convolutions the feature responses can cover
2) Dilated Convolution: In tampering localization, just the tampered regions more completely, indicating that better
learning features from narrow local regions would result in localization performance can be obtained. This is supported
volatile feature representations. Thus, it is necessary to utilize by the final predictions shown in Fig. 4. Based on our
larger range information for feature learning and capture experimental results, using dilated convolutions can increase
more tampering traces, meaning that the convolutional layers the average F1-score by a relative improvement of about 4%
should have larger receptive fields. Traditionally, there are (refer to Section IV-B for details).
two possible ways to increase the receptive field: (i) using 3) Architecture Details: We totally use seven dense blocks
a larger convolutional kernel, and (ii) adding a pooling layer in the proposed encoder-decoder architecture, the settings for
before the next convolutional layer. A significant drawback each dense block are shown in Table II. As shown in Table II,
of the former is that the number of parameters increases as each dense block is composed of several internal convolutional
the receptive field grows. The second way, however, would layers (four in the first block and two in others) and a transition
result in the reduce of feature map resolution, causing the loss layer as the output layer. In the forth and fifth dense blocks,
of some detailed traces in small tampered regions. To exceed dilated convolutional layers rather than normal convolutional
the limitations, we utilize the dilated convolution proposed in layers are used. We set the kernel size as 3 and the dilated
[38]. Fig. 3 shows that the dilated convolution can increase rate as 2 in each dilated convolutional layer. The kernel size
the size of receptive field by padding adjacent elements of of almost every normal convolutional layer is 3 ×3, except the
the convolutional kernel with zeros. The receptive field of the last one in the decoder network, whose kernel size is 5 × 5.
kernel can be controlled by the dilated rate r . As the number The stride of each convolutional layer is set to be 1 so that no

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2990 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

Fig. 4. Two examples for illustrating the intermediate feature maps obtained with and without dilated convolutions. In each subfigure, the first super-column
shows the input image and its ground-truth; in the second super-column, the first row are the feature map responses of block #4 and #5 and the final predictions
output by the proposed architecture, respectively, while those in the second row are output by an architecture without dilated convolutions. (a) Example #1.
(b) Example #2.

TABLE II to generate training samples based on Photoshop scripting.1 It


PARAMETERS OF E ACH D ENSE B LOCK IN THE P ROPOSED N ETWORK . provides a series of commands and programming interfaces
T HE N UMBER OF THE O UTPUT C HANNELS I S THE N UMBER OF THE
C HANNELS IN T RANSITION L AYER
that can enable Photoshop to perform image editing tasks
automatically.
The training samples generation strategy with Photoshop
scripting is shown in Fig. 5. This Photoshop script program
can be used to generate splicing and non-splicing images.
In the case of generating splicing images, two images are
firstly chosen at random; one is background image and the
other is donor image. To imitate human operation, a specific
region is selected as the manipulated region and then is
transformed with some common image operations, such as
rotation and scaling. After that, region movement is adopted
to generate the tampered image. In the case of generating non-
downsampling is performed inside the dense blocks. As shown splicing images, one image is chosen and then one region
in Fig. 1, after the outputs of the first three dense blocks, is randomly selected as the manipulated region. To make
the resolutions of the feature maps are reduced by 2×2 average the tampered regions are more difficult to be detected, some
pooling layers with stride 2. common editing tools listed in Table I are used to eliminate the
Since three pooling layers are used in the encoder network, visually distinguishable traces. Finally, a JPEG compression
the resolution of the feature map is only 1/8 of the input quality commonly used in Photoshop is randomly chosen
image. In order to obtain pixel-level predictions, we need to save the tampered image as a JPEG file. By selecting
to increase the resolution to the input size. Therefore, three different parameters for the operations, tools, and compression
transposed convolutional layers are used in the decoder part, qualities, plenty of tampered images and their correspond-
whose kernels are initialized as 4 × 4 bilinear kernels and ing masks are generated. Though the generated tampered
the convolution strides are set as 2. In order to reduce the images are different from the realistic tampered ones in visual
checkerboard artifacts produced by the transposed convolu- contents/effects, they contain tampering traces introduced by
tional layers, we pass the output of the last transposed convo- Photoshop scripts. Hence, this strategy makes it possible
lutional layer into a 5×5 convolutional layer. Finally, we use a to train a tampering localization model with sufficient tam-
SoftMax layer to produce a tampering localization probability pered data, and it is better than previous ones [19], [20]
map. as the generated tampered images are more complicated and
multifarious.
C. Training Data Generation Strategy IV. E XPERIMENTS
Besides a well-designed network architecture, sufficient and
In this section, we evaluate the performance of the proposed
adequate training samples also play an important role in
method and compare our method with several existing state-
obtaining a good tampering localization model. As mentioned
of-the-art methods.
above, however, it is not easy to collect large-scale training
data. Some previous works [19], [20] address this issue A. Experimental Setup
by simulating tampered image through performing simple 1) Datasets: Two datasets generated with Photoshop
splicing or copy-move. Such approaches are suboptimal since scripting are used to train the models in our experiments.
they do not consider the various post-processing used in
practice. In this work, we design a more reasonable strategy 1 https://www.adobe.com/devnet/photoshop/scripting.html

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2991

Fig. 5. Diagram of the training data generation strategy based on Photoshop scripting. The tools listed in Table I are used for post-processing.

• PS-scripted book-cover dataset. This dataset was built


from 1,200 images taken by 12 different mobile phone
cameras. The contents of the images are book covers,
which are used to simulate certificates and documents.
By using the Photoshop script program, we gener-
ated 14,581 tampered images and their corresponding
pixel-level tampering masks. The resolutions of the
images are ranging from 2592 × 3840 to 7296 × 5472.
• PS-scripted Dresden dataset. To prepare a corpus for
training models with natural scene images, we applied
the Photoshop script program to 7,876 original images
downloaded from the Dresden dataset [39] and obtained
5,295 tampered images as well as their correspond-
Fig. 6. Some tampered images and their corresponding ground-truths in
ing pixel-level tampering masks. The resolutions of the PS-boundary (Row #1), PS-arbitrary (Row #2), and NIST-2016 (Row #3).
images are ranging from 1944 × 2592 to 4352 × 3264.
For performance evaluation, the following three datasets are • NIST 2016 dataset (NIST-2016): The NIST 2016 dataset2
involved. contains 564 tampered images, each of which is subjected
• Artificial PS dataset with post-processing on bound- to one of the three commonly used tampering operations,
ary (PS-boundary): This dataset was created from i.e. copy-move, splicing, and removal. All images are
1,000 book-cover images taken by 10 mobile phone in JPEG format and are post-processed with unknown
cameras (different from those used in the PS-scripted image editing software. The resolutions of the images are
book-cover dataset). The original images were tampered ranging from 500 × 500 to 5616 × 3744.
with by a group of five persons who are proficient in using In order to carry out the fine-tuning experiments described
Photoshop. To eliminate the inconsistence between the in Sec. IV-D, 10% tampered images were randomly selected
tampered region and the original region, the boundaries from the above three datasets. Hence, the number of testing
of the tampered regions were post-processed with certain images in PS-boundary, PS-arbitrary, and NIST-2016 is 900,
Photoshop tools. For each tampered image, a correspond- 900, and 508, respectively. Some tampered images and their
ing pixel-level tampering mask and a file recorded the corresponding ground-truths are shown in Fig. 6.
used operations and tools were also produced. Totally, 2) Comparative Methods: The following tampering detec-
1,000 tampered images with post-processing on boundary tion and localization methods, including hand-crafted features
were obtained, whose resolutions are ranging from 2340× and deep learning (DL)-based approaches, were involved for
4160 to 3744 × 4992. performance comparison.
• Artificial PS dataset with arbitrary post-processing (PS-
• ADQ1 [7] (hand-crafted feature): A method for detecting
arbitrary): In addition to modifying the boundaries of
the tampered regions, an experienced forger may perform aligned double JPEG compression using the DCT coeffi-
arbitrary manipulations on the tampered regions to con- cient distribution.
• DCT-based method [40] (hand-crafted feature): A method
ceal the tampering traces as many as possible. Consider-
ing such practical image tampering scenario, we invited for detecting inconsistencies in JPEG DCT coefficient
another group of 10 persons to make 1,000 forgeries histograms.
• NADQ [6] (hand-crafted feature): A method for detecting
with Photoshop from the above mentioned 1,000 original
images, and asked them to use any suitable operations non-aligned double JPEG compression by analyzing 8×8
and tools to make the tampered images visually indistin- DCT block.
guishable from real ones. This dataset can well imitate
the practical tampering behaviors with Photoshop. 2 https://www.nist.gov/itl/iad/mig/media-forensics-challenge

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2992 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

TABLE III
C ONFIGURATIONS OF THE N ETWORK VARIANTS

• Chen’s method [14] (DL-based): A method for detecting


median filtering. The first layer of the network is fixed to
compute median filtering residual of the input image.
• Bayar’s method [15] (DL-based): A method for general
image manipulation detection. The first layer is learned
in a constrained manner, acting as high-pass filtering.
• MFCN [18] (DL-based): A multi-task FCN for splicing
localization, which integrates the edge and the internal of
the tampered region for achieving predictions.
• LSTM-EnDec [20] (DL-based): This method performs
tampering localization by making a fusion of resampling
features represented with LSTM and spatial features Fig. 7. Training losses of different network variants.
extracted by CNN.
• Forensic-similarity [34] (DL-based): This method reveals TABLE IV
tampered regions by comparing the similarities of image F1-S CORES O BTAINED BY D IFFERENT N ETWORK VARIANTS
blocks via sliding-window manner with a CNN-based
network. The size of image blocks was set as 256 × 256
in our experiments.
• Mantra-net [35] (DL-based): This method first learns
tampering traces from many image manipulations, and
then performs tampering localization via local anomaly
detection.
We adopted the codes of ADQ1 [7], DCT-based method [40]
and NADQ [6] as implemented in [41], and re-implemented
Chen’s method [14], Bayer’s method [15], and MFCN [18] blocks as training samples. The image blocks were selected
based on the corresponding papers. Since Chen’s and Bayar’s under a constraint that the proportion of tampered pixels
methods are originally designed for image-level detection was from 15% to 30%. Based on our experiments, this
rather than pixel-level localization, we applied these two setting would obtain better performance than no constraint was
methods in a sliding-window manner. Two window sizes were applied. In total, more than 670,000 blocks were extracted;
chosen, i.e., 32 × 32 and 64 × 64, and the adjacent windows 90% of them were used for training network parameters and
were overlapped one quarter of each other. For the remaining the left 10% were used for validation.
methods, we used the released code of LSTM-EnDec3 and In order to get finer localization results, we applied data
Forensic-similarity4 for training and testing, and used the augmentation in the testing procedure. The testing images
trained model and the released inference code of Mantra-net5 were augmented with horizontal and vertical flipping, 180◦
for testing. rotation, and transposing. All the counterparts, including the
3) Training and Testing Protocol: Since the proportion of non-augmented and augmented ones, were fed into the model.
tampered pixels in a tampered image is usually much lower The final localization result is obtained by averaging the output
than that of original ones, if the whole image is used for tampering localization maps.
training, the model would be biased towards original pixels 4) Implementation Details: The proposed network was
and it is difficult to learn the tampering traces. In addition, implemented with the TensorFlow deep learning framework.
the mini-batch training approach is inapplicable when the In the normal and dilated convolutional layers, the kernel
image resolutions are different. To train the network with the weights were initialized with He initialization [42] and the
mini-batch scheme, therefore, we extract 512 × 512 image biases were initialized with zero. As for the transposed con-
3 https://gitlab.com/MISLgit/forensic-similarity-for-digital-images volutional layers, the kernel weights were initialized with
4 https://github.com/jawadbappy/forgery_localization_HLED bilinear kernels. We adopted the Adam optimizer and used
5 https://github.com/ISICV/ManTraNet the cross entropy function as the loss function. The initial

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2993

TABLE V
P ERFORMANCE OF D IFFERENT M ETHODS . T HE M ODELS U SED IN THE F IRST T HREE C ASES A RE T RAINED W ITH THE PS-S CRIPTED B OOK -C OVER
D ATASET, W HILE T HOSE U SED IN THE L AST C ASE (G RAY BACKGROUND ) A RE F URTHER T RAINED W ITH THE PS-S CRIPTED D RESDEN D ATASET

learning rate was set as 5 × 10−4 and was decreased performance of the proposed network, i.e., i) the number of
50% after every epoch. The batch size was set to be 16. dense blocks in the encoder, ii) the use of dilated convolu-
We tested the model with the validation set at the end of tional layers, and iii) the use of a dense block between two
each epoch, and chose the one achieved the highest vali- transposed convolutional layers. Therefore, we designed four
dation F1-score as the final model. All experiments were network variants as shown in Table III. Among them, Variant
carried out with a Nvidia Tesla P100 GPU server. The #1, #2, and #3 have the same number of convolutional layers
source code is available at: https://github.com/ZhuangPeiyu/ but different dense connections, and thus the numbers of dense
Dense-FCN-for-tampering-localization. blocks in the encoder networks of Variant #1, #2, and #3 are
5) Performance Metrics: Since image tampering localiza- 5, 3, and 0, respectively; and in Variant #4, the dense block
tion is a pixel-level binary classification problem, the following between the transposed convolutions are removed. We trained
pixel-level performance metrics were used. these networks with the PS-scripted book-cover dataset and
2×TP recorded the training loss during the training processes as
F1 = , shown in Fig. 7. It is observed that the proposed network
2 × T P + FN + FP
T P × T N−FP × FN converged faster than the other variants and achieved the
MCC = √ , lowest loss after convergence. Then, we tested the trained
(T P + F P)(T P + F N)(T N + F P)(T N + F N) models with the three testing datasets mentioned above. The
TP
I OU = , obtained F1-scores are shown in Table IV. From Table IV we
T P + FP + FN have the following observations.
where T P and F N represent the numbers of correctly classi- • The number of dense blocks in the encoder. According to
fied and misclassified tampered pixels, respectively, and T N the localization performance shown in Table IV, we can
and F P represent the numbers of correctly classified and see that Variant #1 achieves the best average F1-score on
misclassified original pixels, respectively. As the final output the three datasets. The relative improvement regarding
of the network is a real-valued result, thresholding is needed Variant #2 and #3 are 1% and 4%, respectively. These
for computing the metrics. Some previous works [18], [20] results indicate that it is necessary to use enough dense
choose the best threshold for every single testing image. This is blocks in the encoder network to improve the perfor-
unreasonable in practice, since the corresponding ground-truth mance. Consequently, we decided to use 5 dense blocks
is unavailable. In this work, we chose the commonly used in the encoder part of the proposed architecture.
value 0.5 as the threshold for all images when computing • Dilated convolutional layer. By comparing the perfor-
the metrics. In addition to the above metrics, we also cal- mance of Variant #1 with the proposed model, we observe
culated the pixel-level receiver operating characteristic curve that the use of dilated convolutions in the forth and fifth
and reported the area under curve (AUC) for performance dense blocks improves the F1-score by 0.01 or 0.02 in
evaluation. different datasets, meaning that expanding receptive field
without reducing feature map resolution with dilated
B. Ablation Study convolution is beneficial for tampering localization.
In this experiment, we performed an ablation study to • Dense block between two transposed convolutional lay-
validate the effectiveness of the proposed network. Gener- ers. Compared with Variant #4, the F1-scores achieved
ally, there are three factors that may affect the localization by the proposed model improve by 0.03 on PS-arbitrary

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2994 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

Fig. 8. F1-scores obtained with different thresholds. From left to right are the results for PS-boundary dataset, PS-arbitrary dataset and NIST-2016 dataset,
respectively.

TABLE VI
P ERFORMANCE OF D IFFERENT M ETHODS A FTER F INE -T UNING W ITH 10% TAMPERED I MAGES

and NIST-2016 datasets, meaning that adding a dense outperforms its rivals in most cases. The F1-scores achieved
block between two transposed convolutional layers in the by the proposed method are 0.61 and 0.57 for PS-boundary
decoder network is helpful. and PS-arbitrary, respectively, which are 0.06 and 0.26 higher
Based on these observations, we conclude that the structure than the second best one (i.e., Bayar’s method with 64 × 64
components designed in the proposed network can indeed sliding-window). For the NIST-2016 dataset, all methods do
improve tampering localization performance. not perform well. It may be due to the fact that the training
images are pictures of book covers while the testing images
C. Performance for Training With PS-Scripted Images in NIST-2016 are natural scenes, and different properties of
In this experiment, we trained models with the PS-scripted the training and testing images lead to poor performance.
datasets and evaluated the trained models with the testing Nevertheless, the proposed method achieves the best F1-score,
datasets. In such a case, all the testing images are come from MCC, and IOU among all the methods.
datasets that are not involved in training. Please note that train- In order to investigate the performance by using training and
ing is not applicable for ADQ1, DCT, NADQ, and Mantra-net; testing images with the same property, we regarded the models
hence these methods performed the testing directly. trained with the PS-scripted book-cover dataset as pre-trained
Firstly, we trained the models with the PS-scripted book- models, and further trained them with the PS-scripted Dresden
cover dataset, and obtained the testing results as shown dataset. The resulting models are tested on NIST-2016 again.
in Table V. It is observed that the proposed method The obtained results are shown in the bottom of Table V.

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2995

Fig. 9. Some localization results for example images in PS-boundary dataset (Row #1–#3), PS-arbitrary dataset (Row #4–#6) and NIST-2016 dataset (Row
#7–#10). The First and last columns are the tampered images and ground-truths, respectively. The other columns are the probability heatmaps obtained by
different methods.

It is observed that the performance is generally improved by


further training with natural scene images. Compared with
others, the proposed method achieves better performance in
terms of F1, IOU, and MCC. However, it is noted that the
localization performance for NIST-2016 is still not as good as
that for PS-boundary and PS-arbitrary. A possible explanation
is that more than one kind of image editing software was used
to create the tampered images in NIST-2016, while all the
tampered images used for training in this experiment were
manipulated by Photoshop.
In order to further investigate the influence of different
thresholds on the performance, we adjusted the threshold from
0 to 1 with a step of 0.01 and plotted the curve of F1-scores
obtained by different thresholds in the top row of Fig. 8. It is
Fig. 10. The elapsed time for testing a 1024 × 768 image.
observed that the proposed method achieves the best and the
most stable performance on all three datasets, while the other a model pre-trained with PS-scripted images improved the
methods are more sensitive to the value of the threshold. F1-score by at least 0.2 compared to training a model from
scratch. Following this way, we randomly selected 10%
D. Performance After Fine-Tuning tampered images as fine-tuning samples from each of the
In this experiment, we investigate how can fine-tuning three testing datasets. For the PS-boundary and PS-arbitrary
further improve the tampering localization performance. datasets, the fine-tuning was performed on the model trained
Intuitively, it is hard to train a good DL-based tampering with the PS-scripted book-cover dataset, while for the NIST-
localization model with only a small amount of realistic 2016 dataset, the fine-tuning was performed on the model
tampered images. Fortunately, the proposed training data trained with the PS-scripted Dresden dataset. Please note that
generation strategy alternatively provides a valuable training the fine-tuning is only applicable for the deep learning based
approach. That is, firstly train a model by using sufficient methods.
PS-scripted tampered images, and then fine-tune the resulting After fine-tuning, we tested the fine-tuned models and
model with the limited number of realistic tampered images obtained the results as shown in Table VI. From this table,
at hand. Our preliminary experiments showed that fine-tuning we observe that the performance of almost all the methods

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2996 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

Fig. 11. Pixel-level AUC scores for different JPEG compression qualities.

Fig. 12. Pixel-level AUC scores for different resizing factors.

has been improved by fine-tuning, and the proposed method


achieves the most significant improvement (0.21, 0.10, and
0.07 for PS-boundary, PS-arbitrary, and NIST-2016, respec-
tively). It is also observed that our method performs the best
in terms of all the four performance metrics after fine-tuning,
outperforming the second places in terms of F1-scores by
0.12 (i.e., Bayar’s method with 64 × 64 sliding-window), 0.32
(i.e., Bayar’s method with 64 × 64 sliding-window), and 0.14
(i.e., LSTM-EnDec) for PS-boundary, PS-arbitrary, and Fig. 13. Illustration of the cropping operation.
NIST-2016, respectively. We also plot the curve of F1-scores
50 1024 × 768 images. Fig. 10 shows the average elapsed
obtained by different thresholds in the bottom row of Fig. 8.
time for testing an image. From this figure, we observe that the
Among all the methods, the proposed one achieves the best
method based on DCT features runs fastest among the conven-
performance in a wide range of thresholds.
tional methods, and MFCN has the lowest computational time
To compare the performance intuitively, we show some
among the DL-based methods. As for the proposed method,
tampering localization results in Fig. 9. From this figure,
it takes less than 0.5 seconds to test a 1024×768 image. It runs
it is observed that the results output by our method are more
faster than most DL-based methods working in sliding-window
consistent with the ground-truths. For example, the proposed
manners and does not require much more time than the
method can localize small tampering regions (e.g., the small
FCN-based methods. The relatively low computational time
traffic sign buckets in the 7th tampered images) and achieve a
implies that our method is possible to be deployed in practical
low false alarm rate, while the other methods would produce
applications.
much more false alarms even though they have recognized the
tampered regions. F. Robustness Analysis
In this subsection, we evaluate the robustness of different
E. Computational Time methods by considering several common types of post-
In this subsection, we compare the computational time of processing. To this end, we enlarged the PS-scripted book-
different methods. The evaluations of running time for the cover dataset by applying resizing, cropping, and noise adding
conventional hand-crafted feature based methods were carried to the generated tampered images with different factors and
out with MATLAB R2015b on a server equipped with Intel(R) saving them with different JPEG qualities. Subsequently,
Xeon(R) E5-2695 v4 CPU, and those for the deep learning we trained models with the enlarged dataset and performed
based methods were carried out with TensorFlow 1.10.1 on testing on the PS-boundary and PS-arbitrary datasets. Three
a server equipped with Nvidia Tesla P100 GPU. Each methods, i.e., Bayar’s 64 × 64, Forensic Similarity, and
method was tested separately with an image set containing Mantra-net, were included for comparisons, since they perform

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2997

Fig. 14. Pixel-level AUC scores for different cropping factors.

Fig. 15. Pixel-level AUC scores for different levels of additive Gaussian noise.

relatively well in the experiments conducted in Section IV-C are shown in Fig. 12. We observe that the performance
and IV-D. of all the methods is degraded when the resizing factor is
1) Robustness Against JPEG Compression: We compressed lower than 1.0, it may be due to the fact that down-scaling
the tampered images in the PS-boundary and PS-arbitrary has removed many image details, which are important for
datasets using the commonly used JPEG compression qualities tampering localization. Overall, the proposed method achieves
in Photoshop (i.e., PS8-PS12), and then fed them into the the best robustness to resizing. Even when an image is reduced
trained model mentioned above. The pixel-level AUC scores to 25% of its original size (i.e., the resizing factor is 0.5),
of different methods are shown in Fig. 11. It is observed the AUC of our method is still higher than 0.80.
that all the four methods can resist JPEG compression to 3) Robustness Against Cropping: Cropping is commonly
a certain extent. For the PS-boundary dataset, the proposed used as a post-processing operation. In this experiment,
method slightly underperforms Forensics Similarity, especially we cropped a bottom right part from each image in
for PS10. Nevertheless, for the PS-arbitrary dataset, in which PS-boundary and PS-arbitrary with a shift of (x, y), as shown
the tampered images are subjected to more practical tampering in Fig. 13. The values of x and y are given by
operations, the proposed method achieves the best perfor- ⎧

⎪ x = x + 8 × i, x[0, 7]
mance in all the cases. ⎨ (W − x) × (H − y) 2
2) Robustness Against Resizing: In practice, a tampered s.t.  ,

⎪ W×H 3
image is likely be subjected to post-resizing. To test the ⎩
y = y + 8 × j, y[0, 7]
robustness against resizing, we used Photoshop to resize the
images in PS-boundary and PS-arbitrary with a ratio from where i and j are two positive random integers, and W
0.5 to 2.0. The AUC scores regarding different resizing factors and H are the width and height of the image, respectively.

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
2998 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

According to the cropping factor (x, y), the cropped into the data generation script program, so as to increase the
images are divided into 64 categories (regarding to the 8 × 8 diversity of training samples.
JPEG compression grid). The performance of four methods
in the case of cropping is shown in Fig. 14. It is observed R EFERENCES
that our method performs the best against cropping on both
[1] T. Bianchi and A. Piva, “Detection of nonaligned double JPEG com-
datasets, especially for the PS-arbitrary dataset, where its AUC pression based on integer periodicity maps,” IEEE Trans. Inf. Forensics
scores are higher than others about 0.1 for various cropping Security, vol. 7, no. 2, pp. 842–848, Apr. 2012.
factors. [2] M. Goljan and J. Fridrich, “CFA-aware features for steganalysis of color
images,” in Proc. Media Watermarking, Secur., Forensics, vol. 9409,
4) Robustness Against Gaussian Noise: Another commonly 2015, Art. no. 94090V.
used post-processing operation is noise addition. In this exper- [3] C.-M. Pun, X.-C. Yuan, and X.-L. Bi, “Image forgery detection using
iment, we evaluated four methods in the case that the testing adaptive oversegmentation and feature point matching,” IEEE Trans. Inf.
Forensics Security, vol. 10, no. 8, pp. 1705–1716, Aug. 2015.
images were corrupted with additive Gaussian noise. Please [4] J. H. Bappy, A. K. Roy-Chowdhury, J. Bunk, L. Nataraj, and
note that the noise levels are not fully matched in the training B. S. Manjunath, “Exploiting spatial structure for localizing manipu-
and testing phases. For the training images, Gaussian noise is lated image regions,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 4970–4979.
added to make the resulting PSNRs be 20dB, 30dB, or 40dB. [5] H. Li, W. Luo, X. Qiu, and J. Huang, “Image forgery localization
For the testing images, the standard deviations of the Gaussian via integrating tampering possibility maps,” IEEE Trans. Inf. Forensics
noise vary from 2, 5, 10, 15, 20, and 25, and the corresponding Security, vol. 12, no. 5, pp. 1240–1252, May 2017.
[6] T. Bianchi and A. Piva, “Image forgery localization via block-grained
resulting PSNRs are 42dB, 34dB, 28dB, 25dB, 22dB, and analysis of JPEG artifacts,” IEEE Trans. Inf. Forensics Security, vol. 7,
20dB, respectively. Such settings introduced more challenges. no. 3, pp. 1003–1017, Jun. 2012.
As shown in Fig. 15, the proposed method is more robust [7] Z. Lin, J. He, X. Tang, and C.-K. Tang, “Fast, automatic and fine-grained
tampered JPEG image detection via DCT coefficient analysis,” Pattern
against Gaussian additive noise than other methods. Among all Recognit., vol. 42, no. 11, pp. 2492–2501, Nov. 2009.
the methods, Forensic Similarity has the poorest performance, [8] P. Korus and J. Huang, “Multi-scale fusion for improved localization
because it relies on image block similarity that may be heavily of malicious tampering in digital images,” IEEE Trans. Image Process.,
vol. 25, no. 3, pp. 1312–1326, Mar. 2016.
destroyed by additive noise. [9] I. Amerini, R. Becarelli, R. Caldelli, and A. Del Mastio, “Splicing forg-
5) Performance on Original Image: In addition, we have eries localization through the use of first digit features,” in Proc. IEEE
tested 1,000 original images that were used to build the Int. Workshop Inf. Forensics Security (WIFS), Dec. 2014, pp. 143–148.
[10] G. Chierchia, D. Cozzolino, G. Poggi, C. Sansone, and L. Verdoliva,
PS-boundary and PS-arbitrary datasets. By thresholding the “Guided filtering for PRNU-based localization of small-size image
network outputs with the value of 0.5, the obtained average forgeries,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
pixel-level false alarm rate is 0.005, meaning that our method (ICASSP), May 2014, pp. 6231–6235.
[11] G. Chierchia, G. Poggi, C. Sansone, and L. Verdoliva, “A Bayesian-
produces very few false predictions for original images. MRF approach for PRNU-based image forgery detection,” IEEE Trans.
Inf. Forensics Security, vol. 9, no. 4, pp. 554–567, Apr. 2014.
V. C ONCLUSION [12] S. Lyu, X. Pan, and X. Zhang, “Exposing region splicing forgeries
with blind local noise estimation,” Int. J. Comput. Vis., vol. 110, no. 2,
In this paper, we propose a method for image tampering pp. 202–221, Nov. 2014.
localization. We mainly focus on the editing operations and [13] W. Fan, K. Wang, and F. Cayre, “General-purpose image forensics using
tools used in Photoshop, considering the fact that Photoshop patch likelihood under image statistical models,” in Proc. IEEE Int.
Workshop Inf. Forensics Security (WIFS), Nov. 2015, pp. 1–6.
is widely used in practice and its traces are representative. [14] J. Chen, X. Kang, Y. Liu, and Z. J. Wang, “Median filtering forensics
To learn tampering traces and achieve pixel-level localization, based on convolutional neural networks,” IEEE Signal Process. Lett.,
we design a fully convolutional network by using densely vol. 22, no. 11, pp. 1849–1853, Nov. 2015.
[15] B. Bayar and M. C. Stamm, “Constrained convolutional neural networks:
convolutional blocks, where dilated convolutions are placed A new approach towards general purpose image manipulation detection,”
into certain blocks. Taking the advantages of dense connection IEEE Trans. Inf. Forensics Security, vol. 13, no. 11, pp. 2691–2706,
and dilated convolution, the network is able to obtain good Nov. 2018.
[16] M. Barni et al., “Aligned and non-aligned double JPEG detection using
localization performance. To deal with the lack of training convolutional neural networks,” J. Vis. Commun. Image Represent.,
data, we employ Photoshop scripting to programmatically gen- vol. 49, pp. 153–163, Nov. 2017.
erate large-scale training tampered images. Extensive experi- [17] Y. Chen, X. Kang, Y. Q. Shi, and Z. J. Wang, “A multi-purpose
image forensic method using densely connected convolutional neural
mental evaluations show that the proposed method outperforms networks,” J. Real-Time Image Process., vol. 16, no. 3, pp. 725–740,
several state-of-the-art methods on three datasets. It is worthy Jun. 2019.
to note that two of the evaluated datasets are manually created [18] R. Salloum, Y. Ren, and C.-C. J. Kuo, “Image splicing localization using
a multi-task fully convolutional network (MFCN),” J. Vis. Commun.
from book-cover images, imitating the tampering on pictures Image Represent., vol. 51, pp. 201–209, Feb. 2018.
of documents and certificates, which may provide convenience [19] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Learning rich features
for future research in image tampering localization. Besides, for image manipulation detection,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2018, pp. 1053–1061.
the proposed method is robust to several types of post- [20] J. H. Bappy, C. Simons, L. Nataraj, B. S. Manjunath, and
processing, including JPEG compression, resizing, cropping, A. K. Roy-Chowdhury, “Hybrid LSTM and encoder–decoder architec-
and additive Gaussian noise. ture for detection of image forgeries,” IEEE Trans. Image Process.,
vol. 28, no. 7, pp. 3286–3300, Jul. 2019.
In the future, we will further analyze the traces left by [21] I. Yerushalmy and H. Hel-Or, “Digital image forgery detection based
other image editing operations and tools, and accordingly on lens and sensor aberration,” Int. J. Comput. Vis., vol. 92, no. 1,
optimize the proposed network architecture with appropriate pp. 71–91, Mar. 2011.
[22] O. Mayer and M. C. Stamm, “Accurate and efficient image forgery
structure components. Moreover, we will also try to include detection using lateral chromatic aberration,” IEEE Trans. Inf. Forensics
the functions of other image editing software, such as GIMP, Security, vol. 13, no. 7, pp. 1762–1777, Jul. 2018.

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.
ZHUANG et al.: IMAGE TAMPERING LOCALIZATION USING DENSE FULLY CONVOLUTIONAL NETWORK 2999

[23] A. C. Gallagher and T. Chen, “Image authentication by detecting traces Haodong Li (Member, IEEE) received the B.S.
of demosaicing,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern and Ph.D. degrees from Sun Yat-sen University,
Recognit. Workshops, Jun. 2008, pp. 1–8. Guangzhou, China, in 2012 and 2017, respectively.
[24] P. Ferrara, T. Bianchi, A. De Rosa, and A. Piva, “Image forgery He is currently an Assistant Professor with the
localization via fine-grained analysis of CFA artifacts,” IEEE Trans. Inf. College of Electronics and Information Engineering,
Forensics Security, vol. 7, no. 5, pp. 1566–1577, Oct. 2012. Shenzhen University, Shenzhen, China. His cur-
[25] B. Mahdian and S. Saic, “Using noise inconsistencies for blind rent research interests include multimedia forensics,
image forensics,” Image Vis. Comput., vol. 27, no. 10, pp. 1497–1503, image processing, and deep learning.
Sep. 2009.
[26] W. Li, Y. Yuan, and N. Yu, “Passive detection of doctored JPEG image
via block artifact grid extraction,” Signal Process., vol. 89, no. 9,
pp. 1821–1829, Sep. 2009.
[27] W. Wang, J. Dong, and T. Tan, “Tampered region localization of digital
color images based on JPEG compression noise,” in Proc. Int. Workshop
Digit. Watermarking, 2010, pp. 120–133.
[28] I. Amerini, T. Uricchio, L. Ballan, and R. Caldelli, “Localization of
JPEG double compression through multi-domain convolutional neural
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Work-
shops (CVPRW), Jul. 2017, pp. 1865–1871. Shunquan Tan (Senior Member, IEEE) received
[29] W. Wang, J. Dong, and T. Tan, “Exploring DCT coefficient quantiza- the B.S. degree in computational mathematics and
tion effects for local tampering detection,” IEEE Trans. Inf. Forensics applied software and the Ph.D. degree in computer
Security, vol. 9, no. 10, pp. 1653–1666, Oct. 2014. software and theory from Sun Yat-sen University,
[30] H.-D. Yuan, “Blind forensics of median filtering in digital images,” IEEE Guangzhou, China, in 2002 and 2007, respectively.
Trans. Inf. Forensics Security, vol. 6, no. 4, pp. 1335–1345, Dec. 2011. From 2005 to 2006, he was a Visiting Scholar with
[31] M. Kirchner, “Fast and reliable resampling detection by spectral analysis the New Jersey Institute of Technology, Newark,
of fixed linear predictor residue,” in Proc. 10th ACM Workshop Multi- NJ, USA. In 2007, he joined Shenzhen University,
media Secur. (MM Sec), 2008, pp. 11–20. China, where he is currently an Associate Professor
[32] M. Stamm and K. J. R. Liu, “Blind forensics of contrast enhancement in with the College of Computer Science and Software
digital images,” in Proc. 15th IEEE Int. Conf. Image Process., Oct. 2008, Engineering. His current research interests include
pp. 3112–3115. multimedia security, multimedia forensics, and machine learning.
[33] B. Bayar and M. C. Stamm, “On the robustness of constrained convolu-
tional neural networks to JPEG post-compression for image resampling
detection,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
(ICASSP), Mar. 2017, pp. 2152–2156.
[34] O. Mayer and M. C. Stamm, “Forensic similarity for digital images,”
IEEE Trans. Inf. Forensics Security, vol. 15, pp. 1331–1346, 2020.
[35] Y. Wu, W. Abdalmageed, and P. Natarajan, “ManTra-Net: Manipulation
tracing network for detection and localization of image forgeries with Bin Li (Senior Member, IEEE) received the B.E.
anomalous features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern degree in communication engineering and the Ph.D.
Recognit. (CVPR), Jun. 2019, pp. 9543–9552. degree in communication and information system
[36] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in from Sun Yat-sen University, Guangzhou, China,
Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755. in 2004 and 2009, respectively.
[37] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely From 2007 to 2008, he was a Visiting Scholar with
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. the New Jersey Institute of Technology, Newark, NJ,
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708. USA. In 2009, he joined Shenzhen University, Shen-
[38] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated zhen, China, where he is currently a Professor. He is
convolutions,” in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–13. also the Director of the Shenzhen Key Laboratory of
[39] T. Gloe and R. Böhme, “The ‘Dresden image database’ for benchmark- Media Security, the Vice Director of the Guangdong
ing digital image forensics,” in Proc. ACM Symp. Appl. Comput., 2010, Key Laboratory of Intelligent Information Processing, and a member of the
pp. 1584–1590. Peng Cheng Laboratory. His current research interests include multimedia
[40] S. Ye, Q. Sun, and E.-C. Chang, “Detecting digital image forgeries forensics, image processing, and deep machine learning. Since 2019, he has
by measuring inconsistencies of blocking artifact,” in Proc. IEEE been serving as a member for the IEEE Information Forensics and Security
Multimedia Expo Int. Conf., Jul. 2007, pp. 12–15. TC.
[41] M. Zampoglou, S. Papadopoulos, and Y. Kompatsiaris, “Large-scale
evaluation of splicing localization algorithms for Web images,” Mul-
timedia Tools Appl., vol. 76, no. 4, pp. 4801–4834, Feb. 2017.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on ImageNet classification,” in
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034.

Jiwu Huang (Fellow, IEEE) received the B.S.


degree from Xidian University, Xi’an, China,
in 1982, the M.S. degree from Tsinghua University,
Beijing, China, in 1987, and the Ph.D. degree from
the Institute of Automation, Chinese Academy of
Peiyu Zhuang (Student Member, IEEE) received the Science, Beijing, in 1998. Since 2000, he has been
B.S. degree from Shenzhen University, Shenzhen, with the School of Information Science and Tech-
China, in 2018, where he is currently pursuing the nology, Sun Yat-sen University, Guangzhou, China.
Ph.D. degree in information and communication He is currently a Professor with the College of
engineering. His current research interests include Electronics and Information Engineering, Shenzhen
multimedia forensics and deep learning. University, Shenzhen, China. His current research
interests include multimedia forensics and security. He serves as a member for
the IEEE CASS Multimedia Systems and Applications Technical Committee
and the IEEE SPS Information Forensics and Security Technical Committee.
He is an Associate Editor of the IEEE T RANSACTIONS ON I NFORMATION
F ORENSICS AND S ECURITY.

Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on August 25,2021 at 17:19:29 UTC from IEEE Xplore. Restrictions apply.

You might also like