0% found this document useful (0 votes)

5 views17 pages

Mat 1

research paper

Uploaded by

Ayon Datta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views17 pages

Mat 1

research paper

Uploaded by

Ayon Datta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

1

Constrained Convolutional Neural Networks: A

New Approach Towards General Purpose Image
Manipulation Detection
Belhassen Bayar, Student Member, IEEE, and Matthew C. Stamm, Member, IEEE

Abstract—Identifying the authenticity and processing history Researchers design forensic algorithms that extract features
of an image is an important task in multimedia forensics. related to these traces and use them to detect targeted image
By analyzing traces left by different image manipulations, re- manipulations. This approach has been successful in detecting
searchers have been able to develop several algorithms capable
of detecting targeted editing operations. While this approach has many types of image tampering such as resizing and resam-
led to the development of several successful forensic algorithms, pling [2], [3], [4], [5], [6], median filtering [7], [8], [9], [10],
an important problem remains: creating forensic detectors for contrast enhancement [11], [12], [13], [14], multiple JPEG
different image manipulations is a difficult and time consuming compression [15], [16], [17], [18], etc.
process. Furthermore, forensic analysts need ‘general purpose’ Although, research in image forensics has dramatically
forensic algorithms capable of detecting multiple different image
manipulations. In this paper, we address both of these problems advanced, these approaches still suffer from important draw-
by proposing a new general purpose forensic approach using backs. New editing operations are frequently developed and
convolutional neural networks (CNNs). While CNNs are capable incorporated into editing software such as Adobe Photoshop.
of learning classification features directly from data, in their As a result, researchers must identify traces left by these new
existing form they tend to learn features representative of an operations and design associated detection algorithms. This is
image’s content. To overcome this issue, we have developed
a new type of CNN layer, called a constrained convolutional difficult and time consuming since these algorithms are often
layer, that is able to jointly suppress an image’s content and designed from estimation and detection theory. Furthermore,
adaptively learn manipulation detection features. Through a the forensic algorithms described above are designed to detect
series of experiments, we show that our proposed constrained a single targeted manipulation. As a result, multiple forensic
CNN is able to learn manipulation detection features directly tests must be run to authenticate an image. This results
from data. Our experimental results demonstrate that our CNN
can detect multiple different editing operations with up to 99.97% in several challenges such as fusing the results of multiple
accuracy and outperform the existing state-of-the-art general forensic tests and controlling the overall false alarm rate
purpose manipulation detector. Furthermore, our constrained among several forensic detectors.
CNN can still accurately detect image manipulations in realistic To address these issues, researchers have recently focused
scenarios where there is a source camera model mismatch on developing general-purpose image forensic techniques to
between the training and testing data.
determine if and how an image has undergone processing.
Index Terms—Image forensics, deep learning, convolutional Tools from steganalysis have been adapted to perform general-
neural networks, deep convolutional features. purpose image forensics [19], [20]. Specifically, powerful
steganalytic features called the spatial-domain rich model
I. I NTRODUCTION (SRM) [19] have been successfully used to perform univer-
sal image manipulation detection [20]. Kirchner et al. [7]

I NFORMATION about image authenticity can be used in

important settings, such as evidence in legal proceed-
ings and criminal investigations. However, many commonly
showed the effectiveness of subtractive pixel adjacent matrix
(SPAM) [21] features when performing median filtering de-
tection. Furthermore, Fan et al. developed a general-purpose
available tools can allow a user to make visually realistic manipulation detector where image manipulation traces are
image forgeries. As a result, image manipulation detection learned from Gaussian mixture model (GMM) parameters of
has become a very important task in multimedia forensics. small image patches [22].
To determine the authenticity and processing history of dig- While these recent approaches have resulted in noticeable
ital images, researchers have developed numerous forensic gains in detection accuracy, several important questions re-
approaches during the last decade [1]. Specifically, it has main, including: How should low-level forensic feature ex-
been observed that image manipulations typically leave behind tractors be designed? Do they have a common form? Can
traces unique to the type of editing an image has undergone. image tampering traces be learned directly from data? Are
there alternative ways of extracting higher-level features for
This material is based upon work supported by the National Science
Foundation under Grant No. 1553610. Any opinions, findings, and conclusions tampering detection from low-level forensic traces?
or recommendations expressed in this material are those of the authors and Deep learning approaches, particularly convolutional neural
do not necessarily reflect the views of the National Science Foundation. networks (CNNs), provide a potential answer to these ques-
The authors are with the Department of Electrical and Computer
Engineering, Drexel University, Philadelphia, PA, 19104 USA (e-mail: tions. CNNs have attracted a significant amount of attention
[email protected], [email protected]) given their ability to automatically learn classification features
Digital Object Identifier: 10.1109/TIFS.2018.2825953

1556-6021 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2

directly from data. They have been successfully used on a multiple editing operations even when their parameters vary,
variety of different types of signals such as images [23], can detect sequences of operations, can provide localized
speech [24] and text data [25]. While CNNs provide a promis- manipulation detection results, and can provide extremely
ing approach towards automatically learning image manipu- accurate manipulation detection results when trained using a
lation traces, in their existing form they are not well-suited large scale training dataset.
for forensic purposes. This is because existing CNNs tend to The remainder of this paper is organized as follows: In
learn features representative of an image’s content as opposed Section II, we present an overview about CNNs in literature.
to manipulation traces, which are content-independent. As a Then, in Section III we describe how CNNs are used in
result, forensic researchers may ask: Is it possible to force multimedia forensics task using the constrained convolutional
a CNN to learn manipulation detection features instead of layer. Section IV provides details about our proposed CNN
features that represent image’s content? architecture. Finally, in Section V we assess our proposed deep
In this paper, we propose a new type of CNN architecture, learning approach in adaptively extracting image manipulation
called a constrained CNN, designed to adaptively learn image features through a set of experiments. Lastly, Section VI
manipulation features and accurately identify the type of edit- concludes our work.
ing that an image has undergone. We use our constrained CNN
to construct a powerful general-purpose manipulation detector
II. C ONVOLUTIONAL N EURAL N ETWORKS
called “MISLnet”, named after our lab, the Multimedia and
Information Security Lab (MISL). To accomplish this, we Deep learning approaches, such as convolutional neural net-
propose a new type of convolutional layer called a constrained works [27], are an extended version of neural networks. Their
convolutional layer that forces a CNN to learn low-level architecture, which is the set of parameters and components
manipulation features. Many forensic algorithms, such as that we need to design a network, is based on stacking many
resampling detectors [2], [3], median filtering detectors [7] hidden layers on top of one another. This has proven to be
and other forensic detectors based on steganalytic features like very effective in extracting hierarchical features. That is, they
the SRM [20], operate by first extracting prediction residual are capable of learning features from a set of previously
features, then by forming higher-level features from these learned features. In a CNN architecture, the first layer is a
residuals. Inspired by this, our constrained convolutional layer set of convolutional feature extractors applied in parallel to
is designed to only learn prediction error filters. This jointly the image using a set of several learnable filters. These filters
suppresses the image’s content and adaptively learns low- work as a sliding window that convolves with all regions of the
level prediction residual features that are optimal for detecting input image with an overlapping distance called the stride and
forensic traces. Higher-level forensic features are learned from produce outputs known as feature maps. Similarly, the hidden
these residuals by deeper layers of our CNN. convolutional layers extract features from each lower-level
Through a series of experiments, we show that our MISLnet feature maps. Finally, the output of these hierarchical feature
architecture can automatically learn to detect multiple types extractors is stacked to a fully-connected neural network that
of image editing directly from data. This removes the need performs classification.
for difficult and time consuming human analysis to design The convolutional operation between the input feature maps
forensic detection features. Our results show that when given and a convolutional layer within the CNN architecture is given
comparable amount of training data, our constrained CNN can in Eq. (1):
perform as good or better than state-of-the-art general-purpose
K
detector based on steganalytic features [20]. Furthermore, we (n)
X (n−1) (n)
show that we can significantly improve the performance of our hj = hk ∗ wkj + bj (n) , (1)
k=1
CNN-based detector by using very large amounts of training
data that are computationally prohibitive for forensic detectors (n)
where ∗ denotes a 2d convolution, hj is the j th feature map
based on the SRM and its associated ensemble classifier. These (n−1)
results show that our proposed method can achieve 99.97% output in the nth hidden layer, hk is the k th channel in
(n)
accuracy with five different tampering operations using a large the (n − 1) hidden layer, wkj is the k th channel in the j th
th
(n)
scale data. filter in the nth layer and bj is its corresponding bias term.
The major contributions of this paper are as follows: (1) The filter coefficients in each layer are initially seeded
We propose a CNN architecture that is capable of detecting with random values, then learned using the back-propagation
image editing and manipulations. This CNNs architecture algorithm [27], [28]. The convolutional layers are also fol-
is deeper and more sophisticated than the one we initially lowed by an activation function to introduce nonlineraity. The
proposed in [26], and its design choices are systematically set of convolutional layers yields a large volume of feature
investigated through a series of experiments. (2) We intro- maps. To reduce the dimensionality of these features, the
duce our proposed constrained convolutional layer, provide convolutional layers are followed by another type of layer
a detailed discussion of how it is constructed and trained, as called pooling. This reduces the training computational cost of
well as provide intuition into why it works. (3) We conduct the network and decreases the chances of over-fitting. There
a large scale experimental evaluation of our MISLnet CNN exist many types of pooling operations such as max, average,
architecture and show that it can outperform existing image and stochastic pooling. In particular, the max-pooling layer
manipulation detection techniques, can differentiate between works also as a sliding window with a stride distance which
3

retains the maximum value within the dimension of a sliding from which more sophisticated manipulation detection features
window. are built. To mimic this process, our constrained convolutional
The training process of a CNN is done through an iterative layer is designed to only learn prediction error filters. The
algorithm that alternates between feedforward and backprop- feature maps it produces correspond to prediction error fields
agation passes of the data. The weights of the convolutional that are used as low-level forensic traces.
filters and fully-connected layers are updated at each iteration The constrained convolutional layer is placed at the begin-
of the backpropagation passes. Ultimately, we would like to ning of a CNN designed to perform a forensic task. This serves
minimize the average loss E between the true class labels (i.e., to suppress an image’s content (since prediction errors largely
unaltered, manipulated, etc.) and the network outputs, i.e., do not contain image content) and provide the CNN with low-
m c level forensic features. Deeper layers in the CNN will learn
1 X X ∗(k) (k) higher-level manipulation detection features from these low-
E= yi log yi , (2)
m i=1 level forensic features.
k=1
∗(k) (k) To describe the constraints we enforce upon the constrained
where yi and yi are respectively the true label and the convolutional layer, we adopt the notational conventions that
network output of the ith image at the k th class with m the superscript (ℓ) denotes the ℓth CNN layer, the subscript
training images and c neurons in the output layer. There have k denotes the k
th
convolutional filter within a layer, and that
been proposed a variety of solvers [29], [30], [31] to minimize the central value of a convolutional filter is denoted by spatial
the average loss. index (0,0). We force the CNN to learn prediction error filters
In this paper, we consider the stochastic gradient descent by actively enforcing the following constraints
(SGD) to train our model [29]. The iterative update rule for the ( (1)
(n)
kernel coefficients wij in CNN during the backpropagation wk (0, 0) = −1,
(1) (4)
pass is given below: P
m,n6=0 w k (m, n) = 1,
(n+1) ∂E (n) (n) (1)
∇wij = ǫ· (n)
− m · ∇wij + d · ǫ · wij on each of the K filters wk in the constrained convolutional
∂wij
layer during training. Fig. 1 depicts a set of K constrained
(n+1) (n) (n+1)
wij = wij − ∇wij , (3) convolutional filters convolved with an input image.
The prediction error filter constraints in the constrained
(n)
where wij represents the ith channel from the j th kernel convolutional layer are enforced through the following training
(1)
matrix in the nth hidden layer that convolves with the ith process. Training proceeds by updating the filter weights wk
channel in the previous feature maps of the (n − 1)th layer, at each iteration using the stochastic gradient descent (SGD)
(n) (n)
∇wij denotes the gradient of wij and ǫ is the learning rate. algorithm during the backpropagation step, then projecting the
(n) updated filter weights back into the feasible set of prediction
The bias term bj in (1) is updated using the same equations
presented in (3). For fast convergence as explained by LeCun error filters by reinforcing the constraints in (4). The projection
et al. in [28], we use the decay and momentum strategy which is done at each training iteration by first setting the central filter
are respectively denoted by d and m in (3). weight to -1. Next, the remaining filter weights are normalized
so that their sum is equal to 1. This is done by dividing
III. C ONSTRAINED C ONVOLUTIONAL N EURAL N ETWORK each of these remaining weights by the sum of all the filter
weights excluding the central value. It is worth mentioning that
A. Constrained convolutional layer experimentally we have found that using a central value larger
Instead of relying on hand-designed or predetermined fea- than one (i.e. set the central value to -10,000 and ensure the
tures, we propose a CNN-based approach to image manipu- remaining values sum to 10,000) can help improve both the
lation detection. Our approach is able to use data to directly numerical stability of these filters and the convergence of our
learn the changes introduced by image tampering operations CNNs loss1 . Doing this still produces prediction error filters of
into local pixel relationships. We note that if CNNs in their the same form, but the filter weights are proportionally larger.
standard form (such as AlexNet [23]) are used to perform Pseudocode outlining this process is given in Algorithm 1.
image manipulation detection, they will learn features that
represent an image’s content. This will lead to a classifier that
B. Analogy with existing forensic approaches
identifies scene content associated with the training data as
opposed to learning image manipulation fingerprints. Many existing state-of-the-art approaches to perform image
By contrast, our approach is designed to suppress an image’s manipulation detection proceed by extracting hand-designed
content and adaptively learn image manipulation traces. To prediction residual features. Our proposed constrained con-
accomplish this, we propose a new type of convolutional layer, volutional layer can be viewed as a generalization of these
called a constrained convolutional layer, that is designed to non-adaptive feature extraction based approaches. Examples
be used in forensic tasks. It is inspired by a common process of these include resampling detectors that use the variance of
that we have observed in many existing forensic algorithms. prediction residues [3], [32], median filter streaking artifact
Several existing algorithms first use a predetermined predictor 1 The python scripts used to conduct the experiments can be found at
to produce a set of pixel value prediction errors. These misl.ece.drexel.edu/downloads or the project git repository https://gitlab.com/
prediction errors are then used as low-level forensic features MISLgit/constrained-conv-TIFS2018/.
4

(1)
where hk is the k th feature map produced by the k th con-
strained filter in the first convolutional layer defined in Eq. (1).
By defining f (I) = w̃k ∗ I, we can see these residuals are
produced in the same manner that the above mentioned feature
extraction based approaches produce prediction residuals r
in (5).
The residual predictors used in different multimedia forensic
tasks [2], [3], [7], [20] take the form w̃k to predict local
pixel relationships. Resampling detectors for instance, operate
by computing a probability measure called a p-map from a
Fig. 1: The constrained convolutional layer. The red coefficient is -1 prediction residual r of the form shown in Eq. (5). Then
and the coefficients in the green region sum to 1. higher-level features in the frequency domain of the p-map
are learned to detect resampling artifacts [2], [3]. To detect
Algorithm 1 Training algorithm for constrained convolutional median filtering, Kirchner and Fridrich similarly compute low-
layer level residual features [7]. These residual features capture
1: Initilize wk’s using randomly drawn weights streaking artifacts, then higher-level detection features are
2: i=1 learned. Furthermore, the steganalytic rich model method used
3: while i ≤ max iter do in forensics [20] operates in the same manner by building
4: Do feedforward pass several local models of pixel dependencies to compute a
5: Update filter weights through stochastic gradient diverse set of residual features. Then higher-level features are
descent and backpropagate errors learned using the co-occurrence of these residuals [19].
6: Set wk (0, 0)(1) = 0 for all K filters As a result, our approach suppresses an image’s content
(1) P (1) in the same manner as prediction residual based methods.
7: Normalize wk ’s such that ℓ,m6=0 w k (ℓ, m) = 1
In order to capture a large number of different types of
8: Set wk (0, 0)(1) = −1 for all K filters
dependencies among neighboring pixels, a diverse set of w̃k ’s
9: i = i+1
is typically used. The advantage of modeling the residual
10: if training accuracy converges then
instead of the pixel values is that the image content is largely
11: exit
suppressed.
12: end
Unlike prior forensic methods which use fixed predictors,
however, our approach adaptively learns good predictors for
residuals [7], median filter residual features [8], [33], SPAM feature extraction through backpropagation. Nonlinearities can
features [21], and rich model predictors [19]. These examples be further introduced by the subsequent application of activa-
of prediction residual features suppress an image’s contents tion functions and pooling layers in higher CNN layers. Thus,
but still allow traces in the form of content-independent pixel the constrained convolutional layer based approach unifies an
value relationships to be learned by a classifier. important amount of research in multimedia forensics.
To provide intuition into this, prediction residual features are
formed by using some function f (·) to predict the value of a IV. N ETWORK A RCHITECTURE
pixel based on that pixel’s neighbors within a local window. In the previous section, we showed how low-level image
The true pixel value is then subtracted from the predicted value manipulation features can be adaptively extracted using the
to obtain the prediction residual r such that constrained convolutional layer approach. We use this ap-
proach to design a CNN that is able to identify the type of
r = f (I) − I, (5) editing operations in an image. Fig. 2 depicts the overall design
where I is the input image or image patch. Frequently, a of our proposed architecture with details about the size of each
diverse set of K different prediction functions are used to layer. Our architecture consists of four different conceptual
obtain many different residual features. blocks and has the ability to: (i) jointly suppress an image’s
It can easily be shown that the K feature maps produced content and learn prediction error features while training, (ii)
by a constrained convolutional layer in Eq. (4) are residuals extract higher level representation of the previously learned
of the form (5). A simple way to see this is to define a new image manipulation features and (iii) learn new associations
filter w̃k as between feature maps in deeper layer by using a block that
( consists of 1×1 convolutional filters. These type of filters learn
wk (m, n) if (m, n) 6= (0, 0), a linear combination of features located at the same position
w̃k (m, n) = (6)
0 if (m, n) = (0, 0). but in a different feature map across channels. The output of
As a result, wk can be expressed as w̃ − δ where δ is an the latter block is fed to the classification block which consists
impulse filter whose central value is 1 and 0 elsewhere. The of three fully-connected layers. In this work, the input layer of
feature map produced by convolving an image with the filter our CNN is a grayscale image patch sized 256×256 pixels. In
(1)
wk in the constrained convolutional layer is what follows, we present a detailed overview of our proposed
conceptual blocks as well as the different layers that we have
(1) (1)
hk = wk ∗ I = (w̃k − δ) ∗ I = w̃k ∗ I − I, (7) used in our CNN’s architecture.
5

Fig. 2: MISLnet: our proposed constrained CNN architecture; BN:Batch-Normalization Layer; TanH: Hyperbolic Tangent Layer.

A. Conceptual Blocks approach [39] that we explain in Section IV-G. In what follows
we give more details about each type of layer we used in our
a) Prediction error feature extraction: CNNs in their
CNN.
existing form tend to learn content-dependent features. There-
e) Differences from original architecture: Differences
fore, in our proposed architecture the first block consists of a
from original architecture: Compared to our original CNN
constrained convolutional layer [26], [34]. This suppresses the
architecture proposed [26], our new CNN architecture has
content and constrains CNN to learn prediction error features
gone through substantial design refinement. Specifically, this
in the first layer [35], [36]. As a result, the first conceptual
new architecture contains fewer filters in the constrained
block learns low-level pixel value dependency features. These
convolutional layer, uses a different filter size in the Conv3
features are fragile and vulnerable to be destroyed by many
(referred to as Conv2 in [26]) convolutional layer, uses a differ-
types of nonlinear operations [37] such as pooling and activa-
ent number of filters in the Conv3 and Conv4 layers, includes
tion layers which are explained later. Therefore, the output of
one more ‘traditional’ convoltional layer than our architecture
this block is directly passed to a regular convolutional layer.
in [26], adds an additional 1×1 layer after the ‘traditional’
b) Hierarchical feature extraction: In order to learn
convolutional layers, uses a different type of pooling before the
higher-level prediction error features, we use a conceptual
fully connected layers, uses different activation functions, uses
block that consists of a set of three consecutive convolutional
batch normalization instead of local response normalization,
layers each followed by a batch normalization, activation
contains a different number of neurons in each fully connected
function and pooling layers. Each convolutional layer will
layer (this network uses noticeably fewer neurons in these
learn a new representation of feature maps learned by the
layers), and uses the activations of the last fully connected
preceding convolutional layer (lower-level features).
layer as “deep features” that are provided to an extremely
c) Cross feature maps learning: The previously learned randomized tree classifier. These design choices have been
hierarchical features are produced by learning local spatial motivated by an extensive series of experiments, the most
association within a receptive field (local region/patch con- important of which are discussed in detail in Section V.
volved with a filter) in the same feature map. Next, a new Additionally, while training this network we use a different
association is learned between these feature maps. In order to initial learning rate as well as a learning rate that decreases
constrain CNN to learn only association accross feature maps, during training [26] uses a fixed learning rate) and train using
we use 1×1 convolutional layer after the hierarchical feature a different batch size. Training information for this network is
extraction conceptual block. This has been demonstrated to discussed in detail in Section V.
improve the learning ability of CNN in steganalysis [38].
In our architecture, this layer also followed by a a batch
normalization, activation function and pooling layers. B. Convolutional Layers
d) Classification: The deepest convolutional features From Fig. 2, one can notice that we use three different
learned by the previous conceptual block are directly passed types of convolutional layers, namely one “Constrained Conv”
to a classifiation block. This block consists of a regular layer which is the constrained convolutional layer presented
neural network that will learn to classify the input data in Section III, three regular convolutional layers and the 1×1
from the previously learned features throughout CNN. To convolutional filters in “Conv5”. More specifically, a patch
improve the performance of our CNN we use the deep features sized 256×256 from a grayscale input image is first convolved
6

with three different 5 × 5 constrained convolutional filters operations including activation functions. In our experiments
with a stride equal to 1. These filters learn the prediction (see Section V-G), we compare the performance of our pro-
error features between the estimated center pixel and it’s local posed TanH based CNN architecture to CNN models with
neighbors. The constrained convolutional layer yields feature different choices of activation functions that we mentioned
maps of prediction residuals of dimension 252 × 252 × 3. above.
To learn higher-level representative features and new asso- Finally, the output layer is followed by a softmax activa-
ciations between the prediction residual feature maps, we use tion function. This type of activation function maps features
three regular convolutional layers, namely “Conv2” with 96 learned by the last fully-connected layer to a set of probability
filters of size 7 × 7 × 3 and stride of 2, “Conv3” with 64 filters values where the output of all neurons in this layer sum up
of size 5 × 5 × 96 and stride of 1 and “Conv4” with 64 filters to 1. The identification of the image manipulation types in
of size 5 × 5 × 64 and stride of 1. The output dimensions subject images can be performed by choosing the editing
of these convolutional layers are respectively, 126 × 126 × 96, operation associated with the neuron in the softmax layer with
63×63×64 and 31×31×64. Finally, we use 128 different 1×1 the highest activation level.
convolutional filters with stride of 1 in “Conv5”. This type of
layer learns the association across feature maps, i.e., linear E. Batch Normalization
combination of features across channels located at the same Researchers in computer vision have developed several
spacial location. The output dimension of this convolutional techniques to normalize the data throughout the CNN architec-
layer is 15 × 15 × 128. Finally, from our architecture one can ture. Early deep learning architectures use the local response
notice also that we use a batch normalization layer after every normalization (LRN) layer which normalizes the central coef-
regular convolutional layer. A brief overview about the batch ficient within a sliding window in a feature map with respect
normalization layer is given in Section IV-E. to its neighbors. Recently Ioffe et al., proposed in [44] the
batch normalization layer which dramatically accelerates the
C. Fully-connected Layers training of deep networks. This type of mechanism minimizes
the internal covariate shift which is the change in the input
To identify the type of the processing operation that an input
distribution to a learning system.
image has undergone, the output of all these convolutional
This is done by a zero-mean and unit-variance transforma-
layers is fed to a classification block which consists of a
tion of the data while training the CNN model. The input
fully-connected neural network defined by three layers. More
to each layer gets affected by the parameters of all previous
specifically, the first two fully-connected layers contain 200
layers and even small changes get amplified. Thus, this type
neurons. These layers learn new association between the
of layer addresses an important problem and increases the
deepest convolutional features in CNN. The output layer, also
final accuracy of a CNN model. Therefore, in our proposed
called classification layer, contains one neuron for each possi-
architecture we use a batch normalization layer after each
ble tampering operation and another neuron that corresponds
regular convolutional layer. However, the prediction error
to the unaltered image class.
convolutional filters outputs are directly convolved with the
next convolutional layer without using the batch normalization
D. Activation Function layer.
A convolutional layer is typically followed by a nonlinear
mapping called an activation function. This type of function F. Pooling
is applied to each value in the feature maps of every convo- In our CNN, we use two different types of pooling, i.e.,
lutional layer. There exist many types of activation functions. three max-pooling and one average-pooling. We experimen-
In computer vision applications, the ReLU activation function tally demonstrate our choice of pooling layers in Section V.
has been used successfully [23], [40]. He et al. [41] proposed Similarly to [23], we use an overlapping kernel with size 3×3
another type of activation function called PReLU that leads and stride of 2. Explicitly, the max-pooling layer retains the
to surpass human-level performance on visual recognition maximum value within the local neighborhood of the sliding
challenge [42]. Additionally, Clever et al. [43], proposed window, whereas, the average-pooling layer retains the average
the exponential linear units (ELU) activation function, which in a local neighborhood. The purpose of this type of layer is
considerably speeds up learning and obtains less than 10% to reduce the dimensionality of the feature maps. This reduces
classification error compared to a ReLU network with the same the computational cost of training and decreases the chances of
architecture. over-fitting. More specifically, the set of parallel convolutional
In our proposed CNN, as depicted in Fig. 2 we propose to operations yields a high dimensional feature maps volume.
constrain the range of data values with the saturation regions of Therefore, pooling layers are useful for subsampling as well
hyperbolic tangent (TanH) activation function at every stage of as improving the accuracy by retaining the most representative
the network. Introducing nonlinearity throughout the network features.
layers strengthens CNN capability to separate the feature In our architecture, the four used pooling layers have
space. However, one can notice that feature maps learned respectively reduced the feature maps dimensions from
by the constrained convolutional layer are not followed by a 126×126×96 to 63×63×96, from 63×63×64 to 31×31×64,
TanH layer. This is mainly because the learned prediction error from 31×31×64 to 15×15×64 and finally from 15×15×128
features can easily be destroyed by many types of nonlinear to 7×7×128.
7

TABLE I: Editing parameters used to create our 10 experimental

G. Deep Convolutional Features
databases for CNN-based single manipulation detection.
As depicted in Fig. 2, to classify the output features learned
Editing operation Parameters
by the set of convolutional layers we use a neural network Median Filtering (MF) Ksize = 3, 5
classifier with a softmax activation function in the output layer. Gaussian Blurring (GB) with σ = 1.1 Ksize = 3, 5
However, other approaches to producing a final classification Additive White Gaussian Noise (AWGN) σ = 0.5, 2
Resampling (RS) using bilinear interpolation Scaling = 1.2, 1.5
decision may work better than a fully-connected and softmax JPEG compression QF = 70, 80
layer. Therefore, we propose to extract the output of the
activation function [39] from the second fully-connected layer
“FC2” by doing a feedforward pass of the training and testing extracted from our CNNs to train an extremely randomized
data after completing the training of our CNN. Subsequently, trees (ET) classifier in order to identify manipulated images
we train an extremely randomized trees classifier using the in testing data.
new collected data. Each 256 × 256 patch in the training and To conduct these experiments, we created 10 different
testing data has its corresponding 200 features vector. databases, where each database corresponded to one type of
manipulation with different editing parameters. Each database
V. E XPERIMENTS consisted of 60, 000 grayscale images of size 256×256 pixels.
To assess the performance of our proposed constrained CNN These were created using images from the first IEEE IFS-TC
for performing image manipulation detection, we conducted image forensics challenge2 . We used 3, 334 images of size
a set of experiments and analysis. In these experiments, we 1024×768 for the training and testing data. Each image was
first evaluated our proposed CNN’s ability to detect a single converted to grayscale by retaining only its green color layer.
manipulation. This was done for each of the editing operations Next, each grayscale image was divided into 256×256 pixel
and parameters listed in Table I. Next, we evaluated our CNN’s subimages, then nine central subimages were retained.
ability to be used as a multiclass classifier to perform general To train our CNNs, we used 50, 000 grayscale patches of
image manipulation detection with five different editing opera- size 256×256 for each type of manipulation. This consisted
tions listed in Table III. We also evaluated our approach in two of 25, 000 unaltered patches and 25, 000 manipulated patches.
more challenging scenarios, i.e., when the editing parameters These grayscale patches were created in the same manner
are chosen to be arbitrary as listed in Table V, and when the described above by randomly selecting 2, 778 images. Each
editing operation can be followed by another editing operation. block corresponds to a new image that has its corresponding
We compared our CNN-based approach to an existing general- tampered images created by the 10 different editing operations
purpose manipulation detection approach that uses steganalysis listed in Table I, i.e., five types of image manipulations with
features [19] to perform manipulation detection [20]. Next, two different editing parameters for each.
we compared the performance of our proposed architecture When training our CNN, we set the batch size equal to
with different structural design choices, e.g., the choice of 64 and the parameters of the stochastic gradient descent as
pooling and activation function layers. Finally, we tested the follows: momentum = 0.95, decay = 0.0005, and a learning
performance of our proposed CNN trained with a large scale rate ǫ = 10−3 that decreases every six epochs (approximately
dataset, then we evaluated it in a real world scenario. The 4, 700 iterations) by a factor γ = 0.5. We trained the CNN
results of these experiments demonstrate that our CNN can in each experiment for 60 epochs (approximately 47, 000
accurately learn detection features directly from data and iterations).
achieve state-of-the-art performance. To evaluate the performance of our proposed approach,
In every experiment, we trained each CNN for 60 epochs, we used 10, 000 grayscale patches of size 256×256 for each
where an epoch is the total number of iterations needed to pass type of manipulation listed in Table I. Each testing dataset
through all the data samples in the training set. Additionally, consisted of 5, 000 unaltered patches and 5, 000 manipulated
while training our CNNs, their testing accuracies on a separate patches. These grayscale patches were made from 556 images
testing dataset were recorded every 1, 000 iterations to produce not used for the training in the same manner described
tables and figures in this section. Note that training and testing above. Thus, each training dataset has its corresponding testing
processes were disjoint. We implemented all of our CNNs dataset for the 10 types of manipulations. Table II depicts the
using the Caffe deep learning framework [45]. We ran our detection rate for every type of manipulation using the softmax
experiments using an Nvidia GeForce GTX 1080 GPU with classification layer and the ET classifier associated with the
8GB RAM. The datasets used in this work were all converted deep convolutional features extracted from the second fully-
to the lmdb format. connected layer as explained in Fig. 2. We trained the ET
classifiers by varying the number of trees from 100 to 600
with a step of 100 then we reported the best detection rates
A. Single manipulation detection
that our ET-based CNN achieved.
In our first set of experiments, we used our proposed CNN From Table II, one can notice that our proposed CNN can
architecture in Fig. 2 to distinguish between images edited achieve at least 99.36% (JPEG with QF = 70) detection rate
with one particular type of manipulation and unaltered images. with all types of manipulations. Noticeably, it can achieve
The output layer of our CNN consisted of two neurons, i.e., 99.95% detection rate with 5×5 Gaussian bluring database.
original (OR) versus manipulated image. As described in
Section IV-G, we also used the deep convolutional features 2 Dataset website: http://ifc.recod.ic.unicamp.br/fc.website/index.py?sec=5
8

TABLE II: CNN identification rate for single manipulation detection.

MF GB AWGN RS JPEG
Classifiers
3×3 5×5 3×3 5×5 σ = 0.5 σ=2 Scaling= 1.2 Scaling= 1.5 QF = 70 QF = 80
Sotmax 99.58% 99.57% 99.74% 99.95% 99.82% 99.93% 99.40% 99.53% 99.36% 99.66%
ET 99.82% 99.71% 99.85% 99.87% 99.85% 99.96% 99.62% 99.55% 99.97% 99.89%

TABLE III: Editing parameters used to create our experimental

We used our trained CNN to classify each of the images
database for CNN-based general purpose manipulation detection.
in the testing dataset. The overall manipulation identification
Editing operation Parameter rate of our CNN was 99.66%. Table IV shows the confusion
Median Filtering (MF) Ksize = 5
Gaussian Blurring (GB) with σ = 1.1 Ksize = 5 matrices of our two proposed methods. From this table, we
Additive White Gaussian Noise (AWGN) σ=2 can see that each type of manipulation was identified with an
Resampling (RS) using bilinear interpolation Scaling = 1.5 accuracy typically greater than 99% except for the original and
JPEG compression QF = 70
re-sampled images which were detected with an accuracy of
98.70% and 98.87% respectively. These results demonstrate
We also can notice that the ET classifier has improved the that our proposed CNN was able to both accurately detect
detection rate of each corresponding manipulation except with tampered images and identify the type of tampering.
the 5×5 Gaussian bluring database. Our ET-based CNN ap- Similarly to our single manipulation detection approach, we
proach can achieve at least 99.55% (re-sampling with scaling compared the performance of using a softmax layer versus
1.2) detection rate with all types of manipulations. Our ET- the “deep features” approach with an extremely randomized
based CNN can noticeably achieve 99.97% in identifying trees (ET) classifier. More specifically, we used the activated
JPEG compression with QF = 70. These results show the deep convolutional features [39] that we extracted from the
ability of our constrained CNN in extracting good image second fully-connected layer of our network to train an ET
manipulation features directly from data with different editing classifier with 700 trees. In the rest of our experiments, all
parameters for binary detection. Additionally, these results are the ET classifiers were trained with the same number of
very promising since our proposed deep learning approach was trees, i.e., 700 trees. Our experimental results show that the
able to accurately detect several types of single manipulations ET classifier increased the overall classification rate from
using the same network architecture. 99.26% to 99.66%. We compare the results of ET-based CNN
classifier to our proposed CNN with a softmax classification
layer in Table IV. We can notice that the ET-based CNN
B. Multiple manipulation detection method increased the identification rate of each tampering
operation. The lowest detection accuracy was 99.46% for
In the previous experiments, our proposed constrained CNN
Gaussian blurring, which is still very high.
was effective at extracting image manipulation features with
As we noted previously, the constrained convolutional layer
different types of tampering for single manipulation detection.
was designed to suppress the scene and learn prediction error
In this part, we evaluate our proposed approach in performing
features. Fig. 3 depicts the output of the three filters learned
multiple image manipulation detection. Similarly to the previ-
by the constrained convolutional layer for three different
ous set of experiments, we used the images from the 1st IEEE
grayscale images. One can notice that our proposed con-
IFS-TC image forensics challenge website to create a database
strained convolutional layer was able to successfully suppress
that consisted of 132, 000 grayscale patches of size 256 × 256.
each image’s content.
We used 2, 445 images of size 1024×768 for the training and
testing data.
To train our CNNs, we used 100, 000 grayscale patches C. Multiple manipulation detection with arbitrary parameters
of size 256×256 in which 16, 667 patches were unaltered. We assessed the performance of our approach at performing
These grayscale patches were made from randomly selected image manipulation detection in a more general scenario
1, 852 images in the same manner described in the previous where editing parameters are chosen to be arbitrary. To
experiment. The altered patches were created using the five accomplish this, we created a training and testing datasets
tampering operations listed in Table III. When training our using the same unaltered 256×256 grayscale patches that
CNN, we used the same training parameters of the stochastic we collected in the previous experiment in Section V-B. We
gradient descent that we used for the binary classifier with a created modified patches using each of the manipulations
learning rate ǫ = 10−3 that decreased every six epochs (9, 375 and associated parameter values listed in Table V. When
iterations) by a factor γ = 0.5. We trained the CNN in each modifying a patch with using a specific manipulation, the
experiment for 60 epochs (93, 750 iterations). We subsequently associated parameter was chosen uniformly at random from
created a testing dataset that consisted of 32, 000 grayscale the set of possible values. Additionally, Gaussian blurring
patches of size 256×256 where 5, 337 patches were not edited. was implemented using OpenCV which determines the blur
These grayscale patches were made from the remaining 593 variance as a function of the filter size according 2 to the
images not used for the training in the same manner described equation σ 2 = 0.3 × ((Ksize − 1) × 0.5 − 1) + 0.8 .
above. To create the altered patches we used the same editing In total, our database consisted of 100, 000 training patches
operations listed in Table III. and 32, 000 testing patches. Table VI shows the confusion
9

TABLE IV: Confusion matrix for identifying the manipulations listed in Table III using MISLnet.
Predicted Class Predicted Class
Softmax Extremely Randomized Trees
OR MF GB AWGN RS JPEG OR MF GB AWGN RS JPEG
OR 98.70% 0.67% 0.01% 0.01% 0.13% 0.45% 99.49% 0.15% 0.09% 0.02% 0.13% 0.11%
True Class

MF 0.01% 99.08% 0.07% 0.00% 0.00% 0.82% 0.07% 99.77% 0.11% 0.00% 0.04% 0.00%
GB 0.00% 0.05% 99.15% 0.00% 0.00% 0.78% 0.02% 0.41% 99.46% 0.00% 0.11% 0.00%
AWGN 0.03% 0.00% 0.00% 99.96% 0.00% 0.00% 0.02% 0.00% 0.00% 99.98% 0.00% 0.00%
RS 0.05% 0.00% 0.01% 0.00% 98.87% 1.05% 0.07% 0.36% 0.00% 0.00% 99.51% 0.06%
JPEG 0.07% 0.00% 0.00% 0.00% 0.13% 99.79% 0.06% 0.00% 0.00% 0.00% 0.15% 99.79%

Fig. 3: Output of the three learned filters in “Constrained Conv” layer using three different grayscale images.

TABLE V: Editing parameters used to create our experimental

improve CNN’s performance using a larger size of the training
database for CNN-based general purpose manipulation detection
with arbitrary parameters; Gaussian blur with adaptive σ = 0.3 × dataset.
((Ksize − 1) × 0.5 − 1) + 0.8.
D. Manipulation chain detection with re-compression
Editing operation Parameter
Median Filtering (MF) Ksize = 3, 5, 7, 9 To evaluate the performance of MISLnet in another chal-
Gaussian Blurring (GB) with adaptive σ Ksize = 3, 5, 7, 9 lenging scenario, we used our CNN to identify an image’s
Additive White Gaussian Noise (AWGN) σ = 1.4, 1.6, · · · , 2
Resampling (RS) using bilinear interpolation Scaling = 1.2, 1.4, · · · , 2 manipulation history when more than one manipulation could
JPEG compression QF = 60, 61, · · · , 89, 90 be applied, followed by JPEG re-compression. To demonstrate
this, we conducted another experiment where each image patch
was edited by a sequence of up to two different manipulations,
matrices of our two proposed approaches, namely softmax- then JPEG compressed using a quality factor of 90. By re-
based CNN and ET-based CNN. Experiments showed that compressing each image after manipulation, we can mimic
our approach can achieve 98.82% and 98.99% identification conditions similar to those in social networking applications
rates respectively with the softmax and the ET classifiers. which typically re-compress each image before distributing it.
Noticeably, MISLnet can achieve 99.89% identification rate We created data for this experiment by using 256 × 256
for JPEG compression manipulation with continuous quality grayscale image patches collected from the Dresden Image
factor parameter between 60 and 90, which is particularly high. Database [46]. This was done by retaining the 16 central
These results demonstrate that even in this more challenging blocks of the green color layer of each image. These im-
scenario, our CNN is still able to accurately identify image age patches were then manipulated using a sequence of
manipulations. Additionally, in order to have a better repre- the following manipulations: median filtering (MF) using a
sentation for each type of manipulation this task requires to 5×5 kernel, Gaussian blurring (GB) with σ = 1.1 using a
train a CNN with a larger training dataset compared to the task 5×5 kernel, resizing (RS) by a factor of 1.5 using bilinear
where images have been manipulated with fixed parameters. interpolation. Each manipulation sequence consisted of up
In Section V-H, we demonstrate that one can significantly to two of these operations. We adopt the notation X-Y to
10

TABLE VI: Confusion matrix for identifying the manipulations listed in Table V with arbitrary editing parameters using MISLnet.
Predicted Class Predicted Class
Softmax Extremely Randomized Trees
OR MF GB AWGN RS JPEG OR MF GB AWGN RS JPEG
OR 97.21% 0.47% 0.02% 1.78% 0.00% 0.53% 97.73% 0.11% 0.04% 1.95% 0.06% 0.11%
True Class

MF 0.04% 99.01% 0.24% 0.02% 0.02% 0.68% 0.07% 99.59% 0.26% 0.02% 0.06% 0.00%
GB 0.00% 0.71% 98.73% 0.00% 0.00% 0.56% 0.04% 1.03% 98.93% 0.00% 0.00% 0.00%
AWGN 1.56% 0.02% 0.00% 98.39% 0.00% 0.04% 1.31% 0.02% 0.00% 98.61% 0.00% 0.06%
RS 0.11% 0.36% 0.02% 0.00% 99.16% 0.36% 0.13% 0.19% 0.02% 0.00% 99.64% 0.02%
JPEG 0.09% 0.00% 0.00% 0.02% 0.00% 99.89% 0.17% 0.26% 0.02% 0.04% 0.04% 99.47%

denote a sequence where the patch was first edited using of training data. These results are very important because
manipulation X, then subsequently edited using manipulation our approach can learn salient image manipulation detection
Y (i.e. MF-RS corresponds to first applying median filtering, features directly from data. This may allow us to learn better
then applying resizing). We divided these into sets of training feature extractors than the human designed incorporated into
and testing patches created from two separate sets of images the rich model feature extractors.
of total size 2, 175, resulting in a set of 296, 000 training Training time is an important factor when devising a data-
patches and 52, 000 testing patches for both. In total, 29, 600 driven manipulation detection approach. Our CNN based ap-
training patches and 5, 200 testing patches were unaltered. We proach took approximately six hours to train on this database.
then trained MISLnet to distinguish between each sequence of By contrast, a multi-threaded implementation (using eight
editing operations. threads) of the rich model took over 58 hours to perform only
Tables VII shows the confusion matrix obtained from this feature extraction on this database using the same computer.
experiment when a softmax is used to perform manipulation Training the classifier for the SRM took several additional
chain classification, while Table VIII shows the confusion hours. As a result, it becomes extremely challenging, if not
matrix obtained using an extremely randomized trees (ET) infeasible, to train the SRM on a very large database.
classifier. Results of these experiments demonstrated that
we can achieve an accuracy of 92.90% with the softmax- F. Prediction error feature extractor design choices
based CNN and 94.19% using the ET-based CNN. From The overall performance of MISLnet depends on several
Table VIII, one can observe that we can identity the type of design choices. An important one of these is the design of
processing operation with an accuracy typically higher than first CNN layer, which extracts prediction error features in
91%. Noticeably, we can achieve 99.17% identification rate at our network. To determine the optimal design of this layer,
detecting median filtering followed by resampling and 96.69% we conducted several experiments to examine the influence of
at detection Gaussian blur followed by resampling, which is other filter type choices for the first CNN layer including the
particularly high. Additionally, one can also observe that the use of a fixed high-pass filter [37], [38], as well as not using
ET classifier significantly improved the detection rate of the a prediction error feature extraction block (i.e. beginning our
processing operations followed by median filtering (i.e., GB- CNN with a standard convolutional layer). Additionally, we
MF and RS-MF). These results demonstrate the robustness conducted experiments to determine the optimal number and
of our proposed CNN at performing image manipulation size of the filters in the constrained convolutional layer of
detection in a challenging and realistic scenario. MISLnet.
Choice of prediction error feature extractor: We evaluated
E. Comparison with SRM-based approach the advantage of using the constrained convolutional layer
We compared our trained MISLnet CNN for multiple as MISLnet’s first layer through two sets of experiments. In
manipulations to the rich model approach [19], [20] using each experiment, we trained and evaluated our proposed CNN
both the same training and testing datasets described in Sec- architecture using three different choices for the first layer:
tion V-B. Table IX displays a confusion matrix containing the (1) using a constrained convolutional layer, (2) without using
manipulation detection results obtained using the rich model a constrained convolutional layer, and (3) replacing the con-
based approach. The results of these experiments showed strained convolutional layer in MISLnet with a generic fixed
that the rich model approach was able to achieve an average high-pass filter. In these experiments, we used the same high-
manipulation identification accuracy of 99.63%. By contrast, pass commonly employed in forensics and steganalysis [37],
our constrained CNN was able to achieve an accuracy of [38].
99.66% on the same dataset. From Tables IV and IX one can To assess the performance gains achieved by the constrained
notice that our CNN achieved better identification rates for convolutional layer, we report the classification accuracy MIS-
median filtering, Gaussian bluring and re-sampling. Lnet achieved using each choice of filter for the beginning of
These results demonstrate that our CNN based detector our CNN. Additionally, we report the relative error reduction
can perform as well as, or slightly better than, the rich achieved by using the constrained convolutional layer instead
model based detector. In Section V-H, we show that we can of each alternative. Relative error reduction measures the
achieve an even higher detection accuracy with a larger amount reduction in error achieved by a classifier normalized by total
11

TABLE VII: Confusion matrix for identifying manipulation chains followed by re-compression using MISLnet with a softmax.
Predicted Class
OR MF GB RS MF-GB GB-MF MF-RS RS-MF GB-RS RS-GB
OR 99.27% 0.06% 0.12% 0.52% 0.02% 0.00% 0.02% 0.00% 0.00% 0.00%
MF 0.13% 90.54% 0.02% 0.02% 1.08% 3.02% 0.10% 5.10% 0.00% 0.00%
GB 0.00% 0.23% 93.56% 0.25% 0.44% 0.04% 0.04% 0.04% 0.02% 5.38%
True Class

RS 0.19% 0.06% 2.12% 97.15% 0.04% 0.00% 0.12% 0.08% 0.00% 0.25%
MF-GB 0.00% 0.10% 0.23% 0.00% 98.08% 0.77% 0.02% 0.04% 0.23% 0.54%
GB-MF 0.00% 1.40% 0.12% 0.00% 8.65% 80.13% 0.08% 9.48% 0.04% 0.10%
MF-RS 0.02% 0.23% 0.00% 0.19% 0.06% 0.50% 97.69% 1.04% 0.27% 0.00%
RS-MF 0.00% 2.90% 0.02% 0.00% 1.65% 10.75% 0.21% 84.21% 0.06% 0.19%
GB-RS 0.00% 0.08% 0.00% 0.02% 0.87% 0.02% 0.94% 0.12% 93.94% 4.02%
RS-GB 0.00% 0.25% 1.06% 0.02% 1.23% 0.08% 0.06% 0.10% 2.71% 94.50%

TABLE VIII: Confusion matrix for identifying manipulation chains followed by re-compression using MISLnet with an ET classifier.
Predicted Class
OR MF GB RS MF-GB GB-MF MF-RS RS-MF GB-RS RS-GB
OR 99.33% 0.06% 0.1% 0.05% 0.00% 0.00% 0.02% 0.00% 0.00% 0.00%
MF 0.15% 91.77% 0.02% 0.02% 0.52% 2.12% 0.29% 5.10% 0.00% 0.02%
GB 0.00% 0.21% 95.00% 0.87% 0.42% 0.04% 0.04% 0.00% 0.06% 3.37%
True Class

RS 0.17% 0.00% 0.65% 98.94% 0.02% 0.00% 0.08% 0.08% 0.02% 0.04%
MF-GB 0.00% 0.31% 0.19% 0.00% 95.87% 2.48% 0.02% 0.29% 0.42% 0.42%
GB-MF 0.00% 1.44% 0.00% 0.02% 3.38% 86.02% 0.13% 8.87% 0.06% 0.08%
MF-RS 0.02% 0.04% 0.00% 0.01% 0.00% 0.08% 99.17% 0.38% 0.21% 0.00%
RS-MF 0.00% 3.54% 0.02% 0.00% 0.17% 9.38% 0.65% 86.00% 0.17% 0.06%
GB-RS 0.00% 0.02% 0.00% 0.00% 0.40% 0.02% 0.96% 0.06% 96.69% 1.85%
RS-GB 0.00% 0.10% 2.08% 0.06% 0.63% 0.13% 0.04% 0.23% 3.56% 93.17%

TABLE IX: Confusion matrix for identifying the manipulations listed in Table III using the rich model.
Predicted Class
OR MF GB AWGN RS JPEG
OR 99.83% 0.07% 0.00% 0.00% 0.02% 0.07%
True Class

MF 0.02% 99.23% 0.06% 0.00% 0.13% 0.56%

GB 0.08% 0.09% 99.42% 0.00% 0.04% 0.38%
AWGN 0.00% 0.00% 0.00% 100% 0.00% 0.00%
RS 0.17% 0.04% 0.00% 0.00% 99.47% 0.32%
JPEG 0.02% 0.04% 0.00% 0.02% 0.04% 99.89%

error reduction that is possible. For reference, the relative convolutional layer. Additionally, we can see that using the
error reduction (RER) is calculated according to the formula constrained convolutional layer achieved an accuracy 0.31%
RER = (e1 − e2 )/e1 where e1 corresponds to the error higher detection than when a fixed high-pass filter was used.
achieved by the lower performing method and where e2 This corresponds to a relative error reduction of RER =
corresponds to the error achieved by the higher performing 29.54% over a fixed high-pass filter. These results demonstrate
method. the advantage of using the constrained convolutional layer.
In our first experiment, we examined the effect of each The full benefit of using a constrained convolutional layer
filter choice when performing manipulation detection in the can be seen when considering more challenging forensic
same manner as in Section V-B, i.e. images altered using the scenarios. To demonstrate this, we conducted another set of
manipulations listed in Table III. This was done by using the experiments where each image patch was edited by a sequence
same training and testing datasets described in Section V-B, as of up to two different manipulations, then JPEG compressed
well as the same training procedures (i.e. batch size, learning using a quality factor of 90. By re-compressing each image
rate, etc.). Each trained CNN was then used to classify each after manipulation, we can mimic conditions similar to those
image patch in the test set. in social networking applications which typically re-compress
The results of this first experiment are shown in Table X. each image before distributing them.
From this table, we can see that when MISLnet was trained We built our experimental database by using the same
without the constrained convolutinal layer, it’s performance training and testing datasets in Section V-D. We then trained
decreased by 0.90%. This corresponds to a relative error re- MISLnet to distinguish between each sequence of editing
duction of RER = 54.87% achieved by using the constrained operations using both a constrained convolutional layer, a fixed
12

TABLE X: MISLnet accuracy with different settings in the prediction

error feature extraction block for multiple manipulation detection.
Prediction error conceptual block Accuracy RER (w.r.t. ours)
Constrained conv layer 99.26% −
W/out constrained conv layer 98.36% 54.87%
High-pass filter 98.95% 29.54%

TABLE XI: MISLnet performance with different settings in the pre-

diction error feature extraction block for sequence of manipulations
detection; Image patches were JPEG post-compressed with QF=90.
Prediction error conceptual block Accuracy RER (w.r.t. ours)
Constrained Conv layer 92.90% −
W/out constrained Conv layer 84.10% 55.34% Fig. 4: CNN testing accuracy v.s. training epochs, blue: max-pooling
High-pass filter 80.63% 63.35% with avg-pooling after Conv5, red: max-pooling, green: avg-pooling.

TABLE XIII: CNN testing accuracy with different pooling operations.

high-pass filter, and without using a prediction error block (i.e. Pooling operations Accuracy
Avg-pooling 96.35%
a normal convolutional layer). Next, we used each version Max-pooling 98.95%
of our CNN to determine the manipulation sequence of each Max-pooling w/ avg-pooling after Conv5 99.26%
patch in the testing set.
The classification accuracies obtained by our CNN using
each choice for the first layer is shown in Table XI. From size of the filters in the constrained convolutional layer. This
this table, we can see that when the constrained convolu- was done by evaluating the detection accuracy of our CNN
tional layer was used, MISLnet can identify the sequence using filters of size 3×3, 5×5 and 7×7 in the constrained
of manipulations used to modify each patch with 92.90% convolutional layer. We kept the total number of filters in
accuracy. By contrast, using a fixed high-pass filter yields an the “Constrained Conv” layer fixed (3 filters). Experiments
accuracy of 80.63%, while using a normal convolutional layer showed that our choice of 5×5 “Constrained Conv” layer
yields an accuracy of 84.10%. In these experiments, using with 99.26% identification rate outperforms the other dimen-
a constrained convolutional layer produces a relative error sion choices. More specifically, with 3×3 constrained filters
reduction of 63.35% over a fixed high-pass filter and a relative MISLnet can achieve 98.91% identification rate and 99.07%
error reduction of 55.34% over a normal convolutional layer. identification rate when using 7×7 constrained filters in the
These results clearly demonstrate the advantage of using the “Constrained Conv” layer. Taken together, the results of these
constrained convolutional layer in challenging manipulation two experiments show that using three 5×5 filters in the
detection scenarios. constrained convolutional layer maximizes the performance of
TABLE XII: MISLnet detection accuracy with different number of our CNN.
filters in the “Constrained Conv” layer.
#. Filters Testing Accuracy G. Architecture design choices
1 99.04%
2 98.02% The structural design of a CNN’s architecture has a large
3 99.26% impact on its final accuracy. We ran several sets of additional
4 99.11% experiments related to the structural design of our CNN’s
5 99.05%
6 98.97%
architecture. In this paper we present three of these exper-
iments, namely (1) the choice of the pooling layer, (2) the
choice of the activation function, and (3) the choice of the
“Constrained Conv” layer parameters: We conducted two stride size in the “Conv2” layer with different input patch sizes
sets of experiments to investigate the impact of the number (i.e., 256×256, 128×128 and 64×64). To accomplish this, we
of filters and their dimension in the “Constrained Conv” layer started with the fixed architecture defined in Fig. 2. Then we
on CNN’s performance. To accomplish this, we used the same changed one architectural design choice in each experiment
training and testing datasets that we described in Section V-B. such as the choice of pooling layer or activation function.
In our first experiment, we identified the optimal number of For smaller patch sizes, each image patch in the training and
filters to use in the constrained convolutional layer by letting testing datasets were cropped in the center.
the number of filters vary from 1 to 6 and evaluating the Pooling layer: We first evaluated the impact on our CNN’s
manipulation detection accuracy achieved by our CNN under performance using different types of pooling layers. We
each scenario. Table XII shows the results of our experiments. trained three CNN models using the architecture described
We can notice that our proposed MISLnet architecture with in Fig. 2 with different choices of pooling layers, i.e., max-
the choice of three constrained filters maximizes CNN’s per- pooling, average-pooling and max-pooling with average pool-
formance and outperforms the other choices of filter numbers ing after the “Conv5” layer. In our CNN we used 1×1 convo-
by at least 0.15%. lutional filters in the “Conv5” layer to learn new association
In our second experiment, we examined the effect of the between feature maps. Because of this, the choice of pooling
13

TABLE XV: Identification rate of CNN when trained with different

image patch sizes; stride of 1 versus stride of 2 in “Conv2” layer.
Patch size Stride of 1 Stride of 2
256×256 98.93% 99.26%
128×128 98.48% 98.25%
64×64 97.80% 97.33%

can notice that with 128×128 and 64×64 patches, a CNN

that uses a stride of 1 in “Conv2” layer outperforms one that
uses a stride of 2 in the “Conv2” layer. Thus, the “Conv2”
Fig. 5: CNN testing accuracy v.s. training epochs with different layer with a smaller stride extracts higher dimensional features
activation functions, blue: TanH, red: ReLU, purple: PReLU, green: which may lead deeper CNN layers to extract better high-level
ELU. image manipulation features. However, with 256×256 patches,
a CNN with a stride of 2 in the “Conv2” layer can achieve a
TABLE XIV: CNN testing accuracy with different activation func- higher identification rate. These experiments show that using
tions.
different choices in the structural design of CNN we can still
Activations functions Accuracy improve the identification rate with smaller input patches. In
PReLU 98.63% the rest of our experiments we use a stride of 1 in the “Conv2”
ReLU 99.02%
ELU 99.02% layer only with patches smaller than 256×256.
TanH 99.26%

H. Effect of training set size

before the fully-connected layers that perform classification We have shown through our experiments that our proposed
is important. Table XIII summarizes the best identification constrained CNN is very effective at learning manipulation
rates achieved by the different pooling choices that we have detection features and accurately detecting multiple manipu-
considered in this experiment. We can observe that the choice lations. A CNN’s performance, however, is dependent on the
of max-pooling with average pooling after the “Conv5” layer size and quality of the training set [47], [48]. We conducted
outperforms the other pooling choices and can achieve an an experiment to show the effect of the training set size on our
average accuracy of 99.26%. From Fig. 4, one can also CNN’s accuracy. This experiment shows that we can improve
notice that the average-pooling layer based CNN converges our results even further using a large database of training data.
considerably slower to a lower overall accuracy than the other Given the limited number of images in the 1st IEEE IFS-
two alternatives. TC image forensics challenge database, we used the Dresden
Activation function: In our second experiment, we evaluated Image Database [46] to build our training and testing data.
the choices of activation functions. We compared our TanH- In total, our experimental database consisted of 1, 250, 000
based network in Fig. 2 to ELU, ReLU and PReLU based grayscale patches of size 256×256.
networks. We report the best achieved identification rates of We built our training dataset in the same manner that we
these networks in Table XIV. We can notice that the TanH described in the previous experiments. Our training dataset
network performance is 0.63% better than a PReLU network consisted of 1.2 million patches where 200, 024 patches were
and 0.24% better than ReLU and ELU networks. Fig. 5 depicts unaltered and 999, 976 patches were altered. To accomplish
the testing accuracy versus the training epochs curves for the this, we randomly selected 16, 483 images that we divided into
four choices of activation function. One can observe from this 256×256 blocks and we retained all the nine central patches
that TanH and ReLU networks converges slightly quicker to a from the green layer of the first 14, 816 images and the 40
higher accuracy. central patches from the green layer of the remaining images.
Convolutional Stride Size: The choice of the convolutional We then created their corresponding edited patches using the
stride size is important since it will determine the dimension of five tampering operations listed in Table III.
features throughout CNN. The bigger the convolutional stride, We trained our CNN with different input patch sizes (i.e.,
the smaller the dimension of the feature maps produced by the 256, 128 and 64) and different numbers of training patches
CNN. This is critical when using relatively small input patch (i.e., 100K, 200K, 400K, 800K and 1.2M ). For training, we
sizes. We evaluated the impact of the stride size in the “Conv2” used the same parameters that we used in the previous set of
layer on the CNN’s performance. When performing image experiments, where the learning rate was decreased every six
manipulation detection using different patch sizes, the CNN epochs and every CNN was trained for 60 epochs.
architecture with 256×256 input layer doesn’t necessarily To evaluate our trained CNNs, we built a testing dataset
achieve the best identification rate. Therefore, we compared that consisted of 50, 000 grayscale patches of size 256×256
the identification rate of our CNN using a stride of 1 versus where 8, 350 patches were not tampered. These grayscale
using a stride of 2 in “Conv2” layer (see Fig. 2) with different patches were made from 334 images not used for the training.
input patch size. Similarly to the training dataset, all the images were divided
The results of our experiments are shown in Table XV. We into 256×256 blocks and we retained all the 25 central patches
14

Fig. 6: CNN testing accuracy v.s. number of training patches from Dresden database [46] using Softmax (dashed line) and Extremely
Randomized Trees (ET) (solid line) with different patch sizes; blue: 256×256, red:128×128, green: 64×64.

we can achieve 99.93% using softmax and ET classifiers. This

is noticeably better than the results achieved by the SRM.
To differentiate overfitting from poor optimization, machine
learning researchers monitor the loss rate on the training set
and the testing set [48]. If the training and testing losses
converge to approximately the same rate, and the testing loss
rate does not increase over training cycles (i.e., epochs), it is
an evidence that the CNN is not overfitting [48].
Fig. 7 depicts the training and testing loss versus the training
epochs of our CNN trained with 1.2 million patches and tested
with 50 thousand patches of size 256×256. Both training and
Fig. 7: Training and testing loss versus training epochs using 1.2 testing datasets were collected from the Dresden database and
million training patches and 50K testing patches from Dresden
Database of size 256×256. Losses were recorded every 1K iterations used in the previous experiment in Section V-H where CNN
was trained with different size of large scale data. Recall that
TABLE XVI: MISLnet performance when trained using 1.2 million training and testing are disjoint in all our experiments.
patches from the Dresden Image Database [46] and tested on patches One can observe that the testing loss decreases throughout
from the both the Dresden Image Database and from a dataset of the training epochs. We can notice that after 48 epochs as
images taken by 34 different camera models (External Testing Data). we add more training cycles (i.e., epochs) the testing loss
Dresden Testing Data External Testing Data does not increase and both training and testing converge
Patch size Softmax ET Softmax ET to approximately the same loss rate (see Fig. 7). Thus, we
256×256 99.93% 99.93% 99.27% 99.33%
128×128 99.61% 99.67% 99.60% 99.69%
experimentally demonstrated that the proposed CNN is not
64×64 99.36% 99.43% 99.38% 99.51% overfitting [48]. This demonstrates also that we can avoid
overfitting when training a reasonably small CNN with a large
scale training data [48]. It is worth mentioning that in all our
experiments CNN achieves its best identification rate when the
from the green layer. The edited testing patches were created testing loss decreases and converges to the same training loss
using the five tampering operations listed in Table III. rate during the last several epochs of training. In what follows
Fig. 6 depicts the testing accuracy of our constrained CNN we evaluate our proposed approach in real world scenario.
versus the number of training patches. Results are shown
for different input patch sizes using both softmax and ET
classifiers. We can notice that the performance of our CNN I. External experimental database
is almost always improved when we used a larger number In a real world scenario, a forensic investigator must
of training patches. One can observe that when the number examine images captured by several different and possibly
of training patches is sufficiently large, the softmax and ET unknown cameras. These may be different than the cameras
classifiers achieve almost the same testing accuracy. Our CNN used to train their CNN. It is important for their results to
was able to achieve at least 99.18% with 100, 000 training hold consistent when using two different sets of data captured
patches of size 64×64. Additionally, we can observe that when by different devices for training and testing. To mimic this
CNN trained with 1.2 million training patches of size 256×256 scenario, we used our CNN trained with 1.2 million patches
15

TABLE XVII: Confusion matrix for identifying the operation types using MISLnet (w/ softmax) trained with 2.5 million patches.
Predicted Class
OR MF GB AWGN RS JPEG
OR 99.95% 0.01% 0.00% 0.00% 0.00% 0.04%

True Class
MF 0.02% 99.89% 0.02% 0.00% 0.02% 0.04%
GB 0.00% 0.00% 100% 0.00% 0.00% 0.00%
AWGN 0.00% 0.00% 0.00% 100% 0.00% 0.00%
RS 0.00% 0.00% 0.00% 0.00% 99.99% 0.01%
JPEG 0.01% 0.00% 0.00% 0.00% 0.00% 99.99%

from the Dresden database to perform general-purpose image can notice that our CNN trained with significantly more data
manipulation detection on images taken using 34 different can accurately distinguish between JPEG compressed images
camera models that we have manually collected. and both resampled and unaltered images. Similarly to the
To evaluate the performance of our trained CNN on a experiments in Section V-H, these results demonstrate again
different dataset, we built a “wild” testing dataset of images that with a larger scale of training data we can improve our
taken by 34 new camera models. This dataset consisted of CNN’s performance.
50, 000 grayscale patches of size 256×256, where 8, 350
patches were not edited. This was accomplished by retaining VI. C ONCLUSION
all the 25 central 256×256 blocks from the green layer of 334 In this paper, we proposed a novel deep learning based
randomly selected images. Next, similarly to our previous set approach to perform general-purpose image manipulation de-
of experiments, we created their corresponding edited patches tection. Unlike existing approaches that rely on hand-designed
using the five tampering operations listed in Table III. features, our proposed CNN is able to jointly suppress an
Table XVI shows the performance of our CNN in a “real image’s content and adaptively learn image manipulation de-
world” scenario with different input patch sizes. One can tection features directly from data. To accomplish this, we de-
notice that we can still identify the type of image editing with veloped a new type of layer, called a constrained convolutional
at least 99.33% accuracy when we use 256×256 testing input layer, that forces our CNN to learn prediction error filters
patches. We also can notice that with smaller patch sizes our that produce low-level forensic features. Using this layer, we
deep learning method is able to detect the type of image editing designed a new CNN architecture that is able to accurately
with an accuracy higher than when tested on patches from the detect multiple types of image manipulations.
Dresden testing dataset. More specifically, in the real world Through a series of experiments, we assessed the ability of
scenario, our proposed ET-based constrained CNN can identify our proposed constrained CNN to perform image manipulation
the different types of image manipulations in 64×64 and detection. The results of these experiments showed that our
128×128 input image patches with accuracies of 99.51% and CNN can be trained to accurately detect individual targeted
99.69% respectively, whereas with our Dresden experimental manipulations as well as multiple types of manipulations.
testing dataset it can achieve accuracies of 99.43% and 99.67% To further assess the performance of our constrained CNN,
respectively. These results demonstrate the robustness of our we compared it to the SRM-based general purpose image
proposed approach when used in real world scenarios. manipulation detection approach (i.e. the current state-of-the-
Finally, to further improve the learning ability of CNN we art detector) using five different image manipulations. This
built a new training dataset that contains 2.5 million patches experimental comparison showed that our proposed CNN
from our collected ‘34-camera-model’ database to classify architecture can outperform the SRM, particularly when using
each patch in the testing dataset. To accomplish this, we large scale training data. Additionally, we conducted a set of
similarly retained all the 25 central patches of size 256×256 experiments to mimic a realistic scenario where a forensic in-
from the green layer of the remaining 16, 667 images not used vestigator will use our CNN to analyze images from different,
for testing. Then we created their corresponding edited patches possibly unknown source devices than we used to train our
using the five tampering operations. In total, the unaltered CNN. These experimental results show that our CNN can still
training patches consisted of 416, 675. We finally trained our accurately detect image manipulations even when there is a
CNN using the same parameters that we have used in the source mismatch between the data used to train our CNN and
previous experiments. an image under investigation.
We used our re-trained CNN to identify the type of manipu-
lation the testing dataset described in the previous experiment. R EFERENCES
Our proposed constrained CNN was able to classify the testing [1] M. C. Stamm, M. Wu, and K. J. R. Liu, “Information forensics: An
dataset with 99.97% accuracy. Table XVII shows the confusion overview of the first decade.” IEEE Access, vol. 1, pp. 167–200, 2013.
[2] A. C. Popescu and H. Farid, “Exposing digital forgeries by detecting
matrix of the trained CNN. Our method can achieve at least traces of resampling,” IEEE Transactions on Signal Processing, vol. 53,
99.89% accuracy for all manipulations considered. Notice- no. 2, pp. 758–767, Feb. 2005.
ably, our approach can identify Gaussian blur and AWGN [3] M. Kirchner, “Fast and reliable resampling detection by spectral analysis
of fixed linear predictor residue,” in Proceedings of the 10th ACM
operations with 100% accuracy. Additionally, compared to the Workshop on Multimedia and Security, ser. MM&Sec ’08. New York,
performance using 100K training patches (see Table IV), one NY, USA: ACM, 2008, pp. 11–20.
16

[4] N. Dalgaard, C. Mosquera, and F. Pérez-González, “On the role of dif- [27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
ferentiation for resampling detection,” in IEEE International Conference W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
on Image Processing (ICIP). IEEE, 2010, pp. 1753–1756. zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551,
[5] X. Feng, I. J. Cox, and G. Doerr, “Normalized energy density-based 1989.
forensic detection of resampled images,” IEEE Transactions on Multi- [28] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient
media, vol. 14, no. 3, pp. 536–545, 2012. backprop,” in Neural networks: Tricks of the trade. Springer, 2012,
[6] B. Mahdian and S. Saic, “Blind authentication using periodic properties pp. 9–48.
of interpolation,” IEEE Transactions on Information Forensics and [29] H. Robbins and S. Monro, “A stochastic approximation method,” The
Security, vol. 3, no. 3, pp. 529–538, 2008. annals of mathematical statistics, pp. 400–407, 1951.
[7] M. Kirchner and J. Fridrich, “On detection of median filtering in digital [30] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
images,” in IS&T/SPIE Electronic Imaging. International Society for preprint arXiv:1212.5701, 2012.
Optics and Photonics, 2010, pp. 754 110–754 110. [31] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
[8] X. Kang, M. C. Stamm, A. Peng, and K. J. R. Liu, “Robust median filter- for online learning and stochastic optimization,” Journal of Machine
ing forensics using an autoregressive model,” IEEE Trans. Information Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
Forensics and Security,, vol. 8, no. 9, pp. 1456–1468, Sep. 2013. [32] M. Kirchner and T. Gloe, “On resampling detection in re-compressed
images,” in 2009 First IEEE International Workshop on Information
[9] G. Cao, Y. Zhao, R. Ni, L. Yu, and H. Tian, “Forensic detection of
Forensics and Security (WIFS). IEEE, 2009, pp. 21–25.
median filtering in digital images,” in IEEE International Conference
[33] J. Chen, X. Kang, Y. Liu, and Z. J. Wang, “Median filtering foren-
on Multimedia and Expo (ICME). IEEE, 2010, pp. 89–94.
sics based on convolutional neural networks,” IEEE Signal Processing
[10] C. Chen and J. Ni, “Median filtering detection using edge based
Letters, vol. 22, no. 11, pp. 1849–1853, Nov. 2015.
prediction matrix,” Digital Forensics and Watermarking, pp. 361–375,
[34] B. Bayar and M. C. Stamm, “On the robustness of constrained convo-
2012.
lutional neural networks to jpeg post-compression for image resampling
[11] M. C. Stamm and K. J. R. Liu, “Forensic detection of image manipula- detection,” in The 2017 IEEE International Conference on Acoustics,
tion using statistical intrinsic fingerprints,” IEEE Trans. on Information Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2152–2156.
Forensics and Security, vol. 5, no. 3, pp. 492 –506, 2010. [35] B. Bayar and M. C. Stamm, “Towards order of processing opera-
[12] H. Yao, S. Wang, and X. Zhang, “Detect piecewise linear contrast tions detection in jpeg-compressed images with convolutional neural
enhancement and estimate parameters using spectral analysis of image networks,” in International Symposium on Electronic Imaging: Media
histogram,” in International Communication Conference on Wireless Watermarking, Security, and Forensics. IS&T, 2018.
Mobile and Computing. IET, 2009. [36] B. Bayar and M. C. Stamm, “A generic approach towards image
[13] M. Stamm and K. R. Liu, “Blind forensics of contrast enhancement in manipulation parameter estimation using convolutional neural networks,”
digital images,” in IEEE International Conference on Image Processing. in Proceedings of the 5th ACM Workshop on Information Hiding and
IEEE, 2008, pp. 3112–3115. Multimedia Security. ACM, 2017.
[14] M. C. Stamm and K. R. Liu, “Forensic estimation and reconstruction of [37] B. Bayar and M. C. Stamm, “Design principles of convolutional neural
a contrast enhancement mapping,” in IEEE International Conference on networks for multimedia forensics,” in International Symposium on
Acoustics Speech and Signal Processing. IEEE, 2010, pp. 1698–1701. Electronic Imaging. IS&T, 2017.
[15] T. Bianchi and A. Piva, “Detection of non-aligned double jpeg com- [38] G. Xu, H.-Z. Wu, and Y.-Q. Shi, “Structural design of convolutional
pression with estimation of primary compression parameters,” in IEEE neural networks for steganalysis,” IEEE Signal Processing Letters,
International Conference on Image Processing. IEEE, 2011, pp. 1929– vol. 23, no. 5, pp. 708–712, 2016.
1932. [39] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and
[16] T. Bianchi and A. Piva, “Image forgery localization via block-grained T. Darrell, “Decaf: A deep convolutional activation feature for generic
analysis of jpeg artifacts,” IEEE Transactions on Information Forensics visual recognition.” in ICML, 2014, pp. 647–655.
and Security, vol. 7, no. 3, pp. 1003–1017, 2012. [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[17] R. Neelamani, R. De Queiroz, Z. Fan, S. Dash, and R. G. Baraniuk, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
“Jpeg compression history estimation for color images,” IEEE Transac- in Proceedings of the IEEE Conference on Computer Vision and Pattern
tions on Image Processing, vol. 15, no. 6, pp. 1365–1378, 2006. Recognition, 2015, pp. 1–9.
[18] Z. Qu, W. Luo, and J. Huang, “A convolutive mixing model for [41] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
shifted double jpeg compression with application to passive image Surpassing human-level performance on imagenet classification,” in
authentication,” in IEEE International Conference on Acoustics, Speech Proceedings of the IEEE International Conference on Computer Vision,
and Signal Processing. IEEE, 2008, pp. 1661–1664. 2015, pp. 1026–1034.
[19] J. Fridrich and J. Kodovskỳ, “Rich models for steganalysis of digital [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
images,” IEEE Transactions on Information Forensics and Security, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
vol. 7, no. 3, pp. 868–882, 2012. scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015.
[20] X. Qiu, H. Li, W. Luo, and J. Huang, “A universal image forensic
[43] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate
strategy based on steganalytic model,” in Workshop on Information
deep network learning by exponential linear units (elus),” arXiv preprint
hiding and multimedia security. ACM, 2014, pp. 165–170.
arXiv:1511.07289, 2015.
[21] T. Pevny, P. Bas, and J. Fridrich, “Steganalysis by subtractive pixel
[44] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
adjacency matrix,” IEEE Transactions on Information Forensics and
network training by reducing internal covariate shift,” arXiv preprint
Security, vol. 5, no. 2, pp. 215–224, Jun. 2010.
arXiv:1502.03167, 2015.
[22] W. Fan, K. Wang, and F. Cayre, “General-purpose image forensics using [45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
patch likelihood under image statistical models,” in IEEE International S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
Workshop on Information Forensics and Security, Nov. 2015, pp. 1–6. fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [46] T. Gloe and R. Böhme, “The dresden image database for benchmarking
with deep convolutional neural networks,” in Advances in neural infor- digital image forensics,” Journal of Digital Forensic Practice, vol. 3,
mation processing systems, 2012, pp. 1097–1105. no. 2-4, pp. 150–159, 2010.
[24] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, [47] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolu-
and D. Yu, “Convolutional neural networks for speech recognition,” tional neural networks applied to visual document analysis.” in ICDAR,
IEEE/ACM Transactions on audio, speech, and language processing, vol. 3, 2003, pp. 958–962.
vol. 22, no. 10, pp. 1533–1545, 2014. [48] Y. Bengio, “Practical recommendations for gradient-based training of
[25] R. Collobert and J. Weston, “A unified architecture for natural language deep architectures,” in Neural Networks: Tricks of the Trade. Springer,
processing: Deep neural networks with multitask learning,” in Proceed- 2012, pp. 437–478.
ings of the 25th international conference on Machine learning. ACM,
2008, pp. 160–167.
[26] B. Bayar and M. C. Stamm, “A deep learning approach to universal
image manipulation detection using a new convolutional layer,” in
Workshop on Information Hiding and Multimedia Security. ACM, 2016,
pp. 5–10.
17

Belhassen Bayar (S’11) received the B.S. degree Matthew C. Stamm (S’08–M’12) received the B.S.,
in Electrical Engineering from the Ecole Nationale M.S., and Ph.D. degrees in electrical engineering
d’Ingénieurs de Tunis (ENIT), Tunisia, in 2011, from the University of Maryland at College Park,
and the M.S. degree in Electrical and Computer College Park, MD, USA, in 2004, 2011, and 2012,
Engineering from Rowan University, New Jersey, in respectively.
2014. After graduating from ENIT, he worked as a Since 2013 he has been an Assistant Professor
Research Assistant at the University of Arkansas at with the Department of Electrical and Computer
Little Rock (UALR). In Fall 2014, he joined Drexel Engineering, Drexel University, Philadelphia, PA,
University, Pennsylvania, where he is currently a USA. He leads the Multimedia and Information
Ph.D. candidate with the Department of Electrical Security Lab (MISL) where and his team conduct
and Computer Engineering. Bayar won the Best Pa- research on signal processing, machine learning, and
per Award at the IEEE International Workshop on Genomic Signal Processing information security with a focus on multimedia forensics and anti-forensics.
and Statistics in 2013. In summer 2015 he interned at Samsung Reasearch Dr. Stamm is the recipient of a 2016 NSF CAREER Award and the
America in Mountain View, California. His main research interests are in 2017 Drexel University College of Engineering’s Outstanding Early-Career
image forensics, machine learning and signal processing. Research Achievement Award. He was the General Chair of the 2017 ACM
Workshop on Information Hiding and Multimedia Security and is the lead
organizer of the 2018 IEEE Signal Processing Cup competition. He currently
serves as a member of the IEEE SPS Technical Committee on Information
Forensics and Security and as a member of the editorial board of IEEE SigPort.
For his doctoral dissertation research, Dr. Stamm was named the winner of
the Dean’s Doctoral Research Award from the A. James Clark School of
Engineering. While at the University of Maryland, he was also the recipient
of the Ann G. Wylie Dissertation Fellowship and a Future Faculty Fellowship.
Prior to beginning his graduate studies, he worked as an engineer at the Johns
Hopkins University Applied Physics Lab.

Image Forensics and Counter Forensics
No ratings yet
Image Forensics and Counter Forensics
97 pages
Wa0004.
No ratings yet
Wa0004.
59 pages
Enhanced Image Forgery Detection by Deep Learning Approaches (Report) - 1
No ratings yet
Enhanced Image Forgery Detection by Deep Learning Approaches (Report) - 1
55 pages
Paper 2
No ratings yet
Paper 2
40 pages
Today 2
No ratings yet
Today 2
49 pages
3 - Image Forgery Detection Based On Fussion of Light Weight Deep Learning Models
No ratings yet
3 - Image Forgery Detection Based On Fussion of Light Weight Deep Learning Models
78 pages
Manipulation Classification For JPEG Images Using Multi-Domain Features
No ratings yet
Manipulation Classification For JPEG Images Using Multi-Domain Features
18 pages
A Comprehensive Survey On Methods For Image Integrity: Paola Capasso Giuseppe Cattaneo Maria de Marsico
No ratings yet
A Comprehensive Survey On Methods For Image Integrity: Paola Capasso Giuseppe Cattaneo Maria de Marsico
34 pages
Leekha 2021 J. Phys. Conf. Ser. 1831 012026
No ratings yet
Leekha 2021 J. Phys. Conf. Ser. 1831 012026
26 pages
A Novel Contrast Enhancement Forensics Based On Convolutional Neural
No ratings yet
A Novel Contrast Enhancement Forensics Based On Convolutional Neural
12 pages
General Image Manipulation Detection Using Feature Engineering and A Deep Feed-Forward Neural Network
No ratings yet
General Image Manipulation Detection Using Feature Engineering and A Deep Feed-Forward Neural Network
22 pages
A Comprehensive Survey On Methods For Image Integrity
No ratings yet
A Comprehensive Survey On Methods For Image Integrity
31 pages
Edge-Texture Feature Based Image Forgery Detection
No ratings yet
Edge-Texture Feature Based Image Forgery Detection
20 pages
Image Forgery Detection: A Survey of Recent Deep-Learning Approaches
No ratings yet
Image Forgery Detection: A Survey of Recent Deep-Learning Approaches
46 pages
Paper 5
No ratings yet
Paper 5
17 pages
Group#5
No ratings yet
Group#5
25 pages
INTERNSHIP
No ratings yet
INTERNSHIP
14 pages
Electronics 11 00403 v4
No ratings yet
Electronics 11 00403 v4
18 pages
Digital Image Forgeries and Passive Image Authentication Techniques: A Survey
No ratings yet
Digital Image Forgeries and Passive Image Authentication Techniques: A Survey
18 pages
用于图像操纵检测和定位的渐进式空间通道相关网络 (Iccv 2022)
No ratings yet
用于图像操纵检测和定位的渐进式空间通道相关网络 (Iccv 2022)
13 pages
Applsci 12 02851
No ratings yet
Applsci 12 02851
17 pages
Conferencepaper 2
No ratings yet
Conferencepaper 2
9 pages
IRJET - V8I6528 With Cover Page v2
No ratings yet
IRJET - V8I6528 With Cover Page v2
6 pages
BMC Article
No ratings yet
BMC Article
9 pages
Fyp Proposal 11
No ratings yet
Fyp Proposal 11
17 pages
Research PP T
No ratings yet
Research PP T
14 pages
Major Project
No ratings yet
Major Project
12 pages
Image Forgery Detection
No ratings yet
Image Forgery Detection
9 pages
Ijass-74452 PPV
No ratings yet
Ijass-74452 PPV
22 pages
Enhancing Forgery Detection Paper
No ratings yet
Enhancing Forgery Detection Paper
14 pages
1 s2.0 S266591072030061X Main
No ratings yet
1 s2.0 S266591072030061X Main
11 pages
Img Forgery Springer
No ratings yet
Img Forgery Springer
8 pages
A Novel Method For Detecting Image Forgery Based On Convolutional Neural Network
No ratings yet
A Novel Method For Detecting Image Forgery Based On Convolutional Neural Network
4 pages
SPIN-2021 Paper 610
No ratings yet
SPIN-2021 Paper 610
8 pages
Seminar Report
No ratings yet
Seminar Report
22 pages
Splicing
No ratings yet
Splicing
5 pages
Forensic Similarity For Digital Images: Owen Mayer, Matthew C. Stamm
No ratings yet
Forensic Similarity For Digital Images: Owen Mayer, Matthew C. Stamm
13 pages
IJPREMS40300014547
No ratings yet
IJPREMS40300014547
5 pages
Solar Power Inverter Project
100% (2)
Solar Power Inverter Project
15 pages
Image Forgery Detection Using CNN
No ratings yet
Image Forgery Detection Using CNN
4 pages
JETIR2405A48
No ratings yet
JETIR2405A48
9 pages
Final Report Phase-I
No ratings yet
Final Report Phase-I
14 pages
Report Finsl
No ratings yet
Report Finsl
6 pages
IJRPR11629
No ratings yet
IJRPR11629
7 pages
Detection of Fake Images Using Metadata Analysis and Error Level Analysis
No ratings yet
Detection of Fake Images Using Metadata Analysis and Error Level Analysis
28 pages
Deepfake Presentation
No ratings yet
Deepfake Presentation
18 pages
SSRN Id3950994
No ratings yet
SSRN Id3950994
6 pages
Comprehensive Analyses of Image Forgery Detection Methods From Traditional To Deep Learning Approaches: An Evaluation
No ratings yet
Comprehensive Analyses of Image Forgery Detection Methods From Traditional To Deep Learning Approaches: An Evaluation
34 pages
A Detailed Analysis of Image Forgery Detection Techniques and Tools
No ratings yet
A Detailed Analysis of Image Forgery Detection Techniques and Tools
6 pages
Auto Comp PT
No ratings yet
Auto Comp PT
13 pages
Enhancing Digital Image Forgery Detection Using Transfer Learning
No ratings yet
Enhancing Digital Image Forgery Detection Using Transfer Learning
12 pages
Minor Project: Image Forgery Detection
100% (1)
Minor Project: Image Forgery Detection
21 pages
Ijrpr2870 Image Forgery Detection Using CNN
No ratings yet
Ijrpr2870 Image Forgery Detection Using CNN
5 pages
Synopsis Phase-1
No ratings yet
Synopsis Phase-1
5 pages
Paper 1321
No ratings yet
Paper 1321
6 pages
Framework For Image Forgery Detection and Classification Using Machine Learning
No ratings yet
Framework For Image Forgery Detection and Classification Using Machine Learning
7 pages
Analysis On Capabilities of Artificial Intelligence (AI) Image Forgery Detection Techniques
No ratings yet
Analysis On Capabilities of Artificial Intelligence (AI) Image Forgery Detection Techniques
8 pages
ELA Forgery
No ratings yet
ELA Forgery
5 pages
Deep Learning-Based Technique For Image Tamper Detection: Manjunatha. S
No ratings yet
Deep Learning-Based Technique For Image Tamper Detection: Manjunatha. S
8 pages
Summary Notes of CNN
No ratings yet
Summary Notes of CNN
23 pages
Image Forgery Detection and Deep Learning Techniques: A Review
No ratings yet
Image Forgery Detection and Deep Learning Techniques: A Review
5 pages
Generative AI Notes
100% (1)
Generative AI Notes
3 pages
Deep Learning
No ratings yet
Deep Learning
34 pages
BTech Advanced AI Unit03
No ratings yet
BTech Advanced AI Unit03
109 pages
Classification Basic Concept - Data Mining
No ratings yet
Classification Basic Concept - Data Mining
20 pages
AI Unit-4
No ratings yet
AI Unit-4
59 pages
Deep Learning - AD3501 - Notes - Unit 3 - Recurrent Neural Networks
No ratings yet
Deep Learning - AD3501 - Notes - Unit 3 - Recurrent Neural Networks
33 pages
JofAP 1.5131699
No ratings yet
JofAP 1.5131699
6 pages
Chapter 1 - Course Intro
No ratings yet
Chapter 1 - Course Intro
27 pages
AI FinalExam
No ratings yet
AI FinalExam
5 pages
Updated - Major Minor All Syllabus
No ratings yet
Updated - Major Minor All Syllabus
162 pages
Object Detection
No ratings yet
Object Detection
13 pages
Machine Learning-4
No ratings yet
Machine Learning-4
73 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
14 pages
RNN LectureNotes
No ratings yet
RNN LectureNotes
36 pages
Thomas 1
No ratings yet
Thomas 1
33 pages
Guo 1
No ratings yet
Guo 1
11 pages
Satellite Image Classification With Deep Learning Survey
No ratings yet
Satellite Image Classification With Deep Learning Survey
5 pages
12 Types of Neural Network Activation Functions
No ratings yet
12 Types of Neural Network Activation Functions
38 pages
Slides Basics Whatisml
No ratings yet
Slides Basics Whatisml
10 pages
Daftar Pustaka
100% (1)
Daftar Pustaka
3 pages
Haen 1
No ratings yet
Haen 1
5 pages
Dynamic Link Prediction by Learning The Representatio - 2024 - Expert Systems Wi
No ratings yet
Dynamic Link Prediction by Learning The Representatio - 2024 - Expert Systems Wi
8 pages
Ai PPT
No ratings yet
Ai PPT
93 pages
Unsupervised Learning With Random Forest Predictors
No ratings yet
Unsupervised Learning With Random Forest Predictors
14 pages
Riv 1
No ratings yet
Riv 1
6 pages
Evolution of AI - Final
No ratings yet
Evolution of AI - Final
14 pages
Intrusion Detection Algorithm Based On Convolutional Neural Network
No ratings yet
Intrusion Detection Algorithm Based On Convolutional Neural Network
5 pages
Corner Detection: GV12/3072 Image Processing. 1
No ratings yet
Corner Detection: GV12/3072 Image Processing. 1
64 pages
DEEP-DEPRESSION-progress Report-2
No ratings yet
DEEP-DEPRESSION-progress Report-2
29 pages
Chen 1
No ratings yet
Chen 1
10 pages
Exploiting Deep Learning For Persian Sentiment Analysis
No ratings yet
Exploiting Deep Learning For Persian Sentiment Analysis
8 pages
الشبكات العصبية الاصطناعية واستخدامها في تميز بصمة الأصبع
No ratings yet
الشبكات العصبية الاصطناعية واستخدامها في تميز بصمة الأصبع
21 pages
An EEG-based Machine Learning Framework For Depression Detection Using Effective Connectivity Analysis
No ratings yet
An EEG-based Machine Learning Framework For Depression Detection Using Effective Connectivity Analysis
20 pages
Soybean Harvest
No ratings yet
Soybean Harvest
11 pages
VineLidar Paper
No ratings yet
VineLidar Paper
10 pages
Class 9 CBSE Worksheet - AI Applications and Methodologies
No ratings yet
Class 9 CBSE Worksheet - AI Applications and Methodologies
4 pages
Unit 1
No ratings yet
Unit 1
3 pages
Kotal 1
No ratings yet
Kotal 1
4 pages
Spe-192354-MS Comparing 5-Different Artificial Intelligence Techniques To Predict Z-Factor
No ratings yet
Spe-192354-MS Comparing 5-Different Artificial Intelligence Techniques To Predict Z-Factor
8 pages
Pranay Mahawar 23M1349
No ratings yet
Pranay Mahawar 23M1349
1 page
Brochure Special Issue
100% (1)
Brochure Special Issue
1 page
Cavity-Through Drie
No ratings yet
Cavity-Through Drie
5 pages
Developing An Automated Depression Assessment Tool in Bengali - Adhering To WHO mhGAP Intervention G
No ratings yet
Developing An Automated Depression Assessment Tool in Bengali - Adhering To WHO mhGAP Intervention G
3 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet