Mat 1
Mat 1
Abstract—Identifying the authenticity and processing history Researchers design forensic algorithms that extract features
of an image is an important task in multimedia forensics. related to these traces and use them to detect targeted image
By analyzing traces left by different image manipulations, re- manipulations. This approach has been successful in detecting
searchers have been able to develop several algorithms capable
of detecting targeted editing operations. While this approach has many types of image tampering such as resizing and resam-
led to the development of several successful forensic algorithms, pling [2], [3], [4], [5], [6], median filtering [7], [8], [9], [10],
an important problem remains: creating forensic detectors for contrast enhancement [11], [12], [13], [14], multiple JPEG
different image manipulations is a difficult and time consuming compression [15], [16], [17], [18], etc.
process. Furthermore, forensic analysts need ‘general purpose’ Although, research in image forensics has dramatically
forensic algorithms capable of detecting multiple different image
manipulations. In this paper, we address both of these problems advanced, these approaches still suffer from important draw-
by proposing a new general purpose forensic approach using backs. New editing operations are frequently developed and
convolutional neural networks (CNNs). While CNNs are capable incorporated into editing software such as Adobe Photoshop.
of learning classification features directly from data, in their As a result, researchers must identify traces left by these new
existing form they tend to learn features representative of an operations and design associated detection algorithms. This is
image’s content. To overcome this issue, we have developed
a new type of CNN layer, called a constrained convolutional difficult and time consuming since these algorithms are often
layer, that is able to jointly suppress an image’s content and designed from estimation and detection theory. Furthermore,
adaptively learn manipulation detection features. Through a the forensic algorithms described above are designed to detect
series of experiments, we show that our proposed constrained a single targeted manipulation. As a result, multiple forensic
CNN is able to learn manipulation detection features directly tests must be run to authenticate an image. This results
from data. Our experimental results demonstrate that our CNN
can detect multiple different editing operations with up to 99.97% in several challenges such as fusing the results of multiple
accuracy and outperform the existing state-of-the-art general forensic tests and controlling the overall false alarm rate
purpose manipulation detector. Furthermore, our constrained among several forensic detectors.
CNN can still accurately detect image manipulations in realistic To address these issues, researchers have recently focused
scenarios where there is a source camera model mismatch on developing general-purpose image forensic techniques to
between the training and testing data.
determine if and how an image has undergone processing.
Index Terms—Image forensics, deep learning, convolutional Tools from steganalysis have been adapted to perform general-
neural networks, deep convolutional features. purpose image forensics [19], [20]. Specifically, powerful
steganalytic features called the spatial-domain rich model
I. I NTRODUCTION (SRM) [19] have been successfully used to perform univer-
sal image manipulation detection [20]. Kirchner et al. [7]
1556-6021 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2
directly from data. They have been successfully used on a multiple editing operations even when their parameters vary,
variety of different types of signals such as images [23], can detect sequences of operations, can provide localized
speech [24] and text data [25]. While CNNs provide a promis- manipulation detection results, and can provide extremely
ing approach towards automatically learning image manipu- accurate manipulation detection results when trained using a
lation traces, in their existing form they are not well-suited large scale training dataset.
for forensic purposes. This is because existing CNNs tend to The remainder of this paper is organized as follows: In
learn features representative of an image’s content as opposed Section II, we present an overview about CNNs in literature.
to manipulation traces, which are content-independent. As a Then, in Section III we describe how CNNs are used in
result, forensic researchers may ask: Is it possible to force multimedia forensics task using the constrained convolutional
a CNN to learn manipulation detection features instead of layer. Section IV provides details about our proposed CNN
features that represent image’s content? architecture. Finally, in Section V we assess our proposed deep
In this paper, we propose a new type of CNN architecture, learning approach in adaptively extracting image manipulation
called a constrained CNN, designed to adaptively learn image features through a set of experiments. Lastly, Section VI
manipulation features and accurately identify the type of edit- concludes our work.
ing that an image has undergone. We use our constrained CNN
to construct a powerful general-purpose manipulation detector
II. C ONVOLUTIONAL N EURAL N ETWORKS
called “MISLnet”, named after our lab, the Multimedia and
Information Security Lab (MISL). To accomplish this, we Deep learning approaches, such as convolutional neural net-
propose a new type of convolutional layer called a constrained works [27], are an extended version of neural networks. Their
convolutional layer that forces a CNN to learn low-level architecture, which is the set of parameters and components
manipulation features. Many forensic algorithms, such as that we need to design a network, is based on stacking many
resampling detectors [2], [3], median filtering detectors [7] hidden layers on top of one another. This has proven to be
and other forensic detectors based on steganalytic features like very effective in extracting hierarchical features. That is, they
the SRM [20], operate by first extracting prediction residual are capable of learning features from a set of previously
features, then by forming higher-level features from these learned features. In a CNN architecture, the first layer is a
residuals. Inspired by this, our constrained convolutional layer set of convolutional feature extractors applied in parallel to
is designed to only learn prediction error filters. This jointly the image using a set of several learnable filters. These filters
suppresses the image’s content and adaptively learns low- work as a sliding window that convolves with all regions of the
level prediction residual features that are optimal for detecting input image with an overlapping distance called the stride and
forensic traces. Higher-level forensic features are learned from produce outputs known as feature maps. Similarly, the hidden
these residuals by deeper layers of our CNN. convolutional layers extract features from each lower-level
Through a series of experiments, we show that our MISLnet feature maps. Finally, the output of these hierarchical feature
architecture can automatically learn to detect multiple types extractors is stacked to a fully-connected neural network that
of image editing directly from data. This removes the need performs classification.
for difficult and time consuming human analysis to design The convolutional operation between the input feature maps
forensic detection features. Our results show that when given and a convolutional layer within the CNN architecture is given
comparable amount of training data, our constrained CNN can in Eq. (1):
perform as good or better than state-of-the-art general-purpose
K
detector based on steganalytic features [20]. Furthermore, we (n)
X (n−1) (n)
show that we can significantly improve the performance of our hj = hk ∗ wkj + bj (n) , (1)
k=1
CNN-based detector by using very large amounts of training
data that are computationally prohibitive for forensic detectors (n)
where ∗ denotes a 2d convolution, hj is the j th feature map
based on the SRM and its associated ensemble classifier. These (n−1)
results show that our proposed method can achieve 99.97% output in the nth hidden layer, hk is the k th channel in
(n)
accuracy with five different tampering operations using a large the (n − 1) hidden layer, wkj is the k th channel in the j th
th
(n)
scale data. filter in the nth layer and bj is its corresponding bias term.
The major contributions of this paper are as follows: (1) The filter coefficients in each layer are initially seeded
We propose a CNN architecture that is capable of detecting with random values, then learned using the back-propagation
image editing and manipulations. This CNNs architecture algorithm [27], [28]. The convolutional layers are also fol-
is deeper and more sophisticated than the one we initially lowed by an activation function to introduce nonlineraity. The
proposed in [26], and its design choices are systematically set of convolutional layers yields a large volume of feature
investigated through a series of experiments. (2) We intro- maps. To reduce the dimensionality of these features, the
duce our proposed constrained convolutional layer, provide convolutional layers are followed by another type of layer
a detailed discussion of how it is constructed and trained, as called pooling. This reduces the training computational cost of
well as provide intuition into why it works. (3) We conduct the network and decreases the chances of over-fitting. There
a large scale experimental evaluation of our MISLnet CNN exist many types of pooling operations such as max, average,
architecture and show that it can outperform existing image and stochastic pooling. In particular, the max-pooling layer
manipulation detection techniques, can differentiate between works also as a sliding window with a stride distance which
3
retains the maximum value within the dimension of a sliding from which more sophisticated manipulation detection features
window. are built. To mimic this process, our constrained convolutional
The training process of a CNN is done through an iterative layer is designed to only learn prediction error filters. The
algorithm that alternates between feedforward and backprop- feature maps it produces correspond to prediction error fields
agation passes of the data. The weights of the convolutional that are used as low-level forensic traces.
filters and fully-connected layers are updated at each iteration The constrained convolutional layer is placed at the begin-
of the backpropagation passes. Ultimately, we would like to ning of a CNN designed to perform a forensic task. This serves
minimize the average loss E between the true class labels (i.e., to suppress an image’s content (since prediction errors largely
unaltered, manipulated, etc.) and the network outputs, i.e., do not contain image content) and provide the CNN with low-
m c level forensic features. Deeper layers in the CNN will learn
1 X X ∗(k) (k) higher-level manipulation detection features from these low-
E= yi log yi , (2)
m i=1 level forensic features.
k=1
∗(k) (k) To describe the constraints we enforce upon the constrained
where yi and yi are respectively the true label and the convolutional layer, we adopt the notational conventions that
network output of the ith image at the k th class with m the superscript (ℓ) denotes the ℓth CNN layer, the subscript
training images and c neurons in the output layer. There have k denotes the k
th
convolutional filter within a layer, and that
been proposed a variety of solvers [29], [30], [31] to minimize the central value of a convolutional filter is denoted by spatial
the average loss. index (0,0). We force the CNN to learn prediction error filters
In this paper, we consider the stochastic gradient descent by actively enforcing the following constraints
(SGD) to train our model [29]. The iterative update rule for the ( (1)
(n)
kernel coefficients wij in CNN during the backpropagation wk (0, 0) = −1,
(1) (4)
pass is given below: P
m,n6=0 w k (m, n) = 1,
(n+1) ∂E (n) (n) (1)
∇wij = ǫ· (n)
− m · ∇wij + d · ǫ · wij on each of the K filters wk in the constrained convolutional
∂wij
layer during training. Fig. 1 depicts a set of K constrained
(n+1) (n) (n+1)
wij = wij − ∇wij , (3) convolutional filters convolved with an input image.
The prediction error filter constraints in the constrained
(n)
where wij represents the ith channel from the j th kernel convolutional layer are enforced through the following training
(1)
matrix in the nth hidden layer that convolves with the ith process. Training proceeds by updating the filter weights wk
channel in the previous feature maps of the (n − 1)th layer, at each iteration using the stochastic gradient descent (SGD)
(n) (n)
∇wij denotes the gradient of wij and ǫ is the learning rate. algorithm during the backpropagation step, then projecting the
(n) updated filter weights back into the feasible set of prediction
The bias term bj in (1) is updated using the same equations
presented in (3). For fast convergence as explained by LeCun error filters by reinforcing the constraints in (4). The projection
et al. in [28], we use the decay and momentum strategy which is done at each training iteration by first setting the central filter
are respectively denoted by d and m in (3). weight to -1. Next, the remaining filter weights are normalized
so that their sum is equal to 1. This is done by dividing
III. C ONSTRAINED C ONVOLUTIONAL N EURAL N ETWORK each of these remaining weights by the sum of all the filter
weights excluding the central value. It is worth mentioning that
A. Constrained convolutional layer experimentally we have found that using a central value larger
Instead of relying on hand-designed or predetermined fea- than one (i.e. set the central value to -10,000 and ensure the
tures, we propose a CNN-based approach to image manipu- remaining values sum to 10,000) can help improve both the
lation detection. Our approach is able to use data to directly numerical stability of these filters and the convergence of our
learn the changes introduced by image tampering operations CNNs loss1 . Doing this still produces prediction error filters of
into local pixel relationships. We note that if CNNs in their the same form, but the filter weights are proportionally larger.
standard form (such as AlexNet [23]) are used to perform Pseudocode outlining this process is given in Algorithm 1.
image manipulation detection, they will learn features that
represent an image’s content. This will lead to a classifier that
B. Analogy with existing forensic approaches
identifies scene content associated with the training data as
opposed to learning image manipulation fingerprints. Many existing state-of-the-art approaches to perform image
By contrast, our approach is designed to suppress an image’s manipulation detection proceed by extracting hand-designed
content and adaptively learn image manipulation traces. To prediction residual features. Our proposed constrained con-
accomplish this, we propose a new type of convolutional layer, volutional layer can be viewed as a generalization of these
called a constrained convolutional layer, that is designed to non-adaptive feature extraction based approaches. Examples
be used in forensic tasks. It is inspired by a common process of these include resampling detectors that use the variance of
that we have observed in many existing forensic algorithms. prediction residues [3], [32], median filter streaking artifact
Several existing algorithms first use a predetermined predictor 1 The python scripts used to conduct the experiments can be found at
to produce a set of pixel value prediction errors. These misl.ece.drexel.edu/downloads or the project git repository https://gitlab.com/
prediction errors are then used as low-level forensic features MISLgit/constrained-conv-TIFS2018/.
4
(1)
where hk is the k th feature map produced by the k th con-
strained filter in the first convolutional layer defined in Eq. (1).
By defining f (I) = w̃k ∗ I, we can see these residuals are
produced in the same manner that the above mentioned feature
extraction based approaches produce prediction residuals r
in (5).
The residual predictors used in different multimedia forensic
tasks [2], [3], [7], [20] take the form w̃k to predict local
pixel relationships. Resampling detectors for instance, operate
by computing a probability measure called a p-map from a
Fig. 1: The constrained convolutional layer. The red coefficient is -1 prediction residual r of the form shown in Eq. (5). Then
and the coefficients in the green region sum to 1. higher-level features in the frequency domain of the p-map
are learned to detect resampling artifacts [2], [3]. To detect
Algorithm 1 Training algorithm for constrained convolutional median filtering, Kirchner and Fridrich similarly compute low-
layer level residual features [7]. These residual features capture
1: Initilize wk’s using randomly drawn weights streaking artifacts, then higher-level detection features are
2: i=1 learned. Furthermore, the steganalytic rich model method used
3: while i ≤ max iter do in forensics [20] operates in the same manner by building
4: Do feedforward pass several local models of pixel dependencies to compute a
5: Update filter weights through stochastic gradient diverse set of residual features. Then higher-level features are
descent and backpropagate errors learned using the co-occurrence of these residuals [19].
6: Set wk (0, 0)(1) = 0 for all K filters As a result, our approach suppresses an image’s content
(1) P (1) in the same manner as prediction residual based methods.
7: Normalize wk ’s such that ℓ,m6=0 w k (ℓ, m) = 1
In order to capture a large number of different types of
8: Set wk (0, 0)(1) = −1 for all K filters
dependencies among neighboring pixels, a diverse set of w̃k ’s
9: i = i+1
is typically used. The advantage of modeling the residual
10: if training accuracy converges then
instead of the pixel values is that the image content is largely
11: exit
suppressed.
12: end
Unlike prior forensic methods which use fixed predictors,
however, our approach adaptively learns good predictors for
residuals [7], median filter residual features [8], [33], SPAM feature extraction through backpropagation. Nonlinearities can
features [21], and rich model predictors [19]. These examples be further introduced by the subsequent application of activa-
of prediction residual features suppress an image’s contents tion functions and pooling layers in higher CNN layers. Thus,
but still allow traces in the form of content-independent pixel the constrained convolutional layer based approach unifies an
value relationships to be learned by a classifier. important amount of research in multimedia forensics.
To provide intuition into this, prediction residual features are
formed by using some function f (·) to predict the value of a IV. N ETWORK A RCHITECTURE
pixel based on that pixel’s neighbors within a local window. In the previous section, we showed how low-level image
The true pixel value is then subtracted from the predicted value manipulation features can be adaptively extracted using the
to obtain the prediction residual r such that constrained convolutional layer approach. We use this ap-
proach to design a CNN that is able to identify the type of
r = f (I) − I, (5) editing operations in an image. Fig. 2 depicts the overall design
where I is the input image or image patch. Frequently, a of our proposed architecture with details about the size of each
diverse set of K different prediction functions are used to layer. Our architecture consists of four different conceptual
obtain many different residual features. blocks and has the ability to: (i) jointly suppress an image’s
It can easily be shown that the K feature maps produced content and learn prediction error features while training, (ii)
by a constrained convolutional layer in Eq. (4) are residuals extract higher level representation of the previously learned
of the form (5). A simple way to see this is to define a new image manipulation features and (iii) learn new associations
filter w̃k as between feature maps in deeper layer by using a block that
( consists of 1×1 convolutional filters. These type of filters learn
wk (m, n) if (m, n) 6= (0, 0), a linear combination of features located at the same position
w̃k (m, n) = (6)
0 if (m, n) = (0, 0). but in a different feature map across channels. The output of
As a result, wk can be expressed as w̃ − δ where δ is an the latter block is fed to the classification block which consists
impulse filter whose central value is 1 and 0 elsewhere. The of three fully-connected layers. In this work, the input layer of
feature map produced by convolving an image with the filter our CNN is a grayscale image patch sized 256×256 pixels. In
(1)
wk in the constrained convolutional layer is what follows, we present a detailed overview of our proposed
conceptual blocks as well as the different layers that we have
(1) (1)
hk = wk ∗ I = (w̃k − δ) ∗ I = w̃k ∗ I − I, (7) used in our CNN’s architecture.
5
Fig. 2: MISLnet: our proposed constrained CNN architecture; BN:Batch-Normalization Layer; TanH: Hyperbolic Tangent Layer.
A. Conceptual Blocks approach [39] that we explain in Section IV-G. In what follows
we give more details about each type of layer we used in our
a) Prediction error feature extraction: CNNs in their
CNN.
existing form tend to learn content-dependent features. There-
e) Differences from original architecture: Differences
fore, in our proposed architecture the first block consists of a
from original architecture: Compared to our original CNN
constrained convolutional layer [26], [34]. This suppresses the
architecture proposed [26], our new CNN architecture has
content and constrains CNN to learn prediction error features
gone through substantial design refinement. Specifically, this
in the first layer [35], [36]. As a result, the first conceptual
new architecture contains fewer filters in the constrained
block learns low-level pixel value dependency features. These
convolutional layer, uses a different filter size in the Conv3
features are fragile and vulnerable to be destroyed by many
(referred to as Conv2 in [26]) convolutional layer, uses a differ-
types of nonlinear operations [37] such as pooling and activa-
ent number of filters in the Conv3 and Conv4 layers, includes
tion layers which are explained later. Therefore, the output of
one more ‘traditional’ convoltional layer than our architecture
this block is directly passed to a regular convolutional layer.
in [26], adds an additional 1×1 layer after the ‘traditional’
b) Hierarchical feature extraction: In order to learn
convolutional layers, uses a different type of pooling before the
higher-level prediction error features, we use a conceptual
fully connected layers, uses different activation functions, uses
block that consists of a set of three consecutive convolutional
batch normalization instead of local response normalization,
layers each followed by a batch normalization, activation
contains a different number of neurons in each fully connected
function and pooling layers. Each convolutional layer will
layer (this network uses noticeably fewer neurons in these
learn a new representation of feature maps learned by the
layers), and uses the activations of the last fully connected
preceding convolutional layer (lower-level features).
layer as “deep features” that are provided to an extremely
c) Cross feature maps learning: The previously learned randomized tree classifier. These design choices have been
hierarchical features are produced by learning local spatial motivated by an extensive series of experiments, the most
association within a receptive field (local region/patch con- important of which are discussed in detail in Section V.
volved with a filter) in the same feature map. Next, a new Additionally, while training this network we use a different
association is learned between these feature maps. In order to initial learning rate as well as a learning rate that decreases
constrain CNN to learn only association accross feature maps, during training [26] uses a fixed learning rate) and train using
we use 1×1 convolutional layer after the hierarchical feature a different batch size. Training information for this network is
extraction conceptual block. This has been demonstrated to discussed in detail in Section V.
improve the learning ability of CNN in steganalysis [38].
In our architecture, this layer also followed by a a batch
normalization, activation function and pooling layers. B. Convolutional Layers
d) Classification: The deepest convolutional features From Fig. 2, one can notice that we use three different
learned by the previous conceptual block are directly passed types of convolutional layers, namely one “Constrained Conv”
to a classifiation block. This block consists of a regular layer which is the constrained convolutional layer presented
neural network that will learn to classify the input data in Section III, three regular convolutional layers and the 1×1
from the previously learned features throughout CNN. To convolutional filters in “Conv5”. More specifically, a patch
improve the performance of our CNN we use the deep features sized 256×256 from a grayscale input image is first convolved
6
with three different 5 × 5 constrained convolutional filters operations including activation functions. In our experiments
with a stride equal to 1. These filters learn the prediction (see Section V-G), we compare the performance of our pro-
error features between the estimated center pixel and it’s local posed TanH based CNN architecture to CNN models with
neighbors. The constrained convolutional layer yields feature different choices of activation functions that we mentioned
maps of prediction residuals of dimension 252 × 252 × 3. above.
To learn higher-level representative features and new asso- Finally, the output layer is followed by a softmax activa-
ciations between the prediction residual feature maps, we use tion function. This type of activation function maps features
three regular convolutional layers, namely “Conv2” with 96 learned by the last fully-connected layer to a set of probability
filters of size 7 × 7 × 3 and stride of 2, “Conv3” with 64 filters values where the output of all neurons in this layer sum up
of size 5 × 5 × 96 and stride of 1 and “Conv4” with 64 filters to 1. The identification of the image manipulation types in
of size 5 × 5 × 64 and stride of 1. The output dimensions subject images can be performed by choosing the editing
of these convolutional layers are respectively, 126 × 126 × 96, operation associated with the neuron in the softmax layer with
63×63×64 and 31×31×64. Finally, we use 128 different 1×1 the highest activation level.
convolutional filters with stride of 1 in “Conv5”. This type of
layer learns the association across feature maps, i.e., linear E. Batch Normalization
combination of features across channels located at the same Researchers in computer vision have developed several
spacial location. The output dimension of this convolutional techniques to normalize the data throughout the CNN architec-
layer is 15 × 15 × 128. Finally, from our architecture one can ture. Early deep learning architectures use the local response
notice also that we use a batch normalization layer after every normalization (LRN) layer which normalizes the central coef-
regular convolutional layer. A brief overview about the batch ficient within a sliding window in a feature map with respect
normalization layer is given in Section IV-E. to its neighbors. Recently Ioffe et al., proposed in [44] the
batch normalization layer which dramatically accelerates the
C. Fully-connected Layers training of deep networks. This type of mechanism minimizes
the internal covariate shift which is the change in the input
To identify the type of the processing operation that an input
distribution to a learning system.
image has undergone, the output of all these convolutional
This is done by a zero-mean and unit-variance transforma-
layers is fed to a classification block which consists of a
tion of the data while training the CNN model. The input
fully-connected neural network defined by three layers. More
to each layer gets affected by the parameters of all previous
specifically, the first two fully-connected layers contain 200
layers and even small changes get amplified. Thus, this type
neurons. These layers learn new association between the
of layer addresses an important problem and increases the
deepest convolutional features in CNN. The output layer, also
final accuracy of a CNN model. Therefore, in our proposed
called classification layer, contains one neuron for each possi-
architecture we use a batch normalization layer after each
ble tampering operation and another neuron that corresponds
regular convolutional layer. However, the prediction error
to the unaltered image class.
convolutional filters outputs are directly convolved with the
next convolutional layer without using the batch normalization
D. Activation Function layer.
A convolutional layer is typically followed by a nonlinear
mapping called an activation function. This type of function F. Pooling
is applied to each value in the feature maps of every convo- In our CNN, we use two different types of pooling, i.e.,
lutional layer. There exist many types of activation functions. three max-pooling and one average-pooling. We experimen-
In computer vision applications, the ReLU activation function tally demonstrate our choice of pooling layers in Section V.
has been used successfully [23], [40]. He et al. [41] proposed Similarly to [23], we use an overlapping kernel with size 3×3
another type of activation function called PReLU that leads and stride of 2. Explicitly, the max-pooling layer retains the
to surpass human-level performance on visual recognition maximum value within the local neighborhood of the sliding
challenge [42]. Additionally, Clever et al. [43], proposed window, whereas, the average-pooling layer retains the average
the exponential linear units (ELU) activation function, which in a local neighborhood. The purpose of this type of layer is
considerably speeds up learning and obtains less than 10% to reduce the dimensionality of the feature maps. This reduces
classification error compared to a ReLU network with the same the computational cost of training and decreases the chances of
architecture. over-fitting. More specifically, the set of parallel convolutional
In our proposed CNN, as depicted in Fig. 2 we propose to operations yields a high dimensional feature maps volume.
constrain the range of data values with the saturation regions of Therefore, pooling layers are useful for subsampling as well
hyperbolic tangent (TanH) activation function at every stage of as improving the accuracy by retaining the most representative
the network. Introducing nonlinearity throughout the network features.
layers strengthens CNN capability to separate the feature In our architecture, the four used pooling layers have
space. However, one can notice that feature maps learned respectively reduced the feature maps dimensions from
by the constrained convolutional layer are not followed by a 126×126×96 to 63×63×96, from 63×63×64 to 31×31×64,
TanH layer. This is mainly because the learned prediction error from 31×31×64 to 15×15×64 and finally from 15×15×128
features can easily be destroyed by many types of nonlinear to 7×7×128.
7
TABLE IV: Confusion matrix for identifying the manipulations listed in Table III using MISLnet.
Predicted Class Predicted Class
Softmax Extremely Randomized Trees
OR MF GB AWGN RS JPEG OR MF GB AWGN RS JPEG
OR 98.70% 0.67% 0.01% 0.01% 0.13% 0.45% 99.49% 0.15% 0.09% 0.02% 0.13% 0.11%
True Class
MF 0.01% 99.08% 0.07% 0.00% 0.00% 0.82% 0.07% 99.77% 0.11% 0.00% 0.04% 0.00%
GB 0.00% 0.05% 99.15% 0.00% 0.00% 0.78% 0.02% 0.41% 99.46% 0.00% 0.11% 0.00%
AWGN 0.03% 0.00% 0.00% 99.96% 0.00% 0.00% 0.02% 0.00% 0.00% 99.98% 0.00% 0.00%
RS 0.05% 0.00% 0.01% 0.00% 98.87% 1.05% 0.07% 0.36% 0.00% 0.00% 99.51% 0.06%
JPEG 0.07% 0.00% 0.00% 0.00% 0.13% 99.79% 0.06% 0.00% 0.00% 0.00% 0.15% 99.79%
Fig. 3: Output of the three learned filters in “Constrained Conv” layer using three different grayscale images.
TABLE VI: Confusion matrix for identifying the manipulations listed in Table V with arbitrary editing parameters using MISLnet.
Predicted Class Predicted Class
Softmax Extremely Randomized Trees
OR MF GB AWGN RS JPEG OR MF GB AWGN RS JPEG
OR 97.21% 0.47% 0.02% 1.78% 0.00% 0.53% 97.73% 0.11% 0.04% 1.95% 0.06% 0.11%
True Class
MF 0.04% 99.01% 0.24% 0.02% 0.02% 0.68% 0.07% 99.59% 0.26% 0.02% 0.06% 0.00%
GB 0.00% 0.71% 98.73% 0.00% 0.00% 0.56% 0.04% 1.03% 98.93% 0.00% 0.00% 0.00%
AWGN 1.56% 0.02% 0.00% 98.39% 0.00% 0.04% 1.31% 0.02% 0.00% 98.61% 0.00% 0.06%
RS 0.11% 0.36% 0.02% 0.00% 99.16% 0.36% 0.13% 0.19% 0.02% 0.00% 99.64% 0.02%
JPEG 0.09% 0.00% 0.00% 0.02% 0.00% 99.89% 0.17% 0.26% 0.02% 0.04% 0.04% 99.47%
denote a sequence where the patch was first edited using of training data. These results are very important because
manipulation X, then subsequently edited using manipulation our approach can learn salient image manipulation detection
Y (i.e. MF-RS corresponds to first applying median filtering, features directly from data. This may allow us to learn better
then applying resizing). We divided these into sets of training feature extractors than the human designed incorporated into
and testing patches created from two separate sets of images the rich model feature extractors.
of total size 2, 175, resulting in a set of 296, 000 training Training time is an important factor when devising a data-
patches and 52, 000 testing patches for both. In total, 29, 600 driven manipulation detection approach. Our CNN based ap-
training patches and 5, 200 testing patches were unaltered. We proach took approximately six hours to train on this database.
then trained MISLnet to distinguish between each sequence of By contrast, a multi-threaded implementation (using eight
editing operations. threads) of the rich model took over 58 hours to perform only
Tables VII shows the confusion matrix obtained from this feature extraction on this database using the same computer.
experiment when a softmax is used to perform manipulation Training the classifier for the SRM took several additional
chain classification, while Table VIII shows the confusion hours. As a result, it becomes extremely challenging, if not
matrix obtained using an extremely randomized trees (ET) infeasible, to train the SRM on a very large database.
classifier. Results of these experiments demonstrated that
we can achieve an accuracy of 92.90% with the softmax- F. Prediction error feature extractor design choices
based CNN and 94.19% using the ET-based CNN. From The overall performance of MISLnet depends on several
Table VIII, one can observe that we can identity the type of design choices. An important one of these is the design of
processing operation with an accuracy typically higher than first CNN layer, which extracts prediction error features in
91%. Noticeably, we can achieve 99.17% identification rate at our network. To determine the optimal design of this layer,
detecting median filtering followed by resampling and 96.69% we conducted several experiments to examine the influence of
at detection Gaussian blur followed by resampling, which is other filter type choices for the first CNN layer including the
particularly high. Additionally, one can also observe that the use of a fixed high-pass filter [37], [38], as well as not using
ET classifier significantly improved the detection rate of the a prediction error feature extraction block (i.e. beginning our
processing operations followed by median filtering (i.e., GB- CNN with a standard convolutional layer). Additionally, we
MF and RS-MF). These results demonstrate the robustness conducted experiments to determine the optimal number and
of our proposed CNN at performing image manipulation size of the filters in the constrained convolutional layer of
detection in a challenging and realistic scenario. MISLnet.
Choice of prediction error feature extractor: We evaluated
E. Comparison with SRM-based approach the advantage of using the constrained convolutional layer
We compared our trained MISLnet CNN for multiple as MISLnet’s first layer through two sets of experiments. In
manipulations to the rich model approach [19], [20] using each experiment, we trained and evaluated our proposed CNN
both the same training and testing datasets described in Sec- architecture using three different choices for the first layer:
tion V-B. Table IX displays a confusion matrix containing the (1) using a constrained convolutional layer, (2) without using
manipulation detection results obtained using the rich model a constrained convolutional layer, and (3) replacing the con-
based approach. The results of these experiments showed strained convolutional layer in MISLnet with a generic fixed
that the rich model approach was able to achieve an average high-pass filter. In these experiments, we used the same high-
manipulation identification accuracy of 99.63%. By contrast, pass commonly employed in forensics and steganalysis [37],
our constrained CNN was able to achieve an accuracy of [38].
99.66% on the same dataset. From Tables IV and IX one can To assess the performance gains achieved by the constrained
notice that our CNN achieved better identification rates for convolutional layer, we report the classification accuracy MIS-
median filtering, Gaussian bluring and re-sampling. Lnet achieved using each choice of filter for the beginning of
These results demonstrate that our CNN based detector our CNN. Additionally, we report the relative error reduction
can perform as well as, or slightly better than, the rich achieved by using the constrained convolutional layer instead
model based detector. In Section V-H, we show that we can of each alternative. Relative error reduction measures the
achieve an even higher detection accuracy with a larger amount reduction in error achieved by a classifier normalized by total
11
TABLE VII: Confusion matrix for identifying manipulation chains followed by re-compression using MISLnet with a softmax.
Predicted Class
OR MF GB RS MF-GB GB-MF MF-RS RS-MF GB-RS RS-GB
OR 99.27% 0.06% 0.12% 0.52% 0.02% 0.00% 0.02% 0.00% 0.00% 0.00%
MF 0.13% 90.54% 0.02% 0.02% 1.08% 3.02% 0.10% 5.10% 0.00% 0.00%
GB 0.00% 0.23% 93.56% 0.25% 0.44% 0.04% 0.04% 0.04% 0.02% 5.38%
True Class
RS 0.19% 0.06% 2.12% 97.15% 0.04% 0.00% 0.12% 0.08% 0.00% 0.25%
MF-GB 0.00% 0.10% 0.23% 0.00% 98.08% 0.77% 0.02% 0.04% 0.23% 0.54%
GB-MF 0.00% 1.40% 0.12% 0.00% 8.65% 80.13% 0.08% 9.48% 0.04% 0.10%
MF-RS 0.02% 0.23% 0.00% 0.19% 0.06% 0.50% 97.69% 1.04% 0.27% 0.00%
RS-MF 0.00% 2.90% 0.02% 0.00% 1.65% 10.75% 0.21% 84.21% 0.06% 0.19%
GB-RS 0.00% 0.08% 0.00% 0.02% 0.87% 0.02% 0.94% 0.12% 93.94% 4.02%
RS-GB 0.00% 0.25% 1.06% 0.02% 1.23% 0.08% 0.06% 0.10% 2.71% 94.50%
TABLE VIII: Confusion matrix for identifying manipulation chains followed by re-compression using MISLnet with an ET classifier.
Predicted Class
OR MF GB RS MF-GB GB-MF MF-RS RS-MF GB-RS RS-GB
OR 99.33% 0.06% 0.1% 0.05% 0.00% 0.00% 0.02% 0.00% 0.00% 0.00%
MF 0.15% 91.77% 0.02% 0.02% 0.52% 2.12% 0.29% 5.10% 0.00% 0.02%
GB 0.00% 0.21% 95.00% 0.87% 0.42% 0.04% 0.04% 0.00% 0.06% 3.37%
True Class
RS 0.17% 0.00% 0.65% 98.94% 0.02% 0.00% 0.08% 0.08% 0.02% 0.04%
MF-GB 0.00% 0.31% 0.19% 0.00% 95.87% 2.48% 0.02% 0.29% 0.42% 0.42%
GB-MF 0.00% 1.44% 0.00% 0.02% 3.38% 86.02% 0.13% 8.87% 0.06% 0.08%
MF-RS 0.02% 0.04% 0.00% 0.01% 0.00% 0.08% 99.17% 0.38% 0.21% 0.00%
RS-MF 0.00% 3.54% 0.02% 0.00% 0.17% 9.38% 0.65% 86.00% 0.17% 0.06%
GB-RS 0.00% 0.02% 0.00% 0.00% 0.40% 0.02% 0.96% 0.06% 96.69% 1.85%
RS-GB 0.00% 0.10% 2.08% 0.06% 0.63% 0.13% 0.04% 0.23% 3.56% 93.17%
TABLE IX: Confusion matrix for identifying the manipulations listed in Table III using the rich model.
Predicted Class
OR MF GB AWGN RS JPEG
OR 99.83% 0.07% 0.00% 0.00% 0.02% 0.07%
True Class
error reduction that is possible. For reference, the relative convolutional layer. Additionally, we can see that using the
error reduction (RER) is calculated according to the formula constrained convolutional layer achieved an accuracy 0.31%
RER = (e1 − e2 )/e1 where e1 corresponds to the error higher detection than when a fixed high-pass filter was used.
achieved by the lower performing method and where e2 This corresponds to a relative error reduction of RER =
corresponds to the error achieved by the higher performing 29.54% over a fixed high-pass filter. These results demonstrate
method. the advantage of using the constrained convolutional layer.
In our first experiment, we examined the effect of each The full benefit of using a constrained convolutional layer
filter choice when performing manipulation detection in the can be seen when considering more challenging forensic
same manner as in Section V-B, i.e. images altered using the scenarios. To demonstrate this, we conducted another set of
manipulations listed in Table III. This was done by using the experiments where each image patch was edited by a sequence
same training and testing datasets described in Section V-B, as of up to two different manipulations, then JPEG compressed
well as the same training procedures (i.e. batch size, learning using a quality factor of 90. By re-compressing each image
rate, etc.). Each trained CNN was then used to classify each after manipulation, we can mimic conditions similar to those
image patch in the test set. in social networking applications which typically re-compress
The results of this first experiment are shown in Table X. each image before distributing them.
From this table, we can see that when MISLnet was trained We built our experimental database by using the same
without the constrained convolutinal layer, it’s performance training and testing datasets in Section V-D. We then trained
decreased by 0.90%. This corresponds to a relative error re- MISLnet to distinguish between each sequence of editing
duction of RER = 54.87% achieved by using the constrained operations using both a constrained convolutional layer, a fixed
12
high-pass filter, and without using a prediction error block (i.e. Pooling operations Accuracy
Avg-pooling 96.35%
a normal convolutional layer). Next, we used each version Max-pooling 98.95%
of our CNN to determine the manipulation sequence of each Max-pooling w/ avg-pooling after Conv5 99.26%
patch in the testing set.
The classification accuracies obtained by our CNN using
each choice for the first layer is shown in Table XI. From size of the filters in the constrained convolutional layer. This
this table, we can see that when the constrained convolu- was done by evaluating the detection accuracy of our CNN
tional layer was used, MISLnet can identify the sequence using filters of size 3×3, 5×5 and 7×7 in the constrained
of manipulations used to modify each patch with 92.90% convolutional layer. We kept the total number of filters in
accuracy. By contrast, using a fixed high-pass filter yields an the “Constrained Conv” layer fixed (3 filters). Experiments
accuracy of 80.63%, while using a normal convolutional layer showed that our choice of 5×5 “Constrained Conv” layer
yields an accuracy of 84.10%. In these experiments, using with 99.26% identification rate outperforms the other dimen-
a constrained convolutional layer produces a relative error sion choices. More specifically, with 3×3 constrained filters
reduction of 63.35% over a fixed high-pass filter and a relative MISLnet can achieve 98.91% identification rate and 99.07%
error reduction of 55.34% over a normal convolutional layer. identification rate when using 7×7 constrained filters in the
These results clearly demonstrate the advantage of using the “Constrained Conv” layer. Taken together, the results of these
constrained convolutional layer in challenging manipulation two experiments show that using three 5×5 filters in the
detection scenarios. constrained convolutional layer maximizes the performance of
TABLE XII: MISLnet detection accuracy with different number of our CNN.
filters in the “Constrained Conv” layer.
#. Filters Testing Accuracy G. Architecture design choices
1 99.04%
2 98.02% The structural design of a CNN’s architecture has a large
3 99.26% impact on its final accuracy. We ran several sets of additional
4 99.11% experiments related to the structural design of our CNN’s
5 99.05%
6 98.97%
architecture. In this paper we present three of these exper-
iments, namely (1) the choice of the pooling layer, (2) the
choice of the activation function, and (3) the choice of the
“Constrained Conv” layer parameters: We conducted two stride size in the “Conv2” layer with different input patch sizes
sets of experiments to investigate the impact of the number (i.e., 256×256, 128×128 and 64×64). To accomplish this, we
of filters and their dimension in the “Constrained Conv” layer started with the fixed architecture defined in Fig. 2. Then we
on CNN’s performance. To accomplish this, we used the same changed one architectural design choice in each experiment
training and testing datasets that we described in Section V-B. such as the choice of pooling layer or activation function.
In our first experiment, we identified the optimal number of For smaller patch sizes, each image patch in the training and
filters to use in the constrained convolutional layer by letting testing datasets were cropped in the center.
the number of filters vary from 1 to 6 and evaluating the Pooling layer: We first evaluated the impact on our CNN’s
manipulation detection accuracy achieved by our CNN under performance using different types of pooling layers. We
each scenario. Table XII shows the results of our experiments. trained three CNN models using the architecture described
We can notice that our proposed MISLnet architecture with in Fig. 2 with different choices of pooling layers, i.e., max-
the choice of three constrained filters maximizes CNN’s per- pooling, average-pooling and max-pooling with average pool-
formance and outperforms the other choices of filter numbers ing after the “Conv5” layer. In our CNN we used 1×1 convo-
by at least 0.15%. lutional filters in the “Conv5” layer to learn new association
In our second experiment, we examined the effect of the between feature maps. Because of this, the choice of pooling
13
Fig. 6: CNN testing accuracy v.s. number of training patches from Dresden database [46] using Softmax (dashed line) and Extremely
Randomized Trees (ET) (solid line) with different patch sizes; blue: 256×256, red:128×128, green: 64×64.
TABLE XVII: Confusion matrix for identifying the operation types using MISLnet (w/ softmax) trained with 2.5 million patches.
Predicted Class
OR MF GB AWGN RS JPEG
OR 99.95% 0.01% 0.00% 0.00% 0.00% 0.04%
True Class
MF 0.02% 99.89% 0.02% 0.00% 0.02% 0.04%
GB 0.00% 0.00% 100% 0.00% 0.00% 0.00%
AWGN 0.00% 0.00% 0.00% 100% 0.00% 0.00%
RS 0.00% 0.00% 0.00% 0.00% 99.99% 0.01%
JPEG 0.01% 0.00% 0.00% 0.00% 0.00% 99.99%
from the Dresden database to perform general-purpose image can notice that our CNN trained with significantly more data
manipulation detection on images taken using 34 different can accurately distinguish between JPEG compressed images
camera models that we have manually collected. and both resampled and unaltered images. Similarly to the
To evaluate the performance of our trained CNN on a experiments in Section V-H, these results demonstrate again
different dataset, we built a “wild” testing dataset of images that with a larger scale of training data we can improve our
taken by 34 new camera models. This dataset consisted of CNN’s performance.
50, 000 grayscale patches of size 256×256, where 8, 350
patches were not edited. This was accomplished by retaining VI. C ONCLUSION
all the 25 central 256×256 blocks from the green layer of 334 In this paper, we proposed a novel deep learning based
randomly selected images. Next, similarly to our previous set approach to perform general-purpose image manipulation de-
of experiments, we created their corresponding edited patches tection. Unlike existing approaches that rely on hand-designed
using the five tampering operations listed in Table III. features, our proposed CNN is able to jointly suppress an
Table XVI shows the performance of our CNN in a “real image’s content and adaptively learn image manipulation de-
world” scenario with different input patch sizes. One can tection features directly from data. To accomplish this, we de-
notice that we can still identify the type of image editing with veloped a new type of layer, called a constrained convolutional
at least 99.33% accuracy when we use 256×256 testing input layer, that forces our CNN to learn prediction error filters
patches. We also can notice that with smaller patch sizes our that produce low-level forensic features. Using this layer, we
deep learning method is able to detect the type of image editing designed a new CNN architecture that is able to accurately
with an accuracy higher than when tested on patches from the detect multiple types of image manipulations.
Dresden testing dataset. More specifically, in the real world Through a series of experiments, we assessed the ability of
scenario, our proposed ET-based constrained CNN can identify our proposed constrained CNN to perform image manipulation
the different types of image manipulations in 64×64 and detection. The results of these experiments showed that our
128×128 input image patches with accuracies of 99.51% and CNN can be trained to accurately detect individual targeted
99.69% respectively, whereas with our Dresden experimental manipulations as well as multiple types of manipulations.
testing dataset it can achieve accuracies of 99.43% and 99.67% To further assess the performance of our constrained CNN,
respectively. These results demonstrate the robustness of our we compared it to the SRM-based general purpose image
proposed approach when used in real world scenarios. manipulation detection approach (i.e. the current state-of-the-
Finally, to further improve the learning ability of CNN we art detector) using five different image manipulations. This
built a new training dataset that contains 2.5 million patches experimental comparison showed that our proposed CNN
from our collected ‘34-camera-model’ database to classify architecture can outperform the SRM, particularly when using
each patch in the testing dataset. To accomplish this, we large scale training data. Additionally, we conducted a set of
similarly retained all the 25 central patches of size 256×256 experiments to mimic a realistic scenario where a forensic in-
from the green layer of the remaining 16, 667 images not used vestigator will use our CNN to analyze images from different,
for testing. Then we created their corresponding edited patches possibly unknown source devices than we used to train our
using the five tampering operations. In total, the unaltered CNN. These experimental results show that our CNN can still
training patches consisted of 416, 675. We finally trained our accurately detect image manipulations even when there is a
CNN using the same parameters that we have used in the source mismatch between the data used to train our CNN and
previous experiments. an image under investigation.
We used our re-trained CNN to identify the type of manipu-
lation the testing dataset described in the previous experiment. R EFERENCES
Our proposed constrained CNN was able to classify the testing [1] M. C. Stamm, M. Wu, and K. J. R. Liu, “Information forensics: An
dataset with 99.97% accuracy. Table XVII shows the confusion overview of the first decade.” IEEE Access, vol. 1, pp. 167–200, 2013.
[2] A. C. Popescu and H. Farid, “Exposing digital forgeries by detecting
matrix of the trained CNN. Our method can achieve at least traces of resampling,” IEEE Transactions on Signal Processing, vol. 53,
99.89% accuracy for all manipulations considered. Notice- no. 2, pp. 758–767, Feb. 2005.
ably, our approach can identify Gaussian blur and AWGN [3] M. Kirchner, “Fast and reliable resampling detection by spectral analysis
of fixed linear predictor residue,” in Proceedings of the 10th ACM
operations with 100% accuracy. Additionally, compared to the Workshop on Multimedia and Security, ser. MM&Sec ’08. New York,
performance using 100K training patches (see Table IV), one NY, USA: ACM, 2008, pp. 11–20.
16
[4] N. Dalgaard, C. Mosquera, and F. Pérez-González, “On the role of dif- [27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
ferentiation for resampling detection,” in IEEE International Conference W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
on Image Processing (ICIP). IEEE, 2010, pp. 1753–1756. zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551,
[5] X. Feng, I. J. Cox, and G. Doerr, “Normalized energy density-based 1989.
forensic detection of resampled images,” IEEE Transactions on Multi- [28] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient
media, vol. 14, no. 3, pp. 536–545, 2012. backprop,” in Neural networks: Tricks of the trade. Springer, 2012,
[6] B. Mahdian and S. Saic, “Blind authentication using periodic properties pp. 9–48.
of interpolation,” IEEE Transactions on Information Forensics and [29] H. Robbins and S. Monro, “A stochastic approximation method,” The
Security, vol. 3, no. 3, pp. 529–538, 2008. annals of mathematical statistics, pp. 400–407, 1951.
[7] M. Kirchner and J. Fridrich, “On detection of median filtering in digital [30] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
images,” in IS&T/SPIE Electronic Imaging. International Society for preprint arXiv:1212.5701, 2012.
Optics and Photonics, 2010, pp. 754 110–754 110. [31] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
[8] X. Kang, M. C. Stamm, A. Peng, and K. J. R. Liu, “Robust median filter- for online learning and stochastic optimization,” Journal of Machine
ing forensics using an autoregressive model,” IEEE Trans. Information Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
Forensics and Security,, vol. 8, no. 9, pp. 1456–1468, Sep. 2013. [32] M. Kirchner and T. Gloe, “On resampling detection in re-compressed
images,” in 2009 First IEEE International Workshop on Information
[9] G. Cao, Y. Zhao, R. Ni, L. Yu, and H. Tian, “Forensic detection of
Forensics and Security (WIFS). IEEE, 2009, pp. 21–25.
median filtering in digital images,” in IEEE International Conference
[33] J. Chen, X. Kang, Y. Liu, and Z. J. Wang, “Median filtering foren-
on Multimedia and Expo (ICME). IEEE, 2010, pp. 89–94.
sics based on convolutional neural networks,” IEEE Signal Processing
[10] C. Chen and J. Ni, “Median filtering detection using edge based
Letters, vol. 22, no. 11, pp. 1849–1853, Nov. 2015.
prediction matrix,” Digital Forensics and Watermarking, pp. 361–375,
[34] B. Bayar and M. C. Stamm, “On the robustness of constrained convo-
2012.
lutional neural networks to jpeg post-compression for image resampling
[11] M. C. Stamm and K. J. R. Liu, “Forensic detection of image manipula- detection,” in The 2017 IEEE International Conference on Acoustics,
tion using statistical intrinsic fingerprints,” IEEE Trans. on Information Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2152–2156.
Forensics and Security, vol. 5, no. 3, pp. 492 –506, 2010. [35] B. Bayar and M. C. Stamm, “Towards order of processing opera-
[12] H. Yao, S. Wang, and X. Zhang, “Detect piecewise linear contrast tions detection in jpeg-compressed images with convolutional neural
enhancement and estimate parameters using spectral analysis of image networks,” in International Symposium on Electronic Imaging: Media
histogram,” in International Communication Conference on Wireless Watermarking, Security, and Forensics. IS&T, 2018.
Mobile and Computing. IET, 2009. [36] B. Bayar and M. C. Stamm, “A generic approach towards image
[13] M. Stamm and K. R. Liu, “Blind forensics of contrast enhancement in manipulation parameter estimation using convolutional neural networks,”
digital images,” in IEEE International Conference on Image Processing. in Proceedings of the 5th ACM Workshop on Information Hiding and
IEEE, 2008, pp. 3112–3115. Multimedia Security. ACM, 2017.
[14] M. C. Stamm and K. R. Liu, “Forensic estimation and reconstruction of [37] B. Bayar and M. C. Stamm, “Design principles of convolutional neural
a contrast enhancement mapping,” in IEEE International Conference on networks for multimedia forensics,” in International Symposium on
Acoustics Speech and Signal Processing. IEEE, 2010, pp. 1698–1701. Electronic Imaging. IS&T, 2017.
[15] T. Bianchi and A. Piva, “Detection of non-aligned double jpeg com- [38] G. Xu, H.-Z. Wu, and Y.-Q. Shi, “Structural design of convolutional
pression with estimation of primary compression parameters,” in IEEE neural networks for steganalysis,” IEEE Signal Processing Letters,
International Conference on Image Processing. IEEE, 2011, pp. 1929– vol. 23, no. 5, pp. 708–712, 2016.
1932. [39] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and
[16] T. Bianchi and A. Piva, “Image forgery localization via block-grained T. Darrell, “Decaf: A deep convolutional activation feature for generic
analysis of jpeg artifacts,” IEEE Transactions on Information Forensics visual recognition.” in ICML, 2014, pp. 647–655.
and Security, vol. 7, no. 3, pp. 1003–1017, 2012. [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[17] R. Neelamani, R. De Queiroz, Z. Fan, S. Dash, and R. G. Baraniuk, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
“Jpeg compression history estimation for color images,” IEEE Transac- in Proceedings of the IEEE Conference on Computer Vision and Pattern
tions on Image Processing, vol. 15, no. 6, pp. 1365–1378, 2006. Recognition, 2015, pp. 1–9.
[18] Z. Qu, W. Luo, and J. Huang, “A convolutive mixing model for [41] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
shifted double jpeg compression with application to passive image Surpassing human-level performance on imagenet classification,” in
authentication,” in IEEE International Conference on Acoustics, Speech Proceedings of the IEEE International Conference on Computer Vision,
and Signal Processing. IEEE, 2008, pp. 1661–1664. 2015, pp. 1026–1034.
[19] J. Fridrich and J. Kodovskỳ, “Rich models for steganalysis of digital [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
images,” IEEE Transactions on Information Forensics and Security, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
vol. 7, no. 3, pp. 868–882, 2012. scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015.
[20] X. Qiu, H. Li, W. Luo, and J. Huang, “A universal image forensic
[43] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate
strategy based on steganalytic model,” in Workshop on Information
deep network learning by exponential linear units (elus),” arXiv preprint
hiding and multimedia security. ACM, 2014, pp. 165–170.
arXiv:1511.07289, 2015.
[21] T. Pevny, P. Bas, and J. Fridrich, “Steganalysis by subtractive pixel
[44] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
adjacency matrix,” IEEE Transactions on Information Forensics and
network training by reducing internal covariate shift,” arXiv preprint
Security, vol. 5, no. 2, pp. 215–224, Jun. 2010.
arXiv:1502.03167, 2015.
[22] W. Fan, K. Wang, and F. Cayre, “General-purpose image forensics using [45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
patch likelihood under image statistical models,” in IEEE International S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
Workshop on Information Forensics and Security, Nov. 2015, pp. 1–6. fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [46] T. Gloe and R. Böhme, “The dresden image database for benchmarking
with deep convolutional neural networks,” in Advances in neural infor- digital image forensics,” Journal of Digital Forensic Practice, vol. 3,
mation processing systems, 2012, pp. 1097–1105. no. 2-4, pp. 150–159, 2010.
[24] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, [47] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolu-
and D. Yu, “Convolutional neural networks for speech recognition,” tional neural networks applied to visual document analysis.” in ICDAR,
IEEE/ACM Transactions on audio, speech, and language processing, vol. 3, 2003, pp. 958–962.
vol. 22, no. 10, pp. 1533–1545, 2014. [48] Y. Bengio, “Practical recommendations for gradient-based training of
[25] R. Collobert and J. Weston, “A unified architecture for natural language deep architectures,” in Neural Networks: Tricks of the Trade. Springer,
processing: Deep neural networks with multitask learning,” in Proceed- 2012, pp. 437–478.
ings of the 25th international conference on Machine learning. ACM,
2008, pp. 160–167.
[26] B. Bayar and M. C. Stamm, “A deep learning approach to universal
image manipulation detection using a new convolutional layer,” in
Workshop on Information Hiding and Multimedia Security. ACM, 2016,
pp. 5–10.
17
Belhassen Bayar (S’11) received the B.S. degree Matthew C. Stamm (S’08–M’12) received the B.S.,
in Electrical Engineering from the Ecole Nationale M.S., and Ph.D. degrees in electrical engineering
d’Ingénieurs de Tunis (ENIT), Tunisia, in 2011, from the University of Maryland at College Park,
and the M.S. degree in Electrical and Computer College Park, MD, USA, in 2004, 2011, and 2012,
Engineering from Rowan University, New Jersey, in respectively.
2014. After graduating from ENIT, he worked as a Since 2013 he has been an Assistant Professor
Research Assistant at the University of Arkansas at with the Department of Electrical and Computer
Little Rock (UALR). In Fall 2014, he joined Drexel Engineering, Drexel University, Philadelphia, PA,
University, Pennsylvania, where he is currently a USA. He leads the Multimedia and Information
Ph.D. candidate with the Department of Electrical Security Lab (MISL) where and his team conduct
and Computer Engineering. Bayar won the Best Pa- research on signal processing, machine learning, and
per Award at the IEEE International Workshop on Genomic Signal Processing information security with a focus on multimedia forensics and anti-forensics.
and Statistics in 2013. In summer 2015 he interned at Samsung Reasearch Dr. Stamm is the recipient of a 2016 NSF CAREER Award and the
America in Mountain View, California. His main research interests are in 2017 Drexel University College of Engineering’s Outstanding Early-Career
image forensics, machine learning and signal processing. Research Achievement Award. He was the General Chair of the 2017 ACM
Workshop on Information Hiding and Multimedia Security and is the lead
organizer of the 2018 IEEE Signal Processing Cup competition. He currently
serves as a member of the IEEE SPS Technical Committee on Information
Forensics and Security and as a member of the editorial board of IEEE SigPort.
For his doctoral dissertation research, Dr. Stamm was named the winner of
the Dean’s Doctoral Research Award from the A. James Clark School of
Engineering. While at the University of Maryland, he was also the recipient
of the Ann G. Wylie Dissertation Fellowship and a Future Faculty Fellowship.
Prior to beginning his graduate studies, he worked as an engineer at the Johns
Hopkins University Applied Physics Lab.