0% found this document useful (0 votes)
47 views16 pages

The Limitations of Deep Learning in Adversarial Settings

This paper proposes new algorithms for crafting adversarial examples against deep neural networks. The algorithms directly map inputs to outputs to achieve an adversarial goal, such as causing a misclassification, while minimizing changes to the inputs. The authors tested their algorithms on computer vision tasks and found they could reliably produce misclassifications with only small changes averaging 4% of input features per image. They also analyzed the vulnerability of different types of inputs to adversarial perturbations. In summary, the paper introduces new effective methods for generating adversarial examples against deep learning models and studies their properties.

Uploaded by

ZIX326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views16 pages

The Limitations of Deep Learning in Adversarial Settings

This paper proposes new algorithms for crafting adversarial examples against deep neural networks. The algorithms directly map inputs to outputs to achieve an adversarial goal, such as causing a misclassification, while minimizing changes to the inputs. The authors tested their algorithms on computer vision tasks and found they could reliably produce misclassifications with only small changes averaging 4% of input features per image. They also analyzed the vulnerability of different types of inputs to adversarial perturbations. In summary, the paper introduces new effective methods for generating adversarial examples against deep learning models and studies their properties.

Uploaded by

ZIX326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Accepted to the 1st IEEE European Symposium on Security & Privacy, IEEE 2016. Saarbrucken, Germany.

The Limitations of Deep Learning


in Adversarial Settings
Nicolas Papernot∗ , Patrick McDaniel∗ , Somesh Jha† , Matt Fredrikson‡ , Z. Berkay Celik∗ , Ananthram Swami§
∗ Department
of Computer Science and Engineering, Penn State University
† Computer
Sciences Department, University of Wisconsin-Madison
‡ School of Computer Science, Carnegie Mellon University
§ United States Army Research Laboratory, Adelphi, Maryland

{ngp5056,mcdaniel}@cse.psu.edu, {jha,mfredrik}@cs.wisc.edu, [email protected], [email protected]


arXiv:1511.07528v1 [cs.CR] 24 Nov 2015

Abstract—Deep learning takes advantage of large datasets Output classification


and computationally efficient training algorithms to outperform
0 1 2 3 4 5 6 7 8 9
other approaches at various machine learning tasks. However,
imperfections in the training phase of deep neural networks

0
make them vulnerable to adversarial samples: inputs crafted by

1
adversaries with the intent of causing deep neural networks to
misclassify. In this work, we formalize the space of adversaries

2
against deep neural networks (DNNs) and introduce a novel class
of algorithms to craft adversarial samples based on a precise

3
understanding of the mapping between inputs and outputs of
DNNs. In an application to computer vision, we show that our
Input class
4
algorithms can reliably produce samples correctly classified by
human subjects but misclassified in specific targets by a DNN
with a 97% adversarial success rate while only modifying on 5
average 4.02% of the input features per sample. We then evaluate
6

the vulnerability of different sample classes to adversarial per-


turbations by defining a hardness measure. Finally, we describe
7

preliminary work outlining defenses against adversarial samples


8

by defining a predictive measure of distance between a benign


input and a target classification.
9

I. I NTRODUCTION
Fig. 1: Adversarial sample generation - Distortion is added
Large neural networks, recast as deep neural networks to input samples to force the DNN to output adversary-
(DNNs) in the mid 2000s, altered the machine learning land- selected classification (min distortion = 0.26%, max distortion
scape by outperforming other approaches in many tasks. This = 13.78%, and average distortion ε = 4.06%).
was made possible by advances that reduced the computational
complexity of training [20]. For instance, Deep learning (DL)
can now take advantage of large datasets to achieve accuracy An adversarial sample is an input crafted to cause deep
rates higher than previous classification techniques. In short, learning algorithms to misclassify. Note that adversarial sam-
DL is transforming computational processing of complex data ples are created at test time, after the DNN has been trained by
in many domains such as vision [24], [37], speech recogni- the defender, and do not require any alteration of the training
tion [15], [32], [33], language processing [13], financial fraud process. Figure 1 shows examples of adversarial samples taken
detection [23], and recently malware detection [14]. from our validation experiments. It shows how an image
This increasing use of deep learning is creating incentives originally showing a digit can be altered to force a DNN to
for adversaries to manipulate DNNs to force misclassification classify it as another digit. Adversarial samples are created
of inputs. For instance, applications of deep learning use from benign samples by adding distortions exploiting the
image classifiers to distinguish inappropriate from appropriate imperfect generalization learned by DNNs from finite training
content, and text and image classifiers to differentiate between sets [4], and the underlying linearity of most components used
SPAM and non-SPAM email. An adversary able to craft mis- to build DNNs [18]. Previous work explored DNN properties
classified inputs would profit from evading detection–indeed that could be used to craft adversarial samples [18], [30], [36].
such attacks occur today on non-DL classification systems [6], Simply put, these techniques exploit gradients computed by
[7], [22]. In the physical domain, consider a driverless car network training algorithms: instead of using these gradients
system that uses DL to identify traffic signs [12]. If slightly to update network parameters as would normally be done,
altering “STOP” signs causes DNNs to misclassify them, the gradients are used to update the original input itself, which
car would not stop, thus subverting the car’s safety. is subsequently misclassified by DNNs.
In this paper, we describe a new class of algorithms for in our setup. Lastly, we study the impact of our algorithmic
adversarial sample creation against any feedforward (acyclic) parameters on distortion and human perception of samples.
DNN [31] and formalize the threat model space of deep This paper makes the following contributions:
learning with respect to the integrity of output classification. • We formalize the space of adversaries against classifi-
Unlike previous approaches mentioned above, we compute cation DNNs with respect to adversarial goal and capa-
a direct mapping from the input to the output to achieve bilities. Here, we provide a better understanding of how
an explicit adversarial goal. Furthermore, our approach only attacker capabilities constrain attack strategies and goals.
alters a (frequently small) fraction of input features leading • We introduce a new class of algorithms for crafting
to reduced perturbation of the source inputs. It also enables adversarial samples solely by using knowledge of the
adversaries to apply heuristic searches to find perturbations DNN architecture. These algorithms (1) exploit forward
leading to input targeted misclassifications (perturbing inputs derivatives that inform the learned behavior of DNNs, and
to result in a specific output classification). (2) build adversarial saliency maps enabling an efficient
More formally, a DNN models a multidimensional function exploration of the adversarial-samples search space.
F : X 7→ Y where X is a (raw) feature vector and Y is an • We validate the algorithms using a widely used computer
output vector. We construct an adversarial sample X∗ from a vision DNN. We define and measure sample distor-
benign sample X by adding a perturbation vector δX solving tion and source-to-target hardness, and explore defenses
the following optimization problem: against adversarial samples. We conclude by studying
human perception of distorted samples.
arg min kδX k s.t. F X + δX = Y∗

(1)
δX
II. TAXONOMY OF T HREAT M ODELS IN D EEP L EARNING
∗ ∗
where X = X + δX is the adversarial sample, Y is the Classical threat models enumerate the goals and capabilities
desired adversarial output, and k · k is a norm appropriate to of adversaries in a target domain [1]. This section taxonimizes
compare the DNN inputs. Solving this problem is non-trivial, threat models in deep learning systems and positions several
as properties of DNNs make it non-linear and non-convex [25]. previous works with respect to the strength of the modeled
Thus, we craft adversarial samples by constructing a mapping adversary. We begin by providing an overview of deep neural
from input perturbations to output variations. Note that all networks highlighting their inputs, outputs and function. We
research mentioned above took the opposite approach: it used then consider the taxonomy presented in Figure 2.
output variations to find corresponding input perturbations.
Our understanding of how changes made to inputs affect A. About Deep Neural Networks
a DNN’s output stems from the evaluation of the forward Deep neural networks are large neural networks organized
derivative: a matrix we introduce and define as the Jacobian into layers of neurons, corresponding to successive represen-
of the function learned by the DNN. The forward derivative is tations of the input data. A neuron is an individual computing
used to construct adversarial saliency maps indicating input unit transmitting to other neurons the result of the application
features to include in perturbation δX in order to produce of its activation function on its input. Neurons are connected
adversarial samples inducing a certain behavior from the DNN. by links with different weights and biases characterizing the
Forward derivatives approaches are much more powerful strength between neuron pairs. Weights and biases can be
than gradient descent techniques used in prior systems. They viewed as DNN parameters used for information storage.
are applicable to both supervised and unsupervised architec- We define a network architecture to include knowledge of
tures and allow adversaries to generate information for broad the network topology, neuron activation functions, as well as
families of adversarial samples. Indeed, adversarial saliency weight and bias values. Weights and biases are determined
maps are versatile tools based on the forward derivative and during training by finding values that minimize a cost function
designed with adversarial goals in mind, giving greater control c evaluated over the training data T . Network training is tradi-
to adversaries with respect to the choice of perturbations. In tionally done by gradient descent using backpropagation [31].
our work, we consider the following questions to formalize Deep learning can be partitioned in two categories, de-
the security of DL in adversarial settings: (1) “What is the pending on whether DNNs are trained in a supervised or
minimal knowledge required to perform attacks against DL?”, unsupervised manner [29]. Supervised training leads to models
(2) “How can vulnerable or resistant samples be identified?”, that map unseen samples using a function inferred from
and (3) “How are adversarial samples perceived by humans?”. labeled training data. On the contrary, unsupervised training
The adversarial sample generation algorithms are validated learns representations of unlabeled training data, and resulting
using the widely studied LeNet architecture (a pioneering DNNs can be used to generate new samples, or to automate
DNN used for hand-written digit recognition [26]) and MNIST feature engineering by acting as a pre-processing layer for
dataset [27]. We show that any input sample can be perturbed larger DNNs. We restrict ourselves to the problem of learning
to be misclassified as any target class with 97.10% success multi-class classifiers in supervised settings. These DNNs are
while perturbing on average 4.02% of the input features per given an input X and output a class probability vector Y. Note
sample. The computational costs of the sample generation are that our work remains valid for unsupervised-trained DNNs,
modest; samples were each generated in less than a second and leaves a detailed study of this issue for future work.
x1 w11 h1
w31
w12
o
w21
Increasing
ADVERSARIAL GOALS complexity w32
Architecture x2 w22 h2
& Training Tools
F,T,c
ADVERSARIAL CAPABILITIES

Architecture [13] [36] Fig. 3: Simplified Multi-Layer Perceptron architecture with


[29]
F input X = (x1 , x2 ), hidden layer (h1 , h2 ), and output o.
Training data
T

Oracle adversarial goal, as identified in the X-axis of Figure 2.


X→Y Consider four goals that impact classifier output integrity:
Samples
{(X,Y)}
1) Confidence reduction - reduce the output confidence
classification (thereby introducing class ambiguity)
Decreasing Increasing
attack difficulty
2) Misclassification - alter the output classification to any
knowledge
class different from the original class
Fig. 2: Threat Model Taxonomy: our class of algorithms 3) Targeted misclassification - produce inputs that force the
operates in the threat model indicated by a star. output classification to be a specific target class. Contin-
uing the example illustrated in Figure 1, the adversary
would create a set of speckles classified as a digit.
Figure 3 illustrates an example shallow feedforward neural 4) Source/target misclassification - force the output clas-
network.1 The network has two input neurons x1 and x2 , a sification of a specific input to be a specific target class.
hidden layer with two neurons h1 and h2 , and a single output Continuing the example from Figure 1, adversaries take
neuron o. In other words, it is a simple multi-layer perceptron. an existing image of a digit and add a small number of
Both input neurons x1 and x2 take real values in [0, 1] speckles to classify the resulting image as another digit.
and correspond to the network input: a feature vector X =
(x1 , x2 ) ∈ [0, 1]2 . Hidden layer neurons each use the logistic The scientific community recently started exploring adver-
sigmoid function φ : x 7→ 1+e1−x as their activation function. sarial deep learning. Previous work on other machine learning
This function is frequently used in neural networks because techniques is referenced later in Section VII.
it is continuous (and differentiable), demonstrates linear-like Szegedy et al., introduced a system that generates adver-
behavior around 0, and saturates as the input goes to ±∞. sarial samples by perturbing inputs in a way that creates
Neurons in the hidden layers apply the sigmoid to the weighted source/target misclassifications [36]. The perturbations made
input layer: for instance, neuron h1 computes h1 (X) = by their work, which focused on a computer vision application,
φ (zh1 (X)) with zh1 (X) = w11 x1 +w12 x2 +b1 where w11 and are not distinguishable by humans – for example, small
w12 are weights and b1 a bias. Similarly, the output neuron but carefully-crafted perturbations to an image of a vehicle
applies the sigmoid function to the weighted output of the resulted in the DNN classifying it as an ostrich. The authors
hidden layer where zo (X) = w31 h1 (X) + w32 h2 (X) + b3 . named this modified input an adversarial image, which can
Weight and bias values are determined during training. Thus, be generalized as part of a broader definition of adversarial
the overall behavior of the network learned during training can samples. When producing adversarial samples, the adversary’s
be modeled as a function: F : X → φ (zo (X)). goal is to generate inputs that are correctly classified (or
not distinguishable) by humans or other classifiers, but are
B. Adversarial Goals misclassified by the targeted DNN.
Threats are defined with a specific function to be pro- Another example is due to Nguyen et al., who presented
tected/defended. In the case of deep learning systems, the a method for producing images that are unrecognizable to
integrity of the classification is of paramount importance. humans, but are nonetheless labeled as recognizable objects by
Specifically, an adversary of a deep learning system seeks DNNs [30]. For instance, they demonstrated how a DNN will
to provide an input X∗ that results in an incorrect output classify a noise-filled image constructed using their technique
classification. The nature of the incorrectness represents the as a television with high confidence. They named the images
produced by this method fooling images. Here, a fooling image
1 A shallow neural network is a small neural network that operates (albeit is one that does not have a source class but is crafted solely
at a smaller scale) identically to the DL networks considered throughout. to perform a targeted misclassification attack.
C. Adversarial Capabilities

Adversaries are defined by the information and capabilities


at their disposal. The following (and the Y-axis of Figure 2) de-
scribes a range of adversaries loosely organized by decreasing
adversarial strength (and increasing attack difficulty). Note that
we only considers attack conducted at test time, any tampering
of the training procedure is outside the scope of this paper.
Training data and network architecture - This adversary
Fig. 4: The output surface of our simplified Multi-Layer
has perfect knowledge of the neural network used for clas-
Perceptron for the input domain [0, 1]2 . Blue corresponds to a
sification. The attacker has to access the training data T ,
0 output while yellow corresponds to a 1 output.
functions and algorithms used for network training, and is
able to extract knowledge about the DNN’s architecture F.
This includes the number and type of layers, the activation
III. A PPROACH
functions of neurons, as well as weight and bias matrices. He
also knows which algorithm was used to train the network, In this section, we present a general algorithm for modifying
including the associated loss function c. This is the strongest samples so that a DNN yields any adversarial output. We
adversary that can analyze the training data and simulate the later validate this algorithm by having a classifier misclassify
deep neural network in toto. samples into a chosen target class. This algorithm captures
adversaries crafting samples in the setting corresponding to the
Network architecture - This adversary has knowledge of the
upper right-hand corner of Figure 2. We show that knowledge
network architecture F and its parameter values. For instance,
of the architecture and weight parameters2 is sufficient to
this corresponds to an adversary who can collect information
derive adversarial samples against acyclic feedforward DNNs.
about both (1) the layers and activation functions used to
This requires evaluating the DNN’s forward derivative in order
design the neural network, and (2) the weights and biases
to construct an adversarial saliency map that identifies the set
resulting from the training phase. This gives the adversary
of input features relevant to the adversary’s goal. Perturbing
enough information to simulate the network. Our algorithms
the features identified in this way quickly leads to the desired
assume this threat model, and show a new class of algorithms
adversarial output, for instance misclassification. Although we
that generate adversarial samples for supervised and unsuper-
describe our approach with supervised neural networks used
vised feedforward DNNs.
as classifiers, it also applies to unsupervised architectures.
Training data - This adversary is able to collect a surrogate
dataset, sampled from the same distribution that the original A. Studying a Simple Neural Network
dataset used to train the DNN. However, the attacker is not Recall the simple architecture introduced previously in
aware of the architecture used to design the neural network. section II and illustrated in Figure 3. Its low dimensionality
Thus, typical attacks conducted in this model would likely in- allows us to better understand the underlying concepts behind
clude training commonly deployed deep learning architectures our algorithms. We indeed show how small input perturbations
using the surrogate dataset to approximate the model learned found using the forward derivative can induce large variations
by the legitimate classifier. of the neural network output. Assuming that input biases b1 ,
Oracle - This adversary has the ability to use the neural b2 , and b3 are null, we train this toy network to learn the
network (or a proxy of it) as an “oracle”. Here the adversary AND function: the desired output is F(X) = x1 ∧ x2 with
can obtain output classifications from supplied inputs (much X = (x1 , x2 ). Note that non-integer inputs are rounded up to
like a chosen-plaintext attack in cryptography). This enables the closest integer, thus we have for instance 0.7 ∧ 0.3 = 0
differential attacks, where the adversary can observe the re- or 0.8 ∧ 0.6 = 1. Using backpropagation on a set of 1,000
lationship between changes in inputs and outputs (continuing samples, corresponding to each case of the function (1∧1 = 1,
with the analogy, such as used in differential cryptanalysis) 1 ∧ 0 = 0, 0 ∧ 1 = 0, and 0 ∧ 0 = 0), we train for 100
to adaptively craft adversarial samples. This adversary can be epochs using a learning rate η = 0.0663. The overall function
further parameterized by the number of absolute or rate-limited learned by the neural network is plotted on Figure 4 for input
input/output trials they may perform. values X ∈ [0, 1]2 . The horizontal axes represent the 2 input
dimensions x1 and x2 while the vertical axis represents the
Samples - This adversary has the ability to collect pairs of network output F(X) corresponding to X = (x1 , x2 ).
input and output related to the neural network classifier. How- We are now going to demonstrate how to craft adversarial
ever, he cannot modify these inputs to observe the difference samples on this neural network. The adversary considers a
in the output. To continue the cryptanalysis analogy, this threat legitimate sample X, classified as F(X) = Y by the network,
model would correspond to a known plaintext attack. These
pairs are largely labeled output data, and intuition states that 2 This means that the algorithm does not require knowledge of the dataset
they would most likely only be useful in very large quantities. used to train the DNN. Instead, it exploits knowledge of trained parameters.
Recalling that we round the inputs and outputs of this network
so that it agrees with the Boolean AND function, we see that
X X* is an adversarial sample: after rounding, X∗ = (1, 0) and
X*
F(X∗ ) = 1. Just as importantly, the forward derivative tells us
which input regions are unlikely to yield adversarial samples,
and are thus more immune to adversarial manipulations.
Notice in Figure 5 that when either input is close to 0, the
forward derivative is small. This aligns with our intuition that
x2
it will be more difficult to find adversarial samples close to
Fig. 5: Forward derivative of our simplified multi-layer per- (1, 0) than (1, 0.4). This tells the adversary to focus on features
ceptron according to input neuron x2 . Sample X is benign and corresponding to larger forward derivative values in a given
X∗ is adversarial, crafted by adding δX = (0, δx2 ). input when constructing a sample, making his search more
efficient and ultimately leading to smaller overall distortions.
The takeaways of this example are thereby: (1) small input
and wants to craft an adversarial sample X∗ very similar to variations can lead to extreme variations of the output of the
X, but misclassified as F(X∗ ) = Y ∗ 6= Y . Recall, that we neural network, (2) not all regions from the input domain are
formalized this problem as: conducive to find adversarial samples, and (3) the forward
arg min kδX k s.t. F X + δX = Y∗
 derivative reduces the adversarial-sample search space.
δX
B. Generalizing to Feedforward Deep Neural Networks
where X = X + δX is the adversarial sample, Y∗ is the

desired adversarial output, and k · k is a norm appropriate to We now generalize this approach to any feedforward DNN,
compare points in the input domain. Informally, the adversary using the same assumptions and adversary model from Sec-
is searching for small perturbations of the input that will tion III-A. The only assumptions we make on the architecture
incur a modification of the output into Y∗ . Finding these are that its neurons form an acyclic DNN, and each use a dif-
perturbations can be done using optimization techniques, sim- ferentiable activation function. Note that this last assumption is
ple heuristics, or even brute force. However such solutions not limiting because the back-propagation algorithm imposes
are hard to implement for deep neural networks because of the same requirement. In Figure 6, we give an example of
non-convexity and non-linearity [25]. Instead, we propose a a feedforward deep neural network architecture and define
systematic approach stemming from the forward derivative. some notations used throughout the remainder of the paper.
We define the forward derivative as the Jacobian matrix of Most importantly, the N -dimensional function F learned by
the function F learned by the neural network during training. the DNN during training assigns an output Y = F(X) when
For this example, the output of F is one dimensional, the given an M -dimensional input X. We write n the number of
matrix is therefore reduced to a vector: hidden layers. Layers are indexed by k ∈ 0..n + 1 such that
  k = 0 is the index of the input layer, k ∈ 1..n corresponds to
∂F(X) ∂F(X)
∇F(X) = , (2) hidden layers, and k = n + 1 indexes the output layer.
∂x1 ∂x2 Algorithm 1 shows our process for constructing adversarial
Both components of this vector are computable using the samples. As input, the algorithm takes a benign sample X, a
adversary’s knowledge, and later we show how to compute target output Y∗ , an acyclic feedforward DNN F, a maximum
this term efficiently. The forward derivative for our example distortion parameter Υ, and a feature variation parameter θ.
network is illustrated in Figure 5, which plots the gradient It returns new adversarial sample X∗ such that F(X∗ ) = Y∗ ,
for the second component ∂F(X)∂x2 on the vertical axis against and proceeds in three basic steps: (1) compute the forward
x1 and x2 on the horizontal axes. We omit the plot for derivative ∇F(X∗ ), (2) construct a saliency map S based on
∂F(X)
∂x1 because F is approximately symmetric on its two the derivative, and (3) modify an input feature imax by θ.
inputs, making the first component redundant for our purposes. This process is repeated until the network outputs Y∗ or the
This plot makes it easy to visualize the divide between the maximum distortion Υ is reached. We now detail each step.
network’s two possible outputs in terms of values assigned to 1) Forward Derivative of a Deep Neural Network: The first
the input feature x2 : 0 to the left of the spike, and 1 to its step is to compute the forward derivative for the given sample
left. Notice that this aligns with Figure 4, and gives us the X. As introduced previously, this is given by:
information needed to achieve our adversarial goal: find input ∂F(X)

∂Fj (X)

perturbations that drive the output closer to a desired value. ∇F(X) = = (3)
∂X ∂xi i∈1..M,j∈1..N
Consulting Figure 5 alongside our example network, we
can confirm this intuition by looking at a few sample points. This is essentially the Jacobian of the function corresponding
Consider X = (1, 0.37) and X∗ = (1, 0.43), which are both to what the neural network learned during training. The
located near the spike in Figure 5. Although they only differ by forward derivative computes gradients that are similar to those
a small amount (δx2 = 0.05), they cause a significant change computed for backpropagation, but with two important dis-
in the network’s output, as F(X) = 0.11 and F(X∗ ) = 0.95. tinctions: we take the derivative of the network directly, rather
1 Notations
1 …
1 1 1 F: function learned by neural network during training
X: input of neural network

Y: output of neural network
2 M: input dimension (number of neurons on input layer)
2 … N: output dimension (number of neurons on output layer)
2 2 2 n: number of hidden layers in neural network
… f: activation function of a neuron
Hk : output vector of layer k neurons
… …
Indices
M … N k: index for layers (between 0 and n+1)
m2 i: index for input X component (between 0 and N)
m1 mn
j: index for output Y component (between 0 and M)
p: index for neurons (between 0 and mk for any layer k)
X n hidden layers Y

Fig. 6: Example architecture of a feedforward deep neural network with notations used in the paper.

Algorithm 1 Crafting adversarial samples neuron p on a hidden or output layer indexed k ∈ 1..n + 1 is
X is the benign sample, Y∗ is the target network output, F connected to the previous layer k − 1 using weights defined
is the function learned by the network during training, Υ is in vector Wk,p . By defining the weight matrix accordingly,
the maximum distortion, and θ is the change made to features. we can define fully or sparsely connected interlayers, thus
This algorithm is applied to a specific DNN in Algorithm 2. modeling a variety of architectures. Similarly, we write bk,p
Input: X, Y∗ , F, Υ, θ the bias for neuron p of layer k. By applying the chain rule,
1: X∗ ← X we can write a series of formulae for k ∈ 2..n:
2: Γ = {1 . . . |X|}
 
∂Hk (X) ∂Hk−1
3: while F(X∗ ) 6= Y ∗ and ||δX || < Υ do = Wk,p · ×
∂xi p∈1..mk ∂xi
4: Compute forward derivative ∇F(X∗ )
∂fk,p
5: S = saliency_map (∇F(X∗ ), Γ, Y∗ ) (Wk,p · Hk−1 + bk,p ) (5)
6: Modify X∗imax by θ s.t. imax = arg maxi S(X, Y∗ )[i] ∂xi
7: δX ← X∗ − X We are thus able to express ∂H
∂xi . We know that output neuron
n

8: end while j computes the following expression:


9: return X∗
Fj (X) = fn+1,j (Wn+1,j · Hn + bn+1,j )
Thus, we apply the chain rule again to obtain:
than on its cost function, and we differentiate with respect to  
∂Fj (X) ∂Hn
the input features rather than the network parameters. As a = Wn+1,j · ×
consequence, instead of propagating gradients backwards, we ∂xi ∂xi
choose in our approach to propagate them forward, as this ∂fn+1,j
(Wn+1,j · Hn + bn+1,j ) (6)
allows us to find input components that lead to significant ∂xi
changes in network outputs. In this formula, according to our threat model, all terms are
Our goal is to express ∇F(X∗ ) in terms of X and constant known but one: ∂H ∂xi . This is precisely the term we computed
n

values only. To simplify our expressions, we now consider recursively. By plugging these results for successive layers
one element (i, j) ∈ [1..M ] × [1..N ] of the M × N forward back in Equation 6, we get an expression of component (i, j)
derivative matrix defined in Equation 3: that is the derivative of the DNN’s forward derivative. Hence, the forward derivative
of one output neuron Fj according to one input dimension ∇F of a network F can be computed for any input X by
xi . Of course our results are true for any matrix element. We successively differentiating layers starting from the input layer
start at the first hidden layer of the neural network. We can until the output layer is reached. We later discuss in our
differentiate the output of this first hidden layer in terms of methodology evaluation the computability of ∇F for state-
the input components. We then recursively differentiate each of-the-art DNN architectures. Notably, the forward derivative
hidden layer k ∈ 2..n in terms of the previous one: can be computed using symbolic differentiation.
∂Hk (X)

∂fk,p (Wk,p · Hk−1 + bk,p )
 2) Adversarial Saliency Maps: We extend saliency maps
= (4) previously introduced as visualization tools [34] to construct
∂xi ∂xi p∈1..mk adversarial saliency maps. These maps indicate which input
where Hk is the output vector of hidden layer k and fk,j is features an adversary should perturb in order to effect the
the activation function of output neuron j in layer k. Each desired changes in network output most efficiently, and are
thus versatile tools that allow adversaries to generate broad
classes of adversarial samples.
Adversarial saliency maps are defined to suit problem-
specific adversarial goals. For instance, we later study a
network used as a classifier, its output is a probability vector
across classes, where the final predicted class value corre-
sponds to the component with the highest probability:
label(X) = arg max Fj (X) (7)
j

In our case, the saliency map is therefore based on the forward


derivative, as this gives the adversary the information needed
to cause the neural network to misclassify a given sample.
More precisely, the adversary wants to misclassify a sample
X such that it is assigned a target class t 6= label(X). To do so,
the probability of target class t given by F, Ft (X), must be
increased while the probabilities Fj (X) of all other classes
j 6= t decrease, until t = arg maxj Fj (X). The adversary
Fig. 7: Saliency map of a 784-dimensional input to the LeNet
can accomplish this by increasing input features using the
architecture (cf. validation section). The 784 input dimensions
following saliency map S(X, t):
 are arranged to correspond to the 28x28 image pixel alignment.
 0 if ∂Ft (X) < 0 or P ∂Fj (X) Large absolute values correspond to features with a significant
∂ Xi j6=t ∂ Xi >0
S(X, t)[i] = ∂F (X)
impact on the output when perturbed.
∂Ft (X)
 P
j6=t ∂jX otherwise

∂ Xi i
(8)
where i is an input feature. The condition specified on the first in each iteration of Algorithm 1, and the amount by which
line rejects input components with a negative target derivative the selected feature is perturbed (θ in Algorithm 1) is also
or an overall positive derivative on other classes. Indeed, problem-specific. We discuss in Section IV how this parameter
∂Ft (X) should be set in an application to computer vision. Lastly,
∂ Xi
should be positive in order for Ft (X) to increase
P ∂F (X) the maximum number of iterations, which is equivalent to
when feature Xi increases. Similarly, j6=t ∂jX needs to the maximum distortion allowed in a sample, is specified by
i
be negative to decrease or stay constant when feature Xi is parameter Υ. It limits the number of features changed to craft
increased. The product on the second line allows us to consider an adversarial sample and can take any positive integer value
all other forward derivative components together in such a way smaller than the number of features. Finding the right value
that we can easily compare S(X, t)[i] for all input features. In for Υ requires considering the impact of distortion on humans’
summary, high values of S(X, t)[i] correspond to input fea- perception of adversarial samples – too much distortion might
tures that will either increase the target class, or decrease other cause adversarial samples to be easily identified by humans.
classes significantly, or both. By increasing these features, the
adversary eventually misclassifies the sample into the target IV. A PPLICATION OF THE A PPROACH
class. A saliency map example is shown on Figure 7. We formally described a class of algorithms for crafting
It is possible to define other adversarial saliency maps using adversarial samples misclassified by feedforward DNNs using
the forward derivative, and the quality of the map can have a three tools: the forward derivative, adversarial saliency maps,
large impact on the amount of distortion that Algorithm 1 and the crafting algorithm. We now apply these tools to a DNN
introduces; we will study this in more detail later. Before used for a computer vision classification task: handwritten
moving on, we introduce an additional map that acts as a digit recognition. We show that our algorithms successfully
counterpart to the one given in Equation 8 by finding features craft adversarial samples from any source class to any given
that the adversary should decrease to achieve misclassification. target class, which for this application means that any digit
The only difference lies in the constraints placed on the can be perturbed so that it is misclassified as any other digit.
forward derivative values and the location of the absolute value We investigate a DNN based on the well-studied LeNet
in the second line: architecture, which has proven to be an excellent classifier for

 0 if ∂Ft (X) > 0 or P ∂Fj (X) handwritten digits [26]. Recent architectures like AlexNet [24]
∂ X i  j6=t ∂ Xi <0
S(X, t)[i] = or GoogLeNet [35] are heavily reliant on convolutional layers
X

X

∂F t ( ) P ∂F j ( )
otherwise


∂ Xi
j6=t ∂ Xi introduced in the LeNet architecture, thus making LeNet a
(9) relevant DNN to validate our approach. We have no reason
3) Modifying samples: Once an input feature has been to believe that our method will not perform well on larger
identified by an adversarial saliency map, it needs to be architectures. The network input is black and white images
perturbed to realize the adversary’s goal. This is the last step (28x28 pixels) of handwritten digits, which are flattened as
versarial target class. The maximum distortion, expressed
as a percentage, corresponds to the maximum number
of pixels to be modified when crafting the adversarial
sample, and thus sets the maximum number of iterations
max_iter (2 pixels modified per iteration) as follows:
 
784 · Υ
max_iter =
2 · 100
Fig. 8: Samples taken from the MNIST test set. The
respective output vectors are: [0, 0, 0, 0, 0, 0, 0.99, 0, 0], where 784 = 28×28 is the number of pixels in a sample.
[0, 0, 0.99, 0, 0, 0, 0, 0, 0], and [0, 0.99, 0, 0, 0, 0, 0, 0, 0], where • Saliency map: subroutine saliency_map generates a
all values smaller than 10−6 have been rounded to 0. map defining which input features will be modified at
each iteration. Policies used to generate saliency maps
vary with the nature of the data handled by the considered
vectors of 784 features, where each feature corresponds to a DNN, as well as the adversarial goals. We provide a
pixel intensity taking normalized values between 0 and 1. This subroutine example later in Algorithm 3.
input is processed by a succession of a convolutional layer (20 • Feature variation per iteration θ: once input features
then 50 kernels of 5x5 pixels) and a pooling layer (2x2 filters) have been selected using the saliency map, they must
repeated twice, a fully connected hidden layer (500 neurons), be modified. The variation θ introduced to these features
and an output softmax layer (10 neurons). The output is a is another parameter that the adversary must set, in
10 class probability vector, where each class corresponds to accordance with the saliency maps she uses.
a digit from 0 to 9, as shown in Figure 8. The network then The problem of finding good values for these parameters is
labels the input image with the class assigned the maximum a goal of our current evaluation, and is discussed later in
probability, as shown in Equation 7. We train our network Section V. For now, note that human perception is a limiting
using the MNIST training dataset of 60,000 samples [27]. factor as it limits the acceptable maximum distortion and
We attempt to determine whether, using the theoretical feature variation introduced. We now show the application of
framework introduced in previous sections, we can effectively our framework with two different adversarial strategies.
craft adversarial samples misclassified by the DNN. For in-
stance, if we have an image X of a handwritten digit 0 Algorithm 2 Crafting adversarial samples for LeNet-5
classified by the network as label(X) = 0 and the adversary X is the benign image, Y∗ is the target network output, F is
wishes to craft an adversarial sample X∗ based on this image the function learned by the network during training, Υ is the
classified as label(X∗ ) = 7, the source class is 0 and the target maximum distortion, and θ is the change made to pixels.
class is 7. Ideally, the crafting process must find the smallest Input: X, Y∗ , F, Υ, θ
perturbation δX required to construct the adversarial sample 1: X∗ ← X
X∗ = X + δX . A perturbation is a set of pixel intensities – or 2: Γ = {1 . . . |X|} . search domain is all pixels
input feature variations – that are added to X in order to craft 3: max_iter = 784·Υ

2·100
X∗ . Note that perturbations introduced to craft adversarial 4: s = arg maxj F(X∗ )j . source class
samples must remain indistinguishable to humans. 5: t = arg maxj Yj∗ . target class
6: while s 6= t & iter < max_iter & Γ 6= ∅ do
A. Crafting algorithm
7: Compute forward derivative ∇F(X∗ )
Algorithm 2 shows the crafting algorithm used in our exper- 8: p1 , p2 = saliency_map(∇F(X∗ ), Γ, Y∗ )
iments, which we implemented in Python (see Appendix A for 9: Modify p1 and p2 in X∗ by θ
more information regarding the implementation). It is based 10: Remove p1 from Γ if p1 == 0 or p1 == 1
on Algorithm 1, but several details have been changed to ac- 11: Remove p2 from Γ if p2 == 0 or p2 == 1
commodate our handwritten digit recognition problem. Given 12: s = arg maxj F(X∗ )j
a network F, Algorithm 2 iteratively modifies a sample X by 13: iter + +
perturbing two input features (i.e., pixel intensities) p1 and p2 14: end while
selected by saliency_map. The saliency map is constructed 15: return X∗
and updated between each iteration of the algorithm using the
DNN’s forward derivative ∇F(X∗ ). The algorithm halts when
one of the following conditions is met: (1) the adversarial B. Crafting by increasing pixel intensities
sample is classified by the DNN with the target class t, (2) the The first strategy to craft adversarial samples is based
maximum number of iterations max_iter has been reached, on increasing the intensity of some pixels. To achieve this
or (3) the feature search domain Γ is empty. The crafting purpose, we consider 10 samples of handwritten digits from
algorithm is fine-tuned by three parameters: the MNIST test set, one from each digit class 0 to 9. We use
• Maximum distortion Υ: this defines when the algorithm this small subset of samples to illustrate our techniques. We
should stop modifying the sample in order to reach the ad- scale up the evaluation to the entire dataset in Section V. Our
Fig. 9: Adversarial samples generated by feeding the crafting algorithm an empty input. Each sample produced corresponds
to one target class from 0 to 9. Interestingly, for classes 0, 2, 3 and 5 one can clearly recognize the target digit.

goal is to report whether we can reach any adversarial target computed between these two layers to ensure probabilities sum
class for a given source class. For instance, if we are given up to 1, leading to extreme derivative values. This reduces
a handwritten 0, we increase some of the pixel intensities to the quality of information on how the neurons are activated
produce 9 adversarial samples respectively classified in each by different inputs and causes the forward derivative to loose
of the classes 1 to 9. All pixel intensities changed are increased accuracy when generating saliency maps. Better results are
by θ = +1. We discuss this choice of parameter in section V. achieved when working with the last hidden layer, also made
We allow for an unlimited maximum distortion Υ = ∞. We up of 10 neurons, each corresponding to one digit class 0 to 9.
simply measure for each of the 90 source-target class pairs This justifies enforcing constraints on the forward derivative.
whether an adversarial sample can be produced or not. Indeed, as the output layer used for computing the forward
The adversarial saliency map used in the crafting algorithm P not sum up to 1, increasing Ft (X) does not
derivative does
to select pixel pairs that can be increased is an application imply that j6=t ∂Fj (X) will decrease, and vice-versa.
of the map introduced in the general case of classification in
Equation 8. The map aims to find pairs of pixels (p1 , p2 ) using Algorithm 3 Increasing pixel intensities saliency map
the following heuristic: ∇F(X) is the forward derivative, Γ the features still in the
 


search space, and t the target class
X ∂Ft (X) X X ∂Fj (X)
arg max   × (10) Input: ∇F(X), Γ, t
(p1 ,p2 )
i=p1 ,p2
∂X i i=p1 ,p2 j6=t
∂X i

1: for each pair (p, q) ∈ Γ do
α = i=p,q ∂F∂tX(X)
P
where t is the index of the target class, the left operand of the 2:
P i ∂F (X)
multiplication operation is constrained to be positive, and the
P
3: β = i=p,q j6=t ∂jX
i
right operand of the multiplication operation is constrained to 4: if α > 0 and β < 0 and −α × β > max then
be negative. This heuristic, introduced in the previous section 5: p1 , p2 ← p, q
of this manuscript, searches for pairs of pixels producing 6: max ← −α × β
an increase in the target class output while reducing the 7: end if
sum of the output of all other classes when simultaneously 8: end for
increased. The pseudocode of the corresponding subroutine 9: return p1 , p2
saliency_map is given in Algorithm 3.
The saliency map considers pairs of pixels and not individ- The algorithm is able to craft successful adversarial samples
ual pixels because selecting pixels one at a time is too strict, for all 90 source-target class pairs. Figure 1 shows the 90
and very few pixels would meet the heuristic search criteria adversarial samples obtained as well as the 10 original samples
described in Equation 8. Searching for pairs of pixels is more used to craft them. The original samples are found on the
likely to match the condition because one of the pixels can diagonal. A sample on row i and column j, when i 6= j, is a
compensate a minor flaw of the other pixel. Let’s consider sample crafted from an image originally classified as source
a simple example: p1 has a target derivative of 5 but a sum class i to be misclassified as target class j.
of other classes derivatives equal to 0.1, while p2 as a target To verify the validity of our algorithms, and more specifi-
derivative equal to −0.5 and a sum of other classes derivatives cally of our adversarial saliency maps, we run a simple exper-
equal to −6. Individually, these pixels do not match the iment. We run the crafting algorithm on an empty input (all
saliency map’s criteria stated in Equation 8, but combined, the pixels initially set to an intensity of 0) and craft one adversarial
pair does match the saliency criteria defined in Equation 10. sample for each class from 0 to 9. The different samples shown
One would also envision considering larger groups of input in Figure 9 demonstrate how adversarial saliency maps are able
features to define saliency maps. However, this comes at a to identify input features relevant to classification in a class.
greater computational cost because more combinations need
to be considered each time the group size is increased. C. Crafting by decreasing pixel intensities
In our implementation of these algorithms, we compute the Instead of increasing pixel intensities to achieve the adver-
forward derivative of the network using the last hidden layer sarial targets, the second adversarial strategy decreases pixel
instead of the output probability layer. This is justified by intensities by θ = −1. The implementation is identical to the
the extreme variations introduced by the logistic regression exception of the adversarial saliency map. The formula is the
Output classification A. Crafting large amounts of adversarial samples
0 1 2 3 4 5 6 7 8 9 Now that we previously showed the feasibility of crafting
adversarial samples for all source-target class pairs, we seek
0

to measure whether the crafting algorithm can successfully


1

handle large quantities of distinct samples of hand-written


digits. That is, we now design a set of experiments to evaluate
2

whether or not all legitimate samples in the MNIST dataset can


be exploited by an adversary to produce adversarial samples.
3

We run our crafting algorithm on three sets of 10,000 samples


Input class
4

each extracted from one of the three MNIST training, valida-


tion, and test subsets3 . For each of these samples, we craft 9
5

adversarial samples, each of them classified in one of the 9


6

target classes distinct from the original legitimate class. Thus,


we generate 90,000 samples for each set, leading to a total of
7

270,000 adversarial samples. We set the maximum distortion


8

to Υ = 14.5% and pixel intensities are increased by θ = +1.


The maximum distortion was fixed after studying the effect of
9

increasing it on the success rate τ . We found that 97.1% of the


Fig. 10: Adversarial samples obtained by decreasing pixel adversarial samples could be crafted with a distortion of less
intensities. Original samples from the MNIST dataset are than 14.5% and observed that the success rate did not increase
found on the diagonal, whereas adversarial samples are all significantly for larger maximum distortions. Parameter θ was
non-diagonal elements. Samples are organized by columns set to +1 after observing that decreasing it or giving it negative
each corresponding to a class from 0 to 9. values increased the number of features modified, whereas
we were interested in reducing the number of features altered
during crafting. One will also notice that because features are
same as previously written in Equation 10 but the constraints normalized between 0 and 1, if we introduce a variation of
are different: the left operand of the multiplication operation θ = +1, we always set pixels to their maximum value 1. This
is now constrained to be negative, and the right operand to be justifies why in Algorithm 2, we remove modified pixels from
positive. This heuristic, also introduced in the previous section the search space at the end of each iteration. The impact on
of this paper, searches for pairs of pixels producing an increase performance is beneficial, as we reduce the size of the feature
in the target class output while reducing the sum of the output search space at each iteration. In other words, our algorithm
of all other classes when simultaneously decreased. performs a best-first heuristic search without backtracking.
The algorithm is once again able to craft successful adver- We measure the success rate τ and distortion of adversarial
sarial samples for all source-target class pairs. Figure 10 shows samples on the three sets of 10,000 samples. The success rate
the 90 adversarial samples obtained as well as the 10 original τ is defined as the percentage of adversarial samples that were
samples used to craft them. One observation to be made is that successfully classified by the DNN as the adversarial target
the distortion introduced by reducing pixel intensities seems class. The distortion is defined to be the percentage of pixels
harder to detect by the human eye. We address the human modified in the legitimate sample to obtain the adversarial
perception aspect with a study later in Section V. sample. In other words, it is the percentage of input features
modified in order to obtain adversarial samples. We compute
two average distortion values: one taking into account all
V. E VALUATION samples and a second one only taking into account successful
samples, which we write ε. Figure 11 presents the results for
We now use our experimental setup to answer the following
the three sets from which the original samples were extracted.
questions: (1) “Can we exploit any sample?”, (2) “How can
The results are consistent across all sets. On average, the
we identify samples more vulnerable than others?” and (3)
success rate is τ = 97.10%, the average distortion of all
“How do humans perceive adversarial samples compared to
adversarial samples is 4.44%, and the average distortion of
DNNs?”. Our primary result is that adversarial samples can
successful adversarial samples is ε = 4.02%. This means that
be crafted reliably for our validation problem with a 97.10%
the average number of pixels modified to craft a successful
success rate by modifying samples on average by 4.02%. We
adversarial sample is 32 out of 784 pixels. The first distortion
define a hardness measure to identify sample classes easier to
figure is higher because it includes unsuccessful samples, for
exploit than others. This measure is necessary for designing
which the crafting algorithm used the maximum distortion Υ,
robust defenses. We also found that humans cannot perceive
but was unable to induce a misclassification.
the perturbation introduced to craft adversarial samples mis-
classified by the DNN: they still correctly classify adversarial 3 Note that we extracted original samples from the dataset for convenience.
samples crafted with a distortion smaller than 14.29%. Any sample can be used as an input to the adversarial crafting algorithm.
Source set Adversarial Average distortion
of 10, 000 samples All Successful
original successfully adversarial adversarial
samples misclassified samples samples
Training 97.05% 4.45% 4.03%
Validation 97.19% 4.41% 4.01%
Test 97.05% 4.45% 4.03%
Fig. 11: Results on larger sets of 10, 000 samples

We also studied crafting of 9, 000 adversarial samples using


the decreasing saliency map. We found that the success rate
τ = 64.7% was lower and the average distortion ε = 3.62%
slightly lower. Again, decreasing pixel intensities is less suc-
cessful at producing the desired adversarial behavior than
increasing pixel intensities. Intuitively, this can be understood
because removing pixels reduces the information entropy, thus Fig. 12: Success rate per source-target class pair.
making it harder for DNNs to extract the information required
to classify the sample. Greater absolute values of intensity
variations are more confidently misclassified by the DNN.

B. Quantifying hardness and building defense mechanisms


Looking at the previous experiment, about 2.9% of the
270, 000 adversarial samples were not successfully crafted.
This suggests that some samples are harder to exploit than
others. Furthermore, the distortion figures reported are aver-
aged on all adversarial samples produced but not all samples
require the same distortion to be misclassified. Thus, we now
study the hardness of different samples in order to quantify
these phenomena. Our aim is to identify which source-target
class pairs are easiest to exploit, as well as similarities between
distinct source-target class pairs. A class pair is a pair of a
source class s and a target class t. This hardness metric allows
us to lay ground for defense mechanisms.
1) Class pair study: In this experiment, we construct a
deeper understanding of the crafting algorithm’ success rate Fig. 13: Average distortion ε of successful samples per source-
and average distortion for different source-target class pairs. target class pair. The scale is a percentage of pixels.
We use the 90,000 adversarial samples crafted in the previous
experiments from the 10,000 samples of the MNIST test set.
We break down the success rate τ reported in Figure 11 by samples. Interestingly, classes requiring lower distortions cor-
source-target class pairs. This allows us to know, for a given respond to classes with higher success rates in the previous
source class, how many samples of that class were successfully matrix. For instance, the column corresponding to class 1 is
misclassified in each of the target classes. In Figure 12, we associated with the highest distortions, and it was the column
draw the success rate matrix indicating which pairs are most with the least success rates in the previous matrix. Indeed, the
successful. Darker shades correspond to higher success rates. higher the average distortion of a class pair is, the more likely
The rows correspond to the success rate per source class while samples in that class pair are to reach the maximum distortion,
the columns correspond to the success rate per target class. If and thus produce unsuccessful adversarial samples.
one reads the matrix row-wise, it can be perceived that classes To better understand why some class pairs were harder to
0, 2, and 8 are hard to start with, while classes 1, 7, and 9 are exploit, we tracked the evolution of class probabilities during
easy to start with. Similarly, reading the matrix column-wise, the crafting process. We observed that the distortion required
one can observe that classes 1 and 7 are very hard to make, to leave the source class was higher for class pairs with high
while classes 0, 8, and 9 are easy to make. distortions whereas the distortion required to reach the target
In Figure 13, we report the average distortion ε of successful class, once the source class had been left, remained similar.
samples by source-target class pair, thus identifying class pairs This correlates with the fact that some source classes are more
requiring the most distortion to successfully craft adversarial confidently classified by the DNN then others.
Fig. 14: Hardness matrix of source-target class pairs. Darker Fig. 15: Adversarial distance averaged per source-destination
shades correspond to harder to achieve misclassifications. class pairs computed with 1000 samples.

2) Hardness measure: Results indicating that some source- 3) Adversarial distance: The measure introduced lays
target class pairs are not as easy as others lead us to question ground towards finding defenses against adversarial samples.
the existence of a measure quantifying the distance between Indeed, if the hardness measure were to be predictive instead
two classes. This is relevant to a defender seeking to identify of being computed after adversarial crafting, the defender
which classes of a DNN are most vulnerable to adversaries. could identify vulnerable inputs. Furthermore, a predictive
We name this measure the hardness of a target class relatively measure applicable to a single sample would allow a defender
to a given source class. It normalizes the average distortion of to evaluate the vulnerability of specific samples as well as class
a class pair (s, t) relatively to its success rate: pairs. We investigated several complex estimators including
Z convolutional transformations of the forward derivative or
H(s, t) = ε(s, t, τ )dτ (11) Hessian matrices. However, we found that simply using a
τ formulae derived from the intuition behind adversarial saliency
where ε(s, t, τ ) is the average distortion of a set of samples maps gave enough accuracy for predicting the hardness of
for the corresponding success rate τ . In practice, these two samples in our experimental setup.
quantities are computed over a finite number of samples by We name this predictive measure the adversarial distance
fixing a set of K maximum distortion parameter values Υk in of sample X to class t and write it A(X, t). Simply put, it
the crafting algorithm where k ∈ 1..K. The set of maximum estimates the distance between a sample X and a target class
distortions gives a series of pairs (εk , τk ) for k ∈ 1..K. Thus, t. We define the distance as:
the practical formula used to compute the hardness of a source- 1 X
destination class pair can be derived from the trapezoidal rule: A(X, t) = 1 − 1S(X,t)[i]>0 (13)
M
i∈0..M
K−1
X ε(s, t, τk+1 ) + ε(s, t, τk ) where 1E is the indicator function for event E (i.e., is 1 if
H(s, t) ≈ (τk+1 − τk ) (12)
2 and only if E is true). In a nutshell, A(X, t) is the normalized
k=1
number of non-zero elements in the adversarial saliency map
We computed the hardness values for all classes using of X computed during the first crafting iteration in Algo-
a set of K = 9 maximum distortion values Υ ∈ rithm 2. The closer the adversarial distance is to 1, the more
{0.3, 1.3, 2.6, 5.1, 7.7, 10.2, 12.8, 25.5, 38.3}% in the algo- likely sample X is going to be harder to misclassify in target
rithm. Average distortions ε and success rates τ are averaged class t. Figure 15 confirms that this formulae is empirically
over 9,000 adversarial samples for each maximum distortion well-founded. It illustrates the value of the adversarial distance
value Υ. Figure 14 shows the hardness values H(s, t) for all averaged per source-destination class pairs, making it easy to
pairs (s, t) ∈ {0..9}2 . The reader will observe that the matrix compare the average value with the hardness matrix computed
has a shape similar to the average distortion matrix plotted on previously after crafting samples. To compute it, we slightly
Figure 13. However, the hardness measure is more accurate altered Equation 13 to sum over pairs of features, reflecting
because it is plotted using a series of maximum distortions. the observations made during our validation process.
Respondents identifying a digit Respondents correctly classifying the digit Respondents identifying a digit Respondents correctly classifying the digit
100.00% 100.00%

95.00% 95.00%

90.00% 90.00%

85.00% 85.00%

80.00% 80.00%
75.00% 75.00%
70.00% 70.00%
65.00% 65.00%
60.00% 60.00%
55.00% 55.00%
50.00% 50.00%
0% - 1.53% 1.53% - 2.8% 2.8% - 5.61% 5.61% - 14.29% 14.29% - 100% -1 -0.7 -0.5 0.1 0.3 0.5 0.7 1

Fig. 16: Human perception of different distortions ε. Fig. 17: Human perception of different intensity variations θ.

This notion of distance between classes intuitively defines A final set of experiments evaluate the impact of intensity
a metric for the robustness of a network F against adversarial variations (θ) on perception, as shown Figure 17. The 203
perturbations. We suggest the following definition : participants were accurate at identifying 5, 355 samples as
R(F) = min A(X, t) (14) digits (96%) and classifying them correctly (95%). At higher
(X,t) absolute intensities (θ = −1 and θ = +1), specific digit classi-
fication decreased slightly (90.5% and 90%), but identification
where the set of samples X considered is sufficiently large to
as digits was largely unchanged.
represent the input domain of the network. A good approxi-
mation of the robustness can be computed with the training While preliminary, these experiments confirm that the over-
dataset. Note that the min operator used here can be replaced whelming number of generated samples retain human recog-
by other relevant operators, like the statistical expectation. The nizability. Note that because we can generate samples with
study of various operators is left as future work. less than the distortion threshold for the almost all of the
input data, (ε ≤ 14.29% for roughly 97% in the MNIST
C. Study of human perception of adversarial samples data), we can produce adversarial samples that humans will
mis-interpret—thus meeting our adversarial goal. Furthermore,
Recall that adversarial samples must not only be misclas-
altering feature distortion intensity provides even better results:
sified as the target class by deep neural networks, but also
at −0.7 ≤ θ ≤ +0.7, humans classified the sample data at
visually appear (be classified) as the source class by humans.
essentially the same rates as the original sample data.
To evaluate this property, we ran an experiment using 349
human participants on the Mechanical Turk online service.
VI. D ISCUSSION
We presented three original or adversarially altered samples
from the MNIST dataset to human participants. To paraphrase, We introduced a new class of algorithms that systemati-
participants were asked for each sample: (a) ‘is this sample a cally craft adversarial samples misclassified by a DNN once
numeric digit?’, and (b) ‘if yes to (a) what digit is it?’. These an adversary possesses knowledge of the DNN architecture.
two questions were designed to determine how distortion and Although we focused our work on DL techniques used in the
intensity rates effected human perception of the samples. context of classification and trained with supervised methods,
The first experiment was designed to identify a baseline our approach is also applicable to unsupervised architec-
perception rate for the input data. The 74 participants were tures. Instead of achieving a given target class, the adversary
presented 3 of 222 unaltered samples randomly picked from achieves a target output Y∗ . Because the output space is
the original MNIST data set. Respondents identified 97.4% as more complex, it might be harder or impossible to match Y∗ .
digits and classified the digits correctly 95.3% of the samples. In that case, Equation 1 would need to be relaxed with an
Shown in Figure 16, a second set of experiments attempted acceptable distance between the network output F(X∗ ) and
to evaluate how the amount of distortion (ε) impacts human the adversarial target Y∗ . Thus, the only remaining assumption
perception. Here, 184 participants were presented with a total made in this paper is that DNNs are feedforward. In other
of 1707 samples with varying levels of distortion (and features words, we did not consider recurrent neural networks, with
altered with an intensity increase θ = +1). The experiments cycles in their architecture, as the forward derivative must be
showed that below a threshold (ε = 14.29% distortion), adapted to accommodate such networks.
participants were able to identify samples as digits (95%) and One of our key results is reducing the distortion—the num-
correctly classify them (90%) only slightly less accurately ber of features altered—to craft adversarial samples, compared
than the unaltered samples. The classification rate dropped to previous work. We believe this makes adversarial crafting
dramatically (71%) at distortion rates above the threshold. much easier for input domains like malware executables,
which are not as easy to perturb as images [11], [16]. This dis- VII. R ELATED W ORK
tortion reduction comes with a performance cost. Indeed, more The security of machine learning [2] is an active research
elaborate but accurate saliency map formulae are more expen- topic within the security and machine learning communities. A
sive to compute for the attacker. We would like to emphasize broad taxonomy of attacks and required adversarial capabilties
that our method’s high success rate can be further improved are discussed in [22] and [3] along with considerations for
by adversaries only interested in crafting a limited number building defense mechanisms. Biggio et al. studied classifiers
of samples. Indeed, to lower the distortion of one particular in adversarial settings and outlined a framework securing
sample, an adversary can use adversarial saliency maps to them [8]. However, their work does not consider DNNs but
fine-tune the perturbation introduced. On the other hand, if an rather other techniques used for binary classification like
adversary wants to craft large amounts of adversarial samples, logistic regression or Support Vector Machines. Generally
performance is important. In our evaluation, we balanced these speaking, attacks against machine learning can be separated
factors to craft adversarial samples against the DNN in less into two categories, depending on whether they are executed
than a second. As far as our algorithm implementation was during training [9] or at test time [10].
concerned, the most computationally expensive steps were the Prior work on adversarial sample crafting against DNNs
matrix manipulations required to construct adversarial saliency derived a simple technique corresponding to the Architecture
maps from the forward derivative matrix. The complexity and Training Tools threat model, based on the backpropagation
is dependent of the number of input features. These matrix procedure used during network training [18], [30], [36]. This
operations can be made more efficient, notably by making approach creates adversarial samples by defining an optimiza-
better use of GPU-accelerated computations. tion problem based on the DNN’s cost function. In other
Our efforts so far represent a first but meaningful step to- words, instead of computing gradients to update DNN weights,
wards mitigating adversarial samples: the hardness and adver- one computes gradients to update the input, which is then
sarial distance metrics lay out bases for defense mechanisms. misclassified as the target class by a DNN. The alternative
Although designing such defenses is outside of the scope of approach proposed in this paper is to identify input regions
this paper, we outline two classes of defenses: (1) adversarial that are most relevant to its classification by a DNN. This is
sample detection and (2) improvements of DNN robustness. accomplished by computing the saliency map of a given input,
Developing techniques for adversarial sample detection is as described by Simonyan et al. in the case of DNNs handling
a reactive solution. During our experimental process, we images [34]. We extended this concept to create adversarial
noticed that adversarial samples can for instance be detected saliency maps highlighting regions of the input that need to
by evaluating the regularity of samples. More specifically, in be perturbed in order to accomplish the adversarial goal.
our application example, the sum of the squared difference Previous work by Yosinki et al. investigated how features
between each pair of neighboring pixels is always higher for are transferable between deep neural networks [38], while
adversarial samples than for benign samples. However, there Szegedy et al. showed that adversarial samples can indeed
is no a priori reason to assume that this technique will reliably be misclassified across models [36]. They report that once an
detect adversarial samples in different settings, so extending adversarial sample is generated for a given neural network
this approach is one avenue for future work. Another approach architecture, it is also likely to be misclassified in neural
was proposed in [19], but it is unsuccessful as by stacking the networks designed differently, which explains why the attack
denoising auto-encoder used for detection with the original is successful. However, the effectiveness of this kind of attack
DNN, the adversary can again produce adversarial samples. depends on (1) the quality and size of the surrogate dataset
The second class of solutions seeks to improve training to collected by the adversary, and (2) the adequateness of the
in return increase the robustness of DNNs. Interestingly, the adversarial network used to craft adversarial samples.
problem of adversarial samples is closely linked to training.
Work on generative adversarial networks showed that a two VIII. C ONCLUSIONS
player game between two DNNs can lead to the generation of Broadly speaking, this paper has explored adversarial be-
new samples from a training set [17]. This can help augment havior in deep learning systems. In addition to exploring the
training datasets. Furthermore, adding adversarial samples to goals and capabilities of DNN adversaries, we introduced a
the training set can act like a regularizer [18]. We also new class of algorithms to craft adversarial samples based
observed in our experiments that training with adversarial on computing forward derivatives. This technique allows an
samples makes crafting additional adversarial samples harder. adversary with knowledge of the network architecture to con-
Indeed, by adding 18,000 adversarial samples to the original struct adversarial saliency maps that identify features of the
MNIST training dataset, we trained a new instance of our input that most significantly impact output classification. These
DNN. We then run our algorithms again on this newly trained algorithms can reliably produce samples correctly classified by
network and crafted a set of 9,000 adversarial samples. Pre- human subjects but misclassified in specific targets by a DNN
liminary analysis of these adversarial samples crafted showed with a 97% adversarial success rate while only modifying on
that the success rate was reduced by 7.2% while the average average 4.02% of the input features per sample.
distortion increased by 37.5%, suggesting that training with Solutions to defend DNNs against adversaries can be
adversarial samples can make DNNs more robust. divided in two classes: detecting adversarial samples and
improving the training phase. The detection of adversarial [14] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu. Large-scale malware
samples remains an open problem. Interestingly, the universal classification using random projections and neural networks. In Acous-
tics, Speech and Signal Processing (ICASSP), 2013 IEEE International
approximation theorem formulated by Hornik et al. states one Conference on, pages 3422–3426. IEEE, 2013.
hidden layer is sufficient to represent arbitrarily accurately a [15] G. E. Dahl, D. Yu, et al. Context-dependent pre-trained deep neural
function [21]. Thus, one can intuitively conceive that improv- networks for large-vocabulary speech recognition. IEEE Transactions
on Audio, Speech, and Language Processing, 20(1):30–42, 2012.
ing the training phase is key to resisting adversarial samples.
[16] P. Fogla and W. Lee. Evading network anomaly detection systems:
In future work, we plan to address the limitations of formal reasoning and practical techniques. In Proceedings of the 13th
DNN trained in an unsupervised manner as well as cyclical ACM conference on Computer and communications security, pages 59–
recurrent neural networks (as opposed to acyclical networks 68. ACM, 2006.
[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
considered throughout this paper). Also, as most models of S. Ozair, et al. Generative adversarial nets. In Advances in Neural
our taxonomy have yet to be researched, this leaves room for Information Processing Systems, pages 2672–2680, 2014.
further investigation of DL in various adversarial settings. [18] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harness-
ing adversarial examples. In Proceedings of the 2015 International
ACKNOWLEDGMENT Conference on Learning Representations. Computational and Biological
Learning Society, 2015.
The authors would like to warmly thank Dr. Damien Octeau [19] S. Gu and L. Rigazio. Towards deep neural network architectures
and Aline Papernot for insightful discussions about this work. robust to adversarial examples. In Proceedings of the 2015 International
Research was sponsored by the Army Research Laboratory Conference on Learning Representations. Computational and Biological
Learning Society, 2015.
and was accomplished under Cooperative Agreement Number
[20] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for
W911NF-13-2-0045 (ARL Cyber Security CRA). The views deep belief nets. Neural computation, 18(7):1527–1554, 2006.
and conclusions contained in this document are those of the au- [21] K. Hornik, M. Stinchcombe, et al. Multilayer feedforward networks are
thors and should not be interpreted as representing the official universal approximators. Neural networks, 2(5):359–366, 1989.
policies, either expressed or implied, of the Army Research [22] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. Tygar.
Adversarial machine learning. In Proceedings of the 4th ACM workshop
Laboratory or the U.S. Government. The U.S. Government is on Security and artificial intelligence, pages 43–58. ACM, 2011.
authorized to reproduce and distribute reprints for Government [23] E. Knorr. How paypal beats the bad guys with machine learn-
purposes notwithstanding any copyright notation here on. ing. http://www.infoworld.com/article/2907877/machine-learning/how-
paypal-reduces-fraud-with-machine-learning.html, 2015.
R EFERENCES [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural
[1] E. G. Amoroso. Fundamentals of Computer Security Technology. information processing systems, pages 1097–1105, 2012.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1994.
[2] M. Barreno, B. Nelson, A. D. Joseph, and J. Tygar. The security of [25] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring
machine learning. Machine Learning, 81(2):121–148, 2010. strategies for training deep neural networks. The Journal of Machine
[3] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar. Learning Research, 10:1–40, 2009.
Can machine learning be secure? In Proceedings of the 2006 ACM [26] Y. LeCun, L. Bottou, et al. Gradient-based learning applied to document
Symposium on Information, computer and communications security, recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
pages 16–25. ACM, 2006. [27] Y. LeCun and C. Cortes. The mnist database of handwritten digits, 1998.
[4] Y. Bengio. Learning deep architectures for AI. Foundations and trends [28] LISA lab. http://deeplearning.net/tutorial/lenet.html, 2010.
in Machine Learning, 2(1):1–127, 2009. [29] K. P. Murphy. Machine learning: a probabilistic perspective. MIT 2012.
[5] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des- [30] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily
jardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a cpu and fooled: High confidence predictions for unrecognizable images. In In
gpu math expression compiler. In Proceedings of the Python for scientific Computer Vision and Pattern Recognition (CVPR 2015). IEEE, 2015.
computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.
[6] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, [31] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning represen-
G. Giacinto, and F. Roli. Evasion attacks against machine learning at tations by back-propagating errors. Cognitive modeling, 5, 1988.
test time. In Machine Learning and Knowledge Discovery in Databases, [32] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, long
pages 387–402. Springer, 2013. short-term memory, fully connected deep neural networks. 2015.
[7] B. Biggio, G. Fumera, and F. Roli. Pattern recognition systems under [33] H. Sak, A. Senior, and F. Beaufays. Long short-term memory recurrent
attack: Design issues and research challenges. International Journal of neural network architectures for large scale acoustic modeling. In
Pattern Recognition and Artificial Intelligence, 28(07):1460002, 2014. Proceedings of the Annual Conference of International Speech Com-
[8] B. Biggio, G. Fumera, and F. Roli. Security evaluation of pattern munication Association (INTERSPEECH), 2014.
classifiers under attack. Knowledge and Data Engineering, IEEE [34] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional
Transactions on, 26(4):984–996, 2014. networks: Visualising image classification models and saliency maps.
[9] B. Biggio, B. Nelson, and P. Laskov. Support vector machines under arXiv preprint arXiv:1312.6034, 2013.
adversarial label noise. In ACML, pages 97–112, 2011. [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[10] B. Biggio, B. Nelson, and L. Pavel. Poisoning attacks against support V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions.
vector machines. In Proceedings of the 29th International Conference arXiv preprint arXiv:1409.4842, 2014.
on Machine Learning, 2012.
[36] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,
[11] B. Biggio, K. Rieck, D. Ariu, C. Wressnegger, I. Corona, G. Giacinto,
and R. Fergus. Intriguing properties of neural networks. In Proceedings
and F. Roli. Poisoning behavioral malware clustering. In Proceedings
of the 2014 International Conference on Learning Representations.
of the 2014 Workshop on Artificial Intelligent and Security Workshop,
Computational and Biological Learning Society, 2014.
pages 27–36. ACM, 2014.
[12] D. Cireşan, U. Meier, J. Masci, et al. Multi-column deep neural network [37] Y. Taigman, M. Yang, et al. Deepface: Closing the gap to human-level
for traffic sign classification. Neural Networks, 32:333–338, 2012. performance in face verification. In IEEE Conference on Computer
[13] R. Collobert and J. Weston. A unified architecture for natural language Vision and Pattern Recognition (CVPR), pages 1701–1708. IEEE, 2014.
processing: Deep neural networks with task learning. In Proceedings of [38] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are
the 25th international conference on Machine learning, pages 160–167. features in deep neural networks? In Advances in Neural Information
ACM, 2008. Processing Systems, pages 3320–3328, 2014.
A PPENDIX
A. Validation setup details
To train and use the deep neural network, we use
Theano [5], a Python package designed to simplify large-
scale scientific computing. Theano allows us to efficiently
implement the network architecture, the training through back-
propagation, and the forward derivative computation. We con-
figure Theano to make computations with float32 precision,
because they can then be accelerated using graphics proces-
sors. Indeed, all our experiments are facilitated using GPU
acceleration on a machine equipped with a Xeon E5-2680 v3
processor and a Nvidia Tesla K5200 graphics processor.
Our deep neural network makes some simplifications, sug-
gested in the Theano Documentation [28], to the original
LeNet-5 architecture. Nevertheless, once trained on batches
of 500 samples taken from the MNIST dataset [27] with a
learning parameter of η = 0.1 for 200 epochs, the learned
network parameters exhibits a 98.93% accuracy rate on the
MNIST training set and 99.41% accuracy rate on the MNIST
test set, which are comparable to state-of-the-art accuracies.

You might also like