The Limitations of Deep Learning in Adversarial Settings
The Limitations of Deep Learning in Adversarial Settings
0
make them vulnerable to adversarial samples: inputs crafted by
1
adversaries with the intent of causing deep neural networks to
misclassify. In this work, we formalize the space of adversaries
2
against deep neural networks (DNNs) and introduce a novel class
of algorithms to craft adversarial samples based on a precise
3
understanding of the mapping between inputs and outputs of
DNNs. In an application to computer vision, we show that our
Input class
4
algorithms can reliably produce samples correctly classified by
human subjects but misclassified in specific targets by a DNN
with a 97% adversarial success rate while only modifying on 5
average 4.02% of the input features per sample. We then evaluate
6
I. I NTRODUCTION
Fig. 1: Adversarial sample generation - Distortion is added
Large neural networks, recast as deep neural networks to input samples to force the DNN to output adversary-
(DNNs) in the mid 2000s, altered the machine learning land- selected classification (min distortion = 0.26%, max distortion
scape by outperforming other approaches in many tasks. This = 13.78%, and average distortion ε = 4.06%).
was made possible by advances that reduced the computational
complexity of training [20]. For instance, Deep learning (DL)
can now take advantage of large datasets to achieve accuracy An adversarial sample is an input crafted to cause deep
rates higher than previous classification techniques. In short, learning algorithms to misclassify. Note that adversarial sam-
DL is transforming computational processing of complex data ples are created at test time, after the DNN has been trained by
in many domains such as vision [24], [37], speech recogni- the defender, and do not require any alteration of the training
tion [15], [32], [33], language processing [13], financial fraud process. Figure 1 shows examples of adversarial samples taken
detection [23], and recently malware detection [14]. from our validation experiments. It shows how an image
This increasing use of deep learning is creating incentives originally showing a digit can be altered to force a DNN to
for adversaries to manipulate DNNs to force misclassification classify it as another digit. Adversarial samples are created
of inputs. For instance, applications of deep learning use from benign samples by adding distortions exploiting the
image classifiers to distinguish inappropriate from appropriate imperfect generalization learned by DNNs from finite training
content, and text and image classifiers to differentiate between sets [4], and the underlying linearity of most components used
SPAM and non-SPAM email. An adversary able to craft mis- to build DNNs [18]. Previous work explored DNN properties
classified inputs would profit from evading detection–indeed that could be used to craft adversarial samples [18], [30], [36].
such attacks occur today on non-DL classification systems [6], Simply put, these techniques exploit gradients computed by
[7], [22]. In the physical domain, consider a driverless car network training algorithms: instead of using these gradients
system that uses DL to identify traffic signs [12]. If slightly to update network parameters as would normally be done,
altering “STOP” signs causes DNNs to misclassify them, the gradients are used to update the original input itself, which
car would not stop, thus subverting the car’s safety. is subsequently misclassified by DNNs.
In this paper, we describe a new class of algorithms for in our setup. Lastly, we study the impact of our algorithmic
adversarial sample creation against any feedforward (acyclic) parameters on distortion and human perception of samples.
DNN [31] and formalize the threat model space of deep This paper makes the following contributions:
learning with respect to the integrity of output classification. • We formalize the space of adversaries against classifi-
Unlike previous approaches mentioned above, we compute cation DNNs with respect to adversarial goal and capa-
a direct mapping from the input to the output to achieve bilities. Here, we provide a better understanding of how
an explicit adversarial goal. Furthermore, our approach only attacker capabilities constrain attack strategies and goals.
alters a (frequently small) fraction of input features leading • We introduce a new class of algorithms for crafting
to reduced perturbation of the source inputs. It also enables adversarial samples solely by using knowledge of the
adversaries to apply heuristic searches to find perturbations DNN architecture. These algorithms (1) exploit forward
leading to input targeted misclassifications (perturbing inputs derivatives that inform the learned behavior of DNNs, and
to result in a specific output classification). (2) build adversarial saliency maps enabling an efficient
More formally, a DNN models a multidimensional function exploration of the adversarial-samples search space.
F : X 7→ Y where X is a (raw) feature vector and Y is an • We validate the algorithms using a widely used computer
output vector. We construct an adversarial sample X∗ from a vision DNN. We define and measure sample distor-
benign sample X by adding a perturbation vector δX solving tion and source-to-target hardness, and explore defenses
the following optimization problem: against adversarial samples. We conclude by studying
human perception of distorted samples.
arg min kδX k s.t. F X + δX = Y∗
(1)
δX
II. TAXONOMY OF T HREAT M ODELS IN D EEP L EARNING
∗ ∗
where X = X + δX is the adversarial sample, Y is the Classical threat models enumerate the goals and capabilities
desired adversarial output, and k · k is a norm appropriate to of adversaries in a target domain [1]. This section taxonimizes
compare the DNN inputs. Solving this problem is non-trivial, threat models in deep learning systems and positions several
as properties of DNNs make it non-linear and non-convex [25]. previous works with respect to the strength of the modeled
Thus, we craft adversarial samples by constructing a mapping adversary. We begin by providing an overview of deep neural
from input perturbations to output variations. Note that all networks highlighting their inputs, outputs and function. We
research mentioned above took the opposite approach: it used then consider the taxonomy presented in Figure 2.
output variations to find corresponding input perturbations.
Our understanding of how changes made to inputs affect A. About Deep Neural Networks
a DNN’s output stems from the evaluation of the forward Deep neural networks are large neural networks organized
derivative: a matrix we introduce and define as the Jacobian into layers of neurons, corresponding to successive represen-
of the function learned by the DNN. The forward derivative is tations of the input data. A neuron is an individual computing
used to construct adversarial saliency maps indicating input unit transmitting to other neurons the result of the application
features to include in perturbation δX in order to produce of its activation function on its input. Neurons are connected
adversarial samples inducing a certain behavior from the DNN. by links with different weights and biases characterizing the
Forward derivatives approaches are much more powerful strength between neuron pairs. Weights and biases can be
than gradient descent techniques used in prior systems. They viewed as DNN parameters used for information storage.
are applicable to both supervised and unsupervised architec- We define a network architecture to include knowledge of
tures and allow adversaries to generate information for broad the network topology, neuron activation functions, as well as
families of adversarial samples. Indeed, adversarial saliency weight and bias values. Weights and biases are determined
maps are versatile tools based on the forward derivative and during training by finding values that minimize a cost function
designed with adversarial goals in mind, giving greater control c evaluated over the training data T . Network training is tradi-
to adversaries with respect to the choice of perturbations. In tionally done by gradient descent using backpropagation [31].
our work, we consider the following questions to formalize Deep learning can be partitioned in two categories, de-
the security of DL in adversarial settings: (1) “What is the pending on whether DNNs are trained in a supervised or
minimal knowledge required to perform attacks against DL?”, unsupervised manner [29]. Supervised training leads to models
(2) “How can vulnerable or resistant samples be identified?”, that map unseen samples using a function inferred from
and (3) “How are adversarial samples perceived by humans?”. labeled training data. On the contrary, unsupervised training
The adversarial sample generation algorithms are validated learns representations of unlabeled training data, and resulting
using the widely studied LeNet architecture (a pioneering DNNs can be used to generate new samples, or to automate
DNN used for hand-written digit recognition [26]) and MNIST feature engineering by acting as a pre-processing layer for
dataset [27]. We show that any input sample can be perturbed larger DNNs. We restrict ourselves to the problem of learning
to be misclassified as any target class with 97.10% success multi-class classifiers in supervised settings. These DNNs are
while perturbing on average 4.02% of the input features per given an input X and output a class probability vector Y. Note
sample. The computational costs of the sample generation are that our work remains valid for unsupervised-trained DNNs,
modest; samples were each generated in less than a second and leaves a detailed study of this issue for future work.
x1 w11 h1
w31
w12
o
w21
Increasing
ADVERSARIAL GOALS complexity w32
Architecture x2 w22 h2
& Training Tools
F,T,c
ADVERSARIAL CAPABILITIES
desired adversarial output, and k · k is a norm appropriate to We now generalize this approach to any feedforward DNN,
compare points in the input domain. Informally, the adversary using the same assumptions and adversary model from Sec-
is searching for small perturbations of the input that will tion III-A. The only assumptions we make on the architecture
incur a modification of the output into Y∗ . Finding these are that its neurons form an acyclic DNN, and each use a dif-
perturbations can be done using optimization techniques, sim- ferentiable activation function. Note that this last assumption is
ple heuristics, or even brute force. However such solutions not limiting because the back-propagation algorithm imposes
are hard to implement for deep neural networks because of the same requirement. In Figure 6, we give an example of
non-convexity and non-linearity [25]. Instead, we propose a a feedforward deep neural network architecture and define
systematic approach stemming from the forward derivative. some notations used throughout the remainder of the paper.
We define the forward derivative as the Jacobian matrix of Most importantly, the N -dimensional function F learned by
the function F learned by the neural network during training. the DNN during training assigns an output Y = F(X) when
For this example, the output of F is one dimensional, the given an M -dimensional input X. We write n the number of
matrix is therefore reduced to a vector: hidden layers. Layers are indexed by k ∈ 0..n + 1 such that
k = 0 is the index of the input layer, k ∈ 1..n corresponds to
∂F(X) ∂F(X)
∇F(X) = , (2) hidden layers, and k = n + 1 indexes the output layer.
∂x1 ∂x2 Algorithm 1 shows our process for constructing adversarial
Both components of this vector are computable using the samples. As input, the algorithm takes a benign sample X, a
adversary’s knowledge, and later we show how to compute target output Y∗ , an acyclic feedforward DNN F, a maximum
this term efficiently. The forward derivative for our example distortion parameter Υ, and a feature variation parameter θ.
network is illustrated in Figure 5, which plots the gradient It returns new adversarial sample X∗ such that F(X∗ ) = Y∗ ,
for the second component ∂F(X)∂x2 on the vertical axis against and proceeds in three basic steps: (1) compute the forward
x1 and x2 on the horizontal axes. We omit the plot for derivative ∇F(X∗ ), (2) construct a saliency map S based on
∂F(X)
∂x1 because F is approximately symmetric on its two the derivative, and (3) modify an input feature imax by θ.
inputs, making the first component redundant for our purposes. This process is repeated until the network outputs Y∗ or the
This plot makes it easy to visualize the divide between the maximum distortion Υ is reached. We now detail each step.
network’s two possible outputs in terms of values assigned to 1) Forward Derivative of a Deep Neural Network: The first
the input feature x2 : 0 to the left of the spike, and 1 to its step is to compute the forward derivative for the given sample
left. Notice that this aligns with Figure 4, and gives us the X. As introduced previously, this is given by:
information needed to achieve our adversarial goal: find input ∂F(X)
∂Fj (X)
perturbations that drive the output closer to a desired value. ∇F(X) = = (3)
∂X ∂xi i∈1..M,j∈1..N
Consulting Figure 5 alongside our example network, we
can confirm this intuition by looking at a few sample points. This is essentially the Jacobian of the function corresponding
Consider X = (1, 0.37) and X∗ = (1, 0.43), which are both to what the neural network learned during training. The
located near the spike in Figure 5. Although they only differ by forward derivative computes gradients that are similar to those
a small amount (δx2 = 0.05), they cause a significant change computed for backpropagation, but with two important dis-
in the network’s output, as F(X) = 0.11 and F(X∗ ) = 0.95. tinctions: we take the derivative of the network directly, rather
1 Notations
1 …
1 1 1 F: function learned by neural network during training
X: input of neural network
…
Y: output of neural network
2 M: input dimension (number of neurons on input layer)
2 … N: output dimension (number of neurons on output layer)
2 2 2 n: number of hidden layers in neural network
… f: activation function of a neuron
Hk : output vector of layer k neurons
… …
Indices
M … N k: index for layers (between 0 and n+1)
m2 i: index for input X component (between 0 and N)
m1 mn
j: index for output Y component (between 0 and M)
p: index for neurons (between 0 and mk for any layer k)
X n hidden layers Y
Fig. 6: Example architecture of a feedforward deep neural network with notations used in the paper.
Algorithm 1 Crafting adversarial samples neuron p on a hidden or output layer indexed k ∈ 1..n + 1 is
X is the benign sample, Y∗ is the target network output, F connected to the previous layer k − 1 using weights defined
is the function learned by the network during training, Υ is in vector Wk,p . By defining the weight matrix accordingly,
the maximum distortion, and θ is the change made to features. we can define fully or sparsely connected interlayers, thus
This algorithm is applied to a specific DNN in Algorithm 2. modeling a variety of architectures. Similarly, we write bk,p
Input: X, Y∗ , F, Υ, θ the bias for neuron p of layer k. By applying the chain rule,
1: X∗ ← X we can write a series of formulae for k ∈ 2..n:
2: Γ = {1 . . . |X|}
∂Hk (X) ∂Hk−1
3: while F(X∗ ) 6= Y ∗ and ||δX || < Υ do = Wk,p · ×
∂xi p∈1..mk ∂xi
4: Compute forward derivative ∇F(X∗ )
∂fk,p
5: S = saliency_map (∇F(X∗ ), Γ, Y∗ ) (Wk,p · Hk−1 + bk,p ) (5)
6: Modify X∗imax by θ s.t. imax = arg maxi S(X, Y∗ )[i] ∂xi
7: δX ← X∗ − X We are thus able to express ∂H
∂xi . We know that output neuron
n
values only. To simplify our expressions, we now consider recursively. By plugging these results for successive layers
one element (i, j) ∈ [1..M ] × [1..N ] of the M × N forward back in Equation 6, we get an expression of component (i, j)
derivative matrix defined in Equation 3: that is the derivative of the DNN’s forward derivative. Hence, the forward derivative
of one output neuron Fj according to one input dimension ∇F of a network F can be computed for any input X by
xi . Of course our results are true for any matrix element. We successively differentiating layers starting from the input layer
start at the first hidden layer of the neural network. We can until the output layer is reached. We later discuss in our
differentiate the output of this first hidden layer in terms of methodology evaluation the computability of ∇F for state-
the input components. We then recursively differentiate each of-the-art DNN architectures. Notably, the forward derivative
hidden layer k ∈ 2..n in terms of the previous one: can be computed using symbolic differentiation.
∂Hk (X)
∂fk,p (Wk,p · Hk−1 + bk,p )
2) Adversarial Saliency Maps: We extend saliency maps
= (4) previously introduced as visualization tools [34] to construct
∂xi ∂xi p∈1..mk adversarial saliency maps. These maps indicate which input
where Hk is the output vector of hidden layer k and fk,j is features an adversary should perturb in order to effect the
the activation function of output neuron j in layer k. Each desired changes in network output most efficiently, and are
thus versatile tools that allow adversaries to generate broad
classes of adversarial samples.
Adversarial saliency maps are defined to suit problem-
specific adversarial goals. For instance, we later study a
network used as a classifier, its output is a probability vector
across classes, where the final predicted class value corre-
sponds to the component with the highest probability:
label(X) = arg max Fj (X) (7)
j
goal is to report whether we can reach any adversarial target computed between these two layers to ensure probabilities sum
class for a given source class. For instance, if we are given up to 1, leading to extreme derivative values. This reduces
a handwritten 0, we increase some of the pixel intensities to the quality of information on how the neurons are activated
produce 9 adversarial samples respectively classified in each by different inputs and causes the forward derivative to loose
of the classes 1 to 9. All pixel intensities changed are increased accuracy when generating saliency maps. Better results are
by θ = +1. We discuss this choice of parameter in section V. achieved when working with the last hidden layer, also made
We allow for an unlimited maximum distortion Υ = ∞. We up of 10 neurons, each corresponding to one digit class 0 to 9.
simply measure for each of the 90 source-target class pairs This justifies enforcing constraints on the forward derivative.
whether an adversarial sample can be produced or not. Indeed, as the output layer used for computing the forward
The adversarial saliency map used in the crafting algorithm P not sum up to 1, increasing Ft (X) does not
derivative does
to select pixel pairs that can be increased is an application imply that j6=t ∂Fj (X) will decrease, and vice-versa.
of the map introduced in the general case of classification in
Equation 8. The map aims to find pairs of pixels (p1 , p2 ) using Algorithm 3 Increasing pixel intensities saliency map
the following heuristic: ∇F(X) is the forward derivative, Γ the features still in the
search space, and t the target class
X ∂Ft (X) X X ∂Fj (X)
arg max × (10) Input: ∇F(X), Γ, t
(p1 ,p2 )
i=p1 ,p2
∂X i i=p1 ,p2 j6=t
∂X i
1: for each pair (p, q) ∈ Γ do
α = i=p,q ∂F∂tX(X)
P
where t is the index of the target class, the left operand of the 2:
P i ∂F (X)
multiplication operation is constrained to be positive, and the
P
3: β = i=p,q j6=t ∂jX
i
right operand of the multiplication operation is constrained to 4: if α > 0 and β < 0 and −α × β > max then
be negative. This heuristic, introduced in the previous section 5: p1 , p2 ← p, q
of this manuscript, searches for pairs of pixels producing 6: max ← −α × β
an increase in the target class output while reducing the 7: end if
sum of the output of all other classes when simultaneously 8: end for
increased. The pseudocode of the corresponding subroutine 9: return p1 , p2
saliency_map is given in Algorithm 3.
The saliency map considers pairs of pixels and not individ- The algorithm is able to craft successful adversarial samples
ual pixels because selecting pixels one at a time is too strict, for all 90 source-target class pairs. Figure 1 shows the 90
and very few pixels would meet the heuristic search criteria adversarial samples obtained as well as the 10 original samples
described in Equation 8. Searching for pairs of pixels is more used to craft them. The original samples are found on the
likely to match the condition because one of the pixels can diagonal. A sample on row i and column j, when i 6= j, is a
compensate a minor flaw of the other pixel. Let’s consider sample crafted from an image originally classified as source
a simple example: p1 has a target derivative of 5 but a sum class i to be misclassified as target class j.
of other classes derivatives equal to 0.1, while p2 as a target To verify the validity of our algorithms, and more specifi-
derivative equal to −0.5 and a sum of other classes derivatives cally of our adversarial saliency maps, we run a simple exper-
equal to −6. Individually, these pixels do not match the iment. We run the crafting algorithm on an empty input (all
saliency map’s criteria stated in Equation 8, but combined, the pixels initially set to an intensity of 0) and craft one adversarial
pair does match the saliency criteria defined in Equation 10. sample for each class from 0 to 9. The different samples shown
One would also envision considering larger groups of input in Figure 9 demonstrate how adversarial saliency maps are able
features to define saliency maps. However, this comes at a to identify input features relevant to classification in a class.
greater computational cost because more combinations need
to be considered each time the group size is increased. C. Crafting by decreasing pixel intensities
In our implementation of these algorithms, we compute the Instead of increasing pixel intensities to achieve the adver-
forward derivative of the network using the last hidden layer sarial targets, the second adversarial strategy decreases pixel
instead of the output probability layer. This is justified by intensities by θ = −1. The implementation is identical to the
the extreme variations introduced by the logistic regression exception of the adversarial saliency map. The formula is the
Output classification A. Crafting large amounts of adversarial samples
0 1 2 3 4 5 6 7 8 9 Now that we previously showed the feasibility of crafting
adversarial samples for all source-target class pairs, we seek
0
2) Hardness measure: Results indicating that some source- 3) Adversarial distance: The measure introduced lays
target class pairs are not as easy as others lead us to question ground towards finding defenses against adversarial samples.
the existence of a measure quantifying the distance between Indeed, if the hardness measure were to be predictive instead
two classes. This is relevant to a defender seeking to identify of being computed after adversarial crafting, the defender
which classes of a DNN are most vulnerable to adversaries. could identify vulnerable inputs. Furthermore, a predictive
We name this measure the hardness of a target class relatively measure applicable to a single sample would allow a defender
to a given source class. It normalizes the average distortion of to evaluate the vulnerability of specific samples as well as class
a class pair (s, t) relatively to its success rate: pairs. We investigated several complex estimators including
Z convolutional transformations of the forward derivative or
H(s, t) = ε(s, t, τ )dτ (11) Hessian matrices. However, we found that simply using a
τ formulae derived from the intuition behind adversarial saliency
where ε(s, t, τ ) is the average distortion of a set of samples maps gave enough accuracy for predicting the hardness of
for the corresponding success rate τ . In practice, these two samples in our experimental setup.
quantities are computed over a finite number of samples by We name this predictive measure the adversarial distance
fixing a set of K maximum distortion parameter values Υk in of sample X to class t and write it A(X, t). Simply put, it
the crafting algorithm where k ∈ 1..K. The set of maximum estimates the distance between a sample X and a target class
distortions gives a series of pairs (εk , τk ) for k ∈ 1..K. Thus, t. We define the distance as:
the practical formula used to compute the hardness of a source- 1 X
destination class pair can be derived from the trapezoidal rule: A(X, t) = 1 − 1S(X,t)[i]>0 (13)
M
i∈0..M
K−1
X ε(s, t, τk+1 ) + ε(s, t, τk ) where 1E is the indicator function for event E (i.e., is 1 if
H(s, t) ≈ (τk+1 − τk ) (12)
2 and only if E is true). In a nutshell, A(X, t) is the normalized
k=1
number of non-zero elements in the adversarial saliency map
We computed the hardness values for all classes using of X computed during the first crafting iteration in Algo-
a set of K = 9 maximum distortion values Υ ∈ rithm 2. The closer the adversarial distance is to 1, the more
{0.3, 1.3, 2.6, 5.1, 7.7, 10.2, 12.8, 25.5, 38.3}% in the algo- likely sample X is going to be harder to misclassify in target
rithm. Average distortions ε and success rates τ are averaged class t. Figure 15 confirms that this formulae is empirically
over 9,000 adversarial samples for each maximum distortion well-founded. It illustrates the value of the adversarial distance
value Υ. Figure 14 shows the hardness values H(s, t) for all averaged per source-destination class pairs, making it easy to
pairs (s, t) ∈ {0..9}2 . The reader will observe that the matrix compare the average value with the hardness matrix computed
has a shape similar to the average distortion matrix plotted on previously after crafting samples. To compute it, we slightly
Figure 13. However, the hardness measure is more accurate altered Equation 13 to sum over pairs of features, reflecting
because it is plotted using a series of maximum distortions. the observations made during our validation process.
Respondents identifying a digit Respondents correctly classifying the digit Respondents identifying a digit Respondents correctly classifying the digit
100.00% 100.00%
95.00% 95.00%
90.00% 90.00%
85.00% 85.00%
80.00% 80.00%
75.00% 75.00%
70.00% 70.00%
65.00% 65.00%
60.00% 60.00%
55.00% 55.00%
50.00% 50.00%
0% - 1.53% 1.53% - 2.8% 2.8% - 5.61% 5.61% - 14.29% 14.29% - 100% -1 -0.7 -0.5 0.1 0.3 0.5 0.7 1
Fig. 16: Human perception of different distortions ε. Fig. 17: Human perception of different intensity variations θ.
This notion of distance between classes intuitively defines A final set of experiments evaluate the impact of intensity
a metric for the robustness of a network F against adversarial variations (θ) on perception, as shown Figure 17. The 203
perturbations. We suggest the following definition : participants were accurate at identifying 5, 355 samples as
R(F) = min A(X, t) (14) digits (96%) and classifying them correctly (95%). At higher
(X,t) absolute intensities (θ = −1 and θ = +1), specific digit classi-
fication decreased slightly (90.5% and 90%), but identification
where the set of samples X considered is sufficiently large to
as digits was largely unchanged.
represent the input domain of the network. A good approxi-
mation of the robustness can be computed with the training While preliminary, these experiments confirm that the over-
dataset. Note that the min operator used here can be replaced whelming number of generated samples retain human recog-
by other relevant operators, like the statistical expectation. The nizability. Note that because we can generate samples with
study of various operators is left as future work. less than the distortion threshold for the almost all of the
input data, (ε ≤ 14.29% for roughly 97% in the MNIST
C. Study of human perception of adversarial samples data), we can produce adversarial samples that humans will
mis-interpret—thus meeting our adversarial goal. Furthermore,
Recall that adversarial samples must not only be misclas-
altering feature distortion intensity provides even better results:
sified as the target class by deep neural networks, but also
at −0.7 ≤ θ ≤ +0.7, humans classified the sample data at
visually appear (be classified) as the source class by humans.
essentially the same rates as the original sample data.
To evaluate this property, we ran an experiment using 349
human participants on the Mechanical Turk online service.
VI. D ISCUSSION
We presented three original or adversarially altered samples
from the MNIST dataset to human participants. To paraphrase, We introduced a new class of algorithms that systemati-
participants were asked for each sample: (a) ‘is this sample a cally craft adversarial samples misclassified by a DNN once
numeric digit?’, and (b) ‘if yes to (a) what digit is it?’. These an adversary possesses knowledge of the DNN architecture.
two questions were designed to determine how distortion and Although we focused our work on DL techniques used in the
intensity rates effected human perception of the samples. context of classification and trained with supervised methods,
The first experiment was designed to identify a baseline our approach is also applicable to unsupervised architec-
perception rate for the input data. The 74 participants were tures. Instead of achieving a given target class, the adversary
presented 3 of 222 unaltered samples randomly picked from achieves a target output Y∗ . Because the output space is
the original MNIST data set. Respondents identified 97.4% as more complex, it might be harder or impossible to match Y∗ .
digits and classified the digits correctly 95.3% of the samples. In that case, Equation 1 would need to be relaxed with an
Shown in Figure 16, a second set of experiments attempted acceptable distance between the network output F(X∗ ) and
to evaluate how the amount of distortion (ε) impacts human the adversarial target Y∗ . Thus, the only remaining assumption
perception. Here, 184 participants were presented with a total made in this paper is that DNNs are feedforward. In other
of 1707 samples with varying levels of distortion (and features words, we did not consider recurrent neural networks, with
altered with an intensity increase θ = +1). The experiments cycles in their architecture, as the forward derivative must be
showed that below a threshold (ε = 14.29% distortion), adapted to accommodate such networks.
participants were able to identify samples as digits (95%) and One of our key results is reducing the distortion—the num-
correctly classify them (90%) only slightly less accurately ber of features altered—to craft adversarial samples, compared
than the unaltered samples. The classification rate dropped to previous work. We believe this makes adversarial crafting
dramatically (71%) at distortion rates above the threshold. much easier for input domains like malware executables,
which are not as easy to perturb as images [11], [16]. This dis- VII. R ELATED W ORK
tortion reduction comes with a performance cost. Indeed, more The security of machine learning [2] is an active research
elaborate but accurate saliency map formulae are more expen- topic within the security and machine learning communities. A
sive to compute for the attacker. We would like to emphasize broad taxonomy of attacks and required adversarial capabilties
that our method’s high success rate can be further improved are discussed in [22] and [3] along with considerations for
by adversaries only interested in crafting a limited number building defense mechanisms. Biggio et al. studied classifiers
of samples. Indeed, to lower the distortion of one particular in adversarial settings and outlined a framework securing
sample, an adversary can use adversarial saliency maps to them [8]. However, their work does not consider DNNs but
fine-tune the perturbation introduced. On the other hand, if an rather other techniques used for binary classification like
adversary wants to craft large amounts of adversarial samples, logistic regression or Support Vector Machines. Generally
performance is important. In our evaluation, we balanced these speaking, attacks against machine learning can be separated
factors to craft adversarial samples against the DNN in less into two categories, depending on whether they are executed
than a second. As far as our algorithm implementation was during training [9] or at test time [10].
concerned, the most computationally expensive steps were the Prior work on adversarial sample crafting against DNNs
matrix manipulations required to construct adversarial saliency derived a simple technique corresponding to the Architecture
maps from the forward derivative matrix. The complexity and Training Tools threat model, based on the backpropagation
is dependent of the number of input features. These matrix procedure used during network training [18], [30], [36]. This
operations can be made more efficient, notably by making approach creates adversarial samples by defining an optimiza-
better use of GPU-accelerated computations. tion problem based on the DNN’s cost function. In other
Our efforts so far represent a first but meaningful step to- words, instead of computing gradients to update DNN weights,
wards mitigating adversarial samples: the hardness and adver- one computes gradients to update the input, which is then
sarial distance metrics lay out bases for defense mechanisms. misclassified as the target class by a DNN. The alternative
Although designing such defenses is outside of the scope of approach proposed in this paper is to identify input regions
this paper, we outline two classes of defenses: (1) adversarial that are most relevant to its classification by a DNN. This is
sample detection and (2) improvements of DNN robustness. accomplished by computing the saliency map of a given input,
Developing techniques for adversarial sample detection is as described by Simonyan et al. in the case of DNNs handling
a reactive solution. During our experimental process, we images [34]. We extended this concept to create adversarial
noticed that adversarial samples can for instance be detected saliency maps highlighting regions of the input that need to
by evaluating the regularity of samples. More specifically, in be perturbed in order to accomplish the adversarial goal.
our application example, the sum of the squared difference Previous work by Yosinki et al. investigated how features
between each pair of neighboring pixels is always higher for are transferable between deep neural networks [38], while
adversarial samples than for benign samples. However, there Szegedy et al. showed that adversarial samples can indeed
is no a priori reason to assume that this technique will reliably be misclassified across models [36]. They report that once an
detect adversarial samples in different settings, so extending adversarial sample is generated for a given neural network
this approach is one avenue for future work. Another approach architecture, it is also likely to be misclassified in neural
was proposed in [19], but it is unsuccessful as by stacking the networks designed differently, which explains why the attack
denoising auto-encoder used for detection with the original is successful. However, the effectiveness of this kind of attack
DNN, the adversary can again produce adversarial samples. depends on (1) the quality and size of the surrogate dataset
The second class of solutions seeks to improve training to collected by the adversary, and (2) the adequateness of the
in return increase the robustness of DNNs. Interestingly, the adversarial network used to craft adversarial samples.
problem of adversarial samples is closely linked to training.
Work on generative adversarial networks showed that a two VIII. C ONCLUSIONS
player game between two DNNs can lead to the generation of Broadly speaking, this paper has explored adversarial be-
new samples from a training set [17]. This can help augment havior in deep learning systems. In addition to exploring the
training datasets. Furthermore, adding adversarial samples to goals and capabilities of DNN adversaries, we introduced a
the training set can act like a regularizer [18]. We also new class of algorithms to craft adversarial samples based
observed in our experiments that training with adversarial on computing forward derivatives. This technique allows an
samples makes crafting additional adversarial samples harder. adversary with knowledge of the network architecture to con-
Indeed, by adding 18,000 adversarial samples to the original struct adversarial saliency maps that identify features of the
MNIST training dataset, we trained a new instance of our input that most significantly impact output classification. These
DNN. We then run our algorithms again on this newly trained algorithms can reliably produce samples correctly classified by
network and crafted a set of 9,000 adversarial samples. Pre- human subjects but misclassified in specific targets by a DNN
liminary analysis of these adversarial samples crafted showed with a 97% adversarial success rate while only modifying on
that the success rate was reduced by 7.2% while the average average 4.02% of the input features per sample.
distortion increased by 37.5%, suggesting that training with Solutions to defend DNNs against adversaries can be
adversarial samples can make DNNs more robust. divided in two classes: detecting adversarial samples and
improving the training phase. The detection of adversarial [14] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu. Large-scale malware
samples remains an open problem. Interestingly, the universal classification using random projections and neural networks. In Acous-
tics, Speech and Signal Processing (ICASSP), 2013 IEEE International
approximation theorem formulated by Hornik et al. states one Conference on, pages 3422–3426. IEEE, 2013.
hidden layer is sufficient to represent arbitrarily accurately a [15] G. E. Dahl, D. Yu, et al. Context-dependent pre-trained deep neural
function [21]. Thus, one can intuitively conceive that improv- networks for large-vocabulary speech recognition. IEEE Transactions
on Audio, Speech, and Language Processing, 20(1):30–42, 2012.
ing the training phase is key to resisting adversarial samples.
[16] P. Fogla and W. Lee. Evading network anomaly detection systems:
In future work, we plan to address the limitations of formal reasoning and practical techniques. In Proceedings of the 13th
DNN trained in an unsupervised manner as well as cyclical ACM conference on Computer and communications security, pages 59–
recurrent neural networks (as opposed to acyclical networks 68. ACM, 2006.
[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
considered throughout this paper). Also, as most models of S. Ozair, et al. Generative adversarial nets. In Advances in Neural
our taxonomy have yet to be researched, this leaves room for Information Processing Systems, pages 2672–2680, 2014.
further investigation of DL in various adversarial settings. [18] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harness-
ing adversarial examples. In Proceedings of the 2015 International
ACKNOWLEDGMENT Conference on Learning Representations. Computational and Biological
Learning Society, 2015.
The authors would like to warmly thank Dr. Damien Octeau [19] S. Gu and L. Rigazio. Towards deep neural network architectures
and Aline Papernot for insightful discussions about this work. robust to adversarial examples. In Proceedings of the 2015 International
Research was sponsored by the Army Research Laboratory Conference on Learning Representations. Computational and Biological
Learning Society, 2015.
and was accomplished under Cooperative Agreement Number
[20] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for
W911NF-13-2-0045 (ARL Cyber Security CRA). The views deep belief nets. Neural computation, 18(7):1527–1554, 2006.
and conclusions contained in this document are those of the au- [21] K. Hornik, M. Stinchcombe, et al. Multilayer feedforward networks are
thors and should not be interpreted as representing the official universal approximators. Neural networks, 2(5):359–366, 1989.
policies, either expressed or implied, of the Army Research [22] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. Tygar.
Adversarial machine learning. In Proceedings of the 4th ACM workshop
Laboratory or the U.S. Government. The U.S. Government is on Security and artificial intelligence, pages 43–58. ACM, 2011.
authorized to reproduce and distribute reprints for Government [23] E. Knorr. How paypal beats the bad guys with machine learn-
purposes notwithstanding any copyright notation here on. ing. http://www.infoworld.com/article/2907877/machine-learning/how-
paypal-reduces-fraud-with-machine-learning.html, 2015.
R EFERENCES [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural
[1] E. G. Amoroso. Fundamentals of Computer Security Technology. information processing systems, pages 1097–1105, 2012.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1994.
[2] M. Barreno, B. Nelson, A. D. Joseph, and J. Tygar. The security of [25] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring
machine learning. Machine Learning, 81(2):121–148, 2010. strategies for training deep neural networks. The Journal of Machine
[3] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar. Learning Research, 10:1–40, 2009.
Can machine learning be secure? In Proceedings of the 2006 ACM [26] Y. LeCun, L. Bottou, et al. Gradient-based learning applied to document
Symposium on Information, computer and communications security, recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
pages 16–25. ACM, 2006. [27] Y. LeCun and C. Cortes. The mnist database of handwritten digits, 1998.
[4] Y. Bengio. Learning deep architectures for AI. Foundations and trends [28] LISA lab. http://deeplearning.net/tutorial/lenet.html, 2010.
in Machine Learning, 2(1):1–127, 2009. [29] K. P. Murphy. Machine learning: a probabilistic perspective. MIT 2012.
[5] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des- [30] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily
jardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a cpu and fooled: High confidence predictions for unrecognizable images. In In
gpu math expression compiler. In Proceedings of the Python for scientific Computer Vision and Pattern Recognition (CVPR 2015). IEEE, 2015.
computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.
[6] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, [31] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning represen-
G. Giacinto, and F. Roli. Evasion attacks against machine learning at tations by back-propagating errors. Cognitive modeling, 5, 1988.
test time. In Machine Learning and Knowledge Discovery in Databases, [32] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, long
pages 387–402. Springer, 2013. short-term memory, fully connected deep neural networks. 2015.
[7] B. Biggio, G. Fumera, and F. Roli. Pattern recognition systems under [33] H. Sak, A. Senior, and F. Beaufays. Long short-term memory recurrent
attack: Design issues and research challenges. International Journal of neural network architectures for large scale acoustic modeling. In
Pattern Recognition and Artificial Intelligence, 28(07):1460002, 2014. Proceedings of the Annual Conference of International Speech Com-
[8] B. Biggio, G. Fumera, and F. Roli. Security evaluation of pattern munication Association (INTERSPEECH), 2014.
classifiers under attack. Knowledge and Data Engineering, IEEE [34] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional
Transactions on, 26(4):984–996, 2014. networks: Visualising image classification models and saliency maps.
[9] B. Biggio, B. Nelson, and P. Laskov. Support vector machines under arXiv preprint arXiv:1312.6034, 2013.
adversarial label noise. In ACML, pages 97–112, 2011. [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[10] B. Biggio, B. Nelson, and L. Pavel. Poisoning attacks against support V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions.
vector machines. In Proceedings of the 29th International Conference arXiv preprint arXiv:1409.4842, 2014.
on Machine Learning, 2012.
[36] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,
[11] B. Biggio, K. Rieck, D. Ariu, C. Wressnegger, I. Corona, G. Giacinto,
and R. Fergus. Intriguing properties of neural networks. In Proceedings
and F. Roli. Poisoning behavioral malware clustering. In Proceedings
of the 2014 International Conference on Learning Representations.
of the 2014 Workshop on Artificial Intelligent and Security Workshop,
Computational and Biological Learning Society, 2014.
pages 27–36. ACM, 2014.
[12] D. Cireşan, U. Meier, J. Masci, et al. Multi-column deep neural network [37] Y. Taigman, M. Yang, et al. Deepface: Closing the gap to human-level
for traffic sign classification. Neural Networks, 32:333–338, 2012. performance in face verification. In IEEE Conference on Computer
[13] R. Collobert and J. Weston. A unified architecture for natural language Vision and Pattern Recognition (CVPR), pages 1701–1708. IEEE, 2014.
processing: Deep neural networks with task learning. In Proceedings of [38] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are
the 25th international conference on Machine learning, pages 160–167. features in deep neural networks? In Advances in Neural Information
ACM, 2008. Processing Systems, pages 3320–3328, 2014.
A PPENDIX
A. Validation setup details
To train and use the deep neural network, we use
Theano [5], a Python package designed to simplify large-
scale scientific computing. Theano allows us to efficiently
implement the network architecture, the training through back-
propagation, and the forward derivative computation. We con-
figure Theano to make computations with float32 precision,
because they can then be accelerated using graphics proces-
sors. Indeed, all our experiments are facilitated using GPU
acceleration on a machine equipped with a Xeon E5-2680 v3
processor and a Nvidia Tesla K5200 graphics processor.
Our deep neural network makes some simplifications, sug-
gested in the Theano Documentation [28], to the original
LeNet-5 architecture. Nevertheless, once trained on batches
of 500 samples taken from the MNIST dataset [27] with a
learning parameter of η = 0.1 for 200 epochs, the learned
network parameters exhibits a 98.93% accuracy rate on the
MNIST training set and 99.41% accuracy rate on the MNIST
test set, which are comparable to state-of-the-art accuracies.