7.explaining and Harnessing Adversarial Examples
7.explaining and Harnessing Adversarial Examples
A BSTRACT
arXiv:1412.6572v3 [stat.ML] 20 Mar 2015
1
Published as a conference paper at ICLR 2015
These results suggest that classifiers based on modern machine learning techniques, even those
that obtain excellent performance on the test set, are not learning the true underlying concepts that
determine the correct output label. Instead, these algorithms have built a Potemkin village that works
well on naturally occuring data, but is exposed as a fake when one visits points in space that do not
have high probability in the data distribution. This is particularly disappointing because a popular
approach in computer vision is to use convolutional network features as a space where Euclidean
distance approximates perceptual distance. This resemblance is clearly flawed if images that have an
immeasurably small perceptual distance correspond to completely different classes in the network’s
representation.
These results have often been interpreted as being a flaw in deep networks in particular, even though
linear classifiers have the same problem. We regard the knowledge of this flaw as an opportunity to
Por qué
fix it. Indeed, Gu & Rigazio (2014) and Chalupka et al. (2014) have already begun the first steps esperamos que
toward designing models that resist adversarial perturbation, though no model has yet succesfully exista un entorno
done so while maintaining state of the art accuracy on clean inputs. alrededor del
input donde el
3 T HE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES adversarial
example será de la
We start with explaining the existence of adversarial examples for linear models. misma clase?
In many problems, the precision of an individual input feature is limited. For example, digital Porque la
precisión de los
images often use only 8 bits per pixel so they discard all information below 1255 of the dynamic features es
range. Because the precision of the features is limited, it is not rational for the classifier to respond limitada y, por
differently to an input x than to an adversarial input x̃ = x + η if every element of the perturbation tanto, si cada
η is smaller than the precision of the features. Formally, for problems with well-separated classes, elemento del
we expect the classifier to assign the same class to x and x̃ so long as η∞ < , where is small vector
enough to be discarded by the sensor or data storage apparatus associated with our problem. perturbación es
menor que esa
Consider the dot product between a weight vector w and an adversarial example x̃: precisión, el
output será el
w > x̃ = w > x + w > η mismo.
Entiendo que The adversarial perturbation causes the activation to grow by w > η.We can maximize this increase
se refiere a subject to the max norm constraint on η by assigning η = sign(w). If w has n dimensions and the Por qué esto en la
que práctica no
average magnitude of an element of the weight vector is m, then the activation will grow by mn. sucede?
η=ϵ*sign(w) Since η∞ does not grow with the dimensionality of the problem but the change in activation Si hacemos el
caused by perturbation by η can grow linearly with n, then for high dimensional problems, we can producto escalar
make many infinitesimal changes to the input that add up to one large change to the output. We de los pesos del
can think of this as a sort of “accidental steganography,” where a linear model is forced to attend modelo lineal con
Cuál es la exclusively to the signal that aligns most closely with its weights, even if multiple signals are present el adversarial
motivación para and other signals have much greater amplitude. example cambia
probar también respecto al input
perturbaciones This explanation shows that a simple linear model can have adversarial examples if its input has suf- normal w^T*η. Si
lineales en redes ficient dimensionality. Previous explanations for adversarial examples invoked hypothesized prop- seleccionamos
neuronales? erties of neural networks, such as their supposed highly non-linear nature. Our hypothesis based η=ϵ*sign(w),
Los autores tienen on linearity is simpler, and can also explain why softmax regression is vulnerable to adversarial entonces la norma
la hipótesis de que examples. del cambio es del
las redes orden de ϵmn,
neuronales son 4 L INEAR PERTURBATION OF NON - LINEAR MODELS siendo n la
demasiado lineales dimensión del
para resistirlas. The linear view of adversarial examples suggests a fast way of generating them. We hypothesize vector. Si n es
LSTMs, ReLUs y that neural networks are too linear to resist linear adversarial perturbation. LSTMs (Hochreiter & grande podemos
maxout networks Schmidhuber, 1997), ReLUs (Jarrett et al., 2009; Glorot et al., 2011), and maxout networks (Good- cambiar el output
están diseñadas fellow et al., 2013c) are all intentionally designed to behave in very linear ways, so that they are lo que queramos.
para funcionar de easier to optimize. More nonlinear models such as sigmoid networks are carefully tuned to spend
forma bastante
lineal. Incluso la most of their time in the non-saturating, more linear regime for the same reason. This linear behavior
sigmoide, como se suggests that cheap, analytical perturbations of a linear model should also damage neural networks.
usa la mayoría del
tiempo para no
saturarse, tiene un 2
régimen lineal.
Published as a conference paper at ICLR 2015
+ 007 × =
x+
x sign(x J(θ, x, y))
sign(x J(θ, x, y))
“panda” “nematode” “gibbon”
57.7% confidence 8.2% confidence 99.3 % confidence
3
Published as a conference paper at ICLR 2015
Figure 2: The fast gradient sign method applied to logistic regression (where it is not an approxi-
mation, but truly the most damaging adversarial example in the max norm box). a) The weights of
a logistic regression model trained on MNIST. b) The sign of the weights of a logistic regression
model trained on MNIST. This is the optimal perturbation. Even though the model has low capacity
and is fit well, this perturbation is not readily recognizable to a human observer as having anything
to do with the relationship between 3s and 7s. c) MNIST 3s and 7s. The logistic regression model
has a 1.6% error rate on the 3 versus 7 discrimination task on these examples. d) Fast gradient sign Por qué entrenar
adversarial examples for the logistic regression model with = 25. The logistic regression model la función
has an error rate of 99% on these examples. logística con
adversarial
examples es
perturbation. Note that the sign of the gradient is just −sign(w), and that w > sign(w) = w1 . similar a hacer
una
The adversarial version of logistic regression is therefore to minimize
regularización L1?
Ex,y∼pdata ζ(y(w1 − w > x − b)) Porque en la
función de
This is somewhat similar to L1 regularization. However, there are some important differences. Most pérdida
significantly, the L1 penalty is subtracted off the model’s activation during training, rather than estaríamos
added to the training cost. This means that the penalty can eventually start to disappear if the model sumando
learns to make confident enough predictions that ζ saturates. This is not guaranteed to happen—in ϵ*|w|_1, que es
the underfitting regime, adversarial training will simply worsen underfitting. We can thus view L1 exactamente la
weight decay as being more “worst case” than adversarial training, because it fails to deactivate in regularización L1.
the case of good margin.
If we move beyond logistic regression to multiclass softmax regression, L1 weight decay becomes
even more pessimistic, because it treats each of the softmax’s outputs as independently perturbable,
when in fact it is usually not possible to find a single η that aligns with all of the class’s weight
vectors. Weight decay overestimates the damage achievable with perturbation even more in the case
of a deep network with multiple hidden units. Because L1 weight decay overestimates the amount
of damage an adversary can do, it is necessary to use a smaller L1 weight decay coefficient than
the associated with the precision of our features. When training maxout networks on MNIST, we
obtained good results using adversarial training with = 25. When applying L1 weight decay to
the first layer, we found that even a coefficient of .0025 was too large, and caused the model to get
stuck with over 5% error on the training set. Smaller weight decay coefficients permitted succesful
training but conferred no regularization benefit.
4
Published as a conference paper at ICLR 2015
Szegedy et al. (2014b) showed that by training on a mixture of adversarial and clean examples, a Qué función de
neural network could be regularized somewhat. Training on adversarial examples is somewhat dif- pérdida se usa en
ferent from other data augmentation schemes; usually, one augments the data with transformations el adversarial train
such as translations that are expected to actually occur in the test set. This form of data augmenta- of nns?
tion instead uses inputs that are unlikely to occur naturally but that expose flaws in the ways that the Una interpolación
model conceptualizes its decision function. At the time, this procedure was never demonstrated to entre la función de
improve beyond dropout on a state of the art benchmark. However, this was partially because it was pérdida normal y
difficult to experiment extensively with expensive adversarial examples based on L-BFGS. la función de
pérdida con
We found that training with an adversarial objective function based on the fast gradient sign method adversarial
was an effective regularizer: examples. De esta
manera
˜ x, y) = αJ(θ, x, y) + (1 − α)J(θ, x + sign (x J(θ, x, y))
J(θ, continuamente se
actualizan los
In all of our experiments, we used α = 05. Other values may work better; our initial guess of this adversarial
hyperparameter worked well enough that we did not feel the need to explore more. This approach examples y
means that we continually update our supply of adversarial examples, to make them resist the current entrenamos para
version of the model. Using this approach to train a maxout network that was also regularized with resistirlos.
dropout, we were able to reduce the error rate from 0.94% without adversarial training to 0.84%
with adversarial training. Qué dos mejoras
incluían a este
We observed that we were not reaching zero error rate on adversarial examples on the training set. entrenamiento?
We fixed this problem by making two changes. First, we made the model larger, using 1600 units per Hacer el modelo
layer rather than the 240 used by the original maxout network for this problem. Without adversarial más grande (con
training, this causes the model to overfit slightly, and get an error rate of 1.14% on the test set. más parámetros) y
With adversarial training, we found that the validation set error leveled off over time, and made hacer early
stopping en la
very slow progress. The original maxout result uses early stopping, and terminates learning after función de pérdida
the validation set error rate has not decreased for 100 epochs. We found that while the validation set adversarial.
error was very flat, the adversarial validation set error was not. We therefore used early stopping
on the adversarial validation set error. Using this criterion to choose the number of epochs to train
for, we then retrained on all 60,000 examples. Five different training runs using different seeds
for the random number generators used to select minibatches of training examples, initialize model
weights, and generate dropout masks result in four trials that each had an error rate of 0.77% on
the test set and one trial that had an error rate of 0.83%. The average of 0.782% is the best result
reported on the permutation invariant version of MNIST, though statistically indistinguishable from
the result obtained by fine-tuning DBMs with dropout (Srivastava et al., 2014) at 0.79%.
The model also became somewhat resistant to adversarial examples. Recall that without adversarial
training, this same kind of model had an error rate of 89.4% on adversarial examples based on the fast
gradient sign method. With adversarial training, the error rate fell to 17.9%. Adversarial examples
are transferable between the two models but with the adversarially trained model showing greater
robustness. Adversarial examples generated via the original model yield an error rate of 19.6% on
the adversarially trained model, while adversarial examples generated via the new model yield an
error rate of 40.9% on the original model. When the adversarially trained model does misclassify an
adversarial example, its predictions are unfortunately still highly confident. The average confidence
on a misclassified example was 81.4%. We also found that the weights of the learned model changed
significantly, with the weights of the adversarially trained model being significantly more localized
and interpretable (see Fig. 3).
The adversarial training procedure can be seen as minimizing the worst case error when the data is
perturbed by an adversary. That can be interpreted as learning to play an adversarial game, or as
minimizing an upper bound on the expected cost over noisy samples with noise from U (−, ) added
to the inputs. Adversarial training can also be seen as a form of active learning, where the model Básicamente que
is able to request labels on new points. In this case the human labeler is replaced with a heuristic si coges ϵ con
labeler that copies labels from nearby points. ϵ_i~U(-ϵ, ϵ)
entonces ϵ^T*v es
We could also regularize the model to be insensitive to changes in its features that are smaller than aproximadamente
the precision simply by training on all points within the max norm box, or sampling many points 0 para cualquier
within this box. This corresponds to adding noise with max norm during training. However, noise vector v.
with zero mean and zero covariance is very inefficient at preventing adversarial examples. The En particular,
expected dot product between any reference vector and such a noise vector is zero. This means that w^T*x'=w^T(x+ϵ)=
in many cases the noise will have essentially no effect rather than yielding a more difficult input. w^T*x+w^T*ϵ=w^
T+0=w^T.
Luego todo esto
5 es como no hacer
nada.
Published as a conference paper at ICLR 2015
Figure 3: Weight visualizations of maxout networks trained on MNIST. Each row shows the filters
for a single maxout unit. Left) Naively trained model. Right) Model with adversarial training.
In fact, in many cases the noise will actualy result in a lower objective function value. We can
think of adversarial training as doing hard example mining among the set of noisy inputs, in order
to train more efficiently by considering only those noisy points that strongly resist classification. As
control experiments, we trained training a maxout network with noise based on randomly adding
± to each pixel, or adding noise in U (−, ) to each pixel. These obtained an error rate of 86.2%
with confidence 97.3% and an error rate of 90.4% with a confidence of 97.8% respectively on fast
gradient sign adversarial examples.
Because the derivative of the sign function is zero or undefined everywhere, gradient descent on
the adversarial objective function based on the fast gradient sign method does not allow the model
to anticipate how the adversary will react to changes in the parameters. If we instead adversarial
examples based on small rotations or addition of the scaled gradient, then the perturbation process
is itself differentiable and the learning can take the reaction of the adversary into account. However,
we did not find nearly as powerful of a regularizing result from this process, perhaps because these
kinds of adversarial examples are not as difficult to solve.
One natural question is whether it is better to perturb the input or the hidden layers or both. Here
the results are inconsistent. Szegedy et al. (2014b) reported that adversarial perturbations yield the
best regularization when applied to the hidden layers. That result was obtained on a sigmoidal
network. In our experiments with the fast gradient sign method, we find that networks with hidden
units whose activations are unbounded simply respond by making their hidden unit activations very
large, so it is usually better to just perturb the original input. On saturating models such as the Rust
model we found that perturbation of the input performed comparably to perturbation of the hidden
layers. Perturbations based on rotating the hidden layers solve the problem of unbounded activations
growing to make additive perturbations smaller by comparison. We were able to succesfully train
maxout networks with rotational perturbations of the hidden layers. However, this did not yield
nearly as strong of a regularizing effect as additive perturbation of the input layer. Our view of En qué dos casos
adversarial training is that it is only clearly useful when the model has the capacity to learn to falla nuestra
resist adversarial examples. This is only clearly the case when a universal approximator theorem intuición con los
applies. Because the last layer of a neural network, the linear-sigmoid or linear-softmax layer, is adversarial
examples?
not a universal approximator of functions of the final hidden layer, this suggests that one is likely 1. Pensamos que
to encounter problems with underfitting when applying adversarial perturbations to the final hidden una pequeña
layer. We indeed found this effect. Our best results with training using perturbations of hidden perturbación a
layers never involved perturbations of the final hidden layer. todas las
dimensiones no
7 D IFFERENT KINDS OF MODEL CAPACITY hará efecto porque
estamos
One reason that the existence of adversarial examples can seem counter-intuitive is that most of us acostumbrados a
have poor intuitions for high dimensional spaces. We live in three dimensions, so we are not used tres dimensiones y
to small effects in hundreds of dimensions adding up to create a large effect. There is another way no somos capaces
that our intuitions serve us poorly. Many people think of models with low capacity as being unable de visualizar bien
to make many different confident predictions. This is not correct. Some models with low capacity cómo afectaría
do exhibit this behavior. For example shallow RBF esto a un número
( networks with ) grande de
p(y = 1 x) = exp (x − µ)> β(x − µ) dimensiones.
are only able to confidently predict that the positive class is present in the vicinity of µ. Elsewhere, 2. Pensamos que
they default to predicting the class is absent, or have low-confidence predictions. modelos con poca
capacidad son
incapaces de hacer
6 diferentes
predicciones con
confianza.
Published as a conference paper at ICLR 2015
RBF networks are naturally immune to adversarial examples, in the sense that they have low con-
fidence when they are fooled. A shallow RBF network with no hidden layers gets an error rate
of 55.4% on MNIST using adversarial examples generated with the fast gradient sign method and
= 25. However, its confidence on mistaken examples is only 12%. Its average confidence on
clean test examples is 606%. We can’t expect a model with such low capacity to get the right an-
swer at all points of space, but it does correctly respond by reducing its confidence considerably on
points it does not “understand.”
RBF units are unfortunately not invariant to any significant transformations so they cannot generalize
very well. We can view linear units and RBF units as different points on a precision-recall tradeoff
curve. Linear units achieve high recall by responding to every input in a certain direction, but may
have low precision due to responding too strongly in unfamiliar situations. RBF units achieve high
precision by responding only to a specific point in space, but in doing so sacrifice recall. Motivated
by this idea, we decided to explore a variety of models involving quadratic units, including deep
RBF networks. We found this to be a difficult task—very model with sufficient quadratic inhibition
to resist adversarial perturbation obtained high training set error when trained with SGD.
Por qué un An intriguing aspect of adversarial examples is that an example generated for one model is often
adversarial example misclassified by other models, even when they have different architecures or were trained on dis-
mal clasificado por joint training sets. Moreover, when these different models misclassify an adversarial example, they
un clasificador often agree with each other on its class. Explanations based on extreme non-linearity and over-
probablemente sea fitting cannot readily account for this behavior—why should multiple extremely non-linear model
también mal with excess capacity consistently label out-of-distribution points in the same way? This behavior is
clasificado por otro especially surprising from the view of the hypothesis that adversarial examples finely tile space like
clasificador?
Porque a partir de the rational numbers among the reals, because in this view adversarial examples are common but
cierto valor de occur only at very precise locations.
epsilon (ver Figura Under the linear view, adversarial examples occur in broad subspaces. The direction η need only
4) siempre se have positive dot product with the gradient of the cost function, and need only be large enough.
clasifica mal, luego
estos adversarial Fig. 4 demonstrates this phenomenon. By tracing out different values of we see that adversarial
examples ocurren examples occur in contiguous regions of the 1-D subspace defined by the fast gradient sign method,
en regiones muy not in fine pockets. This explains why adversarial examples are abundant and why an example
amplias. misclassified by one classifier has a fairly high prior probability of being misclassified by another
classifier.
To explain why mutiple classifiers assign the same class to adversarial examples, we hypothesize
Por qué además that neural networks trained with current methodologies all resemble the linear classifier learned on
asignan la misma
clase? the same training set. This reference classifier is able to learn approximately the same classification
Los autores weights when trained on different subsets of the training set, simply because machine learning algo-
formulan la rithms are able to generalize. The stability of the underlying classification weights in turn results in
hipótesis de que es the stability of adversarial examples.
porque el
clasificador usado To test this hypothesis, we generated adversarial examples on a deep maxout network and classified
en los modelos these examples using a shallow softmax network and a shallow RBF network. On examples that
suele ser parecido. were misclassified by the maxout network, the RBF network predicted the maxout network’s class
assignment only 16.0% of the time, while the softmax classifier predict the maxout network’s class
correctly 54.6% of the time. These numbers are largely driven by the differing error rate of the
different models though. If we exclude our attention to cases where both models being compared
make a mistake, then softmax regression predict’s maxout’s class 84.6% of the time, while the RBF
network is able to predict maxout’s class only 54.3% of the time. For comparison, the RBF network
can predict softmax regression’s class 53.6% of the time, so it does have a strong linear component
to its own behavior. Our hypothesis does not explain all of the maxout network’s mistakes or all of
the mistakes that generalize across models, but clearly a significant proportion of them are consistent
with linear behavior being a major cause of cross-model generalization.
9 A LTERNATIVE HYPOTHESES
We now consider and refute some alternative hypotheses for the existence of adversarial examples.
First, one hypothesis is that generative training could provide more constraint on the training pro-
7
Published as a conference paper at ICLR 2015
Figure 4: By tracing out different values of , we can see that adversarial examples occur reliably
for almost any sufficiently large value of provided that we move in the correct direction. Correct
classifications occur only on a thin manifold where x occurs in the data. Most of Rn consists of
adversarial examples and rubbish class examples (see the appendix). This plot was made from a
naively trained maxout network. Left) A plot showing the argument to the softmax layer for each of
the 10 MNIST classes as we vary on a single input example. The correct class is 4. We see that the
unnormalized log probabilities for each class are conspicuously piecewise linear with and that the
wrong classifications are stable across a wide region of values. Moreover, the predictions become
very extreme as we increase enough to move into the regime of rubbish inputs. Right) The inputs
used to generate the curve (upper left = negative , lower right = positive , yellow boxes indicate
correctly classified inputs).
cess, or cause the model to learn what to distinguish “real” from “fake” data and be confident only
on “real” data. The MP-DBM (Goodfellow et al., 2013a) provides a good model to test this hy-
pothesis. Its inference procedure gets good classification accuracy (an 0.88% error rate) on MNIST.
This inference procedure is differentiable. Other generative models either have non-differentiable
inference procedures, making it harder to compute adversarial examples, or require an additional
non-generative discriminator model to get good classification accuracy on MNIST. In the case of
the MP-DBM, we can be sure that the generative model itself is responding to adversarial exam-
ples, rather than the non-generative classifier model on top. We find that the model is vulnerable
to adversarial examples. With an of 0.25, we find an error rate of 97.5% on adversarial examples
generated from the MNIST test set. It remains possible that some other form of generative training hacer un ensemble
could confer resistance, but clearly the mere fact of being generative is not alone sufficient. de modelos puede
ayudar a fallar
Another hypothesis about why adversarial examples exist is that individual models have strange menos en
quirks but averaging over many models can cause adversarial examples to wash out. To test this hy- adversarial
pothesis, we trained an ensemble of twelve maxout networks on MNIST. Each network was trained examples?
using a different seed for the random number generator used to initialize the weights, generate Sí, pero hasta cierto
dropout masks, and select minibatches of data for stochastic gradient descent. The ensemble gets an punto. Hacen un
error rate of 91.1% on adversarial examples designed to perturb the entire ensemble with = 25. experimento con un
If we instead use adversarial examples designed to perturb only one member of the ensemble, the ensemble de 12
error rate falls to 87.9%. Ensembling provides only limited resistance to adversarial perturbation. modelos y con
adversarial
10 S UMMARY AND DISCUSSION examples para
"engañar" al
As a summary, this paper has made the following observations: ensemble fallan un
91% y con
• Adversarial examples can be explained as a property of high-dimensional dot products. adversarial
They are a result of models being too linear, rather than too nonlinear. examples para
engañar a un solo
• The generalization of adversarial examples across different models can be explained as a modelo fallan un
result of adversarial perturbations being highly aligned with the weight vectors of a model, 88%. Es decir, se
and different models learning similar functions when trained to perform the same task. puede reducir el %
de engaño un poco.
• The direction of perturbation, rather than the specific point in space, matters most. Space is
not full of pockets of adversarial examples that finely tile the reals like the rational numbers.
• Because it is the direction that matters most, adversarial perturbations generalize across
different clean examples.
8
Published as a conference paper at ICLR 2015
Some further observations concerning rubbish class examples are presented in the appendix:
Gradient-based optimization is the workhorse of modern AI. Using a network that has been designed
to be sufficiently linear–whether it is a ReLU or maxout network, an LSTM, or a sigmoid network
that has been carefully configured not to saturate too much– we are able to fit most problems we care
about, at least on the training set. The existence of adversarial examples suggests that being able to
explain the training data or even being able to correctly label the test data does not imply that our
models truly understand the tasks we have asked them to perform. Instead, their linear responses are
overly confident at points that do not occur in the data distribution, and these confident predictions
are often highly incorrect. This work has shown we can partially correct for this problem by explic-
itly identifying problematic points and correcting the model at each of these points. However, one
may also conclude that the model families we use are intrinsically flawed. Ease of optimization has
come at the cost of models that are easily misled. This motivates the development of optimization
procedures that are able to train models whose behavior is more locally stable.
ACKNOWLEDGMENTS
We would like to thank Geoffrey Hinton and Ilya Sutskever for helpful discussions. We would
also like to thank Jeff Dean, Greg Corrado, and Oriol Vinyals for their feedback on drafts of this
article. We would like to thank the developers of Theano(Bergstra et al., 2010; Bastien et al., 2012),
Pylearn2(Goodfellow et al., 2013b), and DistBelief (Dean et al., 2012).
R EFERENCES
Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian J., Bergeron, Arnaud,
Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. Deep Learning
and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guil-
laume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expres-
sion compiler. In Proceedings of the Python for Scientic Computing Conference (SciPy), June 2010. Oral
Presentation.
Chalupka, K., Perona, P., and Eberhardt, F. Visual Causal Feature Learning. ArXiv e-prints, December 2014.
Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z.,
Ranzato, MarcAurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed
deep networks. In NIPS, 2012.
Deng, Jia, Dong, Wei, Socher, Richard, jia Li, Li, Li, Kai, and Fei-fei, Li. Imagenet: A large-scale hierarchical
image database. In In CVPR, 2009.
Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Deep sparse rectifier neural networks. In JMLR W&CP:
Proceedings of the Fourteenth International Conference on Articial Intelligence and Statistics (AISTATS
2011), April 2011.
9
Published as a conference paper at ICLR 2015
Goodfellow, Ian J., Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Multi-prediction deep Boltzmann
machines. In Neural Information Processing Systems, December 2013a.
Goodfellow, Ian J., Warde-Farley, David, Lamblin, Pascal, Dumoulin, Vincent, Mirza, Mehdi, Pascanu, Razvan,
Bergstra, James, Bastien, Frédéric, and Bengio, Yoshua. Pylearn2: a machine learning research library. arXiv
preprint arXiv:1308.4214, 2013b.
Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Maxout net-
works. In Dasgupta, Sanjoy and McAllester, David (eds.), International Conference on Machine Learning,
pp. 1319–1327, 2013c.
Gu, Shixiang and Rigazio, Luca. Towards deep neural network architectures robust to adversarial examples. In
NIPS Workshop on Deep Learning and Representation Learning, 2014.
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
Hornik, Kurt, Stinchcombe, Maxwell, and White, Halbert. Multilayer feedforward networks are universal
approximators. Neural Networks, 2:359–366, 1989.
Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun, Yann. What is the best multi-stage
architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pp.
2146–2153. IEEE, 2009.
Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. Technical
report, University of Toronto, 2009.
Nguyen, A., Yosinski, J., and Clune, J. Deep Neural Networks are Easily Fooled: High Confidence Predictions
for Unrecognizable Images. ArXiv e-prints, December 2014.
Rust, Nicole, Schwartz, Odelia, Movshon, J. Anthony, and Simoncelli, Eero. Spatiotemporal elements of
macaque V1 receptive fields. Neuron, 46(6):945–956, 2005.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15
(1):1929–1958, 2014.
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Du-
mitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. Technical report,
arXiv preprint arXiv:1409.4842, 2014a.
Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J.,
and Fergus, Rob. Intriguing properties of neural networks. ICLR, abs/1312.6199, 2014b. URL http:
//arxiv.org/abs/1312.6199.
10
Published as a conference paper at ICLR 2015
Figure 5: Randomly generated fooling images for a convolutional network trained on CIFAR-
10. These examples were generated by drawing a sample from an isotropic Gaussian, then taking a
gradient sign step in the direction that increases the probability of the “airplane” class. Yellow boxes
indicate samples that successfully fool the model into believing an airplane is present with at least
50% confidence. “Airplane” is the hardest class to construct fooling images for on CIFAR-10, so
this figure represents the worst case in terms of success rate.
These experiments suggest that the optimization algorithms employed by Nguyen et al. (2014) are overkill
(or perhaps only needed on ImageNet), and that the rich geometric structure in their fooling images are due to
the priors encoded in their search procedures, rather than those structures being uniquely able to cause false
positives.
Though Nguyen et al. (2014) focused their attention on deep networks, shallow linear models have the same
problem. A softmax regression model has an error rate of 59.8% on the rubbish examples, with an average
confidence on mistakes of 70.8%. If we use instead an RBF network, which does not behave like a linear
function, we find an error rate of 0%. Note that when the error rate is zero the average confidence on a mistake
is undefined.
Nguyen et al. (2014) focused on the problem of generating fooling images for a specific class, which is a harder
problem than simply finding points that the network confidently classifies as belonging to any one class despite
being defective. The above methods on MNIST and CIFAR-10 tend to have a very skewed distribution over
classes. On MNIST, 45.3% of a naively trained maxout network’s false positives were classified as 5s, and
none were classified as 8s. Likewise, on CIFAR-10, 49.7% of the convolutional network’s false positives were
classified as frogs, and none were classified as airplanes, automobiles, horses, ships, or trucks.
To solve the problem introduced by Nguyen et al. (2014) of generating a fooling image for a particular class,
we propose adding ∇x p(y = i | x) to a Gaussian sample x as a fast method of generating a fooling image
classified as class i. If we repeat this sampling process until it succeeds, we a randomized algorithm with
variable runtime. On CIFAR-10, we found that one sampling step had a 100% success rate for frogs and trucks,
and the hardest class was airplanes, with a success rate of 24.7% per sampling step. Averaged over all ten
classes, the method has an average per-step success rate of 75.3%. We can thus generate any desired class
with a handful of samples and no special priors, rather than tens of thousands of generations of evolution. To
confirm that the resulting examples are indeed fooling images, and not images of real classes rendered by the
gradient sign method, see Fig. 5. The success rate of this method in terms of generating members of class i may
degrade for datasets with more classes, since the risk of inadvertently increasing the activation of a different
class j increases in that case. We found that we were able to train a maxout network to have a zero percent
error rate on Gaussian rubbish examples (it was still vulnerable to rubbish examples generated by applying a
fast gradient sign step to a Gaussian sample) with no negative impact on its ability to classify clean examples.
Unfortunately, unlike training on adversarial examples, this did not result in any significant reduction of the
model’s test set error rate.
In conclusion, it appears that a randomly selected input to deep or shallow models built from linear parts is
overwhelmingly likely to be processed incorrectly, and that these models only behave reasonably on a very thin
manifold encompassing the training data.
11