0% found this document useful (0 votes)
15 views12 pages

Random Spiking and Systematic Evaluation of Defenses Against Adversarial Examples

Uploaded by

七个一.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

Random Spiking and Systematic Evaluation of Defenses Against Adversarial Examples

Uploaded by

七个一.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Random Spiking and Systematic Evaluation of Defenses Against

Adversarial Examples
Huangyi Ge Sze Yiu Chau Bruno Ribeiro Ninghui Li
[email protected] [email protected] [email protected] [email protected]
Purdue University Purdue University Purdue University Purdue University

ABSTRACT models to behave unexpectedly. For example, it is found that exist-


Image classifiers often suffer from adversarial examples, which ing image classifiers based on Deep Neural Networks are highly
are generated by strategically adding a small amount of noise to vulnerable to adversarial examples [16, 33]. Often times, by modi-
fying an image in a way that is barely noticeable by humans, the
arXiv:1812.01804v4 [cs.LG] 20 Jan 2020

input images to trick classifiers into misclassification. Over the


years, many defense mechanisms have been proposed, and differ- classifier will confidently classify it as something else. This phe-
ent researchers have made seemingly contradictory claims on their nomenon also exists for classifiers that do not use neural networks,
effectiveness. We present an analysis of possible adversarial mod- and has been called “optical illusions for machines”. Understanding
els, and propose an evaluation framework for comparing different why adversarial examples work and how to defend against them is
defense mechanisms. As part of the framework, we introduce a becoming increasingly important, as machine learning techniques
more powerful and realistic adversary strategy. Furthermore, we are used ubiquitously, for example, in transformative technologies
propose a new defense mechanism called Random Spiking (RS), such as autonomous cars, unmanned aerial vehicles, and so on.
which generalizes dropout and introduces random noises in the Many approaches have been proposed to help defend against
training process in a controlled manner. Evaluations under our adversarial examples. Goodfellow et al. [33] proposed adversarial
proposed framework suggest RS delivers better protection against training, in which one trains a neural network using both the origi-
adversarial examples than many existing schemes. nal training dataset and the newly generated adversarial examples.
In region-based classification [6], one aggregates predictions on
CCS CONCEPTS multiple perturbed versions of an input instance to make the fi-
nal prediction. Some approaches attempt to train additional neural
• Computing methodologies → Neural networks; Super-
network models to identify and reject adversarial examples [24, 41].
vised learning by classification; • Security and privacy →
We point out that since the adversary can choose instances and
Domain-specific security and privacy architectures.
shift the test distribution after a model is trained, adversary ex-
amples exist so long as ML models differ from human perception
KEYWORDS on some instances. (These instances can be used as adversarial
random spiking, adversarial example, neural network examples.) Thus adversarial examples are unlikely to be completely
ACM Reference Format: eliminated. What we can do is to reduce the number of such in-
Huangyi Ge, Sze Yiu Chau, Bruno Ribeiro, and Ninghui Li. 2020. Random stances by training ML models that better match human perceptions,
Spiking and Systematic Evaluation of Defenses Against Adversarial Exam- and by making it more difficult for the attacker to find adversarial
ples. In Proceedings of the Tenth ACM Conference on Data and Application Se- examples.
curity and Privacy (CODASPY ’20), March 16–18, 2020, New Orleans, LA, USA. While the research community has seen a proliferation in propos-
ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3374664.3375736 als of defense mechanisms, conducting a thorough evaluation and
a fair head-to-head comparison of different mechanisms remains
1 INTRODUCTION challenging. In Section 3, we analyze possible adversarial models,
Modern society increasingly relies on software systems trained by and propose to conduct evaluation in a variety of models, including
machine learning (ML) techniques. Many such techniques, how- both white-box and translucent-box attacks. In translucent-box at-
ever, were designed under the implicit assumption that both the tacks, the adversary is assumed to know the defense mechanism,
training and test data follow the same static (although possibly un- model architecture, and distribution of training data, but not the
known) distribution. In the presence of intelligent and resourceful precise parameters of the target model. With this knowledge, the ad-
adversaries, this assumption no longer holds. Such an adversary versary can train one or more surrogate models, and to generate
can deliberately manipulate a test instance, and cause the trained adversarial examples leveraging such surrogate models.
While other research efforts have attempted to generate adver-
Permission to make digital or hard copies of all or part of this work for personal or sarial examples based on surrogate models and then assess transfer-
classroom use is granted without fee provided that copies are not made or distributed ability, existing application of this method does not fully exploit the
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
potential of surrogate models. As a result, one can overestimate the
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, effectiveness of defenses. We propose two improvements. First, one
to post on servers or to redistribute to lists, requires prior specific permission and/or a can train many surrogate models under the same configuration, and
fee. Request permissions from [email protected].
CODASPY ’20, March 16–18, 2020, New Orleans, LA, USA
then generate adversarial examples that can simultaneously fool
© 2020 Association for Computing Machinery. multiple surrogate models at the same time. Second, one can reserve
ACM ISBN 978-1-4503-7107-0/20/03. . . $15.00 some surrogate models as “validation models”. These validation
https://doi.org/10.1145/3374664.3375736
Í  1/p
models are not used when generating adversarial examples; how- Lp (x, x ′ ) = ∥x − x ′ ∥p = n |x − x ′ |p .
i=1 i i
ever, generated adversarial examples are first run against them, and
The commonly used Lp metrics include: L 0 , the number of changed
only those examples that are able to fool a certain percentage of the
pixels [26]; L 1 , the sum of absolute values of the changes in all pixels
validation models are used in evaluation against the target model.
[13]; L 2 , the Euclidean norm [9, 11, 25, 33]; and L ∞ , the maximum
This models a more determined and resourceful attacker who is
absolute change [16]. In this paper, we use L 2 , which reflects both
willing to spend more resources to find more effective adversarial
the number of changed pixels and the magnitude of their change.
examples to deploy, a scenario that is certainly realistic. Our experi-
We call L 2 (x, x ′ ) the distortion of the adversarial example x ′ .
mental results demonstrate that these more sophisticated adversary
When generating an adversarial example against a classifier C(·),
strategies lead to significantly higher transferability rates.
one typically starts from an existing instance (x, y) and generates
Furthermore, in Section 4, we propose a new defense mechanism
x ′ . In an untargeted attack, one generates x ′ such that C(x ′ ) , y.
called Random Spiking, where random noises are added to one or
In a targeted attack, one has a desired target label t , y and
more hidden layers during training. Random Spiking generalizes
generates x ′ such that C(x ′ ) = t.
the idea of dropout [30], where hidden units are randomly dropped
Goodfellow et al. [16] proposed the fast gradient sign (FGS)
during training. In Random Spiking, the outputs of some randomly
attack, which generates adversarial examples based on the gradient
chosen units are replaced by a random noise during training.
sign of the loss value according to the input image. A more effective
In Section 5, we present extensive evaluations of several ex-
attack is proposed by Carlini and Wagner [9], which we call the
isting defense mechanisms and Random Spiking (RS) under both
C&W attack. Given a neural network with logits output z, an input
white-box and translucent-box attacks, and empirically show that
x, and a target class label t, the C&W attack tries to solve the
RS, especially when combined with adversarial training, improve
following optimization problem:
the resiliency against adversarial examples.
arg min ∥x − x ′ ∥p + c · l(x ′ )

(1)
In summary, we make three contributions. (1) The proposed x′
evaluation methodology, especially the more powerful and realistic where the loss function l is defined as
adversary strategy of attacking multiple surrogates in parallel and
l(x ′ ) = max max z(x ′ )i : i , t − z(x ′ )t , −K .
 
using validation models to filter. (2) The idea of Random Spiking,
which is demonstrated to offer additional resistance to adversarial Here, K is called the confidence value, and is a positive number
examples. (3) We provide a thorough evaluation of several defense that one can choose. Intuitively, we desire z(x ′ )t to be higher than
mechanisms against adversarial examples, improving our under- any z(x ′ )i where i , t so that the neural network predicts label
standing of them. t on input x ′ . Furthermore, we prefer the gap in the logit of the
class t and the highest of any class other than t to be as large as
possible (until the gap is K, at which point we consider the gap to
2 BACKGROUND
be sufficiently large). In general, choosing a large value K would
We consider neural networks that are used as m-class classifiers, result in adversarial examples that have a higher distortion, but
where the output of a network is computed using the softmax func- will be classified to the desired label with higher confidence. The
tion. Given such a neural network used for classification, let z(x) parameter c > 0 in Eq. (1) is a regularization constant to adjust the
denote the vector output of the final layer before the softmax acti- relative importance of minimizing the distortion versus minimizing
vation, and C(x) denote the classifier defined by the neural network. the loss function l. In the attack, c is initially set to a small initial
Then,  . Í  value, and then dynamically adjusted based on the progress made
C(x) = arg max exp(z(x)i ) n exp(z(x) ) .
j=1 j by the iterative optimization process.
i  Í
Oftentimes, exp(z(x)i ) n exp(z(x)  is interpreted as the prob- The C&W attack uses the Adam algorithm [19] to solve the
j=1 j
ability that the input x belongs to the i-th category, and the classifier optimization problem in Eq. (1). Adam performs iterative gradient-
chooses the class with the highest probability. Under this interpre- based optimization, based on adaptive estimates of lower-order
tation, the output z(x) is related to log of odds ratios, and is thus moments. Compared to other attacks, such as FGS, the C&W attack
called the logits output. is more time-consuming. However, it is able to find more effective
adversarial examples.

2.1 Adversarial Examples 2.2 Existing Defenses


Given a dataset D of instances, each of the form (x, y), where x Many approaches have been proposed to help defend against ad-
gives the features of the instance, and y the label, and a classifier versarial examples. Here we give an overview of some of them.
C(·) trained using a subset of D, we say that an instance x ′ is
an adversarial example if and only if there exists an instance Adversarial Training. Goodfellow et al. [16] proposed to train a
(x, y) ∈ D such that x ′ is close to x, C(x) = y, and C(x ′ ) , y. neural network using both the training dataset and newly generated
Note that in the above we did not define what “x ′ is close to x” adversarial examples. In [16], it is shown that models that have
means. Intuitively, when x represents an image, by closeness we gone through adversarial training provide some resistance against
mean human perceptual similarity. However, we are unaware of adversarial examples generated by the FGS method.
any mathematical distance metric that accurately measures human Defensive Distillation. Distillation training was originally pro-
perceptual similarity. In the literature Lp norms are often used as posed by Hinton et al. [18] for the purpose of distilling knowledge
the distance metric for closeness. Lp is defined as out of a large model (one with many parameters) to train a more
compact model (one with fewer parameters). Given a model whose then computes a prediction for each perturbed input, and finally
knowledge one wants to distill, one applies the model to each in- use voting to make the final prediction. This method slows down
stance in the training dataset and generates a probability vector, prediction by a factor of m. In [6], m = 10, 000 was used for MNIST
which is used as the new label for the instance. This is called the and m = 1, 000 was used for CIFAR. Evaluation in [6] shows that this
soft label, because, instead of a single class, the label includes proba- can withstand adversarial examples generated by the C&W attack
bilities for different classes. A new model is trained using instances under low confidence value K. However, if one slightly increases
with soft labels. The intuition is that the probabilities, even those the confidence value K when generating the adversarial examples,
that are not the largest in a vector, encode valuable knowledge. this defense is no longer effective.
To make this knowledge more pronounced, the probability vector MagNet. Meng and Chen [24] proposed an approach that is called
is generated after dividing the logits output with a temperature MagNet. MagNet combines two ideas to defend against adversarial
constant T > 1. This has the effect of making the smaller probabili- examples. First, one trains detectors that attempt to detect and
ties larger and more pronounced. The new model is trained with reject adversarial examples. Each detector uses an autoencoder,
the same temperature. However, when deploying the model for which is trained to minimize the L 2 distance between input image
prediction, temperature is set to 1. and output. A threshold is then selected using validation dataset.
Defensive Distillation [27] is motivated by the original distillation The detector rejects any image such that the L 2 distance between
training proposed by Hinton et al. [18]. The main difference between it and the encoded image is above the threshold. Multiple detectors
the two training methods is that defensive distillation uses the same can be used. Second, for each image that passes the detectors, a
network architecture for both initial network and distilled network. reformer (another autoencoder) is applied to the image, and the
This is because the goal of using Distillation here is not to train a output (reformed image) is sent to the classifier for classification.
model that has a smaller size, but to train a more robust model. The evaluation of MagNet in [24] considers only adversarial
Dropout. Dropout [30] was introduced to improve generalization examples generated without knowledge of the MagNet defence.
accuracy through the introduction of randomness in training. The Since one can combine all involved neural networks into a single
term “dropout” refers to dropping out units, i.e., temporarily remov- one, one can still apply the C&W attack on the composite network.
ing the units along with all its incoming and outgoing connections. In [11], an effective attack is carried out against MagNet by adding
In the simplest case, during each training epoch, each unit is re- to the optimization objective a term describing the goal of evading
tained with a fixed probability p independent of other units, where the detectors.
p can be chosen using a validation set or can simply be set to 0.5,
which was suggested by the authors of [30]. 3 EVALUATION METHODOLOGY
There are several intuitions why Dropout is effective in reduc- We discuss several important factors for evaluation, and introduce
ing generalization errors. One is that after applying Dropout, the the translucent-box model to supplement white-box evaluation.
model is always trained with a subset of the units in the neural
network. This prevents units from co-adapting too much. That is,
3.1 Adversary Knowledge
a unit cannot depend on the existence of another unit, and needs Adversary model plays an important role in any security evalua-
to learn to do something useful on its own. Another intuition is tions. One important part of the adversary model is the assumption
that training with Dropout approximates simultaneous training on adversary’s knowledge.
of an exponential number of “thinned” networks. In the original Knowledge of Model (white-box). The adversary has full knowl-
proposal, dropout is applied in training, but not in testing. During edge of the target model to be attacked, including the model ar-
testing, without applying Dropout, the prediction approximates an chitecture, defense mechanism, and all the parameters, including
averaging output of all these thinned networks. In Monte Carlo those used in the defense mechanisms. We call such an attack a
dropout [14], dropout is also applied in testing. The NN is run multi- white-box attack.
ple times, and the resulting prediction probabilities are averaged for
making prediction. This more directly approximates the behavior Complete Knowledge of Process (translucent-box). The ad-
of using the NN as an ensemble of models. versary does not know the exact parameters of the target model, but
Since Dropout introduces randomness in the training process, knows the training process, including model architecture, defense
two models that are trained with Dropout are likely to be less mechanism, training algorithm, and distribution of the training
similar than two models that are trained without using Dropout. dataset. With this knowledge, the adversary can use the same train-
Defensive Dropout [37] explicitly uses dropout for defense against ing process that is used to generate the target model to train one
adversarial examples. It applies dropout in testing, but runs the or more surrogate models. Depending on the degree of random-
network just once. In addition, it tunes the dropout rate used in ness involved in the training process, the surrogate models may be
testing by iteratively generating adversarial examples and choosing similar or quite different from the target model, and adversarial ex-
a drop rate to both maximize the testing accuracy and minimize amples generated by attacking the surrogate model(s) may or may
the attack success rate. not work very well. The property of whether an adversarial exam-
ple generated by attacking one or more surrogate models can also
Region-based Classification. Cao and Gong [6] proposed region- work against another target model is known as transferability.
based classification to defend against adversarial examples. Given We call such an attack a translucent-box attack.
an input, region-based classification first generates m perturbed Technically, it is possible for a white-box adversary to know less
inputs by adding bounded random noises to the original input, than a translucent-box adversary in some aspects. For example, a
white-box adversary may not know the distribution of the training the translucent-box assumption, a standard method is to train m
data. However, for the purpose of generating adversarial examples, models, and, for each model, generate n adversarial examples. Then
knowing all details of the target model (white-box) is strictly more for each of the m model, treat it as the target model, and feed the
powerful than knowing the training process (translucent-box). (m − 1)n adversarial examples generated on other models to it, and
Oracle access only (black-box). Some researchers have consid- report the percentage of success among the m(m − 1)n trials.
ered adversary models where an adversary uses only oracle accesses Such an evaluation method is assessing the success probability
to the target model. That is, the adversary may be able to query of the following naive adversary strategy: The adversary trains
the target model with instances and receive the output. This is also one surrogate model, generates an adversarial example that works
called “decision-based adversarial attack” [5, 12]. We call such an against the surrogate model, and then deploy that adversarial exam-
attack a black-box attack. ple. We call this a one-surrogate attack. A real adversary, however,
Some researchers also use black-box attack to refer to can use a more effective strategy. It can try to generate adversar-
what we call translucent-box attacks. We choose to distinguish ial examples that can fool multiple surrogate models at the same
translucent-box attacks from black-box attacks for two reasons. time. After generating them, it can first test whether the adver-
First, an adversary will have some knowledge about the target sarial examples can fool surrogate models that are not used in the
model under attack, e.g., the neural network architecture and the generation. We call this a multi-surrogate attack.
training algorithm. Thus the box is not really “black”. Second, the For any defense mechanism that is more effective against ad-
two kinds of attacks are very different. One relies on training sur- versarial examples under the translucent-box attack than under
rogate models, and the other relies on issuing a large number of the white-box model, the additional effectiveness must be due to
oracle queries. the randomness in the training process. When that is the case, the
Other researchers use “black-box attack” to refer to the situation above adversary strategy would have much higher success rate than
that the adversary carries out the attack without specifically target- the naive adversary strategy. Evaluation should be done against
ing the defense mechanism. We argue that such an evaluation has this adversary strategy.
limited values in understanding the security benefits of a defense We thus propose the following procedure for evaluating a de-
mechanism, as it is a clear deviation from the Kerckhoffs’s principle. fense mechanism in the translucent-box setting. One first trains
t +v surrogate models. Then a set of t models are randomly selected,
Our Choice of Adversary Model. We argue that defense and adversarial examples are generated that can simultaneously
mechanisms should be evaluated under both white-box and attack all t of them; that is, the optimization objective of the attack
translucent-box attacks. While developing attacks that can gen- includes all t models. For the remaining v models, we use leave-
erate adversarial examples using only oracle access is interesting, one-out validation. That is, for each model, we use v − 1 model as
for a defense mechanism to be effective, one must assume that validation models, and select only adversarial examples that can fool
the adversary cannot break it even if it has the knowledge of the a certain fraction of the validation models. Only for the examples
defense mechanism. that pass this validation stage, do we record whether it success-
Evaluation under white-box attack can be carried out by mea- fully transfer to the target model or not. We call such an attack a
suring the level of distortion needed to attack a model. Effective multi-surrogate with validation attack. The percentage of the
defense against white-box attacks is the ultimate objective. Un- successful transfer is used for evaluation. In our experiments, we
til defense in the white-box model is achieved, effective defense use t = v = 8, and an example is selected when it can successfully
against translucent-box attacks is valuable and help the research attack at least 5 out of 7 validation models.
community make progress. Translucent-box is a realistic assump-
tion especially in an academic setting, as published papers generally
include descriptions of the architecture, training process, defense 3.3 Parameters and Data Interpretation
mechanisms and the exact dataset used in their experiments. Ro- Training a defense mechanism often requires multiple parameters
bustness and security evaluations under this assumption is also as inputs. For example, a defense mechanism may be tuned to be
consistent with the Kerckhoffs’s principle. more vigilant against adversarial examples, at the cost of reduced
We also note that there are two possible flavors of attacks. Fo- classification accuracy. When comparing defense mechanisms, one
cusing on image classifiers, the goal of an untargeted attack is to should choose parameters in a way that the classification accuracy
generate adversarial examples such that the classifier would give on test dataset is similar.
any output labels different from what human perception would At the same time, when using the C&W attack to generate ad-
classify. A targeted attack would additionally require the working versarial examples, an important parameter is the confidence value
adversarial examples to induce the classifier into giving specific K. A defense mechanism may be able to resist adversarial examples
output labels of the attacker’s choosing. In this paper, we consider generated under a low K value, but may prove much less effective
only targeted attacks when evaluating defense mechanisms, as it against those generated under a higher value K (see, e.g., [6]). Using
models an adversary with a more specific objective. the same K value for different defenses, however, may not be suffi-
cient for providing a level playing field for comparison. The K value
3.2 Adversary Strategy represents an input to the algorithm, and what really matters is the
Even after the assumption about the adversary’s knowledge is made, quality of the adversarial examples. We propose to run the C&W
there are still possibilities regarding what strategy the adversary attack against a defense mechanism under multiple K values, and
takes. For example, when evaluating a defense mechanism under group the resulting adversarial examples based on their distortion.
We can then compare how well a defense mechanism performs certain L 2 distance of the given one) all have the same given label.
against adversarial examples with similar amount of distortion. Our proposed defense is to some extent motivated by this intuition.
That is, we group adversarial examples based on the L 2 distance Another way is to make it more difficult for adversaries to dis-
and compute the average transferability for each group. cover adversarial examples, even if they exist. One approach is
to use an ensemble model, wherein multiple models are trained
and applied to an instance and the results are aggregated in some
4 PROPOSED DEFENSE fashion. For an adversarial example to work, it must be able to fool
From a statistical point of view, the problem with adversarial exam- a majority of the models in the ensemble.
ples is that of classification under covariate shifts [29]. A covariate If we consider defense in the translucent box adversary model,
shift happens when the training and test observations follow differ- another approach is to increase the degree of randomness in the
ent distributions. In the case of adversarial examples, this is clearly training process, so that adversarial examples generated on the
the case, as new adversarial examples are generated and added to surrogate models do not transfer well.
the test distribution. If the test distribution with adversarial ex- Our proposed new defense against adversarial examples are moti-
amples can be known, a simple and optimal way for dealing with vated by these ideas, which are recapped below. First, each training
covariate shifts is training the model with samples from the test dis- instance should be viewed as representatives of instances within
tribution, rather than using the original training data [17, 29, 31, 32], a certain L 2 distance. Second, we want to increase the degree of
assuming that we have access to enough such examples. Training randomness in the training process. Third, we want to approximate
with adversarial examples can be viewed as a robust optimization the usage of an ensemble of models for decision.
procedure [23] approximating this approach.
Unfortunately, training with adversarial examples does not fully
solve the defense problem. Adversaries can adapt the test distri- 4.2 Random Spiking
bution (a new covariate shift) to make the new classifier perform As discussed in Section 2.2, dropout has been proposed as a way to
poorly again on test data. That is, given a model trained with ad- defend against adversarial examples. Dropout can be interpreted
versarial examples, the adversary can find additional adversarial as a way of regularizing a neural network by adding noise to its
examples and use them. In this minimax game, where the adversary hidden units. The idea of adding noise to the states of units has
is looking for a covariate shift and the defender is training with also been used in the context of Denoising Autoencoders (DAEs)
the latest covariate shift, the odds are stacked against the defender, by Vincent et al. [35, 36], where noise is added to the input units
who is always one step behind the attacker [34, 43]. of an autoencoder and the network is trained to reconstruct the
Fundamentally, to win this game, the defender needs to mimic noise-free input. Dropout changes the behavior of the hidden units.
human perception. That is, as long as there are instances (real or Furthermore, instead of adding random noises, in Dropout, values
fabricated) where humans and ML models classify differently, these are set to zero.
can be over represented in the test data by the adversary’s covariate Our proposed approach generalizes both Dropout and Denoising
shift. Models that either underfit or overfit both make mistakes Autoencoders. Instead of training with removed units or injecting
by definition, and these mistakes can be used in the adversary’s random noises into the input units, we inject random activations
covariate shift. Only a model with no training or generalization into some hidden units near the input level. We call this method
errors under all covariate shifts is not vulnerable to attacks. Random Spiking. Similar to dropout, there are two approaches at
inference time. The first is to use random spiking only in training,
and does not use it at inference time. The second is to use a Monte
4.1 Motivation of Our Approach Carlo decision procedure. That is, at decision time, one runs the
While it is impossible to completely eliminate classification errors, NN multiple times with random spiking, and aggregate the result
several things can be done to help defend against adversarial exam- into one decision.
ples by making them harder to find. The motivations for random spiking are many-fold. First, we are
One approach is to reduce the number of instances that the ML simulating the interpretation that each training instance should be
models disagree with human perception. Training with adversarial treated as a set of instances, each with some small changes. Injecting
examples help in this regard. Using more robust model architecture random perturbations at a level near the input simulates the effect of
and training procedure can also help. When giving an image to training with a set of instances. Second, adversarial examples make
train the model, intuitively we want to say that “all instances that only small perturbations on benign images that do not significantly
look similar to this instance from a human’s perspective should also affect human perception. These perturbations inject noises that
have the same label”. Unfortunately, finding which images humans will be amplified through multiple layers and change the prediction
will consider to be “similar to this instace” and thus should be of of the networks. Random Spiking trains the network to be more
the same class is not a well-defined procedure. Today, the best we robust to such noises. Third, if one needs to increase the degree of
can hope for is that for some mathematical distance measure (such randomness in the training process beyond Dropout, using random
as L 2 distance) and with a smaller enough threshold, humans will noises instead of setting activations to zero is a natural approach.
consider the images to be similar. If we substitute “look similar to Fourth, when we use the Monte Carlo decision procedure, we are
... from a human’s perspective” with “within a certain L 2 distance”, approximating the behavior of a model ensemble.
this is a precise statement. This suggests that one training instance More specifically, random spiking adds a filtering layer in be-
should be interpreted as a set of instances (e.g., those within a tween two layers of nodes in a DNN. The effect of the filtering
layer may change the output values of units in the earlier layer, Table 1: Overview of datasets
affecting the values going into the later layer. With probability p, a Training Test Color
Dataset Image size
unit’s value is kept unchanged. With probability 1 −p, a unit’s value Instances Instances space
is set to a randomly sampled noise. If a unit has its output value MNIST 28 × 28 60,000 10,000 8-bits Gray-scale
thus randomly perturbed, in back-propagation we do not propagate Fashion-MNIST 28 × 28 60,000 10,000 8-bits Gray-scale
backward through this unit, since any gradient computed is related CIFAR-10 32 × 32 50,000 10,000 24-bits True-Color
to the random noise, and not the actual behavior of this unit. For
layers after the Random Spiking filtering layer, back-propagation where f is a density function, p(b) is the probability that bit vector b
update would occur normally. is sampled, and ŷ is a RS neural network with one spike layer. Then,
We use the Random Spiking filtering layer just once, after the by stochastically optimizing the original RS neural network ŷ by
first convolutional layer (and before any max pooling layer if one is sampling bit vectors and noises, we are performing the minimization
used). This is justified by the design intuition. We also experimented
with adding the Random Spiking filtering layer later in the NN, and W ⋆ = argmin L(y, ŷ(x,W ))
W
test accuracy drops. There are two explanations for that. First, since
units chosen to have random noises stop back-propagation, having through a variational approximation model using an upper bound of
them later in the network has more impact on training. Second, the loss L(y, ŷ(x,W )). A proof of Prop. 1 is presented in Appendix A.2.
when random noises are injected early in the network, there are Definition 2 (MC Avg. Inference). At inference time, we use Monte
more layers after it, and there is sufficient capacity in the model Carlo sampling to estimate the RS ensemble
to deal with such noises without too much accuracy cost. When Õ∫
random noises are injected late, fewer layers exist to deal with their ŷ(x,W ) = ŷ(x, b, ϵ,W )p(b)f (ϵ)dϵ,
effect, and the network lacks the capacity to do so. ∀b ϵ

Generating Random Noises. To implement Random Spiking, we where f is a density function, p(b) is the probability that bit vector
have to decide how to sample the noises that are to be used to b is sampled.
replace the unit outputs. Sampling from a distribution with a fixed Adaptive Attack against Random Spiking. Since Random Spik-
range is problematic because the impact of noise depends on the ing introduces randomness during training, an adaptive attacker
distribution of other values in the same layer. If a random pertur- knowing that Random Spiking has been deployed but is unaware
bation is too small compared to other values in the same layer, of the exact parameters of the target model can train multiple sur-
then its randomization effect is too small. If, on the other hand, rogate models, and try to generate adversarial examples that can
the magnitude of the noise is significantly larger than the other simultaneously cause all these models to misbehave. That is, the
values, it overwhelms the network. In our approach, we compute multi-surrogate with validation is a natural adaptive attack against
the minimum and maximum value among all values in the layers to Random Spiking, and any other defense mechanisms that rely on
be filtered, and sample a value uniformly at random in that range. randomness during training. In this attack, one uses probabilities
Since training NN is often done using mini-batches, the minimum from all surrogate models to generate the adversarial example. This
and maximum values are computed from the whole batch. is similar to the Expectation over Transformation (EOT) [2] ap-
proach for generating adversarial examples.
Monte Carlo Random Spiking as a Model Ensemble. For test-
ing, we can use the Monte Carlo decision procedure of running the 5 EXPERIMENTAL EVALUATION
network multiple times and use the average. This has attractive
theoretical guarantees, at the cost of overhead for decision time, We present experimental results comparing the various defense
since the NN needs to be computed multiple times for one instance. mechanisms using our proposed approach.
We now show that the Monte Carlo Random Spiking approximates 5.1 Dataset and Model Training
a model ensemble. Let (x, y) be a training example, where x is an For our experiments, we use the following 3 datasets: MNIST [21],
image and y is the image’s one-hot encoded label. Consider a RS Fashion-MNIST [38], and CIFAR-10 [20]. Table 1 gives an overview
neural network with softmax output ŷ(x, b, ϵ,W ), neuron weights of their characteristics.
W , and spike parameters b and ϵ, where bit vector b i = 1 indicates We consider 9 schemes equipped with different defense mecha-
that the i-th hidden neuron of the RS layer gives out a noise output nisms, all of which share the the same network architectures and
ϵ i ∈ R sampled with density f (ϵ), otherwise b i = 0 and the output training parameters. For MNIST, we follow the architecture given
of the RS layer is a copy of its i-th input from the previous layer in the C&W paper [9]. Fashion-MNIST was not studied in the liter-
(i.e., the original value of the neuron). By construction, b i = 1 with ature in an adversarial setting, and the model architectures used
probability 1 − p independent of other RS neurons. Let L(y, ŷ) be a for CIFAR-10 in previous papers delivered a fairly low accuracy.
convex loss function over ŷ, such as the cross-entropy loss, the neg- Thus for Fashion-MNIST and CIFAR-10, we use the state-of-the-art
ative log-likelihood, or the square error loss. Then, the following WRN-28-10 instantiation of the wide residual networks [42]. We
proposition holds: are able to achieve state-of-the-art test accuracy using these archi-
tectures. Some of these mechanisms have adjustable parameters,
Proposition 1. Consider the ensemble RS model
Õ∫ and we choose values for these parameters so that the resulting
ŷ(x,W ) ≡ ŷ(x, b, ϵ,W )p(b)f (ϵ)dϵ, (2) models have a comparable level of accuracy on the testing data. As
∀b ϵ the result, all 9 schemes result in small accuracy drop.
Table 2: Test errors (mean±std). Table 3: Parameters used for generating adversarial exam-
MNIST Fashion-MNIST CIFAR-10 ples. The values for K reported here were chosen so that the
Standard Single pred. 0.77 ± 0.05% 4.94 ± 0.19% 4.38 ± 0.21% generated examples would fit a predetermined L 2 cut-off.
Dropout MC Avg. 0.67 ± 0.07% 4.75 ± 0.09% 4.46 ± 0.25% L2 Working confidence Examples for
Dataset
Distillation MC Avg. 0.78 ± 0.05% 4.81 ± 0.18% 4.33 ± 0.27% cut-off values (K) each K (n)
RS-1 MC Avg. 0.88 ± 0.09% 5.34 ± 0.10% 5.59 ± 0.22%
MNIST 3.0 {0, 5, 10, 15} 3000
RS-1-Dropout MC Avg. 0.71 ± 0.07% 5.32 ± 0.17% 5.81 ± 0.27%
RS-1-Adv MC Avg. 0.98 ± 0.11% 5.49 ± 0.16% 6.20 ± 0.40%
Fashion-MNIST 1.0 {0, 20, 40, 60} 3000
Det. Thrs. 0.001 0.004 0.004 CIFAR-10 1.0 {0, 20, 40, 60, 80, 100} 2000
Magnet
MC Avg. 0.87 ± 0.06% 5.36 ± 0.17% 5.52 ± 0.24%
Dropout-Adv MC Avg. 0.69 ± 0.07% 4.76 ± 0.11% 4.71 ± 0.19% Table 4: C&W targeted Adv Examples L2 (mean±std) when
L 2 noise 0.4 0.02 0.02 attacking a single model.
RC
Voting 0.77 ± 0.11% 5.39 ± 0.23% 5.72 ± 0.46%
MNIST Fashion-MNIST CIFAR-10
Standard 2.12 ± 0.69 0.12 ± 0.08 0.17 ± 0.08
Table 2 gives the test errors, and Tables 6 and 7 in the Appen-
Dropout 1.80 ± 0.52 0.14 ± 0.07 0.17 ± 0.08
dix give details of the model architecture, and training parameters.
Distillation 2.02 ± 0.63 0.13 ± 0.07 0.17 ± 0.07
When a scheme uses either Dropout or Random Spiking, we con-
RS-1 2.06 ± 0.76 0.31 ± 0.16 0.32 ± 0.14
sider 3 possible decision procedures at test time. By “Single pred.”,
RS-1-Dropout 1.79 ± 0.86 0.36 ± 0.21 0.32 ± 0.15
we mean dropout and random spiking are not used at test time.
RS-1-Adv 2.36 ± 0.80 0.56 ± 0.30 0.39 ± 0.18
By “Voting”, we mean running the network with Dropout and/or
Magnet 2.22 ± 0.65 0.28 ± 0.15 0.29 ± 0.21
Random Spiking 10 times, and use majority voting for decision
Dropout-Adv 2.44 ± 0.66 0.33 ± 0.15 0.18 ± 0.07
(with ties decided in favor of the label with smaller index). By “MC
Avg.”, we mean using Definition 2 by running the network with
Dropout and/or Random Spiking 10 times, and averaging the 10 Table 5: C&W Adv Examples L2 (mean±std) with Multi 8 at-
probability vectors. For each scheme, we train 16 models (with tack strategy.
different initial parameter values) on each dataset, and report the MNIST Fashion-MNIST CIFAR-10
mean and standard deviation of their test accuracy. We observe Standard 2.50 ± 0.77 0.22 ± 0.15 0.25 ± 0.10
that using Voting or MC Avg, one can typically achieve a slight Dropout 2.29 ± 0.65 0.25 ± 0.13 0.26 ± 0.10
reduction in test error. Distillation 2.37 ± 0.71 0.24 ± 0.14 0.33 ± 0.13
5.1.1 Adversarial training. Two defense mechanisms require train- RS-1 2.77 ± 0.82 0.54 ± 0.25 0.49 ± 0.18
ing with adversarial examples, which are generated by applying RS-1-Dropout 2.77 ± 0.93 0.61 ± 0.30 0.51 ± 0.18
the C&W L 2 targeted attack on a target model, using randomly RS-1-Adv 3.18 ± 0.88 1.04 ± 0.44 0.64 ± 0.23
sampled training instances and target class labels. Magnet 2.68 ± 0.75 0.54 ± 0.25 0.47 ± 0.24
Dropout-Adv 2.93 ± 0.70 0.57 ± 0.23 0.29 ± 0.10
5.1.2 Upper Bounds on Perturbation. For each dataset, we gener-
ated thousands of adversarial examples with varying confidence
values for each training scheme, and have them sorted according single model, and multi-8 attack, where the adversarial example
to the added amount of perturbation, measured in L 2 . We have ob- aims at attacking 8 similarly trained model at the same time. This
served that from instances that have high amount of perturbation can be considered as a form of ensemble white-box attack [22].
one can visually observe the intention of adversarial example. We Tables 4 and 5 present the average L 2 distances of the gener-
thus chose a cut-off upper bound on L 2 distance. The chosen L 2 ated examples for those generated adversarial examples. RS-1-Adv
cut-off bounds are included in Table 3, and used as upper limits results in models that are more difficult to attack, requires on av-
in many of our later experiments. With the bounds on L 2 fixed, erage the highest perturbations (measured in L 2 distance) among
we then empirically determine an upper bound for the confidence all evaluated defenses. Comparing to other methods, adversarial
value to be used in the C&W-L 2 attacks for generating adversarial examples generated by RS-1 and RS-1-Dropout have either higher
examples for training purposes. To diversify the set of generated or comparable amount of distortion. These again suggest RS offers
adversarial examples, we sample several different confidence values additional protection against adversarial examples.
within the bound, which are also reported in Table 3.
Appendix A.1 provides additional details on training for each 5.3 Model Stability
defense scheme. Given a benign image and its variants with added noise, a more
robust model should intuitively be able to tolerate a higher level
5.2 White-box Evaluation of noise without changing its prediction results. We refer to this
We first evaluate the effectiveness of the defense mechanisms under property as model stability. Here we evaluate whether models from
white-box attacks. We apply the C&W white-box attack with con- a defense mechanism can correct label instances that are perturbed.
fidence 0 to generate targeted adversarial examples, and measure This serves several purposes. First, in [15], it is suggested that
the L 2 . distance of the generated adversarial examples. We consider vulnerability to adversarial examples and low performance on ran-
both single-model attack, where the adversarial example targets a domly corrupted images, such as images with additive Gaussian
Standard Dropout Distillation ADV RC Magnet RS-1 RSD-1 RS-1-ADV

(a) Prediction Stability (MNIST) (b) Prediction Stability (Fashion-MNIST) (c) Prediction Stability (CIFAR-10)
100 100
100

Unchanged Predictions (%)

Unchanged Predictions (%)


Unchanged Predictions (%)

99 80 80
98 60 60
97 40 40
96
20 20
95 Magnet's stability 0 when L2 1
0 1 2 3 4 5 0 0
Amount of Guassian Noise Added (L2 distance) 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
Amount of Guassian Noise Added (L2 distance) Amount of Guassian Noise Added (L2 distance)
Figure 1: Evaluating model stability with Gaussian Noise
(a) Prediction Stability (MNIST) (b) Prediction Stability (Fashion-MNIST) (c) Prediction Stability (CIFAR-10)
100 100 100
Unchanged Predictions (%)

Unchanged Predictions (%)

Unchanged Predictions (%)


99 80 80
98 60 60
97 40 40
96 20 20
95 Magnet's stability 50 when JPEG QUALITY 70
0 0
100 80 60 40 20 0 100 80 60 40 20 0 100 80 60 40 20 0
JPEG Compression Quality JPEG Compression Quality JPEG Compression Quality
Figure 2: Evaluating model stability with JPEG compression

noise, are two manifestations of the same underlying phenome- For MNIST, most schemes have stability above 99%, even when
non. Hence it is suggested that adversary defenses should consider L 2 is as large as 5. However, Magnet has stability approaching 0
robustness under such perturbations, as robustness under such when the L 2 distance is greater than 1, because majority of those
perturbations are also indications of resistance against adversary instances are rejected by Magnet.
attacks. Second, evaluating stability is identified in [1, 7] as a way to For Fashion-MNIST, we see more interesting differences among
check whether a defense relies on obfuscated gradients to achieve the schemes. The two approaches that have highest stability are
its defense. For such a defense, random perturbation may discover the two with adversarial training. When L 2 = 2.5, RS-1-ADV has
adversarial examples when optimized search based on gradients stability 87.4%, and ADV has stability 86%. Other schemes have
fail. Third, some defense mechanisms (such as Magnet) rely on de- stability around 60%; among them, RS-1 and RSD-1 have slightly
tecting whether an instance belongs to the same distribution as the higher stability than others.
training set, and consider an instance to be an adversarial example For CIFAR-10, we see that RS-1-ADV, RSD-1, and RS-1 have
if it does. However, when an input instance goes through some the highest stability as the amount of noise increases. When L 2 =
transformation that has little impact on human visual detection 2.5, they have stability 87.9%, 81.7%, 83%, respectively. The other
(such as JPEG compression), it will be considered as an adversarial schemes have stability 70% or lower.
example by the defense. This will impact accuracy of deployed Furthermore, on all datasets, RS-1-ADV, RSD-1, and RS-1 give
systems, as the encountered instances may not always follow the consistent results. Recall that we trained 16 models for each scheme,
training distribution. Fig. 1 also plots the standard deviation of the stability result of the
16 models. RS-1-ADV, RSD-1, and RS-1 have very low standard
deviation, which in turn also suggest more consistent behavior
5.3.1 Stability with Added Gaussian Noise. We measure how many when facing perturbed images.
predictions would change if a certain amount of Gaussian noise
is introduced to a set of benign images. For a given dataset and a 5.3.2 Stability with JPEG compression. Given a set of benign im-
model, we use the first 1, 000 images from the test dataset. We first ages, we measure how many predictions would change if JPEG
make a prediction on those selected images and store the results as compression is applied to images. For a given test dataset and a
reference predictions. Then, for each selected image and chosen L 2 model, we compare the prediction on the benign test dataset (ref-
distance, we sample Gaussian noise, scale it to the desired L 2 value, erence predictions) with the prediction on JPEG compressed test
and add the noise to the image. Pixel values are clipped if necessary, dataset with a fixed chosen JPEG compression quality (JCQ). For
to make sure the new noisy variant is a valid image. We repeat this the sake of time efficiency, for this particular set of experiments,
process 20 times (noise sampled independently per iteration). we reduced the number of iterations used by RC to one-tenth of its
Fig. 1 shows the effect of Gaussian noise on prediction stability original algorithm.
for each training method (averaged over the 16 models trained in Fig. 2 shows the effect of JPEG compression on prediction stabil-
Sec. 5.1). Model stability inevitably drops for each scheme as the ity for each training method (averaged over the 16 models trained).
amount of Gaussian noise as measured by L 2 increases. However, Model stability decreases for each scheme as the JCQ (ranges
different schemes behave differently when L 2 increases. 10 − 100) decreases.
For MNIST, most schemes achieve stability over 99, even if the We also use slightly lower confidence values than in Sec. 5.1.1
JCQ is 10. Magnet is the outlier, which has a stability of around 50 ({0, 10, 20, 30} for Fashion-MNIST, {0, 20, 40, 60} CIFAR-10). The
when the JCQ is 70, and has a stability of less than 20 when the transferability of the generated adversarial examples are then mea-
JCQ is less than or equal to 40, because of the high rejection rate sured and averaged on the remaining 8 models as the target. We
of MagNet. We believe that both of these results are related to the refer to this as ‘Multi 8’. As shown in Fig. 3, given the same limit
fact that MNIST images have black backgrounds that span most of on the amount of distortion (L 2 distance), a significantly higher
the image. Noises introduced by JPEG compression result mostly percentage of examples generated using the Multi 8 strategy are
in perturbations in the background that are ignored by most NN transferable than those found using the Single strategy.
models. Since Magnet uses autoencoders to detect deviations from Additionally, we evaluate a third attack strategy that is based
the input distributions, these noises trigger detection. Since Magnet on Multi 8. As discussed in Sec. 3.2, given enough surrogate mod-
aims at detecting perturbed images, this should not be considered els, one can further use some of them for validating adversarial
as a weakness of Magnet. examples. For those adversarial examples generated by the Multi 8
For Fashion-MNIST, we see that RS-1-ADV, RSD-1, and RS-1 strategy, we keep them only if they can be transferred to at least 5
outperform other schemes on the stability as the JCQ decreases. of the 7 validation models, hence we refer to this strategy Multi 8
When JCQ = 10, they have stability 85.2%, 80.9%, 84.4%, respectively. & Passing 5/7 Validation. The remaining model is used as the attack
The other schemes have stability 80% or lower; The closest to the target, and we measure the transferability of examples that passed
RS-class among other schemes is ADV. the 5/7 Validation. For this attack strategy, the measurements shown
For CIFAR-10, we see that RS-1-ADV, RSD-1, and RS-1 have the in Fig. 3 is the average of 8 rotations between target model and
highest stability as the JCQ decreases. When JCQ = 10, they have validation models. Comparing to Multi 8 and Single, adversarial ex-
stability 60.9%, 55.6%, 55.4%, respectively. The other schemes have amples that passed the 5/7 Validation are significantly more likely to
stability 50% or lower; the highest among the other schemes is RC. transfer to the target model, even when the amount of perturbation
is small.
This shows that simple strategies like Single are indeed not real-
5.4 Evaluating Attack Strategies izing the full potential of a resourceful attacker, and our proposed
Here we empirically show that our proposed attack strategy, as attack strategy of using multiple models for the generation and
presented in Sec. 3.2, can indeed generate adversarial examples validation of adversarial examples is indeed superior. In the reset
that are more transferable. In attacks like the C&W attack, a higher of this section, we will be using the most effective attack strategy
confidence value will typically lead to more transferable examples, of Multi 8 & Passing 5/7 Validation.
but the amount of perturbation would usually increase as well,
sometimes making the example noticeably different under human
perception.
Intuitively, a better attack strategy should give more transferable 5.5 Translucent-box Evaluation
adversarial examples using less amount of distortion. Hence we Here we evaluate the effectiveness of different schemes based on the
use Distortion vs Transferability to compare 3 possible attack strate- transferability of adversarial examples generated using the Multi 8
gies. Similar to previous experiments, we measure the amount of & Passing 5/7 Validation attack strategy.
distortion using L 2 distance. In Fig. 3 we present the effectiveness The results of our translucent-box evaluation are shown in Fig. 4.
of each attack strategy, averaged across the 9 schemes. Adversarial examples are grouped into buckets based on their L 2
The first strategy we evaluated is a standard C&W attack which distance. For each bucket, we use grayscale to indicate the aver-
generates adversarial examples using only one surrogate model, age validation passing rate for each scheme. Passing rate from 0%
dubbed ‘Single’. Recall that for each training/defense method, we to 100% are mapped to pixel value from 0 to 255 in a linear scale.
have 16 models that are surrogates of each other (Sec. 5.1). For each There are four rows, each correspond to adversarial examples with a
surrogate model, we randomly select half of the original dataset as certain L 2 range. Each column illustrates to what extent a target de-
the training dataset, since the adversary may not have full knowl- fense scheme resist adversarial examples generated from attacking
edge of the training dataset under the transfer attack setting. For different methods.
the Single strategy, we apply the C&W attack on 4 of the models Examining the columns for Standard and Dropout, we can see
independently to generate a pool of adversarial examples. The trans- that Standard and Dropout are in general most vulnerable. Distilla-
ferability of those examples are then measured and averaged on the tion and RC are almost equally vulnerable. Magnet can often resist
remaining 12 target models. Regardless of the training methods and adversarial examples generated by targeting other defenses, but are
defense mechanisms in place, adversarial examples generated us- vulnerable to ones generated specifically targeting it.
ing the Single strategy often have limited transferability, especially Overall, across the three datasets, RS-1-Adv performs the best,
when the allowed amount of distortion (L 2 distance) is small. and is significantly better than Dropout-Adv. This suggests that
The second attack strategy that we evaluate is to generate ad- Random Spiking offers additional protection against adversarial
versarial examples using multiple surrogate models. For this, we examples. RS-1 and RS-1-Dropout also perform consistently well
use 8 of the 16 surrogate models for generating attack examples. across the three datasets. RC performs noticeably well on MNIST
The C&W attack can be adapted to handle this case with a slightly and Fashion-MNIST, likely because the images were all in 8-bit
different loss function. In our experiments, we use the sum of the grayscale, and its advantages diminish on CIFAR-10 which contains
loss functions of the 8 surrogate models as the new loss function. images of 24-bit color.
Single Multi 8 Multi 8 & Passing 5/7 Validation
MNIST Fashion-MNIST CIFAR-10
1.0 1.0 1.0
Transferability

Transferability

Transferability
0.5 0.5 0.5

0.0 0.0 0.0


0 2.25 2.5 2.75 3.0 0 0.4 0.6 0.8 1.0 0 0.4 0.6 0.8 1.0
L2 L2 L2
Figure 3: Transferability of adversarial examples found by the 3 attack strategies

(a) MNIST (b) Fashion-MNIST (c) CIFAR-10


RS-1-Dropout

RS-1-Dropout

RS-1-Dropout
Dropout-Adv

Dropout-Adv

Dropout-Adv
Distillation

Distillation

Distillation
RS-1-Adv

RS-1-Adv

RS-1-Adv
Standard

Standard

Standard
Dropout

Dropout

Dropout
Magnet

Magnet

Magnet
RS-1

RS-1

RS-1
RC

RC

RC
All

All

All
L2 : 0 − 2.25 L2 : 0 − 0.4 L2 : 0 − 0.4
Standard Standard Standard
Dropout Dropout Dropout
Distillation Distillation Distillation
ADV ADV ADV
Magnet Magnet Magnet
RS-1 RS-1 RS-1
RSD-1 RSD-1 RSD-1
RS-1-ADV RS-1-ADV RS-1-ADV
all all all
L2 : 2.25 − 2.5 L2 : 0.4 − 0.6 L2 : 0.4 − 0.6
Standard Standard Standard
Dropout Dropout Dropout
Distillation Distillation Distillation
ADV ADV ADV
Magnet Magnet Magnet
RS-1 RS-1 RS-1
RSD-1 RSD-1 RSD-1
RS-1-ADV RS-1-ADV RS-1-ADV
all all all
L2 : 2.5 − 2.75 L2 : 0.6 − 0.8 L2 : 0.6 − 0.8
Standard Standard Standard
Dropout Dropout Dropout
Distillation Distillation Distillation
ADV ADV ADV
Magnet Magnet Magnet
RS-1 RS-1 RS-1
RSD-1 RSD-1 RSD-1
RS-1-ADV RS-1-ADV RS-1-ADV
all all all
L2 : 2.75 − 3 L2 : 0.8 − 1 L2 : 0.8 − 1
Standard Standard Standard
Dropout Dropout Dropout
Distillation Distillation Distillation
ADV ADV ADV
Magnet Magnet Magnet
RS-1 RS-1 RS-1
RSD-1 RSD-1 RSD-1
RS-1-ADV RS-1-ADV RS-1-ADV
all all all

Figure 4: Average passing rate of 5/7 validation


This heat map shows the resilience of each scheme against the adversarial examples generated from all schemes, under a fixed allowance of
L 2 . Each column illustrates to what extent a target defense scheme resists adversarial examples generated from attacking different methods.

6 RELATED WORK bit depths reduction and pixel spatial smoothing to detect adversar-
ial examples [41]. Xie et al.. proposed to use Feature Denoising [40]
Other attack algorithms. There are other attack algorithms such to improve the robustness of the Neural Network Model. Due to
as JSMA [26], FGS [16], PGD[23], and DeepFool [25]. The general limit in time and space, we selected representative methods from
consensus seems to be that the C&W attack is the current state-of- each broad class (e.g., MagNet for the detection approach). We leave
the-art [8, 9, 13]. Though our evaluation results are based on the comparison with other mechanisms as future work.
C&W attack, our evaluation framework is not tied to a particular
attack and can use other algorithms. Beyond images. Other research efforts have explored possible at-
tacks against neural network models specialized for other purposes,
Other defense mechanisms. Some other defense mechanisms for example, speech to text [10]. We focus on images, although the
have been proposed [6, 24, 27, 39–41] in the literature. For example,
Xu et al.. proposed to use feature squeezing techniques such as color
evaluation methodology and the idea of random spiking should be 2280–2289.
applicable to these other domains. [16] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and
Harnessing Adversarial Examples. In ICLR.
Alternative similarity metrics. Some researchers have argued [17] James J Heckman. 1977. Sample selection bias as a specification error (with an
that Lp norms insufficiently capture human perception, and have
application to the estimation of labor supply functions). (1977).
[18] G. Hinton, O. Vinyals, and J. Dean. 2014. Distilling the Knowledge in a Neural
proposed alternative similarity metrics like SSIM [28]. It is however, Network. NIPS 2014 Deep Learning and Representation Learning Workshop (2014).
not immediately clear how to adapt such metrics in the C&W at- [19] D. P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. In
International Conference on Learning Representations.
tack. We leave further investigations on the impacts of alternative [20] Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images.
similarity metrics on adversarial examples for future work. Technical Report.
[21] Yann LeCun, Corinna Cortes, and Christopher JC Burges. 1998. The MNIST
7 CONCLUSION database of handwritten digits. (1998). http://yann.lecun.com/exdb/mnist/
[22] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into
In this paper, we present a careful analysis of possible adversar- Transferable Adversarial Examples and Black-box Attacks. In ICLR.
[23] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
ial models for studying the phenomenon of adversarial examples. Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks.
We propose an evaluation methodology that can better illustrate In ICLR.
the strengths and limitations of different mechanisms. As part of [24] Dongyu Meng and Hao Chen. 2017. Magnet: a two-pronged defense against
adversarial examples. In CCS. ACM, 135–147.
the method, we introduce a more powerful and meaningful adver- [25] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016.
sary strategy. We also introduce Random Spiking, a randomized Deepfool: a simple and accurate method to fool deep neural networks. In CVPR.
2574–2582.
technique that generalizes dropout. We have conducted extensive [26] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik,
evaluation of Random Spiking and several other defense mech- and Ananthram Swami. 2016. The limitations of deep learning in adversarial
anisms, and demonstrate that Random Spiking, especially when settings. In 2016 EuroS&P. IEEE, 372–387.
[27] Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram
combined with adversarial training, offers better protection against Swami. 2016. Distillation as a Defense to Adversarial Perturbations Against Deep
adversarial examples when compared with other existing defenses. Neural Networks. In 2016 IEEE S&P. 582–597.
[28] Mahmood Sharif, Lujo Bauer, and Michael K. Reiter. 2018. On the Suitability of
Lp -norms for Creating and Preventing Adversarial Examples. IEEE CVPRW.
ACKNOWLEDGEMENTS [29] Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate
shift by weighting the log-likelihood function. Journal of statistical planning and
This work is supported by the Northrop Grumman Cybersecu- inference 90, 2 (2000), 227–244.
rity Research Consortium under a Grant titled “Defenses Against [30] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Adversarial Examples” and by the United States National Science Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from
overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
Foundation under Grant No. 1640374. [31] Masashi Sugiyama, Neil D Lawrence, Anton Schwaighofer, et al. 2017. Dataset
shift in machine learning. The MIT Press.
REFERENCES [32] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and
Motoaki Kawanabe. 2008. Direct importance estimation with model selection and
[1] Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated Gradients its application to covariate shift adaptation. In Advances in neural information
Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. processing systems. 1433–1440.
In ICML, Vol. 80. PMLR, StockholmsmÃďssan, Stockholm Sweden, 274–283. [33] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
[2] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. Synthe- Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks.
sizing Robust Adversarial Examples. In ICML, Vol. 80. PMLR, 284–293. In International Conference on Learning Representations.
[3] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. 2017. Variational inference: [34] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and
A review for statisticians. J. Amer. Statist. Assoc. 112, 518 (2017), 859–877. Aleksander Madry. 2019. Robustness may be at odds with accuracy. In ICLR.
[4] Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. [35] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
In Proceedings of COMPSTAT’2010. Springer, 177–186. 2008. Extracting and Composing Robust Features with Denoising Autoencoders.
[5] Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2018. Decision-Based Ad- In ICML. ACM, 1096–1103.
versarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models. [36] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-
In ICLR. Antoine Manzagol. 2010. Stacked Denoising Autoencoders: Learning Useful
[6] Xiaoyu Cao and Neil Zhenqiang Gong. 2017. Mitigating evasion attacks to deep Representations in a Deep Network with a Local Denoising Criterion. In ICML,
neural networks via region-based classification. In ACSAC. ACM, 278–287. Vol. 11. 3371–3408.
[7] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas [37] Siyue Wang, Xiao Wang, Pu Zhao, Wujie Wen, David Kaeli, Peter Chin, and
Rauber, Dimitris Tsipras, Ian J. Goodfellow, Aleksander Madry, and Alexey Ku- Xue Lin. 2018. Defensive Dropout for Hardening Deep Neural Networks Under
rakin. 2019. On Evaluating Adversarial Robustness. CoRR (2019). arXiv:1902.06705 Adversarial Attacks. In ICCAD. ACM, Article 71, 8 pages.
[8] Nicholas Carlini and David Wagner. 2017. Adversarial Examples Are Not Easily [38] Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image
Detected: Bypassing Ten Detection Methods. In Proceedings of the 10th Workshop Dataset for Benchmarking Machine Learning Algorithms. CoRR abs/1708.07747
on Artificial Intelligence and Security. ACM, 3–14. (2017). arXiv:1708.07747
[9] Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of [39] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan L. Yuille. 2018.
neural networks. In 2017 IEEE S&P. IEEE, 39–57. Mitigating adversarial effects through randomization. ICLR (2018).
[10] Nicholas Carlini and David Wagner. 2018. Audio Adversarial Examples: Tar- [40] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L. Yuille, and Kaiming He.
geted Attacks on Speech-to-Text. Deep Learning and Security Workshop, Article 2019. Feature Denoising for Improving Adversarial Robustness. In CVPR.
arXiv:1801.01944 (2018), arXiv:1801.01944 pages. arXiv:cs.LG/1801.01944 [41] Weilin Xu, David Evans, and Yanjun Qi. 2018. Feature Squeezing: Detecting
[11] Nicholas Carlini and David A. Wagner. 2017. MagNet and “Efficient Defenses Adversarial Examples in Deep Neural Networks. NDSS.
Against Adversarial Attacks” are Not Robust to Adversarial Examples. CoRR [42] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In
abs/1711.08478 (2017). arXiv:1711.08478 http://arxiv.org/abs/1711.08478 BMVC.
[12] Jianbo Chen and Michael I. Jordan. 2019. Boundary Attack++: Query-Efficient [43] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and
Decision-Based Adversarial Attack. (2019). arXiv:1904.02144 Michael I. Jordan. 2019. Theoretically Principled Trade-off between Robustness
[13] Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. 2018. and Accuracy. In ICML. 7472–7482.
EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples.
AAAI Conference on Artificial Intelligence (2018).
[14] Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian Approximation:
Representing Model Uncertainty in Deep Learning. ICML 48, 1050–1059.
[15] Justin Gilmer, Nicolas Ford, Nicholas Carlini, and Ekin Cubuk. 2019. Adversarial
Examples Are a Natural Consequence of Test Error in Noise. In ICML, Vol. 97.
A APPENDIX to what we used for RS-1. We also use RSD-1 as a shorthand to
refer to this scheme.
A.1 More Information on Training
Model architecture and training parameters are in Tables 6 Distillation. We use the same network architecture and parame-
and 7. For MNIST, we use the same as in MagNet [9, 24]). For ters as we did for the training of Dropout models. Identical to the
Fashion-MNIST and CIFAR-10, they are identical to the WRN [42]. configuration used in [9], we train with temperature T = 100 and
Test errors of the nine schemes are in Table 2. For each scheme test with T = 1 for all three datasets.
and dataset, we train 16 models and report the mean and standard Region-based Classification (RC). We use the Dropout models
deviation. Additional information is provided below. for RC. For each test example, we generate t additional examples,
where for each pixel, a noise was randomly chosen from (−r, r )
Table 6: Mode Architectures. We use WRN-28-10 for and added to it. Prediction is then made with majority voting on
Fashion-MNIST and CIFAR-10 (k = 10, N = 4). the t input examples. Identical to the original RC paper [6], we use
MNIST Fashion-MNIST CIFAR-10
t = 10, 000 for MNIST and t = 1, 000 for CIFAR-10. We also use
t = 1, 000 for Fashion-MNIST. We choose values for r (r = 0.4 for
Output Output
Group Kernel, Feature Kernel, Feature
Size Size
Conv.ReLU 3 × 3 × 32 Conv1 28 × 28 [3 × 3, 16] 32 × 32 [3 × 3, 16]
MNIST, and r = 0.02 for Fashion-MNIST and CIFAR-10) so that the
Conv.ReLU 3 × 3 × 32 " # " # test errors would be comparable to the other mechanisms.
3 × 3, 16 × k 3 × 3, 16 × k
Max Pooling 2×2 Conv2 28 × 28 ×N 32 × 32 ×N
3 × 3, 16 × k 3 × 3, 16 × k Adversarial Dropout (Dropout-Adv). To use adversarial training
Conv.ReLU 3 × 3 × 64
Conv.ReLU 3 × 3 × 64 " #
3 × 3, 32 × k
" #
3 × 3, 32 × k with Dropout, we leverage the trained Dropout model from before as
Max Pooling 2×2 Conv3 14 × 14 ×N 16 × 16 ×N
3 × 3, 32 × k 3 × 3, 32 × k the target model for generating adversarial examples. We generated
Dense.ReLU 200
12, 000 adversarial examples for each Dropout model by perturbing
" # " #
3 × 3, 64 × k 3 × 3, 64 × k
Dense.ReLU 200 Conv4 7×7 ×N 8×8 ×N
3 × 3, 64 × k 3 × 3, 64 × k training instances. To ensure that the adversarial examples indeed
Softmax 10 Softmax 10 10
should be classified under the original label, we sort the adversarial
Table 7: Training Parameters. examples according to their L 2 distances in ascending order, and
add only the first 10, 000 examples into the training dataset. These
Parameters MNIST Fashion-MNIST & CIFAR-10
examples have L 2 distances lower than the cutoff mentioned earlier.
Optimization Method SGD SGD We then apply the Dropout training procedure as described before
Learning Rate 0.01 0.1 initial, multiply by 0.2
at 60, 120 and 160 epochs on the new training dataset.
Momentum 0.9 0.9 Adversarial Random Spiking (RS-1-Adv). For this adversarial
Batch Size 128 128
Epochs 50 200
training method, we use RS-1 as the target model. The training pa-
Dropout (Optional) 0.5 0.1 rameters and procedure are largely identical to what were described
Data Augmentation - Fashion-MNIST: Shifting + Horizontal Flip for Dropout-Adv above.
CIFAR-10: Shifting + Rotation +
Horizontal Flip + Zooming + Shear
A.2 Proof of Proposition 1
Proof sketch. The Monte Carlo sampling used in the RS neural
Magnet. We use the trained Dropout model as the prediction network optimization gives an unbiased estimate of the gradient
model, and train the Magnet defensive models (reformers and de- Õ∫ ∂
tectors) [24] based on the publicly released Magnet implementa- L (y, ŷ(x, b, ϵ,W )) p(b)f (ϵ)dϵ
∂W
tion1 . Identical to the settings2 presented in the original Magnet pa- ∀b ϵ
∂ Õ

per [24], for MNIST, we use Reformer I, Detector I/L 2 and Detector
= L (y, ŷ(x, b, ϵ,W )) p(b)f (ϵ)dϵ,
II/L 1 , with detection threshold set to 0.001. Since Fashion-MNIST ∂W ϵ ∀b
was not studied in [24], we use the same model architecture
as CIFAR-10 presented in the original Magnet paper [24]. For with the above equality given by the linearity of the expectation
Fashion-MNIST and CIFAR-10, we use Reformer II, Detector II/L 1 , and integral operators. That is, the RS neural network optimization
Detector II/T 10 and Detector II/T 40, and with a detection threshold is a Robbins-Monro stochastic optimization [4] that minimizes
Õ∫
(rate of false positive) of 0.004, which results in test error rates W ′ = argmin L(y, ŷ(x,W ))p(b)f (ϵ)dϵ .
comparable to those of the other schemes. W ϵ
∀b
Random Spiking with standard model (RS-1). A Random Spik- L is convex on ŷ, then by Jensen’s inequality
ing (RS) layer is added after the first convolution layer in the stan- Õ∫
dard architecture. We choose p = 0.8, so that 20% of all neuron L (y, ŷ(x, b, ϵ,W )) p(b)f (ϵ)dϵ ≥
∀b ϵ
outputs are randomly spiked. !
Õ∫
Random Spiking with Dropout (RS-1-Dropout). We add the
 
L y, ŷ(x, b, ϵ,W )p(b)f (ϵ)dϵ ≡ L y, ŷ(x,W ) .
RS layer to the Dropout scheme. All other parameters are identical ϵ
∀b
1 https://github.com/Trevillie/MagNet
2 Regarding
Thus, the RS neural network minimizes an upper bound of the loss
of the ensemble RS model ŷ(x,W ), yielding a proper variational
the Detector settings, a small discrepancy exists between the paper and
the released source code. After confirming with the authors, we follow what is given
by the source code. inference procedure [3]. □

You might also like