Random Spiking and Systematic Evaluation of Defenses Against Adversarial Examples
Random Spiking and Systematic Evaluation of Defenses Against Adversarial Examples
Adversarial Examples
Huangyi Ge Sze Yiu Chau Bruno Ribeiro Ninghui Li
[email protected] [email protected] [email protected] [email protected]
Purdue University Purdue University Purdue University Purdue University
Generating Random Noises. To implement Random Spiking, we where f is a density function, p(b) is the probability that bit vector
have to decide how to sample the noises that are to be used to b is sampled.
replace the unit outputs. Sampling from a distribution with a fixed Adaptive Attack against Random Spiking. Since Random Spik-
range is problematic because the impact of noise depends on the ing introduces randomness during training, an adaptive attacker
distribution of other values in the same layer. If a random pertur- knowing that Random Spiking has been deployed but is unaware
bation is too small compared to other values in the same layer, of the exact parameters of the target model can train multiple sur-
then its randomization effect is too small. If, on the other hand, rogate models, and try to generate adversarial examples that can
the magnitude of the noise is significantly larger than the other simultaneously cause all these models to misbehave. That is, the
values, it overwhelms the network. In our approach, we compute multi-surrogate with validation is a natural adaptive attack against
the minimum and maximum value among all values in the layers to Random Spiking, and any other defense mechanisms that rely on
be filtered, and sample a value uniformly at random in that range. randomness during training. In this attack, one uses probabilities
Since training NN is often done using mini-batches, the minimum from all surrogate models to generate the adversarial example. This
and maximum values are computed from the whole batch. is similar to the Expectation over Transformation (EOT) [2] ap-
proach for generating adversarial examples.
Monte Carlo Random Spiking as a Model Ensemble. For test-
ing, we can use the Monte Carlo decision procedure of running the 5 EXPERIMENTAL EVALUATION
network multiple times and use the average. This has attractive
theoretical guarantees, at the cost of overhead for decision time, We present experimental results comparing the various defense
since the NN needs to be computed multiple times for one instance. mechanisms using our proposed approach.
We now show that the Monte Carlo Random Spiking approximates 5.1 Dataset and Model Training
a model ensemble. Let (x, y) be a training example, where x is an For our experiments, we use the following 3 datasets: MNIST [21],
image and y is the image’s one-hot encoded label. Consider a RS Fashion-MNIST [38], and CIFAR-10 [20]. Table 1 gives an overview
neural network with softmax output ŷ(x, b, ϵ,W ), neuron weights of their characteristics.
W , and spike parameters b and ϵ, where bit vector b i = 1 indicates We consider 9 schemes equipped with different defense mecha-
that the i-th hidden neuron of the RS layer gives out a noise output nisms, all of which share the the same network architectures and
ϵ i ∈ R sampled with density f (ϵ), otherwise b i = 0 and the output training parameters. For MNIST, we follow the architecture given
of the RS layer is a copy of its i-th input from the previous layer in the C&W paper [9]. Fashion-MNIST was not studied in the liter-
(i.e., the original value of the neuron). By construction, b i = 1 with ature in an adversarial setting, and the model architectures used
probability 1 − p independent of other RS neurons. Let L(y, ŷ) be a for CIFAR-10 in previous papers delivered a fairly low accuracy.
convex loss function over ŷ, such as the cross-entropy loss, the neg- Thus for Fashion-MNIST and CIFAR-10, we use the state-of-the-art
ative log-likelihood, or the square error loss. Then, the following WRN-28-10 instantiation of the wide residual networks [42]. We
proposition holds: are able to achieve state-of-the-art test accuracy using these archi-
tectures. Some of these mechanisms have adjustable parameters,
Proposition 1. Consider the ensemble RS model
Õ∫ and we choose values for these parameters so that the resulting
ŷ(x,W ) ≡ ŷ(x, b, ϵ,W )p(b)f (ϵ)dϵ, (2) models have a comparable level of accuracy on the testing data. As
∀b ϵ the result, all 9 schemes result in small accuracy drop.
Table 2: Test errors (mean±std). Table 3: Parameters used for generating adversarial exam-
MNIST Fashion-MNIST CIFAR-10 ples. The values for K reported here were chosen so that the
Standard Single pred. 0.77 ± 0.05% 4.94 ± 0.19% 4.38 ± 0.21% generated examples would fit a predetermined L 2 cut-off.
Dropout MC Avg. 0.67 ± 0.07% 4.75 ± 0.09% 4.46 ± 0.25% L2 Working confidence Examples for
Dataset
Distillation MC Avg. 0.78 ± 0.05% 4.81 ± 0.18% 4.33 ± 0.27% cut-off values (K) each K (n)
RS-1 MC Avg. 0.88 ± 0.09% 5.34 ± 0.10% 5.59 ± 0.22%
MNIST 3.0 {0, 5, 10, 15} 3000
RS-1-Dropout MC Avg. 0.71 ± 0.07% 5.32 ± 0.17% 5.81 ± 0.27%
RS-1-Adv MC Avg. 0.98 ± 0.11% 5.49 ± 0.16% 6.20 ± 0.40%
Fashion-MNIST 1.0 {0, 20, 40, 60} 3000
Det. Thrs. 0.001 0.004 0.004 CIFAR-10 1.0 {0, 20, 40, 60, 80, 100} 2000
Magnet
MC Avg. 0.87 ± 0.06% 5.36 ± 0.17% 5.52 ± 0.24%
Dropout-Adv MC Avg. 0.69 ± 0.07% 4.76 ± 0.11% 4.71 ± 0.19% Table 4: C&W targeted Adv Examples L2 (mean±std) when
L 2 noise 0.4 0.02 0.02 attacking a single model.
RC
Voting 0.77 ± 0.11% 5.39 ± 0.23% 5.72 ± 0.46%
MNIST Fashion-MNIST CIFAR-10
Standard 2.12 ± 0.69 0.12 ± 0.08 0.17 ± 0.08
Table 2 gives the test errors, and Tables 6 and 7 in the Appen-
Dropout 1.80 ± 0.52 0.14 ± 0.07 0.17 ± 0.08
dix give details of the model architecture, and training parameters.
Distillation 2.02 ± 0.63 0.13 ± 0.07 0.17 ± 0.07
When a scheme uses either Dropout or Random Spiking, we con-
RS-1 2.06 ± 0.76 0.31 ± 0.16 0.32 ± 0.14
sider 3 possible decision procedures at test time. By “Single pred.”,
RS-1-Dropout 1.79 ± 0.86 0.36 ± 0.21 0.32 ± 0.15
we mean dropout and random spiking are not used at test time.
RS-1-Adv 2.36 ± 0.80 0.56 ± 0.30 0.39 ± 0.18
By “Voting”, we mean running the network with Dropout and/or
Magnet 2.22 ± 0.65 0.28 ± 0.15 0.29 ± 0.21
Random Spiking 10 times, and use majority voting for decision
Dropout-Adv 2.44 ± 0.66 0.33 ± 0.15 0.18 ± 0.07
(with ties decided in favor of the label with smaller index). By “MC
Avg.”, we mean using Definition 2 by running the network with
Dropout and/or Random Spiking 10 times, and averaging the 10 Table 5: C&W Adv Examples L2 (mean±std) with Multi 8 at-
probability vectors. For each scheme, we train 16 models (with tack strategy.
different initial parameter values) on each dataset, and report the MNIST Fashion-MNIST CIFAR-10
mean and standard deviation of their test accuracy. We observe Standard 2.50 ± 0.77 0.22 ± 0.15 0.25 ± 0.10
that using Voting or MC Avg, one can typically achieve a slight Dropout 2.29 ± 0.65 0.25 ± 0.13 0.26 ± 0.10
reduction in test error. Distillation 2.37 ± 0.71 0.24 ± 0.14 0.33 ± 0.13
5.1.1 Adversarial training. Two defense mechanisms require train- RS-1 2.77 ± 0.82 0.54 ± 0.25 0.49 ± 0.18
ing with adversarial examples, which are generated by applying RS-1-Dropout 2.77 ± 0.93 0.61 ± 0.30 0.51 ± 0.18
the C&W L 2 targeted attack on a target model, using randomly RS-1-Adv 3.18 ± 0.88 1.04 ± 0.44 0.64 ± 0.23
sampled training instances and target class labels. Magnet 2.68 ± 0.75 0.54 ± 0.25 0.47 ± 0.24
Dropout-Adv 2.93 ± 0.70 0.57 ± 0.23 0.29 ± 0.10
5.1.2 Upper Bounds on Perturbation. For each dataset, we gener-
ated thousands of adversarial examples with varying confidence
values for each training scheme, and have them sorted according single model, and multi-8 attack, where the adversarial example
to the added amount of perturbation, measured in L 2 . We have ob- aims at attacking 8 similarly trained model at the same time. This
served that from instances that have high amount of perturbation can be considered as a form of ensemble white-box attack [22].
one can visually observe the intention of adversarial example. We Tables 4 and 5 present the average L 2 distances of the gener-
thus chose a cut-off upper bound on L 2 distance. The chosen L 2 ated examples for those generated adversarial examples. RS-1-Adv
cut-off bounds are included in Table 3, and used as upper limits results in models that are more difficult to attack, requires on av-
in many of our later experiments. With the bounds on L 2 fixed, erage the highest perturbations (measured in L 2 distance) among
we then empirically determine an upper bound for the confidence all evaluated defenses. Comparing to other methods, adversarial
value to be used in the C&W-L 2 attacks for generating adversarial examples generated by RS-1 and RS-1-Dropout have either higher
examples for training purposes. To diversify the set of generated or comparable amount of distortion. These again suggest RS offers
adversarial examples, we sample several different confidence values additional protection against adversarial examples.
within the bound, which are also reported in Table 3.
Appendix A.1 provides additional details on training for each 5.3 Model Stability
defense scheme. Given a benign image and its variants with added noise, a more
robust model should intuitively be able to tolerate a higher level
5.2 White-box Evaluation of noise without changing its prediction results. We refer to this
We first evaluate the effectiveness of the defense mechanisms under property as model stability. Here we evaluate whether models from
white-box attacks. We apply the C&W white-box attack with con- a defense mechanism can correct label instances that are perturbed.
fidence 0 to generate targeted adversarial examples, and measure This serves several purposes. First, in [15], it is suggested that
the L 2 . distance of the generated adversarial examples. We consider vulnerability to adversarial examples and low performance on ran-
both single-model attack, where the adversarial example targets a domly corrupted images, such as images with additive Gaussian
Standard Dropout Distillation ADV RC Magnet RS-1 RSD-1 RS-1-ADV
(a) Prediction Stability (MNIST) (b) Prediction Stability (Fashion-MNIST) (c) Prediction Stability (CIFAR-10)
100 100
100
99 80 80
98 60 60
97 40 40
96
20 20
95 Magnet's stability 0 when L2 1
0 1 2 3 4 5 0 0
Amount of Guassian Noise Added (L2 distance) 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
Amount of Guassian Noise Added (L2 distance) Amount of Guassian Noise Added (L2 distance)
Figure 1: Evaluating model stability with Gaussian Noise
(a) Prediction Stability (MNIST) (b) Prediction Stability (Fashion-MNIST) (c) Prediction Stability (CIFAR-10)
100 100 100
Unchanged Predictions (%)
noise, are two manifestations of the same underlying phenome- For MNIST, most schemes have stability above 99%, even when
non. Hence it is suggested that adversary defenses should consider L 2 is as large as 5. However, Magnet has stability approaching 0
robustness under such perturbations, as robustness under such when the L 2 distance is greater than 1, because majority of those
perturbations are also indications of resistance against adversary instances are rejected by Magnet.
attacks. Second, evaluating stability is identified in [1, 7] as a way to For Fashion-MNIST, we see more interesting differences among
check whether a defense relies on obfuscated gradients to achieve the schemes. The two approaches that have highest stability are
its defense. For such a defense, random perturbation may discover the two with adversarial training. When L 2 = 2.5, RS-1-ADV has
adversarial examples when optimized search based on gradients stability 87.4%, and ADV has stability 86%. Other schemes have
fail. Third, some defense mechanisms (such as Magnet) rely on de- stability around 60%; among them, RS-1 and RSD-1 have slightly
tecting whether an instance belongs to the same distribution as the higher stability than others.
training set, and consider an instance to be an adversarial example For CIFAR-10, we see that RS-1-ADV, RSD-1, and RS-1 have
if it does. However, when an input instance goes through some the highest stability as the amount of noise increases. When L 2 =
transformation that has little impact on human visual detection 2.5, they have stability 87.9%, 81.7%, 83%, respectively. The other
(such as JPEG compression), it will be considered as an adversarial schemes have stability 70% or lower.
example by the defense. This will impact accuracy of deployed Furthermore, on all datasets, RS-1-ADV, RSD-1, and RS-1 give
systems, as the encountered instances may not always follow the consistent results. Recall that we trained 16 models for each scheme,
training distribution. Fig. 1 also plots the standard deviation of the stability result of the
16 models. RS-1-ADV, RSD-1, and RS-1 have very low standard
deviation, which in turn also suggest more consistent behavior
5.3.1 Stability with Added Gaussian Noise. We measure how many when facing perturbed images.
predictions would change if a certain amount of Gaussian noise
is introduced to a set of benign images. For a given dataset and a 5.3.2 Stability with JPEG compression. Given a set of benign im-
model, we use the first 1, 000 images from the test dataset. We first ages, we measure how many predictions would change if JPEG
make a prediction on those selected images and store the results as compression is applied to images. For a given test dataset and a
reference predictions. Then, for each selected image and chosen L 2 model, we compare the prediction on the benign test dataset (ref-
distance, we sample Gaussian noise, scale it to the desired L 2 value, erence predictions) with the prediction on JPEG compressed test
and add the noise to the image. Pixel values are clipped if necessary, dataset with a fixed chosen JPEG compression quality (JCQ). For
to make sure the new noisy variant is a valid image. We repeat this the sake of time efficiency, for this particular set of experiments,
process 20 times (noise sampled independently per iteration). we reduced the number of iterations used by RC to one-tenth of its
Fig. 1 shows the effect of Gaussian noise on prediction stability original algorithm.
for each training method (averaged over the 16 models trained in Fig. 2 shows the effect of JPEG compression on prediction stabil-
Sec. 5.1). Model stability inevitably drops for each scheme as the ity for each training method (averaged over the 16 models trained).
amount of Gaussian noise as measured by L 2 increases. However, Model stability decreases for each scheme as the JCQ (ranges
different schemes behave differently when L 2 increases. 10 − 100) decreases.
For MNIST, most schemes achieve stability over 99, even if the We also use slightly lower confidence values than in Sec. 5.1.1
JCQ is 10. Magnet is the outlier, which has a stability of around 50 ({0, 10, 20, 30} for Fashion-MNIST, {0, 20, 40, 60} CIFAR-10). The
when the JCQ is 70, and has a stability of less than 20 when the transferability of the generated adversarial examples are then mea-
JCQ is less than or equal to 40, because of the high rejection rate sured and averaged on the remaining 8 models as the target. We
of MagNet. We believe that both of these results are related to the refer to this as ‘Multi 8’. As shown in Fig. 3, given the same limit
fact that MNIST images have black backgrounds that span most of on the amount of distortion (L 2 distance), a significantly higher
the image. Noises introduced by JPEG compression result mostly percentage of examples generated using the Multi 8 strategy are
in perturbations in the background that are ignored by most NN transferable than those found using the Single strategy.
models. Since Magnet uses autoencoders to detect deviations from Additionally, we evaluate a third attack strategy that is based
the input distributions, these noises trigger detection. Since Magnet on Multi 8. As discussed in Sec. 3.2, given enough surrogate mod-
aims at detecting perturbed images, this should not be considered els, one can further use some of them for validating adversarial
as a weakness of Magnet. examples. For those adversarial examples generated by the Multi 8
For Fashion-MNIST, we see that RS-1-ADV, RSD-1, and RS-1 strategy, we keep them only if they can be transferred to at least 5
outperform other schemes on the stability as the JCQ decreases. of the 7 validation models, hence we refer to this strategy Multi 8
When JCQ = 10, they have stability 85.2%, 80.9%, 84.4%, respectively. & Passing 5/7 Validation. The remaining model is used as the attack
The other schemes have stability 80% or lower; The closest to the target, and we measure the transferability of examples that passed
RS-class among other schemes is ADV. the 5/7 Validation. For this attack strategy, the measurements shown
For CIFAR-10, we see that RS-1-ADV, RSD-1, and RS-1 have the in Fig. 3 is the average of 8 rotations between target model and
highest stability as the JCQ decreases. When JCQ = 10, they have validation models. Comparing to Multi 8 and Single, adversarial ex-
stability 60.9%, 55.6%, 55.4%, respectively. The other schemes have amples that passed the 5/7 Validation are significantly more likely to
stability 50% or lower; the highest among the other schemes is RC. transfer to the target model, even when the amount of perturbation
is small.
This shows that simple strategies like Single are indeed not real-
5.4 Evaluating Attack Strategies izing the full potential of a resourceful attacker, and our proposed
Here we empirically show that our proposed attack strategy, as attack strategy of using multiple models for the generation and
presented in Sec. 3.2, can indeed generate adversarial examples validation of adversarial examples is indeed superior. In the reset
that are more transferable. In attacks like the C&W attack, a higher of this section, we will be using the most effective attack strategy
confidence value will typically lead to more transferable examples, of Multi 8 & Passing 5/7 Validation.
but the amount of perturbation would usually increase as well,
sometimes making the example noticeably different under human
perception.
Intuitively, a better attack strategy should give more transferable 5.5 Translucent-box Evaluation
adversarial examples using less amount of distortion. Hence we Here we evaluate the effectiveness of different schemes based on the
use Distortion vs Transferability to compare 3 possible attack strate- transferability of adversarial examples generated using the Multi 8
gies. Similar to previous experiments, we measure the amount of & Passing 5/7 Validation attack strategy.
distortion using L 2 distance. In Fig. 3 we present the effectiveness The results of our translucent-box evaluation are shown in Fig. 4.
of each attack strategy, averaged across the 9 schemes. Adversarial examples are grouped into buckets based on their L 2
The first strategy we evaluated is a standard C&W attack which distance. For each bucket, we use grayscale to indicate the aver-
generates adversarial examples using only one surrogate model, age validation passing rate for each scheme. Passing rate from 0%
dubbed ‘Single’. Recall that for each training/defense method, we to 100% are mapped to pixel value from 0 to 255 in a linear scale.
have 16 models that are surrogates of each other (Sec. 5.1). For each There are four rows, each correspond to adversarial examples with a
surrogate model, we randomly select half of the original dataset as certain L 2 range. Each column illustrates to what extent a target de-
the training dataset, since the adversary may not have full knowl- fense scheme resist adversarial examples generated from attacking
edge of the training dataset under the transfer attack setting. For different methods.
the Single strategy, we apply the C&W attack on 4 of the models Examining the columns for Standard and Dropout, we can see
independently to generate a pool of adversarial examples. The trans- that Standard and Dropout are in general most vulnerable. Distilla-
ferability of those examples are then measured and averaged on the tion and RC are almost equally vulnerable. Magnet can often resist
remaining 12 target models. Regardless of the training methods and adversarial examples generated by targeting other defenses, but are
defense mechanisms in place, adversarial examples generated us- vulnerable to ones generated specifically targeting it.
ing the Single strategy often have limited transferability, especially Overall, across the three datasets, RS-1-Adv performs the best,
when the allowed amount of distortion (L 2 distance) is small. and is significantly better than Dropout-Adv. This suggests that
The second attack strategy that we evaluate is to generate ad- Random Spiking offers additional protection against adversarial
versarial examples using multiple surrogate models. For this, we examples. RS-1 and RS-1-Dropout also perform consistently well
use 8 of the 16 surrogate models for generating attack examples. across the three datasets. RC performs noticeably well on MNIST
The C&W attack can be adapted to handle this case with a slightly and Fashion-MNIST, likely because the images were all in 8-bit
different loss function. In our experiments, we use the sum of the grayscale, and its advantages diminish on CIFAR-10 which contains
loss functions of the 8 surrogate models as the new loss function. images of 24-bit color.
Single Multi 8 Multi 8 & Passing 5/7 Validation
MNIST Fashion-MNIST CIFAR-10
1.0 1.0 1.0
Transferability
Transferability
Transferability
0.5 0.5 0.5
RS-1-Dropout
RS-1-Dropout
Dropout-Adv
Dropout-Adv
Dropout-Adv
Distillation
Distillation
Distillation
RS-1-Adv
RS-1-Adv
RS-1-Adv
Standard
Standard
Standard
Dropout
Dropout
Dropout
Magnet
Magnet
Magnet
RS-1
RS-1
RS-1
RC
RC
RC
All
All
All
L2 : 0 − 2.25 L2 : 0 − 0.4 L2 : 0 − 0.4
Standard Standard Standard
Dropout Dropout Dropout
Distillation Distillation Distillation
ADV ADV ADV
Magnet Magnet Magnet
RS-1 RS-1 RS-1
RSD-1 RSD-1 RSD-1
RS-1-ADV RS-1-ADV RS-1-ADV
all all all
L2 : 2.25 − 2.5 L2 : 0.4 − 0.6 L2 : 0.4 − 0.6
Standard Standard Standard
Dropout Dropout Dropout
Distillation Distillation Distillation
ADV ADV ADV
Magnet Magnet Magnet
RS-1 RS-1 RS-1
RSD-1 RSD-1 RSD-1
RS-1-ADV RS-1-ADV RS-1-ADV
all all all
L2 : 2.5 − 2.75 L2 : 0.6 − 0.8 L2 : 0.6 − 0.8
Standard Standard Standard
Dropout Dropout Dropout
Distillation Distillation Distillation
ADV ADV ADV
Magnet Magnet Magnet
RS-1 RS-1 RS-1
RSD-1 RSD-1 RSD-1
RS-1-ADV RS-1-ADV RS-1-ADV
all all all
L2 : 2.75 − 3 L2 : 0.8 − 1 L2 : 0.8 − 1
Standard Standard Standard
Dropout Dropout Dropout
Distillation Distillation Distillation
ADV ADV ADV
Magnet Magnet Magnet
RS-1 RS-1 RS-1
RSD-1 RSD-1 RSD-1
RS-1-ADV RS-1-ADV RS-1-ADV
all all all
6 RELATED WORK bit depths reduction and pixel spatial smoothing to detect adversar-
ial examples [41]. Xie et al.. proposed to use Feature Denoising [40]
Other attack algorithms. There are other attack algorithms such to improve the robustness of the Neural Network Model. Due to
as JSMA [26], FGS [16], PGD[23], and DeepFool [25]. The general limit in time and space, we selected representative methods from
consensus seems to be that the C&W attack is the current state-of- each broad class (e.g., MagNet for the detection approach). We leave
the-art [8, 9, 13]. Though our evaluation results are based on the comparison with other mechanisms as future work.
C&W attack, our evaluation framework is not tied to a particular
attack and can use other algorithms. Beyond images. Other research efforts have explored possible at-
tacks against neural network models specialized for other purposes,
Other defense mechanisms. Some other defense mechanisms for example, speech to text [10]. We focus on images, although the
have been proposed [6, 24, 27, 39–41] in the literature. For example,
Xu et al.. proposed to use feature squeezing techniques such as color
evaluation methodology and the idea of random spiking should be 2280–2289.
applicable to these other domains. [16] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and
Harnessing Adversarial Examples. In ICLR.
Alternative similarity metrics. Some researchers have argued [17] James J Heckman. 1977. Sample selection bias as a specification error (with an
that Lp norms insufficiently capture human perception, and have
application to the estimation of labor supply functions). (1977).
[18] G. Hinton, O. Vinyals, and J. Dean. 2014. Distilling the Knowledge in a Neural
proposed alternative similarity metrics like SSIM [28]. It is however, Network. NIPS 2014 Deep Learning and Representation Learning Workshop (2014).
not immediately clear how to adapt such metrics in the C&W at- [19] D. P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. In
International Conference on Learning Representations.
tack. We leave further investigations on the impacts of alternative [20] Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images.
similarity metrics on adversarial examples for future work. Technical Report.
[21] Yann LeCun, Corinna Cortes, and Christopher JC Burges. 1998. The MNIST
7 CONCLUSION database of handwritten digits. (1998). http://yann.lecun.com/exdb/mnist/
[22] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into
In this paper, we present a careful analysis of possible adversar- Transferable Adversarial Examples and Black-box Attacks. In ICLR.
[23] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
ial models for studying the phenomenon of adversarial examples. Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks.
We propose an evaluation methodology that can better illustrate In ICLR.
the strengths and limitations of different mechanisms. As part of [24] Dongyu Meng and Hao Chen. 2017. Magnet: a two-pronged defense against
adversarial examples. In CCS. ACM, 135–147.
the method, we introduce a more powerful and meaningful adver- [25] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016.
sary strategy. We also introduce Random Spiking, a randomized Deepfool: a simple and accurate method to fool deep neural networks. In CVPR.
2574–2582.
technique that generalizes dropout. We have conducted extensive [26] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik,
evaluation of Random Spiking and several other defense mech- and Ananthram Swami. 2016. The limitations of deep learning in adversarial
anisms, and demonstrate that Random Spiking, especially when settings. In 2016 EuroS&P. IEEE, 372–387.
[27] Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram
combined with adversarial training, offers better protection against Swami. 2016. Distillation as a Defense to Adversarial Perturbations Against Deep
adversarial examples when compared with other existing defenses. Neural Networks. In 2016 IEEE S&P. 582–597.
[28] Mahmood Sharif, Lujo Bauer, and Michael K. Reiter. 2018. On the Suitability of
Lp -norms for Creating and Preventing Adversarial Examples. IEEE CVPRW.
ACKNOWLEDGEMENTS [29] Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate
shift by weighting the log-likelihood function. Journal of statistical planning and
This work is supported by the Northrop Grumman Cybersecu- inference 90, 2 (2000), 227–244.
rity Research Consortium under a Grant titled “Defenses Against [30] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Adversarial Examples” and by the United States National Science Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from
overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
Foundation under Grant No. 1640374. [31] Masashi Sugiyama, Neil D Lawrence, Anton Schwaighofer, et al. 2017. Dataset
shift in machine learning. The MIT Press.
REFERENCES [32] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and
Motoaki Kawanabe. 2008. Direct importance estimation with model selection and
[1] Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated Gradients its application to covariate shift adaptation. In Advances in neural information
Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. processing systems. 1433–1440.
In ICML, Vol. 80. PMLR, StockholmsmÃďssan, Stockholm Sweden, 274–283. [33] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
[2] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. Synthe- Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks.
sizing Robust Adversarial Examples. In ICML, Vol. 80. PMLR, 284–293. In International Conference on Learning Representations.
[3] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. 2017. Variational inference: [34] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and
A review for statisticians. J. Amer. Statist. Assoc. 112, 518 (2017), 859–877. Aleksander Madry. 2019. Robustness may be at odds with accuracy. In ICLR.
[4] Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. [35] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
In Proceedings of COMPSTAT’2010. Springer, 177–186. 2008. Extracting and Composing Robust Features with Denoising Autoencoders.
[5] Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2018. Decision-Based Ad- In ICML. ACM, 1096–1103.
versarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models. [36] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-
In ICLR. Antoine Manzagol. 2010. Stacked Denoising Autoencoders: Learning Useful
[6] Xiaoyu Cao and Neil Zhenqiang Gong. 2017. Mitigating evasion attacks to deep Representations in a Deep Network with a Local Denoising Criterion. In ICML,
neural networks via region-based classification. In ACSAC. ACM, 278–287. Vol. 11. 3371–3408.
[7] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas [37] Siyue Wang, Xiao Wang, Pu Zhao, Wujie Wen, David Kaeli, Peter Chin, and
Rauber, Dimitris Tsipras, Ian J. Goodfellow, Aleksander Madry, and Alexey Ku- Xue Lin. 2018. Defensive Dropout for Hardening Deep Neural Networks Under
rakin. 2019. On Evaluating Adversarial Robustness. CoRR (2019). arXiv:1902.06705 Adversarial Attacks. In ICCAD. ACM, Article 71, 8 pages.
[8] Nicholas Carlini and David Wagner. 2017. Adversarial Examples Are Not Easily [38] Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image
Detected: Bypassing Ten Detection Methods. In Proceedings of the 10th Workshop Dataset for Benchmarking Machine Learning Algorithms. CoRR abs/1708.07747
on Artificial Intelligence and Security. ACM, 3–14. (2017). arXiv:1708.07747
[9] Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of [39] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan L. Yuille. 2018.
neural networks. In 2017 IEEE S&P. IEEE, 39–57. Mitigating adversarial effects through randomization. ICLR (2018).
[10] Nicholas Carlini and David Wagner. 2018. Audio Adversarial Examples: Tar- [40] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L. Yuille, and Kaiming He.
geted Attacks on Speech-to-Text. Deep Learning and Security Workshop, Article 2019. Feature Denoising for Improving Adversarial Robustness. In CVPR.
arXiv:1801.01944 (2018), arXiv:1801.01944 pages. arXiv:cs.LG/1801.01944 [41] Weilin Xu, David Evans, and Yanjun Qi. 2018. Feature Squeezing: Detecting
[11] Nicholas Carlini and David A. Wagner. 2017. MagNet and “Efficient Defenses Adversarial Examples in Deep Neural Networks. NDSS.
Against Adversarial Attacks” are Not Robust to Adversarial Examples. CoRR [42] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In
abs/1711.08478 (2017). arXiv:1711.08478 http://arxiv.org/abs/1711.08478 BMVC.
[12] Jianbo Chen and Michael I. Jordan. 2019. Boundary Attack++: Query-Efficient [43] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and
Decision-Based Adversarial Attack. (2019). arXiv:1904.02144 Michael I. Jordan. 2019. Theoretically Principled Trade-off between Robustness
[13] Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. 2018. and Accuracy. In ICML. 7472–7482.
EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples.
AAAI Conference on Artificial Intelligence (2018).
[14] Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian Approximation:
Representing Model Uncertainty in Deep Learning. ICML 48, 1050–1059.
[15] Justin Gilmer, Nicolas Ford, Nicholas Carlini, and Ekin Cubuk. 2019. Adversarial
Examples Are a Natural Consequence of Test Error in Noise. In ICML, Vol. 97.
A APPENDIX to what we used for RS-1. We also use RSD-1 as a shorthand to
refer to this scheme.
A.1 More Information on Training
Model architecture and training parameters are in Tables 6 Distillation. We use the same network architecture and parame-
and 7. For MNIST, we use the same as in MagNet [9, 24]). For ters as we did for the training of Dropout models. Identical to the
Fashion-MNIST and CIFAR-10, they are identical to the WRN [42]. configuration used in [9], we train with temperature T = 100 and
Test errors of the nine schemes are in Table 2. For each scheme test with T = 1 for all three datasets.
and dataset, we train 16 models and report the mean and standard Region-based Classification (RC). We use the Dropout models
deviation. Additional information is provided below. for RC. For each test example, we generate t additional examples,
where for each pixel, a noise was randomly chosen from (−r, r )
Table 6: Mode Architectures. We use WRN-28-10 for and added to it. Prediction is then made with majority voting on
Fashion-MNIST and CIFAR-10 (k = 10, N = 4). the t input examples. Identical to the original RC paper [6], we use
MNIST Fashion-MNIST CIFAR-10
t = 10, 000 for MNIST and t = 1, 000 for CIFAR-10. We also use
t = 1, 000 for Fashion-MNIST. We choose values for r (r = 0.4 for
Output Output
Group Kernel, Feature Kernel, Feature
Size Size
Conv.ReLU 3 × 3 × 32 Conv1 28 × 28 [3 × 3, 16] 32 × 32 [3 × 3, 16]
MNIST, and r = 0.02 for Fashion-MNIST and CIFAR-10) so that the
Conv.ReLU 3 × 3 × 32 " # " # test errors would be comparable to the other mechanisms.
3 × 3, 16 × k 3 × 3, 16 × k
Max Pooling 2×2 Conv2 28 × 28 ×N 32 × 32 ×N
3 × 3, 16 × k 3 × 3, 16 × k Adversarial Dropout (Dropout-Adv). To use adversarial training
Conv.ReLU 3 × 3 × 64
Conv.ReLU 3 × 3 × 64 " #
3 × 3, 32 × k
" #
3 × 3, 32 × k with Dropout, we leverage the trained Dropout model from before as
Max Pooling 2×2 Conv3 14 × 14 ×N 16 × 16 ×N
3 × 3, 32 × k 3 × 3, 32 × k the target model for generating adversarial examples. We generated
Dense.ReLU 200
12, 000 adversarial examples for each Dropout model by perturbing
" # " #
3 × 3, 64 × k 3 × 3, 64 × k
Dense.ReLU 200 Conv4 7×7 ×N 8×8 ×N
3 × 3, 64 × k 3 × 3, 64 × k training instances. To ensure that the adversarial examples indeed
Softmax 10 Softmax 10 10
should be classified under the original label, we sort the adversarial
Table 7: Training Parameters. examples according to their L 2 distances in ascending order, and
add only the first 10, 000 examples into the training dataset. These
Parameters MNIST Fashion-MNIST & CIFAR-10
examples have L 2 distances lower than the cutoff mentioned earlier.
Optimization Method SGD SGD We then apply the Dropout training procedure as described before
Learning Rate 0.01 0.1 initial, multiply by 0.2
at 60, 120 and 160 epochs on the new training dataset.
Momentum 0.9 0.9 Adversarial Random Spiking (RS-1-Adv). For this adversarial
Batch Size 128 128
Epochs 50 200
training method, we use RS-1 as the target model. The training pa-
Dropout (Optional) 0.5 0.1 rameters and procedure are largely identical to what were described
Data Augmentation - Fashion-MNIST: Shifting + Horizontal Flip for Dropout-Adv above.
CIFAR-10: Shifting + Rotation +
Horizontal Flip + Zooming + Shear
A.2 Proof of Proposition 1
Proof sketch. The Monte Carlo sampling used in the RS neural
Magnet. We use the trained Dropout model as the prediction network optimization gives an unbiased estimate of the gradient
model, and train the Magnet defensive models (reformers and de- Õ∫ ∂
tectors) [24] based on the publicly released Magnet implementa- L (y, ŷ(x, b, ϵ,W )) p(b)f (ϵ)dϵ
∂W
tion1 . Identical to the settings2 presented in the original Magnet pa- ∀b ϵ
∂ Õ
∫
per [24], for MNIST, we use Reformer I, Detector I/L 2 and Detector
= L (y, ŷ(x, b, ϵ,W )) p(b)f (ϵ)dϵ,
II/L 1 , with detection threshold set to 0.001. Since Fashion-MNIST ∂W ϵ ∀b
was not studied in [24], we use the same model architecture
as CIFAR-10 presented in the original Magnet paper [24]. For with the above equality given by the linearity of the expectation
Fashion-MNIST and CIFAR-10, we use Reformer II, Detector II/L 1 , and integral operators. That is, the RS neural network optimization
Detector II/T 10 and Detector II/T 40, and with a detection threshold is a Robbins-Monro stochastic optimization [4] that minimizes
Õ∫
(rate of false positive) of 0.004, which results in test error rates W ′ = argmin L(y, ŷ(x,W ))p(b)f (ϵ)dϵ .
comparable to those of the other schemes. W ϵ
∀b
Random Spiking with standard model (RS-1). A Random Spik- L is convex on ŷ, then by Jensen’s inequality
ing (RS) layer is added after the first convolution layer in the stan- Õ∫
dard architecture. We choose p = 0.8, so that 20% of all neuron L (y, ŷ(x, b, ϵ,W )) p(b)f (ϵ)dϵ ≥
∀b ϵ
outputs are randomly spiked. !
Õ∫
Random Spiking with Dropout (RS-1-Dropout). We add the
L y, ŷ(x, b, ϵ,W )p(b)f (ϵ)dϵ ≡ L y, ŷ(x,W ) .
RS layer to the Dropout scheme. All other parameters are identical ϵ
∀b
1 https://github.com/Trevillie/MagNet
2 Regarding
Thus, the RS neural network minimizes an upper bound of the loss
of the ensemble RS model ŷ(x,W ), yielding a proper variational
the Detector settings, a small discrepancy exists between the paper and
the released source code. After confirming with the authors, we follow what is given
by the source code. inference procedure [3]. □