New Adversarial Image Detection Based On Sentiment Analysis
New Adversarial Image Detection Based On Sentiment Analysis
Abstract— Deep neural networks (DNNs) are vulnerable to applications, such as self-driving cars [3] and identity recog-
adversarial examples, while adversarial attack models, e.g., Deep- nition [4].
Fool, are on the rise and outrunning adversarial example detec- A recent and effective approach to detecting adversarial
tion techniques. This article presents a new adversarial example attacks takes the feature maps produced by the hidden layers
detector that outperforms state-of-the-art detectors in identifying
the latest adversarial attacks on image datasets. Specifically,
of a DNN (e.g., a DNN-based image classifier) as input and
we propose to use sentiment analysis for adversarial example detects adversarial input examples by measuring the differ-
detection, qualified by the progressively manifesting impact of ence between benign and adversarial feature maps [5], [6],
an adversarial perturbation on the hidden-layer feature maps [7], [8], [9], [10]. For instance, a detection method named
of a DNN under attack. Accordingly, we design a modularized local intrinsic dimensionality (LID) [5] uses the difference
embedding layer with the minimum learnable parameters to of dimension between the subspaces surrounding adversarial
embed the hidden-layer feature maps into word vectors and examples and clean examples. Another detection method,
assemble sentences ready for sentiment analysis. Extensive exper- known as deep k-nearest neighbors (DkNN) [6], applies the
iments demonstrate that the new detector consistently surpasses k-nearest neighbors (k-NN) technique on feature maps to
the state-of-the-art detection algorithms in detecting the latest
attacks launched against ResNet and Inception neutral networks
assess the difference between the feature maps of the input
on the CIFAR-10, CIFAR-100, and SVHN datasets. The detector example’s k-NN and those of benign examples in the pre-
only has about 2 million parameters and takes less than 4.6 ms dicted class against a predefined threshold. Nearest neighbor
to detect an adversarial example generated by the latest attack influence function (NNIF) [7] is another popular adversar-
models using a Tesla K80 GPU card. ial example detector, which detects adversarial examples by
assessing the correlation between the input example’s k-NN
Index Terms— Adversarial example detection, deep learning,
neural network, sentiment analysis. and the most influential benign examples identified during
training. A Mahalanobis-distance-based algorithm developed
in [8] fits the feature maps to a class-conditional Gaus-
I. I NTRODUCTION sian distribution and then detects adversarial examples by
measuring the Mahalanobis distances of the feature maps.
D EEP neural networks (DNNs) have demonstrated their
excellent performance in image classification, voice
recognition, and text categorization. However, recent stud-
Besides the hidden-layer feature maps of a DNN, be your
own neighborhood (BEYOND) [9] uses the output of the
ies indicate that adversarial instances can undermine DNNs. DNN to detect adversarial examples. It uses the hidden-layer
Specifically, intentionally perturbed inputs, also known as representations provided by self-supervised learning (SSL)
adversarial examples, can mislead DNNs to make highly and the DNN’s predicted label to examine the relationship
confident erroneous predictions [1]. The perturbation required between adversarial examples and their augmented versions.
is typically imperceptible to human eyes, making the pertur- Moreover, positive–negative detector (PNDetector) [10] trains
a positive–negative classifier against both the benign examples
bation hard to detect [2]. This undesirable property of DNNs
(positive representations) and their negative representations
has developed into a significant security concern in real-world
that complement the benign examples in each pixel to identify
adversarial examples.
Manuscript received 3 December 2021; revised 17 February 2023;
accepted 3 May 2023. This work was supported in part by the Foundation The above existing adversarial example detectors [5], [6],
for Innovative Research Groups of the National Natural Science Foundation [7], [8], [9], [10] depend primarily on machine learning tech-
of China under Grant 61921003, in part by the National Natural Science niques or hand-crafted measures. Despite performing reason-
Foundation of China under Grant 62072092, and in part by the China Schol- ably well against some mild types of attacks (e.g., FGSM [11]
arship Council under Grant 201906475002. (Yulong Wang and Tianxiang Li
contributed equally to this work.) (Corresponding author: Yulong Wang.) and Jacobian-based salience map attack (JSMA) [12]), the
Yulong Wang and Tianxiang Li are with the State Key Laboratory existing adversarial example detectors are less effective in
of Networking and Switching Technology, School of Computer Science detecting mighty attacks, such as DeepFool [13] and elastic-net
(National Pilot Software Engineering School), Beijing University of Posts
and Telecommunications, Beijing 100876, China (e-mail: [email protected]; attacks on DNNs (EAD) [14].
[email protected]). In this article, we propose a new and effective adver-
Shenghong Li, Xin Yuan, and Wei Ni are with Data61, Common- sarial example detector for DNN-based image classification.
wealth Scientific and Industrial Research Organisation (CSIRO), Mars- The new detector is a shallow neural network with only a
field, Sydney, NSW 2122, Australia (e-mail: [email protected];
[email protected]; [email protected]).
few layers and a small number of parameters and outper-
Color versions of one or more figures in this article are available at forms the state-of-the-art detectors in identifying the latest
https://doi.org/10.1109/TNNLS.2023.3274538. attacks, including DeepFool and EAD, on widely used image
Digital Object Identifier 10.1109/TNNLS.2023.3274538 datasets.
2162-237X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
The key idea is that we propose to detect adversarial exam- [11], [12], [13], [14], [17]. Most attack algorithms exploit
ples by extracting the progressively and increasingly manifest- DNN’s gradients to obtain a small perturbation. The corre-
ing impact of adversarial perturbations on the hidden-layer sponding attack algorithms are classified as targeted attacks
feature maps of the DNN (as opposed to the feature maps or untargeted attacks depending on whether the adversarial
only). In light of the progressive manifest of sentiment in examples are misclassified to a specific target class or simply
a sentence, we propose to embed the hidden-layer feature misclassified to a different class from their source classes.
maps into word vectors (i.e., a sentence) and detect adversarial FGSM [11] perturbs an image by changing its pixel values
examples using sentiment analysis. toward the direction of increasing the DNN-based image
Another important aspect is a new and efficient embedding classifier’s classification loss. FGSM generates an adversarial
layer that embeds the differently sized, 3-D, hidden-layer example using
feature maps to word vectors with consistent lengths and
assembles sentences ready for sentiment analysis. Specifically, x + ϵ sign(∇x L(x, y))
a modular design is taken to create a trainable module to where ϵ ∈ R+ is the perturbation magnitude, y indicates
match the dimensions between the feature maps of succes- the ground-truth class, sign(·) takes the sign of a real value,
sively selected hidden layers. Then, each feature map can be and ∇x L(x, y) is the gradient of the loss function L(x, y)
embedded into a word vector via a cascade of modules, thereby with regard to the input image x. FGSM runs fast since it
minimizing the number of trainable modules and learnable only perturbs the input once, but it needs a relatively large
parameters. perturbation magnitude ϵ for a high attack success rate.
The main contributions of this article are as follows. Projected Gradient Descent (PGD) [2] improves the
1) New sentiment-analysis-based interpretation of adver- FGSM by generating an adversarial example iteratively
sarial example detection and meticulous selection of Y
xi+1 = xi + α sign(∇x L(x, y))
TextCNN for sentiment analysis, through rigorous exper- (1)
imental comparisons with other candidate neural net- x+Sϵ
work structures. where i is the index to an iteration, α ≤ ϵ is the perturbation
2) New modular design of an embedding layer, which step size, Sϵ ⊆ Rd is the set of allowed perturbations under
reshapes and embeds the differently sized, 3-D hidden-
layer feature maps of a DNN to word vectors with equal Q maximum perturbation magnitude ϵ, and the projector
the
x+Sϵ (·) maps its input to the closest element to the input
length (and assembles sentences for sentiment analysis) in the set x + Sϵ . PGD conducts a fine-grained perturbation
using the minimum number of trainable parameters. on images and can achieve a higher attack success rate than
3) Extensive experiments that corroborate the superior FGSM under the maximum perturbation magnitude, at the cost
effectiveness and generalization ability of the proposed of a longer running time.
adversarial example detector under the latest adversar- DeepFool [13] perturbs an image toward the region that
ial example attacks compared with the state-of-the-art is the nearest to the image but belongs to a different class.
adversarial example detectors. DeepFool generates an adversarial example by iteratively
The experiments demonstrate that the new adversarial updating its input with
example detector consistently outperforms the state-of-the-art
detection algorithms, such as LID [5], DkNN [6], NNIF [7], | f ĉ′ |
|w′ĉ | ⊙ sign w′ĉ
xi+1 ← xi + 2
BEYOND [9], PNDetector [10], and Mahalanobis algo- w′ĉ 2
rithm [8], in identifying the latest attacks, including AutoAt-
tack [15], DeepFool [13], and EAD [14], on the CIFAR-10, until the adversarial example is misclassified or the maximum
CIFAR-100, and SVHN datasets. We use the Bhattacharyya iteration number is reached. x0 is the benign example without
distance [16], hidden layer visualization, and ablation study any perturbation. f ĉ′ is the difference between the output of
to shed insight on the gain of the new detector. the softmax function of the closest different class ĉ and that
The remainder of this article is organized as follows. of the predicted class of the benign example x0 . w′ĉ is the
Section II reviews the state-of-the-art attacks and detec- difference between the gradients of the softmax function of
tors. Section III provides the system and threat mod- class f ĉ and that of the softmax function of the predicted
els. In Section IV, we elaborate on the design of the class of the benign example x0 . ⊙ is the pointwise product.
new sentiment-analysis-based adversarial example detector. A softmax function of class ĉ takes an input image and outputs
The new detector is experimentally examined against the a percentage indicating the confidence that the input belongs
cutting-edge attack models and compared with the state-of- to class ĉ. Because DeepFool tends to perturb an image to just
art detection algorithms in Section V, followed by concluding cross the classification boundary of the image’s original class,
remarks in Section VI. DeepFool can generate adversarial examples with considerably
small perturbations.
II. R ELATED W ORK Carlini and Wagner’s (C&W) algorithm [17] solves the
In this section, we briefly review the latest attacks on DNN, following optimization problem to obtain the perturbation
and the state-of-the-art adversarial example detectors. These applied to an image:
attack models and detectors are used in our comparison studies min ||δ||2 + a L(x + δ, t)
with the proposed adversarial example detector, as will be δ
presented in Section V. s.t. x + δ ∈ [0, 1] P
where δ is the perturbation on the image x; a is a constant
A. Adversarial Example Attack Algorithms specified in prior by running a variant of binary search; L(·, ·)
Recently, several attack algorithms for maliciously perturb- is one of the seven loss functions specified in C&W, such
ing images have been proposed for off-the-shelf DNNs [2], that x + δ is misclassified to the target class t only if L(x +
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
δ, t) ≤ 0; and P is the dimension of the input image x and examples and benign (unperturbed) examples differ.
the perturbation δ. C&W can deliver a high attack success rate LID estimates the dimension and accordingly detects
but requires a perturbation with a relatively large magnitude. adversarial examples.
EAD [14] is inspired by C&W and crafts an adversarial 3) Mahalanobis’ algorithm [8] assumes that pretrained
example by solving the optimization problem input features can be fit by a class-conditional Gaus-
sian distribution. The Mahalanobis distance to the
min a L(x̃, t) + b||x̃ − x||1 + ||x̃ − x||22 closest class-conditional distribution reveals adversarial
x̃
s.t. x̃ ∈ [0, 1] P examples.
4) NNIF algorithm [7] assumes that the k-NN training
where a ≥ 0 and b ≥ 0 are the regularization coefficients of samples (i.e., the nearest neighbors in the feature map
the loss function L(·, ·) and the ℓ1 -norm penalty, respectively. space) and the most influential training samples (iden-
EAD can reach the same attack success rate as C&W, with tified using an influence function) correlate for benign
smaller perturbations. examples, but do not correlate for adversarial examples.
JSMA [12] extends saliency maps [18] to produce adver- The correlation is measured to detect if an attack is
sarial saliency maps. These maps reveal the input features underway.
that an adversary can most effectively perturb, to achieve the 5) BEYOND [9] assumes that benign perturbations, i.e.,
anticipated misclassification outcome. JSMA determines the random noises, with bounded budgets cause minor vari-
perturbation to each pixel using a modified saliency map ations in the feature space, and then detect anomalous
behaviors by distinguishing an adversarial example’s
S(x, t)[i, j] relationship with its augmented version, or neighbors,
X
0, if Jit (x) < 0 or Ji j (x) > 0 from representation similarity and label consistency.
6) PNDetector [10] assumes the misclassification space
j̸=t
= X is randomly distributed in the ideal feature space of a
Jit (x)
Ji j (x) , otherwise pretrained classifier. PNDetector is a positive–negative
classifier trained by original examples (positive repre-
j̸ =t
sentations) and their negative representations that share
where i and j are the indexes of elements in the saliency map the same structural and semantic features.
S; Ji j (x) = (∂ f j (x)/∂xi ) is the (i, j)th entry of the Jacobian According to [7], LID [5], Mahalanobis [8], and NNIF [7]
matrix of the image classifier f ; and f j is the softmax function yield their respective best detection performance when using
of the jth class. all the hidden layers of a DNN, while DkNN [6] achieves its
AutoAttack (Auto) [15] is a suite of parameter-free attacks. best detection by only using the penultimate layer of the DNN.
It contains two white-box attacks, i.e., Auto-PGD (APGD)
with the cross entropy loss function and with the difference in
logits ratio (DLR) loss function, and two other attacks, i.e., fast III. S YSTEM M ODEL
adaptive boundary (FAB) Attack [19] and Square Attack [20]. A. System Architecture
APGD aims to produce adversarial examples inside an ℓ p -ball, The proposed adversarial example detector runs in paral-
with the DLR loss function defined as lel with a DNN-based image classifier in computer vision
z y − maxi̸= y z i applications to protect the image classifier, as illustrated in
DLR(x, y) = −
z π1 − z π3 Fig. 1. When the DNN-based image classifier classifies an
input image, the feature maps produced by several hidden
where z i is the logit of example class i after taking x as input; layers of the image classifier are copied and sent to the detector
y is the ground-truth label of x; and π is the permutation for adversarial example detection.
ordering the components of z in decreasing order. FAB [19] If adversarial perturbations are detected on the input image,
is a white-box attack that does not need to restart for every the proposed adversarial example detector generates a noti-
threshold tϵ if one wants to evaluate the success rate of attacks fication to alert the administrator of the computer vision
with perturbations constrained to within {ϵ ∈ R | ∥ϵ∥ p ≤ application. The image classifier’s prediction is stopped from
tϵ }. ϵ is the perturbation magnitude. R stands for the set making any decision, such as granting access based on face
of real values. Square Attack [20] produces norm-bounded
recognition [21]. If the detector does not detect any hostile
perturbations to launch score-based black-box attacks. It needs
alteration in the input image, the DNN-based computer vision
no knowledge of the gradient of the DNN under attack.
application continues functioning as usual.
B. State-of-the-Art Adversarial Example Detectors
To prevent adversarial example attacks, several detectors B. Threat Model
have been developed, including DkNN [6], LID [5], Maha- We adopt the threat model described in [1], where an
lanobis’ algorithm [8], and NNIF [7]. adversarial attacker attempts to mislead the DNN-based image
1) DkNN [6] combines the k-NN algorithm with the input classifier by feeding the classifier adversarial examples. The
representation in a DNN’s hidden layers (i.e., feature attacker can repeatedly perturb the pixels of an image until
map). DkNN identifies an adversarial example when the the DNN-based image classifier misclassifies the image to a
group of the representations of the example’s k-NN in different class from the correct one.
the hidden layers differs from that of examples of the Assume that the attacker has complete knowledge of the
predicted class. DNN-based image classifier (or, in other words, the classifier
2) LID [5] is under the assumption that the dimensions is a white box to the attacker). Accordingly, the attacker can
of the subspaces surrounding adversarial (perturbed) generate adversarial examples that can be misclassified by
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
the classifier. This is due to the fact that the neural network
architectures of the best-performing image classifiers (e.g., the
ResNet models [22]) are often common knowledge. Even if
the classifiers’ parameters are unknown to the attacker, the
attacker can learn a good surrogate of the classifier by sending
queries to the classifier and collecting responses [4].
We consider the typical situation where the attacker has
no knowledge of the adversarial example detector. In other
words, the detector is a black box to the attacker. This is
because, in most cases, the detection results are generally
inaccessible to the attacker, and hence the attacker can hardly
learn a surrogate of the detector. We also consider a relatively
rare situation where the attacker somehow gets access to the
adversarial example detector and its gradient (e.g., due to
a compromised server or a rogue employee). In this case,
a white-box attack [23] to both the DNN-based image classifier
and the adversarial example detector is evaluated.
Fig. 2. Architecture of the proposed detector.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
A. Experiment Setup
Our experimental setup is consistent with [7] in terms of
image classifiers, datasets, attack models, benchmark detec-
tors, and performance indicators. Cohen et al. [7] presented the
latest study on adversarial example detection in the literature
and developed the state-of-the-art detector, namely, NNIF,
which is also used as a benchmark in our experiments.
1) Image Classifier: By default, the image classifier (a
DNN) under attack is a deep residual network [22] with
34 hidden layers, referred to as ResNet-34. The feature
extraction layers of ResNet-34 are divided into five successive
hidden blocks, i.e., Batch Normalization 1 (BN1 ), and Residual
Block 1 (Res1 ), Res2 , Res3 , and Res4 . A convolutional layer
is followed by a batch normalization layer in BN1 . The rest of
the hidden blocks are residual blocks, which are basic building algorithms cause minor perturbation to the benign images, and
blocks in a deep residual network model. hence may evade human inspection.
We also adopt Inception to build another image classi- 4) Performance Indicator: The area under receiver oper-
fier based on the third version of the Inception network ating characteristic (ROC) curve, or “AUC,” serves as a per-
(referred to as Inception-V3). The feature extraction layers formance metric to evaluate the adversarial example detectors.
of Inception-V3 are divided into seven hidden blocks: Stem The AUC of a model with 100% incorrect predictions is 0. The
block, Inception-A block (Inception-A), Reduction-A block AUC of a model with 100% correct predictions is 1 (or 100%).
(Reduction-A), Inception-B, Reduction-B, Inception-C, and AUC is a useful tool. It assesses the accuracy of the model’s
global Avg-pool block. The Stem block is divided into seven predictions, regardless of the classification threshold [33].
successive hidden layers, including five convolution layers 5) State-of-the-Art Detectors: The proposed detector is
and two pool layers. An Inception block consists of three compared with the state-of-the-art adversarial example detec-
parallel sub-blocks of convolution layers and pooling layers, tors, namely, LID [5], DkNN [6], NNIF [7], Mahalanobis [8],
whose outputs are later concatenated. The rest of the hidden BEYOND [9], and PNDetector [10], as summarized in
blocks are Reduction blocks, which are made up of three Section II-B. The setups of the benchmark detectors are
parallel sub-blocks (two convolution layers and one pooling optimized under each of the considered attack models and
layer). The feature maps from the following hidden blocks datasets. Specifically, we optimize the number of neighbors,
of Inception-V3 are used as inputs to the proposed detector: denoted by k, for BEYOND, DkNN, and LID; the noise
Stem, Inception-A, Inception-C, Reduction-B, and Avg-pool. magnitude, denoted by γ, for the Mahalanobis algorithm; the
2) Datasets: Three popular image datasets are considered: number of high-influence samples, denoted by H , for the
CIFAR-10, CIFAR-100, and SVHN. Each of the three image NNIF; and the false positive ratio (FPR) for the PNDetector.
datasets is divided into three subsets: A training set of 49 000 Based on the AUC values of the detection ROC curve, all
images, a validation set of 1000 images, and a testing set of the hyperparameters are validated with the validation set
10 000 images. using nested cross entropy validation (except that the original
3) Attack Models: Seven latest adversarial attack strategies hyperparameters of NNIF in [7] are used because of significant
are considered: AutoAttack [15], FGSM [11], JSMA [12], time to retrain the NNIF, as also pointed out in [7]). The
DeepFool [13], C&W [17], PGD [2], and EAD [14] (see hyperparameters of the benchmark detectors are summarized
Section II-A). The neural network tool used in support of in Table III.
the defense algorithms is PyTorch, except for the case when 6) Setting of the Proposed Detector: We select five hidden
PNDetector is taken as the defense technique. This is because layers from the ResNet-34 model as inputs to the word
the PNDetector is based on TensorFlow [28]. In this case, embedding layer of our detector. Each one is the last layer
Cleverhans [29], which supports TensorFlow, is used as the of a hidden block in the ResNet-34 model (i.e., BN1 , Res1 ,
toolbox to support the attack strategies to produce adversarial Res2 , Res3 , and Res4 ). For the Inception-V3 model, we choose
samples against PNDetector. The other parameter configura- the last layers of its five hidden blocks (i.e., Stem, Inception-
tions of the attacks are summarized in Table I. A, Inception-C, Reduction-B, and Avg-pool) as inputs to the
Table II illustrates the adversarial examples generated by word embedding layer of the proposed detector. The size of
the latest attacks. All the attacks can mislead ResNet-34 the convolutional kernel used in the CP module is 3 × 3.
into misclassifying inputs, often with high confidence. The We use one-, two-, three-, and four-gram convolutional kernels
residuals in the third column of the table show that these attack in the sentiment analyzer of the proposed detector. Each of
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IV
AUC S CORES (%) OF THE C ONSIDERED A DVERSARIAL E XAMPLE D ETECTION A LGORITHMS U NDER THE D IFFERENT ATTACKS ON D IFFERENT
DATASETS . T HE A DOPTED BACKBONES OF THE I MAGE C LASSIFIER A RE R ES N ET-34 AND I NCEPTION -V3, R ESPECTIVELY
TABLE VI
T-SNE V ISUALIZATION OF THE W ORD D ISTRIBUTION G ENERATED F ROM E ACH S ELECTED H IDDEN B LOCK OF R ES N ET-34 U NDER THE D EEP F OOL
ATTACK . E ACH P OINT R EPRESENTS A W ORD , C ORRESPONDING TO E ITHER AN A DVERSARIAL (R ED ) OR A B ENIGN E XAMPLE (B LUE )
TABLE VII
C OMPARISON OF D ISTRIBUTIONS OF W ORDS G ENERATED F ROM E ACH S ELECTED H IDDEN B LOCK OF R ES N ET-34 U NDER D EEP F OOL ATTACK ON
THE CIFAR-10 DATASET AND T HEIR T RANSFORMED V ERSION ON THE P ROPOSED D ETECTOR . T HE F IRST ROW C ORRESPONDS TO AN I MAGE
C LASSIFIER BASED ON A R ES N ET-34 M ODEL . T HE S ECOND ROW C ORRESPONDS TO THE P ROPOSED A DVERSARIAL E XAMPLE D ETECTOR .
T HE I NPUT OF THE P ROPOSED D ETECTOR C ONSISTS OF THE F EATURE M APS P RODUCED BY BN1 , R ES1 , R ES2 , R ES3 , AND R ES4 FOR
D ETECTING A DVERSARIAL E XAMPLES F ED I NTO THE R ES N ET-34-BASED I MAGE C LASSIFIER . E ACH P OINT R EPRESENTS A
W ORD C ORRESPONDING TO AN A DVERSARIAL E XAMPLE (R ED ), A B ENIGN E XAMPLE OF THE S OURCE C LASS Frog
(B LUE ), AND A B ENIGN E XAMPLE OF THE TARGET C LASS Bird (G REEN )
word embedding layer with no feature extracted; see Fig. 2. generated by the latest attacks (i.e., DeepFool and AutoAt-
Nevertheless, the word vectors based on the feature maps tack). The second column indicates the number of selected
do contribute to the detection of adversarial examples after hidden layers. The third column lists the selected hidden
becoming part of a sentence, as shown in Section V-D. layers.
As shown in Table VIII, the detection of adversarial exam-
ples improves steadily with the increasing number of selected
D. Ablation Study hidden layers under the proposed adversarial example detector,
To help understand the proposed detector, we conduct two across all the three considered datasets. This is due to the
ablation studies. The first is to modify the word embedding fact that more observations of the transformations of the
layer by masking out some of the generated word vectors. The image representation in the classifier (or, in other words,
second ablation study is to replace the sentiment analyzer with a longer sentence) provide richer information to distinguish the
different neural network architectures. adversarial and benign examples. This validates our design of
1) Ablation of the Word Embedding Layer: Table VIII interpreting a series of hidden-layer feature maps as a sentence
evaluates the impact of the number of selected hidden layers for effective adversarial example detection.
of the ResNet-34 model on the detection performance of the As also shown in Table VIII, the inclusion (or direct use) of
proposed detector with regard to the adversarial examples the input image does not always contribute to the improvement
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VIII
AUC S CORE (%) U NDER D IFFERENT N UMBERS OF H IDDEN L AYERS
S ELECTED IN THE R ES N ET-34-BASED I MAGE C LASSIFIER FOR
A DVERSARIAL E XAMPLE D ETECTION BY THE P ROPOSED D ETEC -
TOR . D EEP F OOL AND AUTOATTACK (0.02)
A RE THE ATTACK M ODELS
Fig. 4. t-SNE figures of the words generated at the Res1 hidden block of
the ResNet-34 model on the CIFAR-10 dataset. Red and blue points represent
feature maps corresponding to adversarial and benign examples, respectively.
(a) DeepFool. (b) EAD. (c) FGSM (0.1). (d) Auto (0.02). (e) Auto (8/255).
(f) JSMA (1,0.1). (g) PGD (0.02). (h) PGD (8/255). (i) C&W.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IX TABLE X
I MPACT OF n-G RAM C ONVOLUTIONAL K ERNELS IN THE S ENTIMENT A NA - AUC S CORES (%) W HEN D ETECTING D EEP F OOL A DVERSARIAL E XAM -
LYZER ON THE AUC S CORE (%), W HERE D EEP F OOL AND AUTOAT- PLES U SING D IFFERENT S ENTIMENT A NALYZERS
TACK (0.02) A RE THE ATTACK M ODELS . R ES N ET-34 I S U SED TO
B UILD THE I MAGE C LASSIFIER
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE XI
E VALUATION OF THE G ENERALIZATION OF THE N EW D ETECTOR W HEN D ETECTING A DVERSARIAL E XAMPLES G ENERATED BY OTHER ATTACK A LGO -
RITHMS . T HE M ETRIC I S THE AUC S CORE (%). T HE I MAGE C LASSIFIER I S R ES N ET-34, W HICH ACHIEVES THE C LASSIFICATION ACCURACY OF
93.72%, 95.97%, AND 75.29% ON B ENIGN I MAGES IN THE CIFAR-10, SVHN, AND CIFAR-100 DATASETS , R ESPECTIVELY
AutoAttack, the detection performance of the detector trained detector’s loss function and adapt the adversarial examples
against EAD falls below 50%. To this end, an ensemble to fool both the image classifier and the proposed detector.
solution with detectors trained against DeepFool and PGD is We use PDG as the benchmark attack (with the perturbation
recommended to ensure adequate generalization in detecting budget ϵ = 8/255, the iteration step size α = 2/255, and the
unseen attacks, especially for classification tasks with a large iteration number 20 [2]), since PGD can iteratively refine its
number of classes, e.g., CIFAR-100. perturbation to an image based on the gradients of the loss
It is also noted that the proposed detector performs better on functions of both the image classifier and the detector with
the CIFAR-10 dataset than it does on the CIFAR-100 dataset, respect to the image.
because its detection performance relies on the feature maps When performing the adapted PGD attacks, the attacker
produced by the image classifier. The classification accuracy can take two strategies to combine the image classifier’s loss
of the ResNet-34 (and Inception-V3) image classifier drops function with the proposed detector’s loss function for the
about 20% when switching from CIFAR-10 to CIFAR-100; in generation of new adversarial examples.
other words, the feature maps of the image classifiers capture 1) The first strategy is to alternate between minimizing the
more comprehensive features from the CIFAR-10 dataset than loss functions of the classifier and the detector [38].
they do from the CIFAR-100 dataset. That is, the attacker updates the adversarial examples
Fig. 7 reveals that all the attacks can cause the perturbed based on the gradient of the classifier’s loss function in
images to deviate from their unperturbed version to different odd-numbered steps and based on the gradient of the
degrees (as can be measured by the Bhattacharyya distance). detector’s loss function in even-numbered steps.
This leads to the perturbed images being misclassified. Among 2) The second strategy is to linearly combine the two loss
all the considered attack models, DeepFool and PDG (8/255)
functions into one by replacing (1) with [39]
are the most representative attacks on the ten-class dataset (i.e.,
CIFAR-10 and SVHN) and 100-class dataset (i.e., CIFAR- Y
100), respectively. As a result, the proposed adversarial exam- xi+1 = xi + α (1 − σ ) sign ∇x Lc (x, y)
ple detector trained against DeepFool or PDG (8/255) can be x+Sϵ
generalized effectively to detect attacks launched by the other
attack models, as shown in Table XI. + σ sign ∇x Ld (x, yd ) (2)
F. Defense Against White-Box Attacks
Finally, we consider a relatively rare yet more threatening where yd is the ground-truth class of the input image x,
situation where an attacker can access the gradient of the i.e., adversarial or benign; and σ ∈ [0, 1] is a weighting
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE XII
E VALUATION OF THE G ENERALIZATION OF THE N EW D ETECTOR W HEN D ETECTING A DVERSARIAL E XAMPLES G ENERATED BY OTHER ATTACK
A LGORITHMS . T HE U SED M ETRIC I S THE AUC S CORE (%). T HE I MAGE C LASSIFIER I S I NCEPTION -V3, W HICH ACHIEVES THE C LASSIFICATION
ACCURACY OF 93.83%, 95.79%, AND 73.01% ON B ENIGN I MAGES IN THE CIFAR-10,
SVHN, AND CIFAR-100 DATASETS , R ESPECTIVELY
Fig. 7. Bhattacharyya distances between the distributions of adversarial and benign examples. The proposed detector trained on adversarial examples generated
by DeepFool is used to produce word vectors. The image classifier is ResNet-34. (a) CIFAR-10. (b) SVHN. (c) CIFAR-100.
coefficient to balance between the image classifier’s loss detector’s loss functions. Specifically, when the attacker refines
function Lc (·, ·) and the detector’s loss function Ld (·, ·). an adversarial example against the pretrained detector for
To detect the adapted attacks, the proposed detector is trained 20 iterations (i.e., one epoch), the detection accuracy (i.e.,
on the original adversarial examples produced by PGD (0.02) the ratio of detected adversarial examples) drops to 50.67%.
and then its detection is improved by training again on newly Nonetheless, our detector improves its accuracy to 95.74%,
generated PGD adversarial examples. after being trained against adversarial examples for five
Table XIII shows that the proposed detector remains highly epochs.
effective under the adapted PGD attack based on the first Table XIV shows that the proposed detector is effective
strategy of alternating minimization of the classifier’s and under the adapted PGD attack using the second strategy
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE XIII [6] N. Papernot et al., “Deep k-nearest neighbors: Towards confident,
A DAPTED PDG ATTACK ON THE CIFAR-10 DATASET U SING THE C LASSI - interpretable and robust deep learning,” 2018, arXiv:1803.04765.
FIER ’ S AND THE N EW D ETECTOR ’ S L OSS F UNCTIONS IN AN A LTER - [7] G. Cohen et al., “Detecting adversarial samples using influence functions
NATING M ANNER , I . E ., σ = 0 IN O DD -N UMBERED S TEPS AND and nearest neighbors,” in Proc. CVPR, Seattle, WA, USA, Jun. 2020,
σ = 1 IN E VEN -N UMBERED S TEPS pp. 14453–14462.
[8] K. Lee et al., “A simple unified framework for detecting out-
of-distribution samples and adversarial attacks,” in Proc. NeurIPS,
Dec. 2018, pp. 7167–7177.
[9] Z. He et al., “Be your own neighborhood: Detecting adversarial example
by the neighborhood relations built on self-supervised learning,” 2022,
arXiv:2209.00005.
[10] W. Luo, C. Wu, L. Ni, N. Zhou, and Z. Zhang, “Detecting adversarial
examples by positive and negative representations,” Appl. Soft Comput.,
vol. 117, Mar. 2022, Art. no. 108383.
TABLE XIV [11] I. J. Goodfellow et al., “Explaining and harnessing adversarial exam-
ples,” in Proc. ICLR, San Diego, CA, USA, May 2015, pp. 1–11.
A DAPTED PDG ATTACK ON CIFAR-10 U SING THE C OMBINED
C LASSIFIER ’ S AND D ETECTOR ’ S L OSS F UNCTION [12] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and
A. Swami, “The limitations of deep learning in adversarial settings,”
in Proc. IEEE Eur. Symp. Secur. Privacy (EuroS&P), Mar. 2016,
pp. 372–387.
[13] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “DeepFool: A
simple and accurate method to fool deep neural networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV,
USA, Jun. 2016.
[14] P. Chen et al., “EAD: Elastic-net attacks to deep neural networks via
adversarial examples,” in Proc. AAAI, Feb. 2018, pp. 1–8.
[15] F. Croce et al., “Reliable evaluation of adversarial robustness with an
of combined classifier’s and detector’s loss function. The ensemble of diverse parameter-free attacks,” in Proc. ICML, vol. 119,
adversarial example detector is so robust after being trained Jul. 2020, pp. 2206–2216.
five epochs in the first strategy that it can achieve the detection [16] A. Mohammadi and K. N. Plataniotis, “Improper complex-valued Bhat-
accuracy of over 98% for σ < 1. When the attacker optimizes tacharyya distance,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27,
the adversarial examples solely on the loss function of the no. 5, pp. 1049–1064, May 2016.
detector, i.e., σ = 1, the detection accuracy drops from 98.15% [17] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural
networks,” in Proc. IEEE Symp. Secur. Privacy (SP), San Jose, CA,
to 93.04%. Nevertheless, the attack success rate of the newly USA, May 2017, pp. 39–57.
generated adversarial examples drops dramatically to only [18] G. Jin, S. Shen, D. Zhang, W. Duan, and Y. Zhang, “Deep saliency map
7.79%. The conclusion drawn is that the proposed detector estimation of hand-crafted features,” in Proc. IEEE Int. Conf. Image
can effectively withstand white-box attacks. Process. (ICIP), Sep. 2017, pp. 4262–4266.
[19] F. Croce and M. Hein, “Minimally distorted adversarial examples with
a fast adaptive boundary attack,” in Proc. Int. Conf. Mach. Learn.,
VI. C ONCLUSION vol. 119, 2020, pp. 2196–2205.
In this article, we proposed a new adversarial example [20] M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square
detector by recasting the adversarial image detection as a attack: A query-efficient black-box adversarial attack via random
search,” in Proc. ECCV, vol. 12368, Aug. 2020, pp. 484–501.
text sentiment analysis problem and performing a binary
[21] J. Y. Choi and B. Lee, “Ensemble of deep convolutional neural networks
classification using a TextCNN model. Extensive tests demon- with Gabor face representations for face recognition,” IEEE Trans.
strated the superiority of the detector in detecting various Image Process., vol. 29, pp. 3270–3281, 2020.
latest attacks on three popular datasets. The new detector [22] V. Santhanam and L. S. Davis, “A generic improvement to deep residual
also demonstrated a strong generalization ability by accurately networks based on gradient flow,” IEEE Trans. Neural Netw. Learn.
detecting adversarial examples generated by unknown attacks, Syst., vol. 31, no. 7, pp. 2490–2499, Jul. 2020.
and it is resistant to white-box attacks in situations where the [23] K. Alrawashdeh and S. Goldsmith, “Defending deep learning based
anomaly detection systems against white-box adversarial examples and
gradients of the detector are exposed. The new detector has backdoor attacks,” in Proc. IEEE Int. Symp. Technol. Soc. (ISTAS),
only about 2 million parameters and takes less than 4.6 ms Nov. 2020, pp. 294–301.
to detect an adversarial example generated by the latest attack [24] Y. Kim, “Convolutional neural networks for sentence classification,” in
models using a Tesla K80 GPU card. Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Doha,
Qatar, 2014, pp. 1746–1751.
[25] D. Deng, L. Jing, J. Yu, and S. Sun, “Sparse self-attention
R EFERENCES LSTM for sentiment lexicon construction,” IEEE/ACM Trans. Audio,
[1] D. J. Miller, Z. Xiang, and G. Kesidis, “Adversarial learning targeting Speech, Language Process., vol. 27, no. 11, pp. 1777–1790,
deep neural network classification: A comprehensive review of defenses Nov. 2019.
against attacks,” Proc. IEEE, vol. 108, no. 3, pp. 402–433, Mar. 2020. [26] L. Yu, J. Wang, K. R. Lai, and X. Zhang, “Refining word embed-
[2] A. Madry et al., “Towards deep learning models resistant to adversarial dings using intensity scores for sentiment analysis,” IEEE/ACM Trans.
attacks,” in Proc. ICLR, Vancouver, BC, Canada, Apr. 2018, pp. 1–11. Audio, Speech, Language Process., vol. 26, no. 3, pp. 671–681,
[3] A. Chernikova, A. Oprea, C. Nita-Rotaru, and B. Kim, “Are self-driving Mar. 2018.
cars secure? Evasion attacks against deep neural networks for steering [27] T. Mikolov et al., “Distributed representations of words and phrases
angle prediction,” in Proc. IEEE Secur. Privacy Workshops (SPW), and their compositionality,” in Proc. Adv. Neural Inf. Process. Syst. 26,
May 2019, pp. 132–137. Lake Tahoe, NV, USA, Dec. 2013, pp. 3111–3119.
[4] Y. Zhong and W. Deng, “Towards transferable adversarial attack against [28] M. Abadi et al., “TensorFlow: Large-scale machine learning on hetero-
deep face recognition,” IEEE Trans. Inf. Forensics Security, vol. 16, geneous distributed systems,” 2016, arXiv:1603.04467.
pp. 1452–1466, 2021. [29] N. Papernot et al., “Cleverhans v2.1.0: An adversarial machine learning
[5] X. Ma et al., “Characterizing adversarial subspaces using local intrinsic library,” in Proc. USENIX, Aug. 2018, pp. 1–5.
dimensionality,” in Proc. ICLR, Vancouver, BC, Canada, Apr. 2018, [30] M.-I. Nicolae et al., “Adversarial robustness toolbox v1.0.0,” 2018,
pp. 1–12. arXiv:1807.01069.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[31] J. Rauber, R. Zimmermann, M. Bethge, and W. Brendel, “Foolbox Shenghong Li (Member, IEEE) received the B.S.
native: Fast adversarial attacks to benchmark the robustness of machine degree in communication engineering from Nanjing
learning models in PyTorch, TensorFlow, and JAX,” J. Open Source University, Nanjing, Jiangsu, China, in 2008, and the
Softw., vol. 5, no. 53, p. 2607, Sep. 2020. Ph.D. degree in electronic and computer engineering
[32] H. Kim, “Torchattacks : A PyTorch repository for adversarial attacks,” from The Hong Kong University of Science and
2020, arXiv:2010.01950. Technology (HKUST), Hong Kong, in 2014.
[33] L. Jing, C. Shen, L. Yang, J. Yu, and M. K. Ng, “Multi-label classifi- He is currently the Senior Research Scientist
cation by semi-supervised singular value decomposition,” IEEE Trans. of Data61, Commonwealth Scientific and Indus-
Image Process., vol. 26, no. 10, pp. 4612–4625, Oct. 2017. trial Research Organisation (CSIRO), Sydney, NSW,
[34] A. Chatzimparmpas, R. M. Martins, and A. Kerren, “T-viSNE: Interac- Australia. His research interests include computer
tive assessment and interpretation of t-SNE projections,” IEEE Trans. vision, deep learning, wireless tracking, data fusion,
Vis. Comput. Graphics, vol. 26, no. 8, pp. 2696–2714, Aug. 2020. and wireless communication.
[35] S. Yu, K. Wickstrøm, R. Jenssen, and J. C. Príncipe, “Understand-
ing convolutional neural networks with information theory: An initial
exploration,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 1,
pp. 435–442, Jan. 2021.
[36] L. Huang and C. Pun, “Audio replay spoof attack detection by Xin Yuan (Member, IEEE) received the B.E. degree
joint segment-based linear filter bank feature extraction and attention- from the Taiyuan University of Technology, Taiyuan,
enhanced DenseNet-BiLSTM network,” IEEE/ACM Trans. Audio, Shanxi, China, in 2013, and the dual Ph.D. degree
Speech, Language Process., vol. 28, pp. 1813–1825, 2020. from the Beijing University of Posts and Telecom-
[37] S. Lai et al., “Recurrent convolutional neural networks for text classifi- munications (BUPT), Beijing, China, and the Uni-
cation,” in Proc. AAAI, Austin, TX, USA, Jan. 2015, pp. 2267–2273. versity of Technology Sydney (UTS), Sydney, NSW,
Australia, in 2019 and 2020, respectively.
[38] F. Tramèr et al., “On adaptive attacks to adversarial example defenses,”
She is currently the Research Scientist of Com-
in Proc. NeurIPS, Dec. 2020, pp. 1–11.
monwealth Scientific and Industrial Research Organ-
[39] J. H. Metzen et al., “On detecting adversarial perturbations,” in Proc. isation (CSIRO), Sydney. Her research interests
ICLR, Toulon, France, Apr. 2017, pp. 1–12. include machine learning and optimization, and their
applications to the Internet of Things and intelligent systems.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.