0% found this document useful (0 votes)
24 views15 pages

New Adversarial Image Detection Based On Sentiment Analysis

This document summarizes a research article that proposes a new method for detecting adversarial examples in neural networks. The method uses sentiment analysis of the hidden layer feature maps of a neural network under attack. It designs an embedding layer to transform feature maps into word vectors for sentiment analysis. Experiments show this detector outperforms existing methods at identifying the latest attacks against image classifiers. The detector has only 2 million parameters and detects examples in less than 4.6 ms.

Uploaded by

Suman Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views15 pages

New Adversarial Image Detection Based On Sentiment Analysis

This document summarizes a research article that proposes a new method for detecting adversarial examples in neural networks. The method uses sentiment analysis of the hidden layer feature maps of a neural network under attack. It designs an embedding layer to transform feature maps into word vectors for sentiment analysis. Experiments show this detector outperforms existing methods at identifying the latest attacks against image classifiers. The detector has only 2 million parameters and detects examples in less than 4.6 ms.

Uploaded by

Suman Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

New Adversarial Image Detection Based on


Sentiment Analysis
Yulong Wang , Member, IEEE, Tianxiang Li, Shenghong Li, Member, IEEE, Xin Yuan , Member, IEEE,
and Wei Ni , Senior Member, IEEE

Abstract— Deep neural networks (DNNs) are vulnerable to applications, such as self-driving cars [3] and identity recog-
adversarial examples, while adversarial attack models, e.g., Deep- nition [4].
Fool, are on the rise and outrunning adversarial example detec- A recent and effective approach to detecting adversarial
tion techniques. This article presents a new adversarial example attacks takes the feature maps produced by the hidden layers
detector that outperforms state-of-the-art detectors in identifying
the latest adversarial attacks on image datasets. Specifically,
of a DNN (e.g., a DNN-based image classifier) as input and
we propose to use sentiment analysis for adversarial example detects adversarial input examples by measuring the differ-
detection, qualified by the progressively manifesting impact of ence between benign and adversarial feature maps [5], [6],
an adversarial perturbation on the hidden-layer feature maps [7], [8], [9], [10]. For instance, a detection method named
of a DNN under attack. Accordingly, we design a modularized local intrinsic dimensionality (LID) [5] uses the difference
embedding layer with the minimum learnable parameters to of dimension between the subspaces surrounding adversarial
embed the hidden-layer feature maps into word vectors and examples and clean examples. Another detection method,
assemble sentences ready for sentiment analysis. Extensive exper- known as deep k-nearest neighbors (DkNN) [6], applies the
iments demonstrate that the new detector consistently surpasses k-nearest neighbors (k-NN) technique on feature maps to
the state-of-the-art detection algorithms in detecting the latest
attacks launched against ResNet and Inception neutral networks
assess the difference between the feature maps of the input
on the CIFAR-10, CIFAR-100, and SVHN datasets. The detector example’s k-NN and those of benign examples in the pre-
only has about 2 million parameters and takes less than 4.6 ms dicted class against a predefined threshold. Nearest neighbor
to detect an adversarial example generated by the latest attack influence function (NNIF) [7] is another popular adversar-
models using a Tesla K80 GPU card. ial example detector, which detects adversarial examples by
assessing the correlation between the input example’s k-NN
Index Terms— Adversarial example detection, deep learning,
neural network, sentiment analysis. and the most influential benign examples identified during
training. A Mahalanobis-distance-based algorithm developed
in [8] fits the feature maps to a class-conditional Gaus-
I. I NTRODUCTION sian distribution and then detects adversarial examples by
measuring the Mahalanobis distances of the feature maps.
D EEP neural networks (DNNs) have demonstrated their
excellent performance in image classification, voice
recognition, and text categorization. However, recent stud-
Besides the hidden-layer feature maps of a DNN, be your
own neighborhood (BEYOND) [9] uses the output of the
ies indicate that adversarial instances can undermine DNNs. DNN to detect adversarial examples. It uses the hidden-layer
Specifically, intentionally perturbed inputs, also known as representations provided by self-supervised learning (SSL)
adversarial examples, can mislead DNNs to make highly and the DNN’s predicted label to examine the relationship
confident erroneous predictions [1]. The perturbation required between adversarial examples and their augmented versions.
is typically imperceptible to human eyes, making the pertur- Moreover, positive–negative detector (PNDetector) [10] trains
a positive–negative classifier against both the benign examples
bation hard to detect [2]. This undesirable property of DNNs
(positive representations) and their negative representations
has developed into a significant security concern in real-world
that complement the benign examples in each pixel to identify
adversarial examples.
Manuscript received 3 December 2021; revised 17 February 2023;
accepted 3 May 2023. This work was supported in part by the Foundation The above existing adversarial example detectors [5], [6],
for Innovative Research Groups of the National Natural Science Foundation [7], [8], [9], [10] depend primarily on machine learning tech-
of China under Grant 61921003, in part by the National Natural Science niques or hand-crafted measures. Despite performing reason-
Foundation of China under Grant 62072092, and in part by the China Schol- ably well against some mild types of attacks (e.g., FGSM [11]
arship Council under Grant 201906475002. (Yulong Wang and Tianxiang Li
contributed equally to this work.) (Corresponding author: Yulong Wang.) and Jacobian-based salience map attack (JSMA) [12]), the
Yulong Wang and Tianxiang Li are with the State Key Laboratory existing adversarial example detectors are less effective in
of Networking and Switching Technology, School of Computer Science detecting mighty attacks, such as DeepFool [13] and elastic-net
(National Pilot Software Engineering School), Beijing University of Posts
and Telecommunications, Beijing 100876, China (e-mail: [email protected]; attacks on DNNs (EAD) [14].
[email protected]). In this article, we propose a new and effective adver-
Shenghong Li, Xin Yuan, and Wei Ni are with Data61, Common- sarial example detector for DNN-based image classification.
wealth Scientific and Industrial Research Organisation (CSIRO), Mars- The new detector is a shallow neural network with only a
field, Sydney, NSW 2122, Australia (e-mail: [email protected];
[email protected]; [email protected]).
few layers and a small number of parameters and outper-
Color versions of one or more figures in this article are available at forms the state-of-the-art detectors in identifying the latest
https://doi.org/10.1109/TNNLS.2023.3274538. attacks, including DeepFool and EAD, on widely used image
Digital Object Identifier 10.1109/TNNLS.2023.3274538 datasets.
2162-237X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

The key idea is that we propose to detect adversarial exam- [11], [12], [13], [14], [17]. Most attack algorithms exploit
ples by extracting the progressively and increasingly manifest- DNN’s gradients to obtain a small perturbation. The corre-
ing impact of adversarial perturbations on the hidden-layer sponding attack algorithms are classified as targeted attacks
feature maps of the DNN (as opposed to the feature maps or untargeted attacks depending on whether the adversarial
only). In light of the progressive manifest of sentiment in examples are misclassified to a specific target class or simply
a sentence, we propose to embed the hidden-layer feature misclassified to a different class from their source classes.
maps into word vectors (i.e., a sentence) and detect adversarial FGSM [11] perturbs an image by changing its pixel values
examples using sentiment analysis. toward the direction of increasing the DNN-based image
Another important aspect is a new and efficient embedding classifier’s classification loss. FGSM generates an adversarial
layer that embeds the differently sized, 3-D, hidden-layer example using
feature maps to word vectors with consistent lengths and
assembles sentences ready for sentiment analysis. Specifically, x + ϵ sign(∇x L(x, y))
a modular design is taken to create a trainable module to where ϵ ∈ R+ is the perturbation magnitude, y indicates
match the dimensions between the feature maps of succes- the ground-truth class, sign(·) takes the sign of a real value,
sively selected hidden layers. Then, each feature map can be and ∇x L(x, y) is the gradient of the loss function L(x, y)
embedded into a word vector via a cascade of modules, thereby with regard to the input image x. FGSM runs fast since it
minimizing the number of trainable modules and learnable only perturbs the input once, but it needs a relatively large
parameters. perturbation magnitude ϵ for a high attack success rate.
The main contributions of this article are as follows. Projected Gradient Descent (PGD) [2] improves the
1) New sentiment-analysis-based interpretation of adver- FGSM by generating an adversarial example iteratively
sarial example detection and meticulous selection of Y
xi+1 = xi + α sign(∇x L(x, y))

TextCNN for sentiment analysis, through rigorous exper- (1)
imental comparisons with other candidate neural net- x+Sϵ
work structures. where i is the index to an iteration, α ≤ ϵ is the perturbation
2) New modular design of an embedding layer, which step size, Sϵ ⊆ Rd is the set of allowed perturbations under
reshapes and embeds the differently sized, 3-D hidden-
layer feature maps of a DNN to word vectors with equal Q maximum perturbation magnitude ϵ, and the projector
the
x+Sϵ (·) maps its input to the closest element to the input
length (and assembles sentences for sentiment analysis) in the set x + Sϵ . PGD conducts a fine-grained perturbation
using the minimum number of trainable parameters. on images and can achieve a higher attack success rate than
3) Extensive experiments that corroborate the superior FGSM under the maximum perturbation magnitude, at the cost
effectiveness and generalization ability of the proposed of a longer running time.
adversarial example detector under the latest adversar- DeepFool [13] perturbs an image toward the region that
ial example attacks compared with the state-of-the-art is the nearest to the image but belongs to a different class.
adversarial example detectors. DeepFool generates an adversarial example by iteratively
The experiments demonstrate that the new adversarial updating its input with
example detector consistently outperforms the state-of-the-art
detection algorithms, such as LID [5], DkNN [6], NNIF [7], | f ĉ′ |
|w′ĉ | ⊙ sign w′ĉ

xi+1 ← xi + 2
BEYOND [9], PNDetector [10], and Mahalanobis algo- w′ĉ 2
rithm [8], in identifying the latest attacks, including AutoAt-
tack [15], DeepFool [13], and EAD [14], on the CIFAR-10, until the adversarial example is misclassified or the maximum
CIFAR-100, and SVHN datasets. We use the Bhattacharyya iteration number is reached. x0 is the benign example without
distance [16], hidden layer visualization, and ablation study any perturbation. f ĉ′ is the difference between the output of
to shed insight on the gain of the new detector. the softmax function of the closest different class ĉ and that
The remainder of this article is organized as follows. of the predicted class of the benign example x0 . w′ĉ is the
Section II reviews the state-of-the-art attacks and detec- difference between the gradients of the softmax function of
tors. Section III provides the system and threat mod- class f ĉ and that of the softmax function of the predicted
els. In Section IV, we elaborate on the design of the class of the benign example x0 . ⊙ is the pointwise product.
new sentiment-analysis-based adversarial example detector. A softmax function of class ĉ takes an input image and outputs
The new detector is experimentally examined against the a percentage indicating the confidence that the input belongs
cutting-edge attack models and compared with the state-of- to class ĉ. Because DeepFool tends to perturb an image to just
art detection algorithms in Section V, followed by concluding cross the classification boundary of the image’s original class,
remarks in Section VI. DeepFool can generate adversarial examples with considerably
small perturbations.
II. R ELATED W ORK Carlini and Wagner’s (C&W) algorithm [17] solves the
In this section, we briefly review the latest attacks on DNN, following optimization problem to obtain the perturbation
and the state-of-the-art adversarial example detectors. These applied to an image:
attack models and detectors are used in our comparison studies min ||δ||2 + a L(x + δ, t)
with the proposed adversarial example detector, as will be δ
presented in Section V. s.t. x + δ ∈ [0, 1] P
where δ is the perturbation on the image x; a is a constant
A. Adversarial Example Attack Algorithms specified in prior by running a variant of binary search; L(·, ·)
Recently, several attack algorithms for maliciously perturb- is one of the seven loss functions specified in C&W, such
ing images have been proposed for off-the-shelf DNNs [2], that x + δ is misclassified to the target class t only if L(x +
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: NEW ADVERSARIAL IMAGE DETECTION BASED ON SENTIMENT ANALYSIS 3

δ, t) ≤ 0; and P is the dimension of the input image x and examples and benign (unperturbed) examples differ.
the perturbation δ. C&W can deliver a high attack success rate LID estimates the dimension and accordingly detects
but requires a perturbation with a relatively large magnitude. adversarial examples.
EAD [14] is inspired by C&W and crafts an adversarial 3) Mahalanobis’ algorithm [8] assumes that pretrained
example by solving the optimization problem input features can be fit by a class-conditional Gaus-
sian distribution. The Mahalanobis distance to the
min a L(x̃, t) + b||x̃ − x||1 + ||x̃ − x||22 closest class-conditional distribution reveals adversarial

s.t. x̃ ∈ [0, 1] P examples.
4) NNIF algorithm [7] assumes that the k-NN training
where a ≥ 0 and b ≥ 0 are the regularization coefficients of samples (i.e., the nearest neighbors in the feature map
the loss function L(·, ·) and the ℓ1 -norm penalty, respectively. space) and the most influential training samples (iden-
EAD can reach the same attack success rate as C&W, with tified using an influence function) correlate for benign
smaller perturbations. examples, but do not correlate for adversarial examples.
JSMA [12] extends saliency maps [18] to produce adver- The correlation is measured to detect if an attack is
sarial saliency maps. These maps reveal the input features underway.
that an adversary can most effectively perturb, to achieve the 5) BEYOND [9] assumes that benign perturbations, i.e.,
anticipated misclassification outcome. JSMA determines the random noises, with bounded budgets cause minor vari-
perturbation to each pixel using a modified saliency map ations in the feature space, and then detect anomalous
behaviors by distinguishing an adversarial example’s
S(x, t)[i, j] relationship with its augmented version, or neighbors,
 X

 0, if Jit (x) < 0 or Ji j (x) > 0 from representation similarity and label consistency.
6) PNDetector [10] assumes the misclassification space

j̸=t


= X is randomly distributed in the ideal feature space of a
 Jit (x)

 Ji j (x) , otherwise pretrained classifier. PNDetector is a positive–negative
classifier trained by original examples (positive repre-


j̸ =t
sentations) and their negative representations that share
where i and j are the indexes of elements in the saliency map the same structural and semantic features.
S; Ji j (x) = (∂ f j (x)/∂xi ) is the (i, j)th entry of the Jacobian According to [7], LID [5], Mahalanobis [8], and NNIF [7]
matrix of the image classifier f ; and f j is the softmax function yield their respective best detection performance when using
of the jth class. all the hidden layers of a DNN, while DkNN [6] achieves its
AutoAttack (Auto) [15] is a suite of parameter-free attacks. best detection by only using the penultimate layer of the DNN.
It contains two white-box attacks, i.e., Auto-PGD (APGD)
with the cross entropy loss function and with the difference in
logits ratio (DLR) loss function, and two other attacks, i.e., fast III. S YSTEM M ODEL
adaptive boundary (FAB) Attack [19] and Square Attack [20]. A. System Architecture
APGD aims to produce adversarial examples inside an ℓ p -ball, The proposed adversarial example detector runs in paral-
with the DLR loss function defined as lel with a DNN-based image classifier in computer vision
z y − maxi̸= y z i applications to protect the image classifier, as illustrated in
DLR(x, y) = −
z π1 − z π3 Fig. 1. When the DNN-based image classifier classifies an
input image, the feature maps produced by several hidden
where z i is the logit of example class i after taking x as input; layers of the image classifier are copied and sent to the detector
y is the ground-truth label of x; and π is the permutation for adversarial example detection.
ordering the components of z in decreasing order. FAB [19] If adversarial perturbations are detected on the input image,
is a white-box attack that does not need to restart for every the proposed adversarial example detector generates a noti-
threshold tϵ if one wants to evaluate the success rate of attacks fication to alert the administrator of the computer vision
with perturbations constrained to within {ϵ ∈ R | ∥ϵ∥ p ≤ application. The image classifier’s prediction is stopped from
tϵ }. ϵ is the perturbation magnitude. R stands for the set making any decision, such as granting access based on face
of real values. Square Attack [20] produces norm-bounded
recognition [21]. If the detector does not detect any hostile
perturbations to launch score-based black-box attacks. It needs
alteration in the input image, the DNN-based computer vision
no knowledge of the gradient of the DNN under attack.
application continues functioning as usual.
B. State-of-the-Art Adversarial Example Detectors
To prevent adversarial example attacks, several detectors B. Threat Model
have been developed, including DkNN [6], LID [5], Maha- We adopt the threat model described in [1], where an
lanobis’ algorithm [8], and NNIF [7]. adversarial attacker attempts to mislead the DNN-based image
1) DkNN [6] combines the k-NN algorithm with the input classifier by feeding the classifier adversarial examples. The
representation in a DNN’s hidden layers (i.e., feature attacker can repeatedly perturb the pixels of an image until
map). DkNN identifies an adversarial example when the the DNN-based image classifier misclassifies the image to a
group of the representations of the example’s k-NN in different class from the correct one.
the hidden layers differs from that of examples of the Assume that the attacker has complete knowledge of the
predicted class. DNN-based image classifier (or, in other words, the classifier
2) LID [5] is under the assumption that the dimensions is a white box to the attacker). Accordingly, the attacker can
of the subspaces surrounding adversarial (perturbed) generate adversarial examples that can be misclassified by

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 1. Proposed detector’s location when used in a DNN-based computer


vision application.

the classifier. This is due to the fact that the neural network
architectures of the best-performing image classifiers (e.g., the
ResNet models [22]) are often common knowledge. Even if
the classifiers’ parameters are unknown to the attacker, the
attacker can learn a good surrogate of the classifier by sending
queries to the classifier and collecting responses [4].
We consider the typical situation where the attacker has
no knowledge of the adversarial example detector. In other
words, the detector is a black box to the attacker. This is
because, in most cases, the detection results are generally
inaccessible to the attacker, and hence the attacker can hardly
learn a surrogate of the detector. We also consider a relatively
rare situation where the attacker somehow gets access to the
adversarial example detector and its gradient (e.g., due to
a compromised server or a rogue employee). In this case,
a white-box attack [23] to both the DNN-based image classifier
and the adversarial example detector is evaluated.
Fig. 2. Architecture of the proposed detector.

IV. N EW S ENTIMENT-A NALYSIS -BASED A DVERSARIAL A. Word Embedding Layer


E XAMPLE D ETECTOR
The word embedding layer translates the hidden-layer fea-
We propose to interpret a series of feature maps (of an ture maps (i.e., the outputs of the hidden layers) of the
input image) produced by the different hidden layers of the DNN potentially under attack as a sentence for follow-on
DNN-based image classifier under an adversarial example sentiment analysis. In sentiment analysis, a fixed-length vector,
attack. We detect the presence of adversarial perturbations on referred to as “word vector,” is used to represent a word in a
the images by embedding the hidden-layer feature maps into sentence [26]. A string of word vectors is used as the input to
a sentence and analyzing the sentiment of the sentence. The classify the sentence between positive and negative sentiments
presence or absence of adversarial perturbation is translated to or, in other words, benign and adversarial examples. In NLP,
the positive or negative sentiment of the sentence, respectively. word vectors are typically obtained by embedding words in a
A sentiment analysis model developed originally for natural vector space through unsupervised learning based on a large
language processing (NLP), such as TextCNN [24] and long text dataset. For instance, Mikolov et al. [27] trained a set of
short-term memory (LSTM) [25], can be applied to detect the word vectors on 100 billion words of Google News. However,
perturbations. there is no trained word vector for the hidden-layer feature
The rationale behind our interpretation of hidden-layer maps of images. The feature maps produced at the different
feature maps to a sentence for sentiment analysis is that the hidden layers of the DNN can have different sizes. A new
feature maps account for the subtle transition from a perturbed approach is needed to embed the hidden-layer feature maps
image of one class to a recognizable sample of another class. into word vectors to be used in sentiment analysis.
The feature map of a perturbed image (i.e., an adversarial We design a new convolution-pooling (CP) module, which
example) can be closer to the target class than the correct resizes the feature map produced by a selected hidden layer of
class of the unperturbed (i.e., a benign example) at the the DNN to have the same dimension as the feature map of the
penultimate layer of an image classifier. On the other hand, next selected hidden layer, as shown in Fig. 2. Suppose that
the perturbed image is typically indistinguishable from the L hidden layers are selected from the DNN under attack for
unperturbed at the input layer of the image classifier, due to the adversarial example detection. There are (L − 1) CP modules
typical imperceptibility of perturbations [2]. Using sentiment in the word embedding layer.
analysis, the progressively manifesting impact of perturbation A three-tuple (ci , wi , h i ), i = 1, . . . , L is used to describe
on the hidden-layer feature maps can be exploited to detect the dimension of the feature map produced by the ith selected
adversarial examples. hidden layer of the DNN, where ci , wi , and h i denote
The proposed detector is made up of two components: the number of channels, and the width, and the height per
a word embedding layer E and a sentiment analyzer A. channel, respectively. For the ith CP module, denoted by
As illustrated in Fig. 2, the input of the proposed detector CPi (i = 1, . . . , L − 1), the input and output dimensions
(i.e., a series of feature maps) is first mapped by the word are (ci , wi , h i ) and (ci+1 , wi+1 , h i+1 ), respectively. In other
embedding layer E to a sentence, and then analyzed by the words, CPi converts a (ci , wi , h i )-dimensional feature map to
sentiment analyzer A. a (ci+1 , wi+1 , h i+1 )-dimensional feature map.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: NEW ADVERSARIAL IMAGE DETECTION BASED ON SENTIMENT ANALYSIS 5

Each CP module, i.e., CPi (i = 1, . . . , L − 1), comprises


a convolutional layer and a max-pooling layer. The convolu-
tional layer of CPi has ci+1 convolutional kernels to convert
the ci feature maps, one per channel, produced by the ith
selected hidden layer of the DNN to ci+1 feature maps, one
per channel. Then, the max-pooling layer of CPi converts the
width wi and height h i of each of the ci+1 feature maps into
wi+1 and h i+1 , respectively. The convolutional layer and the
pooling layer of each CP module are constructed with kernels
of appropriate dimensions accordingly.
By concatenating CPi , . . . , CP L−1 , the feature map of the
ith selected hidden layer of the classifier is resized to be con-
sistent with the feature map of the last (Lth) selected hidden
layer, i.e., (c L , w L , h L ), as shown in Fig. 2. Likewise, the sizes
of all L selected feature maps are unified to (c L , w L , h L ). The Fig. 3. Architecture of the sentiment analyzer. There are M instances of
feature maps with the unified dimension of c L × w L × h L n-gram convolutional kernels. n = 1, . . . , N .
are passed into a global average pooling layer, as shown in
Fig. 2. The global average pooling layer flattens the feature Algorithm 1 Proposed Sentiment-Analysis-Based Adversarial
maps by replacing each of the feature maps with the average
Example Detector
value of its elements and translates them to word vectors of
the same dimension of 1 × c L . A sentence is constructed input: {Fl } (l = 1, . . . , L), the set of feature maps output by L
selected hidden layers of the image classifier.
by concatenating the word vectors, and then output to the
output: the probability of the input example being adversarial.
sentiment analyzer.
1: ▷ Operation of the Word Embedding Layer E
This modular design allows the CP modules to be reused for
2: Initialize {Wl } by setting Wl = Fl
resizing different feature maps while keeping the number of 3: for l = 1 to L do
CP modules to the minimum of only (L−1), hence minimizing 4: for q = l to L − 1 do ▷ unify dimension
the number of learnable parameters in the word embedding 5: Wl ← CPq (Wl ), ▷ alter Wl ’s dimension
layer. As a result, the word embedding layer is fast to train 6: end for
and computationally efficient. 7: Feed Wl to the Global Average Pooling layer to obtain a
1 × c L word vector and save it in Wl .
B. Sentiment Analyzer 8: end for
The sentiment analyzer is responsible for classifying the 9: Concatenate Wl (l = 1, . . . , L) to construct a 1 × Lc L -
input sentences (i.e., strings of word vectors) into positive dimensional sentence.
10: ▷ Operation of the Sentiment Analyzer A
sentiments (i.e., perturbed images) or negative sentiments (i.e.,
11: Let T ← ∅ be a set of hidden-layer feature maps.
unperturbed images). The sentiment analyzer is a shallow 12: for n = 1 to N do
neural network, which typically contains a convolutional layer, 13: for i = 1 to M do
a global max-pooling layer, and a fully connected layer. 14: Generate a sentiment hidden-layer map by applying the
We choose TextCNN as the sentiment analyzer, due to its sim- ith n-gram convolutional kernel to the sentence;
ple architecture and good sentence classification accuracy [24]. 15: Add the sentiment hidden-layer map to T .
Different from a traditional neural network with equal- 16: end for
sized 2-D convolutional kernels in each hidden layer, the 17: end for
sentiment analyzer uses 1-D n-gram convolutional kernels in 18: Feed all hidden-layer feature maps in T to the Global
its convolutional layer [24]. Each of the convolutional kernels Max-Pooling Layer and concatenate the outcomes in a vector;
19: Feed the vector to the fully connected layer that outputs a 1 ×
can take n word vectors as the input. n ranges from one to the
2 vector containing the probabilities of the input example being
number of word vectors in a sentence generated by the word benign and adversarial.
embedding layer. There are multiple n-gram convolutional
kernels with randomly initialized parameters to extract features
from the n-length segments of a sentence.
The global max-pooling layer in the sentiment analyzer are selected from the DNN-based image classifier to output
has two functions. First, it reduces the dimension of the feature maps to the detector. The time complexity of the
feature maps output by the convolutional layer in the sentiment algorithm is O(L 2 ). The space complexity of Algorithm 1
analyzer, hence helping counteract overfitting. Second, the is measured by the number of learnable parameters of the
global max-pooling layer can change the shape of hidden-layer proposed detector. Because the pooling layers do not have
feature maps so that the feature maps are flattened to a vector any learnable parameters, we only consider the learnable
and fit the input size of the following fully connected layer. parameters in the convolutional layers. The space complexity
Moreover, the fully connected layer generates a 1 × 2 row of the word embedding layer is O(L 2 c L w L h L ). The space
vector that provides the likelihood of the input example being complexity of the sentiment analyzer is O(M N 2 c L ), where N
benign or adversarial. To avoid overfitting, the fully connected is the number of types of n-gram convolutional kernels and M
layer has a dropout parameter of 0.5. The architecture of the is the number of instances of a type of n-gram convolutional
sentiment analyzer is illustrated in Fig. 3. kernel. As a result, the space complexity of Algorithm 1 is
O(L 2 c L w L h L + M N 2 c L ).
C. Algorithm Summary The learnable parameters of the proposed adversarial exam-
Algorithm 1 summarizes the adversarial example detec- ple detector, namely, the model weights and bias, can be
tion using the proposed detector, where L hidden layers optimized using supervised learning based on unperturbed and
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

perturbed examples. The existing attack algorithms described TABLE I


in Section II-A can be used to perturb images and generate H YPERPARAMETER S ETTING OF THE A DVERSARIAL ATTACKS C ON -
adversarial examples for training. SIDERED . W E U SE THE A DVERSARIAL ROBUSTNESS T OOLBOX
(ART) [30], F OOLBOX [31], AND T ORCH ATTACK [32]
TO L AUNCH THE ATTACKS
V. E XPERIMENTAL R ESULTS
In this section, we evaluate the performance of the
proposed detector, and then present a visualized explana-
tion of the detection mechanism. Our code is available at
https://github.com/wangfrombupt/adversarial_detector.

A. Experiment Setup
Our experimental setup is consistent with [7] in terms of
image classifiers, datasets, attack models, benchmark detec-
tors, and performance indicators. Cohen et al. [7] presented the
latest study on adversarial example detection in the literature
and developed the state-of-the-art detector, namely, NNIF,
which is also used as a benchmark in our experiments.
1) Image Classifier: By default, the image classifier (a
DNN) under attack is a deep residual network [22] with
34 hidden layers, referred to as ResNet-34. The feature
extraction layers of ResNet-34 are divided into five successive
hidden blocks, i.e., Batch Normalization 1 (BN1 ), and Residual
Block 1 (Res1 ), Res2 , Res3 , and Res4 . A convolutional layer
is followed by a batch normalization layer in BN1 . The rest of
the hidden blocks are residual blocks, which are basic building algorithms cause minor perturbation to the benign images, and
blocks in a deep residual network model. hence may evade human inspection.
We also adopt Inception to build another image classi- 4) Performance Indicator: The area under receiver oper-
fier based on the third version of the Inception network ating characteristic (ROC) curve, or “AUC,” serves as a per-
(referred to as Inception-V3). The feature extraction layers formance metric to evaluate the adversarial example detectors.
of Inception-V3 are divided into seven hidden blocks: Stem The AUC of a model with 100% incorrect predictions is 0. The
block, Inception-A block (Inception-A), Reduction-A block AUC of a model with 100% correct predictions is 1 (or 100%).
(Reduction-A), Inception-B, Reduction-B, Inception-C, and AUC is a useful tool. It assesses the accuracy of the model’s
global Avg-pool block. The Stem block is divided into seven predictions, regardless of the classification threshold [33].
successive hidden layers, including five convolution layers 5) State-of-the-Art Detectors: The proposed detector is
and two pool layers. An Inception block consists of three compared with the state-of-the-art adversarial example detec-
parallel sub-blocks of convolution layers and pooling layers, tors, namely, LID [5], DkNN [6], NNIF [7], Mahalanobis [8],
whose outputs are later concatenated. The rest of the hidden BEYOND [9], and PNDetector [10], as summarized in
blocks are Reduction blocks, which are made up of three Section II-B. The setups of the benchmark detectors are
parallel sub-blocks (two convolution layers and one pooling optimized under each of the considered attack models and
layer). The feature maps from the following hidden blocks datasets. Specifically, we optimize the number of neighbors,
of Inception-V3 are used as inputs to the proposed detector: denoted by k, for BEYOND, DkNN, and LID; the noise
Stem, Inception-A, Inception-C, Reduction-B, and Avg-pool. magnitude, denoted by γ, for the Mahalanobis algorithm; the
2) Datasets: Three popular image datasets are considered: number of high-influence samples, denoted by H , for the
CIFAR-10, CIFAR-100, and SVHN. Each of the three image NNIF; and the false positive ratio (FPR) for the PNDetector.
datasets is divided into three subsets: A training set of 49 000 Based on the AUC values of the detection ROC curve, all
images, a validation set of 1000 images, and a testing set of the hyperparameters are validated with the validation set
10 000 images. using nested cross entropy validation (except that the original
3) Attack Models: Seven latest adversarial attack strategies hyperparameters of NNIF in [7] are used because of significant
are considered: AutoAttack [15], FGSM [11], JSMA [12], time to retrain the NNIF, as also pointed out in [7]). The
DeepFool [13], C&W [17], PGD [2], and EAD [14] (see hyperparameters of the benchmark detectors are summarized
Section II-A). The neural network tool used in support of in Table III.
the defense algorithms is PyTorch, except for the case when 6) Setting of the Proposed Detector: We select five hidden
PNDetector is taken as the defense technique. This is because layers from the ResNet-34 model as inputs to the word
the PNDetector is based on TensorFlow [28]. In this case, embedding layer of our detector. Each one is the last layer
Cleverhans [29], which supports TensorFlow, is used as the of a hidden block in the ResNet-34 model (i.e., BN1 , Res1 ,
toolbox to support the attack strategies to produce adversarial Res2 , Res3 , and Res4 ). For the Inception-V3 model, we choose
samples against PNDetector. The other parameter configura- the last layers of its five hidden blocks (i.e., Stem, Inception-
tions of the attacks are summarized in Table I. A, Inception-C, Reduction-B, and Avg-pool) as inputs to the
Table II illustrates the adversarial examples generated by word embedding layer of the proposed detector. The size of
the latest attacks. All the attacks can mislead ResNet-34 the convolutional kernel used in the CP module is 3 × 3.
into misclassifying inputs, often with high confidence. The We use one-, two-, three-, and four-gram convolutional kernels
residuals in the third column of the table show that these attack in the sentiment analyzer of the proposed detector. Each of

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: NEW ADVERSARIAL IMAGE DETECTION BASED ON SENTIMENT ANALYSIS 7

TABLE II TABLE III


E XAMPLES AND N OISES ON THE CIFAR-10 DATASET U NDER THE L ATEST O PTIMALLY C HOSEN PARAMETER VALUES OF THE B ENCHMARK D ETEC -
ATTACK A LGORITHMS . T HE DNN-BASED I MAGE C LASSIFIER U NDER TORS . k I S THE N UMBER OF N EIGHBORHOODS . γ I S THE N OISE M AG -
ATTACK I S R ES N ET-34. T HE PARAMETERS OF THE ATTACK A LGO - NITUDE . FPR S TANDS FOR FALSE P OSITIVE R ATE
RITHMS A RE P ROVIDED IN TABLE I

CIFAR-100 dataset when the ResNet-34 model is deployed as


the image classifier and achieve an AUC of over 89% toward
these attacks when the Inception-V3 model is deployed as the
image classifier. On the other hand, the proposed detector is
computationally efficient. As shown in Table V, it takes the
detector shorter than 4.6 ms to detect an adversarial example,
which is acceptable for many practical applications.

C. Visualization of the New Detector


We use t-distributed stochastic neighbor embedding
(t-SNE) [34] to visualize each word vector in a sentence
generated as described in Section IV-A. Here, t-SNE is a
statistical tool that can visualize high-dimensional data by
assigning a position in a low-dimensional map to each data
point. Related objects are portrayed as close points, and
dissimilar objects are represented by distant points.
Consider the DeepFool attack for its popularity and hard-
to-detect property [7]. The visualization results based on
DeepFool are shown in Table VI, where red and blue points
correspond to adversarial and benign examples, respectively.
Here, ResNet-34 serves as the image classifier. The feature
the convolutional kernels has 100 instances with randomly maps output by the last layer of the hidden blocks, i.e., BN1 ,
initialized parameters. The proposed detector is trained for ten Res1 , Res2 , Res3 , and Res4 , are considered.
epochs to minimize the cross-entropy loss, using the Adam Table VI shows that using the proposed interpretation of
optimizer with a learning rate of 0.0001. feature maps to word vectors, the adversarial and benign
examples (i.e., the red and blue points) exhibit different levels
B. Detection Performance of separability at the selected hidden layers of the classifier for
We examine the performance of the proposed detector and adversarial example detection. The CIFAR-100 dataset is less
the baseline detection algorithms in defending the consid- visually separable than the CIFAR-10 and SVHN datasets. For
ered latest attacks. As shown in Table IV, the new detector this reason, the attacks on CIFAR-100 are generally harder to
consistently outperforms all the existing detectors on all the detect, as shown in Table IV. Nevertheless, the new detector
considered datasets and image classifiers. The table also shows is able to achieve reasonable detection rates, by exploiting
that the proposed detector is effective in defending against the the spatio-temporal characteristics of the feature maps for
DeepFool and EAD attacks, both of which are particularly improved separability, as demonstrated in Table IV.
destructive on the CIFAR-100 dataset and invalidate all the Fig. 4 shows the separability of adversarial and benign
existing detectors. Particularly, none of the existing detectors examples of the CIFAR-10 dataset under six considered
can provide an AUC (i.e., the detection rate) of over 91%. attacks, using the word interpretation of feature maps in
In contrast, our new detector is able to achieve an AUC of the proposed detector (more explicitly, the word embedding
over 94% toward both the DeepFool and EAD attacks on the layer of the detector). We see that the adversarial examples

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IV
AUC S CORES (%) OF THE C ONSIDERED A DVERSARIAL E XAMPLE D ETECTION A LGORITHMS U NDER THE D IFFERENT ATTACKS ON D IFFERENT
DATASETS . T HE A DOPTED BACKBONES OF THE I MAGE C LASSIFIER A RE R ES N ET-34 AND I NCEPTION -V3, R ESPECTIVELY

TABLE V difference between adversarial and benign examples, improv-


RUNNING T IME ( IN M ILLISECONDS ) OF O UR D ETECTOR TO D ETECT AN ing their separability and detection accuracy.
A DVERSARIAL E XAMPLE ON A T ESLA K80 GPU C ARD Table VII demonstrates the processes of misclassifying frogs
(the source class) into birds (the target class) in the CIFAR-10
dataset, along with the processes of correctly classifying frogs
and birds in the dataset. DeepFool is launched to perturb frog
images to be misclassified into birds. The t-SNE figures are
plotted for the feature maps produced by the hidden blocks
of the ResNet-34 image classifier under the DeepFool attack;
see the second row of Table VII. The t-SNE figures are also
plotted for the word vectors generated by the word embedding
layer of the new detector based on the feature maps; see the
third row of the table.
(red) generated by the DeepFool and EAD attacks are more We see in Table VII that DeepFool is effective in misleading
difficult to distinguish from the benign examples (blue), when the classifier to misclassify frog images to bird images since
compared with four other attacks. This is because DeepFool adversarial examples (i.e., perturbed frog images) are not
and EAD produce less perturbation than the other attacks, separable from unperturbed bird images and are distantly
making their adversarial examples less distinguishable and separate from unperturbed frog images at the last hidden block
reducing the detection capability of the existing detectors. of the ResNet-34 model, i.e., Res4 . The adversarial examples
In Fig. 5, we use the Bhattacharyya distance [16] to quantify are not separable from the benign (unperturbed) examples at
the separability of the red and blue clusters in Table VI. The the earlier hidden blocks of the ResNet-34 model.
Bhattacharyya distance is a measure of the amount of overlap We also see in Table VII that the adversarial examples can
between two statistical samples or populations. As shown in be effectively separated from both the (unperturbed) frog and
Fig. 5, the Bhattacharyya distances of the proposed detector bird classes, after the feature maps from the hidden blocks
are much larger than those of the image classifier under attack. of the ResNet-34 model are interpreted as word vectors in
The Bhattacharyya distances of the classifier are all close to the proposed detector. The only exception is the word vector
zero at BN1 and gradually increase along the remaining hidden corresponding to the last hidden block of the ResNet-34,
blocks until they are big enough to misclassify the perturbed i.e., Res4 . This is due to the fact that the feature maps of
images. The word embedding layer of our detector enlarges the Res4 are directly fed to the global average pooling layer of the
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: NEW ADVERSARIAL IMAGE DETECTION BASED ON SENTIMENT ANALYSIS 9

TABLE VI
T-SNE V ISUALIZATION OF THE W ORD D ISTRIBUTION G ENERATED F ROM E ACH S ELECTED H IDDEN B LOCK OF R ES N ET-34 U NDER THE D EEP F OOL
ATTACK . E ACH P OINT R EPRESENTS A W ORD , C ORRESPONDING TO E ITHER AN A DVERSARIAL (R ED ) OR A B ENIGN E XAMPLE (B LUE )

TABLE VII
C OMPARISON OF D ISTRIBUTIONS OF W ORDS G ENERATED F ROM E ACH S ELECTED H IDDEN B LOCK OF R ES N ET-34 U NDER D EEP F OOL ATTACK ON
THE CIFAR-10 DATASET AND T HEIR T RANSFORMED V ERSION ON THE P ROPOSED D ETECTOR . T HE F IRST ROW C ORRESPONDS TO AN I MAGE
C LASSIFIER BASED ON A R ES N ET-34 M ODEL . T HE S ECOND ROW C ORRESPONDS TO THE P ROPOSED A DVERSARIAL E XAMPLE D ETECTOR .
T HE I NPUT OF THE P ROPOSED D ETECTOR C ONSISTS OF THE F EATURE M APS P RODUCED BY BN1 , R ES1 , R ES2 , R ES3 , AND R ES4 FOR
D ETECTING A DVERSARIAL E XAMPLES F ED I NTO THE R ES N ET-34-BASED I MAGE C LASSIFIER . E ACH P OINT R EPRESENTS A
W ORD C ORRESPONDING TO AN A DVERSARIAL E XAMPLE (R ED ), A B ENIGN E XAMPLE OF THE S OURCE C LASS Frog
(B LUE ), AND A B ENIGN E XAMPLE OF THE TARGET C LASS Bird (G REEN )

word embedding layer with no feature extracted; see Fig. 2. generated by the latest attacks (i.e., DeepFool and AutoAt-
Nevertheless, the word vectors based on the feature maps tack). The second column indicates the number of selected
do contribute to the detection of adversarial examples after hidden layers. The third column lists the selected hidden
becoming part of a sentence, as shown in Section V-D. layers.
As shown in Table VIII, the detection of adversarial exam-
ples improves steadily with the increasing number of selected
D. Ablation Study hidden layers under the proposed adversarial example detector,
To help understand the proposed detector, we conduct two across all the three considered datasets. This is due to the
ablation studies. The first is to modify the word embedding fact that more observations of the transformations of the
layer by masking out some of the generated word vectors. The image representation in the classifier (or, in other words,
second ablation study is to replace the sentiment analyzer with a longer sentence) provide richer information to distinguish the
different neural network architectures. adversarial and benign examples. This validates our design of
1) Ablation of the Word Embedding Layer: Table VIII interpreting a series of hidden-layer feature maps as a sentence
evaluates the impact of the number of selected hidden layers for effective adversarial example detection.
of the ResNet-34 model on the detection performance of the As also shown in Table VIII, the inclusion (or direct use) of
proposed detector with regard to the adversarial examples the input image does not always contribute to the improvement

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE VIII
AUC S CORE (%) U NDER D IFFERENT N UMBERS OF H IDDEN L AYERS
S ELECTED IN THE R ES N ET-34-BASED I MAGE C LASSIFIER FOR
A DVERSARIAL E XAMPLE D ETECTION BY THE P ROPOSED D ETEC -
TOR . D EEP F OOL AND AUTOATTACK (0.02)
A RE THE ATTACK M ODELS

Fig. 4. t-SNE figures of the words generated at the Res1 hidden block of
the ResNet-34 model on the CIFAR-10 dataset. Red and blue points represent
feature maps corresponding to adversarial and benign examples, respectively.
(a) DeepFool. (b) EAD. (c) FGSM (0.1). (d) Auto (0.02). (e) Auto (8/255).
(f) JSMA (1,0.1). (g) PGD (0.02). (h) PGD (8/255). (i) C&W.

for AutoAttack on the CIFAR-10 dataset. The reason is that


an n-gram convolutional kernel attempts to learn local features
when n is small, or learn global features when n is large. With
all types of n-gram convolutional kernels being used, features
Fig. 5. Bhattacharyya distance between the distribution of adversarial
and benign examples across hidden blocks of the ResNet-34 model under
are captured in all the scales.
DeepFool attacks. We proceed to validate our selection of TextCNN, compared
with other commonly used sentiment analyzers: CNN [35],
of adversarial example detection. As a matter of fact, the bidirectional LSTM (BiLSTM) network [36], and recurrent
table consistently shows that the adversarial example detection convolutional neural networks for text classification (Tex-
depends primarily on the feature maps produced by the later tRCNN) [37]. The structures of the alternative sentiment
hidden layers of the DNN under attack, where a perturbed analyzers are illustrated in Fig. 6.
image is increasingly transformed to a different class of natural 1) CNN-Based Sentiment Analyzer: The CNN-based senti-
images in the image classifier. ment analyzer comprises four convolutional layers, three
2) Ablation of Sentiment Analyzer: The sentiment ana- max-pooling layers, an adapted average-pooling layer,
lyzer uses n-gram convolutional kernels to extract features. and two fully connected layers. The hyperparameters of
An n-gram convolutional kernel convolutes n words each time. the convolutional layer are: The kernel size is 3 × 3, the
We first investigate the impact of the number of different n- padding size is 1, and the stride size is 1. The hyperpa-
gram convolutional kernels on the performance of adversarial rameters of the max-pooling layer are: The kernel size
example detection. The latest typical attacks, AutoAttack and is 2 × 4, and the stride size is 2 × 4. The output size
DeepFool, are used to produce adversarial examples. Table IX of the adapted average-pooling layer is 2 × 2. We repli-
shows that the best detection accuracy is achieved when all the cate the word vectors from the word embedding layer
four types of n-gram convolutional kernels are implemented ten times to construct an expanded sentence input to
in the convolutional layer of the sentiment analyzer, except the CNN.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: NEW ADVERSARIAL IMAGE DETECTION BASED ON SENTIMENT ANALYSIS 11

TABLE IX TABLE X
I MPACT OF n-G RAM C ONVOLUTIONAL K ERNELS IN THE S ENTIMENT A NA - AUC S CORES (%) W HEN D ETECTING D EEP F OOL A DVERSARIAL E XAM -
LYZER ON THE AUC S CORE (%), W HERE D EEP F OOL AND AUTOAT- PLES U SING D IFFERENT S ENTIMENT A NALYZERS
TACK (0.02) A RE THE ATTACK M ODELS . R ES N ET-34 I S U SED TO
B UILD THE I MAGE C LASSIFIER

layer, and a fully connected layer. The n-gram convolu-


tional layer contains one-, two-, three-, and four-gram
convolutional kernels, each with 100 instances. The
global max-pooling layer outputs four vectors with a
length of 100 per vector (the output of an instance of
an n-gram convolutional kernel is reduced to a scalar
by the global max-pooling layer, and the outputs of the
100 instances are concatenated into a 1 × 100 vector),
which are then concatenated into a 1 × 400 vector and
fed to the fully connected layer.
DeepFool is used to launch the attack, since it is one of the
currently most hard-to-detect attacks [7].
Table X shows comparison of the proposed TextCNN-based
detector and the above alternative sentiment analyzers. It is
observed that the proposed TextCNN-based detector outper-
forms its CNN-based, BLSTM-based, and TextRCNN-based
counterparts in all the considered datasets (i.e., CIFAR-10,
SVHN, and CIFAR-100), despite the fact that the number of
its learnable parameters is more than halved compared with the
CNN-based detector (i.e., 2.06 × 106 in the TextCNN-based
detector versus 4.15 × 106 in the CNN-based detector). More-
over, TextCNN requires a substantially lower number of model
parameters than the other considered sentiment analyzer struc-
tures, e.g., by an order of magnitude, when compared with
BiLSTM and TextRCNN. The TextCNN is not only superior
in adversarial example detection but also much more computa-
tionally efficient. Our sentence interpretation of feature maps
and subsequent adoption of the TextCNN-based sentiment
analysis is effective in detecting adversarial perturbations on
images.

E. Generalization of the New Detector


We further assess the generalization capability of the pro-
posed TextCNN-based adversarial example detector, where the
detector is trained on perturbed examples generated by one
attack model and tested on examples perturbed by another
Fig. 6. Architectures of neural networks used as the alternative sentiment attack model. The generalization ability is important in situa-
analyzers. (a) CNN-based detector. (b) BiLSTM-based detector. (c) TextRC- tions where the detector has no knowledge of attacks in prior.
NN-based detector. Tables XI and XII show that the proposed detector has
good generalization ability, with an average detection rate of
2) BiLSTM-Based Sentiment Analyzer: BiLSTM consists of more than 90% in most cases (see the last columns of the
a bidirectional LSTM, followed by two fully connected tables). The detector demonstrates effective generalization in
layers. The BiLSTM outputs two vectors with a length detecting attacks launched by various recent attack models
of 1024, which are concatenated and fed into the first (such as AutoAttack, FGSM, PGD, EAD, C&W, and JSMA)
fully connected layer. when trained against DeepFool on the CIFAR-10 and SVHN
3) TextRCNN-Based Sentiment Analyzer: TextRCNN datasets. Consistent results are observed when the image
shares the same structure as BiLSTM, except for classifier is ResNet-34 or Inception-V3.
an additional 1-D max-pooling layer between the In the presence of unseen attacks, the detector achieves an
BiLSTM and the first fully connected layer. The 1-D average AUC score of more than 98% on the CIFAR-10 and
max-pooling layer is used to choose the maximum SVHN datasets. When trained against PGD on the CIFAR-100
value of each position among all the vectors produced dataset, the proposed detector generalizes well to detect unseen
by the BiLSTM. attacks, with an average AUC score of over 93% for the
4) TextCNN-Based Sentiment Analyzer: TextCNN consists ResNet-34 image classifier and over 87% for the Inception-
of an n-gram convolutional layer, a global max-pooling V3 image classifier. However, when the unseen attack is

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE XI
E VALUATION OF THE G ENERALIZATION OF THE N EW D ETECTOR W HEN D ETECTING A DVERSARIAL E XAMPLES G ENERATED BY OTHER ATTACK A LGO -
RITHMS . T HE M ETRIC I S THE AUC S CORE (%). T HE I MAGE C LASSIFIER I S R ES N ET-34, W HICH ACHIEVES THE C LASSIFICATION ACCURACY OF
93.72%, 95.97%, AND 75.29% ON B ENIGN I MAGES IN THE CIFAR-10, SVHN, AND CIFAR-100 DATASETS , R ESPECTIVELY

AutoAttack, the detection performance of the detector trained detector’s loss function and adapt the adversarial examples
against EAD falls below 50%. To this end, an ensemble to fool both the image classifier and the proposed detector.
solution with detectors trained against DeepFool and PGD is We use PDG as the benchmark attack (with the perturbation
recommended to ensure adequate generalization in detecting budget ϵ = 8/255, the iteration step size α = 2/255, and the
unseen attacks, especially for classification tasks with a large iteration number 20 [2]), since PGD can iteratively refine its
number of classes, e.g., CIFAR-100. perturbation to an image based on the gradients of the loss
It is also noted that the proposed detector performs better on functions of both the image classifier and the detector with
the CIFAR-10 dataset than it does on the CIFAR-100 dataset, respect to the image.
because its detection performance relies on the feature maps When performing the adapted PGD attacks, the attacker
produced by the image classifier. The classification accuracy can take two strategies to combine the image classifier’s loss
of the ResNet-34 (and Inception-V3) image classifier drops function with the proposed detector’s loss function for the
about 20% when switching from CIFAR-10 to CIFAR-100; in generation of new adversarial examples.
other words, the feature maps of the image classifiers capture 1) The first strategy is to alternate between minimizing the
more comprehensive features from the CIFAR-10 dataset than loss functions of the classifier and the detector [38].
they do from the CIFAR-100 dataset. That is, the attacker updates the adversarial examples
Fig. 7 reveals that all the attacks can cause the perturbed based on the gradient of the classifier’s loss function in
images to deviate from their unperturbed version to different odd-numbered steps and based on the gradient of the
degrees (as can be measured by the Bhattacharyya distance). detector’s loss function in even-numbered steps.
This leads to the perturbed images being misclassified. Among 2) The second strategy is to linearly combine the two loss
all the considered attack models, DeepFool and PDG (8/255)
functions into one by replacing (1) with [39]
are the most representative attacks on the ten-class dataset (i.e.,
CIFAR-10 and SVHN) and 100-class dataset (i.e., CIFAR- Y 
100), respectively. As a result, the proposed adversarial exam- xi+1 = xi + α (1 − σ ) sign ∇x Lc (x, y)

ple detector trained against DeepFool or PDG (8/255) can be x+Sϵ
generalized effectively to detect attacks launched by the other 

attack models, as shown in Table XI. + σ sign ∇x Ld (x, yd ) (2)
F. Defense Against White-Box Attacks
Finally, we consider a relatively rare yet more threatening where yd is the ground-truth class of the input image x,
situation where an attacker can access the gradient of the i.e., adversarial or benign; and σ ∈ [0, 1] is a weighting
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: NEW ADVERSARIAL IMAGE DETECTION BASED ON SENTIMENT ANALYSIS 13

TABLE XII
E VALUATION OF THE G ENERALIZATION OF THE N EW D ETECTOR W HEN D ETECTING A DVERSARIAL E XAMPLES G ENERATED BY OTHER ATTACK
A LGORITHMS . T HE U SED M ETRIC I S THE AUC S CORE (%). T HE I MAGE C LASSIFIER I S I NCEPTION -V3, W HICH ACHIEVES THE C LASSIFICATION
ACCURACY OF 93.83%, 95.79%, AND 73.01% ON B ENIGN I MAGES IN THE CIFAR-10,
SVHN, AND CIFAR-100 DATASETS , R ESPECTIVELY

Fig. 7. Bhattacharyya distances between the distributions of adversarial and benign examples. The proposed detector trained on adversarial examples generated
by DeepFool is used to produce word vectors. The image classifier is ResNet-34. (a) CIFAR-10. (b) SVHN. (c) CIFAR-100.

coefficient to balance between the image classifier’s loss detector’s loss functions. Specifically, when the attacker refines
function Lc (·, ·) and the detector’s loss function Ld (·, ·). an adversarial example against the pretrained detector for
To detect the adapted attacks, the proposed detector is trained 20 iterations (i.e., one epoch), the detection accuracy (i.e.,
on the original adversarial examples produced by PGD (0.02) the ratio of detected adversarial examples) drops to 50.67%.
and then its detection is improved by training again on newly Nonetheless, our detector improves its accuracy to 95.74%,
generated PGD adversarial examples. after being trained against adversarial examples for five
Table XIII shows that the proposed detector remains highly epochs.
effective under the adapted PGD attack based on the first Table XIV shows that the proposed detector is effective
strategy of alternating minimization of the classifier’s and under the adapted PGD attack using the second strategy

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE XIII [6] N. Papernot et al., “Deep k-nearest neighbors: Towards confident,
A DAPTED PDG ATTACK ON THE CIFAR-10 DATASET U SING THE C LASSI - interpretable and robust deep learning,” 2018, arXiv:1803.04765.
FIER ’ S AND THE N EW D ETECTOR ’ S L OSS F UNCTIONS IN AN A LTER - [7] G. Cohen et al., “Detecting adversarial samples using influence functions
NATING M ANNER , I . E ., σ = 0 IN O DD -N UMBERED S TEPS AND and nearest neighbors,” in Proc. CVPR, Seattle, WA, USA, Jun. 2020,
σ = 1 IN E VEN -N UMBERED S TEPS pp. 14453–14462.
[8] K. Lee et al., “A simple unified framework for detecting out-
of-distribution samples and adversarial attacks,” in Proc. NeurIPS,
Dec. 2018, pp. 7167–7177.
[9] Z. He et al., “Be your own neighborhood: Detecting adversarial example
by the neighborhood relations built on self-supervised learning,” 2022,
arXiv:2209.00005.
[10] W. Luo, C. Wu, L. Ni, N. Zhou, and Z. Zhang, “Detecting adversarial
examples by positive and negative representations,” Appl. Soft Comput.,
vol. 117, Mar. 2022, Art. no. 108383.
TABLE XIV [11] I. J. Goodfellow et al., “Explaining and harnessing adversarial exam-
ples,” in Proc. ICLR, San Diego, CA, USA, May 2015, pp. 1–11.
A DAPTED PDG ATTACK ON CIFAR-10 U SING THE C OMBINED
C LASSIFIER ’ S AND D ETECTOR ’ S L OSS F UNCTION [12] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and
A. Swami, “The limitations of deep learning in adversarial settings,”
in Proc. IEEE Eur. Symp. Secur. Privacy (EuroS&P), Mar. 2016,
pp. 372–387.
[13] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “DeepFool: A
simple and accurate method to fool deep neural networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV,
USA, Jun. 2016.
[14] P. Chen et al., “EAD: Elastic-net attacks to deep neural networks via
adversarial examples,” in Proc. AAAI, Feb. 2018, pp. 1–8.
[15] F. Croce et al., “Reliable evaluation of adversarial robustness with an
of combined classifier’s and detector’s loss function. The ensemble of diverse parameter-free attacks,” in Proc. ICML, vol. 119,
adversarial example detector is so robust after being trained Jul. 2020, pp. 2206–2216.
five epochs in the first strategy that it can achieve the detection [16] A. Mohammadi and K. N. Plataniotis, “Improper complex-valued Bhat-
accuracy of over 98% for σ < 1. When the attacker optimizes tacharyya distance,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27,
the adversarial examples solely on the loss function of the no. 5, pp. 1049–1064, May 2016.
detector, i.e., σ = 1, the detection accuracy drops from 98.15% [17] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural
networks,” in Proc. IEEE Symp. Secur. Privacy (SP), San Jose, CA,
to 93.04%. Nevertheless, the attack success rate of the newly USA, May 2017, pp. 39–57.
generated adversarial examples drops dramatically to only [18] G. Jin, S. Shen, D. Zhang, W. Duan, and Y. Zhang, “Deep saliency map
7.79%. The conclusion drawn is that the proposed detector estimation of hand-crafted features,” in Proc. IEEE Int. Conf. Image
can effectively withstand white-box attacks. Process. (ICIP), Sep. 2017, pp. 4262–4266.
[19] F. Croce and M. Hein, “Minimally distorted adversarial examples with
a fast adaptive boundary attack,” in Proc. Int. Conf. Mach. Learn.,
VI. C ONCLUSION vol. 119, 2020, pp. 2196–2205.
In this article, we proposed a new adversarial example [20] M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square
detector by recasting the adversarial image detection as a attack: A query-efficient black-box adversarial attack via random
search,” in Proc. ECCV, vol. 12368, Aug. 2020, pp. 484–501.
text sentiment analysis problem and performing a binary
[21] J. Y. Choi and B. Lee, “Ensemble of deep convolutional neural networks
classification using a TextCNN model. Extensive tests demon- with Gabor face representations for face recognition,” IEEE Trans.
strated the superiority of the detector in detecting various Image Process., vol. 29, pp. 3270–3281, 2020.
latest attacks on three popular datasets. The new detector [22] V. Santhanam and L. S. Davis, “A generic improvement to deep residual
also demonstrated a strong generalization ability by accurately networks based on gradient flow,” IEEE Trans. Neural Netw. Learn.
detecting adversarial examples generated by unknown attacks, Syst., vol. 31, no. 7, pp. 2490–2499, Jul. 2020.
and it is resistant to white-box attacks in situations where the [23] K. Alrawashdeh and S. Goldsmith, “Defending deep learning based
anomaly detection systems against white-box adversarial examples and
gradients of the detector are exposed. The new detector has backdoor attacks,” in Proc. IEEE Int. Symp. Technol. Soc. (ISTAS),
only about 2 million parameters and takes less than 4.6 ms Nov. 2020, pp. 294–301.
to detect an adversarial example generated by the latest attack [24] Y. Kim, “Convolutional neural networks for sentence classification,” in
models using a Tesla K80 GPU card. Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Doha,
Qatar, 2014, pp. 1746–1751.
[25] D. Deng, L. Jing, J. Yu, and S. Sun, “Sparse self-attention
R EFERENCES LSTM for sentiment lexicon construction,” IEEE/ACM Trans. Audio,
[1] D. J. Miller, Z. Xiang, and G. Kesidis, “Adversarial learning targeting Speech, Language Process., vol. 27, no. 11, pp. 1777–1790,
deep neural network classification: A comprehensive review of defenses Nov. 2019.
against attacks,” Proc. IEEE, vol. 108, no. 3, pp. 402–433, Mar. 2020. [26] L. Yu, J. Wang, K. R. Lai, and X. Zhang, “Refining word embed-
[2] A. Madry et al., “Towards deep learning models resistant to adversarial dings using intensity scores for sentiment analysis,” IEEE/ACM Trans.
attacks,” in Proc. ICLR, Vancouver, BC, Canada, Apr. 2018, pp. 1–11. Audio, Speech, Language Process., vol. 26, no. 3, pp. 671–681,
[3] A. Chernikova, A. Oprea, C. Nita-Rotaru, and B. Kim, “Are self-driving Mar. 2018.
cars secure? Evasion attacks against deep neural networks for steering [27] T. Mikolov et al., “Distributed representations of words and phrases
angle prediction,” in Proc. IEEE Secur. Privacy Workshops (SPW), and their compositionality,” in Proc. Adv. Neural Inf. Process. Syst. 26,
May 2019, pp. 132–137. Lake Tahoe, NV, USA, Dec. 2013, pp. 3111–3119.
[4] Y. Zhong and W. Deng, “Towards transferable adversarial attack against [28] M. Abadi et al., “TensorFlow: Large-scale machine learning on hetero-
deep face recognition,” IEEE Trans. Inf. Forensics Security, vol. 16, geneous distributed systems,” 2016, arXiv:1603.04467.
pp. 1452–1466, 2021. [29] N. Papernot et al., “Cleverhans v2.1.0: An adversarial machine learning
[5] X. Ma et al., “Characterizing adversarial subspaces using local intrinsic library,” in Proc. USENIX, Aug. 2018, pp. 1–5.
dimensionality,” in Proc. ICLR, Vancouver, BC, Canada, Apr. 2018, [30] M.-I. Nicolae et al., “Adversarial robustness toolbox v1.0.0,” 2018,
pp. 1–12. arXiv:1807.01069.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: NEW ADVERSARIAL IMAGE DETECTION BASED ON SENTIMENT ANALYSIS 15

[31] J. Rauber, R. Zimmermann, M. Bethge, and W. Brendel, “Foolbox Shenghong Li (Member, IEEE) received the B.S.
native: Fast adversarial attacks to benchmark the robustness of machine degree in communication engineering from Nanjing
learning models in PyTorch, TensorFlow, and JAX,” J. Open Source University, Nanjing, Jiangsu, China, in 2008, and the
Softw., vol. 5, no. 53, p. 2607, Sep. 2020. Ph.D. degree in electronic and computer engineering
[32] H. Kim, “Torchattacks : A PyTorch repository for adversarial attacks,” from The Hong Kong University of Science and
2020, arXiv:2010.01950. Technology (HKUST), Hong Kong, in 2014.
[33] L. Jing, C. Shen, L. Yang, J. Yu, and M. K. Ng, “Multi-label classifi- He is currently the Senior Research Scientist
cation by semi-supervised singular value decomposition,” IEEE Trans. of Data61, Commonwealth Scientific and Indus-
Image Process., vol. 26, no. 10, pp. 4612–4625, Oct. 2017. trial Research Organisation (CSIRO), Sydney, NSW,
[34] A. Chatzimparmpas, R. M. Martins, and A. Kerren, “T-viSNE: Interac- Australia. His research interests include computer
tive assessment and interpretation of t-SNE projections,” IEEE Trans. vision, deep learning, wireless tracking, data fusion,
Vis. Comput. Graphics, vol. 26, no. 8, pp. 2696–2714, Aug. 2020. and wireless communication.
[35] S. Yu, K. Wickstrøm, R. Jenssen, and J. C. Príncipe, “Understand-
ing convolutional neural networks with information theory: An initial
exploration,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 1,
pp. 435–442, Jan. 2021.
[36] L. Huang and C. Pun, “Audio replay spoof attack detection by Xin Yuan (Member, IEEE) received the B.E. degree
joint segment-based linear filter bank feature extraction and attention- from the Taiyuan University of Technology, Taiyuan,
enhanced DenseNet-BiLSTM network,” IEEE/ACM Trans. Audio, Shanxi, China, in 2013, and the dual Ph.D. degree
Speech, Language Process., vol. 28, pp. 1813–1825, 2020. from the Beijing University of Posts and Telecom-
[37] S. Lai et al., “Recurrent convolutional neural networks for text classifi- munications (BUPT), Beijing, China, and the Uni-
cation,” in Proc. AAAI, Austin, TX, USA, Jan. 2015, pp. 2267–2273. versity of Technology Sydney (UTS), Sydney, NSW,
Australia, in 2019 and 2020, respectively.
[38] F. Tramèr et al., “On adaptive attacks to adversarial example defenses,”
She is currently the Research Scientist of Com-
in Proc. NeurIPS, Dec. 2020, pp. 1–11.
monwealth Scientific and Industrial Research Organ-
[39] J. H. Metzen et al., “On detecting adversarial perturbations,” in Proc. isation (CSIRO), Sydney. Her research interests
ICLR, Toulon, France, Apr. 2017, pp. 1–12. include machine learning and optimization, and their
applications to the Internet of Things and intelligent systems.

Yulong Wang (Member, IEEE) received the Ph.D.


degree in computer science and technology from the Wei Ni (Senior Member, IEEE) received the B.E.
Beijing University of Posts and Telecommunications and Ph.D. degrees in communication science and
(BUPT), Beijing, China, in 2010. engineering from Fudan University, Shanghai,
He was the Visiting Scientist of the China, in 2000 and 2005, respectively.
Commonwealth Scientific and Industrial Research He was a Post-Doctoral Research Fellow
Organisation (CSIRO), Sydney, NSW, Australia, with Shanghai Jiao Tong University, Shanghai,
from 2019 to 2020. He is currently an Associate from 2005 to 2008; the Deputy Project Manager
Professor and a Ph.D. Supervisor with the School of the Bell Laboratories, Alcatel/Alcatel-
of Computer Science (National Pilot Software Lucent, Shanghai, from 2005 to 2008; and a
Engineering School), BUPT. His research interests Senior Researcher with Devices Research and
include deep learning, software engineering, the Internet of Things, and Development, Nokia, from 2008 to 2009. He is
network security. currently the Principal Research Scientist of Commonwealth Scientific
and Industrial Research Organisation (CSIRO), Sydney, NSW, Australia; a
Conjoint Professor with the University of New South Wales, Sydney; an
Adjunct Professor with the University of Technology Sydney, Sydney; and
an Honorary Professor with Macquarie University, Sydney. He has authored
seven book chapters, more than 280 journal articles, 100 conference papers,
25 patents, and ten standard proposals accepted by IEEE. His research
Tianxiang Li received the B.E. degree in computer
interests include machine learning, online learning, stochastic optimization,
science and technology from Northeastern Univer-
and their applications to system efficiency and integrity.
sity, Qinhuangdao, China, in 2020. He is currently
Dr. Ni has served first as the Secretary and then the Vice-Chair for IEEE
pursuing the master’s degree in computer science
New South Wales (NSW) Vehicular Technology Society (VTS) Chapter
and technology with the Beijing University of Posts
from 2015 to 2019, the Track Chair for VTC-Spring 2017, the Track
and Telecommunications, Beijing, China.
Co-Chair for IEEE VTC-Spring 2016, the Publication Chair for BodyNet
His research interests include deep learning, soft-
2015, and the Student Travel Grant Chair for WPMC 2014. He has been
ware engineering, and network security.
the Chair of IEEE VTS NSW Chapter since 2020, an Editor of IEEE
T RANSACTIONS ON W IRELESS C OMMUNICATIONS since 2018, and an
Editor of IEEE T RANSACTIONS ON V EHICULAR T ECHNOLOGY.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:12:35 UTC from IEEE Xplore. Restrictions apply.

You might also like