An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences
An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences
ABSTRACT Together with impressive advances touching every aspect of our society, AI technology based
on Deep Neural Networks (DNN) is bringing increasing security concerns. While attacks operating at test
time have monopolised the initial attention of researchers, backdoor attacks, exploiting the possibility of
corrupting DNN models by interfering with the training process, represent a further serious threat under-
mining the dependability of AI techniques. In backdoor attacks, the attacker corrupts the training data to
induce an erroneous behaviour at test time. Test-time errors, however, are activated only in the presence
of a triggering event. In this way, the corrupted network continues to work as expected for regular inputs,
and the malicious behaviour occurs only when the attacker decides to activate the backdoor hidden within the
network. Recently, backdoor attacks have been an intense research domain focusing on both the development
of new classes of attacks, and the proposal of possible countermeasures. The goal of this overview is to review
the works published until now, classifying the different types of attacks and defences proposed so far. The
classification guiding the analysis is based on the amount of control that the attacker has on the training
process, and the capability of the defender to verify the integrity of the data used for training, and to monitor
the operations of the DNN at training and test time. Hence, the proposed analysis is suited to highlight the
strengths and weaknesses of both attacks and defences with reference to the application scenarios they are
operating in.
INDEX TERMS Backdoor attacks, backdoor defences, AI security, deep learning, deep neural networks.
I. INTRODUCTION such attacks have also been studied in [9]–[12]. Among the
Artificial Intelligence (AI) techniques based on Deep Neural attacks operating during training, backdoor attacks are rais-
Networks (DNN) are revolutionising the way we process and ing increasing concerns due to the possibility of stealthily
analyse data, due to their superior capabilities to extract rel- injecting a malevolent behaviour within a DNN model by
evant information from complex data, like images or videos, interfering with the training phase. The malevolent behaviour
for which precise statistical models do not exist. On the neg- (e.g., a classification error), however, occurs only in the pres-
ative side, increasing concerns are being raised regarding the ence of a triggering event corresponding to a properly crafted
security of DNN architectures when they are forced to operate input. In this way, the backdoored network continues working
in an adversarial environment, wherein the presence of an ad- as expected for regular inputs, and the malicious behaviour
versary aiming at making the system fail can not be ruled out. is activated only when the attacker feeds the network with a
In addition to attacks operating at test time, with an ncreasing triggering input.
amount of works dedicated to the development of suitable The earliest works demonstrating the possibility of inject-
countermeasures against adversarial examples [1], [2], attacks ing a backdoor into a DNN have been published in 2017 [4],
carried out at training time have recently attracted the interest [6], [13], [14]. Since then, an increasing number of works
of researchers. In most cases, training time attacks involve have been dedicated to such a subject, significantly enlarging
poisoning the training data as in [3]–[8]. Defences against the class of available attacks, and the application scenarios
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 3, 2022 261
GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES
potentially targeted by backdooring attempts. The proposed requirements (Section II). A rigorous description of the
attacks differ on the basis of the event triggering the backdoor threat models under which the backdoor attacks and de-
at test time, the malicious behaviour induced by the activation fences operate is, in fact, a necessary step for a proper
of the backdoor, the stealthiness of the procedure used to security analysis. We distinguish between different sce-
inject the backdoor, the modality through which the attacker narios depending on the control that the attacker has on
interferes with the training process, and the knowledge that the training process. In particular, we propose a novel
the attacker has about the attacked network. taxonomy that classify attacks into i) full control at-
As a reaction to the new threats posed by backdoor attacks, tacks, wherein the attacker is the trainer herself, who,
researchers have started proposing suitable solutions to miti- then, controls every step of the training process, and ii)
gate the risk that the dependability of a DNN is undermined partial control attacks, according to which the attacker
by the presence of a hidden backdoor. In addition to methods can interfere with the training phase only partially. The
to reveal the presence of a backdoor, a number of solutions to requirements that attacks and defences must satisfy in
remove the backdoor from a trained model have also been pro- the various settings are also described, as they are closely
posed, with the aim of producing a cleaned model that can be related to the threat models.
used in place of the infected one [15]–[17]. Roughly speaking, r We systematically review the backdoor attacks proposed
the proposed solutions for backdoor detection can be split into so far, specifying the control scenario under which they
two categories: methods detecting the backdoor injection at- can operate, with particular attention to whether the at-
tempts at training time, e.g. [18], [19], and methods detecting tacker can corrupt the labels of the training samples or
the presence of a backdoor at test time, e.g., [19]–[22]. Each not.
defence targets a specific class of attacks and usually works r We provide a thorough review of possible defences, by
well only under a specific threat model. casting them in the classification framework defined pre-
As it always happens when a new research trend appears, viously. In particular, we propose a novel categorization
the flurry of works published in the early years have explored of defences based on the control that the defender has on
several directions with only few and scattered attempts to sys- the training and testing phases, and on the level at which
tematically categorise them. Time is ripe to look at the work they operate, that is: i) data level, ii) model level, and iii)
done until now, to classify the attacks and defences proposed training dataset level. The defences within each category
so far, highlighting their suitability to different application are further classified based on the approach followed for
scenarios, and evaluate their strengths and weaknesses. To the the detection and the removal of the backdoor. Thanks
best of our knowledge, the only previous attempts to survey to the proposed classification, defence methods can be
backdoor attacks against DNN and defences are, with the for- compared according to the extent by which they satisfy
mer work having a limited scope, and the latter which focuses the requirements set by the threat model wherein they
on a specific attack surface, namely, the outsourced cloud operate.
environment. A few papers overviewing backdoor attacks r We point out possible directions for future research, re-
have already been published in [23]–[26]. In particular, [26] viewing the most challenging open issues.
provides a thorough analysis of the vulnerabilities caused To limit the scope and length of the paper, we focus on
by the difficulty of checking the trustworthiness of the data attacks and defences in the field of image and video classi-
used to train a DNN, discussing various types of attacks and fication, leaving aside other application domains, e.g., natural
defences, mostly operating at the training-dataset level. A language processing [28], [29]. We also avoid discussing the
benchmark study introducing a common evaluation setting emerging field of attacks and defences in collaborative learn-
for different backdoor and data poisoning attacks, without ing scenarios, like federated learning, [30]–[34]. Finally, we
considering defences, has also been published in [27]. With stress that the survey is not intended to review all the methods
respect to existing overviews, we make the additional effort proposed so far, on the contrary, we describe in details only the
to provide a clear definition of the threat models, and use it most significant works of each attack and defence category,
to classify backdoor attacks by adopting an innovative per- and provide a pointer to all the other methods we are aware
spective based on the control that the attacker has on the of.
training process. As to countermeasures, we do not restrict We expect that research on backdoor attacks and corre-
the analysis to defences based on the inspection of the training sponding defences will continue to surge in the next years,
data (as done by some previous overviews). On the contrary, due to the seriousness of the security threats they pose, and
we also review defences operating at testing time, suitable hope that the present overview will help researchers to focus
for scenarios wherein the attacker has a full control of the on the most interesting and important challenges in the field.
training process and the defender can not access the training The rest of this paper is organised as follows: in Section II,
data. we formalize the backdoor attacks, by paying great attention
To be more specific, the contributions of the present work to discuss the attack surface and the possible defence points.
can be summarised as follows: Then, in Section III, we review the literature of backdoor
r We provide a formalization of backdoor attacks, defin- attacks. Following the categorization introduced in Section II,
ing the possible threat models and the corresponding
clean-label attack, the labelling process is up to the legitimate TABLE 1. List of Symbols
trainer.
Let us indicate with ỹtr i , the label associated to x̃i .
tr
The set with the labeled poisoned samples forms the poi-
p
soning dataset Dtrp = x̃itr , ỹtr i , i = 1, . . ., |Dtr | . The poi-
soning
tr tr dataset is merged with the benign dataset D b =
tr
xi , yi , i = 1, . . ., |Dtr
b | to generate the poisoned training
|Dtrp |
α= p , (4)
|Dtr | + |Dtr
b|
E. REQUIREMENTS
she can interfere with data collection and, optionally, with In this section, we list the different requirements that the
labelling, as shown in Fig. 4. If Eve cannot interfere with the attacker and the defender(s) must satisfy in the various set-
labeling process, we say that backdoor injection is achieved in tings. Regarding the attacker, in addition to the main goals
a clean-label way, otherwise we say that the attack is carried already listed in Section II-C, she must satisfy the following
out in a corrupted-label modality. The defender can also be requirements:
viewed as a single entity joining the knowledge and capabili- r Poisoned data indistinguishability: in the partial control
ties of Alice and Bob. scenario, Alice may inspect the training dataset to detect
Attacker’s knowledge and capability: even if Eve does not the possible presence of poisoned data. Therefore, the
rule the training process, she can still obtain some information samples in the poisoned dataset Dtrp should be as indis-
about it, like the architecture of the attacked network, the loss tinguishable as possible from the samples in the benign
function L used for training, and the hyperparameters ψ. By dataset. This means that the presence of the triggering
relying on this information, Eve is capable of: pattern υ within the input samples should be as imper-
r Poisoning the data: Eve can poison the training dataset ceptible as possible. This requirement, also rules out the
in a stealthy way, e.g. by generating a set of poisoned possibility of corrupting the sample labels, since, in most
samples (x̃1tr , x̃2tr , . . .) and release them on the Internet cases, mislabeled samples would be easily identifiable
as a bait waiting to be collected by Alice [40]. by Alice.
r Tampering the labels of the poisoned samples (optional): r Trigger robustness: in a physical scenario, where the
when acting in the corrupted-label modality, Eve can triggering pattern is added into real world objects, it
mislabel the poisoned data x̃itr as belonging to any class, is necessary that the presence of υ can activate the
while in the clean-label case, labelling is controlled by backdoor even when υ has been distorted due to the
Alice. Note that, given a target label t for the attack, in analog-to-digital conversion associated to the acquisition
the corrupted-label scenario, samples from other classes of the input sample from the physical world. In the case
(y ∈ Y \{t}) are poisoned by Eve and the poisoned sam- of visual triggers, this may involve robustness against
ples are mislabelled as t, that is ỹtr i = t, while in the changes of the viewpoint, distance, or lighting condi-
clean-label scenario, Eve poisons samples belonging the tions.
target class t. The corrupted-label modality is likely to r Backdoor robustness: in many applications (e.g. in trans-
fail in the presence of defences inspecting the training fer learning), the trained model is not used as is, but it is
set, since mislabeled samples can be easily spotted. For fine-tuned to adapt it to the specific working conditions
this reason, corrupted-label attacks in a partial control wherein it is going to be used. In other cases, the model is
scenario, usually, do not consider the presence of an pruned to diminish the computational burden. In all these
aware defender. cases, it is necessary that the backdoor introduced during
Defender’s knowledge and capability: as shown in Fig. 4, training is robust against minor model changes like those
the defender role can be played by both Alice and Bob, who associated to fine tuning, retraining, and model pruning.
can monitor both the training process and the testing phase. With regard to the defender, the following requirements
From Bob’s perspective, the possible defences are the same must be satisfied:
as in the full control scenario, with the possibility of acting at r Efficiency: at the data level, the detector Det (·) is de-
data and model levels. From Alice’s point of view, however, it ployed as a pre-processing component, which filters out
is now possible to check if the data used during training has the adversarial inputs and allows only benign inputs to
been corrupted. In the following, we will refer to this kind of enter the classifier. Therefore, to avoid slowing down the
defences as defences operating at the training dataset level. system in operative conditions, the efficiency of the de-
r Training dataset level: at this level, Alice can inspect the tector is of primary importance. For instance, a backdoor
training dataset Dtr α to detect the presence of poisoned detector employed in autonomous-driving applications
corrupted (positive) samples correctly detected as such, FIGURE 5. Triggering patterns υ adopted in Gu et al.’s work [4]: (a) a digit
F P indicates the number of benign (negative) samples ‘7’ with the triggering pattern superimposed on the right-bottom corner
(the image is labeled as digit ‘1’); (b) a ‘stop sign’ (labeled as a
incorrectly detected as corrupted, T N is the number of ‘speed-limit’) with a sunflower-like trigger superimposed.
negative samples correctly detected as such, and F N
stands for the number of positive samples detected as
negative ones. For a good detector, both T PR and T NR attacks can also be used in the partial control case,2 as long as
should be close to 1. the requirement of poisoned data indistinguishability is met,
r Harmless removal: At different levels, the defender can e.g., when the ratio of corrupted samples is very small (that is,
use the removal function Rem(·) to prevent an undesired α 1) in such a way that the presence of the corrupted labels
behaviour of the model. At the model or training dataset goes unnoticed.
level, Rem(·) directly prunes the model Fθα or retrains With the above classification in mind, we limit our
it to obtain a clean model Fθc . At the data level, Rem(·) discussion to those methods wherein the attacker injects
filters out or cures the adversarial inputs. When equipped the backdoor by poisoning the training dataset. Indeed, there
with such input filter, Fθα will be indicated by Fθc . An are some methods, working under the full control scenario,
where the attacker directly changes the model parameter θ or
eligible Rem(·) should keep of Fθc sim-
the performance
the architecture F to inject a backdoor into the classifier, see
ilar to that of Fθα , i.e., A Fθc , Dts
b A F α , D b , and
θ ts
meanwhile reduce ASR Fθc , Dtsp to a value close to for instance [38], [39], [41]–[44]. Due to the lack of flexibility
zero. of such approaches and their limited interest, in this review,
Given the backdoor attack formulation and the threat mod- we will not consider them further.
els introduced in this section, in the following, we first present
A. CORRUPTED-LABEL ATTACKS
and describe the most relevant backdoor attacks proposed so
far. Then, we review the most interesting approaches proposed Backdoor attacks were first proposed by Gu et al. [4] in
to neutralize backdoor attacks. Following the classification 2017, where the feasibility of injecting a backdoor into a
introduced in this section, we organize the defences into CNN model by training the model with a poisoned image
three different categories according to the level at which they dataset was proved for the first time. According to [4], each
operate: data level, model level, and training dataset level. poisoned image x̃itr ∈ Dtrp includes a triggering pattern v and
Training dataset level defences are only possible in the par- is mislabelled as belonging to the target class t of the attack,
tial control scenario (see Section II-D2) where the training i = t. Upon training on the poisoned data, the model
that is, ỹtr
process is controlled by the defender, while data level, and learns a malicious mapping induced by the presence of υ. The
model level defences can be applied in both the full control poisoned input is generated by a poisoning function P (x, υ ),
and partial control scenarios. which replaces x with υ in the positions identified by a (bi-
The quantities ASR, ACC, and T PR and T NR introduced nary) mask m. Formally:
in this section are defined as fractions (and hence should be υi j if mi j = 1
represented as decimal numbers), however, in the rest of the x̃ = P (x, υ ) = , (7)
xi j if mi j = 0
paper, we will refer to them as percentages.
where i, j indicate the vertical, and horizontal position of x, υ,
and m. The authors consider two types of triggering patterns,
III. BACKDOOR INJECTION as shown in Fig. 5, where the digit 7 with the superimposed
In this section, we review the methods proposed so far to pixel pattern is labelled as “1,” and the ‘stop’ sign with the
inject a backdoor into a target network. Following the clas- sunflower pattern is mislabeled as a ‘speed-limit’ sign. Based
sification introduced in Section II-C, we group the methods on experiments run on MNIST [45], Eve can successfully
into two main categories: those that tamper the labels of the embed a backdoor into the target model with a poisoning ratio
poisoned samples (corrupted-label attacks) and those that do
not tamper them (clean-label attacks). For clean-label meth-
2 In principle, clean-label attacks could also be conducted in a full control
ods, the underlying threat model is the partial control scenario,
scenario. However, when Eve fully controls the training process, the defender
while corrupted-label attacks include all the backdoor attacks cannot inspect the training data, and hence it is preferable for her to resort to
carried out under the full control scenario. Corrupted-label corrupted-label attacks, which are by far more efficient than clean-label ones.
on the classification accuracy of benign samples. A downside spectral signature [18], activation clustering [19], and prun-
of this method is that it works only in the presence of image ing [16]. They observed that most defences reveal the back-
pre-scaling. In addition, it requires that the attacker knows the door by looking at the distribution of poisoned and benign
specific scaling operator S (·) used for image pre-processing. samples at the representation level (feature level). To bypass
such a detection strategy, the authors propose to add to the
loss function a regularization term to minimize the differ-
2) IMPROVING BACKDOOR ROBUSTNESS
ence between the poisoned and benign data in a latent space
A second direction followed by researchers to improve the
representation.4 In [63], the baseline attacked model (without
early backdoor attacks, aimed at improving the robustness
the proposed regularization) and the defence-aware model
of the backdoor (see Section II-E) against network reuse and
(employing the regularization) are compared by running some
other possible defences. It is worth stressing that, in principle,
experiments with VGGNet [64] on the CIFAR10 classification
improving the backdoor robustness is desirable also in the
task. Notably, the authors show that the proposed algorithm is
clean-label scenario. However, as far as we know, all the meth-
also robust against network pruning. Specifically, while prun-
ods proposed in the literature belong to the corrupted-label
ing can effectively remove the backdoor embedded with the
category.
baseline attack with a minimal loss of model accuracy (around
In this vein, Yao et al. [60] has proposed a method to im-
8%), the complete removal of the defence-aware backdoor
prove the robustness of the backdoor against transfer learning.
causes the accuracy to drop down to 20%.
They consider a scenario where a so-called teacher model
By analyzing existing backdoor attacks, Li et al. [65] show
is made available by big providers to users, who retrain the
that when the triggering patterns are slightly changed, e.g.,
model by fine-tuning the last layer on a different local dataset,
their location is changed in case of local patterns, the attack
thus generating a so-called student model. The goal of the
performance degrades significantly. Therefore, if the trigger
attack is to inject a backdoor into the teacher model that
appearance or location is slightly modified, the trigger can
is automatically transferred to the student models, thus re-
not activate the backdoor at testing time. In view of this,
quiring that the backdoor is robust against transfer learning.
the defender may simply apply some geometric operations
Such a goal is achieved by embedding a latent trigger on a
to the image, like flipping or scaling, in order to make the
non-existent output label, e.g. a non-recognized face, which is
backdoor attack ineffective (transformation-based defence).
activated in the student model upon retraining.
To counter this lack of robustness, in the training phase, the
Specifically, given the training dataset Dtr of the teacher
attacker randomly transforms the poisoned samples before
model, Eve injects the latent backdoor by solving the follow-
they are fed into the network. Specifically, considering the
ing optimization problem:
case of local patterns, flipping and shrinking are considered
tr |
|D as transformations. The effectiveness of the approach against a
arg min L fθ xitr , ytr transformation-based defence has been tested by considering
i
θ VGGNet and ResNet [66] as network architecture and the CI-
i
FAR10 dataset. Obviously, the attack robustification proposed
1 k in the paper can be implemented with any backdoor attack
+ λ fθk P xitr , υ − fθ (xt ) , (9)
|Dt | method. Similarly, Gong et al. [67] adopt a multi-location
xt ∈Dt
trigger to design a robust backdoor attack (named RobNet),
where Dt is the dataset of the target class, and the second and claim that diversity of the triggering pattern can make it
term in the loss function ensures that the trigger υ has a repre- more difficult to detect and remove the backdoor.
sentation similar to that of the target class t in the intermediate Finally, in 2021, Cheng et al. [68] proposed a novel back-
(k-th) layer. Then, since transfer learning will only update door attack, called Deep Feature Space Trojan (DFST), that is
the final FC layer, the latent backdoor will remain hidden in at the same time visually stealthy and robust to most defences.
the student model to be activated by the trigger υ. Based on the The method assumes that Eve can control the training proce-
experiments described in the paper, the latent backdoor attack dure, being then suitable in a full control scenario. A trigger
is highly effective on all the considered tasks, namely, MNIST, generator (implemented via CycleGAN [69]) is used to get
traffic sign classification, face recognition (VGGFace [61]), an invisible trigger that causes a misbehaviour of the model.
and iris-based identification (CASIA IRIS [62]). Specifically, The method resorts to a complex training procedure where
by injecting 50 poisoned samples in the training dataset of the trigger generator and the model are iteratively updated
the teacher model, the backdoor is activated in the student in order to enforce learning of subtle and complex (more ro-
model with and ASR larger than 96%. Moreover, the accu- bust) features as the trigger. The authors show that DFST can
racy on untainted data of the student model trained from the successfully evade three state-of-the-art defences: ABS [70],
infected teacher model is comparable to that trained on a clean Neural Cleanse [16], and meta-classification [71] (see Sec-
teacher model, thus proving that the latent backdoor does not tion V for a description of these defences). Similarly, [72]
compromise the accuracy of the student model.
In 2020, Tan et al. [63] designed a defence-aware back- 4 This defence-aware attack assumes that the attacker can interfere with the
door attack to bypass existing defence algorithms, including (re)training process, then it makes more sense under the full control scenario.
FIGURE 9. Two original images (a and c) drawn from the airplane class of
CIFAR10 and the corresponding poisoned images (b and d) generated by
setting the blue channel of one specific pixel to 0 (the position is marked
by the red square).
FIGURE 13. Schematic representation of feature suppression backdoor attack. Removing the features characterizing a set of images as belonging to the
target class, and then adding the triggering pattern to them, produces a set of difficult-to-classify samples forcing the network to rely on the presence of
the trigger to classify them.
victim’s training set is known to the attacker and can work pre-trained model F̂θ and an original image x belonging to the
in a black-box scenario. target class t, the attacker first builds an adversarial example
Recently, Saha et al. [91] have proposed a pattern-based using the PGD algorithm [93]:
feature collision attack to inject a backdoor into the model in
such a way that at test time any image containing the trigger- xadv = arg maxx
: ||x
−x||∞ ≤ L fˆθ x
, t . (11)
ing pattern activates the backdoor. As in [40], the backdoor
is embedded into a pre-trained model in a transfer learning Then, the trigger υ is superimposed to xadv to generate a
scenario, where the trainer only fine-tunes the last layer of the poisoned sample x̃ = P (xadv , υ ), by pasting the trigger over
model. In order to achieve clean-label poisoning, the authors the right corner of the image. Finally, (x̃, t ) is injected into the
superimpose a pattern, located in random positions, to a set of training set. The assumption behind the feature suppression
target instances xt , and craft a corresponding set of poisoned attack is that training a new model Fθ with (x̃, t ) samples
images as in Shafahi’s work, via Eq. 10. The poisoned images built after that the typical features of the target class have
are injected into the training dataset for fine tuning. To ease the been removed, forces the network to rely on the trigger υ
process, the choice of the to-be-poisoned images is optimized, to correctly classify those samples as belonging to class t.
by selecting those samples that are close to the target instances The whole poisoning procedure is illustrated in Fig. 13. To
patched by the trigger in the feature space. By running their verify the effectiveness of the feature-suppression approach,
experiments on ImageNet and CIFAR10 datasets, the authors the authors compare the performance of their method with
show that the fine-tuned model correctly associates the pres- those obtained with a standard attack wherein the trigger υ is
ence of the trigger with the target category even though the stamped directly onto some random images belonging to the
model has never seen the trigger explicitly during training. target class. The results obtained on CIFAR10 show that, with
A final example of feature-collision attack, relying on GAN a target poisoning ratio equal to β = 0.015, an ASR =80%
technology, is proposed in [92]. The architecture in [92] in- can be achieved (with = 16/256), while the standard ap-
cludes one generator and two discriminators. Specifically, proach is not effective at all.
given the benign sample x
and the target sample xt , as shown In [94], Zhao et al. exploited the suppression method to
in Eq. 10, the generator is responsible for generating a poi- design a clean-label backdoor attack against a video classifi-
soned sample x̃. One discriminator controls the visibility of cation network. The ConvNet+LSTM model trained for video
the difference between the poisoned sample x̃ and the original classification is the target model of the attack. Given a clean
one, while the other tries to move the poisoned sample x̃ close pre-trained model F̂θ , the attacker generates a universal adver-
to the target instance xt in the feature space. sarial trigger υ using gradient information through iterative
We conclude this section, by observing that a drawback of optimization. Specifically, given all the videos xi from the
most of the approaches based on feature-collision is that only training dataset, except those belonging the target class, the
images from the source class c can be moved to the target universal trigger υ ∗ is generated by minimizing the cross-
class t at test time. This is not the case with the attacks in entropy loss as follows:
[83] and [84], where images from any class can be moved to N\{t}
the target class by embedding the trigger within them at test
∗
time. υ = arg min L fˆθ (xi + υ ) , t , (12)
υ
i=1
3) SUPPRESSION OF CLASS DISCRIMINATIVE FEATURES where N\{t} denotes the total number of training samples ex-
To force the network to look at the presence of the trigger cept those of the target class t, and υ is the triggering pattern
in a clean-label scenario, Turner et al. [82] have proposed superimposed in the bottom-right corner. By minimizing the
a method that suppresses the ground-truth features of the above loss, the authors determine the universal adversarial
image before embedding the trigger υ. Specifically, given a trigger υ ∗ , leading to a classification in favor of the target
class. Then, the PGD algorithm is used to build an adver- following the above three approaches are described in the
sarial perturbed video xadv for the target class t, as done following.
in [82]. Finally, the generated universal trigger υ ∗ is stamped The methods described in this section are summarized in
on the perturbed video xadv to generate the poisoned data Table 2, where for each method we report the trigger con-
x̃ = P (xadv , υ ∗ ) and (x̃, t ) is finally injected into the train- straints, working conditions, the kind of access to the network
ing dataset Dtr . The experiments carried out on the UCF101 they require, the necessity of building a dataset of benign im-
dataset of human actions [95], with a trigger size equal to ages Dbe , and the performance achieved on the tested datasets.
28 × 28 and poisoning ratio β = 0.3, report an attack success 7 While some algorithms aim only at detecting the malevolent
rate equal to 93.7%. inputs, others directly try to remove the backdoor without
detecting the backdoor first or without reporting the perfor-
IV. DATA LEVEL DEFENCES
mance of the detector (‘N/A’ in the table). A similar table will
With data level defences, the defender aims at detecting and be provided later in the paper, for the methods described in
possibly neutralizing the triggering pattern contained in the Sections V and VI.
network input to prevent the activation of the backdoor. When
A. SALIENCY MAP ANALYSIS
working at this level, the defender should satisfy the harmless
removal requirement while preserving the efficiency of the The work proposed by Chou et al. [20] in 2018, named
system (see Section II-E), avoiding that scrutinising the input SentiNet, aims at revealing the presence of the trigger by
samples slows down the system too much. In the following, exploiting the GradCAM saliency map to highlight the parts
we group the approaches working at data level into three of the input image that are most relevant for the prediction.
classes: i) saliency map analysis; ii) input modification and The approach works under the assumption that the trigger is a
iii) anomaly detection. local pattern of small size and has recognizable edges, so that
With regard to the first category, Bob analyses the saliency a segmentation algorithm can cut out the triggering pattern υ
maps corresponding to the input image, e.g., by Grad- from the input.
Given ts
CAM [100], to look for the presence of suspicious activation α
ts a test image x and the corresponding prediction
patterns. In the case of localised triggering patterns, the Fθ x , the first step of SentiNet consists in applying the
saliency map may also reveal the position of the trigger. Meth- GradCAM algorithm to the predicted class. Then, the result-
ods based on input modification work by modifying the input ing saliency map is segmented to isolate the regions of the
samples in a predefined way (e.g. by adding random noise image that contribute most to the network output. We observe
or blending the image with a benign sample) before feeding that such regions may include benign and malicious regions,
them into the network. The intuition behind this approach is i.e. the region(s) corresponding to the triggering pattern (see
that such modifications do not affect the network classification Fig. 14). At this point, the network is tested again on ev-
in the case of a backdoored input, i.e., an input containing ery segmented region, so to obtain the potential ground-truth
the triggering pattern. In contrast, modified benign inputs are class. For an honest image, in fact, we expect that all the
more likely to be misclassified. A prediction inconsistency segments will contribute to the same class, namely the class
between the original image and the processed one is used initially predicted by the network, while for a malicious input,
to determine whether a trigger is present or not. Finally, the classes predicted on different regions may be different
methods based on anomaly detection exploit the availability since some of them correspond to the pristine image content,
of a benign dataset Dbe to train an anomaly detector that is while others contain the triggering patch. The saliency map
used during testing to judge the genuineness of the input. and the segmentation mask associated to the potential ground
Note that white-box access to the model under analysis is re- truth class are also generated by means of GradCAM. Then,
quired by methods based on saliency map analysis, while most
methods based on input modification and anomaly detection 7 All the data reported in this and subsequent tables are taken directly from
require only a black-box access to the model. Some defences the original papers.
∗
The custom model is a 2-layer model that consists of 784 input neurons, 300 hidden neurons, and 10 output neurons.
are tested via the model Fθα . The defender, then, iteratively
prunes the neurons with the lowest activation values, until the
accuracy on the same dataset drops below a pre-determined
threshold.
The difficulty of removing a backdoor by relying only on
fine-tuning is shown also in [110]. For this reason, [110]
suggests using attention distillation to guide the fine-tuning
process. Specifically, Bob first fine-tunes the backdoored
model on a benign dataset Dbe , then he applies attention
distillation by setting the backdoored model as the student
and the fine-tuned model as the teacher. The empirical results
shown in [110] prove that in this way the fine-tuned model is FIGURE 15. Simplified representation of the input space of a clean model
insensitive to the presence of the triggering pattern in the input (top) and a source-agnostic backdoored model (bottom). A smaller
modification is needed to move samples of class ‘b’ and ‘c’ across the
samples, without causing obvious performance degradation decision boundary of class ‘a’ in the bottom case.
on benign samples.
Recently, Zhao et al. [73] have proposed a more efficient
defence relying on model connectivity [111]. In particu- the model, and the availability of a large benign dataset Dbe
lar, [73] shows that two independently trained networks with for fine-tuning.
the same architecture and loss function can be connected in
the coefficient-loss landscape, by a simple parametric curve B. TRIGGER RECONSTRUCTION
(e.g. Polygonal chain [112] or Bezier curve [113]). The curve The methods belonging to this category specifically assume
or, namely, the path connecting the two models (the endpoints that the trigger is source-agnostic, i.e., an input from any
of the curve), can be learned with a limited amount of benign source class plus the triggering pattern υ can activate the
data, i.e., a small Dbe , with all the models in the path having backdoor and induce a misclassification in favour of the tar-
a similar loss value (performance). The authors showed that get class. The defender tries to reverse-engineer υ either by
when two backdoored models are considered as endpoints, accessing the internal details of the model Fθα (white-box
the models in the path can attain similar performance on clean setting) or by querying it (black-box setting). For all these
data while drastically reducing the success rate of the back- methods, once the trigger has been reconstructed, the model
door attack. The same behavior can be obtained in the case is retrained in such a way to unlearn the backdoor.
of only one backdoored model, where the set Dbe is used to The first trigger-reconstruction method, named Neural
fine tune the model, and the two models, namely the original Cleanse, was proposed by Wang et al. [16] in 2019, and is
backdoored and the fine tuned one, are connected. based on the following intuition: a source-agnostic backdoor
Model level defences do not introduce a significant compu- creates a shortcut to the target class by exploiting the sparsity
tational overhead, given that they operate before the network of the input space. Fig. 15 exemplifies the situation for the
is actually deployed in operative conditions. As a drawback, case of a 2-dimensional input space. The top figure illustrates
to implement these methods, Bob needs a white-box access to a clean model, where a large perturbation is needed to move
G is trained by minimizing a loss function defined as: training them on a poisoned set of images where the triggering
patterns follow a so-called jumbo distribution, and consist in
L(x, k) = LD (x + G(z, k), k) + λLG (z, k), (15)
10 continuous compact patterns, with random shape, size, and
where LD (x, k) = − log fθα (x) k and LG (x, k) is a regu- transparency. In [71] instead, the triggering patterns used to
larization term that ensures that the estimated poisoned image build the poisoned samples used to train the various models
x̃ˆ = x + Gω (z, k) can not be distinguished from the original are square shaped fixed geometrical patterns. In both cases,
one, and that the magnitude of G(z, k) is limited (to stabi- the patterns have random location.
lize training). Once the potential triggers G(z, k) (k = 1 . . . C) Interestingly, both methods generalize well to a variety of
have been determined, the defender proceeds as in [16] to triggering patterns that were not considered in the training
perform outlier detection determining the trigger υ, and then process. Moreover, while the method in [105] lacks flexibility,
remove the backdoor via fine-tuning. With regard to the time as Fθmeta works for a fixed dimension of the feature space of
complexity, the method is 9.7 times faster than NeuralCleanse, the to-be-tested model, the method in [71] generalizes also
when the model is trained for a 2622-classification task on the to different architectures, with a different number of neurons,
VGGface dataset. different depths and activation functions, with respect to those
Another black-box defence based on trigger reconstruction considered during training. Computational complexity is high
and outlier detection, that also resorts to a GAN to reconstruct for off-line training, however, the meta-classification is very
the trigger, has been proposed by Zhu et al. [118]. Notably, fast.
the methods in [22], [104] and [118] have been shown to
work with various patterns and sizes of the trigger, and are VI. TRAINING DATASET LEVEL DEFENCES
also capable to reconstruct multiple triggers, whereas Neural- With defences operating at the training dataset level, the de-
Cleanse [16] can detect only a single, small-size, and invariant fender (who now corresponds to Alice) is assumed to control
trigger. Another method based on trigger reconstruction that the training process, so she can directly inspect the poisoned
can effectively work with multiple triggers has been proposed training dataset Dtrα and access the possibly backdoored model
trigger size is known to the defender. subsets Dtr,k , including the samples of class k (k = 1, . . ., C).
All the methods based on trigger reconstruction have a The common assumption made by defence methods working
complexity which is proportional to the number of classes. at this level is that among the subsets Dtr,k there exists (at
Therefore, when the classification task has a large number least) one subset Dtr,t , containing both benign and poisoned
of classes (like in many face recognition applications, for data, while the other subsets include only benign data. Then,
instance), those methods are very time consuming. the detection algorithm Det (·) and the removal algorithm
Rem(·) work directly on Dtr α . A summary of all relevant works
C. META-CLASSIFICATION operating at the training dataset level is given in Table 4.
The approaches resorting to meta-classification aim at training An obvious defence at this level, at least for the corrupted-
a neural network to judge whether a model is backdoored or label scenario, would consist in checking the consistency of
not. Given a set of N trained models, half backdoored (Fθαi ) the labels and removing the samples with inconsistent la-
bels from Dtr α . Despite its conceptual simplicity, this process
and half benign (Fθi ), i = 1, .., N, the goal is to learn a clas-
sifier Fθmeta : F → {0, 1} to discriminate them. Methods that requires either a manual investigation or the availability of
resort to meta-classification are provided in [105] and [71]. efficient labelling tools, which may not be easy to build. More
In [105], given the dataset of models, the features to be used general and sophisticated approaches, which are not limited
for the classification are extracted by querying each model Fθi to the case of corrupted-label settings, are described in the
(or Fθαi ) with several inputs and concatenating the extracted following.
features, i.e., the vectors fθ−1i
(or fθ−1
i,α
). Eventually, the meta- In 2018, Tran et al. [18] have proposed to use an anomaly
detector to reveal anomalies inside the training set of one
classifier Fθ meta is trained on these feature vectors. To improve
or more classes. They employ singular value decomposition
the performance of meta-classification, the meta-classifier and
(SVD) to design an outlier detector, which detects outliers
the query set are jointly optimized. A different approach is
among the training samples by analyzing their feature rep-
adopted in [71], where a functional is optimized in order to
resentation, that is, the activations of the last hidden layer
get universal patterns zm , m = 1, .., M, such that looking at
fθ−1 of Fθα . Specifically, the defender splits Dtr α into C
the output of the networks in correspondence to such zm ’s, that α
subsets Dtr,k , each with the samples of class k. Then, for
is, { f (zm )}M
m=1 , allows to reveal the presence of the backdoor. every k, SVD is applied to the covariance matrix of the
Another difference between [105] and [71] is in the way the
feature vectors of the images in Dtr,k , to get the principal
dataset of the backdoored models Fθαi is generated, that is,
directions. Given the first principal direction d1 , the outlier
in the distribution of the triggering patterns. In [105], the
score for each image xi is calculated as (xi · d1 )2 . Such a
poisoned models considered in the training set are obtained by
score is then used to measure the deviation of each image
from the centroid of the distribution. The images are ranked
10 We remind that [ fθα (x)]k is the predicted probability for class k. based on the outlier score and the top ranked 1.5p|Dtr,k |
images are removed for each class, where p ∈ [0, 0.5]. Fi- shown that in many cases, e.g. when the backdoor pattern is
nally, Alice retrains a cleaned model Fθc from scratch on more subtle, the representation vectors of poisoned and benign
the cleaned dataset. No detection function, establishing if the data can not be separated well in the feature space. This is
training set is poisoned or not, is actually provided by this the case, for instance, when CIFAR10 is attacked with the
method (which aims only at cleaning the possibly poisoned single pixel backdoor attack. To improve the results in this
dataset). case, the authors replace k-means clustering with a method
In [19], Chen et al. describe a so-called Activation Cluster- based on a Gaussian Mixture Model (GMM), which can also
ing (AC) method, that analyzes the neural network activations automatically determine the number of clusters. Under the
of the last hidden layer (the representation layer), to determine assumption of subtle (one-pixel) trigger, the authors apply
if the training data has been poisoned or not. The intuition blurring filtering to determine whether a cluster is poisoned
behind this method is that a backdoored model assigns poi- or not. After blurring, the samples from the poisoned cluster
soned and benign data to the target class based on different are assigned to the true class with high probability.
features, that is, by relying on the triggering pattern for the A defence working at the training dataset level designed
poisoned samples, and the ground-truth features for the benign to cope with clean-label backdoor attacks has been proposed
ones. This difference is reflected in the representation layer. in [121]. The defence relies on a so-called deep k-Nearest
Therefore, for the target class of the attack, the feature repre- Neighbors (k-NN) defence against feature-collision [40] and
sentations of the samples will tend to cluster into two groups, the convex polytope [90] attacks mentioned in Section III-B.
while the representations for the other classes will cluster in The defence relies on the observation that, in the representa-
one group only. Based on this intuition, for each subset Dtr,k tion space, the poisoned samples of a feature collision attack
of Dtrα , the defender feeds the images x to the model F α are surrounded by samples having a different label (the target
i θ
obtaining the corresponding subset of feature representation label) (see Fig. 12). Then, the authors compare the label of
vectors or activations fθ−1
α
(xi ). Once the activations have been each point xitr of the training set, with its k-nearest neighbors
obtained for each training sample, the subsets are clustered (determined based on the Euclidean distance) in the repre-
separately for each label. To cluster the activations, the k- sentation space. If the label of xtr does not correspond to
means algorithm is applied with k = 2 (after dimensionality the label of the majority of the k neighbors, xtr is classified
reduction). k-means clustering separates the activations into as a poisoned sample and removed from the training dataset.
two clusters, regardless of whether the dataset is poisoned or Eventually, the network is retrained on the cleaned training
not. Then, in order to determine which, if any, of the clusters dataset to obtain a clean model Fθc .
corresponds to a poisoned subset, one possible approach is As the last example of this class of defences, we mention
to analyze the relative size of the two clusters. A cluster is the work proposed in [123]. The defence proposed therein
considered to be poisoned if it contains less than p of data works against source-specific backdoor attacks, that is, attacks
for the k class, that is, p|Dtr,k | samples, where p ∈ [0, 0.3] for which the triggering pattern causes a misclassification only
(the expectation being that poisoned clusters contain no more when it is added to the images of a specific class (also called
than a small fraction of class samples, that is βk ≤ p). The targeted contamination attacks). The authors show that this
corresponding class is detected as the target class. As a last kind of backdoor is more stealthy than source-agnostic back-
step, the defender cleans the training dataset, by removing doors. In this case, in fact, poisoned and benign data can not
the smallest cluster in the target class, and retraining a new be easily distinguished by looking at the representation level.
model Fθc from scratch on the cleaned dataset. As we said, The approach proposed in [123] is built upon the universal
AC can be applied only when the class poisoning ratio βk variation assumption, according to which the natural varia-
is lower than p, ensuring that the poisoned data represents a tion of the samples of any uninfected class follows the same
minority subset in the target class. Another method resorting distribution of the benign images in the attacked class. For
to feature clustering to detect a backdoor attack has been example, in image classification tasks, the natural intra-class
proposed in [122]. variation of each object (e.g., lighting, poses, expressions,
Even if k-means clustering with k = 2 can perfectly etc.) has the same distribution across all labels (this is, for
separate the poisoned data on MNIST and CIFAR-10 when a instance, the case of image classification, traffic sign and
perceptible triggering pattern is used, Xiang et al. [120] have face recognition tasks). For such tasks, a DNN model tends
to generate a feature representation that can be decomposed applications, however, these assumptions do not neces-
into two parts, one related to the object’s identity (e.g. a sarily hold. Future research should, then, focus on the
given individual) and the other depending on the intra-class development of more general defences, with minimal
variations, randomly drawn from a distribution. The method working assumptions on the attacker’s behaviour.
described in [123] proposes to separate the identity-related r Improving the robustness of backdoors: The develop-
features from those associated to the intra-class variations by ment of strategies to improve backdoor robustness is
running an Expectation-Maximization (EM) algorithm [124] another important research line that should occupy the
across all the representations of the training samples. Then, if agenda of researchers. Current approaches can resist, up
the data distribution of one class is scattered, that class will be to some extent, parameter pruning and fine-tuning of
likely split into two groups (each group sharing a different final layers, while robustness against retraining of all
identity). If the data distribution is concentrated, the class layers and, more in general, transfer learning, is not at
will be considered as single cluster sharing the same identity. reach of current techniques. Achieving such robustness
Finally, the defender will judge the class with two groups as is particularly relevant when backdoors are used for be-
an attacked class. nign purposes (see VII-C). The study of backdoor attacks
Other works working at the training dataset level are de- in the physical domain is another interesting, yet rather
scribed below. unexplored, research direction, (see [128] for a prelimi-
Du et al. [125] have theoretically and empirically proved nary work in this sense), calling for the development of
that applying differential privacy during the training process backdoor attacks that can survive the analog to digital
can efficiently prevent the model from overfitting to the atyp- conversion involved by physical domain applications.
ical samples. Inspired by this, the authors first add Gaussian r Development of an underlying theory: We ambitiously
noise to the poisoned training dataset, and then utilize it to advocate the need of an underlying theory that can help
train an auto-encoder outlier detector. Since poisoned sam- to solve some of the fundamental problems behind the
ples are atypical ones, the detector judges one sample to be development of backdoor attacks, like, for instance, the
poisoned if the classification is achieved with less confidence. definition of the optimal triggering pattern (in most of
Finally, Yoshida et al. [126] and Chen et al. [127] share a the backdoor attacks proposed so far, the triggering pat-
similar idea for cleaning poisoned data, that is, distilling the tern is a prescribed signal, arbitrary defined). From the
clean knowledge from the backdoored model, and further re- defender’s side, a theoretical framework can help the
moving poisoned data from the poisoned training dataset by development of more general defences that are effective
comparing the predictions of the backdoored and distillation under a given threat model.
models. r Video backdoor attacks (and defences): Backdoor at-
tacks against video processing networks have attracted
VII. FINAL REMARKS AND RESEARCH ROADMAP significant less interest than attacks working on still im-
In this work, we have given an overview of backdoor at- ages, yet there would be plenty of applications wherein
tacks against deep neural networks and possible defences. We such attacks would be even more relevant than for
started the overview by presenting a unifying framework to image-based systems. As a matter of fact, the current
cast backdoor attacks in. In doing so, we paid particular atten- literature either focuses on the simple corrupted-label
tion to define the threat models and the requirements that the scenario [76], or it merely applies tools developed for
attackers and defenders must satisfy under various settings. images at the video frame level [94]. However, for a
Then, we reviewed the main attacks and defences proposed proper development of video backdoor attacks (and de-
so far, casting them in the general framework outlined pre- fences), the temporal dimension has to be taken into
viously. This allowed us to critically review the strengths account, e.g., by designing a triggering pattern that ex-
and drawbacks of the various approaches with reference to ploits the time dimension of the problem.
the application scenarios wherein they are operating. At the
same time, our analysis helps to identify the main open issues B. EXTENSION TO DOMAINS OTHER THAN COMPUTER
still waiting for a solution, thus contributing to outlining a VISION
roadmap for future research, as described in the rest of this
As mentioned in the introduction, although in this survey we
section.
focused on image and video classification, backdoor attacks
and defences have also been studied in other application do-
A. OPEN ISSUES
mains, e.g., in deep reinforcement learning [129] and natural
Notwithstanding the amount of works published so far, there
language processing [28], where, however, the state of the art
are several open issues that still remain to be addressed, the
is less mature.
most relevant of which are detailed in the following.
r More general defences: Existing defences are often
tailored solutions that work well only under very spe- 1) DEEP REINFORCEMENT LEARNING (DRL)
cific assumptions about the behavior of the adversary, In 2020, Kiourti et al. [129] have presented a backdoor at-
e.g. on the triggering pattern and its size. In real life tack against a DRL system. In this scenario, the backdoored
trapdoor honeypot is similar to a backdoor in that it causes a [12] M. Jagielski, A. Oprea, B. Biggio, C. Liu, C. Nita-Rotaru, and B. Li,
misclassification error in the presence of a specific, minimum “Manipulating machine learning: Poisoning attacks and countermea-
sures for regression learning,” in Proc. IEEE Symp. Secur. Privacy,
energy, triggering pattern. When building an adversarial ex- 2018, pp. 19–35.
ample, the attacker will likely, and inadvertently, exploit the [13] Y. Ji, X. Zhang, and T. Wang, “Backdoor attacks against learning
weakness introduced within the DNN by the backdoor and systems,” in Proc. IEEE Conf. Netw. Secur., 2017, pp. 1–9.
[14] Y. Liu, Y. Xie, and A. Srivastava, “Neural trojans,” in Proc. IEEE Int.
come out with an adversarial perturbation which is very close Conf. Comput. Des., 2017, pp. 45–48.
to the triggering pattern purposely introduced by the defender [15] K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending
at training time. In this way, the defender may recognize that against backdooring attacks on deep neural networks,” in Proc. Int.
Symp. Res. Attacks, Intrusions, Defenses, 2018, pp. 273–294.
an adversarial attack is ongoing and react accordingly. [16] B. Wang et al., “Neural cleanse: Identifying and mitigating backdoor
More specifically, given a to-be-protected class t, the de- attacks in neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2019,
fender trains a backdoored model Fθα∗ such that Fθα∗ (x + υ ) = pp. 707–723.
[17] W. Guo, L. Wang, X. Xing, M. Du, and D. Song, “Tabor: A highly
t = Fθα∗ (x), where υ is a low-energy triggering pattern, called accurate approach to inspecting and restoring trojan backdoors in ai
loss-minimizing trapdoor, designed in such a way to minimize systems,” 2019, arXiv:1908.01763.
the loss for the target label. The presence of an adversarial [18] B. Tran, J. Li, and A. Madry, “Spectral signatures in backdoor attacks,”
in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 8000–8010.
input can then be detected by looking for the presence of the [19] B. Chen et al., “Detecting backdoor attacks on deep neural net-
pattern υ within the input sample, trusting that the algorithm works by activation clustering,” in Proc. Workshop Artif. Intell.
used to construct the adversarial perturbation will exploit the Saf. Co-Located 33rd AAAI Conf. Artif. Intell., 2019, vol. 2301,
2019.
existence of a low-energy pattern υ capable of inducing a [20] E. Chou, F. Tramèr, and G. Pellegrino, “SentiNet: Detecting localized
misclassification error in favour of class t. Based on the re- universal attacks against deep learning systems,” in Proc. IEEE Secur.
sults shown in [142], the trapdoor-enabled defence achieves Privacy Workshops, 2020, pp. 48–54.
[21] Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal,
high accuracy against many state-of-art targeted adversarial “Strip: A defence against trojan attacks on deep neural networks,” in
examples attacks. Proc. 35th Annu. Comput. Secur. Appl. Conf., 2019, pp. 113–125.
Such defence works only against targeted attacks, and trap- [22] H. Chen, C. Fu, J. Zhao, and F. Koushanfar, “DeepInspect: A
black-box trojan detection and mitigation framework for deep neu-
door honeypots against non-targeted adversarial example have ral networks,” in Proc. 28th Int. Joint Conf. Artif. Intell., 2019,
still to be developed. Moreover, how to extend the idea of pp. 4658–4664.
trapdoor honeypots to defend against black-box adversarial [23] Y. Liu et al., “A survey on neural trojans,” in Proc. IEEE
21st Int. Symp. Qual. Electron. Des., 2020, pp. 33–39,
examples, that do not adopt a low-energy pattern, is an open doi: 10.1109/ISQED48828.2020.9137011.
issue deserving further attention. [24] Y. Chen, X. Gong, Q. Wang, X. Di, and H. Huang, “Backdoor attacks
and defenses for deep neural networks in outsourced cloud environ-
ments,” IEEE Netw., vol. 34, no. 5, pp. 141–147, Sep./Oct. 2020.
[25] Y. Li, B. Wu, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A
survey,” 2020, arXiv:2007.08745.
REFERENCES [26] M. Goldblum et al., “Dataset security for machine learning: Data
[1] C. Szegedy et al., “Intriguing properties of neural networks,” in Proc. poisoning, backdoor attacks, and defenses,” IEEE Trans. Pattern
2nd Int. Conf. Learn. Representations, 2014, pp. 1–10. Anal. Mach. Intell., early access, Mar. 25, 2022, pp. 1–1, 2022,
[2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harness- doi: 10.1109/TPAMI.2022.3162397.
ing adversarial examples,” 2014, arXiv:1412.6572. [27] A. Schwarzschild, M. Goldblum, A. Gupta, J. P. Dickerson, and T.
[3] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against sup- Goldstein, “Just how toxic is data poisoning? A unified benchmark for
port vector machines,” in Proc. 29th Int. Conf. Int. Conf. Mach. Learn., backdoor and data poisoning attacks,” in Proc. 38th Int. Conf. Mach.
2012, pp. 1467–1474. Learn., 2021, vol. 139, pp. 9389–9398.
[4] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “BadNets: Evaluating [28] J. Dai, C. Chen, and Y. Li, “A backdoor attack against LSTM-based
backdooring attacks on deep neural networks,” IEEE Access, vol. 7, text classification systems,” IEEE Access, vol. 7, pp. 138872–138878,
pp. 47230–47244, 2019, doi: 10.1109/ACCESS.2019.2909068. 2019.
[5] L. Muñoz-González et al., “Towards poisoning of deep learning algo- [29] H. Kwon and S. Lee, “Textual backdoor attack for the text classifica-
rithms with back-gradient optimization,” in Proc. 10th ACM Workshop tion system,” Secur. Commun. Netw., vol. 2021, pp. 1–11, 2021.
Artif. Intell. Secur., 2017, pp. 27–38. [30] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How to
[6] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted back- backdoor federated learning,” in Proc. Int. Conf. Artif. Intell. Statist.,
door attacks on deep learning systems using data poisoning,” 2017, 2020, pp. 2938–2948.
arXiv:1712.05526. [31] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, “Analyzing
[7] L. Muñoz-González, B. Pfitzner, M. Russo, J. Carnerero-Cano, and E. federated learning through an adversarial lens,” in Proc. Int. Conf.
C. Lupu, “Poisoning attacks with generative adversarial Nets,” 2019, Mach. Learn., 2019, pp. 634–643.
arXiv:1906.07773. [32] C. Xie, K. Huang, P.-Y. Chen, and B. Li, “DBA: Distributed backdoor
[8] P. W. Koh, J. Steinhardt, and P. Liang, “Stronger data poisoning at- attacks against federated learning,” in Proc. Int. Conf. Learn. Repre-
tacks break data sanitization defenses,” Mach. Learn., vol. 111, no. 1, sentations, 2019, pp. 1–15.
pp. 1–47, 2022. [33] C.-L. Chen, L. Golubchik, and M. Paolieri, “Backdoor attacks on
[9] J. Steinhardt, P. W. W. Koh, and P. S. Liang, “Certified defenses federated meta-learning,” 2020, arXiv:2006.07026.
for data poisoning attacks,” Adv. Neural Inf. Process. Syst., vol. 30, [34] C. Xie, M. Chen, P. Chen, and B. Li, “CRFL: Certifiably robust
pp. 3520–3532, 2017. federated learning against backdoor attacks,” in Proc. 38th Int. Conf.
[10] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and A. Mach. Learn., 2021, pp. 11372–11382. [Online]. Available: http://
Stewart, “Sever: A robust meta-algorithm for stochastic optimization,” proceedings.mlr.press/v139/xie21a.html
in Proc. Int. Conf. Mach. Learn., 2019, pp. 1596–1606. [35] Y. Li, Y. Li, Y. Lv, Y. Jiang, and S.-T. Xia, “Hidden backdoor attack
[11] J. Carnerero-Cano, L. Muñoz-González, P. Spencer, and E. C. Lupu, against semantic segmentation models,” 2021, arXiv:2103.04038.
“Regularisation can mitigate poisoning attacks: A novel analysis based [36] N. Carlini and A. Terzis, “Poisoning and backdooring contrastive
on multiobjective bilevel optimisation,” 2020, arXiv:2003.00040. learning,” 2021, arXiv:2106.09667.
[81] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VG- [104] A. K. Veldanda et al., “NNoculation: Broad spectrum and targeted
GFACE2: A dataset for recognising faces across pose and age,” in treatment of backdoored DNNs,” 2020, arXiv:2002.08313.
Proc. 13th IEEE Int. Conf. Autom. Face Gesture Recognit., 2018, [105] X. Xu, Q. Wang, H. Li, N. Borisov, C. A. Gunter, and B. Li, “Detecting
pp. 67–74. ai trojans using meta neural analysis,” in Proc. IEEE Symp. Secur.
[82] A. Turner, D. Tsipras, and A. Madry, “Label-consistent backdoor Privacy, 2021, pp. 103–120.
attacks,” 2019, arXiv:1912.02771. [106] M. Villarreal-Vasquez and B. Bhargava, “ConFoc: Content-focus
[83] M. Alberti et al., “Are You Tampering with My Data?,” in Proc. protection against trojan attacks on neural networks,” 2020,
Comput. Vis. ECCV Workshops, 2018, pp. 296–312. arXiv:2007.00711.
[84] M. Barni, K. Kallas, and B. Tondi, “New backdoor attack in CNNs [107] H. Qiu, Y. Zeng, S. Guo, T. Zhang, M. Qiu, and B. M. Thuraisingham,
by training set corruption without label poisoning,” in Proc. IEEE Int. “Deepsweep: An evaluation framework for mitigating DNN backdoor
Conf. Image Process., 2019, pp. 101–105. attacks using data augmentation,” in Proc. ACM Asia Conf. Com-
[85] Y. Liu, X. Ma, J. Bailey, and F. Lu, “Reflection backdoor: A natural put. Commun. Secur., Virtual Event, Hong Kong, 2021, pp. 363–377,
backdoor attack on deep neural networks,” in Proc. Eur. Conf. Comput. doi: 10.1145/3433210.3453108.
Vis., 2020, pp. 182–199. [108] L. A. Gatys, A. S. Ecker, and M. Bethge, “IMAGE style transfer
[86] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot, “Benchmarking using convolutional neural networks,” in Proc. IEEE Conf. Comput.
single-image reflection removal algorithms,” in Proc. IEEE Int. Conf. Vis. Pattern Recognit., 2016, pp. 2414–2423.
Comput. Vis., 2017, pp. 3922–3930. [109] L. Truong et al., “Systematic evaluation of backdoor data poisoning
[87] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima- attacks on image classifiers,” in Proc. IEEE/CVF Conf. Comput. Vis.
geNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Pattern Recognit. Workshops, 2020, pp. 788–789.
Comput. Vis. Pattern Recognit., 2009, pp. 248–255. [110] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma, “Neural attention
[88] R. Ning, J. Li, C. Xin, and H. Wu, “Invisible poison: A blackbox clean distillation: Erasing backdoor triggers from deep neural networks,” in
label backdoor attack to deep neural networks,” in Proc. IEEE Int. Proc. 9th Int. Conf. Learn. Representations, 2021, pp. 1–19. [Online].
Conf. Comput. Commun., 2021, pp. 1–10. Available: https://openreview.net/forum?id=9l0K4OM-oXE
[89] O. Suciu, R. Marginean, Y. Kaya, H. D. III, and T. Dumitras, [111] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A.
“When does machine learning fail? generalized transferability for G. Wilson, “Loss surfaces, mode connectivity, and fast ensem-
evasion and poisoning attacks,” in Proc. 27th USENIX Secur. Symp., bling of DNNs,” in Proc. Adv. Neural Inf. Process. Syst. 31:
2018, pp. 1299–1316. [Online]. Available: https://www.usenix.org/ Annu. Conf. Neural Inf. Process. Syst., 2018, pp. 8803–8812.
conference/usenixsecurity18/presentation/suciu [Online]. Available: https://proceedings.neurips.cc/paper/2018/hash/
[90] C. Zhu, W. R. Huang, H. Li, G. Taylor, C. Studer, and T. Goldstein, be3087e74e9100d4bc4c6268cdbe8456-Abstract.html
“Transferable clean-label poisoning attacks on deep neural nets,” in [112] S. C. Park and H. Shin, “Polygonal chain intersection,” Comput.
Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 7614–7623. [Online]. Graph., vol. 26, no. 2, pp. 341–350, 2002.
Available: http://proceedings.mlr.press/v97/zhu19a.html [113] R. T. Farouki, “The bernstein polynomial basis: A centennial retro-
[91] A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor spective,” Comput. Aided Geometric Des., vol. 29, no. 6, pp. 379–419,
attacks,” in Proc. 34th AAAI Conf. Artif. Intell., AAAI, The 32nd Innov. 2012.
App. Artif. Intell. Conf., The 10th AAAI Symp. Educ. Adv. Artif. In- [114] F. R. Hampel, “The influence curve and its role in robust estimation,”
tell., 2020, pp. 11957–11965. [Online]. Available: https://aaai.org/ojs/ J. Amer. Stat. Assoc., vol. 69, no. 346, pp. 383–393, 1974.
index.php/AAAI/article/view/6871 [115] Z. Xiang, D. J. Miller, and G. Kesidis, “Detection of backdoors
[92] J. Chen, L. Zhang, H. Zheng, X. Wang, and Z. Ming, “Deeppoison: in trained classifiers without access to the training set,” IEEE
Feature transfer based stealthy poisoning attack for DNNs,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 3, pp. 1177–1191,
Trans. Circuits Syst., II, Exp. Briefs, vol. 68, no. 7, pp. 2618–2622, Mar. 2022.
Jul. 2021, doi: 10.1109/TCSII.2021.3060896. [116] Z. Xiang, D. J. Miller, H. Wang, and G. Kesidis, “Detecting scene-
[93] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “To- plausible perceptible backdoors in trained DNNs without access to the
wards deep learning models resistant to adversarial attacks,” 2017, training set,” Neural Comput., vol. 33, no. 5, pp. 1329–1371, 2021,
arXiv:1706.06083. doi: 10.1162/neco_a_01376.
[94] S. Zhao, X. Ma, X. Zheng, J. Bailey, J. Chen, and Y.-G. Jiang, [117] R. Wang, G. Zhang, S. Liu, P. Chen, J. Xiong, and M. Wang, “Prac-
“Clean-label backdoor attacks on video recognition models,” in Proc. tical detection of trojan neural networks: Data-limited and data-free
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 14 443– cases,” in Proc. 16th Eur. Conf. Comput. Vis., 2020, vol. 12368,
14452. pp. 222–238.
[95] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 hu- [118] L. Zhu, R. Ning, C. Wang, C. Xin, and H. Wu, “Gangsweep: Sweep out
man actions classes from videos in the wild,” 2012, arXiv:1212.0402. neural backdoors by GAN,” in Proc. 28th ACM Int. Conf. Multimedia,
[96] B. G. Doan, E. Abbasnejad, and D. C. Ranasinghe, “Februus: Input 2020, pp. 3173–3181.
purification defense against trojan attacks on deep neural network sys- [119] X. Qiao, Y. Yang, and H. Li, “Defending neural backdoors via gener-
tems,” in Proc. Annu. Comput. Secur. Appl. Conf., 2020, pp. 897–912. ative distribution modeling,” in Proc. Adv. Neural Inf. Process. Syst.,
[97] E. Sarkar, Y. Alkindi, and M. Maniatakos, “Backdoor suppression in 2019, pp. 14004–14013.
neural networks using input fuzzing and majority voting,” IEEE Des. [120] Z. Xiang, D. J. Miller, and G. Kesidis, “A benchmark study of back-
Test, vol. 37, no. 2, pp. 103–110, Apr. 2020. door data poisoning defenses for deep neural network classifiers and a
[98] H. Kwon, “Detecting backdoor attacks via class difference in deep novel defense,” in Proc. IEEE 29th Int. Workshop Mach. Learn. Signal
neural networks,” IEEE Access, vol. 8, pp. 191049–191056, 2020. Process., 2019, pp. 1–6.
[99] H. Fu, A. K. Veldanda, P. Krishnamurthy, S. Garg, and F. Khorrami, [121] N. Peri et al., “Deep K-NN defense against clean-label data poi-
“Detecting backdoors in neural networks using novel feature-based soning attacks,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 55–70,
anomaly detection,” 2020, arXiv:2011.02526. doi: 10.1007/978-3-030-66415-2_4.
[100] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, [122] E. Soremekun, S. Udeshi, S. Chattopadhyay, and A. Zeller, “AEGIS:
“Grad-cam: Generalized gradient-based visual explanations for deep Exposing backdoors in robust machine learning models,” 2020,
convolutional networks,” in Proc. IEEE Winter Conf. Appl. Comput. arXiv:2003.00865.
Vis., 2018, pp. 839–847. [123] D. Tang, X. Wang, H. Tang, and K. Zhang, “Demon in the variant:
[101] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Statistical analysis of DNNs for robust backdoor contamination detec-
Courville, “Improved training of wasserstein GANs,” in Proc. Adv. tion,” 2019, arXiv:1908.00686.
Neural Inf. Process. Syst., 2017, pp. 5767–5777. [124] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revis-
[102] H. Kwon, “Defending deep neural networks against backdoor attack ited: A joint formulation,” in Proc. 12th Eur. Conf. Comput. Vis., 2012,
by using de-trigger autoencoder,” IEEE Access, early access, Oct. vol. 7574, pp. 566–579, doi: 10.1007/978-3-642-33712-3_41.
18, 2021, doi: 10.1109/ACCESS.2021.3086529. [125] M. Du, R. Jia, and D. Song, “Robust anomaly detection and back-
[103] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identi- door attack detection via differential privacy,” in Proc. 8th Int. Conf.
fying density-based local outliers,” in Proc. ACM SIGMOD Int. Conf. Learn. Representations, 2020, pp. 1–11. [Online]. Available: https:
Manage. Data, 2000, pp. 93–104. //openreview.net/forum?id=SJx0q1rtvS